Professional Documents
Culture Documents
Piyush Rai
Machine Learning (CS771A)
Aug 12, 2016
Learning as Optimization
Supervised learning problem with training data {(x n , yn )}N
n=1
Goal: Find f : x y that fits the training data well and is also simple
The function f is learned by solving the following optimization problem
f = arg min
f
N
X
`(yn , f (x n )) + R(f )
n=1
The objective is a sum of the empirical training loss and a regularizer term
`(yn , f (x n )) denotes the loss function: Error f makes on example (x n , yn )
The regularizer R(f ) is a measure of complexity of the function f
This is called Regularized Empirical Risk Minimization
Regularization hyperparameter controls the amount of regularization
N
X
n=1
N
N
X
X
>
1
=(
x n x n + ID )
yn x n = (X> X + ID )1 X> y
n=1
n=1
y = {y1 , y2 , . . . , yN }
p(y |) also called the likelihood, p(yn |) is lik. w.r.t. a single data point
The likelihood will be a function of the parameters
Log-likelihood:
L() = log p(y | ) = log
N
Y
p(yn | ) =
n=1
N
X
log p(yn | )
n=1
N
X
log p(yn | )
n=1
n=1
where NLL() =
PN
n=1
log p(yn | )
10
MLE: An Example
Consider a sequence of N coin toss outcomes (observations)
Each observation yn is a binary random variable. Head = 1, Tail = 0
Since each yn is binary, lets use a Bernoulli distribution to model it
p(yn | ) = yn (1 )1yn
Here to be probability of a head. Want to learn using MLE
PN
PN
Log-likelihood: n=1 log p(yn | ) = n=1 yn log + (1 yn ) log(1 )
Taking derivative of the log-likelihood w.r.t. , and setting it to zero gives
PN
yn
MLE = n=1
N
MLE in this example is simply the fraction of heads!
What can go wrong with this approach (or MLE in general)?
We havent regularized . Can do badly (i.e., overfit) if there are outliers or
if we dont have enough data to learn reliably.
Machine Learning (CS771A)
11
Prior Distributions
In probabilistic models, we can specify a prior distribution p() on parameters
12
p(y |)p()
p(y )
13
p(y |)p()
p(y )
Now, instead of doing MLE which maximizes the likelihood, we can find the
that maximizes the posterior probability p(|y )
MAP = arg max p(|y )
13
=
MAP = arg max
N
X
p(y |)p()
p(y )
arg max log p(y |) + log p()
n=1
14
MAP: An Example
Lets again consider the coin-toss problem (estimating the bias of the coin)
Each likelihood term is Bernoulli: p(yn |) = yn (1 )1yn
Since (0, 1), we assume a Beta prior: Beta(, )
p() =
( + ) 1
(1 )1
()()
15
MAP: An Example
The log posterior probability for the coin-toss model
N
X
n=1
PN
+1
N ++2
n=1 yn
16
17
x n RD , response yn R
w RD :
y n = w > x n + n
Gaussian noise: n N (0, 2 ), 2 : variance of Gaussian noise
Thus each yn can be thought of as drawn from a Gaussian, as follows
yn N (w > x n , 2 )
Goal: Learn weight vector
Lets look at both MLE and MAP estimation for this probabilistic model
18
19
1
2 2
(x)2
2 2
Mean: E[x] =
Variance: var[x] = 2
20
21
N
1 X
(yn w > x n )2
2 2 n=1
Note that negative log likelihood (NLL) is similar to squared loss function
MLE will give the same solution as in the (unregularized) least squares
Machine Learning (CS771A)
22
N
X
2
(yn w > x n )2 + 2 w > w
n=1
23
24
N
X
(yn w > x n )2
n=1
MAP solution:
MAP = arg min
w
w
N
X
2
(yn w > x n )2 + 2 w > w
n=1
25
26
Next Class:
Probabilistic Models for Classification
(Logistic Regression)
27