You are on page 1of 28

Learning via Probabilistic Modeling of Data

Piyush Rai
Machine Learning (CS771A)
Aug 12, 2016

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Quick Recap of Last Lecture

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Learning as Optimization
Supervised learning problem with training data {(x n , yn )}N
n=1

Goal: Find f : x y that fits the training data well and is also simple
The function f is learned by solving the following optimization problem
f = arg min
f

N
X

`(yn , f (x n )) + R(f )

n=1

The objective is a sum of the empirical training loss and a regularizer term
`(yn , f (x n )) denotes the loss function: Error f makes on example (x n , yn )
The regularizer R(f ) is a measure of complexity of the function f
This is called Regularized Empirical Risk Minimization
Regularization hyperparameter controls the amount of regularization

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

`2 Regularized Linear Regression: Ridge Regression


Linear regression model f (x ) = w > x
Loss function: squared loss, regularizer: `2 norm of

The resulting Ridge Regression problem is solved as


= arg min
w
w

N
X

(yn w > x n )2 + ||w ||2

n=1

A nice, convex objective function, has a unique global minima


Note: = 0 gives the ordinary least squares solution (no regularization)
Can take derivative w.r.t.

w , set it to zero, and get simple, closed form soln

N
N
X
X
>
1
=(
x n x n + ID )
yn x n = (X> X + ID )1 X> y
n=1

n=1

Can also used iterative methods (e.g., gradient-descent) to optimize the


objective function and solve for w (for better efficiency)
Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Ridge Regression: Effect of Regularization


Consider ridge regression on some data with 10 features (thus the weight
vector w has 10 components)

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Learning via Probabilistic Modeling

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Probabilistic Modeling of Data


Assume the data

y = {y1 , y2 , . . . , yN } as generated from a probability model


yn p(y |)

Each data point yn is a random variable drawn from distribution p(y |)


denotes the parameters of the probability distribution
Assume the observations to be independently & identically distributed (i.i.d.)

We wish to learn the parameters using the data

y = {y1 , y2 , . . . , yN }

Almost any learning problem can be formulated like this


Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Parameter Estimation in Probabilistic Models


Since data is i.i.d., the probability of observing data y = {y1 , y2 , . . . , yN }
N
Y
p(y |) = p(y1 , y2 , . . . , yN |) =
p(yn |)
n=1

p(y |) also called the likelihood, p(yn |) is lik. w.r.t. a single data point
The likelihood will be a function of the parameters

How do we estimate the best model parameters ?


One option: Find value of that makes observed data most probable
Maximize the likelihood p(y |) w.r.t. (Maximum Likelihood Estimation)
Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Maximum Likelihood Estimation (MLE)


We doing MLE, we typically maximize log-likelihood instead of the likelihood,
which is easier (doesnt affect the estimation because log is monotonic)

Log-likelihood:
L() = log p(y | ) = log

N
Y

p(yn | ) =

n=1

N
X

log p(yn | )

n=1

Maximum Likelihood Estimation (MLE)


MLE = arg max L() = arg max

N
X

log p(yn | )

n=1

Now this becomes an optimization problem w.r.t.


Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Maximum Likelihood Estimation (MLE)


Maximum Likelihood parameter estimation
N
X
MLE = arg max
log p(yn | )

n=1

We can also think of it as minimizing the negative log-likelihood (NLL)


MLE = arg min NLL()

where NLL() =

PN

n=1

log p(yn | )

We can think of the negative log-likelihood as a loss function


Thus MLE is equivalent to doing empirical risk (loss) minimization
This view relates the optimization and probabilistic modeling approaches
Something is still missing (we will look at that shortly)

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

10

MLE: An Example
Consider a sequence of N coin toss outcomes (observations)
Each observation yn is a binary random variable. Head = 1, Tail = 0
Since each yn is binary, lets use a Bernoulli distribution to model it
p(yn | ) = yn (1 )1yn
Here to be probability of a head. Want to learn using MLE
PN
PN
Log-likelihood: n=1 log p(yn | ) = n=1 yn log + (1 yn ) log(1 )
Taking derivative of the log-likelihood w.r.t. , and setting it to zero gives
PN
yn
MLE = n=1
N
MLE in this example is simply the fraction of heads!
What can go wrong with this approach (or MLE in general)?
We havent regularized . Can do badly (i.e., overfit) if there are outliers or
if we dont have enough data to learn reliably.
Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

11

Prior Distributions
In probabilistic models, we can specify a prior distribution p() on parameters

The prior distribution plays two key roles


The prior helps us specify that some values of are more likely than others
The prior also works as a regularizer for (we will see this soon)

Note: A uniform prior distribution is the same as using no prior!


Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

12

Using a Prior in Parameter Estimation


We can combine the prior p() with the likelihood p(y |) using Bayes rule
and compute the posterior distribution over the parameters
The posterior distribution is given by
p(|y ) =

Machine Learning (CS771A)

p(y |)p()
p(y )

Learning via Probabilistic Modeling of Data

13

Using a Prior in Parameter Estimation


We can combine the prior p() with the likelihood p(y |) using Bayes rule
and compute the posterior distribution over the parameters
The posterior distribution is given by
p(|y ) =

p(y |)p()
p(y )

Now, instead of doing MLE which maximizes the likelihood, we can find the
that maximizes the posterior probability p(|y )
MAP = arg max p(|y )

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

13

Maximum-a-Posteriori (MAP) Estimation


We will work with the log posterior probability (it is easier)
MAP = arg max p(|y )

arg max logp(|y )

arg max log

=
MAP = arg max

N
X

p(y |)p()
p(y )
arg max log p(y |) + log p()

log p(yn |) + log p()

n=1

Same as MLE with an extra log-prior-distribution term (acts as a regularizer)


Can also write the same as the following (equivalent) minimization problem
MAP = arg min NLL() log p()

When p() is a uniform prior, MAP reduces to MLE


Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

14

MAP: An Example
Lets again consider the coin-toss problem (estimating the bias of the coin)
Each likelihood term is Bernoulli: p(yn |) = yn (1 )1yn
Since (0, 1), we assume a Beta prior: Beta(, )
p() =

( + ) 1

(1 )1
()()

, are called hyperparameters of the prior. Note: is the gamma function.

For Beta, using = = 1 corresponds to using a uniform prior distribution


Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

15

MAP: An Example
The log posterior probability for the coin-toss model
N
X

log p(yn |)+ log p()

n=1

Ignoring the constants w.r.t. , the log posterior probability simplifies to


PN
n=1 {yn log + (1 yn ) log(1 )} + ( 1) log + ( 1) log(1 )
Taking derivative w.r.t. and setting to zero gives
MAP =

PN

+1
N ++2
n=1 yn

Note: For = 1, = 1, i.e., p() = Beta(1, 1) (which is equivalent to a


uniform prior, hence no regularizer), we get the same solution as MLE
Note: Hyperparameters of a prior distribution usually have intuitive meaning.
E.g., in the coin-toss example, 1, 1 are like pseudo-observations expected numbers of heads and tails, respectively, before tossing the coin
Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

16

Probabilistic Linear Regression

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

17

Linear Regression: A Probabilistic View


Given: N training examples {x n , yn }N
n=1 , features:

x n RD , response yn R

Probabilistic view: responses yn s are generated from a probabilistic model


Assume a noisy linear model with regression weight vector

w RD :

y n = w > x n + n
Gaussian noise: n N (0, 2 ), 2 : variance of Gaussian noise
Thus each yn can be thought of as drawn from a Gaussian, as follows
yn N (w > x n , 2 )
Goal: Learn weight vector

w (note: 2 assumed known but can be learned)

Lets look at both MLE and MAP estimation for this probabilistic model

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

18

Gaussian Distribution: Brief Review

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

19

Univariate Gaussian Distribution


Distribution over real-valued scalar r.v. x
Defined by a scalar mean and a scalar variance 2
Distribution defined as
N (x; , 2 ) =

1
2 2

(x)2
2 2

Mean: E[x] =
Variance: var[x] = 2

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

20

Multivariate Gaussian Distribution


Distribution over a multivariate r.v. vector x RD of real numbers
Defined by a mean vector RD and a D D covariance matrix
> 1
1
1
N (x; , ) = p
e 2 (x) (x)
D
(2) ||

The covariance matrix must be symmetric and positive definite


All eigenvalues are positive
z > z > 0 for any real vector

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

21

MLE for Probabilistic Linear Regression


Assuming Gaussian distributed responses yn , we have


1
(yn w > x n )2
p(yn |x n , w ) = N (w > x n , 2 ) =
exp
2 2
2 2
Thus the likelihood (assuming i.i.d. responses) or probability of data:
( N
)
 N2

N
Y
X (yn w > x n )2
1
p(y |X, w ) =
exp
p(yn |x n , w ) =
2 2
2 2
n=1
n=1

x n (features) assumed given/fixed. Only modeling the response yn


Log-likelihood (ignoring constants w.r.t. w )
Note:

log p(y |X, w )

N
1 X
(yn w > x n )2
2 2 n=1

Note that negative log likelihood (NLL) is similar to squared loss function
MLE will give the same solution as in the (unregularized) least squares
Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

22

MAP Estimation for Probabilistic Linear Regression


We want to regularize our model, so we will use a prior distribution on the
weight vector w . We will use a multivariate Gaussian prior with zero mean

w >w 
p(w ) = N (0, 2 ID ) exp
22
The log-likelihood, as we have already seen, is given by
N
1 X
log p(y |X, w ) 2
(yn w > x n )2
2 n=1
The MAP objective (log-posterior) will be the log-likelihood + log p(w )
1 X
w >w
(yn w > x n )2
2
2 n=1
22
Maximizing this is equivalent to minimizing the following w.r.t.
N

MAP = arg min


w
w

N
X
2
(yn w > x n )2 + 2 w > w

n=1

Note: Assuming = 2 (regularization hyperparam), this is equivalent to the


Ridge Regression problem
Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

23

MLE vs MAP Estimation: An Illustration

w MAP is a compromise between priors mean and w MLE

In this case, doing MAP shrinks the estimate of

Machine Learning (CS771A)

w towards the priors mean

Learning via Probabilistic Modeling of Data

24

MLE vs MAP: Summary


MLE solution:
MLE = arg min
w
w

N
X

(yn w > x n )2

n=1

MAP solution:
MAP = arg min
w
w

N
X
2
(yn w > x n )2 + 2 w > w

n=1

Some Take-home messages:


MLE estimation of a parameter leads to unregularized solutions
MAP estimation of a parameter leads to regularized solutions
A Gaussian likelihood model corresponds to using squared loss
A Gaussian prior on parameters acts as an `2 regularizer
Other likelihoods/priors can be chosen. E.g., using a Laplace likelihood model
can give more robustness to outliers than Gaussian likelihood; a Laplace prior
distribution exp (C ||w ||1 ) on w would regularize the `1 norm of w
Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

25

Probabilistic Modeling: Summary


A flexible way to model data by specifying a proper probabilistic model
Can choose likelihoods and priors based on the nature/property of data
Allows us to do Bayesian learning
Allows learning the full distribution of the parameters (note that MLE/MAP
only give a single best answer as a point estimate of the parameters)
Allows getting an estimate of confidence in the models prediction (useful for
doing Active Learning)
Allows learning the size/complexity of the model from data (no tuning)
Allows learning the hyperparameters from data (no tuning)
Allows learning in the presence of missing data
.. and many other benefits

MLE/MAP estimation is also related to the optimization view of ML


Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

26

Next Class:
Probabilistic Models for Classification
(Logistic Regression)

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

27

You might also like