Probablistic PDF

Learning via Probabilistic Modeling of Data
Piyush Rai
Machine Learning (CS771A)
Aug 12, 2016
Quick Recap of Last Lecture
Learning as Optimization
Supervised learning problem with training data {(x n , yn )}N
n=1
Goal: Find f : x y that fits the training data well and is also simple
The function f is learned by solving the following optimization problem
f = arg min
f
N
X
`(yn , f (x n )) + R(f )
n=1
The objective is a sum of the empirical training loss and a regularizer term
`(yn , f (x n )) denotes the loss function: Error f makes on example (x n , yn )
The regularizer R(f ) is a measure of complexity of the function f
This is called Regularized Empirical Risk Minimization
Regularization hyperparameter controls the amount of regularization
`2 Regularized Linear Regression: Ridge Regression

Linear regression model f (x ) = w > x
Loss function: squared loss, regularizer: `2 norm of
The resulting Ridge Regression problem is solved as

= arg min
w
w
N
X
(yn w > x n )2 + ||w ||2
n=1
A nice, convex objective function, has a unique global minima

Note: = 0 gives the ordinary least squares solution (no regularization)
Can take derivative w.r.t.
w , set it to zero, and get simple, closed form soln
N
N
X
X
>
1
=(
x n x n + ID )
yn x n = (X> X + ID )1 X> y
n=1
n=1
Can also used iterative methods (e.g., gradient-descent) to optimize the

objective function and solve for w (for better efficiency)
Ridge Regression: Effect of Regularization

Consider ridge regression on some data with 10 features (thus the weight
vector w has 10 components)
Learning via Probabilistic Modeling
Probabilistic Modeling of Data

Assume the data
y = {y1 , y2 , . . . , yN } as generated from a probability model

yn p(y |)
Each data point yn is a random variable drawn from distribution p(y |)

denotes the parameters of the probability distribution
Assume the observations to be independently & identically distributed (i.i.d.)
We wish to learn the parameters using the data
y = {y1 , y2 , . . . , yN }
Almost any learning problem can be formulated like this

Parameter Estimation in Probabilistic Models

Since data is i.i.d., the probability of observing data y = {y1 , y2 , . . . , yN }
N
Y
p(y |) = p(y1 , y2 , . . . , yN |) =
p(yn |)
n=1
p(y |) also called the likelihood, p(yn |) is lik. w.r.t. a single data point
The likelihood will be a function of the parameters
How do we estimate the best model parameters ?

One option: Find value of that makes observed data most probable
Maximize the likelihood p(y |) w.r.t. (Maximum Likelihood Estimation)
Maximum Likelihood Estimation (MLE)

We doing MLE, we typically maximize log-likelihood instead of the likelihood,
which is easier (doesnt affect the estimation because log is monotonic)
Log-likelihood:
L() = log p(y | ) = log
N
Y
p(yn | ) =
n=1
N
X
log p(yn | )
n=1

MLE = arg max L() = arg max
N
X
log p(yn | )
n=1
Now this becomes an optimization problem w.r.t.


Maximum Likelihood parameter estimation
N
X
MLE = arg max
log p(yn | )
n=1
We can also think of it as minimizing the negative log-likelihood (NLL)

MLE = arg min NLL()
where NLL() =
PN
n=1
log p(yn | )
We can think of the negative log-likelihood as a loss function

Thus MLE is equivalent to doing empirical risk (loss) minimization
This view relates the optimization and probabilistic modeling approaches
Something is still missing (we will look at that shortly)
10
MLE: An Example
Consider a sequence of N coin toss outcomes (observations)
Each observation yn is a binary random variable. Head = 1, Tail = 0
Since each yn is binary, lets use a Bernoulli distribution to model it
p(yn | ) = yn (1 )1yn
Here to be probability of a head. Want to learn using MLE
PN
PN
Log-likelihood: n=1 log p(yn | ) = n=1 yn log + (1 yn ) log(1 )
Taking derivative of the log-likelihood w.r.t. , and setting it to zero gives
PN
yn
MLE = n=1
N
MLE in this example is simply the fraction of heads!
What can go wrong with this approach (or MLE in general)?
We havent regularized . Can do badly (i.e., overfit) if there are outliers or
if we dont have enough data to learn reliably.
11
Prior Distributions
In probabilistic models, we can specify a prior distribution p() on parameters
The prior distribution plays two key roles

The prior helps us specify that some values of are more likely than others
The prior also works as a regularizer for (we will see this soon)
Note: A uniform prior distribution is the same as using no prior!

12
Using a Prior in Parameter Estimation

We can combine the prior p() with the likelihood p(y |) using Bayes rule
and compute the posterior distribution over the parameters
The posterior distribution is given by
p(|y ) =
p(y |)p()
p(y )
13
Using a Prior in Parameter Estimation

We can combine the prior p() with the likelihood p(y |) using Bayes rule
and compute the posterior distribution over the parameters
The posterior distribution is given by
p(|y ) =
p(y |)p()
p(y )
Now, instead of doing MLE which maximizes the likelihood, we can find the
that maximizes the posterior probability p(|y )
MAP = arg max p(|y )
13
Maximum-a-Posteriori (MAP) Estimation

We will work with the log posterior probability (it is easier)
MAP = arg max p(|y )
arg max logp(|y )
arg max log
=
MAP = arg max
N
X
p(y |)p()
p(y )
arg max log p(y |) + log p()
log p(yn |) + log p()
n=1
Same as MLE with an extra log-prior-distribution term (acts as a regularizer)

Can also write the same as the following (equivalent) minimization problem
MAP = arg min NLL() log p()
When p() is a uniform prior, MAP reduces to MLE

14
MAP: An Example
Lets again consider the coin-toss problem (estimating the bias of the coin)
Each likelihood term is Bernoulli: p(yn |) = yn (1 )1yn
Since (0, 1), we assume a Beta prior: Beta(, )
p() =
( + ) 1
(1 )1
()()
, are called hyperparameters of the prior. Note: is the gamma function.
For Beta, using = = 1 corresponds to using a uniform prior distribution

15
MAP: An Example
The log posterior probability for the coin-toss model
N
X
log p(yn |)+ log p()
n=1
Ignoring the constants w.r.t. , the log posterior probability simplifies to

PN
n=1 {yn log + (1 yn ) log(1 )} + ( 1) log + ( 1) log(1 )
Taking derivative w.r.t. and setting to zero gives
MAP =
PN
+1
N ++2
n=1 yn
Note: For = 1, = 1, i.e., p() = Beta(1, 1) (which is equivalent to a

uniform prior, hence no regularizer), we get the same solution as MLE
Note: Hyperparameters of a prior distribution usually have intuitive meaning.
E.g., in the coin-toss example, 1, 1 are like pseudo-observations expected numbers of heads and tails, respectively, before tossing the coin
16
Probabilistic Linear Regression
17
Linear Regression: A Probabilistic View

Given: N training examples {x n , yn }N
n=1 , features:
x n RD , response yn R
Probabilistic view: responses yn s are generated from a probabilistic model

Assume a noisy linear model with regression weight vector
w RD :
y n = w > x n + n
Gaussian noise: n N (0, 2 ), 2 : variance of Gaussian noise
Thus each yn can be thought of as drawn from a Gaussian, as follows
yn N (w > x n , 2 )
Goal: Learn weight vector
w (note: 2 assumed known but can be learned)
Lets look at both MLE and MAP estimation for this probabilistic model
18
Gaussian Distribution: Brief Review
19
Univariate Gaussian Distribution

Distribution over real-valued scalar r.v. x
Defined by a scalar mean and a scalar variance 2
Distribution defined as
N (x; , 2 ) =
1
2 2
(x)2
2 2
Mean: E[x] =
Variance: var[x] = 2
20
Multivariate Gaussian Distribution

Distribution over a multivariate r.v. vector x RD of real numbers
Defined by a mean vector RD and a D D covariance matrix
> 1
1
1
N (x; , ) = p
e 2 (x) (x)
D
(2) ||
The covariance matrix must be symmetric and positive definite

All eigenvalues are positive
z > z > 0 for any real vector
21
MLE for Probabilistic Linear Regression

Assuming Gaussian distributed responses yn , we have

1
(yn w > x n )2
p(yn |x n , w ) = N (w > x n , 2 ) =
exp
2 2
2 2
Thus the likelihood (assuming i.i.d. responses) or probability of data:
( N
)
N2

N
Y
X (yn w > x n )2
1
p(y |X, w ) =
exp
p(yn |x n , w ) =
2 2
2 2
n=1
n=1
x n (features) assumed given/fixed. Only modeling the response yn

Log-likelihood (ignoring constants w.r.t. w )
Note:
log p(y |X, w )
N
1 X
(yn w > x n )2
2 2 n=1
Note that negative log likelihood (NLL) is similar to squared loss function
MLE will give the same solution as in the (unregularized) least squares
22
MAP Estimation for Probabilistic Linear Regression

We want to regularize our model, so we will use a prior distribution on the
weight vector w . We will use a multivariate Gaussian prior with zero mean

w >w
p(w ) = N (0, 2 ID ) exp
22
The log-likelihood, as we have already seen, is given by
N
1 X
log p(y |X, w ) 2
(yn w > x n )2
2 n=1
The MAP objective (log-posterior) will be the log-likelihood + log p(w )
1 X
w >w
(yn w > x n )2
2
2 n=1
22
Maximizing this is equivalent to minimizing the following w.r.t.
N
MAP = arg min

w
w
N
X
2
(yn w > x n )2 + 2 w > w
n=1
Note: Assuming = 2 (regularization hyperparam), this is equivalent to the

Ridge Regression problem
23
MLE vs MAP Estimation: An Illustration
w MAP is a compromise between priors mean and w MLE
In this case, doing MAP shrinks the estimate of
w towards the priors mean
24
MLE vs MAP: Summary

MLE solution:
MLE = arg min
w
w
N
X
(yn w > x n )2
n=1
MAP solution:
MAP = arg min
w
w
N
X
2
(yn w > x n )2 + 2 w > w
n=1
Some Take-home messages:

MLE estimation of a parameter leads to unregularized solutions
MAP estimation of a parameter leads to regularized solutions
A Gaussian likelihood model corresponds to using squared loss
A Gaussian prior on parameters acts as an `2 regularizer
Other likelihoods/priors can be chosen. E.g., using a Laplace likelihood model
can give more robustness to outliers than Gaussian likelihood; a Laplace prior
distribution exp (C ||w ||1 ) on w would regularize the `1 norm of w
25
Probabilistic Modeling: Summary

A flexible way to model data by specifying a proper probabilistic model
Can choose likelihoods and priors based on the nature/property of data
Allows us to do Bayesian learning
Allows learning the full distribution of the parameters (note that MLE/MAP
only give a single best answer as a point estimate of the parameters)
Allows getting an estimate of confidence in the models prediction (useful for
doing Active Learning)
Allows learning the size/complexity of the model from data (no tuning)
Allows learning the hyperparameters from data (no tuning)
Allows learning in the presence of missing data
.. and many other benefits
MLE/MAP estimation is also related to the optimization view of ML

26
Next Class:
Probabilistic Models for Classification
(Logistic Regression)
27

Probablistic PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probablistic PDF

Uploaded by

Copyright:

Available Formats

Learning via Probabilistic Modeling of Data

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Quick Recap of Last Lecture

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

`2 Regularized Linear Regression: Ridge Regression

The resulting Ridge Regression problem is solved as

(yn w > x n )2 + ||w ||2

A nice, convex objective function, has a unique global minima

w , set it to zero, and get simple, closed form soln

Can also used iterative methods (e.g., gradient-descent) to optimize the

Learning via Probabilistic Modeling of Data

Ridge Regression: Effect of Regularization

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Learning via Probabilistic Modeling

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Probabilistic Modeling of Data

y = {y1 , y2 , . . . , yN } as generated from a probability model

Each data point yn is a random variable drawn from distribution p(y |)

We wish to learn the parameters using the data

Almost any learning problem can be formulated like this

Learning via Probabilistic Modeling of Data

Parameter Estimation in Probabilistic Models

How do we estimate the best model parameters ?

Learning via Probabilistic Modeling of Data

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE)

Now this becomes an optimization problem w.r.t.

Learning via Probabilistic Modeling of Data

Maximum Likelihood Estimation (MLE)

We can also think of it as minimizing the negative log-likelihood (NLL)

We can think of the negative log-likelihood as a loss function

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Learning via Probabilistic Modeling of Data

The prior distribution plays two key roles

Note: A uniform prior distribution is the same as using no prior!

Learning via Probabilistic Modeling of Data

Using a Prior in Parameter Estimation

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Using a Prior in Parameter Estimation

Machine Learning (CS771A)

Learning via Probabilistic Modeling of Data

Maximum-a-Posteriori (MAP) Estimation

arg max logp(|y )

arg max log

log p(yn |) + log p()

Same as MLE with an extra log-prior-distribution term (acts as a regularizer)

When p() is a uniform prior, MAP reduces to MLE

Learning via Probabilistic Modeling of Data

, are called hyperparameters of the prior. Note: is the gamma function.

For Beta, using = = 1 corresponds to using a uniform prior distribution

Learning via Probabilistic Modeling of Data

log p(yn |)+ log p()

Ignoring the constants w.r.t. , the log posterior probability simplifies to

Note: For = 1, = 1, i.e., p() = Beta(1, 1) (which is equivalent to a

Learning via Probabilistic Modeling of Data