ApproximateInference PDF

Machine Learning
Srihari
Approximate Inference
Sargur Srihari
srihari@cedar.buffalo.edu
Machine Learning
Srihari
Plan of Discussion
Need for Approximation

Types of Approximation
Functionals and Variational Methods
Minimizing K-L divergence
Factorized approximations
Examples
Machine Learning
Srihari
Posterior Distribution in Bayesian Methods

Problem of computing posterior is a special case of Variational Inference
p(Z | X, ) =
p(Z, X | )
We cant compute the posterior

for many interesting models
p(Z, X | )
Z
Consider the Bayesian mixture of Gaussians,

1. Draw k ~ N(0, 2 ) for k = 1,.. K
2. For i = 1...n :
(a) Draw zi ~ Mult ( )
(b) Draw xi ~ N( zi , 2 )
Suppressing the fixed parameters, the distribution is
p( ) p(z )p(x | z , )
p( ) p(z )p(x | z ,
K
p( 1:K , z1:n | x1:n ) =
k=1
1:K z1:n
k=1
i=1
1:K
i=1
1:K
The numerator is easy to compute for any configuration of hidden variables

The problem is the denominator.
The integral is not easy to compute
The summation has Kn terms which is intractable
Situation arises in most interesting problems
Approximate posterior inference is one of the central problems of Bayesian statistics
3
Machine Learning
Srihari
Central Task in Using Probabilistic Models

with latent variables
In unsupervised learning no labelled samples of latent variables
1. Evaluation of posterior distribution p(Z|X)

Where Z=latent variables, X=observed data variables
GMM: Z=latent subclasses (z0,..zK), X=observed data
HMM: Z=latent variables (z0,..zK), X=observed data
2. Evaluation of expectation of p(X,Z) wrt to p(Z|X)

E.g., in EM for m.l.e. of parameters of p(X,Z):
evaluate expectation of complete-data log-likelihood ln
p(X,Z) wrt posterior of p(Z/X)
4
Machine Learning
Srihari
Need for Approximation for EM

Often infeasible to evaluate posterior
distributions or expectations wrt distributions
High dimensionality of latent space
Complex and intractable expectations
In the case of GMMs we get expressions

Posterior
p(Z | X, , )
( N ( x
N
n=1 k=1
| k , k ))
Expectation
N
znk
Which are evaluated

and maximized using EM
EZ ln p( X ,Z | , ,) = ( z nk ) ln k + ln N ( xn | k , k )
n=1 k=1
Machine Learning
Srihari
Difficulties with Posteriors and Expectations

For continuous variables
Required integrations have no closed form solutions
Dimensionality of space and integrand prohibit
numerical integration
For discrete variables

Summation in marginalization: exponential no. of
states
Machine Learning
Srihari
Types of Approximations
1. Stochastic
Markov chain Monte Carlo
Have allowed use of Bayesian methods across many domains
Computationally demanding
They can generate exact results
2. Deterministic
Variational Inference (or Variational Bayes)

Based on analytical approximations to posterior
e.g., particular factorization or specific parametric form such as
Gaussian
Scale well in large applications

Can never generate exact results
7
Machine Learning
Srihari
Example of an Inference Problem

Observed Variables
X = {x1,.., xN} N i.i.d. data
Latent Variables and Parameters

Z = {z1,..,zN}
Model for joint distribution p(X,Z) is specified

Goal is to find approximation for posterior
distribution p(Z|X) as well as for p(X)
8
Machine Learning
Srihari
Idea behind Variational Methods

Pick a family of distributions over the latent
variables with its own variational parameters
q(Z | )
Then, find the setting of the parameters that

makes q close to the posterior p(Z|X)
Use q with the fitted parameters as a proxy
for the posterior
e,.g., to predict about future data, or posterior of
hidden variables
Typically the true posterior is not in the
9
variational family
Machine Learning
Srihari
Role of KL Divergence
Measure closeness of distributions q(Z)
and p(Z|X) using KL divergence
q(Z )
KL{q || p} = Eq ln
p(Z
|
X)

q(Z )
= q(Z )ln
dZ
p(Z | X)
Minimizing KL{q||p} is equivalent to maximizing

L(q) as shown next
10
Machine Learning
Srihari
Plan of Attack in finding q
ln p(X) = L(q) + KL(q || p)
Log Marginal
Probability
where
" p(X,Z) %
L(q) = q(Z)ln#
&dZ
$ q(Z) '
and
" p(Z | X) %
KL{q || p} = q(Z)ln#
&dZ
q(Z)
$
'
Kullback-Leibler Divergence
between proposed q and p
(the desired posterior of interest)
to be minimized
Functional we wish to maximize
Also applicable to discrete
distributions by replacing
integrations with summations
Observations on optimization
Lower bound on ln p(X) is L(q)

Maximizing lower bound L(q) wrt distribution q(Z) is equivalent to
minimizing KL Divergence
When KL divergence vanishes q(Z) equals the posterior p(Z|X)
Plan:
We seek that distribution q(Z) for which L(q) is largest
Since true posterior is intractable we consider restricted family for q(Z)
Seek member of this family for which KL divergence is minimized
11
hl1 flexibledrr.
ove
I"J
,proachthe rc
Machine Learning
! a pararncF
use
er bound 4
E#
Srihari
r oprimjz
Variational Inference
rample of tb
a\ e opti
Based on Calculus of Variations

Invented by Euler
spe
niz
p(x
il1 of disrrib
oupsthat re
ronfactonztl
Standard Calculus concerns derivatives of functions

Function takes variable as input and returns value of function
Functional is a mapping with function as input

Returns value of functional as output
Example of a functional is entropy

H[p(x)] = p(x)ln p(x) dx
Leonhard Euler
Swiss
Mathematician
1707-1783
Functional Derivative
How does function change in response to small
changes
in input function
Quantity maximized is a functional
12
Machine Learning
Srihari
Function versus Functional

Function takes a value of a variable as input and
returns the function value as output
Derivative describes how output varies as we make
infinitesimal changes in input value
Functional takes a function as input and returns

the functional value as output
Derivative of functional describes how value of
functional changes with infinitesimal changes in input
function
Recall Gaussian Processes dealt with

distributions of functions
Now we deal with how to find a function that
maximizes a functional, e.g., entropy
13
Machine Learning
Srihari
Variational methods
Nothing intrinsically approximate
But naturally lend themselves to approximation
By restricting range of functions, e.g.,
Quadratic
Linear combination of fixed basis functions
Factorization assumptions
14
Machine Learning
Srihari
Variational Approximation Example

Use parametric distribution q(Z|w)
Lower bound for the functional L(q) is a function of w and
can use standard nonlinear optimization to determine optimal w

Negative
Logarithms
Laplace
Original
distribution
Variational distribution is Gaussian

Optimized with respect to mean and variance
15
Machine Learning
Factorized Distribution Approach
Srihari
Method is called Mean Field Theory in Physics

(also Mean Field Variational Inference)
Restricting the family of distributions
Partition the elements of Z into disjoint groups
M
q(Z) = qi (Z i )
i=1
Among all distributions q(Z) having this form we

seek the one
for which the lower bound L(q),
which is a functional, is largest

p(X, Z )
L(q) = q(Z )ln
dZ
q(Z )
= qi ln p(X, Z ) ln qi dZ
i
i
16
Machine Learning
Srihari
Optimal Solution for Factorized Distribution

By substituting the factorized form of q(Z)
into L(q) , which we wish to maximize
The optimal solution is
ln q*j (Z j ) = Ei j [ ln p(X, Z )] + const
Expectation is wrt all of the other factors

{qi } for i j
Optimal qj(Zj) is obtained by initializing

qi(Zi) and cycling through all other factors
17
Machine Learning
Srihari
Properties of Factorized Approximations

Example problem: factorized
approximation to true posterior distribution
which is a Gaussian
p(z) = N ( z | , 1 ) over two correlated variables
1
=
11
=
21
12
22
We wish to approximate using a factorized

Gaussian of the form q(z)=q1(z1)q2(z2)
18
Machine Learning
Srihari
Factorized Approximation of Bivariate

Gaussian
Using solution for factorized distributions
ln q (z1 ) = Ez2 [ ln p(z)] + const

*
1
We can identify mean and precision for the Gaussians as

1
q1* ( z1 ) = N ( z1 | m1 , 11
) where m1 = 1 11112 ( E[z2 ] 2 )
1
q2* ( z2 ) = N ( z2 | m1 , 11
)
19
Machine Learning
Two alternative forms of KL Divergence

Factorized
approximation
of bivariate
Gaussian
Green: Correlated Gaussian distribution

For 1, 2 and 3 standard deviations
Red: q(z) over same variables given by product
of two independent univariate Gaussians
Minimization based on
KL Divergence KL(q||p)
Mean correctly captured
But various under-estimated
: too compact
Minimization based on
Reverse KL Divergence
KL(p||q)
Form used in
Expectation Propagation
Srihari
Machine Learning
Srihari
Two alternative forms of KL divergence

Approximating a multimodal distribution by
a unimodal one
a
Blue Contours: multimodal distribution p(Z)

Red Contours in (a): Single Gaussian q(Z) that best approximates p(Z)
in (b): single Gaussian q(Z) that best approximates p(Z)
in the sense of minimizing KL(p||q)
in(c): showing different local minimum of KL divergence
21
Machine Learning
Srihari
Alpha Family of Divergences

Two forms of divergence are members of
the alpha family of divergences
4
D ( p || q) =
(1 p(x) q(x) dx )
1
(1+ )/ 2
(1 )/ 2
where < <

KL(p||Q) corresponds to 1
KL(q||p) corresponds to 1
all D ( p || q ) 0 with equality iff p(x) = q(x)
For
when = 0 we get symmetric divergence

which is linearly related to the Hellinger
Distance (a valid distance measure)
DH ( p || q) =
1/ 2
( p(x)
q(x)1/ 2 ) 2 dx
Machine Learning
Srihari
Example: Bayesian inference of

parameters of Univariate Gaussian
Inferring the parameters of a single
Gaussian
Assume Gaussian-Gamma conjugate prior
distribution of parameters
23
Machine Learning
Srihari
Variational Inference of Univariate Gaussian

Variational
Inference of
mean m and
precision t

Green:
Contours of
true posterior
Iterative
scheme
converges to
red contours
24
Machine Learning
Srihari
Variational Mixture of Gaussians

Demonstrates how Bayesian treatment
elegantly resolves mle issues
Conditional distribution
of Z given p

N
K
p(Z | ) = kznk
n=1 k=1
Conditional distribution of observed data
p(X | Z, ,) = N(x n | k ,1k ) znk
Priors over parameters:

Dirichlet distribution over p

p( ) = Dir( | 0 ) = C( 0 ) k 0 1
k=1
Gaussian-Wishart over mean and precision

p(,) = p( | ) p()
= N(k m0 ( 0 k )1 )W ( k |W 0 , 0 )
k =1
25
Machine Learning
Srihari
Variational Distribution
priors
precisions
Joint distribution of the

random variables:
p(X,Z, , ,) = p(X | Z, ,) p(Z | ) p( | ) p()
means
Directed acyclic graph
Representing mixture
26
Machine Learning
Srihari
Variational Bayesian GMM

Old Faithful data set
K=6 components
After convergence there are
only two components
Density of red ink shows
Mixing coefficients
27

ApproximateInference PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ApproximateInference PDF

Uploaded by

Copyright:

Available Formats

Machine Learning

Need for Approximation

Posterior Distribution in Bayesian Methods

We cant compute the posterior

Consider the Bayesian mixture of Gaussians,

p( 1:K , z1:n | x1:n ) =

The numerator is easy to compute for any configuration of hidden variables

Central Task in Using Probabilistic Models

1. Evaluation of posterior distribution p(Z|X)

2. Evaluation of expectation of p(X,Z) wrt to p(Z|X)

Need for Approximation for EM

In the case of GMMs we get expressions

Which are evaluated

Difficulties with Posteriors and Expectations

For discrete variables

Variational Inference (or Variational Bayes)

Scale well in large applications

Example of an Inference Problem

Latent Variables and Parameters

Model for joint distribution p(X,Z) is specified

Idea behind Variational Methods

Then, find the setting of the parameters that

Minimizing KL{q||p} is equivalent to maximizing

Plan of Attack in finding q

ln p(X) = L(q) + KL(q || p)

Lower bound on ln p(X) is L(q)

Based on Calculus of Variations

Standard Calculus concerns derivatives of functions

Functional is a mapping with function as input

Example of a functional is entropy

Quantity maximized is a functional

Function versus Functional

Functional takes a function as input and returns

Recall Gaussian Processes dealt with

Variational Approximation Example

Variational distribution is Gaussian

Factorized Distribution Approach

Method is called Mean Field Theory in Physics

Among all distributions q(Z) having this form we

which is a functional, is largest

Optimal Solution for Factorized Distribution

Expectation is wrt all of the other factors

Optimal qj(Zj) is obtained by initializing

Properties of Factorized Approximations

We wish to approximate using a factorized

Factorized Approximation of Bivariate

ln q (z1 ) = Ez2 [ ln p(z)] + const

We can identify mean and precision for the Gaussians as

Two alternative forms of KL Divergence

Green: Correlated Gaussian distribution

Two alternative forms of KL divergence

Blue Contours: multimodal distribution p(Z)

Alpha Family of Divergences

where < <

when = 0 we get symmetric divergence

Example: Bayesian inference of

Variational Inference of Univariate Gaussian

Variational Mixture of Gaussians

Conditional distribution of observed data

p(X | Z, ,) = N(x n | k ,1k ) znk

Priors over parameters:

Gaussian-Wishart over mean and precision

Joint distribution of the