You are on page 1of 27

Machine Learning

Srihari

Approximate Inference
Sargur Srihari
srihari@cedar.buffalo.edu

Machine Learning

Srihari

Plan of Discussion

Need for Approximation


Types of Approximation
Functionals and Variational Methods
Minimizing K-L divergence
Factorized approximations
Examples

Machine Learning

Srihari

Posterior Distribution in Bayesian Methods


Problem of computing posterior is a special case of Variational Inference
p(Z | X, ) =

p(Z, X | )

We cant compute the posterior


for many interesting models

p(Z, X | )
Z

Consider the Bayesian mixture of Gaussians,


1. Draw k ~ N(0, 2 ) for k = 1,.. K
2. For i = 1...n :
(a) Draw zi ~ Mult ( )
(b) Draw xi ~ N( zi , 2 )
Suppressing the fixed parameters, the distribution is

p( ) p(z )p(x | z , )
p( ) p(z )p(x | z ,
K

p( 1:K , z1:n | x1:n ) =

k=1

1:K z1:n

k=1

i=1

1:K

i=1

1:K

The numerator is easy to compute for any configuration of hidden variables


The problem is the denominator.
The integral is not easy to compute
The summation has Kn terms which is intractable
Situation arises in most interesting problems
Approximate posterior inference is one of the central problems of Bayesian statistics
3

Machine Learning

Srihari

Central Task in Using Probabilistic Models


with latent variables
In unsupervised learning no labelled samples of latent variables

1. Evaluation of posterior distribution p(Z|X)


Where Z=latent variables, X=observed data variables
GMM: Z=latent subclasses (z0,..zK), X=observed data
HMM: Z=latent variables (z0,..zK), X=observed data

2. Evaluation of expectation of p(X,Z) wrt to p(Z|X)


E.g., in EM for m.l.e. of parameters of p(X,Z):
evaluate expectation of complete-data log-likelihood ln
p(X,Z) wrt posterior of p(Z/X)
4

Machine Learning

Srihari

Need for Approximation for EM


Often infeasible to evaluate posterior
distributions or expectations wrt distributions
High dimensionality of latent space
Complex and intractable expectations

In the case of GMMs we get expressions


Posterior
p(Z | X, , )

( N ( x
N

n=1 k=1

| k , k ))

Expectation
N

znk

Which are evaluated


and maximized using EM

EZ ln p( X ,Z | , ,) = ( z nk ) ln k + ln N ( xn | k , k )
n=1 k=1

Machine Learning

Srihari

Difficulties with Posteriors and Expectations


For continuous variables
Required integrations have no closed form solutions
Dimensionality of space and integrand prohibit
numerical integration

For discrete variables


Summation in marginalization: exponential no. of
states

Machine Learning

Srihari

Types of Approximations
1. Stochastic
Markov chain Monte Carlo
Have allowed use of Bayesian methods across many domains

Computationally demanding
They can generate exact results

2. Deterministic

Variational Inference (or Variational Bayes)


Based on analytical approximations to posterior
e.g., particular factorization or specific parametric form such as
Gaussian

Scale well in large applications


Can never generate exact results
7

Machine Learning

Srihari

Example of an Inference Problem


Observed Variables
X = {x1,.., xN} N i.i.d. data

Latent Variables and Parameters


Z = {z1,..,zN}

Model for joint distribution p(X,Z) is specified


Goal is to find approximation for posterior
distribution p(Z|X) as well as for p(X)
8

Machine Learning

Srihari

Idea behind Variational Methods


Pick a family of distributions over the latent
variables with its own variational parameters
q(Z | )

Then, find the setting of the parameters that


makes q close to the posterior p(Z|X)
Use q with the fitted parameters as a proxy
for the posterior
e,.g., to predict about future data, or posterior of
hidden variables
Typically the true posterior is not in the
9
variational family

Machine Learning

Srihari

Role of KL Divergence
Measure closeness of distributions q(Z)
and p(Z|X) using KL divergence
q(Z )
KL{q || p} = Eq ln

p(Z
|
X)


q(Z )
= q(Z )ln
dZ
p(Z | X)

Minimizing KL{q||p} is equivalent to maximizing


L(q) as shown next
10

Machine Learning

Srihari

Plan of Attack in finding q

ln p(X) = L(q) + KL(q || p)

Log Marginal
Probability

where
" p(X,Z) %
L(q) = q(Z)ln#
&dZ
$ q(Z) '
and
" p(Z | X) %
KL{q || p} = q(Z)ln#
&dZ
q(Z)
$
'

Kullback-Leibler Divergence
between proposed q and p
(the desired posterior of interest)
to be minimized
Functional we wish to maximize
Also applicable to discrete
distributions by replacing
integrations with summations

Observations on optimization

Lower bound on ln p(X) is L(q)


Maximizing lower bound L(q) wrt distribution q(Z) is equivalent to
minimizing KL Divergence
When KL divergence vanishes q(Z) equals the posterior p(Z|X)

Plan:
We seek that distribution q(Z) for which L(q) is largest
Since true posterior is intractable we consider restricted family for q(Z)
Seek member of this family for which KL divergence is minimized

11

hl1 flexibledrr.

ove
I"J

,proachthe rc

Machine Learning

! a pararncF
use
er bound 4
E#

Srihari

r oprimjz

Variational Inference

rample of tb
a\ e opti

Based on Calculus of Variations


Invented by Euler

spe
niz
p(x

il1 of disrrib
oupsthat re
ronfactonztl

Standard Calculus concerns derivatives of functions


Function takes variable as input and returns value of function

Functional is a mapping with function as input


Returns value of functional as output

Example of a functional is entropy


H[p(x)] = p(x)ln p(x) dx

Leonhard Euler
Swiss
Mathematician
1707-1783

Functional Derivative
How does function change in response to small
changes
in input function

Quantity maximized is a functional

12

Machine Learning

Srihari

Function versus Functional


Function takes a value of a variable as input and
returns the function value as output
Derivative describes how output varies as we make
infinitesimal changes in input value

Functional takes a function as input and returns


the functional value as output
Derivative of functional describes how value of
functional changes with infinitesimal changes in input
function

Recall Gaussian Processes dealt with


distributions of functions
Now we deal with how to find a function that
maximizes a functional, e.g., entropy
13

Machine Learning

Srihari

Variational methods
Nothing intrinsically approximate
But naturally lend themselves to approximation
By restricting range of functions, e.g.,
Quadratic
Linear combination of fixed basis functions
Factorization assumptions

14

Machine Learning

Srihari

Variational Approximation Example


Use parametric distribution q(Z|w)
Lower bound for the functional L(q) is a function of w and
can use standard nonlinear optimization to determine optimal w

Negative
Logarithms

Laplace
Original
distribution

Variational distribution is Gaussian


Optimized with respect to mean and variance

15

Machine Learning

Factorized Distribution Approach

Srihari

Method is called Mean Field Theory in Physics


(also Mean Field Variational Inference)
Restricting the family of distributions
Partition the elements of Z into disjoint groups
M

q(Z) = qi (Z i )
i=1

Among all distributions q(Z) having this form we


seek the one
for which the lower bound L(q),

which is a functional, is largest


p(X, Z )
L(q) = q(Z )ln
dZ
q(Z )

= qi ln p(X, Z ) ln qi dZ
i
i

16

Machine Learning

Srihari

Optimal Solution for Factorized Distribution


By substituting the factorized form of q(Z)
into L(q) , which we wish to maximize
The optimal solution is
ln q*j (Z j ) = Ei j [ ln p(X, Z )] + const

Expectation is wrt all of the other factors


{qi } for i j

Optimal qj(Zj) is obtained by initializing


qi(Zi) and cycling through all other factors
17

Machine Learning

Srihari

Properties of Factorized Approximations


Example problem: factorized
approximation to true posterior distribution
which is a Gaussian
p(z) = N ( z | , 1 ) over two correlated variables
1
=

11
=
21

12

22

We wish to approximate using a factorized


Gaussian of the form q(z)=q1(z1)q2(z2)
18

Machine Learning

Srihari

Factorized Approximation of Bivariate


Gaussian
Using solution for factorized distributions

ln q (z1 ) = Ez2 [ ln p(z)] + const


*
1

We can identify mean and precision for the Gaussians as


1
q1* ( z1 ) = N ( z1 | m1 , 11
) where m1 = 1 11112 ( E[z2 ] 2 )
1
q2* ( z2 ) = N ( z2 | m1 , 11
)

19

Machine Learning

Two alternative forms of KL Divergence


Factorized
approximation
of bivariate
Gaussian

Green: Correlated Gaussian distribution


For 1, 2 and 3 standard deviations
Red: q(z) over same variables given by product
of two independent univariate Gaussians

Minimization based on
KL Divergence KL(q||p)
Mean correctly captured
But various under-estimated
: too compact

Minimization based on
Reverse KL Divergence
KL(p||q)
Form used in
Expectation Propagation

Srihari

Machine Learning

Srihari

Two alternative forms of KL divergence


Approximating a multimodal distribution by
a unimodal one
a

Blue Contours: multimodal distribution p(Z)


Red Contours in (a): Single Gaussian q(Z) that best approximates p(Z)
in (b): single Gaussian q(Z) that best approximates p(Z)
in the sense of minimizing KL(p||q)
in(c): showing different local minimum of KL divergence
21

Machine Learning

Srihari

Alpha Family of Divergences


Two forms of divergence are members of
the alpha family of divergences
4
D ( p || q) =
(1 p(x) q(x) dx )
1
(1+ )/ 2

(1 )/ 2

where < <


KL(p||Q) corresponds to 1

KL(q||p) corresponds to 1
all D ( p || q ) 0 with equality iff p(x) = q(x)
For

when = 0 we get symmetric divergence


which is linearly related to the Hellinger
Distance (a valid distance measure)
DH ( p || q) =

1/ 2

( p(x)

q(x)1/ 2 ) 2 dx

Machine Learning

Srihari

Example: Bayesian inference of


parameters of Univariate Gaussian
Inferring the parameters of a single
Gaussian
Assume Gaussian-Gamma conjugate prior
distribution of parameters

23

Machine Learning

Srihari

Variational Inference of Univariate Gaussian


Variational
Inference of
mean m and
precision t

Green:
Contours of
true posterior
Iterative
scheme
converges to
red contours

24

Machine Learning

Srihari

Variational Mixture of Gaussians


Demonstrates how Bayesian treatment
elegantly resolves mle issues
Conditional distribution
of Z given p

N
K
p(Z | ) = kznk
n=1 k=1

Conditional distribution of observed data

p(X | Z, ,) = N(x n | k ,1k ) znk

Priors over parameters:


Dirichlet distribution over p

p( ) = Dir( | 0 ) = C( 0 ) k 0 1

k=1

Gaussian-Wishart over mean and precision


p(,) = p( | ) p()

= N(k m0 ( 0 k )1 )W ( k |W 0 , 0 )
k =1

25

Machine Learning

Srihari

Variational Distribution
priors

precisions

Joint distribution of the


random variables:
p(X,Z, , ,) = p(X | Z, ,) p(Z | ) p( | ) p()
means
Directed acyclic graph
Representing mixture

26

Machine Learning

Srihari

Variational Bayesian GMM


Old Faithful data set

K=6 components
After convergence there are
only two components
Density of red ink shows
Mixing coefficients

27

You might also like