You are on page 1of 66

CSE291D Lecture 6

Monte Carlo Methods 2:


Markov chain Monte Carlo

1
Announcements
• Please hand in homework 1 to the front

• You can also pick up hard copies of the project


specification (already on Piazza)

• Homework 2 will be available on Piazza tomorrow

• Project proposals are due on Tuesday!

2
Probability and Inference
Probability

Data generating
Observed data
process

Inference

3
Figure based on one by Larry Wasserman, "All of Statistics"
Monte Carlo Inference
• Bayesian inference:
– Computing/approximating the posterior
– Answering queries based on the posterior
(MAP, marginals, posterior predictive,…)
– Constructing a data structure for answering queries

• We need more powerful algorithms for


Bayesian inference, to solve real-world
high-dimensional inference problems.

4
Markov chain Monte Carlo
• Goal: approximate/summarize a distribution,
e.g. the posterior, with a set of samples

• Idea: use a Markov chain to simulate the


distribution and draw samples

5
Example: statistical mechanics

• Suppose we are modeling particles in a gas. The


Boltzmann distribution (Gibbs distribution) has the form:

• This probability distribution arises from the behavior of the


particles over time, which could be modeled as a Markov chain

6
Markov chain Monte Carlo

• “The tour de force [of Metropolis et al. (1953)] was their realization that
they did not need to simulate the exact dynamics; they only needed to
simulate some Markov chain having the same equilibrium distribution.”
– Charles Geyer, the MCMC Handbook

7
History of MCMC
• 1953: Metropolis algorithm invented at Los Alamos
National Labs, home of the Manhattan Project

• 1970: Generalized to Metropolis-Hastings.


Used by chemists and physicists.

• 1984: Gibbs sampler invented, used for mode finding


(Geman and Geman, 1984)

• 1990: Finally brought to the attention of Bayesian


statisticians (Gelfand and Smith, 1990)
8
9
Learning outcomes
By the end of the lesson, you should be able to:

• Derive Metropolis-Hastings and Gibbs sampling


algorithms to simulate from probability models

• Apply these methods to solve practical inference


tasks, while sensibly navigating convergence issues

10
11
Note: the correct
answer is false!

12
Markov chain

X1 X2 X3 X4 X5

• Homogenous: transition probabilities don’t change.

• Transition matrix:

• Transition operator:
13
Markov chain: example

14
Markov chain: example

15
Markov chain: example

16
Markov chain: example

17
Markov chain: example

18
Markov chain: example

19
Markov chain: example

20
Markov chain: example

21
Markov chain: example

22
Stationary distribution
• The distribution over states converged towards a particular distribution, in
this case the uniform distribution

• In this case, we say that the Markov chain has reached equilibrium

• The distribution is called a stationary distribution, a.k.a. an


equilibrium distribution, a.k.a. an invariant distribution of the chain.

23
24
Stationary distribution

25
Stationary distribution
• A distribution is said to be a stationary distribution
(a.k.a. invariant distribution) of a Markov chain if

• If x is distributed according to , it will still be


distributed according to after taking a step in the
Markov chain

26
Stationary distribution

• A stationary distribution is an eigenvector of


the transition matrix that has eigenvalue 1.

27
Markov chain Monte Carlo
• Select a Markov chain whose unique stationary
distribution is the target distribution P(x)

• Simulate the Markov chain until you reach


equilibrium, then keep the states as samples

28
Metropolis-Hastings
• Need:
– unnormalized target density P*(x)
– A proposal distribution that depends on the
current value of x, Q(x’;x(t))

29
Metropolis-Hastings
• In each iteration t:
– Draw from the proposal distribution,

– Decide whether to accept the proposal, or reject it


– If accept

– Else

30
Metropolis-Hastings
• In each iteration t:
– Draw from the proposal distribution,

– Decide whether to accept the proposal, or reject it


– If accept

– Else

31
Metropolis-Hastings
acceptance decision
• With symmetric proposal:

Higher probability states should be


accepted proportionally more often

• Asymmetric proposal:

Correct for asymmetry


in the proposal 32
Metropolis-Hastings
acceptance decision
• With symmetric proposal:

Higher probability states should be


accepted proportionally more often

• Asymmetric proposal:

Correct for asymmetry


in the proposal 33
Proposal distributions
• A common choice of proposal distribution:
a Gaussian centered at the current state x(t).
The variance plays the role of a step size.

• Independence Metropolis-Hastings:
– Proposal is a fixed Gaussian (or other distribution)
which does not depend on x(t).
(can be useful for a unimodal distribution where
we can find the mode, e.g. logistic regression).
34
35
Selecting the variance of the proposal
• The variance of the proposal (step size) can have
a big impact on performance, but it may be
difficult to know the best value ahead of time.

• One could try some preliminary runs with several


values for the variance of the proposal, and select
one that performs well – around 25-50%
acceptance rate is typically good.
36
Example: mixture of two
1-D Gaussians

37
Example: mixture of two
1-D Gaussians

38
Example: mixture of two
1-D Gaussians

39
steps steps
40
Detailed balance
• Detailed balance is a sufficient (but not
necessary) condition for stationarity:

• If we could magically pick a state from the distribution, and


take a step in the chain, we’d be just as likely to pick xb and go
to xa, as we are to pick xa and go to xb.

41
Stationarity of Metropolis-Hastings
• Want to show detailed balance holds:

42
Stationarity of Metropolis-Hastings
• Want to show detailed balance holds:

43
Stationarity of Metropolis-Hastings
• Want to show detailed balance holds:

44
Stationarity of Metropolis-Hastings
• Want to show detailed balance holds:

45
Gibbs sampling
• Sampling from a complicated joint distribution P(x) is
hard.

• Often, sampling one variable at a time, given all the


others, is much easier.

• Graphical models:
Graph structure gives us Markov blanket

46
Gibbs sampling
• Update variables one at a time by drawing
from their conditional distributions

• In each iteration, sweep through and update


all of the variables, in any order.
47
Gibbs sampling

48
Gibbs sampling

49
Gibbs sampling

50
Gibbs sampling

51
Gibbs sampling

52
Gibbs sampling

53
Gibbs sampling

54
Block Gibbs sampling
• Perform Gibbs updates on groups of variables
at once

Latent
states Z1 Z2 Z3 Z4 Z5

Observations Y1 Y2 Y3 Y4 Y5

θk
Parameters
K 55
Block Gibbs sampling
• Perform Gibbs updates on groups of variables
at once

Latent
states Z1 Z2 Z3 Z4 Z5

Observations Y1 Y2 Y3 Y4 Y5

θk
Parameters
K 56
Block Gibbs sampling
• Perform Gibbs updates on groups of variables
at once

Latent
states Z1 Z2 Z3 Z4 Z5

Observations Y1 Y2 Y3 Y4 Y5

θk
Parameters
K 57
Convergence of Gibbs sampling

• The target distribution is quite clearly/


intuitively invariant to Gibbs updates

• Gibbs updates can be viewed as a


special case of Metropolis-Hastings updates,
with an acceptance probability = 1.

58
Requirements for MCMC convergence
to the target distribution

1. The target distribution P(x) is a


stationary distribution of the Markov chain

2. The Markov chain has a limiting distribution,

59
Reasons that a Markov chain
might not converge to a limiting distribution

• Reducible: Some states can’t be reached from


other states. 1 2

3 4

• Periodic: Contains fixed-length cyclic behavior


1 2

3 4 60
Theorem

• If a Markov chain is irreducible and aperiodic,


it has a limiting distribution , which is its
unique stationary distribution

1 2 1 2

3 4 3 4

61
MCMC convergence in practice

• It is generally relatively easy to construct a


Markov chain which has a the correct
limiting distribution

• The challenge is to assess when it has reached


this limiting distribution

62
Burn in
• The initial samples won’t be from the stationary
distribution, until we converge.
– Use a burn in period where the initial samples are
discarded.

• Multiple chains can help check convergence

63
Monitoring convergence

64
Monitoring convergence

Log-likelihood or log
posterior probability

Number of iterations

65
66

You might also like