CSE291D Lecture 3: Conjugate Priors Generative Models For Discrete Data

CSE291D Lecture 3
Conjugate Priors
Generative Models for Discrete Data
1
Participation grades: Poll Everywhere
• I will be recording participation in polls from
now on, for your 5% participation grade
• Be sure to log in using the sign-in associated

with your UCSD ID!
• If you can’t bring a digital device to class to

use Poll Everywhere, let me know.
2
3
4
5
Learning outcomes
By the end of the lesson, you should be able to:
• Derive posterior distributions for simple conjugate

models
• Compute posterior predictive distributions for simple

conjugate models
• Build and infer models of discrete data sets using the

Beta-bernoulli, Dirichlet-multinomial, and
naïve Bayes-based models
6
Bayesian statistics: Recap
• Write down your prior beliefs, write down your likelihood, and
apply Bayes ‘ rule,
• Bayesian inference provides a distribution over parameters,

the posterior distribution, not just a point estimate
7
Conjugate priors
• How should we select our prior distribution?
– Conjugate priors are a mathematically convenient
choice
• Conjugate prior:
– Posterior is in the same family of distributions as
the prior
8
Why conjugate priors are important
• Tractability for simple models
– Closed form solutions to posterior from Bayes’ rule
• For more complicated models, conjugate priors

will also be valuable for approximate inference
techniques such as Markov chain Monte Carlo
– Simple, fast updates
– better mixing time (speed of convergence)
9
Bernoulli model
• Flip a biased coin, what is the probability of a
heads?
• Likelihood for iid coin tosses
10
Bernoulli model
• Flip a biased coin, what is the probability of a
heads?
• Likelihood for iid coin tosses
11
Bernoulli model
• Note how the likelihood worked out nicely for
iid draws, since we could add up exponents
• Wouldn’t it be convenient if we could do this

for the prior, too?
12
Beta distribution
• The distribution over [0,1] which is of the right
form so that we can add the exponents again
• Normalization constant
What must Z look like?

13
Beta distribution
• The distribution over [0,1] which is of the right
form so that we can add the exponents again
• Normalization constant
What must Z look like?

14
15
Beta distribution
normalization constant
• Is there a closed-form solution to this integral?
16
Beta distribution
normalization constant
• Is there a closed-form solution to this integral?
(the beta function)
17
Gamma function generalizes factorial
• More generally,
• The properties you might need in this course:
(the digamma function)

18
Putting it all together
19
Putting it all together
20
Beta-bernoulli conjugacy
• Prior:
• Likelihood:
• Posterior:
21
22
23
24
25
26
• The posterior is another beta!

• The counts got added to the prior counts!
27

28

29

30
Shortcut – ignore normalizers
• Recognize this as an unnormalized beta distribution.

The normalization constant must be the normalizer for a
beta distribution.
31

beta distribution.
32

beta distribution.
33

beta distribution.
34

beta distribution.
35
Interpretation of prior
hyper-parameters
• Prior gets added to the counts, so we can think of it

as “pseudo-counts” of previous “observations.”
• The pseudo-counts can be fractional.
36
37
38
39
40
41
Beta(1,1) Beta(1,10) Beta(2,2)
Beta(2,3) Beta(50,5)
42
43
Symmetric beta distributions with hyper-
parameters less than one are U-shaped
Beta(0.5, 0.5) Beta(0.005, 0.005)
44
Beta-bernoulli posterior predictive
• In this case, posterior predictive is the

posterior mean.
45

posterior mean.
46

posterior mean.
47

posterior mean.
48

posterior mean.
49
Dirichlet-multinomial model
• Multinomial distribution:
Roll a die N times, how many of each face?
• The data are the counts, not the sequence of draws,
so we need to multiply by a combinatorial term
• The Dirichlet distribution is the conjugate prior for

the multinomial (and discrete, a.k.a.
categorical) distributions
50
Dirichlet-multinomial model
• Multinomial distribution:
Roll a die N times, how many of each face?
• The data are the counts, not the sequence of draws,
so we need to multiply by a combinatorial term
• The Dirichlet distribution is the conjugate prior for

the multinomial (and discrete, a.k.a.
categorical a.k.a multinoulli) distributions
51
Dirichlet distribution
52
53
54
55
56
Multinomial distribution as an
urn process
• Place colored balls in an urn, where the number of colored
balls for each color k is proportional to θk
– For each of N observations
• Draw a ball from the urn, observe its color k
• Add one to the count of that color, Nk
• Place the ball back in the urn
57
Posterior predictive of Dirichlet-
Multinomial model: Polya Urn
• Place colored balls in an urn, where α is
the Dirichlet prior vector.
– For each of N observations
• Draw a ball from the urn, observe its color k
• Add one to the count of that color, Nk
• Place the ball back, along with a new ball of the same color
58
59
Dirichlet-multinomial text model
The quick brown fox jumps over the sly lazy dog
[5 6 37 1 4 30 5 22 570 12]
• Goals:
– a simple generative model for text documents
– Prediction of new words in a document
• Text compression
• Predictive text
• Classification, clustering
• Authorship identification
60
The quick brown fox jumps over the sly lazy dog
[5 6 37 1 4 30 5 22 570 12]
• Bag of words: represent document by its
count vector [N1, N2, …, NV]
• Multinomial likelihood model
• Dirichlet prior
61
• Model:
• Dirichlet posterior
• Dirichlet-multinomial posterior predictive
62
Multinomial Naïve Bayes text classifier
• Goal: classification of documents
– E.g. what is the subject/topic of this document?
– Who wrote this document?
– News article vs scientific article vs tweet?
• Model: use our simple Dirichlet-multinomial

model for each class
63
Multinomial Naïve Bayes text classifier
• Model:
• Dirichlet posterior for each class’s parameters
• Use Bayes’ rule to classify, leveraging the

posterior predictive distribution
64
Unsupervised multinomial Naïve Bayes
• Goal: Unsupervised analysis,
clustering of documents
• Model: The same generative model, but we

don’t observe class labels.
65
Unsupervised multinomial Naïve Bayes
• Model:
• Dirichlet posterior for each class’s parameters,

if we knew the latent class assignments
• Need to use approximate inference: EM, MCMC, or VB

66
Dirichlet-multinomial models as
building blocks
• Starting from these simple models, we can
extend them to develop:
– Sophisticated language models
• Markov dependencies. N-grams. Probabilistic context-
free grammars. Topic models.
– Biological sequence models
• HMM models, outputs are DNA/RNA/Protein chains
– Social network models
• Multinomials draw latent group memberships, social
actors in the network
67
Model-building in practice
• Construct a generative model of your data that
captures reasonable assumptions
• Define the data of your problem

– What is the observed data I want to model?
– What data do I want to predict?
– Are there latent variables that would help with modeling,
or which I’d like to infer?
• Start with the simplest model you can think of
• “All models are wrong, some models are useful” – G.E. Box
68
Build a model with standard
distributions as building blocks
• Binary variables as coin flips (Bernoulli)?
• Discrete variables as die rolls (categorical)?
• Count vectors as multinomials?
• Continuous variable as Gaussian?
• Dependent latent variables as HMM chain?
69
Formulating a generative model
• Having defined our data, latent variables,
observed variables and target variables, define
their dependencies.
• An assumed generative process will define the

model. Its structure will (typically) be a
directed graphical model, with a DAG
structure
70
The art of latent variable modeling:
Box’s loop
Evaluate,
Understand,
Data iterate
explore,
predict
Low-dimensional,
Complicated, noisy,
semantically meaningful
high-dimensional
representations
Latent
variable
(Algorithm, model) pair model
carefully co-designed for
tractability
71

CSE291D Lecture 3: Conjugate Priors Generative Models For Discrete Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE291D Lecture 3: Conjugate Priors Generative Models For Discrete Data

Uploaded by

Copyright:

Available Formats

CSE291D Lecture 3

• Be sure to log in using the sign-in associated

• If you can’t bring a digital device to class to

• Derive posterior distributions for simple conjugate

• Compute posterior predictive distributions for simple

• Build and infer models of discrete data sets using the

• Bayesian inference provides a distribution over parameters,

• For more complicated models, conjugate priors

• Likelihood for iid coin tosses

• Likelihood for iid coin tosses

• Wouldn’t it be convenient if we could do this

What must Z look like?

What must Z look like?

• Is there a closed-form solution to this integral?

• Is there a closed-form solution to this integral?

(the beta function)

• The properties you might need in this course:

(the digamma function)

• The posterior is another beta!

• The posterior is another beta!

• The posterior is another beta!

• The posterior is another beta!

• Recognize this as an unnormalized beta distribution.

• Recognize this as an unnormalized beta distribution.

• Recognize this as an unnormalized beta distribution.

• Recognize this as an unnormalized beta distribution.

• Recognize this as an unnormalized beta distribution.

• Prior gets added to the counts, so we can think of it

• The pseudo-counts can be fractional.

Beta(0.5, 0.5) Beta(0.005, 0.005)

• In this case, posterior predictive is the

• In this case, posterior predictive is the

• In this case, posterior predictive is the

• In this case, posterior predictive is the

• In this case, posterior predictive is the

• The Dirichlet distribution is the conjugate prior for

• The Dirichlet distribution is the conjugate prior for

• Multinomial likelihood model

• Dirichlet-multinomial posterior predictive

• Model: use our simple Dirichlet-multinomial

• Dirichlet posterior for each class’s parameters

• Use Bayes’ rule to classify, leveraging the

• Model: The same generative model, but we

• Dirichlet posterior for each class’s parameters,

• Need to use approximate inference: EM, MCMC, or VB

• Define the data of your problem

• Start with the simplest model you can think of

• An assumed generative process will define the

You might also like