You are on page 1of 71

CSE291D Lecture 3

Conjugate Priors
Generative Models for Discrete Data

1
Participation grades: Poll Everywhere
• I will be recording participation in polls from
now on, for your 5% participation grade

• Be sure to log in using the sign-in associated


with your UCSD ID!

• If you can’t bring a digital device to class to


use Poll Everywhere, let me know.
2
3
4
5
Learning outcomes
By the end of the lesson, you should be able to:

• Derive posterior distributions for simple conjugate


models

• Compute posterior predictive distributions for simple


conjugate models

• Build and infer models of discrete data sets using the


Beta-bernoulli, Dirichlet-multinomial, and
naïve Bayes-based models

6
Bayesian statistics: Recap
• Write down your prior beliefs, write down your likelihood, and
apply Bayes ‘ rule,

• Bayesian inference provides a distribution over parameters,


the posterior distribution, not just a point estimate

7
Conjugate priors
• How should we select our prior distribution?
– Conjugate priors are a mathematically convenient
choice

• Conjugate prior:
– Posterior is in the same family of distributions as
the prior

8
Why conjugate priors are important
• Tractability for simple models
– Closed form solutions to posterior from Bayes’ rule

• For more complicated models, conjugate priors


will also be valuable for approximate inference
techniques such as Markov chain Monte Carlo
– Simple, fast updates
– better mixing time (speed of convergence)

9
Bernoulli model
• Flip a biased coin, what is the probability of a
heads?

• Likelihood for iid coin tosses

10
Bernoulli model
• Flip a biased coin, what is the probability of a
heads?

• Likelihood for iid coin tosses

11
Bernoulli model
• Note how the likelihood worked out nicely for
iid draws, since we could add up exponents

• Wouldn’t it be convenient if we could do this


for the prior, too?

12
Beta distribution
• The distribution over [0,1] which is of the right
form so that we can add the exponents again

• Normalization constant

What must Z look like?


13
Beta distribution
• The distribution over [0,1] which is of the right
form so that we can add the exponents again

• Normalization constant

What must Z look like?


14
15
Beta distribution
normalization constant

• Is there a closed-form solution to this integral?

16
Beta distribution
normalization constant

• Is there a closed-form solution to this integral?

(the beta function)

17
Gamma function generalizes factorial

• More generally,

• The properties you might need in this course:

(the digamma function)


18
Putting it all together

19
Putting it all together

20
Beta-bernoulli conjugacy
• Prior:

• Likelihood:

• Posterior:

21
Beta-bernoulli conjugacy

22
Beta-bernoulli conjugacy

23
Beta-bernoulli conjugacy

24
Beta-bernoulli conjugacy

25
Beta-bernoulli conjugacy

26
Beta-bernoulli conjugacy

• The posterior is another beta!


• The counts got added to the prior counts!

27
Beta-bernoulli conjugacy

• The posterior is another beta!


• The counts got added to the prior counts!

28
Beta-bernoulli conjugacy

• The posterior is another beta!


• The counts got added to the prior counts!

29
Beta-bernoulli conjugacy

• The posterior is another beta!


• The counts got added to the prior counts!

30
Shortcut – ignore normalizers

• Recognize this as an unnormalized beta distribution.


The normalization constant must be the normalizer for a
beta distribution.

31
Shortcut – ignore normalizers

• Recognize this as an unnormalized beta distribution.


The normalization constant must be the normalizer for a
beta distribution.

32
Shortcut – ignore normalizers

• Recognize this as an unnormalized beta distribution.


The normalization constant must be the normalizer for a
beta distribution.

33
Shortcut – ignore normalizers

• Recognize this as an unnormalized beta distribution.


The normalization constant must be the normalizer for a
beta distribution.

34
Shortcut – ignore normalizers

• Recognize this as an unnormalized beta distribution.


The normalization constant must be the normalizer for a
beta distribution.

35
Interpretation of prior
hyper-parameters

• Prior gets added to the counts, so we can think of it


as “pseudo-counts” of previous “observations.”

• The pseudo-counts can be fractional.

36
37
38
39
40
41
Beta(1,1) Beta(1,10) Beta(2,2)

Beta(2,3) Beta(50,5)

42
43
Symmetric beta distributions with hyper-
parameters less than one are U-shaped

Beta(0.5, 0.5) Beta(0.005, 0.005)

44
Beta-bernoulli posterior predictive

• In this case, posterior predictive is the


posterior mean.

45
Beta-bernoulli posterior predictive

• In this case, posterior predictive is the


posterior mean.

46
Beta-bernoulli posterior predictive

• In this case, posterior predictive is the


posterior mean.

47
Beta-bernoulli posterior predictive

• In this case, posterior predictive is the


posterior mean.

48
Beta-bernoulli posterior predictive

• In this case, posterior predictive is the


posterior mean.

49
Dirichlet-multinomial model
• Multinomial distribution:
Roll a die N times, how many of each face?
• The data are the counts, not the sequence of draws,
so we need to multiply by a combinatorial term

• The Dirichlet distribution is the conjugate prior for


the multinomial (and discrete, a.k.a.
categorical) distributions

50
Dirichlet-multinomial model
• Multinomial distribution:
Roll a die N times, how many of each face?
• The data are the counts, not the sequence of draws,
so we need to multiply by a combinatorial term

• The Dirichlet distribution is the conjugate prior for


the multinomial (and discrete, a.k.a.
categorical a.k.a multinoulli) distributions

51
Dirichlet distribution

52
Dirichlet distribution

53
Dirichlet distribution

54
Dirichlet distribution

55
Dirichlet distribution

56
Multinomial distribution as an
urn process
• Place colored balls in an urn, where the number of colored
balls for each color k is proportional to θk
– For each of N observations
• Draw a ball from the urn, observe its color k
• Add one to the count of that color, Nk
• Place the ball back in the urn

57
Posterior predictive of Dirichlet-
Multinomial model: Polya Urn
• Place colored balls in an urn, where α is
the Dirichlet prior vector.
– For each of N observations
• Draw a ball from the urn, observe its color k
• Add one to the count of that color, Nk
• Place the ball back, along with a new ball of the same color

58
59
Dirichlet-multinomial text model
The quick brown fox jumps over the sly lazy dog
[5 6 37 1 4 30 5 22 570 12]
• Goals:
– a simple generative model for text documents
– Prediction of new words in a document
• Text compression
• Predictive text
• Classification, clustering
• Authorship identification

60
Dirichlet-multinomial text model
The quick brown fox jumps over the sly lazy dog
[5 6 37 1 4 30 5 22 570 12]
• Bag of words: represent document by its
count vector [N1, N2, …, NV]

• Multinomial likelihood model

• Dirichlet prior

61
Dirichlet-multinomial text model
• Model:

• Dirichlet posterior

• Dirichlet-multinomial posterior predictive

62
Multinomial Naïve Bayes text classifier
• Goal: classification of documents
– E.g. what is the subject/topic of this document?
– Who wrote this document?
– News article vs scientific article vs tweet?

• Model: use our simple Dirichlet-multinomial


model for each class

63
Multinomial Naïve Bayes text classifier
• Model:

• Dirichlet posterior for each class’s parameters

• Use Bayes’ rule to classify, leveraging the


posterior predictive distribution
64
Unsupervised multinomial Naïve Bayes
• Goal: Unsupervised analysis,
clustering of documents

• Model: The same generative model, but we


don’t observe class labels.

65
Unsupervised multinomial Naïve Bayes
• Model:

• Dirichlet posterior for each class’s parameters,


if we knew the latent class assignments

• Need to use approximate inference: EM, MCMC, or VB


66
Dirichlet-multinomial models as
building blocks
• Starting from these simple models, we can
extend them to develop:
– Sophisticated language models
• Markov dependencies. N-grams. Probabilistic context-
free grammars. Topic models.
– Biological sequence models
• HMM models, outputs are DNA/RNA/Protein chains
– Social network models
• Multinomials draw latent group memberships, social
actors in the network

67
Model-building in practice
• Construct a generative model of your data that
captures reasonable assumptions

• Define the data of your problem


– What is the observed data I want to model?
– What data do I want to predict?
– Are there latent variables that would help with modeling,
or which I’d like to infer?

• Start with the simplest model you can think of

• “All models are wrong, some models are useful” – G.E. Box

68
Build a model with standard
distributions as building blocks
• Binary variables as coin flips (Bernoulli)?
• Discrete variables as die rolls (categorical)?
• Count vectors as multinomials?
• Continuous variable as Gaussian?
• Dependent latent variables as HMM chain?

69
Formulating a generative model
• Having defined our data, latent variables,
observed variables and target variables, define
their dependencies.

• An assumed generative process will define the


model. Its structure will (typically) be a
directed graphical model, with a DAG
structure

70
The art of latent variable modeling:
Box’s loop
Evaluate,
Understand,
Data iterate
explore,
predict

Low-dimensional,
Complicated, noisy,
semantically meaningful
high-dimensional
representations

Latent
variable
(Algorithm, model) pair model
carefully co-designed for
tractability
71

You might also like