You are on page 1of 55

CSE291D Lecture 10

Evaluating Unsupervised Models

1
Announcements
• You have until Tuesday to submit Homework 2
– You can submit now if you want to

• Homework 1 is graded, and will be returned to


you shortly, once the grades are entered.

2
The art of latent variable modeling:
Box’s loop
Evaluate,
Understand,
Data iterate
explore,
predict

Low-dimensional,
Complicated, noisy,
semantically meaningful
high-dimensional
representations

Latent
variable
(Algorithm, model) pair model
carefully co-designed for
tractability
3
Goals of evaluation

• Model criticism, for improving / refining

• Validation, sanity checking

• Usefulness at a task

• Comparison to competing methods

4
Evaluation of supervised models
Label, or
D features regression output

N Data points X Y

5
Evaluation of supervised models
Label, or
D features regression output

Train

X Y
Test

6
Evaluation of supervised models
Label, or
D features regression output

Train

X Y
Test

Cross-validate
Predict test labels,
Measure accuracy or loss
7
Evaluation of unsupervised models
D features Latent variables

?
Train

X Z
Test ?
No labels available.
How to evaluate?
8
Learning outcomes
By the end of the lesson, you should be able to:

• Evaluate unsupervised latent variable models


using appropriate techniques

• Compute the log-likelihood of held-out data via


annealed importance sampling

• Construct appropriate discrepancy functions for


posterior predictive checks

9
10
Evaluation of unsupervised models

• Quantitative evaluation
– Measurable, quantifiable performance metrics

• Qualitative evaluation
– Exploratory data analysis (EDA) using the model
– Human evaluation, user studies,…
11
Evaluation of unsupervised models

• Intrinsic evaluation
– Measure inherently good properties of the model
• Fit to the data, interpretability,…

• Extrinsic evaluation
– Study usefulness of model for external tasks
• Classification, retrieval, part of speech tagging,…

12
Extrinsic evaluation:
What will you use your model for?
• If you have a downstream task in mind, you
should probably evaluate based on it!

• Even if you don’t, you could contrive one for


evaluation purposes

• E.g. use latent representations for:


– Classification, regression, retrieval, ranking…

13
• Goal: retrieve relevant documents given a query
• Query likelihood model:
– Each document has a language model
– Score documents by the likelihood of generating the query

14
Prediction of held-out data

?
Train

X Z
Test ?

15
Prediction of held-out data

?
Train Θ
X Z
Test ?

16
Prediction of held-out data

?
Train Θ
X Z
Test ?

17
Prediction of held-out data
• Compute log-probability of held-out data
under the posterior predictive distribution

• In practice, use MCMC to draw posterior


samples of Θ.

18
Computing log-likelihood
from posterior samples

19
Computing the likelihood with
latent variables
• We may need to marginalize out the latent
variables to compute the likelihood:

• Sometimes (e.g. mixture models), we can do this


analytically, but not always (e.g. topic models).

20
Computing the likelihood with
latent variables
• Mixture model:

• Topic model:

Topic (cluster assignment) for each word,


State space is exponential in document length
21
Simple Monte Carlo estimator
of the likelihood

• Pro: easy to do! Con: poor in high dimensions

22
Importance sampling to
estimate the likelihood

Likelihood = partition function of this distribution!

• Importance sampling can estimate a ratio of


partition functions, so can estimate this!

23
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)

24
Importance sampling to
estimate the likelihood
• Target distribution:

• Proposal distribution:
– Normalized distribution

• Ratio of partition functions:

25
How to choose a proposal distribution
• If we use the prior, we recover the simple Monte
Carlo algorithm. Importance weights:

• But, we really need a better proposal that


works well in high dimensions.

26
Annealed importance sampling
(Neal, 2001)

• Scales up importance sampling to high


dimensional data, using MCMC

• Corrects for MCMC convergence failures


using importance weights

27
Annealed importance sampling
(Neal, 2001)

28
Annealed importance sampling
Technical details
• Need:
– Target distribution
– Initial distribution, easy to sample from
– A sequence of intermediate distributions,

interpolating between and

– Markov chain transitions that leave each of these


distributions invariant (stationary)

29
Annealed importance sampling
Technical details

• Procedure: For each importance sample,


– Generate from
– Generate from using

– Generate from using

30
Annealed importance sampling
Technical details

• This procedure generates from an importance sampling proposal


distribution, where the proposal is the entire sequence.

• For this importance sampler, the target distribution is the


reverse process, annealing (actually, heating) from to

Reversal of the transition


31
Annealed importance sampling
Technical details
• Compute importance weights

• Since is an importance
sample from

• Normalizing constants for P, Q, same as ,


So we have
32
Using external data for validation
• Sometimes we have external data as a proxy
for ground truth, e.g. document metadata

• Sometimes we can obtain a small amount of


labeled data for evaluation
– Crowdsourcing?

33
Example: modeling influence in
citation networks

34
Foulds and Smyth (2013), EMNLP
Example: modeling influence in
citation networks

Which are the most


important articles?

35
Foulds and Smyth (2013), EMNLP
Example: modeling influence in
citation networks

What are the influence


relationships between articles?

36
Foulds and Smyth (2013), EMNLP
Topical influence regression
Latent variables for
document influence
citation edge influence

37
Foulds and Smyth (2013), EMNLP
Model Validation Using Metadata:
Number of times the citation occurs in the text

38
Self citations

ACL Corpus NIPS Corpus

39
Example: Using labeled data

• Goal: Find problem-reporting forum posts for


Massive open online courses (MOOCS)

40
Example: Using labeled data
• Model: A topic model, with seeded topics and
hierarchical structure enforced

• Since seeded, topics aligned to crowd-sourced


labels, evaluated at retrieval (e.g. F1 score)
41
Variation of information
• Challenge: Latent variables not always aligned
with gold-standard labels

• E.g. an HMM for part of speech tagging, latent states


not aligned with POS tags.

• Solution 1: Manually align latent variables and labels


• Solution 2: Use information-theoretic relationship
between latent variables and labels

42
Variation of information

Treat clustering as a discrete distribution over the cluster assignments

43
Posterior predictive checks
• Sampling data from the posterior predictive distribution
allows us to “look into the mind of the model” – G. Hinton

“This use of the word mind is not intended to be metaphorical. We believe that a mental
state is the state of a hypothetical, external world in which a high-level internal
representation would constitute veridical perception. That hypothetical world is what the
figure shows.” Geoff Hinton et al. (2006), A Fast Learning Algorithm for Deep Belief Nets.
44
Posterior predictive checks
• Does data drawn from the model differ from the
observed data, in ways that we care about?

• PPC:
– Define a discrepancy function (a.k.a. test statistic) T(X).
• Like a test statistic for a p-value. How extreme is my data set
– Simulate new data X(rep) from the posterior predictive
• Use MCMC to sample parameters from posterior, then simulate data
– Compute T(X(rep)) and T(X), compare. Repeat, to estimate:

45
Example: Belin & Rubin (1995)
• Modeling response times for schizophrenics
and non-schizophrenics at a task
• Three discrepancies:
– largest observed variance for schizophrenics
– smallest observed variance for schizophrenics
– average within-person variance across all subjects

46
47
48
49
50
Qualitative evaluation
• Modeling U.S. Presidential state of the union addresses

51
J. R. Foulds, S. H. Kumar, and L. Getoor. Latent topic networks: A versatile probabilistic programming framework for topic models. ICML, 2015.
Human evaluation

52
Best practices

• Baselines: compare to both strong and


weak (simple) methods/models
– Can your complicated model actually beat a dumb
heuristic? By how much?

• Evaluate with multiple experimental modalities


– e.g. held-out log-likelihood + extrinsic task + qualitative

53
“Jimmy’s law of evaluation”

Probability your
paper is accepted at 0.5
an ML conference

0
1 2 3

Number of experimental modalities

54
Think-pair-share:
• You have a Bayesian model of the density of Geolocated Twitter
tweets, per individual, over time, in Southern California. Your end-
goal is to use to model to detect identity fraud. How will you
evaluate it?

• Come up with as many different


experiments as you can.

Modeling human location data with mixtures of kernel densities. M. Lichman and P. Smyth.
55
Proceedings of 20th ACM SIGKDD Conference (KDD 2014))

You might also like