CSE291D 10a

CSE291D Lecture 10
Evaluating Unsupervised Models
1
Announcements
• You have until Tuesday to submit Homework 2
– You can submit now if you want to
• Homework 1 is graded, and will be returned to

you shortly, once the grades are entered.
2
The art of latent variable modeling:
Box’s loop
Evaluate,
Understand,
Data iterate
explore,
predict
Low-dimensional,
Complicated, noisy,
semantically meaningful
high-dimensional
representations
Latent
variable
(Algorithm, model) pair model
carefully co-designed for
tractability
3
Goals of evaluation
• Model criticism, for improving / refining
• Validation, sanity checking
• Usefulness at a task
• Comparison to competing methods
4
Evaluation of supervised models
Label, or
D features regression output
N Data points X Y
5
Label, or
Train
X Y
Test
6
Label, or
Train
X Y
Test
Cross-validate
Predict test labels,
Measure accuracy or loss
7
Evaluation of unsupervised models
D features Latent variables
?
Train
X Z
Test ?
No labels available.
How to evaluate?
8
Learning outcomes
By the end of the lesson, you should be able to:
• Evaluate unsupervised latent variable models

using appropriate techniques
• Compute the log-likelihood of held-out data via

annealed importance sampling
• Construct appropriate discrepancy functions for

posterior predictive checks
9
10
• Quantitative evaluation
– Measurable, quantifiable performance metrics
• Qualitative evaluation
– Exploratory data analysis (EDA) using the model
– Human evaluation, user studies,…
11
• Intrinsic evaluation
– Measure inherently good properties of the model
• Fit to the data, interpretability,…
• Extrinsic evaluation
– Study usefulness of model for external tasks
• Classification, retrieval, part of speech tagging,…
12
Extrinsic evaluation:
What will you use your model for?
• If you have a downstream task in mind, you
should probably evaluate based on it!
• Even if you don’t, you could contrive one for

evaluation purposes
• E.g. use latent representations for:

– Classification, regression, retrieval, ranking…
13
• Goal: retrieve relevant documents given a query
• Query likelihood model:
– Each document has a language model
– Score documents by the likelihood of generating the query
14
Prediction of held-out data
?
Train
X Z
Test ?
15
?
Train Θ
X Z
Test ?
16
?
Train Θ
X Z
Test ?
17
• Compute log-probability of held-out data
under the posterior predictive distribution
• In practice, use MCMC to draw posterior

samples of Θ.
18
Computing log-likelihood
from posterior samples
19
Computing the likelihood with
latent variables
• We may need to marginalize out the latent
variables to compute the likelihood:
• Sometimes (e.g. mixture models), we can do this

analytically, but not always (e.g. topic models).
20
Computing the likelihood with
latent variables
• Mixture model:
• Topic model:
Topic (cluster assignment) for each word,

State space is exponential in document length
21
Simple Monte Carlo estimator
of the likelihood
• Pro: easy to do! Con: poor in high dimensions
22
Importance sampling to
estimate the likelihood
Likelihood = partition function of this distribution!
• Importance sampling can estimate a ratio of

partition functions, so can estimate this!
23
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)
24
Importance sampling to
estimate the likelihood
• Target distribution:
• Proposal distribution:
– Normalized distribution
• Ratio of partition functions:
25
How to choose a proposal distribution
• If we use the prior, we recover the simple Monte
Carlo algorithm. Importance weights:
• But, we really need a better proposal that

works well in high dimensions.
26
Annealed importance sampling
(Neal, 2001)
• Scales up importance sampling to high

dimensional data, using MCMC
• Corrects for MCMC convergence failures

using importance weights
27
(Neal, 2001)
28
Technical details
• Need:
– Target distribution
– Initial distribution, easy to sample from
– A sequence of intermediate distributions,
interpolating between and
– Markov chain transitions that leave each of these

distributions invariant (stationary)
29
Technical details
• Procedure: For each importance sample,

– Generate from
– Generate from using
– Generate from using
30
Technical details
• This procedure generates from an importance sampling proposal

distribution, where the proposal is the entire sequence.
• For this importance sampler, the target distribution is the

reverse process, annealing (actually, heating) from to
Reversal of the transition

31
Technical details
• Compute importance weights
• Since is an importance
sample from
• Normalizing constants for P, Q, same as ,

So we have
32
Using external data for validation
• Sometimes we have external data as a proxy
for ground truth, e.g. document metadata
• Sometimes we can obtain a small amount of

labeled data for evaluation
– Crowdsourcing?
33
Example: modeling influence in
citation networks
34
Foulds and Smyth (2013), EMNLP
citation networks
Which are the most

important articles?
35
citation networks
What are the influence

relationships between articles?
36
Topical influence regression
Latent variables for
document influence
citation edge influence
37
Model Validation Using Metadata:
Number of times the citation occurs in the text
38
Self citations
ACL Corpus NIPS Corpus
39
Example: Using labeled data
• Goal: Find problem-reporting forum posts for

Massive open online courses (MOOCS)
40
Example: Using labeled data
• Model: A topic model, with seeded topics and
hierarchical structure enforced
• Since seeded, topics aligned to crowd-sourced

labels, evaluated at retrieval (e.g. F1 score)
41
Variation of information
• Challenge: Latent variables not always aligned
with gold-standard labels
• E.g. an HMM for part of speech tagging, latent states

not aligned with POS tags.
• Solution 1: Manually align latent variables and labels

• Solution 2: Use information-theoretic relationship
between latent variables and labels
42
Variation of information
Treat clustering as a discrete distribution over the cluster assignments
43
Posterior predictive checks
• Sampling data from the posterior predictive distribution
allows us to “look into the mind of the model” – G. Hinton
“This use of the word mind is not intended to be metaphorical. We believe that a mental
state is the state of a hypothetical, external world in which a high-level internal
representation would constitute veridical perception. That hypothetical world is what the
figure shows.” Geoff Hinton et al. (2006), A Fast Learning Algorithm for Deep Belief Nets.
44
Posterior predictive checks
• Does data drawn from the model differ from the
observed data, in ways that we care about?
• PPC:
– Define a discrepancy function (a.k.a. test statistic) T(X).
• Like a test statistic for a p-value. How extreme is my data set
– Simulate new data X(rep) from the posterior predictive
• Use MCMC to sample parameters from posterior, then simulate data
– Compute T(X(rep)) and T(X), compare. Repeat, to estimate:
45
Example: Belin & Rubin (1995)
• Modeling response times for schizophrenics
and non-schizophrenics at a task
• Three discrepancies:
– largest observed variance for schizophrenics
– smallest observed variance for schizophrenics
– average within-person variance across all subjects
46
47
48
49
50
Qualitative evaluation
• Modeling U.S. Presidential state of the union addresses
51
J. R. Foulds, S. H. Kumar, and L. Getoor. Latent topic networks: A versatile probabilistic programming framework for topic models. ICML, 2015.
Human evaluation
52
Best practices
• Baselines: compare to both strong and

weak (simple) methods/models
– Can your complicated model actually beat a dumb
heuristic? By how much?
• Evaluate with multiple experimental modalities

– e.g. held-out log-likelihood + extrinsic task + qualitative
53
“Jimmy’s law of evaluation”
Probability your
paper is accepted at 0.5
an ML conference
0
1 2 3
Number of experimental modalities
54
Think-pair-share:
• You have a Bayesian model of the density of Geolocated Twitter
tweets, per individual, over time, in Southern California. Your end-
goal is to use to model to detect identity fraud. How will you
evaluate it?
• Come up with as many different

experiments as you can.
Modeling human location data with mixtures of kernel densities. M. Lichman and P. Smyth.
55
Proceedings of 20th ACM SIGKDD Conference (KDD 2014))

CSE291D 10a

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE291D 10a

Uploaded by

Copyright:

Available Formats

CSE291D Lecture 10

Evaluating Unsupervised Models

• Homework 1 is graded, and will be returned to

• Model criticism, for improving / refining

• Validation, sanity checking

• Comparison to competing methods

• Evaluate unsupervised latent variable models

• Compute the log-likelihood of held-out data via

• Construct appropriate discrepancy functions for

• Even if you don’t, you could contrive one for

• E.g. use latent representations for:

• In practice, use MCMC to draw posterior

• Sometimes (e.g. mixture models), we can do this

Topic (cluster assignment) for each word,

• Pro: easy to do! Con: poor in high dimensions

Likelihood = partition function of this distribution!

• Importance sampling can estimate a ratio of

• Ratio of partition functions:

• But, we really need a better proposal that

• Scales up importance sampling to high

• Corrects for MCMC convergence failures

interpolating between and

– Markov chain transitions that leave each of these

• Procedure: For each importance sample,

– Generate from using

• This procedure generates from an importance sampling proposal

• For this importance sampler, the target distribution is the

Reversal of the transition

• Normalizing constants for P, Q, same as ,

• Sometimes we can obtain a small amount of

Which are the most

What are the influence

ACL Corpus NIPS Corpus

• Goal: Find problem-reporting forum posts for

• Since seeded, topics aligned to crowd-sourced

• E.g. an HMM for part of speech tagging, latent states

• Solution 1: Manually align latent variables and labels

Treat clustering as a discrete distribution over the cluster assignments

• Baselines: compare to both strong and

• Evaluate with multiple experimental modalities

Number of experimental modalities

• Come up with as many different

You might also like