Professional Documents
Culture Documents
Estimation
Generative vs. Discriminative models
Christopher Manning
Introduction
So far weve looked at generative models
Language models, Naive Bayes
But there is now much use of conditional or
discriminative probabilistic models in NLP,
Speech, IR (and ML generally)
Because:
They give high accuracy performance
They make it easy to incorporate lots of linguistically
important features
They allow automatic building of language
independent, retargetable NLP modules
Joint vs. Conditional Models
c c
d1 d2 d3 d1 d2 d3
Generative Discriminative
Conditional vs. Joint Likelihood
A joint model gives probabilities P(d,c) and tries
to maximize this joint likelihood.
It turns out to be trivial to choose weights: just relative
frequencies.
A conditional model gives probabilities P(c|d). It
takes the data as given and models only the
conditional probability of the class.
We seek to maximize conditional likelihood.
Harder to do (as well see)
More closely related to classification error.
Conditional models work well:
Word Sense Disambiguation
Training Set Even with exactly the
Objective Accuracy same features,
changing from joint to
Joint Like. 86.8 conditional estimation
Cond. Like. 98.5 increases performance
Christopher Manning
Features
In these slides and most maxent work: features f are
elementary pieces of evidence that link aspects of what we
observe d with a category c that we want to predict
A feature is a function with a bounded real value: f: C
D
Example features
f1(c, d) [c = LOCATION w-1 = in isCapitalized(w)]
f2(c, d) [c = LOCATION hasAccentedLatinChar(w)]
f3(c, d) [c = DRUG ends(w, c)]
16
Feature-Based Linear Classifiers
Linear classifiers at classification time:
Linear function from feature sets {fi} to classes {c}.
Assign a weight i to each feature fi.
We consider each class for an observed datum d
For a pair (c,d), features vote with their weights:
vote(c) = ifi(c,d)
LOCATION DRUG
PERSON
in Qubec in Qubec
in Qubec
Choose the class c which maximizes ifi(c,d)
Feature-Based Linear Classifiers
21
Quiz Question
Assuming exactly the same set up (3 class decision:
LOCATION, PERSON, or DRUG; 3 features as before,
maxent), what are:
P(PERSON | by Goric) =
P(LOCATION | by Goric) =
P(DRUG | by Goric) =
1.8 f1(c, d) [c = LOCATION w-1 = in isCapitalized(w)]
-0.6 f2(c, d) [c = LOCATION hasAccentedLatinChar(w)]
0.3 f3(c, d) [c = DRUG ends(w, c)]
exp li fi (c, d )
PERSON LOCATION DRUG P (c | d , l ) =
exp l f (c' , d )
i
by Goric by Goric by Goric
i i
c' i
Building a Maxent Model
NB FACTORS: PREDICTIONS:
NB Model P(A,M) =
P(A) = P(E) =
P(M|A) = P(E,M) =
Class
P(M|E) = P(A|M) =
P(E|M) =
X1=M
Text classification: Asia or Europe
Europe Training Asia
Data
Monaco Monaco Monaco Monaco
Monaco Monaco Hong Hong Monaco Hong Hong
Kong Kong Kong Kong
Monaco
NB FACTORS: PREDICTIONS:
NB Model P(A,H,K) =
P(A) = P(E) =
P(H|A) = P(K|A) = P(E,H,K) =
Class
P(H|E) = PK|E) = P(A|H,K) =
P(E|H,K) =
X1=H X2=K
Text classification: Asia or Europe
Europe Training Asia
Data
Monaco Monaco Monaco Monaco
Monaco Monaco Hong Hong Monaco Hong Hong
Kong Kong Kong Kong
Monaco
NB FACTORS: PREDICTIONS:
NB Model P(A,H,K,M) =
P(A) = P(E) =
P(M|A) = P(E,H,K,M) =
Class P(M|E) = P(A|H,K,M) =
P(H|A) = P(K|A) =
P(E|H,K,M) =
P(H|E) = PK|E) =
H K M
Naive Bayes vs. Maxent Models
Naive Bayes models multi-count correlated evidence
Each feature is multiplied in, even when you have multiple
features telling you the same thing
exp li fi (c, d )
log P(C | D, l ) = log P(c | d , l ) = log i
log P(C | D, l ) =
( c ,d )( C ,D )
log exp li fi (c, d ) -
i
( c ,d )( C ,D )
log exp li fi (c' , d )
c' i
log P(C | D, l ) = N (l ) - M (l )
The derivative is the difference between the
derivatives of each component
The Derivative I: Numerator
N (l )
( c , d )(C , D )
log exp lci fi (c, d )
i
l f ( c, d )
( c ,d )( C , D ) i
i i
= =
li li li
li f i (c, d )
=
( c , d )(C , D )
i
li
= f (c, d ) i
( c , d )( C , D )
1
i
li
c '' i
= P (c ' | d , l ) f (c ' , d )
( c , d )( C , D ) c '
i = predicted count(fi, )
The Derivative III
log P(C | D, l )
= actual count ( fi , C ) -predicted count ( fi , l )
li
The optimum parameters are the ones for which each features
predicted expectation equals its empirical expectation. The optimum
distribution is:
Always unique (but parameters may not be unique)
Always exists (if feature counts are from actual data).
These models are also called maximum entropy models because we
find the model having maximum entropy and satisfying the constraints:
E p ( f j ) = E~p ( f j ), "j
Finding the optimal parameters
We want to choose parameters 1, 2, 3, that
maximize the conditional log-likelihood of the
training data
n
CLogLik ( D) = log P(ci | di )
i =1
1
uncertain for a fair
coin.
E p [fi ]= E p [fi ] p
x f i
x = Ci
Adding constraints (features):
Lowers maximum entropy
Raises maximum likelihood of data Constraint that
Brings the distribution further from uniform pHEADS = 0.3
3 5 11 13 3 1
Maximize H:
f ( w x)
w f (x)
Convex Non-Convex
Convexity guarantees a single, global maximum because any
higher points are greedily reachable.
Convexity II
Constrained H(p) = x log x is convex:
x log x is convex
x log x is convex (sum of convex functions is convex).
The feasible region of constrained H is a linear subspace (which is
convex)
The constrained entropy surface is therefore convex.
The maximum likelihood exponential model (dual) formulation is also
convex.
Feature Overlap/Feature Interaction
A a A a A a
B B B
Empirical
b b b
A a All = 1 A = 2/3 A = 2/3
B 2 1 A a
A a A a
b 2 1 B 1/3 1/6
B 1/4 1/4 B 1/3 1/6
b 1/4 1/4 b 1/3 1/6 b 1/3 1/6
A a
A a A a B A+A
B B A
b A+A
b b A
Example: Named Entity Feature
Overlap
Grace is correlated with
Feature Weights
PERSON, but does not Feature Type Feature PERS LOC
add much evidence on top Previous word at -0.73 0.94
of already knowing prefix
features. Current word Grace 0.03 0.00
Beginning bigram <G 0.45 -0.04
A a A a A a
B B B
Empirical
b b b
A a All = 1 A = 2/3 B = 2/3
B 1 1
A a A a A a
b 1 0
B 1/4 1/4 B 1/3 1/6 B 4/9 2/9
b 1/4 1/4 b 1/3 1/6 b 2/9 1/9
A a A a A a
B 0 0 B A B A+B B
b 0 0 b A b A
Feature Interaction
If you want interaction terms, you have to add
them: A a A a A a
B B B
Empirical
b b b
A a A = 2/3 B = 2/3 AB = 1/3
B 1 1 A a A a A a
b 1 0 B 1/3 1/6 B 4/9 2/9 B 1/3 1/3
b 1/3 1/6 b 2/9 1/9 b 1/3 0
Lots of sparsity:
Overfitting very easy we need smoothing!
Many features seen in training will never occur again at test time.
Optimization problems:
Feature weights can be infinite, and iterative solvers can take a
long time to get to those infinities.
Smoothing: Issues
Assume the following empirical distribution:
Heads Tails
h t
Features: {Heads}, {Tails}
Well have the following model distribution:
e lH elT
pHEADS = lH lT pTAILS = lH lT
e +e e +e
Really, only one degree of freedom ( = HT)
elH e -lT el e0
pHEADS = lH -lT lT -lT = l 0 pTAILS = l 0
e e +e e e +e e +e
Logistic regression model
Smoothing: Issues
The data likelihood in this model is:
log P(h, t | l ) = h log pHEADS + t log pTAILS
l
log P(h, t | l ) = hl - (t + h) log (1 + e )
Heads Tails Heads Tails Heads Tails
2 2 3 1 4 0
Smoothing: Early Stopping
In the 4/0 case, there were two problems:
The optimal value of was , which is a long trip for an
optimization procedure.
The learned distribution is just as spiked as the empirical one no
smoothing.
One way to solve both issues is to just stop the optimization early, after
a few iterations.
Heads Tails
The value of will be finite (but presumably big).
4 0
The optimization wont take forever (clearly). Input
Commonly used in early maxent work. Heads Tails
1 0
Output
Smoothing: Priors (MAP)
What if we had a prior expectation that parameter values
wouldnt be very large?
We could then balance evidence suggesting large
parameters (or infinite) against our prior.
The evidence would never totally defeat the prior, and
parameters would be smoothed (and kept finite!).
We can do this explicitly by changing the optimization
objective to maximum posterior likelihood:
Smoothing: Priors 22 =
10
22=
1
Gaussian, or quadratic, or L2 priors:
Intuition: parameters shouldnt be large.
Formalization: prior expectation that each parameter will be
distributed according to a gaussian with mean and variance 2.
1 (li - mi ) 2 They dont even
s i 2p 2s i 2 name anymore!
Penalizes parameters for drifting to far from their mean prior value
(usually =0).
22=1 works surprisingly well.
Smoothing: Priors
If we use gaussian priors:
Trade off some expectation-matching for smaller parameters.
When multiple features can be recruited to explain a data point, the
more common ones generally receive more weight.
Accuracy generally goes up!
Change the objective: 22 =
( c , d )(C , D ) i 2s 2
i
22
=1
Smoothing helps:
Softens distributions.
Pushes weight onto more explanatory features.
Allows many features to be dumped safely into the mix.
Speeds up convergence (if both are allowed to converge)!
Smoothing: Regularization
Talking of priors and MAP estimation is
Bayesian language
In frequentist statistics, people will instead talk
about using regularization, and in particular, a
gaussian prior is L2 regularization