You are on page 1of 74

Maxent Models and Discriminative

Estimation
Generative vs. Discriminative models

Christopher Manning
Introduction
So far weve looked at generative models
Language models, Naive Bayes
But there is now much use of conditional or
discriminative probabilistic models in NLP,
Speech, IR (and ML generally)
Because:
They give high accuracy performance
They make it easy to incorporate lots of linguistically
important features
They allow automatic building of language
independent, retargetable NLP modules
Joint vs. Conditional Models

We have some data {(d, c)} of paired observations


d and hidden classes c. P(c,d)
Joint (generative) models place probabilities over
both observed data and the hidden stuff (gene-rate
the observed data from hidden stuff):
All the classic StatNLP models:
n-gram models, Naive Bayes classifiers, hidden Markov
models, probabilistic context-free grammars, IBM machine
translation alignment models
Joint vs. Conditional Models
Discriminative (conditional) models take the data
as given, and put a probability over hidden P(c|d)
structure given the data:
Logistic regression, conditional loglinear or maximum entropy
models, conditional random fields
Also, SVMs, (averaged) perceptron, etc. are discriminative
classifiers (but not directly probabilistic)
Bayes Net/Graphical Models
Bayes net diagrams draw circles for random variables, and lines for
direct dependencies
Some variables are observed; some are hidden
Each node is a little classifier (conditional probability table) based on
incoming arcs

c c

d1 d2 d3 d1 d2 d3

Naive Bayes Logistic Regression

Generative Discriminative
Conditional vs. Joint Likelihood
A joint model gives probabilities P(d,c) and tries
to maximize this joint likelihood.
It turns out to be trivial to choose weights: just relative
frequencies.
A conditional model gives probabilities P(c|d). It
takes the data as given and models only the
conditional probability of the class.
We seek to maximize conditional likelihood.
Harder to do (as well see)
More closely related to classification error.
Conditional models work well:
Word Sense Disambiguation
Training Set Even with exactly the
Objective Accuracy same features,
changing from joint to
Joint Like. 86.8 conditional estimation
Cond. Like. 98.5 increases performance

Test Set That is, we use the


Objective Accurac same smoothing, and
y the same word-class
Joint Like. 73.6 features, we just
change the numbers
Cond. Like. 76.1 (parameters)
(Klein and Manning 2002, using Senseval-1 Data)
Discriminative Model Features

Making features from text for


discriminative NLP models

Christopher Manning
Features
In these slides and most maxent work: features f are
elementary pieces of evidence that link aspects of what we
observe d with a category c that we want to predict
A feature is a function with a bounded real value: f: C
D
Example features
f1(c, d) [c = LOCATION w-1 = in isCapitalized(w)]
f2(c, d) [c = LOCATION hasAccentedLatinChar(w)]
f3(c, d) [c = DRUG ends(w, c)]

LOCATION LOCATION DRUG PERSON


in Arcadia in Qubec taking Zantac saw Sue

Models will assign to each feature a weight:


A positive weight votes that this configuration is likely
correct
A negative weight votes that this configuration is likely
incorrect
Feature Expectations
We will crucially make use of two expectations
actual or predicted counts of a feature firing:

Empirical count (expectation) of a feature:

empirical E( fi ) = (c,d )observed(C ,D) fi (c, d )


Model expectation of a feature:

E ( fi ) = (c,d )(C ,D) P(c, d ) fi (c, d )


Features
In NLP uses, usually a feature specifies (1) an indicator
function a yes/no boolean matching function of
properties of the input and (2) a particular class
fi(c, d) [(d) c = cj] [Value is 0 or 1]
They pick out a data subset and suggest a label for it.
We will say that (d) is a feature of the data d, when, for
each cj, the conjunction (d) c = cj is a feature of the
data-class pair (c, d)
Feature-Based Models
The decision about a data point is based only on
the features active at that point.
Data Data Data
BUSINESS: Stocks to restructure DT JJ NN
hit a yearly low bank:MONEY debt. The previous fall

Label: BUSINESS Label: MONEY Label: NN


Features Features Features
{, stocks, hit, a, {, w-1=restructure, {w=fall, t-1=JJ w-
yearly, low, } w+1=debt, L=12, }
1=previous}

Text Word-Sense POS Tagging


Categorization Disambiguation
Example: Text Categorization
(Zhang and Oles 2001)
Features are presence of each word in a document and the document class (they
do feature selection to use reliable indicator words)
Tests on classic Reuters data set (and others)
Nave Bayes: 77.0% F1
Linear regression: 86.0%
Logistic regression: 86.4%
Support vector machine: 86.5%
Paper emphasizes the importance of regularization (smoothing) for successful
use of discriminative methods (not used in much early NLP/IR work)
Other Maxent Classifier Examples
You can use a maxent classifier whenever you want to assign data
points to one of a number of classes:
Sentence boundary detection (Mikheev 2000)
Is a period end of sentence or abbreviation?
Sentiment analysis (Pang and Lee 2002)
Word unigrams, bigrams, POS counts,
PP attachment (Ratnaparkhi 1998)
Attach to verb or noun? Features of head noun, preposition,
etc.
Parsing decisions in general (Ratnaparkhi 1997; Johnson et al. 1999, etc.)
Feature-based Linear Classifiers

How to put features into a classifier

16
Feature-Based Linear Classifiers
Linear classifiers at classification time:
Linear function from feature sets {fi} to classes {c}.
Assign a weight i to each feature fi.
We consider each class for an observed datum d
For a pair (c,d), features vote with their weights:
vote(c) = ifi(c,d)
LOCATION DRUG
PERSON
in Qubec in Qubec
in Qubec
Choose the class c which maximizes ifi(c,d)
Feature-Based Linear Classifiers

There are many ways to chose weights for features

Perceptron: find a currently misclassified example, and


nudge weights in the direction of its correct
classification

Margin-based methods (Support Vector Machines)


Feature-Based Linear Classifiers
Exponential (log-linear, maxent, logistic, Gibbs) models:
Make a probabilistic model from the linear combination
ifi(c,d)
exp li fi (c, d ) Makes votes positive
P (c | d , l ) =
exp l f (c' , d )
i
i i Normalizes votes
c' i
P(LOCATION|in Qubec) = e1.8e0.6/(e1.8e0.6 + e0.3 + e0) = 0.586
P(DRUG|in Qubec) = e0.3 /(e1.8e0.6 + e0.3 + e0) = 0.238
P(PERSON|in Qubec) = e0 /(e1.8e0.6 + e0.3 + e0) = 0.176
The weights are the parameters of the probability
model, combined via a soft max function
Feature-Based Linear Classifiers
Exponential (log-linear, maxent, logistic, Gibbs) models:
Given this model form, we will choose parameters {i}
that maximize the conditional likelihood of the data
according to this model.
We construct not only classifications, but probability
distributions over classifications.
There are other (good!) ways of discriminating classes SVMs,
boosting, even perceptrons but these methods are not as trivial to
interpret as distributions over classes.
Aside: logistic regression
Maxent models in NLP are essentially the same as
multiclass logistic regression models in statistics
(or machine learning)
If you have seen these before you might think about:
The parameterization is slightly different in a way that is
advantageous for NLP-style models with tons of sparse
features (but statistically inelegant)
The key role of feature functions in NLP and in this
presentation
The features are more general, with f also being a function of the
class when might this be useful?

21
Quiz Question
Assuming exactly the same set up (3 class decision:
LOCATION, PERSON, or DRUG; 3 features as before,
maxent), what are:
P(PERSON | by Goric) =
P(LOCATION | by Goric) =
P(DRUG | by Goric) =
1.8 f1(c, d) [c = LOCATION w-1 = in isCapitalized(w)]
-0.6 f2(c, d) [c = LOCATION hasAccentedLatinChar(w)]
0.3 f3(c, d) [c = DRUG ends(w, c)]

exp li fi (c, d )
PERSON LOCATION DRUG P (c | d , l ) =
exp l f (c' , d )
i
by Goric by Goric by Goric
i i
c' i
Building a Maxent Model

The nuts and bolts


Building a Maxent Model
We define features (indicator functions) over data points
Features represent sets of data points which are distinctive
enough to deserve model parameters.
Words, but also word contains number, word ends with ing, etc.

We will simply encode each feature as a unique String


A datum will give rise to a set of Strings: the active features
Each feature fi(c, d) [(d) c = cj] gets a real number weight

We concentrate on features but the math uses i indices


of fi
Building a Maxent Model
Features are often added during model development to
target errors
Often, the easiest thing to think of are features that mark bad
combinations

Then, for any given feature weights, we want to be able to


calculate:
Data conditional likelihood
Derivative of the likelihood wrt each feature weight
Uses expectations of each feature according to the model

We can then find the optimum feature weights (discussed


later).
Naive Bayes vs. Maxent models

Generative vs. Discriminative models: The


problem of overcounting evidence
Christopher Manning
Text classification: Asia or Europe
Europe Training Asia
Data
Monaco Monaco Monaco Monaco
Monaco Monaco Hong Hong Monaco Hong Hong
Kong Kong Kong Kong
Monaco

NB FACTORS: PREDICTIONS:
NB Model P(A,M) =
P(A) = P(E) =
P(M|A) = P(E,M) =
Class
P(M|E) = P(A|M) =
P(E|M) =
X1=M
Text classification: Asia or Europe
Europe Training Asia
Data
Monaco Monaco Monaco Monaco
Monaco Monaco Hong Hong Monaco Hong Hong
Kong Kong Kong Kong
Monaco

NB FACTORS: PREDICTIONS:
NB Model P(A,H,K) =
P(A) = P(E) =
P(H|A) = P(K|A) = P(E,H,K) =
Class
P(H|E) = PK|E) = P(A|H,K) =
P(E|H,K) =
X1=H X2=K
Text classification: Asia or Europe
Europe Training Asia
Data
Monaco Monaco Monaco Monaco
Monaco Monaco Hong Hong Monaco Hong Hong
Kong Kong Kong Kong
Monaco

NB FACTORS: PREDICTIONS:
NB Model P(A,H,K,M) =
P(A) = P(E) =
P(M|A) = P(E,H,K,M) =
Class P(M|E) = P(A|H,K,M) =
P(H|A) = P(K|A) =
P(E|H,K,M) =
P(H|E) = PK|E) =
H K M
Naive Bayes vs. Maxent Models
Naive Bayes models multi-count correlated evidence
Each feature is multiplied in, even when you have multiple
features telling you the same thing

Maximum Entropy models (pretty much) solve this


problem
As we will see, this is done by weighting features so that model
expectations match the observed (empirical) expectations
Maxent Models and Discriminative
Estimation
Maximizing the likelihood
Exponential Model Likelihood
Maximum (Conditional) Likelihood Models :
Given a model form, choose values of parameters to
maximize the (conditional) likelihood of the data.

exp li fi (c, d )
log P(C | D, l ) = log P(c | d , l ) = log i

( c , d )(C , D ) ( c ,d )( C , D ) exp l f (c' , d )


c' i
i i
The Likelihood Value
The (log) conditional likelihood of iid data (C,D)
according to maxent model is a function of the
data and the parameters :
log P(C | D, l ) = log P(c | d , l ) = log P(c | d , l)
( c ,d )(C , D ) ( c , d )( C , D )
If there arent many values of c, its easy to
calculate:
exp li fi (c, d )
log P(C | D, l ) = log i
( c , d )( C , D ) exp li fi (c' , d )
c' i
The Likelihood Value
We can separate this into two components:

log P(C | D, l ) =
( c ,d )( C ,D )
log exp li fi (c, d ) -
i

( c ,d )( C ,D )
log exp li fi (c' , d )
c' i

log P(C | D, l ) = N (l ) - M (l )
The derivative is the difference between the
derivatives of each component
The Derivative I: Numerator

N (l )

( c , d )(C , D )
log exp lci fi (c, d )
i
l f ( c, d )
( c ,d )( C , D ) i
i i

= =
li li li

li f i (c, d )
=
( c , d )(C , D )
i

li
= f (c, d ) i
( c , d )( C , D )

Derivative of the numerator is: the empirical count(fi, c)


The Derivative II: Denominator
M (l )

( c , d )( C , D )
log exp li f i (c' , d )
c' i
=
li li
exp li f i (c' , d )
1
= c' i

( c , d )( C , D ) exp li f i (c' ' , d ) li


c '' i
exp li f i (c' , d ) li f i (c' , d )
1
=
( c , d )( C , D ) exp li f i (c' ' , d ) c '
i

1
i

li
c '' i

exp li f i (c' , d ) li f i (c' , d )


= i i

( c , d )( C , D ) c ' exp l f (c' ' , d )


c '' i
i i li

= P (c ' | d , l ) f (c ' , d )
( c , d )( C , D ) c '
i = predicted count(fi, )
The Derivative III
log P(C | D, l )
= actual count ( fi , C ) -predicted count ( fi , l )
li
The optimum parameters are the ones for which each features
predicted expectation equals its empirical expectation. The optimum
distribution is:
Always unique (but parameters may not be unique)
Always exists (if feature counts are from actual data).
These models are also called maximum entropy models because we
find the model having maximum entropy and satisfying the constraints:

E p ( f j ) = E~p ( f j ), "j
Finding the optimal parameters
We want to choose parameters 1, 2, 3, that
maximize the conditional log-likelihood of the
training data
n
CLogLik ( D) = log P(ci | di )
i =1

To be able to do that, weve worked out how to


calculate the function value and its partial
derivatives (its gradient)
A likelihood surface
Finding the optimal parameters
Use your favorite numerical optimization
package.
Commonly (and in our code), you minimize the negative of
CLogLik
1. Gradient descent (GD); Stochastic gradient descent
(SGD)
2. Iterative proportional fitting methods: Generalized
Iterative Scaling (GIS) and Improved Iterative Scaling
(IIS)
3. Conjugate gradient (CG), perhaps with
preconditioning
4. Quasi-Newton methods limited memory variable
metric (LMVM) methods, in particular, L-BFGS
Maxent Models and Discriminative
Estimation
The maximum entropy model presentation
Maximum Entropy Models
An equivalent approach:
Lots of distributions out there, most of them very
spiked, specific, overfit.
We want a distribution which is uniform except in
specific ways we require.
Uniformity means high entropy we can search for
distributions which have properties we desire, but also
have high entropy.

Ignorance is preferable to error and he is less remote


from the truth who believes nothing than he who
believes what is wrong Thomas Jefferson (1781)
(Maximum) Entropy
Entropy: the uncertainty of a distribution.
Quantifying uncertainty (surprise):
Event x H
Probability px
Surprise log(1/px)
pHEADS
Entropy: expected surprise (over p):
A coin-flip is most

1
uncertain for a fair
coin.

H(p) = E p log2 = - px log2 px


px x
Maxent Examples I
What do we want from a distribution?
Minimize commitment = maximize entropy.
Resemble some reference distribution (data).
Solution: maximize entropy H, subject to feature-
Unconstrained,
based constraints: max at 0.5

E p [fi ]= E p [fi ] p
x f i
x = Ci
Adding constraints (features):
Lowers maximum entropy
Raises maximum likelihood of data Constraint that
Brings the distribution further from uniform pHEADS = 0.3

Brings the distribution closer to data


1/e

Maxent Examples II - x log x

H(pH pT,) pH + pT = 1 pH = 0.3


Maxent Examples III
Lets say we have the following event space:
NN NNS NNP NNPS VBZ VBD

and the following empirical data:

3 5 11 13 3 1
Maximize H:

1/e 1/e 1/e 1/e 1/e 1/e


want probabilities: E[NN,NNS,NNP,NNPS,VBZ,VBD] = 1

1/6 1/6 1/6 1/6 1/6 1/6



Maxent
Too uniform!
Examples IV
N* are more common than V*, so we add the feature fN =
{NN, NNS, NNP, NNPS}, with E[fN] =32/36
NN NNS NNP NNPS VBZ VBD
8/36 8/36 8/36 8/36 2/36 2/36

and proper nouns are more frequent than common


nouns, so we add fP = {NNP, NNPS}, with E[fP] =24/36
4/36 4/36 12/36 12/36 2/36 2/36

we could keep refining the models, e.g., by adding a


feature to distinguish singular vs. plural nouns, or verb
types.
Convexity
f ( wi xi ) wi f ( xi ) wi = 1
i i i

f ( w x)
w f (x)

Convex Non-Convex
Convexity guarantees a single, global maximum because any
higher points are greedily reachable.
Convexity II
Constrained H(p) = x log x is convex:
x log x is convex
x log x is convex (sum of convex functions is convex).
The feasible region of constrained H is a linear subspace (which is
convex)
The constrained entropy surface is therefore convex.
The maximum likelihood exponential model (dual) formulation is also
convex.
Feature Overlap/Feature Interaction

How overlapping features work in maxent


models
Feature Overlap
Maxent models handle overlapping features well.
Unlike a NB model, there is no double counting!

A a A a A a
B B B
Empirical
b b b
A a All = 1 A = 2/3 A = 2/3
B 2 1 A a
A a A a
b 2 1 B 1/3 1/6
B 1/4 1/4 B 1/3 1/6
b 1/4 1/4 b 1/3 1/6 b 1/3 1/6
A a
A a A a B A+A

B B A
b A+A
b b A
Example: Named Entity Feature
Overlap
Grace is correlated with
Feature Weights
PERSON, but does not Feature Type Feature PERS LOC
add much evidence on top Previous word at -0.73 0.94
of already knowing prefix
features. Current word Grace 0.03 0.00
Beginning bigram <G 0.45 -0.04

Local Context Current POS tag NNP 0.47 0.45


Prev and cur tags IN NNP -0.10 0.14
Prev Cur Next
Previous state Other -0.70 -0.92
State Othe ??? ???
Current signature Xx 0.80 0.46
r
Prev state, cur sig O-Xx 0.68 0.37
Word at Grac Road
Prev-cur-next sig x-Xx-Xx -0.69 0.37
e
P. state - p-cur sig O-x-Xx -0.20 0.82
Tag IN NNP NNP

Sig x Xx Xx
Total: -0.58 2.68
Feature Interaction
Maxent models handle overlapping features well, but do not
automatically model feature interactions.

A a A a A a
B B B
Empirical
b b b
A a All = 1 A = 2/3 B = 2/3
B 1 1
A a A a A a
b 1 0
B 1/4 1/4 B 1/3 1/6 B 4/9 2/9
b 1/4 1/4 b 1/3 1/6 b 2/9 1/9

A a A a A a
B 0 0 B A B A+B B
b 0 0 b A b A
Feature Interaction
If you want interaction terms, you have to add
them: A a A a A a
B B B
Empirical
b b b
A a A = 2/3 B = 2/3 AB = 1/3

B 1 1 A a A a A a
b 1 0 B 1/3 1/6 B 4/9 2/9 B 1/3 1/3
b 1/3 1/6 b 2/9 1/9 b 1/3 0

A disjunctive feature would also have done it (alone):


A a A a
B B 1/3 1/3
b b 1/3 0
Feature Interaction
For loglinear/logistic regression models in statistics, it is
standard to do a greedy stepwise search over the space of
all possible interaction terms.
This combinatorial space is exponential in size, but thats
okay as most statistics models only have 48 features.
In NLP, our models commonly use hundreds of thousands
of features, so thats not okay.
Commonly, interaction terms are added by hand based on
linguistic intuitions.
Example: NER Interaction
Previous-state and current-signature Feature Weights
have interactions, e.g. P=PERS-C=Xx
indicates C=PERS much more strongly Feature Type Feature PERS LOC
than C=Xx and P=PERS independently.
This feature type allows the model to
Previous word at -0.73 0.94
capture this interaction. Current word Grace 0.03 0.00
Beginning bigram <G 0.45 -0.04
Current POS tag NNP 0.47 0.45
Local Context
Prev and cur tags IN NNP -0.10 0.14
Prev Cur Next Previous state Other -0.70 -0.92
State Othe ??? ??? Current signature Xx 0.80 0.46
r
Prev state, cur sig O-Xx 0.68 0.37
Word at Grac Road Prev-cur-next sig x-Xx-Xx -0.69 0.37
e
P. state - p-cur sig O-x-Xx -0.20 0.82
Tag IN NNP NNP

Sig x Xx Xx
Total: -0.58 2.68
Conditional Maxent Models for
Classification
The relationship between conditional and
joint maxent/exponential models
Classification
What do these joint models of P(X) have to do with
conditional models P(C|D)? X
Think of the space CD as a complex X.
C is generally small (e.g., 2-100 topic classes)
D is generally huge (e.g., space of documents)
We can, in principle, build models over P(C,D).
This will involve calculating expectations of features (over C
CD): CD

E ( fi ) = (c,d )(C ,D) P(c, d ) fi (c, d )


D

Generally impractical: cant enumerate X efficiently.


Classification II
D may be huge or infinite, but only a few d occur in our
data.
What if we add one feature for each d and constrain its
C
expectation to match our empirical data?
"(d ) D P(d ) = P (d )
Now, most entries of P(c,d) will be zero.
D
We can therefore use the much easier sum:

E ( fi ) = (c,d )(C ,D) P(c, d ) fi (c, d )


= (c,d )(C ,D) P ( d )>0 P(c, d ) fi (c, d )
Classification III
But if weve constrained the D marginals
"(d ) D P(d ) = P (d )
then the only thing that can vary is the conditional
distributions:
P(c, d ) = P(c | d ) P(d )
= P(c | d ) P (d )
Classification IV
This is the connection between joint and
conditional maxent / exponential models:
Conditional models can be thought of as joint models
with marginal constraints.
Maximizing joint likelihood and conditional
likelihood of the data in this model are equivalent!
Smoothing/Priors/ Regularization for
Maxent Models
Smoothing: Issues of Scale
Lots of features:
NLP maxent models can have well over a million features.
Even storing a single array of parameter values can have a
substantial memory cost.

Lots of sparsity:
Overfitting very easy we need smoothing!
Many features seen in training will never occur again at test time.

Optimization problems:
Feature weights can be infinite, and iterative solvers can take a
long time to get to those infinities.
Smoothing: Issues
Assume the following empirical distribution:
Heads Tails
h t
Features: {Heads}, {Tails}
Well have the following model distribution:
e lH elT
pHEADS = lH lT pTAILS = lH lT
e +e e +e
Really, only one degree of freedom ( = HT)

elH e -lT el e0
pHEADS = lH -lT lT -lT = l 0 pTAILS = l 0
e e +e e e +e e +e
Logistic regression model
Smoothing: Issues
The data likelihood in this model is:
log P(h, t | l ) = h log pHEADS + t log pTAILS
l
log P(h, t | l ) = hl - (t + h) log (1 + e )

log P log P log P


Heads Tails Heads Tails Heads Tails
2 2 3 1 4 0
Smoothing: Early Stopping
In the 4/0 case, there were two problems:
The optimal value of was , which is a long trip for an
optimization procedure.
The learned distribution is just as spiked as the empirical one no
smoothing.

One way to solve both issues is to just stop the optimization early, after
a few iterations.
Heads Tails
The value of will be finite (but presumably big).
4 0
The optimization wont take forever (clearly). Input
Commonly used in early maxent work. Heads Tails
1 0
Output
Smoothing: Priors (MAP)
What if we had a prior expectation that parameter values
wouldnt be very large?
We could then balance evidence suggesting large
parameters (or infinite) against our prior.
The evidence would never totally defeat the prior, and
parameters would be smoothed (and kept finite!).
We can do this explicitly by changing the optimization
objective to maximum posterior likelihood:

Posterior Prior Evidence


log P(C, l | D) = log P(l ) + log P(C | D, l )
22 =

Smoothing: Priors 22 =
10

22=
1
Gaussian, or quadratic, or L2 priors:
Intuition: parameters shouldnt be large.
Formalization: prior expectation that each parameter will be
distributed according to a gaussian with mean and variance 2.
1 (li - mi ) 2 They dont even

P(li ) = exp - capitalize my

s i 2p 2s i 2 name anymore!

Penalizes parameters for drifting to far from their mean prior value
(usually =0).
22=1 works surprisingly well.
Smoothing: Priors
If we use gaussian priors:
Trade off some expectation-matching for smaller parameters.
When multiple features can be recruited to explain a data point, the
more common ones generally receive more weight.
Accuracy generally goes up!
Change the objective: 22 =

log P(C, l | D) = log P(C | D, l ) -logP(l) 22


l
log P(C, l | D) = P(c | d , l ) - i i + k
( - m ) 2 = 10

( c , d )(C , D ) i 2s 2
i
22
=1

Change the derivative:


Prior mean=0

log P(C, l | D) / li = actual( fi , C) - predicted( fi , l ) -(li - mi )/s 2


Example: NER Smoothing
Feature Weights
Because of smoothing, the more
common prefix and single-tag Feature Type Feature PERS LOC
features have larger weights even Previous word at -0.73 0.94
though entire-word and tag-pair
features are more specific. Current word Grace 0.03 0.00
Beginning bigram <G 0.45 -0.04

Local Context Current POS tag NNP 0.47 0.45


Prev and cur tags IN NNP -0.10 0.14
Prev Cur Next Previous state Other -0.70 -0.92
State Othe ??? ??? Current signature Xx 0.80 0.46
r
Prev state, cur sig O-Xx 0.68 0.37
Word at Grac Road Prev-cur-next sig x-Xx-Xx -0.69 0.37
e
P. state - p-cur sig O-x-Xx -0.20 0.82
Tag IN NNP NNP

Sig x Xx Xx
Total: -0.58 2.68
Example: POS Tagging
From (Toutanova et al., 2003):
Overall Unknown
Accuracy Word Acc

Without 96.54 85.20


Smoothing

With 97.10 88.20


Smoothing

Smoothing helps:
Softens distributions.
Pushes weight onto more explanatory features.
Allows many features to be dumped safely into the mix.
Speeds up convergence (if both are allowed to converge)!
Smoothing: Regularization
Talking of priors and MAP estimation is
Bayesian language
In frequentist statistics, people will instead talk
about using regularization, and in particular, a
gaussian prior is L2 regularization

The choice of names makes no difference to the


math
Smoothing: Virtual Data
Another option: smooth the data, not the parameters.
Example:

Heads Tails Heads Tails


4 0 5 1
Equivalent to adding two extra data points.
Similar to add-one smoothing for generative models.
Hard to know what artificial data to create!
Smoothing: Count Cutoffs
In NLP, features with low empirical counts are often dropped.
Very weak and indirect smoothing method.
Equivalent to locking their weight to be zero.
Equivalent to assigning them gaussian priors with mean zero and
variance zero.
Dropping low counts does remove the features which were most in
need of smoothing
and speeds up the estimation by reducing model size
but count cutoffs generally hurt accuracy in the presence of
proper smoothing.
We recommend: dont use count cutoffs unless absolutely necessary
for memory usage reasons.

You might also like