SNLP 16 Chap5

Chapter 5:
Backing-Off Language Models
5.1. Unknown words and

unseen unigrams
Goal of this section

How to define a vocabulary for language
tasks
Notion of out-of-vocabulary words
Example Crime and Punishment

Training corpus: 11 000 tokens
Test corpus:
3000 tokens
108 of them not seen in training corpus
Usually training corpus defines vocabulary

Unseen words in test corpus are hence not
part of the vocabulary
OOV-words: out-of-vocabulary words
4
Example Crime and Punishment

Training corpus: 11 000 tokens
Test corpus:
3000 tokens
108 of them not seen in training corpus
Definition:
# unseen words in test corpus

OOV - rate
# of tokens in test corpus
OOV-Rate on
Crime and Punishment
OOV-Rate = 108/3000 = 0.036 = 3.6%
How to defined a vocabulary

Count the frequency of each word in a
corpus
Use the most frequent N words for the
vocabulary
OOV-Rate vs. Vocabulary
OOV rate depends on vocabulary size

Morphologically rich languages have higher OOV rates for the
same vocabulary size
Specifically the relation is a power law
OOV - Rate
1
(Size of Vocabulary)
8
Corresponding Problem with

sequences of words (M-Gram)
Size of vocabulary: N
Number of sequences of length M with N
words:
NM
Numerical example: N=64000 and M=3
NM = 3 1014
No realistic possibility to observe all trigrams
in a training corpus
a need methods to account for unseen events
9
5.2 Maximum Likelihood

Estimate of Probabilities
10
10

Simple way to determine a language
model P(w|h)
11
11
From Perplexity to Likelihood

PP P( w1...wN ) 1/ N
exp f ( w, h) log P( w | h)
w, h
log( PP ) f ( w, h) log P( w | h)
w, h
Minimize perplexity
12
N (w, h) logP(w | h)
w, h
F N ( w, h) log P( w | h)
w, h
Define likelihood
1
N
12
Maximize likelihood
Maximum-Likelihood Estimation of
Probabilities
Maximize
F N ( w, h) log P( w | h)
w, h
Using the normalization constraint
P ( w | h) 1
for all h
Modified likelihood using Lagrange multiplier

F
'
N ( w, h) log P( w | h) (h) 1 P( w | h)
w, h
13
13
Maximum-Likelihood Estimation of
Probabilities (cont.)
F
'
N ( w, h) log P( w | h) (h) 1 P( w | h)
w, h
F '
N ( w' , h' )
( h' ) 0
P( w' | h' ) P( w' | h' )
Constraint
1 P( w | h) N ( w' , h' )
w
N ( w, h)
P ( w | h)
N ( h)
14
w'
( h' )
P( w' | h' )
N ( h' )
( h' )
N ( w' , h' )
( h' )
(h' ) N (h' )
Maximum likelihood estimate results

in relative frequencies
14
Issues with Maximum Likelihood

Estimate
Unseen events
a vanishing probability estimate
athis sequence can not be found in
speech recognition
aIR: is document does not cover all query
terms, it can not be relevant
Need other modeling scheme
15
15
5.3 Absolute discounting

(Kneser-Ney-Smoothing Part 1)
16
16

Discuss smoothing techniques that avoid
vanishing probabilities
17
17
Backing-off language model
N ( w, h) d
(h) ( w | h) fr N(w,h) 0
P ( w | h) N ( h)
(h) 0 ( w | h)
Backing-off weight (h)
Backing-off distribution (w|h) (normalized!)
Discouting parameter d
18
18
Backing-off language model

(without animation)
N ( w, h) d
(h) ( w | h) fr N(w,h) 0
P ( w | h) N ( h)
( h) ( w | h)
Backing-off weight (h)
Backing-off distribution (w|h) (normalized!)
Discouting parameter d
19
19
Calculating the backing-off weight

(h)
Use normalization
1
P ( w | h)
N ( w, h) d
( h) ( w | h )
N ( h)
w:N(w,h)0
w
w:N(w,h) 0
N ( h)
( h)
Solve for (h)

( h)
20
w:N(w,h) 0
N ( h)
w:N(w,h) 0
N ( h)
20
d R ( h)
N ( h)
with R(h)
w:N(w,h)0
Calculating the backing-off weight

(h)
Backing-off weight
d R ( h)
(h)
N ( h)
21
21
with R(h)
w:N(w,h)0
Storing the counts

(Example: Bigram v,w)
N, R
v1
v2
N(v1),R(v1) N(v2),R(v2)
w1
N(v1w1)
22
w2
N(v1w2)
wR(v1)
N(v1wR(v1))
22
vR
N(vR),R(vR)
. more leafs .
wR(vR)
N(vR,wR(vR))
Examples
Example of backing-off code
Example of language model
23
23
Influence of Discounting Parameter
24
Discounting parameter has a significant impact on perplexity
Values of between 0.7 and 0.9 are typically good choices
24
5.4 Leaving-one-out estimate

of the discounting parameters
25
25
Goal
How to get a good estimate for the
discounting parameter without tuning it
explicitly?
26
26
Closed form optimization solution for

discounting parameter
Idea of cross validation: split training data
Training
data
Cross
validation
data
Use for parameter

optimization
27
27
Variants of cross validation

Suppose you want to train a unigram
language model:
What is the smallest possible cross
validation corpus?
How do you avoid fluctuations?
28
28
Leaving-one-out
Each unigram of the corpus can be the
cross validation corpus
Calculate average of all possible cross
validation corpora
29
29
Change of counts for leaving one out

w1 w2 w3 w4 w5 w6 w7 w8 w9 wN
30
N(w6)
a N(w6)-1
a N-1
30
Leaving-one-out likelihood
N
FLOO log PLOO ( wn )
N (w) log P
wW
n 1
LOO
( w)
Rd
N ( w) 1 d
N ( w) log
N 1
N 1
wW :N ( w ) 1
( R 1)d
log
wW :N ( w ) 1 N 1
Calculate
31
FLOO
0
d
31
Maximizing Leaving-one-out
Likelihood
FLOO
0
d
Calculate
FLOO
0
d
Skip calculating the derivative

Result
R 1
n1
0 nr r
r 1 d Rd d
r 2
a no closed form solution
32
32
Iterative solution
Split off terms in n1 and n2

Solve for d and use as an iterative equation
d i 1
33
n1
nr r (1 d i (1 R ))
(1 R ) n1 2n2
r 3 r 1 d i (1 R )
33
Approximate solution
Just use first iteration and initialize with d0=0
d1
34
n1
nr r
(1 R ) n1 2n2
r 3 r 1
34
Approximate solution
Usually R<1
Ignore the sum (only a small change to discounting parameter
n1
d
n1 2n2
Works very well for all practical applications
35
35
Demonstrate Approximate Solution

-> Test formula using switchboard bigram
36
36
Remember Zipfs Law
Zipf (simplified):
N0
N (r )
r
n2
n1
37
37
Estimate n1 and n2
Length of the first step in the ZipfDistribution
N (r )
and
38
N0
r
N0
r
N N
n1 0 0
1/ 2 3 / 2
N0 N0
n2

3/ 2 5/ 2
38
Estimate Discounting Parameter

n1
d
n1 2n2
1
n2
1 2
n1
5
For 1 : d 0.71
7
((2 / 3)1/ (2 / 5)1/ )

1 2
(21/ (2 / 3)1/ )
Estimate only
gives degradation
as compared to
optimal numerical
solution
39
39
Slope in Zipf vs. Discounting
The larger the slope,

the smaller the
discounting
parameter
40
40
Zipf for Unigram and Bigram
Longer range
models have larger
discounting
parameters
41
41
5.5 Other smoothing methods
42
42
Goal
Smoothing techniques that are simpler
Smoothing techniques that are used for
other reasons
43
43
Floor discounting (add epsilon)

1
P( w | h) N ( w, h)
Z
Add a little bit to each count

1/Z: Normalisation
V
1
1 P( w | h) N (h) V
Z
w1
P ( w | h)
44
44
N ( w, h)
N (h) V
Z N (h) V
Linear Discounting
P( w | h) (1 )
N ( w, h)
( w | h)
N ( h)
Variant : P( w | h) (1 (h))
n1 (h)
Popular value : (h)
N ( h)
: determines amount of smoothing

( w | h) : " backing - off distribution"
N ( w, h)
( h) ( w | h )
N ( h)
with n1 (h)
See example
45
45
w:N ( w, h ) 1
Good-Turing Discounting
Adjusted counts:
nN ( wh)1
fr N(w,h) 5
N ( w, h) 1
N * ( wh)
nN ( wh)
N(wh)
sonst
Model:
N * ( wh)
P ( w | h)
( h ) ( w | h)
N ( h)
Backing-off weight from normalization:
46
46
N * ( h)
( h) 1
N ( h)
5.6 Kneser-Ney Smoothing
47
47
Kneser-Ney Smoothing
Try to find a dedicated backing-off
distribution
Two different variants
Marginal constraint backing off
Singleton backing-off
48
48
Idea
N ( w, h) d
(h) ( w | h) fr N(w,h) 0
P ( w | h) N ( h)
(h) ( w | h)
With h beeing the history shortened by one word

p( w | h) P( w | h) P(h' | h)
(**)
h'
with h' beeing the word removed from the history
a look for backing off distribution such that (**) is satisfied
49
49
Solution
Direct calculation
Solution
)
N
(
w
h
( w | h)
N ( wh)
with
N ( wh)
50
h ':N ( whh ') 0
50
Marginal
constraint
backing-off
Alternative Determine Backing-Off

Distribution using Leaving-One-Out
F F0
log (h) (w | h)
h 'hw:N ( whh ') 1
(h)1 ( w | h)
w
F0
likelihood for events seen more than once
(h ) Lagange multiplier
51
51
Solution for Leaving-One-Out

Approach to Backing-Off Distribution
Direct calculation of first derivative
Solution
)
N
(
w
h
1
( w | h)
N ( wh)
with
N1 ( wh)
52
h ': N ( whh ') 1
52
Comparison MarginalConstraint/Singelton Backing-Off

Backing-off
Modell
53
Bigram
Trigram
4-gram
Rel. Frequencies
160.1
146.4
147.8
Marginal
Constraint
154.8
138.3
137.9
Singelton BO
156.2
142.6
141.9
53
5.7 Pruning of count trees
54
54
Goal
Method to decrease size of the tree
55
55
Pruning Count-Trees
Level 1
Level 2
Level 3
Simple criterion for pruning trees:

remove all infrequent nodes
56
56
Improved Criterion:
Change in Likelihood
Metric inspired by maximum likelihood objective
In case a leaf is pruned, the backing of probability
will be used instead
FLeaf N (w, h) log P(w | h) N (wh) log (h) (w | h)
P ( w | h)
N ( w, h) log
)
(
h
)
(
w
|
h
This is a metric for the leafs of the tree

Create measure for sub-trees from measures from leafs
57
57
Effect of language model pruning
Pruned bigram better than pruned trigram

For a given size pruning gives you the best LM
Results generalize to longer N-grams
58
58
Summary
59
OOV-words
Maximum likelihood estimator
Backing-off language models
Discounting parameter
Kneser-Ney Smoothing
Pruning of trees
59

SNLP 16 Chap5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SNLP 16 Chap5

Uploaded by

Copyright:

Available Formats

Chapter 5:

Backing-Off Language Models

5.1. Unknown words and

Goal of this section

Example Crime and Punishment

Usually training corpus defines vocabulary

Example Crime and Punishment

# unseen words in test corpus

How to defined a vocabulary

OOV-Rate vs. Vocabulary

OOV rate depends on vocabulary size

Corresponding Problem with

5.2 Maximum Likelihood

Goal of this section

From Perplexity to Likelihood

Using the normalization constraint

Modified likelihood using Lagrange multiplier

Maximum likelihood estimate results

Issues with Maximum Likelihood

Need other modeling scheme

5.3 Absolute discounting

Goal of this section

Backing-off language model

Backing-off language model

Calculating the backing-off weight

Solve for (h)

Calculating the backing-off weight

Storing the counts

Influence of Discounting Parameter

Discounting parameter has a significant impact on perplexity

Values of between 0.7 and 0.9 are typically good choices

5.4 Leaving-one-out estimate

Closed form optimization solution for

Use for parameter

Variants of cross validation

Change of counts for leaving one out

FLOO log PLOO ( wn )

Skip calculating the derivative

a no closed form solution

Split off terms in n1 and n2

Just use first iteration and initialize with d0=0

Works very well for all practical applications

Demonstrate Approximate Solution

Remember Zipfs Law

Estimate Discounting Parameter

((2 / 3)1/ (2 / 5)1/ )

Slope in Zipf vs. Discounting

The larger the slope,

Zipf for Unigram and Bigram

5.5 Other smoothing methods

Floor discounting (add epsilon)

Add a little bit to each count

: determines amount of smoothing

Backing-off weight from normalization:

5.6 Kneser-Ney Smoothing

With h beeing the history shortened by one word

with h' beeing the word removed from the history

a look for backing off distribution such that (**) is satisfied

h ':N ( whh ') 0

Alternative Determine Backing-Off

h 'hw:N ( whh ') 1

likelihood for events seen more than once

Solution for Leaving-One-Out