You are on page 1of 59

Chapter 5:

Backing-Off Language Models

5.1. Unknown words and


unseen unigrams

Goal of this section


How to define a vocabulary for language
tasks
Notion of out-of-vocabulary words

Example Crime and Punishment


Training corpus: 11 000 tokens
Test corpus:
3000 tokens
108 of them not seen in training corpus

Usually training corpus defines vocabulary


Unseen words in test corpus are hence not
part of the vocabulary
OOV-words: out-of-vocabulary words
4

Example Crime and Punishment


Training corpus: 11 000 tokens
Test corpus:
3000 tokens
108 of them not seen in training corpus

Definition:

# unseen words in test corpus


OOV - rate
# of tokens in test corpus

OOV-Rate on
Crime and Punishment
OOV-Rate = 108/3000 = 0.036 = 3.6%

How to defined a vocabulary


Count the frequency of each word in a
corpus
Use the most frequent N words for the
vocabulary

OOV-Rate vs. Vocabulary

OOV rate depends on vocabulary size


Morphologically rich languages have higher OOV rates for the
same vocabulary size
Specifically the relation is a power law
OOV - Rate

1
(Size of Vocabulary)
8

Corresponding Problem with


sequences of words (M-Gram)
Size of vocabulary: N
Number of sequences of length M with N
words:
NM
Numerical example: N=64000 and M=3
NM = 3 1014
No realistic possibility to observe all trigrams
in a training corpus
a need methods to account for unseen events
9

5.2 Maximum Likelihood


Estimate of Probabilities

10

10

Goal of this section


Simple way to determine a language
model P(w|h)

11

11

From Perplexity to Likelihood


PP P( w1...wN ) 1/ N

exp f ( w, h) log P( w | h)
w, h

log( PP ) f ( w, h) log P( w | h)
w, h

Minimize perplexity

12

N (w, h) logP(w | h)
w, h

F N ( w, h) log P( w | h)
w, h

Define likelihood

1
N

12

Maximize likelihood

Maximum-Likelihood Estimation of
Probabilities
Maximize

F N ( w, h) log P( w | h)
w, h

Using the normalization constraint

P ( w | h) 1

for all h

Modified likelihood using Lagrange multiplier


F

'

N ( w, h) log P( w | h) (h) 1 P( w | h)
w, h

13

13

Maximum-Likelihood Estimation of
Probabilities (cont.)
F

'

N ( w, h) log P( w | h) (h) 1 P( w | h)
w, h

F '
N ( w' , h' )

( h' ) 0
P( w' | h' ) P( w' | h' )

Constraint

1 P( w | h) N ( w' , h' )
w

N ( w, h)
P ( w | h)
N ( h)

14

w'

( h' )

P( w' | h' )

N ( h' )
( h' )

N ( w' , h' )
( h' )

(h' ) N (h' )

Maximum likelihood estimate results


in relative frequencies

14

Issues with Maximum Likelihood


Estimate
Unseen events
a vanishing probability estimate
athis sequence can not be found in
speech recognition
aIR: is document does not cover all query
terms, it can not be relevant

Need other modeling scheme

15

15

5.3 Absolute discounting


(Kneser-Ney-Smoothing Part 1)

16

16

Goal of this section


Discuss smoothing techniques that avoid
vanishing probabilities

17

17

Backing-off language model

N ( w, h) d
(h) ( w | h) fr N(w,h) 0

P ( w | h) N ( h)

(h) 0 ( w | h)
Backing-off weight (h)
Backing-off distribution (w|h) (normalized!)
Discouting parameter d

18

18

Backing-off language model


(without animation)
N ( w, h) d
(h) ( w | h) fr N(w,h) 0

P ( w | h) N ( h)

( h) ( w | h)
Backing-off weight (h)
Backing-off distribution (w|h) (normalized!)
Discouting parameter d

19

19

Calculating the backing-off weight


(h)
Use normalization
1

P ( w | h)

N ( w, h) d
( h) ( w | h )

N ( h)
w:N(w,h)0
w

w:N(w,h) 0

N ( h)

( h)

Solve for (h)


( h)

20

w:N(w,h) 0

N ( h)

w:N(w,h) 0

N ( h)

20

d R ( h)

N ( h)

with R(h)

w:N(w,h)0

Calculating the backing-off weight


(h)
Backing-off weight

d R ( h)
(h)
N ( h)

21

21

with R(h)

w:N(w,h)0

Storing the counts


(Example: Bigram v,w)
N, R
v1

v2
N(v1),R(v1) N(v2),R(v2)

w1
N(v1w1)

22

w2
N(v1w2)

wR(v1)
N(v1wR(v1))

22

vR
N(vR),R(vR)

. more leafs .

wR(vR)
N(vR,wR(vR))

Examples
Example of backing-off code
Example of language model

23

23

Influence of Discounting Parameter

24

Discounting parameter has a significant impact on perplexity

Values of between 0.7 and 0.9 are typically good choices

24

5.4 Leaving-one-out estimate


of the discounting parameters

25

25

Goal
How to get a good estimate for the
discounting parameter without tuning it
explicitly?

26

26

Closed form optimization solution for


discounting parameter
Idea of cross validation: split training data

Training
data

Cross
validation
data

Use for parameter


optimization

27

27

Variants of cross validation


Suppose you want to train a unigram
language model:
What is the smallest possible cross
validation corpus?
How do you avoid fluctuations?

28

28

Leaving-one-out
Each unigram of the corpus can be the
cross validation corpus
Calculate average of all possible cross
validation corpora

29

29

Change of counts for leaving one out


w1 w2 w3 w4 w5 w6 w7 w8 w9 wN

30

N(w6)

a N(w6)-1

a N-1

30

Leaving-one-out likelihood
N

FLOO log PLOO ( wn )

N (w) log P

wW

n 1

LOO

( w)

Rd
N ( w) 1 d
N ( w) log

N 1
N 1

wW :N ( w ) 1
( R 1)d
log

wW :N ( w ) 1 N 1

Calculate

31

FLOO
0
d

31

Maximizing Leaving-one-out
Likelihood
FLOO
0
d

Calculate

FLOO
0
d

Skip calculating the derivative


Result

R 1
n1
0 nr r

r 1 d Rd d
r 2

a no closed form solution

32

32

Iterative solution

Split off terms in n1 and n2


Solve for d and use as an iterative equation
d i 1

33

n1

nr r (1 d i (1 R ))

(1 R ) n1 2n2
r 3 r 1 d i (1 R )

33

Approximate solution

Just use first iteration and initialize with d0=0

d1

34

n1

nr r
(1 R ) n1 2n2

r 3 r 1

34

Approximate solution

Usually R<1
Ignore the sum (only a small change to discounting parameter

n1
d
n1 2n2

Works very well for all practical applications

35

35

Demonstrate Approximate Solution


-> Test formula using switchboard bigram

36

36

Remember Zipfs Law

Zipf (simplified):

N0
N (r )
r
n2
n1

37

37

Estimate n1 and n2
Length of the first step in the ZipfDistribution
N (r )

and

38

N0
r

N0
r

N N
n1 0 0
1/ 2 3 / 2

N0 N0
n2

3/ 2 5/ 2

38

Estimate Discounting Parameter


n1
d
n1 2n2

1
n2
1 2
n1

5
For 1 : d 0.71
7

((2 / 3)1/ (2 / 5)1/ )


1 2
(21/ (2 / 3)1/ )

Estimate only
gives degradation
as compared to
optimal numerical
solution
39

39

Slope in Zipf vs. Discounting

The larger the slope,


the smaller the
discounting
parameter

40

40

Zipf for Unigram and Bigram

Longer range
models have larger
discounting
parameters

41

41

5.5 Other smoothing methods

42

42

Goal
Smoothing techniques that are simpler
Smoothing techniques that are used for
other reasons

43

43

Floor discounting (add epsilon)


1
P( w | h) N ( w, h)
Z

Add a little bit to each count


1/Z: Normalisation
V

1
1 P( w | h) N (h) V
Z
w1

P ( w | h)

44

44

N ( w, h)
N (h) V

Z N (h) V

Linear Discounting

P( w | h) (1 )

N ( w, h)
( w | h)
N ( h)

Variant : P( w | h) (1 (h))
n1 (h)
Popular value : (h)
N ( h)

: determines amount of smoothing


( w | h) : " backing - off distribution"

N ( w, h)
( h) ( w | h )
N ( h)

with n1 (h)

See example
45

45

w:N ( w, h ) 1

Good-Turing Discounting
Adjusted counts:

nN ( wh)1

fr N(w,h) 5
N ( w, h) 1
N * ( wh)
nN ( wh)

N(wh)
sonst

Model:
N * ( wh)
P ( w | h)
( h ) ( w | h)
N ( h)

Backing-off weight from normalization:

46

46

N * ( h)
( h) 1
N ( h)

5.6 Kneser-Ney Smoothing

47

47

Kneser-Ney Smoothing
Try to find a dedicated backing-off
distribution
Two different variants
Marginal constraint backing off
Singleton backing-off

48

48

Idea
N ( w, h) d
(h) ( w | h) fr N(w,h) 0

P ( w | h) N ( h)
(h) ( w | h)

With h beeing the history shortened by one word


p( w | h) P( w | h) P(h' | h)

(**)

h'

with h' beeing the word removed from the history

a look for backing off distribution such that (**) is satisfied

49

49

Solution
Direct calculation
Solution
)
N
(
w
h

( w | h)
N ( wh)

with
N ( wh)

50

h ':N ( whh ') 0

50

Marginal
constraint
backing-off

Alternative Determine Backing-Off


Distribution using Leaving-One-Out
F F0

log (h) (w | h)

h 'hw:N ( whh ') 1

(h)1 ( w | h)
w

F0

likelihood for events seen more than once

(h ) Lagange multiplier

51

51

Solution for Leaving-One-Out


Approach to Backing-Off Distribution
Direct calculation of first derivative
Solution
)
N
(
w
h
1
( w | h)
N ( wh)

with
N1 ( wh)

52

h ': N ( whh ') 1

52

Comparison MarginalConstraint/Singelton Backing-Off


Backing-off
Modell

53

Bigram

Trigram

4-gram

Rel. Frequencies

160.1

146.4

147.8

Marginal
Constraint

154.8

138.3

137.9

Singelton BO

156.2

142.6

141.9

53

5.7 Pruning of count trees

54

54

Goal
Method to decrease size of the tree

55

55

Pruning Count-Trees

Level 1
Level 2
Level 3

Simple criterion for pruning trees:


remove all infrequent nodes
56

56

Improved Criterion:
Change in Likelihood
Metric inspired by maximum likelihood objective
In case a leaf is pruned, the backing of probability
will be used instead

FLeaf N (w, h) log P(w | h) N (wh) log (h) (w | h)

P ( w | h)

N ( w, h) log
)

(
h
)

(
w
|
h

This is a metric for the leafs of the tree


Create measure for sub-trees from measures from leafs
57

57

Effect of language model pruning

Pruned bigram better than pruned trigram


For a given size pruning gives you the best LM
Results generalize to longer N-grams
58

58

Summary

59

OOV-words
Maximum likelihood estimator
Backing-off language models
Discounting parameter
Kneser-Ney Smoothing
Pruning of trees

59

You might also like