Professional Documents
Culture Documents
Definition:
OOV-Rate on
Crime and Punishment
OOV-Rate = 108/3000 = 0.036 = 3.6%
1
(Size of Vocabulary)
8
10
10
11
11
exp f ( w, h) log P( w | h)
w, h
log( PP ) f ( w, h) log P( w | h)
w, h
Minimize perplexity
12
N (w, h) logP(w | h)
w, h
F N ( w, h) log P( w | h)
w, h
Define likelihood
1
N
12
Maximize likelihood
Maximum-Likelihood Estimation of
Probabilities
Maximize
F N ( w, h) log P( w | h)
w, h
P ( w | h) 1
for all h
'
N ( w, h) log P( w | h) (h) 1 P( w | h)
w, h
13
13
Maximum-Likelihood Estimation of
Probabilities (cont.)
F
'
N ( w, h) log P( w | h) (h) 1 P( w | h)
w, h
F '
N ( w' , h' )
( h' ) 0
P( w' | h' ) P( w' | h' )
Constraint
1 P( w | h) N ( w' , h' )
w
N ( w, h)
P ( w | h)
N ( h)
14
w'
( h' )
P( w' | h' )
N ( h' )
( h' )
N ( w' , h' )
( h' )
(h' ) N (h' )
14
15
15
16
16
17
17
N ( w, h) d
(h) ( w | h) fr N(w,h) 0
P ( w | h) N ( h)
(h) 0 ( w | h)
Backing-off weight (h)
Backing-off distribution (w|h) (normalized!)
Discouting parameter d
18
18
P ( w | h) N ( h)
( h) ( w | h)
Backing-off weight (h)
Backing-off distribution (w|h) (normalized!)
Discouting parameter d
19
19
P ( w | h)
N ( w, h) d
( h) ( w | h )
N ( h)
w:N(w,h)0
w
w:N(w,h) 0
N ( h)
( h)
20
w:N(w,h) 0
N ( h)
w:N(w,h) 0
N ( h)
20
d R ( h)
N ( h)
with R(h)
w:N(w,h)0
d R ( h)
(h)
N ( h)
21
21
with R(h)
w:N(w,h)0
v2
N(v1),R(v1) N(v2),R(v2)
w1
N(v1w1)
22
w2
N(v1w2)
wR(v1)
N(v1wR(v1))
22
vR
N(vR),R(vR)
. more leafs .
wR(vR)
N(vR,wR(vR))
Examples
Example of backing-off code
Example of language model
23
23
24
24
25
25
Goal
How to get a good estimate for the
discounting parameter without tuning it
explicitly?
26
26
Training
data
Cross
validation
data
27
27
28
28
Leaving-one-out
Each unigram of the corpus can be the
cross validation corpus
Calculate average of all possible cross
validation corpora
29
29
30
N(w6)
a N(w6)-1
a N-1
30
Leaving-one-out likelihood
N
N (w) log P
wW
n 1
LOO
( w)
Rd
N ( w) 1 d
N ( w) log
N 1
N 1
wW :N ( w ) 1
( R 1)d
log
wW :N ( w ) 1 N 1
Calculate
31
FLOO
0
d
31
Maximizing Leaving-one-out
Likelihood
FLOO
0
d
Calculate
FLOO
0
d
R 1
n1
0 nr r
r 1 d Rd d
r 2
32
32
Iterative solution
33
n1
nr r (1 d i (1 R ))
(1 R ) n1 2n2
r 3 r 1 d i (1 R )
33
Approximate solution
d1
34
n1
nr r
(1 R ) n1 2n2
r 3 r 1
34
Approximate solution
Usually R<1
Ignore the sum (only a small change to discounting parameter
n1
d
n1 2n2
35
35
36
36
Zipf (simplified):
N0
N (r )
r
n2
n1
37
37
Estimate n1 and n2
Length of the first step in the ZipfDistribution
N (r )
and
38
N0
r
N0
r
N N
n1 0 0
1/ 2 3 / 2
N0 N0
n2
3/ 2 5/ 2
38
1
n2
1 2
n1
5
For 1 : d 0.71
7
Estimate only
gives degradation
as compared to
optimal numerical
solution
39
39
40
40
Longer range
models have larger
discounting
parameters
41
41
42
42
Goal
Smoothing techniques that are simpler
Smoothing techniques that are used for
other reasons
43
43
1
1 P( w | h) N (h) V
Z
w1
P ( w | h)
44
44
N ( w, h)
N (h) V
Z N (h) V
Linear Discounting
P( w | h) (1 )
N ( w, h)
( w | h)
N ( h)
Variant : P( w | h) (1 (h))
n1 (h)
Popular value : (h)
N ( h)
N ( w, h)
( h) ( w | h )
N ( h)
with n1 (h)
See example
45
45
w:N ( w, h ) 1
Good-Turing Discounting
Adjusted counts:
nN ( wh)1
fr N(w,h) 5
N ( w, h) 1
N * ( wh)
nN ( wh)
N(wh)
sonst
Model:
N * ( wh)
P ( w | h)
( h ) ( w | h)
N ( h)
46
46
N * ( h)
( h) 1
N ( h)
47
47
Kneser-Ney Smoothing
Try to find a dedicated backing-off
distribution
Two different variants
Marginal constraint backing off
Singleton backing-off
48
48
Idea
N ( w, h) d
(h) ( w | h) fr N(w,h) 0
P ( w | h) N ( h)
(h) ( w | h)
(**)
h'
49
49
Solution
Direct calculation
Solution
)
N
(
w
h
( w | h)
N ( wh)
with
N ( wh)
50
50
Marginal
constraint
backing-off
log (h) (w | h)
(h)1 ( w | h)
w
F0
(h ) Lagange multiplier
51
51
with
N1 ( wh)
52
52
53
Bigram
Trigram
4-gram
Rel. Frequencies
160.1
146.4
147.8
Marginal
Constraint
154.8
138.3
137.9
Singelton BO
156.2
142.6
141.9
53
54
54
Goal
Method to decrease size of the tree
55
55
Pruning Count-Trees
Level 1
Level 2
Level 3
56
Improved Criterion:
Change in Likelihood
Metric inspired by maximum likelihood objective
In case a leaf is pruned, the backing of probability
will be used instead
P ( w | h)
N ( w, h) log
)
(
h
)
(
w
|
h
57
58
Summary
59
OOV-words
Maximum likelihood estimator
Backing-off language models
Discounting parameter
Kneser-Ney Smoothing
Pruning of trees
59