Professional Documents
Culture Documents
Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrews tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Slide 1
A Markov System
Has N states, called s1, s2 .. sN
s2
s1
N=3 t=0
s3
Slide 2
A Markov System
Has N states, called s1, s2 .. sN
s2
Current State
There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN }
s1
N=3 t=0 qt=q0=s3
s3
Slide 3
A Markov System
Current State
s2
There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly.
s1
N=3 t=1 qt=q1=s2
s3
Slide 4
A Markov System
Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state.
Slide 5
s2
s1
N=3 t=1 qt=q1=s2
s3
P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0
A Markov System
Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state.
Slide 6
s2
1/2
s1
N=3 t=1 qt=q1=s2
1/3 1
s3
Markov Property
qt+1 is conditionally independent of { qt-1, qt-2, q1, q0 } given qt. In other words: P(qt+1 = sj |qt = si ) = P(qt+1 = sj |qt = si ,any earlier history) Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q0, q1, q3,q4 )?
s2
1/2
s1
N=3 t=1 qt=q1=s2
1/3 1
s3
Slide 7
P(qt+1=s1|qt=s2) = 1/2
Answer:
P(qt+1=s2|qt=s2) = 1/2
0 P(qt+1=s3|qt=s2) = 0
P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1
Markov Property
qt+1 is conditionally independent of { qt-1, qt-2, q1, q0 } given qt. In other words: P(qt+1 = sj |qt = si ) = P(qt+1 = sj |qt = si ,any earlier history) Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q0, q1, q2,q3,q4 )?
q1 q1/3 2
1
s2
2/3
1/2
1/2
s1
N=3 t=1 qt=q1=s2
s3
q4
Slide 8
P(qt+1=s1|qt=s2) = 1/2
Answer:
P(qt+1=s2|qt=s2) = 1/2
0 P(qt+1=s3|qt=s2) = 0
P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1
i 1 2 3 i
Markov Property
qt+1 is conditionally independent of { qt-1, qt-2, P(qt+11, j|q =0si)} P(qt+1=sq|q.=si) q=s q given N t
t t
q1 q1/3 2
1
P(qt+1=s1|qt=si) P(qt+1=s2|qt=si)
1/2
In other words:
a22
:
a12
1j
a a a : :
1N 2N 3N
a2j
1/2
:2/3 :
s1
Each of these probability N=3 tables is identical t=1
s3
N
aN1
Question: what would be the best a Bayes Net structure to aNN represent aN2 Nj the Joint Distribution of ( q0, q1, q2,q3,q4 )?
Notation:
aij
qt=q1=s2
q4
aij = P(qt +1 = s j | qt = si )
Slide 9
A Blind Robot
A human and a robot wander around randomly on a grid
R H
. N (num Note: * ) = 18 states 24 18 = 3
Slide 10
STATE q =
Copyright Andrew W. Moore
Dynamics of System
q0 =
R H
Typical Questions: Whats the expected time until the human is crushed like a bug? Whats the probability that the robot will hit the left wall before it hits the human? Whats the probability Robot crushes human on next time step?
Copyright Andrew W. Moore
Each timestep the human moves randomly to an adjacent cell. And Robot also moves randomly to an adjacent cell.
Slide 11
Example Question
Its currently time t, and human remains uncrushed. Whats the probability of crushing occurring at time t + 1 ? If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information Hidden Markov Models are applicable!
Copyright Andrew W. Moore Slide 12
Well do this first Too Easy. We wont do this Main Body of Lecture
P(qt = s ) =
Copyright Andrew W. Moore
P(Q)
i
j
p0 (i ) =
pt +1 ( j ) = P (qt +1 = s j ) =
Slide 14
pt +1 ( j ) = P (qt +1 = s j ) =
Slide 15
pt +1 ( j ) = P (qt +1 = s j ) =
P(q
i =1
t +1
= s j qt = si ) =
Slide 16
pt +1 ( j ) = P (qt +1 = s j ) =
P(q
i =1 N i =1
Remember,
t +1
= s j qt = si ) =
t +1
aij = P(qt +1 = s j | qt = si )
P(q
Copyright Andrew W. Moore
= s j | qt = si ) P(qt = si ) =
a
i =1
ij
pt (i )
Slide 17
pt +1 ( j ) = P (qt +1 = s j ) =
P(q
i =1 N i =1
t +1
= s j qt = si ) =
P(qt +1 = s j | qt = si ) P(qt = si ) =
Copyright Andrew W. Moore
a
i =1
ij
pt (i )
Slide 18
pt +1 ( j ) = P (qt +1 = s j ) =
P(q
i =1 N i =1
t +1
= s j qt = si ) =
t +1
P(q
Copyright Andrew W. Moore
= s j | qt = si ) P(qt = si ) =
a
i =1
ij
pt (i )
Slide 19
Hidden State
Its currently time t, and human remains uncrushed. Whats the probability of crushing occurring at time t + 1 ? If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information Hidden Markov Models are applicable!
Copyright Andrew W. Moore Slide 20
Well do this first Too Easy. We wont do this Main Body of Lecture
10
Hidden State
The previous example tried to estimate P(qt = si) unconditionally (using no observed evidence). Suppose we can observe something thats affected by the true state. Example: Proximity sensors. (tell us the contents of the 8 adjacent squares)
R0 H
W denotes WALL
True state qt
W denotes WALL
True state qt
Uncorrupted Observation
W W
H H
Slide 22
11
W denotes WALL
True state qt Ot is noisily determined depending on the current state. Assume that Ot is conditionally independent of {qt-1, qt-2, q1, q0 ,Ot-1, Ot-2, O1, O0 } given qt. In other words: P(Ot = X |qt = si ) = P(Ot = X |qt = si ,any earlier history)
Copyright Andrew W. Moore
Uncorrupted Observation
W W
H H
Slide 23
W denotes WALL
True state qt Ot is noisily determined depending on the current state. Assume that Ot is conditionally independent of {qt-1, qt-2, q1, q0 ,Ot-1, Ot-2, O1, O0 } given qt. In other words: P(Ot = X |qt = si ) =
Uncorrupted Observation
W W
H H
Question: whatd be the best Bayes Net structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?
Slide 24
12
Answer:
Example: Noisy Proximity sensors. (unreliably tell us O0 the contents of the 8 adjacent squares)
q0 q1
R0 H
O1
W denotes WALL
True state qt q2 Ot is noisily determined depending on O2 the current state. Assume that Ot is conditionally q3 independent of {qt-1, qt-2, q1, q0 ,Ot-1, O Ot-2, O1, O0 } given3qt. In other words: P(Ot = q4 t = si ) = X |q
Uncorrupted Observation
W W
H H
Question: whatd be the best Bayes Net structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?
O4
Slide 25
Answer:
bi (k = P (Ot = k | t = i ) Example: Noisy Proximity sensors. )(unreliablyqtell sus O0 the contents of the 8i adjacent Osquares) =k|q =s ) P(O =M|q =s ) P(O =1|q =s ) P( =2|q =s ) P(O
q0 q1
R0 H
O1
b1 (k)
W
b1(M)
W b2 (M)
:
b2(k) b (k) 3
: : :
denotes
Uncorrupted Observation
bi(k)
: W :
bi (2)
:
Assume that Ot is q3 independent of {qt-1, qt-2, q1, q0 ,Ot-1, O Ot-2, O1, O0 } given3qt. In other words: P(Ot = q4 t = si ) = X |q
N conditionally bN (1)
bN (2)
bN(k)
H
W bN (M) H
Question: whatd be the best Bayes Net structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?
O4
Slide 26
13
14
Slide 29
Slide 30
15
Slide 31
Slide 32
16
Slide 33
Slide 34
17
Slide 35
Slide 36
18
Question 1: State Estimation What is P(qT=Si | O1O2OT) Ot-1 Question 2: Most Probable Path b (O ) Given O1O2OT , what is the most probable path a that I took? Question 3: Learning HMMs: Given O1O2OT , what is the maximum likelihood HMM that could have produced this string of observations?
A t-1 AA
bB(Ot)
Bus
aAB aBA aBC bC(Ot+1) aCB
Ot+1
Eat
walk
aCC
Slide 37
T = # timesteps, N = # states
Copyright Andrew W. Moore Slide 38
19
*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. Available from
http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626
For a particular trial. Let T be the number of observations T is also the number of states passed through O = O1 O2 .. OT is the sequence of observations Q = q1 q2 .. qT is the notation for a path of states = N,M,{i,},{aij},{bi(j)}
Copyright Andrew W. Moore
Slide 40
20
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random.
XY
1/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = 1/2 a11 = 0 a12 = 1/3 a13 = 1/3 b1 (X) = 1/2 b2 (X) = 0 b3 (X) = 1/2
Copyright Andrew W. Moore
2 = 1/2 a12 = 1/3 a22 = 0 a32 = 1/3 b1 (Y) = 1/2 b2 (Y) = 1/2 b3 (Y) = 0
3 = 0 a13 = 2/3 a13 = 2/3 a13 = 1/3 b1 (Z) = 0 b2 (Z) = 1/2 b3 (Z) = 1/2
Slide 41
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between S1 and S2
XY
1/3
ZY
1/3
ZX
S3
1/3
__ __ __
__ __ __
Slide 42
21
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between X and Y
XY
1/3
ZY
1/3
ZX
S3
1/3
S1 __ __
__ __ __
Slide 43
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: Goto S3 with probability 2/3 or S2 with prob. 1/3
XY
1/3
ZY
1/3
ZX
S3
1/3
S1 __ __
X __ __
Slide 44
22
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between Z and X
XY
1/3
ZY
1/3
ZX
S3
1/3
S1 S3 __
X __ __
Slide 45
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: Each of the three next states is equally likely
XY
1/3
ZY
1/3
ZX
S3
1/3
S1 S3 __
X X __
Slide 46
23
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between Z and X
XY
1/3
ZY
1/3
ZX
S3
1/3
S1 S3 S3
X X __
Slide 47
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations:
XY
1/3
ZY
1/3
ZX
S3
1/3
S1 S3 S3
X X Z
Slide 48
24
State Estimation
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations:
XY
1/3
ZY
1/3
ZX
S3
1/3
Slide 49
S1
1/3
S2
XY
1/3
ZY
1/3
ZX
S3
1/3
QPaths of length 3
How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q?
Slide 50
25
S1
1/3
S2
XY
1/3
ZY
1/3
ZX
S3
1/3
QPaths of length 3
P(Q)= P(q1,q2,q3) =P(q1) P(q2,q3|q1) (chain rule) =P(q1) P(q2|q1) P(q3| q2,q1) (chain) =P(q1) P(q2|q1) P(q3| q2) (why?) Example in the case Q = S1 S3 S3: =1/2 * 2/3 * 1/3 = 1/9
How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q?
Slide 51
S1
1/3
S2
XY
1/3
ZY
1/3
ZX
S3
1/3
QPaths of length 3
How do we compute P(Q) for = P(O1 O2 O3 |q1 q2 q3 ) an arbitrary path Q? = P(O1 | q1 ) P(O2 | q2 ) P(O3 | q3 ) (why?) How do we compute P(O|Q) for an arbitrary path Q?
Example in the case Q = S1 S3 S3: = P(X| S1) P(X| S3) P(Z| S3) = =1/2 * 1/2 * 1/2 = 1/8
Slide 52
26
S1
1/3
S2
XY
1/3
ZY
1/3
ZX
S3
1/3
P (Q ) d 27 P(O|Q) nee |Q) would O)O O |q q qd )27 P(O P( 1 2 3 1 an 3 How do we compute P(Q) for = P(O 2 t ons uta) iP(O | q ) P(O | q ) (why?) an arbitrary path Q? = com1p| q1 P(O 2 2 ions 3 3 putat omcase Q = S S eS320 = c d 3: How do we compute P(O|Q) Example in the 1 ne 3 ould ns w |Q) rvatio for an arbitrary path Q? = P(X| S1) P(X| e 3) P(Z| bil3) n P(O S lio = 0 obs S
QPaths of length 3
.5 of 2 and 3 ence sequ * 1/2 pu1/2 ns 1/8 tatio = A =1/2 * c om illion So lets 3 .5 b s tation ompu c
be smarter
Slide 53
Slide 54
27
t(i) = P(O1 O2 OT qt = Si | ) (t(i) can be defined stupidly by considering all paths length t. How?)
1 (i ) = P(O1 q1 = Si )
Slide 55
t(i) = P(O1 O2 OT qt = Si | ) (t(i) can be defined stupidly by considering all paths length t. How?)
t +1 ( j ) = P(O1O2 ...Ot Ot +1 qt +1 = S j )
N i =1 N
= P(O1O2 ...Ot qt = Si Ot +1 qt +1 = S j ) = P(Ot +1 , qt +1 = S j O1O2 ...Ot qt = Si )P(O1O2 ...Ot qt = Si ) = P(Ot +1 , qt +1 = S j qt = Si ) t (i ) = P(qt +1 = S j qt = Si )P Ot +1 qt +1 = S j t (i ) = aij b j (Ot +1 ) t (i )
i
Copyright Andrew W. Moore Slide 56
i =1
28
in our example
t (i ) = P(O1O2 ..Ot qt = Si ) 1 (i ) = bi (O1 ) i
i
S1
1/3 1/3
S2
XY
1/3
ZY
2/3 1/3
2/3
t +1 ( j ) = aij b j (Ot +1 ) t (i )
ZX
S3
1/3
WE SAW O1 O2 O3 = X X Z
1 (1) =
1 4
1 (3) = 0
1 12 1 3 (3) = 72
2 (1) = 0 3 (1) = 0
Copyright Andrew W. Moore
2 (3) =
Slide 57
Easy Question
We can cheaply compute t(i)=P(O1O2Otqt=Si) (How) can we cheaply compute P(O1O2Ot) ?
Slide 58
29
Easy Question
We can cheaply compute t(i)=P(O1O2Otqt=Si) (How) can we cheaply compute P(O1O2Ot) ?
(i)
i =1 t
t (i )
( j)
j =1 t
Slide 59
= argmax
30
= The Probability of the path of Length t-1 with the maximum chance of doing all these things: OCCURING and ENDING UP IN STATE Si and PRODUCING OUTPUT O1Ot DEFINE: So: mppt(i) = that path t(i)= Prob(mppt(i))
Slide 61
max
Now, suppose we have all the t(i)s and mppt(i)s for all i. HOW TO GET mppt(1) mppt(2) : mppt(N)
Copyright Andrew W. Moore
t+1(j) and
Prob=t(1)
mppt+1(j)? S1 S2 :
Prob=t(2) Prob=t(N)
Sj
SN qt qt+1
Slide 62
31
The most prob path with last two states Si Sj is the most prob path to Si , followed by transition Si Sj
Slide 63
the most prob path to Si , followed by transition Si Sj What is the prob of that path? t(i) x P(Si Sj Ot+1 | ) = t(i) aij bj (Ot+1) SO The most probable path to Sj has Si* as its penultimate state where i*=argmax t(i) aij bj (Ot+1)
i
Copyright Andrew W. Moore Slide 64
32
the most prob path to Si , followed by transition Si Sj What is the prob of that path? t(i) x P(Si Sj Ot+1 | ) Summary: = t(i) aij bj (Ot+1) t+1(j) = t(i*) a b (Ot+1) with i* defined SO The most probable path to(j) j=has ij j (i*)S to the left mppt+1 S mppt+1 i* Si* as its penultimate state where i*=argmax t(i) aij bj (Ot+1)
i
Copyright Andrew W. Moore Slide 65
33
O1 O2 .. OT
O1 O2 .. O
T
Slide 67
Inferring an HMM
Remember, weve been doing things like P(O1 O2 .. OT | ) That is the notation for our HMM parameters. Now We have some observations and we want to estimate from them. AS USUAL: We could use (i) (ii) MAX LIKELIHOOD = argmax P(O1 .. OT | ) BAYES
and then take E[] or max P( | O1 .. OT )
Copyright Andrew W. Moore Slide 68
Work out P( | O1 .. OT )
34
t (i ) =
Expected number of transitions out of state i during the path Expected number of transitions from state i to state j during the path
Slide 69
T 1 t =1
t (i, j ) =
T 1
HMM estimation
Notice
(i, j )
t =1 T 1 t =1 t
T 1
(i )
t
(i, j ) (i )
t t
Slide 70
35
We want
Slide 71
We want
Expected # transitions i k |
k =1
old
, O1 , O2 , LOT
Slide 72
36
We want
Expected # transitions i k |
k =1
old
, O1 , O2 , LOT
P(q
N t =1 T k =1 t =1
t +1
= s j , qt = si | old , O1 , O2 , L OT ) = sk , qt = si | old , O1 , O2 , L OT )
P(q
t +1
Slide 73
We want
Expected # transitions i k |
k =1
old
, O1 , O2 , LOT
P(q
N t =1 T k =1 t =1
t +1
= s j , qt = si | old , O1 , O2 , L OT ) = sk , qt = si | old , O1 , O2 , L OT )
Sij = P (qt +1 = s j , qt = si , O1 , LOT | old )
t =1 T
P(q
Sij
t +1
S
k =1
where
ik
= What?
Slide 74
37
We want
Expected # transitions i k |
k =1
old
, O1 , O2 , LOT
P(q
N t =1 T k =1 t =1
t +1
= s j , qt = si | old , O1 , O2 , L OT ) = sk , qt = si | old , O1 , O2 , L OT )
Sij = P (qt +1 = s j , qt = si , O1 , LOT | old )
t =1 T
P(q
Sij
t +1
S
k =1
where
ik
= aij t (i ) t +1 ( j )b j (Ot +1 )
t =1
Slide 75
We want
new ij
= Sij
S
k =1
ik
where
Slide 76
38
We want
new ij
= Sij
S
k =1
ik
where
N
Slide 77
EM for HMMs
If we knew we could estimate EXPECTATIONS of quantities such as Expected number of times in state i Expected number of transitions i j If we knew the quantities such as Expected number of times in state i Expected number of transitions i j We could compute the MAX LIKELIHOOD estimate of = {aij},{bi(j)}, i Roll on the EM Algorithm
Copyright Andrew W. Moore Slide 78
39
EM 4 HMMs
1. 2. 3. 4. 5. 6. 7. Get your observations O1 OT Guess your first estimate (0), k=0 k = k+1 Given O1 OT, (k) compute t(i) , t(i,j) 1 t T, 1 i N, 1 j N
Compute expected freq. of state i, and expected freq. ij Compute new estimates of aij, bj(k), i accordingly. Call them (k+1) Goto 3, unless converged. Also known (for the HMM case) as the BAUM-WELCH algorithm.
Slide 79
Bad News
There are lots of local minima
Good News
The local minima are usually adequate models of the data.
Notice
EM does not estimate the number of states. That must be given. Often, HMMs are forced to have some links with zero probability. This is done by setting aij=0 in initial estimate (0) Easy extension of everything seen today: HMMs with real valued outputs
Copyright Andrew W. Moore Slide 80
40
Trade-off between too few states (inadequately modeling the structure in the data) and too many (fitting the noise). Thus #states is a regularization parameter.
Bad News
Blah blah blah bias variance tradeoffblah blahcross-validationblah blah.AIC, The local minima are usually adequate models of the BIC.blah blah (same ol same ol)
data.
EM does not estimate the number of states. That must be given. Often, HMMs are forced to have some links with zero probability. This is done by setting aij=0 in initial estimate (0) Easy extension of everything seen today: HMMs with real valued outputs
Copyright Andrew W. Moore Slide 81
41