HMM 14

Note to other teachers and users of these slides.
Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrews tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Hidden Markov Models

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599
Copyright Andrew W. Moore
Slide 1
A Markov System
Has N states, called s1, s2 .. sN
s2
There are discrete timesteps, t=0, t=1,
s1
N=3 t=0
s3
Slide 2
A Markov System
s2
Current State
There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN }
s1
N=3 t=0 qt=q0=s3
s3
Slide 3
A Markov System
Current State
s2
There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly.
s1
N=3 t=1 qt=q1=s2
s3
Slide 4
P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1
A Markov System
Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state.
Slide 5
s2
s1
N=3 t=1 qt=q1=s2
s3
P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

1/2 2/3
A Markov System
Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state.
Slide 6
s2
1/2
s1
N=3 t=1 qt=q1=s2
1/3 1
s3

Often notated with arcs between states

1/2 2/3
Markov Property
qt+1 is conditionally independent of { qt-1, qt-2, q1, q0 } given qt. In other words: P(qt+1 = sj |qt = si ) = P(qt+1 = sj |qt = si ,any earlier history) Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q0, q1, q3,q4 )?
s2
1/2
s1
N=3 t=1 qt=q1=s2
1/3 1
s3
Slide 7
P(qt+1=s1|qt=s2) = 1/2
Answer:
P(qt+1=s2|qt=s2) = 1/2
0 P(qt+1=s3|qt=s2) = 0
P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1
Markov Property
qt+1 is conditionally independent of { qt-1, qt-2, q1, q0 } given qt. In other words: P(qt+1 = sj |qt = si ) = P(qt+1 = sj |qt = si ,any earlier history) Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q0, q1, q2,q3,q4 )?
q1 q1/3 2
1
s2
2/3
1/2
1/2
s1
N=3 t=1 qt=q1=s2
s3
P(qt+1=s1|qt=s3) = 1/3 3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0
q4
Slide 8
P(qt+1=s1|qt=s2) = 1/2
Answer:
P(qt+1=s2|qt=s2) = 1/2
0 P(qt+1=s3|qt=s2) = 0
P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1
i 1 2 3 i
Markov Property
qt+1 is conditionally independent of { qt-1, qt-2, P(qt+11, j|q =0si)} P(qt+1=sq|q.=si) q=s q given N t
t t
q1 q1/3 2
1
a s2 11 a21 a31 ai1
P(qt+1=s1|qt=si) P(qt+1=s2|qt=si)
1/2
In other words:
a22
:
a12
1j
a a a : :
1N 2N 3N
a2j
1/2
:2/3 :
P(qt+1 = sj |q= 3j i ) = a32 t as P(qt+1 =

ai2
: : sj |q= t
s1
Each of these probability N=3 tables is identical t=1
s3
N
aN1
P(qt+1=s1|qt=s3) = 1/3 3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0
Question: what would be the best a Bayes Net structure to aNN represent aN2 Nj the Joint Distribution of ( q0, q1, q2,q3,q4 )?
Notation:
aij
si ,any earlier history)

aiN
qt=q1=s2
q4
aij = P(qt +1 = s j | qt = si )
Slide 9
A Blind Robot
A human and a robot wander around randomly on a grid
R H
. N (num Note: * ) = 18 states 24 18 = 3
Slide 10
STATE q =
Location of Robot, Location of Human
Dynamics of System
q0 =
R H
Typical Questions: Whats the expected time until the human is crushed like a bug? Whats the probability that the robot will hit the left wall before it hits the human? Whats the probability Robot crushes human on next time step?
Each timestep the human moves randomly to an adjacent cell. And Robot also moves randomly to an adjacent cell.
Slide 11
Example Question
Its currently time t, and human remains uncrushed. Whats the probability of crushing occurring at time t + 1 ? If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information Hidden Markov Models are applicable!
Copyright Andrew W. Moore Slide 12
Well do this first Too Easy. We wont do this Main Body of Lecture
What is P(qt =s)? slow, stupid answer

Step 1: Work out how to compute P(Q) for any path Q = q1 q2 q3 .. qt Given we know the start state q1 (i.e. P(q1)=1) P(q1 q2 .. qt) = P(q1 q2 .. qt-1) P(qt|q1 q2 .. qt-1) WHY? = P(q1 q2 .. qt-1) P(qt|qt-1) = P(q2|q1)P(q3|q2)P(qt|qt-1) Step 2: Use this knowledge to get P(qt =s)
ion is putat Com ntial in t ne expo
Slide 13
P(qt = s ) =
QPaths of length t that end in s
P(Q)
What is P(qt =s) ? Clever answer

For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) Easy to do inductive definition
i
j
p0 (i ) =
pt +1 ( j ) = P (qt +1 = s j ) =
Slide 14

For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) Easy to do inductive definition 1 if si is the start state i p0 (i ) = otherwise 0
pt +1 ( j ) = P (qt +1 = s j ) =
Slide 15

pt +1 ( j ) = P (qt +1 = s j ) =
P(q
i =1
t +1
= s j qt = si ) =
Slide 16

pt +1 ( j ) = P (qt +1 = s j ) =
P(q
i =1 N i =1
Remember,
t +1
= s j qt = si ) =
t +1
aij = P(qt +1 = s j | qt = si )
P(q
= s j | qt = si ) P(qt = si ) =
a
i =1
ij
pt (i )
Slide 17

Computation is simple. Just fill in this table in this order:

t 0 1 : tfinal pt(1) pt(2) 0 1 pt(N) 0
pt +1 ( j ) = P (qt +1 = s j ) =
P(q
i =1 N i =1
t +1
= s j qt = si ) =
P(qt +1 = s j | qt = si ) P(qt = si ) =
a
i =1
ij
pt (i )
Slide 18

Cost of computing Pt(i) for all states Si is now O(t N2) The stupid way was O(Nt) This was a simple example It was meant to warm you up to this trick, called Dynamic Programming, because HMMs do many tricks like this.
pt +1 ( j ) = P (qt +1 = s j ) =
P(q
i =1 N i =1
t +1
= s j qt = si ) =
t +1
P(q
= s j | qt = si ) P(qt = si ) =
a
i =1
ij
pt (i )
Slide 19
Hidden State
Its currently time t, and human remains uncrushed. Whats the probability of crushing occurring at time t + 1 ? If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information Hidden Markov Models are applicable!
Well do this first Too Easy. We wont do this Main Body of Lecture
10
Hidden State
The previous example tried to estimate P(qt = si) unconditionally (using no observed evidence). Suppose we can observe something thats affected by the true state. Example: Proximity sensors. (tell us the contents of the 8 adjacent squares)
R0 H
W denotes WALL
True state qt
What the robot sees: Observation Ot

Slide 21
Noisy Hidden State

Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares)
R0 H H W W W
W denotes WALL
True state qt
Uncorrupted Observation
W W
H H
Slide 22
11
Noisy Hidden State

R0 H H 2 W W W
W denotes WALL
True state qt Ot is noisily determined depending on the current state. Assume that Ot is conditionally independent of {qt-1, qt-2, q1, q0 ,Ot-1, Ot-2, O1, O0 } given qt. In other words: P(Ot = X |qt = si ) = P(Ot = X |qt = si ,any earlier history)
W W
H H
Slide 23
Noisy Hidden State

R0 H H 2 W W W
W denotes WALL
True state qt Ot is noisily determined depending on the current state. Assume that Ot is conditionally independent of {qt-1, qt-2, q1, q0 ,Ot-1, Ot-2, O1, O0 } given qt. In other words: P(Ot = X |qt = si ) =
W W
H H
Question: whatd be the best Bayes Net structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?
Slide 24
12
Answer:
Example: Noisy Proximity sensors. (unreliably tell us O0 the contents of the 8 adjacent squares)
q0 q1
Noisy Hidden State
R0 H
O1
W denotes WALL
True state qt q2 Ot is noisily determined depending on O2 the current state. Assume that Ot is conditionally q3 independent of {qt-1, qt-2, q1, q0 ,Ot-1, O Ot-2, O1, O0 } given3qt. In other words: P(Ot = q4 t = si ) = X |q
W W
H H
O4
Slide 25
Answer:
bi (k = P (Ot = k | t = i ) Example: Noisy Proximity sensors. )(unreliablyqtell sus O0 the contents of the 8i adjacent Osquares) =k|q =s ) P(O =M|q =s ) P(O =1|q =s ) P( =2|q =s ) P(O
q0 q1
Noisy Hidden State Notation:

t
t
R0 H
O1
1 b1(1) 2 b2 (1) 3 b3 (1)
b1 (2) b2 (2) b3 (2)

:
b1 (k)
W
b1(M)
W b2 (M)
:
b2(k) b (k) 3
: : :
b3 (M) WALL bi (M)

: W
denotes
: : True state qt q2 determined depending on i bi(1) Ot is noisily O2 the current state. : :
bi(k)
: W :
bi (2)
:
Assume that Ot is q3 independent of {qt-1, qt-2, q1, q0 ,Ot-1, O Ot-2, O1, O0 } given3qt. In other words: P(Ot = q4 t = si ) = X |q
N conditionally bN (1)
bN (2)
bN(k)
H
W bN (M) H
O4
Slide 26
13
Hidden Markov Models

Our robot with noisy sensors is a good example of an HMM Question 1: State Estimation What is P(qT=Si | O1O2OT) It will turn out that a new cute D.P. trick will get this for us. Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took? And what is that probability? Yet another famous D.P. trick, the VITERBI algorithm, gets this. Question 3: Learning HMMs: Given O1O2OT , what is the maximum likelihood HMM that could have produced this string of observations? Very very useful. Uses the E.M. Algorithm
Are H.M.M.s Useful?

You bet !! Robot planning + sensing when theres uncertainty (e.g. Reid Simmons / Sebastian Thrun / Sven Koenig) Speech Recognition/Understanding Phones Words, Signal phones Human Genome Project Complicated stuff your lecturer knows nothing about. Consumer decision modeling Economics & Finance. Plus at least 5 other things I havent thought of.
14
Some Famous HMM Tasks

Question 1: State Estimation What is P(qT=Si | O1O2Ot)
Slide 29

Slide 30
15

Slide 31

Question 1: State Estimation What is P(qT=Si | O1O2Ot) Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took?
Slide 32
16

Question 1: State Estimation What is P(qT=Si | O1O2Ot) Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took?
Slide 33

Question 1: State Estimation What is P(qT=Si | O1O2Ot) Woke up at 8.35, Got on Bus at 9.46, Question 2: Most Probable Path Sat in lecture 10.05-11.22 Given O1O2OT , what is the most probable path that I took?
Slide 34
17

Question 1: State Estimation What is P(qT=Si | O1O2Ot) Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took? Question 3: Learning HMMs: Given O1O2OT , what is the maximum likelihood HMM that could have produced this string of observations?
Slide 35

Question 1: State Estimation What is P(qT=Si | O1O2OT) Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took? Question 3: Learning HMMs: Given O1O2OT , what is the maximum likelihood HMM that could have produced this string of observations?
Slide 36
18
O Some Famous HMM Tasks

aBB
t
Question 1: State Estimation What is P(qT=Si | O1O2OT) Ot-1 Question 2: Most Probable Path b (O ) Given O1O2OT , what is the most probable path a that I took? Question 3: Learning HMMs: Given O1O2OT , what is the maximum likelihood HMM that could have produced this string of observations?
A t-1 AA
bB(Ot)
Bus
aAB aBA aBC bC(Ot+1) aCB
Ot+1
Eat
walk
aCC
Slide 37
Basic Operations in HMMs

For an observation sequence O = O1OT, the three basic HMM operations are:
Problem Evaluation: Calculating P(qt=Si | O1O2Ot) Inference: Computing Q* = argmaxQ P(Q|O) Learning: Computing * = argmax P(O|) Algorithm Forward-Backward Viterbi Decoding Baum-Welch (EM) Complexity
+
O(TN2) O(TN2) O(TN2)
T = # timesteps, N = # states
19
HMM Notation (from Rabiners Survey)

The states are labeled S1 S2 .. SN
*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. Available from
http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626
For a particular trial. Let T be the number of observations T is also the number of states passed through O = O1 O2 .. OT is the sequence of observations Q = q1 q2 .. qT is the notation for a path of states = N,M,{i,},{aij},{bi(j)}
is the specification of an HMM

Slide 39
HMM Formal Definition

An HMM, , is a 5-tuple consisting of N the number of states This is new. In our M the number of possible observations previous example, {1, 2, .. N} The starting state probabilities start state was P(q0 = Si) = i deterministic a11 a22 a1N a21 a22 a2N The state transition probabilities : : : P(qt+1=Sj | qt=Si)=aij aN1 aN2 aNN b1(1) b2(1) : bN(1) b1(2) b2(2) : bN(2) b1(M) b2(M) : bN(M) The observation probabilities P(Ot=k | qt=Si)=bi(k)
Slide 40
20
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random.
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = 1/2 a11 = 0 a12 = 1/3 a13 = 1/3 b1 (X) = 1/2 b2 (X) = 0 b3 (X) = 1/2
2 = 1/2 a12 = 1/3 a22 = 0 a32 = 1/3 b1 (Y) = 1/2 b2 (Y) = 1/2 b3 (Y) = 0
3 = 0 a13 = 2/3 a13 = 2/3 a13 = 1/3 b1 (Z) = 0 b2 (Z) = 1/2 b3 (Z) = 1/2
Slide 41
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between S1 and S2
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0
3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =
q0= q1= q2=
__ __ __
O0= O1= O2=
__ __ __
Slide 42
21
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between X and Y
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0
3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =
q0= q1= q2=
S1 __ __
O0= O1= O2=
__ __ __
Slide 43
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: Goto S3 with probability 2/3 or S2 with prob. 1/3
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0
3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =
q0= q1= q2=
S1 __ __
O0= O1= O2=
X __ __
Slide 44
22
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between Z and X
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0
3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =
q0= q1= q2=
S1 S3 __
O0= O1= O2=
X __ __
Slide 45
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: Each of the three next states is equally likely
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0
3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =
q0= q1= q2=
S1 S3 __
O0= O1= O2=
X X __
Slide 46
23
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between Z and X
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0
3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =
q0= q1= q2=
S1 S3 S3
O0= O1= O2=
X X __
Slide 47
Heres an HMM
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations:
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0
3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =
q0= q1= q2=
S1 S3 S3
O0= O1= O2=
X X Z
Slide 48
24
State Estimation
S1
1/3
Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations:
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0
3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =
This is what the observer has to work with

q0= q1= q2= ? ? ? O0= O1= O2= X X Z
Slide 49
Prob. of a series of observations

What is P(O) = P(O1 O2 O3) = P(O1 = X ^ O2 = X ^ O3 = Z)? Slow, stupid way:
P (O) = =
QPaths of length 3
S1
1/3
S2
XY
1/3
1/3 2/3 2/3
ZY
1/3
P(O Q) P(O | Q) P(Q)
ZX
S3
1/3
QPaths of length 3
How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q?
Slide 50
25

P (O) = =
QPaths of length 3
S1
1/3
S2
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
QPaths of length 3
P(Q)= P(q1,q2,q3) =P(q1) P(q2,q3|q1) (chain rule) =P(q1) P(q2|q1) P(q3| q2,q1) (chain) =P(q1) P(q2|q1) P(q3| q2) (why?) Example in the case Q = S1 S3 S3: =1/2 * 2/3 * 1/3 = 1/9
How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q?
Slide 51

P (O) = =
QPaths of length 3
S1
1/3
S2
XY
1/3
1/3 2/3 2/3
ZY
1/3

P(O|Q)
ZX
S3
1/3
QPaths of length 3
How do we compute P(Q) for = P(O1 O2 O3 |q1 q2 q3 ) an arbitrary path Q? = P(O1 | q1 ) P(O2 | q2 ) P(O3 | q3 ) (why?) How do we compute P(O|Q) for an arbitrary path Q?
Example in the case Q = S1 S3 S3: = P(X| S1) P(X| S3) P(Z| S3) = =1/2 * 1/2 * 1/2 = 1/8
Slide 52
26

P (O) = =
QPaths of length 3
S1
1/3
S2
XY
1/3
1/3 2/3 2/3
ZY
1/3
ZX
S3
1/3
P (Q ) d 27 P(O|Q) nee |Q) would O)O O |q q qd )27 P(O P( 1 2 3 1 an 3 How do we compute P(Q) for = P(O 2 t ons uta) iP(O | q ) P(O | q ) (why?) an arbitrary path Q? = com1p| q1 P(O 2 2 ions 3 3 putat omcase Q = S S eS320 = c d 3: How do we compute P(O|Q) Example in the 1 ne 3 ould ns w |Q) rvatio for an arbitrary path Q? = P(X| S1) P(X| e 3) P(Z| bil3) n P(O S lio = 0 obs S
QPaths of length 3
.5 of 2 and 3 ence sequ * 1/2 pu1/2 ns 1/8 tatio = A =1/2 * c om illion So lets 3 .5 b s tation ompu c
be smarter
Slide 53
The Prob. of a given series of observations, non-exponential-cost-style

Given observations O1 O2 OT Define t(i) = P(O1 O2 Ot qt = Si | ) t(i) = Probability that, in a random trial, Wed have seen the first t observations Wed have ended up in Si as the tth state visited. In our example, what is 2(3) ? where 1 t T
Slide 54
27
t(i) = P(O1 O2 OT qt = Si | ) (t(i) can be defined stupidly by considering all paths length t. How?)
t(i): easy to define recursively

= P(q1 = Si )P(O1 q1 = S i )
1 (i ) = P(O1 q1 = Si )
= what? t +1 ( j ) = P(O1O2 ...Ot Ot +1 qt +1 = S j ) =
Slide 55
t(i) = P(O1 O2 OT qt = Si | ) (t(i) can be defined stupidly by considering all paths length t. How?)
t(i): easy to define recursively

1 (i ) = P(O1 q1 = Si )
= = P(q1 = Si )P(O1 q1 = Si ) what?
t +1 ( j ) = P(O1O2 ...Ot Ot +1 qt +1 = S j )
N i =1 N
= P(O1O2 ...Ot qt = Si Ot +1 qt +1 = S j ) = P(Ot +1 , qt +1 = S j O1O2 ...Ot qt = Si )P(O1O2 ...Ot qt = Si ) = P(Ot +1 , qt +1 = S j qt = Si ) t (i ) = P(qt +1 = S j qt = Si )P Ot +1 qt +1 = S j t (i ) = aij b j (Ot +1 ) t (i )
i
i =1
28
in our example
t (i ) = P(O1O2 ..Ot qt = Si ) 1 (i ) = bi (O1 ) i
i
S1
1/3 1/3
S2
XY
1/3
ZY
2/3 1/3
2/3
t +1 ( j ) = aij b j (Ot +1 ) t (i )
ZX
S3
1/3
WE SAW O1 O2 O3 = X X Z
1 (1) =
1 4
1 (2) = 0 2 (2) = 0 3 (2) =

1 72
1 (3) = 0
1 12 1 3 (3) = 72
2 (1) = 0 3 (1) = 0
2 (3) =
Slide 57
Easy Question
We can cheaply compute t(i)=P(O1O2Otqt=Si) (How) can we cheaply compute P(O1O2Ot) ?
(How) can we cheaply compute P(qt=Si|O1O2Ot)
Slide 58
29
Easy Question
We can cheaply compute t(i)=P(O1O2Otqt=Si) (How) can we cheaply compute P(O1O2Ot) ?
(i)
i =1 t
(How) can we cheaply compute P(qt=Si|O1O2Ot)
t (i )
( j)
j =1 t
Slide 59
Most probable path given observations

What' s most probable path given O1O2 ...OT , i.e. What is
argmax P(Q O O ...O )?

1 2 T Q 1 2 T
Slow, stupid answer :

Q
argmax P(Q O O ...O )

P(O1O2 ...OT Q )P(Q) P(O1O2 ...OT )
Q
= argmax
= argmax P(O1O2 ...OT Q )P(Q )

Q
30
Efficient MPP computation

Were going to compute the following variables: t(i)= max P(q1 q2 .. qt-1 qt = Si O1 .. Ot) q1q2..qt-1
= The Probability of the path of Length t-1 with the maximum chance of doing all these things: OCCURING and ENDING UP IN STATE Si and PRODUCING OUTPUT O1Ot DEFINE: So: mppt(i) = that path t(i)= Prob(mppt(i))
Slide 61
The Viterbi Algorithm

t (i ) = q1q2 ...qt 1 P(q1q2 ...qt 1 qt = Si O1O2 ..Ot )
mppt (i ) = q q ...q P(q1q2 ...qt 1 qt = S i O1O2 ..Ot ) t 1 1 2
arg max max
1 (i ) = one choice P(q1 = Si O1 )

= i bi (O1 ) = P(q1 = Si )P(O1 q1 = Si )
max
Now, suppose we have all the t(i)s and mppt(i)s for all i. HOW TO GET mppt(1) mppt(2) : mppt(N)
t+1(j) and
Prob=t(1)
mppt+1(j)? S1 S2 :
Prob=t(2) Prob=t(N)
Sj
SN qt qt+1
Slide 62
31

time t S1 : Si : time t+1 Sj
The most prob path with last two states Si Sj is the most prob path to Si , followed by transition Si Sj
Slide 63

The most prob path with last two states Si Sj is
the most prob path to Si , followed by transition Si Sj What is the prob of that path? t(i) x P(Si Sj Ot+1 | ) = t(i) aij bj (Ot+1) SO The most probable path to Sj has Si* as its penultimate state where i*=argmax t(i) aij bj (Ot+1)
i
32

The most prob path with last two states Si Sj is
the most prob path to Si , followed by transition Si Sj What is the prob of that path? t(i) x P(Si Sj Ot+1 | ) Summary: = t(i) aij bj (Ot+1) t+1(j) = t(i*) a b (Ot+1) with i* defined SO The most probable path to(j) j=has ij j (i*)S to the left mppt+1 S mppt+1 i* Si* as its penultimate state where i*=argmax t(i) aij bj (Ot+1)
i
Whats Viterbi used for?

Classic Example Speech recognition: Signal words HMM observable is signal Hidden state is part of word formation What is the most probable word given this signal?
UTTERLY GROSS SIMPLIFICATION In practice: many levels of inference; not one big jump.
33
HMMs are used and useful

But how do you design an HMM? Occasionally, (e.g. in our robot example) it is reasonable to deduce the HMM from first principles. But usually, especially in Speech or Genetics, it is better to infer it from large amounts of data. O1 O2 .. OT with a big T.
Observations previously in lecture Observations in the next bit
O1 O2 .. OT
O1 O2 .. O
T
Slide 67
Inferring an HMM
Remember, weve been doing things like P(O1 O2 .. OT | ) That is the notation for our HMM parameters. Now We have some observations and we want to estimate from them. AS USUAL: We could use (i) (ii) MAX LIKELIHOOD = argmax P(O1 .. OT | ) BAYES
and then take E[] or max P( | O1 .. OT )
Work out P( | O1 .. OT )
34
Max likelihood HMM estimation

Define t(i) = P(qt = Si | O1O2OT , ) t(i,j) = P(qt = Si qt+1 = Sj | O1O2OT , ) t(i) and t(i,j) can be computed efficiently i,j,t (Details in Rabiner paper)
T 1 t =1
t (i ) =
Expected number of transitions out of state i during the path Expected number of transitions from state i to state j during the path
Slide 69
T 1 t =1
t (i, j ) =
t (i ) = P(qt = Si O1O2 ..OT , )

T 1 t =1
t (i, j ) = P(qt = Si qt +1 = S j O1O2 ..OT , )
(i ) = expected number of transitions out of state i during path

t
(i, j ) = expected number of transitions out of i and into j during path

t =1 t
T 1
HMM estimation
Notice
(i, j )
t =1 T 1 t =1 t
T 1
(i )
t
= Estimate of Prob(Next state S j This state Si ) We can re - estimate a ij
expected frequency i j = expected frequency i
(i, j ) (i )
t t
Slide 70
We can also re - estimate b j (Ok ) L (See Rabiner)

35
We want
new aij = new estimate of P(qt +1 = s j | qt = si )
Slide 71
We want
Expected # transitions i j | old , O1 , O2 , LOT
Expected # transitions i k |
k =1
old
, O1 , O2 , LOT
Slide 72
36
We want
k =1
old
, O1 , O2 , LOT
P(q
N t =1 T k =1 t =1
t +1
= s j , qt = si | old , O1 , O2 , L OT ) = sk , qt = si | old , O1 , O2 , L OT )
P(q
t +1
Slide 73
We want
k =1
old
, O1 , O2 , LOT
P(q
N t =1 T k =1 t =1
t +1
Sij = P (qt +1 = s j , qt = si , O1 , LOT | old )
t =1 T
P(q
Sij
t +1
S
k =1
where
ik
= What?
Slide 74
37
We want
new aij = new estimate of P (qt +1 = s j | qt = si )
k =1
old
, O1 , O2 , LOT
P(q
N t =1 T k =1 t =1
t +1
Sij = P (qt +1 = s j , qt = si , O1 , LOT | old )
t =1 T
P(q
Sij
t +1
S
k =1
where
ik
= aij t (i ) t +1 ( j )b j (Ot +1 )
t =1
Slide 75
We want
new ij
= Sij
S
k =1
ik
where
Sij = aij t (i ) t +1 ( j )b j (Ot +1 )

t =1
Slide 76
38
We want
new ij
= Sij
S
k =1
ik
where
Sij = aij t (i ) t +1 ( j )b j (Ot +1 )

t =1
N
Slide 77
EM for HMMs
If we knew we could estimate EXPECTATIONS of quantities such as Expected number of times in state i Expected number of transitions i j If we knew the quantities such as Expected number of times in state i Expected number of transitions i j We could compute the MAX LIKELIHOOD estimate of = {aij},{bi(j)}, i Roll on the EM Algorithm
39
EM 4 HMMs
1. 2. 3. 4. 5. 6. 7. Get your observations O1 OT Guess your first estimate (0), k=0 k = k+1 Given O1 OT, (k) compute t(i) , t(i,j) 1 t T, 1 i N, 1 j N
Compute expected freq. of state i, and expected freq. ij Compute new estimates of aij, bj(k), i accordingly. Call them (k+1) Goto 3, unless converged. Also known (for the HMM case) as the BAUM-WELCH algorithm.
Slide 79
Bad News
There are lots of local minima
Good News
The local minima are usually adequate models of the data.
Notice
EM does not estimate the number of states. That must be given. Often, HMMs are forced to have some links with zero probability. This is done by setting aij=0 in initial estimate (0) Easy extension of everything seen today: HMMs with real valued outputs
40
There are lots of local minima
Trade-off between too few states (inadequately modeling the structure in the data) and too many (fitting the noise). Thus #states is a regularization parameter.
Bad News
Blah blah blah bias variance tradeoffblah blahcross-validationblah blah.AIC, The local minima are usually adequate models of the BIC.blah blah (same ol same ol)
Good News Notice
data.
EM does not estimate the number of states. That must be given. Often, HMMs are forced to have some links with zero probability. This is done by setting aij=0 in initial estimate (0) Easy extension of everything seen today: HMMs with real valued outputs
What You Should Know

What is an HMM ? Computing (and defining) t(i) DONT PANIC: The Viterbi algorithm starts on p. 257. Outline of the EM algorithm To be very happy with the kind of maths and analysis needed for HMMs Fairly thorough reading of Rabiner* up to page 266* [Up to but not including IV. Types of HMMs].
*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.
http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626
41

HMM 14

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HMM 14

Uploaded by

Copyright:

Available Formats

Note to other teachers and users of these slides.

Hidden Markov Models

Copyright Andrew W. Moore

There are discrete timesteps, t=0, t=1,

Copyright Andrew W. Moore

Copyright Andrew W. Moore

Has N states, called s1, s2 .. sN

Copyright Andrew W. Moore

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1

Copyright Andrew W. Moore

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1

P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

Copyright Andrew W. Moore

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1

P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

Copyright Andrew W. Moore

P(qt+1=s1|qt=s3) = 1/3 3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

Copyright Andrew W. Moore

a s2 11 a21 a31 ai1

P(qt+1 = sj |q= 3j i ) = a32 t as P(qt+1 =

P(qt+1=s1|qt=s3) = 1/3 3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

si ,any earlier history)

Copyright Andrew W. Moore

Location of Robot, Location of Human

What is P(qt =s)? slow, stupid answer

QPaths of length t that end in s

What is P(qt =s) ? Clever answer

Copyright Andrew W. Moore

What is P(qt =s) ? Clever answer

Copyright Andrew W. Moore

What is P(qt =s) ? Clever answer

Copyright Andrew W. Moore

What is P(qt =s) ? Clever answer

What is P(qt =s) ? Clever answer

Computation is simple. Just fill in this table in this order:

What is P(qt =s) ? Clever answer

What the robot sees: Observation Ot

Copyright Andrew W. Moore

Noisy Hidden State

What the robot sees: Observation Ot

Copyright Andrew W. Moore

Noisy Hidden State

What the robot sees: Observation Ot

Noisy Hidden State

What the robot sees: Observation Ot

Copyright Andrew W. Moore

Noisy Hidden State

What the robot sees: Observation Ot

Copyright Andrew W. Moore

Noisy Hidden State Notation:

1 b1(1) 2 b2 (1) 3 b3 (1)

b1 (2) b2 (2) b3 (2)

b3 (M) WALL bi (M)

: : True state qt q2 determined depending on i bi(1) Ot is noisily O2 the current state. : :

What the robot sees: Observation Ot

Copyright Andrew W. Moore

Hidden Markov Models

Are H.M.M.s Useful?

Some Famous HMM Tasks

Copyright Andrew W. Moore

Some Famous HMM Tasks

Copyright Andrew W. Moore