You are on page 1of 5

An Improved HMM Speech Recognition Model

Lichi Yuan1, 2 School of Information Technology, Jiangxi University of Finance & Economics, Nanchang, 330013China 2 College of information Science and Engineering, Central South University, Changsha 410083, China Email: yuan_lichi@hotmail.com
1

Abstract
In order to overcome the defects of the duration modeling of homogeneous HMM in speech recognition and the unreaistic assumption that successive observations are independent and identically distribution within a state, Markov Family model (MFM), a new statistical model is proposed in this paper. Independence assumption is placed by conditional independence assumption in Markov Family model. We have successfully applied Markov Family model to speech recognition and propose duration distribution based MFM recognition model (DDBMFM) which takes duration distribution into account and integrates the frame and segment based acoustic modeling techniques. The speaker independent continuous speech recognition experiments show that this new recognition model have higher performance than standard HMM recognition models.

1. Introduction
The advent of hidden Markov model[1] (HMM) has brought about a considerable progress in speech recognition technology over the last two decades, and nowhere has this progress been more evident than in the area of Large Vocabulary Speaker Continuous Speech Recognition. In laying out a pattern recognition framework for variable-length patterns such as speech, one usually assumes the pattern is generated by an unobservable Markov chain of basic units, each of which captures the local salient features of the signal and can be modeled by a probabilistic distribution with fixed parameters. The frame-based approaches, such as HMM based approaches, further assume that, within each unit, the acoustic frames are independent and stationary. While these assumptions lead to tractable and efficient implementations, they are inconsistent with the properties of speech. Another major unrealistic

assumption with HMMs is that probability density functions which model the duration of the states are assumed to be exponential, which are not appropriate for modeling the speech events which are characterized by HMM states. In recent years, researchers have examined alternatives to the HMM for representing speech acoustics. One such alternative is the segment model[9] that is a generalization of HMM. The segment model explicitly represents the speech dynamics and temporal correlations between frames. Segment based methods, on the other hand, do not employ the piecewise stationary and conditional independence assumptions. However, these systems are usually more expensive and, although they have produced encouraging performance for small vocabulary or isolated recognition tasks, their effectiveness on large vocabulary continuous speech recognition (LVCSR) remains an open issue. In order to cope with these deficiencies of the classical HMM, Markov Family model, a new statistical model is proposed in this paper. Markov Family model (MFM) is constructed on a Markov Family consisting of multiple stochastic processes which have probability relations each other. Independence assumption in HMM is placed by conditional independence assumption in MFM. We have successfully applied Markov Family model to speech recognition and propose duration distribution based MFM recognition model (DDBMFM) which takes duration distribution into account and integrates the frame and segment based acoustic modeling techniques. Experimental results show that a large vocabulary speaker-independent continuous speech recognition system using this approach has a greatly improved recognition rate.

978-1-4244-1724-7/08/$25.00 2008 IEEE

1311

ICALIP2008

2. Hidden Markov model and Markov Family model


Definition 2.1 Hidden Markov model A hidden Markov model[1,3] (HMM) is a five-tuple (S, A, V, B, ) where: S = {s1 , , s N } is a finite set of states;

point, it can be said that the standard HMM is a special case of MFM. Condition 2 demonstrates the relations among these Markov chains of MFM, and it can also simplify the calculation of model. According to Condition 3, the previous ni 1 values that a variable

Xi will take before time t and the values that the rest
variables take at time t are independent if the value of the variable Xi takes at time t is known. From the view of the statistics, the assumption of independence is stronger than the assumption of conditional independence, and it can be inferred from independence to conditional independence. So the assumption of conditional independence in Markov Family model is more realistic than the assumption of independence in HMM.

V ={v1, , vM} is a finite observation alphabet; = { 1 , , N } is the distribution of initial states, where (1) i = P(X1 = si } 1 i N A = (ai, j ) NN is a probability distribution on state
transitions, where

ai, j = P(Xt+1 = sj | Xt = si )
is the probability of a transition to state
B = (b j ,k ) N M symbol emissions, where

(2)

s j from s i ;

is a probability distribution on state (3)

3. Duration distribution Based MFM Recognition Model


Hidden Markov Modeling (HMM) techniques have been applied successfully to speech recognition problems. However, a major weakness of HMM is that the probability density functions which model the duration of the states are assumed to be exponential, which are not appropriate for modeling the speech events which are characterized by HMM states. In order to cope with this deficiency, duration modeling has been widely employed in speech recognition to help confining the search process so as to improve the recognition accuracy. The analysis unit used in duration modeling can be speech segment like HMM state, phone, initial/final, syllable or even word for Mandarin Speech. The most popular duration model used in HMM-based speech recognition is the state duration model[7]. The other question is that in continuous speech, the difference of speaking rates is big among speakers in different speaking environments. Not only interspeaker variability but also intra speaker variability causes the difference in speaking rate, because speakers tend to vary the speaking rate in different environments. It is well known that the performance of the speech recognition systems degrades when the speaking rate is much different than the average speaking rate. The recognition experiment[3] also shows that the duration information behaves best on data of low speak rate, behaves normal on the data of medium speak rate and has little effect on the data of fast speak rate. To overcome the defects mentioned above, we applied Markov Family model to speech recognition and propose duration distribution based MFM

bi,k = P(ot =vk | Xt = si ) 1 k M , 1i N


is the probability of observing the symbol state

v k when in

si .

Definition 2.2 (Markov Family Model)

Let

{ X t }t 1 = { x1,t ,

x m ,t }t 1

is

m-

dimensional stochastic vector, whose componential variable X i = {xi ,t }t 1 , 1 i m taking values in finite set S i ,1 i m . We say these componential variables

X i,1 i m construct a m-dimensional Markov


Family model if satisfying the following conditions: 1 Each componential variable X i,1 i m is a ni -order
Markov chain.

P(xi,t | xi,1, , xi,t 1) = P(xi,t | xi,t ni +1, , xi,t 1)

(4)

2 What value a variable will take at time t is only related to its previous values before time t and the values that the rest variables take at time t.

P(xi,t | x1,1, , x1,t , xi,1 , xi,t1 , xm,1, xm,t ) =


P(xi,t | xi,t ni +1, , xi,t 1, x1,t , xi1,t , xi+1,t , xm,t )
(5)
3 Conditional independence:

P(xi,tni +1, , xi,t1, x1,t , xi1,t , xi+1,t , xm,t | xi,t ) =


P( xi ,t ni +1 ,
(6)

, xi ,t 1 | xi ,t ) P( x1,t | xi ,t )

P( xm,t | xi ,t )

Condition 1 means that Markov Family model is constructed on a multiple stochastic process. From this

1312

recognition model (DDBMFM) which takes duration distribution into account. The basic recognition unit used popular in speech recognition stochastical model is phone. Assume sl , l = 1 ~ L stand for state, where L is the total number of states in phone model, variable

= P (O | S , W ) P ( S | W )

(9)

Now let T1 = 1, Ti =

i 1 Lu

u ,v

+ 1, 2 i N

xn is the

Ti ,1 = Ti , Ti , j = Ti + i ,v , 2 j L i
v =1

u =1 v =1 j 1

state at time n( n 1) , y n is the observation feature in state

xn , and n stands for the duration of state xn .


al = asl = P(xn = sl ),l =1 ~ L

be segment pointsand suppose

O={o1,o2, ,oT } is a

Some related parameters are defined as follows:

M-order Markov chain, then


P (O | S , W )

ai, j =asi,sj =P(xn+1 =sj | xn =si ),i, j =1~L


bl ( yn ) = bsl ( yn ) = P( yn | xn = sl ),l = 1 ~ L
We also assume that w,v =1~Vstands for word, where v V is the total number of words in the vocabulary of recognition system, and word wv has totally Lv states, which represented by slv , l = 1 ~ Lv . W = {w , w2 , wN } 1 is a word string hypothesis for a given acoustic observation sequence

P (oTi , j + k | oTii,, jj + k M +1 , s ij , wi )
T + k 1 i =1 j =1 k =1

Li i , j 1

P(oTi , j +k | oTii,, jj +k M +1 , s ij )
T + k 1 i =1 j =1 k =1

Li i , j 1

(10) To simplify probability

the

computation
T + k 1

of

conditional

P (oTi , j + k | oTii,, jj + k M +1 , s ij ) , here we use

O ={o1,o2 , ,oT } , s ij is the


s , then the correspond state
i j i j N LN N LN
i j

an auto-regressive model to characterize the frame correlation between successive observation vectors, i.e. let
M 1 h =1

j (1 j Li ) th state of the word wi and i , j stands


for the duration of state
1 1 1 1 1 L 1 1 L 1

sequence can be represented by

S ={s , , s , , s , , s , , s , , s , , s , , s }
1,1individual 1,L1individual i, jindividual N,LNindividua

oTi , j + k =

h Ti , j + k h

(11) to

The task of speech recognition is to find the most likely word sequence W = {w1 , w2 , w N } for an input observation sequence O = {o1 , o2 , , oT } , and can be formulated as a search for the most likely sequence W given the observation sequence O : P(O | W) P(W) argmaxP(S,W | O) = argmax P(O) W W (7) = arg max P (O | W ) P (W )
W

be the predicted observation of successive observation vectors are diagonal matrices, and let

oTi, j +k according

oTii,,jj + k M +1 . Where h
(12)

T + k 1

nTi, j +k = oTi, j +k oTi, j +k


and the predicted observation conditional probability
T +k 1

is the noise signal between the actual observation oTi, j +k

oTi, j +k

. Then the

Where

P (W ) = p ( wi | wi K +1 , wi K + 2 ,
i =1

, wi 1 ) K 2 (8)

P(oTi, j +k | oTii,,jj+kM+1, sij ) = P(nTi, j +k | sij )


(13) We can assume that the probability density functions of P(nTi , j +k | sij ) is a multivariate Gaussian mixture density functions whose mean vector and covariance matrix are decided by state s j . The probability P(S | W ) in equation (9) can be calculated as follow:
i

In practice, the cases of K gram language models that people usually use are for K = 2,3,4 , referred to as bigram, trigram, and four-gram, respectively. Let be the set of possible state sequence S correspond to the word sequence W , then

P (O | W ) = P (O, S | W )

1313

1 P ( S | W ) = P ( s1 ,

1 , s1 | w1 )

1,1individual

Substituting Equation (18) and (17) into Equation (16), we get

P(s ,
1 j j =2 N

L1

, s1j | s1j 1 ,
i , s1i | s L11 , i

, s1j 1 , w1 )
i , s L11 , wi 1 , wi ) i

P( n+1 = i, j | xn+1 = s ij , n = i, j1 , wi )
P(n+1 =i, j | xn+1 = sij , w) P(n+1 =i, j |n =i, j1, w) i i P(n+1 =i, j | w) i

1, j individual
i 1

1, j 1individual

[ P( s ,
i=2

i ,1individual

i 1,Li 1 individual

P( s ,
i j j =2

Li

(19) The other parameters of Equation (14) can be calculated similarly. The calculation above of conditional probability

, s ij | s ij 1 ,

, s ij 1 , wi )]
(14)

P( n+1 = i, j | n = i, j 1, wi )
utilized the correlation information between duration of two neighboring speech units, and this system is called Bigram of duration. We can also utilize the correlation information between duration of r neighboring speech units, and correspond system is called r gram of duration. To solve the problem of sparse data, conditional probability

i , j individual

i , j 1individual

Where

P ( s ij ,

, s ij | s ij 1 ,
i j

, s ij 1 , wi )

i , j individual

i , j 1individual

= P(xn+1 = s ,n+1 =i, j | xn = sij1,n =i, j1, wi ) P(n+1 =i, j | xn+1 = sij , xn = sij1,n =i, j1,wi )
P( xn+1 = sij | xn = sij 1, wi )
(15) Where conditional probability P(xn+1 = sj | xn = sj1, w ) i can take the approximation
i i

P(n+1 =i, j | n =i, j1, wi )


can take the approximation

P( n+1 = i , j | n = i , j 1 )
or be estimated by smoothing method.

P ( n +1 = i , j | n = i , j 1 , wi )
(1wi )P(n+1 =i, j |n =i, j1i ) +wi P(n+1 =i, j |n =i, j1,w) i
(20) Where w,0 <w <1 is smoothing parameter. It is necessary to stress that in this paper, i , j stands for the duration of state, but it can also be the duration of phone, syllable or even word etc. So the duration model we supposed has great flexibility. Traditional HMM is homogeneous Markov process, its state transition probability ai ,i is a constant independent of time. Then duration of state i satisfies exponential distribution:

P ( xn +1 = s | xn = s

i j

i j 1

) = as i

i j 1 , s j

According to the properties of Markov Family model, we have:

P( n+1 = i, j | xn+1 = sij , xn = sij 1 , n = i, j 1 , wi )

= P( n+1 = i , j | xn+1 = s , n = i , j 1 , wi )
i j

=
=

P(xn+1 = sij ,n =i, j1 |n+1 =i, j , wi ) P(n+1 =i, j | wi ) P(xn+1 = sij ,n =i, j1 | wi )
P(xn+1 =s ,n =i, j1 | w) i
i j

P(xn+1 =sij |n+1 =i, j ,w)P(n =i, j1 |n+1 =i, j ,w)P(n+1 =i, j | w) i i i (16) Based on the Bayes rule, we have

P ( ) = a i,i1 (1 a i ,i ), 1
This is inappropriate according to the nature of speech. In order to cope with this deficiency some authors have proposed to model explicitly the state duration, and one solution is to insert the duration information into the model in the recognition process, and several function form can be adopted in the duration distribution described by parameter, such as the Gaussian function and the Gamma function etc. In this paper, we assume that the probability density function of state duration is a Gaussian density function.

P( xn+1 = s | n+1 = i , j , wi )
i j

P(n+1 =i, j | xn+1 = sij , wi ) P(xn+1 = sij | wi ) P(n+1 =i, j | wi )


(17)

P ( n = i , j 1 | n+1 = i , j , wi ) P( n+1 = i, j | n = i, j1, wi ) P( n = i, j1 | wi ) = P( n+1 = i, j | wi )


(18)

1314

4. Eexperimental Results
An investigation of the use of new model was carried on a Large-vocabulary speaker independent continuous speech recognition task. The Wall Street Journal Corpus was designed to provide generalpurpose speech data with large vocabularies. In these experiments the standard SI-84 training material, containing 7240 sentences from 84 speakers (42m/42f) is used to build the phone models, and another Wall Street Journal 20k open Corpus is used to testing. The baseline system was a gender- independent within-word-triphone mixture-Gaussian tied state HMM system. In this model set, all the speech models had a three emitting states, left-to-right topology. The evaluations were based on the word error rate percentage (% WER) and the experimental results were compared in table 1.
Table 1. Performance of a baseline system and the new system model baseline new system WER (%) 11.8 10.4

[2] Lai W. H., and Chen S. H., Analysis of Syllable Duration Models for Mandarin Speech. Proc. IEEE Intern Conf Acoust, Speech, Signal Process (ICASSP), 2002, vol. I, 497-500. [3] Wang W. J, and Chen S.H, 2002. The Study of Prosodic Modeling for Mandarin Speech. Proc. Int. Computer Symposium (ICS), 2002, vol. 2, 1777-1784. [4] Wang Zuoying, Xiao Xi, Duration Distribution Based HMM Speech Recognition Models. Chinese Journal of Electronics, 32(1), 2004, 46-49. [5] Shinoda K, Lee CH, A structural bayes approach to speaker adaptation. IEEE Transaction on Speech and Audio Processing, 9(3), 2001, 276287. [6] E. Chang, J.L. Zhou, S. Di, C. Huang, and K. F. Lee, Large Vocabulary Mandarin Speech Recognition with Different Approaches in Modeling Tones. Proc. ICSLP 2000, Oct., 2000,Volume II, 983-986. [7] Vaseghi, State duration modeling in hidden Markov models. Signal Processing, 41,1995, 3141. [8] Mitchell C. D, Jamieson L. H, Modeling duration in a hidden Markov model with the exponential family. Proceedings of IEEE ICASSP [C], Minneapolis, MN, USA, ICASSP-93, 1993, 2,331334. [9] Hsiao-Wuen Hon and Kuansan Wang, Unified Frame and Segment Based Models for Automatic Speech Recognition. ICASSP2000, Turkey.

From table 1, it can be seen that the efficacy of the proposed approaches were verified and a significant improvement in model quality and recognition performance was obtained. The application of the generalized decision tree to word-boundary dependent acoustic models for example reduced the word error rate for the 20k-WSJ test data up to 11.9 % in respect to HMM baseline system.

5. References
[1] Rabiner L, Juang B H, Fundamentals of Speech Recognition (New Jersey, USA, Prentice Hall, 1993).

1315

You might also like