Professional Documents
Culture Documents
Lichi Yuan1, 2 School of Information Technology, Jiangxi University of Finance & Economics, Nanchang, 330013China 2 College of information Science and Engineering, Central South University, Changsha 410083, China Email: yuan_lichi@hotmail.com
1
Abstract
In order to overcome the defects of the duration modeling of homogeneous HMM in speech recognition and the unreaistic assumption that successive observations are independent and identically distribution within a state, Markov Family model (MFM), a new statistical model is proposed in this paper. Independence assumption is placed by conditional independence assumption in Markov Family model. We have successfully applied Markov Family model to speech recognition and propose duration distribution based MFM recognition model (DDBMFM) which takes duration distribution into account and integrates the frame and segment based acoustic modeling techniques. The speaker independent continuous speech recognition experiments show that this new recognition model have higher performance than standard HMM recognition models.
1. Introduction
The advent of hidden Markov model[1] (HMM) has brought about a considerable progress in speech recognition technology over the last two decades, and nowhere has this progress been more evident than in the area of Large Vocabulary Speaker Continuous Speech Recognition. In laying out a pattern recognition framework for variable-length patterns such as speech, one usually assumes the pattern is generated by an unobservable Markov chain of basic units, each of which captures the local salient features of the signal and can be modeled by a probabilistic distribution with fixed parameters. The frame-based approaches, such as HMM based approaches, further assume that, within each unit, the acoustic frames are independent and stationary. While these assumptions lead to tractable and efficient implementations, they are inconsistent with the properties of speech. Another major unrealistic
assumption with HMMs is that probability density functions which model the duration of the states are assumed to be exponential, which are not appropriate for modeling the speech events which are characterized by HMM states. In recent years, researchers have examined alternatives to the HMM for representing speech acoustics. One such alternative is the segment model[9] that is a generalization of HMM. The segment model explicitly represents the speech dynamics and temporal correlations between frames. Segment based methods, on the other hand, do not employ the piecewise stationary and conditional independence assumptions. However, these systems are usually more expensive and, although they have produced encouraging performance for small vocabulary or isolated recognition tasks, their effectiveness on large vocabulary continuous speech recognition (LVCSR) remains an open issue. In order to cope with these deficiencies of the classical HMM, Markov Family model, a new statistical model is proposed in this paper. Markov Family model (MFM) is constructed on a Markov Family consisting of multiple stochastic processes which have probability relations each other. Independence assumption in HMM is placed by conditional independence assumption in MFM. We have successfully applied Markov Family model to speech recognition and propose duration distribution based MFM recognition model (DDBMFM) which takes duration distribution into account and integrates the frame and segment based acoustic modeling techniques. Experimental results show that a large vocabulary speaker-independent continuous speech recognition system using this approach has a greatly improved recognition rate.
1311
ICALIP2008
point, it can be said that the standard HMM is a special case of MFM. Condition 2 demonstrates the relations among these Markov chains of MFM, and it can also simplify the calculation of model. According to Condition 3, the previous ni 1 values that a variable
Xi will take before time t and the values that the rest
variables take at time t are independent if the value of the variable Xi takes at time t is known. From the view of the statistics, the assumption of independence is stronger than the assumption of conditional independence, and it can be inferred from independence to conditional independence. So the assumption of conditional independence in Markov Family model is more realistic than the assumption of independence in HMM.
V ={v1, , vM} is a finite observation alphabet; = { 1 , , N } is the distribution of initial states, where (1) i = P(X1 = si } 1 i N A = (ai, j ) NN is a probability distribution on state
transitions, where
ai, j = P(Xt+1 = sj | Xt = si )
is the probability of a transition to state
B = (b j ,k ) N M symbol emissions, where
(2)
s j from s i ;
v k when in
si .
Let
{ X t }t 1 = { x1,t ,
x m ,t }t 1
is
m-
dimensional stochastic vector, whose componential variable X i = {xi ,t }t 1 , 1 i m taking values in finite set S i ,1 i m . We say these componential variables
(4)
2 What value a variable will take at time t is only related to its previous values before time t and the values that the rest variables take at time t.
, xi ,t 1 | xi ,t ) P( x1,t | xi ,t )
P( xm,t | xi ,t )
Condition 1 means that Markov Family model is constructed on a multiple stochastic process. From this
1312
recognition model (DDBMFM) which takes duration distribution into account. The basic recognition unit used popular in speech recognition stochastical model is phone. Assume sl , l = 1 ~ L stand for state, where L is the total number of states in phone model, variable
= P (O | S , W ) P ( S | W )
(9)
Now let T1 = 1, Ti =
i 1 Lu
u ,v
+ 1, 2 i N
xn is the
Ti ,1 = Ti , Ti , j = Ti + i ,v , 2 j L i
v =1
u =1 v =1 j 1
O={o1,o2, ,oT } is a
P (oTi , j + k | oTii,, jj + k M +1 , s ij , wi )
T + k 1 i =1 j =1 k =1
Li i , j 1
P(oTi , j +k | oTii,, jj +k M +1 , s ij )
T + k 1 i =1 j =1 k =1
Li i , j 1
the
computation
T + k 1
of
conditional
an auto-regressive model to characterize the frame correlation between successive observation vectors, i.e. let
M 1 h =1
S ={s , , s , , s , , s , , s , , s , , s , , s }
1,1individual 1,L1individual i, jindividual N,LNindividua
oTi , j + k =
h Ti , j + k h
(11) to
The task of speech recognition is to find the most likely word sequence W = {w1 , w2 , w N } for an input observation sequence O = {o1 , o2 , , oT } , and can be formulated as a search for the most likely sequence W given the observation sequence O : P(O | W) P(W) argmaxP(S,W | O) = argmax P(O) W W (7) = arg max P (O | W ) P (W )
W
be the predicted observation of successive observation vectors are diagonal matrices, and let
oTi, j +k according
oTii,,jj + k M +1 . Where h
(12)
T + k 1
oTi, j +k
. Then the
Where
P (W ) = p ( wi | wi K +1 , wi K + 2 ,
i =1
, wi 1 ) K 2 (8)
In practice, the cases of K gram language models that people usually use are for K = 2,3,4 , referred to as bigram, trigram, and four-gram, respectively. Let be the set of possible state sequence S correspond to the word sequence W , then
P (O | W ) = P (O, S | W )
1313
1 P ( S | W ) = P ( s1 ,
1 , s1 | w1 )
1,1individual
P(s ,
1 j j =2 N
L1
, s1j | s1j 1 ,
i , s1i | s L11 , i
, s1j 1 , w1 )
i , s L11 , wi 1 , wi ) i
P( n+1 = i, j | xn+1 = s ij , n = i, j1 , wi )
P(n+1 =i, j | xn+1 = sij , w) P(n+1 =i, j |n =i, j1, w) i i P(n+1 =i, j | w) i
1, j individual
i 1
1, j 1individual
[ P( s ,
i=2
i ,1individual
i 1,Li 1 individual
P( s ,
i j j =2
Li
(19) The other parameters of Equation (14) can be calculated similarly. The calculation above of conditional probability
, s ij | s ij 1 ,
, s ij 1 , wi )]
(14)
P( n+1 = i, j | n = i, j 1, wi )
utilized the correlation information between duration of two neighboring speech units, and this system is called Bigram of duration. We can also utilize the correlation information between duration of r neighboring speech units, and correspond system is called r gram of duration. To solve the problem of sparse data, conditional probability
i , j individual
i , j 1individual
Where
P ( s ij ,
, s ij | s ij 1 ,
i j
, s ij 1 , wi )
i , j individual
i , j 1individual
= P(xn+1 = s ,n+1 =i, j | xn = sij1,n =i, j1, wi ) P(n+1 =i, j | xn+1 = sij , xn = sij1,n =i, j1,wi )
P( xn+1 = sij | xn = sij 1, wi )
(15) Where conditional probability P(xn+1 = sj | xn = sj1, w ) i can take the approximation
i i
P( n+1 = i , j | n = i , j 1 )
or be estimated by smoothing method.
P ( n +1 = i , j | n = i , j 1 , wi )
(1wi )P(n+1 =i, j |n =i, j1i ) +wi P(n+1 =i, j |n =i, j1,w) i
(20) Where w,0 <w <1 is smoothing parameter. It is necessary to stress that in this paper, i , j stands for the duration of state, but it can also be the duration of phone, syllable or even word etc. So the duration model we supposed has great flexibility. Traditional HMM is homogeneous Markov process, its state transition probability ai ,i is a constant independent of time. Then duration of state i satisfies exponential distribution:
P ( xn +1 = s | xn = s
i j
i j 1
) = as i
i j 1 , s j
= P( n+1 = i , j | xn+1 = s , n = i , j 1 , wi )
i j
=
=
P(xn+1 = sij ,n =i, j1 |n+1 =i, j , wi ) P(n+1 =i, j | wi ) P(xn+1 = sij ,n =i, j1 | wi )
P(xn+1 =s ,n =i, j1 | w) i
i j
P(xn+1 =sij |n+1 =i, j ,w)P(n =i, j1 |n+1 =i, j ,w)P(n+1 =i, j | w) i i i (16) Based on the Bayes rule, we have
P ( ) = a i,i1 (1 a i ,i ), 1
This is inappropriate according to the nature of speech. In order to cope with this deficiency some authors have proposed to model explicitly the state duration, and one solution is to insert the duration information into the model in the recognition process, and several function form can be adopted in the duration distribution described by parameter, such as the Gaussian function and the Gamma function etc. In this paper, we assume that the probability density function of state duration is a Gaussian density function.
P( xn+1 = s | n+1 = i , j , wi )
i j
1314
4. Eexperimental Results
An investigation of the use of new model was carried on a Large-vocabulary speaker independent continuous speech recognition task. The Wall Street Journal Corpus was designed to provide generalpurpose speech data with large vocabularies. In these experiments the standard SI-84 training material, containing 7240 sentences from 84 speakers (42m/42f) is used to build the phone models, and another Wall Street Journal 20k open Corpus is used to testing. The baseline system was a gender- independent within-word-triphone mixture-Gaussian tied state HMM system. In this model set, all the speech models had a three emitting states, left-to-right topology. The evaluations were based on the word error rate percentage (% WER) and the experimental results were compared in table 1.
Table 1. Performance of a baseline system and the new system model baseline new system WER (%) 11.8 10.4
[2] Lai W. H., and Chen S. H., Analysis of Syllable Duration Models for Mandarin Speech. Proc. IEEE Intern Conf Acoust, Speech, Signal Process (ICASSP), 2002, vol. I, 497-500. [3] Wang W. J, and Chen S.H, 2002. The Study of Prosodic Modeling for Mandarin Speech. Proc. Int. Computer Symposium (ICS), 2002, vol. 2, 1777-1784. [4] Wang Zuoying, Xiao Xi, Duration Distribution Based HMM Speech Recognition Models. Chinese Journal of Electronics, 32(1), 2004, 46-49. [5] Shinoda K, Lee CH, A structural bayes approach to speaker adaptation. IEEE Transaction on Speech and Audio Processing, 9(3), 2001, 276287. [6] E. Chang, J.L. Zhou, S. Di, C. Huang, and K. F. Lee, Large Vocabulary Mandarin Speech Recognition with Different Approaches in Modeling Tones. Proc. ICSLP 2000, Oct., 2000,Volume II, 983-986. [7] Vaseghi, State duration modeling in hidden Markov models. Signal Processing, 41,1995, 3141. [8] Mitchell C. D, Jamieson L. H, Modeling duration in a hidden Markov model with the exponential family. Proceedings of IEEE ICASSP [C], Minneapolis, MN, USA, ICASSP-93, 1993, 2,331334. [9] Hsiao-Wuen Hon and Kuansan Wang, Unified Frame and Segment Based Models for Automatic Speech Recognition. ICASSP2000, Turkey.
From table 1, it can be seen that the efficacy of the proposed approaches were verified and a significant improvement in model quality and recognition performance was obtained. The application of the generalized decision tree to word-boundary dependent acoustic models for example reduced the word error rate for the 20k-WSJ test data up to 11.9 % in respect to HMM baseline system.
5. References
[1] Rabiner L, Juang B H, Fundamentals of Speech Recognition (New Jersey, USA, Prentice Hall, 1993).
1315