Professional Documents
Culture Documents
I. I NTRODUCTION
In the last years, applications requiring speech recognition
techniques have grown considerably, varying from speaker
independent to speaker adapted systems, in low and high noise
environments. Moreover, the speech recognition systems have
been designed to achieve both high processing performance
and accuracy. In order to achieve such aims, the model size is
an important aspect to be analyzed as it is directly related to
the processing load and to the model classification capability.
So, this work presents a new approach to determine the number
of Gaussian components in continuous speech recognition
systems: the new Gaussian Elimination Algorithm (GEA). The
results are compared to the classical BIC and to an entropybased method.
II. BACKGROUND
The statistical modeling problem has some points that do
not depend on the specific task for which the model was
designed. One classical problem that must be overcome is the
over-parameterization [1] which may occur with large models,
that is, models with an excessive number of parameters. In
general, such models present low training error rate, due to
the high flexibility, but the performance, with a test database,
is almost always unsatisfactory. On the other hand, models
with insufficient number of parameters cannot even be trained.
At this point there is a trade-off between robustness and
trainability, which must be taken into account in order to
obtain a high performance system. In the speech recognition
context, the percentage of correctly recognized words (for the
SVCS corpus) or phones (for the TIMIT corpus), and/or the
phone recognition accuracy (for the TIMIT corpus) are used
The authors are with the Digital Speech Processing Laboratory, State
Univeristy of Campinas, C.P. 6101, CEP: 13083-852, Campinas-SP, Brazil.
email: gfgyared@yahoo.com,{fabio,selmini}@decom.fee.unicamp.br
SBrT
588
B. TIMIT Corpus
The TIMIT corpus is a benchmark for testing continuous
phone recognition systems [5], [6]. The training set contains
3696 sentences and the complete test set contains 1344 sentences.
Following the convention [7], [8], 48 different contextindependent phones are used for training models and 39
phones for recognition. The same model and parameters
configuration described in the SVCS corpus are used. The
recognition task is performed by the HTK [4] using a phonelevel bigram language model with a scale factor of 2.
The BIC [10], [11], [12] have been widely used within
statistical modeling in many different areas for model structure
selection. The main concept that supports the BIC is the
principle of parsimony, that is, the selected model should be
that with the smallest complexity and the highest capacity of
modeling the training data. This can be observed directly from
Equation (4)
D
1
d=1
2
,
log10 2esd
(1)
MS
BIC
Mlj
j
log P ojt |Mlj l log Nj ,
2
t=1
Nj
(4)
Nsi Hsi ,
(2)
i=1
(3)
SBrT
(i;j;s)
=
Pwg
(i;j;s)
Nwg
(s)
Nf rames
(5)
589
(a)
(b)
xd
Fig. 2.
d 2 d-xd
2 d-xd d
xd
Data
sample
Fig. 1.
Multidimensional Gaussian.
f (Ot ) =
(2)
dim/2
||
1/2
e(x)
(x)/2
dim
d=1
DC (j) =
(6)
2
2
1
e[(xd d ) /2d ] .
2
2d
(7)
s=j i=1
GIM (Ot )
(i;j;s)
dim
d=1
2
1
(8)
x
where dim is the feature vector dimension, zd = d 2 dij ,
dij
and Ot = (x1 , x2 , . . . , xdim ) is the feature vector. The values
dij and dij correspond to the mean and standard deviation
respectively, along dimension d, of Gaussian i that belongs
to state j.
The GIM of Gaussian i, that belongs to state j, is
calculated for every data sample of state s, so that the mean
value of GIM can be obtained in respect to each existing state,
as shown in Equation (9) below
Ns
(i;j;s)
PGIM
SBrT
t=1
GIM (Ot )
Ns
(i;j;s)
Mj
i=1
e(zd ) dzd ,
(9)
(10)
(N 1)
N M
j
Z
d
(i;j;s)
PGIM
where K is the rigour exponent, Mj is the number of Gaussians in state j and N is the total number of states. If the
logarithm of the Discriminative Constant (DC) is taken, the
resulting expression in Equation (11) is similar to Equation
(4), in the sense that the first term measures the capacity of
modeling the correct patterns and the second is a penalty term
i=1
N M
j
(i;j;j)
PGIM log
s=j i=1
(i;j;s)
PGIM
N 1
. (11)
The difference resides in the second term, which is a penalization due to the number of parameters in the model, in Equation
(4), and which is a penalization due to the interference in the
other state models, in Equation(11).
The main idea is to eliminate Gaussians from a well trained
model with a fixed number of Gaussian components per state
and to observe the new DC value obtained. The Discriminative Constant (DC) given above may increase or decrease
depending on the relevance of the eliminated Gaussian. In
this manner, the rigor exponent plays an important role in the
Gaussian selection, since it makes the discriminative criterion
DC more restrictive, that is, the greater the exponent, the
more rigorous the criterion is and therefore less Gaussians
are eliminated.
The procedure described above is applied for each HMM
state model. Once the discriminative Gaussian selection is
590
TABLE I
T HE FOLLOWING INCREMENTS IN THE NUMBER OF G AUSSIANS WERE
TESTED :
Increm.
N
Gauss.
70
regular.
paramet.
()
0.05
Mdxy =
i=x,y
dxy
i=x,y
(i;j;j)
PGIM +
(i;j;j)
PGIM
Distance
threshold
(Mdxy )
4.5
Differ. in
perform.
(%)
+0.61
Differ. in
perform.
(%)
-0.26
Differ. in
perform.
(%)
+1.33
(12)
B. TIMIT Database
(i;j;s)
PGIM
The first step was to obtain three baseline systems for the
TIMIT corpus. The performance of such systems are shown
in Table II
s=j i=x,y
TABLE II
(13)
Number
of
Gaussians
1152
2304
4608
VII. E XPERIMENTS
A. SVCS Database
In the experiments, the proposed method was used to
optimize the baseline system containing 12 Gaussians per
state (1296 Gaussians), which gives the highest percentage
of correctly recognized words (81.01%) for this database.
The tests with the SVCS database have shown that the new
GEA was able to outperform both the Bayesian Information
Criterion [18] and the entropy-based method presented in
section IV.
The highest performances given by the varying number of
Gaussians per state systems obtained from the three methods,
as well as the empirical parameters used are presented in Table
I.
The GEA performance depends on the segmentation used in
the discriminative analysis, and also on the effective reduction
on the exceeding Gaussians present in the models. The SVCS
database is not manually segmented and, in this case, the
SBrT
Entropy-based method
Reduced
Economy of
Perc.
model
Gaussians
Corr.
size
(%)
(%)
1040
19.8
81.62
BIC
Reduced
Economy of
Perc.
model
Gaussians
Corr.
size
(%)
(%)
1110
14.4
80.75
GEA
Reduced
Economy of
Perc.
model
Gaussians
Corr.
size
(%)
(%)
1055
18.6
82.34
Percentage
Correct
(%)
70.94
73.03
73.11
Recognition
Accuracy
(%)
63.7
65.87
66.14
591
TABLE III
S YSTEMS OBTAINED AFTER APPLYING THE GEA USING A HMM
between the recognition performance, in terms of phone recognition rate and accuracy, and the quality of the segmentation
generated by the Viterbi alignment (compared against the
manual segmentation of the TIMIT database), was calcuted
for each system. The results are shown in Figures 3 and 4 for
the percentage of segmentation errors smaller than 50ms.
SEGMENTATION .
Mdxy
0
Mdxy
14
67
66.5
66
Accuracy (%)
Mdxy
Baseline System
Num.
K
of
Gauss.
50
1114
Baseline System
Num.
K
of
Gauss.
5
2056
Baseline System
Num.
K
of
Gauss.
10
3825
65.5
65
64.5
64
63.5
63
62.5
0.974
0.976
0.978
0.98
0.982
Percentage of segmentation errors smaller than 50ms
Fig. 3. Relation between the percentage of segmentation errors and the phone
recognition rate.
74
73.5
Phone recognition rate (%)
The results obtained from the GEA for the TIMIT corpus
are shown in Table III.
As can be noted, the highest economy (17%) was achieved
from the 4608 Gaussians baseline system, from which 783
Gaussians were eliminated without any loss in performance.
In addition, the smaller the system, the lower the economy
obtained, what seems to be reasonable, since the model
topology selection process is constrained to at least the same
performance of the original system. The lack of acoustic
resolution makes difficult the task of distinguishing different
acoustic patterns, what may increase the recognition errors and
thus must be avoided.
2) The Quality of the Segmentation Used: The segmentation used in the discriminative analysis is important for labeling each acoustic vector and thus for calculating the DC value
of each state model. In this manner, it is interesting to define
a methodology for choosing the more appropriate recognition
system for the Viterbi alignment task. The percentage of errors
observed in the segmentation [19], [20] generated by the 4608
Gaussians baseline system is presented in Table IV.
73
72.5
72
71.5
71
70.5
70
0.974
0.976
0.978
0.98
0.982
Percentage of segmentation errors smaller than 50ms
TABLE IV
The correlation coefficient between the percentage of segmentation errors and the recognition performances, for diffenrent error thresholds, are presented in Table V.
Percentage
Correct
(%)
48.2
81.2
92.7
96.3
98.1
TABLE V
C ORRELATION COEFFICIENTS ( C . C .) BETWEEN THE PERCENTAGE OF
SEGMENTATION ERRORS AND THE RECONGITION PERFORMANCE .
SBrT
Segmentation Error
10ms
20ms
30ms
40ms
50ms
Recognition Accuracy
c.c. = 0.34
c.c. = 0.11
c.c. = 0.12
c.c. = 0.34
c.c. = 0.61
592
VIII. C ONCLUSION
The new GEA outperformed the BIC and the Entropy-based
method for determining systems with a better compromise
between complexity and recognition performance. In addition,
the reduced system also gave an increase of 1.33% in performance, when compared to the baseline system that gives the
highest recognition performance for the SVCS database used
in the experiments.
The experiments with the TIMIT database have shown that
the new GEA was able to determine reduced systems with
at least the same performance as the original system. It was
also verified that the correlation between the percentage of
segmentation errors and the recognition performance is not
high (c.c. 61% for the conditions used in the experiments).
Furthermore, the recognition accuracy seems to be more
appropriate for choosing the recognition system for the Viterbi
alignmente task, than the phone recognition rate, that is, the
system that gives the highest accuracy should be used in the
Viterbi Alignment task in order to obtain the segmentation for
the GEA.
The next work is to extend the GEA concepts for HMM
complexity control in context-dependent phone models and
also to other pattern recognition problems.
IX. ACKNOWLEDGEMENT
We would like to thank the CNPq (Conselho Nacional de
Desenvolvimento Cientfico e Tecnologico) for the research
funding.
R EFERENCES
[1] L. A. Aguirre, An Introduction to Systems Identification - Linear and
Non-linear Techniques Applied to Real Systems (in portuguese). Editora
UFMG, 2000.
[2] L. R. Bahl, F. Jelinek, and R. L. Mercer, A maximum likelyhood
approach to continuous speech recognition, IEEE Transactions on
Pattern Anal. Machine Intell., vol. 5, pp. 179190, 1983.
[3] C. A. Ynoguti, Continuous speech recognition using hidden markov
models (in portuguese), Ph.D. dissertation, Universidade Estadual de
Campinas, 1999.
[4] The HTK Book, Cambridge University Engineering Departament, 2002.
[5] S. Kapadia, V. Valtchev, and S. J. Young, MMI training for continuous
phoneme recognition on the TIMIT database, in IEEE International
Conference on Acoustic, Speech and Signal Processing, 1993.
[6] L. F. Lamel and J. L. Gauvain, High performance speaker-independent
phone recognition using CDHMM, in Eurospeech 1993, 1993.
[7] K.-F. Lee and H.-W. Hon, Speaker-independent phone recognition
using Hidden Markov Models, IEEE Transactions on Acoustics, Speech
and Signal Processing, vol. 37, no. 11, pp. 16411648, 1989.
[8] K.-F. Lee and H.-W. H. M.-Y. Hwang, Recent progress in the sphinx
speech recognition system, in Workshop on Speech and Natural Language, 1989.
[9] Y. J. Chung and C. K. Un, Use of different number of mixtures in
continuous density hidden markov models, Electronics Letters, vol. 29,
no. 9, pp. 824825, 1993.
[10] I. Csiszr and P. C. Shields, The consistency of the bic markov order
estimator, in IEEE International Symposium on Information Theory,
2000.
[11] M. A. T. Figueiredo and A. K. Jain, Unsupervised learning of finite
mixture models, IEEE Transactions on Pattern Analysis and Machine
Intelligence., vol. 24, no. 3, pp. 381396, 2002.
[12] S. S. Chen and P. S. Gopalakrishnan, Clustering via the bayesian
information criterion with applications in speech recognition, in IEEE
International Conference on Acoustic, Speech and Signal Processing,
1998.
SBrT
593