28

VI INTERNATIONAL TELECOMMUNICATIONS SYMPOSIUM (ITS2006), SEPTEMBER 3-6, 2006, FORTALEZA-CE, BRAZIL
HMM Topology in Continuous Speech Recognition

Systems
Glauco F. G. Yared, Fabio Violaro and Antonio Marcos Selmini
Abstract Nowadays, HMM-based speech recognition systems

are used in many real time processing applications, from cell
phones to automobile automation. In this context, one important
aspect to be considered is the HMM model size, which directly
determines the computational load. So, in order to make the
system practical, it is interesting to optimize the HMM model size
constrained to a minimum acceptable recognition performance.
Furthermore, topology optimization is also important for reliable
parameter estimation. This work presents the new Gaussian
Elimination Algorithm (GEA) for determining the more suitable
HMM complexity in continuous speech recognition systems. The
proposed method is evaluated on a small vocabulary continuous
speech (SVCS) database as well as on the TIMIT corpus.
Index Terms Gaussian Elimination Algorithm, HMM complexity, Viterbi alignment.
I. I NTRODUCTION
In the last years, applications requiring speech recognition
techniques have grown considerably, varying from speaker
independent to speaker adapted systems, in low and high noise
environments. Moreover, the speech recognition systems have
been designed to achieve both high processing performance
and accuracy. In order to achieve such aims, the model size is
an important aspect to be analyzed as it is directly related to
the processing load and to the model classification capability.
So, this work presents a new approach to determine the number
of Gaussian components in continuous speech recognition
systems: the new Gaussian Elimination Algorithm (GEA). The
results are compared to the classical BIC and to an entropybased method.
II. BACKGROUND
The statistical modeling problem has some points that do
not depend on the specific task for which the model was
designed. One classical problem that must be overcome is the
over-parameterization [1] which may occur with large models,
that is, models with an excessive number of parameters. In
general, such models present low training error rate, due to
the high flexibility, but the performance, with a test database,
is almost always unsatisfactory. On the other hand, models
with insufficient number of parameters cannot even be trained.
At this point there is a trade-off between robustness and
trainability, which must be taken into account in order to
obtain a high performance system. In the speech recognition
context, the percentage of correctly recognized words (for the
SVCS corpus) or phones (for the TIMIT corpus), and/or the
phone recognition accuracy (for the TIMIT corpus) are used
The authors are with the Digital Speech Processing Laboratory, State
Univeristy of Campinas, C.P. 6101, CEP: 13083-852, Campinas-SP, Brazil.
email: gfgyared@yahoo.com,{fabio,selmini}@decom.fee.unicamp.br
SBrT
as a performance measure and the total number of Gaussian

components as the model size.
The idea of using different number of components for each
mixture is firstly supported by the fact that each acoustic unit
has a certain number of samples in the training database.
Depending on the amount of data samples available, the
acoustic resolution of the HMM models should be increased or
decreased in order to perform a reliable parameter estimation.
In addition, the complexity of the acoustic distribution boundary also determines the number of components required for
correct modeling different classes. In this manner, it is possible
to reduce the probability of some Gaussians to converge to
the wrong distributions, when using the Maximum Likelihood
Estimation (MLE) algorithm for training HMMs [2].
The computational cost is directly related to the number
of Gaussian components present in the system. As a consequence, the number of operations and the memory requirements increase with the number of components, and thus the
computational load is a practical argument that supports the
optimization of the model complexity.
Finally, before starting any analysis, it is important to
define the emphasis for topology selection method, that is,
to determine a varying number of Gaussians per state system that gives the highest performance or that allows the
maximum economy of parameters constrained to a minimum
performance threshold.
Briefly, section III describes the speech recognition system
and the database used in the experiments. Then, the two
classical methods for determining the more suitable model size
are summarized in sections IV and V. Section VI presents the
new Gaussian Elimination Algorithm (GEA) and section VII
shows the experimental results. The conclusions are given in
section VIII.
III. S PEECH R ECOGNITION S YSTEM AND DATABASE
A. SVCS Corpus
A brazilian Portuguese small vocabulary continuous speech
database (700 different words) [3], with a set of 200 different
sentences, spoken by 40 speakers (20 male and 20 female),
was used in the experiments. In addition, 1200 utterances were
used for training and 400 for testing [3].
A maximum likelihood estimation-based (MLE) training
system was implemented in order to obtain three state left-right
continuous density HMM models. As 36 different contextindependent phones (including silence) are used, 108 multidimensional Gaussian mixtures, with a variable or fixed number
of components in each mixture, are present in the system. In
addition, each Gaussian component is represented in a 39dimensional space (12 mel cepstral coefficients, 1 log-energy
588
parameter and their first and second order derivatives), using

diagonal covariance matrix. The recognition task is performed
by the HTK [4] using a Back-off bigram language model.
B. TIMIT Corpus
The TIMIT corpus is a benchmark for testing continuous
phone recognition systems [5], [6]. The training set contains
3696 sentences and the complete test set contains 1344 sentences.
Following the convention [7], [8], 48 different contextindependent phones are used for training models and 39
phones for recognition. The same model and parameters
configuration described in the SVCS corpus are used. The
recognition task is performed by the HTK [4] using a phonelevel bigram language model with a scale factor of 2.
Calculate the entropy variation ds and arrange it in

decreasing order.
Retain the current size of the first N states arranged in
decreasing order of ds while returning the others to the
previous configuration.
Repeat the steps above util the expected model size is
reached.
V. BAYESIAN I NFORMATION C RITERION
The BIC [10], [11], [12] have been widely used within
statistical modeling in many different areas for model structure
selection. The main concept that supports the BIC is the
principle of parsimony, that is, the selected model should be
that with the smallest complexity and the highest capacity of
modeling the training data. This can be observed directly from
Equation (4)
IV. E NTROPY- BASED METHOD

The method that will be described now is based on the
entropy measure and uses growing procedures to determine the
appropriate model size [9]. Basically, all acoustic models starts
with unitary size Gaussian mixture per state, and the entropy
information of each Gaussian is calculated by Equation (1)
Hsi =
D

1
d=1

2
,
log10 2esd
(1)
where s is the state number, i is the Gaussian number,

D is the Gaussian dimensionality and 2 is the variance
of the i-th mixture component in state s.
After that, the entropy of each state (Hs ) is calculated as
shown in Equation (2)
Hs =
MS
BIC
Mlj

j
log P ojt |Mlj l log Nj ,
2
t=1
Nj
(4)
where Mlj is the candidate model l of state j, Nj is the

number of data samples of state j, ojt is the t-th data sample
of state j, lj is the number of free parameters present in
Mlj and the parameter controls the penalization term.
According to such criterion, the selected model is the one
over all models that gives the highest BIC value. As can be
noted, the topology of each state model is selected despite of
the other existing states. However, there are some proposed
modifications in the BIC in order to take into acount all
existing states during the topology selection [13].
VI. G AUSSIAN E LIMINATION A LGORITHM (GEA)
Nsi Hsi ,
(2)
i=1
where MS is the number of Gaussian components in state

s and Nsi the number of data samples associated to the ith mixture component in state s. Then segmental K-means
algorithm is applied in order to split all Gaussian components
in such a way that, after splitting, the model will have twice the
number of components. The entropy for this new configuration
s ) is then calculated and now it is possible to obtain the
(H
entropy variation in Equation (3)
s.
d s = Hs H
(3)
The idea is to arrange the states in a decreasing order of

the entropy variation, retain the first N states with the new
configuration and then decrease the others by one Gaussian
component (previous configuration). This process is repeated
until the desired model size is reached.
The algorithm is summarized below.
Start with one Gaussian per state and calculate Hs .
Split all clusters so that the model will have twice the
number of Gaussians.
Apply one step of segmental K-means algorithm.
s.
Calculate H
SBrT
Previous works in this area have used likelihood measures in

model topology selection criterions [13], [14], [15], [16]. The
present work defines a Gaussian Importance Measure (GIM),
which is firstly used in a new discriminative Gaussian selection
algorithm and then in a Euclidian distance based Gaussian
selection procedure.
A. Discriminative Determination of Model Complexity
On a previous work, a likelihood based measure was used
to compute the contribution of each data sample to the
importance of each Gaussian, which was the Winner Gaussian
Probability (WGP) [17], as defined in Equation 5
(i;j;s)
(i;j;s)
=
Pwg
(i;j;s)
Nwg
(s)
Nf rames
(5)
correspponds to the number of times that the

where Nwg
Gaussian i, which belongs to state model j, gives the
highest likelihood for the data associated to state s, and
(s)
Nf rames is the number of frames associated to state s.
The WGP measure is a discrete probability function, and
may be inaccurate depending on the amount of data available.
In this manner, it seems to be more appropriate to define a
new continuous function for the GIM.
589
The proposed discriminative method for determining the

more suitable number of Gaussians per state differs from
others [13], [16], in the sense that the algorithm starts from a
well-trained system and indicates which Gaussians should be
eliminated using the new GIM, instead of likelihood measures.
Assuming for simplicity a 2-dimensional feature space, the
new method is based on the hipervolume probability. Basically,
in this case, the contribution of each data sample to the GIM
is given by the volume outside the elipse indicated in Figure
1.
(a)
(b)
xd
Fig. 2.
d 2 d-xd
2 d-xd d
xd
Contribution along each dimension to the GIM.
where Ns is the number of data samples of state s.

The new proposed measure (GIM) is based on the probability of the feature samples being out of the interval
d xd d < x < d + xd d ,
Data
sample
Fig. 1.
Multidimensional Gaussian.
In fact, in the present work all multidimensional Gaussians

N (, ) are represented in a 39-dimensional acoustic space,
and the probability density function is given by Equation (6)
1
f (Ot ) =
(2)
dim/2
||
1/2
e(x)
(x)/2
dim

d=1
DC (j) =
(6)
where || is the determinant of covariance matrix. If the

parameters are statistically independent (diagonal covariance
matrix), then the pdf can be written as in Equation (7)
f (Ot ) =
as shown in Figure 2. Moreover, the closer the feature sample

is to the Gaussian mean value, the higher is the contribution
to the GIM along the analyzed dimension.
(i;j;s)
The PGIM can be used as a measure of the importance
of each Gaussian in respect to each existing state. In this
manner, a discriminative Gaussian selection method can be
implemented. Thus, the main objective is to maximize the
discriminative relation so that each final model gives the
(i;j;s)
highest PGIM for the correct patterns and gives the smallest
(i;j;s)
PGIM for the wrong patterns at the same time. This can be
done by the maximization of Equation (10) below
K

M
j (i;j;j)
PGIM
2
2
1
e[(xd d ) /2d ] .
2
2d
(7)
s=j i=1
GIM (Ot )
(i;j;s)
dim

d=1
2
1
log DC (j) = K log
(8)
x
where dim is the feature vector dimension, zd = d 2 dij ,
dij
and Ot = (x1 , x2 , . . . , xdim ) is the feature vector. The values
dij and dij correspond to the mean and standard deviation
respectively, along dimension d, of Gaussian i that belongs
to state j.
The GIM of Gaussian i, that belongs to state j, is
calculated for every data sample of state s, so that the mean
value of GIM can be obtained in respect to each existing state,
as shown in Equation (9) below
Ns

(i;j;s)
PGIM
SBrT
t=1
GIM (Ot )
Ns
(i;j;s)
Mj

i=1
e(zd ) dzd ,
(9)
(10)
(N 1)
N M

j
Z
d

(i;j;s)
PGIM
where K is the rigour exponent, Mj is the number of Gaussians in state j and N is the total number of states. If the
logarithm of the Discriminative Constant (DC) is taken, the
resulting expression in Equation (11) is similar to Equation
(4), in the sense that the first term measures the capacity of
modeling the correct patterns and the second is a penalty term
For each Gaussian, the contribution along all dimensions to

the GIM can be calculated by Equation (8)
i=1
N M

j
(i;j;j)
PGIM log
s=j i=1
(i;j;s)
PGIM
N 1
. (11)
The difference resides in the second term, which is a penalization due to the number of parameters in the model, in Equation
(4), and which is a penalization due to the interference in the
other state models, in Equation(11).
The main idea is to eliminate Gaussians from a well trained
model with a fixed number of Gaussian components per state
and to observe the new DC value obtained. The Discriminative Constant (DC) given above may increase or decrease
depending on the relevance of the eliminated Gaussian. In
this manner, the rigor exponent plays an important role in the
Gaussian selection, since it makes the discriminative criterion
DC more restrictive, that is, the greater the exponent, the
more rigorous the criterion is and therefore less Gaussians
are eliminated.
The procedure described above is applied for each HMM
state model. Once the discriminative Gaussian selection is
590
TABLE I
T HE FOLLOWING INCREMENTS IN THE NUMBER OF G AUSSIANS WERE
finished, the resulting model is trained by means of the BaumWelch algorithm.

It is also important to observe that the discriminative algorithm only detects Gaussians that converged to the wrong
distributions during training. However, there may still exist
exceeding Gaussians after applying the discriminative algorithm. Although these Gaussians are not detected by the discriminative algorithm, they must be discarded, since this can
be performed without any degradation of the model capacity
of classification. In this manner, an additional distance based
algorithm is applied to eliminate the exceeding Gaussians.
TESTED :
40, 50, 60, 70, 80, 90 AND 100; THE REGULARIZATION

() ASSUMED THE VALUES 0.3, 0.2, 0.15, 0.1, 0.07, 0.05,
0.03 AND 0.01; THE DISTANCE THRESHOLDS (Mdxy ) 7, 6, 5.5, 5, 4.5, 4,
PARAMETER
3.5 AND 0 WERE USED IN THE EXPERIMENTS , WITH K =106 .
Increm.
N
Gauss.
70
regular.
paramet.
()
0.05
B. Gaussian Selection Based on Euclidian Distance Measure

Initially, a distance threshold is necessary for determining
groups of Gaussian components that should be replaced by
just one. In addition, a different threshold should be used
for Gaussians in the boundaries and for Gaussians in the
central part of the distribution. Then, the Euclidian distance is
calculated between all Gaussians for a specific state, and the
exceeding components are replaced by the Gaussian with the
highest determinant of the covariance matrix.
(i;j;s)
After that, the PGIM was used in order to give an indication of which Gaussians are in the boundaries, and also to
give a different weight for these Gaussians and those in the
central part of the distribution. Therefore, a modified Euclidian
distance measure was defined in Equation (12)

Mdxy =

i=x,y
dxy
i=x,y
(i;j;j)
PGIM +
(i;j;j)
PGIM

Distance
threshold
(Mdxy )
4.5
Differ. in
perform.
(%)
+0.61
Differ. in
perform.
(%)
-0.26
Differ. in
perform.
(%)
+1.33
Viterbi alignment against the correct transcriptions was used

for obtaining the reference segmentation. So, it seems to be
important to evaluate the quality of the segmentation used
and also to investigate which recognition system is the more
appropriate for the Viterbi alignment task. In order to do so,
the following experiments were performed with the TIMIT
corpus, which is manually segmented.
(12)
B. TIMIT Database
(i;j;s)
PGIM
The first step was to obtain three baseline systems for the
TIMIT corpus. The performance of such systems are shown
in Table II
s=j i=x,y
where dxy is given by

T
dxy = (x y ) (x y ) ,
TABLE II
(13)
BASELINE SYSTEMS (8, 16 AND 32 G AUSSIANS PER STATE ).
and x and y are the mean vector of Gaussian components

x and y respectively.
Number
of
Gaussians
1152
2304
4608
VII. E XPERIMENTS
A. SVCS Database
In the experiments, the proposed method was used to
optimize the baseline system containing 12 Gaussians per
state (1296 Gaussians), which gives the highest percentage
of correctly recognized words (81.01%) for this database.
The tests with the SVCS database have shown that the new
GEA was able to outperform both the Bayesian Information
Criterion [18] and the entropy-based method presented in
section IV.
The highest performances given by the varying number of
Gaussians per state systems obtained from the three methods,
as well as the empirical parameters used are presented in Table
I.
The GEA performance depends on the segmentation used in
the discriminative analysis, and also on the effective reduction
on the exceeding Gaussians present in the models. The SVCS
database is not manually segmented and, in this case, the
SBrT
Entropy-based method
Reduced
Economy of
Perc.
model
Gaussians
Corr.
size
(%)
(%)
1040
19.8
81.62
BIC
Reduced
Economy of
Perc.
model
Gaussians
Corr.
size
(%)
(%)
1110
14.4
80.75
GEA
Reduced
Economy of
Perc.
model
Gaussians
Corr.
size
(%)
(%)
1055
18.6
82.34
Percentage
Correct
(%)
70.94
73.03
73.11
Recognition
Accuracy
(%)
63.7
65.87
66.14
After that, the new GEA was applied to optimize the

complexity of each of them. It is important to hightlight that
the baseline system containing 4608 Gaussians was used for
generating the reference segmentation, which is used in the
discriminative analysis. As there is also a manual segmentation
for this database, the quality of the segmentation obtained from
the Viterbi alignmente will be evaluated.
1) Reduction of the Baseline Systems Complexity: The
GEA was able to determine reduced systems which presented
at least the same performance of the original system. This
was accomplished by starting from each baseline system and
then varying the rigour exponent K and the distance threshold
Mdxy . In such experiments, the economy of Gaussians was
emphasized, constrained to at least the same performance of
the original system.
591
TABLE III
S YSTEMS OBTAINED AFTER APPLYING THE GEA USING A HMM
between the recognition performance, in terms of phone recognition rate and accuracy, and the quality of the segmentation
generated by the Viterbi alignment (compared against the
manual segmentation of the TIMIT database), was calcuted
for each system. The results are shown in Figures 3 and 4 for
the percentage of segmentation errors smaller than 50ms.
SEGMENTATION .
Mdxy
0
Mdxy
14
with 1152 Gaussians

Recog.
Economy
Accur.
of Gauss.
(%)
(%)
63.72 (+0.02)
3.3
with 2304 Gaussians
Recog.
Economy
Accur.
of Gauss.
(%)
(%)
65.91 (+0.04)
10.8
with 4608 Gaussians
Recog.
Economy
Accur.
of Gauss.
(%)
(%)
66.51 (+0.37)
17.0
67
66.5
66
Accuracy (%)
Mdxy
Baseline System
Num.
K
of
Gauss.
50
1114
Baseline System
Num.
K
of
Gauss.
5
2056
Baseline System
Num.
K
of
Gauss.
10
3825
65.5
65
64.5
64
63.5
63
62.5
0.974
0.976
0.978
0.98
0.982
Percentage of segmentation errors smaller than 50ms
Fig. 3. Relation between the percentage of segmentation errors and the phone
recognition rate.
74
73.5
Phone recognition rate (%)
The results obtained from the GEA for the TIMIT corpus
are shown in Table III.
As can be noted, the highest economy (17%) was achieved
from the 4608 Gaussians baseline system, from which 783
Gaussians were eliminated without any loss in performance.
In addition, the smaller the system, the lower the economy
obtained, what seems to be reasonable, since the model
topology selection process is constrained to at least the same
performance of the original system. The lack of acoustic
resolution makes difficult the task of distinguishing different
acoustic patterns, what may increase the recognition errors and
thus must be avoided.
2) The Quality of the Segmentation Used: The segmentation used in the discriminative analysis is important for labeling each acoustic vector and thus for calculating the DC value
of each state model. In this manner, it is interesting to define
a methodology for choosing the more appropriate recognition
system for the Viterbi alignment task. The percentage of errors
observed in the segmentation [19], [20] generated by the 4608
Gaussians baseline system is presented in Table IV.
73
72.5
72
71.5
71
70.5
70
0.974
0.976
0.978
0.98
0.982
Percentage of segmentation errors smaller than 50ms
Fig. 4. Relation between the percentage of segmentation errors and the

recognition accuracy.
TABLE IV
The correlation coefficient between the percentage of segmentation errors and the recognition performances, for diffenrent error thresholds, are presented in Table V.
S EGMENTATION ERROR FOR THE 4608 G AUSSIANS BASELINE SYSTEM .

Percentagem of
Segmentation Errors
threshold
10ms
20ms
30ms
40ms
50ms
Percentage
Correct
(%)
48.2
81.2
92.7
96.3
98.1
TABLE V
C ORRELATION COEFFICIENTS ( C . C .) BETWEEN THE PERCENTAGE OF
SEGMENTATION ERRORS AND THE RECONGITION PERFORMANCE .
As can be noted, 98.1% of the segmentation errors are

smaller than 50ms, that is, up to 5 frames apart from the
correct position. The question that arises at this point is related
to the apropriate choice of the recognition system, that is,
if there were more systems availables, which one should
be chosen for generating the segmentation. The apparently
obvious solution is to choose the system that gives the highest
recognition performance. In order to evaluate such approach,
20 different recognition systems with different topologies and
performances were obtained from the GEA. The correlation
SBrT
Segmentation Error
10ms
20ms
30ms
40ms
50ms
Phone Recognition Rate

c.c = 0.28
c.c = 0.05
c.c = 0.07
c.c = 0.29
c.c = 0.58
Recognition Accuracy
c.c. = 0.34
c.c. = 0.11
c.c. = 0.12
c.c. = 0.34
c.c. = 0.61
The results show that the correlation between the quality of

segmentation and the recognition performance was not high
(c.c. 61%). In addition, the percentage of segmentation errors
is more related to the recognition accuracy than to the phone
recognition rate.
592
VIII. C ONCLUSION
The new GEA outperformed the BIC and the Entropy-based
method for determining systems with a better compromise
between complexity and recognition performance. In addition,
the reduced system also gave an increase of 1.33% in performance, when compared to the baseline system that gives the
highest recognition performance for the SVCS database used
in the experiments.
The experiments with the TIMIT database have shown that
the new GEA was able to determine reduced systems with
at least the same performance as the original system. It was
also verified that the correlation between the percentage of
segmentation errors and the recognition performance is not
high (c.c. 61% for the conditions used in the experiments).
Furthermore, the recognition accuracy seems to be more
appropriate for choosing the recognition system for the Viterbi
alignmente task, than the phone recognition rate, that is, the
system that gives the highest accuracy should be used in the
Viterbi Alignment task in order to obtain the segmentation for
the GEA.
The next work is to extend the GEA concepts for HMM
complexity control in context-dependent phone models and
also to other pattern recognition problems.
[13] A. Biem, Model selection criterion for classification: Application to

HMM topology optimization, in 7-th International Conference on
Document Analysis and Recognition (ICDAR03), 2003.
[14] L. R. Bahl and M. Padmanabhan, A discriminant measure for model
complexity adaptation, in IEEE International Conference on Acoustic,
Speech and Signal Processing, 1998.
[15] Y. Gao, E.-E. Jan, M. Padmanabhan, and M. Picheny, HMM training
based on quality measurement, in IEEE International Conference on
Acoustic, Speech and Signal Processing, 1999.
[16] M. Padmanabhan and L. R. Bahl, Model complexity adaptation using
a discriminant measure, IEEE Transactions on Speech and Audio
Processing, vol. 8, no. 2, pp. 205208, 2000.
[17] G. F. G. Yared, F. Violaro, and L. C. Sousa, Gaussian Elimination Algorithm for HMM complexity reduction in continuous speech recognition
systems, in Interspeech - Eurospeech 2005, 2005.
[18] V. Valtchev, Discriminative methods in HMM-based speech recognition, Ph.D. dissertation, University of Cambridge, 1995.
[19] D. T. Toledano, L. A. H. Gomez, and L. V. Grande, Automatic phonetic
segmentation, IEEE Transactions on Speech and Audio Processing,
vol. 11, no. 6, pp. 617625, 2003.
[20] D. T. Toledano and L. A. H. Gomez, Local refinement of phonetic
boundaries: A general framework and its application using different
transition models, in Eurospeech 2001, 2001.
IX. ACKNOWLEDGEMENT
We would like to thank the CNPq (Conselho Nacional de
Desenvolvimento Cientfico e Tecnologico) for the research
funding.
R EFERENCES
[1] L. A. Aguirre, An Introduction to Systems Identification - Linear and
Non-linear Techniques Applied to Real Systems (in portuguese). Editora
UFMG, 2000.
[2] L. R. Bahl, F. Jelinek, and R. L. Mercer, A maximum likelyhood
approach to continuous speech recognition, IEEE Transactions on
Pattern Anal. Machine Intell., vol. 5, pp. 179190, 1983.
[3] C. A. Ynoguti, Continuous speech recognition using hidden markov
models (in portuguese), Ph.D. dissertation, Universidade Estadual de
Campinas, 1999.
[4] The HTK Book, Cambridge University Engineering Departament, 2002.
[5] S. Kapadia, V. Valtchev, and S. J. Young, MMI training for continuous
phoneme recognition on the TIMIT database, in IEEE International
Conference on Acoustic, Speech and Signal Processing, 1993.
[6] L. F. Lamel and J. L. Gauvain, High performance speaker-independent
phone recognition using CDHMM, in Eurospeech 1993, 1993.
[7] K.-F. Lee and H.-W. Hon, Speaker-independent phone recognition
using Hidden Markov Models, IEEE Transactions on Acoustics, Speech
and Signal Processing, vol. 37, no. 11, pp. 16411648, 1989.
[8] K.-F. Lee and H.-W. H. M.-Y. Hwang, Recent progress in the sphinx
speech recognition system, in Workshop on Speech and Natural Language, 1989.
[9] Y. J. Chung and C. K. Un, Use of different number of mixtures in
continuous density hidden markov models, Electronics Letters, vol. 29,
no. 9, pp. 824825, 1993.
[10] I. Csiszr and P. C. Shields, The consistency of the bic markov order
estimator, in IEEE International Symposium on Information Theory,
2000.
[11] M. A. T. Figueiredo and A. K. Jain, Unsupervised learning of finite
mixture models, IEEE Transactions on Pattern Analysis and Machine
Intelligence., vol. 24, no. 3, pp. 381396, 2002.
[12] S. S. Chen and P. S. Gopalakrishnan, Clustering via the bayesian
information criterion with applications in speech recognition, in IEEE
International Conference on Acoustic, Speech and Signal Processing,
1998.
SBrT
593

28

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

28

Uploaded by

Copyright:

Available Formats

VI INTERNATIONAL TELECOMMUNICATIONS SYMPOSIUM (ITS2006), SEPTEMBER 3-6, 2006, FORTALEZA-CE, BRAZIL

HMM Topology in Continuous Speech Recognition

Abstract Nowadays, HMM-based speech recognition systems

as a performance measure and the total number of Gaussian

parameter and their first and second order derivatives), using

Calculate the entropy variation ds and arrange it in

IV. E NTROPY- BASED METHOD

where s is the state number, i is the Gaussian number,

where Mlj is the candidate model l of state j, Nj is the

where MS is the number of Gaussian components in state

The idea is to arrange the states in a decreasing order of

Previous works in this area have used likelihood measures in

correspponds to the number of times that the

The proposed discriminative method for determining the

Contribution along each dimension to the GIM.

where Ns is the number of data samples of state s.

In fact, in the present work all multidimensional Gaussians

where || is the determinant of covariance matrix. If the

as shown in Figure 2. Moreover, the closer the feature sample

log DC (j) = K log

For each Gaussian, the contribution along all dimensions to

finished, the resulting model is trained by means of the BaumWelch algorithm.

40, 50, 60, 70, 80, 90 AND 100; THE REGULARIZATION

3.5 AND 0 WERE USED IN THE EXPERIMENTS , WITH K =106 .

B. Gaussian Selection Based on Euclidian Distance Measure

Viterbi alignment against the correct transcriptions was used

where dxy is given by

BASELINE SYSTEMS (8, 16 AND 32 G AUSSIANS PER STATE ).

and x and y are the mean vector of Gaussian components

After that, the new GEA was applied to optimize the

with 1152 Gaussians

Fig. 4. Relation between the percentage of segmentation errors and the

S EGMENTATION ERROR FOR THE 4608 G AUSSIANS BASELINE SYSTEM .

As can be noted, 98.1% of the segmentation errors are

Phone Recognition Rate

The results show that the correlation between the quality of

[13] A. Biem, Model selection criterion for classification: Application to

You might also like