Professional Documents
Culture Documents
Speech processing involves recognition, synthesis, language identification, speaker recognition, and a
host of subsidiary problems regarding variations in speaker and speaking conditions. Notwithstanding
the difficulty of the problems, and the fact that speech processing spans two major areas, acoustic
engineering and computational linguistics, great progress has been made in the past fifteen years, to the
point that commercial speech recognizers are increasingly available in the late 1990s. Still, problems
remain both at the sound level, especially dealing with noise and variation, and at the dialogue and
conceptual level, where speech blends with natural language analysis and generation.
Acoustic variability, due to the fact that the same phonemes pronounced in
different contexts (that is surrounded by different phonemes) will have
different acoustic realization (this is called thecoarticulation effect).
Additional factors that play a role include the fact that the general prosody
of a sentence modifies the corresponding signal, and that the signal is
different when speech is uttered in various environments, in noise, with
reverberation, with different microphones, or different types of
microphones.
Speaking variability, when the same speaker speaks normally, shouts,
whispers, uses a creaky voice, or has a cold.
Speaker variability, since different speakers have different timbers and
different speaking habits.
Noise and channel distortions are difficult to handle, especially when there is no a
priori knowledge of the noise or of the distortion. These phenomena directly affect
the acoustics of the signal, but may also indirectly modify the voice at the source.
This is known as the Lombard effect, where noise modifies the utterance of the
words (as people tend to speak louder), but may also be reflected in voice changes
due to the psychological awareness of speaking to a machine.
The fact that, contrary to written texts, speech is continuous and has no silence to
separate words, adds extra difficulty. But continuous speech is also difficult to
handle because linguistic phenomena of various kinds may occur at the junctions
between words, or within words which are often used, and which are usually short
and therefore much affected by coarticulation.
5.2.2 History of Major Methods, Techniques, and Approaches
Regarding speech synthesis, the origins may be placed very early in time. The first
result in that field may be placed in 1791, when W. von Kempelen demonstrated
his speaking machine, which was built with a mechanical apparatus mimicking the
human vocal apparatus. The next major successful attempt may be placed at the
New York World Fair in 1939, when H. Dudley presented the Voder, based on
electrical devices. In this case, the approach was rather based on an analysissynthesis approach. The sounds where first analyzed and then replayed. In both
cases, it was necessary to learn how to play those very special musical instruments
(one week in the case of the Voder), and the human demonstrating the systems
probably used the now well-known trick of announcing to the audience what they
would hear, and thus inducing the understanding of the corresponding sentence.
Since then, major progress may be reported in that field, with basically two
approaches still reflecting the Von Kempelen/Dudley dichotomy on "KnowledgeBased" vs "Template-Based" approaches. The first approach is based on the
functioning of the vocal tract, which often goes together with formant synthesis
(the formants are the resonances of the vocal tract). The second is based on the
synthesis of pre-analyzed signals, which leads to diphone synthesizers, and more
generally to signal segment concatenation. A speech synthesizer for American
English was designed based on the first approach at MIT (Klatt, 1980), and
resulted in the best synthesizer available at that time. Several works may also be
reported in the field of articulatory synthesis, which aims at mimicking more
closely the functioning of the vocal apparatus. However, the best quality is
presently obtained by diphone based approaches or the like, using simply PCM
60s that find those parameters (that is, train the model) (Baum, 1972), and match in
an optimal way a model with a signal (Viterbi, 1967), similarly to DTW. The
interesting features of this approach is that it is possible to include in a given model
parameters which represent different ways of pronouncing a word for different
speaking styles of the same speaker, or for different speakers, and different
pronunciations of the words, with different probabilities, or, even more
interestingly, that it is possible to train phoneme models instead of word models.
The recognition process may then be expressed as finding the word sequence
which maximizes the probability that the word sequence produced the signal. This
can be simply rewritten as the product of the probability that the signal was
produced by the word sequence (Acoustic Model) and the probability of the word
sequence (Language Model). This latter probability can be obtained by computing
the frequency of the succession of two (bigrams) or three (trigrams) words in texts
or speech transcriptions corresponding to the kind of utterances which will be
considered in the application. It is also possible to consider the probabilities of
grammatical category sequences (biclass and triclass models).
The HMM approach requires very large amounts of data for training, both in terms
of signal and in terms of textual data, and the availability of such data is crucial for
developing technologies and applications, and evaluating systems.
Various techniques have been proposed for the decoding process (depth-first,
breadth-first, beam search, A* algorithm, stack algorithm, Tree Trellis, etc.). This
process is very time consuming, and one research goal is to accelerate the process
without losing quality.
This statistical approach was proposed in the early 70s. It was developed
throughout the early 80s in parallel with other approaches, as there was no
quantitative way of comparing approaches on a given task. The US Department of
Defense DARPA Human Language Technology program, which started in 1984,
fostered an evaluation-driven comparative research paradigm, which clearly
demonstrated the advantages of the statistical approach (DARPA, 198998).
Gradually, the HMM approach became more popular, both in the US and abroad.
In parallel, the connectionist, or neural network (NN), approach was experimented
in various fields, including speech processing. This approach is also based on
training, but is considered to be more discriminative than the HMM one. However,
it is less adequate than HMM to model the time information. Hybrid systems that
combine HMMs and NNs have therefore been proposed. Though they provide
interesting results, and, in some limited cases, even surpass the pure HMM
approach, they have not proven their superiority.
This history illustrates how problems were attacked and in some cases partly
solved by different techniques: acoustic variability through the use of TemplateMatching using DTW in the 70s, followed by stochastic modeling in the 80s,
stay as a research topic for the future, with the goal to make it more natural and
invisible. The systems will thus become more speaker-independent, but will still
have a speaker adaptation component. This adaptation can also be necessary for the
same speaker, if his or her voice changes due to illness conditions for example
In speech synthesis, the quality of text-to-speech synthesis is better, but still not
good enough for replacing "canned speech" (constructed by concatenating phrases
and words). The generalization of the use of Text-to-Speech synthesis for
applications such as reading aloud email messages will however probably help
making this imperfect voice familiar and acceptable. Further improvement should
therefore be obtained on phoneme synthesis itself, but attention should be placed
on improving the naturalness of the voice. This involves prosody, as it is very
difficult to generate a natural and acceptable prosody from the text, and it may be
somehow easier to do it in the speech generation module of an oral dialogue
system. This also involves voice quality, allowing the TTS synthesis system to
change its voice to interpret the right meaning of a sentence. Voice conversion
(allowing a TTS synthesis system to speak with the voice of the user, after analysis
of this voice) is another area of R&D interest (Abe et al., 1990).
Generally speaking, the research program for the next years should be "to put back
Language into Language Modeling", as proposed by F. Jelinek during the MLIM
workshop. It requires taking into account that the data which has to be modeled
is language, not just sounds, and that it therefore has some specifics, including an
internal structure which involves more than a window of two or three words. This
would suggest going beyond Bigrams and Trigrams, to consider parsing complete
sentences.
In the same way, as suggested by R. Rosenfeld during the MLIM workshop, it may
be proposed "to put Speech back in Speech Recognition", since the data to be
modeled is speech, with its own specifics, such as having been produced by a
human brain through the vocal apparatus. In that direction, it may be mentioned
that the signal processing techniques for signal acquisition were mostly based on
MFCC (Mel Frequency Cepstral Coefficients) in the 80s (Davis and Merlmelstein,
1980), and are getting closer to perceptual findings with PLP (Perceptually
weighted Linear Prediction) in the 90s (Hermansky, 1990).
Several application areas are now developing, including consumer electronics
(mobile phones, hand-held organizers), desktop applications (Dictation, OS
navigation, computer games, language learning), telecommunications (autoattendant, home banking, call-centers). These applications require several
technological advances, including consistent accuracy, speaker-independence and
quick adaptation, consistent handling of Out-Of-Vocabulary words, easy addition
of new words and names, automatic updating of vocabularies, robustness to noise
and channel, barge-in (allowing a human to speak over the systems voice and
interrupt it), and also standard software and hardware compatibility and low cost.
(sitting in front of his computer, in which case a text + graphics output may be
appropriate, or driving his car, in which case, a speech output of a summarized
version of the textual information may be more appropriate, for example).
5.5 Juxtaposition of this Area with Other Areas
Over the years, speech processing is getting closer to natural language processing,
as speech recognition is shifting to speech understanding and dialogue, and as
speech synthesis becomes increasingly natural and approaches language generation
from concepts in dialogue systems. Speech recognition would benefit from better
language parsing, and speech synthesis would benefit from better morpho-syntactic
tagging and language parsing.
Speech recognition and speech synthesis are used in Machine Translation (Chapter
4) for spoken language translation (Chapter 7).
Speech processing meets Natural Language Processing, but also computer vision,
computer graphics, gestural communication in multimodal communication
systems, with open research issues on the relationship between image, language
and gesture for example (see Chapter 9).
Even imperfect speech recognition meets Information Retrieval (Chapter 2) in
order to allow for multimedia document indexing through speech, and retrieval of
multimedia documents (such as in the US Informedia (Wactlar et al., 1999) and the
EU Thistle or Olive projects). This information retrieval may even be multilingual,
extending the capability of the system to index and retrieve the requested
information, whatever the language spoken by the user, or present in the data.
Information Extraction (Chapter 3) from spoken material is a similar area of
interest, and work has already been initiated in that domain within DARPAs Topic
Detection and Tracking program. Here also, it will benefit from cooperation
between speech and NL specialists and from a multilingual approach, as data is
available on multiple sources in multiple languages worldwide.
Speech recognition, speech synthesis, speech understanding and speech generation
meet in order to allow for oral dialogue. Vocal dialogue will get closer to research
in the area of dialogue modeling (indirect speech acts, beliefs, planning, user
models, etc.). Adding a multilingual dimension empowers individuals and gives
them a universal access to the information world.
5.6 The Treatment of Multiple Languages in Speech Processing
Addressing multilinguality is important in speech processing. A system that
handles several languages is much easier to put on the market than a system that
can only address one language. In terms of research, the structural differences
across languages are interesting for studying any one of them. Rapid deployment
With respect to multilinguality, there are two important questions. First, can data
be shared across languages (if a system is able to recognize one language, will it be
necessary to conduct the same effort to address another one? Or is it possible to
reuse for example the acoustic models of the phonemes that are similar in two
different languages)? Second, can knowledge be shared across language? (Could
the scientific results obtained in studying one language be used for studying
another language? As the semantic meaning of a sentence remains the same, when
it is pronounced in two different languages, it should be possible to model
language-independent knowledge independently of the languages used)?
5.7 Conclusion
Notwithstanding the difficulty of the problems facing speech processing, and
despite the fact that speech processing spans two major areas, acoustic engineering
and computational linguistics, great progress has been made in the past fifteen
years. Commercial speech recognizers are increasingly available today,
complementing machine translation and information retrieval systems in a trio of
Language Processing applications. Still, problems remain both at the sound level,
especially dealing with noise and variations in speaker and speaking condition, and
at the dialogue and conceptual level, where speech blends with natural language
analysis and generation.
5.8 References
Abe, M., S. Nakamura, K. Shikano, and H. Kuwabara. 1990. Voice
conversion through vector quantization. Journal of the Acoustical Society of
Japan, E-11 (7176).
Baker, J.K. 1975. Stochastic Modeling for Automatic Speech
Understanding. In R. Reddy (ed), Speech Recognition (521542).
Academic Press.
Baum, L.E. 1972. An Inequality and Associated Maximization Technique in
Statistical Estimation of Probabilistic Functions of Markov
Processes. Inequalities 3 (18).
Black, A. W. and N. Campbell. 1995. Optimising selection of units from
speech databases for concatenative synthesis. Proceedings of the fourth
European Conference on Speech Communication and Technology (581
584). Madrid, Spain.
Cole, R., J. Mariani, H. Uszkoreit, N. Varile, A. Zaenen, A. Zampolli, V.
Zue. 1998. Survey of the State of the Art in Human Language Technology.
Speaker recognition
Sadaoki Furui (2008), Scholarpedia, 3(4):3715.
doi:10.4249/scholarpedia.3715
Post-publication activity
Curator: Sadaoki Furui
Speaker recognition is the process of automatically recognizing who is speaking by using the
speaker-specific information included in speech waves to verify identities being claimed by
people accessing systems; that is, it enables access control of various services by voice (Furui,
1991, 1997, 2000). Applicable services include voice dialing, banking over a telephone network,
telephone shopping, database access services, information and reservation services, voice mail,
security control for confidential information, and remote access to computers. Another important
application of speaker recognition technology is as a forensics tool.
Contents
7 References
8 See Also
The fundamental difference between identification and verification is the number of decision
alternatives. In identification, the number of decision alternatives is equal to the size of the
population, whereas in verification there are only two choices, acceptance or rejection, regardless
of the population size. Therefore, speaker identification performance decreases as the size of the
population increases, whereas speaker verification performance approaches a constant
independent of the size of the population, unless the distribution of physical characteristics of
speakers is extremely biased.
There is also a case called open set identification, in which a reference model for an unknown
speaker may not exist. In this case, an additional decision alternative, the unknown does not
match any of the models, is required. Verification can be considered a special case of the open
set identification mode in which the known population size is one. In either verification or
identification, an additional threshold test can be applied to determine whether the match is
sufficiently close to accept the decision, or if not, to ask for a new trial.
The effectiveness of speaker verification systems can be evaluated by using the receiver operating
characteristics (ROC) curve adopted from psychophysics. The ROC curve is obtained by assigning
two probabilities, the probability of correct acceptance (1 ? false rejection rate) and the
probability of incorrect acceptance (false acceptance rate), to the vertical and horizontal axes
respectively, and varying the decision threshold. The detection error trade-off (DET) curve is also
used, in which false rejection and false acceptance rates are assigned to the vertical and
horizontal axes respectively. The error curve is usually plotted on a normal deviate scale. With
this scale, a speaker recognition system whose true speaker and impostor scores are Gaussians
with the same variance will result in a linear curve with a slope equal to ? 1. The DET curve
representation is therefore more easily readable than the ROC curve and allows for a comparison
of the systems performance over a wide range of operating conditions.
The equal-error rate (EER) is a commonly accepted overall measure of system performance. It
corresponds to the threshold at which the false acceptance rate is equal to the false rejection rate.
phoneme or syllable, it generally achieves higher recognition performance than the textindependent method.
There are several applications, such as forensics and surveillance applications, in which
predetermined key words cannot be used. Moreover, human beings can recognize speakers
irrespective of the content of the utterance. Therefore, text-independent methods have attracted
more attention. Another advantage of text-independent recognition is that it can be done
sequentially, until a desired significance level is reached, without the annoyance of the speaker
having to repeat key words again and again.
Both text-dependent and independent methods have a serious weakness. That is, these security
systems can easily be circumvented, because someone can play back the recorded voice of a
registered speaker uttering key words or sentences into the microphone and be accepted as the
registered speaker. Another problem is that people often do not like text-dependent systems
because they do not like to utter their identification number, such as their social security number,
within the hearing of other people. To cope with these problems, some methods use a small set of
words, such as digits as key words, and each user is prompted to utter a given sequence of key
words which is randomly chosen every time the system is used. Yet even this method is not
reliable enough, since it can be circumvented with advanced electronic recording equipment that
can reproduce key words in a requested order. Therefore, a text-prompted speaker recognition
method has been proposed in which password sentences are completely changed every time.
DTW-Based Methods
In this approach, each utterance is represented by a sequence of feature vectors, generally, shortterm spectral feature vectors, and the trial-to-trial timing variation of utterances of the same text
is normalized by aligning the analyzed feature vector sequence of a test utterance to the template
feature vector sequence using a DTW algorithm. The overall distance between the test utterance
and the template is used for the recognition decision. When multiple templates are used to
represent spectral variation, distances between the test utterance and the templates are averaged
and then used to make the decision. The DTW approach has trouble modeling the statistical
variation in spectral features.
HMM-Based Methods
An HMM can efficiently model the statistical variation in spectral features. Therefore, HMMbased methods have achieved significantly better recognition accuracies than DTW-based
methods.
Long-Term-Statistics-Based Methods
Long-term sample statistics of various spectral features, such as the mean and variance of
spectral features over a series of utterances, have been used. Long-term spectral averages are
extreme condensations of the spectral characteristics of a speaker's utterances and, as such, lack
the discriminating power of the sequences of short-term spectral features used as models in textdependent methods.
VQ-Based Methods
A set of short-term training feature vectors of a speaker can be used directly to represent the
essential characteristics of that speaker. However, such a direct representation is impractical
when the number of training vectors is large, since the memory and amount of computation
required become prohibitively large. Therefore, attempts have been made to find efficient ways of
compressing the training data using vector quantization (VQ) techniques.
In this method, VQ codebooks, consisting of a small number of representative feature vectors, are
used as an efficient means of characterizing speaker-specific features. In the recognition stage, an
input utterance is vector-quantized by using the codebook of each reference speaker; the VQ
distortion accumulated over the entire input utterance is used for making the recognition
determination.
In contrast with the memoryless (frame-by-frame) VQ-based method, non-memoryless source
coding algorithms have also been studied using a segment (matrix) quantization technique. The
advantage of a segment quantization codebook over a VQ codebook representation is its
characterization of the sequential nature of speech events. A segment modeling procedure for
constructing a set of representative time normalized segments called filler templates has been
proposed. The procedure, a combination of K-means clustering and dynamic programming time
alignment, provides a means for handling temporal variation.
Ergodic-HMM-Based Methods
The basic structure is the same as the VQ-based method, but in this method an ergodic HMM is
used instead of a VQ codebook. Over a long timescale, the temporal variation in speech signal
parameters is represented by stochastic Markovian transitions between states. This method uses
a multiple-state ergodic HMM (i.e., all possible transitions between states are allowed) to classify
speech segments into one of the broad phonetic categories corresponding to the HMM states. The
automatically obtained categories are often characterized as strong voicing, silence, nasal/liquid,
stop burst/post silence, frication, etc.
The VQ-based method has been compared with the discrete/continuous ergodic HMM-based
method, particularly from the viewpoint of robustness against utterance variations. It was found
that the continuous ergodic HMM method is far superior to the discrete ergodic HMM method
and that the continuous ergodic HMM method is as robust as the VQ-based method when
enough training data is available. However, when little data is available, the VQ-based method is
more robust than the continuous HMM method. Speaker identification rates using the
continuous HMM were investigated as a function of the number of states and mixtures. It was
shown that the speaker recognition rates were strongly correlated with the total number of
mixtures, irrespective of the number of states. This means that using information on transitions
between different states is ineffective for text-independent speaker recognition.
A technique based on maximum likelihood estimation of a Gaussian mixture model (GMM)
representation of speaker identity is one of the most popular methods. This method corresponds
to the single-state continuous ergodic HMM. Gaussian mixtures are noted for their robustness as
a parametric model and for their ability to form smooth estimates of rather arbitrary underlying
densities.
The VQ-based method can be regarded as a special (degenerate) case of a single-state HMM with
a distortion measure being used as the observation probability.
Speech-Recognition-Based Methods
The VQ- and HMM-based methods can be regarded as methods that use phoneme-classdependent speaker characteristics contained in short-term spectral features through implicit
phoneme-class recognition. In other words, phoneme-classes and speakers are simultaneously
recognized in these methods. On the other hand, in the speech-recognition-based methods,
phonemes or phoneme-classes are explicitly recognized, and then each phoneme/phoneme-class
segment in the input speech is compared with speaker models or templates corresponding to that
phoneme/phoneme-class.
A five-state ergodic linear predictive HMM for broad phonetic categorization has been
investigated. In this method, after frames that belong to particular phonetic categories have been
identified, feature selection is performed. In the training phase, reference templates are
generated and verification thresholds are computed for each phonetic category. In the
verification phase, after phonetic categorization, a comparison with the reference template for
each particular category provides a verification score for that category. The final verification
score is a weighted linear combination of the scores for each category. The weights are chosen to
reflect the effectiveness of particular categories of phonemes in discriminating between speakers
and are adjusted to maximize the verification performance. Experimental results showed that
verification accuracy can be considerably improved by this category-dependent weighted linear
combination method.
A speaker verification system using 4-digit phrases has also been tested in actual field conditions
with a banking application, where input speech was segmented into individual digits using a
speaker-independent HMM. The frames within the word boundaries for a digit were compared
with the corresponding speaker-specific HMM digit model and the Viterbi likelihood score was
computed. This was done for each of the digits making up the input utterance. The verification
score was defined to be the average normalized log-likelihood score over all the digits in the
utterance.
A large vocabulary speech recognition system has also been used for speaker verification. With
this approach a set of speaker-independent phoneme models were adapted to each speaker.
Speaker verification consisted of two stages. First, speaker-independent speech recognition was
run on each of the test utterances to obtain phoneme segmentation. In the second stage, the
segments were scored against the adapted models for a particular target speaker. The scores were
normalized by those with speaker-independent models. The system was evaluated using the 1995
NIST-administered speaker verification database, which consists of data taken from the
Switchboard corpus. The results showed that this method did not out-perform Gaussian mixture
models.
unigrams and bigrams from manually transcribed conversations are used to characterize a
particular speaker in a traditional target/background likelihood ratio framework. The use of
support vector machines for performing the speaker verification task based on phone and word
sequences obtained using phone recognizers has been proposed. The benefit of these features was
demonstrated in the NIST extended data task for speaker verification; with enough
conversational data, a recognition system can become familiar with a speaker and achieve
excellent accuracy. The corpus was a combination of phases 2 and 3 of the Switchboard-2
corpora. Each training utterance in the corpus consisted of a conversation side that was
nominally of length 5 minutes (approximately 2.5 minutes of speech) recorded over a land-line
telephone. Speaker models were trained using 1 ? 16 conversation sides. These methods need
utterances of at least several minutes long, much longer than those used in conventional speaker
recognition methods.
Parameter-Domain Normalization
As one typical normalization technique in the parameter domain, spectral equalization, the socalled blind equalization method, has been confirmed to be effective in reducing linear channel
effects and long-term spectral variation. This method is especially effective for text-dependent
speaker recognition applications using sufficiently long utterances. In this method, cepstral
coefficients are averaged over the duration of an entire utterance, and the averaged values are
subtracted from the cepstral coefficients of each frame (CMS; cepstral mean subtraction). This
method can compensate fairly well for additive variation in the log spectral domain. However, it
unavoidably removes some text-dependent and speaker-specific features, so it is inappropriate
for short utterances in speaker recognition applications. It has also been shown that time
derivatives of cepstral coefficients (delta-cepstral coefficients) are resistant to linear channel
mismatches between training and testing.
Likelihood Normalization
A normalization method for likelihood (similarity or distance) values that uses a likelihood ratio
has been proposed. The likelihood ratio is the ratio of the conditional probability of the observed
measurements of the utterance given the claimed identity is correct, to the conditional
probability of the observed measurements given the speaker is an impostor (normalization term).
Generally, a positive log-likelihood ratio indicates a valid claim, whereas a negative value
indicates an imposter. The likelihood ratio normalization approximates optimal scoring in Bayes
sense.
This normalization method is, however, unrealistic because conditional probabilities must be
calculated for all the reference speakers, which requires large computational cost. Therefore, a set
of speakers, cohort speakers, who are representative of the population distribution near the
claimed speaker has been chosen for calculating the normalization term. Another way of
choosing the cohort speaker set is to use speakers who are typical of the general population. It
was reported that a randomly selected, gender-balanced background speaker population
outperformed a population near the claimed speaker.
A normalization method based on a posteriori probability has also been proposed. The difference
between the normalization method based on the likelihood ratio and that based on a posteriori
probability is whether or not the claimed speaker is included in the impostor speaker set for
normalization; the cohort speaker set in the likelihood-ratio-based method does not include the
claimed speaker, whereas the normalization term for the a posteriori-probability-based method
is calculated by using a set of speakers including the claimed speaker. Experimental results
indicate that both normalization methods almost equally improve speaker separability and
reduce the need for speaker-dependent or text-dependent thresholding, compared with scoring
using only the model of the claimed speaker.
A method in which the normalization term is approximated by the likelihood for a world model
representing the population in general has also been proposed. This method has an advantage in
that the computational cost for calculating the normalization term is much smaller than the
original method since it does not need to sum the likelihood values for cohort speakers. A method
based on tied-mixture HMMs in which the world model is made as a pooled mixture model
representing the parameter distribution for all the registered speakers has been proposed. The
use of a single background model for calculating the normalization term has become the
predominate approach used in speaker verification systems.
Since these normalization methods neglect absolute deviation between the claimed speaker's
model and the input speech, they cannot differentiate highly dissimilar speakers. It has been
reported that a multilayer network decision algorithm makes effective use of the relative and
absolute scores obtained from the matching algorithm.
A family of normalization techniques has been proposed, in which the scores are normalized by
subtracting the mean and then dividing by standard deviation, both terms having been estimated
from the (pseudo) imposter score distribution. Different possibilities are available for computing
the imposter score distribution: Znorm, Hnorm, Tnorm, Htnorm, Cnorm and Dnorm (Bimbot et
al., 2004). The state-of-the-art text-independent speaker verification techniques associate one or
more parameterization level normalization approaches (CMS, feature variance normalization,
feature warping, etc.) with world model normalization and one or more score normalizations.
References
Bimbot, F. J., Bonastre, F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S.,
Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz D. and Reynolds, D. A. (2004) A Tutorial
on Text-Independent Speaker Verification, EURASIP Journ. on Applied Signal Processing,
pp. 430-451.
Fauve, B. G. B., Matrouf, D., Scheffer, N., and Bonastre, J.-F (2007) State-of-the-Art
Performance in Text-Independent Speaker Verification through Open-Source Software,
IEEE Trans. On Audio, Speech, and Language Process., 15, 7, pp. 1960-1968.
Furui, S. (1997) Recent Advances in Speaker Recognition, Proc. First Int. Conf. Audio- and
Video-based Biometric Person Authentication, Crans-Montana, Switzerland, pp. 237-252.
Furui, S. (2000) Digital Speech Processing, Synthesis, and Recognition, 2nd Edition, New
York: Marcel Dekker.
Yin, S.-C., Rose, R. and Kenny, P. (2007) A Joint factor Analysis Approach to Progressive
Model Adaptation in Text-Independent Speaker Verification, IEEE Trans. On Audio,
Speech, and Language Process., 15, 7, pp. 1999-2010.
Internal references
Senior Lecturers: Charles (Pete) Bernardin, Peter A. Blakey, Paul Deignan, Nathan B.
Dodge,James Florence, Jung Lee, Randall E. Lehmann, P. K. Rajasekaran, Ricardo E.
Saad, William (Bill) Swartz, Marco Tacca
UT Dallas Affiliated Faculty: Larry P. Ammann, Leonidas Bleris, Yves J. Chabal, Bruce E.
Gnade, Matthew J. Goeckner, Robert D. Gregg, Jiyoung Kim, Moon J. Kim, David J. Lary, Yang
Liu, Robert L. Rennaker II, Mario A. Rotea, Mathukumalli Vidyasagar, Robert M. Wallace, Steve
Yurkovich
Objectives
The program leading to the MSEE degree provides intensive preparation for professional
practice in a broad spectrum of high-technology areas of electrical engineering. It is designed to
serve the needs of engineers who wish to continue their education. Courses are offered at a time
and location convenient for the student who is employed on a full-time basis.
The objective of the doctoral program in electrical engineering is to prepare individuals to perform
original, leading edge research in the broad areas of communications and signal processing;
mixed-signal IC design; digital systems; power electronics; microelectronics and nanoelectronics,
optics and photonics; optical communication devices and systems; power electronics and energy
systems, and wireless communications. Because of our strong collaborative programs with
Dallas-area high-technology companies, special emphasis is placed on preparation for research
and development positions in these high-technology industries.
Facilities
The Erik Jonsson School of Engineering and Computer Science has developed a state-of-the-art
information infrastructure consisting of a wireless network in all buildings and an extensive fiberoptic and copper Ethernet. Through the Texas Higher Education Network, students and faculty
have direct access to most major national and international networks. UT Dallas has an Internet
2 connection. In addition, many personal computers and UNIX workstations are available for
student use.
The Engineering and Computer Science Building and the new Natural Science and Engineering
Research Laboratory provide extensive facilities for research in microelectronics,
telecommunications, and computer science. A Class 10000 microelectronics clean room facility,
including e-beam lithography, sputter deposition, PECVD, LPCVD, etch, ash and evaporation, is
available for student projects and research. The Plasma Applications and Science Laboratories
have state-of-the-art facilities for mass spectrometry, microwave interferometry, optical
spectroscopy, optical detection, in situ ellipsometry and FTIR spectroscopy. In addition, a
modified Gaseous Electronics Conference Reference Reactor has been installed for plasma
processing and particulate generatiohn studies. Research in characterization and fabrication of
nanoscale materials and devices is performed in the Nanoelectronics Laboratory. The Optical
Communications Laboratory includes attenuators, optical power meters, lasers, APD/p-i-n
photodetectors, optical tables, and couplers and is available to support system level research in
optical communications. Tissue optics research is also supported in this laboratory. The Photonic
Testbed Laboratory supports research in photonics and optical communications with currentgeneration optical networking test equipment. The Electronic Materials Processing Laboratory
has extensive facilities for fabricating and characterizing semiconductor and optical devices. The
Photonic Devices and Systems Laboratory houses graduate research projects centered on
optical instrumentation and photonic integrated circuits.
The Renewable Energy and Vehicular Technology Laboratory (REVT-Lab) is equipped with
various sources of renewable energy such as wind and solar, a micro-grid formed by a network
of multi-port power electronic converters, a stationary plug in hybrid vehicle testbed, a stationary
DFIG-based wind energy emulator, a series of adjustable speed motor drive technologies
including PMSM, SRM and induction motor drives. All of the testbeds are equipped with digital
control, state-of-the-art measurement and protection devices. REVT laboratory is also equipped
with a cold plasma chamber for hydrogen harvesting and battery testing facilities. The main focus
of the REVT Lab is to improve reliability and security of the power electronic-driven technologies
as applied to utility and vehicular industries.
The Texas Analog Center of Excellence (TxACE) at The University of Texas at Dallas (UT
Dallas) has the mission of leading the country in analog research and education. TxACE
research seeks to create fundamental analog, mixed signal and RF design innovations in
integrated circuits and systems that improve energy efficiency, healthcare, and public safety and
security. The center is supported by Semiconductor Research Corporation, Texas Emerging
Technology Fund, Texas Instruments Inc., the UT System, and UT Dallas. TxACE is the largest
analog technology center in the world on the basis of funding and the number of principal
investigators. The center funds ~70 directed research projects led by ~65 principal and coprincipal investigators from 31 academic institutions including three international institutions.
The Multimedia Communications Laboratory has a dedicated network of PC's, Linux stations,
and multi-processor, high performance workstations for analysis, design and simulation of image
and video processing systems. The Signal and Image Processing (SIP) Laboratory has a
dedicated network of PC's equipped with digital camera and signal processing hardware
platforms allowing the implementation of advanced image processing algorithms. The Statistical
Signal Processing Laboratory is dedicated to research in statistical and acoustic signal
processing for biomedical and non-biomedical applications. It is equipped with high-performance
computers and powerful textual and graphical software platforms to analyze advanced signal
processing methods, develop new algorithms, and perform system designs and simulations. The
Acoustic Research Laboratory provides number of test-beds and associated equipment for signal
measurements, system modeling, real-time implementation and testing of algorithms related to
audio/acoustic/speech signal processing applications such as active noise control, speech
enhancement, dereverberation, echo cancellation, sensor arrays, psychoacoustic signal
processing, etc.
The Center for Robust Speech Systems (CRSS) is focused on a wide range of research in the
area of speech signal processing, speech and speaker recognition, speech/language technology,
and multi-modal signal processing involving facial/speech modalities. CRSS is affiliated with
HLTRI in the Erik Jonsson School, and collaborates extensively with faculty and programs across
UT Dallas on speech and language research. CRSS supports an extensive network of
workstations, as well as a High-Performance Compute Cluster with over 30TB of disk space and
420 CPU ROCS multi-processor cluster. The center also is equipped with several Texas
Instruments processors for real-time processing of speech signals, and two ASHA certified sound
booths for perceptual/listening based studies and for speech data collection. CRSS supports
mobile speech interactive systems through the UT Drive program for in-vehicle driver-behavior
systems, and multi-modal based interaction systems via image-video-speech research.
The Sensing, Robotics, Vision, Control and Estimation (SeRViCE) Lab focuses on topics of
control and estimation with applications in robotics, autonomous vehicles and sensor
management. Primary expertise is in vision-based control and estimation and nonlinear control,
that is, using cameras as the primary sensor to control robots or other complex systems.
Robotics resources in the lab currently include two Pioneer 3-DX mobile robots from Mobile
Robots Inc. and a Stubli TX90 robot manipulator, with six degrees of freedom, 7kg nominal
payload and capable of torque level control. Camera resources include multiple web cameras,
three high-quality, firewire, color, digital video cameras, and an 18Mp digital SLR camera. The
SeRViCE Lab also features general support equipment, including desktop and mobile work
stations DLP projectors, power tools, hand tools, oscilloscopes, and other electronic
measurement equipment.
The Laboratory for Autonomous Robotics and Systems (LARS) focuses on the development of
novel control theory to support autonomous and teleoperation of general robotic systems. Active
research projects include: (a) human-in-the-loop multi-robot telemanipulation, (b) autonomous
networked robotics, and (c) control of bipedal walking robots. The LARS is equipped with high
speed high resolution 8-camera Vicon motion capture system for general purpose motion
tracking. The LARS possesses various mobile robots to supported multi-robot research, including
six gumstix controlled iRobot Creates and a Quanser QBall quadrotor UAV. The LARS also
possesses various force feedback user interface devices, including Logitech force feedback
joystick and driving wheel, and Novint Falcon, a 3-translational degree-of-freedom Deltastructure desktop haptic device.
The Broadband Communication Laboratory has design and modeling tools for fiber and wireless
transmission systems and networks, and all-optical packet routing and switching. The Advanced
Communications Technologies (ACT) Laboratory provides a design and evaluation environment
for the study of telecommunication systems and wireless and optical networks. ACT has facilities
for designing network hardware, software, components, and applications.
The Center for Systems, Communications, and Signal Processing, with the purpose of promoting
research and education in general communications, signal processing, control systems, medical
and biological systems, circuits and systems and related software, is located in the Erik Jonsson
School.
The Wireless Information Systems (WISLAB) and Antenna Measurement Laboratories have
wireless experimental equipment with a unique multiple antenna testbed to integrate and to
demonstrate radio functions (i.e. WiFi and WiMAX) under different frequency usage
characteristics. With the aid of the Antenna Measurement Lab located in the Waterview Science
and Technology Center (WSTC), the researchers can design, build, and test many types of
antennas.
The Quality of Life Technology Laboratory is a multidisciplinary engineering education, research
and developmental laboratory aimed at improving Quality of Life of people through technological
advancements, innovations, and intelligent system designs. It has design, modeling and
simulation tools for medical devices and systems.
The faculty of the Erik Jonsson School's Photonic Technology and Engineering Center (PhoTEC)
carry out research in enabling technologies for microelectronics and telecommunications. Current
research areas include nonlinear optics, Raman amplification in fibers, optical switching,
applications of optical lattice filters, microarrays, integrated optics, and optical networking.
In addition to the facilities on campus, cooperative arrangements have been established with
many local industries to make their facilities available to UT Dallas graduate engineering
students.
Admission Requirements
The university's general admission requirements are discussed on the Graduate
Admissionpage (catalog.utdallas.edu/2014/graduate/admission).
A student lacking undergraduate prerequisites for graduate courses in electrical engineering
must complete these prerequisites or receive approval from the graduate advisor and the course
instructor.
A diagnostic exam may be required. Specific admission requirements follow.
The student entering the MSEE program should meet the following guidelines:
An undergraduate preparation equivalent to a baccalaureate in electrical engineering
Degree Requirements
The university's general degree requirements are discussed on the Graduate Policies and
Procedures page (catalog.utdallas.edu/2014/graduate/policies/policy).
The MSEE requires a minimum of 33 semester credit hours.
All students must have an academic advisor and an approved degree plan. These are based
upon the student's choice of concentration (Biomedical Applications of Electrical Engineering;
Circuits and Systems; Communications; Control Systems; Digital Systems; Photonic Devices and
Systems; Power Electronics and Energy Systems, RF and Microwave Engineering, Signal
Processing; Solid State Devices and Micro Systems Fabrication). Courses taken without advisor
approval will not count toward the 33 semester credit hour requirement. Successful completion of
the approved course of studies leads to the MSEE degree.
The MSEE program has both a thesis and a non-thesis option. All part-time MSEE students will
be assigned initially to the non-thesis option. Those wishing to elect the thesis option may do so
by obtaining the approval of a faculty thesis supervisor. With the prior approval of an academic
advisor, non-thesis students may count no more than 3 semester credit hours of research or
individual instruction courses towards the 33 semester credit hour degree requirement.
All full-time, supported students are required to participate in the thesis option. The thesis option
requires nine semester credit hours of research (of which three must be thesis semester credit
hours), a written thesis submitted to the graduate school, and a formal public defense of the
thesis. The supervising committee administers this defense and is chosen in consultation with
the student's thesis advisor prior to enrolling for thesis credit. Research and thesis semester
credit hours cannot be counted in an MSEE degree plan unless a thesis is written and
successfully defended.
Concentrations
One of the nine concentrations listed below, subject to approval by a graduate advisor, must be
used to fulfill the requirements of the MSEE program. Students must achieve an overall GPA
(grade point average) of 3.0 or better, a GPA of 3.0 or better in their core MSEE classes, and a
grade of B- or better in all their core MSEE classes in order to satisfy their degree requirements.
One 5000 level electrical engineering course can be counted towards the graduate semester
credit hours.
It is highly recommended that students take an independent study course with an EE faculty
member that will be counted as one of the EE electives. The independent study course is
intended to gear the coursework towards one of the following research areas in the department:
biosensors, biomedical signal processing, bioinstrumentation, medical imaging, biomaterials, and
bio-applications in RF.
Communications
This curriculum emphasizes the application and theory of all phases of modern communications.
Each student electing this concentration must take four required courses (12 semester credit
hours).
Two of the courses are:
EESC 6349 Random Processes
Control Systems
This curriculum emphasizes methods to predict, estimate, and regulate the behavior of electrical,
mechanical, or other systems including robotics.
Each student electing this concentration must take four required courses (12 semester credit
hours).
Two of the courses are:
EECS 6331 Linear Systems
EESC 6349 Random Processes
The remaining two must be selected from:
EECS 6336 Nonlinear Systems
EEGR 6381 Computational Methods in Engineering
EESC 6343 Detection and Estimation Theory
EESC 6360 Digital Signal Processing I
EESC 6364 Pattern Recognition
EESC 7V85 Special Topics in Signal Processing
Approved electives must be taken to make a total of 33 semester credit hours.
Digital Systems
The goal of the curriculum is to educate students about issues arising in the design and analysis
of digital systems, an area relevant to a variety of high-technology industries. Because the
emphasis is on systems, coursework focuses on three areas: hardware design, software design,
and analysis and modeling.
Each student electing this concentration must take four required courses (12 semester credit
hours):
Two of the courses are:
EEDG 6301 Advanced Digital Logic
EEDG 6304 Computer Architecture
The remaining two must be selected from:
EECT 6325 VLSI Design
EEDG 6302 Microprocessor Systems
EEDG 6345 Engineering of Packet-Switched Networks
Approved electives must be taken to make a total of 33 semester credit hours.
Signal Processing
This curriculum emphasizes the application and theory of signal processing.
Each student electing this concentration must take four required courses (12 semester credit
hours).
Two of the courses are:
EESC 6349 Random Processes
EESC 6360 Digital Signal Processing I
The remaining two must be selected from:
Admission Requirements
Degree Requirements
The university's general degree requirements are discussed on the Graduate Policies and
Procedures page (catalog.utdallas.edu/2014/graduate/policies/policy).
Each program for doctoral study is individually tailored to the student's background and research
objectives by the student's supervisory committee. The program will require a minimum of 75
semester credit hours beyond the baccalaureate degree. These credits must include at least 30
semester credit hours of graduate level courses beyond the baccalaureate level in the major
concentration. All PhD students must demonstrate competence in the master's level core
courses in their research area. All students must have an academic advisor and an approved
plan of study.
Also required are:
A research oriented oral qualifying examination (QE) demonstrating competence in the
PhD candidate's research area. A student must make an oral presentation based on a
review of 2 to 4 papers followed by a question-answer session. Admission to PhD
candidacy is based on two criteria: Graded performance in the QE and GPA in
graduate level organized courses. A student entering the PhD program with a MSEE
must pass this exam within 3 long semesters, and a student entering without an MSEE
must pass this exam within 4 long semesters. A student has at most two attempts at
this qualifying exam. The exam will be given during the fall and spring semesters.
A comprehensive exam consisting of: a written dissertation proposal, a public seminar,
and a private oral examination conducted by the PhD candidate's supervising
committee. At least half of the supervising committee must comprise of core EE faculty
members and it must be chaired or co-chaired by an EE faculty member.
Completion of a major research project culminating in a dissertation demonstrating an
original contribution to scientific knowledge and engineering practice. The dissertation
will be defended publicly. The rules for this defense are specified by the Office of the
Dean of Graduate Studies. Neither a foreign language nor a minor is required for the
PhD. However, the student's supervisory committee may impose these or other
requirements that it feels are necessary and appropriate to the student's degree
program.
Research
The principal concentration areas for the MSEE program are: Biomedical Applications of
Electrical Engineering; Circuits and Systems; Communications; Control Systems; Digital
Systems; Photonic Devices and Systems; Power Electronics and Energy Systems, RF and
Microwave Engineering; Signal Processing; Solid State Devices and Micro Systems Fabrication.
Besides courses required for each concentration, a comprehensive set of electives is available in
each area.
Doctoral level research opportunities include: VLSI design and test, analog and mixed-signal
circuits and systems, RF and microwave engineering, biomedical applications of electrical
engineering, power electronics, renewable energy, motors and drives, vehicular technology,
computer architecture, embedded systems, computer aided design (CAD), ASIC design
methodologies, high speed system-on chip design and test, reconfigurable computing, network
processor design, interconnection networks, nonlinear signal-processing, smart antennas and
array processing, statistical and adaptive signal processing, multimedia signal processing, image
processing, real-time imaging, medical image analysis, pattern recognition, speech processing
and recognition, control theory, robotics, digital communications, modulation and coding,
electromagnetic-wave propagation, diffractive structures, fiber and integrated photonics,
nonlinear optics, optical transmission systems, all-optical networks, optical investigation of
material properties (reflectometry and ellipsometry), optical instrumentation, lasers, quantum-well
optical devices, theory and experiments in semiconductor-heterostructure devices, plasma
deposition and etching, nanoelectronics, wireless communication, network protocols and
evaluation, mobile computing and networking, and optical networking.
Interdisciplinary Opportunities: Continuing with the established tradition of research at UT Dallas,
the Electrical Engineering Program encourages students to interact with researchers in the
strong basic sciences and mathematics. Cross disciplinary collaborations have been established
with the Chemistry, Mathematics, and Physics programs of the School of Natural Sciences and
with faculty in the School of Brain and Behavioral Science.
Text dependent (restrained) : the subject has to say a fixed phrase (password) which is the same for
enrollment and for verification, or the subject is prompted by the system to repeat a randomly generated
phrase.
Text independent (unrestrained) : recognition based on whatever words the subject sais.
Text dependent recognition has better performance for subjects that cooperate. But text independent voice recognition
is more flexible that it can be used for non-cooperating individuals.
Basically identification or authentication using speaker recognition consists of four steps:
1.
2.
3.
4.
voice recording
feature extraction
pattern matching
decision (accept / reject)
Visualization of the accoustic pattern of the voice: loudness of the input vs. time.
Depending on the application a voice recording is performed using a local, dedicated system or remotely (e.g.
telephone). The accoustic patterns of speech can be visualized as loudness or frequency vs. time. Speaker recognition
systems analyze the frequency as well as attributes such as dynamics, pitch, duration and loudness of the signal.
During feature extraction the voice recording is cut into windows of equal length, these cut-out samples are
calledframes which are often 10 to 30 ms long.
Pattern matching is the actual comparisson of the extracted frames with known speaker models (or templates), this
results in a matching score which quantifies the similarity in between the voice recording and a known speaker model.
Pattern matching is often based onHidden Markov Models (HMMs), a statistical model which takes into account the
underlying variations and temporal changes of the accoustic pattern.
Alternatively Dynamic Time Warping is used, this algorithm measures the similarity in between two sequences that vary
in speed or time, even if this variation is non-linear such as when the speaking speed changes during the sequence.
Some systems use "anti-speaker" techniques such as cohort models.
Nuance is a US based company and a major player when it comes to speech recognition, they also developed
a product for speaker recognition called Nuance Verifier.
Voice Trust is a german company specialized in speaker recognition solutions.
Obviously for people who are mute or having problems with their voice due to severe
illness this biometric solution is not useable.
Uniqueness
Permanence
An issue with speaker recognition is that the voice changes with ageing, and is also
Collectability
Voice recordings are easy to obtain and do not require expensive hardware. The real
advantage of voice recognition is that it can be done over telephone lines or using
computer microphones, with variable recording and transmission quality. Pattern
matching algorithms must be able to handle ambient noise and differing quality of the
recordings.
Acceptability
A major issue with speaker recognition is spoofing using voice recordings. The risk of
spoofing with voice recordings can be mitigated if the system requests a random
Circumvention
generated phrase to be repeated, an impostor cannot anticipate the random phrase
that will be required and therefore cannot attempt a playback spoofing attack.
Performance
speaker provides a given utterance. Speaker verification, on the other hand, is the
process of accepting or rejecting the identity claim of a speaker. Most applications
in which a voice is used as the key to confirm the identity of a speaker are
classified as speaker verification.
voice of a registered speaker saying the key words or sentences can be accepted as
the registered speaker. To cope with this problem, there are methods in which a
small set of words, such as digits, are used as key words and each user is prompted
to utter a given sequence of key words that is randomly chosen every time the
system is used. Yet even this method is not completely reliable, since it can be
deceived with advanced electronic recording equipment that can reproduce key
words in a requested order. Therefore, a text-prompted (machine-driven-textdependent) speaker recognition method has recently been proposed by [MF93b].
problems which must be solved in the future. The reader is referred to the
following papers for more general reviews:
[Fur86a,Fur89,Fur91,Fur94,O'S86,RS91].