You are on page 1of 6

Speaker Recognition using Vector Quantization and Gaussian Mixture Models

Sharat Chikkerur sharat@mit.edu December 6, 2006

Abstract
Speaker recognition is the process of identifying a person based on his voice. It is a challenging task to separate the speaker identity information (who is speaking it) from the speech content itself (what is being said). Speaker recognition has several useful applications including biometric authentication and intuitive human computer interaction. In this report, we critically compare two popular approaches to perform speaker recognition: vector quantization [4] and Gaussian mixture modeling [3]. We compare the feature extraction, complexity of model building, recognition and accuracy rates of these two approaches. Existing work in literature is based on selecting a model based on language priors (such as number of phonemes in the language) and using a cross validation process. In this report, we attempt to perform automatic model selection based on Bayesian information criterion (BIC) for both VQ and GMM based approaches and compare their performance with that obtained by brute-force model selection. We perform quantitative evaluation over a subset of TIMIT speech database using 550 utterances from 110 distinct speakers.

spoken words irrespective of the person speaking it. In this scenario we are trying to capture the similarities of the words spoken. On the other hand in speaker recognition we are trying to distinguish between the speakers based on how they speak those words. In this scenario we are trying to enhance the differences in the way the words are spoken. This difference arises based on the shape of the vocal tracts of the person and the speech habits and accent acquired over the years. It is referred to in literature by various terms such as speaker identication, talker identication, talker authentication etc. [1].Speaker recognition is a broad problem and includes both speaker verication and identication. In speaker verication, the user claims an identity and the claim is veried by means of his voice. In speaker identication the identity of the person is not known. Given a sample of speech, it has to be matched with the speech samples already in the database, in order to come up with a possible identity. It must be noted that the user may or may not actually exist in the database. This is known as an open

Pysical

Introduction
Behavioral Biological

Speaker recognition is a very popular form of biometric authentication due to its use of use, accuracy and ability to perform authentication over phonelines. It also has useful applications in building intuitive human computer interfaces. It is a common mistake to fail to distinguish between speech recognition and speaker recognition. In speech recognition the problem is to identify 1

Fingerprints Face Handgeometry Iris Speech Signature Gait DNA Body odor

Table 1: Biometrics

Figure 1: Generic speaker recognition system

set identication problem. Further, speaker recognition 3 Speaker Modeling and Recognican be classied as text-dependent or text-independent tion depending on whether the input speech is constrained or not. In this report, we implement and study two forms Figure 2 shows a simplied signal model for the human of closed set, text independent speaker identication sysspeech system. A more detailed model is discussed in tems. [1]. The excitation source may be periodic as in the case of voiced speech (e.g. /a/,/e/ etc) or may be random as in the case of unvoiced speech (/s/,/sh/ etc). The vocal 2 Overview tract is modeled as a time variant lter that modulates the signal from either excitation source. The purpose of the Figure 1 show a generic speaker recognition system. At speaker recognition system is to accurately model the lthe time of enrollment, speech sample is acquired in a ter characteristics. Speech is produced by the modulation controlled and supervised manner from the user. The of air ow by the vocal tracts. Therefore the vocal tract accuracy of the system relies upon the length and sig- shape is the primary distinguishing parameter that affects nal to noise ratio of this signal. The system then pre- the speech of a person. The vocal tract modies the specprocesses the speech signal to perform silence removal, tral contents of the speech signal as it passes through it. loudness equalization and other signal conditioning oper- For each individual the vocal tract resonates at different ations. Feature extraction is then performed on the pro- frequencies. These frequencies are called formants. The cessed signal in order to extract speaker discriminatory shape of the vocal tract can be estimated or parameterized information from it. This discriminatory information will based on the spectral characteristics and location of the form the speaker model. This model can either be stochas- formants. The human speech system is driven by an excitic, statistical or simply a template [1]. The model must tation source, namely airow from the lungs. The path of have high inter speaker variability and low intra speaker the air ow through the vocal tract modulates the signal to variability. At the time of verication a speech sample is a signicant level. The extent of modulation depends on acquired from the user. This sample may be of short dura- the user. tion or taken under uncontrolled condition. The claimed Speaker modeling involves the representation of an utidentity is known in case of verication and is not known terance as a sequence of feature vectors. The model must in case of identication. The recognition system has to ex- have high inter speaker variability and low intra speaker tract the features from this sample and compare it against variability. Utterance spoken by the same person but at the models already stored before hand. In the following different times result in a similar yet different sequence of sections, we discuss the process of feature extraction and feature vectors. The purpose if this modeling is to capspeaker modeling in more detail. ture these variations in the extracted set of features. There 2

Figure 2: Physical and computational models for speech production

are two types of speech models, stochastic and template models that are used in speaker recognition. The stochastic model treats the speech production process as a parametric random process and assumes that the parameters of the underlying stochastic process can be estimated in a precise, well dened manner. Hidden Markov Model is a very popular stochastic model. The template model attempts to model the speech production process in a nonparametric manner by retaining a number of sequences of feature vectors derived from multiple utterances of the same word by the same person . The project uses the template model using normalized cepstral coefcients.

signal is then divided into analysis frames where the signal can be assumed to be stationary. A 15ms Hamming window is applied to the emphasized speech every 7.5ms. Each frame of the speech is now represented using Melfrequency cepstral coefcients. This representation is popular in both speech and speaker recognition communities. It is obtained by treating the spectrum of the speech sample itself as a signal and obtaining a compact representation using another transform (cosine transform in this case). Cepstral features capture the gross shape of the spectrum which characterizes the shape of the vocal tract and hence the user [2]. The mel-frequency transformation, transforms the frequency axis to a scale that is closer to human perception. Both of these features are combined 3.1 Feature Extraction to derive the mel-frequency ceptral coefcients as shown The digitized sound signal is rst preprocessed by re- in gure 3. moving the silence zones are removed by means of short time energy calculation. Segments of 15ms are chosen 3.2 Model Building for this purpose. A segment whose energy is less than some threshold relative to the average energy of the en- During training, the speaker is asked to speak several sentire signal is discarded. A high emphasis lter H(z) = tences of unconstrained speech (3 sentences in our case). (1 0.95z 1 ) is applied to the speech signal. The speech After feature extraction, each 30 ms signal frame is de3

Figure 3: Process of extracting cepstral features

scribed using 12 cepstral coefcients. These coefcients distance, roughly encode description of the sound/phoneme conL tained in the frame. However, even this compact represen1 min d(xi , cj ) (2) Di = tation is not convenient since the number of features dej L l=1 pend on the length of the training speech. In order to convert this variable length representation to a xed length representation, only the gross statistics of the features are The identity of the speaker is established to the one correstored in lieu of the entire collection. In this report, we sponding to the model that has the least distance. describe two ways of capturing the distribution of these speech features 3.2.2 Gaussian mixture model 3.2.1 Vector quantization In this method, the distribution of the feature vectors x is modeled explicitly using a mixture of M Gaussians.
M

Given a sequence of training feature vectors x1 , x2 , x3 . . . xN , xi Rn , we partition the space Rn , into M distinct regions S1 , S2 . . . SM , with centers c1 , c2 . . . cM , such that the average distortion D is minimized. The distortion D is given by, D= 1 M N min d(xi , cj )
i=1 j

p(x) =
i=1

pi bi (x)

(3)

bi (x) =

1 1 exp (x i )T 1 (x i(4) ) i 1/2 2 |2i |

(1)

K-means is one of the solutions to solve this problem, where ci represent the cluster centers. Other algorithms include LBG [4] and clustering. During recognition, the input speech is used to extract a sequence of features x1 , x2 . . . xL , whose length need not be the same as the training set. For each of these feature vectors, the distance to the closest cluster was computed and accumulated over the entire sequence. The similarity to the model i (ci , ci . . . ci ) is given by the average 1 2 M 4

Here mui , i represent the mean and covariance of the ith mixture. Given the training data x1 , x2 ...xN , and the number of mixtures M, the parameters i , i , pi is learnt using expectation maximization. During recognition, the input speech is again used extract a sequence of features x1 , x2 . . . xL . The distance of the given sequence from the model is obtained by computing the log likelihood of the given sequence given the data. The identity of the speaker is assigned based on the model that provides the highest likelihood of the observed data.

3.3

spoken by each of 630 speakers from 8 major dialect regions of the United States. Out of this large set, we chose Both the vector quantization and Gaussian mixture mod- 550 utterances of 110 distinct users to evaluate our sysels assume that the number of cluster centers/mixtures is tem. given during the training. This is a classical model selection model. In case of VQ, having a large number of clusters, decrease the average distortion due to quantiza- 4.2 Results tion and in case of GMM, having a large number of mix- For each user, three utterances were randomly chosen as tures increase the likelihood of the observed data. How- the training set and the two remaining utterances were ever, the number of clusters/mixtures cannot be inde- used as the test set. A positive identication was said nitely increased, since this leads to poor generalization to happen only when the minimum distance belonged to ability. Simpler models provide better generalization than the correct user. It is to be noted that random guessing more complex models. Thus, there is a conicting trade- will produce less than 1% accurate results. Therefore this off between generalization and the ability to t the train- multi-class categorization task is a hard one. The results ing data. Existing methods in literature solve this problem of our evaluation are shown in Figure 4 by determining the number of clusters/mixtures M either through a cross validation set, heuristics or based on prior knowledge about how the data is generated. Here, we try 4.3 Discussion to determine the number of clusters, by jointly optimiz- We can make the following observations about the results ing the log likelihood and the complexity of the cluster1. Both VQ and GMM perform far above chance ing/mixture solution. We rely upon Bayesian Information (1speaker specic information. Criterion to achieve this. We select a model the minimizes the Bayesian score given by 2. The accuracy of the VQ increases with the number of cluster centers. However, the GMM seems more BIC = 2L + k log(n) ,for GMM (5) robust and stable, with model selection. RSS = n log( ) + k log(n) , for VQ (6) 3. Performing automatic model selection gives favorn able results in both VQ and GMM case, but does not Where the parameters are, necessarily perform better than brute-force parameL log likelihood of the entire data ter selection. n number of points k number of free parameters in the model RSS root means square error for VQ solution

Model Selection

5 Conclusion

4
4.1

Evaluation
Dataset

We performed our evaluation on the TIMIT speech database(http://www.ldc.upenn.edu/). The TIMIT corpus of read speech has been designed to provide speech data for the development and evaluation of automatic speech recognition systems. However, the large number of distinct speakers present in the system also makes it suitable for evaluating speaker recognition systems as well. TIMIT contains a total of 6300 sentences, 10 sentences 5

We presented an introduction to the problem of speaker recognition and discussed the problems of speaker modeling and recognition. We presented an two separate implementations of a speaker identication system based on mel-cepstral coefcients as feature representation and VQ and GMM as classiers.We also explored the use of model selection using BIC in both of these implementations and found that they do not necessarily give better performance. Overall, it shows the probabilistic modeling methods such as gaussian mixture models, are more robust to model parameters as opposed to non parametric methods such as VQ.

Figure 4: Evaluation results

References
[1] J.P.Campbell. Speaker recongnition: A tutorial. Proceeding of the IEEE, 85(9), 1997. [2] Rabiner and Schafer. Digital Processing of Speech Signals. Prentice Hall International, 1978. [3] D. A. Reynolds and R. C. Rose. Robust textindependent speaker identication using gaussian mixture speaker models. IEEE Transactions on Speed and Audio Processing, 3(1), 1995. [4] F. K. Soon, A. E. Rosenberg, B. H. Jhuang, and L. R. Rabiner. A vector quantization approach to speaker recognition. ATT Tech Journal, 66(22), 1987.

You might also like