You are on page 1of 32

Final Report on Speech

Recognition Project
Ceren Burak Da
040100531
Introduction
The design of a pre-processing, clustering and
a classifier blocks of a speech recognition
system is aimed in this project. The
computations are made in C/C++ by the
author herself and the visualization materials
are generated in MATLAB. The documentation
of the codes is given in the appendix.
Pre-Processing Block
Silence trimmed

RMS applied

Hanning windowed

FFT taken

Relation between Mel and Hertz scales

Triangular filters

Cepstrum Analysis and Homomorphic
Deconvolution
Nonlinear signal processing technique.
Useful in speech processing and recognition
applications.
Bogert, Healy and Tukey defined cepstrum and
quefrency in 1963.
Oppenheim (1964) defined homomorphic
systems.
The transformation of a signal into its
cepstrum is actually a homomorphic
transformation that maps the convolution into
addition.
Let us have a sampled signal, x[n] that is
composed of the sum of the signal v[n] and an
echo (shifted and scaled copy) of it:
Since the convolution in time domain corresponds to
the multiplication in the frequency domain
Take the magnitude of both sides,
Nonlinear technique applied in finding the cepstrum is
the logarithm. So, take the logarithms of each side,
Since the logarithm of multiplication is just the addition
of the terms,
Define:
Now if go back to time domain, we should use I-DTFT.
Finally, one can obtain the following quefrency domain
equation:
cepstrum
Speech Production Model based on
Cepstrum Analysis
Voiced sounds are produced by exciting the vocal
tract with quasi-periodic pulses of air flow caused by
the opening and closing of the glottis.
Fricative sounds are produced by forming a
constriction somewhere in the vocal tract and forcing
air through the constriction so that the turbulence is
created and therefore producing a noise-like
excitation.
Plosive sounds are produced by completely closing
of the vocal tract, building up pressure behind the
closure, and then suddenly releasing the pressure.
Figure 17: Discrete-time speech production model, picture courtesy of Oppenheim, Discrete-
Time Signal Processing, [5].
Parameters in the model
1. The coefficients of V(z), or mathematical
representation of the vocal tract which is
simply a general IIR filter. So, the locations of
poles and zeros change the sound.
2. The mode of excitation of the vocal tract
system: a periodic impulse train or random
noise.
3. The amplitude of the excitation signal.
4. The pitch period of the speech excitation
for voiced speech, namely the frequency of
the voiced sound.
Let us assume that the model is valid and fixed over a
short time period of 10 ms, so we can apply the
cepstrum analysis to a short segment of length L
(=1024) samples.
Apply window to the resulting signal in order w[n] to
taper smoothly to zero at both ends. Therefore, the
input to the homomorphic system will be,
If we further assume w[n] varies slowly with respect to
the variations of v[n], the cepstrum analysis reduces to,
If the p[n] is a train of impulses,
By applying cepstrum analysis, we obtain the following
equation.
MFCC and delta coefficients
calculation

Clustering and Classification
K-Means clustering is applied to each training
file to generate the confusion matrix and
tables.
KNN is applied to recognize some test words.
Vowels, unequal a-priori probabilities
Vowels, equal a-priori probabilities,
each has 97 feature vectors
Vowels, equal a-priori probabilities,
each has 194 feature vectors
Consonants, unequal a-priori
probabilities

Consonants, equal a-priori probabilities, each
has 194 feature vectors
Confusion table for consonants

KNN classification


References
[1] Numerical Recipes in C++: The Art of Scientific Computing. William Press. Saul Teukolsky. William
Vetterling. Brian Flannery. 2002.
[2] S. S. Stevens, J. Volkmann, E. B. Newman, A scale for the Measurement of the Psychological Magnitude
Pitch, J. Acoust. Soc. Am. Vol. 8, issue 3, pp. 185-190, 1937.
[3] Huang, X., Acero, A. and Hon, H. (2001), Spoken Language Processing - A Guide to Theory, Algorithm,
and System Development, Prentice Hall PTR, New Jersey.
[4] L. Muda, M. Begam and I. Elamvazuthi, Voice Recognition Algorithms using Mel Frequency Ceptral
Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques, Journal of Computing, V. 2, i. 3, p. 138-
143, 2010.
[5] Oppenheim A. V., Schafer, R. W., \emph{Discrete-Time Signal Processing}, Pearson International 3.
Edition.
[6] Davis S. B., Mermelstein, P., Comparison of Parametric Representations for Monosyllabic Word
Recognition in Continuously Spoken Sentences, Haskins Laboratories, Status Report on Speech Research,
1980.
[7] J. Ye, Speech Recognition Using Time Domain Features from Phase Space Reconstructions, PhD thesis
for Marquette University, Wisconsin - US, 2004.
[8] An Introduction to Speech Recognition, B. Plannerer, 2005.
[9] R. O. Duda. P. E. Hart and D. G. Stork, \emph{Pattern Classification}, John Wiley \& Sons. 2000.
[10] L. Rabiner \& B.-H. Juang, \emph{Fundamentals of Speech Recognition}, Prentice Hall Signal
Processing Series.
[11] H. Artuner, The Design and Implementation of a Turkish Speech Phoneme Clustering System, PhD
thesis for Hacettepe Universitesi - TR, 1994.

You might also like