You are on page 1of 5

Speech Recognition using Wavelets

Dr. T. Kishore Kumar


ABSTRACT:

Souvik Sarkar

Gangadhar V.

The problem of speech recognition is addressed using the wavelet transform as a means to help match phonemes from a speech signal. This work uses a template of pre-recorded, wavelet-transformed phonemes as its basis for comparison. This application illustrates how wavelets can be used for better accuracy in speech recognition
I. INTRODUCTION

Speech recognition is a fascinating application of digital signal processing (DSP) that has many real-world applications. Speech recognition can be used to automate many tasks that previously required hands-on human interaction, such as recognizing simple spoken commands to perform something like turning on lights or shutting a door. To increase recognition rate, techniques such as neural networks, hidden Markov models etc. can be used. Recent technological advances have made recognition of more complex speech patterns possible. Despite these breakthroughs, however, current efforts are still far away from a 100% recognition of natural human speech. Much more research and development in this area are needed before DSP even comes close to achieving the speech recognition ability of a human being.
II. PROBLEM DEFINITION: DIGIT RECOGNITION

amplitude, pitch, and phonetic emphasis that vary from speaker to speaker. The problem becomes easier, however, when we look at certain subsets of human speech. For instance, vowels and consonants in the English language are produced in different ways by the vocal tract and accordingly possess unique features that can be exploited to differentiate them from each other. This projects aims to identify speech by its transient characteristics which include recognition of consonants. We chose the spoken digits one to five as our study set since they are short monosyllabic words with a detectable amount transient behavior. The time domain representation of a spoken five is shown in Figure 2.1.

Figure 2.1

Recognizing natural speech is a challenging task. Human speech is parameterized over many variables such as

As you can see each signal possesses both periodic or steady state behavior as well as transient behavior. The periodic sections - in general the latter portion of the signals - correspond to the pronunciation of vowels while the transient spiky sections correspond to the pronunciation of consonants. Consonants are physically generated by the stopping of air, intuitively confirming the consonants transient behavior.

III. THE APPROACH:


TIME FREQUENCY ANALYSIS

Just looking at the time domain signals is not going to cut it. We want to look at as few things as possible to make our comparisons, and the time signals, although appear different to the human eye, require the integration too many details to come up with a succinct criterion of discernment. One possible solution is to take the signal into the frequency domain via the Fourier Transform to determine if there are any salient frequencies that may distinguish one digit from another. The Fourier Transform, however, projects signals onto complex sines and cosines infinitely long signals. That we are dealing with transient characteristics - very short signals - hints that Fourier basis may not be the best choice for analysis. In fact because transient signals are localized in time, they are very rich in spectral content. Many Fourier components are required to synthesize temporally localized signals. We want pithy comparisons not longwinded ones - otherwise we might as well be better off comparing the time domain representations. Clearly we need a basis that matches the transient signal better - one that carries both temporal location - like an impulse - and frequency content - like a sinusoid.
IV.DAUBECHIES WAVELET BASIS

The 32 point Daubechies wavelet is shown in Figure 4.2. A few other wavelets are shown below in Figure 4.3 as well.

Figure 4.2

Figure 4.3

It turns out that there do exist basis functions that fit the bill - namely wavelets. Wavelets are a cross between the impulse and the sinusoid. The wavelet dies off at negative and positive infinity giving location in time. The wavelet's wiggle gives the frequency content. For our project we chose the 32 point Daubechies wavelet generated by the Matlab command daubcqf.m from the Rice Wavelet Toolbox for MATLAB for two reasons. 1. Apparently it is the default wavelet for time frequency analysis. 2. It looks a lot like the transient parts of speech.

With Fourier analysis we compared our signals to a basis consisting of sinusoids that differed in frequency. With wavelet analysis we compare our signals to a basis consisting of wiggles that differ in frequency and temporal location. Surprisingly such a set is generated by one wavelet prototype or mother wavelet. The wavelet W may be represented through the wave equation as a function of two parameters - frequency and time and thus may be expressed as:
W = g(f*t + t')

where t is time, f is frequency and t' is the time delay. Varying the two parameters of the wavelet has physical consequences. We use the mother wavelet X shown in Figure 4.4 to demonstrate these changes.

Figure 4.4

By varying f we can compress and dilate the prototype wavelet to obtain wavelets of higher frequency and wavelets of lower frequency respectively - much like varying the frequency w in a sine function sin(wt). Figure 4.5 shows the result of multiplying the f of X by a factor of 0.5

algorithm relied on differentiating digits by their octaves.


V. OUR APPROACHES TO THE PROJECT:

Figure 4.5

By varying t' we can translate the wavelet in time. Figure 4.6 shows the result of subtracting some delay in the argument of X.

Figure 4.6

By varying both parameters we can generate a wide domain of wavelets each representing a different frequency content within different time intervals. Once a set of wavelets is generated from the prototype wavelet, the signal is projected onto the set via the dot product or in more formal terminology the wavelet transform. If the two parameters f and t' are stepped through continuously we have the continuous wavelet transform. On the other hand if the two parameters are stepped through discretely we have the discrete wavelet transform. For our project, we chose the discrete wavelet transform (DWT) using the Matlab function mdwt.mexsol from the Rice Wavelet Toolbox for MATLAB. The DWT steps through frequency and time by factors of two. Hence the DWT projects the signal onto a set of octaves - wavelets that differ in frequency by factors of two. The majority of our working recognition

The very first step to our project is, of course, making templates of the digits to compare input signals with. For each digit, we recorded 21 samples from seven different sources -- all males, and wavelet transform each one of them. Then we take the average of the coefficients as our templates. Daubechies wavelet of length 32 is used in this project; the level of the transform is seven (a level is just the number of octaves the signal is projected onto). These numbers are obtained by trialand-error. The wavelet transform function we used can be found at the Rice Wavelet Toolbox for Matlab. At first, we tried to compare the entire input signal to the templates. The very first approach we took was the mean square difference comparison, where we subtract the template from the input signal, square the remainders, and sum up all the coefficients -- hoping that the digit which the input signal correspond to, will give the minimum value. This approach works very well with signals we made the templates out of; however, it is a complete failure with signals outside of the templates. We then tried to make comparisons with other methods: comparing the absolute values of the coefficients, normalizing the signal before comparing, dot product the input signals to the templates.Among these methods, dot product gives the best result. We dot the input with each of the templates, and due to the nature of the dot product, the digit that the signal correspond to will result in the largest value. As a different approach we analyzed the octaves, and found out that we can differentiate 2 and 3 from 1, 4, and 5 looking at the amplitude of the third octave. If the amplitude is small, the

number is either 2 or 3; otherwise, it's 1, 4, or 5.

with large amplitude in the forth octave, those of 1 also fractuates but with less amplitude, while 5's coefficients remain roughly constant in this region. Therefore, to distinguish between them, we threshold the first part of the octave, and count the number of samples above the threshold. The value of the threshold is picked so that 5 will only have a few coefficients above it, 4 will have many. If a definite conclusion isn't reached, the method of dot product is used again to identify the input signal.
VI. CONCLUSION:
:

We then analyze other octaves to differentiate between 2 and 3, and 1, 4, and 5. For 2 and 3, we look at the second octave. We threshold the region and find the number of samples above the threshold. If the number of samples above the threshold is large, we probably have a 2, and if the number is small, the signal is likely to be a 3. Of course, there's always the chance that the number falls within the region between a 2 and a 3. In this case, we use the dot product comparison to identify the signal. We used the same approach to identify 1, 4, and 5, only that these three numbers differ mostly in the forth octave instead of the second.

Wavelets proves to be an effective method in analyzing speech signals that contain both steady state characteristics (vowels) and transient characteristics (consonants), since different combinations of vowels and consonants have distinct characteristics in different octaves. We are getting an problem during recognition of digit 4. Its octaves are almost same as digit five so we have go for different approach for the this type of digit recognition. If more time were available, we are sure that we can find even more differences in different octaves between each of the digits, therefore giving even better results. Possible future ideas to continue this project: Experiment with the continuous wavelet transform which steps though the frequencies and times continuously, instead of the discrete wavelet transform used in this project. Find a method to break up words like "twenty-one" which will make possible the recognition of all integers.
VII. REFERENCES:

Octave comparison between one and two

These three numbers have a similar mean value in the forth octave; however they differ in the amount they fractuates in the region. Four's coefficients fractuates

1. Acero, A., Acoustical and Environmental Robustness in Automatic Speech Recognition, Kluwer Academic Publishers, Boston, MA, 1993.

2. Cohen, J., Kamm, T., Andreou, A., An Experiment on Vocal Tract Variability, Proceedings of the CAIP Workshop: Frontiers of Speech Recognition, Aug 1994. 3. Huang, X., Alleva, F. A., Hon, H.-W., Hwang, M.-Y., Lee, K.- F. and Rosenfeld, R., The Sphinx-II Speech Recognition System: An Overview, Computer Speech and Language, Vol.2, pp. 137-48. 4. Moreno, P. J., Jain, U., Raj, B., Stern, R. M., Approaches to microphone independence in Automatic Speech Recognition, Proceedings of the ARPA Spoken Language Systems Technology Workshop, pp. 74-6, Jan. 1995. 5. Moreno, P. J., Raj, B., and Stern, R. M., A Unified Approach to Robust Speech Recognition, EUROSPEECH-95, Madrid, Spain, pp. 481-4,Sep. 1995. 6. Roth, R., Gillick, L., Orloff, J., Scattone, F., Gao, G., Wegmann, S., and Baker, J., Dragon Systems 1994 Large Vocabulary Continuous Speech Recognizer, Proceedings of the ARPA Spoken Language Systems Technology Workshop, pp. 116-120, Jan. 1995. 7. Wegmann, S., McAllaster, D., Orloff, J., Peskin, B., Speaker Normalization on Conversational Telephone Speech, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-339 I-342, May 1996.

You might also like