Professional Documents
Culture Documents
METHODOLOGY
Acquire Speech Signal Preprocessing Feature Extraction Feature Matching
Recognized Command
= . + .
. .
= . .
Spectral Shaping ()
Changing the shape of the vocal tract changes the spectral shape of the speech signal, thus articulating different speech sounds Most valuable information for speech recognizer is contained in the way the spectral shape of the speech signal changes in time. Direct computation of power spectrum from the speech signal results in a spectrum containing ripples caused by the excitation spectrum (). A smooth spectral shape without the ripples that represent () has to be estimated.
Cepstral Transformation
= . . = . log = log( . ) = log + log( )
Interpret this log-spectrum as a time signal The ripples caused by () would then have a high-frequency. Hence, by using a kind of low pass filtering we can get the smooth spectral shape Inverse Fourier transform of the log spectrum brings us back to the time domain, giving the so called cepstrum. Low pass filtering is done by setting the higher valued Cepstral coefficients to zero and then transforming back to the frequency domain. The process of filtering in the Cepstral domain is called liftering.
Common way to do mel frequency warping is to use triangle shaped filter in the spectral domain to build a weighted sum over the power spectrum coefficients which lie within each window. This gives us a new set of coefficients known as the mel spectral coefficie Perform Cepstral Transformation on them to extract Mel frequency Cepst Coefficients. The MFCC are directly used for recognition instead of transforming them back to frequency domain.
Feature Matching
Each utterance is divided into frames of 20ms. MFCC for each of frame is computed and represented by a vector. Hence each utterance is represented by a vector sequence. X = {x0,x1,.,xTx1} Distance between individual vectors are found using the Euclidean distance formula.
DTW Algorithm
Finding the optimal alignment path
DTW Algorithm
Key points to find the optimal path A grid point (i,j) in the optimal path can have the predecessors (i-1,j), (i-1,j-1) and (i,j-1)
Bellmans Principle : If Popt is the optimal path through the matrix of grid points beginning at (0,0) and ending at (Tw-1,Tx1), and grid point (i,j) is part of path Popt, then the partial path from (0,0) to (i,j) is also part of Popt Creating an Accumulated distance matrix, according to the formula
The accumulated distance at the point (Tw-1,Tx-1) is the distance between the vector sequence W and X .
Front Panel
Block Diagram
Step 2: Pre-processing
Preprocessing of the input speech signal consist of the following steps
2.1 Pre-Emphasis
The goal of pre-emphasis is to compensate for the high frequency part that was suppressed during the sound production mechanism of humans. Thus the speech signal is passed through a FIR high pass filter which increases the magnitude of some higher frequencies with respect to the magnitude of other frequencies hence improves the over-all signal to noise ratio. = 0.95[ 1]
2.2 Framing
The input speech signal is segmented into small frames of 20ms length with 50% overlap with the adjoining frames to create continuity.
2.3 Windowing
Each frame is multiplied with the hamming window in time domain. This helps to reduce the discontinuity at the start and end of each frames. 2 = 0.54 0.46 cos 1
Sometimes spikes due to the external noise crosses the threshold and contributes 1 to the Boolean array. To remove these spikes a Median filter VI in LabVIEW with left and right rank as 3 is used. The median filter replaces the ith element in the Boolean array with the median of { 3, 2, 1, , 1, + 2, + 3}elements. Hence the median filter smoothen the Boolean array.
Now we use the Peak detector VI in LabVIEW to find the index of the start and end of the utterance. Using these index extract the corresponding frames containing the utterance. N.B: In my project, all commands where of length less than 0.6sec. Sometimes spikes due to noise remained even after using the median filter and hence the ending index was not detected accurately. But the start index was detected accurately most of the time, so I used to extract 0.6sec of sound after the start index.
FFT is done on each frame of the utterance and half of it is taken. The spectrum of each frame is warped onto the Mel scale and thus Mel spectral coefficients are obtained. Discrete cosine transform is done on Mel spectral coefficients of each frame, hence obtaining MFCC. The first 2 coefficients of the obtained MFCC are removed as they varied significantly between different utterances of the same word. Liftering is done by replacing all MFCC except the first 14 by zero. The first coefficient of MFCC of each frame was replaced by the log energy of that frame. Delta and Acceleration coefficients are found from the MFCC so as to increase the dimension of the feature vector of the frames, thereby increasing the accuracy.
Delta coefficients are found from the following equation. Value of p chosen was 1.
Acceleration coefficients are found by replacing the MFCC in the above equation by delta coefficients Feature vector is normalized by subtracting their mean from each elements Thus each frame of utterance is converted into a feature vector of dimension 35.
Limitations
Environment Dependent
The input speech feature vector is compared with a set of feature vectors in the dictionary which were recorded in a particular environment. So when used from a different environment the efficiency decreases unless the threshold and the dictionary are updated accordingly.
Speaker Dependent
As the dictionary is trained by a particular user, the VI outputs consistent results when used by the trainer.
Questions..?
Thank You