Mathematical Profile of Automatic Speech Recognition Algorithm

Arid Zone Journal of Engineering, Technology and Environment, September, 2018; Vol.
14(3):478-490
Copyright © Faculty of Engineering, University of Maiduguri, Maiduguri, Nigeria.
Print ISSN: 1596-2490, Electronic ISSN: 2545-5818, www.azojete.com.ng
MATHEMATICAL PROFILE OF AUTOMATIC SPEECH RECOGNITION

ALGORITHM
S. A. Y. Amuda1 and Oladimeji Ibrahim2
1
Computer Engineering Department, University of Ilorin, Ilorin. Nigeria.
2
Electrical & Electronics Engineering, University of Ilorin, Ilorin, Nigeria.
Corresponding author’s e-mail address: amudasulyman@gmail.com
Abstract
This work provide mathematical insight to Automatic Speech Recognition (ASR) system’s algorithm such
that, the intricacy of the system becomes a simplified correlation of the ASR algorithm to the physical form
using the mathematical flowchart which clearly and uniquely show the link from one stage of the algorithm
to the other. The mathematical profile of the ASR algorithm starts from the data input module, through
noise cancellation module, voice activity detection module, pre-processing module, Linear Predictive
Coding (LPC) based feature extraction module, then provides alternate root for both Dynamic Time
Wapping (DTW) and Hidden Markov Model (HMM) based pattern matching module after which the output
is fed to the final decision module of the ASR algorithm. The modern research outputs has improved the
robustness of each stage of the algorithm but the approach used here focused on the basics of each stage
which helps in easy and better understanding of the ASR system. It also aid in the evaluation and create
necessary intuition for decoding problems of the recent ASR systems for new researchers in the research
area.
Keywords: Mathematical profile, ASR algorithm, Sequence of the stages, Speech conversion
1.0 Introduction
Automatic Speech Recognition (ASR) research is over five decades with the research outputs been
well publicised in terms of physical products and research documentation (docsoft.com, June,
2009). In spite of the popularity of this research area, the engineering clairvoyance still mystifies
the aspiring researchers of the third world countries even as the research products seem common to
most of these countries. The quagmire is due to the fact that most literatures on the research did not
provide the basic sequential mathematical and practical insight to the speech recognition
algorithm. Different approach has been used to present the mathematical profiles of the ASR
system (Rabiner et al., Huang et al., 2001, William et al., 2006, Wilcox et al, 1997 and Deller et
al. 2000) and in such presentations, the ASR mathematical expression are either floating or
focusing on a sub-section of the ASR algorithm, thereby, it becomes difficult for one to have the
detailed mathematical insight of the working of the ASR algorithm.
In some cases, some authors even made the mistake of presenting the work on speech processing
as something encompassing the speech recognition system, this confuse more the innocent reader
or beginner in the research area. The ASR is more of extracting the speech features and
parameterising the features for a decision algorithm that will determine the output of the system.
Some ASR algorithms either uses the Mel Frequency Cepstral Coefﬁcient (MFCC) or relative
spectral transform-perceptual linear prediction (RASTA-PLP) for their feature vectors and the
Gaussian Mixture Model (GMM) or Hidden Markov Model (HMM) as the acoustic parametric
model with different training criterion. On the decision part different sequence discriminative
training algorithms such as to improve the Minimum Classiﬁcation Error (MCE) and Minimum
Phone Error (MPE) are used to further improve the ASR accuracy (Reynolds et al., 2000 and Yu et
al. 2015).
Though, the mathematical profile presented here cannot capture all the antecedent approach of
these ASR stages overall focus is in simplifying the complexity of the ASR algorithm development
by providing a sequential mathematical profile of a robust ASR system taking cognizance of
different approach at different stages but narrows down to the mathematical basics required at that
478
Arid Zone Journal of Engineering, Technology and Environment, September, 2018; Vol. 14(3):478-490.
ISSN 1596-2490; e-ISSN 2545-5818; www.azojete.com.ng
stage of the algorithm. The algorithm profile which was based on the presentation of Amuda,
(2013), starts from the audio input, through speech conditioning, the words isolation, feature
extraction, parameterisation, classification training and finally to the decision stage.
2.0 Data Preparation
Speech recording is done in different file formats indicated by the extension letters attached to the
file name, which give an insight into how the speech data is stored or compressed in a file. The
generally adopted format for ASR algorithms is wave file format with extension .wav for storing
speech data, which is an uncompressed Pulse Code Modulated (PCM) wave file and peculiar to
Microsoft system with Resource Interchange File Format (RIFF) header (Bourke P., 2001 and
Gunter B. 1995). In designing the wave reader algorithm there is need to confirm at the earlier
stage if the input file is a wave file or not, and to know all the associated parameters of the file as a
precaution to subsequent data processing. Therefore, for proper representation, analysis and
reconstruction of the audio signal, it will be helpful to know the frequency at which the sound data
is sampled, the number of bits per sample, the number of channels and if the file is compressed or
uncompressed. These information as shown in Table 1 are contained in the file header information.
The header information is embedded in three chunks, the first is the riff chunk, sometimes referred
to as wave descriptor chunk, which describes the structure of the wave file by giving the chunk
size and the wave format. The second is the format chunk, which shows the audio format, the
number of channels, the sample rate, byte rate and bits per sample. Lastly, is the data chunk, which
shows the size and boundary of the audio/speech data. Impliedly from Table 1 it shows that, the
first forty two bytes are the associated header file information and the actual speech/audio data
starts with the 43rd byte. The pseudo code for the wave file reader using the file format as a guide
is as follows:
Step 1. Get the data wave file
Step 2. Define the wave file chunks
Step 3. Create a temporal working data file for to store the extracted speech data.
Step 4. Get all wave file header information bytes of the chunks.
Step 5. Get the actual speech bytes and stored it in the temporal working file.
Step 6. Repeat Step 4.till the end of the file.
Step 7. Stop.
Table 1. The canonical structure of wave file in Resource Interchange File Format (RIFF)
Chunk Name Sub-Chunk Name Field Output Field Bytes Offset bytes
The RIFF Chunk Chunk ID RIFF” 4 0
(12 Bytes long) Chunk Size Length of the chunk 4 4
Format “WAVE” 4 8
The “Format” or Sub-Chunk ID “FMT” 4 12
“fmt” Chunk (24 Sub-Chunk Size Length of format chunk (0x10) 4 16
Bytes long)
Audio Format 0=mono 1=stereo 2 20
No. of Channels Channel numbers 2 22
Sample Rate Sample per second 4 24
Byte Rate Bytes per second 4 28
Block Alignment 1=8bits mono, 2=8bits stereo/16bits mono, 4=16bits stereo 4 32
Bits per Sample Bits per sample 4 34
The “data” Chunk Sub-Chunk ID “data” 4 36
Sub-Chunk Size Length of chunk 4 40
Data Raw Speech/Sound data Data Size 44
479
Amuda et al: Atical Profile of Automatic Speech Recognition Algorithm. AZOJETE, 14(3):478-490. ISSN
1596-2490; e-ISSN 2545-5818, www.azojete.com.ng
3.0 The Mathematical Profile of the ASR System

The overview of the ASR algorithm is shown in Figure 1. It shows the flow chat of the practical
progression of the stages involved in the ASR system in two modules with the interface of the two
modules. The flow chat was based on the assumption that the training module used clean
(noiseless) speech samples while the second module is recognition module, based on real life
scenario. In the recognition module, the input is from two microphones with one microphone to
capture the speech with the background noise and the other microphone to capture the background
noise only. The mathematical quagmire and specifications of each stage as presented in the flow
chart of Figure-1 were used to present the detail mathematical profile of the whole system is
presented in the flow chart of Figure 2.
Recognition Module Training Module
Noise Signal Noisy Speech Signal Clean Speech
Anti-Aliasing & Anti-Aliasing

Noise Removal Algorithm
Voice Activity Detection Voice Activity Detection
Speech Signal Pre-processing Speech Signal Pre-processing
Feature Feature
Extraction Extraction
Vector Quantisation Vector Quantisation
Matching
Algorithm Template
either by
DTW or HMM
Decision Algorithm Threshold
Verdict
Figure 1: Flowchart of a typical ASR Algorithm with the Training and Recognition Modules
480
Noise Signal Noisy Speech Signal

Captured by Microphone (30Hz – 4 kHz, 120Ω, Captured by Microphone (30 Hz – 4 kHz, 120Ω,
±3dB) Sampled at 8 kHz (16 bits/sample) ±3dB) Sampled at 8 kHz (16 bits/sample)
Anti-aliasing WhereM is the index middle

coefficient obtained from kernel
𝑁−1
sin[𝜔𝑐 (𝑛 − 𝑀)]
; 𝑛≠𝑀 size through M = N/2, the
𝑏[𝑘] ∗ 𝑥(𝑛 − 𝑘) , 𝑏[𝑘] = 𝜋(𝑛 − 𝑀) normalised cut-off frequency is
𝑥(𝑛) = 𝜔𝑐 2𝜋𝑓
computed by;𝜔𝑐 = 𝑐 as fc and
𝑘=0 ; 𝑛=𝑀 𝑓𝑠
𝜋 fsare the cut-off and sampling
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 frequencies respectively
Noise Cancellation (By Spectral Subtraction)

(𝑆𝑝𝑒𝑒𝑐ℎ 𝑒𝑛𝑒𝑟𝑔𝑦)
𝑆𝑁𝑅(𝑑𝐵) = 10𝑙𝑜𝑔
(𝑁𝑜𝑖𝑠𝑒 𝑒𝑛𝑒𝑟𝑔𝑦)
Noise Spectrum (Using DFT) Noisy-Word Spectrum (Using DFT)

𝑁−1 𝑁−1
2𝜋 2𝜋
𝑠(𝑛)𝑒 −𝑗
𝑁
𝑘𝑛
, 𝑘 = 0,1 … 𝑁 − 1 𝑥(𝑛)𝑒 −𝑗 𝑁
𝑘𝑛
, 𝑘 = 0,1 … 𝑁 − 1
𝑆(𝑘) = 𝑛=0 𝑋(𝑘) = 𝑛=0
⬚ ⬚
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Spectral Subtraction
G(k)=X(k) – S(k)
WhereB(k) and X(k) are Spectral Normalisation

normalised and unnormalised B(k)=LG(k)+w
cepstrum vectors, w is a
function of instantaneous
signal-to-noise-ratio (SNR) Clean Speech (Using IDFT)
of both the noisy and noise 𝑁−1
1 2𝜋
inputs equalisation while, L 𝐵(𝑘)𝑒 𝑗 𝑁
𝑘𝑛
, 𝑛 = 0,1 … 𝑁 − 1
is the normalisation matrix. 𝑥′(𝑛) = 𝑁
𝑘=0
⬚
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Pre-emphasis
x(n) = x′(n+1) - afac * x′(n)
Where afac = 0.97; n= 1, 2, …...N
and N is the data length.
A1
481
A1
End-point Detection/Voice Activity

Speech Energy Detection
𝑀−1
𝐸𝑚 ≝ |𝑥(𝑛)|2
𝑛=1
Zero Crossing Measurement

𝑀−1
1
𝑍𝑚 = |𝑠𝑔𝑛[𝑥(𝑛)] − 𝑠𝑔𝑛[𝑥(𝑛 − 1)]|
2𝑁
𝑛=1
as
1, 𝑖𝑓 𝑍𝑚 ≥ 0
𝑍′𝑚 =
0, 𝑖𝑓 𝑍𝑚 < 0
Detection Thresholds
For the first 11 frames;-
𝑚=11
1
𝐸𝑇ℎ𝑟𝑒𝑠 = 𝐸𝑚
11
1
𝑚=11
x(n)
1
𝑍𝐶𝑅𝑇ℎ𝑟𝑒𝑠 = 𝑍′𝑚
11
1
0 0
𝑥(𝑛), 𝐸𝑛 ≥ 𝐸𝑇ℎ𝑟𝑒𝑠 𝑜𝑟 𝑍𝑛 ≥ 𝑍𝐶𝑅𝑇ℎ𝑟𝑒𝑠
𝑥(𝑛) =
0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Where0is for noise region and x(n) is valid
speech
x(n)
Framing
𝑥′(𝑛) = 𝑥(𝑛)𝑓(𝐹) 𝑎𝑛𝑑 𝐹 ≝ 𝐿 + 𝑠𝑓 𝑤𝑓 + 𝑠𝑓
⬚
Where wf is frame width = 20 milliseconds per sample
sf is the overlap length = 10 milliseconds per sample
and F is average frames per word length L
Window
𝑥′′(𝑛) =ing
𝑥′(𝑛)𝑊(𝑛)⬚
Where the Hamming Window function W(n) with the
window sequence starting from time n=0 through the
window length N is define by;-
0.54 − 0.46 𝑐𝑜𝑠(2𝜋𝑛 𝑁 − 1) , 0 ≤ 𝑛 ≤ 𝑁 − 1
𝑊(𝑛) =
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
B1
482
B1
Feature Extraction
LPC
From above Let S(i, j)=𝑥′′(𝑛)
Auto-correlation Method
𝑁−1
Where ri is the autocorrelation
𝑟𝑖 = 𝑆𝑖,𝑗 ∗ 𝑆𝑖,(𝑗+𝑖) coefficients for i=0,1,2,…(P-1). The
𝑗=1 coefficient (P=12) sequence is
computed recursively using;-
Levinson-Durbin Recursion
𝑖−1
(𝑖) (𝑖−1)
𝑘𝑖 = 𝑟𝑖 + 𝑎𝑗 𝑟(𝑖−𝑗) 𝐸 (𝑖−1)
𝑗=1
(𝑖) (𝑖−1)
𝑘𝑗 = 𝑘𝑗 Updating the reflection coefficient (k)
(𝑖) (𝑖) and the prediction error (E)
𝐸 (𝑖) = (1 − 𝑘𝑖 𝑘𝑖 )𝐸 (𝑖−1)
(𝑖) (𝑖−1) (𝑖) (𝑖−1)

𝑎𝑗 = 𝑎𝑗 − 𝑘𝑖 𝑎(𝑖−𝐽) Computes the new coefficients
(𝑖) (𝑖)
𝑎𝑗 = −𝑘𝑗 For j =1,(i - 1)
Taking the Delta and Delta-delta
𝑃 𝑃
(𝑖)
𝑑𝑡 = 𝑛(𝑎(𝑡+𝑛) − 𝑎(𝑡−𝑛) ) 2 𝑛2
𝑛=1 𝑛=1
Where dt(0) is the delta coefficients,dt(1) is the delt-delta coefficients

for ai = dt(0) and P is the number of coefficients (P =12);-
Cepsral Coefficients (LPCC)

𝑛−1
1
𝑐𝑛 = −𝑎𝑛 + (𝑛 − 𝑖)𝑎𝑖 𝑐(𝑛−𝑖)Where cn is cepstral coefficient
𝑛 equivalent to taking the magnitude and
𝑖=0
log. of coefficients in fft based method.
C1
483
C1
Vector Quantization
Extraction
LBG
From above Let T = cn, as the training sequence with source vectors M,
Algorithm
hence; T={x1, x2, ………….xm} but with k dimension the vectors
elements becomes xm= {xm,1, xm,2, ………………. xm,k}
Initializing Process
Where CT is the codebook with N
𝑀
1 number of codevectors defined by; CT =
𝐶1∗ = 𝑥𝑚 {C1, C2,…..CN} the dimensional k-
𝑀
𝑚=1 vectors then each vector is defined by Cn
= (Cn,1, Cn,2, ………. CN,k).
𝑀
1
∗
𝐷𝑎𝑣𝑒 = 𝑥𝑚 − 𝐶1∗ The Dave is the distortion measure
𝑀𝑘 used to determine the partitioning by
𝑚=1
ensuring that is minimized.
Splitting Process
(0)
𝐶𝑖 = (1 + 𝜖)𝐶𝑖∗ , For i = 1, 2, …………. N , while 𝜖 is
(0) the variance of the codevectors.
𝐶𝑁+1
= (1 − 𝜖)𝐶𝑖∗ .
Iteration Process
(0) ∗
𝐿𝑒𝑡 𝐷𝑎𝑣𝑒 = 𝐷𝑎𝑣𝑒 𝑏𝑦 𝑠𝑒𝑡𝑡𝑖𝑛𝑔 𝑖
=0
(i) Set m = 1, 2, …………. M to find the minimum value of
𝑥𝑚 − 𝐶𝑛∗ 2 over all n = 1, 2, ………..N. such that n* is the
𝑖
index for achieving the minimum by setting 𝑄(𝑥𝑚 ) = 𝐶𝑛∗ .
(ii) Updating the codevector for n = 1, 2, ………..…..N. by
(𝑖) 𝒙𝒎
(𝑖+1) 𝑄(𝑥𝑚 )=𝐶𝑛
𝐶𝑛 = after which i is again set by i=i+1
(𝑖) 𝟏
𝑄(𝑥𝑚 )=𝐶𝑛
(iii) To calculate Dave 𝑀

𝑖
1 2
𝐷𝑎𝑣𝑒 = 𝑥𝑚 − 𝑄(𝑥𝑚 )
𝑀𝑘
𝑚=1
(𝑖−1) (𝑖) (𝑖−1)

And if (𝐷𝑎𝑣𝑒 − 𝐷𝑎𝑣𝑒 )/𝐷𝑎𝑣𝑒 > 𝜖 then go back stage (i)
∗ (𝑖) (𝑖)
else set 𝐷𝑎𝑣𝑒 = 𝐷𝑎𝑣𝑒 so that 𝐶𝑛∗ = 𝐶𝑛 as the final
codevectors for all n=1, 2, …….N
D1
484
D1
dtw hmm
dtw
Dynamic Time Warping

(DTW)
Euclidean Distance Metric
𝑑(𝑥, 𝑦) = (𝑥𝑖 − 𝑦𝑖 )2
𝑖 Where d(x,y) gives the local distance metrics (LDM) of the
word templates xi (x1, x2, …. xN) and test word yi (y1, y2, ..yM).
Normalisation
𝐷(𝑖, 𝑗) = min [𝐷(𝑖 − 1, 𝑘) + 𝑑(𝑘, 𝑗)] Where D(i,j) gives the accumulated distance of
1≤𝑘≤𝑀
the local distance d(k,j) with M possible moves.
𝑠𝑖 = 𝐷 𝑖 + 1, 𝑠(𝑖+1) , 𝑤ℎ𝑒𝑟𝑒 𝑖 = 𝑁 − 1, 𝑁 − 2 … . ,1
Such that, the optima path is (s1,s2,……sN) for optimal minimum distance D(N,M)
when sN = M.
do
485
hmm
Hidden Markov Model (HMM)

1. Computing Observation Sequence Probability αt(j) Using Forward Algorithm
Initialisation of the sequence after

∝1 (𝑖) = 𝜋𝑖 𝑏𝑖 (𝑋𝑖 ) 𝑓𝑜𝑟 1 ≤ 𝑖 ≤ 𝑁 determining the probabilities ai, bi and πi
for a given model defined as Φ= (A, B, π).
𝑁
∝𝑡 (𝑗) = ∝(𝑡−1) (𝑖)𝑎𝑖𝑗 𝑏𝑗 (𝑋𝑡 ) 𝑓𝑜𝑟 2 ≤ 𝑡 ≤ 𝑇; 1 ≤ 𝑖 ≤ 𝑁

𝑖=1 The induction stage which generates the
probability that HMM is in state j.
𝑁
𝑃(𝑋|ɸ) = ∝𝑇 (𝑖) Sum up all the observation sequence probabilities

𝑖=1
2. Computing State Probabilities for a Model and Sequence Using Viterbi Algorithm
𝑉1 (𝑖) = 𝜋𝑖 𝑏𝑖 (𝑋1 ) 𝑓𝑜𝑟 1 ≤ 𝑖 ≤ 𝑁

Initialisation of the state probabilities
𝐵𝑡 (𝑗) = 0
𝑉𝑡 (𝑗) = max 𝑉(𝑡−1) (𝑖)𝑎𝑖𝑗 𝑏𝑗 (𝑋𝑡 ) 𝑓𝑜𝑟 2 ≤ 𝑡 ≤ 𝑇; 1 ≤ 𝑖 ≤ 𝑁 The induction stage

1≤𝑖≤𝑁 which generates the state
𝐵𝑡 (𝑗) = Arg max 𝑉(𝑡−1) (𝑖)𝑎𝑖𝑗 𝑓𝑜𝑟 2 ≤ 𝑡 ≤ 𝑇; 1 ≤ 𝑖 ≤ 𝑁 probability at time t
1≤𝑖≤𝑁
𝑃∗ = Max [𝑉𝑡 (𝑖)]

1≤𝑖≤𝑁
Terminate after finding the best score P* within time T.
𝑞𝑡∗ = Arg max[𝐵𝑇 (𝑖)]
1≤𝑖≤𝑁
∗
𝑞𝑡∗ = 𝐵(𝑡+1) 𝑞(𝑡+1) 𝑓𝑜𝑟 𝑡 = 𝑇 − 1, 𝑇 − 2,…1 Back tracking to determine the
best sequence path.
hmm_cont
486
hmm_cont
3. Estimates or Updates of the Model Parameters

1 ……………………….the initialisation
𝛽𝑇 (𝑖) = 𝑓𝑜𝑟 1 ≤ 𝑖 ≤ 𝑇
𝑁
𝑁
𝛽𝑡 (𝑖) = 𝑎𝑖𝑗 𝑏𝑖 (𝑋(𝑡−1) )𝛽(𝑡+1) (𝑗) 𝑓𝑜𝑟 𝑡 = 𝑇 − 1, … .1; 1 ≤ 𝑖 ≤ 𝑁

𝑗=1 ……………….the induction process
𝑁
𝛾𝑡 (𝑖, 𝑗) = 𝛼𝑡−1 (𝑖)𝑎𝑖𝑗 𝑏𝑗 (𝑋𝑡 )𝛽𝑡 (𝑗) ∝𝑇 ...the

(𝑘) re-estimation process for the
𝑘=1 state transition at a given time t.
The updated normalisation process;
â𝑖𝑗
𝑇 𝑇 𝑁 ......updating the model parameters.
= 𝛾𝑡 (𝑖, 𝑗) 𝛾𝑡 (𝑖, 𝑘)
𝑡=1 𝑡=1 𝑘=1
𝑀 𝑇𝑚 𝑀 𝑇𝑚 𝑁
…for an M data sequence, the
â𝑖𝑗 = 𝛾𝑡𝑚 (𝑖, 𝑗) 𝛾𝑡𝑚 (𝑖, 𝑗) updating of the transition
𝑚=1 𝑡=1 𝑚=1 𝑡=1 𝑘=1 matrix of the model.
𝑇
𝑏𝑗 (𝑘) = 𝛾𝑡 (𝑖, 𝑗) 𝛾𝑡 (𝑖, 𝑗)

𝑡∈𝑋𝑡 =𝑜𝑘 𝑖 𝑡=1 𝑖
𝛾𝑠
𝑆 𝑀𝑠 …for an M data sequence, the
𝑏𝑗 (𝑥𝑡 ) = 𝑐𝑗𝑠𝑚 𝑁(𝑥𝑡 ; 𝜇𝑗𝑠𝑚 , 𝛴𝑗𝑠𝑚 ) updating of the output distribution
𝑠=1 𝑚=1
sequence of the model.
1 1
…the updated model parameter
− (𝑥−𝜇)′ 𝛴 −1 (𝑥−𝜇)
𝑁(𝑥; 𝜇, 𝛴) = 𝑒 2 are thereby normalized and
(2𝜋)𝑛 |𝛴| sum up to one.
… Where the mean 𝜇 and variation 𝛴 of the model were computed as a

contribution from the statistics of the sequence of the observations.
𝑇 𝑇
1 1
𝜇′𝑗 = 𝑥𝑡 𝑎𝑛𝑑 𝛴′𝑗 = 𝑥𝑡 − 𝜇𝑗 (𝑥𝑡 − 𝜇𝑗 )′
𝑇 𝑇
𝑡=1 𝑡=1
ho
487
ho
Maximum posterior adaptation (MAP)

µjm is the speakers independent
𝑁𝑗𝑚 Ƭ mean and ȗjm is the mean of
µ𝑗𝑚 = ȗ𝑗𝑚 + ȗ
𝑁𝑗𝑚 + Ƭ 𝑁𝑗𝑚 + Ƭ 𝑗𝑚 the observed adaptation data for
update stream of state j and
𝑅 𝑇𝑟 𝑅 𝑇𝑟 mixture component m with
𝑟 𝑟
weighting component Ƭfor data
ȗ𝑗𝑚 = 𝐿𝑗𝑚 (𝑡)𝑂𝑡𝑟 𝐿𝑗𝑚 (𝑡) occupation likelihood N.
𝑟=1 𝑡=1 𝑟=1 𝑡=1
do
do
Decision Algorithm
The recognition decision is based on the generated matching score ȗ which is determined
for a test word xtw and k number of words templates xtp through;
ȗ = arg min 𝑥𝑡𝑤 − 𝑥𝑡𝑝 𝑓𝑜𝑟 𝑡𝑝 = 1, 2, … … … . . . , 𝑘.

𝑡𝑝
Such that the tpth word is recognised only when the threshold value T>ȗ as;
𝑟𝑒𝑐𝑜𝑔𝑛𝑖𝑠𝑒𝑑, 𝑓𝑜𝑟 ȗ ≤ 𝑇
ȗ=
𝑛𝑜𝑡 𝑟𝑒𝑐𝑜𝑔𝑛𝑖𝑠𝑒𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Figure 2: Flowchart showing the mathematical profile and specifications at each stage of a typical
robust ASR algorithm
4. The Algorithm Yardstick

The algorithm presents the background material of an ASR system with the use of mathematical
expressions for the deep and dynamic generative models at each stage of the ASR algorithm. The
adopted sequential presentation of the algorithm above, where the raw speech input pass through
the feature extraction stage to the complex classification stage, provides the clarification that will
enable the reader to relate the literary outline to the technicality or practical view of the algorithm.
And it also illustrate the unifying relationship of the format of the preceding stage to the next, for
all the stages involved in the algorithm.
Literarily, the algorithm starts from raw speech or audio signal capturing through the microphones
which converts the analogue audio signal to electrical signal and then to digital signal. The
information about its conversion to digital signal is provided at the first stage of the algorithm
followed by the anti-aliasing and then the noisy part of the main speech is removed through
spectral subtraction and normalisation. This ensures that only the clean speech is passed on for
further processing and repudiates the effects of the spectral subtraction on the speech signal
488
respectively. Though, speech is considered to be continuous but the truth is that there is some
noticeable silence in-between the words. There is need to remove these silent portions to avoid
wastage of time and resources. Hence, the speech signal is then subjected to the Voice Detection
stage where only the speech portion is been captured for onward processing. The compactness and
clean speech sample is then guaranteed as the speech signal is passed through framing and
windowing after which the feature extraction process is then applied. The feature extraction is
Liner Prediction Coefficient (LPC) based, even though the algorithm provides links for both delta-
delta and cepsral coefficients generation for the features parameterisation training through the
Vector Quantisation (VQ). The classification stage provides for either the Dynamic Time Warping
(DTW) route or the Hidden Markov Model (HMM) route. The output from either of the two routes
is then used at the decision stage to determine if the word or speech is recognised or not.
5. Conclusion
This paper hereby provides the mathematical insight to a typical ASR algorithm that help to
simplify the basics of the algorithm development for each stage involved in an ASR system.
Though, there have been several modifications at different stages of the algorithm over time but
without good understanding of the basic stages of the ASR algorithm it will be very difficult for
one to catch up with those recent modifications of the ASR algorithm. The ASR algorithm adopted
in this paper was based on the general background model at each stages of the modules in the ASR
algorithm such that one will be able to link any modification on any of the stages to the basics
model of that stage. It also enhances the understanding of the need for the modifications or help the
new or young ASR researcher to exploit the ASR algorithm to make necessary contributions to the
research fields. This paper also provide the practical insight to the ASR development of which will
be the basic research tools needed to localised the ASR algorithm to meet up the challenges of the
ASR products that will be suitable for the local speech adaptation and its environmental features.
References
Amuda, SAY. 2013. Development of Isolated Speech Recognition Algorithm Robust to Noise and
Selected Nigerian Accents. PhD. Dissertation submitted to Electrical & Electronics Engineering
Department, University of Ilorin, Ilorin, Nigeria.
Acero, A. and Stern, RM. 2014. Robust Speech Recognition by Normalization of the Acoustic
Space. citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.2056
Annaji, R. and Shrisha, R. 2000. Parallelization of the LBG Vector Quantisation Algorithm for
Shared Memory Systems. International Journal of Image Processing, vol. 3, issue 4, pp. 170-183.
Becchetti, C. and Ricotti, LP. 2004. Speech Recognition. Published by John Wiley & Sons,
Chichester, West Sussex, England.
Boll, SF. 1979. Suppression of Acoustic Noise in Speech Using Spectral Subtraction. IEEE.
Bourke, P. 2001. WAVE Sound File Format. Published by International Thompson Computer
Press, Boston.
Deller, JR., Hansen, JHL. and Proakis, JG. 2000. Discrete-Time Processing of Speech Signals.
IEEE Press, the Institute of Electrical and Electronics Engineers, Inc. New York, USA.
Gailkwad, SK., Gawali, BW. and Yannawar, P. 2010. A Review on Speech Recognition
Technique. International Journal of Computer Applications, vol. 10, No. 3, pp. 16-24.
489
Geoge, AS. 2005. Using spectral Subtraction to Enhance Speech and Increase performance in
Automatic Speech Recognition, Maryland Engineering Research Internship Teams Program.
Gray, RM. 1984. Vector Quantisation, IEEE ASSP Magazine, pp. 4-29, (http: //www.data-
compression.com/VQ.).
Gunter, B. 1995. File Formats Handbook, Published by International Thompson Computer Press,
Boston.
Hollmen, J., Tresp, V. and Simula, O. 2000. Learning Vector Quantization Algorithm for
Probabilistic Models. Proceedings of European Signal Processing Conference, Vol. 2, pp. 721-724.
Huang, X., Acero, A. and Hon, H. 2001. Spoken Language Processing: A Guide to Theory,
Algorithm and System Development. Published by Prentice Hall International Inc. New Jersey,
USA.
Jedruszek, J. 2000. Speech Recognition. Published by Alcatel Communication Review.
Loizou, PC. 2006. Speech Enhancement: Theory and Practice. Published by CRC Press.
Mann, TP. 2006. Numerically Stable Hidden Markov Model Implementation
Markel, JD. and Gray, AH. 1973. On Autocorrelation Equations as Applied to Speech Analysis.
IEEE Transact on Audio and Electroacoustics, AU-21, pp. 69-79.
Meclellan, S. and Gibson, JD. 1997. Speech Signal Processing (Coding, Transmission and
Storage). Published by ITU, CRC press.
Parikh, GK. 2002. The Effect of Noise on the Spectrum of Speech. M.Sc. Thesis in
Telecommunication Engineering, University of Texas, Dallas.
Rabiner, L. and Juang, BH. 1993. Fundamentals of Speech Recognition. Published by Prentice-
Hall International, Inc., Englewood Cliffs, New Jersey, USA.
Reynolds, DA. and Heck, LP. 2000. Automatic Speaker Recognition. Conference Paper at AAAS
2000 meeting on Human, Computer and Speech.
Wilcox, LD. and Bush, MA. 1997. Speech Signal Processing (Speech Recognition). Published by
ITU, CRC press.
William, KM. and Douglas, BW. 2006. The digital Signal Processing Handbook. Published by
CRC Press, 1998, www.DSPguide.com.
Wringly, SN. 1999. Speech Recognition by Dynamic Time Warping. www.dcs.shef.ac.uk.
Yu, D. and Deng, L. 2015. Automatic Speech Recognition: A Deep Learning Approach. Springer-
Verlag London.
490

Mathematical Profile of Automatic Speech Recognition Algorithm

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mathematical Profile of Automatic Speech Recognition Algorithm

Uploaded by

Copyright:

Available Formats

Arid Zone Journal of Engineering, Technology and Environment, September, 2018; Vol.

MATHEMATICAL PROFILE OF AUTOMATIC SPEECH RECOGNITION

3.0 The Mathematical Profile of the ASR System

Recognition Module Training Module

Noise Signal Noisy Speech Signal Clean Speech

Anti-Aliasing & Anti-Aliasing

Voice Activity Detection Voice Activity Detection

Speech Signal Pre-processing Speech Signal Pre-processing

Vector Quantisation Vector Quantisation

Decision Algorithm Threshold

Noise Signal Noisy Speech Signal

Anti-aliasing WhereM is the index middle

Noise Cancellation (By Spectral Subtraction)

Noise Spectrum (Using DFT) Noisy-Word Spectrum (Using DFT)

WhereB(k) and X(k) are Spectral Normalisation

End-point Detection/Voice Activity

Zero Crossing Measurement

(𝑖) (𝑖−1) (𝑖) (𝑖−1)

Where dt(0) is the delta coefficients,dt(1) is the delt-delta coefficients

Cepsral Coefficients (LPCC)

(iii) To calculate Dave 𝑀

(𝑖−1) (𝑖) (𝑖−1)

Dynamic Time Warping

Hidden Markov Model (HMM)

Initialisation of the sequence after

∝𝑡 (𝑗) = ∝(𝑡−1) (𝑖)𝑎𝑖𝑗 𝑏𝑗 (𝑋𝑡 ) 𝑓𝑜𝑟 2 ≤ 𝑡 ≤ 𝑇; 1 ≤ 𝑖 ≤ 𝑁

𝑃(𝑋|ɸ) = ∝𝑇 (𝑖) Sum up all the observation sequence probabilities

𝑉1 (𝑖) = 𝜋𝑖 𝑏𝑖 (𝑋1 ) 𝑓𝑜𝑟 1 ≤ 𝑖 ≤ 𝑁

𝑉𝑡 (𝑗) = max 𝑉(𝑡−1) (𝑖)𝑎𝑖𝑗 𝑏𝑗 (𝑋𝑡 ) 𝑓𝑜𝑟 2 ≤ 𝑡 ≤ 𝑇; 1 ≤ 𝑖 ≤ 𝑁 The induction stage

𝑃∗ = Max [𝑉𝑡 (𝑖)]

3. Estimates or Updates of the Model Parameters

𝛽𝑡 (𝑖) = 𝑎𝑖𝑗 𝑏𝑖 (𝑋(𝑡−1) )𝛽(𝑡+1) (𝑗) 𝑓𝑜𝑟 𝑡 = 𝑇 − 1, … .1; 1 ≤ 𝑖 ≤ 𝑁

𝛾𝑡 (𝑖, 𝑗) = 𝛼𝑡−1 (𝑖)𝑎𝑖𝑗 𝑏𝑗 (𝑋𝑡 )𝛽𝑡 (𝑗) ∝𝑇 ...the

The updated normalisation process;

𝑏𝑗 (𝑘) = 𝛾𝑡 (𝑖, 𝑗) 𝛾𝑡 (𝑖, 𝑗)

… Where the mean 𝜇 and variation 𝛴 of the model were computed as a

Maximum posterior adaptation (MAP)

ȗ = arg min 𝑥𝑡𝑤 − 𝑥𝑡𝑝 𝑓𝑜𝑟 𝑡𝑝 = 1, 2, … … … . . . , 𝑘.

4. The Algorithm Yardstick

You might also like