You are on page 1of 5

2012 International Conference on Communication Systems and Network Technologies

Speech and Speaker Recognition System using Artificial Neural Networks and
Hidden Markov Model

Niladri Sekhar Dey* Ramakanta Mohanty K. L. chugh


Department of Comp. Sci. and Engg. Department of Comp. Sci. and Engg. Department of Comp. Sci. and Engg.
MLR Institute of Technology, MLR Institute of Technology, MLR Institute of Technology,
Hyderabad, AP, India Hyderabad, AP, India Hyderabad, AP, India
niladridey@mlrinstitutions.ac.in ramakanta5a@gmail.com kl_chugh@hotmail.com

* Corresponding author

Abstract— Aiming towards automatic machine learning by spectrum analysis depicted in figure 1 to machine voice
human, a methodology for speech recognition with speaker spectrum deprecated in figure 3 for speech recognition.
identification based on Hidden Markov Model for security is a
demand of science. Inspiring by the same, we propose a
methodology to identify speaker and detection of speech.
Within our research acquisition of speech signal, analysis of
spectrogram, neutralization, extraction of features for
recognition, mapping of speech using Artificial Neural
networks is presented. In our investigation such a method of
mapping is realized using back propagation rules of neural Fig. 1. Sample Input Spectrogram
networks. This algorithm is especially suitable for huge set of
input and output speech mapping. Additionally recognition of
speaker using Hidden Markov Model also will be presented in
this paper.

Keywords- Neutralization of Speech; Speaker Identification;


Speech Recognition; Spectrogram Analysis of Speech
Fig. 3. Machine Voice Spectrogram
I. INTRODUCTION
The rest of the paper is organized in the following
Speech is the most efficient way to train a machine or manner. A brief discussion about computer human
communicate with a machine. This work focuses on the communication techniques is presented in Section II,
objective to recognize the word or the phase spoken by Section III describes proposed methodology, Setup and
human, keywords in high speed. But as speech is a bio outputs are presented in Section IV, Section V presents
metric property and bio metric property of humans are results are discussions and Finally Section VI concludes the
difficult to recognize. Hence the human leaded learning paper.
processes for machines are not so successful. Although the
recognition of speech using pattern mapping algorithms are II. RELATED WORK
successful, especially continuous learning algorithms like
back propagation algorithm using any combination of For a long improvisation of computer human
feedforward and feedback networks with Gradient searching communication is being done. Still due to a significant
number of draw backs those prototypes leads to multiple
capabilities[1-3]. The generation of voice or speech signal is
malfunctions. For our convenience we classify the
from the vocal muscles, which depends on the physical
prototypes in five different types to justify the improvised
structure of human and it varies regularly. More over the
method of learning.
signal is also affected by the emotions of the speakers. In Standard Recorded Speech Processing systems, the
Therefore it’s required to neutralize the signals to derive captured sound signals are matched with the standard inputs.
emotion free inputs to detection system, analysis of But the captured sound signals are varies speaker to speaker
spectrogram of the speech signal and further mapping of by age, sex, anatomic variation and emotion. As most of
input spectrogram set to output spectrogram set. As after the these types of systems do not provide any training features
neutralization the signal is free from emotions and the hence the success rate is limited.
spectrogram analysis generates the best input sets which can In Natural Language Processing systems, the theoretical
be mapped to output dataset, which is obtained during algorithms are followed, where most of the times the
training, hence we propose an algorithm to map input

978-0-7695-4692-6/12 $26.00 © 2012 IEEE 307


312
311
DOI 10.1109/CSNT.2012.221
calculation of noise are not properly included. Hence it data and also speaker identification using HMM. The
reflects in the output error. algorithm is divided into six parts as
In Accent Neutral Recognition System systems, the • Acquisition of the speech signal
processing is done based on the processed parameters • Spectrogram analysis
without considering the stresses on syllables. Hence in
• Reduction of Noise
majority the recognition fails. More over if this types of
• Normalization of signal
systems does not include training, then the failure rate is
high once again. • Recognition using ANN
• Recognition of speaker using HMM
In Processing of continuous Speech systems, the phases
are considered to be processed. But the processing B. Acquisition of the speech signal: A sound signal
capabilities are limited in speed of computation. More over is a physical quantity, which can be measured and as well be
the speed of speakers varies, which makes the system formulated. The Fig. 4 demonstrates the process flow. The
unstable in recognition of speech. input is a noisy and destabilize speech signal which is
basically a non stationary sinusoidal signal. Mathematically
In Hardware dependant processing systems the variation that is defined as:
of hardware like quality of microphones affects results.
x(t ) = ¦ t =0 A k ( t ) cos (ωi ( t ) t + ϕi ( t ) )
n
(1)
In this research we discuss a type of prototype where the
variation of speaker is resolved using the help of speech
C. Spectrogram Analysis: The most common format
training of the system. More over the neutralization process
is a graph with two geometric dimensions as the horizontal
can reduce the emotional effects of the speech signals with
axis represents time, the vertical axis is frequency. In
noises. As the learning is also parameter based, hence the
spectrogram a third dimension indicating the amplitude of a
effects of variation in speech also can be trained to the
particular frequency at a particular time is represented by
system. The utilization of artificial neural networks
the intensity or color of each point in the image. The basic
furthermore makes the processing faster.
mathematics is
n
Acquisition of x'(t) = STFT{x(t)} ≡ X(τ, ω) = ³ x(t) ω(t −τ)e− jωt dt (2)
Speech Signal
t =0

Spectrogram Hence, analysis of non stationary signal combining Eq. (1)


Analysis and Eq. (2) is
n

Noise
x'(t) = STFT{x(t)} ≡ X(τ, ω) = ³ x(t) ω(t −τ)e− jωt dt (3)
Reduction t =0
D. Reduction of Noise: Most of the electronic
Normalization recording equipment has an effect of noise on the recorded
sound signal. But when it is to be processed for speech
recognition, then the minimum noise also can weigh down
Recognition using ANN Speaker Recognition
the processes of neural networks during training and
processing. Hence we tried to reduce the noise before
processing the signal for ANN. The mathematical
representation is as
Output Generation

B
x '(t )
Fig. 4. Process Flow x ''(t ) = ³ log 2 (1+ )df (4)
0
n(t )
III. PROPOSED METHODOLOGY
Where, B is the bandwidth of the channel. After the
A. Strategy: The algorithm involves the acquisition of reduction of noise the signal is stable and ready to be
the speech signal, processing, matching with the training normalized, 

312
313
308
∝ n
The input to this model is the processed speech
³ ¦AK(t)cos(ωi ( t) t +ϕi ( t)) ω(t −τ)e dt
− jωt
B
signal and output is the decision. The intermediate
hidden layers will adjust the weight for each node.
x''(t) =³log2[1+ −∝
t=0
]df During training the network is expandable, hence not
0
n(t) only the new data, new parameters also can be stored.
(5) X’’’ is considered as the input to the neural network,
where X’’’ is defined as
E. Normalization of Signal: Normalization of the n

sound signal is really effective when the signal need to X ''' = ¦ x '''i (8)
i =1
spectrogram analysis before recognition process.
Normalization of a sound signal can be peak normalization Combining the Eq. (7) and (8)
or loudness normalization. As our process is focused on
power spectrums, hence we choose the loudness ∝ n

³ ¦A (t)cos(ω( t) t+ϕ ( t)) ω(t −τ)e


normalization as the best method. It can be achieved by −jωt
n B K i i dt
1
X'' =¦[ [³log2(1+−∝
calculating the root mean square of the power spectrum. The t=0
mathematical representation is as
)df ]2i]
i=1 n 0 nt( )
(9)
1
x '''(t ) RMS = [ x ''(t )]2 (6)
n The Weight matrix is defined in the hidden layer 1
After the normalization the speech signal is ready layer as WI and in the output layer as WJ. The weight
to get processed matrix in the hidden layers and expandable layers is
denoted as WK. Where
μ n
³ ¦ AK(t)cos¨¨©wi(t)t+ji(t) ¹¸¸w(t-t)e dt 2
§ · - jwt n n
WI = ¦ wi Wk = ¦ wk
B
x '''(t)RMS = 1n[ ³ log2(1+-μt=0 )df ] (10), (11),
0 n(t) i=0 k =0
(7) n n

F. Recognition using ANN: The most important part WJ = ¦ w j (12), w j = ¦ Wk (13),


j =0 K =0
of speech recognition is to recognize the speech using
n n
Artificial Neural Networks of the speech signal after
wJ = ¦ ¦ wk (14)
normalization. Here we propose a multilayer pattern K = 0 k =0
mapping neural network, which works on the principle of
back propagation algorithm [4-7]. The design of this Moreover we represent the composite weight matrix as
multilayer neural network is based on research on various W = WI .WJ .WK (15)
neural network systems [Fig. 5]. The specialty of this model
is the flexible and expandable hidden layer for The output of this neural network is decision, though
recognition[8-10]. we represent that as Y. Whereas Y is formulated at
Y = f ( X .W )
Y = f ( X ''' .WI .WJ .WK ) (16)
Combing the Eq. (9) and (16)
∝ n

³¦A (t)cos(ω ( t) t +ϕ ( t) ) ω(t −τ)e


− j ωt

B K i i
dt
n
1
¦ ³
t =0
Y = f ( [ [ log2 (1+ −∝
2
)df ] i ].WWW
I
. J. K)
i =1 n0 n(t)
(17)
Combing the Eq. (10), (11), (12), (13) and (17)
∝ n

³ ¦A (t)cos( ω ( t) t +ϕ ( t) ) ω(t −τ)e


−jωt
dt
n B K i i n n n
Fig. 5. Multilayer Pattern Mapping Network 1
¦ [³log (1+ −∝ t=0
¦ ¦¦w .W )
2
Y = f( [ )df ] ]. w.
2 i i k K
i=1 n 0 nt
() i=0 K=0 k=0

313
314
309
, Which makes the final formulation for speech spectrogram analyzer fully meets these requirements with
recognition. 3.6/8/26/43/46/50/67 GHz rate, Displayed average noise
G. Recognition of Speaker using HMM: Hidden level: –152 dBm at 2 GHz; –148 dBm at 26 GHz (1 Hz
Markov Model (HMM) is a powerful statistical tool for bandwidth) and Typ. 77 dB ACLR for 3GPP, typ. 84 dB
modeling generative sequences that can be characterized by with noise correction. For the recording we have used
general purpose hardware for storage.
an underlying process generating an observable sequence.
HMMs have found application in many areas interested in
signal processing, and in particular speech processing,
phrase chunking, and extracting target information from
documents. Here we propose a HMM for recognition of
speaker [Fig 6]. The input of the Hidden Markov Model is
the data rejected by ANN.

ANN Training Fig. 7. Acquisition of speech signal (Dataset 1)

Activation Acceptance
Processing
Parameters

Frequency
Rejected

Time
Fig. 8. Analysis of Spectrogram (Dataset 2)

Table 1: Analysis of training parameter and features


Amplitude Base Data Set -1 Data Set -2 Data Set -3 Data Set -4 Data Set -5 Sample
- - - - - -
12 27.422125 27.422125 27.422125 27.422125 27.422125 27.422125
- - - - - -
Fig. 6. Proposed Hidden Markov Model 23 27.554403 27.554403 27.554403 27.554403 27.554403 27.554403
35 -26.37182 -26.37182 -26.37182 -26.37182 -26.37182 -26.37182
47 -27.20977 -27.20977 -27.20977 -27.20977 -27.20977 -27.20977
The calculation of the cut-off value is pre-decided and in - - - - - -
the processing phase, the calculation with the input data will 59 25.490559 25.490559 25.490559 25.490559 25.490559 25.490559
be done. During processing, the dataset will be justified
with cu-off value and if not rejected then the analysis of the
frequency, time and amplitude will be done and extraction
of new feature will be calculated. With the new acceptance
parameter, artificial neural network will be retrained and the
hidden layer in ANN will be expanded.

IV. SETUP AND OUTPUT


During the research we have used most eligible setups
and equipments to get best analysis results. Thought the
cheapest hardware are also compatible with the work. We Fig. 9. Analysis of Acceptance report and Error Graph
have used Earthworks SR-30 for recording, based on David
Blackmer's principle that extending a microphone's
frequency response outside the normal range of hearing will
allow for a much higher-definition sound. To handle the V. RESULT AND DISCUSSION
wide variety of measurement tasks in product development, We recognized speech and speakers using Artificial
an instrument must offer ample functionality and excellent Neural Network based on frequency of speech ranging from
performance in all areas of interest. The R&S®FSU as

314
315
310
12Hz to 586Hz with five different dataset generated from ACKNOWLEDGMENT
five different speaker. The input datasets are consisting of The parts of this research are supported by Prof. P.
intensity values from spectrum analysis of the speech signal. Rammohan Rao, Mr. L. Naveen Kumar, Mr. K. Sandeep
During the spectrogram analysis amplitude extraction of the and Prof. N. Subba Reddy.
speech signal is done, which is considered as the input to the
neural network. Before considering the amplitudes we REFERENCES
processed the signals for neutralization. As the effect of [1] C. T. Chen, W. D. Chang, «A Feedforward Neural
noise is reduced, we got the speech signal ranging from 0dB Network with Function Shape Autotuning», Neural
to 140dB. Still this signal is not ready to be process by the Networks, Vol.9, No 4, pp. 627-641, June 1996.
artificial neural network as the signal is not yet normalized [2] F. Piazza, A. Uncini, M. Zenobi, «Artificial Neural
as contains multiple peek variation due to syllable and Networks with Adaptive Polynomial Activation Function»,
accent effect. Therefore, we neutralize the speech signal
in Proceedings of IJCNN, Beijing, Cina, pp. II-343-349,
hence the long spectra got feasible to study and silent
Nov. 1992.
spectra got remarkably separable from the long spectra,
[3] F. Piazza, A. Uncini, M. Zenobi, «Neural Networks with
which makes the speech signal ready to be processed with
artificial neural networks. The ANN will recognize the Digital LUT Activation Function», in Proceedings of
speech signals based on the predefined acoustics IJCNN, Nagoya, Japan, pp. II-1401-1404, 1993.
parameters. The eight acoustic features that were used [4] J-N Hwang, S-R Lay, M Maechler, R.D. Martin, J.
included the four formants F1 through F4, the spectral Schimert, «Regression Modeling in Back-Propagation and
slope, harmonic difference H1-H2 and the aperiodicity & Projection Pursuit Learning» IEEE Transactions on Neural
periodicity contents in the speech signal. Moreover during Networks, 5(2), 342-353.
training the nodes are expandable to accommodate the new [5] S. Guarnieri, F. Piazza, A. Uncini, «Multilayer Neural
features. The use of hidden markov model finally decides Networks with Adaptive Spline-based Activation
the rejection or addition of new parameter in hidden layers. Functions» In Proceedings of the WCNN 95, Washington
The calculation of the cut-off value is pre-decided and in the D.C., USA, I695-I699, 1995.
processing phase, the calculation with the input data will be [6] L. Vecci, P. Campolucci, F. Piazza, A. Uncini,
done. During processing, the dataset will be justified with «Approximation Capabilities of Adaptive Spline Neural
cut-off value and if not rejected then the analysis of the Networks» In Proceedings of ICNN’97 Huston TX 96,
frequency, time and amplitude will be done and extraction ,1997.
of new feature will be calculated. [7] L. Vecci, F. Piazza and A. Uncini, «Learning and
In the last decade, it has continued with the EARS Approximation Capabilities of Adaptive Spline Activation
project, which undertook recognition of Mandarin and Function Neural Networks», accepted for publication in
Arabic in addition to English, and the GALE project, which
Neural Networks.
focused solely on Mandarin and Arabic and required
[8] E. Catmull, R. Rom, «A Class of Local Interpolating
translation simultaneously with speech recognition. Still
Splines», in R. E. Barnhill, R. F. Riesenfeld (ed.), Computer
those systems are not free from errors as most of times
theoretical algorithms are applied. We have produced Aided Geometric Design, Academic Press, NewYork, 1974,
system for testing on word “Hello” for five different pp. 317-326.
speakers [Table 1] and matched with the system voice. In [9] N. Benvenuto, M. Marchesi, F. Piazza and A. Uncini «A
the analysis of acceptance report [Fig.9] we have noticed Comparison between Real and Complex valued Neural
that more than eighty percent of the cases, without much Networks in Communication Applications», Proc. of Intern.
training, the data are accepted. In remaining fifteen percent Conference on Neural Networks, ICANN91, Espoo,
cases it need to extract new parameter for training and Finland, June 1991.
recognition. [10] H. Leung, S. Haykin, «The Complex Backpropagation
Algorithm», IEEE Trans Acoust. Speech and Signal
VI. CONCLUSION Process, Vol.ASSP- 39, pp.2101-2104, Sept. 1991.
We presented the application of artificial neural
networks and hidden markov model for speech and speaker
reorganization. The work is majorly focused to acquisition
of speech signal, analysis of spectrogram, neutralization,
extraction of features for recognition, mapping of speech
using Artificial Neural networks. Moreover additionally
recognition of speaker using Hidden Markov Model also
made into this work, which generates the new features for
recognition. This work will be generalized in future for
human lead machine learning.

315
316
311

You might also like