You are on page 1of 20

Int J Speech Technol

DOI 10.1007/s10772-017-9400-x

An experimental framework forArabic digits speech recognition


innoisy environments
AzzedineTouazi1,2 MohamedDebyeche1

Received: 10 October 2016 / Accepted: 15 January 2017


Springer Science+Business Media New York 2017

Abstract In this paper we present an experimental frame- effectiveness of syllable-like unit in building Arabic digits
work for Arabic isolated digits speech recognition named recognition system, which exceeds word-like unit by an
ARADIGITS-2. This framework provides a performance overall Word Accuracy Rate of 0.44 and 0.58% for clean
evaluation of Modern Standard Arabic devoted to a Dis- and multi-condition training modes, respectively.
tributed Speech Recognition system, under noisy environ-
ments at various Signal-to-Noise Ratio (SNR) levels. The Keywords Spoken Arabic digits recognition Distributed
data preparation and the evaluation scripts are designed speech recognition ETSI-AFE AURORA-2 database
by deploying a similar methodology to that followed in Hidden Markov Models
AURORA-2 database. The original speech data contains
a total of 2704 clean utterances, spoken by 112 (56 male
and 56 female) Algerian native speakers, down-sampled 1Introduction
at 8 kHz. The feature vectors, which consist of a set of
Mel Frequency Cepstral Coefficients and log energy, are The recent progress in wireless applications such as speech
extracted from speech samples using ETSI Advanced recognition over mobile devices has led to the development
Front-End (ETSI-AFE) standard; whereas, the Hidden of clientserver recognition systems, also known as Dis-
Markov Models (HMMs) Toolkit is used for building the tributed Speech Recognition (DSR) (Pearce 2000). In DSR
speech recognition engine. The recognition task is con- architecture, the front-end client is located in the terminal
ducted in speaker-independent mode by considering both device and it is connected over a protected data channel
word and syllable as acoustic units. Therefore, an optimal to a remote back-end recognition server. Although many
fitting of HMM parameters, as well as the temporal deriva- technological developments have been made, the existing
tives window, is carried out through a series of experiments speech recognition performance still needs improvement,
performed on the two training modes: clean and multi- particularly when the speech utterances are exposed to high
condition. Better results are obtained by exploiting the noise environments.
polysyllabic nature of Arabic digits. These results show the With regard to developing evaluation databases for DSR
systems in multiple languages, there have been the activi-
* Azzedine Touazi ties of AURORA working group (Hirsch and Pearce 2000;
touazi.azzedine@gmail.com; atouazi@cdta.dz Pearce 2001; AURORA 2006). Their evaluation scenarios
Mohamed Debyeche have had a considerable impact on noisy speech recogni-
mdebyeche@gmail.com tion research; this includes AURORA-2, AURORA-3,
1
AURORA-4, and AURORA-5 databases. AURORA-2 is
Speech Communication andSignal Processing Laboratory
a small vocabulary evaluation of noisy connected digits
(LCPTS), Faculty ofElectronics andComputer Science,
University ofScience andTechnology Houari Boumediene for American English talkers; the task performs speaker-
(USTHB), Bab Ezzouar, Algiers, Algeria independent recognition of isolated and connected digits,
2
Center forDevelopment ofAdvanced Technologies (CDTA), with and without adding background noise. AURORA-3
Baba Hassen, Algiers, Algeria consists of a noisy small vocabulary speech recorded

13
Vol.:(0123456789)
Int J Speech Technol

inside cars, and it serves to test the frond-end from dif- speech database for MSA native speakers from 11 Arab
ferent languages, namely Finnish (Nokia 2000), Danish countries, in which a total of 415 sentences are recorded by
(Lindberg 2001), Spanish (Macho 2000), German (Netsch 40 speakers (20 male and 20 female). This database takes
2001), and Italian (Knoblich 2000). AURORA-4 pro- into account the speaker variability such as gender, age,
vides a large vocabulary continuous speech recognition country, and region, with the motivation of making it suit-
tasks, which aims to compare the effectiveness of differ- able for the design and development of Arabic continuous
ent DSR front-end algorithms. In addition to its similarity speech recognition systems.
to AURORA-2, AURORA-5 covers the distortion effects Moreover, there exist data centers that provide relevant
caused by the hands-free speech input inside a room. Fur- speech databases for both MSA and MCA. For example,
thermore, the AURORA tasks have been distributed with the European Language Resources Association (ELRA),
the HMMs Toolkit (HTK) scripts, which allow attaining where the most popular project was NEMLAR broadcast
easily the baseline performance for further speech recogni- news speech Arabic corpus. This project is composed of
tion research. about 40h of MSA recorded from four different radio sta-
Numerous evaluation methodologies and frameworks for tions (ELRA 2005). Also, the Linguistic Data Consortium
AURORA-2 were developed by the working group of the (LDC) has developed recently a new database that contains
information processing society for Japanese noisy speech 590h of recorded Arabic speech from 269 male and female
recognition, namely CENSREC-1/AURORA-2J, CEN- speakers. The LDC recordings are conducted by the speech
SREC-2, CENSREC-3, and CENSREC-4 (Nakamura etal. group at King Saud University in different noise environ-
2005; Fujimoto etal. 2006; Nishiura etal. 2008). The first ments for read and spontaneous speech (LDC 2014). How-
developed version, AURORA-2J, contains Japanese noisy ever, with such diversity and richness of Arabic language,
connected digit utterances and their associated HTK evalu- there is a difficult process to generate from the existing
ation scripts. CENSREC-2 is another database for evalua- databases a proper dataset for the problem at hand, as well
tion of noisy continuous digits recognition whose data were as for testing the industrialized speech platforms.
recorded in real car driving environments. CENSREC-3 Automatic recognition of spoken digits is essential in
database contains speech utterances, of isolated words, many DSR application areas for different languages. Com-
recorded in similar environments to those considered in pared to other commonly used languages, a limited num-
CENSREC-2. The last developed database, CENSREC-4, ber of recent efforts on building Arabic digit recognizers
is an evaluation framework of distant-talking connected have been conducted. Among the previous researches,
digit utterances in various reverberation conditions. using either Artificial Neural Networks (ANNs) or Hidden
Arabic is one of the most widely spoken languages in the Markov Models (HMMs) as recognition engine, the works
world. Nowadays, it is considered as the fifth widely used presented in (Alotaibi et al. 2003; Alotaibi 2005, 2008;
language, the native language of more than 350million of Hyassat and Abu Zitar 2006; Amrouche et al. 2010; Ma
people (World Bank 2016) as well as the liturgical language & Zeng 2012; Hajj and Awad 2013). However, there is a
for over a billion Muslims around the world. In the Arab lack of common Arabic digits database for evaluation and
world today, there are two forms of Arabic, The Modern results comparison of the systems proposed. This is prin-
Standard Arabic (MSA) and the Modern Colloquial Ara- cipally due to the differences in the types of features and
bic (MCA). MSA, commonly known as the modern form noises used, and differences in the testing methodologies.
of classical Arabic (or Quranic Arabic), is considered as The main objective of this work is to investigate Ara-
the official language in academic institutions, written and bic spoken digits from the speech recognition point of
broadcasted Arabic media, and civil services. The vernacu- view. We introduce an Arabic noisy speech speaker-inde-
lar or colloquial form is the most used when people speak pendent isolated digits database and its evaluation scripts,
about everyday life topics. Moreover, there exists a variety named ARADIGITS-2. This database is particularly con-
of MCA forms from different Arabic regions (e.g. Middle ceived to evaluate the recognition performance of MSA
East, North Africa, and Egypt). digits in a DSR system. The data preparation (i.e. speech
Lack of Arabic language resources is one of major files, used noises, and text transcriptions) and the HTK
issues confronted by the Arabic speech research commu- evaluation scripts are designed by drawing inspiration from
nity. Among the most relevant developed corpora one can AURORA-2 database (Hirsch and Pearce 2000). The spo-
cite the Orientel project (Siemund et al. 2002). This pro- ken digit utterances are 112 Algerian MSA native speakers
ject covers an important package of data collection, on (56 male and 56 female) corrupted by additive noises at dif-
both MSA and MCA ranging from Mediterranean to Mid- ferent Signal-to-Noise Ratio (SNR) levels. The European
dle East countries, including Turkey and Cyprus, as well Telecommunications Standards Institute Advanced Front-
as applications for mobile and multi-modal platforms. End (ETSI-AFE) standard (ETSI 2007) is used for Mel
Abushariah etal. (2012) have developed a large vocabulary Frequency Cepstral Coefficients (MFCCs) feature vectors

13
Int J Speech Technol

extraction and compression. Whereas, the recognition task toolkit. A detailed description of the ARADIGITS-2 data
is performed using the two training modes: clean (that is, preparation and HTK parameterization is the object of
models are trained with clean data and the test is performed Sect. 3. The recognition system performance obtained by
with noisy data) and multi-condition (that is, training is empirically fine-tuning the suitable recognition parameters,
performed with clean and noisy data). for both word and syllable-like acoustic models, is pre-
For small vocabulary tasks such as digits recognition, sented in Sect.4. Finally, we summarize the conclusion of
word acoustic unit is used more commonly. In this work, the presented work in Sect.5, as well as further work that
a syllable-based recognition system is also designed. The needs to be completed.
application of syllable unit is motivated by the polysyl-
labic nature of Arabic digits that has many differences in
terms of syllable types and numbers (Naveh-Benjamin and
Ayres 1986; Ryding 2005). When compared to other lan- 2DSR standards andAURORA2 database
guages such as English, Arabic digits have about twice as
many syllables per digit as those in English. For example, 2.1DSR standards
in Naveh-Benjamin & Ayres (1986) for the chosen four lan-
guages, English, Spanish, Hebrew, and Arabic, the mean As depicted in Fig. 1, the main idea of DSR consists of
number of syllables per word for the digits (06, 8, 9) is 1, using a local front-end terminal from which the MFCC
1.625, 1.875, and 2.25, respectively. vectors are extracted and transmitted, through an error
The motivation for building DSR system using Algerian protected data channel, to a remote back-end recognition
MSA rather than Algerian MCA is two-fold: (i) As the case server. Compared to the traditional network-based auto-
of different Arabic countries, the divergence among sev- matic speech recognition, a DSR system provides specific
eral Algerians MCA makes very complex the task of col- benefits for mobile services, such as (i) acoustic noise
lecting data and designing a common recognition system. compensation at the client side, (ii) low bit-rate transmis-
It is therefore similar to the case of MCA digits recogni- sion over data channel, and (iii) improved recognition
tion where the digits are pronounced quietly different from performance.
place to place and town to town and (ii) In some DSR ser- In the basic DSR standard ETSI Front-End (ETSI-FE)
vices where the front-end spoken digit numbers are highly (ETSI 2003a), the speech features (i.e. MFCC compo-
important, such as bank account, credit card and insurance nents) are derived from the extracted speech frames, in
identification, the use of a recognition system based on the front-end part, at frame length of 25 ms with frame
MSA may guarantee more accurate performance. shift of 10 ms, using Hamming windowing. A Fourier
The remainder of this paper is structured as follows: In transform is then performed and followed by a Mel filter
Sect.2, we present a general overview of ETSI DSR stand- bank with 23 bands in the frequency range from 64 up
ards, AURORA-2 database, and HTK speech recognition to 4 kHz. The extracted features are the first 12 MFCCs

Fig.1DSR system architecture Speech Utterance


Front-end (Client)

MFCC
Extraction Source/Channel Encoder

Bitstream

Back-end (Server)

HMMs Time derivatives


estimation

Recognition
Decision
HMM-based MFCC
Recognizer Channel/Source Decoder

13
Int J Speech Technol

(C1-C12) and the energy coefficients, such as (C0) and the these additional components will increase relatively the
log energy (log E) in each extracted frame. compression bit-rate.
The different blocks of the ETSI front-end MFCC An advanced front-end feature extraction and com-
extraction algorithm are illustrated in Fig.2. pression algorithms (ETSI-AFE) (ETSI 2007) have been
The abbreviation of each block is listed as bellow: published by ETSI for robust speech recognition. The
standardized AFE provides considerable improvements in
ADC: Analog to Digital Converter recognition performance in the presence of background
Offcom: Offset Compensation noise. In the feature extraction part of the ETSI-AFE stand-
Framing: Frame length is 25 ms, with frame shift is ard, noise reduction is performed first, which is based on
10ms Wiener filtering theory. Then, MFCCs coefficients and log
PE: Pre-emphasis Filter, with a factor of 0.97 energy are computed from the de-noised signal and blind
Log E: Log Energy Computation equalization is applied to cepstral features. Voice activity
W: Hamming Windowing detection (VAD) for the non-speech frame dropping is also
FFT: Fast Fourier Transform (only magnitude compo- implemented in the front-end feature extractor. The VAD
nents are considered) flag is used for excluding the non-speech frames from the
MF: Mel filter bank with 23 frequency bands recognition task.
LOG: Nonlinear Transformation On the server side, unlike the conventional ETSI-FE
DCT: Discrete Cosine Transform standard, where the cepstral derivatives are computed
through the HTK recognition engine using centered finite
In the compression task (i.e. source coding), the difference approximation (Young et al. 2006), ETSI-AFE
14-dimentional feature vector [C1, C2, ..., C12, C0, log E] includes additional scripts that compute these coefficients
is split into seven sub-vectors, and each of them is quan- based on polynomial approximation (more details are pro-
tized with its own 2-dimensional vector quantizer. The vided in Sect.3.2). Also, in ETSI-AFE back-end side, the
resulting compression bit-rate is 4400 bps and 4800 bps energy coefficients C0 and log E are both used in the recog-
when the overhead and error protection bits are included nition task by employing the following combination:
(i.e. channel coding). In the back-end side delta and
deltadelta coefficients, or time derivatives, are estimated
Ccomb = C0 + logE, (1)
and appended to the 13 static features [C1, C2, ..., C12, C0 where and are set to 0.6/23 and 0.4, respectively (ETSI
or log E], to obtain a total of 39 elements for each feature 2007).
vector.
In some DSR applications, for example, in human 2.2AURORA2 database andHTK toolkit
assisted dictation, the machine and the human recognition
are mixed in the same application, so it may be necessary The original high quality TIDigits database (Leonard
to reconstruct the speech signal at the back-end. The ETSI 1984) is the source speech of AURORA-2 database that
Extended Front-End (ETSI-EFE) standard (ETSI 2003b) consists of isolated and connected digits task. It provides
provides additional parameters such as voicing class and speech samples and scripts to perform speaker-independent
fundamental frequency, which are extracted at the front- speech recognition experiments in clean and noisy condi-
end. These parameters allow reconstructing the speech tions. This database has been prepared by down-sampling
signal at the back-end side. Therefore, the transmission of from the original 20kHz sampling frequency to 8kHz with

Fig.2 Block diagram of fea- Input


ture extraction algorithm (ETSI speech
2003a) Framing
ADC Offcom PE W FFT MF

LOG
Log E

DCT

Feature vector = [C1, C2, ..., C12, C0, log E]

13
Int J Speech Technol

an ideal low pass filter. An additional filtering is applied linear left-to-right with no skips over states, also known
with the two standard frequency characteristics: G.712 as Bakis topology (Bakis 1976). Two silence models are
(ITU-T 1996) and Modified Intermediate Reference System defined, i.e. sil (silence) and sp (short pause). The
(MIRS) (ITU-T 1992). sil model has three emitting states and each state has
AURORA-2 contains eight types of realistic additive six mixtures, while the sp model has only a single
noises with stationary and non-stationary segments (subur- state. Each word model has 16 states with three Gaussian
ban train (subway), babble, car, exhibition hall, restaurant, mixtures per state (in HTK structure, two dummy states
street, airport, and train station) at different SNR levels are added at the beginning and at the end of the given
(clean, 20, 15, 10, 5, 0, and 5 dB). This database contains set of states). Each Gaussian component is defined by the
two training sets of 8440 utterances for each one (clean and global means and variances of acoustic coefficients.
multi-condition sets), and three test sets (set A, set B, and The overall Word Accuracy Rate (WAR) of recogni-
set C). tion experiments, conducted on AURORA-2 task, using
The clean training set is filtered with the G.712 charac- ETSI-AFE standard are summarized in Tables 1 and 2
teristic without any noise added. In multi-condition training (Hirsch and Pearce 2006). These experiments are per-
set, the same utterances are equally split into 20 subsets, formed on different training modes, with and without (i.e.
after filtering with the G.712 characteristic. These subsets baseline) MFCCs compression, and without including the
are corrupted by four noises (subway, babble, car, and exhi- optional VAD parameter.
bition hall) at five different SNR levels (clean, 20, 15, 10,
and 5 dB).
The first test set called test set A, consists of 28,028
utterances filtered with the G.712 characteristic using four 3ARADIGITS2 data description andsystem
different noises, namely subway, babble, car, and exhibi- parameterization
tion hall. In total, this test set consists of 28 subsets where
the noises are added at seven different SNR levels (clean, As outlined in the previous section, the aim of the present
20, 15, 10, 5, 0, and 5 dB). Test set A contains the same work is to develop an Arabic small vocabulary isolated
noises to those used in the multi-condition training set; digits database for DSR applications. The used spoken
this leads to a high match between training and test data. words in ARADIGITS-2 database are the Arabic digits
The second test set called test set B, which is created by zero through nine, which are polysyllabic (except for the
the same way as test set A (i.e. same clean utterances fil- monosyllabic digit zero). Arabic digits include almost
tered with the G.712) but by using four different noises, all types of MSA syllables, which are limited to the fol-
restaurant, street, airport, and train station. The third test lowing combinations of consonants /C/ and vowels /V/
set called test set C contains 14,014 utterances distributed (Ryding 2005):
into 14 subsets, where two different types of noises are con-
sidered, subway and street. In test set C, speech and noises Full form pronunciation syllables: consists of (i) short
are first filtered with the MIRS (i.e. in order to simulate or weak syllables /CV/ (consonantshort vowel) and (ii)
the frequency characteristics received from the terminal long or strong syllables /CVV/ (consonantlong vowel)
device), and then these noises are added at different SNR or /CVC/ (consonantshort vowelconsonant).
levels (clean, 20, 15, 10, 5, 0, and 5 dB). Furthermore, a Additional pause form pronunciation syllables: con-
full description of AURORA-2 database is given in Hirsch sists of super-strong syllables /CVVC/ (consonantlong
& Pearce (2000). vowelconsonant) or /CVCC/ (consonantshort vowel
The HTK toolkit is principally designed for building consonantconsonant).
HMM-based speech processing tools, in particular recog-
nizers. It consists of a set of library modules and tools in C
source code available from http://htk.eng.cam.ac.uk/. The
tools provide sophisticated facilities for speech analysis,
HMM training, testing, and results analysis. There are two
major processing stages involved. Firstly, the HTK training Table1Overall WAR (%) for the AURORA-2 baseline
tools are used to estimate the parameters of a set of HMMs
Training mode Set A Set B Set C Overall
using training utterances and their associated transcrip- (020)
tions. Secondly, unknown utterances are transcribed using dB
the HTK engine.
Clean 87.74 87.09 85.44 87.02
In AURORA-2 task, the model set contains 11 whole
Multi-condition 92.29 91.77 90.77 91.78
word HMM models (digits 0 to 9 and oh) which are

13
Int J Speech Technol

Table2Overall WAR (%) for the AURORA-2 including AFE Algerian native speakers, by considering speaker variabil-
encoder ity such as gender and age (from 18 to 50) and speaking
Training mode Set A Set B Set C Overall style. The speech utterances were recorded in a large and
(020) a very quiet auditory room (with a capacity of up to 1800
dB people), using a high quality microphone. At the time of
Clean 87.81 87.09 85.77 87.11 the recording, the room was largely empty with an over-
Multi-condition 92.14 91.54 90.61 91.59 all environmental Sound Pressure Level (SPL) less than
35 dB. A total of 3600 speech utterances were originally
recorded at 22.050 kHz sampling frequency and was then
Table3Pronunciations of Arabic digits (Alotaibi 2003) down-sampled into 16kHz.
Digit Arabic Pronunciation Syllables Number of
From the original 3600 utterances of ARADIGITS we
(english writing syllables selected and manually verified the transcriptions of a total
writing) of 2704 utterances spoken by 112 speakers, 56 males and
56 females. Some utterances are discarded due to bad qual-
One w-hd CV-CVC 2
ity of recordings and mispronunciations. The wave speech
Two ath-nyn CVC-CVC 2
files are down-sampled from 16 to 8 kHz and then repre-
Three th-l-thh CV-CV-CVC 3
sented as a 16-bit little-endian byte order. These tasks
Four ar-b-ah CVC-CV- 3
CVC can be performed thanks to the Praat script (Boersma and
Five khm-sh CVC-CVC 2 Weenink 2015). The selection of training and test sets is
Six st-th CVC-CVC 2 made in such a way to equilibrate between speaker and gen-
Seven sb-ah CVC-CVC 2 der (equally distributed). In the training set, a total of 1840
Eight th-m-nyh CV-CV-CVC 3 utterances (4052 syllables), produced by 34 males and 34
Nine ts-h CVC-CVC 2 females, are used. In the test set, 864 utterances (1906 syl-
Zero sfr CVCC 1 lables), pronounced by the rest of speakers (i.e. 22 males
and 22 females), are used. The occurrence frequency of
each digit is shown in Table 4, indicating that the occur-
The corresponding pronunciations of the MSA ten dig- rences are balanced to have approximately the same num-
its, as well as the number and types of used syllables, are ber of utterances of each digit.
given in Table3. By following the same way as AURORA-2 database,
ARADIGITS-2 consists of two training sets (for clean and
3.1Data preparation multi-condition training modes), and three test sets, test set
A, test set B, and test set C, where each set is split into four
The speech utterances used in ARADIGITS-2 are gath- sub-sets. The speech signals are filtered with the two standard
ered from the original ARADIGITS spoken Arabic iso- frequency characteristics G712 and/or MIRS (ITU-T 1992;
lated digits database. This database was initially used in ITU-T 1996), and then they are corrupted by eight different
the recognition system evaluation proposed by Amrouche noises at different SNR levels (20, 15, 10, 5, 0, and 5 dB).
et al. (2010). ARADIGITS is a set of Arabic digit utter- The noises used, for considering the variability of environ-
ances pronounced in MSA (from 0 to 9) collected from 120 mental conditions, are issued from AURORA-2 database,

Table4Distribution of Digit Number of Number of Total digits Number of training Number of Total syl-
ARADIGIT-2 original clean training digits test digits syllables test syllables lables
speech data
One 183 85 268 366 170 536
Two 183 84 267 366 168 534
Three 183 87 270 549 261 810
Four 190 87 277 570 261 831
Five 185 87 272 370 174 544
Six 181 87 268 362 174 536
Seven 184 87 271 368 174 542
Eight 182 88 270 546 264 810
Nine 186 88 274 372 176 548
Zero 183 84 267 183 84 267
Total 1840 864 2704 4052 1906 5958

13
Int J Speech Technol

such as suburban train, babble, car, exhibition hall, restaurant, where ES and EN represent the total energy of the speech
street, airport, and train-station. The open-source Filtering signal and the total energy of noise signal, respectively. The
and Noise Adding Tool (FaNT) (Hirsch 2005) is used to fil- speech energy is estimated by using the ITU-T P.56 voltme-
ter speech signals with the appropriate filter characteristic as ter function from the ITU-T Software Tool Library (Neto
defined by the International Telecommunication Union (ITU) 1999).
standards. This tool allows adding a noise signal to clean In total, each ARADIGITS-2 test set consists of 6048
speech at a desired SNR level, which can be expressed as: utterances, and 1840 utterances for each training set. Fig-
( ) ures3 and 4 show a detailed block scheme which describes
ES
SNR(dB) = 10log10 , (2) how the training set is divided between different noises and
EN
noise levels and how each of the test sets are divided between

Fig.3Training dataset in
ARADIGITS-2 (clean and 1840 original training utterances
multi-condition modes)

G712 filtering

Added noises:
Subway, babble, car, and exhibition

SNR levels:
Clean, 20, 15, 10, and 5 dB

Multi-condition training set


Clean training set
(20 subsets)

Fig.4Test dataset in ARA-


DIGITS-2 (test set A, B, and C) 864 original test utterances

G712 filtering G712 filtering G712+MIRS filtering

Adding noises: Adding noises: Adding noises:


Subway, babble, car, and Restaurant, street, Subway, Street, babble,
exhibition airport, and train and restaurant

SNR levels: Clean, 20, 15, 10, 5, 0, and -5 dB

Test set A: Test set B: Test set C:


6048 utterances split into 6048 utterances split into 6048 utterances split into
28 subsets 28 subsets 28 subsets

13
Int J Speech Technol

noise and noise levels. However, unlike AURORA-2, in test



2 TM (2M + 1)t2
set C of ARADIGITS-2 we add two additional sub-sets with wt = ,
M
two noise environments which contain non-stationary seg- 2
TM (2M + 1)

t4 (6)
ments, namely babble, and restaurant. t=M

3.2Time derivatives estimation where


M
In speech recognition systems there are, generally, two ways
to approximate the time derivatives or dynamic features.
TM = t2 . (7)
t=M
Firstly, by exploring the discrete time representation of cep-
stral coefficients, a finite difference approximation is per- However, in ETSI-AFE, the weights are normalized
formed by simply using a first or second-order finite differ- by the maximum weight, such that the largest weight is
ence (e.g. the case of HTK). However, this approximation is equal to 1. Thus, the normalized weights at time t, t
intrinsically noisy and the cepstral sequence usually cannot and t, can be expressed as:
be expressed in a suitable form for differentiation (Soong
wt
1988; Rabiner 1993). The second way, which is the more w t = , (8)
max(w)
appropriate one, approximates the derivatives by the use of
an orthogonal polynomial of each cepstral coefficient trajec-
wt
tory over a finite length window (Furui 1981, 1986; Rabiner w t = . (9)
max(w)
1993).
As defined in the ETSI-AFE standard, the first and second
derivative coefficients, i and i, respectively, for each
3.3HTK parameterization
feature component Ci (i.e. C1- C12 and Ccomb) are computed
on a 9-frame window (i.e. derivative window or the interval
MFCC coefficients are extracted for each speech utter-
over which the derivatives are estimated) using the following
ance based on the client front-end ETSI-AFE extraction
weighted sums (ETSI 2007):
algorithm. The final feature vector consists of the first 12
M MFCC coefficients (C1-C12) and the combination of C0
and log E (Ccomb). Our choice of ETSI-AFE, for MFCCs

C i = w t Ci (t), M = 4, (3)
t=M extraction, is motivated by its specifications which allow
reducing considerably the effect of background noise.
M Also, it is expected that ETSI-AFE will give improved
recognition performance compared to the conventional

C i = w t Ci (t), M = 4, (4)
t=M ETSI-FE for all languages (Pearce 2001).
The HTK speech recognizer engine with software ver-
where t is the frame time index. The respective weighting
sion 3.4 (Young etal. 2006) is used to evaluate the recogni-
coefficients are set as follows:
tion performance of ARADIGITS-2. The digits are mod-
w = [1, 0.75, 0.5, 0.25, 0, 0.25, 0.5, 0.75, 1]. elled as: (i) whole word model, i.e. the recognition unit
w = [1, 0.25, 0.285714, 0.607143, 0.714286, should comprise the whole word and (ii) syllable model
where each word is mapped onto its syllable representation.
0.607143, 0.285714, 0.25, 1]
The number of units is the 10 Arabic digits for word model
According to the weighting coefficient values and from (3)
and (4), one may deduce that the first and second derivatives
are calculated by fitting the cepstral trajectory with a second
order polynomial over a window of 2M+1 frame length
with M=4. For a given M the weighting coefficients wt 1 2 3
and wt, at time t, are estimated by the following formulas
(Rabiner 1993):
t
wt =
TM
, (5)
Fig.5Three states silence model architecture (with two dummy
states)

13
Int J Speech Technol

Table5Speech utterance
analysis and HTK configuration Database name ARADIGITS-2
RAW format 16 bit signed little endian sample format
Sampling frequency 8kHz
MFCC extraction algorithm ETSI-AFE 202 050 Ver. 1.1.5
Cespstral parameters (C1,..., C12, Ccomb)
Derivatives estimation Fitting second order polynomial (Rabiner 1993)
Recognition engine HTK Ver 3.4
HMM model for word -based acoustic unit Ten isolated Arabic digits+silence model
HMM model for syllable-based acoustic unit 19 syllables+silence model
State transition configuration (word or syllable) Left-to-right model without skip
Silence model Three emitting states (with a transition structure)
HTK parameter type MFCC_E_D_A
HTK pruning threshold and limit 250.0 150.0 1000.0

and 19 syllables for syllable model (from Table 3, the 19 where N is the total number of words in the test set, S is
syllables are obtained by eliminating the redundant ones). the number of substitution errors, D is the number of dele-
The silence model is considered as the same case as tion errors, and I is the number of insertion errors (Young
AURORA-2; it consists of three states with a transition et al. 2006). We should point out that the ARADIGITS-2
structure, and six Gaussian mixtures for each state (see results are presented with respect to the convention used in
Fig.5); however, the inter-word short pause sp model is AURORA-2 database, where the overall WAR is calculated
not considered, as the database contains only isolated dig- by considering the performance for SNRs from 20dB down
its. The MFCC extraction is followed by the global means to 0dB.
and the global variances computation. Then, HMM for Initially, we conducted an experiment on ARADIG-
each unit initialized with global means and variances is ITS-2 with similar AURORA-2 model parameters, such
created. The embedded Baum-Welch for HMM parameter as 16 states and three Gaussian mixtures for each rec-
re-estimation scheme is used in the training process. In the ognition unit (i.e. syllable or word). The derivatives are
testing process, the forward/backward algorithm is per- computed on a 9-frame window using formulas (3) and
formed, where the most probable pronunciation is assigned (4). We should highlight that in case of considering syl-
to each word from the transcription file. lables as acoustic models; they can be concatenated to
The ARADIGITS-2 data analysis and HTK parameters form word models according to a syllable pronunciation
are given in Table 5; except to the derivatives window dictionary (this dictionary is generated from pronuncia-
length, the number of emitting states of each digit model, tions in Table3). Table6 shows the overall baseline per-
as well as the number of Gaussian mixtures per state. The formance for both cases word and syllable-based recogni-
optimal configuration of these parameters will be the objec- tion with both clean and multi-condition training modes.
tive of the next section. In the following subsections, a series of experi-
ments will be focused on studying the effects related to
the derivative window lengths, the number of states, as
4System testing andevaluation well as the number of Gaussian mixtures per state. Also,
these parameters are fine-tuned on ARADIGIT-2. This
Recognition experiments are conducted in speaker-inde- optimization will be performed with respect to the best
pendent mode; it means that the data in the test set does
not cross with those in the training set. The baseline sys- Table6Overall WAR (%) for the ARADIGITS-2 using the same
tem uses a feature vectors, which consists of 39-compo- AURORA-2 topology parameters
nent including the 13 extracted static coefficients (C1-C12,
Recognition unit Training mode Set A Set B Set C Overall
Ccomb) and the corresponding first and second deriva- (020)
tive parameters (that is, the used HTK parameter type is dB
MFCC_E_D_A). The recognition performance is measured
Word Clean 84.70 86.25 81.18 84.04
in terms of WAR given by the following formula:
Multi-condition 95.19 95.46 93.47 94.71
S+D+I Syllable Clean 76.23 79.05 73.56 76.28
WAR = 1 , (10)
N Multi-condition 96.00 96.00 93.91 95.30

13
Int J Speech Technol

Fig.6WAR (%) versus derivative window length for ARADIGITS-2 Fig.8WAR (%) versus derivative window length for AURORA-2
(word model) (word model)

Table7Overall WAR (%) for the ARADIGITS-2 using 11-frame


derivative window

Recognition unit Training mode Set A Set B Set C Overall


(020)
dB

Word Clean 85.30 87.99 82.57 85.29


Multi-condition 95.58 95.28 93.87 94.91
Syllable Clean 79.70 81.99 76.06 79.25
Multi-condition 96.25 95.97 94.24 95.49

derivative features estimated over long intervals may help


Fig.7WAR (%) versus derivative window length for ARADIGITS-2 most when the mismatch of training and test is greatest.
(syllable model)
The same experiment is carried out on AURORA-2
database; this can show how Arabic compares to English
recognition performance obtained for each parameter in digits recognition. Results in Figs. 6 and 8 (i.e. when
variation. using the same acoustic unit) indicate that the maximum
accuracy rates, for English digits, are obtained when a
4.1Effects ofderivative window length 9-frame window is used. However, comparing to English
digits, more number of frames may be needed to accurately
The first and second derivative window length is initially represent the transitional information from one phoneme
chosen from the results of the recognition experiments, to another in Arabic digits. Furthermore, as mentioned in
with the two training modes, summarized in Figs.6 and Furui (1986), one of the possible causes of variation in the
7. Here the derivative window length is varied from 5 to optimal window length might be caused by the difference
21 frames, where the same AURORA-2 HMM parame- of languages; however, this still remains to be investigated.
ters are considered. We note that the two derivative win-
dows have the same lengths. 4.2Effects ofnumber ofstates andGaussian mixtures
Comparison of recognition results indicates that gen- perstate
erally the maximum accuracy rates are obtained when
an 11-frame window is used to calculate the derivatives. We will now study the effects related to the number of
However, it can be observed that for the case of syllable- states and mixtures in Arabic digits. We first adopt an
based models trained with clean speech, the maximum 11-frame derivative window for both HMM models (word
performance is achieved at 17-frame window, with a or syllable) and training conditions. After finding the opti-
degraded recognition rate comparing to the case of word mal number of states and number of mixtures, another fine-
level unit. This may be due to the fact that the chosen tuning of the derivatives window length will be performed.
model parameters are inappropriate. Furthermore, stud- The overall WAR using 11-frame is shown in Table7.
ies on effects of derivative window, e.g. in Applebaum The issue of the number of states to use in each unit
and Hanson (1991) and Lee etal. (1996), expect that the model and the number of mixture densities per state is

13
Int J Speech Technol

Fig.9WAR (%) versus number of mixtures and states in clean train- Fig.11WAR (%) versus number of mixtures and states in clean
ing (word model) training (syllable model)

Fig.10WAR (%) versus number of mixtures and states in multi-con- Fig.12WAR (%) versus number of mixtures and states in multi-con-
dition training (word model) dition training (syllable model)

highly dependent on both the amount of training data and To illustrate the effects of varying the number of states
the vocabulary words (Rabiner et al. 1989). In this work, and the number of mixtures, Figs.9, 10, 11, and 12 show
the number of states is globally optimized, where each the overall WAR in clean and multi-condition training
acoustic model is allowed to have the same number of modes. It can be seen that the overall best accuracies are
states. This implies that the models will work best when achieved at 16 states and the number of Gaussian mix-
they represent units with the same number of phonemes tures can take the two values 3 or 6. However, the reached
(Rabiner 1989). The optimization procedure is principally maximum accuracy of syllable model remains degraded,
based on varying the number of states for each model (4, 8, in clean training mode. We will show later how the accu-
12, 16), and the number of Gaussian mixtures for each state racy of syllable model is improved through a number of
(3, 6, 9, , 27). However, other optimal combinations may investigations.
be found by further increasing the number of states or mix-
tures, but with relatively more computational complexity. 4.3Fitting ofmodels parameters
During the training process, the mixture densities are
gradually increased as follows. We start by a single Gauss- Previously, various tests are conducted in order to show
ian density per state for all acoustic models. We collect the effects of derivative window length and HMM topol-
statistics from the training data and estimate the model ogy parameters (i.e. the number of states and the number
parameters by applying three re-estimation iterations. Then, of Gaussian mixtures). The optimal combination of initial
the mixture densities are increased and the parameters are estimates is to adopt an 11-frame derivative window, 16
re-estimated by applying three further iterations. This pro- states and three or six Gaussian mixtures per state. How-
cess is repeated until the number of desired mixtures is ever, further investigations are needed to further optimize
obtained. Finally, further seven re-estimation iterations are the parameters, as well as to provide more reliable assess-
performed (Hirsch and Pearce 2000). ment of syllable and word based recognition systems.

13
Int J Speech Technol

Table8Overall WAR (%) for high and low SNR levels, in clean More precisely, to assess the effects of noise on recogni-
training mode tion performance, results in Tables 9, 10, 11 and 12 dis-
Recognition unit SNR levels (dB) Set A Set B Set C Overall play detailed information about confusion among digits in
(020) different training modes for a set of high noisy utterances
dB extracted from test set A. Confusion matrices reveal the
Word (05) 69.27 74.02 64.35 69.21 zero model as the cause for most confusions, in clean
(1020) 95.99 97.30 94.72 96.00 training mode. However, the confusion effect of zero
Syllable (05) 56.42 59.55 49.31 55.09 model in word-based system is relatively less impor-
(1020) 95.22 96.95 93.90 95.36 tant with respect to syllable-based system. From results
obtained with multi-condition training, we expect that the
digit zero is more susceptible to noisy environments than
Table8 shows more details of results of Table7 in which the other digits.
an analysis of WAR versus SNR levels is conducted with Furthermore, in case of considering word unit, although
clean training. The accuracies are analyzed for high SNRs zero segments are masked by noise; this digit can be eas-
which stand for (1020) dB and low SNRs which stand ily discriminated as it is the only monosyllabic one with
for (05) dB. We can see that the use of syllable like unit the shortest duration; however, when the syllable model is
may lead to performance that is prone to low recognition adopted, the recognition of class zero will be more dif-
rates for low SNRs. This is due to its weak robustness to ficult as the models are all monosyllabic, so the discrimina-
high noise levels. However, the results using both acoustic tion by segment duration may be less. In other word, the
units are somewhat similar, in case of low noises (i.e. high use of an acoustic unit with a longer duration facilitates
SNRs). exploitation of simultaneously temporal and spectral varia-
tions (Gish and Ng 1996; Ganapathiraju etal. 2001).

Table9Confusion matrix for Digit One Two Three Four Five Six Seven Eight Nine Zero WAR (%)
word model (clean training)
One 106 1 1 5 0 0 0 1 0 56 62.35
Two 0 153 0 0 0 0 0 0 0 15 91.07
Three 0 5 133 3 1 0 0 2 1 29 76.44
Four 0 4 2 108 43 0 1 0 3 13 62.07
Five 0 2 0 1 152 0 0 1 0 18 87.36
Six 0 5 0 0 8 62 0 0 57 42 35.63
Seven 0 3 2 13 18 0 70 0 6 62 40.23
Eight 0 7 4 0 0 0 0 101 0 64 57.39
Nine 0 5 1 1 5 0 0 0 153 11 86.93
Zero 0 3 1 1 0 1 1 0 2 159 94.64
Overall 69.27

Table10Confusion matrix for Digit One Two Three Four Five Six Seven Eight Nine Zero WAR (%)
word model (multi-condition
training) One 165 1 0 2 0 0 0 1 0 1 97.06
Two 0 165 1 0 0 0 0 2 0 0 98.21
Three 0 7 161 2 0 0 0 1 0 3 92.53
Four 0 0 0 171 2 0 0 0 0 1 98.28
Five 0 0 1 3 167 1 0 0 0 2 95.98
Six 0 6 0 0 0 136 0 0 25 7 78.16
Seven 0 0 0 19 0 0 145 0 0 10 83.33
Eight 0 12 3 0 0 0 1 155 0 5 88.07
Nine 0 7 0 2 2 10 3 0 152 0 86.36
Zero 1 1 1 0 0 3 3 0 1 158 94.05
Overall 91.15

13
Int J Speech Technol

4.3.1Fixing thezero model (unvoiced fricative) followed by the short vowel /i/ and
then it ends with two consonants, namely /f/ (unvoiced
In Arabic language, the word zero, pronounced as fricative) and /r/ (voiced lateral) (Alotaibi 2005).
sfr, is a monosyllable word with type /CVCC/. This The digit zero is mainly dominated by unvoiced pho-
type of syllable is less frequent and it occurs only at the nemes. Generally, this part of speech can be masked more
word-final position or in monosyllabic words (Al-Zabibi easily by the background noise, because it is often acous-
1990). The digit sfr consists of mainly unvoiced pho- tically noise-like and its energy is usually much weaker
nemes and it is usually the shortest among all the other than that of voiced speech (Hu and Wang 2008). Therefore,
digits. It starts with the relatively long consonant /s/ we expect that more number of mixture densities will be

Table11Confusion matrix for Digit One Two Three Four Five Six Seven Eight Nine Zero WAR (%)
syllable model (clean training)
One 62 0 0 0 0 0 0 0 0 108 36.47
Two 0 137 0 0 0 0 0 0 0 31 81.55
Three 0 1 97 0 0 0 4 0 0 72 55.75
Four 0 6 0 75 36 0 3 0 6 48 43.10
Five 0 0 0 0 118 0 1 0 0 55 67.82
Six 0 2 0 0 9 30 0 0 37 96 17.24
Seven 0 0 0 0 2 1 63 0 4 104 36.21
Eight 0 4 1 0 0 0 0 85 0 86 48.30
Nine 0 5 0 0 3 1 0 0 141 26 80.11
Zero 0 0 0 0 0 0 0 0 2 166 98.81
Overall 56.37

Table12Confusion matrix for Digit One Two Three Four Five Six Seven Eight Nine Zero WAR (%)
syllable model (multi-condition
training) One 164 0 0 0 0 0 4 0 0 2 96.47
Two 0 163 1 0 0 0 1 0 2 1 97.02
Three 0 5 156 0 0 0 4 0 0 9 89.66
Four 0 0 0 169 3 0 1 0 0 1 97.13
Five 0 0 0 1 170 0 2 0 0 1 97.70
Six 0 1 0 0 1 139 0 0 23 10 79.89
Seven 0 0 0 1 0 1 167 0 1 4 95.98
Eight 1 13 4 0 0 0 3 142 0 13 80.68
Nine 0 4 0 0 0 10 2 0 159 1 90.34
Zero 0 0 0 0 0 2 3 0 2 161 95.83
Overall 92.01

Fig.13WAR (%) by varying


number of mixtures of zero
model in clean training

13
Int J Speech Technol

Fig.14WAR (%) by varying


number of mixtures of zero
model in multi-condition train-
ing

Table13Optimal number of Recognition unit Training mode Zero model Remaining Number of
mixtures for each recognition models states per
unit in ARADIGIT-2 model

Word Clean 18 6 16
Multi-condition 9 6 16
Syllable Clean 27 3 16
Multi-condition 12 3 16

Table14Confusion matrix for Digit One Two Three Four Five Six Seven Eight Nine Zero WAR (%)
word model after fixing zero
model (clean training) One 122 18 4 9 8 0 2 2 1 4 71.76
Two 0 168 0 0 0 0 0 0 0 0 100
Three 0 16 147 0 4 0 0 0 1 6 84.48
Four 0 5 1 109 51 0 3 0 4 1 62.64
Five 0 5 0 1 166 0 0 0 0 2 95.40
Six 0 18 0 0 13 67 0 0 68 8 38.51
Seven 0 8 5 13 35 0 82 0 8 23 47.13
Eight 0 44 4 0 1 0 0 110 0 17 62.50
Nine 0 10 1 1 8 0 0 0 155 1 88.07
Zero 0 18 3 0 5 2 4 0 5 131 77.98
Overall 72.74

needed to capture the large amount of variations in the fea- We conducted a second experiment with the new opti-
ture space of this digit. The solution of increasing acoustic mized mixtures by considering the same previous test
resolution using sub-word modeling has demonstrated its utterances for low SNRs, in order to see the effects of the
effectiveness in improving the recognition performance, in confusion among digits. Tables14, 15, 16 and 17 show
previous research studies, e.g. (Lee etal. 1990). that using the optimal densities of digit zero increases
Various experiments have been conducted in order to the number of correctly predicted elements from the
assess the effects of only increasing the number of mixtures confusion matrix diagonal. It can also be noticed that the
of the digit zero (3, 6,, and 36 mixtures). As shown overall system accuracy using syllable as acoustic unit
in Figs.13 and 14, the degraded performance due to high outperforms word unit.
noise level is compensated by increasing the zero model Furthermore, it can be observed from previous results
densities. The optimal numbers of mixtures in which the (see Fig. 11) that there is no improvement in system
best overall accuracies are achieved, when the rest of the performance if mixtures of all syllables are increased
models are trained either by three or six mixtures, are illus- together. This could explain the effects on performance
trated in Table13. by increasing only the mixtures of zero model.

13
Int J Speech Technol

Table15Confusion matrix for Digit One Two Three Four Five Six Seven Eight Nine Zero WAR (%)
word model after fixing zero
model (multi-condition training) One 165 2 0 2 0 0 0 1 0 0 97.06
Two 0 163 1 0 0 1 0 3 0 0 97.02
Three 1 3 164 1 0 1 1 1 0 2 94.25
Four 0 0 0 173 1 0 0 0 0 0 99.43
Five 0 1 2 3 165 0 1 0 0 2 94.83
Six 0 4 0 0 1 138 0 0 24 7 79.31
Seven 0 0 0 24 0 0 144 0 0 6 82.76
Eight 0 10 2 0 0 0 0 162 0 2 92.05
Nine 0 6 0 1 1 13 2 0 153 0 86.93
Zero 2 0 3 0 1 4 8 1 1 148 88.10
Overall 91.15

Table16Confusion matrix Digit One Two Three Four Five Six Seven Eight Nine Zero WAR (%)
for syllable model after fixing
zero model (clean training) One 143 7 1 0 3 1 3 0 4 8 84.12
Two 0 166 0 0 0 0 0 0 0 2 98.81
Three 0 17 138 0 4 1 4 1 0 9 79.31
Four 0 8 0 90 54 1 6 0 13 2 51.72
Five 0 5 0 0 165 0 1 0 0 3 94.83
Six 0 8 0 0 16 49 0 0 87 14 28.16
Seven 0 5 2 0 11 12 113 0 17 14 64.94
Eight 0 37 6 0 1 0 0 113 0 19 64.20
Nine 0 8 0 0 2 1 0 0 163 2 92.61
Zero 0 11 2 0 6 0 4 0 6 139 82.74
Overall 74.02

Table17Confusion matrix Digit One Two Three Four Five Six Seven Eight Nine Zero WAR (%)
for syllable model after fixing
zero model (multi-condition One 166 0 0 0 0 0 4 0 0 0 97.65
training)
Two 0 164 1 0 0 0 1 0 1 1 97.62
Three 2 6 160 0 0 0 4 0 0 2 91.95
Four 0 0 0 169 3 0 2 0 0 0 97.13
Five 0 0 0 0 171 0 2 0 0 1 98.28
Six 0 1 0 0 1 142 2 0 22 6 81.61
Seven 0 0 0 1 0 0 169 0 2 2 97.13
Eight 1 15 4 0 0 0 3 145 0 8 82.39
Nine 0 4 0 0 0 10 1 0 160 1 90.91
Zero 0 0 0 0 0 2 6 0 3 157 93.45
Overall 92.77

4.3.2Optimizing thederivative window length be seen that the optimal derivative window is changed
only when using syllable model in clean training. The
We performed another experiment to determine the opti- new window length for syllable model is reduced from
mal derivative window length, based on the maximum 17-frame to 13-frame. Other results in Table18 show that
achieved performance. In this experiment, the recognition better overall recognition performance is obtained when
systems use the obtained optimal mixtures for both acous- using syllable model in both training conditions, with the
tic units (refer to Table13). Comparing results in Figs.15 new optimized configuration.
and 16 with the previous results in Figs. 6 and 7, it can

13
Int J Speech Technol

levels. The HMM models are trained by considering the


following optimized parameters: (i) Each syllable model
has 16 states, with three mixtures per state; except to the
zero model which has 27 and 12 mixtures per state, for
clean and multi-condition, respectively (refer to Table 13)
and (ii) 13-frame and 11-frame sizes are used to estimate
the derivatives for clean and multi-condition, respectively
(refer to Fig. 16). Similar to AURORA-2 database, the
silence model has three states and each state has six mix-
tures. The recognition performance is evaluated for the case
of uncompressed MFCCs (baseline) and for the case of
compressed MFCCs using ETSI-AFE encoder at 4400 bps.
From Tables19 and 20, it can be observed an expected
Fig.15WAR (%) versus derivative window length after fixing overall improvement in performance of test set A compared
zero model (word-based recognition) to test set B in multi-condition training mode. This can be
justified by the fact that test set A contains the same noises
as used in multi-condition training mode. Also, the results
show degraded recognition performance for test set C. This
degradation is due to the effect of MIRS filtering (i.e. han-
dling convolutional distortion). Comparing to clean train-
ing mode, we notice a graceful degradation on the perfor-
mance of test clean utterances in multi-condition training
mode. This can be interpreted by the improvement on noisy
speech at the cost of scarifying the recognition performance
of the clean one (Cui and Gong 2007). It can be seen that
the noises which contain the non-stationary segments such
as babble, restaurant, airport, and train-station (Hirsch and
Pearce 2000) do not reduce considerably the performance
with respect to the rest of noises.
Results in Tables 21 and 22 show a graceful degrada-
Fig.16WAR (%) versus derivative window length after fixing tion in the recognition performance, using the quantiza-
zero model (syllable-based recognition) tion codebooks of ETSI-AFE encoder with multi-condition
training (with 0.21% of relative degradation rate). However,
the recognition performance could be further improved
Table18Overall WAR (%) for the ARADIGITS-2 after fixing if the quantization codebooks are re-estimated by the use
zero model of new MFCC vectors extracted from Arabic corpus. We
Recognition unit Training mode Set A Set B Set C Overall should point out that the results correspond to the case
(020) where the HMM models are trained on uncompressed fea-
dB tures; this means that there is a mismatch between training
Word Clean 86.85 89.54 86.53 87.64 and test data.
Multi-condition 95.74 95.97 94.38 95.36
Syllable Clean 87.80 90.16 86.27 88.08
Multi-condition 96.57 96.11 95.14 95.94 5Conclusion

In this paper we presented ARADIGIT-2, an HMM-based


4.4Detailed experimental results using syllable unit speaker-independent Arabic digits speech recognition
database. Designed within an experimental framework,
A series of tests have been conducted in order to find an ARADIGIT-2 provides a performance evaluation of MSA
optimal configuration for the ARADIGIT-2 task, where we isolated digits under noisy environments at various SNR
demonstrated that using syllable unit with the appropriate levels. Generally, we followed the same methodology
parameters leads to best performance. Tables 19, 20, 21 as AURORA-2 database, which is widely used for noise
and 22, summarize a detailed ARADIGITS-2 recognition robust DSR systems evaluation. ARADIGITS-2 is aimed
performance, for different noise types and various SNR to make available a corpus of Arabic speech data, in order

13
Int J Speech Technol

Table19Detailed WAR (%) for the ARADIGITS-2 baseline in clean training


Noise Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB Overall (020
dB)

Set A
Subway 100 98.15 94.44 88.89 81.48 64.35 37.96 87.80
Babble 100 99.54 99.07 94.91 86.57 65.74 34.26
Car 100 99.54 99.54 95.37 83.33 64.81 30.56
Exhibition 100 97.69 96.76 94.91 85.65 65.28 34.26
Overall 100 98.73 97.45 93.52 84.26 65.05 34.26
Set B
Restaurant 100 99.54 98.61 93.06 81.02 62.96 37.50 90.16
Street 100 99.54 99.07 96.76 90.28 68.98 31.02
Airport 100 99.54 98.15 94.91 88.89 69.44 43.06
Train 100 99.54 99.07 97.22 92.59 74.07 32.41
Overall 100 99.54 98.73 95.49 88.20 68.86 36.00
Set C
Subway 100 96.76 91.67 88.43 77.31 59.72 31.48 86.27
Street 100 99.07 99.07 95.83 87.96 63.89 26.39
Babble 100 97.69 95.83 88.89 76.85 55.56 30.09
Restaurant 100 100 99.07 95.37 86.57 69.91 41.20
Overall 100 98.38 96.41 92.13 82.17 62.27 32.29
Overall (sets 100 98.88 97.53 93.71 84.88 65.39 34.18 88.08
A. B. C)

Table20Detailed WAR (%) for the ARADIGITS-2 baseline in multi-condition training


Noise Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB Overall (020
dB)

Set A
Subway 100 99.07 98.61 99.07 96.76 90.74 62.50 96.57
Babble 99.54 100 99.54 99.07 96.30 86.11 50.93
Car 100 99.54 99.54 99.54 98.15 86.57 57.41
Exhibition 100 99.07 99.07 97.22 96.30 91.20 62.96
Overall 99.89 99.42 99.19 98.73 96.88 88.66 58.45
Set B
Restaurant 100 99.54 99.07 99.54 96.76 82.41 53.70 96.11
Street 99.54 99.54 99.54 99.07 97.69 87.04 56.48
Airport 100 100 99.54 99.54 96.76 85.19 56.48
Train 100 100 99.54 99.07 96.76 85.65 60.19
Overall 99.89 99.77 99.42 99.31 96.99 85.07 56.71
Set C
Subway 100 99.07 99.07 99.07 97.22 86.57 52.78 95.14
Street 99.54 99.54 99.54 99.54 97.22 83.80 47.22
Babble 100 100 99.54 98.15 91.67 76.39 40.74
Restaurant 100 99.54 99.07 98.15 94.44 85.19 54.63
Overall 99.89 99.54 99.31 98.73 95.14 82.99 48.84
Overall (sets 99.89 99.58 99.31 98.92 96.34 85.57 54.67 95.94
A. B. C)

13
Int J Speech Technol

Table21Detailed WAR (%) for the ARADIGITS-2 including AFE encoder in clean training
Noise Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB Overall (020
dB)

Set A
Subway 100 97.69 95.37 89.35 82.87 62.96 34.72 89.10
Babble 99.54 99.54 98.15 94.91 89.35 67.59 35.19
Car 100 99.54 99.54 97.22 88.43 70.83 27.78
Exhibition 100 98.15 96.30 93.98 89.35 70.83 35.19
Overall 99.89 98.73 97.34 93.87 87.50 68.05 33.22
Set B
Restaurant 100 99.54 99.07 91.67 81.02 57.41 33.33 89.95
Street 99.54 99.54 99.07 97.22 89.81 69.44 31.48
Airport 100 100 98.15 94.91 88.89 68.98 41.67
Train 100 100 99.07 97.69 91.20 76.39 30.09
Overall 99.89 99.77 98.84 95.37 87.73 68.06 34.14
Set C
Subway 100 95.83 92.13 87.04 77.78 57.87 28.24 85.81
Street 100 99.07 98.15 94.91 87.96 60.19 23.61
Babble 100 98.61 95.37 88.89 79.17 55.56 27.78
Restaurant 100 99.54 97.69 95.37 87.96 67.13 39.35
Overall 100 98.26 95.84 91.55 83.22 60.19 29.75
Overall (sets 99.92 98.92 97.34 93.60 86.15 65.43 32.37 88.29
A. B. C)

Table22Detailed WAR (%) for the ARADIGITS-2 including AFE encoder in multi-condition training
Noise Clean 20 dB 15 dB 10 dB 5 dB 0 dB 5 dB Overall (020
dB)

Set A
Subway 100 98.61 98.61 98.61 96.30 87.96 62.96 96.37
Babble 99.54 100 100 99.07 96.76 86.11 50.46
Car 100 99.54 99.54 99.54 98.15 87.50 57.41
Exhibition 100 99.07 98.15 97.69 96.30 89.81 61.11
Overall 99.89 99.31 99.08 98.73 96.88 87.85 57.99
Set B
Restaurant 100 99.07 99.54 99.54 94.44 81.48 51.85 96.04
Street 99.54 99.54 99.54 99.54 98.15 88.43 58.80
Airport 100 99.54 99.54 99.54 96.76 85.19 58.80
Train 100 100 99.54 99.54 96.76 85.19 58.80
Overall 99.89 99.54 99.54 99.54 96.53 85.07 57.06
Set C
Subway 100 98.15 98.15 98.15 94.44 87.04 53.70 94.79
Street 99.54 99.54 99.54 99.54 98.15 83.80 47.22
Babble 100 100 99.54 98.15 91.20 76.85 39.81
Restaurant 100 99.54 99.07 97.69 93.98 83.33 49.54
Overall 99.89 99.31 99.08 98.38 94.44 82.76 47.57
Overall (sets 99.89 99.38 99.23 98.88 95.95 85.22 54.21 95.73
A. B. C)

13
Int J Speech Technol

to allow researchers and developers to perform evaluation Alotaibi, Y. A. (2008). Comparative study of ANN and HMM to Ara-
of their developed algorithms, as well as for building DSR bic digits recognition systems. Journal of King Abdulaziz Uni-
versitys, 19(1), 4359.
applications which need Arabic digits as input data. Al-Zabibi, M. (1990). An acoustic-phonetic approach in automatic
Although the word unit is the frequently used in build- Arabic speech recognition (Doctoral dissertation, The British
ing digits recognition engines, we have also adopted the Library in Association with UMI,1990).
syllable unit in building ARADIGIT-2. The use of syllable Amrouche, A., Debyeche, M., Taleb-Ahmed, A., Rouvaen, J. M., &
Yagoub, M. C. E. (2010). An efficient speech recognition system
unit is motivated by the polysyllabic nature of Arabic digits in adverse conditions using the nonparametric regression. Engi-
comparing to other languages such as English. To improve neering Applications of Artificial Intelligence, 23(1), 8594.
the recognizer performance, a series of experiments have Applebaum, T. H., & Hanson, B. (1991). Regression features for rec-
been conducted in order to find the optimal combination of ognition of speech in quiet and in noise. In Proceedings of the
international conference on acoustics, speech, and signal pro-
parameters of acoustic units, especially for the monosyl- cessing, ICASSP, (pp.985988).
labic Arabic digit zero. The parameters of interest are: AURORA project. (2006). AURORA speech recognition experi-
the number of states per model, the number of Gaussian mental framework. Retrieved September 15, 2016, from http://
mixtures per state, and the derivative window length. We AURORA.hsnr.de/index.html
Bakis, R. (1976). Continuous speech recognition via centisecond
found that syllable-like unit fits better comparing to word acoustic states. The Journal of the Acoustical Society of America,
like-unit. The recognition performance using syllable unit 59(S1), S97.
exceeds word unit by an overall WAR of 0.44 and 0.58% Boersma, P., & Weenink, D. (2015). Praat: Doing phonetics by
for clean and multi-condition training modes, respectively. computer. Version 5.4.08. Retrieved September 15, 2016, from
http://www.praat.org/
However, finding more effective configuration remains to Cui, X., & Gong, Y. (2007). A study of variable-parameter Gaussian
be established. mixture hidden Markov modeling for noisy speech recognition.
The final obtained results, using syllable unit, are prom- IEEE Transactions on Acoustics, Speech, and Signal Processing,
ising. These results correspond to the cases of both uncom- 15(4), 13661376.
ELRA. (2005). NEMLAR broadcast news speech corpus. ELRA cata-
pressed and compressed features of ETSI-AFE DSR stand- logue, ELRA-S0219. Retrieved September 15, 2016, from http://
ard, with an overall recognition performance on both clean catalog.elra.info/product_info.php?products id = 874
and multi-condition modes of (88.08, 95.94%), and (88.29, ETSI document ES 201 108. (2003a). Speech processing, transmis-
95.73%), for uncompressed and compressed MFCCs, sion, and quality aspects (stq): Distributed speech recognition;
front-end feature extraction algorithm; compression algorithms.
respectively. Version 1.1.3.
In addition, since the ETSI-AFE standard has been ETSI document ES 202 211. (2003b). Speech processing, transmis-
tested on a range of languages, the elaborated ARADIG- sion, and quality aspects (STQ): Distributed speech recognition;
ITS-2 made the possibility of testing this standard in extended front-end feature extraction algorithm; compression
algorithms; back-end speech reconstruction algorithm. Version
Arabic language. However, our future work will focus on 1.1.1.
extending the DSR ARADIGITS-2 to a large vocabulary ETSI document ES 202 050. (2007). Speech processing, transmis-
Arabic continuous speech recognition database, by con- sion, and quality aspects (STQ): Distributed speech recognition;
structing a mixture of word, syllable, and phoneme-based advanced front-end feature extraction algorithm; compression
algorithms. Version 1.1.5.
acoustic units. Fujimoto, M., Takeda, K., & Nakamura, S. (2006). CENSREC-3: An
evaluation framework for Japanese speech recognition in real
Acknowledgements This work has been supported in part by the car-driving environments. IEICE Transactions on Information
LCPTS laboratory project. We would like to thank Dr Abderrahmane and Systems, 89(11), 27832793.
Amrouche for making many suggestions which have been exception- Furui, S. (1981). Cepstral analysis technique for automatic speaker
ally helpful in carrying out this research work. We also would like verification. Acoustics, IEEE Transactions on Acoustics, Speech,
to thank Dr. Amr Ibrahim El-Desoky Mousa for providing support in and Signal Processing, 29(2), 254272.
interpreting results. Furui, S. (1986). Speaker-independent isolated word recognition
using dynamic features of speech spectrum. IEEE Transactions
on Acoustics, Speech, and Signal Processing, 34(1), 5259.
Ganapathiraju, A., Hamaker, J., Picone, J., Ordowski, M., & Dod-
dington, G. R. (2001). Syllable-based large vocabulary continu-
References ous speech recognition. IEEE Transactions on Acoustics, Speech,
and Signal Processing, 9(4), 358366.
Abushariah, M. A., Ainon, R. N., Zainuddin, R., Elshafei, M., & Gish, H., & Ng, K. (1996). Parametric trajectory models for speech
Khalifa, O. O. (2012). Phonetically rich and balanced text and recognition. In Proceedings of the international conference on
speech corpora for Arabic language. Language Resources and spoken language processing, ICSLP, (pp.466469).
Evaluation, 46(4), 601634. Hajj, N., & Awad, M. (2013). Weighted entropy cortical algorithms
Alotaibi, Y. A. (2003). High performance Arabic digits recognizer for isolated Arabic speech recognition. In Proceedings of the
using neural networks. In Proceedings of the international joint International Joint Conference on Neural Networks, IJCNN,
conference on neural networks, IJCNN, (pp.670674). (pp.17).
Alotaibi, Y. A. (2005). Investigating spoken Arabic digits in speech Hirsch, H.-G., & Pearce, D. (2000). The AURORA experimental
recognition setting. Information Sciences, 173(1), 115139. framework for the performance evaluation of speech recognition

13
Int J Speech Technol

systems under noisy conditions. In Proceedings of ISCA tutorial Netsch, L. (2001). Description and baseline results for the sub-
and research workshop, (pp.181188). set of the Speechdat-Car German database used for ETSI STQ
Hirsch, H-G. (2005). FaNT, filtering and noise adding tool. Retrieved AURORA WI008 advanced DSR front-end evaluation. Texas
September 15, 2016, from http://dnt.kr.hsnr.de/ Instruments. AU/273/00.
Hirsch, H-G., & Pearce, D. (2006). Applying the advanced ETSI fron- Nishiura, T., Nakayama, M., Denda, Y., Kitaoka, N., Yamamoto, K.,
tend to the AURORA-2 task. technical report, Version 1.1. Yamada, T., et al. (2008). Evaluation framework for distant-
Hu, G., & Wang, D. (2008). Segregation of unvoiced speech from talking speech recognition under reverberant: Newest part of the
nonspeech interference. The Journal of the Acoustical Society of CENSREC Series. In Proceedings of the language resources and
America, 124(2), 13061319. evaluation conference, LREC, (pp.18281834).
Hyassat, H., & Abu Zitar, R. (2006). Arabic speech recognition using Nokia. (2000). Baseline results for subset of Speechdat-Car Finnish
SPHINX engine. International Journal of Speech Technology, database used for ETSI STQ WI008 advanced front-end evalua-
9(3), 133150. tion. AU/225/00.
ITU-T, Recommendation P.830. (1992). Subjective performance Pearce, D. (2000). Enabling new speech driven services for mobile
assessment of telephone-band and wideband digital codecs. devices: An overview of the ETSI standards activities for distrib-
Geneva, Switzerland. uted speech recognition. In Proceedings of the voice input/output
ITU-T, Recommendation G.712. (1996). Transmission performance applied society conference, AVIOS (pp. 8386). San Jose:
characteristics for pulse code modulation channels, Geneva, AVIOS
Switzerland. Pearce, D. (2001). Developing the ETSI AURORA advanced distrib-
Knoblich, U. (2000). Description and baseline results for the subset uted speech recognition front-end & what next?. In Proceedings
of the Speechdat-Car Italian database used for ETSI STQ Aurora of the workshop on automatic speech recognition and under-
WI008 advanced DSR front-end evaluation. Alcatel. AU/237/00. standing, ASRU, (pp.131134).
Lee, C. H., Rabiner, L., Pieraccini, R., & Wilpon, J. (1990). Acoustic Rabiner, L. R. (1989). A tutorial on Hidden Markov models and
modeling of subword units for speech recognition. In Proceed- selected applications in speech recognition. Proceedings of the
ings of the international conference on acoustics, speech, and IEEE, 77(2), 257286.
signal processing, ICASSP, (pp.721724). Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recog-
Lee, C. H., Soong, F. K., & Paliwal, K. K. (1996). Automatic speech nition (Vol.14). Englewood Cliffs: PTR Prentice Hall.
and speaker recognition: advanced topics (Vol. 355). London: Rabiner, L. R., Wilpon, J. G., & Soong, F. K. (1989). High perfor-
Springer Science & Business Media. mance connected digit recognition using hidden Markov models.
Leonard, R. (1984). A database for speaker-independent digit recog- IEEE Transactions on Acoustics, Speech, and Signal Processing,
nition. In Proceedings of the international conference on acous- 37(8), 12141225.
tics, speech, and signal processing, ICASSP, (pp.328331). Ryding, K. C. (2005). A reference grammar of modern standard Ara-
Lindberg, B. (2001). Danish Speechdat-Car Digits database for bic. Cambridge: Cambridge University Press.
ETSI STQ AURORA advanced DSR. CPK, Aalborg University. Siemund, R., Heuft, B., Choukri, K., Emam, O., Maragoudakis, E.,
AU/378/01. Tropf, H., et al. (2002). OrienTel: Arabic speech resources for
Ma, D., & ZENG, X. (2012). An improved VQ based algorithm for the IT market. In Proceedings of the language resources and
recognizing speaker-independent isolated words. In Proceedings evaluation conference, LREC.
of the international conference on machine learning and cyber- Soong, F. K., & Rosenberg, A. E. (1988). On the use of instantane-
netics, ICMLC, (pp.792796). ous and transitional spectral information in speaker recognition.
Macho, D. (2000). Spanish SDC-AURORA database used for ETSI IEEE Transactions on Acoustics, Speech, and Signal Processing,
STQ AURORA WI008 advanced DSR front-end evaluation, 36(6), 871879.
description and baseline results. Barcelona: Universitat Politec- The Linguistic Data Consortium. (2014). King Saud University data-
nica de Catalunya (UPC). AU/271/00. base. Retrieved September 15, 2016, from https://catalog.ldc.
Nakamura, S., Takeda, K., Yamamoto, K., Yamada, T., Kuroiwa, S., upenn.edu/ldc2014s02
Kitaoka, N., Nishiura, T., Sasou, A., Mizumachi, M., Miyajima, World Bank (2016). Retrieved September 15, 2016, from http://data.
C., Fujimoto, M., & Endo, T. (2005). AURORA-2J: An evalu- worldbank.org/region/ARB
ation framework for Japanese noisy speech recognition. IEICE Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.,
Transaction on Information and Systems, 88(3), 535544. Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., &
Naveh-Benjamin, M., & Ayres, T. J. (1986). Digit span, reading rate, Woodland, P. (2006). The HTK Book. Version 3.4. Cambridge:
and linguistic relativity. The Quarterly Journal of Experimental Cambridge University, Engineering Department.
Psychology, 38(4), 739751.
Neto, S.F.D.C. (1999). The ITU-T software tool library. International
Journal of Speech Technology, 2(4), 259272.

13

You might also like