Professional Documents
Culture Documents
Abstract—This letter presents a voice activity detection (VAD) II. AUDIO SIGNAL ANALYSIS USING NON-NEGATIVE
approach using non-negative sparse coding to improve the detec- SPARSE CODING
tion performance in low signal-to-noise ratio (SNR) conditions.
The basic idea is to use features extracted from a noise-reduced Let be the magnitude spec-
representation of original audio signals. We decompose the mag- trum of an audio signal with time frames, where
nitude spectrum of an audio signal on a speech dictionary learned denotes the magnitude
from clean speech and a noise dictionary learned from noise of the -th time frame; is the frequency-bin index, and
samples. Only coefficients corresponding to the speech dictionary
. can be approximated by a linear combination of an
are considered and used as the noise-reduced representation
of the signal for feature extraction. A conditional random field over-complete set of bases with weights ,
(CRF) is used to model the correlation between feature sequences where is called a
and voice activity labels along audio signals. Then, we assign dictionary, and is a coefficient vector.
the voice activity labels for a given audio by decoding the CRF. Denoting , the magnitude spectrum of the
Experimental results demonstrate that our VAD approach has a signal can be decomposed under the non-negativity con-
good performance in low SNR conditions.
straint and using non-negative matrix
Index Terms—Conditional random fields, noise reducing, non- factorization (NMF) as . NMF gives “parts” based
negative sparse coding, voice activity detection. representation of the signal as only additive combinations
of the bases in are allowed in the approximation. NMF with
a typical sparseness constraint on is namely non-negative
I. INTRODUCTION sparse coding (NSC) [9]. NSC is an attractive middle-level
signal representation method for noise-robust feature extrac-
tion. It can be achieved by minimizing the distance between the
signal and its approximation:
V OICE activity detection (VAD) is used to detect the pres-
ence of speech in an audio signal. It plays an important
role in numerous modern speech communication systems. In the
(1)
last decade, since Sohn et al. [1] proposed a VAD algorithm with
impressive performance in 1999, there have been many vari- where and are the estimated optimal values of and ,
ants of VAD focusing on approaches using statistical models respectively; denotes the Frobenius norm; is the
[2]–[4]. Regarding VAD as a binary classification problem, re- -th elements of , and the non-negative constant controls
searchers employed some feature extraction methods as well as the sparsity of . (1) is unusually subject to , so
classifiers based on statistical learning theory in their VAD ap- that the elements of provide a power-based representation
proaches [5]–[7]. For example, You et al. [7] proposed a VAD on . Note that NSC is equivalent to NMF when . The
algorithm based on sparse coding technique, aiming to improve dictionary can be learned according to (1) using the NSC
the noise-robustness of features for speech detection, and Saito algorithm proposed in [10]. The decomposition of into on a
et al. [5] developed a VAD system based on conditional random given dictionary can be achieved by using the same algorithm
field (CRF) [8] using multiple popular features. However, most but with a fixed . Additionally, we define as the
of these approaches used features extracted directly from rep- reconstruction of , and as the residual.
resentations of the mixture of speech and noise. The capability
of the features for speech/pause discrimination might be seri- III. SPARSE REPRESENTATION FOR VAD
ously degraded in low signal-to-noise ratio (SNR) conditions.
To mitigate the degradation, we propose a VAD approach via We aim to extract features for speech detection from a noise-
noise reducing using non-negative sparse coding in which fea- reduced representation of original audio signals. Since the mag-
tures for speech detection are extracted from a noise-reduced nitude spectrum of an audio signal is approximately the sum
representation of original audio signals. of speech magnitude spectrum and noise magnitude spec-
trum [11], can be decomposed as
Manuscript received January 19, 2013; revised March 04, 2013; accepted
March 07, 2013. Date of publication March 14, 2013; date of current version
(2)
March 28, 2013. The associate editor coordinating the review of this manuscript
and approving it for publication was Prof. Jeronimo Arenas-Garcia. where and denote the contributions of speech and noise
The authors are with Beijing Lab of Intelligent Information Technology and
the School of Computer Science, Beijing Institute of Technology, Beijing, China in the magnitude spectrum, respectively; denotes a speech
(e-mail: tengpeng@bit.edu.cn; jiayunde@bit.edu.cn). dictionary (with bases) which is over-complete and learned
Digital Object Identifier 10.1109/LSP.2013.2252615 from clean speech signals using NSC, in order to obtain noise-
If is discarded, the noise contribution is supposed to be re- and observation feature functions defined as
duced away from . Therefore, under the assumption described
in (3), only is considered in our VAD approach and used as (9)
the noise-reduced representation of speech in , independently
of the noise. Then, feature vectors to describe the speech activity
with parameters . Note that the model with tied
are extracted from . At the -th time frame , we extract a
parameters is used across all cliques, in order to seamlessly
feature vector from consisting of handle sequences of arbitrary length. Given a fully labeled
three statistics of coefficients in , i.e., MAX, mean square training set , the CRF parameters is esti-
root and mean: mated by maximizing the conditional log-likelihood:
(10)
(4)
With a trained CRF, letting denote for simplicity,
the best activity label sequence conditioned on a given fea-
for measuring the presence of speech. ture vector sequence is estimated by decoding the CRF, i.e.,
solving
IV. VAD CONTEXT MODELING BASED ON CRF (11)
The goal of a VAD task is to give a sequence of voice activity
labels along a given audio signal The decoding is usually achieved using Viterbi algo-
, where indicates the speech absence or pres- rithm, obtaining a hard decision of (as shown in
ence at the -th time frame . Let be an observed feature [5]). However, we employ the Forward-Backward al-
vector derived from , and correspondingly gorithm to calculate where
be an observed feature vector sequence along . We model the serving as the a poste-
correlation between and using CRFs with a linear chain riori of speech presence at the -th time frame where controls
structure, i.e., first-order state dependency depicted in Fig. 1. In the range of the context that is considered. Then, we actually
the linear chain, the cliques include pairs of neighboring labels obtain an activity label sequence determined by a threshold
and feature-label pairs . Let exponentiated :
feature functions be the positive-valued potential functions
of the cliques. Given an observation with time frames and (12)
parameters , the distribution over a label sequence can be ,
defined as so that a trade-off between detection probability and false alarm
probability of VAD can be easily made by tuning .
(5)
V. EXPERIMENTS
TIMIT [12] corpus with its word transcription is used in the
(6) experiments for the VAD performance evaluation. Four typical
noise sources from NOISEX-92 [13] corpus, including the F-16,
factory, white and babble noises, are selected for the simulations
where is the observation dependent normalization. In of real noisy environments. We randomly select 128 sentences,
our approach, is computed, in terms of weighted sums over of which 8 sentences (excluding the two dialects) were spoken
the features of the cliques, by by each of 16 speakers from TIMIT TEST set. 64 sentences from
half of the speakers are concatenated as a long utterance with si-
lence of random length (from 1 to 3 seconds) inserted between
every pair of adjacent sentences, and the remaining 64 sentences
are concatenated in the same way. These two long utterances are
(7) about 338 seconds and 331 seconds long and with 51.6% and
50.3% of speech signals, respectively. The first long utterance
TENG AND JIA: VOICE ACTIVITY DETECTION VIA NOISE REDUCING USING NON-NEGATIVE SPARSE CODING 477
Fig. 4. ROC curves of the VAD approaches under F-16 noise in high- and
medium-SNR (20 dB and 5 dB, respectively) conditions.
ACKNOWLEDGMENT
The authors thank Prof. Xiangjian He for his English correc-
tions to this manuscript.
REFERENCES
[1] J. Sohn, N. Kim, and W. Sung, “A statistical model-based voice activity
detection,” IEEE Signal Process. Lett., vol. 6, no. 1, pp. 1–3, 1999.
[2] Y. Cho, K. Al-Naimi, and A. Kondoz, “Improved voice activity detec-
tion based on a smoothed statistical likelihood ratio,” in Proc. Int. Conf.
Acoustics, Speech, and Signal Processing, 2001, vol. 2, pp. 737–740.
[3] J. Ramírez, J. Segura, J. Górriz, and L. García, “Improved voice activity
detection using contextual multiple hypothesis testing for robust speech
recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no.
8, pp. 2177–2189, 2007.
[4] Y. Suh and H. Kim, “Multiple acoustic model-based discriminative
likelihood ratio weighting for voice activity detection,” IEEE Signal
Process. Lett., vol. 19, no. 8, pp. 507–510, 2012.
[5] A. Saito, Y. Nankaku, A. Lee, and K. Tokuda, “Voice activity detection
based on conditional random fields using multiple features,” in Proc.
Interspeech, 2010, pp. 2086–2089.
[6] J. Wu and X. Zhang, “Efficient multiple kernel support vector machine
Fig. 3. ROC curves of the VAD approaches under (a) F-16, (b) factory, (c) based voice activity detection,” IEEE Signal Process. Lett., vol. 18, no.
white and (d) babble noises at SNR dB, where “Ours(I)”, “Ours(II)” and 8, pp. 466–469, 2011.
“Simple NSC” denote our VAD approach using Method-1, Method-2 and the [7] D. You, J. Han, G. Zheng, and T. Zheng, “Sparse power spectrum based
simple NSC method for estimating , respectively. robust voice activity detector,” in Proc. Int. Conf. Acoustics, Speech,
and Signal Processing, 2012, pp. 289–292.
[8] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields:
between the bases from speech and noise dictionaries. Our Probabilistic models for segmenting and labeling sequence data,” in
Proc. Int. Conf. Machine Learning, 2001, pp. 282–289.
Method-2 performs better than Method-1 in the medium- and [9] P. Hoyer, “Non-negative sparse coding,” in Proceedings of the IEEE
low-SNR F-16 noise conditions, but worse in the high-SNR Workshop on Neural Networks for Signal Processing, 2002, pp.
due to over-noise-reduction. In high-SNR conditions, the per- 557–565, IEEE.
formance of our approach is degraded by the high possibility of [10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix
factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp.
over-noise-reduction which is raised from both the inaccurate 19–60, 2010.
evaluation of and the noise-insensitivity caused by the [11] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, “Speech
very-low SNR training utterance for CRF. denoising using nonnegative matrix factorization with priors,” in
Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2008, pp.
4029–4032.
VI. CONCLUSION [12] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, and N.
Dahlgren, “DARPA TIMIT acoustic-phonetic continous speech
We have proposed a VAD approach using NSC to improve corpus CD-ROM,” in NIST, 1993.
[13] [Online]. Available: http://spib.rice.edu/spib/select_noise.htmlR. Uni-
the detection performance by using a noise-reduced represen- versity, NOISEX-92 Database, [Online.] Available:
tation for feature extraction. We have decomposed the magni- [14] H.-G. Hirsch, FaNT: Filtering and Noise Adding Tool [Online]. Avail-
tude spectrum of an audio signal into coefficients on a clean able: http://aurora.hsnr.de