You are on page 1of 20

A Tour Through the

Wonderful World
of
Speech Enhancement
Femi Odelowo

Definition
Speech enhancement is concerned with improving
some perceptual aspect of speech that has been
degraded by additive noise
Speech
Enhancement Theory and Practice, P. C. Loizou

Perceptual aspects typically are the quality and/or


intelligibility of the source signal
Algorithms could be broadly grouped depending on
whether there is a single source or multiple sources
Single microphone or single channel speech enhancement
Microphone array or multichannel noise enhancement

Focus is single channel enhancement

Signal Model
The additive noise model is the most commonly
considered model
STFT

where is the noisy signal, is the desired speech signal,


and is the additive noise
The noise is assumed to be independent of the speech
signal

Processing Flow/Block Diagram


The noisy signal is broken into overlapping frames
Individual frames are processed
The enhanced speech signal is reassembled using the
overlap add method

STFT

Phase

Paramet
er
Estimati
on

Gain
Calculatio
n

Spectral
Modificati
on

Inverse
STFT

Algorithms
Spectral subtraction
Conceptually the simplest to design/implement
Based on the assumed additive nature of the noise

Statistical model-based algorithms


Based on a statistical estimation framework
Includes the Wiener and several minimum mean-square error
(MMSE) algorithms

Subspace algorithms
Based on a linear algebra framework
Typically use eigenvalue/eigenvector decomposition or SVD

Machine learning algorithms


The big bad new kid on the block

Problems With Classical Methods


Algorithms need a good noise and/or SNR estimate
Mathematical accuracy is not necessarily the best!
Noise estimation is worse with lower SNR

Enhanced sound is plagued with a distorted background


Referred to as musical noise

Poor performance in non-stationary noise

Examples using the Wiener Filter

The Wiener filter seeks to minimize the MMSE E[e2(n)]

Exhibition Noise, 10dB Signal,


Simple VAD
True Noise PSD vs. Noise PSD Estimates for f = 500

-20
-25

-10

-30

-20
-30

-40

PS (dB)

PN (dB)

-35

-45
-50

True Noise
Est. Noise,
Est. Noise,
Est. Noise,
Est. Noise,
Est. Noise,

-55
-60
-65
-70

Clean Signal PSD vs. Signal PSD Estimates for f = 500

0.5

1.5

Time (sec)

2.5

-40
-50
-60

= 0.7
= 0.9
=1
=2
=5

-70
-80

-90

0.5

1.5

Time (sec)

True Signal
Est. Signal, = 0.7
Est. Signal, = 1
Est. Signal, = 5
2.5

Exhibition Noise, 10dB Signal, IMCRA


Algorithm
True Noise PSD vs. Noise PSD Estimates With imcra Noise Estimation for f = 500

Clean Signal PSD vs. Signal PSD Estimates With imcra Noise Estimation for f = 500
0

-30

-10

-35

-20

-40

-30

-45

-40

PS (dB)

PN (dB)

-25

-50
-55

True Noise
Est. Noise,
Est. Noise,
Est. Noise,
Est. Noise,
Est. Noise,

-60
-65
-70

0.5

1.5

Time (sec)

2.5

-50
-60

= 0.7
= 0.9
=1
=2
=5

-70

True Signal
Est. Signal, = 0.7
Est. Signal, = 1
Est. Signal, = 5

-80

-90

0.5

1.5

Time (sec)

2.5

Exhibition Noise, 10dB Signal,


Enhanced Speech
Simple VAD

Improved MCRA Noise Estimatio

Noisy Signal
Enhanced Signal, oracle PSDs
Enhanced Signal, = 0.7

Enhanced Signal, = 0.7

Enhanced Signal, = 1

Enhanced Signal, = 1

Enhanced Signal, = 5

Enhanced Signal, = 5

Restaurant Noise, 10dB Signal,


Enhanced Speech
Simple VAD

Improved MCRA Noise Estimatio

Noisy Signal
Enhanced Signal, oracle PSDs
Enhanced Signal, = 0.7

Enhanced Signal, = 0.7

Enhanced Signal, = 1

Enhanced Signal, = 1

Enhanced Signal, = 5

Enhanced Signal, = 5

SNR & Wiener Gain Estimation, Car


Noise, 10dB
40

True vs. Estimated SNR for Representative Frequency Bin

True SNR
DD SNR Estimate
Anderson DD Estimate

-10

Weiner Gain (dB)

SNR (dB)

20
0
-20
-40
-60
0

True vs. Estimated Weiner Gains for Representative Frequency Bin

-20
-30
-40
-50

0.5

1.5

Time (sec)

2.5

-60
0

True
DD Gain
Anderson DD Gain

0.5

1.5

Time (sec)

2.5

Wiener Gains, Car Noise, 10dB


Signal
1

Ideal vs. Realized Weiner Gains, Ephraim-Malah DD SNR Update

0.9
0.8

Wiener Gains

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-50

Ideal
Realized
-40

-30

-20

-10

SNR (dB)

10

20

30

40

50

Important Statistical Models


Statistical models are based on a probabilistic model of
the DFT components of speech and additive noise
Signal Model:
Short Time Spectral Amplitude (STSA) estimator
Also called the Ephraim-Malah estimator
Obtained as

Log Spectral Amplitude (LSA) estimator


Also due to Y. Ephraim and D. Malah
Obtained as
A variant of this algorithm, the optimally-modified LSA (OMLSA) by I. Cohen estimator is typically used as a benchmark
for the classical algorithms

A Machine Learning Approach


Can we learn a gain function based on the SNR
estimates that performs better than the Wiener gain?
A generalized additive model (GAM) was fitted to the
true Wiener gain using the decision-directed SNR, a
posteriori SNR, and noise estimates as covariates
A GAM is a flexible modeling framework in which a
linear predictor depends on either parametric or nonparametric functions of predictor variables
Results showed improved performance over Wiener
filtering.

Performance of the GAM Model


4

5
Mean COMP SIG Score

Mean PESQ Score

3.5
3
2.5
Learned Response
True Signal/Noise WF
DD Wiener Filter

2
1.5

dB

10

3.5
3
Learned Response
True Signal/Noise WF
DD Wiener Filter

2.5
0

dB

10

15

4.5

3.5

Mean COMP OVL Score

Mean COMP BAK Score

15

3
2.5
Learned Response
True Signal/Noise WF
DD Wiener Filter

2
1.5

4.5

dB

10

15

4
3.5
3
2.5
Learned Response
True Signal/Noise WF
DD Wiener Filter

2
1.5

dB

10

15

Performance of the GAM Model


(contd.)
0dB Signals

3
Learned Response
True Signal/Noise WF
DD Wiener Filter

2.5

1.5
airport

babble

car

exhibition restaurant
Noise Types

station

street

2.5

babble

car

exhibition restaurant
Noise Types

station

street

train

15dB Signals

3.8

3.4
Learned Response
True Signal/Noise WF
DD Wiener Filter

3.2
3
2.8

Mean PESQ Score

Mean PESQ Score

Learned Response
True Signal/Noise WF
DD Wiener Filter

3.6

3.6
Learned Response
True Signal/Noise WF
DD Wiener Filter

3.4
3.2
3

2.6
airport

2
airport

train

10dB Signals

3.8

5dB Signals

3.5

Mean PESQ Score

Mean PESQ Score

3.5

babble

car

exhibition restaurant
Noise Types

station

street

train

2.8
airport

babble

car

exhibition restaurant
Noise Types

station

street

train

Other Machine Learning Approaches


Independent Component Analysis
Non-negative Matrix Factorization
Deep Neural Networks
Very recent and have produced the best results
Some interesting results from the publication Yong Xu et. al
are at http://
home.ustc.edu.cn/~xuyong62/demo/SE_DNN_taslp.html
More research is needed on how to obtain the best
performance

Other Research Areas


Speech enhancement based on phase spectrum
modification
Phase spectrum compensation (PSC) algorithm by K. Wojcicki
et. al performed as well or slightly better than the STSA
estimator
Research results suggest the analysis window used and
sidelobe attenuation levels are important

Enhancement utilizing both magnitude and phase


correction
Idea is to gain the best of both worlds
Results varied when the PSC and STSA estimator were
combined

Questions/Discussion

You might also like