You are on page 1of 75

Departement Elektrotechniek

ESAT-SISTA/TR 1999-33

SVD-based Optimal Filtering with Applications to


Noise Reduction in Speech Signals1
Simon Doclo2 , Marc Moonen2
April 10, 1999
Internal report

This report is available by anonymous ftp from ftp.esat.kuleuven.ac.be in the


directory pub/SISTA/doclo/reports/99-33.ps.gz

ESAT (SISTA) - Katholieke Universiteit Leuven, Kardinaal Mercierlaan 94, 3001 Leuven (Heverlee), Belgium, Tel.
32/16/321899,
Fax 32/16/321970, WWW: http://www.esat.kuleuven.ac.be/sista. E-mail:
simon.doclo@esat.kuleuven.ac.be . Simon Doclo is a Research Assistant supported by the I.W.T. (Flemish Institute for Scienti c and Technological Research in Industry). Marc Moonen is a Research Associate with the F.W.O.
- Vlaanderen (Fund for Scienti c Research - Flanders). This research work
was carried out at the ESAT laboratory of the Katholieke Universiteit Leuven, in the framework of the F.W.O. Research Project nr. G.0295.97, Design and implementation of adaptive digital signal processing algorithms for
broadband applications, the Interuniversity Attraction Pole IUAP P4-02 (19972001), Modeling, Identi cation, Simulation and Control of Complex Systems,
initiated by the Belgian State, Prime Minister's O ce - Federal O ce for Scienti c, Technical and Cultural A airs and the IT-project Multimicrophone Signal
Enhancement Techniques for handsfree telephony and voice controlled systems
(MUSETTE) (AUT/970517/Philips ITCL) of the I.W.T. (Flemish Institute for
Scienti c and Technological Research in Industry) and was partially sponsored
by Philips-ITCL. The scienti c responsibility is assumed by its authors.

SVD-based optimal ltering with applications


to noise reduction in speech signals
Simon Doclo
ESAT - SISTA, Katholieke Universiteit Leuven
Kardinaal Mercierlaan 94, 3001 Leuven, Belgium
E-mail: simon.doclo@esat.kuleuven.ac.be
Marc Moonen
ESAT - SISTA, Katholieke Universiteit Leuven
Kardinaal Mercierlaan 94, 3001 Leuven, Belgium
E-mail: marc.moonen@esat.kuleuven.ac.be

April 10, 1999

Abstract
In this report, a compact review is given of a class of SVD-based signal enhancement
procedures, which amount to a speci c optimal ltering technique for the case where
the so-called `desired response' signal cannot be observed.
A number of simple properties (e.g. symmetry properties) of the obtained estimators
are derived, which to our knowledge have not been published before and which are
valid for the white noise case as well as for the coloured noise case. Also a standard
procedure based on averaging is investigated, leading to serious doubts about the
necessity of the averaging step.
When applying this technique to multi-microphone noise reduction, the optimal lter
exhibits a kind of beamforming behaviour for highly correlated noise sources. When
comparing this technique to standard beamforming algorithms, its performance is
equally good for highly correlated noise sources. For less correlated noise sources {
a situation where standard beamforming typically fails { it is shown that its performance is better than standard beamforming techniques. Finally it is shown by simulations that this technique is more robust to environmental changes, such as source
movement, microphone displacement and microphone ampli cation than standard
beamforming techniques.

Contents

1 Introduction
2 SVD-based optimal ltering
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8

Preliminaries . . . . . . . . . . . .
SVD-based ltering . . . . . . . . .
Error covariance matrix . . . . . .
White noise case . . . . . . . . . .
Time series ltering . . . . . . . .
Time series ltering and averaging
Multichannel time series ltering .
Conclusion . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

3 Beamforming behaviour of multichannel ltering


3.1 Preliminaries . . . . . . . .
3.2 Spatio-temporal white noise
3.2.1 Broadband source .
3.2.2 Smallband source . .
3.3 Localized noise source . . .
3.4 Real-world situation . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

4 Comparison to standard beamforming algorithms


4.1
4.2
4.3
4.4

Standard beamforming algorithms


General con guration . . . . . . .
Comparison . . . . . . . . . . . . .
Dependence on noiseframe . . . . .

5 Robustness issues
5.1
5.2
5.3
5.4

Source movement . . . . .
Microphone displacement
Microphone ampli cation
Conclusion . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

2
4

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

. 4
. 6
. 7
. 8
. 9
. 13
. 17
. 24

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

25
25
28
29
36
38
40

42
42
44
45
46

52
52
55
58
59

6 Conclusion
Acknowledgments
A Derivative to vectors and matrices

62
62
63

B Eigenvectors of symmetric Toeplitz and block-Toeplitz matrices

66

A.1 Derivative to vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


A.2 Derivative to matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

B.1 Structured matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66


B.2 Symmetry properties of eigenvectors . . . . . . . . . . . . . . . . . . . 68

1 Introduction
In many speech communication applications, like audio-conferencing and hands-free
mobile telephony, the recorded and transmitted speech signals contain a considerable
amount of acoustic noise. This is mainly due to the fact that the speaker is located
at a certain distance from the recording microphones, which allows the microphones
to record the noise sources too. Background noise can stem from stationary noise
sources like a fan, but most of the time the background noise is non-stationary and
broadband, with a spectral density depending upon the environment. The background noise causes a signal degradation which can lead to total unintelligibility of
the speech and which decreases the performance of speech coding and speech recognition systems. Therefore e cient noise reduction algorithms are called for.
During the last years some techniques for noise reduction in speech have been proposed which are based on the singular value decomposition (SVD) 1] 2] 3]. Most of
these techniques only deal with the one-microphone case and therefore have to rely on
signal speci c characteristics. Speech signals can be assumed to consist of several formants. The interpretation which is given to most one-microphone SVD-based noise
reduction techniques, is that these techniques will try to extract the most important
formants from the noisy speech signal 4], thereby reducing the amount of noise.
When using a microphone array, the spatial con guration of the speech/noise sources
and the microphone array constitutes an important aspect which should not be neglected. Therefore multi-microphone algorithms should not only exploit signal characteristics, but should also exploit the characteristics of the channel between the
speech/noise sources and the microphone array. Although the SVD-based multimicrophone extensions which have been proposed 5] exploit the signal characteristics
in a more robust way, they still don't exploit channel characteristics.
Section 2 describes a class of SVD-based signal enhancement procedures, which
amount to a speci c optimal ltering technique for the case where the so-called
`desired response' signal cannot be observed. It is shown that this optimal lter can
be written as a function of the generalized singular vectors and singular values of a
so-called speech and noise data matrix. A number of simple symmetry properties
of the optimal lter are derived, which are valid for the white noise case as well as
for the coloured noise case. Also the averaging step of the standard one-microphone
SVD-based noise reduction techniques 2] 4] is investigated, leading to serious doubts
about the necessity of this averaging step. When applying the SVD-based optimal
ltering technique for multiple channels, a number of additional symmetry properties
can be derived, depending on the structure of the noise covariance matrix.
In Section 3 the SVD-based optimal ltering technique is applied to multi-microphone
noise reduction in speech. For some contrived examples it is shown that this technique exhibits a kind of beamforming behaviour. When considering spatio-temporal
white noise on all microphones, it is shown by simulations that the directivity pattern of the SVD-based optimal lter is focused towards the speech source. When
considering a localized noise source (and no multipath propagation), it is shown that
a zero is steered towards this noise source.
Section 4 further compares the performance of the SVD-based optimal ltering technique with standard beamforming algorithms 6] (delay-and-sum, Gri ths-Jim and
2

Generalized Sidelobe Canceller (GSC) 7] 8] 9] 10]). Adaptive Gri ths-Jim beamformers perform particularly well when the noise on the di erent microphones is
highly correlated. When the noise is less correlated, the performance of these beamformers drops considerably. It is shown by simulations that for highly correlated
noise sources the SVD-based optimal ltering technique performs equally well as
adaptive Gri ths-Jim beamformers, and that for less correlated noise sources it also
continues to perform better. In this section the dependence of the performance of
the SVD-based optimal ltering technique on the length and the starting point of
the noiseframe is also investigated.
Section 5 discusses the issue of robustness. It is known that standard beamforming
algorithms are rather sensitive to incorrect estimation of source direction and uncalibrated microphone arrays. It is shown by simulations that the SVD-based optimal
ltering technique is more robust to environmental changes, such as source movement,
microphone displacement and microphone ampli cation than standard beamforming
techniques.

2 SVD-based optimal ltering

2.1 Preliminaries

Consider the following ltering problem ( gure 1) : uk 2 RN is the lter input vector
at time k, yk is the lter output at time k,

yk = uTk w = wT uk ;

(2.1)

dk is the desired lter output (`desired response') at time k, ek is the error at time k,
ek = dk ? yk ;

(2.2)

and w 2 RN is the optimal lter. All signals are supposed to be real-valued.


dk

uk

yk

+
-

ek

Figure 1: Optimal ltering problem with desired response dk


The MSE (mean square error) cost function for optimal ltering is

JMSE (w) = Efe2k g = Ef(dk ? yk )2g = Ef(dk ? wT uk )2 g


= Efd2k g ? 2wT E fuk dk g + wT E uk uTk w

(2.3)

The optimal lter is found by setting the derivative @ J@MSE


w equal to zero. Using the
expressions from Appendix A.1, we obtain the Wiener-Hopf equations 11] :

@ JMSE = ?2E fu d g + 2E u uT w = 0:
k k
k k
@w

(2.4)

The optimal lter wWF is the well-known Wiener lter :

wWF = E uk uTk

?1

E f u k dk g

(2.5)

It is also possible to consider multiple right-hand side problems, i.e. work with a
desired vector signal dk 2 RN instead of a scalar dk ( gure 2). The lter output
vector yk 2 RN is obtained as :

ykT = uTk W;

(2.6)

with W 2 RN N the optimal lter. The i-th column of W is then an optimal lter
for the i-th component of dk .
4

dk

uk

yk

+
-

ek

Figure 2: Optimal ltering problem with desired response vector dk


The corresponding formulae are :
JMSE (W) = Efkek k22 g = Efkdk ? yk k22g
= Efkdk ? WT uk k22 g = Ef(dk ? WT uk )T (dk ? WT uk )g
= EfdTk dk g ? 2EfuTk Wdk g + EfuTk WWT uk g
(2.7)
The optimal lter is found by setting the derivative @ J@MSE
W equal to zero. Using the
expressions from Appendix A.2, we obtain

@ JMSE = ?2E u dT + 2E u uT W = 0:
k k
k k
@W

(2.8)

The optimal Wiener lter WWF is :


WWF = E uk uTk ?1 E uk dTk
(2.9)
If E uk uTk and E uk dTk are known, the problem is solved conceptually.

In the following, we consider problems where only observations of uk are available,


and the observed signal uk contains a signal-of-interest sk (e.g. a speech signal) plus
additive noise nk ,
uk = s k + nk
(2.10)
If we consider speech applications and use a robust speech/noise detection algorithm
12] 13], noise-only observations can be made during speech pauses (time k0 ),
(2.11)
uk = 0 + n k
which allows to estimate the spatial and temporal colour of the noise.
Our goal is to reconstruct the signal-of-interest sk (during speech activity) from uk by
means of a linear lter W. In the optimal lter context this means that the desired
signal dk is in fact equal to the signal-of-interest sk ,
dk = sk
(2.12)
but that now the desired signal dk is an unobservable signal. The optimal solution
(Wiener lter) is still given by
0

WWF = E uk uTk ?1 E uk sTk


but obtaining an estimate for E uk sTk is not straightforward.
5

(2.13)

2.2 SVD-based ltering

If we assume that we observe uk = nk during speech pauses, then we can use such
observations to estimate
0

E nk nTk = E uk uTk :
0

(2.14)

If we assume (short-term) noise stationarity, we have

E nk nTk = E nk nTk
which means that we are able to estimate E nk nTk .
0

(2.15)

During speech activity, we observe both the signal-of-interest and the noise signal,

uk = s k + nk
(2.16)
and we can use such observations to estimate E uk uTk .
If we assume that sk and nk are statistically independent (E sk nTk = 0), then
E uk uTk = E sk sTk + E sk nTk + E nk sTk + E nk nTk
= E sk sTk + E nk nTk
(2.17)
Given E uk uTk and E nk nTk , we can thus compute E sk sTk .
Finally, from the assumed independence of sk and nk it also follows that
E uk sTk = E sk sTk + E sk nTk = E sk sTk ;
(2.18)
so that the the optimal lter WWF is given by
WWF = E uk uTk ?1 E sk sTk =E uk uTk ?1 (E uk uTk ?E nk nTk(2.19)
)
PS : Note that if the desired response vector dk were nk instead of sk , then the
n for nk would be
optimal estimator WWF
n
WWF
= E uk uTk ?1 E uk nTk = E uk uTk ?1 E nk nTk
= I ? WWF :
(2.20)
This means that an optimal estimate for nk is obtained by subtracting the
optimal estimate for sk from uk , and vice versa.
PS : Note that if the additive noise is zero (E nk nTk = 0), then WWF = I .
An interesting and useful simpli cation in formula (2.19) for WWF is derived from

the joint diagonalization (generalized eigenvalue decomposition) 14] of the symmetric


matrices E uk uTk and E nk nTk ,

E uk uTk
E nk nTk

= X diagf i2 g X T
= X diagf i2 g X T
6

(2.21)

with X an invertible, but not necessarily orthogonal, matrix. Note that diagf i2 g
represents a diagonal matrix with diagonal elements i2 , i = 1 : : : N , and that diagf i2 g
is similarly de ned.
In practice, X , i2 and i2 are computed by means of a generalized singular value
decomposition of the data matrices Uk 2 Rp N and Nk 2 Rq N (with p and q
typically larger than N ),

2 uT
k
66 uTk+1
Uk = 64 ..
.

2 nT
3
k
66 nTk+1
77
75 Nk = 64 ...

3
77
75

(2.22)

uTk+p?1
nTk+q?1
such that E uk uTk ' UTk Uk and E nk nTk ' NTk Nk . The generalized singular
value decomposition of the matrices Uk and Nk is de ned as
Uk = U diagf ig X T
(2.23)
Nk = V diagf i g X T ;
with U 2 Rp N and V 2 Rq N orthogonal matrices, X 2 RN N an invertible matrix
and ii the generalized singular values.
By substituting the above formulas into formula (2.19), one obtains
2
2
WWF = X ?T diagf i ?2 i g X T

(2.24)

In fact, the lter WWF belongs to a more general class of estimators, which can be
described by

W = X ?T diagff ( i2; i2 )g X T :

(2.25)

This formula can be interpreted as follows :


X ?T is an analysis lterbank which performs a transformation from the time
domain to a transform domain
f ( i2 ; i2 ) is a function which modi es the transform domain parameters
X T is a synthesis lterbank which performs a transformation from the transform
domain back to the time domain.
2
2
By using the function f ( i2 ; i2 ) = i ?i2 i and using the generalized eigenvectors X ,
one obtains the optimal lter de ned in formula (2.24).

2.3 Error covariance matrix

The estimation error ek is de ned as


T u
ek = sk ? yk = sk ? WWF
k

(2.26)

The expected value of the estimation error is


T u g = (I ? WT ) Efs g ? WT
Efek g = Efsk ? WWF
(2.27)
k
k
WF
WF Efnk g
which is zero if Efsk g = 0 and WWF = 0 or if Efnk g = 0 and WWF = I .
The error covariance matrix is computed as
T u ) (s ? WT u )T g
Efek eTk g = Ef(sk ? WWF
k
k
WF k
T
T
= E sk sk ? WWF E uk sTk ? E uk sTk WWF
T
+ WWF
E uk uTk WWF
(2.13)
T
T
= E sk sTk ? WWF
E uk sTk ? E uk sTk WWF + WWF
E uk
(2.18)
= E sk sTk ? E sk sTk WWF
(2.17)
= (E uk uTk ? E nk nTk ) (I ? WWF )
(2.19)
= E uk uTk ? (E uk uTk ? E nk nTk ) ? E nk nTk + E nk nTk
= E nk nTk WWF
A similar formula is obtained in 15]. In particular, we are interested in the diagonal
elements of the error covariance matrix fE nk nTk WWF gii , since these elements
indicate how well fsk gi (the ith component of sk ) is estimated.

2.4 White noise case

In the white noise case, we have

E nk nTk =

I;

(2.29)
with 2 the power of the white noise process. Obviously this simpli es the formulas
considerably. The joint diagonalization reduces to an eigenvalue decomposition of
the form
( E u uT = X diagf 2 g X T
i
k k
(2.30)
T
2
2
E nk nk =
I = X XT ;

hence with X an orthogonal matrix. By using X ?T = X and i = , the optimal


lter becomes
2
2
WWF = X diagf i ?2 g X T

(2.31)

Often the noise power 2 can be estimated from the smallest singular values of
E uk uTk (e.g. after assuming a low rank model for E sk sTk , which is approximately valid for speech signals 16]). This means that speech detection is no longer
necessary, and that the method also applies to non-speech applications.
In the white noise case, the error covariance matrix Efek eTk g reduces to
T u ) (s ? WT u )T g = 2 I
Efek eTk g = Ef(sk ? WWF
(2.32)
k
k
WF k
8

sTk
WWF
(2.28)

PS : In the white noise case, from the orthogonality of X , it follows that every
diagonal element in WWF is limited between 0 and 1 :
2? 2
fWWF gii = X (i; :) diagf j 2 g X (i; :)T
=

N 2? 2
X
j
2
j
j =1
N
X
X (i; j )2
j =1
0 (since 2

X (i; j )2
1

(2.33)

2
j)

(2.34)

which means that the estimate for fsk gi contains a contribution fuk gi with
0
1, and that = 1 in the noiseless case ( = 0).
PS : In the white noise case, WWF is a symmetric matrix. This means that if the
estimate for fsk gi contains a contribution fuk gj , then the estimate for fsk gj
contains a contribution fuk gi (`reciprocity').

2.5 Time series ltering

Let us now assume the vector uk is taken from a time series u(k), i.e.

uk = u(k) u(k ? 1) u(k ? 2) : : : u(k ? N + 1)

(2.35)

and similarly

sk = s(k) s(k ? 1) s(k ? 2) : : : s(k ? N + 1) T :


(2.36)
The data matrices Uk 2 Rp N and Nk 2 Rq N , as de ned in equation (2.22), now
are Toeplitz matrices, e.g.

2 uT
66 uTkk+1
Uk = 666 uk.+2
4 ..

uTk+p?1

3 2 u(k)
77 66 u(k + 1)
77 = 66 u(k + 2)
75 64
..
.

u(k ? 1)
u(k)
u(k + 1)
..
.

u(k ? 2)
u(k ? 1)
u(k)
..
.

: : : u(k ? N + 1)
: : : u(k ? N + 2)
: : : u(k ? N + 3)
..
.

u(k + p ? 1) u(k + p ? 2) u(k + p ? 3) : : : u(k + p ? N )


(2.37)

For wide-sense stationary (WSS) processes s(k), the autocorrelation function


only dependent on the time di erence ,

is

( ) = Efs(k) s(k ? )g

(2.38)

( ) = (? );

(2.39)

and is a symmetric function,

3
77
77
75

such that the correlation matrices E uk uTk and E sk sTk are symmetric Toeplitz
matrices, e.g.

2
66
E sk sTk = 666
4

(0)
(1)
(2)
..
.

(N ? 1)

(1)
(0)
(1)
..
.

(N ? 2)

(2)
(1)
(0)
..
.

:::
:::
:::

(N ? 3) : : :

(N ? 1)
(N ? 2) 77
(N ? 3) 77 :
75
..
.
(0)

(2.40)

Symmetric Toeplitz matrices belong to the class of double symmetric matrices, which
are symmetric about both the main diagonal and the secondary diagonal. The eigenvectors of such matrices are known to have special symmetry properties 17] 18]. For
speci c notation and properties, we refer to appendix B.

Theorem 1 If the lter WWF is constructed according to equations (2.19)(2.24),


then WWF satis es
WWF = J WWF J
(2.41)
T
T
(2.42)
WWF = J WWF J
with J a matrix with all ones along its secondary diagonal and zeros everywhere else
(reversal matrix as de ned in equation (B.4)). These properties hold in the white
noise case as well as in the coloured noise case.
Proof : Since E uk uTk and E nk nTk are symmetric Toeplitz, they satisfy
E uk uTk = J E uk uTk J
(2.43)
E nk nTk = J E nk nTk J
(2.44)

According to lemma 1 and 2 in appendix B, it follows that

E uk uTk ?1 = J E uk uTk ?1 J
(2.45)
?
1
?
1
E uk uTk
E nk nTk = J E uk uTk
E nk nTk J (2.46)
The optimal lter WWF , de ned in equation (2.19) is

WWF = I ? E uk uTk

?1

E nk nTk

(2.47)

From this, it follows that

J WWF J = J (I ? E uk uTk ?1 E nk nTk ) J


= I ? E uk uTk ?1 E nk nTk = WWF

(2.48)

According to lemma 1 in appendix B, it follows that


T
T
J WWF
J = WWF

10

(2.49)

T
T mean that the ith
The properties J WWF J = WWF and J WWF
J = WWF
row/column of WWF is equal to the (N +1 ? i)th row/column in reverse order. In the
white noise case WWF is a symmetric matrix. From the property J WWF J = WWF
it then follows that WWF is a double-symmetric matrix in the white noise case.

Theorem 2 If the lter WWF belongs to the more general class of estimators, dened in equation (2.25),

WWF = X ?T diagff ( i2 ; i2 )g X T ;

(2.50)

the properties of equation (2.42) still hold, in the white noise case as well as in the
coloured noise case.
Proof : The joint diagonalization of E uk uTk and E nk nTk , as de ned in equation (2.21), is

E uk uTk
E nk nTk

= X diagf i2 g X T
= X diagf i2 g X T

Therefore

E uk uTk

?1

E nk nTk = X ?T diagf i2 g X T ;
i

(2.51)

(2.52)

is the eigenvector decomposition of E uk uTk ?1 E nk nTk , with X an invertible, but not necessarily orthogonal matrix (only in the white noise case).
Because

E uk uTk

?1

E nk nTk = J E uk uTk

?1

E nk nTk J

(2.53)

the eigenvectors (columns of X ?T ) are known to have symmetry properties, in


particular (see equation (B.13))

J X ?T = X ?T diagf 1g:

(2.54)

With this, one obtains

J WWF J = J X ?T diagff ( i2 ; i2 )g X T J
= X ?T diagf 1g diagff ( i2 ; i2 )g diagf 1g X T
= X ?T diagff ( i2 ; i2 )g X T
= WWF

(2.55)

Rank truncation, for instance, is the basis for a popular estimation procedure in the
white noise case 1], where

f ( i2 ; 2 ) = 1

= 0
11

if
if

2
i
2
i

2
2

(2.56)

If we consider only the 2rst generalized eigenvector, corresponding to the maximum


generalized eigenvalue 121 , i.e.

f ( i2 ; i2 ) = 1

= 0

if i = 1
otherwise

(2.57)

T uk will have maximal signal-to-noise ratio (SNR) 19],


then the estimate ^sk = WWF
but the signal will be distorted (for some applications this distortion can however be
tolerated). This means that the optimal lter WWF as de ned in equation (2.24)
2
2
with f ( i2 ; i2 ) = i ?i2 i will not produce maximal signal-to-noise ratio. Instead, by
minimizing mean squared error (MSE), this lter will also take into account signal
distortion.

Note that an estimate ^sk for sk is obtained as

2 s^(k)
66 s^(k ? 1)
^sk = 666 s^(k ? 2)
4 ...

s^(k ? N + 1)

3
77
77 = WWF
T
75

We will use a more explicit notation as follows

2 s^
(k)
66 s^kk::kk??NN +1
+1(k ? 1)
66 s^k:k?N +1
(k ? 2)
64 ...

s^k:k?N +1(k ? N + 1)

3
77
77 = WWF
T
75

2 u(k)
66 u(k ? 1)
66 u(k ? 2)
64 ...

u(k ? N + 1)

2 u(k)
66 u(k ? 1)
66 u(k ? 2)
64 ...

3
77
77
75

u(k ? N + 1)

3
77
77
75

(2.58)

(2.59)

where s^k:k?N +1(l) means that an estimate for s(l) is obtained as a linear combination
T produces
of u(k), u(k ? 1), : : : , u(k ? N + 1). For N odd, the middle row in WWF
the estimate s^k:k?N +1(k ? N2?1 ), where s(k ? N2?1 ) is estimated from u(k ? N2?1 )
together with N2?1 earlier samples and N2?1 later samples of u.
T J = WT then indicates that for N odd, the middle row in
The property J WWF
WF
T is symmetric, and hence represents a linear phase lter. Note that a zero phase
WWF
property has been attributed to an SVD and rank truncation based estimator for the
white noise case, if an additional averaging step (see also section 2.6) is included 20].
For the colored noise case 2] 3] 4], a similar linear phase property had apparently
not been derived yet.

12

2.6 Time series ltering and averaging

From

2 s^
(k)
66 s^kk::kk??NN +1
(k ? 1)
66 s^k:k?N +1
(k ? 2)
64 ... +1

s^k:k?N +1(k ? N + 1)

it follows that
2 s^
(k)
66 s^kk::kk??NN +1
(k ? 1)
66 s^k:k?N +1
(k ? 2)
64 ... +1
s^k:k?N +1(k ? N + 1)
T
= WWF

3
77
77 = WWF
T
uk
75

(2.60)

: : : s^k+N ?1:k (k + N ? 1)
: : : s^k+N ?1:k (k + N ? 2)
: : : s^k+N ?1:k (k + N ? 3)

s^k+1:k?N +2(k + 1)
s^k+1:k?N +2(k)
s^k+1:k?N +2(k ? 1)

..
.
s^k+1:k?N +2(k ? N + 2) : : :

..
.
s^k+N ?1:k (k)

uk uk+1 : : : uk+N ?1

3
77
77
75

(2.61)

It is seen that several (maximum N ) estimates are obtained for one and the same
sample s(l). As an example, N estimates for s(k) are available on the main diagonal.
If w(i; j ) denotes the (i; j )-element of WWF , one can obtain an explicit formula for
all these estimates together :

2 s^
3
k:k?N +1(k)
66 s^k+1:k?N +2(k) 77
66
77 = WWF
..
T
.
64 s^
7
k+N ?2:k?1(k) 5
s^k+N ?1:k (k)

2 u(k + N ? 1) 3
66 u(k + N ? 2) 77
66 ...
7
66 u(k + 1) 777
66 u(k)
77
66 .
77
66 ..
7
4 u(k ? N + 2) 75

(2.62)

u(k ? N + 1)

with

20
0
T =6
64 ...
WWF

0
0
..
.
0
w(1; N ? 1)
w(1; N ) w(2; N )

::: 0
: : : w(1; 2)
..
.
: : : w(N ? 2; N ? 1)
: : : w(N ? 1; N )

w(1; 1)
w(2; 2)
..
.
w(N ? 1; N ? 1)
w(N; N )

T
T it immediately follows that
From J WWF
J = WWF
T
T J
WWF
= J WWF

: : : w(N ? 1; 1)
: : : w(N; 2)
..
.
::: 0
::: 0

w(N; 1)
0
..
.
0
0

(2.64)

The question now arises which estimate, out of the N available estimates for s(k),
is the best. The answer is given by the error covariance matrix (see section 2.3)
13

3
77
5
(2.63)

Efek eTk g = Efnk nTk g WWF . The smallest element on the main diagonal of the

error covariance matrix corresponds to the best estimator. From here on the best
T , will be denoted as wmin .
estimator, which is the corresponding row of WWF
WF
The question remains if perhaps an even better estimate for s(k) can be obtained
by linearly combining the N available estimates. This question is apparently not
easily answered. An obvious choice could be to average over all available estimates, a
technique which is often applied to rank truncation based estimation 1] 2] 3] 4] 20],

1
N

1
N

s~k+N ?1:k?N +1(k) =

1
N

1
N

:::

:::

1
N
{z
~
w

1
N

1
N

1
N

2 s^
3
k:k?N +1(k)
66 s^k+1:k?N +2(k) 77
66
7
.
64 s^ .. (k) 775
k+N ?2:k?1
2 s^uk(+kN+?1:Nk(?k)1) 3
66 u(k + N ? 2) 77
66 ...
7
66 u(k + 1) 777
T 6
77
WWF
} 666 u(k)
77
66 ...
77
4 u(k ? N + 2) 5
u(k ? N + 1)

(2.65)

Here s~k+N ?1:k?N +1(k) is estimated from u(k) together with (N ? 1) earlier samples
and (N ? 1) later samples. The (2N ? 1)-taps lter w~ is obtained by averaging over
T (i; :). From the symmetry property of WWF it is
the available N -taps lters WWF
readily seen that w~ is symmetric, and hence represents a linear phase lter. A crucial
question is whether the (2N ? 1)-taps estimator w~ is better than the individual N T (i; :) it is computed from. Speci cally, w
~ should be compared
taps estimators WWF
T (if N is odd), which represents a linear
with the symmetric middle row of WWF
phase lter that uses N2?1 earlier samples and N2?1 later samples.
First, it can be veri ed that w~ is not an optimal lter, i.e.

s~k+N ?1:k?N +1(k) 6= s^k+N ?1:k?N +1(k)

(2.66)

Note that s~k+N ?1:k?N +1(k) corresponds to a linear-phase (2N ? 1)-taps estimator
w~ , obtained by averaging over a collection of N -taps estimators. On the other hand,
s^k+N ?1:k?N +1(k) corresponds to an linear-phase (2N ? 1)-taps estimator w^ , which
is obtained by applying the usual Wiener lter formulas to a (2N ? 1)-dimensional
vector uk . So, in general, w^ is a function of (0), (1), : : : , (2N ? 1), where w~ will
only be a function of (0), (1), : : : , (N ? 1). This means that s^k+N ?1:k?N +1(k)
and s~k+N ?1:k?N +1(k) are not the same, except for contrived examples. The following
example further illustrates this.
14

2
Example : As an example, consider a white noise case with 2
i for all i. Then
WWF = E uk uTk ?1 E sk sTk ' 12 E sk sTk

2
66
64

(N ? 1)
(N ? 2) 77
1
= 2
75
..
.
(N ? 1) (N ? 2) : : :
(0)
where the correlation matrix E sk sTk is a symmetric Toeplitz matrix. The
matrix WWF then has the form
T = 1
WWF
2

20
66 0..
4.

0
0
..
.

(N ? 1)

(0)
(1)
..
.

(1)
(0)
..
.

:::
:::

::: 0
(0) : : :
: : : (1) (0) : : :
..
..
.
.
(N ? 2) : : : (1) (0) : : :
(N ? 2) : : : (1) (0) : : :

(N ? 2) (N ? 1)
(N ? 2) 0
..
..
.
.
0
0
0
0

3
77
5

It is readily veri ed that the (2N ? 1)-taps estimator w~ , obtained through


averaging, is
N (0)
2 (N ? 2) 1 (N ? 1)
w~ ' 1 1 (N ? 1) 2 (N ? 2)
2

whereas the corresponding (2N ? 1)-taps optimal linear-phase lter w^ (middle


T applied to a (2N ? 1)-dimensional vector) is
row of WWF
(0)
(N ? 2) (N ? 1) :
w^ =' 12 (N ? 1) (N ? 2)
Secondly, simulations indicate that the obtained error variance for the (2N ? 1)-taps
estimator w~ is mostly comparable to the error variances for the original N -taps esT (i; :), and always larger than the error variance for the best N -taps
timators WWF
min .
estimator wWF
Simulation example : Consider two (stationary) unit-variance white noise processes
s(k) and n(k), k = 1 : : : L. The input signal u(k) is constructed as the sum of
the useful signal s(k) and the noise n(k),
u(k) = s(k) + n(k); k = 1 : : : L
(2.67)
with 2 the power of the noise process, which is assumed to be known. The
correlation matrix Efu uT g 2 RN N is computed as
(2.68)
Efu uT g = 1 UT U

with L the length of the signals and U 2 RL N the data matrix de ned
as in equation (2.22). Since the noise is white, the noise correlation matrix
Efn nT g 2 RN N is
Efn nT g = 2 I:
(2.69)
15

T (i; :),
Both the optimal lter WWF , which consists of N N -taps estimators WWF
and the (2N ? 1)-taps estimator w~ , obtained through averaging, are computed
from these correlation matrices. Also ^s(k) = U WWF , which consists of N
estimates s^i (k) = U WWF (:; i), and s~(k), obtained through averaging, are
computed. The error variances ^i ; i = 1 : : : N , and ~ are de ned as
L
X
1
(s(k) ? s^i (k))2 ; i = 1 : : : N
^i = L

(2.70)

k=1

L
X
~ = L1
(s(k) ? s~(k))2
(2.71)
k=1
For N = 9 and L = 105 , the error variances ^i ; i = 1 : : : N , and ~ are compared
for two di erent noise powers ( 2 = 0:5 and 2 = 2) in gure 3. As can be seen
from the simulations, the (2N ? 1)-taps estimator w~ is not always better than
the individual N -taps estimators WWF (:; i) it is computed from. Moreover,
there always seems to exist N -taps estimators WWF (:; i) which give rise to a
lower error variance than the (2N ? 1)-taps estimator w~ .
Error variance comparison (N=9, L=105, 2=0.5)

Error variance comparison (N=9, L=105, 2=2)

0.38

1.15

Optimal
i
Averaged

Optimal
i
Averaged

1.1

0.36
1.05

Error variance

Error variance

0.34

0.32

0.95

0.9

0.85
0.3
0.8

0.75
0.28
0.7

0.26

5
i

0.65

5
i

Figure 3: Error variance comparison between (2N ? 1)-taps estimator w~ and original
N -taps estimators WWF (:; i) for di erent noise powers
Hence averaging does not seem to be a well-founded operation, while on the other
hand it certainly increases computational complexity, since it requires (2N ? 1)taps ltering instead of N -taps ltering. If minimal error variance is sought for, we
min corresponding to the smallest
therefore suggest to pick the N -tap estimator wWF
diagonal element in the error covariance matrix. If the linear phase property is
T
desirable, we suggest to pick the N -tap estimator given by the middle row of WWF
(for N odd).
16

2.7 Multichannel time series ltering

Consider M channels where each channel mj (k); j = 1 : : : M , consists of a ltered


version of the desired signal s(k) and an additive noise term nj (k),

mj (k) = hj (k) s(k) + nj (k);


(2.72)
with hj (k) the lter for the j th channel. This situation arises e.g. when we have a

lr

efl
ec

tio
n

microphone array recording both a desired signal and background noise in a room,
as depicted in gure 4.

sig

na

noise source

11
00
00
11

11
00
00
11
00
11

direct path
signal source

SVD-based
multi-microphone
signal enhancement

inter

00
11
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11

feren

ce

microphone array

noise source

Figure 4: Microphone array recording desired signal and background noise


The vector uk 2 RMN now takes the form

2m 3
66 m12kk 77
uk = 64 .. 75 ;
.

(2.73)

mMk

with

mjk = mj (k) mj (k ? 1) : : : mj (k ? N + 1) T ;
The vectors sk and nk are similarly de ned. The data matrix Uk 2 Rp
in equation (2.22) then takes the form

Uk = U1k U2k : : : UMk ;


with

2 m (k)
66 mj (kj + 1)
Ujk = 666 mj (k. + 2)
..
4

mj (k ? 1)
mj (k)
mj (k + 1)

mj (k ? 2)
mj (k ? 1)
mj (k)

(2.74)
MN

as de ned
(2.75)

: : : mj (k ? N + 1)
: : : mj (k ? N + 2)
: : : mj (k ? N + 3)

3
77
77
75

..
..
..
.
.
.
mj (k + p ? 1) mj (k + p ? 2) mj (k + p ? 3) : : : mj (k + p ? N )
(2.76)
17

Using the same formulas as for the one-channel case, the optimal lter WWF and
min can be computed.
best (MN )-taps estimator wWF
For stationary signals we can use the same correlation matrices for all samples. Therefore the estimated signal s^(k) can be computed as
2 s^(k) 3
6 s^(k + 1) 77
min T
^s(k) = 664
(2.77)
= U1k U2k : : : UMk wWF
7
..
5
.
s^(k + p ? 1)
This lter operation can be considered as a multichannel lter, where each of the M
channels is ltered with an N -taps lter Aj , where
min = A A : : : A
(2.78)
wWF
1
2
M :
This is depicted in gure 5.
A1
A2
e

+
+
+
+

A3
A4

11
00
11
00

m1

Desired signal

m2
m3
m4

Noise
Microphone
array

Figure 5: Multichannel ltering


If we consider no multipath e ects, i.e. the lter hj (k) = 1 for each channel j , then
we can prove some additional symmetry properties for the optimal lter WWF . In
this case the desired signal in each channel is s(k),
mj (k) = s(k) + nj (k):
(2.79)
In the following we will only consider symmetry properties for the 2-channel case.
However these properties can be easily extended for more than 2 channels. For 2
channels, we have the following data model,
m1 (k) = s(k) + n1(k)
(2.80)
m2 (k) = s(k) + n2(k);
such that the vectors uk and nk can be written as
and

1k = sk + n1k
uk = m
m2k
sk
n2k

(2.81)

nk = nn21kk

(2.82)

18

Consider the following notations for the correlation matrices :


=
=
=
=
=
=
=

uu
nn
s
n11
n12
n21
n22

uk uTk
nk nTk
sk sTk
n1k nT1k
n1k nT2k =
n2k nT1k =
n2k nT2k

E
E
E
E
E
E
E

(2.83)
(2.84)
(2.85)
(2.86)
(2.87)
(2.88)
(2.89)

T
n21
T
n12

If we assume the desired signal and the noise are uncorrelated then the correlation
matrix uu can be written as
uu

= E
= E

? sT sT
k k

sk + n1k
sk
n2k
sk
sTk sTk
sk

n1k
n2k

+E

{z

} |

ss

+ nT1k nT2k

{z

nT1k nT2k

nn

(2.90)

with
s
s

ss =
nn =

n11
n21

n12
n22

s
s

(2.91)

n11
T
n12

n12
n22

(2.92)

First we will discuss the symmetry properties of the correlation matrix ss and the
conditions under which the correlation matrix nn exhibits these symmetry properties.
Property 1 Because of the speci c form of the correlation matrix ss in equation
(2.91), this matrix exhibits the following properties (for notation, see Appendix B) :
1. ss is a symmetric Toeplitz matrix
2. J ssJ = ss
3. S ssS = ss
Proof :

1. Since s is a symmetric Toeplitz matrix, it is easily veri ed that ss is


Toeplitz and that
T =
ss

T
s
T
s

T
s
T
s

19

s
s

s
s

= ss

2. Since s is a symmetric Toeplitz matrix, J sJ = s and


0 J
s s
J ssJ = J0 J0
J
0
s s

J sJ J s J =
J sJ J s J

s
s

s
s

3. Since ss is block-Toeplitz and block-symmetric,


0 I =
s s
S ssS = I0 I0
I 0
s s

= ss
s
s

s
s

= ss

For the noise correlation matrix nn we will discuss the conditions under which nn
exhibits symmetry properties.

Property 2 The noise correlation matrix


J
J

n22 J
n12 J

=
=

n11
T
n12

nn has the property

nn J

= nn , i
(2.93)

( n12 is centro ? symmetric)

Proof : Trivial by equating the following expressions :


n11
T
n12

nn =

nn J

= J0 J0

n11
T
n12

n12
n22

0 J = J n22 J J Tn12 J
J 0
J n12 J J n11 J

n12
n22

A su cient condition for n12 being centro-symmetric, is n12 being Toeplitz. For
stationary noise sources, n11 and n22 are symmetric Toeplitz matrices, such that
the condition J n22 J = n11 implies that n11 = n22 .

Property 3 The noise correlation matrix


n11
n12

=
=

n22
T
n12

nn has the property

nn S = nn, i

(2.94)

( n12 is symmetric)

Proof : Trivial by equating the following expressions :


n11
T
n12

nn =

nnS =

0 I
I 0

n11
T
n12

20

n12
n22

n12
n22

0 I =
I 0

n22
n12

T
n12
n11

2
For di erent types of noise correlation matrices nn we will now discuss the symmetry
properties for the optimal Wiener lter WWF ,
2
2
WWF = X ?T diagf i ?2 i g X T

(2.95)

which can be written as


?1 ( uu ? nn) = ?1
uu
uu

WWF =

s + n11
s + Tn12

s + n12
s + n22

ss

?1

s
s

s
s

(2.96)

and the symmetry properties for the more general class of estimators

WWF = X ?T diagff ( i2 ; i2 )g X T
(2.97)
For convenience, we will partition the matrix WWF into four parts :
11
12
WF WWF :
(2.98)
WWF = W
21
22
WWF WWF
Property 4 Because of the speci c form of the optimal Wiener lter WWF in equa-

tion (2.96), this lter exhibits the properties

11 = W12
WWF
WF
21 = W22
WWF
WF

(2.99)
(2.100)

Proof : As can be easily veri ed

WWF =
=
=

?1
s s
s + n11 s + n12
T
+
+
s s
n22
s
n12 s
11
12
s s
21
22
s s
( 11 + 12 ) s ( 11 + 12 ) s
( 21 + 22 ) s ( 21 + 22 ) s

Case 1 : The noise correlation matrix

nn has the form


n11
T
n12

nn =

n12
n11

with n11 symmetric Toeplitz and n12 Toeplitz (but not symmetric)

21

(2.101)

Since J ssJ = ss and J nn J = nn (see property 2), it follows that J uu J = uu ,


J ?uu1 J = ?uu1 and J ?uu1 nn J = ?uu1 nn. From this last property if follows similarly
to the proof of theorem 1 that for the optimal lter J WWF J = WWF .
Therefore the matrix WWF has the form
11
11
WF WWF ;
WWF = W
12 W12
WWF
WF

(2.102)

11 J = W12 :
J WWF
WF

(2.103)

with
Similarly to the proof of theorem 2, one would expect that the general class of estimators, as described in equation (2.97), exhibits the same symmetry properties as
the optimal lter. However, not all eigenvectors of the matrix ?uu1 nn are symmetric
or skew-symmetric, such that for X ?T (of which the columns are the eigenvectors),

J X ?T 6= X ?T diagf 1g:

(2.104)

This can be explained because ?uu1 nn has an eigenvalue 1 with multiplicity N ( uu


and nn have N eigenvalues which are the same). The eigenspace corresponding to
this eigenvalue consists of N eigenvectors which are a linear combination of symmetric
and skew-symmetric vectors, and hence, are neither symmetric nor skew-symmetric
17]. Therefore the general class of estimators exhibits no symmetry properties at all.
However, if we only retain the N eigenvectors X1 which are symmetric or skewsymmetric and discard the N eigenvectors X2 which are neither symmetric nor skewsymmetric, then we can prove the same symmetry properties (2.102) and (2.103) for
the general class of estimators. If we assume that the matrix X ?T has the form

X ?T = X1 X2 ;

(2.105)

and the diagonal matrix in equation (2.97) is of the form


diagff ( i2 ; i2 )g = 0 00 ;

(2.106)

with 2 RN N a diagonal matrix, then the general class of estimators exhibits the
same symmetry properties as the optimal lter.

Case 2 : The noise correlation matrix

nn has the form


n11
n12

nn =

n12
n11

(2.107)

with n11 and n12 symmetric Toeplitz.

Because this is a special case of equation (2.101), the same symmetry properties hold
for the optimal lter J WWF J = WWF .
Since S ssS = ss and S nnS = nn (see property 3), it follows that S uu S = uu ,
22

?1 S = ?1 and S ?1 nn S = ?1 nn. From this last property if follows similarly


uu
uu
uu
uu

to the proof of theorem 1 that for the optimal lter

S WWF S = S ?I ? ?uu1 nn S = I ? ?uu1 nn = WWF :


Therefore the matrix WWF has the form
11
11
WF WWF ;
WWF = W
11
11
WWF WWF

(2.108)
(2.109)

with
11 J = W11 :
J WWF
WF

(2.110)

This means that the lters for the 2 channels are equal.
For the general class of estimators, as described in equation (2.97), these symmetry
properties don't hold in all cases. The reason is the same as for case 1, i.e.

J X ?T 6= X ?T diagf 1g

(2.111)

with X ?T the matrix containing the eigenvectors of ?uu1 nn . The property which
does hold in all cases is

S X ?T = X ?T diagf 1g:

(2.112)

Therefore the general class of estimators always satis es SWWF S = WWF and has
the form
11
12
WF WWF ;
WWF = W
12
11
WWF WWF

(2.113)

but it only exhibits the additional symmetry properties (2.109) and (2.110) if the
diagonal matrix in equation (2.97) is of the form
diagff ( i2 ; i2 )g = 0 00 ;

(2.114)

with 2 RN N a diagonal matrix, under the same conditions as case 1.

Case 3 : If in case 2 the noise sources n1(k) and n2(k) are uncorrelated (
then the noise correlation matrix nn has the form
0 ;
n11
nn =
0
n11

n12 = 0),

(2.115)

with n11 symmetric Toeplitz.

The conclusions regarding symmetry properties are the same as in case 2, for the
optimal lter as well as for the general class of estimators.
23

Case 4 : If in case 3 the uncorrelated noise sources n1(k) and n2(k) are white noise
sources with the same noise power 2 , then the noise correlation matrix nn
has the form
nn =

I 0 :
0 I

(2.116)

The conclusions regarding symmetry properties are the same as in case 2, except for
the additional property that WWF is symmetric, for the optimal lter as well as for
the general class of estimators.
In this case the optimal lter WWF has the form
11
11
WF WWF ;
WWF = W
11
11
WWF WWF

(2.117)

11 = J W11 J
WWF
WF
11
11
WWF = WWF T :

(2.118)
(2.119)

with

2.8 Conclusion

In this section we have described a class of SVD-based signal enhancement procedures, which amount to a speci c optimal ltering technique for the case where the
so-called `desired response' signal dk = sk cannot be observed. It is shown that this
optimal lter WWF can be written as a function of the generalized singular vectors
and singular values of a so-called speech data matrix Uk and noise data matrix Nk .
When applying this ltering technique to time series, a number of simple symmetry
properties are derived, which prove to be valid for the white noise case as well as for
the coloured noise case. Also the averaging step of the standard one-microphone SVDbased noise reduction techniques is investigated, leading to serious doubts about the
necessity of this averaging step, which increases computational complexity but does
not improve performance. When applying the SVD-based optimal ltering technique
to multiple channels, a number of additional symmetry properties can be derived,
depending on the structure of the noise covariance matrix.

24

3 Beamforming behaviour of multichannel ltering


In this section we will discuss the frequency and spatial ltering properties of SVDbased estimators, described in section 2, when applied to multichannel noise reduction
in speech signals. As already mentioned in section 2.7, applying the SVD-based
optimal ltering technique to multiple channels can be considered as a multichannel
ltering operation for which a beamforming interpretation can be given.
These beamforming properties will be examined for di erent simulated situations.
We will consider a desired signal (broadband/smallband) arriving at a microphone
array from di erent directions and a di use or localized noise source. We will also
examine the performance of this noise reduction technique for a real-world situation.
In the ideal case, the spatial beamforming pattern should amplify in the direction of
the signal source and should attenuate in the direction of the localized noise. We will
see that the SVD-based optimal ltering technique exhibits such behaviour. In the
following sections we will further compare the SVD-based optimal ltering technique
with standard beamforming algorithms.
First we will discuss the room con guration, the used speech and noise signals and
the SVD-based noise reduction technique in full detail.

3.1 Preliminaries

Consider a linear equi-spaced microphone array consisting of M microphones. The


microphone array is recording both a desired speech signal s(k) and background noise
n(k), as depicted in gure 6. The j th microphone signal can be written as

mj (k) = sj (k) + nj (k)

j = 1 : : : M;

(3.1)

ref

lec

tio

where sj (k) is the speech signal in the j th microphone signal and nj (k) is the noise
in the j th microphone signal.

sig
n

al

noise source

11
00
00
11
00
11

11
00
00
11
direct path
signal source

SVD-based
multi-microphone
signal enhancement

inter

00
11
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11

feren

microphone array

ce

noise source

Figure 6: Microphone array recording desired signal and background noise


In order to improve the signal-to-noise ratio of the microphone signals mj (k) and
hence reduce the background noise, we use the multichannel lter structure, as depicted in gure 7, which lters and sums the di erent microphone signals. The main
25

di culty lies in nding the optimal lters Aj . For nding these lters we will use the
SVD-based optimal ltering technique, described in section 2.
A1
A2
e

+
+
+
+

A3
A4

11
00
00
11

m1

Desired signal

m2
m3
m4

Noise
Microphone
array

Figure 7: Multichannel ltering


The clean speech signal s(k) is an 8 kHz speech signal (20000 samples), which
is depicted in gure 8. The dotted line indicates the region of speech activity
(speech/silence detection). Speech is present in samples 7041 : : : 15360]. The signal sj (k) is the speech signal in the j th microphone signal, which is the clean speech
signal s(k) ltered with the acoustic impulse response of the room. In this section
we will only consider a pure delay environment without multipath e ects, where the
speech signals sj (k) are delayed versions of each other. If the desired signal impinges
on the microphone array at an angle , the delay (number of samples) between two
adjacent microphones is
(3.2)
= d cos f ;

with d the distance between the microphones, c the speed of sound (c ' 340 ms ) and
fs the sampling frequency, such that

sj+m (k) = sj (k ? m ):
(3.3)
If 2= Z, than the di erent speech signals sj (k) can be constructed by ltering the
clean speech signal s(k) with an interpolation lter.
In this section we will consider two kind of noise sources :
spatio-temporal white noise (di use noise) where the noise nj (k) in the j th
microphone signal is temporal white noise and is uncorrelated with the noise
nl (k) in the lth microphone signal,

E (nj (k)nl (k)) = 0; j 6= l

(3.4)

a localized white noise source n(k) which impinges on the microphone array
at an angle , such that the noise signals nj (k) are delayed versions of each
other (analogous to the speech signal) and are correlated with each other. The
di erent noise signals nj (k) are constructed by ltering the white noise signal
n(k) with an interpolation lter.
26

As already indicated in equation (3.1), the j th microphone signal mj (k) is a noisy


speech signal, which is the sum of sj (k) and nj (k). Such a signal is depicted in
gure 8, where the dotted line indicates the region of speech activity (speech/noise
detection).
Original signal
1

Amplitude

0.5
0
0.5
1
0

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

2
4

x 10
Microphone signal 1
1

Amplitude

0.5
0
0.5
1
0

0.2

0.4

0.6

0.8

1
1.2
Time (samples)

1.4

1.6

1.8

2
4

x 10

Figure 8: Clean and noisy speech signal


Using the signals mj (k) and nj (k) we construct the speech data matrix Uk 2 Rp MN
and the noise data matrix Nk 2 Rp MN , as de ned in equations (2.22) and (2.75),
where N denotes the length of the lters Aj ,

with

8
>
>
>
>
>
Ujk
>
>
>
>
<
>
>
>
>
>
Njk
>
>
>
>
:

Uk =
Nk =

U1k U2k : : : UMk


N1k N2k : : : NMk ;

(3.5)

2 m (k)
3
mj (k ? 1)
mj (k ? 2) : : : mj (k ? N + 1)
j
66 mj (k + 1)
mj (k)
mj (k ? 1) : : : mj (k ? N + 2) 77
66 mj (k + 2)
mj (k + 1)
mj (k)
: : : mj (k ? N + 3) 77
64
75
..
..
..
..
.
.
.
.
2 mj (kn +(kp) ? 1) mnj ((kk+?p1)? 2) mnj ((kk +? p2)? 3) : : :: : : n m(kj (?k N+ p+?1)N3)
j
j
j
66 nj (kj + 1)
nj (k)
nj (k ? 1) : : : nj (k ? N + 2) 77
66 nj (k + 2)
nj (k + 1)
nj (k)
: : : nj (k ? N + 3) 77
64
75
..
..
..
..
.
.
.
.
nj (k + p ? 1) nj (k + p ? 2) nj (k + p ? 3) : : : nj (k + p ? N )

(3.6)

27

For constructing the speech data matrix Uk we use 2000 samples mj (8000 9999).
For constructing the noise data matrix Nk we use the same frame nj (8000 9999).
In practice this is never possible, since the noise data matrix can only be constructed
during periods where no speech is present. However for simulated situations the total noise signal nj (k) is known. In section 4 it will be seen that for stationary noise
constructing the noise data matrix Nk from a di erent frame than the speech data
matrix Uk has no in uence with regard to the performance.
Using the generalized singular value decomposition of Uk and Nk

Uk = U diagf ig X T
Nk = V diagf i g X T ;
we can compute the optimal Wiener lter WWF ,
2
2
WWF = X ?T diagf i ?2 i g X T :
i

(3.7)

(3.8)

By choosing the column of WWF corresponding to the smallest element on the diagonal of the noise correlation matrix NTk Nk WWF we obtain the best estimator
min . The lter wmin consists of the M lters of length N ,
wWF
WF
min = A A : : : A
wWF
1
2
M :

(3.9)

The resulting estimated (enhanced) signal s^(k) is computed by ltering and summing
the microphone signals mj (k) with the lters Aj for their total length (20000 samples),

s^(k) =

M
X
j =1

Aj (k) mj (k):

(3.10)

For comparison purposes the signal-to-noise ratio (SNR) will be used. The SNR of a
signal x(k) is de ned as

?P x2(k)
SNR(x) = (P x2 (k))speech ;
noise

(3.11)

which is the energy of the signal x(k) during speech periods divided by the energy of
the signal x(k) during noise periods. Therefore a speech/noise detection is necessary
as indicated by the dotted line in gure 8.

3.2 Spatio-temporal white noise

In this case the noise nj (k) in the j th microphone signal is temporal white noise and
is uncorrelated with the noise nl (k) in the lth microphone signal,

E (nj (k)nl (k)) = 0; j 6= l:


28

(3.12)

3.2.1 Broadband source


We will discuss the SNR-improvement and the frequency behaviour when the broadband speech source is in front of the microphone array ( = 90 ) and we will discuss
the spatial ltering properties (beamforming behaviour) for = 90 and = 45 .
The distance d between the di erent microphones is 5 cm.
When the speech source is in front of the microphone array ( = 90 ), the microphone
signals mj (k) are
mj (k) = s(k) + nj (k);
(3.13)
with nj (k) temporal white noise. The noise power is chosen such that the SNR of
the rst (noisy) microphone signal m1 (k) is 3:08 dB. We have varied the number of
channels M from 1 to 10 and lterlength N from 1 to 20. Figure 9 shows the SNR of
the enhanced signal s^(k) for di erent values of M and N . As can be clearly seen, the
SNR of the enhanced signal s^(k) improves when the number of channels M increases
and when the lterlength N increases. However, from a certain lterlength on (in
this speci c case N ' 8), the SNR improvements are marginal.
Figure 10 depicts the noisy speech signals ( rst microphone signal m1 (k)), enhanced
signals s^(k) and the amplitude of the frequency response jHj (f )j for the M lters
Aj , with
N
X

(3.14)
Aj (k) exp ? j 2 ff(k ? 1) :
s
k=1
The number of channels M is 2 and the lterlength N is 2,5 and 10. The SNR of the
enhanced signals is 6:49 dB (N = 2), 9:82 dB (N = 5) and 10:74 dB (N = 10).
In gure 11 the number of channels M is 5. The SNR of the enhanced signals is 8:92
dB (N = 2), 13:12 dB (N = 5) and 13:73 dB (N = 10).
In gure 12 the number of channels M is 10. The SNR of the enhanced signals is
11:37 dB (N = 2), 15:76 dB (N = 5) and 16:52 dB (N = 10).

Hj (f ) =

As already indicated in section 2.7 for uncorrelated white noise sources with the same
noise power (case 4), theoretically the lters Aj for the M di erent channels should
be the same. This can be veri ed for the frequency responses in gures 10,11 and 12.
We will now compare the spatial ltering properties (beamforming behaviour) when
the speech source impinges on the microphone array at = 90 and = 45 . Ideally
the spatial beamforming pattern should amplify in the direction of the desired signal.
The spatial beamforming pattern H (f; ) is both a function of frequency f and angle
and can be calculated as
M
X
(3.15)
H (f; ) = H (f ) exp j 2 f (l ? 1)d cos ;
l=1

with Hl (f ) de ned in equation (3.14). The number of channels M is 5 and the lterlength N is 10.
29

White noise Broadband signal No multipath


18

16

SNR of enhanced signal (dB)

14

12

10

Filterlength N
2

5
6
Number of channels M

10

White noise Broadband signal No multipath


18

16

SNR of enhanced signal (dB)

14

12

10

Number of channels M
2

10
12
Filterlength N

14

16

18

20

Figure 9: SNR of enhanced signal s^(k) for spatio-temporal white noise and speech
source in front of microphone array ( = 90 ). Number of channels M varies from 1
to 10 and lterlength N varies from 1 to 20.

30

Frequency response per channel

Corrupted + Enhanced signal

0.6

0.5

0.5
0

0.4
0.5

0.3

1.5

0.2
2

0.1
2.5

0.5

1.5

2.5

500

1000

1500

2000

2500

3000

3500

4000

3000

3500

4000

3000

3500

4000

Frequency response per channel

Corrupted + Enhanced signal

0.6

0.5

0.5
0

0.4
0.5

0.3

1.5

0.2
2

0.1
2.5

0.5

1.5

2.5

500

1000

1500

2000

2500

Frequency response per channel

Corrupted + Enhanced signal

0.6

0.5

0.5
0

0.4
0.5

0.3

1.5

0.2
2

0.1
2.5

0.5

1.5

2.5

500

1000

1500

2000

2500

Figure 10: Noisy and enhanced signals and frequency response jHj (f )j of the lters
Aj for spatio-temporal white noise and speech source in front of microphone array
( = 90 ). Number of channels M = 2 and lterlength N = 2; 5; 10.

31

Corrupted + Enhanced signal

Frequency response per channel

0.25

0.5
0.2

0.5
0.15

1.5
0.1

2.5

0.05

3.5

0.5

1.5

2.5

500

1000

Corrupted + Enhanced signal

1500

2000

2500

3000

3500

4000

3000

3500

4000

3000

3500

4000

Frequency response per channel

0.25

0.5
0.2

0.5
0.15

0.1

1.5

2
0.05

2.5

0.5

1.5

2.5

500

1000

Corrupted + Enhanced signal

1500

2000

2500

Frequency response per channel

0.3

0.5
0.25

0.5

0.2

1
0.15

1.5

0.1

2.5
0.05

3.5

0.5

1.5

2.5

500

1000

1500

2000

2500

Figure 11: Noisy and enhanced signals and frequency response jHj (f )j of the lters
Aj for spatio-temporal white noise and speech source in front of microphone array
( = 90 ). Number of channels M = 5 and lterlength N = 2; 5; 10.

32

Corrupted + Enhanced signal

Frequency response per channel

1
0.12

0.5

0.1

0.5

0.08

1
0.06

1.5
0.04

0.02

2.5

0.5

1.5

2.5

500

1000

Corrupted + Enhanced signal

1500

2000

2500

3000

3500

4000

3000

3500

4000

3000

3500

4000

Frequency response per channel

0.14

0.5

0.12

0
0.1

0.5
0.08

1
0.06

1.5
0.04

0.02

2.5

0.5

1.5

2.5

500

1000

Corrupted + Enhanced signal

1500

2000

2500

Frequency response per channel

0.14

0.5
0.12

0
0.1

0.5

0.08

1.5

0.06

2
0.04

2.5
0.02

3.5

0.5

1.5

2.5

500

1000

1500

2000

2500

Figure 12: Noisy and enhanced signals and frequency response jHj (f )j of the lters
Aj for spatio-temporal white noise and speech source in front of microphone array
( = 90 ). Number of channels M = 10 and lterlength N = 2; 5; 10.

33

For = 90 , gure 13 depicts the noisy speech signal ( rst microphone signal m1 (k)),
enhanced signal s^(k), the amplitude of the frequency response jHj (f )j for the M lters Aj , the amplitude of the spatial beamforming pattern jH (f; )j for all frequencies
f and for one speci c frequency f = 1000 Hz. As can be seen from the spatial beamforming pattern for f = 1000 Hz, the directivity gain is maximal for the direction
= 90 . This is even better illustrated in gure 14 where the spatial beamforming
pattern is plotted for every frequency f = i 100; i = 1 : : : 40. For every frequency
the directivity gain is maximal for the direction = 90 . However for low frequencies
the spatial selectivity is very poor.
In gure 15 the angle = 45 . As can be seen from the spatial beamforming pattern
for f = 1000 Hz, the directivity gain is maximal for the direction = 45 . This is
even better illustrated in gure 16 where the spatial beamforming pattern is plotted
for every frequency f = i 100; i = 1 : : : 40. For most frequencies the directivity
gain is maximal for the direction = 45 . However for low frequencies the spatial
selectivity is very poor.
Corrupted signal + Enhanced signal

Frequency response per channel

0.25

0.2
0.15

0.1
2
0.05
3
0

0.5

1.5

1000

2000

3000

4000

x 10
Spatial pattern

Spatial pattern (1000 Hz)


90
1
0.8
0.6
0.4
0.2

1
0.8
0.6
180

0.4

0.2
0
2000
4000

50

100

150
270

Figure 13: Noisy and enhanced signal, frequency response jHj (f )j of the lters Aj
and spatial beamforming pattern jH (f; )j for spatio-temporal white noise and speech
source in front of microphone array ( = 90 ). Number of channels M = 5 and
lterlength N = 10.
34

90
1.4
1.2
1
0.8
0.6
0.4
0.2
180

270

Figure 14: Spatial beamforming pattern jH (f; )j with f = i 100; i = 1 : : : 40 for


spatio-temporal white noise and speech source in front of microphone array ( = 90 ).
Number of channels M = 5 and lterlength N = 10.
Corrupted signal + Enhanced signal

Frequency response per channel

1
0.25
0
0.2
1

0.15

0.1
0.05

3
0

0.5

1.5

1000

2000

3000

4000

x 10
Spatial pattern

Spatial pattern (1000 Hz)


90
1
0.8
0.6
0.4
0.2

1
0.8
0.6
180

0.4

0.2
0
2000
4000

50

100

150
270

Figure 15: Noisy and enhanced signal, frequency response jHj (f )j of the lters Aj
and spatial beamforming pattern jH (f; )j for spatio-temporal white noise and speech
source at = 45 . Number of channels M = 5 and lterlength N = 10.
35

90
1
0.8
0.6
0.4
0.2
180

270

Figure 16: Spatial beamforming pattern jH (f; )j for spatio-temporal white noise and
speech source at = 45 . Number of channels M = 5 and lterlength N = 10.

3.2.2 Smallband source


To further illustrate the frequency ltering properties of the SVD-based noise reduction technique, we have ltered the broadband speech signal s(k) with a bandpass
lter between 1800 and 1880 Hz and have used this (smallband) ltered speech signal
sf (k). The amplitude of the frequency response of this bandpass lter is depicted in
gure 17. Ideally the frequency response Hj (k) of the lters Aj should be con ned
to this region, since there is no useful signal present in other frequency bands.
Sharp bandpass filter

Amplitude

0.8

0.6

0.4

0.2

500

1000

1500

2000
Frequency

2500

3000

3500

4000

Figure 17: Frequency response of bandpass lter between 1800 and 1880 Hz.
The speech source sf (k) impinges on the microphone array at = 90 and the noise is
spatio-temporal white. Figure 18 depicts the noisy speech signals ( rst microphone
signal m1 (k)), enhanced signals s^(k) and the amplitude of the frequency response
jHj (f )j for the M lters Aj . The number of channels M is 5 and the lterlength N
is 10,20 and 50. The higher the lterlength, the better the the frequency response
Hj (f ) of the lters Aj is con ned to the region between 1800 and 1880 Hz.
36

Corrupted + Enhanced signal

Frequency response per channel

0.25

0.5
0.2

0.5
0.15

0.1

1.5

2
0.05

2.5

0.5

1.5

2.5

500

1000

Corrupted + Enhanced signal

1500

2000

2500

3000

3500

4000

3000

3500

4000

3000

3500

4000

Frequency response per channel

0.25

0.5
0.2

0.5
0.15

0.1

1.5

2
0.05

2.5

0.5

1.5

2.5

500

1000

Corrupted + Enhanced signal

1500

2000

2500

Frequency response per channel

0.3

0.5
0.25

0
0.2

0.5

0.15

1.5
0.1

2
0.05

2.5

0.5

1.5

2.5

500

1000

1500

2000

2500

Figure 18: Noisy and enhanced signals and frequency response jHj (f )j of the lters Aj
for spatio-temporal white noise and smallband speech source in front of microphone
array ( = 90 ). Number of channels M = 5 and lterlength N = 10; 20; 50.

37

3.3 Localized noise source

We will now consider a localized white noise source n(k) which impinges on the
microphone array at a certain angle, such that the noise signals nj (k) are delayed
versions of each other. We will discuss the spatial ltering properties (beamforming
behaviour) when the speech source is in front of the microphone array ( = 90 )
and the noise source impinges on the microphone array at = 150 . The distance d
between the di erent microphones is 5 cm. Ideally the spatial beamforming pattern
should amplify in the direction of the desired signal and should attenuate (place a
zero) in the direction of the localized noise source. The noise power is chosen such
that the SNR of the rst (noisy) microphone signal m1 (k) is 3:02 dB. The number of
channels M is 2 and 5 and the lterlength N is 10.
For M = 2, gure 19 depicts the noisy speech signal ( rst microphone signal m1 (k)),
enhanced signal s^(k), the amplitude of the frequency response jHj (f )j for the M lters
Aj , the amplitude of the spatial beamforming pattern jH (f; )j for all frequencies f
and for one speci c frequency f = 1000 Hz. The SNR of s^(k) is 17:20 dB.
Corrupted signal + Enhanced signal

Frequency response per channel

4
3

2
2
1
3
0

0.5

1.5

1000

2000

3000

4000

x 10
Spatial pattern

Spatial pattern (1000 Hz)


90
2
1.5

1.5

0.5
1

180

0.5
0
2000
4000

50

100

150
270

Figure 19: Noisy and enhanced signal, frequency response jHj (f )j of the lters Aj
and spatial beamforming pattern jH (f; )j for localized white noise source ( = 150 )
and speech source ( = 90 ). Number of channels M = 2 and lterlength N = 10.

38

90
2.5
2
1.5
1
0.5
180

270

Figure 20: Spatial beamforming pattern jH (f; )j with f = i 100; i = 1 : : : 40 for


localized white noise source ( = 150 ) and speech source in front of microphone
array ( = 90 ). Number of channels M = 2 and lterlength N = 10.
6

Corrupted signal + Enhanced signal


1

12

x 10 Frequency response per channel

10

8
1

4
2

3
0

0.5

1.5

1000

2000

3000

4000

x 10
6

Spatial pattern

Spatial pattern (1000 Hz)


90
15000000

x 10

12
10
8
6
4
2

10000000
5000000
180

0
2000
4000

50

100

150
270

Figure 21: Noisy and enhanced signal, frequency response jHj (f )j of the lters Aj
and spatial beamforming pattern jH (f; )j for localized white noise source ( = 150 )
and speech source ( = 90 ). Number of channels M = 5 and lterlength N = 10.
39

90
14000000
12000000
10000000
8000000
6000000
4000000
2000000
180

270

Figure 22: Spatial beamforming pattern jH (f; )j with f = i 100; i = 1 : : : 40 for


localized white noise source ( = 150 ) and speech source in front of microphone
array ( = 90 ). Number of channels M = 5 and lterlength N = 10.
In gure 20, where the spatial beamforming pattern is plotted for every frequency
f = i 100; i = 1 : : : 40, it is illustrated that for most frequencies the directivity gain
is very small for = 150 , the direction of the noise source. Only for low frequencies
the spatial selectivity is rather poor.
In gure 21 the number of channels M = 5. The SNR of the enhanced signal s^(k)
is 21:21 dB. As can be seen in gure 22, where the spatial beamforming pattern is
plotted for every frequency f = i 100; i = 1 : : : 40, the directivity gain is very small
for = 150 , the direction of the noise source.

3.4 Real-world situation

We will brie y describe the performance of the SVD-based noise reduction technique
for a real-world example. The signals have been recorded in the ESAT SpeechLab
with a microphone array, consisting of 6 microphones (distance between microphones
d = 5 cm). The speech source and the noise source are localized sources. The SNR
of the (noisy) microphone signal m1 (k) is 13:70 dB. The lterlength N used is 20.
When we use the optimal Wiener lter WWF ,
2
2
WWF = X ?T diagf i ?2 i g X T :

(3.16)

the SNR of the enhanced signal s^(k) is 16:65 dB, only a small improvement.
When we use the more general class of estimators,

WWF = X ?T diagff ( i2 ; i2 )g X T ;
40

(3.17)

with the rank-truncating function f ( i2 ; i2 ),


f ( i2 ; i2 ) = 1
= 0

if i = 1
otherwise

(3.18)

the SNR of the enhanced signal s^(k) is 24:50 dB, a considerable improvement. In this
method we only consider the rst generalized eigenvector, corresponding to the maximum generalized eigenvalue, as described in equation (2.57). Although the enhanced
signal s^(k) is distorted, this distortion can be tolerated in this case.

41

4 Comparison to standard beamforming algorithms


In this section we will compare the performance of the SVD-based noise reduction
technique with standard beamforming techniques (delay-and-sum beamforming and
Gri ths-Jim beamforming). The criterion which is compared is the signal-to-noise
ratio (SNR) of the enhanced signals, as de ned in equation (3.11). We will show that
the SVD-based procedure performs better than the Gri ths-Jim beamformer in all
situations, i.e. for all reverberation times. We will also discuss the dependence of the
performance of the SVD-based technique on the choice of the noiseframe. As already
indicated in section 3, in practice it is never possible to choose the noiseframe (noise
samples used in the noise data matrix Nk ) at the same moment as the speechframe
(samples used in the speech data matrix Uk ), because we can only construct the
noise data matrix Nk during periods where no speech is present. If we choose the
noiseframe at a di erent moment than the speechframe, performance will decrease in
general. However, by making the noiseframe long enough, performance can be made
equally good.

4.1 Standard beamforming algorithms

We will brie y discuss two beamforming techniques : xed delay-and-sum beamforming 6] 21] and adaptive Gri ths-Jim beamforming 7] 8] 9] 10].
Figure 23 depicts a xed delay-and-sum beamformer. In order to achieve a spatial
alignment of the microphone array with the speech source, which impinges on the
microphone array at an angle , the di erent microphone signals mj (k) are delayed
with j ,
(4.1)
= (j ? 1) d cos ;
j

with d the distance between the microphones. In order to compute j the angle
rst needs to be estimated from the di erent microphone signals mj (k), e.g. using
some generalized cross-correlation method 22]. A delay-and-sum beamformer o ers
a limited spatial selectivity, especially for low frequencies.

+
+
+
+

m1

m2

m3

m4
Microphone
array

Figure 23: Delay-and-sum beamformer


42

Delay

- --

+
+
+
+

F1

Blocking
matrix

F2

m1

m2

m3

m4

Fixed
beamformer

Microphone
array

FM

Multi-channel
adaptive filter

Figure 24: Gri ths-Jim beamformer


Better performance can be obtained by using an adaptive Gri ths-Jim beamformer,
as depicted in gure 24. This structure is also known as a Generalized Sidelobe
Canceller. It mainly consists of three parts :
a xed delay-and-sum beamformer, in order to achieve a spatial alignment of
the microphone array with the speech source. This xed beamformer creates a
so-called speech reference which already has a better SNR than the individual
microphone signals.
a blocking matrix B which creates a so-called noise reference. This noise reference is created by placing zeros in the direction of the speech source, such
that the noise reference contains as less speech signal as possible and is in fact
a reference for the noise. If necessary, di erent noise references can be created.
a standard multi-channel adaptive lter, using the noise reference as input
signal and the speech reference as desired signal 11]. To allow some acausal
taps in the adaptive lter, the speech reference is sometimes delayed.
The adaptive lter will remove from the speech reference that part of the speech
reference which is correlated to the noise reference. If the noise in the di erent
microphone signals is correlated and the speech can be assumed uncorrelated to the
noise, then the adaptive lter will reduce a considerable amount of noise from the
speech reference. So adaptive Gri ths-Jim beamformers will perform considerably
better for highly correlated noise than for uncorrelated (di use) noise.
Because of multipath e ects the noise reference can never be perfect (except in a pure
delay environment). A problem arises when the noise reference also contains part of
the speech signal. In that case the adaptive lter will also remove part of the speech
signal from the speech reference, distorting the resulting signal. In order to avoid
this signal cancellation, no adaptation of the adaptive lter is allowed during speech
periods 10]. For good performance adaptive Gri ths-Jim beamformer therefore need
a robust speech/noise detection.
43

4.2 General con guration

We will brie y discuss the room con guration, used signals and parameters of the
used algorithms.
The room consists of a microphone array, speech source and noise source and is
depicted in gure 25. The room has the following dimensions : 7m 3.5m 3m. The
linear equi-spaced microphone array has 5 microphones and the distance d between
the microphones is 5cm. The position of microphone j is 2+(j {1)0.05 0.5 1]. The
position of the speech source is in front of the microphone array : 2.1 2 1]. The
position of the noise source is 6 2.5 1].
Because we will compare the performance of the di erent algorithms for correlated
as well as for uncorrelated noise, the reverberation of the room is an important
parameter. Reverberation is described by the reverberation time T60 , which can be
expressed in function of the re ection coe cient (0
1) of the walls (assuming
all the walls have the same re ection coe cient), according to Eyring's formula :
:163V ;
(4.2)
T60 = ?S 0log(1
? )
with V the volume of the room and S the total surface of the room. For large
re ection coe cients ( ' 1), Eyring's formula reduces to Sabine's law :
(4.3)
T = 0:163V :
60

The re ection coe cient is a necessary parameter for calculating the room impulse
response through the image method described in 23].

Noise
11
00
00
11

Desired signal

Microphone
array

Figure 25: Room con guration


The signals used are the same as in section 3 : an 8 kHz clean speech signal (20000
samples), depicted in gure 8 and a temporal white noise source.
44

In total the performance of 5 algorithms will be compared : delay-and-sum beamformer, Gri ths-Jim beamformer, iterated Gri ths-Jim beamformer, SVD-based optimal ltering 1 and SVD-based optimal ltering 2. The parameters for these di erent
algorithms are now brie y explained :
Delay-and-sum beamformer : because the speech source is located in front of
the microphone array, the delay-and-sum beamformer becomes a simple sum
beamformer ( j = 0).
Gri ths-Jim beamformer : the xed beamformer has j = 0. The delay for the
speech reference is 250 taps. We will consider only one noise reference (because
there is only one noise source) and the blocking matrix B is 4 {1 {1 {1 {1]. The
lterlength of the (one-channel) adaptive lter is 500. The adaptive ltering
algorithm we use is NLMS (normalized least mean squares) with adaptation
parameter = 1 11]. The lter coe cients are not adapted during speech
periods, in order to limit signal cancellation and distortion.
Iterated Gri ths-Jim beamformer : same parameters as for the Gri ths-Jim
beamformer, except for the fact that the adaptive lter will be reiterated with
the same data for di erent values of the adaptation parameter ( = 1, 0.5,
0.2, 0.1, 0.05). The smaller the adaptation parameter , the slower the convergence, but the smaller the excess error 11].
SVD-based optimal ltering 1 : same procedure which has been described in full
detail in section 3.1. The SVD1-procedure will construct the speech data matrix
Uk and the noise data matrix Nk using the same frame (as already indicated
this is never possible in practice). The start of the speechframe and the noiseframe is sample 8000, and the length of the speechframe and the noiseframe is
2000. The lterlength N for the lters Aj is 10, 20 and 50.
SVD-based optimal ltering 2 : the di erence between the SVD1 and the SVD2procedure is the fact that the SVD2-procedure constructs the noise data matrix
Nk from a di erent frame than the speech data matrix Uk . The start of the
speechframe is sample 8000, the start of the noiseframe is sample 3000, and the
length of the speechframe and the noiseframe is 2000. The lterlength N for
the lters Aj is 10, 20 and 50.

4.3 Comparison

We will compare the SNR of the enhanced signal s^(k) for the di erent algorithms.
This comparison will be made for di erent reverberation times T60 of the room. Low
reverberation time means that multipath e ects are not signi cant (because the walls
don't re ect much) and that the direct signal will dominate. This means that the
noise signals arriving on the microphone array are highly correlated. High reverberation time means that multipath e ects are signi cant (because the walls re ect) and
that the direct signal will no longer dominate. This means that the noise signals
arriving on the microphone array are highly uncorrelated (di use noise).
45

For di erent reverberation times T60 , gure 26 compares the performance of the di erent beamforming techniques (delay-and-sum beamformer, Gri ths-Jim beamformer
and iterated Gri ths-Jim beamformer). As can be seen, for small reverberation
times (highly correlated noise) the Gri ths-Jim beamformer performs much better
than for high reverberation times (highly uncorrelated noise. This is normal because
the Gri ths-Jim beamformer is designed for correlated noise, not for di use noise
(see section 4.1). As expected, when iterating the Gri ths-Jim beamformer, the
performance increases.
Figure 27 compares the performance of the delay-and-sum beamformer and the
Gri ths-Jim beamformer with the SVD-based optimal ltering 1 ( lterlength N =
10; 20; 50). Unlike the Gri ths-Jim beamformer, the SVD-based optimal ltering 1
still performs well for di use noise (high reverberation times). As can be seen, for
all reverberation times, the SVD-based optimal ltering 1 performs better than the
Gri ths-Jim beamformer, if the lterlength N is high enough (in this case N = 20
su ces). The higher the lterlength N , the better the performance.
Figures 28, 29 and 30 compare the performance of the SVD-based optimal ltering
techniques 1 and 2 for di erent lterlengths (N = 10; 20; 50). As can be clearly seen,
the performance of SVD-based optimal ltering 2 is always worse than the performance of SVD-based optimal ltering 1. This di erence in performance increases as
the lterlength N increases. SVD-based optimal ltering 2 is the technique we have
to use in practice since the noise data matrix Nk can only be constructed during
periods where no speech is present. A crucial question now arises : can SVD-based
optimal ltering 2 perform equally well as SVD-based optimal ltering 1 ? The next
section answers this question by investigating the performance of SVD-based optimal
ltering on the noiseframe.

4.4 Dependence on noiseframe

In this section we will investigate the dependence of the performance of the SVDbased optimal ltering technique on the noiseframe, when the noiseframe is di erent
from the speechframe. We will investigate the dependence on the start and the length
of the noiseframe. It will be shown that the SVD-based optimal ltering 2 can perform equally well as the SVD-based optimal ltering 1 if the noiseframe is made long
enough. In this section we will only consider lterlength N = 50, because the higher
lterlength, the larger the di erence between SVD-based optimal ltering 1 and 2.
Figure 31 compares the performance of the SVD-based optimal ltering 2 for different lengths of the noiseframe (L = 1000 : : : 7000). The start of the noiseframe is
sample 3000 and the lterlength N is 50. As can be seen, there is a considerable
dependence of the performance of the SVD-based optimal ltering 2 on the length of
the noiseframe : the larger the noiseframe, the higher the performance. The reason
is that if the noiseframe is made larger, a better estimate of the noise correlation
matrix E nk nTk can be made. However, the disadvantage is that by making the
noiseframe larger to obtain an acceptable performance, we have to assume that the
noise is stationary enough.
Figure 32 compares the performance of the SVD-based optimal ltering 2 for di erent lengths and starting points of the noiseframe. The length L of the noiseframe is
46

Normal situation (5chan) multipath for speech


45

Orig
DelSum
GRJ
GRJ_iter

40

35

30

SNR

25

20

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 26: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer and Gri ths-Jim beamformer
Normal situation (5chan) multipath for speech
45

Orig
DelSum
GRJ
SVD1 (10)
SVD1 (20)
SVD1 (50)

40

35

30

SNR

25

20

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 27: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 (N = 10; 20; 50)
47

Normal situation (5chan) multipath for speech


45

Orig
DelSum
GRJ
SVD1 (10)
SVD2 (10)

40

35

30

SNR

25

20

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 28: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 and 2 (N = 10)
Normal situation (5chan) multipath for speech
40

Orig
DelSum
GRJ
SVD1 (20)
SVD2 (20)

35

30

SNR

25

20

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 29: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 and 2 (N = 20)
48

Normal situation (5chan) multipath for speech


40

Orig
DelSum
GRJ
SVD1 (50)
SVD2 (50)

35

30

SNR

25

20

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 30: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 and 2 (N = 50)
varied from 1000 to 7000 and the start of the noiseframe is varied from 2000 to 18000
(where possible). The re ection coe cient of the walls of the room is 0.5. As can
be seen, there is no considerable dependence of the performance on the starting point
of the noiseframe. However, for some L (especially L = 2000) there is a noticeable
peak when the starting point of the noiseframe is sample 8000, which is also the
start of the speechframe. As already indicated in the previous gure, the larger the
noiseframe, the higher the performance.
Figure 33 compares the performance of the SVD-based ltering techniques 1 and 2
for di erent lengths of the noiseframe (L = 2000 and L = 7000). It can be seen
that for L = 7000 the SVD-based optimal ltering 2 performs equally well than the
SVD-based optimal ltering 1. However in this case the length of the noiseframe
(7000) is larger than the length of the speechframe (2000). In real-time processing
an exponentially decaying window can be used with di erent time constants for the
speechframe and the noiseframe.
The overall conclusion is that the SVD-based optimal ltering 2 can perform equally
well as the SVD-based optimal ltering 1 if the noiseframe is made large enough.

49

Normal situation (5chan) Different length for noiseframe order=50


45

L=1000
L=2000
L=3000
L=4000
L=5000
L=6000
L=7000

40

35

30

SNR

25

20

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 31: SNR of enhanced signal s^(k) for SVD-based optimal ltering technique
with di erent length of noiseframe (N = 50)

Normal situation (5chan) Different length for noiseframe/noisestart order=50


30

25

SNR

20

15

L=1000
L=2000
L=3000
L=4000
L=5000
L=6000
L=7000

10

8
10
12
Start of noiseframe (x1000)

14

16

18

Figure 32: SNR of enhanced signal s^(k) for SVD-based optimal ltering technique
with di erent lengths and starting points of noiseframe (N = 50 and = 0:5)
50

Normal situation (5chan)


40

Orig
DelSum
GRJ
SVD1 (50) L=2000
SVD2 (50) L=2000
SVD2 (50) L=7000

35

30

SNR

25

20

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 33: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 and 2 (N = 50 and L = 2000; 7000)

51

5 Robustness issues
A very important issue for all multi-microphone ltering techniques is robustness.
We consider three major problems :
source movement : the direction of the speech source is wrongly estimated.
This is only important for the beamforming techniques since the SVD-based
optimal ltering technique needs no direction estimation of the speech source.
microphone displacement : we assume a linear equi-spaced microphone array,
but in fact the microphones are not equally spaced.
microphone characteristics : we assume that the characteristics (gain, spatial
directivity, frequency behaviour) of all microphones are equal, but in fact the
characteristics are di erent. We will only consider a di erent gain for the
microphones.
We will consider the same room con guration used in section 4 and depicted in gure
25. The number of microphones M is 5 and the lterlength N of the lters is 50.

5.1 Source movement

Consider the situation where the speech source impinges on the microphone array at
an angle 6= 90 , while we assume that the speech source is in front of the microphone
array ( = 90 ). We will not consider multipath e ects for the speech source (only
interpolation lter), but we will still consider multipath e ects for the noise source in
order to simulate the correlated/uncorrelated nature of the noise. We will investigate
the robustness of the delay-and-sum beamformer, the Gri ths-Jim beamformer and
the SVD-based optimal ltering 1 for this situation.
Figure 34 shows the performance of the delay-and-sum beamformer for di erent
angles . Since we assume the speech source is in front of the microphone array
( = 90 ), the delays j are 0. As expected, the performance decreases if the angle
deviates from the nominal position = 90 .
Figure 35 shows the performance of the Gri ths-Jim beamformer for di erent angles
. For angles > 90 the performance decreases, while for angles < 90 the performance increases ! We think this (somewhat strange) behaviour can be explained
by considering the spatial directivity pattern of the Gri ths-Jim beamformer. The
spatial directivity pattern will certainly have a zero in the direction of the noise
source (in this case = 152:55 ), such that the performance of the beamformer will
decrease in the neighbourhood of this direction. However, it is possible that the spatial directivity pattern has its maximum for an angle 6= 90 , such that maximum
performance is obtained for this speci c angle. Since we don't impose any additional
constraints with regard to the shape of the spatial directivity pattern other than
placing a zero in the direction of the noise, we cannot predict the exact form of the
spatial directivity pattern. Therefore it is di cult to predict the performance of the
Gri ths-Jim beamformer for di erent angles , since the performance is dependent
on the particular con guration.
52

DelSum beamformer (5chan) SNR for source movement


9

SNR

8
7

angle = 90
angle = 60
angle = 30
angle = 0

6
5
4

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

SNR

8
7

angle = 90
angle = 120
angle = 150
angle = 180

6
5
4

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 34: SNR of the enhanced signal s^(k) for delay-and-sum beamformer for different angles

GRJ beamformer (5chan) SNR for source movement


30

angle = 90
angle = 60
angle = 30
angle = 0

SNR

25
20
15
10
5

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

25

angle = 90
angle = 120
angle = 150
angle = 180

20

SNR

15
10
5
0
5
0

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 35: SNR of the enhanced signal s^(k) for Gri ths-Jim beamformer for di erent
angles
53

SVD1 (50) (5chan) SNR for source movement


40

angle = 90
angle = 60
angle = 30
angle = 0

35

SNR

30
25
20
15
10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

40

angle = 90
angle = 120
angle = 150
angle = 180

35

SNR

30
25
20
15
10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 36: SNR of the enhanced signal s^(k) for SVD-based optimal ltering 1 for
di erent angles

Difference SVD1 (50) and GRJ (5chan) SNR for source movement
30

angle = 90
angle = 60
angle = 30
angle = 0

25

SNR

20
15
10
5
0
0

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

30

angle = 90
angle = 120
angle = 150
angle = 180

25

SNR

20
15
10
5
0
0

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 37: SNR-di erence between SVD-based optimal ltering 1 and Gri ths-Jim
beamformer for di erent angles
54

Figure 36 shows the performance of the SVD-based optimal ltering 1 for di erent
angles . For some angles and reverberation times the performance decreases, while
for other angles and reverberation times the performance increases. No speci c conclusion can be drawn from this gure.
Figure 37 shows the di erence in performance between the SVD-based optimal ltering 1 and the Gri ths-Jim beamformer for di erent angles . The SVD-based
optimal ltering 1 would be more robust than the Gri ths-Jim beamformer if the
di erence in performance increases the more the angle deviates from the nominal
position = 90 . However this behaviour cannot be observed for all reverberation times in gure 37. Therefore we have to conclude that for source movement,
the SVD-based optimal ltering technique is not more robust than the Gri ths-Jim
beamformer. However, for all angles the performance of the SVD-based optimal
ltering 1 is still better than the performance of the Gri ths-Jim beamformer.

5.2 Microphone displacement

Consider the situation where the linear microphone array is not equi-spaced, while we
assume it is equi-spaced. We will consider a displacement of the second microphone
in the x-direction towards the rst microphone. The nominal position of the second
microphone is xmic2 = 2:05.
Figures 38 and 39 show the performance of the delay-and-sum beamformer and
Gri ths-Jim beamformer for di erent microphone positions xmic2 . As expected, the
performance decreases if the microphone position xmic2 deviates from the nominal
position xmic2 = 2:05.
Figures 40 shows the performance of the SVD-based optimal ltering 1 for di erent microphone positions xmic2 . As can be seen, there is no signi cant di erence
in performance if the microphone position xmic2 deviates from the nominal position
xmic2 = 2:05.
Figure 41 shows the di erence in performance between the SVD-based optimal ltering 1 and the Gri ths-Jim beamformer for di erent microphone positions xmic2 .
Because the di erence in performance increases the more the microphone position
xmic2 deviates from the nominal position xmic2 = 2:05, we can conclude that for microphone displacement, the the SVD-based optimal ltering technique is more robust
than the Gri ths-Jim beamformer.

55

DelSum beamformer (5chan) SNR for displacement mic2


11

x_mic2 = 2.01
x_mic2 = 2.02
x_mic2 = 2.03
x_mic2 = 2.04
x_mic2 = 2.05

10

SNR

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 38: SNR of the enhanced signal s^(k) for delay-and-sum beamformer for different microphone positions xmic2

GRJ beamformer (5chan) SNR for displacement mic2

x_mic2 = 2.01
x_mic2 = 2.02
x_mic2 = 2.03
x_mic2 = 2.04
x_mic2 = 2.05

20

SNR

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 39: SNR of the enhanced signal s^(k) for Gri ths-Jim beamformer for di erent
microphone positions xmic2
56

SVD1 (50) (5chan) SNR for displacement mic2


40

x_mic2 = 2.01
x_mic2 = 2.02
x_mic2 = 2.03
x_mic2 = 2.04
x_mic2 = 2.05

35

30

SNR

25

20

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 40: SNR of the enhanced signal s^(k) for SVD-based optimal ltering 1 for
di erent microphone positions xmic2

Difference SVD1 (50) and GRJ (5chan) SNR for displacement mic2

x_mic2 = 2.01
x_mic2 = 2.02
x_mic2 = 2.03
x_mic2 = 2.04
x_mic2 = 2.05

16

14

SNR

12

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 41: SNR-di erence between SVD-based optimal ltering 1 and Gri ths-Jim
beamformer for di erent microphone positions xmic2
57

5.3 Microphone ampli cation

Consider the situation where not all microphone characteristics are equal. We will
consider the speci c situation where the second microphone has a gain 6= 1, while
all other microphones have = 1.
Figures 42 and 43 show the performance of the delay-and-sum beamformer and
Gri ths-Jim beamformer for di erent gains . As expected, the performance decreases if the gain deviates from the nominal gain = 1.
Figures 44 shows the performance of the SVD-based optimal ltering 1 for di erent
gains . Theoretically it can be proven that the SVD-based optimal ltering technique is independent of di erent gains for the di erent microphones.
Consider the speech data matrix Uk and the noise data matrix Nk as de ned in
equation (3.5) for multichannel time series ltering,
(
Uk = U1k U2k : : : UMk
(5.1)
Nk = N1k N2k : : : NMk ;
where M is the number of microphones. If we assume that each microphone j is
multiplied by a gain j then the modi ed speech data matrix U0k becomes
2?
3
1
6 ?2
77
U0k = 1U1k 1U2k : : : M UMk = Uk 664
75 = Uk ;
...
?M
(5.2)
with 2 RMN MN and ?j 2 RN N ,

2
6
?j = 664

...

3
77
75 ;

(5.3)

with N the lterlength of the lters Aj . The modi ed noise data matrix
similarly de ned as
N0k = 1 N1k 1 N2k : : : M NMk = Nk :
The modi ed speech and noise correlation matrices 0uu and 0nn then are

U0k T U0k = UTk Uk =


N0k T N0k = NTk Nk =
nn
0 becomes
such that the modi ed optimal lter WWF
?
0
WWF
= 0uu ?1 0uu ? 0nn
= ?1 ?uu1 ?1 (
?
uu
?
1
?
1
=
uu ( uu ? nn )
?
1
=
WWF :
0 =
uu
0 =

58

uu
nn

nn

N0k is
(5.4)
(5.5)

)
(5.6)

The modi ed estimated signal ^s0k then becomes


0 = Uk
^s0k = U0k WWF

?1

WWF

= Uk WWF

= ^sk

(5.7)

such that any column ^s0k (:; i) of ^s0k is just a scaled version of ^sk (:; i), which means
that ^s0k (:; i) and ^sk (:; i) have the same signal-to-noise ratio.
Figure 45 shows the di erence in performance between the SVD-based optimal ltering 1 and the Gri ths-Jim beamformer for di erent gains . Because the di erence
in performance increases the more the gain deviates from the nominal position
= 1, we can conclude that for microphone ampli cation, the the SVD-based optimal ltering technique is more robust than the Gri ths-Jim beamformer.

5.4 Conclusion

Taking into account that in real life the used algorithm has to be robust for a combination of all three deviations (source movement, microphone displacement and
microphone characteristics), we can conclude that the SVD-based optimal ltering
technique is more robust than standard beamforming techniques.

59

DelSum beamformer (5chan) SNR for gain mic2


10.5

gain = 1
gain = 0.5
gain = 1.5
gain = 2
gain = 2.5
gain = 3
gain = 4
gain = 5

10
9.5
9
8.5

SNR

8
7.5
7
6.5
6
5.5
5

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 42: SNR of the enhanced signal s^(k) for delay-and-sum beamformer for different gains

GRJ beamformer (5chan) SNR for gain mic2


22

gain = 1
gain = 0.5
gain = 1.5
gain = 2
gain = 2.5
gain = 3
gain = 4
gain = 5

20
18
16

SNR

14
12
10
8
6
4
2
0

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 43: SNR of the enhanced signal s^(k) for Gri ths-Jim beamformer for di erent
gains
60

SVD1 (50) (5chan) SNR for gain mic2


40

gain = 1
gain = 0.5
gain = 1.5
gain = 2
gain = 2.5
gain = 3
gain = 4
gain = 5

35

30

SNR

25

20

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 44: SNR of the enhanced signal s^(k) for SVD-based optimal ltering 1 for
di erent gains

Difference SVD1 (50) and GRJ (5chan) SNR for gain mic2
25

gain = 1
gain = 0.5
gain = 1.5
gain = 2
gain = 2.5
gain = 3
gain = 4
gain = 5

20

SNR

15

10

200

400

600

800
1000
1200
Reverberation Time (ms)

1400

1600

1800

2000

Figure 45: SNR-di erence between SVD-based optimal ltering 1 and Gri ths-Jim
beamformer for di erent gains
61

6 Conclusion
In this report we have described a class of SVD-based signal estimation procedures,
which amount to a speci c optimal ltering problem for the case where the so-called
`desired response' signal cannot be observed. It is shown that this optimal lter can
be written as a function of the generalized singular vectors and singular values of a
so-called speech and noise data matrix. A number of simple symmetry properties
of the optimal lter are derived, which are valid for the white noise case as well as
for the coloured noise case. Also the averaging step of the standard one-microphone
SVD-based noise reduction techniques is investigated, leading to serious doubts about
the necessity of this averaging step. When applying the SVD-based optimal ltering
technique for multiple channels, a number of additional symmetry properties can be
derived, depending on the structure of the noise covariance matrix.
When this SVD-based optimal ltering technique is applied to multi-microphone noise
reduction in speech, it is shown that this technique exhibits beamforming properties.
When considering spatio-temporal white noise on all microphones, it is shown that
the directivity pattern of the SVD-based optimal lter is focused towards the speech
source. When considering a localized noise source (and no multipath propagation),
it is shown that a zero is steered towards this noise source.
When we further compare the performance of the SVD-based optimal ltering technique with standard beamforming algorithms, it is shown by simulations that for
highly correlated noise sources the SVD-based optimal ltering technique performs
equally well as adaptive Gri ths-Jim beamformers, and that for less correlated noise
sources it continues to perform better. It is also noted that the length of the noiseframe plays an important role with regard to the performance of the optimal ltering
technique.
Finally, it is shown by simulations that the SVD-based optimal ltering technique
is more robust to environmental changes, such as source movement, microphone displacement and microphone ampli cation than standard beamforming techniques.

Acknowledgments
Simon Doclo is a Research Assistant supported by the I.W.T. (Flemish Institute for
Scienti c and Technological Research in Industry). Marc Moonen is a Research Associate with the F.W.O.-Vlaanderen (Fund for Scienti c Research-Flanders). This
research work was carried out at the ESAT laboratory of the Katholieke Universiteit Leuven, in the framework of the IT-project Multimicrophone Signal Enhancement Techniques for handsfree telephony and voice controlled systems (MUSETTE)
(AUT/970517/MUSETTE) of the I.W.T. (Flemish Institute for Scienti c and Technological Research in Industry), and was partially sponsored by Philips-ITCL. The
scienti c responsibility is assumed by its authors.

62

Appendix A: Derivative to vectors and matrices


Consider the vectors u 2 RN , d 2 RN and w 2 RN , and the matrices A 2 RN N
and W 2 RN N ,

2w 3
2u 3
2d 3
1
1
66 w2 77
66 u2 77
66 d12 77
w = 64 .. 75 u = 64 .. 75 d = 64 .. 75
.
.
.
wN

uN

dN

2 a a ::: a 3
2 w w ::: w 3
11
12
1N
66 a21 a22 : : : a2N 77
66 w2111 w2212 : : : w21NN 77
A = 64 .. ..
.. 75 W = 64 ..
..
.. 75 :
.
.
.
.
.
.
aN 1 aN 2 : : : aNN

(A.1)

(A.2)

wN 1 wN 2 : : : wNN

A.1 Derivative to vectors


J = w T u = uT w ;

@J
@w

=u

Proof :

J = wT u =
@J
@w =

J = wT Aw;
J = wT uuT w;

2
66
66
4

@J
@w1
@J
@w2

..
.

@J
@wN

N
X

3i=1

@J = u
uiwi ) @w
k
k

2u 3
77 6 u12 7
77 = 66 .. 77 = u
5 4 . 5

(A.3)
(A.4)

uN

= (A + AT )w
@J
T
@ w = 2uu w

@J
@w

Proof :

J =
@J
@wk =
=

N
N X
X
T
wiaij wj
w Aw =
i=1 j =1
0
1
N
N
X
X
@ @
A
@wk wk akk wk + j=1;j6=k wk akj wj + i=1;i6=k wi aik wk
N
N
X
X
akj wj + wi aik = A(1; :)w + A(:; 1)T w
i=1
j =1

63

(A.5)

(A.6)

2
66
66
4

@J
@w =

For symmetric A :

@J
@w1
@J
@w2

..
.

@J
@wN

3 2
3 2 A(:; 1)T
A
(1
;
:)
77 6 A(2; :) 7 6 A(:; 2)T
77 = 66 .. 77 w + 66 ..
5 4 . 5 4 . T
A(N; :)

A(:; N )

3
77
75 w = (A + AT )w

@ J = 2Aw:
@w

(A.7)

(A.8)

A.2 Derivative to matrices


J = uT Wd;

@J
@W

= udT

Proof :

J = uT Wd =
@J =
@W

2
66
66
4

N X
N
X

@J
@w11
@J
@w21

i=1 j =1
@J
@w12 : : :
@J
@w22 : : :

@J
@w1N
@J
@w2N

@J
@wN 1

@J
@wN 2

@J
@wNN

..
.

J = uT WWT u;

..
.

@J
@W

J = uT WWT u =

:::

..
.

(A.9)

kl

3 2
3
u
d
u
d
:
:
:
u
d
1
1
1
2
1
N
77 6 u2d1 u2d2 : : : u2 dN 7
77 = 66 ..
..
.. 775 = udT
4
.
.
.
5
uN d1 uN d2 : : : uN dN

(A.10)

= 2uuT W

Proof :

@J
@wpq =

@J = u d
ui wij dj ) @w
k l

0N
N X
X
@
i=1

j =1

1 N
! X
N
N X
N X
X
A
uj wji wkiuk
uj wji
wki uk =
k=1

i=1 j =1 k=1

(A.11)
0
1
N
N
X
@ @u w w u + X
u
w
w
u
+
upwpq wkq uk A
j jq pq p
@wpq p pq pq p

N
X
j =1

= 2up

uj wjq up +
N
X
j =1

N
X
k=1

j =1;j 6=p

k=1;k6=p

upwkq uk

uj wjq = 2up uW(:; q)


64

(A.12)

@J =
@W

2
66
66
4

@J
@w11
@J
@w21

@J
@w12
@J
@w22

:::
:::

@J
@wN 1

@J
@wN 2

:::

..
.

..
.

2 u uT W(:; 1)
66 u21uT W(:; 1)
2 64
..
.

@J
@w1N
@J
@w2N

..
.

3
77
77
5

@J
@wNN
u1 uT W(:; 2)
u2 uT W(:; 2)

..
.

: : : u1 uT W(:; N )
: : : u2 uT W(:; N ) 77
..
.

uN uT W(:; 1) uN uT W(:; 2) : : : uN uT W(:; N )


= 2 uuT W(:; 1) uuT W(:; 2) : : : uuT W(:; N )
= 2uuT W

75
(A.13)

65

Appendix B: Eigenvectors of symmetric Toeplitz and blockToeplitz matrices


B.1 Structured matrices

Consider the matrix A 2 RN , the vector v 2 RN and the matrix B 2 RNp Np ,

2 a a ::: a 3
2v 3
11
12
1N
66 a21 a22 : : : a2N 77
66 v12 77
A = aij ] = 64 ..
..
.. 75 v = vi ] = 64 .. 75
.
.
.
.
aN 1 aN 2 : : : aNN

(B.1)

vN

2 b b ::: b 3
66 b2111 b2212 : : : b21NN 77
B = bij ] = 64 .. ..
.. 75
.
.
.

with bij 2 Rp p

(B.2)

bN 1 bN 2 : : : bNN
De nition 1 A is called symmetric i (if and only if) A is symmetric along its

main diagonal (aij = aji),

A is symmetric , A = AT

De nition 2 The reversal matrix J 2

RN N is a matrix with ones along the

secondary diagonal and zeros everywhere else.

20
6 ..
J = 664 .
0

(B.3)

0 :::
..
.
1 :::
1 0 :::

1
..
.
0
0

3
77
75

(B.4)

JA reverses the rows of A, AJ reverses the columns of A and JAJ reverses both the
rows and the columns of A. J is an orthogonal and symmetric matrix, hence J = J T ,
JJ = I and J ?1 = J .

De nition 3 A is called centro-symmetric i A is symmetric along its secondary


diagonal (aij = aN ?j +1;N ?i+1 )

A is centro-symmetric , JA = AT J;

(B.5)

De nition 4 A is called double-symmetric (or symmetric centro-symmetric) i


A is symmetric and centro-symmetric (aij = aji = aN ?i+1;N ?j+1)
A is double-symmetric ,

A = AT
JA = AT J
66

) JAJ = A

(B.6)

From the property JAJ = A, nothing can be concluded about the symmetry nor
centro-symmetry of the matrix A.
Example:

21
A=4 4

2 3
5 4
3 2 1

3
5

If JAJ = A, this simply means that the ith row/column of A is equal to the
(N ? i + 1)th row of A in reverse. For N odd, this means the middle row/column of
A is symmetric.

De nition 5 v is called symmetric i J v = v and skew-symmetric i J v = ?v.


De nition 6 A is called Toeplitz i the elements along the diagonals of A are
constant.

2a
66 a2111
A = 666 a31
4 ...

a12
a11
a21

a13
a12
a11

..
.

..
.

: : : a1N
: : : a1;N ?1
: : : a2;N ?2

aN 1 aN ?1;1 aN ?2;1 : : :

..
.

a11

3
77
77
75

(B.7)

As can be readily veri ed, all Toeplitz matrices are centro-symmetric.

De nition 7 A is symmetric Toeplitz if it is both symmetric and Toeplitz.

2a
66 a1211
A = 666 a13
4 ...

a12
a11
a12

a13
a12
a11

..
.

..
.

: : : a1N
: : : a1;N ?1 77
: : : a2;N ?2 77

a1N a1;N ?1 a1;N ?2 : : :

..
.

a11

75

(B.8)

As can be readily veri ed, all symmetric Toeplitz matrices are double-symmetric.

De nition 8 The block-reversal matrix S 2 RNp Np is a matrix with identity


matrices Ip 2 Rp p along its secondary diagonal and zeros everywhere else.

2
6
S = 664

0
..
.
0

0 :::
..
.

Ip

..
.
Ip : : : 0
Ip 0 : : : 0

3
77
75

(B.9)

S is an orthogonal and symmetric matrix, hence S = S T , SS = I and S ?1 = S .

67

De nition 9 B is called block-symmetric i the p p block-matrices bij are symmetric along the main diagonal of the matrix B (bij = bji ).
2 b b b ::: b 3
66 b1211 b2212 b2313 : : : b21NN 77
B = bij ] = 66 b13 b23 b33 : : : b3N 77
(B.10)

64

..
.

..
.

..
.

75

..
.

b1N b2N b3N : : : bNN

In general, block-symmetric matrices are not symmetric. A block-symmetric matrix


B is symmetric if all block-matrices bij are symmetric (bij = bTij ).

De nition 10 B is called block-Toeplitz i the p p block-matrices bij along the


diagonals of B are constant.

2b
66 b2111
B = bij ] = 666 b.31
4 ..

b12
b11
b21
..
.

b13
b12
b11
..
.

: : : b1N
: : : b1;N ?1 77
: : : b1;N ?2 77
. 7

bN 1 bN ?1;1 bN ?2;1 : : :

..

b11

(B.11)

In general, block-Toeplitz matrices are not Toeplitz.

B.2 Symmetry properties of eigenvectors

Theorem 1 17] If A 2 RN N satis es JAJ = A and has N distinct eigenvalues,


then A has dN=2e symmetric eigenvectors and bN=2c skew-symmetric eigenvectors
which span the eigenspace of A, where dxe represents the smallest integer greater than
or equal to x and bxc represents the largest integer smaller than or equal to x.
Proof : A has N orthonormal eigenvectors vi which are unique apart from their sign.
For any eigenvector vi :
Avi = ivi ) JAvi = i J vi ) AJ vi = iJ vi
(B.12)
Thus J vi is an eigenvector of A corresponding to i . Since the N eigenvalues of A are distinct and J vi has the same norm as vi , then J vi = vi so
that vi is either symmetric or skew-symmetric. The only possible way for the
eigenspace to consist of N mutually orthogonal, symmetric (skew-symmetric)
nonzero eigenvectors, is that is consists of dN=2e symmetric eigenvectors and
2
bN=2c skew-symmetric eigenvectors.

The result of theorem 1 is still true if the multiplicity of some eigenvalues is larger than
1, but the proof is di erent. However, in some cases, A will then have eigenvectors
which are a linear combination of symmetric and skew-symmetric vectors, and hence,
are neither symmetric nor skew-symmetric 17].
68

Corollary 1 If A = X X ?1 is the eigenvalue decomposition of A, then


JX = X diagf 1g
(B.13)
Corollary 2 Since JAJ = A is true for all double-symmetric and symmetric Toeplitz

matrices A, all eigenvectors of double-symmetric and symmetric Toeplitz matrices are


symmetric or skew-symmetric.
Lemma 1 If JAJ = A, then JAT J = AT and JA?1 J = A?1 (if A is invertible).
Proof :
AT = (JAJ )T = J T AT J T = JAT J
A?1 = (JAJ )?1 = J ?1 A?1 J ?1 = JA?1 J
The inverse of a nonsingular double-symmetric matrix is also double-symmetric 18].
The inverse of a nonsingular symmetric Toeplitz matrix is double symmetric, but not
necessarily Toeplitz.
Lemma 2 Consider A 2 RN N and B 2 RN N . If JAJ = A and JBJ = B , then
J (A + B )J = A + B and J (AB )J = AB .
The sum of two double-symmetric matrices A and B is also double-symmetric. The
sum of two symmetric Toeplitz matrices A and B is also symmetric Toeplitz. The
product of two double-symmetric matrices A and B is double-symmetric, only if
AB = BA. The product of two symmetric Toeplitz matrices A and B is doublesymmetric, only if AB = BA, and is not necessarily Toeplitz.

Theorem 2 If B 2 RNp Np satis es SBS = B and has Np distinct eigenvalues,


then all Np eigenvectors vi of B satisfy the property Svi = vi .
Proof : B has Np orthonormal eigenvectors vi which are unique apart from their
sign. For any eigenvector vi :
Bvi = ivi ) SBvi = iSvi ) BSvi = iSvi
(B.14)
Thus Svi is an eigenvector of B corresponding to i . The Np eigenvalues of B
2
are distinct and Bvi has the same norm as vi , so that Svi = vi .
Although it hasn't been proven, we believe that the result of theorem 2 is still true
if the multiplicity of some eigenvalues is larger than 1. However we also believe that,
in some cases, B will then have eigenvectors vi which are a linear combination of
vectors x which satisfy Sx = x and vectors y which satisfy Sy = ?y, such that
Svi 6= vi.
Corollary 3 If B = X X ?1 is the eigenvalue decomposition of B, then
SX = X diagf 1g
(B.15)
Corollary 4 Since SBS = B is true for all matrices B which are both blocksymmetric and block-Toeplitz, all eigenvectors vi of these matrices satisfy Svi = vi .
69

Lemma 3 If SBS = B, then SBT S = BT and SB?1S = B?1 (if B is invertible).


Proof :

BT = (SBS)T = ST BT ST = SBT S
B?1 = (SBS)?1 = S?1B?1 S?1 = SB?1 S
Lemma 4 Consider B 2 RNp Np and C 2 RNp Np . If SBS = B and SCS = C,
then S(B + C)S = B + C and S(BC)S = BC.
The sum of two block-symmetric matrices B and C is also block-symmetric. The
sum of two block-Toeplitz matrices B and C is also block-Toeplitz.
The properties proven in theorems 1 and 2 and lemmas 1, 2, 3 and 4 hold for any
transformation matrix T and data matrix A which satisfy

8 TAT = A
<
T
: TT == TT ?1

70

(B.16)

References
1] M. Dendrinos, S. Bakamidis, and G. Carayannis, \Speech enhancement from
noise : A regenerative approach," Speech Communication, vol. 10, pp. 45{57,
Feb. 1991.
2] S. H. Jensen, P. C. Hansen, S. D. Hansen, and J. A. S rensen, \Reduction of
Broad-Band Noise in Speech by Truncated QSVD," IEEE Trans. Speech, Audio
Processing, vol. 3, pp. 439{448, Nov. 1995.
3] P. S. K. Hansen, Signal Subspace Methods for Speech Enhancement. PhD thesis,
Technical University of Denmark, Lyngby, Denmark, Sept. 1997.
4] P. C. Hansen and S. H. Jensen, \FIR Filter Representations of Reduced-Rank
Noise Reduction," IEEE Trans. Signal Processing, vol. 46, pp. 1737{1741, June
1998.
5] I. Dologlou, J.-C. Pesquet, and J. Skowronski, \Projection-based rank reduction
algorithms for multichannel modelling and image compression," Signal Processing, vol. 48, pp. 97{109, Jan. 1996.
6] B. D. Van Veen and K. M. Buckley, \Beamforming: A Versatile Approach to
Spatial Filtering," IEEE ASSP Magazine, pp. 4{24, Apr. 1988.
7] O. L. Frost III, \An Algorithm for Linearly Constrained Adaptive Array Processing," Proc. IEEE, vol. 60, pp. 926{935, Aug. 1972.
8] L. J. Gri ths and C. W. Jim, \An alternative approach to linearly constrained
adaptive beamforming," IEEE Trans. Antennas Propag., vol. 30, pp. 27{34, Jan.
1982.
9] K. M. Buckley, \Broad-Band Beamforming and the Generalized Sidelobe Canceller," IEEE Trans. Acoust., Speech, and Signal Processing, vol. 34, pp. 1322{
1323, Oct. 1986.
10] S. Nordholm, I. Claesson, and B. Bengtsson, \Adaptive Array Noise Suppression
of Handsfree Speaker Input in Cars," IEEE Trans. Vehicular Technology, vol. 42,
pp. 514{518, Nov. 1993.
11] S. Haykin, Adaptive Filter Theory. Information and system sciences series, Prentice Hall, 3rd ed., 1996.
12] F. Xie and S. Van Gerven, \Comparative study of 3 speech detection methods,"
Tech. Rep. MI2-SPCH-95-8, ESAT, K.U.Leuven, Belgium, Oct. 1995.
13] S. Doclo and E. De Clippel, \Verbetering van spraakverstaan bij hoortoestellen
via adaptieve ruisonderdrukking in reele tijd," Master's thesis, K.U.Leuven, Belgium, 1997. UDC : 681.5.017(043).
14] G. H. Golub and C. F. Van Loan, Matrix Computations. MD : John Hopkins
University Press, Baltimore, 3rd ed., 1996.
71

15] L. L. Scharf, Statistical Signal Processing : Detection, Estimation and Time


Series Analysis. Addison Wesley, 1st ed., July 1991.
16] R. J. McAulay and T. F. Quatieri, \Speech analysis/synthesis based on a sinusoidal representation," IEEE Trans. Acoust., Speech, and Signal Processing,
vol. 34, pp. 744{754, Aug. 1986.
17] P. Butler and A. Cantoni, \Eigenvalues and eigenvectors of symmetric centrosymmetric matrices," Linear Algebra and its Applications, vol. 13, pp. 275{
288, Mar. 1976.
18] J. Makhoul, \On the Eigenvectors of Symmetric Toeplitz Matrices," IEEE
Trans. Acoust., Speech, and Signal Processing, vol. 29, pp. 868{872, Aug. 1981.
19] B. De Moor, J. Staar, and J. Vandewalle, \Oriented Energy and Oriented Signalto-Signal Ratio Concepts in the Analysis of Vector Sequences and Time Series,"
in SVD and Signal Processing: Algorithms, Applications and Architectures (E. F.
Deprettere, ed.), pp. 209{232, North-Holland: Elsevier Science Publishers B.V.,
1988.
20] I. Dologlou and G. Carayannis, \Physical Representation of Signal Reconstruction from Reduced Rank Matrices," IEEE Trans. Signal Processing, vol. 39,
pp. 1682{1684, July 1991.
21] D. Van Compernolle and S. Van Gerven, \Beamforming with Microphone Arrays," Tech. Rep. MI2-SPCH-94-6, ESAT, K.U.Leuven, Belgium, July 1994.
22] G. C. Carter, Coherence and Time Delay Estimation. New York: IEEE Press,
1993.
23] J. Allen and D. Berkley, \Image method for e ciently simulating small-room
acoustics," Journal of the Acoustical Society of America, vol. 65, pp. 943{950,
Apr. 1979.

72

You might also like