Professional Documents
Culture Documents
ESAT-SISTA/TR 1999-33
ESAT (SISTA) - Katholieke Universiteit Leuven, Kardinaal Mercierlaan 94, 3001 Leuven (Heverlee), Belgium, Tel.
32/16/321899,
Fax 32/16/321970, WWW: http://www.esat.kuleuven.ac.be/sista. E-mail:
simon.doclo@esat.kuleuven.ac.be . Simon Doclo is a Research Assistant supported by the I.W.T. (Flemish Institute for Scienti c and Technological Research in Industry). Marc Moonen is a Research Associate with the F.W.O.
- Vlaanderen (Fund for Scienti c Research - Flanders). This research work
was carried out at the ESAT laboratory of the Katholieke Universiteit Leuven, in the framework of the F.W.O. Research Project nr. G.0295.97, Design and implementation of adaptive digital signal processing algorithms for
broadband applications, the Interuniversity Attraction Pole IUAP P4-02 (19972001), Modeling, Identi cation, Simulation and Control of Complex Systems,
initiated by the Belgian State, Prime Minister's O ce - Federal O ce for Scienti c, Technical and Cultural A airs and the IT-project Multimicrophone Signal
Enhancement Techniques for handsfree telephony and voice controlled systems
(MUSETTE) (AUT/970517/Philips ITCL) of the I.W.T. (Flemish Institute for
Scienti c and Technological Research in Industry) and was partially sponsored
by Philips-ITCL. The scienti c responsibility is assumed by its authors.
Abstract
In this report, a compact review is given of a class of SVD-based signal enhancement
procedures, which amount to a speci c optimal ltering technique for the case where
the so-called `desired response' signal cannot be observed.
A number of simple properties (e.g. symmetry properties) of the obtained estimators
are derived, which to our knowledge have not been published before and which are
valid for the white noise case as well as for the coloured noise case. Also a standard
procedure based on averaging is investigated, leading to serious doubts about the
necessity of the averaging step.
When applying this technique to multi-microphone noise reduction, the optimal lter
exhibits a kind of beamforming behaviour for highly correlated noise sources. When
comparing this technique to standard beamforming algorithms, its performance is
equally good for highly correlated noise sources. For less correlated noise sources {
a situation where standard beamforming typically fails { it is shown that its performance is better than standard beamforming techniques. Finally it is shown by simulations that this technique is more robust to environmental changes, such as source
movement, microphone displacement and microphone ampli cation than standard
beamforming techniques.
Contents
1 Introduction
2 SVD-based optimal ltering
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Preliminaries . . . . . . . . . . . .
SVD-based ltering . . . . . . . . .
Error covariance matrix . . . . . .
White noise case . . . . . . . . . .
Time series ltering . . . . . . . .
Time series ltering and averaging
Multichannel time series ltering .
Conclusion . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Robustness issues
5.1
5.2
5.3
5.4
Source movement . . . . .
Microphone displacement
Microphone ampli cation
Conclusion . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 4
. 6
. 7
. 8
. 9
. 13
. 17
. 24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
28
29
36
38
40
42
42
44
45
46
52
52
55
58
59
6 Conclusion
Acknowledgments
A Derivative to vectors and matrices
62
62
63
66
1 Introduction
In many speech communication applications, like audio-conferencing and hands-free
mobile telephony, the recorded and transmitted speech signals contain a considerable
amount of acoustic noise. This is mainly due to the fact that the speaker is located
at a certain distance from the recording microphones, which allows the microphones
to record the noise sources too. Background noise can stem from stationary noise
sources like a fan, but most of the time the background noise is non-stationary and
broadband, with a spectral density depending upon the environment. The background noise causes a signal degradation which can lead to total unintelligibility of
the speech and which decreases the performance of speech coding and speech recognition systems. Therefore e cient noise reduction algorithms are called for.
During the last years some techniques for noise reduction in speech have been proposed which are based on the singular value decomposition (SVD) 1] 2] 3]. Most of
these techniques only deal with the one-microphone case and therefore have to rely on
signal speci c characteristics. Speech signals can be assumed to consist of several formants. The interpretation which is given to most one-microphone SVD-based noise
reduction techniques, is that these techniques will try to extract the most important
formants from the noisy speech signal 4], thereby reducing the amount of noise.
When using a microphone array, the spatial con guration of the speech/noise sources
and the microphone array constitutes an important aspect which should not be neglected. Therefore multi-microphone algorithms should not only exploit signal characteristics, but should also exploit the characteristics of the channel between the
speech/noise sources and the microphone array. Although the SVD-based multimicrophone extensions which have been proposed 5] exploit the signal characteristics
in a more robust way, they still don't exploit channel characteristics.
Section 2 describes a class of SVD-based signal enhancement procedures, which
amount to a speci c optimal ltering technique for the case where the so-called
`desired response' signal cannot be observed. It is shown that this optimal lter can
be written as a function of the generalized singular vectors and singular values of a
so-called speech and noise data matrix. A number of simple symmetry properties
of the optimal lter are derived, which are valid for the white noise case as well as
for the coloured noise case. Also the averaging step of the standard one-microphone
SVD-based noise reduction techniques 2] 4] is investigated, leading to serious doubts
about the necessity of this averaging step. When applying the SVD-based optimal
ltering technique for multiple channels, a number of additional symmetry properties
can be derived, depending on the structure of the noise covariance matrix.
In Section 3 the SVD-based optimal ltering technique is applied to multi-microphone
noise reduction in speech. For some contrived examples it is shown that this technique exhibits a kind of beamforming behaviour. When considering spatio-temporal
white noise on all microphones, it is shown by simulations that the directivity pattern of the SVD-based optimal lter is focused towards the speech source. When
considering a localized noise source (and no multipath propagation), it is shown that
a zero is steered towards this noise source.
Section 4 further compares the performance of the SVD-based optimal ltering technique with standard beamforming algorithms 6] (delay-and-sum, Gri ths-Jim and
2
Generalized Sidelobe Canceller (GSC) 7] 8] 9] 10]). Adaptive Gri ths-Jim beamformers perform particularly well when the noise on the di erent microphones is
highly correlated. When the noise is less correlated, the performance of these beamformers drops considerably. It is shown by simulations that for highly correlated
noise sources the SVD-based optimal ltering technique performs equally well as
adaptive Gri ths-Jim beamformers, and that for less correlated noise sources it also
continues to perform better. In this section the dependence of the performance of
the SVD-based optimal ltering technique on the length and the starting point of
the noiseframe is also investigated.
Section 5 discusses the issue of robustness. It is known that standard beamforming
algorithms are rather sensitive to incorrect estimation of source direction and uncalibrated microphone arrays. It is shown by simulations that the SVD-based optimal
ltering technique is more robust to environmental changes, such as source movement,
microphone displacement and microphone ampli cation than standard beamforming
techniques.
2.1 Preliminaries
Consider the following ltering problem ( gure 1) : uk 2 RN is the lter input vector
at time k, yk is the lter output at time k,
yk = uTk w = wT uk ;
(2.1)
dk is the desired lter output (`desired response') at time k, ek is the error at time k,
ek = dk ? yk ;
(2.2)
uk
yk
+
-
ek
(2.3)
@ JMSE = ?2E fu d g + 2E u uT w = 0:
k k
k k
@w
(2.4)
wWF = E uk uTk
?1
E f u k dk g
(2.5)
It is also possible to consider multiple right-hand side problems, i.e. work with a
desired vector signal dk 2 RN instead of a scalar dk ( gure 2). The lter output
vector yk 2 RN is obtained as :
ykT = uTk W;
(2.6)
with W 2 RN N the optimal lter. The i-th column of W is then an optimal lter
for the i-th component of dk .
4
dk
uk
yk
+
-
ek
@ JMSE = ?2E u dT + 2E u uT W = 0:
k k
k k
@W
(2.8)
(2.13)
If we assume that we observe uk = nk during speech pauses, then we can use such
observations to estimate
0
E nk nTk = E uk uTk :
0
(2.14)
E nk nTk = E nk nTk
which means that we are able to estimate E nk nTk .
0
(2.15)
During speech activity, we observe both the signal-of-interest and the noise signal,
uk = s k + nk
(2.16)
and we can use such observations to estimate E uk uTk .
If we assume that sk and nk are statistically independent (E sk nTk = 0), then
E uk uTk = E sk sTk + E sk nTk + E nk sTk + E nk nTk
= E sk sTk + E nk nTk
(2.17)
Given E uk uTk and E nk nTk , we can thus compute E sk sTk .
Finally, from the assumed independence of sk and nk it also follows that
E uk sTk = E sk sTk + E sk nTk = E sk sTk ;
(2.18)
so that the the optimal lter WWF is given by
WWF = E uk uTk ?1 E sk sTk =E uk uTk ?1 (E uk uTk ?E nk nTk(2.19)
)
PS : Note that if the desired response vector dk were nk instead of sk , then the
n for nk would be
optimal estimator WWF
n
WWF
= E uk uTk ?1 E uk nTk = E uk uTk ?1 E nk nTk
= I ? WWF :
(2.20)
This means that an optimal estimate for nk is obtained by subtracting the
optimal estimate for sk from uk , and vice versa.
PS : Note that if the additive noise is zero (E nk nTk = 0), then WWF = I .
An interesting and useful simpli cation in formula (2.19) for WWF is derived from
E uk uTk
E nk nTk
= X diagf i2 g X T
= X diagf i2 g X T
6
(2.21)
with X an invertible, but not necessarily orthogonal, matrix. Note that diagf i2 g
represents a diagonal matrix with diagonal elements i2 , i = 1 : : : N , and that diagf i2 g
is similarly de ned.
In practice, X , i2 and i2 are computed by means of a generalized singular value
decomposition of the data matrices Uk 2 Rp N and Nk 2 Rq N (with p and q
typically larger than N ),
2 uT
k
66 uTk+1
Uk = 64 ..
.
2 nT
3
k
66 nTk+1
77
75 Nk = 64 ...
3
77
75
(2.22)
uTk+p?1
nTk+q?1
such that E uk uTk ' UTk Uk and E nk nTk ' NTk Nk . The generalized singular
value decomposition of the matrices Uk and Nk is de ned as
Uk = U diagf ig X T
(2.23)
Nk = V diagf i g X T ;
with U 2 Rp N and V 2 Rq N orthogonal matrices, X 2 RN N an invertible matrix
and ii the generalized singular values.
By substituting the above formulas into formula (2.19), one obtains
2
2
WWF = X ?T diagf i ?2 i g X T
(2.24)
In fact, the lter WWF belongs to a more general class of estimators, which can be
described by
W = X ?T diagff ( i2; i2 )g X T :
(2.25)
(2.26)
E nk nTk =
I;
(2.29)
with 2 the power of the white noise process. Obviously this simpli es the formulas
considerably. The joint diagonalization reduces to an eigenvalue decomposition of
the form
( E u uT = X diagf 2 g X T
i
k k
(2.30)
T
2
2
E nk nk =
I = X XT ;
(2.31)
Often the noise power 2 can be estimated from the smallest singular values of
E uk uTk (e.g. after assuming a low rank model for E sk sTk , which is approximately valid for speech signals 16]). This means that speech detection is no longer
necessary, and that the method also applies to non-speech applications.
In the white noise case, the error covariance matrix Efek eTk g reduces to
T u ) (s ? WT u )T g = 2 I
Efek eTk g = Ef(sk ? WWF
(2.32)
k
k
WF k
8
sTk
WWF
(2.28)
PS : In the white noise case, from the orthogonality of X , it follows that every
diagonal element in WWF is limited between 0 and 1 :
2? 2
fWWF gii = X (i; :) diagf j 2 g X (i; :)T
=
N 2? 2
X
j
2
j
j =1
N
X
X (i; j )2
j =1
0 (since 2
X (i; j )2
1
(2.33)
2
j)
(2.34)
which means that the estimate for fsk gi contains a contribution fuk gi with
0
1, and that = 1 in the noiseless case ( = 0).
PS : In the white noise case, WWF is a symmetric matrix. This means that if the
estimate for fsk gi contains a contribution fuk gj , then the estimate for fsk gj
contains a contribution fuk gi (`reciprocity').
Let us now assume the vector uk is taken from a time series u(k), i.e.
(2.35)
and similarly
2 uT
66 uTkk+1
Uk = 666 uk.+2
4 ..
uTk+p?1
3 2 u(k)
77 66 u(k + 1)
77 = 66 u(k + 2)
75 64
..
.
u(k ? 1)
u(k)
u(k + 1)
..
.
u(k ? 2)
u(k ? 1)
u(k)
..
.
: : : u(k ? N + 1)
: : : u(k ? N + 2)
: : : u(k ? N + 3)
..
.
is
( ) = Efs(k) s(k ? )g
(2.38)
( ) = (? );
(2.39)
3
77
77
75
such that the correlation matrices E uk uTk and E sk sTk are symmetric Toeplitz
matrices, e.g.
2
66
E sk sTk = 666
4
(0)
(1)
(2)
..
.
(N ? 1)
(1)
(0)
(1)
..
.
(N ? 2)
(2)
(1)
(0)
..
.
:::
:::
:::
(N ? 3) : : :
(N ? 1)
(N ? 2) 77
(N ? 3) 77 :
75
..
.
(0)
(2.40)
Symmetric Toeplitz matrices belong to the class of double symmetric matrices, which
are symmetric about both the main diagonal and the secondary diagonal. The eigenvectors of such matrices are known to have special symmetry properties 17] 18]. For
speci c notation and properties, we refer to appendix B.
E uk uTk ?1 = J E uk uTk ?1 J
(2.45)
?
1
?
1
E uk uTk
E nk nTk = J E uk uTk
E nk nTk J (2.46)
The optimal lter WWF , de ned in equation (2.19) is
WWF = I ? E uk uTk
?1
E nk nTk
(2.47)
(2.48)
10
(2.49)
T
T mean that the ith
The properties J WWF J = WWF and J WWF
J = WWF
row/column of WWF is equal to the (N +1 ? i)th row/column in reverse order. In the
white noise case WWF is a symmetric matrix. From the property J WWF J = WWF
it then follows that WWF is a double-symmetric matrix in the white noise case.
Theorem 2 If the lter WWF belongs to the more general class of estimators, dened in equation (2.25),
WWF = X ?T diagff ( i2 ; i2 )g X T ;
(2.50)
the properties of equation (2.42) still hold, in the white noise case as well as in the
coloured noise case.
Proof : The joint diagonalization of E uk uTk and E nk nTk , as de ned in equation (2.21), is
E uk uTk
E nk nTk
= X diagf i2 g X T
= X diagf i2 g X T
Therefore
E uk uTk
?1
E nk nTk = X ?T diagf i2 g X T ;
i
(2.51)
(2.52)
is the eigenvector decomposition of E uk uTk ?1 E nk nTk , with X an invertible, but not necessarily orthogonal matrix (only in the white noise case).
Because
E uk uTk
?1
E nk nTk = J E uk uTk
?1
E nk nTk J
(2.53)
J X ?T = X ?T diagf 1g:
(2.54)
J WWF J = J X ?T diagff ( i2 ; i2 )g X T J
= X ?T diagf 1g diagff ( i2 ; i2 )g diagf 1g X T
= X ?T diagff ( i2 ; i2 )g X T
= WWF
(2.55)
Rank truncation, for instance, is the basis for a popular estimation procedure in the
white noise case 1], where
f ( i2 ; 2 ) = 1
= 0
11
if
if
2
i
2
i
2
2
(2.56)
f ( i2 ; i2 ) = 1
= 0
if i = 1
otherwise
(2.57)
2 s^(k)
66 s^(k ? 1)
^sk = 666 s^(k ? 2)
4 ...
s^(k ? N + 1)
3
77
77 = WWF
T
75
2 s^
(k)
66 s^kk::kk??NN +1
+1(k ? 1)
66 s^k:k?N +1
(k ? 2)
64 ...
s^k:k?N +1(k ? N + 1)
3
77
77 = WWF
T
75
2 u(k)
66 u(k ? 1)
66 u(k ? 2)
64 ...
u(k ? N + 1)
2 u(k)
66 u(k ? 1)
66 u(k ? 2)
64 ...
3
77
77
75
u(k ? N + 1)
3
77
77
75
(2.58)
(2.59)
where s^k:k?N +1(l) means that an estimate for s(l) is obtained as a linear combination
T produces
of u(k), u(k ? 1), : : : , u(k ? N + 1). For N odd, the middle row in WWF
the estimate s^k:k?N +1(k ? N2?1 ), where s(k ? N2?1 ) is estimated from u(k ? N2?1 )
together with N2?1 earlier samples and N2?1 later samples of u.
T J = WT then indicates that for N odd, the middle row in
The property J WWF
WF
T is symmetric, and hence represents a linear phase lter. Note that a zero phase
WWF
property has been attributed to an SVD and rank truncation based estimator for the
white noise case, if an additional averaging step (see also section 2.6) is included 20].
For the colored noise case 2] 3] 4], a similar linear phase property had apparently
not been derived yet.
12
From
2 s^
(k)
66 s^kk::kk??NN +1
(k ? 1)
66 s^k:k?N +1
(k ? 2)
64 ... +1
s^k:k?N +1(k ? N + 1)
it follows that
2 s^
(k)
66 s^kk::kk??NN +1
(k ? 1)
66 s^k:k?N +1
(k ? 2)
64 ... +1
s^k:k?N +1(k ? N + 1)
T
= WWF
3
77
77 = WWF
T
uk
75
(2.60)
: : : s^k+N ?1:k (k + N ? 1)
: : : s^k+N ?1:k (k + N ? 2)
: : : s^k+N ?1:k (k + N ? 3)
s^k+1:k?N +2(k + 1)
s^k+1:k?N +2(k)
s^k+1:k?N +2(k ? 1)
..
.
s^k+1:k?N +2(k ? N + 2) : : :
..
.
s^k+N ?1:k (k)
uk uk+1 : : : uk+N ?1
3
77
77
75
(2.61)
It is seen that several (maximum N ) estimates are obtained for one and the same
sample s(l). As an example, N estimates for s(k) are available on the main diagonal.
If w(i; j ) denotes the (i; j )-element of WWF , one can obtain an explicit formula for
all these estimates together :
2 s^
3
k:k?N +1(k)
66 s^k+1:k?N +2(k) 77
66
77 = WWF
..
T
.
64 s^
7
k+N ?2:k?1(k) 5
s^k+N ?1:k (k)
2 u(k + N ? 1) 3
66 u(k + N ? 2) 77
66 ...
7
66 u(k + 1) 777
66 u(k)
77
66 .
77
66 ..
7
4 u(k ? N + 2) 75
(2.62)
u(k ? N + 1)
with
20
0
T =6
64 ...
WWF
0
0
..
.
0
w(1; N ? 1)
w(1; N ) w(2; N )
::: 0
: : : w(1; 2)
..
.
: : : w(N ? 2; N ? 1)
: : : w(N ? 1; N )
w(1; 1)
w(2; 2)
..
.
w(N ? 1; N ? 1)
w(N; N )
T
T it immediately follows that
From J WWF
J = WWF
T
T J
WWF
= J WWF
: : : w(N ? 1; 1)
: : : w(N; 2)
..
.
::: 0
::: 0
w(N; 1)
0
..
.
0
0
(2.64)
The question now arises which estimate, out of the N available estimates for s(k),
is the best. The answer is given by the error covariance matrix (see section 2.3)
13
3
77
5
(2.63)
Efek eTk g = Efnk nTk g WWF . The smallest element on the main diagonal of the
error covariance matrix corresponds to the best estimator. From here on the best
T , will be denoted as wmin .
estimator, which is the corresponding row of WWF
WF
The question remains if perhaps an even better estimate for s(k) can be obtained
by linearly combining the N available estimates. This question is apparently not
easily answered. An obvious choice could be to average over all available estimates, a
technique which is often applied to rank truncation based estimation 1] 2] 3] 4] 20],
1
N
1
N
1
N
1
N
:::
:::
1
N
{z
~
w
1
N
1
N
1
N
2 s^
3
k:k?N +1(k)
66 s^k+1:k?N +2(k) 77
66
7
.
64 s^ .. (k) 775
k+N ?2:k?1
2 s^uk(+kN+?1:Nk(?k)1) 3
66 u(k + N ? 2) 77
66 ...
7
66 u(k + 1) 777
T 6
77
WWF
} 666 u(k)
77
66 ...
77
4 u(k ? N + 2) 5
u(k ? N + 1)
(2.65)
Here s~k+N ?1:k?N +1(k) is estimated from u(k) together with (N ? 1) earlier samples
and (N ? 1) later samples. The (2N ? 1)-taps lter w~ is obtained by averaging over
T (i; :). From the symmetry property of WWF it is
the available N -taps lters WWF
readily seen that w~ is symmetric, and hence represents a linear phase lter. A crucial
question is whether the (2N ? 1)-taps estimator w~ is better than the individual N T (i; :) it is computed from. Speci cally, w
~ should be compared
taps estimators WWF
T (if N is odd), which represents a linear
with the symmetric middle row of WWF
phase lter that uses N2?1 earlier samples and N2?1 later samples.
First, it can be veri ed that w~ is not an optimal lter, i.e.
(2.66)
Note that s~k+N ?1:k?N +1(k) corresponds to a linear-phase (2N ? 1)-taps estimator
w~ , obtained by averaging over a collection of N -taps estimators. On the other hand,
s^k+N ?1:k?N +1(k) corresponds to an linear-phase (2N ? 1)-taps estimator w^ , which
is obtained by applying the usual Wiener lter formulas to a (2N ? 1)-dimensional
vector uk . So, in general, w^ is a function of (0), (1), : : : , (2N ? 1), where w~ will
only be a function of (0), (1), : : : , (N ? 1). This means that s^k+N ?1:k?N +1(k)
and s~k+N ?1:k?N +1(k) are not the same, except for contrived examples. The following
example further illustrates this.
14
2
Example : As an example, consider a white noise case with 2
i for all i. Then
WWF = E uk uTk ?1 E sk sTk ' 12 E sk sTk
2
66
64
(N ? 1)
(N ? 2) 77
1
= 2
75
..
.
(N ? 1) (N ? 2) : : :
(0)
where the correlation matrix E sk sTk is a symmetric Toeplitz matrix. The
matrix WWF then has the form
T = 1
WWF
2
20
66 0..
4.
0
0
..
.
(N ? 1)
(0)
(1)
..
.
(1)
(0)
..
.
:::
:::
::: 0
(0) : : :
: : : (1) (0) : : :
..
..
.
.
(N ? 2) : : : (1) (0) : : :
(N ? 2) : : : (1) (0) : : :
(N ? 2) (N ? 1)
(N ? 2) 0
..
..
.
.
0
0
0
0
3
77
5
with L the length of the signals and U 2 RL N the data matrix de ned
as in equation (2.22). Since the noise is white, the noise correlation matrix
Efn nT g 2 RN N is
Efn nT g = 2 I:
(2.69)
15
T (i; :),
Both the optimal lter WWF , which consists of N N -taps estimators WWF
and the (2N ? 1)-taps estimator w~ , obtained through averaging, are computed
from these correlation matrices. Also ^s(k) = U WWF , which consists of N
estimates s^i (k) = U WWF (:; i), and s~(k), obtained through averaging, are
computed. The error variances ^i ; i = 1 : : : N , and ~ are de ned as
L
X
1
(s(k) ? s^i (k))2 ; i = 1 : : : N
^i = L
(2.70)
k=1
L
X
~ = L1
(s(k) ? s~(k))2
(2.71)
k=1
For N = 9 and L = 105 , the error variances ^i ; i = 1 : : : N , and ~ are compared
for two di erent noise powers ( 2 = 0:5 and 2 = 2) in gure 3. As can be seen
from the simulations, the (2N ? 1)-taps estimator w~ is not always better than
the individual N -taps estimators WWF (:; i) it is computed from. Moreover,
there always seems to exist N -taps estimators WWF (:; i) which give rise to a
lower error variance than the (2N ? 1)-taps estimator w~ .
Error variance comparison (N=9, L=105, 2=0.5)
0.38
1.15
Optimal
i
Averaged
Optimal
i
Averaged
1.1
0.36
1.05
Error variance
Error variance
0.34
0.32
0.95
0.9
0.85
0.3
0.8
0.75
0.28
0.7
0.26
5
i
0.65
5
i
Figure 3: Error variance comparison between (2N ? 1)-taps estimator w~ and original
N -taps estimators WWF (:; i) for di erent noise powers
Hence averaging does not seem to be a well-founded operation, while on the other
hand it certainly increases computational complexity, since it requires (2N ? 1)taps ltering instead of N -taps ltering. If minimal error variance is sought for, we
min corresponding to the smallest
therefore suggest to pick the N -tap estimator wWF
diagonal element in the error covariance matrix. If the linear phase property is
T
desirable, we suggest to pick the N -tap estimator given by the middle row of WWF
(for N odd).
16
lr
efl
ec
tio
n
microphone array recording both a desired signal and background noise in a room,
as depicted in gure 4.
sig
na
noise source
11
00
00
11
11
00
00
11
00
11
direct path
signal source
SVD-based
multi-microphone
signal enhancement
inter
00
11
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
feren
ce
microphone array
noise source
2m 3
66 m12kk 77
uk = 64 .. 75 ;
.
(2.73)
mMk
with
mjk = mj (k) mj (k ? 1) : : : mj (k ? N + 1) T ;
The vectors sk and nk are similarly de ned. The data matrix Uk 2 Rp
in equation (2.22) then takes the form
2 m (k)
66 mj (kj + 1)
Ujk = 666 mj (k. + 2)
..
4
mj (k ? 1)
mj (k)
mj (k + 1)
mj (k ? 2)
mj (k ? 1)
mj (k)
(2.74)
MN
as de ned
(2.75)
: : : mj (k ? N + 1)
: : : mj (k ? N + 2)
: : : mj (k ? N + 3)
3
77
77
75
..
..
..
.
.
.
mj (k + p ? 1) mj (k + p ? 2) mj (k + p ? 3) : : : mj (k + p ? N )
(2.76)
17
Using the same formulas as for the one-channel case, the optimal lter WWF and
min can be computed.
best (MN )-taps estimator wWF
For stationary signals we can use the same correlation matrices for all samples. Therefore the estimated signal s^(k) can be computed as
2 s^(k) 3
6 s^(k + 1) 77
min T
^s(k) = 664
(2.77)
= U1k U2k : : : UMk wWF
7
..
5
.
s^(k + p ? 1)
This lter operation can be considered as a multichannel lter, where each of the M
channels is ltered with an N -taps lter Aj , where
min = A A : : : A
(2.78)
wWF
1
2
M :
This is depicted in gure 5.
A1
A2
e
+
+
+
+
A3
A4
11
00
11
00
m1
Desired signal
m2
m3
m4
Noise
Microphone
array
1k = sk + n1k
uk = m
m2k
sk
n2k
(2.81)
nk = nn21kk
(2.82)
18
uu
nn
s
n11
n12
n21
n22
uk uTk
nk nTk
sk sTk
n1k nT1k
n1k nT2k =
n2k nT1k =
n2k nT2k
E
E
E
E
E
E
E
(2.83)
(2.84)
(2.85)
(2.86)
(2.87)
(2.88)
(2.89)
T
n21
T
n12
If we assume the desired signal and the noise are uncorrelated then the correlation
matrix uu can be written as
uu
= E
= E
? sT sT
k k
sk + n1k
sk
n2k
sk
sTk sTk
sk
n1k
n2k
+E
{z
} |
ss
+ nT1k nT2k
{z
nT1k nT2k
nn
(2.90)
with
s
s
ss =
nn =
n11
n21
n12
n22
s
s
(2.91)
n11
T
n12
n12
n22
(2.92)
First we will discuss the symmetry properties of the correlation matrix ss and the
conditions under which the correlation matrix nn exhibits these symmetry properties.
Property 1 Because of the speci c form of the correlation matrix ss in equation
(2.91), this matrix exhibits the following properties (for notation, see Appendix B) :
1. ss is a symmetric Toeplitz matrix
2. J ssJ = ss
3. S ssS = ss
Proof :
T
s
T
s
T
s
T
s
19
s
s
s
s
= ss
J sJ J s J =
J sJ J s J
s
s
s
s
= ss
s
s
s
s
= ss
For the noise correlation matrix nn we will discuss the conditions under which nn
exhibits symmetry properties.
n22 J
n12 J
=
=
n11
T
n12
nn J
= nn , i
(2.93)
nn =
nn J
= J0 J0
n11
T
n12
n12
n22
0 J = J n22 J J Tn12 J
J 0
J n12 J J n11 J
n12
n22
A su cient condition for n12 being centro-symmetric, is n12 being Toeplitz. For
stationary noise sources, n11 and n22 are symmetric Toeplitz matrices, such that
the condition J n22 J = n11 implies that n11 = n22 .
=
=
n22
T
n12
nn S = nn, i
(2.94)
( n12 is symmetric)
nn =
nnS =
0 I
I 0
n11
T
n12
20
n12
n22
n12
n22
0 I =
I 0
n22
n12
T
n12
n11
2
For di erent types of noise correlation matrices nn we will now discuss the symmetry
properties for the optimal Wiener lter WWF ,
2
2
WWF = X ?T diagf i ?2 i g X T
(2.95)
WWF =
s + n11
s + Tn12
s + n12
s + n22
ss
?1
s
s
s
s
(2.96)
and the symmetry properties for the more general class of estimators
WWF = X ?T diagff ( i2 ; i2 )g X T
(2.97)
For convenience, we will partition the matrix WWF into four parts :
11
12
WF WWF :
(2.98)
WWF = W
21
22
WWF WWF
Property 4 Because of the speci c form of the optimal Wiener lter WWF in equa-
11 = W12
WWF
WF
21 = W22
WWF
WF
(2.99)
(2.100)
WWF =
=
=
?1
s s
s + n11 s + n12
T
+
+
s s
n22
s
n12 s
11
12
s s
21
22
s s
( 11 + 12 ) s ( 11 + 12 ) s
( 21 + 22 ) s ( 21 + 22 ) s
nn =
n12
n11
with n11 symmetric Toeplitz and n12 Toeplitz (but not symmetric)
21
(2.101)
(2.102)
11 J = W12 :
J WWF
WF
(2.103)
with
Similarly to the proof of theorem 2, one would expect that the general class of estimators, as described in equation (2.97), exhibits the same symmetry properties as
the optimal lter. However, not all eigenvectors of the matrix ?uu1 nn are symmetric
or skew-symmetric, such that for X ?T (of which the columns are the eigenvectors),
J X ?T 6= X ?T diagf 1g:
(2.104)
X ?T = X1 X2 ;
(2.105)
(2.106)
with 2 RN N a diagonal matrix, then the general class of estimators exhibits the
same symmetry properties as the optimal lter.
nn =
n12
n11
(2.107)
Because this is a special case of equation (2.101), the same symmetry properties hold
for the optimal lter J WWF J = WWF .
Since S ssS = ss and S nnS = nn (see property 3), it follows that S uu S = uu ,
22
(2.108)
(2.109)
with
11 J = W11 :
J WWF
WF
(2.110)
This means that the lters for the 2 channels are equal.
For the general class of estimators, as described in equation (2.97), these symmetry
properties don't hold in all cases. The reason is the same as for case 1, i.e.
J X ?T 6= X ?T diagf 1g
(2.111)
with X ?T the matrix containing the eigenvectors of ?uu1 nn . The property which
does hold in all cases is
S X ?T = X ?T diagf 1g:
(2.112)
Therefore the general class of estimators always satis es SWWF S = WWF and has
the form
11
12
WF WWF ;
WWF = W
12
11
WWF WWF
(2.113)
but it only exhibits the additional symmetry properties (2.109) and (2.110) if the
diagonal matrix in equation (2.97) is of the form
diagff ( i2 ; i2 )g = 0 00 ;
(2.114)
Case 3 : If in case 2 the noise sources n1(k) and n2(k) are uncorrelated (
then the noise correlation matrix nn has the form
0 ;
n11
nn =
0
n11
n12 = 0),
(2.115)
The conclusions regarding symmetry properties are the same as in case 2, for the
optimal lter as well as for the general class of estimators.
23
Case 4 : If in case 3 the uncorrelated noise sources n1(k) and n2(k) are white noise
sources with the same noise power 2 , then the noise correlation matrix nn
has the form
nn =
I 0 :
0 I
(2.116)
The conclusions regarding symmetry properties are the same as in case 2, except for
the additional property that WWF is symmetric, for the optimal lter as well as for
the general class of estimators.
In this case the optimal lter WWF has the form
11
11
WF WWF ;
WWF = W
11
11
WWF WWF
(2.117)
11 = J W11 J
WWF
WF
11
11
WWF = WWF T :
(2.118)
(2.119)
with
2.8 Conclusion
In this section we have described a class of SVD-based signal enhancement procedures, which amount to a speci c optimal ltering technique for the case where the
so-called `desired response' signal dk = sk cannot be observed. It is shown that this
optimal lter WWF can be written as a function of the generalized singular vectors
and singular values of a so-called speech data matrix Uk and noise data matrix Nk .
When applying this ltering technique to time series, a number of simple symmetry
properties are derived, which prove to be valid for the white noise case as well as for
the coloured noise case. Also the averaging step of the standard one-microphone SVDbased noise reduction techniques is investigated, leading to serious doubts about the
necessity of this averaging step, which increases computational complexity but does
not improve performance. When applying the SVD-based optimal ltering technique
to multiple channels, a number of additional symmetry properties can be derived,
depending on the structure of the noise covariance matrix.
24
3.1 Preliminaries
j = 1 : : : M;
(3.1)
ref
lec
tio
where sj (k) is the speech signal in the j th microphone signal and nj (k) is the noise
in the j th microphone signal.
sig
n
al
noise source
11
00
00
11
00
11
11
00
00
11
direct path
signal source
SVD-based
multi-microphone
signal enhancement
inter
00
11
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
feren
microphone array
ce
noise source
di culty lies in nding the optimal lters Aj . For nding these lters we will use the
SVD-based optimal ltering technique, described in section 2.
A1
A2
e
+
+
+
+
A3
A4
11
00
00
11
m1
Desired signal
m2
m3
m4
Noise
Microphone
array
with d the distance between the microphones, c the speed of sound (c ' 340 ms ) and
fs the sampling frequency, such that
sj+m (k) = sj (k ? m ):
(3.3)
If 2= Z, than the di erent speech signals sj (k) can be constructed by ltering the
clean speech signal s(k) with an interpolation lter.
In this section we will consider two kind of noise sources :
spatio-temporal white noise (di use noise) where the noise nj (k) in the j th
microphone signal is temporal white noise and is uncorrelated with the noise
nl (k) in the lth microphone signal,
(3.4)
a localized white noise source n(k) which impinges on the microphone array
at an angle , such that the noise signals nj (k) are delayed versions of each
other (analogous to the speech signal) and are correlated with each other. The
di erent noise signals nj (k) are constructed by ltering the white noise signal
n(k) with an interpolation lter.
26
Amplitude
0.5
0
0.5
1
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
2
4
x 10
Microphone signal 1
1
Amplitude
0.5
0
0.5
1
0
0.2
0.4
0.6
0.8
1
1.2
Time (samples)
1.4
1.6
1.8
2
4
x 10
with
8
>
>
>
>
>
Ujk
>
>
>
>
<
>
>
>
>
>
Njk
>
>
>
>
:
Uk =
Nk =
(3.5)
2 m (k)
3
mj (k ? 1)
mj (k ? 2) : : : mj (k ? N + 1)
j
66 mj (k + 1)
mj (k)
mj (k ? 1) : : : mj (k ? N + 2) 77
66 mj (k + 2)
mj (k + 1)
mj (k)
: : : mj (k ? N + 3) 77
64
75
..
..
..
..
.
.
.
.
2 mj (kn +(kp) ? 1) mnj ((kk+?p1)? 2) mnj ((kk +? p2)? 3) : : :: : : n m(kj (?k N+ p+?1)N3)
j
j
j
66 nj (kj + 1)
nj (k)
nj (k ? 1) : : : nj (k ? N + 2) 77
66 nj (k + 2)
nj (k + 1)
nj (k)
: : : nj (k ? N + 3) 77
64
75
..
..
..
..
.
.
.
.
nj (k + p ? 1) nj (k + p ? 2) nj (k + p ? 3) : : : nj (k + p ? N )
(3.6)
27
For constructing the speech data matrix Uk we use 2000 samples mj (8000 9999).
For constructing the noise data matrix Nk we use the same frame nj (8000 9999).
In practice this is never possible, since the noise data matrix can only be constructed
during periods where no speech is present. However for simulated situations the total noise signal nj (k) is known. In section 4 it will be seen that for stationary noise
constructing the noise data matrix Nk from a di erent frame than the speech data
matrix Uk has no in uence with regard to the performance.
Using the generalized singular value decomposition of Uk and Nk
Uk = U diagf ig X T
Nk = V diagf i g X T ;
we can compute the optimal Wiener lter WWF ,
2
2
WWF = X ?T diagf i ?2 i g X T :
i
(3.7)
(3.8)
By choosing the column of WWF corresponding to the smallest element on the diagonal of the noise correlation matrix NTk Nk WWF we obtain the best estimator
min . The lter wmin consists of the M lters of length N ,
wWF
WF
min = A A : : : A
wWF
1
2
M :
(3.9)
The resulting estimated (enhanced) signal s^(k) is computed by ltering and summing
the microphone signals mj (k) with the lters Aj for their total length (20000 samples),
s^(k) =
M
X
j =1
Aj (k) mj (k):
(3.10)
For comparison purposes the signal-to-noise ratio (SNR) will be used. The SNR of a
signal x(k) is de ned as
?P x2(k)
SNR(x) = (P x2 (k))speech ;
noise
(3.11)
which is the energy of the signal x(k) during speech periods divided by the energy of
the signal x(k) during noise periods. Therefore a speech/noise detection is necessary
as indicated by the dotted line in gure 8.
In this case the noise nj (k) in the j th microphone signal is temporal white noise and
is uncorrelated with the noise nl (k) in the lth microphone signal,
(3.12)
(3.14)
Aj (k) exp ? j 2 ff(k ? 1) :
s
k=1
The number of channels M is 2 and the lterlength N is 2,5 and 10. The SNR of the
enhanced signals is 6:49 dB (N = 2), 9:82 dB (N = 5) and 10:74 dB (N = 10).
In gure 11 the number of channels M is 5. The SNR of the enhanced signals is 8:92
dB (N = 2), 13:12 dB (N = 5) and 13:73 dB (N = 10).
In gure 12 the number of channels M is 10. The SNR of the enhanced signals is
11:37 dB (N = 2), 15:76 dB (N = 5) and 16:52 dB (N = 10).
Hj (f ) =
As already indicated in section 2.7 for uncorrelated white noise sources with the same
noise power (case 4), theoretically the lters Aj for the M di erent channels should
be the same. This can be veri ed for the frequency responses in gures 10,11 and 12.
We will now compare the spatial ltering properties (beamforming behaviour) when
the speech source impinges on the microphone array at = 90 and = 45 . Ideally
the spatial beamforming pattern should amplify in the direction of the desired signal.
The spatial beamforming pattern H (f; ) is both a function of frequency f and angle
and can be calculated as
M
X
(3.15)
H (f; ) = H (f ) exp j 2 f (l ? 1)d cos ;
l=1
with Hl (f ) de ned in equation (3.14). The number of channels M is 5 and the lterlength N is 10.
29
16
14
12
10
Filterlength N
2
5
6
Number of channels M
10
16
14
12
10
Number of channels M
2
10
12
Filterlength N
14
16
18
20
Figure 9: SNR of enhanced signal s^(k) for spatio-temporal white noise and speech
source in front of microphone array ( = 90 ). Number of channels M varies from 1
to 10 and lterlength N varies from 1 to 20.
30
0.6
0.5
0.5
0
0.4
0.5
0.3
1.5
0.2
2
0.1
2.5
0.5
1.5
2.5
500
1000
1500
2000
2500
3000
3500
4000
3000
3500
4000
3000
3500
4000
0.6
0.5
0.5
0
0.4
0.5
0.3
1.5
0.2
2
0.1
2.5
0.5
1.5
2.5
500
1000
1500
2000
2500
0.6
0.5
0.5
0
0.4
0.5
0.3
1.5
0.2
2
0.1
2.5
0.5
1.5
2.5
500
1000
1500
2000
2500
Figure 10: Noisy and enhanced signals and frequency response jHj (f )j of the lters
Aj for spatio-temporal white noise and speech source in front of microphone array
( = 90 ). Number of channels M = 2 and lterlength N = 2; 5; 10.
31
0.25
0.5
0.2
0.5
0.15
1.5
0.1
2.5
0.05
3.5
0.5
1.5
2.5
500
1000
1500
2000
2500
3000
3500
4000
3000
3500
4000
3000
3500
4000
0.25
0.5
0.2
0.5
0.15
0.1
1.5
2
0.05
2.5
0.5
1.5
2.5
500
1000
1500
2000
2500
0.3
0.5
0.25
0.5
0.2
1
0.15
1.5
0.1
2.5
0.05
3.5
0.5
1.5
2.5
500
1000
1500
2000
2500
Figure 11: Noisy and enhanced signals and frequency response jHj (f )j of the lters
Aj for spatio-temporal white noise and speech source in front of microphone array
( = 90 ). Number of channels M = 5 and lterlength N = 2; 5; 10.
32
1
0.12
0.5
0.1
0.5
0.08
1
0.06
1.5
0.04
0.02
2.5
0.5
1.5
2.5
500
1000
1500
2000
2500
3000
3500
4000
3000
3500
4000
3000
3500
4000
0.14
0.5
0.12
0
0.1
0.5
0.08
1
0.06
1.5
0.04
0.02
2.5
0.5
1.5
2.5
500
1000
1500
2000
2500
0.14
0.5
0.12
0
0.1
0.5
0.08
1.5
0.06
2
0.04
2.5
0.02
3.5
0.5
1.5
2.5
500
1000
1500
2000
2500
Figure 12: Noisy and enhanced signals and frequency response jHj (f )j of the lters
Aj for spatio-temporal white noise and speech source in front of microphone array
( = 90 ). Number of channels M = 10 and lterlength N = 2; 5; 10.
33
For = 90 , gure 13 depicts the noisy speech signal ( rst microphone signal m1 (k)),
enhanced signal s^(k), the amplitude of the frequency response jHj (f )j for the M lters Aj , the amplitude of the spatial beamforming pattern jH (f; )j for all frequencies
f and for one speci c frequency f = 1000 Hz. As can be seen from the spatial beamforming pattern for f = 1000 Hz, the directivity gain is maximal for the direction
= 90 . This is even better illustrated in gure 14 where the spatial beamforming
pattern is plotted for every frequency f = i 100; i = 1 : : : 40. For every frequency
the directivity gain is maximal for the direction = 90 . However for low frequencies
the spatial selectivity is very poor.
In gure 15 the angle = 45 . As can be seen from the spatial beamforming pattern
for f = 1000 Hz, the directivity gain is maximal for the direction = 45 . This is
even better illustrated in gure 16 where the spatial beamforming pattern is plotted
for every frequency f = i 100; i = 1 : : : 40. For most frequencies the directivity
gain is maximal for the direction = 45 . However for low frequencies the spatial
selectivity is very poor.
Corrupted signal + Enhanced signal
0.25
0.2
0.15
0.1
2
0.05
3
0
0.5
1.5
1000
2000
3000
4000
x 10
Spatial pattern
1
0.8
0.6
180
0.4
0.2
0
2000
4000
50
100
150
270
Figure 13: Noisy and enhanced signal, frequency response jHj (f )j of the lters Aj
and spatial beamforming pattern jH (f; )j for spatio-temporal white noise and speech
source in front of microphone array ( = 90 ). Number of channels M = 5 and
lterlength N = 10.
34
90
1.4
1.2
1
0.8
0.6
0.4
0.2
180
270
1
0.25
0
0.2
1
0.15
0.1
0.05
3
0
0.5
1.5
1000
2000
3000
4000
x 10
Spatial pattern
1
0.8
0.6
180
0.4
0.2
0
2000
4000
50
100
150
270
Figure 15: Noisy and enhanced signal, frequency response jHj (f )j of the lters Aj
and spatial beamforming pattern jH (f; )j for spatio-temporal white noise and speech
source at = 45 . Number of channels M = 5 and lterlength N = 10.
35
90
1
0.8
0.6
0.4
0.2
180
270
Figure 16: Spatial beamforming pattern jH (f; )j for spatio-temporal white noise and
speech source at = 45 . Number of channels M = 5 and lterlength N = 10.
Amplitude
0.8
0.6
0.4
0.2
500
1000
1500
2000
Frequency
2500
3000
3500
4000
Figure 17: Frequency response of bandpass lter between 1800 and 1880 Hz.
The speech source sf (k) impinges on the microphone array at = 90 and the noise is
spatio-temporal white. Figure 18 depicts the noisy speech signals ( rst microphone
signal m1 (k)), enhanced signals s^(k) and the amplitude of the frequency response
jHj (f )j for the M lters Aj . The number of channels M is 5 and the lterlength N
is 10,20 and 50. The higher the lterlength, the better the the frequency response
Hj (f ) of the lters Aj is con ned to the region between 1800 and 1880 Hz.
36
0.25
0.5
0.2
0.5
0.15
0.1
1.5
2
0.05
2.5
0.5
1.5
2.5
500
1000
1500
2000
2500
3000
3500
4000
3000
3500
4000
3000
3500
4000
0.25
0.5
0.2
0.5
0.15
0.1
1.5
2
0.05
2.5
0.5
1.5
2.5
500
1000
1500
2000
2500
0.3
0.5
0.25
0
0.2
0.5
0.15
1.5
0.1
2
0.05
2.5
0.5
1.5
2.5
500
1000
1500
2000
2500
Figure 18: Noisy and enhanced signals and frequency response jHj (f )j of the lters Aj
for spatio-temporal white noise and smallband speech source in front of microphone
array ( = 90 ). Number of channels M = 5 and lterlength N = 10; 20; 50.
37
We will now consider a localized white noise source n(k) which impinges on the
microphone array at a certain angle, such that the noise signals nj (k) are delayed
versions of each other. We will discuss the spatial ltering properties (beamforming
behaviour) when the speech source is in front of the microphone array ( = 90 )
and the noise source impinges on the microphone array at = 150 . The distance d
between the di erent microphones is 5 cm. Ideally the spatial beamforming pattern
should amplify in the direction of the desired signal and should attenuate (place a
zero) in the direction of the localized noise source. The noise power is chosen such
that the SNR of the rst (noisy) microphone signal m1 (k) is 3:02 dB. The number of
channels M is 2 and 5 and the lterlength N is 10.
For M = 2, gure 19 depicts the noisy speech signal ( rst microphone signal m1 (k)),
enhanced signal s^(k), the amplitude of the frequency response jHj (f )j for the M lters
Aj , the amplitude of the spatial beamforming pattern jH (f; )j for all frequencies f
and for one speci c frequency f = 1000 Hz. The SNR of s^(k) is 17:20 dB.
Corrupted signal + Enhanced signal
4
3
2
2
1
3
0
0.5
1.5
1000
2000
3000
4000
x 10
Spatial pattern
1.5
0.5
1
180
0.5
0
2000
4000
50
100
150
270
Figure 19: Noisy and enhanced signal, frequency response jHj (f )j of the lters Aj
and spatial beamforming pattern jH (f; )j for localized white noise source ( = 150 )
and speech source ( = 90 ). Number of channels M = 2 and lterlength N = 10.
38
90
2.5
2
1.5
1
0.5
180
270
12
10
8
1
4
2
3
0
0.5
1.5
1000
2000
3000
4000
x 10
6
Spatial pattern
x 10
12
10
8
6
4
2
10000000
5000000
180
0
2000
4000
50
100
150
270
Figure 21: Noisy and enhanced signal, frequency response jHj (f )j of the lters Aj
and spatial beamforming pattern jH (f; )j for localized white noise source ( = 150 )
and speech source ( = 90 ). Number of channels M = 5 and lterlength N = 10.
39
90
14000000
12000000
10000000
8000000
6000000
4000000
2000000
180
270
We will brie y describe the performance of the SVD-based noise reduction technique
for a real-world example. The signals have been recorded in the ESAT SpeechLab
with a microphone array, consisting of 6 microphones (distance between microphones
d = 5 cm). The speech source and the noise source are localized sources. The SNR
of the (noisy) microphone signal m1 (k) is 13:70 dB. The lterlength N used is 20.
When we use the optimal Wiener lter WWF ,
2
2
WWF = X ?T diagf i ?2 i g X T :
(3.16)
the SNR of the enhanced signal s^(k) is 16:65 dB, only a small improvement.
When we use the more general class of estimators,
WWF = X ?T diagff ( i2 ; i2 )g X T ;
40
(3.17)
if i = 1
otherwise
(3.18)
the SNR of the enhanced signal s^(k) is 24:50 dB, a considerable improvement. In this
method we only consider the rst generalized eigenvector, corresponding to the maximum generalized eigenvalue, as described in equation (2.57). Although the enhanced
signal s^(k) is distorted, this distortion can be tolerated in this case.
41
We will brie y discuss two beamforming techniques : xed delay-and-sum beamforming 6] 21] and adaptive Gri ths-Jim beamforming 7] 8] 9] 10].
Figure 23 depicts a xed delay-and-sum beamformer. In order to achieve a spatial
alignment of the microphone array with the speech source, which impinges on the
microphone array at an angle , the di erent microphone signals mj (k) are delayed
with j ,
(4.1)
= (j ? 1) d cos ;
j
with d the distance between the microphones. In order to compute j the angle
rst needs to be estimated from the di erent microphone signals mj (k), e.g. using
some generalized cross-correlation method 22]. A delay-and-sum beamformer o ers
a limited spatial selectivity, especially for low frequencies.
+
+
+
+
m1
m2
m3
m4
Microphone
array
Delay
- --
+
+
+
+
F1
Blocking
matrix
F2
m1
m2
m3
m4
Fixed
beamformer
Microphone
array
FM
Multi-channel
adaptive filter
We will brie y discuss the room con guration, used signals and parameters of the
used algorithms.
The room consists of a microphone array, speech source and noise source and is
depicted in gure 25. The room has the following dimensions : 7m 3.5m 3m. The
linear equi-spaced microphone array has 5 microphones and the distance d between
the microphones is 5cm. The position of microphone j is 2+(j {1)0.05 0.5 1]. The
position of the speech source is in front of the microphone array : 2.1 2 1]. The
position of the noise source is 6 2.5 1].
Because we will compare the performance of the di erent algorithms for correlated
as well as for uncorrelated noise, the reverberation of the room is an important
parameter. Reverberation is described by the reverberation time T60 , which can be
expressed in function of the re ection coe cient (0
1) of the walls (assuming
all the walls have the same re ection coe cient), according to Eyring's formula :
:163V ;
(4.2)
T60 = ?S 0log(1
? )
with V the volume of the room and S the total surface of the room. For large
re ection coe cients ( ' 1), Eyring's formula reduces to Sabine's law :
(4.3)
T = 0:163V :
60
The re ection coe cient is a necessary parameter for calculating the room impulse
response through the image method described in 23].
Noise
11
00
00
11
Desired signal
Microphone
array
In total the performance of 5 algorithms will be compared : delay-and-sum beamformer, Gri ths-Jim beamformer, iterated Gri ths-Jim beamformer, SVD-based optimal ltering 1 and SVD-based optimal ltering 2. The parameters for these di erent
algorithms are now brie y explained :
Delay-and-sum beamformer : because the speech source is located in front of
the microphone array, the delay-and-sum beamformer becomes a simple sum
beamformer ( j = 0).
Gri ths-Jim beamformer : the xed beamformer has j = 0. The delay for the
speech reference is 250 taps. We will consider only one noise reference (because
there is only one noise source) and the blocking matrix B is 4 {1 {1 {1 {1]. The
lterlength of the (one-channel) adaptive lter is 500. The adaptive ltering
algorithm we use is NLMS (normalized least mean squares) with adaptation
parameter = 1 11]. The lter coe cients are not adapted during speech
periods, in order to limit signal cancellation and distortion.
Iterated Gri ths-Jim beamformer : same parameters as for the Gri ths-Jim
beamformer, except for the fact that the adaptive lter will be reiterated with
the same data for di erent values of the adaptation parameter ( = 1, 0.5,
0.2, 0.1, 0.05). The smaller the adaptation parameter , the slower the convergence, but the smaller the excess error 11].
SVD-based optimal ltering 1 : same procedure which has been described in full
detail in section 3.1. The SVD1-procedure will construct the speech data matrix
Uk and the noise data matrix Nk using the same frame (as already indicated
this is never possible in practice). The start of the speechframe and the noiseframe is sample 8000, and the length of the speechframe and the noiseframe is
2000. The lterlength N for the lters Aj is 10, 20 and 50.
SVD-based optimal ltering 2 : the di erence between the SVD1 and the SVD2procedure is the fact that the SVD2-procedure constructs the noise data matrix
Nk from a di erent frame than the speech data matrix Uk . The start of the
speechframe is sample 8000, the start of the noiseframe is sample 3000, and the
length of the speechframe and the noiseframe is 2000. The lterlength N for
the lters Aj is 10, 20 and 50.
4.3 Comparison
We will compare the SNR of the enhanced signal s^(k) for the di erent algorithms.
This comparison will be made for di erent reverberation times T60 of the room. Low
reverberation time means that multipath e ects are not signi cant (because the walls
don't re ect much) and that the direct signal will dominate. This means that the
noise signals arriving on the microphone array are highly correlated. High reverberation time means that multipath e ects are signi cant (because the walls re ect) and
that the direct signal will no longer dominate. This means that the noise signals
arriving on the microphone array are highly uncorrelated (di use noise).
45
For di erent reverberation times T60 , gure 26 compares the performance of the di erent beamforming techniques (delay-and-sum beamformer, Gri ths-Jim beamformer
and iterated Gri ths-Jim beamformer). As can be seen, for small reverberation
times (highly correlated noise) the Gri ths-Jim beamformer performs much better
than for high reverberation times (highly uncorrelated noise. This is normal because
the Gri ths-Jim beamformer is designed for correlated noise, not for di use noise
(see section 4.1). As expected, when iterating the Gri ths-Jim beamformer, the
performance increases.
Figure 27 compares the performance of the delay-and-sum beamformer and the
Gri ths-Jim beamformer with the SVD-based optimal ltering 1 ( lterlength N =
10; 20; 50). Unlike the Gri ths-Jim beamformer, the SVD-based optimal ltering 1
still performs well for di use noise (high reverberation times). As can be seen, for
all reverberation times, the SVD-based optimal ltering 1 performs better than the
Gri ths-Jim beamformer, if the lterlength N is high enough (in this case N = 20
su ces). The higher the lterlength N , the better the performance.
Figures 28, 29 and 30 compare the performance of the SVD-based optimal ltering
techniques 1 and 2 for di erent lterlengths (N = 10; 20; 50). As can be clearly seen,
the performance of SVD-based optimal ltering 2 is always worse than the performance of SVD-based optimal ltering 1. This di erence in performance increases as
the lterlength N increases. SVD-based optimal ltering 2 is the technique we have
to use in practice since the noise data matrix Nk can only be constructed during
periods where no speech is present. A crucial question now arises : can SVD-based
optimal ltering 2 perform equally well as SVD-based optimal ltering 1 ? The next
section answers this question by investigating the performance of SVD-based optimal
ltering on the noiseframe.
In this section we will investigate the dependence of the performance of the SVDbased optimal ltering technique on the noiseframe, when the noiseframe is di erent
from the speechframe. We will investigate the dependence on the start and the length
of the noiseframe. It will be shown that the SVD-based optimal ltering 2 can perform equally well as the SVD-based optimal ltering 1 if the noiseframe is made long
enough. In this section we will only consider lterlength N = 50, because the higher
lterlength, the larger the di erence between SVD-based optimal ltering 1 and 2.
Figure 31 compares the performance of the SVD-based optimal ltering 2 for different lengths of the noiseframe (L = 1000 : : : 7000). The start of the noiseframe is
sample 3000 and the lterlength N is 50. As can be seen, there is a considerable
dependence of the performance of the SVD-based optimal ltering 2 on the length of
the noiseframe : the larger the noiseframe, the higher the performance. The reason
is that if the noiseframe is made larger, a better estimate of the noise correlation
matrix E nk nTk can be made. However, the disadvantage is that by making the
noiseframe larger to obtain an acceptable performance, we have to assume that the
noise is stationary enough.
Figure 32 compares the performance of the SVD-based optimal ltering 2 for di erent lengths and starting points of the noiseframe. The length L of the noiseframe is
46
Orig
DelSum
GRJ
GRJ_iter
40
35
30
SNR
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 26: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer and Gri ths-Jim beamformer
Normal situation (5chan) multipath for speech
45
Orig
DelSum
GRJ
SVD1 (10)
SVD1 (20)
SVD1 (50)
40
35
30
SNR
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 27: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 (N = 10; 20; 50)
47
Orig
DelSum
GRJ
SVD1 (10)
SVD2 (10)
40
35
30
SNR
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 28: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 and 2 (N = 10)
Normal situation (5chan) multipath for speech
40
Orig
DelSum
GRJ
SVD1 (20)
SVD2 (20)
35
30
SNR
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 29: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 and 2 (N = 20)
48
Orig
DelSum
GRJ
SVD1 (50)
SVD2 (50)
35
30
SNR
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 30: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 and 2 (N = 50)
varied from 1000 to 7000 and the start of the noiseframe is varied from 2000 to 18000
(where possible). The re ection coe cient of the walls of the room is 0.5. As can
be seen, there is no considerable dependence of the performance on the starting point
of the noiseframe. However, for some L (especially L = 2000) there is a noticeable
peak when the starting point of the noiseframe is sample 8000, which is also the
start of the speechframe. As already indicated in the previous gure, the larger the
noiseframe, the higher the performance.
Figure 33 compares the performance of the SVD-based ltering techniques 1 and 2
for di erent lengths of the noiseframe (L = 2000 and L = 7000). It can be seen
that for L = 7000 the SVD-based optimal ltering 2 performs equally well than the
SVD-based optimal ltering 1. However in this case the length of the noiseframe
(7000) is larger than the length of the speechframe (2000). In real-time processing
an exponentially decaying window can be used with di erent time constants for the
speechframe and the noiseframe.
The overall conclusion is that the SVD-based optimal ltering 2 can perform equally
well as the SVD-based optimal ltering 1 if the noiseframe is made large enough.
49
L=1000
L=2000
L=3000
L=4000
L=5000
L=6000
L=7000
40
35
30
SNR
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 31: SNR of enhanced signal s^(k) for SVD-based optimal ltering technique
with di erent length of noiseframe (N = 50)
25
SNR
20
15
L=1000
L=2000
L=3000
L=4000
L=5000
L=6000
L=7000
10
8
10
12
Start of noiseframe (x1000)
14
16
18
Figure 32: SNR of enhanced signal s^(k) for SVD-based optimal ltering technique
with di erent lengths and starting points of noiseframe (N = 50 and = 0:5)
50
Orig
DelSum
GRJ
SVD1 (50) L=2000
SVD2 (50) L=2000
SVD2 (50) L=7000
35
30
SNR
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 33: SNR of noisy signal m1 (k) and SNR of enhanced signal s^(k) for delayand-sum beamformer, Gri ths-Jim beamformer and SVD-based optimal ltering
technique 1 and 2 (N = 50 and L = 2000; 7000)
51
5 Robustness issues
A very important issue for all multi-microphone ltering techniques is robustness.
We consider three major problems :
source movement : the direction of the speech source is wrongly estimated.
This is only important for the beamforming techniques since the SVD-based
optimal ltering technique needs no direction estimation of the speech source.
microphone displacement : we assume a linear equi-spaced microphone array,
but in fact the microphones are not equally spaced.
microphone characteristics : we assume that the characteristics (gain, spatial
directivity, frequency behaviour) of all microphones are equal, but in fact the
characteristics are di erent. We will only consider a di erent gain for the
microphones.
We will consider the same room con guration used in section 4 and depicted in gure
25. The number of microphones M is 5 and the lterlength N of the lters is 50.
Consider the situation where the speech source impinges on the microphone array at
an angle 6= 90 , while we assume that the speech source is in front of the microphone
array ( = 90 ). We will not consider multipath e ects for the speech source (only
interpolation lter), but we will still consider multipath e ects for the noise source in
order to simulate the correlated/uncorrelated nature of the noise. We will investigate
the robustness of the delay-and-sum beamformer, the Gri ths-Jim beamformer and
the SVD-based optimal ltering 1 for this situation.
Figure 34 shows the performance of the delay-and-sum beamformer for di erent
angles . Since we assume the speech source is in front of the microphone array
( = 90 ), the delays j are 0. As expected, the performance decreases if the angle
deviates from the nominal position = 90 .
Figure 35 shows the performance of the Gri ths-Jim beamformer for di erent angles
. For angles > 90 the performance decreases, while for angles < 90 the performance increases ! We think this (somewhat strange) behaviour can be explained
by considering the spatial directivity pattern of the Gri ths-Jim beamformer. The
spatial directivity pattern will certainly have a zero in the direction of the noise
source (in this case = 152:55 ), such that the performance of the beamformer will
decrease in the neighbourhood of this direction. However, it is possible that the spatial directivity pattern has its maximum for an angle 6= 90 , such that maximum
performance is obtained for this speci c angle. Since we don't impose any additional
constraints with regard to the shape of the spatial directivity pattern other than
placing a zero in the direction of the noise, we cannot predict the exact form of the
spatial directivity pattern. Therefore it is di cult to predict the performance of the
Gri ths-Jim beamformer for di erent angles , since the performance is dependent
on the particular con guration.
52
SNR
8
7
angle = 90
angle = 60
angle = 30
angle = 0
6
5
4
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
SNR
8
7
angle = 90
angle = 120
angle = 150
angle = 180
6
5
4
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 34: SNR of the enhanced signal s^(k) for delay-and-sum beamformer for different angles
angle = 90
angle = 60
angle = 30
angle = 0
SNR
25
20
15
10
5
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
25
angle = 90
angle = 120
angle = 150
angle = 180
20
SNR
15
10
5
0
5
0
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 35: SNR of the enhanced signal s^(k) for Gri ths-Jim beamformer for di erent
angles
53
angle = 90
angle = 60
angle = 30
angle = 0
35
SNR
30
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
40
angle = 90
angle = 120
angle = 150
angle = 180
35
SNR
30
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 36: SNR of the enhanced signal s^(k) for SVD-based optimal ltering 1 for
di erent angles
Difference SVD1 (50) and GRJ (5chan) SNR for source movement
30
angle = 90
angle = 60
angle = 30
angle = 0
25
SNR
20
15
10
5
0
0
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
30
angle = 90
angle = 120
angle = 150
angle = 180
25
SNR
20
15
10
5
0
0
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 37: SNR-di erence between SVD-based optimal ltering 1 and Gri ths-Jim
beamformer for di erent angles
54
Figure 36 shows the performance of the SVD-based optimal ltering 1 for di erent
angles . For some angles and reverberation times the performance decreases, while
for other angles and reverberation times the performance increases. No speci c conclusion can be drawn from this gure.
Figure 37 shows the di erence in performance between the SVD-based optimal ltering 1 and the Gri ths-Jim beamformer for di erent angles . The SVD-based
optimal ltering 1 would be more robust than the Gri ths-Jim beamformer if the
di erence in performance increases the more the angle deviates from the nominal
position = 90 . However this behaviour cannot be observed for all reverberation times in gure 37. Therefore we have to conclude that for source movement,
the SVD-based optimal ltering technique is not more robust than the Gri ths-Jim
beamformer. However, for all angles the performance of the SVD-based optimal
ltering 1 is still better than the performance of the Gri ths-Jim beamformer.
Consider the situation where the linear microphone array is not equi-spaced, while we
assume it is equi-spaced. We will consider a displacement of the second microphone
in the x-direction towards the rst microphone. The nominal position of the second
microphone is xmic2 = 2:05.
Figures 38 and 39 show the performance of the delay-and-sum beamformer and
Gri ths-Jim beamformer for di erent microphone positions xmic2 . As expected, the
performance decreases if the microphone position xmic2 deviates from the nominal
position xmic2 = 2:05.
Figures 40 shows the performance of the SVD-based optimal ltering 1 for di erent microphone positions xmic2 . As can be seen, there is no signi cant di erence
in performance if the microphone position xmic2 deviates from the nominal position
xmic2 = 2:05.
Figure 41 shows the di erence in performance between the SVD-based optimal ltering 1 and the Gri ths-Jim beamformer for di erent microphone positions xmic2 .
Because the di erence in performance increases the more the microphone position
xmic2 deviates from the nominal position xmic2 = 2:05, we can conclude that for microphone displacement, the the SVD-based optimal ltering technique is more robust
than the Gri ths-Jim beamformer.
55
x_mic2 = 2.01
x_mic2 = 2.02
x_mic2 = 2.03
x_mic2 = 2.04
x_mic2 = 2.05
10
SNR
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 38: SNR of the enhanced signal s^(k) for delay-and-sum beamformer for different microphone positions xmic2
x_mic2 = 2.01
x_mic2 = 2.02
x_mic2 = 2.03
x_mic2 = 2.04
x_mic2 = 2.05
20
SNR
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 39: SNR of the enhanced signal s^(k) for Gri ths-Jim beamformer for di erent
microphone positions xmic2
56
x_mic2 = 2.01
x_mic2 = 2.02
x_mic2 = 2.03
x_mic2 = 2.04
x_mic2 = 2.05
35
30
SNR
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 40: SNR of the enhanced signal s^(k) for SVD-based optimal ltering 1 for
di erent microphone positions xmic2
Difference SVD1 (50) and GRJ (5chan) SNR for displacement mic2
x_mic2 = 2.01
x_mic2 = 2.02
x_mic2 = 2.03
x_mic2 = 2.04
x_mic2 = 2.05
16
14
SNR
12
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 41: SNR-di erence between SVD-based optimal ltering 1 and Gri ths-Jim
beamformer for di erent microphone positions xmic2
57
Consider the situation where not all microphone characteristics are equal. We will
consider the speci c situation where the second microphone has a gain 6= 1, while
all other microphones have = 1.
Figures 42 and 43 show the performance of the delay-and-sum beamformer and
Gri ths-Jim beamformer for di erent gains . As expected, the performance decreases if the gain deviates from the nominal gain = 1.
Figures 44 shows the performance of the SVD-based optimal ltering 1 for di erent
gains . Theoretically it can be proven that the SVD-based optimal ltering technique is independent of di erent gains for the di erent microphones.
Consider the speech data matrix Uk and the noise data matrix Nk as de ned in
equation (3.5) for multichannel time series ltering,
(
Uk = U1k U2k : : : UMk
(5.1)
Nk = N1k N2k : : : NMk ;
where M is the number of microphones. If we assume that each microphone j is
multiplied by a gain j then the modi ed speech data matrix U0k becomes
2?
3
1
6 ?2
77
U0k = 1U1k 1U2k : : : M UMk = Uk 664
75 = Uk ;
...
?M
(5.2)
with 2 RMN MN and ?j 2 RN N ,
2
6
?j = 664
...
3
77
75 ;
(5.3)
with N the lterlength of the lters Aj . The modi ed noise data matrix
similarly de ned as
N0k = 1 N1k 1 N2k : : : M NMk = Nk :
The modi ed speech and noise correlation matrices 0uu and 0nn then are
58
uu
nn
nn
N0k is
(5.4)
(5.5)
)
(5.6)
?1
WWF
= Uk WWF
= ^sk
(5.7)
such that any column ^s0k (:; i) of ^s0k is just a scaled version of ^sk (:; i), which means
that ^s0k (:; i) and ^sk (:; i) have the same signal-to-noise ratio.
Figure 45 shows the di erence in performance between the SVD-based optimal ltering 1 and the Gri ths-Jim beamformer for di erent gains . Because the di erence
in performance increases the more the gain deviates from the nominal position
= 1, we can conclude that for microphone ampli cation, the the SVD-based optimal ltering technique is more robust than the Gri ths-Jim beamformer.
5.4 Conclusion
Taking into account that in real life the used algorithm has to be robust for a combination of all three deviations (source movement, microphone displacement and
microphone characteristics), we can conclude that the SVD-based optimal ltering
technique is more robust than standard beamforming techniques.
59
gain = 1
gain = 0.5
gain = 1.5
gain = 2
gain = 2.5
gain = 3
gain = 4
gain = 5
10
9.5
9
8.5
SNR
8
7.5
7
6.5
6
5.5
5
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 42: SNR of the enhanced signal s^(k) for delay-and-sum beamformer for different gains
gain = 1
gain = 0.5
gain = 1.5
gain = 2
gain = 2.5
gain = 3
gain = 4
gain = 5
20
18
16
SNR
14
12
10
8
6
4
2
0
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 43: SNR of the enhanced signal s^(k) for Gri ths-Jim beamformer for di erent
gains
60
gain = 1
gain = 0.5
gain = 1.5
gain = 2
gain = 2.5
gain = 3
gain = 4
gain = 5
35
30
SNR
25
20
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 44: SNR of the enhanced signal s^(k) for SVD-based optimal ltering 1 for
di erent gains
Difference SVD1 (50) and GRJ (5chan) SNR for gain mic2
25
gain = 1
gain = 0.5
gain = 1.5
gain = 2
gain = 2.5
gain = 3
gain = 4
gain = 5
20
SNR
15
10
200
400
600
800
1000
1200
Reverberation Time (ms)
1400
1600
1800
2000
Figure 45: SNR-di erence between SVD-based optimal ltering 1 and Gri ths-Jim
beamformer for di erent gains
61
6 Conclusion
In this report we have described a class of SVD-based signal estimation procedures,
which amount to a speci c optimal ltering problem for the case where the so-called
`desired response' signal cannot be observed. It is shown that this optimal lter can
be written as a function of the generalized singular vectors and singular values of a
so-called speech and noise data matrix. A number of simple symmetry properties
of the optimal lter are derived, which are valid for the white noise case as well as
for the coloured noise case. Also the averaging step of the standard one-microphone
SVD-based noise reduction techniques is investigated, leading to serious doubts about
the necessity of this averaging step. When applying the SVD-based optimal ltering
technique for multiple channels, a number of additional symmetry properties can be
derived, depending on the structure of the noise covariance matrix.
When this SVD-based optimal ltering technique is applied to multi-microphone noise
reduction in speech, it is shown that this technique exhibits beamforming properties.
When considering spatio-temporal white noise on all microphones, it is shown that
the directivity pattern of the SVD-based optimal lter is focused towards the speech
source. When considering a localized noise source (and no multipath propagation),
it is shown that a zero is steered towards this noise source.
When we further compare the performance of the SVD-based optimal ltering technique with standard beamforming algorithms, it is shown by simulations that for
highly correlated noise sources the SVD-based optimal ltering technique performs
equally well as adaptive Gri ths-Jim beamformers, and that for less correlated noise
sources it continues to perform better. It is also noted that the length of the noiseframe plays an important role with regard to the performance of the optimal ltering
technique.
Finally, it is shown by simulations that the SVD-based optimal ltering technique
is more robust to environmental changes, such as source movement, microphone displacement and microphone ampli cation than standard beamforming techniques.
Acknowledgments
Simon Doclo is a Research Assistant supported by the I.W.T. (Flemish Institute for
Scienti c and Technological Research in Industry). Marc Moonen is a Research Associate with the F.W.O.-Vlaanderen (Fund for Scienti c Research-Flanders). This
research work was carried out at the ESAT laboratory of the Katholieke Universiteit Leuven, in the framework of the IT-project Multimicrophone Signal Enhancement Techniques for handsfree telephony and voice controlled systems (MUSETTE)
(AUT/970517/MUSETTE) of the I.W.T. (Flemish Institute for Scienti c and Technological Research in Industry), and was partially sponsored by Philips-ITCL. The
scienti c responsibility is assumed by its authors.
62
2w 3
2u 3
2d 3
1
1
66 w2 77
66 u2 77
66 d12 77
w = 64 .. 75 u = 64 .. 75 d = 64 .. 75
.
.
.
wN
uN
dN
2 a a ::: a 3
2 w w ::: w 3
11
12
1N
66 a21 a22 : : : a2N 77
66 w2111 w2212 : : : w21NN 77
A = 64 .. ..
.. 75 W = 64 ..
..
.. 75 :
.
.
.
.
.
.
aN 1 aN 2 : : : aNN
(A.1)
(A.2)
wN 1 wN 2 : : : wNN
@J
@w
=u
Proof :
J = wT u =
@J
@w =
J = wT Aw;
J = wT uuT w;
2
66
66
4
@J
@w1
@J
@w2
..
.
@J
@wN
N
X
3i=1
@J = u
uiwi ) @w
k
k
2u 3
77 6 u12 7
77 = 66 .. 77 = u
5 4 . 5
(A.3)
(A.4)
uN
= (A + AT )w
@J
T
@ w = 2uu w
@J
@w
Proof :
J =
@J
@wk =
=
N
N X
X
T
wiaij wj
w Aw =
i=1 j =1
0
1
N
N
X
X
@ @
A
@wk wk akk wk + j=1;j6=k wk akj wj + i=1;i6=k wi aik wk
N
N
X
X
akj wj + wi aik = A(1; :)w + A(:; 1)T w
i=1
j =1
63
(A.5)
(A.6)
2
66
66
4
@J
@w =
For symmetric A :
@J
@w1
@J
@w2
..
.
@J
@wN
3 2
3 2 A(:; 1)T
A
(1
;
:)
77 6 A(2; :) 7 6 A(:; 2)T
77 = 66 .. 77 w + 66 ..
5 4 . 5 4 . T
A(N; :)
A(:; N )
3
77
75 w = (A + AT )w
@ J = 2Aw:
@w
(A.7)
(A.8)
@J
@W
= udT
Proof :
J = uT Wd =
@J =
@W
2
66
66
4
N X
N
X
@J
@w11
@J
@w21
i=1 j =1
@J
@w12 : : :
@J
@w22 : : :
@J
@w1N
@J
@w2N
@J
@wN 1
@J
@wN 2
@J
@wNN
..
.
J = uT WWT u;
..
.
@J
@W
J = uT WWT u =
:::
..
.
(A.9)
kl
3 2
3
u
d
u
d
:
:
:
u
d
1
1
1
2
1
N
77 6 u2d1 u2d2 : : : u2 dN 7
77 = 66 ..
..
.. 775 = udT
4
.
.
.
5
uN d1 uN d2 : : : uN dN
(A.10)
= 2uuT W
Proof :
@J
@wpq =
@J = u d
ui wij dj ) @w
k l
0N
N X
X
@
i=1
j =1
1 N
! X
N
N X
N X
X
A
uj wji wkiuk
uj wji
wki uk =
k=1
i=1 j =1 k=1
(A.11)
0
1
N
N
X
@ @u w w u + X
u
w
w
u
+
upwpq wkq uk A
j jq pq p
@wpq p pq pq p
N
X
j =1
= 2up
uj wjq up +
N
X
j =1
N
X
k=1
j =1;j 6=p
k=1;k6=p
upwkq uk
(A.12)
@J =
@W
2
66
66
4
@J
@w11
@J
@w21
@J
@w12
@J
@w22
:::
:::
@J
@wN 1
@J
@wN 2
:::
..
.
..
.
2 u uT W(:; 1)
66 u21uT W(:; 1)
2 64
..
.
@J
@w1N
@J
@w2N
..
.
3
77
77
5
@J
@wNN
u1 uT W(:; 2)
u2 uT W(:; 2)
..
.
: : : u1 uT W(:; N )
: : : u2 uT W(:; N ) 77
..
.
75
(A.13)
65
2 a a ::: a 3
2v 3
11
12
1N
66 a21 a22 : : : a2N 77
66 v12 77
A = aij ] = 64 ..
..
.. 75 v = vi ] = 64 .. 75
.
.
.
.
aN 1 aN 2 : : : aNN
(B.1)
vN
2 b b ::: b 3
66 b2111 b2212 : : : b21NN 77
B = bij ] = 64 .. ..
.. 75
.
.
.
with bij 2 Rp p
(B.2)
bN 1 bN 2 : : : bNN
De nition 1 A is called symmetric i (if and only if) A is symmetric along its
A is symmetric , A = AT
20
6 ..
J = 664 .
0
(B.3)
0 :::
..
.
1 :::
1 0 :::
1
..
.
0
0
3
77
75
(B.4)
JA reverses the rows of A, AJ reverses the columns of A and JAJ reverses both the
rows and the columns of A. J is an orthogonal and symmetric matrix, hence J = J T ,
JJ = I and J ?1 = J .
A is centro-symmetric , JA = AT J;
(B.5)
A = AT
JA = AT J
66
) JAJ = A
(B.6)
From the property JAJ = A, nothing can be concluded about the symmetry nor
centro-symmetry of the matrix A.
Example:
21
A=4 4
2 3
5 4
3 2 1
3
5
If JAJ = A, this simply means that the ith row/column of A is equal to the
(N ? i + 1)th row of A in reverse. For N odd, this means the middle row/column of
A is symmetric.
2a
66 a2111
A = 666 a31
4 ...
a12
a11
a21
a13
a12
a11
..
.
..
.
: : : a1N
: : : a1;N ?1
: : : a2;N ?2
aN 1 aN ?1;1 aN ?2;1 : : :
..
.
a11
3
77
77
75
(B.7)
2a
66 a1211
A = 666 a13
4 ...
a12
a11
a12
a13
a12
a11
..
.
..
.
: : : a1N
: : : a1;N ?1 77
: : : a2;N ?2 77
..
.
a11
75
(B.8)
As can be readily veri ed, all symmetric Toeplitz matrices are double-symmetric.
2
6
S = 664
0
..
.
0
0 :::
..
.
Ip
..
.
Ip : : : 0
Ip 0 : : : 0
3
77
75
(B.9)
67
De nition 9 B is called block-symmetric i the p p block-matrices bij are symmetric along the main diagonal of the matrix B (bij = bji ).
2 b b b ::: b 3
66 b1211 b2212 b2313 : : : b21NN 77
B = bij ] = 66 b13 b23 b33 : : : b3N 77
(B.10)
64
..
.
..
.
..
.
75
..
.
2b
66 b2111
B = bij ] = 666 b.31
4 ..
b12
b11
b21
..
.
b13
b12
b11
..
.
: : : b1N
: : : b1;N ?1 77
: : : b1;N ?2 77
. 7
bN 1 bN ?1;1 bN ?2;1 : : :
..
b11
(B.11)
The result of theorem 1 is still true if the multiplicity of some eigenvalues is larger than
1, but the proof is di erent. However, in some cases, A will then have eigenvectors
which are a linear combination of symmetric and skew-symmetric vectors, and hence,
are neither symmetric nor skew-symmetric 17].
68
BT = (SBS)T = ST BT ST = SBT S
B?1 = (SBS)?1 = S?1B?1 S?1 = SB?1 S
Lemma 4 Consider B 2 RNp Np and C 2 RNp Np . If SBS = B and SCS = C,
then S(B + C)S = B + C and S(BC)S = BC.
The sum of two block-symmetric matrices B and C is also block-symmetric. The
sum of two block-Toeplitz matrices B and C is also block-Toeplitz.
The properties proven in theorems 1 and 2 and lemmas 1, 2, 3 and 4 hold for any
transformation matrix T and data matrix A which satisfy
8 TAT = A
<
T
: TT == TT ?1
70
(B.16)
References
1] M. Dendrinos, S. Bakamidis, and G. Carayannis, \Speech enhancement from
noise : A regenerative approach," Speech Communication, vol. 10, pp. 45{57,
Feb. 1991.
2] S. H. Jensen, P. C. Hansen, S. D. Hansen, and J. A. S rensen, \Reduction of
Broad-Band Noise in Speech by Truncated QSVD," IEEE Trans. Speech, Audio
Processing, vol. 3, pp. 439{448, Nov. 1995.
3] P. S. K. Hansen, Signal Subspace Methods for Speech Enhancement. PhD thesis,
Technical University of Denmark, Lyngby, Denmark, Sept. 1997.
4] P. C. Hansen and S. H. Jensen, \FIR Filter Representations of Reduced-Rank
Noise Reduction," IEEE Trans. Signal Processing, vol. 46, pp. 1737{1741, June
1998.
5] I. Dologlou, J.-C. Pesquet, and J. Skowronski, \Projection-based rank reduction
algorithms for multichannel modelling and image compression," Signal Processing, vol. 48, pp. 97{109, Jan. 1996.
6] B. D. Van Veen and K. M. Buckley, \Beamforming: A Versatile Approach to
Spatial Filtering," IEEE ASSP Magazine, pp. 4{24, Apr. 1988.
7] O. L. Frost III, \An Algorithm for Linearly Constrained Adaptive Array Processing," Proc. IEEE, vol. 60, pp. 926{935, Aug. 1972.
8] L. J. Gri ths and C. W. Jim, \An alternative approach to linearly constrained
adaptive beamforming," IEEE Trans. Antennas Propag., vol. 30, pp. 27{34, Jan.
1982.
9] K. M. Buckley, \Broad-Band Beamforming and the Generalized Sidelobe Canceller," IEEE Trans. Acoust., Speech, and Signal Processing, vol. 34, pp. 1322{
1323, Oct. 1986.
10] S. Nordholm, I. Claesson, and B. Bengtsson, \Adaptive Array Noise Suppression
of Handsfree Speaker Input in Cars," IEEE Trans. Vehicular Technology, vol. 42,
pp. 514{518, Nov. 1993.
11] S. Haykin, Adaptive Filter Theory. Information and system sciences series, Prentice Hall, 3rd ed., 1996.
12] F. Xie and S. Van Gerven, \Comparative study of 3 speech detection methods,"
Tech. Rep. MI2-SPCH-95-8, ESAT, K.U.Leuven, Belgium, Oct. 1995.
13] S. Doclo and E. De Clippel, \Verbetering van spraakverstaan bij hoortoestellen
via adaptieve ruisonderdrukking in reele tijd," Master's thesis, K.U.Leuven, Belgium, 1997. UDC : 681.5.017(043).
14] G. H. Golub and C. F. Van Loan, Matrix Computations. MD : John Hopkins
University Press, Baltimore, 3rd ed., 1996.
71
72