You are on page 1of 33

Bell Laboratories

  

  
  
subject: A Fast Normalized Cross-Correlation DTD date: February 18, 2000
Combined with a Robust Multi-Channel Fast
Recursive Least Squares Algorithm from: Tomas Gaensler
Work Project No. MA30033002 Dept. 10009657
File Case 38794-43 MH 2D-514
(908) 582-7287
gaensler@bell-
labs.com
Jacob Benesty
Dept. 10009657
MH 2D-518
(908) 582-7643
jbenesty@bell-
labs.com
10009657-000218-
02TM

TECHNICAL MEMORANDUM

1 Introduction
Ideally, acoustic echo cancelers (AECs) remove undesired echoes that result from coupling between
the loudspeaker and the microphone used in full-duplex hands-free telecommunication systems.
Figure 1 shows a diagram of a single-channel AEC. The far-end speech signal x(n) goes through
the echo path represented by a filter h(n) to produce the echo, y  (n), which is picked up by the
microphone together with the near-end talker signal v(n) and ambient noise w(n). The microphone
signal is denoted y(n).

x(n)
DTD

Adaptive ĥ(n) h(n)


algorithm


ŷ(n) y (n)
v(n) + w(n)

e(n) y(n)

Figure 1: Block diagram of a basic AEC setup and double-talk detector.

The echo path is modeled by an adaptive FIR filter, ĥ(n), which subtracts a replica of the echo
from the return channel and thereby achieves cancellation. This may look like a simple straightfor-

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
-2-

ward system identification task for the adaptive filter; however, in most conversations there are the
double-talk situations that make identification much more problematic than what it might appear at
a first glance. Double-talk occurs when the speech of the two talkers arrives simultaneously at the
canceler, i.e. x(n) = 0 and v(n) = 0. The situation with near-end talk only, x(n) = 0 and v(n) = 0,
can be regarded as an “easy-to-detect” double-talk case. This will be evident from analysis shown
in the next section.
In the double-talk situation, the near-end speech acts as a strong uncorrelated noise to the adap-
tive algorithm. The disturbing near-end speech may cause the adaptive filter to diverge. Hence,
annoying audible echo will pass through to the far-end. The common way to alleviate this problem
is to slow down or completely halt the filter adaptation when presence of near-end speech is de-
tected. This is the very important role of the Double-Talk Detector (DTD). The basic double-talk
detection scheme computes a detection statistic, ξ, and compares it with a preset threshold, T .
An undisputable fact, however, is that all double-talk detectors have a probability of not detecting
double-talk when it occurs, i.e. they have a probability of miss, Pm = 0. Requiring the probability
of miss to be smaller will undoubtedly increase the probability of false alarms hence slow down the
convergence rate of the acoustic echo canceler. For some DTDs, requiring a small probability of
miss increases the sensitivity to echo path changes. As a consequence, no matter what DTD is used,
undetected near-end speech will perturb the adaptive algorithm from time to time. In practice,
what has been done in the past is: first the DTD is designed to be “as good as” one can afford
and then, the adaptive algorithm is slowed down so that it copes with the detection errors made
by the DTD. This is natural to do since if the adaptive algorithm is very fast, it can react faster
to situation changes (e.g. double-talk) than the DTD, and thus diverge. However, this approach
severely penalizes the convergence rate of the AEC in the ideal situation, i.e. when far-end, but no
near-end, talk is present.
In the light of these facts, we look at adaptive algorithms that can handle at least a small amount
of double-talk without diverging. This is achieved by combining the following three concepts which
will be studied throughout this memorandum:

• Adaptive algorithms,
• Robust statistics,
• Double-talk detection schemes.

A large number of DTD schemes, i.e. detection statistics, have been proposed since the echo
canceler was invented, [1]. The Geigel algorithm [2] has proven successful in network echo cancelers;
however, it does not always provide reliable performance when used in the acoustic situation. This
is because its assumption of a minimum echo path attenuation may not be valid in the acoustic
case. Other methods based on cross-correlation and coherence [3, 4, 5, 6] have been studied which
appear to be more appropriate for AEC applications. A DTD based on multi statistic testing in
combination with modeling of the echo path by two filters was proposed in [7].
In this technical memorandum, we will present a method for handling double-talk in which a
robust fast recursive least squares algorithm (FRLS) and the normalized cross-correlation (NCC)
DTD, [6], are used. The NCC DTD is developed into a fast version, called FNCC, by reusing
computational results from the FRLS algorithm. The major advantage of this detector is that it is
not dependent on the actual echo path attenuation the way, e.g., the Geigel DTD is. This makes it
much more suitable for the acoustic application. This NCC detector is also shown to be a Neyman-
Pearson detector, [8], under certain assumptions. Furthermore, as a by-product of the fast NCC
derivation, we show that the FNCC DTD uses the same calculation structure, but not test statistics,
as the two-path method proposed in [7]. Hence, our derivation theoretically justifies the two-path
structure. Since the FNCC detector relies on the fast recursive least squares algorithm (FRLS),
robust multi-channel versions of the FNCC algorithm will also be presented.
The paper is organized as follows. Section 2 presents the generic double-talk detector and de-
scribes the properties of DTDs and adaptive algorithms. We devote Section 3 to robust recursive

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
-3-

least squares algorithms since the combination of robust algorithms and double-talk detectors have
been found useful. Furthermore, the recursive least squares algorithms are especially suited for
multi-channel acoustic echo cancellation. Section 4 reviews the normalized cross-correlation DTD.
A fast version of this algorithm (FNCC) is derived and a link is established between this algorithm
and the two-path structure. Evaluation of the proposed FNCC algorithm is made in Section 5. We
also present simulations showing the effectiveness of the DTD and an FRLS based acoustic echo
canceler. Section 6 ends this memorandum with conclusions and remarks for further studies.

2 The Generic Double-Talk Detector Scheme


All types of double-talk detectors basically operate in the same manner. Thus, the general procedure
for handling double-talk is described by the following four steps, [9]:

1. A detection statistic ξ is formed using available signals, e.g. x, y, e, etc., and the estimated
filter coefficients ĥ.
2. The detection statistic ξ is compared to a preset threshold T , and double-talk is declared if
ξ < T.
3. Once double-talk is detected, it is declared for a minimum period of time Thold . During periods
for which double-talk is declared, the filter adaptation is disabled.
4. If ξ ≥ T continuously for an interval Thold , the filter resumes adaptation. The comparison of
ξ to T continues, and double-talk is declared again when ξ < T .

The hold time Thold in Step 3 and Step 4 is necessary to suppress detection dropouts due to the
noisy behavior of the detection statistic. Although there are some possible variations, most of the
DTD algorithms have this basic form and differ only in the detection statistic.
An “optimum” decision variable ξ for double-talk detection should behave as follows:

• If v(n) = 0 (double-talk is not present), ξ ≥ T ,


• If v(n) = 0 (double-talk is present), ξ < T ,
• ξ is insensitive to echo path variations.

The threshold T must be a constant, independent of the data. Moreover, it is desirable that the
decisions are made without introducing any delay (or at least minimizing the introduced delay) in
the updating of the model filter. Delayed decisions adversely affect the performance of the AEC.

2.1 Properties of DTDs and Adaptive Algorithms

Important issues that have to be addressed when designing a DTD and an echo canceler are:

(i) Under what circumstances does double-talk disturb the adaptive filter?
(ii) What characterizes “good” double-talk detectors?
(iii) In what situation does the DTD operate well/poorly?

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
-4-

The performance of an AEC is often evaluated through its mean square error performance (MSE) or
through its misalignment ε = h − ĥ2 /h(n)2 in different situations. The expected misalignment
is given by the formula:

(1 − λ)L
E{ε2 (n)} ≈ , (1)
ENR
where E{·} denotes mathematical expectation, λ is a constant parameter of the recursive least
squares adaptive algorithm and L is the length of the adaptive filter. See Appendix A for derivation
of this formula. That is, the level of convergence (misalignment performance) is completely governed
by echo-to-near-end speech power ratio, ENR = σy2 /σv2 1 . Regarding question (i) stated above:
assume we have equally strong talkers, we find that when the echo path has a high attenuation the
adaptive filter will be very sensitive to double-talk. On the contrary, a low attenuation means lower
sensitivity. Consequently, the ENR is important to consider when setting the parameters of the
DTD.
A DTD can partially be characterized by using general detection theory [5, 6, 9]. It is possible
to objectively evaluate and compare the performance of different DTDs by means of probability
of correct detection, Pd , and probability false alarm, Pf (a false alarm occur when double-talk is
declared when it is actually not present). Moreover, the theory justifies a method to select the
threshold T for the detection statistic, ξ. As far as (ii) is concerned, we can state that a well
designed DTD has a low probability of false alarm while the probability of detection is set to an
acceptable level. This level depends on the sensitivity of the adaptive algorithm.
Item (iii) can now be addressed in terms of probability of not detecting (miss) double-talk when
it is present, Pm = 1 − Pd . For the NCC DTD described in Section 4, this probability is given by,

1
Pm = P {ξ ≥ T | Double-talk is present} = P {ξN ≥ T 1 + }, (2)
ENR
where P {·} denotes “probability
 of” and ξN is the normalized test statistic. Pm is a monotonically
1
decreasing function of T 1 + ENR . This means that as ENR decreases, so does the probability of
miss Pm . Details for derivation of this formula are found in Appendix B.
To summarize, both the performance of the adaptive filter and the DTD are governed by the
ENR. However, while the performance of the adaptive filter degrades with a decreasing ENR, the
performance of the DTD improves, so these properties somewhat balance each other. Moreover, the
DTD has to be faster, i.e. gather reliable statistics faster, than the adaptive filter. Hence, faster
convergence of the AEC results in higher requirements of the DTD. As a consequence it is important
to evaluate the echo canceler and the DTD jointly as a system.

2.2 The Geigel DTD

In this section, we give a brief description of a very simple DTD algorithm due to A. A. Geigel [2],
since it will be used for comparisons in this paper. The Geigel DTD declares presence of near-end
speech whenever

max{ |x(n)|, . . . , |x(n − Lg + 1)| }


ξ (g) = < T, (3)
|y(n)|

where Lg and T are suitably chosen constants. This detection scheme is based on a waveform level
comparison between the microphone signal y(n) and the far-end speech x(n) assuming the near-end
speech v(n) at the microphone signal will be typically at the same level, or stronger, than the echo
1 In the denominator of this ratio, we should also add the power of the ambient noise w(n). However, we can in

most situations neglect w(n) in comparison to v(n) and, it is more convenient to define the echo-to-ambient noise
ratio separately which is done in Section 5.

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
-5-

y  (n). The maximum of the Lg most recent samples of x(n) is taken for the comparison because of
the unknown delay in the echo path. The threshold T compensates for the gain of the echo path
response h, and is often set to 2 for network echo cancelers because the hybrid (echo path) loss is
typically about 6 dB or more. For an AEC, however, it is not easy to set a universal threshold to
work reliably in all the various situations because the loss through the acoustic echo path can vary
greatly depending on many factors. For Lg , one easy choice is to set it the same as the adaptive
filter length L since we can assume that the echo path is covered by this length.

3 Robust Recursive Least Squares Algorithms


A robust approach to handle double-talk has been studied and proven very successful in the network
echo canceler case [10], where the combination of outlier-resistant adaptive algorithms and a Geigel
DTD were studied. For the acoustic case, it is desirable to use a more appropriate DTD, e.g., the
NCC DTD, and combine it with a robust adaptive algorithm.
Our objective is to apply the robust recursive least squares algorithm to multi-channel acous-
tic echo cancellation. We will therefore derive a robust multi-channel fast recursive least squares
algorithm, robust FRLS. For simplicity, this is done in a couple of steps by first finding a robust
single-channel recursive least squares algorithm. By robustness we mean insensitivity to small devi-
ations of the real distribution from the assumed model distribution [11].
A recursive least squares algorithm can be regarded as a maximum likelihood estimator in which
the error distribution has been assumed to be Gaussian. The performance of an algorithm optimized
for Gaussian noise could be very poor in e.g. the AEC application because of the unexpected number
of large noise values (double-talk disturbances) that are not modeled by the Gaussian law. Thus,
when deriving algorithms for the AEC application, the probability density function (PDF) of the
noise model should be a long-tailed PDF in order to take the bursts noise due to DTD failures into
account, see [11, 12] for the definition of a long tailed PDF. Hence, we are interested in distributional
robustness since the shape of the true underlying distribution deviates from the assumed model,
which usually is the Gaussian law.
As explained in [11], a robust procedure should achieve the following:

• It should have a reasonably good efficiency at the assumed model.


• It should be robust in the sense that small deviations from the model assumptions should
impair the performance only slightly.
• Somewhat larger deviations from the model should not cause a catastrophe.

In this study, we assume that the PDF, p(z), of the disturbing noise has a tail which is heavier
than the Gaussian density, pG (z). By this we mean that p(z) > pG (z), ∀ |z| ≥ |z0 |. Furthermore,
p(z) should be chosen such that the derivative of ln[p(z)] is bounded when |z| → ∞. Although, the
difference between p(z) and the Gaussian PDF is small, whether the derivative of ln[p(z)] is bounded
or not makes all the difference between a robust and a non-robust approach.
In the following, we show how to derive a robust recursive least squares adaptive algorithm2
from a PDF, p(z), and how to successfully apply it to the problem of acoustic echo cancellation by
combining the resulting algorithm with the fast normalized cross-correlation DTD.
2 Least squares is a misnomer because we do not minimize the least squares criterion but rather maximize a

likelihood function. However, the name “least squares” is convenient to use here.

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
-6-

3.1 Derivation of the Single-Channel Robust RLS

In the context of echo cancellation, the error signal at time n between the system and model filter
outputs is given by (see Fig. 1)

e(n) = y(n) − ŷ(n), (4)

where
T
ŷ(n) = ĥ x(n) (5)

is an estimate of the output signal y(n),


 T
ĥ = ĥ0 ĥ1 ··· ĥL−1

is the model filter, and


 T
x(n) = x(n) x(n − 1) · · · x(n − L + 1)
T
is a vector containing the last L samples of the input signal x. Superscript denotes transpose of
a vector or a matrix.
The first step in the derivation of a robust algorithm is to chose an optimization criterion. Consider
the following function:
   
e(n)
J ĥ = ρ , (6)
s(n)
where

ρ(z) ∼ − ln[p(z)] (7)

is a convex function (i.e., ρ (z) > 0) and s(n) is a positive scale factor more thoroughly described
below.
Newton-type algorithms minimize the optimization criterion, (6), by the following recursion (see
[13] for the Newton algorithm):

ĥ(n) = ĥ(n − 1) − R−1


ψ  ∇J ĥ(n − 1) , (8)
 
where ∇J ĥ is the gradient of the optimization criterion w.r.t. ĥ and Rψ is (an approximation)


of E ∇2 J ĥ(n − 1) .
 
The gradient of J ĥ w.r.t. ĥ is:

   
x(n) e(n)
∇J ĥ = − ψ , (9)
s(n) s(n)
where
d
ρ(z) = ρ (z) = ψ(z). (10)
dz
 
The second derivative of J ĥ is:

  x(n)xT (n)  e(n) 


∇2 J ĥ = ψ , (11)
s2 (n) s(n)

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
-7-

where

ψ  (z) > 0, ∀z. (12)

Good choices of ψ(·) will be exemplified in the next section.


In this study we choose the following approximation of Rψ ,

ψ  e(n)
s(n)
Rψ = R(n), (13)
s2 (n)

where
n

R(n) = λn−i x(i)xT (i)
i=1
= λR(n − 1) + x(n)xT (n), (14)

and λ (0 < λ ≤ 1) is an exponential forgetting factor. This choice of Rψ will allow us to derive a
fast version of the robust algorithm.
We can now deduce a robust single-channel recursive least squares (RLS) adaptive algorithm
from (4), (5), (8), (9), (11), and (13) to be,
T
e(n) = y(n) − ĥ (n − 1)x(n), (15)
 
s(n) e(n)
ĥ(n) = ĥ(n − 1) +
k(n)ψ , (16)
ψ  e(n)
s(n)
s(n)

where k(n) = R−1 (n)x(n) is the Kalman gain [14].

3.2 Examples of Robust PDFs

Two examples of PDFs that have the good property of bounded first derivative of their log-likelihood
functions, thus are appropriate for modeling the disturbing noise are:
1 − ln[cosh(πz/2)]
p1 (z) = e , (17)
2
 2
1− −z /2


e , |z| ≤ k0
p2 (z) = 1− −|z|k0 +k02 /2 (18)


e , |z| > k0

where the mean and the variance are respectively equal to 0 and 1. The second one, p2 (z), is known
as the PDF of the least informative distribution, [11]. Figure 2 compares these PDFs to the Gaussian
density:
1  
pG (z) = √ exp −z 2 /2 . (19)

Table 1 shows the corresponding functions ψ(·) and ψ  (·).

3.3 Scale Factor Estimation

An important part of the algorithm is the estimate of the scale factor s. Traditionally, the scale
is used to make a robust algorithm invariant to the noise level, [11]. In our approach however, it
should reflect the minimum mean square error, be robust to short burst disturbances (double-talk

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
-8-

Table 1 Nonlinear robustness functions and their derivatives. θ(z) is the Heaviside function, i.e.
θ(z) = 1, z > 0, θ(z) = 0, z < 0.
PDF ψ(z) ψ  (z)
p1 (z) π tanh( π2 z)/2 π /4 cosh2 ( π2 z)
2

p2 (z) max{−k0 , min(z, k0 )} θ(z + k0 ) − θ(z − k0 )


pG (z) z 1

in our application) and track slow changes of the residual error (echo path changes). The choice of
this method of estimating s is justified in [10, 15]. We have chosen the scale factor estimate as
  
s(n)  e(n) 
s(n + 1) = λs s(n) + (1 − λs )
ψ , (20)
ψ  e(n) s(n) 
s(n)

s(0) = σx ,
which is very simple to implement. With this choice, the current estimate of s is governed by the level
of the error signal in the immediate past over a time interval roughly equal to 1/(1 − λs ). Assume
we have derived ψ(·) from p2 (z) in the previous section. The behaviour of the robust algorithm can
then be explained as follows: when the algorithm has not yet converged, s is large. Hence the limiter
is in its linear portion and therefore the robust algorithm behaves roughly like the conventional RLS
algorithm. When double-talk occurs, the error is determined by the limiter and by the scale of the
error signal during the recent past of the error signal before the double-talk occurs. Thus divergence
rate is reduced for a duration of about 1/(1 − λs ). This gives ample time for the DTD to act. If there
is a system change, the adaptive AEC algorithm does not track immediately. However, as the scale
estimator tracks the larger error signal, the nonlinearity is scaled up and convergence rate of the
AEC accelerates. The trade off between robustness and tracking rate of the adaptive algorithm is
thus governed by the tracking rate of the scale estimator which is controlled by one single parameter
λs .

3.4 Generalization to the Multi-Channel Case


So far we have only discussed single-channel algorithms. In this section, we derive a robust multi-
channel RLS algorithm. For this case, we assume that we have P incoming transmission room
signals, xi (n), i = 1 . . . P . Define the P input excitation vectors as,
 T
xi (n) = xi (n) xi (n − 1) · · · xi (n − L + 1) , i = 1, 2, ..., P.

Let the error signal at time n between one arbitrary microphone output y(n) in the receiving
room and its estimate be
P
T
e(n) = y(n) − ĥi xi (n), (21)
i=1

where
 T
ĥi = ĥi,0 ĥi,1 ··· ĥi,L−1 , i = 1, 2, ..., P
are the P modeling filters. In general, we have P microphones in the receiving room as well, i.e. P
return channels, and thus P 2 adaptive filters. From now on though, we will just look at one return
channel because the derivation is similar for the other P − 1 channels. By stacking the P modeling
filters and excitation vectors according to,

T
T T T
ĥ(n) = ĥ1 (n) ĥ2 (n) · · · ĥP (n) ,
 T
x(n) = xT1 (n) xT2 (n) · · · xTP (n) ,

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
-9-

Table 2 Variable definitions of the multi-channel FRLS algorithm.

Variables Matrix size


 T
χ(n) = x1 (n) x2 (n) · · · xP (n) (P × 1)
 T
x(n) = χT (n) . . . χT (n − L + 1) (P L × 1)
 T
ĥ(n) = ĥ1,0 (n) ĥ2,0 (n) . . . ĥP −1,L−1 (n) ĥP −1,L−1 (n) (P L × 1)
A(n), B(n) = Forward and backward prediction matrices (P L × P )
EA (n), EB (n) = Forward and backward prediction error covariance matrices (P × P )
eA (n), eB (n) = Forward and backward prediction error vectors (P × 1)
k(n) = a posteriori Kalman gain vector (P L × 1)
k (n) = a priori Kalman gain vector (P L × 1)
T −1
α(n) = λ + x (n)R (n − 1)x(n)
ϕ(n) = λ/α(n), Maximum likelihood variable (1 × 1)
κ ∈ [1.5, 2.5] Stabilization parameter (1 × 1)
λ ∈ (0, 1] Forgetting factor (1 × 1)

we can write the multi-channel error criterion with respect to the modeling filters in the same manner
as in the single-channel case described in Section 3.1,
   
e(n)
JM ĥ = ρ . (22)
s(n)

This leads to the adaptive algorithm,


T
e(n) = y(n) − ĥ (n − 1)x(n), (23)
 
s(n) e(n)
ĥ(n) = ĥ(n − 1) +
k(n)ψ , (24)
ψ  e(n)
s(n)
s(n)

where s(n) is the previously discussed scale factor and k(n) = R−1 (n)x(n) is the multi-channel
Kalman gain.

3.5 The Robust Multi-Channel Fast RLS

A robust multi-channel fast RLS (FRLS) can be derived by using the a priori Kalman gain k (n) =
R−1 (n − 1)x(n). For details of the derivation, see [16]. This a priori Kalman gain can be computed
recursively with 5P 2 L multiplications and the error and the adaptation parts in 2P L multiplications.
“Stabilized” versions of FRLS (with P 2 L more multiplications) exist in the literature but they are
not significantly more stable than their non-stabilized counterparts with non-stationary signals like
speech. Our approach to handle this problem is simply to re-initialize the predictor-based variables
when instability is detected with the use of the maximum likelihood variable. This variable is
inherent in the FRLS computations.
Table 2 defines the variables in the robust multi-channel FRLS algorithm. Note that the channels
of the filter- and state-vector [x(n)] are interleaved in this algorithm. The robust FRLS algorithm
with a complexity of 6P 2 L + 2P L is found in Table 3.

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
- 10 -

Table 3 The robust multi-channel FRLS algorithm.

Prediction Matrix size


T
eA (n) = χ(n) − A (n − 1)x(n − 1) (P × 1)
α1 (n) = α(n − 1) + eTA (n)E−1− 1)eA (n)
A (n (1 × 1)
   
t(n) 0
=  P ×1 + ((P L + P ) × 1)
m(n) k (n − 1)
 
IP ×P
E−1 (n − 1)eA (n)
−A(n − 1) A
EA (n) = λ[EA (n − 1) + eA (n)eTA (n)/α(n − 1)] (P × P )
A(n) = A(n − 1) + k(n − 1)eTA (n)/α(n − 1) (P L × P )
eB1 (n) = EB (n − 1)m(n) (P × 1)
eB2 (n) = χ(n − L) − BT (n − 1)x(n) (P × 1)
eB (n) = κeB2 (n) + (1 − κ)eB1 (n) (P × 1)
k (n) = t(n) + B(n − 1)m(n) (P L × 1)
α(n) = α1 (n) − eTB2 (n)m(n) (1 × 1)
EB (n) = λ[EB (n − 1) + eB2 (n)eTB2 (n)/α(n)] (P × P )
B(n) = B(n − 1) + k(n)eTB (n)/α(n) (P L × P )
Filtering
T
e(n) = y(n) − ĥ (n − 1)x(n) (1 × 1)
 
s(n) e(n)
ĥ(n) = ĥ(n − 1) +
k (n)ψ (P L × 1)
e(n) s(n)
ψ α(n)s(n)
  
s(n)  e(n) 
s(n + 1) = λs s(n) + (1 − λs )
ψ (1 × 1)
ψ  e(n) s(n) 
s(n)

4 The Normalized Cross-Correlation (NCC) DTD


A decision variable for double-talk detection that has the potential of fulfilling the properties of
an “optimum” variable described in Section 2 was proposed in [6, 9]. This variable is formed by
normalizing the cross-correlation vector between the excitation vector x(n) and the return signal
y(n). Normalization results in a natural threshold level for ξ when v(n) = 0. In Appendix C we
show that this detector can be viewed as a special case of a Neyman-Pearson binary statistical test,
[8]. The decision variable is given by

ξ = rT (σy2 R)−1 r, (25)

where R = E{x(n)xT (n)}, r = E{x(n)y(n)}, and σy2 = E{y 2 (n)}.


Moreover, we can rewrite ξ as:

hT Rh
ξ=  , (26)
hT Rh + σv2
by using the identities
r = Rh, (27)

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
- 11 -

Table 4 Matrix inversion lemma and other relations.

R−1 (n − 1)x(n)xT (n)R−1 (n − 1)


R−1 (n) = λ−1 R−1 (n − 1) − λ−1
α(n)
1
k(n) = R−1 (n)x(n) = R−1 (n − 1)x(n), a posteriori Kalman gain
α(n)
k (n) = R−1 (n − 1)x(n), a priori Kalman gain
α(n)xT (n)k(n) = xT (n)R−1 (n − 1)x(n)
α(n)rT (n − 1)k(n) = rT (n − 1)R−1 (n − 1)x(n)
T
= ĥb (n − 1)x(n) = ŷ(n), estimated echo

and

σy2 = rT R−1 r + σv2 . (28)

It is easily deduced from (26) that for v(n) = 0, ξ = 1 and for v(n) = 0, ξ < 1. Note also that ξ is
independent of the echo path when v(n) = 0.
The detection statistic (25) is rather complex to calculate and therefore unsuitable to implement.
However, a fast version of this algorithm can be derived by recursively updating R−1 r using the
Kalman gain R−1 x(n), [14], which is calculated in the FRLS algorithm. The resulting DTD, called
the Fast NCC, will be derived next.

4.1 Derivation of a Fast NCC (FNCC)

Estimated quantities of the cross-correlation and the near-end signal power are introduced for the
derivation of the FNCC DTD. Equation (25) may be written as

rT (n)R−1 (n)r(n) η 2 (n)


ξ 2 (n) = 2
= 2 (29)
σy (n) σy (n)

where we squared the the statistics for simplicity. The correlation variables are estimated as,

r(n) = λr(n − 1) + x(n)y(n), (30)


R(n) = λR(n − 1) + x(n)xT (n), (31)
σy2 (n) = λσy2 (n − 1) + y 2 (n). (32)

Table 4 reviews some useful relations that are frequently used in the following. Let’s continue to

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
- 12 -

Table 5 The FNCC double-talk detector and echo canceler.

Double-talk detector Multiplications


σy2 (n) = λσy2 (n − 1) + y (n)
2
2
T
eb (n) = y(n) − ĥb (n − 1)x(n) PL
η (n) = λη (n − 1) + y (n) −
2 2 2
ϕ(n)e2b (n) 3
η(n)/σy (n) < T, ⇒ double-talk, µ = 0
η(n)/σy (n) ≥ T, ⇒ no double-talk, µ = 1
eb (n)
ĥb (n) = ĥb (n − 1) + k (n) PL
ϕ(n)
Robust echo cancellation
T
e(n) = y(n) − ĥ (n − 1)x(n) PL
 
s(n)e(n)
ĥ(n) = ĥ(n − 1) + µ
k (n)ψ PL
e(n)s(n)
ψ
ϕ(n) s(n)
  
s(n)  e(n) 
s(n + 1) = λs s(n) + (1 − λs )

ψ  e(n) s(n) 
s(n)

look at the statistic η 2 (n),


 
η 2 (n) = λrT (n − 1) + y(n)xT (n) R−1 (n) [λr(n − 1) + x(n)y(n)]
= λ2 rT (n − 1)R−1 (n)r(n − 1) + λrT (n − 1)R−1 (n)x(n)y(n)
+ λy(n)xT (n)R−1 (n)r(n − 1) + y 2 (n)xT (n)R−1 (n)x(n)
 
= λ2 rT (n − 1) λ−1 R−1 (n − 1) − λ−1 α(n)k(n)kT (n) r(n − 1)
+ 2λy(n)kT (n)r(n − 1) + y 2 (n)xT (n)k(n)
= λrT (n − 1)R−1 (n − 1)r(n − 1) − λα(n)(kT (n)r(n − 1))2 + 2λy(n)kT (n)r(n − 1)
λ
+ y 2 (n)(1 − )
α(n)
λ 2 λ λ
= λη 2 (n − 1) − ŷ (n) + 2 y(n)ŷ(n) + (1 − )y 2 (n)
α(n) α(n) α(n)
= λη 2 (n − 1) − ϕ(n)ŷ 2 (n) + 2ϕ(n)ŷ(n)y(n) + [1 − ϕ(n)] y 2 (n)
= λη 2 (n − 1) + y 2 (n) − ϕ(n)e2 (n) (33)
where the likelihood variable, ϕ(n) = λ/α(n) and e(n) is the residual error, e(n) = y(n) − ŷ(n). We
find that the statistics needed to form the test statistic of the FNCC DTD are given by the simple
first order recursions in (32) and (33). Table 5 gives the essential calculations for both DTD and
echo canceler but it is assumed that the Kalman gain has been calculated by the FRLS algorithm
in Table 3. Note that we need to distinguish between the echo path estimate, from now on denoted
hb (n), calculated in the DTD and the estimate calculated in the echo canceler, h(n).

4.2 The Link Between FNCC and the Two-Path DTD Scheme
One common way to handle double-talk is to use the two-path system, [7]. In this case there is
an adaptive background filter modeling the echo path and a fixed foreground filter which cancels
the echo. The notion of a background and foreground filter is in this memo regarded as a two-
path structure. Whenever the background filter is performing well according to some criteria, it is

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
- 13 -

copied to the foreground filter. An interesting interpretation of the echo canceler and DTD shown
in Table 5 is that it naturally has a two-path structure. We see that the DTD in Table 5 represents
the background adaptive filter, and the AEC a foreground filter. What distinguishes Table 5 from
the original two-path structure is that the foreground filter is also adaptive. Hence, the FNCC
DTD/AEC theoretically justifies the two-path structure in the echo cancellation problem.

5 Evaluation and Simulations


The adaptive algorithm for the acoustic echo canceler has been derived for the multi-channel case.
In these simulations, we prefer to do most of the evaluations using the single-channel version since
these simulations are less complex to perform. Performance of the the robust two-channel algorithm
is shown for a double-talk situation. All results are directly applicable to the general multi-channel
case.

5.1 Receiver Operating Characteristics (ROC)

When evaluating a detector it is customary to show the probability of detection, Pd , versus the
probability of false alarm, Pf . This curve is known as the receiver operating characteristic (ROC) of
the detector and has been widely used for characterizing detectors in communication systems and
radar. The ROC curve of a detector should always be above the “chance line”, i.e. Pd = Pf , which
characterizes the performance for a pure guess.
We can adopt the ROC technique for characterizing double-talk detectors as well. This has
previously been suggested in [5, 9]. One problem that has to be kept in mind though, is that speech
has a time-varying power level. This may give less consistent results when estimating Pd and Pf
than what would be the case for a stationary signal. Nevertheless, for comparative studies of the
DTDs the estimated ROC is still a useful performance metric.
Estimation of Pf and Pd using a set of speech data obtained from a database, [17], was made
according to the procedure presented in [9]. The files used are: sentences (040), (046), and (062) of
talker M25, sentences (012) and (075) of talkers F14, F45, M08, and M51. Furthermore, all sentences
are also normalized to have same average power level.

Echo path. The measured acoustic path, “sgi2cl.mat” channel-b, is used as echo path.
However, it is decimated to 8 kHz sampling rate resulting in a total length of 2048 coefficients,
see Fig. 3. It is also normalized so that σy2 = σx2 for the actual speech data. The original echo
path is available in the database: /u/drrm/matlab/desktop.
Probability of false alarm. When estimating the probability of false alarm we use all
sentences above as far-end speech. This corresponds to a speech sequence of 21.8 seconds at 8
kHz sampling rate. The signal (echo)-to-ambient noise ratio, SNR = 10 log10 {σy2 /σw
2
}, is set
to 30 dB.
Probability of detection. The three sentences of talker M25, total of 5 seconds of data,
are used as far-end speech when the probability of detection is estimated. As near-end, the
remaining sentences are used, each about 2 seconds long. In our case we investigate the
performance when the average echo-to-near-end ratio, ENR = 10 log10 {σy2 /σv2 } is 0 dB since
it is natural to assume equally strong talkers.
Practical modifications of the DTD and echo canceler. Some practical modifications
have been incorporated in the double-talk detector and canceler in order to improve overall
performance. The exact implementation of the AEC and FNCC DTD is found in Table 6,
Appendix D.

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
- 14 -

The thresholds of the Geigel and FNCC detectors are chosen such that their probability of
false alarm, go from approximately 0 to about 0.25. For these thresholds we then estimate the
corresponding probability of detection (see [9] for details of estimation of the probabilities). Results
from these simulations are shown in Fig. 4. The results are consistent with those found in [9] and
by the use of the ROC it is possible to set thresholds such that the DTDs are compared fairly.

5.2 AEC/DTD Performance During Double-Talk

Our purpose here is to show that the FNCC DTD is very suitable for the acoustic case. For these
simulations, we use a subset of the speech data described in Section 5.1. Data and algorithm settings
are:

Far-end: Speech from talker M25 repeated twice giving a total of 10 seconds of data, The
standard deviation of the signal is σx = 1900.
Near-end: Speech from talker M51, sentence (075).
Levels: ENR = 0 dB, SNR = 30 dB.
Echo path: “sgi2cl.mat” channel-b, 8 kHz sampling frequency, N = 2048, normalized so that
σy2 = σx2 .

FRLS parameters: L = 1024 (128 ms), λ = 1 − 1/3L, κ = 0, ĥ0 = 0.


FNCC parameters: λ1 = 1 − 2/L, T = 0.85 ⇒ Pd = 0.37, Pf ≈ 0, Fig. 4.
Geigel parameter: T = 1.20 ⇒ Pd = 0.37, Pf ≈ 0.03, Fig. 4.

Robust settings: ψ e(n) s(n) = max{−k0 , min( e(n)


s(n) , k0 )}, λs = 0.9975, k0 = 1.5, β ≈
0.60655, s0 = 1000, and
  
 e(n) 1,
ψ =
s(n) 0.5

according to criterion in Appendix D.

The choice of thresholds for the detectors is based on practical aspects, e.g. it is preferable to
have a low probability of false alarm and a reasonable probability of detection rather than a high
probability of detection and higher false alarm rate. Normally one would compare detectors, set to
have the same probability of false alarm. However, for a reasonable detection rate (0.35-0.7) the
FNCC DTD has a false alarm rate virtually zero in our conditions. Forcing the Geigel DTD to have
such low false alarm will results in a detection rate also close to zero! We therefore choose thresholds
for the DTDs such that they have the same probability of detection. This of course has the effect
that the Geigel detector exhibit a higher false alarm rate. The performance is measured by means
of the mean square error (MSE). We estimate the MSE according to,
2
LPF([e(n) − v(n) − w(n)] )
MSE(n) = 10 log10 { 2 } (34)
LPF([y  (n)] )

where LPF denotes a first order lowpass filter with a pole in 0.999.
Figure 5 shows far-end speech, double-talk and mean square error performance of the FRLS and
robust FRLS algorithms when either the FNCC or the Geigel detector is used as DTD. The time
intervals for which the DTDs have indicated double-talk are also shown below the MSE-curves. From
these results we draw the following conclusions: Looking at the MSE-performance of the non-robust
FRLS algorithm, both detectors miss the very first parts of the double-talk and therefore the FRLS

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
- 15 -

AEC algorithm diverges. During the double-talk period, detection misses of the Geigel DTD results
in further performance loss for the canceler. No divergence is seen for the robust FRLS, regardless
of DTD, since the robustness handles the situation when there are detection misses.
What is interesting now is the convergence degradation for using either the FNCC or Geigel
DTD combined with robust FRLS or FRLS. This is studied in the following section.

5.3 AEC/DTD Sensitivity to a Time-Varying Echo Path


For the purpose of showing how the DTD/AEC system handles echo path changes we have chosen
to study four cases. The same speech data and algorithm settings as in the previous section are
used (except no double-talk is introduced). In each simulation, the echo path undergoes one of four
changes after the FRLS algorithm has been adapting for 2 seconds. The changes that have been
tested are:

1. Time shift: h(n) → h(n − 5).


2. Time advance: h(n) → h(n + 14).
3. 6 dB increase of echo path gain: h(n) → 2h(n).
1
4. 6 dB decrease of echo path gain: h(n) → h(n).
2

In reality there are infinite possibilities for echo path variations. However, these four cases represent
the basic variations that may occur, therefore they were chosen.
Of major concern, when an echo path change occurs, is that the false alarm rate of the DTD
may increase. That would undoubtedly reduce convergence rate.
Figures 6 and 7 show mean square error performance of the FRLS and robust FRLS algorithms
combined with either the FNCC or the Geigel detector as DTD. From these curves we draw the
following conclusions: The FNCC DTD (Fig. 6) does not have a significant increase of false alarms
due to echo path changes. For the time advance change, Fig. 6b, there are a few false alarms visible.
With use of the foreground/background structure in the FNCC/robust FRLS combination, we detect
the echo path change and improve convergence rate, through ψ  {·}, of the foreground robust FRLS
algorithm such that it becomes as fast as the (non-robust) FRLS algorithm.
The Geigel DTD has about the same false alarm rate after time shift/advance echo path changes,
Figs. 7a, b. The convergence rate of both robust and non-robust FRLS is affected negatively by
these false alarms (cf. Figs. 6a, b). When the echo path gain increases, Fig. 7c, the false alarm rate
becomes so high that the adaptation of the robust FRLS is halted and the convergence time of the
(non-robust) FRLS algorithm is tripled. For a decrease in echo path gain, there is no negative effect
in the Geigel DTD performance of course.

5.4 Two-Channel AEC/DTD Performance During Double-Talk


For this stereo simulation, we use real-life speech data described in [18]. These recordings were made
in HuMaNet room B [19] and are available on a CD database (“Recording2”). The microphone and
loudspeaker setup in the near-end room is shown in Fig. 8, and estimated acoustic responses and
magnitude functions are shown in Fig. 9 and 10. In order to have a unique solution for the echo
canceler, the far-end signals have been processed with a non-linearity according to [20]. We down-
sampled the data a factor of 2 in order to have a sampling rate of 8 kHz. Furthermore, we added
near-end speech to the microphone signals for the purpose of showing the double-talk behaviour of
the robust two-channel FRLS/FNCC DTD.
Basic speech and algorithm settings are:

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
- 16 -

Far-end: Stereo recordings from talker M25 (same talker as in the previous section). Distor-
tion parameter of the non-linearity is 0.5, [20]. That is, the far-end channels are pre-processed
according to:
0.5  
x(·) = x(·) + x(·) ± |x(·) | , (35)
2
where (·) denotes either (left) or (right) channel. The positive half-wave is added to the left
channel and the negative to the right.
Near-end: Added speech from talker M51, sentence (075).
Levels: ENR ≈ 0 dB, SNR ≈ 38 dB.
FRLS parameters: L = 1024 (128 ms), λ = 1 − 1/(3 · 2L), κ = 1, ĥ0 = 0.
FNCC parameters: λ1 = 1 − 1/L, T = 0.90 ⇒ Pd = 0.63, Pf ≈ 0, Fig. 4.

Figure 11 shows the mean square error performance during double-talk. We find that the two-channel
algorithm behaves as well as the single-channel algorithm presented in the previous section.

6 Conclusions
In this technical memorandum, we have combined methods from robust statistics with recursive
least squares filtering and double-talk detection in order to solve problems that arise in acoustic echo
cancellation. An acoustic echo cancellation system comprising of a robust multi-channel adaptive
algorithm and a fast normalized cross correlation double-talk detector has been presented. Major
results of the FNCC detector are:

• It has a much higher accuracy than other DTDs. That is, lower false alarm rate at a fixed
probability of detection.
• It is not dependent on any assumptions about the echo path e.g., the echo path gain. This is
a very important property since the variability of acoustic echo paths is high.

• It has low added complexity when implemented together with the FRLS algorithm.
• The required nonlinear function (ψ(·)) in the robust FRLS results in some minor performance
loss (slower convergence rate) compared to the non-robust FRLS. However, we can easily
compensate for this loss by combining the FNCC decision with some simple statistics, see
Table 6. Poor convergence is thereby efficiently detected and the convergence rate can thus be
speeded up.

Furthermore, theoretical analysis and simulations have shown,

• The derivation of FNCC theoretically justifies the previously proposed two-path structure. It
should not be regarded as a proof of optimality of the two-path structure.
• Requiring a higher probability of detection (double-talk), i.e. increasing the threshold, in-
creases the false alarm rate after echo path changes.
• Increasing λ in FRLS will not result in as low divergence as that of the robust FRLS during
double-talk, if both algorithms are constrained to have the same convergence rate.

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
- 17 -

The combination of robust statistics, an optimal3 adaptive algorithm (FRLS), and the FNCC DTD
results in an “explosive cocktail” where theoretical and practical aspects complement each other, eg.
robustness makes it possible to lower the probability of detection in the DTD. This in turn results
in better performance (lower false alarm rate) of the DTD during echo path changes and thus faster
convergence of the echo canceler.
So far, we have confined our study to the fullband case of adaptive filtering. The presented ideas
could of course be generalized to a subband or frequency domain scheme. For example, in a subband
implementation it is not required to have double-talk detectors in each subband, but rather in a
few selected ones. This would significantly decrease overall complexity. The FNCC detector uses
intermediate calculations of the FRLS algorithm. Another way of implementing this detector is to
base the detection statistic on a projection of the near-end data onto a subspace spanned by the
far-end signal. An algorithm that relies on this principle is the affine projection algorithm, [21]. In
this case, we can use the same techniques as the fast affine projection algorithm, [22], in order to
have a fast version of the DTD. It is also noted that for a projection order equal to one, we get the
normalized least mean square algorithm which will result in a very low complexity system of DTD
and AEC.

Acknowledgments
The authors would like to thank G. W. Elko and M. M. Sondhi for discussions and comments.

Tomas Gaensler

MH-10009657-TG/JB-tg Jacob Benesty

Atts.
References
Appendices A-D
Tables 6
Figures 2-11

3 In the sense that it is a maximum likelihood algorithm if the disturbance is distributed according our assumptions.

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions
References
[1] M. M. Sondhi, “An adaptive echo canceler,” The Bell Syst. Techn. J., vol. XLVI, no. 3, pp.
497–510, March 1967.
[2] D. L. Duttweiler, “A twelve-channel digital echo canceler,” IEEE Trans. Commun., vol. 26,
no. 5, pp. 647–653, May 1978.
[3] H. Ye and B. X. Wu, “A new double-talk detection algorithm based on the orthogonality
theorem,” IEEE Trans. Commun., vol. 39, no. 11, pp. 1542–1545, Nov. 1991.
[4] R. D. Wesel, “Cross-correlation vectors and double-talk control for echo cancellation,” Unpub-
lished work, 1994.
[5] T. Gänsler, M. Hansson, C-J. Ivarsson, and G. Salomonsson, “A double-talk detector based on
coherence,” IEEE Trans. Commun., vol. 44, no. 11, pp. 1421–1427, Nov. 1996.
[6] J. Benesty, D. R. Morgan, and J. H. Cho, “An new class of doubletalk detectors based on
cross-correlation,” IEEE Trans. Speech and Audio Processing, to appear.
[7] K. Ochiai, T. Araseki, and T. Ogihara, “Echo canceler with two echo path models,” IEEE
Trans. Commun., vol. COM-25, no. 6, pp. 589–595, June 1977.
[8] L. L. Scharf, Statistical Signal Processing, Electrical and Computer Engineering: Digital Signal
Processing. Addison Wesley, 1991.
[9] J. H. Cho, D. R. Morgan, and J. Benesty, “An objective technique for evaluating doubletalk
detectors in acoustic cancelers,” IEEE Trans. Speech Audio Processing, vol. 7, no. 6, pp. 718–
724, Nov. 1999.
[10] T. Gänsler, S. L. Gay, M. M. Sondhi, and J. Benesty, “Double-talk robust fast converging
algorithms for network echo cancellation,” IEEE Trans. Speech Audio Processing, to appear.
[11] P. J. Huber, Robust Statistics, pp. 68–71, 135–138, New York: Wiley, 1981.
[12] S. A. Kassam and H. V. Poor, “Robust techniques for signal processing: a survey,” Proc. of
the IEEE, vol. 73, pp. 433–481, Mar. 1985.
[13] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall
PTR, Upper Saddle River, N.J., 1993.
[14] S. Haykin, Adaptive Filter Theory, Information and System Sciences Series. Prentice-Hall, 3
edition, 1995.
[15] J. Benesty and T. Gaensler, “A robust fast recursive least squares adaptive algorithm,” Lucent,
Bell Laboratories technical memorandum, 10006957-991213-34TM, 1999.
[16] M. G. Bellanger, Adaptive Digital Filters and Signal Analysis, Marcel Dekker, 3 edition, 1987.
[17] J. Schroeter, “A high-quality speech database,” AT&T Bell Laboratories technical memoran-
dum, BL011227-921202-07TM, 1992.
[18] P. Eneroth, S. L. Gay, J. Benesty, and T. Gaensler, “Real-time implementation of a stereophonic
acoustic echo canceler,” Lucent Bell Laboratories technical memorandum, BL011332-981217-
30TM, 1998.
[19] D. A. Berkley and J. L. Fanagan, “HuMaNet: an experimental human-machine communications
network based on ISDN wideband audio,” AT&T Tech. J., vol. 69, pp. 87–99, Sept./Oct. 1990.
[20] J. Benesty, D. R. Morgan, and M. M. Sondhi, “A better understanding and an improved solution
to the specific problems of stereophonic acoustic echo cancellation,” IEEE Trans. Speech Audio
Processing, vol. 6, no. 2, pp. 156–165, March 1998.

[21] K. Ozeki and T. Umeda, “An adaptive filtering algorithm using an orthogonal projection to an
affine subspace and its properties,” Electr. and Commun. in Japan, vol. 67-A, no. 5, 1984.
[22] S. L. Gay and S. Tavathia, “The fast affine projection algorithm,” in Proc. IEEE ICASSP,
1995, pp. 3023–3026.
A The Misalignment Formula
In general, the misalignment is dependent on the type of algorithm. In this appendix, we derive the
expected misalignment for the recursive least squares algorithm. Define the expected value of the
misalignment as,

E{||h − ĥ(n)||2 }
E{ε2 (n)} = . (36)
||h||2
Let (n) = h − h(n). For the RLS algorithm it is known that, [14],
T
E{(n) (n)} = tr{K(n)}, (37)
K(n) = σv2 (1 − λ)R−1 , (38)

where R = E{x(n)xT (n)}, (L × L). Assuming the signal x(n) is uncorrelated with variance σx2 we
find,
σv2 (1 − λ)L (1 − λ)L
E{ε2 (n)} = = . (39)
σx2 ||h||2 ENR

B Derivation of Probability of Miss


Assume we have the NCC detector in (25). This statistic has a certain variance and its probability
of miss is given by,

Pm = P {ξ(n) ≥ T | Double-talk is present}. (40)

As the threshold T 
increases, this probability decreases. Let’s normalize ξ(n) by dividing by its
standard deviation, E{ξ 2 (n)}. Now

σy2 1
E{ξ 2 (n)} ≈ 2 2
= 1 , (41)
σy + σv 1 + ENR

which leads to

ξ(n) 1
Pm = P{ ≥T 1+ }. (42)
E{ξ 2 (n)} ENR

We now see what influence of the the echo-to-near-end ratio has on the probability of miss, i.e. as
ENR decreases so does Pm .

C The Neyman-Pearson Detector


Assume we have the following binary hypothesis test:

H0 : y(n) = hT x(n) (43)


T
H1 : y(n) = v(n) + h x(n) (44)
(45)

and let y(n) = [y(n) y(n − 1) . . . y(n − N + 1)]T . The Neyman-Pearson detection statistic (likeli-
hood function) is then given by,
p [y(n)|H1 ]
L [y(n)] = > γ  (46)
p [y(n)|H0 ]
Imposing the assumption of white Gaussian signals4 we find the following log-likelihood function:

N σ2T 1 σh2 T x
l [y(n)] = ln{ 2 h x2 } + yT (n)y(n) > γ  , (47)
2 σv + σhT x 2 σhT x (σv2 + σh2 T x )
2
  
Independent of data
⇔ (48)

l [y(n)] = yT (n)y(n) > γ, (49)

It is sufficient to use l [y(n)] as test statistic since only the term yT (n)y(n) is data dependent. To
make the statistic independent of h we can normalize it with ŷT (n)ŷ(n) which relates to the variance
of the estimated echo. Furthermore we invert the result which gives:

ŷT (n)ŷ(n) ŷT (n)ŷ(n)


ξ2 = = . (50)
l [y(n)] yT (n)y(n)

Comparing (50) with (29) it is found that the latter is an exponential window implementation of the
former which establishes the relation between the NCC detector and the Neyman-Pearson binary
hypothesis test.

D Practical Modifications
Sections 3.5 and 4.1 give the theoretical derivations of the robust FRLS and the FNCC detector.
In a realistic situation with non-stationary input (speech), some of the assumptions used in these
derivations are violated. We have therefore slightly modified the algorithms in order to handle these
problems better.
Table 6 presents the modified system of DTD and robust FRLS used in the simulations in
Section 5. To be able to freely choose faster response of the FNCC detector than that of the
FRLS algorithm, we introduce a new forgetting factor, λ1 , in (53) and (54). A preferable choice
is λ1 = 1 − 2/L. Theoretically, (29) should always be between 0 and 1. However, in the practical
algorithm where (33) is used, this property can be violated. This may happen when the background
filter is seriously perturbed by e.g. double-talk or an echo path change has occurred. Equations (52)-
(56) below modify the detector such that η 2 (n) and σy2 (n) are not updated when y 2 (n) − ϕ(n)e2 (n)
in (33) is negative.
Equations (51) and (57) are new foreground and background energy estimators. If the background
filter is performing better than the foreground, and at the same time the non-linear robustness
function is in its saturated region we know that an echo path change has occurred, thus convergence
is increased by reducing ψ  (·). This control is implemented in (58)-(60).
Finally, robustness is maintained, in spite of detection misses, over longer periods of double-talk
by keeping the scale factor small, (61)-(63).

4 This is a very crude approximation of the real situation which is neither Gaussian or white.
Table 6 The FNCC double-talk detector and echo canceler with modifications used in all simulations.

Double-talk detector Multiplications


T
eb (n) = y(n) − ĥb (n − 1)x(n) PL
σe2b (n) = λ1 σe2b (n − 1) + e2b (n) 2 (51)
if y 2
(n) ≥ ϕ(n)e2b (n) (52)
σy2 (n) = λ1 σy2 (n − 1) + y 2 (n) 2 (53)
η 2 (n) = λ1 η 2 (n − 1) + y 2 (n) − ϕ(n)e2b (n) 3 (54)
else
σy2 (n) = σy2 (n − 1) (55)
η (n) = η (n − 1)
2 2
(56)
end
η(n)/σy (n) < T, ⇒ double-talk, µ = 0
η(n)/σy (n) ≥ T, ⇒ no double-talk, µ = 1
eb (n)
ĥb (n) = ĥb (n − 1) + k (n) PL
ϕ(n)
Robust echo cancellation
T
e(n) = y(n) − ĥ (n − 1)x(n) PL
σe2 (n) = − 1) + e (n)
λ1 σe2 (n 2
2 (57)
√ e(n)
if σe2b (n) < σe2 (n)/ 2 & ψ( ) = k0 (58)
s(n)
 
e(n)
ψf = 0.5 (59)
s(n)
else
 
e(n)
ψf =1 (60)
s(n)
end
 
s(n)  e(n)
ĥ(n) = ĥ(n − 1) + µ
k (n)ψ PL
e(n) s(n)
ψf s(n) ϕ(n)
if (No double-talk) (61)
  
s(n)  e(n) 
s(n + 1) = λs s(n) + (1 − λs )
ψ 3 (62)
ψf e(n) s(n) 
s(n)

else
s(n + 1) = λs s(n) + (1 − λs )smin 2 (63)
end
(a)
0

−2

−4

−6
ln[p (z)]
(⋅)

−8

−10

−12

−14
−5 −4 −3 −2 −1 0 1 2 3 4 5
z

(b)
2

1.5

0.5

−0.5

−1

−1.5

−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
z

Figure 2: (a) PDFs on a logarithmic scale. (b) Corresponding ψ(z) functions. Dashed: pG (z),
Dotted: p1 (z), Solid: p2 (z).
0.3

0.2

0.1

−0.1

−0.2

−0.3

0 50 100 150 200 250


ms

Figure 3: Echo path response used to simulate the acoustic path.


1

0.9

0.8
Probability of detection

0.7

0.6

0.5

0.4

0.3

0 0.05 0.1 0.15 0.2 0.25


Probability of false alarm

Figure 4: Receiver operating characteristic (ROC) of the FNCC () and Geigel (∗) DTDs. ENR = 0
dB and SNR = 30 dB.
(a)

(b)

−10
dB

−20

−30

−40

(c)

−10
dB

−20

−30

−40
0 1 2 3 4 5 6 7 8 9 10
seconds

Figure 5: Double-talk situation. Robust FRLS (Solid line), FRLS (Dashed line). (a) Far-end speech
(upper), near-end speech (lower). (b) Mean square error with the FNCC DTD. (c) Mean square
error with the Geigel DTD. The rectangles in the bottom of (b) and (c) indicate where double-talk
has been detected.
(a)

−10
dB

−20

−30

−40

(b)

−10
dB

−20

−30

−40

(c)

−10
dB

−20

−30

−40

(d)

−10
dB

−20

−30

−40
0 1 2 3 4 5 6 7 8 9 10
seconds

Figure 6: Mean square error during echo path change. The echo path changes at 2 seconds: (a)
Time shift of 5 samples. (b) Time advance of 14 samples. (c) 6 dB increase of echo path gain. (d) 6
dB decrease of echo path gain. Robust FRLS with the FNCC DTD (Solid line), (non-robust) FRLS
with the FNCC DTD (Dashed line).
(a)

−10
dB

−20

−30

−40

(b)

−10
dB

−20

−30

−40

(c)

−10
dB

−20

−30

−40

(d)

−10
dB

−20

−30

−40
0 1 2 3 4 5 6 7 8 9 10
seconds

Figure 7: Mean square error of the robust FRLS with the Geigel DTD. Robust FRLS with the
Geigel DTD (Solid line), (non-robust) FRLS with the Geigel DTD (Dashed line)..
0.35m

0.80m
Figure 8: Near-end room microphone and loudspeaker setup.

(a) (b)

0.6

0.4

0.2

−0.2

−0.4

−0.6

(c) (d)

0.06
0.04
0.02
0
−0.02
−0.04
−0.06
−0.08
0 50 100 0 50 100
Time (ms) Time (ms)

Figure 9: Near-end room impulse responses. (a) From left speaker to left microphone. (b) From
right speaker to right microphone. (c) From left speaker to right microphone. (d) From right speaker
to left microphone.
(a) (b)
2.5

1.5

0.5

(c) (d)
0.8

0.6

0.4

0.2

0
0 2 4 6 8 0 2 4 6 8
Frequency (kHz) Frequency (kHz)

Figure 10: Magnitude of the frequency response of the near-end room. Other conditions as in
Fig. 9.
(a)

(b)

−10
dB

−20

−30

−40

(c)

−10
dB

−20

−30

−40
0 1 2 3 4 5 6
seconds

Figure 11: Double-talk situation. Robust two-channel FRLS with FNCC DTD. (a) Left channel
far-end speech (upper), near-end speech (lower). (b) Mean square error of left channel. (c) Mean
square error of right channel. Other conditions same as in Fig. 5.

Title:
Bell Laboratories

Algorithm
Document Cover Sheet
for Technical Memorandum
A Fast Normalized Cross-Correlation DTD Combined with a Robust Multi-Channel Fast Recursive Least Squares

Authors Electronic Address Location Phone Company (if other than Lucent–BL)
Tomas Gaensler gaensler@bell-labs.com MH 2D-514 (908) 582-7287
Jacob Benesty jbenesty@bell-labs.com MH 2D-518 (908) 582-7643

Document No. Filing Case No. Work Project No.


10009657-000218-02TM 38794-43 MA30033002

Keywords:
Adaptive Filter, RLS, FRLS, Robustness, Double-Talk Detector, Acoustic Echo Cancellation

MERCURY Announcement Bulletin Sections


CMM– Communications MAS– Math and Statistics

Abstract

In acoustic echo cancellation, double-talk detection is a challenging problem. This is mainly due to the fact that the
echo path is time-varying. As a result, it is hard to discriminate between echo path changes and double-talk. In this
technical memorandum we develop a fast normalized cross-correlation (FNCC) method for double-talk detection that
is independent of echo path gain but relies on some of the computations performed in the fast recursive least squares
algorithm (FRLS) used in the acoustic echo canceler (AEC). For highest performance of the AEC during double-talk
and echo path changes, robustness is introduced in the FRLS algorithm. The two-channel FRLS algorithm has been
proven to perform well for stereophonic acoustic echo cancellation. We therefore generalize the algorithms for the
multi-channel case. The combination of the FNCC detector and a robust FRLS algorithm results in a system with
extremely low sensitivity to double-talk, few detection errors, and fast convergence after echo path changes.

Pages of Text 17 Other Pages 16 Total 33


No. Figs. 11 No. Tables 6 No. Refs. 22

Mailing Label

Lucent Technologies – PROPRIETARY


Use pursuant to Company Instructions

tm.sty (1996/02/04) BELL LABORATORIES


Initial Distribution Specifications 10009657-000218-02TM (page ii of ii)

Complete Copy Cover Sheet Only


S. R. Ahuja (NJ9620 2D-540) A. V. Aho (NJ7460 2C-447A)
DH 1133 W. F. Brinkman (NJ9620 1C-224)
SUP 1133 A. N. Netravali (NJ9620 6A-412)
MTS 11332 W. H. Ninke (NJ9620 2B-403)
J. C. Baumhauer (IND590 0-12) I. Saniee (NJ9620 2C-300)
K. M. Brown (NJ9620 3B-519) R. D. Slusky (NJ9620 3B-503)
J. Cezanne (NJ7460 2F-202A) MTS 1133
J. Chen (NJ7460 1L-415)
C.-S. Chuang (NJ7460 2G-230)
D. L. Duttweiler (NJ7460 4E-538)
W. E. Keasler (NJ7460 2N-203C)
V. B. Lawrence (NJ7460 2F-602A)
T. L. Marzetta (NJ9620 2C-373)
J. E. Mazo (NJ9620 2C-182)
D. Mitra (NJ9620 2C-325)
K. Ramanan (NJ9620 2C-319)
D. G. Shaw (NJ7460 1C-419)
A. K. Stenard (NJ7460 2D-329)

Business Communication Systems


J. R. Celli (NJ7460 1E-631A)
W. A. Ford (NJC240 3K-474)

Network Systems
G. A. Baylock (NJ7460 2D-323)
S. K. Gupta (NJ7460 2E-338)
C. S. Ryan (NJ7460 2D-319)

Microelectronics
E. Diethorn (NJ9620 2C-578)

Future Lucent Distribution Release to any Lucent employee (excluding contract employees)

Author Signatures

Tomas Gaensler Jacob Benesty

Organizational Approvals

M. M. Sondhi G. W. Elko B. H. Juang - DH

For Use by Recipient of Cover Sheet:


Computing network users may order copies via the library -1 command; Internal Technical Document Service
for information, type man library after the UNIX
R
system prompt.
( ) AK 2H-28 ( ) IH 7M-103 ( ) DR 2F-19 ( ) NW-ITDS
Otherwise: ( ) ALC 1B-102 ( ) MV 3L-19 ( ) INH 1C-114 ( ) PR 5-2120
Enter your Company ID number. ( ) CB 3O-2011 ( ) WH 3E-204 ( ) IW 2Z-156
Return this sheet to any ITDS location. ( ) HO 4F-112 ( ) MT 3B-117

You might also like