You are on page 1of 2

Solving the cocktail party problem using deep neural networks

For many years the cocktail party problem has been


considered the holy grail of speech processing. To solve the
cocktail party problem, the speech signals of all speakers that
are being recorded by a single microphone have to be
retrieved. However, these people can speak simultaneously,
which makes the source (or speaker) separation problem
much harder. Furthermore, most applications require the
separation algorithm to be speaker independent, which means
that no prior information on the speaker is known.

If we would be able to determine speech tracks for every speaker present, this would be a great help
in applications such as hearing aids and automatic transcriptions of meetings as well as a
preprocessing stage in voice command applications and natural language interfaces such as Siri,
Google Now, Corona and so on.

Recently (2016), major steps have been made in solving


the cocktail party problem using Deep Neural Networks
(DNNs). In general, DNNs try to retrieve high level
features from low level (or input) features, using
multiple layers of hidden units. For this task we want to
know which parts (time-frequency bins) of the recorded
audio spectrogram belong to which speaker. The
proposed network in [1] maps each bin of the audio
spectrogram to a so called embedding space where
afterwards a simple clustering mechanism is used to
assign bins to the corresponding speaker.
Impressive results are achieved. However, the
generalizability and robustness of this technique can be
questioned. The multi-speaker mixture is artificially
created by mixing together two or more independent utterances. These utterances come from the Wall
Street Journal (WSJ) database, where studio recording are made of sentences from the Wall Street
Journal being read out loud. It is unclear how the DNN would perform in other scenarios where one
or more of the following changes are made:
 Microphone: Different microphones have different transfer functions in the frequency
domain. For example, there will be differences in the spectrogram of a mixture recorded with
high quality microphone compared to a (cell)phone.
 Reverberation: How well does the DNN cope with reverberation? Is there a difference
between outdoors and indoors? How much reverberation can be allowed?
 Read versus spontaneous speech: The WSJ database consists of read sentences. The way we
talk in spontaneous manner is different from the way we read out loud.
 (Non-) stationary noise: In the original experiments of [1], there are no added noise sources
but only speech. Does speech source separation still work in the presence of stationary noise
(e.g. a fan) and non-stationary noise (refrigerator, construction site, music, …)?

In the first phase you will try to analyze whether the DNN struggles towards robustness. Since this
technique is new, little research has already been done. Afterwards you will research how to adapt
the network to increase performance in these more realistic scenarios. Experiments will be done using
TensorFlow, a toolkit for research using DNNs. Baseline code will be provided.
Promotor
Hugo Van hamme (ESAT-A 02.84)
Supervision
Jeroen Zegers (ESAT-A 02.87)
Workload
Literature and study: 20%
Analysis and problem statement: 40%
Implementation and experimenting: 40%
Number of students
1

[1] Hershey, J.R; Chen, Z; Le Roux, J.; Watanabe, S., “Deep Clustering: Discriminative Embeddings
for Segmentation and Separation”, IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2016, 31-35

You might also like