Professional Documents
Culture Documents
ABSTRACT AUDIO
CQT
TRANSCRIPTION HMM PIANO-ROLL
MODEL POSTPROCESSING
MIDI scale
80
Flute 60 96
Guitar 40 76 60
Harpsichord 28 88
40
Oboe 58 91
100 200 300 400 500 600 700 800 900 1000
Piano 21 108 (a)
Violin 55 100
100
Table 1. MIDI note range of the instrument templates used
MIDI scale
in the proposed transcription system. 80
60
the spectral template P (|z), where is the log-frequency
40
index. In Table 1, the pitch range of each instrument used
100 200 300 400 500 600 700 800 900 1000
for template extraction is shown.
(b)
2.2 Transcription Model Figure 2. (a) The transcription matrix P (p, t) of the first
Utilizing the extracted instrument templates and by extend- 10s of the MIREX woodwind quintet. (b) The pitch ground
ing the shift-invariant PLCA algorithm, a model is pro- truth of the same recording. The abscissa corresponds to
posed which supports the use of multiple pitch and instru- 10ms.
ment templates in a convolutive framework, thus support- P
ing tuning changes and frequency modulations. By consid- ,s P (p, f, s|, t)P (, t)
ering the input CQT spectrum as a probability distribution P (f |p, t) = P (5)
f,,s P (p, f, s|, t)P (, t)
P (, t), the proposed model can be formulated as:
P
X
,f P (p, f, s|, t)P (, t)
P (, t) = P (t) P (|s, p) P (f |p, t)P (s|p, t)P (p|t) P (s|p, t) = P (6)
p,s s,,f P (p, f, s|, t)P (, t)
(1) P
where P (|s, p) is the spectral template that belongs to in- ,f,s P (p, f, s|, t)P (, t)
P (p|t) = P (7)
strument s and MIDI pitch p = 21, . . . , 108, P (f |p, t) is p,,f,s P (p, f, s|, t)P (, t)
the time-dependent impulse distribution that corresponds
to pitch p, P (s|p, t) is the instrument contribution for each It should be noted that since the instrument-pitch tem-
pitch in a specific time frame, and P (p|t) is the pitch prob- plates have been extracted during the training stage, the up-
ability distribution for each time frame. date rule for the templates (4) is not used, but is included
By removing the convolution operator, the model of (1) for the sake of completeness. Using these constant tem-
can be expressed as: plates, convergence is quite fast, usually requiring 10-20
X iterations. The resulting piano-roll transcription matrix is
P (, t) = P (t) P (f |s, p)P (f |p, t)P (s|p, t)P (p|t) given by:
p,f,s
(2) P (p, t) = P (t)P (p|t) (8)
In order to only utilize each template P (|s, p) for detect-
ing the specific pitch p, the convolution of P (|s, p) In Fig. 2, the transcription matrix P (p, t) for an excerpt of
P (f |p, t) takes place using an area spanning one semitone the MIREX multi-F0 woodwind quintet recording can be
around the ideal position of p. Since 60 bins per octave are seen, along with the corresponding pitch ground truth.
used in the CQT spectrogram, f has a length of 5. In order for the algorithm to provide as meaningful so-
The various parameters in (1) can be estimated using lutions as possible, sparsity is encouraged on transcription
iterative update rules derived from the EM algorithm. For matrix P (p|t), expecting that only few notes are present at
the expectation step the update rule is: a given time frame. In addition, sparsity can be enforced
P (p, f, s|, t) = to matrix P (s|p, t), meaning that for each pitch at a given
time frame, only a few instrument sources contributes to its
P ( f |s, p)P (f |p, t)P (s|p, t)P (p|t) production. The same technique used in [5] was employed
P (3)
p,f,s P ( f |s, p)P (f |p, t)P (s|p, t)P (p|t) for controlling sparsity, by modifying the update equations
For the maximization step, the update equations for the (6) and (7):
proposed model are: P
P ,f P (p, f, s|, t)P (, t)
f,t P (p, f, s| + f, t)P ( + f, t) P (s|p, t) = P P (9)
P (|s, p) = P (4)
,t,f P (p, f, s| + f, t)P ( + f, t) s ,f P (p, f, s|, t)P (, t)
P
For the Piano-only Note Tracking task, the submitted
,f,s P (p, f, s|, t)P (, t)
P (p|t) = (10) system (BD3) ranked 1st/2nd using the onset-only met-
P P rics and 5th using the onset-offset metrics. This again
p ,f,s P (p, f, s|, t)P (, t)
indicates that the system over-estimated note durations.
By setting , > 1, the entropy in matrices P (s|p, t) and
P (p|t) is lowered and sparsity is enforced.
4. REFERENCES
2.3 Postprocessing [1] E. Benetos and S. Dixon. Multiple fundamental fre-
Instead of simply thresholding P (p, t) for extracting the quency estimation using spectral structure and tem-
piano-roll transcription as in [5], additional postprocessing poral evolution rules. In Music Information Retrieval
is applied in order to perform note smoothing and tracking. Evaluation eXchange, Utrecht, Netherlands, August
Hidden Markov models (HMMs) [9] have been used in the 2010.
past for note smoothing in signal processing-based tran- [2] E. Benetos and S. Dixon. Multiple-instrument poly-
scription approaches (e.g. [8]). Here, a similar approach to phonic music transcription using a convolutive prob-
the HMM smoothing procedure employed in [8] is used, abilistic model. In 8th Sound and Music Computing
but modified for the probabilistic framework of the pro- Conf., pages 1924, July 2011.
posed transcription system.
Each pitch p is modeled by a two-state HMM, denot- [3] V. Emiya, R. Badeau, and B. David. Multipitch estima-
ing pitch activity/inactivity. The hidden state sequence for tion of piano sounds using a new probabilistic spectral
each pitch is given by Qp = {qp [t]}. MIDI files from smoothness principle. IEEE Trans. Audio, Speech, and
the RWC database [4] from the classic and jazz subgen- Language Processing, 18(6):16431654, August 2010.
res were employed in order to estimate the state priors
P (qp [1]) and the state transition matrix P (qp [t]|qp [t 1]) [4] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka.
for each pitch p. For each pitch, the most likely state se- RWC music database: music genre database and musi-
quence is given by: cal instrument sound database. In Int. Conf. Music In-
formation Retrieval, October 2003.
p = arg max
Y
Q P (qp [t]|qp [t 1])P (op [t]|qp [t]) (11)
qp [t] t [5] G. Grindlay and D. Ellis. A probabilistic subspace
which can be computed using the Viterbi algorithm [9]. model for multi-instrument polyphonic transcription.
For estimating the observation probability for each active In 11th Int. Society for Music Information Retrieval
pitch P (op [t]|qp [t] = 1), we use a sigmoid curve which Conf., pages 2126, August 2010.
has as input the transcription piano-roll P (p, t) from the [6] A. Klapuri and M. Davy, editors. Signal Process-
output of the transcription model: ing Methods for Music Transcription. Springer-Verlag,
1 New York, 2nd edition, 2006.
P (op [t]|qp [t] = 1) = (12)
1 + eP (p,t)
[7] G. Mysore and P. Smaragdis. Relative pitch estimation
The result of the HMM postprocessing step is a binary of multiple instruments. In IEEE Int. Conf. Acoustics,
piano-roll transcription which can be used for evaluation. Speech, and Signal Processing, pages 313316, April
2009.
2.4 System Variants
Three variants of the system are utilized for MIREX eval- [8] G. Poliner and D. Ellis. A discriminative model for
uation; one trained on orchestral instruments only for the polyphonic piano transcription. EURASIP J. Advances
multiple-F0 estimation task (BD1), one trained on orches- in Signal Processing, (8):154162, January 2007.
tral instruments plus piano for the note tracking task (BD2), [9] L. R. Rabiner. A tutorial on hidden Markov models
and a system trained on piano templates for the piano-only and selected applications in speech recognition. Proc.
note tracking task (BD3). of the IEEE, 77(2):257286, February 1989.