Professional Documents
Culture Documents
Abstract
Recently, communication, digital music creation,
and computer storage technology has led to the
dynamic increasing of online music repositories in
both number and size, where automatic content-based
indexing is critical for users to identify possible
favorite music pieces. Timbre recognition is one of the
important subtasks for such an indexing purpose. Lots
of research has been carried out in exploring new
sound features to describe the characteristics of a
musical sound. The Moving Picture Expert Group
(MPEG) provides a standard set of multimedia
features, including low level acoustical features based
on latest research in this area. This paper introduces
our newly designed temporal features used for
automatic indexing of musical sounds and evaluates
them with MPEG7 descriptors, and other popular
features.
1. Introduction
In recent years, researchers have extensively
investigated lots of acoustical features to build
computational model for automatic music timbre
estimation. Timbre is a quality of sound that
distinguishes one music instrument from another,
while there are a wide variety of instrument families
and individual categories. It is rather subjective
quality, defined by ANSI as the attribute of auditory
sensation, in terms of which a listener can judge that
two sounds, similarly presented and having the same
loudness and pitch, are different. Such definition is
subjective and not of much use for automatic sound
timbre classification. Therefore, musical sounds must
be very carefully parameterized to allow automatic
timbre recognition. The real use of timbre-based
grouping of music is very nicely discussed in [3]. The
following are some of the specific challenges that
NFFT
2
1.),
2.),
P (n)
= 10 log 10 (xt )
6.),
((log
( f ( n ) / 1000 ) C ) 2 Px ( n))
P (n)
3.),
4.),
r=
k =1
2
k
7.),
~
X =
r
8.),
T
~
x1
~ T
x2
~
X = M
M
T
~
xM
9.),
~
X = USV T
10.),
V K = [v1
11.),
v2 L vk ]
SFM b =
c(i)
i =il ( b )
5.),
ih ( b )
1
c(i )
ih(b) il (b) + 1 i =il ( b )
yt = rt
T
~
xt v1
T
T
~
xt v2 L ~
xt vk
12.),
IHSC( frame) =
harmo=1
nb _ harmo
A( frame, harmo)
harmo =1
nb _ frames
IHSC( frame)
SE (i, k ) =
frame=1
HSC =
14.),
A(i, k ) + A(i, k + 1)
2
19.),
nb _ frames
1
IHSS (i) =
1
IHSC(i )
k =1
A(i, k + j)
SE (i, k ) =
j = 1
IHSD(i) =
| log
k =1
20.),
, k = 2, K 1
10
21.),
log
k =1
10
( A(i, k ))
15.),
(i , k )
k =1
HSD =
IHSD(i)
22.),
i =1
HSS =
IHSS(i)
16.),
i =1
A(i 1, k ) A(i, k )
IHSV (i) = 1
k =1
k =1
(i 1, k )
17.),
(i, k )
k =1
LAT = log10 (T 1 T 0)
s( j )s( j k )
HSV =
IHSV (i)
k =1
23.),
18.),
M
r (i, k ) =
24.),
j=m
m + n1
m+ n1
s( j ) 2 s( j k ) 2
j=m
j =m
n 1
H (i ) = max r (i, k )
k =Q
0.5
25.),
m + n 1
m + n 1
j=m
j =m
s( j)s( j K ) s
a ( f lim ) =
f max
f max
f = flim
f = flim
28.),
29.),
30.),
26.),
S (k ) =
( j K)
p' ( f ) p( f )
27.),
P (k )
i =1
31.),
NFFT
SC =
f (k ) S ( k )
k =1
NFFT
S (k )
k =1
32.),
Od =
A22k 1
k =2
n =1
Ev =
38.),
An2
A22k
An2
k =1
39.),
n =1
length( SEnv)
n / sr SEnv(n)
TC =
n =1
length( SEnv)
33.),
SEnv(n)
n =1
B=
n An
n =1
An
40.),
n =1
Tr1 = A12
h3, 4 =
n =1
A A
2
i
i=3
2
j
i =5
10
i =8
k =1
41.),
A 2j
36.),
j =1
h8,9,10 = Ai2
k =1
35.),
j =1
h5,6,7 = Ai2
34.),
2
n
fd m = Ak ( f k /(k f 1 )) / Ak
A2j
j =1
37.),
42.),
1, x 0
sign( x) =
1, x < 0
43.),
n =1
k =1
X i ( k ) C X i (k )
45.),
k =1
Fi = ( X i (k ) X i 1 (k ) )
N /2
46.),
k =1
3. Classification
The classifiers, applied in the investigations on
musical instrument recognition, represent practically
all known methods. In our research, so far we have
used four classifiers (Bayesian Networks and Decision
Tree J-48) upon numerous music sound objects to
explore the effectiveness of our new descriptors.
Nave Bayesian is a widely used statistical
approach, which represents the dependence structure
between multiple variables by a specific type of
graphical model, where probabilities and conditionalindependence statements are strictly defined. It has
been successfully applied to speech recognition [26]
[14].
The joint probability distribution function is
N /2
Ci =
f (k ) X (k )
i
k =1
N /2
44.),
X (k )
k =1
p ( x1 ,..., x N ) = p ( xi | x i )
47.),
i =1
where x1 , , x N
4. Experiments
We used a database of 3294 music recording sound
objects of 109 instruments from McGill University
Master Samples CD Collection, which has been widely
used for research on musical instrument recognition all
over the world. We implemented a hierarchical
structure in which we first discriminate different
instrument families (woodwind, string family,
harmonic percussion family and non-harmonic family),
and then discriminate the sounds into different type of
instruments within each family. The woodwind family
had 22 different instruments. The string family
contained seven different instruments. The harmonic
percussion family included nine instruments. The nonharmonic family contained 73 different instruments.
All classifiers were 10-fold cross validation with a split
of 90% training and 10% testing. We used WEKA for
all classifications.
In this research, we compared hierarchical
classifications with none- hierarchical classifications in
our experiments. The hierarchical classification
schema is shown in the following figure.
Family
No-pitch
Percussion
String
Woodwind
J48-Tree
91.726%
77.943%
86.0465%
76.669%
75.761%
NaveBaysian
72.6868%
75.2169%
88.3721%
66.6021%
78.0158%
Family
No-pitch
Percussion
String
Woodwind
J48-Tree
86.434%
73.7299%
85.2484%
72.4272%
67.8133%
NaveBaysian
64.7041%
66.2949%
84.9379%
61.8447%
67.8133%
All
MPEG
J48-Tree
70.4923%
65.7256%
NaveBaysian
68.5647%
56.9824%
6. Acknowledgment
This work is supported by the National Science
Foundation under grant IIS-0414815.
7. References
[1] Atkeson, C.G., Moore A.W., and Schaal, S. (1997).
Locally Weighted Learning for Control, Artificial
Intelligence Review. Feb. 11(1-5), 75-113.
[2] Balzano, G.J. (1986). What are musical pitch and timbre?
Music Perception - an interdisciplinary Journal. 3, 297314.
[3] Bregman, A.S. (1990). Auditory scene analysis, the
perceptual organization of sound, MIT Press
[4] Cadoz, C. (1985). Timbre et causalite, Unpublished
paper, Seminar on Timbre, Institute de Recherche et
Coordination Acoustique / Musique, Paris, France, April
13-17.
[5] Dziubinski, M., Dalka, P. and Kostek, B. (2005)
Estimation of Musical Sound Separation Algorithm
Effectiveness Employing Neural Networks, Journal of
Intelligent Information Systems, 24(2/3), 133158.
[6] Eronen, A. and Klapuri, A. (2000). Musical Instrument
Recognition Using Cepstral Coefficients and Temporal
Features. In proceeding of the IEEE International
Conference on Acoustics, Speech and Signal Processing
ICASSP, Plymouth, MA, 753-756.
[7] Fujinaga, I., McMillan, K. (2000) Real time Recognition
of Orchestral Instruments, International Computer Music
Conference, 141-143.
[8] Gillet, O. and Richard, G. (2005) Drum Loops Retrieval
from Spoken Queries, Journal of Intelligent Information
Systems, 24(2/3), 159-177
[9] Herrera. P., Peeters, G., Dubnov, S. (2003) Automatic
Classification of Musical Instrument Sounds, Journal of
New Music Research, 32(19), 321.
[10] Kaminskyj, I., Materka, A. (1995) Automatic source
identification of monophonic musical instrument sounds,
the IEEE International Conference On Neural Networks.
[11] Kostek, B. and Wieczorkowska, A. (1997). Parametric
Representation of Musical Sounds, Archive of Acoustics,
Institute of Fundamental Technological Research,
Warsaw, Poland, 22(1), 3-26.
[12] le Cessie, S. and van Houwelingen, J.C. (1992). Ridge
Estimators in Logistic Regression, Applied Statistics, 41,
(1 ), 191-201.