Professional Documents
Culture Documents
SMC 2017
July 5-8, 2017
Espoo, Finland
Edited by Tapio Lokki, Jukka Ptynen, and Vesa Vlimki
Aalto University
School of Science
School of Electrical Engineering
School of Arts, Design, and Architecture
Published by Aalto University
ISBN 978-952-60-3729-5
ISSN 2518-3672
http://smc2017.aalto.fi
ii
Table of Contents
Contents iii
Preface viii
Committee and sponsors ix
Reviewers x
Special Issue, call for papers xi
Keynotes
Toshifumi Kunimoto 1
History of Yamaha's electronic music synthesis
Simon Waters 2
Revisiting the ecosystemic approach
Anssi Klapuri 4
Learnings from developing the recognition engine for a musical instrument
learning application
Posters
Victor Lazzarini 5
The design of a lightweight DSP programming library
Luca Turchet, Carlo Fischione, Mathieu Barthet 13
Towards the internet of musical things
Dario Sanfilippo, Agostino Di Scipio 21
Environment-Mediated Coupling of Autonomous Sound-Generating Systems in
Live Performance: An Overview of the Machine Milieu Project
Oliver Hdl, Geraldine Fitzpatrick, Fares Kayali, Simon Holland 28
Design Implications for Technology-Mediated Audience Participation in Live Music
Hirofumi Takamori, Haruki Sato, Takayuki Nakatsuka, Shigeo Morishima 35
Automatic Arranging Musical Score for Piano using Important Musical Elements
Leo McCormack, Vesa Vlimki 42
FFT-based Dynamic Range Compression
Hanns Holger Rutz, Robert Hldrich 50
A Sonification Interface Unifying Real-Time and Offline Processing
Manuel Brsio, Filipe Lopes, Gilberto Bernardes, Rui Penha 58
QUALIA: A Software for Guided Meta-Improvisation Performance
Eduardo Micheloni, Marcella Mandanici, Antonio Rod, Sergio Canazza 63
Interactive Painting Sonification Using a Sensor-Equipped Runway
Matteo Lionello, Marcella Mandanici, Sergio Canazza, Edoardo Micheloni 71
Interactive Soundscapes: Developing a Physical Space Augmented through
Dynamic Sound Rendering and Granular Synthesis
Anne-Marie Burns, Sbastien Bel, Caroline Traube 77
Learning to Play the Guitar at the Age of Interactive and Collaborative
Web Technologies
Pierre Beauguitte, Bryan Duggan, John D. Kelleher 85
Key Inference from Irish Traditional Music Scores and Recordings
iii
Jeffrey M. Morris 92
Ferin Martinos Tour: Lessons from Adapting the Same Algorithmic Art
Installation to Different Venues and Platforms
Tejaswinee Kelkar, Alexander Refsum Jensenius 98
Exploring Melody and Motion Features in Sound-Tracings
Ryan Kirkbride 104
Troop: A Collaborative Tool for Live Coding
Yusuke Wada, Yoshiaki Bando, Eita Nakamura, Katsutoshi Itoyama,
Kazuyoshi Yoshii 110
An Adaptive Karaoke System that Plays Accompaniment Parts of Music
Audio Signals Synchronously with Users' Singing Voices
Jose J. Valero-Mas, Jos M. Iesta 117
Experimental Assessment of Descriptive Statistics and Adaptive
Methodologies for Threshold Establishment in Onset Selection Functions
Raul Masu, Andrea Conci, Cristina Core, Antonella de Angeli, Fabio Morreale 125
Robinflock: a Polyphonic Algorithmic Composer for Interactive Scenarios
with Children
Philippe Kocher 133
Technology-Assisted Performance of Polytemporal Music
Peter Lennox, Ian McKenzie 139
Investigating Spatial Music Qualia through Tissue Conduction
Social interaction
Charles P. Martin, Jim Torresen 175
Exploring Social Mobile Music with Tiny Touch-Screen Performances
Peter Ahrendt, Christian C. Bach, Sofia Dahl 181
Does Singing a Low-Pitch Tone Make You Look Angrier?
Juan Carlos Vasquez, Koray Tahirolu, Niklas Pllnen, Johan Kildal,
Teemu Ahmaniemi 188
Monitoring and Supporting Engagement in Skilled Tasks: From Creative
Musical Activity to Psychological Wellbeing
Alexander Refsum Jensenius, Agata Zelechowska, Victor Gonzalez Sanchez 195
The Musical Influence on People's Micromotion when Standing Still in Groups
iv
Zhiyan Duan, Chitralekha Gupta, Graham Percival, David Grunberg and Ye Wang 200
SECCIMA: Singing and Ear Training for Children with Cochlear Implants via
a Mobile Application
v
Liang Chen, Christopher Raphael 287
Renotation of Optical Music Recognition Data
Filipe Lopes 294
WALLACE: Composing Music for Variable Reverberation
vi
Automatic systems and interactive performance
Andrea Valle, Mauro Lanza 391
Systema Naturae: Shared Practices Between Physical Computing and
Algorithmic Composition
Ken Dguernel, Jrme Nika, Emmanuel Vincent, Grard Assayag 399
Generating Equivalent Chord Progressions to Enrich Guided Improvisation:
Application to Rhythm Changes
Torsten Anders 407
A Constraint-Based Framework to Model Harmony for Music Composition
Gleb Rogozinsky, Mihail Chesnokov, Eugene Cherny 415
pch2csd: an Application for Converting Nord Modular G2 Patches into
Csound Code
Flavio Everardo 422
Towards an Automated Multitrack Mixing Tool Using Answer Set Programming
Johnty Wang, Robert Pritchard, Bryn Nixon, Marcelo M. Wanderley 429
Explorations with Digital Control of MIDI-enabled Pipe Organs
Lopold Crestel, Philippe Esling 434
Live Orchestral Piano, a System for Real-Time Orchestral Music Generation
Yuta Ojima, Tomoyasu Nakano, Satoru Fukayama, Jun Kato, Masataka Goto,
Katsutoshi Itoyama, Kazuyoshi Yoshii 443
A Singing Instrument for Real-Time Vocal-Part Arrangement of Music
and Audio Signals
Jennifer MacRitchie, Andrew J. Milne 450
Evaluation of the Learnability and Playability of Pitch Layouts in New
Musical Instruments
vii
Preface
We are proud to present the 14th edition of the Sound and Music Computing Conference
(SMC-17). The conference is hosted by Aalto University in Otaniemi, Espoo, Finland.
This years theme Wave in time and space contains a language joke, as the Finnish word
aalto appearing in the name of our university means a wave. When the Aalto University
was formed in 2010 by merging three Finnish universities, Helsinki University of Technology,
Helsinki Business School, and the University of Arts and Design, the SMC-related activities
were distributed in three different units, the Department of Signal Processing and
Acoustics, the Department of Media Technology (currently Computer Science), and the
Department of Media. In spring 2017, the Aalto Acoustics Lab was founded as a research
center gathering all SMC-related researchers of Aalto University. We have our own office
building and acoustic measurement laboratories next to each other in the main Aalto
campus in Otaniemi, Espoo. The Acoustics Lab was in fact the old name of the acoustics
research unit in the technical university, but it was abandoned about ten years ago, when
the university structure was re-organized. It was now the right time to start using it again.
The SMC-17 conference is the first common activity of the new Aalto Acoustics Lab: the
chair Vesa Vlimki, the papers chair Tapio Lokki, and the music chair Koray Tahirolu
represent the three different units, which meet at the Aalto Acoustics Lab.
The scientific program of SMC-17 includes 3 keynote lectures, 45 oral presentations, and
20 posters. In addition, there are 15 sound installations or fixed media, and 14 music pieces
to be presented in three concerts. Furthermore, a demo session at the Aalto Acoustics Lab
is open to all SMC-17 goers. The 65 papers published in these proceedings were selected
out of 85 submissions by the program committee. We are honored to welcome keynote
speakers Dr. Toshifumi Kunimoto, a famous synthesizer designer from Yamaha
(Hamamatsu, Japan), Prof. Anssi Klapuri, the CTO of Yousician (Helsinki, Finland), a
company producing musical games, and Dr. Simon Waters (Belfast, UK), a reputable
composer of electroacoustic music.
The SMC conference series has been running without constant financial support from any
association for many years. We are always getting many participants, who sponsor this
conference series by paying the registration fee, which can be kept relatively low in
comparison to some other international conferences. With the help of external sponsors, it
is possible to not only offer access to all oral and poster presentations, keynote talks, and
demos, but also offer free lunches, coffee breaks, concerts, and a banquet to all SMC-17
participants. We would like to express our gratitude to our sponsors: Genelec, Native
Instruments, Applied Sciences Basel, the Acoustical Society of Finland, and the Federation
of Finnish Learned Societies.
We welcome you all to Finland and to Aalto University, which you may remember as the
wave university. We hope that you will enjoy SMC-17!
viii
Committee
Chair Prof. Vesa Vlimki
Sponsors
ix
Reviewers
x
Special Issue, Call for Papers
Sound and music computing is a young and highly multidisciplinary research field. It
combines scientific, technological, and artistic methods to produce, model, and
understand audio and sonic arts with the help of computers. Sound and music
computing borrows methods, for example, from computer science, electrical
engineering, mathematics, musicology, and psychology.
In this Special Issue, we want to address recent advances in the following topics:
Analysis, synthesis, and modification of sound
Automatic composition, accompaniment, and improvisation
Computational musicology and mathematical music theory
Computer-based music analysis
Computer music languages and software
High-performance computing for audio
Interactive performance systems and new interfaces
Multi-modal perception and emotion
Music information retrieval
Music games and educational tools
Music performance analysis and rendering
Robotics and music
Room acoustics modeling and auralization
Social interaction in sound and music computing
Sonic interaction design
Sonification
Soundscapes and environmental arts
Spatial sound
Virtual reality applications and technologies for sound and music
Submissions are invited for both original research and review articles. Additionally,
invited papers based on excellent contributions to recent conferences in this field
will be included in this Special Issue; for example, from the 2017 Sound and Music
Computing Conference SMC-17. We hope that this collection of papers will serve as
an inspiration for those interested in sound and music computing.
xi
soon as accepted) and will be listed together on the special issue website. Research
articles, review articles as well as short communications are invited. For planned
papers, a title and short abstract (about 100 words) can be sent to the Editorial
Office for announcement on this website.
Submitted manuscripts should not have been published previously, nor be under
consideration for publication elsewhere (except conference proceedings papers). All
manuscripts are thoroughly refereed through a single-blind peer-review process. A
guide for authors and other relevant information for submission of manuscripts is
available on the Instructions for Authors page. Applied Sciences is an international
peer-reviewed open access monthly journal published by MDPI.
Please visit the Instructions for Authors page before submitting a manuscript.
The Article Processing Charge (APC) for publication in this open access journal is
1200 CHF (Swiss Francs). Submitted papers should be well formatted and use good
English. Authors may use MDPI's English editing service prior to publication or during
author revisions.
Guest Editors
xii
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Keynote 1:
About the speaker: Toshifumi Kunimoto was born in Sapporo, Hokkaido, Japan in
1957. He received the B.S., M.S. and Ph.D. degrees for his work on ARMA digital
filter design from the Faculty of Engineering, Hokkaido University, Sapporo, Japan
in 1980, 1982, and 2017, respectively. In 1982 Toshifumi Kunimoto joined Yamaha,
where he has designed large-scale integration (LSI) systems for numerous musical
instruments such as the Electone, Yamaha's trademark electronic organ line. He has
created numerous signal processing technologies used in Yamaha synthesizers and
pro-audio equipment. Among his designs are the famous Yamaha VL1 Virtual
Acoustic Synthesizer and the CP1 electronic piano. He has also contributed to
several Japanese specialized music and audio publications with articles describing
the behavior of analog audio equipment such as guitar stompboxes. Since 2008, he
has worked at Yamaha's Center for Research & Development in Hamamatsu, Japan.
SMC2017-1
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Keynote 2:
About the speaker: Dr. Simon Waters joined the staff of the Sonic Arts Research
Centre at Queen's University Belfast in September 2012, moving from his previous
role as Director of Studios at the University of East Anglia (1994 - 2012). He initially
established a reputation as an electroacoustic composer, with awards and
commissions and broadcasts in the UK and abroad, working with many
contemporary dance and physical theatre companies and visual artists including
Ballet Rambert, Adventures in Motion Pictures and the Royal Opera House Garden
Venture, and with pioneering multimedia theatre practitioners Moving Being.
From a focus on electroacoustic works for contemporary dance in the 1980s, his
work has shifted from studio-based acousmatic composition to a position which
reflects his sense that music is at least as concerned with human action as with
acoustic fact. His current research explores continuities between tactility and
hearing, and is particularly concerned with space as informed by and informing
human presence, rather than as a benign "parameter". He has supervised over fifty
research students in areas as diverse as improvising machines, inexpertise and
marginal competence in musical interaction, silence and silencedness, soundscape
SMC2017-2
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
His publications are similarly diverse, recent examples exploring empathy and ethics
in improvised conduct, early nineteenth century woodwind instrument production in
London, the notion of the "local" in a ubiquitously digitized world, and hybrid
physical/virtual instrument design.
SMC2017-3
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Keynote 3:
Learnings from developing the recognition engine for a musical instrument learning
application
Yousician develops an educational platform that helps our users to learn to play a
musical instrument. In a typical setting, the user has a mobile device in front of her
and plays a real acoustic instrument to interact with the application running on the
device. The application listens to the user's performance via the device's built-in
microphone and gives real-time feedback about the performance, while also showing
written music on the screen and playing an accompanying backing track. In this talk,
I will discuss the learnings that we had while building the music recognition engine
for our application. There are several technical challenges in using a musical
instrument as a "game controller", particularly so in a cross-platform environment
with a very heterogeneous set of devices. The most obvious challenge is to
recognise the notes that the user is playing, as the quality and acoustic
characteristics of instruments vary widely. The noise conditions for the recognition
are often quite demanding: the accompanying music from the device speakers tends
to leak back to the microphone and correlates strongly with what the user is
supposed to play. At the same time, the application should be robust to background
noises such as someone talking in the background and potentially other students
playing a few meters away in a classroom setting. Timing and synchronisation sets
another challenge: giving feedback about performance timing requires
synchronisation accuracy down to roughly 10 ms, whereas the audio I/O latency
uncertainties are often an order of magnitude larger on an unknown device. I will
discuss some of our solutions to these challenges and our workflow for developing
and testing the recognition DSP. I will also discuss briefly the audio processing
architecture that we use and the reasons why we decided to build our own audio
engine (on top of JUCE) to be in full control of the audio capabilities. I will end with
a more personal note as a person who moved from the academia to a startup
roughly six years ago, reflecting on the most satisfying vs. frustrating moments and
the most interesting or surprising aspects during that time.
About the speaker: Anssi Klapuri received his Ph.D. degree from Tampere University
of Technology (TUT), Finland in 2004. He visited as a post-doc researcher at cole
Centrale de Lille, France, and Cambridge University, UK, in 2005 and 2006,
respectively. He led the Audio Research Group at TUT until 2009 and then worked
as a Lecturer in the Centre for Digital Music at Queen Mary University of London in
20102011. Since 2011, he has worked as the CTO of Yousician, developing
educational applications for learning to play musical instruments. His research
interests include audio signal processing, auditory modeling, and machine learning.
SMC2017-4
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Victor Lazzarini
Maynooth University,
Ireland
victor.lazzarini@nuim.ie
SMC2017-5
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
virtual methods. Instead, processing methods may or may Listing 2. C++ class implementing eq. 1
not exist in a derived class, depending on what they are
supposed to implement. They can be given any name, struct SineOsc {
although the established informal nomenclature is to call double m_ph;
them process(). double m_sr;
The principal base class in the library is AudioBase.
This is subclassed to implement all different audio-handling SineOsc(double ph, double sr)
objects, from synthesis/processing to signal buffers, func- m_ph(ph), m_sr(sr) {};
tion tables, delay lines, and audio IO. The layout of Audio
Base is informed by a number of basic design decisions double process(double a, double f){
that underpin the principles of AuLib. double s = a * sin(m_ph);
m_ph += twopi* f / m_sr;
2.1 Stateful vs. stateless representations return s;
One of the basic motivations for AuLib was to place self- }
standing algorithms, previously implemented as free func- };
tions, within a wrapping object allowing the safe-keeping With an object-oriented implementation, the stateful de-
of internal states. Lets explore this idea with a simple ex- scription of the algorithm is complete and provides enough
ample of a sinusoidal oscillator, described by eq. 1. robustness for use in a variety of contexts. Likewise, if we
Z look across the various types of DSP operations that a li-
s(t) = a(t) sin 2 f (t) (1) brary would hope to implement, we will see all sorts of
state variables involved. This provides enough motivation
A C implementation of such function would have to take for the wrapping of such algorithms in C++ classes.
account of the sample-by-sample phase values that are pro-
duced by the integration of the time-varying frequency f (t). 2.2 Abstraction and encapsulation
Typically, in a sane implementation, the current value of
the phase would be kept externally to the function, and In fact, by clearly describing an algorithm as having a state
modified as a side-effect (listing 1). and a means of computing its output, we are abstracting
the DSP object as a specific data type. This encapsulates
all the kinds of operations we would expect to be able to
Listing 1. C function implementing eq. 1
apply to such an object. What are the things we would
like any DSP algorithm to contain? It would be useful for
const double twopi = 8. * atan(1.);
instance for it to hold its output so that we only need to
double sineosc(double a, double f,
compute it once. Basic attributes such as the sampling rate
double *ph, double sr){
and the frame size (number of channels in an interleaved
double s = a * sin(*ph);
signal) would also be essential.
*ph += twopi * f / sr; For efficient implementation, the output should not be
return s;
limited to sample-by-sample computation (as in the min-
}
imal example of the oscillator in listing 2). Typically, we
While this is entirely appropriate to demonstrate and ex- would want a block of frames to be generated for each call
pose the oscillator algorithm for study, it is clearly not ro- of a processing method, which may vary in size. A means
bust enough to be incorporated into a library. Quite rightly, of registering whether the object is in an error state would
users would expect to be able to use such functions to im- also be useful for program diagnostics. In this formulation,
plement multiple oscillators, in banks, or for amplitude a class that models a generic Audio DSP object would con-
or frequency modulation. In this context, a programmer tain the following attributes (listing 3)
could inadvertently supply a single phase address to a se-
ries of calls to such functions when implementing a bank
Listing 3. Attributes of the Audio DSP base class
of oscillators and would clearly fail to get the intended re-
sult. While it could work when carefully employed, such
class AudioBase {
as stateless presentation of the algorithm is clearly incom-
protected:
plete.
uint32_t m_nchnls;// no of channels
While there are ways of describing a sine wave oscilla-
uint32_t m_vframes;// vector size
tor in a stateless or purely functional fashion, once we are
std::vector<double> m_vector;
committed to define the computation in a stateful form, we
double m_sr;// sampling rate
need to provide a means to keep an account of the current
uint32_t m_error;// error code
state. Clearly, a self-contained oscillator will need to main-
...
tain the last computed value of the phase, as the algorithm
};
contains an integration. For this, we can wrap the whole
algorithm in a class that models its state and the means to These are protected so that no unintended modification is
get an output sample. A minimal C++ class implementing allowed. This class is for all practical purposes a wrapper
such oscillator is shown in listing 2. around an audio vector (of double floating-point samples).
SMC2017-6
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Methods for basic manipulation are also added: scale, off- This shows an example of how each subclass represents
set, modulation, mixing, and sample access are provided mostly small modification of its parent, with most of its
through overloaded operators. Setting and getting samples code re-used. Another benefit is that if a modification needs
off the vector are also provided (single channel samples, to be made (e.g. a bug fix), it does not need to be repro-
full blocks, etc.), and to modify the vector size, as well as duced at several places (which opens the door for introduc-
methods to get the value of the object attributes. ing small errors at these different locations).
Code re-use through composition is also employed through-
2.3 Code re-use out the library. For example, the Delay class holds an
AudioBase object that implements its delay line, using
Since we have embraced, for good reasons, the object-
the inlined access methods provided in that class. The
oriented approach, it is very useful to take advantage of
Balance class, which implements envelope following and
inheritance, as well as composition. For this reason, I have
signal amplitude balancing, is made up of two Rms ob-
designed the class hierarchy from the most general to the
jects that are used to measure the RMS amplitude of input
most specific, although overall the tree is not very deep
signals. Rms itself is a specialisation of a first-order low-
(six levels at most). We will see two specific cases in more
pass filter class. In another example, the TableSet class,
detail in section 4, but as an example, the ResonZ class
which is an utility class for the band-limited oscillator class
shows how the re-use of code can be employed. In fig. 1,
BlOsc is made up of a vector of FourierTable objects
we see that it is subclassed from a series of parents.
containing waveform tables.
2.4 Connectivity
Some special attention has been given to the ways in which
objects can be easily connected with each other and with
code from other libraries (both in C and C++). For this to
be achieved, there are two ways in which input and output
to objects can be achieved:
SMC2017-7
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-8
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-9
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-10
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
AuLib::Fir::dsp(const double *sig){ 38) as a means of adding the dry and wet effect signals at
double out = 0; the pan processing input.
uint32_t N = m_ir.tframes();
for(uint32_t i = 0; 6. CONCLUSIONS
i < m_vframes; i++){
m_delay[m_pos] = sig[i]; This paper described the design of a simple, lightweight
m_pos = m_pos != N-1 ? audio DSP library in C++. The main motivation is to pro-
m_pos + 1 : 0; vide a platform to develop and collect algorithms for the
for(uint32_t j = 0,rp = m_pos; study, teaching and research in audio programming. The
j < N; j++){ library classes are effectively thin wrappers that envelope
out += succinct and efficient implementations of DSP operations.
m_delay[rp] * m_ir[N-1-j]; The code has been designed to be robust enough for general-
rp = rp != N ? rp + 1 : 0; purpose deployment in audio processing applications. So-
} urce code and multi-platform build scripts are provided at
m_vector[i] = out;
out = 0.; https://github.com/vlazzarini/aulib
}
return vector(); AuLib is free software, licensed by the Lesser GNU Pub-
} lic License.
7. REFERENCES
5. AN ECHO PROGRAM
[1] V. Lazzarini and F. Accorsi, Designing a sound ob-
To complete the discussion, an echo program is shown in ject library, in Proceedings of the XVIII Brazilian
listing 7. Fig.5 shows the flowchart for a single audio input Computer Society Conference, vol. III, Belo Horizonte,
channel (multichannel input is allowed, with stereo out- 1998, pp. 95104.
put).
[2] P. Cook and G. Scavone, The synthesis toolkit (stk),
in Proceedings of the ICMC 99, vol. III, Berlin, 1999,
SoundIn pp. 164166.
?
Chn [3] V. Lazzarini, The SndObj sound object library, Or-
ganised Sound, no. 5, pp. 3549, 2000.
?
Delay [4] B. Stroustrup, The C++ Programming Language,
+f
?
2nd ed. Addison-Wesley, 1991.
?
Pan [5] V. Lazzarini, Time-domain signal processing, in The
? Audio Programming Book, R. Boulanger and V. Laz-
SigBus
zarini, Eds. MIT Press, 2010, pp. 463512.
?
SoundOut
[6] ISO/IEC, ISO international standard ISO/IEC
4882:2011, programming language C++, 2011. [On-
Figure 5. The signal flowchart for a single input channel in line]. Available: http://www.iso.org/iso/iso catalogue/
the echo program. catalogue tc/catalogue detail.htm?csnumber=50372
At the top of the program listing, lines 1 to 6 include the [7] V. Lazzarini, AuLib documentation, v.1.0 beta, 2017.
required AuLib classes, SoundOut, SoundIn, Delay, [Online]. Available: http://vlazzarini.github.io/aulib/
Chn, SigBus and Pan. A sound input object is con- [8] ISO/IEC, ISO international standard
structed in line 17, taking as a source either the name of ISO/IEC 14882:2014, programming lan-
a soundfile, "adc" for realtime audio, or "stdin" for guage C++, 2014. [Online]. Avail-
standard input. Arrays of channel readers, delay lines and able: http://www.iso.org/iso/home/store/catalogue ics/
panners are created next according to the requested num- catalogue detail ics.htm?csnumber=64029
ber of channels. Lines 24 and 25 contain the mixer object
and sound output constructors. Again, the output desti- [9] V. Lazzarini, The development of computer music
nation can be a soundfile name, "dac" for realtime, or programming systems, Journal of New Music Re-
"stdout" for standard output. search, no. 42, pp. 97110, 2013.
The processing loop, lines 33 to 44, is made up of calls to
the various DSP objects through their operator(), plus
the clearing of the signal bus mixer. It runs until the input
is finished, which will depend on the source of samples.
Note also the use of the overloaded operator += (in line
SMC2017-11
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-12
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-13
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
mission-critical control. This is pushing the development 2.4 Participatory live music performance systems
of the emerging paradigm of Tactile Internet [10] and
Within interactive arts, participatory live music performan-
millimeter Waves (mmWaves) communications [11, 12].
ce (PLMP) systems capitalising on information and com-
Tactile Internet refers to an Internet network where the
munication technologies have emerged to actively engage
communication delay between a transmitter and a receiver
audiences in the music creation process [19]. These sys-
would be so low that even information associated to hu-
tems disrupt the traditional unidirectional chain of musi-
man touch, vision, and audition could be transmitted back
cal communication from performers to audience members
and forth in real-time, making remote interaction experi-
who are passive from the creative point of view (see e.g.,
ences, virtual or human, effective, high-quality and realis-
[1922]). Interaction techniques for technology-mediated
tic. This novel area focuses on the problem of designing
audience participation have been proposed exploiting a
communication networks, both wireless and wired, capa-
wide range of media and sensors, from mobile devices [19
ble to ensure ultra-low latency communications, with end-
21, 23, 24] to tangible interfaces [25] such as light sticks
to-end delays of the oder or few milliseconds. The techno-
[26] (see [19] for a review and classification framework).
logical vision of Tactile Internet is that the round-trip time
Most PLMP systems require the audience to use a sin-
to send information from a source to a destination and back
gle type of device and application. Nevertheless, different
experiences latencies below 5ms. An important aspect of
types of devices could be exploited simultaneously to en-
the Tactile Internet vision is the reliability and low latency
rich interaction possibilities. To date, audience creative
for wireless communications. One of the emerging tech-
participation has mainly been based on manual controls or
nology for the wireless communication is mmWaves. Such
gestures using smartphones (e.g., screen touch, tilt). Ex-
a technology uses wireless frequencies within the range of
pressive modalities could be increased by tracking physio-
10 to 300 GHz and offers data rates of giga bits per seconds
logical parameters (e.g., electrodermal activity, heart rate)
over short distances. These frequencies make possible the
[27] [28] at the individual and collective levels using de-
design of small antennas that can be easily embedded in
vices specifically designed for this purpose, or by tracking
musical instruments (as proposed in the Smart Instruments
more complex audience behaviors and body gestures. Fur-
concept [13]). Moreover, the high data rates will enable
thermore, means of interaction in current PLMP systems
the transmission of multimodal content in high resolution.
typically rely on the auditory or visual modalities, while
the sense of touch has scarcely been explored to create
2.2 Digital ecosystems more engaging musical experiences.
Digital ecosystems are the result of recent developments
of digital network infrastructure inheriting from principles 2.5 Smart Instruments
of ecological systems [14]. They are collaborative envi- Recently, a new class of musical instruments has been pro-
ronments where species/agents form a coalition to reach posed, the Smart Instruments [13]. In addition to sensor
specific goals. Value is created by making connections and actuator enhancements provided in the so-called aug-
through collective (swarm) intelligence and by promo- mented instruments [29, 30], Smart Instruments are char-
ting collaboration. The concepts underlying digital ecosys- acterised by embedded computational intelligence, a sound
tems match well the proposed goals of the Internet of Mu- processing and synthesis engine, bidirectional wireless con-
sical Things where multiple actors, smart devices and in- nectivity, an embedded sound delivery system, and a sys-
telligent services interact to enrich musical experiences. tem for feedback to the player. Smart Instruments bring
together separate strands of augmented instruments, net-
2.3 Networked music performance systems worked music and Internet of Things technology, offering
direct point-to-point communication between each other
Networked music performance (NMP) systems were pro- and other portable sensor-enabled devices, without need
posed to enable collaborative music creation over a com- for a central mediator such as a laptop. Interoperability
puter network and have been the object of scientific and is a key feature of Smart Instruments, which are capable
artistic investigations [9, 1517]. A notable example is the of directly exchanging musically relevant information with
ReacTable [18], a tangible interface consisting of a table one another and communicating with a diverse network of
capable of tracking objects that are moved on its surface external devices, including wearable technology, mobile
to control the sonic output. The ReacTable allows multiple phones, virtual reality headsets and large-scale concert hall
performers to simultaneously interact with objects either audio and lighting systems.
placed on the same table or on several networked tables in
The company MIND Music Labs 4 has recently develo-
geographically remote locations.
ped the Sensus Smart Guitar [13, 31] which, to the best
Networked collaborative music creations can occur over of our knowledge, is the first musical instrument to en-
a Wide Area Network (WAN), or a Local Area Network compass all of the above features of a Smart Instrument.
(LAN) and in particular over a wireless one (WLAN) [17], Such an instrument is based on a conventional electroa-
and different methods have been proposed for each of these coustic guitar that is augmented with IoT technologies. It
configurations. In [9], the authors provide a comprehen- involves several sensors embedded in various parts of the
sive overview of hardware and software technologies en- instrument, which allow for the tracking of a variety of
abling NMP, including low-latency codecs, frameworks,
protocols, as well as perceptually relevant aspects. 4 www.mindmusiclabs.com
SMC2017-14
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
gestures of the performer. These are used to modulate the phones. Berthaut et al. proposed Reflets, a mixed-reality
instruments sound thanks to an embedded platform for environment that allows one to display virtual content on
digital audio effects. Acoustical sounds are produced by stage, such as 3D virtual musical interfaces or visual aug-
the instrument itself by means of an actuation system that mentations of instruments and performers [36]. Poupyrev
transforms the instruments resonating wooden body into et al. proposed the augmented groove, a musical inter-
a loudspeaker. Furthermore, the Sensus Smart Guitar is face for collaborative jamming where AR, 3D interfaces,
equipped with bidirectional wireless connectivity, which as well as physical, tangible interaction are used for con-
makes possible the transmission and reception of different ducting multimedia musical performance [37].
types of data from the instrument to a variety of smart de- Immersive virtual environments have been proposed as a
vices and vice versa. means to provide new forms of musical interactions. For
instance, Berthaut et al. proposed the 3D reactive wid-
2.6 Smart wearables gets, graphical elements that enable efficient and simulta-
neous control and visualisation of musical processes, along
The last decade has witnessed a substantial increase in the with Piivert, an input device developed to manipulate such
prevalence of wearable sensor systems, including electronic widgets, and several techniques for 3D musical interac-
wristbands, watches and small sensor tokens clipped to a tion [38, 39].
belt or held in a pocket. Many of such devices (e.g., Fit- The growing availability of 360 videos has recently
bit 5 ) target the personal health and fitness sectors. They opened new opportunities for the entertainment industry,
typically include inertial measurement units (IMUs) for so that musical content that can be delivered through VR
capturing body movement and sensors for physiological devices offering experiences unprecedented in terms of im-
data (e.g., body temperature, galvanic skin response, heart mersion and presence. Recent examples include Orches-
rate). Such devices, here referred to as smart wearables, tra VR, a 360 3D performance featuring the opening of
include wireless communication options to link to mobile Beethovens Fifth Symphony performed by the Los Ange-
phones or computers. In some cases, a small display, spe- les Philharmonic Orchestra, accessible via an app for vari-
aker or tactile actuator may be included. A distinguishing ous VR headsets 6 , Paul McCartneys 360 cinematic con-
characteristic of wearable devices is their unobtrusiveness: cert experience app allowing the experience of recorded
they are designed to be worn during everyday activity and concerts with 360 video and 3D audio using Googles
to passively collect data without regular intervention by the Cardboard HMD, Los Angeles Radio Station KCRW, which
user. Such features make these devices suitable to track launched a VR App for intimate and immersive musical
and collect body movements and physiological responses performances 7 and FOVEs Eye Play The Piano project,
of audience members during live concerts. However, to which allows disabled children to play a real acoustic piano
date, this challenge has been scarcely addressed. using eye tracking technologies embedded in HMDs 8 .
Moreover, to date, the use of the wearable devices ex-
ploiting the tactile channel in musical applications has been
rather limited. Noticeable exceptions are Rhytmnshoes, a 3. THE INTERNET OF MUSICAL THINGS
wearable shoe-based audio-tactile interface equipped with The proposed Internet of Musical Things (IoMUT) relates
bidirectional wireless transmission [32], and Mood Glove, to the network of physical objects (Musical Things) dedi-
a glove designed to amplify the emotions expressed by mu- cated to the production, interaction with or experience of
sic in film through haptic sensations [33]. musical content. Musical Things embed electronics, sen-
sors, data forwarding and processing software, and net-
2.7 Virtual reality, augmented reality, and 360 videos work connectivity enabling the collection and exchange
of data for musical purpose. A Musical Thing can take
The last two decades have seen an increase of both aca-
the form of a Smart Instrument, a Smart Wearable, or any
demic and industrial research in the fields of virtual reality
other smart device utilised to control, generate, or track re-
(VR) and augmented reality (AR) for musical applications.
sponses to music content. For instance, a Smart Wearable
Several virtual musical instruments have been developed
can track simple movements, complex gestures, as well as
(for a recent review see [34]), while musical instruments
physiological parameters, but can also provide feedback
augmented with sensors, such as the Sensus Smart Guitar,
leveraging the senses of audition, touch, and vision.
have been used to interactively control virtual reality sce-
The IoMUT arises from (but is not limited to) the holis-
narios displayed on head-mounted-displays (HMD) [31].
tic integration of the current and future technologies men-
AR has been used to enhance performance stages for aug-
tioned in Section 2. The IoMUT is based on a technolog-
mented concert experiences, as well as for participatory
ical infrastructure that supports multidirectional wireless
performances applications. Mazzanti et al. proposed the
communication between Musical Things, both locally and
augmented stage [35] an interactive space for both per-
remotely. Within the IoMUT, different types of devices
formers and audience members, where AR techniques are
for performers and audience are exploited simultaneously
used to superimpose a performance stage with a virtual en-
to enrich interaction possibilities. This multiplies affor-
vironment, populated with interactive elements. Specta-
dances and ways to track performers and audience mem-
tors contribute to the visual and sonic outcome of the per-
formance by manipulating virtual objects via their mobile 6 www.laphil.com/orchestravr
7 www.kcrw.com/vr
5 www.fitbit.com 8 www.eyeplaythepiano.com/en
SMC2017-15
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
bers creative controls or responses. The technological in- ers, as well as their instruments and machines. This has
frastructure of the IoMUT consists of hardware and soft- the potential to revolutionise the way to experience, com-
ware (such as sensors, actuators, devices, networks, pro- pose, and learn music, as well as even record it by adding
tocols, APIs, platforms, clouds, services), but differently other modalities to audio. In particular, IoMUT has the po-
from the IoT, these are specific to the musical case. In tential to make NMP and PLMP more engaging and more
particular, for the most typical case of real-time applica- expressive, because it uses a radically novel approach to
tions, the infrastructure ensures communications with low address the fundamental obstacles of state-of-the-art meth-
latency, high reliability, high quality, and synchronization ods, which hinder efficient, meaningful and expressive in-
between connected devices. teractions between performers and between performers and
Such an infrastructure enables an ecosystem of interop- audiences.
erable devices connecting performers with each other, as The IoMUT ecosystem can support new performers-per-
well as with audiences. Figure 1 shows a conceptual dia- formers and audience-performers interactions, not possible
gram of the different components that are interconnected beforehand. Examples of use cases include novel forms
in our vision of the IoMUT ecosystem. As it can be seen of: jamming (e.g., using apps running on smartphones to
in the diagram, the interactions between the human actors control the sound engine of a smart instrument); enhanced
(performers and audience members) are mediated by Mu- concert experiences (e.g., audience members in possession
sical Things. Such interactions can be both co-located (see of haptic feedback smart wearables feel the vibrato of
blue arrows), when the human actors are in the same physi- a smart violin or the rhythm of a smart drum; the emo-
cal space (e.g., concert hall, public space), or remote, when tional response of audience is used to control the timbre
they take place in different physical spaces that are con- of Smart Instruments or the behavior of stage equipment
nected by a network (see black arrows). such as projections, smoke machines, lights); remote re-
Regarding co-located interactions, these can be based on hearsals (point-to-point audio streaming between Smart In-
point to point communications between a Musical Thing struments).
in possession of the performer and a Musical Thing in pos-
To date, no human computer interaction system for mu-
session of a audience member (see the blue dashed arrows),
sical applications enables the many interaction pathways
but also between one or more Musical Things of the perfor-
and mapping strategies we envision in the IoMUT: one-
mers towards one or more Musical Things for the audience
to-one, one-to-many, many-to-one, many-to-many, in both
as a whole, and vice versa (see the blue solid line arrow).
co-located and remote situations. In contrast to traditional
An example of the latter case could be that of one or more
acoustic instruments, the IoMUT framework allows to es-
Smart Instruments affecting the lighting system of a con-
tablish composed electronic instruments where the con-
cert all. Regarding remote interactions, these can occur not
trol interface and the process of sound production is decou-
only between audience members/performers present at the
pled [42]. New situations of performative agency [42]
concert venue and remote audience members/performers
can be envisioned by letting audience members and the
(see the solid black arrows), but also between remote audi-
intelligence derived within the IoMUT digital ecosystem
ence members/performers (see the black dashed arrows).
influence the outcome of specific musical performances.
The communication between Musical Things is achieved
through APIs (application programming interfaces, indi- By applying the IoT to music we envision to go from the
cated in Figure 1 with the small red rectangles), which traditional musical chain (i.e., composers writing musical
we propose could be based on a unified API specifica- content for performers, who deliver it to a unique and cre-
tion (the IoMUT API specification). The interactions men- atively passive audience) to a musical mesh where pos-
tioned above, based on the exchange of multimodal cre- sibilities of interactions are countless. We envision both
ative content, are made possible thanks to Services (in- common (co-located participatory music performance in
dicated with the green areas). For instance, these can be a concert hall) and extreme scenarios (massive open on-
services for creative content analysis (such as multi-sensor line music performance gathering thousands or hundreds
data fusion [40], music information retrieval [41]), services of thousands of participants in a virtual environment).
for creative content mapping (between analysis and de- Combining such IoMUT musical mesh model with
vices), or services for creative content synchronization (be- VR/AR applications it is possible to enable new forms of
tween devices). In particular, the implementation of novel music learning, co-creation, and immersive and augmented
forms of interactions that leverage different sensory modal- concert experiences. For instance, a violin player could see
ities makes the definition of Multimodal Mapping Strate- through AR head-mounted-displays semantic and visual
gies necessary. These strategies consist of the process of information about what another performer is playing, or be
transforming, in real-time, the sensed data into control data able to follow the score without having to look at a music
for perceptual feedback (haptic, auditory, visual). stand. An audience member could virtually experience to
walk on stage or feel in the skin of a performer. A whole
audience could affect the lighting effects on stage based on
4. IMPLICATIONS
physiological responses sensed with wireless smart wrist-
Thanks to the IoMUT it is possible to reimagine the live bands. Such smart wristbands could also be used to under-
music performance art and music teaching by providing a stand audiences affective responses. An audience could
technological ecosystem that multiplies possibilities of in- engage in the music creation process at specific times in a
teraction between audiences, performers, students, teach- performance as prepared in a composers score. A concert-
SMC2017-16
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Co-located
Services
Remote
Legend: API; co-located interactions between individual actors; co-located interactions as a whole;
interactions between remote actors; interactions between co-located and remote actors.
goer could have access to augmented programme notes Between a sensor that acquires a measurement of a specific
guiding and preparing them prior to the concert experi- auditory, physiological, or gestural phenomenon, and the
ence by learning more about historical and compositional receiver that reacts to that reading over a network, there is
aspects and listening to different renderings interactively, a chain of networking and information processing compo-
and letting them see the evolution in the score as the music nents, which must be appropriately addressed in order to
is being played or additional information about the soloist enable acceptable musical interactions over the network.
during the concert. Nevertheless, current NMP systems suffer from transmis-
In a different vein, the IoMUT has the potential to gener- sion issues of latency, jitter, synchronization, and audio
ate new business models that could exploit the information quality: these hinder real-time interactions that are essen-
collected by Musical Things in many ways. For instance, tial to collaborative music creation [9]. It is also important
such information could be used to understand customer be- to notice that for optimal NMP experiences, several aspects
haviour, to deliver specific services, to improve products of musical interactions must be taken into account beside
and concert experiences, and to identify and intercept the the efficient transmission of audio content. Indeed, during
so-called business moments (defined by Gartner Inc. 9 ). co-located musical interactions musicians rely on several
modalities in addition to the sounds generated by their in-
struments, which include for instance the visual feedback
5. CURRENT CHALLENGES
from gestures of other performers, related tactile sensa-
The envisioned IoMUT poses both technological and artis- tions, or the sound reverberation of the space [43]. How-
tic or pedagogical challenges. Regarding the technological ever, providing realistic performance conditions over a net-
challenges, it is necessary that the connectivity features of work represents a significant engineering challenge due to
the envisioned Musical Things go far beyond the state-of- the extremely strict requirements in terms of network la-
the-art technologies available today for the music domain. tency and multimodal content quality, which are needed to
9
achieve a high-quality interaction experience [44].
www.gartner.com/newsroom/id/2602820
SMC2017-17
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
The most important specific technological challenges of sion, low latency, synchronicity of audio, visual, and tac-
IoMUT against the more general ones from the IoT are the tile modalities). Another related challenge is how to effec-
requirements of very short latency, high reliability, high tively use multiple sensory modalities in PLMP systems.
audio/multimodal content quality, and synchronization to In particular the haptic one leveraged by Smart Wearables
be ensured for musical communication. This implies the could have a high impact potential on the musical experi-
creation of a technological infrastructure that is capable of ence of audience.
transmitting multimodal content, and in particular audio, A major challenge to the approach of Multimodal Map-
from one musician to another/others not only in hi-fi qua- ping Strategies, consists of how to determine mappings
lity, but also with a negligible amount of latency, which en- flexible enough to allow for musical participation expres-
able performers to play in synchronous ways. Current IoT sive and meaningful to both experts and novices. These
scientific methods and technologies do not satisfy these mappings could be based on features extracted in real-time
tight constraints needed for the real-time transmission of from sensors data and musical audio analysis. Multi-sensor
audio/multimodal content both at short and at large dis- data fusion techniques [40] could be exploited for this pur-
tances [9, 17]. The envisioned Tactile Internet [10] is ex- pose, which explicitly account for the diversity in acquired
pected to solve at least in part such issues. data, (e.g., in relation to sampling rates, dimensionality,
The establishing of the Tactile Internet vision will re- range, and origin).
quire, however, the redesign of networking protocols, from Moreover, the IoMUT demands new analytic approaches:
the physical layer to the transport layer. For example, at the new analytic tools and algorithms are needed to process
physical layer, messages will have to be inevitably short large amounts of music-related data in order to retrieve in-
to ensure the desired low latencies, and this will pose re- formation useful to understand and exploit user behaviours.
strictions to the data rates. At the routing layer, the proto- Regarding the non-technological challenges posed by the
cols will have to be optimized for low delays rather than IoMUT, from the artistic perspective, previous attempts
for high throughput of information. Possible avenues for to integrate audiences into performances have not com-
future research include: the identification of all the com- pletely released the barriers inhibiting musical interactivity
ponents of a high-performance network infrastructure that between performers and audiences. For a performance to
is physically capable to deliver low-latency across the ra- be truly interactive, each member of the audience should
dio as well as in the core network segments; the propo- be as individually empowered as the performers on stage.
sition of new distributed routing decision methods capa- To achieve this, it is necessary to reimagine musical per-
ble to minimize the delay by design; the investigation of formances and invent new compositional paradigms that
new dynamic control and management techniques based can catalyse, encompass and incorporate multiple and di-
on optimization theory capable to configure and allocate verse contributions into a coherent and engaging whole.
the proper data plane resources across different domains to This poses several challenges: how can we compose music
build low-latency end-to-end services. that gives individual freedom to several (potentially hun-
dreds or thousands) of performers so that they feel em-
The IoMUT digital ecosystems can benefit from ongoing
powered without compromising the quality of the perfor-
work in the semantic web. Metadata related to multimodal
mance? How do we manage all of these inputs from indi-
creative content can be represented using vocabulary de-
viduals who have different backgrounds, sensibilities and
fined in ontologies with associated properties. Such on-
skills and leverage them so that the result is satisfying for
tologies enable the retrieval of linked data within the eco-
all of them and in which each person can still recognize
system driven by the the needs of specific creative music
his or her own individual contributions? How do musi-
services (e.g., a service providing vibrato of notes played
cians compose and rehearse for a concert where they do
by a guitar player to enact events in the Musical Thing of
not fully control the end result? In short, how do we use
an audience member). Alignment and mapping/translation
technology to integrate each individual expression into an
techniques can be developed to enable semantic informa-
evolving whole that binds people together?
tion integration [14] within the IoMUT ecosystem.
A framework such as the IoMUT and what it entails for
An important aspect of the IoMUT regards the intercon- artistic and pedagogical agendas will require to be assessed.
nection of different types of devices. Such devices target This could pave the way for novel research on audience re-
performers or audiences (both co-located and remote), and ception, interactive arts, education and aesthetics. Such
are used to generate, track, and/or interpret multimodal research would for instance help to reflect on the roles of
musical content. This poses several other technological constraints, agency and identities in the participatory arts.
challenges. These include the need for ad-hoc protocols Finally, issues related to security and privacy of informa-
and interchange formats for musically relevant information tion should also be addressed, especially if such system
that have to be common to the different Musical Things, was to be deployed for the masses.
as well as the definition of common APIs specifically de-
signed for IoMUT applications.
6. CONCLUSIONS
Wearable systems present many opportunities for novel
forms of musical interaction, especially involving multi- IoMUT relates to wireless networks of Musical Things that
ple sensory modalities. Related design challenges concern allow for interconnection of and interaction between per-
the optimization for musical qualities of the sensor and ac- formers, audiences, and their smart devices. Using IoMUT
tuators capabilities of these devices (e.g., temporal preci- we proposed a model transforming the traditional linear
SMC2017-18
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
composer-performer-audience musical chain into a musi- [10] G. P. Fettweis, The Tactile Internet: applications
cal mesh interconnecting co-located and/or remote perfor- and challenges, IEEE Vehicular Technology Maga-
mers and audiences. Many opportunities are enabled by zine, vol. 9, no. 1, pp. 6470, 2014.
such a model that multiplies possibilities of interaction and
[11] H. Shokri-Ghadikolaei, C. Fischione, G. Fodor,
communication between audiences, performers, their in-
P. Popovski, and M. Zorzi, Millimeter wave cellular
struments and machines. On the other hand, the IoMUT
networks: A MAC layer perspective, IEEE Transac-
poses both technological and non-technological challenges
tions on Communications, vol. 63, no. 10, pp. 3437
that we expect will be faced in upcoming years by both
3458, 2015.
academic and industrial research.
[12] H. Shokri-Ghadikolaei, C. Fischione, P. Popovski, and
Acknowledgments M. Zorzi, Design aspects of short-range millimeter-
The work of Luca Turchet is supported by MSCA Indi- wave networks: A MAC layer perspective, IEEE Net-
vidual fellowship of European Unions Horizon 2020 re- work, vol. 30, no. 3, pp. 8896, 2016.
search and innovation programme, under grant agreement [13] L. Turchet, A. McPherson, and C. Fischione, Smart
No 749561. The work of Carlo Fischione is supported Instruments: Towards an Ecosystem of Interoperable
by the TNG SRA ICT project TouCHES - Tactile Cy- Devices Connecting Performers and Audiences, in
berphysical Networks, funded by the Swedish Research Proceedings of the Sound and Music Computing Con-
Council. Mathieu Barthet acknowledges support from the ference, 2016, pp. 498505.
EU H2020 Audio Commons grant (688382).
[14] H. Boley and E. Chang, Digital ecosystems: Princi-
ples and semantics, in Inaugural IEEE International
7. REFERENCES
Conference on Digital Ecosystems and Technologies,
[1] L. Atzori, A. Iera, and G. Morabito, The internet of 2007.
things: a survey, Computer networks, vol. 54, no. 15,
[15] A. Barbosa, Displaced soundscapes: A survey of
pp. 27872805, 2010.
network systems for music and sonic art creation,
[2] W. Dargie and C. Poellabauer, Fundamentals of wire- Leonardo Music Journal, vol. 13, pp. 5359, 2003.
less sensor networks: theory and practice. John Wiley
[16] G. Weinberg, Interconnected musical networks: To-
& Sons, 2010.
ward a theoretical framework, Computer Music Jour-
[3] E. Borgia, The Internet of Things vision: Key fea- nal, vol. 29, no. 2, pp. 2339, 2005.
tures, applications and open issues, Computer Com- [17] L. Gabrielli and S. Squartini, Wireless Networked Mu-
munications, vol. 54, pp. 131, 2014. sic Performance. Springer, 2016.
[4] A. Hazzard, S. Benford, A. Chamberlain, C. Green- [18] S. Jorda, G. Geiger, M. Alonso, and M. Kaltenbrun-
halgh, and H. Kwon, Musical intersections across the ner, The reactable: exploring the synergy between live
digital and physical, in Digital Music Research Net- music performance and tabletop tangible interfaces, in
work Abstracts (DMRN+9), 2014. Proceedings of the 1st international conference on Tan-
gible and embedded interaction, 2007, pp. 139146.
[5] A. Willig, Recent and emerging topics in wireless in-
dustrial communication, IEEE Transactions on Indus- [19] Y. Wu, L. Zhang, N. Bryan-Kinns, and M. Barthet,
trial Informatics, vol. 4, no. 2, pp. 102124, 2008. Open symphony: Creative participation for audiences
of live music performances, IEEE Multimedia Maga-
[6] IEEE Std 802.15.4-2996, September, Part 15.4: Wire- zine, no. 48-62, 2017.
less Medium Access Control (MAC) and Physical
Layer (PHY) Specifications for Low-Rate Wireless Per- [20] G. Fazekas, M. Barthet, and M. B. Sandler, Novel
sonal Area Networks (WPANs), IEEE, 2006. Methods in Facilitating Audience and Performer In-
teraction Using the Mood Conductor Framework, ser.
[7] S. Landstrom, J. Bergstrom, E. Westerberg, and Lecture Notes in Computer Science. Springer-Verlag,
D. Hammarwall, Nb-iot: a sustainable technology for 2013, vol. 8905, pp. 122147.
connecting billions of devices, Ericsson Technology
Review, vol. 93, no. 3, pp. 112, 2016. [21] L. Zhang, Y. Wu, and M. Barthet, A web application
for audience participation in live music performance:
[8] C. Rottondi, M. Buccoli, M. Zanoni, D. Garao, G. Ver- The open symphony use case, in Proceedings of the
ticale, and A. Sarti, Feature-based analysis of the international conference on New Interfaces for Musi-
effects of packet delay on networked musical inter- cal Expression, 2016, pp. 170175.
actions, Journal of the Audio Engineering Society,
vol. 63, no. 11, pp. 864875, 2015. [22] A. Clement, F. Ribeiro, R. Rodrigues, and R. Penha,
Bridging the gap between performers and the audi-
[9] C. Rottondi, C. Chafe, C. Allocchio, and A. Sarti, An ence using networked smartphones: the a.bel system,
overview on networked music performance technolo- in Proceedings of International Conference on Live In-
gies, IEEE Access, 2016. terfaces, 2016.
SMC2017-19
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[23] A. Tanaka, Mobile music making, in Proceedings of [37] I. Poupyrev, R. Berry, J. Kurumisawa, K. Nakao,
the international conference on New Interfaces for Mu- M. Billinghurst, C. Airola, H. Kato, T. Yonezawa, and
sical Expression, 2004, pp. 154156. L. Baldwin, Augmented groove: Collaborative jam-
ming in augmented reality, in ACM SIGGRAPH 2000
[24] N. Weitzner, J. Freeman, S. Garrett, and Y.-L. Chen,
Conference Abstracts and Applications, 2000, p. 77.
massmobile-an audience participation framework. in
Proceedings of the international conference on New In- [38] D. Berthaut, M. Desainte-Catherine, and M. Hachet,
terfaces for Musical Expression, 2012, pp. 2123. Interacting with 3d reactive widgets for musical per-
[25] B. Bengler and N. Bryan-Kinns, Designing collabora- formance, Journal of New Music Research, vol. 40,
tive musical experiences for broad audiences, in Pro- no. 3, pp. 253263, 2011.
ceedings of the 9th ACM Conference on Creativity & [39] F. Berthaut and M. Hachet, Spatial interfaces and in-
Cognition. ACM, 2013, pp. 234242. teractive 3d environments for immersive musical per-
[26] J. Freeman, Large audience participation, technology, formances, IEEE Computer Graphics and Applica-
and orchestral performance, in Proceedings of the In- tions, vol. 36, no. 5, pp. 8287, 2016.
ternational Computer Music Conference, 2005. [40] D. Hall and J. Llinas, Multisensor data fusion, in
[27] D. J. Baker and D. Mullensiefen, Hearing wagner: Multisensor Data Fusion. CRC press, 2001.
Physiological responses to richard wagners der ring
[41] J. Burgoyne, I. Fujinaga, and J. Downie, Music in-
des nibelungen, in Proc. Int. Conference on Music
formation retrieval, A New Companion to Digital Hu-
Perception and Cognition, 2014.
manities, pp. 213228, 2016.
[28] A. Tanaka, Musical performance practice on sensor-
based instruments, Trends in Gestural Control of Mu- [42] O. Bown, A. Eldridge, and J. McCormack, Under-
sic, vol. 13, pp. 389405, 2000. standing interaction in contemporary digital music:
from instruments to behavioural objects, Organised
[29] T. Machover and J. Chung, Hyperinstruments: Musi- Sound, vol. 14, no. 2, pp. 188196, 2009.
cally intelligent and interactive performance and cre-
ativity systems, in Proceedings of the International [43] W. Woszczyk, J. Cooperstock, J. Roston, and
Computer Music Conference, 1989. W. Martens, Shake, rattle, and roll: Getting immersed
in multisensory, interactive music via broadband net-
[30] E. Miranda and M. Wanderley, New digital musical works, Journal of the Audio Engineering Society,
instruments: control and interaction beyond the key- vol. 53, no. 4, pp. 336344, 2005.
board. AR Editions, Inc., 2006, vol. 21.
[44] M. Slater, Grand challenges in virtual environments,
[31] L. Turchet, M. Benincaso, and C. Fischione, Exam- Frontiers in Robotics and AI, vol. 1, p. 3, 2014.
ples of use cases with smart instruments, in Proceed-
ings of AudioMostly Conference, 2017 (In press).
[32] S. Papetti, M. Civolani, and F. Fontana,
Rhythmnshoes: a wearable foot tapping inter-
face with audio-tactile feedback, in Proceedings of
the International Conference on New Interfaces for
Musical Expression, 2011, pp. 473476.
[33] A. Mazzoni and N. Bryan-Kinns, Mood glove: A hap-
tic wearable prototype system to enhance mood music
in film, Entertainment Computing, vol. 17, pp. 917,
2016.
[34] S. Serafin, C. Erkut, J. Kojs, N. Nilsson, and R. Nor-
dahl, Virtual reality musical instruments: State of the
art, design principles, and future directions, Computer
Music Journal, vol. 40, no. 3, pp. 2240, 2016.
[35] D. Mazzanti, V. Zappi, D. Caldwell, and A. Brogni,
Augmented stage for participatory performances. in
Proceedings of the international conference on New In-
terfaces for Musical Expression, 2014, pp. 2934.
[36] F. Berthaut, D. Plasencia, M. Hachet, and S. Subra-
manian, Reflets: Combining and revealing spaces for
musical performances, in Proceedings of the interna-
tional conference on New Interfaces for Musical Ex-
pression, 2015.
SMC2017-20
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT 1. INTRODUCTION
Today, a large variety of technical configurations are used The research documented here stems from the collabora-
in live performance contexts. In most of them, comput- tion between two sound artists implementing and explor-
ers and other devices act usually as powerful yet subordi- ing complex dynamical systems in the context of live per-
nated agencies, typically piloted by performers: with few formance [1, 2, 3]. Born out of hands-on experimentation
notable exceptions, large-scale gestures and structural de- in the authors private workspace, the collaboration started
velopments are left either to the performers actions or to in 2014 and developed through different stages until 2016,
well-planned automations and/or composing algorithms. when was given the project title Machine Milieu.
At the same time, the performance environment is either ig- In usual performance ecosystems [4], computers and
nored or tuned out, ideally kept neutral with regard to the other devices act as powerful yet subordinated agencies,
actual sound events and the overall performance process. typically piloted by practitioners, such that large-scale ges-
This paper describes a different approach. The authors tures and structural developments arise from either per-
investigate the complex dynamics arising in live perfor- formers actions or well-planned automations and/or com-
mance when multiple autonomous sound systems are cou- posing algorithms. To a large extent, the environment is
pled through the acoustic environment. In order to allow either ignored or tuned out, perceived as ideally neutral
for more autonomous and adaptive or better: ecosys- with regard to the actual sound events and musical ges-
temic behaviour on the part of the machines, the au- tures. Our research aims instead at establishing a struc-
thors suggest that the notion of interaction should be re- tural, multi-directional connection between the three main
placed with that of a permanent and continuing structural agential nodes in such a performance ecosystem performer(s),
coupling between machine(s), performer(s) and environ- machine(s) and environment(s) as something entirely me-
ment(s). More particularly, the paper deals with a spe- diated by the sounding environment itself. The basic idea
cific configuration of two (or more) separate computer- is to let every agency develop and sonically manifest itself
based audio systems co-evolving in their autonomic pro- as a function of all of the other agencies involved. Essen-
cesses based on permanent mutual exchanges through and tial is an effort aiming at having the machines (computers)
with the local environment, i.e., in the medium of sound somehow make sense of what happens sound-wise in the
only. An attempt is made at defining a self-regulating, situ- local, shared environment, and act accordingly.
ated, and hybrid dynamical system having its own agency
One may say, with an altogether different terminology,
and manifesting its potential behaviour in the performance
that the goal here is to create and situate a performative
process. Human agents (performers) can eventually in-
agency providing the conditions for the emergence [5] of
trude and explore affordances and constraints specific to
a self through the perturbation and the exploration of the
the performance ecosystem, possibly biasing or altering
surrounding sound environment, the latter representing a
its emergent behaviours. In so doing, what human agents
complex structured ensemble standing for non-selves i.e.,
actually achieve is to specify their role and function in the
standing for other selves [6]: a coherent system develops a
context of a larger, distributed kind of agency created by
sense of its self as each of its components affects all of the
the whole set of highly interdependent components active
others, while the environment to which it is structurally
in sound. That may suggest new solutions in the area of im-
coupled is actively and inevitably influencing the whole
provised or structured performance and in the sound arts
process, i.e., the behaviour of the single components as
in general.
well as their interactions [7]. In essence, this is a recur-
Overall, the approach taken here allows for an empirical
sive process consistent with Gregory Batesons definition
investigation of the mobile and porous boundaries between
of information as something built and processed by a (cog-
the kind of environmental agency connoting densely con-
nitive) system coupled to an environment [8]. The perfor-
nected networks, on one hand, and the freedom (however
mative agency we aim to design and explore is a mini-
restricted) in independent and intentional behaviour on the
mally cognitive system [9, 10] that construes information
part of human agents intruding the ecosystem, on the other.
about its surrounding in order to establish a positive, in-
deed constructive relationship with the forces present and
Copyright:
2016
c Dario Sanfilippo et al. This is an open-access article active in the surroundings. (To what extent that may be
distributed under the terms of the Creative Commons Attribution License pursued through feature-extraction methods and sound
3.0 Unported, which permits unrestricted use, distribution, and reproduc- descriptors is, in our view, a central issue of theoretical
tion in any medium, provided the original author and source are credited. as well as technical relevance, yet we will have to leave
SMC2017-21
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
it for a more specific discussion, in another paper). In a environment, which provides a pool of possible sources of
batesonian perspective, a bit of information is notoriously information, and the space where other agencies are them-
defined as a difference which makes a difference: a dif- selves active in their autonomic process.
ferential quantum that travels and spreads across a circuit Leaning on independent sources of information and agency
and undergoes an overall process or recursive interactions leads to a lack of contextuality and coherence between sonic
and transformations. As an ensemble, the recursive pro- contents and musical developments. An approach based on
cess has no single site of agency, no specific site or centre autonomous dynamical systems can reveal crucial and per-
of global and unilateral control over the ensemble or any haps necessary in order to avoid a certain lack of organicity
of the single parts. However, while the interacting parts and shape up a performative situation more consistent with
let emerge a whole a system that cannot be reduced to the metaphor of a live (living) practice and a live (lived)
the single separate parts the emergent whole in turn may experience of sound and music.
eventually bias or bend the individual parts in their fur- The implementation of multiple feedback delay networks
ther doing (downward causation) [11] and thus reinforce is a very central factor in our practice, for the peculiar
the ensemble in its consistent and distinct dynamical be- dynamical characteristics they exhibit seem well suited to
haviour. human-machine interactions in the context of music per-
We call music the traces of such a process in sound. formance [15]. In the particular case of Machine Milieu,
It should now be clear why we are not referring to our col- we have a distributed and structural interaction between
laboration simply as a duo (two human agents, each play- all of the variables and domains involved in the electro-
ing its own device, overlapping its own sound output to acoustic chain (as one can see by following the oriented
that of the partner): it would be more appropriate to think arrows in Figure 1): from microphones and loudspeakers
of the overall performance ecosystem explored here as an to the digital processes, as well as to the performers and the
integrated, hybrid organism, an assemblage made of hu- environment itself. The recursive interactions born in the
mans, machines and environment(s) as well as of a rich set feedback loops allow the overall process to develop from
of mediations and negotiations allowing for their interde- the micro-time scale properties (signal contours and related
pendency. The structure of this assemblage is essentially timbral percepts) to the formal unfolding and behavioural
that of a complex dynamical system [11, 12]. We believe transitions in the systems. Because of the nonlinearities in-
that the realisation of a distributed and parallel interaction cluded at various levels (due to the particular audio trans-
like the one described here, based on a tight sound relation- formations, as well as to the circuitry of the analog trans-
ship to the actual performance site, results in a peculiar ducers involved), any slight perturbation recirculates in the
performative approach and opens to innovative aesthetic process in a way that can potentially have significant short
developments in the practice of live sound art. and long-term effects for all of the variables and domains,
resulting in a very delicate and non-trivial coexistence and
binding of all components in the live environment. When
2. THE IMPORTANCE OF AUTONOMY AND detected differences in the medium are truly informa-
FEEDBACK IN MUSIC SYSTEMS tive (to use batesonian terminology, when information is
The condition of mutual influence is fundamental and im- indeed construed in the coupling of two systems), a larger
plicit in any notion of interaction. In our view, however, in process is started in the context of which sound is revealed
order to achieve such a condition both humans and ma- capable of shaping itself in an organic and expressive way:
chines should be capable of autonomous dynamical be- thus, the agency of the overall system allows for not only
haviours so that they can act (sound-wise) in the environ- the emergence of specific timbral or gestural shales, but
ment while also adapting to the (sounding) actions of other also of larger-scale articulations (musical form).
forces and systems. In that sense, automated not to be Indeed, besides being central to any approach of physical
confused with autonomous systems are not at all adequate: modelling of sound [16, 17, 18], in a larger cybernetic per-
in designs based on pre-determined or stochastic schedul- spective, feedback mechanisms [19] are reputed the key for
ing of sonic events, for example exploiting pseudo-random the modelling of artificial forms of intelligent behaviour,
number generators and other abstract formal rules, musical perhaps aimed less at the simulation of something already
articulations are operated in a domain that is both inde- existing, and more at the manifestation of emergent entities
pendent of, and fundamentally (in)different to, the domain with their own personality (or, in our case, musicality). In-
where musical action takes place namely, sound. Across formation and computation, as referred to any entity hav-
the many levels of the performance ecosystem, we avoid ing cognitive functions, were historically defined by Heinz
relinquishing action to any such independent agency, and von Foerster as recursive processes [20] in a system hav-
try to lean as much as possible on sites of agency situated ing sufficient complexity in dealing with the environment
in the experiential milieu of sound, namely as constantly and the minimal requisite variance (as defined by W. R.
mediated by the surrounding environment. On a general Ashby [21]) in order to be stable and support its self.
level, defining autonomy in music systems is a rather diffi-
cult task [13]. In a bio-cybernetic view [7, 14] autonomous
agency in a domain requires a structural openness, a con- 3. OVERVIEW DESCRIPTION OF MACHINE
stant exposure to an external space. Different from au- MILIEU
tomation, a fruitful notion of autonomy includes an oper- 3.1 Setup
ational openness to heterogeneous forces and actors in the
environment: the systems necessary closure (defining its The main components and relationships in the Machine
identity, its self) consists in a loop onto itself through the Milieu project are schematised in Figure 1. The overall
SMC2017-22
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Complex System A
Performer
A
CSP ASP
CSP ASP
Performer
D
Complex System D
Figure 1. Overall configuration of the Machine Milieu project. Oriented arrows represent the signal flow across the network. By environment we
mean here the performance room, with its specific acoustics, i.e., the spatial niche which actually shapes the sound produced by the loudspeakers (as well
as by the performers and the audience). The complex audio systems are sound-generating systems consisting of time-variant networks of nonlinear
processing units, interrelated through positive and negative feedback loops. Each complex system includes audio signal processing (ASP) units as well
as control signal processing (CSP) units: the first include DSP algorithms meant to process sound signals; the second include DSP algorithms meant to
analyse incoming signals and turn that signal descriptor data into control signals useful to drive the states variables of the ASP. These two subsystems are
mutually influencing each other. By dropping the two performers, we can still rely on an autonomous, unsupervised dynamical system.
setup includes two performers (called A and D); two sound- acoustics.
generating systems (real-time computer-operated signal pro-
cessing algorithms) both including audio signal processing 3.2 Technical aspects
(ASP) as well as control signal processing (CSP) units; a
set of microphones; a set of loudspeakers; and of course a The complex audio systems implemented are time-variant
performance space (Environment). feedback delay networks through which a number of non-
The interrelatedness among the involved components can linear components are interrelated. Both positive and nega-
be clarified as follows. Each of the two performers ac- tive feedback mechanisms are established in order to max-
tions is a function of the sonic context as perceived in the imise counterbalancing structures in the network and achieve
environment. The state of each of the two audio systems higher degrees of variety in the resulting behaviours. The
is dependent on the sonic context (captured through micro- two audio systems could be divided into at least two sub-
phones and internally analysed) as well as on the perform- systems, although they should always be considered as a
ers direct access to the internal values of both the audio single unit given their structural synergy. All signal pro-
and the control signal processing algorithms included. Fi- cessing is done through Pure Data Vanilla 1 and Kyma/Pacarana 2 ,
nally, by making audible the computer processing output at and use exclusively time-domain processing methods, with
specific positions in the local environment, the loudspeak- the microphone signals as inputs. (Frequency-domain meth-
ers act as the very means that elicit the performance spaces ods are set aside mainly for reasons of economy in compu-
acoustical response, thus affecting both the performers and tational load.)
the computer processes. The ASP units include:
In this general setup, performers may eventually be dropped
asynchronous granulation
out. In that case, we have an entirely autonomous and un-
supervised ecosystem consisting in the coupling of com- 1 http://msp.ucsd.edu/software.html
putational devices, a number of transducers and the room 2 http://kyma.symbolicsound.com/
SMC2017-23
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
various (re)sampling methods ysis of transients. In contrast with those, density is proba-
waveshaping (nonlinear distortion) bly a rather unusual kind of descriptor, and is understood
feedback and cascaded FM here in terms of RMS values calculated over multiple ex-
PWM pulse-width modulation tended signal segments (in the order of few to several sec-
simple high-pass and low-pass filtering, as well as onds), eventually correlated with peak envelope tracking
all-pass and comb filtering (attack transients).
delay networks (FDN, feedback delay networks) In actuality, as anticipated above, the analysis data are
for us a source out of which, with a little signal process-
We dispense ourselves with the details: the listed audio ing, we can shape up multiple control signals. Indeed,
signal processing methods, and the related control vari- the observed sonic characteristics are then combined or
ables, will be certainly well-known to the reader. However, used individually and eventually mapped (linearly or non-
it should be noted that, while normally working in the sub- linearly) over value ranges compatible with those of con-
audio frequency range, control variables (CSP outputs) are trol variables in the ASP algorithms. But they are, too,
actually worked out here as audio signals in our implemen- more extensively processed (via simple filters, delay units,
tation. Therefore, while maintaining their control function, etc.) to create a variety of viable control signals. We ac-
here they are occasionally mapped in the audible range and knowledge a creative role in the shaping of several con-
accordingly used as modulation signals (i.e., with audible trol signals out of one or anyway few sound sources: in
spectral changes). It might also be worth noting that FDN this perspective indeed a broader conceptual shift is taking
configurations may act in several different ways: here, they place from interactive composing to composing the in-
are often used to dispatch signals across the set of audio teractions [1]. We also explore the idea of higher-order
processing methods, sometimes creating recursive paths, analytical data created by statistics and measurements of
and thus contributing to articulate layers of sonic transfor- (lower-level) extracted information. Control signals based
mations through larger time frames; yet in certain circum- on higher-order data are especially useful to affect longer-
stances they will also act in a more typical way, i.e., as re- term developments in the variable space, thus achieving a
verb units (in which case included variables are those you sense of global orientation across prolonged sound textures
would expect from a reverberator unit). and global musical shapes.
The CSP units involve either simple mappings or more Once applied to audio processing variables, control sig-
elaborate transformations of data created by the real-time nals will affect either subtle or overt changes in the sound,
analysis and evaluation of a limited range of sonorous as- and thus because sound is the very source of control
pects in the audio signal. There is today a very large body signals they will loop back onto themselves, indirectly
of work on feature-extraction and descriptors (see re- affecting their own subsequent unfolding. This is feed-
lated topics and broader questions in [22, 23, 24]) which back in the low-frequency domain (second or higher-order
represents for us a relevant shared common knowledge. emergent patterns heard as longer-term patterns or events).
However, in our project we do not rely on any library of de- The specific final mapping is what will determine whether
scriptors or other existing resources of the kind. Rather, we the relationship between variables and control signals makes
prefer implementing ourselves simple but efficient meth- for a positive or negative feedback loop.
ods matching specific requirements, that of implying very It is important to notice that, as heard by any of the
limited computational load (all processes must be highly individual ecosystem components, the sonic environment
efficient under real-time computation constraints) and that comprises ones own sound (the particular components)
of coming up with a small but varied set of hypothesis on as well as the sound output of other sources in the environ-
broad but auditorily important aspects of the sound in the ment (including the environment itself, if not acoustically
performance space. dry or idle). Accordingly, of special interest is the genera-
In our work, the main sonic features subject to tracking tion of control signals based on feature-extraction methods
algorithms include: somehow able to distinguish and track down, from within
the total sound, those events or properties coming from
loudness ones own contribution in the total system and those com-
density ing from others.
brightness
3.3 Modes of performance
noisiness
roughness The complete resultant infrastructure can be seen as a densely
interconnected network. Densely connected network sys-
Loudness estimations are obtained as RMS values calcu- tems have been fruitfully investigated in the context of al-
lated over subsequent windowed segments of the incom- gorithmically based live performance practices e.g., The
ing signal (as an alternative, we sometimes integrate am- Hub [25, 26] and the early League of Automatic Music
plitude values over the signal chunks). Windowed seg- Composers [27], and resurface today in collective live cod-
ments can be of different sizes, and the size can also change ing practices. In [26] a discussion is proposed as to the
during the performance as a function of some other con- ecosystemic (or simply technical) nature of such networked
trol signal. Brightness and noisiness are calculated via performance situations. Here, we consider that in those ex-
original CPU-efficient algorithms operating in the time do- amples the network interconnectedness typically relies on
main (zero-crossing rate, differentiation, spectral median) abstract, formal protocols of music data and their trans-
and/or via averaged responses of large-width band-pass fil- fer along lines of digital connections (MIDI, OSC, etc.),
ters. Roughness estimation uses peak envelope for the anal- if not on more general computer network protocols, un-
SMC2017-24
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
related to music. Leaning on previous works [1, 2, 3], back, and very difficult to realise: the artefacts and traces
the main sites of agency in the network are better under- of such an almost impossible task actually resonate in the
stood as components in a sonic ecosystem, i.e., they are musical unfolding of the performance itself.
structurally coupled in the medium of sound. The indi- In the third approach, instead, there is a sharper separa-
vidual functions as well as the interdependencies among tion between performer actions and the overall ecosystem
the ecosystem components remain under the spell of the process. The goal, in this case, is to do as least as possi-
permanent mechanical (acoustical) mediation of the local ble in order to let the systems start sounding in the room
environment. Taking up anthropologist Tim Ingolds cri- thus eventually interacting between themselves. In this sit-
tique of the widespread notion of network (see [28], p. 70), uation, each one system is actually interfering with its own
in our case the connections out of which a larger dynami- sound as heard in the surrounding environment but through
cal ensemble is formed, are not between nodes but rather the other system. It was interesting to hear them creating
along lines (acoustical propagation in the air). Understood sonic materials and gestures somewhat different form those
as ensembles relinquished in a physical (and hypertechnol- generated when operating each for itself. After setting up
ogised) environment, and acted upon (or let unsupervised) the initial conditions (tuning of variables, placement of
by human agents, such situated networks of sonic interac- microphones and speakers), we would let the overall pro-
tions make it difficult to tell what is the place and meaning cess develop until satisfying behaviours emerged; eventu-
of computing in such configurations of humans and non- ally, when the potential seemed to have exhausted, we in-
humans [29]. At the same time, they allow us to emphasise tervened slightly altering the setup and defining new initial
the very materiality and situatedness of algorithmic struc- conditions, letting the process manifest new emergent be-
tures, quite usually considered as purely formal construc- haviours (we would then keep proceeding like that for a
tions disconnected from and independent of any sensible few times).
context [30]. In retrospect, the three approaches taken are all consis-
By entering the otherwise unsupervised autonomous sys- tent with the Machine Milieu project. They represent for
tem, performers have two fundamental ways through which us different ways to investigate the boundaries between, on
they can interact with the machines. On one hand, they one hand, the kind of environmental agency [14] connoting
can operate directly on the ASP and CSP, namely by man- our performance ecosystem and, on the other,the (small but
ually varying the variables in the audio transformation al- significant) margin of manoeuvre in the independent, tar-
gorithms (via a computer GUI or an external controller), geted and intentional actions taken on by human agents.
or by reconfiguring the mapping and the dispatching of What remains constant, and central, across these differ-
control signals. Alternatively, performers can for example ent performative approaches, is the notion that the main
change the position of the microphones in the performance system components involved (either including or not in-
space, or creatively modify the frequency response of mi- cluding humans) are permanently coupled to each other.
crophones with small resonators such as pipes or boxes. Properly speaking, they do not interact there is no discrete
Either ways, relying on an attitude of improvisation cer- event in space that can be called an interaction and that
tainly represents a straightforward and consistent approach might be entirely independent in time from (few or sev-
when it comes to performing with feedback systems [31, eral) earlier exchanges. Rather, there is a continuing flow
32], especially in consideration that improvisation is itself of mutual exchanges or reciprocal determinations. Within
intrinsically a feedback process (current actions are mostly an ecosystemic perspective, the notion of interaction is not
determined by carefully listening and promptly reacting to at all satisfactory, and should be replaced by a notion of
whatever results from earlier actions). Indeed, we have structural coupling (itself a term coming from general sys-
often adopted a largely improvisational approach during tem theory, and more specifically connoting the way living
the Machine Milieu sessions we were able to have so far. systems use to deal with their environment [7]).
However we thought that some alternative ways of going
should be considered: even when radical, improvisation
4. CONCLUSIVE REMARKS
may lead to higher-level patterns and gestural behaviours
which all too directly connote (and delimit) the range of In this paper, we have overviewed some artistic research
possible interesting situations. More particularly, we felt issues central in the Machine Milieu project, in the context
that improvisation would be good in order to explore spe- of which we explore the environment-mediated coupling of
cific aspects of the network dynamics in our performance autonomous generative audio systems in live performance.
ecosystem, but could prevent us from understanding and By creating a hybrid (digital, analog and mechanical) as-
creatively investigating other aspects, specifically with re- semblage for creatively experimenting with the interde-
gard to the potential autonomic behaviours it may engen- pendencies among situated autonomous systems, we try to
der. open a space for performers and listeners to ponder ques-
We explored two main alternatives. The first is actually tions of context awareness, ecosystemic dynamics, mate-
just a small shift away from radical improvisation: because riality of algorithms in daily life, while also empirically
the two autonomous sound-generating systems were capa- touching on more fundamental questions of autonomy and
ble of exhibiting peculiar dynamical behaviours, we tried dynamical behaviour, on the other. We are inclined to con-
to contrast that attitude with our interventions in an attempt sider such notions crucial in music systems intended for
at forcing them to a somewhat more static behaviours (pro- live performance, and this is especially true when as we
longed sonic textures, rich in short-term variations, but con- do here one replaces a notion of interaction (nowadays
sistent and static on the longer run). In systemic terms, too charged with misunderstandings and often treated in
this way of acting in the system is a form of negative feed- trivial ways) with the systemic notion of structural cou-
SMC2017-25
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
pling (among machines, performers and environments). [6] A. Di Scipio, Listening to yourself through the oth-
The implicit statement is that, in line with larger theoreti- erself: on background noise study and other works,
cal efforts today emphasising the environmentalisation of Organised Sound, vol. 16, no. 02, pp. 97108, 2011.
agency (or the agentivity of overly technologised environ-
ments) [14, 33], the realisation of a kind of distributed mu- [7] H. R. Maturana and F. J. Varela, Autopoiesis and cog-
sical agency may result in an understanding of musical cre- nition. Reidel, Dordrecht, 1980.
ativity as a phenomenon emergent and indeed ecosystemic.
Furthermore, the dialectical interweaving of individual [8] G. Bateson, Steps to an ecology of mind. University
action and autonomous ecosystemic processes, allows for a of Chicago Press, 1972.
variety of performative practices to be explored which, on
the sole basis of sound-mediated interactions, projects the [9] A. Etxeberria, J. J. Merelo, and A. Moreno, Studying
interdependencies and codependencies of part and whole organisms with basic cognitive capacities in artificial
i.e., of systems and subsystems, as well as of structurally worlds, Cognitiva, vol. 3, no. 2, pp. 203218, 1994.
coupled but different domains to the level of larger scale
[10] X. Barandiaran and A. Moreno, On what makes
developments in a performance: any notion of form be-
certain dynamical systems cognitive: A minimally
comes then closer to the ensemble of component parts ei-
cognitive organization program, Adaptive Behavior,
ther contrasted or harmoniously interwoven among them
vol. 14, no. 2, pp. 171185, 2006.
in their temporal and spatial activity. In the particular ex-
perience reported above, we have touched on this issue
[11] R. Benkirane, E. Morin, I. Prigogine, and F. Varela,
when contrasting the autonomic behaviours of unsuper-
La complexite, vertiges et promesses: 18 histoires de
vised environment-mediated machines, against a direct and
sciences. Le pommier, 2002.
radical improvisation approach of practitioners entering the
performance ecosystem. [12] M. Mitchell, Complex systems: Network thinking,
In further work, we expect that approaches such as the Artificial Intelligence, vol. 170, no. 18, pp. 11941212,
one taken here may contribute to a deeper understanding of 2006.
what can be considered live in sound art and music perfor-
mance with live electronics. Different from other authors [13] O. Bown and A. Martin, Autonomy in musicgener-
who have tackled this issue [34, 35], we are convinced ating systems, in Eighth Artificial Intelligence and
that, in and of itself, the sheer presence of human perform- Interactive Digital Entertainment Conference, 2012,
ers in the context of and in the foreground of an overly p. 23.
technologised playground is insufficient to clarify hybrid
assemblages merging human and machine agency such as [14] B. Clarke and M. B. Hansen, Emergence and embod-
deployed by in the Machine Milieu project. The implicit iment: New essays on second-order systems theory.
and unquestioned opposition of living (hence cognitive) Duke University Press, 2009.
vs non-living (hence non-cognitive) systems reflecting in
other similar dualities, such as human vs. inhuman, natural [15] D. Sanfilippo and A. Valle, Feedback systems: An an-
vs. artificial, etc. is today probably untenable as a basis alytical framework, Computer Music Journal, vol. 37,
to understand what (or who) does live in live electronics. no. 2, pp. 1227, 2013.
Maybe we should look for what is no more human in liv-
ing systems, and what is already all too human in artificial [16] K. Karplus and A. Strong, Digital synthesis of
(i.e., humanly construed) systems. plucked-string and drum timbres, Computer Music
Journal, vol. 7, no. 2, pp. 4355, 1983.
5. REFERENCES [17] P. R. Cook, A meta-wind-instrument physical model,
[1] A. Di Scipio, Sound is the interface: from interactive and a meta-controller for real-time performance con-
to ecosystemic signal processing, Organised Sound, trol, 1992.
vol. 8, no. 03, pp. 269277, 2003.
[18] D. Rocchesso and J. O. Smith, Circulant and elliptic
[2] , Emergence du son, son demergence: Essai feedback delay networks for artificial reverberation,
depistemologie experimentale par un compositeur, Speech and Audio Processing, IEEE Transactions on,
Intellectica, vol. 48, no. 49, pp. 221249, 2008. vol. 5, no. 1, pp. 5163, 1997.
[3] D. Sanfilippo, Turning perturbation into emergent [19] F. Heylighen and C. Joslyn, Cybernetics and second
sound, and sound into perturbation, Interference: A order cybernetics, Encyclopedia of physical science &
Journal of Audio Culture, no. 3, 2013. technology, vol. 4, pp. 155170, 2001.
[4] S. Waters, Performance Ecosystems: Ecological ap- [20] H. Von Foerster, Understanding understanding: Es-
proaches to musical interaction, EMS: Electroacous- says on cybernetics and cognition. Springer Science
tic Music Studies Network, 2007. & Business Media, 2007.
[5] P. A. Corning, The re-emergence of emergence: A [21] W. Ashby, Requisite variety and its implications for
venerable concept in search of a theory, Complexity, the control of complex systems, Cybernetica, vol. 1,
vol. 7, no. 6, pp. 1830, 2002. pp. 8399, 1958.
SMC2017-26
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-27
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-28
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
and other musicians (p.454). In all of these cases, how- Question Spectators Musicians
ever, the feedback was focused on particular aspects of the Age (younger than 29) 59% 47%
provided system for audience participation or the specific Gender (male) 57% 80%
performance concept. Education (college or 43% 70%
In contrast, Mazzanti et al. [8] propose six metrics to de- higher)
scribe and evaluate concepts for participatory performances Playing an instrument or vo- 75% 100%
at a more general level. They directly address aspects of cal training (yes)
participatory performances conceptually and technically, Attending or playing con- 52% 21%
although, still tied closely to their particular Augmented certs (once/month or more)
Stage platform and to participants feedback collected dur-
ing evaluation. Table 1. Demographics
With the piece Experimence [12], we followed a different
approach and composed a song having audience participa-
tion in mind. We reflected on the creative process of this experience using different Lickert scales [13, 14]. These
composition and concluded with rather general variables. statements were primarily informed by literature investi-
These variables describe considerations for a composition gating the experiences of spectators attending musical live
before having any particular technology for such an interac- performances [6, 15]. The original questionnaire as well as
tive performance system available. detailed survey results are available in Appendix B in [16].
In most published studies, participants feedback about
technology-mediated audience participation (TMAP) across
3.2 Analysis Approach
different studies is generally positive [24, 6, 7]. Feedback
in such studies tends to be limited to particular technical For the analysis of the results, descriptive statistics and
systems and tends to be focused on details of interaction quantitative methods were used. Ways of presenting these
modalities and desired improved or additional features. Au- results include bar charts showing frequencies of responses
diences often express a wish for more control [3, 6]. Musi- as percentages of the whole sample [17]. This analysis
cians, however, appear to be far more sceptical towards new approach concerned mainly results presented in 4.1 and
ways of audience participation [6]. However, the literature 4.4. For questions which allowed a wide range of response
does not contain much evidence or discussion about these options, statistical measures of central tendency 2 were
concerns on the musicians side. calculated for easier interpretation [18]. Results of these
Overall, musicians and audiences have distinctive require- questions are presented in 4.2, 4.3, and 4.5.
ments, as does musical coherence, and there can be wide
variation among both groups. As examples in literature
suggest, the effective design of TMAP generally requires 4. RESULTS
balancing knowledge from diverse perspectives and taking
into account requirements of different roles in live music The survey was carried out online over a period of three
performance. weeks resulting in 254 responses. For the analysis, incom-
The present study is unusual in surveying these require- plete responses (27) were excluded, which left 227 complete
ments from two different perspectives and without any par- datasets (169 spectators, 58 musicians).
ticular TMAP concept or technology in mind. The aim was Different channels, mainly Austrian and British, were
to identify general design implications as well as potential used to distribute the survey link. Among these channels
design strategies for future case studies. were mailing lists of universities, music-related projects
and communities, personal contacts of involved researchers,
and social media. Furthermore, a distribution by companies
3. SURVEY
in the music business (e.g. labels, concert organiser), music
The survey was designed to be conducted online using the related magazines and broadcasting stations was requested.
free open source software LimeSurvey 1 . Participants could These inquiries mainly remained unconfirmed, though.
choose between a German or an English version.
4.1 Demographics and Music-Related Information
3.1 Questionnaire Design
The questionnaire begun with questions about basic infor- Musicians and spectators were separately analysed. Table 1
mation, followed by questions about music-related infor- gives a demographic overview of the dataset. In both target
mation in general and live music in particular, and finally groups about half were younger than 29 and there was a
focusing audience participation in live music. We decided good balance among spectators between male and female,
to use these four sections to guide the participant through while the musicians were predominantly male. Three quar-
the survey step by step from very general questions to very ters of all spectators (75%) played instruments or had vocal
particular ones concerning audience participation. training. The musicians were not explicitly asked, whether
In the second part of the questionnaire the participants they had a musical training or not. This was assumed, based
had to rate various statements from their point of view and on their decision to fill out the survey as a musician.
1 https://www.limesurvey.org (last access 15.05.2017) 2 median (Md), mode (Mo), and interquartile range (IQR)
SMC2017-29
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-30
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
the category dramaturgy (Md=3/Mo=3; 3 is tend to agree). among the audience for the design of TMAP. We refer to
The statement volume of certain instruments even has an this as skilfulness.
interquartile range of 1, which indicates a lower distribution A little more than half of the surveyed spectators attend
and a more stable tend to agree. In most cases spectators live concerts at least once a month, which is a good amount
have a neutral opinion. Musicians have much stronger opin- of people regularly experiencing live music. In the case of
ion. Most of them tend to agree with being able to influence musicians only a fifth plays concerts with the same regular-
visuals or dramaturgical elements. Except of general vol- ity. If we invert this number, it means the majority of the
ume most disagree on all statements within the category musicians play live concerts less than once a month. This is
sound. not as high as one could think of, when asking people who
Finally, survey participants had to rate statements about consider themselves as musicians. A possible explanation
how TMAP could actually work. These statements included could be that a certain number of musicians have above-
general strategies to involve the audience, concrete exam- average experiences and regularly play live concerts, but do
ples for participation and actual technologies that might be not make a living out of music. Following this assumption,
used. the aforementioned one fifth could be considered as profes-
Among the group of spectators is one statement that clearly sionals, which seems to be appropriate for someone who
stands out, the artist meets the expectations of the audience plays a concert every month or even more often. For the
as most spectators strongly agree with it (Md=4/Mo=4) and design of TMAP this means it is important to consider the
the values are not much distributed (IQR=1). Most specta- professional level of musicians and their live performances.
tors could also imagine using phonometers to measure the In conclusion, we refer to this as masterfulness.
noise level and making a certain creative contribution. With
sensor technology, they could imagine cameras for visual 5.2 Motivation to Play or Attend Live Concerts
recognition and floor sensors. Most spectators disagree with
voting, controlling sound or visuals actively, or providing The participants motivation in relation to live concerts
personal data (e.g. heart rate). In most cases, musicians showed strong agreements in terms of having a distinctive
have a similar opinion as the spectators or they agree even experience. For spectators the strongest motivation for vis-
stronger than audience members. With active sound control iting concerts is to have a unique and special experience.
or sensor data, however, they strongly disagree. Similarly, most musicians want to create a unique and spe-
cial experience. Additionally, spectators agreed, that live
music is better than listening to records. This raises the
5. DISCUSSION implication of distinctiveness. It refers to the distinctive
The survey presented in this paper explores the design space experience TMAP should create in a live concert.
of technology-mediated audience participation (TMAP) in Following the previous implication, this suggests that
live music from the perspective of two key stakeholder TMAP should always create a distinctive experience. How-
groups (musicians and audiences). We continue to discuss ever, at the same time only few spectators agreed to be
the previously described survey results and revisit the four involved in a show. Musicians on the other hand are more
research questions. By answering the first three research prepared to involve the audience but this is still the second
questions and discussing notable tendencies of the results, lowest among their ratings. In addition, many spectators
we will identify implications that concern the design of want to focus on music without distraction, while most
TMAP. musicians agree on focusing on playing music. Strictly
In particular, we look step by step at noticeable differences speaking, we can interpret this as an indication that peo-
of statistical values in the results and draw conclusions in ple are not really interested in TMAP. Although, it could
relation to TMAP. We will finish the discussion by revisiting also mean that we should focus on a well-considered and
the fourth research question and take the outcomes of the unobtrusive involvement of spectators and musicians when
whole survey into consideration to elaborate and propose utilising TMAP to create a unique live music experience
implications for design of TMAP. To start the discussion, for everybody involved. This highlights an implication for
we address the first research question: What are the musical the design of TMAP we call obtrusiveness.
preferences, motivations, and behavioural tendencies of Two other statements sharing high agreement among the
spectators and musicians in live concerts? spectators are about being part of an audience and express-
ing themselves to show excitement. Again, having TMAP
in mind, this indicates the importance of the spectators be-
5.1 Music-Related Information
ing able to act expressively and identify themselves with
Looking at the musical training of spectators, three quarters the whole audience. This leads to expressiveness, which
stated they play instruments. This number is relatively high, means the design of TMAP needs to consider forms of inter-
bearing in mind that those who filled out the survey as spec- action that enable the spectators to be expressive, whether
tators do not consider themselves as musicians. The fact as individuals, in smaller groups, or as a whole audience.
that 75% of the spectators have musical training supports In parallel, Mazzanti et al. [8] refer to expressiveness with
the responses credibility of this survey regarding music- their dimensions Active/Passive Audience Affinity and
related questions to spectators. Furthermore, these numbers Audience Interaction Transparency to some extent.
highlight the issue of musically trained spectators among The social context was also identified as important. We
the audience, or more general, to consider possible skills interpret the high agreement of spectators to meet other
SMC2017-31
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
people as sociability. This sociability refers to social aspects 5.4 Mobile Technologies during Live Concerts
in relation to TMAP. For example, there could be a certain
social motivation to enable TMAP from the side of the The second research question was: How do spectators
artist or to participate as a spectator. Either way, it allows and musicians use mobile technology during live concerts?
spectators to socialise to some extent, whether this is with With the implication of objects, we already looked on phone
friends or meeting other people. use in relation to the behaviour during songs. Additional
Spectators want to have a unique experience and prefer survey results show that only 56% use their phones at least
live performed music. Most important for musicians is once during a live concert. The only reason why spectators
to be on stage and play music publicly. This very high use their phones in particular is to take pictures. Most spec-
agreement among musicians to be exposed on stage and the tators never use their mobile phones for any other purpose.
wish for liveness among the audience raises the implication In conclusion, we define this as readiness. For TMAP this
of exposure. This not only highlights the importance for means we need to consider to what extent an audience is
musicians to be on stage, but also the need to design TMAP ready for a particular participation. This readiness could be
in a way that considers the exposed situation of musicians. in terms of general availability (e.g. having a device such as
a mobile phone that is capable of something) or in terms of
a certain knowledge or habit (e.g. using the mobile phone
5.3 Behaviour at Live Concerts
for a particular purpose).
Behaviour at live concerts for both spectators and musi- In general, musicians do not use mobile technology for
cians is related to the expressiveness, as discussed above. their performances often. What we do not learn from the
However, we can see certain differences in behaviour for results, are the reasons why most of the musicians do not
different song types. This shows that the spectators and use mobile technologies for their performances. This might
musicians behaviour depends on songs and the mood they be for practical reasons because they just do not need them
create among spectators, and this raised the implication of for artistic purposes, but it could also be kind of refusal. In
mood. The challenge for the design of TMAP is to consider conclusion, we raise the implication of openness. This open-
the mood and the resulting behaviour of a participating ness is somehow similar to readiness but focuses more on
audience as well as the musicians. For instance, some audi- the musicians relation towards technology as an important
ences might have a certain mood-driven behaviour a priori part of TMAP.
(e.g. according to a style of music) and others might change
their behaviour according to a particular song that creates a 5.5 Opinion About TMAP
different mood (e.g. the hit of a band versus a new song no
one knows). With the third research question we asked: What are the
The wide distribution of values with statements of what concerns of spectators and musicians regarding TMAP?
spectators do every time (sing along) or what they never Overall, most spectators tend to agree more on influencing
do (shout/whistle), supports the assumption of a higher elements of sound (e.g. volume) or dramaturgy (e.g. song
distribution of ratings for these statements. This means selection) in a live concert. Most musicians tend to agree
that although most spectators always sing along, there is a on letting the audience participate in visuals (e.g. lights) or
certain number of spectators who will not always do it. We dramaturgy as well, but strongly disagree on an influence
call these anticipated differences regarding the behaviour of sound. Interpreting the figures, it is noticeable that most
among spectators diversity. spectators do not care too much about visuals, while most
It is clearer however that spectators do not tend to use musicians would somehow offer them the chance to partic-
objects during a concerts. Three statements are among the ipate in light effects, for instance. Regarding sound most
four lowest rated ones (using a lighter, phone or camera). In spectators would like to have some influence, while most
addition, these three are the only ones among all behaviours musicians do not want the audience to influence sound. In
requiring an object or specific thing. This raises the impli- relation to TMAP we call this appropriateness. This means
cation of objects, which means we need to consider the role the actual impact that happens through the participation on
of tangible interfaces in the interaction design for TMAP. some performance element has to be chosen and designed
Musicians show that they like to communicate with spec- in a way both spectators and musicians can live with.
tators whether passively, when smiling at them or watching From the result we know that most spectators tend to agree
their reaction, or actively, when making announcements. on influencing the sound to some extent. When asked, if
We argue that all of them show some sort of appreciation they could imagine using a smartphone app to control the
to the audience. While making announcements has also sound actively, although, most spectators strongly disagreed.
an informational purpose, the other two show that most In a similar contradictory way, most spectators tend to agree
musicians care about their spectators, whether by just ob- on influencing the choice of songs, but strongly disagree
serving them to see a reaction or actively smiling at them. on using a smartphone app for voting. This inconsistency
As we know from Lee at al. [7], spectators report feeling raises the implication of contradiction, which describes
connected (p.454) when experiencing TMAP. We call this the challenge to find a compromise to resolve a contradic-
the implication of communication. When TMAP actually tory situation. In the same way other studies report that
happens during a live concert, it most likely needs some audiences wish to have more influence when experiencing
sort of communication, whether it is done by the musicians TMAP while musicians tend to be sceptical towards giving
themselves, by a moderator, or in a self-explanatory way. them more control [3, 6].
SMC2017-32
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Most spectators and musicians tend to agree that it would in terms of what we have learned about the participants
make a live concert more exciting if the audience could opinion about TMAP as we cannot judge if people answered
make a certain creative contribution. Musicians agree on based on their expectations or prior experience.
that even more strongly. We do not know which form of Although, all design implications are derived from the
creativity musicians had in mind when they rated the state- survey results in the same way, they address the design of
ments. If we consider sound as one of the most important TMAP on different levels. Some of them are more obvious
creativity-related aspects of a live concert, it contradicts the for design (e.g. distinctiveness, sociability, expressiveness),
musicians refusal of an impact on sound, as we already while others are more difficult to consider and orient to (e.g.
know. In conclusion, we define the implication of creativity mood, openness).
dealing with the challenge to what extent TMAP is or has
to be a creative contribution to a live concert.
6. CONCLUSION
5.6 Implications for Design of TMAP
In the existing literature, identification of issues for the de-
The final research question builds up on the previous three sign of technology-mediated audience participation (TMAP)
to draw conclusions: What implications for the design of in live music has been mostly based on concrete case stud-
TMAP can be identified? By answering the previous three ies and derived from case-specic feedback of participants.
research questions and discussing the results of all survey This has resulted in an identifiable gap in design knowledge.
sections step by step, we identified 16 implications con- Consequently, we conducted an online survey to collect
cerning the design of TMAP in live music. Some of them quantitative data about live music and TMAP on a more
primarily address either spectators or musicians require- general basis, detached from any particular case study. The
ments, others concern both. results were based on 227 complete responses of specta-
In some cases, we identified these implications by drawing tors and musicians, analysed through the use of descriptive
together notable results of the survey questions and sections statistics. With a step by step discussion of these results
and by considering spectators as well as musicians moti- across survey sections and both target groups, we identied
vation, behaviour, and opinion. Hence, some implications 16 key issues for the design of TMAP in live music. These
might overlap to some extent, e.g. readiness and openness. issues are skilfulness, expressiveness, diversity, objects,
Both address attitudes and habits of the spectators and mu- readiness, masterfulness, exposure, communication, open-
sicians in relation to technology that might have an impact ness, creativity, distinctiveness, obtrusiveness, sociability,
on the design of TMAP. Nonetheless, readiness highlights mood, appropriateness, and contradiction.
more the technological availability and habits of spectators, The most desired idea by spectators was to select songs
while openness rather bears the musicians relation towards played during a concert by using TMAP. Visuals in general
technology. are sometimes offered by musicians as an element for con-
Besides readiness and openness, especially obtrusiveness trol by TMAP, but do not concern audience members so
showed a potential scepticism in relation to TMAP on both much. In the case of sound, audience members mostly wish
sides. Although, various case studies in literature [24, 6, 7] to control the volume of the music, whereas the musicians
report mostly positive feedback from participants who really mostly reject any inuence on sound.
experienced TMAP at live concerts. This indicates a certain Finally, there are preferences about particular technolo-
difficulty to envision TMAP without experiencing it as our gies for audience participation. Candidate technologies
survey participants did. include recognition systems such as cameras, oor sensors,
From a structural point of view, these implications stand and phonometers. In the case of smartphone technologies
by themselves rather than being a complete set of design in particular, opinions are divided.
strategies. However, they are an important step to gen-
eralise design strategies around TMAP and to serve as
a well-founded starting point for actual design processes. 6.1 Future Work
Furthermore, these implications for the design of TMAP Throughout the discussion, we found additional possible
complement the range of design implications derived from design directions to be investigated in further studies. These
particular case studies as the ones presented during the rather concrete design ideas concern the possible impact
discussion of related work. on performance elements and technological preferences.
In general, musicians agree with a creative contribution
5.7 Limitations from the audience. We cannot draw further conclusions on
The survey presented in this paper investigates a diverse the results about how this creative contribution might look.
range of aspects that concern the design of TMAP. While Thus, it should be a potential focus for further studies.
the results we discussed mainly address a series of general As mentioned in the limitations section earlier, little demo-
design implications beyond specific use cases, we do not graphic information is known about the sample of survey
know much about demographic data, for which population participants. This could be a good starting point for further
the results are representative, and if study participants had studies to investigate specific target demographic groups
prior experience with TMAP. It can be difficult to imagine (e.g. based on knowledge, experience, or genre).
the implications of TMAP for those who have never experi- Methodologically, future work could explore peoples
enced such an interactive performance. This is a limitation opinion by using qualitative methods. In this way, we could
SMC2017-33
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
better understand why some ideas for TMAP are preferred [9] B. V. Hout, M. Funk, L. Giacolini, J. Frens, and
and others are not. B. Hengeveld, Experio: a Design for Novel Audience
Participation in Club Settings, in Proceedings of the
7. REFERENCES International Conference on New Interfaces for Musical
Expression, Goldsmith University, London, UK, 2014,
[1] M. Neuhaus, The Broadcast Works and Audium, Zeit- pp. 4649.
gleich: The Symposium, the Seminar, the Exhibition, pp.
119, 1994. [10] K. Hayes, M. Barthet, Y. Wu, L. Zhang, and N. Bryan-
Kinns, A Participatory Live Music Performance with
[2] G. McAllister and M. Alcorn, Interactive performance the Open Symphony System, in Proceedings of the
with wireless PDAs, in Proceedings of the Interna- 2016 CHI Conference Extended Abstracts on Human
tional Computer Music Conference, 2004, pp. 14. Factors in Computing Systems - CHI EA 16. New
York, New York, USA: ACM Press, 2016, pp. 313316.
[3] J. Freeman, Large Audience Participation, Technology,
and Orchestral Performance, in Proceedings of the [11] G. Levin, Dialtones (A Telesymphony), 2001.
International Computer Music Conference, 2005, pp. [Online]. Available: http://www.flong.com/projects/
757760. telesymphony/
[4] M. Feldmeier and J. Paradiso, An Interactive Music [12] O. Hodl, G. Fitzpatrick, and S. Holland, Experimence:
Environment for Large Groups with Giveaway Wireless Considerations for Composing a Rock Song for Interac-
Motion Sensors, Computer Music Journal, vol. 31, tive Audience Participation, in Proceedings of ICMC,
no. 1, pp. 5067, 2007. 2014, pp. 169176.
[5] N. Weitzner, J. Freeman, S. Garrett, and Y.-L. Chen, [13] P. H. Rossi, J. D. Wright, and A. B. Anderson, Hand-
massMobile - an Audience Participation Framework, book of Survey Research. New York: Academic Press,
in Proceedings of the International Conference on New 1983.
Interfaces for Musical Expression, University of Michi-
gan, Ann Arbor, 2012, pp. 14. [14] R. Porst, Fragebogen - Ein Arbeitsbuch. VS Verlag
fur Sozialwissenschaften; Auflage: 3, 2011.
[6] O. Hodl, F. Kayali, and G. Fitzpatrick, Designing In-
teractive Audience Participation Using Smart Phones in [15] K. Burland and S. Pitts, Coughing and Clapping: Inves-
a Musical Performance, in Proceedings of the Interna- tigating Audience Experience, K. Burland and S. Pitts,
tional Computer Music Conference, 2012, pp. 236241. Eds. Surrey, Ashgate, 2014.
[7] S. Lee and J. Freeman, echobo: A Mobile Music Instru- [16] O. Hodl, The Design of Technology-Mediated Au-
ment Designed for Audience To Play, in Proceedings dience Participation in Live Music, PhD Thesis,
of the International Conference on New Interfaces for Vienna University of Technology, 2016. [Online]. Avail-
Musical Expression, KAIST, Daejeon, Korea, 2013, pp. able: http://repositum.tuwien.ac.at/obvutwhs/download/
450455. pdf/1375209
[8] D. Mazzanti, V. Zappi, D. Caldwell, and A. Brogni, [17] N. Blaikie, Analyzing Quantitative Data: From Descrip-
Augmented Stage for Participatory Performances, in tion to Explanation. London: Sage Publications Ltd,
Proceedings of the International Conference on New 2003.
Interfaces for Musical Expression, London, UK, 2014, [18] S. Jamieson, Likert scales: how to (ab) use them,
pp. 2934. Medical Education, vol. 38, no. 12, pp. 12171218,
2004.
SMC2017-34
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT notes, such as the piano, guitar, and organ, involves vari-
ous problems such as expressing impressions of original
There is a demand for arranging music composed using songs and considering playability. This is because an
multiple instruments for a solo piano because there are instrument that can play a plurality of notes has an ex-
several pianists who wish to practice playing their favor- tended range of expression compared with an instrument
ite songs or music. Generally, the method used for piano that only plays monotonic notes. This is followed by fo-
arrangement entails reducing original notes to fit on a cusing on pop music as it involves music composed using
two-line staff. However, a fundamental solution that im- multiple instruments. The study also focuses on a piano
proves originality and playability in conjunction with arrangement since the piano is one of the most famous
score quality continues to elude approaches proposed by instruments that generate a wide range of expressions.
extant studies. Hence, the present study proposes a new Therefore, the present study examines the automatic ar-
approach to arranging a musical score for the piano by rangement of pop music for a solo piano.
using four musical components, namely melody, chords, Generally, extant studies have discussed automating an
rhythm, and the number of notes that can be extracted arrangement for a piano from music composed of multi-
from an original score. The proposed method involves ple parts. Fujita et al. [1] arranged an ensemble score for
inputting an original score and subsequently generating a piano by using a melody part and a base part that were
both right- and left-hand playing parts of piano scores. extracted from an original score. They output piano
scores corresponding to different levels of difficulty
With respect to the right part, optional notes from a chord
based on a users skill level by considering a maximum
were added to the melody. With respect to the left part,
musical interval and a minimum key stroke duration.
appropriate accompaniments were selected from a data- Chiu et al. [2] analyzed the role of each part in an original
base comprising pop musical piano scores. The selected score. The analysis resulted in generating a solo piano
accompaniments are considered to correspond to the im- score by considering the preservation of original compo-
pression of an original score. High-quality solo piano sition, maximum number of simultaneous key strokes,
scores reflecting original characteristics were generated and maximum musical interval of each hand. Onuma et al.
and considered as part of playability. [3] analyzed a process of a piano arrangement used by
arrangers, constructed a piano arrangement system based
1. INTRODUCTION on specific problems that occur in the piano arrangement
process, and proposed solutions for the same. When all
Players often enjoy music while performing an instru- the parts of an original score are summarized into a two-
ment. However, it is typically difficult to play music that line staff, the notes leading to a problem in an arrange-
is composed using multiple instruments and vocal parts, ment are automatically detected and conveyed to a user.
and this is applicable to orchestral, chamber, and pop The choice of a solution depends on the user. Onuma et al.
music. The main reason is that it is necessary to coordi- constructed a solo piano arrangement system correspond-
nate several individuals who can play an instrument and ing to a users performance skill. Nakamura et al. [4]
to adjust their schedules accordingly. Therefore, several considered playability based on a piano fingering model
studies have focused on support for solo playing by ar- by using merged-output Hidden Markov Model. A con-
ranging an original score. Additionally, solo arrange- tinuous flow of notes was considered by the fingering
ments aid in playing comfortably and determining new model, and two indices were optimized, namely preserva-
aspects of the music. They also increase a players moti- tion of an original impression and playability. The prob-
vation by playing their favorite music. Conversely, creat- lem was solved by considering simultaneous playability
ing of an arranged score involves considerable labor and as well as a continuous of notes in both hands.
expertise. Hence, the present study investigates a problem A common vein in previous studies involved reducing
involving an automated arrangement for a solo instru- and selecting notes from an original score in a solo piano
ment based on music composed of multiple playing parts. arrangement. Extant studies either used original notes
Instruments that can simultaneously play the plurality of directly or used octave shifts. This method prevents a
piano-arranged score from dissonance and preserves an
Copyright: 2017 First author et al. This is an open-access article dis-
original scores impression. Additionally, they can gener-
tributed under the terms of the Creative Commons Attribution License 3.0
ate a playable score by considering simultaneous playa-
Unported, which permits unrestricted use, distribution, and reproduction
bility and a time sequence of notes when notes are re-
in any medium, provided the original author and source are credited.
SMC2017-35
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
duced and selected. However, these still remain problems (1) Arranging a musical piece that cannot be played by a
that the outputted piano-arranged score includes difficult single instrument into one that can be played by a
notes to perform. A potential reason for this problem is specific instrument.
that the definition of playability is not appropriate. (2) Transcribing an atmosphere of a musical piece into
Nakamura et al. [4] used a fingering model to consider mainly a jazz version, a Chopin version and a rock
playability. However, this entails considering several version.
playability factors in addition to the simultaneous maxi- (3) Simplifying an original score which makes it easy to
mum number of the key strokes, simultaneous maximum play.
musical interval, minimum duration of time between key
strokes such as extension of the maximum musical inter-
val by arpeggio, the cost of fingering, and variation in the Previous studies examined arranging with respect to (2)
difficulty that depends on a tempo. Therefore, it is diffi- and (3) ([5-8] and [9-11]). However, there is a paucity of
cult to define an appropriate restriction with respect to studies examining arranging with respect to (1).
playability. The lack of defining an appropriate restriction The present study focuses on Arranging with respect to
of playability can generate a piano score that is not play- (1) and setting conditions required involving a high-
able in specific places. Extant studies did not discuss the quality piano arrangement in (1) as shown below:
quality of a piano arrangement. However, a piano ar-
rangement is not beautiful if it does not consider a con- (i) A melody line is the highest pitch in each meas-
tinuous flow of musical sounds such as loudness of sound ure.
or change in volume. For example, two phrases, namely (ii) A chord in a piano-arranged score corresponds
A and B, are involved in a music. The scores that are to an original one.
condensed into two-line staffs from A and B corre- (iii) An accompaniment suit for the rhythm part in an
spond to the same. Phrase A consists of a few instru- original musical piece.
ment parts, and phrase B consists of several instrument (iv) A piano-arranged score reflects loudness in an
parts. When piano reduction is considered, the output original musical piece.
piano scores are not different between A and B be- (v) A piano-arranged score is playable.
cause both condensed scores created by collecting all
notes in the original score are almost the same. Therefore, The reasons for setting the aforementioned five condi-
it is difficult to express a continuous flow of musical tions are as follows. (i) The note that corresponds to the
sounds. Onuma et al. [3] considered this problem by ana- highest pitch can correspond to a melody that determines
lyzing a piano arrangement process that is actually per- important characteristics of the music because it is more
formed by arrangers. However, the system is not auto- impressive when compared with other notes. Therefore,
matic since the system requires manual corrections by a the impression of a piano-arranged score is significantly
user. different from an original musical piece without condi-
The present study involves proposing a new approach tion (i). (ii) Dissonance occurs when a chord in the piano-
to arrange a musical score for a piano. Fig. 1 shows an arranged score does not match an original chord. Subse-
overview of the proposed method. When a piano- quently, a piano-arranged score corresponds to an uncom-
arranged score is generated, four important musical com- fortable piano-arranged score. Conditions (iii) and (iv)
SMC2017-36
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-37
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-38
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
are a few notes that are not on the chord Dm when Subjective
notes are simply translated based on the root note. In this Arithme-
variance evaluation
case, the notes are relocated to the nearest constituent tic mean
of Our Arrang-
notes of the chord Dm. of
method er
It is considered that the generated left-hand part score
Zen-
is playable since it comprises an existing piano-arranged 0.59 0.19 5.4 5.8
zenzense
score. If the existing piano-arranged score is playable,
Sen-
then the selected accompaniment in the database is also 0.44 0.39 5.7
playable. The process of left-hand part generation satis- bonzakura
fies condition (v) that relates to a high-quality piano ar- Sugar
rangement as discussed in section 2. Song and 0.34 0.43 5.6
Bitter Step
5.3 Correction
There is a possibility of overlapping with respect to the Table 1. Spearmans rank correlation coefficient
and subjective evaluation value.
notes of the right- and left-hand parts if each part gener-
ated in sections 5. 1 and 5. 2 are simply combined. In this
case, the problem is solved by reducing either the right-
hand or left-hand part. However, the method of reduction
results in a few problems. For example, the impression of
a piano-arranged score departs from an original score
because the accent becomes weaker due to the reduction
of notes. Therefore, this problem is solved by relocating
an overlapped note in the left-hand part to the nearest
constituent note that is lower than the overlapped note.
This operation enables the generation of a playable piano-
arranged score by preserving an original impression. This
process satisfies conditions (ii), (iv), and (v), which are
related to a high-quality piano arrangement as discussed Figure 7. The original score of RWC-MDB-P-2001
in section 2. No. 5.
SMC2017-39
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
fessional arranger [12] is also evaluated and compared Therefore, this problem can be solved by considering the
with the best arrangement generated by the proposed continuous of the notes.
method. The result of the evaluation is shown in Tab. 1.
With respect to the playability, two individuals who 8. CONCLUSIONS
could play the piano evaluated the same by playing the
best piano-arranged score in the proposed method. The In this paper, a piano-arranged score is generated by us-
obtained results corresponded to views that suggested ing four musical components extracted from an original
that there are a few difficult points although an unplaya- score. Furthermore, the study involved defining five con-
ble point does not exist. ditions necessary for a high-quality piano arrangement.
A part of RWC-MDB-P-2001 No. 5 recorded in the Subsequently, the piano arrangement process is based on
RWC Music Database is shown as an example of the re- the five conditions. The proposed method involves con-
sult [16]. The original score is shown in Fig. 7, and the structing an accompaniment database in the form of ac-
results of piano arrangement are shown in Fig. 8. The companiment matrices. The right-hand part is generated
bass part is set as the rhythm part, and the threshold is by adding notes of a chord. The left-hand part is generat-
set as 0.5. Fig. 8 shows scores (a), (b), and (c) in ascend- ed by selection from the accompaniment database and
ing order of the values of . harmonizing the selected accompaniments. Finally, the
right-hand part and the left-hand part are combined with
the correction of overlapping notes.
The proposed method highlights a novelty of the piano
7. CONSIDERATIONS arranging process. Previous studies used all the infor-
The results confirmed that correlations existed between mation in each part. The generated piano-arranged scores
and subjective evaluations based on the result of are directly affected by the original score. In contrast, the
Spearmans rank correlation coefficient. Specifically, present study involved generating a piano-arranged score
with respect to Zenzenzense, the variance is lower than by using musical components that are abstracted from
that of the others, and a strong positive correlation exists. musical scores. Therefore, it is easy to generate a piano-
It is assumed that rhythm is significantly involved in this arranged score from acoustic signals. A future study will
result. The piano arrangement of Senbonzakura in- involve generating a piano-arranged score from original
volves a rhythm that is a slightly different from that of acoustic signals. Additionally, future research will exam-
the original without discomfort since the original rhythm ine the construction of an accompaniment database divid-
is simple, and thus it is easier to fit any rhythms. Conse- ed into difficulty levels. It will also explore generating a
quently, it is considered that a piano arrangement score piano-arranged score with respect to difficulty levels.
that is slightly different from an original score also corre- Furthermore, future study aims also include increasing
sponds to a high-quality piano arrangement. With respect musical components of a database to select an appropriate
to Sugar Song and Bitter Step, it is difficult to deter- accompaniment.
mine an optimum accompaniment from the accompani-
ment database since the rhythm of the original song is Acknowledgments
quite complicated. With respect to the rhythm perspective
in Zenzenzense, an optimum accompaniment could be This work was supported by JST ACCEL Grant Number
selected because the original song of Zenzenzense in- JPMJAC1602, Japan.
volves a moderately complex rhythm. With respect to the
calculation of the similarity of rhythm, a more appropri-
ate selection of accompaniment is expected by applying
different weights to each beat based on the importance of 9. REFERENCES
beats such as strong beats and weak beats. With respect [1] K. Fujita et al., A proposal for piano score
to the quality of a piano arrangement, all songs are evalu-
generation that considers proficiency from multiple
ated well generally. The piano-arranged score with the
part, in Tech. Rep. Special Interest Group on Music
proposed method corresponded to approximately 93% of
and Computer, 2008, pp. 47-52.
the performance relative to arrangements by a profession-
al arranger. Two evaluators indicated that the piano- [2] C. Shih-Chuan et al., Automatic system for the
arranged score with the proposed method was better than arrangement of piano reductions, in Proc. 11th
the piano arrangement by a professional arranger. In con- IEEE International Symposium on Multimedia, 2009,
clusion, the results confirmed the usefulness of the pro- pp. 459-464.
posed method.
With respect to the playability, unplayable parts were [3] S. Onuma et al., Piano arrangement system based
not revealed since published scores were used for the left- on composers arrangement processes, in Proc.
hand part. If the published scores are playable, then a ICMC, 2010, pp. 191-194.
generated left-hand score is also playable. However, the
[4] E. Nakamura et al., Automatic Piano Reduction
evaluators indicated that there were some parts that were
not easy to play. Thus, it was assumed that the right-hand from Ensemble Scores Based on Merged-Output
part that includes added notes to the melody is not con- Hidden Markov Model, in Proc. ICMC, 2015, pp.
sidered with respect to playability in the proposed method. 298-305.
SMC2017-40
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-41
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-42
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-43
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 2: The block diagram of the time-domain feed-forward DRC, derived from [14]. Note that the slin signal can be
identical to the xlin signal, in order to perform typical dynamic range compression behaviour; or alternatively, they can be
independent signals, in order to perform the side-chain compression technique.
sdB [n] 2(sdB [n] T ) W
s [n]+
dB
yG [n] = ( 1 1)(sdB [n]T + W )2
R
2W
2
2|(sdB [n] T )| W
T + (sdB [n]T )
R 2(s dB [n] T ) > W,
(2)
where T and W are the threshold and knee width parame-
ters, respectively, expressed in decibels and R is the com-
pression ratio.
The envelope detector is then formulated as [28]
A vL [n 1]+
(1 )b [n] b [n] > v [n 1]
A G G L
vL [n] = (3)
R vL [n 1]+
(1 R )bG [n] bG [n] vL [n 1],
SMC2017-44
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 4: Block diagram of the proposed time-frequency domain DRC design, where TFT/FB denotes a time-frequency
transform or perfect reconstruction filterbank. Note, however, that if x(t) and s(t) are the same signal, then the TFT/FB
should be applied only once.
transformed into the time-frequency domain, via a Short- the energy of the side-chain BG = YG XG . How-
Time Fourier Transform (STFT) or a perfect reconstruc- ever, the attack and release time constants must now take
tion filterbank. Therefore, time-domain frames of N in- into account the change of domain, so are formulated as:
put samples x = [x0 , ..., xN 1 ]T are now represented as A = e1/(attack/L) and R = e1/(release/L) respec-
their time-frequency domain counterparts X(t, f ); where tively, where the attack and release times are given as time-
t refers to the down-sampled time index and f [0, ..., L domain samples per millisecond.
1] to the frequency indices given by a hop size L. Due to the nature of spectral processing, a spectral floor
The proposed time-frequency domain architecture is parameter is now recommended to mitigate audible arte-
shown in Fig. 4 and shares many similarities with its time- facts that can occur when a frequency band is too harshly
domain counterpart (see Fig. 2). attenuated. Therefore, the final temporally-dependent and
However, due to the change in domain, it is now the en- frequency-dependent gain factors C(t, f ) are given as
ergy of each time-frequency index that is used as the input !
for the gain computer
q
VG (t,f )
C(t, f ) = max , 10 20 , (8)
XG (t, f ) = 10 log10 |S(t, f )|2 , (5)
which should be multiplied element-wise with the input
where XG and S refer to the side-chain energy and the signal, in order to yield the dynamically compressed audio
side-chain signal respectively. signal, which can be auditioned after performing an appro-
For the gain computer, the frequency-dependent energy priate inverse time-frequency transform.
values are compared against their respective threshold pa-
rameters, so that the dynamic range compression may be
applied in a frequency-dependent manner (note that the 5. AN FFT-BASED IMPLEMENTATION OF THE
time and frequency indices (t, f ) have been omitted for PROPOSED DESIGN
compactness)
In order to investigate the behaviour and performance of
the proposed design, the algorithms presented were imple-
XG 2(XG T ) W mented using MatLab and then realised as a Virtual Studio
Technology (VST) audio plug-in 1 . The graphical user in-
X +
G
YG = ( 1 1)(XG T + W )2 (6) terface (GUI) was designed using the open source JUCE
R
2W
2
2|(XG T )| W framework and is depicted in Fig. 5.
T + (XG T )
2(XG T ) > W, The alias-free STFT filterbank in [29], which uses anal-
R
ysis and synthesis windows that are optimised to re-
where YG is the new target gain, expressed in decibels; T duce temporal aliasing, was selected as the time-frequency
and W are the threshold and knee width parameters re- transform for the implementation. A hop size of 128 sam-
spectively, which are also expressed in decibels and R is ples and an FFT length of 1024 samples were selected,
the compression ratio. with a further sub-division of the lowest four bands to in-
The envelope detector also works in a similar manner to crease low-frequency resolution. This additional filtering
that of the time-domain implementation in (3) is similar to the hybrid filtering that is performed in the fil-
terbanks used for spatial audio coding [30]; therefore, there
( are 133 frequency bands in total, with centre frequencies
A VG z 1 + (1 A )BG BG > VG z 1 that roughly correspond to the resolution of human hear-
VG = (7)
R VG z 1 + (1 R )BG BG VG z 1 , ing. A sample rate of 48 kHz and a frame size of 512 sam-
ples were hard coded into the plug-in.
where VG refers to the output of the envelope detector
1 Various pre-rendered sound examples and Mac OSX/Windows ver-
in decibels and BG refers to the input gain, calculated sions of the VST audio plug-in, can be found on the companion web-page:
as the difference between the gain computer estimate and http://research.spa.aalto.fi/publications/papers/smc17-fft-drc/
SMC2017-45
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-46
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
(a) T = 20 dB, R = 8:1, W = 0 dB, Att/Rel = 10/80 ms. (b) T = 50 dB, R = 8:1, W = 0 dB, Att/Rel = 10/80 ms.
Figure 6: Spectrograms of the drum kit sample (top left, top right) and the output audio, when utilising the time-domain
(middle left) and the proposed (middle right) DRC designs; also shown are the frequency-dependent gain factors for the
time-domain (bottom left) and the proposed (bottom right) DRC designs.
Figure 7: Spectrograms of white noise (top) and after Figure 8: Spectrograms of white noise (top) and after
frequency-dependent ducking using a drum kit sample frequency-dependent ducking using a speech sample (mid-
(middle); also shown are the frequency-dependent gain dle); also shown are the frequency-dependent gain factors
factors (bottom). T = 50 dB, R = 8:1, W = 0 dB, Att/Rel (bottom). T = 80 dB, R = 20:1, W = 0 dB, Att/Rel =
= 10/150 ms. 10/800 ms.
SMC2017-47
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
resolution multi-band compression. [12] MusicTech, Pro Tools tutorial: Cutting edge
Regarding the presented FFT-based implementation, fu- production techniquesSerial compression serial
ture work could investigate how best to design a GUI that compression, http://www.musictech.net/2015/05/
accommodates the additional parameters, or study how one pro-tools-serial-compression/, May 2015, [Online;
might automate them; such as in [31]. accessed 28-Jan-2017].
[6] J. C. Schmidt and J. C. Rutledge, Multichannel dy- [19] D. K. Bustamante and L. D. Braida, Multiband
namic range compression for music signals, in Proc. compression limiting for hearing-impaired listeners,
IEEE International Conference on Acoustics, Speech, Journal of Rehabilitation Research and Development,
and Signal Processing (ICASSP-96), vol. 2, Atlanta, vol. 24, no. 4, pp. 149160, 1987.
GA, USA, May 1996, pp. 10131016.
[20] B. Kollmeier, J. Peissig, and V. Hohmann, Real-time
[7] U. Zolzer, Digital Audio Signal Processing. John Wi- multiband dynamic compression and noise reduction
ley & Sons, 2008. for binaural hearing aids, Journal of Rehabilitation
Research and Development, vol. 30, no. 1, pp. 8294,
[8] P. Raffensperger, Toward a wave digital filter model of 1993.
the Fairchild 670 limiter, in Proc. 15th Int. Conf. Digi-
tal Audio Effects (DAFx-12), York, UK, Sept. 2012, pp. [21] E. Lindemann, The continuous frequency dynamic
195202. range compressor, in Proc. IEEE Workshop on Appli-
cations of Signal Processing to Audio and Acoustics,
[9] A. J. Oliveira, A feedforward side-chain New Paltz, NY, USA, Oct. 1997, pp. 14.
limiter/compressor/de-esser with improved flexi-
bility, Journal of the Audio Engineering Society, [22] Voxengo, Soniformer audio plugin, http://www.
vol. 37, no. 4, pp. 226240, Apr. 1989. voxengo.com/product/soniformer/, 2016, [Online; ac-
cessed 31-Jan-2017].
[10] Music Radar, How to create clas-
sic pumping sidechain compression, [23] E. Perez-Gonzalez and J. Reiss, Automatic equaliza-
http://www.musicradar.com/tuition/tech/ tion of multichannel audio using cross-adaptive meth-
how-to-create-classic-pumping-sidechain-compression, ods, in Audio Engineering Society Convention 127,
2015, [Online; accessed 01-Feb-2017]. New York, NY, USA, Oct. 2009.
[11] Sound on Sound, Parallel compression, http://www. [24] J. D. Reiss, Intelligent systems for mixing multichan-
soundonsound.com/techniques/parallel-compression, nel audio, in Proc. IEEE 17th Conf. Digital Signal
2013, [Online; accessed 27-Jan-2017]. Processing (DSP), Corfu, Greece, July 2011, pp. 16.
SMC2017-48
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-49
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-50
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
49000.0
SysSon PLATFORM SysSon IDE 47500.0
46000.0
44500.0
43000.0
41500.0
40000.0
SONIFICATION denitions REAL-TIME
WORKSPACE (VIEW)
38500.0
Altitude [m]
31000.0
Sonication 29500.0
Designer 26500.0
Compiler Program Sources 25000.0
Scientist (User) 23500.0
22000.0
20500.0
19000.0
17500.0
16000.0
WORKSPACE (MODEL) Data Sources Parameters Real-time Programs
r o n m en t
14500.0
13000.0
11500.0
10000.0
en v i
2003-01-16 12:00:00Z
2003-03-16 12:00:00Z
2003-05-16 12:00:00Z
2003-07-16 12:00:00Z
2003-09-16 00:00:00Z
2003-11-16 00:00:00Z
2004-01-16 12:00:00Z
2004-03-16 12:00:00Z
2004-05-16 12:00:00Z
2004-07-16 12:00:00Z
2004-09-16 00:00:00Z
2004-11-16 00:00:00Z
2005-01-16 12:00:00Z
2005-03-16 12:00:00Z
2005-05-16 12:00:00Z
2005-07-16 12:00:00Z
2005-09-16 00:00:00Z
2005-11-16 00:00:00Z
2006-01-16 12:00:00Z
2006-03-16 12:00:00Z
2006-05-16 12:00:00Z
2006-07-16 12:00:00Z
2006-09-16 00:00:00Z
2006-11-16 00:00:00Z
2007-01-16 12:00:00Z
2007-03-16 12:00:00Z
2007-05-16 12:00:00Z
2007-07-16 12:00:00Z
2007-09-16 00:00:00Z
2007-11-16 00:00:00Z
2008-01-16 12:00:00Z
2008-03-16 12:00:00Z
2008-05-16 12:00:00Z
2008-07-16 12:00:00Z
2008-09-16 00:00:00Z
2008-11-16 00:00:00Z
2009-01-16 12:00:00Z
2009-03-16 12:00:00Z
2009-05-16 12:00:00Z
2009-07-16 12:00:00Z
2009-09-16 00:00:00Z
2009-11-16 00:00:00Z
2010-01-16 12:00:00Z
2010-03-16 12:00:00Z
2010-05-16 12:00:00Z
2010-07-16 12:00:00Z
2010-09-16 00:00:00Z
2010-11-16 00:00:00Z
2011-01-16 12:00:00Z
2011-03-16 12:00:00Z
2011-05-16 12:00:00Z
2011-07-16 12:00:00Z
2011-09-16 00:00:00Z
2011-11-16 00:00:00Z
2012-01-16 12:00:00Z
2012-03-16 12:00:00Z
2012-05-16 12:00:00Z
2012-07-16 12:00:00Z
2012-09-16 00:00:00Z
2012-11-16 00:00:00Z
2013-01-16 12:00:00Z
2013-03-16 12:00:00Z
2013-05-16 12:00:00Z
2013-07-16 12:00:00Z
2013-09-16 00:00:00Z
2013-11-16 00:00:00Z
2014-01-16 12:00:00Z
DEFINITIONS Object API Real-time UGen Graph DSL
ge
an
ch
Application Real-time Time
General IDE (IntelliJ IDEA) Application Sonication Sonication Figure 2. Temperature anomalies (5 C blue, +5 C red),
Development Prototyping Designer
(Researcher) equatorial in 10km to 50km altitude, Jan 2003 to Jan 2014.
SMC2017-51
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-52
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3.1 Applying the UGen Model to Offline Processing Both systems use a weak type GE (graph element) for the
values that can be used as parameters to UGens (constants
At first glance, offline and real-time processing should be are graph elements, as are UGens themselves), but while
quite similar, only distinguished by the kind of scheduler on the SuperCollider server all streams are restricted to
that runs the nodes of the graph, in the former case being 32-bit floating point resolution, in FScape we adopted
busy incessantly, and in the latter case paying attention to multiple numeric types, consisting currently of 32-bit and
deadlines. Figure 5 shows the example of an envelope gen- 64-bit integer numbers and 64-bit floating point numbers.
erator in real-time (SuperCollider) and in offline (FScape) Some UGens are generic in that they can natively work
code. The envelope goes from one to zero in 200 frames, with different numeric types, preserving particular preci-
then back to one in 200 frames, and remains there for an- sion and semantics, while others coerce a GE into a partic-
other 400 frames. Three subtle difference can be seen ular type. For example, ArithmSeq(0.0, 4)similar to
already: Dseries in SuperColliderproduces the arithmetic series
0.0, 4.0, 8.0, . . . with floating point numbers, while
The phase argument to the sawtooth generator dif- ArithmSeq(0L, 4) produces the series 0L, 4L, 8L, . . . with
fers. While in SuperCollider the ranges of some 64-bit integer numbers. This approach is still under evalu-
parameters have historically been chosen to minim- ation.
ise mathematical operations on the server side, in The way UGens are constructed from the user-facing side
FScape we favour consistency across various UGens is very similar, including lazy expansion of elements that
over additional arithmetical operations required. allows the preservation of pseudo-UGens (e.g. in serial-
isation) and the late binding of resources needed from the
In SuperCollider we must specify whether the UGen environment (e.g. automatic adaptation of the expanded
uses audio or control rate, while in FScape these UGen graph to variable types and dimensions of control
categories do not exist. Instead, the block sizes of inputs), multi-channel expansion, and the composition with
signals depend on the particular UGen and may even unary and binary operators. But when it comes to the im-
vary across the inputs and outputs of a UGen, al- plementation of the UGens, the systems use very different
though a nominal block size can be specified that architecture, which is summarised in Table 1.
determines the default buffer size for dynamic sig-
nals. 3.2 Interfacing with an Offline Graph
The frequency is given in Hertz in SuperCollider. In It remains to define an input/output interface for integrating
FScape it is given as a normalised value, as there is offline processing graphs with other objects inside SysSon.
no notion of a global sampling rate. We were con- All objects share as a common interface a dictionary
templating the introduction of a SampleRate UGen called attribute mapthat can be used to freely associate
for convenience, but since the system was primarily attributes with it, using conventional string keys. In the real-
designed to transform audio files, one would mostly time case, for example, when one creates a named control
want to work directly with the respective sampling "freq".kr, the control input is looked up in the attribute
rates of the input files. In the cases where images or map of a containing Proc object (the object wrapping the
videos are processed, the notion of audio sampling UGen graph) at key "freq". This could be a scalar value,
rate would also be meaningless. In fact, many UGens a break-point function, or any other object that could be
SMC2017-53
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
val m = Matrix("var")
val v0 = m.valueSeq
val v = Gate(v0, !v0.isNaN)
val mn = RunningMin(v).last
val mx = RunningMax(v).last
MkDouble("min", mn)
MkDouble("max", mx)
SMC2017-54
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-55
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
nodes. This is generally helpful when real-time properties Oct 2016 Feb 2017
s
st
SysSon IDE
gio
are not required, because the system tends to adjust itself
ol
QBO scenario Blobs2D, BlobVoices UGens
at
m
c li
to the particular data-rates produced in the nodes, while Direct mapping models Migrate previous sonifications
Software v1.10 Software v1.13
back-translation
node. Nonetheless, whenever the program contains a dia- Idea: reduce to individual features API for lazily calculated objects
mond topologya sink and source are connected by more Can we do without offline? Implementation in FScape, SysSon
Peaks via ArrayMax UGen Structural program key cache
than one paththis can result in a stalling process, and ap-
s
em
propriate intermediate buffers must be inserted. Akka cur-
o bl
Dec 2016
pr
general IDE
rently has no means to automatically detect these situations,
sti
og
ol
and so it is the programmers responsibility to manually in- Try tracing multiple local max Sketch out re-integration
at
m
cli
Build state machine using Pseudo-code for possible DSL
sert buffers. BufferDisk is the most simple form of such a LocalIn/Out UGens Analogies to RT (cache, link)
buffer, as it is unbounded and written to disk if it becomes
too large. Alternatively, we could have just duplicated the Nov 2016
s
ist
valueWindow call, although we would then also duplicate a
SysSon IDE
og
ol
Using new blob data, build
at
few more nodes and thus require more computation.
m
First complete sonification
cli
set of sonifications with
In a traditional signal processing scenario, it is easy enough Refine sound timbres
different parameters, timbres
lib rin d
ito ad
ra g
to calculate the necessary buffers, and a future version of
ry
on
FScape can probably accomplish this automatically, but
m
with non-vectorial structural data such as the output of the Limitations of real-time means Implement as separately running
blob detection, we fear that manual control will still be inside general IDE Idea: look at overall gestalt program in general IDE
required. Again, some research on multi-rate management Metaballs blob algorithms Verification of data output
ad ble
d v
exists, e.g. [8], but this is focused on real-time scenarios
ta
(with properties such as de-coupling that do not apply here)
iew
and vectorial data. Another related aspect is the complexity Spec requirements for Pseudo-algorithm:
of implementing multi-rate UGens with many inputs and offline and integration Slices > Blobs > Rasterise
Look at FScape - feasible? > Voices > Output matrix
ev orks iss
outputs. As Arum Alb has noted, handling multi-rate sig-
alu fr )
w ism
at om
(d
e
nals inside modules yields complex code in every concrete
20
module [9, p. 145].
14
Figure 9. Stages in the development of the blob sonification
4. DISCUSSION AND CONCLUSION
We have presented a new architecture within the SysSon A number of concrete suggestions can be made with re-
sonification platform to integrate real-time sound synthesis spect to the future development of this architecture:
with offline pre-processing stages, using a similarly con-
structed DSL for writing UGen graphs. The work is the It would be extremely interesting to have an interact-
result of quick iterations that responded to feedback from ive worksheet or REPL mode for offline programs.
climatologists who we are trying to give a tool at hand that That is to say, the ability to execute these programs
they can use all the way from data import to model selec- step by step, validate the intermediate outputs, re-
tion and parametrisation. The architecture was elaborated adjust, etc., akin to the notebook mode of opera-
through the case study of using a blob detection algorithm tion that has gained much traction with iPython and
to sonify the QBO phenomenon. After presenting a set of Jupyter [10].
rendered sounds to the climatologists, 2 it became quite
Offline programs could become a formalisation within
clear that allowing them to navigate interactively through
the application code base itself for various situations:
the data and adjust different sound parameters would be
a requisite for advancing the sound models, so having an Real-time UGens could be added that implicitly
integrated data pre-processing stage was an important step spawn pre-processing stages, freeing the au-
towards this goal. thors of sound models to create separate offline
Another important result of this research is a better un- objects for trivial tasks. For example, query-
derstanding of the development and transfer processes that ing the minimum and maximum value of mat-
take place at the various stages of experimenting with new rix would be better suited as operators directly
sonification approaches and feeding them back into a stable available within the real-time program.
platform. It emerges a pattern of oscillation in this process
The user (scientist) facing side of the editor
between the greenfield experimentation in the external IDE,
could have a richer palette of matrix transform-
and reformulation within the inner IDE for deployment.
ation operators. For example, ad-hoc averaging
This movement is illustrated in Figure 9.
across a dimension or down-sampling would
2 See the revised sets labelled WegC_170508-... at https://
be a very useful operation offered to the user,
soundcloud.com/syssonproject/sets/ (accessed 18-May-2017) and of which a particular sonification model
SMC2017-56
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-57
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-58
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
responsible for routing the graphics resulting from the The screen is divided into two parts: a) from the
input signals analysis to the instrumentalists, thus centre to the left, it shows the representation of the
creating a feedback system with complex mapping and sound analysis of the instrumentalist in real time; b)
contrapuntal capabilities. The system was developed in from the right border to the central bar, the gesture
Max/MSP [4] and Processing [10]. MaxMSP is used to indication suggested by the pilot composer. The drawn
analyse and process the live audio input signals, which graphics include circumferences expressing the loudness
result is then sent to Processing, that interprets the data of each Bark frequency band at each 1024 ms analysis
to create graphics on a rolling screen. slice (represented vertically). To distinguish
participating instruments different colours are applied.
2.1 Flow of data Fig. 3 shows an individual monitor.
SMC2017-59
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-60
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-61
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-62
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT
SMC2017-63
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
and Music Computing Group of the University of Padova 2 tion of the painting, and an audio reproduction system, fed
to produce a sonification of one of his works. The artist in- by a sound background derived by the painting and users
vited us to choose one painting among a series of 9, where step sonification. Fig.3 shows the schema of the installa-
various leaves textures are depicted. Although the textures tion, outlining the components of these two main elements.
would vary for density, leaf dimension and color, the paint- The soundscape of the installation changes according to
ings are united by the presence of a single subject (a car- the distance of the visitors from the image of the paint-
pet of dry leaves) occupying uniformly their entire surface. ing. In order to detect the position on the wooden runway,
This suggests that also the sound rendering should be made a system of eight piezoelectric sensors is used (Sensing).
regular and constant although with some degree of vari- The signals acquired is then processed by means of an al-
ability inside. Moreover, the paintings seem to be the re- gorithm developed ad hoc based on the TDOA approach
sult of the superimposition of different layers representing (Localization). Other localization approaches were con-
leaves of similar color, as confirmed by the creative process sidered. One of them exploit computer vision algorithm.
declared by the artist himself. This suggests the possibil- This was not considered because of issue in case of un-
ity of decomposing the image in different color matrices managed environment with multiple people interacting in
and to consider the painting as a sort of three-dimensional the installation. Furthermore, it was necessary to detect
object of various superimposed layers. This idea envisions either the user position or his steps to sonorize them. In
that various layers should not simply be seen as one above addition, the installation has been developed with the col-
the other on the surface of the painting but that they can laboration of company specialized in diagnosis of wooden
be disposed in the space in front of it. Thus, the paint- board and analysis of propagation time of waves. This led
ing is stretched from the wall into the space, allowing a to the development of this specific system.
user to walk through its various layers. The resulting inter- The information acquired are used to control the sound-
action space is depicted in Figure 2 where the user walks scape. It is produced by a bank of oscillator, fed by pink
noise. Its outputs (Soundscape reproduction) are differ-
ently weighted according to an algorithm of image pro-
cessing that extracts the needed features (Image feature ex-
traction) from the three color layers of the painting (color
matrix extraction). Furthermore, the runway is provided
with a led stream that changes its color according to the
color matrix of the painting. This provide a visual feed-
back so as to outline different portions of the runway each
of which corresponds a different color layer. Further de-
tails on the elements illustrated in Fig.3 will be described
in the following sections.
2. SYSTEM ARCHITECTURE
3. FROM THE PAINTING TO THE SOUND
The installation is characterized by two main elements: a
wooden runway that leads the visitor toward the projec- The basic idea of this image sonification project is to con-
2 The website of the SMC group of the University of Padova is at sider the leaves as the elements of a mask, in which the
http://smc.dei.unipd.it/ colored spots represent open holes on a solid background.
SMC2017-64
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
STEPS
SPECTRA H B hB lB
H
BACKGD.
B
hB
lB
Transposing this interpretation in musical terms, the back- sible to weight the filters output according to the lightness
ground comes to be a complex wide range spectrum of fre- of the every single pixel. This process results in different
quencies and the holes are the points where the single fre- harmonies coming from the various weights and assigned
quencies can be heard. In this perspective, the function of to the spectral components. For instance, in the case repre-
the image is similar to that of an orchestral score, where sented in Figure 4, the marked column has higher values in
each line corresponds to a musical instrument or orchestral the lower part of the painting and very small values in the
section. The black holes (signle note or group of notes) ac- higher part. This opens the lowest frequencies of the spec-
tivate the section, while the white lines (rests) mute them. trum much more than the highest, allowing a major weight
Hence, the spectrum plays the role of the whole orches- of the lowest components. The situations is very similar
tra and the image the role of the musical score. To realize to a 38 voices chorale 5 where the dynamics of the various
this idea, we selected the painting with the lowest degree voices depend on the pixel lightness values.
of leaves density among the 9 proposed (see Figure 1a).
Following the idea of the various superimposed layers, we 3.1 The Installation Soundscape
needed also to extract different colors matrices from the
original painting. To do so we analyzed the image using The characteristic of the sonification engine described above
the hue-saturation-value color model, where hue refers to is to be very close to the image features and to be easily
the pure color, saturation to the quantity of white and value adaptable to produce a painting-related soundscape. For
to the lightness [13]. Choosing an appropriate hue range, it this project we focused our attention on subtractive syn-
is possible to obtain matrices that represent only the leaves thesis due to its flexibility and richness of expressive pos-
of the selected color range. According to the suggestion sibilities. The bank of filters can be fed with different input
of the artist, we extracted three matrices corresponding ap- sounds and can be easily tuned according to various spec-
proximately to the primary colors (red, green and blue), tral models, which can be changed in relationship to the
shifting a little bit the green matrix towards the yellow color of the matrix employed as musical score. Moreover,
which is the predominant color of the painting. The re- the resonant filters output allows a kind of smearing effect
sult of this process is depicted in Figure 1b, where a mask which smooths the harmonic transitions which character-
of yellow and green leaves extracted from the original im- ize our image sonification approach, giving to the audio
age is shown. 3 To provide a rich background texture for rendering a great variety in a seamless way. Another pos-
the image sonification, we employed a bank of 38 band sibility offered by our sonification system is to implement
pass resonant filters 4 fed with pink noise and with sepa- a second bank of filters with the same characteristics of the
rate controlled outputs. To superimpose the mask to the first, being fed by a file reproducing the sound of a step on
background texture, we scaled the yellow matrix in a 38 a carpet of leaves which can be triggered by impulses com-
rows and 127 column matrix, where only the lightness val- ing from a users real step on the wooden runway. Thus,
ues are reported (see Figure 4). This matrix can be con- not only this would produce step sounds completely differ-
sidered as the real sonification score because it provides ent from real ones, but these steps could be tuned accord-
columns of lightness values which are scanned at a regular ing to a spectrum complementary to that used for the back-
speed from the left to the right and vice versa. This tim- ground. In the actual project we experimented four spectra:
ing mechanism allows the necessary link between the vi- H (a harmonic spectrum of 38 frequencies starting from
sual representation and the sequential nature of music. The 100Hz), B (38 frequencies extracted from the full spec-
time-controlled columns of lightness values are converted trum of a bell sound), hB (38 frequencies extracted from
in numeric controls for a mixer matrix, thus making it pos- the highest band of the same bell spectrum) and lB (the
same as above in the lowest band). Table 1 shows the full
3 Of course many other options are available to extract color matrices
range of the possibilities in order to combine a background
from an image, starting with the number of matrices to be extracted and
with their color range. In our case the choice of extracting three matrices spectrum with its complementaries with a wide number
depends not only on the artists suggestion but also on the actual available
space for the interactive sonification on the sensor-equipped runaway (see 5 The chorale is a vocal polyphonic composition, usually a religious
Section ??. hymn, typical of the sacred literature of the German Protestant Church
4 The sonification is implemented in a Max/MSP patch employing under Martin Luther (1523). The chorale is characterized by a regular
the fffb object. See https://docs.cycling74.com/max5/ homo-rhythmic proceeding of the voices, exactly in the same way of the
refpages/msp-ref/fffb.html for reference. scanned columns of the image matrix.
SMC2017-65
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
of soundscape generation possibilities, which can be cho- the runway. The sensing module consists of data acquisi-
sen according to the users position on the sensor-equipped tion, using eight piezoelectric sensors, signal amplification
runway 6 . and digitization by means of a digital audio interface. The
localization section is mainly divided in: STFT, footstep
4. THE SENSOR-EQUIPPED RUNWAY detection and TDOA estimation (see Fig. 5).
4.2 Sensing
Usual applications of floor vibrations detection use sophis-
ticated sensors like the geophone sm-24. It can detect very
small vibrations with high precision going to the detriment
of very high price. In this application, due to the small
size of the runway and the resonance of steps, a good SNR
can be ensured with piezoelectric sensors of 2cm diameter.
Figure 5: Schema of the position detection system: Sens- They are equally distributed on the runway in order to di-
ing is responsible of acquire the signals from the runway, vide it in twelve areas, each of which is identified by the
Localization detect the steps and the position four nearest sensors that the localization system will use.
As described in Section 3, the painting is divided in three
The objective of sensing section is to acquire the footstep- color layers and each of which is assigned to the section of
induced structural vibration through sensors mounted on
8 GRF (Ground Reaction Force) is, according to Newtons Law of Re-
6 Some sound examples of these possibilities can be found at action, the force equal in magnitude but opposite in direction produced
https://youtu.be/wzE17agkZ7I by the ground as reaction to the force the body exerts on the ground.
7 Description of this tool can be found at http://microtec.eu/ The ground reaction force is used as propulsion to initiate and control the
en/catalogue/products/viscan/ movement, and it is normally measured by force sensor plates.
SMC2017-66
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-67
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-68
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Acknowledgments
(a) Graph of TDOA trend for hits in position 1
This work has been partially supported by Microtec s.r.l.
(Grant n. 98/2015).
7. REFERENCES
[1] L. Bullivant, Responsive Environments: Architecture,
Art and Design. London: V & A Publications, 2006.
SMC2017-69
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[9] F. Sparacino, Scenographies of the past and museums [18] S. Winiarski and A. Rutkowska-Kucharska, Esti-
of the future: from the wunderkammer to body-driven mated ground reaction force in normal and pathologi-
interactive narrative spaces, in In Proceedings of the cal gait, in Acta of Bioengineering and Biomechanics,
12th Annual ACM international Conference on Multi- vol. 11, no. 1, 2009.
media, 2004, pp. 7279.
[19] M. Syczewska and T. Oberg, Mechanical energy lev-
[10] A. Webb, A. Kerne, E. Koh, P. Joshi, T. Park, and els in respect to the center of mass of trunk segments
R. Graeber, Choreographic buttons: promoting social during walking in healthy and stroke subjects, in Gait
interaction through human movement and clear affor- & Posture, 2001, p. 131.
dances, in In Proceedings of the 14th Annual ACM in-
ternational Conference on Multimedia, 2006, pp. 451 [20] V. Inman, H. Ralston, and F. Todd, in Human walking.
460. Baltimore, London: Williams & Wilkins, 1981.
[11] K. Gronbaek, O. Iversen, K. Kortbek, K. Nielsen, and [21] D. Salvati and S. Canazza, Incident signal power com-
L. Aagaard, igamefloor - a platform for co-located parison for localization of concurrent multiple acoustic
collaborative games, in In Proceedings of the Inter- sources, The Scientific World Journal, vol. 2014, p. 13
national Conference on Advances in Computer Enter- pages, 2014.
tainment ACE, Salzburg, Austria, 2007, pp. 6471. [22] D.Salvati and S. Canazza, Adaptive time delay es-
timation using filter length constraints for source lo-
[12] K. Gronbaek, O. Iversen, K. Kortbek, K. Nielsen, and
calization in reverberant acoustic environments, IEEE
L. Aagard, Interactive floor support for kinesthetic
Signal Processing Letters, vol. 20, no. 5, pp. 507510,
interaction in children learning environments, in In
2013.
Proc. of INTERACT, Rio de Janeiro, Brazil, 2007.
[23] I. A. Veres and M. B. Sayir, Wave propagation in a
[13] M. D. Fairchild, Color appearance models. John Wi-
wooden bar, Ultrasonics, 2004.
ley & Sons, 2013.
[24] S. Canazza, C. Fantozzi, and N. Pretto, Accessing
[14] G.Succi, G. Prado, R. Gampert, T. Pedersen, and
tape music documents on mobile devices, ACM
H. Dhaliwa, Problems in seismic detection and track-
Trans. Multimedia Comput. Commun. Appl., vol. 12,
ing, in in Proc. of SPIE, vol. 404, 2000, p. 165.
no. 1s, pp. 20:120:20, Oct. 2015. [Online]. Available:
[15] G. Succi, D. Clapp, R. Gampert, and G. Prado, Foot- http://doi.acm.org/10.1145/2808200
step detection and tracking, in in Proc. of SPIE, vol.
[25] F. Bressan and S. Canazza, The challenge of preserv-
4393, 2001, p. 22.
ing interactive sound art: a multilevel approach, Int.
[16] H. Amick, A frequency-dependent soil propagation J. of Arts and Technology, vol. 7, no. 4, pp. 294 315,
model, in in Proc. of SPIE, 1999. 2014.
[17] R. Vera-Rodriguez, N. Evans, R. Lewis, B.Fauve, and [26] C. Fantozzi, F. Bressan, N. Pretto, and S. Canazza,
J. Mason, An experimental study on the feasibility of Tape music archives: from preservation to access,
footsteps as a biometric, in in Proceedings of 15th Eu- International Journal on Digital Libraries, pp. 117,
ropean Signal Processing Conference (EUSIPCO07), 2017. [Online]. Available: http://dx.doi.org/10.1007/
2007, p. 748752. s00799-017-0208-8
SMC2017-70
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Copyright:
c 2017 Matteo Lionello et al. This is 1.2 Interactive Soundscapes
an open-access article distributed under the terms of the
Creative Commons Attribution 3.0 Unported License, which permits unre- Interactive Soundscape is a 3x4 mt. floor under the range
stricted use, distribution, and reproduction in any medium, provided the original of a camera. When the user enters the installation s/he
author and source are credited. is projected in a sonic environment composed by sounds
SMC2017-71
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
related to a real environment 1 . The aims of this project visit in GoogleMap 4 . Mobile equipped with GPS and
are both the spatial exploration of the reconstructed sound- Internet can be used to enrich the experience of walking
scape and the structural exploration of the sound elements through a city listening to a particular soundscape that is
peculiar to that soundscape. Using a granular synthesis streamed by the application and guided by the clients po-
process, brief time-windows are extracted from the sounds sition. Compared to Soundscape Modeling where the in-
composing the soundscape. The user changing her/his po- teraction is interfaced through GPS or relative position in a
sition inside the installation modifies the parameters ap- virtual world, detecting a users interaction in a augmented
plied to the processing of the sound sources. Thus, moving reality environment can be a challenging topic. Responsive
around it is possible to explore the internal structure of the floor interactive systems are one of the main system used to
sounds projecting the user into a world as much real as the detect the users interaction. They are mostly implemented
original one, emphasizing the acoustic properties of each through computer vision, audio capture and tracking tech-
sound. The role of the artist takes a step back leaving the niques. For the Swiss Expo in 2002, in Neuchatel has been
tools for the musical composition to an audience not nec- developed a complex artistic installation called Ada [8],
essarily musically educated and not even aware of them. which consisted of an artificial organism capable to inter-
Besides the artistic purposes, the installation can be used act actively with its visitor capturing their interaction and
in pedagogical field looking how children manage the spa- replying using visual and sound activities. The aim of the
tial information to study cognitive processes [6] [7]. When project was to provide Ada of the ability of expressing its
the user gets the feedback from the different sounds dis- internal status, which varied according to the input from
tributed over the augmented space, cognitive maps are built peoples interaction. Her core was composed by a track-
by abstraction processes according to them. Thus, when a ing system recording weight, direction, speed and location
user enters the tracked area s/he can use what has been of the visitors walking on its responsive floor; a vision sys-
previously learned, enriching her/his cognitive map of the tem which used cameras controlled by the localization of
surrounding space. the users for visual information; and an audio system to
help identify salient individuals. Beside computer vision
and microphones, another technique used for tracking the
2. RELATED WORK user inside an augmented environment is triangularization
Spatial cognition is widely used in interactive installations through wireless devices. The LISTEN [9] project supplies
such as soundscape rendering and audio augmented envi- to museum and art exhibitions visitors an augmented and
ronments. The design of composed soundscape for virtual personalized visit. Through motion-tracked wireless head-
environments requires a formal description and organiza- phones, users exploring the physical space get immersed in
tion of the sound events which are used in it. The dis- a virtual environment which augments the space itself. A
tinction between background audio and sonic events has soundscape is composed dynamically differently for each
been well defined by the Music Technology Group at Pom- visitor according to how s/he moves inside the space, to
peu Fabra Universidad in Barcelona. They developed an previous collected information of his path and his prefer-
online platform Soundscape Modeling that widely de- ences and interests gathered during her/his walk. The ex-
scribes the state of the art about the possible applications perience provided by the installation varies from person to
of soundscape rendering and design, providing a powerful person, according to their behavior in the environment, al-
tool for authoring process. An online platform facilitates lowing thus a higher level of visitors engagement.
the process and generation of realistic soundscapes, focus-
ing especially on computer games and mobile and web ap- 3. SYSTEM DESCRIPTION
plications. Sample research from Freesound.org 2 is
simplified by an automatic audio classification of sound Interactive Soundscape uses a set of four loudspeakers, one
features to compose graph model related to particular tax- camera and a computer running the two parts of the soft-
onomies. The soundscape is composed according to a two ware and connected to both the audio system and the cam-
layers based architecture. On the first layer all the back- era. It is presented as an indoor space under the range of a
ground sounds are propagated, whereas on a upper layer camera placed at the top of the room with its optical axis
there are dynamic sounds events with which the users are perpendicular to the floor. The installation is an interactive
able to interact. In a server/client architecture, the server system composed by a tracking system capable of produc-
gets real-time information from the clients coordinates re- ing a couple of Cartesian coordinates of a user. Her/his
lated to a space providing a web stream of the soundscape. relative position is transferred via OSC 5 [10] to a mod-
Combining SuperCollider 3 , Freesound and GoogleMaps ule written in PureData 6 [11] to process the audio data.
it is possible to provide soundscapes formed by collage The user entering the tracked zone is virtually transposed
of sounds according to specific position in a virtual city- into a certain sonic environment built around her/him. Ac-
4 Google Maps is a web mapping service developed by Google, in-
1 A video of the installation is available at https://vimeo.com/ cluding satellite images, topographic city and rural maps, and 360 degree
146137215. street view perspective https://maps.google.com/
2 Freesound.org is a collaborative on-line sound database, providing 5 Open Sound Control is a networking protocol developed at CNMAT
meta-data and geo-localization data for the uploaded sounds. http:// by Adrian Freed and Matt Wright for musical instrument controls to co-
freesound.org municate in a network
3 Supercollider is a developing environment and programming lan- 6 PureData is a visual programming language Open Source devel-
guage used for real time audio synthesis. http://supercollider. oped by Miller Puckette, more information can be found at https:
github.io/ //puredata.info/
SMC2017-72
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
cording to the users real location her/his virtual position is ular synthesis takes place. Its necessary to select the audio
modified as well and, consequently, the processing of the files to use for the granular synthesis, whereas files not se-
audio data is differently shaped. The result is a soundwalk lected files will be processed with a deep reverb. By walk-
obtained by an acoustically augmented reality. ing inside and outside boundary distinguishing the two dif-
ferent areas, a 1400 ms crossfade between the two flows -
3.1 The Soundscapes Design one made from granular synthesis and the other one from
sum of the amplitude of the audio sources - is applied.
While the granularization technique is traditionally used
on a physical fixed support, this installation projects is a 3.2 The Design of the Parameters
real-time result of the interaction of a user with a phys-
ical environment. The human-machine interaction takes The source location for the audio files can be chosen to
place through the spatial cognition of the user who decides be inside as well as outside the augmented space. In the
to move according to her/his spatial perception combined case the coordinates are set outside, the user will never be
with her/his auditory memory of the areas which s/he has able to reach that source and the sound will never gain the
already explored. As a tribute to the work Pacific Fan- maximal intensity. The synthesis parameters depend on
fare by Barry Truax, for testing the installation it has been the distance between the user and the center of the granu-
decided to propose a sound environment similar to the one lar zone. Along the edge the grains are 100 ms long, the
played by the artist: a harbor soundscape. density is maximized (100% of probability for two grains
to succeed), the parameters for the band pass filter applied
to the grains uses low cut off frequencies between 150 and
300 Hz, whereas the quality factors values are ranging
between 5 and 20, and the variance parameter describing
the heterogeneity in the granular streams modeling is set
to 0.1 in order to model all the streams around a narrow
neighborhood of the values set by the users position. As
the user goes closer to the center of the granular zone, the
soundscape results more and more chaotic: in the center
the duration of the grains is set equal to 10 ms, the density
is 25% and the cut off frequencies and the quality factors
reach high values. The granularization happens in real time
during the playback of the audio extracting every grain in
each stream one after one from the original audio file. In
Figure 1 it is possible to see a representation of the com-
posed soundscape and the spatial organization of sounds
Figure 1. Graphic representation of the sound events posi- inside the augmented space. The squared area represents
tioning on the Interactive Soundscapes active area, with a the area covered by the camera, the red area is where the
users soundwalk path. granular synthesis occurs, the blue and gray shaded back-
grounds starting from the top and the bottom of the figure
In order to reproduce this soundscape, train and plane represent the amplitudes of the two different background
sounds were used to simulate the harbor industrial area, sounds which are mixed together. The diamonds refer to
seagulls and bells sounds for the docks, weaves, boats whis- the events composing the soundscape on the background
tles and far storms to simulate the farthest zone of the docks layer.
to the sea, and human activity sounds to simulate the inner
land of the harbor. 7 The waves and the human activity 4. SYSTEM ARCHITECTURE
sounds are used as the background of the soundscape and
are processed with the granular synthesis in the central cir-
cular area shown in Figure 1, while the other files are used
as events (some of them processed with reverberation and
other with granular synthesis). Other configurations for
different soundscapes are possible. Starting from a set of
audio files for a particular context, a supervisor can load
them in the PureData patch where the virtual location of
the audio sources is decided setting a pair of coordinates. Figure 2. Flow chart of the system architecture of Interac-
Those coordinates refer to the location where the user will tive Soundscapes
hear the maximal amplification from that audio. Likewise
it can be set the type of decreasing amplitude as the user The system architecture is made of two different software
goes further from an audio source. The granular zone is set modules (Fig. 2). The first module is designed to elabo-
choosing the circle radium and center over which the gran- rate the images captured by a camera, oriented perpendic-
7 All the audio files have been selected from http://freesound. ularly respect to the floor. The camera covers a surface of
org variable dimension depending on the height at which it is
SMC2017-73
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
positioned. A software named Zone Tracker [12] elabo- synthesis has been designed to operate on individual audio
rates the visual information, subtracting the instantaneous stream from each source. In order to obtain 64 granular
images from a dry image of the ground. From the resulting sub-streams for each source, the main stream has been re-
image, calculated by averages of morphological transfor- cursively divided in 4 sub-stream for 3 times as shown in
mation, a blob is obtained and its barycenter represents Figure 3. To make the final stream richest and as vari-
the user position in the area covered by the camera. The co- ous as possible, each of the first 4 sub-streams (see Figure
ordinates of the users barycenter are sent, through OSC, to 4) are defined with a set of parameters controlling dura-
a second software module apt to the interaction sound pro- tion, density, mid-frequency and Q factor of the granular
cessing. The second module is written in PureData, which stream. Each of these parameters are then stochastically
processes the users position with the loaded audio files, modeled depending to a dynamic variance parameter. All
creating for each file an audio stream scaled in amplitude the 64 sub-streams are then summed in one output result-
in function of the user position. A second stream is as well ing stream. Even more the resulting streams relating to
created for each file, performing a granular synthesis of the each source are directed in one last output audio flow. To
first one. develop in a various and plentiful way each of the 4 sub-
flows sharing the same set of control parameters, they have
4.1 Granular synthesis been modeled through a probability distribution such that
there could be no two sub-streams identical in any aspect.
5. EVALUATION
To provide a first evaluation of the soundscape generated
in the Interactive Soundscape environment we prepared an
online listening test with the following aims:
Figure 3. Granular streams architecture for two differ- 3. to see how listeners link the computer generated sound-
ent audio sources. Each original audio input is split into scape to a real environment.
4 different classes of streams characterized by different
granular parameters. The resulting asynchronous granu- Thirty-one invited listeners took part in the test, 25 males
lar streams are summed together into a synthesized output and 6 females. Subjects were aged between 20 and 57
stream. years (mean = 31.5 years, SD = 8.8). Seventeen subjects
have studied a musical instrument for more than 5 years
(54.8%), nine have studied a musical instrument for less
then 5 years (29%), and five have never studied a musi-
cal instrument (16.1%). For the test we selected an audio
excerpt recorded during an exploratory session performed
by a single user, whose soundwalk path is depicted in Fig-
ure 1. The excerpt lasts approximately 1 min and 30 s and
comprehends both soundscape and granular synthesis pro-
duced sounds, with granular processing beginning at s16.
Figure 5 reports a rough reconstruction of the event ap-
pearance and timing in the first 21s of the audio excerpt
chosen for users evaluation. In this representation at least
five events are detected before the user enters the granular
processing zone (waves, train, birds, bell and seagulls) 8 .
The audio excerpt is not a simple soundscape recording but
is a computer generated soundscape which refers to a real
soundscape but which is not real. Beyond reverberation ef-
Figure 4. Grain structure. Each of the 64 audio streams for
fects, it contains sounds obtained through a granular syn-
each audio source are filtered by a band pass filter depend-
thesis process which can significantly affect the perception
ing on the users position and enveloped according to the
of the user. Thus, to assess if the parameters of the audio
input parameters.
8 This interpretation is not unique. Due to the complexity of the sound-
scape recordings employed and to the differences in the listening condi-
The streams of grains is obtained imposing to one au- tions (which in an online test cannot be easily controlled), different lis-
dio a train of windows function of short length. Granular teners may report different events or a different event order.
SMC2017-74
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
1
5 3
UNPLEASANT PLEASANT
1
3 4
2 2 3 1
first 21s of the audio excerpt chosen for users evaluation. Users ratings on the unpleasant-pleasant axis
SMC2017-75
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
granted in advance considering that the audio excerpt is [6] A. Camurri, S. Canazza, C. Canepa, G. L. Foresti,
the product of a soundwalk in a virtual environment and A. Roda, G. Volpe, and S. Zanolla, The stanza logo-
not the reproduction of a real soundscape. Due to method- motoria: an interactive environment for learning and
ological and technical difficulties it was not possible to communication, in Proceedings of SMC Conference
assess if subjects could perceive the sound changes ob- 2010, Barcelona, 2010.
tained through the granular synthesis and how they eval-
uated them. This is an important task which requires a [7] M. Mandanici, A. Roda, and S. Canazza, The har-
strict control on the listening conditions and a major con- monic walk: an interactive physical environment to
trol on the audio files content. As a matter of facts, the learn tonal melody accompaniment, Advances in Mul-
presence of a great number of different sounds, both in the timedia, vol. 2016, p. 2, 2016.
foreground and in the background, can confuse the listen- [8] K. Eng, A. Babler, U. Bernardet, M. Blanchard,
ers and does not guarantee a sufficient sound discrimina- M. Costa, T. Delbruck, R. J. Douglas, K. Hepp,
tion. No textual description of the physical place which D. Klein, J. Manzolli et al., Ada-intelligent space: an
could have generated the soundscape was delivered to the artificial creature for the swissexpo. 02, in Robotics
test subjects. They spontaneously identified it mainly as and Automation, 2003. Proceedings. ICRA03. IEEE
train station and harbor, showing thus that the sound International Conference on, vol. 3. IEEE, 2003, pp.
rendering is recognizable as a real soundscape and that it 41544159.
is consistent with what we wanted to reproduce.
[9] G. Eckel, Immersive audio-augmented environ-
6.1 Further work ments, in Proceedings Fifth International Conference
on Information Visualisation, 2001.
The next steps for this work will be to collocate it in a 3D
audio context using a wave-field synthesizer and providing [10] M. Wright, Open sound control-a new protocol for
the system to detect interactions from more people at the communicationg with sound synthesizers, in Proceed-
same time. The sound information will be local instead of ings of the 1997 International Computer Music Confer-
being the same in the whole area, and the parameters col- ence, 1997, pp. 101104.
lected from the participants will be used only for the ma-
nipulation of the dynamical events over the soundscape. [11] M. Puckette, Pure data: another integrated computer
music environment, in Proceedings of the second
Acknowledgments intercollege computer music concerts, 1996, pp.
3741. [Online]. Available: https://puredata.info/
We want to thank Leonardo Amico for his support in pro-
gramming the Zone Tracker application. [12] S. Zanolla, S. Canazza, A. Roda, A. Camurri, and
G. Volpe, Entertaining listening by means of the
7. REFERENCES Stanza Logo-Motoria: an Interactive Multimodal Envi-
ronment, Entertainment Computing, vol. 2013, no. 4,
[1] R. M. Schafer, The soundscape: Our sonic environ- pp. 213220, 2013.
ment and the tuning of the world. Inner Tradi-
tions/Bear & Co, 1993. [13] O. Axelsson, M. E. Nilsson, and B. Berglund, A prin-
cipal components model of soundscape perception,
[2] B. Truax, Soundscape composition as global mu-
The Journal of the Acoustical Society of America, vol.
sic: Electroacoustic music as soundscape, Organised
128, no. 5, pp. 28362846, 2010.
Sound, vol. 13, no. 2, pp. 103109, August 2008.
[3] F. Pachet, Active listening: What is in the air? Sony [14] J. Posner, J. A. Russell, and B. S. Peterson, The
CSL, Tech. Rep., 1999. circumplex model of affect: An integrative approach
to affective neuroscience, cognitive development, and
[4] P. Oliveros, Deep listening: a composers sound prac- psychopathology, Development and psychopathology,
tice. IUniverse, 2005. vol. 17, no. 03, pp. 715734, 2005.
[5] (1999). [Online]. Available: http://www.sfu.ca/truax/
vanscape.html
SMC2017-76
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Learning to play the guitar at the age of interactive and collaborative Web
technologies
ABSTRACT the ability to discuss his sensations with his master and to
receive feedback on his action. All these interactions are
Recent advances in online technologies are changing the believed to contribute to learning and to help to maintain
way we approach instrumental music education. The motivation.
diversity of online music resources has increased through Similar activities can be observed in group lessons. In
the availability of digital scores, video tutorials and music that context, there might be less emphasis on individual
applications, creating the need for a cohesive, integrated objectives and progression but a gain benefit on interaction
learning experience. This article will first explore how with peers. In a group context, the attention of the master is
different technological innovations have shaped music shared between the learners. Learners are able to regulate
education and how their actual transposition in the digital their action with respect to the master, to the group and
world is creating new learning paradigms. As a practical among themselves. Learners can discuss their sensations
illustration of these paradigms, we will present the and request feedback from peers, and consequently form a
development of a new online learning environment that community of practice.
allows master-apprentice or self-taught learning, using
interactive and collaborative tools. 1.2 Informal learning in-between lessons or in the
case of self-taught learners
1. INTRODUCTION Table 2 presents the tasks of the learner in-between
In order to understand how technology can help a lessons (or in the case of a self-taught learner) and relates
self-taught learner, or a learner in between lessons with a these to the learning and teaching activities presented in
master, we must present the specificities of private and Table 1. It shows which technologies can be used to
group lessons. In this introductory section, we will mitigate the absence of a master and of peers. For each
identify the learning processes in place and determine activity, the table presents different types of technologies
which are missing when the learner is left alone. We will and gives a few examples of existing commercial and
then review how technology has been used to mitigate research solutions.
these in the past and how new Web technologies can be 1.2.1 Search for Material
used today.
One of the tasks the learner might want to perform is to
1.1 Group Music Classes vs Private Music Lessons search for new material. A typical seft-taught learner will
want to search for his favourite artists songs. Websites to
Table 1 describes the master and the learner activities. It share tablatures exist on the Web for a long time. At first,
identifies the modalities in place and the type of feedback this information was stored in simple text files using
that is possible between them. In the typical context of a ASCII characters to represent strings and fret positions,
private lesson, prior to any lesson, both protagonists will techniques and chords information. The Ultimate-Guitar
discuss the learners intrinsic motivations. The master will Website [16] (see Table 2 for all technologies references)
adapt his teaching according to the learner objectives and is one example of this type of sharing platforms.
prior knowledge. This differentiated teaching will Nowadays, these sharing platforms have evolved and can
continuously occur throughout the lesson series since the display the score in standard western notation and
master will adapt his teaching to the learners progress. advanced tablature notation. They also offer playback
During a lesson, the learner observes and receives direct functionalities and permit, for example, to modify the
instructions from the master, but he also plays with his playback tempo or to loop sections of the score
master. That allows him to regulate his action according (Songsterr [15] is one of these). The most elaborate
to what he observes, ears and feels. The learner also has interfaces will even link and play synchronously audio
and video recordings of a performance (Soundslice [14],
Copyright:
c 2017 Anne-Marie Burns, Sebastien Bel, Caroline Traube et al. This JellyNote [13]).
is an open-access article distributed under the terms of the Music information retrieval algorithms offer new
Creative Commons Attribution 3.0 Unported License, which permits unrestricted possibilities to extract playing instructions directly from a
use, distribution, and reproduction in any medium, provided the original author recording. There already exist some commercial software
and source are credited. applications where one can input audio recording and
SMC2017-77
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
retrieve an analysis of the chords and notes played information on a single support. Just as a master would
(Chordify [17], Capo [18], GuitarMaster [19]). annotate the learner score so that he can go home and
Non-commercial applications are also developed in the practice with that supplemental information, digital scores
MIR research field to automatically retrieve gestural made possible the concept of augmented score.
information from a combination of audio-video or motion Augmented scores were first introduced in the
capture recordings of a performance and with the use of VEMUS/IMUTUS project [32, 33] and took the form of
predictive models [1]. annotated scores were the annotations could take various
forms: graphical, textual, aural and sonic.
1.2.2 Seeking information, demonstrations and feedback VEMUS/IMUTUS also introduced the idea of displaying
various analysis curves of the performance in parallel to
Two other tasks the learner wants to perform is to search the score (frequency curves, sonograms, envelope curves,
for information and instructions and to find models to etc.). Nowadays, augmented scores can include data from
observe. These two tasks might be complemented with multimodal recordings of performance and from the
the will to receive feedback on the comprehension and on analysis of these performances.
the execution on the instrument of what was learnt.
Various sources of information for these tasks include 1.2.3 Play With Others
text-based questions and answers on forums or social
networks, blog posts and dedicated Websites exploiting Another activity that is frequent in private and group
various types of media. Nowadays, audio and video lessons is to play along with the master or with peers.
recordings production is easily available to anyone. This s a rewarding activity and an important motivational
Distribution is also facilitated with the use of general factor. It is often something that a learner would miss
platforms for videos (Youtube [28], DailyMotion [30], when he is alone to practice. As we have seen, nowadays
Vimeo [29]) and audio (Soundcloud [36]). These most score editors and score sharing platforms offer
recordings can then be linked and included in the playback functionalities. These virtual players can use
previously mentioned forums, blogs, and Websites. different technologies from MIDI playback to sound
Waldron [2227] made an extensive study of different synthesis or sampling to synchronization with audio or
online communities of practice where members are seen even audio-video recordings. Some will only play the
as prosumer, e. i. users that both produce and consume instrument as others will also include accompaniment by
material posted by the community. In these communities, other instruments. Dedicated software like
members can, for example, produce instructional Band-in-a-Box [40] has been long used for that purpose.
audio-video recordings or post advice on techniques they However, playing along a virtual peer or band is not as
master, and, on the other hand, post their questions or rewarding as playing with real peers. For that purpose,
recordings of their performances to request the other projects like I-Maestro [34] introduced the idea of playing
members for feedback and comments. The previously over distance. Game platforms like Rocksmith [37] or
mentioned research projects on automatic analysis of learning environments like Yousician [38] also use the
performance can also be used for the task of providing idea of challenging other users in network performance of
automatic feedback to the users. However, to our a score.
knowledge, actual commercial products including
1.2.4 Structure the Learning Path
automatic feedback only use audio analysis for pitch and
rhythm detection (Yousician [38], JellyNote [13]). Finally, as we have seen the availability of online music
A special case of applications can include all this resources for learning is vast and disparate. Learners often
SMC2017-78
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
have to search among many different sources to find pedagogical content developers to help them create
resources that fit their needs. This in itself can be a source motivating learning environments. The research projects
of motivation lost. However, there exist platforms that try we review all include an LMS. I-Maestro [34] also
to solve that problem by providing a complete includes the idea of collaborative work. While
environment dividing the learning in achievable and PRAISE [31] introduce social network tools with the
significant steps that can quantitatively and qualitatively Music Circle where learners and masters can share their
be evaluated. These platforms are of two kinds: serious performances and annotate other learners performance
game and learning management system (LMS). There is recordings.
no absolute definition of serious games, but what is
generally agreed upon is that they are a category of games
that uses entertainment and multimedia to convey an 2. THE NOVAXE PROJECT
experience to the user [41]. In the case of music This project will offer a new gesture rich guitar notation
education, that experience will take the form of a musical as well as an augmented score, allowing learners to access
skill. Serious games for music are derived from electronic different dimensions of the performance and to create
rhythm games that appeared in the 1990s. They have the links between the gestural instructions found in a score
specificity of using real instruments as game controllers. and the one captured during real performances using a set
Rocksmith [37] from Ubisoft is a well-know example for of sensors. It implies the creation of analysis tools to
the guitar. Magna Quest [39] is another example emerging extract score and gestural parameters from a performance.
from research and applied to the violin. These parameters are useful to align the score and the
On the counter part, LMS dedicated to instrumental performance, to complement the score with performance
learning or said otherwise computer-assisted music instructions, to transcribe a score from a performance and
pedagogy for instrumental learning are derived from the to provide feedback to a learner on his performance of a
digital form of a pedagogy named programmed learning. score. The project aims to explore the hypothesis that
Actual LMS use game design elements to keep the learner interaction and collaboration throughout peer reviews and
motivation, this is what is called gamification [42]. emulation combined with observational learning from a
In LMS, learning can be organized in levels where the master but also from self-observation can be beneficial for
learner can progress on different paths. The learners can musical instruments learning.
compare themselves with others throughout the use of This project was created to address some of the
leaderboards. They can follow their progress with fundamental problems in modern music education. It
progression bars and dashboards. They can unlock new consists of three main facets; advanced notation,
learning paths by taking tests and receiving badges. These Web-based pedagogical technology, and a highly
badges can be shared on social networks. Learning structured educational system. The main focus of the
analytics tools can also be used to inform the learners of project is to enhance the musical education experience by
their progress and recommend them activities to go making it more intuitive, rewarding, and efficient for both
further. All these tools are at the disposition of the casual student and the serious apprentice. The project
SMC2017-79
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 1: A Novaxe score in the Web platform player with all editions and properties panels open.
is designed to answer three main challenges currently physical proximity. Sensors, sound and video analysis are
faced by guitar teachers. used to mitigate this indirection, and provide the learner
First, the majority of guitar students use various with valuable feedback. This helps to maintain the learner
notations: standard notation, standard arranged notation, motivation and reduces the risk of developing wrong
basic tablature, advanced tablature, lead sheets, and the technique that would then take even more practice time to
chord and lyric style. Each of these systems has strengths correct.
and weaknesses, often increasing workload, while
compromising learning efficiency. The project alternate 2.1 The Online Learning Environment
notation alleviates the core challenges faced by many
music students. This project is therefore an attempt to answer these
Second, social networks and cloud computing has challenges faced by music and guitar teaching at the age
changed the way we socialize, collaborate and learn. of interactive and collaborative Web technology. The
Internet services and software are more connected than pedagogy and the innovative musical notation used in the
ever. With the propagation of Web applications, driven by project were developed by Vandendool over more than ten
recent technological advances, we can access complex years of personal experience as a private guitar
data-driven applications from personal computers as well teacher [43, 44]. The pedagogy includes concepts from
as from a myriad of smart devices. As we have seen in the music theory such as scale degrees, chord tones, harmonic
first part of this article, music learning and production has function of chords, rhythm necklace and technical
not escaped these new paradigms. As Martn e al. notes: information for both hands. The alternative notation
with the rise of online music communities using system of the project has the capacity to display this extra
performance or pedagogical applications, there is an information, giving the student a very high level of
increasing need for tools for manipulating music scores. technical and musical insight. This information is
There is also a need for Web-based tools for visualizing, leveraged to create relationships between scores
playing, and editing [scores] collaboratively [9]. according to the learners skill set, using characteristics
Finally, learning to play a musical instrument is a long such as the progression of chord degrees, fret position,
process as musical gesture enters a category of gestures technical information and difficulty.
referred to as expert gesture. This type of gesture requires These pedagogical tools are ported to the digital world.
years of deliberate practice and hours of repetitive The platform in development includes a score editor and
exercises in order to achieve a high level of proficiency. player entirely based on the most recent Web technologies
These high-level skills are commonly transmitted from a (HTML5, CSS3, and several Javascript libraries). It
master to an apprentice throughout hours of guided permits to import scores from other digital formats like
observations. However, this teaching method has always Guitar Pro [5] and MusicXML [45] and to convert them
faced some issues. The periods of contact between the into the project notation format. It is also possible to
master and the apprentice are often short and punctual. In create scores from scratch or to complement these with
between these periods, the apprentice is left all alone for the editor. The editor let the user add technical and
self-learning. In the case of pure online learning, the theoretical information to the score including right-hand
master-to-student link is even compromised by the lack of technique, left hand fingering, notes function in a chord
SMC2017-80
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-81
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
environment (for example, teachers can use that feature to proprioception or is he always relying on the system? The
share exercises with their students). The alpha version of usage of artificial intelligence agent to take care of the
the player offers different playback functionalities. The user preferences and learning style might seem appealing,
users can modify the tempo and loop throughout sections but might it be at the risk of limiting the learner to what is
of the score. The notes to be played can be visualized naturally attractive to him? Answering these questions is
using the Novaxe notation system or on a fretboard probably a required step to develop the necessary critical
representation. An online editor is already partially thought to give these technologies their proper place in
developed and will become available in subsequent the musical learning tools arsenal.
versions. A search engine based on musical information
of the score (ex.: chord progressions, techniques, etc.) is Acknowledgments
also in development. We believe that such a basic set of
This research is funded by a Mitacs acceleration grant in
features based on an individual guitar master experience
the context of the Novaxe gesture toolkit project 2 . The
was necessary to start a discussion with potential users
authors would like to thank all the development team:
of the project. The Agile development methodology will
Mark, Andrew, Christelle, Mathieu and Etienne and the
subsequently help us to develop a platform that will
industrial partner, Orval Vandendool.
become closer to the user needs at each iteration.
4. REFERENCES
3. CONCLUSIONS
[1] A. Perez-Carrillo, J.-L. Arcos, and M. Wanderley,
The ideal master-to-learner musical instrument learning Estimation of guitar fingering and plucking controls
situation involves different modalities from reading a based on multimodal analysis of motion, audio
symbolic representation of a musical excerpt to regulating and musical score, in Proceedings of the 11th
ones action by visual and aural comparison with an International Symposium on CMMR, Plymouth,
expert musician or with peers. This complex feedback United-Kingdom, June 16-19 2015.
loop is lost in periods in between lessons or in the case of
self-taught learning. But technological innovations can be [2] Finale. [Online]. Available: http://www.finalemusic.
used to regain access to some of these modalities, for com
example, with sound and video recordings. However,
even with recordings, what is still missing is the [3] Sibelius. [Online]. Available: http://www.avid.com/
possibility to interact and receive feedback from a master sibelius
or from peers. This has been partially alleviated with the [4] Musescore. [Online]. Available: https://musescore.org
use of social networks and automatic feedback tools.
Nowadays, the diversity of online music resources is [5] Guitar pro. [Online]. Available: http://www.guitar-pro.
creating the need for a cohesive and integrated learning com/
experience. We have seen how the computer games
[6] Tuxguitar. [Online]. Available: http://tuxguitar.herac.
industry has influenced the way we approach the
com.ar
development of virtual learning environments in every
education field but also in music. We have seen how game [7] Powertab. [Online]. Available: http://www.power-tab.
elements can be used to maintain the learner motivation net
even in the case of repetitive exercises. We have seen how
these LMS can be coupled with social networks to create [8] Jamshake. [Online]. Available: https://www.jamshake.
communities of practice around a learning platform. com
The use of the digital and Web technologies in
[9] D. Martn, T. Neullas, and F. Pachet, Leadsheetjs:
instrumental learning opens new perspectives. For
A javascript library for online lead sheet editing,
example, as Waldron notes, the proximity with a master or
in TENOR 2015 First International Conference on
other players is no longer a constraint to belong to a
Technologies for Music Notation and Representation,
community of practice of a particular instrument or style
Paris, France, 28-30 may 2015 2015.
[23, 25, 26]. It also relates the symbolic musical notation
more closely to the performance with the uses of [10] Noteflight. [Online]. Available: https://www.
augmented score giving the user access to more noteflight.com
dimensions of a musical excerpt. However, it also raises
new questions. It is not yet clear what uses the learner is [11] Scorio. [Online]. Available: http://www.scorio.com
able to make of all the extra information he is receiving in [12] Flat.io. [Online]. Available: https://flat.io
the form of graphical representations of his performance
and how he is able to use the diverse types of feedback. [13] Jellynote. [Online]. Available: https://www.jellynote.
Does that really provide him with different perspectives com
on the matter to learn and will that teach him about the
subject matter but also about himself as a learner [48]? [14] Soundslice. [Online]. Available: https://www.
Or does that make him dependent on the learning soundslice.com
environment? Is the learner still able to develop his own 2 https://www.mitacs.ca/en/projects/novaxe-gesture-toolkit
SMC2017-82
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[15] Songsterr. [Online]. Available: https://www.songsterr. [31] Practice and performance analysis inspiring social
com education. [Online]. Available: http://www.iiia.csic.es/
praise/
[16] Ultimate guitar. [Online]. Available: https://www.
ultimate-guitar.com [32] D. Fober, S. Letz, and Y. Orlarey, Vemus - feedback
and groupware technologies for music instrument
[17] Chordify. [Online]. Available: https://chordify.net
learning, in Proceedings of the 4th Sound and Music
[18] Capo. [Online]. Available: http:// Computing Conference, Lefkada, Greece, 11-13 July
supermegaultragroovy.com/products/capo/ 2007, pp. 117123.
[19] Guitarmaster. [Online]. Available: http://www. [33] S. Raptis, A. Askenfelt, D. Fober, A. Chalamandaris,
guitarmaster.co.uk E. Schoonderwaldt, S. Letz, A. Baxevanis, K. F.
Hansen, and Y. Orlarey, IMUTUS - An Effective
[20] C. Dittmar, E. Cano, J. Abeer, and S. Grollmisch, Practicing Environment For Music Tuition, in Proc. of
Music information retrieval meets music education, the International Computer Music Conference, 2005,
in Multimodal Music Processing. Dagstuhl Follow- pp. 383386.
Ups, M. Muller, M. Goto, and M. Schedl, Eds.
Germany: Dagstuhl Publishing, 2012, vol. 3, ch. 6, pp. [34] K. Ng, B. Ong, T. Weyde, and K. Neubarth,
95120. Interactive multimedia technology-enhanced learning
for music with i-maestro, in World Confenrence
[21] Hooktheory. [Online]. Available: https://www.
on Educational Multimedia, Hypermedia and
hooktheory.com
Telecommunications, June 2008.
[22] J. Waldron and K. K. Veblen, The medium is the
message: Cyberspace, community, and music learning [35] Music circle. [Online]. Available: http://goldsmiths.
in the irish traditional music virtual community, musiccircleproject.com
Journal of Music, Technology and Education, vol. 1,
[36] Soundcloud. [Online]. Available: https://soundcloud.
no. 23, pp. 99111, November 2008.
com
[23] J. Waldron, Exploring a virtual music community of
practice: Informal music learning on the internet, [37] Rocksmith. [Online]. Available: https://rocksmith.
Journal of Music, Technology and Education, vol. 2, ubisoft.com
no. 23, pp. 97112, November 2009.
[38] Yousician. [Online]. Available: http://yousician.com
[24] , Locating narratives in postmodern spaces: A
cyber ethnographic field study of informal music [39] F. Dube, J. Kiss, and H. Bouldoire, Gamification
learning in online community, Action, Criticism & and learner engagement: A learning the violon
Theory for Music Education, vol. 10, no. 2, December implementation interface example, in In proceedings
2011. of International Symposium Learning and Teaching
Music in the Twenty-First Century: The Contribution
[25] , Conceptual frameworks, theoretical models and of Science and Technology LTM21/AEM21, Montreal,
the role of youtube: Investigating informal music Quebec, Canada, 2015.
learning and teaching in online music community,
Journal of Music Education and Technology, vol. 4, no. [40] Band-in-a-box. [Online]. Available: http://www.
23, pp. 189200, 2011. pgmusic.com
[26] , Youtube, fanvids, forums, vlogs and blogs: [41] F. Laamarti, M. Eid, and A. El Saddik, An overview
Informal music learning in a convergent on - and of serious games, International Journal of Computer
offline music community, International journal of Games Technology, p. 15, October 2014.
music education, pp. 116, February 2013.
[42] S. Deterding, D. Dixon, R. Khaled, and L. Nacke,
[27] , User-generated content, youtube and From game design elements to gamefulness:
participatory culture on the web: Music teaching Defining gamification, in Proceedings of the
and learning in two contrasting online communities, 15th International Academic MindTrek Conference:
Music Education Research, vol. 15, no. 3, pp. Envisioning Future Media Environments. Tampere,
257274, April 2013. Finland: ACM, New York, NY, USA, September
28-30 2011, pp. 915.
[28] Youtube. [Online]. Available: https://www.youtube.
com [43] M. Vandendool, Musical notation systems for
[29] Vimeo. [Online]. Available: https://vimeo.com guitar fretboard, visual displays thereof, and uses
thereof, US Patent App. 13/740,842, July 18 2013.
[30] Dailymotion. [Online]. Available: http://www. [Online]. Available: http://www.google.ch/patents/
dailymotion.com US20130180383
SMC2017-83
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-84
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Key detection is a common task in Music Information Re- where P is the set of 24 key-profiles representing candidate
trieval (MIR), and has been a part of the Music Information keys.
Retrieval Evaluation eXchange (MIREX) since its start in This method only needs a pitch class histogram from the
2005. Motivations for it include automatic analysis and an- musical excerpt. From a symbolic representation, obtain-
notations of large databases. The present study focuses on ing this is straightforward. From an audio representation,
key identification for Irish traditional music tunes. Key- an extra step of computing a chromagram is required, which
finding algorithms are tested on two collections of audio does not represent any significant difficulties. It can how-
recording, of session and solo recordings, representing 636 ever add some noise to the histogram because of the har-
tunes overall. Symbolic transcriptions have been compiled monics present in an audio signal. The resulting pitch class
for all the tunes. A range of methods from the literature are histogram is, in the classification proposed in [2], a low-
benchmarked on both the audio and symbolic data. Some level global descriptor.
modifications to these methods, as well as new methods, Other methods for key-identification are based on higher-
including a set of parametric models, are presented and level features. For example, in [3] the intervals of a melody
tested. are analyzed, which presupposes that an automatic tran-
scription of the signal has been performed beforehand. In
A musical key consists of a tonic note, represented by [4], an HMM is trained to estimate the key from a sequence
a pitch class (C, C#, D...), and a mode, or rather mode of chords.
family, which can be minor or major. Consequently, there
are always 24 candidate keys, for the 12 semitones of the In preparing this paper we also experimented with some
octave and the two considered modes. Enharmonic equiv- machine learning (ML) models for key-detection (such as
alence is used, which means that we do not distinguish be- multinomical regression models). Generally, ML models
tween different spellings of the same note in the twelve- perform best on relatively balanced datsets; consequently,
tone equal temperament, e.g. D# and Eb. Throughout the in order to train our ML models we introduced transposed
paper we will adopt the convention of denoting major keys tunes into the dataset in order to balance the distribution in
by upper-case letters, and minor keys as lower-case ones. the data. However, the performance of these ML models
The standard approach to identifying keys in a musical was relatively weak and as a result we do not include a de-
piece is to use key-profiles [1]. They can be seen as vec- scription of them nor the data preparation carried out for
tors assigning weights to the twelve semitones, denoted them in this paper.
(p[i])i=0,...,11 . One key-profile per mode is defined for the
tonic note C, and transposition to the another tonic note The main contribution of this paper is a set of new key-
profiles, outperforming existing keys-profiles on the cor-
Copyright:
c 2017 Pierre Beauguitte et al. This is pora considered. In Section 2, we present the key-finding
an open-access article distributed under the terms of the method, and existing key-profiles defined in the literature.
Creative Commons Attribution 3.0 Unported License, which permits unre- In Section 3, new key-profiles are introduced. Section 4
stricted use, distribution, and reproduction in any medium, provided the original gives a description of the datasets. Section 5 gives the de-
author and source are credited. tails and results of the experiment using the profiles. Sec-
SMC2017-85
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
tion 6 presents a parametric extension of one of the models, where the type of the chord the depends on the scale con-
the method used to select the parameters, and the perfor- sidered. In the major case, the classic major scale (Ionian)
mance of this model. Finally Section 7 discusses the study is used. However in the minor case, the harmonic scale is
and indicates future ideas for research. chosen, where the seventh degree is one semitone higher
than in the natural scale.
2. RELATED WORK The Tone Center Images (TCI) are then obtained by sum-
ming the R-images of the chords, weighted by how often
2.1 Triads they occur in the cadences, and normalizing:
Certainly the most naive way to define a key-profile is to
consider only the triad of the tonic chord. For example, 6 I + 3 V + II + IV + VI
in C it is expected that the pitch classes of the tonic C,
After normalization, the key-profiles obtained are:
the third E and the fifth G will be the most frequent, as
reflected in the key-profiles: pLeman (C) = [0.36, 0.05, 0.21, 0.08, 0.24, 0.21,
ptriad (C) = [1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0] 0.05, 0.31, 0.07, 0.24, 0.09, 0.10]
ptriad (c) = [1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0] pLeman (c) = [0.34, 0.11, 0.15, 0.25, 0.11, 0.25,
0.02, 0.31, 0.24, 0.09, 0.12, 0.14]
2.2 Krumhansl-Kessler
The key-profiles established in [5] were obtained by per- 3. NEW PROFILES
ceptual experiments, in contrast with the triads presented
In this section two new pairs of key-profile are introduced.
above which were motivated by musical theory. Subjects
were asked to rate how well the different pitch classes fit
with a short musical excerpt establishing a key. The Krum- 3.1 Basic spaces for modal scales
hansl-Kessler key-profiles are a well known method for The key-profiles introduced in 2.3 are based on the natural
key detection. scales, or Ionian mode for major and Aeolian mode for mi-
nor. In Irish music, two other modes are commonly used:
pKK (C) = [6.35, 2.23, 3.48, 2.33, 4.38, 4.09, the Mixolydian mode (major with a minor seventh) and
2.52, 5.19, 2.39, 3.66, 2.29, 2.88] the Dorian mode (minor with a major sixth). The two ba-
pKK (c) = [6.33, 2.68, 3.52, 5.38, 2.60, 3.53, sic spaces are modified so that they are suited to both major
2.54, 4.75, 3.98, 2.69, 3.34, 3.17] modes (Ionian and Mixolydian) and minor modes (Aeolian
and Dorian). This is done by setting p[10] = p[11] in the
2.3 Lerdahls Basic Spaces major case, and p[9] = p[8] in the minor case:
The basic spaces defined by Lerdahl in [6] are derived pLerdahl? (C) = [5, 1, 2, 1, 3, 2, 1, 4, 1, 2, 2, 2]
from the diatonic scale of each key. Different weights are pLerdahl? (c) = [5, 1, 2, 3, 1, 2, 1, 4, 2, 2, 2, 1]
given to the degrees of the scale: 5 for the tonic (index 0
for C and c), 4 for the fifth (index 7 for C and c), 3 for the This idea of considering the minor and major seventh as
third (index 4 for C, 3 for c), 2 to the rest of the diatonic equivalent has already been used for the task of Irish tradi-
scale, and 1 to the remaining semitones. tional music tune transcription and recognition in [8].
pLerdahl (C) = [5, 1, 2, 1, 3, 2, 1, 4, 1, 2, 1, 2]
3.2 Cadences
pLerdahl (c) = [5, 1, 2, 3, 1, 2, 1, 4, 2, 1, 2, 1]
These key-profiles are inspired by Lemans tone center im-
It is worth noting that the natural minor scale, or Aeolian ages. As mentioned in 2.4, cadences play an important role
scale, is considered here: the natural seventh (index 10) is in establishing a tonal center. In Irish traditional music, the
taken as part of the scale, not the augmented seventh (index most commmon cadence is I - IV - V - I [9]. In the case
11) as would be the case with the harmonic minor scale. of minor tunes, we also consider the chord sequence VII -
VII - I - I, often used in accompaniments. Consequently
2.4 Lemans Tone Center Images the formulae to obtain the key-profiles are
In [7], the simple residue image (or R-image) of a chord
Major: 2 I + IV + V
is generated as a weighted combination of the undertone Minor: 4 I + 2 VII + IV + V
series of the tonic. The tone center images are then de-
rived by summing the R-images of the chords present in Instead of considering R-images, chords are simply rep-
the common cadences. The three typical cadences selected resented by their triad, as introduced in 2.1. The resulting
in [7] are key-profiles are:
I IV V I pCadences (C) = [3, 0, 1, 0, 2, 1, 0, 3, 0, 1, 0, 1]
I II V I
I VI V I pCadences (c) = [5, 0, 3, 4, 0, 3, 0, 5, 1, 0, 3, 0]
SMC2017-86
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Finally, these profiles are also modified to account for the 160 100
Mixolydian and Dorian modes: 140
80
120
pCadences? (C) = [3, 0, 1, 0, 2, 1, 0, 3, 0, 1, 1, 1] 100 60
80
pCadences? (c) = [5, 0, 3, 4, 0, 3, 0, 5, 1, 1, 3, 0] 40
60
40
20
4. DATASETS 20
This section introduces the datasets used for this study. 0 0
D G a e A b C D G a e A C d b g F
SMC2017-87
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
FSaudio FSsymb GLaudio GLsymb key-profiles which assign different weights to degrees of
Triad 0.873 0.795 0.671 0.709 the scale (see Section 2.3) we believe that the performance
KK 0.854 0.869 0.665 0.696 of the Cadences key-profiles can also be improved by intro-
Lerdahl 0.887 0.890 0.689 0.777 ducing weights. Specifically by assigning different weights
Leman 0.848 0.829 0.646 0.666 to the tonic, third and fifth degrees in the triads uses to
Lerdahl? 0.893 0.890 0.711 0.798 build the Cadence profiles presented in Section 3.2. To test
Cadences 0.883 0.874 0.677 0.770 this hypothesis we ran a second experiment which is de-
Cadences? 0.902 0.818 0.713 0.750 scribed in Section 6.
SMC2017-88
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
selecting the set of weights with the highest performance FSaudio FSsymb GLaudio GLsymb
on the datasets this model selection methodology suffers Cadences(W ) 0.891 0.879 0.702 0.769
from the problem of peeking introduced in the preced- Cadences? (W ) 0.908 0.840 0.720 0.745
ing paragraph. In other words we will find a single best-set
of weights on the dataset but the performance of this model
Table 2: MIREX scores computed from the aggregate ma-
on the dataset will not be a realistic measure of the model
trices after cross-validation on the 4 corpora
on unseen data. We will refer to this methodology as the
Best-Weights method.
An alternative methodology is to use a process called 10- then applied the Best-Weights method introduced above,
fold cross validation, following the methodology of [11]. with the chosen grid size, to find the best single set of
The focus of a 10-fold cross-validation process is to esti- weights on our dataset.
mate the average performance of the models generated by
a machine learning algorithm and hyper-parameter set 3 on 6.3 Results
unseen data. In a 10-fold cross-validation a dataset is split
into 10 equally size subsets, or folds. Then 10 experiments The only hyper-parameter in this experiment is g, the width
are run (one per fold). In each experiment 9/10s of the data of the grid. A wide range allows a better fit on the training
is used to train a model and the remaining 1/10 (the fold) is data, but poses a risk of overfitting it, resulting in poor per-
used to evaluate the model performance. The overall per- formance on the test sets. The experiment was performed
formance of algorithm and hyper-parameters is then cal- for g ranging from 2 to 10. The grid size g = 3, allow-
culated by aggregating the confusion matrices generated ing the weights wi to take values in [1, 2, 3], gave the best
by each of the 10 experiments and calculating a perfor- results, and is used for the following results.
mance metric on the resulting matrix. The advantage of a Scores calculated from the aggregate matrices after the
cross-validation methodology is that in each of the 10 ex- cross validation on each of the four datasets are presented
periments distinct data is used to train and test the model. in Table 2. The models Cadences? (W ) outperform all
So the final overall accuracy score is representative of the other methods on the two audio corpora. However, the
likely performance of a model trained with the tested algo- Lerdahl? key-profiles evaluated in Experiment 1 remain
rithm on unseen data. the best performing ones on the symbolic data. Conse-
In our context training a model involves selecting the set quently the rest of this section focuses on Cadences? (W )
of weights that perform best on a dataset. So, to utilize on the audio datasets.
a cross-validation methodology to evaluate the weighted The result of the cross-validation method suggests that
cadences approach we need to run a grid-search process the models Cadences? (W ) generalize well to unseen au-
in each of the 10 experiments 4 and select the best set of dio data. In order to obtain one single model (as opposed
weights for that experiment based on the performance on to the multiple ones resulting from the 10 folds), the Best-
the 9/10s training portion of the data and then evaluate the Weights method was then performed on the combined data-
performance of these best weights on the (unseen) remain- set (FS + GL)audio . Grouping the two collections of au-
ing 1/10 of the data. The advantage of the cross valida- dio recordings means that the profiles should perform well
tion methodology is that it provides us with an estimate of on both heterophonic and monophonic recordings. The
the likely performance of a weighted cadences key-profiles weights obtained are (3, 1, 2), corresponding to the intu-
on unseen data. This is because in each of the 10 weight ition that the tonic and fifth are more important than the
tuning and model evaluation experiments (one experiment third, as in Section 2.3. The resulting key-profiles are:
per fold) distinct sets of tunes are used for weight tun-
ing and evaluation. Consequently, the accuracy scores re- pCadences? (3,1,2) (C) = [8, 0, 2, 0, 2, 3, 0, 7, 0, 1, 1, 1]
turned from this process reflect the accuracies of models on pCadences? (3,1,2) (c) = [14, 0, 4, 4, 0, 7, 0, 11, 1, 1, 7, 0]
unseen tunes (i.e. tunes that were not seen during weight
tuning). The drawback of this approach, however, is that With these profiles, the MIREX scores are 0.901 on FSaudio
the weight tuning process in each of these experiments may and 0.730 on GLaudio , to be compared to the scores in Table
return different sets of weights. So, although the overall ac- 1. The lower score on FS is not unexpected: the grid search
curacy scores provides an indication of likely performance maximizes the overall score accross the combined audio
of a set of weighted cadences key-profiles on unseen data collection, regardless of the scores on the individual collec-
it does not provide an accuracy measure for cadences key- tions. The overall MIREX score on the combined collec-
profiles with a specific (fixed) set of weights. tion is 0.819, compared to 0.811 with the non-parametric
As a result, we decided to use apply both methodologies Cadences? profiles.
to the weighted cadences key-profiles. First we applied The confusion matrices for these new key-profiles on the
a cross-validation process to estimate the performance of audio collections, and for the Lerdahl? ones on the sym-
weighted cadences key-profiles on unseen data. This first bolic datasets (on which they are still the highest scoring
step also serves to determine the best grid size to use. We ones), are given in Tables 3 to 6. Rows indicate the actual
keys in the ground truth annotations, while columns indi-
3 Hyper-parameters are parameters on a machine learning algorithms
cate estimated keys. Keys that never occur in either the
as distinct to parameters on a model.
4 As opposed to the single grid search process that would be used in ground truth or the estimations are omitted.
the Best-Weights method.
SMC2017-89
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
D G a e A b C d
D 149 1 0 0 0 0 0 0 Correct 289
G 4 119 0 0 0 0 0 0 Fifth 6
a 5 9 7 0 0 0 1 1 Relative 6
e 8 3 1 3 0 0 0 0 Parallel 0
A 2 0 0 0 9 0 0 0 Neighbour 17
b 2 0 0 0 0 1 0 0 Other 8
C 0 0 0 0 0 0 1 0
D G a e A b C f# B
D 131 3 1 0 0 14 0 1 0 Correct 280
G 0 113 1 4 0 1 0 0 0 Fifth 0
a 0 7 11 1 4 0 0 0 0 Relative 19
e 1 1 0 11 0 2 0 0 0 Parallel 5
A 0 0 0 0 11 0 0 0 0 Neighbour 9
b 0 0 0 0 0 2 0 0 1 Other 9
C 0 0 0 0 0 0 1 0 0
D G a e A C d b g F
D 82 4 6 1 1 0 4 0 0 0
G 6 71 4 1 0 0 0 0 0 0
Correct 205
a 5 9 17 0 1 3 2 0 0 0
Fifth 16
e 10 6 5 13 2 0 0 0 0 0
Relative 15
A 7 0 2 0 11 0 1 0 0 0
Parallel 8
C 0 1 2 0 0 5 1 0 0 0
Neighbour 27
d 1 0 2 0 0 2 3 0 0 0
Other 29
b 3 0 0 1 0 0 0 0 0 0
g 0 0 0 0 0 2 0 0 2 0
F 0 0 0 0 0 0 0 0 0 1
D G a e A C d b g F f#
D 85 1 5 1 0 0 0 6 0 0 0
G 2 71 1 7 0 0 0 1 0 0 0
Correct 231
a 1 13 18 3 1 1 0 0 0 0 0
Fifth 3
e 3 1 0 26 1 0 0 5 0 0 0
Relative 22
A 4 0 1 0 14 0 0 0 0 0 2
Parallel 2
C 0 1 4 0 0 4 0 0 0 0 0
Neighbour 20
d 0 0 1 0 0 1 6 0 0 0 0
Other 22
b 0 0 0 0 0 0 0 4 0 0 0
g 0 0 0 0 0 0 0 0 3 1 0
F 0 0 0 0 0 0 1 0 0 0 0
SMC2017-90
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
The three types of errors taken into account in the MIREX 8. REFERENCES
evaluation metrics are highlighted in different shades of
[1] D. Temperley, The cognition of basic musical struc-
blue. Another error occurs frequently in this experiment,
tures. Cambridge, Mass: MIT Press, 2001.
named neighbour. The neighbour relationship is de-
fined as follows: [2] E. Gomez, Tonal Description of Polyphonic Audio for
Music Content Processing, INFORMS J. on Comput-
two keys are neighbours if one is major, the ing, vol. 18, no. 3, pp. 294304, Jan. 2006.
other minor, and the minor one has a tonic one
tone above the tonic of the major one. [3] S. T. Madsen, G. Widmer, and J. Kepler, Key-Finding
with Interval Profiles. in ICMC, 2007.
To take a concrete example, D and e are neighbour keys. [4] K. Noland and M. B. Sandler, Key Estimation Using
In terms of modes, the scales of D Mixolydian and of e a Hidden Markov Model. in Proceedings of the 7th
Aeolian contain the exact same pitch classes. It is not rare International Society for Music Information Retrieval
in Irish music that a tune labelled as e minor changes its Conference, 2006, pp. 121126.
tonic center for a few bars to the neighbour key D, e.g.
the well known Cooleys reel. On both audio datasets, this [5] C. L. Krumhansl, Cognitive Foundations of Musical
type of error is the most common. As such, and although Pitch, ser. Oxford Psychology Series. Oxford, New
the MIREX evaluation metrics does not take these errors York: Oxford University Press, 1990.
into account, reporting them is relevant.
[6] F. Lerdahl, Tonal Pitch Space, Music Perception: An
Relative keys are the next most common errors, on both Interdisciplinary Journal, vol. 5, no. 3, pp. 315349,
audio and symbolic datasets. The scales of two relative Apr. 1988.
keys contain, as is the case with neighbour keys, the same
pitch classes, if one considers the Aeolian mode. Changes [7] M. Leman, Music and Schema Theory, ser. Springer
of tonic center in a tune between its key and the relative Series in Information Sciences, T. S. Huang, T. Ko-
key are also quite common in Irish music. The high fre- honen, M. R. Schroeder, and H. K. V. Lotsch, Eds.
quencies of these two types of errors can be explained by Berlin, Heidelberg: Springer Berlin Heidelberg, 1995,
the specific characteristics of Irish traditional music. vol. 31, dOI: 10.1007/978-3-642-85213-8.
Table 7 gives the percentages of correctly inferred keys [8] B. Duggan, Machine annotation of traditional Irish
per mode. A clear difference in performance appears be- dance music, Ph.D. dissertation, Dublin Institute of
tween the major and minor keys, suggesting that minor Technology, 2009.
keys are harder to detect than major keys. [9] F. Vallely, Ed., The Companion to Irish Traditional
Music, second edition ed. Cork: Cork University
FSaudio FSsymb GLaudio GLsymb Press, 2011.
Major 97.5% 91.1% 80.6% 82.5%
Minor 26.8% 58.5% 39.3% 64.0% [10] F. Korzeniowski and G. Widmer, Feature learning for
chord recognition: the deep chroma extractor, in Pro-
ceedings of the 17th International Society for Music
Table 7: Proportions of correct inference per mode Information Retrieval Conference, New York, USA,
2016.
7. CONCLUSION AND FUTURE WORK [11] J. D. Kelleher, B. Mac Namee, and A. DArcy, Funda-
mentals of machine learning for predictive data ana-
In this paper, a range of existing key detection algorithms lytics: algorithms, worked examples, and case studies.
was tested on datasets of audio and symbolic Irish mu- Cambridge, Massachusetts: The MIT Press, 2015.
sic. Modifications of these models, and a new set of key-
profiles, were introduced, which improved the performance
on both types of representation. Error analysis showed that
the most common confusions were between relative and
neighbour keys, which reflects some specific characteris-
tics of Irish music.
As stated in [9], it is sometimes difficult to pinpoint the
key of an Irish tune. The key annotations on the datasets
were made by the first author, and it is possible that other
annotators could annotate some of the tunes differently. A
way to quantify these ambiguities will be to gather annota-
tions from other experienced musicians in order to obtain
a Cohens kappa coefficient. It is expected that most er-
rors made by the key-matching algorithms will be on tunes
where annotators disagree.
SMC2017-91
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Jeffrey M. Morris
Texas A&M University
morris@tamu.edu
SMC2017-92
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-93
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
normalize to the level of its video noise and become fre- In order to more fully engage with the site, I incorpo-
netic. This functioned nicely to gain the attention of peo- rated another performing algorithm of mine that captured
ple in nearby spaces; they knew something was happening sounds from the environment, mixed them with the piano
over there without having to be there. Then, when they en- sound, and folded them together into rich contrapuntal tex-
tered the cameras view and the camera normalized to their tures. I originally intended to use plant and tree motion to
movement, the energetic burst followed by silence had the influence Ferin, but there was a surprising dearth of wind
effect of a shy performer who didnt realize he had an au- at the time. As the demonstration video shows, this re-
dience. sults in a very different character during the night com-
This remains the most effective presentation scenario for pared to the day. It is busy at night because of increased
the software alone, without user controls. wildlife activity as well as the software normalizing to the
low light levels, in addition to the cameras infrared illu-
minators changing the range of things that can be seen.
5. SIDE LOBBY WITH DIGITAL PLAYER PIANO During the day, it plays mildly as visitors walk into the
area, exposed to the bright sun, but tended to stare back
The Society for Electroacoustic Music in the United States
as visitors stopped to ponder the sculpture. It did tend to
(SEAMUS) 2015 conference was held in the Moss Arts
become more responsive to subtle motions of the visitors,
Center, at Virginia Tech (Virginia Polytechnic Institute and
like shifting their weight to one leg, repositioning arms, or
State University). Ferin was placed in the atrium outside
turning heads.
one of the concert venues. [6] This was a quiet, low-traffic
For video documentation of this exhibition, see [8].
area, so I felt a need to restrict its maximum activity so it
wouldnt be too frenetic for the peaceful space. It proved
to be a favorite lounge for some participants to relax. Since 7. IMPROVISATION WITH SAXOPHONE
it was at a music conference, the addition of the Yamaha
In a Disklavier-themed concert at Texas A&M University
Disklavier digital player piano made engagement very re-
titled Hot-Wired Piano, I performed with Ferin and col-
warding for those that ventured to duet with the software.
league Jayson Beaster Jones playing tenor saxophone. It
Besides the rewarding effect of hearing the software mir-
was amusing to reflect on whether this was a duo or trio.
ror and embellish ones own playing, the struggle over
Since I was there interacting with the software but not mak-
the keyboard is an especially exciting form of engagement.
ing any sound directly, this trio consisted of only two hu-
The visitor, not knowing when or what the software will
mans and two instruments, but one of the instruments wasnt
play, is drawn into a hyper alert improvisation mindset in-
being played by either of the humans. In a sense, the saxo-
volving intense critical listening as well as continually ad-
phones sound was also (at times) influencing the software,
justed plans on when and what to play next. At times, vis-
because I replaced the keyboard input with pitch recogni-
itors have noted that the software played notes just as the
tion.
visitor was reaching for them (simultaneously frustrating
At first, having simply replaced the keyboard input with
and gratifying); other times, the software would operate in
pitch recognition, I quickly discovered that the pitch de-
parallel accompaniment or musical dialogue with the visi-
tection was so accurate that the algorithm would too-often
tor at the opposite end of the keyboards range.
play in unison with the saxophonist. While this was not
Even for viewers who didnt play the piano, the effect of a musical goal, and it wasnt conducive to building coun-
watching a human interface (the keyboard) play automati- terpoint, it also starkly highlighted expressive differences
cally can evoke the image of a ghost playing the keyboard. in intonation between the saxophone and the piano, which
They also got the thrill of watching the piano-playing visi- yielded undesirable sonic effects. In response, I added a
tors essentially become part of the artwork. gradually changing pitch offset between the pitch recogni-
tion and the note-triggering parts of the algorithm, and I
6. OUTDOORS WITH MULTIPLE CAMERAS AND created a key control that would momentarily unmute the
SPEAKERS, WITHOUT DISPLAY audio input, so that I could have the algorithm listen to the
saxophonist only at strategic moments.
I-Park is a sculpture garden in East Haddam, Connecticut, With this unmute control in addition to the camera in-
focusing on ephemeral artworks engaging with nature. [7] put, I functioned more like a conductor, musically flowing
For this exhibition, I cleared a circular path in an undevel- my hands in front of the camera to shape Ferins musical
oped part of the land. Fortuitous developments with other flow. This type and range of motion felt most comfortable
works on the property led me to install a sculpture built to my human physiology. It did loosely allow me to draw
from parts of a grand piano and two upright pianos that on my rudimentary conducting skills, even though the ges-
had been left outdoors for years. I arranged them to re- tures bore no particular meaning to the software. Rather,
semble the carcass of a large mysterious beast and placed my conducting experience informed my movements as a
speakers in footlights surrounding it, to make it look like limited kind of dance: patterns that allowed me to sustain
the overgrown site of a forgotten spectacle. Because of the certain levels of continuous motion when desired and pat-
circular path, I embedded four cameras in the sculpture, terns to organize space to make room for quick, disjunct
in order to capture motion in all directions from a suitable gestures.
distance. I also use opportunity to develop a weather-proof For the audience, this experience charged the confusion
housing for the computer and other electronics. and (hopefully) curiosity about which sounds and motions
SMC2017-94
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
were causing which musical results, as well as tracing the makes such passages sound too harsh, and temporal den-
life of some events as the are fed back in loops, changing sity only allows the attack transients to come through while
between sounds and movements, influencing and some- masking the more attractive and interesting decay of each
times echoing or elaborating on each other. note. In contrast, the electric pianos darker tone prevents
individual notes from being discerned, often resulting in
8. COMPUTER-ASSISTED COMPOSITION muddy sound masses. Organ tones result in similar re-
sults by their mostly flat amplitude envelopes, keeping note
Ferin has also been used successfully in computer-assisted onsets from standing out, masked by the sustain of previ-
composition [9]. In most cases, this workflow has looked ous notes or with attention distracted by abrupt cutoffsof
much like a duo version of the live performance sce- course, organs are capable of quite lyrical legato playing,
nario, with me conducting in front of the camera and but Ferins emergent proclivity toward compound melodies
playing occasional brief motives on the keyboard to influ- (a la Hindemith) is made distractingly disjunct with organ-
ence the software in desired directions. The latter has es- like amplitude envelopes. Experiments with mallet key-
pecially been useful in building large-scale form, since as board instrument sounds were more pleasing in regard to
of yet, the software has no long term memory. Left alone, these factors, but they lose Ferins frequent expressive use
motives only stay active in the music to the extent they hap- of the damper pedal (and vibraphone suffers the same as
pen to be captured in and played from the rolling buffer. the electric piano), besides the fact that it loses touch with
A computer-aided composition work session might typ- human-scale playability on mallet instruments.
ically involve a few takes like this and then turn to the This highlights an important lesson in orchestration: with
editing stage. Given the softwares capricious rubato play- a long-established discrete pitch system, scales, a canon
ing, automated transcription from MIDI to music notation comprising music that heavily stresses melodies, chords,
is not straight-forward. It is helpful to create a MIDI refer- and human scale rhythms, and an abundance of theoreti-
ence track in a MIDI sequencer to accompany the recorded cal analysis focusing on these parameters and not others, it
performance, by tapping out the basic metronome pulse is easy to imagine most instruments, especially the piano,
where it should occur in the music, following the tempo as neutral, general purpose music generators, when really,
as it changes. However, when a sequencer reinterprets the each instruments idiosyncrasies shape the musical deci-
performance by mapping it to the MIDI reference track, sions of the composer, even if unbeknownst to the com-
all tempo information is lost from the resulting music no- poser. Since Ferin Martino has no capacity for evaluating
tation, and needs to be manually interpreted into standard its own output, especially with regard to the spectral and
terms such as allegro and ritardando. Similar work us nec- temporal features described above, so this is a crucial stage
essary for its rapid expressive dynamics as well. in the development of a performer algorithm. This relates
This juncture, buried in tedium and practicality, is where in spirit to William Setharess work in tuning systems, in
Ferin Martino the composer is separated from Ferin Mar- which he demonstrates that consonance and dissonance are
tino the performer. Musical nuances from Ferins perfor- closely tied to timbre and tuning system more than Fortean
mance are stripped away, leaving room for interpretation interval class vectors. [11]
by the next (human) performer. As some of the aura (after
Walter Benjamin [10]) of Ferins performance is striped To finish the description of the transcription process: Ferins
away, the composition can gain new aura through its ca- dense playing usually ends up yielding sheet music re-
pacity for many variations in interpretations by other per- quiring advanced piano skill to play. This makes it very
formers (cf. Eco [2]). amenable to treating Ferins print out as a piano score to
This algorithms output is surprisingly pianistic to the ear, be orchestrated for small chamber ensembles, such as pi-
especially for its relative simplicity (one page of code) and ano duo, piano trio (piano, violin, and cello), and string
emergent nature, and curiously, it does not sound charac- quartet. The manual orchestration process allows me (as
teristic, or even very good, using any other instruments the supervising composer, if not the one executing each
sound, even other keyboard instruments. One aspect con- note choice) to apply the considerations in the preceding
tributing to its pianistic character must be its anthropomor- paragraphs in order to map the musical content to fit each
phic design: with trichords played by a virtual left hand instrument as idiomatically effective as possible. The fact
with five fingers, melodies played by a virtual right hand, that Ferins playing sounds more naturally pianistic than
pitch intervals and inter-onset (time) intervals kept within it looks in sheet music is also telling; it suggests that the
human scale, and drunk walk pseudorandom number gen- character of the piano-human combination has its own id-
erators to keep passages from being too disjunct. However, iosyncrasies, a subset of the full characteristic possibilities
these design factors would make it just as good a fit for any of the piano itself (including other playing scenarios).
keyboard instrument, including a harpsichord, organ, and One such piece was premiered by the Apollo Chamber
electric piano, so there must be more to its good fit for the Players, a string quartet based in Houston, Texas. The
pianos sound. main material was created in a single take with Ferin. I was
It seems that the algorithms output has been tuned to sculpting Ferins behavior by moving my hands in front of
the modern pianofortes timbre (the spectrum and ampli- the camera and occasionally introducing motives of a sin-
tude envelope) inadvertently to the exclusion of success gle repeated pitch via the piano keyboard. This created a
with other timbres. Ferin Martinos playing is often dense dynamic in the piece of striving for stasis and falling away
in pitch and in time. The harpsichords bright spectrum from it, as I would play repeated notes at a few certain
SMC2017-95
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
junctures, and Ferin would copy parts of that motive and music and sound as a stimulus to influence her mark mak-
carry it off in sequences. After the initial performance, I ing vocabulary in her drawing and painting practice. She
distributed the voices for a string quartet. To build a more also sometimes thinks kinetically about the act of painting:
coherent form (without trying to cover up Ferins capri- making painting gestures in the air before and after touch-
cious nature), I added a brief introduction comprising three ing the brush to the canvas in a process of contemplation,
slow cascading iterations of the repeated-pitch motive, and winding up, and following through. Both of these facts
I inserted a partial recapitulation before the final sequence: have made collaborations with Ferin Martino very fruitful.
at a point where the music seems stuck in a loop, I inserted This creates a fascinating multi-modal creative feedback
a break, then a sequence of truncated clips from previous loop as she paints in response to the music and her motion
moments, in the order they happened, but always inter- while painting in turn shapes the music. Ferins capacity
rupted. This section is played sul ponticello to set it off for endless through-composed playing also suits this sce-
as a copy of previous events, as if retracing ones steps be- nario well. Together, they produce an experience of paint-
fore continuing. This recapitulation leads to the repeated ing and composition as a realtime process laid out across
motive where the original music was stopped, and it con- time, besides the final artifacts that result from the process.
tinues from that point in full voice, going on to the end. I
gave it the title, The Garden of Forking Paths to reflect on 11. OTHER FUTURE WORK
the many musical sequences Ferin tests before moving on,
as well as its ultimately through-composed winding path. As highlighted throughout the discussion above, there are
many areas where future developments can be done. Fur-
ther work will investigate possibilities for building larger
9. EXPERIMENTS WITH DANCE formal structures over time, including buffers on medium
and long time scale, as well as memory cues as in Butch
Given these experiences with tuning the software to re-
Morriss Conduction technique for conducting ensembles
spond to human movement, it was natural to consider adapt-
of improvisers. The software could also generate its own
ing this work for dance. However, these experiments quickly
MIDI reference track for beat mapping during transcrip-
highlighted differences between a camera with video anal-
tion. Further software routines could be created to auto-
ysis and the human eye with human cognition. As dancers
matically interpret tempo and dynamics changes into stan-
explored the full stage space, they would range from being
dard words and symbols. Synthetic timbres can be ex-
so close to the camera that most of their bodies could not
plored, sharing the key features of the piano and avoiding
be seen to being so far at end of the stage that their whole
problematic features described above. Also in this vein, the
bodies only occupied a small portion of the cameras view.
software can be tuned to suit other instruments. This would
A camera operator would naturally want to zoom in and
be an especially interesting inquiry for thinking about com-
out and pan around to keep the dancers full body in the
position, since it turns out that many of its most interesting
video frame. However, this revealed that moving the cam-
melodic lines incorporate both virtual hands to build com-
era even slightly can yield a maximal amount of change in
pound melodies or sequences. Finally, the software could
the video signal even when there is no movement on stage.
be taken to another level of sophistication and self gov-
Seeing the need for more development for this kind of per-
ernance if it could monitor the sonic features of its own
formance, this exploratory performance was put on hold.
output and adjust its note decisions accordingly.
The problem of dancers distance might be addressed by
However, as these opportunities are opened up above, the
an automatic spatial normalizing process that crops out any
discussion also highlights distinctive features of the char-
space that contains no movement, leaving a tight-framed
acter oft his algorithm that would be significantly changed,
rectangle inside to evaluate. However, this will require
perhaps to become less intriguing or idiosyncratic, or at
many more decisions, thereby shaping the character of the
least becoming something that should be considered a dif-
work. Additionally, the quick pace of dancers movement
ferent process altogether. These advances will be consid-
might be outside the softwares scope of responsiveness
ered carefully with this potential sacrifice in mind. It will
that was effective for causal visitors. Making it more re-
more likely spawn a number different performer agents
sponsive may result in a less interesting one-to-one kind of
rather than numerous upgrades to this one.
interaction. The notion of camera operator as performer is
an intriguing one worthy of future investigation, but it too
will require many character-defining decisions. This line 12. CONCLUSIONS
of inquiry may need to use a completely different perform- It is delightfully gratifying that something originally meant
ing algorithm from Ferin, but informed by these experi- to be an afternoon project to demonstrate a Disklavier digi-
ences. tal player piano, with a modicum of interactivity, for a tour
of university dignitaries would provide so rich an output,
10. EXPERIMENTS WITH PAINTING become so fascinating a collaborator, open up deep ques-
tions of the nature of composition and performance, and
Working with a painter for visual input has avoided the reach audiences in Milan, Athens, and in the United States,
challenges described with dancers, since a painter is largely Texas, the Southeast, the Mid-Atlantic, and New England.
engaged with a two-dimensional vertical plane, like the These experiences show that simple, perhaps even sim-
camera and screen. April Zanne Johnson [12] already uses plistic solutions can reach new levels of elegance and open
SMC2017-96
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
up new dimensions to contemplate, as with Alexanders so- [11] W. A. Sethares, Tuning, Timbre, Spectrum, Scale,
lution to the Gordian knot and with Cantors diagonal. In 2nd ed. New York, NY, USA: Springer, 2004.
the end, an impatient attempt to simply make music has
revealed myriad questions that are answered in that pro- [12] Neurologically produced synesthetic forms and color:
cess, whether the composer takes conscious responsibility Studio visit with April Zanne Johnson, http://
for them, leaves them to tacit intuition, or lets the gov- artmazemag.com/april-zanne-johnson/.
erning technology fill in those answers according to what [13] J. Morris, Ferin Martino, July 2014, http://www.
comes most naturally to itfor better or worse. morrismusic.org/ferinmartino.
13. EXAMPLES
Video, audio, and musical scores resulting from this work
can be found at [13].
Acknowledgments
Thanks to saxophonist Jayson Beaster Jones and painter
April Zanne Jonhson for joining me in developing these
productions, and to the I-Park Foundation and the Atlantic
Center for the Arts for sponsoring this work.
14. REFERENCES
[1] D. Cope, Computer Models of Musical Creativity.
Cambridge, MA, USA: MIT Press, 2005.
[2] U. Eco, The Open Work. Cambridge, MA, USA: Har-
vard University Press, 1962/1989.
[3] J. M. Morris, Ferin Martino: A small piano algo-
rithm and its lessons on creativity, interaction, and ex-
pression, in 3rd International Workshop on Musical
Metacreation (MUME 2014). Artificial Intelligence
and Interactive Digital Entertainment, 2014, pp. 4044.
[4] T. di Milano, History and mission, Septem-
ber 2016, http://www.triennale.org/en/chi-siamo/
storia-e-mission/.
[5] A. S. Onassis Public Benefit Foundation, Onassis
cultural center athens, Decemeber 2015, http://www.
onassis.org/en/cultural-center.php.
[6] J. Morris, Ferin Martino demo, Video, Septem-
ber 2015, https://www.youtube.com/watch?v=
iaDjezcrzHA.
[7] I. Park Foundation, I-Park international artist-in-
residence program, March 2003, http://www.i-park.
org/.
[8] J. Morris, Where Ferin was, November 2015, http:
//www.morrismusic.org/2015/where-ferin-was.
[9] J. M. Morris, Collaborating with machines: Hybrid
performances allow a different perspective on genera-
tive art, in Proceedings of the Generative Art Interna-
tional Conference. Polytechnic University of Milan,
Italy, 2013, pp. 114125.
[10] W. Benjamin, Illuminations: Essays and Reflections.
Berlin, Germany: Schocken Books, 1936/1969, ch.
The Work of Art in the Age of Mechanical Reproduc-
tion, pp. 217252.
SMC2017-97
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-98
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
400
300
200
100
0
0 100 200 300 400 500 600 700
2. BACKGROUND
Time (centiseconds)
Melody3
500 2.1 Pitch Height and Melodic Contour
Pitch (Hz)
400
300
200
100 This paper is concerned with melody, that is, sequences of
0
0 100 200 300 400 500 600 700 pitches, and how people trace melodies with their hands.
Time (centiseconds)
Melody4 Pitch appears to be a musical feature that people easily re-
500
late to when tracing sounds, even when the timbre of the
Pitch (Hz)
400
300
200
sound changes independently of the pitch [1719]. Melodic
100
0
contour has been studied in terms of symbolic pitch [20,
0 100 200 300 400
Time (centiseconds)
500 600 700
21]. Eitan explores the multimodal associations with pitch
height and verticality in his papers [22, 23]. Our subjective
experience of melodic contours in cross cultural contexts
Figure 1. Examples of Pitch features of selected melodies, is also investigated in Eerolas paper [24].
extracted through autocorrelation. The ups and downs in melody have often been compared
to other multimodal features that also seem to have up-
down contours, such as words that signify verticality. This
the generalisability of such a method is harder to evaluate
attribute of pitch to verticality has also been used as a fea-
than those of the symbolic methods.
ture in many visualization algorithms. In this paper, we fo-
cus particularly on the vertical movement in the tracings of
1.4 Method for similarity finding participants, to investigate if there is, indeed, a relationship
with the vertical contours of the melodies. We also want
While some of these methods use matrix similarity com-
to see if this relationship can be extracted through features
putation [14], others use edit distance-based metrics [15],
that have been explored to represent melodic contour. If
and string matching methods [16]. Extraction of sound sig-
the features proposed for melodic contours are not enough,
nals to symbolic data that can then be processed in any of
we wish to investigate other methods that can be used to
these ways is yet another method to analyse melodic con-
represent a common feature vector between melody and
tour. This paper focuses on evaluating melodic contour
motion in the vertical axis. All 4 melodies in the small
features through comparison with motion contours, as op-
dataset that we create for the purposes of this experiment
posed to being compared to other melodic phrases. This
are represented as pitch in Figure 1.
would shed light on whether the perception of contour as a
feature is even consistent, measurable, or whether we need
other types of features to capture contour perception.
Yet another question is how to evaluate contours and their 2000
LH plot for Participant 1
2000
RH plot for Participant 1
behaviours when dealing with data such as motion responses 1000 1000
Displacement(mm)
SMC2017-99
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3. EXPERIMENT DESCRIPTION
2.2 Categories of contour feature descriptors
The experiment was designed so that subjects were instructed
In the following paragraphs, we will describe how the fea- to perform hand movements as if they were creating the
ture sets selected for comparison in this study are com- melodic fragments that they heard. The idea was that they
puter. The feature sets that come from symbolic notation would shape the sound with their hands in physical space.
analysis are revised to compute the same features from the As such, this type of free-hand sound-tracing task is quite
pitch extracted profiles of the melodic contours. different from some sound-tracing experiments using pen
on paper or on a digital tablet. Participants in a free-hand
2.2.1 Feature 1: Sets of signed pitch movement direction
tracing situation would be less fixated upon the precise
These features are described in [6], and involve a descrip- locations of all of their previous movements, thus giving
tion of the points in the melody where the pitch ascends or us an insight of the perceptually salient properties of the
descends. This method is applied by calculating the first melodies that they choose to represent.
derivatives of the pitch contours, and assigning a change
of sign whenever the spike in the velocity is greater than or 3.1 Stimuli
less than the standard deviation of the velocity. This helps We selected 16 melodic fragments from four genres of mu-
us come up with the transitions that are more important to sic that use vocalisations without words:
the melody, as opposed to movement that stems from vi-
bratos, for example. 1. Scat singing
2. Western classical vocalise
2.2.2 Feature 2: Initial, Final, High, Low features 3. Sami joik
Adams, and Morris [7, 8] propose models of melodic con- 4. North Indian music
tour typologies and melodic contour description models The melodic fragments were taken from real recordings,
that rely on encoding melodic features using these descrip- containing complete phrases. This retained the melodies in
tors, creating a feature vector of those descriptors. For this the form that they were sung and heard in, thus preserving
study, we use the feature set containing initial, final, high their ecological quality. The choice of vocal melodies was
and low points of the melodic and motion contours com- both to eliminate the effect of words on the perception of
puted directly from normalized contours. music, but also to eliminate the possibility of imitating the
sound-producing actions on instruments (air-instrument
2.2.3 Feature 3: Relative interval encoding
performance) [25].
In these sets of features, for example as proposed in Fried- There was a pause before and after each phrase. The
man, Quinn, Parsons, [6, 9, 14], the relative pitch distances phrases were an average of 4.5 seconds in duration (s.d.
are encoded either as a series of ups and downs, combined 1.5s). These samples were presented in two conditions: (1)
with features such as operators (,=,) or distances of rel- the real recording, and (2) a re-synthesis through a saw-
ative pitches in terms of numbers. Each of these meth- tooth wave from an autocorrelation analysis of the pitch
ods employs a different strategy to label the high and low profile. There was thus a total of 32 stimuli per participant.
SMC2017-100
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3.2 Participants
A total of 32 participants (17 female, 15 male) were re-
cruited to move to the melodic stimuli in our motion cap-
ture lab. The mean age of the participants was 31 years
(SD=9). The participants were recruited from the Univer-
sity of Oslo, and included students, and employees, who
were not necessarily from a musical background.
The study was reported to and obtained ethical approval
from the Norwegian Centre for Research Data. The par-
ticipants signed consent forms and were free to withdraw
during the experiment, if they wished.
SMC2017-101
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Mean Motion of RHZ for Melody 1 Mean segmentation bins of pitches for Melody 1
300 300
200 200
100 100
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
200 200
100 100
0 0
1 2 3 4 5 1 2 3 4 5
200 200
100 100
0 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Pitch movement in Hz
Displacement in mm
Mean Motion of RHZ for Melody 4 Mean segmentation bins of pitches for Melody 4
300 300
200 200
100 100
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Figure 6. Plots of the representation of features 1 and 3. These features are compared to analyse similarity of the contours.
5.2 Confusion between tracing and target melody less than the current ambit of melody that is being traced.
Other contours apart from pitch and melody are also sig-
As seen in the confusion matrix in Figure 7, the tracings are
nificant for this discussion, especially timbral and dynamic
not clearly classified as target melodies by direct compar-
changes. However, the relationships between those and
ison of contour values itself. This indicates that although
motion were beyond the scope of this paper. The inter-
the feature vectors might show a strong trend in vertical
pretation of motion other than just vertical motion is also
motion mapping to pitch contours, this is not enough for
not handled within this paper.
significant classification. This demonstrates the need for
having feature vectors that adequately describe what is go- The features that were shown to be significant can be ap-
ing on in music and motion. plied for the whole dataset to see relationships between
vertical motion and melody. Contours of dynamic and tim-
bral change can also be interesting to compare with the
6. DISCUSSION same methods against melodic tracings.
A significant problem when analysing melodies through
symbolic data is that a lot of the representation of texture,
7. REFERENCES
as explained regarding Melody 2, gets lost. Vibratos, or-
naments, and other elements that might be significant for [1] A. Marsden, Interrogating melodic similarity: a
the perception of musical motion can not be captured effi- definitive phenomenon or the product of interpreta-
ciently through these methods. However, these ornaments tion? Journal of New Music Research, vol. 41, no. 4,
certainly seem salient for peoples bodily responses. Fur- pp. 323335, 2012.
ther work needs to be carried out to explain the relationship
of ornaments and motion, and this relationship might have [2] C. Cronin, Concepts of melodic similarity in music
little or nothing to do with vertical motion. copyright infringement suits, Computing in musicol-
We also found that the performance of a tracing is fairly ogy: a directory of research, no. 11, pp. 187209,
intuitive to the eye. The decisions for choosing particular 1998.
methods of expressing the music through motion do not
appear odd when seen from a human perspective, and yet [3] A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith,
characterizing what are significant features for this cross- Query by humming: musical information retrieval in
modal comparison is a much harder question. an audio database, in Proceedings of the third ACM
Our results show that vertical motion seems to correlate international conference on Multimedia. ACM, 1995,
with pitch contours in a variety of ways, but most signifi- pp. 231236.
cantly when calculated in terms of signed relative values.
Signed relative values, as in Feature 3, also maintain the [4] L. Lu, H. You, H. Zhang et al., A newapproach to
context of the melodic phrase itself, and this is seen to query by humming in music retrieval. in ICME, 2001,
be significant for sound-tracings. Interval distances matter pp. 2225.
SMC2017-102
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-103
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Ryan Kirkbride
University of Leeds
sc10rpk@leeds.ac.uk
SMC2017-104
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
This style of synchronisation is used by Ogborn [15] and 2.1 Development Language
his network music duo, Very Long Cat, that synchronises
Live Coding and tabla drumming over the internet. Troop is designed to work with the Live Coding language,
FoxDot [6] but, like Extramuros, it can be language neutral
Alternatively, two or more Live Coders might perform with a small amount of configuration. The programming
together at the same location in front of an audience with language Python 1 is useful for fast software development
each performers laptop creating sound separately. De- and comes with a built-in package for designing graphi-
pending on the style of music this requires each machine cal user interfaces (GUIs) called Tkinter. As FoxDot is
to be temporally synchronised in real time, which is not written in Python it makes it easy to pipe commands to its
always easy to accomplish. A common way to do this is interpreter from Troop.
through a clock-sharing protocol where each machine is
constantly listening to a designated time-keeper, as imple-
mented by the Birmingham Ensemble for Electroacoustic 2.2 Interface
Research [4] and Algorave pioneers, Slub [5] for exam- The philosophy behind Troop is that all performers seem-
ple. This requires an extra layer of configuration and is still ingly share the same text buffer and contribute to its con-
liable to lag between performers downbeat depending on tent at the same time. To allow each performer to differ-
the network being used. entiate their own contributions from others each connected
Another resulting problem of this style of performance performer is given a different coloured label which con-
is the effect it has on the audience. Multiple Live Coders tains their name. This labels position within the input text
working on individual portions of code will usually have box is mapped to the location of the respective performers
their work on separate screens, some, or all, of which will text cursor, as shown in Figure 1.
be projected for the audience. This can result in a non-
optimal audience experience as there may either be too
much going on or, if there arent enough projectors avail-
able, not enough. One possible solution to this problem
is the shared-buffer networked Live Coding system, Extra-
muros [8]. As mentioned previously this software uses a
client-server model to connect multiple performers within
a single web page. This allows them to create their own
code and request and modify other performers code in the
same window, reducing the number of screens necessary to
project during the performance. However, as the number
of connected users, and consequently text boxes, increases
the font size must be reduced and the code becomes less
legible. Figure 1. A screen shot of the first iteration of the Troop
Gabber [9], the network performance extension to the interface with three connected users.
JavaScript based language, Gibber, works in a similar way
but also allows users to interact with each others code di- As time elapses during a performance and the text buffer
rectly within the same text buffer. This allows for only one begins to fill up it is not always possible to separate the in-
screen to be projected but all performers work to be dis- dividual contributions and it becomes unclear as to whom
played. Originally Gabber used a single, shared code ed- has written what; even to the performers. This problem
itor but it was found to be problematic as program frag- has been identified by Xambo et al. [17], stating there is a
ments frequently shifted in screen location during perfor- challenge in identifying how to know what is each others
mances and disturbed the coding of performers. Gabber code, as well as how to know who has modified it.
has since moved to a more distributed model. The single, To help combat this problem and differentiate each per-
shared text buffer model may have proved problematic in formers contributions each connected performer is allo-
this instance but has seen much mainstream success, most cated a different colour for text (see Figure 2) and high-
notably in Google Docs [16], and has prompted me to cre- lighting. By doing this performers can leave traces of their
ate a similar tool for concurrent Live Coding collaboration own coloured code throughout the communal text buffer.
entitled Troop. Editing a block of another performers code interweaves
their colours and thought processes, creating a lasting vi-
sual testament to a collaborative process; or at least un-
til someone else makes their own edit. The colour of the
2. DESIGN text entered by each performer matches the colour of their
marker to retain a consistency in each performers iden-
The purpose of the Troop software is to allow multiple Live
tity. It is customary in Live Coding for evaluated code to
Coders to collaborate within the same document in such
be temporarily highlighted and by doing this in separate
a way that audience members will also be able to identify
colours it allows both audience and performers to identify
the changes made by the different performers. This section
the source of that action.
outlines the steps taken to achieve this and briefly describes
the two types of network architecture used in this project. 1 http://python.org/
SMC2017-105
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Sound Output
FoxDot
Server
Performers contributions are also measured in terms of server. In this case, the Troop server is run with an extra
quantity of characters present. In the bottom right corner command-line argument:
of 2 there is a bar-chart that displays the proportion of code
contributed by each performer. As a collaborative perfor- ./run-server.py --client
mance tool it is particularly interesting to keep track of this
This signals to the server to send information about what
information and use it as a creative constraint for a perfor-
code to evaluate back to the client. Figure 4 shows how
mance (see section 4.2.1 for a discussion on this).
the code execution and sound creation are shifted to the
2.3 Network Architecture clients responsibility in this set up. A client-server model
was adopted over a peer-to-peer topology as it meant that
The Troop software uses a client-server model but can be the client application did not have to be reconfigured based
set up in two ways based on which machine will be han- on where the code was being executed and the centralised
dling code execution. server queue structure for storing keystrokes could still be
used.
2.3.1 Server-Side Execution Model
The nature of Troops concurrent text editing means that
This first design uses a centralised synchronous networked code in each clients text buffer is identical. This means
performance model similar to that of LOLC [13] and trans- that there is no need to use a shared name-space, such as
fers messages using TCP/IP to ensure all messages are sent in [14,18], as the data is reproduced on each connected ma-
over the network. The server runs on a machine dedicated chine. An advantageous consequence of this is that Troop
to listening to client applications (see figure 3), which are need only send raw text across the network and no serialis-
sending keystroke information, and handles the code ex- ing of other data types is required in order for the program
ecution and sound creation. Incoming keystroke data are to be reproduced on multiple machines.
stored in a queue structure to avoid sending data sent from
two clients simultaneously in differing orders to other clients.
As the number of connected clients increases so too would Sound Output
the chance of this error occurring. A disadvantage of stor-
FoxDot
ing all keystroke data in a queue on the server is that there
is sometimes a noticeable latency between pressing a key Client
and the corresponding letter appearing the text editor de-
pending on the network.
In this topology only the server machine executes code
and schedules musical events, which eliminates any need
for clock synchronisation. This method also utilises a cen- Server
tralised shared name-space, which doesnt require infor-
mation about the code to be sent to the client applications.
The disadvantage to this approach is that all performers Client Client
must be local to the server (a client and server can be
run on the same machine together) if they want to hear the FoxDot FoxDot
performance, unless they stream the audio through another Sound Output Sound Output
program from the server machine.
2.3.2 Client-Side Execution Model
Figure 4. Network graph of Troops client-side execution
When performers are not co-located it might be required model.
that code is run on the client machines as opposed to the
SMC2017-106
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2.4 Extending the software for other languages 3.3 Clock Synchronisation
Troop contains a module called interpreter.py that Along with information about each clients keystrokes, the
defines the classes used for communicating with host Live Troop server sends a ping message to each connected
Coding languages. Currently there are interpreter classes client once every second. This serves two purposes; the
for the FoxDot and SuperCollider languages. If a language first is to periodically check that each client can still be
can take a raw string input and convert that to usable code contacted and remove those that cannot from its address
then Troop will be able to communicate with it. Specifying book, and the second is to aid clock synchronisation. A
the lang flag at the command line, followed by the name consequence of this ping message is that the settime
of the desired language, will start the client with a different method of Troops interpreter class (Section 2.4) is called
interpreter to FoxDot like so: with a value, which can then communicate with whatever
time-keeping data structure is used by the host language if
./run-client.py --lang SuperCollider need be.
By giving performers flexible access to multiple Live Cod-
ing languages Troop can enhance musical expressivity and 4. CONCLUSIONS
broaden the channels for collaboration. This paper has presented the Live Coding environment,
Troop, and discussed its potential as a tool for enhanc-
3. SYNCHRONISATION ing channels for collaboration and communication in Live
Coding ensembles.
3.1 Maintaining Text Consistency
One of the most difficult aspects to this project was making 4.1 Applications in Teaching
sure that the contents of the text buffer for each connected Music has been used as a useful teaching analogy for com-
client remained identical. In Troops first iteration, text puter science in younger age groups, as demonstrated by
would be added immediately on the local client then sent software such as Scratch [19] and the Live Coding lan-
to the server to be forwarded to the other connected clients. guage Sonic-Pi [20], and collaborative Live Coding can
Problems arose mainly when two clients were editing the help move student thinking from an individual to a social
same line, which caused the characters to be in different or- plane [17].
ders for each client. To combat this the clients in the next By running several Troop servers in a classroom environ-
version of Troop would send all keystroke information to ment Troop would encourage collaborative work between
the server, which dispersed them to each client in the same students and allow a single teacher to monitor the work of
order. This worked very well if the server application was multiple groups of students from a single machine.
running on the local machine but there were noticeable de-
lays between key-press and text appearing on screen when 4.2 Future Extensions
connecting to a public server running Troop in a London
data centre. 4.2.1 Adding Creative Constraints
A solution was to combine these two input systems to
Troop is an environment that facilitates group work over a
maximise efficiency and still maintain consistency between
network but each contributor in a document still maintains
clients text. When a key is pressed, Troop checks if there
their own identity within the process through the use of
are any other clients currently editing the same line as the
coloured code. Maintaining this identity requires the num-
local client and if there are, it sends all the information to
ber of characters entered by each user to be stored. This
server to make sure the characters are entered in the same
information could be used to add a constraint to a particu-
order for every client. If the local client is the only one
lar session to explore different collaborative approaches to
editing a line the character is added immediately on the lo-
Live Coding.
cal machine and then sent to the server to tell other clients
On each keystroke it would be possible to test a constraint
to add the character. This reduces latency for character in-
and then only add the character to the text buffer if that test
puts when clients are not interfering with one another but
returns true. An example of this might be to disallow a sin-
also maintains consistency at times when they are.
gle user to take up more than 50% of the text buffer, which
could easily be achieved using the short function outlined
3.2 Limiting Random Values
in pseudo-code in Figure 5. The start of the performance
When random number generators are utilised to define pitches would be difficult as performers would have to work to-
and rhythms their values may become inconsistent between gether slowly to make sure they dont spend long periods
each machine and each performer will consequently hear of time waiting for other users to catch up, but as the to-
a different piece of music. This can be combated by se- tal number of characters in the text buffer increased so too
lecting the random number generators seed at the start would the flexibility for writing larger portions of code in
of the performance. In FoxDot for example, evaluating one go.
the command Random.seed(1) will do so for all con- Another example might be to ban a random user from
nected clients and will consequently mean that any random typing for a number of seconds to try and force them to
elements will, although appear random, be completely de- look at what the rest of their ensemble are doing, or to
terministic for that session. only allow them to edit a single line of code written by one
SMC2017-107
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
def constraint(text, client): [6] R. Kirkbride, FoxDot: Live coding with python
if client.chars > text.chars / 2: and supercollider, in Proceedings of the International
return False Conference of Live Interfaces, 2016, pp. 194198.
else:
return True [7] J. McCartney, Rethinking the computer music lan-
guage: Supercollider, Computer Music Journal,
vol. 26, no. 4, pp. 6168, 2002.
Figure 5. Example pseudo-code for a creative constraint
for Troop. [8] D. Ogborn, d0kt0r0/extramuros: language-neutral
shared-buffer networked live coding system,
https://github.com/d0kt0r0/extramuros, 2016, ac-
of their co-performers. The next iteration of Troop will cessed 13/12/16.
implement a constraints.py module that will contain
several creative constraints and allow users to define their [9] C. Roberts, K. Yerkes, D. Bazo, M. Wright, and
own functions. J. Kuchera-Morin, Sharing time and code in a
browser-based live coding environment, in Proceed-
4.2.2 Web Browser Version ings of the First International Conference on Live Cod-
Live Coding environments such as Gibber [10] and Live- ing. Leeds, UK: ICSRiM, University of Leeds, 2015,
CodeLab [21] run in most standard web browsers and re- pp. 179185.
quire no installation on the client machine. This lowers the
barriers to entry and also makes the application platform- [10] C. Roberts and J. Kuchera-Morin, Gibber: Live cod-
independent (provided the web browser itself runs correctly). ing audio in the browser. Ann Arbor, MI: Michigan
Moving Troop to a Javascript based client could help pro- Publishing, University of Michigan Library, 2012.
vide a simpler set up procedure for those newer to com- [11] J. Rohrhuber, A. de Campo, R. Wieser, J.-K. van Kam-
puter programming. pen, E. Ho, and H. Holzl, Purloined letters and dis-
tributed persons, in Music in the Global Village Con-
4.3 Evaluation ference (Budapest), 2007.
Currently a small group of performers are learning to use
[12] A. de Campo and J. Rohrhuber, Republic, https:
the FoxDot and Troop software in order to perform at an
//github.com/supercollider-quarks/Republic, accessed:
event at the end of April 2017. Feedback from both the
05/02/2017.
users and audience members will be collected, assessed,
and contribute to the further development of Troop. [13] J. Freeman and A. Van Troyer, Collaborative textual
improvisation in a laptop ensemble, Computer Music
Acknowledgments Journal, vol. 35, no. 2, pp. 821, 2011.
This research is funded by the White Rose College of Arts
[14] A. C. Sorensen, A distributed memory for net-
and Humanities (http://www.wrocah.ac.uk).
worked livecoding performance, in Proceedings of the
ICMC2010 International Computer Music Conference,
5. REFERENCES 2010, pp. 530533.
[1] N. Collins, A. McLean, J. Rohrhuber, and A. Ward, [15] D. Ogborn, Live coding together: Three potentials of
Live coding in laptop performance, Organised collective live coding, Journal of Music, Technology
sound, vol. 8, no. 03, pp. 321330, 2003. & Education, vol. 9, no. 1, pp. 1731, 2016.
[2] TOPLAP, TOPLAP the home of live coding, http: [16] Google, Google docs - create and edit documents on-
//toplap.org/, 2004, accessed: 08/12/16. line, for free, https://www.google.co.uk/docs/about/,
2017, accessed: 02/02/17.
[3] T. Magnusson, Herding cats: Observing live coding
in the wild, Computer Music Journal, vol. 38, no. 1, [17] A. Xambo, J. Freeman, B. Magerko, and P. Shah,
pp. 816, 2014. Challenges and new directions for collaborative live
coding in the classroom, 2016.
[4] S. Wilson, N. Lorway, R. Coull, K. Vasilakos, and
T. Moyers, Free as in beer: Some explorations into [18] S. W. Lee and G. Essl, Models and opportunities for
structured improvisation using networked live-coding networked live coding, in Live Coding and Collabo-
systems, Computer Music Journal, vol. 38, no. 1, pp. ration Symposium, vol. 1001, 2014, pp. 48 1092121.
5464, 2014.
[19] A. Ruthmann, J. M. Heines, G. R. Greher, P. Laidler,
[5] A. McLean, Reflections on live coding collaboration, and C. Saulters II, Teaching computational thinking
Proceedings of the Third Conference on Computation, through musical live coding in scratch, in Proceedings
Communication, Aesthetics and X. Universidade do of the 41st ACM technical symposium on Computer sci-
Porto, Porto, p. 213, 2015. ence education. ACM, 2010, pp. 351355.
SMC2017-108
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-109
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Yusuke Wada Yoshiaki Bando Eita Nakamura Katsutoshi Itoyama Kazuyoshi Yoshii
Department of Intelligence Science and Technology
Graduate School of Informatics, Kyoto University, Japan
{wada,yoshiaki,enakamura,itoyama,yoshii}@ sap.ist.i.kyoto-u.ac.jp
SMC2017-110
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-111
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
4QBSTFNBUSJY
1 2 TJOHJOHWPJDF
3 4QFDUSPHSBNPGUIF
PSJHJOBMBVEJPTJHOBM
4 3PCVTU
/.' -PXSBOLNBUSJY
BDDPQBOJNFOU
SMC2017-112
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
DFJ
WP
H
FJD JO
H
c
PW JO
T
HO E
FU
J
GP HOJT
Warping path
BS
B calculated by W
Q
E TFG
online DTW
' FU P
F
$ BS MB
$' BQ DT
MaxRunCount
F
. FT JN
5
5JNFTDBMFPGVTFSTTJOHJOHWPJDF
W
EJNFOTJPOT
.'$$'PGVTFSTTJOHJOHWPJDF
Figure 6. Online DTW with input length W = 8, search
Figure 5. A warping path obtained by online DTW. width c = 4, and path constraint M axRunCount = 4.
All calculated cells are framed in bold and colored sky
blue, and the optimal path is colored orange.
where wh and wh are represent the shape and rate pa-
rameters of the gamma distribution.
is calculated as follows:
To force the sparse components to take nonnegative val-
ues, gamma priors with rate parameters given the Jeffreys
di,j1
hyperpriors are put on those components as the following: di,j = ||xi yj || + min di1,j (8)
di1,j1 .
p(S|s , s ) = G(sf t |s , fs t ), (6)
f,t The variable s and t in Algorithms 1 and 2 represent the
p(fs t ) (fs t )1 . (7) current position in the feature sequences X and Y re-
spectively. The online DTW calculates the optimal warp-
where s represents the hyperparameter of the gamma dis- ing path L = {o1 , , ol }, oi = (ik , jk ) (0 ik
tribution that controls the sparseness. Using Eqs. (3)(7), ik+1 n; 0 jk jk+1 n), using the root mean
the expected values of W, H, and S are estimated in a square for ||xi yj || incrementally, without backtraking.
mini-batch style using a VB technique. (ik , jk ) means that the frame xik corresponds to the frame
yjk . Figure 6 shows an example how the cost matrix and
the warping path is calculated. Each number in Figure 6
3.4 Audio-to-Audio Alignment between Singing Voices represents the order to calculate the cost matrix. The pa-
rameter W is the length of the input mini-batch. If the
We use an online version of dynamic time warping (DTW) warping path reaches the W -th row or column, then the
[22] that estimates an optimal warping path between a users calculation stops. If the warping path ends at (W, k)(k <
singing voices and original singing voices separated from W ), the next warping path starts from that point. c restricts
music audio signals (Figure 5). Since the timbres and F0s the calculation of the cost matrix. At most c successive ele-
of the users singing voices can significantly differ from ments are calculated for each calculation of the cost matrix.
those of the original singing voices according to the singing M axRunCount restricts the shape of the warping path.
skills of the user, we focus on both MFCCs and F0s of The warping path is incremented at most M axRunCount.
singing voices for calculating a cost matrix in DTW. To The function GetInc decides which to increment a row,
estimate the F0s, saliency-based F0 estimation called sub- column, or both of the warping path. If the row is incre-
harmonic summation [23] is performed. mented from the position (s, t) in the cost matrix, then at
First MFCCs and F0s are extracted from the mini-batch most c successive elements from (s c, t) to (s, t) are cal-
spectrogram of the separated voice X = {x1 , , xW } culated. Otherwise, successive elements from (s, t c) to
and that of the users singing voice Y = {y1 , , yW }. (s, t) are calculated.
Suppose we represent the concatenated vector of MFCCs For the system reported here, we set the parameters as
and F0 extracted from the xi and yi as xi = {mi , fi }
(x) (x) W = 300, c = 4, M axRunCount = 3.
and yi = {mi , fi }, X = {x1 , xW } and Y =
(y) (y)
3.5 Time Stretching of Accompaniment Sounds
{y1 , yW
} are input to the online DTW. In this concate-
nation, the weight of F0 is smaller than MFCCs. This is Given the warping path L = {o1 , , ol }, the system cal-
because F0 would be much less stable than MFCCs when culates a series of stretch rates R = {r1 , , rW } for each
the user has poor skills. If those mini-bathes are not silent, frame of a mini-batch. The stretch rate of the i-th frame,
the MFCCs and F0s are extracted and the cost matrix D = ri , is given by
{di,j }(i = 1, , W ; j = 1, , W ) is updated accord-
ing to Algorithms 1 and 2 with constraint parameters W , the amount of i in {i1 , , il }
ri = . (9)
c, and M axRunCount, i.e., a partial row or column of D the amount of i in {j1 , , jl }
SMC2017-113
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-114
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
question (1) question (2) voices based on online DTW. This system enables a user to
subject 1 partially yes partially yes expressively sing an arbitrary song by dynamically chang-
subject 2 yes partially yes ing the tempo of the users singing voices. The quantitative
subject 3 partially yes no and subjective experimental results showed the effective-
subject 4 yes partially yes ness of the system.
We plan to improve separation and alignment of singing
Table 1. The result of the subjective evaluation. voices. Using the tempo estimation result would help im-
provement of the audio-to-audio alignment. Automatic har-
monization for users singing voices would be an interest-
ment accuracy was not very high. This was because the
ing function as a smart karaoke system. Another important
separated singing voices contained musical noise, and the
research direction is to help users improve their singing
time stretch led to further noise, giving inaccurate MFCCs.
skills by analyzing the weak points from the history of the
The number of songs used in this evaluation is rather
matching results between the users and original singing
small. We plan further evaluation using more songs.
voices.
There are many possibilities for improving the accuracy.
First, HMM-based methods would be superior to the DTW- Acknowledgments
based method. HMM-based methods learn previous inputs
unlike DTW-based methods, and this would be useful to This study was supported by JST OngaCREST and On-
improve the accuracy. Second, simultaneous estimation of gaACCEL Projects and JSPS KAKENHI Nos. 26700020,
the alignment and the tempo estimation would improve the 24220006, 26280089, 16H01744, and 15K16654.
accuracy. The result of tempo estimation could help to pre-
dict the alignment. An approach for this way of alignment 6. REFERENCES
is a method using particle filters [11].
[1] M. Hamasaki et al., Songrium: Browsing and listen-
4.3 Subjective Evaluation ing environment for music content creation commu-
Each of four subjects was asked to sing a Japanese popu- nity, in Proc. SMC, 2015, pp. 2330.
lar song after listening to the songs in advance. The songs [2] Y. Bando et al., Variational bayesian multi-channel
used for evaluation were an advertising jingle Hitachi no robust nmf for human-voice enhancement with a de-
Ki, a rock song Rewrite by Asian Kung-fu Generation, formable and partially-occluded microphone array, in
a popular song Shonen Jidai by Inoue Yosui, and a pop- Proc. EUSIPCO, 2016, pp. 10181022.
ular song Kimagure Romantic by Ikimono Gakari. The [3] H. Tachibana et al., A real-time audio-to-audio
subjects were then asked two questions; (1) whether the karaoke generation system for monaural recordings
automatic accompaniment was accurate, and (2) whether based on singing voice suppression and key conversion
the user interface was appropriate. techniques, J. IPSJ, vol. 24, no. 3, pp. 470482, 2016.
The responses by the subjects are shown in Table 1. The
[4] W. Inoue et al., Adaptive karaoke system: Human
responses indicate, respectively, that the automatic accom-
singing accompaniment based on speech recognition,
paniment was partially accurate and practical, and the user
in Proc. ICMC, 1994, pp. 7077.
interface was useful.
The subjects also gave several opinions for the system. [5] R. B. Dannenberg, An on-line algorithm for real-time
First, the accompaniment sounds were low quality and it accompaniment, in Proc. ICMC, 1984, pp. 193198.
was not obvious whether the automatic accompaniment was [6] B. Vercoe, The synthetic performer in the context of
accurate. We first need to evaluate quality of the singing live performance, in Proc. ICMC, 1984, pp. 199200.
voice separation. An approach for this problem could be [7] C. Raphael, Automatic segmentation of acoustic mu-
to add a mode of playing a click sound according to the sical signals using hidden markov models, J. IEEE
current tempo. Second, some of the subjects did not under- Trans. on PAMI, vol. 21, no. 4, pp. 360370, 1999.
stand what the displayed spectrograms represented. Some
[8] A. Cont, A coupled duration-focused architecture for
explanation should be added for further user-friendliness,
realtime music to score alignment, J. IEEE Trans. on
or only the stretch rate and F0 trajectories should be dis-
PAMI, vol. 32, no. 6, pp. 974987, 2010.
played.
The number of the test sample used in this subjective [9] E. Nakamura et al., Outer-product hidden markov
evaluation is rather small. We plan further evaluation by model and polyphonic midi score following, J. New
more subjects. Music Res., vol. 43, no. 2, pp. 183201, 2014.
[10] T. Nakamura et al., Real-time audio-to-score align-
5. CONCLUSION ment of music performances containing errors and
This paper presented a novel adaptive karaoke system that arbitrary repeats and skips, J. IEEE/ACM TASLP,
plays back accompaniment sounds separated from music vol. 24, no. 2, pp. 329339, 2016.
audio signals while adjusting the tempo of those sounds [11] N. Montecchio et al., A unified approach to real time
to that of the users singing voices. The main components audio-to-score and audio-to-audio alignment using se-
of the system are singing voice separation based on online quential montecarlo inference techniques, in Proc.
VB-RNMF and audio-to-audio alignment between singing ICASSP, 2011.
SMC2017-115
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-116
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-117
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
of the system (e.g., [10]), there is still a need to further Finally, it must be mentioned that the analysis parameters
study such processes. Thus, in this work we perform an of both algorithms have been configured to a window size
experimental study assessing different ideas which, to the of 92.9 ms with a 50 % of overlapping factor.
best of our knowledge, no previous work has considered.
Particularly, we study the possibility of considering other 3. ONSET SELECTION FUNCTIONS
percentile values (i.e., other statistical tendency measures)
different to the median, somehow extending the work by In this section we introduce the different OSF methods
Kauppinen [13] but particularised to onset detection. We considered. Given that the contemplated ODF processes
also examine the influence of the window size in adaptive depict onset candidates as local maxima in O(t), the stu-
methodologies. Eventually, we compare static and adap- died OSF methods search for peaks above a certain thres-
tive threshold procedures to check their respective capabi- hold value (t) for removing spurious detections.
lities. As introduced, the idea in this work is to assess the fol-
The rest of the paper is structured as follows: Section 2 lowing OSF criteria: a) considering percentile values diffe-
introduces the ODF methods considered for this work; Sec- rent to the median one; b) assessing the influence of the
tion 3 explains the different OSF approaches assessed and window size in adaptive methodologies; and c) compa-
compared; Section 4 describes the evaluation methodology ring the performance of static against adaptive methodolo-
considered; Section 5 introduces and analyses the results gies. Mathematically, these criteria can be formalised as in
obtained; eventually, Section 6 remarks the conclusions Eq. 2:
and proposes future work. n o n o
(ti ) = O(twi ) + P (n) O(twi ) (2)
where twi ti W W
2 , ti + 2 , W denotes the size of the
2. ONSET DETECTION FUNCTIONS
sliding window, and {} and P (n) {} denote the average
Two different ODF algorithms were used for our experi-
and nth percentile value of the sample distribution at issue,
mentation, which were selected according to their good re-
respectively.
sults reported in the literature: the Semitone Filter Bank
For assessing the influence of the percentile, 20 values
(SFB) [7] and the SuperFlux (SuF) [17,18] algorithms. For
equally spaced in the range n [0, 100] were considered.
the rest of the paper we refer to the time series resulting
The particular case of P (50) is equivalent to the median
from the ODF process as O(t).
value of the distribution.
As an energy-based ODF method, SFB analyses the evo-
In terms of window sizes, 20 values equally space in the
lution of the magnitude spectrogram with the particularity
range W [0.2, 5] s were also considered. Additionally,
of considering that harmonic sounds are being processed.
the value proposed by West and Cox [20] of W = 1.5 s
A semitone filter bank is applied to each frame window
was included as a reference.
of the magnitude spectrogram, being the different filters
In order to obviate the adaptive capabilities of the OSF,
centred at each of the semitones marked by the well tem-
W was set to the length of the O(t) to assess the static
perament, and the energy of each band (root mean square
methodologies.
value) is retrieved. After that, the first derivative across
In addition, the case of manually imposing the threshold
time is obtained for each single band. As only energy rises
value (ti ) = T was considered. For that we establish 20
may point out onset information, negative outputs are ze-
values equally spaced in the range T [0, 1] as O(t) is
roed. Finally, all bands are summed and normalised to ob-
normalised to that range.
tain the O(t) function.
For the rest of the paper, the following notation will be
The SuF method bases its performance on the idea of the
used: and P denote the use of the mean or the percentile,
Spectral Flux [19] signal descriptor and extends it. Spec-
respectively, whereas +P stands for the sum of both des-
tral Flux obtains positive deviations of the bin-wise diffe-
criptors; adaptive approaches are distinguished from the
rence between two consecutive frames of the magnitude
static ones by incorporating the t as a subindex, showing
spectrogram for retrieving the O(t) function. SuF substi-
their temporal dependency (i.e., t , Pt , and t + Pt in
tutes the difference between consecutive analysis windows
opposition to , P, and + P); T evinces the manually
by a process of tracking spectral trajectories in the spec-
imposition of the threshold value; lastly, B is used to de-
trum together with a maximum filtering process that sup-
note the case in which no threshold is applied and all local
presses vibrato articulations as they tend to increase false
maxima are retrieved as onsets: B (ti ) = 0.
detections.
Finally, Fig. 2 shows some graphical examples of OSF
Given that the different ODF processes may not span in
processes applied to a given O(t) function.
the same range, we apply the following normalisation pro-
cess to O(t) so that the resulting function O(t) satisfies
O(t) [0, 1]: 4. METHODOLOGY
4.1 Corpus
O(t) min {O(t)}
O(t) = (1) The dataset used for the evaluation is the one proposed by
max {O(t)} min {O(t)}
Bock et al. [14]. It comprises 321 monaural recordings
where min {} and max {} retrieve the minimum and ma- sampled at 44.1 kHz covering a wide variety of polyphony
ximum values of O(t), respectively. degrees and instrumentation.
SMC2017-118
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
1.0 results obtained for each of the audio files considering the
0.8
number of onsets each one contains.
0.6
0.4 Table 1 shows the results obtained for the proposed com-
0.2 parative of static and adaptive threshold methodologies for
0.0 both ODF processes considered. In this particular case,
0 0.5 1 1.5 2 2.5
window size for adaptive methodologies has been fixed to
Time (s) W = 1.5 s as in West and Cox [20].
(a) Static threshold (ti ) = 0.4 when manually imposing a static
threshold value (T ). Symbol (#) remarks the positions selected An initial remark to point out according to the results ob-
as onsets. tained is that, in general, the figures obtained with the sli-
ding window methodology do not remarkably differ to the
1.0
ones obtained in the static one, independently of the ODF
0.8
0.6 process. This can be clearly seen when P and + P are
0.4 respectively compared to Pt and t + Pt : all percentile va-
0.2 lues considered retrieve considerably similar results except
0.0 for the case when high percentile values are considered, in
0 0.5 1 1.5 2 2.5
which the Pt method shows its superior capabilities. Addi-
Time (s) tionally, while P approaches show their best performance
(b) Adaptive threshold (ti ) using the percentile scheme (Pt ): for percentile values in the range [60, 75], the + P me-
solid and dashed styles represent the 75th and 50th percentiles,
respectively, while symbols (#) and () depict the positions se- thods obtain their optimal results in the lower ranges, more
lected as onsets for each criterion. The sliding window has been precisely around [0, 30].
set to W = 1.5 s.
Regarding the use of the methodologies, results ob-
Figure 2: Examples of OSF applied to a given O(t). tained for the static and adaptive methodologies did not
differ for the SFB function. On the contrary, when conside-
ring the SuF function, the adaptive methodology performed
The total duration of the set is 1 hour and 42 minutes and slightly worse than the static one. Finally, as these methods
contains 27,774 onsets. The average duration per file is do not depend on any external configuration, their perfor-
19 seconds (the shortest lasts 1 second and the longest one mance does not vary with the percentile parameter.
extends up to 3 minutes) with an average figure of 87 on- In terms of the T strategy, its performance matched the
sets per file (minimum of 3 onsets and maximum of 1,132 static P and + P methods in the sense that the perfor-
onsets). For these experiments no partitioning in terms of mance degrades as the introduced threshold was increased.
instrumentation, duration, or polyphony degree was con- Nevertheless, this method showed its performance when
sidered as we aim at assessing broader tendencies. the threshold value considered lies in the range [0.10, 0.30].
As a general comparison of the methods considered, the
4.2 Performance measurement methods showed a very steady performance paired with
high performance results. Also, while + P methods
Given the impossibility of defining an exact point in time to
generally outperform P strategies in terms of peak perfor-
be the onset of a sound, correctness evaluation requires of a
mance, the particular case of Pt approaches showed less
certain tolerance [21]. In this regard, an estimated onset is
dependency on the percentile configuration value.
typically considered to be correct (True Positive, TP) if its
Given that all previous experimentation has been done
corresponding ground-truth annotation is within a 50 ms
considering a fixed window value of W = 1.5 s, we need
time lapse of it [1]. False Positive errors (FP) occur when
to study the influence of that parameter in the overall per-
the system overestimates onsets and False Negative errors
formance of the system. In this regard, Tables 2 and 3 show
(FN) when the system misses elements present in the re-
the results for the Pt and t + Pt methods considering dif-
ference annotations.
ferent window sizes for the SFB and SuF processes, res-
Based on the above, we may define the F-measure (F1 )
pectively. Additionally, Figures 3 and 4 graphically show
figure of merit as follows:
these results for the Pt and t + Pt methods, respectively.
2 TP As the figures obtained for both ODF processes qualita-
F1 = (3) tively show the same trends, we shall analyse them jointly.
2 TP + FP + FN
Checking Tables 2 and 3, results for the Pt method show
where F1 [0, 1], denoting the extreme values 0 and 1 the that larger windows tend to increase the overall perfor-
worst and best performance of the system, respectively. mance of the detection, at least when not considering an
Other standard metrics such as Precision (P) and Recall extreme percentile value. For instance, let us focus on the
(74)
(R) have been obviated for simplicity. Pt method for the SuF process (Table 3): when consid-
ering a window of W = 0.20 s, the performance is set on
F1 = 0.67, which progressively improves as the window
5. RESULTS
size is increased, getting to a score of F1 = 0.73 when
In this section we introduce and analyse the results ob- W = 5.00 s.
tained for the different experiments proposed. All figures As commented, this premise is not accomplished when
presented in this section show the weighted average of the percentile values are set to its possible extremes. For low
SMC2017-119
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SFB SuF
Th/Pc
T P +P t Pt t + Pt T P +P t Pt t + Pt
0.00 0.65 0.73 0.65 0.73 0.72 0.65 0.72 0.64 0.77 0.64 0.77 0.74 0.64 0.76
0.11 0.71 0.73 0.65 0.72 0.72 0.65 0.71 0.76 0.77 0.64 0.77 0.74 0.64 0.76
0.21 0.75 0.73 0.66 0.71 0.72 0.66 0.70 0.65 0.77 0.65 0.77 0.74 0.64 0.76
0.32 0.73 0.73 0.66 0.70 0.72 0.66 0.69 0.51 0.77 0.66 0.77 0.74 0.65 0.76
0.42 0.66 0.73 0.68 0.68 0.72 0.67 0.67 0.39 0.77 0.68 0.76 0.74 0.67 0.75
0.53 0.56 0.73 0.69 0.66 0.72 0.69 0.63 0.29 0.77 0.70 0.75 0.74 0.69 0.74
0.63 0.44 0.73 0.70 0.62 0.72 0.70 0.58 0.21 0.77 0.73 0.73 0.74 0.71 0.71
0.74 0.31 0.73 0.70 0.53 0.72 0.70 0.50 0.15 0.77 0.74 0.68 0.74 0.72 0.65
0.84 0.19 0.73 0.65 0.39 0.72 0.63 0.36 0.10 0.77 0.68 0.56 0.74 0.65 0.54
0.95 0.10 0.73 0.40 0.13 0.72 0.42 0.12 0.06 0.77 0.42 0.29 0.74 0.44 0.27
1.00 0.05 0.73 0.05 0.00 0.72 0.27 0.00 0.05 0.77 0.05 0.00 0.74 0.29 0.00
Table 1: Results in terms of the F1 score for the dataset considered for the ODF methods introduced. A 1.5-second window
has been used for the methods based on sliding window (t , Pt , t + Pt ) based on the work by West and Cox [20].
Bold elements represent the best figures for each ODF method considered. Due to space requirements we have selected
the threshold and percentile parameters (Th/Pc) showing the general tendency. In this same regard, both threshold and
percentile parameters are expressed in the range [0, 1].
Table 2: Results in terms of the F1 score for the SFB process. Bold elements represent the best figures obtained for each
percentile value considered. Due to space requirements, the most representative percentile parameters (Pc) and window
sizes (W ) have been selected to show the general tendency.
Table 3: Results in terms of the F1 score for the SuF process. Bold elements represent the best figures obtained for each
percentile value considered. Due to space requirements, the most representative percentile parameters (Pc) and window
sizes (W ) have been selected to show the general tendency.
(11)
percentile values, window size seems to be irrelevant to the Pt , the performance measure was always F1 = 0.65
system. For instance, when considering the SFB method, independently of the W value considered.
SMC2017-120
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
On the other extreme, very high percentile values suffer a necessary to be tuned for each particular dataset.
performance decrease as larger window sizes are used. As
(100)
an example, for the SuF configuration, Pt with W = 5.1 Statistical significance analysis
5.00 s achieved an F1 = 0.12, while when W = 0.20 s
this figure raised up to F1 = 0.67. In order to perform a rigorous analysis of the results ob-
tained and derive strong conclusions out of them, we now
Results for the t + Pt method showed similar tenden-
perform a set of statistical tests. Specifically, these ana-
cies since, when not considering extreme percentile values,
lyses were performed with the non-parametric Wilcoxon
the overall performance increased as larger windows were
rank-sum and Friedman tests [22], which avoid any assump-
used. Nevertheless, in opposition to the Pt case, this par-
tion about the distribution followed by the figures obtained.
ticular configuration showed an improvement tendency as
The former method is helpful for comparisons of differ-
W is progressively increased for low percentile values.
ent distributions in a pairwise fashion while the latter one
In general it was observed that, for all W window sizes, generalises this pairwise comparison to a generic number
the best percentile configurations seemed to be in the range of distributions.
[60, 70] for the Pt approach and in the range [0, 20] for the
In this work the Wilcoxon rank-sum test was applied to
t + Pt case. This fact somehow confirms the hypothesis
assess whether there are significant differences among all
suggesting that the median value may not always be the
the methods proposed using the results in Table 1. The
best percentile to consider.
single scores obtained for each Th/Pc constitute a sample
Checking now the results considering Figs. 3 and 4, some of the distribution for the OSF method to be tested. The
additional remarks, which are more difficult to check in the results obtained when considering a significance level of
aforementioned tables, may be pointed out. p < 0.05 can be checked in Table 4.
The first one of them is that the selection of the proper Attending to the results of the Wilcoxon test, an initial
parameters of windows size and percentile factor is crucial. point to comment is that method T the less robust of all
For both Pt and t + Pt methods, there is a turning point considered due to its significantly lower performance in all
in which the performance degrades to values lower than cases. Oppositely, method can be considered the best
the considered baseline B (no OSF applied). For the Pt strategy as it outperforms all other strategies. The adap-
method there is a clear point for this change in tendency tive equivalent to this technique, t , achieves similar per-
around the 85th percentile for any window size considered. formances to the static one except for the case of SuF, in
However, for the t + Pt approach there was not a unique which it does not outperform the rest of the methods as
point but a range, which remarkably varied depending on consistently as in the static case. Methods P and Pt show
the type of ODF and window size considered. an intermediate performance among the previously com-
Another point to highlight is that the static methodolo- mented extremes. Their results are not competitive with
gies (P and + P) consistently define the upper bound in any of the or t strategies. These strategies are generally
performance before the so-called turning point. This fact tied with methods +P and t +Pt as they typically show
somehow confirms the initial idea of using large windows significantly similar performances with punctual cases in
(on the limit, one single window considering the whole which one of those methods improves the rest.
O(t) function) in opposition to small windows. We now consider the statistical analysis of influence of
The results obtained when considering window sizes in the window size (W ) and the percentile values (Pc) on the
the range [0, 1] seconds, which include the performance of overall performance. Given that now we have two vari-
the considered reference window of W = 1.5 s, achieved ables to be evaluated, we consider the use of the Friedman
results similar to the obtained upper bound. The other con- test. The idea with this experiment is to assess whether
sidered window sizes showed a remarkable variability in variations in these variables report statistically significant
the performance, ranging from achieving figures similar to difference in the F1 score, so we assume that the null hy-
the upper bound to figures close to the baseline B. pothesis to be rejected is that no difference should be ap-
As a general summary for all the experimentation per- preciated. Note that this type of analysis does not report
formed, we may conclude that adaptive OSF methodolo- whether a particular strategy performs better than another
gies may be avoided as static approaches obtain similar one but simply if the differences when using different pa-
results with less computational cost. Particularly, in our rameters are relevant.
experiments, methods considering percentile (P) or mean For performing this experiment we consider the data in
and percentile ( + P) reported the upper bounds in per- Tables 2 and 3 as they contain all the information required
formance. for the analysis. In this regard Tables 5 and 6 show the p
Nevertheless, the particular percentile value used remar- significance values obtained when measuring the influence
kably affects the performance. For P, the best results seem of W and Pc, respectively.
to be obtained when the percentile parameter was set in the Attending to the results obtained, we can check the re-
range [60,70]. For + P the best figures were obtained markable influence of the W parameter in the overall re-
when this parameter was set to a very low value. sults as all cases report very low p scores which reject the
Finally, the commonly considered median value for the null hypothesis. The only exception is the Pt in the SFB
OSF did not report the best results in our experiments. case in which p is not that remarkably low, but still would
These results point out that the median statistical descriptor reject the null hypothesis considering a typical significance
may not always be the most appropriate to be used, being threshold of p < 0.05.
SMC2017-121
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
0.8
0.7
0.6
SFB
F1
0.5
0.4
0.3
0.8
0.7
0.6
SuF
F1
0.5
0.4
0.3
0 10 20 30 40 50 60 70 80 90 100
Percentile (Pc)
Figure 3: Evolution of the F1 score when considering the Pt strategy for the dataset proposed.
0.8
0.7
0.6
SFB
F1
0.5
0.4
0.3
0.8
0.7
0.6
SuF
F1
0.5
0.4
0.3
0 10 20 30 40 50 60 70 80 90 100
Percentile (Pc)
Figure 4: Evolution of the F1 score when considering the t + Pt strategy for the dataset proposed.
In terms of the percentile value (Pc), the p significance Finally, note that this statistical analysis confirms the ini-
values obtained clearly show the importance of this pa- tial conclusions depicted previously: static approaches are
rameter in the design of the experiment. This points out significantly competitive when compared to adaptive me-
the clear need to consider this as another parameter in the thods; window sizes for adaptive method remarkably in-
design of onset detection systems. fluence the performance of the system; when considering
SMC2017-122
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SFB SuF
Method
T P +P t Pt t + Pt T P +P t Pt t + Pt
T < < < < < < < < < < < <
> > > > > > > > > > > >
P > < = < = = > < = < > =
+P > < = < = > > < = = = >
t > < > > > > > < > = > =
Pt > < = = < > > < < = < =
t + Pt > < = < < < > < = < = =
Table 4: Results obtained for the Wilcoxon rank-sum test to the comparison results depicted in Table 1. Symbols (>), (<),
and (=) state that the onset detection capability of the method in the row is significantly higher, lower, or not different to
the method in the column. Symbol () depicts that the comparison is obviated. A significance level of p < 0.05 has been
considered.
Acknowledgments
OSF methods relying on statistical descriptions, the per-
centile should be considered as another design parameter This research work is partially supported by Vicerrectorado
due to its relevance in the results. de Investigacion, Desarrollo e Innovacion de la Univer-
sidad de Alicante through FPU program (UAFPU2014--
5883) and the Spanish Ministerio de Economa y Competi-
6. CONCLUSIONS tividad through project TIMuL (No. TIN201348152C2
1R, supported by EU FEDER funds). Authors would also
Onset detection, the retrieval of the starting points of note like to thank Sebastian Bock for kindly sharing the onset
events in audio music signals, has been largely studied dataset.
because of its remarkable usefulness in the Music Infor-
mation Retrieval field. These schemes usually work on a
two-stage basis: the first one derives a time series out of 7. REFERENCES
the initial signal whose peaks represent the candidate on-
set positions; the second assesses this function by setting a [1] J. P. Bello, L. Daudet, S. A. Abdallah, C. Duxbury,
threshold to minimise the errors and maximise the overall M. E. Davies, and M. B. Sandler, A Tutorial on Onset
performance. While the first of such processes has received Detection in Music Signals, IEEE Transactions on Au-
significant attention from the community, the second stage dio, Speech, and Language Processing, vol. 13, no. 5,
has not been that deeply studied. pp. 10351047, 2005.
In this work we have studied the actual influence of a
proper design of this second stage in the overall perfor- [2] D. P. W. Ellis, Beat tracking by dynamic program-
mance. More precisely we addressed three main points: ming, Journal of New Music Research, vol. 36, no. 1,
measuring the benefits of static and adaptive threshold tech- pp. 5160, 2007.
niques, the use of different statistical descriptions for set-
ting this threshold, and the influence of the considered win- [3] M. Alonso, G. Richard, and B. David, Tempo Esti-
dow size in adaptive methodologies. mation for Audio Recordings, Journal of New Music
Our experiments show that static methodologies are clearly Research, vol. 36, no. 1, pp. 1725, 2007.
SMC2017-123
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[4] E. Benetos and S. Dixon, Polyphonic music transcrip- [15] C. Duxbury, J. P. Bello, M. Davies, and M. Sandler,
tion using note onset and offset detection, in Proceed- Complex Domain Onset Detection for Musical Sig-
ings of the IEEE International Conference on Acous- nals, in Proceedings of the 6th International Confer-
tics, Speech, and Signal Processing, ICASSP, Prague, ence on Digital Audio Effects, DAFx-03, London, UK,
Czech Republic, 2011, pp. 3740. 2003, pp. 9093.
[5] J. Schluter and S. Bock, Improved musical onset de- [16] J. P. Bello, C. Duxbury, M. Davies, and M. B. Sandler,
tection with convolutional neural networks, in IEEE On the use of phase and energy for musical onset de-
International Conference on Acoustics, Speech and tection in the complex domain, IEEE Signal Process-
Signal Processing (ICASSP). IEEE, 2014, pp. 6979 ing Letters, vol. 11, pp. 553556, 2004.
6983.
[17] S. Bock and G. Widmer, Maximum Filter Vibrato
Suppression for Onset Detection, in Proceedings of
[6] R. Zhou, M. Mattavelli, and G. Zoia, Music Onset De-
the 16th International Conference on Digital Audio Ef-
tection Based on Resonator Time Frequency Image,
fects (DAFx-13), Maynooth, Ireland, September 2013,
IEEE Transactions on Audio, Speech, and Language
pp. 5561.
Processing, vol. 16, no. 8, pp. 16851695, 2008.
[18] , Local Group Delay based Vibrato and Tremolo
[7] A. Pertusa, A. Klapuri, and J. M. Inesta, Recogni- Suppression for Onset Detection, in Proceedings of
tion of Note Onsets in Digital Music Using Semi- the 13th International Society for Music Informa-
tone Bands, in Proceedings of the 10th Iberoameri- tion Retrieval Conference (ISMIR), Curitiba, Brazil,
can Congress on Pattern Recognition, CIARP, Havana, November 2013, pp. 589594.
Cuba, 2005, pp. 869879.
[19] P. Masri, Computer Modeling of Sound for Transfor-
[8] J. P. Bello and M. B. Sandler, Phase-based note on- mation and Synthesis of Musical Signals, PhD Thesis,
set detection for music signals, in Proceedings of the University of Bristol, UK, 1996.
IEEE International Conference on Acoustics, Speech,
and Signal Processing, ICASSP, Hong Kong, Hong [20] K. West and S. Cox, Finding An Optimal Segmenta-
Kong, 2003, pp. 4952. tion for Audio Genre Classification, in Proceedings of
the 6th International Conference on Music Information
[9] E. Benetos and Y. Stylianou, Auditory Spectrum- Retrieval, ISMIR, London, UK, 2005, pp. 680685.
Based Pitched Instrument Onset Detection, IEEE
[21] P. Leveau and L. Daudet, Methodology and tools
Transactions on Audio, Speech, and Language Pro-
for the evaluation of automatic onset detection algo-
cessing, vol. 18, no. 8, pp. 19681977, 2010.
rithms in music, in Proceeding of the 5th Interna-
[10] C. Rosao, R. Ribeiro, and D. Martins de Matos, Influ- tional Symposium on Music Information Retrieval, IS-
ence of peak picking methods on onset detection, in MIR, Barcelona, Spain, 2004, pp. 7275.
Proceedings of the 13th International Society for Mu- [22] J. Demsar, Statistical comparisons of classifiers over
sic Information Retrieval Conference, ISMIR, Porto, multiple data sets, Journal of Machine Learning Re-
Portugal, 2012, pp. 517522. search, vol. 7, pp. 130, 2006.
[11] S. Dixon, Onset detection revisited, in Proceedings
of the 9th International Conference on Digital Audio
Effects, DAFx-06, Montreal, Canada, 2006, pp. 133
137.
SMC2017-124
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT
The contribution of this paper is the proposition of a real
The purpose of this paper is to present Robinflock, an al- time algorithmic composer of polyphonic music. Indeed,
gorithmic system for automatic polyphonic music genera- to the best of our knowledge, related studies in algorithmi-
tion to be used with interactive systems targeted at chil- cally generated polyphonic music never address interac-
dren. The system allows real time interaction with the gen- tive purposes. As a consequence, the development of an
erated music. In particular, a number of parameters can be algorithm that generates polyphonic music in real time was
independently manipulated for each melody line. The de- not strictly necessary. Previous algorithmic systems for
sign of the algorithm was informed by the specific needs polyphonic music generation exist; however, they have
of the targeted scenario. We discuss how the specific needs been developed from a computational linguistic perspec-
of the polyphony with children influenced the develop- tive, aiming at generating polyphonic music that avoid
ment choices. Robinflock was tested in a field study in a composition and counterpoint errors. The objective of
local kindergarten involving 27 children over a period of these studies was to imitate specific styles [8] or to apply
seven months. compositional rules derived from music theory [2, 10, 32].
These algorithms successfully generated good quality
1. INTRODUCTION counterpoint compositions, but their application domain
was limited to off-line non-interactive music.
Polyphonic music is widespread in western music tradi-
The novelty of the Robinflock lies in the possibility to in-
tion. It consists of at least two melodic lines that are sim-
dependently manipulate each musical line on several mu-
ultaneously performed and that produce coherent harmo-
sical parameters in real time. Robinflock was developed in
nies when overlap. Polyphonic music originated with vo-
the context of Child Orchestra [9], a research project aimed
cal music, grown during the Renaissance [22], and has
at exploring new solutions to use technology in context of
been fundamental for classical composers (e.g. Bach,
music making for children. In this paper, we describe the
Haydn, and Mozart), for modern composers (e.g. Schoen-
technical implementation of the algorithm underling how
berg and Stockhausen), and for music studies [31].
the technical choices are grounded on specific require-
Performing polyphonic music is particularly demanding as
ments: design solutions were taken to meet a number of
it requires musicians to master score reading and the abil-
requirements related to the specific interactive scenario of
ity to play with other performers. The difficulties of per-
polyphonic music for children. The architecture is based
forming polyphonic music are even more demanding for
on Robin, an algorithm developed by some of the author
young children due to their limited musical experience.
that generates piano music for interactive purposes [24,
Despite these difficulties, children seem to have an innate
26].
predilection for polyphony [33].
Robinflock was adopted in the Child Orchestra field study,
We argue that an algorithmic solution can increase chil-
which involved 27 kindergarten children over a period of
dren confidence with polyphonic music. In particular, an
seven months [9]. The result of this field study suggested
algorithmic composition system that generates polyphonic
that the algorithm is particularly effective to help children
music in real time can be applied in interactive situations
to familiarise with polyphonic music.
to familiarise children with polyphony. We propose to del-
The remainder of this paper is organized as follows. Sec-
egate part of the complexities of performing polyphonic
tion 2 presents related work in algorithmic composition
music to the computer, enabling children to directly ma-
and in music for children. Section 3 describes in details the
nipulate different lines by interacting with a number of pa-
architecture of Robinflock. Section 4 introduces the field
rameters.
work in which the algorithm was used and presents the re-
Copyright: 2017 Raul Masu et al. This is an open-access article dis-
sults. We conclude this paper with discussions and consid-
tributed under the terms of the Creative Commons Attribution License 3.0 erations on future work.
Unported, which permits unrestricted use, distribution, and reproduction
in any medium, provided the original author and source are credited
SMC2017-125
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-126
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[4]. Their system, which was designed with the help of The generated music should be varied, coherent, and com-
four music experts, proposed a metaphor based on body pliant with compositional rules, thus aiming to provide
movements. The movement of the children were captured children with a valid stimulus for learning by ear [14].
by a tracking system and mapped into a percussive sound Moreover, the system is required to allow multiple chil-
output, which varied according to the children's move- dren to interact with it at the same time. Thus, each child
ments. Specifically, tempo was mapped to the speed of the needs to have control over a specific voice of the music.
children, volume to the intensity of the activity, and pitch In particular, low-level parameters should be directly ac-
to the proximity among children. The system did not con- cessible to actively stimulate the children and foster par-
trol harmonic and melodic features, and consequently did ticipation. The parameters individuated during the design
not require a complex algorithmic system to deal with such phase were volume, speed, and articulation.
elements. Another example is the Continuator, that is par- These interactive and musical requirements can be met by
ticularly relevant for the scope of this paper as it involves using an algorithmic solution insofar as it allows to gen-
algorithmic elements [29, 1]. The Continuator, takes as in-
erate and manipulate the different lines in real time. This
put the music performed on the keyboard and plays back
would provide each child with the control over his own
coherent excerpts that mirror performer's play. The system
melodic line, thus allowing children to modify the music
facilitated piano improvisation by automatically generat-
ing musical replies. The system resulted to be effective in without reciprocal interference. Furthermore, as opposed
promoting flow state for children [1], but does not have to a solution based on triggering pre-recorded melodies,
any aim in acculturating children in any particular musical algorithmic composition guarantees continuous access
technicality. and control to structural parameters. Finally, an algorith-
mic solution can guarantee a virtually infinite number of
3. SYSTEM ARCHITECTURE compositions but, at the same time, restrain the musical
outputs to correct melodic and harmonic rules.
The algorithmic composer was developed following a As highlighted in the related work section, most of the
number of musical needs that emerged during a prelimi- systems that algorithmically generate polyphonic music
nary design phase. The design phase, whose details are are not designed for interactive purposes [2, 10, 12, 23,
discussed in a related publication [34], involved experts 30, 32]. We therefore decided to base the architecture of
in the domain of music theory, music teaching, and peda- the algorithm on Robin, an algorithm previously devel-
gogy, who participated in four focus groups and one oped by some of the authors [24, 26], which follows a
Figure 1. Scheme of the architecture of the algorithm.
workshop. We present here the main musical needs that rule-based approach to manipulate a number of musical
emerged is the design phase. parameters in real-time (i.e. speed, articulation, melody
direction, timbre, volume, consonance, and mode). Robin
SMC2017-127
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
controlled a single voice with accompaniment generated the transition probabilities between successive chords have
by using some piano clichs of classical music, and was already been modelled as a Markov process [33]. Chords
adopted in several interactive contexts [25, 27]. How- transition data can be extracted by the analysis of existing
ever, as each child needs to control a different voice, the music [8], surveying of music theory [33], or following
algorithm had to be extended to support multiple voices. personal composing aesthetic [39].
To this end, we implemented an algorithm that allows in- To guarantee a high degree of music adaptability, chord
dependent manipulation on each line of polyphonic mu- and degrees correlation does not depend on previous
sic. Robinflock was developed with a rule based ap- states of the system: a first-order Markov process deter-
proach. mines the harmonic progression as a continuous stream of
chords. We implemented the matrix following a rule-based
approach deriving the transition probabilities from har-
Similarly to most of the interactive systems previously de-
mony manuals (Table 1) [31, 35].
scribed [7, 21, 34], the structure of Robinflock is modular.
As the music was required to offer a variety of stimuli,
The algorithm is composed of three modules that sepa-
Robinflock was developed to compute music in all the ma-
rately deal with: i) harmony, rhythm, and melody. The Har-
jor and minor keys, and in all the Gregorian modalities.
mony Generation Module (HGM) follows traditional har-
To cope with this requirement, the system computes the
monic rules [31, 35] implemented with a first order Mar-
harmony progression applying the scale degree to the de-
kov chain. The Rhythm Generation Module (RGM) is
sired tonality or modality. For example, a first degree in C
composed of four modules that compute the rhythm of
major key corresponds to the C-E-G chord while the same
each line. The rhythm is computed each quarter while the
degree in G minor key corresponds to the G-Bb-D chord.
harmony is computed each four quarter. A global metro-
nome synchronises the two modules. The Melodies Gen-
eration Module (MGM) fills all the four rhythms with I II III IV V VI VII
notes that are coherent to the current chord to avoid coun- I 0 0,05 0,07 0,35 0,35 0,08 0,1
terpoint errors like parallel fifths and octaves [17]. The II 0,05 0 0,05 0,15 0,65 0,2 0
generated music is performed with four FM synthesisers, III 0 0,07 0 0,2 0,8 0,65 0
one for each line. The overall architecture of Robinflock IV 0,15 0,15 0,05 0 0,6 0,05 0
can be seen in Figure 1. V 0,64 0,05 0,05 0,13 0 0,13 0
VI 0,06 0,35 0,12 0,12 0,35 0 0
Robinflock is developed in Max-MSP, and each module is VII 1 0 0 0 0 0 0
implemented with JavaScript, using the Max-JS object, Table 1. The Markov matrix in the Harmony Module
with the only exception of the rhythm module, which is
composed of four JS objects, one for each line. The harmonic rhythm of Robinflock is similar to that of
To ensure time synchronisation of the four voices, a met- Robin: each chord corresponds to a bar, and every eight
ronome is set to the minimum metric unit (the sixteenth) chords/bars the system is forced to cadence to a tonic,
and each voice is encoded using arrays of sixties. All the dominant or subdominant (I, IV, V). The introduction of
four arrays are generated by the MGM, which combines cadences allows to coherent music harmony, while main-
the four rhythm arrays generated by the RGMs with the taining real time flexibility.
current harmony. A metronome scans the four arrays at the
same time: each cell of the array (sixteenth note) contains
3.2 Rhythm Generation Module
a MIDI pitch value in case a new note has to be performed
or a -1 flag if a new note is not required. In the latter case, The Rhythm Generation Module (RGM) manages the
the previous note continues to play. These MIDI values are rhythm of each musical line autonomously, thereby it is
sent to the four synthesisers, that are also implemented in composed of four different instances of the same agent.
Max-MSP. During the design process, speed was identified as one of
the musical parameter that should be manipulated autono-
3.1 Harmony Generation Module mously on each voice. To keep coherence among the four
voices, we decided to express speed as note density.
Traditionally in tonal/modal music, harmony is organized Changing the value of the BPM on each voice would in-
on the basis of chord progressions and cadences. One com- deed have resulted in unwanted results. The rhythm den-
mon approach to manage harmony is that of Generative sity ranges from a continuum of sixteenths notes up to a
Grammar, which can compute good quality chord progres- four-quarters note, with a simple metric division (no tuples
sions but it is based on time-span reduction of the musical are adopted).
structure, and requires a knowledge a priori on the overall To reduce the latency, thus guaranteeing a real time re-
structure [38]. Following the design requirements, we sponse, the rhythm is computed at each quarter (four times
needed a high level of flexibility in real time. Therefore, each bar). Nevertheless, for low note density situations, it
we computed harmony as a stream of chords that are not a could be necessary to have two, three, and four quarters
priori generated. Each chord is defined as a scale degree notes. Consequently, the length of the note has to be mod-
and computed as a state of a Markov chain. Since the his- ified while the note is playing. The length of the note is
torical works of Xenakis, Markov chains have been thereby not computed at the beginning of the notes but de-
adopted in algorithmic composition [39] to manage tem- fined at the end, allowing continuous access to the note
poral successions of musical events. Concerning harmony, density parameter.
SMC2017-128
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
To this end, the rhythm was encoded using note ON-OFF phonic music [19]. Then, the other voices are com-
messages. Given that each line is rhythmically computed puted in the following order: tenor, alto, and soprano.
autonomously we use the note ON message also to termi- A chord note is assigned to each voice. If the current
nate the previous note. The ON-OFF messages are stored Rhythm Array corresponds to the first quarter of the
in an array - the Rhythm Array - of four Boolean values. measure, and consequently to a new chord, the algo-
Each position of the array corresponds to a sixteenth notes: rithm checks that it does not produce a parallel octave
1 means new note; and 0 means that the previous note con- or a fifth interval with the previous chord in a two-
tinues to play. For example, the array [1,1,1,1] corresponds steps recursive procedure. In the first step, the system
to four sixteenths note and the array [1,0,0,0] corresponds selects a note of the chord; in the second step, the sys-
to one quarter note (Figure 2 and Figure 3). tem checked that this note does not produce octave or
fifth with the previous quarter. If the selected note pro-
duces illegal parallel octave or fifth, the algorithm
goes back to step one, choosing a different note. For
the second, third, and fourth quarter of the measure,
Figure 2. Rhythmic pattern corresponding to the array
the chord is the same, and the melodies reposition
[1,1,1,1].
chord notes. As a consequence, it is not necessary to
pay attention to the illegal octaves [31]. Having the
harmony stable for four quarters was not common in
historic counterpoint. However, we adopted this solu-
Figure 3. Rhythmic pattern corresponding to the array
tion to guarantee the flexibility of the note length
[1,0,0,0].
while keeping the harmony coherent.
The density can range between an entire 4/4 bar of six- 2. The other notes of the four voices are computed. The
teenths, and a minimum of four quarter notes. The corre- algorithm enters the second step only if the density of
sponding array for one entire bar are [1,1,1,1] [1,1,1,1] the voice is higher than a quarter. Two different main
[1,1,1,1] [1,1,1,1] for the sixteenths (maximum density), cases are identified: i) syncope: as the chord is the
and [1,0,0,0] [0,0,0,0] [0,0,0,0] [0,0,0,0] for the four quar- same, the algorithm computes the note as a consonant
ter note (minimum density). arpeggio; ii) non-syncope: the second eight is com-
The continuum that ranges between the two extremes is puted with a chord note, and the sixteenth upbeat notes
discretised in sets of possible rhythms. Each set refers to a are computed as passage notes.
basic note length: two quarter note, one quarter note, Given the real-time constraints, no prediction on succes-
eighth note and sixteens note. The sets contain regular and sive notes can be made. Thus, in the current version of the
irregular patterns, for instance the sixteenths note, contains system, some elegant ornamentations of the fifth species
[1,0,1,1] but also [1,1,0,1] and quarter notes set contains counterpoint (e.g. dissonant suspensions, neighbour note,
[1,0,0,0] but also [0,0,1,0]. According to the selected den- and double passing tones) are not implemented.
sity a pattern is chosen from the corresponding set.
To avoid illegal syncope, the RGM saves information on 3.4 Synthesiser
the previous quarter and which quarter of the bar is cur-
rently being computed. For instance, if the density value The synthesiser is realized using simple FM synthesis with
suddenly jumps from maximum to minimum the system a different carrier modular ratio for the nearby voices,
would compute the succession [1,1,1,1] [0,0,0,0] that pro- (bass FM ratio 2, tenor FM ratio 3, alto FM ratio 2, soprano
duces illegal rhythm. In this case the system bypasses the FM ratio 3). The choice of using the FM and avoiding sam-
rhythm density and computes: [1,1,1,1] [1,0,0,0]. ples were suggested by the expert in music pedagogy dur-
ing the design process. Using a real instrument timber
could indeed cause the child to be positively or negatively
3.3 Melodies generation module (MGM)
emotionally influenced by a specific instrument, and this
At each quarter, the Melody Generation Module (MGM) could decrease his attention over the specific exercise (e.g.
receives the four Rhythm Arrays and the current chord the sound of a piano for the daughter of a pianist).
from RGMs and HGM, and computes the four melodies The synthesiser controls both the volume and the articula-
arrays, one for each voice. The MGM outputs four streams tion. The articulation is manipulated by changing the
of MIDI notes. ADSR (attack, decay, sustain, release) parameters, in the
range that continues between 02 02 0 0 for the staccato and
The computation of the notes of the arrays occurs in two 200 100 1 100 for the legato. These values follow the
steps. standard ADSR (attack, decay, sustain, release) codifica-
tion of the Max-MSP object. Attack, decay, and release are
1. The downbeat notes of the four voices (first position expressed in milliseconds. The sustain is a multiplicative
in the position in the Rhythm Array) are filled with factor used to calculate the volume during the sustain of a
notes of the chord, as in first species counterpoint. The note, its value is normalized between 0 and 1.
first voice to be computed is the bass: it can be the
fundamental or the third note of the chord. This note
cannot be the fifth of the chord to avoid the second
disposition of the chord, which was not used in poly-
SMC2017-129
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
4. ROBINFLOCK IN THE CONTEXT OF The mapping between the movement of the children and
the music was discussed with the experts and adapted ac-
CHILD ORCHESTRA
cording to the observation of the system. Initially, we
This section shows a field application of Robinflock. We planned to map the speed of melodies with the speed of the
present the practical adoption of Robinflock in a kinder- children, and the articulation with the distance of chil-
garten and report the main observations collected in this drens barycentre to the floor. During the field work, we
context. Full details on the field study that involved Robin- realised that the manipulation of more than one parameter
flock can be found in a dedicated paper [9]. would have been too demanding for the children. As a con-
sequence, we maintained only speed.
4.1 Activities
4.3 Observation
Robinflock was used over a period of seven months in a
local kindergarten. A total of 27 children (14 females) aged The experience in the kindergarten was evaluated integrat-
3 and 4 took part in the study. They were divided into two ing observations, interviews, and more than 300 drawings
groups: 15 children of 3 years old (4 males) and 12 chil- sketched by children at the end of each session. We sum-
dren of 4 years old (9 males). Both groups followed the marise here the main considerations. Full detailed on the
same cycle of 12 lessons, which were facilitated by a mu- methodology adopted for the analysis can be found in [9].
sic teacher with a specific expertise in music pedagogy for The first important observation is that children identified
pre-schoolers. One kindergarten teacher was also present and reacted properly to the three parameters. We observed
at each lesson to supervise the activities. how children reacted to changes of musical parameters us-
For this field study two different use of the system have ing specific exercises designed along with the music
been adopted. In the first half of the lessons, Robinflock teacher. We asked children to change their behaviour, for
was controlled by the music teacher via a mobile applica- instance jump outside or inside a carpet, following the
tion. In the second half of the lessons, the system was di- changes of the music. Volume, was the most difficult pa-
rectly operated by the children by means of their move- rameter for some of the children, as they struggled per-
ments, which were tracked via motion tracking (details on forming the exercise. By contrast, differences in articula-
the interfaces below). The observations were conducted by tion and speed were immediately perceived.
three of the authors. Robinflock was also particularly successful in helping
children to familiarise with polyphonic music. This fact
4.2 Interfaces was assessed by observing how children reacted to the
changes of specific musical lines. For the polyphony we
Two different interfaces were developed to control Robin- also asked children to change their behaviour following the
flock: a web app, which allows the teacher to remotely ma- changes of the music, in this case different child were
nipulate the music using a smartphone, and a motion asked to follow the changes of different voices.
tracker, which allows the children to directly manipulate Combining the observation of children behaviour with the
the music. discussion with the music teacher allowed us to understand
children that children had difficulties to manipulate more
4.2.1 Web app than one parameter. We also observed that a four lines po-
The web app was used by the teacher in the first phase of lyphony was too complex to be manipulated by children.
the field study to help children familiarise with the music For this reason, we reduced the polyphony to two voices.
played by Robinflock and to propose a number of exercises Despite these limitations, the observations revealed that
that were specifically centred on polyphony. Robinflock succeeded in encouraging children to actively
The implementation of the app was based on web applica- participate in music making. Comments collected from the
tion technologies (node.js and javascript), which allow children confirmed this finding, as they commented on the
compatibility with smartphones and tablets. The capability experience stating that I did the music and we played
of interacting wirelessly through a smartphone allowed the the music.
teacher to freely move in the room and to offer support to To summarize, Robinflock proved helpful to familiarise
children. The application was implemented using sliders children with polyphonic musical structures and peculiari-
and buttons typical of mobile UI. ties.
SMC2017-130
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-131
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[13] J. Glover, and S. Young, Music in the early years. [32] J. J. Polito, M. Daida, and T. F. Bersano-Begey, "Mu-
Routledge, 2002. sica ex machina: Composing 16th-century counter-
[14] E. Gordon, A music learning theory for newborn and point with genetic programming and symbiosis." in
young children. Gia Publications, 2003. Proc. Int. Conf. International Conference on Evolu-
[15] D. Herremans, and K. Srensen, "Composing first tionary Programming, Springer Berlin Heidelberg,
species counterpoint with a variable neighbourhood 1997, pp. 113-123.
search algorithm." in J. Journal of Mathematics and [33] D. Pond, "A composer's study of young children's in-
the Arts 6.4, 2012, pp. 169-189. nate musicality." in J. Bulletin of the Council for Re-
[16] L. Hetland, "Listening to music enhances spatial-tem- search in Music Education,1981, pp. 1-12.
poral reasoning: Evidence for the" Mozart Effect"." [34] R. Rowe, "Machine listening and composing with cy-
in J. Journal of Aesthetic Education 34.3/4, 2000, pp. pher." in J. Computer Music Journal 16.1, 1992, pp.
105-148. 43-63.
[17] Jr. Hiller, A.Lejaren, and L. M. Isaacson, "Musical [35] A. Schoenberg, Theory of harmony, Univ. of Califor-
composition with a high speed digital computer." Au- nia Press, 1978.
dio Engineering Society Convention 9, Audio Engi- [36] W. L. Sims, "The effect of high versus low teacher
neering Society, 1957. affect and passive versus active student activity dur-
[18] E. Jaques-Dalcroze, Rhythm, music and education. ing music listening on preschool children's attention,
Read Books Ltd, 2014. piece preference, time spent listening, and piece
[19] K. Jeppesen, Counterpoint: the polyphonic vocal recognition." in J. Journal of Research in Music Edu-
style of the sixteenth century. Courier Corporation, cation 34.3,1986, pp. 173-191.
2013. [37] J. M. Thresher, "The contributions of Carl Orff to el-
[20] S. Kalnine, and F. Bonthoux, "Object manipulability ementary music education." in J. Music Educators
affects childrens and adults conceptual processing." Journal 50.3, 1964, pp. 43-48.
in J. Psychonomic bulletin & review 15.3, 2008, pp. [38] R. Jackendoff, A generative theory of tonal music.
667-672. MIT press, 1985.
[21] G. E. Lewis, "Interacting with latter-day musical au- [39] I. Xenakis, Formalized music: thought and mathemat-
tomata." in J. Contemporary Music Review 18.3, ics in composition, Pendragon Press, 1992.
1999, pp. 99-112. [40] K. Wassermann, M. Blanchard, U. Bernardet, J. Man-
[22] A. Mann, The study of fugue. Courier Corporation, zolli, and P. F. Verschure. Roboser-An Autonomous
1987. Interactive Musical Composition System. Proc. Int.
[23] V. P. Martn-Caro, and D. Conklin, "Statistical Gen- Conf. ICMC, Berlin, 2000.
eration of two-voices florid counterpoint in Proc. Int.
Conf. Sound and music computing 2016, pp. 380-
387.
[24] F. Morreale, A. De Angeli. Collaborating with an Au-
tonomous Agent to Generate Affective Music in
J. Computers in Entertainment, ACM, 2016.
[25] F. Morreale, A. De Angeli, R. Masu, P. Rota, and N.
Conci, Collaborative creativity: The music room.
In J. Personal and Ubiquitous Computing, 18(5),
Springer, 2014, pp.1187-1199.
[26] F. Morreale, R. Masu, and A. De Angeli. "Robin: an
algorithmic composer for interactive scenarios Proc.
Int. Conf. Sound and music computing (SMC2013)
Stockholm, 2013 pp. 207-212.
[27] F. Morreale, R. Masu, A. De Angeli, and P. Rota,
The music room in Proc. Int. Conf. CHI'13 Ex-
tended Abstracts on Human Factors in Computing
Systems. pp. 3099-3102, ACM.
[28] G. Nierhaus, Algorithmic Composition: Paradigms of
Automated Music Generation. Springer, 2009.
[29] F. Pachet, "The continuator: Musical interaction with
style." in J. Journal of New Music Research 32.3,
2003, pp. 333-341.
[30] S. Phon-Amnuaisuk, and G. Wiggins, "The four-part
harmonisation problem: a comparison between ge-
netic algorithms and a rule-based system." in Proc.
Int. Conf. AISB99 Symposium on Musical Creativ-
ity, London: AISB, 1999, pp. 28-34.
[31] W. Piston, Harmony. Norton, 1948.
SMC2017-132
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Philippe Kocher
Institute for Computer Music and Sound Technology
Zurich University of the Arts
philippe.kocher@zhdk.ch
ABSTRACT
SMC2017-133
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Network
As far as tempo polyphony is concerned, there are several
works in the 20th century classical music that abandoned
the notion of a common tempo for all performers, of which
the earliest examples can be found in the oeuvre of Charles
Ives. In these works different tempos are assigned to indi-
Ext. Device
vidual instruments or musical layers and in most cases the Ext.Devices
Other Device
SMC2017-134
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-135
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 6. Performance of Quadrat. The musicians are Figure 7. Performance of Manduria. The musicians are
positioned at the four corners of the auditorium, approx- grouped in small sub-ensembles and distributed in a gallery
imately 20 m apart. space.
is, approximately 20 metres apart from each other (Fig. 6). around the audience (Fig. 7). The immersive sound thus
Especially the first section of this piece exhibits a con- produced was very impressive, not only because of its spa-
stant change back and forth between synchronicity and in- tial quality but also due to the fact that a dense polyphonic
dependence. Every few bars, all instruments play a note structure or a superimposition of differently characterised
exactly together. Between these notes they execute rhyth- musical layers can be segregated more easily if the sound
mical structures in different tempos or in tempos slowly sources are distributed in space.
drifting apart. This structure was well perceivable by In accordance with this, it was Wirths intention to assign
the listener as a (tempo-)polyphony interspersed with syn- an individual characteristic to every instrumental group
chronous chords. and to produce a very heterogeneous texture. This lead
Concerning the performance practice, the musicians had to the question if in fact spatialisation is musically more
to develop a special (but quickly mastered) skill. Being po- effective than tempo polyphony and therefore will turn
sitioned wide apart, they had to precisely adjust their tim- out to be the primary purpose of the application Poly-
ing solely to the beats indicated by the application, rather tempo Network in the future. Although it is too early to
than using their hearing sense to adapt their playing to the make a final assessment, one can certainly say that spatial
other musicians. However, other aspects of musical in- effects are more straightforwardly composed than tempo
teraction like control of intonation or dynamic balance re- polyphonies, which always involve a great deal of tedious
mained the same as in traditional musical practice.F mathematics.
One musical texture posed a problem: At some points, Wirth, being an accomplished composer, relied on his ex-
the four musicians have to play interlocking rhythms, perience and wrote a piece that worked well and was suc-
which should result in a rhythmic pattern of regular semi- cessfully performed, but he did not engage in experiments
quavers circling around the audience. In practice, this was with complex tempo structures. The tempo polyphony in
not satisfactorily realisable for two reasons: On the one Manduria consists of the superimposition of speeds that
hand, playing a tight interlocking rhythm does not sim- can be expressed in notation, e. g. a 4/4 bar in the same
ply need the aid of a metronome but rather a very subtle time as a 5/4 bar, etc. Furthermore, the basic tempo MM
interaction between the performers, which cannot be es- = 72 is present throughout the piece and all other tempos
tablished over large distances. On the other hand, the vi- are proportionally related to it. This allowed Wirth to plot
sual animation to indicate the tempo is less precise than the his composition in a traditional score with occasionally co-
punctual beep of a click track. In most other cases this is inciding bar lines. That way, he could stay in his familiar
a considerable advantage, as it does not give the musicians working environment and maintain his habitual workflow,
the feeling that they have to fight against the metronome which eventually was necessary to deliver a piece of high
when they get a litte behind or ahead of the beat. But for quality and in due time.
the execution of tight rhythms this flexibility is not partic-
ularly helpful. 4.4 Tempo Studie (2016)
The short composition Tempo Studie was the authors at-
4.3 Manduria (2016)
tempt to realise a piece with an open form generated in
Without knowledge of the aforementioned piece, the com- real time. This trio for violin, bass clarinet and accordion
poser Stefan Wirth came up with a similar idea of spa- utilises a computer algorithm that randomly chooses frag-
tial distribution of the musicians for his piece Manduria. ments and presents them to the performers. But not only
Seven groups of one to three musicians each were placed the succession of fragments is random-based but also the
SMC2017-136
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-137
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
are played several times in a row, in order to enable the Korean Electro-Acoustic Music Societys 2014 Annual
comparison of the different variants and consequently the Conference, Seoul, 2014.
appreciation of the random-based form.
The question whether the availability of technology leads [10] C. Hope, A. Wyatt, and L. Vickery, The Decibel
to novel forms of compositional strategies, posed at the be- ScorePlayer: New Developments and Improved Func-
ginning of this paper, has to be further explored. This re- tionality, in Proceedings of the International Com-
quires more composers with different backgrounds to en- puter Music Conference, 2015, pp. 314317.
gage with this technology and create works that can serve [11] P. Kocher, Rethinking Open Form, in Proceedings
as application scenarios. It is of particular importance that of the Generative Art Conference. Florence: Domus
future works will also involve different genres and styles. Argenia Publisher, 2016, pp. 339349.
As this is an ongoing project, the application Polytempo
Network will be further developed and extended. Two fea- [12] , Polytempo Composer: A Tool for the Compu-
tures are to be added with high priority: first, the facility tation of Synchronisable Tempo Progressions, in Pro-
to annotate the score, which is much asked for by perform- ceedings of the Sound and Music Computing Confer-
ers, and second, the possibility for wireless network com- ence, Hamburg, 2016, pp. 238242.
munication, which would simplify the technical setup for
a rehearsal or a concert a great deal.
Acknowledgments
I would like to thank my fellow composers Stefan Wirth
and Andre Meier for their artistic contributions to this re-
search project.
6. REFERENCES
[1] P. Kocher, Polytempo Network: A System for
Technology-Assisted Conducting, in Proceedings of
the International Computer Music Conference, Athens,
2014, pp. 532535.
SMC2017-138
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-139
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
mented with some elements of substitution, and so the The spatial qualia engendered by the various approach-
concepts overlap. Multimodal presentations of certain es differ; a large and complex system may well give ex-
classes of information might provide richer experiences. periences of large environments but may be less compe-
Vibrotactile stimuli can be used to enhance perceived low tent in producing intimate ones with sources close to
frequency content, emphasize transients and steering of the listener. The converse is generally the case with
spatial auditory perception [10]. Philosophically, we small, intimate systems. Composers of spatial music are
think in terms of information channels rather than direct thus constrained in what qualia they can attempt to offer.
sensory equivalence. For discussion of spatial music compositional concerns,
Such multimodal interactions will be subjects of future see for example [17].
investigative work. In all the above cases, listeners with bilateral or unilat-
eral hearing deficits will experience degraded spatial mu-
3. SPATIAL MUSIC sical qualia, reducing immersion and impairing enjoy-
ment of the material.
There has, in the late 20th and early 21st centuries, been
burgeoning interest in spatial, surround or 3-dimensional 4. SPATIAL TISSUE CONDUCTION
music. The subject can be discussed in engineering, aes-
thetic and perceptual terms [11][12][13]. The underlying Auditory perception elicited by means of mechanical
principles are that spatial (as against non-spatial) music transduction, i.e. a tuning fork pressed against the crani-
might provide enhanced experience in terms of involve- um, has long been known. Single vibro-tactile transduc-
ment and immersivity. In information-transmission terms, ers have been in use in audiology and the hearing aid
incorporating spatial parameters facilitates greater infor- industry for decades. Until fairly recently spatial audio
mation throughput, allowing finer detail to be depicted was not thought possible through tissue conduction, theo-
and discerned. rised interaural level differences due to interaural attenua-
It is acknowledged that perceptual tasks in music listen- tion were not considered sufficient; studies have shown
ing differ from those in environment listening. In the lat- this not to be the case [18][19][20]. In all three experi-
ter, requirements for timely detection of threat and re- ments to assess lateralisation, stimuli were presented bi-
ward are presumed to have exerted evolutionary influence laterally with transducers placed in contact with either the
on phylogenetic development. Notwithstanding, exapta- mastoid process behind the ear or the condyle just in front
tion [14], whereby evolved mechanisms or capabilities of the ear; all produced similar results to that of head-
can become co-opted for other uses, provides that our phones. These experiments indicate that when sound is
spatial abilities are available for the experiencing of mu- presented through tissue conduction we still make use of
sic. the same binaural cues as for air conducted (AC) sound.
Stereo [15] provides for a loudspeaker-feed signal set Auditory localization is dependent on the physiological
that generates interaural differences (in amplitude and and anatomical properties of the auditory system as well
phase) at the ideal listening position that can produce the as behavioral factors. The textbook primary cues for au-
powerful illusion of a left-right discriminable stage with ditory localization are interaural differences and spectral
multiple spatially-separate musical sources, either static cues [21][22][23]. The ridges and folds in the outer ear
or moving. Additionally, spatial reverberant fields (cap- reflect and absorb certain frequency components of a
tured or synthesized) can give some sense of ensemble sound wave, the spectral characteristics of a sound wave
depth (some sources closer than others) and spaciousness. will differ if approaching the ear from different direc-
The effect is of a proscenium arch presentation. The ste- tions. Due to the shape of the pinnae providing this filter-
reo signal set can be listened to over headphones; howev- ing effect the elevation and position of sound sources is
er the effect is generally of a soundstage distributed left- encoded in direction-dependent spectral cues allowing us
right between the ears, giving a particular in-the-head to localize sound sources. Many literary sources agree
experience. A binaural signal set can be used (either bin- that vertical information derives exclusively from posi-
aurally recorded or synthesized) to promote externaliza- tion-depending differences in the frequency filtering
tion (for a discussion see: [16]) and in the optimal case, properties of the external ear.
where the head-related transfer function (HRTF) used in Whilst interaural differences akin to air conduction may
the production of the signal set closely matches the result when sound is presented through tissue conduction,
HRTF of the listener, strong impressions of an external- no sound is presented to the outer-ear specifically the
ized, three-dimensional environment can ensue. Howev- pinnae and vertical information should be absent; some
er, such close matching is rarely feasible and the usual comments suggest this is not the case. This anomaly may
experience falls short of the theoretical optimum. arise out of fine differences in arrival times caused by
Surround sound, where a complex signal set is fed to propagation along multiple signal pathways from trans-
multiple loudspeakers surrounding the listener(s) can ducer to the basilar membrane. There is also an intriguing
depict many source-locations, movements and a sense of possibility of multimodal cueing; binaural auditory cues
being immersed in a whole spatial environment. Howev- merging with additional information provided through the
er, perceptions of depth-of-field (variations in perceiver- somatosensory system via haptic cues [10][24][25][26].
source distance) remains limited. Systems range from When using a multiple transducer array vertical infor-
fairly simple (e.g. Dolby 5.1 surround) to complex (e.g mation is available to the listener as well as externalisa-
high-order ambisonics or wave field synthesis). tion of the perceived sound; how this is the case contin-
ues to be the subject of further investigation.
SMC2017-140
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
5. APPARATUS, MUSICAL MATERIALS modified stereo, ambisonics and direct feed. A 1st order
ambisonic recording of a country park captured using a
AND LISTENING CONDITIONS
Soundfield microphone provides the ambient back-
ground; stereo recordings of bird sounds, a steam train
5.1 Apparatus and music alongside mono FX clips were used to create
the soundscape. Signals were processed using Reaper
Psychophysical investigation of the dimensions of expe-
DAW; signals were spatially encoded using WigWare 1st
rience of spatial tissue-conduction listening will prove
order ambisonic panning and decoded through a Wig-
useful, but for many first-time listeners, bases for com-
Ware 1st order periphonic ambisonic decoder patched to
parison may be lacking; a training period targeting specif-
the transducer array.
ic attributes may be required. Determining what those
attributes might be is the aim of the present study.
5.3 Listening Conditions
5.4 Limitations
Figure 1. Multiple transducer array Equipment and calibration: The transducers in use have
Sounds presented at: the following known limitations:
1-left mastoid Frequency response: 200Hz to 16 KHz, low fre-
2 25mm above left temple quencies are not well served, resulting in a thin
3: between forehead and vertex sound for some musical material.
4: 25mm above right temple Component matching: the manufacturers do not
publish information on performance matching.
5: right mastoid
With a cohort of 100 and a wide variety of head sizes,
precision in determining matched contact force for all
transducers was infeasible, possibly resulting in different
spatial experiences for different listeners. Additionally, as
audiological testing was impossible, variations in hearing
acuity could not be taken into account
The demonstrations took place in an environment with
Figure 2. BCT-2 10W transducer high levels of ambient sound, especially in vocal ranges,
A prototype headset transducer array using five BCT-1 entailing concomitant constraints on dynamic range and
8 90dB 1W/1 m tactile transducers has been used to hence subtlety of detail.
The method of recording responses proved to be subop-
display a range of spatial soundscapes and music. Each
timal, as many volunteers described the experience in
transducer receives a discrete signal set through an indi-
greater detail verbally than subsequently on paper.
vidual amplifier; a Focusrite PRO 26 i/o interface pro-
vides fire-wire connection to a mac mini running Reaper
DAW. A single BCT-2 10W transducer was also availa- 6. RESPONSES AND ANALYSIS
ble for listeners to position on the jaw, zygomatic arch or The responses were tabulated for analysis to identify key
back their head/neck. A set of banded style 3M Ear Plugs themes.
were available for listeners to use and compare the expe- Of interest were the variations in descriptive language
rience with the plugs in vs out. across such a mixture of untrained listeners varying in
age, gender, expertise and listening ability. A broad syn-
5.2 Listening Materials onymic approach was taken, whereby terms were loosely
grouped to form themes. So, for instance, the category
Environmental and musical stimuli was processed using a weird included terms such as eerie, strange and
variety of effects and routed in different formats; stereo, unusual.
SMC2017-141
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2) Female age 36, Stage Manager, non-musician. We found that the attribute class that mapped most
strongly to positive expressions was the interesting
category; 35% of comments included interesting and pos-
SMC2017-142
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
itive descriptors. This was followed by the spatial class tributes may be favored in different applications, of
38% had used spatial terms; 32% had used spatial and which personal music listening is only one. Similarly, it
positive terms. Those that referred to the way they felt may be that a single spatial music encoding regime will
about the experience also correlated highly with positive not be appropriate for all listeners.
26% featured positive and feelings. Clarity was referred This work has enabled us to identify the following de-
to in conjunction with positive comments in 19%, vibra- velopment areas for future research:
tions were mentioned along with positive comments in Technological: improved signal processing, improved
15% of cases. transduction, improved apparatus comfort, developments
in multimodal stimuli.
7. DISCUSSION Methodological: Precise characterization of listener
hearing capabilities, investigation of training periods and
Results may indicate that tissue conduction is of more of individual preferences for encoding. Parameterization
utility to some than others; variations in comments might of qualia for spatial music listening.
also indicate variances in biomechanical and/or neurolog- Possible benefits of competent spatial tissue-conduction
ical auditory processing. apparatus include:
Some volunteers (3%) reported some degree of bilateral Enhanced quality of life for those with conductive
or unilateral hearing deficit, but nevertheless reported in hearing loss, through access to personal music listen-
spatial and positive terms. Some others (2%) reported ing.
spatial anomalies that might indicate a degree of unilat- Augmented private perception where unimpeded air-
eral deficit (the sound field sounded shifted to the left) conducted hearing is required.
but it was not within our experimental purview to com- Diagnostic procedures to identify and isolate conduc-
ment or diagnose. Likewise, some that had used very tive hearing loss components.
positive and clarity terms may actually have been observ- Improved methodologies for the investigation of
ing differences between their normal air-conducted hear- mutimodally-augmented perception.
ing and this experience.
The surprising concomitance of reports of vibrations
(which we might have thought was an undesirable per-
cept) and positive comments prompts us to speculate that 9. REFERENCES
program-material modulated haptic input can contribute
to the experience. [1] G. Martin, W. Woszczyk, J. Corey and R. Quesnel
Notably, in the case of the weird category (23%), Controlling Phantom Image Focus in a Multichan-
weird and positive comments appeared in conjunction in nel Reproduction System in Proc. 107th Confer-
11% of all comments; there may be an overlap in the ence of the Audio Engineering Society (AES 107),
weird and the interesting categories, depending on New York, 1999
individuals use of language. It does appear that the nov- [2] J. Blauert, W. Lindemann, Auditory spaciousness:
elty of the experience may be conflated with positive Some further psychoacoustic analyses, in J.
reports, and this in itself does not imply improvement in Acoustical Society of America (JASA), 1986,
informational throughput. 80(2):533-42
SMC2017-143
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-144
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-145
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
what is a sound event pattern and what is a motion event producing body motion, as well as to make extensive
pattern in a textural excerpt. We may suspect that this correlations between these two domains.
interweaving of sound and motion also extends to aesthetic Starting with the instruments, we need to consider how
and affective features, such that e.g. subjective im- the instruments work, i.e. know about their mode of ex-
pressions of energy and gait in musical experiences com- citation and resonant features, and importantly, the logis-
bine sound and body motion sensations. tics of their positioning. Needless to say, one of the key
We here have the non-trivial task of trying to differenti- elements of drum set performance is the instrument layout,
ate what goes on in complex drum set textures. First of all, enabling the performer with minimum effort, and at times
this may be regarded as a more conceptual or ontological also great speed, to alternate between the different instru-
issue of trying to distinguish the ingredients in the multi- ments, cf. the layout of instruments depicted in Figure 1.
modal mix. We should ask questions about what consti- In earlier recordings, an acoustical drum set was used,
tutes a sound event, and what constitutes a motion event. but this required the use of so-called active markers for the
And furthermore, we should ask questions of what are the infrared motion capture system, because the shiny metal
criteria for fusion of small-scale events into larger-scale parts of the instruments totally confused the tracking
events, e.g. between what is perceived as individual im- system so that passive markers could not be used. The use
pacts, and what is perceived as a more fused stream of on- of active markers turned out to be both cumbersome
sets e.g. as in a fill. Also, it will be useful to try to distin- (requiring wiring of the mallets) and not sufficiently reli-
guish between what is strictly sound-producing body mo- able, making us use a MIDI drum set (Roland TD20) in
tion and what may be more sound-facilitating or even subsequent recording sessions. Using MIDI drums made
sound-accompanying motion of a drummer, e.g. recogniz- the use of ordinary passive markers possible, as well as
ing what may be more communicative or theatrical body gave us precise onset timing from the MIDI data. In our
motion in a drum set performance [1]. subjective opinion, the sound quality of the MIDI instru-
Additionally, we need to take into consideration the ments was satisfactory, i.e. seemed to well reflect the
acoustic features of the instruments as well as the logistics different playing modes as well as the resonant instrument
of the drum set instrument layout, as this is the spatial properties, and produced satisfactory contextual overlap
framework for the choreography of sound-producing body for sound coarticulation (more on coarticulation below).
motion.
The next challenge will be to have a reasonably good RIDE
CYMBAL CRASH
understanding of what goes on in the production of these (51) #2 (52)
(at times quite complex) sound-motion textures. This will CRASH
1st TOM #3 (28)
involve recording the body motion of the drummer, first of (48)
CRASH 2ndTOM
all the hands and arms, but also the whole torso and the #1 (55) (43)
head, as well as the feet. The use of both feet in drum sets SNARE
3rd TOM
necessitates balancing the body in the seated position so as HI-HAT
(26, 42,
DRUM
(38)
(41)
to enable free motion of all four limbs (right and left 44, 46) BASS
DRUM
hands/arms, right and left feet). Besides entailing various (36)
SMC2017-146
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Although infrared marker-based motion capture systems Furthermore, capturing the precise moment of impact is
presently seem to be the most versatile method for col- not possible even with very high frame rates (in theory,
lecting data on sound-producing body motion, it does also approaching infinitely high sampling rates), so the use of
entail some serious challenges, first of all with the at times a MIDI drum set is a practical solution in this regard also.
very rapid motion in expert drumming. Besides the men- As can be seen in Figure 2, there is a discontinuity in the
tioned issue of unwanted reflections, there is also the issue motion trajectories around the point of impact, i.e. at the
of dropouts, and most seriously, of not being able to minima of the motion trajectories. That said, the point of
completely capture what goes on even with very high impact may of course be estimated from the position data
frame rates (sample rates). As shown in [4], it may be of the recorded motion frames by various means, such as
necessary to have very high frame rates when studying by functional data analysis (see e.g. [5]).
detailed and fast motion.
DrumStick displacement (Slow tempo) Motion Spectrum (Slow tempo)
1150 100
1100
Displacement(mm)
80
1050
1000 60
950 40
900
20
850
800 0
0 1 2 3 4 5 0 5 10 15 20
Time(s) Frequency (Hz)
1100
Displacement(mm)
80
1050
1000 60
950 40
900
20
850
800 0
0 1 2 3 4 5 0 5 10 15 20
Time (s) Frequency (Hz)
Figure 2. Two motion tracking examples using passive markers attached to a drumstick recorded at 240Hz. The two
examples had different tempi (slow, fast) with mean velocities of 1448 mm/s and 3564 mm/s. Their peak velocities were
10096 mm/s and 11666 mm/s. Although the mean velocity of the fast motion was almost 2.5 times higher than that of the
slow motion, we can see that peak velocities were almost the same. It means that the end velocities of single strokes do not
vary significantly with respect to tempo. Also, every time when the marker reaches the lowest position (point of collision),
we could observe discontinuities in the trajectories for both cases. It is interesting to note that the dominant frequency of
the motion was less than 10 Hz, (see the right column), but the frame rate of 240Hz was not enough to capture the motion.
With our focus on sound-motion textures, we have the stream of motion data, challenges that border on to issues
additional challenges of not only finding out what goes on of motor control and perception, such as:
in singular strokes, but also in an ensemble of markers to Accurate onset detection, both of sound and motion
capture the overall motion involved in more composite Challenges of representing multiple trajectories (i.e.
textures. markers) in graphs, see e.g. Figures 4 and 5 below
The motion capture data will even with much care taken Discerning and understanding variability vs. constancy
to avoid the problems presented above, usually have some in trajectories, cf. the point made by Nikolai Bernstein
need for preprocessing, typically removing spurious re- as "An instance of repetition without repetition",
flections, gap-filling dropouts, assigning markers to points meaning that we very often find approximate (i.e. non-
on the body, and applying skeleton models. Usually, there exact) similarities between motion chunks in human
will be a need for further processing that can result in behavior [7, p. 1726].
representations as graphics and animations of the motion
trajectories, as well as of derivatives, in particular of 4. CONSTRAINTS
velocity, acceleration, and jerk, as well as more cumulative
measures, such as quantity of motion. Processing is typi- Needless to say, there are a number of constraints on non-
cally done in Matlab and associated toolboxes [6] or other electronic sound production in music. First of all, there are
software, e.g. Mokka, and the hope is to develop means for constraints of the instruments, such as their repertoire of
more synoptic overviews as suggested by Figure 4. possible output sounds and modes of performance. This
Further on, there are challenges of detecting and repre- includes various constraints such as the required force of
senting musically salient features from the continuous excitation, of reverberation time, the need for damping,
SMC2017-147
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
and also the need for positioning the effectors prior to im- lead to so-called temporal coarticulation in relation to
pacts, in turn entailing important logistic elements in the sound onset events, i.e. that there is a contextual
layout of the instruments in the drum set that have conse- smearing of motion from event to event [9].
quences for the sound-producing body motion: Motion of end effectors, e.g. hands or feet, may also
All instruments must be reached from the sitting posi- recruit motion of more proximal effectors, e.g. arms,
tion. shoulder, torso, and even whole body. This leads to so-
The drummer must be seated so that both the right and called spatial coarticulation, meaning the recruitment
left foot can be used independently. of more effector elements than the end-point effectors,
And there are of course many constraints on human sound- depending on the specific task [9].
producing body motion: limitations on speed and ampli- Coarticulation is an emergent phenomenon, cf. [10, p.
tude of motion, on endurance, the need for rests and need 592]: "a control mechanism in which the different
for minimizing effort in order to be able to sustain such motor elements are not simply linked together in the
motion for extended periods of time and to avoid strain in- correct sequence, but are also tuned individually and
jury. In addition to such mostly biomechanical constraints, linked synergistically based on the final goal, with
there are motor control constraints such as need for hier- coarticulation observed as an emergent phenomenon."
archical and anticipatory control. In sum, we have the fol- And furthermore, as argued in [10, p. 592], coarticula-
lowing main constraints on the production of sound- tion is related to chunking, i.e. to the "the integration
motion textures: of independent motor elements into a single unit. With
There are basic motion categories that seem to be chunking, there can be an increase of coarticulation and
grounded in a combination of biomechanical and motor reduced cognitive demands because less elements are
control constraints: impulsive, also sometimes referred organized for a given motor goal"... "Chunking is also
to as ballistic, meaning a very rapid and brief muscle a critical element for automatization to emerge."
contraction followed by relaxation, such as in hitting; Left hand/right hand alternations is a way around the
iterative, meaning a rapid back and forth motion such constraints of speed, allowing for extremely rapid
as in shaking the wrist or in rolling the hand; sustained, figures, similar to tremolo alternations in one hand
meaning continuous motion and mostly continuous figures on various other percussion instruments and on
effort, such as in stroking, but this category is not so keyboards, and actually related to the back-and-forth
common in drumming (found in scraping the cymbal, shake motion of wrist tremolo in string instruments and
brushes on drums). Such a motion typology is clearly at triangle (or other confine-based instruments such as
work in shaping musical sound, hence, the labels may tubular bells).
equally well apply to sound [8]. Phase relationships of left-right, cf. [2], suggesting dif-
All motion takes time, i.e. instantaneous displacement ferent levels of stability depending on phase relation-
is not possible for the human body, and this in turn will ships.
Vertical Displacement (mm)
800
600
RTOE
400 RHEEL
200 RKNE
0
10 12 14 16 18 20 22 24 26 28 30
Vertical Displacement (mm)
800
600
LTOE
400
LHEEL
200 LKNE
0
10 12 14 16 18 20 22 24 26 28 30
37
36.5
MIDI pitch
36
35.5
35
10 12 14 16 18 20 22 24 26 28 30
Time (s)
Figure 3. The increasing density of motion events leads to phase-transition, similarly to how event proximity may lead to
coarticulatory fusion. Here we see how a repeated and accelerated right foot left foot bass drum pedal motion leads to a
phase-transition after approximately 20 seconds with the disappearance of the dip after the impact as well as a diminishing
of the amplitude of motion, something that seems reasonable given the speed constraints of the human body.
SMC2017-148
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
In addition to fusion by coarticulation, we can also find More specifically, we believe there is evidence for how
instances of fusion by so-called phase-transitions in the seemingly complex bimanual tasks can be carried out by a
sound-producing motion [9]. Phase-transition means that a human performer provided a hierarchical approach. As
series of otherwise independent events may fuse into what suggested in [3], a polyrhythmic passage may be thought
is perceived as a continuous stream because of increased of as a singular rhythmic pattern, e.g. a pattern of 3 against
event density, and conversely, a continuous stream of 4 quarter tones may be thought of as a series of one
events may change into a series of distinct events with a punctured eight note, one sixteenth note, two eight notes,
decrease in event density. The transition from single one sixteenth note, and one punctured eight note. This very
strokes to a drum roll, or from single tones to a tremolo or simple example may be seen as what in [2, p. 103] is called
trill, are common examples of this. In Figure 3, we see an a case of when "a dual task becomes a single task", and
example of such phase transition in the bass drum foot which the authors believe works because "the spatial
pedals of the drum set. With an acceleration, we see the patterns produced by the 2 hands form a geometric ar-
change in the motion shape, and at the fastest tempo, also rangement that can be conceptualized as a unified repre-
a decrease in the amplitude of the motion. sentation."
With both coarticulation and phase-transition we have Another aspect of facilitating hierarchies is that of alter-
cases of fusion, what we in general refer to as chunking, nations between effectors and/or effector positions. It is
that seem to emerge from a combination of biomechanical well known that for instance a tremolo on a keyboard or
and motor control constraints (for the moment, it seems not with two mallets on a percussion instrument is much easier
possible to say exactly what is due to what), i.e. gives us a using a left-right-left-right-etc. hand rolling motion than a
bottom-up factor in the shaping of sound-motion textures. series of up-down motions of the whole hand. And
Additionally, it seems that skill in performance is also likewise, in the vocal apparatus, it is possible to make a
about minimization of effort, and that virtuoso instrument very fast articulation of saying ta-ka-ta-ka-tam, whereas
performance in general gives an overall impression of saying ta-ta-ta-ta-tam is only possible at a slower rate. In
ease, something that seems related to the exploitation of the drum set, this alternation between effectors can lead to
motion hierarchies. very complex passages, and although the motor control
factors enabling such virtuosity remains to be explored, the
5. HIERARCHIES idea of some kind of hierarchical and intermittent control
scheme does seem quite plausible.
Using both hands and both feet in at times very dense and On the perceptual side, there are probably also sound
rhythmically complex drum set passages, is clearly de- hierarchies, i.e. that there is the reverberation of the in-
manding both in terms of effort and motor control. From struments contributing to the basic smearing, the phase-
what is known of human motor control (see e.g. [11]), such transition leading to fusion (e.g. iterative rather than im-
skillful behavior necessitates that the motion be partially pulsive sound) and that the coarticulatory inclusion of
automatic, i.e. need some kind of control hierarchy. successive sounds (sometimes also simultaneous sounds,
Likewise, the layout of the human body will inherently e.g. bass drum + the other drums) contribute to holistically
result in some motion hierarchies, i.e. the fingers, attached experienced chunks of sound-motion textures.
to hands, attached to arms, shoulders, torso, and whole
body, hence, we have hierarchies both in the spatial and in
6. SOUND-MOTION OBJECTS
the temporal domain, similar to having coarticulation in
both the spatial and temporal domain. From a combined sound and motion perspective, it seems
Spatial hierarchies are evident in task-dependent effector clear that sound-motion textures in drum set performance
mobilization, e.g. for soft and slow textures, there is usu- are both perceived, and in their production, conceived, as
ally only a need for the most distal effectors, e.g. hand and chunks, gestalts, or holistic entities, entities that may vary
wrist, whereas in more loud and fast and dense textures, it in content, e.g. between single slow strokes and ferocious
may be necessary to recruit more proximal effectors, e.g. fills, but typically experienced as units, as what we like to
elbow, shoulder, and even torso. call sound-motion objects. The main reasons for the idea
As for temporal hierarchies, these are probably the most of sound-motion objects are the following:
crucial for achieving high levels of performance skill. Such Intermittent control, based on the need for anticipation
hierarchies are again clearly dependent on extensive and on the possibilities of action gestalts in human
practice, practice resulting in exploitation of various syn- motor control [12, 14, 15].
ergies (i.e. cooperation of muscle groups) and hierarchies, Intermittent energy infusion, in particular with ballistic
i.e. top-down 'commands' at intermittent points in time motion, i.e. biomechanics in combination with in-
[12]. Hierarchies then go together with a basic disconti- strument physics (requiring impact excitation), and
nuity in human motor control, and is associated with anti- biomechanical demands for rests [9].
cipatory motion [13], as well as with high levels of pre- Coarticulation, which is a constraint-based pheno-
programing, i.e. with what has been called action gestalts menon that results in object level fusion, also related to
[14]. The turn towards discontinuity in human motor phase-transition in cases of repetitive motion [9].
control in some recent research is interesting for musical Chunk timescale musically most salient, meaning that
applications, because the demands for very rapid motion the timescale of approximately between 0.5 and 5 se-
would indeed go better with intermittent motor control conds is the most salient for most musical features, cf.
than with more traditional models of so-called 'closed the work of Pierre Schaeffer on sonic objects and
loop', i.e. continuous, control because of its slowness [15]. similar work on timescales in music [8].
SMC2017-149
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
The notion of sound-motion objects is then not only in Also in cases of gradually increasing density of fills,
accordance with some basic features of perception and e.g. from none by way of various upbeats and synco-
motor control, but also advantageous in view of various pations to very dense and energetic fills, a hierarchical
music theory and/or analytic perspectives. With such a scheme of salient points in the music, i.e. what could be
central perspective of sound-motion objects, we believe called "anchoring" [7], seems to be at work.
we are in a better position to interpret the various sound Focusing on sound-motion objects could also be useful
and motion data that we collect in our research: for studying the aesthetic and affective aspects of drum
Regard sound-motion textures as holistic objects, often set music, i.e. for studying how the various motion
repeated, often with variations, but variation-tolerant in features, in particular the quantity of motion, the
retaining identity, cf. Bernstein's idea of "repetition velocity, and jerk (derivative of velocity) may contri-
without repetition" mentioned above [7]. bute to the overall experience of the music.
1400
Vertical displacement (mm)
1200
1000
800
600
400
200
0
30.5 31 31.5 32 32.5 33 33.5
4
RELB
RWR
2
Angular velocity (rad/s)
2
30.5 31 31.5 32 32.5 33 33.5
5
LELB
LWR
0
5
30.5 31 31.5 32 32.5 33 33.5
MIDI pitch
40
20
0
30.5 31 31.5 32 32.5 33 33.5
Time (s)
Figure 4. Motion of markers and MIDI onsets vs. time during the end fill after playing a steady slow groove. Top panel: The
vertical displacement of markers on the player's right ankle (lower dotted line), right toe (lower blue line), right hand (upper
dotted trace), left hand (dash-dotted line), right stick (upper blue line) and left stick (red line). As can be seen, the traces of
the hand markers tend to precede the stick movements slightly, but otherwise maintain similar shape albeit with lower
amplitude. The ankle marker is also preceding the toe marker, but this ankle movement is larger. Middle panels: the angular
velocities of the right and left wrists and elbows. Bottom panel: MIDI onset times and drum info.
In our studies of sound-motion textures, we have seen how probably in order to have the effectors in the optimal
rather demanding passages may be performed seemingly position for the final impacts of the fill.
with ease and elegance, and after quite extensive reviews In Figure 4, we see the motion of the hands and the feet
of our recordings, we believe that the principles of sound- in this three-second fill, as well as a zoomed in image of
motion objects are crucial for both the performer and for the angular velocity of the right and left elbows and wrists,
the listener. As an example of this, consider the three- indicative of the motion to the impact-points in the fill, and
second fill passage from the end of a slow eight measure the MIDI onset points of these impact-points.
groove in Figure 4 and 5. Subjectively, the passage is a In Figure 5, we see a 3-D rendering of the right and left
different object from the main pattern of the slow groove, drumsticks and hands in the same three-second groove,
but it is right on the beat of the groove. In the course of the demonstrating the more large-scale motion also involved
fill, the drummer also makes a right torso rotation, in this fill.
SMC2017-150
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 5. The movement for markers attached on the drummer's right/left hand and drumsticks during a final fill depicted
in Figure 4 (from 31.4s to 33.5s). Every 8th sample point is marked with a circle and a line has been drawn between the
marker at the stick and the hand, connecting the two. The movement start in the (upper) right hand side of the figures,
where the right hand and stick are lifted in preparation for the first stroke in the fill while the left hand is at a lower amplitude
(compare Figure 4). Thereafter the arms move to other drums in the left of the figure, and, finally, back to the snare.
SMC2017-151
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
data," in Proceedings of the sound and music com- [11] D. A. Rosenbaum, Human Motor Control (second
puting conference, Stockholm. KTH Royal Institute of edition). Academic Press Inc., San Diego, 2009).
Technology, 2013.
[12] A. Karniel, "The minimum transition hypothesis for
[7] M. Roerdink, A. Ridderikhoff, C. E. Peper & P. J. intermittent hierarchical motor control," Frontiers in
Beek, "Informational and Neuromuscular Contribu- Computational Neuroscience, 7, 12. 2013.
tions to Anchoring in Rhythmic Wrist Cycling," http://doi.org/10.3389/fncom.2013.00012
in Annals of Biomedical Engineering, 41(8), 2013,
[13] S. Dahl, "Movements, timing and precision of drum-
pp. 17261739.
mers," in B. Mller, S. Wolf, G.-P. Brueggemann, Z.
[8] R. I. Gody, M.-H. Song, K. Nymoen, M. H. Romar- Deng, A. McIntosh, F. Miller, & W. S. Selbie (Eds.),
heim, & A. R. Jensenius, "Exploring Sound-Motion Handbook of Human Motion, ISBN 978-3-319-
Similarity in Musical Experience," in Journal of New 30808-1 Springer (Forthcoming).
Music Research, 45(3), 2016, pp. 210-222.
[14] S. T. Klapp & R. J. Jagacinski, "Gestalt principles in
[9] R. I. Gody, "Understanding coarticulation in musi- the control of motor action." in Psychological Bulle-
cal experience," in M. Aramaki, M. Derrien, R. tin, 137(3), 2011, pp. 443462.
Kronland-Martinet, & S. Ystad (Eds.), Sound, music,
and motion. Lecture Notes in Computer Science. [15] I. D. Loram, van C. De Kamp, M. Lakie, H. Gollee,
Springer, 2014, pp. 535547. & P. J. Gawthrop, "Does the motor system need in-
termittent control?" in Exercise and Sport Science
[10] S. T. Grafton & A. F. de C. Hamilton, "Evidence for Review, 42(3), 2014, pp. 117125.
a distributed hierarchy of action representation in the
brain," in Human Movement Science, 26(4), 2007, pp.
590616.
SMC2017-152
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT
1. INTRODUCTION
We previously developed an automatic accompaniment sys- We previously developed an offline jazz piano trio syn-
tem called Eurydice [1, 2], that can handle tempo changes thesizing system [8] utilizing a database of performances
and note insertion/deviation/substitution mistakes in the ac- from which appropriate accompaniment performances of
companied human performance as well as repeats and skips. bass and drums are selected based on characteristics of the
Although this system can deal with various mistakes, the piano MIDI input. Experimental results demonstrated the
musical score of the accompaniment is fixed in Eurydice, systems ability to meaningfully accompany piano perfor-
since the system follows the players performance by match- mances to a certain degree.
ing between the performance and the music score.
To appropriately accompany improvised performances, This paper describes an offline jazz piano trio synthe-
the accompaniment system needs to estimate the musical sizing system based on free generation of accompaniment
intention of the improvising player. Such systems are called performances without using the performance database. In-
jazz session systems. Previous jazz session systems [37] stead, the system can learn the interrelationship of the in-
require heuristic rules and human labeling of training data strument performances and their time series characteristics
to estimate the musical intention of human players. In con- using hidden Markov models and deep neural networks.
trast, our approach is to statistically learn the relationship The approach follows a recent trend to uses deep learning
between performances of the different instruments from (especially long short-term memory recurrent neural net-
performance MIDI data as well as information contained works (LSTM-RNN [9]) in automatic music generation.
in lead sheets, in order to avoid the need of parameters set The composition system DeepBach [10] uses a bidirec-
by the developer. tional LSTM to generate Bach-style music. Chu et al. [11]
use a LSTM to generate POP music.
Copyright:
c 2017 Takeshi Hori et al. This is an open-access article distributed As illustrated in Fig. 1, the modeled jazz piano trio con-
under the terms of the Creative Commons Attribution 3.0 Unported License, which sists of a piano, whose performance MIDI data is used
permits unrestricted use, distribution, and reproduction in any medium, provided as input and a bass as well as drums whose performance
the original author and source are credited. MIDI data is generated by the computer.
SMC2017-153
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2. MATHEMATICAL MODEL OF THE JAZZ The number of notes played by the hi-hat, snare
SESSION SYSTEM drum, crash cymbal and the total number of all
other notes.
2.1 Trajectory model
Common features
We describe a musical performance as a trajectory in mu-
sical feature space [12]. At every point of time in the per- The number of notes, the number of simultane-
formance, the musical features of the current performance ously played notes, and the average MIDI ve-
state (tempo, harmony, etc.) identify a position in the musi- locities (loudness).
cal feature space. Consequently, a jazz musician moves
through this feature space while performing, resulting in a The ratio of off-beat notes to all notes in the
performance trajectory. current bar.
In case of a jazz session comprising multiple instruments, The ratios of all features (except for the off-
one can observe a correlation between the trajectories of beat-ratio) with respect to the previous bar as
the different instruments during a good session. There- well as the ratios with respect to the complete
fore, our aim is to implement a statistically trainable model performance.
that is able to learn the correlation between the instruments
trajectories. This correlation is used to generate the per- We refer to these features as style features in the fol-
formance trajectories of accompaniment instruments (bass lowing.
and drums) from the performance trajectory of the main 2.2.2 Trajectory continuity
instrument (piano).
To generate performance trajectories from anywhere in fea-
2.2 The problems and solution strategies ture space, one has to discretize or interpolate the avail-
able music performance data. In the proposed jazz session
Since the amount of available jazz performance data was system, this is achieved by using clustering in the musi-
relatively small, three problems had to be addressed in or- cal feature spaces of the accompaniment instruments. As
der to use machine learning methods effectively. Firstly, a first step, the musical feature vectors are clustered using
the number of musical features that could possibly be used a continuous mixture hidden Markov model. The observa-
to describe performances is large, resulting in a very high- tion probabilities related to each hidden state of the HMM
dimensional feature space. To keep the dimensionality rel- are described utilizing a Gaussian mixture model (GMM).
atively low, we chose a set of features especially suited Based on preliminary analysis of the data, the number of
for jazz music. Secondly, since the data in feature space hidden states was chosen to be 4, using 4 GMM clusters
is sparse, an interpolation method is required to gener- per state. As a result, the trajectory model is approximated
ate continuous performance trajectories through this space. by a stochastic state transition model. A musical perfor-
This problem is addressed by segmenting the feature space mance corresponds to a Viterbi path in the HMM and the
into subspaces using a continuous mixture hidden Markov interaction between instruments to co-occurrence of state
model. Thirdly, the jazz session system needs to learn the transitions.
non-linear dependency between the performance trajecto- To increase precision, we further segment each HMM
ries of the different instruments, which is achieved using state internally. We are using a GMM to cluster the vec-
neural networks. tors of deviation from the HMM states centroid.
2.2.1 High dimensionality
The GMMs approximate a distribution as a mixture of
We selected 68 features that are extractable from the music normal distributions, and their updating algorithm applies
performance at every bar. The features were chosen based the auxiliary function method. In case of the HMM, the
on opinions of jazz musicians, and are the following: state transition probabilities and the states GMMs are op-
timized jointly using the auxiliary function method.
Piano-specific features The auxiliary function has to fulfill Jensens inequality,
formulated as follows:
The range between the highest and lowest notes.
The number of notes contained in the current X X
diatonic scales, as well as the numbers of ten- F( i fi (x; )) i F (fi (x; )) (1)
i i
sion notes, avoid notes, and blue notes.
=: G(, ) (2)
The ratios of diatonic notes, tension notes, avoid X
notes and blues notes to the total number of s.t. i = 1, (3)
i
notes in the current bar.
where G(, ) is the auxiliary function, = (1 , , n )
Bass-specific features
are the auxiliary variables, is a parameter, and F () a
The range between the highest and lowest notes. convex function. x is either a style feature vector (when
obtaining the HMM states), a vector of deviation from an
Drums-specific features HMM states centroid (for state internal clustering).
SMC2017-154
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3.1 Training phase Fig. 3 illustrates the training process described above.
The training phase consists of following 6 steps: 3.1.1 Long short-term memory (LSTM)
1. Key estimation A LSTM [9,14] is a type of recurrent neural network which
Estimating the tonic key of every bar of the perfor- can effectively handle time series data. In our system, a
mances in the training data. LSTM is used to match piano performance features to the
2. Feature extraction HMM states which are obtained as clusters in other fea-
Extracting performance features and computing the ture space of the two other instruments (see section 2.2.2).
style features of all instruments at every bar. This essentially corresponds to identifying non-linear clus-
ters in the piano performance feature space which are re-
3. Data segmentation lated to those of the other instruments. This dependency is
Segmenting the performance feature spaces of bass illustrated as red arrows in Fig. 3. The LSTM is trained
and drums using a continuous mixture HMM. using MIDI data of jazz piano trio performances.
SMC2017-155
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-156
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
4. EXPERIMENTAL EVALUATION
We trained our jazz piano trio synthesizing system using
MIDI data of jazz piano trio performances of 12 different
songs. We used the Python library Theano to implement
the neural networks, whose properties are listed in table
1. The 12 training songs contain a total of about 2000
bars of which about 150 bars were used as validation set. Figure 6. A result of comparative evaluation experimental.
The matching error of the LSTM, which matches piano Every evaluation score is the average of the grades given
performance vectors to performance states of the other in- by the four research participants.
struments, was approximately 26% in case of the bass and
49% in case of the drums. The matching error of the DBN,
and the associated accompaniment performances are
which matches piano feature vectors to HMM state cen-
used for the respective bar.
troid deviation vector clusters, was approximately 28% in
case of the bass and 38% in case of the drums. NMF: The piano performance feature space is clus-
tered using non-negative matrix factorization (NMF)
Table 1. System Parameters and the nearest neighbor method is applied within
HMM parameters the resulting clusters.
The number of state 4
NMF + co-occurrence + trigram: The other instru-
The number of mixture 4
ments feature spaces are clustered as well (using
Epoch 100
NMF). To identify the instrument performance in-
GMM parameters (deviation from center) terrelationship, co-occurrence matrices of the differ-
The number of mixture 8 ent instruments performance feature clusters are ob-
Epoch 200 tained. Additionally, for each instrument, trigram
LSTM parameters probabilities are computed. The accompaniment per-
Learning rate 0.0001 formances are then chosen by maximizing the tri-
The number of middle layer units 30 gram probability combined with the co-occurrence
Momentum 0.9 probability.
Drop out 0.5
Epoch 5000 The evaluation results are shown Fig. 6. A significant
DBN parameters improvement over the previously used methods was ob-
Learning rate 0.00001 served. While the best previous method achieved a score
The number of hidden units (RBM) 50 of 3.25, the proposed system reached a score of 4.25.
Momentum 0.9
Weight decay 0.001
5. CONCLUSION
Objective average activation 0.01
Coefficient for sparse regularization 0.1 We modeled a jazz piano trio session as the set of perfor-
The number of layers 5 mance trajectories of the trios instruments. To be able to
Prior epoch 10000 learn from a small amount of data, we approximated the
Epoch 100000 trajectory model by a stochastic state transition model. We
used a LSTM network as well as a DBN to learn the corre-
We let the system generate accompaniment performances lation between the performance of the different jazz piano
for another song and asked 4 students to rate the result us- trio instruments from the performance MIDI data of jazz
ing a five-point grading scale (2 students had more than ten songs. Based on the estimated performance states of the
years of musical experience). We then compared the scores accompaniment instruments, the system generates melody
with evaluation results of previously used methods based and rhythm of the accompaniment.
on the selection of performance data from a database [18]. The proposed jazz session system received a higher eval-
These methods include: uation score than our previous system. This confirms that
the use of subspace clustering is an effective means to in-
Random: Only harmony constraints are met and both crease the precision of the used machine learning meth-
melody and rhythm pattern are drawn randomly from ods. However, there is still room for improvement, as the
the performance database. amount of available training data was relatively small and
further investigation of the system and style features might
Nearest neighbor: For each bar, the system searches be fruitful.
the database to find the piano performance with the Another possibility to improve the system might be the
closest resemblance to the input piano performance, use of context clustering. In our system, the context will
SMC2017-157
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
be the musical structure or the harmony context (chord pro- [11] H. Chu, R. Urtasun, and S. Fidler, Song from pi: A
gression dependencies). Yoshimura et al. [19] have ob- musically plausible network for pop music generation,
served that context clustering can be useful to increase the arXiv preprint arXiv:1611.03477, 2016.
effectiveness of machine learning methods utilized in nat-
ural speech synthesizing systems. [12] T. Hori, K. Nakamura, and S. Sagayama, Statisti-
Furthermore, one could investigate how to obtain more cally trainable model of jazz session: Computational
training data, replace parametric clustering methods with model, music rendering features and case data utiliza-
non-parametric ones, or enable the system to handle audio tion, IPSJ SIG Technical Reports, vol. 2016, no. 18,
input data instead of MIDI data. Moreover, our future goal pp. 16, 2016.
is to develop a real-time jazz session system. [13] G. E. Hinton, S. Osindero, and Y. W. Teh, A fast learn-
ing algorithm for deep belief nets, Neural computa-
6. REFERENCES tion, vol. 18, no. 7, pp. 15271554, 2006.
[1] H. Takeda, N. T, S. Sagayama et al., Automatic accol- [14] A. Graves, Supervised sequence labelling, in Super-
npaniment system of MIDI performance using HMM- vised Sequence Labelling with Recurrent Neural Net-
based score followng, IPSJ SIG Technical Reports, works. Springer, 2012, pp. 513.
pp. 109116, 2006.
[15] Y. Freund and D. Haussler, Unsupervised learning of
[2] E. Nakamura, H. Takeda, R. Yamamot, Y. Saito, distributions on binary vectors using two layer net-
S. Sako, S. Sagayama et al., Score following handling works, in Advances in neural information processing
performances with arbitrary repeats and skips and au- systems, 1992, pp. 912919.
tomatic accompaniment, IPSJ Journal, vol. 54, no. 4,
[16] G. E. Hinton, Training products of experts by min-
pp. 13381349, 2013.
imizing contrastive divergence, Neural computation,
[3] S. Wake, H. Kato, N. Saiwaki, and S. Inokuchi, Coop- vol. 14, no. 8, pp. 17711800, 2002.
erative musical partner system using tension - param-
[17] M. Tsuchiya, K. Ochiai, H. Kameoka, and
eter : JASPER (jam session partner), IPSJ Journal,
S. Sagayama, Probabilistic model of two-dimensional
vol. 35, no. 7, pp. 14691481, 1994.
rhythm tree structure representation for automatic
[4] I. Hidaka, M. Goto, and Y. Muraoka, An automatic transcription of polyphonic midi signals, in Signal
jazz accompaniment system reacting to solo, IPSJ SIG and Information Processing Association Annual
Technical Reports, vol. 1995, no. 19, pp. 712, 1995. Summit and Conference (APSIPA), 2013 Asia-Pacific.
IEEE, 2013, pp. 16.
[5] M. Goto, I. Hidaka, H. Matsumoto, Y. Kuroda, and
[18] T. Hori, K. Nakamura, and S. Sagayama, Automatic
Y. Muraoka, A jazz session system for interplay
selection and concatenation system for jazz piano trio
among all playersI. system overview and implementa-
using case data, Proc. ISCIE International Symposium
tion on distributed computing environment, IPSJ SIG
on Stochastic Systems Theory and its Applications, vol.
Technical Reports, vol. 1996, no. 19, pp. 2128, 1996.
2017, 2017.
[6] I. Hidaka, M. Goto, and Y. Muraoka, A jazz session
[19] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi,
system for interplay among all playersII. implementa-
and T. Kitamura, Simultaneous modeling of spectrum
tion of a bassist and a drummer, IPSJ SIG Technical
, pitch and duration in hmm-based speech synthesis,
Reports, vol. 1996, no. 19, pp. 2936, 1996.
The IEICE Transactions, vol. 83, no. 11, pp. 2099
[7] M. Hamanaka, M. Goto, H. Aso, and N. Otsu, Gui- 2107, 2000.
tarist simulator: A jam session system statistically
learning players reactions, IPSJ Journal, vol. 45,
no. 3, pp. 698709, 2004.
SMC2017-158
AUDIO-VISUAL SOURCE ASSOCIATION FOR STRING ENSEMBLES
THROUGH MULTI-MODAL VIBRATO ANALYSIS
Source Association
ABSTRACT Video Fine-grained
Hand Tracking
Motion Analysis
With the proliferation of video content of musical perfor- Association
mances, audio-visual analysis becomes an emerging topic Optimization
trajectory of each note of each instrument from the audio source is active are used to learn a timbre model for the
mixture in a score-informed fashion. Then, we detect vi- separation of the source. These methods, however, only
brato notes by inspecting the periodicity of the pitch fluc- deal with mixtures with at most two active sources. The
tuation. On the visual side, we first track the left hand of difficulty of source association increases dramatically as
each string player, and then capture the fine motion of the the source number increases.
hand using optical flow, resulting in a 1-D hand displace- The audio-visual source association problem for chamber
ment curve along the principal motion direction. For each music ensembles is more challenging since several sound
source-player pair, we define a matching score based on the sources are active almost all the time. In [11], we pro-
cross correlation between the concatenated pitch trajecto- posed the first approach to audio-visual source association
ries of all vibrato notes and the concatenated displacement for string ensembles in a score-informed fashion where
curve segments in the corresponding time ranges. Finally, there are up to five simultaneously active sources (e.g.,
the bijection between sources and players that maximizes string quintets). This approach analyzes the large-scale
the overall matching score is output as the association re- motion from bowing strokes of string-instrument players
sult. and correlates it with note onsets in the score track. The
This is a complementary extension to our previous work assumptions are that most note onsets correspond to the
on source association [11] that uses the correlation between beginning of bowing strokes, and that different instrumen-
bowing strokes and score note onsets. When vibrato is tal parts show different rhythmic patterns. When these as-
used, the proposed approach offers a more accurate and sumptions are invalid, for example, when multiple notes
robust solution, as the correlation between the audio and are played within a single bow stroke (i.e., legato bowing)
visual aspects is now performed through the entire pro- or when different parts show a similar rhythmic pattern,
cess of vibrato notes instead of only the onsets as in our this approach becomes less robust.
previous work. This advantage is shown by the better as- The analysis of the fingering hand motion due to vibrato
sociation performance on 19 video recordings of chamber articulations and their correlation to pitch fluctuations in
musical performances that use at most one non-string in- this paper is a complementary extension of [11]. It ex-
struments. In the following, we first describe related work tends the large-scale motion analysis of the bowing hand to
on audio-visual association in Section 2. We then describe the fine-grained motion analysis of the fingering hand. It
the proposed approach in Section 3. We evaluate the sys- also extends the audio-visual correlation from the discrete
tem in Section 4 and conclude the paper in Section 5. note onsets to the entire processes of vibrato notes. As vi-
brato articulations are common practices for string instru-
2. RELATED WORK ment players (at least for skilled players), and the vibrato
parameters (e.g., rate, depth, and phase) of different play-
There has been some previous work on audio-visual source ers are quite different, this vibrato cue provides robust and
association for speech and music signals. Hershey and complementary information for audio-visual source asso-
Movellan [9] first explored the audio-visual synchrony to ciation for string ensembles.
locate speakers in the video. In [12], a region tracking
algorithm was applied to segment the video into tempo-
ral evolution of semantic objects to be correlated with au- 3. METHOD
dio energy. Casanovas et al. [13] applied non-linear diffu- 3.1 Audio Analysis
sion to extract the pixels whose motion is most consistent
with changes of audio energy. In [14], source localization The main goal of audio analysis is to detect and analyze
results are incorporated into graph-cut based image seg- vibrato notes in a score-informed fashion. There has been
mentation to extract the face region of the speaker. Other much work on vibrato analysis for both singing voices [23]
methods include time delayed neural networks [15], prob- and instruments [24], but most of them only deal with mono-
abilistic multi-modal generative models [16], and canoni- phonic music. For methods that do deal with polyphonic
cal correlation analysis (CCA) [17, 18]. All of the above- music, they only analyze vibrato in the melody with the
mentioned systems, however, assume that there is at most presence of background music [25]. The more challeng-
one active sound source at a time. ing problem of vibrato detection and analysis for multi-
When multiple sources are active simultaneously, cross- ple simultaneous sources is, however, rarely explored. In
modal association can be utilized to isolate sounds that cor- this paper, we first separate sound sources using a score-
respond to each of the localized visual features. Barzelay informed source separation approach [26] and then per-
and Schechner [19, 20] detected drastic changes (i.e., on- form pitch analysis on each source to extract and analyze
sets of events) in audio and video and then used their coin- vibrato notes.
cidence to associate audio-visual components belonging to
3.1.1 Score-informed Source Separation
the same source of harmonic sounds. Sigg et al. [21] refor-
mulated CCA by incorporating non-negativity and sparsity Musical sound sources are often strongly correlated in time
constraints on the coefficients of the projection directions, and frequency, and without additional knowledge the sep-
to locate sound sources in movies and separate them by aration of the sources is very difficult. When the musi-
filtering. In [22], audio and video modalities were decom- cal score is available, as for many Western classical music
posed into relevant structures using redundant representa- pieces, the separation can be simplified by leveraging the
tions for source localization. Segments where only one score information. In recent years, many different models
SMC2017-160
Pitch 73
(MIDI 72
number) 71
70
69
68
67
66
65
Audio MIDI 64
63
5 6 7 8 9 10 11
55
54
Input 53
52
51
50
49
48
47
46
45
5 6 7 8 9 10 11
Time (sec)
the rough position of the left hand. This method was orig- where v is the eigenvector corresponding to the largest
inally proposed in [34]. It extracts feature points using the eigenvalue of the PCA of v(t). We then perform an inte-
minimum eigenvalue algorithm in a given region specified gration of the motion velocity curve over time to calculate
in the first video frame and then track these points through- a motion displacement curve as
out the video based on assumptions such as brightness con- Z t
stancy and spatial coherence. For our implementation (see X(t) = V ( )d. (5)
Fig. 3), we manually specify a rectangle bounding box (70 0
70 pixels) to cover the left hand in the first frame and This hand displacement curve corresponds to the fluctu-
initialize 30 feature points within the bounding box using ation of the vibrating length of the string and hence the
corner-point detector. We then track the points in follow- pitch fluctuation of the note. Fig. 5 plots the extracted
ing frames and relocate the center of the bounding box to motion displacement curve X(t) of one player along with
the median x- and y- coordinates of the points. In order to the pitch trajectory. We find that when vibrato is detected
prevent the loss of tracking due to abrupt motion or occlu- from audio (labeled in green rectangles), there are always
sions, we re-initialize the feature points every 20 frames, if observable fluctuations in the motion.
some points disappear.
Pitch (MIDI Number)
3.2.2 Fine-grained Motion Capture 74
72
70
To capture the fine motion of the left hand, we apply op- 68
tical flow estimation [35] on the original video frames to 66
64
calculate pixel-level motion velocities. For each frame, the 62
60
average motion velocity over all pixels within the bound- 2 3 4 5 6 7 8 9 10 11 12 13
ing box is calculated as u(t) = [ux (t), uy (t)]. Notice that Motion Displacement Curve X(t)
3
the pixel motions in the bounding box contain not only the 2
1
players hand vibrato motion but also his/her overall slower 0
body motion while performing the music. This motion not -1
SMC2017-162
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
the same time range. All of the curves have been normal- 95.5 / 96.8 72.7 / 70.9 64.8 / 59.7 57.8 / 53.3
100
ized to have zero mean and unit Root-Mean-Square (RMS) 90
value. Clearly, Player 1 is a better match. Quantitatively, 80
Accuracy (%)
we measure the similarity between the pitch trajectory of 70
60
each note of the p-th source and the hand displacement 50
curve of the q-th player in the corresponding time range 40
30
by 20
D E 10
F [p] (ti ) X [q] (ti ) 0
Duets Trios Quartets Quintets
m[p,q] (i) = exp
, (6)
F [p] (ti )
X [q] (ti )
Figure 7. Box plots and scatter plots of note-level match-
off ing accuracy of string-instrumental sources of all pieces
where ti = [ton
i , ti ] represents thetime range of the i-th
note from its onset ton off grouped by polyphony. Each data point represents the
i to offset ti ,
F and X are the nor-
malized pitch trajectories and hand displacement curves, accuracy of one source. The blue lines mark the ran-
respectively, and hi represents the vector inner product. dom guess accuracies. Numbers above the figures are me-
Then the overall matching score between the p-th source dian/mean values.
and the q-th player can be defined as the sum of the simi-
larities of all vibrato notes: trajectories, i.e., 1 sample every 10 ms. For non-string in-
[p]
Nvib
struments where the vibrato is not generated by hand mo-
X tion, we do not track the hand nor capture the motion; In-
M [p,q] = m[p,q] (i), (7)
stead, we set the hand displacement curve of these players
i=1
in Eq. (5) to all zeros.
[p]
where Nvib is the number of detected vibrato notes of the
p-th source. 4.1 Note-level Matching Accuracy
For an ensemble with K players, there are K! bijective
As the matching score matrix M [p,q] used to find the best
associations (permutations). Let () be a permutation func-
source association is the sum of the matching scores of
tion, then the p-th audio source will be associated with
all vibrato notes m[p,q] (i), we first evaluate how good the
the (p)-th player. For each permutation, we calculate
note-level matching scores are. To do so, we calculate a
an overall association scoreQ as the product of all pair-wise
K note-level matching accuracy, which is defined as the per-
matching scores, i.e., S = p=1 M [p,(p)] . We then rank
centage of vibrato notes of each source that best matches
all of the association candidates in a descending order and
the correct player according to the note-level matching score.
return the first candidate as the final association.
Fig. 7 shows the boxplots and scatter plots of this measure
of all sources (excluding the non-string instruments) of all
4. EXPERIMENTS pieces, grouped by their polyphony. Each point represents
We evaluate the proposed approach on the University of the accuracy of one source, and the numbers above show
Rochester Musical Performance (URMP) dataset 1 [36]. the median/mean values. The blue lines mark the accura-
This dataset contains isolately recorded but synchronized cies based on random guess, e.g., the chance to randomly
audio-visual recordings of all instrumental parts of 44 clas- match a note to the correct player in a quartet is 25%. We
sical ensemble pieces ranging from duets to quintets. For find that this accuracy drops as the polyphony increase be-
our experiments, we use the 11 string ensemble pieces and cause of 1) the more incorrect candidates and 2) the less
8 other pieces that use only one non-string instrument, to- good quality of the extracted pitch trajectory of vibrato
taling 19 pieces including 5 duets, 4 trios, 7 quartets, and notes from high-polyphonic audio mixtures. Outliers in the
3 quintets. Fig. 8 shows screen shots of the performance figures are sources that only contain a few vibrato notes.
videos of all the 19 test pieces. Most pieces have a length It is noted that a less good note-level matching accuracy
of 2-3 minutes, providing enough cues for source associ- does not necessarily lead to a bad piece-level source asso-
ation. Detailed information about these pieces is listed in ciation, because the latter considers all the vibrato notes
Table 1. in a piece. In fact, as long as the note-level matching is
Audio is sampled at 48 KHz, and processed with a frame somewhat accurate, the ensemble decision considering all
length of 42.7 ms and a hop size of 10 ms for the Short- of the vibrato notes at the piece-level is quite reliable, as
Time Fourier Transform (STFT). Video resolution is 1080P, illustrated in the following experiment.
and the frame rate is 29.97 frames per second. The audio
and video streams have been carefully synchronized in the 4.2 Piece-level Source Association Results
dataset. The estimated pitch trajectories are interpolated to
We then evaluate the piece-level source association accu-
make the first pitch estimate correspond to time at 0 ms.
racy. The estimated association for each piece is consid-
The estimated hand displacement curves are also interpo-
ered correct if and only if all the sources are associated
lated to make them have the same sample rate as the pitch
to the correct players. As the number of permutations is
1 http://www.ece.rochester.edu/projects/air/projects/datasetproject.html the factorial of the polyphony, this task becomes much
SMC2017-163
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 8. Screen shots of the 19 test pieces. Due to limited conditions of the recording facilities, all players faced to the
same direction and some parts of certain players were not captured by the camera. These artifacts might have made the
visual analysis of left-hand motion easier than in real-world performance recordings.
Table 1. Detailed information of the 19 test pieces together with the source association performance of the proposed
approach. Abbreviations of instruments are: violin (Vn.), viola (Va.), cello (Vc.), double bass (D.B.), flute (Fl.), oboe
(Ob.), clarinet (Cl.), saxphone (Sax.), and trumpet (Tp.).
more challenging for pieces with a high polyphony. In The main reason that the proposed approach outperforms
our experiments, the proposed approach successfully out- our previous system [11] on this dataset, as we stated in the
puts the correct association for 18 of the 19 pieces, yield- introduction, is that the vibrato analysis provides more reli-
ing a success rate of 94.7%. This outperforms our previ- able cues for association than the note onset analysis in our
ous method [11] whose accuracy was 89.5% on the same previous system. Vibrato, if present, often occurs through-
pieces. Among the 19 pieces, there are in total 65 individ- out a note; Vibrato patterns of different players are also dis-
ual sources, and 63 of them (96.9%) are associated with tinct even if they are playing notes with the same rhythm.
the correct player, while in our previous system only 58 We have to admit that for performances that vibrato is not
sources are correct. used, the proposed approach would fail. Nonetheless, it
Table 1 lists the association results on all of the testing provides a complementary approach to our previous sys-
pieces. For each piece, we show the number of correctly tem and combining these two approaches is a future direc-
associated sources and the rank of correct association among tion of research.
all permutations. The proposed approach only fails gen-
tly on one piece (No. 8), where the correct association is 5. CONCLUSIONS
ranked the second. This failure case is a trio containing a
clarinet while the two string instruments are always play- We proposed a methodology for source association by syn-
ing plucking notes, for which vibrato is not generally used. ergistic analyses of vibrato articulations from both audio
SMC2017-164
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
and video modalities for string ensembles. Specifically, [10] B. Li, Z. Duan, and G. Sharma, Associating players
we found that the fine-grained motion of the fingering hand to sound sources in musical performance videos, in
shows a strong correlation with the pitch fluctuation of vi- Late Breaking Demo, Intl. Society for Music Informa-
brato notes, which was utilized to solve source association tion Retrieval (ISMIR), 2016.
in a score-informed fashion. Experiments showed a high
success rate on pieces of different polyphony, and proved [11] B. Li, K. Dinesh, Z. Duan, and G. Sharma, See
that vibrato features provide more robust cues for source and listen: Score-informed association of sound tracks
association than bowing motion features that were utilized to players in chamber music performance videos, in
in our previous work. This technique enables novel and Proc. IEEE Intl. Conf. Acoust., Speech, and Signal Pro-
richer music enjoyment experiences that allow users to iso- cess. (ICASSP), 2017.
late/enhance sound sources by selecting the players in the [12] K. Li, J. Ye, and K. A. Hua, Whats making that
video. For future work, we would like to combine the vi- sound? in Proc. ACM Intl. Conf. Multimedia, 2014,
brato cues and the bowing cues to achieve more robust as- pp. 147156.
sociation results. We also would like to explore scenarios
where the score is not available. [13] A. L. Casanovas and P. Vandergheynst, Audio-based
nonlinear video diffusion, in Proc. IEEE Intl. Conf.
Acoustics, Speech, and Signal Process. (ICASSP).
6. REFERENCES IEEE, 2010, pp. 24862489.
[1] C. Vernallis, Unruly media: YouTube, music video, and
[14] Y. Liu and Y. Sato, Finding speaker face region by
the new digital cinema. Oxford University Press,
audiovisual correlation, in Proc. Workshop on Multi-
2013.
camera and Multi-modal Sensor Fusion Algorithms
[2] A. Bazzica, C. C. Liem, and A. Hanjalic, Exploit- and Applications, 2008.
ing instrument-wise playing/non-playing labels for [15] R. Cutler and L. Davis, Look whos talking: Speaker
score synchronization of symphonic music. in Proc. detection using video and audio correlation, in Proc.
Intl. Society for Music Information Retrieval (ISMIR), IEEE Intl. Conf. Multimedia and Expo,, vol. 3, 2000,
2014. pp. 15891592.
[3] A. Hrybyk, Combined audio and video analysis for [16] J. W. Fisher and T. Darrell, Speaker association with
guitar chord identification, Masters thesis, Drexel signal-level audiovisual fusion, IEEE Trans. Multime-
University, 2010. dia, vol. 6, no. 3, pp. 406413, 2004.
[4] A. Bazzica, J. C. van Gemert, C. C. Liem, and A. Han- [17] E. Kidron, Y. Y. Schechner, and M. Elad, Cross-modal
jalic, Vision-based detection of acoustic timed events: localization via sparsity, IEEE Trans. Signal Process.,
a case study on clarinet note onsets, in International vol. 55, no. 4, pp. 13901404, 2007.
Workshop on Deep Learning for Music (DLM) in con-
junction with the International Joint Conference on [18] H. Izadinia, I. Saleemi, and M. Shah, Multimodal
Neural Networks (IJCNN), Anchorage, Alaska (USA), analysis for identification and segmentation of moving-
2017. sounding objects, IEEE Trans. Multimedia, vol. 15,
no. 2, pp. 378390, 2013.
[5] K. Dinesh, B. Li, X. Liu, Z. Duan, and G. Sharma,
[19] Z. Barzelay and Y. Y. Schechner, Harmony in mo-
Visually informed multi-pitch analysis of string en-
tion, in Proc. IEEE Conf. Computer Vision and Pat-
sembles, in Proc. IEEE Intl. Conf. Acoust., Speech,
tern Recognition (CVPR), 2007, pp. 18.
and Signal Process. (ICASSP), 2017.
[20] Z. Barzelay and Y. Y. Schechner, Onsets coincidence
[6] S. Dahl and A. Friberg, Visual perception of expres- for cross-modal analysis, IEEE Trans. Multimedia,
siveness in musicians body movements, Music Per- vol. 12, no. 2, pp. 108120, 2010.
ception: An Interdisciplinary Journal, vol. 24, no. 5,
pp. 433454, 2007. [21] C. Sigg, B. Fischer, B. Ommer, V. Roth, and J. Buh-
mann, Nonnegative CCA for audiovisual source sepa-
[7] J. Scarr and R. Green, Retrieval of guitarist fingering ration, in Proc. IEEE Workshop on Machine Learning
information using computer vision, in Proc. Intl. Conf. for Signal Processing, 2007, pp. 253258.
Image and Vision Computing New Zealand (IVCNZ),
2010. [22] A. L. Casanovas, G. Monaci, P. Vandergheynst, and
R. Gribonval, Blind audiovisual source separation
[8] D. Murphy, Tracking a conductors baton, in Proc. based on sparse redundant representations, IEEE
Danish Conf. Pattern Recognition and Image Analysis, Trans. Multimedia, vol. 12, no. 5, pp. 358371, 2010.
2003.
[23] L. Regnier and G. Peeters, Singing voice detection in
[9] J. R. Hershey and J. R. Movellan, Audio vision: Us- music tracks using direct voice vibrato detection, in
ing audio-visual synchrony to locate sounds. in Proc. Proc. IEEE Intl. Conf. Acoust., Speech, and Signal Pro-
Neural Inform. Process. Syst., 1999, pp. 813819. cess. (ICASSP), 2009, pp. 16851688.
SMC2017-165
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[24] S. Rossignol, P. Depalle, J. Soumagne, X. Rodet, and [36] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma,
J.-L. Collette, Vibrato: detection, estimation, ex- Creating a musical performance dataset for multi-
traction, modification, in Proc. Digital Audio Effects modal music analysis: Challenges, insights, and ap-
Workshop (DAFx), 1999, pp. 14. plications, IEEE Trans. Multimedia, submitted. Avail-
able: https://arxiv.org/abs/1612.08727.
[25] J. Driedger, S. Balke, S. Ewert, and M. Muller,
Template-based vibrato analysis of music signals, in
Proc. Intl. Society for Music Information Retrieval (IS-
MIR), 2016.
SMC2017-166
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT
Stage acoustic preferences and their impact on musicians This paper uses the real-time auralization system intro-
have been traditionally studied through two approaches: in duced in [5] to conduct experiments on stage acoustics
preferences of solo musicians. The auralization method is
Copyright:
c 2017 Sebasti V. Amengual Gar et al. This validated and a method to modify the directional response
is an open-access article distributed under the terms of the of an auralized room is proposed. Two experiments are de-
Creative Commons Attribution 3.0 Unported License, which permits unre- scribed: a study of room acoustics preference depending
stricted use, distribution, and reproduction in any medium, provided the original on the performance context and a study on the preference
author and source are credited. of directional early energy.
SMC2017-167
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-168
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
BS 500 - 1000 ms
DST KH
FRF 100 - 1000 ms FRF FRF
0 20 - 1000 ms 0 0
0 - 1000 ms
-10 -10 -10
-20 -20 -20
dB
dB
dB
-30 -30 -30
-40 -40 -40
-50 -50 -50
100 1000 10000 100 1000 10000 100 1000 10000
dB
dB
0 0 0
-3 -3 -3
-6 -6 -6
-9 -9 -9
100 1000 10000 100 1000 10000 100 1000 10000
T30 T30 T30
1.5 1.5 1.5
1.2 1.2 1.2
0.9 0.9 0.9
s
s
0.6 0.6 0.6
0.3 0.3 0.3
0 0 0
100 1000 10000 100 1000 10000 100 1000 10000
C80 C80 C80
2 JND
18 1 JND 18 18
15 Real room 15 15
12 12 12
Auralization
dB
dB
dB
9 9 9
6 6 6
3 3 3
0 0 0
-3 -3 -3
100 1000 10000 100 1000 10000 100 1000 10000
Frequency (Hz) Frequency (Hz) Frequency (Hz)
Figure 3: Comparison of frequency response (first row) and auralization error (second row) in different time intervals, and
monaural room parameters (third and fourth row) of the measured rooms and their auralizations. Solid and dashed lines on
the first row correspond to the real room and the auralization, respectively.
SMC2017-169
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Back
Front
Left
Front
Right
DST -42dB
-30dB -42dB
-30dB -42dB
-30dB
Back
Front
Left
Front
Right
BS -42dB
-30dB -42dB
-30dB -42dB
-30dB
Back
Front
Left
Front
Right
KH -42dB
-30dB -42dB
-30dB -42dB
-30dB
Figure 4: Broadband spatiotemporal representation of the auralized rooms (backward integrated) analyzed at the musician
position.
SMC2017-170
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
tests are implemented, using the same methodology and The ability of hearing oneself and other musicians on
acoustic conditions, but investigating five different case stage is an important factor to facilitate ensemble perfor-
scenarios or performance contexts through a different ques- mance and previous research suggests that the direction of
tion (in the following order): arrival of early reflections on stage can influence the stage
acoustics preferences of chamber orchestras [13], as well
Practice Technique: In which room do you prefer to as solo musicians [4]. It is not clear, however, how this
practice instrument technique? directionality should be quantified and how it is related to
Practice Concert: Which room do you prefer to prac- subjective perception of musicians on stage. The goal of
tice a concert piece? the present study is to provide a methodology that allows
a systematic study of those preferences in trumpet players
Concert: Which room do you prefer to perform in a
using modified versions of the auralized rooms with differ-
concert?
ent early energy levels and directions.
Easiness: In which room is it easier to perform? A pilot test was conducted with 5 participants, evaluat-
Quality: Which room provides the best overall acous- ing their preference in terms of stage support. The term
tic (or sound) quality? was discussed with the participants to ensure a correct in-
terpretation, and it was agreed that support is related to the
The experiment was completed by 7 different players and ability of hearing their own sound in the room without dif-
every section of the test lasted for approximately 10 min- ficulty or need to force their instrument (as described by
utes. The order of the trials was fully randomized, while Gade [2]).
the order of the test sections was always the same. De- The experiment followed the same procedure as in the
pending on the physical and psychological fatigue of the general preference test. In this case there were 3 differ-
players the tests were completed in one or in two sessions. ent rooms with 5 modifications per room (all, front-back,
Five of the players completed a double pairwise compari- sides, top-down, no-ER). The order of the comparisons
son (5 tests x 6 pairs x 2 repetitions) with inverted orders was randomized and all the participants rated each pair
to analyze the consistency of their answers. A preliminary once, resulting in 30 comparisons (3 rooms x 10 pairs) per
analysis revealed a high consistency, thus to reduce the fa- participant.
tigue of the players, the last two players completed only In all the studied halls the starting time of the late rever-
a single pairwise comparison. The responses of these last beration (tend ) was set to 100 ms, while the mixing time
two players were then duplicated to ensure an equal weight (tmix ) was 45, 45 and 65 ms for BS, DST and KH, respec-
of all subjects in the analysis. tively.
SMC2017-171
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
0.5
0.4
Table 1: Duration of trials for the general preference ex-
periment
0.3
SMC2017-172
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Practice Technique
0.7 0.7 0.7 0.7
Practice Concert
Concert R2 =0.97
0.6 0.6 Easiness 0.6 0.6
Quality
0.5 0.5 0.5 0.5
R2 =0.98 R2 =0.62
R2 =0.75
Preference
0 0 0 0
0 0.5 1 1.5 2 0 1 2 3 4 0 0.5 1 1.5 2 0 0.5 1
RT20 (s) Gall (dB) Gearly (dB) Glate (dB)
Figure 7: Quadratic fits of the room preferences and room acoustic parameters. The R2 value included in the graph
corresponds to the adjusted R2 .
Preference
0.5
0.6
DST
DST
BS
0.4 BS
0.5 KH KH
Preference
Probability of chance
0.3
0.4
BTL preference
0.2
0.3
0.1
0.2
0
0.1 -25 -20 -15 -10 -5 0
ST early (dB)
0
all front-back sides top-down no-ER
comfortable and more exhausting. However, in such cases varied substantially depending on the test question. For
where the playing requires a high level of concentration Practice Technique musicians often tended to play fast pas-
and ability to hear own mistakes, a very dry room is of- sages with complex note articulations, meaning that they
ten preferred. Indeed, a small amount of reverberation is tended to explore the early response of the room and its
beneficial in keeping a high clarity and reduce the fatigue clarity. Practice concert and Concert were usually related
of the players. When practicing for a concert, the prefer- with more lyrical passages. However, those technical and
ence shifts towards slightly longer reverberation times and lyrical passages could be in many cases parts of the same
louder rooms, which could be closer to concert conditions. piece. Additionally, when testing Easiness and Sound Qual-
However, it is necessary to retain a certain degree of clar- ity they often played a broad type of pieces with differ-
ity or early energy, in order allow the detection and cor- ent characters, testing the articulation and judging the tone
rection of musical performance errors or articulation im- color and the decay of the rooms. Finally, the trial duration
perfections. When playing in a concert, musicians often (see Tab. 1) of the Practice Technique case is significantly
prefer longer reverberation times, which could be related higher than in the other cases (p < 0.01), which could
with a bigger projection of the instrument and the feeling indicate that there is a learning effect in judging the differ-
of playing in front of a bigger audience. The easiness of ent rooms and the task becomes easier when musicians are
playing, or musician comfort is directly related with the more familiar with the different conditions.
amount of early energy, as musicians feel more supported
Regarding the test of directional early energy, most of the
by the stage conditions and less effort is needed to de-
musicians commented that in many cases it is very diffi-
velop the notes and feel their presence in the room. During
cult to distinguish which are the differences of the rooms.
the interviews multiple participants stated that when they
In fact, most of the players tended to play notes with clear
feel supported by the room it is easier to relax, reduce fa-
articulation to explore the early part of the sound. As seen
tigue and ultimately perform better. Finally, although in
in the results, the direction of arrival of the early energy
the studied rooms Quality seems to be quadratically related
seems to not be relevant to judge their preference, instead
with RT20 , it has to be considered that more aspects, such
the level of the early energy would be more important. It is
as tone balance, are involved and purely energetic terms
thus not clear, what would be the directional preference in
are insufficient to rate the sound quality of a room.
case of having equal levels in the different cases. In future
In addition, it is interesting that during the different parts experiments the effect of the level should be considered in
of the first test (general preference), the played excerpts order to include a proper equalization of early energy. In
SMC2017-173
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
addition, the length of the window and mixing time should under the 7th Framework Programme (EC grant agreement no.
be carefully decided to provide a wide and realistic vari- 605867).
ety of acoustic scenarios. This pilot study demonstrated
the potential of manipulating directional energy in virtual 7. REFERENCES
environments, allowing a detailed control of the acoustic [1] S. V. Amengual Gar, W. Lachenmayr, and M. Kob, Study
scene and enabling a systematic study of different stage on the influence of acoustics on organ playing using room
acoustic phenomena. enhancement, in Proceedings of the Third Vienna Talk on
Music Acoustics, Vienna, Austria, 2015, pp. 252258.
5.2 Comparability with previous studies [2] A. C. Gade, Investigations of musicians room acoustic con-
ditions in concert halls. part I: Methods and laboratory con-
Previous research on stage acoustics for solo musicians ditions, Acustica, vol. 69, pp. 193203, 1989.
usually studied musicians preference of stage acoustics, [3] K. Ueno and H. Tachibana, Experimental study on the evalu-
which intuitively translates to preference to perform in a ation of stage acoustics by musicians using a 6-channel sound
concert. However, preference could be understood as well simulation system, Acoust. Sci. & Tech, vol. 24, no. 3, pp.
130138, 2003.
as sound quality or degree of comfort provided by the room.
The present study demonstrates that the same musicians [4] A. Guthrie, S. Clapp, J. Braasch, and N. Xiang, Using am-
judging the same rooms lead to very different results de- bisonics for stage acoustics research, in International Sym-
posium on Room Acoustics, Toronto, Canada, 2013.
pending on the formulation of the question and judging cri-
teria or performance context. This makes it very difficult [5] S. V. Amengual Gar, D. Eddy, M. Kob, and T. Lokki, Real-
to compare the results of the present study with previous time auralization of room acoustics for the study of live mu-
sic performance, in Fortschritte der Akustik - DAGA 2016,
research. However, if we compare the concert conditions Aachen, Germany, 2016, pp. 14741777.
with previous results of preference the results support for-
mer findings. Here, a quasi linear relationship was found [6] S. Tervo, J. Ptynen, A. Kuusinen, and T. Lokki, Spatial de-
composition method for room impulse responses, J. Audio
between RT20 and the preference of a room to perform in a
Eng. Soc., vol. 61, no. 1/2, pp. 1627, January/February 2013.
concert and longer reverberation times were also preferred
in previous studies (up to 2.3 s [4] and around 1.9 s in [3]). [7] V. Pulkki, Virtual sound source positioning using vector
base amplitude panning, J. Audio Eng. Soc., vol. 45, pp.
However, the longest (and most preferred) reverberation
456466, 1997.
time in the present test is approximately 1.4 s.
The comparison of stage energy parameters, such as sup- [8] J. Ptynen, S. Tervo, and T. Lokki, Analysis of concert hall
acoustics via visualizations of time-frequency and spatiotem-
port (ST) or strength (G) depends indeed on the distance poral responses, J. Acoust. Soc Am., vol. 133, no. 2, pp. 842
between the source and the receiver and the directivity of 857, Feb. 2013.
the source. In addition, while they have been standardly
[9] S. Tervo, J. Ptynen, N. Kaplanis, M. Lydolf, S. Bech, and
measured in previous research [2, 3], the calibration pro- T. Lokki, Spatial analysis and synthesis of car audio system
cedures, instrument directivity and micking techniques ap- and car cabin acoustics with a compact microphone array, J.
plied in the virtual environments leave some room for un- Audio Eng. Soc., vol. 63, no. 11, pp. 914925, 2015.
certainties on the absolute values and it is not straightfor- [10] S. Tervo and J. Ptynen. SDM toolbox. [Online]. Avail-
ward to ensure that the same values measured in different able: http://de.mathworks.com/matlabcentral/fileexchange/
virtual environments are equivalent. 56663-sdmtoolbox
[11] J. Ptynen, S. Tervo, and T. Lokki, Amplitude panning de-
creases spectral brightness with concert hall auralizations,
6. CONCLUSION in Audio Engineering Society Conference: 55th International
Conference: Spatial Audio, Aug 2014.
Two experiments on musicians stage preferences are pre-
sented. The results demonstrate that the performance con- [12] Acoustics Measurement of room acoustic parameters Part
text is a key factor for musicians to prefer a particular stage 1: Performance spaces, Std. ISO 3382-1:2009, 2009.
acoustics. Moreover, the direction of the early energy does [13] L. Panton, D. Holloway, D. Cabrera, and L. Miranda, Stage
not seem to play an important role on determining the pref- acoustics in eight australian concert halls: Acoustic condi-
erence of stage support, while the level of the early energy tions in relation to subjective assessments by a touring cham-
is more important. The results seem to support partially ber orchestra, Acoustics Australia, pp. 115.
previous findings, although the comparability of the results [14] S. V. Amengual Gar, T. Lokki, and M. Kob, Live perfor-
depends greatly on the measurement process and calibra- mance adjustments of solo trumpet players due to acoustics,
tion process of the virtual environments. in International Symposium on Musical and Room Acoustics
(ISMRA), La Plata, Argentina, 2016.
Directions of future work should increase the range of
tested rooms and instruments, as well as a careful design [15] F. Wickelmaier and C. Schmid, A matlab function to esti-
mate choice model parameters from paired-comparison data,
of the directional properties of the auralized rooms. Behavior Research Methods, Instruments, and Computers,
vol. 1, no. 36, pp. 2940, 2004.
Acknowledgments
[16] W. Chiang, C. Shih-tang, and H. Ching-tsung, Subjective
The authors want to acknowledge the musicians participating in assessment of stage acoustics for solo and chamber music
the experiments for their time and effort. The research work pre- performances, Acta Acustica united with Acustica, vol. 89,
sented in this article has been funded by the European Commis- no. 5, pp. 848856, 2003.
sion within the ITN Marie Curie Action project BATWOMAN
SMC2017-174
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT
SMC2017-175
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Same Mobile Phone Locative Perfor- performing music together. These experiences of group
Location Orchestra [10]; mance creativity can lead to the emergence of qualities, ideas, and
Ensemble Meta- experiences that cannot be easily explained by the actions
tone [11] of the individual participants [15]. Mobile devices have of-
Viscotheque ten been used in ensemble situations such as MoPho (Stan-
Jams [12]. ford Mobile Phone Orchestra) [10], Pocket Gamelan [16],
Different Networked Musi- Glee and Ensemble Metatone [11]; however, in these examples,
Location cal Performance Karaoke [13]; the musicians played together in a standard concert situa-
Magic Piano [14] MicroJam Perfor- tion.
mances. Given that mobile devices are often carried by users at all
Same Time Different Time times, it would be natural to ask whether mobile device en-
semble experiences can be achieved even when performers
Table 1. Ensemble performances typically occur with all
are not in a rehearsal space or concert venue. Could users
participants in the same time and place; however, other sit-
contribute to ensemble experiences at a time and place that
uations are possible. MicroJam focuses on performances
is convenient to them? The use of computer interfaces to
that are distributed (different place) and asynchronous (dif-
work collaboratively even when not in the same space and
ferent time).
time has been extensively discussed. In HCI, groupware
systems have been framed using a time-space matrix to ad-
tended to be frequent, casual, and ephemeral. Snapchat dress how they allow users to collaborate in the same and
introduced an interface that optimised replying to many different times and places [17]. For many work tasks, it
image messages in one session. This app further empha- is now common to collaborate remotely and at different
sised ephemerality by introducing images that expire and times using tools such as Google Docs or Git; however,
can only be viewed for a few seconds after being opened. distributed and asynchronous musical collaboration is not
These and other social applications have attracted many as widely accepted.
millions of users and encouraged their creativity in the writ- In Table 1, we have applied the time-space matrix to mo-
ten word and photography. While social media is often bile musical performance. Conventional collaborative per-
used to promote music [3], music making has yet to be- formances happen at the same time and location. Even
come an important creative part of the social media land- with mobile devices, most collaboration has occurred in
scape. this configuration. Collaborations with performers distrib-
While music is often seen as an activity where accom- uted in different locations but performing at the same time
plishment takes practice and concerted effort, casual musi- are often called networked musical performances [18]. Early
cal experiences are well-known to be valuable and reward- versions of Smules Magic Piano [14] iPad app included
ing creative activities. Accessible music making, such as the possibility of randomly assigned, real-time duets with
percussion drum circles, can be used for music therapy [4]. other users. Networked performances are also possible
Digital Musical Instruments (DMIs) such as augmented re- with conventional mobile DMIs and systems for real-time
ality instruments [5] and touch-screen instruments [6] have audio and video streaming.
also been used for therapeutic and casual music-making. Performance with participants in different times, the right
In the case of DMIs, the interface can be designed to sup- side of Table 1, are less well-explored than those on the
port creativity of those without musical experience or with left. Performances where participants are in the same place,
limitations on motor control. Apps such as Ocarina [7] and but at different times, could come under the banner of loca-
Pyxis Minor [8] have shown that simple touch-screen inter- tive performances, such as Net derive [19] or Sonic City
faces can be successful for exploration by novice users as [20], where geographical location is an important input to
well as supporting sophisticated expressions by practised a musical process.
performers.
The final area of the matrix involves music-making with
Some mobile music apps have included aspects of social
performers that are in different places and different times.
music-making. Smules Leaf Trombone app introduced the
Glee Karaoke [13] allows users to upload their sung ren-
idea of a world stage [9]. In this app, users would per-
ditions of popular songs, and add layers to other perform-
form renditions of well-known tunes on a unique trombone-
ers contributions. The focus in this app, however, is on
like DMI. Users from around the world were then invited
singing along with a backing track, and the mobile device
to critique renditions with emoticons and short text com-
does not really function as a DMI but an audio recorder.
ments. World stage emphasised the idea that while the ac-
These limitations rule out many musical possibilities; for
curacy of a rendition could be rated by the computer, only
instance, touch-screen DMIs can create a variety of sounds
a human critic could tell if it was ironic or funny. Indeed,
from similar interfaces, so orchestras of varying instru-
Leaf Trombone, and other Smule apps have made much
ments could be assembled from many remote participants
progress in integrating musical creation with social media.
to improvise original music. It remains to be seen whether
large scale musical collaboration between distributed users
1.2 Jamming through Space and Time
is possible. It seems likely, however, that such collabo-
While performance and criticism is an important social rations would uncover hidden affordances of the medium
part of music-making, true musical collaboration involves as has been seen in other distributed online media such
SMC2017-176
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
as Twitch Plays Pokemon [21]. Our app, MicroJam, ing back up in the jam screen for playback. When playing
also fits into this lower-right quadrant. In the next section, back, both the sound and visualised touch-interactions are
we will describe how this new app enables distributed and replayed in the touch-area.
asynchronous collaboration on original musical material. When viewing a previously saved performance, the user
can tap reply, to open a new layer on top of the recording.
As shown in the right side of Figure 2, the previous as well
2. MICROJAM
as current touch-visualisations are shown as well as sepa-
MicroJam is an app for creating, sharing, and collaborat- rate sonifications for each layer. At present, only one reply
ing with tiny touch-screen musical performances. This app is possible in MicroJam; however, the ability to reply to
has been specifically created to interrogate the possibilities replies, creating multi-layered performances is planned for
for collaborative mobile musical performances that span future versions.
space and time. While these goals are lofty, the design
of MicroJam has been kept deliberately simple. The main 2.2 Tiny touch-screen performances
screen recalls social-media apps for sharing images. Mu-
sical performances in MicroJam are limited to very short MicroJam is intended to provide a musical experience that
interactions, encouraging frequent and ephemeral creative is similar to other mobile social creativity applications. One
contribution. MicroJam is an iOS app written in Swift innovation in this field has been to constrain contributions,
and uses Apples CloudKit service for backend cloud stor- leading to more frequent interactions and possibly higher
age. The source code is freely available for use and mod- creativity due to the lower stakes and effort. Musical in-
ification by other researchers and performers [2]. In the teractions in MicroJam are similarly constrained to be tiny
following sections we will discuss the design of the app, touch-screen performances: those that are limited in the
the format of the tiny musical performances that can be area and duration of interaction. We define a tiny perfor-
created, and some early experiences and musical perfor- mance as follows:
mances recorded with users.
1. All touches take place in a square subset of the touch-
screen.
2.1 Design
The primary screen for MicroJam is the performance in- 2. Duration of the performance is five seconds.
terface, shown in the centre of Figure 2 and called jam!. 3. Only one simultaneous touch is recorded at a time.
This screen features a square touch-area which is initially
blank. Tapping, swirling, or swiping anywhere in this area Such performances require very little effort on the part of
will create sounds and also start recording touch activity. users. While some users may find it difficult to conceive
All touch interaction in this area is visualised with a sim- and perform several minutes of music on a touch-screen
ple paint-style drawing that follows the users touches, and device, five seconds is long enough to express a short idea,
is simultaneously sent to a Pure Data patch running under but short enough to leave users wanting to create another
libpd [22] to be sonified. After five seconds of touch in- recording. It has been argued that five seconds is enough
teraction, the recording is automatically stopped (although time to express a sonic object and other salient musical
the performer can continue to interact with the touch area). phenomena. [23]. While the limitation to a single touch
The recording can be subsequently replayed with the play may seem unnecessary on todays multi-touch devices, this
button, or looped with the jam button. stipulation limits tiny performances to monophony. In or-
Users of MicroJam can choose the sound-scheme used der to create more complex texture or harmony, performers
to sonify their interactions in the jam interface. At present, must collaborate, or record multiple layers themselves.
four options are available through the settings screen: chirp, For transmission and long-term storage, tiny touch-screen
a simple theremin-like sound with pitch mapped to the x- performances are stored as simple comma-separated values
dimension of interactions; keys, a Rhodes-like keyboard files. The data format records each touch interactions time
sound with pitch and timbre mapped to the x- and y-dim- (as an offset in seconds from the start of the performance),
ensions; strings, a simple modelled string sound that per- whether the touch was moving or not, x and y locations,
forms mandolin rolls and changes timbre as the performer as well as touch pressure. In MicroJam, the visual trace
moves around the screen; and drums, a simple drum ma- of performances is also stored for later use in the app, al-
chine with different sounds mapped to quadrants of the though this could be reconstructed from the touch data.
screen. As each of these sound-schemes is implemented
as a Pure Data patch, it is straightforward to add more to
2.3 User Data and Early Experiences
MicroJam in future.
Previously recorded performances, and those recorded by The prototype version of MicroJam has been distributed
other users and downloaded from the server, are listed in and demonstrated to small numbers of researchers and stu-
the world screen as shown on the left side of Figure 2. Each dents. These early experiences have allowed us to stream-
performance is represented by a visual trace of the touch- line the creative workflow of the app and to observe the
drawing captured during recording as well as the name of kind of tiny performances that can be created in the jam
the contributor and date of the performance. Any one of interface. Since the release of the first prototype, around
these performances can tapped which opens the record- 200 tiny performances have been collected. Most of the
SMC2017-177
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 2. The MicroJam interface allows users to browse a timeline of performances (left), create new performances
(centre), and reply, or play in duet, with previously recorded performances (right).
participants were computer science and engineering stu- own music, and to collaborate with others. Rather than
dents with little computer music experience, a few music a discouragement, we see this as an opportunity to con-
technology students were also included. tinue developing mobile music experiences that push the
The visualisations of a subset of these performances are boundaries of everyday music making. Understanding how
reproduced in Figure 3. This figure shows the variety of comfortable users would be composing original tiny per-
touch-interaction styles that have already been observed in formances could be addressed in future studies. The app
performers. Many of the interactions are abstract, resem- design could also include more guidance, such as a training
bling scribbles that show the user experimenting with the system, or more extensive visual feedback, to help users
synthesis mapping of the jamming interface. In some per- who are unsure about making touch-screen music.
formances, repeated patterns can be seen, where perform-
ers have repeated rhythmic motions in different parts of the 3. CONCLUSIONS AND FUTURE WORK
touch-area. A number of the performances are recognis- In this paper we have advocated for social apps for creat-
able images: figures, faces, and words. These users were ing music, as opposed to more popular written and visual
particularly interested in how the interface maps between media. We have argued that such apps could take advan-
drawing, a visual and temporal activity, and sound. Ob- tage of the ubiquity of mobile devices by allowing users to
serving the users, it seems that some focused on what a collaborate asynchronously and in different locations, and
particular drawing might sound like, while others were in- shown that such modes of interaction are relatively unex-
terested in what particular sounds look like. At present, plored compared to more conventional ensemble perfor-
these performance recordings have not been analysed with mances. Our app, MicroJam, represents a new approach
respect to which sound-scheme was in use, and how re- to asynchronous musical collaboration. Taking inspiration
ply layers fit together. Further experiments could seek to from the ephemeral contributions that typify social me-
understand these relationships. dia apps, MicroJam limits performers to tiny five-second
It has been gratifying to hear that several early users of touch-screen performances, but affords them extensive op-
the app greatly enjoyed the experience of creating tiny per- portunities to browse, playback, and collaborate through
formances, and wished that similar interactions could be responses. MicroJams tiny performance format includes a
integrated into existing social apps. These users immedi- complete listing of the touch interactions as well as a sim-
ately set about recording multiple performances, explor- ple visualisation of touch interactions. This format allows
ing the sound-schemes and the creative possibilities of the performances to be easily distributed, viewed, and studied.
touch-screen interface. Other users, however, had a luke- Early experiences with MicroJam have shown that users
warm reaction to the concept of free-form touch-screen can engage with the interface to create a range of inter-
musical performance. It could be that casual users do not esting performances. While some see potential to include
expect to be able to create original music. After all, mu- music-making in their social media activities, others may
sical performance is often (erroneously, we feel) seen as lack confidence about producing music, even in the tiny
a task only for specialists with a high level of training or performance format.
talent. It may be a hard sell to ask users to create their There is much scope for refinement and development of
SMC2017-178
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 3. The visual trace of performances in the prototype for MicroJam. Some performances focus on similar touch
gestures in different locations of the jamming area, others appear to be exploratory scribbles, and several focus on visual
interaction.
SMC2017-179
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[5] A. G. D. Correa, I. K. Ficheman, M. d. Nascimento, sic, Leonardo Music Journal, vol. 21, pp. 5764, De-
and R. d. D. Lopes, Computer assisted music ther- cember 2011. DOI:10.1162/LMJ a 00062
apy: A case study of an augmented reality musical sys-
tem for children with cerebral palsy rehabilitation, in [14] G. Wang, Game design for expressive mobile
International Conference on Advanced Learning Tech- music, in Proceedings of the International Con-
nologies. IEEE, 2009. DOI:10.1109/ICALT.2009.111 ference on New Interfaces for Musical Expression,
pp. 218220. vol. 16. Brisbane, Australia: Queensland Conserva-
torium Griffith University, 2016, pp. 182187. [On-
[6] S. Favilla and S. Pedell, Touch screen ensemble mu- line]. Available: http://www.nime.org/proceedings/
sic: Collaborative interaction for older people with 2016/nime2016 paper0038.pdf
dementia, in Proceedings of the 25th Australian
Computer-Human Interaction Conference, ser. OzCHI [15] R. K. Sawyer, Group creativity: Musical performance
13. New York, NY, USA: ACM, 2013. DOI:10.1145/ and collaboration, Psychology of Music, vol. 34, no. 2,
2541016.2541088 pp. 481484. pp. 148165, 2006. DOI:10.1177/0305735606061850
[16] G. Schiemer and M. Havryliv, Pocket Gamelan:
[7] G. Wang, Ocarina: Designing the iPhones magic
Swinging phones and ad-hoc standards, in Proceed-
flute, Computer Music Journal, vol. 38, no. 2, pp. 8
ings of the 4th International Mobile Music Workshop,
21, 2014. DOI:10.1162/COMJ a 00236
Amsterdam, May 2007.
[8] T. Barraclough, D. Carnegie, and A. Kapur, Musical
[17] S. Greenberg and M. Roseman, Using a room
instrument design process for mobile technology,
metaphor to ease transitions in groupware, Depart-
in Proceedings of the International Conference on
ment of Computer Science, University of Calgary,
New Interfaces for Musical Expression, E. Berdahl
Tech. Rep. 98/611/02, 1998.
and J. Allison, Eds. Baton Rouge, Louisiana,
USA: Louisiana State University, May 2015, pp. [18] A. Carot, P. Rebelo, and A. Renaud, Networked
289292. [Online]. Available: http://www.nime.org/ music performance: State of the art, in Audio
proceedings/2015/nime2015 313.pdf Engineering Society 30th International Conference,
Mar 2007. [Online]. Available: http://www.aes.org/
[9] G. Wang, S. Salazar, J. Oh, and R. Hamilton, World e-lib/browse.cfm?elib=13914
stage: Crowdsourcing paradigm for expressive so-
cial mobile music, Journal of New Music Research, [19] A. Tanaka and P. Gemeinboeck, Net derive: Conceiv-
vol. 44, no. 2, pp. 112128, 2015. DOI:10.1080/ ing and producing a locative media artwork, in Mo-
09298215.2014.991739 bile Technologies: From Telecommunications to Me-
dia, G. Goggin and L. Hjorth, Eds. London, UK:
[10] J. Oh, J. Herrera, N. J. Bryan, L. Dahl, and G. Wang, Routledge, 2008.
Evolving the mobile phone orchestra, in Proceedings
of the International Conference on New Interfaces [20] L. Gaye, R. Maze, and L. E. Holmquist, Sonic
for Musical Expression, ser. NIME 10, K. Beilharz, City: the urban environment as a musical interface,
A. Johnston, S. Ferguson, and A. Y.-C. Chen, in Proceedings of the International Conference
Eds. Sydney, Australia: University of Technology on New Interfaces for Musical Expression, ser.
Sydney, 2010, pp. 8287. [Online]. Available: http: NIME 03. Montreal, Canada: McGill University,
//www.nime.org/proceedings/2010/nime2010 082.pdf 2003, pp. 109115. [Online]. Available: http:
//www.nime.org/proceedings/2003/nime2003 109.pdf
[11] C. Martin, H. Gardner, and B. Swift, Track-
ing ensemble performance on touch-screens with [21] S. D. Chen, A crude analysis of twitch plays
gesture classification and transition matrices, in pokemon, arXiv preprint arXiv:1408.4925, 2014.
Proceedings of the International Conference on [Online]. Available: https://arxiv.org/abs/1408.4925
New Interfaces for Musical Expression, ser. NIME
[22] P. Brinkmann, P. Kirn, R. Lawler, C. McCormick,
15, E. Berdahl and J. Allison, Eds. Baton
M. Roth, and H.-C. Steiner, Embedding Pure
Rouge, LA, USA: Louisiana State University,
Data with libpd, in Proceedings of the Pure
2015, pp. 359364. [Online]. Available: http:
Data Convention. Weimar, Germany: Bauhaus-
//www.nime.org/proceedings/2015/nime2015 242.pdf
Universitat Weimar, 2011. [Online]. Available:
[12] B. Swift, Chasing a feeling: Experience in computer http://www.uni-weimar.de/medien/wiki/PDCON:
supported jamming, in Music and Human-Computer Conference/Embedding Pure Data with libpd:
Interaction, ser. Springer Series on Cultural Com- Design and Workflow
puting, S. Holland, K. Wilkie, P. Mulholland, and
[23] R. I. Gody, A. R. Jensenius, and K. Nymoen, Chunk-
A. Seago, Eds. London, UK: Springer, 2013, pp. 85
ing in music by coarticulation, Acta Acustica united
99.
with Acustica, vol. 96, no. 4, pp. 690700, 2010. DOI:
[13] R. Hamilton, J. Smith, and G. Wang, Social composi- 10.3813/AAA.918323
tion: Musical data systems for expressive mobile mu-
SMC2017-180
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Sofia Dahl
Peter Ahrendt, Christian C. Bach
Aalborg University Copenhagen
Aalborg University
Department of Architecture, Design
Master Program in Sound and Music Computing
and Media Technology
{pahren16; cbach12}@student.aau.dk
sof@create.aau.dk
SMC2017-181
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-182
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
105
100
95
Midi Notes
90
85
80
75
Figure 3. Test interface for choosing the face with the an- High-Pitch Low-Pitch
grier expression
SMC2017-183
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3
Table 3. Associations between expression and number of
high-pitch and low-pitch choices 2
SMC2017-184
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-185
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Apart from eyebrow height, there are other facial cues ception of musical performance, Cognition, vol. 101,
that observers use to judge what type of behaviour or per- no. 1, pp. 80113, 2006.
sonality an individual is likely to have. For instance, differ-
ent facial morphology in both humans and chimpanzees re- [5] W. F. Thompson and F. A. Russo, Facing the music,
veal personality traits to observers [19] and an association Psychological Science, vol. 18, no. 9, pp. 756757,
between the width-to-height ratio and aggression ratings of 2007.
static images showing neutral faces has been reported [20].
[6] W. F. Thompson, F. A. Russo, and S. R. Living-
In our study we showed within-participant pairs of pictures
stone, Facial expressions of singers influence per-
to be compared by the assessors, but still some of these
ceived pitch relations, Psychonomic bulletin & review,
faces could have provided strong static cues not associated
vol. 17, no. 3, pp. 317322, 2010.
with the task of singing high and low notes. By rating the
aggressiveness or friendlieness of individual pictures, pos- [7] M. Schutz and S. Lipscomb, Hearing gestures, see-
sibly combined with eye tracking, it would be possible to ing music: Vision influences perceived tone duration,
better identify the specific role of the eyebrows in the as- Perception, vol. 36, no. 6, pp. 888897, 2007.
sessment of the faces.
[8] V. C. Tartter and D. Braun, Hearing smiles and frowns
6. CONCLUSION in normal and whisper registers, The Journal of the
Acoustical Society of America, vol. 96, no. 4, pp. 2101
When singing notes of different pitch height, untrained 2107, 1994.
singers facial expressions are altered in such a way that
when assessors are asked to select the friendlier face, [9] J. J. Ohala, The acoustic origin of the smile, The
they would choose the high-pitch picture to a higher de- Journal of the Acoustical Society of America, vol. 68,
gree over the low-pitch picture [11]. We could replicate no. 51, pp. 533533, 1980.
this result from our experiment. Furthermore, we could
also show the opposite relation where assessors instructed [10] D. Huron, Affect induction through musical sounds:
to select the angrier face, to a higher degree choose the an ethological perspective, Philosophical Transac-
low-tone picture over the high-pitch one. Hence, a task tions of the Royal Society B: Biological Sciences, vol.
normally associated with music appear to affect dynamic 370, no. 1664, p. 20140098, 2015.
facial expressions in a way that observers detect as emo-
tional expressions influencing their choices of angrier [11] D. Huron, S. Dahl, and R. Johnson, Facial Expression
and friendlier above level of chance. and Vocal Pitch Height: Evidence of an Intermodal As-
sociation, Empirical Musicology Review, vol. 4, no. 3,
Acknowledgments pp. 93100, 2009.
The authors would like to thank all singing participants and [12] D. Huron and D. Shanahan, Eyebrow movements and
assessors. vocal pitch height: evidence consistent with an etho-
Authors Ahrendt and Bach were responsible for the main logical signal. The Journal of the Acoustical Society
part of the work and set up the experiment, performed of America, vol. 133, no. 5, pp. 294752, 2013.
data collection and analysis under the supervision of au-
thor Dahl as part of a course in the Sound and Music Com- [13] Propellerhead Software AB, Reason 9 Operation
puting master program at Aalborg University. All authors Manual, Tech. Rep., 2011.
participated equally in the writing of the final manuscript.
[14] Matlab Appdesigner, 2017. [Online]. Avail-
able: https://se.mathworks.com/products/matlab/
7. REFERENCES app-designer.html
[1] J. W. Davidson, Visual perception of performance
[15] A. Agresti, An Introduction to Categorical Data Anal-
manner in the movements of solo musicians, Psychol-
ysis, 2nd ed. Gainesville: John Wiley & Sons Inc.,
ogy of music, vol. 21, no. 2, pp. 103113, 1993.
2007, vol. 22, no. 1.
[2] M. Broughton and C. Stevens, Music, movement and
marimba: An investigation of the role of movement [16] D. J. Best and D. E. Roberts, The Upper Tail Proba-
and gesture in communicating musical expression to bilities of Spearmans Rho, Wiley for the Royal Statis-
an audience, Psychology of Music, vol. 37, no. 2, pp. tical Society, vol. 24, no. 3, pp. 377379, 1975.
137153, 2008.
[17] J. Ohala, Signaling with the EyebrowsCommentary
[3] S. Dahl and A. Friberg, Visual perception of expres- on Huron, Dahl, and Johnson, Empirical Musicology
siveness in musicians body movements, Music Per- Review, vol. 4, no. 3, pp. 101102, 2009.
ception: An Interdisciplinary Journal, vol. 24, no. 5,
pp. 433454, 2007. [18] S. Chong, J. F. Werker, J. A. Russell, and J. M. Car-
roll, Three facial expressions mothers direct to their
[4] B. W. Vines, C. L. Krumhansl, M. M. Wanderley, and infants, Infant and Child Development, vol. 12, no. 3,
D. J. Levitin, Cross-modal interactions in the per- pp. 211232, 2003.
SMC2017-186
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-187
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Juan Carlos Vasquez, Koray Tahiroglu, Niklas Pollonen Johan Kildal Teemu Ahmaniemi
Department of Media IK4-TEKNIKER Nokia Technologies
Aalto University Inaki Goenaga 5 Espoo, Finland
School of ARTS Eibar 20600 teemu.ahmaniemi@nokia.com
FI-00076 AALTO Finland Spain
firstname.lastname@aalto.fi johan.kildal@tekniker.es
SMC2017-188
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
els of engagement back to higher levels that are associated istics are poorly understood. On the cognitive level, David
with a satisfactory creative experience. Lastly, as happi- Hurons ITPRA (Imagination-Tension-Prediction-Response-
ness and wellbeing crucially depend on the activities we Appraisal) theory studies the relationships between affec-
engage in [3], in this user study we aimed at understand- tive consequences and expectancies in music [21], estab-
ing whether a satisfactory musical experience could indeed lishing clear connections. Evidence also suggests that some
contribute towards the preservation of psychological well- auditory illusions, such as Binaural Auditory Beats, are as-
being, as we proposed in the beginning. sociated with less negative mood, notably affecting perfor-
mance in vigilance tasks of sustained concentration, par-
2. RELATED WORK ticularly when they occur in the beta range (16 and 24
hz) [22].
In spite of being a commonly mentioned parameter in mu-
sic research, the mere act of obtaining a unified and con- Regarding the concept of musical interaction being per-
sistent definition for engagement represents a surprising ceived as rewarding, the Orff approach states that improvi-
challenge. In the field of education, Appleton [6] points sation with pentatonic scales, with bars and notes removed,
the discussion towards separating engagement from mo- is always satisfactory and encourages freedom from the
tivation: the latter [7] provides an answer to the reason- fear of making mistakes [23]. The goal is for everybody
ing behind a specific behaviour depending on the charac- to experience success and appreciate the aesthetic in music
teristics of the personal energy invested in said situation, almost immediately, rather than having to learn notes and
whereas engagement refers to the level of focused involve- rhythms before anything satisfying is accomplished. Per-
ment [8], or energy in action, the connection between per- cussive patterns were important in his theory, as they ap-
son and activity [9]. When applied to musical activity, en- peal to a primal essence in human beings and allow tempo
gagement may also have a positive connotation, described and beat to be experienced rather than learned. We in-
as a state of sustained, undivided attention and emotional corporated many of these ideas into our work while de-
connection towards creative musical performance [10] or signing the response module and its implementation in a
sometimes as a neutral parameter prone to be measured music-making creative activity that can link cognitive fea-
and evaluated in interdependent values for attention, emo- tures and physical movements together to pay attention for
tion and action [11]. immediate responses. This is a generic hypothesis of re-
search in cognitive neuroscience of music and the concept
Similarly, there are other studies exploring music as a
of music, mind and body couplings that our project takes
creative activity that contributes to the general wellbeing.
into account [2, 24]. The former points out the benefits of
An example demonstrates how active music-making im-
music as a therapeutic tool and it can improve recovery in
proves wellbeing and overall comfort in a large sample of
the domains of verbal memory and focused attention. The
hospitalised children with cancer [12]. This study draws
latter indicates its impact on social and individual develop-
upon previous research featuring live music therapy that
ment through active engagement with music.
promotes interaction through voice and body expressions,
recognising the benefits of interactivity in opposition of lis-
tening passively to music as a therapy [13]. While previous
studies delineate a methodology for therapy rather than the
design of an interactive system, there are examples of in-
3. NOISA
teractive tools aimed at positively impacting in wellbeing
through rhythmical movement with music as a central ele-
ment [14, 15]. There is also the precedent of an interactive As mentioned, NOISA (Network Of Intelligent Sonic Agents)
application to engage patients with dementia in active mu- is the robotic musical interface that we took into use and
sic making without pre-existing skills [16]. In the same built on for this research work. NOISA consists of three in-
vein, we found an interactive technology tested in children struments for music creation, a Microsoft Kinect 2 camera
with disabilities and aimed to study how music interaction and a central computer. The three instruments are phys-
can become potentially health promoting [17]. ically identical from each other. Each instrument is built
As a concept, psychological wellbeing is subjective, rel- inside a plastic box. The interface for playing each in-
ative and prone to change with ease. However, it is pos- strument consists of two handles designed to provide an
sible to find specific sonic characteristics that have proven ergonomic form for grasping them from several directions,
to affect positively an individuals perception of comfort which can also position themselves actively, thanks to DC
and wellbeing. For instance, in the context of music, some motors (the robotic interactions with the physical medium).
studies indicate that fast and slow tempos are associated The handles are attached to a motorised fader, which served
with happiness and sadness, respectively, as are major and both as position indicator and as active position controller.
minor modes [18, 19]. Loud music has proven to be in The hardware in each box includes an Arduino, a Rasp-
cross-cultural studies as an universal cue for anger [20]. berry Pi, two motorised faders with extra large motors, two
Timbre is ambivalent, although several studies indicate that capacitive touch sensors on the handles, and an embedded
attenuation of high frequencies can be a difference between speaker. In each instrument, sound production is largely
anger (unfiltered) and tenderness (attenuation) [19]. Al- based on granular synthesis and phase vocoder algorithms
though affective associations of both tempo and mode are de-fragmenting radically small fragments of pre-existing
fairly well established, effects of other musical character- pieces from the classical music repertoire.
SMC2017-189
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-190
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ity, and we observed if the system could observe the corre- tool [28].
sponding drop in engagement, and if the instrument could For the second part of the study (designed to observe
intervene collaboratively until engagement was recovered. the engagement tracking and recovery functionalities of
After a general description of the whole session and af- our interactive system), we asked the participant to repeat
ter obtaining consent, the first part of the user study started the same musical task, extending it for a period of four
with the participant filling out the PANAS questionnaire minutes. This time we tracked and logged the level of
(The Positive and Negative Affect Schedule) [26]. This is a engagement over time. We scheduled two events suffi-
validated questionnaire that evaluates positive and negative ciently distracting at 1:30 (door opening, person coming
affects using a psychometric scale, and the change in those in and loudly manipulating a cardboard box) and a more
within a specific period of time during which some treat- sublte event at 3:00 (phone ringing). We drew inspiration
ment activity took place (in our case, a performance ses- from previous studies measuring the impact of distractions
sion with our collaborative musical instrument). PANAS in psychological wellbeing during a skilled activity [29].
uses 10 descriptors per category of affects, showing rela- The participant was not aware that such events would take
tions between them with personality stats and traits. PANAS place, nor he/she was prepared for them. The total duration
has been successfully applied to measure overall psycho- of the test was approximately 30 minutes (see Figure 2).
logical wellbeing before, and after a musically creative ac-
tivity [27]
5. RESULTS
The participant was then introduced to the NOISA instru-
ment, with explanations about how it works, followed by The PANAS scores are divided in Positive Affects (PA)
a period of hands-on familiarisation with it (typically for and Negative Affects (NA). According to the PANAS doc-
about three minutes). Once the participant felt confident umentation, the mean PA scores for momentary (immedi-
enough, the first music performance task was described, ate) evaluation are 29.7 (SD = 7.9). Scores can range from
which required performing a piece according to a some- 10 to 50, with higher scores representing higher levels of
what restrictive set of rules (we wanted to reduce variabil- positive affect. For Negative Affects the mean NA score
ity in the performing activities at this stage in the study). In for an equally momentary evaluation are 14.8 (SD = 5.4).
particular, the participant would be asked to start perform- Scores can range from 10 to 50, with lower scores repre-
ing with a specific instrument (Agent 1), then repeat the senting lower levels of negative affects.
given gesture with the middle instrument (Agent 2), finish- During our tests, we obtained a PA score mean (Before
ing with inputting the same repeated movement in the third the activity) = 30.57 (SD = 3.89) and a PA score mean
instrument (Agent 3), resulting in a piece that lasted for be- (After the activity) = 34.21 (SD = 5.14). In addition, we
tween three and five minutes. Once this was done, NOISA collected a NA Score (Before the activity) = 15.57 (SD =
would have typically started to make automatic responses. 6) and a NA Score (After the activity) = 12.30 (SD = 3.30).
At this moment, the participant was asked to explore freely This means an overall improvement of 3.64 PA points and
any of the agents, and once the participant was ready to fin- 3.27 NA points. By adding PA and NA scores, we found
ish, she/he had to manipulate the handles in Agent 1. The that the overall improvement in the reported psychological
participant was required to stop any input activity as an in- wellbeing was of 6,91 points, which in a range of 10 to
dication of the end of the music task. 50 in the PANAS range represents an overall improvement
After finishing the task, the participant was asked to fill of 17.27% after 3 minutes of activity with our system. In
in again the PANAS Questionnaire in order to compare addition, 30% of the participants reported an improvement
the impact of the task in the overall sense of psychologi- of more than 10 points in the PA affect score
cal wellbeing. Following that, the participant was asked to In our paired before/after test, for Negative Affects the
evaluate the systems usability by filling in the online ver- two-tailed p value equals 0.0029 (t = 3.5979), a statistically
sion of the System Usability Scale (SUS) designed by our significant difference (significance level > 99%). Regard-
research group to facilitate the data gathering and analy- ing the Positive Affects, we obtained a p value of 0.0699
sis. SUS has provided with that high-level measurement (t = 1.9750), which is also statistically significant (signifi-
of subjective usability, and over the years the work done by cance level > 95%).
other researchers has indicated that it is a valid and reliable Regarding usability, the System Usability Scale research
SMC2017-191
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
drew the average of 68 points in the scale (from 0 to 100). and strengthen the level of engagement in a skilled activity
Sauro [30] found that when people rate a system or prod- of musical improvisation with NOISA instruments. The
uct with a SUS score of 82, they are likely to recommend resulting experience of such interaction was also likely to
a system or product to a friend or colleague (Net Promoter have contributed positively to the psychological wellbeing
Score). After evaluating our system, the participants gave of the participants.
it a SUS mean value of 73.9 (SD = 11.33) which means Judging from the quantitative data obtained through the
5.9 points above average. One third of the participants PANAS questionnaire, the average levels of Positive Af-
gave our system a SUS score of 82 or higher, potentially fects were higher after finishing the test, and the Nega-
becoming a promoter. tive Effects were lower after completing the task. In case
Finally, our engagement graphs over time showed that of Negative Affects, the difference was found to be very
the system was able to identify and measure a distraction statistically significant. In the PANAS scale, the Negative
event in 73.33% of the participants, impacting the overall Affects scores decreased from above the mean (15.57) to
engagement by 0.2 points or more, in a scale of 0 (total below it (12.30). Even though the impact on the Positive
disengaged) to 1 (full engagement). Our engagement scale Affects was found to be not quite statistically significant,
is also divided into categories, where 0 to 0.3 means low we consider the overall improvement of 17.27% positive
engagement, 0.3 to 0.6 medium engagement and 0.6 to 1 considering the short amount of time spent with the system
high engagement. Calculating every time a distraction had during our task. The creative activity facilitated by NOISA
an impact in the engagement, there was an overall average system proved to be statistically significant to impact on
of 0.41 sudden decrease (SD = 1.40). As the NOISA sys- the Negative Affects described in PANAS (Distressed, Up-
tem reacts once it detects a drop in the overall engagement, set, Guilty, Hostile, Irritable, Ashamed, Nervous, Jittery
the level of focus was effectively regained to a stable level and Afraid), which shows that the musical task showed
before the distraction event in a mean time of 10.07 sec- some capabilities during the test that can be successfully
onds (SD = 7.15). Two engagement over time graphs for applied to help reducing the subjective perception of worry,
the participants 10 and 11 (see Figure 3 and 4) are included stress and other negative personality traits of the partici-
to demonstrate an example of a case where our engage- pants.
ment prediction system detected both distractions (90 and The engagement estimation results indicated effective-
180 second mark, respectively), managing to regain focus ness when detecting and maintaining the focus level of the
of the participant through autonomous counteractions. participants. In that sense, being able to measure at least
one of the distractions in 73.33% of the participants is sat-
isfactory, considering that while being observed, the par-
ticipant could feel compelled to keep his/her attention to
the task at hand. This situation would avoid us to detect
any distractions in their physical body movements. The
response module was capable of recovering attention and
focus from the participants in an average time of 10.07 sec-
onds after the distractions. We can note here that our cur-
rent contribution to NOISA system provided an engaging
interaction that was gradually and progressively built up by
Figure 3. Engagement over time graph of the participant the participants active and continuous involvement in the
10 musical activity with NOISA instruments.
Similarly, we observed that both our new hardware de-
sign and response module were evaluated by the partici-
pants as easy to use, obtaining consistent above-average
scores in the SUS scale. In addition, a large portion (one
third) of the participants cataloged it above the 82 mark,
which is particularly positive considering that we asked
non-specialist individuals to evaluate a new musical inter-
face. Our sample consisting of 15 participants might be
considered small; however, Tullis and Stetson [31] have
already demonstrated the reliability of SUS in smaller sam-
ples (8-12 participants).
Figure 4. Engagement over time graph of the participant
11 7. CONCLUSIONS
Based on previous literature, it is clear that designing for
wellbeing is strongly linked with the psychological under-
6. ANALYSIS AND DISCUSSION
standing of wellbeing, positive experience and physical ac-
The results presented above show that the design of the tivity [3]. The purpose of our project was to establish a
response module provided beneficial results for the partic- situated musical activity for the user to be engaged in, at-
ipants. The response module could successfully maintain tempting to achieve the goal of enhancing psychological
SMC2017-192
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
wellbeing. Our design was grounde in a concrete con- Factors in Computing Systems, ser. CHI EA 15.
text (creative musical overall process) and guided by the ACM, 2015, pp. 279282. [Online]. Available:
functionality (hand-held interface) as well as through sub- http://doi.acm.org/10.1145/2702613.2725446
tle interaction (counter-responsive musical actions). We
observed and measured the behaviour of our system as [5] K. Tahiroglu, T. Svedstrom, and V. Wikstrom,
it deals with losses of performer focus provoked by the Musical engagement that is predicated on intentional
controlled introduction of external distractors. During the activity of the performer with noisa instruments,
tests, we observed indications that being engaged in our in Proceedings of the International Conference on
musical activity contributed positively to participants psy- New Interfaces for Musical Expression, E. Berdahl
chological wellbeing. Encouraged by our initial results, we and J. Allison, Eds., May 31 June 3 2015, pp.
plan to continue our research by including in a future study 132135. [Online]. Available: http://www.nime.org/
a comparison with a control group, which will enable us to proceedings/2015/nime2015 121.pdf
draw in a longer term more comprehensive conclusions of
[6] J. J. Appleton, S. L. Christenson, D. Kim, and A. L.
the impact of the activity with NOISA in the pyschological
Reschly, Measuring cognitive and psychological
wellbeing of the user.
engagement: Validation of the student engagement
We also aim to investigate the application of a similar
instrument, Journal of School Psychology, vol. 44,
engagement-monitoring technique in a broader scope of
no. 5, pp. 427 445, 2006, motivation. [Online]. Avail-
skilled, potentially mentally absorbing actions with peo-
able: http://www.sciencedirect.com/science/article/pii/
ple working in different occupations. A foreseen psycho-
S0022440506000379
logical wellbeing application scenario that will be imple-
mented in the future phases of the project, could be the [7] M. L. Maehr and H. A. Meyer, Understanding
non-intrusive recognition of changes in patterns of engage- motivation and schooling: Where weve been, where
ment that a person presents when carrying out different we are, and where we need to go, Educational
creative tasks with digital environments, which could sig- Psychology Review, vol. 9, no. 4, pp. 371409,
nal changes in their mood, psychological health or deteri- 1997. [Online]. Available: http://dx.doi.org/10.1023/
oration of their cognitive capacity (early diagnostic goal). A%3A1024750807365
This application requires a new hand-held interface to be
designed and implemented which will be fully integrated [8] J. M. Reeve, H. Jang, D. Carrell, S. Jeon, and
with the engaging monitoring and the new response mod- J. Barch, Enhancing students engagement by
ule in NOISA system. Through subtle external actions, the increasing teachers autonomy support, Motivation
new interface and the digital environment could assist the and Emotion, vol. 28, no. 2, pp. 147169, 2004.
person to regain focus and engagement with the creative [Online]. Available: http://dx.doi.org/10.1023/B%
task in hand, which might be beneficial for the success- 3AMOEM.0000032312.95499.6f
ful completion of the task (assistive living goal), and to
do so with an enhanced sense of control, accomplishment [9] E. Frydenberg, M. Ainley, and V. J. Rus-
and satisfaction (psychological wellbeing goal) in a cre- sell, Student motivation and engagement,
ative process. Schooling issues digest ; 2005/2, 2005, 16p.
[Online]. Available: http://www.dest.gov.au/
Acknowledgments sectors/school education/publications resources/
schooling issues digest/documents/schooling issues
This project was developed thanks to the support of the digest motivation engagement pdf.htm
Nokia Research Centre in Finland.
[10] N. Bryan-Kinns and P. G. Healey, Decay in
8. REFERENCES collaborative music making, in Proceedings of
the International Conference on New Interfaces
[1] D. H. Pink, Drive: The surprising truth about what for Musical Expression, Paris, France, 2006, pp.
motivates us. Penguin, 2011. 114117. [Online]. Available: http://www.nime.org/
proceedings/2006/nime2006 114.pdf
[2] C. D. Ryff, Psychological well-being in adult life,
Current directions in psychological science, vol. 4, [11] G. Leslie, Measuring musical engagement, UC San
no. 4, pp. 99104, 1995. Diego: Music, 2013.
[3] S. Diefenbach, M. Hassenzahl, K. Eckoldt, L. Hartung, [12] M. E. Barrera, M. H. Rykov, and S. L. Doyle, The ef-
E. Lenz, and M. Laschke, Designing for well-being: fects of interactive music therapy on hospitalized chil-
A case study of keeping small secrets, The Journal of dren with cancer: a pilot study, Psycho-Oncology,
Positive Psychology, pp. 18, 2016. vol. 11, no. 5, pp. 379388, 2002.
[4] K. Tahiroglu, T. Svedstrom, and V. Wikstrom, [13] J. M. Standley and S. B. Hanser, Music therapy re-
Noisa: A novel intelligent system facilitating smart search and applications in pediatric oncology treat-
interaction, in Proceedings of the 33rd Annual ment, Journal of Pediatric Oncology Nursing, vol. 12,
ACM Conference Extended Abstracts on Human no. 1, pp. 38, 1995.
SMC2017-193
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[14] P. Keyani, G. Hsieh, B. Mutlu, M. Easterday, and [27] A. M. Sanal and S. Gorsev, Psychological and phys-
J. Forlizzi, Dancealong: supporting positive social ex- iological effects of singing in a choir, Psychology of
change and exercise for the elderly through dance, in Music, vol. 42, no. 3, pp. 420429, 2014.
CHI05 extended abstracts on Human factors in com-
puting systems. ACM, 2005, pp. 15411544. [28] J. Brooke, Sus: a retrospective, Journal of usability
studies, vol. 8, no. 2, pp. 2940, 2013.
[15] J. Music and R. Murray-Smith, Virtual hooping:
teaching a phone about hula-hooping for fitness, fun [29] B. Thorgaard, B. B. Henriksen, G. Pedersbk, and
and rehabilitation, in Proceedings of the 12th interna- I. Thomsen, Specially selected music in the cardiac
tional conference on Human computer interaction with laboratory - an important tool for improvement of the
mobile devices and services. ACM, 2010, pp. 309 wellbeing of patients, European Journal of Cardio-
312. vascular Nursing, vol. 3, no. 1, pp. 2126, 2004.
[16] P. Riley, N. Alm, and A. Newell, An interactive tool [30] J. Sauro, Measuring usability with the system usabil-
to promote musical creativity in people with dementia, ity scale (sus). 2011, Available at: http://www. mea-
Computers in Human Behavior, vol. 25, no. 3, pp. 599 suringusability. com/sus. php [22.04. 2013], 2015.
608, 2009. [31] T. S. Tullis and J. N. Stetson, A comparison of ques-
[17] K. Stensth and E. Ruud, An interactive technology tionnaires for assessing website usability, in Usability
for health: New possibilities for the field of music and Professional Association Conference, 2004, pp. 112.
health and for music therapy? a case study of two chil-
dren with disabilities playing with orfi, Series from
the Centre for Music and Health, vol. 8, no. 7, pp. 39
66, 2014.
[18] A. Gabrielsson and P. N. Juslin, Emotional expression
in music. Oxford University Press, 2003.
[19] P. N. Juslin and P. Laukka, Expression, perception,
and induction of musical emotions: A review and a
questionnaire study of everyday listening, Journal of
New Music Research, vol. 33, no. 3, pp. 217238,
2004.
[20] L.-L. Balkwill, W. F. Thompson, and R. Matsunaga,
Recognition of emotion in japanese, western, and hin-
dustani music by japanese listeners, Japanese Psycho-
logical Research, vol. 46, no. 4, pp. 337349, 2004.
[21] D. Huron, Sweet anticipation: Music and the psychol-
ogy of expectation. MIT press, 2006.
[22] J. D. Lane, S. J. Kasian, J. E. Owens, and G. R. Marsh,
Binaural auditory beats affect vigilance performance
and mood, Physiology & behavior, vol. 63, no. 2, pp.
249252, 1998.
[23] A. Long, Involve me: Using the orff approach within
the elementary classroom, Awards for Excellence in
Student Research and Creative Activity, no. 4, 2013.
[24] S. Hallam, The power of music: Its impact on the in-
tellectual, social and personal development of children
and young people, International Journal of Music Ed-
ucation, vol. 28, no. 3, pp. 269289, 2010.
[25] B. Osborn, Understanding through-composition in
post-rock, math-metal, and other post-millennial rock
genres, Music Theory Online, vol. 17, no. 3, 2011.
[26] D. Watson, L. A. Clark, and A. Tellegen, Develop-
ment and validation of brief measures of positive and
negative affect: the panas scales. Journal of personal-
ity and social psychology, vol. 54, no. 6, p. 1063, 1988.
SMC2017-194
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT
1. INTRODUCTION
It is commonly assumed that listening to musical sound,
and particularly dance music with a clear pulse, makes
us move. This assumption is to some extent supported by Figure 1. The setup for the Norwegian Championship of
the literature in embodied music cognition [1, 2], and there Standstill. Each subject wore a reflective marker on the
are also empirical studies of music-induced motion [3, 4] head, and one static marker was recorded from a standing
or motion enhanced by music [5, 6]. Many of these for- pole in the middle of the space as a reference.
mer studies have mainly focused on voluntary and fairly
large-scale music-related body motion. As far as we know,
there is little empirical evidence of music actually making specifically, this paper is aimed at answering the following
people move when they try to remain at rest. questions:
Our aim is to investigate the tiniest performable and per-
How (much) do people move when trying to stand
ceivable human motion, what we refer to as micromotion.
still?
Such micromotion is primarily involuntary and performed
at a scale that is barely observable to the human eye. Still How (much) does music influence the micromotion
we believe that such micromotion may be at the core of our observed during human standstill?
cognition of music at large, being a natural manifestation
of the internal motor engagement [7]. To answer these questions, we have started carrying out
In our previous studies we have found that subjects ex- a series of group experiments under the umbrella name of
hibit a remarkably consistent level of micromotion when the Norwegian Championship of Standstill. The theo-
attempting to stand still in silence, even for extended peri- retical background of the study and a preliminary analysis
ods of time (10 minutes) [8]. The measured standstill level have been presented in [10]. This paper presents a quan-
of a person is also consistent with repeated measures over titative analysis of the data from the 2012 edition of our
time [9]. These studies, however, were carried out on small experiment series.
groups of people (25), so we have been interested in test-
ing whether these findings hold true also for larger groups. 2. THE EXPERIMENT
In this paper we report on a study of music-induced mi-
cromotion, focusing on how music influences the motion The experiment was carried out in the fourMs motion cap-
of people trying to stand still. In order to answer that ques- ture lab at the University of Oslo in March 2012 (Figure 1).
tion, it is necessary to have baseline recordings of how
much people move when standing still in silence. More 2.1 Participants
A little more than 100 participants were recruited to the
Copyright:
c 2017 et al. This is an open-access article distributed under the study, and they took part in groups consisting of 5-17 par-
terms of the Creative Commons Attribution 3.0 Unported License, which permits ticipants at a time (see Figure 1 for a picture of the setup).
unrestricted use, distribution, and reproduction in any medium, provided the origi- Not every participant completed the task and there were
nal author and source are credited. some missing marker data, resulting in a final dataset of
SMC2017-195
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
S1 S2 S3 S4 S5 S6 S7
1
0.5
Amplitude
0.5
1
0 50 100 150 200 250 300 350 400
Time (s)
SMC2017-196
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 1. Mean QoM values (in mm/s) for all sessions, in both no-sound and sound conditions, as well as for each of the
individual music sections.
No sound (3 min) Sound (3 min)
Part 1 2 3 4 5 6 7 8
Mean QoM (mm/s) 6.5
Mean QoM (mm/s) 6.3 6.6
Mean QoM (mm/s) 6.3 6.2 6.5 6.7 6.5 6.6 6.9 6.7
Standard deviation 1.4 1.8 1.9 1.9 1.7 1.8 3.8 2.3
7000 9
8.5
6000 S1 S2 S3 S4 S5 S6 S7
7.5
Cumulative distance travelled (mm)
5000
7
4000
6.5
3000 6
5.5
2000
5
4.5
1000 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5
QoM (3 dimensions) (mm/s)
0
0 50 100 150 200 250 300 350 400
Time (s) Figure 5. Scatter plot showing the linearity between
QoM occurring in the horizontal (XY) plane and three-
Figure 4. Cumulative distance travelled for all participants. dimensional (XYZ) for the entire data set.
3.3 Horizontal Motion mm/s (SD = 2.2 mm/s) for the part with sound. This is
To answer the question of how people move over time, we not a dramatic difference, but shows that the musical stim-
computed planar quantity of motion. The horizontal QoM uli did influence the level of standstill. A paired sample
(over the XY plane) was computed for all participants in t-test was conducted to evaluate statistical significance of
order to further test the differences between conditions and the observed differences between sound and no-sound con-
stimuli. The mean horizontal QoM was found to be 6.4 ditions across the sample group. The results indicate the
mm/s for the entire 6-minute recording (SD = 1.5 mm/s). differences in means for three-dimensional QoM were sig-
This value is only marginally smaller than the 6.5 mm/s nificant for a 95% confidence interval (t = 2.48, p = .015).
found for the 3D QoM, suggesting that most motion, in Differences in the planar QoM between the sound (6.5
fact, occurred in the XY plane. The relation between hori- mm/s) and no-sound (6.2 mm/s) segments of the experi-
zontal and 3D motion can also be seen in Figure 5. ment were also statistically significant (t = 2.5, p < .05),
although not considerably larger than those observed from
3.4 Vertical Motion 3D QoM.
These observed differences between sound and no-sound
To investigate the level of vertical motion, we also cal- conditions were further explored by conducting a k-means
culated QoM along the Z-axis. The mean vertical QoM cluster analysis of both 3D and 2D QoM for the entire data
across participants and conditions was 0.73 mm/s (SD = set. Using instantaneous QoM as a predictor, two clusters
0.52 mm/s), considerably smaller than the horizontal QoM were identified by the implemented algorithm, although,
reported above. This can also be seen in plots of the ver- as seen in the silhouette plot in Figure 8, most points in
tical motion (Figure 7) and in the frontal (YZ) plane (Fig- the clusters have silhouette values smaller than 0.3. This
ure 6), in which the bulk of motion in the Z axis is below 1 indicates that the clusters are not entirely separated, which
mm/s. could be due to the homogeneity of the sample group and
When looking at the differences between conditions, the the continuous nature of the musical stimuli.
mean vertical QoM during the no-sound segment of the
trials was found to be 0.69 mm/s, while for the sound seg- The results are even clearer when looking at the individ-
ment it was 0.77 mm/s. ual stimuli in Table 1, with a QoM of 6.9 mm/s for the
electronic dance music sample (#7) and 6.7 mm/s for the
salsa excerpt (#8). As such, the results confirm our ex-
3.5 Influence of sound on motion
pectation that music makes you move. Even though the
For the 3-minute parts without sound we found an aver- result may not be very surprising in itself, it is interesting
age QoM of 6.3 mm/s (SD = 1.4 mm/s), as opposed to 6.6 to see that even in a competition during which the partici-
SMC2017-197
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2.4
2.2
Cluster
1
QoM along Z axis (mm/s)
1.8
1.6
1.4 2
0 0.2 0.4 0.6 0.8 1
1.2 Silhouette Value
1
Figure 8. Silhouette plot from k means clustering analysis
0.8
of QoM along the XY plane for the entire data set.
0.6
0.4
3.5 4 4.5 5 5.5 6 6.5 7 7.5
QoM along Y axis (mm/s) 3.7 Effects of group, posture and physical activity
Aiming to evaluate the effects of standing strategies and
Figure 6. Plot showing QoM in the vertical plane (YZ) for
postures, the participants were allowed to choose their stand-
the entire data set. The majority of the motion along this
ing posture during the experiment. In the post-experiment
direction was below 1 mm/s.
questionnaire they were asked to self-report on whether
2000 they were standing with their eyes open or closed, and
1900
whether they had their knees locked. The majority of the
Position along Z axis (mm)
1800
participants reported that they stood with open eyes (N =
1700
62 versus N = 4 for closed eyes, and N = 8 for those who
1600
switched between eyes opened and closed during the ex-
1500
periment). Furthermore, 33 of the participants reported
1400
standing with locked knees, 31 switched between open and
0 50 100 150
Time (s)
200 250 300
locked knees and 10 reported standing with their knees
open. A 1-way ANOVA was performed to test if any of
Figure 7. Instantaneous position of the marker along the Z these factors influenced the average QoM of the partici-
axis (vertical direction). pants, but showed no statistically significant results.
Interestingly, the participants who reported greater amount
of time spent doing physical exercise tended to move more
pants actively try to stand absolutely still, the music has an during the experiment (r = -.299, p < .01). This tendency
influence on their motion in what can be termed micro was particularly evident during the no-sound section (r =
level. -.337, p < .01), but it was also observed during the sound
section (r = -.251, p = < .05).
Additionally, we compared the average QoM results for
3.6 Age, Height and Gender all conditions (no-sound, sound, average no-sound and sound,
and computed difference between sound and no-sound con-
We found a significant negative correlation between the av- ditions) between groups of participants. Participants were
erage QoM results and the participants age. Generally, split into 10 groups of varying age (F(8, 82) = 3.43, p <
younger participants tended to move more (r = -.278, p < .05), experience with performing , composing or produc-
.01), both in the no-sound (r = -.283, p < .01) and sound ing music (F(8, 82) = 2.4, p < .05), size (min = 5, max
conditions (r = -.255, p < .05). From the reported demo- = 17) and the proportion of gender. We found no statis-
graphic information, we also found that the younger par- tically significant differences between groups across these
ticipants listened to music more frequently (r = -,267, p < characteristics.
.05) and exercised more (r = -.208, p < .05). The younger
participants also reported feeling less tired during the ex- 3.8 Subjective experience of motion
periment (r = -.35, p < .001), subjectively experienced
After taking part in the experiment the participants were
greater motion (r = -.215, p < .05), and also reported mov-
asked to estimate how much they moved, to what extent
ing more when sound was being played (r = -.22, p < .05).
the music influenced their movement, and how tiresome
Unexpectedly, the QoM results did not correlate with the the experience felt. Overall, the self-reported tiredness
participants height, which was estimated by calculating showed some correlation with self-reported motion (r =
the average of each participants Z-axis values. Due to -.44, p < .001) and with the self-reported experience of
a lower centre of mass, we would have expected to see moving more to music (r = -.289, p < .01). The kinematic
shorter people with lower QoM results. However, the win- data confirmed this sensation: the more tired the partici-
ner was 192 cm tall, while the runner-up was 165 cm. pants felt, the more they moved to music (r = -.228, p <
Also, there were no significant differences in performance .05) and the greater was the difference in motion to sound
between male and female participants (no difference in av- compared to the no-sound conditions (r = -.311, p < .01).
erage QoM, QoM in silence, QoM in music or QoM be- More importantly, although the subjective experience of
tween both conditions). motion did not correlate with the measured level of mo-
SMC2017-198
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
tion, the participants who reported moving more to music movement, Music Perception, vol. 28, no. 1, pp.
did move more during the sound condition when compared 5970, 2010. [Online]. Available: http://mp.ucpress.
to the no-sound condition (r = -.239, p < .05 for the differ- edu/content/28/1/59
ence in QoM between music and silence).
[4] B. Burger, M. R. Thompson, G. Luck, S. Saarikallio,
and P. Toiviainen, Influences of Rhythm- and Timbre-
4. CONCLUSIONS Related Musical Features on Characteristics of Music-
This study was aimed at further exploring the magnitude of Induced Movement, Frontiers in Psychology, vol. 4,
micromotion and the influence of music on human stand- 2013. [Online]. Available: http://journal.frontiersin.
still, based on the preliminary work presented in [10]. Quan- org/article/10.3389/fpsyg.2013.00183/abstract
tity of motion (QoM) was shown to be a sensitive measure [5] M. Peckel, T. Pozzo, and E. Bigand, The impact of the
of micromotion for the conditions under analysis. The perception of rhythmic music on self-paced oscillatory
computation of both three-dimensional and planar QoM movements, Frontiers in Psychology, vol. 5, Sep.
showed that micromotion occurred mainly on the horizon- 2014. [Online]. Available: http://journal.frontiersin.
tal plane. Additionally, statistically significant differences org/journal/10.3389/fpsyg.2014.01037/full
were found between no-sound and sound conditions across
the dataset. Two clusters were identified in the data through [6] F. Styns, L. van Noorden, D. Moelants, and
k-means cluster analysis, although most points in the clus- M. Leman, Walking on music, Human Movement
ters had silhouette values below 0.4. This could be due Science, vol. 26, no. 5, pp. 769785, 2007. [Online].
to the continuous nature of the sound stimuli and the small Available: http://linkinghub.elsevier.com/retrieve/pii/
(although statistically significant) differences between con- S0167945707000589
ditions. [7] Y.-H. Su and E. Poppel, Body movement enhances
The analysis revealed some relationships between QoM the extraction of temporal structures in auditory
data and the self-reported characteristics of physical activ- sequences, Psychological Research, vol. 76, no. 3,
ity and demographic information. People who exercised pp. 373382, May 2012. [Online]. Available: http:
regularly found it more difficult to stand still. Moreover, //link.springer.com/10.1007/s00426-011-0346-3
younger participants tended to move more during both no-
sound and sound conditions. These results may suggest [8] A. R. Jensenius and K. A. V. Bjerkestrand, Exploring
that people who tend to be more active struggle to reach micromovements with motion capture and sonifica-
and maintain a complete standstill posture, although they tion, in Arts and Technology, Revised Selected Papers,
might be able to stand normally for longer periods of time ser. LNICST, A. L. Brooks, Ed. Berlin: Springer,
and with greater balance. The correlation found between 2012, vol. 101, pp. 100107. [Online]. Available: http:
self-reported tiredness and both self-reported and measured //www.springerlink.com/content/j04650123p105646/
motion can not be considered conclusive and further stud-
[9] A. R. Jensenius, K. A. V. Bjerkestrand, and V. Johnson,
ies will focus on a more in depth assessment of the ef-
How still is still? Exploring human standstill for
fects of tiredness in combination with sound stimuli during
artistic applications, International Journal of Arts and
standstill.
Technology, vol. 7, no. 2/3, p. 207, 2014. [Online].
The fact that there were no significant QoM differences
Available: http://www.inderscience.com/link.php?id=
between the groups of participants, indicates that testing
60943
varying number of participants at once is a viable way to
test our hypotheses. Future work will focus on studying [10] A. R. Jensenius, Exploring music-related micro-
larger sample groups and use different stimuli, with a fo- motion, in Body, Sound and Space in Music and
cus on investigating in more depth how different musical Beyond: Multimodal Explorations. Routledge,
features influence the micromotion of people standing still. 2017, pp. 2948, https://www.routledge.com/Body-
Sound-and-Space-in-Music-and-Beyond-Multimodal-
Acknowledgments Explorations/Wollner/p/book/9781472485403.
This research has been supported by the Norwegian Re- [11] A. R. Jensenius, K. Nymoen, S. Skogstad, and
search Council (#250698) and Arts Council Norway. A. Voldsund, A Study of the Noise-Level in Two
Infrared Marker-Based Motion Capture Systems, in
5. REFERENCES Proceedings of the Sound and Music Computing Con-
ference, Copenhagen, 2012, pp. 258263. [Online].
[1] M. Leman, Embodied music cognition and mediation Available: http://urn.nb.no/URN:NBN:no-31295
technology. Cambridge, Mass.: MIT Press, 2008.
[12] B. Burger and P. Toiviainen, MoCap Toolbox -
[2] R. I. Godoy and M. Leman, Eds., Musical Gestures: A Matlab toolbox for computational analysis of
Sound, Movement, and Meaning. New York: Rout- movement data, in Proceedings of the Sound and
ledge, 2010. Music Computing Conference, 2013, pp. 172178.
[Online]. Available: https://jyx.jyu.fi/dspace/handle/
[3] P. Toiviainen, G. Luck, and M. Thompson, Embodied 123456789/42837
meter: Hierarchical eigenmodes in music-induced
SMC2017-199
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SECCIMA: Singing and Ear Training for Children with Cochlear Implants via
a Mobile Application
Zhiyan Duan, Chitralekha Gupta, Graham Percival, David Grunberg and Ye Wang
National University of Singapore
{zhiyan,chitralekha}@u.nus.edu, graham@percival-music.ca, {grunberg,wangye}@comp.nus.edu.sg
SMC2017-200
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2 Related Work
2.1 Musical Training
Musical training can have a powerful effect on music per-
ception and appreciation. An excellent review of music
training for CI users is given in [4]; crucially, it supports
the idea that training has the potential to improve music
Figure 1: Two interfaces of MOGAT. Left: the Higher-
perception and appreciation in this population.
Lower game. Right: the Vocal Matcher game.
A recent examination of the effect of musical training
on CI users found that it improved rhythm, timbre, and are aimed at mitigating reduced language skills, address-
melodic contour recognition [8]. These results were echoed ing their sensitivity to visual changes, and mitigating their
in [5], showing that melodic contour identification perfor- tendency to be distracted.
mance improved sharply after four weeks of music train-
ing, and no significant decline was found in the follow-up 3 Design Considerations
study eight weeks after the training intervention.
The goal of this study is to design a game-based training
2.2 Computer-aided Training Solutions application on smartphones that can be used to improve
pitch perception and production skills of children with CIs.
There are a number of computer-based speech training pro-
Therefore, the application was developed to facilitate the
grams for CI users [9, 10]. However, two recent meta-
following functions:
reviews of computer auditory speech training cautioned
that although many studies yielded positive results, the qual- 1. Play music notes audibly and display them visually;
ity of the studies was low to moderate [11, 12]. Other 2. Test players perception of notes and give feedback;
projects explored new possibilities for interaction without
3. Evaluate players singing voice, and give feedback
performing any usability tests, such as [13], which used an
on the quality of the pitch produced by the player.
interactive floor to combine movement-based playing and
language acquisition , and [14], which allowed children to We began by examining MOGAT [7] (Figure 1) to iden-
progress through a story by identifying phonemes. tify design flaws. These are split into the following cate-
There are few computer-based music training programs. gories.
The Interactive Music Awareness Program is a computer
system for adults which allows users to explore various as- 3.1 Simplicity
pects of music [6], including melodies, rhythms, and tex- Children with CI are generally less developed in literacy
ture. A 12-week study with 16 adult CI users showed and can be distracted by changing visual elements [17,18].
that IMAP increased their ability to recognize musical in- This means that UI designed for them must be straightfor-
struments. Our main inspiration was Mobile Game with ward and simple.
Auditory Training for Children with Cochlear Implants In MOGATs Vocal Matcher game (Figure 1), the child
(MOGAT) [7]. This project provided three games: Higher had to first listen to a musical note and try to match that
Lower, which prompted children to identify whether a note by singing it. On the screen, there was an empty
pair of sounds were in ascending or descending pitch; Vo- progress bar that gradually filled up as the child sustained
cal Matcher, which asked children to sing notes with spe- the target pitch. When the pitch produced by the child
cific pitches, and Ladder Singer, which provided a pitch- was different from the target pitch, a small arrow appeared,
graded karaoke interface. However, some flaws in the UI showing the child how to adjust her pitch to approach the
of this system caused problems when being used by chil- target. However, children sometimes adjusted their pitch
dren. These are discussed in Section 3. in the wrong direction. Later examination revealed that
there was an ambiguity in this interface. An arrow pointing
2.3 Design for Children with Disabilities
down can be interpreted in two ways: either your pitch is
In many cases, children with CIs are not as cognitively de- too high; please go lower, or compared to the target pitch,
veloped as children with typical hearing, with literacy, lan- yours is too low; please go higher.
guage, and behavioral issues due to the cumulative effect Besides this ambiguity, the arrow design had another po-
of the hearing loss before diagnosis, waiting time for the tential flaw. Normal progress bars seen in many other in-
surgery, and recovery time after the implant [15, 16]. terfaces contain no arrows. First, the child was required
The specific design needs of deaf or hard of hearing chil- to notice the arrow, and second, the child must learn the
dren are discussed in two recent papers. [17] noted com- meaning of the arrow. This added to the cognitive load
munication and behavioral difficulties during prototyping when children were already trying to sing the correct mu-
sessions, as well as the children being initially hesitant to sical note and fill the progress bar.
explore the game being designed. [17] also confirmed that Finally, screens in MOGAT included a status bar showing
the children were very sensitive to visual changes, quickly the number of completed questions, and a separate timer
noticing (and then fixating on) even minor changes to the which counted up. These added little to the actual game-
background. [18] presented twenty-five guidelines for de- play, and in the case of the timer could serve to distract
signing games for deaf children. Many of these guidelines the child by inducing them to watch the changing numbers
SMC2017-201
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-202
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 3: The initial design of the 3 games in SECCIMA: (a) Before (b) After
Xylophone, High/Low, and SingNRoll (left to right).
Figure 4: SingNRoll changes after round 1
glossy balls are placed at the bottom of the screen, while
implant was between 10 to 135 months (mean 80.4, std
the holes to which those balls belong are shown. The balls
29.3). All participants had unilateral cochlear implants.
color is the same as in the Xylophone game, while holes
The user studies were conducted in an empty, quiet class-
are surrounded by a halo of the corresponding color. Upon
room on the campus of the primary school where the chil-
tapping a ball or a hole, the corresponding musical note
dren were recruited. Each session started with a 5-minute
is played. To win the game, the player needs to drag the
briefing, and then the participant played the game for a
balls to their corresponding holes, ideally by listening to
maximum duration of 15 minutes with no instructions or
the pitch of each ball and matching it with the correct hole.
hints from the test conductors. Finally, a 5-minute feed-
As the levels of the game progress, the visual clues grad-
back conversation was administered and questionnaire filled.
ually disappear so as to challenge the player to rely more
on her hearing than on her visual ability. From level three 6.2 Round 1
on, the halos around the holes are not colored anymore,
and from level six onward, the colors of some balls disap- All participants in this round had difficulty figuring out the
pear as well. The final level of this game has all the colors SingNRoll game; they tried dragging the ball rather than
removed, forcing the player to solely rely on her hearing. singing. This might have been caused by the inconsistency
between High/Low and SingNRoll, where the action to
5.3 SingNRoll move a ball in the former was dragging but in the later
switched to singing. Another issue was the lack of effective
In SingNRoll, the player first listens to a musical note (i.e.
prompt to switch from one action to another. The instruc-
the target) played by the application, then attempts to re-
tion screen was skipped very quickly by most participants.
produce the pitch by singing. The smartphone captures the
singing of the player and performs real-time audio analy- Based on these observations, we made the following changes
sis to extract the pitch using the YIN algorithm [19]. The to the design for the next round (Figure 4). For SingNRoll,
users pitch controls the vertical position of the ball. If the we placed the ball inside a bubble to differentiate it from
pitch matches the target pitch within a specific tolerance, the balls in the High/Low game. When the player sang, the
the ball will be aligned with the hole on screen and start bubble moved vertically. If the player sustained singing at
rolling towards the hole, hence the name SingNRoll. By the target pitch, the bubble gradually grew until reaching
sustaining the target pitch for a certain duration, the player some threshold and bursting. By changing the interface
can move the ball to the hole and win the game. If the dramatically, our aim was to indicate that this game ex-
player fails to hold the target pitch, the ball will gradually pected different inputs from High/Low. We also removed
roll away from the hole, to discourage random guesses. the hole entirely and replaced it with a target area to avoid
any confusion between SingNRoll and High/Low. An-
6 Iterative Design Process other consequence of this change is that it avoided a po-
tential confusion regarding movement. The original de-
After the prototype was completed, we followed an itera- sign used the Y-axis to indicate pitch (which we retain),
tive design process (IDP) to improve the interface. Four and the X-axis to indicate completion. By abandoning X-
rounds of iterative design were conducted, with 3 children axis movement and using the bubbles size as a completion
in the first, 5 in the second, and 4 in the last two rounds. metric, we clarify the meaning of movement in the game.
Finally, a prompt message was added to remind the player
6.1 Participants and Procedure to sing if she remained silent for more than 6 seconds.
Participants of the user study were recruited from a local
6.3 Round 2
primary school dedicated to children with hearing impair-
ment. There were 16 participants in total, with 13 boys Neither the bubble nor the prompt message seemed to have
and 3 girls. Their age range was 9 to 12 years. All of the any effect. Only two out of the five participants managed
participants were pre-lingually deaf, and 14 of them were to figure out how to play the game. Although the prompt
congenitally deaf. Most of them (14 out of 16) had regu- said explicitly Sing to move bubble. Break the bubble
lar access to smartphones and used smartphones in general here to get the Gem., children did not seem to follow it.
to play games. The age range of the participants cochlear Young children with CI have difficulty understanding long
SMC2017-203
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Time [s]
40
20
0
1 2 3 4
(b) Instruction (c) Instruction Completion times for High/Low
80
screen before screen after
Time [s]
60
40
20
0
1 2 3 4
Completion times for SingNRoll
Time [s]
DNC: 1 DNC: 4 DNC: 1
200
(a) Animated tip (d) Xylophone (e) Xylophone 100
of SingNRoll before after 0
1 2 3 4
Round
Figure 5: UI changes for round 2
sentences [17, 18], and we failed to adequately heed those Figure 6: Completion times for games. DNC indicates
warnings. The arrow pointing at the target area encouraged that the child did not complete the exercise.
participants to tap on that area, which caused more confu-
sion. A few participants also tried to play the game directly No. Statement
on the instruction screen, because it had a very similar look Q1 I understood the instructions
to the actual game and caused the confusion. Q2 The game was fun
Q3 The game was easy to play
Another subtle consistency issue was observed. Since
Q4 I want to play this game again
holes in Xylophone were not tappable, participants consid- Q5 I want to play this game at home
ered holes in High/Low to behave the same way. However, G1 Rate the Xylophone game
the later were designed to play the note upon tapping. G2 Rate the High/Low game
We made three changes after this round: 1) instruction G3 Rate the SingNRoll game
screen of SingNRoll was changed from a static image to
an animated demo (Figure 5a), 2) all instruction screens re- Table 1: Post-session questionnaire for participants.
ceived a style update different from the actual games (Fig-
game as the time lapsed from the player entering the game
ure 5b, 5c), 3) holes in Xylophone and High/Low were
to the player winning the third level, where there are no
given different appearances (Figure 5d, 5e).
color indications around the targets.
6.4 Round 3 & Round 4 Figure 6 shows the completion time of all participants
organized by round number and game. One participant
The animated cartoon instruction worked well. All four in round 3 was excluded because the corresponding video
participants learned to play the SingNRoll game very quickly recording was corrupted. Both the Xylophone game and
(three of them managed to finish the first level in under 30 the High/Low game were completed within a minute (many
seconds). The other two problems observed in round 2 within 30 seconds). The completion time for the SingNRoll
were also resolved by the corresponding adjustments. game was more interesting as this was the challenging part
With no obvious flaws found in round 3, we added a for the participants. In the first two rounds when there were
scoreboard to SingNRoll as an extra motivational com- no animated instructions, 5 out of 8 participants failed to
ponent to see if it can boost excitement. complete the game. Those who managed to complete the
No new problems were discovered in round 4, but there game had struggled for quite some time (89, 164, 189 sec-
was also no effect of the scoreboard all of the chil- onds) to succeed. After the introduction of the animated in-
dren quickly skipped over it. The scoreboard might contain struction in round three, the majority of the participants (7
more information than they would like to consume. out of 8) completed the game without much struggle. The
completion time was reduced significantly as well, from
6.5 Objective Evaluation more than two minutes to well under one minute.
During the iterative design process, participants interac-
6.6 Subjective Evaluation
tions with SECCIMA were filmed, from which we calcu-
lated the time taken for a participant to complete each game Subjective measurements were collected via the post-session
and used this time as an objective measure to quantify the questionnaire (Table 1, Figure 7). The participants were
intuitiveness and ease of use of the UI. asked to choose from a Likert scale of 5, with larger num-
For Xylophone and SingNRoll, the completion time was bers corresponding to more positive answers. The responses
defined as the time from a player entering the game to win- to Q1 (I understood the instructions) and Q3 (The game
ning the first level of the game. The completion time of the was easy to play) exhibit a high correlation to the intro-
High/Low game was defined differently because the first duction of animated instruction, as they increase notice-
level of High/Low has color strokes around the targets and ably after round 2. This confirms our observation from
the participant can win the game by simple color matching. the objective evaluation that the animated instruction did
Therefore we defined the completion time of the High/Low help the participants learn the game faster and make the
SMC2017-204
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Round 1 Round 2 Round 3 Round 4 App No. of UI elements No. of useless taps
M-LG 10 11, 16, 32, 36, 50
5 S-LG 6 0, 0, 0, 0, 0, 0
Response 4 M-SG 9 0, 16, 31, 33, 43
3 S-SG 7 0, 0, 0, 4, 10, 55
2
Table 2: Number of unnecessary taps from users
1
Q1 Q2 Q3 Q4 Q5 G1 G2 G3
MOGAT SECCIMA
Success (%)
6
game easier. The scores of Q2 (The game was fun) in- 80
creased marginally over time; however, the large variations 4
60
in scores of Q4 and Q5 suggest that these improvements in 2
enjoyment were not significant. 0 40
The participants were also asked to rate the three games LG SG M-LG S-LG 3 S-LG 10
on a Likert scale of 1 to 5. It can be seen that the scores of
(a) Normalized unnec- (b) Success rate in listening games
Xylophone (G1) and High/Low (G2) are noticeably higher
essary taps from users (higher is better), S-H/L x: first
than that of SingNRoll (G3). One possible reason is that in (lower is better). p = x levels of S-High/Low. p =
the SingNRoll game participants were facing more chal- 0.001, 0.51. 0.006, 0.06.
lenges, which may have attenuated their enjoyment. But
even though SingNRoll is less preferred, its scores are on Figure 8: Objective comparison (box plots).
a upward trend, which points to the effectiveness of the
total number of UI elements in respective games to nor-
adjustments to the system.
malize the difference in relative densities of screen layouts.
We used this normalized number of unnecessary taps as an
7 Comparison with MOGAT objective measure of the intuitiveness of the UI, with more
SECCIMA was compared with MOGAT to assess its intu- unnecessary taps indicating a poor interface. As shown in
itiveness. In order to distinguish the two, we will prepend Figure 8a, the normalized number of unnecessary taps in
the game names with S- or M-, respectively. Both appli- SECCIMA were significantly fewer than MOGAT in the
cations have two main components: a listening game (LG: listening game. Results for the singing game also favored
M-Higher Lower vs S-High/Low) and a singing game (SG: SECCIMA, but the difference was not significant.
M-Vocal Matcher vs S-SingNRoll in SECCIMA). Another objective measure we used was the percentage
of successful attempts in the listening game. In both inter-
7.1 Participants, Setup and Procedure faces, the child was supposed to distinguish musical notes
and to indicate which one is higher or lower. Choosing ran-
A total of 12 participants were involved, out of which 7 domly would give a 50% successful rate. Figure 8b shows
were congenitally deaf, 4 pre-lingually deaf and 1 post- the successful rates for M-Higher Lower and S-High/Low.
lingually deaf. The age range was 7 to 10 years. All With a mean success rate of 50.67% (vs. 81.8 % and 80.8
participants were regular smartphone users and had used % in SECCIMA), children playing M-Higher Lower were
them for playing games. Time since cochlear implantation likely to be making selections randomly, indicating the user
ranged from 20 to 84 months (mean 45.3, std 22.9). interface was not intuitive enough for the children to under-
For SECCIMA, the same equipment as in the iterative stand. Later levels of S-High/Low were more challenging
design process was used. For MOGAT, we used an iPod because they involved more than two musical notes (only
Touch (4th gen.). The same audio cable was used to con- the first three levels involved two notes), but as student per-
nect the device with participants implants. formance remained high, the interface of SECCIMA was
All participants had no prior exposure to either inter- demonstrated to be more effective.
face. They were randomly divided into two groups, each In the singing game, none of the 6 children using MO-
using one of the apps. The same procedure as in the iter- GAT could figure out how to play the game on their own
ative design process was used, i.e. a pre-session briefing, until the teacher intervened. On the contrary, 3 out of 6
15 minutes of game play (video recorded), and a short dis- children in the SECCIMA group managed to figure it out
cussion as well as a post-session questionnaire. on their own, and another one kept blowing air instead of
singing. Only two of them failed to understand that they
7.2 Objective Evaluation
were supposed to make sound. Although not as good as
Objective measures were derived from the video record- in the iterative design study (in which all but one child in
ings. Since one participant opted out of video recording, rounds 3 and 4 figured it out), this result was not surprising
we were left with 6 subjects in the SECCIMA group and 5 considering the younger age of this group (710 vs. 912).
in the MOGAT group. In terms of feedback, the childrens pitches in the MO-
We counted the unnecessary taps (i.e. tapping on UI el- GAT group were mostly monotonic, suggesting that they
ements that are not tappable) by the participants in both did not fully understand the feedback. In the SECCIMA
groups (shown in Table 2), which was then divided by the group, more children were varying their pitches in response
SMC2017-205
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Pitch (MIDI)
MOGAT SECCIMA
70
60 5
Target pitch
Response
50 4
40 3
0 2 4 6 8 10
2
Time (sec)
(a) One example of a pitch contour from MOGAT 1
Q1 Q2 Q3 Q4 Q5 LG SG
Pitch (MIDI)
80
Target pitch
70 Figure 10: Questionnaire results (higher is better), see text
60 for details about possible inaccuracies (box plots).
50
0 1 2 3 4 5 p = 0.21, 0.58, 0.03, 0.39, 0.74, 0.49, 0.76.
Time (sec)
(b) One example of a pitch contour from SECCIMA 8 Discussion
Figure 9: Comparison of two example pitch contours. The interfaces for SECCIMA games were simpler than
MOGAT, and the visual elements were consistent through-
to the feedback. As an example, Figure 9a and 9b show two out all the games. The lower number of unnecessary taps
typical pitch contours from users of MOGAT and SEC- and the higher success rate in SECCIMA confirms our hy-
CIMA respectively. The pitch contour from MOGAT was pothesis that increasing the simplicity and the consistency
mostly monotonic and did not move towards the target pitch of the games made them easy to follow. Additionally, the
over time. On the other hand, the one from SECCIMA slid reinforcement considerations resulted in a step-by-step learn-
nicely towards the target pitch and stayed there. ing process in SECCIMA. The lesson about color and posi-
For the comparison between the two singing games, com- tion of the balls in the first game helped the students under-
pletion time was not used as a comparing metric because stand how to drag the balls to correct positions in the sec-
the game designs were too different to compare fairly. For ond game. Similarly, lessons from each level were being
example,M-Vocal Matcher did not penalize the user for in- applied to play the next levels. The reduction in unneces-
correct pitch, which made it much easier than S-SingNRoll, sary taps and more-than-chance success rate of SECCIMA
because the latter imposed a penalty for incorrect pitch and shows that the children were learning the games, and not
therefore was harder to win. For the same reason, success randomly playing them. Finally, in SingNRoll, familiarity
rate was not used either. with floating bubbles may have helped the children to un-
derstand the feedback on how to change their pitch. Thus,
7.3 Subjective Evaluation it was a compound effect of all design considerations that
made the interface more intuitive and less confusing.
Subjective measurements were collected via the post-session We offer a few recommendations for researchers work-
questionnaire (questions in Table 1, results in Figure 10). ing with children with CI. (1) Test early, test often: No
The children indicated that SECCIMA was easier to un- matter how many hours are spent discussing designs in the
derstand and play than MOGAT (Q1 and Q3). It is worth lab, something unexpected will always happen when test-
noting that some of the children playing MOGAT indicated ing with children. That said, it is important to (2) carefully
they understood the game fully (by rating Q1 with a score balance the number of children in the design vs. evalua-
of 5), although they did not play the game as they were tion phases. There are usually limited number of CI users
supposed to. In fact, none of the children in the MOGAT in each city, and parents and caregivers are appropriately
group was able to figure out how to play the singing game cautious about childrens participation in research studies.
(Vocal Matcher) after a long struggle, and the teacher had It might be helpful to (3) involve participants outside of
to intervene. Discussions with their teacher confirmed that the target population. While the real evaluation should
they tend to say Yes to those questions they do not under- still be done with the target group, some of the flaws could
stand. This could potentially bring inaccuracies to the re- have been detected by children with normal hearing.
sults. Researchers working on comparisons between peo- It is important to (4) reduce the amount of text and use
ple with different ages or mental capabilities may need to more visual displays, provided those visuals displays do
take this potential bias into consideration. not cause distractions. (5) When text is necessary, words
The childrens teacher who helped us conduct the study and phrases are more preferable then sentences, be-
provided some feedback on both interfaces. She mentioned cause these children have difficulties comprehending longer
that MOGAT was less intuitive. For example, the word Se- text. It is worth noting that even the use of imagery and an-
lect used in MOGAT was not in the childrens vocabulary, imation should be controlled to minimize cognitive load.
and the children had no exposure to bar graphs until their These findings are consistent with those of [17, 18].
later stage in primary school. In contrast, SECCIMAs We observed that many young children with CI are not
colored balls were simpler and easier to understand. She able to express themselves effectively (also noted in [16,
specifically pointed out that the arrows (see Figure 1) used 17]); forming sentences can be difficult for them. There-
by Vocal Matcher in MOGAT were confusing, even to her fore, (6) obtaining verbal feedback from them takes time
as an adult. Children were tapping on the flashing arrow and patience (as with adults with cognitive disabilities [20]).
even though it was not tappable. On the other hand, the As noted earlier in the comparison study, some children
animation used in SECCIMA was more intuitive. indicated that they completely understood the instructions
SMC2017-206
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
while they actually did not. When collecting subjective ear training after cochlear implantation. Psychomusicology:
feedback from these children, this nodding phenomenon Music, Mind, and Brain, vol. 22, no. 2, p. 134, 2012.
should be taken into consideration to avoid potential bias. [9] Q.-J. Fu and J. J. Galvin III, Maximizing cochlear implant
We therefore suggest to (7) use objective measurements patients performance with advanced speech training proce-
in addition to subjective feedback, such as success rate, dures, Hearing research, vol. 242, no. 1, pp. 198208, 2008.
number of unnecessary taps, and completion times.
[10] , Computer-assisted speech training for cochlear im-
plant patients: Feasibility, outcomes, and future directions,
9 Conclusion in Seminars in hearing, vol. 28, no. 2. NIH Public Access,
2007.
We designed SECCIMA to facilitate music training of chil-
dren with cochlear implants using an iterative design pro- [11] H. Henshaw and M. A. Ferguson, Efficacy of individual
cess. The observations and design decisions in each itera- computer-based auditory training for people with hearing
loss: A systematic review of the evidence, PLoS ONE, vol. 8,
tion were documented. Both objective and subjective eval- no. 5, p. e62836, 05 2013.
uations were performed to assess the effectiveness of the
design decisions and the ease of use of the interface. The [12] R. Pizarek, V. Shafiro, and P. McCarthy, Effect of comput-
evaluation result shows that the adjustments between iter- erized auditory training on speech perception of adults with
hearing impairment, SIG 7 Perspectives on Aural Rehabil-
ations improved the usability and made the interface easier itation and Its Instrumentation, vol. 20, no. 3, pp. 91106,
to use. A comparison with an existing work was performed 2013.
and showed that our system is more intuitive and easier
for the children to use. The insights and guidelines sum- [13] O. S. Iversen, K. J. Kortbek, K. R. Nielsen, and L. Aagaard,
Stepstone: An interactive floor application for hearing
marized from this study may be helpful to those who are impaired children with a cochlear implant, in Interaction
interested in designing applications for children with CI. Design and Children, ser. IDC 07. New York, NY,
USA: ACM, 2007, pp. 117124. [Online]. Available:
Acknowledgments http://doi.acm.org/10.1145/1297277.1297301
The authors would like to thank Ms. Joan Tan for facili- [14] S. Cano, V. Penenory, C. A. Collazos, H. M. Fardoun,
tating the user study as well as many excellent suggestions and D. M. Alghazzawi, Training with Phonak: Serious
Game As Support in Auditory Verbal Therapy for Children
to our project. We are grateful to Ms. Terry Theseira and with Cochlear Implants, in ICTs for Improving Patients
her colleagues and students from Canossian School for as- Rehabilitation Research Techniques, ser. REHAB 15. New
sistance and participation in the user studies. This project York, NY, USA: ACM, 2015, pp. 2225. [Online]. Available:
was partially funded by a research grant R-252-000-597- http://doi.acm.org/10.1145/2838944.2838950
281 from National Research Foundation in Singapore. [15] R. Calderon and M. Greenberg, Social and emotional de-
velopment of deaf children: Family, school, and program ef-
10 References fects, The Oxford handbook of deaf studies, language, and
education, vol. 1, pp. 188199, 2011.
[1] B. S. Wilson and M. F. Dorman, Cochlear implants: a re-
markable past and a brilliant future, Hearing research, vol. [16] S. C. Theunissen, C. Rieffe, M. Kouwenberg, L. J. De Raeve,
242, no. 1, pp. 321, 2008. W. Soede, J. J. Briaire, and J. H. Frijns, Behavioral prob-
lems in school-aged hearing-impaired children: the influence
[2] H. J. McDermott, Music perception with cochlear implants: of sociodemographic, linguistic, and medical factors, Euro-
a review, Trends in amplification, vol. 8, no. 2, pp. 4982, pean child & adolescent psychiatry, vol. 23, no. 4, pp. 187
2004. 196, 2014.
[3] C. J. Limb and A. T. Roy, Technological, biological, and [17] L. E. Potter, J. Korte, and S. Nielsen, Design with the deaf:
acoustical constraints to music perception in cochlear implant Do deaf children need their own approach when designing
users, Hearing Research, vol. 308, pp. 13 26, 2014. technology? in Interaction Design and Children, ser. IDC
14. New York, NY, USA: ACM, 2014, pp. 249252.
[4] V. Looi, K. Gfeller, and V. Driscoll, Music appreciation and
training for cochlear implant recipients: a review, in Semi- [18] A. Melonio and R. Gennari, How to design games for deaf
nars in hearing, vol. 33, no. 4, 2012, p. 307. children: Evidence-based guidelines, in 2nd International
Workshop on Evidence-based Technology Enhanced Learn-
[5] Q.-J. Fu, J. J. Galvin, X. Wang, and J.-L. Wu, Benefits of
ing, ser. Advances in Intelligent Systems and Computing.
music training in mandarin-speaking pediatric cochlear im-
Springer International Publishing, 2013, vol. 218, pp. 8392.
plant users, Journal of Speech, Language, and Hearing Re-
search, pp. 17, 2014. [19] A. De Cheveigne and H. Kawahara, Yin, a fundamental fre-
[6] R. M. van Besouw, B. R. Oliver, M. L. Grasmeder, S. M. Hod- quency estimator for speech and music, J. of the Acoust. Soc.
kinson, and H. Solheim, Evaluation of an interactive music of America, vol. 111, no. 4, pp. 19171930, 2002.
awareness program for cochlear implant recipients, Music [20] S. Johansson, J. Gulliksen, and A. Lantz, User participation
Perception: An Interdisciplinary Journal, vol. 33, no. 4, pp. when users have mental and cognitive disabilities, in
493508, 2016. Proceedings of the 17th International ACM SIGACCESS
[7] Y. Zhou, K. C. Sim, P. Tan, and Y. Wang, Mogat: mobile Conference on Computers & Accessibility, ser. ASSETS 15.
games with auditory training for children with cochlear im- New York, NY, USA: ACM, 2015, pp. 6976. [Online].
plants, in ACM Multimedia. ACM, 2012, pp. 429438. Available: http://doi.acm.org/10.1145/2700648.2809849
SMC2017-207
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Miroslav Malik* , Sharath Adavanne* , Konstantinos Drossos , Tuomas Virtanen , Dasa Ticha , and Roman Jarina
Department of Multimedia and Information-Communication Technologies, University of Zilina, Slovakia
firstname.lastname@fel.uniza.sk
Audio Research Group, Department of Signal Processing, Tampere University of Technology, Finland
firstname.lastname@tut.fi
SMC2017-208
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
In this paper, we propose a method that exploits the com- Baseline by open smile (60x260x1)
OR log mel-band energy(60x64x1)
bined modeling capacities of the CNN, RNN, and fully
connected (FC) network. We call this the stacked convo- 8, 3x3, 2D CNN, ReLUs
Batch normalization
lutional and recurrent neural network (CRNN). Similar ar-
chitecture of stacked CNN, RNN, and FC has been used in
literature and shown to consistently outperform the previ- 60x(260 OR 64*8) 60x(260 OR 64*8)
ous network configurations for sound event detection [22], 8, time distributed FC 8, time distributed FC
speech recognition [23], and music classification [24]. We 60x8 60x8
approach the problem as a regression one and evaluate 8, bidirectional GRU, tanh 8, bidirectional GRU, tanh
the method using the EiM dynamic task dataset. We use 60x16 60x16
1, time distributed maxout 1, time distributed maxout
RMSE as the metric in order to provide a direct comparison
60x1 60x1
with results from existing works. Our proposed method
Arousal Valence
employs only a small fraction of parameters compared to
the recent state-of-the-art DBLSTM-based system by Li at
Figure 1: Proposed method of stacked convolutional and
al. [25] and performs with significantly better results (i.e.
recurrent neural network for music emotion recognition.
lower RMSE for both arousal and valence recognition).
Additionally, we evaluate our method with a raw feature
set (log mel-band energy). This is motivated from the fact FC uses the linear activation and the GRUs use the tanh
that neural networks can by themselves learn the first order activation. In the bidirectional GRU, we concatenate to-
derivatives and statistics existing in the feature set provided gether the forward and backward activations. The net-
in EiM dataset. These raw features are extracted from the work was trained using the backpropagation through time
same songs that were utilized in the EiM dataset. The re- alorithm [30], the Adam optimizer with the default param-
sults using the raw feature set are comparable to the ones eters in [31], and the RMSE as the loss function. In order
achieved with the Li et al. system. to reduce overfitting of the network to training data we use
The rest of the paper is organized as follows. In Section 2 dropout [32] for the CNN and RNN layers and the Elas-
we present our proposed method. Section 3 describes the ticNet [33] regularization for the weights and activity of
dataset, features, metric, baseline and the evaluation proce- the CNN. The network was implemented using the Keras
dure. We analyze and discuss the obtained results in Sec- framework [34] and the Theano backend [35].
tion 4. Finally concluding the paper in Section 5.
3. EVALUATION
2. PROPOSED METHOD
3.1 Dataset
The input to the proposed method is a sequence of audio
features extracted from audio. We evaluate the method For the evaluation of our method, we utilized the dataset
with two separate audio features - baseline (Section 3.2.1) provided for the MediaEvals EiM task [15]. For annota-
and raw (Section 3.2.2). Our proposed method, illustrated tions, the Russells two-dimensional continuous emotional
in Figure 1 is a neural network architecture obtained by space [36] was used. The values of arousal and valence
stacking a CNN with two branches of FC and RNN one belonging to every 500 ms segment were annotated by five
each for arousal and valence. This combined CRNN ar- to seven annotators per song from the Amazon Mechanical
chitecture maps the input audio features into their respec- Turk.
tive arousal and valence values. The output of the method, The training set consisted of 431 audio excerpts with a
arousal and valence values, is in the range of [-1, 1]. length of 45 seconds from the Free music archive. The
In our method, the local shift-invariant features are ex- first 15 seconds of excerpts were used for annotators adap-
tracted from the audio features using a CNN with a re- tation. The annotations and features for the remaining
ceptive filter size of 3 3. The feature map of the CNN 30 seconds were provided as the development set. This
acts as the input to two parallel but identical branches, one amounts to 60 annotations per audio file (30 s of audio with
for arousal and the other one for valence. Each of these annotations every 500 ms). Each of the annotations are in
branches consists of the FC, the bidirectional gated recur- the range of [-1, 1] for both arousal and valence. Negative
rent unit (GRU) [26], and the output layer consisting of values represents low arousal/valence and positive high.
one node of the maxout layer [27]. Both FC and maxout The evaluation set consists of 58 full songs from the
had their weights shared across time steps. The FC was royalty free multitrack MedleyDB dataset and Jamendo
utilized to reduce the dimensionality of the feature maps music website. Similar features and annotations as devel-
from the CNN and, consequently, reduce the number of opment set were provided for the evaluation set.
parameters required by the GRU. The bidirectional GRU
layer was utilized to learn the temporal information in the 3.2 Audio features
features. The maxout layer was employed as the regression
3.2.1 Baseline features
layer due to its ability to approximate a convex, piecewise
linear activation function [27]. The feature set that we used is the one that was utilized in
We use the rectified linear unit (ReLU) [28] activation the EiM challenge. The most of research is based on this
and the batch normalization [29] for the CNN layer. The feature set, so we call it the baseline features set. It consists
SMC2017-209
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
of the mean and standard deviation of 65 low-level acous- Baseline by open smile (60x260x1)
tic descriptors and their first order derivatives from the OR log mel-band energy(60x64x1)
2013 INTERSPEECH Computational Paralinguistic Chal-
lenge [37], extracted with the openSMILE toolbox [38]. In 8, 3x3, 2D CNN, ReLUs
the feature set are included mel frequency cepstral coef- Batch normalization
ficients (MFCCs), spectral features such as flux, centroid, 60x(260 OR 64*8)
kurtosis, and rolloff, and voice related features such as jit-
ter and shimmer. The total amount of features is 260 and 8, time distributed FC
they were extracted from non overlapping segments of len-
60x8
gth 500 ms. The mean and standard deviation of the fea-
tures for this 500 ms were estimated from frames of 60 ms 8, bidirectional GRU, tanh
with 50 ms ( 80%) overlap. 60x16
The above baseline features consist of mean, standard de- 60x1 60x1
viation, first and second order derivatives of different raw Arousal Valence
features. Neural networks by themselves can extract such
statistics from the raw features directly. Thus, we argue
Figure 2: Stacked convolutional and recurrent neural net-
that these features are not the best choice for neural net-
work without branching (CRNN NB).
works. In order to study the performance of the proposed
network with raw features, we employ just the log mel-
band energies. Mel-band related features have been used eters. Additionally, SVR and ANN adds to further com-
previously for MediaEval EiM task [18] and MIREX AMC plexity of the overall method.
task [39]. The librosa python library [40] was used to ex-
tract the mel-band features from 500 ms segments, in a 3.5 Evaluation procedure
similar fashion to the baseline feature set. The hyperparameter estimation of the proposed method
was done by varying the number of layers of the CNN,
3.3 Metric FC, and GRU from one to three, and the number of units
for each of these were varied in the set of {4, 8, 16, 32}.
The RMSE is widely used in many areas as a statistical
Identical dropout rates were used for CNN and GRU and
metric to measure a model performance. It represents a
varied in the set of {0.25, 0.5, 0.75}. The ElasticNet vari-
standard deviation of the differences of predicted values
ables L1 and L2 were each varied in the set of {1, 0.1,
from the line of best fit at the sample level. Given N pre-
0.01, 0.001, 0.0001}. The parameters were decided based
dicted samples yn and the corresponding reference sam-
on the best RMSE score on the development set, using the
ples yn , then RMSE between them can be written as:
baseline features, mini-batch size of 32, and the maximal
s sequence length of 60. The mini-batch size of 32 was cho-
PN
n=1 (yn yn )2 sen from the set of {16, 32, 64, 128} based on the variance
RM SE = (1) in the training set error and the number of iterations taken
N
to achieve the best RMSE. The sequence length was cho-
3.4 Baseline sen to be the same length as the number of annotations per
audio file in the training set, therefore 30 s per every audio
The proposed method is compared to the system proposed clip annotated every 0.5 s and this sequence length varied
in [25]. To the best of our knowledge, this method has from 10 s to 60 s with the multiply of 2. The network was
the top results on the MediaEvals EiM dataset using the trained to achieve the lowest average RMSE of valence and
baseline audio features. This result was obtained using an arousal. The best configuration with least number of pa-
ensemble of six bidirectional LSTM (BLSTM) networks of rameters had one layer each of CNN, FC and GRU with
five layers and 256 units each, trained on sequence length eight units each, a dropout rate of 0.25, L1 of 0.1 and L2
of 10. The first two layers of these were pre-trained with of 0.001. Figure 1 shows the configuration and the feature
the baseline features. The six networks of the ensemble map dimensions of the proposed method.
were chosen such that they covered all the training data, In order to provide the comparison of the performance
and sequence lengths of 10, 20, 30 and 60. The output of CRNN without these two branches (CRNN NB), we re-
of the ensemble of networks was fused using SVR with a moved one branch in the above configuration and trained
third order polynomial kernel and artificial neural network the CRNN with both the valence and arousal on the same
(ANN) of 14 nodes. The average output of SVR and ANN branch as seen in Figure 2.
was used as the final arousal and valence values. The proposed CRNN method was trained with the raw
In terms of complexity of this baseline method, a five features (log mel-band energy). In order to give a direct
layer 256 input and output units BLSTM has about 2 comparison of performance, we keep the exact configura-
million parameters. An ensemble of six BLSTMs com- tion as the proposed CRNN with only the dropout changed
pounded in this manner will have about 12 million param- to 0.75. The network was seen to overfit to the training
SMC2017-210
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 1: RMSE errors for development and evaluation data using (a) baseline and (b) log mel-band energy features. The
numbers reported are the mean and standard deviation over five separate runs of the method with different initialization of
network weights.. The proposed CRNN network performance is reported for sequence lengths 10, 20, 30 and 60.
data with 0.25 dropout rate that used for baseline features. Table 2: RMSE errors for evaluation data using baseline
A dropout rate of 0.75 was chosen after tuning in the set of features. The numbers reported for CRNN NB are the
{0.25, 0.5, 0.75}. mean and standard deviation over five separate runs of the
method with different initialization of network weights.
SMC2017-211
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[12] MIREX 2011: MIREX 2010: Audio [25] X. Li, J. Tian, M. Xu, Y. Ning, and L. Cai, DBLSTM-
train/test: Mood classification - MIREX08 based multi-scale fusion for dynamic emotion predic-
dataset, accessed 14.2.2017. [Online]. Avail- tion in music, in Multimedia and Expo (ICME), 2016
able: http://www.music-ir.org/nema out/mirex2011/ IEEE International Conference on. IEEE, 2016, pp.
results/act/mood report/summary.html 16.
[13] K. Drossos, A. Floros, A. Giannakoulopoulos, and [26] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bah-
N. Kanellopoulos, Investigating the impact of sound danau, F. Bougares, H. Schwenk, and Y. Bengio,
angular position on the listener affective state, IEEE Learning phrase representations using RNN encoder-
Transactions on Affective Computing, vol. 6, no. 1, pp. decoder for statistical machine translation, arXiv
2742, Jan 2015. preprint arXiv:1406.1078, 2014.
SMC2017-212
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[27] I. J. Goodfellow, D. Warde-Farley, M. Mirza, [35] Theano Development Team, Theano: A Python
A. Courville, and Y. Bengio, Maxout networks, in In- framework for fast computation of mathematical ex-
ternational Conference on Machine Learning (ICML), pressions, vol. abs/1605.02688, May 2016. [Online].
2013. Available: http://arxiv.org/abs/1605.02688
[28] V. Nair and G. E. Hinton, Rectified linear units im- [36] J. A. Russell, A circumplex model of affect, Journal
prove restricted boltzmann machines, in Proceedings of Personality and Social Psychology, vol. 39, no. 6,
of the 27th international conference on machine learn- pp. 11611178, 1980.
ing (ICML-10), 2010, pp. 807814.
[37] F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro,
[29] S. Ioffe and C. Szegedy, Batch normalization: Accel- and K. R. Scherer, On the acoustics of emotion
erating deep network training by reducing internal co- in audio: what speech, music, and sound have in
variate shift, CoRR, vol. abs/1502.03167, 2015. common, Frontiers in Psychology, vol. 4, p. 292,
2013. [Online]. Available: http://journal.frontiersin.
[30] P. J. Werbos, Backpropagation through time: what
org/article/10.3389/fpsyg.2013.00292
it does and how to do it, Proceedings of the IEEE,
vol. 78, no. 10, pp. 15501560, 1990. [38] F. Eyben, F. Weninger, F. Gross, and B. Schuller, Re-
cent developments in opensmile, the munich open-
[31] D. Kingma and J. Ba, Adam: A method for stochastic
source multimedia feature extractor, in Proceedings of
optimization, in arXiv:1412.6980 [cs.LG], 2014.
the 21st ACM international conference on Multimedia.
[32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, ACM, 2013, pp. 835838.
and R. Salakhutdinov, Dropout: A simple way to pre-
[39] T. Lidy and A. Schindler, Parallel convolutional neu-
vent neural networks from overfitting, in Journal of
ral networks for music genre and mood classification,
Machine Learning Research (JMLR), 2014.
MIREX2016, 2016.
[33] H. Zou and T. Hastie, Regularization and variable se-
lection via the elastic net, in Journal of the Royal [40] B. McFee, M. McVicar, C. Raffel, D. Liang,
Statistical Society: Series B (Statistical Methodology), O. Nieto, E. Battenberg, J. Moore, D. Ellis,
vol. 67, no. 2, 2005, pp. 301320. R. Yamamoto, R. Bittner, D. Repetto, P. Viktorin,
J. F. Santos, and A. Holovaty, librosa: 0.4.1,
[34] F. Chollet, Keras, https://github.com/fchollet/keras, accessed 16.01.2017. [Online]. Available: http:
2015, accessed 16.01.2017. //dx.doi.org/10.5281/zenodo.32193
SMC2017-213
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT (version 1.6.2) and the Timbre Toolbox [2] (version 1.4),
in estimating these salient moments. Sound samples from
The attack phase of sound events plays an important role in real music recordings are used to compare the toolbox
how sounds and music are perceived. Several approaches analysis results to a listening test. As such, our research
have been suggested for locating salient time points and complements the work by Kazazis et al. [3], where the
critical time spans within the attack portion of a sound, toolbox results are compared to strictly controlled synthe-
and some have been made widely accessible to the research sis parameters using additive synthesis.
community in toolboxes for Matlab. While some work ex- In Section 2 we discuss previous work in auditory per-
ists where proposed audio descriptors are grounded in lis- ception concerned with the attack portion of sonic events.
tening tests, the approaches used in two of the most popular In Section 3 we take a closer look at computational estima-
toolboxes for musical analysis have not been thoroughly tion of attack phase descriptors. Further, in Section 4, we
compared against perceptual results. This article evaluates show how the algorithms in the MIRtoolbox and Timbre
the calculation of attack phase descriptors in the Timbre toolbox compare to our own experimental results, and then
toolbox and the MIRtoolbox by comparing their predic- we discuss these results and conclude in Section 5.
tions to empirical results from a listening test. The results
show that the default parameters in both toolboxes give in-
2. BACKGROUND
accurate predictions for the sound stimuli in our experi-
ment. We apply a grid search algorithm to obtain alterna- The attack of musical events has been studied from a range
tive parameter settings for these toolboxes that align their of perspectives. Pierre Schaeffer experimented with tape
estimations with our empirical results. recordings of sounds in the 1960s [4]. By cutting off the
beginning of sound events he showed the importance of the
1. INTRODUCTION attack portion for the perception of sonic events.
A seminal paper in this field is John W. Gordons study
Many music researchers rely on available toolboxes for from 1987 of the perceptual attack time of musical tones [5].
quantitative descriptions of the characteristics of sound files. Gordon manipulated synthesis parameters in a digital syn-
These descriptions are commonly referred to as descriptors thesizer and observed how the perceived moment of met-
or features, and they may be used as variables in experi- rical alignment was affected. A range of parameters were
mental research or as input to a classification algorithm. In found to be involved. Gordon introduced the term percep-
this paper we investigate the descriptors that concern the tual attack time, to describe the moment in time perceived
first part of sonic events, what we will refer to as attack as the rhythmic emphasis of a sound [5]. The term was fur-
phase descriptors. 1 Particularly, we are interested in the ther explained by Wright, saying that the perceptual attack
detection of salient time points, such as the point when the time of a musical event is its perceived moment of rhyth-
sound event is first perceived, the point when it reaches mic placement [6]. This definition emphasises that the
its peak amplitude, and the perceptual attack of the sound perceptual attack time of a sound event acts as reference
event, which will be introduced properly in Section 2. Ro- when aligning it with other events to create a perceptually
bust detection of such points in time is essential to obtain isochronous sequence, as illustrated in Figure 1. Wright
accurate values for the attack phase descriptors commonly further argues that the perceptual attack of a sound event is
used in the music information retrieval community, such as best modelled not as a single point in time, but as a contin-
Log-attack time and attack slope. We compare the results uous probability density function indicating the likelihood
of two popular toolboxes for Matlab, the MIRtoolbox [1] as to where a listener would locate the maxima of rhyth-
1 Note that phase in this context signifies that these descriptors concern mic emphasis. These definitions are strongly linked to the
a certain temporal segment of the sound. This differs from the technical concept of perceptual centres (P-centres) in speech timing,
meaning of phase in signal processing, such as in a phase spectrum.
defining the moment in time when a brief event is per-
ceived to occur [7]. In effect, when two sounds are per-
Copyright:
c 2017 K. Nymoen et al. This is an open-access article distributed ceived as synchronous, it is their P-centres that are aligned.
under the terms of the Creative Commons Attribution 3.0 Unported License, which In addition to the perceptual attack time, Gordon [5] ar-
permits unrestricted use, distribution, and reproduction in any medium, provided gues that it makes sense to talk about both physical onset
the original author and source are credited. time and perceptual onset time for a given sound event.
SMC2017-214
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-215
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3. COMPUTING ATTACK PHASE DESCRIPTORS the energy envelope. This makes sense when seen in rela-
tion to the previously mentioned MIREX audio onset com-
In this section we look more closely at how various at-
petition, but is distinctly different to the meaning of onset
tack phase descriptors are computed in the Timbre Tool-
in our paper. Both toolboxes provide estimates of when the
box and MIRtoolbox. With reference to the list of descrip-
attack of a sound event begins and ends. 6 Both also pro-
tors in Table 1, the computational estimation of ph-type
vide several directly or indirectly related descriptors (such
descriptors is conceptually quite different from that of pe-
as attack slope and temporal centroid).
type descriptors. While a physical feature of the signal can
Peeters et al. [2] argue that while a common approach
be calculated accurately, and only depends on algorithmic
in estimating the beginning and end of an attack range is
parameters like filter cutoff frequency or window length,
to apply fixed thresholds to the energy envelope, more ro-
any perceptual feature can only be an estimate of how the
bust results may be obtained by looking at the slope of the
particular sound event would be perceived. Neither of the
energy envelope before it reaches its peak value. Trum-
toolboxes state clearly whether the computed descriptors
pet sounds are mentioned as a particular reason for this, as
are estimates of perceptual or physical features. Both em-
their energy envelopes may increase after the attack. Both
ploy algorithms where the end of the attack range does not
of the toolboxes we discuss take such an approach, how-
necessarily coincide with the peak of the energy envelope.
ever with different strategies. The two strategies are illus-
This suggests that the toolboxes take into account the fact
trated in Figure 3, and explained further below.
that the perceived attack of a sound event might end before
the energy peak.
Timbre Toolbox attack phase estimation
Below, we take a closer look at the attack phase descrip- Effort Function
tors as they are provided by the toolboxes, and in Section 4 Mean effort
The common basis of most algorithms for calculating at- 0 0.2 0.4 0.6 0.8 1
time (seconds)
tack phase descriptors is some function describing the en-
ergy envelope of the sound signal. Attack phase descrip- MIRToolbox attack phase estimation
0.02
tors are calculated based on various thresholds applied to Time derivative
Peak position
this envelope and its derivatives. The method and the pa- 0.01
Threshold
rameters specified in the calculation of the energy envelope
strongly influence the estimated attack phase descriptors. 0
SMC2017-216
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Peeters [11] suggested the weakest effort method, which 18 ms (dark sounds with slow attack and long duration).
is implemented in the Timbre Toolbox for estimating the This verifies previous research, which suggests that per-
beginning and end of the attack. In this method, a set of ceived attack may best be modelled as a range (i.e. a beat
i equidistant thresholds i is specified between zero and a bin [12]) or probability distribution [6], rather than a sin-
peak value for the energy envelope. For each i , a corre- gle point in time. Consequently, we cannot evaluate the
sponding ti denotes the first time when the energy envelope toolboxes based on a single value (e.g., their estimation of
crosses the corresponding threshold. Then a set of effort the beginning of the attack), but rather need to determine
parameters i,i+1 is calculated as the time between each t. the amount of overlap between the calculated attack range
With the mean effort notated as , the beginning of the at- and the distribution of perceptual responses.
tack is calculated as the first i,i+1 that falls below , The use of a simultaneous click track and stimulus track
and the end as the first sucessive i that exceeds . may provide fusion cues that are not inherent to the per-
Both are also adjusted to the local maximum/minimum ceptual attack of the stimulus. Several scholars have ar-
within the i,i+1 interval. The default value for is 3. gued that an alternating sequence between stimulus and
The MIRtoolbox uses the first time derivative of the en- click might be more reliable [9, 13]. We controlled for
ergy envelope (e0 ) as basis for estimating attack phase de- the effects of event fusion by a corresponding anti-phase
scriptors. A peak in e0 necessarily occurs before the en- alignment task, alternating click and stimulus. The results
ergy peak itself, and the attack is predicted to begin when showed no significant difference between the two response
e0 goes above 20% of its peak value, and end when it falls modes [14], which is also in accordance with Villings re-
below 20% of its peak value. ported concistency across these measurement methods [7].
SMC2017-217
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
0.04
0.45
0.06 0.4
coded (i.e. not intended for user adjustment), one may mod-
Frame size (seconds)
0.07
ify the code in order to run an optimisation algorithm on 0.35
Jaccard Index
0.08
the parameters. We have used a two-dimensional grid search 0.09 0.3
the difference between the toolbox results and our percep- 0.12 0.2
tual data. For each parameter setting, we computed the Jac- 0.13
0.15
0.14
card Index [15] between the estimated attack range and the
0.15
time span covered by our experimental results (meanSD). 0.1
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
This is a measure of the amount of overlap between the Threshold (fraction of epeak)
two, and takes a value of 1 if the two ranges are identical,
and 0 if there is no overlap. For the default parameter set- Figure 6. Output of grid search algorithm optimising frame
tings, the mean Jaccard Index across all sounds were 0.41 size and threshold parameters in the MIRtoolbox.
(MIRtoolbox) and 0.25 (Timbre toolbox).
Figure 6 shows the output of our grid search algorithm
for the MIRtoolbox. The mean Jaccard Index across the Toolbox Envelope parameter Threshold parameter
nine sounds was used as fitness measure. 30 settings were
Timbre LPfilter cutoff frequency
tested per parameter, in total 900 parameter settings per toolbox Default: 5 Hz Default: 3
toolbox. The results from the parameter optimisation pro-
Optimised: 37 Hz Optimised: 3.75
cess are shown in Table 2, and the corresponding optimised
estimates of attack ranges are shown in Figure 5. The Jac- MIR- Frame size fraction of e0peak
card Index scores for the optimised parameters were 0.57 toolbox Default: 0.1 s Default: 0.2
for the MIRtoolbox and 0.61 for the Timbre toolbox. The Optimised: 0.03 s Optimised: 0.075
main improvement is the reduced spread of the attack times
compared with the default attack estimates, as a result of Table 2. Parameters for optimisation, default values and
more detailed energy envelopes. optimised results
SMC2017-218
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-219
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-220
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Frame-level mel-spectrogram model Frame-level raw waveform model Sample-level raw waveform model
Figure 1. Simplified model comparison of frame-level approach using mel-spectrogram (left), frame-level approach using
raw waveforms (middle) and sample-level approach using raw waveforms (right).
Since audio waveforms are one-dimensional data, previous In the frame-level raw waveform model, a strided con-
work that takes waveforms as input used a CNN that con- volution layer is added beneath the bottom layer of the
sists of one-dimensional convolution and pooling stages. frame-level mel-spectrogram model. The strided convo-
While the convolution operation and filter length in upper lution layer is expected to learn a filter-bank representation
layers are usually similar to those used in the image do- that correspond to filter kernels in a time-frequency rep-
main, the bottom layer that takes waveform directly con- resentation. In this model, once the raw waveforms pass
ducted a special operation called strided convolution, which through the first strided convolution layer, the output fea-
takes a large filter length and strides it as much as the filter ture map has the same dimensions as the mel-spectrogram.
length (or the half). This frame-level approach is compa- This is because the stride, filter length, and number of fil-
rable to hopping windows with 100% or 50% hop size in a ters of the first convolution layer correspond to the hop
short-time Fourier transform. In many previous works, the size, window size, and number of mel-bands in the mel-
stride and filter length of the first convolution layer was set spectrogram, respectively. This configuration was used for
to 10-20 ms (160-320 samples at 16 kHz audio) [8,1012]. music auto-tagging task in [11, 12] and so we used it as a
baseline model.
In this paper, we reduce the filter length and stride of
the first convolution layer to the sample-level, which can 3.3 Sample-level raw waveform model
be as small as 2 samples. Accordingly, we increase the
depth of layers in the CNN model. There are some works As described in Section 1, the approach using the raw wave-
that use 0.6 ms (10 samples at 16 kHz audio) as a stride forms should be able to address log-scale amplitude com-
length [6, 7], but they used a CNN model only with three pression and phase-invariance. Simply adding a strided
convolution layers, which is not sufficient to learn the com- convolution layer is not sufficient to overcome the prob-
plex structure of musical signals. lems. To improve this, we add multiple layers beneath the
frame-level such that the first convolution layer can han-
dle much smaller length of samples. For example, if the
stride of the first convolution layer is reduced from 729
3. LEARNING MODELS (= 36 ) to 243 (= 35 ), 3-size convolution layer and max-
pooling layer are added to keep the output dimensions in
Figure 1 illustrates three CNN models in the music auto-
the subsequent convolution layers unchanged. If we re-
tagging task we compare in our experiments. In this sec-
peatedly reduce the stride of the first convolution layer this
tion, we describe the three models in detail.
way, six convolution layers (five pairs of 3-size convolution
and max-pooling layer following one 3-size strided convo-
lution layer) will be added (we assume that the temporal
3.1 Frame-level mel-spectrogram model
dimensionality reduction occurs only through max-pooling
and striding while zero-padding is used in convolution to
This is the most common CNN model used in music auto-
preserve the size). We describe more details on the config-
tagging. Since the time-frequency representation is two
uration strategy of sample-level CNN model in the follow-
dimensional data, previous work regarded it as either two-
ing section.
dimensional images or one-dimensional sequence of vec-
tors [11, 1315]. We only used one-dimensional(1D) CNN
3.4 Model Design
model for experimental comparisons in our work because
the performance gap between 1D and 2D models is not sig- Since the length of an audio clip is variable in general, the
nificant and 1D model can be directly compared to models following issues should be considered when configuring
using raw waveforms. the temporal CNN architecture:
SMC2017-221
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
4. EXPERIMENTAL SETUP 1 MTAT contains 170 hours long audio and MSD contains 1955 hours
long audio in total
In this section, we introduce the datasets used in our exper- 2 https://github.com/keunwoochoi/MSD_split_for_
iments and describe experimental settings. tagging
SMC2017-222
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2n models
model with 16384 samples (743 ms) as input model with 32768 samples (1486 ms) as input
model n layer filter length & stride AUC model n layer filter length & stride AUC
64 frames 6 1+6+1 256 0.8839 128 frames 7 1+7+1 256 0.8834
128 frames 7 1+7+1 128 0.8899 256 frames 8 1+8+1 128 0.8872
256 frames 8 1+8+1 64 0.8968 512 frames 9 1+9+1 64 0.8980
512 frames 9 1+9+1 32 0.8994 1024 frames 10 1+10+1 32 0.8988
1024 frames 10 1+10+1 16 0.9011 2048 frames 11 1+11+1 16 0.9017
2048 frames 11 1+11+1 8 0.9031 4096 frames 12 1+12+1 8 0.9031
4096 frames 12 1+12+1 4 0.9036 8192 frames 13 1+13+1 4 0.9039
8192 frames 13 1+13+1 2 0.9032 16384 frames 14 1+14+1 2 0.9040
3n models
model with 19683 samples (893 ms) as input model with 59049 samples (2678 ms) as input
model n layer filter length & stride AUC model n layer filter length & stride AUC
27 frames 3 1+3+1 729 0.8655 81 frames 4 1+4+1 729 0.8655
81 frames 4 1+4+1 243 0.8753 243 frames 5 1+5+1 243 0.8823
243 frames 5 1+5+1 81 0.8961 729 frames 6 1+6+1 81 0.8936
729 frames 6 1+6+1 27 0.9012 2187 frames 7 1+7+1 27 0.9002
2187 frames 7 1+7+1 9 0.9033 6561 frames 8 1+8+1 9 0.9030
6561 frames 8 1+8+1 3 0.9039 19683 frames 9 1+9+1 3 0.9055
4n models
model with 16384 samples (743 ms) as input model with 65536 samples (2972 ms) as input
model n layer filter length & stride AUC model n layer filter length & stride AUC
64 frames 3 1+3+1 256 0.8828 256 frames 4 1+4+1 256 0.8813
256 frames 4 1+4+1 64 0.8968 1024 frames 5 1+5+1 64 0.8950
1024 frames 5 1+5+1 16 0.9010 4096 frames 6 1+6+1 16 0.9001
4096 frames 6 1+6+1 4 0.9021 16384 frames 7 1+7+1 4 0.9026
5n models
model with 15625 samples (709 ms) as input model with 78125 samples (3543 ms) as input
model n layer filter length & stride AUC model n layer filter length & stride AUC
125 frames 3 1+3+1 125 0.8901 625 frames 4 1+4+1 125 0.8870
625 frames 4 1+4+1 25 0.9005 3125 frames 5 1+5+1 25 0.9004
3125 frames 5 1+5+1 5 0.9024 15625 frames 6 1+6+1 5 0.9041
Table 2. Comparison of various mn -DCNN models with different input sizes. m refers to the filter length and pooling length
of intermediate convolution layer modules and n refers to the number of the modules. Filter length & stride indicates the
value of the first convolution layer. In the layer column, the first digit 1 of 1+n+1 is the strided convolution layer, and the
last digit 1 is convolution layer which actually works as a fully-connected layer.
4.2 Optimization of 23 for MTAT and 50 for MSD, respectively. In the mel-
spectrogram model, we conducted the input normalization
We used sigmoid activation for the output layer and bi- simply by dividing the standard deviation after subtracting
nary cross entropy loss as the objective function to opti- mean value of entire input data. On the other hand, we did
mize. For every convolution layer, we used batch normal- not perform the input normalization for raw waveforms.
ization [20] and ReLU activation. We should note that, in
our experiments, batch normalization plays a vital role in 5. RESULTS
training the deep models that takes raw waveforms. We
In this section, we examine the proposed models and com-
applied dropout of 0.5 to the output of the last convolution
pare them to previous state-of-the-art results.
layer and minimized the objective function using stochas-
tic gradient descent with 0.9 Nesterov momentum. The
5.1 mn -DCNN models
learning rate was initially set to 0.01 and decreased by a
factor of 5 when the validation loss did not decrease more Table 2 shows the evaluation results for the mn -DCNN
than 3 epochs. A total decrease of 4 times, the learning rate models on MTAT for different input sizes, number of lay-
of the last training was 0.000016. Also, we used batch size ers, filter length and stride of the first convolution layer. As
SMC2017-223
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
input type model MTAT MSD 5.3 MSD result and the number of filters
Persistent CNN [21] 0.9013 -
We investigate the capacity of our sample-level architec-
Frame-level 2D CNN [14] 0.894 0.851
ture even further by evaluating the performance on MSD
(mel-spectrogram) CRNN [15] - 0.862
that is ten times larger than MTAT. The result is shown in
Proposed DCNN 0.9059 - Table 4. While training the network on MSD, the number
Frame-level
of filters in the convolution layers has been shown to affect
1D CNN [11] 0.8487 - the performance. According to our preliminary test results,
(raw waveforms)
increasing the number of filters from 16 to 512 along the
Sample-level layers was sufficient for MTAT. However, the test on MSD
Proposed DCNN 0.9055 0.8812
(raw waveforms) shows that increasing the number of filters in the first con-
volution layer improves the performance. Therefore, we
Table 4. Comparison of our works to prior state-of-the-arts increased the number of filters in the first convolution layer
from 16 to 128.
described in Section 3.4, m refers to the filter length and
pooling length of intermediate convolution layer modules 5.4 Comparison to state-of-the-arts
and n refers to the number of the modules. In Table 2, In Table 4, we show the performance of the proposed ar-
we can first find that the accuracy is proportional to n for chitecture to previous state-of-the-arts on MTAT and MSD.
most m. Increasing n given m and input size indicates that They show that our proposed sample-level architecture is
the filter length and stride of the first convolution layer be- highly effective compared to them.
come closer to the sample-level (e.g. 2 or 3 size). When the
first layer reaches the small granularity, the architecture is 5.5 Visualization of learned filters
seen as a model constructed with the same filter length and
sub-sampling length in all convolution layers as depicted The technique of visualizing the filters learned at each layer
in Table 1. The best results were obtained when m was 3 allows better understanding of representation learning in
and n was 9. Interestingly, the length of 3 corresponds to the hierarchical network. However, many previous works
the 3-size spatial filters in the VGG net [2]. In addition, we in music domain are limited to visualizing learned filters
can see that 1-3 seconds of audio as an input length to the only on the first convolution layer [11, 12].
network is a reasonable choice in the raw waveform model The gradient ascent method has been proposed for filter
as in the mel-spectrogram model. visualization [22] and this technique has provided deeper
understanding of what convolutional neural networks learn
5.2 Mel-spectrogram and raw waveforms from images [23, 24]. We applied the technique to our
DCNN to observe how each layer hears the raw wave-
Considering that the output size of the first convolution forms. The gradient ascent method is as follows. First, we
layer in the raw waveform models is equivalent to the mel- generate random noise and back-propagate the errors in the
spectrogram size, we further validate the effectiveness of network. The loss is set to the target filter activation. Then,
SMC2017-224
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 2. Spectrum of the filters in the sample-level convolution layers which are sorted by the frequency of the peak
magnitude. The x-axis represents the index of the filter, and the y-axis represents the frequency. The model used for
visualization is 39 -DCNN with 59049 samples as input. Visualization was performed using the gradient ascent method to
obtain the input waveform that maximizes the activation of a filter in the layers. To effectively find the filter characteristics,
we set the input waveform estimate to 729 samples which is close to a typical frame size.
SMC2017-225
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[4] X. Zhang, J. Zhao, and Y. LeCun, Character-level [17] J. Lee and J. Nam, Multi-level and multi-scale fea-
convolutional networks for text classification, in Ad- ture aggregation using pre-trained convolutional neu-
vances in neural information processing systems, 2015, ral networks for music auto-tagging, arXiv preprint
pp. 649657. arXiv:1703.01793, 2017.
[5] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, [18] E. Law, K. West, M. I. Mandel, M. Bay, and J. S.
Character-aware neural language models, arXiv Downie, Evaluation of algorithms using games: The
preprint arXiv:1508.06615, 2015. case of music tagging, in ISMIR, 2009, pp. 387392.
[6] D. Palaz, M. M. Doss, and R. Collobert, Convolu- [19] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and
tional neural networks-based continuous speech recog- P. Lamere, The million song dataset, in Proceedings
nition using raw speech signal, in IEEE International of the 12th International Conference on Music Infor-
Conference on Acoustics, Speech and Signal Process- mation Retrieval (ISMIR), vol. 2, no. 9, 2011, pp. 591
ing (ICASSP), 2015, pp. 42954299. 596.
[7] D. Palaz, R. Collobert et al., Analysis of cnn-based [20] S. Ioffe and C. Szegedy, Batch normalization: Accel-
speech recognition system using raw speech as input, erating deep network training by reducing internal co-
Idiap, Tech. Rep., 2015. variate shift, arXiv preprint arXiv:1502.03167, 2015.
[8] R. Collobert, C. Puhrsch, and G. Synnaeve, [21] J.-Y. Liu, S.-K. Jeng, and Y.-H. Yang, Apply-
Wav2letter: an end-to-end convnet-based speech ing topological persistence in convolutional neural
recognition system, arXiv preprint arXiv:1609.03193, network for music audio signals, arXiv preprint
2016. arXiv:1608.07373, 2016.
[9] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, [22] D. Erhan, Y. Bengio, A. Courville, and P. Vincent,
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and Visualizing higher-layer features of a deep network,
K. Kavukcuoglu, Wavenet: A generative model for University of Montreal, vol. 1341, p. 3, 2009.
raw audio, CoRR abs/1609.03499, 2016. [23] M. D. Zeiler and R. Fergus, Visualizing and under-
standing convolutional networks, in European confer-
[10] T. N. Sainath, R. J. Weiss, A. W. Senior, K. W. Wilson,
ence on computer vision. Springer, 2014, pp. 818
and O. Vinyals, Learning the speech front-end with
833.
raw waveform cldnns. in INTERSPEECH, 2015, pp.
15. [24] A. Nguyen, J. Yosinski, and J. Clune, Deep neural
networks are easily fooled: High confidence predic-
[11] S. Dieleman and B. Schrauwen, End-to-end learning
tions for unrecognizable images, in Proceedings of
for music audio, in IEEE International Conference
the IEEE Conference on Computer Vision and Pattern
on Acoustics, Speech and Signal Processing (ICASSP),
Recognition, 2015, pp. 427436.
2014, pp. 69646968.
SMC2017-226
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-227
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
are trained for a fixed combinations of instruments [711]. metric approaches embodied by NMF.
For classical music, we constrain the timbre-informed sys- As a solution to data scarcity and to improve generaliza-
tem to the possible combinations of notes and instruments tion and robustness, deep learning systems rely on data
which exist in a set of scores which have the same instru- augmentation techniques by applying transformations to
ments. We denote this case as score-constrained source the existing data [17, 18]. Choosing realistic transforma-
separation. This is a more general case than score-informed tions depends on the possible axes on which data varies
source separation [15], as score is used solely during the and on the task, e.g. pitch shifting, time stretching, loud-
training phase to synthesize renditions. ness, randomly mixing different recordings or background
We propose a timbre-informed and score-constrained sys- noise. Although our approach can be seen as a data aug-
tem to train neural networks for monaural source separa- mentation strategy, we generate new data according to trans-
tion of classical music mixtures. We generate training data formations that make more sense for source separation, in-
through synthesis of a set of scores by considering the fol- stead of augmenting existing data with techniques popular
lowing factors: tempo, timbre, dynamics and circular shift- in other classification tasks [17, 18]. Furthermore, except
ing (local timing variations). In addition, we extend the circular shifting [7], we use different transformations than
timbre-informed low latency system we proposed in [11] previous deep learning approaches. We consider music
to the case of score-constrained classical music source sep- samples played with different dynamics and by different
aration. As argued in [11], the model we use has consider- instruments (timbre), which is different than simply chang-
able less parameters, is easier to train, and can be used in a ing the loudness or the amplitude of a sample. Addition-
low latency scenario. We evaluate several training scenar- ally, we mix the audio files and not the resulting spectro-
ios on Bach chorales pieces played by four instruments. grams to account for phase cancellation.
The remainder of this paper is structured as follows. In Besides [10], deep learning approaches are evaluated in a
Section 2 we discuss the relation of the proposed method cross-validation manner [7, 9, 11]. However, we claim that
with the previous work. We present the architecture of the due to the small size of the classical music datasets used
neural network in Section 3.1, the parameter learning in for evaluation, these methods are more likely to perform
Section 3.2, the data processing in Section 3.3, and the poorly in real-life scenarios where we can have different
proposed approaches for generating training data in Sec- tempos, timbre, or dynamics. Therefore, varying training
tion 3.4. The evaluation dataset, method, parameters and data with these factors adds robustness.
results are discussed in Section 4. We conclude with an
outlook on the presented system in Section 5.
3. METHOD
2. RELATION TO PREVIOUS WORK The diagram for the proposed system is depicted in Figure
1. For the training stage, we depart from the score of a
As mentioned before, training NMF basis for source sep- given piece which we synthesize at various combinations
aration is carried out either by score synthesis [13, 16] or of tempos, timbres and dynamics. We then generate addi-
by analytic approaches which assume learning registers for tional data by mixing circular shifted versions of the audio
all considered instruments [6, 15]. In a similar way, we tracks for the corresponding sources. More details on data
can synthesize the desired pieces or use samples for notes generation are given in Section 3.4.
to train a neural network. However, in contrast to NMF,
we need to train the network with examples containing the
3.1 Network architecture
mixtures along with the targeted sources which cover pos-
sible cases one might encounter in real-life or in the test The convolutional neural network (CNN) used in this pa-
dataset. In addition, a deep learning model is optimized per is a variation of the one we formulated in [11] and com-
and heavily dependent on the training data and needs to prises two convolutional layers, a dense bottleneck layer
cover more examples at the training stage than the para- having a low number of units, a second dense layer fol-
SMC2017-228
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
lowed by the inverse of the first two layers for each of the number of frequency bins of X. We summarize the proce-
target sources. We kept similar filter shapes in order to pre- dure in Algorithm 1.
serve a low number of parameters and the low latency of
the model. Algorithm 1 Generating training data
The input of the network is a STFT magnitude spectro- 1 for each piece in the training set do
gram X RT F , where T is the number of time frames 2 Compute STFT of the piece
T
(the time context modeled) and F is the number of fre- 3 Initialize total number of blocks to B = T O , where O
quency bins. This input is passed through a vertical convo- is the step size.
4 for b=1:B do
lution layer comprising P1 = 30 filters of size (1, 30) with 5 Slice Xb = X(b O : b O + T, :)
stride 5, thus yielding the output X1 RP1 T F1 , where 6 Slice Xbj = Xj (b O : b O + T, :)
F1 = (F 30)/5 + 1. Then, we have a horizontal con- 7 end for
volution comprising P2 = 30 filters of size ( 23 T, 1) with 8 end for
stride 1 which yields the output X2 RP2 T2 F1 , where 9 Generate batches by randomly grouping blocks.
T2 = (T 23 T )/1 + 1. This layer is followed by a dense
bottleneck layer of size F3 = 256, which has an input of b) Separation: Separating a target signal involves slicing
size P2 T2 F1 . a magnitude spectrogram. Then, we obtain an estimation
As we need to reconstruct the output for each of the sources for the sources by feed forwarding the blocks through the
j = 1 : J, where J is the total number of separated network, and we overlap-add the blocks to obtain the mag-
sources, we perform the inverse operations for the horizon- nitude spectrograms. The steps are presented in the Algo-
tal and the vertical convolutions. To match the dimensions rithm 2.
needed to compute the inverse of the second layer, we need
to create a dense layer of size P2 T2 F1 features for each Algorithm 2 Separation of a piece
source, thus having J P2 T2 F1 units. Consequentially, 1 Compute complex valued STFT; keep the phase.
for each of the estimated sources we perform the inverse T T
2 Initialize total number of blocks to B = O
, where O is
operation of the convolution layers, i.e. the deconvolution, the step size.
using the same parameters learned by these layers. The 3 for b=1:B do
4 Slice Xb = X(b O : b O + T, :)
final inverse layer yields a set of estimations Ej RT F ,
5 Feed-forward X b through the CNN, obtaining the magni-
with j = 1 : J.
tude spectrograms Xjb , for the sources j = 1 : J.
The convolutional layers have a linear activation function 6 end for
and the dense ones a rectified linear unit activation func- 7 for j=1:J do
tion, as in [11]. 8 Compute Xj by overlap-adding Xjb .
9 end for
10 Use the phase of the original signal and compute the esti-
3.2 Parameter learning mated sources with the inverse overlap-add STFT.
Similarly to [7,11], the network yields the estimations Ej
RT F from which we derive the soft masks Mj RT F , for If we consider that the sources yj (t), j = 1 : J are lin-
each source j = 1 : J: PJ
early mixed, such that x(t) = yj (t), then we can use
|Ej | j=1
Mj = PJ (1) the original phase of the audio mixture to obtain the signals
j=1 |Ej | associated to the sources yj (t), with an inverse overlap-add
STFT, as in [13].
Then, the soft masks are multiplied with the original mag-
nitude spectrogram to obtain the corresponding estimated
3.4 Data generation
magnitude spectrograms for the sources: Xj = Mj X.
The parameters of the network are updated using mini- The proposed system is trained with data generated by syn-
batch Stochastic Gradient Descent with AdaDelta algorithm thesizing a set of scores at different tempos, using samples
[19] according to the following loss function which min- of different timbres and dynamics, and then applying circu-
imizes the Euclidean distance between the estimated Xj lar shifting to the resulting audio files, to account for local
PJ
and target Xj sources: Loss = j=1 kXj Xj k2 timing variations. In this paper, we consider four factors:
Because the target sources are harmonic, highly corre- a) Tempo: The renditions of a classical music piece vary
lated and playing consonant musical phrases, we do not in- in terms of tempo. Therefore, in order to make the model
clude the dissimilarity cost between the sources as in [11]. more robust to variations in tempo, we synthesize the score
at different tempos, i.e. we adjust the note duration for
3.3 Data processing each note corresponding to each instrument in the score.
The possible tempo variations are represented in the vec-
a) Training: Generating training data assumes slicing the tor a = {a1 , a2 , . . . , aA }, where A is the total number of
spectrograms of the mixture X(t, f ) and the target spec- considered tempos.
trograms Xj (t, f ) in overlapping blocks of size T , for the b) Timbre: Different recording conditions, instruments,
time frames t = 1 : T and frequency bins f = 1 : F , players or playing styles can yield differences in timbre. A
where T is the total number of time frames and F is the single timbre variation comprises samples of the notes in
SMC2017-229
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-230
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
and Itakura Saito distances instead of Euclidean distance in datasets. The separated audio files and the computed mea-
the cost function. sures for each file and instrument are made available through
c) Hardware and software tools: Our work follows the Zenodo 5 .
principle of research reproducibility, so that the code used We verify that when the training and test datasets coin-
in this paper is made available 3 . It is built on top of Lasagne, cide, the CNN approach achieves the best performance.
a framework for neural networks using Theano 4 . We ran We denote this case as Oracle.
the experiments on a computer with GeForce GTX TITAN
X GPU, Intel Core i7-5820K 3.3GHz 6-Core Processor, 4.4.1 Separation results with real recordings: Bach10
X99 gaming 5 x99 ATX DDR44 motherboard. We evaluate the considered approaches on the Bach10 dataset.
Overall results are presented in Figure 2, while instrument
4.3 Experiments specific ones are provided in Figure 3.
4.3.1 Convolutional Neural Networks
In order to test various data generation approaches, we 14
approach
train the CNN system in Section 3.1 with different data: CNNLOOCV
12
a) CNN Bach10: We train with all the 10 pieces in the CNNOracle
CNNRWC
Bach10 dataset. 10 CNNRWCGT
b) CNN leave one out (LOOCV) on Bach10: We train CNNSibelius
8 CNNSibeliusGT
mean(dB)
with nine pieces of Bach10 and test on the remaining piece,
NMF
repeating this for all the 10 pieces of the dataset. 6
c) CNN Sibelius: We train with all the synthetic pieces in
the generated dataset described in Section 4.1a. 4
d) CNN Sibelius GT: We train with all the synthetic pieces
2
in generated dataset described in Section 4.1a, synthesized
with the score perfectly aligned with the rendition. 0
e) CNN RWC: We train with all the synthetic pieces in SDR SIR SAR
measure
the generated dataset described in Section 4.1b. Since the
number of the possible combinations between the factors Figure 2. Results in terms of SDR, SIR, SAR for Bach10
a, c, d, e is too large, we randomly sample 400 points. dataset and the considered approaches: CNN and NMF [6]
f) CNN RWC GT: We train with all the synthetic pieces
in the generated dataset described in Section 4.1b, synthe-
The LOOCV approach is trained with examples from
sized with the score perfectly aligned with the rendition.
the same dataset, having the same timbre and style, thus
In this case, because there are less factors to vary, we ran-
achieves 4dB in SDR. This illustrates the fact that train-
domly sample 50 points.
ing the neural network on similar pieces, played by the
4.3.2 Score-constrained NMF same musicians in the same recording conditions is ben-
eficial for the system.
We compare the proposed approaches with an NMF timbre- The approach CNN RWC, which involves synthesizing
informed system based on the multi-source filter model the original score with samples comprising a variety of in-
[4, 6] which includes timbre models trained on the RWC strument timbres and dynamics, yields as good results as
dataset. Because we deal with a score-constrained sce- the LOOCV approach 4dB SDR. On the other hand,
nario and for a fair comparison, the gains of the NMF are if the training set comprises less samples and less varia-
restricted to the notes in the score, without taking into ac- tions in timbre and dynamics, the we have considerable
count the time when the notes are played. Thus, each row lower results, as in the CNN Sibelius approach which has
of the gains matrix corresponding to a note is set to 1 if 1dB SDR. In fact, the CNN RWC approach has higher
a given instrument plays a note from the score. The other SIR than CNN Sibelius (9dB compared with 4dB). Thus,
values in the gains matrix are set to 0 and do not change learning to separate more diverse examples reduces the in-
during computation, while the values set to 1 evolve ac- terference.
cording to the energy distributed between the instruments. Synthesizing the score perfectly aligned with the rendi-
The NMF parameters are kept fixed as in [4]: 50 iterations tion does not improve the results for the considered ap-
for the NMF, beta-divergence distortion = 1.3, STFT proaches: CNN RWC GT and CNN Sibelius GT. For this
length of window of 96ms, and a hop size of 11ms. particular CNN architecture and the modeled time con-
text (300ms), synthesizing the original score with circular
4.4 Results and discussion shifting to compensate for local timing variation achieves
results as good as the ideal case when synthesized a per-
We present the results of the evaluated approaches in Sec- fectly aligned score. This is encouraging considering that
tion 4.3.1 and Section 4.3.2 on the Bach10 and Sibelius a perfectly aligned score is difficult to obtain. Furthermore,
3 https://github.com/MTG/DeepConvSep a score-following system introduces additional latency.
4 http://lasagne.readthedocs.io/en/latest/Lasagne
and http://deeplearning.net/software/theano/Theano 5 http://zenodo.org/record/344499#.WLbagSMrIy4
SMC2017-231
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
8 approach
14 CNNOracle
7 CNNBach10
6 12 CNNRWC
5 NMF
10
4
mean(dB)
SDR(dB)
3 8
2
6
1
0 4
1
2
2
3 0
bassoon clarinet saxophone violin SDR SIR SAR
measure
Figure 3. Results for each instrument in terms of SDR for Figure 4. Results in terms of SDR, SIR, SAR for Sibelius
Bach10 dataset and the considered approaches: CNN and dataset and the considered approaches: CNN and NMF [6]
NMF [6]
SMC2017-232
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
NVIDIA Corporation. This work is partially supported by [11] P. Chandna, M. Miron, J. Janer, and E. Gomez,
the Spanish Ministry of Economy and Competitiveness un- Monoaural audio source separation using deep convo-
der CASAS project (TIN2015-70816-R). We thank Agustin lutional neural networks, International Conference on
Martorell for his help with Sibelius and Pritish Chandna for Latent Variable Analysis and Signal Separation, 2017.
his useful feedback.
[12] Z. Duan and B. Pardo, Soundprism: An online system
for score-informed source separation of music audio,
6. REFERENCES IEEE Journal of Selected Topics in Signal Processing,
[1] E. Vincent, C. Fevotte, R. Gribonval, L. Benaroya, pp. 112, 2011. [Online]. Available: http://ieeexplore.
X. Rodet, A. Robel, E. Le Carpentier, and F. Bimbot, ieee.org/xpls/abs all.jsp?arnumber=5887382
A tentative typology of audio source separation tasks,
[13] J. Fritsch and M. Plumbley, Score informed audio
in 4th Int. Symp. on Independent Component Analysis
source separation using constrained non-negative ma-
and Blind Signal Separation (ICA), 2003, pp. 715720.
trix factorization and score synthesis, ICASSP, pp.
[2] J. Janer, E. Gomez, A. Martorell, M. Miron, and 888891, 2013.
B. de Wit, Immersive orchestras: audio processing
[14] G. Widmer and W. Goebl, Computational models of
for orchestral music vr content, in Games and Virtual
expressive music performance: The state of the art,
Worlds for Serious Applications (VS-Games), 2016 8th
Journal of New Music Research, vol. 33, no. 3, pp.
International Conference on. IEEE, 2016, pp. 12.
203216, 2004.
[3] E. Gomez, M. Grachten, A. Hanjalic, J. Janer, S. Jorda,
[15] S. Ewert and M. Muller, Score-Informed Voice Sepa-
C. F. Julia, C. C. S. Liem, A. Martorell, M. Schedl,
ration For Piano Recordings. 12th International So-
and G. Widmer, Phenicx: Performances as highly en-
ciety for Music Information Retrieval Conference,
riched and interactive concert experiences, in SMC,
no. 12, pp. 245250, 2011.
2013.
[16] J. Ganseman, P. Scheunders, G. Mysore, and J. Abel,
[4] M. Miron, J. Carabias-Orti, J. Bosch, E. Gomez, and
Source separation by score synthesis. in Interna-
J. Janer, Score-informed source separation for mul-
tional Computer Music Conference, 2010.
tichannel orchestral recordings, Journal of Electrical
and Computer Engineering, vol. 2016, 2016. [17] J. Schluter and T. Grill, Exploring data augmentation
for improved singing voice detection with neural net-
[5] T. Virtanen, Monaural sound source separation by
works, in 16th International Society for Music Infor-
nonnegative matrix factorization with temporal conti-
mation Retrieval Conference, 2015.
nuity and sparseness criteria, IEEE Transactions on
Audio, Speech, and Language Processing, vol. 15, [18] J. Salamon and J. Bello, Deep convolutional neu-
no. 3, pp. 10661074, 2007. ral networks and data augmentation for environmental
sound classification, IEEE Signal Processing Letters,
[6] J. J. Carabias-Orti, T. Virtanen, P. Vera-Candeas,
vol. 24, no. 3, pp. 279283, 2017.
N. Ruiz-Reyes, and F. J. Canadas-Quesada, Musi-
cal Instrument Sound Multi-Excitation Model for Non- [19] M. D. Zeiler, Adadelta: an adaptive learning rate
Negative Spectrogram Factorization, IEEE Journal of method, arXiv preprint arXiv:1212.5701, 2012.
Selected Topics in Signal Processing, vol. 5, no. 6, pp.
11441158, Oct. 2011. [20] M. Goto, Development of the RWC music database,
in ICA, 2004, pp. 553556.
[7] P.-S. Huang, S. D. Chen, P. Smaragdis, and
M. Hasegawa-Johnson, Singing-voice separation [21] E. Vincent, R. Gribonval, and C. Fevotte, Perfor-
from monaural recordings using robust principal com- mance measurement in blind audio source separation,
ponent analysis, in ICASSP. IEEE, mar 2012, pp. IEEE Transactions on Audio, Speech and Language
5760. Processing, vol. 14, no. 4, pp. 14621469, jul 2006.
[8] E. Grais, M. Sen, and H. Erdogan, Deep neural [22] V. Emiya, E. Vincent, N. Harlander, and V. Hohmann,
networks for single channel source separation, in Subjective and Objective Quality Assessment of Au-
ICASSP. IEEE, may 2014, pp. 37343738. dio Source Separation, IEEE Transactions on Audio,
Speech, and Language Processing, vol. 19, no. 7, pp.
[9] A. Simpson, G. Roma, and M. Plumbley, Deep 20462057, Sep. 2011.
karaoke: Extracting vocals from musical mixtures us-
ing a convolutional deep neural network, in Interna-
tional Conference on Latent Variable Analysis and Sig-
nal Separation. Springer, 2015, pp. 429436.
SMC2017-233
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT two sonification methods for tremor analysis which are de-
scribed in detail in [6]. These are intended to be used in-
We present the evaluation of a sonification approach for terchangeably dependent on tremor characteristics and per-
the acoustic analysis of tremor diseases. The previously sonal preference. Both sonifications and the corresponding
developed interactive tool offers two methods for sonifi- graphical user interface are implemented in Pure Data 1 .
cation of measured 3-axes acceleration data of patients The interface is targeted towards real-time use as a supple-
hands. Both sonifications involve a bank of oscillators mentary medical tool in order to improve diagnostic qual-
whose amplitudes and frequencies are controlled by ei- ity. It is employed to extract relevant features of the tremor
ther frequency analysis similar to a vocoder or Empirical signal, which are exposed aurally by the developed sonifi-
Mode Decomposition (EMD) analysis. In order to enhance cation algorithms. The resulting feature space is high di-
the distinct rhythmic qualities of tremor signals, additional mensional and therefore predestined for aural rendering in
amplitude modulation based on measures of instantaneous preference to visual representations. This tool, however,
energy is applied. The sonifications were evaluated in two aims not at providing definite answers nor a final diagnosis
experiments based on pre-recorded data of patients suffer- of the disease.
ing from different tremor diseases. In Experiment 1, we In a previous pilot study (see [6]), test participants were
tested the ability to identify a patients disease by using asked to identify the tremor diseases of patients by us-
the interactive sonification tool. In Experiment 2, we ex- ing the interactive sonification interface on pre-recorded
amined the perceptual difference between acoustic repre- movement data of 30 patients who divided equally into
sentations of different tremor diseases. Results indicate three different groups of diseases (Parkinsonian, Essential,
that both sonifications provide relevant information on tre- and Psychogenic tremor). Participants used headphones
mor data and may complement already available diagnos- and obtained prior training with the same set of patients.
tic tools. On average, participants reached 61 % correct diagnoses,
which is far above chance (1/3). According to partici-
1. INTRODUCTION pating neurologists, the proposed interface facilitates an
insight in the movement pattern of an examined tremor
Tremor is a movement disorder which produces involun- without visual tools. An interactive switch between so-
tary rhythmic oscillation movements of a body part [1]. As nifications did not improve overall sensitivity. However,
it can be caused by various neurological diseases [2], a cor- test participants positively welcomed the possibility to pa-
rect diagnosis is quickly needed to choose the right ther- rametrize the sonifications to personal preference.
apy. Each of these diseases evokes a specific movement
In retrospective, this pilot study suffered from unclear
pattern which can be recognized visually by specialized
data and immature experimental design: It was based on
neurologists. This visual diagnosis, however, is unreliable
a small sample size, clinical reference diagnoses were not
and common approaches for additional ex-post analysis of
perfectly reliable, and test participants were trained with
videos or measured sensor data are time-consuming and
the same set of patients tremor data as used in the experi-
can not be easily integrated into daily clinical practice. A
ment. The results are therefore of limited significance. As
sonification has the advantage that it provides an auditory
a consequence, we carried out an extended study which
representation which is continuously following the spec-
is presented here. It includes more test participants and a
tral characteristics of the tremor and thus allows to keep
larger dataset of patients with confirmed diagnosis.
track of the time-dependent spectral structures. Real-time
This article is structured as follows. After this introduc-
sonification of tremor movement data could therefore be-
tion, Sec. 2 gives an overview of the technical setup for
come a promising extension to already available diagnostic
data acquisition and the two different sonification meth-
tools. Sonification has recently been successfully used for
ods. One used frequency analysis (Vocoder sonification,
for therapy of Parkinsonian tremor [3, 4].
Sec. 2.1), while the other is based on Empirical Mode De-
In a follow-up to our previous research [5], we developed
composition analysis (EMD sonification, Sec. 2.2). These
sonifications were evaluated with real patients data in an
Copyright: 2017 Marian Weger et al. This is an open-access article distributed identification test with the interactive audiovisual user in-
under the terms of the Creative Commons Attribution 3.0 Unported License, which terface as well as in a triangle discrimination test (Sec. 3).
permits unrestricted use, distribution, and reproduction in any medium, provided
the original author and source are credited. 1 Pure Data (Pd): http://puredata.info/
SMC2017-234
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Finally, in Sec. 4, we summarize our findings and give an 2.2 EMD Sonification
outlook on future work.
The second sonification is based on Empirical Mode De-
Accompanying sound examples can be found on the pro-
composition analysis.
ject web page [7]. These include stereo recordings of both
EMD was originally developed by [9] to analyze non-sta-
sonifications with two patients of each tremor type.
tionary and non-linear signals. The idea of EMD is that
complex data sets can be decomposed into a finite (and of-
ten small) number of so-called Intrinsic Mode Functions
2. SONIFICATION APPROACH
(IMFs). Each IMF represents one mode of the signal. The
Both sonification methods share the basic technical setup higher the index of an IMF, the lower its frequency com-
as well as some fundamental data conditioning steps. ponents.
Movement data is recorded by 3-axis accelerometers 2 at- In contrast to Fourier analysis where a signal is decom-
tached to the patients hands and sampled at 1 kHz. Sig- posed into a set of pre-defined base functions, the EMD
nals are recorded with CED Spike2 3 and pre-processed in obtains the base functions adaptively from the signal. A
Matlab. A DC removal and 70 Hz low pass filter is ap- perfect reconstruction of the original signal is possible via
plied on the acceleration signal in order to cover the typi- summation of the contained IMFs and the resulting resid-
cal frequency range of pathological tremor (predominantly ual signal. The basic EMD algorithm is explained in [9
315 Hz). Both left and right arm sensors are individually 12].
sonified. For each individual IMF, the instantaneous phase, fre-
As strong amplitude variations can occur between differ- quency, and amplitude can be obtained from the Hilbert
ent measurements, Automatic Gain Control (AGC) is ap- transform. In conjunction with the EMD, this is called
plied to the input signal at different stages in both sonifi- the Hilbert-Huang Transform (HHT) [9, 13]. The HHT
cations. The measurement data is further conditioned by a has been proposed for tremor analysis in recent studies,
Principal Component Analysis (PCA) [8]. In the context e.g., [1416].
of the presented sonifications, the first principal compo- For the sonification, only the first five IMFs of the in-
nent is projected on the multichannel data to retrieve the put signal x[t] are determined via EMD. Eventually, these
monophonic input signal x[t] (see [5] for a more detailed IMFs are individually leveled by AGC.
description). Each IMF then controls the frequency and amplitude of
We will briefly describe the basic sonification algorithms. an individual sinusoidal oscillator. Although both the in-
For further information, please consult [6]. stantaneous amplitude and the frequency can be computed
at any time by using the Hilbert transform, we used the
generalized zero-crossing method [17] for the frequency,
2.1 Vocoder Sonification as it was found to provide more stable results. The deter-
mined frequencies are then multiplied by a user-controlled
The first sonification method is similar to a vocoder. constant factor to map the low tremor frequencies to the au-
By using a sliding window FFT, the input signal is di- dible range. Each oscillator is then individually amplitude
vided into 5 frequency bands (24 Hz, 46 Hz, 69 Hz, modulated by the smoothed half-wave rectified IMF sig-
913 Hz, and 1320 Hz). Center frequencies and band- nal itself, in order to display the original tremor frequency
widths have been selected based on experience with the range as a superposition of rhythmic structures.
spectra of different tremor types. Finally, the output signal of the sonification is formed by
Then, the normalized energies in the individual bands are the sum of these four signals.
used to control the amplitudes of 5 sinusoidal oscillators Due to the specific time-varying characteristics of the tre-
which are tuned harmonically to each other, i.e., follow- mor signals, the sonic result of the EMD-based sonification
ing the harmonic series (f0 , 2 f0 , etc.). A Frequency resembles the sound of singing birds (compare sound ex-
Modulation (FM) with the smoothed half-wave rectified amples [7]). The register of each bird is dependent on
input signal x[t] can be applied optionally. This results the frequency and amplitude of the corresponding IMF.
in a time-varying fundamental frequency of fi (t) instead
of a constant f0 .
The sum of the five oscillator signals is finally amplitude 3. EVALUATION STUDY
modulated by the variably smoothed half-wave rectified in- The presented sonification methods as well as the inter-
put signal x[t]. active interface were evaluated on the basis of pre-recor-
The Vocoder sonification produces a harmonic complex, ded movement data of 97 patients with confirmed diagno-
evoking a clear, optionally time-varying pitch percept (com- sis. The clinical tremor data were collected by the Medical
pare sound examples [7]). The time-varying timbral char- University of Graz and UCL Institute of Neurology Lon-
acter resembles vocal formants whereas the overall ampli- don between 2012 and 2013. Hand acceleration data were
tude modulation adds a rhythmic dimension. recorded for both hands simultaneously in rest (arms hang-
ing) and posture (arms outstretched) condition. The pa-
2 Biometrics ACL300 (mass: 10 g, accuracy: 2 % FS):
tients divide unevenly into four tremor types: 24 Parkin-
http://www.biometricsltd.com/accelerometer.htm
connected to CED 1401 interface. sonian, 19 Essential, 36 Psychogenic, and 18 Dysto-
3 CED Spike2: http://ced.co.uk/products/spkovin nic tremor. To force an even distribution, we randomly se-
SMC2017-235
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
a
b
OSCILLOSCOPE
c DSP
+6
+2
-0dB
SOFT
GLOBAL
f
-2
MUTE
-6
-12
-20 L_mono
-30
-50
<-99
e HARD
R_mono
STEREO
d
rotation SPATIAL MOVEMENT translation AM MASTER
SON.-1 SON.-2 20
Patient Psychogenic
g 15
h
Patient 1 ON ON
Condition Tremor-type Confidence
Patient 2
10
very very
Patient 3 Rest Parkinsonian unsure sure Frequency Frequency
i j
Patient 4 Posture Essential
5
Patient 5 Psychogenic SUBMIT 1 2 3 4 5
Patient 6 Dystonic
VOL FM 200 Hz VOL 200 Hz Hz
IMF_Frequencies
PROCEED BAND_INTENSITIES
Figure 1. Graphical user interface (here in training mode). Annotations in red letters. Visual feedback: a) waveform, b)
oscilloscope, c) FFT spectrum, each for left and right channel, d) ratio between rotational and translational movement.
Global parameters: e) smoothing of the AM modulator, f) mono/stereo and left/right channel switch, g) sonification switch.
Vocoder sonification: h) FM modulation, i) fundamental frequency. EMD sonification: j) frequency factor.
lected 18 patients of each type to form the reduced dataset 3.1.1 User Interface
of 184 = 72 patients.
Apart from the sonic representation of the tremor data, a
Test participants for the extended study were recruited simple visualization is provided (see Fig. 1). It shows var-
from an expert listening panel [18, 19], a group of mu- ious visual information, such as waveform view, oscillo-
sicians and sound engineers with experience in listening gram, level meter, FFT spectrum, ratio between rotational
tests. Despite the target audience being neurologists, tra- and translational movement strength as well as band inten-
ined listeners have been chosen to ensure a best-case sce- sities and IMF frequencies for the individual sonifications.
nario and hence a more fair comparison of the results with Further, apart from standard controls such as volume, the
the currently achieved diagnostic accuracy through visual interface features interactive access to a selection of so-
and computer-aided ex-post analysis methods. Neurolo- nification parameters. Globally for both sonifications, the
gists can acquire these abilities provided that they obtain smoothing of the AM modulator signal as well as stereo/
appropriate ear training. The sonification was presented mono playback can be controlled. In addition, each so-
with closed headphones. Test participants were allowed to nification allows individual access to oscillator frequency
take handwritten notes throughout the experiments. (base frequency for Vocoder sonification, multiplicative fac-
We carried out two experiments which are discussed be- tor for EMD sonification) and dedicated gains for left and
low: a 4AFC identification task (Sec. 3.1) similar to the right arm sensors. Another slider provides control of the
pilot study (cf., [6]) and a triangle test (also called 3AFC FM index for the Vocoder sonification.
oddity task) which is described in Sec. 3.2.
3.1.2 Procedure
In Experiment 1, participants had to accomplish three ses-
3.1 Experiment 1: training and identification task sions (S1S3) of one hour on different days with approxi-
mately one week of pause in between. Each session started
The basic design of Experiment 1 was similar to the pilot with approx. 30 minutes of free training where participants
study [6]. It further addresses learning effects over several could freely listen to the sonifications of 8 patients per tre-
training sessions. An additional focus of this experiment mor type (24 patients in total) with disclosed diagnosis.
was the interactive use of sonification parameters. The goal Participants were able to decide by themselves when they
was to verify the results of the pilot study with a larger felt to be ready for the next part of the session, in which
number of patients and tremor types. they had to identify the disease of 24 patients (6 per tre-
SMC2017-236
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 1. Overview of the results from Experiment 1. Values in parentheses describe one standard deviation.
mor type) by using the interactive sonification interface. types Parkinsonian, Essential, Psychogenic, and Dystonic,
The task followed the same procedure as in the pilot study. are referred to as Par, Ess, Psy, and Dys.
Feedback was given after each trial. The higher percent correct responses in S2 compared to
For each participant individually, the dataset was split S1 and S3 is further supported by the higher contingency
randomly into 3 subsets of 24 patients (6 per tremor type). coefficient 4 (0.32 vs. 0.22 and 0.25), which means that
These 3 subsets were presented in the training and test- there is a stronger relation between reference diagnosis and
ing parts of the 3 sessions. The presentation order of pa- submitted diagnosis for S2 compared to both other ses-
tients within each session and within each mode (training/ sions.
testing) was randomized across participants. In the first The pooled results of S1 and S3 (Tab. 2d) reveal a high
and in the last session, training was performed with sub- confusion between Parkisonian and Psychogenic tremor
set A, while a completely new subset (B/C) was introduced (31.8 % Par falsely assigned to Psy, 28.1 % Psy falsely as-
for the testing part. In the second session, both training and signed to Par). The lowest false positive rate appeared be-
testing were performed with the same subset (B): tween Parkinsonian and Dystonic tremor (16.1 % Par false-
ly assigned to Dys, 18.8 % Dys falsely assigned to Par).
S1 S2 S3 A binomial test as above was performed on the results
Training A B A per tremor type. When looking at the pooled results of
Testing B B C S1 and S3, only the percent correct identifications of Par-
kinsonian, Psychogenic, and Dystonic tremor were signif-
S1 and S3 are seen as a realistic scenario where known icantly higher than chance (Par: p=0.014, Psy: p=0.042,
patients characteristics were applied to unknown patients. Dys: p=0.004), while Essential tremor could not be iden-
In S2, the ability of memorizing specific patients char- tified (p=0.107). However, the differences in sensitivity
acteristics was tested, comparable to the procedure in the (Par: 0.32, Ess: 0.29, Psy: 0.31, Dys: 0.34) and F-mea-
pilot study. sure (Par: 0.38, Ess: 0.36, Psy: 0.33, Dys: 0.42) are only
During the testing part of each session, participants were small. The difference in percent correct responses between
asked to submit a diagnosis (4AFC) and judge their con- the individual tremor types therefore seems marginal.
fidence (values from 1 to 100) for each of the 24 patients. We did not find any significant clusterization of test par-
Feedback (reference diagnosis) was given after each trial. ticipants based on their individual contingency tables. Also,
Sonification-specific parameters as well as the condition the results of those participants who primarily used the
(rest/posture) could be additionally controlled by a motor- vocoder-based sonification did not significantly differ from
ized MIDI controller (Behringer BCF-2000). the results of those wo preferred the EMD-based sonifica-
tion (9 participants vs. 7 participants, 32.3 % vs. 30.4 % cor-
3.1.3 Results
rect answers as average of S1 and S3). However, there
Overall results of the identification task are shown in Tab. 1. were five top performers who achieved at least 33.3 % cor-
On average, the test participants achieved 31.5% correct rect answers in every session (Average of S1 and S3: 36.6 %
answers for S1 and S3, while S2 led to 39.1 % correct an- correct answers, ranging from 33.3 % to 47.7 %).
swers. Confidence intervals in Tab. 1 are based on the in- On average (pooled over S1 and S3), participants took
verse Beta CDF. The total percent correct answers lead to 38.4 s (SD=30.1 s) to complete one trial and reported a
a discrimination index d0 of 0.26 for S1/S3 and 0.49 for confidence of 39.4 (SD=25.6).
S2 [20].
The primary test results (percent correct answers) were 3.1.4 Interactive use of sonification parameters
analyzed by using a binomial test with one variable dis- For the slider- or rotary-based controls (AM smoothing,
ease (4 levels), assuming a constant hit rate of 1/4, sample Vocoder sonification base frequency and FM index, EMD
size of 384 (24 patients 16 participants), and significance sonification frequency factor), we observed that partici-
level of p=0.05. For all three sessions, the achieved per- pants found their preferred values already within the first
cent correct answers are significantly higher than chance training session. These parameters were rarely changed
(p=0.002 for S1/S3, p<0.001 for S2). The overall results during diagnosis.
of the first and third session are identical and therefore con- The sonification switch also exhibited convergent behav-
sidered equivalent. ior for all but one participant who was continuously switch-
For further analysis of the results, contingency tables were q
created for the individual sessions (Tab. 2a2c) as well as 4 Chi2
Pearsonss contingency coefficient: K = N +Chi2
, where N is
for the pooled results of S1 and S3 (Tab. 2d). The tremor the sample size.
SMC2017-237
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-238
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-239
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Type pair Cluster 1 Cluster 2 Top 5 icant perceptual difference was also shown between Dys-
Par vs. Ess 44.8 34.4 35.0 tonic and Psychogenic tremor. Overall d0 was 0.95.
Par vs. Psy 38.5 38.5 50.0 During the first experiment, participants were allowed to
Par vs. Dys 29.2 62.5 50.0 manipulate critical sonification parameters. All test partic-
Ess vs. Psy 39.6 36.5 48.3 ipants found their preferred sonification and correspond-
Ess vs. Dys 42.7 29.2 43.3 ing parameters quickly. Still, the permanent comparison
Psy vs. Dys 44.8 54.2 60.0 between left and right arm sensor as well as between rest
Average 39.9 42.5 47.8 and posture condition provided useful information. The
interactive change of these parameters facilitates the con-
Table 4. Average results per pair of tremor types in Experi- struction of a coherent image of the observed tremor and
ment 2 for the two main clusters and the 5 best performing allows informed decision making.
participants. Values describe % correct answers. The proposed sonification interface is not meant to re-
place currently available diagnostic tools, but rather to com-
plement them. The fact that both sonifications led to dif-
3.2.3 Discussion
ferent results in the triangle test lets us conjecture that they
The overall discrimination index d0 =0.95 in Experiment 2 could supplement each other in a beneficial way, espe-
shows that the oddity task was indeed easier than the iden- cially if additional knowledge from other diagnostic tools
tification task of Experiment 1 (d0 =0.26). However, over- is available. A combination and cross-checking of differ-
all percent correct responses are still too low for direct ap- ent sources of information is essential for an efficient and
plications in the medical context. correct diagnosis.
A cluster analysis on the basis of the test participants re- It is obvious that both sonifications as well as the cor-
sults leads to an almost perfect division into the users of responding user interface need to be substantially revised.
the different sonifications. While the Vocoder sonification This revision could be based on the test participants qual-
seems to provide similar discrimination performance for itative feedback which we collected. Informal interviews
all tremor pairs, users of the EMD sonification achieved with test participants revealed that the offered sonification
extraordinary results for the pairs Par/Dys and Psy/Dys (at parameters might not be sufficient. One of the top perform-
the expense of lower percent correct responses for the other ing participants complained that the first IMF can become
pairs). Each sonification might emphasize different aspects obtrusive and often masks the important fine structure of
of the observed tremori. If those aspects were known, par- the higher IMFs; however, at the same time, it also con-
ticipants might be able to combine both sonifications in tains critical information and can therefore not be omitted.
order to improve their sensitivity. The interactive change Individual volume control for the IMFs may therefore an
between both sonifications therefore seems reasonable and option for future improvements.
might facilitate the construction of a coherent picture based Further, we observed that some test participants success-
on the observed phenomena in order to make informed de- fully used the spatial movement visualization for decision
cisions. making: while Parkinsonian and Dystonic tremor showed
The five top performers of Experiment 2 (Tab. 4) also almost only rotational movement, Essential and Psychoge-
achieved high identification rates in Experiment 1 (on av- nic tremor sometimes caused sudden switches towards a
erage: 33.8 % correct, ranging from 31.3 % to 37.5 %, two translational movement. At this stage, however, the ratio
of them among the Top 5). This leads to the assumption between rotational and translational movement is not con-
that at least some test participants have found consistent veyed by the sonification. A prior approach mapping this
tremor-specific sound characteristics and were able to ap- movement quality to a chorus effect (see [6]) did not work
ply these complex models in an identification task as well due to the parameters transient behavior. A promising op-
as in a discrimination task. tion might be a mapping of the translation/rotation-ratio to
relative pitch.
Finally, we think that the collected strategies of the test
4. CONCLUSIONS AND OUTLOOK participants could be brought in correspondence with Mu-
In the presented study, we evaluated tremor disease identi- sic Information Retrieval (MIR) analysis descriptors ap-
fication and discrimination by means of interactive sonifi- plied to sonifications and raw movement data. Thus, the
cation. sonification qualities which have proven to be perceptually
In Experiment 1, test participants were asked to indicate relevant in tremor segregation could be enhanced in the
the tremor type of unknown patients by analyzing pre-re- next development steps. In general, the re-design of the
corded movement data with the interactive sonification in- developed sonifications could be informed by the findings
terface. Average percent correct diagnoses were signifi- of this combined qualitative and quantitative analysis.
cantly above chance for patients with Parkinsonian, Psy-
chogenic, and Dystonic tremor, but not for Essential tre- 5. ACKNOWLEDGMENTS
mor. Overall sensitivity index d0 was 0.26.
In Experiment 2, participants were able to distinguish We would like to thank Petra Schwingenschuh and the Med-
Parkinsonian tremor from Essential and Dystonic tremor ical University of Graz, who kindly provided us with the
with sensitivity significantly above guessing rate. A signif- clinical tremor data and thus made this research possible.
SMC2017-240
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[4] A.-M. Gorgas, L. Schn, R. Dlapka, J. Doppler, [16] I. Carpinella, D. Cattaneo, and M. Ferrarin, Hilbert-
M. Iber, C. Gradl, A. Kiselka, T. Siragy, and B. Horsak, Huang transform based instrumental assessment of in-
Short-Term Effects of Real-Time Auditory Display (So- tention tremor in multiple sclerosis, J Neural Eng,
nification) on Gait Parameters in People with Parkin- vol. 12, no. 4, 8 2015.
sons Disease A Pilot Study, 2017, pp. 855859.
[17] N. E. Huang, Z. Wu, S. R. Long, K. C. Arnold,
[5] D. Pirr, A. Wankhammer, P. Schwingenschuh, X. Chen, and K. Blank, On instantaneous frequency,
R. Hldrich, and A. Sontacchi, Acoustic interface for Advances in Adaptive Data Analysis, vol. 1, no. 2, pp.
tremor analysis, in Proc. International Conference on 177229, 2009.
Auditory Display (ICAD), 02 2012. [18] R. Hldrich, H. Pomberger, and A. Sontacchi, Re-
cruiting and evaluation process of an expert listening
[6] M. Weger, D. Pirr, A. Wankhammer, and R. Hldrich,
panel, in Fortschritte der Akustik (NAG/DAGA 2009),
Discrimination of tremor diseases by interactive so-
B. Rinus, Ed., Rotterdam (Niederlande), 03 2009.
nification, in 5th Interactive Sonification Workshop
(ISon), 12 2016. [19] M. Frank and A. Sontacchi, Performance review of
an expert listening panel, in Fortschritte der Akustik
[7] IEM. (2017) Acoustic interface for tremor analysis. (DAGA 2012), H. Hanselka, Ed., Berlin (Deutschland),
[Online]. Available: http://tremor.iem.at 09 2012.
[8] J. Shlens, A tutorial on principal component analysis, [20] M. J. Hacker and R. Ratcliff, A revised table of d
in Systems Neurobiology Laboratory, Salk Institute for for m-alternative forced choice, Perception & Psy-
Biological Studies, 2005. chophysics, vol. 26, no. 2, pp. 168170, 1979.
[9] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. [21] B. J. Craven, A table of d for m-alternative odd-
Shih, Q. Zheng, N.-C. Yen, C. C. Tung, and H. H. Liu, man-out forced-choice procedures, Perception & Psy-
The empirical mode decomposition and the hilbert chophysics, vol. 51, no. 4, pp. 379385, 1992.
spectrum for nonlinear and non-stationary time se-
ries analysis, Proc. of the Royal Society of London [22] J. H. W. Jr., Hierarchical grouping to optimize an ob-
A: Mathematical, Physical and Engineering Sciences, jective function, Journal of the American Statistical
vol. 454, no. 1971, pp. 903995, 1998. Association, vol. 58, no. 301, pp. 236244, 1963.
SMC2017-241
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-242
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2. METHOD
Figure 1. Screenshot of the graphical interface.
One of the main aims of the study was to investigate if, and
how, auditory feedback alters gaze behavior and distribu-
tion of visual attention in a nonhaptic versus haptic task. ing the experiment. The final sample size for analysis was
More dimensions are covered in [16]. The user was asked n=13 (8 M, 5 F; age=19-24 years, mean = 20.8 years).
to throw a virtual ball into a goal, see Fig. 1. A throw-
ing gesture was chosen since this is an intuitive and simple 2.3 Apparatus
movement task that requires attention to shift between an
object and a target. This gesture could also be intuitively 2.3.1 Technical Setup
sonified. We define the movement performed in this study The technical setup can be seen in Fig. 2. One desktop
as a virtual throwing gesture, since the haptic device that computer was used to run the haptic interface and the eye
was used has limited possibilities in terms of affording a tracking application; one laptop computer was used for
real throwing gesture. sound generation. The 3D interface was designed to allow
the user to grasp a ball and to throw it into a circular goal
2.1 Hypothesis area. A SensAbleTM Phantom
R Desktop haptic device 2
We hypothesized that continuous auditory feedback effec- was used to provide haptic feedback. This haptic device
tively represents temporal information in such a manner has a pen-like stylus that is attached to a robotic arm. The
that a shift in attention is enabled in a virtual (visual) throw- stylus can be used to haptically explore objects in virtual
ing task. More specifically, our hypothesis was that the use 3D environments and to provide force feedback. Weight
of interactive movement sonification will enable a user to and inertia can also be simulated. In our experiment, a but-
shift visual focus from an actual interaction point (the lo- ton on the stylus was used to activate the function of grasp-
cation of the cursor) to a more distant target point in a vir- ing an object, enabling the user to lift and move the virtual
tual 3D environment. Sonification can provide meaningful ball. The 3D application, shown in Fig. 1, was based on
information about a movement. In conditions involving the haptic software library Chai3D [17] and developed in
sonification, the user will not be required to focus as much C++.
on the ball, since information about the performed move- Eye tracking data was captured at a sampling rate of 60
ment is communicated directly through sound. This will Hz using a commercial X2-60 eye-tracker from Tobii Tech-
enable visual focus to shift from the ball, to the goal. More- nology 3 connected to an external processing unit. The
over, we hypothesized that gaze behavior will be differently software Tobii Studio was used for calibration and analysis
affected when haptic force feedback (adding a sense of of eye tracking data. Two interactive sonification models
weight and inertia) is simultaneously presented with move- were implemented using Max/MSP 4 . Communication be-
ment sonification. tween Max/MSP and the 3D application was made possi-
ble via Open Sound Control (OSC) [18]. A pair of Sennheiser
2.2 Participants RS 220 headphones was used to provide auditory feedback
during the experiment.
A total of 20 participants took part in the experiment. All
participants were first year students at KTH Royal Institute 2.3.2 Sonification
of Technology (11 M, 9 F; age=19-25 years, mean = 20.4 For conditions involving auditory feedback, the movements
years). None of the participants reported having any hear- of the virtual ball were continuously sonified. Two differ-
ing deficiencies or reduced touch sensitivity. All partici- ent sound models were used for sonification of the throw-
pants reported that they had normal or corrected-to-normal ing gesture, i.e. when aiming towards the goal. The first
vision. Six of the participants normally wore glasses; two sound model (from now on referred to as S1) was based
of them wore their glasses during the test sessions and two on a physical model of friction, readily available from the
of them used contact lenses. None of the participants re- Sound Design Toolkit 5 [19] (SDT). The second sound model
ported that they had any previous experience of using force (from now on referred to as S2) was obtained by filter-
feedback devices. All participants signed a consent form ing pink noise using the Max object [biquad] with a
prior to taking part in the experiment. Six participants were low-pass filter setting. We opted for these particular sound
disregarded from the analysis since the level of usable gaze
data was below 80% 1 . Yet another participant was disre-
garded since the headphones accidentally switched off dur- 2 http://www.dentsable.com/
haptic-phantom-desktop.htm
3 www.tobiipro.com
1 100% in this context would imply that gaze data for both eyes were 4 https://cycling74.com/
SMC2017-243
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-244
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
cussed and explored through sonifications of gaze trajecto- Condition Median IQR
ries, see Sec. 3.4. V 0.53 0.77
AV1 0.43 0.61
AV2 0.38 0.70
3. RESULTS VH 0.49 0.69
AVH1 0.46 0.60
In order to test the hypotheses, an index of the eye tracking AVH2 0.66 0.70
metric (fixation duration or fixation count) for the ball AOI
divided by the metric for both the goal AOI and the ball Table 2. Total fixation duration index for first 15 throwing
AOI was defined, according to Eq. (1): gestures, filtered to include data only when grasping the
ball. IQR=inter quartile range. Red cell: highest median
F ixBall value. Green cell: lowest median value.
I= (1)
F ixGoal + F ixBall
Total Fixation Duration
This index simplifies the interpretation of whether the par- 1.00 Participant
P04
ticipant was focusing mostly on the ball or the goal AOI
P06
without having to take data from both AOIs into consider- P07
ation simultaneously. 0.75
P08
pant fixated on the goal, ball or the entire screen, respec- P22
P23
tively. We calculated total fixation duration on the entire 0.00
VH AVH1 AVH2 V AV1 AV2
screen for the first 15 throwing gestures. Descriptive statis- Condition
tics for this measure are presented in Tab. 1. We can ob-
serve that the highest median value was found for condi-
Figure 4. Total fixation duration index values for first 15
tion VH. Lowest median value was observed for condition
throwing gestures, filtered to include data only when grasp-
AV2.
ing the ball. Participants above the dashed red line focused
Filtering recordings to include gaze data only from when longer time on the ball than on the goal AOI.
the participants were grasping the virtual ball, the total fix-
ation duration index (according to Eq. 1) was computed
for each participant and condition. As opposed to the mea- 3.2 Total Fixation Count
sure presented in Tab. 1, this measure specifically evaluates
Total fixation count measures the total number of fixations
the effect of movement sonification (not the effect of other
within an area of interest. We calculated total fixation
auditory events caused by e.g. scoring a goal) on eye track-
count on the entire screen for the first 15 throwing gestures.
ing metrics, since such data has been filtered out. A total
In the descriptive statistics table Tab. 3, we can observe that
fixation duration index higher than 0.5 indicates longer fix-
the highest median value was found for condition V. Low-
ation on the ball AOI than on the goal AOI (when grasping
est median value was observed for condition AV1.
the ball). Descriptive statistics for the measure is presented
Filtering data to include gaze data only from when the
in Tab. 2. Highest median value was found for the haptic
ball was grasped, total fixation count index (as described
condition AVH2. Lowest median value was observed for
in Eq. 1) was computed for each participant and condi-
the nonhaptic condition AV2. Total fixation duration index
tion. A total fixation count index higher than 0.5 indicates
per participant and condition is visualized in Fig. 4.
more fixations on the ball AOI than on the goal AOI (when
grasping the ball). Descriptive statistics for the measure
Condition Median IQR
is presented in Tab. 4. Highest median value was found
V 137.39 21.72
for the haptic condition AVH2. Lowest medians were ob-
AV1 137.96 24.06
served for the nonhaptic conditions AV1 and AV2.
AV2 125.92 25.44
VH 142.73 44.49 3.3 Cluster Analysis - Total Fixation Duration
AVH1 132.39 19.39
AVH2 131.92 31.97 In order to explore similar behavioral patterns among par-
ticipants, a K-means clustering analysis was performed.
Table 1. Total fixation duration on screen (seconds) for Appropriate number of clusters was decided after visually
full duration of first 15 attempts to throw the ball into the inspecting patterns in Fig. 4. The number of clusters was
goal. IQR=inter quartile range. Red cells : highest median set to 5, and the initial cluster centers were evaluated based
value. Green cell: lowest median value. on the data, specifying random starting assignments to 20
and selecting the solution with the lowest within cluster
SMC2017-245
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-246
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 6. Gaze behavior of participant 10. Points corre- Figure 7. Gaze behavior of participant 12. Points corre-
spond to the coordinates of the averaged left- and right eye spond to the coordinates of the averaged left- and right eye
gaze position on the screen. gaze position on the screen.
AVH1, V and VH. The participant focused mostly on the creaking sound model than for the swishing sound model.
ball, in most conditions (all apart from AV1). The coor- For participant 12, this tendency can be observed both for
dinates of the averaged left- and right eye gaze positions haptic and nonhaptic conditions. The participant performed
are displayed in Fig. 6. The plot suggests that there might the experiment in the following condition order: AVH1,
be a difference between conditions (the points are more or AV1, AVH2, AV2, V and VH. The coordinates of the av-
less scattered in different conditions). For the ball AOI, eraged left- and right eye gaze positions are displayed in
this participant had total fixation durations in the range be- Fig. 7. As seen in the figure, the gaze points are distributed
tween 7.24 and 19.72 seconds (lowest for condition AV1 in a similar pattern in all conditions. For the ball AOI,
and highest for condition V) and total fixation counts in the participant had total fixation durations in the range be-
the range between 18 and 29 counts (lowest for AV2 and tween 3.05 and 7.44 seconds (lowest for VH and highest
highest for condition V). for AV1) and total fixation counts in the range between 10
and 28 counts (lowest for V and highest for AV1). As indi-
3.4.2 Cluster 2 - Participant 22
cated by the plots presented in Fig. 7, the sonifications of
Generally, participants belonging to cluster 2 focused longer this participants gaze behavior do sound different in dif-
on the ball AOI for the swishing sound model than for the ferent conditions.
creaking sound model. The total error rate for all condi-
3.4.4 Cluster 4 - Participant 04
tions for participants in this cluster was 37.5 (median 37.5).
Participant 22 performed the experiment in the following This participant was classified separately in a solo cluster.
condition order: V, AV2, VH, AV1, AVH1 and AVH2. For The total error rate for all conditions for this participant
most of the conditions, this particular participant focused was 52.0. The participant performed the experiment in the
longer time on the goal AOI than on the ball AOI. For the following condition order: AVH1, VH, V, AVH1, AV2 and
ball AOI, participant 22 had total fixation durations in the AV1. As indicated by the metrics in Tab. 5, this participant
range between 4.29 and 29.67 seconds (lowest for AV1 focused approximately as much on the goal AOI as on the
and highest for V) and total fixation counts in the range ball AOI in all conditions. For the ball AOI, participant 4
between 22 and 77 counts (lowest for AV1 and highest had total fixation durations in the range between 15.98 and
for condition V). By listening to the sonified gaze trajec- 31.40 seconds (lowest for V and highest for AVH1) and
tories, one can observe that condition V stands out from total fixation counts in the range between 28 and 48 counts
the other conditions; this condition had overall longer gaze (lowest for VH and highest for AVH1). The V condition
durations, compared to other nonhaptic conditions. Inter- stands out compared to other conditions.
estingly, we can observe from the metrics in Tab. 6 that this
3.4.5 Cluster 5 - Participant 15
participant focused more on the ball for conditions where
no sound was present. Participants in cluster 5 showed the opposite behavior of
participants in cluster 3: they generally focused more on
3.4.3 Cluster 3 - Participant 12
the ball AOI than on the goal AOI. The total error rate for
Participants belonging to cluster 3 generally focused longer all conditions for participants in this cluster was 95.0 (me-
on the goal AOI than on the ball AOI. The total error rate dian 96.0), i.e. the highest error rate of all clusters. Partici-
for all conditions for participants in this cluster was 32 pant 15 performed the conditions in the following order: V,
(median 26), i.e. the best performance of all clusters. Con- AV1, VH, AVH1, AV2 and AVH2. The coordinates of the
trary to the behavior of participants in cluster 2, most par- averaged left- and right eye gaze positions are displayed in
ticipants in cluster 3 focused longer on the ball AOI for the Fig. 8. As seen in the figure, gaze trajectories appear to
SMC2017-247
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-248
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
between focusing mostly on the ball versus goal AOI for video exploration, Journal of Eye Movement Re-
cluster 3 and 5. search, vol. 5, no. 4, p. 2, 2012.
[9] X. Min, G. Zhai, Z. Gao, C. Hu, and X. Yang, Sound
5. CONCLUSIONS influences visual attention discriminately in videos,
This exploratory study focused on the effect of auditory in Quality of Multimedia Experience (QoMEX), 2014
feedback on gaze behavior. Analysis of eye tracking met- Sixth International Workshop on. IEEE, 2014, pp.
rics revealed several clusters of different gaze behaviors 153158.
among only 13 participants. Some of the participants ap-
[10] V. Losing, L. Rottkamp, M. Zeunert, and T. Pfeiffer,
peared to have been affected by the presence of auditory
Guiding visual search tasks using gaze-contingent au-
feedback, whereas other showed similar gaze trajectory
ditory feedback, in Proceedings of the 2014 ACM In-
patterns across conditions. More thorough studies are re-
ternational Joint Conference on Pervasive and Ubiqui-
quired in order to be able to draw any general conclusions
tous Computing: Adjunct Publication. ACM, 2014,
regarding the effect of auditory feedback in this context.
pp. 10931102.
This exploratory study raises intriguing questions that mo-
tivate future large-scale studies on the effect of auditory [11] E. Boyer, Continuous auditory feedback for sensori-
feedback on gaze behavior. motor learning, Ph.D. dissertation, Universite Pierre
et Marie Curie-Paris VI, 2015.
Acknowledgments
[12] A. Poole and L. J. Ball, Eye tracking in hci and us-
This work was supported by the Swedish Research Coun- ability research, Encyclopedia of human computer in-
cil (Grant No. D0511301). It was also partially funded teraction, vol. 1, pp. 211219.
by a grant to Roberto Bresin from KTH Royal Institute of
Technology. [13] Computer interface evaluation using eye movements:
methods and constructs, International Journal of In-
6. REFERENCES dustrial Ergonomics, vol. 24, no. 6, pp. 631 645,
1999.
[1] J. Vroomen and B. d. Gelder, Sound enhances visual
perception: cross-modal effects of auditory organiza- [14] A. Duchowski, Eye tracking methodology: Theory and
tion on vision. Journal of experimental psychology: practice. Springer Science & Business Media, 2007,
Human perception and performance, vol. 26, no. 5, p. vol. 373.
1583, 2000.
[15] M. A. Just and P. A. Carpenter, Eye fixations and cog-
[2] L. Shams, Y. Kamitani, and S. Shimojo, Illusions: nitive processes, Cognitive Psychology, vol. 8, no. 4,
What you see is what you hear, Nature, vol. 408, no. pp. 441 480, 1976.
6814, pp. 788788, 2000.
[16] E. Frid, J. Moll, R. Bresin, and E.-L. Sallnas-Pysander,
[3] J. Mishra, A. Martnez, and S. A. Hillyard, Effect of Influence of movement sonification on performance
attention on early cortical processes associated with the in a virtual throwing task performed with and with-
sound-induced extra flash illusion, Journal of cogni- out haptic feedback, Manuscript submitted for pub-
tive neuroscience, vol. 22, no. 8, pp. 17141729, 2010. lication.
[4] S. L. Calvert and T. L. Gersh, The selective use of [17] F. Conti, F. Barbagli, D. Morris, and C. Sewell,
sound effects and visual inserts for childrens televi- {CHAI} an open-source library for the rapid
sion story comprehension, Journal of Applied Devel- development of haptic scenes, in Proceedings of
opmental Psychology, vol. 8, no. 4, pp. 363375, 1987. the IEEE World Haptics Conference, 2005. [Online].
Available: http://www.chai3d.org/
[5] A. R. Seitz, R. Kim, and L. Shams, Sound facilitates
visual learning, Current Biology, vol. 16, no. 14, pp. [18] M. Wright, Open sound control: an enabling tech-
14221427, 2006. nology for musical networking, Organised Sound,
vol. 10, no. 03, pp. 193200, 2005.
[6] G. Song, D. Pellerin, and L. Granjon, Different types
of sounds influence gaze differently in videos, Jour- [19] S. D. Monache, P. Polotti, and D. Rocchesso, A toolkit
nal of Eye Movement Research, vol. 6, no. 4, pp. 113, for explorations in sonic interaction design, in Pro-
2013. ceedings of the 5th Audio Mostly Conference: A Con-
ference on Interaction with Sound. ACM, 2010, p. 1.
[7] G. Song, D. Pellerin, and L. Granjon, Sound effect on
visual gaze when looking at videos, in Signal Process- [20] J. H. Goldberg and A. M. Wichansky, Chapter 23 -
ing Conference, 2011 19th European. IEEE, 2011, pp. eye tracking in usability evaluation: A practitioners
20342038. guide, in The Minds Eye, J. Hyn, R. Radach, and
H. Deubel, Eds. Amsterdam: North-Holland, 2003,
[8] A. Coutrot, N. Guyader, G. Ionescu, and A. Caplier, pp. 493 516.
Influence of soundtrack on eye movements during
SMC2017-249
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Felipe Orduna-Bustamante
Centro de Ciencias Aplicadas y Desarrollo Tecnologico,
Universidad Nacional Autonoma de Mexico,
Circuito Exterior S/N, Ciudad Universitaria,
Del. Coyoacan, C.P. 04510, Mexico D.F., MEXICO.
felipe.orduna@ccadet.unam.mx
SMC2017-250
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-251
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-252
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
7 60 h
3 1s
2
17ms
1
0
0 50 100 150 200 0 1 2 3 4 5 6 7
Time [ms] Dimension of Space [D]
Figure 4. Impulse responses in rooms of space dimensions Figure 6. Calculation times for the generation of impulse
D = 1 to 6, Ds = D, including: (j)(Ds 1)/2 . Absolute responses in rooms of dimensions D = 1 to 6. Timings of
peak amplitudes are normalized to 0.5 vertical units. Allen 1979 for N = 37, 500 image sources [8], and also
estimated (est.) for N = 2, 146, 689, used here for D = 3.
6 6. CONCLUSIONS
Dimension of Space [D]
increasing dimension D, there is increased and more in- [6] R. L. Liboff, Complex Hyperspherical Equations, Nodal-Partitoning, and First-Excited-
State Theorems in IRn , International Journal of Theoretical Physics, vol. 41, no. 10, pp.
tricate reverberation, with a consequently reduced intel- 19571970, October 2001.
ligibility, and even with strong aftermaths at the higher [7] W.-Q. Zhao, Relation Between Dimensional and Angular Momentum for Radially Sym-
dimensions. These are caused by the lumping together metric Potential in N -Dimensional Space, Commun. Theor. Phys., vol. 46, no. 3, pp. 429
434, September 2006.
of reflections that get mapped into the same neighboring
[8] J. B. Allen and D. A. Berkley, Image method for efficiently simulating small-room acous-
sampling periods (also perceptually) in certain portions of tics, Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943950, April 1979.
the impulse response. At the lower dimensions (especially
[9] J. S. Avery, Harmonic polynomials, hypershperical harmonics, and atomic spectra, Jour-
D = 1, 2), there are more chances of flutter and pulsa- nal of Computational and Applied Mathematics, vol. 233, pp. 13661379, 2010.
tion effects, while at the higher dimensions, sound tends to [10] T. Kereselidze, G. Ckdua, P. Defrance, and J. Ogilvie, Derivations, properties and appli-
increasingly acquire hissing and chirping effects. cations of Coulomb Sturmians defined in spheroidal coordinates, Molecular Physics, vol.
114, no. 1, pp. 148161, 2016.
When using a source transfer function of fixed dimension,
[11] Antti Kelloniemi and Vesa Valimaki and Patty Huang and Lauri Savioja, Artificial re-
Figure 5, reverberation effects are noticeably reduced for verberation using a hyper-dimensional FDTD mesh, in 13th European Signal Processing
rooms with D < Ds , but are over-emphasized when D > Conference, IEEE Conference Publications, Ed., 2005, pp. 14.
Ds . This occurs because, compared to the normal radial [12] Antti Kelloniemi and Patty Huang and Vesa Valimaki and Lauri Savioja, Hyper-
Dimensional Digital Waveguide Mesh for Reverberation Modeling, in CiteSeerX
decay 1/r(D1)/2 , the radial decay 1/r(Ds 1)/2 has little https://core.ac.uk/. oai:CiteSeerX.psu:10.1.1.384.708, 2013-12-08, 2013.
reach for D < Ds , but is more far reaching for D > Ds .
SMC2017-253
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-254
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-255
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-256
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 3. The recording setup. Left: block scheme of the acquisition chain. Right: a picture of the KEMAR mannequin
positioned at the pianist location. The setup was moved to the center of the room for the recording session.
SMC2017-257
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
the second one (reverse setting hereafter), they were re- had never experienced the sound of an acoustic piano in
versed so that the left channel was routed to the right head- live conditions). During the session, the experimenter an-
phone channel and viceversa. notated relevant reactions or behaviors by the subjects. Fi-
Experimental stimuli were then constructed using only a nally after the session an informal post-experimental in-
subset of the A tones, namely the five tones A0, A1, A3, terview was carried out with questions related to musical
A6, A7. This choice was driven by the need of keeping the competencies of the subject and her/his impressions of the
total duration of a single experimental session to a reason- experiment.
able time. Experimental sessions lasted about 3540 minutes. A to-
The duration of the stimuli was limited to 5 s, for the tal of 13 subjects took part in the experiment (age 17 26
same reason. Additionally, 5 s long stimuli minimized the years, average 22.2). None of them was a pianist. Inter-
perceptibility of background noise in the late decay stage views revealed that 8 of them were not familiar with the
of the samples, as discussed in Sec. 3 above. In the active acoustic piano, 8 were listeners, and 1 was a performer al-
condition, the 5 s stimulus duration was enforced by show- though not at a professional level. Moreover, 6 subjects
ing a timer to the subjects and instructing them to release had no music training, 2 had practiced a musical instru-
the key after this duration; if the key was not released after ment as amateurs for a short amount of time, and 5 received
this duration, a note-off event was automatically sent. music training and played a musical instrument although
The total number of trials was 120 (i.e., 5 samples 2 not at a professional level.
routing settings 4 repetitions 3 conditions), thus 40
for each of the three conditions. In the passive closed and 4.3 Results
passive open conditions the key velocity was set to 84 in
all samples. In the active condition the key velocity was We analyze and discuss results regarding the second di-
set by the playing subject, controlling the corresponding mension rated by subjects, i.e. perceived lateral direc-
sample play back. tion of the sound. One-way ANOVAs with factor Con-
dition were performed separately for each note, and for
the two routing settings independently. Such a choice is
4.2 Procedure
preliminary at this stage of the research, since clearly it
Subjects with an even ID attended the active, passive closed prevents from bringing to surface the interactions among
and passive open condition in sequence; those with an odd factors. Even without analyzing such interactions, the re-
ID attended the passive closed, passive open, and active sults we found using individual ANOVAs are sufficiently
condition instead. Stimuli were randomized within each rich to lead to an articulate discussion.
condition. Fig. 5 shows boxplots for the standard routing setting.
After each trial, the subject had to rate the stimulus along Physical positions of the five A keys have been normalized
two dimensions: the overall sound quality, rated on a 7- in the range [-3,3], see Sec. 4.2. Furthermore, perceived
points Likert scale (from 1 to 7), and the perceived lateral note position means are reported for each of the three lis-
direction of the sound, again on a 7-points Likert scale tening conditions (passive closed, passive open, active).
(from 3 to +3). Subjective ratings were recorded through The result is mixed: lower tones were localized on posi-
a graphical interface on a laptop placed at the piano music tions having clear relationship with the keys; localization,
stand. The interface was composed of a vertical slider for though, ceased to depend on key positions for the higher
the sound quality ratings and a horizontal slider for the lat- notes, which in fact were localized slightly leftward from
eral direction ratings. The same interface guided subjects the center.
through the experiment. For the active condition, the actual In the reverse routing setting, similar boxplots were found.
(MIDI) note values and velocity values of the keys played Fig. 6 in fact shows that participants located the tones as-
by subjects were recorded as well. similably on reversed positions, again displacing the higher
Experimental sessions started with a brief verbal intro- notes to the wrong half of the keyboard.
duction to the subjects, who were told that the goal of the For each note a one-way ANOVA was conducted among
experiment was to evaluate the quality of piano sound, the three listening conditions, separately for the standard
with no mention of localization or spatial features of the and reverse settings. The ANOVA showed significant dif-
sound. Then the various phases of the experiment were ex- ferences between conditions for note A1 with the standard
plained. In particular, subjects were instructed to keep their setting (F (2, 140) = 6.15, p < 0.01), and for note A7
head still during each trial and to maintain the visual gaze with the reverse setting (F (2, 125) = 5.25, p < 0.01).
on a single location in front of them. This ensured that, Pairwise post-hoc t-tests for paired samples were ran, us-
during the playback of the binaural stimuli, the reference ing Holm-Bonferroni correction on p-values. The post-
frame of the subjects heads remained coherent with that hoc analysis revealed that note A1 was localized in signif-
of the Kemar head (with which the same binaural stimuli icantly different positions in active vs. passive open condi-
were acquired). tions (t(89) = 3.52, p < 0.01), and in active vs. passive
Then, a pre-experimental interview was carried out to closed conditions (t(89) = 3.17, p < 0.01). The same
collect age, self-reporting of hearing problems, degree of analysis showed that note A7 was localized in significantly
familiarity with the acoustic piano as an instrument. Three different positions in active vs. passive closed conditions
alternative answers to the last question were given: per- (t(74) = 3.09, p < 0.01), and in passive open vs. passive
former, listener, no familiarity (meaning that the subject closed conditions (t(102) = 2.34, p < 0.05).
SMC2017-258
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
standard reverse
\ A C O A C O
\ A0-A1
A0-A3
A0-A6
\ A0-A7
\ A1-A3
A1-A6
\
A1-A7
A3-A6
A3-A7
A6-A7
Figure 5. Key vs. perceived note position for the standard
routing setting. Right to left boxplots for each note: per- Table 1. Significantly different note pairs under different
ceived position in the passive closed condition; perceived listening conditions and routing settings. A: active; C: pas-
position in the passive open condition; perceived position sive closed; O: passive open. : p < 0.05; : p < 0.01.
in the active condition. Note positions are normalized in
the range [-3,3] and marked with X: A0=-3; A1=-2; A3=-
0.5; A6=2; A7=3. key positions: in this case the visual and somatosensory
feedback in general helped refine the localization, mean-
while with apparently stronger role of the somatosensory
\ than visual feedback in locking the sound source in corre-
\ spondence of the respective key. The data are also less dis-
persed when somatosensory perception was enabled. On
\ the contrary, notes A6 and A7 were localized off key po-
sition. In this case the contribution of the visual and so-
matosensory feedback to the auditory localization is not
\
clear: for note A6, the two sensory modalities seemed to
\ shift the sound source position toward the key; for note A7,
they probably wandered around the sound source position
where it was localized through listening.
In the reverse routing setting, listeners appear more con-
fused. Notes are largely localized in the normalized por-
Figure 6. Key vs. perceived note position for the reverse tion [0,2] of the keyboard range. Once again there was no
routing setting. Right to left boxplots for each note: per- significant support from the visual or somatosensory feed-
ceived position in the passive closed condition; perceived back, except for note A7 which, however, suffers from the
position in the passive open condition; perceived position most inaccurate localization. In general the somatosensory
in the active condition. Note positions are normalized data are more dispersed, suggesting that active playing did
in the range [-3,3] and marked with X: A0=3; A1=2; not help reduce confusion in subjects when listening to re-
A3=0.5; A6=-2; A7=-3. verse routing.
In the limit of its significance, the ANOVA among con-
A further series of one-way ANOVA was conducted for ditions suggests the existence of a progressively support-
each condition and routing setting, this time to uncover sig- ive role of the visual and then somatosensory feedback in
nificant differences among notes. These ANOVA showed localizing a tone over the corresponding key when the au-
that all notes were localized on significantly different po- ditory system did not contradict this localization process.
sitions and routing setting, except in the case of passive Conversely, such two sensory modalities would have no
closed condition with reverse setting. Pairwise post-hoc t- decisive role when the auditory localization failed to match
tests for paired samples were ran, using Holm-Bonferroni the sound source with the corresponding key position.
correction on p-values. The post-hoc analysis revealed a The same analysis was repeated for the amateur musi-
number of significantly different pairs. For brevity, indi- cians subgroup (5 subjects), and for subjects with an even
vidual t-statistics are omitted and only the significant p- and odd ID who were respectively exposed to a different
values are listed in Table 1. order of the experimental conditionsrefer to the begin-
ning of Sec. 4.2. Results were analyzed informally, due
to insufficient population in those subgroups. Amateur
5. DISCUSSION
musicians provided answers that are less dispersed around
First of all, with standard routing setting listeners did lo- the note position, whereas they did not perform differently
calize all tones without significant support (except for note from non-musicians under reverse routing conditions. The
A0) from the visual or somatosensory feedback. Notes same effect seems to affect subjects with an even ID, i.e.,
A0, A1 and A3 were localized around the corresponding those who attended the active condition first. A previous
SMC2017-259
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
knowledge of music playing, then, could have supported a systematic search for robust acoustic precursors that may
subjects in making more correct decisions in normal rout- be responsible for the activation of the precedence effect.
ing conditions. Similar support could have helped those
subjects who first played the notes during the test. 7. REFERENCES
The ANOVA among note positions did not contradict the
previous analysis among conditions, and suggests further [1] R. D. Lawson, Loudspeaker system for electronic pi-
observations. First, in the reverse routing setting there was ano, US Patent 5 789 693, Aug., 1998.
certainly more confusion in localizing the tones on signifi-
[2] K. Yamana, Electronic musical instrument having a
cantly different positions of the keyboard. Visual feedback
pan-pot function, US Patent 4 577 540 A, Aug., 1986.
was relatively more decisive in this sense, not only with the
reverse but also with the standard routing setting. In this [3] N. H. Fletcher and T. D. Rossing, The Physics of Mu-
case tone localization was more reliable: Table 1 shows a sical Instruments. New York: Springer-Verlag, 1991.
generally increasing role of the somatosensory (A), audi-
tory (C) and visual (O) modality in supporting the local- [4] S. Koseki, R. Mantani, N. Sugiyama, and T. Tamaki,
ization of the notes on different keyboard positions. One Method for making electronic tones close to acoustic
would be tempted to sort the rows in that table in key dis- tones, recording system and tone generating system,
tance order, and expect to find an increasingly different EU Patent EP1 357 538A3, Apr., 2003.
perceived position of the tones based on this order. Unfor-
[5] C. Tsakostas and J. Blauert, Some new experiments
tunately this is true only to a limited extent: pairs A3-A6
on the precedence effect, FORTSCHRITTE DER
and A3-A7, for instance, which fall about on the middle
AKUSTIK, vol. 27, pp. 486487, 2001.
rows if the table is reorganized in key distance order, did
not give rise to distinct perceived positions in the keyboard [6] A. Askenfelt, Observations on the transient compo-
under any type of feedback. nents of the piano tone, STL-QPSR, vol. 34, no. 4, pp.
A comparison between our analysis and the data in Fig. 2 1522, 1993, web available at http://www.speech.kth.
does not permit to align this discussion to one possible re- se/qpsr.
sult suggested by that figure, i.e., that somatosensory feed-
back may have supported tone localization on the corre- [7] F. Fontana, Y. D. Pra, and A. Amendola, Sensitivity to
sponding keys when the auditory feedback was unreliable; loudspeaker permutations during an eight-channel ar-
in fact, our experiment mainly suggests that visual and so- ray reproduction of piano notes, in Proc. SMAC/SMC
matosensory localization worked better if not contradict- 2013, Stockholm, Sweden, Jul. 30 - Aug. 3 2013.
ing the auditory localization. This misalignment indeed
[8] F. Fontana, S. Zambon, and Y. D. Pra, Designing
could be expected, since we switched from loudspeaker
on subjective tolerance to approximated piano repro-
to headphone listening and furthermore we experimented
ductions, in Proc. of the Third Vienna Talk on Music
with participants who were not pianists.
Acoustics, Vienna, Austria, Sep. 16-19 2015, pp. 197
204, invited paper.
6. CONCLUSIONS
[9] W. G. Gardner and K. D. Martin, HRTF measure-
An experiment has been made to investigate the role of ments of a KEMAR, J. Acoust. Soc. Am., vol. 97,
the auditory, visual and somatosensory feedback in piano no. 6, p. 39073908, Jun. 1995.
tone localization. We inherited previous literature about
the characteristic transients precurring a steady tone in the [10] F. Fontana, F. Avanzini, H. Jarvelainen, S. Papetti,
piano, along with yet partially unstable findings about spa- F. Zanini, and V. Zanini, Perception of interactive vi-
tial realism of piano tones using auditory and somatosen- brotactile cues on the acoustic grand and upright pi-
sory cues. Moving from this literature, we hypothesized ano, in Proc. Joint ICMC/SMC Conf., Athens, Greece,
that somatosensory and visual feedback supported the au- 2014, pp. 948953.
ditory localization of piano tones on the corresponding keys.
Overall the results suggest that this happens in the cases
when the auditory feedback is coherent with the somatosen-
sory and visual feedback, in the sense of the hypothesized
specific localization process. However, these results are
difficult to be compared with the previous findings.
Given also the different hypothesis as well as conditions
existing between the previous tests and the experiment pro-
posed here, we remain noncommittal about the existence
of an auditory precedence effect leading the localization
of piano notes, whose key positions are well-known by pi-
anists in any case through their visual and somatosensory
experience. Rather, we have collected results under a broad
set of conditions which may prove precious to build future
experiments. Such experiments, first of all, should include
SMC2017-260
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-261
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
elements to vibrate, resonate and oscillate so that the The Palm Room (a) acoustics are very reverberant,
building itself becomes a very large musical instrument. cathedral-like and the space is filled with luxuriant vege-
[1]. Sonic Greenhouse presents a parallel with Byrnes tation. The sound work was designed to respond to this
piece in its aim to induce vibration into the buildings grandeur with full-spectrum and dynamic sonic construc-
physical structure, with the notable difference of the tions involving precisely rendered movements of sound
present project using audio-rate actuators rather than sub- within the glass structure. The Cactus Room (b) is very
audio frequency mechanical devices such as motors and silent and intimate. In accordance, the sound was de-
solenoids. Audio-rate excitation at bass frequencies has signed to occupy the space with precise, intricate sounds
been investigated within the Resonant Architecture which draw the visitors attention to details and silent
projects different instantiations, exciting the lower spec-
contemplation. The social space (c) was left as a relaxa-
trum resonant modes of large-scale architectural and
tion space, convenient for discussion and studying docu-
sculptural entities [2]. Structural vibration has been used
as a compositional element in numerous works of Bill mentation about the work presented in printed format.
Fontana, for example in the Harmonic Bridge piece at
Tate Modern, London [3]. More recently, Rodolphe
Alexis has used the Greenhouse context for sound instal-
lation, using wavefield synthesis to create a soundscape
for a flower exhibition at Natural History Museum in
Paris [4]. The present work builds on prior research on
structure-borne sound in intermedia context by Otso
Lhdeoja [5].
The compositional considerations of Sonic Greenhouse
have echoes in Janet Cardiffs Forty Part Motet, which
sets the listener as the active agent in the unraveling of a
spatially distributed audio work, leading into a first-
person spatialisation [6]. A similar idea has been ex-
plored in Parabolic Dishes by Bernhard Leitner creat-
ing spatial regions sonically, in which the visitors are
required to explore the acoustically structured space [7].
In a situation where the visitor is an explorative agent,
questions of form and sound spatialisation strategies
arise. In our work, this was reflected in a dialogue con- Figure 1: Helsinki Winter Garden floor plan and photos
sidering placing sounds in space rather than in time, and of its three rooms: a) The Palm Room, b) the Cactus
creating a generatively changing sonic situation rather Room and c) the Chatting Room
than a narrative, echoing Max Neuhaus: Traditionally
composers have located the elements of a composition in
time. One idea which I am interested in is locating them,
3.1 System Description
instead, in space, and letting the listener place them in his
own time. [8] The Sonic Greenhouse implemented a sound diffusion
system in two of the Winter Gardens three rooms. A 40-
channel system was built in the Palm Room and another,
3. DESIGN AND IMPLEMENTATION independent system with 20 channels was built in the
Cactus Room.
The Helsinki Winter Garden comprises three different
spaces, as shown in Fig. 1: a) the central Palm Room Two different types of audio transducers were used: 1)
(surface 280m2), b) the eastern wings Cactus Room TEAX32C30-4/B structure-borne sound drivers attached
(200m2) and c) the western wings social space (170m2). on the buildings glass walls and on custom-made sus-
Sonic Greenhouse was designed in intimate connection pended plexiglass panels, and 2) small cone loudspeakers
to the given space in order to foster integration into the enclosed in glass jars placed on the ground amidst the
physical structures, plan and human life, acoustics and vegetation. Sound diffusion via structure-borne sound
poetics of the space. A prior period of sensorial research drivers was an aesthetic starting point for the entire piece.
was completed by multiple visits to the site spanning over These drivers - or vibration speakers - enable to trans-
a year, observing how the staff and visitors behave in the form a rigid surface into a full-range speaker2 with mini-
space and how the building reacts to different weather mum visual impact. The driver and glass panel combina-
conditions. These observations had a decisive effect on tion is acoustically ideal for creating a large-scale sound-
the design of the piece, which had to take into account scape - oriented aural impressions. The panels act as flat
high humidity levels, leakages and the regular activities panel dipole speakers, providing a larger and more dif-
such as watering the plants. The visits also served to
2
familiarise ourselves with the acoustic specificities of the The measured frequency range of the TEAX32C30-4/B transducer is
site. 70Hz - 14kHz.
SMC2017-262
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
fuse acoustic image than a more directive cone speaker. den should not be altered in any way, some others were
In this sense, diffuse sound emanating from glass panels enthusiastic about the novelty and the added auditive
creates an aesthetic unity: the sound source is not easily dimension. The authors did not collect feedback in a
located, and perceptually it blends into the glass, its systematic manner, but a guest book was available at the
transparency and the surrounding landscape. The cone installation. A theme that arose from the guest book
loudspeakers were designed to provide a spatial counter- comments was the felt need for more explanatory materi-
point to the wall and ceiling transducers, as well as to al concerning the audio techniques involved which
create a symbolic visual element - a speaker enclosed in seemed too abstract for some visitors to appreciate. An
glass replicates the greenhouse idea in miniature scale. In intellectual counterpart to the pieces sensorial character
a sense the speaker jars work as a metaphor of the whole could have been beneficial, such as written documenta-
installation. The building is sounding through its glass tion, schemas and links to web media. We also conducted
and the speaker cones sounds are contained by the glass a final interview with the Winter Garden staff in order to
making the setting for an inner-outer sonic and visual have their impressions recorded. The staffs reactions
interplay. were of particular importance for us, as they acted as
project collaborators and they harbor a very intimate
relationship both with the building and the plants that
they take care of every day. The staff members were very
enthusiastic about having a sound art premiere in the
greenhouse, and they noted a revived interest in the site
in some specific categories of audience such as elderly
people, impaired persons and children. For the staff
members, the sonic dimension created a novel dimension
in a space they know very well, and they appreciated
spending two full weeks discovering relationships be-
tween architecture, plants and sounds. By the end of the
installation, it was agreed that two weeks was a correct
duration for a longer installation the sonic matter
should be changed periodically in order not to become
Figure 2: photos of transducer on glass panel and minia- too redundant.
ture speaker placed in a glass jar.
4. SOUND SPATIALISATION AND
WEATHER DATA FEED
The audio transducer distribution in the space resulted
from a compromise between the aim to produce a highly Two separate sound diffusion systems were implemented,
immersive sound field and the technical constraints of the one in the larger Palm Room and a second in the Cactus
building. Immersive diffusion of sounds from the Room, due to the will to design two very different sonic
ground, walls and ceiling was a priority in our design. atmospheres and spatialisation control systems. A com-
However, the buildings top layers of glass were not mon starting point for both systems was the determina-
accessible from the inside with a scissor lift. Also, attach- tion to avoid any concept or technique of a sound field
ing panels amidst the vegetation was not feasible due to centered around a sweet spot. The Winter Gardens cen-
plant protection regulations. With these constraints, we tral areas are occupied by plants and there is no single
designed a hybrid system of glass-panel mounted trans- central space available for audience. In addition, a typical
ducers, plexiglass panel speakers and loudspeaker jars, all visit to the Winter Garden involves walking from one
characterised by visual transparency and a unified aes- room to the other, between plants, and a perceptual
thetic design criteria. movement between large ensembles and tiny details. The
sound spatialisation was designed from the following
3.2 Audience Feedback premises: moving audience, immersive soundscape al-
lowing for perceptual zooming into details and out to-
The installation was open to the public for two weeks in wards general impression, as well as a detailed, moving
September-October 2016 and it was well visited - the and lively general character.
estimation of reached audience is 6000 persons. The
Helsinki Winter Garden is a popular recreation place for Sonic Greenhouse stems from a collaboration between
locals as well as tourists, and many visitors did not come two composers of electronic music, each following a
specifically for the installation, rather for the site itself. specific aesthetic agenda. In this setting, we chose to
The site also has a strong clientele of people who visit experiment with a set different and complementary ap-
regularly, even almost every day (such as elderly people proaches to sound spatialisation, ranging from the distri-
living nearby). Audience reactions were overall positive, bution of numerous multichannel sources up to the point
but varied strongly between the motives of the visitors. of multichannel redistribution of a stereo source creating
Some Winter Garden regulars felt that their beloved gar- the multichannel perceptual sensation due to the extended
SMC2017-263
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
size of the space. Working in collaboration created a macro-scale events such as sound selection and mixing
dialogue for testing a range of ideas on sound diffusion. and to microsound-level operations like grain density and
While Moreno had a very strong preference for avoiding spectrum.
panning trajectories, Lhdeoja was intrigued by the per-
The motivation for connecting the Cactus Rooms inter-
ceptual effects of large-scale sonic trajectories in such
nal sound life to the external weather conditions
complex multi-channel acoustic space. The following
stemmed from the poetics of transparency and frontier of
sections provides a detailed account of the different spati-
a greenhouse. Within the glass frame one can visually
alisation techniques involved in the piece.
observe outside events, but has a very limited perceptual
access to the elements. By connecting the outside condi-
4.1 The Palm Room (Main Hall)
tions to the sound engine, we envisioned to find a way
For the Winter Gardens main hall - or Palm Room, each around the greenhouses boundary condition. As a
composer created three compositional modules, com- result, we obtained a perpetually variating generative
bined within a generative Max MSP mixer diffusing the sound structure tightly related to the atmosphere inside as
modules in random order. well as outside the room.
The spatialisation strategies were:
4.3 Audience Agency: First-Person Spatialisation
- Single sound source trajectories generated with dis-
While a considerable attention was invested into the prior
tance-based amplitude panning (DBAP)[9], combined
design of sound spatialisation, once the installation was
with larger masses of more stationary sonic material as
open to the public it became clear that the foremost spati-
spatial counterpoint.
alisation strategy was not computational, but rather the
- Polyphonic envelope-based texture in which every individual human trajectories within the piece. Writing on
sound comes from only one channel. The panning sensa- his piece We Have Never Been Disembodied at the
tion is produced by having similar sonic content on dif- Mirrored Gardens, Guangzhou, China, Olafur Eliasson
ferent channels. echoes precisely our findings on how Sonic Greenhouse
came alive by the individual agencies of the visitors: For
- Sonic breathing in and out of the space: a sonic ges-
We have never been disembodied, I was inspired by the
ture would be initiated from a single diffusion point and
idea of using this humble context, intended for plants and
then gradually spread to all channels, or the inverse grad-
agriculture, as a platform where full responsibility is
ual retreat process.
handed over to the visitor, enabling him or her to become
- Random hard-panning distribution: a sound is ran- an agent in the space that is Mirrored Gardens. It is a
domly switched between all channels in a temporal se- platform of potentials, taking the intimacy of the village
quence. to its extreme, allowing for micro-sequences when visi-
tors move through the building, and making explicit the
- Stereo source spread across multiple channels: Channel
temporal dimension of life." [11]
location and surfaces alter the perception of pitch and
timbre. Eliassons work has been analysed by Adam Basanta
and extended to the conceptualisation of form in audio-
- One sonic element in the center while the other element
visual installations, where the three Euclidian dimen-
travels from channel to channel exciting different areas of
sions (length, width and height) are not only modulated
the site one at a time.
by the fourth, topological dimension (time), but also by
the fifth dimension of perceiver subjectivity. [...][12]
4.2 The Cactus Room: Weather Data-Driven Granu-
Following Eliasson, we can contemplate form as the
lar Engine
particular temporal experience of the first-person subject
For the Cactus Room, our aesthetical decision was to as they navigate in, through and out of the works frame.
associate sound particle clouds diffused at hearing That is, form as the particular first-person narrativisation
threshold levels with the dry acoustics and desert-like of experience in a given installation. [13]
atmosphere of the space. A 19-channel instance of the
[munger1~] MSP external [10] was used as the basic Basantas analysis corresponds to the experiential out-
grain engine, completed with Morenos design of a gen- come of Sonic Greenhouse. In the design phase, we did
erative wavy process based on pre-synthesized grain not predict the importance that the audience agency
streams that add a complimentary mid-range frequencies would take in the final work. It turned out that not only
to the overall spectrum. spatialisation, but also the form and narrative of the piece
were largely defined by the first-person trajectory within
Both granular engines and their source audio material
the space and between the rooms. In large spaces with
were controlled by real-time weather data retrieved from
numerous non-symmetrically placed sound sources, the
the openweather internet service3. Wind velocity, temper-
audience becomes the primary spatialisation agent. This
ature, pressure, and humidity data were mapped both to
finding has important consequences for our future de-
3
signs of large-scale audio works, as it challenges estab-
https://openweathermap.org/api
SMC2017-264
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
lished ways of designing sonic spatialisation as well as we took a combined approach using a combination of
form in audio art. recorded instruments metallophones, organ, piano, and
banjo, field recordings, and synthetic sounds, carefully
5. COMPOSITIONAL STRATEGIES tuned for the space and tested through regular visits. By
tuning the sounds and imitating behaviour of preexisting
5.1 Considerations on Time and Spatiality - Compos- sonic elements like the wind in the surroundings or the
ing Sonic Weather water inside which were used as ghost electronics [18]
and by echoing the local weather conditions we created
I was to move from structure to process, from music as an organic soundscape that would create a continuity
an object having parts, to music without beginning, mid- sensation avoiding to create any kind a sonic shock when
dle, or end, music as weather - John Cage [14] entering or leaving the space.
Music is often considered as a time-based art. Max Neu-
haus proposed that musical creation is not about placing 5.2 A Note on Sonic Acupuncture
sounds in time but rather in space, the visitor being the A possible definition of acupuncture can be: a local ac-
agent who would enact the time dimension. [8]. Audio tion by means of a pressure point on a key spot that has
artists who are notorious for creating an alternative use of the power to change the situation globally, beyond the
time in their sound works (La Monte Young, Phil local area in which the pressure point is applied. There-
Niblock, a. o.), usually rely heavily in using either long fore, sonic acupuncture relies in applying sonic pressure
sustained sounds or loops. In this work we propose the points on key spots affecting the global sonic situation.
creation of a more complex sonic environment that sur- These sonic pressure points can consist either on active
pases the perception of sonic discourse as a narrative in acoustic objects, passive acoustic objects, or a combina-
time, getting closer to a sonic weather echoing the tion of both. Urban Sonic Acupuncture parallels the prac-
words of John Cage. When listening to this sonic weather tice of Urban and Public Space Acupuncture [19] in the
time becomes a secondary dimension. Aural architecture field. Aural architecture deals with
When articulating sonically a complex public space, one spatial and cultural acoustics, it also assigns four basic
of the possible strategies to follow is to create different functions of sound in space: social, navigational, aesthet-
areas of sonic colour in which the sonic material is ic and musical spatiality [15]. Artistic sonic interventions
grouped and spread creating a unique experience for are placed along this axis by starting a negotiation be-
every visitor, seeking to foster a bodily dialogue between tween artistic intentions and the local knowledge and
the visitor and the space. This approach requires a non- practices. Sonic Greenhouse proposed an early attempt
linear narrative. in Sonic Greenhouse, an algorithm of sonic acupuncture by means of glass speaker-pots
controlled the continuous crossfading between predefined placed in the central area of the Palm room. In some
sonic situations achieving greater level of sonic complex- sections, these devices were emitting slowly sweeping
ity while keeping a natural blend of elements. This con- frequencies sine waves4 generated algorithmically alter-
tinuously evolving sonic complexity was not overwhelm- ing the perception of the sonic weather at that given mo-
ing to the visitors perception but rather contributing to the ment, also driving attention to the lower layers of the
feeling of being immersed in a sonic weather, an intense space and to specific plants in the area.
experience of the space surpassing the perception of tem-
poral discourse. 6. CONCLUSIONS
For Sonic Greenhouse we built an array of active and In this article we have discussed a recent large-scale
passive sonic objects to use in order to work as aural audio-architectural installation from three points of view,
architects [15] creating a sonic weather focusing more in namely system design, sound spatialisation and external
placing sounds in space rather than in time. In that sense, data feed, as well as compositional strategies. The work
our aim as composers was to approach aural architectural was thought as a research process for a multichannel
composition in an atmospheric manner, favoring periph- structure-borne sound system mounted directly on a
eral perception and light gestalt [16]. The piece illumi- buildings structure, combined with specific aesthetic
nated sonically the architecture with a combination of aims related to the Helsinki Winter Garden greenhouse.
active and passive sonic objects. The active sonic objects The results confirm a proof of concept: structure-borne
were the sound emitting glasses of the windows, the plex- sound drivers are an effective a meaningful way to sonify
iglass panels, and the speaker cones inside the glass pots; a large architectural space, presenting perceptual ad-
while the passive objects were the glass pots which fil- vantages in comparison to traditional cone speakers, such
tered the sounds coming from the speaker cones, the as radiation patterns and non-invasive appearance. More-
plexiglass panels that created a more complex network of over, the results point towards a reformulation of con-
acoustic reflections and spatio-morphologies [17], and cepts in sound spatialisation and form composition for
the reverberation of the room.
4
In order to blend our sonic layer with the preexisting Alvin Lucier - "music for piano with slow sweep pure wave oscilla-
sonic structures and regular daily activities of the space tors" (1992). PIANOSCULPTURE for Mario Prisuelos (2016) is a
piano and tape piece by Moreno based on the same principle.
SMC2017-265
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
large-scale works where the audience becomes an active [8] M. Neuhaus, Max Neuhaus: Inscription, sound
agent in the space. Temporal and spatial elements unfold works Vol. 1 (p. 34). Cantz Verlag, 1994.
according to an individual exploration, which counter [9] T. Lossius, P. Baltazar and T. de la Hogue, DBAP
narrative designs thought for stationary audiences. Alter- distance-based amplitude panning. Proceedings of
native compositional strategies were implemented, such the International Computer Music Conference,
as composing a Sonic Weather evolving in space rather ICMC 2009.
than in narrative time, as well as Sonic Acupuncture [10] I. I. Bukvic et al., Munger1~: towards a Cross-
meaning small-scale punctual sonic interventions in pre- Platform Swissarmy Knife of Real-Time Granular
cise locations. synthesis. Proceedings of the International Computer
Music Conference, ICMC. 2007.
[11] O. Eliasson, We Have Never Been Disembodied,
http://www.vitamincreativespace.com/en/?work=56
07
[12] O. Eliasson, Vibrations. In O. Eliasson, I. Calvino, I.
Blom, D. Birnabaum and M. Wigley (authors) Your
Engagement has Consequences on the Relativity of
Your Reality. Baden, Switzerland: Lars Mller Pub-
lishers, 2006.
[13] A. Basanta, Extending Musical Form Outwards in
Space and Time: Compositional strategies in sound
art and audiovisual installations. in Organised
Sound, 2015, pp 171-181
Figure 3: General view of the Palm Room, where 40 [14] J. Cage, An Autobiographical Statement, in
channels of sound diffusion were implemented. Southwest Review, 1991
[15] B. Blesser and L. Salter, Spaces speak, are you lis-
tening?: experiencing aural architecture. The MIT
Acknowledgments Press, 2007.
Work on the Sonic Greenhouse project was supported by [16] J. Pallasmaa, Atmosphere as place. The Intelligence
of Place - Topographies and Poetics. Edited by Jeff
University of the Arts Helsinki Sibelius Academy Music
Malpas. Bloomsbury Publishing, 2014.
Technology Department, Kone Foundation, Arts Promo-
[17] D. Smalley, Spectro-morphology and structuring
tion Centre Finland and the City of Helsinki. The prior
processes. The Language of Electroacoustic Music.
research work in structure-borne sound and in composi- in Organised Sound, 1986.
tion was supported by Academy of Finland and Niilo [18] H. Whipple, Beasts and Butterflies: Morton Su-
Helander Foundation. The authors would also like to botnick's Ghost Scores. in The Musical Quarterly,
thank the Helsinki Winter Garden staff for their support. 69(3), 1983. pp 425-441
[19] H. Casanova and J. Hernandez, Public Space Acu-
7. REFERENCES puncture. Strategies and Interventions for Activating
City Life, Actar, 2015.
[1] D. Byrne, Playing the Building,
http://journal.davidbyrne.com/explore/playing-the-
building, 2005
[2] N. Maigret and N. Montgermont, The Resonant
architectureProject: http://resonantarchitecture.com,
2009-2013
[3] Bill Fontana, Harmonic Bridge,
http://resoundings.org/Pages/Architectural_Sound_S
culptures.html
[4] R. Alexis, Mille et une Orchides,
http://www.mnhn.fr/fr/visitez/agenda/exposition/mil
le-orchidees-2017
[5] O. Lhdeoja, Structure-Borne Sound and Aurally
Active Spaces in Proceedings of New Interfaces for
Electronic Music, NIME 2014, p. 319-322.
[6] J. Cardiff, 2001, The Forty Part Motet,
http://www.cardiffmiller.com/artworks/inst/motet.ht
ml
[7] B. Leitner, P.U.L.S.E., Ostfildern: Hatje Cantz
Verlag, 2008.
SMC2017-266
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Victor Lazzarini
Maynooth University,
Ireland
victor.lazzarini@nuim.ie
SMC2017-267
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Csound. Each one of them is minimal: consisting just of nsmps: the size of an audio vector (also for audio
a call to the instantiated class processing function. For in- opcodes only).
stance, one of these, used at instrument initialisation time,
is defined simply as init(), kperf() and aperf() non-op meth-
ods, to be reimplemented as needed (see sec. 2.2).
template <typename T>
int init(CSOUND *csound, T *p) { The other base class in CPOF is FPlugin, derived from
p->csound = (Csound *)csound; Plugin, which adds an extra facility for fsig (streaming
return p->init(); frequency-domain) plugins:
}
framecount: a member to hold a running count
In this case T is derived type passed on registration (which of fsig frames.
is defined at compile time). When running, Csound calls
Input and output arguments, which in C are just raw point-
this function, which in its turn delegates directly to the plu-
ers are conveniently wrapped by the Params class, which
gin class in question. Note that this is all hidden from the
provides a number of methods to access them according to
framework user, who only needs to derive her classes and
their expected Csound variable types.
register them. As we will see in the following sections, this
scheme enables significant economy, re-use and reduction
2.2 Deriving opcode classes
in code verbosity (one of the issues with CRTP).
Csound has two basic action times for opcodes: init and
perf-time. The former runs a processing routine once per
2. THE FRAMEWORK
instrument instantiation (and/or once again if a re-init is
CPOF is based on two class templates, which plugin classes called for). Code for this is placed in the Plugin class
derive from and instantiate. Opcode classes can then be init() function. Perf-time code runs in a loop and is
registered by instantiating and calling one of the two over- called once every control (k-)cycle. The other class meth-
loaded registration function templates. All CPOF code is ods kperf() and aperf() are called in this loop, for
declared under the namespace csnd. In this section, we control (scalar) and audio (vectorial) processing. The fol-
discuss the base class templates, three simple cases of de- lowing examples demonstrate the derivation of plugin clas-
rived plugins and their registration. ses for each one of these opcode types (i, k or a). Note that
k and a opcodes can also use i-time functions if they re-
2.1 The Base Classes quire some sort of initialisation.
The framework base classes are actually templates which 2.2.1 Init-time opcodes
need to be derived and instantiated by the user code. The For init-time opcodes, all we need to do is provide an im-
most general of these is Plugin 1 . To use it, we program plementation of the init() method:
our own class by subclassing it and passing the number of
output and inputs our opcode needs as its template argu- struct Simplei : csnd::Plugin<1,1> {
ments: int init() {
outargs[0] = inargs[0];
#include <plugin.h> return OK;
struct MyPlug : csnd::Plugin<1,1> { }; }
The above lines will create a plugin opcode with one out- };
put (first template argument) and one input (second tem- In this simple example, we just copy the input arguments
plate argument). This class defines a complete opcode, but to the output once, at init-time. Each scalar input type can
since it is only using the base class stubs, it is also fully be accessed using array indexing. All numeric argument
non-op. data is real, declared as MYFLT, the internal floating-point
To make it do something, we will need to reimplement type used by Csound.
one, two or three of its methods. As outlined above, this
2.2.2 K-rate opcodes
base class is derived from the Csound structure OPDS and
has the following members: For opcodes running only at k-rate (no init-time opera-
tion), all we need to do is provide an implementation of
outargs: a Params object holding output argu- the kperf() method:
ments.
struct Simplek : csnd::Plugin<1,1> {
inargs: input arguments (Params). int kperf() {
outargs[0] = inargs[0];
csound: a pointer to the Csound engine object. return OK;
}
offset: the starting position of an audio vector };
(for audio opcodes only).
Similarly, in this simple example, we just copy the input
1 All CPOF code is declared in plugin.h arguments to the output at each k-period.
SMC2017-268
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
template <typename T> will register the simple polymorphic opcode, which can
int plugin(Csound *csound, be used with i-, k- and a-rate variables. In each instantia-
const char *name, tion of the plugin registration template, the class name is
const char *oargs, passed as an argument to it, followed by the function call.
const char *iargs, If the class defines two specific static members, otypes
uint32_t thrd, and itypes, to hold the types for output and input argu-
uint32_t flags = 0) ments, declared as
iargs: a string containing the opcode input types, template <typename T>
one identifier per argument int plugin(Csound *csound,
const char *name,
thrd: a code to tell Csound when the opcode should uint32_t thread,
be active. uint32_t flags = 0)
flags: multithread flags (generally 0 unless the op- For some classes, this might be a very convenient way
code accesses global resources). to define the argument types. For other cases, where op-
code polymorphism might be involved, we might re-use
For opcode type identifiers, the most common types are: the same class for different argument types, in which case
a (audio), k (control), i (i-time), S (string) and f (fsig). For it is not desirable to define these statically in a class.
the thread argument, we have the following options, which
depend on the processing methods implemented in our plu- 3. SUPPORT CLASSES
gin class:
Plugins developed with CPOF can avail of a number of
thread::i: indicates init(). helper classes for accessing resources in the Csound en-
gine. These include a class encapsulating the engine itself,
thread::k: indicates kperf().
2 The header file modload.h, where on_load() is declared, con-
thread::ik: indicates init() and kperf(). tains three boilerplate calls to Csound module C functions, required for
Csound to load plugins properly. For this reason, each plugin library
should also include this header only once, otherwise duplicate symbols
thread::a: indicates aperf(). will cause linking errors.
SMC2017-269
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
and classes for resource allocation, input/output access/- This is only needed if the Plugin has allocated extra re-
manipulation, threads, and support for constructing objects sources using mechanisms that require de-allocation. It is
allocated in Csounds heap. not employed in most cases, as we will see below. To use it,
the plugin needs to implement a deinit() method and
3.1 The Csound Engine Object then call the plugin_deinit() method passing itself
(through a this pointer) in its own init() function:
Plugins are run by an engine which is encapsulated by
the Csound class. They all hold a pointer to this, called csound->plugin_deinit(this);
csound, which is needed for some of the operations in-
voking parameters, or for some utility methods (such as 3.2 Audio Signals
console messaging, MIDI data access, FFT operations).
The following are the public methods of the Csound class: Audio signal variables are vectors of nsmps samples and
we can access them through raw MYFLT pointers from in-
init_error(): takes a string message and sig- put and output parameters. To facilitate a modern C++ ap-
nals an initialisation error. proach, the AudioSig class wraps audio signal vectors
conveniently, providing iterators and subscript access. Ob-
perf_error(): takes a string message, an instru- jects are constructed by passing the current plugin pointer
ment instance and signals a performance error. (this) and the raw parameter pointer, e.g.:
_A4(): returns A4 pitch reference. For efficiency and to prevent leaks and undefined behaviour
we need to leave all memory allocation to Csound and re-
nchnls(): return number of output channels for frain from using C++ dynamic memory allocators or stan-
the engine. dard library containers that use dynamic allocation behind
the scenes (e.g. std::vector).
nchnls_i(): same, for input channel numbers. This requires us to use the AuxAlloc mechanism imple-
mented by Csound for opcodes. To allow for ease of use,
current_time_samples(): current engine time CPOF provides a wrapper template class (which is not too
in samples. dissimilar to std::vector) for us to allocate and use as
much memory as we need. This functionality is given by
current_time_seconds(): current engine time
the AuxMem class:
in seconds.
allocate(): allocates memory (if required).
midi_channel(): midi channel assigned to this
instrument. operator[]: array-subscript access to the allo-
cated memory.
midi_note_num(): midi note number (if the in-
strument was instantiated with a MIDI NOTE ON). data(): returns a pointer to the data.
midi_note_vel(): same, for velocity. len(): returns the length of the vector.
SMC2017-270
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
csnd::plugin<Oscillator>(csound,
3.4 Function Table Access "oscillator", "a", "kki",
csnd::thread::ia);
Access to function tables has also been facilitated by a thin
wrapper class that allows us to treat it as a vector object.
This is provided by the Table class:
3.5 String Types
init(): initialises a table object from an opcode
argument pointer. String variables in Csound are held in a STRINGDAT data
structure, containing a data member that holds the actual
operator[]: array-subscript access to the func- string and a size member with the allocated memory size.
tion table. While CPOF does not wrap strings, it provides a translated
access to string arguments through the argument objects
data(): returns a pointer to the function table data. str_data() function. This takes an argument index
(similarly to data()) and returns a reference to the string
len(): returns the length of the table (excluding
variable, as demonstrated in this example:
guard point).
struct Tprint : csnd::Plugin<0,1> {
begin(), cbegin() and end(), cend(): re-
int init() {
turn iterators to the beginning and end of the func-
char *s = inargs.str_data(0).data;
tion table.
csound->message(s);
iterator and const_iterator: iterator types return OK;
for this class. }
};
An example of table access is given by an oscillator op-
code, which is implemented in the following class: This opcode will print the string to the console. Note that
we have no output arguments, so we set the first template
struct Oscillator : csnd::Plugin<1,3> { parameter to 0. We register it using
csnd::Table tab;
double scl; csnd::plugin<Tprint>(csound, "tprint",
double x; "", "S", csnd::thread::i);
SMC2017-271
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3.6 Streaming Spectral Types begin(), cbegin() and end(), cend(): re-
turn iterators to the beginning and end of the data
For streaming spectral processing opcodes, we have a dif-
frame.
ferent base class with extra facilities needed for their oper-
ation (FPlugin). In Csound, fsig variables, which carry iterator and const_iterator: iterator types
spectral data streams, are held in a PVSDAT data structure. for this class.
To facilitate their manipulation, CPOF provides the Fsig
class, derived from PVSDAT. To access phase vocoder Fsig opcodes run at k-rate but will internally use an up-
bins, a container interface is provided by pv_frame date rate based on the analysis hopsize. For this to work, a
(spv_frame for the sliding mode) 3 . This holds a series frame count is kept and checked to make sure we only pro-
of pv_bin (spv_bin for sliding) 4 objects, which have cess the input when new data is available. The following
the following methods: example class implements a simple gain scaler for fsigs:
amp(): returns the bin amplitude. struct PVGain : csnd::FPlugin<1, 2> {
static constexpr
freq(): returns the bin frequency. char const *otypes = "f";
amp(float a): sets the bin amplitude to a. static constexpr
char const *itypes = "fk";
freq(float f): sets the bin frequency to f.
int init() {
operator*(pv_bin f): multiply the amp of a if(inargs.fsig_data(0).isSliding()){
pvs bin by f.amp. char *s = "sliding not supported";
return csound->init_error(s);
operator*(MYFLT f): multiply the bin amp by
}
f
if(inargs.fsig_data(0).fsig_format()
operator*=(): unary versions of the above. != csnd::fsig_format::pvs &&
inargs.fsig_data(0).fsig_format()
The pv_bin class can also be translated into a std::co != csnd::fsig_format::polar){
mplex<float>, object if needed. This class is also fully char *s = "format not supported";
compatible with the C complex type and an object obj can return csound->init_error(s);
be cast into a float array consisting of two items (or a float }
pointer), using reinterpret_cast<float(&)[2]> csnd::Fsig &fout =
(obj) or reinterpret_cast<float*>(&obj). The outargs.fsig_data(0);
Fsig class has the following methods: fout.init(csound,
inargs.fsig_data(0));
init(): initialisation from individual parameters framecount = 0;
or from an existing fsig. Also allocates frame mem- return OK;
ory as needed. }
dft_size(), hop_size(), win_size(), win
int kperf() {
_type() and nbins(), returning the PV data pa-
csnd::pv_frame &fin =
rameters.
inargs.fsig_data(0);
count(): get and set fsig framecount. csnd::pv_frame &fout =
outargs.fsig_data(0);
isSliding(): checks for sliding mode. uint32_t i;
fsig_format(): returns the fsig data format
if(framecount < fin.count()) {
(fsig_format::pvs, ::polar ::complex,
std::transform(fin.begin(),
or ::tracks).
fin.end(), fout.begin(),
The pv_frame (or spv_frame) class contains the fol- [this](csnd::pv_bin f){
lowing methods: return f *= inargs[1];});
framecount =
operator[]: array-subscript access to the spec- fout.count(fin.count());
tral frame }
return OK;
data(): returns a pointer to the spectral frame data. }
};
len(): returns the length of the frame.
3
Note that, as with strings, there is a dedicated method
pv_frame is a convenience typedef for Pvframe<pv_bin>,
whereas spv_frame is Pvframe<spv_bin> in the arguments object that returns a ref to an Fsig class
4 pv_bin is Pvbin<float> and spv_bin is Pvbin<MYFLT> (which can also be assigned to a pv_frame ref). This
SMC2017-272
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-273
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-274
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Johnty Wang
Joseph Malloch Stephane Huot
Input Devices and Music
Graphics and Experiential Media Lab Mjolnir Group
Interaction Lab/CIRMMT
Dalhousie University Inria Lille
McGill University
Joseph.Malloch@dal.ca Stephane.Huot@inria.fr
Johnty.Wang@mcgill.ca
Marcelo Wanderley
Fanny Chevalier
Input Devices and Music
Mjolnir Group
Interaction Lab/CIRMMT
Inria Lille
McGill University
Fanny.Chevalier@inria.fr
Marcelo.Wanderley@mcgill.ca
This paper describes a versioning and annotation system Over the years we have been involved in the conception,
for supporting collaborative, iterative design of mapping design, use and evaluation of a large number of DMIs,
layers for digital musical instruments (DMIs). First we de- including some that have been actively performed for 10
scribe prior experiences and contexts of working on DMIs years and have toured internationally. Throughout the de-
that has motivated such features in a tool, describe the velopment of these instruments, iteration has proven to be
current prototype implementation and then discuss future the most important aspect. Many designers take a holis-
work and features that are intended to improve the capabil- tic approach their DMI concept: designing not just the
ities of tools for new musical instrument building as well sensing system or sound synthesizer but also gesture vo-
as general interactive applications that involve the design cabularies, notation systems, and lighting or video systems
of mappings with a visual interface. that complement their idea of performance with the instru-
ment. This approach is exciting, but also brings with it
1. INTRODUCTION difficult challenges, since the various aspects of the instru-
ment are interdependent - the physical instrument, sensing
A crucial component of the digital musical instrument (DMI) system, gesture, and sound cannot be changed individually
[1] building process involves the mapping of sensor or ges- without affecting the others. While the overall concept of
ture input signals from the musician to relevant synthesis the instrument can inform process, the actual implementa-
parameters, and is described in detail by [2]. While there tion often requires iteration. Usually, the most important
are many different approaches to the design and implemen- advances are due to a complex system surprising the de-
tation of DMI mapping, for many designers, a common signer, inspiring them to modify their vision of the final
part of the process involves the use of graphical user in- instrument.
terfaces where the connection of signals can be observed
It is important that our tools for supporting mapping de-
and manipulated visually. In this paper we use personal
sign support iterative workflows, and also make it easy to
experiences of building DMIs in a variety of contexts to
do exploratory development which might result in serendip-
motivate two main features for a mapping tool: a deeply
itous discovery of new directions for the instrument.
integrated graphical versioning and comparison tool, and
rich annotation features that go beyond the text-based tag-
ging commonly employed by existing versioning systems.
2.1 Collaboration
While our focus here is to extend the capabilities of tools
for creative design of interactive systems [3], we acknowl- If iteration is important in solo development, in collabo-
edge the fact that a good tool is not the end-all solution rative design contexts it is essential, especially in groups
in itself, but something that can help in the process. In where the roles of hardware designer, composer, and per-
this paper we first present background and related work, former are played by different collaborators. In this case
followed by the motivations for and details on the imple- it is no longer feasible for a single united vision of the
mentation of our system, and finally a discussion on the instrument (and its performance practice and repertoire)
limitations of the current work and future steps. to exist. The collaborators inevitably use different tools,
have different approaches, and in interdisciplinary projects
Copyright:
c 2017 Johnty Wang et al. This is an open-access article distributed even different vocabularies for discussing interaction. Dur-
under the terms of the Creative Commons Attribution 3.0 Unported License, which ing the course of several long-term projects [4], we started
permits unrestricted use, distribution, and reproduction in any medium, provided building approaches and tools to try to support interdisci-
the original author and source are credited. plinary collaborations based on a few principles:
SMC2017-275
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Dont enforce representation standards: Encour- two different versions of a document [12] and second, the
aging each collaborator to represent their system com- management of previous versions. While the workflow of
ponent in the strongest way possible helps maintain most modern source control systems are centred around
the benefits of their specific domain-knowledge to these two issues, they mostly focus on text based data.
the project. It also helps with modularity! When the data has a graphical view, or is based on a trans-
lation of text data, a text-based diff tool is no longer suf-
Make things freely connectable: If we encourage ficient. Dragicevic et al. [13] describes the process of im-
strong but idiomatic representations of system com- plementing a comparison system via computing the visual
ponents, flexibility must come from the ability to output and then rendering the transition between versions
freely interconnect them. Our approach is for mod- using animations, which preserves the mental model of the
els to live only in devices rather than influencing user better when there is a translation process is involved.
connection or communication protocols; representa- The environment implemented by Rekall [14], designed
tions on the wire should be low-level, allowing basic to serve both as a score as well as documentation for a
compatibility between drastically different devices. theatrical work, contains features that are relevant to our
application since it contains a visual timeline as well as a
Dont dumb it down, but...: Musicians possess highly
rich annotation interface, but is intended for a very differ-
technical knowledge, and are usually capable of learn-
ent purpose of documenting the timeline progression of a
ing surprising amounts of electrical engineering, dig-
piece of work, as opposed to being able to preview/revert
ital signal processing, and software design in short
to prior states.
periods of time. This means that instrument design-
ers do not need to hide low-level technical proper-
ties of the system from the performer, however the 4. SUPPORTING EXPLORATION, BUILDING
nature of the system representation should be ques- COMPLEXITY
tioned: is there a technical representation from the
performers perspective that might serve better? 4.1 Problems Identified
Through the process of working with DMIs in a variety of
The software library libmapper [4] and the surrounding
contexts as described earlier, we have identified the follow-
tools are the result of this design philosophy and have sup-
ing issues when it comes to the design of mappings:
ported a large number of interdisciplinary research/creation
projects. These tools form a discoverable, distributed net-
work of mappable devices; they encourage (but do not en- Fear of losing current state: By not keeping a con-
force) modular, strong representations of system compo- stant stack of actions and having to manually man-
nents; they glue together different programming languages age incremental versions, the fear of moving away
and environments; and they aid the process of rapid experi- from a desirable configuration when experimenting
mentation and iteration by making every signal compatible inhibits exploration.
and allowing drag-and-drop style mapping design.
Difficulty indexing, saving, finding, restoring and
talking about previous mapping designs limits num-
3. RELATED WORK ber of options that can be reasonably compared.
There are a fairly large number of existing tools for sup-
Support for complexity: Without effective compar-
porting the task of mapping DMIs or similar interactive
ison between mapping configurations, it is difficult
systems. Some, like the Mapping Library for Pure Data [5]
to create complex mappings by combining mapping
and the Digital Orchestra Toolbox 1 try to provide build-
vignettes generated in workshop settings.
ing blocks for assembling mapping structures; others, like
MnM [6] and LoM [7] support specific multidimensional
mapping schemes, usually learned from examples. A va- 4.2 Proposed Solutions
riety of platforms and environments are also used for de-
Based on the issues identified above, we have proposed the
signing and implementing DMI mapping, including ICon
following solutions:
[8], junXion [9], and Jamoma [10]. Finally, some com-
mercial DMIs have dedicated instrument-specific mapping
systems, such as the Continuum Editor [11], 2PIM 2 , Kar- Distributed Components: Using the software library
lax Bridge 3 , and EigenD 4 libmapper [4] enables simple composition of new
or existing system components, and its distributed
nature supports collaboration (For example, an arbi-
3.1 Version Control Systems
trary number of user interfaces can be actively view-
The core features of this work is related to two particular ing and editing the system at the same time, with
aspects of version control: first, the comparison between different visual representations of the system).
1 http://idmil.org/software/dot
2 Version tracking: Ability to store previous states as
http://www.pucemuse.com/
3 http://www.dafact.com/ to encourage experimentation while ensuring that no
4 http://www.eigenlabs.com/ valuable work is lost.
SMC2017-276
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 1. Screenshot showing working version (left), and a comparison of a previous version (right) when it is selected
from the interface showing. Note: the actual rendering of the preview overlay has been offset in the application to show up
better on a printed page in grayscale.
SMC2017-277
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-278
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT delve into the original materials and understand its inherent
ambiguities. On the one hand digital editions are not lim-
The process of creating historical-critical digital music edi- ited by printing costs and can provide scans of the original
tions involves the annotation of musical measures in the autographs, on the other hand suitable human-computer in-
source materials (e.g. autographs, manuscripts or prints). teraction makes it easier to navigate in the complex network
This serves to chart the sources and create concordances be- of cross-references between various source materials and
tween them. So far, this laborious task is barely supported the editors annotations.
by software tools. We address this shortcoming with two
The hybrid edition of Max Regers works may serve as
interface approaches that follow different functional and
an example. Two screenshots are depicted in Figure 1.
interaction concepts. Measure Editor is a web application
On the left side, the typeset final score is shown together
that complies with the WIMP paradigm and puts the focus
with small icons that represent annotations by the editor.
on detailed, polygonal editing of image zones. Vertaktoid
When the user opens an annotation, a window shows the
is a multi-touch and pen interface where the focus was on
textual annotation as well as the source materials that the
quick and easy measure annotation. Both tools were evalu-
annotation refers to (see the lower-left sub-window of the
ated with music editors giving us valuable clues to identify
right screenshot in Figure 1). The textual annotation in
the best aspects of both approaches and motivate future
the depicted example says that one of the sources lacks
development.
the fortissimo in measure 3. By clicking on the autograph
icons, the corresponding autograph scans are displayed (see
1. INTRODUCTION the other three sub-windows of the right screenshot). This
Musicologists create historical-critical editions of musi- way, a musician can delve into the piece and understand its
cal works based on autographs by the composer and other heritage and its ambiguities.
source materials, e.g., letters or materials from various copy- Music in common music notation is structured in musi-
ists and publishers. In order to create a musical score that cal measures. Larger works can additionally be split up
is both accurate and usable by musicians, the editor has to in several movements. Musicians usually refer to specific
spot and analyze differences between the different sources, positions within a score by measure numbers. Measure
decide which variant to use in the final score, and docu- numbers therefore provide a reasonable granularity for links
ment the editorial decisions. These decisions are usually between textual annotations and the original material. In
described in the critical apparatus, which may be contained practice, this makes it necessary to define processable mark-
in the same volume as the musical score or be published ings for every measure in all related autographs and create
separately. A musician can consult the critical apparatus concordances, i.e., one-to-one mappings, between these
in order to understand editorial decisions, to examine dif- annotation. The XML-based MEI (Music Encoding Initia-
ferences between different editions of the same work, or to tive) [2] [3] data format, the main format for digital music
spot interesting variants that he or she might choose to play editions, is able to handle measure definitions in images in
in opposition to the editors choices. the form of polygons. Since musical works can consists of
Digital and hybrid editions, which are published both in hundreds or even thousands of measures, the annotation of
digital and printed form, make it easier for a musician to them can, however, be a tedious task. Therefore specialized
and time-efficient tools are needed.
Copyright:
c 2017 Yevgen Mexin, Aristotelis Hadjakos, Axel Berndt, Simon We developed two different interfaces: The Measure Ed-
Waloschek et al. This is an open-access article distributed under the terms itor is a web application and part of a full featured music
of the Creative Commons Attribution 3.0 Unported License, which permits unre- edition platform in development. Vertaktoid is a standalone
stricted use, distribution, and reproduction in any medium, provided the original Android Application that supports pen and touch interac-
author and source are credited. tion. Both software prototypes were evaluated by music
SMC2017-279
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
editors, showing potential paths towards an efficient mea- Since measure annotation is the most laborious repetitive
sure annotation tool. task, we developed specialized tools. The development of
these tools addressed shortcomings of Edirom-Editor with
2. RELATED WORK regard to its usability. In this we see the biggest potential
for accelerating the process of measure annotation.
2.1 Current tools for Measure Annotation
Edirom 1 is a project that aims to assist the music editors 2.2 Measure Detection in Optical Music Recognition
providing special software. Two of them are explicitly
Annotation of measures is a relatively time-consuming pro-
relevant for this overview: Edirom-Editor [4] and Edirom-
cess that needs a lot of user attention. In this paper we
Online [5].
discuss two tools that facilitate this task by introducing im-
Edirom-Editor is a platform that combines different tools
proved interaction techniques and intelligent approaches
for creating digital and hybrid music editions. Edirom-
like automatic numbering (see Section 3.2.2), but the main
Editor supports the following tasks:
part of the annotation process remains manual. On one
Cataloging of music works represented by multiple hand, this allows the working process to be free from issues
sources with musical and eventual textual content. that can be caused by automatic recognition and allows non-
standard context-based user decisions. On the other hand,
Definition of the music scores structure by specify- automatic processing of big volume of data counts can be
ing the movements and their names. performed much faster.
Measure annotation with sequence numbers and even- Optical Music Recognition (OMR) [8] is the automatic
tual additional meta-information. The measures will conversion of scanned music scores into computer read-
be also aligned to the movements in this step. able data in variable formats, e.g., MusicXML, 3 or MEI. 4
OMR processes the content of music scores, trying to recog-
Creation of concordances for the movements and nize the notation and create a representation that provides
measures that represent the relations between the the best correspondence to the original source, which is typ-
corresponding elements from different sources. ically scanned. The effectiveness of optical recognition of
Creation of textual annotations and additional refer- printed music scores is greater than of hand-written scores,
ences between the elements. which are very common material for digital music editions.
For our purposes the recognition of music notes themselves,
Export of the created digital music edition in the MEI including pitch and duration, is not relevant since we are
format. mainly interested in finding measure boundaries. Therefore
The MEI format can then be presented and published using we will present related work that tackles this task below.
Edirom-Online. Edirom-Online can present the music score, Dalitz and Crawford have investigated the possibility of
the corresponding measures from different sources, the an- OMR in relation to the lute tablature. The recognition of
notations and references. The publishing can be performed measures is discussed in detail in the paper [9]. Although
both locally and remotely via the web. the lute tablature does not rely to common Western music
Edirom-Editor is well known by music editors, who work notation, it has a similar measure concept. The staff lines
on digital and hybrid music editions. Among other projects, are removed during the preprocessing step and then the
Edirom-Editor is being used in the following projects: Det- algorithm tries to detect the bar line candidates through
molder Hoftheater [6], OPERA, 2 and Beethovens Werk- their aspect ratio (ratio of width to height). The candidates
statt [7]. can be also a combination of multiple closely located frag-
1 http://edirom.de, last access: Feb. 2017 3 https://musicxml.com, last access: Feb. 2017
2 http://opera.adwmainz.de, last access: Feb. 2017 4 http://music-encoding.org, last access: Feb. 2017
SMC2017-280
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ments. The properties of bar line candidates are validated system. These will be applied here, too. The definition of
trough comparison with staff line and staff space param- zones (typically bounding boxes) is the dominating task.
eters. The optical measure recognition for lute tablatures It is essentially similar to the more general task of select-
can be performed relatively robustly because they do not ing image regions which is well-researched in the field of
contain complex systems of multiple voices such as in other human-computer interaction. In Edirom Editor [4] zones
music score. are defined by the traditional rectangle spanning gesture
Vigliensoni, Burllet and Fujinaga have devised an ap- that is known, e.g., from graphics editors and file explorers
proach for optical measure recognition of music scores (selecting multiple files by spanning a box around them).
in common Western music notation [10]. Vigliensoni et Edirom Editor relies on mouse input. Users describe this
al. follow the task model from Bainbridge and Bell [11] interaction technique as slow and demanding, requiring a
that contains the following steps: image preprocessing and lot of corrections as the initially drawn bounding boxes
normalization, staff line identification and removal, musical rarely embrace their content perfectly. The original devel-
object location and musical reasoning. opers of Edirom Editor obtained this feedback from their
The approach by Vigliensoni et al. [10] needs additional own experiences with music edition projects, active com-
information from the user about the structure of staves in munication with other editors and during the yearly Edirom
music score. This information consists of the count of staves Summer School. This was the situation that motivated the
for each page, the count of systems, the relation between developments which we report here.
staves and systems and the kind of bar lines. During the As an alternative to mouse input, pen (and touch) seems
process of bar line detection, the algorithm has to remove the most promising modality as it corresponds more closely
all horizontal lines and filter the resulting set to include only to the way we naturally work with sheet music. However,
thin, vertical elements. The remaining bar candidates are there is a variety of selection gestures for pen input, each
further filtered by their aspect ratio using a user-defined with its advantages and disadvantages depending on the
parameter. The algorithm can handle the broken candidates application context. For some editions it may suffice when
by combining the vertical lines with the nearest horizontal zones roughly embrace measures, others may need very
positions. As a final step, the height and position of each precise bounding volumes.
bar candidate is compared with the related properties of the Typical selection gestures in the literature are spanning
corresponding system. The candidates that do not match a rectangle (as implemented in Edirom Editor), lassoing
the system will be also removed. The second user-defined (well-known from graphics editors) and tapping (discrete
parameter, vertical tolerance, controls the sensitivity of the object selection) [13]. These are sometimes combined with
last step. This measure recognition technique was evaluated a mode button that is operated by the secondary hand to
trough comparison with manually annotated music scores. clarify conflicting gestures, phrase multiple strokes to one
Vigliensoni et al. have chosen one hundred music score gesture, and separate the selection process from an oper-
pages from the International Music Score Library Project ation that is applied to it [14]. Zeleznik & Miller [15]
(IMSLP) 5 and had these pages marked by experienced an- use terminal punctuation to define selection-action phrases.
notators. The authors compared the these bounding boxes Several phrasing techniques (timeout, button, pigtail) where
with those recognized by the machine. The results of ma- investigated by Hinckley et al. in [16,17]. They also summa-
chine recognition was considered correct if the divergence rize that lassoing is the favored selection gesture because
of the bounding boxes was less than 0.64 cm. The evalua- it is well suited to selecting handwritten notes or diagrams,
tion results yield to an average f-score of 0.91 by the best which typically contain many small ink strokes that would
chosen aspect ratio and vertical tolerance parameters. be tedious to tap on individually [16, 17].
Although a lot of further research papers in OMR thematic Interesting insight into the correlation of precision and
field exist, we focus on these two articles because they quickness of selection gestures is given by Lank et al. [18].
have contributed largely to optical measure recognition. Based on the motion dynamics of a gesture they made an
The approach from Padilla et al. [12] with the clever idea analysis of deliberateness versus sloppiness and inferred
to compare and combine the OMR outputs for different a linear relationship between tunnel width and speed, for
sources of the same music score can be also mentioned. The paths of varying curvature [18] which helps to automati-
majority voting approach is used to decide which elements cally refine the selection.
are correctly recognized.
3. TWO APPROACHES
2.3 Interaction
The measure annotation task is a time-consuming process
The process of measure annotation comprises three inter-
within the work on music editions. In practice, it includes a
action tasks: navigation through autographs, definition of
large number of interaction steps which have to be executed
zones (measures, respectively) in the autographs, and edit-
with high precision, putting a heavy load on the user. Hence,
ing of zone numbering and movement association. The
in our project we wanted to optimize the user interaction
latter is conveniently solved via text input. Regarding navi-
necessary for this task. We decided to follow two parallel
gation between and within images (autograph pages), stan-
paths: we included a measure annotation tool (Measure
dard gestures for panning and zooming are well established
Editor) in the full featured music edition platform as a web
and implemented in every multitouch enabled operating
application on one hand, and on the other hand implemented
5 http://imslp.org, last access: Feb. 2017 a native tablet tool (Vertaktoid) on an Android system for
SMC2017-281
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-282
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
user can change the ordinal number of a measure or the for digital music editions. Therefore, the main task of Ver-
corresponding movement. The other parameters that can be taktoid is to minimize the man-hour costs and, on the other
adjusted are: hand, to provide a comfortable interface for precise user-
defined measure annotation process. Vertaktoid supports
Suffix - the additional textual information that can be natively the MEI format and can be used in combination
added to the measure name after the ordinal number with Edirom Editor and Edirom Online programs. Vertak-
toid is an open source application and was published 7
Upbeat - describes whether a measure is an upbeat under LGPL 3.0 license.
of the parent movement.
3.2.1 Marking measures
Pause - the number that represents the repeated rest
in measure. It may be considered during calculating To mark measures, the user selects the pen tool and draws
the sequence number of next measure and can have a an outline of the measure with the interactive stylus. The
meaning when defining concordances start of users stroke is marked by a small circle. When
the user closes the outline by reaching that circle again, the
Locked - makes the measure uneditable. interaction is completed and the measure is defined by the
rectangular bounding box of the outline. This interaction
Excluded - marks the measure as excluded (not to be is very easy to understand and therefore well-suited for
considered). users that just start working with Vertaktoid. However,
when marking several hundreds measures in a large music
Measure Editor supports the music editors providing nu- piece or even thousands of measures spread over several
merous features for comfortable and fast annotation of mu- autographs, the efficiency of the interaction becomes much
sic scores. Inasmuch as measure annotation is a part of more important than its learnability. Therefore, Vertaktoid
the whole process of music edition creation, Measure Edi- supports two further interaction methods:
tor will be also developed as a part of the complex system
that was designed to provide different tools for music edi- Click, dont draw: If, while drawing the outline,
tors. Measure Editor is an open source application and is the user lifts the stylus and puts it down at another
available 6 under LGPL 3.0 license. position, the line from the last to the new position
is automatically drawn. Now, the user can continue
3.2 Vertaktoid drawing the outline or click again at another position.
When the user reaches the initial position (marked by
Created as an android application for tablets with a pen, the small circle) by clicking or drawing, the outline
Vertaktoid wants to cover a need of mobile user-friendly is completed.
instruments for comfortable and quickly annotation of mea-
sures in handwritten and printed music scores. As men- Cut, dont repeat: Oftentimes the user marks sev-
tioned above, the measure annotation is an important and eral measures that are aligned as they are part of the
relatively time consuming part of whole creation process same grand staff (see, e.g., measures 1-4 in Figure 4).
6 https://bitbucket.org/zenmem/zenmem-frontend, 7 https://github.com/cemfi/vertaktoid, last access:
last access: May 2017 May 2017
SMC2017-283
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-284
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Measure Editor 5
4.5
NOT ACCEPTABLE MARGINAL ACCEPTABLE 4
LOW HIGH
3.5
3
WORST POOR OK GOOD EXCELLENT BEST 2.5
IMAGINABLE IMAGINABLE
Vertaktoid
2
Measure Editor
1.5
1
0 10 20 30 40 50 60 70 80 90 100
0.5
0
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
Statements
Vertaktoid
(a) The meaning of the SUS Score depicted on adjective ratings and (b) Average spreading of participants votes over SUS statements.
acceptability ranges.
For the last two cases, a calculation of the vertical over- A simple formula calculates the SUS score from the indi-
lap is needed. Let M be the rectangle that represents the vidual votes. Figure 6(a) shows the meaning of a SUS score
existing measure and M 0 - the rectangle that represents the translated into adjective ratings [21].
new measure. The rectangles can be defined by two points: Our study was conducted as part of the Edirom Summer
ul - upper lefter vertex and lr - lower righter vertex. Each School event 8 , which was created to teach musicologists
vertex consists of x and y coordinates. Then the vertical the creation of digital music editions using the Edirom Edi-
overlapping Overy is calculated as follows: tor tool and to inform experienced music editors about the
new techniques in this field. We recruited sixteen music
0 0
min (lryM , lryM ) max (ulyM , ulyM ) editors for our study. The familiarity of participants with
Over y = (1) the music annotation software was heterogeneous: some of
min (lryM ulyM 0 , lryM ulyM 0 )
them have learned the Edirom Editor during the Summer
Overy is the vertical intersection of both rectangles rela- School, the other part have already a good expirience with
tive to the height of the smaller shape. Rectangles, which it. They used prototype versions of Measure Editor and
doesnt have a horizontal intersection, can be also processed. Vertaktoid and annotated autographs with them. Figure 6
Now that the vertical overlapping factor is calculated, the shows the average votes for both applications. The resulting
final order of two examined measures can be decided. The SUS scores were 70 for Measure Editor and 87 for Vertak-
algorithm will take the horizontal order of measures if toid. This means that both tools were found as acceptable
Overy 0.5 and the reverse horizontal order otherwise. and Vertaktoid even as excellent (see Figure 6(a)).
This approach is applied to each measure from the corre- To decide about future development, we interviewed the
sponding movement until the measures are sorted by their study participants. The combination of the survey and these
positions. consultations was taken into consideration during the devel-
Sometimes, such a strict numbering scheme is not appro- oping of next versions for both applications. Their feedback
priate for the musical content, e.g., for upbeat measures, contains a lot of clever suggestions. So, the cut function
multiple measure rests or measures that are split between can be performed in both dimensions, what can make the
two successive pages or staves. In those cases the user can annotation process more effective. Some users found the
manually assign an arbitrary name for the measure. Such often repeating selection of suitable control elements as
names are marked with a frame (see measure 8A in Fig- uncomfortable. The next control element in some cases can
ure 4). The alignment of measures to the movements can be automatically chosen following the common workflow.
be also changed. Both user-performed operations causes The whole annotation process can be also performed in the
further automatically renumbering of following measures predefined sequence of steps. Thus, the application can
inside of the corresponding movements. offer the user to annotate the measures with draft bounding
boxes that can be refined during the next step.
Furthermore, Vertaktoid is currently used at Detmolder
4. EVALUATION & DISCUSSION Hoftheater project, which provides continuous helpful feed-
back for further evaluation and troubleshooting.
To evaluate the usability of both measure annotation tools, a
A number of features could be added in the future. Optical
quantitative study was carried out. The study was based on
measure recognition and subsequent automatic annotation
the System Usability Scala (SUS) [20]a simple and quick
would be a helpful and time-saving function at least for
survey that can be used to measure the usability of a product
annotation of printed music sheets. It would also be pro-
or service based on the subjective assessments of partici-
ductive if several users could simultaneously work on one
pants. The SUS survey contains ten statements, which must
music score. This would mean that users could access data
be voted with a value in the range from 1 (strongly disagree)
from a shared database to continue the work of another
to 5 (strongly agree). The statements alternate between pos-
itive and negative expressions about the examined software. 8 http://ess.uni-paderborn.de, last access: May 2017
SMC2017-285
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
users, make corrections or discuss further work flow. It Information Retrieval Conference, ISMIR 2013, Cu-
would also be useful if the users could make additional ritiba, Brazil, November 4-8, 2013, 2013, pp. 125130.
textual (and not only textual) annotations for individual
[11] D. Bainbridge and T. Bell, The challenge of optical
measures, music notes or even any other positions on the
music recognition, Computers and the Humanities,
facsimile. These annotations could be considered in the
vol. 35, no. 2, pp. 95 121, 2001.
later steps during the creation of a music edition.
Both tools can also contribute to each other and, using the [12] V. Padilla, A. McLean, A. Marsden, and K. Ng, Im-
experience of the sibling tool, inherit its best techniques. proving optical music recognition by combining outputs
Thus, the possibility to annotate an area via polygon can be from multiple sources, in Proceedings of the 16th Inter-
very helpful in Vertaktoid as well as the editing functions. national Society for Music Information Retrieval Con-
On the other hand, Measure Editor can be improved by ference, ISMIR 2015, Malaga, Spain, October 26-30,
additional interaction techniques, movement colorizing or 2015, 2015, pp. 517523.
automatic numbering.
[13] E. Artinger, T. Coskun, M. Schanzenbach, F. Echtler,
S. Nestler, and G. Klinker, Exploring Multi-touch
5. REFERENCES Gestures for Map Interaction in Mass Casualty Inci-
[1] M. Reger, Reger-Werkausgabe, Bd. I/1: Choralphan- dents, in INFORMATIK 2011 Informatik schafft
tasien fur Orgel, A. Becker, S. Konig, C. Grafschmidt, Communities (Proc. of the 41st GI-Jahrestagung), H.-
and S. Steiner-Grage, Eds. Leinfelden-Echterdingen, U. Hei, P. Pepper, H. Schlingloff, and J. Schneider,
Germany: Carus-Verlag, 2008. Eds., Gesellschaft fur Informatik (GI). Berlin: Bonner
Kollen Verlag, Oct. 2011.
[2] P. Roland, The Music Encoding Initiative (MEI), in
Proceedings of the International Conference on Musical [14] K. Hinckley, F. Guimbretiere, M. Agrawala, G. Apitz,
Applications Using XML, 2002, pp. 5559. and N. Chen, Phrasing Techniques for Multi-Stroke Se-
lection Gestures, in Proc. of Graphics Interface 2006.
[3] A. Hankinson, P. Roland, and I. Fujinaga, The music Canadian Human-Computer Communications Society,
encoding initiative as a document-encoding framework, June 2006, pp. 147154.
in Proceedings of the 12th International Society for Mu-
[15] R. Zeleznik and T. Miller, Fluid Inking: Augmenting
sic Information Retrieval Conference, Miami (Florida),
the Medium of Free-Form Inking with Gestures, in
USA, October 24-28 2011, pp. 293298.
Proc. of Graphics Interface 2006. Canadian Human-
[4] Virtueller Forschungsverbund Edirom (ViFE), Computer Communications Society, June 2006, pp. 155
Edirom Editor 1.1.26, https://github.com/Edirom/ 162.
Edirom-Editor, June 2016, accessed: 2017-20-02. [16] K. Hinckley, P. Baudisch, G. Ramos, and F. Guim-
[5] Virtueller Forschungsverbund Edirom (ViFE), Edirom bretiere, Design and Analysis of Delimiters for
Online, https://github.com/Edirom/Edirom-Online, ac- Selection-Action Pen Gesture Phrases in Scriboli, in
cessed: 2017-20-02. Proc. of the SIGCHI Conf. on Human Factors in Com-
puting Systems (CHI 2005). Portland, Oregon, USA:
[6] I. Capelle and K. Richts, Cataloguing of musical hold- ACM SIGCHI, Apr. 2005, pp. 451460.
ings with MEI and TEI. The Detmold Court Theatre
[17] Hinckley, K. and Baudisch, P. and Ramos, G. and Guim-
Project, in Music Encoding Conference 2015. Flo-
bretiere, F., Delimiters for Selection-Action Pen Ges-
rence, Italy: Music Encoding Initiative, May 2015.
ture Phrases, United States Patent No. US 7,454,717
[7] S. Cox, M. Hartwig, and R. Sanger, Beethovens Werk- B2, USA, Nov. 2008, filed Oct. 2004.
statt: Genetische Textkritik und Digitale Musikedition, [18] E. Lank, E. Saund, and L. May, Sloppy Selection: Pro-
Forum Musikbibliothek, vol. 36, no. 2, pp. 1320, July viding an Accurate Interpretation of Imprecise Selection
2015. Gestures, Computers & Graphics, vol. 29, no. 4, pp.
[8] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, 490500, Aug. 2005.
C. Guedes, and J. S. Cardoso, Optical music recog- [19] B. Tillett, What is FRBR? a conceptual model for
nition: state-of-the-art and open issues, International the bibliographic universe, Library of Congress, Tech.
Journal of Multimedia Information Retrieval, vol. 1, Rep., February 2004.
no. 3, pp. 173190, 2012.
[20] J. Brooke, SUS: a quick and dirty usability scale,
[9] C. Dalitz and T. Crawford, From Facsimile to Content in Usability Evaluation In Industry, P. W. Jordan, I. L.
Based Retrieval: the Electronic Corpus of Lute Mu- McClelland, B. Thomas, and B. A. Weerdmeester, Eds.
sic, Phoibos Zeitschrift fur Zupfmusik, pp. 167185, London: Taylor and Francis, 1996, pp. 189194.
February 2013.
[21] A. Bangor, P. Kortum, and J. Miller, Determining What
[10] G. Vigliensoni, G. Burlet, and I. Fujinaga, Optical Individual SUS Scores Mean: Adding an Adjective
Measure Recognition in Common Music Notation, in Rating Scale, J. Usability Studies, vol. 4, no. 3, pp.
Proceedings of the 14th International Society for Music 114123, May 2009.
SMC2017-286
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT and make efficient use of space. Often the engraver must
choose between many possible equivalent representations
We present the problem of music renotation, in which the of the music, while considering the placing of symbols,
results of optical music recognition are rendered in image avoiding unnecessary overlap, preserving important align-
format, while changing various parameters of the notation, ments between symbols, and leading to a pleasing docu-
such as the size of the display rectangle or transposition. ment overall.
We cast the problem as one of quadratic programming. We Our essential idea is to leverage the choices made in a
construct parameterizations of each composite symbol ex- previous engraving to guide our renotated rendition. This
pressing the degrees of freedom in its rendering, and re- includes the many hard or categorical decisions made in
late all the symbols through a connected graph. Some of the original document, such as what notes should be beamed,
the edges in this graph become terms in the quadratic cost what stem directions to use, as well as the many other
function expressing a desire for spacing similar to that in choices that lead to equivalent notation in terms of pitch
the original document. Some of the edges express hard lin- and rhythm. However, we also seek to leverage the specific
ear constraints between symbols expressing relations, such soft choices of layout and spacing that give the music the
as alignments, that must be preserved in the renotated ver- desired density of information while leading to high read-
sion. The remaining edges represent linear inequality con- ability. In essence, our renotated music seeks to copy as
straints, used to resolve overlapping symbols. The opti- much of the original layout as possible, while simultane-
mization is solved through generic techniques. We demon- ously satisfying the constraints imposed by our renotated
strate renotation on several examples of piano music. format, such as the dimensions of the display rectangle.
Our formulation of the renotation problems ties in natu-
1. INTRODUCTION rally to our work in optical music recognition (OMR). Our
Ceres system [68] is a longstanding research interest that
A symbolic music encoding, such as MEI [1], MusicXML [2],
seeks to recognize the contents of a music document. From
and many others, represents music in a hierarchical for-
the standpoint of recognition, it is almost impossible to
mat, naturally suited for algorithmic manipulation. One of
identify a musical symbol in an image without also know-
the most important applications of symbolic music encod-
ing the precise location and parametrization of the symbol,
ings will likely be what we call music renotation [3], which
thus accurately registering the recognized symbol with the
renders an image of the music notation, perhaps modified
image data. Therefore, a welcome consequence of OMR
in various ways. For instance, music renotation may pro-
is a precise understanding of the music layout, including
duce a collection of parts from a score, show the music in
both the hard and soft choices, leaving our current effort
transposed form, display notation expressing performance
poised for high-quality renotation. This is the goal we ad-
timing, or simply display the music in an arbitrarily-sized
dress here.
rectangle. Music renotation may someday be the basis for
Our essential approach is to cast the renotation problem
digital music stands, which will offer a range of possibili-
as one of optimization, expressed in terms of a graph that
ties for musicians that paper scores do not, such as instant
connects interdependent symbols. This view is much like
and universal access to music, page turning, search, and
the spring embedding approach to graph layout, so popular
performance feedback. While these ideas could apply to
in recent years [912]. Spring embeddings seek to depict
any type of music notation, our focus here is on common
the nodes and edges of a graph in a plane, with minimal
Western notation.
conflict between symbols, while clustering nearby nodes.
As long as there has been interest in symbolic music rep-
Unlike the graph layout problem, there is no single cor-
resentations, there has been a parallel interest in rendering
rect graph for music renotation. Rather, the construction
these encodings as readable music, e.g. [4], [5]. There is a
of an appropriate graph lies at the heart of the modelling
well-developed craft surrounding this engraving problem,
challenge. Our work bears some resemblance to the work
predating algorithmic efforts by centuries. The essential
of Renz [13], who takes a spring embedding view of one-
task is to produce music documents that are easy to read
dimensional music layout.
Copyright:
c 2017 Liang Chen et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution 3.0 Unported License, which
2. MODELING THE NOTATION
permits unrestricted use, distribution, and reproduction in any medium, provided At present, we do not model the entire collection of possi-
the original author and source are credited. ble musical symbols, limiting ourselves to the most com-
SMC2017-287
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
mon ones, though providing generic mechanisms to extend differ from those in the original notation we simply try
to larger collections. This restriction of attention paral- to copy the original notation as much as possible. To cap-
lels the state of our OMR system, which recognizes only ture hard constraints, such as the essential symbol align-
a subset of possible notation, at present. It seems that any ments between rhythmically coincident notes, we include a
music notation effort must eventually come into contact collection of linear equality constraints {atm x = bm }M
m=1 ,
with the heavy tail problem there are a great many which we will summarize as Ax = b. To capture the non-
unusual symbols, special cases, and exceptions to familiar overlapping constraints, we include inequality constraints
0
0 0
rules which, collectively, pose a formidable modeling chal- of the form {atm0 x bm0 }Mm0 =1 summarized as A x b .
lenge [14]. To avoid getting buried in the many details of Thus we define our desired solution as
real-life music notation we focus here on the core of mu-
sical symbols, by which we mean those involved in con- x = arg min Q(x) (1)
x:Ax=b,A0 xb0
veying pitch and rhythm (beams, flags, stems, note heads,
ledger lines, accidentals, clefs, augmentation dots, time QP gives us a formulaic way of solving an optimization
signatures, key signatures, rests, etc.), with a few additions problem, once posed in this manner.
to this core. To be more specific, we define our objective function
We divide the world of music symbols into composite and and constraints in terms of a graph, G = (E, V ). In our
isolated symbols. The isolated symbols, (clefs, acciden- graph, the parametrized symbols become the vertex set,
tals, etc.), are stand-alone symbols, represented by single V = {v1 , . . . , vN }. We define an edge set composed of
characters in a music font. The composite symbols, includ- three types of edges between these vertices, soft, hard and
ing beams and chords, are formed by constrained configu- conflict, denoted by E = Es Eh Ec . Our graph will be
rations of primitives, such as beams, stems, heads, ledger connected by the Es Eh edges, thus ensuring that there
lines, flags, etc. For purposes of renotation we view the are no notational islands that function independently of
satellites, (accidentals, articulations, augmentation dots, etc.) the rest. For every edge, e = (vi , vj ) Es we define a
as isolated symbols, rather than as parts of a composite quadratic term, qe (xi , xj ), so that our objective function,
symbol. Q, decomposes as
When drawing a composite symbol one must obey ba- X
sic constraints between the constituent primitives. For in- Q(x) = qe (xi , xj )
stance, the vertical positions of the note heads are fixed e=(vi ,vj )Es
in relation to the staff, while their horizontal positions are
fixed in relation to the note stem. These relations hold Similarly, for each edge e = (vi , vj ) Eh we define a
with both regular and wrong-side note heads. After ac- constraint, ate x = be , where the vector, ae , is non-zero
counting for these constraints, the only degrees of free- only for the components corresponding to xi and xj . Anal-
dom in drawing a chord (not including its satellites) are ogously we have a constraint, ate x be for each edge
the horizontal and vertical locations of the stem endpoint. e Ec . Thus A and b from Eqn. 1 become
The locations of all other primitives follow deterministi- t
cally once the stem endpoint is fixed. Beamed groups have ae1 be1
more degrees of freedom in their rendering. After fix- ate be2
2
A= . b= .
ing the corners of the beamed group one must also de- . . . .
cide upon the horizontal positions of the stems, leading to ateM beM
n + 2 degrees of freedom for an n-note beamed group.
With these parametrizations in mind, we can think of all where e1 , . . . , eM is an arbitrary ordering of the edges of
symbols, composite and isolated, as parametrized objects, Eh , with a similar definition for A0 and b0 .
where an isolated symbol is parametrized simply by its
two-dimensional location. In some cases, one coordinate 2.2 A More Detailed Look: Beamed Groups
of an isolated symbol may be fixed, as with the vertical co-
ordinate of an accidental or clef. In such cases the symbol We examine here the case of a beamed group in more de-
has a one-dimensional parameter. tail, so as to flesh-out our ideas. As already described, the
beamed group itself (without its satellites) is parametrized
2.1 Overview of Approach by an (n + 2)-dimensional parameter, xb . Each note on
each chord of the beamed group may have an associated
We cast the renotation problem as constrained optimiza- accidental, while the heights of these accidentals are fixed
tion, in particular, quadratic programming (QP) [15], in at the appropriate staff position. Thus we can write {xacci }
which one minimizes a quadratic function subject to lin- for the the horizontal positions of these accidentals, in-
ear equality and inequality constraints. To this end we let dexed by i. Absent other considerations, there is an ideal
v1 , . . . , vN be the musical symbols, isolated and compos- horizontal distance between a note head and its acciden-
ite, and let x = (x1 , . . . , xN ) be the parameters for these tal. However, in some cases, particularly with a densely-
symbols. Thus the notation is completely determined by voiced chord, the accidentals may be placed far to the left
x. We define a quadratic objective function, Q(x), mea- of this ideal to avoid overlap. For each accidental we in-
suring the desirability of the parameter x. Q penalizes the troduce a quadratic term, qiacc (xb , xacc
i ), that penalizes the
degree to which the renonated symbols relative positions degree that the accidental displacement differs from that in
SMC2017-288
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-289
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
the right of vi in the original notation, we could choose the terpretation of the symbols meaning, for instance rhythm
constraint aih ajl . This constraint would be expressed in recognition to determine simultaneous notes, though we
terms of the parameters xi , xj , and included into the set of approximate this goal, at present, through thresholding.
inequality constraints. This approach guarantees that our While our analysis is presently limited to these basic sym-
QP problem has a non-trivial feasible solution, since the bols, we can extend to a greater variety of symbols, in-
layout of the original image is a feasible point. If vi and vj cluding slurs, text, dynamics, etc., by anchoring to the
overlap in the original image, we cannot resolve this sit- basic symbols. In the most common case, such a symbol
uation, though such situations are uncommon and usually is owned or related to a basic symbol. For instance, a dy-
occur in layout situations without obvious solutions. namic symbol or text may be associated with a certain note,
We iterate the process of solving a QP problem, and im- requring its horizontal placement to align with the note. A
posing new constraints to resolve any newly-discovered great variety of symbols can be included into the present
overlapping symbols. As all of the newly-added inequal- context by aligning them to the basic symbols with hard
ity constraints are consistent with the original notation, the edges that force the appropriate horizontal alignment while
QP problem has a feasible region, thus remains well-posed. using a quadratic term to encourage the vertical spacing in
Our final result is a solution to the QP problem with equal- the original image.
ity constraints, plus the additional constraints imposed through Figure 2 shows a graph that is constructed from the no-
the iterative process. In this way we approximate the no- tation, expressing the optimization problem we solve. Dif-
tion of non-overlapping constraints. ferent colored edges in this figure correspond to different
Our approach constrains us to solve layout problems in types of quadratic penalty, alignment, and overlap con-
a way that is similar to what has been done in the origi- straint. As outlined above, we proceed by solving the opti-
nal notation, rather than seeking novel ways of resolving mization problem defined by the hard and soft edges first.
conflicts. In essence, this is consistent with the heart of In doing so, conflicts will arise between symbols that over-
our proposed approach that leverages the existing notation lap in the optimal result. We then iterate the process
by deferring to the original notation. The problem of con- of imposing conflict edges (inequality constraints) between
structing good notation from a symbolic music represen- the overlapped symbols until none remain. These conflict
tation containing no positional information is considerably edges, as well as the other types of edge are indicated in
more challenging, and, perhaps, unnecessary. the figure.
SMC2017-290
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 2. Graph showing the types of edges between the vertices. Green: Soft horizontal edges (quadratic term); Blue: Soft
vertical edges (quadratic term); Magenta: Hard Edges (equality constraint); Red: Conflict edges (inequality constraint).
Figure 3. A section from the first movement of Mozarts Piano Sonata, K. 333 (Breitkopf and Hartel 1878 edition)
1878 Breitkopf and Hartel edition 1 . As with this exam- The second pair of experiments show the same two pieces,
ple, many of the most useful public domain editions are now notated in page mode, in Figures 3 and 4. The page
quite old, though these editions are still excellent sources sizes were chosen to be considerably wider than the orig-
for renotation, with a significant degree of expertise and inal, thus forcing our algorithm to adjust the spacing to
care supporting the engravings. The second shows a page achieve the customary alignment of the rightmost bar line
of the Rigaudon from Ravels Le Tombeau de Couperin, of each system. In addition to removing the original clef-
using the Durand 1918 edition 2 . key-signatures at the start of each original staff line, we
Obviously the notation for these examples appears spare added the appropriate key-clef-signatures at the beginnings
without the additional symbols that describe style, dynam- of our renotated staff lines.
ics, articulations, etc., though one can still appreciate the
value of the proposed ideas based on these examples. While
some spacing issues are not resolved in the most readable 4. CONCLUSIONS AND FUTURE WORK
manner, the overall notation manages to preserve the ap-
propriate rhythmic alignment and produce somewhat natural- Our renotation examples demonstrate work in progress,
looking notation in a fully automatic way. thus there are still some issues that need to be resolved
in the definition of our objective function before the no-
1 http://music.informatics.indiana.edu/papers/
tation has a professional, easy-to-read look. For instance,
smc17/Mozart_panorama.pdf
2 http://music.informatics.indiana.edu/papers/ smaller symbols, such as accidentals and articulations have
smc17/Ravel_panorama.pdf different requirements for spacing, and are not ideally han-
SMC2017-291
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 4. A section from the Rigaudon from Ravels Le Tombeau de Couperin, (Durand and Cie. 1918 edition)
dled by the more generic approach to spacing we take at currently, which we hope will further support the overall
present. framework we present. For instance, our symbolic repre-
There are some aspects of this approach that are novel and sentation of the music data easily allows for small trans-
worth emphasizing. First of all, two-dimensional layout positions in which notes are shifted a few staff positions
of nearly anything can be a challenging problem, though up or down, while the transposed accidentals are handled
especially difficult in a domain such as music engraving, in a formulaic (and correct) way. It is worth noting that
where there is a well-developed, if not clearly formulated, this recipe for transposition runs into difficulty with larger
standard practice. Rather than try to explicitly formulate transpositions that call for changes with stem directions
the many aspects of good music notation, we seek to lever- and, perhaps, other notational choices.
age existing examples, using these as a guide that will pull Finally, we hope to soon pursue uses of notation designed
our results toward reasonable layout. This includes both to convey performance characteristics, rather than read-
the inter-symbol spacing as well as the categorical or dis- ability. For instance, we can present the horizontal posi-
crete decisions of music notation (stem direction, beaming, tion of each note as proportional to its actual onset time,
etc.). thus allowing one to see the expressive use of timing, em-
We have formulated the problem as a constrained opti- bedded in the notation itself. Additional variations would
mization, specifically as quadratic programming, thus al- allow the visualization of fine-grained pitch, of interest for
lowing us to incorporate many different objectives into our both tuning and vibrato, as well as dynamics. Our work
final result. Furthermore, the approach is fully automatic. with score alignment allows for estimation of the appro-
It is worth noting, however, that no fully automatic ap- priate performance parameters, while the renotation ideas
proach will likely lead to the high quality notation we wish may prove useful for conveying this information in a way
to produce. The optimization approach we present can eas- that is easy for the practicing musician to assimilate.
ily be extended to handle explicit human guidance in the
form of hand-specified constraints or additional terms for 5. REFERENCES
the quadratic objective function. When these are added,
[1] A. Hankinson, P. Roland, and I. Fujinaga, The mu-
we still compute a global optimum that takes into account
sic encoding initiative as a document-encoding frame-
all other aspects of the objective function, as well as other
work, in ISMIR, 2011, pp. 293298.
constraints. Thus our approach provides a natural frame-
work for including human-supplied guidance. [2] M. Good, Musicxml for notation and analysis, The
There are additional applications of the proposed tech- virtual score: representation, retrieval, restoration,
niques we have not presented within, but are still pursuing vol. 12, pp. 113124, 2001.
SMC2017-292
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-293
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-294
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Wallace (see Figure 1) is a software application to foster Wallace was developed in MaxMSP [8]. By default, the
music compositions based on variable reverberation. Its total amount of possible sound sources is five. The sound
implementation design is aimed at: 1) offering an easy sources can be sound files or live sound input (i.e. micro-
phone). The convolution process is performed using the
way to make sound travel across independent reverber-
HISSTools Impulse Response Toolbox [9], particularly
ations automatically 2) exploring the relationship and
the MSP object multiconvolve~. This object performs
implications between composition and performance of
real-time convolution.
music in contexts in which reverberation is constantly The default IRs are included in an external fold-
changing. er named IRs and were retrieved from The Open
Acoustic Impulse Response Library (Open AIR) [10].
The user, however, can add more IRs to the database and
use them. Wallace, by default, is ready to output to a
quadrophonic system, yet the user is able to choose a
stereo output.
2.2.2 Transistions
The GUI (see Figure 3) defines a squared area to allow
the user to visualize the transitions across the IRs. It fea-
tures a small black circle that, according to its position,
signals which IRs is being used (i.e. heard) to process
the audio signal. The closer the circle is to a speaker
within the defined squared area (i.e. specific number as
displayed in the GUI), the louder that specific convolu-
tion process (i.e. sound result) is heard. If the circle is in
the middle of the squared area, however, the sound result
Figure 1. Wallace will be a mixture of all the sound convolutions.
SMC2017-295
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2.3 Workflow The first piece is for soprano saxophone and live elec-
tronics only using Wallace. For the saxophone part I de-
The first step is to choose one of two possible sound liberately used musical phrases containing contrasting
sources: file or mic. The first case refers to a sound file elements (e.g. high pitched sounds vs low pitched sounds,
while the second refers to sound input (see Figure 4). forte vs piano, sound vs silence). Although the formal
structure of the saxophone score is linear (i.e. not open
form), it is composed of many sequenced small musical
gestures to enhance the contrasting elements (see Figure
7).
SMC2017-296
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
sessions, it seemed that using sharp and contrasting sound soundscapes are not meant to stood out but instead install
gestures helped to perform and hear the reverberation a quiet background space.
nuances. With the inclusion of that kind of phrasing tex- This piece was performed twice. Each perfor-
ture in the final score, I wanted to 1) listen to its sound mance occurred at a different concert hall and each con-
results using Wallace 2) assess if I could experience the cert hall had contrasting natural reverberation qualities.
saxophone sound and the reverberation(s) in dialogue The first concert hall had a pronounced natural reverbera-
(e.g. overlapping sounds and pitches). tion (~ 2 seconds) and the second one had little natural
During one of the rehearsals, by chance, I dis- reverberation. The experience of listening to the piece,
covered that using the sound of the human voice, in com-
particularly experiencing the transitions between different
parison with the sound of the saxophone for the same
virtual spaces (i.e. IRs), felt quite different in each con-
reverberation scheme, I perceived much more clearly the
cert hall. The first two movements (i.e. more complex
different reverberations. I believe that this might be relat-
ed to 1) the spoken human voice is noisier (i.e. irregular rhythmical textures) seemed more interesting when per-
spectrum) when compared to the saxophone sound (i.e. formed in the concert hall with little reverberation where-
regular spectrum), thus, more radical spectral changes as the last movements (i.e. simpler rhythmical textures)
occur when the audio is being processed and 2) the hu- felt more appropriate for the concert hall with noticeable
man voice is hardwired to our daily experience of the natural reverberation.
world, thus, the humans instantly recognize the smallest This piece was performed in November 2016
nuances. and can be heard here [12].
This piece was performed in May 2016 and can
be heard here [11]. 4. CONCLUSIONS
3.3 Variaes sobre Espao #2 The main purpose of this research is to devise ideas about
composing music for variable reverberation. Wallace is a
This piece is for quintet (flute, clarinet, piano, violin and resource to pursue such compositional intents. It is a digi-
violoncello), tape and live electronics (Wallace). It com- tal software application developed to make a given audio
prises 5 movements and the tape is comprised of distinc- signal to flow by different reverberations. In addition,
tive soundscapes (e.g. forest, inside a church, in the coun- Wallace offers possibilities to make automatic transitions
tryside, late night at a small village). between the different reverberations.
This instrumental setup allowed me to explore The practical compositional work has led me to
different facets of mixing/blending reverberations (i.e. some conclusions, specifically: the spoken human voice
IRs) when compared with the strategies I used in Var- is the sound that, from the stand point of sound percep-
iaes sobre Espao #1. Such facets include 1) the articu- tion, best illustrates different reverberations; the natural
lation of the ensemble (i.e. several sound sources) with reverberation of each space plays a decisive role in the
Wallace and its consequence on the overall sound result perceived sound output produced by Wallace, thus, each
2) repeat pitched notes at a given pulse (e.g. repeated performance will not sound the same in concert halls with
quarter notes) to hear its sonic nuances changing across different reverberations. This might mean that there is no
the different reverberations 3) use contrasting instrumen- perfect concert hall to use Wallace, instead, each compo-
tal textures (e.g. tutti vs solo). Some of the questions I sition will be (and should be composed to be) in dialogue
asked myself were: Will I hear many spaces blending with the real and virtual space; lastly, the use of continu-
with each other? Will I hear different instruments in dif- ous quiet soundscape sounds imprints a sense of back-
ferent spaces? Will I hear just a single reverberation ground space which helps Wallaces sound output to
composed of several reverberations? stand out.
The instrumental texture/notation used in the During the course of composing the aforementioned
beginning of the piece is very common sounding, howev- pieces, as well as implementing Wallace, I established a
er, with each movement, some musical aspects are fro- generic compositional approach (see Figure 8). It defines
zen or simplified (e.g. note duration, the rate at which three main ideas to be addressed while composing music
harmony changes, textures, dynamics, rhythm). In the last for variable reverberation, particularly using Wallace.
movement there is only a continuous and spacy melody The first idea suggests one to think about the balance
played by the piano, only punctuated by slight gestures between dry sound/open space/soundscape vs wet
played by the remaining instruments. The reason I com- sound/closed space/reverberation; the second idea sug-
posed like that was to be able to experiment and listen to gests one to think about the balance between static virtual
many sonic textures (e.g. complex vs simple) between the reverberation vs variable virtual reverberation; the third
sound produced by the instruments and Wallace. idea, in the case of using variable reverberation schemes,
During rehearsals, I discovered that I could lis- suggests one to think about how to employ transitions
ten and perceive many simultaneous different reverbera- between distinct reverberations. In addition, the balances
tions when there were soundscape sounds permanently aforementioned dont (should not) have to be constant
playing in the background. Consequently, I recorded sev- during the course of a piece. Instead, it seems to me in-
eral and distinct soundscapes in order to build a database teresting to shift the balances during the performance of
of background soundscapes. Most importantly, these the piece.
SMC2017-297
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
5. FUTURE WORK
Future work includes the composition of new pieces, the
inclusion of more IRs in the default database, the elabora-
tion of documentation (e.g. video tutorials and perfor-
mance videos) and the design of new models to make IR
transitions within each reverberation scheme (e.g. more
automatic transition movements, analyze the input sound
of a sound source and map a specific audio feature to
move the black circle). Furthermore, the GUI is going to
be redesign and Wallace application and source code are
going to be released very soon at filipelopes.net.
Acknowledgments
Many thanks to CIPEM- Inet-MD, Henrique Portovedo,
Miguel Azguime, Miso Music Portugal, SondAr-Te En-
semble, Joo Vidinha and Rui Penha.
6. REFERENCES
[1] M. Forsyth, Buildings for Music. Cambridge, MA:
MIT Press, 1985
[2] L. L. Henrique, Acstica Musical. Edio 5. Lisboa,
Fundao Calouste Gulbenkian.
[3] A. P. de O. Carvalho, Acstica Ambiental. Ed. 84.
Faculdade de Engenharia da Universidade do Porto,
2013
[4] B. Blesser An interdisciplinary synthesis of
reverberation viewpoints in Journal of the Audio
Engineering Society, 2001,49(10), pp. 867-903
[5] B. Pentcheva Hagia Sophia and Multisensory
Aesthetics in Gesta, 2011, 50(2), pp. 93-111
[6] https://www.wired.com/2017/01/happens-
algorithms-design-concert-hall-stunning-
elbphilharmonie/
[7] Chowning, J. (2011). Turenas: the realization of a
dream. Communication presented at the Journes
dInformatique Musicale 2011, Saint-Etienne
[8] http://www.cycling74.com
[9] A. Harker, P. Alexandre Tremblay, The HISSTools
impulse response toolbox: Convolution for the
SMC2017-298
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
David Moffat
Rod Selfridge Centre for Digital Music
Joshua D Reiss
Media and Arts Technology EECS Dept.,
Centre for Digital Music
Doctoral College, Queen Mary University of London,
joshua.reiss@qmul.ac.uk
r.selfridge@qmul.ac.uk Mile End Road, UK E1 4NS
d.j.moffat@qmul.ac.uk
ABSTRACT
SMC2017-299
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
sound effects, including sword whoosh, waves and wind Harmonic Frequency Gain
sounds was undertaken in [1]. Analysis and synthesis oc- Drag dipole fundamental 2fl (t) 0.1Il (t)
cur in the frequency domain using a sub-band method to (fd )
produce coloured noise. In [4] a rapier sword sound was Lift dipole 1st harmonic 3fl (t) 0.6Il (t)
replicated but this focused on the impact rather than the Drag dipole 1st harmonic 4fl (t) 0.0125Il (t)
swoosh when swung through the air. Lift dipole 2nd harmonic 5fl (t) 0.1Il (t)
A physical model of swords sounds was explored in [5].
Here offline sound textures are generated based on the Table 1. Additional frequencies and gains as a function of
physical dimensions of the sword. The sound textures are the lift dipole fundamental frequencyfl (t).
then played back with speed proportional to the move-
ment. The sound textures are generated using computa-
tional fluid dynamics software, (CFD), solving the Navier 2.1 Tone Frequency
Stokes equations and using Lighthills acoustic analogy [6] Strouhal (1878) defined a useful relationship between the
extended by Curles Method [7]. In this model, [5], the tone frequency fl air speed u and cylinder diameter d
sword is split into a number of compact sources, (discussed (Eqn. (1)). The variable St is known as the Strouhal num-
in section 2), spaced along the length of the sword. As a ber.
sword was swept thought the air each source will move
at a different speed and the sound texture for that source fl (t) u(t)
adjusted accordingly. The sound from each source was St (t) = (1)
d
summed and output to the listener.
As air flows around cylinder, vortices are shed causing a
A Japanese Katana sword was analysed in [8] from wind fluctuating lift force normal to the flow dominated by the
tunnel experiments. A number of harmonics from vortex fundamental frequency, fl . Simultaneously a fluctuating
shedding were observed along with additional harmonics drag force is present with frequency, fd , twice that of the
from a cavity tone due to the shinogi or blood grooves in lift frequency. The drag acts in line with the the air flow
the profile of the sword. and it was noted in [10] that, The amplitude of the fluctu-
This paper presents the design, implementation and anal- ating lift is approximately ten times greater than that of the
ysis of a real-time physical model that can be used to pro- fluctuating drag.
duce sounds similar to those of a sword swooshing through It was shown in [7] and confirmed in [11] that aeroacous-
the air. It builds on previous work by the authors to create tic sounds, in low flow speed situations, could be modelled
a real-time physical model of an Aeolian tone [9]. by the summation of compact sources, namely monopoles,
Our model offers the user control over parameters such dipoles and quadrupoles. Aeolian tones can be represented
as the arc length of the swing, sweep angles, top speed of by dipole sources, one for the lift frequency and one for the
the swing, dimensions of the blade as well as calculating drag; each source includes a number of harmonics.
what a listener will hear from a given observation point. The turbulence around the cylinder affects the frequency
The parameters available give the user the ability to model of the tones produced. A measure of this turbulence is
a wide variety of sword profiles, as well as other objects given by a dimensionless variable, the Reynolds number,
that produce similar sounds. Re , given by the relationship in Eqn. (2).
In addition a version of the model has been implemented air d .u(t)
with control mapped to the movement of a Wii controller. Re (t) = (2)
air
It is envisaged that many of the parameters available to
the user could be set by a game engine and therefore the where air and air are the density and viscosity of air
sounds generated directly. This would allow a correlation respectively. The value of the Strouhal number has been
of the sound generated to the movement and weapon being found to be related to the Reynolds number. An experi-
used by a character, (Fig. 1). mental study of this relationship was performed in [12],
giving the following equation:
St (t) = + p (3)
2. THEORY Re (t)
where and are constants and given in Table 1 of [12],
The fundamental aeroacoustic sound created when a cylin- (additional values are calculated in [9]). The different val-
der is swung through the air is the Aeolian tone. This sound ues represent the turbulence regions of the flow, starting at
is generated when fluid passes around the cylinder and vor- laminar up to sub-critical.
tices are shed from opposite sides. This causes oscillating With the Strouhal number obtained, diameter and air
lift and drag forces which in turn produce tones of different speed known, we can apply them to Eqn. (1) and obtain
strength and propagation directions. the fundamental frequency, fl (t), of the aeolian tone, gen-
A brief overview of the Aeolian tone will be given here, erated by the lift force. Once fl (t) has been calculated, we
including some fundamental equations. For greater depth can calculate the drag dipole frequency and harmonics as
the reader is directed to [9]. shown in Table 1.
SMC2017-300
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
22 St (t)2 l b u(t)6 sin2 cos2
Il (t) (4)
32c3 r2 (1 M (t) cos )4
Figure 2. Position of 8 compact sources and coordinates
where M (t) is the Mach number; u(t)/c, where c is the used in sword model.
speed of sound. The elevation angle, azimuth angle and
distance between listener and source are given by , , and
r respectively. The constant is set to 1. The correlation For ease of design the thickness down the blade of the
length, l, given as a multiple of diameters, indicates the sword was taken as a linear interpolation between the
span-wise length that the vortex shedding is in phase; after thickness set at the hilt and that set at the tip. The diameter
this the vortices become decorrelated. Il (t) is applied to of an individual source is then set as the thickness corre-
the fundamental frequency and scaled for the harmonics as sponding to its position on the blade. Equation 4 shows
shown in Table 1. The gain for the drag dipole is obtained that the gain is proportional to u(t)6 . As a sword arcs, the
from [10] and the lift dipole harmonics values from [15]. blade moves faster through the air at the tip than at the hilt.
The drag dipole 1st harmonic was witnessed in computa- Hence the sources nearer the tip will have a greater influ-
tional simulations and added at the appropriate value. ence over the sound generated than those nearer the hilt.
Eight compact sources were used to model the sword in
2.3 Wake Noise our model. Two of the sources were fixed at the tip and the
hilt. Five sources were then positioned back from the tip at
As the Reynolds number increases, the vortices diffuse
a spacing of 7 times the diameter, d. The final source was
rapidly and merge into a turbulent wake. The wake pro-
placed equidistant between the last one from the group at
duces wide band noise modelled by lateral quadrupole
the tip and the hilt. This is illustrated in Fig. 2.
sources whose intensities vary with u(t)8 [16].
This positioning of the six sources at the tip was equiv-
It is noted in [16] that there is very little noise content
alent to each source having a set correlation length of 7d,
below the lift dipole fundamental frequency. Further, it
(see Section 2.2). A range of correlation values from 17d
has been shown in [17] that the roll off of the amplitude
to 3d are given in [18] depending on Reynolds number. A
of the turbulent noise, above the fundamental frequency
plot showing similar values is shown in [19]. 1
is fl (t)2 . An equation for calculating the wake noise is
In this model, the position of the sources has to be cho-
given in [9].
sen a priori to the calculation of the Reynolds number; the
value 7d was chosen as a compromise, covering a reason-
3. IMPLEMENTATION able length of the sword for a wide range of speeds (hence
Implementation of each individual compact source is given Reynolds numbers). For illustration, examining a sword of
in [9]. The basic concept of the sword model is to line length 1.117 metres long and diameter ranging from 0.013
up a number of the compact sources to recreate the sound metres at the hilt to 0.008 metres at the tip. The 8 sources
created as a blade swings through the air. positioned as described equates to a coverage of 43% of the
Like the original compact source model presented in [9], blade length being represented by the sources. As shown
this model was developed using the Pure Data program- in Fig. 2 the positioning of sources capture the area which
ming platform. The intensity given in Eqn. (4) is time aver- contributes most to the sound.
aged which caused an issue for this model due to the swing The coordinate system used for the model is shown in
time being shorter than the averaging process. In the case Fig. 2. The centre of the sword arc is at the origin and each
of this model the intensity was implemented as an instan- sweep is set as circular. The distance travelled by each
taneous value. source during an arc is a great circle on a sphere with a
Our model has been given a simple interface to allow the radius equal to that of the source.
user to manipulate physical properties such as sword thick- The user specifies the sweep of the sword by setting a
ness, length, top speed at the tip and sweep angles. Some start and finish for azimuth and elevation angles as well as
or all of these parameters can be directly mapped to graph- 1 In [5] the correlation length of 3d was used; the number of sources
ics and animation within a game environment. set to match the length of the sword.
SMC2017-301
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-302
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 3. Results of a Tukey post hoc test. **** = p<0.0001, *** = p<0.001, ** = p<0.01, * = p<0.05, - = p> 0.05
SMC2017-303
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
The swoosh sound generated from a sword as it is swung [5] Y. Dobashi, T. Yamamoto, and T. Nishita, Real-
through the air is successfully replicated by an imple- time rendering of aerodynamic sound using sound tex-
mentation of a number of compact source models given tures based on computational fluid dynamics, in ACM
in [9]. Vortex shedding frequencies are generated in real- Transactions on Graphics, 2003.
time while calculating intensity, bandwidths and harmon-
ics. [6] M. J. Lighthill, On sound generated aerodynamically.
Listening tests show model with reduced physics is per- i. general theory, Proceedings of the Royal Society
ceived as more authentic than the model with higher phys- of London. Series A. Mathematical and Physical Sci-
ences, 1952.
4 A copy of all the sound files used in the perceptual evaluation
test as well as a copy of the PureData sword demo model can be
obtained from https://code.soundsoftware.ac.uk/projects/physical-model- [7] N. Curle, The influence of solid boundaries upon
of-a-sword-sound. aerodynamic sound, in Proceedings of the Royal So-
SMC2017-304
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ciety of London A: Mathematical, Physical and Engi- [15] J. C. Hardin and S. L. Lamkin, Aeroacoustic compu-
neering Sciences, 1955. tation of cylinder wake flow. AIAA journal, 1984.
[8] M. Roger, Coupled oscillations in the aeroacoustics
of a katana blade, Journal of the Acoustical Society of [16] B. Etkin, G. K. Korbacher, and R. T. Keefe, Acoustic
America, 2008. radiation from a stationary cylinder in a fluid stream
(Aeolian tones), Journal of the Acoustical Society of
[9] R. Selfridge et al., Physically derived synthesis model America, 1957.
of Aeolian tones, Best Student Paper Award - 141st
Audio Engineering Society Convention USA, 2016.
[17] A. Powell, Similarity and turbulent jet noise, Journal
[10] C. Cheong et al., Computation of Aeolian tone from a of the Acoustical Society of America, 1959.
circular cylinder using source models, Applied Acous-
tics, 2008. [18] O. M. Phillips, The intensity of Aeolian tones, Jour-
[11] J. Gerrard, Measurements of the sound from circular nal of Fluid Mechanics, 1956.
cylinders in an air stream, Proceedings of the Physical
Society. Section B, 1955. [19] C. Norberg, Flow around a circular cylinder: aspects
of fluctuating lift, Journal of Fluids and Structures,
[12] U. Fey, M. Konig, and H. Eckelmann, A new 2001.
Strouhal-Reynolds-number relationship for the circular
cylinder in the range 47 105, Physics of Fluids, 1998.
[20] M. J. Morrell and J. D. Reiss, Inherent Doppler prop-
[13] C. Norberg, Effects of Reynolds number and a low- erties of spatial audio, in Audio Engineering Society
intensity freestream turbulence on the flow around Convention 129, 2010.
a circular cylinder, Chalmers University, Goteborg,
Sweden, Technological Publications, 1987.
[21] N. Jillings et al., Web audio evaluation tool: A
[14] M. E. Goldstein, Aeroacoustics, New York, McGraw- browser-based listening test environment, Sound and
Hill International Book Co., 1976., 1976. Music Computing, 2015.
SMC2017-305
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-306
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
10
There are several different approaches that can be utilized
0 in order to mitigate the occlusion effect. Inserting earplugs
deeper into the ear canals reduces the area of the ear canal
-10 walls that radiates bone-conducted sound and thereby also
reduces the occlusion effect [1]. Deep earplugs are, how-
100 1000
ever, often perceived as uncomfortable. On the other hand,
f (Hz)
using large earmuffs instead of earplugs also reduces the
Figure 1: The occlusion effect measured as the change in occlusion effect [1]. These are usually not used by per-
sound pressure when blocking the ear canal with a shal- forming musicians due to aesthetic reasons. Finally, vents
lowly inserted foam earplug [1]. The excitation signal was can be used to allow the low-frequency energy in the ear
produced by a bone-conduction transducer. canal to escape. This might be a useful approach in hear-
ing aids, but in hearing protectors it will not only reduce
the occlusion effect but also reduce the effectiveness of the
problems: 20% of the musicians with hearing problems hearing protection at low frequencies.
used hearing protectors, while 6% of the musicians without A further approach is to utilize noise-cancellation tech-
hearing problems used hearing protectors. Among the mu- niques in order to cancel out the bone-conducted sound ra-
sicians that answered the questionnaire, 47% used custom- diating into the ear canal. In this case, the bone-conducted
molded earplugs, while 25% used disposable earplugs. sound is recorded inside the ear canal and simultaneously
According to the survey, the main reason why many clas- played back from the earplug in opposite phase. The re-
sical musicians do not use hearing protectors is that it hin- sulting destructive interference reduces the sound pressure
ders their own performance (n=155). The second largest inside the ear canal and thereby also reduces the occlusion
reason is that earplugs make it difficult to hear other musi- effect.
cians in an ensemble (n=88). Further reasons are that the Several previous studies have tried and succeeded in
sensation of hearing protectors is unpleasant (n=15) and reducing the occlusion effect caused by an earplug using
that they are difficult to insert (n=12). miniature microphones and loudspeakers inside the ear
One specific problem when using hearing protectors both canal. The majority of this research has been done in the
hindering the musicians own performance and making field of hearing aids. Mejia et al. [5] combined analog
it difficult to hear others play is the occlusion effect. In negative feedback with an acoustic vent, and were able to
a normal situation the ear canal is open and the bone- reduce the occlusion effect in hearing aids by 15 dB. Test
conducted sound energy transmitted from the musicians subjects reported that their own voice was more natural
own instrument into the ear canal through vibration of when using this occlusion reduction.
the ear canal walls exits to a large degree through the Borges et al. [6] developed an adaptive occlusion can-
ear canal opening. When the ear canal is blocked with celler for hearing aids, obtaining an average attenuation of
an earplug, much of the bone-conducted sound energy is 6.3 dB. The test subjects reported that the occlusion can-
directed to the eardrum. This causes the amplification of celler created a sensation of their ears opening. Sunohara
low-frequency sound, which is referred to as the occlusion et al. [7] also proposed an adaptive system for reducing the
effect. Another way to describe the occlusion effect is that occlusion effect. Computer simulations of the proposed al-
the open ear canal acts as a high-pass filter, but occluding gorithm showed a maximum occlusion reduction of 30 dB.
the ear canal removes this high-pass filter which causes However, due to the adaptive nature of the algorithm, the
low frequencies to become louder [1]. effectiveness of the reduction varied over time.
Figure 1 shows the occlusion effect measured by the in- Bernier and Voix [8] proposed a hearing protection so-
crease in sound pressure when occluding the ear canal with lution for musicians similar to the one presented in this
a shallowly inserted foam earplug. As shown, the sound paper, but without monitoring of the musicians own in-
pressure produced by a bone-conduction transducer is am- strument. Analog negative feedback was used to reduce
plified approximately 30 dB at frequencies between 100 the occlusion effect, while signals from microphones on
and 200 Hz. Above these frequencies, the occlusion effect the outside of the hearing protectors were processed with a
gradually decreases: at 1 kHz, the amplification is approx- digital signal processor and reproduced through the minia-
imately 10 dB, while no amplification is observed at 2 kHz ture loudspeakers inside the ear canals in order to obtain
and above. adjustable attenuation with natural timbre. The solution
The low-frequency boost caused by the occlusion effect was able to achieve a reduction of the occlusion effect of
causes two problems for musicians using earplugs. Firstly, approximately 10 dB. Bernier and Voix also proposed to
it colours the sound, making it darker and boomy. This compensate for the non-linearity of loudness perception in
makes it difficult for the musicians to hear the nuances of order to make the timbre independent of the current atten-
the sound of their own instruments. Secondly, the boost uation of the hearing protectors.
SMC2017-307
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-308
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
instrument
microphones DSP
HRTF
hear-through +
microphones reverb +
+ headphone and ear canal
equalization
in-ear headphones
microphones
-+
low-pass
SMC2017-309
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-310
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
10
dB -10
-20
100 1000 10000
f (Hz)
Figure 5: Magnitude response of the headphones, measured in the ear canal simulator with the eardrum microphone. The
solid blue line represents the unprocessed response, while the dashed red line represents the response with the occlusion
reduction circuit active.
10
0
dB
-10
Figure 6: The effect of the occlusion reduction circuit on the magnitude response of the headphones, i.e., the difference
between the solid blue line and the dashed red line in Fig. 5.
considered not to be loud enough to be disturbing, and the much reverberation. Clearly, the level and quality of
spectrum of the noise was also considered not to be dis- reverberation must be easily adjustable depending on the
turbing. surrounding space, the instrument, and the preferences of
the musician.
4.2.3 Monitoring
4.2.4 Playing alone
The placement of the instrument microphones was chosen
based on experimentation to provide a balanced and Especially the trumpet player, the saxophonist, and the pic-
natural sound of the instrument. Although the prototype colo player thought that the electronic hearing protectors
hearing protection solution supports connecting up to would be good for practicing alone. When practicing in a
four instrument microphones, only two microphones were small room, the sound of the instrument is often too loud
found to be necessary during the evaluation. These two due to strong reflections. With the electronic hearing pro-
microphones were in the end placed closed to each other, tectors, however, the sound of the instrument can be atten-
since this provided a balanced sound of all the instruments uated, while still remaining quite clear and pleasant, unlike
tested. On the viola, the microphones were placed at the when using normal earplugs. The piccolo player summa-
bridge. On the bassoon, trumpet, saxophone, oboe, and rized the experience: It sounds good and natural, but it
clarinet, the microphones were placed at the side of the doesnt hurt, like it usually does.
bell. On the flute and piccolo, the microphones were
4.2.5 Playing with an orchestra
placed approximately one-third of the instruments length
from the end of the instrument closest to the mouthpiece. The musicians tried the electronic hearing protectors while
Amplifying the sound of the instrument microphones, listening to a symphony orchestra recording and playing
processed with HRTFs and with reverberation added, their instruments. The musicians commented that they
clearly improved the sound of the instrument. Most were able to hear both their own instrument and the rest
musicians commented that the sound was not completely of the orchestra well. The flutist also commented that the
natural, but quite pleasant. In many cases, slightly sound of her own instrument was richer and more inspira-
amplifying the equalized signal from the hear-through tional when playing with the electronic hearing protectors
microphones further improved the sound. compared with regular earplugs. The saxophonist said
Some of the musicians thought that the reverberation that he could hear his instrument well and play with the
added to the sound was pleasant, while others felt that orchestra, but wondered how he would be able to adjust
it was unnatural, or that there was either too little or too his dynamics with respect to the rest of the orchestra, since
SMC2017-311
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
10
dB
-10
-20
100 1000 10000
f (Hz)
Figure 7: Equalization of the headphone response when reproducing the signals from the hear-through and instrument
microphones. The solid blue line represents the unequalized magnitude response with the occlusion reduction circuit active.
The dashed red line represents the equalized response. The measurements were performed in the ear canal simulator with
the eardrum microphone.
the electronic hearing protectors alter the level of the own cuit design is the low-frequency boost that can be seen in
instrument and the orchestra independently. Fig. 6. Although few musical instruments have energy at
such low frequencies, this boost can, e.g., emphasize sound
4.2.6 Technical implementation caused by the movement of the hearing protector cables,
The rubber tips of the headphones did not fit well in the and should thus be addressed in future designs, if possible.
ears of all the musicians, and thus different sized tips Adding the instrument microphone signals with HRTFs
would naturally be needed for different ear canal sizes. and reverberation makes the timbre brighter and more nat-
The need for custom ear molds also came up. ural, and also makes the sound of the instrument seem
Based on the musicians comments, the sound of as it originates from the instrument and not from inside
the hearing protectors (timbre, balance between the the head. For most wind instruments, a bone-conducted
musicians own instrument and the orchestra, and re- buzzing sound from the reed or lips is normally promi-
verberation) should be easily adjustable, but preferably nent when playing with the ears occluded, making playing
minimal adjustments should be needed to get the sound with ear plugs uncomfortable for many musicians. Adding
right. A few musicians also pointed out that it might be the instrument microphone signals alleviates this problem,
beneficial to adjust the hear-through level separately for by altering the balance between air-conducted and bone-
each ear, since sometimes there are loud instruments that conducted sound to be more favourable and closer to natu-
should be attenuated only on one side. Naturally, this ral.
would affect localization cues and the spatial perception of Amplifying the instrument microphone signals will, how-
the auditory scene, and experiments should be performed ever, affect the balance between the musicians own in-
to deterimine if such an effect is acceptable. strument and the other musicians instruments when play-
Based on the evaluation, the instrument microphones ing in an ensemble. The musicians instrument will thus
could be mounted on so called goosenecks, allowing sound louder than normal when compared with the sounds
easy adjustment of the microphone position. A single of other instruments. This balance will be, among other
microphone might be enough in many cases, and seemed things, instrument specific, since all instruments have a
to provide a pleasant sound with the instruments in this different natural balance between air-conducted and bone-
evaluation (we actually used two microphones in all cases, conducted sound. The balance between the musicians own
but they were positioned very close to each other). instrument and the other instruments can of course be im-
proved by amplifying the signals from the hear-through
microphones. However, heavy amplification of these sig-
4.3 Discussion
nals will counteract the function of the hearing protectors.
The subjective evaluation of the prototype hearing pro- Adding reverberation to the instrument signals may be
tection system showed that the occlusion reduction in useful not only to aid in externalization and integrating the
many cases improves the experience for the musician, by sound of the instrument into the surrounding acoustic en-
attenuating amplified bass frequencies and thus providing vironment. It can also be used for personal practice, mak-
a timbre that is more pleasant and natural. With the ears ing the instrument sound like its being played in, e.g., a
plugged, the sound of the instrument can often appear as much larger hall. The possibilities to adjust the reverbera-
if it originates from inside the head, but in some cases the tion algorithm available on the DSP are, however, limited,
occlusion reduction can alleviate this problem. and the available memory as well as the graphical develop-
The source of the noise that is heard when the occlusion ment tool for programming the DSP limit the possibilites
reduction circuit is enabled is currently unclear. Unless to develop other algorithms. Thus, a more versatile DSP
the source is the self noise of the microphones, it could would be needed in the future. Even with more sophisti-
probably be reduced with more advanced circuit design. cated reverberation algorithms, the task of combining real
Another problem with the current occlusion reduction cir- and artificial reverberation is far from trivial. To minimize
SMC2017-312
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
the amount of manual adjustment needed, impulse sounds [4] R. Albrecht, T. Lokki, and L. Savioja, A mobile
could be isolated from the hear-through microphone sig- augmented reality audio system with binaural micro-
nals, and the characteristics of the real reverberation ex- phones, in Proceedings of the Interacting with Sound
tracted from these. Workshop, Stockholm, Sweden, 2011, pp. 711.
The delay introduced by the DSP in the hear-through and
instrument microphone signals can cause two different [5] J. Mejia, H. Dillon, and M. Fisher, Active cancella-
problems. A short delay will result in a comb-filtering tion of occlusion: An electronic vent for hearing aids
effect, as the microphone signals reproduced through the and hearing protectors, The Journal of the Acoustical
headphones get summed with the sound leaking past the Society of America, vol. 124, no. 1, pp. 235240, 2008.
earplugs into the ear canals. The delay itself will become [6] R. Borges, M. Costa, J. Cordioli, and L. Assuiti, An
noticeable as it grows larger, affecting the performance adaptive occlusion canceller for hearing aids, in Pro-
of the musician. The allowable delay depends greatly on, ceedings of the 21st European Signal Processing Con-
e.g., the instrument type, but delays greater than 1 ms ference, Marrakech, Morocco, 2013.
may already produce audible comb-filtering effects and
delays greater than 6.5 ms can be perceived as audible [7] M. Sunohara, K. Watanuki, and M. Tateno, Occlusion
delays with some instruments [15]. The situation gets reduction system for hearing aids using active noise
more complicated as the sound of the other musicians control technique, Acoustical Science and Technol-
instruments also are delayed, and especially if several ogy, vol. 35, no. 6, pp. 318320, 2014.
musicians are using similar hearing protectors. The delays
[8] A. Bernier and J. Voix, Active musicians hearing pro-
of the prototype hearing protection solution were not
tection device for enhanced perceptual comfort, in Eu-
measured. However, none of the musicians reported that
roNoise 2015, Maastricht, the Netherlands, 2015, pp.
the delay negatively affected their performance.
17731778.
SMC2017-313
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-314
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-315
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
where constants A, B and C can be described in the fol- described in Section 2.2 have been implemented to ulti-
lowing way: mately create a loss coefficient for each individual eigen-
frequency. If we set the porous medium to the furthest
1 cmn 2 2 distance possible, the damping induced by this can be ig-
A= 2
+ , B= mn ,
k hk k2 nored. In Figure 1 the results can be seen.
The loss coefficients are then calculated in the following
cmn 1 way [11, 13]:
C= 2.
hk k
12 ln(10) 3 ln(10)
Moreover, P t is the input signal at time instance t at spec- cmn = , where T60 = , (13)
T60 tot
ified input location (xp , yp ), k is the chosen timestep 1/fs
and cmn is a loss coefficient that can be set per eigenfre- cmn = 4tot .
quency which gives a lot of control over the frequency con-
Here, tot is simply the addition of all of damping factors.
tent of the output sound.
3.3 Dynamic outputs
3.1 Eigenfrequencies
The output can be retrieved at a specified point by, in a
The (angular) eigenfrequencies mn can be calculated us- certain sense, inverting (7):
ing the following equation [14]:
M X
X N
2
n2
2 m uout = qmn mn (xout , yout ). (14)
mn = + 2 . (10)
L2x Ly m=1 n=1
SMC2017-316
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-317
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
when these are decreased. This can be explained by the In order to improve computational time when the outputs
fact that the eigenfrequencies are stretched/shortened as are moving, mn (xout , yout ) is evaluated for different val-
the plate dimensions increase/decrease. The exact oppo- ues of t: t [1, fs ] with steps of 4/max([Lx Ly ]), be-
site happens happens when plate thickness is changed: if fore the update equation (9) and is then selected (instead of
the plate gets thicker, the pitch goes up, and the other way evaluated) at the appropriate times in the update equation.
around. This can be explained by the increase in the stiff- We found this step-size to be a good balance between speed
ness factor in (2) when the thickness increases. and having minimal artefacts in the sound if the speed is
An interesting effect occurs when input and output are not set to be too high.
combined (not 100% wet signal) as the pitch bend effect Changing the plate dimensions greatly influences the com-
only applies to the reverb, and does not influence the dry putational time as the eigenfrequencies need to be recal-
input signal. culated many time. To improve computational time, the
eigenfrequencies and the modal shapes are reevaluated ev-
4.4 Dynamic loss coefficients ery 100th time-step, according to the current (new) dimen-
sions of the plate. As with the previous chosen time-step,
To the best of our knowledge, the current state of the art
found this to be a good balance between speed and having
does not make use of loss coefficients dependent on phys-
minimal artefacts.
ical parameters. We explored the possibility to make some
of these parameters dynamic. The most interesting param-
eter we explored is the air density (a ) in Equation (6). 5. CONCLUSION AND FUTURE WORK
When increasing this, the sound (especially the high fre- In this paper we presented a VA simulation of plate rever-
quency content) will die out sooner and has a muffled beration. The output of the implementation sounds natu-
sound and can also be derived from the equation. Ulti- ral and parameters like in- and output positions and plate
mately, we decided against implementing this feature, as it dimensions have been made dynamic resulting in a very
did not add much to the sound; it only decreased the natu- unique and interesting flanging and pitch-bend effect (re-
ralness of the plate reverb. spectively), something which has not yet been achieved by
the current state of the art. Another novelty is that we in-
4.5 Computational speed cluded thermoelastic and radiation damping that influence
As stated in before, the computational speed depends on every eigenfrequency independently.
the total number of eigenfrequencies (M (n) and N (m)) In the future we would like to create a real-time plu-
being accounted for in the algorithm. We found a corre- gin containing all that is presented in this paper. In order
lation of 0.996 between this number and program speed, to do so, the algorithm needs to be optimised, especially
making decreasing the total number of eigenfrequencies computationally heavy processes like moving the pickups
our main focus. What the authors explain in [13] is the and changing the dimensions of the plate. When this is
possibility of removing the eigenfrequencies that are not achieved, the plugin will be tested with musicians in order
perceptually important. In our algorithm we propose: to test usability and whether the output sound is satisfac-
tory. Furthermore, we would like to add damping induced
12 C/100 by a porous medium to our implementation. Even though
d=( 2 1) fc , (17)
it has been ignored in this work, it is an important feature
where C is an arbitrary amount in cents and fc is the cur- in the EMT140. Lastly, we would like to explore the pos-
rent eigenfrequency (in Hz) starting with the lowest eigen- sibilities of changing other parameters of the plate using
frequency accounted for in the algorithm which can be cal- the presented model as a basis. The shape and the struc-
culated using f = 2 . The algorithm will then discard the ture of the plate, for example, would be very interesting to
eigenfrequencies if: make dynamic. Moreover, different materials, such as gold
and aluminium, or even non-metallic materials like glass,
fi fc < d, could be explored and result in some interesting timbres.
otherwise:
fc = fi . 6. REFERENCES
When C is put to 0.1 cents, the total number of eigen- [1] V. Valimaki, F. Fontana, J. O. Smith, and U. Zolzer,
frequencies accounted for (using the EMT140 properties) Introduction to the special issue on virtual analog au-
decreases from 18,218 to 7,932, which makes a big dif- dio effects and musical instruments, in IEEE Trans-
ference in computational time. Now, our implementation actions on Audio, Speech, and Language Processing,
only needs only needs roughly 7 seconds instead of 12 sec- VOL. 18, NO. 4, New York, USA, 2010.
onds to process 9.2 seconds of audio, proving that a real- [2] U. Audio. (2010) EMT140 plate reverb powered
time implementation would indeed be feasible. The algo- plug-in for UAD-2 (video). [Online]. Available:
rithm has been tested using a MacBook Pro containing a https://www.youtube.com/watch?v= KnMG OW5c0
2,2GHz Intel Core i7 processor. The authors indeed report
no significant audible difference after reducing the number [3] J. Menhorn. (2012) EMT140 plate reverb. [On-
of modes this way. line]. Available: http://designingsound.org/2012/12/
emt-140-plate-reverb/
SMC2017-318
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-319
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-320
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
shrinking size [34]. This suggests that the congruence Table 2. Chromatic mapping associations between au-
strength can be affected and even reversed depending on dio and visual descriptors for the retrieval of audio.
whether the stimuli are dynamic or static. Visual Features Audio Features
Furthermore, many descriptors deriving from various
Color variance Spectral flatness
types of audio analysis are available for use in a potential Color brightness Spectral centroid
AV mapping, likewise for the descriptors of an image. This Brightness variance Periodicity
raises the question of which descriptors will be most effec- Size Loudness
tive in AV mappings. Finally, the auditory stimuli in most Horizontal length Duration
of the studies commonly undertaken tend to be limited to
simple synthetic tones. Considering that most natural and
musical sounds are more complex than pure sine tones 3. METHOD
whether these results still apply to complex real-world
sounds is uncertain. These are important questions to an- An experiment was conducted to answer the following
swer for the development of Morpheme as it uses a multi- questions
dimensional mapping and concatenative sound synthesis Which overall mapping (chromatic/achromatic)
uses pre-recorded audio material to synthesise sound that is the most effective?
can have complex timbre characteristics. Do the sounds used as the source audio (corpus)
by the concatenative synthesiser have an effect
2. MORPHEME the effectiveness of the mappings?
Is the effectiveness of the mappings affected by
Morpheme is an application that allows sound synthesis to the experience of the respondents with sound/mu-
be controlled by sketching. The underlying synthesis tech- sic?
nique used is concatenative synthesis. For more infor-
mation about the graphical user interface of the system,
3.1 Subjects
please see [35]. Morpheme extracts visual features from
the analysis of the sketch drawn and uses these features to A total of 110 volunteers took part in the study. The study
query audio units from Catarts database [10]. During was conducted in two conditions, in what will be referred
playback, windowed analysis is performed on the pixels of to as the supervised and unsupervised conditions. The su-
the sketch. A window scans the sketch from left to right pervised study took place in Edinburgh Napier Univer-
one pixel at every clock cycle, the rate of which is deter- sitys Auralization suite. Participants used Beyer Dynam-
mined by the user. Only the areas of the canvas that are ics DT 770 Pro monitoring headphones to listen to the au-
within the boundaries of the window area are analysed. dio stimuli. A 17.3 inch laptop screen was used for viewing
The window dimensions can be determined by the user, the visual stimuli. SurveyGismo was used for viewing the
however the default size of the analysis window is 9 pixel stimuli and for recording the participants responses. The
wide by 240 pixel height. For more detailed information unsupervised study ran as an online survey. In the super-
about the implementation, see [36]. vised study a total of 44 participants volunteered to take
part. Sixteen were sound practitioners (referred to as ex-
2.1 The AV Mapping perts), and twenty-eight non-sound practitioners (referred
to as non-experts). All of the expert participants except two
Some of the most effective empirically validated audio- played a musical instrument and the self-reported levels of
visual associations discussed in the introduction were se- expertise were 4 basic, 6 intermediate and 4 advanced.
lected to develop two mappings that enable the selection In the unsupervised study, participants were similarly di-
of audio units from the corpus. One of the main distinc- vided into two groups. There were a total of 66 participants
tions between the mappings is that one is achromatic while recruited via a post on audio and music related mailing
the other is chromatic. The first mapping is considered lists. The expert group consisted of 39 participants and the
achromatic because the visual descriptors extracted from non-expert group of 27. All of the expert participants ex-
Morphemes sketch do not use any colour information, cept one played a musical instrument and the self-reported
whereas the second mapping does. Table 1 and Table 2 levels of expertise were 11 basic, 14 intermediate and 14
show the details of the two mappings. advanced. Twenty eight of the expert participants had re-
Table 1. Achromatic mapping associations between au- ceived formal music theory training for at least one year.
dio and visual descriptors for the retrieval of audio. All of the participants reported using analogue and digital
equipment for sound synthesis, signal processing, se-
Visual Features Audio Features quencing and audio programming tools.
Texture granularity Spectral flatness
Vertical position Pitch
Texture repetitiveness Periodicity
Size Loudness
Horizontal length Duration
SMC2017-321
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3.2 Stimuli
Four audio stimuli were tested for each mapping, i.e. four
sounds per mapping, resulting in a total of eight audio
stimuli. Each sound was synthesized using one of the four
corpora (i.e. String, Wind, Impacts, Birds). The resulting
sounds used as stimuli had a duration of 8 seconds each.
The string and the birds corpora consisted of mainly har-
monic and periodic sounds, while the wind and the impacts
corpora consisted of mainly unpitched, non-harmonic
sounds. The string and the wind corpora consisted of a
small number of audio-units that were very homogenous
(i.e. all audio-units had very similar characteristics (loud-
ness, periodicity, pitch)), while the impacts and the bird
corpora consisted of audio-units that were non- homoge- Stimulus 2
neous (i.e. more variation in terms of loudness, periodicity
and frequency contain).
SMC2017-322
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
served. No significant relationship was noted between cor- The comparison between the expert and the non-expert
rect selection rate and the mapping X2(1, N = 224) = groups in the supervised condition shows that there is an
1.143, p = .28. association between the variables subjects skills and cor-
For the expert group in the supervised condition, data rect selection rate X2(1, N = 352) = 4.43, p = .035.
analysis shows a strong relationship between the correct Further testing of the relationship between the variables
selection rate and the audio corpora X2(3, N = 128) = correct detection and skills was performed by layering the
9.86, p = .02. No significant relationship was noted be- participant responses based on the two levels of the map-
tween correct selection rate and the mapping X2(1, N = ping factor. The results of the test show that the association
128) = 3.33, p = .068. between the subjects skills and correct selection rate was
For the non-expert group in the unsupervised condition, stronger when the chromatic mapping was tested
data analysis shows that there is a strong relationship be- X2(1, N = 176) =4.27, p = .039 and not strong when the
tween correct selection rate and the audio corpora achromatic mapping was tested X2(1, N = 176) = .88, p =
X2(3, N = 216) = 11.37, p = .010. No significant relation- .347.
ship was revealed between correct selection rate and the The comparison between the expert and the non-expert
mappings X2(1, N = 216) =.70, p = .40. groups in the unsupervised condition show no association
For the expert group in the unsupervised condition, data between the variables subjects skills and correct selection
analysis shows that there is a strong relationship between rate X2(1, N = 528) = 0.10, p = .747. Further testing of the
the ability of the subjects to identify the correct image and relationship between the variables correct detection and
the audio corpora X2(3, N = 312) = 23.15, p < .001. skills was performed by layering the participant responses
Again, no significant relationship was observed between based on the two levels of the mapping factor. The results
the correct selection rate and the mappings X2(1, N = 312) of the test show that there was no association between the
=.3.07, p = .079. subjects skills and correct selection rate for either of the
Post-hoc testing of each group/condition chi-square re- mappings: achromatic X2(1, N = 264) = .011, p = .917;
sults was performed for each level of the factor corpus in chromatic mapping X2(1, N = 264) =.332, p = .564. Post-
order to determine which factor levels contributed to the hoc testing showed that the relationship between correct
statistical significance that was observed. The results of detection and subjects skills in the different experiment
the post-hoc test are presented in Table 3. For more infor- conditions was not significant, see Table 4.
mation about the approach followed to perform post-hoc
testing refer to [37] and [38]. 4.3 Confidence Ratings
4.2 Comparison between expert & non-expert groups This section presents the results obtained from the second
question of the experiment, were expert and non-expert
This section presents the results obtained from the compar- subjects in the supervised and unsupervised conditions re-
ison between the expert and the non-expert groups in the ported on their confidence levels regarding their decision
supervised and unsupervised condition. A Pearsons chi- (i.e. corresponding image for each audio stimuli). A re-
square test of independence was performed on three be- peated measure ANOVA using a general linear model was
tween subject variables to test the null hypothesis of no as- computed with the confidence ratings as within subjects
sociation between: (i) the subjects skills and correct image factor and the corpora and the mapping as between sub-
selection rate, (ii) subjects skills and correct image selec- jects factors. The analysis aimed to test the null hypothesis
tion rates difference between the two mappings, and (iii) of no association between the subjects confidence ratings
subjects skills and correct image selection rates difference and the mappings, or the audio corpora.
between the four corpora.
Figure 2. Overall effect of the corpus on the subject's Figure 3. Overall effect of the mapping on the subject's
ability to identify the correct image. ability to detect the correct stimuli.
SMC2017-323
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 3. Post-hoc testing of each group/condition chi-square results by estimating p values from the chi-square residuals for
each level of a factor and comparing these to an adjusted benferonni corrected p value. An asterisk (*) indicates statistical
significance.
Experts (s) Experts (u) Non-experts(s) Non-experts(u)
Corpus X2 p X2 p X2 p X2 p
Birds 0.17 .67 0.11 .73 0.2 .64 0.93 .33
Impacts 0.17 .67 0.04 .83 1.1 .28 0.93 .33
String 6.3 .01 14.8* <.006 6.8 .008 6.6 .009
Wind 6.3 .01 15.8* <.006 17.3* <.006 6.6 .009
N 32 78 56 54
Total X2(3, N =128) =9.8, p <.05 X2(3, N =312)=23.1, p <.01 X2(3, N =224)=19.2, p <.01 X2(3, N =226)=11.3, p <.05
s=supervised / u=unsupervised
benferroni adjusted p value 0.0062
Skills X2 p N The answer to the first question of this study is that the
Expert (s) .6 .41 128 choice of mapping did not affect participants image selec-
Expert (u) 2.1 .14 312 tion success. A trend that can be observed in the data indi-
Non-Expert (s) 8.5 .033 224 cates that participants selected the correct images well
Non-Expert (u) .43 .51 216 above chance levels, regardless of the mapping.
Total X2(3, N = 880) = 8.711, p = .033 The answer to the second question of the study is that the
s=supervised/u=unsupervised
audio corpus used to synthesise the audio stimuli from the
Benferroni adjusted p value 0.00625
images had a significant effect on the subjects' ability to
Analysis of the subjects confidence ratings revealed no identify the correct image. The image selection success
significant differences due to the corpus with an exception rate was higher when the string corpus was used, followed
for the experts in the unsupervised condition, see Table 5. by the birds and the impact corpus, while the lowest suc-
Significant differences were noted between each mapping cess rate was observed for the wind corpus. The analysis
for the means of the expert group. No interactions between showed that the corpora that had the strongest effect on the
the factors mapping and audio corpus were revealed. The subjects successful selection rate were the string and the
results of the analyses are shown in Figure 4. An addi- wind.
tional ANOVA model was computed to examine the rela- In response to the third question, there was no statisti-
tionship between reported confidence and correct image cally significant difference between the results of the ex-
identification which was found to be significant for all the pert and non-expert groups.
groups with only the exception of the expert group in the
supervised condition. The results of this analysis are
shown in Table 6.
Table 5. General ANOVA model computed to investigate the effect of the AV associations, corpus and mapping on the perceived
similarity reported by the subjects. An asterisk (*) indicates statistical significance.
Table 6. General ANOVA model computed to investigate the effect of correct detection on the reported confidence level. An
asterisk (*) indicates statistical significance.
Correct selection
Group F P Total
Non-experts (s) 7.53* <.005 223
Non-experts (u) 6.15* <.05 215
Experts(s) .34 .5 127
Experts(u) 16.73* <.001 311
df 3
s=supervised / u=unsupervised
SMC2017-324
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 4. Data means and confidence intervals of participants confidence ratings for the mapping and corpus factors for all of
the groups. If intervals do not overlap the corresponding means are statistically significant.
show that when the wind corpus was used, participants felt
least confident, however for the other three corpora the re-
5. DISCUSSION sults were not following a strong correlation pattern.
It is worth noting that overall there are consistencies in the
subjects responses across the groups and conditions. The 6. CONCLUSIONS
fact that there are consistencies between the two subject
groups (i.e. experts, non-experts) suggests that musi- We reported on the evaluation of Morphemes Audio-Vis-
cal/sound training was not important. Furthermore consist- ual mappings used to control a corpus-based concatenative
encies in the subjects responses between the two experi- synthesizer through painting. One of the main motivations
mental conditions (controlled environment and online) of the present investigation was to explore how to best uti-
suggest that both approaches to gathering data are equally lise empirically validated audio-visual associations in the
well suited to this type of experiment. design of sound synthesis user interfaces. Many studies
The first result of the study is that the choice of audio have shown that people often map stimulus properties
corpus had an effect on participants image selection suc- from different sensory modalities onto each other in a man-
cess rate and therefore affects the effectiveness of the map- ner that is surprisingly consistent. Furthermore, previous
pings. A first interpretation is that when the sound corpus research suggests that correspondences of stimulus prop-
is not harmonic and continuous, the resulting sounds can erties may be innate or learned at very early stages of a
be noisy and lack clarity. This could affect the effective- persons life, [40][42]. This research project is motivated
ness of the mapping by causing a reduction in the salience by the idea that mappings which are consistent with users
of the audio-visual associations. This in turn weakens the prior perceptual knowledge and the subtleties of multisen-
ability of the participants to pay attention to the causal re- sory aspects of perception, can better accommodate their
lationship between the image and the sound. sensorimotor skills and support the comprehension of in-
As stated in Section 4.4, there was no statistically signif- formation and interaction with computer systems.
icant differences between expert and non-expert subjects. The present experiment tested the effects of four concat-
The fact that musical/sound training was not a very signif- enative synthesiser corpora on the effectiveness of two AV
icant factor suggests that the correspondences might be un- mappings, measured by the ability of subjects to identify
derpinned by either similarities between auditory and vis- the image used to synthesise a sound they were presented
ual perception and/or by cultural conventions, but not spe- with. The results showed that when multiple parameters
cifically by the acquisition of advanced musical and listen- vary simultaneously, the nature of the sounds present in
ing skills. These findings are in agreement with [18] and the audio corpus has a moderating effect on the mapping
oppose the findings of [17], [20], [39]. effectiveness. For example, when the corpus consisted of
Finally, the participants confidence ratings revealed no harmonic sounds, participants detection rates signifi-
correlation between confidence levels and correct image cantly increased in comparison to when using inharmonic
selection. Although a trend can be observed between cor- sounds. Finally, results indicated that sound/musical train-
rect image selection and confidence levels reported by sub- ing had no significant effect on the perceived similarity be-
jects (i.e. confidence levels on average are higher when tween AV features or the discrimination ability of the sub-
participants have responded correctly rather than incor- jects.
rectly), the confidence levels reported on incorrect re- Further work includes making improvements to the map-
sponses are very high (i.e. in most cases well above 50%), ping of Morpheme, for instance to minimise interactions
which does not indicate that participants were aware that between audio parameters used (e.g. spectral flatness and
their responses were incorrect. Overall, participants felt periodicity). A set of less correlated audio features is likely
more confident when the chromatic mapping and the string to lead to more effective mappings. Another direction for
corpus were used. Furthermore, the confidence ratings
SMC2017-325
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
future work involves testing a wider range of audio-visual Psychol., vol. 116, pp. 323336, 1987.
features.
[15] R. D. Melara, Dimensional interaction between
Acknowledgments color and pitch., J. Exp. Psychol. Hum. Percept.
The authors would like to acknowledge the Institute for Infor- Perform., vol. 15, pp. 6979, 1989.
matics & Digital Innovation for their financial support and
IMTR research centre for sharing the CataRT system. [16] G. R. Patching and P. T. Quinlan, Garner and
congruence effects in the speeded classification of
bimodal signals., J. Exp. Psychol. Hum. Percept.
2 REFERENCES Perform., vol. 28, pp. 755775, 2002.
[1] B. Caramiaux, J. Francoise, F. Bevilacqua, and N. [17] M. B. Kssner, Shape, drawing and gesture: cross-
Schnell., Mapping Through Listening, in modal mappings of sound and music, Kings
Computer Music Journal, vol. 38, no. 2006, 2014, pp. College London, 2014.
130.
[18] S. Lipscomb and E. Kim, Perceived match between
[2] B. Caramiaux, F. Bevilacqua, and N. Schnell, visual parameters and auditory correlates: An
Towards a gesture-sound cross-modal analysis, in experimental multimedia investigation, in
Gesture in Embodied Communication and Human- International Conference on Music Perception and
Computer Interaction, I. Kopp, S., Wachsmuth, Ed. Cognition, 2004, pp. 7275.
Springer, Heidelberg, 2010, pp. 158170.
[19] F. Berthaut, M. Desainte-catherine, and M. Hachet,
[3] M. Leman, Embodied Music Cognition and Combining audiovisual mappings for 3d musical
Mediation Technology. MIT press, 2008. interaction, International. Computer Music
[4] R. I. Gody, Images of Sonic Objects, Organised Conference, 2010.
Sound, vol. 15, no. 1, p. 54, Mar. 2010. [20] R. Walker, The effects of culture , environment , age
[5] D. Smalley, Spectromorphology: explaining sound- , and musical training on choices of visual, vol. 42,
shapes, Organised sound, vol. 2, pp. 107126, 1997. no. 5, pp. 491502, 1987.
[6] T. Wishart, On Sonic Art. Hardwood academic [21] L. B. Smith and M. D. Sera, A Developmental
publishers, 1996. Analysis of the Polar Structure of Dimensions,
Cogn. Psychol., vol. 142, pp. 99142, 1992.
[7] M. Blackburn, The Visual Sound-Shapes of
Spectromorphology: an illustrative guide to [22] C. Spence, Crossmodal correspondences: a tutorial
composition, Organised Sound, vol. 16, no. 1, pp. review., Atten. Percept. Psychophys., vol. 73, no. 4,
513, Feb. 2011. pp. 97195, May 2011.
[8] a. Hunt and R. Kirk, Radical user interfaces for [23] E. Ben-Artzi and L. E. Marks, Visual-auditory
real-time control, Proc. 25th EUROMICRO Conf. interaction in speeded classification: Role of stimulus
Informatics Theory Pract. New Millenn., pp. 612 difference, Percept. Psychophys., vol. 57, no. 8, pp.
vol.2, 1999. 11511162, Nov. 1995.
[9] A. Hunt, M. Wanderley, and R. Kirk, Towards a [24] Z. Eitan and R. Timmers, Beethovens last piano
Model for Instrumental Mapping in Expert Musical sonata and those who follow crocodiles: cross-
Interaction, in International Computer Music domain mappings of auditory pitch in a musical
Conference, 2000. context., Cognition, vol. 114, no. 3, pp. 40522,
2010.
[10] D. Schwarz, G. Beller, B. Verbrugghe, and S. Britton,
Real-Time Corpus-Based Concatenative Synthesis [25] K. Evans and A. Treisman, Natural cross-modal
with CataRT, in Digital Audio FX, 2006, pp. 279 mappings between visual and auditory features, J.
282. Vis., vol. 10, pp. 112, 2010.
[11] L. E. Marks, On associations of light and sound: The [26] E. Rusconi, B. Kwan, B. L. Giordano, C. Umilt, and
mediation of brightness, J. Psychol., vol. 87, pp. B. Butterworth, Spatial representation of pitch
173188, 1983. height: the SMARC effect., Cognition, vol. 99, no.
2, pp. 11329, 2006.
[12] I. H. Bernstein and B. A. Edelstein, Effects of some
variations in auditory input upon visual choice [27] S. Hidaka, W. Teramoto, M. Keetels, and J.
reaction time., J. Exp. Psychol., vol. 87, pp. 241 Vroomen, Effect of pitch-space correspondence on
247, 1971. sound-induced visual motion perception., Exp.
brain Res., vol. 231, no. 1, pp. 11726, Nov. 2013.
[13] T. L. Hubbard, Synesthesia-like mappings of
lightness, pitch, and melodic interval, Am. J. [28] L. E. Marks and Z. Eitan, Garners paradigm and
Psychol., vol. 109, no. 2, pp. 219238, 1996. audiovisual correspondence in dynamic stimuli:
Pitch and vertical direction, Seeing Perceiving, vol.
[14] R. D. Melara and T. P. OBrien, Interaction between 25, pp. 7070, Jan. 2012.
synesthetically corresponding dimensions., J. Exp.
[29] L. E. Marks, On Cross-Modal Similarity: The
SMC2017-326
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Perceptual Structure of Pitch , Loudness , and correspondences., Psychol. Sci., vol. 21, no. 1, pp.
Brightness, J. Exp. Psychol. Hum. Percept. 215, Jan. 2010.
Perform., vol. 15, no. 3, pp. 586602, 1989.
[30] R. Chiou and A. N. Rich, Cross-modality
correspondence between pitch and spatial location
modulates attentional orienting., Perception, vol.
41, pp. 339353, 2012.
[31] K. Giannakis, Sound Mosaics, Middlesex
University, 2001.
[32] K. Giannakis, A comparative evaluation of
auditory-visual mappings for sound visualisation,
Organised Sound, vol. 11, no. 3, pp. 297307, 2006.
[33] Z. Eitan, How pitch and loudness shape musical
space and motion: new findings and persisting
questions, in The Psychology of Music in
Multimedia, S.-L. Tan, A. J. Cohen, S. D. Lipscomb,
and R. A. Kendall, Eds. Oxford University Press,
2013, pp. 161187.
[34] Z. Eitan, A. Schupak, A. Gotler, and L. E. Marks,
Lower pitch is larger, yet falling pitches shrink:
Interaction of pitch change and size change in
speeded discrimination, Proc. Fechner Day, 2011.
[35] A. Tsiros and G. Lepltre, Evaluation of a Sketching
Interface to Control a Concatenative Synthesiser, in
42th International Computer Music Conference,
2016, pp. 15.
[36] A. Tsiros, A Multidimensional Sketching Interface
for Visual Interaction with Corpus-Based
Concatenative Sound Synthesis, Edinburgh Napier
University, 2016.
[37] T. M. Beasley and R. E. Schumacker, Multiple
Regression Approach to Analyzing Contingency
Tables: Post Hoc and Planned Comparison
Procedures, Journal of Experimental Education,
vol. 64, no. 1. pp. 7993, 1995.
[38] M. a. Garcia-perez and V. Nunez-anton, Cellwise
Residual Analysis in Two-Way Contingency
Tables, Educ. Psychol. Meas., vol. 63, no. 5, pp.
825839, 2003.
[39] M. B. Kussner and D. Leech-Wilkinson,
Investigating the influence of musical training on
cross-modal correspondences and sensorimotor skills
in a real-time drawing paradigm, Psychol. Music,
vol. 42, no. 3, pp. 448469, Jul. 2013.
[40] S. Jeschonek, S. Pauen, and L. Babocsai, Cross-
modal mapping of visual and acoustic displays in
infants: The effect of dynamic and static
components, J. Dev. Psychol., vol. 10, pp. 337358,
2012.
[41] S. Wagner, E. Winner, D. Cicchetti, and H. Gardner,
Metaphorical mapping in human infants., Child
Dev., vol. 52, pp. 728731, 1981.
[42] P. Walker, J. G. Bremner, U. Mason, J. Spring, K.
Mattock, A. Slater, and S. P. Johnson, Preverbal
infants sensitivity to synaesthetic cross-modality
SMC2017-327
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT
Copyright:
c 2017 Jonas Holfelt et al. This is an open-access article distributed 1. Analysis and feature extraction.
under the terms of the Creative Commons Attribution 3.0 Unported License, which 2. Mapping between features and effect parameters envi-
permits unrestricted use, distribution, and reproduction in any medium, provided ronment.
the original author and source are credited. 3. Transformation and Re-synthesis
SMC2017-328
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2. RELATED WORK tentions of the performer. Friberg et. al stated that emo-
tional expression of a music performance can be accu-
Research on digital audio effects is highly interdisciplinary
rately predicted using a limited set of acoustic tone pa-
in the sense that a wide variety of topics and technical prin-
rameters [5].
ciples have to be understood for their design and imple-
Studies have found that variation of timing and amplitude
mentation. This section will take a methodical approach in
are essential features of auditory communication and mu-
considering the various research paths explored during the
sic, as they carry information about expressive intentions
process of this project.
and provide higher resolution of auditory information [6].
In addition to the variety of existing software for extracting
2.1 Adaptive Digital Audio Effects
tone parameters (e.g. audio to MIDI converters), Friberg
Verfaille et al. proposed a new type of audio transfor- et. al developed the algorithm CUEX [5]to identify ex-
mation: adaptive digital audio effects (A-DAFx). This pressive variations in music performances, rather than pro-
type of effect is also referred to as dynamic processing, viding accurate musical notations. The CUEX algorithm is
intelligent effects, or content based transformations since able to extract a wide variety of parameters, but places sig-
the effect parameters are altered by the input source over nificant emphasis on onset and offset detection, as it con-
time [4]. The idea of altering the output sound based on tributes to many temporal feature extractions, such as tone
the input already exists in commonly used effects such rate, articulation and onset velocity [5]. Hence, we chose
as compressors and auto-tune. However, adaptiveness tone rate to provide a rough tempo estimation, and fluctua-
introduces a new kind of control since specific sound tions in the tone rate are used to identify temporal expres-
features (or sound descriptors) can be mapped to alter sive intentions for our A-DAFx.
control parameters of the output. The implemented echo Timbral deviations also play an important role in enhanc-
effect uses the auto-adaptive approach, which is defined ing the expressive value of a performance, and have a sig-
as an effect where acoustic features are extracted from the nificant effect on the interpretation of music. The bright-
same source as to which the effect is applied. ness attribute for example, has a direct correlation to the
input force or energy of the sound. As a result, musicians
The mapping of the sound features is an important are able to utilise this parameter for control and expression,
aspect of A-DAFx. Several different mapping strategies which demonstrates a relation between acoustical timbre
have been suggested for adaptive effects, covering both and expressive intention [7].
sound and gesture inputs [3, 4, 11, 12]. However, only Loudness, another expressive perceptual attribute, which is
sound features were considered for the purpose of this related to the energy of a signal. Although amplitude vari-
project. An important consideration for mapping of sound ations do not influence expressivity to the same extent as
features is the mapping strategy between extraction and temporal variations, they still have a significant effect, and
manipulation parameters. This choice can be difficult to as the energy of a signal can be intentionally controlled, it
argue for, since it depends on musical intention [4]. If the serves as a powerful tool for musical expression. Contrary
mapping gets too complex, the users ability to manipulate to loudness, energy is not influenced by frequency, but this
the mapping depends on their prior knowledge about low level feature is able to provide sufficient information
DAFx and the techniques used. about the dynamics of the performance [6].
A guitar, as a string instrument produces nearly harmonic
2.2 Examples of A-DAFx sounds, and its pitch can be determined as a result [1].
Harmonies and melodies carry important information on
At this time, only three commercial adaptive products are
musical structure [3]. Yet musical structure is often estab-
known to be aimed at guitarists, these are the TE-2 Tera
lished by a composer, therefore variations in pitch are not
Echo, MO-2 Multi Overtone and DA-2 Adaptive Dis-
necessarily related to expressive performance. A focus on
tortion by BOSS [1315]. However, no technical de-
improvisational performance however, would allow the in-
scriptions of these effects are known to be available.
vestigation of temporal changes in pitch as an expressive
2.3 Analysis and Extraction of Expressive Acoustic detail. The physical counterpart of pitch is the fundamen-
Features tal frequency, which can serve as an adequate indicator.
SMC2017-329
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
how to extract information and expressive features from replicates the nonlinearities introduced in valve amplifiers
the electric guitar and how to map these features to the by asymmetrical clipping of the input audio signal x.
A-DAFx parameters. A set of design criteria were formed
to frame and guide the project with a special emphasis ( xQ Q
on implementing A-DAFx that would be beneficial for 1edist (xQ))
+ 1edistQ 0 , Q 6= 0, x 6= Q,
f (x) = 1 Q
electric guitar players. dist + 1edistQ 0 x=Q
(1)
Design Requirements:
It should function in most Digital Audio Workstations At low input levels there should be no distortion in the
(DAWs) as a VST format. signal. Clipping and limiting should occur at large nega-
The plugin should work in real-time without notable la- tive input values and be approximately linear for positive
tency. values [3]. The work point Q, controls how much non-
The extracted features should be generalised and linear clipping is introduced in the signal. Large negative
mapped to work with several different effects. values for Q will provide a more linear output, and dist
For the purpose of creating an A-DAFx, a digital im- increases the amount of distortion in the signal.
plementation of the spatial echo effect was chosen.
To create a more interesting echo effect, a variety of 4.2.2 Vibrato
modulation and timbral effects were implemented into The implementation for the vibrato was inspired by
the echo engine. These included highpass and lowpass Zolzers et al. vibrato algorithm [3]. This algorithm pro-
filters, vibrato, reverse delay, and saturation (tube). The duces vibrato in an audio signal by creating pitch variations
choice of effects was influenced by commercial delay in a way that is similar to the Doppler effect. Varying the
plugins that utilise similar effects, for example: Strymons distance between the sound source and the listener is the
Timeline, ChaseBliss Audios Tonal recall, and Red same as varying the delay in the signal. If the variation in
Pandas Particle Granular Delay/Pitch Shifter [17]. the delay is periodic, then the pitch variation is also peri-
During the design phase, the DAFx was tested and odic.
tweaked using MATLAB by MathWorks and its Audio To implement this, a circular buffer is used with a read and
System Toolbox Library. The white paper by Mathsworks write index. The write index is used for writing the audio
DeVane and Bunkhelia served as inspiration on how to input in the buffer. The read index reads the delayed au-
work with the library [18]. dio input from the buffer in an oscillating manner creating
periodic variations in the outputted delay.
SMC2017-330
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
for choosing this method was due to its ability to run in as a detailed indication of tempo. In order to gain a better
real-time and perform well under a wide range of condi- resolution, we derive the average inter onset interval (IOI)
tions [22]. An audio signal spectrum will usually consist to indicate speed at which the musician is playing. The
of a series of peaks, corresponding to the fundamental fre- onset marks a single point in time within the signal where
quency with harmonic components at integer multiples of the transient region or attack starts, highlighting a sudden
the fundamental [23]. The HPS method utilises this by burst of energy [28].
down-sampling the spectrum a number of times, multi- Onset detection based on spectral features include var-
plying the compressed versions together, and comparing ious robust methods, which generally require less pre-
it to the original spectrum. The resulting spectrum will processing and potentially allow the bandwise analyses of
display an alignment of the strongest harmonic peaks and polyphonic signals. The reduction function transforms the
consequently, the fundamental frequency. The HPS tech- set of spectral windows into a so called novelty curve,
nique measures the maximum coincidence for harmonics which indicates the location of prominent changes, and
for each spectral frame according to: therefore the possible onsets. The novelty curve is acquired
from the rectified spectral difference at each window, treat-
R
Y ing the onset as a broadband event, capturing only positive
Y () = |X(r)| (2) changes in the magnitude spectrum [28, 29].
r=1
K
Where R is the number of harmonics being considered, SD =
X
|X[k]| |X[k 1]| (6)
The resulting periodic correlation array is then searched k=0
for a maximum value of a range of possible fundamen-
tal frequencies to obtain the fundamental frequency esti- The localisation of an onset relies on two threshold param-
mate [22]. eters. First, an absolute threshold on the amount of positive
Yb = arg maxY (i ) (3) change in the magnitude spectrum, to filter out the spurious
peaks. The second being temporal threshold, which makes
4.3.2 Spectral Centroid: sure that only one onset is detected in a given range. In or-
The spectral centroid provides a strong impression of the der to calculate the average inter-onset interval, the onset
timbral feature; brightness. It is calculated by taking a intervals are accumulated over time in a vector and divided
spectrums centre of gravity according to the following for- by the total number of onsets in the same period.
mula [24, 25].
4.4 Parameter Mapping
PN 1 d
k=1 k X [k] In order to demonstrate the capabilities of the effect we cre-
SChz = P N 1
(4)
d
k=1 X [k] ated a specific preset. The parameter mappings for the pre-
The calculation was implemented by applying a hamming
window for each buffer, to attenuate any unwanted fre-
quencies that result from taking the Fourier Transform of a
segment of the audio signal. For music signals, the bright-
ness generally increases with the presence of higher fre-
quency components in the signal, which is often introduced
through inharmonic sounds [25, 26].
4.3.3 Energy:
Energy is a basic but powerful acoustic feature as it de-
scribes the signal energy [27]. Energy is calculated by
squaring and summing up amplitudes in a windowed sig-
nal, after which the sum is normalised by the length of the
window.
N 1
1 X 2
En = x (n) (5)
N n=0
The frame size of the input signal was used to decide the
window length. The value generated by the energy was
used to estimate the amount of force in the guitar strum-
ming.
4.3.4 Tone Rate:
Tone rate was implemented to gain information on the per-
formers speed of playing. It is roughly proportional to
tempo, as it refers to the number of successive tones over
a given period, but this range is not large enough to serve Figure 2. Effect engine and parameter mapping structure
SMC2017-331
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
set called Dreamy, were determined by an attempt to con- Initially a pilot test was performed, and the experimen-
struct meaningful relationships between feature extractions tal procedure was adjusted accordingly. The final proce-
and effect controls, and to also create an accurate represen- dure was deducted as follows: Each participant was given
tation of the preset name. The notion of dreamy music is a short scripted introduction. They were told they would
often associated with rich ambient and reverberant sounds, be trying two versions of an effect, and were introduced to
which can be digitally replicated by utilising effects such the plugin interface. Thereafter the participant would be
as delay, reverb and vibrato. An attempt to achieve this asked to sign a terms of agreement, allowing for the exper-
type of sound identity was made through the parame- iment to be recorded. A small demographic questionnaire
ter relationships described. We found adaptive mappings was also conducted to obtain information on age, experi-
such as Pitch Height Delay Mix to make sense mu- ence, and preferred genre. This was done to make sure that
sically, as overly delayed signals in low frequencies can participants fit the chosen target group.
create an undesired muddy low end. The mappings were The remaining test was structured as follows:
decided from what we found aesthetically pleasing and fit- 1. Basic tasks; 2. Short interview; 3. Expressive impro-
ting to the chosen notion of dreamy. Further, to inform the visational tasks; and 4. Follow-up interview.
choice of adaptive mapping, the dreamy preset was subject In the first set of tasks, the participant was asked to per-
to initial tests. The non-adaptive versions static parameters form a variety of basic playing styles on the guitar. The
were centred in relation to the range of the adaptive map- goal of these tasks was to make sure that the participant
ping. Figure 2 displays the signal path of the whole effect, explored every aspect of the effect, and was able to gather
with the coloured lines outlining the parameter mappings some thoughts about the difference between the two ver-
for Dreamy. The adaptive mappings for the Dreamy sions. After performing the basic tasks, a short interview
preset are as follows: was conducted, where the participant was asked to explain
Energy Level Vibrato Depth + Filter Fc the difference between the two versions. There were ques-
Spectral Centroid Delay Feedback tions specifically asking about the effect of certain features
Tone/Rate Vibrato Rate (such as strum power or pitch), which helped clarify if the
Fundamental Frequency Delay Mix + Filter Q participant understood changes in the sound.
The next set of tasks focused on the participants own in-
5. EVALUATION terpretation of expressiveness. They were asked to try ver-
sion A and B when playing deadpan and expressively. The
Having designed and implemented an adaptive digital au- prediction was that the difference between A and B would
dio effect that uses expressive audio features as input, we be more noticeable when playing expressively. Deadpan
now turn to evaluating its performance and use for a per- was described as mechanical playing, adding no expres-
former. In the following we will describe the evaluation sive variation to the playing. Expressively was defined as
which was done to aswer the research question: Are users performing as though you were trying to convey some mu-
able to distinguish between the adaptive and non-adaptive sical expression and emotion to an audience.
version, and do they understand what the difference is? Finally, another short interview was conducted. The par-
ticipant was asked to explain if the difference between ver-
5.1 Participants sion A and B was more noticeable when playing expres-
In order to take part in the evaluation, we required that the sively. Afterwards the difference was revealed by the ex-
participant be able to perform basic guitar techniques. Ad- perimenter, and the participant was asked to guess which
ditionally, we deemed knowledge about effects and various of the versions was the adaptive one. The complete test
musical terms necessary for the participant to understand lasted between 25-35 minutes per participant.
the questions asked during the experiment. Hence the tar-
get group was defined as intermediate guitar players well 5.3 Data Collection
acquainted with audio effects.
1) Qualitative Interview: The main method for gathering
The participants were recruited through purposive sam-
data was through qualitative structured interviews: All of
pling [30]. The researchers chose associative participants,
the questions required the participants to provide a specific
who fit the target group. A total of seven people partici-
answer (yes, no, A or B). This allowed for a structured
pated in the evaluation. The participants were between 22
comparison between all their answers. Additionally, the
and 27 years old and the average guitar playing experience
participants were encouraged to elaborate on each ques-
was 9.8 years.
tion, with the test experimenter taking notes of their an-
swers throughout the interviews.
5.2 Procedure
2) Observation: The tests were all video recorded for post-
We chose a within-participant design where players com- observation. The guitar was recorded with the effects ap-
pared two versions of our effect, one adaptive and one non- plied. The recordings served several purposes:
adaptive. In order not to affect the behaviour of the partic- 1. Review the interviews, in case the test experimenter
ipants in a systematic way [31], the presentation order of missed any important notes. 2. Check for user errors did
the two versions was counterbalanced so that Participant 1 they understand the tasks correctly? 3. Check for system-
tested with A = non-adaptive, B = adaptive, while Partici- atic errors glitches, and other types of unpredictable be-
pant 2 tested with A = adaptive, B = non-adaptive, etc.). haviour. 4. Triangulate with the data from the interviews.
SMC2017-332
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 1. Summary of the participants choices to the simple questions during the evaluation. The questions have been
condensed here, but can be found their actual format in the appendix. The data in bold outlines the successful choices in
the interview. Ad = the adaptive version. Non-Ad = the non-adaptive version. U = Unclear, Dont know or No perceived
difference.
5.4 Analysis Methods versions. A change in the adaptive version was identified,
but the description of the effect was inconsistent between
As a first step, participants choices to the simple questions
participants. They could hear that something happened but
were summarised (See table 1), after which a more thor-
could not define exactly what it was.
ough analysis of the interview was performed using Mean-
ing Condensation and Content Analysis. The full interview
answers were reduced by performing meaning condensa- 6.2 Comparing Deadpan to Expressive Playing
tion, after which the content was analysed by searching for
keywords, patterns, and outliers to gain a broad perspec- 6.2.1 Expressive Motivation:
tive of the answers. To strengthen the validity of the study,
the data from the interviews was compared with the ob- The version that made the participants more motivated to
servation data. If the two sources are not congruent we play expressively appeared to be random, and the version
considered the data to be invalid. that participants preferred in general was fairly random.
There was no agreed preference. The non-adaptive version
was deemed simpler and predictable, while the adaptive
6. RESULTS version was deemed playful, stronger, and interesting for
6.1 Difference Noticed Between Versions experimenting with sounds.
SMC2017-333
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
7. DISCUSSION testing it was found that the participant generally did not
find that the adaptive version made them play more expres-
The results showed that the participants were able to per-
sively than the non-adaptive, and that users did not prefer
ceive a notable difference between the two versions of the
the adaptive effect. However, when using the adaptive ver-
effect, and they were able to tell which one was adaptive.
sion, the participants felt they had more control and noticed
However, they had trouble understanding exactly what fea-
a bigger difference in the sound produced by the effect
tures actually affected the parameters of the effect. To gain
when playing expressively compared to deadpan. It was
an understanding of why some of these issues occurred,
also found that the mapping of tempo and pitch were not
triangulation was performed.
apparent to the users, whereas the energy mapping was eas-
ily recognised. To investigate A-DAFx further, it would be
7.1 General Comparison Between Versions worthwhile to test the effect in various environments (such
The participants described the non-adaptive version as flat, as in a recording or live setting). In conjunction with this, it
dry and more simple compared to the adaptive. The adap- may be beneficial to apply a longitudinal testing approach
tive version was described to have a thicker/fuller effect so that participants could integrate the effect in their own
where there was more happening with the effect. The in- creative and compositional work process.
fluence of the energy was the most noticeable and a pos-
sible reason for this could be the direct relation between Acknowledgments
the power of the strum and the perceived changes in the The authors greatfully thank the guitarists involved in the
vibrato depth and the cut off frequency. The influence of evaluation.
playing tempo and the pitch on the effect was less clear to Authors Holfelt, Csapo, Andersson, Zabetian, and Castani-
the participants. The players could detect the influence of eto were responsible for the main work which was per-
pitch, but were not able to identify what parameter it was formed as a student project during the first semester of
controlling. The least successful mapping was the tempo Sound and Music Computing, Aalborg University Copen-
and the vibrato rate, which the participants were perceiving hagen. Authors Erkut and Dahl supervised the project and
the same in both cases. The noticed difference can also be participated in the writing together with author Overholt.
biased by the imbalanced influence of different mappings,
therefore the masking of one effect of another.
In the more free task involving expressive playing, play- 9. REFERENCES
ers thought both the non-adaptive and the adaptive versions [1] D. M. Campbell, C. A. Greated, A. Myers, and
were adaptive. But the players described the non-adaptive M. Campbell, Musical Instruments: History, Technol-
version to be simpler and predictable, while the adaptive ogy, and Performance of Instruments of Western Music.
version was deemed playful, stronger, and interesting for Oxford: Oxford University Press, 2004.
experimenting with sounds.
[2] J. Bruck, A. Grundy, and I. Joel, An Audio
7.2 Further Comments Timeline, 2016, accessed: Dec. 15, 2016. [On-
line]. Available: http://www.aes.org/aeshc/docs/audio.
The responses to the questions regarding general usabil- history.timeline.html
ity that participants answered at the end of the test can be
helpful to set design requirements for further development. [3] U. Zolzer, DAFX: Digital Audio Effects, 2nd ed. Ox-
All the participants thought that an adaptive effect would ford, England: Wiley, John & Sons, 2011.
be useful, but mostly for experimental purposes. They gen-
erally wished that they had more control, or at least knew [4] V. Verfaille and D. Arfib, A-DAFX: Adaptive Digi-
what kind of parameters were active in the effect, which tal Audio Effects, Digital Audio Effects (DAFX-01),
highlights a clear need for transparency in the further de- 2001.
sign of such tool. The participants especially liked the idea
[5] A. Friberg, E. Schoonderwaldt, and P. Juslin, Cuex:
of being able to control the intensity of adaptive expres-
An Algorithm for Automatic Extraction of Expressive
sion applied to the effect. They also had some ideas for
Tone Parameters in Music Performance from Acoustic
mapping the effect differently.
Signals, Acta Acustica United With Acustica, vol. 93,
no. 3, 2007.
8. CONCLUSIONS
[6] A. Bhatara, A. K. Tirovolas, L. M. Duan, B. Levy, and
An adaptive digital audio effect was implemented to run D. J. Levitin, Perception of Emotional Expression in
real-time and was tested on intermediate guitar players. Musical Performance, Journal of Experimental Psy-
The effect was a multi-dimensional delay engine with chology: Human Perception and Performance, vol. 37,
added vibrato, reverse delay, saturation, and filters. Sev- no. 3, 2011.
eral sound features, pitch, brightness, tone rate and energy
were extracted from the input signal and transformed the [7] M. Barthet, R. Kronland-Martinet, and S. Ystad,
output signal by varying the effect parameters over time. Improving Musical Expressiveness by Time-Varying
The design aimed to exploit the musical intention of the Brightness Shaping, Computer Music Modelling and
players in order to manipulate effect parameters. Through Retrieval, vol. 4969, 2008.
SMC2017-334
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[8] R. A. Kendall and E. C. Carterette, The Communi- [22] T. Smyth, Music 270a: Signal Analysis, 2015,
cation of Musical Expression, Music Perception: An accessed: Dec. 12, 2016. [Online]. Available: http:
Interdisciplinary Journal, vol. 8, no. 2, 1990. //musicweb.ucsd.edu/trsmyth/analysis/analysis.pdf
[9] L. Mion and G. DInca, Analysis of Expression in [23] G. Middleton, OpenStax CNX, 2003, accessed: Dec.
Simple Musical Gestures to Enhance Audio in Inter- 12, 2016. [Online]. Available: http://cnx.org/contents/
faces, Virtual Reality, vol. 10, no. 1, 2006. i5AAkZCP@2/Pitch-Detection-Algorithms
[10] C. Oshima, K. Nishimoto, Y. Miyagawa, and T. Shi- [24] T. H. Park, Introduction to Digital Signal Processing,
rosaki, A Concept to Facilitate Musical Expression, 1st ed. Singapore: World Scientific Publishing Co.,
Proceedings of the 4th conference on Creativity & cog- 2010.
nition, 2002.
[25] E. Schubert, Does Timbral Brightness Scale with Fre-
[11] U. Z. V. Verfaille and D. Arfib, Adaptive Digital Au- quency and Spectral Centroid? Acta Acustica United
dio Effects (a-DAFX): A New Class of Sound Trans- With Acustica, vol. 92, 2006.
formations, IEEE Transactions on Audio, Speech and [26] H. Jarvelainen and M. Karjalainen, Importance of In-
Language Processing, no. 5, 2006. harmonicity in the Acoustic Guitar, Forum Acusticum,
[12] V. Verfaille, M. M. Wanderley, and P. Depalle, Map- vol. 2005, 2005.
ping Strategies for Gestural and Adaptive Control of [27] F. Eyben, Real-time Speech and Music Classification
Digital Audio Effects, Journal of New Music Re- by Large Audio Feature Space Extraction. Springer,
search, vol. 35, no. 1, 2006. 2016.
[13] B. Corporation, Te-2 Tera Echo, 2016, accessed: [28] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury,
Dec. 13, 2016. [Online]. Available: https://www.boss. M. Davies, and M. B. Sandler, A Tutorial on On-
info/us/products/te-2/ set Detection in Music Signals, IEEE Transactions on
Speech and Audio Processing, vol. 13, no. 5, 2005.
[14] , Da-2 Adaptive Distortion, 2016, accessed:
Dec. 13, 2016. [Online]. Available: https://www.boss. [29] M. Muller, D. Ellis, A. Klapuri, and G. Richard, Sig-
info/us/products/da-2/features/ nal Processing for Music Analysis, IEEE Journal of
Selected Topics in Signal Processing, vol. 5, no. 6,
[15] , Mo-2 Multi Overtone, 2016, accessed: 2011.
Dec. 13, 2016. [Online]. Available: https://www.boss.
info/us/products/mo-2/ [30] T. Bjrner, Qualitative Methods for Consumer Re-
search: The Value of the Qualitative Approach in The-
[16] M. Muller, Fundamentals of Music Processing: Audio, ory & Practice. Denmark: Reitzel, Hans, 2015.
Analysis, Algorithms, Applications: 2015. Switzer-
land: Springer International Publishing AG, 2015. [31] A. Field and G. J. Hole, How to Design and Report
Experiments. London: SAGE Publications, 2003.
[17] G. Tanaka, Reviews of the Best Guitar Pedals
& Gear, 2013, accessed: Dec. 12, 2016. [On-
line]. Available: http://www.bestguitareffects.com/
top-best-delay-guitar-effects-pedals-buyers-guide/
SMC2017-335
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-336
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
V+
V+
R R
VBE1
IE1 Q1
Q1 IED1
Vin R ICD1 0
Q2
RL Vout
VBE2
F IED1
R
ICD1
V I C1 Iout
Vin
I C2
ICD2 RL Vout
Figure 1. Schematic of the Lockhart wavefolder circuit.
F IED2
2. CIRCUIT ANALYSIS
Figure 2. Ebers-Moll equivalent model of the Lockhart
Figure 1 shows a simplified schematic of the Lockhart wavefolder circuit.
wavefolder adapted from Ken Stones design [4]. This
circuit was originally designed to function as a frequency
tripler by R. Lockhart Jr. [4, 35], but was later repurposed the emitter currents then yields:
to perform wavefolding in analog synthesizers. The core of V+ VBE1 Vin
the circuit consists of an NPN and a PNP bipolar junction IE1 = , (3)
R
transistor tied at their base and collector terminals. In or- Vin VBE2 V
der to carry out the large-signal analysis of the circuit, we IE2 = . (4)
R
replace Q1 and Q2 with their corresponding Ebers-Moll
injection models [10] as shown in Figure 2. Nested sub- Next, we apply Kirchhoffs current law at the collector
scripts are used to distinguish between the currents and nodes:
voltages in Q1 from those in Q2 . For instance, ICD1 is
the current through the collector diode in Q1 . Iout = IC1 IC2 , (5)
We begin our circuit analysis by assuming that the sup- IC1 = F IED1 ICD1 , (6)
ply voltages V will always be significantly larger than IC2 = F IED2 ICD2 . (7)
the voltage at the input, i.e. V < Vin < V+ . This as-
sumption is valid for standard synthesizer voltage levels Assuming that the contribution of reverse currents R ICD1
(V = 15V, Vin [5, 5]V), and implies that the base- and R ICD2 to the total emitter currents IED1 and IED2 is
emitter junctions of Q1 and Q2 will be forward-biased with negligible, we can establish that
approximately constant voltage drops for all expected in-
put voltages. Applying Kirchhoffs voltage law around IED1 IE1 and IED2 IE2 .
both inputemitter loops (cf. Fig. 1) results in
Substituting these values in (6) and (7), and setting the
Vin = V+ RIE1 VBE1 , (1) value F = 1 gives a new expression for the output cur-
rent:
Vin = VBE2 + RIE2 + V , (2)
Iout = IE1 IE2 ICD1 + ICD2 . (8)
where IE1 and IE2 are the emitter currents, and VBE1 and Combining (3) and (4) we can derive an expression for
VBE2 are the voltage drops across the base-emitter junc- the difference between emitter currents. Since the voltage
tions of Q1 and Q2 , respectively. Solving (1) and (2) for drops VBE1 & VBE2 across the base-emitter junctions are
SMC2017-337
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
approximately equal (i.e. VBE1 VBE2 ) for the expected to model the control circuit of the Buchla LPG [3]. Several
range of Vin , their contribution to the difference of emitter authors have used it to solve the implicit voltage relation-
currents vanishes. Therefore, ships of diode pairs [3,36,37]. As described in [36], W (x)
can be used to solve problems of the form
2Vin
IE1 IE2 = . (9)
R (A + Bx)eCx = D, (17)
Substituting (9) into (8) yields an expression for the total
output current Iout in terms of the input voltage and the which have the solution
currents through the collector diodes:
1 CD AC/B A
x= W e . (18)
2Vin C B B
Iout = ICD1 + ICD2 . (10)
R
Since the collector diodes in the model are antiparallel,
Now, the I-V relation of a diode can be modeled using
only one of them can conduct at a time [38]. Going back
Shockleys ideal diode equation, defined as
to (10), when Vin 0 virtually no current flows through
VD the collector diode of Q1 (i.e. ICD1 0). The same can
ID = IS e VT 1 , (11)
be said for ICD2 when Vin < 0. By combining these new
assumptions with (12)(14), we can derive a piecewise ex-
where ID is the diode current, IS is the reverse bias satura- pression for the Lockhart wavefolder:
tion current, VT is the thermal voltage and is the ideality
factor of the diode. Applying Shockleys diode equation to 2RL
(Vin Vout )
the collector diodes using ideality factor = 1, and sub- Vout = Vin + RL IS exp , (19)
R VT
stituting into (10) gives us
2Vin
V
CD1 VCD
2
where = sgn(Vin ) and sgn() is the signum function.
Iout = IS e VT
e VT
. (12) This expression is still implicit; however, it can be rear-
R
ranged in the form described by (17), which gives us:
Next, we use Kirchhoffs voltage law to define expressions
for VCD1 and VCD2 in terms of Vin and Vout 2RL Vout Vin
Vin + Vout exp = RL Is exp
R VT VT
VCD1 = Vout Vin , (13) (20)
VCD2 = Vin Vout , (14) Solving for Vout as defined in (18) yields an explicit model
for the Lockhart wavefolder:
and replace these values in (12). This gives us
Vout = VT W ( exp (Vin )) Vin , (21)
2Vin Vout Vin Vin Vout
Iout = IS e VT e VT . (15)
R
where
As a final step, we multiply both sides of (15) by the load
resistance RL and remove the exponential functions to pro- 2RL R + 2RL RL IS
= , = and = . (22)
duce an expression for the output voltage of the Lockhart R VT R VT
wavefolder:
An important detail to point out is that the output of the
Vout Vin
2RL Vin Lockhart wavefolder is out of phase with the input signal
Vout = 2RL IS sinh . (16)
R VT by 180 . This can be compensated with a simple inverting
stage.
3. EXPLICIT FORMULATION
Equation (16) describes a nonlinear implicit relationship 4. DIGITAL IMPLEMENTATION
between the input and output voltages. Its solution can be
approximated using numerical methods such as Halleys The voltages and currents inside an electronic circuit are
method or Newton-Raphson. These methods have previ- time-dependent. Therefore, the Lockhart wavefolder can
ously been used in the context of VA modeling, e.g. in be described as a nonlinear memoryless system of the form
[10, 13, 16].
In this work, we propose an explicit formulation for Vout (t) = f (Vin (t)), (23)
the output voltage of the Lockhart wavefolder derived us-
ing the Lambert-W function. The Lambert-W function 1 where f is the nonlinear function (21) and t is time. Equa-
W (x) is defined as the inverse function of x = yey , tion (24) can then be discretized directly as
i.e. y = W (x). Recent research has demonstrated its suit-
ability for VA applications. Parker and DAngelo used it Vout [n] = f (Vin [n]), (24)
1 Strictly speaking, W (x) is multivalued. This work only utilizes the
upper branch, often known as W0 (x) in the literature. where n is the discrete-time sample index.
SMC2017-338
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Vout
Vout
its accurate discrete-time implementation. As an arbitrary 0 0
input waveform is folded, new harmonics will be intro- -0.5 -0.5
duced. Those harmonics whose frequency exceeds half the -1 -1
sampling rate, or the Nyquist limit, will be reflected into -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Vin Vin
the baseband and will cause unpleasant artifacts. Oversam- (a) (b)
pling by high factors is commonly employed to mitigate
this problem but this approach increases the computational Figure 3. Inputoutput relationship of the Lockhart wave-
requirements of the system by introducing redundant oper- folder circuit (inverted) measured using (a) SPICE and
ations. (b) the proposed digital model, for load resistor values
In this work, we propose using the first-order antideriva- RL = 1k, 5k, 10k and 50k (in order of increasing steep-
tive method presented in [33] and [34]. This method is ness).
derived from the analytical convolution of a continuous-
time representation of the processed signal with a first-
order lowpass kernel. The antialiased form for the Lock- and m = 0, 1, 2, ..., M 1. M is then the number of iter-
hart wavefolder is given by ations required for pm to approximate zero. The efficiency
F (Vin [n]) F (Vin [n 1]) of the method will then depend on the choice of the initial
Vout [n] = , (25) guess w0 . An optimized MATLAB implementation of this
Vin [n] Vin [n 1]
method can be found in [40].
where F () is the antiderivative of f (), the wavefolder func-
tion (21). This antiderivative is defined as
VT 5. RESULTS
2
F (Vin ) = (1 + W ( exp (Vin ))) Vin2 . (26)
2 2 To validate the proposed model, the circuit was simulated
Since the antiderivative of W is defined in terms of W using SPICE. Figure 3(a) shows the inputoutput relation-
itself, this form provides a cheap alternative to oversam- ship of the system measured with SPICE for values of Vin
pling. This means the value W , which constitutes the most between 1.2 and 1.2 V and different load resistance val-
computationally expensive part of both (21) and (26), only ues. Figure 3(b) shows the inputoutput relation of the
has to be computed once for each output sample. proposed model (21) simulated using MATLAB. The po-
Now, when Vin [n] Vin [n 1] (25) becomes ill- larity of the transfer function was inverted to compensate
conditioned. This should be avoided by defining the spe- for the introduced phase shift. An ad hoc scaling factor of
cial case 2 was used on both measurements to compensate for the
Vin [n] + Vin [n 1]
energy loss at the fundamental frequency. The results pro-
Vout [n] = f , (27) duced by the proposed model fit those produced by SPICE
2
closely. As the value of the load resistor RL is raised, the
when |Vin [n] Vin [n 1]| is smaller than a predetermined steepness of the wavefolder function increases. This in-
threshold, e.g. 106 . This special case simply bypasses the crease in steepness translates into more abrasive tones at
antialiased form while compensating for the half-sample the output.
delay introduced by the method. Figures 4(a) and (b) show the output of the wavefolder
model when driven by a 500-Hz sinewave with a peak am-
4.2 Computing the Lambert-W Function
plitude of 1 V, and RL = 10k and 50 k, respectively. Both
Several options exist to compute the value of W (x). In curves are plotted on top of their equivalent SPICE simu-
fact, scripting languages such as MATLAB usually contain lations, showing a good match. The original input signal
their own native implementations of the function. Paiva et has been included to help illustrate the folding operation
al. [38] proposed the use of a simplified iterative method performed by the system.
which relied on a table-read for its initial guess. In the in- Figure 5(b) shows the magnitude spectrum of a 1-V
terest of avoiding lookup tables, we propose approximating sinewave with fundamental frequency f0 = 2145 Hz
the value of the Lambert-W function directly using Hal- processed by the trivial (i.e. non-antialiased) wavefolder
leys method, as suggested in [39]. To compute wm , an model (21). A sample rate Fs = 88.2 kHz (i.e. twice the
approximation to W (x), we iterate over standard 44.1 kHz audio rate), was used in this and the rest
pm of the examples presented in this study. This plot shows the
wm+1 = wm , (28) high levels of aliasing distortion introduced by the wave-
pm sm rm
folding process, even when oversampling by factor 2 is
where
used. Figure 5(d) shows the magnitude spectrum of the
pm = wm ewm x, same waveform processed using the antialiased form (25).
As expected, the level of aliasing has been significantly re-
rm = (wm + 1)ewm ,
duced, with very few aliases left above the 80 dB mark.
wm + 2 As illustrated by the left-hand side of Fig. 5, the antialiased
sm = , form preserves the time-domain behavior of the system.
2(wm + 1)
SMC2017-339
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
1 1 0
Magnitude (dB)
Vin SPICE Model
0.5 0.5 -20
0 0
-40
-0.5 -0.5
-60
-1 -1
-80
0 1/f0 2/f0 0 1/f0 2/f0 0 5k 10k 15k 20k
Time (s) Time (s) Frequency (Hz)
(a) (a) (b)
1
1 0
Magnitude (dB)
0.5
0.5 -20
0
0 -40
-0.5 -0.5
-60
-1 -1
0 1/f0 2/f0 -80
Time (s) 0 1/f0 2/f0 0 5k 10k 15k 20k
(b) Time (s) Frequency (Hz)
(c) (d)
Magnitude (dB)
SMC2017-340
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
30 1
0.5
20
0
10 -0.5
0 -1
ANMR (dB)
0 1/f0 2/f0
-10 Time (s)
(a)
-20
1
-30 0.5
Audio Rate (44.1 kHz) 0
-40
2xOS
4xOS -0.5
-50 8xOS
2xOS + Antialiasing -1
-60 0 1/f0 2/f0
1k 2k 3k 4k 5k Time (s)
Fundamental Frequency (Hz) (b)
Figure 7. Measured ANMR of a range of sinusoidal wave- Figure 8. Waveform for a unit gain sine signal processed
forms processed using the Lockhart wavefolder model using the proposed cascaded structure with (a) G = 10 and
(RL = 50 k) under five different scenarios: direct audio (b) G = 10 plus a 5-V DC offset.
rate, oversampling by factors 2, 3 and 8, and oversampling
by 2 combined with the antialiasing form. Values below Component Value Constant Value
the 10 dB threshold indicate lack of perceivable aliasing. R 15 k VT 26 mV
RL 7.5 k IS 1017 A
SMC2017-341
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
G 1/4 4
Vin WF WF WF WF Tanh Vout
DCoset
Figure 9. Block diagram for the proposed cascaded wavefolder structure. Input gain and DC offset are used to control the
timbre of the output signal.
comments during the final preparation of this work. F. Es- [20] C. Roads, A tutorial on non-linear distortion or waveshaping synthe-
queda is supported by the Aalto ELEC Doctoral School. sis, Computer Music J., vol. 3, no. 2, pp. 2934, 1979.
[21] M. L. Brun, Digital waveshaping synthesis, J. Audio Eng. Soc.,
vol. 27, no. 4, pp. 250266, 1979.
[22] J. Kleimola, Audio synthesis by bitwise logical modulation, in
8. REFERENCES Proc. 11th Int. Conf. Digital Audio Effects (DAFx-08), Espoo, Fin-
land, Sept. 2008, pp. 6070.
[1] R. A. Moog, A voltage-controlled low-pass high-pass filter for audio
signal processing, in Proc. 7th Conv. Audio Eng. Soc., New York, [23] J. Kleimola, V. Lazzarini, J. Timoney, and V. Valimaki, Vector
USA, Oct. 1965. phaseshaping synthesis, in Proc. 14th Int. Conf. Digital Audio Ef-
fects (DAFx-11), Paris, France, Sept. 2011, pp. 233240.
[2] Buchla Electronic Musical Instruments. The history of Buchla.
Accessed March 3, 2017. [Online]. Available: https://buchla.com/ [24] T. Stilson and J. Smith, Alias-free digital synthesis of classic analog
history/ waveforms, in Proc. Int. Computer Music Conf., Hong Kong, Aug.
1996, pp. 332335.
[3] J. Parker and S. DAngelo, A digital model of the Buchla lowpass-
gate, in Proc. Int. Conf. Digital Audio Effects (DAFx-13), Maynooth, [25] E. Brandt, Hard sync without aliasing, in Proc. Int. Computer Mu-
Ireland, Sept. 2013. sic Conf., Havana, Cuba, Sept. 2001, pp. 365368.
[4] K. Stone. Simple wave folder for music synthesizers. Accessed [26] V. Valimaki, Discrete-time synthesis of the sawtooth waveform with
March 3, 2017. [Online]. Available: http://www.cgs.synth.net/ reduced aliasing, IEEE Signal Process. Lett., vol. 12, no. 3, pp. 214
modules/cgs52 folder.html 217, Mar. 2005.
[5] D. Rossum, Making digital filters sound analog, in Proc. Intl. [27] V. Valimaki, J. Nam, J. O. Smith, and J. S. Abel, Alias-suppressed
Computer Music Conf. (ICMC 1992), San Jose, CA, USA, Oct. 1992, oscillators based on differentiated polynomial waveforms, IEEE
pp. 3034. Trans. Audio Speech Lang. Process., vol. 18, no. 4, pp. 786798, May
2010.
[6] V. Valimaki, S. Bilbao, J. O. Smith, J. S. Abel, J. Pakarinen, and
D. Berners, Introduction to the special issue on virtual analog au- [28] V. Valimaki, J. Pekonen, and J. Nam, Perceptually informed synthe-
dio effects and musical instruments, IEEE Trans. Audio, Speech and sis of bandlimited classical waveforms using integrated polynomial
Lang. Process., vol. 18, no. 4, pp. 713714, May 2010. interpolation, J. Acoust. Soc. Am., vol. 131, no. 1, pp. 974986, Jan.
2012.
[7] T. Stilson and J. O. Smith, Analyzing the Moog VCF with consid-
erations for digital implementation, in Proc. Int. Computer Music [29] E. Brandt, Higher-order integrated wavetable synthesis, in Proc.
Conf., Hong Kong, Aug. 1996. Int. Conf. Digital Audio Effects (DAFx-12), York, UK, Sept. 2012,
pp. 245252.
[8] A. Huovilainen, Non-linear digital implementation of the Moog
ladder filter, in Proc. Int. Conf. Digital Audio Effects (DAFx-04), [30] J. Kleimola and V. Valimaki, Reducing aliasing from synthetic audio
Naples, Italy, Oct. 2004, pp. 61164. signals using polynomial transition regions, IEEE Signal Process.
Lett., vol. 19, no. 2, pp. 6770, Feb. 2012.
[9] T. Helie, On the use of Volterra series for real-time simulations of
weakly nonlinear analog audio devices: application to the Moog lad- [31] F. Esqueda, S. Bilbao, and V. Valimaki, Aliasing reduction in clipped
der filter, in Proc. Int. Conf. Digital Audio Effects (DAFx-06), Mon- signals, IEEE Trans. Signal Process., vol. 60, no. 20, pp. 52555267,
treal, Canada, Sept. 2006, pp. 712. Oct. 2016.
[10] F. Fontana and M. Civolani, Modeling of the EMS VCS3 voltage- [32] F. Esqueda, V. Valimaki, and S. Bilbao, Rounding corners with
controlled filter as a nonlinear filter network, IEEE Trans. Audio, BLAMP, in Proc. Int. Conf. Digital Audio Effects (DAFx-16), Brno,
Speech, Language Process., vol. 18, no. 4, pp. 760772, Apr. 2010. Czech Republic, Sept. 2016, pp. 121128.
[11] S. DAngelo and V. Valimaki, Generalized Moog ladder filter: Part [33] J. Parker, V. Zavalishin, and E. Le Bivic, Reducing the aliasing of
II explicit nonlinear model through a novel delay-free loop imple- nonlinear waveshaping using continuous-time convolution, in Proc.
mentation method, IEEE/ACM Trans. Audio, Speech, Language Pro- Int. Conf. Digital Audio Effects (DAFx-16), Brno, Czech Republic,
cess., vol. 22, no. 12, pp. 18731883, Dec. 2014. Sept. 2016, pp. 137144.
[12] A. Huovilainen, Enhanced digital models for analog modulation ef- [34] S. Bilbao, F. Esqueda, J. Parker, and V. Valimaki, Antiderivative an-
fects, in Proc. Int. Conf. Digital Audio Effects (DAFx-05), Madrid, tialiasing for memoryless nonlinearities, IEEE Signal Process. Lett.,
Spain, Sept. 2005, pp. 155160. to be published.
[13] D. T. Yeh, J. Abel, and J. O. S. III, Simulation of the diode limiter in [35] R. Lockhart Jr., Non-selective frequency tripler uses transistor satu-
guitar distortion circuits by numerical solution of ordinariy differen- ration characteristics, Electronic Design, no. 17, Aug. 1973.
tial equations, in Proc. Int. Conf. Digital Audio Effects (DAFx-07), [36] K. J. Werner, V. Nangia, A. Bernardini, J. O. Smith III, and A. Sarti,
Bordeaux, France, Sept. 2007, pp. 197204. An improved and generalized diode clipper model for wave digital
[14] J. Pakarinen and D. T. Yeh, A review of digital techniques for model- filters, in Proc. 139th Conv. Audio Eng. Soc., New York, USA, Oct.
ing vacuum-tube guitar amplifiers, Computer Music J., vol. 33, no. 2, Nov. 2015.
pp. 85100, 2009. [37] A. Bernardini, K. J. Werner, A. Sarti, and J. O. S. III, Modeling
[15] J. Parker, A simple digital model of the diode-based ring modulator, nonlinear wave digital elements using the Lambert function, IEEE
in Proc. Int. Conf. Digital Audio Effects (DAFx-11), Paris, France, Transactions on Circuits and Systems I: Regular Papers, vol. 63,
Sept. 2011, pp. 163166. no. 8, pp. 12311242, Aug 2016.
[16] M. Holters and U. Zolzer, Physical modelling of a wah-wah effect [38] R. C. D. de Paiva, S. DAngelo, J. Pakarinen, and V. Valimaki, Em-
pedal as a case study for application of the nodal dk method to circuits ulation of operational amplifiers and diodes in audio distortion cir-
with variable parts, in Proc. Int. Conf. Digital Audio Effects (DAFx- cuits, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 10, pp.
11), Paris, France, Sept. 2011, pp. 3135. 688692, Oct. 2012.
[17] R. C. D. de Paiva, J. Pakarinen, V. Valimaki, and M. Tikander, [39] D. Veberic. Having fun with Lambert W(x) function. Accessed March
Real-time audio transformer emulation for virtual tube amplifiers, 3, 2017. [Online]. Available: https://arxiv.org/pdf/1003.1628.pdf
EURASIP J. Adv. Signal Process, vol. 2011, Feb. 2011. [40] C. Moler. The Lambert W function. Accessed May 19, 2017.
[18] D. Yeh, Automated physical modeling of nonlinear audio circuits [Online]. Available: http://blogs.mathworks.com/cleve/2013/09/02/
for real-time audio effectspart II: BJT and vacuum tube examples, the-lambert-w-function/
IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 4, pp. [41] H.-M. Lehtonen, J. Pekonen, and V. Valimaki, Audibility of aliasing
1207126, May 2012. distortion in sawtooth signals and its implications for oscillator algo-
[19] F. Eichas, M. Fink, M. Holters, and U. Zolzer, Physical modeling rithm design, J. Acoust. Soc. Am., vol. 132, no. 4, pp. 27212733,
of the MXR Phase 90 guitar effect pedal, in Proc. Int. Conf. Digital Oct. 2012.
Audio Effects (DAFx-14), Erlangen, Germany, Sept. 2014, pp. 153
158.
SMC2017-342
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT cally but also through the tangible, spatial and interactive
characteristics of sound installations.
Highly recurrent networks that exhibit feedback and delay
mechanisms offer promising applications for music com- 2.1 Feedback and Delay in Sound Synthesis
position, performance, and sound installation design. This
Feedback and delay mechanisms play a prominent role for
paper provides an overview and a comparison of several
the creation and processing of digital audio signals [2].
pieces that have been realised by various musicians in the
In digital signal processing, typical applications include
context of a practice-led research project.
the design of recursive filters or the simulation of room
acoustics (see e.g. [3]) In sound synthesis, several physical
1. INTRODUCTION modelling techniques, such as digital waveguide synthesis,
employ feedback and delay to simulate the propagation of
Over three years, the ICST Institute for Computer Mu- acoustic waves through physical media (see e.g. [4]).
sic and Sound Technology has run a research project en- Approaches that employ feedback and delay mechanisms
titled FAUN Feedback Audio Networks that explored within highly recurrent networks are much less common.
the musical potential of time-delay and feedback mech- Such networks can give rise to interesting phenomena of
anisms within densely interconnected recurrent networks self-organisation. The term generative audio system is
[1]. Such networks are interesting for several reasons: The used to designate generative approaches in computer mu-
changing activity of the network can be rendered audible sic that do not operate on symbolic data but rather create
through direct audification which abolishes the need for a the sonic output through direct audification [5].
potentially arbitrary sonification mapping. The manipu- There exist a few examples that employ neural networks
lation of delay times provides a formalism of potential use for sound synthesis [6, 7]. Approaches that are not related
for both sound synthesis and algorithmic composition. The to neural networks are equally relevant. The computer mu-
networks lend themselves for generative approaches that sic environment resNET represents an early example [8].
explore the sonic potential of self-organised processes. It permits the realisation of networks for sound synthesis
The FAUN project follows a practice-led approach to that consist of interconnected exciter and resonator units.
study how time-delayed feedback networks can be adopted More recently, a sound synthesis system was realised based
for musical creation. This paper considers how musicians on iterative maps whose variables are coupled via a net-
appropriate and integrate the conceptual, functional and work [9]. Recent research has been conducted on feedback
aesthetic principles of these networks into their own cre- networks consisting of time-varying allpass filters [5].
ative practice. Accordingly, the paper does not provide
a systematic analysis of particular network algorithms but 2.2 Exposure of Generative Systems
rather highlights and discusses the diversity of individual
approaches that were chosen by the musicians. Due to their complex dynamics, time-delayed feedback
networks can be employed as generative systems that ex-
hibit autonomous and self-organised behaviours. This ren-
2. BACKGROUND ders these networks attractive within the context of genera-
tive art [10, 11]. An ongoing debate in generative art refers
This section presents a brief overview of the application of
to the challenge of devising an artwork in such a way that
feedback and delay principles in sound synthesis. It also
the specific characteristics of the underlying algorithms
provides information about approaches that transfer com-
manifest themselves as principal aspects of the work. By
putational data and processes into experienceable repre-
directly exposing the algorithms in the perceivable char-
sentations. These approaches can inspire musicians to ren-
acteristics of an artwork, the audience becomes engaged
der sound synthesis techniques perceivable not only soni-
not only on an aesthetic level but can also gain through
a process of experiential reverse-engineering an intellec-
Copyright:
c 2017 Daniel Bisig et al. This is an open-access article distributed tual appreciation of the work [12]. The principle of ren-
under the terms of the Creative Commons Attribution 3.0 Unported License, which dering generative processes directly perceivable can serve
permits unrestricted use, distribution, and reproduction in any medium, provided as a compositional strategy in computer music [13, 14]. In
the original author and source are credited. this context, the method of mapping abstract algorithms
SMC2017-343
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-344
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3.1.4 Circular
Figure 5. Example network topologies and audio routings The piece Circular is different from the other pieces in
for the piece Twin Primes. that it employs a physical feedback mechanism. The main
11 http://vboehm.net/2014/12/fourses/ (accessed 2017-02-26)
SMC2017-345
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 8. Diagram of the stage setup for the piece Circular Figure 10. Example network topologies and audio routings
(left) and a photograph of the performance (right). From for the piece Thread.
left to right, the graphical elements represents a foot pedal
and a hand drum.
3.2 Installations
SMC2017-346
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Primes. The output connection contains a delay unit whose on top of a note stand (see right side of Fig. 14). Once the
time is constantly changed by linearly interpolating be- installation is set up, the distances among the loudspeak-
tween random values. The network topology is shown on ers are manually measured and stored in a configuration
the left side of Fig. 12. The network consists of a single file to be read upon initialisation of the network. A more
node and four recurrent connections that connect the node thorough description of the installation is available in [17].
with itself. The installation setup consists of four piezo-
electric speaker films, each of them 1.8 metres long, that
hang from the ceiling (see right side of Fig. 12).
Figure 14. Network topology and audio routing for the in-
stallation Sheet Music (left) and a photograph of the piezo-
electric loudspeakers (right).
Figure 12. Network topology and audio routing for the 3.2.3 Studie 24.2
installation Stripes (left) and a photograph of four piezo- For the realisation of the piece Studie 24.2 the exist-
electric loudspeakers (right). ing setup of the installation Dodecahedron was used.
This setup possesses the shape of a platonic body and
places twenty loudspeakers at equal distances from each
3.2.2 Sheet Music
other on the surface of a sphere. The piece appropriates
The installation Sheet Music employs a network algorithm the given geometric setup as compositional constraint in
that resembles the functioning of a switch board. When- that it relates the physical loudspeakers and the metallic
ever a nodes signal amplitude exceeds a threshold, the struts to nodes and connections in a feedback and delay-
nodes output is connected to another node. In combina- network. By modifying the probabilities of signal prop-
tion with a one-to-one correspondence between node ac- agation across different network connections, temporary
tivity and acoustic output via a corresponding loudspeaker, subnetworks are formed.
the switching behaviour gives rise to pointillistic sonic pat- The node and connection characteristics are shown in
terns that propagate through space. In addition, the instal- Fig. 15. Here, the RMS of the signal amplitude is passed
lation establishes a relationship between delay times and through a nonlinear gain function that boosts middle am-
spatial distances among loudspeakers. The delay time be- plitudes while eliminating low and high amplitudes. In
tween nodes is proportional to the spatial distance between addition, a global amplitude control mechanism uses the
the corresponding loudspeakers. This allows to exploit the RMS of all network signals to adjust the output gain of
spatial constraints of an exhibition situation in order to cre- network connections. The gain and delay values are con-
ate a unique and site specific network configuration. tinuously randomised, which serves to create gradually
The characteristics of the network nodes and connections changing timbres and helps to avoid the appearance of res-
is depicted in Fig. 13. Each node passes the summed in- onances.
put signals through a limiter and a routing mechanism that Fig. 16 shows a schematic representation of the network
switches in a round robin manner to its next setting when- topology on the left side and a photograph of the Deodec-
ever the nodes signal amplitude exceeds a threshold. The ahedron installation on the right side. According to the di-
network topology is shown on the left side of Fig. 14. rect correspondence between installation scaffold and net-
The installation setup consists of eight piezoelectric work topology, each node is connected to three neighbour-
speaker films, each of them an A4 sheet in size, sitting ing nodes. Signals propagate in a branching manner, i.e.
SMC2017-347
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
a signal that arrives at a particular node from one of its unique identifier that is communicated to other loudspeak-
neighbouring nodes can only be propagated to the other ers vie an infrared sender mounted at the front of their cas-
two neighbouring nodes. The probability and scaling of ing. This signal is received by other loudspeakers via an
this branching propagation are randomly changed. This infrared receiver that sits at the back of their casing. Each
propagation principle leads to the appearance of local sonic loudspeaker then transmits the list of received identifiers
movement trajectories. via Wifi to a computer which updates the topology of the
sound synthesis network accordingly. A more thorough
description of this installation is provided in [18].
Figure 16. Network topology and audio routing for the Figure 18. Network topology and audio routing for the
piece Studie 24.2 (left) and a photograph of an exhibition installation Watchers (left) and a photograph of the exhibi-
setup (right) tion situation (right)
3.2.4 Watchers
4. DISCUSSION
The installation Watchers allows visitors to manually alter
This section provides a comparison and discussion of the
the horizontal orientation of loudspeakers and thereby af-
musical approaches for the previously described pieces.
fect the topology of the network-based synthesis system.
Each loudspeaker renders the activity of a particular net-
4.1 Algorithm Design
work node audible. Loudspeakers possess a line of sight
which controls the existence of network connections. De- On a basic level, the design of sound synthesis network
pending on a loudspeakers orientation, different neigh- algorithms can either be informed by techniques from dig-
bouring loudspeakers fall into its line of sight. Network ital signal processing or from the appropriation of algo-
connections are created between nodes that correspond to rithms from non-musical domains. The former approach
loudspeaker that can see each other. Whenever the estab- is familiar to most musicians whereas the latter results in
lishment of a connection leads to the propagation of activ- non-standard techniques whose musical usefulness is more
ity among network nodes, the sonic characteristics of the difficult to assess. In FAUN, most musicians chose a digi-
corresponding loudspeakers become gradually intermixed. tal signal processing approach in their network implemen-
The characteristics of the network nodes and connections tation. Only the piece Errant it is implements a neural
is depicted in Fig. 17. A node contains an internal sine os- network. However, this implementation is complemented
cillator whose output is combined with the sum of the in- with more conventional signal processing algorithms.
coming signals. This combined signal is passed through a
4.1.1 Feedback Stabilisation
sigmoid wave shaping function that serves as audio distor-
tion and gating mechanism. The shape of the wave shaping One of the challenges when working with feedback sys-
function changes based on whether the RMS of the input tems concerns the avoidance of instabilities caused by pos-
signal lies within or outside a lower and upper threshold. itive feedback. Most of the pieces employ an ad hoc ap-
Fig. 18 shows a schematic representation of the network proach for the automated stabilisation of feedback. The
topology and audio routing. Each loudspeaker is attached corresponding mechanisms either directly control local or
to a rotational joint. Also, each loudspeaker possesses a global signal gain or employ a gate that closes whenever
SMC2017-348
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
the signal exceeds a threshold. Only in two pieces for live 4.2 Compositional Approaches
improvisation (Errant it is, Circular), the task of feedback
The use of feedback and delay principles implies that all
stabilisation is delegated to manual intervention.
sonic aspects of a musical piece (pitch, timbre, rhythm,
Unique approaches for feedback stabilisation are em- etc.) become mutually interrelated. Accordingly, a musi-
ployed by the following pieces: Sheet Music implements a cian cannot freely decide to modify each of these aspects
connection switching mechanism that cuts of the propaga- individually. The specification of the velocity of the gating
tion of signals that exceed an amplitude threshold, Studie mechanism in Twin Primes and Thread results not only in
24.2 uses a global signal attenuation mechanism and also spectral but also in rhythmical effects. The changing wave
boosts signals with average amplitude that serves to main- shapes in Watchers and the connection switching mecha-
tain a fairly constant signal amplitude throughout the piece, nisms of Sheet Music serve a similar purpose. In Thread,
Watchers uses a wave shaping function to control signal the rhythmic structure is largely predefined by the speci-
amplitude. fication of delay times in connections that pass an initial
activation trigger to network nodes. Errant it is uses an
4.1.2 Network Topologies external mechanism for controlling rhythmicity: a Fourses
ramp generator occasionally controls periodic changes in
While all networks described in this paper contain recur- the membrane potential of the Izhikevich model neurons.
rent connections, the characteristics and topologies of the The piece Roj represents a peculiar case. By manually ar-
networks are quite diverse. In all pieces with the excep- ranging pre-recorded material, external rhythmic and for-
tion of Studie 24.2, many or all recurrent connections con- mal elements are imposed on the music that are unrelated
nect a node back to itself. While this is obvious for net- to the characteristics of a network mechanism. This ap-
works that consist of one node only, local recurrent con- proach offers the largest compositional freedom, but this
nections are important in larger networks as well since they comes at the cost of sacrificing some of the potential of
allow to maintain the activity of a node for extended pe- the networks to operate as autonomous generative systems.
riods of time. In many of the networks that consist of a Most musicians tried to strike a balance between composi-
few nodes only (3 to 5), the recurrent connections across tional freedom and generative autonomy.
different nodes serve to establish circular propagation pat- An interesting relationship between human performer
terns. In networks containing larger number of nodes (6 and generative system develops in Thread and Circular.
to 20), circular recurrent patterns appear as well but they In both pieces, the musicians behaviour consists mostly
do not span all neurons and thereby create subnetworks. of spontaneous reactions to the complex behaviour of the
The usage of subnetworks within larger networks plays an network. Accordingly, both the musician and the network
important role in that it renders the complex feedback dy- possess agency and it is the mutual interplay between the
namics more manageable than in networks that are fully two that shapes the development of the piece.
connected. Also, subnetworks permit to maintain concur-
rent sonically distinct network activities while allowing the 4.3 Experienceable Relationships
occasional exchange of sonic material among the various This section addresses the establishment of relationships
subnetworks. None of the pieces makes use of a larger net- between network algorithms and experienceable results
work in which the nodes are fully connected. that go beyond purely sonic considerations.
With the exception of Errant it is, all pieces use networks
in which nodes and connections all possess the same char- 4.3.1 Spatialisation
acteristics. This is an indication that parametrical and topo- In all pieces, the output of the network is spatialised by a
logical variations in homogenous networks allow for the multichannel speaker setup, but two different approaches
creation of sufficiently diverse sonic material so that the are employed. Either the output of the network is routed
added complexity of heterogenous networks is not con- through a mixing desk before being distributed across the
sidered worthwhile by the musicians. On the other hand, loudspeaker setup, or the sonic output of each network
many musicians employ several different network topolo- node is directly connected to an individual loudspeaker.
gies, either in parallel or successively throughout the piece. Only two pieces (Roj and Errant it is) follow the first ap-
Some of the pieces use multiple networks with fixed proach. In a one-to-one routing of node output to loud-
topology (Errant it is, Twin Primes, Thread). By switch- speaker, the spatialisation of the sonic output is predeter-
ing between these networks or by scaling the contribu- mined to a large extend by the networks topology. Ac-
tion of individual networks to the overall sonic result, the cordingly, this approach treats acoustic spatialisation as
pieces can progress in a predictable manner through dif- a property that is correlated with other musical aspects
ferent musical sections. Other pieces use only a single through the behaviour of the network. While this second
network but continuously (Watchers, Sheet Music) or oc- approach constrains compositional freedom, it offers the
casionally (Roj) modify its topology. This approach leads opportunity to expose the networks dynamics as spatial
to less predictable results since the musical result depends trajectories of sonic material.
both on the new topology and the former activity patterns
4.3.2 Physical Correspondences
that still persist after a topological change. On the other
hand, the persistence of network activity helps to create a The one-to-one routing of node output to loudspeaker can
sonic continuity throughout the development of a piece. be understood as an ontological form of mapping between
SMC2017-349
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
algorithmic properties and physical objects. That is, the [3] D. Rocchesso and J. O. Smith, Circulant and elliptic
individual nodes in a network manifest both as acous- feedback delay networks for artificial reverberation,
tic point sources and tangible objects. Accordingly, the IEEE Transactions on Speech and Audio Processing,
speaker setup conveys information about the structure of vol. 5, no. 1, pp. 5163, 1997.
the network that might otherwise be hidden from the audi-
[4] S. Bilbao, Wave and Scattering Methods for Numerical
ence. This mutual object correspondence is emphasised in
Simulation. Chichester: John Wiley & Sons, 2004.
Stripes, Sheet Music, Studie 24.2, and Watchers.
Apart from establishing a correspondence between net- [5] G. Surges, T. Smyth, and M. Puckette, Generative
work nodes and loudspeakers, some of the pieces also feedback networks using time-varying allpass filters,
experiment with the exposure of the networks topology in Proc. of the Int. Computer Music Conference, 2015.
through physical or spatial metaphors. Studie 24.2 uses
the given physical structure of an installation as guiding [6] K. Ohya, Sound Variations by Recurrent Neural Net-
principle for the network topology: each metallic strut be- work Synthesis, in Proc. of the Int. Computer Music
tween two loudspeakers corresponds to a connection be- Conference, Ann Arbor, 1998, pp. 280283.
tween two network nodes. Another approach is chosen for [7] A. Eldridge, Neural Oscillator Synthesis: Generat-
Sheet Music, where the distances among loudspeaker are ing Adaptive Signals with a Continuous-Time Neural
mapped onto delay times. This correspondence between Model, in Proc. of the Int. Computer Music Confer-
spatial properties of physical space and temporal proper- ence, Barcelona, 2005.
ties of a sound synthesis system allows the physical space
to partially imprint itself on the musical characteristics. [8] M. Hamman, Dynamically Configurable Feed-
In Watchers, relationships among loudspeakers are based back/Delay Networks: A Virtual Instrument Composi-
on the principle of line of sight. Since visitors can af- tion Model, in Proc. of the Int. Computer Music Con-
fect the visibility among loudspeakers either by manu- ference, Aarhus, 1994, pp. 394397.
ally changing the loudspeakers orientations or by blocking [9] B. Battey, Musical Pattern Generation with Variable-
the infrared signals with their bodies, this principle of cor- Coupled Iterated Map Networks, Organised Sound,
respondence can be exploited as an interaction affordance. vol. 9, no. 2, pp. 137150, 2004.
[10] P. Galanter, What is Generative Art? Complexity The-
5. CONCLUSION ory as a Context for Art Theory, in Proc. of the Gen-
This publication provides a comparison of different musi- erative Art Conference, Milan, 2003.
cal approaches for working with time-delay and feedback [11] A. Dorin, J. McCabe, J. McCormack, G. Monro, and
networks. We believe that the pieces realised in the context M. Whitelaw, A Framework for Understanding Gen-
of this project are sufficiently diverse to inform a general erative Art, Digital Creativity, vol. 23, pp. 239259,
evaluation of the creative potential of such networks. Nev- 2012.
ertheless, it is clear that the chosen approaches hardly rep-
resent the full breadth of possibilities. Some of the pieces [12] M. Whitelaw, Readme 100: Temporary Software Art
do not follow fundamentally different approaches but be- Factory. Norderstedt: Books on Demand, 2005, pp.
long to one of several families of similar works that have 135154.
emerged during the project. Furthermore, two different ap-
[13] T. Davis, Complexity as Process: Complexity-
proaches are underrepresented in this body of works: first,
Inspired Approaches to Composition, Organised
network-based forms of sound synthesis that incorporate
Sound, vol. 15, no. 02, pp. 137146, 2010.
algorithms from outside the musical domain in order to ex-
periment with non-standard forms of sound synthesis, and [14] S. Kim-Cohen, In the blink of an ear: Toward a non-
second, installations in which musical strategies for estab- cochlear sonic art. A&C Black, 2009.
lishing correspondences between musical algorithms and
[15] A. Eldridge, The poietic power of generative sound art
physical properties are exploited as affordances for inter-
for an information society, the sciences, vol. 2, no. 7,
active engagement. In a future continuation of this project,
p. 18, 2012.
we hope to be able to offer calls for work that specifically
focus on these topics. [16] E. M. Izhikevich, Simple model of spiking neurons,
IEEE Transactions on neural networks, vol. 14, no. 6,
6. REFERENCES pp. 15691572, 2003.
[1] D. Bisig and P. Kocher, Early investigations into mu- [17] P. Kocher and D. Bisig, The sound installation sheet
sical applications of time-delayed recurrent networks, music, in Proc. of the Generative Art Conference,
in Proc. of the Generative Art Conference, Milan, Rome, 2013.
2013. [18] D. Bisig, Watchers an installative representation of a
generative system, in Proc. of the 5th Conference on
[2] D. Sanfilippo and A. Valle, Towards a typology of
Computation, Communication, Aesthetics & X, Lisboa,
feedback systems, in Proc. of the Int. Computer Music
2017.
Conference, 2012, pp. 3037.
SMC2017-350
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
A generative approach has been intensely pursued in the Generative approaches to art and creativity are typically
digital domain, i.e. where a certain generative system has intended as ways to yield complex structures by defining
an established connection between software and final out- compact formal procedures. A plethora of models and
put. Computer music has worked extensively on algorith- techniques has been defined in the last decades, includ-
mic composition techniques that have been applied since ing, among the others, formal grammars, fractals, cellular
its inception both to event organization and audio gener- automata [5]. As such approaches are formal, they have
ation [1]. On the other side, physical computing [2] has been applied/mapped to various aesthetic domains, includ-
successfully tried in recent years to fill the gap between ing e.g. visual art and music composition. The rationale
computation and the physical world, by linking algorith- at their base is to shift the focus from the final work to the
mic design to a heterogeneous set of production systems, meta-level, that of a system capable of generating various
e.g. in the extensive applications of 3D printing. Concern- legal, well-formed outputs. In turn, outputs may be
ing electronics, and in particular electronic circuit design, evaluated, e.g. filtered in terms of acceptance/refusal in re-
it can be observed that the latter is a complex task, as it lation to some (un)desired features, and the evaluation may
involves accurate selection of components in relation to a be used to tune the system by means of a feedback loop. In
specific desired output. Thus, algorithmic techniques are short, algorithmic techniques like the aforementioned ones
used on the design side mostly for circuit optimization. As are mostly used to explore the output space of a system.
an example, evolutionary algorithms have been used to de- In the domain of sound production, analog audio synthesis
sign circuits: the rationale is that in some cases a certain far from being a dead end has survived to digital audio
performance is expected and the algorithm is iteratively and it is at the moment flourishing among musicians, in
run so to design a circuit to fit the targeted performance [3]. particular in hybrid systems that retain analog audio gen-
In this sense, generation is constrained by an expected re- eration while assigning control to digital modules. In the
sult rather than focusing on the exploration of new behav- field of electronic circuits for music applications, modular
iors (see also [4] for AI techniques based on an evolution- systems that can be assembled in various ways are a stan-
ary computation for circuit design). Moreover, hardware dard solution since the inception of analog electronic mu-
layout for electronic components involves many critical sic, and recently various efforts have been made to create
steps, depending on both component features (packaging) miniature, highly reconfigurable modular systems (e.g. lit-
and routing constraints (e.g. non overlapping of routes). tleBits 1 ). However, as the assembly is left to the user, all
Thus, a completely automated pipeline as required by a these systems are not suited for generative techniques. In
truly generative approach, linking algorithmic models to this sense, the generative process should move to a lower
hardware level. A generative approach to audio circuit de-
sign has not been investigated, as far as we know, as typi-
Copyright:
c 2017 Simone Pappalardo et al. This is
cal audio circuits implement well-defined schemes that are
an open-access article distributed under the terms of the
a priori constrained in relation to specific needs, e.g. noise
Creative Commons Attribution 3.0 Unported License, which permits unre-
reduction optimization or adherence to a certain already es-
stricted use, distribution, and reproduction in any medium, provided the original
author and source are credited. 1 http://littlebits.cc
SMC2017-351
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
1. Model retrieval: as a first step, an already estab- Figure 2. Diode clamping circuit.
lished audio circuit is taken into account, so to start
with a well-know and working example;
is to link some independent object characterized by audio
2. Functional decomposition: the circuit is theoreti- input and output (methodology, step 2). Second, it should
cally decomposed into functional units that can be provide a spectral enrichment (a music desideratum) with-
thought as closed elements provided with inputs and out a final amplitude overload (an audio constraint related
outputs. These units can include various electronic to generative procedure, as iteration results in stacking an
component configurations. Granularity at the single undefined number of components).
component level (e.g. diode) is too basic to provide In relation to step 1, it can be observed that diodes ap-
an operative base for generation, as too many con- plications have been widely used in audio. in particular,
straints would have to be taken into account. diode clipping (Figure 1) is often used in analog and audio
3. Rule-based re-organisation: functional units are con- circuits to create distortion effects or to protect a devices
sidered as symbols to be arranged in strings follow- input from possible current overloads: as a consequence,
ing a specific formalization. various schematics implementing this technique are avail-
able (e.g. [8]). Diodes used in analog effects for guitar
While conceptually separated, the three steps are tightly in- generally clip when the input signal voltage reaches the
tegrated, as there is a constant feedback loop among search- forward threshold voltage of the component. For our cir-
ing for suitable audio circuits, investigating about their fit- cuit we decided to work with Schottky diodes so to have a
ting into a modular approach, and assessing the robust- very low clipping threshold (typically 0.2 V against 0.7 V
ness in scaling up component assembly via generative pro- of a silicon type).
cedure. In the following, we will discuss the previous In classical analog distortion guitar effects, the amplitude
methodology in relation to a case study. of the input signal is limited in either directions by a fixed
amount. Indeed, in these devices, the reference threshold
voltage for diodes is the circuit ground, so that the input
3. MODEL RETRIEVAL AND FUNCTIONAL
signal clips at the small forward voltage of the forward bi-
DECOMPOSITION IN A TEST CASE
ased diode. In practical applications, the threshold voltage
In this section we discuss audio analog electronics, that is, of the diode, and thus therefore the signal clipping point,
in reference to the previous process, both models (1) and can be set to any desired value by increasing or decreas-
decomposition (2). We will discuss a specific test case to ing the reference voltage [9]. It is therefore theoretically
show various emerging issues, but the former is meant as possible to modulate the clipping threshold of a wave by
an example of the previously introduced, general method- setting an audio-controlled variable threshold as the for-
ology. As already said, while it might be desirable to un- ward biasing of the diode. In our case, the threshold of the
couple electronic circuit design from generative modeling, diode is controlled by a clamping circuit [10] (Figure 2). In
there is a feedback loop between the domain to be genera- order both to isolate the effects of two circuits (clamping
tively modeled (electronic circuits) and the modeling sys- and clipping) and to control impedances to implement a
tem (see later). In general terms, our basic design choice simulation of an audio-modulated diode with Spice3 [11],
was targeted to define an expansion procedure of a certain we improved the theoretical schematics in Figures 1 and
basic circuit. Indeed, other circuit designs may be taken 2 by introducing two operational amplifiers, to be used as
into account as candidates. voltage followers. The resulting circuit is shown schemat-
Our aim has been to look for an already established audio ically in Figure 3. It is still a theoretical circuit but suffi-
schematic (methodology, step 1) provided with two fea- cient to obtain a simulation of a carrier sine wave (220
tures. First, it should be expandable by iterating a limited Hz) dynamically clipped by another sinusoidal modula-
set of basic units, as the goal of the main electronic circuit tor (439 Hz). Figure 4 shows a new simulation of the
SMC2017-352
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
V+
Modulator + C2 U3
D3
V-
0.1f
V5 D4
1N5817
1N5817
V+
SINE(0 2 1123)
out
Carrier U2
.TRAN 500ms
V+
R1
V-
1Meg
9 V3 V1 SINE(0 2 220)
9 V4
D2
V+
V-
1N5817
Modulator - C1 U1
D1
V-
out V2
0.1f
1N5817
SINE(0 2 439)
V(out)
2.4V
0.0V
-2.4V
0ms 40ms 80ms 120ms
Figure 3. Half wave diode-modulated clipping (dmc)
same schematic, this time featuring a 220 Hz carrier and Figure 6. a) dc+.
1239 Hz modulator. This schematic allows the modulation
of the negative half-waves of a carrier sine by changing at means of inserting few capacitors. In our schematic model
audio rate the voltage reference at the anode of a diode. we add both a voltage follower object to use for coupling
In order to modulate both negative and positive parts of the various circuit each other, and an op amp input am-
the carrier sine wave, Figure 5 shows the implementation plifier to filter, amplify and/or attenuate three input sig-
of a duplication/reversal of the clamping schematic. Such nals: carrier, negative modulator, positive modulator. The
a clamping circuit is intended to limit a third modulating schematic model can be, then, decomposed into five units
wave in the positive range. This signal is used to induce a (see later): dc+, dc-, buffer, sig in, sig out.
fluctuation in the voltage at cathode of a further diode that Figures 6 to 10 show all the schematics of the final single
modulates the positive part of the carrier half-wave. objects. Figure 11 shows the complete root circuit ready
Figure 5 shows a Spice3 simulation with positive modula- for a Spice simulation. Colored blocks represent previ-
tor = 1123 Hz, negative modulator = 439 Hz, and carrier ously discussed components, that is, a dc+; b dc-;
= 220 Hz. c buffer; d sig in; e sig out (for other letters, see
The outlined schematic can be expanded by a systematic Section 4, and the companion Figure 13). Figure 12 shows
development, through a nesting and multiplication of mod- a physical test implementation of the root circuit, a min-
ulating objects around the axis of the carrier signal, thus imal assemblage of the components that acts as starting
fulfilling the requirement of step 2 of our methodology, state for generation.
that prepares rule-base organization (step 3). In order to al-
low the nesting of elements, we took advantage of typical
4. RULE-BASED RE-ORGANIZATION
features of operational amplifiers. These components (op
amps, as commonly referred), thanks to their impedance In general terms, many formalisms are available as gen-
features and their ease of use, allow to simply cascade sev- erative models to govern an arrangement of audio compo-
eral parts, adding the possibility of frequency control by nents. In our case, we have explored L-systems [12]. L-
SMC2017-353
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 7. b) dc-.
Figure 8. c) buffer.
SMC2017-354
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
d*@3 C2 20pF
R8
b3@4
d*@2 d*@1 d*@3
267k
V+
SINE(0 1 440)C6 R9 100k R1
V+
U1
U3 C1
D1
V5 330nF 47k R2
V-
R10 330nF
V-
100k 1N5817
33k
V+
C7 V2
10F R5 R4 9 a2@4 c1@4 b3@4
V- .TRAN 500ms
51k 949meg
9 V1
C3
V-
d*@1 20pF
e5@*
R3
c1@4 c4@5 c4@5
267k
SINE(0 1 405)
V+
C4 R6 R19 C13
V+
V+
U6
U2 U7 C12
C14 out
V4 330nF 47k R18 50k 2.2F
V-
R7 R21 330nF
V-
V-
330nF 47k R20
33k 47k
50k
C5
10F
e5@*
C9 20pF
d*@2 R15
267k D2
V+
SINE(0 1 440)
C10 R16 100k R14
V+
U4
U5 C8
1N5817
V3 330nF 47k R11
V-
R17 330nF
Figure 13. Axiom, with connection labels.
V-
100k
33k
C11
d*@3 d*@2
Figure 11. Root circuit (axiom).
a3@8 c3@6 c2@8 b2@6
c4@5
ization [15]. Figure 13 shows the topology of the circuit Figure 14. First rewriting, with connection labels.
of Figure 11 in relation to its modeled decomposition into
the axiom . Each component is colored in relation to its
type, and labeled according to the mapped symbol, like in softwares, while white boxes are internal blocks developed
Figure 11. It can be seen that output ports are connected to to perform specific functions. Plain text labels represent
input ports with the same identifier. Figure 14 shows the textual data. The generative model has been implemented
first application of the ruleset (n = 1). in SuperCollider [16], an audio-targeted programming lan-
The definition of the formal system has been driven by se- guage that features high-level data collection manipula-
mantic constraints, i.e. by taking into account viable elec- tion. The SuperCollider component performs two task:
tronic connections. While increasing n, the resulting string generation and mapping. L-Generator is the module im-
shows some spurious artifacts, in terms of its electronic in- plementing the L-system generation: it takes an L-system
terpretation. Figure 15 shows the topological graph with specification as a textual input, and output a string (as a
n = 2 (labels have been replaced with the component result of rewriting). The string is fed into the Parser that
types). It is apparent from Figure 15 that buffers at the converts it into a data structure apt to be further manipu-
graph side are misplaced in the electronic interpretation, lated by the Pruner module. The resulting data structure
as they are isolated. To solve this issue, a recursive pruning can then be mapped into other formats. The mapping task
strategy is applied to the graph, so that no isolated buffers can be seen as a software gluing process. The DotGener-
are present in the final graph. Figure 16 shows the graph ator creates a dot file to be subsequently rendered by the
of Figure 15 after pruning. dot program as a visual file. As discussed, this is instru-
mental in rapidly understanding the behavior of the gener-
5. AN INTEGRATED PIPELINE ation process. LTGenerator creates input files (in the asc
ASCII format) for the LTSpice IV software [17]. LTSpice
While algorithmic modeling of electronic circuit topolo- IV includes simulation of the circuit via Spice3 (includ-
gies can have a theoretical interest, it is indeed very far ing audio generation, a crucial features in our case) and
from a practical application. The aim of a generative ap- schematic drawing. In order to generate asc files, compo-
proach is indeed to obtain circuits whose complexity can- nent descriptions are stored in specification templates (spec
not be reached by hand design and that are (o should be) file), in which opportune textual replacements for identi-
still guaranteed to work by the system definition itself. This fiers and positions are performed. Component circuits (a
requires an integrated automatic pipeline, ideally from for- e) are algorithmically placed on the board by following a
mal model seamlessly to PCB printing. The implemented simple carriage return strategy over a fixed width grid and
solution is shown in Figure 18. Grey boxes represent used using textual labeling of input and output as internal ref-
SMC2017-355
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-356
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
sigIn sigIn
dc+ dc+ dc+ dc+ op amp op amp op amp op amp op amp op amp op amp op amp dc- dc- dc- dc-
op amp
out
0.5
l-system string 2104
Generation
(Hz)
L-Generator Parser data structure Pruner data structure
FrequencyFrequency
(Hz)
Mapping
SuperCollider LtGenerator DotGenerator
2104
0.25 Figure 20. LTSpice audio simulation with 4 rewritings,
0.25
439, 440, 442.5 Hz.
2104
(Hz)
Generative design Electronic modelling Circuit visualisation Layout organisation PCB printing
Frequency
1 2 3 4 5
Frequency(Hz)
SMC2017-357
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
sible to create singleton, unique synthesizers/processors. [7] R. Ghazala, Circuit-Bending. Build your own alien in-
Indeed, the most delicate point is the layout for PCB print- struments. Indianapolis: Wiley, 2005.
ing of large generated circuits, that is still not robust from [8] R. Salminen. (2000) Design Your
an automation perspective. But, once fine-tuned in relation Own Distortion. [Online]. Available:
to real physical behavior, circuits can be simulated, and http://www.generalguitargadgets.com/how-to-build-it/
only the final step (5) requires handcrafted operations. As technical-help/articles/design-distortion/
a future work, we are planning to print PCB boards and
test real final complex circuits, to analyze other analog cir- [9] P. Horowitz and W. Hill, The Art of Electronics, 3rd ed.
cuits in order to decompose them, so that other generative New York, NY, USA: Cambridge University Press,
models may be taken into account. 1989.
[10] VV.AA. (2012) DIY Circuit De-
V+
100k R5 D6
C18
C1 U0
4
V+
R2 1N5817 C18
330nF
100k 15 15
4
able: https://www.engineersgarage.com/tutorials/
V-
R19 1N5817
V-
R19 1N5817
V+
330nF
330nF
V+
U9 100k
R4 R3 C7
16
100k
V+
C7
V+
U9 2
U26
V-
diy-circuit-design-waveform-clamping?page=3
V+
V-
R25
V+
2 100k
C11
R15 U10 D16 51k 949meg 330nF
47k
U26
V-
R8 16
R21 R20 C24
V+
3
V+
U32
330nF
V-
R31 C43
V-
R25
V+
47k 14 330nF
R14 R13 47k
100k R15 U10V+ D16 51k 949meg 330nF
V-
C37 R44
20pF 330nF
C11 47k
47k
C38 R36 51k 949meg
V+
*
U33
16 100k
C47
R51 U46 D52
3 2
V+
330nF 47k 16
3
v-
R34
V-
R12 1N5817
V-
R48
V+
1N5817
330nFC39
33k
C40 U42 330nF
100k C27 U29
14 17
R35 100k
V+
17
100k 16
R61 U56 D62
V+
V-
15 47k 100k 15
R78 U73 D79 V+ 14
V-
267k R28
V-
330nF 4 R54
100k
14
17 51k 949meg 330nF
47k 330nF
R14 R13
partment of Electrical Engineering and Computer Sci-
V-
R75 1N5817
47k
V+
330nF
100k
V-
20pF
V+
V+ 2 100k R85 U80 D86
47k
V-
V-
R82
V+
1N5817
51k 949meg 330nF
V+
C70 U72
100k
U33 R67 R66
100k R51 U46 D52
14
V+
*
3
V+
U92 V-
C47
V-
14 47k 5
330nF 47k
U105
V+ 4 16
V-
R91 C103
3
V-
330nF 16 R88
v-
R48 1N5817
V+
V-
R104
v+
V+
V1 330nF
330nF
V-
100k U96
R121 U116 D122
V+
*
V+
C117
14 47k 1
100k
R110 R109 C113 U115
15
330nF 47k
R134 R78 D79
V+
2 4
100k
U73 V+ 14
v-
100k
C102 17
4
330nF
47k R54
51k 949meg 330nF
V-
R131 1N5817
17
V+
330nF
10F 100k
R120 R119 C123 U125
47k
V-
15
R75 1N5817
V+
V- 14
330nF
V+
U138
R137
V+
14
V+
2
3
V+
16
R65 1N5817 R77 R76R164 C81
V+
2 47k 100k 16
U159 D165 3
330nF
V-
R161 1N5817
50k 2.2F 330nF
V-
R82 1N5817
V+
R150 100k
50k
U72
R154
51kR163
949meg
R143 R142
330nF
USA: Springer-Verlag New York, Inc., 2001.
V+
R168 1N5817
3 20pF 51k 949meg 330nF
V+
100k
V- C157 R155
v+
V-
R71
V+
15 51k 949meg
C177
4 330nF
*
3 R170 R169 R84 C173 R83 U175 C87
V+
330nF 47k
17
47k 16
5
v-
R153 V- 2
V-
R178 1N5817
U105
V+ 4
[14] H. Meinhardt, The Algorithmic Beauty of Sea Shells,
V-
16 R88
V+
100k
U183D189
51k 949meg 330nF
R188 10F
2
R180 R179 C184
16
V- 3 47k
V-
R104
v+
R185 1N5817
51k 949meg 330nF
330nF
V+
100k
V1
U95 47k
C93 R187 R186
17 V+ R98
V+
16 51k 949meg
V2
100k R111 U106 D112 York, Inc., 2003.
V-
330nF
100k
C101 R99
[15] E. Gansner, E. Koutsofios, and S. North, Drawing
v+
U96
V+
*
1 R110 R109 C113 U115
Figure 23. LTSpice drawing from automatic generated asc 330nF 47k
V+
100k
graphs with dot, 2006. 4
v-
R97
C130
R134 U129 D135 V- 1
V-
R131 1N5817
330nF
V+
14
U138
V-
R137
V+
100k
V-
47k 100k 16
R149 C151 C160
R164 U159 D165 3
[17] G. Brocard, The LTSpice IV Simulator: Manual, Meth-
V-
5 4 R141 1N5817
* 17 330nF
100k
V-
R161 1N5817
50k
R150
2.2F
[1] C. Roads, The Computer Music Tutorial. Cambridge, 330nF
100k
R143 ods and Applications. Kunzelsau: Swiridoff Verlag,
R142
R154
50k
V+
*
3 R170 R169 C173 U175
330nF 47k 16
v-
R153
ing and Controlling the Physical World with Comput- V- 2
V-
33k R174
51k 949meg 330nF
C158 47k
ers. Boston, Mass.: Course Technology, 2004.
V+
100k
C184
R188 U183D189 10F
16
3
V-
R185 1N5817
330nF
100k
R187 R186
V+
51k 949meg
SMC2017-358
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT using music performance data, few projects have dealt with
creation of a music performance database open to public.
Music performance databases that can be referred to as nu- In musicological analysis, some researchers constructed
merical values play important roles in the research of mu- a database of the transition data of pitch and loudness and
sic interpretation, the analysis of expressive performances, then use the database through statistical processing. Wid-
automatic transcription, and performance rendering tech- mer et al. analyzed deviations in the tempi and dynam-
nology. The authors have promoted the creation and public ics of each beat from Horowitzs piano performances [9].
release of the CrestMusePEDB (Performance Expression Sapp et al., working on the Mazurka Project [10], collected
DataBase), which is a performance expression database as many recordings of Chopin mazurka performances as
of more than two hundred virtuoso piano performances possible in order to analyze deviations of tempo and dy-
of classical music from the Baroque period through the namics by each beat in a similar manner to [9].
early twentieth century, including music by Bach, Mozart, The authors have been promoting the creation and pub-
Beethoven and Chopin. The CrestMusePEDB has been lic release of the CrestMusePEDB (Performance Expres-
used by more than fifty research institutions around the sion DataBase), which consists of more than two hundred
world. It has especially contributed to research on per- virtuoso piano performances of classical music from the
formance rendering systems as training data. Responding Baroque period through the early twentieth century, in-
to the demand to increase the database, we have started cluding music by Bach, Mozart, Beethoven, and Chopin
a new three-year project to enhance the CrestMusePEDB [11]. The 1st edition of the CrestMusePEDB (ver.1.0-3.1)
with a 2nd edition that started in 2016. In the 2nd edition, has been used musical data and has been used by more than
phrase information that pianists had in mind while playing fifty research institutions throughout the world. In partic-
the performance is included, in addition to the performance ular, it has contributed to researches on the performance
data that contain quantitative data. This paper introduces rendering systems as training data [12, 13].
an overview of the ongoing project. The database is unique in providing a set of detailed data
of expressive performances, including the local tempo for
each beat and the dynamics, onset time deviation, and du-
1. INTRODUCTION ration for every note. For example, the performance ele-
The importance of music databases has been recognized ments provided in the Mazurka Projects data [10] are beat-
through the progress of music information retrieval tech- wise local tempos and dynamics and precise information
nologies and benchmarks. Since the year 2000, some with note-wise performance elements that cannot be ex-
large-scale music databases have been created and have tracted. In the MAPS database [14], which is widely used
had a strong impact on the global research arena [14]. for polyphonic pitch analysis and piano transcription, per-
Meta-text information, such as the names of composers formance data does not include any temporal deviations
and musicians, has been attached to large-scale digital and thus cannot be thought of as realistic performance data
databases and has been used in the analysis of music styles, in the aspect of musical expression. Such detailed perfor-
structures, and performance expressions, from the view- mance expression data are crucial for constructing perfor-
point of social filtering in MIR fields. mance rendering systems and realistic performance models
for analysis and transcription.
The performance expression data plays an important role
in formulating impressions of music [58]. Providing The size of the CrestMusePEDB 1st edition is not large
a music performance expression database, especially de- enough, compared with other databases for computer sci-
scribing deviation information from neutral expression, ence. Demand for the database has been increasing in re-
can be regarded as a research in sound and music commu- cent years, particularly in the studies using machine learn-
nity (SMC). In spite of there being many music researches ing techniques. In addition, data that explicitly describe the
relationship between a performance and the musical struc-
ture intended by the performer has been required 1 . Re-
Copyright:
c 2017 Mitsuyo Hashida et al. This is sponding to these demands, we started a three-year project
an open-access article distributed under the terms of the in 2016 to enhance the CrestMusePEDB in a 2nd edition,
Creative Commons Attribution 3.0 Unported License, which permits unre-
1 In many cases, the apex (the most important) note in a phrase is se-
stricted use, distribution, and reproduction in any medium, provided the original
lected by the performer, and there may be the case that phrase sections
author and source are credited. are analyzed based on the performers own interpretation.
SMC2017-359
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-360
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
4.2 Recording
4. PROCEDURE FOR MAKING THE DATABASE
The key feature in the creation of the 2nd edition PEDB is
4.1 Overview that we can talk with the pianists directly about their per-
formances. Before the recording procedure, we confirmed
The main goals of this edition are to enhance the perfor- with each pianist that the recorded performance should
mance data and to provide structural information paired clarify its phrase structure (phrase or sub-phrase) and its
with the performance data. The outline of database gener- apex note, as the interpretation of the performance.
ation is shown in Fig. 2. Pianists are asked to (1) play based on their own inter-
Musical structure differs depending on pianists interpre- pretation for all pieces and to (2) play with exaggerated
tation and even on the score version. Some of the musical expressions retaining the phrase structure for some pieces.
works have multiple versions of the musical scores; such In addition, for some pieces, pianists are asked to (3) play
as the Paderewski Edition, the Henle Verlag, and the Peters with the intension of accompanying a soloist or dancers. If
Edition, e.g., Mozarts famous piano sonata in A Major K. there are different interpretations or score editions of one
331. piece, they played with the both versions. For Mozarts
piano sonata in A Major K. 331 and Beethovens Pathe-
Before the recording, the pianists were asked to prepare
tique Sonata, the 2nd movement, different versions of the
to play their most natural interpretation of the score that
scores have been provided to the pianists by the authors.
they usually use. Further they were requested to prepare
Recordings were done in several music laboratories or
additional interpretation such as by a different score edi-
recording studios. As shown in Fig. 3, performances
tion, by a professional conductor suggests, and by over-
played with a YAMAHA Disklavier were recorded as both
expressed the pianists interpretation. Pianists were re-
audio signals and MIDI data including controls of pedals,
quested to express the difference of these multiple inter-
via ProTools. We also recorded the video for the interview
pretations regarding the phrase structure. After the record-
process after recording.
ing, we interviewed the pianist on how (s)he tried to ex-
press the intended structure of the piece after listening to
the recorded performances. 4.3 Alignment Tool
Through these steps, source materials for the database are After MIDI recordings are obtained, each MIDI signal is
obtained. Then, the materials are analyzed and converted aligned to the corresponding musical score. To improve the
to the data, as they can be referenced in the database. The efficiency, this is done in two stages: automatic alignment
main procedure of this stage is note-to-note matching be- and correction by an annotator. In the automatic alignment
tween score data and the performance. Some notes in the stage, a MIDI performance is analyzed with an algorithm
performance may be erroneously played, and the number based on an HMM [17], which is one of the most accu-
of notes played in trill is not constant. To handle such situ- rate and fast alignment methods for symbolic music. A
ations, we developed an original score-performance align- post-processing to detect performance errors, i.e., pitch er-
ment tool based on a hidden Markov model (HMM) [17]. rors, extra notes, and missing notes, is also included in this
In the following subsections, the recording procedure and stage.
the overview of the alignment tool are described. In the following stage of correction by an annotator, a vi-
SMC2017-361
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
audio (stereo)
audio (stereo)
YAMAHA Disklavier
XLR x 2
USB
Figure 5. A sample of phrase structure data (W. A.
Mac (Pro Tools)
Mozarts Piano Sonata K. 331, 1st Mov., Peters Edition.)
Square bracket and circle mark denotes phrase/sub-phrase
and apex, respectively.
Figure 3. Technical setup for the recording
6. CONCLUDING REMARKS
5. PUBLISHING THE DATABASE
In this paper, we introduced our latest attempt to enhance
We released a beta version of the 2nd edition PEDB con- the PEDB and overviewed part of the database as a beta
sisting of 103 performances of 43 pieces by two profes- version to investigate the users requests for the database.
sional pianists at the end of May, 2017. Table 1 shows Although the number of data currently collected in the beta
the list of the performances included in the beta version. version is small, we have completed the workflow for the
As shown in this table, some of the pieces are played with database creation. In the coming years, we plan to increase
more than one expression or interpretation. the number of performance data to more than five hundred
This edition provides (1) recording files (WAV and
MIDI), (2) score files (MusicXML and MIDI), (3) infor- 3 http//:crestmuse.jp/pedb edition2/
SMC2017-362
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 1. List of performances included in the beta version. self : the players expression, over: over expression, accomp.:
played as the accompaniment for solo instrument or chorus, waltz: focused on leading a dance, Hoshina: played along with
a professional conductors interpretation, Henle and Peters: used the score of Henle Edition and Peters Edition, and others:
extra expressions via discussion with the authors and each player.
Performances
No. Composer Title Player 1 Player 2
# interpretation # interpretation
1 J. S. Bach Invention No. 1 2 self / over 2 self / over
2 J. S. Bach Invention No. 2 2 self / over 1 self
3 J. S. Bach Invention No. 8 - - 1 self
4 J. S. Bach Invention No. 15 2 self / over 1 self
5 J. S. Bach Wohltemperierte Klavier I-1, prelude 2 self / accomp. -
6 L. V. Beethoven Fr Elise 1 self 3 self / over / note_d
8 L. V. Beethoven Piano Sonata No. 8 Mov. 1 1 self 1 self
9 L. V. Beethoven Piano Sonata No. 8 Mov. 2 2 self / Hoshina 2 self / Hoshina
10 L. V. Beethoven Piano Sonata No. 8 Mov. 3 1 self 1 self
7 L. V. Beethoven Piano Sonata No. 14 Mov. 1 2 self / over 1 self
11 F. Chopin Etude No. 3 2 self / over 1 self
12 F. Chopin Fantaisie-Impromptu, Op. 66 2 self / over 1 self
15 F. Chopin Mazurka No. 5 1 self -
13 F. Chopin Mazurka No. 13 2 self / over -
14 F. Chopin Mazurka No. 19 2 self / over -
16 F. Chopin Nocturne No. 2 2 self / over 1 self
17 F. Chopin Prelude No. 1 2 self / over -
18 F. Chopin Prelude No. 4 1 self 1 self
19 F. Chopin Prelude No. 7 1 self 1 self
20 F. Chopin Prelude No. 15 1 self 1 self
21 F. Chopin Prelude No. 20 1 self -
22 F. Chopin Waltz No. 1 2 self / waltz 2 self / waltz
23 F. Chopin Waltz No. 3 2 self / over 1 self
24 F. Chopin Waltz No. 7 2 self / waltz 2 self / waltz
25 F. Chopin Waltz No. 9 2 self / over 1 self
26 F. Chopin Waltz No. 10 1 self 1 self
27 C. Debussy La fille aux cheveux de lin 2 self / over -
28 C. Debussy Rverie 1 self -
29 E. Elgar Salut d'amour Op. 12 2 self / accomp. -
30 G. Hndel Largo / Ombra mai f 2 self / accomp. -
31 Japanese folk song Makiba-no-asa 2 exp1 / exp2 -
32 F. Liszt Liebestrume 1 self -
33 F. Monpou Impresiones intimas No. 5 "Pajaro triste" 1 self -
34 W. A. Mozart Piano Sonata K. 331 Mov. 1 3 self / Henle / Peters 2 Henle / Peters
35 W. A. Mozart Twelve Variations on "Ah vous dirai-je, Maman" 2 self / over -
36 T. Okano Furusato 2 self / accomp. 2 self / accomp
37 T. Okano Oboro-zuki-yo 2 self / accomp. -
38 S. Rachmaninov Prelude Op. 3, No. 2 2 self / neutral -
39 M. Ravel Pavane pour une infante dfunte 1 self -
40 E. Satie Gymnopdies No. 2 2 self / neutral -
41 R. Schumann Kinderszenen No. 7 "Trumerei" 2 self / neutral 1 self
42 P. I. Tchaikovsky The Seasons Op. 37b No. 6 "June: Barcarolle" 2 self / over -
Total: 70 31
by adding new data to the previous version, considering the for her advice as a professional pianist, and Professor T.
compatibility with the data-format of the PEDB 1st edition. Murao for his cooperation. This work was supported by
There is no other machine-readable performance JSPS KAKENHI Grant Numbers 16H02917, 15K16054,
database associated with musical structure. We hope that and 16J05486. E. Nakamura is supported by the JSPS re-
the database can be utilized for research in many research search fellowship (PD).
fields related to music performances. As future work,
we would like to develop some applications, in addition
to the data-format design, so the database can be used 7. REFERENCES
by researchers who are not familiar with information
[1] J. S. Downie, The music information retrieval evalu-
technology.
ation exchange (mirex), in D-Lib Magazine, 2006, p.
Vol.12 No.12.
Acknowledgments
The authors are grateful to Dr. Y. Ogawa and Dr. S. Furuya [2] http://staff.aist.go.jp/m.goto/rwc-mdb/, 2012 (last
for their helpful suggestion. We also thank Ms. E. Furuya update).
SMC2017-363
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-364
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-365
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Silverman, 1986) has been used in recent segmentation segmentation algorithms, a number of performance mea-
studies (e.g. Bruderer, 2008; Hartmann, Lartillot, & Toivi- sures have been proposed, such as precision, recall, and
ainen, 2016b) to obtain a precise aggregation across bound- F-measure (Lukashevich, 2008); also correlation between
ary data. This approach consists of obtaining a probabil- time series has been applied for this purpose (Hartmann et
ity density estimate of the boundary data using a Gaussian al., 2016a). One of the factors that has been shown to highly
function; in a KDE curve, temporal regions that prompted contribute to the segmentation accuracy is the width of the
boundary indication from many listeners are represented novelty kernel (e.g. Hartmann et al., 2016a), which roughly
as peaks of perceptual boundary density. This continuous refers to the temporal context with respect to which novelty
representation of perceptual segment boundary probability is estimated for each time point.
has been used to compare different stimuli, groups of partic- However, to the best of our knowledge, no study has
ipants, and segmentation tasks (Bruderer, 2008; Hartmann systematically investigated what specific aspects of nov-
et al., 2016b). elty curves contribute to their accuracy for a given stim-
Prediction of perceptual boundary density in the audio ulus. It would be relevant to investigate what characteris-
domain often involves the computation of novelty curves tics of novelty curves relate to their accuracy, as this could
(Foote, 2000), which roughly describe the extent to which allow to predict the relative suitability of a novelty curve
a temporal context is characterized by two continuous seg- for a given stimulus without the need of direct comparison
ments separated by a discontinuity with respect to a given against ground truth, and to bypass computation of novelty
musical feature. Recent studies have compared novelty curves that would be assumed not to deliver satisfactory
curves with perceptual boundary density curves (Hartmann, performance with regard to a particular stimulus. From the
Lartillot, & Toiviainen, 2016a) and compared peaks de- viewpoint of music perception, it would be useful to bet-
rived from both curves (Mendoza Garay, 2014), showing ter understand the extent to which musical characteristics
that novelty detection can predict segmentation probabili- perceived by listeners are directly apparent from novelty
ties derived from numerous participants. curves, and to gain more knowledge on the types of musi-
cal changes that prompt both boundary perception and high
A preliminary step in novelty detection and other seg-
novelty scores.
mentation frameworks consists of the extraction of a mu-
Recently, Hartmann et al. (2016a) studied segmentation
sical feature, which will determine the type of musical con-
accuracy achieved for concatenated musical pieces using
trast to be detected. Timbre, tonality, and to some extent
different novelty curves. It was found that optimal predic-
rhythm (Jensen, 2007) have been considered to be impor-
tion of perceptual segment boundary density involves the
tant features for structural analysis. Relatively high predic-
use of large kernel widths; the study also highlights the role
tion of musical structure has been found for two musical
of rhythmic and pitch-based features on segmentation pre-
features: MFCCs (Mel-Frequency Cepstral Coefficients)
diction. This study is a follow-up to the paper by Hartmann
for timbre description (Foote, 2000) and Chromagram or
et al. (2016a), as it focused further on prediction of percep-
similar features (Serr, Muller, Grosche, & Arcos, 2014)
tual segmentation density via novelty detection, and exam-
for description of pitch changes; also combined approaches
ined the same musical pieces; in particular, we investigated
have been proposed (Eronen, 2007). The segmentation ac-
one of the perceptual segmentation sets (an annotation seg-
curacy achieved by novelty curves seems to highly depend
mentation task performed by musician listeners) studied by
on the musical feature that is used, and on the choice of
Hartmann et al. (2016a), and explored perceptual segmen-
temporal parameters for feature extraction (Peeters, 2004).
tation density and novelty curves for individual musical
In addition, different musical pieces might require differ-
stimuli. The aim of the present study was to understand
ent musical features for optimal prediction (Peiszer et al.,
whether or not local variability of musical features and dis-
2008); as pointed out by McFee and Ellis (2014), struc-
tance between novelty peaks are related with the accuracy
ture in pop and rock is frequently determined by harmonic
of segmentation models. The following research questions
change, whereas jazz is often sectioned based on instrumen-
guided our investigation:
tation. No single feature can optimally predict all musical
examples; certain features are more appropriate than others 1. What specific aspects of musical stimuli that account
depending on particular aspects of musical pieces (Smith & for segmentation accuracy can be directly described
Chew, 2013). from musical features?
This mechanism is however not well understood at
2. What stimulus-specific attributes of novelty curves de-
present: it is unclear what characteristics of musical fea-
termine optimal segmentation accuracy?
tures contribute to the segmentation accuracy for a given
musical piece. Addressing this issue would be important As regards the first research question, we expected to find
since it would enable the possibility to select optimal musi- an inverse relationship, dependent on musical stimulus, be-
cal features for further novelty detection, avoiding the com- tween magnitude of local feature variation and accuracy ob-
putation of novelty curves that would not yield satisfactory tained via novelty detection. For instance, musical stimuli
results; it would also help to develop better alternatives to displaying unfrequent local tonal contrast would yield op-
the novelty detection approach, with the aim of reducing timal segmentation accuracy via tonal novelty curves. The
computational costs. rationale behind this hypothesis is that if there is not much
To analyze the impact of different factors associated to local change in a feature, then the local changes that occur
novelty curves upon segmentation and compare different in that feature should be more salient. One of the most basic
SMC2017-366
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2.1.2 Stimuli
Figure 1. General design of the study.
We selected 6 instrumental music pieces; two of them
lasted 2 minutes and the other four were trimmed down to
Gestalt principles, the law of Prgnanz, relates to this ratio- this length for a total experiment duration of around one
nale, because it states that, under given conditions, percep- hour. The pieces (see Hartmann et al., 2016a) comprise a
tual organization will be as good as possible, i.e. percepts variety of styles and considerably differ from one another
tend to have the simplest, most stable and most regular or- in terms of musical form; further, they emphasize aspects
ganization that fits the sensory pattern (Koffka, 1935). of musical change of varying nature and complexity.
Regarding the second question, we expected that stimuli
yielding higher accuracy for a given feature would exhibit 2.1.3 Apparatus
a relatively large temporal distance between novelty peaks
for that feature. Although this relationship could depend An interface in Sonic Visualizer (Cannam, Landone, & San-
on other factors such as tempo and duration, we believed dler, 2010) was prepared to obtain segmentation boundary
that the absolute time span between peaks would serve as indications from participants. Stimuli were presented in
a rough predictor of accuracy for the reasons above men- randomized order; for each stimulus, the interface showed
tioned regarding the law of Prgnanz. To give an exam- its waveform as a visual-spatial cue over which boundaries
ple, optimal accuracy was expected for stimuli character- would be positioned (subjects were asked to focus solely
ized by long and uniform segments with respect to instru- on the music). Participants used headphones to play back
mentation that would be delimited by important timbral the music at a comfortable listening level, and both key-
changes, whereas music characterized by more frequent board and mouse were required to complete the segmenta-
timbral change and gradual transitions would yield lower tion task.
accuracy for a timbre-based novelty measure.
2.1.4 Procedure
SMC2017-367
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3. Freely play back from different parts of the musical First, for each time series , where corresponds to a fea-
example and make the segmentation more precise by ture dimension, the squared difference between successive
adjusting the position of boundaries; also removal of time points is obtained. Next, a flux time series is ob-
any boundaries indicated by mistake is allowed. tained as the squared root of the sum across dimensions:
SMC2017-368
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
4
3.2 Finding Predictors of Segmentation Accuracy
3 p < .001
Musical Feature Feature Flux MDSP
p < .01
Subband Flux .54 .03
z-score
2
p < .05
Fluctuation Patterns .30 .11
1 Chromagram .68 .27
Key Strength .74 .47
0
Tonal Centroid .66 .50
1
Table 1. Correlation between z-transformed accuracy and
characterizations of musical features (Feature Flux) and
in
n
a
is
k
ve
an
to
es
er
Ra
or
vo
et
up
M
Sm
D
G
Co
SMC2017-369
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Smetana Couperin
6 6
Dimension 5 5
4 4
3 3
2 2
1 1
0 58 116 0 60 120
103 103
4 3
Novelty
2
2
1
0 0
0 58 116 0 60 120
103 103
4 3
Density
2
2
1
0 0
0 58 116 0 60 120
Time (s) Time (s)
Figure 3. Tonal Centroid, novelty of Tonal Centroid and perceptual boundary density for stimuli Smetana and Couperin.
musical feature characteristics; we believe that automatic according to musical piece, motivating further analyses on
segmentation systems could optimize their parameters for novelty detection that focus on each piece separately. It
stimuli of different idioms and that such an approach would is also apparent that no single algorithm is robust for pre-
lead to an increase in accuracy. diction of all examples, suggesting the importance of in-
Regarding the validity of the approach, we also highlight corporating combinations of multiple features (Hartmann
that our method for studying segmentation involves amass- et al., 2016a; Mller, Chew, & Bello, 2016), multiple time
ing annotations from multiple listeners for the same stim- scales (Kaiser & Peeters, 2013; Gaudefroy, Papadopoulos,
ulus; this possibility has been previously explored only by & Kowalski, 2015), and other aspects of segmentation (e.g.
few studies on segmentation prediction (Mendoza Garay, repetition principles, see Gaudefroy et al., 2015; Paulus
2014; Hartmann et al., 2016a). In contrast, MIR studies on & Klapuri, 2006) into novelty approaches. Overall, how-
music segmentation are often based on data coming from ever, the results seem to support the idea that performance
one or few annotators; the computation of a KDE is usually in structural analysis heavily may depend more on musi-
not needed since the set of annotated segment boundaries cal stimuli than on the algorithm used or the choice of pa-
is directly compared against peaks picked from a novelty rameters (Peiszer et al., 2008), which could serve to justify
curve. In this sense, analyses of perceptual segmentation subsequent steps of our analysis.
based on data that is probably more representative of the
musician population should be useful for better understand- 4.2 Feature-based Prediction of Novelty Detection
ing both perception and prediction of musical structure. Performance
This section examines our research questions in light of
Our first hypothesis was that accuracy obtained via novelty
the proposed analysis, and assesses the extent to which the
detection would increase for stimuli with low local varia-
stated hypotheses could be supported. Finally, we conclude
tion of musical features. In support with this hypothesis, we
this article with possible directions for future research.
overall found Feature Flux to be a good predictor of correla-
tion between perceptual segmentation density and novelty
4.1 Segmentation Accuracy
(Table 1), and a same pattern of results for pitch-based and
The first step to address the research questions was to inves- rhythmic features. This suggests, for these features, that in-
tigate the accuracy to predict perceived boundaries yielded creased local feature continuity may be indicative of higher
by different musical features for different musical exam- accuracy of novelty curves.
ples. As shown in Figure 2, accuracy seems to highly vary The second hypothesis of this study was that accuracy
SMC2017-370
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
would increase for stimuli with more distant novelty peaks. are also characterized by few structural sections of long
Indeed, mean distance between subsequent peaks was duration, and yield a relatively straightforward prediction;
found to be somewhat indicative of the accuracy of nov- for instance, Genesis contains multiple short sounds and
elty curves with respect to perceptual segmentation density. effects, yet its sections are clearly delimited by important
This suggests, particularly for tonal features, that longer instrumentation changes, which probably had a positive ef-
novelty peak-to-peak duration is associated with higher cor- fect on accuracy. Again, stylistic information might help
relation between detected novelty and perceptual segment to disentangle these and other problems regarding segmen-
density. tation accuracy.
Focusing on pitch-based and rhythmic features, the results
indicate that high local variability is associated with low 4.3 General Discussion
accuracy. A possible interpretation is that music character- One of the main aims of this study was to find out methods
ized by few local changes in pitch or rhythm often involves to select musical features that would be efficient in segmen-
few, highly contrasting pitch-based or rhythmic structural tation prediction for a given stimulus; to this end, we inves-
boundaries. For instance, rhythmically stable melodies are tigated the relationship between accuracy and characteriza-
clearly separated by highly discernible rests and long notes tions of musical features and novelty curves. According to
in Dvok. This could be interpreted in the light of theoret- the results, for most features there is an inverse relationship
ical approaches to musical expectation (Narmour, 1992), between local variability and accuracy, and a direct rela-
according to which similarity between successive events tionship between mean distance between subsequent nov-
generates the expectation of another similar event. If even- elty peaks and accuracy. This suggests that stimuli whose
tually this expectation is not satisfied, a sense of closure features are characterized by low variation between succes-
may be perceived, prompting the indication of a segment sive time points, and whose novelty curves have few peaks,
boundary; this may explain why, for example, accurate are likely to yield higher segmentation prediction accuracy.
tonal-based segmentation predictions exhibit low variabil- A possible reason that explains these results is that music
ity at a local level and often involve few, perceptually stark with infrequent musical change often yields perceptually
boundaries that delimit homogeneous groupings of events. salient boundaries; according to the Gestalt rules of per-
Related to our previous result, we found that novelty ceptual organization, similar events that are proximal in
curve characteristics can be used as predictors of accu- time are grouped together, creating a strong sense of clo-
racy: larger distance between subsequent novelty peaks sure whenever a dissimilar event is perceived. Following
was found to result in higher novelty accuracy (Table 1), this interpretation, if a given musical dimension changes
especially for tonal features. As aforementioned, this rela- frequently, an increase of listeners attention towards other
tionship relates to the properties of good organizations dimensions evoking strong closure may occur during seg-
proposed by Gestalt theorists (Koffka, 1935). In this sense, mentation.
music characterized by novelty peaks that are clearly iso-
lated should yield higher accuracy as they would relate to 4.4 Considerations for Further Research
perceptually salient musical changes.
Since this study focused on the analysis of segmentations
We should highlight that the features yielding highest cor- of the same musical pieces from multiple listeners, the num-
relations with accuracy were tonal. It is possible that in- ber of segmented stimuli does not suffice to draw solid con-
terpretations derived via perceptual organization rules and clusions about the correlates of segmentation accuracy; this
expectation violation are better applicable to the case of should be considered a major methodological caveat. More
prediction via tonal features because these features focus musical stimuli are clearly needed to assess the generaliza-
unambiguously on changes in perceived tonal context (and tion ability of our results; future studies should in this re-
not on e.g. loudness changes). In contrast, other descrip- spect increase the number of musical stimuli used for per-
tions used are somewhat more vague: i) Subband Flux ceptual segmentation tasks while maintaining a satisfactory
discontinuities encompass changes of instrumentation, reg- participant sample size. As regards the sample of partic-
ister, voicing, articulation, and loudness; ii) Fluctuation ipants used, this study only focused on musician listeners,
Pattern changes could be attributed to rhythmic patterns, mainly following MIR studies, which recruit expert annota-
tempo, articulation, and use of repetition; iii) changes in tors for the preparation of musical structure data sets. Fur-
Chromagram are manifested in pitch steps, pitch jumps, ther work should concentrate on annotation segmentation
and use of chords. In this respect, tonal features consider from nonmusicians in order to understand accuracy of nov-
a single dimension of musical change, whereas other fea- elty curves with regards to the majority of the population.
tures analyzed in this study may yield more intricate de- Another issue to consider is that the novelty detection ap-
scriptions. proach is designed to yield maximum scores for high con-
Accuracy seemed to increase for stimuli with little local tinuity within segments and high discontinuity at segmen-
change and more distance between peaks (Table 1), how- tation points, so in this sense it is not surprising that mu-
ever Subband Flux seemed to yield an opposite trend. In sic exhibiting clear discontinuity between large sequences
this regard, high local variability of changes in instrumen- of homogeneity for a given feature will yield higher seg-
tation, register, loudness, etc., seems to be associated with mentation accuracy for that feature. In this regard, our re-
higher accuracy. It could be the case that musical pieces sults should be further tested using other approaches; for
with high local spectral change and multiple novelty peaks instance probabilistic methods (e.g. Pauwels, Kaiser, &
SMC2017-371
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Peeters, 2013; Pearce & Wiggins, 2006) would be suitable Koffka, K. (1935). Principles of Gestalt psychology. New
as they offer alternative assumptions regarding location of York: Harcourt, Brace and Company.
actual boundaries. Lartillot, O. & Toiviainen, P. (2007). A Matlab toolbox
Finally, as an outcome of this study it can be stated that lis- for musical feature extraction from audio. In Interna-
teners may tend to focus on musical dimensions that do not tional Conference on Digital Audio Effects (pp. 237
change often. This interpretation is plausible and highlights 244). Bordeaux.
the importance of e.g. tonal and tempo stability, as well Lukashevich, H. M. (2008). Towards quantitative measures
as the role of repetition and motivic similarity in musical of evaluating song segmentation. In Proceedings of
pieces. Future work should test this possibility by conduct- the 9th International Conference on Music Informa-
ing listening studies in which listeners would describe what tion Retrieval (pp. 375380).
is the most salient dimension for different time points in the McFee, B. & Ellis, D. P. (2014). Learning to segment songs
music; further, as suggested by Mller et al. (2016), auto- with ordinal linear discriminant analysis. In 2014
matic detection of these acoustic description cues should IEEE International Conference on Acoustics, Speech
also be a relevant task regarding structural segmentation. and Signal Processing (pp. 51975201).
Mendoza Garay, J. (2014). Self-report measurement of seg-
5. REFERENCES mentation, mimesis and perceived emotions in acous-
matic electroacoustic music (Masters thesis, Univer-
Bruderer, M. (2008). Perception and modeling of seg- sity of Jyvskyl).
ment boundaries in popular music (Doctoral disser- Mller, M., Chew, E., & Bello, J. P. (2016). Computational
tation, JF Schouten School for User-System Inter- Music Structure Analysis (Dagstuhl Seminar 16092).
action Research, Technische Universiteit Eindhoven, Dagstuhl Reports, 6(2), 147190. doi:http://dx.doi.
Eindhoven). org/10.4230/DagRep.6.2.147
Cannam, C., Landone, C., & Sandler, M. (2010). Sonic Narmour, E. (1992). The analysis and cognition of melodic
Visualiser: an open source application for viewing, complexity: the implication-realization model. Uni-
analysing, and annotating music audio files. In Pro- versity of Chicago Press.
ceedings of the ACM Multimedia International Con- Paulus, J. & Klapuri, A. (2006). Music structure analysis
ference (pp. 14671468). Firenze, Italy. by finding repeated parts. In Proceedings of the 1st
Cohen, J. (1988). Statistical power analysis for the be- ACM workshop on audio and music computing mul-
havioral sciences. (2nd). New Jersey: Lawrence Erl- timedia (pp. 5968).
baum. Pauwels, J., Kaiser, F., & Peeters, G. (2013). Combin-
Eronen, A. (2007). Chorus detection with combined use ing harmony-based and novelty-based approaches
of MFCC and chroma features and image process- for structural segmentation. In Ismir (pp. 601606).
ing filters. In Proceedings of the 10th International Pearce, M. T. & Wiggins, G. (2006). The information dy-
Conference on Digital Audio Effects (pp. 229236). namics of melodic boundary detection. In Proceed-
Citeseer. Bordeaux. ings of the ninth international conference on music
Foote, J. T. (2000). Automatic audio segmentation us- perception and cognition (pp. 860865).
ing a measure of audio novelty. In IEEE Interna- Peeters, G. (2004). Deriving musical structures from signal
tional Conference on Multimedia and Expo (Vol. 1, analysis for music audio summary generation: se-
pp. 452455). IEEE. New York. quence and state approach. Computer Music Mod-
Gaudefroy, C., Papadopoulos, H., & Kowalski, M. (2015). eling and Retrieval, 169185.
A multi-dimensional meter-adaptive method for au- Peiszer, E., Lidy, T., & Rauber, A. (2008). Automatic audio
tomatic segmentation of music. In 13th International segmentation: segment boundary and structure detec-
Workshop on Content-Based Multimedia Indexing tion in popular music. In Proceedings of the 2nd In-
(CBMI) (pp. 16). IEEE. ternational Workshop on Learning the Semantics of
Hartmann, M., Lartillot, O., & Toiviainen, P. (2016a). Inter- Audio Signals (LSAS). Paris, France.
action features for prediction of perceptual segmen- Pyper, B. J. & Peterman, R. M. (1998). Comparison of
tation: effects of musicianship and experimental task. methods to account for autocorrelation in correlation
Journal of New Music Research. analyses of fish data. Canadian Journal of Fisheries
Hartmann, M., Lartillot, O., & Toiviainen, P. (2016b). and Aquatic Sciences, 55(9), 21272140.
Multi-scale modelling of segmentation: effect of mu- Serr, J., Muller, M., Grosche, P., & Arcos, J. (2014). Un-
sical training and experimental task. Music Percep- supervised music structure annotation by time se-
tion, 34(2), 192217. ries structure features and segment similarity. IEEE
Jensen, K. (2007). Multiple scale music segmentation using Transactions on Multimedia, 16(5), 12291240.
rhythm, timbre, and harmony. EURASIP Journal on Silverman, B. W. (1986). Density estimation for statistics
Applied Signal Processing, 2007(1), 159159. and data analysis. Boca Raton: CRC press.
Kaiser, F. & Peeters, G. (2013). Multiple hypotheses at mul- Smith, J. B. & Chew, E. (2013). A meta-analysis of the
tiple scales for audio novelty computation within mu- mirex structure segmentation task. In Proc. of the
sic. In Proceedings of the International Conference 14th International Society for Music Information Re-
on Acoustics, Speech, and Signal Processing. Van- trieval Conference (Vol. 16, pp. 4547). Curitiba.
couver.
SMC2017-372
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT
SMC2017-373
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-374
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
S S( 21 ; 0)
M2,3 M2,3 ( 12 ; 0)
B3 B3
B3,S (1; 0) B3,W ( 31 ; 0)
SB SB SB
SBE (1; 0) SBE (1; 0) SBE (1; 0)
( ( (
( ( (
Figure 3. An example 6
==== of the rhythmic tree of a 8 bar with 6
the rhythm . Figure 4. An example
==== of the rhythmic tree of a 8 bar with
the rhythm
including strengths and lexicalization.
An example of ==the 6
== rhythmic tree of a single 8 bar with
the rhythm is shown in Figure 3. Here, the first heads are first ranked by d such that longer notes are con-
beat is been rewritten as a terminal since it is the only note sidered stronger. Any ties are broken by s such that an ear-
present. lier starting position corresponds to greater strength. Any
further ties are broken such that notes which are not tied
3.2 Lexicalization into are considered stronger than those which are.
An example==of 6
One downside of using a PCFG to model the rhythmic = = the rhythmic tree of a single 8 bar with the
rhythm including strengths and lexicalization is
structure is that PCFGs make a strong independence as-
shown in Figure 4.
sumption that is not appropriate for music. Specifically,
in a given rhythm, a note can only be heard as stressed
or important in contrast with the notes around it, though 3.3 Performing Inference
a standard PCFG cannot model this. A PCFG may see a Each of the LPCFG rule probabilities are computed as sug-
dotted quarter note and assume that it is a long note, even gested by Jurafsky and Martin [19], plus additionally con-
though it has no way of knowing whether the surround- ditioning each on the meter type. For example, the replace-
ing notes are shorter or longer, and thus, whether the note ment {M2,3 ( 21 ; 0) B3,S (1; 0) B3,W ( 31 ; 0)} is modeled
should indeed be considered stressed. by the product of Equations (1), (2), and (3). Equation (1)
To solve this problem, we implement a lexicalized PCFG models the probability of a transition given the left-hand-
(LPCFG), where each terminal is assigned a head corre- side nodes head, while Equations (2) and (3) model the
sponding to its note with the longest duration. Strong heads probability of each childs head given its type and the par-
(in this work, those representing longer notes) propagate ents head.
upwards through the metrical tree to the non-terminals in a
process called lexicalization. This allows the grammar to p(M2,3 B3,S B3,W | M2,3 , (1/2; 0)) (1)
model rhythmic dependencies rather than assuming inde-
pendence as in a standard PCFG, and the pattern of strong p((1; 0) | M2,3 , B3,S , (1/2; 0)) (2)
and weak beats and sub-beats is used to determine the un-
p((1/3; 0) | M2,3 , B3,W , (1/2; 0)) (3)
derlying rhythmic stress pattern of a given piece of music.
This head is written (d; s), where d is the duration of the The meter type conditioning ensures that the model not
longest note (or, the portion of that note which lies beneath prefer one meter type over another based on uneven train-
the node), and s is the starting position of that note. The ing data. Specifically, each initial transition S Mb,s
character t is added to the end of s if that note is tied into is assigned a probability of 1. The actual probability val-
(i.e. if the onset of the note lies under some previous node). ues are computed given a training corpus using maximum
In the heads, d and s are normalized so that the duration of likelihood estimation with Good-Turing smoothing as de-
the node itself is 1. Thus, only heads which are assigned to scribed by Good [20]. If a given replacements head, as
nodes at the same depth can be compared directly. A node modeled by Equations (2) and (3), is not seen in the train-
with no notes is assigned the empty head of (0; 0). ing data, we use a backing-off technique as follows. We
Once node heads have been assigned, each beat and sub- multiply the probability from the Good-Turing smoothing
beat non-terminal is assigned a strength of either strong by a new probability equation, where the meter type is re-
(S), weak (W ), or even (E). These are assigned by com- moved (again with Good-Turing smoothing). This allows
paring the heads of siblings in the rhythmic tree. If all the grammar to model, for example, the beat-level transi-
siblings heads are equal, they are assigned even strength. tions of a 98 bar using the beat-level transitions of a 34 bar.
Otherwise, those siblings with the strongest head are as- Note that this does not allow any cross-level calculations
signed strong strength while all others are assigned weak where, for example, the beat level of a 98 bar could be mod-
strength, regardless of their relative head strengths. eled by the sub-beat level of a 68 bar, though this could be a
Head strength is determined by a ranking system, where possible avenue for future work.
SMC2017-375
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-376
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Method Inventions Fugues Essen [22] Table 3. The F1 of the LPCFG and LPCFG+BT split by
Temperley [17] 0.58 0.63 0.60 meter type. Here, # represents the number of pieces in
LPCFG+BT 0.55 0.72 0.74 each meter type, and meter types which occur only once in
a corpus are omitted.
Table 2. The F1 of each method for each corpus using an
automatically-aligned tatum.
positions than the fugues. The reason for this lack of im-
provement seems to be a simple lack of training data, as
cross-validation across corpora by training on the Inven- can be seen in Table 3, which shows that as the number of
tions when testing the Fugues and vice versa; however, that training pieces for each meter type increases, the perfor-
led to similar but slightly worse results, as the complexities mance of the LPCFG improves dramatically, though the
of the rhythms in the corpora are not quite similar enough LPCFG+BT does not follow this trend.
to allow for such training to be successful. Specifically, During automatic tatum alignment, beat-tracking tends to
there is much more syncopation in the Fugues than in the quantize rare patterns (which may occur only a single me-
Inventions, and thus our grammar would tend to prefer to ter type) into more common ones, allowing the grammar
incorrectly choose meters for the Inventions which would to identify the rhythmic stress of a piece without having
result in some syncopation. to parse too many previously unseen rhythms. This helps
in the case of not enough training data, but can hurt when
4.3 Results more training data is available. These rare patterns tend to
With hand-aligned input, the LPCFG is evaluated against be strong markers of a certain meter type, and if enough
two baselines. First, a naive one, guessing 44 time with training data is available to recognize them, such quanti-
an anacrusis such that the first full bar begins at the on- zation would remove a very salient rhythmic clue. There-
set of the first note (the most common time signature in fore, for both hand-aligned and automatically-aligned in-
each corpus). Second, the PCFG without lexicalization (as put, more training data in the style of the inventions should
proposed in Section 3.1, with Good-Turing smoothing and continue to improve its performance on that corpus.
rule of congruence matching). Fig. 6 shows the percentage of pieces in each corpus for
With automatically-aligned input, we evaluate against the which each method achieves 3, 2, 1, or 0 true positives,
model proposed by Temperley [17], which performs beat and further details exactly what is happening on the inven-
tracking jointly with meter detection. For direct compar- tions. The true positive counts correspond with those in
ison, we use the fastest beat given by Temperleys model our metric, and represent the number of levels of the met-
as the tatum, and we call this version of our grammar an rical tree (bar, sub-beat, beat) which were matched in both
LPCFG with beat tracking (LPCFG+BT). It would be bet- length and phase for each piece. Thus, more true positives
ter to perform beat tracking and meter detection jointly, as corresponds with a more accurate guess.
in Temperleys model; however, we leave such joint infer- The improvement on the fugues is clear for both types of
ence for future work. The automatic alignment presents input. On the hand-aligned inventions, however, the naive
some difficulty in computing our metric, since a metri- 4 model gets 40% of the metrical structures of the inven-
4
cal hypothesis generated from an incorrectly aligned input tions exactly correct (3 TPs), while the LPCFG gets only
may move in and out of phase throughout the course of a 26.67%. The LPCFG gets significantly more of its guesses
piece due to beat tracking errors. Therefore, we evaluate mostly or exactly correct (with 2 or 3 TPs), and elimi-
each meter based on its phase and sub-beat length relative nates totally incorrect guesses (0 TPs) completely. This
only to the first note of each piece, thus avoiding any mis- shows that, even though it may not have had enough data
alignments caused by subsequent errors in beat tracking. yet to classify time signatures correctly, it does seem to be
The results for hand-aligned input can be found in Ta- learning some sort of structural patterns from what limited
ble 1, where it can be seen that the LPCFG outperforms data it has. Meanwhile, on the automatically-aligned in-
all baselines, quite substantially on the fugues. The re- ventions, the LPCFG+BT gets slightly fewer of its guesses
sults for automatically-aligned input are shown in Table 2. mostly correct than Temperleys model, but significantly
Here, the LPCFG+BT outperforms Temperleys model on fewer (only one invention) totally incorrect, again showing
the fugues and the Essen corpus, but is outperformed by it that it has learned some structural patterns. The slightly
on the inventions. lower F1 of the LPCFG+BT on the automatically-aligned
For both hand-aligned and automatically-aligned input, it inventions is due to a higher false positive rate than Tem-
is surprising that the LPCFG (and LPCFG+BT) does not perleys model.
perform better on the inventions, which are simpler com- That our models performance is more sensitive to a lack
SMC2017-377
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3 TPs 2 TPs 1 TP 0 TPs and that it can be combined with a beat tracking model
to achieve good meter detection results on automatically-
(a) (b) (a) (b) aligned symbolic music data. The fact that lexicalization
100
adds definite value over a standard PCFG shows that there
are complex rhythmic dependencies in music which such
80 lexicalization is able to capture.
Our model is somewhat sensitive to a lack of training
% of Pieces
PC 4 4
LP FG
G
G 7]
T
LP FG
G
G 7]
T
The proposed LPCFG shows promise in meter detection
+B
+B
CF
CF
CF [1
CF [1
even using only rhythmic data, and future work will incor-
LP
LP
SMC2017-378
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[5] N. Spiro, Combining Grammar-Based and Memory- [21] C. Lee, The perception of metrical structure: Exper-
Based Models of Perception of Time Signature and imental evidence and a new model, in Representing
Phase, in Music and Artificial Intelligence. Springer Musical Structure, P. Howell, R. West, and I. Cross,
Berlin Heidelberg, 2002, pp. 183194. Eds. Academic Press, May 1991, ch. 3, pp. 59128.
[6] H. Takeda, T. Nishimoto, and S. Sagayama, Rhythm [22] H. Schaffrath and D. Huron, The essen folksong col-
and Tempo Recognition of Music Performance from a lection in the humdrum kern format, Menlo Park, CA:
Probabilistic Approach. in ISMIR, 2004. Center for Computer Assisted Research in the Human-
ities, 1995.
[7] , Rhythm and Tempo Analysis Toward Auto-
matic Music Transcription, in ICASSP. IEEE, 2007, A. RULE OF CONGRUENCE
pp. 13171320.
A metrical structure hypothesis begins as unmatched, and
[8] E. Nakamura, K. Yoshii, and S. Sagayama, Rhythm is considered to be fully matched if and only if both its beat
Transcription of Polyphonic MIDI Performances and sub-beat levels have been matched. Thus, a metrical
Based on a Merged-Output HMM for Multiple Voices, hypothesis can be in one of four states: fully matched, sub-
in SMC, 2016. beat matched, beat matched, or unmatched.
If a hypothesis is unmatched, a note which is shorter than
[9] J. C. Brown, Determination of the meter of musical a sub-beat and does not divide a sub-beat evenly is counted
scores by autocorrelation, The Journal of the Acousti- as a mismatch. A note which is exactly a sub-beat in length
cal Society of America, vol. 94, no. 4, p. 1953, 1993. is either counted as a mismatch (if it is not in phase with
the sub-beat), or the hypothesis is moved into the sub-beat
[10] B. Meudic, Automatic Meter Extraction from MIDI
matched state. A note which is between a sub-beat and a
Files, in Journees dinformatique musicale, 2002.
beat in length is counted as a mismatch. A note which is
[11] D. Eck and N. Casagrande, Finding Meter in Music exactly a beat in length is either counted as a mismatch (if
Using An Autocorrelation Phase Matrix and Shannon it is not in phase with the beat), or the hypothesis is moved
Entropy, in ISMIR, 2005, pp. 504509. into the beat matched state. A note which is longer than a
beat, is not some whole multiple of a beat in length, and
[12] A. Volk, The Study of Syncopation Using Inner Met- does not divide a bar evenly is counted as a mismatch.
ric Analysis: Linking Theoretical and Experimental If a hypothesis is sub-beat matched, it now interprets each
Analysis of Metre in Music, Journal of New Music incoming note based on that sub-beat length. That is, any
Research, vol. 37, no. 4, pp. 259273, dec 2008. note which is longer than a single sub-beat is divided into
up to three separate notes (for meter matching purposes
[13] W. B. de Haas and A. Volk, Meter Detection in Sym- only): (1) The part of the note which lies before the first
bolic Music Using Inner Metric Analysis, in ISMIR, sub-beat boundary which it overlaps (if the note begins ex-
2016, pp. 441447. actly on a sub-beat, no division occurs); (2) The part of the
note which lies after the final sub-beat boundary which it
[14] A. Volk and W. B. de Haas, A corpus-based study on overlaps (if the note ends exactly on a sub-beat, no division
ragtime syncopation, ISMIR, 2013. occurs); and (3) the rest of the note. After this processing,
a note which is longer than a sub-beat and shorter than a
[15] N. Whiteley, A. T. Cemgil, and S. Godsill, Bayesian
beat is counted as a mismatch. (Due to note division, this
Modelling of Temporal Structure in Musical Audio, in
only occurs if the note is two sub-beats in length and each
ISMIR, 2006.
beat has three sub-beats.) A note which is exactly a beat in
[16] D. Temperley, Music and Probability. The MIT Press, length moves the hypothesis into the fully matched state.
2007. A note which is longer than a beat and is not some whole
multiple of beats is counted as a mismatch.
[17] , A Unified Probabilistic Model for Polyphonic If a hypothesis is beat matched, it now interprets each
Music Analysis, Journal of New Music Research, incoming note based on that beat length exactly as is de-
vol. 38, no. 1, pp. 318, mar 2009. scribed for sub-beat length in the previous paragraph. Af-
ter this processing, a note which is shorter than a sub-beat
[18] P. Toiviainen and T. Eerola, Autocorrelation in meter and does not divide a sub-beat evenly is counted as a mis-
induction: The role of accent structure, The Journal match. A note which is exactly a sub-beat in length is
of the Acoustical Society of America, vol. 119, no. 2, p. either counted as a mismatch (if it is not in phase with
1164, 2006. the sub-beat), or the hypothesis is moved into the fully
matched state. A note which is longer than a sub-beat and
[19] D. Jurafsky and J. H. Martin, Speech and language shorter than a beat, and which does not align with the be-
processing, International Edition, 2000. ginning or end of a beat, is counted as a mismatch.
Once a metrical hypothesis is fully matched, incoming
[20] I. J. Good, The population frequencies of species and
notes are no longer checked for matching, and the hypoth-
the estimation of population parameters, Biometrika,
esis will never be removed.
pp. 237264, 1953.
SMC2017-379
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Copyright:
c 2017 Taegyun Kwon et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution 3.0 Unported License, which
2. SYSTEM DESCRIPTION
permits unrestricted use, distribution, and reproduction in any medium, provided The proposed framework is illustrated in Figure 1. The
the original author and source are credited. left-hand presents the two independent AMT systems that
SMC2017-380
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-381
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-382
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3. EXPERIMENTS
3.1 Dataset
We used the MAPS dataset [14], specifically the MUS
subset that contains large pieces of piano music, for train-
ing and evaluation. Each piece consists of audio files and
a ground-truth MIDI file. The audio files were rendered Figure 4. Ratio of correctly aligned onsets as a function of
from the MIDI with nine settings of different pianos and threshold. Each point represent mean of piecewise preci-
recording conditions. This helped our model avoid over- sion. Some data points with lower than 80% of precision
fitting to a specific piano tone. The MIDI files served as are not shown in this figure.
the ground-truth annotation of the corresponding audio but
some of them (ENSTDkCl and ENSTDkAm) are some-
former can be applied to both audio-to-audio and audio-
times temporally inaccurate, which is more than 65 ms as
to-MIDI alignment. Therefore, we made an audio version
described in [15].
of the distorted MIDI using a high-quality sample-based
We conducted the experience with 4-fold cross valida- piano synthesizer and employed it as an input. We tested
tion with training and test splits from the configuration I 1 Ewerts algorithms for both audio and MIDI cases. The
in [11]. For each fold, 43 pieces were detached from the temporal frame rate of features were adjusted to 100 fps
training set and used for validation. As a result, each fold for both algorithms.
was composed of 173, 43 and 54 pieces for training, vali-
For the aligning task with Ewerts algorithms, we used
dation and test, respectively, as processed in [10].
the same FastDTW algorithm. But since the FastDTW al-
gorithm cannot be directly applied to Carabias-Ortis algo-
3.2 Evaluation method rithm due to its own distance calculation method, we ap-
In order to evaluate the audio-to-score alignment task, we plied a classic DTW algorithm, which employs an entire
need another MIDI representation (typically score MIDI) frame-wise distance matrix. Because of the limitation of
apart from the performance MIDI aligned with audio. We memory, when reproducing Carabias-Ortis algorithm, we
generated the separate MIDI by changing intervals between excluded 35 pieces that are longer than 400 seconds among
successive concurrent set of notes. Specifically, we mul- the test sets.
tiplied a randomly selected value between 0.7 and 1.3 to Note that even though the dataset for evaluation is differ-
modify the interval. This scheme of temporal distortion ent, the results of two reproduced algorithms were similar
prevents the alignment path from being trivial and was also to the results in their original works. The mean onset er-
employed in previous work [6, 16, 17]. rors of Ewerts algorithm on piano music was 19 ms with
After we obtained the alignment path through DTW, ab- 26 ms standard deviation [6]. The result introduced in the
solute temporal errors between estimated note onsets and original Carabias-Ortis paper [18] shows a quite large dif-
ground truth were measured. For each piece of music in ference in terms of the mean of piecewise error, but we
the test set, mean value of the temporal errors and ratio of assumes that the difference is due to the change of the test
correctly aligned notes with varying thresholds were used set. The align rate of original result and our reproduced
to summarize the results. result were similar (50 ms: 74% - 69%, 100 ms: 90% -
92%, 200 ms: 95% - 96%). Hence, we assumed that our
reproduction was reliable for the comparison.
3.3 Compared Algorithms
To make a performance comparison, we reproduced two
4. RESULTS AND DISCUSSION
alignment algorithms proposed by Ewert et al. [6] and one
by Carabias-Orti et al. [18]. We performed the experiments 4.1 Comparison with Others
with the same test set using the FastDTW algorithm but
without any post-processing. Ewerts algorithms used a Figure 4 shows the results of the audio-to-score alignment
hand-crafted chromagram and onset features based on au- from the compared algorithms. They represent the ratio of
dio filter bank responses. Carabias-Ortis algorithm em- correctly aligned onsets in precision as a function of er-
ployed a non-negative matrix factorization to learn spectral ror threshold. Typically, tolerance window with 50 ms is
basis of each note combination from spectrogram. The lat- used for evaluation. However, because most of notes were
ter is designed only for audio-to-audio alignment while the aligned within 50 ms of temporal threshold, we varied the
width tolerance window from 0 ms to 200 ms with 10 ms
1 http://www.eecs.qmul.ac.uk/sss31/TASLP/info.html steps.
SMC2017-383
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 1. Results of the piecewise onset errors. Mean, median, and standard deviation of the errors are in millisecond.
The right columns are the ratio of notes (%) that are aligned within the onset error of 10 ms, 30 ms, 50 ms and 100 ms,
respectively.
SMC2017-384
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
showed that chroma onset features take a crucial role in [10] R. Kelz, M. Dorfer, F. Korzeniowski, S. Bock, A. Arzt,
improving the accuracy. In fact, the successful alignment and G. Widmer, On the Potential of Simple Frame-
performance might be possible because we used the same wise Approaches to Piano Transcription, in Proceed-
recording condition for both training and test sets. Consid- ings of the International Conference on Music Infor-
ering this issue, we investigate the generalization capacity mation Retrieval (ISMIR), 2016, pp. 475481.
of our model by evaluating it on various datasets in the fu-
ture. Also, we plan to improve the AMT system by using [11] S. Sigtia, E. Benetos, and S. Dixon, An end-to-end
other types of deep neural networks. neural network for polyphonic piano music transcrip-
tion, IEEE/ACM Transactions on Audio, Speech and
Acknowledgments Language Processing (TASLP), vol. 24, no. 5, pp. 927
939, 2016.
This work was supported by Korea Advanced Institute of
Science and Technology (project no. G04140049) and Ko- [12] S. Salvador and P. Chan, FastDTW : Toward Accurate
rea Creative Content Agency (project no. N04170044). Dynamic Time Warping in Linear Time and Space,
Intelligent Data Analysis, vol. 11, pp. 561580, 2007.
6. REFERENCES [13] M. Muller, H. Mattes, and F. Kurth, An efficient mul-
[1] A. Arzt, G. Widmer, and S. Dixon, Automatic Page tiscale approach to audio synchronization, in Proc.
Turning for Musicians via Real-Time Machine Listen- International Conference on Music Informa- tion Re-
ing, in Proceedings of the conference on ECAI 2008: trieval (ISMIR), 2006, p. 192197.
18th European Conference on Artificial Intelligence, [14] V. Emiya, R. Badeau, and B. David, Multipitch esti-
vol. 1, no. 1, 2008, pp. 241245. mation of piano sounds using a new probabilistic spec-
[2] R. B. Dannenberg and C. Raphael, Music score align- tral smoothness principle, IEEE Transactions on Au-
ment and computer accompaniment, Communications dio, Speech and Language Processing, vol. 18, no. 6,
of the ACM, vol. 49, no. 8, pp. 3843, 2006. pp. 16431654, 2010.
[3] G. Widmer, S. Dixon, W. Goebl, E. Pampalk, and [15] S. Ewert and M. Sandler, Piano Transcription in
A. Tobudic, In Search of the Horowitz Factor, AI the Studio Using an Extensible Alternating Directions
Magazine, vol. 24, no. 3, pp. 111130, 2003. Framework, vol. 24, no. 11, pp. 19831997, 2016.
[16] M. Muller, H. Mattes, and F. Kurth, An efficient mul-
[4] A. Friberg and J. Sundberg, Perception of just-
tiscale approach to audio synchronization. in Proc.
noticeable time displacement of a tone presented in a
International Conference on Music Informa- tion Re-
metrical sequence at different tempos, The Journal of
trieval (ISMIR). Citeseer, 2006, pp. 192197.
The Acoustical Society of America, vol. 94, no. 3, pp.
18591859, 1993. [17] C. Joder, S. Essid, and G. Richard, A conditional ran-
dom field framework for robust and scalable audio-to-
[5] S. Dixon and G. Widmer, MATCH: a music alignment
score matching, IEEE Transactions on Audio, Speech,
tool chest, in Proceedings of the International Society
and Language Processing, vol. 19, no. 8, pp. 2385
for Music Information Retrieval Conference (ISMIR),
2397, 2011.
2005, pp. 492497.
[18] J. J. Carabias-Orti, F. J. Rodrguez-Serrano, P. Vera-
[6] S. Ewert, M. Muller, and P. Grosche, High resolution
Candeas, N. Ruiz-Reyes, and F. J. Canadas-Quesada,
audio synchronization using chroma onset features,
An audio to score alignment framework using spec-
in Proc. IEEE International Conference on Acoustics,
tral factorization and dynamic time warping. in Proc.
Speech and Signal Processing (ICASSP), 2009, pp.
International Conference on Music Informa- tion Re-
18691872.
trieval (ISMIR), 2015, pp. 742748.
[7] A. Arzt, S. Bock, S. Flossmann, H. Frostel, M. Gasser,
C. C. S. Liem, and G. Widmer, The piano music com-
panion, Frontiers in Artificial Intelligence and Appli-
cations, vol. 263, no. 1, pp. 12211222, 2014.
SMC2017-385
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT gorithm for finding the sequence of the hidden states that
maximizes the minimum transition probability on the se-
A theoretical basis of existing attempts to automatic finger- quence (not the product of all the transition probabilities
ing decision is path optimization that minimizes the diffi- on the sequence). The purpose of the paper is to intro-
culty of whole phrase that is typically defined as the sum duce a parameterized family of Viterbi algorithm termed
of the difficulties of each moves required for playing the the Lp -Viterbi algorithm that continuously interpolates
phrase. However, from a practical point of view, it is more the conventional Viterbi algorithm and the minimax Viterbi
important to minimize the maximum difficulty of the move algorithm and then apply it to HMM-based guitar fingering
required for playing the phrase, that is, to make the most decision. The framework of HMM-based fingering deci-
difficult move easiest. For this reason, our previous work sion employs a hidden Markov model (HMM) whose hid-
introduced a variant of the Viterbi algorithm termed the den states are left hand forms of guitarists and output sym-
minimax Viterbi algorithm that finds the path of the hid- bols are musical notes. We perform fingering decision by
den states that maximizes the minimum transition proba- solving the decoding problem of HMM using our proposed
bility along the path. In the present work, we introduce a Lp -Viterbi algorithm. We set the transition probabilities
parameterized family of Viterbi algorithm termed the Lp - to large values for easy moves and small values for diffi-
Viterbi algorithm that continuously interpolates the con- cult ones so that resulting fingerings make the most diffi-
ventional Viterbi algorithm and the minimax Viterbi algo- cult move easiest as previously discussed in this section.
rithm. (It coincides with the conventional for p = 1 and To distinguish the original Viterbi algorithm from our vari-
the minimax for p = .) We apply those variants of the ants, we refer to the former as conventional Viterbi algo-
Viterbi algorithm to HMM-based guitar fingering decision rithm throughout the paper.
and compare the resulting fingerings.
There have been several attempts to automatic fingering
decision and automatic arrangement for guitars. Sayegh
1. INTRODUCTION [3] introduced the path optimization approach to finger-
The pitch ranges of strings of string instruments usually ing decision of generic string instruments. Miura et al.
have significant overlaps. Therefore, such string instru- [4] developed a system that generates guitar fingerings for
ments have several ways to play even a single note (except given melodies (monophonic phrases). Radicioni et al. [5]
the highest and the lowest notes that are covered only by a introduced cognitive aspects of fingering decision to the
single string) and thus numerous ways to play a phrase and path optimization approach. Radisavljevic and Driessen
an astronomical number of possible ways to play a whole [6] proposed a method for designing cost functions for dy-
song. This is why fingering decision for a given song is not namic programming (DP) for fingering decision. Tuohy
an easy task and automatic fingering decision has been at- and Potter [7] introduced a genetic algorithm (GA) for fin-
tracting many researchers. Existing attempts to automatic gering decision. Baccherini et al. [8] introduced finite state
fingering decision are mainly based on path optimization automaton to fingering decision of generic string instru-
by minimizing the difficulty level of whole phrase that is ments. Hori et al. [9] applied input-output HMM intro-
defined as the sum or the product of the difficulty levels duced by Bengio and Frasconi [10] to guitar fingering de-
of each moves. However, whether a string player can play cision and arrangement. McVicar et al. [11] developed a
a given passage using a specific fingering depends almost system that generates guitar tablatures for rhythm guitar
only on whether the most difficult move included in the fin- and lead guitar automatically. Comparing to those previ-
gering is playable. In particular, for beginner players, it is ous works, the novelty of the present work lies in that it
most important to minimize the maximum difficulty level introduces a new category of path optimization and vari-
of move included in a fingering, that is, to make the most ants of the Viterbi algorithm correspondingly.
difficult move easiest. The rest of the paper is organized as follows. Section
To this end, our previous work [1] introduced a variant of 2 recalls the conventional Viterbi algorithm and the min-
the Viterbi algorithm [2] termed the minimax Viterbi al- imax Viterbi algorithm and connects them by introducing
the Lp -Viterbi algorithm. Section 3 recalls the framework
Copyright:
c 2017 Gen Hori et al. This is an open-access article distributed of fingering decision based on HMM for monophonic gui-
under the terms of the Creative Commons Attribution 3.0 Unported License, which tar phrases to which we apply the Lp -Viterbi algorithm and
permits unrestricted use, distribution, and reproduction in any medium, provided evaluates the resulting fingerings in Section 4. Section 5
the original author and source are credited. concludes the paper.
SMC2017-386
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Suppose that we have two finite sets of hidden states Q and Recursion fills out the rest columns of and using the
output symbols S, following recursive formulae for j = 1, 2, . . . , N and t =
1, 2, . . . , T1,
Q = {q1 , q2 , . . . , qN },
S = {s1 , s2 , . . . , sM }, j,t+1 = max(it + log aij ) + log b(qj , yt+1 ),
i
and two sequences of random variables X of hidden states j,t+1 = arg max(it + log aij ).
i
and Y of output symbols,
Termination finds the last hidden state of the maximum
X = (X1 , X2 , . . . , XT ), Xt Q,
likelihood sequence xML using the last column of ,
Y = (Y1 , Y2 , . . . , YT ), Yt S,
xT = arg max iT .
then a hidden Markov model M is defined by a triplet i
from a hidden Markov model M , we are interested in the Recursion for minimax Viterbi algorithm fills out the
sequence of hidden states two tables and using the following recursive formulae
for j = 1, 2, . . . , N and t = 1, 2, . . . , T1,
x = (x1 , x2 , . . . , xT )
j,t+1 = max(min(it , log aij +log b(qj , yt+1 ))),
i
that generated the observed sequence of output symbols y
with the maximum likelihood, j,t+1 = arg max(min(it , log aij +log b(qj , yt+1 ))).
i
xML = arg max P (y, x|M ) We modify only the second step and leave other steps un-
x
changed. We call the modified algorithm the minimax
= arg max P (x|M )P (y|x, M )
x Viterbi algorithm. The modified algorithm finds the se-
= arg max(log P (x|M ) + log P (y|x, M )) quence of hidden states with the maximum minimum prob-
x ability, that is, the minimum maximum difficulty. Although
T
X the word maximin is appropriate in view of probability,
= arg max (log a(xt1 , xt ) + log b(xt , yt )) we choose the word minimax for naming the variant be-
x
t=1 cause it is intuitive to understand in view of difficulty and
(1) the word conveys our concept of making the most difficult
SMC2017-387
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
move easiest. The modified second step works in the same Viterbi algorithm as follows. According to the change of
manner as the original one but now the element it keeps signs in the table , we also modify the first and third step
the value of the maximum minimum transition probability as follows. We leave the last step unchanged.
of the subsequence of hidden states for the first t observa-
tions. The term min(it , log aij +log b(qj , yt+1 )) updates Initialization for Lp -Viterbi algorithm initializes the first
the value of the minimum transition probability as the term columns of the two tables and using the following
it + log aij + log b(qj , yt+1 ) does the value of the maxi- formulae for i = 1, 2, . . . , N ,
mum log likelihood in the conventional Viterbi algorithm.
i1 = | log i + log b(qi , y1 )|,
2.4 Lp -Viterbi algorithm
i1 = 0.
To implement a family of Viterbi algorithm that continu-
ously interpolates the conventional Viterbi algorithm and Recursion for Lp -Viterbi algorithm fills out the two ta-
the minimax Viterbi algorithm, we consider a generalized bles and using the following recursive formulae for
decoding problem of HMM defined with the Lp -norm of a j = 1, 2, . . . , N and t = 1, 2, . . . , T1,
certain real vector. For a real number p 1 1 , the Lp -norm
q
of an n-dimensional real vector p
j,t+1 = min it p + | log aij +log b(qj , yt+1 )|p ,
i
v = (v1 , v2 , . . . , vn ) Rn q
p
j,t+1 = arg min it p + | log aij +log b(qj , yt+1 )|p .
i
is defined as
p
Termination for Lp -Viterbi algorithm finds the last hid-
p
||v||p = |v1 |p + |v2 |p + + |vn |p . (3)
den state of the maximum likelihood sequence xML using
In particular, the L1 -norm and the L2 -norm correspond to the last column of ,
the Manhattan distance and the Euclidean distance and are
widely used in the contexts of machine learning and com- xT = arg min iT .
pressed sensing. The regression models with the L1 and i
SMC2017-388
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
3.2 Transition and output probabilities minimax Viterbi algorithm, and the proposed algorithm.
Figure 1 shows the results for the opening part of Ro-
The model parameters of HMM such as the transition prob-
mance Anonimo. The numbers on the tablatures show the
abilities and the output probabilities are usually estimated
fret numbers and the numbers in parenthesis below the tab-
from training data using the Baum-Welch algorithm [12].
latures show the finger numbers where 1,2,3 and 4 means
However, we set those parameters manually as explained
the index, middle, ring and pinky fingers. In the bottom
in the following instead of estimation from training data
graph of the transition probabilities, we see that the red
because it is difficult to prepare enough training data for
line (the conventional Viterbi algorithm) keeps higher val-
our application of fingering decision, that is, guitar scores
ues (easy moves) at the cost of two very small values (very
annotated with fingerings.
difficult moves) while the blue line (the minimax Viterbi
The difficulty levels of the moves from forms to forms
algorithm) circumvents such very small values and makes
are implemented in the probabilities of the transitions from
the most difficult move easiest. The green line (the Lp -
hidden states to hidden states; a small value of the transi-
Viterbi algorithm for p = 3.0) falls into an intermediate
tion probability means the corresponding move is difficult
category. The top tablature uses the pinky finger two times
and a large value means easy. For simplicity, we assume
while the middle and the bottom ones avoids it. Figure
that the four fingers of the left hand (the index, middle, ring
2 shows the results for an excerpt from the cello part of
and pinky fingers) are always put on consecutive frets. This
Eine Kleine Nachtmusik. In the bottom graph, we see
lets us calculate the index finger position (the fret number
again that the red line keeps higher values at the cost of
the index finger is put on) of form qi as follows,
few very small values while the blue line circumvents such
g(qi ) = fi hi + 1. very small values.
Using the index finger position, we set the transition prob- 5. CONCLUSIONS
ability as
We have introduced a variant of the conventional Viterbi
aij (dt ) = P (Xt = qj | Xt1 = qi , dt ) algorithm termed the Lp -Viterbi algorithm that continu-
1
|g(qi ) g(qj )|
ously interpolates the conventional Viterbi algorithm and
exp PH (hj ) the minimax Viterbi algorithm. The variant coincides with
2dt dt
the conventional Viterbi algorithm for p = 1 and the min-
where means proportional and the left hand side is nor- imax Viterbi algorithm for p = . We have applied the
malized so that the summation with respect to j equals 1. variant to HMM-based guitar fingering decision. Our pre-
The first term of the right hand side is taken from the proba- vious work [1] showed that the minimax Viterbi algorithm
bility density function of the Laplace distribution that con- minimizes the maximum difficulty of the move required
centrates on the center and its variance dt is set to the time for playing a given phrase. In the simulation of the present
interval between the onsets of the (t 1)-th note and the work, we have compared the fingerings obtained by the
t-th note. The second term PH (hj ) corresponds to the dif- conventional Viterbi algorithm, the Lp -Viterbi algorithm
ficulty level of the destination form defined depending on and the minimax Viterbi algorithm to see that the Lp -Viterbi
the finger number hj . In the simulation in the following algorithm is capable of making fingerings that fall into an
section, we set PH (1) = 0.4, PH (2) = 0.3, PH (3) = 0.2 intermediate category between the conventional Viterbi al-
and PH (4) = 0.1 which means the form using the index gorithm and the minimax Viterbi algorithm. We hope that
finger is the easiest and the pinky finger is the most dif- those variants of the Viterbi algorithm find a wide range of
ficult. The difficulty levels of the forms are included in applications in a variety of research fields.
the transition probabilities (not in the output probability)
in such a way that a transition probability is adjusted to a Acknowledgments
smaller value when the destination form of the transition is
This work was supported by JSPS KAKENHI Grant Num-
difficult.
bers 26240025, 17H00749.
As for the output probability, because all the hidden states
have unique output symbols in our HMM for fingering de-
cision, it is 1 if the given output symbol is the one that the 6. REFERENCES
hidden state outputs and 0 if the given output symbol is [1] G. Hori and S. Sagayama, Minimax viterbi algorithm
not, for HMM-based guitar fingering decision, in Proceed-
bik = P ( Yt = sk | Xt = qi ) ings of 17th International Society for Music Informa-
tion Retrieval (ISMIR2016), New York City, U.S.A.,
1 (if sk = n(qi )) 2016, pp. 448453.
= .
0 (if sk = 6 n(qi ))
[2] A. J. Viterbi, Error bounds for convolutional codes
4. EVALUATION and an asymptotically optimum decoding algorithm,
IEEE Transactions on Information Theory, vol. 13,
We evaluate our proposed Lp -Viterbi algorithm by com- no. 2, pp. 260269, 1967.
paring the results of fingering decision for monophonic
guitar phrases using the conventional Viterbi algorithm, the [3] S. I. Sayegh, Fingering for string instruments with
SMC2017-389
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 1. The results of fingering decision for the opening Figure 2. The results of fingering decision for an excerpt
part of Romance Anonimo (only top notes). The top, from the cello part of Eine Kleine Nachtmusik. The top,
the middle and the bottom tablatures show three finger- the middle and the bottom tablatures show three finger-
ings obtained by the conventional Viterbi algorithm, the ings obtained by the conventional Viterbi algorithm, the
Lp -Viterbi algorithm (p = 3.0) and the minimax Viterbi Lp -Viterbi algorithm (p = 3.0) and the minimax Viterbi
algorithm respectively. In the bottom graph, the red line algorithm respectively. In the bottom graph, the red line,
shows the transition probabilities along the hidden path the green line and the blue line are obtained by the conven-
obtained by the conventional Viterbi algorithm while the tional Viterbi algorithm, the Lp -Viterbi algorithm and the
green line and the blue line are obtained by the Lp -Viterbi minimax Viterbi algorithm respectively.
algorithm and the minimax Viterbi algorithm respectively.
tions, in Fun with Algorithms. Springer, 2007, pp.
the optimum path paradigm, Computer Music Jour- 4052.
nal, vol. 13, no. 3, pp. 7684, 1989.
[9] G. Hori, H. Kameoka, and S. Sagayama, Input-output
[4] M. Miura, I. Hirota, N. Hama, and M. Yanagida, Con- HMM applied to automatic arrangement for guitars,
structing a system for finger-position determination Journal of Information Processing, vol. 21, no. 2, pp.
and tablature generation for playing melodies on gui- 264271, 2013.
tars, Systems and Computers in Japan, vol. 35, no. 6,
pp. 1019, 2004. [10] Y. Bengio and P. Frasconi, An input output HMM ar-
chitecture, Advances in neural information processing
[5] D. P. Radicioni, L. Anselma, and V. Lombardo, A systems, vol. 7, pp. 427434, 1995.
segmentation-based prototype to compute string in-
struments fingering, in Proceedings of Conference on [11] M. McVicar, S. Fukayama, and M. Goto, Autolead-
Interdisciplinary Musicology (CIM04), vol. 17, Graz, guitar: Automatic generation of guitar solo phrases in
Austria, 2004, pp. 97104. the tablature space, in 12th International Conference
on Signal Processing (ICSP), 2014, pp. 599604.
[6] A. Radisavljevic and P. Driessen, Path difference
learning for guitar fingering problem, in Proceedings [12] L. E. Baum and T. Petrie, Statistical inference for
of International Computer Music Conference, vol. 28, probabilistic functions of finite state Markov chains,
Miami, USA, 2004. The annals of mathematical statistics, vol. 37, no. 6,
pp. 15541563, 1966.
[7] D. R. Tuohy and W. D. Potter, A genetic algorithm for
the automatic generation of playable guitar tablature,
in Proceedings of International Computer Music Con-
ference, 2005, pp. 499502.
[8] D. Baccherini, D. Merlini, and R. Sprugnoli, Tab-
latures for stringed instruments and generating func-
SMC2017-390
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
1. INTRODUCTION
The cycle Systema Naturae (2013-17, [1]) includes four 20
minute pieces. Its main feature is the use of acoustic instru-
ments together with computer-controlled physical objects
placed in a specific spatial organisation 1 .
Two references are at the base of whole cycle. The first is
the Medieval tradition of bestiaria, herbaria and lapidaria
intended as multi-faceted catalogues of miscellaneous be-
ings. A second reference for the work is taxonomy, that is, Figure 1. RepertorioZero premiering Regnum animale.
the systematic description of living organisms that dates
back to Linnaeus Systema Naturae (hence the name of the
whole cycle) as the rationalistic possibility of ordering the
polymorphous (and many times teratomorphic) appearance
of nature. Each piece of the cycle is organised as a catalog
of short pieces that receive an invented Latin name in bi-
nomial nomenclature (see two examples in Figure 11 and
14). Regnum animale (RA), featuring a string trio and 25
objects, has been commissioned by Milano Musica Festi-
val and premiered by RepertorioZero (2013). Regnum veg- Figure 2. ensemble mosaik performing Regnum vegetabile.
etabile (RV), for sextet, includes 30 hacked hair dryers,
and has been commissioned by Ensemble Mosaik (Darm-
stadt, 2014). Regnum Lapideum (RL), for septet and 25
objects is a commission by Ensemble 2e2m and has been
1 Photo/video documentation can be found at http://vimeo.
com/vanderaalle and http://www.flickr.com/photos/
vanderaalle/albums
Copyright:
c 2017 Andrea Valle et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution 3.0 Unported License, which
permits unrestricted use, distribution, and reproduction in any medium, provided Figure 3. Ensemble 2e2m rehearsing Regnum lapideum.
the original author and source are credited.
SMC2017-391
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2. ELECTROMECHANICAL ware nature is quite soft: sound bodies, and their parts,
INSTRUMENTARIUM can be replaced easily and effortlessly. Sound bodies in
most cases remain open, that is, accessible for manipula-
The main idea at the basis of the electromechanical se- tion. All the orchestras typically present a no-case look,
tups is to create residual orchestras [2], that is, ensem- overtly showing components and connections.
ble of instruments made up of debris and re-cycled objects Flexibility is here intended as the capability of the residual
(hence on, sound bodies). Residual orchestras belong to orchestras to be modified in relation to specific needs: as
two worlds, as they are both everyday objects with a mech- an example, a performance may require a certain setup in
anized control and offer an acoustic behaviour similar to relation e.g. to the presence of microphones for the ampli-
music instruments. The main ratio at the basis of the com- fication of sound bodies. Orchestras are assembly of var-
position of Systema naturae is thus to create and explore ious objects, thus they are made of modular components,
a middle ground where mechanized objects can be con- that e.g. can be easily replaced.
trolled in a standard even if basic musical way (by creat- While the previous considerations may apply to a variety of
ing events, exploring their spectra, organizing their dynam- sound instruments based on a DIY approach, both acous-
ics) while music instruments are treated in an object-like tic [12] and electronic [13], in residual orchestras the en-
fashion by means of a wide usage of extended techniques. ergy source does not involve at all the human body (an im-
As an example, string trio and guitar include strings pre- portant feature in ethnomusicological classification, [4]),
pared with Patafix glue pads to create inharmonic spectra, as it is electromechanical, ad typically (even if not always)
while wind instruments largely use multiphonics and per- exploits low voltage motors/actuators. This feature is cru-
cussive sounds. To be successful, sound integration be- cial in bridging sound bodies with computational control,
tween objects and instrument has been based on a com- that covers the third element of the previous minimal def-
plete analysis of sounds, recorded from both objects and inition of music instrument. Since 2000 microcontroller
instruments, so that acoustic information could be stored boards (e.g. Arduino) have played a pivotal role in physical
and used as a sort of homogenous ground while compos- computing [14], not only in providing an interface between
ing. Audio simulation of the pieces has proven fundamen- software and the physical environment, but have prompted
tal in avoiding misconception about acoustic behaviour and a new design perspective [15], that has revitalised the DIY
in interactively exploring an unusual acoustic territory (see technological community. At the moment, many options
Section 3). Given such a technological infrastructure, it has are indeed available, including the flourishing of single-
then become possible to exploit a wide range of classic board computers, like Raspberry PI, UDOO or Bela. In
algorithmic techniques for composition (e.g. cellular au- the context of residual orchestras, physical computing pro-
tomata, canonic techniques, data sonification, etc.). On the vides the control layer for instrument behaviour. Physical
same side, a classification of objects in ethnomusicological computing also implements the principle of flexibility in
terms (see in the following) has proven useful in creating a relation to information processing and symbolic manipu-
common conceptual ground with acoustic instruments. lation, as a computers main strength is that it ensures not
Residual orchestras are meant to challenge the notion of only programming but also re-programming. In short, by
instrument by reducing it to its minimal root. The sem- means of computational control, residual orchestras allow
inal classification by Hornbostel and Sachs (H-S) [3] has to create an acoustic computer music [11]. Systema nat-
prompted a vast, still on-going, debate among ethnomu- urae is entirely based on various residual orchestras that
sicologists about the nature of music instrument [4]. By provides an apt correlate of the main aesthetic assumption
taking into account also electronic and digital music in- at the base of the cycle, nature as a catalog of heteroge-
struments [5] [6] it is possible to propose a minimal defi- neous and most time bizarre entities, that, yet, can be
nition, that considers a music instrument as a device capa- used in relation to advanced algorithmic composition tech-
ble of generating sound once a certain amount of energy is niques. The instrumentarium prepared for Systema Natu-
provided. Three elements are thus relevant: the physical rae has to respect three complex constraints:
body, an energy source, and a control interface that allows Low cost construction and maintenance: technical bud-
a (variably) fine tuning of the latter, so that the physical get is typically very limited and must be distributed (by de-
body can respond properly. In residual orchestras, physical sign) over a large set of sound bodies, the building of each
bodies are designed and assembled following three main of them having thus to cope with a severely reduced aver-
principles, inspired by sustainable design ( [7]; [8]; in par- age budget. Typically, budget does not allow to outsource
ticular here [9]): refabrication, softening, flexibility. instrument building: setups are thus entirely designed and
Refabrication: many practices around the world have tra- built by the authors. A DIY approach is fundamental as,
ditionally developed specific attitudes towards the refab- due to the nature of sound bodies, maintenance is required
rication of objects as a normal way of shaping and re- while installing setups for concerts.
shaping the semiotic status of material culture [10]. A first Transportability: for each piece, the whole setup has to
example of residual orchestra by the author is the Rumen- be designed for disassembly [9], as it must be assem-
tarium project [11]. bled/disassembled easily and quickly as possible in order
Softening refers to the sound body hardware: being so to perform the piece in various locations and in the con-
simple and intrinsically costless, sound bodies can be pro- text of concerts including other complex setups. All tech-
duced while designed in an improvisation-like mood, start- nical materials (sound bodies, physical computing inter-
ing from available materials. As a consequence, their hard-
SMC2017-392
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
face, powering units, cables) should be included, to avoid and RV focused on wind instruments, RL favours a metal-
issues (and additional cost) on location with rented materi- lic, percussive approach.
als. Due to the hacked nature of the materials, some spare In the following, sound bodies are described following H-
parts should be included in the transported package. More- S basic principles (see in particular [16]), adding specific
over, the overall volume and weight of the technical mate- tags from the previous classification schema. The sub-
rial must be reduced to be easily transportable. Since their scripts in the Control and Setup tags indicate respectively
premieres, both RA and RV have been performed more the current voltage and type (only if V = 12 and type = AC)
than 10 times in various European locations. and the number of occurrences in the specified piece. Fig-
Time responsivity: the previous two constraints necessar- ure 4 shows the resulting multidimensional classification,
ily reduce the complexity of sound body, so that their be- in which only the parts of the H-S tree relevant for sound
haviour is extremely simple, in many cases resulting in the bodies are represented (e.g. membranophones are not rep-
production of a single fixed sound with unvarying spectral resented). Sound bodies have been named following a sort
and dynamic features. To cope with this from a compo- of distortion process applied to various references to in-
sition perspective, it is crucial to be able at least to finely strument/object name.
tune their temporal behaviour. This means that sound bod- I. Idiophones
ies must provide a fast attack and decay, so that complex The idiophone family consists of objects that directly pro-
rhythmical organization becomes possible. duce sounds by their whole body.
In the following, we describe various designs for sound Cono (D, S, U, RL8 ): coni are 8 woofer loudspeakers to
bodies. As noted by [4], the classification of music in- be placed directly on the ground. Each cono is surmounted
struments can be quite complex if only considering sound by an object, simply placed on its top. Loudspeakers are
production, while, on the other side, it can be extended to intended to deliver audio impulses that make the surmount-
include various criteria depending on specific ethnographi- ing object shake or resonate. Objects include various metal
cal practices. The classic taxonomic organisation by Horn- boxes and two large water PVC pipes, having the diameter
bostel and Sachs [3] (H-S) can be tailored to fit generic of the woofer and respectively 80 and 112 cm long. They
needs by distinguishing among macrofamilies of instru- act as resonators, filtering the impulses provided, with a
ments on the base of mechanical features. In fact, other cri- clear kick drum sound. (see the two orange pipes in Figure
teria may be applied resulting in a multidimensional clas- 3).
sification (as proposed by [5]). In our case: Lampadina (D, M230 , U, RA1 ): a light bulb that can be
Control behaviour: a basic distinction opposes sound bod- turned on/off via a relay. It allows to exclusively hear the
ies with discrete behaviour (on/off, D) to those allowing a sonorous relay click while providing at the same time a vi-
continuous one (C). sual rhythmic cue.
Control technology: most sound bodies are operated via Meshugghello (D, M12AC , P, RA1 ) an AC doorbell, that
microcontroller (M), but a subset requires a soundcard (S). provides a metallic, intermittent pitched sound.
While this is not relevant from the user perspective, thanks Cimbalo (C, M12 , U, RL2 ): cimbali are scraped instru-
to software abstraction, nevertheless this implies a spe- ments in which a plastic beater variably rotates over a metal
cific different technical implementation. In turn, micro- plate (e.g. a pan).
controllers can be used to activate 12 V DC/AC devices or Sistro (C, M12 , U, RL2 ): a vessel rattle in which the beater
> 12 V (mostly, 230 V) AC ones, a feature that requires rotates inside a metal box, causing the scraping of various
specific attention. elements (e.g. buttons, seeds, rice) on the internal surface.
Pitch: an important feature in order to compose with sound Molatore (C, M12 , U, RA4 ): literally grinder, it is a
bodies is the opposition between pitched and unpitched scavenged component from tape decks (typically, cassette
ones (P/U), as pitched sound bodies may be tightly inte- or VHS players) that generates low-volume, noisy, contin-
grated with harmonic content provided by acoustic instru- uous sounds by rubbing against various metallic surfaces.
ments. Spremoagrume (D, M230 , U, RA1 ): a juicer featuring a
Setup context: sound bodies have been designed in rela- low-pitched rumble caused by friction between its rotating
tion to each piece of the cycle and its instrumental setup. element and the container.
RA includes a string trio, surrounded in circle by an amass Rasoio (D, M12 , U, RA1 ): an electric shaver put into a
of computer-driven, electro-mechanical devices built from metal box partially filled with buttons, that provides a buzzing,
discarded and scavenged everyday objects and appliances almost snare-like sound (Figure 5-4).
(electric knives, radio clocks, turntables, and so on). RV Segopiatto (D, M230 , U, RA4 ): a segopiatto is built by
setup includes a traditional acoustic instrument sextet (string shortly exciting a cymbal (or, in general, a metal resonat-
and wind trio), placed behind a 7-meter line of 30 wind ing plate) by means of an electric knife. The knifes blade
sound bodies (vs. RAs circle), all using the same model is leaning above the cymbal. The knifes motor is operated
of hair dryer. In RL the instrumental ensemble is focused at 230 V and switched on/off by a 12 V relay. Figure 5-3
on percussive and plucked instruments (two strings, two shows a segopiatto on a cymbal.
woodwinds, guitar, piano and percussion), and the same Tola (D, M12 , P, RL6 ): an idiophone made up of a metal
happens for the electromechanical setup that includes plucked bar screwed into a metal can (Figure 7). Once the bar is
strings and percussions, scattered on the floor. While RA struck, the metal can acts as a resonator, the resulting sound
was mostly heterogenous and based on house appliances, including both spectral components from the bar and the
SMC2017-393
Spremoagrume Cetro
scraped free aerophones
shaken wind instruments Girodisco
reeds Radio
non-idiophonic flutes
struck directly friction struck indirectly Cetro free aerophones wind instruments Girodisco Radio
friction struck indirectly Idiophones Idiophones Cetro free aerophones Chordophones
Chordophones Aerophones
Aerophones
wind instruments Girodisco Electrophones
RadioElectrophones
haken reeds Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
non-idiophonic flutes
Tola Cimbalo Sistro Rasoio free reeds ribbon reeds Sirenetto with duct without duct
percussion Molatore Spremoagrume scraped shaken reeds non-idiophonic flutes
struck directly
emoagrume struck directlyfrictionshaken
scraped frictionindirectly struck indirectly
struck reeds Cetro Cetro free aerophones
non-idiophonic free aerophones
flutes wind instruments
wind instruments Girodisco
Girodisco Radio Radio
ro Rasoio free reeds ribbon reeds Sirenetto with duct without duct Instrumentalia
Meshugghello Segopiatto RA
Armonica RA/RV Anciolio
RLTola RV Cimbalo Ancetto Sistro Trombo Rasoio Zampogno free reeds
Eolio Ocarino
ribbon reeds Fischietta
Sirenetto Cocacola with duct without duct
na percussion
Molatore
Cimbalo MolatoreSistro
Spremoagrume Spremoagrume
scraped
Rasoio scraped
free reeds shakenribbon reeds
shaken Sirenetto reeds reeds with non-idiophonic
non-idiophonic
duct flutesflutes
without duct
Armonica Anciolio Ancetto DiscreteTrombo Idiophones Chordophones Aerophones Electrophones
Continuous Zampogno Eolio Ocarino Fischietta Cocacola
Soundcard
Unpitched Pitched
Microcontroller Armonica Anciolio Ancetto Trombo Zampogno Eolio Ocarino Fischietta Cocacola
adina
lo Meshugghello
Segopiatto Segopiatto
Tola Tola
Cimbalo AnciolioCimbalo
Armonica Sistro AncettoSistroRasoioTromboRasoiofree reedsfree reedsribbonribbon
Zampogno Eolioreeds Sirenetto
reeds Sirenetto
Ocarino withwith
Fischietta duct ductCocacola without ductwithout duct
struck directly friction struck indirectly Cetro free aerophones wind instruments Girodisco Radio
Armonica Armonica
Anciolio Anciolio
Ancetto Ancetto TromboTrombo Zampogno
Zampogno Eolio
Eolio Ocarino
Ocarino FischiettaFischietta
Cocacola Cocacola
percussion Molatore Spremoagrume scraped shaken reeds non-idiophonic flutes
Cono Lampadina Meshugghello Segopiatto Tola Cimbalo Sistro Rasoio free reeds ribbon reeds Sirenetto with duct without duct
SMC2017-394
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
a b c d
e f g h
Figure 6. Some wind instruments for Regnum vegetabile. Figure 8. The cetro.
SMC2017-395
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
DESIGN verbal information DESIGN 1 in sb-scores) for events to be performed by acoustic instru-
shared information
ments. Intermediate materials include indeed a very het-
2
1a Data/Models Data/Models 1b
multimedia files
erogenous collection of working materials. Sb-scores are
3a
Instrument 4
audio files
Sound bodies the final output of composition for sound bodies, but they
recording recording 3b
have also to be processed (5b) to obtain the final data for-
SOUND ORGANISATION
13 Real-time Simulator
audio files
Recording 9
Csound-based, non real time Simulator (10, see [21]) is
12
able to include all the sound samples (both of sound bod-
click tracks 8
Real-time application
15b
ies and acoustic instruments) to generate an audio file with
graphic notation files
14a
15a
a complete and accurate simulation of the piece (11). The
Music notation
music notation files Score assembler 18
Generator
19 Algorithmic composition environment (5a) generates auto-
complete music score
ML AV
matically click tracks via OM (12, i.e. audio files). Click
tracks are used to synchronise musicians with sound bodies
and have to be played back during live performance: they
are thus integrated into the Real-time application (8). An-
REHEARSAL/PERFORMANCE 3
other component, the Real-time Simulator (13) performs
a different function, as it plays back sound files of the
Figure 9. Shared composition information flow. sound body parts of pieces, syncing them with the rela-
tive click tracks on a separate channel. Its main usage is
during rehearsals with musicians, and proved fundamen-
piece, the choice of acoustic instruments and timbre), the tal, as setups cannot be easily transported, and simulation
design and building of sound bodies, the development of of sound bodies parts provides accurate feedback for per-
software to be used while composing (e.g. music notation formers while rehearsing. The Music notation generator
generation, sound analysis, sound resysnthesis for simulat- (14a, via OM [22]) is used to generate CPN scores for in-
ing the result). struments (music notation files, 15a), while the PS Gener-
2. Sound organisation ator (14b) creates automatic PostScript visualization from
The core step concerns indeed the organisation of sound sb-score parsing (graphic notation files, 15b), that gives a
materials, that typically includes various algorithmic tech- visual overview of the sound body parts (Figure 11). In
niques (see Data/Models 1a-b). Shared output includes general, automatic music notation generation is a complex
data files, algorithms and implementations, graphics etc task but it is instrumental in ensuring an integrated algo-
(2). In order to clearly foresee composition results, sound rithmic composition (IAC) pipeline, rather than a computer
bodies are recorded (3b) and sound files shared (4). The assisted one (CAC) [23]. These two graphical outputs are
same happens for instrumental samples (3a), that are mostly then assembled automatically by the Score assembler (18),
recorded from scratch (rather than imported form libraries), that generates ConTeXt files [24], a TEX-based typesetting
as they include special playing techniques. These sound macropackage preferred to LATEX for its flexibility [23].
files (4) can be used to feed various algorithms while com- The resulting final complete music score (in PDF, 19) can
posing, e.g. audio analysis to gather spectral data (in the then be delivered to musicians.
Algorithmic composition environments, 5a-b, respectively
in OM Figure 10 and SC). The main output of the two
Algorithmic composition environments are sb-scores (i.e.
scores for sound bodies, 6). Sb-scores are ASCII files that
specify events to be triggered in the sound bodies. Sb-
scores are defined by the following protocol, loosely in-
spired by Csound score file format [20]:
ocarino1 79.876 0.036 210 210
zampogno7 37.556 0.074 50 50
SMC2017-396
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-397
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
[5] R. T. A. Lysloff and J. Matson, A new approach to the [22] J. Bresson, C. Agon, and G. Assayag, Openmusic: Vi-
classification of sound-producing instruments, Ethno- sual programming environment for music composition,
musicology, vol. 29, no. 2, pp. 21336, 1985. analysis and research, in Proceedings of the 19th ACM
MM. New York, NY, USA: ACM, 2011, pp. 743746.
[6] S. Jorda, FMOL toward user-friendly, sophisticated
new musical instruments, Computer Music Journal, [23] A. Valle, Integrated algorithmic composition, in
vol. 26, pp. 2339, 2002. NIME 208: Proceedings, 2008, pp. 253256.
[7] W. McDonough and M. Braungart, Cradle to Cradle : [24] H. Hagen, ConTeXt the manual, PRAGMA Advanced
Remaking the Way We Make Things. New York: North Document Engineering, Hasselt NL, 2001.
Point, 2002.
SMC2017-398
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ABSTRACT
Machine improvisation systems use style modelling to
This paper presents a method taking into account the form
learn the musical style of a human improviser. Several
of a tune upon several levels of organisation to guide mu-
methods of style modelling have been used for music gen-
sic generation processes to match this structure. We first
eration including statistical sequence modelling [13], ex-
show how a phrase structure grammar can represent a hi-
tended Markov models [4], methods using deep and recur-
erarchical analysis of chord progressions and be used to
rent neural networks [5, 6], and methods based on the fac-
create multi-level progressions. We then explain how to
tor oracle, a structure from formal language theory [7, 8]
exploit this multi-level structure of a tune for music gen-
adapted to a musical context in [9]. Studies around style
eration and how it enriches the possibilities of guided ma-
modelling have made this structure a proficient tool for
chine improvisation. We illustrate our method on a promi-
improvised performance and interaction. These studies in-
nent jazz chord progression called rhythm changes. After
volve the creation of specific navigation heuristics [10], its
creating a phrase structure grammar for rhythm changes
adaptation to audio features [11, 12] or its integration in a
with a professional musician, the terminals of this gram-
system mixing background knowledge with the local con-
mar are automatically learnt on a corpus. Then, we gener-
text of an improvisation [13]. However, none of these sys-
ate melodic improvisations guided by multi-level progres-
tems takes the long-term structure of the improvisation into
sions created by the grammar. The results show the poten-
account.
tial of our method to ensure the consistency of the impro-
visation regarding the global form of the tune, and how the
knowledge of a corpus of chord progressions sharing the Improvising on idiomatic music such as jazz [14], where
same hierarchical organisation can extend the possibilities the improvisation is guided by a chord progression, has
of music generation. been a major interest in machine improvisation. Donze
et al. introduced the use of a control automaton to guide
the improvisation [15]. Nika et al., with ImproteK [16],
1. INTRODUCTION
designed a system of guided improvisation, that is to say a
In jazz music, and more generally in improvised music, the system where the music generation model uses prior knowl-
inherent form of a chord progression is often composed edge of a temporal scenario (e.g. a chord progression).
upon several levels of organisation. For instance, a sub- The music generation model anticipates the future of a
sequence of chords can create a tonal or modal function scenario while ensuring consistency with the past of the
and a sequence of functions can be organised as a section. improvisation. It received positive feedback from profes-
This multi-scale information is used by musicians to cre- sional improvisers and evolved according to musicians ex-
ate variations of famous chord progressions whilst keeping pertise [17]. However, these scenarios are purely sequences
their original feel. In this article, we define as equivalent of symbols. The chord progression is only considered on a
two chord progressions sharing the same hierarchical or- local level and it does not take into account the fact that the
ganisation. We call a multi-level progression an entity de- same harmonic progression can have different roles when
scribing an analysis of a music progression upon several played in different parts of a tune. Therefore, the impro-
hierarchical levels (for instance chord progression, func- visation does not adapt to the global structure. We would
tional progression, sectional progression, etc.) that can be like the improvisation to take this global structure into ac-
used to generate equivalent chord progressions. We want count in a similar way as humans apply syntactic cognitive
to design a musical grammar from which we can infer this processes capable of handling the hierarchical structure of
hierarchical information and then use it to extend the pos- a song [18].
sibilities of music generation.
Form analysis, i.e. the problem of defining and analysing
Copyright:
c 2017 Ken Deguernel et al. This is an open-access article distributed the global structure of a piece of music, is a major topic
under the terms of the Creative Commons Attribution 3.0 Unported License, which in Music Information Retrieval and Computational Musi-
permits unrestricted use, distribution, and reproduction in any medium, provided cology and it has been studied with methods from different
the original author and source are credited. fields. Giraud et al. use methods from bioinformatics and
SMC2017-399
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
In this article, we present two contributions. First, we two disjoint finite sets of symbols: N the set of non-
propose a method formalising multi-level progression with terminal symbols, and the set of terminal symbols,
phrase structure grammars. The structural aspect of equiv-
a singular element s N called the axiom,
alent chord progressions is analysed with a musician but
the actual contents of the multi-level progression are then a finite set R of rewrite rules included in
automatically learnt on a corpus. The grammar enables the (N ) N (N ) (N )
system to generate multiple instances of a given progres-
sion rather than a single one and to provide its hierarchical where is the Kleene star, i.e., X is the set of finite se-
structure. Second, we introduce a method using the infor- quences of elements of X.
mation from a multi-level progression to expand the cur-
rent methods of music generation. This method is based Phrase structure grammars are a type of grammar pre-
on the algorithms from [16] for music generation that we sented by Noam Chomsky, based on constituent analysis,
adapt to take into account the information from a multi- that is to say on a breakdown of linguistic functions within
level progression; thus, creating a system able to improvise a hierarchical structure. In [25], Chomsky presented this
on a progression with changing voicing and considering its example of phrase structure grammar:
position in the structure of the tune. We apply this method
(i) Sentence N P + V P
to rhythm changes, the most played chord progression of
(ii) N P Article + N oun
jazz music, after the blues [29]. We first create a phrase
(iii) V P V erb + N P
structure grammar for rhythm changes, and then gener-
(iv) Article a, the...
ate melodic improvisations over multi-level progressions
(v) N oun man, ball...
provided by the grammar.
(vi) V erb hit, took...
In section 2, we explain how a phrase structure gram-
mar can model a multi-level progression by creating a hi- Rule (i) should be read as rewrite Sentence as N P +
erarchical structure and we create a grammar for rhythm V P , i.e., a sentence consists of a noun phrase followed
changes. Then, in section 3, we introduce a heuristic to by a verb phrase. Rule (ii) should be read as rewrite N P
generate a music sequence following a multi-level progres- as Article + N oun, i.e., a noun phrase consists of an ar-
sion. In section 4, we describe our experiments and the ticle followed by a noun, etc.
results obtained with this new generation model. Finally,
we conclude and propose some perspectives for this gener- A derivation is the sequence of rules applied to create
ation model in section 5. a specific sentence. Figure 1 shows a diagram represent-
ing the derivation of the sentence the man hit a ball.
This diagram does not convey all the information about
2. GENERATIVE GRAMMAR AND SYNTACTIC
the derivation since we cannot see the order in which the
STRUCTURE
rules were applied. Nevertheless, it clearly shows the hier-
In this section, we present how we can use a phrase struc- archical syntactic structure of this sentence, thus creating a
ture grammar to create a multi-level progression from visualisation of the constituent analysis.
equivalent chord progressions. First, we present the defini-
SMC2017-400
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
I VI- II- V7 I VI- II- V7 They mix various versions of rhythm changes as they im-
A
I I 7
IV IV- I VI- II- V7 provise [29], the chord progression can be different at ev-
ery iteration. Using a phrase structure grammar to generate
I VI- II- V7 I VI- II- V7 rhythm changes seems therefore appropriate. Consider-
A
ing chords, functions and sections as constituents, we can
I I7 IV IV- I V7 I
create a phrase structure grammar embedding the hierar-
III7 III7 VI7 VI7 chical structure of the rhythm changes where chords are
B
the terminal symbols.
II7 II7 V7 V7
I VI- II- V7 I VI- II- V7 In order to create this phrase structure grammar for
A rhythm changes, we analysed with a professional jazz
I I7 IV IV- I V7 I
musician all the rhythm changes from the Omnibook cor-
pus [13, 30]. This sub-corpus consists of 26 derivations of
Figure 2. Original chord progression of I Got Rhythm. rhythm changes and contains the melodies and annotated
jazz chord symbols [31]. We present here the grammar that
we created. This grammar has then been validated with
2.2 Application to rhythm changes jazz musicians (cf. section 4.1).
In order to test the formalisation of multi-level progres- (i) Rhythm Changes A1 + A2 + B + A
sions with phrase structure grammars, we create such a (ii) A1 I + + +
grammar for a specific jazz chord progression: the rhythm (iii) A2 I + + +
changes. We show how we can obtain a hierarchical anal- (iv) A A1 , A2
ysis of the chord progression and that it can be used to (v) B III + V I + II + V
generate equivalent chord progressions. I , , , , III , V I , II , V are learnt on the corpus
SMC2017-401
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
I I I II- V7
III- VI7 II- V7
A1
V- I7 IV IV
I I II- V7
I I I II- V7
III- VI7 II- V7
A2
V- I7 IV IV
I I I I
Rhythm Changes
III III7 III7 III7 III7
V I VI7 VI7 VI7 VI7
B
II II7 II7 II7 II7
V II- II- V7 V7
I I I II- V7
I VI7 II- V7
A A1
V- I7 IV IV-
I I II- V7
Figure 3. Diagram of a rhythm changes derivation on Charlie Parkers theme Celerity (each chord lasts two beats).
A A1 denotes the application of rule (iv).
Figure 3 shows the diagram of one derivation of this approach, but with the addition of a temporal scenario used
grammar for one of the chord progressions on Charlie to guide the improvisation. The scenario is a prior known
Parkers theme Celerity. sequence of symbols (called labels) that must be followed
during the improvisation. Contrary to Omax, the memory
is not based on a sequence of pitches, but on a sequence of
3. GENERATING MUSIC ON A MULTI-LEVEL
musical contents tagged by a label.
PROGRESSION
In this section, we present a method using the information The goal of the associated generation model is to com-
from a multi-level progression to enrich current music gen- bine at every time in the generation an anticipation step
eration methods using the knowledge of the hierarchical ensuring continuity with the future of the scenario, and a
structure and the equivalence between chord progressions. navigation step keeping consistency with the past of the
We first present the concepts we used as a basis for our memory. Therefore, a phase of generation of a music se-
work, and then introduce how to use the equivalent chord quence follows two succesive steps:
progressions of a multi-level progression in a music gener-
The anticipation step consists in looking for an event
ation model and how it extends its possibilities.
in the memory sharing a common future with the
current scenario. This is done by indexing the pre-
3.1 Generating music on a chord progression fixes of the suffix of the current scenario in the mem-
We propose to base our generation model on existing meth- ory using the regularites in the pattern [16]. There-
ods of machine improvisation guided by a scenario, to fore, by finding such a prefix in the memory, we in-
which we add the notion of multi-level progression. sure continuity with the future of the scenario.
The basis of our generation model is a simplified version of
The navigation step consists in looking for events in
ImproteK [16] which is inspired by the work from OMax
the memory sharing a common context with what
[32]. In OMax, improvising consists in navigating a mem-
was last played in order to follow a non-linear path
ory on which we have information on the places sharing
in the memory, thus creating new musical sentences.
the same context, i.e. the same musical past. The funda-
mental premise is that it is possible to jump from one place As in OMax, the search for places in the memory sharing
in the memory to another with the same context. This non- a common past is done thanks to the factor oracle: a struc-
linear path on the memory results in the creation of a new ture from the field of formal language theory introduced by
musical sentence that differs from the original material, but Crochemore et al. [7, 8]. Initially designed as an automa-
that retains its musical style [10]. ImproteK uses a similar ton recognising a language including at least all the factors
SMC2017-402
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
The navigation step is done in a similar way. Us- I II- V7 I VI7 II- V7
ing the information given by the factor oracle, we I I7 IV #IVo I II- V7
first look for places in the memory sharing the same
context with exact labels, but we also open the pos-
sibilities to equivalent labels. This way, we extend Figure 5. Example of derivation of rhythm changes gen-
the realm of possibilities for music generation. Once erated by the grammar.
again, we can favour places in the memory sharing
similar multi-level label as the one in the multi-level
4. EXPERIMENTATION ON RHYTHM CHANGES
progression (with a weight); thus, taking better con-
sideration of the progressions hierarchical structure. 4.1 Evaluation of the grammar
In order to compute a score of similarity between multi- In order to evaluate our phrase structure grammar, we gen-
level labels, a weight Wi can be attributed to each level in erated 30 derivations of rhythm changes with it. The gen-
order to evaluate the similarity of the considered places in erated multi-level progressions have been assessed by the
the memory with the scenario such as musician, who helped creating the grammar and by another
professional jazzman, who was not involved in the initial
process and therefore analysed the generated multi-level
X
Wi = 1.
ilevel progressions strictly from a musicology point of view. Fig-
ure 5 shows an example of generated rhythm changes.
In this way, we can prioritise the choice of places in the Other derivations of rhythm changes can be found online
memory with high scores, or limit the choice to highest at repmus.ircam.fr/dyci2/ressources.
scores according to a threshold. For instance, a user can
consider that matching the functions is more important than Every generated rhythm changes has been validated by
matching the chords. both musicians. However, these generated chord progres-
sions have to be seen only as realisations of rhythm
Finally, Figure 4 sums up the whole process for generat- changes for improvisation accompaniment. For automatic
ing music upon multi-level progressions. composition of a chord progression for a theme, some par-
SMC2017-403
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Improvisation: 13th 11th 3rd 5th 13th 5th 9th 7th 5th 5th 5th
Memory: 3rd 1st 7th 9th 13th 5th 13th 11th 9th 5th 5th
Figure 6. First four bars from the bridge of an improvisation on An Oscar for Treadwell. The Improvisation lines shows
the chord and degrees played in the generated improvisation; the Memory lines shows the chord and degrees on which
the different parts of the melody used to generate the improvisation were actually learnt. We see that when generating
improvisations, we can reach places in the memory with a different chord label. The melody still makes sense in its
continuity and harmonically because the multi-level labels from the memory and from the multi-level progressions are
equivalent. The generated improvisations are therefore enriched with a new form of creativity.
allelism constraints assuring some symmetries between the We conducted a listening session with two professional
different A section would be welcome. Moreover, this jazz musicians to analyse the generated improvisations.
grammar covers the whole span of traditional bebop style First of all, the most significant difference seems to be that
rhythm changes (on which it was trained on with the Om- when using the multi-level progression, the improvisations
nibook corpus). No important possibilities were reported are indeed better at following the structure. This can be
missing by any of the musicians. However, as expected, it especially noticed during the B section of the improvisa-
does not create more modern versions of rhythm changes tions, where the guide notes and 5th of each chord are more
such as the one of The Eternal Triangle by Sonny Stitt, or pronounced (cf. example on An Oscar for Treadwell). Fig-
Straight Ahead by Lee Morgan, for instance. However, this ure 6 shows an excerpt from the bridge of an improvisation
is only due to the training corpus. No structural changes to on An Oscar for Treadwell.
the grammar would be needed for these. It would be inter- Second, when encountering chords that do not appear in
esting in the future to extend our rhythm changes corpus the memory, the freedom provided by the multi-level pro-
to more varied chord progressions. gression (e.g., playing on a different chord as long as it
shares the same functional role) enables the improvisation
4.2 Improvisation on rhythm changes system to generate musical sentences with less fragmen-
tation, creating a better sense of consistence and fluidity
In order to test the benefits of using a multi-level progres-
(cf. example on Thriving from a Riff ). Moreover, this gen-
sion, we generated some improvisations on Charlie
erates a form of creativity exemplified by playing musical
Parkers music. First, using the phrase structure grammar
sentences on a different chord than the original one, thus
trained on the rhythm changes from the Omnibook, we
playing new guide notes or extensions adding colour to the
generated multi-level progressions. On these progressions,
improvisations, whilst keeping consistency (cf. example
we generated improvisations using two methods:
on Anthropology).
the base generation model introduced in section 3.1.
In this case, only the chord progression level of the These results are promising and show how using a multi-
multi-level progression was taken into account for level progression can result in significant improvement on
generation. how the improvisations are guided through the memory of
the system.
the extended generation model introduced in section
3.2. In this case, all the information from the multi-
level progression was taken into account, i.e., the 5. CONCLUSIONS
chord progression, the functional progression, and
the sectional progression. A score was attributed to We have shown how we can model the multi-level progres-
each level. We considered that for rhythm changes sion of a chord progression with a phrase structure gram-
the most important aspect was the functional aspect. mar, and how to use this information for machine impro-
Therefore, we attributed a weight of 0.5 to the func- visation. First, this abstraction of a chord progression en-
tional level. We then attributed a weight of 0.3 to ables the system to generate several instances (or voicings)
the chord level, and a weight of 0.2 to the sectional of this chord progression injecting some creativity in the
level. chord progression generation whilst keeping its multi-level
structure under control. Second, this additional musical
In both cases, the content of the memory is a tune from information is used during the generation of a melodic im-
Charlie Parkers Omnibook. Examples of generated im- provisation; the generation process is able to take into ac-
provisations (with both models) can be listened online at count the global form of the scenario with the multi-level
repmus.ircam.fr/dyci2/ressources. aspect to adapt to an ever-changing scenario and expand its
possibilities. We applied this method on rhythm changes
SMC2017-404
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
and constructed a phrase structure grammar for this type of [6] G. Bickerman, S. Bosley, P. Swire, and R. M. Keller,
chord progressions. This grammar was built and validated Learning to create jazz melodies using deep belief
with musicians expertise. We then generated improvisa- nets, in Proceedings of the International Conference
tions on multi-level progressions generated with the gram- on Computational Creativity, 2010, pp. 228236.
mar (including chord progression, functionnal progression
and sectional progression), using new navigation heuris- [7] C. Allauzen, M. Crochemore, and M. Raffinot, Fac-
tics adapted to this multi-level information. The generated tor oracle : a new structure for pattern matching, in
improvisations give promising results according to profes- Proceedings of SOFSEM99, Theory and Practice of
sional jazz musicians. The global form of the chord pro- Informatics, 1999, pp. 291306.
gression is respected, and the improvisations are still con-
[8] A. Lefebvre and T. Lecroq, Computing repeated fac-
sistent, even on unmet chord progressions.
tors with a factor oracle, in Proceedings of the 11th
The rhythm changes grammar was trained on a corpus of
Australasian Workshop On Combinatorial Algorithms,
traditional bebop style rhythm changes. It would be in-
2000, pp. 145158.
teresting to extend this corpus to more modern variations
of rhythm changes and to apply this method on other sce- [9] G. Assayag and S. Dubnov, Using factor oracles for
narios such as a blues structure or more abstract scenarios machine improvisation, Soft Computing, vol. 8-9, pp.
such as Steve Colemans Rhythmic Cycles based on the lu- 604610, 2004.
nation cycle [33]. It would also be interesting to see if this
hierarchical structure could be extended to a higher level to [10] G. Assayag and G. Bloch, Navigating the oracle :
take the organisation of an improvisation upon several suc- a heuristic approach, in Proceedings of the Interna-
cessive occurences of the chord progression into account. tional Computer Music Conference, 2007, pp. 405
Finally, it would also be interesting to extend this work 412.
into a system where not only the improvisation adapts to
the generated progression, but the derivations generated by [11] G. Surges and S. Dubnov, Feature selection and com-
the grammar also take what is being improvised into ac- position using PyOracle, in Proceedings of the 2nd In-
count, or into a system with an automatic generation of the ternational Workshop on Musical Metacreation, 2013,
hierarchical grammar based on machine learning on a cor- pp. 114121.
pus.
[12] A. Einbond, D. Schwarz, R. Borghesi, and N. Schnell,
Introducing CatOracle: Corpus-based concatenative
improvisation with the audio oracle algorithm, in Pro-
Acknowledgments
ceedings of the 42nd International Computer Music
This work is made with the support of the French National Conference, 2016, pp. 140147.
Research Agency, in the framework of the project DYCI2
Creative Dynamics of Improvised Interaction (ANR-14- [13] K. Deguernel, E. Vincent, and G. Assayag, Using
CE24-0002-01), and with the support of Region Lorraine. multidimensional sequences for improvisation in the
We thank the professional jazzmen Pascal Mabit and Louis OMax paradigm, in Proceedings of the 13th Sound
Bourhis for their musical inputs. and Music Computing Conference, 2016, pp. 117122.
SMC2017-405
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-406
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Torsten Anders
University of Bedfordshire
Torsten.Anders@beds.ac.uk
SMC2017-407
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
The declarative approach of constraint programming [6] sets of integers), and highly optimised domain-specific prop-
is more suitable for that. In this programming paradigm, agation algorithms filter variable domains before every step
users define constraint satisfaction problems (CSP) in which in the search process, which typically greatly reduces the
constraints restrict relations between decision variables. 2 search space. The design of MusES cannot be ported to
Variables are unknowns, and initially have a domain of propagation-based constraint solvers, because it is not lim-
multiple possible values. A solver then searches for one ited to the variable types supported by such solvers. MusES
or more solutions that reduces the domain of each variable has originally not been designed with constraint program-
to one value that is consistent with all its constraints. ming in mind, and the variable domains of CSPs defined
Several surveys, partly by this author, give an overview of with MusES consist of complex SmallTalk objects.
how constraint programming has been used for modelling This research proposes the first harmony framework that
music theory and composition. Pachet and Roy [7] focus supports various analytical information to allow for mod-
on constraint-based harmonisation, specifically four-part elling complex theories of harmony at a high level of ab-
harmonisation of a given melody. An introduction to the straction, and whose design is at the same time suitable
field of modelling composition and music theory with con- for propagation-based constraint solvers. Using constraint
straint programming in general is the detailed review [8]. propagation allows, e.g., to quickly solve CSPs with large
A further review [9] presents how six composers used con- domain sizes and that way allows for microtonal harmony.
straint programming to realise specific compositions. While the framework was implemented in Strasheela on
A particular extensive constraint-based harmony system top of Oz, it could also be implemented with any other con-
is CHORAL by Kemal Ebcioglu [10], which generates four- straint system that supports the following features found
part harmonisations for given choral melodies that resem- in various propagation-based constraint systems (e.g., Ge-
ble the style of Johann Sebastian Bach. code, Choco, JaCoP, MiniZinc): the variable domains Bool-
Several existing systems allow users to define their own ean, integer, and set of integers; integer and set constraints
harmonic CSPs, and each of these systems has its spe- including reified constraints (meta-constraints, which also
cific advantages. Situation [11] provides a family of mini- constrain whether their stated relationship holds or not);
languages for specifying how the arguments to certain pre- and the element constraint. 3
defined harmonic constraints change across chords. Score- This framework is largely comparable in terms of its flex-
PMC, as subsystem of PWConstraints [12, Chap. 5] is well ibility with the combination of MusES and BackTalk, al-
suited for complex polyphonic CSPs where the rhythm is though it has a different stylistic focus. MusES was de-
fixed. PWMC [13] and its successor Cluster Engine allow signed for jazz, while this research focuses on classical
for complex polyphonic CSPs where the rhythm is con- tonal music and contemporary music in an extended ton-
strained as well. ality including microtonal music.
While all these systems allow users to define their own
harmonic CSPs, they restrict the constrainable harmonic 3. APPLICATIONS
information to pitches and intervals between pitches. For
example, users can constrain harmonic and melodic inter- Before presenting formal details of the proposed frame-
vals between note pitches, or represent an underlying har- work, some examples showing the framework in action
mony by chords consisting of concrete notes. However, the will further motivate this research. Please remember that
music representations of these systems are not extendable, all rules discussed for these examples are only a demon-
and more abstract analytical information such as chord and stration of the capabilities of the proposed framework, and
scale types or roots, scale degrees, enharmonic note repres- all these rules can of course be changed.
entations etc. are not supported. Such analytical informa-
3.1 Automatic Melody Harmonisation
tion is important for various music theories. Some inform-
ation could be deduced from a pitch-based representation The first example creates a harmonisation for a given mel-
(e.g., pitch classes), but other information is difficult to de- ody. It is comparatively simple, and is therefore discussed
duce (e.g., the chord type or root) and complex harmonic in more detail.
CSPs are more difficult to define and solve this way. While the proposed framework was originally developed
An expressive harmonic constraint system that supports for music composition, it can also be used for analysis,
such analytical information is the combination of the mu- because it only controls relations between concepts like
sic representation MusES [14] and the constraint solver notes and chords. This example demonstrates an interme-
BackTalk [15]. This combination was used, e.g., for auto- diate case. It performs an automatic harmonic analysis of
matic harmonisation. BackTalk supports variable domains a given folk tune, but additional compositional rules are
of arbitrary SmallTalk collections. The search algorithm applied to the resulting harmonies. Voicing is irrelevant in
first filters all variable domains with a general consistency- this example; only the chord symbols are searched for.
checking algorithm, and then searches for a solution of this The harmonic rhythm is slower than the melody, as com-
reduced problem with a backjumping algorithm. mon is classical, folk and popular music. By contrast,
Many modern constraint solvers use constraint propaga- most automatic harmonisation examples in the literature
tion for efficiency [16, Chap. 4]. Their variable domains are choral-like with one chord per note.
are restricted to specific types (e.g., Boolean, integer and 3 For details on the element constraint see https://sofdem.
github.io/gccat/gccat/Celement.html, and also section
2 In the rest of the paper we just use the term variable for brevity. 4.3.
SMC2017-408
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
7
C x x G C theory and that way demonstrates that the framework is
x
Z Z ZZ
Z Z Z Z ZZ capable of modelling complex conventional theories. To
the knowledge of the author, Schoenbergs theory of har-
7
C x Gx G C
x mony has not been computationally modelled before, neither
Z Z ZZ
Z Z Z Z ZZ with constraints programming nor in any other way. Also,
7
this example implements modulation, which has rarely been
C G C done with constraint programming before. Among the many
x
x x
Z Z ZZ
Z Z Z Z ZZ
proposed systems, Ebcioglus CHORAL [10] is the only
system the author is aware of that supports modulation. As
C Em G
7
C the current example implements a lot of harmonic know-
x
x x ledge, it can only be summarised here.
Z Z ZZ
Z Z Z Z ZZ The example modulates from C major to G major, but
features an extended pitch set that allows for non-diatonic
tones. Figure 2 shows a solution.
Figure 1. For the given melody this simple harmonisation
CC C CC CC C C C C C C
model has four solutions
C C C C C C C
For simplicity, this example defines relatively rigid ba-
CC C CC CC C C C C C C
C C C C C C
C
sic conditions. Only major, minor, and dominant seventh
chords are permitted, and all chords must be diatonic in C
major. The harmonic rhythm is fixed, and all chords share C: I vi 7 ii vii iii
G: vi vii 7
iii vi IV V 7 I
the same duration (e.g., a whole bar), but chord repetitions
are permitted.
Figure 2. A solution of the Schoenberg harmony model
The example distinguishes between harmonic and non- modulating from C to G major; Roman numerals added
harmonic tones, but for simplicity only a few cases of non- manually for clarity (altered chords are crossed out)
harmonic tones are permitted (passing and neighbour tones).
All other melody notes must be chord pitches.
The example applies Schoenbergs guidelines on chord
The example further borrows a few harmonic rules from
root progressions designed to help obtaining better pro-
Schoenberg [1] in order to ensure musically reasonable
gressions. Schoenberg distinguishes between three kinds
solutions. The example assumes that the given melody
of root progressions. In strong or ascending progressions
starts and ends with the tonic these chords are constrained
the chord root progresses by a fourth up (harmonically the
to be equal. A seventh chord must be resolved by a fourth
same as a fifth down) or a third down. For example, in Fig.
upwards the fundament (e.g., V7 I), the simplest res-
2 chord 1 progresses to chord 2 by a third down (I vi),
olution form for seventh chords. All chords share at least
while chord 2 progresses to chord 3 by a fourth up / fifth
one common pitch class with their predecessor (harmonic
down (vi ii). Descending root progressions form the
band, a simpler form of Schoenbergs directions for produ-
second type. They are the inversion of strong progressions:
cing favourable chord progressions).
a fourth down (fifth up) or a third up. These are not allowed
Figure 1 shows all solutions for the first phrase of the
in this example. Finally, in superstrong progressions the
German folksong Horch was kommt von draussen rein
root moves a second up or down, as the chords in the pen-
that fulfil the given rules. An x on top of a note denotes a
ultimate bar do (IV V). These different progressions
nonharmonic pitch.
are explained in more detail and formalised in [17].
Because of the relative simplicity of this example, it works
The chords are related to underlying scales, which change
only well for some melodies and less good for others. The
during the modulation. The first five chords relate to the C
harmonic rhythm of the melody must fit the harmonic rhythm
major scale, and the rest to G major. However, the example
specified for the example (at least chords can last longer,
also allows for non-diatonic tones. Schoenberg introduces
as repetitions are permitted). This is easily addressed by
these as accidentals from church modes, namely the raised
turning the currently fixed chord durations into variables,
1st, 2nd, 4th, 5th and the flattened 7th degree. These are
but doing so increases the size of the search space. Fur-
always resolved stepwise in the direction of their altera-
ther, the nonharmonic pitches of the melody must fit the
tion (e.g., raised degrees go up). In order to support the
cases defined (passing and neighbour tones). An extension
intended direction of the modulation (C to G), only raised
could define further cases, like suspensions and anticipa-
degrees are allowed in this example. In the solution above,
tions. Also, the melody currently cannot modulate. This
e.g., chord 4 contains a raised 4th degree of C major (F])
can be solved by additionally modelling scales, and apply-
that is chromatically introduced and resolves upwards by a
ing modulation constraints between chords and scales (as
step. Chord 7 contains a raised 5th degree of G major (D]).
shown in the next example).
A modulation constraint requires that chord 5 is a neutral
chord, i.e., a chord shared by both keys. In the solution
3.2 Modelling Schoenbergs Theory of Harmony
above this chord is iii in C and vi in G major. The mod-
The next example implements large parts of Schoenbergs ulation constraint further requires that the neutral chord is
tonal theory of harmony [1] a particular comprehensive followed by the modulatory chord, which is only part of
SMC2017-409
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
the target key (vii7 in the solution). For clarity, the mod- composed manually (e.g., in Fig. 3 the septuplets in the
ulatory chord must progress by a strong root progression a Great division, and the triplets in the pedal were composed
fourth upwards (into iii above). manually).
The example also applies voice leading rules. Open and Some example constraints are outlined. Chords have at
hidden parallel fifths and octaves are not permitted. Also, least four tones, which all belong to the simultaneous scale.
the upper three voices are restricted to small melodic inter- The first and last chord root of a section is often the root of
vals and small harmonic intervals between voices (larger its scale. To ensure smooth transitions between chords, the
harmonic intervals are allowed between tenor and bass). voice-leading distance between consecutive chords is low
Root positions and first inversions can occur freely. For (at most 3 semitones in the excerpt). The voice-leading
example, chord 4 is a first inversion in Fig. 2. The number distance is the minimal sum of absolute intervals between
of first inversions has not been constrained and it is rather the tones of two chords. For example, the voice-leading
high in the shown solution (6 chords out of 11). More gen- distance between the C and A[ major triads is 2 (C C =
erally, statistical properties such as the likelihood of these 0, E E[ = 1, G A[ = 1). Also, any three consecutive
chords are difficult to control by constraint programming chords must be distinct.
(e.g., their overall number can be restricted, but they then The actual notes (in staves 1-3) must express the underly-
may be bunched early or late in the search process). ing harmony (stave 4). Nonharmonic tones (marked with
It should be noted that the pitch resolution of this example an x in Fig. 3) are prepared and resolved by small inter-
is actually 31-tones per octave (31-EDO). This tempera- vals. Across the section starting in measure 8, the con-
ment is virtually the same as quarter-comma meantone, a trapuntal lines in the swell division rise gradually (pitch
common tuning in the 16th and 17th century. It can be not- domain boundaries are rising), and melodic intervals are
ated with standard accidentals (], [, [[ . . . ). This tempera- getting smaller (this section lasts over 10 bars, so this is
ment has been used here, because it simplifies the nota- not obvious from the few bars shown). The contrapuntal
tion by distinguishing between enharmonically equivalent voices are never more than an octave apart; they dont
pitches (e.g., E[ and D] are different pitch classes in this cross; they avoid open and hidden parallels; they avoid
temperament). More generally, the use of this tempera- perfect consonances between simultaneous notes (one is
ment demonstrates that the proposed framework supports there in Fig. 3 after manual revisions); and voice notes
microtonal music [18]. sound all tones of the underlying harmony. Also, the lines
are composed from motifs; and durational accents are con-
3.3 A Compositional Application in Extended Tonality strained [20].
SMC2017-410
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
q=70
6 Swell b bw
4 6 w 3
&4 x
4 4
p 5
Organ
?4 b b 6 w 3
{ 4
?4
5
5
x
4
6
w
3 x
&4
3
Pedals 4 b 4 b 4
w-
5 mp
Chords 4 bb
6 b 3
(Anal.)
&4 4 w 4
Augmented Added Ninth Augmented Dominant Seventh
Sixth Suspended Fourth Steely Dan Chord
Scales 4 6 3
(Anal.) & 4 bb w 4 w 4
Takemitsu Tree Line mode 2
7
Great
8 ## bb r Swell x
3 # #
b
&4 b # b n b # b
Org. ff f
6 x
5
3
{ &4
? 43
b b
# b
b n
Ped. b
3 b#b b#b
&4 b b
Dominant Seventh Suspended Second SixthNinth Chord
3
& 4 bnb bn# b
Messiaen mode 6
Figure 3. Excerpt from the composition Pfeifenspiel composed by the author. The upper three staves show the actual com-
position, and the lower two an analysis of the underlying harmony. Chord and scale tones are shown like an appoggiatura
and roots as normal notes. Nonharmonic tones are marked by an x.
pitch classes of the untransposed chord in this case the C- Instead of specifying pitch classes by integers as shown,
major triad, {C, E, G} as a set of pitch class integers. The it can be more convenient to specify note names, which are
given name is a useful annotation, but not directly used by then automatically mapped to the corresponding pitch class
the framework. integers, depending on psPerOct. In 12-EDO, C 7 0,
[hname: major, root: 0, PCs: {0, 4, 7}i, C] 7 1 and so on. Alternatively, pitch classes can be
specified by frequency ratios as a useful approximation of
CT = hname: minor, root: 0, PCs: {0, 3, 7}i, (1) just intonation intervals for different temperaments. Again
...] in 12-EDO, the prime 11 7 0, the fifth 23 7 7 etc. 5
Scale types are declared in the same way in an extra se- The format of chord and scale declarations is extendable.
quence of tuples ST. For example, the pitch class set of the Users can add further chord or scale type features (e.g.,
major scale type is {0, 2, 4, 5, 7, 9, 11}, while its root is 0. a measure of the dissonance degree of each chord type),
Users can also declare the global number of pitches per which would then result in further variables in the chord
octave (psPerOct), and that way specify the meaning of and scale representation discussed in the Sec. 4.3 below.
all integers representing pitches and pitch classes in the Note that chord and scale declarations are internally re-
constraint model. 4 A useful default value for psPerOct is arranged for the constraints discussed in Sec. 4.3. For ex-
12, which results in the common 12-EDO, and which was ample, root CT is the sequence of all chord roots (in the
used for the chord and scale type examples above. order of the chord declarations), PCs ST is the sequence of
4 Only equal temperaments that evenly subdivide the octave are sup- the pitch class sets of all scale declarations, and so on.
ported. However, just intonation or irregular temperaments can be closely
approximated by setting psPerOct to a high value (e.g., psPerOct = 5 Remember that for the frequency ratio r, the corresponding pitch
1200 results in cent resolution). class is round((log2 r) psPerOct).
SMC2017-411
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
4.2 Temporal Music Representation duces the representation of these objects, their variables,
and the implicit constraints between these variables. The
The underlying harmony can change over time. Temporal
representation of chords and scales is identical, except that
relations are a suitable way to express dependencies: all
chords depend on the declaration of chord types CT, and
notes simultaneous to a certain chord or scale depend on
scales on scale types ST (see Sec. 4.1). Therefore, the rest
that object (i.e., those notes fall into their harmony).
of this subsection only discusses the definition of chords.
The framework shows its full potential when combined
A chord c is represented by a tuple of four variables (4)
with a music representation where multiple events can hap-
in addition to the temporal variables mentioned above
pen simultaneously. A chord sequence (or scale sequence
(Sec. 4.2) that are indicated with . . . . 6
or both) can run in parallel to the actual score, as shown in
the score example discussed above (Fig. 3). c = htype, transp, PCs, root, . . . i (4)
Score objects are organised in time by hierarchic nesting.
The type (integer variable) denotes the chord type. Form-
A sequential container implicitly constrains its contained
ally, it is the position of the respective chord in the col-
objects (e.g., notes, chords, or other containers) to follow
lection of chord type declarations CT, see Eq. (1) above.
each other in time. The objects in a simultaneous container
The transp (integer variable) specifies how much the chord
start at the same time (by default). All temporal score ob-
is transposed with respect to its declaration. PCs (set of
jects represent temporal parameters like their start time,
integers variable) is the set of (transposed) pitch classes
end time and duration by integer variables. A rest be-
of the chord, and the root (integer variable) is the (trans-
fore a score object is represented by its temporal parameter
posed) root pitch class.
offset time (another integer variable), which allows for ar-
For chords where the root is 0 (C) in the declaration,
bitrary rests between objects in a sequencial container, and
transp and root are equal. In a simplified framework,
before objects in a simultaneous container.
the variable transp could therefore be left out. However,
Equation (2) shows the constraints between temporal vari-
sometimes it is more convenient to declare a chord where
ables of a simultaneous container sim and its contained ob-
the root is not C (e.g., leaving a complex chord from the lit-
jects object1 . . . objectn . Any contained object objecti
erature untransposed, or stating the pitch classes of a chord
starts at the start time of the container sim plus the off-
by fractions where the root is not 11 ). Therefore this flexib-
set time of the contained object. The end time of sim is
ility is retained here with the separate variables transp and
the maximum end time of any container. The relations
root.
between temporal variables of a sequential container and
The constrains (5) and (6) restrict the relation between
its contained objects are constrained correspondingly, and
the variables of any chord object c, and the collection of
every temporal object is constrained by the obvious rela-
chord type declarations CT. The element constraint is a
tion that the sum of its start time and duration is its end
key here. It accesses in an ordered sequence of variables
time. The interested reader is referred to [21, Chap. 5] for
a variable at a specific index, but the index is also a vari-
further details.
able. 7 In (5), root CT [type] is the untransposed root of
start objecti = start sim + offset objecti the chord type. The (transposed) chord root is that un-
(2)
end sim = max(end object1 , . . . , end objectn ) transposed root pitch-class transposed by the constraint
transp-pc. Pitch class transposition in 12-EDO with mod-
Temporal relations can be defined with these temporal
ulus 12 is well known in the literature. The definition here
parameters. For example, we can constrain that (or whether,
uses psPerOct as divisor to support arbitrary equal tem-
by using a resulting truth value) two objects o1 and o2 are
peraments. A corresponding constraint for pitch class sets
simultaneous by constraining their start and end times (3).
is expressed in (6).
Note that for clarity this constraint is simplified here by
leaving out the offset times of these objects, and remember root c = transp-pc(root CT [type c ], transp c ) (5)
that denotes a conjunction (logical and). PCs c = transp-PCs(PCs CT [type c ], transp c ) (6)
start o1 < end o2 start o2 < end o1 (3)
When chords are extended by further variables (e.g., a
Remember that all these relations are constraints rela- chord type specific dissonance degree) the chord declara-
tions that work either way. The temporal structure of a tions and chord object variables are simply linked by fur-
score can be unknown in the definition of a harmonic CSP. ther element constraints (e.g., feat c = feat CT [type c ]).
Users can apply constraints, e.g., to control the harmonic
rhythm in their model, or the rhythm of the notes in a har- 4.4 Notes with Analytical Information
monic counterpoint.
Note objects represent the actual notes in the score. A note
If other constraints depend on which objects are simultan-
n is represented by a tuple of variables as shown in (7). As
eous to each other (e.g., harmonic relations between notes
with chords, temporal variables are left out for simplicity,
and chords), then the search should find temporal paramet-
and are only indicated with . . . .
ers relatively early during the search process.
6 Internally, some additional auxiliary variables are used in the imple-
SMC2017-412
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
This scheme can further be used for chords with the pair
of integer variables hchordDegree, chordAccidental i. The
n = hpitch, pc, oct, inChord ?, inScale?, . . . i (7) integer variable chordDegree denotes a specific tone of a
The notes pitch (integer variable) is essential for melodic chord, e.g., its root, third, fifth etc.; it is the position of
constraints. It is a MIDI note number in case of 12-EDO. a chord tone in its chord type declaration in CT, while
The pc (integer variable) represents the pitch class (chroma) chordAccidental (integer variable) indicates whether and
independent of the oct (octave, integer variable) compon- how much the note deviates from a chord tone (for non-
ent, which is useful for harmonic constraints. The relation harmonic tones). This representation was also used in the
between a notes pitch, pc and oct is described by (8); the model of Schoenbergs theory of harmony discussed above
octave above middle C is 4. to recognise dissonances that should be resolved (e.g., any
seventh in a seventh chord).
pitch = pc + (oct + 1) psPerOct (8)
The Boolean variable inChord ? indicates a harmonic or 5. MODELLING WITH THE FRAMEWORK
nonharmonic note, i.e., whether the pc of a note n is an ele-
ment of the PCs of its simultaneous chord c, implemented The proposed framework consists primarily of the con-
with a reified set membership constraint (9). The Boolean strainable harmony representation presented in the previ-
variable inScale? denotes equivalently whether notes are ous section. Developing concrete harmony models with
inside or outside their simultaneous scale. this foundation is relatively straightforward: the variables
in the representation are constrained. This section shows
inChord ?n = pc n PCs c (9) formal details of the first application shown in section 3.1.
For simplicity, this example only uses three chord types:
4.5 Degrees, Accidentals, and Enharmonic Spelling major, minor, and the dominant seventh chord. These are
So far, the formal presentation used two common pitch rep- declared exactly as shown previously in equation (1), where
resentations: the single variable pitch and the pair hpc, only the dominant seventh chord is added with the pitch
octi; further pitch-related representations can be useful to classes {0, 4, 7, 10} and root 0. The example declares scale
denote scale degrees (including deviations from the scale), types in the same way; only the major scale is needed.
tones in a chord (and deviations), and to express enhar- The music representation of this example consists of a
monic notation. These representations are closely related. nested data structure. The top level is a simultaneous con-
They split the pc component into further representations, tainer, which contains three objects: a sequential container
depending on a given scale or chord. These representations of notes; a sequential container of chords; and a scale. In
are only briefly summarised here; formal details are left out the definition, the note pitches and temporal values are
due to space limitations. set to the melody (the folksong Horch was kommt von
Enharmonic spelling is represented with the pair of in- draussen rein), the scale is set to C major, and all chord
teger variables hnominal , accidental i, where nominal rep- durations are set to whole note values. In other words, only
resents one of the seven pitch nominals (C, D, E, . . . ) as the chord types and transpositions, and hence also their
integers: 1 means C, 2 means D, . . . , and 7 means B. 8 The pitch classes and roots are unknown in the definition.
variable accidental is an integer where 0 means \, and 1 Space does not permit to discuss all constraints of this ex-
means raising by the smallest step of the current equal tem- ample, but the formalisation of some of them gives an idea
perament. In 12-EDO, 1 means ], -1 means [, and -2 is [[ of the overall approach. As discussed before, the harmony
and so on. 9 of notes are their simultaneous chords and scale; chords
The representation of enharmonic spelling depends on also depend on their simultaneous scale. Only diatonic
the pitch classes of the C major scale: the nominal 1 is chords are allowed: every chord pitch class set is a sub-
a reference to the pitch class at the first scale degree of C- set of the scale pitch class set.
major, 2 refers to the second degree and so on. The same However, the example allows for nonharmonic notes. Con-
scheme can be used with any other scale to express scale straining the notes parameter inChord ? allows to model
degrees with the pair of integer variables hscaleDegree, these. Nonharmonic notes are surrounded by harmonic
scaleAccidental i. The variable scaleDegree denotes ba- notes, i.e., the parameter inChord ? of their predecessor
sically the position in the pitch classes of a given scale. and successor notes must be true. Only passing tones and
If the variable scaleAccidental is 0 (\), then the expressed neighbour tones are permitted: the melodic intervals be-
pitch class is part of that scale. Otherwise, scaleAccidental tween a nonharmonic note and its predecessor and suc-
denotes how far the expressed pitch class deviates from the cessor notes must not exceed a step (two semitones). Equa-
pitch class at scaleDegree. This representation has been tion (10) shows these constraints; remember that the oper-
used in the Schoenberg example described in Sec. 3.2 to ator indicates the implication constraint (logical con-
constrain the raised scale degrees I, II, IV, and V. sequence).
8 The choice to start with C and not A as 1 is arbitrary, and it is very
inChord ?ni = false
easy to change that when desired. C is represented by 1 and not 0 for
consistency with scale degrees, where the lowest degree is commonly inChord ?ni1 = inChord ?ni+1 = true
notated as I. (10)
9 This representation can also be implemented in a constraint system |pitch ni pitch ni1 | 2
that only supports positive integer variables by adding a constant offset to
all accidental variables. |pitch ni+1 pitch ni | 2
SMC2017-413
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-414
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-415
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-416
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-417
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
typically accepts an input data as some parameters and out- 1-pole LP filter and an Output module. We give this ex-
puts the result, our zak-based UDOs do not have any out- tremely primitive patch not for the purposes of timbre dis-
puts. cussion but rather to present a clear example of how our
In more practical view, the typical Csound opcode/UDO converter works.
is applied like this: The converter default output file is test.csd in the program
directory. Comments are given after semicolons.
aOut1 SomeOpcode aIn1, aP1, kP2 [, ...] The algorithm detects three different types of modules
and take their UDO definitions from the library.
where aIn1 is an input of audio type, aP1 is an a-rate
parameter, kP2 is a k-rate parameter, and aOut1 is an a- sr = 96000 ; audio sampling rate
rate output. ksmps = 16 ; times k-rate lower than a-rate
In our system we have nchnls = 2 ; number of output channels
0dbfs = 1.0 ; relative amplitude level
SomeOpcode aP1, kP2 [, ...], kIn1, kOut1
zakinit 4, 3 ; Init zak-space
where aP1 is an a-rate parameter, kP2 is a k-rate parame- opcode Noise, 0, kkk ;White Noise generator
ter, kIn1 is a number of some k- or a-rate bus to read input kColor, kMute, kOut xin
data from, and kOut1 is a number of some k- or a-rate bus if kMute!=0 goto Mute ;mutes sound if Mute is On
to send data to. aout rand 0.5, 0.1 ; seed value 0.1
After parameter field our UDOs have IO field, in which aout tone aout, kColor ; Csound simple LP filter
zaw aout, kOut
the numbers of buses in zak space are listed. Using de- Mute:
scribed approach we are free to list the opcodes in any or- endop
der, just like Nord Modular user can add the modules in
arbitrary order. So the first indexed module can be easily opcode FltLP, 0, kikkkkk ;One-pole LP filter
the last one in the audio chain. ; Keyboard Tracking has not been implemented yet
kKBT, iOrder, kMod, kCF, kIn, kModIn, kOut xin
Another important aspect to be described here is map- ain zar kIn
ping. Nord modules have several different controllers of a kmod zkr kModIn
lot of ranges, i.e. audio frequency range, amplitude range, aout tonex ain, kCF+kmod*kMod, iOrder
normalized values in the range from 0 to 1, delay time zaw aout, kOut
ranges, envelope stage durations, etc. The real values of endop
the controllers can be seen only when using editor. The opcode Out2, 0, kkkkk; Output module
patch file stores 7 bit MIDI values without any reference ; Only stereo output has been implemented
to appropriate range. It made us to manually fill the table kTarget, kMute, kPad, kL, kR xin
with data types, ranges and values. Special mapping files if kMute!=0 goto Mute:
of pch2csd project contain numbers of tables to be used for aL zar kL
aR zar kR
each parameter of each module. outs aL*kPad, aR*kPad
I.e Module #112 (LevAdd module) table contains follow- Mute:
ing lines: endop
SMC2017-418
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-419
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-420
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-421
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Flavio Everardo
Potsdam University
flavio.everardo@cs.uni-potsdam.de
The use of Answer Set Programming (ASP) inside musi- Machine Learning techniques, trained on initial mixes
cal domains has been demonstrated specially in compo- to apply the learned features on new content.
sition, but so far it hasnt overtaken engineering areas in Grounded Theory approach, aims to acquire basic
studio-music (post) production such as multitrack mixing mixing knowledge including psychoacoustic studies
for stereo imaging. This article aims to demonstrate the and user preferences to eventually transfer this com-
use of this declarative approach to achieve a well-balanced pilation to an intelligent system, and
mix. A knowledge base is compiled with rules and con- Knowledge Engineering, feeds an intelligent system
straints extracted from the literature about what profes- with already known rules and constraints.
sional music producers and audio engineers suggest cre- All these related work uses low-level features (audio anal-
ating a good mix. More specially, this work can deliver ei- ysis) and their mixing decisions are solely based on this in-
ther a mixed audio file (mixdown) as well as a mixing plan formation. So far none has used a declarative proposal as
in (human-readable) text format, to serve as a starting point an alternative to mix multiple tracks. As an alternative, this
for producers and audio engineers to apply this methodol- paper shows an innovative way to create mixes using An-
ogy into their productions. Finally this article presents a swer Set Programming (ASP) [8] as part of the Knowledge
decibel (dB) and a panning scale to explain how the mixes Engineering category. This proposal uses high-level in-
are generated. formation extracted from the literature including audio en-
gineering books, websites and peer-reviewed papers such
1. INTRODUCTION as [7, 914] without taking in consideration (yet) low-level
information.
The stereo mixing process in a glance, results by placing Previous work has shown the use of a declarative lan-
each independent recorded instrument between two speak- guage such as ASP to compose diverse types of musical
ers or channels using a three dimensional space in a bal- styles [15, 16] including lyrics that matches the compo-
anced way; this means, working with depth (volume), width sition [17]. Other works proposes to build and compose
(panorama) and height (frequency). Audio engineers and chords progressions and cadences [18, 19], create score
music producers with specific knowledge and skills usually variations [20] or fill the spaces to complete a composi-
do this task by using certain software called Digital Au- tion [21]. The reason why ASP fits perfectly for this type
dio Workstation (DAW) with no feedback from the com- of problem is because of its declarative approach that al-
puter or any intelligent system. Nevertheless, there are lows in a compact way, describing the problem rather than
many rules regarding mixing. Some rules describe con- the form of getting the solution. Another benefit is the pos-
ventions about volume settings, placing instruments in the sibility to add or change rules without taking care of their
panorama or even equalize instruments within certain band- order of execution. Also mixing is a high combinatorial
width frequencies. This leads to the idea of gathering all problem meaning that there are many ways to do a mix [9]
these rules and let a computer produce mixes automatically independently of the audio files used. Using a generate-
or at least set a starting points that can be modified later in and-test approach may deliver valid, and even not yet pro-
a DAW. posed (heard) results.
On the other hand, besides the DAWs available to create The goal of this work is to demonstrate the use of rules,
your mix, related work has already explored several tasks the easy modelling for mixing rules and engineering knowl-
that compel the mixing procedure in an automatic way edge representation, as well as the high-performance of
such as: Levels balancing, automatic panning, dynamic ASP to create a balanced studio mix and a human-readable
range compression and equalization, among others [16]. plan file. This audio mix is not intended to be closed to
From this point, there are some automatic mixing systems a specific genre, although only a fixed set of instruments
that integrate the tasks previously mentioned by using one is used. Audio input files can be given to generate the
mixdown file and render a WAV file using Csound [22].
Copyright:
c 2017 Flavio Everardo et al. This is Besides, due the lack of real-life examples settings or com-
an open-access article distributed under the terms of the mon practices describing professionals approaches on how
Creative Commons Attribution 3.0 Unported License, which permits unre- to balance the audio files [7], this work proposes a deci-
stricted use, distribution, and reproduction in any medium, provided the original bel (dB) and a panorama scale to place each instrument in
author and source are credited. the sound field. To prove the basic modeling of a mixing
SMC2017-422
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
procedure, the scope of this work focuses on balance and in other words the incoming signal remains untouched. If
panning only, leaving further processors such as dynamic the fader is moved 6 dB up it will double the incoming
range compression, equalization and filtering for future de- sound (+6dB = 200%) and the same occurs in the op-
velopments. posite way, if the fader is decreased 6 dB below 0 dB it
The result of this work is a static mixdown file, which will represent only half power of the original sound (-6dB
is a rendered file with at most 16 instruments mixed. The = 50%). This comes from the relation between the Am-
user can adjust the number of instruments in the mix. Ac- plitude Ratio-Decibel function which is dB = 20log10 N,
cording to Gibson [10], the most common instruments in where dB is the resultant number of decibels, and N equals
a mix are: bass drum (also known as kick), snare, hi hats, the Amplitude Ratio. This formula states that an amplitude
hi and low toms, left and right overs, ride cymbal, a bass ratio of 1, equals 0 dB, which is the maximum volume of
line, lead and rhythmic guitar, lead and back vocals, piano, an incoming signal without distortion. The Table 1 demon-
violin and cello. These fix but vast number of instruments strates that 41 dB represents a full-scale domain from the
is sufficient enough to illustrate the goal of this paper. incoming sound percentage (ISP) from 1 to 100%.
This document starts by giving a brief explanation of the
concepts that are involved the mixing process. The next
Table 1. Amplitude Ratio-Decibel Conversion Table
section describes what ASP is and how the mix is modeled
in a logic form. Finally, this paper concludes with an eval- dB ISP (%) Amplitude Ratio
uation and results of the mix and the plan files as well as a +6 200 1.9953 (2)
discussion on further work. 0 100 1
-1 89 0.891
-2 79 0.794
2. THE MIXING PROCESS
-3 71 0.708
The mixing process involves many expert tasks that accu- -6 50 0.501 (1/2)
rately described, can be implemented in software or hard- -10 32 0.316
ware [23]. All this expert knowledge extracted from the -20 10 0.1
literature consists of sets of rules, constraints, conventions -30 3 0.0316
and guidelines to recreate a well-balanced mixdown. As -40 1 0.01
stated before the mixing process can be divided in three
major areas: Dynamics (depth), Panning (Width) and Fre-
The six apparent volume levels that Gibson proposes are
quencies (Height) and it is common to find in the litera-
conventions to set the recorded instruments into the sound
ture [7, 914] that this sequence persists as the main flow
field. Table 2 shows a scale for this six volume levels used
of the mix. The first step in a mix is to handle the depth of
in this work. According to Gibson, the room contains six
the track.
2.1 The Dynamics inside the Mix Table 2. Six Volume Levels- dBs Range Table
Volume Level dBs (From) dBs (To) ISPs
The dynamics represents the emotion that the mix creates
and it is made by adjusting the volumes of different tracks 1 0 -1.83 100 - 81
respecting which instruments should be louder than others. 2 -1.94 -4.29 80 - 61
The first step that most experts state is the need to have a 3 -4.44 -7.74 60 - 41
lead instrument which will sound the loudest compared to 4 -7.96 -19.17 40 - 11
the remaining instruments in the mix. The reason behind 5 -20.0 -24.44 10 - 6
this is that every song has its own personality and is mixed 6 -26.02 -40.0 5-1
based on the lead instrument. Once the lead instrument
is selected, the next step will define the volumes of the levels; nevertheless the literature doesnt count a real dB
remaining instruments. scale to map this room. Most of the literature only display
David Gibson states [10] that a good practice is to set analogies about how far or how close an instrument should
each instrument into one of six apparent volume levels, but be from another. The scale in Table 2 maps the room using
before setting the amount of volume of each instrument, the information in Table 1 as reference. The half of the
first lets introduce about sound pressure level, decibels, room (between levels 3 and 4) equals the half of the sound
and amplitude ratio. When you raise a fader on a mix- which is -6 dB (50%) and from this point the rest of the
ing board, you are raising the voltage of the signal being values are divided in a uniform way to cover the six levels.
sent to the amp, which sends more power to the speak- After balancing all the instruments by placing a dB or SPL
ers, which increases the Sound Pressure Level (SPL) in value according to the previous information, the next step
the air that your ears hear. Therefore, when you turn up a is panning.
fader, of course, the sound does get louder [10]. The deci-
bel (dB) is a logarithmic measurement of sound and is the
2.2 How Do We Turn The Pan Pot?
measure for audio signal or sound levels. The relationship
between dB and the percentage of the sound states that 0 Moving the instruments between the speakers from left to
dB equals a 100% of the recorded (or incoming) sound, right does the panning or the positioning of each instru-
SMC2017-423
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ment in the sound field (normally with the use of a physi- hold unless atom A is derived to be true. Letting the ex-
+
cal or digital pan pot) with the purpose to gain more clarity pression head (r) = a0 , body(r) = {a1 , . . . , am }, and
in the mix. The panning is not the result of treating each body(r) = {am+1 , . . . , an }, r denoted by head (r)
+
channel individually, instead, it is the result of the interac- body(r) {a | a body(r) }. A logic program P
tion between those channels. consist in a set of rules of the form (3). The ground in-
The engineers job is to maintain the balance of the track stance (or grounding) of P , denoted by grd (P ), is created
across the stereo field and one recommended technique is by substituting all variables in a with some elements in-
setting the sounds apart with similar frequency ranges. An- side the domain of P . This results in a set of all ground
other strong suggestion is to avoid hard panning which is rules constructible from rules r P . A satisfiable ground
panning an instrument fully right or left. Low-frequency rule r of the form (3) contains a set X of ground atoms
+
tracks must not be panned out of the center and the rea- if body(r) X and body(r) X = refering that
son is to ensure that the power of the low-frequency tracks a0 X. If X satisfies every rule r grd (R), X is a model
are equally distributed across the speakers. Thats why the of P and X is an answer set of P if X is a subset-minimal
+
lowest frequency instruments are not panned like the bass model of {head (r) body(r) | r grd (P ), body(r)
line and the kick. X = }.
This work uses the -3 dB, equal power, sine/cosine pan- For cardinality rules and integrity constraints the expres-
ning law, which consist of assigning values from -1 (hard sions are in the form
left panning) to +1 (hard right panning) to determine the
h a1 , . . . , am , am+1 , . . . , an (4)
relative gain of the track between channels. This law is the
most common to use according to audio engineering liter- where ai , for 1 m n, an atom of the form p(t1 , . . . , tk )
ature [11]. The formulas to calculate the amount of energy with predicate symbol p, and t1 , . . . , tk are terms such as
that each speaker will generate in both, left and right chan- constants, variables, or functions. The head h can be either
nels are: for integrity constraints or of the form l {h1 , . . . , hk } u
Lef tGain = cos((p + 1)/4) (1) for cardinality constraint in which l, u are integers and
h1 , . . . , hk are atoms. Similarly a ground cardinality rule is
RightGain = sin((p + 1)/4) (2)
satisfied by a set X of ground atoms if {a1 , . . . , am } X
To get the value of p which corresponds to the panning and {am+1 , . . . , an }X = implying h = l {h1 , . . . , hk }
value of a single instrument, a scale which contains 21 pos- u and l |{h1 , . . . , hk } X| u; finally a ground
sible values is proposed, 10 assigned to the left (from -1 to integrity constraint if {a1 , . . . , am } 6 X or {am+1 , . . . ,
-0.1), 10 to the right (from 0.1 to 1) and 1 value for cen- an } X 6= . Both ground cardinality (in a rules head)
tered instruments (0 for center panning). and ground integrity constraints are satisfied under certain
conditions like only satisfied by a set of ground atoms X
if X does not satisfy its body 1 . In case of a ground cardi-
3. ANSWER SET PROGRAMMING
nality constraint if the statement before is not fulfilled, X
ASP is a logic-programming paradigm on the use of non- must contain at least l and at most u atoms of {h1 , . . . , hk }.
monotonic reasoning, oriented on solving complex prob- For example, given the following program in ASP code:
lems originally designed for the Knowledge Representa- 1 a :- b, c.
tion and Reasoning domain (KRR). ASP has become very 2 b :- not d.
popular in areas, which involve problems of combinatorial 3 c.
4 e :- f.
search using a considerable amount of information to pro-
cess like Automated Planning, Robotics, Linguistics, Biol-
ogy and even Music [24]. ASP is based on a simple yet Listing 1. ASP Basic Code
expressive rule language that allows to easily model prob- the unique answer set is {a, b, c}. The default-negated
lems in a compact form. The solutions to such problem value not d satisfies b, because there is no evidence of d
are known as answer sets or stable models [25], nowa- as a fact. Saying this, both facts c, b satisfies the first rule
days handled by high performance and efficient Answer to get a. Note that f cannot satisfy e because there is no
Set Solvers [26]. prove of f to be true.
This paper gives a small introduction to the syntax and se- The following section shows the ASP programs (encod-
mantics of logic programming and ASP including the nor- ings) that lead to possible configurations of a mix which
mal or basic rule type, cardinality rules and integrity con- may include rules with arithmetical functions i.e., {+,
straints . For further reading and a more in-depth coverage -, *, /} and also comparison predicates {=, !=,<,
of the expressions shown in this section, refer to [8,24,27]. <=,>=,>}. This work uses the state-of-the-art grounder
A normal rule expression r is in the form: gringo [28] and solver clasp [26].
SMC2017-424
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
in the mix, the lead instrument, the room definition, setting (line 20), contrary to the kick drum that only can be placed
dynamics, panning instruments and stereo mix evenness. at levels 2 and 3 (line 18). The integrity constraints from
The instruments inside the mix can be user defined and as lines 24-31 also limit these instruments to a certain volume
stated before there are 16 possible instruments to mix. The level. The main difference is that this constraint only ap-
resultant mix can contain any number of instruments from plies if the instrument is not lead.
the mentioned list. Each desirable instrument to appear in 1 volumeLevel(1..6).
the mix must be reflected as a fact trackOn(I), where
I is the instrument name like kick, snare, piano or 3 1 { instrVolumeLevel(I,VL) :
4 volumeLevel(VL) } 1 :- trackOn(I).
bass. Listing 2 shows part of the instruments definition.
1 trackOn(kick). 6 1 { instrDB(I,DB) :
2 trackOn(snare). 7 volumeLeveldB(VL,DB) } 1 :-
3 trackOn(hihats). 8 instrVolumeLevel(I,VL).
4 trackOn(bass).
5 trackOn(piano). 10 :- instrVolumeLevel(LI,VL),
6 trackOn(leadVocals). 11 trackOn(LI),
7 trackOn(leadGuitar). 12 leadInstr(LI), VL != 1.
SMC2017-425
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
by picking one of three possible direction of the stereo The last four lines (19-22) avoid that the mix is uneven or
field, left, center or right (lines 1-8). Additionally each unbalanced between the instruments positioned in the left
instrument in the mix needs to respect the panning con- and right. The mix must not be charged to one side.
straints. Listing 6 shows an example of these constraints, 1 countTracksOn(X) :-
stating that the kick, bass, lead vocals and the snare instru- 2 X = #count{0,trackOn(I):trackOn(I)}.
ments must respect a centered position (lines 10-17). On
4 countPanCenter(X) :-
the other side, there are instruments that cannot be played 5 X = #count{0,
in the center (lines 19-25). Listing 7 also contains another 6 instrPanningLevel(I,center,PL) :
set of integrity constraints to respect sides. Examples of 7 instrPanningLevel(I,center,PL)}.
these are: The left overs cannot be on the right and vice
versa (lines 1-4). If the two guitars, lead and rhythm are 9 countPanLeft(X) :-
10 X = #count{0,
in the mix, without evidence that the piano is the mix, the 11 instrPanningLevel(I,left,PL) :
guitars must be on different sides (lines 6-10). Also if there 12 instrPanningLevel(I,left,PL)}.
is no evidence that the lead guitar exists, but both rhythm
guitar and piano are facts, pan both in different sides too 14 countPanRight(X) :-
(lines 12-16). 15 X = #count{0,
16 instrPanningLevel(I,right,PL) :
1 :- instrPanDirection(leftOvers, PD), 17 instrPanningLevel(I,right,PL)}.
2 panDirection(PD), PD != left.
3 :- instrPanDirection(rightOvers, PD), 19 :- countPanLeft(N) ,
4 panDirection(PD), PD != right. 20 countTracksOn(X), N >= X/2.
21 :- countPanRight(N),
6 :- instrPanDirection(rhythmGuitar,PD), 22 countTracksOn(X), N >= X/2.
7 instrPanDirection(leadGuitar, PD),
8 not trackOn(piano),
9 trackOn(leadGuitar), Listing 9. Making the Mix Even
10 trackOn(rhythmGuitar).
Given this knowledge base it is possible to get a config-
12 :- instrPanDirection(rhythmGuitar,PD), uration of a balanced mix using volumes and pan only.
13 instrPanDirection(piano,PD), The solution output from ASP (depending on the number
14 not trackOn(leadGuitar), of instruments) can contain facts such as: countTracks
15 trackOn(rhythmGuitar), On(8) countPanCenter(3) countPanLeft(2) count
16 trackOn(piano).
PanRight(3) leadInstrument(leadGuitar) intru
mentDB(hihats,-2) intrumentDB(bass,-2) instr
Listing 7. Side Panning Constraints umentPanningLevel(kick,center,0) instrumentP
After picking sides, the next step is to figure out how far anningLevel(rhytmicGuitar,left,7) instrument
left or right each instrument will be panned (Listing 8). PanningLevel(lowTom,right,1) instrumentPanni
The firsts four lines states that if the instrument is not cen- ngLevel(leadGuitar,right,3).
tered, choose one pan level and place it there. The litera- The facts above are an extract of an answer set. This no-
ture does not force to pan an instrument to a specific posi- tation can be parsed into human-readable text for the mix-
tion, except from the ones in the center. This open opportu- down plan and to Csound code to render the WAV file.
nities to play around and listen some instruments in differ-
ent positions on each mixing configuration. To avoid hard 5. TESTING AND RESULTS
panning, lines 6 to 9 are in charge to satisfy this constraint.
Similarly, lines 11 to 14 avoids that two instruments both To test the performance towards an automated mixing tool,
panned left or right, stay at the same panning level. five songs from different genres were mixed using only the
1 1 { instrumentPanningLevel(I,PD,PL) rules mentioned. Rendered mix files were generated start-
2 : panLevels(PL) } 1 :- ing from the grounding and solving processes by gringo
3 instrumentPanDirection(I,PD), and clasp respectively. Followed by two Perl file parsers,
4 PD != center.
one to convert the answer set to a human-readable plan and
6 :- instrPanningLevel(I,right,MPV), the other to parse the output to Csound code. There is no
7 maxPanValue(MPV). audio treatment in Csound, just indicating the volume in
8 :- instrPanningLevel(I,left ,MPV), dB and the resultant panning values. A Csound function
9 maxPanValue(MPV). automatically determines if the incoming file is mono or
stereo in order to use that file in the mix. No low-level fea-
11 :- instrPanningLevel(I1,PD,PL),
12 instrPanningLevel(I2,PD,PL), tures are extracted and no further processors are applied to
13 I1 != I2, the mixes.
14 PD != center. The audio stems used for testing are high quality recorded
publicly available from the Free Multitrack Download Li-
Listing 8. How Far Left or Right Constraints brary by Cambridge Music Technology website 2 .
The mixes lasts between 20-60 seconds and the number
Finally, the part six counts the total number of instru-
ments, the ones in the center, left and right (Listing 9). 2 http://cambridge-mt.com/ms-mtk.htm
SMC2017-426
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
of tracks varies depending on the genre. For each song, 6. DISCUSSION AND FURTHER WORK
four different answers were asked to the solver in order to
This work introduced a Knowledge-Engineered approach
test that:
with the use of ASP for proposing mixes in an automatic
way using high-level information. This work can be the
1. All the mixdown files generated didnt cause clip
upper layer from one of the existing work which rely on
(distortion). Even though it is possible to tell Csound
low-level features, convinced that this integration will lead
to leave certain headroom below 0 dB (also sug-
to more real-life mixes. On the other hand this work can be
gested by the literature).
extended to a formal system and it is necessary to integrate
2. All the selected instruments were heard in the mix. low-level features such as cross-adaptive methods [2, 29]
and even sub grouping mixing rules [30].
3. In fact all the mixes respect a balanced distribution Saying this, another imminent goal is the development
of the instruments through the sound space. of the third axis for frequencies equalization including fil-
4. No inconsistencies were found when running and ters like bandpass, low-pass, hi-pass or shelf to solve inter-
switching between different types of instruments. In channel masking. Also the addition of a processing chain
other words, there are no rules that contradict others. that includes compressor, and time-based effects like delay
and reverb.
5. Using the aforementioned rules, this work can de- For further code testing, several configurations can be de-
liver a vast number of different and valid mixing veloped to fulfill different mixing requirements for specific
configurations. music styles. An example of this is giving the openness
to define different room types (different volume level
Less than 100 ASP code lines (not included comments and dB scale) and panning values depending on the music
and white spaces) are enough to develop a rule-based ap- genre. So far this work uses rules for most common in-
proach for multitrack stereo mixing. Also letting a non- struments and the addition of new instruments like digital
deterministic solver to propose thousands possible answers synthetizers and other electronic-music related instruments
in short amount of time (Over 1000 different results in less like pads, leads and plucks (to mention some of them) are
than 0.031 seconds) gives the option to ask for another necessary to open this tool to other music styles and mixes.
mixing plan without changing the input audio files. This The use of ASP leaves the option to change and mod-
value was obtained by running Clingo 5 on a Intel based ify rules and constraints to describe other parts of the mix.
machine with 3.07GHz CPU with 15 GB RAM. Also this allows keeping the knowledge separated from the
One of the main features discovered is that taking dif- sound engine. This knowledge base can be used indepen-
ferent stems with some files nearly recorded at 0 dBFS, dent from Csound allowing other options for the sound
the rendered file doesnt clip, even as it is mentioned, this treatment. Adding to this, one further development is that
work doesnt read audio spectral information. different filter types, compressors and other audio effects
The results shows that the track stems worked better if can be coded in Csound. Afterwards add this information
they were (peak) normalized or recorded nearly 0 dBFS. as rules so different instruments can be treated with differ-
Examples of input tracks with a highest peak of -5 dB, ent audio processing.
got lost inside the mix and depending on their loudness Lastly but not least, the benchmarking and assessment
some stems couldnt be heard. Examples of barely audible process against other automated mixing systems, profes-
or none audible are the kick, snare, toms and particularly sional audio engineers and subject test rating (like shown
the kick fights against the bass in the low-end. Masking is- in [11]) needs to be done in order to evaluate the efficiency
sues were generated because of the lack of EQ, but certain of the rules and constraints explained in this work.
clarity was achieved because of the panning rules. This
escenario can be dealt by having two options, being the 7. REFERENCES
first one to use normalized tracks only or analyze data such
as peak level, crest factor, loudness and root mean square [1] E. P. Gonzalez and J. Reiss, Automatic mixing: live
(RMS) level from a track. Adding this new knowledge will downmixing stereo panner, in Proceedings of the
treat each instrument in a different way. 7th International Conference on Digital Audio Effects
(DAFx07), 2007, pp. 6368.
Changing the lead instrument between executions showed
significant differences in the mixes. Not having the lead [2] E. Perez-Gonzalez and J. Reiss, Automatic equaliza-
vocals in the (instrumental) mixes allowed the guitars and tion of multichannel audio using cross-adaptive meth-
piano play around different positions compared with the ods, in Audio Engineering Society Convention 127.
mixes that use vocals. Also when vocals are played in the Audio Engineering Society, 2009.
mix different results showed clarity in instruments more
closed to fully right or left. [3] , Automatic gain and fader control for live mix-
ing, in Applications of Signal Processing to Audio
Also no inconsistencies were found during all the execu-
and Acoustics, 2009. WASPAA09. IEEE Workshop on.
tions. This knowledge base can produce mixes up to 16 in-
IEEE, 2009, pp. 14.
struments and be a starting point to add more instruments.
The generated mixes with their mixing plans can be found [4] S. Mansbridge, S. Finn, and J. D. Reiss, An au-
in http://soundcloud.com/flavioeverardo-research/sets. tonomous system for multitrack stereo pan position-
SMC2017-427
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
ing, in Audio Engineering Society Convention 133. [20] F. O. E. Prez, A logical approach for melodic varia-
Audio Engineering Society, 2012. tions. in LA-NMR, 2011, pp. 141150.
[5] , Implementation and evaluation of autonomous [21] R. Martn-Prieto, Herramienta para armonizacin
multi-track fader control, in Audio Engineering So- musical mediante Answer Set Programming, http://
ciety Convention 132. Audio Engineering Society, www.dc.fi.udc.es/~cabalar/haspie.pdf+, University of
2012. Corunna, Spain, Tech. Rep., 02 2016.
[6] J. J. Scott and Y. E. Kim, Instrument identification in- [22] R. Boulanger et al., The csound book, 2000.
formed multi-track mixing. in ISMIR. Citeseer, 2013,
pp. 305310. [23] J. D. Reiss, Intelligent systems for mixing multichan-
nel audio, in Digital Signal Processing (DSP), 2011
[7] B. D. Man and J. D. Reiss, A semantic approach to 17th International Conference on. IEEE, 2011, pp.
autonomous mixing, in APR13, 2013. Ma, et al. PAR- 16.
TIAL LOUDNESS IN MULTITRACK MIXING.
[24] M. Gelfond and Y. Kahl, Knowledge representation,
[8] C. Baral, Knowledge representation, reasoning and reasoning, and the design of intelligent agents: The
declarative problem solving. Cambridge university answer-set programming approach. Cambridge Uni-
press, 2003. versity Press, 2014.
[9] B. Owsinski, The mixing engineers handbook. Nel- [25] P. Simons, I. Niemel, and T. Soininen, Extending and
son Education, 2013. implementing the stable model semantics, Artificial
Intelligence, vol. 138, no. 1-2, pp. 181234, 2002.
[10] D. Gibson, The art of mixing: A visual guide to
recording, Engineering, and Production, vol. 236, [26] M. Gebser, B. Kaufmann, A. Neumann, and T. Schaub,
1997. clasp: A conflict-driven answer set solver, in Inter-
national Conference on Logic Programming and Non-
[11] B. De Man and J. D. Reiss, A knowledge-engineered monotonic Reasoning. Springer, 2007, pp. 260265.
autonomous mixing system, in Audio Engineering So-
ciety Convention 135. Audio Engineering Society, [27] V. Lifschitz, What is answer set programming?. in
2013. AAAI, vol. 8, no. 2008, 2008, pp. 15941597.
[12] A. U. Case, Mix smart. Focal Press, 2011. [28] M. Gebser, R. Kaminski, A. Knig, and T. Schaub,
Advances in gringo series 3, 2011, pp. 345351.
[13] M. Senior, Mixing secrets for the small studio. Taylor
& Francis, 2011. [29] J. Scott, Automated multi-track mixing and analysis
of instrument mixtures, in Proceedings of the 22nd
[14] E. Perez_Gonzalez and J. Reiss, A real-time semi- ACM international conference on Multimedia. ACM,
autonomous audio panning system for music mixing, 2014, pp. 651654.
EURASIP Journal on Advances in Signal Processing,
vol. 2010, no. 1, p. 436895, 2010. [30] D. M. Ronan, H. Gunes, and J. D. Reiss, Analysis
of the subgrouping practices of professional mix engi-
[15] G. Boenn, M. Brain, M. De Vos, and J. Ffitch, Au- neers, in Audio Engineering Society Convention 142.
tomatic music composition using answer set program- Audio Engineering Society, 2017.
ming, Theory and practice of logic programming,
vol. 11, no. 2-3, pp. 397427, 2011.
[16] E. Prez, F. Omar, and F. A. Aguilera Ramrez, Armin:
Automatic trance music composition using answer set
programming, Fundamenta Informaticae, vol. 113,
no. 1, pp. 7996, 2011.
[17] J. M. Toivanen et al., Methods and models in linguis-
tic and musical computational creativity, 2016.
[18] S. Opolka, P. Obermeier, and T. Schaub, Automatic
genre-dependent composition using answer set pro-
gramming, in Proceedings of the 21st International
Symposium on Electronic Art, 2015.
[19] M. Eppe16, R. Confalonieri, E. Maclean, M. Kaliakat-
sos, E. Cambouropoulos, M. Schorlemmer, M. Code-
scu, and K.-U. Khnberger, Computational inven-
tion of cadences and chord progressions by conceptual
chord-blending, 2015.
SMC2017-428
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Johnty Wang
Robert Pritchard Bryn Nixon
Input Devices and Music
School of Music Ryerson United Church
Interaction Laboratory/CIRMMT
University of British Columbia Vancouver, Canada
McGill University
bob@mail.ubc.ca brynnixon@icloud.com
johnty.wang@mail.mcgill.ca
Marcelo M. Wanderley
Input Devices and Music
Interaction Laboratory/CIRMMT
McGill University
marcelo.wanderley@mcgill.ca
SMC2017-429
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-430
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-431
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Figure 4. The Max/MSP Interface Patch showing a.) the panel of bpatchers for each stop, b.)basic usage instructions and
c.) miscilleneous controls and status output
SMC2017-432
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
a natural feature of any interactive system involving this [3] C. Jacquemin, R. Ajaj, S. Le Beux, C. dAlessandro,
instrument. M. Noisternig, B. F. Katz, and B. Planes, Organ aug-
In terms of software development, our long term goal is to mented reality: Audio-graphical augmentation of a
provide an easy to use mapping system that allows artists classical instrument, Int. J. Creat. Interaces Comput.
to access the sonic capabilities of the instrument. By creat- Graph., vol. 1, no. 2, pp. 5166, Jul. 2010.
ing and exposing a standard protocol for access along with
[4] C. Vik, Carpe zythum kinect controlled pipe organ,
support with mapping libraries such as libmapper facili-
2012, [Online; accessed 20-February-2017]. [Online].
tates the use of the system by composers, musicians, and
Available: https://chrisvik.wordpress.com/2012/04/13/
digital media artists.
carpe-zythum-kinect-controlled-pipe-organ-full-performance/
Since many other organs are equipped with similar con-
trol systems, it would be possible to implement the same [5] M. Wright, Open Sound Control: an enabling tech-
system in more than one location as well, and provide fur- nology for musical networking, Organised Sound,
ther possibilities such as teleperformance. In these situa- vol. 10, no. 03, pp. 193200, 2005.
tions, the self learning feature of the driver patch, extended
with the ability to define the number and name of the stops, [6] A. Pon, J. Wang, S. Carpendale, and L. Radford,
would be very useful to automatically capture the different Womba: A musical instrument for an unborn child,
stop configuration in other Solid State Organ consoles em- in Proceedings of New Interfaces for Musical Expres-
ploying the same SysEx format. sion, 2015.
[7] J. Malloch, S. Sinclair, and M. M. Wanderley, Dis-
7. CONCLUSION tributed tools for interactive design of heterogeneous
signal networks, Multimedia Tools and Applications,
In this paper we have presented the use and initial explo- vol. 74, no. 15, pp. 56835707, 2014.
rations in alternative control of a MIDI enabled pipe organ
[8] K. Hamel, Noteability, a comprehensive music nota-
in a church. Ways that this additional piece of technol-
tion system, in Proceedings of the International Com-
ogy has extended the existing musical service has been de-
puter Music Conference, 1998, pp. 506509.
scribed, along with the technical developments and demon-
strations of the system using additional gestural sensing [9] I. Hattwick, I. Franco, and M. M. Wanderley, The Vi-
technology showing the potential for creating new perfor- bropixels: A Scalable Wireless Tactile Display Sys-
mances and interactive installations. Finally, we present tem, in Human Computer International Conference,
our plans for the continued use of the system in service Vancouver, 2017.
as well as new compositions as well as technical devel-
opments to scale up the usability of the system. From a
cultural perspective, the work documented in this paper
presents a method of utilizing technology and music to fos-
ter new forms of dialogue between the religious communi-
ties where these instruments have been used traditionally
and wider society.
Acknowledgments
The authors would like to thank Steve Miller of Solid State
Organ Systems for providing the technical knowledge re-
garding the MIDI interface, and the music department of
Ryerson United Church for providing access to the organ.
The primary author is supported by a NSERC-FRQNT In-
dustrial Innovation Scholarship with Infusion Systems, and
the Centre for Interdisciplinary Research in Music Media
and Technology.
8. REFERENCES
SMC2017-433
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-434
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-435
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-436
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
rhythmic quantization is defined as the number of time However, extensive efforts are made during the training
frame in the piano-roll per quarter note. Hence, we de- phase to infer values that are finally clamped during the
fine the sequence of piano and orchestra states P (t) and generation phase. Conditional models offer a way to sepa-
O(t) as the sequence of the column vectors formed by the rate the context (past orchestra and present piano) from the
piano-roll representations, where t is a discrete time index. data we actually want to generate (present orchestra).
b
3.3.2 cRBM and FGcRBM
& b 45 46 45 6
- - 4
- - n -
- - - - - -
Horns 1.2.
f
b
& b 45
6 5 6 In our context, we define the cRBM units as
Horns 3.4. 4 4 - - - 4 - - - -
- - n - -
f
b - - - 6 - - - - - - - 6 - - - -
Trumpet 1 (C) & b 45 - - 4 - - 45 - - 4 - -
b
f Original v = O(t)
& b 45
6 5 6
4 4 - - - 4 - - - n -
-
Trumpets 2.3.
(C)
? b 5
and the FGcRBM units as
b 4 46 54
- -
6
- - 4 - - - n -
- -
Tuba
-
Pitch
f
Pitch v = O(t)
x = [O(t N ), ..., O(t 1)]
Horns
z = P (t)
Trumpets
Piano-roll Generating the orchestral state O(t) can be done by sam-
representation
Trombones pling visible units from those two models. This is done
by performing K Gibbs sampling steps, while clamping
Tuba
the context units (piano present and orchestral past) in the
Time
case of cRBM (as displayed in Figure 3) and both the con-
text (orchestral past) and latent units (piano present) in the
Figure 2. From the score of an orchestral piece, a conve- case of the FGcRBM to their known values.
nient piano-roll representation is extracted. A piano-roll pr 3.3.3 Dynamics
is a matrix whose rows represent pitches and columns rep-
resent a time frame depending on the time quantization. A As aforementioned, we use binary stochastic units, which
pitch p at time t played with an intensity i is represented by solely indicate if a note is on or off. Therefore, the velocity
pr(p, t) = i, 0 being a note off. This definition is extended information is discarded. Although we believe that the ve-
to an orchestra by simply concatenating the piano-rolls of locity information is of paramount importance in orchestral
every instruments along the pitch dimension. works (as it impacts the number and type of instruments
played), this first investigation provides a valuable insight
to determine the architectures and mechanisms that are the
3.3 Model definition most adapted to this task.
The specific implementation of the models for the projec- 3.3.4 Initializing the orchestral past
tive orchestration task is detailed in this subsection. Gibbs sampling requires to initialize the value of the vis-
ible units. During the training process, they can be set to
3.3.1 RBM
the known visible units to reconstruct, which speeds up
In the case of projective orchestration, the visible units are the convergence of the Gibbs chain. For the generation
defined as the concatenation of the past and present orches- step, the first option we considered was to initialize the
tral states and the present piano state visible units with the previous orchestral frame O(t 1).
However, because repeated notes are very common in the
v = [P (t), O(t N ), ..., O(t)] (13) training corpus, the negative sample obtained at the end of
the Gibbs chain was often the initial state itself O(t) =
Generating values from a trained RBM consists in sam- O(t 1). Thus, we initialize the visible units by sampling
pling from the distribution p(v) it defines. Since it is an a uniform distribution between 0 and 1.
intractable quantity, a common way of generating from a
RBM is to perform K steps of Gibbs sampling, and con-
4. EVALUATION FRAMEWORK
sider that the visible units obtained at the end of this pro-
cess will represent an adequate approximation of the true In order to assess the performances of the different models
distribution p(v). The quality of this approximation in- presented in the previous section, we introduce a quantita-
creases with the number K of Gibbs sampling steps. tive evaluation framework for the projective orchestration
In the case of projective orchestration, the vectors P (t) task. Hence, this first requires to define an accuracy mea-
and [O(t N ), ..., O(t 1)] are known, and only the vec- sure to compare a predicted state O(t) with the ground-
tor O(t) has to be generated. Thus, it is possible to infer the truth state O(t) written in the orchestral score. As we ex-
approximation O(t) by clamping O(t N ), ..., O(t 1) perimented with accuracy measures commonly used in the
and P (t) to their known values and using the CD algo- music generation field [15,17,25], we discovered that these
rithm. This technique is known as inpainting [26]. are heavily biased towards models that simply repeat their
SMC2017-437
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Piano h(t)
present
Bkj
Wij
x(t)
Orchestra
past
Aki v(t)
Orchestra
present
Figure 3. Conditional Restricted Boltzmann Machine in the context of projective orchestration. The visible units v represent
the present orchestral frame O(t). The context units x represents the concatenation of the past orchestral and present piano
frames [P (t), O(t N ), ..., O(t 1)]. After training the model, the generation phase consists in clamping the context units
(blue hatches) to their known values and sampling the visible units (filled in green) by running a Gibbs sampling chain.
SMC2017-438
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
available 2 , along with detailed statistics and the source disentangling the interactions of the piano and the orches-
code to reproduce our results. tra against the influence of the temporal context might not
Following standard notations, all instruments of the same be clearly performed by the factored model because of the
section are grouped under the same part in the score. For limited size of the database. Nevertheless, it seems that
instance, the violins, which might be played by several in- conditional probabilistic models are valid candidates to tackle
strumentalists, is written as a single part. Besides, in order the projective orchestration task.
to reduce the number of units, we systematically remove, Even with the event-level accuracy, the repeat model still
for each instrument, any pitch which is never played in the obtains a very strong score, which further confirms the
database. Hence, the dimension of the orchestral vector is need to devise more subtle accuracy measures. This re-
reduced from 3584 to 1220. sult directly stems from the properties of orchestral scores,
The dataset is split between train, validation and test sets where it is common that several instruments play a sus-
of files that represent respectively 80%, 10% and 10% of tained background chord, while a single solo instrument
the full set. For reproducibility, the different sets we used performs a melody. Hence, even in the event-level frame-
are provided as text files on the companion website. work, most of the notes will be repeated between two suc-
cessive events. This can be observed on the database sec-
5.2 Quantitative evaluation tion of the companion website.
We evaluate five different models on the projective orches-
tration task. The first model is a random generation of the 5.3 Qualitative analysis
orchestral frames from a Bernoulli distribution of param- In order to allow for a qualitative analysis of the results,
eter 0.5 in order to evaluate the complexity of the task. we provide several generated orchestrations from differ-
The second model predicts an orchestral frame at time t ent models on our companion website. We also report the
by repeating the frame at time t 1, in order to evaluate different accuracy scores obtained by varying the hyper-
the frame-level bias. These two naive models constitute a parameters. In this analysis, it appeared that the number of
baseline against which we compare the RBM, cRBM and hidden units and number of Gibbs sampling steps are the
FGcRBM models. two crucial quantities. For both the cRBM and FGcRBM
The complete implementation details and hyper-parameters models, best results are obtained when the number of hid-
can be found on the companion website. den units is over 3000 and the number of Gibbs sampling
steps larger than 20.
Frame-level Frame-level Event-level We observed that the generated orchestrations from the
Model accuracy (Q = 4) accuracy (Q = 8) accuracy FGcRBM might sound better than the ones generated by
Random 0.73 0.73 0.72 the cRBM. Indeed, it seems that the cRBM model is not
Repeat 61.79 76.41 50.70 able to constraint the possible set of notes to the piano in-
RBM 7.67 4.56 1.39 put. To confirm this hypothesis, we computed the accuracy
cRBM 5.12 34.25 27.67 scores of the models while setting the whole piano input to
FGcRBM 33.86 43.52 25.80
zero. By doing this, we can evaluate how much a model
Table 1. Results of the different models for the projective rely on this information. The score for those corrupted in-
orchestration task based on frame-level accuracies with a puts considerably dropped for the FGcRBM model, from
quantization of 4 and 8 and event-level accuracies. 25.80% to 1.15%. This indicates that the prediction of this
model heavily rely on the piano input. Instead, the cor-
The results are summarized in Table 1. In the case of rupted score of the cRBM drops from 27.67% to 17.80%.
frame-level accuracies, the FGcRBM provides the highest This shows that the piano input is not considered as a cru-
accuracy amongst probabilistic models. However, the re- cial information for this model, which rather performs a
peat model remains superior to all models in that case. If purely predictive task based on the orchestral information.
we increase the quantization, we see that the performances
of the repeat model increase considerably. This is also the 5.4 Discussion
case for the cRBM and FGcRBM models as the predictive
It is important to state that we do not consider this predic-
objective becomes simpler.
tion task as a reliable measure of the creative performance
The RBM model obtains poor performances for all the
of different models. Indeed, predicting and creating are
temporal granularities and seems unable to grasp the un-
two fairly different things. Hence, the predictive evalua-
derlying structure. Hence, it appears that the conditioning
tion framework we have built does not assess the genera-
is of paramount importance to this task. This could be ex-
tive ability of a model, but is rather used as a selection cri-
plained by the fact that a dynamic temporal context is one
terion among different models. Furthermore, this provides
of the foremost properties in musical orchestration.
an interesting auxiliary task towards orchestral generation.
The FGcRBM has slightly worse performances than the
cRBM in the event-level framework. The introduction of
three ways interactions might not be useful or too intri- 6. LIVE ORCHESTRAL PIANO (LOP)
cate in the case of projective orchestration. Furthermore,
We introduce here the Live Orchestral Piano (LOP), which
2 https://qsdfo.github.io/LOP/database allows to compose music with a full classical orchestra in
SMC2017-439
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
real-time by simply playing on a MIDI piano. This frame- pact distributed representation might be found. Lowering
work relies on the knowledge learned by the models in- the dimensionality of the data would greatly improve the
troduced in the previous sections in order to perform the efficiency of the learning procedure. For instance, meth-
projection from a piano melody to the orchestra. ods close to the word-embedding techniques used in natu-
ral language processing might be useful [30].
6.1 Workflow The general objective of building a generative model for
time series is one of the currently most prominent topic
The software is implemented on a client/server paradigm. for the machine learning field. Orchestral inference sets a
This choice allows to separate the orchestral projection task slightly more complex framework where a generated mul-
from the interface and sound rendering engine. That way, tivariate time series is conditioned by an observed time se-
multiple interfaces can easily be implemented. Also, it ries (the piano score). Finally, being able to grasp long-
should be noted that separating the computing and ren- term dependencies structuring orchestral pieces appears as
dering aspects on different computers could allow to use a promising task.
high-quality and CPU-intensive orchestral rendering plu-
gins, while ensuring the real-time constraint on the over-
8. ACKNOWLEDGEMENTS
all system (preventing degradation on the computing part).
The complete workflow is presented in Figure 5. This work has been supported by the NVIDIA GPU Grant
The user plays a melody (single notes or chords sequences) program. The authors would like to thank Denys Bouliane
with a MIDI keyboard, which is retrieved inside the inter- for kindly providing a part of the orchestral database.
face. The interface has been developed in Max/Msp, to fa-
cilitate both the score and audio rendering aspects. The in- 9. REFERENCES
terface transmits this symbolic information (as a variable-
length vector of active notes) via OSC to the MATLAB [1] C. Koechlin, Traite de lorchestration. Editions Max
server. This interface also performs a transcription of the Eschig, 1941.
piano score to the screen, by relying on the Bach library
[2] N. Rimsky-Korsakov, Principles of Orchestration.
environment. The server uses this vector of events to pro-
Russischer Musikverlag, 1873.
duce an 88 vector of binary note activations. This vector
is then processed by the orchestration models presented in [3] S. McAdams, Timbre as a structuring force in music,
the previous sections in order to obtain a projection of a in Proceedings of Meetings on Acoustics, vol. 19, no. 1.
specific symbolic piano melody to the full orchestra. The Acoustical Society of America, 2013, p. 035050.
resulting orchestration is then sent back to the client in-
terface which performs both the real-time audio rendering [4] D. Tardieu and S. McAdams, Perception of dyads
and score transcription. The interface also provides a way of impulsive and sustained instrument sounds, Music
to easily switch between different models, while control- Perception, vol. 30, no. 2, pp. 117128, 2012.
ling other hyper-parameters of the sampling [5] W. Piston, Orchestration. New York: Norton, 1955.
[6] P. Esling, Multiobjective time series matching and
7. CONCLUSION AND FUTURE WORKS
classification, Ph.D. dissertation, IRCAM, 2012.
We have introduced a system for real-time orchestration
[7] Y. Bengio, A. Courville, and P. Vincent, Representa-
of a midi piano input. First, we formalized the projective
tion learning: A review and new perspectives, Pattern
orchestration task and proposed an evaluation framework
Analysis and Machine Intelligence, IEEE Transactions
that could fit the constraints of our problem. We showed
on, vol. 35, no. 8, pp. 17981828, 2013.
that the commonly used frame-level accuracy measure is
highly biased towards repeating models and proposed an [8] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning,
event-level measure instead. Finally, we assessed the per- Nature, vol. 521, no. 7553, pp. 436444, 05
formance of different probabilistic models and discussed 2015. [Online]. Available: http://dx.doi.org/10.1038/
the better performances of the FGcRBM model. nature14539
The conditional models have proven to be effective in the
orchestral inference evaluation framework we defined. The [9] E. J. Humphrey, J. P. Bello, and Y. LeCun, Moving
observations of the orchestration generated by the models beyond feature design: Deep architectures and auto-
tend to confirm these performance scores, and support the matic feature learning in music informatics. in ISMIR.
overall soundness of the framework. However, we believe Citeseer, 2012, pp. 403408.
that the performances can be further improved by mixing [10] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, Unsu-
conditional and recurrent models. pervised learning of hierarchical representations with
The most crucial improvement to our system would be to convolutional deep belief networks, Communications
include the dynamics of the notes. Indeed, many orchestral of the ACM, vol. 54, no. 10, pp. 95103, 2011.
effects are directly correlated to the intensity variations in
the original piano scores. Hence, by using binaries units, [11] N. Boulanger-Lewandowski, Y. Bengio, and P. Vin-
an essential information is discarded. Furthermore, the cent, Audio chord recognition with recurrent neural
sparse representation of the data suggests that a more com- networks. in ISMIR. Citeseer, 2013, pp. 335340.
SMC2017-440
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Score rendering
OSC OSC
MIDI Input Client Server
Audio rendering
Figure 5. Live orchestral piano (LOP) implementation workflow. The user plays on a MIDI piano which is transcribed
and sent via OSC from the Max/Msp client. The MATLAB server uses this vector of notes and process it following the
aforementioned models in order to obtain the corresponding orchestration. This information is then sent back to Max/Msp
which performs the real-time audio rendering.
[12] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, strument sound combinations: Application to auto-
N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. matic orchestration, Journal of New Music Research,
Sainath et al., Deep neural networks for acoustic mod- vol. 39, no. 1, pp. 4761, 2010.
eling in speech recognition: The shared views of four
[21] P. Esling, G. Carpentier, and C. Agon, Dynamic
research groups, IEEE Signal Processing Magazine,
musical orchestration using genetic algorithms and a
vol. 29, no. 6, pp. 8297, 2012.
spectro-temporal description of musical instruments,
[13] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, Applications of Evolutionary Computation, pp. 371
O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, 380, 2010.
and K. Kavukcuoglu, Wavenet: A generative model
[22] F. Pachet, A joyful ode to automatic orchestration,
for raw audio, CoRR, vol. abs/1609.03499, 2016.
ACM Transactions on Intelligent Systems and Technol-
[Online]. Available: http://arxiv.org/abs/1609.03499
ogy (TIST), vol. 8, no. 2, p. 18, 2016.
[14] D. Eck and J. Schmidhuber, Finding temporal struc- [23] E. Handelman, A. Sigler, and D. Donna, Automatic
ture in music: Blues improvisation with lstm recurrent orchestration for automatic composition, in 1st Inter-
networks, in Neural Networks for Signal Processing, national Workshop on Musical Metacreation (MUME
2002. Proceedings of the 2002 12th IEEE Workshop 2012), 2012, pp. 4348.
on. IEEE, 2002, pp. 747756.
[24] G. W. Taylor, G. E. Hinton, and S. T. Roweis, Model-
[15] V. Lavrenko and J. Pickens, Polyphonic music model- ing human motion using binary latent variables, in Ad-
ing with random fields, in Proceedings of the eleventh vances in neural information processing systems, 2006,
ACM international conference on Multimedia. ACM, pp. 13451352.
2003, pp. 120129.
[25] K. Yao, T. Cohn, K. Vylomova, K. Duh, and C. Dyer,
[16] S. Bosley, P. Swire, R. M. Keller et al., Learning to Depth-gated LSTM, CoRR, vol. abs/1508.03790,
create jazz melodies using deep belief nets, 2010. 2015. [Online]. Available: http://arxiv.org/abs/1508.
03790
[17] N. Boulanger-Lewandowski, Y. Bengio, and P. Vin-
cent, Modeling temporal dependencies in high- [26] A. Fischer and C. Igel, Progress in Pattern Recog-
dimensional sequences: Application to polyphonic nition, Image Analysis, Computer Vision, and Ap-
music generation and transcription, arXiv preprint plications: 17th Iberoamerican Congress, CIARP
arXiv:1206.6392, 2012. 2012, Buenos Aires, Argentina, September 3-6, 2012.
Proceedings. Berlin, Heidelberg: Springer Berlin
[18] D. Johnson, hexahedria, http: Heidelberg, 2012, ch. An Introduction to Restricted
//www.hexahedria.com/2015/08/03/ Boltzmann Machines, pp. 1436. [Online]. Available:
composing-music-with-recurrent-neural-networks/, http://dx.doi.org/10.1007/978-3-642-33275-3 2
08 2015.
[27] G. Hinton, A practical guide to training restricted
[19] F. Sun, Deephear - composing and harmonizing mu- boltzmann machines, Momentum, vol. 9, no. 1, p. 926,
sic with neural networks, http://web.mit.edu/felixsun/ 2010.
www/neural-music.html.
[28] G. E. Hinton, Training products of experts by min-
[20] G. Carpentier, D. Tardieu, J. Harvey, G. Assayag, imizing contrastive divergence, Neural computation,
and E. Saint-James, Predicting timbre features of in- vol. 14, no. 8, pp. 17711800, 2002.
SMC2017-441
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-442
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Yuta Ojima1 , Tomoyasu Nakano2 , Satoru Fukayama2 , Jun Kato2 , Masataka Goto2 ,
Katsutoshi Itoyama1 and Kazuyoshi Yoshii1
1
Graduate School of Informatics, Kyoto University, Japan
2
National Institute of Advanced Industrial Science and Technology (AIST)
{ojima, itoyama, yoshii}@sap.ist.i.kyoto-u.ac.jp
{t.nakano, s.fukayama, jun.kato, m.goto}@aist.go.jp
ABSTRACT
SMC2017-443
Free Arrangement mode Time Fixed mode Easy Arrangement mode
Original
Music Score
was to love was to love was to love
born born born
Original
Piano roll I you I you I Pitch is decreased by you
born was love three semitones
born to
I
Modified love you love
I to you
Piano roll I was to you
born
was One operation for one note Only the pitches change
Reference value three semitones below
Keyboard
Operation
Figure 2. The description of three assignment modes. Each balloon represents the lyrics assigned to a note. In the Free
Arrangement mode, the note in the original vocal part is assigned one by one from the beginning for every key operation.
In the Time Fixed mode, only the pitches can be changed and the onset/offset timings are not changed. In the Easy
Arrangement mode, a user modifies the pitch by specifying its interval relative to the original melody.
sical notes from the F0 trajectory to visualize pitch infor- is given as an input. Since this system needs tablature and
mation by using a Hidden Markov Model (HMM) [5]. For chords given in advance for training, the generated gui-
the second problem, we prepare three assignment modes tar tablature may vary according to the data used in train-
to make it easy to reflect users various intentions for ar- ing. This contributes to the generation of guitar tablature
rangement (Fig. 2). Since these assignment modes can be that reflects individuality underlying in the training data.
switched during the performance, a user can perform as Methods that generate the harmony part have also been
they likes. For the third problem, we use the pitch syn- proposed. Yi et al. [11] proposed a method to generating
chronous overlap and add (PSOLA) algorithm because the four-part harmony automatically when a melody is given.
algorithm is simple enough to work in real time. Since the Dannenberg et al. [12] focused on the music performance
arrangement is performed in real time using a MIDI key- generated by cooperating with computers. They aim to en-
board, this system can be used for a live performance as a able musicians to incorporate computers into their perfor-
DJ plays a song. mances by creating computer performers that play music
along with music.
2. RELATED WORK Our system is also regarded as an automatic accompani-
In this section, we consider work related to different as- ment system for a vocal part generated by a user. Several
pects of our vocal arrangement system. We first review such accompaniment systems have been proposed. Origi-
various music applications and then we focus on arrange- nally, Dannenberg [13] proposed an algorithm for finding
ment systems because our system aims to arrange an exist- the best match between an input music score and a solo
ing music by using signal processing methods. At the end performance with allowing some mistakes. Using that al-
of this section, we review the signal processing methods gorithm, he designed a system that produced an accom-
that are deeply relevant to our system. paniment in real time. Cont [14] also proposed an au-
tomatic accompaniment system. He used hidden hybrid
2.1 Music Applications Markov/semi-Markov models to estimate the current score
Our system supports active music listening by visualiz- position and dynamics of a tempo from input audio and a
ing the vocal part and by replacing the melody with the symbolic music score.
one a user likes. Regarding active music listening, several
2.2 Arrangement Systems for Existing Songs
studies have been made. Goto et al. [6] proposed a Web
service called Songle that enriches music listening expe- Many systems that can arrange existing songs have been
rience by visualizing the music structure, beat structure, proposed [1, 2, 15]. Yasuraoka et al. [2] proposed a music
melody line, and chords. Mauch et al. [7] proposed an in- manipulation system that can change the timbre and phrase
terface called Song Prompter that displays the structure of of a certain musical instrument in polyphonic music when
a song, chords, and lyrics without preparing musical scores the corresponding musical score is given. They define a
or manually aligning the lyrics and chords with the audio. musical-tone model to extract the target musical instru-
Nakra et al. [8] designed an interactive conducting system ment. In this system, a user is asked to show his or her
that enables a user to change the tempo and dynamics of desirable arrangement in advance in the form of a music
an orchestral recording by Wii Remote. score. Tsuzuki et al. [16] proposed a system that extracts
On the other hand, another aspect of our system is that multiple vocal parts from existing songs and uses them to
it is a music generation system. Many music generation make a mashup. That system extracts multiple vocal parts
support systems have been proposed. Simon et al. [9] pro- of different singers singing the same song. The extracted
posed one that generates a chord progression to accompany voices are then overlapped on one shared accompaniment.
the given vocal melody. Using this system, users who are Some systems that assist the manipulation of an origi-
not familiar with musicography can easily try composing nal instrumental part in real time have also been proposed.
a musical piece. McVicar et al. [10] proposed a system Yoshii et al. [1] proposed an arrangement system for a
that composes a guitar tablature when a chord progression drum part. It enables a user to control the volume and
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
timbre of a drum part and to edit drum patterns in real Notes Note that will be played next Mode indicator
Notes played in the original pitch Original onset timing
time by using a drum sound recognition method and beat-
Notes played after pitch modification Elapsed time Pushed key
tracking. Yamamoto et al. [15] propose a user interface
for synthesizing a singing voice for a live improvisational
performance. This interface enables real-time singing syn-
thesis when lyrics information is given in advance.
Methods for vocal-and-accompaniment source separation Figure 3. A screenshot of the proposed system. Each note
(VASS) have intensively been studied [4, 17, 18]. Rafii et comes down from above and is colored corresponding to its
al. [17] proposed a repeated pattern extraction technique status. The current playing mode is indicated in the upper
(REPET). They focus on repetitive structures in music to right corner. The pushed key is highlighted green.
extract a vocal part since accompaniments are repeated
while a vocal part is not repeated. Huang et al. [18] re- pitch. Some of these segments are repeated or discarded to
alized singing voice separation by using robust principle realize time stretching or time compression.
component analysis (RPCA). They model an accompani-
ment part with repetitive structures as a low-rank com- 3. USER INTERFACE
ponent and model a vocal part as a sparse component. This section gives detailed descriptions of the proposed
Ikemiya et al. [4] combined RPCA and pitch estimation system and its user interface (Fig. 3). Section 3.1 explains
of singing voices to realize VASS. They focus on the mu- the user experience which this system provides. Section
tual dependency of singing voice separation and F0 estima- 3.2 describes the user interface in detail.
tion, and they improve singing voice separation in a unified
framework. 3.1 Overview of the Proposed System
2.3.2 Musical Note Estimation from F0 Trajectories We propose a vocal arrangement system that enables users
to manipulate a vocal part as if they were playing an in-
There have been several attempts to estimate musical notes strument by using a MIDI keyboard. We focus on the fol-
when an F0 trajectory is given. In note estimation, dis- lowing functions:
crete values with the interval of semitones are estimated
Replacing the melody of the original vocal part with a
from a continuous F0 trajectory. In Songle [6] these dis-
melody as the user likes
crete values are estimated by using a majority-vote method
for every time unit (e.g., half beat). Laaksonen et al. [19] Adding a harmonic part
proposed a method of creating a symbolic melody when Adding a troll part (i.e. an auxiliary part that follows a
a chord progression and audio data are given. Ryynanen main part)
et al. [20] considered some states within one musical note On the occasion of these arrangements, users change the
(e.g., vibrato and overshooting) and represented the inter- features of the vocal part (i.e. the pitches, onset times, and
state transitions by using a hidden Markov model (HMM). durations) simultaneously. To treat these manipulations in-
Nishikimi et al. [5] represented the generative process of tuitively, we use a MIDI keyboard as the user interface.
an F0 trajectory from underlying musical notes by using a The target users are therefore the persons who can play the
Bayesian HMM. piano.
In this system, the vocal part is divided at the point where
2.3.3 Pitch Modification Algorithms the pitch changes and is arranged. We henceforth refer to
Pitch modification is a major requirement in music ar- each divided segment as a note. When a key of a MIDI
rangement and has been examined in many studies [21]. keyboard is pushed, the system first determines a note that
In the time domain, pitch modification that preserves the is a target of his or her key operation. We henceforth re-
original duration is achieved by time stretching and com- fer to this note as an target note. Then the pitch and the
pression/expansion of the waveform. Many algorithms for duration of the target note are modified corresponding to
time stretching have been proposed, and the simplest is that users key operation. We prepare three arrangement modes
of Roucos et al. [22]. In their method, the waveform is to determine the target note and to specify its pitch: Free
divided into overlapping segments of a fixed length, and Arrangement mode, Time Fixed mode and Easy Arrange-
then they are shifted and added to realize time stretch- ment mode (Fig. 2).
ing. Hamon et al. [23] expanded this algorithm to make Free Arrangement mode The note which has the earliest
use of pitch information. In their method, the length of onset time in the notes that have not played by a user yet
divided segments is determined according to the original is estimated as the target note. The pitch of the target
SMC2017-445
note is shifted to the pitch of a key played by a user. Input audio Preprocessing part
While key operation is needed for every musical note, a
A VASS
user can change the onset time and the pitches no matter
Beat information
how far they are from the original ones.
Vocal
Time Fixed mode The note being played in the original
vocal part at the moment a user pushes the key is esti- B Note estimation
mated as the target note. While the onset times and the Users keyboard operation
durations are unchangeable, a user can change the pitch Notes
at the middle point of a note in this mode. The pitch of Onset times of notes
Determination of the objective note
the target note is shifted to the pitch of a key played by C
Pitch modification
a user as in the Free Arrangement mode. Moreover, a
note is automatically assigned to music one by one even
though a user keeps a key pushed for multiple notes. Accompaniment Screen Arranged notes Resynthesis part
A user therefore does not have to push a key again if
they wants to play multiple continuous notes in the same Figure 4. Overview of the proposed system: Inputs are
pitch. indicated by blue letters and outputs are indicated by red
Easy Arrangement mode The target note is determined letters.
in the same way as in the Time Fixed mode. In this
The response to a users operation is displayed imme-
mode, how the system modifies the pitch of the target
diately.
note is specified by the number of semitones between a
key pushed by a user and C4 (e.g., if a user pushes E4, Users can predict the effect of their operation before
the pitch of the target note is increased by four semi- performing the operation.
tones). The first requirement is imposed because users sometimes
refer to the original melody during their arrangement.
These three arrangement modes are prepared to cope with
The second requirement is imposed to make arrangement
several different motivations for arrangement. When a user
smooth and stress-free for users. The last requirement is
wants to manipulate all the vocal part as they likes, the Free
imposed to let users get the modified performances they
Arrangement mode is suitable. On the other hand, when a
expect.
user wants to manipulate a portion of the vocal part and
We designed the screen to meet these requirements
leaves the rest as it is, the Time Fixed mode or the Easy
(Fig.3). To meet the first requirement, we design the screen
Arrangement mode is suitable since with it the user does
to have the horizontal direction indicating the pitches of
not have to make keyboard manipulations for notes that
the original vocal part. Moreover, we colored each line
they wants to leave unchanged. When a user wants to add
white or gray in order to represent white and black keys on
a harmonic part, the Easy Arrangement mode may be help-
a MIDI keyboard. This layout is like the one of a MIDI
ful because typical harmonic parts are played in a certain
keyboard. Moreover, the line that indicates C4 (MIDI note
pitch relative to the original melody line (e.g., third below)
number 60) is shown in red to make the octave of keys
and the onset times and durations in harmonic parts are
explicit. Musical notes in the original vocal part are visu-
the same as ones in the original melody. Since a user can
alized as rectangles coming down from above. This design
switch these modes at a keyboard during the performance,
is similar to the one of existing musical games, so users
thereby more easily accessing the functions they need to
easily understand the presented screen.
accomplish the arrangement, the system can handle the dif-
To meet the second requirement, the color of the line cor-
ferent kinds of arrangement in a unified framework.
responding to the key played by a user is changed to light
Furthermore, since some users want to eliminate the orig- green and the colors of notes that have already been played
inal vocal part while others just want to add a harmonic or by user are changed into light blue or brown. By these
troll part, leaving the original vocal part unchanged, the in- changes in colors, the interface lets users understand the
terface lets a user toggle the original singing voice on and effect of their key operation and the pushed key without
off. This operation is also controllable during the perfor- looking at their hands.
mance. A user can also pause or resume the song by using To meet the last requirement, the target note at the mo-
a MIDI keyboard. Since a user needs to keep his or her ment is shown in orange. This helps prevent unexpected
eyes on the screen and cannot spare the attention needed manipulations.
to operate any equipments other than a MIDI keyboard, all Since the number of pitches that appear in the original vo-
the operations during the performance are controlled using cal part varies according to the song, the number of lines
only a MIDI keyboard. shown in the screen changes according to the highest pitch
3.2 Design of the Screen and the lowest pitch in the song. The elapsed time is visu-
alized at the top of the window.
The requirements for the screen presented to the user are
as follows: 4. SYSTEM IMPLEMENTATION
The pitches and durations of musical notes in the orig- This section describes the signal processing techniques
inal vocal part are shown in the way that they are intu- used in the system. As shown in Fig.4, the system is
itively and immediately evident to the user. roughly divided into two parts: a preprocessing part and a
resynthesis part. In the preprocessing part, a singing voice
is extracted from the song. This is enabled by vocal-and- Original waveform
accompaniment source separation (VASS). Then the tim-
ings that the pitch switches are estimated and the singing
1. Division and
voice is divided at those timings. This is enabled by note 1. rearrangement
estimation from an F0 trajectory. In the resynthesis part, 1. (Time stretching) Duplicate
a vocal part is manipulated and resynthesized according to
the users operations. The pitch of the note is modified to 2. Add the
the one specified by the user, and then a modified singing 2. overlapping segments
voice is re-synthesized.
SMC2017-448
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
sic listening, Trans. of IPSJ, vol. 48, no. 3, pp. 1229 [17] Z. Rafii et al., Music/voice separation using the simi-
1239, 2007. larity matrix, in Proc. of ISMIR, 2012, pp. 583588.
[2] N. Yasuraoka et al., Changing timbre and phrase in [18] P.-S. Huang et al., Singing-voice separation from
existing musical performances as you like: manipu- monaural recordings using robust principal component
lations of single part using harmonic and inharmonic analysis, in Proc. of ICASSP, 2012, pp. 5760.
models, in Proc. of the 17th ACM Int. Conf. on Multi-
media, 2009, pp. 203212. [19] A. Laaksonen, Automatic melody transcription based
on chord transcription, in Proc. of ISMIR, 2014, pp.
[3] H. Kenmochi et al., Vocaloidcommercial singing 119124.
synthesizer based on sample concatenation, in Proc.
[20] M. P. Ryynanen et al., Automatic transcription of
of the Interspeech Conf., 2007, pp. 40094010.
melody, bass line, and chords in polyphonic music,
[4] Y. Ikemiya et al., Singing voice analysis and editing Computer Music Journal, vol. 32, no. 3, pp. 7286,
based on mutually dependent f0 estimation and source 2008.
separation, in Proc. of ICASSP, 2015, pp. 574578.
[21] Z. Udo et al., DAFX - Digital Audio Effects. Wiley,
[5] R. Nishikimi et al., Musical note estimation for F0 tra- 2002.
jectories of singing voices based on a Bayesian semi-
[22] S. Roucos et al., High quality time-scale modification
beat-synchronous HMM, in Proc. of ISMIR, 2016, pp.
for speech, in Proc. of ICASSP, 1985, pp. 493496.
461467.
[23] C. Hamon et al., A diphone synthesis system based
[6] M. Goto et al., Songle: A web service for active music on time-domain prosodic modifications of speech, in
listening improved by user contributions, in Proc. of Proc. of ICASSP, 1989, pp. 238241.
ISMIR, 2011, pp. 311316.
[24] M. Goto et al., RWC music database: popular, classic,
[7] M. Mauch et al., Song prompter: An accompaniment and jazz music databases, in Proc. of ISMIR, 2002, pp.
system based on the automatic alignment of lyrics and 287288.
chords to audio, in Late-breaking session at the 10th
ISMIR, 2010.
SMC2017-449
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-450
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
b Eb b F G A B # C# Eb F G A B C#
B # C#
A bB C D E F
E
Ab b Bb
F G A Ab Bb C D E F#
A # B # C#
C D E F
E F G Eb F G A B C#
Ab Bb #
Ab Bb C D E M2Faxis
C D M2E axis F
Figure 1. Layout A0 S 0 .
Figure 3. Layout AS 0 (the Wicki layout [3]).
Octave axis and Pitch axis
# C# #
Eb b F C G D
B Octave axis and Pitch axis
A F # B C
b
E
A b F B A B # C# A
F#
b
E b C G F G
b D E
B #C F E #
A b F b
B A E D
Ab B G A F#
G C
E b C F C
b
E
D M2 axis B
Eb F D E M2 axis
A B
Figure 2. Layout A0 S.
Ab B
bC
of isomorphic layouts adjacency and shear impact on Figure 4. Layout AS.
their learnability and playability for melodies. The four re-
sulting layouts are illustrated in Figures 14. Each figure
shows how pitches are positioned, and the orientation of intervals is to make their pitches adjacent: this makes tran-
three axes that we hypothesize will impact on the layouts sitioning between them physically easy, and makes them
usability. Each label indicates whether the layout has ad- visually salient. However, when considering bass or har-
jacent major seconds or not (A and A0 , respectively) and mony parts, scale steps may play a less important role.
whether it is sheared or not (S and S 0 , respectively). The This suggests that differing layouts might be optimal for
three axes are the pitch axis, the octave axis, and the ma- differing musical uses.
jor second axis, as now defined (the implications of these The focus of this experiment is on melody so, for any
three axes, and why they may be important, are detailed in given layout, we tested one version where all major sec-
1.1.2). onds are adjacent and an adapted version where they are
nonadjacent (minor seconds were non-adjacent in both ver-
The pitch axis is any axis onto which the orthogonal
sions). Both types of layouts have been used in new musi-
(perpendicular) projections of all button centres are
cal interfaces; for example, the Thummer (which used the
proportional to their pitch; for any given isomorphic
Wicki layout (Fig. 3)) had adjacent major seconds, while
layout, all such axes are parallel [15].
the AXiS-49 (which uses a Tonnetz-like layout) [17]) has
The octave axis is here defined as any axis that non-adjacent seconds but adjacent thirds and fifths.
passes through the closest button centres that are an
octave apart. 1.1.2 Sheared (S) or Unsheared (S 0 )
The major second axis (M2 axis, for short) is here We conjecture that having any of the above-mentioned
defined as any axis that passes through the closest axes (pitch, octave, and M2) perfectly horizontal or per-
button centres that are a major second apart. 2 fectly vertical makes the layout more comprehensible: if
the pitch axis is vertical or horizontal (rather than slanted),
1.1.1 Adjacent (A) or Non-Adjacent (A0 ) Seconds it allows for the pitch of buttons to be more easily estimated
Scale steps (i.e., major and minor seconds) are, across cul- by sight, thereby enhancing processing fluency. Similar
tures, the commonest intervals in melodies [16]. It makes advantages hold for the octave and M2 axes: scales typ-
sense for such musically privileged intervals also to be spa- ically repeat at the octave, while the major second is the
tially privileged. An obvious way of spatially privileging commonest scale step in both the diatonic and pentatonic
scales that form the backbone of most Western music.
2 When considering tunings different to 12-TET (e.g., meantone or
Pythagorean), alternative but more complex definitions for the octave However, changing the angle of one of these axes typ-
and M2 axes become useful. ically requires changing the angle of one or both of the
SMC2017-451
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
others, so their independent effects can be hard to disam- the performer becomes more focused on eliminating var-
biguate. ious sources of motor error. 4 To test how well the par-
A way to gain partial independence of axis angles is to ticipants have learned the new layouts and perfected their
shear the layout parallel with one of the axes the angle of motor pattern, we are interested in the transfer of learning
the parallel-to-shear axis will not change while the angles from one task to another. For instance, a piano player will
of the other two will. 3 As shown by comparing Figure 1 practice scales, not only so they will achieve a good perfor-
with Figure 2, or by comparing Figure 3 with Figure 4, mance of scales, but so they can play scale-like passages
we used a shear parallel with the octave axis to create two in other musical pieces well. In our study, we designed
versions of the non-adjacent layout and two versions of the a training and testing paradigm for the different pitch lay-
adjacent layout: each unsheared version (A0 S 0 or AS 0 ) has outs such that the test involved a previously un-practised,
a perfectly horizontal M2 axis but a slanted (non-vertical) but familiar (in pitch) melody.
pitch axis; each sheared version version (A0 S or AS) has a
slanted (non-horizontal) M2 axis but a vertical pitch axis. 1.3 Study Design
In both cases the octave axis was vertical.
In this investigation, therefore, we remove any possible For this experiment, we are interested in examining how
impact of the octave axis orientation; we cannot, how- features of a pitch layout affect the ease with which mu-
ever, quantitatively disambiguate between the effects of the sicians can adapt to it. Musically experienced participants
pitch axis and the M2 axis, although qualitative informa- played three different layouts successively. In each such
tion can provide some insight into their independent effects layout, they received an equivalent training program and
(see Sec. 3.1). then performed a test melody four times. The independent
Sheared and unsheared layouts are found in new musi- variables were Adjacency {0, 1}, Shear {0, 1}, Lay-
cal interfaces the Thummer and AXiS-49 both have un- outNo {0, 1, 2} (respectively the first, second, or third
sheared layouts; the crowdfunded Terpstra keyboard uses layout played), and PerfNo {0, 1, 2, 3} (respectively the
a sheared layout. first, second, third, or fourth performance of the melody
in each layout). Participants were presented the layouts in
1.2 Motor Skill Acquisition different orders to permit analysis of ordering effects. In
using musically-experienced participants, we assume that
Learning a new musical instrument requires a number they share a high level of musical information processing
gross and fine motor skills (in order to physically play a skill, thereby reducing its impact on the results. Partici-
note), and sensory processing (of feedback from the body pants preferences were elicited in a semi-structured inter-
and of auditory features, e.g., melody, rhythm, timbre) view, and the accuracy of their performances was numeri-
[18]. For trained musicians adapting to a new instrument, cally assessed from their recorded MIDI data.
they may be required to learn a new pitch mapping, which
means they must adjust their previously learned motor-
pitch associations [19]. In learning a motor skill there are 2. METHODS
three general stages [20]:
2.1 Participants
a cognitive stage, encompassing the processing of
information and detecting patterns. Here, various 24 participants were recruited (age mean = 26, range: 18-
motor solutions are tried out, and the performer finds 44) with at least 5 years of musical experience on any one
out which solutions are most effective. instrument (excluding the voice).
a fixation stage, when the general motor solution has 2.2 Materials
been selected, and a period commences where the
patterns of movement are perfected. This stage can 2.2.1 Hardware and Software
last months, or even years.
The software sequencer Hex [14] was modified to function
an autonomous stage, where the movement patterns as a multitouch MIDI controller, and presented on an Asus
do not require as much conscious attention on the touch screen notebook, as shown in Figure 5. Note names
part of the performer. are not shown on the interface, but middle D is indicated
with a subtly brighter button to serve as a global reference.
Relating this to how a performer learns to play a particu- The position of middle C was indicated to participants, this
lar a musical instrument, we define the first cognitive stage being the starting pitch of every scale, arpeggio, or melody
as learnability and the second fixation stage as playability. they played.
Essentially, learning the motor-pitch associations of a new In order to present training sequences effectively, both
instrument requires the performer to perceive and remem- aurally and visually, Hexs virtual buttons were highlighted
ber pitch patterns. Once these pitch patterns are learned, in time with a MIDI sequence. All training sequences were
3 A shear is a spatial transformation in which points are shifted parallel at 90 bpm and introduced by a two-bar metronome count.
to an axis by a distance proportional to their distance from that axis. For
example, shearing a rectangle parallel to an axis running straight down its 4 Because achieving motor autonomy is a lengthy process one that
middle produces a parallelogram; the sides that are parallel to the shear can seldom be captured by experiments our current study focuses on
axis remain parallel to it, while the other two sides rotate. only the first two elements of motor learning.
SMC2017-452
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
2.2.2 Musical Tasks In a semi-structured interview taking place after the train-
Melodies for musical tasks were chosen to be single-line ing paradigm and test performances concluded, partici-
sequences to be performed solely with the right hand. The pants were asked to choose their preferred layout from the
training melodies consisted of a set of C major scales and a three presented. Participants were asked to detail what they
set of arpeggios (again only using the notes of the C major liked or disliked specifically about their preferred layout
scale) spanning two octaves, and all starting and ending on (and any of the others). When required, the experimenter
middle C. The well-known nursery rhyme Frere Jacques would prompt participants to consider how it felt playing
was used as a test melody; it too was played in C major the scales, arpeggios, or melodies from the training sec-
and began and ended on middle C. tion.
2.3 Procedure
Participants played three out of the four layouts under con- 2.4 Data Analysis
sideration: all 24 participants played both AS 0 and AS,
with 12 participants each playing either A0 S or A0 S 0 . 2.4.1 Qualitative User-Evaluation Statements
The layouts were presented in four different sequences,
with each sequence played by 6 participants: AS 0 then Preferences for any specific layout were noted only for par-
A0 S 0 then AS; or AS 0 then A0 S then AS; or AS then ticipants who clearly indicated a distinct preference (23 out
A0 S 0 then AS 0 ; or AS then A0 S then AS 0 . This means that of 24). 102 statements were made by the 24 participants
the non-adjacent seconds layouts (A0 S 0 and A0 S) were al- regarding their likes and dislikes of the layouts they were
ways presented second, and that participants who started exposed to. Eight statements were categorized as prefer-
with the unsheared adjacent layout (AS 0 ) finished with the ences for a specific layout feature without any clarifica-
sheared adjacent layout (AS), and vice versa. tion. Two statements were deleted as they neither identified
a preference nor discussed any particular layout feature.
2.3.1 Training Paradigm The remaining 92 statements were coded independently
For each of their three layouts, participants were directed by the two authors, categorized as issues of Learnability
through a 15-minute training paradigm involving i) scales or Playability. Mention of visual aspects of a layout, or
and ii) arpeggios. For each stage, this involved: any reference to a collection of buttons in rows or patterns
that may represent cognitive processing on the part of the
1. watching the sequence once as demonstrated by au- performer was classified as Learnability (see Section 1.2).
diovisual highlighting Mention of any physical aspect of playing that may repre-
sent the perfection of motor skills was classified as Playa-
2. playing along with the audiovisual highlighted se- bility. Each category was further classified into whether
quence three times statements made were positive or negative.
3. reproducing the sequence in the absence of au-
diovisual highlighting, for four consecutive perfor-
mances. 2.4.2 Quantitative MIDI Performance Data
All demonstration sequences and participant performances Each MIDI performance was analyzed for accuracy of
were played in time with a 90bpm metronome, and pitches by comparing against an ideal performance of the
recorded as MIDI files. melody at 90bpm. To account for participants who may not
have been able to play the full melody, a score was given
2.3.2 Testing
for the number of correct notes played. This score was
A final production task asked participants to a well-known adjusted by giving negative points for any wrong or extra
nursery rhyme Frere Jacques. Participants first heard notes. The maximum score for a performance in this case
an audio recording of the nursery rhyme to confirm their was 32.
SMC2017-453
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 1. User evaluation statements categorized by Learnability, Playability, or Learnability/Playability. These statements
are separated in terms of the positive or negative meaning, and the particular layout they were referring to in terms of
Adjacency (A, A0 ) and then Shear (S, S 0 ).
SMC2017-454
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
32
{
{
{
LayoutNo 0 1 2
Figure 6. Performance scores, averaged across participants, as a function of layout number (where 0 codes the first layout
presented to the participant, 1 codes their second layout, 2 codes their final layout), and performance number (where 0
codes their first performance in each layout, 1 codes their second performance, etc.). The error bars show 95% confidence
intervals obtained by bootstrapping.
Table 2. A linear mixed effects model of Score with random effects on the intercept grouped by participant. For simplicity,
non-significant predictors are not shown, although they were included in the model.
sheared, pitch-adjacent layout AS commented on the angle rized in Figure 6 as a function of layout type (AS, AS 0 ,
of the rows making it easier to visualize where the fingers A0 S, or A0 S 0 ); layout number (where 0 is the first lay-
would go next and an intuitive sense of moving upwards. out presented to the participant, 1 is the second layout pre-
As these statements hadnt specifically stated whether this sented, 2 is the final layout); performance number (where 0
related to learning/visualizing the patterns of pitches or the is the first performance in any given layout, 1 is the second
actual ergonomics of playing, we did not further categorize performance, etc.). The 95% confidence intervals were ob-
them. tained with 10,000 bootstrap samples.
There were also eight extra statements (separate to those The correlation between participants preferences and
in Table 1) that noted a preference for a specific feature their correct note score was r = .05 and not significant
of one or more of the layouts. Participants here noted a (p = .441).
preference for the circle rather than oval buttons (an ef-
fect of the shear), and a preference for the compact lay- 3.2.1 Model 1
out that minimized gaps between notes (adjacency). Two To analyse which factors, and which of their interactions,
statements noted the flat (unsheared) layouts were easier, impact on participants scores, a linear mixed effects
and one noted that the sheared, adjacent layout (AS) would model was run with Score as the dependent variable,
work well as a split screen when considering two-handed and the following fixed effects predictors: Intercept,
playing. Adjacency, Shear, PerfNo, LayoutNo, AdjacencyShear,
3.1.6 Summary AdjacencyPerfNo, ShearPerfNo, ShearLayoutNo,
PerfNoLayoutNo, AdjacencyShearPerfNo,
Participants clearly show a preference in terms of adja- ShearPerfNoLayoutNo. Because all non-adjacent
cent M2 layouts. Their main concerns are reflected in layouts were second, interactions between Adjacency and
the learnability of a layout, and being able to figure out LayoutNo are not linearly independent of the main effects
the pitch patterns. Playability is concerned particularly in and so cannot also be included in the model. The intercept
statements regarding the shear of a layout. was treated as a random effect grouped by participant,
which allows the model to take account of participants
3.2 MIDI Data
differing aptitudes regarding the task as a whole.
The scores (correct notes played) for performances of The strongest significant effect is the interaction Adja-
Frere Jacques, averaged across participants, are summa- cencyShear, which indicates the generally superior per-
SMC2017-455
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
Table 3. A linear mixed effects model of the scores obtained during the first presented layout, which was either AS 0 or AS.
formance of the sheared and adjacent layout AS. The sig- The design of the experiment may also have contributed
nificant negative conditional effect of Adjacency indicates to the preferences and disparity of statements concerning
that Adjacency reduces the score with unsheared layouts. Adjacency. Each participant was exposed to three different
This results from the fact that non-adjacent layouts were layouts: two with adjacent major seconds (sheared and un-
only presented second, by which time learning had already sheared); one with non-adjacent major seconds (sheared or
resulted in Score approaching the ceiling. This can be seen unsheared). The adjacent M2 layout always came second;
in Figure 6 by observing the course of the blue lines (the a more complete set of orderings would be desirable. The
unsheared layouts) the initial score is relatively low, but experiment tests for right-handed performance of single-
rapidly improves due to learning and has hit the ceiling by note melodies after a short training paradigm.
the time the second (non-adjacent) layout is presented. Results presented on a transfer task immediately after
The conditional effect of LayoutNo indicates that it has training show an effect of shear on learning of this single-
a strong positive impact across the first performance of note melody, however, as different tasks are theorised to
unsheared layouts; the negatively-weighted interaction be- benefit from different layouts, effects may differ for perfor-
tween PerfNo and LayoutNo shows that the impact of Lay- mance of chords or left-handed bass lines, not to mention
outNo reduces as PerfNo increases, and vice versa. more complicated tasks such as modulation or improvisa-
The negatively-weighted interaction between LayoutNo tion. Further testing would also be necessary to determine
and Shear shows that the positive impact of Shear is essen- the long-term effects from training.
tially swamped by the time the final layout was presented
another consequence of the ceiling effect. 4. CONCLUSIONS
3.2.2 Model 2 Non-standard pitch layouts have been proposed since the
late nineteenth century and, over just the last ten years,
The ceiling effect, established above, suggests that the re- a number of products utilizing novel pitch layouts have
sults that are most indicative of the initial learning of a been demonstrated at conferences such as SMC and NIME
new layout are those that arise during the first presented [6,7,9,2123], or released commercially (as previously ref-
layout (i.e., when LayoutNo = 0). Table 3 summarizes a erenced). Purported advantages of such layouts are often
regression of the scores obtained during only the first pre- theoretically plausible but they are rarely subjected to for-
sentation on PerfNo and Shear and their interaction, with a mal experimental tests. There is a multiplicity of different
random effect on the intercept grouped by participant. factors that may play a role in the learnability and playa-
This model shows that for the first presented layout, bility of pitch layouts; furthermore, these factors are of-
which always had adjacent seconds both Shear and ten non-independent. We have used two independent spa-
PerfNo have strong positive effects on Score. However, tial manipulations (Adjacency and Shear) realized with
the interaction shows that the positive impact of PerfNo a multitouch interface to tease apart some of the under-
is much smaller when the layout is sheared, and that the lying factors that affect learnability and playability. These
positive impact of Shear is somewhat reduced as PerfNo spatial manipulations allow us to gain insight into the im-
increases. As before, these inferences can be readily iden- pact of: spatially privileging major seconds by adjacency
tified in Figure 6. or angle; spatially privileging the pitch axis by angle.
The quantitative MIDI analysis indicates that shear pro-
3.3 Limitations vides a significant improvement in participants ability to
learn quickly and to play accurately indeed, even on
The current analysis considers the correct note score for first performance, their correct note score was typically
performances of a well-known melody, where the maxi- very good. Having the non-adjacent M2 layouts presented
mum score is 32. We see a strong ceiling effect, partic- only second makes the impact of this factor on learnability
ularly for layout AS where performances are close to the harder to assess. Fortunately the user evaluations provide
maximum number of correct notes possible. The musical insight here.
task used for testing is simple to ensure that performance is The user evaluations suggest that M2 adjacency has a
affected only by the layout manipulations and not by par- strong effect on the learnability of a layout, with adjacent
ticipants ability to memorise a new melody. What may M2 layouts easier to grasp and process, perhaps due to fa-
be more revealing is an analysis of the timing with which miliarity with previous motor-pitch associations, and per-
participants are able to play these melodies. haps due to their importance as an interval in Western mu-
SMC2017-456
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
sic. In the user evaluations, shear is only considered as a [10] Roger Linn Design, LinnStrument: A revolutionary
secondary concern, when a motor-pitch pattern has already expressive musical performance controller, 2017.
been acquired and needs to be perfected. [Online]. Available: http://www.rogerlinndesign.com/
The quantitative analysis does not allow us to determine linnstrument.html
whether the key factor behind shears impact is how it
[11] ROLI, ROLI : BLOCKS, 2017. [Online]. Available:
changes the direction of the pitch axis or how it changes
https://roli.com/products/blocks
the direction of the M2 axis. The user evaluations, how-
ever, suggest that both underlying factors play a role. Fu- [12] S. Terpstra and D. Horvath. (2017) Terpstra keyboard.
ture research may aim to differentiate between these two [Online]. Available: http://terpstrakeyboard.com
explanations, and to collect more data to eliminate the or-
dering effect on adjacency. [13] A. J. Milne, W. A. Sethares, and J. Plamondon, Iso-
morphic controllers and Dynamic Tuning: Invariant
Acknowledgments fingering over a tuning continuum, Computer Music
Journal, vol. 31, no. 4, pp. 1532, Dec. 2007.
Anthony Prechtl, who coded the original version of Hex
and who specially adapted it for this project to make it re- [14] A. Prechtl, A. J. Milne, S. Holland, R. Laney, and
ceptive to multitouch. D. B. Sharp, A MIDI sequencer that widens access to
Dr Andrew Milne is the recipient of an Australian Re- the compositional possibilities of novel tunings, Com-
search Council Discovery Early Career Award (project puter Music Journal, vol. 36, no. 1, pp. 4254, 2012.
number DE170100353) funded by the Australian Govern- [15] A. J. Milne, W. A. Sethares, and J. Plamondon, Tun-
ment. ing continua and keyboard layouts, Journal of Mathe-
matics and Music, vol. 2, no. 1, pp. 119, 2008.
5. REFERENCES
[16] P. G. Vos and J. M. Troost, Ascending and descend-
[1] R. H. M. Bosanquet, Elementary Treatise on Musical ing melodic intervals: Statistical findings and their per-
Intervals and Temperament. London: Macmillan, ceptual relevance, Music Perception, vol. 6, no. 4, pp.
1877. 383396, 1989.
[2] H. L. F. Helmholtz, On the Sensations of Tone as a [17] L. Euler, Tentamen novae theoriae musicae ex certis-
Physiological Basis for the Theory of Music. New sismis harmoniae principiis dilucide expositae. Saint
York: Dover, 1877. Petersburg: Saint Petersburg Academy, 1739.
[3] K. Wicki, Tastatur fur musikinstrumente, Neumatt [18] R. J. Zatorre, J. L. Chen, and V. B. Penhune, When
(Munster, Luzern, Schweiz), Oct. 1896. the brain plays music : auditory motor interactions
in music perception and production, Nature Reviews:
[4] D. Keislar, History and principles of microtonal
Neuroscience, 2007.
keyboards, Computer Music Journal, vol. 11, no. 1,
pp. 1828, 1987. [Online]. Available: http://ccrma. [19] P. Q. Pfordresher, Musical training and the role of
stanford.edu/STANM/stanms/stanm45 auditory feedback during performance, ANNALS OF
THE NEW YORK ACADEMY OF SCIENCES, vol.
[5] Array Instruments, The Array Mbira, 2017. [Online]. 1252, pp. 171178, 2012.
Available: http://www.arraymbira.com
[20] R. A. Schmidt and T. D. Lee, Motor Control and
[6] G. Paine, I. Stevenson, and A. Pearce, Thummer map- Learning, 5th Ed. Champaign, IL: Human Kinetics,
ping project (thump) report, in Proceedings of the 2011.
2007 International Conference on New Interfaces for
Musical Expression (NIME07), L. Crawford, Ed., June [21] A. J. Milne, A. Xambo, R. Laney, D. B. Sharp,
2007, pp. 7077. A. Prechtl, and S. Holland, Hex playera virtual mu-
sical controller, in Proceedings of the 2011 Interna-
[7] R. Jones, P. Driessen, A. Schloss, and G. Tzanetakis, tional Conference on New Interfaces for Musical Ex-
A force-sensitive surface for intimate control, in Pro- pression (NIME11), A. R. Jensenius and R. I. Gody,
ceedings of the 2009 International Conference on New Eds., Oslo, Norway, 2011, pp. 244247.
Interfaces for Musical Expression (NIME09), 2009, pp.
236241. [22] L. Bigo, J. Garcia, A. Spicher, and W. E. Mackay, Pa-
perTonnetz: Music composition with interactive pa-
[8] C-Thru Music, The AXiS-49 USB music interface, per, in Proceedings of the 9th Sound and Music Com-
2017. [Online]. Available: http://www.c-thru-music. puting Conference, Copenhagen, Denmark, 2012, pp.
com/cgi/?page=prod axis-49 219225.
[9] B. Park and D. Gerhard, Rainboard and Musix: Build- [23] T. W. Hedges and A. P. McPherson, 3D gestural in-
ing dynamic isomorphic interfaces, in Proceedings of teraction with harmonic pitch space, in Proceedings
the International Conference on New Interfaces for of the Sound and Music Computing Conference 2013,
Musical Expression. Daejeon, Republic of Korea, Stockholm, Sweden, 2013, pp. 103108.
2013, pp. 319324.
SMC2017-457
Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland
SMC2017-458