Professional Documents
Culture Documents
2
LONGMAN LINGUISTICS LIBRARY
3
LONGMAN LINGUISTICS LIBRARY
General editors
R. H. ROBINS, University of London
For a complete list of books in the series, see pages vii & viii
4
5
Experimental Phonetics
Katrina Hayward
6
7
First published 2000 by Pearson Education Limited
Notices
Knowledge and best practice in this field are constantly
changing. As new research and experience broaden our
understanding, changes in research methods, professional
practices, or medical treatment may become necessary.
8
mindful of their own safety and the safety of others, including
parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the
authors, contributors, or editors, assume any liability for any
injury and/or damage to persons or property as a matter of
products liability, negligence or otherwise, or from any use or
operation of any methods, products, instructions, or ideas
contained in the material herein.
9
10
to the memory of Eugénie Henderson
11
Longman Linguistics
Library
General Editors:
R. H. ROBINS
University of London
GEOFFREY HORROCKS
University of Cambridge
DAVID DENISON
University of Manchester
Principles of Pragmatics
GEOFFREY N. LEECH
12
General Linguistics
An Introductory Survey
Fourth edition
R. H. ROBINS
Dialects of English
Studies in Grammatical Variation
PETER TRUDGILL and J. K. CHAMBERS (eds)
An Introduction to Bilingualism
CHARLOTTE HOFFMANN
Linguistic Theory
The Discourse of Fundamental Works
ROBERT DE BEAUGRANDE
13
Latin American Spanish
JOHN LIPSKI
Volume I:
The Eastern Traditions of Linguistics
Volume II:
Classical and Medieval Linguistics
Volume III:
Renaissance and Early Modern Linguistics
Volume IV:
Nineteenth Century Linguistics
ANNA MORPURGO DAVIES
To come:
Volume V:
The Twentieth Century
Modern Arabic
Structures, Functions and Varieties
CLIVE HOLES
Frontiers of Phonology
Atoms, Structures and Derivations
JACQUES DURAND and FRANCIS KATAMBA (eds)
14
An Introduction to the Celtic Languages
PAUL RUSSELL
Greek
A History of the Language and its Speakers
GEOFFREY HORROCKS
15
Chapter 4
The acoustics of speech
production
4.1 Introduction
In Chapters 2 and 3, our main concern was with the
description of speech as sound. For phoneticians and
linguists, however, this is not an end in itself. As explained in
Chapter 1, we need to relate the sound of speech to speech
production on the one hand and to speech perception on the
other. In this chapter, we shall focus on the relationship
between acoustics and production. The crucial link is
provided by the source-filter theory, also often referred to as
the acoustic theory of speech production.
166
The basic idea behind the source-filter theory is quite simple.
Sound production often involves more than one stage, and so
it is necessary to distinguish between the original source of
sound and later modifications (filtering) of that source. For
example, in the case of a violin, the original source of sound
is a vibrating string, but the vibration of the string is
transmitted to a sort of box and it is the vibration of this
second object which is responsible for the sound which is
transmitted to our ears. At the risk of oversimplification, we
may say that the string and the box make separate and
independent contributions to the output sound. For example,
the pitch and loudness of a note (among other things) depend
on what the player does to the string, but the overall 'violin'
quality comes from the box.
None of what we have said thus far goes beyond the insights
of impressionistic phonetics, which recognises the
independence of the vocal folds and the cavities above them
and incorporates this into its descriptive framework. The
source-filter theory represents an advance insofar as it
provides a means of relating both vocal fold activity and the
positions of the vocal organs to the output sound in a
principled way. In impressionistic phonetic description, the
167
sound which corresponds to a particular vocal setting is
something to be heard and retained in memory. In the
source-filter approach, the sound - which is now conceived of
not as an auditory percept but as a visual spectrum - is
predicted from the setting by universally valid procedures
which apply not only to speech but to the production of sound
in general. In the next section, I shall illustrate the basic
principles by working through a simple example.
168
possible to feel the air flow of speech, but it is necessary to
place the hand immediately in front of the mouth in order to
do so.
169
4.2.2 The vocal tract filter
The term filter may also be used in a more general way, and
applied to devices which simply alter the balance between the
various component frequencies of a sound, with no
implication that undesirables are being eliminated. For
170
example, it is common for radios, cassette players, and CD
players to have a knob which can be used to alter the balance
between bass and treble; those which come equipped with
'graphic equalisers' are more sophisticated, allowing the user
to make more elaborate adjustments as regards the relative
strengths of frequencies within various bands. All such
adjustments are examples of 'filtering' in this more general
sense.
171
is successful, the predicted vowel spectrum will match the
spectra of naturally produced vowels.
172
FIGURE 4.1 Schematic view of the vocal tract tube during
the production of a midcentral vowel ([a]). For the production
of this vowel, the shape of the vocal tract is very similar to
that of a uniform tube, with no taperings or bulges.
173
example, the first formant frequency is F, and the second
formant frequency is F2.
174
These terms are taken from the vocabulary of electrical
engineering, and reflect the crucial role played by engineers
in developing models of the vocal tract.
175
will be an increase of 6 dB per octave. The effects of radiation
can be represented by means of a frequency response curve -
the radiation function which gradually slopes upwards. The
effects of adding in radiation are shown in Figure 4.3 (d) and
(e).
176
177
FIGURE 4.3 Predicting the spectrum of a mid-central vowel.
The source spectrum (a) is modified by the vocal tract filter
(b) to give the spectrum of air flow at the lips (c). This
spectrum is, in turn, modified as part of the process of
radiation (d) to give the spectrum of the resulting sound wave
(e).
178
peaks of the vocal tract filter function (500 Hz, 1500 Hz,
2500 Hz, 3500 Hz).
179
vocal folds are doing, we must make appropriate
modifications to one or both of these numbers.
180
As regards the slope of the spectrum, this can be related to
phonation type. It is common to operate with a three-way
division of phonation types into normal or modal, breathy,
and creaky or pressed (Section 8.3.2), depending on the
spacing between the vocal folds and the amount of air passing
through them. What is not so obvious from this description is
that, in breathy voice, the flow waveform (Figure 2.16) will
be more symmetrical, while in pressed (creaky) voice, it will
be more asymmetrical with more gradual opening and sharper
closing. In terms of the spectrum, this involves a difference in
slope. For breathy voice, the fundamental component is more
dominant and the slope is steeper, whereas for creaky voice,
the fundamental component is less dominant and the slope is
less steep. Figure 4.6 shows theoretical source waveforms and
spectra and output vowel spectra for more breathy and more
creaky versions of the schwa of Figure 4.3. It is clear the
overall slope is the steepest in the breathy version and least
steep in the creaky version; the version of Figure 4.3 lies in
between.
The vowel spectra of Figures 4.3, 4.5 and 4.6 were all
constructed using exactly the same vocal tract filter, so any
differences between them are due to differences in the source
brought about by differences in our representation of vocal
181
fold activity. The spacing between the harmonics is entirely
determined
by the fundamental frequency. The slope - that is, the balance
between high and low frequencies - is determined by the
phonation type and/or vocal effort. Here, however, we must
add that the overall slope of the output spectrum will also be
affected by the vocal tract filter, as will be explained below.
182
FIGURE 4.5 Predicted spectra of the voice source (a) and the
radiated vowel (b) for a mid-central vowel [Əa] as in Figure
4.3, but with the fundamental frequency increased to 200 Hz.
The overall slope of the spectrum in (a) is the same as in the
source spectrum of Figure 4.3 (a), but the harmonics are twice
as far apart. In the spectrum in (b), the peaks are less
well-defined than in the corresponding spectrum of Figure 4.3
(e).
183
FIGURE 4.6 Theoretical source waveforms, spectra for the
voice source, and spectra for the radiated vowel for relatively
breathy and relatively creaky (or 'pressed') phonation. The
waveforms should be compared with Figure 2.16 (a). It will
be seen that the waveform for relatively breathy voice is more
symmetrical while the waveform for relatively creaky voice is
somewhat more skewed. The spectra of (b) and (c) should be
compared with the spectra of Figure 4.3 (a) and (e). The
fundamental frequency (and thus the spacing between
harmonics) is the same as in the example of Figure 4.3, but
the overall slope of the spectrum is altered. For relatively
breathy voice, the slope is more steep, whilst for relatively
creaky voice it is more shallow.
184
FIGURE 4.7 A two-tube model for [a].
For example, in Figure 4.7 (a), two tubes are joined together
to approximate the shape of the vocal tract during the
production of a low central vowel ([a]). Although this
two-tube representation is very simple and crude, it presents
some complications which do not arise in the case of the
single tube. This is because the two tubes are not independent
of one another, and the coupling between them causes the
resonant frequencies (formant frequencies) to shift away from
the values for the corresponding single, uncoupled tubes.
Accordingly, the effects of coupling must be taken into
account when determining
the formant frequencies. The calculations are too complicated
to be done by hand.
185
In fact, the crucial step in going from a representation like
that of Figure 4.7 (a) to the corresponding frequency-response
curve (transfer function) is to determine the formant
frequencies of the system. This may be done by calculation,
or by estimating from graphs known as nomograms (Section
4.4.3). As we shall see (Chapter 6), the fact that vowel spectra
can be predicted from their formant frequencies has lent
support to the view that vowel quality should be specified in
terms of formant frequencies.
Figures 4.7, 4.8 and 4.9 illustrate simple models for three
vowels: a low central unrounded vowel ([a]), a high front
186
unrounded vowel ([i]) and a high back rounded vowel ([u]).
In each case, we give (a) a configuration of tubes
approximating the vocal tract shape, (b) the calculated vocal
tract transfer function and (c) the calculated output spectrum.
For the output spectrum (c) in all three cases, we have used
the same representation of the voice source as in Figure 4.3
(F0 = 100 Hz, slope of -12 dB per octave as expected for
modal phonation).1
187
FIGURE 4.8 A three-tube model for [i].
188
in frequency, as for [u], F3 and F4 will have no reinforcement,
and will remain rather weak.
189
FIGURE 4.9 A four-tube model for [u].
190
to provide a link between realistic models of tongue and jaw
activity and the sound which eventually emerges from the
lips, and it is clear that our crude assemblies of tubes will not
do for such purposes.
191
To conclude this section, we should point out that other
factors, apart from accuracy in the representation of the shape
of the vocal tract, need to be taken into consideration in order
to achieve more accurate models. For example, the elastic
properties of the walls (soft and compliant or stiff) will also
play a role in determining the character of the output
spectrum. It has also come to be realised that the assumptions
of the source-filter theory may be too strong, and that the
source and filter are not entirely independent. As
understanding of these matters progresses, it is becoming
possible to achieve ever more realistic models which relate
the vocal tract configuration to the output sound.
192
for [a] and shortest for [u]. Finally, the cross-sectional area of
the constriction is greater for [al than for lil or ful.
193
F2, F3 and F4 - change as the constriction location varies from
back (on the left) to front (on the right). The dashed lines
show the results of similar calculations, carried out when the
fourth tube (representing lip rounding) is added. AH four
dashed lines lie below their corresponding solid lines. This
reflects the fact that, if the fourth (lip) tube is added to the
three basic tubes for any particular constriction location, the
formant frequencies will be lowered. At the same time, the
extent of the lowering will differ from one location to
another. For example, when the back cavity length is 2 cm,
the dashed line for F2 is only slightly lower than the
corresponding solid line. By contrast, when the back cavity
length is 11 cm, the gap between the two lines for F2 is much
greater.
194
FIGURE 4.10 In this general model of vowel articulation (a),
the supralaryngeal vocal tract is represented as a series of
three tubes with an optional fourth tube representing the lips.
The total length is fixed at 17 cm and the constriction length
is fixed at 2 cm. However, the lengths of the front and back
tubes may be varied. The location of the constriction can be
specified by giving the length of one of these cavities, in this
case, the length of the back cavity. The nomogram based on
the model (b) shows how formant frequencies vary with
constriction location and also the effects of adding lip
rounding. Nomograms can be used to estimate formant
frequencies. For example, if the back cavity length is 2 cm
and there is no lip rounding, F1 will be about 500 Hz and F2
will be about 1000 Hz. These numbers are arrived at by
estimating the height of the solid lines corresponding to F1
and F2 at the points corresponding to 2 on the horizontal axis.
195
must take the cross-sectional area of the constriction into
account. In fact, the value which I have chosen is rather small,
and corresponds to the situation where the relevant part of the
tongue is close to the outer wall of the vocal tract. In terms of
impressionistic phonetic classification, this includes all high
vowels (which would be placed at the top of the vowel
quadrilateral) and all fully back vowels. For the high vowels,
the front of the tongue is close to the roof of the mouth; for
fully back vowels, the back of the tongue is close to the back
wall of the pharynx.
196
vary a single articulatory parameter along a continuum, the
acoustic consequences will be more dramatic at some points
along the continuum than at others. For example, suppose that
we use the nomogram to study the effects of changing the
back cavity length (and with it the constriction location) on F2
in the absence of lip rounding. To do this, we must focus our
attention on the solid line labelled F2. For lengths between 2
and 4 cm, and lengths of 10-12 cm, F2 is relatively stable. In
contrast, for lengths between 5 and 9 cm, F2 rises rapidly. We
might say that changing the constriction location involves a
rapid transition between two regions of relative stability. This
is the meaning conveyed by the adjective quantal. Stevens
further argues that languages should prefer articulations in
regions of stability. This is because they do not demand too
much precision on the part of the speaker as regards the
positioning of the articulators.
197
length around 2-4 cm) - corresponding to a low back [a]-like
vowel - the acoustic output will also be relatively stable.
Finally, when lip rounding is added (corresponding to the
dashed curves in Figure 4.10), there is a region of stability
when the back cavity length is 7-8 cm, since F2 has a very
broad minimum, F1 is relatively stable, and F3 is high and
well-separated from both (and is bound to be quite weak).
This corresponds to an [u]-like vowel. Our examination of the
nomogram thus suggests that [i]-like vowels, [a]-like vowels
and [u]-like vowels should be particularly common speech
sounds in the languages of the world, and the experience of
phoneticians and phonologists confirms that this is, in fact,
the case. Following Stevens, these three vowels are
sometimes referred to as quantal vowels.
198
between various vowel qualities. If we wish to describe vowel
quality in acoustic terms, it is useful to be able to get at the
filter on its own, without taking the source into account.
On the face of it, encoding the signal in this way might seem
to have little to do with determining formant frequencies.
However, the equations which emerge from LPC analysis are
remarkably similar to the equations which would be used to
define the vocal tract filter. LPC can therefore be used to
estimate the formant frequencies and, more generally, to
derive a smooth spectrum in which the formant peaks (that is,
peaks of the vocal tract filter function) stand out clearly.
Figure 4.11 shows an LPC spectrum for the vowel of heard
([3]), which may be compared with the FFT spectrum of the
same vowel which was shown in Figure 4.4.2
199
FIGURE 4.11 LPC spectrum for the vowel of English heard.
The FFT spectrum for this vowel was shown in Figure 4.4.
The frequencies of the peaks are: 550 Hz, 1540 Hz, 2380 Hz,
3500 Hz and 4240 Hz.
200
4.6 Extending the source-filter
approach to other classes of speech
sounds
4.6.1 Nasals, laterals and nasalised vowels
201
FIGURE 4.12 Schematic view of the vocal tract tube during
the production of a bilabial nasal ([m]). The tube branches
into a main passage (through the nose) and a side branch (in
the mouth).
202
remain trapped, with the result that there will be gaps in the
output spectrum. This blocking - or at least marked reduction
- of energy transmission at particular frequencies is expressed
in the frequencyresponse curve for the vocal tract filter by the
introduction of negative peaks, known as antiformants or
zeros. The curve for the filter thus exhibits both extra
formants (or poles) and also antiformants (or zeros). Each
extra formant has a corresponding antiformant, so that the two
occur in pairs.
203
FIGURE 4.13 FFT (a) and LPC (b) spectra for the [m] of
English simmer. The FFT spectrum shows a broader
low-frequency peak and a zero at about 760 Hz. The LPC
analysis does not take the possibility of zeros into account.
204
prominent in the output spectrum. At the other extreme, if the
soft palate is only slightly lowered and the passage into the
nasal cavity is narrow, the formants and antiformants will be
close together and will nearly cancel each other out, so the
effects on the output spectrum will be relatively minor.
Accordingly, when acoustic phoneticians have sought to
produce nasalised vowels in artificial (synthetic) speech, they
have been able to vary degree of nasalisation by varying the
separation between the formants (poles) and antiformants
(zeros). A spectrogram of a nasalised vowel, together with its
FFT spectrum, is shown in Figure 6.9.
To sum up, nasals, laterals and nasalised vowels use the same
source as oral vowels, but the filter is more complicated. The
frequency-response curve can no longer be specified using the
formant frequencies alone; it is also necessary to take into
account the position of various zeros.
205
forcing the airstream through a narrow constriction. Since the
flow becomes turbulent as the airstream emerges from the
constriction, it seems clear that the actual source of the sound
must be located in front of the constriction itself. In fact, there
are two possibilities. Firstly, if a jet of air emerges from the
constriction and is directed at an obstacle (in practice, the
upper or lower teeth), sound will be generated at the obstacle.
This is the case for the so-called sibilant fricatives, including
[s] (as in English sip) and [ʃ] (as in English ship). Secondly, at
least for fricatives with places of articulation along the roof of
the mouth such as palatal [ç] (as in English ẖuge) and velar
[x] (as in Scottish English loch no obstacle interrupts the
flow, and the sound will be generated along a relatively rigid
surface (in this case, the hard palate) running parallel to the
direction of flow. Accordingly, Shadle (1990), whose work
has been of crucial importance in this area, distinguishes
between an obstacle source and a wall source. Other things
being equal, obstacle-source fricatives have higher intensity
than their non-obstacle counterparts. It has been suggested by
Ladefoged and Maddieson (1996) that the traditional class of
sibilant fricatives should be equated with the class of obstacle
fricatives.
206
constriction for [s] is smaller than the cavity for [ʃ]. (Other
things being equal, smaller tubes have higher resonant
frequencies.) It is also possible, however, that cavities behind
the obstruent constriction may introduce zeros (antiformants)
into the frequency-response curve. In this matter, their role
may be compared to the role of the mouth cavity in the
production of nasal consonants (Section 4.6.1). We shall not
discuss these complications here.
4.7 Summary
The main concern of this chapter has been the source-filter
theory of speech production. The source-filter theory is a
model of the final stage of speech production, when the
airstream from the lungs is converted into sound. It has two
major underlying assumptions. The first of these is that it is
possible to make a division between the original source of a
sound and subsequent modifications (filtering) of that source.
The second is that the source and the filter are independent so
that possible interactions between them do not need to be
taken into account.
207
be taken into account. These are assumed to be the same for
all speech sounds and are outside the speaker's control.
208
necessary to use two or more tubes, and this makes the
calculations more complicated because the effects of the
coupling between the tubes must be taken into account.
Surprisingly good approximations to actual vowel spectra can
be obtained using only a small number of tubes, though more
elaborate representations involving larger numbers of tubes
will obviously lead to greater accuracy. It is also possible to
construct a very general model of vowel articulation, which
can be used to study how changes in constriction location and
constriction cross-sectional area, and also the presence or
absence of lip rounding, affect the formant frequencies of the
output vowel sound.
209
Kent and Read (1992) also provides a good introduction. At
present, the most thorough account of the acoustics of speech
production is Stevens (1998), which is extremely well-written
and which promises to become a classic work in the field.
Notes
210
211