Professional Documents
Culture Documents
Collins
Standard MIDI Files Perceptual Audio Coding MPEG-1 layers 1, 2 & 3 MPEG-4
Audio coding has actually been around for hundreds of years Traditionally, composers record their music by writing out the notes in a standard notation
A piano roll can be efficiently digitally encoded by recording the time when each note begins and ends This is what a standard MIDI file does The MIDI standard (Musical Instrument Digital Interface) is an internationally agreed language Standard MIDI files encode
MIDI events/messages e.g. note-on, note-off, etc. The time delay between each event Up to 16 different instruments to be played at once Transmission of parameters containing key velocity, volume, modulation etc.
In a MIDI file, it is the instructions to play the notes that are stored, not the audio itself The quality of the reproduction depends on the synthesiser used for playback
Original recording
Only synthesised instruments can Any sounds (including speech be used and singing) can be recorded
Sampling
Digital audio represents the continuous analogue audio waveform by a series of discrete samples The Sample rate must be at least double the bandwidth of the audio signal
Typical hi-fi sample rates are 44.1 kHz (CD audio) and 48 kHz (DAT tape and DAB radio)
Sample rate
Fs/2
Fs
Frequency
Quantisation levels
Each sample is quantised to be represented by a binary integer The number of bits used to represent each sample sets the number of quantisation levels The error between the quantised signal and the original audio is the quantisation noise Peak signal-to-quantisation noise ratio using n-bits per sample can be estimated as:
SNR 6 dB n
CD audio uses 16 bit resolution giving a dynamic range of ~96 dB To hear the quantisation noise, the signal level would be close to the threshold of pain!
Sub-band Coding
Like the eye, the ear is more sensitive to some frequencies than others Many audio coding algorithms exploit this using a form of subband coding
Filters Downsample Quantise Multiplex
Digital audio in
16x48000 =768 kbps 16x3x48000 =2304 kbps 16x3x16000 4x3x16000 =768 kbps =192 kbps
Bit rates:
Perceptual Coding
Remember that the quantisation process will introduce noise and that we want the noise to be imperceptible We want the noise to be just below the threshold of hearing (also known as the Minimum Audible Field, MAF) So, the question should be:
Quantisation Implications
80 70 Sound Pressure Level [dB-SPL] 60 50 40 30 20 10 0 -10 -20 -30 5000 10000 15000 Peak Signal Level
12 16 bits bits
Threshold of Hearing
Frequency [Hz]
Quantisation noise
11 bits
12 bits
12 bits
12 bits
11 bits
10 bits
9 bits
9 bits
10 bits
10 bits
10 bits
9 bits
Threshold of Hearing
Frequency [Hz]
Psychoacoustics
Substantial improvements to our sub-band coder are possible using psychoacoustics Psychoacoustics is the study of how sound is perceived by the ear-brain combination Of interest to us: how the threshold of hearing is not constant In fact, the threshold of hearing constantly changes due to masking
Masking
Signal Signal + Noise (SNR = 24 dB) Noise
In the presence of the signal, the noise sounds much quieter (almost undetectable) Due to the anatomy of the ear, loud sounds mask quieter sounds at nearby frequencies Effectively, the threshold of hearing is raised to the masking threshold The masking threshold can be estimated using a psychoacoustic model and exploited by the coder
Frequency [Hz]
Applying Masking
80 70 Sound Pressure Level [dB-SPL] 60
Space Oddity, Bowie Frame used for example
5 50 bits
5 bits
5 bits
5 bits
4 bits
4 bits
4 bits
4 bits
Masking threshold
4 bits
3 bits
2 bits
2 bits
Threshold of Hearing
The audio signal is processed in discrete blocks of samples known as frames Each frame of each sub-band is:
Scaled to normalise the peak signal level Quantised at a level appropriate for the current signal-tomask ratio
The receiver needs to know the scale factor and quantisation levels used This information must be embedded along with the samples The resulting overhead is very small compared with the compression gains
Block Diagrams
Digital Audio In Sub-band filter bank
Masking thresholds
FFT
Psychoacoustic model
ENCODER
Descale & Dequantise Decode Side Info Inverse filter bank Digital Audio Out
Coded Audio In
DeMultiplex
DECODER
Three perceptual coders are available in the MPEG 1 specification They are know as layers 1, 2 & 3 Layer 1 (.mp1)
Similar to the simple coder just described 32 sub-bands are used Each frame contains 384 samples (32 x 12) A version of layer 1 was used in the Digital Compact Cassette (DCC) Slightly more complex but better quality than layer 1 Frame length increased to 1152 samples (32 x 36)
Layer 2 (.mp2)
Layer 2 (cont)
Data formatting of samples and side information is slightly more efficient Used in Digital Audio Broadcasting (DAB) Significantly more complex than layers 1 or 2 Capable of reasonable quality even at very low data rates A combination of sub-band coding and transform coding is used to give up to 576 frequency bands (compared to 32 for layers 1 & 2) Huffman encoding is applied to samples MP3 files now hugely popular for internet and mobile users
Layer 3 (.mp3)
The same principles are applied in subtly different ways in most general-purpose audio coders E.g.
MPEG-4
General audio coders: Similar to MPEG 1 but including multichannel support Parametric coder: HILN (Harmonics, Individual Lines and Noise) for very low bit rates Speech coders: HVXC and CELP speech coders Structured Audio: Similar to MIDI but including instrument models. Used for synthetic audio. Synthesised Speech: Allows speech to be coded as text and resynthesised at the decoder
Summary
Work by encoding the structure of the music Work by removing the perceptual redundancy from digitised audio Removes perceptual redundancy and statistical redundancy (by entropy coding) Coding method can be chosen to suit signal source Perceptual, statistical and structural redundancy can be exploited
MPEG-1 Layer 3
MPEG-4