You are on page 1of 25

We compress files so we can download them

faster and use up less storage space.


Files can be compressed using lossy or lossless
compression methods.
The type of compression used depends on the
type of data.
Text is not suitable for lossy compression
because missing info can change the meaning of
the document (every letter is required)
Music, video and photos are much more suited to
lossy compression
Some methods to compress audio and visual
data use techniques that rely on our brains
being able to fill in missing data and
completely reconstruct the original message.
Lossy compression disregards some of the
original information when compressing the
file so when the file is expanded again it is
not identical to the original
Lossy methods typically create much smaller
files but some quality is lost
Aoccdrnig to rscheearch at Cmabrigde
Uinervtisy, it deosn't mttaer in waht oredr the
ltteers in a wrod are, the olny iprmoetnt tihng
is taht the frist and lsat ltteer be at the rghit
pclae. The rset can be a toatl mses and you
can sitll raed it wouthit a porbelm. Tihs is
bcuseae the huamn mnid deos not raed ervey
lteter by istlef, but the wrod as a wlohe.
Who could hear the difference?
Why would we want to compress music?
Lossless methods maintain the integrity of
the data.
i.e when the data is compressed and then
expanded, it is the same as the original
Huffman coding is a lossless compression
technique used to reduce the number of
bits needed to send or store a message.
Huffman coding is based on the frequency
of occurrence of a data item (pixel in
images).
The principle is to use a lower number of
bits to encode the data that occurs more
frequently.
Consider the statement: hippos are cool
The total number of characters within the
statement is 15.
To store this statement using ASCII (8 bits per
character) would require 8 x 15 = 120 bits.
Lets see how Huffman coding compares
hippos are cool

Character Frequency
Each character used needs to a 1
be ranked in order of c 1

frequency (least to most). e 1


h 1
The frequency of each i 1
character (including spaces) l 1
is shown in the table to the r 1
right: s 1
space 2
p 2
o 3
1 (a)

1 (c)

Create a node for every


1 (e)
different character
Label each node with the Character Frequency 1 (h)
frequency of the character as
well as the character itself, a 1
1 (i)
shown in brackets c 1

e 1
Make a list of these nodes 1 (l)
h 1
and arrange them in order of i 1

frequency from lowest to l 1


1 (r)
highest r 1

s 1 1 (s)
Note the alphabetical order space 2
of the characters of the same 2
frequency is not necessary, p 2 (space)

but is good practice o 3


2 (p)

3 (o)
Note how
the top
1 (a) two nodes
1 (e) have now
1 (c) moved
below all
1 (e) 1 (h)
the
Use the following characters
algorithm: 1 (h) 1 (i) with a
1. Take the top two
nodes out and join lower
1 (i) 1 (l) occurrence
them together to
make a new node. The frequency
label for this new 1 (l) 1 (r)
node is the combined
frequency of the two 1 (s)
nodes that were taken 1 (r)
1 (a)
out.
2. Place this new node 1 (s) 2
back, ensuring the list 1 (c)
of nodes is still in 2 2
order. (space) (space)

3. Repeat until there is 2 (p)


only one node left. 2 (p)

3 (o) 3 (o)
1 (e)

1 (i)
1 (h)

1 (i) 1 (l)

1 (l) 1 (r)

1 (r) 1 (s) 1 (a)

1 (s) 2 1 (c)
1 (a)
2 2 1 (e)
1 (c)
2 2
(space) (space) 1 (h)

2 (p) 2 (p)

3 (o) 3 (o)
1 (i)

1 (l) 1 (r)
1 (a)

1 (r) 1 (s)
1 (c)

1 (s) 1 (a) 2
1 (e)

2 2
1 (c) 1 (h)

2 1 (e) 2
1 (i)
2
2 1 (h) (space) 1 (l)
(space)

2 (p)
2 (p)

3 (o)
3 (o)
1 (a)
1 (r)
1 (a)
1 (c)
1 (s)
1 (c)
2
1 (e)
2
1 (e)
2 1 (h)
2
1 (h)
2 1 (i)
2
1 (i)
2 1 (l)
2
1 (l)
(space)
2
(space) 1 (r)
2 (p)
2 (p) 1 (s)
3 (o)
3 (o)
1 (a)
1 (i)

1 (c) 2 1 (l)
2
1 (e) 2 1 (r)

2 1 (h) 2 1 (s)
(space)

2 1 (i) 1 (a)
2 (p)

2 1 (l) 1 (c)
3 (o)
2 2
(space) 1 (r) 1 (e)
4
2 (p) 1 (s) 2 1 (h)

3 (o)
2
(space)
1 (i)
1 (a)
2 (p)
2 1 (l)
1 (c)
3 (o)
2 1 (r)
2
1 (e)
2 1 (s) 4
(space) 2 1 (h)
1 (a)
2 (p)
1 (i)
1 (c)
3 (o)
2 1 (l)
2
1 (e) 4
4 2 1 (r)
2 1 (h)
1 (s)
1 (a)
2
(space) 1 (c)
3 (o)
1 (a)
2 (p) 2
1 (e)
1 (c) 4
3 (o) 2 1 (h)
2
1 (e)
4 1 (i)
2 1 (h)
2 1 (l)
1 (i) 4
2 1 (r)
2 1 (l)
4 1 (s)
2
2 1 (r) (space)
4
1 (s) 2 (p)
1 (a) 1 (i)

1 (c) 2 1 (l)
3 (o)
4
2
1 (e) 2 1 (r)
4
2 1 (h) 1 (s)
2
(space)
1 (i) 4
2 (p) 1 (a)
2 1 (l)
4
1 (c)
2 1 (r) 3 (o)
7 2
1 (e)
2 1 (s)
4
(space) 2 1 (h)
4
2 (p)
1 (a)

1 (i)
1 (c)
3 (o)
2 1 (l) 2
7 1 (e)
4
2 1 (r) 4
2 1 (h)
1 (s)
2 1 (i)
(space)
4
2 1 (l)
2 (p) 1 (a)
4
2 1 (r)
1 (c)
3 (o)
8 1 (s)
7 2
1 (e) 2
4 (space)
2 4
1 (h)
2 (p)
1 (a)

3 (o)
1 (c)
7 2
4 1 (e)
2
1 (h)

15
1 (i)
2
4 1 (l)
2
1 (r)
8
2 1 (s)
(space)
4
2 (p)
Now the nodes are
complete, the lines 1 1 (a)
from each node need
to be labelled so that 1 3 (o)
Huffman coding can 0 1 (c)
1 7 1 2
be read. Each upper 1 1 (e)
0 4
line will be labelled 0 2
with a 1 and each
lower line with a 0: 0 1 (h)

15 1 1 (i)
1 2
1 4 0 1 (l)
0 2
1 1 (r)
0 8

1 2 0 1 (s)
(space)
0 4
0 2 (p)
1 1 (a) Huffman
Character
Coding

1 3 (o) a 1011
0 1 (c) c 1010
1 7 1 2
0 4 1 1 (e) e 1001
0 2 h 1000

0 1 (h) i 0111

l 0110
15 1 1 (i)
r 0101
1 2 s 0100
1 4 0 1 (l)
space 001
0 2
1 1 (r) p 000
0 8
o 11
1 2 0 1 (s)
(space)
0 4
0 2 (p)
Through Huffman coding, the encoding for the statement
hippos are cool would be as follows:
10000111000000110100001101101011001001101000000110
Huffman
Character
Coding
a 1011
Through Huffman coding, this statement
c 1010
can be written in 50 bits.
e 1001

Remember that storing the same h 1000

statement through ASCII would require i 0111

120 bits of memory Huffman coding l 0110

saves 70 bits. r 0101

s 0100

space 001

p 000

o 11
Compression is used to make downloading
data quicker and cheaper.
Compression enables us to store more data
on existing storage hardware.
Compression con be lossy, suitable for music
videos and photos
Lossless compression does not lose any data
when it is decompressed.
How to apply Huffman compression to a
simple text.

You might also like