Huffman Coding Trees

In partial fulfilment of the course ADVANCED DATA STRUCTUREs in Master in Information Technology
Submitted by: FILIPINA F. ATABAY ULYSSES MARION G. PALOAY JEA PULMANO CRISTY BANGUG GREGORY BANGUG
Submitted to: MS. ZHELLA ANNE NISPEROS Professor
Huffman Coding Trees
HUFFMAN CODING TREES

Definition Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes". Discussion We now introduce an algorithm that takes as input the frequencies (which are the probabilities of occurrences) of symbols in a string and produces as output a prefix code that encodes the string using the fewest possible bits, among all possible binary prefix codes for these symbols. This algorithm, known as Huffman coding, was developed by David Huffman in a term paper he wrote in 1951 while a graduate student at MIT. (Note that this algorithm assumes that we already know how many times each symbol occurs in the string, so we can compute the frequency of each symbol by dividing the number of times this symbol occurs by the length of the string.) Huffman coding is a fundamental algorithm in data compression. The subject devoted to reducing the number of bits required to represent information. Huffman coding is extensively used to compress bit strings representing text and it also plays an important role in compressing audio and image files. Algorithm 1 presents the Huffman coding algorithm. Given symbols and their frequencies, our goal is to construct a rooted binary tree where the symbols are the labels of the leaves. The algorithm begins with a forest of trees each consisting of one vertex, where each vertex has a symbol as its label and where the weight of this vertex equals the frequency of the symbol that is its label. At each step, we combine two trees having the least total weight into a singletree by introducing a new root and placing the tree with larger weight as its left subtree and the tree with smaller weight as its right subtree. Furthermore, we assign the sum of the weights of the two subtrees of this tree as the total weight of the tree. (Although procedures for breaking ties by choosing between trees with equal weights can be specified, we will not specify such procedures here.) The algorithm is finished when it has constructed a tree, that is, when thefore it is reduced to a single tree. ALGORITHM 1 Huffman Coding. procedureHuffman(C: symbols aiwith frequencies wi, i = 1, . . . , n) F :=forest of n rooted trees, each consisting of the single vertex aiand assigned weight wi whileF is not a tree Replace the rooted trees T and T of least weights from F with w(T ) w(T) with a tree having a new root that has T as its left subtree and T as its right subtree. Label the newedge to T with 0 and the new edge to T with 1. Assign w(T ) + w(T) as the weight of the new tree.
Example 1 illustrates how algorithm 1 is used to encode a set of five symbols. Example 1 Use Huffman coding to encode the following symbols with the frequencies listed: A: 0.08, B: 0.10, C: 0.12, D: 0.15, E: 0.20, F: 0.35. What is the average number of bits used to encode acharacter?
Solution: Figure 1 displays the steps used to encode these symbols. The encoding
producedencodes A by 111, B by 110, C by 011, D by 010, E by 10, and F by 00. The average numberof bits used to encode a symbol using this encoding is 3 0.08 + 3 0.10 + 3 0.12 + 3 0.15 + 2 0.20 + 2 0.35 = 2.45. Note that Huffman coding is a greedy algorithm. Replacing the two subtrees with the smallest weight at each step leads to an optimal code in the sense that no binary prefix code for these symbols can encode these symbols using fewer bits.

0.08 A 0.12 C 0.10 B 0.15 D 0 B 0.18 0.20 E 0 D 0.27 0 D 1 C 0.35 F 0 E 0 B 0.38 0 E 1 0 1 A 1.00 0 1 0 E 1 1 A Step 5 F 0 D 0.62 1 Step 4 1 C 1 C 0.38 1 1 A 0.12 C 0.18 1 A 0.27 0.35 F Step 2 0.15 D 0.20 E 0.20 E 0.35 F 0.35 F
Initial Forest
Step 1
0 B
1 A
Step 3
0 B
0 F 0 D
1 1 C
0 B
Figure 1 Huffman Coding of Symbols in Example 1

Example 2: Lets say you have a set of numbers and their frequency of use and want to create a Huffman encoding for them: FREQUENCY ----------------5 7 10 15 20 45 VALUE ---------1 2 3 4 5 6
Creating a Huffman tree is simple. Sort this list by frequency and make the two-lowest elements into leaves, creating a parent node with a frequency that is the sum of the two lower element's frequencies: 12:* / \ 5:1 7:2 The two elements are removed from the list and the new parent node, with frequency 12, is inserted into the list by frequency. So now the list, sorted by frequency, is: 10:3 12:* 15:4 20:5 45:6 You then repeat the loop, combining the two lowest elements. This results in: 22:* / \ 10:3 12:* / \ 5:1 7:2 and the list is now: 15:4 20:5 22:* 45:6 You repeat until there is only one element left in the list. 35:* / \ 15:4 20:5

22:* 35:* 45:6 ___/ / 22:* / \ 10:3 12:* / \ 5:1 7:2 45:6 57:* 102:* __________________/ / 57:* __/ \___ / 22:* / \ 10:3 12:* 15:4 / \ 5:1 7:2 \ 35:* / \ 20:5 \__ \ 45:6 57:* \___ \ 35:* / \ 15:4 20:5
Now the list is just one element containing 102:*, you are done. This element becomes the root of your binary Huffman tree. To generate a Huffman code you traverse the tree to the value you want, outputing a 0 every time you take a lefthand branch, and a 1 every time you take a righthand branch. (normally you traverse the tree backwards from the code you want and build the binary Huffman encoding string backwards as well, since the first bit must start from the top). Example: The encoding for the value 4 (15:4) is 010. The encoding for the value 6 (45:6) is 1 Decoding a Huffman encoding is just as easy: as you read bits in from your input stream you traverse the tree beginning at the root, taking the left hand path if you read a 0 and the right hand path if you read a 1. When you hit a leaf, you have found the code. Generally, any Huffman compression scheme also requires the Huffman tree to be written out as part of the file, otherwise the reader cannot decode the data. For a static tree, you don't have to do this since the tree is known and fixed. The easiest way to output the Huffman tree itself is to, starting at the root, dump first the left hand side then the right hand side. For each node you output a 0, for each leaf you output a 1 followed by N bits representing the value. For example, the partial tree in my last example above using 4 bits per value can be represented as follows:

000100 fixed 6 bit byte indicates how many bits the value for each leaf is stored in. In this case, 4. 0 root is a node left hand side is 10011 a leaf with value 3 right hand side is 0 another node recurse down, left hand side is 10001 a leaf with value 1 right hand side is 10010 a leaf with value 2 recursion return So the partial tree can be represented with 00010001001101000110010, or 23 bits. Not bad!

References: http://en.wikipedia.org/wiki/Huffman_coding
http://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
http://michael.dipperstein.com/Huffman/ Rosen, Kenneth H. (2011). Discrete mathematics and its applications (7th ed.). New York: McGraw-Hill.

Huffman Coding Trees

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Huffman Coding Trees

Uploaded by

Copyright:

Available Formats

In partial fulfilment of the course ADVANCED DATA STRUCTUREs in Master in Information Technology

Submitted to: MS. ZHELLA ANNE NISPEROS Professor