McGill University:
School of Computer Science
Winter 1997 Class Notes for 308-251B

DATA STRUCTURES AND ALGORITHMS

Topic #22: HUFFMAN TREES

Last modified: April 17, 1997


Definition of a Huffman tree
Properties of Huffman trees
Construction of Huffman trees
Application 1: Huffman coding
Application 2: Merging sorted files
Application 3: Random number generation

Definition of a Huffman tree

A Huffman tree is a binary tree which minimizes the sum of f(i)D(i) over all leaves i, where f(i) is the frequency or weight of leaf i and D(i) is the length of the path from the root to leaf i. In each of the applications, f(i) has a different physical meaning.

See topic 21 on coding and compression.


Properties of Huffman trees

Huffman trees have the following properties:

EXAMPLE

Let us assume an alphabet containing the letters "a", "b", "c", "d" and "e" which are the names of leaves shown in Table 1 with their corresponding frequencies.


Character


Frequency

a

20

b

11

c

8

d

12

e

49

Table 1

According to Table 1 the leaves "c" and "b" are siblings and will be furthest away from the root. The two leaves with smallest frequencies should be siblings so that we keep the sum f(i)D(i) at its minimum value. A counter example, that is, if we did not make the two leaves with smallest frequencies siblings, would result in a tree where the sum f(i)D(i) would not be at its minimum possible value. This would then fail the definition of the Huffman tree. In this case, we would have to swap the leaf with a smaller frequency to be siblings with the other smaller frequency leaf. Also, by a counter example we can prove that every internal node must have 2 children to be a Huffman tree. Let us consider what would happen if we removed the leaf "a" in the tree below.

Removing the leaf would result in a tree where the sum of f(i)D(i) would no longer be at its minimum value.Therefore, we need to reconstruct the tree by removing a leaf from the subtree of the node to make it its second child. The resulting tree is shown below.

After two siblings are identified, removing them and replacing them by an internal node in the list of leaves is fine. This way, the problem is reduced in size by one. (This follows from the fact that the Huffman tree minimizes the sum of f(i)D(i), and not, say, the sum of f(i)D(i)^5.)

It should be noted that Huffman trees are not unique. That is, you can have different trees for the same set of leaves with the same frequencies. For example, if you have leaves with the same frequencies, their positions with respect to each other is not important.


We develop an efficient O(n log n) algorithm for the construction of a Huffman tree due to Hu and Tucker. An array of nodes with frequency and left and right pointers is used:

Leaves
Array
1
2
.
.
.
.
.
.
.
.
.
.
.
.
n
Internal Nodes
n+1
n+2
.
.
.
Root2n-1
f(i) Left Right
f1 0 0
f2 0 0
f3 0 0
.
.
.

G
I
V
E
N

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 0 0
fn 0 0
. . .
. . .
. . .
. . .
. . .
. . .
Table 2

In the top half of the table, arrays 1 to n (the pink area), we have the leaves of the Huffman tree. They have given frequencies and have no left or right children, hence the 0's in the columns. In the bottom half of the tree,arrays n+1 to 2n-1, we have the internal nodes, including the root at 2n-1. Their frequencies (in the green area) can be calculated from their left and right children. As we shall see in the following section, the worst case complexity is O(n log n).

The alphabet is arranged in a heap, which has by definition, the smallest element at the top. The key used to order the heap is the frequencies of the characters. The heap is initially composed of n pairs of (i, f(i)) for leaves i. At each iteration, two elements of minimal frequency are deleted and an internal node is created. The frequency of this newly created internal node is the sum of frequencies of its children and is represented by a new "supercharacter" which has the probability of finding its children. This node is in turn inserted into the heap and the process is repeated.


HU-TUCKER Algorithm


O(n)Make heap using (1,f1),...,(n,fn)
O(n log n)FOR i = n+1 TO 2n-1 DO
    (l,fl) <--- Deletemin(Heap)
    (r,fr) <--- Deletemin(Heap)
    fi <--- fl + fr
    Insert ((i,fi), Heap)
    Left[i] <--- l
    Right[i] <--- r

RETURN

The FOR loop, which is executed n-1 times for an alphabet of size n, repeatedly extracts the two nodes with the lowest frequencies, and replaces them with a node with a frequency equal to the sum of their frequencies. This new node has the two previous smallest nodes as its left and right children. The worst case is O(n log n)


Application 1: Huffman coding

A language is made up of symbols or characters. The set of symbols of a language is called its alphabet. A code word is a way of representing a symbol using other symbols. Each symbol of an alphabet has its own unique codeword. The collection of codewords is known as a code.

Codes are used in many computer applications. One such application is the compression and decompression of data. Another common application is used in the transmission of data. Data is sent over phone lines or data cables as electrical signals . Since these electrical signals are a form of binary language (having an alphabet of only "0" and "1" or "off" and "on"), the data must be encoded. To make the transmission as quick as possible, we like an efficient code with the shortest possible average codeword length. However, the average length of the codeword is NOT simply the average length of the codewords but rather the average of each codeword's length multiplied by the symbol's probability p(i), as indicated by the following formula:

where Pi is the probability of the symbol. The probability of a symbol is the probability of finding it in a text. As always, the sum of the probabilities p(i) is equal to 1.

A prefix code is a code one where no codeword is also a prefix of some other codeword. For example, you cannot have one code be "110" and another "1101" since the first is a prefix to the second.

If we take our example of the five letter alphabet from Table 1, we can generate a Huffman tree for it. Note: The frequency of the characters can be given beforehand from statistical tables or by processing the file once. The frequency can be given in the form of probabilities, percentages or any other numeric way.


Character


Frequency


Codeword

a

20

10

b

11

1100

c

8

1101

d

12

111

e

49

0

Table 3

The prefix code can be visualized by a trie with leaves corresponding to symbols. The average that average codeword length is the sum over all leaves I of the probability times the distance of the leaf from the root. If we treat the Huffman tree as a trie, the codeword for each character of the alphabet is generated by following the path from the root to the character, where each left branch is a 0 and each right branch is a 1. A Huffman code it is obtained by using the codewords obtained by constructing the Huffman tree for the probabilities.If the probabilities are exact, then the Huffman code gives optimal average number of bits per input symbol among all prefix codes, but not necessarily optimal among all compression methods (see also Lempel-Ziv Compression )! This is a tree optimization problem of the same form as that discussed in the definition of Huffman trees, but here, probability p(i) replaces frequency f(i).

Using our example, we can calculate the average codeword length:

0.49*1 + 0.20*2 + 0.12*3 + 0.11*3 + 0.08*3 = 1.82


Application 2: Merging sorted files


We would like to merge n sorted files of sizes s(1), . . . ,s(n) into one long sorted file with as few comparisons as possible. The number of comparisons required to merge two sorted files of sizes m and n is m + n in the worst case (actually, m+n-1 comparisons, but we will ignore the "-1" for now.) We can only merge two files at a time, so the order of merging will be important.

For example, take five files of size 200, 150, 810, 300 and 40. Let us try to merge them in that order. First, we merge the files of size 200 and 150 to obtain a file of size 350. Next, we merge the file of size 350 with the file of size 810 to give a file of size 1160. Next, we merge the file of size 1160 with the file of size 300 to give a file of size 1460. And finally, we merge the files of size 1460 and 40 to give a file of size 1500. We can visualize this merging process with the help of the following binary tree:

Picture of tree with files in random order

Each file is a leaf in the tree. Two siblings merge into a larger parent file.

This might seem okay, but it is in fact quite wasteful. Because the earlier files are merged again and again as part of the bigger files, the number of comparisons required to merge the files is that particular order is 200 + 150 + 350 + 810 + 1160 + 300 + 1460 + 40 = 4470. This is unavoidable, but there is a more efficient way to proceed that will result in fewer comparisons.

If we were to look at this from a tree perspective, the total number of comparisons would be the sum of the sizes of the files s(i) multiplied by their distances D(i) from the root.

Time=Sum[Size(i) * Distance(i from root)]

Again, we have the same optimization: we wish to minimize the sum of s(i)D(i) over all leaves i. Note that s(i) now plays the role of f(i) . The optimal method to merge these files would be to go in increasing order of size: merge the files of size 40 and 150 to give a file of size 190, merge the files of size 190 and 200 to give a file of size 390, merge the files of size 390 and 300 to give a file of size 690, merge the files of size 690 and 810 to give a final file of size 1500. This would give a total number of comparisons of: 40 + 150 + 190 + 200 + 390 + 300 + 690 + 810 = 2770 (compared to 4470 comparisons using the other order). This way, the smallest files are repeatedly merged the most. This is analogous to the character with the lowest frequency having the longest codeword in Huffman coding.

Picture of tree with smallest files far from root

Of all trees, the Huffman tree is optimal. One of the fundamental properties of Huffman trees is the smallest frequencies are the furthest away from the root and siblings merge into their parent. Retaining this fact, we do not really need the tree. The only thing we need to note is that the two smallest files must be merged first. This step can then be repeated over and over.


Application 3: Random number generation

PROBLEM

To generate a random integer with given probabilities.

As a case in point, take the imperfect die with six probabilities. The method used is based upon a binary tree in which each leaf corresponds to an integer. The idea is to take a uniform [0,1] random number U, available in most software libraries and environments, and to travel from root to leaf so that each leaf is reached with the correct probability.

Our example will assume the probabilities Pi for each of the i probabilities of occuring as shown below. For example, the probability of rolling a 1 is 0.15 whereas the probability of rolling a 2 is 0.20. We want to construct a tree which will reach each of the six leaves with the correct probability.

Method to traverse the tree

The algorithm for traversing the tree associates with each internal node i a threshold value t(i). The t(i) values are the sum of all leaf values to its left. To find the t(i) values you traverse the tree in preorder:


Preorder(i)
if i is not NIL
    then
      t(i) = t(i) + P(i) where P(i) is the frequency of the leaf i and zero for an internal node.
      preorder(left[i])
      preorder(right[i])

Once this is set up, each internal node has a threshold value which will allow us to move down the tree and reach the leaves with correct probability. The tree below shows the threshold values at each internal node. We will compare these values with the given random number.

The algorithm below allows us to move down the tree and reach one of the six leaves with the correct probability.


Generate U uniformly on [0,1]
i<-root of the tree
while i is not a leaf do
    if U < t(i)
      then i <- left[i]
      else i <- right[i]
return index[i] (where index[i] is the number associated with leaf i).

Why is this true? Each of the six probabilities occur in a certain range of numbers between 0 and 1. The size of the range of numbers in which each of the six probabilities occur is directly proportional to their frequency. So if your given random number is in the range 0 and 0.10 you would reach the leaf "3". The size of the range is 0.10 and this is also the correct probability of leaf "3". In the same way, to reach leaf "6" you would need your random number generator to give a number between 0.72 and 0.85. The size of the range in which the leaf "6" is reached is 0.13 which is its probability of occuring.

1 Comparison = 1 Time unit

Now, according the above statement, we would need 3 time units to reach the leaf "6" and 2 time units to reach the leaf "2". The expected time is then the sum over all leaves of p(i)D(i), where the p(i)'s are the given probabilities. Once again this is in the format seen earlier for Huffman trees. The tree that optimizes the expected time is the Huffman tree. So, one should construct the Huffman tree (as shown above).


Links

Christopher A. Thorpe and Christian Sepulveda, Harvard University. Course notes on Huffman encoding
Matthew Dillon, Obvious Implementations Corporation Sandel-Avery Engineering. A Quick Tutorial on Generating a Huffman Tree
Patrick Brigger, Ecole Polytechnique Federale de Lausanne Lecture notes on Huffman coding
Newsgroup comp.compression, University of Missouri Newsgroup on Huffman coding
Peter G. Ford, Center for Space Reasearch (M.I.T.) Huffman Coding of ACIS Pixel Data
Dan S. Hirschberg and Debra A. Lelewer, University of California Adaptive Huffman coding
Dane Dwyer and Michael Swafford, Yertle's Multimedia Page Huffman coding
Rudolph Van Graan, Potchefstroom University Huffman encoding
Brad Jowens, Wichita State University A Huffman Code Assignment
Karen Van Houten, University of Idaho Course notes on Huffman codes
Jon Oldevik, University of Oslo Lecture notes on Huffman coding
Advanced RISC Machines Limited Huffman Compressing
Edu Metz, University of Kansas Huffman Encoding Program
Burak Bayramli, Stevens Institute of Technology Huffman Coding
S. Kwong, K.F. Man, S.H. Tong, City University of Hong Kong. Using Huffman Trees for Compressing Chinese Character Fonts
Andre Skupin, University of Salzburg Huffman Trees to Encode Verbal Street Directions
Rhys Price Jones, Fritz Ruehr, Richard Salter, Oberlin College Computer Science Course notes on Huffman coding


References


Web page creators

  • Carlton Davis
  • (boxer@cs.mcgill.ca) (Java Applet)
  • Natali Grimm
  • (natali@cs.mcgill.ca) (HTML and graphics)
  • Nicholas Trottier
  • (trotwood@cs.mcgill.ca) (HTML and graphics)
  • Pascal Chamma
  • (pchamm@po-box.mcgill.ca)(Java applet)


    Last updated April 17, 1997 by Natali Grimm.

    Copyright © 1997, Carlton Davis, Natali Grimm, Nicholas Trottier, Pascal Chamma. All rights reserved. Reproduction of all or part of this work is permitted for educational or research use provided that this copyright notice is included in any copy. Disclaimer: this collection of notes is experimental, and does not serve as-is as a substitute forattendance in the actual class.