This set contains practice questions on the choice of an algorithm or data structure, and on Huffman trees.
Question 1 | Coding. What is a prefix code? Can you state conditions for a prefix code so that any sequence of bits can get decoded to something? Can you give a simple 2-input-symbol prefix code such that not all bit sequences can get decoded? |
Answer | Codewords are finite bit sequences. A prefix code is a collection of codewords, one per symbol, such that no codeword can be found as the prefix (leftmost part) of any other codeword. Any sequence of bits can be decoded if the prefix code is such that the prefix tree has two children for each internal node. Finally, the counterexample: if the prefix code for input symbols A and B has codewords 00 and 01, then the sequence 1111 cannot be decoded. |
Question 2 | Huffman trees. We have symbols occurring with probabilities 1/2, 1/4, 1/8, ..., 1/2^n, 1/2^n. (Note that this sequence decreases geometrically, and that only the last two entries are identical.) Give a possible Huffman code for the given collection of symbols (for general n). |
Answer | Construct the tree and note that a possible code would be 1, 01, 001, 0001, ..., 0000001 and 0000000 in case n = 7. The codeword lengths are thus 1, 2, 3, 4, ... , n, and n. |
Question 3 | Huffman trees (continued). Derive the average code length in the previous example when n is very very large. This involves the summation of a series. In particular, is the answer O(1), O(log n) but not O(1), O(n) but not O(log n), or Omega(n) but not O(n)? |
Answer | From the previous exercise, the average code length is 1/2 + 2/4 + 3/8 + ... + n/2^n + n/2^n. This is not more than if we continue the sequence without limit. We obtain 1/2 + 2/4 + 3/8 + ... = 2. The answer is O(1)! No matter how rich the alphabet, the average number of bits per symbol stays below 2. For geometrically decreasing frequencies of symbols, the Huffman code is wonderful. |
Question 4 |
Efficient algorithms.
How would you merge n (n is large) sorted linked lists of
length n each into one gigantic sorted linked list of n^2
elements? Choose one of the following strategies:
|
Answer |
Look at the complexity of each method.
We will count comparisons between elements in
a worst case scenario.
Method A.
In each round of merging, all elements are involved.
Thus, one round costs n^2 time units. The number of rounds is
about log_2 n. The total time cost is thus about n^2 log_2 n.
Method B.
Buildheap takes about 2n^2 comparisons in
the worst case. Each deletemin
costs about 2 log_2 n (why the 2: check the previous set
of practice questions!). The total is about 4 n^2 log_2 n.
Method C.
All depends on how you find the minimum of the n top
elements. If there is no special structure, then
clearly, each element removed comes at a time cost
of n. The total is about n^3 in the worst case.
If a heap is used, then the total time cost is about 2 n^2 log_2 n. All three methods are thus about equally good, with a slight preference for method A, which is the simplest to implement and has the smallest constant in front of n^2 log_2 n. |
Question 5 |
Clever implementations.
We are looking for a data structure to efficiently deal with
biological data that consist of a large number (n) of
species, and names of animals within species.
We would like to perform the following operations:
|
Answer | To delete an entire species efficiently will force us to keep each species in a separate structure. A search tree for each species seems therefore indicated. If you think this through, the data structure consists of a red-black tree of species. The operation "delete a species" is easily carried out in O(log n) time. Searching for a species takes O(log n) time as well. Each node in this tree is augmented by a key that points to the root of a tree of names within that species. Each of these trees in turn is a red-black tree. Type definition: each cell in the species tree has fields speciesname (the key), left, right, parent, color (it is a red-black tree), rootpointer (for the name tree). The cells in the name tree are as for ordinary red-black trees. All operations mentioned above take O(log n) in the worst case. |
Question 6 | Sending numbers. I have to send my friend a sequence of numbers 0, 1, or 2, and let us say that each number occurs about equally often. Propose a simple and space-efficient method for this. |
Answer | As we are sending bits, and we have no special information on the structure of the sequence, we might as well use Huffman coding. One Huffman code codes 0 as 0, 1 as 10 and 2 as 11. So, we translate symbols individually, and obtain an average of 1.6666 bits per symbol. Additional remark: had we taken the input in pairs, and translated 00, 01, 02, 11, 10, 12, 20, 21 and 22 by 000, 001, 010, 011, 100, 101, 110, 1111 and 1110 (the Huffman code for pairs of symbols), then the average number of bits per pair would have been 3.2222, and per symbol 1.61111, for a slight improvement. |
Question 7 | Huffman tree construction algorithm. I claim that if the frequencies of n symbols are given in a sorted list, then the Huffman tree can be constructed in linear time (O(n)) using a queue. Derive this algorithm. |
Answer |
We keep two queues: Q1 for the given sorted list of
pairs (i, p_i) where i is the i-th symbol/leaf,
and p_i is the weight; Q2 is an initially empty
queue for pairs (j, q_j), where j stands for
an internal node and q_j is its weight.
The algorithm:
Delete the 2 smallest weight items from (Q1, Q2) Let (u,p_u), (v,p_v) be these symbol/weight pairs enqueue ( (j,p_u+p_v) , Q2) left[j] <- u, right[j] <- v |
Copyright © 1999 Luc Devroye. Email: luc@cs.mcgill.ca.