Answers for the third set of practice questions--CS251--Winter 1999


This set contains practice questions on the choice of an algorithm or data structure, and on Huffman trees.

Question 1 Coding. What is a prefix code? Can you state conditions for a prefix code so that any sequence of bits can get decoded to something? Can you give a simple 2-input-symbol prefix code such that not all bit sequences can get decoded?
Answer Codewords are finite bit sequences. A prefix code is a collection of codewords, one per symbol, such that no codeword can be found as the prefix (leftmost part) of any other codeword. Any sequence of bits can be decoded if the prefix code is such that the prefix tree has two children for each internal node. Finally, the counterexample: if the prefix code for input symbols A and B has codewords 00 and 01, then the sequence 1111 cannot be decoded.
Question 2 Huffman trees. We have symbols occurring with probabilities 1/2, 1/4, 1/8, ..., 1/2^n, 1/2^n. (Note that this sequence decreases geometrically, and that only the last two entries are identical.) Give a possible Huffman code for the given collection of symbols (for general n).
Answer Construct the tree and note that a possible code would be 1, 01, 001, 0001, ..., 0000001 and 0000000 in case n = 7. The codeword lengths are thus 1, 2, 3, 4, ... , n, and n.
Question 3 Huffman trees (continued). Derive the average code length in the previous example when n is very very large. This involves the summation of a series. In particular, is the answer O(1), O(log n) but not O(1), O(n) but not O(log n), or Omega(n) but not O(n)?
Answer From the previous exercise, the average code length is 1/2 + 2/4 + 3/8 + ... + n/2^n + n/2^n. This is not more than if we continue the sequence without limit. We obtain 1/2 + 2/4 + 3/8 + ... = 2. The answer is O(1)! No matter how rich the alphabet, the average number of bits per symbol stays below 2. For geometrically decreasing frequencies of symbols, the Huffman code is wonderful.
Question 4 Efficient algorithms. How would you merge n (n is large) sorted linked lists of length n each into one gigantic sorted linked list of n^2 elements? Choose one of the following strategies:
  • Merge all linked lists pairwise (they are all of length 2n now), and repeat this procedure until all lists are joined into one.
  • Put all linked lists together into an array of size n^2. Apply buildheap to the array. Empty the heap into a new linked list.
  • Define n traveling pointers, one for each linked list. Find the minimum top element by looking at all top elements. Transfer it to the new linked list. Update the traveling pointer in question, and repeat all of this until all linked lists are exhausted.
Answer Look at the complexity of each method. We will count comparisons between elements in a worst case scenario. Method A. In each round of merging, all elements are involved. Thus, one round costs n^2 time units. The number of rounds is about log_2 n. The total time cost is thus about n^2 log_2 n. Method B. Buildheap takes about 2n^2 comparisons in the worst case. Each deletemin costs about 2 log_2 n (why the 2: check the previous set of practice questions!). The total is about 4 n^2 log_2 n. Method C. All depends on how you find the minimum of the n top elements. If there is no special structure, then clearly, each element removed comes at a time cost of n. The total is about n^3 in the worst case. If a heap is used, then the total time cost is about 2 n^2 log_2 n.
All three methods are thus about equally good, with a slight preference for method A, which is the simplest to implement and has the smallest constant in front of n^2 log_2 n.
Question 5 Clever implementations. We are looking for a data structure to efficiently deal with biological data that consist of a large number (n) of species, and names of animals within species. We would like to perform the following operations:
  • search (species, name)
  • insert (species, name)
  • delete (species, name)
  • delete (species): this deletes an entire species.
Assume that there are n names within each species, where n is large and unknown. How would you implement this? Please give a drawing and a type definition with very concise explanations. No long-winded text or explicit algorithms please! Give the worst-case complexities of your operations.
Answer To delete an entire species efficiently will force us to keep each species in a separate structure. A search tree for each species seems therefore indicated. If you think this through, the data structure consists of a red-black tree of species. The operation "delete a species" is easily carried out in O(log n) time. Searching for a species takes O(log n) time as well. Each node in this tree is augmented by a key that points to the root of a tree of names within that species. Each of these trees in turn is a red-black tree. Type definition: each cell in the species tree has fields speciesname (the key), left, right, parent, color (it is a red-black tree), rootpointer (for the name tree). The cells in the name tree are as for ordinary red-black trees. All operations mentioned above take O(log n) in the worst case.
Question 6 Sending numbers. I have to send my friend a sequence of numbers 0, 1, or 2, and let us say that each number occurs about equally often. Propose a simple and space-efficient method for this.
Answer As we are sending bits, and we have no special information on the structure of the sequence, we might as well use Huffman coding. One Huffman code codes 0 as 0, 1 as 10 and 2 as 11. So, we translate symbols individually, and obtain an average of 1.6666 bits per symbol. Additional remark: had we taken the input in pairs, and translated 00, 01, 02, 11, 10, 12, 20, 21 and 22 by 000, 001, 010, 011, 100, 101, 110, 1111 and 1110 (the Huffman code for pairs of symbols), then the average number of bits per pair would have been 3.2222, and per symbol 1.61111, for a slight improvement.
Question 7 Huffman tree construction algorithm. I claim that if the frequencies of n symbols are given in a sorted list, then the Huffman tree can be constructed in linear time (O(n)) using a queue. Derive this algorithm.
Answer We keep two queues: Q1 for the given sorted list of pairs (i, p_i) where i is the i-th symbol/leaf, and p_i is the weight; Q2 is an initially empty queue for pairs (j, q_j), where j stands for an internal node and q_j is its weight. The algorithm:
    for j= n+1 to 2n+1 do:
     Delete the 2 smallest weight items from (Q1, Q2)
     Let (u,p_u), (v,p_v) be these symbol/weight pairs
     enqueue ( (j,p_u+p_v) , Q2)
     left[j] <- u, right[j] <- v
Running this algorithm creates an array (as in the Hu-Tucker algorithm) with the root at position 2n+1, and the leaves at positions 1 through n. The fields of the array are left[.] and right[.]. Each step in each iteration takes O(1) times, and thus, the total time is O(n). Note that the nodes entering Q2 are sorted, because the sums p_u+p_v must increase with time. Therefore, step 1 in the algorithm looks at worst at two elements from Q1 and two elements from Q2.

Copyright © 1999 Luc Devroye. Email: luc@cs.mcgill.ca.