A 90 minute closed book examination.
Question 1 | Prefix coding. We are given a prefix coding tree in which arrays left[.] and right[.] are used to denote left and right children. Input symbols are numbered $1$ through $n$. Nodes in the prefix coding tree are numbered from 1 through 2n-1, with input symbol i corresponding to node i. Thus, node i has left child left[i], right child right[i] and if it is a leaf, i itself is an input symbol. An input string t_1,...,t_N is compressed/transformed into a binary output string b_1, b_2,..., b_K according to the coding suggested by the prefix coding tree. Your question is about the data structure and algorithm for this process. You are to design both such that, excluding space for the input and output sequences, O(n) space is used, while the overall time is O(n+K). Notes: the prefix coding tree takes O(n) space, but you may want to use a different data structure, which also takes O(n) space and may be computed in O(n) time. Note also that N does not figure explicitly in the overall time, but it is implicit because K is at least N. |
Solution |
A prefix coding tree with n leaves has
height at least log_2 n, so that writing down
all the codewords, one per symbol, would require
worst case time and space at least n log_2 n.
In other words, it is excluded to keep an
array of codewords.
However, it suffices to keep parent pointers
in the tree by adding an array p[.].
The code is as follows:
for i=1 to 2n-1 do if left[i] not nil: p[left[i]] = i if right[i] not nil: p[right[i]] = iClearly, this takes O(n) space and time. Now, compress the text as follows: for i=1 to N do s = t_i (s is for "symbol") MAKENULL(S) (S is a stack for later use) while p[s] not nil do: if left[p[s]]=s then PUSH(0,S) else PUSH(1,S) s = p[s] (go up to the root) while not empty(S) do: s = POP(S) output(s)In other words, the parent pointers allow us to obtain the codewords backwards. The stack S permits us to reverse them. The time taken by the code is not more than N (for the loop) plus the number of PUSH and POP operations. Each PUSH operation causes a POP later, and the number of POP operations is K. Thus, the total time for the coding is O(N+K) = O(K). |
Question 2 |
Heaps.
Consider a standard binary min-heap holding n nodes.
|
Solution |
Call the binary min-heap H.
Part A requires a preorder traversal cut short
in places:
preorder (i,x,H) (takes care of all nodes in subtree at i) if key[i] <= x then output (i) if 2i <= heapsize[H] then preorder (2i,x,H) if 2i+1 <= heapsize[H] then preorder (2i+1,x,H)We call with preorder (1,x,H). The time is bounded by a constant plus the number of nodes output times 3 (for each node output, we visit at most two children that may not be placed in the output). The time is O(N+1). Part B is similar in spirit, except that we need to visit possible candidate nodes in order. This is achieved by placing these candidate nodes in a priority queue Q implemented as a binary heap. In the priority queue Q, we will place indices j, but the heap order is with respect to key[j]. MAKENULL (Q) INSERT (1,Q) for i = 1 to N do: j = DELETEMIN (Q) output (key[j]) if 2j <= heapsize[H] then INSERT (2j,Q) if 2j+1 <= heapsize[H] then INSERT (2j+1,Q)Note that in every iteration, Q grows by at most one, so the size of Q is at most N. Therefore, each operation INSERT and DELETEMIN on Q takes time O(log N). Thus the total time is O(N log N). |
Question 3 |
Suffix trees.
|
Solution |
In the suffix tree, edges are compacted.
Let $ denotes the terminal symbol.
There should be n leaves, one per
input symbol.
For example, the edges emanating from the
root have labels bS (b is a blank space),
S, IM, M, PLEbSIMONbSAYS$,
LEbSIMONbSAYS$, EbSIMONbSAYS$, bS,
ONbSAYS$, NbSAYS$, AYS$, YS$.
To prove part B, let n_i denote the
number of nodes with i children, and let
N be the total number of nodes. Then,
n_0 = n (one leaf per input symbol) n_1 = 0 (suffix tree has no 1-child nodes) n_0 + n_1 + ... = N (count nodes) n_1 + 2n_2 + 3n_3 + ... = N-1 (count edges)So, subtracting the last two equations, n_0 - n_2 -2n_3 - 3n_4 ... = 1and thus, n = 1 + n_2 + 2n_3 + ... >= 1 + n_2 + n_3 + ... = 1 + N-nso that N <= 2n-1. |
Copyright © 2000 Luc Devroye. Email: luc@cs.mcgill.ca.