## Midterm 2--CS251--Winter 2000

A 90 minute closed book examination.

 Question 1 Prefix coding. We are given a prefix coding tree in which arrays left[.] and right[.] are used to denote left and right children. Input symbols are numbered \$1\$ through \$n\$. Nodes in the prefix coding tree are numbered from 1 through 2n-1, with input symbol i corresponding to node i. Thus, node i has left child left[i], right child right[i] and if it is a leaf, i itself is an input symbol. An input string t_1,...,t_N is compressed/transformed into a binary output string b_1, b_2,..., b_K according to the coding suggested by the prefix coding tree. Your question is about the data structure and algorithm for this process. You are to design both such that, excluding space for the input and output sequences, O(n) space is used, while the overall time is O(n+K). Notes: the prefix coding tree takes O(n) space, but you may want to use a different data structure, which also takes O(n) space and may be computed in O(n) time. Note also that N does not figure explicitly in the overall time, but it is implicit because K is at least N. Solution A prefix coding tree with n leaves has height at least log_2 n, so that writing down all the codewords, one per symbol, would require worst case time and space at least n log_2 n. In other words, it is excluded to keep an array of codewords. However, it suffices to keep parent pointers in the tree by adding an array p[.]. The code is as follows: ``` for i=1 to 2n-1 do if left[i] not nil: p[left[i]] = i if right[i] not nil: p[right[i]] = i ``` Clearly, this takes O(n) space and time. Now, compress the text as follows: ``` for i=1 to N do s = t_i (s is for "symbol") MAKENULL(S) (S is a stack for later use) while p[s] not nil do: if left[p[s]]=s then PUSH(0,S) else PUSH(1,S) s = p[s] (go up to the root) while not empty(S) do: s = POP(S) output(s) ``` In other words, the parent pointers allow us to obtain the codewords backwards. The stack S permits us to reverse them. The time taken by the code is not more than N (for the loop) plus the number of PUSH and POP operations. Each PUSH operation causes a POP later, and the number of POP operations is K. Thus, the total time for the coding is O(N+K) = O(K). Question 2 Heaps. Consider a standard binary min-heap holding n nodes. Part A. Write an algorithm that returns all elements in the heap of value less than or equal to x, in time bounded by a constant times N+1, where N is the (unknown) number of points returned. Note: the time should be independent of n. Part B. Write an algorithm that returns the N smallest elements (without necessarily fixing the heap) in time O(N log N). Note that once again, n does not figure in the complexity! Hint: design a special kind of traversal. No partial credit for part B unless the complexity is as requested. Solution Call the binary min-heap H. Part A requires a preorder traversal cut short in places: ``` preorder (i,x,H) (takes care of all nodes in subtree at i) if key[i] <= x then output (i) if 2i <= heapsize[H] then preorder (2i,x,H) if 2i+1 <= heapsize[H] then preorder (2i+1,x,H) ``` We call with preorder (1,x,H). The time is bounded by a constant plus the number of nodes output times 3 (for each node output, we visit at most two children that may not be placed in the output). The time is O(N+1). Part B is similar in spirit, except that we need to visit possible candidate nodes in order. This is achieved by placing these candidate nodes in a priority queue Q implemented as a binary heap. In the priority queue Q, we will place indices j, but the heap order is with respect to key[j]. ``` MAKENULL (Q) INSERT (1,Q) for i = 1 to N do: j = DELETEMIN (Q) output (key[j]) if 2j <= heapsize[H] then INSERT (2j,Q) if 2j+1 <= heapsize[H] then INSERT (2j+1,Q) ``` Note that in every iteration, Q grows by at most one, so the size of Q is at most N. Therefore, each operation INSERT and DELETEMIN on Q takes time O(log N). Thus the total time is O(N log N). Question 3 Suffix trees. Part A. Draw the suffix tree for the 17-character text ``SIMPLE SIMON SAYS''. Spaces are characters. Part B. Prove that the number of nodes for any suffix tree on any alphabet is not more than 2n-1, where n (assumed to be at least two) is the length of the input string (i.e., length of the given text). Note: statements such as ``the situation is clearly worst for a binary tree'' must be proven. Rather than going that route, you might try defining N as the (unknown) number of nodes and getting some mathematical equations linking N to n using various types of nodes. Alternatively, induction could be attempted. Solution In the suffix tree, edges are compacted. Let \$ denotes the terminal symbol. There should be n leaves, one per input symbol. For example, the edges emanating from the root have labels bS (b is a blank space), S, IM, M, PLEbSIMONbSAYS\$, LEbSIMONbSAYS\$, EbSIMONbSAYS\$, bS, ONbSAYS\$, NbSAYS\$, AYS\$, YS\$. To prove part B, let n_i denote the number of nodes with i children, and let N be the total number of nodes. Then, ``` n_0 = n (one leaf per input symbol) n_1 = 0 (suffix tree has no 1-child nodes) n_0 + n_1 + ... = N (count nodes) n_1 + 2n_2 + 3n_3 + ... = N-1 (count edges) ``` So, subtracting the last two equations, ``` n_0 - n_2 -2n_3 - 3n_4 ... = 1 ``` and thus, ``` n = 1 + n_2 + 2n_3 + ... >= 1 + n_2 + n_3 + ... = 1 + N-n ``` so that N <= 2n-1.