Answers for fifth set of practice questions--CS251--Winter 1999


This set contains practice questions on hashing, disjoint set structures and graphs.

Question 1 Disjoint sets. Consider an ADT "disjoint set" in which the elements are the integers from 1 through n, and the operations are makeset, findset, union, and report, where "report" returns all members in a set. If we wish the operations to take O(1), O(log n), O(log n) and O(k+1) worst-case time respectively, where k is the number of elements reported, what data structure would you suggest?
Answer Let us keep the elements both in a parentpointer tree and in a circular single linked list. There are as many of these lists as there are trees in the forest. For reach element i, we thus keep pointers p[i] (parent in the parentpointer tree), and next[i] (next element in the circular list), and a size[i], the size of the tree, which needs only be correct for the root. Consider the operations in a nutshell, with unions by size (small is a subtree of large).
    makeset (i)
     p[i]=i; next[i]=i
    findset (i)
     while p[i]!=i do i=p[i]; return (i)
    link (i,j) (links sets with roots i,j)
      if i=j then return;
      if size[i]>size[j] then p[j]=i else p[i]=j
      k=next[i]; next[i]=next[j]; next[j]=k;
    union (i,j)
      link (findset(i), findset(j))
    report (i)
      j=i
      repeat { output (j) ; j=next[j] } until j=i
Question 2 Disjoint sets, graphs. We are given an array with n entries that are supposed to be links to parents in a forest of parentpointer trees. Write an algorithm that checks in O(n) worst-case time whether these pointers indeed define a true forest of parentpointer trees. Hint: look at this as a directed graph problem.
Answer So, we were given n pairs (x,p[x]), where p[x] denotes the parentpointer at node x. It is possible that x=p[x] for some x (as in parentpointer trees). Let (x,p[x]) define a directed edge in a directed graph. Run DFS and check is the grapg is acyclic. As DFS takes time O(n) in this case (there are n nodes and n edges), it is clear that we can solve this problem in O(n) time. As each node has outdegree 1, we do not even need to set up the dreaded adjacency lists that we normally use for storing graphs. Caveat: we must alter DFS slightly and instruct it to ignore edges (x,p[x]) with x=p[x] (which normally would be considered cycles), but this is easy to do.
Question 3 Hashing. Explain how you could implement a data structure based on hash tables for dealing with queries in a text file of the nature: find the first occurrence of a given 4-letter word, or find all occurrences of a given 4-letter word. The text file has 10 million characters. Each character has 8 bits and thus has a numerical value between 0 and 255. Please provide specifics, including the size of a hash table, and the kind of hash function used. Note: we are not interested in k-letter words with k not equal to 4.
Answer As instructed, we will preprocess this text file, and compute hash values for all consecutive 4-tuples (there are n-3 such 4-tuples for a text with n characters). We will place these in a hash table with chaining. The cells in the chains have two fields: next[.] indicates the next element in the chain, and textpointer[.] points to the position in the text where the 4-tuple starts (an integer between 1 and n-3). Now, for good performance, we should have chains of average length one, so the table size m should be about n = 10 million. Let us take m = 2^{23} (which is just over 8 million), and for a 4-tuple (x,y,w,z), consisting of 32 bits, let us cook up a hash function h(x,y,w,z) with values consisting of 23 bits (so that they are in the range 0...m-1, as needed). As an example, take all 8 bits of y, followed by the middle 6 bits of w, followed by the last 4 bits of z and then the last 5 bits of x. Interpret this 23-bit sequence as an integer. The chain for (x,y,w,z) contains (pointers to) all 4-tuples in the text that are exactly (x,y,w,z). It may contain others as well, but on average, given a randomly drawn 32-bit sequence (x,y,w,z), the search time is O(1).
Question 4 Hashing. Tough theoretical question: suppose I have n elements and a very good hash function. I have a hash table with m=n^(1.5) double slots, that is, each hash table slot can hold two data points. We perform insert and search in the obvious manner, checking both slots if necessary, but we do not implement any collision resolution as in open addressing. Instead, we have an overflow linked list for those elements that do not fit. This list accepts all overflow elements, regardless of where they come from. Clearly, one must search the overflow list after having inspected a slot. Show that the expected unsuccessful search time is O(1). In other words, show that the expected number of elements in the overflow list is O(1). (Don't feel too bad if you can't get this one.)
Answer If an element ends up in the overflow list, it should be clear that this indicates the existence of a triple (i,j,k) with h(i)=h(j)=h(k), where h(.) is the hash function. For a FIXED triple (i,j,k), the chance of this happening is 1/m^2 = 1/n^3 (as the hash functions act as random numbers, and h(j) comes up h(i) with probability 1/m, and similarly for h(k)). Now, the number of triples is clearly not more than n^3 (clearly, we have i,j,k unequal, but ignoring this gives an upper bound). So, the expected number of triples with h(i)=h(j)=h(k) is not more than n^3 / n^3 = 1. The expected size of the overflow list is not more than 1!
Question 5 Graph traversals. A star graph is a connected undirected graph with a core node and any number of arms of arbitrary length emanating from the core node. A graph G is given to you in adjacency list format, with nodes numbered 1 through n. Give an algorithm that determines in O(n) time whether the given graph is a star graph (note: you do not know which node is the core, and the number of edges, e, may be much larger than n).
Answer
    If n=1, return success (we have a star graph).

    Perform a preliminary count of the degrees (number of cells in the adjacency lists), stopping as soon as the running total exceeds 2(n-1), as the number of edges in any tree cannot exceed n-1.

    If the sum of the degrees is not 2(n-1), return a failure. (We do not even have a tree)

    If there is more than one node of degree > 2, return a failure.

    Let x be any node of maximal degree (note that it is unique if this maximal degree is > 2). It should be the core. Run a DFS started at x, and stop with failure as soon as a gray node is detected among the neighbors of a node (a cycle!), OR as soon as the number of white neighbors of a node (the core excepted) is more than one (not a star graph!).

    If we survived all this testing, return a success. Verify that each step takes time O(n).


Copyright © 1999 Luc Devroye. Email: luc@cs.mcgill.ca.