McGill University:
School of Computer Science
Winter 1997 Class Notes for 308-251B

DATA STRUCTURES AND ALGORITHMS

Topic #24: DISJOINT SET STRUCTURES


Table of contents


Definition

Disjoint sets structures are yet another new abstract data type (A.D.T.) for which we must now define a mathematical model and a series of related operations. Disjoint sets, sometimes called partitions, are the model of this A.D.T. A partition is a collection of disjoint sets such that the union of the sets is the entire space. Thus every element belongs to one and only one set as shown below.





Only the following operations are associated with this A.D.T. :

MakeSet (x)
MakeSet (x) creates a new set containing only the element x.


Union (x,y)
Union (x,y) joins the sets containing x and y into one set.


FindSet (x)
FindSet (x) returns a pointer to the set that contains the element x.


Observe that the standard dictionary operations are missing. For example, search (x) and delete (x) are not in the list of operations. Also, sets can never be split or shrink in size.

To simplify things, we make two convenient assumptions :

  1. The elements are numbered 1, 2, 3, ..., n. (Thus, accessing element i is immediate. Note that if the elements had names, then some amount of searching would be necessary).


  2. The name of a set is the index of (or pointer to) the representative element of that set. Representatives are marked in any fashion, but each set should have only one. (This assumption avoids the messy problem of having to assign names to the sets).


Applications of disjoint set structures

Here are three situations in which disjoint set structures are used :

1) EQUIVALENCE RELATIONS AND EQUIVALENCE CLASSES

Recall that a relation on a set A is called an equivalence relation if it is reflexive, symmetric, and transitive. The operation congruence modulo n is an equivalence relation. The statement a ~ b indicates that elements a and b are in the same set. It follows that to create the equivalence classes (disjoint sets) of a given collection of equivalence relations, we must proceed as such :



A good example of this application would be the equivalence relation congruence modulo 5. In this situation, elements {0,5,10,...} are in the same equivalence class. The usual representative of the class is 0. Element 7 is not in the set whose representative is 0.

The following animated graphic is a second equivalence relation example that uses pixels as elements.

In the preceding image, A, B and C are three equivalence classes. A pixel is an element of a class if it is within that class' boundaries. Furthermore, two pixels are equivalent if they are both in the same set.

2) THE FORTRAN PROGRAMMING LANGUAGE

In this particular programming language, it is possible to reserve the same address for more than one variable. The variables can even be of different types. This gives the user the ability to manipulate for instance, integer variables in character format. The Fortran compiler intrinsically has to determine the partition of the elements (in this case the variables) from a given set of equivalence relations.

EQUIVALENCE (A, B, C) , (D, E) , (E, F)

This allows int A, char B and float C, for example, to hold the same value at the same memory address. If the contents of variable B are changed at some point in the execution of the program, then so will be the contents of variables A and C.

3) COLORING OLD MOVIES

Every frame of an old black and white movie is full of pixels. In order to color this movie, we would need to assign a color to each pixel in such a fashion that equivalent pixels have the same color. Equivalent pixels are those that logically belong together as, for example, the ones representing an actor`s sweater. On a frame, adjacent pixels with either identical or slightly different gray levels may very well be defined as equivalent. Therefore, a gray level frame of pixels defines possibly thousands if not millions of equivalence relations between adjacent pixels. In order to separate the equivalence classes of these relations, a 4-pixel square scan technique is used to analyse pixel x's neighbors. This scanning and thus entering of the equivalence relations between pixels is done from top left to bottom right, row by row.

The outer region of the U shape is colored in yellow. The leftmost upper edge is colored in a different color, here green. Once the rightmost edge is reached by the scan, it is filled in yet another different color, here blue. Finally, once the lower rightmost corner is reached, it is first colored blue but then changed to green as the scan only then "realises" that there is one figure and not two distinct elements. Therefore, all previously blue colored pixels are changed to green as every pixel forming the U shape is in the same distinct set. This coloring problem expands to a 3-dimensional one once more frames are considered and colored.


Data structures

TAGS OF SETS

A) In Array State

We can setup a table of n rows and 2 columns. The first column will hold the n elements whereas the second column will contain the representative of the set in which the current element is found. The following is an example of this structure.
1 3
2 2
3 3
... ...
n 3


Obviously, this setup is considered to be static as it can only hold a maximum number of elements.

B) Linked List

To avoid the static problem that arises in the array form, we can also structure the data in a linked list format. In this implementation, every node points to its representative. This is true even for the representative node, it points to itself.



These two setups allow the FindSet operation to be performed in constant time. However, in the linked list structure, the Union (x,y) operation takes time Omega (n). The operation goes as follows:

Step 1) The nil pointer of the last node of the set with the most elements is changed to point to the representative of the other distinct set.

Step 2) All the representative pointers of the smaller set are changed to point to the representative of the first set.

The following animation illustrates the initial and final state of the Union (2,1) operation.

The above heuristic is called the weighted union heuristic.

BENCHMARK TEST

Before we go on to the second possible implementation of distinct set structures, we can now define a benchmark test for the operations on this A.D.T. Suppose we have n singleton sets. Obviously, to obtain one complete set, we must apply n-1 Union (x,y) operations.

We must consider each step of the transformation from n singleton sets to one final set to understand the time requirements.

  1. FindSet (x) takes O(1) time.


  2. The time to change the nil pointer of the last node of the larger set to point to the representative of the other set.


  3. The time to change k representative pointers.


Union can thus be considered as taking [ O(n) + k representative pointer changes] time units.

Take element x to be in a singleton set. The size of the set is of 1. When it is subject to a Union operation where its representative pointer changes, the size of the new set in which x now lives in is at least twice as large. Let us now assume that the representative pointer of x changes k times then following the previous observation, the size of the new set containing x would have to be greater then or equal to at least 2 to the power k but always smaller than or equal to n.

Therefore, even if the worst case time of one union is O(n), we see that the total time of this benchmark test is :

As for the second data structure implementation, it is based on the philosophy that the Union operation should be fast.


Disjoint set forests

A parent pointer tree represents a distinct set in a partition and is composed of a root, internal nodes and leaves much like most trees. However, the nodes of these trees only have pointers to their parents. The root of one of these trees, the representative of the set, also has a pointer to itself. A disjoint set forest consists of a collection of parent pointer trees taken together representing a collection of sets, a partition.



In this structure, where p[x] is the parent of x, the FindSet operation`s algorithm looks like this.

The time of this operation within this setup is given by the distance of the node x to the root. The worst case time is thus given by the height of the tree.

Within the parent pointer tree implementation, we must define a new operation that will however remain private. This new operation, Link (x,y), also assumes that x and y are distinct set representatives and makes y`s representative pointer point to x as such

This simple operation takes time O(1).



With the preceding two functions as defined, Union (x,y) is simply implemented as :

and also the operation MakeSet is as simple as :

The obvious and unfortunate problem that arises from this parent pointer tree structure is the formation of linear trees which leads to terrible worst case times for the operations. To counter this, two heuristics have been developed.

1) Weighted Heuristic

Small sets are merged into larger sets. This makes the height of any tree smaller than or equal to log to the base 2 of n.

Proof :

Upon creation, all elements are their own representatives. Let us now follow one such element x involved in Union operations. Every time the distance from x to its representative grows by one, the size of the set containing x doubles but can become no greater than log to the base 2 of n, the height of the tree. This is sometimes called the doubling argument.

2) Path Compression

During the FindSet operation, we make each node on the find path point directly to the root.

The code for the FindSet (x) operations looks like this :

This FindSet operation is based on a two-pass method : it first makes one pass up the find path to locate the root and a second pass up the find path to update each node so that it points directly to the root. The time of the clean-up is exactly equal to the time of the activity. For instance, on the given graphic above, the Findset operation takes 6 time units and so, the path compression heuristic also only costs 6 time units.


Running time analysis

We can now end this discussion of distinct set structures by reviewing the time cost of each implementation :
LINK FINDSET BENCHMARK
Linked-List

+ heuristic

O(1)
Parent Pointer Tree O(1)
Parent Pointer Tree

+ heuristics

O(1)




* Actually, the running time of the benchmark for a parent pointer tree with the 2 heuristics is O(n log*n). Log* is defined in Topic 3.


Java applet

This applet illustrates disjoint set structures using the equivalence relation of congruence modulo n. It builds a simili Zn by generating 15 random numbers, in addition to the numbers 0...n - 1, and scanning them for the chosen equivalence. The data structure used is disjoint set forests with contribution from the path compression heuristic. Using direct address hashing, the operations of disjoint set structures are all implemented in time O(1). However, this does not use the pure union algorithm of disjoint set forests as described above, because it uses a special algorithm to process the randomly generated numbers. We recommend that you maximize your browser for best viewing. Start by pressing MakeSet, then press Next Union until all elements are associated. Also, I have not learned threads yet, and time is very scarce, so scrolling too far away will erase the drawings. Enjoy!


Your browser does not support Java, hence the Disjoint Set Structure applet is not currently available to you. Try again using a Java-enabled browser. If you are using Netscape, try enabling Java by accessing Options->Network Preferences->Languages. If you are using MS Internet Explorer, try View->Options->Security and check the box labeled "Enable Java programs".


Here are the source files:


References

  1. Weiss, Mark Allen. Data Structures and Algorithms Analysis in C. Chapter 8. The Benjamin/Cummings Publishing Company, Inc., 1992.


  2. Cormen, Thomas; Leiserson, Charles; Rivest, Ronald. Introduction to Algorithms. Chapter 22. The MIT Press, 1990.


Web links

Disjoint set structures class note New Jersey Institute of Technology Computer Science 435 Course : Topic 24.
Disjoint set structures operation algorithms Mark Allen Weiss' algorithms of operations on disjoint set structures.
Disjoint set structures algorithm Mark Allen Weiss' algorithm on disjoint set structures.







Web page creators


Last updated April 9th , 1997 by Philippe Warda.

Copyright © 1997, Lip Hooi Tan, Philippe Laporte and Philippe Warda. All rights reserved. Reproduction of all or part of this work is permitted for educational or research use provided that this copyright notice is included in any copy. Disclaimer: this collection of notes is experimental, and does not serve as-is as a substitute for attendance to the actual class.