Topic #12: HASHING BY CHAINING

There are three main implementations of abstract data type dictionnary.
1. Binary Search Trees (Pointer-Based)
2. Tries and Suffix trees.
3. Hashing

Implementation of Hashing

2. Hashing with chaining

Hashing with chaining

Hashing with chaining is an application of linked lists and gives an approach to collision resolution. In hashing with chaining, the hash table contains linked lists of elements or pointers to elements (Figure 12.1). The lists are referred to as chains, and the technique is called chaining. This is a common technique where a straightforward implementation is desired and maximum efficiency isn't required. Each linked list contains all the elements whose keys hash to the same index. Using chains minimizes search by dividing the set to be searched into lots of smaller pieces. There's nothing inherently wrong with linear search with small sequences; it's just that it gets slower as the sequences to be searched get longer. In this approach to resolving collisions, each sequence of elements whose keys hash to the same value will stay relatively short, so linear search is adequate.

This Hash Table is an array [0 .. m-1] of linked_list.
The table entries are called buckets or slots, and the linked lists are called chains.

x is placed in linked_list #h[x] (number of hash function).

Choice of Hash Function

Choice of h: h[x]

• must be simple
• must distributes the data evenly
Choice of m: m approximates n (= 1 item/linked list) n = input size.

e.g.
1. MOD FUNCTION.
2. Choice of three digit for school phone number e.g.398-3738

x is an interger value.
H[x] = x mod m.

Choising last three digit(738) is more appropriate than the first three digits (398) as they distribute the data more evenly.
To do this take mod function:
x mod m:
h[x] = x mod 10^k: It gives last k digits
h[x] = x mod 2^k: It gives last k bits

3. MIDDLE DIGITS OF AN INTEGER
4. This often yields unpredictable (and thus good) distributions of the data.
Assume that you wish to take the two digits three positions from the right of x.

If x = 539 872 178
then h[x] = 72

This is obtained by
H[x] = (x/1000) mod 100
Where (/1000) drops three digits and (x/1000) mod 100 keeps two digits.

5. ORDER - PRESERVING HASH FUNCTIO
• x < y applies h[x]<= h[y] (pointer reserving by hash)
• (Application: Sorting)

6. PERFECT HASH FUNCTION
• A perfect hashing function is one that causes no collisions.
• Perfect hashing functions can be found only under certain conditons.
• One application of the perfect hash function is static dictionary.
• h[x] is designed after having peeked at the data.

Analysis

Here we analyse the number of probes in a table for doing "search". A probe is a check of a table entry. All other components of the algorithm are ignored, but as they are collectively bounded by the number of probes, it suffices to just consider probes.

n = input size, m= table size, and a= LOAD FACTOR = n /m

-WORST CASE : (Number of probes for search /insert)

T= Theta(n) (all elements are in 1 linked list)

-AVERAGE CASE : Assume h[key[x]] spreads input independently and randomly over m buckets.
1. "UNSUCCESSFUL SEARCH"

2. When the element searched is not in the table.

Number of probes = length of the linked list h[k].
Expected time = n/m = a(+1).

Where 'a' is n/m and '+1' is to check NIl.
And k is uniformly and randomly distributed over all values in the table.

3. SPACE

4. Space is relative to input size.

SPACE/Input size = (m+n)/n = 1 + 1/a

5. "SUCCESSFUL SEARCH"
6. When the element searched figures in the table

x1, x2, ..., xn: input

ti = Expected time
Number of probes =
1+(i-l)/m

Where (i-1)/m comes from the time needed to insert the i-th element in a table with i-1 element. By the unsucceful search part, the expected time is (i-1)/m.

On average, the time to find an element in the table is 1/n (1+)i+1)/m)= 1 +n(n-1)/2nm <= 1 + a/2.

Remark

1. Provided that 1/2<a<2
2. , O(1) is the average time required for search/insert/delete operations.

3. You may need to periodically rehash when a leaves desired range "during the life of the hash table".
Rehash means junking the old table and creating a whole new table of, say, double the size or half the size.
• Why this is OK?

Because it takes n probes to jump a point. But overall, excluding the jumps, the required Number of probes is less or equal to n. Therefore jumping a point does not influence the overall number of probes.

Bucket Sort

The hashing is by an order-preserving hash function, thus between linked lists, elements are sorted.

The table which contains buckets, small arrays of entries, can resolve a simple collision. Each bucket contains objects whose keys all hash to the index of that location in the table. A hash value selects one of these buckets, which is then searched sequentially until the target object is located or an empty slot is encountered. If the selected bucket is empty, or none of its objects have keys that match the target key, the object is simply added to the end of the bucket.

• h[k] = (floor function)(k/max_value)*m
• 0 < k/max_value<1 So h[k] belong to {0,1,2, ...,m-1}
• And k<1 implies h[k] <= h[1]

Analysis

If the keys are uniformly distributed over the table, then hashing takes expected time O(n) under a computational model in which the computation of a hash function is done in O(1) worst-case time.

Steps

1. Construct hash table (O(n))
2. Sort all linked lists(by a naive method): This is bad in worst cse, but good in average O(n).

Comparison of "chaining" with "open addressing"

In open addressing, we assume that insert and search are done by radomly jumping from location to location, as in a random permutation of 0,1, ...,m-1.

The expected number of probes for unsuccessful search grows is not more than 1/(1-a).
Where 0 < a < 1.

In conclusion, for expected number of probes, for fixed a, chaining is always better than open addressing.

Java applet

For additional information on Hashing by Chaining, visit these selected web sites

 Chaining for Collision Resolution Implementatin of hashing, direct addressing, chaining for collision resolution and open addressing. From Oberlin College Computer Science. Rhys Price Jones, Fritz Ruehr, and Richard Salter. Direct Chaining Algorithms on direct chaining, search and insert. From the Departement of Computer Science, University of Chile. Gaston H. Gonnet, Richardo Baeza-Yates. Separate chaining Algorithms on separate chaining, search and insert. From Departement of Computer Science, University of Chile. Gaston H. Gonnet, Richardo Baeza-Yates. Object Hashing A mechanism for generating a unique hash number fot an arbitrary object; from The Mit Scheme object_hashing facility. Provide a small pseudocode of a hash table. Hashing Function An overview of Hashing function. From the MIT LCSClinical Decision Making Group.

References

1. Mitchell, L. Model. Data Abstraction, and Data Structures. Prentice-Hall, Inc, 1994: 383-390.
An overview of chaining and Buckets
2. Daniel F. Stubbs & Neil W. Webre. Data Structures with Abstract Data types and Ada. PWS Publishing Company, 1993: 375-382.
A conceptual definition of Perfect Hashing Function
3. Edward M. Reingold & Wilfred J. Hansen. Data structures. Little, Brown Computer series, 1983.
Comparisons of hashing by chaining with other algorithms
4. Adam Drozdek & Donald L. Simon. Data structure in C. PWS Publishing Co.,1995: 376-399
An example of hashing application written in C
5. J. Pierprzyk Babak Sadeghiyan. Design of Hashing Algorithms. Lecture Notes in Computer Science, G. Goos & J. Hartmanis, Springer-Verlag, 1995.
An overview of hash functions.

Web page creators

WOrK :
Web page design.

Ho-Sheng Cheng
WoRk :
Java Applet.
CONT@CT:snowman@cs.mcgill.ca

Hyung Sunt Kim
wORk :
Graphics & References.
CONT@CT:hkim@cs.mcgill.ca

Last updated March 4, 1997 by kiadi Pati