School of Computer Science

Winter 1997 Class Notes for 308-251B

Last modified: March 17, 1997

The table is T[0..m-1] where m is the table size, and the input is x_1,...,x_n. The has function h is given.

We will insert x_i in slot h[key[x_i]] in T. However, if the slot is occupied, we will try the next slot, until an empty slot is found.

**Example**: Say we are given a table with 10 slots. Let our hash function
be the last digit of X. Now we input the values (in order) 19, 33, 43, 53,
21.

**INSERT:**

We apply our hash function to the first value: the last digit is 9, so 19 goes in slot 9. Next is 33: this goes in slot 3. Now we come to a problem: the next number, 43 should also go into slot 3. We have no linked lists, so where does it go? In open addressing, a value that is slated for an occupied slot goes into the next empty slot. 43 then goes into slot 4. 53: slot 3 is filled, go to slot 4. Slot 4 is filled, go to slot 5. Slot 5 is empty, so 53 goes into slot 5.

The table is usually made circular, so that a value meant to be inserted into the last slot (which is occupied) is sent around to the front of the list. So if we were to insert 99 into the table, we see that slot 9 is occupied, and 99 is sent to slot 0, which is empty.

Note that if we have more values entered than slots, we can run out
of room in the table, and be unable to make any more insertions. For this
reason, the load factor ** a** in open addressing can never exceed
1.0 .

**DELETE:**

Say we want to delete 43. First we go to slot 3: It's occupied, but not by 43. So we go on to the next occupied slot. We continue to do this until we find either the number we're looking for (success), an empty slot or we arrive back where we started. In this case, the very next slot is occupied by 43, so we remove it and mark the slot as deleted. (We'll see why in a moment.)

**SEARCH:**

Now we will search for 53. We check the first slot indicated, 3. It's occupied, but not by 53, so we move on. Now we come slot 4, which is empty. So we should stop, right? If we hit an empty slot, we know 53 can't be beyond this slot. This is why we marked slot 4 "deleted". We know there was something here, so 53 can be beyond this point. We move on to the next slot, and we have located 53.

The probe sequence is the permutation of 0,1,...,m-1 used for searching for an available slot when inserting an element. For a key k the sequence is denoted by:

Thus, h(k,0) is what we used to call h(k). The probe sequence used in the simple introductory example is simplyh(k,0), h(k,1), ..., h(k,m-1)

all mod m (to force the circular inspection of the array). Clearly, this is a permutation, but not a very good one (we'll see why in a moment).h(k), h(k) + 1, ..., h(k) + m - 1

**Linear Probing:**

This is more or less self-explanatory. In linear probing, the slots in the has table are probed in a linear fashion, in order (ie. check slot 4, then slot 5, then 6, then 7,...etc). We express linear probing as

where c is an integer. While the slots are probed in a linear fashion, they are not necessarily probed in exact order. Note that we use mod m to account for the (usually) circular nature of the table. One of the problems with linear probing is called primary clustering. In this, "blocks" of occupied slots tend to accumulate; the larger a block is, the faster it grows, by the principles of probability. This increases the average search time in the table.h(k,i)=[h(k) + c*i] mod m

**Random Probing:**

Random probing makes use of a pseudo-random function to "jump around" in the hash table. The random probe sequence is

Where d(0), ..., d(m-1) is a permutation of 0, ..., m - 1. Here, d(i) is an integer, but it is generated uniquely by a recursive function for each i. The recursive definition of d ish(k,i) = (h[k] + d(i)) mod m

where a is a carefully picked integer. The way a is chosen is to insure each number [0,...,m-1] eventually appears as d(i). It is known that such a recursively defined sequence is a permutation of 0, 1, ..., m - 1 if and only if a - 1 is a multiple of every prime divisor of m, where 4 is considered as prime.d(0) = 0

d(i+1) = [a*d(i) + 1] mod m

**Example 1: m = 10, a = 3**. We get the sequence for
d = 0 1 4 3 0 1 4 3 0 1 4 3 0... This probe sequence is not a permutation. The prime divisors of 10 are 2 and 5, so (a-1) must be a multiple of 2*5 = 10. If we choose a = 11, then, we get the sequence 0 1 2 3 4 5 6 7 8 9 0 1 2 ... so a =
11 is a reasonable choice.

**Example 2: m = 100**. Note that m has prime divisors 2, 4 and 5. To have a permutation, we must pick a - 1 as a multiple of 4 x 5 = 20. Thus, a could be 1, 21, 41, 61, 81, 101, and so fourth.

Note: The name "random probing" is a bit of a misnomer, since the sequence is a deterministic one. But to an untrained weye, the sequence appears as random, so the name stuck.

**Double Hashing:**

Double hashing makes use of two functions to probe the table. This is usually the best method, since it has many of the characteristics of a random permutation. We express double hashing as

where h2 is an auxiliary hash function. Here we must make sure that h2 and m are relatively prime (ie. gcd = 1) or the probe will only examine part of the table (for the same reasons as random probing, above).

**Implementation**: an algorithm for searching a hash table (from Cormen, Leiserson and Rivest's *Introduction to Algorithms*).

HASH-SEARCH(T,k) i <- 0repeatj <- h(k,i)ifT[j] = kthen returnj i <- i + 1untilT[j] = NIL or i = mreturnNIL

We assume that:

- The probe sequence is a truly random permutation for every data item placed in the table. (Note that this does NOT mean that we use "random probing". It is only a theoretical assumption.
- The computation of a hash function tkaes 1 time unit independent of the value of the key, m, or n.
- n / m = a is the load factor. (Thus, we assume that the table size grows with the input size.)

**Unsuccessful search**: The time taken for an unsuccessful search
is 1 + the number of occupied slots probed until we find an empty slot.

Probability(>= 1 occupied slot) = n/m = *a*

Probability(>= 2 occupied slots) = (n/m) * ([n-1]/[m-1]) <= *a*^2
Then, Expected Time =

1 + P(exactly 1 slot) + 2*P(exact. 2) + 3*P(exact. 3) + ...which can be rewritten as:

So we have the sum

= 1 + P(1) + P(2) + P(3) + ... = P(>=1 slot) = a+ P(2) + P(3) + ... = P(>=2 slots)= a^2+ P(3) + ... = P(>=3 slots) = a^3

However, as

1 / (1 -open addressing is slightly less efficient than hashing with chaining. For small a, the difference is negligible. Open addressing is more efficient storage-wise.a) > 1 +a

The linear hash function is: h(k)=(h[k]+i)mod 10

The random hash function is: h(k)=h[k]+Di, where Di=(7Di+1)mod10

- The hash table of the HomeBrew C++ library
- The hash function core Ai-SHA-1-1
- The TCL family of Hash functions man pages
- The Hashtbl module using hash tables and hash functions
- A hash table C++ header file
- A scheme hash table
- Hasing in Macaulay 2
- Cryptography: the PGP signature
- Hasshing Algorithms
- Implementation of Hashing
- A brief rundown on Hashing
- Tables and Hasing

- Cormen, T. H., Leiserson, C. E. and Rivest, R. L.
*Introduction to Algorithms*. McGraw-Hill, 1990. - "Data Structures and Algorithm Analysis in C", Mark Allen Weiss.
- Pieprzyk, J. and Sadeshiyan, B.
*Design of Hashing Algorithms*. Springler-Verlag, 1993. - Alagar, V. S.
*Fundamentals of Computing Theory and Practice*. Prentice-Hall, 1989. - Aho, A. V.
*Fumdamentals of Computer Science*. Science Press, 1992. - Litwin, W. Linear Hashing : A New Tool for File and Table Addressing. IEEE 1980.
- Larson, P.-A. Linear Hashing with Separators - A Dynamic Hashing Scheme Achieving One-Access Retrieval. ACM Transactions on Database Systems, Vol. 13, No. 3. September 1988.
- Ramakrishnan, R. Relational Databases: Design and Implementation. Course Notes of CS564. Chapter 4. Nov. 1994.
- Mohan,C. ARIES/LHS: A Concurrency Control and Recovery Method Using Write-Ahead Logging for Linear Hashing with Separators. IBM Research Report, IBM Almaden Research Center. Nov 1990.

- Jaime Kahn (jaime@cs.mcgill.ca - Graphics)
- Jean-Luc Manigat (milosz@cs.mcgill.ca - References and paging)
- Shawn Bonkowski (sbonk@cs.mcgill.ca - Java Applet)
- Matthew Findlay (shadows@bcs.ca - references and paging)

Last updated March 17, 1997 by Jean-Luc Manigat

Copyright © 1997, Jaime Kahn, Jean-Luc Manigat, Shawn Bonkowski, and Matthew Findlay. All rights reserved. Reproduction of all or part of this work is permitted for educational or research use provided that this copyright notice is included in any copy. Violators will be beaten with a pillowcase full of tape dispensers. Disclaimer: this collection of notes is experimental, and does not serve as-is as a substitute for attendance in the actual class. Don't say we didn't warn you.