## Topic #13: Hashing by Open Addressing

### Introduction

Open addressing hashing works on the same principle as other types of hashing; that is, it uses a hashing function h and the key of a datum x to map x to h[key(x)], and the values are s tored in a table. The difference is in what is stored in the table. Instead of maintaining a pointer to a linked list, the table contains the actual values of key(x).

### Implementation

The table is T[0..m-1] where m is the table size, and the input is x_1,...,x_n. The has function h is given.

We will insert x_i in slot h[key[x_i]] in T. However, if the slot is occupied, we will try the next slot, until an empty slot is found.

Example: Say we are given a table with 10 slots. Let our hash function be the last digit of X. Now we input the values (in order) 19, 33, 43, 53, 21.

INSERT:

We apply our hash function to the first value: the last digit is 9, so 19 goes in slot 9. Next is 33: this goes in slot 3. Now we come to a problem: the next number, 43 should also go into slot 3. We have no linked lists, so where does it go? In open addressing, a value that is slated for an occupied slot goes into the next empty slot. 43 then goes into slot 4. 53: slot 3 is filled, go to slot 4. Slot 4 is filled, go to slot 5. Slot 5 is empty, so 53 goes into slot 5.

The table is usually made circular, so that a value meant to be inserted into the last slot (which is occupied) is sent around to the front of the list. So if we were to insert 99 into the table, we see that slot 9 is occupied, and 99 is sent to slot 0, which is empty.

Note that if we have more values entered than slots, we can run out of room in the table, and be unable to make any more insertions. For this reason, the load factor a in open addressing can never exceed 1.0 .

DELETE:

Say we want to delete 43. First we go to slot 3: It's occupied, but not by 43. So we go on to the next occupied slot. We continue to do this until we find either the number we're looking for (success), an empty slot or we arrive back where we started. In this case, the very next slot is occupied by 43, so we remove it and mark the slot as deleted. (We'll see why in a moment.)

SEARCH:

Now we will search for 53. We check the first slot indicated, 3. It's occupied, but not by 53, so we move on. Now we come slot 4, which is empty. So we should stop, right? If we hit an empty slot, we know 53 can't be beyond this slot. This is why we marked slot 4 "deleted". We know there was something here, so 53 can be beyond this point. We move on to the next slot, and we have located 53.

### Probing and the Probe Sequence

Probing is simply another word for the sequence we used for our search and insert routines. We "probe" the table until we find an available slot. Due to the (usually) circular nature of the table we have to limit the number of slots the algorithm will examine. Since there are only m slots in the table, we limit the algorithm to m comparisons.

The probe sequence is the permutation of 0,1,...,m-1 used for searching for an available slot when inserting an element. For a key k the sequence is denoted by:

h(k,0), h(k,1), ..., h(k,m-1)
Thus, h(k,0) is what we used to call h(k). The probe sequence used in the simple introductory example is simply
h(k), h(k) + 1, ..., h(k) + m - 1
all mod m (to force the circular inspection of the array). Clearly, this is a permutation, but not a very good one (we'll see why in a moment).

Linear Probing:

This is more or less self-explanatory. In linear probing, the slots in the has table are probed in a linear fashion, in order (ie. check slot 4, then slot 5, then 6, then 7,...etc). We express linear probing as

h(k,i)=[h(k) + c*i] mod m
where c is an integer. While the slots are probed in a linear fashion, they are not necessarily probed in exact order. Note that we use mod m to account for the (usually) circular nature of the table. One of the problems with linear probing is called primary clustering. In this, "blocks" of occupied slots tend to accumulate; the larger a block is, the faster it grows, by the principles of probability. This increases the average search time in the table.

Random Probing:

Random probing makes use of a pseudo-random function to "jump around" in the hash table. The random probe sequence is

h(k,i) = (h[k] + d(i)) mod m
Where d(0), ..., d(m-1) is a permutation of 0, ..., m - 1. Here, d(i) is an integer, but it is generated uniquely by a recursive function for each i. The recursive definition of d is
d(0) = 0

d(i+1) = [a*d(i) + 1] mod m
where a is a carefully picked integer. The way a is chosen is to insure each number [0,...,m-1] eventually appears as d(i). It is known that such a recursively defined sequence is a permutation of 0, 1, ..., m - 1 if and only if a - 1 is a multiple of every prime divisor of m, where 4 is considered as prime.

Example 1: m = 10, a = 3. We get the sequence for d = 0 1 4 3 0 1 4 3 0 1 4 3 0... This probe sequence is not a permutation. The prime divisors of 10 are 2 and 5, so (a-1) must be a multiple of 2*5 = 10. If we choose a = 11, then, we get the sequence 0 1 2 3 4 5 6 7 8 9 0 1 2 ... so a = 11 is a reasonable choice.

Example 2: m = 100. Note that m has prime divisors 2, 4 and 5. To have a permutation, we must pick a - 1 as a multiple of 4 x 5 = 20. Thus, a could be 1, 21, 41, 61, 81, 101, and so fourth.

Note: The name "random probing" is a bit of a misnomer, since the sequence is a deterministic one. But to an untrained weye, the sequence appears as random, so the name stuck.

Double Hashing:

Double hashing makes use of two functions to probe the table. This is usually the best method, since it has many of the characteristics of a random permutation. We express double hashing as

h(k,i) = [h(k) + i*h2(k)] mod m

where h2 is an auxiliary hash function. Here we must make sure that h2 and m are relatively prime (ie. gcd = 1) or the probe will only examine part of the table (for the same reasons as random probing, above).

Implementation: an algorithm for searching a hash table (from Cormen, Leiserson and Rivest's Introduction to Algorithms).

```HASH-SEARCH(T,k)
i <- 0
repeat j <- h(k,i)
if T[j] = k
then return j
i <- i + 1
until T[j] = NIL or i = m
return NIL
```

### Comparison with Chaining

We assume that:

1. The probe sequence is a truly random permutation for every data item placed in the table. (Note that this does NOT mean that we use "random probing". It is only a theoretical assumption.
2. The computation of a hash function tkaes 1 time unit independent of the value of the key, m, or n.
3. n / m = a is the load factor. (Thus, we assume that the table size grows with the input size.)

Unsuccessful search: The time taken for an unsuccessful search is 1 + the number of occupied slots probed until we find an empty slot.

Probability(>= 1 occupied slot) = n/m = a

Probability(>= 2 occupied slots) = (n/m) * ([n-1]/[m-1]) <= a^2 Then, Expected Time =

1 + P(exactly 1 slot) + 2*P(exact. 2) + 3*P(exact. 3) + ...
which can be rewritten as:
 = 1 + P(1) + P(2) + P(3) + ... = P(>=1 slot) = a + P(2) + P(3) + ... = P(>=2 slots)= a^2 + P(3) + ... = P(>=3 slots) = a^3
So we have the sum a + a^2 + a^3 + ..., which we know is <= 1/(1-a)! Just as with chaining, unsucessful search in open addressing takes O(1) expected time under the same computational model (assumption 2).
However, as
1 / (1 - a) > 1 + a
open addressing is slightly less efficient than hashing with chaining. For small a, the difference is negligible. Open addressing is more efficient storage-wise.

### Java applet

This applet simply displays a comparative insertion sequence using both linear and random probing. The example ignores the fact that open address hash tables should be ONLY filled to around 20% of capacity.

The linear hash function is: h(k)=(h[k]+i)mod 10
The random hash function is: h(k)=h[k]+Di, where Di=(7Di+1)mod10

#### References

• Cormen, T. H., Leiserson, C. E. and Rivest, R. L. Introduction to Algorithms. McGraw-Hill, 1990.
• "Data Structures and Algorithm Analysis in C", Mark Allen Weiss.
• Pieprzyk, J. and Sadeshiyan, B. Design of Hashing Algorithms. Springler-Verlag, 1993.
• Alagar, V. S. Fundamentals of Computing Theory and Practice. Prentice-Hall, 1989.
• Aho, A. V. Fumdamentals of Computer Science. Science Press, 1992.
• Litwin, W. Linear Hashing : A New Tool for File and Table Addressing. IEEE 1980.
• Larson, P.-A. Linear Hashing with Separators - A Dynamic Hashing Scheme Achieving One-Access Retrieval. ACM Transactions on Database Systems, Vol. 13, No. 3. September 1988.
• Ramakrishnan, R. Relational Databases: Design and Implementation. Course Notes of CS564. Chapter 4. Nov. 1994.
• Mohan,C. ARIES/LHS: A Concurrency Control and Recovery Method Using Write-Ahead Logging for Linear Hashing with Separators. IBM Research Report, IBM Almaden Research Center. Nov 1990.

### Web page creators

Last updated March 17, 1997 by Jean-Luc Manigat

Copyright © 1997, Jaime Kahn, Jean-Luc Manigat, Shawn Bonkowski, and Matthew Findlay. All rights reserved. Reproduction of all or part of this work is permitted for educational or research use provided that this copyright notice is included in any copy. Violators will be beaten with a pillowcase full of tape dispensers. Disclaimer: this collection of notes is experimental, and does not serve as-is as a substitute for attendance in the actual class. Don't say we didn't warn you.