McGill University:
School of Computer Science
Winter 1997 Class Notes for 308-251B

DATA STRUCTURES AND ALGORITHMS

Topic #14: Hashing - Overview and Applications


A.D.T dictionary

The possible implementations for A.D.T. dictionary are shown below.

1. Binary Search Trees Pointer based
2. Tries/suffix Trees RAM model
3. Hashing

For tries and for hashing you have to look at the data itself. Not so for Binary search trees. All operations on B.S.Ts can be implemented by using key comparisons. For hashing and tries this is no longer sufficient. The actual values of the keys play a central role in determining where elements are inserted. We shouldn't really compare the complexities that we obtain with B.S.T. with those of hashing. It's like comparing apples and oranges.


Some uses of hashing

1 CD databases

For CD's, what we would like to do, is have a world-wide CD database so that when users put there disk in the CD player, they get a full table of contents on there own computers. These tables are not on the disks themselves. No information about the songs is stored on the cd. It must be downloaded from the database. The problem is that CD's have no ID numbers stored on them. So how can the computer know which CD is in the player? Well, it can use the only information on the CD, the track length, and every cd is different. Basically, a big number is cooked up from the track lengths, a "signature" that identifies our CD. This signature is a value obtained by hashing. For example, a number of length of 8 or 10 hex. digits is cooked up; the number is then sent to the data base, and that data base just looks for the closest match. The reason being, that track length may not be measured exactly.

EX: ae 35 0b ff 35 --> looks in the library for a match Signature

ae 35 0b ff 35 --> Searches database for a matching signature.
signature

The principle in hashing, is that x is put in a table in position h[x]. For the cd, you have a big table, some entries possibly nill. And then placed x in the table.

2 Drivers licences/insurance cards

Exactly the same as the cd example. If you look at drivers licence numbers or insurance card numbers, they are cooked up also from things that never change: date of birth, name, etc.. Again, some digits are taken and placed in a certain place. The numbers are the way they are , so that inside the computers they can be put into a table.

3 Sparse arrays

A sparse array is an array which is mostly 0, and in which a few are non-zero entries. Of course one could store the array, but that would not make a lot of sense. The storage would be proportional to n squared. Another possibility, is to cut it up into huge blocks, so that we would have 1 element per block on the average. The number of blocks is about equal to the number of non-zero elements. Then you store the elements in a table. We store a two dimensional array in a one dimensional array. There is a one to one correspondence between the elements in the array an the elements in the table.

# of blocks ~= # non-zero elements

Let the array be nxn, and let there be N nonzero entries. From the coordinates (i,j) of an array element, we determine an index k in an array by a simple calculation. Thus, we have k=h(i,j) for some function h, called a hash function. All elements with the same hashed value k are put in the k-th slot of a simple table. The storage is proportional to N. Had we stored the array, we would have required storage proportional to nxn.

4 File signatures

File signatures provide a compact means of identifying files. The files may be large, such as audio files, picture files, or text files. We extract a function, h[x], the file signature, which is a property of the file.

As an example, if we have a binary file, the signature could consist of bits# { 17, 21, 1003, 905, 904, and 3}, a six bit signature. This signature is very compact. Another possibility for a file signature is the name. Storing files by name is customary, but you can also store by signature, because the signature identifies the file.

If changes are made to the file, then the signature will generally change. In this way, the signature of a file can be used as a quick verification to see if anyone has altered the file, or if it has lost a bit during transmission. A frequently used application is for files storing student's marks.

5 Game boards

Consider a game board for tic-tac-toe or chess. A position in a game may be stored using a hash function. A tic-tac-toe board consists of nine cells, each of which may contain an 'X', an 'O', or nothing. The standard notation for "nine digits, each chosen from the set {0, 1, 2}" is "{0,1,2}^9". We adopt this notation with (0,1,2) corresponding to (X,O,empty). Reading left to right and top to bottom yields (010021000) in base three, or 1*3^4 +2*3^5 +1*3^8. That is the numerical index for this game position, a hash function: From our object, we are cooking up a number.

Game Board Hashing Example:

h[x] --> {0,1,2}9

(010021000)3 = 1*38 + 2*35 + 1*34 = h[x]


The nine squares in a TIC-TAC-TOE board can be mapped to digits in a ternary number. In this example, blank space is mapped to 0, X to 1 and O to 2.


6 A.D.T. dictionary

(SEARCHING, SORTING)

The main application of hashing is for the ADT dictionary (insert, search, sort). It is a serious competitor for binary search trees. However, to appreciate the difference with binary search trees, it helps to understand the difference between various models of computation.

The model which we will use in this example is the RAM model. Recall that the RAM model allows us to perform any opperation on a real number in one time unit, regardless of its length.

Teaser: You are given n different integers from{0, 1, 2, .... n^17}. Using the RAM model, these numbers can be sorted in linear time. (If you discover the answer, you will have rediscovered a famous method of sorting.)

This may appear surprising, because we all have, stored at the back of our minds, that n*log(n) is about the best that you can do for sorting. The trick is that we are talking about a different model. When someone says that n*log(n) is the best that can be done for sorting in the worst case, they are talking about a model which only allows comparisons between two elements at a time. (In that model, one comparison requires one time unit, while all other operations take zero time units.) That is very restrictive. As an analogy, imagine that you are given a scale. You are asked to sort everything using this balance alone. With a RAM model, you have more tricks at your disposal. You can take a number and look at the bits and bytes. The RAM model is more powerful. You can do more, and you can do it faster. The teaser above is a demonstration of the fact that the two models are very different. Comparing estimates of algorithms obtained from different computational is not meaningful.

7 Graphics

In graphics, a central problem is the storage of objects in a scene or view. We have already seen a structure for separating objects in a scene. For the B.S.P. tree, Binary search trees, the strategy was to partition everything by light. You may also organize your data by hashing. The hashing approach to this is to make a grid of appropriate size, an ordinary vertical-horizontal grid. A grid is basically a double array, and there is a one to one correspondence when you go from a double array to a single array. As such, we will store our grid in very much the same way as we did the sparse array. All points that fall in one cell will be stored in the same place. If a cell contains three points, then these three points will be stored in the same entry. The mapping from the grid cell to the memory location is our hash function. An advantage of this method of storage is that certain operations are very fast, such as the nearest neighbour search. If you store your data like this, and you are asked to find the nearest neighbour of a point, the first place to look is in the surrounding cell, then two layers out, then three layers out, and so on, until another point is found.

One method for organizing objects that lie in the plain is to create a grid of appropriate size, and to store the objects that lie in the same grid square together.


One advantage to this is that it facilitates "Nearest Neighbor" Searches.


Implementations

There are three main strategies for implementing hash tables.

  1. DIRECT ADDRESSING
  2. HASHING WITH CHAINING, Hashing by chaning.
  3. OPEN ADDRESSING, Hashing by open addressing.

Direct addressing is one of the simpler and quicker methods of look up used. We can search for an element in O(1) time. One position for each element in the array for every possible value of keys. In direct addressing, x is stored at position key[x] in a table T. The table T has indices from 0 to m-1, where m-1 is the maximal key value. When the key doesn't exist, a nil is assigned to that position in the table.
The standard operations insert, delete and search take one time unit (in a RAM model with uniformly bounded computation times), and can be implemented in one line of code, as shown below:

Dictionary operations for example:

Sometimes we don't need to store pointers to external data in the array, it can be enough to store the key in the position of array itself.


If all operations are so efficient, why is direct addressing not used all the time?
There are fundamentally two problems:
1. In some applications, key[x] = key[y] even though x != y. This is called a collision, and warrants special treatment.
2. Sometimes, n, the number of data items, is much less than m. A case in point is a database of phone numbers with 100 entries, where the key is the phone number. Clearly, there are no collisions, yet m = 10,000,000 >> 100 = n, so most of remians unused.( A real waste)

In class we we solved exercise 12.1.4 of Cormen, Leiserson and Rivest:

We wish to implement a dictionary by using direct addressing on huge array. At the start, the array entries may contain garbage, and initializing the entire array is impractical because of its size. Describe a schem for implementing a direct-address dictionary on a huge array. Each stored object should use O(1) space; the operations SEARCH, INSERT, and DELETE should take O(1) time each; and the initialization of the data structure should take O(1) time. (Hint: Use an additional stack, whose size is the number of keys actually stored in the dictionary, to help determine whether a given entry in the huge array is valid or not.)


Solution:
Using a stacklike array, S, whose size is the number of keys in the dictionary, we can achieve the desired O(1) storage space, and O(1) time for operations search, insert, delete as well as the initilization of the whole structure.
An integer called "top" represents the position of the last item entered into our structure. When top = 0 , the structure is empty.
When keys from the dictionary are read into the table, the number corresponding to key[x]'s position in S is stored in the direct addressing table, T. Then key[x] is stored in S and "top" is incremented by 1. The keys are stored in S in the order they are read. To initialize the structure, simply do top <-- 0. To verify if an item is a valid one or just 'garbage', we check if S[T[key[x]]] = key[x] and if T[key[x]] <= top. This, in effect checks the 'pointers' as shown in the image below:

In the image above, the numbers read into the structure, in order are: 3,8,1,4. Number 6 hasn't been read in yet and when it will be read in, "n"(top) will be incremented. Another important point to make is that if we search for an element,x in the table T, we can tell immediately if it actually exists or if it is garbage when we check if T[x] > n, if true , we know it is garbage since T[x] 'points' to a position in S which is "above the top".



Hashing applet

This applet demonstrates sparse array compression with collision resolution by chaining. It allows the user to specify his/her own hashing function, to quickly generate some random data and to see the results of their hash function. Collisions are shown by a color scale from red to yellow representing from 1 to 5 collisions. More than 5 collisions are ignored for the purposes of this demo.

Input and Output Formats

You are given two palettes, one large and one small. The large palette is a grid of 100X100 for a total of 10,000 positions. Each position is labelled by a 4 digit number (0-9999). The small palette has only 10X10 for a total of 100 positions and so each position here is represented by a 2 digit number (0-99). The hash function takes an input from the 4 digit large palette, performs its magic, and spits the result into the 2 digit small palette.

Hash function syntax

The hash function is an expression in postfix form with operators and operands separated by spaces.

For example: 1 4 substr 100 %

Operands

Each operand may be one of a pointer, a number, or another expression.

Operators

operator example meaning
+ 2 3 + 2 + 3
- ^3 2 - third digit minus 2
* 4 ^1 * 4 times the first digit
/ ^1 3 / first digit divided by 3
% 123 6 % 123 mod 6
substr 2 4 substr the number made of the second, third and forth digits (note you don't use pointers here)

Usage

To generate data simply click and/or drag on the large palette and the resulting hash will appear on the small palette. Both the original 4 digit number and the new 2 digit hash are displayed on the appropriate tabs for each palette. If your hash function produces a value that doesn't fit in the range of the small palette, then the 2 digit number is replaced by two red asterisks **. The reset button clears your current data set and if you have put a new hash function in the input box, will update the current hash function to it.


applet

The source.


Links of interest

Link Description
Notes from Dave Simpson's CMPT 201 Class at Simon Fraser University A nice explanation of the Hash Table data structure, a brief implementation and how to design a good hash function. Also check out the links to the next 2 or 3 lecture note pages.
Lecture Notes from Data Structures course at University of Missouri-Rolla A General Intro. to Hashing with examples of some simple hash functions as well as applications
Oberlin College, CS 275, Lab 10 An introduction to hashing from an on-line lab manual.
Oberlin College, CS 275, Lab 13 Direct addressing implementation
Brigham Young University,

CS 453, Lecture 17

A higher level course overview of hashing, with implementations
Bibliography on Hashing A rather complete collection of papers of all types from ranging manuals to techreports on hashing. It is presented in bibtex format and is assembled by Nelson H. F. Beebe <beebe@math.utah.edu>
Infoseek : hashing If you want more links on hashing, infoseek seems to have the best listing, though you will have to wade through various other meaning of "hashing" ...

Web page creators

This web page was created by:


Last updated March 24, 1997 by Joel Gagnon. Oren Kedem Jeff Ritchie
Copyright © 1997, .All rights reserved. Reproduction of all or part of this work is permitted for educational or research use provided that this copyright notice is included in any copy.
Disclaimer: this collection of notes is experimental, and does not serve as-is as a substitute for attendance in the actual class.


WARNING: Although some may enjoy it, hash is definitely hazardous to your health.
Proceed with caution!