McGill University:
School of Computer Science
Winter 1997 Class Notes for 308-251B

DATA STRUCTURES AND ALGORITHMS

Topic #5: LINEAR TIME SELECTION


The Selection Problem: definition

Given a set of "n" unordered numbers we want to find the "kth" smallest number. (k is an integer between 1 and n)

In the case of problems involving sorting, searching, and selection, a good indicator of time is the number of comparisons. Therefore, we will set

1 comparison=1 time unit
everything else=0 time units


A naive algorithm based on sorting

Algorithm Step Description of the algorithm step Running Time
1 Sort the n elements using merge-sort or heapsort O(n log2n)
2 Return the kthsmallest element 0
Total Running Time of the algorithm= O(n log2n)

Selection can be performed in O(1) time in a sorted array. However, data are not always sorted with respect to the key that is used for selection. Example: we may sort data by name, and ask for the kth smallest phone number

In this lecture, we will study two algorithms with linear worst-case and average-case complexities. The naive algorithm above is simply not comptetitive.


The linear time selection algorithm

This algorithm will find the kthsmallest in O(n) worst case time.
It is also called the median-finding algorithm.

The method uses a divide & conquer strategy to solve the problem. Like the problem with the good and bad chips where we "threw away" 50% of the chips, if we can throw away 30% or even 10% without too much complication, we're cutting down the running time substantially. Therefore, we must efficiently eliminate as much as we can from the competition.


Source code for this applet.
Image1 from the Java applet
Image2 from the Java applet
Image3 from the Java applet
Image4 frmo the Java applet

Algorithm Step Description of the Algorithm step Bound on the running time
0 If n is small, for example n<6, just sort and return the k thsmallest number. 7
1 If n>5, then partition the numbers into groups of 5. n/5
2 Sort the numbers within each group. Select the middle elements (the medians) 7n/5
3 Call your "Selection" routine recursively to find the median of n/5 medians and call it m. Tn/5
4(a) Compare all n-1 elements with the median of medians m and determine the sets L and R, where L contains all elements <m, and R contains all elements >m. Clearly, the rank of m is r=|L|+1 (|L| is the size or cardinality of L) n
4(b) If k=r, then return m
4(c) If k<r, then return kth smallest of the set L T7n/10
4(d) If k>r, then return k-rth smallest of the set R


In the worst case, recursion will occur on 70% of that data set and so we get T7n/10.

Remark 1

If n is small, it obviously doesn't make any sense to "juggle" the values. Let us set the small threshold at 5 elements. To find the "kth" item from n elements, here are the least number of comparisons needed for values where n is less than 6.

2 elements -> 1 comparison
3 elements -> 3 comparisons
4 elements -> 5 comparisons
5 elements -> 7 comparisons*

Exercise

Show how five elements can be sorted using at most seven comparisons.

Remark 2

The Oxford Dictionary defines Median: -adj situated in the middle. 2.Middle value of a series.

Remark 3

When n is not an exact mutiple of 5, we divide the n elements into groups of 5 elements each, and at most one group with the remaining n mod 5 elements. Also, when there are an even number of medians, select the larger of the two median of medians.

Analysis

T(n) is an upper bound on the worst-case time



We will prove that for all n and for some (yet unknown) constant C.

Induction Proof

Case (1): For any small value of n (n<=5), C must be at least 7.
Case(2): Assume that the hypothesis is true for values of n up to and not including n.
Case(3): We must now verify for the general value of n.
We are given

By our induction hypothesis,

Thus, provided that

Luckily for us, it turns out that the third step , is true provided

This concludes the proof. We have actually shown that


There is a slight error in the derivation as 7n/10 and n/5 may not always be integer-valued! However, it still remains true if we take truncations into consideration, that Tn=On.


Why did we break the group down into sets of 5?

Even numbers don't work since there won't be a median. On the other hand 5, 7, 11, 13, etc. work but why won't 3, the smallest odd number? When we try to use the value of 3 in the algorithm then we obtain the inequality

, from which we are not able to conclude that Tn=On.

Exercise

Justify each term in the inequality

Exercise

What is the best we can conclude from


Hoare's algorithm

The algorithm uses one concept from quicksort and returns the kth smallest of n.

If n=1 then return the element
If n>1 then grab a random element from your set and call it x. Cut the group into two sections. The left side (L) will contain all elements smaller than x, the right side(R) will contain all elements greater than x. Let r=|L|+1 be the rank of x.
If k=r then return the element.
If k<r then return the kth smallest from the set L.
If k>r then return the k-rth smallest from the set R.

The worst case of this algorithm happens when x is either the smallest or largest element in the set. There will be no splitting and the time will become


Analysis

T(n) is for the same computational model but now shows the expected or average time

Case 1: x(recall: the random element) is in the middle half. Then you recurse on the set of size at most 3n/4
Case 2: x is not in the middle half. Taking a pessimistic view, you recurse on a set of size at most n. (that means, we assume that no progress was made)

As each case occurs with probability 50%, we get:

Rewrite this inequality as:

The master theorem then says


Exercise

The algorithm of the previous section had in the worst case. Using the inequality above, show that for the average case, .
Hint: draw the recursion tree and sum over all levels.


Links to related material

Many data structures and algorithms courses are ``on the web''. Some useful pointers to get you started in your search are given below.
Algorithms Information and Course Materials on the Net

Courses on algorithms.

Princeton Data Structures

Bob Sedgewick's course on data structures and algorithms at Princeton.

Sedgewick's transparencies

Bob Sedgewick's lecture notes (copies of transparencies) on data structures and algorithms at Princeton. For our course, of particular interest are sections 10 (tries), 11 (strings), 12 (more strings), and 13 (compression), as this material is not covered in detail by Cormen, Leiserson and Rivest.

Sugihara's lecture notes on strings

Sugihara's lecture notes on strings (storage, coding, compression, search) are a nice supplement to Cormen as well. Sugihara teaches at the University of Hawaii.

Index of Sugihara's lecture notes.

Index of Sugihara's lecture notes.

Java applets for five sorting algorithms

Animation for sorting algorithms.

Algorithms at UMass

C++, Algorithms, and Data Structures (CS202) Notes

Rivest's bibliography on algorithms

Ron Rivest's bibliography on algorithms.

MergeSort demo

Mergesort demo from University of Toronto.

Linear Time Selection

Lecture notes

Hoare's homepage

C.A.R. Hoare's homepage


Bibliographic Remarks

The problem of selection (finding the kth largest of n elements) goes back to Rev. C. L. Dodgson's (aka Lewis Carroll) essay (Carroll 1883, p.5-6) on how prizes were awarded unfairly in tennis tournaments. Knuth (1973, p.209)

A linear method to solving this problem was always sought after and only after Manuel Blum discovered a method with O(nlog logn) steps in 1971 were R.Rivest and R.Tarjan (Rivest, 1973, p.448-461) able to create the alorithm which we now call "Linear Time Selection". (Knuth 216)

Hoare, Charles Antony Richard published a method called quicksort in 1962 which is used in the algorithm of finding the kth smallest of n elements.


References

Cormen, Thomas H, Charles E. Leirson, Ronald L. Rivest

"Introduction to Algorithms". Montreal: McGraw-Hill Book Company, 1991.
Section 10.3

Dodgson, Rev. Charles Lutwidge
St.James Gazette. 1 August, 1883

Hoare, Charles A.R. "Find (algorithm 65)"
"Communications of the ACM" 1961.
Vol.4, pp.321-322

Knuth, Donald E. "The Art of Computer Programming".
"Volume 3: Sorting and Searching" Reading, Massachusetts: Addison-Wesley Publishing Company, 1973.
Section 5.3.3. Minimum-Comparison Selection

Morrison, Michael, et al. "Java unleashed".
Sams.net. (1997, p.365-469)
Rivest, Ronald L., and Robert E. Tarjan
Journal of Computers & Sytem Sciences 7: 1973 Vol.7, p.448-461


Web page creators

This web page was created by

Daichi Saeki worked on the HTML coding and creating the equations using LATEX2HTML, Chien Liao worked on the JAVA applet and Silva Baeva worked on research.


"My eyes are BURNING" -Ralph Wiggum jumping into the Simpsons' pool after Homer decides to use chlorine.
Last updated February 4, 1997 by Daichi Saeki.

Copyright © 1997, Daichi Saeki, Chien Heng Liao, Silva Baeva. All rights reserved. Reproduction of all or part of this work is permitted for educational or research use provided that this copyright notice is included in any copy. Disclaimer: this collection of notes is experimental, and does not serve as-is as a substitute for attendance in the actual class.