DATA STRUCTURES AND ALGORITHMS
Topic #21: CODING AND COMPRESSION
Last modified: April 10, 1997
TABLE OF CONTENTS
List of Figures
Introduction
Coding is a translation from one format to another. We need coding and
compression to save space and transmission time. The compression ratio is
calculated by dividing the size of the file before compression, by the size
of the file after compression:
The UNIX compress command uses the
LEMPELZIV method which is described in detail in
topic 23.
We start with a finite set of symbols or characters. A collection of
characters is called an alphabet. The object of coding is to compress this
to a file of bits:
top
Fixed Width Encoding
In fixed width coding, each input symbol is translated into one output string
consisting of a fixed number of bits. Example:
US ASCII uses 8 bits per character. This allows for 2^{8}=256
different symbols.
top
Entropy
Entropy is a quantitative measure of the number of bits needed per symbol
on the average, when the symbols in the input file occur independently
according to a given probability distribution.
For instance, assume there is a text containing independent symbols, where
each symbol has a certain frequency of appearance (all
0),
which we denote as P_{1}, P_{2},
P_{3}
(P_{i} = 1).
We define entropy as:
P_{i}
log_{2}(¹/P_{i}).
For the independent symbol model, the entropy provides a universal lower
bound for the expected number of bits per symbol no matter which compression
method is used.
For example, the English language requires about 4 bits/symbol, and
for languages with an even simpler structure, the entropy is smaller.
A USASCIIcoded English text file cannot be compressed more than
about 50% by the above remarks.
top
Variable Width Coding
Variable width coding is a method in which each input symbol is translated
into one string of output bits called a codeword. The codewords may be of
variable length. The complete collection of codewords is called a code.
Example 1:
Morse Code


a

._


b

_...


c

_._.


d

_..


e

.


f

.._.


g

_ _.

h

....


i

..


j

._ _ _


k

_._


l

._..


m

_ _


n

_.

o

_ _ _


p

._ _.


q

_ _._


r

._.


s

...


t




u

.._

v

..._


w

._ _


x

_.._


y

_._ _


z

_ _..

Morse code requires 1~4 bits for letters, and 5 bits for numerals.
Example 2:
Braille Code
Braille code uses 6 or 12 bits most of the time. In the case of 6 bits,
this would mean 2^{6} or 64 possibilities. This is not enough.
To extend the variation of encoding, it uses only 62 of the possible
encodings to represent the most common characters/symbols, and uses the
last 2 possibilities to denote the existence of another symbol which allows
for more possibilities.
Figure #1. Braille Code Sample
Example 3:
Mac Write
MacWrite is another variable length encoding. It uses 4 or 12 bits per
symbol, which yields an average of 6.4 bits per symbol for English text.
top
Repetition Coding
Example: 0 0 0 0 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1


   Binary Rep.


There are 4 zero's
 (4,0)
 4
 100

0
 0

followed by 1 one
 (1,1)
 1
 1

1
 1

followed by 2 zero's
 (2,0)
 2
 10

0
 0

followed by 5 one's
 (5,1)
 5
 101

1
 1

followed by 2 zero's
 (2,0)
 2
 10

0
 0

followed by 6 one's
 (6,1)
 6
 110

1
 1

Resulting binary code: 1 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1


How do we know how to cut the output string into its component pieces?
We know that each binary codeword starts with a 1. We will indicate the
number of bits needed for this number, by putting the same number of 0's
in front of it. Leave the second number as is. For example for (4,0),
where 100 is the binary codeword for the number 4, we put three 0's in front
of this to get 000 100. The receiver will know that there are three bits
following the 0's to indicate the 4. The resulting string for (4,0) is
000 100 0. The last bit, in this example the '0', is to indicate
whether the string consists of 1's or 0's. The resulting binary string for
the above example is shown in the following figure:
Figure #2. Repetition Coding Sample
Repetition coding is good for files with many long strings of zeros
or long strings of ones. This might be good for lines in a FAX sheet,
or picture files with lots of white space.
top
Differential Coding
If there is a strong dependence between consecutive items in an input file,
then differential coding might be beneficial. Suppose you have lines of data
represented by U_{1}, U_{2}, U_{3}, U_{4},...
Instead of sending/storing this, one codes U_{1},
U_{2}U_{1}, U_{3}U_{2},
U_{4}U_{3},... Where  may denote subtraction or bitwise
exclusive or. The idea is that if U_{i} is almost equal to
U_{{i+1}}, then U_{{i1}}U_{i} consists of nearly
all zeros, and should thus be efficiently encodable by, say, repetition
coding. Example:
U_{1} :
 11110011010101000011010101010101

U_{2} :
 11110011011101000010010101010101

U_{2}U_{1} :
 00000000001000000001000000000000

Situations where this occurs are in FAX lines. A line of a FAX sheet
consists of 1728 bits. Consecutive FAX lines are clearly closely related
and should thus be easy to compress via differential coding. The same
principle applies to scanned picture files.
top
Prefix Codes
In prefix coding each symbol in the input string has a codeword of variable
length. It is a type of variable length code such that
no codeword is a prefix of any other codeword.
A codeword denotes the bits required for a symbol,
example, a is 0110.
Each (binary) codeword in a prefix code is represented by a path in a trie.
The leaf at the end of this path is associated with the symbol for that
codeword. Both the sender and the receiver must have the same
trie. This is a principle that allows
us to cut the message into pieces, to send and reassemble them
after receiving the message.
Sample Prefix Coding:
Assume we have the following coding for a, b and c:
Prefix Codes 
a 
0 
b 
10 
c 
11 
We can also represent the code table visually as a trie:
Figure #3. Prefix Code Trie
This is needed for the receiver. Every prefix code has such a trie where
leaves are symbols. If the person at the other end has this trie then you
could go to the root and keep going up and down the trie. The leaf is where
you cut.
Suppose we want to send the following message:
0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1
Using the trie above, we will get:
top
Huffman Codes, Adaptive Huffman Codes, and LempelZiv Codes
Huffman codes are optimal prefix codes in the sense that the average
codeword length is smallest. They require knowledge of the probabilities
of occurrence of the symbols. Therefore, the Huffman code for an English
text differs from that of a gif file or a C program. This is discussed in
topic 22.
LempelZiv codes also adjust to changing situations and are adaptive in a
very strong sense. These codes are not of the prefix kind, as entire phrases
get translated by pointers to other phrases seen earlier. This is discussed
in
topic 23.
top
Java applet: Prefix Decoding
Source file: example.java
top
Links to related material
A quick tutorial on generating a Huffman tree
 An example of how to contruct a
Huffman tree by Luigi Filippini, CRS4.

Huffman Codes
 Karen Van Houten's example of how to build a Huffman tree.
University of Idaho.

Huffman Coding
 Definition and example of Huffman coding by Jon Oldevik
at UiO Institute of Informatics.

Huffman Coding
 Huffman Coding of ACIS Pixel Data. M.I.T. Center for Space Research.

Dynamic Huffman coding
 Definition of dynamic Huffman coding, structure, encoding/decoding
algorithm and updating the tree by Jon Oldevik at UiO Institute of
Informatics.

Static Huffman Coding
 Huffman's algorithm by Debra A. Lelewer and Daniel S.
Hirschberg at the University of California, Irvine.

Adaptive Huffman Coding
 Adaptive Huffman coding by Debra A. Lelewer and Daniel S.
Hirschberg at the University of California, Irvine.

LempelZiv Codes
 Another adaptive method definition by Debra A. Lelewer and
Daniel S. Hirschberg at the University of California,
Irvine.

Java Huffman Coding
 John Scott's explanation and a java example of Huffman coding.
Imperial College.

Use of Huffman Trees to Encode Verbal Street Directions
 Using Huffman trees to encode directions.
by Andre Skupin at Institut für Geographie der Universität Salzburg.

Chacking ... LZW Compression
 Chacking, Issue #6, Sept. 5, 1993. Bill Lucier at Carnegie
Mellon University.

Compression
 CS 280 Lab Manual: Coding by Alexei Barchenkov, Don Hutchings
and Micheal Klingbeil at Oberlin College.

top
References
[1]
 T.C. Bell, J.G. Cleary, I.H. Witten. Text Compression.
Prentice Hall, 1990.

[2]
 Thomas H. Cormen, Charles E. Leiserson and Ronald L. Rivest.
Introduction to Algorithms.
The MIT Press and McGrawHill Book Company, 1990.

[3]
 Gilbert Held. Personal Computer File Compression.
Van Nostrand Reinhold, 1994.

[4]
 M. Nelson. The Data Compression Book. M.I.T. Publication
Inc., 1992.

top
Web page creators
This web page was created by
top
Last updated: April 10, 1997
Copyright © 1997, Mike Chen, Ming Ming Hao, Vickie
Kwan, Nancy Zaramian. All rights reserved. Reproduction of all
or part of this work is permitted for educational or research use
provided that this copyright notice is included in any copy.
Disclaimer: this collection of notes is experimental, and does not
serve asis as a substitute for attendance in the actual class.
top