McGill University:
School of Computer Science
Winter 1997 Class Notes for 308-251B

DATA STRUCTURES AND ALGORITHMS

Topic #21: CODING AND COMPRESSION

Last modified: April 10, 1997


TABLE OF CONTENTS

Introduction
Fixed Width Coding
Entropy
Variable Width Coding
Repetition Coding
Differential Coding
Prefix Codes
Huffman Codes, Adaptive Huffman Codes,
and Lempel-Ziv Codes
Java applet: Prefix Decoding
Links to related material
References
Web page creators

List of Figures

(Figure #1) Braille Code Sample
(Figure #2) Repetition Coding Sample
(Figure #3) Prefix Code Trie

Introduction

Coding is a translation from one format to another. We need coding and compression to save space and transmission time. The compression ratio is calculated by dividing the size of the file before compression, by the size of the file after compression:
Ratio of Compression
The UNIX compress command uses the LEMPEL-ZIV method which is described in detail in topic 23. We start with a finite set of symbols or characters. A collection of characters is called an alphabet. The object of coding is to compress this to a file of bits:

top


Fixed Width Encoding

In fixed width coding, each input symbol is translated into one output string consisting of a fixed number of bits. Example: US ASCII uses 8 bits per character. This allows for 28=256 different symbols.

top


Entropy

Entropy is a quantitative measure of the number of bits needed per symbol on the average, when the symbols in the input file occur independently according to a given probability distribution. For instance, assume there is a text containing independent symbols, where each symbol has a certain frequency of appearance (all Greater or Equal to0), which we denote as P1, P2, P3 (SigmaPi = 1). We define entropy as: SigmaPi log2(¹/Pi). For the independent symbol model, the entropy provides a universal lower bound for the expected number of bits per symbol no matter which compression method is used. For example, the English language requires about 4 bits/symbol, and for languages with an even simpler structure, the entropy is smaller. A USASCII-coded English text file cannot be compressed more than about 50% by the above remarks.

top


Variable Width Coding

Variable width coding is a method in which each input symbol is translated into one string of output bits called a codeword. The codewords may be of variable length. The complete collection of codewords is called a code.

Example 1:

Morse Code
a ._ b _... c _._. d _.. e . f .._. g _ _.
h .... i .. j ._ _ _ k _._ l ._.. m _ _ n _.
o _ _ _ p ._ _. q _ _._ r ._. s ... t - u .._
v ..._ w ._ _ x _.._ y _._ _ z _ _..
Morse code requires 1~4 bits for letters, and 5 bits for numerals.

Example 2:

Braille Code
Braille code uses 6 or 12 bits most of the time. In the case of 6 bits, this would mean 26 or 64 possibilities. This is not enough. To extend the variation of encoding, it uses only 62 of the possible encodings to represent the most common characters/symbols, and uses the last 2 possibilities to denote the existence of another symbol which allows for more possibilities.
Braille Code Sample

Figure #1. Braille Code Sample

Example 3:

Mac Write
MacWrite is another variable length encoding. It uses 4 or 12 bits per symbol, which yields an average of 6.4 bits per symbol for English text.

top


Repetition Coding

Example: 0 0 0 0 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1
Binary Rep.
There are 4 zero's (4,0) 4 100
0 0
followed by 1 one (1,1) 1 1
1 1
followed by 2 zero's (2,0) 2 10
0 0
followed by 5 one's (5,1) 5 101
1 1
followed by 2 zero's (2,0) 2 10
0 0
followed by 6 one's (6,1) 6 110
1 1
Resulting binary code: 1 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1

How do we know how to cut the output string into its component pieces? We know that each binary codeword starts with a 1. We will indicate the number of bits needed for this number, by putting the same number of 0's in front of it. Leave the second number as is. For example for (4,0), where 100 is the binary codeword for the number 4, we put three 0's in front of this to get 000 100. The receiver will know that there are three bits following the 0's to indicate the 4. The resulting string for (4,0) is 000 100 0. The last bit, in this example the '0', is to indicate whether the string consists of 1's or 0's. The resulting binary string for the above example is shown in the following figure:

Repetition Coding Sample
Figure #2. Repetition Coding Sample
Repetition coding is good for files with many long strings of zeros or long strings of ones. This might be good for lines in a FAX sheet, or picture files with lots of white space.

top


Differential Coding

If there is a strong dependence between consecutive items in an input file, then differential coding might be beneficial. Suppose you have lines of data represented by U1, U2, U3, U4,... Instead of sending/storing this, one codes U1, U2-U1, U3-U2, U4-U3,... Where - may denote subtraction or bitwise exclusive or. The idea is that if Ui is almost equal to U{i+1}, then U{i-1}-Ui consists of nearly all zeros, and should thus be efficiently encodable by, say, repetition coding. Example:
U1 : 11110011010101000011010101010101
U2 : 11110011011101000010010101010101
U2-U1 : 00000000001000000001000000000000
Situations where this occurs are in FAX lines. A line of a FAX sheet consists of 1728 bits. Consecutive FAX lines are clearly closely related and should thus be easy to compress via differential coding. The same principle applies to scanned picture files.

top


Prefix Codes

In prefix coding each symbol in the input string has a codeword of variable length. It is a type of variable length code such that no codeword is a prefix of any other codeword. A codeword denotes the bits required for a symbol, example, a is 0110. Each (binary) codeword in a prefix code is represented by a path in a trie. The leaf at the end of this path is associated with the symbol for that codeword. Both the sender and the receiver must have the same trie. This is a principle that allows us to cut the message into pieces, to send and reassemble them after receiving the message. Sample Prefix Coding: Assume we have the following coding for a, b and c:
Prefix Codes
a 0
b 10
c 11

We can also represent the code table visually as a trie:

Prefix Trie

Figure #3. Prefix Code Trie

This is needed for the receiver. Every prefix code has such a trie where leaves are symbols. If the person at the other end has this trie then you could go to the root and keep going up and down the trie. The leaf is where you cut. Suppose we want to send the following message:
0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1
Using the trie above, we will get:
Code Result

top


Huffman Codes, Adaptive Huffman Codes, and Lempel-Ziv Codes

Huffman codes are optimal prefix codes in the sense that the average codeword length is smallest. They require knowledge of the probabilities of occurrence of the symbols. Therefore, the Huffman code for an English text differs from that of a gif file or a C program. This is discussed in topic 22. Lempel-Ziv codes also adjust to changing situations and are adaptive in a very strong sense. These codes are not of the prefix kind, as entire phrases get translated by pointers to other phrases seen earlier. This is discussed in topic 23.

top


Java applet: Prefix Decoding

Source file: example.java

top


Links to related material

A quick tutorial on generating a Huffman tree An example of how to contruct a Huffman tree by Luigi Filippini, CRS4.
Huffman Codes Karen Van Houten's example of how to build a Huffman tree. University of Idaho.
Huffman Coding Definition and example of Huffman coding by Jon Oldevik at UiO Institute of Informatics.
Huffman Coding Huffman Coding of ACIS Pixel Data. M.I.T. Center for Space Research.
Dynamic Huffman coding Definition of dynamic Huffman coding, structure, encoding/decoding algorithm and updating the tree by Jon Oldevik at UiO Institute of Informatics.
Static Huffman Coding Huffman's algorithm by Debra A. Lelewer and Daniel S. Hirschberg at the University of California, Irvine.
Adaptive Huffman Coding Adaptive Huffman coding by Debra A. Lelewer and Daniel S. Hirschberg at the University of California, Irvine.
Lempel-Ziv Codes Another adaptive method definition by Debra A. Lelewer and Daniel S. Hirschberg at the University of California, Irvine.
Java Huffman Coding John Scott's explanation and a java example of Huffman coding. Imperial College.
Use of Huffman Trees to Encode Verbal Street Directions Using Huffman trees to encode directions. by Andre Skupin at Institut für Geographie der Universität Salzburg.
Chacking ... LZW Compression Chacking, Issue #6, Sept. 5, 1993. Bill Lucier at Carnegie Mellon University.
Compression CS 280 Lab Manual: Coding by Alexei Barchenkov, Don Hutchings and Micheal Klingbeil at Oberlin College.

top


References

[1] T.C. Bell, J.G. Cleary, I.H. Witten. Text Compression. Prentice Hall, 1990.
[2] Thomas H. Cormen, Charles E. Leiserson and Ronald L. Rivest. Introduction to Algorithms. The MIT Press and McGraw-Hill Book Company, 1990.
[3] Gilbert Held. Personal Computer File Compression. Van Nostrand Reinhold, 1994.
[4] M. Nelson. The Data Compression Book. M.I.T. Publication Inc., 1992.

top


Web page creators

This web page was created by

top
Last updated: April 10, 1997

Copyright © 1997, Mike Chen, Ming Ming Hao, Vickie Kwan, Nancy Zaramian. All rights reserved. Reproduction of all or part of this work is permitted for educational or research use provided that this copyright notice is included in any copy. Disclaimer: this collection of notes is experimental, and does not serve as-is as a substitute for attendance in the actual class.

top