School of Computer Science

Winter 1997 Class Notes for 308-251

This lecture is about tries and suffix trees. Here is some general information on tries:

- The term trie comes from the word "re
**trie**val". - Tries were introduced in the 1960's by Fredkin.
- A suffix tree is a member of the trie family.

- The trie is a data structure that can be used to do a fast search in a large
text. For example, we can think of the Oxford English dictionary which
contains several gigabytes of text. The Oxford dictionary is a static
structure because we do not want to add or delete any items. However,
searching for an item in the dictionary is very important. Also, searching for
a string should be efficient because overlapping of strings can occur.

** FIGURE 1 **

- Tries are used to implement the dictionary abstract data type (ADT) where
basic operations like makenull, search, insert, and delete can be performed.

- They can be used for encoding and compression. (To be done in a
later lecture.)

- They can be used in regular expression search and approximate string
matching.

A regular expression is a way to define a pattern of characters. One
example of a regular expression is **[aB]*[cd][efg] **
where * is an indicator of repetition.
The above expression represent any string with the following properties:

- The string starts with any (possibly zero) number of a

- followed by exactly one of c

- followed by exactly one of e

Another example of a regular expression is **^.*$** where
^ denotes the beginning of a line, the dot is a wildcard that represents
any symbol and $ denotes the end of a line. Therefore, the expression
**^.*$** represents a line containing any string repeated any number
of time. That is, a whole line.

In UNIX, there is a command called grep which is used to report
all matching substrings in a file.

We can type the command: **grep '[aB]*[cd][efg]' <filename>** and
all substrings defined by this regular expression contained in
<filename> will be listed.

-A trie is a k-ary position tree.

-It is constructed from input strings, i.e. the input is a set of n
strings called S_{1},S_{2},...,S_{n},
where each S_{i} consists of symbols from a finite alphabet
and has a unique terminal symbol which we call $.

Some examples of alphabets are:

- {0,l} for binary files.
- {all 256 ASCII characters}
- {a,b,c,d,...,x,y,z}

- Non compact tries.
- Compact tries.
- Tries called "PATRICIA" which are even more compact.
- Suffix tries.
- Suffix trees.

A non compact trie is one in which every edge of the underlying tree
represents a symbol of the alphabet. (In the following example and for
the rest of the this page we will always assume that the symbols in our
alphabet are the CAPITAL letters A..Z with terminal symbol $)

Let's construct the trie from the following 5 strings: BIG, BIGGER, BILL,
GOOD, GOSH.

** FIGURE 2 **

When we look for the string GOOD, we start at the root and we follow the G
edge, followed by the O edge, another O edge and finally the D edge.

If we want to look for the string BAD, we start from the root, follow the B
edge and find out that there is no A edge after. Thus BAD is not in the text.

The above structure is rather wasteful because each edge represents a single
symbol. For huge texts this is an enormous waste of space. Instead, we must
find a way to represent the trie in a more compact form.

Before building a compact trie, let's look at 2 types of implementation for
tries.

**The first implementation uses an array of pointers.****The second implementation uses linked lists and is due to "de la Briandais".**

Each array cell has an index. If the alphabet is from A to Z, the array indices will range from A to Z plus the terminal symbol $, so the size of the array is equal to the alphabet size plus 1.

Here is an array representation for the strings BIG and GOO (in figure 3). A string start with either a B or a G. This is why in the first array, cell B and G have children to point to. The other cells have NIL pointers. For example, no word starts with the letter A, so cell A has a NIL pointer.

This implementation is wasteful when we have few words because most of the pointers are NIL.

Each node of the linked list represents a single symbol and has pointers to its next sibling and to its first child. Consider the following text:

The first child of s

The next sibling of s

This type of trie resembles the one in figure 2 except that chains which lead
to leaves are trimmed. This is illustrated in figure 5.

** FIGURE 5 **

The compact form of the trie is in figure 6:

** FIGURE 6 **

The number of leaves is n+1, where n is the number of input strings.
Furthermore, in the leaves, we may store either the strings themselves or
pointers to the strings (that is, integers).

The compact trie can be even more compacted. This will be discussed under
the topic Tries called "PATRICIA".

"PATRICIA" stands for "practical algorithm to retrieve information coded in alphanumeric". This will be different from the previous trie in that an edge can be labeled with more than one character. Hence, all the unary nodes will be collapsed. The following figure illustrates the collapsing process:

The very compact trie will look as follows:

Note:

For binary PATRICIA tries (where there are only 2 symbols), the number of internal nodes is equal to the number of leaves minus 1. The height of the PATRICIA trie is bounded by n (the number of leaves). The heights of the ordinary trie or the compact trie are not necessarily bounded by n.

Principles:

The idea behind suffix TRIE is to assign to each symbol in a text an index
corresponding to its position in the text.(ie: First symbol has index 1,
last symbol has indice n= #of symbols in text). To build the suffix TRIE
we use these indices instead of the actual
object.

- It requires less storage space.
- We do not have to worry how the text is represented. (bin, ASCII, etc)
- We do not have to store the same object twice. (no duplicate)

A suffix of a text [t

To demonstrate the structure of the resulting tree we will build the suffix trie corresponding to the following text:

TEXT: G O O G O L $

POSITION: 1 2 3 4 5 6 7

We begin by giving a position to every suffix in the text. We can now build a SUFFIX TRIE for all n suffixes of the text.

Note: The resulting tree has n leaves and height n.

The suffix tree is created by TRIMMING (compacting + collapsing every unary node) of the suffix TRIE.

The following is a picture of a suffix trie. The corresponding suffix tree is drawn below.We store pointers rather than words in the leaves. And we replaced every string by a pair of indices, (a,b), where a is the index of the beginning of the string and b the index of the end of the string. i.e: We write

- (3,7) for OGOL$
- (1,2) for GO
- (7,7) for $

Searching for all instances of a substring S in a suffix tree is
easy since the symbols in S define a path down the suffix tree. Following
this path, if we encountered a NIL pointer before reaching the end, then S
is not in the tree. If we end up at a node x then S occurs at least
once. Moreover the places where S can be found are given by the pointers
in all the leaves in the subtree rooted at x.

Pseudo-code for searching in suffix tree:

SEARCH

- Start at root
- Go down the tree by taking each time the corresponding bifurcation
- If S correspond to a node then return all leaves in subtree
- If S encountered a NIL pointer then S is not in the tree

If S = "GO" we take the GO bifurcation and return: GOOGOL$,GOL$.

If S = "OR" we take the O bifurcation and then we hit a NIL pointer so "OR" is not in the tree.

- If we have
- TRIE
- Alphabet = {0,1}
- Each string consists of infinitely many independent coin-flips
- s
_{1}= 01101111010001.... - s
_{2}= 11010110110101101... - ...
- s
_{n}= 010111011101... - Compact the resulting tree by eliminating all chains leading to leaves
- Each coin-flip has a 50% chance of being 0 or 1. So starting at any node, we have 50% chance of going left, 50% chance of going right.
- The process ends with the last digit of S. At that point we have:

Putting a period in front, each string can be thought of as the binary expansion of a real number between 0 and 1.

If a new string S of length k is to be added in the TRIE, we can expect that
the distance between the leaf representing S and the root is approximately
log_{ 2} n.
Intuitive proof: