Ukkonen's suffix tree algorithm in plain English
I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.
A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.
For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
My basic understanding, so far:
The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I'm having trouble understanding.
I'm also having trouble understanding:
Here is the completed C# source code. It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:
https://gist.github.com/2373868
Update 2017-11-04
After many years I've found a new use for suffix trees, and have implemented the algorithm in JavaScript . Gist is below. It should be bug-free. Dump it into a js file, npm install chalk
from the same location, and then run with node.js to see some colourful output. There's a stripped down version in the same Gist, without any of the debugging code.
https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6
The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (ie does not contain any repeated characters), and then extending it to the full algorithm.
First, a few preliminary statements.
What we are building, is basically like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth
But : Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers [from,to]
. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).
Basic principle
I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:
abc
The algorithm works in steps, from left to right . There is one step for every character of the string . Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).
So, we start from the left, and first insert only the single character a
by creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#]
, which means the edge represents the substring starting at position 0 and ending at the current end. I use the symbol #
to mean the current end, which is at position 1 (right after a
).
So we have an initial tree, which looks like this:
And what it means is this:
Now we progress to position 2 (right after b
). Our goal at each step is to insert all suffixes up to the current position . We do this by
a
-edge to ab
b
In our representation this looks like
And what it means is:
We observe two things:
ab
is the same as it used to be in the initial tree: [0,#]
. Its meaning has automatically changed because we updated the current position #
from 1 to 2. Next we increment the position again and update the tree by appending a c
to every existing edge and inserting one new edge for the new suffix c
.
In our representation this looks like
And what it means is:
We observe:
#
, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required. First extension: Simple repetitions
Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:
abcabxabcd
It starts with abc
as in the previous example, then ab
is repeated and followed by x
, and then abc
is repeated followed by d
.
Steps 1 through 3: After the first 3 steps we have the tree from the previous example:
Step 4: We move #
to position 4. This implicitly updates all existing edges to this:
and we need to insert the final suffix of the current step, a
, at the root.
Before we do this, we introduce two more variables (in addition to #
), which of course have been there all the time but we haven't used them so far:
(active_node,active_edge,active_length)
remainder
, which is an integer indicating how many new suffixes we need to insert The exact meaning of these two will become clear soon, but for now let's just say:
abc
example, the active point was always (root,'