Ukkonen's algorithm for Generalized Suffix Trees
I am currently working on my own Suffix Tree implementation (using C++, but the question remains language agnostic). I studied the original paper from Ukkonen. The article is very clear so I got to work on my implementation and tried to tackle the problem for Generalized Suffix Trees.
In the tree, each substring leading from a node to another is represented using a pair of integer. While this is straightforward for a regular suffix tree, a problem arises when multiple strings coexist in the same tree (which becomes a Generalized Suffix Tree). Indeed, now such a pair is not sufficient, we need another variable to state which reference string we are using.
A quick example. Consider the string coconut
:
nut
would be (4,6)
. troublemaker
in the tree, (4,6)
can now be: nut
if we refer to the first string. ble
if we refer to the second string. To handle this, I thought adding an id representing the string:
// A pair of int is a substring (regular tree)
typedef std::pair<int,int> substring;
// We need to bind a substring to its reference string:
typedef std::pair<int, substring> mapped_substring;
The problem I currently face is the following:
I get a query to add a string in the tree. During the algorithm, I may have to inspect existing transitions related to other registered strings, represented as a triplet (reference string id, k, p). Some updating operations are based on the substrings indexes, how can I perform them in such conditions ?
Note: The question is language-agnostic so I did not include the c++-tag, though a little snippet is shown.
TL;DR
The original algorithm doesn't really need to be modified in order to build a Generalized Suffix Tree.
Here is the github for my current implementation (in C++). It still needs some reviewing and refactoring (and some extensive testing...) but it's a start!
Note: I am currently working on this implementation, I'll update this answer with stuff when I'll have it available! (Coliru lives and the like...)
Detailed analysis
The hunch I got was the right way. In order to keep up with the tricks used in the original algorithm, we indeed need to add a reference to the original string. Moreover, the algorithm is online, which means you can add strings on-the-fly to the tree.
Suppose we have a Generalized Suffix Tree GST(N) for strings (S1, ..., SN). The problem at hand here is how to handle the building process of GST(N+1), using GST(N).
Tweaking the data model
In the simple case (single suffix tree), each transition is a couple (substring, end vertex). The trick in Ukkonen's algorithm is to model a substring using a pair of pointers to the appropriate positions in the original string. Here, we need to also link such a substring to its "parent" string. To do so:
We call this a mapped substring. The C++ typedefs I use are the ones found in my original question:
// This is a very basic draft of the Node class used
template <typename C>
class Node {
typedef std::pair<int, int> substring;
typedef std::pair<int, substring> mapped_substring;
typedef std::pair<mapped_substring, Node*> transition;
// C is the character type (basically `char`)
std::unordered_map<C, transition> g; // Called g just like in the article :)
Node *suffix_link;
};
As you will see, we will keep the reference pair concept as well. This time, the reference pair, just like the transition, will hold a mapped substring.
Note: As in C++, strings indexes will start at 0.
Inserting the new string
We want to insert SN+1 into GST(N).
GST(N) may have already a whole lot of nodes and transitions. In a simple tree, we would only have the root and the special sink node. Here, we may have transitions for SN+1 that have already been added through the insertion of some previous strings. The first thing to do is to walk down the tree through the transitions as long as it matches SN+1.
By doing so, we end in a state r. This state may be explicit (ie we ended right on a vertex) or implicit (a mismatch occurred in the middle of a transition). We use the same concept as in the original algorithm to model such a state: a reference pair. Quick example:
banana
ba
explicitly exists in GST(N) nal
When we walk down the tree, we end up in the transition t at the character l
which is a mismatch. Thus, the implicit state r we get is represented by the reference pair (s, m) where m is the mapped substring (N+1, (1,3)).
Here, r is the active point for the 5-th iteration of the algorithm in the building of banana
's suffix tree. The fact we got to that state means precisely that the tree for bana
is already built in GST(N).
In this example, we resume the algorithm at the 5-th iteration, to build the suffix tree for banan
using the tree for bana
. Not to lose the generality, we will state that r = (s, (N+1, (k, i-1)), i being the index of the first mismatch. We have indeed k ≤ i (the egality is synonym of r being an explicit state).
Property: We can resume Ukkonen's algorithm to build GST(N) at iteration i (insertion of the character at index i in SN+1). The active point for this iteration is the state r we got by walking down the tree . The only tweaking needed is some fetching operations to resolve the substrings.
Proof for the property
First, the presence of such a state r implies that the whole states for the intermediate tree T(N+1)i-1 are also there. So all is set up and we resume the algorithm.
We need to prove that each procedure in the algorithm remains valid. There are 3 such subroutines:
test_and_split
: given the character to insert at current iteration, test wether we need to split a transition into two separate transitions and if the current point is the end point. canonize
: Given a reference pair (n, m) where n is a vertex and ma mapped substring, returns the couple (n', m') representing the same state such as m' is the shortest possible substring. update
: Update GST(N) so that it has all the states for intermediate tree T(N+1)i at the end of the run. test_and_split
Input: A vertex s, a mapped substring m = (l, (k, p)) and a character t.
Output: A boolean that tells if the state (s, m) is the end point for the current iteration and a node r representing explicitly (s, m) if it is not the end point.
The simplest case goes first. If the substring is empty (k > p), the state is already represented explicitly. We just have to test if we reached the end point or not. In a GST, just like in a common suffix tree, there is ALWAYS at most one transition per node starting with a given character. Thus, if there is a transition starting with t, we return true (we reached the end point), otherwise false.
Now the hard part, when k ≤ p. We first need to fetch the string Sl lying at index l (*) in the original strings table.
Let (l', (k', p')) (resp. s') be the substring (resp. the node) related to the transition TR of s starting with the character Sl(k) (*) . Such a transition exists because (s, (l,(k,p)) represents an (existing) implicit state on the border path of the intermediate tree T(N+1)i-1. Furthermore, we are sure that the p - k first characters on this transition matches.
Do we need to split this transition? That depends on the Δ = p - k + 1-th character on this transition (**) . To test this character, we need to fetch the string lying at index l' on the hash table and get the character at index k' + Δ. This character is guaranteed to exist because the state we are examining is implicit and thus ends in the middle of the transition TR (Δ ≤ p' - k').
If the equality holds, we have nothing to do and return true (the end point is here) and do nothing else. If not, then we must split the transition and create a new state r. TR now becomes (l', (k', k' + Δ - 1)) → r. Another transition is created for r: (l', (k' + Δ, p') → s'. We now return false and r.
(*) : l is not necessarily equal to N+1. Likewise, l and l' may be different (or equal).
(**) : Notice that the number Δ = p - k + 1 does not depend at all on the string chosen as a reference for the mapped substring. It only depends on the implicit state fed to the routine.
Canonize
Input: A node _s_and a mapped substring (l,(k,p)) representing an existing state e in the tree.
Output: A node s' and a mapped substring (l',(k',p')) representing the canonical reference for the state e
Using the same fetching tweaks, we just have to walk down the tree until we have exhausted the character "pool". Here, just like for test_and_split
the unicity of each transition and the fact we have an existing state as input provides us with a valid procedure.
Update
Input: The active point and the index for the current iteration.
Output: The active point for the next iteration.
update
uses both canonize
and test_and_split
which are GST-friendly. The suffix link editing is exactly the same as the one for a common tree. The only thing to notice is that we will create the open transitions (ie the transitions leading to nodes) using SN+1 as the reference string. Thus, at iteration i, the transition will always be linked to the mapped substring (N+1,(i,∞))
Final step
We need to "close" the open transitions. To do so, we just traverse them and edit the ∞ away, replacing it with L-1, where L is the length of SN+1. We also need a sentinel character to mark the end of the string. A character we are sure to never meet in any string. This way, the leaves will remain leaves forever.
Conclusion
The extra fetching work adds a few O(1) operations, increasing a little bit the constant factor of the complexity. Though, the asymptotic complexity remains obviously linear with the length of the inserted strings. Thus, building GST(N) from strings (S1,..., SN) with length n1,...,nN is:
c(GST(N)) = Σi=1..N ni
If your generalized suffix tree is going to have only few strings in it, then you can concatenate them together in a single string using unique terminal symbols (these terminal symbols should not be used in input strings) between each strings.
For example, lets say you have 5 strings: str1, str2, str3, str4 and str5, then you can concatenate these 5 strings as str1$str2#str3@str4%str5 and then make a suffix tree of this concatenated string.
Since we MUST used unique terminal symbols so there would be a limitation on how many maximum string that could be added in generalized suffix tree. Any character which will NEVER be used in input strings, can be taken as terminal symbols.
So based on a predefined set of terminal symbols, we can write the code.
Following article may be helpful.
Generalized Suffix Tree
链接地址: http://www.djcxy.com/p/40098.html上一篇: 用简单的英语,有什么区别?
下一篇: 用于广义后缀树的Ukkonen算法