how to get longest repeating string in substring from suffix tree
I need to find the longest repeating string in substring. Let's say I have string "bannana"
Wikipedia says following:
In computer science, the longest repeated substring problem is the problem of finding the longest substring of a string that occurs at least twice. In the figure with the string "ATCGATCGA$", the longest repeated substring is "ATCGA"
So I assume that for string "bannana"
there are two equally long substrings (if not correct me please): "an"
and "na"
.
Wikipedia also says that for this purpose suffix trees are used. To be more specific here is quotation how to do it (this seems to me more understable than definition on wikipedia):
build a Suffix tree, then find the highest node with at least 2 descendants.
I've found several implementations of suffix trees. Following code is taken from here:
use strict;
use warnings;
use Data::Dumper;
sub classify {
my ($f, $h) = (shift, {});
for (@_) { push @{$h->{$f->($_)}}, $_ }
return $h;
}
sub suffixes {
my $str = shift;
map { substr $str, $_ } 0 .. length($str) - 1;
}
sub suffix_tree {
return +{} if @_ == 0;
return +{ $_[0] => +{} } if @_ == 1;
my $h = {};
my $classif = classify sub { substr shift, 0, 1 }, @_;
for my $key (sort keys %$classif) {
my $subtree = suffix_tree(
grep "$_", map { substr $_, 1 } @{$classif->{$key}}
);
my @subkeys = keys %$subtree;
if (@subkeys == 1) {
my $subkey = shift @subkeys;
$h->{"$key$subkey"} = $subtree->{$subkey};
} else { $h->{$key} = $subtree }
}
return $h;
}
print +Dumper suffix_tree suffixes 'bannana$';
for string "bannana"
it returns following tree:
$VAR1 = {
'$' => {},
'n' => {
'a' => {
'na$' => {},
'$' => {}
},
'nana$' => {}
},
'a' => {
'$' => {},
'n' => {
'a$' => {},
'nana$' => {}
}
},
'bannana$' => {}
};
Another implementation is online from here, for string "bannana"
it returns following tree:
7: a
5: ana
2: annana
1: bannana
6: na
4: nana
3: nnana
|(1:bannana)|leaf
tree:|
| |(4:nana)|leaf
|(2:an)|
| |(7:a)|leaf
|
| |(4:nana)|leaf
|(3:n)|
| |(5:ana)|leaf
3 branching nodes
Questions:
"an"
and "na"
strings? 1. How can I get from those graphs "an" and "na" strings?
build a Suffix tree, then find the highest node with at least 2 descendants.
string-node
is concatenate strings for every node from root to this node. highest node
is node with maximum length string-node
.
See tree in my answer for second question. (3:n)
have 2 descendants and path to node is (2:a)->(3:n)
, concatenate is an
. And also for (5:a)
get na
.
2. As you can see trees are different, are they equivalent or not, if yes why they are different, if not which algorithm is correct?
These trees are different. Rebuild second tree for string "bannana$"
( as in the first tree):
8: $
7: a$
5: ana$
2: annana$
1: bannana$
6: na$
4: nana$
3: nnana$
|(1:bannana$)|leaf
tree:|
| | |(4:nana$)|leaf
| |(3:n)|
| | |(7:a$)|leaf
|(2:a)|
| |(8:$)|leaf
|
| |(4:nana$)|leaf
|(3:n)|
| | |(6:na$)|leaf
| |(5:a)|
| | |(8:$)|leaf
|
|(8:$)|leaf
5 branching nodes
3. If perl implementation is wrong is there any working implementation for perl/python?
I don't know Perl, but the tree is built correctly.
4. I've read about Ukkonen's algorithm which is also mentioned on page with 2nd example (I did not catch if the online version is using this algorithm or not), does any of the mentioned examples using this algorithm? If not, is used algorithm slower or has any drawbacks compared to Ukkonen?
I said earlier that I don't know Perl, but it's a line in first algorthim means that it works at least O(n^2)
( n
it is length string):
map { substr $str, $_ } 0 .. length($str) - 1;
Ukkonen's algorithm works linear time O(n)
.
First algorithm also recursive which may affect used memory.
链接地址: http://www.djcxy.com/p/40112.html上一篇: 字符串与后缀树的隐式表示匹配
下一篇: 如何从后缀树中得到子串中最长的重复字符串