Generating suffixes from a Suffix Tree
I've built a suffix tree in Java based on the site here http://marknelson.us/1996/08/01/suffix-trees/ but I've run into a problem. I can build a suffix tree fine but I can trying to build a set of all suffixes from the tree. I basically find all the 'end nodes' and return the string which is represented by that 'end node'
That algorithm works for a word like "bookkeeper"
├── (1) bookkeeper
├── (9) e
│ ├── (8) eper
│ ├── (10) per
│ └── (12) r
├── (6) k
│ ├── (7) eeper
│ └── (5) keeper
├── (3) o
│ ├── (4) kkeeper
│ └── (2) okkeeper
├── (11) per
└── (13) r
Suffixes:
[bookkeeper, ookkeeper, okkeeper, kkeeper, keeper, eeper, eper, er, per, r]
But it fails when I use something like "ATATATATATA"
├── (1) ATATATATATA
└── (2) TATATATATA
Suffixes:
[ATATATATATA, TATATATATA]
But the proper suffixes should be:
[A, ATA, ATATA, ATATATA, ATATATATA, ATATATATATA, TA, TATA, TATATA, TATATATA, TATATATATA]
I can find the answer by finding all suffixes of each 'end node' string but that doesn't seem like that right approach. Any other suggestions?
EDIT: Thanks izomorphius! Adding an END_CHAR to the original string helped a bunch.
├── (21) #
├── (19) A
│ ├── (20) #
│ └── (15) TA
│ ├── (16) #
│ └── (11) TA
│ ├── (12) #
│ └── (7) TA
│ ├── (8) #
│ └── (3) TA
│ ├── (4) #
│ └── (1) TA#
└── (17) TA
├── (18) #
└── (13) TA
├── (14) #
└── (9) TA
├── (10) #
└── (5) TA
├── (6) #
└── (2) TA#
Suffixes:
[A, ATA, ATATA, ATATATA, ATATATATA, ATATATATATA, TA, TATA, TATATA, TATATATA, TATATATATA]
The typical tip on how to build a suffix tree is to add one more artificial character that you know is not in the alphabet. I usually add '#' and then I build the suffix tree for ATATATATATA# that way you will not have this problem any more.
You get the problem you describe because the missing suffixes are actually met as prefixes of another suffix. Adding an artificial character at the end ensures this will never happen.
链接地址: http://www.djcxy.com/p/40116.html上一篇: 如何反转字符串的后缀树(findind它表示的字符串)
下一篇: 从后缀树生成后缀