Longest common substring in two sequences of strings

2018-06-05 02:26:00

Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:

Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.

I have a function substring(x, y) which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:

Naive : Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.

Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...)) worst case time, or more depending on substring() (correct me if wrong).

I thought of a second approach based on suffix trees/arrays.

Generalized Suffix Tree : I build a GST of sequence Y using Ukkonen's algorithm in O(maxlength(y1, y2,...) (?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.

If there is a better approach, I'd love to know.

EDIT: Apologies if it seemed like I abandoned the topic.

What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.

Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?

It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?

First, order the array X for the longest string to shorter. This way, the first string in X that be a substring of all Y strings is the solution.

A multiprocessor algorithm would be the best way to solve the problem of test each X string with all Y strings quickly.

Here is my idea about a solution of your problem; I am not sure about everything so comments are welcome to improve it if you think it worths the effort.

Begin with computing all common substrings of all strings in Y. First take two strings, and build a tree of all common substrings. Then, for each other string in Y, remove from the map every substring that does not appear in this string. The complexity is linear with the number of strings in Y, but I can't figure out how many elements might be in the tree so I cannot draw an estimation of the final complexity.

Then find the longest string in X which is a substring of one in the tree.

There must be some improvements to do to keep the tree as small as possible, such as keeping only substrings that are not substrings of others.

Writing |Y| for the number of strings in the set Y, and len(Y) for their total length:

Process the strings in Y into a generalized suffix tree (for example, using Ukkonen's algorithm). Takes time O(len(Y)), assuming a constant-size alphabet.

Mark each node in the suffix tree according to whether the string identified by that node belongs to all the strings in Y. Takes time O(|Y| len(Y)).

For each string in X, look it up in the suffix tree and see if the node is marked as belonging to all the strings in Y. Output the longest such marked string. Takes time O(len(X)).

Total time: O(|Y| len(Y)) + O(len(X)).

链接地址: http://www.djcxy.com/p/16306.html

上一篇: UISearchBar AutoLayout ios7错误？

下一篇: 两个字符串序列中最长的公共子字符串