Data Structure for Subsequence Queries
In a program I need to efficiently answer queries of the following form:
Given a set of strings A
and a query string q
return all s ∈ A
such that q is a subsequence of s
For example, given A = {"abcdef", "aaaaaa", "ddca"}
and q = "acd"
exactly "abcdef"
should be returned.
The following is what I have considered considered so far:
For each possible character, make a sorted list of all string/locations where it appears. For querying interleave the lists of the involved characters, and scan through it looking for matches within string boundaries.
This would probably be more efficient for words instead of characters, since the limited number of different characters will make the return lists very dense.
For each n-prefix q
might have, store the list of all matching strings. n
might realistically be close to 3. For query strings longer than that we brute force the initial list.
This might speed things up a bit, but one could easily imagine some n-subsequences being present close to all strings in A
, which means worst case is the same as just brute forcing the entire set.
Do you know of any data structures, algorithms or preprocessing tricks which might be helpful for performing the above task efficiently for large A
s? (My s
s will be around 100 characters)
Update: Some people have suggested using LCS to check if q
is a subsequence of s
. I just want to remind that this can be done using a simple function such as:
def isSub(q,s):
i, j = 0, 0
while i != len(q) and j != len(s):
if q[i] == s[j]:
i += 1
j += 1
else:
j += 1
return i == len(q)
Update 2: I've been asked to give more details on the nature of q
, A
and its elements. While I'd prefer something that works as generally as possible, I assume A
will have length around 10^6 and will need to support insertion. The elements s
will be shorter with an average length of 64. The queries q
will only be 1 to 20 characters and be used for a live search, so the query "ab" will be sent just before the query "abc". Again, I'd much prefer the solution to use the above as little as possible.
Update 3: It has occurred to me, that a data-structure with O(n^{1-epsilon})
lookups, would allow you to solve OVP / disprove the SETH conjecture. That is probably the reason for our suffering. The only options are then to disprove the conjecture, use approximation, or take advantage of the dataset. I imagine quadlets and tries would do the last in different settings.
It could done by building an automaton
. You can start with NFA
(nondeterministic finite automaton which is like an indeterministic directed graph) which allows edges labeled with an epsilon
character, which means that during processing you can jump from one node to another without consuming any character. I'll try to reduce your A
. Let's say you A
is:
A = {'ab, 'bc'}
If you build NFA
for ab
string you should get something like this:
+--(1)--+
e | a| |e
(S)--+--(2)--+--(F)
| b| |
+--(3)--+
Above drawing is not the best looking automaton. But there are a few points to consider:
S
state is the starting state and F
is the ending state. F
state it means your string qualifies as a subsequence. e
(epsilon) to jump forward, therefore you can be at more then one state at each point in time. This is called e
closure. Now if given b
, starting at state S
I can jump one epsilon
, reach 2
, and consume b
and reach 3
. Now given end
string I consume epsilon
and reach F
, thus b
qualifies as a sub-sequence
of ab
. So does a
or ab
you can try yourself using above automata.
The good thing about NFA
is that they have one start state and one final state. Two NFA
could be easily connected using epsilons
. There are various algorithms that could help you to convert NFA
to DFA
. DFA
is a directed graph which can follow precise path given a character -- in particular, it is always in exactly one state at any point in time. (For any NFA, there is a corresponding DFA whose states correspond to sets of states in the NFA.)
So, for A = {'ab, 'bc'}
, we would need to build NFA
for ab
then NFA
for bc
then join the two NFAs
and build the DFA
of the entire big NFA
.
EDIT
NFA of subsequence of abc
would be a?b?c?
, so you can build your NFA as:
Now, consider the input acd
. To query if ab
is subsequence of {'abc', 'acd'}
, you can use this NFA: (a?b?c?)|(a?c?d)
. Once you have NFA you can convert it to DFA where each state will contain whether it is a subsequence of abc
or acd
or maybe both.
I used link below to make NFA graphic from regular expression:
http://hackingoff.com/images/re2nfa/2013-08-04_21-56-03_-0700-nfa.svg
EDIT 2
You're right! In case if you've 10,000 unique characters in the A
. By unique I mean A is something like this: {'abc', 'def'}
ie intersection of each element of A is empty set. Then your DFA would be worst case in terms of states ie 2^10000
. But I'm not sure when would that be possible given that there can never be 10,000
unique characters. Even if you have 10,000 characters in A still there will be repetitions and that might reduce states alot since e-closure might eventually merge. I cannot really estimate how much it might reduce. But even having 10 million states, you will only consume less then 10 mb worth of space to construct a DFA. You can even use NFA and find e-closures at run-time but that would add to run-time complexity. You can search different papers on how large regex are converted to DFAs.
EDIT 3
For regex (a?b?c?)|(e?d?a?)|(a?b?m?)
If you convert above NFA to DFA you get:
It actually lot less states then NFA.
Reference: http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
EDIT 4
After fiddling with that website more. I found that worst case would be something like this A = {'aaaa', 'bbbbb', 'cccc' ....}. But even in this case states are lesser than NFA states.
Tests
There have been four main proposals in this thread:
Shivam Kalra suggested creating an automaton based on all the strings in A
. This approach has been tried slightly in the literature, normally under the name "Directed Acyclic Subsequence Graph" (DASG).
J Random Hacker suggested extending my 'prefix list' idea to all 'n choose 3' triplets in the query string, and merging them all using a heap.
In the note "Efficient Subsequence Search in Databases" Rohit Jain, Mukesh K. Mohania and Sunil Prabhakar suggest using a Trie structure with some optimizations and recursively search the tree for the query. They also have a suggestion similar to the triplet idea.
Finally there is the 'naive' approach, which wanghq suggested optimizing by storing an index for each element of A
.
To get a better idea of what's worth putting continued effort into, I have implemented the above four approaches in Python and benchmarked them on two sets of data. The implementations could all be made a couple of magnitudes faster with a well done implementation in C or Java; and I haven't included the optimizations suggested for the 'trie' and 'naive' versions.
Test 1
A
consists of random paths from my filesystem. q
are 100 random [az]
strings of average length 7. As the alphabet is large (and Python is slow) I was only able to use duplets for method 3.
Construction times in seconds as a function of A
size:
Query times in seconds as a function of A
size:
Test 2
A
consists of randomly sampled [ab]
strings of length 20. q
are 100 random [ab]
strings of average length 7. As the alphabet is small we can use quadlets for method 3.
Construction times in seconds as a function of A
size:
Query times in seconds as a function of A
size:
Conclusions
The double logarithmic plot is a bit hard to read, but from the data we can draw the following conclusions:
Automatons are very fast at querying (constant time), however they are impossible to create and store for |A| >= 256
|A| >= 256
. It might be possible that a closer analysis could yield a better time/memory balance, or some tricks applicable for the remaining methods.
The dup-/trip-/quadlet method is about twice as fast as my trie implementation and four times as fast as the 'naive' implementation. I used only a linear amount of lists for the merge, instead of n^3
as suggested by j_random_hacker. It might be possible to tune the method better, but in general it was disappointing.
My trie implementation consistently does better than the naive approach by around a factor of two. By incorporating more preprocessing (like "where are the next 'c's in this subtree") or perhaps merging it with the triplet method, this seems like todays winner.
If you can do with a magnitude less performance, the naive method does comparatively just fine for very little cost.
As you point out, it might be that all strings in A contain q as a subsequence, in which case you can't hope to do better than O(|A|). (That said, you might still be able to do better than the time taken to run LCS on (q, A[i]) for each string i in A, but I won't focus on that here.)
TTBOMK there are no magic, fast ways to answer this question (in the way that suffix trees are the magic, fast way to answer the corresponding question involving substrings instead of subsequences). Nevertheless if you expect the set of answers for most queries to be small on average then it's worth looking at ways to speed up these queries (the ones yielding small-size answers).
I suggest filtering based on a generalisation of your heuristic (2): if some database sequence A[i] contains q as a subsequence, then it must also contain every subsequence of q. (The reverse direction is not true unfortunately!) So for some small k, eg 3 as you suggest, you can preprocess by building an array of lists telling you, for every length-k string s, the list of database sequences containing s as a subsequence. Ie c[s] will contain a list of the ID numbers of database sequences containing s as a subsequence. Keep each list in numeric order to enable fast intersections later.
Now the basic idea (which we'll improve in a moment) for each query q is: Find all k-sized subsequences of q, look up each in the array of lists c[], and intersect these lists to find the set of sequences in A that might possibly contain q as a subsequence. Then for each possible sequence A[i] in this (hopefully small) intersection, perform an O(n^2) LCS calculation with q to see whether it really does contain q.
A few observations:
上一篇: 不同的子序列DP解释
下一篇: 子序列查询的数据结构