Longest common subsequence with fixed length substrings
I have been using Longest Common Subsequence (LCS) to find the similarly between sequences. The following dynamic programming code computes the answer.
def lcs(a, b):
lengths = [[0 for j in range(len(b)+1)] for i in range(len(a)+1)]
# row 0 and column 0 are initialized to 0 already
for i, x in enumerate(a):
for j, y in enumerate(b):
if x == y:
lengths[i+1][j+1] = lengths[i][j] + 1
else:
lengths[i+1][j+1] =
max(lengths[i+1][j], lengths[i][j+1])
# read the substring out from the matrix
result = ""
x, y = len(a), len(b)
while x != 0 and y != 0:
if lengths[x][y] == lengths[x-1][y]:
x -= 1
elif lengths[x][y] == lengths[x][y-1]:
y -= 1
else:
assert a[x-1] == b[y-1]
result = a[x-1] + result
x -= 1
y -= 1
return result
However I have realised what I really want to solve is a little different. Given a fixed k I need to make sure that the common subsequence only involves substrings of length exactly k. For example, set k = 2 and let the two strings be
A = "abcbabab"
B = "baababcc"
The subsequence that I need would be " ba"+"ab" = baab
.
Is it possible to modify the dynamic programming solution to solve this problem?
The original longest common subsequence problem would just be the k=1 case.
A method that doesn't work.
If we perform the LCS algorithm above, we can get the alignment from the dynamic programming table and check whether those symbols appear in non-overlapping substrings of length k in both input sequences, and delete them if not. The problem is that this doesn't give an optimal solution.
The correction is basically when the two strings in the relevant index have a matching substring (instead of a matching letter, as it is now).
The idea is instead of simply checking for a substring of size 1 in the original solution, check for a substring of length k
, and add 1 to the solution (and 'jump' by k
in the string).
The formula for the recursive approach that should be translated to the DP solution is:
f(i,0) = 0
f(0,j) = 0
f(i,j) = max{
f(i-k,j-k) + aux(s1,s2,i,j,k)
f(i-1,j)
f(i,j-1)
}
where aux(s1,s2,i,j,k)
is a function that is aimed to check if the two substrings are a match, and is defined as:
aux(s1,s2,i,j,k) = 1 | if s1.substring(i-k,i) equals s2.substring(j-k, j)
-infinity | otherwise
You can reconstruct the alignment later using an auxilary matrix that marks the choices of max{}
, and go from last to first when the matrix is complete.
Example:
bcbac
and cbcba
. k=2
Matrix generated by f
:
c b c b a
0 0 0 0 0 0
b 0 0 0 0 0 0
c 0 0 0 1 1 1
b 0 0 1 1 1 1
a 0 0 1 1 1 2
c 0 0 1 1 1 2
And for reproducing the alignment you generate 'choices' matrix:
1 - chose f(ik,jk) + aux(s1,s2,i,j,k)
2 - chose f(i-1,j)
3 - chose f(i,j-1)
d - don't care, (all choices are fine)
x/y -means one of x or y.
c b c b a
b 2/3 2/3 2/3 2/3 2/3
c 2/3 2/3 1 2 2
b 2/3 3 2/3 d 2/3
a 2/3 3 2/3 2/3 1
c 2/3 3 2/3 2/3 3
Now, reconstructing the alignment - start from the last (bottom right) cell:
The order of visiting while reconstructing is: (0 means - not visited, number in many cells means any of these is OK).
c b c b a
0 4 0 0 0 0
b 4 3 0 0 0 0
c 0 0 0 0 0 0
b 0 0 0 2 0 0
a 0 0 0 0 0 0
c 0 0 0 0 0 1
链接地址: http://www.djcxy.com/p/61510.html
上一篇: 不同子序列的数量
下一篇: 具有固定长度子串的最长公共子序列