Longest common subsequence with fixed length substrings

2018-06-21 21:54:08

I have been using Longest Common Subsequence (LCS) to find the similarly between sequences. The following dynamic programming code computes the answer.

def lcs(a, b):
    lengths = [[0 for j in range(len(b)+1)] for i in range(len(a)+1)]
    # row 0 and column 0 are initialized to 0 already
    for i, x in enumerate(a):
        for j, y in enumerate(b):
            if x == y:
                lengths[i+1][j+1] = lengths[i][j] + 1
            else:
                lengths[i+1][j+1] = 
                    max(lengths[i+1][j], lengths[i][j+1])
    # read the substring out from the matrix
    result = ""
    x, y = len(a), len(b)
    while x != 0 and y != 0:
        if lengths[x][y] == lengths[x-1][y]:
            x -= 1
        elif lengths[x][y] == lengths[x][y-1]:
            y -= 1
        else:
            assert a[x-1] == b[y-1]
            result = a[x-1] + result
            x -= 1
            y -= 1
    return result

However I have realised what I really want to solve is a little different. Given a fixed k I need to make sure that the common subsequence only involves substrings of length exactly k. For example, set k = 2 and let the two strings be

A = "abcbabab"
B = "baababcc"

The subsequence that I need would be " ba"+"ab" = baab .

Is it possible to modify the dynamic programming solution to solve this problem?

The original longest common subsequence problem would just be the k=1 case.

A method that doesn't work.

If we perform the LCS algorithm above, we can get the alignment from the dynamic programming table and check whether those symbols appear in non-overlapping substrings of length k in both input sequences, and delete them if not. The problem is that this doesn't give an optimal solution.

The correction is basically when the two strings in the relevant index have a matching substring (instead of a matching letter, as it is now).

The idea is instead of simply checking for a substring of size 1 in the original solution, check for a substring of length k , and add 1 to the solution (and 'jump' by k in the string).

The formula for the recursive approach that should be translated to the DP solution is:

f(i,0) = 0
f(0,j) = 0
f(i,j) = max{
          f(i-k,j-k) + aux(s1,s2,i,j,k)
          f(i-1,j)
          f(i,j-1)
             }

where aux(s1,s2,i,j,k) is a function that is aimed to check if the two substrings are a match, and is defined as:

aux(s1,s2,i,j,k) = 1         | if s1.substring(i-k,i) equals s2.substring(j-k, j)
                   -infinity | otherwise

You can reconstruct the alignment later using an auxilary matrix that marks the choices of max{} , and go from last to first when the matrix is complete.

Example:

bcbac and cbcba . k=2

Matrix generated by f :

      c   b   c  b   a
   0  0   0   0  0   0
b  0  0   0   0  0   0
c  0  0   0   1  1   1   
b  0  0   1   1  1   1
a  0  0   1   1  1   2
c  0  0   1   1  1   2

And for reproducing the alignment you generate 'choices' matrix:

1 - chose f(ik,jk) + aux(s1,s2,i,j,k)
2 - chose f(i-1,j)
3 - chose f(i,j-1)
d - don't care, (all choices are fine)
x/y -means one of x or y.

c      b      c     b      a

b   2/3    2/3    2/3   2/3    2/3
c   2/3    2/3     1     2      2   
b   2/3     3     2/3    d     2/3
a   2/3     3     2/3   2/3     1
c   2/3     3     2/3   2/3     3

Now, reconstructing the alignment - start from the last (bottom right) cell:

It's '3' - move up, don't add anything to the alignment.

It's 1 - we need to add 'ba' to the alignment. (alignment='ba' currently). move k up and left.

It's 1, ad 'bc' to the alignment. current alignment: 'bcba'. move k up and left.

It's 2/3 - move left OR up.

The order of visiting while reconstructing is: (0 means - not visited, number in many cells means any of these is OK).

      c   b   c  b   a
   0  4   0   0  0   0
b  4  3   0   0  0   0
c  0  0   0   0  0   0   
b  0  0   0   2  0   0
a  0  0   0   0  0   0
c  0  0   0   0  0   1

链接地址: http://www.djcxy.com/p/61510.html

上一篇: 不同子序列的数量

下一篇: 具有固定长度子串的最长公共子序列