Find the dominant cyclic substring of a given string

2018-06-25 05:42:46

I am looking for an algorithm to find the dominant cyclic substring of a given string.

A cyclic substring:

a substring which is repeated two or more times adjacently.

A dominant cyclic substring:

The substring which has the most adjacent repetitions is dominant

(adjacent repetitions occur an equal number of times)

the longest length substring is dominant

(on ties of length and adjacent repetitions)

the substring that appears first is dominant

Example 1:

prefixgarbagecyclecyclecyclecyclesufixgarbage

returns cycle :=> cycle is the most repeated adjacent substring

Example 2:

prefixgarbagecyclepadinggarbagecyclesufixgarbage

returns g :=> occurrences of cycle are not repeated adjacently, g repeats twice adjacently

Example 3:

prefixgarbagecyclecyclepadinggarbageprefixgarbage

returns cycle :=> cycle & g repeat twice adjacently but cycle is longer then g

Example 4:

prefixgarbagecyclecyclecycleroundroundroundprefixgarbage

returns cycle :=> cycle & round repeat thrice adjacently & same length but cycle appeared first

Exampe 5:

abcdefghijklmnopqrstuvqxyz

returns <empty string> because there is no repeated adjacent substring

What is the best approach for implementing this algorithm?

Cannot find anything better than this quadratic-time algorithm (implemented in Python):

IREP, IPER, IPOS = 0, 1, 2

def find_dominant(src):
  best = (0, 0, 0) # repetitions-1, period, position

  period = 0
  while period < len(src) // max(2, 1 + best[IREP]):
    period += 1
    length = 0

    for pos in range(len(src) - 1 - period, -1, -1):
      if src[pos] == src[pos + period]:
        length += 1
        repetitions = length // period
        if repetitions >= best[IREP]:
          best = (repetitions, period, pos)
      else:
        length = 0

  return best


s = "prefixgarbagecyclecyclecyclecyclesufixgarbage"
res = find_dominant(s)
if res[0] == 0:
  print("nothing found")
else:
  print(res[IREP] + 1, '*', s[res[IPOS]: res[IPOS] + res[IPER]])

For each possible period scan the string and remember the longest periodic subsequence. Scan it backwards to check less conditions. Stop increasing period when no further improvement could be found.

Time complexity is O(N2 / R), where R is the number of repetitions of dominant substring. Space complexity is O(1).

Scan from left to right.

You need some key/value pairs. The key is a letter. The value includes the index of the last instance of the letter found to that point in the scan AND info about any cycle the letter belongs to with two or more strings (the last one beginning with that letter in that column).

You need a place to store info about any cycles found. Call that the "cycle store."

As you scan do this at each index:

Use the letter there. See if its in the table of keys.

If it is found, look ahead to see if the following letters match the letters between the one found in the table (the previous occurrence) and this letter (at the current scan index).

If they match, we have a cycle, see if the previous occurrence has info showing this is a continuation of the stored cycle info. (Note: Adjacent letters may be a special case.)

if it is a continuation,

add to the stored cycle info to include these chars

update the index of the last found occurrence of this letter

update info about this cycle in the cycle store

if it is not a continuation

create (or replace) the stored cycle info to show this new cycle (count=2)

update the index of the last found occurrence of this letter

add info about this cycle in the cycle store

if the letters after this occurrence don't match the letters after that occurrence,

remove any stored cycle info for this letter

update the index of the last found occurrence of this letter

if the letter isn't in the table, add a key/value pair for this letter and this index.

When you get done scanning, look through the cycle store to see which one is dominant.

Note that you probably don't have to store all the cycles until the end but its not obvious to me right off how to decide which ones you can throw away. Probably something about keeping the ones that are still possibly continuing based on the content of the key/value pair table plus the dominant one so far.

Here is a possible approach (made simpler by the fact that your cycles have to be adjacent). Pick a string. See if it repeats. Keep track of the most repeated one.

EDIT actual tested Python code:

testStrings =[ "prefixgarbagecyclecyclecyclecyclesufixgarbage",
               "prefixgarbagecyclepadinggarbagecyclesufixgarbage",
               "prefixgarbagecyclecyclepadinggarbageprefixgarbage",
               "prefixgarbagecyclecyclecycleroundroundroundprefixgarbage",
               "abcdefghijklmnopqrstuvqxyz"];

for input in testStrings:

 repCountMax = 0
 longestCycle = ""
 repCount = 0
 for i in range (1, len(input)):
  for j in range( i+1, len(input)):
    substring = input[i:j]
    #print substring
    ls = len(substring)
    repCount = 1
    k = j
    while(substring == input[k:k+ls]):
      k = k + ls
      repCount = repCount +1
      #print "repetition ", repCount, " of ", substring, "n"
    if (repCount > repCountMax) or ((repCount == repCountMax) and len(substring) > len(bestCycle)):
      repCountMax = repCount
      bestCycle = substring

 if repCountMax > 1:
  print "best cycle in '", input, "' is '", bestCycle,"' which is repeated ", repCountMax, " times."
 else:
  print "no repeated cycles found in string ", input

Resulting output:

best cycle in ' prefixgarbagecyclecyclecyclecyclesufixgarbage ' is ' ecycl ' which is repeated 4 times.

best cycle in ' prefixgarbagecyclepadinggarbagecyclesufixgarbage ' is ' g ' which is repeated 2 times.

best cycle in ' prefixgarbagecyclecyclepadinggarbageprefixgarbage ' is ' ecycl ' which is repeated 2 times.

best cycle in ' prefixgarbagecyclecyclecycleroundroundroundprefixgarbage ' is 'ecycl ' which is repeated 3 times.

no repeated cycles found in string abcdefghijklmnopqrstuvqxyz

Note - the cycle found was ecycl , not cycle . ecycl occurred first...

Second note - you could make things marginally more efficient by stopping when you can no longer "beat" the current best estimate - for example, if you have found five repeats already, and given the size of the string you are searching for there isn't space for six repeats. This will give a speed improvement when there is a significant number of repeats. See Evgeny's solution for a way to implement that.

链接地址: http://www.djcxy.com/p/70660.html

上一篇: 找到最大的重复子字符串

下一篇: 查找给定字符串的主要循环子字符串