Find the dominant cyclic substring of a given string
I am looking for an algorithm to find the dominant cyclic substring of a given string.
A cyclic substring:
A dominant cyclic substring:
Example 1:
prefixgarbagecyclecyclecyclecyclesufixgarbage
cycle
:=> cycle
is the most repeated adjacent substring Example 2:
prefixgarbagecyclepadinggarbagecyclesufixgarbage
g
:=> occurrences of cycle
are not repeated adjacently, g
repeats twice adjacently Example 3:
prefixgarbagecyclecyclepadinggarbageprefixgarbage
cycle
:=> cycle
& g
repeat twice adjacently but cycle
is longer then g
Example 4:
prefixgarbagecyclecyclecycleroundroundroundprefixgarbage
cycle
:=> cycle
& round
repeat thrice adjacently & same length but cycle
appeared first Exampe 5:
abcdefghijklmnopqrstuvqxyz
<empty string>
because there is no repeated adjacent substring What is the best approach for implementing this algorithm?
Cannot find anything better than this quadratic-time algorithm (implemented in Python):
IREP, IPER, IPOS = 0, 1, 2
def find_dominant(src):
best = (0, 0, 0) # repetitions-1, period, position
period = 0
while period < len(src) // max(2, 1 + best[IREP]):
period += 1
length = 0
for pos in range(len(src) - 1 - period, -1, -1):
if src[pos] == src[pos + period]:
length += 1
repetitions = length // period
if repetitions >= best[IREP]:
best = (repetitions, period, pos)
else:
length = 0
return best
s = "prefixgarbagecyclecyclecyclecyclesufixgarbage"
res = find_dominant(s)
if res[0] == 0:
print("nothing found")
else:
print(res[IREP] + 1, '*', s[res[IPOS]: res[IPOS] + res[IPER]])
For each possible period scan the string and remember the longest periodic subsequence. Scan it backwards to check less conditions. Stop increasing period when no further improvement could be found.
Time complexity is O(N2 / R), where R is the number of repetitions of dominant substring. Space complexity is O(1).
Scan from left to right.
You need some key/value pairs. The key is a letter. The value includes the index of the last instance of the letter found to that point in the scan AND info about any cycle the letter belongs to with two or more strings (the last one beginning with that letter in that column).
You need a place to store info about any cycles found. Call that the "cycle store."
As you scan do this at each index:
When you get done scanning, look through the cycle store to see which one is dominant.
Note that you probably don't have to store all the cycles until the end but its not obvious to me right off how to decide which ones you can throw away. Probably something about keeping the ones that are still possibly continuing based on the content of the key/value pair table plus the dominant one so far.
Here is a possible approach (made simpler by the fact that your cycles have to be adjacent). Pick a string. See if it repeats. Keep track of the most repeated one.
EDIT actual tested Python code:
testStrings =[ "prefixgarbagecyclecyclecyclecyclesufixgarbage",
"prefixgarbagecyclepadinggarbagecyclesufixgarbage",
"prefixgarbagecyclecyclepadinggarbageprefixgarbage",
"prefixgarbagecyclecyclecycleroundroundroundprefixgarbage",
"abcdefghijklmnopqrstuvqxyz"];
for input in testStrings:
repCountMax = 0
longestCycle = ""
repCount = 0
for i in range (1, len(input)):
for j in range( i+1, len(input)):
substring = input[i:j]
#print substring
ls = len(substring)
repCount = 1
k = j
while(substring == input[k:k+ls]):
k = k + ls
repCount = repCount +1
#print "repetition ", repCount, " of ", substring, "n"
if (repCount > repCountMax) or ((repCount == repCountMax) and len(substring) > len(bestCycle)):
repCountMax = repCount
bestCycle = substring
if repCountMax > 1:
print "best cycle in '", input, "' is '", bestCycle,"' which is repeated ", repCountMax, " times."
else:
print "no repeated cycles found in string ", input
Resulting output:
best cycle in ' prefixgarbagecyclecyclecyclecyclesufixgarbage ' is ' ecycl ' which is repeated 4 times.
best cycle in ' prefixgarbagecyclepadinggarbagecyclesufixgarbage ' is ' g ' which is repeated 2 times.
best cycle in ' prefixgarbagecyclecyclepadinggarbageprefixgarbage ' is ' ecycl ' which is repeated 2 times.
best cycle in ' prefixgarbagecyclecyclecycleroundroundroundprefixgarbage ' is 'ecycl ' which is repeated 3 times.
no repeated cycles found in string abcdefghijklmnopqrstuvqxyz
Note - the cycle found was ecycl
, not cycle
. ecycl
occurred first...
Second note - you could make things marginally more efficient by stopping when you can no longer "beat" the current best estimate - for example, if you have found five repeats already, and given the size of the string you are searching for there isn't space for six repeats. This will give a speed improvement when there is a significant number of repeats. See Evgeny's solution for a way to implement that.
链接地址: http://www.djcxy.com/p/70660.html上一篇: 找到最大的重复子字符串
下一篇: 查找给定字符串的主要循环子字符串