Finding the largest repeating substring
Here is a function I wrote that will take a very long text file. Such as a text file containing an entire textbook. It will find any repeating substrings and output the largest string. Right now it doesn't work however, it just outputs the string I put in
For example, if there was a typo in which an entire sentence was repeated. It would output that sentence; given it is the largest in the whole file. If there was a typo where an entire paragraph was typed twice, it would output the paragraph.
This algorithm takes the first character, finds any matches, and if it does and if the length is the largest, store the substring. Then it takes the first 2 characters and repeats. Then the first 3 characters. etc.. Then it will start over except starting at the 2nd character instead of the 1st. Then all the way through and back, starting at the 3rd character.
def largest_substring(string):
length = 0
x,y=0,0
for y in range(len(string)): #start at string[0, ]
for x in range(len(string)): #start at string[ ,0]
substring = string[y:x] #substring is [0,0] first, then [0,1], then [0.2]... then [1,1] then [1,2] then [1,3]... then [2,2] then [2,3]... etc.
if substring in string: #if substring found and length is longest so far, save the substring and proceed.
if len(substring) > length:
match = substring
length = len(substring)
I think your logic is flawed here because it will always return the entire string as it checks whether a substring is in whole string which is always true so the statement if substring in string
will be always true. Instead you need to find if the substring occurs more than once in the entire string and then update the count.
Here is example of brute force algorithm that solves it :-
import re
def largest_substring(string):
length = 0
x=0
y=0
for y in range(len(string)):
for x in range(len(string)):
substring = string[y:x]
if len(list(re.finditer(substring,string))) > 1 and len(substring) > length:
match = substring
length = len(substring)
return match
print largest_substring("this is repeated is repeated is repeated")
链接地址: http://www.djcxy.com/p/70662.html
上一篇: 如何检查两个顶点之间的图形连通性
下一篇: 找到最大的重复子字符串