Simple string compression: removing consecutive repeating substrings
I was asked this question recently in an Amazon interview .
Given a string, remove consecutive repeating substrings from it. If there are multiple consecutive intersecting substrings, remove the largest of those.
To make it clear, here are some examples:
aabcccddeaaa
output:
abcdea
(Compressing the consecutive repeated characters) abababcdeee
output:
abcde
(Compressing consecutive repeated substrings) ababcdabcd
output:
ababcd
(You can compress ' ab
' or ' abcd
', but as length of ' abcd
' is larger you prefer compressing the larger one.)
I could not come up with an efficient implementation, anyone knows a good approach to this?
As this was an interview question, please refrain from using complicated library functions.
没有正则表达式...这种递归方法的工作原理:
var cases = ['aabcccddeaaa', 'abababcdeee', 'ababcdabcd'];
function compress(str) {
var len, sub, i, n;
// if str is shorter than 2, can't be any repeating substrings
if(str.length < 2)
return str;
// max meaningful length is str.length/2 (truncated to integer)
for(len = (str.length / 2) | 0; len > 0; len--) {
// search for a repeating substring of "len" length starting at index i
for(i = 0; i + (len * 2) <= str.length; i++) {
sub = str.substr(i, len);
// if such a substring exists...
if(str.indexOf(sub, i + len) == i + len) {
// "count" how many occurences (advance index till beyond repeats)
for(n = i + len * 2; str.indexOf(sub, n) == n; n += len);
// return a string composed of the compressed part before the match +
// the substring + the compressed part beyond the match
return compress(str.substr(0, i)) + sub + compress(str.substr(n));
}
}
}
// if nothing found, return original string
return str;
}
alert(JSON.stringify(cases.map(compress)));
For a string X
, we can find the largest consecutive repeating substring in O(n^2) using Z-algorithm which Given a string S of length n, the Z Algorithm produces an array Z where Z[i] is the length of the longest substring starting from pat[i] which is also a prefix of pat
(Source)
For each suffix of X
start at i
, applying Z-algorithm for this substring:
int result = 0;
for(int i = 0; i < X.length(); i++)
int[]z = Z-algo(X.substring(i)); //this take O(n)
for(int j = i + result + 1; j < X.length(); j++)
if(z[j] >= j - i + 1)
result = (j - i + 1);
Repeating the above procedure until we cannot find any repeating substring, we can obtain a O(n^3) algorithm.
Note : after reread the question, especially the last example, I figure out that valid repeating substrings are only limited from the original's substrings. So, the time complexity can be reduced to O(n^2 log n) by using a max heap.
pos=[]
dstr={}
final=[]
x="ababcdabcdcde"
for k in re.finditer(r"(?=(.+?)1+)",x): #Find start of all overlapping strings
pos.append(k.start())
i=0
for k in pos: #Find end of overlapping strings
s=re.findall(r"^((.*)2+)",x[k:])
dstr[i]=(k,len(s[0][0]))
i=i+1
#print dstr.values()
k=0
while k< len(dstr.values())-1: #remove smaller length overlapping result
if dstr.values()[k+1][0]<dstr.values()[k][1]<dstr.values()[k+1][1]:
pass
else:
final.append(dstr.values()[k][0])
k=k+1
if dstr.values()[k-1][0] in final:
pass
else:
final.append(dstr.values()[k][0])
#print final
for k in final: #remove strings
s=re.sub(r"(.*)1+",r"1",x[k:])
x=x[:k]+s
print x
这是在python.Works与给定的输入罚款。
链接地址: http://www.djcxy.com/p/86564.html上一篇: 码头备份和恢复mongodb
下一篇: 简单的字符串压缩:删除连续的重复子字符串