InterviewStreet Find Strings

You are given 'n' strings w1, w2, ......, wn. Let Si denote the set of strings formed by considering all unique substrings of the string wi. A substring is defined as a contiguous sequence of one or more characters in the string. More information on substrings can be found here. Let 'S' = {S1 U S2 U .... Sn} .ie 'S' is a set of strings formed by considering all the unique strings in all sets S1, S2, ..... Sn. You will be given many queries and for each query, you will be given an integer 'k'. Your task is to output the lexicographically kth smallest string from the set 'S'.

Input:

The first line of input contains a single integer 'n', denoting the number of strings. Each of the next 'n' lines consists of a string. The string on the ith line (1<=i<=n) is denoted by wi and has a length mi. The next line consists of a single integer 'q', denoting the number of queries. Each of the next 'q' lines consists of a single integer 'k'.

Note: The input strings consist only of lowercase english alphabets 'a' - 'z'.

Output:

Output 'q' lines, where the ith line consists of a string which is the answer to the ith query. If the input is invalid ('k' > |S|), output "INVALID" (quotes for clarity) for that case.

Constraints:

1<=n<=50
1<=mi<=2000
1<=q<=500
1<=k<=1000000000

https://www.interviewstreet.com/challenges/dashboard/#problem/4efa210eb70ac

My approach

For each input string, generate its substrings and add them to a set, which will automatically eliminate duplicates and keep them sorted.Return element at index i in set.

I have written a code on above approach here:

http://justprogrammng.blogspot.com/2012/06/interviewstreet-find-strings-solution-c.html

But Problem I am facing is

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

This error is appearing in few test cases. Can someone please tell me why I am getting this error and how should I rectify this error?


@enjay's answer is correct. I'll elaborate for anyone who is new to this kind of string processing algorithm problems and want to learn more. My answer will depict the large picture and give pointers to any detail mentioned.

The problem @sachin pointed out in interviewstreet.com belongs to a large group of problem where substring, palindrome, etc is involved. All such problem can be solved by one dedicated data structure: suffix array(en.wikipedia.org/wiki/Suffix_array). The complete learning path is as below. But if you are only intreated in solving the problem, you can go directly to 3.

  • Trie(http://en.wikipedia.org/wiki/Trie). The foundation for suffix tree.

  • Suffix tree(http://en.wikipedia.org/wiki/Suffix_tree). Put all suffix of one string into a Trie. Observe that any substring of a given string is some prefix of one of the given string's suffix. The idea of suffix tree is inspiring, but since the construction of suffix tree is either too complex to implement or too slow, in practice, I doubt any one use it. However, for anyone want to challenge the unnecessary difficult, here the best illustration of construction algorithm for suffix tree:http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

  • Suffix array(http://en.wikipedia.org/wiki/Suffix_array). Suffix array contains any information suffix tree have(thus can do whatever suffix tree can do) and is way easier to implement. That being said, if you want to accomplish anything non-trivial with it, you need to construct at least three array and one index for RMQ. The three arrays are:

  • a. The suffix array itself.

    b. The rank array.

    c. The height array.

    Since one common task when using suffix array is to find the longest common prefix of two suffix, one need to do RMQ query to the height array. RMQ is described here: http://en.wikipedia.org/wiki/Range_Minimum_Query


    You get bad_alloc error because there are too many unique substrings to fit in memory. To rectify it you may use any approach where there is no need to store every unique substring.

    My solution is pretty complicated, so I provide just a sketch of an algorithm.

    It is possible to store only those substrings, that are beginning at every possible position in w1, w2, ......, wn, and ending at the end of w1, w2, ......, wn. And instead of the whole substring, you can store only pointer to its starting character.

    To answer the queries, you can sort all the queries, sort all the substrings. Then iterate the substrings in the following manner: take all the substrings starting from the same character, then take all the substrings with the same second character and so on. In other words, you implicitly construct a trie with all inner nodes having weight 1 and all leaf nodes having weight, equal to the length of the remaining unique substring, corresponding to this node. And you iterate through this trie, computing cumulative sum of each node weight, and comparing it with the next not-yet-processed query. As soon as you find the match, you print the substring and continue with trie traversal.

    All this requires not much memory but is very hungry for computation resources. But it may be optimized. You may use an explicit trie to store all short substrings (probably with lengths 1 .. 2 or 1 .. 3). Also you may use bucket sort algorithm to sort longer substrings.


    使用一个后缀数组...它会更快,更容易在内存上genarating所有子串...如果你排序后缀数组,然后尝试搜索使用一些扩展的数据结构,如lcp数组和累积子串数count数组及时解决并在内存限制内......

    链接地址: http://www.djcxy.com/p/40102.html

    上一篇: 后缀链接和失败链接有什么区别?

    下一篇: InterviewStreet查找字符串