What is the fastest substring search algorithm?

2018-06-04 14:07:40

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.

Function should return a pointer to the first match, or NULL if no match is found.

Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).

Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...as well as what I mean by "fastest":

Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).

Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)

Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.

Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way.

Update: My current optimal algorithm is as follows:

For needles of length 1, use strchr .

For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.

For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.

The big questions left in my mind are:

Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.

The only two viable candidate algorithms I've found for the general case (no out-of-memory or quadratic performance conditions) are Two-Way and String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the O(m) (where m is needle length) in space algorithms could be used for m<100 or so. It would also be possible to use algorithms which are worst-case quadratic if there's an easy test for needles which provably require only linear time.

Bonus points for:

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8? (With characters of varying byte lengths, well-formed-ness imposes some string alignment requirements between the needle and haystack and allows automatic 2-4 byte shifts when a mismatching head byte is encountered. But do these constraints buy you much/anything beyond what maximal suffix computations, good suffix shifts, etc. already give you with various algorithms?)

Note: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

Build up a test library of likely needles and haystacks. Profile the tests on several search algorithms, including brute force. Pick the one that performs best with your data.

Boyer-Moore uses a bad character table with a good suffix table.

Boyer-Moore-Horspool uses a bad character table.

Knuth-Morris-Pratt uses a partial match table.

Rabin-Karp uses running hashes.

They all trade overhead for reduced comparisons to a different degree, so the real world performance will depend on the average lengths of both the needle and haystack. The more initial overhead, the better with longer inputs. With very short needles, brute force may win.

Edit:

A different algorithm might be best for finding base pairs, english phrases, or single words. If there were one best algorithm for all inputs, it would have been publicized.

Think about the following little table. Each question mark might have a different best search algorithm.

                 short needle     long needle
short haystack         ?               ?
long haystack          ?               ?

This should really be a graph, with a range of shorter to longer inputs on each axis. If you plotted each algorithm on such a graph, each would have a different signature. Some algorithms suffer with a lot of repetition in the pattern, which might affect uses like searching for genes. Some other factors that affect overall performance are searching for the same pattern more than once and searching for different patterns at the same time.

If I needed a sample set, I think I would scrape a site like google or wikipedia, then strip the html from all the result pages. For a search site, type in a word then use one of the suggested search phrases. Choose a few different languages, if applicable. Using web pages, all the texts would be short to medium, so merge enough pages to get longer texts. You can also find public domain books, legal records, and other large bodies of text. Or just generate random content by picking words from a dictionary. But the point of profiling is to test against the type of content you will be searching, so use real world samples if possible.

I left short and long vague. For the needle, I think of short as under 8 characters, medium as under 64 characters, and long as under 1k. For the haystack, I think of short as under 2^10, medium as under a 2^20, and long as up to a 2^30 characters.

The http://www-igm.univ-mlv.fr/~lecroq/string/index.html link you point to is an excellent source and summary of some of the best known and researched string matching algorithms.

Solutions to most search problems involve trade offs with respect to pre-processing overhead, time and space requirements. No single algorithm will be optimal or practical in all cases.

If you objective is to design a specific algorithm for string searching, then ignore the rest of what I have to say, If you want to develop a generalized string searching service routine then try the following:

Spend some time reviewing the specific strengths and weaknesses of the algorithms you have already referenced. Conduct the review with the objective of finding a set of algorithms that cover the range and scope of string searches you are interested in. Then, build a front end search selector based on a classifier function to target the best algorithm for the given inputs. This way you may employ the most efficient algorithm to do the job. This is particularly effective when an algorithm is very good for certain searches but degrades poorly. For example, brute force is probably the best for needles of length 1 but quickly degrades as needle length increases, whereupon the sustik-moore algoritim may become more efficient (over small alphabets), then for longer needles and larger alphabets, the KMP or Boyer-Moore algorithms may be better. These are just examples to illustrate a possible strategy.

The multiple algorithm approach not a new idea. I believe it has been employed by a few commercial Sort/Search packages (eg SYNCSORT commonly used on mainframes implements several sort algorithms and uses heuristics to choose the "best" one for the given inputs)

Each search algorithm comes in several variations that can make significant differences to its performance, as, for example, this paper illustrates.

Benchmark your service to categorize the areas where additional search strategies are needed or to more effectively tune your selector function. This approach is not quick or easy but if done well can produce very good results.

Published in 2011, I believe it may very well be the "Simple Real-Time Constant-Space String Matching" algorithm by Dany Breslauer, Roberto Grossi, and Filippo Mignosi.

Update:

In 2014 the authors published this improvement: Towards optimal packed string matching.

链接地址: http://www.djcxy.com/p/14886.html

上一篇: 如何让窗户看起来更现代

下一篇: 什么是最快的子字符串搜索算法？