Why are greedy quantifiers less expensive than lazy quantifiers
http://www.rexegg.com/regex-quantifiers.html#tempered_greed
Greedy quantifiers (default) - work by swallowing as many characters allowed before slowly reducing the amount of characters matched one-by-one in order to make way for the rest of the pattern.
For example, the regular expression .*world
against the string hello world
will first attempt to swallow the whole string and put it into the .*
. But it cant, because then world
cant match, so .*
starts giving up characters one by one until it has given up the world
in the original string - in which case the whole regex can match.
Lazy quantifiers - work in much the similiar way, except in reverse. They want to match as less characters as possible, and they do the same thing, adding more characters one-by-one until the rest of the pattern has a chance of matching
But according to this article I read, these seemingly identical processes for greedy and lazy quantifiers are different - in that only lazy quantifiers have the need to "backtrack". But wouldn't greedy quantifiers also need to "backtrack" when spitting out previously swallowed elements?
Why is this the case? It just seems so intuitive
Greedy and lazy quantifiers are equally (in)expensive when applied correctly. However, lazy quantifiers got a bad reputation of being slow because they can be, and often are, applied to compensate for imprecision in a pattern.
Consider a simple example: <.*?>
vs. <.*>
.
When both expressions are applied to the same input
<abcdefghijklmnopqrstuvwxyz0123456789>
they match exactly the same text. The difference is only in how they arrive at the match: the "lazy" expression tries increasingly longer strings, arriving at the match after 40 steps (demo 1). The greedy expression, on the other hand, goes all the way to the end, and backtracks once to arrive at the match after only 5 steps (demo 2).
Note that the situation can be reversed if you add more characters after the >
:
<abcdefghijklmnopqrstuvwxyz0123456789>abcdefghijklmnopqrstuvwxyz0123456789abcdefghijklmnopqrstuvwxyz0123456789abcdefghijklmnopqrstuvwxyz0123456789
Now greedy expression becomes the "slow one", taking 149 steps (demo 3) while the lazy expression continues to take the same 40 steps as it did before (demo 4).
Sometimes greedy sucks.
Sample [asdfsikbfqew0787072143k*(*&^(**&]
Bench
Regex1: [[^]]*]
Options: < none >
Completed iterations: 200 / 200 ( x 1000 )
Matches found per iteration: 1
Elapsed Time: 0.48 s, 476.21 ms, 476211 µs
Regex2: [.*?]
Options: < none >
Completed iterations: 200 / 200 ( x 1000 )
Matches found per iteration: 1
Elapsed Time: 0.32 s, 322.73 ms, 322730 µs
Never, but never, use greedy with the dot unless its to claim the remnants at the end of a line.
Someone once told me that this (?=.*d)
is a lookahead that starts checking from the next character. This is absolutely false.
This immediately sets pointer to the end of the string, and starts checking
backwards (backtracking). Watch out for this because it skips over everything.
It is slower than watching paint dry.
I use to use greedy all the time, now I only use it when forced to, and
I don't recommend it to anyone.
上一篇: 不情愿量词贪婪
下一篇: 为什么贪婪的量词比懒惰的量词便宜