Why are greedy quantifiers less expensive than lazy quantifiers

2018-06-04 01:27:18

http://www.rexegg.com/regex-quantifiers.html#tempered_greed

在这里输入图像描述

Greedy quantifiers (default) - work by swallowing as many characters allowed before slowly reducing the amount of characters matched one-by-one in order to make way for the rest of the pattern.

For example, the regular expression .*world against the string hello world will first attempt to swallow the whole string and put it into the .* . But it cant, because then world cant match, so .* starts giving up characters one by one until it has given up the world in the original string - in which case the whole regex can match.

Lazy quantifiers - work in much the similiar way, except in reverse. They want to match as less characters as possible, and they do the same thing, adding more characters one-by-one until the rest of the pattern has a chance of matching

But according to this article I read, these seemingly identical processes for greedy and lazy quantifiers are different - in that only lazy quantifiers have the need to "backtrack". But wouldn't greedy quantifiers also need to "backtrack" when spitting out previously swallowed elements?

Why is this the case? It just seems so intuitive

Greedy and lazy quantifiers are equally (in)expensive when applied correctly. However, lazy quantifiers got a bad reputation of being slow because they can be, and often are, applied to compensate for imprecision in a pattern.

Consider a simple example: <.*?> vs. <.*> .

When both expressions are applied to the same input

<abcdefghijklmnopqrstuvwxyz0123456789>

they match exactly the same text. The difference is only in how they arrive at the match: the "lazy" expression tries increasingly longer strings, arriving at the match after 40 steps (demo 1). The greedy expression, on the other hand, goes all the way to the end, and backtracks once to arrive at the match after only 5 steps (demo 2).

Note that the situation can be reversed if you add more characters after the > :

<abcdefghijklmnopqrstuvwxyz0123456789>abcdefghijklmnopqrstuvwxyz0123456789abcdefghijklmnopqrstuvwxyz0123456789abcdefghijklmnopqrstuvwxyz0123456789

Now greedy expression becomes the "slow one", taking 149 steps (demo 3) while the lazy expression continues to take the same 40 steps as it did before (demo 4).

Sometimes greedy sucks.

Sample [asdfsikbfqew0787072143k*(*&^(**&]

Bench

Regex1:   [[^]]*]
Options:  < none >
Completed iterations:   200  /  200     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    0.48 s,   476.21 ms,   476211 µs


Regex2:   [.*?]
Options:  < none >
Completed iterations:   200  /  200     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    0.32 s,   322.73 ms,   322730 µs

Never, but never, use greedy with the dot unless its to claim the remnants at the end of a line.

Someone once told me that this (?=.*d) is a lookahead that starts checking from the next character. This is absolutely false.
This immediately sets pointer to the end of the string, and starts checking
backwards (backtracking). Watch out for this because it skips over everything.
It is slower than watching paint dry.

I use to use greedy all the time, now I only use it when forced to, and
I don't recommend it to anyone.

链接地址: http://www.djcxy.com/p/13430.html

上一篇: 不情愿量词贪婪

下一篇: 为什么贪婪的量词比懒惰的量词便宜