Greedy vs. Reluctant vs. Possessive Quantifiers
I found this excellent tutorial on regular expressions and while I intuitively understand what "greedy", "reluctant" and "possessive" quantifiers do, there seems to be a serious hole in my understanding.
Specifically, in the following example:
Enter your regex: .*foo // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Enter your regex: .*?foo // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
Enter your regex: .*+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.
The explanation mentions eating the entire input string, letters been consumed , matcher backing off , rightmost occurrence of "foo" has been regurgitated , etc.
Unfortunately, despite the nice metaphors, I still don't understand what is eaten by whom... Do you know of another tutorial that explains (concisely) how regular expressions engines work?
Alternatively, if someone can explain in somewhat different phrasing the following paragraph, that would be much appreciated:
The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f" "o" "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f" "o" "o") have already been consumed ( by whom? ). So the matcher slowly backs off ( from right-to-left? ) one letter at a time until the rightmost occurrence of "foo" has been regurgitated ( what does this mean? ), at which point the match succeeds and the search ends.
The second example, however, is reluctant, so it starts by first consuming ( by whom? ) "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow ( who swallows?) the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.
The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+, ( how? ) leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off ( what does back off mean? ); it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.
I'll give it a shot.
A greedy quantifier first matches as much as possible. So the .*
matches the entire string. Then the matcher tries to match the f
following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less thing (leaving the "o" at the end of the string unmatched). That still doesn't match the f
in the regex, so it "backtracks" one more step, making the greedy quantifier match one less thing again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f
in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f
in the regex, and the o
and the next o
are matched too. Success!
A reluctant or "non-greedy" quantifier first matches as little as possible. So the .*
matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f
following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more thing (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f
, which succeeds, and the o
and the next o
in the regex match too. Success!
In your example, it then starts the process over with the remaining unmatched portion of the string, following the same process.
A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .*
matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f
in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.
I haven't heard the exact terms 'regurgitate' or 'backing off' before; the phrase that would replace these is "backtracking", but 'regurgitate' seems like as good a phrase as any for "the content that had been tentatively accepted before backtracking threw it away again".
The important thing to realize about most regex engines is that they are backtracking: they will tentatively accept a potential, partial match, while trying to match the entire contents of the regex. If the regex cannot be completely matched at the first attempt, then the regex engine will backtrack on one of its matches. It will try matching *
, +
, ?
, alternation, or {n,m}
repetition differently, and try again. (And yes, this process can take a long time.)
The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f" "o" "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f" "o" "o") have already been consumed ( by whom? ).
The last three letters, f
, o
, and o
were already consumed by the initial .*
portion of the rule. However, the next element in the regex, f
, has nothing left in the input string. The engine will be forced to backtrack on its initial .*
match, and try matching all-but-the-last character. (It might be smart and backtrack to all-but-the-last-three, because it has three literal terms, but I'm unaware of implementation details at this level.)
So the matcher slowly backs off ( from right-to-left? ) one letter at a time until the rightmost occurrence of "foo" has been regurgitated ( what does this mean? ), at which
This means the foo
had tentatively been including when matching .*
. Because that attempt failed, the regex engine tries accepting one fewer character in .*
. If there had been a successful match before the .*
in this example, then the engine would probably try shortening the .*
match (from right-to-left, as you pointed out, because it is a greedy qualifier), and if it was unable to match the entire inputs, then it might be forced to re-evaluate what it had matched before the .*
in my hypothetical example.
point the match succeeds and the search ends.
The second example, however, is reluctant, so it starts by first consuming ( by whom? ) "nothing". Because "foo"
The initial nothing is consumed by .?*
, which will consume the shortest possible amount of anything that allows the rest of the regex to match.
doesn't appear at the beginning of the string, it's forced to swallow ( who swallows?) the
Again the .?*
consumes the first character, after backtracking on the initial failure to match the entire regex with the shortest possible match. (In this case, the regex engine is extending the match for .*?
from left-to-right, because .*?
is reluctant.)
first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.
The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+, ( how? )
A .*+
will consume as much as possible, and will not backtrack to find new matches when the regex as a whole fails to find a match. Because the possessive form does not perform backtracking, you probably won't see many uses with .*+
, but rather with character classes or similar restrictions: account: [[:digit:]]*+ phone: [[:digit:]]*+
.
This can drastically speed up regex matching, because you're telling the regex engine that it should never backtrack over potential matches if an input doesn't match. (If you had to write all the matching code by hand, this would be similar to never using putc(3)
to 'push back' an input character. It would be very similar to the naive code one might write on a first try. Except regex engines are way better than a single character of push-back, they can rewind all the back to zero and try again. :)
But more than potential speed ups, this also can let you write regexs that match exactly what you need to match. I'm having trouble coming up with an easy example :) but writing a regex using possessive vs greedy quantifiers can give you different matches, and one or the other may be more appropriate.
leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off ( what does back off mean? ); it will outperform
"Backing off" in this context means "backtracking" -- throwing away a tentative partial match to try another partial match, which may or may not succeed.
the equivalent greedy quantifier in cases where the match is not immediately found.
It is just my practice output to visualise the scene-
链接地址: http://www.djcxy.com/p/2156.html上一篇: 没有特定单词的线匹配的正则表达式
下一篇: 贪婪与不愿意与拥有量词