How exactly do the lazy quantifiers work in PCRE?

2018-06-26 17:52:22

A bit of background: I'm implementing a regex matching engine (NFA) and it should support PCRE compatibility mode (I mean it should capture subexpressions with the same offsets as PCRE would do).

There's a test in PCRE's testinput1 which I can't fully understand. It tests lazy quantifiers.

So, the regex is

/<a[s]+href[s]*=[s]*          # find <a href=
 (["'])?                       # find single or double quote
 (?(1) (.*?)1 | ([^s]+))       # if quote found, match up to next matching
                                 # quote, otherwise match up to next space
/isx

And the string is

<a href="abcd xyz pqr" cats

PCRE's match is:

<a href="abcd xyz pqr"

and it is obviously using the lazy quantifier.

As far as I understand, lazy quantifiers should not be used until another "greedy" ways are impossible at all. Now here's a possible greedy match:

<a href="abcd

which uses the negative branch of the conditional subpattern, no lazy quantifiers.

So I'm looking for an explanation of this PCRE's behaviour or any details/suggestions why the lazy quantifier matches in this test. Thanks!

EDIT : I also checked out how the TRE library works. It's a POSIX-compatible NFA engine. I modified the original regex a little bit to suit TRE's syntax:

#include <stdlib.h>
#include <stdio.h>
#include <tre/tre.h>

int main()
{
    regex_t preg;
    const char * regex = "<a[ ]+href[ ]*=[ ]*(?:(')(.*?)'|[^ ]+)";
    const char * string = "<a href='abcd xyz pqr' cats";
    int cflags = REG_EXTENDED;
    int eflags = 0;
    size_t nmatch = 3;
    regmatch_t pmatch[100];

    tre_regcomp(&preg, regex, cflags);
    tre_regexec(&preg, string, nmatch, pmatch, eflags);

    for (int i = 0; i < nmatch; i++) {
        printf("%d: (%d, %d)n", i, pmatch[i].rm_so, pmatch[i].rm_eo - pmatch[i].rm_so);
    }

    return 0;
}

and the output (using lengths instead of end offsets) is:

0: (0, 22)
1: (8, 1)
2: (9, 12)

So the suggestion about PCRE's backtracking-specific behaviour is most likely wrong...

First of all I am just a beginner in the REGEX world. So, I am sorry if this answer is wrong or if I misunderstood the question.

Reading this definition taken from this book Regular Expressions Cookbook:

(?(1)then|else) is a conditional that checks whether the first capturing group has already matched something. If it has, the regex engine attempts to match then. If the capturing group has not participated in the match attempt thus far, the else part is attempted.

With this subject: <a href="abcd xyz pqr" cats

The first capturing group has matched the first " character. So, the expected behaviour is to attempt to match the then part. The second capturing group in the then part manages to match the string abcd xyz pqr with (.*?) and finally the then part manages to match abcd xyz pqr" with (.*?)1 . The REGEX may finish with success.

So, the else part with its greddy quantifier is not required, indeed it is not used. It is as if the greddy quantifier never existed.

With this subject: <a href="abcd

The first capturing group has matched the " character. Now the then part manages to match the string abcd with (.*?) but it will never match the last " character because there is not such character at the end of the subject. The conditional fails.

The REGEX engine does not stop here, you have used (["'])? so, the engine may try again because the " character is optional and it keeps going on as if the first capturing group had not matched (indeed there is backtrack). So, now the engine reachs the conditional with no match for the first capturing group, the else part is attempted and it manages to match the string "abcd (the " character is not matched by the first capturing group because of the backtrack and now it is matched by the third capturing group in the else part) The REGEX may finish with success.

PS: I am learning for fun stuff about regular expressions, so probably this answer is totally wrong. Wait for a better answer.

I don't fully understand your question here, but the non-greedy quantifier allows to search to first occurrence of the pattern. With pcretest, you try the greedy and non-greedy forms as well on the same input.

Non-greedy form:

  re> /<a[s]+href[s]*=[s]*(["'])?(?(1)(.*?)1|([^s]+))/i
  data> <a href="ab"cd"
    0: <a href="ab"
    1: "
    2: ab

Greedy form:

 re> /<a[s]+href[s]*=[s]*(["'])?(?(1)(.*)1|([^s]+))/i
 data> <a href="ab"cd"
    0: <a href="ab"cd"
    1: "
    2: ab"cd

链接地址: http://www.djcxy.com/p/74798.html

上一篇: 正则表达式的范围和组以DFA的形式实现

下一篇: 惰性量词在PCRE中的工作原理如何？