Atomic groups clarity

2018-06-26 17:40:53

Consider this regex.

a*b

This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac

This takes 67 steps in debugger to fail.

Now consider this regex.

(?>a*)b

This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac

This takes 133 steps in debugger to fail.

And lastly this regex:

a*+b  (a variant of atomic group)

This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac

This takes 67 steps in debugger to fail.

When I check the benchmark atomic group (?>a*)b performs 179% faster.

Now atomic groups disable backtracking. So performance in match is good.

But why are the number of steps more? Can somebody explain on this?

Why is there a diff. in steps between two atomic groups (?>a*)b and a*+b .

Do they work differently?

Author note:

This answer targets question 1 as delivered by the bounty text "I am looking forward to the exact reason why more steps are being needed by the debugger.I dont need answers explaining how atomic groups work.";
Jerry's answer addresses the other concerns very well, while my other answer takes a ride through the mentioned constructs, how they work, and why they are important. For full knowledge, simply reading this post is not enough!

Every group in a regular expression takes a step to step into and out of the group.

WHAT?!
Yeah, I'm serious, read on...

Firstly, I would like to present you with quantified non-capturing groups, over without the group:

Pattern 1: (?:c)at
Pattern 2: cat

So what exactly happens here? We'll match the patterns with the test string "concat" on a regex engine with optimizations disabled:

(?:c)at cat

While we're at it, I present you some more groups:

groups

Oh no! I'm going to avoid using groups!

But wait! Please note that the number of steps taken to match has no correlation with the performance of the match. pcre engines optimizes away most of the "unnecessary steps" as I've mentioned. Atomic groups are still the most efficient, despite more steps taken on an engine with optimizations disabled.

Perhaps relevant:

Why is a character class faster than alternation?

So what's backtracking?

The engine comes to quantifiers that are greedy by default. Greedy modifiers matches all possible and backtracks by demand , allowing efficient matches,

as referenced by Greedy vs. Reluctant vs. Possessive Quantifiers :

A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less thing (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it "backtracks" one more step, making the greedy quantifier match one less thing again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success! [...]

What does this have to do with `a*+b` ?

In /a*+b/ :

a The literal character "a" .

*+ Zero or more, possessive .

b The literal character "b" .

As referenced by Greedy vs. Reluctant vs. Possessive Quantifiers :

A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.

Why does it matter?

The machine won't realize if it's doing an (in)efficient match on its own. See here for a decent example: Program run forever when matching regex . In many scenarios, regexes written quickly may not be efficient and may well easily be problematic in deployment.

So what's an atomic group?

After the pattern within the atomic group finishes matching, it will not let go, ever. Study this example:

Pattern: (?>dw{2}|12)c
Matching string: 12c

Looks perfectly legitimate, but this match fails . The steps are simple: The first alternation of the atomic group matches perfectly - dw{2} consumes 12c . The group then completes its match - now here is our pointer location:

Pattern: (?>dw{2}|12)c
                   ^
Matching string: 12c
                    ^

The pattern advances. Now we try to match c , but there is no c . Instead of trying to backtrack (releasing dw{2} and consuming 12 ), the match fails.

Well that's a bad idea then! Why would we prevent backtracking, Unihedron?

Now imagine we're manipulating with a JSON object. This file is not small. Backtracking from the end is going to be a bad idea.

       "2597401":[{"jobID":"2597401",
                 "account":"TG-CCR120014",
                 "user":"charngda",
                 "pkgT":{"pgi/7.2-  5":{"libA":["libpgc.so"],
                 "flavor":["default"]}},          
                 "startEpoch":"1338497979",
                 "runTime":"1022",
                 "execType":"user:binary",              
                 "exec":"ft.D.64",
                 "numNodes":"4",
                 "sha1":"5a79879235aa31b6a46e73b43879428e2a175db5",
                 "execEpoch":1336766742,
                 "execModify":"Fri May 11 15:05:42 2012",
                 "startTime":"Thu May 31 15:59:39 2012",
                 "numCores":"64",
                 "sizeT":{"bss":"1881400168","text":"239574","data":"22504"}},  
                 {"jobID":"2597401",
                 "account":"TG-CCR120014",
                 "user":"charngda",
                 "pkgT":{"pgi/7.2-5":{"libA":["libpgc.so"],
                 "flavor":["default"]}},
                 "startEpoch":"1338497946",
                 "runTime":"33"  "execType":"user:binary",
                 "exec":"cg.C.64",
                 "numNodes":"4",
                 "sha1":"caf415e011e28b7e4e5b050fb61cbf71a62a9789",
                 "execEpoch":1336766735,
                "execModify":"Fri May 11 15:05:35 2012",
                "startTime":"Thu May 31 15:59:06 2012",
                "numCores":"64",
                "sizeT":{"bss":"29630984","text":"225749","data":"20360"}},
                {"jobID":"2597401",
                "account":"TG-CCR120014",
                "user":"charngda",
                "pkgT":{"pgi/7.2-5":  {"libA":["libpgc.so"],
                "flavor":["default"]}},
                "startEpoch":"1338500447",
                "runTime":"145",
                "execType":"user:binary",
                "exec":"mg.D.64",
                "numNodes":"4",
                "sha1":"173de32e1514ad097b1c051ec49c4eb240f2001f",
                "execEpoch":1336766756,
                "execModify":"Fri May 11 15:05:56 2012",
                "startTime":"Thu May 31 16:40:47 2012",
                "numCores":"64",
                "sizeT":{"bss":"456954120","text":"426186","data":"22184"}},{"jobID":"2597401",
                "account":"TG-CCR120014",
                "user":"charngda",
                "pkgT":{"pgi/7.2-5":{"libA":["libpgc.so"],
                "flavor":["default"]}},
                "startEpoch":"1338499002",
                "runTime":"1444",
                "execType":"user:binary",
                "exec":"lu.D.64",
                "numNodes":"4",
                "sha1":"c6dc16d25c2f23d2a3321d4feed16ab7e10c2cc1",
                "execEpoch":1336766748,
                "execModify":"Fri May 11 15:05:48 2012",
                "startTime":"Thu May 31 16:16:42 2012",
                "numCores":"64",
                "sizeT":{"bss":"199850984","text":"474218","data":"27064"}}],

Uh oh...

Do you get what I mean now? :P

I'll leave you to figure out the rest, and try to find out more about possessive quantifiers and atomic groups; I'm not writing anything else into this post. Here is where the JSON came from, I saw the answer a few days ago, very inspiring: REGEX reformatting .

Read also

The Stack Overflow Regex Reference

ReDoS - Wikipedia

I think something is wrong...

I don't know how you did your benchmarking but both a*+b and (?>a*)b should be the same. To quote regular-expressions.info (emphasis mine):

Basically, instead of X*+ , write (?>X*) . It is important to notice that both the quantified token X and the quantifier are inside the atomic group. Even if X is a group, you still need to put an extra atomic group around it to achieve the same effect. (?:a|b)*+ is equivalent to (?>(?:a|b)*) but not to (?>a|b)* . The latter is a valid regular expression, but it won't have the same effect when used as part of a larger regular expression.

And just to confirm the above, I ran the following on ideone:

$tests = 1000000;

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/a*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /a*b/    : %1.15f per iteration for %s iterationsn", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/(?>a*)b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /(?>a*)b/: %1.15f per iteration for %s iterationsn", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/a*+b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /a*+b/   : %1.15f per iteration for %s iterationsn", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/(?>a)*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /(?>a)*b/: %1.15f per iteration for %s iterationsn", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

Getting this as the output:

For /a*b/    : 0.000000879034996 per iteration for 1000000 iterations
For /(?>a*)b/: 0.000000876362085 per iteration for 1000000 iterations
For /a*+b/   : 0.000000880002022 per iteration for 1000000 iterations
For /(?>a)*b/: 0.000000883045912 per iteration for 1000000 iterations

Now, I am by no means a PHP expert so I don't know if this is the right way to benchmark stuff in there, but that's they all have about the same performance, which is kind of expected given the simplicity of the task.

Still, couple of things I note from the above:

Neither (?>a)*b nor (?>a*)b are 179% faster than another regex; the above all are within 7% from each other.

Back to the actual question

But why are the number of steps more? Can somebody explain on this?

It is to be noted that the number of steps is not a direct representation of the performance of a regex. It is a factor, but not the ultimate determinant. There are more steps because the breakdown has steps before entering the group, and after entering the group...

1   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
     ^
2   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
      ^^^^^^^^
3   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
          ^^
4   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
            ^
5   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaaa...
               ^

That's 2 more steps because of the group than the 3 steps...

1   / a*+ b /x    aaaaaaaaaaaaaaaaaaaa...
     ^
2   / a*+ b /x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
      ^^^
3   / a*+ b /x    aaaaaaaaaaaaaaaaaaaaa...
          ^

Where you can say that a*b is the same as (?:a*)b but the latter having more steps:

1   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
     ^
2   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
      ^^^^^^^^
3   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
          ^^
4   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
            ^
5   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaaa...

Note: Even there, you see that regex101 has the steps optimised a bit for the number of steps in a*b .

Conclusion

Possessive quantifiers and atomic groups work differently depending on how you use them. If I take regular-expression's example and tweaking it a little:

(?>a|b)*ac match aabaac

But

(?:a|b)*+ac and (?>(?:a|b)*)ac don't match aabaac .

链接地址: http://www.djcxy.com/p/74776.html

上一篇: 原子分组交替是没用的？

下一篇: 原子团体的清晰度

Atomic groups clarity

Every group in a regular expression takes a step to step into and out of the group.

So what's backtracking?

What does this have to do with a*+b ?

Why does it matter?

So what's an atomic group?

Well that's a bad idea then! Why would we prevent backtracking, Unihedron?

Read also

I think something is wrong...

Back to the actual question

Conclusion

What does this have to do with `a*+b` ?