Can DFA regex engines handle atomic groups?

2018-06-03 21:31:55

According to this page (and some others), DFA regex engines can deal with capturing groups rather well. I'm curious about atomic groups (or possessive quantifiers), as I recently used them a lot and can't imagine how this could be done.

I disagree with the fist part of the answer:

A DFA does not need to deal with constructs like atomic grouping.... Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking

Atomic groups are important not only for speed of NFA engines, but they also allow to write simpler and less error-prone regexes. Let's say I needed to find all C-style multiline comments in a program. The exact regex would be something like:

start with the literal /*

eat anything of the following

any char except *

a * followed by anything but /

repeat this as much as possible

end with the literal */

This sounds a bit complicated, the regex

/* ( [^*] | *[^/] )+ */

is complicated and wrong (it doesn't handle /* foo **/ correctly). Using a reluctant (lazy) quantifier is better

/* .*? */

but also wrong as it can eat the whole line

/* foo */ @#$!!**@#$ /* bar */

when backtracking due to a later sub-expression failing on the garbage occurs. Putting the above in an atomic group solves the problem nicely:

(?> /* .*? */ )

This works always (I hope) and is as fast as possible (for NFA). So I wonder if a DFA engine could somehow handle it.

A DFA does not need to deal with constructs like atomic grouping. A DFA is "text directed", unlike the NFA, which is "regex directed", in other words: Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking, as the (NFA) engine tries every permutation possible to find a match at a position, no match is even possible.

Atomic grouping, simply said, throws away backtracking positions. Since a DFA does not backtrack (the text to be matched is checked against the regex, not the regex against the text like a NFA - the DFA opens a branch for each decision), throwing away something that is not there is pointless.

I suggest JFFriedl's Mastering Regular Expressions (Google Books), he explains the general idea of a DFA:

DFA Engine: Text-Directed

Contrast the regex-directed NFA engine with an engine that, while scanning the string, keeps track of all matches “currently in the works.” In the tonight example, the moment the engine hits t, it adds a potential match to its list of those currently in progress:

[...]

Each subsequent character scanned updates the list of possible matches. After a few more characters are matched, the situation becomes

[...]

with two possible matches in the works (and one alternative, knight, ruled out). With the g that follows, only the third alternative remains viable. Once the h and t are scanned as well, the engine realizes it has a complete match and can return success.

I call this “text-directed” matching because each character scanned from the text controls the engine. As in the example, a partial match might be the start of any number of different, yet possible, matches. Matches that are no longer viable are pruned as subsequent characters are scanned. There are even situations where a “partial match in progress” is also a full match. If the regex were ⌈to(…)?⌋, for example, the parenthesized expression becomes optional, but it's still greedy, so it's always attempted. All the time that a partial match is in progress inside those parentheses, a full match (of 'to') is already confirmed and in reserve in case the longer matches don't pan out.

(Source: http://my.safaribooksonline.com/book/programming/regular-expressions/0596528124/regex-directed-versus-text-directed/i87)

Concerning capturing groups and DFAs: as far as I was able to understand from your link, these approaches are not pure DFA engines but hybrids of DFA and NFA.

链接地址: http://www.djcxy.com/p/12976.html

上一篇: .Net正则表达式

下一篇: DFA正则表达式引擎可以处理原子组吗？