Possessive generic quantifier {m,n}+ not implemented in Ruby 1.9.3?
Possessive quantifiers are greedy and refuse backtrack. A regex /.{1,3}+b/
should mean: Match any character except line breaks, 1 to 3 times, as many as possible and don't backtrack. Tthen match the character b
.
In this example:
'ab'.sub /.{1,3}+b/, 'c' #=> "c"
no substitution should take place, contrary to fact.
The result in these two examples differs:
'aab'.sub /.{0,1}+b/, 'c' #=> "c"
'aab'.sub /.?+b/, 'c' #=> "ac"
Compare this with Scala, where they give the same answer:
scala> ".{0,1}+b".r.replaceAllIn("aab", "c")
res1: String = ac
scala> ".?+b".r.replaceAllIn("aab", "c")
res2: String = ac
Is this a Ruby bug, or is it possible to motivate this behavior? Perhaps, Oniguruma for some reason implemented possessive with all quantifiers ?
, *
, +
except the generic quantifier {m,n}
? If that's the case, why?
What really happens
It seems that +
followed by range quantifier doesn't offer the possessive property to the range quantifier. Rather, it is treated as whatever in front repeated once or more. Using .{1,3}+b
as an example, it will be equivalent to (?:.{1,3})+b
.
Work-around
You can work-around this with the more general construct non-backtracking group (or atomic grouping) (?>pattern)
. Let us use the general case pattern{n,m}+
as an example to construct the equivalent regex with non-backtracking group (equivalent to Java's behavior when matching with pattern{n,m}+
):
(?>(?>pattern){n,m})
Why 2 levels of non-backtracking groups? 2 are necessary because:
pattern
(one instance of repetition), backtracking within pattern
is disallowed. (Note that as long as an instance has not been found, backtracking within pattern
is allowed). This is emulated with the inner non-backtrakcing group. pattern
can be found, backtracking to remove any of the instances is disallowed. This is emulated with the outer non-backtracking group. I am not sure if there is any caveat here. Please ping me with comment if you found any case not emulated with this method.
Testing
Test 1
At first, I tested this regex:
(.{1,3}+)b
Initially, I tested without the capturing group, but the result was so surprising that I need the capturing group to confirm what is happening.
On this input:
2343333ab
The result is that the whole string matches , and the capturing group captured 2343333a
(without the b
at the end). This shows that the upper limit has somehow been broken.
DEMO at rubular
Test 2
This second test reveals how range quantifiers {n}
's behavior cannot be modified to be possessive, and it is likely that this also applies for other range quantifiers {n,}
and {n,m}
. Instead, the following +
will only exhibit repetition of 1 or more time behavior.
(My initial conclusion is that +
overwrites the upper limit, but it turns out to be wrong).
Testing regex:
(.{3}+)b
Input string:
23d4344333ab
234344333ab
23434433ab
The matches captured in capturing group 1 are all multiples of 3. From top to bottom, the regex skips 2, 1, 0 characters respectively for the input strings.
Input string with annotation ( []
denotes the match for the whole regex, ()
denotes the text captured by capturing group 1):
23[(d4344333a)b]
2[(34344333a)b]
[(23434433a)b]
DEMO at rubular
Testing code for work around
This is the testing code in Java to show that both outer and inner non-backtracking groups are necessary. ideone
class TestPossessive {
public static void main(String args[]) {
String inputText = "123456789012";
System.out.println("Input string: " + inputText);
System.out.println("Expected: " + inputText.replaceFirst("(?:d{3,4}(?![89])){2,}+", ">$0<"));
System.out.println("Outer possessive group: " + inputText.replaceFirst("(?>(?:d{3,4}(?![89])){2,})", ">$0<"));
System.out.println("Inner possessive group: " + inputText.replaceFirst("(?>d{3,4}(?![89])){2,}", ">$0<"));
System.out.println("Both: " + inputText.replaceFirst("(?>(?>d{3,4}(?![89])){2,})", ">$0<"));
System.out.println();
inputText = "aab";
System.out.println("Input string: " + inputText);
System.out.println("Expected: " + inputText.replaceFirst(".{1,3}+b", ">$0<"));
System.out.println("Outer possessive group: " + inputText.replaceFirst("(?>.{1,3})b", ">$0<"));
System.out.println("Inner possessive group: " + inputText.replaceFirst("(?>.){1,3}b", ">$0<"));
System.out.println("Both: " + inputText.replaceFirst("(?>(?>.){1,3})b", ">$0<"));
}
}
It seems like this is intended in Oniguruma. Documentation says {n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only
{n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only
. I guess this is because of backward compatibility reasons, or?