Posix regex capture group matching sequence
I have the following text string and regex pattern in ac program:
char text[] = " identification division. ";
char pattern[] = "^(.*)(identification *division)(.*)$";
Using regexec() library function, I got the following results:
String: identification division. Pattern: ^(.*)(identification *division)(.*)$ Total number of subexpressions: 3 OK, pattern has matched ... begin: 0, end: 37,match: identification division. subexpression 1 begin: 0, end: 8, match: subexpression 2 begin: 8, end: 35, match: identification division subexpression 3 begin: 35, end: 37, match: .
I was wondering since the regex engine matches in a greedy fashion and the first capture group (.*) matches any number of characters (except new line characters) why doesn't it match characters all the way to the end in the text string (up to '.') as oppose to matching only the first 8 spaces?
Does each capture group have to be matched?
Are there any rules on how the capture group matches the text string?
Thanks.
Just as you said, if the greedy group (.*) had consumed the whole string, the rest of the regex wouldn't have anything to match which wouldn't make your regex match the string. So, yes, each capture group (and other pattern parts) needs to be matched. This is exactly what you specified in your regex.
Try the following string instead and run the code with both a reluctant and a greedy first group and you will see the difference.
char text[] = " identification division identification division. ";
Regexes are as greedy as possible, without being too greedy. Had the left group been as greedy as you expect, the group that matches "identification division" would have been unable to match, erronously rejecting text
, which was clearly in the language.
下一篇: Posix正则表达式捕获组匹配序列