Regex: not arbitrary non capturing group
I'm trying to write regex to cover all my cases. I have to parse Xml and capture some properties. Example:
<item p2="2"/>
<item p1="1" p2="2"/>
<item p1="1" p2="2" p3="3"/>
<item p1="1" p2="2" p3="3" p4="4"/>
<item p1="1" p2="2" p3="3" p4="4" p5="5"/>
I have to capture value of "p2" property and I know that "p2" will always be present in line. Also I want to capture value of "p4" property which will not always be present.
At first I'm trying to satisfy first four cases(first 4 lines in example) and I wrote regular expression like this:
<item.+?p2="(?<val1>d+)".*?(?:p4="(?<val2>d+)")?/>
And it works fine. "val1" group always returns value. And "val2" group returns value if "p4" property were presented.
But to cover my last case:
<item p1="1" p2="2" p3="3" p4="4" p5="5"/>
I have modified my regular expression like this:
<item.+?p2="(?<val1>d+)".*?(?:p4="(?<val2>d+)")?.*?/>
______________________________________________________^^^
And I found that "val1" group still returns values, but "val2" group no more returns the values for all cases.
Could you tell me what I'm missed, and help to write regular expression to cover all my cases?
Example here in Regex tester
XML is not a Regular language there for using Regular Expressions is not the way to go. You will also need a parser.
There are many ways to do this, but personally I would load the XML document into an XmlDocument class and use the SelectNodes method with an XPath query to find your list of items. Once you have that you can use a foreach on each found XmlNode and use the Attributes collection to get the data you want.
If you must do this using regular expressions what you need to do is put the last .? inside the non-capturing group. What you have done is give the Regex permission to ommit the p4 patch and match .? instead. By putting the .*? inside the group it removes this possibility. This is likely to be slow (it may even suffer from catastrophic backtracking) and it will not handle all the complexities of XML. Here is a program that demonstraits:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
var regex = new Regex(@"
<item # Capture <item
.+? # Capture one or more characters as few times as possible
p2= # Capture p2=
"" # Capture opening quote
(?<val1>d+) # Capture one or more decimal digits and put them in val1
"" # Capture closing quote
.*? # Capture zero or more characters as few times as possible
(?: # Begin a non capturing group
p4= # Capture p4=
"" # Capture opening quote
(?<val2>d+) # Capture one or more decimal digits and put them in val2
"" # Capture closing quote
.*? # Capture zero or more characters as few times as possible
)? # Capture 0 or 1 p4s
/> # Capture >
", RegexOptions.IgnorePatternWhitespace);
Test(regex, @"<item p2=""2""/>", "2", string.Empty);
Test(regex, @"<item p1=""1"" p2=""2""/>", "2", string.Empty);
Test(regex, @"<item p1=""1"" p2=""2"" p3=""3""/>", "2", string.Empty);
Test(regex, @"<item p1=""1"" p2=""2"" p3=""3"" p4=""4""/>", "2", "4");
Test(regex, @"<item p1=""1"" p2=""2"" p3=""3"" p4=""4"" p5=""5""/>", "2", "4");
}
static void Test(Regex regex, string test, string p2, string p4)
{
Match m = regex.Match(test);
string p2Group = m.Groups["val1"].Value;
string p4Group = m.Groups["val2"].Value;
Console.WriteLine("Test: '{0}'", test);
Console.WriteLine("p2: '{0}' - {1}", p2Group, p2Group == p2 ? "Success" : "Fail");
Console.WriteLine("p4: '{0}' - {1}", p4Group, p4Group == p4 ? "Success" : "Fail");
}
}
链接地址: http://www.djcxy.com/p/74790.html
上一篇: 正则表达式lookahead,lookbehind和原子组
下一篇: 正则表达式:不是任意的非捕获组