.Net Regex

I have a .Net app using regular expressions to extract information out of some html. The html is not XML compliant, so I can't parse it using XDoc. Here is a small piece of the html that I'm having problems with:

<td class="program">
    <div>
        <h2>
            The O'Reilly Factor
        </h2>
    </div>
</td>
<td class="program">
    <div>
        <span class="font-icon-new">New</span>
        <h2>
            The Kelly File
        </h2>
    </div>
</td>

The regular expression I'm using is:

(<td class="program">.*?(?<isnew>font-icon-new)?.*</td>)+

What I'm expecting in this scenario is two captured groups. The first group's "isnew" group would be blank (a non-hit), but the second group's "isnew" group would be populated. However, the "isnew" group is always blank, and I've tried multiple variations and simplified it down as much as possible to no avail. I'm also using the RegexOptions.Singleline option to ensure the "." also matches newline characters. Any ideas on what I'm missing?

Thanks in advance.


I think you are misusing (if not abusing) the regex engine. Since you already have to check if a known sequence of characters can be inside the string, can't you use a simple String.Contains() ?

Now, why this regex does not capture the attribute value. ? and .* are greedy quantifiers, while .*? is lazy. Let's add capturing groups around those subpatterns to see what exactly we are capturing:

(<td class="program">(.*?)(?<isnew>font-icon-new)?(.*)</td>)+

Group 2 ( (.*?) ) is NULL ! Everything after <td class="program"> is captured into Group 3 ( (.*) ). Have a look at this excerpt (taken from here):

In situations where the decision is between “make an attempt” and “skip an attempt,” as with items governed by quantifiers, the engine always chooses to first make the attempt for greedy quantifiers, and to first skip the attempt for lazy (non-greedy) ones. - Mastering Regular Expressions, p.159

The best regex fix I can imagine is combining the optional word and the next .*? pattern into an optional (greedy) non-capturing group like (?:(?<isnew>font-icon-new).*?)? :

(<td class="program">.*?(?:(?<isnew>font-icon-new).*?)?</td>)+

Results in Expresso ( Note: Singleline mode is ON):

链接地址: http://www.djcxy.com/p/12978.html

上一篇: 如果以javascript中的字符串开头,则不匹配组

下一篇: .Net正则表达式