Python, regex and html: match final tag on line
I'm confused about python greedy/not-greedy characters.
"Given multi-line html, return the final tag on each line."
I would think this would be correct:
re.findall('<.*?>$', html, re.MULTILINE)
I'm irked because I expected a list of single tags like:
"</html>", "<ul>", "</td>".
My O'Reilly's Pocket Reference says that *?
wil "match 0 or more times, but as few times as possible."
So why am I getting 'greedier' matches, ie, more than one tag in some (but not all) matches?
Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.
Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see @Mark's answer. '<[^><]*>$' will work.
链接地址: http://www.djcxy.com/p/13408.html上一篇: 正则表达式匹配由非分隔的字符数