Python, regex and html: match final tag on line

2018-06-04 01:15:56

I'm confused about python greedy/not-greedy characters.

"Given multi-line html, return the final tag on each line."

I would think this would be correct:

re.findall('<.*?>$', html, re.MULTILINE)

I'm irked because I expected a list of single tags like:

"</html>", "<ul>", "</td>".

My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."

So why am I getting 'greedier' matches, ie, more than one tag in some (but not all) matches?

Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.

Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see @Mark's answer. '<[^><]*>$' will work.

链接地址: http://www.djcxy.com/p/13408.html

上一篇: 正则表达式匹配由非分隔的字符数

下一篇: Python，正则表达式和html：匹配最后一个标签