Building Regular Expression (RegEx) to extract text of HTML tag

2018-06-27 11:52:43

This question already has an answer here:

RegEx match open tags except XHTML self-contained tags 35 answers

<a href="javascript:ProcessQuery('report_drilldown',[0-9]+)">([^<]*)</a>

This won't really solve the problem, but it may just barely scrape by. In particular, it's very brittle, the slightest change to the markup and it won't match. If report_drilldown isn't meant to be absolute, replace it with [^']* , and/or capture both it and the number if you need.

If you need something that parses HTML, then it's a bit of a nightmare if you have to deal with tag soup. If you were using Python, I'd suggest BeautifulSoup, but I don't know something similar for C#. (Anyone know of a similar tag soup parsing library for C#?)

The answer is... DON'T!

Use a library, such as this one

I agree regex might not be the best way to parse this, but using backreference it's easily done:

<(?<tag>w*)(?:.*)>(?<text>.*)</k<tag>>

Where tag and text are named capture groups.

hat-tip: expresso library

链接地址: http://www.djcxy.com/p/76862.html

上一篇: 正则表达式匹配打开和关闭的html标签

下一篇: 构建正则表达式（RegEx）以提取HTML标记的文本