Building Regular Expression (RegEx) to extract text of HTML tag

This question already has an answer here:

  • RegEx match open tags except XHTML self-contained tags 35 answers

  • <a href="javascript:ProcessQuery('report_drilldown',[0-9]+)">([^<]*)</a>
    

    This won't really solve the problem, but it may just barely scrape by. In particular, it's very brittle, the slightest change to the markup and it won't match. If report_drilldown isn't meant to be absolute, replace it with [^']* , and/or capture both it and the number if you need.

    If you need something that parses HTML, then it's a bit of a nightmare if you have to deal with tag soup. If you were using Python, I'd suggest BeautifulSoup, but I don't know something similar for C#. (Anyone know of a similar tag soup parsing library for C#?)


    The answer is... DON'T!

    Use a library, such as this one


    I agree regex might not be the best way to parse this, but using backreference it's easily done:

    <(?<tag>w*)(?:.*)>(?<text>.*)</k<tag>>
    

    Where tag and text are named capture groups.

    hat-tip: expresso library

    链接地址: http://www.djcxy.com/p/76862.html

    上一篇: 正则表达式匹配打开和关闭的html标签

    下一篇: 构建正则表达式(RegEx)以提取HTML标记的文本