JAVA正则表达式去除html标签和内容

2018-06-27 11:55:52

可能重复：
如何在Java中删除HTML标签
RegEx匹配除XHTML自包含标签之外的开放标签

我想删除特定的HTML标签及其内容。

例如，如果html是：

<span style='font-family:Verdana;mso-bidi-font-family:
"Times New Roman";display:none;mso-hide:all'>contents</span>

如果标签包含“mso- *”，它必须删除整个标签（开启，关闭和内容）。

正如戴夫牛顿在他的评论中指出的那样，一个html解析器是这里的一种方式。如果你真的想这么做，这里有一个正则表达式：

    String html = "FOO<span style='font-family:Verdana;mso-bidi-font-family:"
        + ""Times New Roman";display:none;mso-hide:all'>contents</span>BAR";
    // regex matches every opening tag that contains 'mso-' in an attribute name
    // or value, the contents and the corresponding closing tag
    String regex = "<(S+)[^>]+?mso-[^>]*>.*?</1>";
    String replacement = "";
    System.out.println(html.replaceAll(regex, replacement)); // prints FOOBAR

链接地址: http://www.djcxy.com/p/76867.html

上一篇: JAVA Regex to remove html tag and content

下一篇: Regexp to pull input tags out of form