JAVA Regex to remove html tag and content
Possible Duplicate:
How to remove HTML tag in Java
RegEx match open tags except XHTML self-contained tags
I want to remove specific HTML tag with its content.
For example, if the html is:
<span style='font-family:Verdana;mso-bidi-font-family:
"Times New Roman";display:none;mso-hide:all'>contents</span>
If the tag contains "mso-*", it must remove the whole tag (opening, closing and content).
As Dave Newton pointed out in his comment, a html parser is the way to go here. If you really want to do it the hard way, here's a regex that works:
String html = "FOO<span style='font-family:Verdana;mso-bidi-font-family:"
+ ""Times New Roman";display:none;mso-hide:all'>contents</span>BAR";
// regex matches every opening tag that contains 'mso-' in an attribute name
// or value, the contents and the corresponding closing tag
String regex = "<(S+)[^>]+?mso-[^>]*>.*?</1>";
String replacement = "";
System.out.println(html.replaceAll(regex, replacement)); // prints FOOBAR
链接地址: http://www.djcxy.com/p/76868.html
上一篇: 正则表达式去掉标签,保留CDATA
下一篇: JAVA正则表达式去除html标签和内容