Java Unescaping XML/HTML before JAXB parsing doesn't work
Can anyone help me?
In HTML/XML:
A numeric character reference refers to a character by its Universal Character Set/Unicode code point, and uses the format:
&#nnnn; or &#x hhhh;
I have to unescape (convert to unicode) these references before I use the JAXB parser.
When I use Apache StringEscapeUtils.unescapeXml() also & ; and > ; and < ; are unescaped, and that is not want I want because then parsing will fail.
Is there a library that only converts the &#nnnn to unicode? But does not unescape the rest?
Example:
begin-tag Adam < ;> ; Sl.meer 4 & 5 &# 55357;&# 56900; end-tag
I have added spaces after &# otherwise you do not see the notation.
For now I fixed it like this, but I want to use a better solution.
String unEncapedString = StringEscapeUtils.unescapeXml(xmlData).replaceAll("&", "&")
.replaceAll("<>", "<>");
StringReader reader = new StringReader(unEncapedString.codePoints().filter(c -> isValidXMLChar(c))
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append).toString());
return (Xxxx) createUnmarshaller().unmarshal(reader);
Looked in the Apache Commons-text library and finally found the solution:
NumericEntityUnescaper numericEntityUnescaper = new NumericEntityUnescaper(
NumericEntityUnescaper.OPTION.semiColonRequired);
xmlData = numericEntityUnescaper.translate(xmlData);
链接地址: http://www.djcxy.com/p/34918.html
上一篇: XML:尾部不允许有内容