Stripping Invalid XML characters in Java
I have an XML file that's the output from a database. I'm using the Java SAX parser to parse the XML and output it in a different format. The XML contains some invalid characters and the parser is throwing errors like 'Invalid Unicode character (0x5)'
Is there a good way to strip all these characters out besides pre-processing the file line-by-line and replacing them? So far I've run into 3 different invalid characters (0x5, 0x6 and 0x7). It's a ~4gb database dump and we're going to be processing it a bunch of times, so having to wait an extra 30 minutes each time we get a new dump to run a pre-processor on it is going to be a pain, and this isn't the first time I've run into this issue.
I haven't used this personally but Atlassian made a command line XML cleaner that may suit your needs (it was made mainly for JIRA but XML is XML):
Download atlassian-xml-cleaner-0.1.jar
Open a DOS console or shell, and locate the XML or ZIP backup file on your computer, here assumed to be called data.xml
Run: java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xml
This will write a copy of data.xml to data-clean.xml, with invalid characters removed.
我使用Xalan org.apache.xml.utils.XMLChar
类:
public static String stripInvalidXmlCharacters(String input) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < input.length(); i++) {
char c = input.charAt(i);
if (XMLChar.isValid(c)) {
sb.append(c);
}
}
return sb.toString();
}
I use the following regexp that seems to work as expected for the JDK6:
Pattern INVALID_XML_CHARS = Pattern.compile("[^u0009u000Au000Du0020-uD7FFuE000-uFFFDuD800uDC00-uDBFFuDFFF]");
...
INVALID_XML_CHARS.matcher(stringToCleanup).replaceAll("");
In JDK7 it might be possible to use the notation x{10000}-x{10FFFF}
for the last range that lies outside of the BMP instead of the uD800uDC00-uDBFFuDFFF
notation that is not as simple to understand.
下一篇: 在Java中剥离无效的XML字符