Stripping Invalid XML characters in Java

2018-06-12 04:42:36

I have an XML file that's the output from a database. I'm using the Java SAX parser to parse the XML and output it in a different format. The XML contains some invalid characters and the parser is throwing errors like 'Invalid Unicode character (0x5)'

Is there a good way to strip all these characters out besides pre-processing the file line-by-line and replacing them? So far I've run into 3 different invalid characters (0x5, 0x6 and 0x7). It's a ~4gb database dump and we're going to be processing it a bunch of times, so having to wait an extra 30 minutes each time we get a new dump to run a pre-processor on it is going to be a pain, and this isn't the first time I've run into this issue.

I haven't used this personally but Atlassian made a command line XML cleaner that may suit your needs (it was made mainly for JIRA but XML is XML):

Download atlassian-xml-cleaner-0.1.jar

Open a DOS console or shell, and locate the XML or ZIP backup file on your computer, here assumed to be called data.xml

Run: java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xml

This will write a copy of data.xml to data-clean.xml, with invalid characters removed.

我使用Xalan org.apache.xml.utils.XMLChar类：

public static String stripInvalidXmlCharacters(String input) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < input.length(); i++) {
        char c = input.charAt(i);
        if (XMLChar.isValid(c)) {
            sb.append(c);
        }
    }

    return sb.toString();
}

I use the following regexp that seems to work as expected for the JDK6:

Pattern INVALID_XML_CHARS = Pattern.compile("[^u0009u000Au000Du0020-uD7FFuE000-uFFFDuD800uDC00-uDBFFuDFFF]");
...
INVALID_XML_CHARS.matcher(stringToCleanup).replaceAll("");

In JDK7 it might be possible to use the notation x{10000}-x{10FFFF} for the last range that lies outside of the BMP instead of the uD800uDC00-uDBFFuDFFF notation that is not as simple to understand.

链接地址: http://www.djcxy.com/p/34906.html

上一篇: Python SAX解析器说XML文件不好

下一篇: 在Java中剥离无效的XML字符