Python SAX parser says XML file is not well
I stripped some tags that I thought were unnecessary from an XML file. Now when I try to parse it, my SAX parser throws an error and says my file is not well-formed. However, I know every start tag has an end tag. The file's opening tag has a link to an XML schema. Could this be causing the trouble? If so, then how do I fix it?
Edit: I think I've found the problem. My character data contains "<" and ">" characters, presumably from html tags. After being parsed, these are converted to "<" and ">" characters, which seems to bother the SAX parser. Is there any way to prevent this from happening?
Does the sax parser not give you details about where it thinks it's not well-formed?
Have you tried loading the file into an XML editor and checking it there? Do other XML parsers accept it?
The schema shouldn't change whether or not the XML is well-formed or not; it may well change whether it's valid or not. See the wikipedia entry for XML well-formedness for a little bit more, or the XML specs for a lot more detail :)
EDIT: To represent "&" in text, you should escape it as &
So:
<
should be
&lt
(assuming you really want ampersand, l, t).
I would suggest putting those tags back in and making sure it still works. Then, if you want to take them out, do it one at a time until it breaks.
However, I question the wisdom of taking them out. If it's your XML file, you should understand it better. If it's a third-party XML file, you really shouldn't be fiddling with it (until you understand it better :-).
I would second recommendation to try to parse it using another XML parser. That should give an indication as to whether it's the document that's wrong, or parser.
Also, the actual error message might be useful. One fairly common problem for example is that the xml declaration (if one is used, it's optional) must be the very first thing -- not even whitespace is allowed before it.
链接地址: http://www.djcxy.com/p/34908.html上一篇: SAX处理特殊字符