SAX character buffer size
I'm trying to use Sax to parse very large XML files. 100's of megs. The problem is the Parser reads in exactly 2048 characters at a time and terminates. I get a los of tag's value splitted into two parts using the callback "public void characters(...)". For example, the first part is in the character array on position 2044 with length 4 "2013" and the second part "-09-30" on position 0 with length 6. It should be a date value "2013-09-30" if receiving in one part. Ho can I avoid this splitting? Anyone can help me?
public void characters(char[] ch, int start, int length) throws SAXException {
if (Main.errorProceso==0){
for(int i=0;i < strlista.size();i++){
if(strlista.get(i).equals(sEtiqueta_actual)){
if (sEtiqueta_actual.equals("Root.Header.Body.")){
String FileNm= String.valueOf(ch, start, length);
if (!FileNm.substring(0,2).equalsIgnoreCase("XX")){
logger.info("El identificador no es XX");
Main.errorProceso=1;
i=strlista.size()+1;
sEtiqueta_actual="";
}
else{
sCod_Fichero=FileNm.substring(0,2)+XXteFormat.format(XXte);
}
}
else if (sEtiqueta_actual.equals("Root.Header.Date.")){
String aux = String.valueOf(ch, start, length).split("T")[0];
try {
sFec=newFormat.format(oldFormat.parse(aux));
} catch (ParseException e) {
logger.error(e.getLocalizedMessage());
Main.errorProceso=1;
}
}
else if (sEtiqueta_actual.equals("Root.Header2.Body2.")){
sNum_Total=String.valueOf(ch, start, length);
}
else if (sEtiqueta_actual.equals("Root.Header3.Body3.Spcf.Inst.")){
sImp =String.valueOf(ch, start, length);
}
.
.
.
else if (sEtiqueta_actual.equals("Root.Header3.Body3.Spcf.Req.")){
try {
sFec2=newFormat.format(oldFormat.parse(String.valueOf(ch, start, length)));
} catch (ParseException e) {
logger.error(e.getLocalizedMessage());
Main.errorProceso=1;
}
}
}
}
This is just the way SAX parsers work. If you could increase the buffer size (and I don't know how to do that), it wouldn't help; it would only reduce the number of times you get values broken into pieces.
The SAX parser is free to split character strings wherever it needs to (documentation). It does this for efficiency; to avoid using memory; for simplicity of implementation; or whatever other reason the library developer came up with.
So if you want to get your strings in one piece, you'll need to do so yourself. A simple solution, assuming that you never need to accumulate string values with sub-elements:
StringBuffer accumulator
to your implementation class, as well as an isAccumulating flag. startElement
, if the element is of interest, set the isAccumulating
flag. characters
, if the isAccumulating
flag is set, append the characters to accumulator. endElement
, if the isAccumulating
flag is set, do whatever you need to do with the accumulated string, and then clear the flag and empty the buffer. If you might need to collect values with sub-elements, you could change isAccumulating
from a flag to an integer depth counter. startElement
increments the counter if it is greater than 0, or sets it to 1 if the element needs to have its value collected. characters
appends the characters if the counter is greater than 0. endElement
decrements the counter if it is greater than zero, and if the result is 0, handles and then clears the accumulator.
Use String.trim()
and check String.length()>=0
before proceeding further into the characters()
function
And use a stack
to keep track of which tag the cData
belongs to. And you then can append
to it.
下一篇: SAX字符缓冲区大小