nolatin characters in xml output

2018-06-12 04:34:21

Edit: I hardcoded the charcter and use repsonse writer to write it, it still comes out to be K nigsberger

response.setCharacterEncoding("UTF-8");

            response.setContentType(contentType);
            //if(contentType!=null)response.setHeader("Content-Type",contentType);
            Writer writer = response.getWriter();//new OutputStreamWriter(response.getOutputStream(),"UTF-8");
            System.err.println("character encoding is "+response.getCharacterEncoding());


            writer.write("Königsberger ");
            writer.flush();

Edit: I tried setContentType and setContentEncoding prior to calling getWriter(), still no difference in output:

        if(res.length()>0){
            //pw.write(res);
            response.setCharacterEncoding("UTF-8");
            response.setContentType(contentType);
            //if(contentType!=null)response.setHeader("Content-Type",contentType);
            Writer writer = response.getWriter();//new OutputStreamWriter(response.getOutputStream(),"UTF-8");
            System.err.println("character encoding is "+response.getCharacterEncoding());


            writer.write(res);
            writer.flush();
        }

I am reading some german characters then output them in xml from java servlet, here's how I read them in UTF8:

int len=0;
        byte[]buffer=new byte[1024];
        OutputStream os = sock.getOutputStream();
        InputStream is = sock.getInputStream();
        query += "rn";
        os.write(query.getBytes("UTF8"));//iso8859_1"));

            do{
                len = is.read(buffer);
             if (len>0) { 
                 if(outstring==null)outstring=new StringBuffer();
                 outstring.append(new String(buffer,0,len, "UTF8"));
             }
           }while(len>0);
System.out.println(outstring);

System.out outputs the string correctly: Königsberger

However when I repipe this string from my servletResponse also using charset=UTF-8 it becomes gobbled: K nigsberger

private void outputResponse(String res, HttpServletRequest request,
            HttpServletResponse response) throws IOException {
        String outputFormat = getOutputFormat(request);
        String contentType=null;
        PrintWriter pw = response.getWriter();
        //response.setCharacterEncoding("UTF-8");
        System.err.println("output "+res);

        contentType= "text/xml; charset=UTF-8";
        res="<?xml version="1.0" encoding="utf-8"?>" + res;

        if(contentType!=null)response.setHeader("Content-Type",contentType);
        if(res.length()>0){
            pw.write(res);
        }
        pw.flush();

    }

do{
  len = is.read(buffer);
  if (len>0) { 
    if(outstring==null) outstring=new StringBuffer();
    outstring.append(new String(buffer,0,len, "UTF8"));
  }
}while(len>0);

This is not a good way to decode UTF-8 as characters can become corrupted on buffer boundaries (details here). UTF-8 is a variable width encoding, so characters require between one and four bytes to store. If it is working, you are just getting lucky. It is better to encode and decode using the Reader/Writer classes (details here).

I believe you need to call either setContentType or setCharacterEncoding prior to calling getWriter . I don't think it is enough to call setHeader directly.

This servlet code will correctly encode and transmit the sample string as UTF-8 data:

  @Override
  protected void doGet(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
    response.setContentType("text/xml; charset=UTF-8");
    PrintWriter pw = response.getWriter();
    pw.write("<?xml version="1.0" encoding="UTF-8"?>");
    pw.write("<data>Ku00F6nigsberger</data>");
    pw.flush();
    pw.close();
  }

Note that I am using the escape sequence u00F6 to emit the character U+00F6 ( ö ) to ensure that I do not corrupt the character in my text editor or during the compilation process (see here for more details).

Is it possible that the data is being misinterpreted on the client? Check the output with a hex editor.

Encoded as UTF-8, "Ku00F6nigsberger" should become the byte sequence:

4b c3 b6 6e 69 67 73 62 65 72 67 65 72

...where the character U+00F6 ( ö ) becomes c3 b6 . You can use code like this to check your values:

  public static void main(String[] args) throws IOException {
    String konigsberger = "Ku00F6nigsberger";
    dumpHex(System.out, konigsberger.getBytes("UTF-8"));
  }

  private static void dumpHex(PrintStream out, byte[] data) {
    for (byte b : data) {
      out.format("%02x ", b);
    }
    out.println();
  }

您应该遵循该示例并使servlet response了解要遵循哪个端点编码：

response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
ServletOutputStream out =response.getOutputStream();
out.write(output.getBytes("UTF-8"));

You allways can use entities like this:

<test>
&#228;
&#252;
&#229;
</test>

to get:

<test>
ä
ü
å
</test>

Maybe not exactly what you want, but a nice workaround. You can use sites like utf8-chartable.de to look up the needed value.

链接地址: http://www.djcxy.com/p/34890.html

上一篇: SAX字符缓冲区大小

下一篇: xml输出中的nolatin字符