Java App : Unable to read iso

I have a file which is encoded as iso-8859-1, and contains characters such as ô .

I am reading this file with java code, something like:

File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
    int byteCount = fr.read(buffer, 0, buffer.length);
    if (byteCount <= 0) {
        break;
    }

    String s = new String(buffer, 0, byteCount,"ISO-8859-1");
    System.out.println(s);
}

However the ô character is always garbled, usually printing as a ? .

I have read around the subject (and learnt a little on the way) eg

  • http://www.joelonsoftware.com/articles/Unicode.html
  • http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
  • http://www.ingrid.org/java/i18n/utf-16/
  • but still can not get this working

    Interestingly this works on my local pc (xp) but not on my linux box.

    I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :

    System.out.println(java.nio.charset.Charset.availableCharsets());
    

    I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.

    I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with

     System.out.println((int) s.getCharAt(index));
    

    In both cases the result should be 244 decimal; 0xf4 hex.

    See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).

    In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.

    EDIT: Here's a really easy way to prove whether or not the console will work:

     System.out.println("Here's the character: u00f4");
    

    Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:

     BufferedReader br = new BufferedReader(
             new InputStreamReader(
             new FileInputStream("myfile.csv"), "ISO-8859-1");
    
     char[] buffer = new char[4096]; // character (not byte) buffer 
    
     while (true)
     {
          int charCount = br.read(buffer, 0, buffer.length);
    
          if (charCount == -1) break; // reached end-of-stream 
    
          String s = String.valueOf(buffer, 0, charCount);
          // alternatively, we can append to a StringBuilder
    
          System.out.println(s);
     }
    

    Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.

    As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.


    @Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).

    Consider this code:

    public static void main(String[] args) throws IOException {
        byte[] data = { (byte) 0xF4 };
        String decoded = new String(data, "ISO-8859-1");
        if (!"u00f4".equals(decoded)) {
            throw new IllegalStateException();
        }
    
        // write default charset
        System.out.println(Charset.defaultCharset());
    
        // dump bytes to stdout
        System.out.write(data);
    
        // will encode to default charset when converting to bytes
        System.out.println(decoded);
    }
    

    By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:

    UTF-8

    If I switch the terminal's encoding to ISO 8859-1, this is printed:

    UTF-8
    ôô

    In both cases, the same bytes are being emitted by the Java program:

    5554 462d 380a f4c3 b40a
    

    The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.

    链接地址: http://www.djcxy.com/p/66118.html

    上一篇: 如何在DOMImplementation版本上调试java depenendency失败

    下一篇: Java应用程序:无法读取ISO