Specific XML data garbled

2018-06-21 15:38:11

I'm using the RSS feed from kat.cr for a personal project. I've tried to read the feed using the Rome framework and have run into a significant problem.

All other feeds I tried to use Rome (and other, more basic, ways of reading the feed) worked perfectly fine however, the following feed kept on throwing character encoding related exceptions.

https://kat.cr/usearch/Arrow%20S04E21/?field=seeders&sorder=desc&rss=1

I then created the following method to see what received data looked like:

public static void saveXML(String url) throws IOException {
    Client client = ClientBuilder.newClient();
    Response r = client.target(url).request(MediaType.TEXT_PLAIN_TYPE).get();

    PrintWriter out = new PrintWriter("XML.txt");
    String sXML = r.readEntity(String.class);
    out.print(sXML);
    out.close();
}

The above mentioned feed results in garbled data while all other feeds show up perfectly. Why is it that it shows up perfectly in any browser even when the charset is forced to UTF-8?

I've looked at the 'XML.txt' file in Hexplorer and noticed UTF-8 encoding byte sequences throughout the file.

I'm thoroughly lost, any help would be GREATLY appreciated.

The content you are receiving is compressed using the GZip format.

Now I was going to write an better answer with a way to solve your problem but your method results in a String and at that point you've likely already altered the raw bytes from the server causing a conversion to not work. I know nothing of the Rome framework or how to make it return bytes or decompress this for you. But assuming you do have some compressed gzip bytes you can do:

public static String decompress(byte [] data) throws IOException {
    try (
        GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(data));
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        ) {

        int read;
        byte [] buff = new byte[1024];
        while((read = gis.read(buff)) != -1) {
            out.write(buff, 0, read);
        }

        return out.toString("UTF-8");
    }
}

You could try this with

String sXML = r.readEntity(String.class);
return decompress(sXML.getBytes());

However I would be surprised if it worked. Maybe you can do

String sXML = r.readEntity(byte[].class);
return decompress(sXML.getBytes());

But again I have no idea how the Rome framework does things.

Edit:

You could also look for the GZIP file signature. I look the file signature off of this website - http://www.garykessler.net/library/file_sigs.html but you can look it up in many places. Assuming you have the bytes from the response you could do something like:

String sXML = r.readEntity(byte[].class);
// check for gzip encoding using signature
if(sXML.length > 3 && 
   sXML[0] == (byte)0x1F && 
   sXML[1] == (byte)0x8B && 
   sXML[2] == (byte)0x08) {
    // Is gzip encoded, decode it.
    return new String(decompress(sXML), "UTF-8");
} else {
    return new String(sXML, "UTF-8");
}

Now I would advocate for trying to make the Rome library do take care of this, but if all else fails this would be one way to do it.

链接地址: http://www.djcxy.com/p/60792.html

上一篇: @font将如何使用各种格式

下一篇: 特定的XML数据出现乱码