How to encode properly this URL

I am trying to get this URL using JSoup

http://betatruebaonline.com/img/parte/330/CIGUEÑAL.JPG

Even using encoding, I got an exception. I don´t understand why the encoding is wrong. It returns

http://betatruebaonline.com/img/parte/330/CIGUEN%C3%91AL.JPG

instead the correct

http://betatruebaonline.com/img/parte/330/CIGUEN%CC%83AL.JPG

How I can fix this ? Thanks.

private static void GetUrl()
{
    try
    {
        String url = "http://betatruebaonline.com/img/parte/330/";
        String encoded = URLEncoder.encode("CIGUEÑAL.JPG","UTF-8");
        Response img = Jsoup
                            .connect(url + encoded)
                            .ignoreContentType(true)
                            .execute();

        System.out.println(url);
        System.out.println("PASSED");
    }
    catch(Exception e)
    {
        System.out.println("Error getting url");
        System.out.println(e.getMessage());
    }
}

The encoding is not wrong, the problem here is composite unicode & precomposed unicode of character "Ñ" can be displayed in 2 ways, they look the same but really different

precomposed unicode: Ñ           -> %C3%91
composite unicode: N and ~       -> N%CC%83

I emphasize that BOTH ARE CORRECT, it depends on which type of unicode you want:

String normalize = Normalizer.normalize("Ñ", Normalizer.Form.NFD);
System.out.println(URLEncoder.encode("Ñ", "UTF-8")); //%C3%91
System.out.println(URLEncoder.encode(normalize, "UTF-8")); //N%CC%83

What happens here?

As stated by @yelliver the webserver seems to use NFD encoded unicode in it's path names. So the solution is to use the same encoding as well.

Is the webserver doing correct?

1. For those who are curious (like me), this article on Multilingual Web Addresses brings some light into the subject. In the section on IRI pathes (the part that is actually handled by the webserver), it states:

Whereas the domain registration authorities can all agree to accept domain names in a particular form and encoding (ASCII-based punycode), multi-script path names identify resources located on many kinds of platforms, whose file systems do and will continue to use many different encodings. This makes the path much more difficult to handle than the domain name.

2. More on the subject on how to encode pathes can be found at Section 5.3.2.2. at the IETF Proposed Standard on Internationalized Resource Identifiers (IRIs) rfc3987. It says:

Equivalence of IRIs MUST rely on the assumption that IRIs are appropriately pre-character-normalized rather than apply character normalization when comparing two IRIs. The exceptions are conversion from a non-digital form, and conversion from a non-UCS-based character encoding to a UCS-based character encoding. In these cases, NFC or a normalizing transcoder using NFC MUST be used for interoperability. To avoid false negatives and problems with transcoding, IRIs SHOULD be created by using NFC . Using NFKC may avoid even more problems; for example, by choosing half-width Latin letters instead of full-width ones, and full-width instead of half-width Katakana.

3. Unicode Consortium states:

NFKC is the preferred form for identifiers, especially where there are security concerns (see UTR #36). NFD and NFKD are most useful for internal processing.

Conclusion

The webserver mentioned in the question does not conform with the recommendations of the IRI standard or the unicode consortium and uses NFD encoding instead of NFC or NFKC. One way to correctly encode an URL-String is as follows

URI uri = new URI(url.getProtocol(), url.getUserInfo(), IDN.toASCII(url.getHost()), url.getPort(), url.getPath(), url.getQuery(), url.getRef());

Then convert that Uri to ASCII string:

String correctEncodedURL=uri.toASCIIString(); 

The toASCIIString() calls encode() which uses NFC encoded unicode. IDN.toASCII() converts the host name to Punycode.


非常简单的解决方案:编码系统提供,您需要的是不同的,以下解决方案将对您有所帮助。

private static void GetUrl(String url)
{
    try
    {

        String encodedurl = url.replace("Ñ","N%CC%83");
        Response img = Jsoup
                            .connect(encodedurl)
                            .ignoreContentType(true)
                            .execute();

        System.out.println(url);
        System.out.println("PASSED");
    }
    catch(Exception e)
    {
        System.out.println("Error getting url");
        System.out.println(e.getMessage());
    }
}
链接地址: http://www.djcxy.com/p/41132.html

上一篇: 在REST Web应用程序中进行分页

下一篇: 如何正确编码这个URL