Why does modern Perl avoid UTF

2018-06-21 02:01:44

I wonder why most modern solutions built using Perl don't enable UTF-8 by default.

I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don't see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.

Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?

Commenting @tchrist got too long, so I'm adding it here.

It seems that I did not make myself clear. Let me try to add some things.

tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.

tchrist pointed to many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way "to enable UTF-8". I have not so much knowledge to argue with that. So, I stick to live examples.

I played around with Rakudo and UTF-8 was just there as I needed . I didn't have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.

Shouldn't that be a goal in modern Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.

Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don't use me!

I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don't attract UTF-8 because it is too complicated in Perl 5.

There are two stages to processing Unicode text. The first is "how can I input it and output it without losing information". The second is "how do I treat text according to local language conventions".

tchrist's post covers both, but the second part is where 99% of the text in his post comes from. Most programs don't even handle I/O correctly, so it's important to understand that before you even begin to worry about normalization and collation.

This post aims to solve that first problem

When you read data into Perl, it doesn't care what encoding it is. It allocates some memory and stashes the bytes away there. If you say print $str , it just blits those bytes out to your terminal, which is probably set to assume everything that is written to it is UTF-8, and your text shows up.

Marvelous.

Except, it's not. If you try to treat the data as text, you'll see that Something Bad is happening. You need go no further than length to see that what Perl thinks about your string and what you think about your string disagree. Write a one-liner like: perl -E 'while(<>){ chomp; say length }' perl -E 'while(<>){ chomp; say length }' and type in 文字化け and you get 12... not the correct answer, 4.

That's because Perl assumes your string is not text. You have to tell it that it's text before it will give you the right answer.

That's easy enough; the Encode module has the functions to do that. The generic entry point is Encode::decode (or use Encode qw(decode) , of course). That function takes some string from the outside world (what we'll call "octets", a fancy of way of saying "8-bit bytes"), and turns it into some text that Perl will understand. The first argument is a character encoding name, like "UTF-8" or "ASCII" or "EUC-JP". The second argument is the string. The return value is the Perl scalar containing the text.

(There is also Encode::decode_utf8 , which assumes UTF-8 for the encoding.)

If we rewrite our one-liner:

perl -MEncode=decode -E 'while(<>){ chomp; say length decode("UTF-8", $_) }'

We type in 文字化け and get "4" as the result. Success.

That, right there, is the solution to 99% of Unicode problems in Perl.

The key is, whenever any text comes into your program, you must decode it. The Internet cannot transmit characters. Files cannot store characters. There are no characters in your database. There are only octets, and you can't treat octets as characters in Perl. You must decode the encoded octets into Perl characters with the Encode module.

The other half of the problem is getting data out of your program. That's easy to; you just say use Encode qw(encode) , decide what the encoding your data will be in (UTF-8 to terminals that understand UTF-8, UTF-16 for files on Windows, etc.), and then output the result of encode($encoding, $data) instead of just outputting $data .

This operation converts Perl's characters, which is what your program operates on, to octets that can be used by the outside world. It would be a lot easier if we could just send characters over the Internet or to our terminals, but we can't: octets only. So we have to convert characters to octets, otherwise the results are undefined.

To summarize: encode all outputs and decode all inputs.

Now we'll talk about three issues that make this a little challenging. The first is libraries. Do they handle text correctly? The answer is... they try. If you download a web page, LWP will give you your result back as text. If you call the right method on the result, that is (and that happens to be decoded_content , not content , which is just the octet stream that it got from the server.) Database drivers can be flaky; if you use DBD::SQLite with just Perl, it will work out, but if some other tool has put text stored as some encoding other than UTF-8 in your database... well... it's not going to be handled correctly until you write code to handle it correctly.

Outputting data is usually easier, but if you see "wide character in print", then you know you're messing up the encoding somewhere. That warning means "hey, you're trying to leak Perl characters to the outside world and that doesn't make any sense". Your program appears to work (because the other end usually handles the raw Perl characters correctly), but it is very broken and could stop working at any moment. Fix it with an explicit Encode::encode !

The second problem is UTF-8 encoded source code. Unless you say use utf8 at the top of each file, Perl will not assume that your source code is UTF-8. This means that each time you say something like my $var = 'ほげ' , you're injecting garbage into your program that will totally break everything horribly. You don't have to "use utf8", but if you don't, you must not use any non-ASCII characters in your program.

The third problem is how Perl handles The Past. A long time ago, there was no such thing as Unicode, and Perl assumed that everything was Latin-1 text or binary. So when data comes into your program and you start treating it as text, Perl treats each octet as a Latin-1 character. That's why, when we asked for the length of "文字化け", we got 12. Perl assumed that we were operating on the Latin-1 string "æååã" (which is 12 characters, some of which are non-printing).

This is called an "implicit upgrade", and it's a perfectly reasonable thing to do, but it's not what you want if your text is not Latin-1. That's why it's critical to explicitly decode input: if you don't do it, Perl will, and it might do it wrong.

People run into trouble where half their data is a proper character string, and some is still binary. Perl will interpret the part that's still binary as though it's Latin-1 text and then combine it with the correct character data. This will make it look like handling your characters correctly broke your program, but in reality, you just haven't fixed it enough.

Here's an example: you have a program that reads a UTF-8-encoded text file, you tack on a Unicode PILE OF POO to each line, and you print it out. You write it like:

while(<>){
    chomp;
    say "$_ 
                        链接地址: http://www.djcxy.com/p/59218.html
                        上一篇:
                            
                                perlbrew和cygwin一起工作吗？                            
                            
                        
                        下一篇:
                            
                                现代Perl为什么要避免使用UTF