How to avoid inadvertent encoding of UTF

2018-06-11 15:16:09

In the process of editing a file encoded as UTF-8 w/o [spurious] BOM the content might become devoid of any Unicode characters outside the ASCII or ANSI ranges. At the next reopening of the file, some text editors (Notepad++) will interpret it as ASCII/ANSI encoded and open it as such. Unaware of the change the user will continue editing, now adding non-ANSI Unicode characters, rendered however useless, since saved in ANSI. A menu option can exist (Notepad++) to open ANSI files as UTF-8 w/o BOM, but leading to the reverse issue of inadvertently overriding ANSI files with Unicode encoding.

One workaround is to add a character outside the ANSI range to a comment in the file. Depending on the decoding algorithm, it might force the editor (Notepad++) to recognize the file as encoded in UTF-8 w/o BOM.

In a HTML document for example you could follow the charset definition in the header with such a Unicode comment, here the U+05D0 HEBREW LETTER ALEF: <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

How would you suggest that an editor tell the difference between ASCII/ANSI and UTF-8 w/o BOM, when the files look the same?

If you want guaranteed recognition of UTF-8 as UTF-8, either add the BOM, or force the file to contain UTF-8 characters.

Configure your editor to always use UTF-8 if possible, if not, complain to the creators of your editor. Charsets not targeting unicode are, IMO, deprecated and should be treated as such.

Files using only characters in the ASCII space (the 7-bit one) would be pretty much the same in UTF-8 anyway, so if you HAVE to deliver something in ASCII encoding, just don't type any unicode characters.

链接地址: http://www.djcxy.com/p/33354.html

上一篇: visual studio 2010

下一篇: 如何避免无意中编码的UTF