how unicode characters get encoded in ascii?
I'm trying to figure out how non-ascii characters get saved in ascii files. For example, if I open notepad ++ and set encoding to UTF-8 and then write שלום it will save it as 11 bites. 3 for BOM mark and two for each character. (I added | before and after each byte)
|239||187||191||215||169||215||156||215||149||215||157|
I can look up these values and figure out what letter they are referring to. Eg http://utf8-chartable.de/unicode-utf8-table.pl?start=1408&number=128&utf8=dec
if I open a new file and set encoding to ASCII and write the same word. It will save 4 bites:
|249||236||229||237|
if I open the ASCII file it will correctly show me the hebrew word that I typed. How does it know? Is there a similar reference as the one for unicode?
The Hebrew characters you have shown are Unicode codepoints U+05E9
, U+05DC
, U+05D5
, and U+05DD
. There is no possible way those codepoints will fit in ASCII, their values are too large. The only way they could be getting saved to file as byte octets 0xF9
0xEC
0xE5
0xED
(respectively) is if they are being encoded using the ISO-8859-8 charset (implemented in Windows in codepages 1255 and 28598). And the only way such a file would be displayed correctly is if it is interpreted using that same charset. If you are not doing anything special to tell the OS to use that specific charset for that file, then your OS must be set to use Hebrew as its default language, and that charset is its default charset for handling ANSI (not ASCII) data.
Only Unicode characters U+0000...U+007F can get encoded in Ascii, in a trivial manner.
Notepad++ does not have Ascii as an encoding. Instead, it has “ANSI”, which is a misnomer for a collection of encodings, typically 8-bit encodings. Simply do not use them. Use UTF-8 instead.
What happens in your case is probably that in your environment, “ANSI” is taken as an 8-bit Latin/Hebrew encoding, where code numbers outside the Ascii range denote Hebrew letters. This works up to a point, but not across systems and programs.
链接地址: http://www.djcxy.com/p/33360.html