character set and its encoding?

2018-06-24 09:54:10

Quite a few concepts related to character set are mentioned in the standard: basic source character set, basic execution character set, basic execution wide-character set, execution character set, and execution wide-character set:

Basic source character set: 91 graphical characters, plus space character, HT, VT, FF, LF (just borrowing name abbreviations from ASCII).

Basic execution (wide-)character set: all members of basic source character set, plus BEL, BS, CR, (wide-)NUL.

The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

I don't have much question for basic source character set, basic execution character set, and basic execution wide-character set.

As for execution character set, the standard says it's implementation-defined and locale-specific, so I tried to get some real sense by observing the byte contents of a string-literal-initialized char array whose value should equal to the numerical value of the encoding of the characters in the execution character set (and a universal-character-name may map to more than one char element due to multibyte encoding):

char str[] = "Greek lowercase alpha is: u03B1.";

It seems that it's almost always utf-8 on Linux ( CE B1 is stored in the array for that Greek letter). On Windows, it's Windows-1252 if system locale is English (some wrong value 3F is stored since Greek is not available in Windows-1252), and some other encoding for other locale (eg A6 C1 in cp936 for Chinese locale, E1 in Windows-1253 for Greek locale, representing Greek lowercase alpha in those two encodings respectively). For all those cases where the Greek letter is available in the locale (thus available in the execution character set), cout << str; can print the Greek letter appropriately. All seems alright.

But for execution wide-character set, I don't understand very well. What is its exact encoding on major platforms? It seems that the ISO-10646 value 0x3B1 of the Greek lowercase alpha always gets stored in the wchar_t for a declaration like the one below on all the platforms that I tried:

wchar_t wstr[] = L"Greek lowercase alpha is: u03B1.";

So I guess the execution wide-charater set may well be UCS-2/UTF-16 or UTF-32 (different environment has different size for wchar_t , 4 for Linux and 2 for Windows mostly)? However, wcout << wstr; doesn't print the Greek letter correctly on either Linux or Windows. Surely the members and encoding of the execution wide-character set is implementation-defined, but that shouldn't be a problem for the implementation-provided iostream facility to recognize and handle that appropriately, right? (While execution character set is also implementation-defined, the iostream facility can handle it alright.) What is the default interpretation of a wchar_t array when handled by iostream facilities? (Anyway, just to clarify, I'm more interested in the nature of the execution wide-character set, rather than finding a correct way to print a wide-character string on certain platforms.)

PS: I'm a total novice for wchar_t stuffs, so my apology if I said something very wrong.

The execution wide-character set is simply the set of characters used to encode wchar_t, at runtime. See N3337 S2.3.

The encoding is implementation-defined. On all modern systems and platforms it would be Unicode (ISO-10646) but nothing makes it so. On older platforms such as IBM mainframe it might be DBCS or something different. You won't see it, but that's what the standard allows.

The EWCS is required to have some specific members and conversions. It is required to work correctly with the library functions. These are not tough restrictions.

The wide characters could actually be short int (as on Windows) or int 32 (as on Unix) and still be the same character set (Unicode).

Basically char uses 1 byte to encode a symbol and is used for ANSII text. It is fine to use if your application deals with Latin only. If you want to support all other languages, Russian for example, you must use multi-byte or Unicode encoding. This is where the wchar_t is useful. If you write sizeof(wchar_t) you will see that 2 bytes are used to encode a symbol.

When you decided to use wchar_t (wide char), you must use the functions that support this type. You will find out that many string functions (fopen_s, string) have analog for wchar_t: _wfopen_s, wstring.

链接地址: http://www.djcxy.com/p/68392.html

上一篇: 我如何验证primefaces树<p：tree>

下一篇: 字符集和它的编码？