How to Output Unicode Strings on the Windows Console

2018-06-10 05:17:38

there are already a few questions relating to this problem. I think my question is a bit different because I don't have an actual problem, I'm only asking out of academic interest. I know that Windows's implementation of UTF-16 is sometimes contradictory to the Unicode standard (eg collation) or closer to the old UCS-2 than to UTF-16, but I'll keep the “UTF-16” terminology here for reasons of simplicity.

Background: In Windows, everything is UTF-16. Regardless of whether you're dealing with the kernel, the graphics subsystem, the filesystem or whatever, you're passing UTF-16 strings. There are no locales or charsets in the Unix sense. For compatibility with medieval versions of Windows, there is a thing called “codepages” that is obsolete but nonetheless supported. AFAIK, there is only one correct and non-obsolete function to write strings to the console, namely WriteConsoleW , which takes an UTF-16 string. Also, a similar discussion applies to input streams, which I'll ignore, too.

However, I think this represents a design flaw in the Windows API: there is a generic function that can be used to write to all stream objects (files, pipes, consoles…) called WriteFile , but this function is byte-oriented and doesn't accept UTF-16 strings. The documentation suggests using WriteConsoleW for console output, which is text oriented, and WriteFile for everything else, which is byte oriented. Since both console streams and file objects are represented by kernel object handles and console streams can be redirected, you have to call a function for every write to a standard output stream that checks whether the handle represents a console stream or a file, breaking polymorphy. OTOH, I do think that Windows's separation between text strings and raw bytes (which is mirrored in many other systems like Java or Python) is conceptually superior to Unix's char* approach that ignores encodings and doesn't distinguish between strings and byte arrays.

So my questions are: What to do in this situation? And why isn't this problem solved even in Microsoft's own libraries? Both the .NET Framework and the C and C++ libraries seem to adhere to the obsolete codepage model. How would you design the Windows API or an application framework to circumvent this issue?

I think that the general problem (which is not easy to solve) is that all libraries assume that all streams are byte-oriented, and implement text-oriented streams on top of that. However, we see that Windows does have special text-oriented streams on the OS level, and the libraries are unable to deal with this. So in any case we must introduce significant changes to all standard libraries. A quick and dirty way would be to treat the console as a special byte-oriented stream that accepts only one encoding. This still requires that the C and C++ standard libraries must be circumvented because they don't implement the WriteFile / WriteConsoleW switch. Is that correct?

The general strategy I/we use in most (cross platform) applications/projects is: We just use UTF-8 (I mean the real standard) everywhere. We use std::string as the container and we just interpret everything as UTF8. And we also handle all file IO this way, ie we expect UTF8 and save UTF8. In the case when we get a string from somewhere and we know that it is not UTF8, we will convert it to UTF8.

The most common case where we stumble upon WinUTF16 is for filenames. So for every filename handling, we will always convert the UTF8 string to WinUTF16. And also the other way if we search through a directory for files.

The console isn't really used in our Windows build (in the Windows build, all console output is wrapped into a file). As we have UTF8 everywhere, also our console output is UTF8 which is fine for most modern systems. And also the Windows console log file has its content in UTF8 and most text-editors on Windows can read that without problems.

If we would use the WinConsole more and if we would care a lot that all special chars are displayed correctly, we maybe would write some automatic pipe handler which we install in between fileno=0 and the real stdout which will use WriteConsoleW as you have suggested (if there is really no easier way).

If you wonder about how to realize such automatic pipe handler: We have implemented such thing already for all POSIX-like systems. The code probably doesn't work on Windows as it is but I think it should be possible to port it. Our current pipe handler is similar to what tee does. Ie if you do a cout << "Hello" << endl , it will both be printed on stdout and in some log-file. Look at the code if you are interested how this is done.

Several points:

One important difference between Windows "WriteConsoleW" and printf is that WriteConsoleW looks at console as GUI rather them text stream. For example if you use it and use pipe you would not capture output.

I would never said that code-pages are obsolete. Maybe windows developers would like them to be so, but they never would be. All world, but windows api, uses byte oriented streams to represent data: XML, HTML, HTTP, Unix, etc, etc use encodings and most popular and most powerful one is UTF-8. So you may use Wide strings internally but in external world you'll need something else.

Even when you print wcout << L"Hello World" << endl it is converted under the hood to byte oriented stream, on most systems (but windows) to UTF-8.

My personal opinion, Microsoft did mistake when changed their API in every place to wide instead of supporting UTF-8 everywhere. Of course you may argue about it. But in fact you have to separate text and byte oriented streams and convert between them.

To answer your first question, you can output Unicode strings to the Windows console using _setmode. Specific details regarding this can be found on Michael Kaplan's blog. By default, the console is not Unicode (UCS-2/UTF-16). It works in an Ansi (locale/code page) manner and must specifically be configured to use Unicode.

Also, you have to change the console font, as the default font only supports Ansi characters. There are some minor exceptions here, such as zero-extended ASCII characters, but printing actual Unicode characters requires the use of _setmode.

In Windows, everything is UTF-16. Regardless of whether you're dealing with the kernel, the graphics subsystem, the filesystem or whatever, you're passing UTF-16 strings. There are no locales or charsets in the Unix sense.

This is not completely true. While the underlying core of Windows does use Unicode, there is a huge amount of interoperability that comes into play that lets Windows interact with a wide variety of software.

Consider notepad (yes, notepad is far from a core component, but it gets my point across). Notepad has the ability to read files that contain Ansi (your current code page), Unicode or UTF-8. You might consider notepad to be a Unicode application, but that is not entirely accurate.

A better example is drivers. Drivers can be written in either Unicode or Ansi. It really depends on the nature of the interface. To further this point, Microsoft provides the StrSafe library, which was specifically written with Kernel-mode drivers in mind, and it includes both Unicode and Ansi versions. While the drivers are either Ansi or Unicode, the Windows kernel must interact with them - correctly - regardless of whatever form they take.

The further away you get from the core of Windows, the more interoperability comes into play. This includes code pages and locales. You have to remember that not all software is written with Unicode in mind. Visual C++ 2010 still has the ability to build using Ansi, Multi-Byte or Unicode. This includes the use of code pages and locales, which are part of the C/C++ standard.

However, I think this represents a design flaw in the Windows API

the following two articles discuss this fairly well.

Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT?

Header files are not retarded, aka What the @#%&* is _O_WTEXT?

On this point, I think you are looking at Windows in hindsight. Unicode did not come first, ASCII did. After ASCII, came code pages. After code pages, came DBCS. After DBCS came MBCS (and eventually UTF-8). After UTF-8, came Unicode (UTF-16/UCS-2).

Each of these technologies was incorporated into the Windows OS over the years. Each building on the last, but without breaking each other. Software was written with each of these in mind. While it may not seem like it sometimes, Microsoft puts a huge amount of effort into not breaking software it didn't write. Even now, you can write new software that takes advantage of any of these technologies and it will work.

The real answer here is "compatibility". Microsoft still uses these technologies and so do many other companies. There are an untold number of programs, components and libraries which have not been updated (or ever will be updated) to use Unicode. Even as newer technologies arise - like .NET - the older technologies must stick around. At the very least for interoperability.

For example, say you have a DLL that you need to interact with from .NET, but this DLL was written using Ansi (single byte code page localized). To make it worse, you don't have the source for the DLL. The only answer here is to use those obsolete features.

链接地址: http://www.djcxy.com/p/30298.html

上一篇: 我如何将命名参数传递给Rake任务？

下一篇: 如何在Windows控制台上输出Unicode字符串