platform unicode in C/C++: Which encoding to use?
I'm currently working on a hobby project (C/C++) which is supposed to work on both Windows and Linux, with full support for Unicode. Sadly, Windows and Linux use different encodings making our lives more difficult.
In my code I'm trying to use the data as universal as possible, making it easy for both Windows and Linux. In Windows, wchar_t is encoded as UTF-16 by default, and as UCS-4 in Linux (correct me if I'm wrong).
My software opens ({_wfopen, UTF-16, Windows},{fopen, UTF-8, Linux}) and writes data to files in UTF-8. So far it's all doable. Until I decided to use SQLite.
SQLite's C/C++ interface allows for one or two-byte encoded strings (click). Ofcourse this does not work with wchar_t in Linux, as the wchar_t in Linux is 4 bytes by default. Therefore, writing and reading from sqlite requires conversion for Linux.
Currently the code is cluttering up with exceptions for Windows/Linux. I was hoping to stick to the standard idea of storing data in wchar_t:
After reading (here) I was convinced I should stick to wchar_t in Windows. But after getting all that to work, the trouble began with porting to Linux.
Currently I'm thinking of redoing it all to stick with simple char(UTF-8) because it works with both Windows and Linux, keeping the fact in mind that I need to 'WideCharToMultiByte' every string in Windows to achieve UTF-8. Using simple char* based strings will greatly reduce the number of exceptions for Linux/Windows.
Do you have any experience with unicode for cross-platform? Any thoughts about the idea of simply storing data in UTF-8 instead of using wchar_t?
所有平台上的UTF-8,即时转换为适用于Windows的UTF-16是跨平台Unicode的常用策略。
Our software is cross-platform as well, and we faced similar problems. We decided that our goal is to have the least amount of conversions possible. This means that we use wchar_t
on Windows and char
on Unix/Mac.
We do this by supporting _T
and LPCTSTR
and similar on Unix and by having generic functions that easily convert between std::string
and std::wstring
. We also have a generic std::basic_string<TCHAR>
( tstring
) which we use in most cases.
So far this works quite well. Basicly most functions take a tstring
or a LPCTSTR
and those which don't will get their parameters converted from a tstring
. That means that most of the time we don't convert our strings and pass through most parameters.