
On 01/28/2012 08:48 PM, Yakov Galka wrote:
The user can just write
cout<< u8"您好世界";
Even better is:
cout<< "您好世界";
which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ) and needs some trickery on others (MSVC: save as UTF-8 without BOM).
No, that's just wrong. That's not the model that C++ uses. By not storing it with the BOM, you're essentially tricking MSVC into believing it is ANSI (windows-1252 on western systems), and thus avoiding source character set to the execution character set, since those happen to be the same. The way a C++ compiler is supposed to work is that all of your source is in the source character set, regardless of the type of string literal you use. Then the compiler will convert your source character set to the execution character set for narrow string literals, to the wide execution character set for wide string literals, to UTF-8 for u8 literals, etc. The correct way to portably use Unicode characters in a C++ source is to write it as UTF-8 and ensure that all compilers will consider the source character set to be UTF-8. Then use the appropriate literal types depending on what encoding you want your string literals to end up in. Of course, in the real world, it causes two practical problems: - MSVC requires a BOM to be present, but GCC will choke if there is one - In the lack of u8 string literals, you're stuck with wide string literals if you want something resembling Unicode, unless you use narrow string literals with just ASCII and escape sequences (\xYY, \u and \U will not work since it will convert) What probably should be done is that compilers should be compelled to support UTF-8 as the source character set in a unified way. I once asked volodya if it were feasible to implement this in the build system (add a BOM for MSVC), but he didn't seem to think it was worth it.