Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

29 Jan 2012

      On 01/28/2012 08:48 PM, Yakov Galka wrote:
...
The user can just write
cout<<  u8"您好世界";
Even better is:
cout<<  "您好世界";
which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ)
    and needs some trickery on others (MSVC: save as UTF-8 without BOM).
No, that's just wrong.
That's not the model that C++ uses. By not storing it with the BOM, 
you're essentially tricking MSVC into believing it is ANSI (windows-1252 
on western systems), and thus avoiding source character set to the 
execution character set, since those happen to be the same.

The way a C++ compiler is supposed to work is that all of your source is 
in the source character set, regardless of the type of string literal 
you use.
Then the compiler will convert your source character set to the 
execution character set for narrow string literals, to the wide 
execution character set for wide string literals, to UTF-8 for u8 
literals, etc.

The correct way to portably use Unicode characters in a C++ source is to 
write it as UTF-8 and ensure that all compilers will consider the source 
character set to be UTF-8. Then use the appropriate literal types 
depending on what encoding you want your string literals to end up in.
Of course, in the real world, it causes two practical problems:
  - MSVC requires a BOM to be present, but GCC will choke if there is one
  - In the lack of u8 string literals, you're stuck with wide string 
literals if you want something resembling Unicode, unless you use narrow 
string literals with just ASCII and escape sequences (\xYY, \u and \U 
will not work since it will convert)

What probably should be done is that compilers should be compelled to 
support UTF-8 as the source character set in a unified way.

I once asked volodya if it were feasible to implement this in the build 
system (add a BOM for MSVC), but he didn't seem to think it was worth it.

Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

Mathias Gaunard