Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

29 Jan 2012

      On Sun, Jan 29, 2012 at 16:28, Mathias Gaunard <mathias.gaunard@ens-lyon.org
...
wrote:
...
On 01/29/2012 02:53 PM, Artyom Beilis wrote:
Not, MSVC does not allow to create both "שלום" and L"שלום" literal
...
as Unicode (utf-8, UTF-16) for all other compilers it is default
behavior.
And it shouldn't.
String literals are in the execution character set. On Windows the
execution character set is what it calls ANSI. That much is not going to
change.
Execution character set is defined by the implementation, that is the
compiler and the runtime library. It has nothing to do with the system
underneath. That is the implementation is free to decide that execution
character set is UTF-8, even though Windows narrow strings are some 'ANSI'.
Standard library interfaces then would accept UTF-8 (fopen, fstream, etc..).
...
[...]
2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding
...
...
...
which
...
makes BOM useless (crap... sory) with MSVC even more.
That's the correct behaviour.
No, it is unspecified behavior according to the standard.
It isn't.
As said above you can't deduce from the standard what is the "execution
character set for Windows". MSVC defines it to be 'ANSI', which is the
source of all problems. But it is unspecified behavior according to the
standard.

 Standard does not specify what narrow encoding should be used, that
...
...
is why u8"" was created.
The standard specifies that it is the execution character set. MSVC
specifies that for its implementation, the execution character set is ANSI.
Yes, and we would like to at least have a flag that overrides the execution
character set to UTF-8.
...
[...]
Use u8 string literals if you want UTF-8.
...
...
Why on earth should I do this?
Because it makes perfect sense and it's the way it's supposed to work.
As per C++11 it doesn't make sense to use any other narrow string literal
but u8"". Why would you use plain "" on Windows?

[...]
...
All we need is some flag for MSVC that tells that string
...
literals encoding is UTF-8.
That "flag" is using the u8 prefix on those string literals.
Remember: the encoding used for the data in a string literal is
independent from the encoding used to write the source.
Yes, it will remain independent even with "" meaning u8"". Even if the
source character set was UTF-32 it would mean UTF-8.

Sincerely,
-- 
Yakov

Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

Yakov Galka