
On Sun, Jan 29, 2012 at 16:28, Mathias Gaunard <mathias.gaunard@ens-lyon.org
wrote:
On 01/29/2012 02:53 PM, Artyom Beilis wrote:
Not, MSVC does not allow to create both "שלום" and L"שלום" literal
as Unicode (utf-8, UTF-16) for all other compilers it is default behavior.
And it shouldn't. String literals are in the execution character set. On Windows the execution character set is what it calls ANSI. That much is not going to change.
Execution character set is defined by the implementation, that is the compiler and the runtime library. It has nothing to do with the system underneath. That is the implementation is free to decide that execution character set is UTF-8, even though Windows narrow strings are some 'ANSI'. Standard library interfaces then would accept UTF-8 (fopen, fstream, etc..).
[...]
2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding
which
makes BOM useless (crap... sory) with MSVC even more.
That's the correct behaviour.
No, it is unspecified behavior according to the standard.
It isn't.
As said above you can't deduce from the standard what is the "execution character set for Windows". MSVC defines it to be 'ANSI', which is the source of all problems. But it is unspecified behavior according to the standard. Standard does not specify what narrow encoding should be used, that
is why u8"" was created.
The standard specifies that it is the execution character set. MSVC specifies that for its implementation, the execution character set is ANSI.
Yes, and we would like to at least have a flag that overrides the execution character set to UTF-8.
[...]
Use u8 string literals if you want UTF-8.
Why on earth should I do this?
Because it makes perfect sense and it's the way it's supposed to work.
As per C++11 it doesn't make sense to use any other narrow string literal but u8"". Why would you use plain "" on Windows? [...]
All we need is some flag for MSVC that tells that string
literals encoding is UTF-8.
That "flag" is using the u8 prefix on those string literals. Remember: the encoding used for the data in a string literal is independent from the encoding used to write the source.
Yes, it will remain independent even with "" meaning u8"". Even if the source character set was UTF-32 it would mean UTF-8. Sincerely, -- Yakov