Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

29 Jan 2012

      ----- Original Message -----
...
From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
On 01/29/2012 09:13 AM, Artyom Beilis wrote:
...
In fact the ONLY modern compiler that deos not suppor them is Vistual 
Studio,
 all others I had ever used (gcc, clang, intel, sunstudio) work fine
 with UTF-8.
They all support it, the problem is that they require different things to use 
it.
Not, MSVC does not allow to create both "שלום" and L"שלום" literal
as Unicode (utf-8, UTF-16) for all other compilers it is default
behavior.
...
...
1. BOM should not be used in source code, no compiler except MSVC uses it 
and most
     do not support it.
According to Yakov, GCC supports it now.
It would be nice if it could work without any BOM though.
GCC's default input and literal encoding is UTF-8. BOM is not needed.
...
...
2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding 
which
     makes BOM useless (crap... sory) with MSVC even more.
That's the correct behaviour.
No, it is unspecified behavior according to the standard.

Standard does not specify what narrow encoding should be used, that
is why u8"" was created. 

All (but MSVC) compilers create UTF-8 literals and use UTF-8 input
and this is the default.
...
Use u8 string literals if you want UTF-8.
Why on earth should I do this?

All the world around uses UTF-8. Why should I specifiy u8"" if it is
something that can be easily defined at compiler level?

All we need is some flag for MSVC that tells that string
literals encoding is UTF-8.

I think the standard should require a method for specification
of input encoding and literals encoding and require UTF-8 input
and literal encoding support whether it is by adding
some flag or by providing some pragma.
...
The problem is only present if the compiler doesn't have those string 
literals.
AFAIR, neither gcc4.6 nor msvc10 supports u8"".

Artyom