
----- Original Message -----
From: Mathias Gaunard <mathias.gaunard@ens-lyon.org> On 01/29/2012 09:13 AM, Artyom Beilis wrote:
In fact the ONLY modern compiler that deos not suppor them is Vistual Studio, all others I had ever used (gcc, clang, intel, sunstudio) work fine with UTF-8.
They all support it, the problem is that they require different things to use it.
Not, MSVC does not allow to create both "שלום" and L"שלום" literal as Unicode (utf-8, UTF-16) for all other compilers it is default behavior.
1. BOM should not be used in source code, no compiler except MSVC uses it and most do not support it.
According to Yakov, GCC supports it now. It would be nice if it could work without any BOM though.
GCC's default input and literal encoding is UTF-8. BOM is not needed.
2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding which makes BOM useless (crap... sory) with MSVC even more.
That's the correct behaviour.
No, it is unspecified behavior according to the standard. Standard does not specify what narrow encoding should be used, that is why u8"" was created. All (but MSVC) compilers create UTF-8 literals and use UTF-8 input and this is the default.
Use u8 string literals if you want UTF-8.
Why on earth should I do this? All the world around uses UTF-8. Why should I specifiy u8"" if it is something that can be easily defined at compiler level? All we need is some flag for MSVC that tells that string literals encoding is UTF-8. I think the standard should require a method for specification of input encoding and literals encoding and require UTF-8 input and literal encoding support whether it is by adding some flag or by providing some pragma.
The problem is only present if the compiler doesn't have those string literals.
AFAIR, neither gcc4.6 nor msvc10 supports u8"". Artyom