Re: [boost] [locale] Review results for Boost.Locale library

2 May 2011

      ...
From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
On 01/05/2011 20:44, Artyom wrote:
...
But the bigger question is what  exactly do you want to do with BOM
and how it would help you to make the  "cross-platform" software?
The goal is to allow all compilers to  recognize that the source is encoded in 
UTF-8.
This is what you need to write  cross-platform source that contains non-ASCII 
characters.
It is not enough.

You can't do it in cross platform way properly as you
can't currently get UTF-8 or UTF-16 or UTF-32 string
literal properly for cross platform code till all
compilers will support C++0x u/U/u8 literals
and at this point NONE of the existing popular compilers
support them (checked MSVC, GCC, Intel, SunCC)
...
...
the  only
real Unicode strings with MSVC would be L"" and they
are  actually would be encoded with UTF-16 encoding
while all non-Windows  world uses UTF-32 as wide character
encodings.
How is that a  problem at all?
And using narrow string literals with UTF-8 content
masquerading as ANSI is a hack, sorry.
That's not the C++-endorsed  solution.
First of all ANSI codepage exists only on Windows
and has nothing to do with cross platform software.

C++ standard does not know what is "ANSI" encodings.
...
...
So basically I can say that untill Microsoft Visual  Studio
team would take UTF-8 seriously and either support 65001
 codepage as expected or provide GCC's like options
for input and exec  encodings I don't see how
this BOM would be useful.
I don't  really care about what the execution character set is.
I definitely do not want  to change it, it should be the user  locale.
No, you never want to be it in user's locale because it makes
compilation locale dependent! Because 

   source.cpp / With UTF-8 BOM
   --------------------------------
   std::string test="שלום-سلام-Мир"

In Israel it would be "שלום-???-????" in CP1255
In Egypt  it would be "????-سلام-???"  in CP1256
In Russia it would be "????-????-Мир" in CP1251
In France it would be "????-???-???"  in CP1252

So no, you always want to have execution character set
to be well defined unless all your sources are
written using US-ASCII which is a subset of 
all character sets.

Artyom Beilis.