
From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
On 01/05/2011 20:44, Artyom wrote:
But the bigger question is what exactly do you want to do with BOM and how it would help you to make the "cross-platform" software?
The goal is to allow all compilers to recognize that the source is encoded in UTF-8. This is what you need to write cross-platform source that contains non-ASCII characters.
It is not enough. You can't do it in cross platform way properly as you can't currently get UTF-8 or UTF-16 or UTF-32 string literal properly for cross platform code till all compilers will support C++0x u/U/u8 literals and at this point NONE of the existing popular compilers support them (checked MSVC, GCC, Intel, SunCC)
the only real Unicode strings with MSVC would be L"" and they are actually would be encoded with UTF-16 encoding while all non-Windows world uses UTF-32 as wide character encodings.
How is that a problem at all?
And using narrow string literals with UTF-8 content masquerading as ANSI is a hack, sorry. That's not the C++-endorsed solution.
First of all ANSI codepage exists only on Windows and has nothing to do with cross platform software. C++ standard does not know what is "ANSI" encodings.
So basically I can say that untill Microsoft Visual Studio team would take UTF-8 seriously and either support 65001 codepage as expected or provide GCC's like options for input and exec encodings I don't see how this BOM would be useful.
I don't really care about what the execution character set is. I definitely do not want to change it, it should be the user locale.
No, you never want to be it in user's locale because it makes compilation locale dependent! Because source.cpp / With UTF-8 BOM -------------------------------- std::string test="שלום-سلام-Мир" In Israel it would be "שלום-???-????" in CP1255 In Egypt it would be "????-سلام-???" in CP1256 In Russia it would be "????-????-Мир" in CP1251 In France it would be "????-???-???" in CP1252 So no, you always want to have execution character set to be well defined unless all your sources are written using US-ASCII which is a subset of all character sets. Artyom Beilis.