
We can assume that the compiler knows the correct character set of the source code file, as trying to fool it would seem to be inherently error prone. This seems to rule out the possibility of char * literals containing UTF-8 encoded text on MSVC, until C++1x Unicode literals are supported.
The biggest nuisance is that we need to know the compile-time character set/encoding (so that we know how to interpret "narrow" string literals), and there does not appear to be any standard way in which this is recorded (maybe I'm mistaken though). The source character set is pretty much irrelevant. It's the execution character set that is problematic. A compiler will translate string
On 25.04.2011 22:31, Jeremy Maitin-Shepard wrote: literals in the source from the source character set to the execution character set for storage in the binary. GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding. So let's assume that further down, it's the execution set that's known.
By knowing the compile-time character set, all ambiguity is removed. The translation database can be assumed to be keyed based on UTF-8, so to translate a message, it needs to be converted to UTF-8. There should presumably be versions of the translation functions that take narrow strings, wide strings, and additional versions for the C++1x unicode literals once they are supported by compilers (I expect that to be very soon, at least for some compilers). If a wide string is specified, it will be assumed to be in UTF-16 or UTF-32 depending on sizeof(wchar_t), and converted to UTF-8. UTF-32 is generally undesirable, I imagine, but in practice should nonetheless work and using wide strings might be the best approach for code that needs to compile on both Windows and Linux. For the narrow version, if the compile-time narrow encoding is UTF-8, the conversion is a no-op. Otherwise, the conversion will have to be done. (The C++1x u8 literal version would naturally require no conversion also.)
The issue with making the narrow version automatically transcode the input from the narrow encoding to UTF-8 is that it is a compatibility issue with C++11 u8 literals. For some reason, there is no way in the type system to distinguish between normal narrow and u8 literals. In other words, if you ever make the translate() functions assume a narrow literal to be in the locale character set, you can't use u8 literals there anymore. Sebastian