Re: [boost] [locale] Review results for Boost.Locale library

26 Apr 2011

      ...
We can assume that the compiler knows the correct character set of the 
source code file, as trying to fool it would seem to be inherently 
error prone.  This seems to rule out the possibility of char * 
literals containing UTF-8 encoded text on MSVC, until C++1x Unicode 
literals are supported.
The biggest nuisance is that we need to know the compile-time 
character set/encoding (so that we know how to interpret
"narrow" string literals), and there does not appear to be any 
standard way in which this is recorded (maybe I'm mistaken though).
The source character set is pretty much irrelevant. It's the execution 
character set that is problematic. A compiler will translate string
On 25.04.2011 22:31, Jeremy Maitin-Shepard wrote:
literals in the source from the source character set to the execution 
character set for storage in the binary.
GCC has options to control both the source (-finput-charset) and the 
execution character set (-fexec-charset). They both default to UTF-8.
However, MSVC is more complicated. It will try to auto-detect the source 
character set, but while it can detect UTF-16, it will treat everything 
else as the system narrow encoding (usually a Windows-xxxx codepage) 
unless the file starts with a UTF-8-encoded BOM. The worse problem is 
that, except for a very new, poorly documented, and probably 
experimental pragma, there is *no way* to change MSVC's execution 
character set away from the system narrow encoding.

So let's assume that further down, it's the execution set that's known.
...
By knowing the compile-time character set, all ambiguity is removed. 
The translation database can be assumed to be keyed based on UTF-8, so 
to translate a message, it needs to be converted to UTF-8.  There 
should presumably be versions of the translation functions that take 
narrow strings, wide strings, and additional versions for the C++1x 
unicode literals once they are supported by compilers (I expect that 
to be very soon, at least for some compilers).  If a wide string is 
specified, it will be assumed to be in UTF-16 or UTF-32 depending on 
sizeof(wchar_t), and converted to UTF-8.  UTF-32 is generally 
undesirable, I imagine, but in practice should nonetheless work and 
using wide strings might be the best approach for code that needs to 
compile on both Windows and Linux.  For the narrow version, if the 
compile-time narrow encoding is UTF-8, the conversion is a no-op.  
Otherwise, the conversion will have to be done.  (The C++1x u8 literal 
version would naturally require no conversion also.)
The issue with making the narrow version automatically transcode the 
input from the narrow encoding to UTF-8 is that it is a compatibility 
issue with C++11 u8 literals. For some reason, there is no way in the 
type system to distinguish between normal narrow and u8 literals. In 
other words, if you ever make the translate() functions assume a narrow 
literal to be in the locale character set, you can't use u8 literals 
there anymore.

Sebastian

Re: [boost] [locale] Review results for Boost.Locale library

Sebastian Redl