Re: [boost] [locale] Review results for Boost.Locale library

25 Apr 2011

      The most significant complaint seems to be the fact that the translation 
interface is limited to ASCII (or maybe UTF-8 is also supported, it 
isn't entirely clear).

Even though various arguments have been made for using only ASCII text 
literals in the program, it seems that it would be relatively easy to 
support other languages.  As has been mentioned by someone else, even if 
the text really is in English, ASCII may not be sufficient as it may be 
desirable to include some special symbol (e.g. the copyright symbol for 
instance), and having to deal with this by creating a translation from 
"ASCII English to appease translation system" to "Real English to 
display to users" would seem to be an unjustifiable additional burden. 
However, I don't think anyone is as familiar with the limitations of 
gettext-related tools as Artyom, so he is the best person to discuss 
exactly how this might be supported.  Previously he briefly described a 
make-shift approach that required the use of a macro, which didn't seem 
like a legitimate solution.

It seems that xgettext (at least the version 0.18.1 that I tested on my 
machine) supports non-ASCII program source provided that the --from-code 
option is given, so it seems that the user could keep the source code in 
any arbitrary character set/encoding and it would still work (and simply 
convert the strings to UTF-8).  It also appears to successfully extract 
strings that are specified with a L prefix, so it seems that should not 
be a problem either.

I suppose there is some question as to how well existing tools for 
translating the messages deal with non-ASCII, but as the tools can be 
improved fairly easily if necessary, I don't think this is a significant 
concern.

We can assume that the compiler knows the correct character set of the 
source code file, as trying to fool it would seem to be inherently error 
prone.  This seems to rule out the possibility of char * literals 
containing UTF-8 encoded text on MSVC, until C++1x Unicode literals are 
supported.

The biggest nuisance is that we need to know the compile-time character 
set/encoding (so that we know how to interpret
"narrow" string literals), and there does not appear to be any standard 
way in which this is recorded (maybe I'm mistaken though).  However, it 
is easy enough for the user to simply specify this as a preprocessor 
define (the build system could add it to the compile flags, and it needs 
to be known anyway in order to invoke xgettext --- presumably it would 
just be based on the active locale at the time the compiler is invoked). 
  If none is specified, it could default to UTF-8 (this can also be used 
for greater efficiency in the case that the compile-time encoding is not 
UTF-8 but the source code happens to only contain ASCII messages).

By knowing the compile-time character set, all ambiguity is removed. 
The translation database can be assumed to be keyed based on UTF-8, so 
to translate a message, it needs to be converted to UTF-8.  There should 
presumably be versions of the translation functions that take narrow 
strings, wide strings, and additional versions for the C++1x unicode 
literals once they are supported by compilers (I expect that to be very 
soon, at least for some compilers).  If a wide string is specified, it 
will be assumed to be in UTF-16 or UTF-32 depending on sizeof(wchar_t), 
and converted to UTF-8.  UTF-32 is generally undesirable, I imagine, but 
in practice should nonetheless work and using wide strings might be the 
best approach for code that needs to compile on both Windows and Linux. 
  For the narrow version, if the compile-time narrow encoding is UTF-8, 
the conversion is a no-op.  Otherwise, the conversion will have to be 
done.  (The C++1x u8 literal version would naturally require no 
conversion also.)

Note that the common case of UTF-8 narrow literals, which is the only 
case currently supported, there would be no performance penalty.

The documentation could explicitly warn that there is a performance 
penalty for not using UTF-8, but I think this penalty is likely to be 
acceptable in many cases.

If normalization proves to be an issue, then the conversion to UTF-8 
could include normalization (perhaps another preprocessor definition) 
and the output of xgettext could also be normalized.

I imagine relative to the work required for the whole library, these 
changes would be quite trivial, and might very well transform the 
library from completely unacceptable to acceptable for a number of 
objectors on the list, while having essentially no impact on those that 
are happy to use the library as is.

Re: [boost] [locale] Review results for Boost.Locale library

Jeremy Maitin-Shepard