Re: [boost] [locale] Review results for Boost.Locale library

27 Apr 2011


      On 04/25/2011 11:56 PM, Artyom wrote:
...
...
From: Jeremy Maitin-Shepard<jeremy@jeremyms.com>
The most significant complaint seems to be the fact that
the translation  interface is limited to ASCII (or maybe UTF-8
is also supported, it isn't  entirely clear).
[snip]
I imagine relative to the work required for the whole  library,
these changes would be quite trivial, and might very well
transform the  library from completely unacceptable to
acceptable for a number of objectors on  the list,
while having essentially no impact on those that
are happy to use the  library as  is.
I can say few words on what can be done and what will never be done.
I will never support wide, char16_t or char32_t strings as keys.
It seems that it is mostly possible to get the desired results using 
only char * strings as keys, but there is one limitation: it is not 
possible to represent strings containing characters that don't fit in a 
single non-Unicode character set, e.g. it seems it would not be possible 
to have a char * string literal with both Japanese and Hebrew text.  As 
this is unlikely to be needed, it might be a reasonable limitation, though.

However, I don't see why you are so opposed to providing additional 
overloads.  With MSVC currently, only wide strings can represent the 
full range of Unicode.  You could provide the definitions in an 
alternate static/dynamic library from the char * overloads, so that 
there would not even be any substantial space overhead.
...
Current interface provides facet that has
template<typename CharType>
    class messages_facet {
       ...
       CharType const *get(int domain_id,char const *msg) const = 0.
       ...
And 2 or 4 types of it installed messages_facet<char>, messages_facet<wchar_t>,
messages_facet<char16_t>  and messages_facet<char32_t>
Supporting
CharType const *get(int domain_id,char const *msg) const = 0.
       CharType const *get(int domain_id,wchar_t const *msg) const = 0.
       CharType const *get(int domain_id,char16_t const *msg) const = 0.
       CharType const *get(int domain_id,char32_t const *msg) const = 0.
Is just waste of memory as each source string for fastest comparison
should be converted to 4 variants or converted in runtime... Wasteful.
Thus I would only consider supporting "char const *" literals.
One possibility is to provide per-domain basis a key in po file
"X-Boost-Locale-Source-Encoding" so user would be able to specify in
special record (which exists in all message catalogs) something
like:
"X-Boost-Locale-Source-Encoding: windows-936"
or
       "X-Boost-Locale-Source-Encoding: UTF-8"
Then when the catalog would be loaded its keys would be converted
to the X-Boost-Locale-Source-Encoding.
This isn't a property of the message catalog, but rather a property of 
the program itself, and therefore should be specified in the program, 
and not in the message catalog, it would seem.  Something like the 
preprocessor define I mentioned would be a way to do this.
...
So if you are MSVC user and you really want to have localized keys
you have following options:
Option A:
    ---------
source.cpp:  // without bom windows-936 encoded
#pragma setlocale("Japanese_Japan.936")
translate("平和"); // L"平和" works well
wcout<<  translate("「平和」"); // convert in runtime from cp939 to UTF-16
      cout<<  translate("「平和」"); // convert in runtime from cp939 to UTF-8
[snip]
When you say "convert in runtime", it seems you actually mean the keys 
will be converted from UTF-8 to cp939 when the messages are loaded, but 
the values will remain UTF-8.  Untranslated strings would have to be 
converted, I suppose.
...
Option B:
    ---------
source.cpp:  // with BOM UTF-8 encoded, still windows-936 locale
#pragma setlocale("Japanese_Japan.936")
translate("平和");  // MSVC would be actually cp936
                          // L"平和" works well
wcout<<  translate("「平和」"); // convert in runtime from cp939 to UTF-16
      cout<<  translate("「平和」"); // convert in runtime from cp939 to UTF-8
[snip]
Okay, same as Option A, except that it is possible to specify wide 
literals using the full range of Unicode characters, rather than being 
limited to the local charset.
...
Option C (in future C++11):
    ---------
source.cpp:  // with BOM UTF-8 encoded
translate(u8"平和");  // Would be utf-8
                            // L"平和" works well
      wcout<<  translate(u8"「平和」"); // convert in runtime from UTF-8 to UTF-16
      cout<<  translate(u8"「平和」"); // convert just copy to the stream as is
Clearly this is a good solution, if only it were supported.
...
Option D (works now):
    ---------
source.cpp:  // without BOM, UTF-8 encoded
translate("平和"); // MSVC would convert use it as UTF-8
                         // L"平和" does not works!!
[snip]
I think it is obvious this isn't a feasible solution, as this breaks 
wide string literals, which are likely to be needed by anyone using MSVC.
...
wcout<<  translate("「平和」"); // convert in runtime from UTF-8 to UTF-16
      cout<<  translate("「平和」"); // convert just copy to the stream as is
myprogram.po:
       msgid ""
       msgstr ""
       "Content-Type: charset=UTF-8\n"
       # it would assume UTF-8 sources
msgid "平和"
       msgstr "שלום"
# not translated
       msgid "「平和」"
       msgstr ""
This can be done and I can implement it.
But do not expect anything beyond this.
Also note that converting a message from cp936 to for example
windows-1255 (Hebrew narrow windows encoding) would swap out all
non-ASCII characters...
I'm not exactly sure why a conversion like this might happen, and also 
it is not clear that is a serious problem.  (Likely the Hebrew speaker 
would not be able to read Japanese anyway.)
...
But this is developer's problem who had chosen to use non-ASCII
keys.