Re: [boost] [locale] Review results for Boost.Locale library

27 Apr 2011

      On 04/27/2011 12:07 AM, Artyom wrote:
...
How the catalog works. It searches the key in the hash table,
as the last stage it compares the strings bytewise.
It is fast and efficient.
In order to support both L"", "", u"" and U"" I need to
create a 4 variants of same string to make sure it works
fast (waste of memory) or I need to convert the
string from UTF-16/32 to UTF-8 that is run-time
memory allocation and conversion.
So no, I'm not going to do this, especially that
it is nor portable enough.
Why not simply provide a compile-time or run-time option to allow the 
user to specify the following:

- encoding of narrow keys to be given as char * arguments, or specify 
that none is to be supported (in which case narrow keys cannot be used 
at all), the default being UTF-8;

- whether wchar_t * arguments are supported (the encoding will be 
assumed to be UTF-16 or UTF-32 depending on sizeof(wchar_t)) [by 
default, not supported]

- whether char16_t arguments are supported [by default, not supported]

- whether char32_t arguments are supported [by default, not supported]

The library would simply convert the UTF-8 encoded keys in the message 
catalogs to each of the supported key argument encodings.  In most 
cases, there would only be a single supported encoding.  Because the 
narrow version could be disabled, with Japanese text and UTF-16 wchar_t, 
this would actually _save_ space since UTF-16 is more efficient than 
UTF-8 for encoding Japanese text.

More to the point, you as a library author can offer this functionality 
(since it shouldn't be too much of an implementation burden) even if you 
as a user of your library wouldn't want to use it (because you are happy 
to provide English string literals).

I agree that it is very unfortunate that wchar_t can mean either UTF-16 
or UTF-32 depending on the platform, but in practice the same source 
code containing L"" string literals can be used on both Windows and 
Linux to reliably specify Unicode string literals (provided that care is 
taken to ensure the compiler knows the source code encoding).  The fact 
that UTF-32 (which Linux tends to use for wchar_t) is space-inefficient 
does in some ways make render Linux a second-class citizen if a solution 
based on wide string literals is used for portability, but using UTF-8 
on MSVC is basically just impossible, rather than merely less efficient, 
so there doesn't seem to be another option.  (Assuming you are unwilling 
to rely on the Windows "ANSI" narrow encodings.)
...
...
...
One possibility is to provide per-domain basis a key  in po file
"X-Boost-Locale-Source-Encoding" so user would be able to  specify in
special record (which exists in all message catalogs)  something
like:
"X-Boost-Locale-Source-Encoding: windows-936"
or
         "X-Boost-Locale-Source-Encoding: UTF-8"
Then when  the catalog would be loaded its keys would be converted
to the  X-Boost-Locale-Source-Encoding.
This isn't a property of the message  catalog, but rather a property of
the program itself, and therefore should  be specified in the program,
and not in the message catalog, it would  seem.  Something like the
preprocessor define I mentioned would be a  way to do this.
Two problem with define that I want
translate("foo") to work automatically and not being a define.
So I either need to provide an encoding in catalog itself
or when I provide domain name (the reason it is done
per domain name as one part of the project may use UTF-8 and
other cp936 and other may use US-ASCII at all)
So I can either specify it when I load a catalog or in
catalog itself.
Okay.

Re: [boost] [locale] Review results for Boost.Locale library

Jeremy Maitin-Shepard