
On 04/27/2011 12:07 AM, Artyom wrote:
How the catalog works. It searches the key in the hash table, as the last stage it compares the strings bytewise.
It is fast and efficient.
In order to support both L"", "", u"" and U"" I need to create a 4 variants of same string to make sure it works fast (waste of memory) or I need to convert the string from UTF-16/32 to UTF-8 that is run-time memory allocation and conversion.
So no, I'm not going to do this, especially that it is nor portable enough.
Why not simply provide a compile-time or run-time option to allow the user to specify the following: - encoding of narrow keys to be given as char * arguments, or specify that none is to be supported (in which case narrow keys cannot be used at all), the default being UTF-8; - whether wchar_t * arguments are supported (the encoding will be assumed to be UTF-16 or UTF-32 depending on sizeof(wchar_t)) [by default, not supported] - whether char16_t arguments are supported [by default, not supported] - whether char32_t arguments are supported [by default, not supported] The library would simply convert the UTF-8 encoded keys in the message catalogs to each of the supported key argument encodings. In most cases, there would only be a single supported encoding. Because the narrow version could be disabled, with Japanese text and UTF-16 wchar_t, this would actually _save_ space since UTF-16 is more efficient than UTF-8 for encoding Japanese text. More to the point, you as a library author can offer this functionality (since it shouldn't be too much of an implementation burden) even if you as a user of your library wouldn't want to use it (because you are happy to provide English string literals). I agree that it is very unfortunate that wchar_t can mean either UTF-16 or UTF-32 depending on the platform, but in practice the same source code containing L"" string literals can be used on both Windows and Linux to reliably specify Unicode string literals (provided that care is taken to ensure the compiler knows the source code encoding). The fact that UTF-32 (which Linux tends to use for wchar_t) is space-inefficient does in some ways make render Linux a second-class citizen if a solution based on wide string literals is used for portability, but using UTF-8 on MSVC is basically just impossible, rather than merely less efficient, so there doesn't seem to be another option. (Assuming you are unwilling to rely on the Windows "ANSI" narrow encodings.)
One possibility is to provide per-domain basis a key in po file "X-Boost-Locale-Source-Encoding" so user would be able to specify in special record (which exists in all message catalogs) something like:
"X-Boost-Locale-Source-Encoding: windows-936" or "X-Boost-Locale-Source-Encoding: UTF-8"
Then when the catalog would be loaded its keys would be converted to the X-Boost-Locale-Source-Encoding.
This isn't a property of the message catalog, but rather a property of the program itself, and therefore should be specified in the program, and not in the message catalog, it would seem. Something like the preprocessor define I mentioned would be a way to do this.
Two problem with define that I want
translate("foo") to work automatically and not being a define.
So I either need to provide an encoding in catalog itself or when I provide domain name (the reason it is done per domain name as one part of the project may use UTF-8 and other cp936 and other may use US-ASCII at all)
So I can either specify it when I load a catalog or in catalog itself.
Okay.