
On 04/25/2011 11:56 PM, Artyom wrote:
From: Jeremy Maitin-Shepard<jeremy@jeremyms.com>
The most significant complaint seems to be the fact that the translation interface is limited to ASCII (or maybe UTF-8 is also supported, it isn't entirely clear).
[snip]
I imagine relative to the work required for the whole library, these changes would be quite trivial, and might very well transform the library from completely unacceptable to acceptable for a number of objectors on the list, while having essentially no impact on those that are happy to use the library as is.
I can say few words on what can be done and what will never be done.
I will never support wide, char16_t or char32_t strings as keys.
It seems that it is mostly possible to get the desired results using only char * strings as keys, but there is one limitation: it is not possible to represent strings containing characters that don't fit in a single non-Unicode character set, e.g. it seems it would not be possible to have a char * string literal with both Japanese and Hebrew text. As this is unlikely to be needed, it might be a reasonable limitation, though. However, I don't see why you are so opposed to providing additional overloads. With MSVC currently, only wide strings can represent the full range of Unicode. You could provide the definitions in an alternate static/dynamic library from the char * overloads, so that there would not even be any substantial space overhead.
Current interface provides facet that has
template<typename CharType> class messages_facet { ... CharType const *get(int domain_id,char const *msg) const = 0. ...
And 2 or 4 types of it installed messages_facet<char>, messages_facet<wchar_t>, messages_facet<char16_t> and messages_facet<char32_t>
Supporting
CharType const *get(int domain_id,char const *msg) const = 0. CharType const *get(int domain_id,wchar_t const *msg) const = 0. CharType const *get(int domain_id,char16_t const *msg) const = 0. CharType const *get(int domain_id,char32_t const *msg) const = 0.
Is just waste of memory as each source string for fastest comparison should be converted to 4 variants or converted in runtime... Wasteful.
Thus I would only consider supporting "char const *" literals.
One possibility is to provide per-domain basis a key in po file "X-Boost-Locale-Source-Encoding" so user would be able to specify in special record (which exists in all message catalogs) something like:
"X-Boost-Locale-Source-Encoding: windows-936" or "X-Boost-Locale-Source-Encoding: UTF-8"
Then when the catalog would be loaded its keys would be converted to the X-Boost-Locale-Source-Encoding.
This isn't a property of the message catalog, but rather a property of the program itself, and therefore should be specified in the program, and not in the message catalog, it would seem. Something like the preprocessor define I mentioned would be a way to do this.
So if you are MSVC user and you really want to have localized keys you have following options:
Option A: ---------
source.cpp: // without bom windows-936 encoded
#pragma setlocale("Japanese_Japan.936")
translate("平和"); // L"平和" works well
wcout<< translate("「平和」"); // convert in runtime from cp939 to UTF-16 cout<< translate("「平和」"); // convert in runtime from cp939 to UTF-8 [snip]
When you say "convert in runtime", it seems you actually mean the keys will be converted from UTF-8 to cp939 when the messages are loaded, but the values will remain UTF-8. Untranslated strings would have to be converted, I suppose.
Option B: ---------
source.cpp: // with BOM UTF-8 encoded, still windows-936 locale
#pragma setlocale("Japanese_Japan.936")
translate("平和"); // MSVC would be actually cp936 // L"平和" works well
wcout<< translate("「平和」"); // convert in runtime from cp939 to UTF-16 cout<< translate("「平和」"); // convert in runtime from cp939 to UTF-8 [snip]
Okay, same as Option A, except that it is possible to specify wide literals using the full range of Unicode characters, rather than being limited to the local charset.
Option C (in future C++11): ---------
source.cpp: // with BOM UTF-8 encoded
translate(u8"平和"); // Would be utf-8 // L"平和" works well wcout<< translate(u8"「平和」"); // convert in runtime from UTF-8 to UTF-16 cout<< translate(u8"「平和」"); // convert just copy to the stream as is
Clearly this is a good solution, if only it were supported.
Option D (works now): ---------
source.cpp: // without BOM, UTF-8 encoded
translate("平和"); // MSVC would convert use it as UTF-8 // L"平和" does not works!! [snip]
I think it is obvious this isn't a feasible solution, as this breaks wide string literals, which are likely to be needed by anyone using MSVC.
wcout<< translate("「平和」"); // convert in runtime from UTF-8 to UTF-16 cout<< translate("「平和」"); // convert just copy to the stream as is
myprogram.po: msgid "" msgstr "" "Content-Type: charset=UTF-8\n" # it would assume UTF-8 sources
msgid "平和" msgstr "שלום"
# not translated msgid "「平和」" msgstr ""
This can be done and I can implement it. But do not expect anything beyond this.
Also note that converting a message from cp936 to for example windows-1255 (Hebrew narrow windows encoding) would swap out all non-ASCII characters...
I'm not exactly sure why a conversion like this might happen, and also it is not clear that is a serious problem. (Likely the Hebrew speaker would not be able to read Japanese anyway.)
But this is developer's problem who had chosen to use non-ASCII keys.