[locale] Review results for Boost.Locale library

The formal review of Artyom Beilis' Boost.Locale library, originally scheduled to run from April 7th through the 16th and extended through the 22nd, is now finished. Fifteen people cast votes on it, ten for acceptance and five against. Though not overwhelming, the two-to-one majority clearly indicates the consensus of the list. As such: The Boost.Locale library IS ACCEPTED into Boost. The details follow, starting with the voters in favor, in the order their reviews were received: * John Bytheway * Steven Watanabe * Sebastian Redl * Fabio Fracassi ("possibly conditional") * Noah Roberts * Steve Bush * Volker Lukas * Paul A. Bristow * Matus Chochlik * Gevorg Voskanyan Those opposed: * Ryou Ezoe * Phil Endecott (but "borderline") * Edward Diener * Mathias Gaunard * Vincente BOTET ("count my vote as 1/2 or 1/4 vote") There was also an early off-list review from Darren Cook which did not include a vote, only listing issues. There was a great deal of discussion around the library, and a number of issues detailed. The major ones (and some less major ones that were repeated by several reviewers) are summarized below, along with the initials of the reviewer(s) who brought them up and Artyom's responses. Note that although I followed the discussions closely, these are only the issues brought up in the formal reviews themselves. * Issue: date_time interface uses enums for periods, which is error prone and inconsistent with other date/time libraries. (JB) * Response: Will be addressed. * Issue: Reference documentation needs more detail or some things not clear; some terms used in the documentation are not defined or are defined too briefly; can't find headers that items are defined in from the documentation. (JB, SW, FF, PE, ED, MC) * Response: Will be addressed. * Issue: There are no prev/next links in the documentation, or tutorial can't easily be navigated. (DC, JB, SR, FF, MC) * Response: Looking into it. * Issue: Few examples include output. (JB) * Response: Will be addressed. * Issue: No examples in Asian languages; the library may have design flaws that are not apparent without them. (DC) * Response: [Artyom has indicated to me (privately) that he's adding such examples. -- CN] * Issue: The translation system requires narrow-character tags, making it English-centric. (RE, ED) * Response: The library implements the most popular and most widely used message-catalog format. It is not perfect, but it is the best system currently available. There may be a work-around if you really must use wide-character languages under Windows. * Issue: Support for wchar_t/UTF-16 is unclear. (RE, ED) * Response: Wide characters are fully supported. * Issue: Doesn't support Win32 API as a backend. (RE) * Response: Misunderstanding, Win32 API backend is fully supported. * Issue: boost::locale::format is not compatible with boost::format. (FF, NR) * Response: boost::format is too limited for use in localization, and throws on any error, which means that a translator error could crash the program. * Issue: boost::locale::date_time is not compatible with boost::date_time. (FF) * Response: An unavoidable consequence of the differences between them, which are necessary due to support for locale independence and non-Gregorian calendars. * Issue: Boundary analysis is only available when using ICU. (FF) * Response: At present, only the ICU backend supports proper boundary analysis. * Issue: Little documentation on the toolchain needed to extract strings and translate them, or the versions required. (FF, NR) * Response: Will be addressed. * Issue: Concerns about relying on GPL/LGPL-licensed tools, or their availability on all platforms, or recommendations to write a Boost version of these tools. (NR) * Response: Reimplementing these is non-trivial and unnecessary; the licensing for these tools does not affect the programs developed with them. All are available for all platforms; will add explicit instructions for getting the latest versions for Windows. * Issue: Use of strings instead of symbols for language and encoding makes run-time errors that could be compile-time errors. The most common ones should be symbols. (PE) * Response: There are dozens of character encodings, and even more locales, and no way to determine which ones are the most common. Not all encodings are supported by all backends or OS configurations. Names ignore case and non-alphanumeric characters, which should minimize errors that could be generated from them. utf_to_utf transcoding will be added. * Issue: Error handling (in conversions) is very basic. (PE) * Response: An unavoidable limitation of the backends. * Issue: Code could use more commenting. (PE, VL) * Response: Noted, will be addressed in the future. * Issue: Some documentation phrasing is confusing, or could use a native English speaker's input. (SR, VL) * Response: Will be addressed as discovered. * Issue: There are no lists of valid language, country, encoding, or variant strings. (ED) * Response: Listed in standards ISO-639 and ISO-3166, which are referenced in the library's documentation. These standards are updated occasionally, and should be referred to directly for the latest information. * Issue: Only works on contiguous, entirely-in-memory strings. (MG) * Response: All current backends require this, and it satisfies the vast majority of use-cases. * Issue: Boundary analysis goes through the entire string and returns a vector of positions. (MG) * Response: Not perfect, but given the limitations of the existing backends, it is reasonable. * Issue: The library's interface is not generic enough, or independent enough of the libraries that it wraps. (VB) * Response: The interface is similar to that of every other i18n library, and should make as few assumptions as possible. It should not be changed. * Issue: The date-time code should be merged into Boost.DateTime. (VB) * Response: Date-time code is locale-dependent by its nature, and is more natural in Boost.Locale. Updating Boost.DateTime to do everything that Boost.Locale's library does would require a lot of work, and in all the time it has existed, only the Gregorian calendar has been implemented. There are Boost libraries that overlap others, so this is not a novelty. -- Chad Nelson Oak Circle Software, Inc. * * *

From: Chad Nelson <chad.thecomfychair@gmail.com>
The formal review of Artyom Beilis' Boost.Locale library, originally scheduled to run from April 7th through the 16th and extended through the 22nd, is now finished.
Fifteen people cast votes on it, ten for acceptance and five against. Though not overwhelming, the two-to-one majority clearly indicates the consensus of the list. As such:
The Boost.Locale library IS ACCEPTED into Boost.
I want to thank to Boosters that had participated in the review, had given me good insights and recommendations on how to make the library better. I'm glad the the Boost community had accepted this library. I especially want to thank to Chad Nelson for being review manager solving many problems and issues on time and for giving the review result so quickly. Thank You All, Artyom

The most significant complaint seems to be the fact that the translation interface is limited to ASCII (or maybe UTF-8 is also supported, it isn't entirely clear). Even though various arguments have been made for using only ASCII text literals in the program, it seems that it would be relatively easy to support other languages. As has been mentioned by someone else, even if the text really is in English, ASCII may not be sufficient as it may be desirable to include some special symbol (e.g. the copyright symbol for instance), and having to deal with this by creating a translation from "ASCII English to appease translation system" to "Real English to display to users" would seem to be an unjustifiable additional burden. However, I don't think anyone is as familiar with the limitations of gettext-related tools as Artyom, so he is the best person to discuss exactly how this might be supported. Previously he briefly described a make-shift approach that required the use of a macro, which didn't seem like a legitimate solution. It seems that xgettext (at least the version 0.18.1 that I tested on my machine) supports non-ASCII program source provided that the --from-code option is given, so it seems that the user could keep the source code in any arbitrary character set/encoding and it would still work (and simply convert the strings to UTF-8). It also appears to successfully extract strings that are specified with a L prefix, so it seems that should not be a problem either. I suppose there is some question as to how well existing tools for translating the messages deal with non-ASCII, but as the tools can be improved fairly easily if necessary, I don't think this is a significant concern. We can assume that the compiler knows the correct character set of the source code file, as trying to fool it would seem to be inherently error prone. This seems to rule out the possibility of char * literals containing UTF-8 encoded text on MSVC, until C++1x Unicode literals are supported. The biggest nuisance is that we need to know the compile-time character set/encoding (so that we know how to interpret "narrow" string literals), and there does not appear to be any standard way in which this is recorded (maybe I'm mistaken though). However, it is easy enough for the user to simply specify this as a preprocessor define (the build system could add it to the compile flags, and it needs to be known anyway in order to invoke xgettext --- presumably it would just be based on the active locale at the time the compiler is invoked). If none is specified, it could default to UTF-8 (this can also be used for greater efficiency in the case that the compile-time encoding is not UTF-8 but the source code happens to only contain ASCII messages). By knowing the compile-time character set, all ambiguity is removed. The translation database can be assumed to be keyed based on UTF-8, so to translate a message, it needs to be converted to UTF-8. There should presumably be versions of the translation functions that take narrow strings, wide strings, and additional versions for the C++1x unicode literals once they are supported by compilers (I expect that to be very soon, at least for some compilers). If a wide string is specified, it will be assumed to be in UTF-16 or UTF-32 depending on sizeof(wchar_t), and converted to UTF-8. UTF-32 is generally undesirable, I imagine, but in practice should nonetheless work and using wide strings might be the best approach for code that needs to compile on both Windows and Linux. For the narrow version, if the compile-time narrow encoding is UTF-8, the conversion is a no-op. Otherwise, the conversion will have to be done. (The C++1x u8 literal version would naturally require no conversion also.) Note that the common case of UTF-8 narrow literals, which is the only case currently supported, there would be no performance penalty. The documentation could explicitly warn that there is a performance penalty for not using UTF-8, but I think this penalty is likely to be acceptable in many cases. If normalization proves to be an issue, then the conversion to UTF-8 could include normalization (perhaps another preprocessor definition) and the output of xgettext could also be normalized. I imagine relative to the work required for the whole library, these changes would be quite trivial, and might very well transform the library from completely unacceptable to acceptable for a number of objectors on the list, while having essentially no impact on those that are happy to use the library as is.

From: Jeremy Maitin-Shepard <jeremy@jeremyms.com>
The most significant complaint seems to be the fact that the translation interface is limited to ASCII (or maybe UTF-8 is also supported, it isn't entirely clear).
[snip]
I imagine relative to the work required for the whole library, these changes would be quite trivial, and might very well transform the library from completely unacceptable to acceptable for a number of objectors on the list, while having essentially no impact on those that are happy to use the library as is.
I can say few words on what can be done and what will never be done. I will never support wide, char16_t or char32_t strings as keys. Current interface provides facet that has template<typename CharType> class messages_facet { ... CharType const *get(int domain_id,char const *msg) const = 0. ... And 2 or 4 types of it installed messages_facet<char>, messages_facet<wchar_t>, messages_facet<char16_t> and messages_facet<char32_t> Supporting CharType const *get(int domain_id,char const *msg) const = 0. CharType const *get(int domain_id,wchar_t const *msg) const = 0. CharType const *get(int domain_id,char16_t const *msg) const = 0. CharType const *get(int domain_id,char32_t const *msg) const = 0. Is just waste of memory as each source string for fastest comparison should be converted to 4 variants or converted in runtime... Wasteful. Thus I would only consider supporting "char const *" literals. One possibility is to provide per-domain basis a key in po file "X-Boost-Locale-Source-Encoding" so user would be able to specify in special record (which exists in all message catalogs) something like: "X-Boost-Locale-Source-Encoding: windows-936" or "X-Boost-Locale-Source-Encoding: UTF-8" Then when the catalog would be loaded its keys would be converted to the X-Boost-Locale-Source-Encoding. So if you are MSVC user and you really want to have localized keys you have following options: Option A: --------- source.cpp: // without bom windows-936 encoded #pragma setlocale("Japanese_Japan.936") translate("平和"); // L"平和" works well wcout << translate("「平和」"); // convert in runtime from cp939 to UTF-16 cout << translate("「平和」"); // convert in runtime from cp939 to UTF-8 myprogram.po: msgid "" msgstr "" "X-Boost-Locale-Source-Encoding: windows-936\n" "Content-Type: charset=UTF-8\n" msgid "平和" msgstr "שלום" # not translated msgid "「平和」" msgstr "" Option B: --------- source.cpp: // with BOM UTF-8 encoded, still windows-936 locale #pragma setlocale("Japanese_Japan.936") translate("平和"); // MSVC would be actually cp936 // L"平和" works well wcout << translate("「平和」"); // convert in runtime from cp939 to UTF-16 cout << translate("「平和」"); // convert in runtime from cp939 to UTF-8 myprogram.po: msgid "" msgstr "" "X-Boost-Locale-Source-Encoding: windows-936\n" "Content-Type: charset=UTF-8\n" msgid "平和" msgstr "שלום" # not translated msgid "「平和」" msgstr "" Option C (in future C++11): --------- source.cpp: // with BOM UTF-8 encoded translate(u8"平和"); // Would be utf-8 // L"平和" works well wcout << translate(u8"「平和」"); // convert in runtime from UTF-8 to UTF-16 cout << translate(u8"「平和」"); // convert just copy to the stream as is myprogram.po: msgid "" msgstr "" "Content-Type: charset=UTF-8\n" # it would assume UTF-8 sources msgid "平和" msgstr "שלום" # not translated msgid "「平和」" msgstr "" Option D (works now): --------- source.cpp: // without BOM, UTF-8 encoded translate("平和"); // MSVC would convert use it as UTF-8 // L"平和" does not works!! wcout << translate("「平和」"); // convert in runtime from UTF-8 to UTF-16 cout << translate("「平和」"); // convert just copy to the stream as is myprogram.po: msgid "" msgstr "" "Content-Type: charset=UTF-8\n" # it would assume UTF-8 sources msgid "平和" msgstr "שלום" # not translated msgid "「平和」" msgstr "" This can be done and I can implement it. But do not expect anything beyond this. Also note that converting a message from cp936 to for example windows-1255 (Hebrew narrow windows encoding) would swap out all non-ASCII characters... But this is developer's problem who had chosen to use non-ASCII keys. Artyom

On 04/25/2011 11:56 PM, Artyom wrote:
From: Jeremy Maitin-Shepard<jeremy@jeremyms.com>
The most significant complaint seems to be the fact that the translation interface is limited to ASCII (or maybe UTF-8 is also supported, it isn't entirely clear).
[snip]
I imagine relative to the work required for the whole library, these changes would be quite trivial, and might very well transform the library from completely unacceptable to acceptable for a number of objectors on the list, while having essentially no impact on those that are happy to use the library as is.
I can say few words on what can be done and what will never be done.
I will never support wide, char16_t or char32_t strings as keys.
It seems that it is mostly possible to get the desired results using only char * strings as keys, but there is one limitation: it is not possible to represent strings containing characters that don't fit in a single non-Unicode character set, e.g. it seems it would not be possible to have a char * string literal with both Japanese and Hebrew text. As this is unlikely to be needed, it might be a reasonable limitation, though. However, I don't see why you are so opposed to providing additional overloads. With MSVC currently, only wide strings can represent the full range of Unicode. You could provide the definitions in an alternate static/dynamic library from the char * overloads, so that there would not even be any substantial space overhead.
Current interface provides facet that has
template<typename CharType> class messages_facet { ... CharType const *get(int domain_id,char const *msg) const = 0. ...
And 2 or 4 types of it installed messages_facet<char>, messages_facet<wchar_t>, messages_facet<char16_t> and messages_facet<char32_t>
Supporting
CharType const *get(int domain_id,char const *msg) const = 0. CharType const *get(int domain_id,wchar_t const *msg) const = 0. CharType const *get(int domain_id,char16_t const *msg) const = 0. CharType const *get(int domain_id,char32_t const *msg) const = 0.
Is just waste of memory as each source string for fastest comparison should be converted to 4 variants or converted in runtime... Wasteful.
Thus I would only consider supporting "char const *" literals.
One possibility is to provide per-domain basis a key in po file "X-Boost-Locale-Source-Encoding" so user would be able to specify in special record (which exists in all message catalogs) something like:
"X-Boost-Locale-Source-Encoding: windows-936" or "X-Boost-Locale-Source-Encoding: UTF-8"
Then when the catalog would be loaded its keys would be converted to the X-Boost-Locale-Source-Encoding.
This isn't a property of the message catalog, but rather a property of the program itself, and therefore should be specified in the program, and not in the message catalog, it would seem. Something like the preprocessor define I mentioned would be a way to do this.
So if you are MSVC user and you really want to have localized keys you have following options:
Option A: ---------
source.cpp: // without bom windows-936 encoded
#pragma setlocale("Japanese_Japan.936")
translate("平和"); // L"平和" works well
wcout<< translate("「平和」"); // convert in runtime from cp939 to UTF-16 cout<< translate("「平和」"); // convert in runtime from cp939 to UTF-8 [snip]
When you say "convert in runtime", it seems you actually mean the keys will be converted from UTF-8 to cp939 when the messages are loaded, but the values will remain UTF-8. Untranslated strings would have to be converted, I suppose.
Option B: ---------
source.cpp: // with BOM UTF-8 encoded, still windows-936 locale
#pragma setlocale("Japanese_Japan.936")
translate("平和"); // MSVC would be actually cp936 // L"平和" works well
wcout<< translate("「平和」"); // convert in runtime from cp939 to UTF-16 cout<< translate("「平和」"); // convert in runtime from cp939 to UTF-8 [snip]
Okay, same as Option A, except that it is possible to specify wide literals using the full range of Unicode characters, rather than being limited to the local charset.
Option C (in future C++11): ---------
source.cpp: // with BOM UTF-8 encoded
translate(u8"平和"); // Would be utf-8 // L"平和" works well wcout<< translate(u8"「平和」"); // convert in runtime from UTF-8 to UTF-16 cout<< translate(u8"「平和」"); // convert just copy to the stream as is
Clearly this is a good solution, if only it were supported.
Option D (works now): ---------
source.cpp: // without BOM, UTF-8 encoded
translate("平和"); // MSVC would convert use it as UTF-8 // L"平和" does not works!! [snip]
I think it is obvious this isn't a feasible solution, as this breaks wide string literals, which are likely to be needed by anyone using MSVC.
wcout<< translate("「平和」"); // convert in runtime from UTF-8 to UTF-16 cout<< translate("「平和」"); // convert just copy to the stream as is
myprogram.po: msgid "" msgstr "" "Content-Type: charset=UTF-8\n" # it would assume UTF-8 sources
msgid "平和" msgstr "שלום"
# not translated msgid "「平和」" msgstr ""
This can be done and I can implement it. But do not expect anything beyond this.
Also note that converting a message from cp936 to for example windows-1255 (Hebrew narrow windows encoding) would swap out all non-ASCII characters...
I'm not exactly sure why a conversion like this might happen, and also it is not clear that is a serious problem. (Likely the Hebrew speaker would not be able to read Japanese anyway.)
But this is developer's problem who had chosen to use non-ASCII keys.

----- Original Message ----
From: Jeremy Maitin-Shepard <jeremy@jeremyms.com> On 04/25/2011 11:56 PM, Artyom wrote:
From: Jeremy Maitin-Shepard<jeremy@jeremyms.com>
The most significant complaint seems to be the fact that the translation interface is limited to ASCII (or maybe UTF-8 is also supported, it isn't entirely clear).
[snip]
I imagine relative to the work required for the whole library, these changes would be quite trivial, and might very well transform the library from completely unacceptable to acceptable for a number of objectors on the list, while having essentially no impact on those that are happy to use the library as is.
I can say few words on what can be done and what will never be done.
I will never support wide, char16_t or char32_t strings as keys.
It seems that it is mostly possible to get the desired results using only char * strings as keys [snip]
However, I don't see why you are so opposed to providing additional overloads. With MSVC currently, only wide strings can represent the full range of Unicode. You could provide the definitions in an alternate static/dynamic library from the char * overloads, so that there would not even be any substantial space overhead.
How the catalog works. It searches the key in the hash table, as the last stage it compares the strings bytewise. It is fast and efficient. In order to support both L"", "", u"" and U"" I need to create a 4 variants of same string to make sure it works fast (waste of memory) or I need to convert the string from UTF-16/32 to UTF-8 that is run-time memory allocation and conversion. So no, I'm not going to do this, especially that it is nor portable enough.
One possibility is to provide per-domain basis a key in po file "X-Boost-Locale-Source-Encoding" so user would be able to specify in special record (which exists in all message catalogs) something like:
"X-Boost-Locale-Source-Encoding: windows-936" or "X-Boost-Locale-Source-Encoding: UTF-8"
Then when the catalog would be loaded its keys would be converted to the X-Boost-Locale-Source-Encoding.
This isn't a property of the message catalog, but rather a property of the program itself, and therefore should be specified in the program, and not in the message catalog, it would seem. Something like the preprocessor define I mentioned would be a way to do this.
Two problem with define that I want translate("foo") to work automatically and not being a define. So I either need to provide an encoding in catalog itself or when I provide domain name (the reason it is done per domain name as one part of the project may use UTF-8 and other cp936 and other may use US-ASCII at all) So I can either specify it when I load a catalog or in catalog itself.
wcout<< translate("「平和」"); // convert in runtime from cp939 to
UTF-16
cout<< translate("「平和」"); // convert in runtime from cp939 to UTF-8 [snip]
When you say "convert in runtime", it seems you actually mean the keys will be converted from UTF-8 to cp939 when the messages are loaded, but the values will remain UTF-8. Untranslated strings would have to be converted, I suppose.
Yes when catalog load the UTF-8 keys will be converted to cp936 for best performance but in runtime the original untranslated keys should be converted to target locale. Artyom

On 04/27/2011 12:07 AM, Artyom wrote:
How the catalog works. It searches the key in the hash table, as the last stage it compares the strings bytewise.
It is fast and efficient.
In order to support both L"", "", u"" and U"" I need to create a 4 variants of same string to make sure it works fast (waste of memory) or I need to convert the string from UTF-16/32 to UTF-8 that is run-time memory allocation and conversion.
So no, I'm not going to do this, especially that it is nor portable enough.
Why not simply provide a compile-time or run-time option to allow the user to specify the following: - encoding of narrow keys to be given as char * arguments, or specify that none is to be supported (in which case narrow keys cannot be used at all), the default being UTF-8; - whether wchar_t * arguments are supported (the encoding will be assumed to be UTF-16 or UTF-32 depending on sizeof(wchar_t)) [by default, not supported] - whether char16_t arguments are supported [by default, not supported] - whether char32_t arguments are supported [by default, not supported] The library would simply convert the UTF-8 encoded keys in the message catalogs to each of the supported key argument encodings. In most cases, there would only be a single supported encoding. Because the narrow version could be disabled, with Japanese text and UTF-16 wchar_t, this would actually _save_ space since UTF-16 is more efficient than UTF-8 for encoding Japanese text. More to the point, you as a library author can offer this functionality (since it shouldn't be too much of an implementation burden) even if you as a user of your library wouldn't want to use it (because you are happy to provide English string literals). I agree that it is very unfortunate that wchar_t can mean either UTF-16 or UTF-32 depending on the platform, but in practice the same source code containing L"" string literals can be used on both Windows and Linux to reliably specify Unicode string literals (provided that care is taken to ensure the compiler knows the source code encoding). The fact that UTF-32 (which Linux tends to use for wchar_t) is space-inefficient does in some ways make render Linux a second-class citizen if a solution based on wide string literals is used for portability, but using UTF-8 on MSVC is basically just impossible, rather than merely less efficient, so there doesn't seem to be another option. (Assuming you are unwilling to rely on the Windows "ANSI" narrow encodings.)
One possibility is to provide per-domain basis a key in po file "X-Boost-Locale-Source-Encoding" so user would be able to specify in special record (which exists in all message catalogs) something like:
"X-Boost-Locale-Source-Encoding: windows-936" or "X-Boost-Locale-Source-Encoding: UTF-8"
Then when the catalog would be loaded its keys would be converted to the X-Boost-Locale-Source-Encoding.
This isn't a property of the message catalog, but rather a property of the program itself, and therefore should be specified in the program, and not in the message catalog, it would seem. Something like the preprocessor define I mentioned would be a way to do this.
Two problem with define that I want
translate("foo") to work automatically and not being a define.
So I either need to provide an encoding in catalog itself or when I provide domain name (the reason it is done per domain name as one part of the project may use UTF-8 and other cp936 and other may use US-ASCII at all)
So I can either specify it when I load a catalog or in catalog itself.
Okay.

On 27/04/2011 21:42, Jeremy Maitin-Shepard wrote:
Why not simply provide a compile-time or run-time option to allow the user to specify the following:
- encoding of narrow keys to be given as char * arguments, or specify that none is to be supported (in which case narrow keys cannot be used at all), the default being UTF-8;
- whether wchar_t * arguments are supported (the encoding will be assumed to be UTF-16 or UTF-32 depending on sizeof(wchar_t)) [by default, not supported]
- whether char16_t arguments are supported [by default, not supported]
- whether char32_t arguments are supported [by default, not supported]
The library would simply convert the UTF-8 encoded keys in the message catalogs to each of the supported key argument encodings. In most cases, there would only be a single supported encoding. Because the narrow version could be disabled, with Japanese text and UTF-16 wchar_t, this would actually _save_ space since UTF-16 is more efficient than UTF-8 for encoding Japanese text.
Why is it so complicated? User gives string and says what encoding it is in, the library converts to the catalog encoding and looks it up, then returns the localized string, converting again if needed. Unlike what Artyom said earlier, converting a string does not necessarily require dynamic memory allocation, and localization is not particularly performance critical anyway. If that runtime conversion is a concern, it's also possible to do that at compile time, at least with C++0x (syntax is ugly in C++03). Actually, I fail to understand what the problem is. Is it just the MSVC BOM problem? I think it should be handled by the build system.
I agree that it is very unfortunate that wchar_t can mean either UTF-16 or UTF-32 depending on the platform
How is that unfortunate? You can tell which one depending on the size of wchar_t.
but in practice the same source code containing L"" string literals can be used on both Windows and Linux to reliably specify Unicode string literals (provided that care is taken to ensure the compiler knows the source code encoding). The fact that UTF-32 (which Linux tends to use for wchar_t) is space-inefficient does in some ways make render Linux a second-class citizen if a solution based on wide string literals is used for portability, but using UTF-8 on MSVC is basically just impossible, rather than merely less efficient, so there doesn't seem to be another option. (Assuming you are unwilling to rely on the Windows "ANSI" narrow encodings.)
You can always use a macro USTRING("foo") that expands to u8"foo" or u"foo" on systems with unicode string literals and L"foo" elsewhere.

On 04/27/2011 03:11 PM, Mathias Gaunard wrote:
On 27/04/2011 21:42, Jeremy Maitin-Shepard wrote:
Why not simply provide a compile-time or run-time option to allow the user to specify the following:
- encoding of narrow keys to be given as char * arguments, or specify that none is to be supported (in which case narrow keys cannot be used at all), the default being UTF-8;
- whether wchar_t * arguments are supported (the encoding will be assumed to be UTF-16 or UTF-32 depending on sizeof(wchar_t)) [by default, not supported]
- whether char16_t arguments are supported [by default, not supported]
- whether char32_t arguments are supported [by default, not supported]
The library would simply convert the UTF-8 encoded keys in the message catalogs to each of the supported key argument encodings. In most cases, there would only be a single supported encoding. Because the narrow version could be disabled, with Japanese text and UTF-16 wchar_t, this would actually _save_ space since UTF-16 is more efficient than UTF-8 for encoding Japanese text.
Why is it so complicated?
User gives string and says what encoding it is in, the library converts to the catalog encoding and looks it up, then returns the localized string, converting again if needed.
Unlike what Artyom said earlier, converting a string does not necessarily require dynamic memory allocation, and localization is not particularly performance critical anyway.
It may often not be performance critical. In some cases, it might be though. Consider the case of a web server, where the work done by the web server machines themselves may essentially just consist of pasting together strings from various sources. (There is possibly a separate database server, etc.) This is also precisely the use case for which Artyom designed the library, I think. In this setting it is fairly clear why converting the messages once when loaded is better than doing it when needed.
If that runtime conversion is a concern, it's also possible to do that at compile time, at least with C++0x (syntax is ugly in C++03).
Maybe it can be done, but I don't think it is a viable possibility.
Actually, I fail to understand what the problem is. Is it just the MSVC BOM problem? I think it should be handled by the build system.
I agree that it is very unfortunate that wchar_t can mean either UTF-16 or UTF-32 depending on the platform
How is that unfortunate? You can tell which one depending on the size of wchar_t.
It is unfortunate simply because it is not uniform, even though it is possible to work around that, and furthermore, it is unfortunate because UTF-32 is generally not wanted.
but in practice the same source code containing L"" string literals can be used on both Windows and Linux to reliably specify Unicode string literals (provided that care is taken to ensure the compiler knows the source code encoding). The fact that UTF-32 (which Linux tends to use for wchar_t) is space-inefficient does in some ways make render Linux a second-class citizen if a solution based on wide string literals is used for portability, but using UTF-8 on MSVC is basically just impossible, rather than merely less efficient, so there doesn't seem to be another option. (Assuming you are unwilling to rely on the Windows "ANSI" narrow encodings.)
You can always use a macro USTRING("foo") that expands to u8"foo" or u"foo" on systems with unicode string literals and L"foo" elsewhere.
You can, but it adds complexity, etc...

On 28/04/2011 00:32, Jeremy Maitin-Shepard wrote:
User gives string and says what encoding it is in, the library converts to the catalog encoding and looks it up, then returns the localized string, converting again if needed.
Unlike what Artyom said earlier, converting a string does not necessarily require dynamic memory allocation, and localization is not particularly performance critical anyway.
It may often not be performance critical. In some cases, it might be though. Consider the case of a web server, where the work done by the web server machines themselves may essentially just consist of pasting together strings from various sources. (There is possibly a separate database server, etc.) This is also precisely the use case for which Artyom designed the library, I think. In this setting it is fairly clear why converting the messages once when loaded is better than doing it when needed.
Converting between encodings without memory allocation could be even cheaper than concatenating strings.
If that runtime conversion is a concern, it's also possible to do that at compile time, at least with C++0x (syntax is ugly in C++03).
Maybe it can be done, but I don't think it is a viable possibility.
It could work if you only need it for short strings and you can spend time at compile time to do that conversion.
It is unfortunate simply because it is not uniform, even though it is possible to work around that, and furthermore, it is unfortunate because UTF-32 is generally not wanted.
It is uniform since it's always Unicode (except on some platforms that very few people care about).
but in practice the same source code containing L"" string literals can be used on both Windows and Linux to reliably specify Unicode string literals (provided that care is taken to ensure the compiler knows the source code encoding). The fact that UTF-32 (which Linux tends to use for wchar_t) is space-inefficient does in some ways make render Linux a second-class citizen if a solution based on wide string literals is used for portability, but using UTF-8 on MSVC is basically just impossible, rather than merely less efficient, so there doesn't seem to be another option. (Assuming you are unwilling to rely on the Windows "ANSI" narrow encodings.)
You can always use a macro USTRING("foo") that expands to u8"foo" or u"foo" on systems with unicode string literals and L"foo" elsewhere.
You can, but it adds complexity, etc...
How so? It solves exactly the problem you explained, i.e. avoid wasting memory with UTF-32 when you can. If USTRING is too long, you can just use _U or something like that.

On 27/04/11 23:11, Mathias Gaunard wrote: <snip>
If that runtime conversion is a concern, it's also possible to do that at compile time, at least with C++0x (syntax is ugly in C++03).
Do you imagine that user-defined literals allow this? Sorry, but they don't, according to my reading of n3290 [lex.ext]. Only user defined integer and floating point literals allow compile-time access to the characters. Unless you're willing to write all strings in [0-9a-fA-F] (in which case you'd certainly be happy with ASCII!) this doesn't help much. If there's some other way to get compile-time access to characters more neatly than in C++03, then please share it; I would like to know! John Bytheway

On 29/04/2011 20:37, John Bytheway wrote:
On 27/04/11 23:11, Mathias Gaunard wrote: <snip>
If that runtime conversion is a concern, it's also possible to do that at compile time, at least with C++0x (syntax is ugly in C++03).
Do you imagine that user-defined literals allow this?
No.
If there's some other way to get compile-time access to characters more neatly than in C++03, then please share it; I would like to know!
In C++0x, "some_literal"[some_constant_integer_expression] is a constant expression.

We can assume that the compiler knows the correct character set of the source code file, as trying to fool it would seem to be inherently error prone. This seems to rule out the possibility of char * literals containing UTF-8 encoded text on MSVC, until C++1x Unicode literals are supported.
The biggest nuisance is that we need to know the compile-time character set/encoding (so that we know how to interpret "narrow" string literals), and there does not appear to be any standard way in which this is recorded (maybe I'm mistaken though). The source character set is pretty much irrelevant. It's the execution character set that is problematic. A compiler will translate string
On 25.04.2011 22:31, Jeremy Maitin-Shepard wrote: literals in the source from the source character set to the execution character set for storage in the binary. GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding. So let's assume that further down, it's the execution set that's known.
By knowing the compile-time character set, all ambiguity is removed. The translation database can be assumed to be keyed based on UTF-8, so to translate a message, it needs to be converted to UTF-8. There should presumably be versions of the translation functions that take narrow strings, wide strings, and additional versions for the C++1x unicode literals once they are supported by compilers (I expect that to be very soon, at least for some compilers). If a wide string is specified, it will be assumed to be in UTF-16 or UTF-32 depending on sizeof(wchar_t), and converted to UTF-8. UTF-32 is generally undesirable, I imagine, but in practice should nonetheless work and using wide strings might be the best approach for code that needs to compile on both Windows and Linux. For the narrow version, if the compile-time narrow encoding is UTF-8, the conversion is a no-op. Otherwise, the conversion will have to be done. (The C++1x u8 literal version would naturally require no conversion also.)
The issue with making the narrow version automatically transcode the input from the narrow encoding to UTF-8 is that it is a compatibility issue with C++11 u8 literals. For some reason, there is no way in the type system to distinguish between normal narrow and u8 literals. In other words, if you ever make the translate() functions assume a narrow literal to be in the locale character set, you can't use u8 literals there anymore. Sebastian

On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this. I think that with a Unicode library getting in Boost, this feature is becoming even more important.

From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this.
The problem even if the source is UTF-8 with BOM "שלום" would be encoded according to locale's 8bit codepage like 1255 or 936 and not UTF-8 string (codepage 65001). It is rather stupid, but this is how MSVC works or understands the place of UTF-8 in this world. Unicode and Visual Studio is just broken... Artyom

On Tue, Apr 26, 2011 at 9:27 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this.
The problem even if the source is UTF-8 with BOM "שלום" would be encoded according to locale's 8bit codepage like 1255 or 936 and not UTF-8 string (codepage 65001).
It is rather stupid, but this is how MSVC works or understands the place of UTF-8 in this world.
It's not stupid. It's because ANSI version of Win32 API expect these encodings. To me, encoding of ordinary string literal use source file's encoding is a stupid idea.
Unicode and Visual Studio is just broken...
Artyom _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Ryou Ezoe

On 26/04/2011 14:27, Artyom wrote:
The problem even if the source is UTF-8 with BOM "שלום" would be encoded according to locale's 8bit codepage like 1255 or 936 and not UTF-8 string (codepage 65001).
It is rather stupid, but this is how MSVC works or understands the place of UTF-8 in this world.
Unicode and Visual Studio is just broken...
That's not broken, this is the expected behaviour. The execution character set is necessarily ANSI with that compiler, and the compiler performs source character set to execution character set as expected. To be able to input UTF-8 in string literals, you should use unicode string literals (C++0x only) or wide string literals (but then you end up with UTF-16).

On Tue, Apr 26, 2011 at 9:27 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this.
The problem even if the source is UTF-8 with BOM "שלום" would be encoded according to locale's 8bit codepage like 1255 or 936 and not UTF-8 string (codepage 65001).
It is rather stupid, but this is how MSVC works or understands the place of UTF-8 in this world.
Unicode and Visual Studio is just broken...
I seriously concerns the author's ability to understand the real world situation. This library is not only useless, but also harmful for localization. It encourage people to use ASCII. The reason there are so many ASCII compatible encodings is, I think, partly for quick workaround. Many existing code expected ASCII. Unicode was not a viable solution at that time. In order to handle their language, they created a encoding that was compatible with ASCII. It worked most of the time. No matter how hard you say "This library expect ASCII input and it's programmer's responsibility to pass ASCII. Anything else is deserve to be broken." People use these ASCII compatible encodings for existing code. Because, it works most of the time. They want to use their language. They want to use a encoding which can express their language. So they use ASCII compatible encodings where ASCII is expected. We have to get rid of ASCII. What a shame a localization library which expect ASCII input.
Artyom _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Ryou Ezoe

On Tue, Apr 26, 2011 at 9:27 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this.
The problem even if the source is UTF-8 with BOM "שלום" would be encoded according to locale's 8bit codepage like 1255 or 936 and not UTF-8 string (codepage 65001).
It is rather stupid, but this is how MSVC works or understands the place of UTF-8 in this world.
Unicode and Visual Studio is just broken...
The real obstacle for localization is not a software which wasn't programmed for a localization in mind. We can replace hard coded text.(whether the source code is provided or not) It's rather tedious but straight forward task. Most program don't need a runtime language switch anyway. Hard coded text is all right. The real obstacle is ASCII. If ASCII is used instead of UTF-8, UTF-16, or UTF-32, we have to use ASCII compatible encoding. In Windows and for Japanese, it's CP932(Microsoft variant of Shift-JIS). In that case, we can't translate a program simply replacing the text. Because Windows can't tell which encoding it is(it can be anything, we can't detect it heuristically), we have to explicitly specify it. For example, an argument for every call of CreateFont API's fdwCharSet parameter must be modified to SHIFTJIS_CHARSET. In Windows, software should use UTF-16. If a locale library expect ASCII input, even though it support wchar_t output, I wonder how many people actually use it. By using ASCII input, this library encourage to use ASCII. Another obstacle for real world localization.
Artyom _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Ryou Ezoe

The problem even if the source is UTF-8 with BOM "שלום" would be encoded according to locale's 8bit codepage like 1255 or 936 and not UTF-8 string (codepage 65001).
It is rather stupid, but this is how MSVC works or understands the place of UTF-8 in this world.
Unicode and Visual Studio is just broken...
The real obstacle for localization is not a software which wasn't programmed for a localization in mind. We can replace hard coded text.(whether the source code is provided or not) It's rather tedious but straight forward task. Most program don't need a runtime language switch anyway. Hard coded text is all right.
Really? If so you probably havn't read the manuals or did any real localization tasks beyond simple converting several strings from one language to another. And no, this isn't called localization. There way too many things beyond this. This remark and other different remarks about for example the fact that you said "plural forms should not be used" at some point in the review make me doubt if you have real experience with localization of a software at all.
The real obstacle is ASCII. If ASCII is used instead of UTF-8, UTF-16, or UTF-32, we have to use ASCII compatible encoding. In Windows and for Japanese, it's CP932(Microsoft variant of Shift-JIS).
You are mixing US-ASCII and Windows "ANSI Encoding" all the time - they are different things.
In that case, we can't translate a program simply replacing the text. Because Windows can't tell which encoding it is(it can be anything, we can't detect it heuristically), we have to explicitly specify it.
For example, an argument for every call of CreateFont API's fdwCharSet parameter must be modified to SHIFTJIS_CHARSET.
In Windows, software should use UTF-16. If a locale library expect ASCII input, even though it support wchar_t output, I wonder how many people actually use it.
I don't understand how you still do not get what I have written many-many times. The input it narrow string but the output can be both narrow and wide string according to your needs.
By using ASCII input, this library encourage to use ASCII. Another obstacle for real world localization.
US-ASCII is used as base line of all encodings. So the software would in independent of various ways various compilers see what encoding should be used. Ryou, Before you continue this "Crusade" against Boost.Locale, please read the other post I have added on how can I implement a way to handle natural-language specific keys. You hadn't even related to it. So please, until you really read things deeply and try to understand the rationale (even if you disagree) don't bother to ask again. Because I can't talk to someone who does not relate to my posts in the list. In the other mail I provided a solutions how to make std::wstring something = wgettext("平和"); Or std::wstring something = wgettext(u8"平和"); // when MSVC will support it To work correctly without any problems in transparent way. And I had told that I will implement this in Boost.Locale even that I think it is wrong from localization point of view. But you still talking about "bad-ascii-boost-loocale" So enough is enough. I'm willing to do constructive discussions but I'm not going to listen any more till you'll change the attitude. I hope I hadn't offended you in any way, but I just can't continue the discussion the way it is. Best Regards, Artyom

Before you continue this "Crusade" against Boost.Locale,
My view of this affair is as follows: Artyom advocates universal use of UTF-8 – see: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... I believe this will be deeply unpopular with Asian programmers who, for whatever reasons, hate UTF-8 and love UTF-16. Artyom has also said that MSVC is "broken" and (as I understand it) that all programmers should use ASCII for locale's gettext translate. However, just because boost is a portable library and boost locale is used for localisation doesn’t mean that programmers are writing *portable applications*. In other words, people may well wish to USE boost locale in UTF-16 environment eg Windows and have nothing to do with UTF-8 or ASCII. As I understand it, those people are not well supported by the gettext bit of locale. I don’t think that can be resolved at this stage, but one should not try and present a necessity as a virtue – which will be resented.

From: Steve Bush <sb2@neosys.com>
Before you continue this "Crusade" against Boost.Locale,
My view of this affair is as follows:
Artyom advocates universal use of UTF-8 – see:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf...
I believe this will be deeply unpopular with Asian programmers who, for whatever reasons, hate UTF-8 and love UTF-16.
That does not mean that the library does not fully supports wide/utf-16 strings. And it does a lots of effort to support it.
Artyom has also said that MSVC is "broken" and (as I understand it) that all programmers should use ASCII for locale's gettext translate.
I've mentioned not once that ASCII keys not a technical reason but much deeper - localization related reason. I can technically add support of L"" and Whatever You Want as gettext keys but this is wrong same as it is wrong today to use "gets()" and the real reason is not technical and not even the fact that MSVC does not know to create proper UTF-8 literal, the reason is much deeper - linguistic reason, usability reason, even for pure Windows developers. I had written this reason not once and I will not repeat it over and over again.
However, just because boost is a portable library and boost locale is used for localisation doesn’t mean that programmers are writing *portable applications*. In other words, people may well wish to USE boost locale in UTF-16 environment eg Windows and have nothing to do with UTF-8 or ASCII.
If I was developing this library for "portable applications" only I would not provide actually wide character support at all as it is very unportable. But I did a huge effort (that BTW I didn't need originally) to support Windows development style.
As I understand it, those people are not well supported by the gettext bit of locale.
gettext catalogs has nothing to do with actual strings encoding that can be used (for both ids and outputs). As I mentioned in other mail I've send. Bottom line... Sometimes design decisions should be made to make software better even that not all potential users like it or understand the rationale behind right now. Best, Artyom

Artyom, I am simply pointing out that dropping the "UTF-16 is harmful/MSVC is broken" rhetoric might avoid *unnecessarily* prejudicing Asian programmers against Boost Locale. I emphasise *unnecessarily* because, with the possible exception of gettext, there does not actually seem to be any problem with the library for Asian programmers! Many thanks for the wonderful library.

Artyom wrote:
I've mentioned not once that ASCII keys not a technical reason but much deeper - localization related reason.
I can technically add support of L"" and Whatever You Want as gettext keys but this is wrong same as it is wrong today to use "gets()" and the real reason is not technical and not even the fact that MSVC does not know to create proper UTF-8 literal, the reason is much deeper - linguistic reason, usability reason, even for pure Windows developers.
I had written this reason not once and I will not repeat it over and over again.
Do you think you can add that reason to documentation? Maybe it's there, but I did not notice it on the quick look. Thanks, -- Vladimir Prus Mentor Graphics +7 (812) 677-68-40

On 4/27/2011 12:57 AM, Artyom wrote:
So enough is enough. I'm willing to do constructive discussions but I'm not going to listen any more till you'll change the attitude.
I hope I hadn't offended you in any way, but I just can't continue the discussion the way it is.
Best Regards, Artyom
I actually applaud you for being more patient than I would have been given Ryou's persistent hostile tone. Although you may be reaping a bit of karmic hostility you sowed with some of your posts on the boost-build list. ;) Claims like "non-Western programmers will never use this library" are out of place on this list. I am a total novice at localization but there are a whole lot of "non-Western" programmers who know English/ASCII well enough to use it as keys (I speculate the majority do). Indians in particular have considerable English proficiency and there are quite a lot of them. I chatted with a Chinese grad student here and his impression is that China and South Korea probably follow Japan's pattern of having many programmers without English proficiency, so Ryou does have a point but could have made it in a much nicer way. Specifically, as Steve Bush said earlier, the idea of a program localized only for the east Asian markets is plausible. But at the same time I think it's perfectly acceptable for boost.locale to NOT target that particular use case and instead go for (I assume) the much larger use case of programs that are intended for ALL markets. It would be very informative to see statistics about programmers: their primary development language and secondary language proficiencies. This is a hard thing to google because it thinks I'm referring to programming language. :( -Matt

On 04/27/2011 10:49 AM, Matthew Chambers wrote:
Claims like "non-Western programmers will never use this library" are out of place on this list.
I disagree. It's a relevant data point, even if you don't agree with the reasoning behind it.
I am a total novice at localization but there are a whole lot of "non-Western" programmers who know English/ASCII well enough to use it as keys (I speculate the majority do). Indians in particular have considerable English proficiency and there are quite a lot of them. I chatted with a Chinese grad student here and his impression is that China and South Korea probably follow Japan's pattern of having many programmers without English proficiency, so Ryou does have a point but could have made it in a much nicer way.
I think you're missing his point. Ryou spells it out for you here:
The real obstacle is ASCII. If ASCII is used instead of UTF-8, UTF-16, or UTF-32, we have to use ASCII compatible encoding. In Windows and for Japanese, it's CP932(Microsoft variant of Shift-JIS).
In that case, we can't translate a program simply replacing the text. Because Windows can't tell which encoding it is(it can be anything, we can't detect it heuristically), we have to explicitly specify it.
The pre-Unicode encoding schemes for program text sucked and he's not going back. On 04/27/2011 10:49 AM, Matthew Chambers wrote:
Specifically, as Steve Bush said earlier, the idea of a program localized only for the east Asian markets is plausible. But at the same time I think it's perfectly acceptable for boost.locale to NOT target that particular use case and instead go for (I assume) the much larger use case of programs that are intended for ALL markets.
As long as everyone recognizes at the outset that it's not going to make everyone happy, possibly to the point of it not being used by those with particular requirements (i.e. Asia).
It would be very informative to see statistics about programmers: their primary development language and secondary language proficiencies.
Of course they know how to spell things in English. We have, what, 26 or 52 characters and they learn 2000 or so for professional work. In the context of program text, Japanese identifiers may be easier to spell in 7-bit ASCII than they are in Japanese. But oversimplification seems to carry connotations, too. Asian text appears to very quickly diverge from what we think of as "plain text" into what we would consider "desktop publishing". It's probably not something that can be done halfway, yet Ryou is (understandably) unwilling to go back to pre-Unicode. I see no reason to think that Ryou is not representative of Asian developers. If we want this library to be useful to them, we probably need to make Ryou happy. - Marsh

Marsh Ray wrote:
On 04/27/2011 10:49 AM, Matthew Chambers wrote:
Claims like "non-Western programmers will never use this library" are out of place on this list.
I disagree.
It's a relevant data point, even if you don't agree with the reasoning behind it.
It's a relevant piece of information, but I think the point was that there are many non-Western programmers (for some definition of "Western") that would use the library. Therefore, a modified form of the assertion would have been better: "No Asian programmers I know of will ever use this library." That might not be as strong, though, unless Ryou also gave scope to the number of Asian programmers he knows that would think similarly. Anyway, I think the concern was for the statement's tone and hyperbole. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 4/27/2011 11:58 AM, Stewart, Robert wrote:
Marsh Ray wrote:
On 04/27/2011 10:49 AM, Matthew Chambers wrote:
Claims like "non-Western programmers will never use this library" are out of place on this list.
I disagree.
It's a relevant data point, even if you don't agree with the reasoning behind it.
It's a relevant piece of information, but I think the point was that there are many non-Western programmers (for some definition of "Western") that would use the library. Therefore, a modified form of the assertion would have been better: "No Asian programmers I know of will ever use this library." That might not be as strong, though, unless Ryou also gave scope to the number of Asian programmers he knows that would think similarly.
Anyway, I think the concern was for the statement's tone and hyperbole.
Yes, the hostile tone and hyperbole are what is out of place, not the assertion. The assertion itself is factually wrong for the general (non-strict) geographical definition of "Western" since India has a large English proficient population and isn't geographically "Western." If we go beyond geographical definitions, Japan is more Western than India (http://www.henryjacksonsociety.org/stories.asp?id=1171). It would have been much simpler and more accurate to write "east Asian" - presumably he didn't do that in the first place to amplify the hyperbole. -Matt

On 04/27/2011 11:58 AM, Stewart, Robert wrote:
Marsh Ray wrote:
On 04/27/2011 10:49 AM, Matthew Chambers wrote:
Claims like "non-Western programmers will never use this library" are out of place on this list.
I disagree.
It's a relevant data point, even if you don't agree with the reasoning behind it.
It's a relevant piece of information, but I think the point was that there are many non-Western programmers (for some definition of "Western") that would use the library. Therefore, a modified form of the assertion would have been better: "No Asian programmers I know of will ever use this library." That might not be as strong, though, unless Ryou also gave scope to the number of Asian programmers he knows that would think similarly.
My guess is there's probably not a lot of diversity of opinion there on this point, at least in Japan.
Anyway, I think the concern was for the statement's tone and hyperbole.
You're talking to a guy who's native language has multiple independent dimensions in the grammar itself for politeness, respect, and formality. Just imagine what kind of linguistic jackhammer we sound like to him. You should probably consider it an honor to be receiving direct criticism under the circumstances. Or at least you should look harder for a way to interpret his statements at face value and give him any possible benefit of the doubt. I see him saying things like: On 04/26/2011 08:47 PM, Ryou Ezoe wrote:
I seriously concerns the author's ability to understand the real world situation. This library is not only useless, but also harmful for localization. It encourage people to use ASCII.
To someone from a Western alphabetic language... not knowing how to print characters...sure, this might sound like he was accusing you of not having graduated first grade. But in Japan they are studying new characters up through high school in a regimented, standardized way. The character set you use has a very well-defined relationship with your level of education, more so even than your choice of word usage in English. I think you really don't understand the reality of his requirements. Within recent memory, systems in Asia used odd mixtures of multi-byte regional encodings like Shift-JIS. These guys have had a long history of trying to fit professional-looking text into encodings that were designed to work better elsewhere. Unicode (in the form of UTF-16) comes along and surely made everyone's life easier. But still, if the compiler doesn't handle UTF-8, then the compiler doesn't handle UTF-8. It's that simple. Yes, it's possible to represent Japanese text in alternate character sets but it "looks wrong", and in this case the appearance carries semantic value. This stuff is very difficult for an outsider to get right - or even appreciate how far wrong they are. In English text we really have very little to compare with to understand how this looks. Here's how I imagine it: * Imagine receiving a resume' from a job applicant printed in block letters. In crayon. * Imagine applying to work for a Japanese company, but because you don't know the proper Japanese characters, you simply format your text like this: http://www.dafont.com/theme.php?cat=201 * Imagine you won a contract to build a sign for a Mexican restaurant, but you order the wrong letters and use German gothic script lettering: http://en.wikipedia.org/wiki/Blackletter * http://i.imgur.com/a3jQY.png * http://bit.ly/hVNGux It seems very reasonable to me that Ryou would be skeptical that such a library would end up being useful to him. Furthermore, if he uses it rather than something more established, and his program outputs something unprofessional, who is that going to reflect on? Don't get me wrong, I think it would be awesome if you guys were able to create something that was well-regarded everywhere. But if you make something to your own requirements, don't act so surprised when not everyone rushes to use it. - Marsh

On 04/27/2011 11:46 AM, Marsh Ray wrote:
[comments about ASCII requirement]
A suggestion was made at one point that an ASCII transliteration of Japanese would be a possible solution, and I think most of us can agreed that should be rejected immediately as a non-viable solution. However, forcing MSVC users to use one of the Windows "ANSI" encodings, such as windows-936, wouldn't be nearly such an impediment.

On 04/27/2011 02:23 PM, Jeremy Maitin-Shepard wrote:
On 04/27/2011 11:46 AM, Marsh Ray wrote:
[comments about ASCII requirement]
A suggestion was made at one point that an ASCII transliteration of Japanese would be a possible solution, and I think most of us can agreed that should be rejected immediately as a non-viable solution.
However, forcing MSVC users to use one of the Windows "ANSI" encodings, such as windows-936, wouldn't be nearly such an impediment.
I've worked at places where I couldn't use certain Boost libraries because I couldn't change the necessary compiler setting to work around a well-documented bug in MSVC). At this place they also questioned whitespace changes which were not even affecting visible formatting. Thankfully I don't work there any more. I don't know how much of a change it is for a software developer in Japan to convince his team to change their source file encoding so they can use some library. It's probably non-trivial. What I do know, however, is that the set of users who are candidates for ending up as happy users of a library is always much greater when the library developers don't think they can "force" users to adopt specific compiler settings or agree to other nonstandard constraints. - Marsh

On 04/27/2011 03:05 PM, Marsh Ray wrote:
On 04/27/2011 02:23 PM, Jeremy Maitin-Shepard wrote:
On 04/27/2011 11:46 AM, Marsh Ray wrote:
[comments about ASCII requirement]
A suggestion was made at one point that an ASCII transliteration of Japanese would be a possible solution, and I think most of us can agreed that should be rejected immediately as a non-viable solution.
However, forcing MSVC users to use one of the Windows "ANSI" encodings, such as windows-936, wouldn't be nearly such an impediment.
I've worked at places where I couldn't use certain Boost libraries because I couldn't change the necessary compiler setting to work around a well-documented bug in MSVC). At this place they also questioned whitespace changes which were not even affecting visible formatting. Thankfully I don't work there any more.
I don't know how much of a change it is for a software developer in Japan to convince his team to change their source file encoding so they can use some library. It's probably non-trivial.
What I do know, however, is that the set of users who are candidates for ending up as happy users of a library is always much greater when the library developers don't think they can "force" users to adopt specific compiler settings or agree to other nonstandard constraints.
I'm not talking about the encoding of the source, but about whether L"" literals are used or not. (If "narrow" literals are used in MSVC, they will be in the execution character set, which is never UTF-8 and in the case of Japanese users might likely be windows-936.) Regardless, though, it seems that Artyom has proposed to support whcar_t and (once supported by the compiler) char16_t and char32_t literals, with the only caveat that the literal type must match the output type (which is kind of an unnecessary limitation but simplifies the interface and in practice is unlikely to be a problem).

On 4/27/2011 1:46 PM, Marsh Ray wrote:
Anyway, I think the concern was for the statement's tone and hyperbole.
You're talking to a guy who's native language has multiple independent dimensions in the grammar itself for politeness, respect, and formality. Just imagine what kind of linguistic jackhammer we sound like to him. You should probably consider it an honor to be receiving direct criticism under the circumstances. Or at least you should look harder for a way to interpret his statements at face value and give him any possible benefit of the doubt. No matter what I imagine it does not tell me if we actually sound impolite to him or not. If he does misunderstand us as being impolite, are you suggesting that it's appropriate (on this list) for him to respond in kind (as opposed to sending a message like "Everybody seems pretty rude on this list!")?
I see him saying things like:
On 04/26/2011 08:47 PM, Ryou Ezoe wrote:
I seriously concerns the author's ability to understand the real world situation. This library is not only useless, but also harmful for localization. It encourage people to use ASCII.
To someone from a Western alphabetic language... not knowing how to print characters...sure, this might sound like he was accusing you of not having graduated first grade. More hyperbole? :) It doesn't sound like he's accusing Artyom of not graduating first grade. He is accusing him of not understanding localization in general. When in fact, CJK is a specific use case for localization, not the general case. And Artyom clearly has a very thorough understanding of localization, at least for C++ and possibly outside the CJK use cases. Suggesting otherwise is outright hostile in any language, even Klingon.
Within recent memory, systems in Asia used odd mixtures of multi-byte regional encodings like Shift-JIS. These guys have had a long history of trying to fit professional-looking text into encodings that were designed to work better elsewhere. Unicode (in the form of UTF-16) comes along and surely made everyone's life easier. But still, if the compiler doesn't handle UTF-8, then the compiler doesn't handle UTF-8. It's that simple.
Yes, it's possible to represent Japanese text in alternate character sets but it "looks wrong", and in this case the appearance carries semantic value. This stuff is very difficult for an outsider to get right - or even appreciate how far wrong they are. I appreciate the potential for frustration but it's not Artyom's fault. In fact the results of the review suggest (to me at least) that he's already gone above and beyond the call of duty for the general localization case. The English language is the best thing the British did for India.
In English text we really have very little to compare with to understand how this looks. Here's how I imagine it: I am amused by these examples but they clearly confound the user experience with the developer experience. It seems obvious to me that it is entirely possible to use boost.locale to create an identical user experience. Ryou was arguing that it would be (much) harder for many Japanese programmers to do so.
Furthermore, if he uses it rather than something more established, and his program outputs something unprofessional, who is that going to reflect on? The developer. The only thing that could possibly be blamed on library author is the difficulty of making the output look professional, not the fact that the developer didn't expend the effort to use it properly.
To be clear, I value the subset of Ryou's criticism that is constructive. This whole library review has been educational to me and makes me glad that all my users know English whether it's their primary language or not. ;) -Matt

On 4/26/2011 6:41 AM, Mathias Gaunard wrote:
On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this.
I think that with a Unicode library getting in Boost, this feature is becoming even more important.
Do you mean editing the source file in-place? That would be pure crazy. On the other hand, copying the source file to the build directory while prepending the BOM and then building from that would be pretty cool! Which one of these were you wanting? -Matt

On 26/04/2011 21:56, Matthew Chambers wrote:
Do you mean editing the source file in-place?
No.
On the other hand, copying the source file to the build directory while prepending the BOM and then building from that would be pretty cool! Which one of these were you wanting?
It should work like any other preprocessing step. It's not any different than running the Qt MOC tool for example.

Mathias Gaunard wrote:
On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this.
Well, if you have a command that can prepend BOM to a file, you can easily modify 'actions compile-c-c++' in msvc.jam to run that command. - Volodya -- Vladimir Prus Mentor Graphics +7 (812) 677-68-40

On 30/04/2011 18:45, Vladimir Prus wrote:
Mathias Gaunard wrote:
On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this.
Well, if you have a command that can prepend BOM to a file, you can easily modify 'actions compile-c-c++' in msvc.jam to run that command.
It would be nice if I could only do this when the source files have been tagged as utf-8 or something like that.

Mathias Gaunard wrote:
On 30/04/2011 18:45, Vladimir Prus wrote:
Mathias Gaunard wrote:
On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this.
Well, if you have a command that can prepend BOM to a file, you can easily modify 'actions compile-c-c++' in msvc.jam to run that command.
It would be nice if I could only do this when the source files have been tagged as utf-8 or something like that.
Well, it would be trivial to implement syntax like: utf8cpp file : file.cpp ; exe whatever : whatever.cpp file ; If that's what you're asking for. In fact, here's a complete example that should almost work, in any Jamfile: import type ; type.register UTF8CPP : : CPP ; import generators ; generators.register-standard $(__name__).add-bom : CPP : UTF8CPP ; actions add-bom { add-bom $(>) -o $(<) } utf8cpp file : file.cpp ; exe whatever : whatever.cpp file ; I say "almost" because somebody who is actually interested in all this should write the 'add-bom' utility, test everything, and do various other boring things, like sending a patch and/or checking in. I think the ball is now in your court. - Volodya -- Vladimir Prus Mentor Graphics +7 (812) 677-68-40

On 01/05/2011 09:32, Vladimir Prus wrote:
Well, it would be trivial to implement syntax like:
utf8cpp file : file.cpp ; exe whatever : whatever.cpp file ;
If that's what you're asking for. In fact, here's a complete example that should almost work, in any Jamfile:
import type ; type.register UTF8CPP : : CPP ;
import generators ; generators.register-standard $(__name__).add-bom : CPP : UTF8CPP ;
actions add-bom { add-bom $(>) -o $(<) }
utf8cpp file : file.cpp ; exe whatever : whatever.cpp file ;
I say "almost" because somebody who is actually interested in all this should write the 'add-bom' utility, test everything, and do various other boring things, like sending a patch and/or checking in.
I think the ball is now in your court.
I'm not sure that's necessary, but what if I wanted to do that with header files as well, without having to list them all?

Mathias Gaunard wrote:
On 01/05/2011 09:32, Vladimir Prus wrote:
Well, it would be trivial to implement syntax like:
utf8cpp file : file.cpp ; exe whatever : whatever.cpp file ;
If that's what you're asking for. In fact, here's a complete example that should almost work, in any Jamfile:
import type ; type.register UTF8CPP : : CPP ;
import generators ; generators.register-standard $(__name__).add-bom : CPP : UTF8CPP ;
actions add-bom { add-bom $(>) -o $(<) }
utf8cpp file : file.cpp ; exe whatever : whatever.cpp file ;
I say "almost" because somebody who is actually interested in all this should write the 'add-bom' utility, test everything, and do various other boring things, like sending a patch and/or checking in.
I think the ball is now in your court.
I'm not sure that's necessary, but what if I wanted to do that with header files as well, without having to list them all?
It would require: - Using globbing to specify the headers - Using a slightly more contrived code to define 'utf8hpp' that can process multile sources. - Instead of adding 'file' to sources as in above, referring to it using the 'implicit-dependency' build property, so that the include path is set up to pick the processed headers. -- Vladimir Prus Mentor Graphics +7 (812) 677-68-40

On 01/05/2011 13:38, Vladimir Prus wrote:
I'm not sure that's necessary, but what if I wanted to do that with header files as well, without having to list them all?
It would require:
- Using globbing to specify the headers
Wouldn't it be possible to directly find all the headers included from a specific source file? Doesn't Boost.Build do this already to deal with dependencies and recompilation?

Mathias Gaunard wrote:
On 01/05/2011 13:38, Vladimir Prus wrote:
I'm not sure that's necessary, but what if I wanted to do that with header files as well, without having to list them all?
It would require:
- Using globbing to specify the headers
Wouldn't it be possible to directly find all the headers included from a specific source file?
And bravely assume they are all UTF8?
Doesn't Boost.Build do this already to deal with dependencies and recompilation?
It does, but it's done at the build engine level, when it's a bit late to go back and create new targets. You seem to be moving the goals, though. What problem are you actually trying to solve? What libraries in Boost want to use UTF8 sources, and for what purposes? Will those libraries work on compilers that have no way of parsing UTF8 files? - Volodya -- Vladimir Prus Mentor Graphics +7 (812) 677-68-40

On 01/05/2011 16:26, Vladimir Prus wrote:
You seem to be moving the goals, though. What problem are you actually trying to solve? What libraries in Boost want to use UTF8 sources, and for what purposes? Will those libraries work on compilers that have no way of parsing UTF8 files?
I'm trying to provide an alternative to the "-finput-charset=utf-8" GCC option setting for MSVC. This option affects all files read by GCC for that particular invocation, and it would be nice if it could do the same thing for MSVC, which unfortunately involves prepending a UTF-8 BOM to all files the compiler will read.

AMDG On 05/01/2011 08:46 AM, Mathias Gaunard wrote:
On 01/05/2011 16:26, Vladimir Prus wrote:
You seem to be moving the goals, though. What problem are you actually trying to solve? What libraries in Boost want to use UTF8 sources, and for what purposes? Will those libraries work on compilers that have no way of parsing UTF8 files?
I'm trying to provide an alternative to the "-finput-charset=utf-8" GCC option setting for MSVC.
This option affects all files read by GCC for that particular invocation, and it would be nice if it could do the same thing for MSVC, which unfortunately involves prepending a UTF-8 BOM to all files the compiler will read.
What about: a) preprocess the source b) add a BOM to the preprocessor output c) Compile the result In Christ, Steven Watanabe

From: Mathias Gaunard <mathias.gaunard@ens-lyon.org> On 30/04/2011 18:45, Vladimir Prus wrote:
On 26/04/2011 11:17, Sebastian Redl wrote:
GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
A long time ago, I asked Vladimir Prus to help me add an option to Boost.Build that would allow to automatically prepend the BOM to source files when using MSVC, but unfortunately he was never able to help me do this.
Well, if you have a command that can prepend BOM to a file, you can easily modify 'actions compile-c-c++' in msvc.jam to run that command.
It would be nice if I could only do this when the source files have been tagged as utf-8 or something like that.
Few points: 1. -fexec-charset in MSVC can be simulated with #pragma setlocale(".XXXX") where XXXX is the codepage. However 65001 (UTF-8) can't be used! 2. -finput-charset can be either defined by the same setlocale pragma and can't be 65001 (UTF-8) as well, and it can be UTF-8 if you add BOM. But in fact BOM is needed for files that contain non-ASCII characters. But the bigger question is what exactly do you want to do with BOM and how it would help you to make the "cross-platform" software? If you write for MSVC add BOM in first place, if you work for cross platform/compiler software MSVC incompatibility with the rest of the world would actually make it impossible to use UTF-8 in cross platform way because the only real Unicode strings with MSVC would be L"" and they are actually would be encoded with UTF-16 encoding while all non-Windows world uses UTF-32 as wide character encodings. So basically I can say that untill Microsoft Visual Studio team would take UTF-8 seriously and either support 65001 codepage as expected or provide GCC's like options for input and exec encodings I don't see how this BOM would be useful. Does anybody know how to open a bug or feature request for MSVC? Such that MSVC11 /201[^0] would support it properly? My $0.02 Artyom

On 01/05/2011 20:44, Artyom wrote:
But the bigger question is what exactly do you want to do with BOM and how it would help you to make the "cross-platform" software?
The goal is to allow all compilers to recognize that the source is encoded in UTF-8. This is what you need to write cross-platform source that contains non-ASCII characters.
the only real Unicode strings with MSVC would be L"" and they are actually would be encoded with UTF-16 encoding while all non-Windows world uses UTF-32 as wide character encodings.
How is that a problem at all? And using narrow string literals with UTF-8 content masquerading as ANSI is a hack, sorry. That's not the C++-endorsed solution.
So basically I can say that untill Microsoft Visual Studio team would take UTF-8 seriously and either support 65001 codepage as expected or provide GCC's like options for input and exec encodings I don't see how this BOM would be useful.
I don't really care about what the execution character set is. I definitely do not want to change it, it should be the user locale.

From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
On 01/05/2011 20:44, Artyom wrote:
But the bigger question is what exactly do you want to do with BOM and how it would help you to make the "cross-platform" software?
The goal is to allow all compilers to recognize that the source is encoded in UTF-8. This is what you need to write cross-platform source that contains non-ASCII characters.
It is not enough. You can't do it in cross platform way properly as you can't currently get UTF-8 or UTF-16 or UTF-32 string literal properly for cross platform code till all compilers will support C++0x u/U/u8 literals and at this point NONE of the existing popular compilers support them (checked MSVC, GCC, Intel, SunCC)
the only real Unicode strings with MSVC would be L"" and they are actually would be encoded with UTF-16 encoding while all non-Windows world uses UTF-32 as wide character encodings.
How is that a problem at all?
And using narrow string literals with UTF-8 content masquerading as ANSI is a hack, sorry. That's not the C++-endorsed solution.
First of all ANSI codepage exists only on Windows and has nothing to do with cross platform software. C++ standard does not know what is "ANSI" encodings.
So basically I can say that untill Microsoft Visual Studio team would take UTF-8 seriously and either support 65001 codepage as expected or provide GCC's like options for input and exec encodings I don't see how this BOM would be useful.
I don't really care about what the execution character set is. I definitely do not want to change it, it should be the user locale.
No, you never want to be it in user's locale because it makes compilation locale dependent! Because source.cpp / With UTF-8 BOM -------------------------------- std::string test="שלום-سلام-Мир" In Israel it would be "שלום-???-????" in CP1255 In Egypt it would be "????-سلام-???" in CP1256 In Russia it would be "????-????-Мир" in CP1251 In France it would be "????-???-???" in CP1252 So no, you always want to have execution character set to be well defined unless all your sources are written using US-ASCII which is a subset of all character sets. Artyom Beilis.

On 02/05/2011 08:47, Artyom wrote:
It is not enough.
You can't do it in cross platform way properly as you can't currently get UTF-8 or UTF-16 or UTF-32 string literal properly for cross platform code till all compilers will support C++0x u/U/u8 literals and at this point NONE of the existing popular compilers support them (checked MSVC, GCC, Intel, SunCC)
Wide strings literals can perfectly be assumed to be UTF-16 or UTF-32, and that's portable. Also as I suggested you can use a macro that allows the usage of unicode string literals instead where they're available.

On 04/26/2011 02:17 AM, Sebastian Redl wrote:
We can assume that the compiler knows the correct character set of the source code file, as trying to fool it would seem to be inherently error prone. This seems to rule out the possibility of char * literals containing UTF-8 encoded text on MSVC, until C++1x Unicode literals are supported.
The biggest nuisance is that we need to know the compile-time character set/encoding (so that we know how to interpret "narrow" string literals), and there does not appear to be any standard way in which this is recorded (maybe I'm mistaken though). The source character set is pretty much irrelevant. It's the execution character set that is problematic. A compiler will translate string
On 25.04.2011 22:31, Jeremy Maitin-Shepard wrote: literals in the source from the source character set to the execution character set for storage in the binary. GCC has options to control both the source (-finput-charset) and the execution character set (-fexec-charset). They both default to UTF-8. However, MSVC is more complicated. It will try to auto-detect the source character set, but while it can detect UTF-16, it will treat everything else as the system narrow encoding (usually a Windows-xxxx codepage) unless the file starts with a UTF-8-encoded BOM. The worse problem is that, except for a very new, poorly documented, and probably experimental pragma, there is *no way* to change MSVC's execution character set away from the system narrow encoding.
So let's assume that further down, it's the execution set that's known.
Yes, it is the execution character set that I meant (I assumed that, as is the case for MSVC, the execution character set is the same as the character set given by the current locale at compile time.)
By knowing the compile-time character set, all ambiguity is removed. The translation database can be assumed to be keyed based on UTF-8, so to translate a message, it needs to be converted to UTF-8. There should presumably be versions of the translation functions that take narrow strings, wide strings, and additional versions for the C++1x unicode literals once they are supported by compilers (I expect that to be very soon, at least for some compilers). If a wide string is specified, it will be assumed to be in UTF-16 or UTF-32 depending on sizeof(wchar_t), and converted to UTF-8. UTF-32 is generally undesirable, I imagine, but in practice should nonetheless work and using wide strings might be the best approach for code that needs to compile on both Windows and Linux. For the narrow version, if the compile-time narrow encoding is UTF-8, the conversion is a no-op. Otherwise, the conversion will have to be done. (The C++1x u8 literal version would naturally require no conversion also.)
The issue with making the narrow version automatically transcode the input from the narrow encoding to UTF-8 is that it is a compatibility issue with C++11 u8 literals. For some reason, there is no way in the type system to distinguish between normal narrow and u8 literals. In other words, if you ever make the translate() functions assume a narrow literal to be in the locale character set, you can't use u8 literals there anymore.
This is a problem with every interface that takes char * arguments, and isn't specific to this particular case. One solution is to use a different name, since it isn't possible to overload on type. Another solution specific to this case is for the user to specify an execution charset via preprocessor define of UTF-8, in which case no conversion will be done regardless, and the user should just make sure not to use it with non-UTF-8 narrow strings. [Aside: It seems that only MSVC users (or users that want to write code that is portable to MSVC) will bother to use the u8 prefix at all, since on GCC it by default has no effect, given the default execution charset of UTF-8.]

On Mon, Apr 25, 2011 at 4:31 PM, Jeremy Maitin-Shepard <jeremy@jeremyms.com> wrote:
having to deal with this by creating a translation from "ASCII English to appease translation system" to "Real English to display to users" would seem to be an unjustifiable additional burden.
I always recommend that the text in the code be written in "programmer speak" and then translated to English by the UI team. Programmers typically aren't good at usability and they often end up changing the text in their code once QA or UX sees the wording of messages presented to users. I request to change the wording of a message shouldn't be a code file change. It should be a translation file change. Now, whether that "programmer speak" is in English or some other language, is another story - the main story of the other emails, I guess. But I just wanted to take this opportunity to encourage developers to get user-presented strings out of their code and into some other file that a separate UI team can maintain for you. (even if that UI team is you today, it might not be in the future) Tony

On Sat, Apr 23, 2011 at 10:46 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
The formal review of Artyom Beilis' Boost.Locale library, originally scheduled to run from April 7th through the 16th and extended through the 22nd, is now finished.
Fifteen people cast votes on it, ten for acceptance and five against. Though not overwhelming, the two-to-one majority clearly indicates the consensus of the list. As such:
The Boost.Locale library IS ACCEPTED into Boost.
The details follow, starting with the voters in favor, in the order their reviews were received: * John Bytheway * Steven Watanabe * Sebastian Redl * Fabio Fracassi ("possibly conditional") * Noah Roberts * Steve Bush * Volker Lukas * Paul A. Bristow * Matus Chochlik * Gevorg Voskanyan
Those opposed: * Ryou Ezoe * Phil Endecott (but "borderline") * Edward Diener * Mathias Gaunard * Vincente BOTET ("count my vote as 1/2 or 1/4 vote")
There was also an early off-list review from Darren Cook which did not include a vote, only listing issues.
There was a great deal of discussion around the library, and a number of issues detailed. The major ones (and some less major ones that were repeated by several reviewers) are summarized below, along with the initials of the reviewer(s) who brought them up and Artyom's responses. Note that although I followed the discussions closely, these are only the issues brought up in the formal reviews themselves.
* Issue: date_time interface uses enums for periods, which is error prone and inconsistent with other date/time libraries. (JB) * Response: Will be addressed.
* Issue: Reference documentation needs more detail or some things not clear; some terms used in the documentation are not defined or are defined too briefly; can't find headers that items are defined in from the documentation. (JB, SW, FF, PE, ED, MC) * Response: Will be addressed.
* Issue: There are no prev/next links in the documentation, or tutorial can't easily be navigated. (DC, JB, SR, FF, MC) * Response: Looking into it.
* Issue: Few examples include output. (JB) * Response: Will be addressed.
* Issue: No examples in Asian languages; the library may have design flaws that are not apparent without them. (DC) * Response: [Artyom has indicated to me (privately) that he's adding such examples. -- CN]
* Issue: The translation system requires narrow-character tags, making it English-centric. (RE, ED) * Response: The library implements the most popular and most widely used message-catalog format. It is not perfect, but it is the best system currently available. There may be a work-around if you really must use wide-character languages under Windows.
* Issue: Support for wchar_t/UTF-16 is unclear. (RE, ED) * Response: Wide characters are fully supported.
* Issue: Doesn't support Win32 API as a backend. (RE) * Response: Misunderstanding, Win32 API backend is fully supported.
* Issue: boost::locale::format is not compatible with boost::format. (FF, NR) * Response: boost::format is too limited for use in localization, and throws on any error, which means that a translator error could crash the program.
* Issue: boost::locale::date_time is not compatible with boost::date_time. (FF) * Response: An unavoidable consequence of the differences between them, which are necessary due to support for locale independence and non-Gregorian calendars.
* Issue: Boundary analysis is only available when using ICU. (FF) * Response: At present, only the ICU backend supports proper boundary analysis.
* Issue: Little documentation on the toolchain needed to extract strings and translate them, or the versions required. (FF, NR) * Response: Will be addressed.
* Issue: Concerns about relying on GPL/LGPL-licensed tools, or their availability on all platforms, or recommendations to write a Boost version of these tools. (NR) * Response: Reimplementing these is non-trivial and unnecessary; the licensing for these tools does not affect the programs developed with them. All are available for all platforms; will add explicit instructions for getting the latest versions for Windows.
* Issue: Use of strings instead of symbols for language and encoding makes run-time errors that could be compile-time errors. The most common ones should be symbols. (PE) * Response: There are dozens of character encodings, and even more locales, and no way to determine which ones are the most common. Not all encodings are supported by all backends or OS configurations. Names ignore case and non-alphanumeric characters, which should minimize errors that could be generated from them. utf_to_utf transcoding will be added.
* Issue: Error handling (in conversions) is very basic. (PE) * Response: An unavoidable limitation of the backends.
* Issue: Code could use more commenting. (PE, VL) * Response: Noted, will be addressed in the future.
* Issue: Some documentation phrasing is confusing, or could use a native English speaker's input. (SR, VL) * Response: Will be addressed as discovered.
* Issue: There are no lists of valid language, country, encoding, or variant strings. (ED) * Response: Listed in standards ISO-639 and ISO-3166, which are referenced in the library's documentation. These standards are updated occasionally, and should be referred to directly for the latest information.
* Issue: Only works on contiguous, entirely-in-memory strings. (MG) * Response: All current backends require this, and it satisfies the vast majority of use-cases.
* Issue: Boundary analysis goes through the entire string and returns a vector of positions. (MG) * Response: Not perfect, but given the limitations of the existing backends, it is reasonable.
* Issue: The library's interface is not generic enough, or independent enough of the libraries that it wraps. (VB) * Response: The interface is similar to that of every other i18n library, and should make as few assumptions as possible. It should not be changed.
* Issue: The date-time code should be merged into Boost.DateTime. (VB) * Response: Date-time code is locale-dependent by its nature, and is more natural in Boost.Locale. Updating Boost.DateTime to do everything that Boost.Locale's library does would require a lot of work, and in all the time it has existed, only the Gregorian calendar has been implemented. There are Boost libraries that overlap others, so this is not a novelty. -- Chad Nelson Oak Circle Software, Inc. * * *
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I can't believe it is accepted. There were just 15 people explicitly stated their opinion. There was only one native non-western language speaker. Me. I really hate to complain much, but this library is totally useless for the Non-English speaker. Library author said it is important to use English. Because it allow us to easily translate to other languages. Does it really matter if Non-English people don't bother to use this library? Believe me, There are so many Boost users who don't know English at all. They are good at using Boost library because there are enough translations. This English-dependent library cannot be used by these users. They CAN use English identifier in the source file. like name of functions, classes, variables etc(badly If you ask me) But text is written in their language. There is no way they use English text. The text in software is not just written by programmer alone. Non-programmers also write text. So this library requires English for anybody. If anybody can use English. There is no need for localization in the first place. That means, if everybody can use this English-dependent library, the purpose of this library, the localization, is unnecessary. I think I found a another issue. Even the English text cannot be express by ASCII alone. There are many characters ASCII doesn't have. such as ®. So this library need to maintain two almost identical English texts. One for real English text which contains non-ASCII characters, and one for unique ASCII identifier that is a argument of translate(). But enough of argument. I just want to assure you all, this library will never be used by non-English speaker. Blaming people for not to use English doesn't work. They don't care. Just like you don't care their language. -- Ryou Ezoe

Ryou Ezoe wrote:
I can't believe it is accepted. There were just 15 people explicitly stated their opinion. There was only one native non-western language speaker. Me.
What is your definition of a western and a non-western language?
I really hate to complain much, but this library is totally useless for the Non-English speaker.
I understand the message translation module of Boost.Locale is not currently well-suited for use with non-ASCII keys. But what about other modules: number formatting, collation, boundary analysis, etc.? Do you consider them useless for non-English speakers too?
Library author said it is important to use English. Because it allow us to easily translate to other languages. Does it really matter if Non-English people don't bother to use this library?
Believe me, There are so many Boost users who don't know English at all. They are good at using Boost library because there are enough translations.
This English-dependent library cannot be used by these users. They CAN use English identifier in the source file. like name of functions, classes, variables etc(badly If you ask me) But text is written in their language. There is no way they use English text. The text in software is not just written by programmer alone. Non-programmers also write text. So this library requires English for anybody.
If anybody can use English. There is no need for localization in the first place.
I can't follow the logic here. Even if all programmers have known English, there still will be need for localization, because not all *users* necessarily know English. Unless all users of the software are programmers themselves.
That means, if everybody can use this English-dependent library, the purpose of this library, the localization, is unnecessary.
Not true. Software localization is primarily aimed at users, not developers, as I already stressed above. Boost.Locale is not a localized library. It is a library which can help create localized software. And even when people know English, that doesn't mean they don't want software to speak their language. Moreover localization is not all about the language. It's more about local culture, and conventions used within that culture. For example, I can read and understand English well. That means I can use just fine software with English as the user interface language without any need for translation (even though I prefer Armenian translation if available, but English will do for me otherwise). The same cannot be said about other cultural aspects. For example, if some software shows me distances in miles and masses in pounds, those values will not make any sense to me, and I'd be forced to convert those values in kilometers and kilograms to understand how much they are. And of course I'll be angry and dissatisfied with that software. Localization is a much larger domain than just message text translation, so Boost.Locale needs to be evaluated in that whole context, rather than in just a specific (even if very important) aspect.
I think I found a another issue. Even the English text cannot be express by ASCII alone. There are many characters ASCII doesn't have. such as ®.
So this library need to maintain two almost identical English texts. One for real English text which contains non-ASCII characters, and one for unique ASCII identifier that is a argument of translate().
That's a working approach. Do you have a proposal for another approach which could be implemented in Boost.Locale to make it better in that regard?
But enough of argument.
The arguments you made earlier have convinced me that using English keys is not always appropriate. Not only because programmers might not be able to write English, but also because the product being developed might not target English speaking audience at all. E.g. a product targeting Chinese, Korean and Japanese markets only. In that case making an English translation would be just a waste of time and money. However your arguments didn't convince me that Boost.Locale should be rejected on that ground, with the following reasons: 1. Using C++11 u8 encoding prefix will resolve the problem 2. Current message translation system is based on widely accepted existing practice, and works well in the majority of use cases 3. I have not yet seen a truly working proposal from you which can address the problem properly in Boost.Locale using C++03 features only 4. Boost.Locale is much more than just message translation. I can imagine many legitimate usages of Boost.Locale not involving message translation at all
I just want to assure you all, this library will never be used by non-English speaker.
Not even when they have u8 prefix? Not even when they don't need message translation but other features of Boost.Locale?
Blaming people for not to use English doesn't work. They don't care.
I don't think anyone was making such blames in these discussions. There were only suggestions to use ASCII-based keys for now as they are more portable.
Just like you don't care their language.
If Artyom didn't care what language is being used by different users, he wouldn't have developed a localization library in the first place. He would rather declare "In order to use my software, you should talk to it in my language". And note that English is not Artyom's native language either.
Ryou Ezoe
Regards, Gevorg

On Mon, Apr 25, 2011 at 1:18 AM, Gevorg Voskanyan <v_gevorg@yahoo.com> wrote:
I understand the message translation module of Boost.Locale is not currently well-suited for use with non-ASCII keys. But what about other modules: number formatting, collation, boundary analysis, etc.? Do you consider them useless for non-English speakers too?
For Japanese. These features are all useless. Although I don't know Chinese and Korean, I think following is also true for them too. Number and Date formatting: There are so many possible ways to express numbers. Some people want comma separation by 3 digits, other want 4 digits. Some want 1000000 to be 100万(万 means 10000). some want 百万(百 means 100)。 Formatting based on locale doesn't work because there is no uniform format. Collation and Conversions: Japanese doesn't have concepts of case and accent. Since we don't have these concepts, we never need it. Boundary analysis: What is the definition of boundary and how does it analyse? It sounds too smart for such a small things it actually does. I'd rather call it strtok with hard-coded delimiters. Japanese doesn't separate each words by space. So unless we perform really complicated natural language processing(which is impossible to be perfect since we never have complete Japanese dictionary), we can't split Japanese text by words. Also, Japanese doesn't have a concept of word wrap. So "find appropriate places for line breaks" is unnecessary. Actually, there are some rules for line break in Japanese. These rules are too complicated and it requires more than text processing. Same for Chinese and Korean. Of course, strtok is still a handy tool and I appreciate yet another design. But I think it's better be handled by more generic library, like Boost String Algorithms. -- Ryou Ezoe

From: Ryou Ezoe <boostcpp@gmail.com>
Number and Date formatting: There are so many possible ways to express numbers. Some people want comma separation by 3 digits, other want 4 digits. Some want to be 100万(万 means 10000). some want 百万(百 means 100)。 Formatting based on locale doesn't work because there is no uniform format.
Have you actually read the manuals? This is the output of : std::cout << bl::format("{1}\n{1,num}\n{1,spell}\n") % 1000000 ; in ja_JP.UTF-8 locale 1000000 1,000,000 百万 Not so bad, isn't it?
Collation and Conversions: Japanese doesn't have concepts of case and accent. Since we don't have these concepts, we never need it.
Irrelevant, even when this feature not required for CJK it is required like many other things (spaces, plural forms for other languages)
Boundary analysis: What is the definition of boundary and how does it analyse? It sounds too smart for such a small things it actually does. I'd rather call it strtok with hard-coded delimiters. Japanese doesn't separate each words by space. So unless we perform really complicated natural language processing(which is impossible to be perfect since we never have complete Japanese dictionary), we can't split Japanese text by words.
Ok this is word splitting |私|は|日本|の|東京都|に|住|んでいます|。|私|は|大|きな|家|に|住|んでいます|。 of the text: 私は日本の東京都に住んでいます。私は大きな家に住んでいます。 I assume it is not perfect and I don't know Japanese to say but I can see at lease that words like: 私 - I 日本 - Japan 東京都 - City of Tokyo But this is not only defined by "space-based" separation. Also for some languages like Thai ICU uses dictionaries. So it is not naive algorithm that separates text by spaces.
Also, Japanese doesn't have a concept of word wrap. So "find appropriate places for line breaks" is unnecessary. Actually, there are some rules for line break in Japanese. These rules are too complicated and it requires more than text processing. Same for Chinese and Korean.
This is possible line-break separation of the same sentences above. |私|は|日|本|の|東|京|都|に|住|ん|で|い|ま|す。|私|は|大|き|な|家|に|住|ん|で|い|ま|す。| At least I can see that it does not allows to start a line with "。" .
Of course, strtok is still a handy tool and I appreciate yet another design. But I think it's better be handled by more generic library, like Boost String Algorithms.
It far more complicated then strtok. Bottom line I see that you hadn't really try to use this library or understand how it works. I'm sorry but it makes me doubt about the review you had sent. Artyom

On Mon, Apr 25, 2011 at 6:04 AM, Artyom <artyomtnk@yahoo.com> wrote:
From: Ryou Ezoe <boostcpp@gmail.com>
Number and Date formatting: There are so many possible ways to express numbers. Some people want comma separation by 3 digits, other want 4 digits. Some want to be 100万(万 means 10000). some want 百万(百 means 100)。 Formatting based on locale doesn't work because there is no uniform format.
Have you actually read the manuals?
This is the output of :
std::cout << bl::format("{1}\n{1,num}\n{1,spell}\n") % 1000000 ;
in ja_JP.UTF-8 locale
1000000 1,000,000 百万
Not so bad, isn't it? Not bad. Still I doubt anybody want to use Boost.locale just for that.
Collation and Conversions: Japanese doesn't have concepts of case and accent. Since we don't have these concepts, we never need it.
Irrelevant, even when this feature not required for CJK it is required like many other things (spaces, plural forms for other languages)
Boundary analysis: What is the definition of boundary and how does it analyse? It sounds too smart for such a small things it actually does. I'd rather call it strtok with hard-coded delimiters. Japanese doesn't separate each words by space. So unless we perform really complicated natural language processing(which is impossible to be perfect since we never have complete Japanese dictionary), we can't split Japanese text by words.
Ok this is word splitting
|私|は|日本|の|東京都|に|住|んでいます|。|私|は|大|きな|家|に|住|んでいます|。
of the text:
私は日本の東京都に住んでいます。私は大きな家に住んでいます。
To me, it looks like splitting by contiguous kanas and kanzis. I don't think I ever need that kind of splitting.
I assume it is not perfect and I don't know Japanese to say but I can see at lease that words like:
私 - I 日本 - Japan 東京都 - City of Tokyo
But this is not only defined by "space-based" separation. Also for some languages like Thai ICU uses dictionaries.
So it is not naive algorithm that separates text by spaces.
Also, Japanese doesn't have a concept of word wrap. So "find appropriate places for line breaks" is unnecessary. Actually, there are some rules for line break in Japanese. These rules are too complicated and it requires more than text processing. Same for Chinese and Korean.
This is possible line-break separation of the same sentences above.
|私|は|日|本|の|東|京|都|に|住|ん|で|い|ま|す。|私|は|大|き|な|家|に|住|ん|で|い|ま|す。|
At least I can see that it does not allows to start a line with "。" .
We have a lot of characters that should not be the initial character of a line. But there is no uniform rule. And it must be work along with font rendering. Simple text processing doesn't suffice.
Of course, strtok is still a handy tool and I appreciate yet another design. But I think it's better be handled by more generic library, like Boost String Algorithms.
It far more complicated then strtok.
Bottom line I see that you hadn't really try to use this library or understand how it works.
I'm sorry but it makes me doubt about the review you had sent.
Artyom _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Ryou Ezoe

Hi, FYI, the document "Requirements for Japanese Text Layout" is published by W3C[1]. "CJKV book" by O'Reilly[2] is also a great book when you consider i18n which covers CJKV language. [1] http://www.w3.org/TR/jlreq/ [2] http://oreilly.com/catalog/9780596514471 Best regards, -- Ryo IGARASHI, Ph.D. rigarash@gmail.com

Hello, Boost.Locale uses ICU's implementation of the standard Unicode Segmentation Algorithm. The rationale behind it is: it is well known, debugged and high quality implementation. Currently ICU is the best known cross platform generic libraries that implement the Unicode algorithms and localization facilities. If it is not good enough and does not do the required thing it is much easier to talk to ICU people. There are already custom segmentation rules for Japanese. If they are not good enough it is possible to talk to them, provide patches or even talk to Unicode consortium if the existing basic algorithm (not locale specific rules) does not lead to good results. Also this library does not deal directly with text layout as it is out of the scope of this library. Artyom ----- Original Message ----
From: Ryo IGARASHI <rigarash@gmail.com> To: boost@lists.boost.org Sent: Mon, April 25, 2011 9:02:44 AM Subject: Re: [boost] [locale] Review results for Boost.Locale library
Hi,
FYI, the document "Requirements for Japanese Text Layout" is published by W3C[1]. "CJKV book" by O'Reilly[2] is also a great book when you consider i18n which covers CJKV language.
[1] http://www.w3.org/TR/jlreq/ [2] http://oreilly.com/catalog/9780596514471
Best regards, -- Ryo IGARASHI, Ph.D. rigarash@gmail.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hi, Artyom, On Mon, Apr 25, 2011 at 8:49 PM, Artyom <artyomtnk@yahoo.com> wrote:
Boost.Locale uses ICU's implementation of the standard Unicode Segmentation Algorithm. The rationale behind it is: it is well known, debugged and high quality implementation. Currently ICU is the best known cross platform generic libraries that implement the Unicode algorithms and localization facilities.
I agree with you that ICU is the best known library for supporting Unicode.
If it is not good enough and does not do the required thing it is much easier to talk to ICU people. There are already custom segmentation rules for Japanese. If they are not good enough it is possible to talk to them, provide patches or even talk to Unicode consortium if the existing basic algorithm (not locale specific rules) does not lead to good results.
Also this library does not deal directly with text layout as it is out of the scope of this library.
Of course I know that text layout is out of scope of your library. Just I want to stress the fact that the line breaking in some language (Japanese in this case) is highly dependent upon text layout. For example, some 'line breaking forbidden rule' is relaxed on newspaper-like text layout situation as shown in "Requirements for Japanese Text Layout"[1]. [1] http://www.w3.org/TR/jlreq/#en-subheading2_1_7 Best regards, -- Ryo IGARASHI, Ph.D. rigarash@gmail.com

On 04/25/2011 10:13 AM, Ryo IGARASHI wrote:
Just I want to stress the fact that the line breaking in some language (Japanese in this case) is highly dependent upon text layout. For example, some 'line breaking forbidden rule' is relaxed on newspaper-like text layout situation as shown in "Requirements for Japanese Text Layout"[1].
There is, of course, a long history of outsiders under-appreciating the subtleties of Asian writing systems. I see an alternate reading of many of these comments about CJK text processing: "it's too complicated for non-native speakers to understand, we won't use a general-purpose library for this anyway, you should probably just give up". But sentiments like this just encourage Boost people to dig into it deeper. If you guys keep talking about how impossible it is, we'll end up with a library for doing CJK processing fully at compile time using template metaprogramming. :-) - Marsh

25.04.2011 0:01, Ryou Ezoe пишет:
Collation and Conversions: Japanese doesn't have concepts of case and accent. Since we don't have these concepts, we never need it. You need it if you want to sell your software on the foreign markets. Otherwise, you don't need the localization library at all.
Also, Japanese doesn't have a concept of word wrap. So "find appropriate places for line breaks" is unnecessary. It is unneccessa ry for you but m ay be important for the foreign users of your so ftware. :-)
-- Sergey Cheban

Ryou Ezoe wrote:
Collation and Conversions: Japanese doesn't have concepts of case and accent. Since we don't have these concepts, we never need it.
OK, that tells case conversions and normalization don't apply to Japanese. But what about collation? Isn't there any "dictionary order" defined for Japanese words? Just curious. Thanks, Gevorg

Hi, Gevorg, On Mon, Apr 25, 2011 at 11:06 PM, Gevorg Voskanyan <v_gevorg@yahoo.com> wrote:
OK, that tells case conversions and normalization don't apply to Japanese. But what about collation? Isn't there any "dictionary order" defined for Japanese words? Just curious.
"Dictionary order" depends on what kind of information in the dictionary. For example, we use complex sorting algorithm for 'Kanji' letter dictionary. However, for language dictionary (Japanese-Japanese dictionary), we use pronunciation order. But this is impossible to decide by program since each 'Kanji' letter have usually 3-4 (sometimes more) completely different pronunciation only to be decided by the context in principle. Just FYI. Best regards, -- Ryo IGARASHI, Ph.D. rigarash@gmail.com

From: Ryo IGARASHI <rigarash@gmail.com> On Mon, Apr 25, 2011 at 11:06 PM, Gevorg Voskanyan <v_gevorg@yahoo.com> wrote:
OK, that tells case conversions and normalization don't apply to Japanese. But what about collation? Isn't there any "dictionary order" defined for Japanese words? Just curious.
"Dictionary order" depends on what kind of information in the dictionary. For example, we use complex sorting algorithm for 'Kanji' letter dictionary.
However, for language dictionary (Japanese-Japanese dictionary), we use pronunciation order. But this is impossible to decide by program since each 'Kanji' letter have usually 3-4 (sometimes more) completely different pronunciation only to be decided by the context in principle.
Just FYI.
These are sizes of collation rules for different languages in ICU 4.4 by size (top 5): 630641 2010-04-28 18:28 zh.txt 439431 2010-04-28 18:28 ko.txt 438456 2010-04-28 18:28 ja.txt 23851 2010-04-28 18:28 kn.txt 23594 2010-04-28 18:28 bn.txt I've looked into ja.txt file and it includes a huge dictionary of Kanji letters sorted by their order. I can't check it by my own but I assume that the collation rules for Japanese are not that simple. Also there are customization parameters for collation in locale names like ja_JP.UTF-8@collation=unihan These are keywords take from: http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers "big5han" Pinyin ordering for Latin, big5 charset ordering for CJK characters. (used in Chinese) "dict" (dictionary) For a dictionary-style ordering (such as in Sinhala) "direct" Hindi variant "gb2312" (gb2312han) Pinyin ordering for Latin, gb2312han charset ordering for CJK characters. (used in Chinese) "phonebk" (phonebook) For a phonebook-style ordering (such as in German) "phonetic" Requests a phonetic variant if available, where text is sorted based on pronunciation. It may interleave different scripts, if multiple scripts are in common use. "pinyin" Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese) "reformed" Reformed collation (such as in Swedish) "search" A special collation type dedicated for string search. "stroke" Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese) "trad" (traditional) For a traditional-style ordering (such as in Spanish) "unihan" Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters. (used in Chinese) So I can't check but I can assume it does something right... Artyom

On Mon, Apr 25, 2011 at 11:06 PM, Gevorg Voskanyan <v_gevorg@yahoo.com> wrote:
Ryou Ezoe wrote:
Collation and Conversions: Japanese doesn't have concepts of case and accent. Since we don't have these concepts, we never need it.
OK, that tells case conversions and normalization don't apply to Japanese. But what about collation? Isn't there any "dictionary order" defined for Japanese words? Just curious.
As Ryo IGARASHI said. We have multiple pronunciations for each kanzis. Sorting by pronunciation is just not possible. In the end, we just sort it by internal code point. Sort by code point is not the best solution. But at least, it's consistent if we use one encoding.
Thanks, Gevorg
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Ryou Ezoe

From: Ryou Ezoe <boostcpp@gmail.com>
Sort by code point is not the best solution. But at least, it's consistent if we use one encoding.
No it is not, UCS encoding has different order in different representations: UTF-8 and UTF-32 order is consistent i.e. for each a,b in utf8(a) < utf8(b) iff utf32(a) < utf32(b) However this is not correct for UTF-16 where codepoints outside of BMP has different ordering. i.e. It may be that codepoint (a) > codepoint(b) but UTF-16(a) sorted before UTF-16(b) Artyom

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Monday, April 25, 2011, Artyom wrote:
From: Ryou Ezoe <boostcpp@gmail.com>
Sort by code point is not the best solution. But at least, it's consistent if we use one encoding.
No it is not, UCS encoding has different order in different representations:
UTF-8 and UTF-32 order is consistent i.e.
Sorry if I'm adding to the pedantry, but Is UCS an "encoding"? My completely non-expert impression was that UCS was a character set, and UTF-8/16/32, etc. were different encodings of that character set? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAk21x6QACgkQ5vihyNWuA4WNRgCg3cNYEtR3OsoohkVIyMrkF/fI K0kAniA90yt0lgawTahW8+R1Y9ktoLa8 =qO/h -----END PGP SIGNATURE-----

On Tue, Apr 26, 2011 at 4:12 AM, Frank Mori Hess <frank.hess@nist.gov> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Monday, April 25, 2011, Artyom wrote:
From: Ryou Ezoe <boostcpp@gmail.com>
Sort by code point is not the best solution. But at least, it's consistent if we use one encoding.
No it is not, UCS encoding has different order in different representations:
UTF-8 and UTF-32 order is consistent i.e.
Sorry if I'm adding to the pedantry, but Is UCS an "encoding"? My completely non-expert impression was that UCS was a character set, and UTF-8/16/32, etc. were different encodings of that character set?
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux)
iEYEARECAAYFAk21x6QACgkQ5vihyNWuA4WNRgCg3cNYEtR3OsoohkVIyMrkF/fI K0kAniA90yt0lgawTahW8+R1Y9ktoLa8 =qO/h -----END PGP SIGNATURE----- _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Sorry. I should say encodings of UCS. -- Ryou Ezoe

On Tue, Apr 26, 2011 at 3:55 AM, Artyom <artyomtnk@yahoo.com> wrote:
From: Ryou Ezoe <boostcpp@gmail.com>
Sort by code point is not the best solution. But at least, it's consistent if we use one encoding.
No it is not, UCS encoding has different order in different representations:
UTF-8 and UTF-32 order is consistent i.e.
for each a,b in utf8(a) < utf8(b) iff utf32(a) < utf32(b)
However this is not correct for UTF-16 where codepoints outside of BMP has different ordering. i.e.
It may be that codepoint (a) > codepoint(b) but UTF-16(a) sorted before UTF-16(b)
What do you mean? No matter what UTF you use. Code point is same. You can't compare UTF-8 string by comparing each octet.
Artyom _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Ryou Ezoe

What do you mean? No matter what UTF you use. Code point is same. You can't compare UTF-8 string by comparing each octet.
Since the discussion is about sorting on pure code point basis it is worth noting that sorting UTF-8 strings on octet basis *is* actually identical to sorting on their decoded code points.

On 25/04/2011 21:50, Ryou Ezoe wrote:
On Tue, Apr 26, 2011 at 3:55 AM, Artyom<artyomtnk@yahoo.com> wrote:
From: Ryou Ezoe<boostcpp@gmail.com>
Sort by code point is not the best solution. But at least, it's consistent if we use one encoding.
No it is not, UCS encoding has different order in different representations:
UTF-8 and UTF-32 order is consistent i.e.
for each a,b in utf8(a)< utf8(b) iff utf32(a)< utf32(b)
However this is not correct for UTF-16 where codepoints outside of BMP has different ordering. i.e.
It may be that codepoint (a)> codepoint(b) but UTF-16(a) sorted before UTF-16(b)
What do you mean? No matter what UTF you use. Code point is same. You can't compare UTF-8 string by comparing each octet.
Actually, you can. And you should actually do it at the octet level for efficiency.

On 24/04/2011 22:01, Ryou Ezoe wrote:
Collation and Conversions: Japanese doesn't have concepts of case and accent. Since we don't have these concepts, we never need it.
I believe all CJK characters can be decomposed to radicals, which are equivalent, so you could want to do normalization. Also, converting between halfwidth and fullwidth katakana could have some uses.
Boundary analysis: What is the definition of boundary and how does it analyse? It sounds too smart for such a small things it actually does.
It uses the boundary analysis algorithms defined by the Unicode standard, which doesn't use heuristics or anything like that. Remember Boost.Locale is just a wrapper of ICU, which is the real smart library.
I'd rather call it strtok with hard-coded delimiters. Japanese doesn't separate each words by space. So unless we perform really complicated natural language processing(which is impossible to be perfect since we never have complete Japanese dictionary), we can't split Japanese text by words. Also, Japanese doesn't have a concept of word wrap. So "find appropriate places for line breaks" is unnecessary. Actually, there are some rules for line break in Japanese.
You can still break at punctuation marks, and there are places where you should definitely not break. Thai, Lao, Chinese and Japanese do require the use of dictionaries or heuristics to correctly distinguish words. However, the default algorithm provided by Unicode still provides a best effort implementation without those things.
participants (18)
-
Artyom
-
Chad Nelson
-
Frank Mori Hess
-
Gevorg Voskanyan
-
Gottlob Frege
-
Jeremy Maitin-Shepard
-
John Bytheway
-
Marsh Ray
-
Mathias Gaunard
-
Matthew Chambers
-
Ryo IGARASHI
-
Ryou Ezoe
-
Sebastian Redl
-
Sergey Cheban
-
Steve Bush
-
Steven Watanabe
-
Stewart, Robert
-
Vladimir Prus