[Boost.Locale] For Preliminary Review of Boost Community (with Full documentation)

Hello All, I had written a Boost.Locale library that provides high quality localization facilities in C++ way. This library uses internally ICU library to provide all facilities: - Correct case conversion, case folding and normalization - Collation including support of 4 Unicode collation levels. - Date and time formatting and parsing - Number formatting, spelling and parsing - Monetary formatting and parsing - Powerful message formatting including support plural forms, using GNU catalogs. - Character, word, sentence and line-break boundary analysis. - Codepage conversion - Support of 8-bit character sets like Latin1 and UTF–8 encoded text. - Support of char, wchar_t and C++0x char16_t, char32_t strings and streams. I've tested it with: - Linux GCC 4.1, 4.3. with ICU 3.6 and 3.8 - Windows MSVC-9 (VC 2008), with ICU 4.2 - Windows MingW with ICU 4.2 - Windows Cygwin with ICU 3.8 Documentation: ============== Full tutorials: http://cppcms.sourceforge.net/boost_locale/docs/ Doxygen reference documentation: http://cppcms.sourceforge.net/boost_locale/docs/doxy/html/ Source Code: ============ https://cppcms.svn.sourceforge.net/svnroot/cppcms/boost_locale/trunk Building: ========= At this point I hand't written any JAM files yet and I use CMake. So checkout the code, and run cmake /path/to/boost_locale/libs/locale You need to have ICU 3.6 or above installed on your system. 4.2 and above recommended It is mandatory dependency. Notes for MSVC Users ==================== In many examples I write: wstring tmp=L"שלום עולם!"; // Unicode string In order to make these examples work correctly source code should include BOM to let MSVC know that the text is UTF-8 and not ASCII... So when you try to compile examples convert files in order to make them work correctly. Todo List ========= - Create a good set of unit tests - More boostification (Jam, etc) I'm waiting for your inputs Artyom Beilis

Hello,
I had written a Boost.Locale library that provides high quality localization facilities in C++ way. This library uses internally ICU library to provide all facilities:
Any notes, inputs, interest, criticism so far? They are more then welcome. Artyom

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Artyom Sent: Thursday, November 12, 2009 7:53 AM To: boost@lists.boost.org Subject: Re: [boost] [Boost.Locale] For Preliminary Review of Boost Community (with Full documentation)
I had written a Boost.Locale library that provides high quality localization facilities in C++ way. This library uses internally ICU library to provide all facilities:
Any notes, inputs, interest, criticism so far? They are more then welcome.
If you need to deal with locales, this looks *very* useful. Personally, I am having more than enough trouble with my own locale ;-) So silence does not indicate lack of approval. Paul --- Paul A. Bristow Prizet Farmhouse Kendal, UK LA8 8AB +44 1539 561830, mobile +44 7714330204 pbristow@hetp.u-net.com

Hello List! Artyom wrote:
I'm waiting for your inputs
First I have to congratulate the author that he tries to fill a dark hole of current c++. I am afraid it is impossible to make such a library optimal for everyone. But there are many parts which are really great: - there are already many features - case conversion, case folding and normalization is done nice - as::spellout and so on is convenient - message formatting is like stated very powerful - std::string is supported (no new string type introduced) - but others are supported to - very good integration in iostreams and facets The overall design fits very well into C++ and boost. But for boundary analysis I have other expections: I would rather expect an iterator similar to boost::tokenizer http://www.boost.org/doc/libs/1_40_0/libs/tokenizer/index.html which lets me iterate over characters, words and sentences. The design of your boundary analysis seem to be very complicated and because of the manual use of offsets very error prone. There is a part where I see intersection to existing boost libraries: - date formatting, especially the ftime functionality are already supported by boost::date_time, see: http://www.boost.org/doc/libs/1_40_0/doc/html/date_time/date_time_io.html#da... - time zones, also already in boost::date_time however I find your way better that you use std::set<std::string> and not std::vector<std::string> like in the Time Zone Datebase of boost::date: http://www.boost.org/doc/libs/1_40_0/doc/html/date_time/local_time.html#date... On the other hand there are parts inside boost which should use boost::locale: - boost::regex (seems like it is currently using libicu directly?) - boost::format Some remarks about the doxygen documentation: http://cppcms.sourceforge.net/boost_locale/docs/doxy/html/ - It should read #include <boost/locale/generator.hpp> instead of #include <generator.hpp> (and all the others). - impl() should be private and should not be documented. "Do not use this" is not very useful. - Documentation missing for operator== in timezone: Does it compare object identity or id? Now I will discuss the problems in the order of the tutorial: http://cppcms.sourceforge.net/boost_locale/docs/ - I did not find a list of all available facets. How can I get back to the defaults (all facets activated) after some are turned off? == Collation == - There should be some explanation of std::use_facet or facet in general because it is a very seldom used feature of the C++ standard. Thats because without something like boost.locale it is barely useable. - loc was used in use_facet<collator<wchar_t> >(loc).compare(collator_base::secondary,a,b); but not defined before (inside Collation chapter). Better you use locale(). The same problem with some_locale and some_level. Give working examples whenever it is easily possible. - You wrote the comment: // Now strings uses default system locale for string comparison Does it means that it uses the default system locale ("C"), the locale when you call gen(""); or the default which was set before with std::locale::global()? I would suggest some glossary, like in boost::filesystem: http://www.boost.org/doc/libs/1_40_0/libs/filesystem/doc/reference.html#Defi... == Number, Time and Currency == - What does "as" mean? Is this just meant as the english word as? - Negative Numbers as currency do not work as expected. The output of the program with a de_AT locale is: -€ 14,00 but I would expect € -14,00 - Parsing has the same problem, but even worse the varialbe contains a completely wrong number afterwards! Parsing € -14,00 leads that the variable contains 11027 afterwards! The library really should throw an exception if there are parsing errors. - You did not mention that cout.imbue() or cin.imbue() must be called first in the beginning of the section. Just setting the global locale (or default locale?) is not enough. - as::ordinal did not work. In the locale de_AT I would expect << boost::locale::as::ordinal << 1 to output "1.". But it just outputs "1". The rule is simple: just add a "." for every number. - as::spellout (as stated above) is really convenient but I am missing a spellout for ordinal numbers (first, second, third, fourth or in german erste, zweite, dritte, vierte) - I missed the place where it is stated that e.g. boost::locale::as::currency_iso needs a boost::locale::as::currency before. - boost::locale::as::currency_iso does not work: std::cout << boost::locale::as::currency << boost::locale::as::currency_iso << 1523.45 << std::endl; € 1.523,45 std::cout << boost::locale::as::currency_iso << 1523.45 << std::endl; 1523.45 But it should read: 1.523,45 EUR - Why is the type double used for time? You should use time_t or TimeType instead. double is used in the example, but also in offset_from_gmt. - Also for time it should be more clear that e.g. time_long is a additional modificator to time. - gmt seems to modificate both. time_zone also modificated both, even though it has the prefix "time_". All others with that prefix only modificate time formatting. Maybe you should avoid prefixing and use a namespace instead? == Message Formatting == - typing mistakes: signle -> single boost::locale::message boost::locale::translate(const char*, const char*, int) <- last int missing - as::domain did not work, when using: std::cout << boost::locale::as::domain("foo") << boost::locale::translate("foo") << std::endl; it results to the compiler error: /usr/src/locale/libs/locale/../../boost/locale/message.hpp: In function ?std::basic_ostream<CharType, std::char_traits<_CharT> >& boost::locale::as::det ails::operator<<(std::basic_ostream<CharType, std::char_traits<_CharT> >&, const boost::locale::as::details::set_domain&) [with CharType = char]?: /usr/src/locale/libs/locale/examples/h.cpp:25: instantiated from here /usr/src/locale/libs/locale/../../boost/locale/message.hpp:391: error: no matching function for call to ?use_facet(std::locale)? - Why is wchar_t used even though it is assigned to a string? std::string msg = translate("Do you want to open the file?").str<wchar_t>(some_locale) - The datatype "message" in send_to_all is not explained? What is "ms"? - " missing and translate misspelled in: cout << format(tranlsate("You have 1 file in the directory",You have {1} files in the directory",files)) % files << endl; - gettext, ngettext are exposed into global namespace when <boost/locale.hpp> is included! == Code-page conversions == - Please give an example for code converter. == Boundary analysis == - Spelling: indx should be index? - first test example does not output any character (only line breaks) - Like above said, shouldn't it be a iterator interface? == Info == - misspelling: tranlsate - std::locale::name (or better std::locale().name()) just returns *, even german locals are set. I would expect that your library sets a name? - How is info meant to be used? ~info is protected, so it seems like that the object should not be instanciated myself? == Multiple Locales == - Do you mean gen.get() in: std::locale ar=get("ar_EG"); Why is it not cached in that case? == general == - What is your evaluation of the potential usefulness of the library? indispensable - Did you try to use the library? With what compiler? gcc (Debian 4.3.2-1.1) 4.3.2 ICU 3.6-2etch3 - How much effort did you put into your evaluation? about 6 hours - Are you knowledgeable about the problem domain? yes You have done good work and I am sure that it has the opportunity to get the standard for localization efforts. Keep at it! best regards Markus Raab -- http://www.markus-raab.org | Without a good library, most interesting -o) | tasks are hard to do in C++; but given a Kernel 2.6.24-1-a /\ | good library, almost any task can be made on a x86_64 _\_v | easy. -- Bjarne Stroustrup

Hello Marcus,
The overall design fits very well into C++ and boost. But for boundary analysis I have other expections: I would rather expect an iterator similar to boost::tokenizer http://www.boost.org/doc/libs/1_40_0/libs/tokenizer/index.html which lets me iterate over characters, words and sentences. The design of your boundary analysis seem to be very complicated and because of the manual use of offsets very error prone.
Thanks, that is very good point. I wasn't aware of `boost::tockenizer` functionality. I think it would be quite easy to provide several TockenizerFunction concepts based on the created mapping. So the interface would be more convenient. I think something like `boost::locale::boundary_separator` that would split the text according to the index, also it would be useful to add flags of types of breaks/types of words should be taken. Something like that: boost::locale::boundary_separator<char>(...) boost::tokenizer<boost::locale::boundary ... >
There is a part where I see intersection to existing boost libraries: - date formatting, especially the ftime functionality are already supported by boost::date_time, see: http://www.boost.org/doc/libs/1_40_0/doc/html/date_time/date_time_io.html#da...
First of all boost::date_time supports only Gregorian calendar, Boost.Locale (ICU) supports many others. For example: std::cout << as::date << time(0) <<std::endl; Can create, in en_US locale would be something like November 11, 2009 When in he_IL@calendar=hebrew locale it would be something like (translated to English) 21 Cheshvan 5770 The number of months in Hebrew year even not constant. So date_time and date_time formatting are not appropriate for these purposes.
- time zones, also already in boost::date_time however I find your way better that you use std::set<std::string> and not std::vector<std::string> like in the Time Zone Datebase of boost::date: http://www.boost.org/doc/libs/1_40_0/doc/html/date_time/local_time.html#date...
1. ICU has quite complete and powerful support of timezones. Built in into the library itself and not relays on external file that should be downloaded and loaded directly to the source. 2. It is quite hard to merge between ICU timezones and boost::date_time time zones. On the API level but is is very simple to assign timezone by its id. 3. Timezone support is used mostly for formatting. So, meanwhile I think that it is quite hard to merge between two timezones implementations because of: a) Different date-time representation b) Different sources of time zone information.
On the other hand there are parts inside boost which should use boost::locale: - boost::regex (seems like it is currently using libicu directly?)
Yes, `boost::regex` uses ICU directly for Unicode support but, regular expression matching is not locale dependent. (At least according to ICU API) So there is no reason to use `boost::locale`.
- boost::format
Take a look on boost::format notes there: http://cppcms.sourceforge.net/boost_locale/docs/index.html#localized-text-fo... In fact there some issues with boost::format. For example std::locale::global(gen("ar_EG")); cout.imbue(gen("en_US")); cout << boost::format("%1$d") % 103 << endl May print "١٠٣" instead of "103" because it uses global locale instead of iostream locale. This can't be fixed easily, if fact it would cause quite big changes in boost::format semantics. So boost::locale::format created that is more appropriate for localization then boost::format that is much more appropriate for actual data formatting.
Some remarks about the doxygen documentation: http://cppcms.sourceforge.net/boost_locale/docs/doxy/html/
- It should read #include <boost/locale/generator.hpp> instead of #include <generator.hpp> (and all the others).
- impl() should be private and should not be documented. "Do not use this" is not very useful.
I'll try to remove it form doxygen, but making it public allows much easier implementation of functionality of many functions that do need access to impl. In fact impl keeps icu::Locale class. In any case it can't be used by user because the class is not defined for him, just forward declaration exists.
- Documentation missing for operator== in timezone: Does it compare object identity or id?
Thanks. It compares ids. I'll add this to documentation.
Now I will discuss the problems in the order of the tutorial: http://cppcms.sourceforge.net/boost_locale/docs/
- I did not find a list of all available facets. How can I get back to the defaults (all facets activated) after some are turned off?
Thanks, You can pass flag "all_categories" or "all_characters". I'll add this remark and the list of all facets.
== Collation ==
- There should be some explanation of std::use_facet or facet in general because it is a very seldom used feature of the C++ standard. Thats because without something like boost.locale it is barely useable.
Ok, good point, I'll add.
- loc was used in use_facet<collator<wchar_t>
(loc).compare(collator_base::secondary,a,b); but not defined before (inside Collation chapter). Better you use locale(). The same problem with some_locale and some_level. Give working examples whenever it is easily possible.
Thanks, I'll try to add remarks or fix.
- You wrote the comment: // Now strings uses default system locale for string comparison Does it means that it uses the default system locale ("C"), the locale when you call gen(""); or the default which was set before with std::locale::global()?
Thanks, you are right this is not clear. In the case I mean std::locale::global(). I'll fix it.
I would suggest some glossary, like in boost::filesystem: http://www.boost.org/doc/libs/1_40_0/libs/filesystem/doc/reference.html#Defi...
Very good point. Thanks! I'll add.
== Number, Time and Currency ==
- What does "as" mean? Is this just meant as the english word as?
English word "as"
- Negative Numbers as currency do not work as expected. The output of the program with a de_AT locale is: -€ 14,00 but I would expect € -14,00
Unfortunately, locale database thinks differently. ICU formats money as -€ 14,00, same does POSIX strfmon in de_AT locale. So I assume this is the standard for monetary formatting in this locale.
- Parsing has the same problem, but even worse the varialbe contains a completely wrong number afterwards! Parsing € -14,00 leads that the variable contains 11027 afterwards! The library really should throw an exception if there are parsing errors.
1. You should provide -€ 14,00, instead of € -14,00 2. You should check failbit as it is done for regular numbers parsings is not thrown by iostreams unless specific flag is set. cin >> as::currency >> how_much; if(cin.fail()) { // error occured } Or: cin.exceptions(istream::failbit); try { cin >> as::currency >> how_much; // do something } catch .... So the behavior is fully standard complaint.
- You did not mention that cout.imbue() or cin.imbue() must be called first in the beginning of the section. Just setting the global locale (or default locale?) is not enough.
Yes, your right, this worth to mention. Many do not know this.
- as::ordinal did not work. In the locale de_AT
As I had notices in documentation, not all locales support as::spellout or as::ordinal. Because some locales missing required rules.
The rule is simple: just add a "." for every number.
In any case, the best practice is to write cout << format(translate("The {1,ordinal} document")) % no << endl; So smart translator may simply substitute a translation string with more correct formatting, in your case. The {1,number}. document In any case I would suggest using latest ICU version. Many things in locale database improved since 3.6
- boost::locale::as::currency_iso does not work: std::cout << boost::locale::as::currency << boost::locale::as::currency_iso << 1523.45 << std::endl; € 1.523,45 std::cout << boost::locale::as::currency_iso << 1523.45 << std::endl; 1523.45 But it should read: 1.523,45 EUR
as::currency_iso requires ICU 4.2 and above. See: http://cppcms.sourceforge.net/boost_locale/docs/index.html#currency-formatti...
- Why is the type double used for time? You should use time_t or TimeType instead. double is used in the example, but also in offset_from_gmt.
Any type of number can be used for time representation time_t, double, int64_t etc.
- Also for time it should be more clear that e.g. time_long is a additional modificator to time.
I think I mentioned this, but I'll clarify it more.
- gmt seems to modificate both. time_zone also modificated both, even though it has the prefix "time_". All others with that prefix only modificate time formatting. Maybe you should avoid prefixing and use a namespace instead?
gmt, time_zone - should not modify both, of so this is bug and I'll fix it.
- typing mistakes: signle -> single
Thanks, I'll fix it.
- as::domain did not work, when using: std::cout << boost::locale::as::domain("foo") << boost::locale::translate("foo") << std::endl;
Thanks, I'll check and fix this (for some wired reason I remember I had already fixed this)
- Why is wchar_t used even though it is assigned to a string? std::string msg = translate("Do you want to open the file?").str<wchar_t>(some_locale)
Ooops, thanks , I'll fix this.
- The datatype "message" in send_to_all is not explained?
Ok, I'll clarify this, message is special container of translated message that can be rendered according to locale when it is written.
What is "ms"?
Typo.. Thanks.
- " missing and translate misspelled in: cout << format(tranlsate("You have 1 file in the directory",You have {1} files in the directory",files)) % files << endl;
- gettext, ngettext are exposed into global namespace when <boost/locale.hpp> is included!
Thanks, I'll fix this
== Code-page conversions ==
- Please give an example for code converter.
Good point, it definitively needs some clarifications.
- Spelling: indx should be index?
No just shorter name.
- first test example does not output any character (only line breaks)
You are right, I'll fix this.
- misspelling: tranlsate
Thanks,
- std::locale::name (or better std::locale().name()) just returns *, even german locals are set. I would expect that your library sets a name?
No. Boost.Locale can't change the base locale class. So it can't change how name is defined. Also name is not portable. For example under linux: en_US.UTF-8 is fine name, but Windows would not use it and throw an exception, Also if the locale is not generated the exception is thrown as well. So I can't change the behavior of underlying std::locale class. So boost::locale::info is provided.
- How is info meant to be used? ~info is protected, so it seems like that the object should not be instanciated myself?
As all other facets -- they have protected destructors and they are managed and destroyed by std::locale class. So you create is as and locale class manages it. std::locale some_new_locale(some_old_locale,new info("en_US")); Also you never use info directly you rather you use it via use_facet functions. cout << std::use_facet<info>(someloc).language() << endl; If some of my facet destructors are not protected this is rather bug.
== Multiple Locales ==
- Do you mean gen.get() in: std::locale ar=get("ar_EG");
Yes. thanks.
Why is it not cached in that case?
Because `get` member function is const memeber and it tryes to find the locale in the existing set and then fetches one. If nothing is found it generates. So, first you populate the cache with all frequently used locales and then fetch them, if something not ordinary requested it would be generated in place. I could use some true "caching" mechanism, but it would require usage of locking per each locale request.
- How much effort did you put into your evaluation? about 6 hours
Markus, thanks you a lot for this review, it extremely valuable nad useful. It would help me a lot. I think I'll be able to do all the fixes in few days and I'll send a notice. Best, Artyom
participants (3)
-
Artyom
-
Markus Raab
-
Paul A. Bristow