[locale] Formal review of Boost.Locale library starts tomorrow

Chad Nelson

6 Apr 2011 6 Apr '11

9:29 p.m.

The formal review of the Boost.Locale library by Artyom Beilis starts tomorrow, April 7th, and is scheduled to last through April 16th. *************** * Its Purpose * *************** Boost.Locale provides thorough localization features to C++ programs by way of std::locale facets. Excerpted from the introduction: C++ offers a very good base for localization via the existing C++ locale facets [...] But these are very limited and sometimes buggy by design. Support for localization varies [...], and there are frequently incompatibilities between them. On the other hand, there is a great, well debugged, high quality, widely used ICU library that gives all of the goodies. But it has a very dated API that mimics Java behavior, completely ignores the STL, and provides a useful API only for UTF-16 encoded text, ignoring other popular Unicode encodings like UTF-8 and UTF-32 and limited but still popular national character sets like Latin1. Boost.Locale provides the natural glue between the C++ locales framework, iostreams, and the powerful ICU library. Although it can use the ICU library, it supports several other processing options as well. ******************* * Where to get it * ******************* You can find the "boost_locale_for_review" version here: https://sourceforge.net/projects/cppcms/files/boost_locale/ The HTML documentation can also be seen at: http://cppcms.sourceforge.net/boost_locale/html/index.html ******************** * Writing a review * ******************** The reviews and all comments should be submitted to the developers list, and the email should have "[locale]" at the beginning of the subject line to make sure it's not missed. Please explicitly state in your review whether the library should be accepted. The general review checklist: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain? -- Chad Nelson Oak Circle Software, Inc. * * *

Attachments:

signature.asc (application/pgp-signature — 198 bytes)

Show replies by date

John Bytheway

8 Apr 8 Apr

8:14 a.m.

On 06/04/11 22:29, Chad Nelson wrote:

...

The general review checklist:

- What is your evaluation of the design?

Overall the design seems well thought out. The following stand out as good points: - The availability of various backends reassures me that the design is appropriately general and flexible. - The option to treat std::string as UTF-8 is welcome; I am fed up with e.g. Glib which insists on assuming std::string is in the default locale. - I get the impression that Artyom has been using the library extensively and it is better for that experience. Only one worry presented itself to me as I read the docs. Consider the following example: date_time some_point = period::year * 1995 + period::january + period::day*1; It looks like a period can be implicitly cast to a date_time. This seems dangerous. I feel the start of 1995 should be obtained more explicitly through the calendar.

...

- What is your evaluation of the implementation?

Did not look.

...

- What is your evaluation of the documentation?

The prose of the documentation is very good, and the tutorials do a good job of introducing the various aspects. I did not look at the reference in detail, but I am more concerned that it does not offer the necessary level of detail. Consider for example the documentation for normalize (<http://cppcms.sourceforge.net/boost_locale/html/group__convert.html#ga11672e3cc3ed7eecf1dd07060265aab2>). I went to the reference for this because the tutorial (<http://cppcms.sourceforge.net/boost_locale/html/conversions.html>) did not explain. I see that it performs Unicode normalization, but there is no hint as to what value I may pass as the 'n' parameter. Where are these documented? This is a rhetorical question; I know the answer is "at the top of the page", but a link would be useful. There are many terms used in the documentation which may not be familiar to all readers. e.g. what it means to "fold case" and what the definitions of the various normalization forms, collation levels, etc. are. Some terms are covered in the glossary, but not all and only briefly. There are some links to Wikipedia articles, but not many. I suggest that links to other resources such as more Wikipedia articles, Unicode Consortium documents, etc., especially in the glossary or in a new "references" appendix but also wherever else appropriate (perhaps linking via a list in the appendix). I second the request for previous/next links. It's hard to browse without them. There are many code examples, most of which involve writing something to stdout, but few of them include examples of possible output. It would be extremely helpful to add example output (preferably more than one example where the output can vary by locale) to more of the examples. Assorted minor issues: In "Locale Generation": - You don't mention what character delimits multiple "variant" options in a locale identifier. Is this even possible? This suggests so, but later the boost::local::info::variant function implies that it is not possible. In "Conversions" - You should mention at least briefly what fold_case and normalize do. In "Messages formatting": - There are four "The source text is not copied." with different emphasis. I think the second and fourth should read "The source text is copied.". In that same paragraph the formatting is wrong near "Plural form translation: Translate a message"; it's missing a newline and wrongly indented. - The example of multiple message domains defines some hypothetical domains, but doesn't use them in the example; it would be clearer to connect the two. - In the full list of xgettext parameters most but not all of the function names given are links to their respective docs. Could they all be? In "Localized Text Formatting": - You do not explain how to include a literal '{' character in the string. - The syntax grammar seems wrong. 'key' seems to be being defined to be just one character, I guess there should be a '*'? - "date, time , datetime" has a stray space after "time". - It says "See as::ftime manipulator.", but the "as::ftime" is not a link. Could it be? In "Working with dates, times, timezones and calendars.": - Just how general is the calendar format here? In "Using Localization Backends": - You discuss on what platforms the standard library backend is useful. Where you say "on Linux with GCC or Intel compilers", would it be better to say "when using the GCC standard library, for example with GCC, the Intel compiler" since I guess e.g. clang (if used with the gcc std lib) would work too? - A comment in an example has stray Doxygen markup: "select \c std backend as default", and again later: "since it is not supported by \c std". Misc: - There was some earlier discussion of whether Boost.Locale should provide the ability to load gettext catalogs from other sources than the filesystem. Was this resolved? It might warrant a mention in the rationale.

...

- What is your evaluation of the potential usefulness of the library?

Very useful. I have in the past written programs for which I would have liked to include localization support, but could find no existing library which presented an interface nice enough that I felt I could use it. Boost.Locale's interface looks good enough that I might.

...

- Did you try to use the library? With what compiler? Did you have any problems?

I did not try.

...

- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?

A reading of the tutorials only, not the reference.

...

- Are you knowledgeable about the problem domain?

No. As a native English speaker I live in blissful ignorance of most of the difficulties :). I hope some users of diverse languages and cultures get the chance to try the library during the review.

...

Please explicitly state in your review whether the library should be accepted.

Yes it should. I send this review now since I suspect I will not get around to looking at the library in more detail or try it myself. If I do manage that I will be sure to report back further. For now, thanks very much to Artyom and good luck with the review. John Bytheway

Artyom

10:57 a.m.

...

- I get the impression that Artyom has been using the library extensively and it is better for that experience.

It is actually integrated part of CppCMS project. For example this wiki (written in C++/CppCMS) http://art-blog.no-ip.info/wikipp/en/page/main Is localized using Boost.Locale and it uses collation, date/time formatting message translation and multiple locales in same process. So yes, I definitely use it all the time :-).

...

Only one worry presented itself to me as I read the docs. Consider the following example:

date_time some_point = period::year * 1995 + period::january + period::day*1;

It looks like a period can be implicitly cast to a date_time. This seems dangerous. I feel the start of 1995 should be obtained more explicitly through the calendar.

Actually date_time by default initialized to current time. So if you set a year 1995 it would be today's date and time back into 1995. I prefer not to use explicit cast as it makes life much easier for users and also period::year * 1995 + period::january + period::day*1 Is a periods set has its own type that are set later together. Also small notices period::january is alias for period::month*0 and has different type then period::month.

...

The prose of the documentation is very good, and the tutorials do a good job of introducing the various aspects.

I did not look at the reference in detail, but I am more concerned that it does not offer the necessary level of detail. Consider for example the documentation for normalize (<http://cppcms.sourceforge.net/boost_locale/html/group__convert.html#ga11672e3cc3ed7eecf1dd07060265aab2>). . I went to the reference for this because the tutorial (<http://cppcms.sourceforge.net/boost_locale/html/conversions.html>) did not explain. I see that it performs Unicode normalization, but there is no hint as to what value I may pass as the 'n' parameter. Where are these documented? This is a rhetorical question; I know the answer is "at the top of the page", but a link would be useful.

Strange that Doxygen didn't inserted like to norm_type automatically. I'll add it and try to look over similar places.

...

There are many terms used in the documentation which may not be familiar to all readers. e.g. what it means to "fold case" and what the definitions of the various normalization forms, collation levels, etc. are. Some terms are covered in the glossary, but not all and only briefly. There are some links to Wikipedia articles, but not many. I suggest that links to other resources such as more Wikipedia articles, Unicode Consortium documents, etc., especially in the glossary or in a new "references" appendix but also wherever else appropriate (perhaps linking via a list in the appendix).

Good point, I'll address it.

...

I second the request for previous/next links. It's hard to browse without them.

Ok, I'll see how to do this, unfortunately Doxygen does not know to do them automatically, but at the top you always have a link upper and you can go forward.

...

There are many code examples, most of which involve writing something to stdout, but few of them include examples of possible output. It would be extremely helpful to add example output (preferably more than one example where the output can vary by locale) to more of the examples.

Ok, I'll try to address this where I can.

...

Assorted minor issues:

In "Locale Generation":

- You don't mention what character delimits multiple "variant" options in a locale identifier. Is this even possible? This suggests so, but later the boost::local::info::variant function implies that it is not possible.

Actually variant support vary by backend. Unix supports predefined set like @latin or @euro and ICU handles them well. Windows backends does not recognize them and ICU provides its own syntax. Like calendar=hebrew;collation=phonetic I'll add these examples.

...

In "Conversions"

- You should mention at least briefly what fold_case and normalize do.

Good point.

...

In "Messages formatting":

- There are four "The source text is not copied." with different emphasis. I think the second and fourth should read "The source text is copied.". In that same paragraph the formatting is wrong near "Plural form translation: Translate a message"; it's missing a newline and wrongly indented.

Ok, missed that you are right.

...

- The example of multiple message domains defines some hypothetical domains, but doesn't use them in the example; it would be clearer to connect the two.

Good point. I'll address it.

...

- In the full list of xgettext parameters most but not all of the function names given are links to their respective docs. Could they all be?

Actually it is something that Doxygen did for me implicitly, the point is that u16* and u32* are hidden as they mostly not in use. I'll see if this is possible to fix this.

...

In "Localized Text Formatting":

- You do not explain how to include a literal '{' character in the string.

Good point, forgot this, it is '{{' and '}}'

...

- The syntax grammar seems wrong. 'key' seems to be being defined to be just one character, I guess there should be a '*'?

Yes should be +

...

- "date, time , datetime" has a stray space after "time".

Ok I'll fix this.

...

- It says "See as::ftime manipulator.", but the "as::ftime" is not a link. Could it be?

I'll add it.

...

In "Working with dates, times, timezones and calendars.":

- Just how general is the calendar format here?

What do you mean? ICU supports several calendars (Hebrew, Islamic, Japanese, Chinese and some more) and their support vary by ICU version.

...

In "Using Localization Backends":

- You discuss on what platforms the standard library backend is useful. Where you say "on Linux with GCC or Intel compilers", would it be better to say "when using the GCC standard library, for example with GCC, the Intel compiler" since I guess e.g. clang (if used with the gcc std lib) would work too?

I can't tell, not tested with clang. But I must say that toooo many standard libraries has broken locale support. For example GCC's libstd++ supports locales only on Linux based on libc's (now POSIX but in earlier days non-standard functions like newlocale/duplocale). So I would not be surprised that clang's version of libstd++ compiled with only C/POSIX locales in. For my experience there are very few standard C++ libraries that do this well. Even stlport's locales quite broken (see faults of SunStudio with stlport in tested platforms/compilers list)

...

- A comment in an example has stray Doxygen markup: "select \c std backend as default", and again later: "since it is not supported by \c std".

Noted.

...

Misc:

- There was some earlier discussion of whether Boost.Locale should provide the ability to load gettext catalogs from other sources than the filesystem. Was this resolved? It might warrant a mention in the rationale.

Ok this is excellent question. Few days ago I passed once again on older reviews and found out that I forgot the request. It is implemented and documented now in trunk but it was fully tested only in last nightly build (as part of CppCMS) http://art-blog.no-ip.info/files/nightly-build-report.html So I decided not to take a chance of breaking current release and not included it in review version. Basically there is an additional callback of type: boost::function<std::vector<char>(std::string const &file_name,std::string const &encoding)> In messages_info structure. See: http://cppcms.sourceforge.net/boost_locale/html/gnu__gettext_8hpp_source.htm... So a custom file system support can be installed in few lines of code (example exists in trunk as well)

...

...
Please explicitly state in your review whether the library should be accepted.

Yes it should.

I send this review now since I suspect I will not get around to looking at the library in more detail or try it myself. If I do manage that I will be sure to report back further.

For now, thanks very much to Artyom and good luck with the review.

John Bytheway

Thank you very much for the review and many important inputs. Artyom

Steven Watanabe

2:17 p.m.

New subject: [locale] Formal review of Boost.Locale library starts tomorrow

AMDG On 04/08/2011 03:57 AM, Artyom wrote:

...

...
In "Using Localization Backends":

- You discuss on what platforms the standard library backend is useful. Where you say "on Linux with GCC or Intel compilers", would it be better to say "when using the GCC standard library, for example with GCC, the Intel compiler" since I guess e.g. clang (if used with the gcc std lib) would work too?

I can't tell, not tested with clang.

But I must say that toooo many standard libraries has broken locale support. For example GCC's libstd++ supports locales only on Linux based on libc's (now POSIX but in earlier days non-standard functions like newlocale/duplocale).

So I would not be surprised that clang's version of libstd++ compiled with only C/POSIX locales in.

clang uses GCC's libstdc++ directly. It doesn't have its own version. In Christ, Steven Watanabe

Dave Abrahams

2:20 p.m.

At Fri, 08 Apr 2011 07:17:43 -0700, Steven Watanabe wrote:

...

AMDG

On 04/08/2011 03:57 AM, Artyom wrote:

...
So I would not be surprised that clang's version of libstd++ compiled with only C/POSIX locales in.

clang uses GCC's libstdc++ directly. It doesn't have its own version.

I think Artyom meant http://libcxx.llvm.org/ -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Marshall Clow

2:54 p.m.

On Apr 8, 2011, at 7:17 AM, Steven Watanabe wrote:

...

On 04/08/2011 03:57 AM, Artyom wrote:

...
...
In "Using Localization Backends":

- You discuss on what platforms the standard library backend is useful. Where you say "on Linux with GCC or Intel compilers", would it be better to say "when using the GCC standard library, for example with GCC, the Intel compiler" since I guess e.g. clang (if used with the gcc std lib) would work too?

I can't tell, not tested with clang.

But I must say that toooo many standard libraries has broken locale support. For example GCC's libstd++ supports locales only on Linux based on libc's (now POSIX but in earlier days non-standard functions like newlocale/duplocale).

So I would not be surprised that clang's version of libstd++ compiled with only C/POSIX locales in.

clang uses GCC's libstdc++ directly. It doesn't have its own version.

Actually, it does: http://libcxx.llvm.org/ But it's in the process of being rolled out (i.e, most people are still using libstdc++) -- Marshall Marshall Clow Idio Software <mailto:mclow.lists@gmail.com> A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

Sebastian Redl

3:11 p.m.

On 08.04.2011, at 16:54, Marshall Clow wrote:

...

On Apr 8, 2011, at 7:17 AM, Steven Watanabe wrote:

...
clang uses GCC's libstdc++ directly. It doesn't have its own version.

Actually, it does: http://libcxx.llvm.org/

That's its own standard library, but on Linux, you will generally use libstdc++, not libc++, for compatibility. And there, Clang simply uses the installed one. Sebastian

Artyom

6:24 p.m.

...

AMDG

On 04/08/2011 03:57 AM, Artyom wrote:

...
...
In "Using Localization Backends":

- You discuss on what platforms the standard library backend is useful. Where you say "on Linux with GCC or Intel compilers", would it be better to say "when using the GCC standard library, for example with GCC, the Intel compiler" since I guess e.g. clang (if used with the gcc std lib) would work too?

I can't tell, not tested with clang.

But I must say that toooo many standard libraries has broken locale support. For example GCC's libstd++ supports locales only on Linux based on libc's (now POSIX but in earlier days non-standard functions like newlocale/duplocale).

So I would not be surprised that clang's version of libstd++ compiled with only C/POSIX locales in.

clang uses GCC's libstdc++ directly. It doesn't have its own version.

If so then it would likely work on Linux, (on all other patforms libstd++ support only "C" and "POSIX" locales). In any case the simplest way to check if locales support is useful is to run this small test: #include <iostream> #include <locale> int main() { try { std::locale loc("en_US.UTF-8"); // if installed on system (check locale -a) std::cout << "Supported" << std::endl; }catch(std::exception const &e) { std::cerr << "Not supported" << std::endl; } } In any case for Linux it is not critical as posix backend is supported. Artyom

John Bytheway

6:06 p.m.

On 08/04/11 11:57, Artyom wrote:

...

...
Only one worry presented itself to me as I read the docs. Consider the following example:

date_time some_point = period::year * 1995 + period::january + period::day*1;

It looks like a period can be implicitly cast to a date_time. This seems dangerous. I feel the start of 1995 should be obtained more explicitly through the calendar.

Actually date_time by default initialized to current time. So if you set a year 1995 it would be today's date and time back into 1995.

I prefer not to use explicit cast as it makes life much easier for users and also

period::year * 1995 + period::january + period::day*1

Is a periods set has its own type that are set later together.

An explicit cast is not the alternative I was considering. I feel something like date_time some_point = gregorian_calendar.year(period::ad, 1995) + period::january + period::day*1; would make more sense. I think I was confused about what a date_time actually is. I was thinking about it like a POSIX time, which needs an associated calendar to give it a human-readable description, but on closer inspection I see that this is not the case. I think you should clarify the description under "Handling Dates and Time"; the current "represents a time point. It is constructed from a calendar and allows manipulation of various time periods." doesn't really give enough clue as to how the programmer should be thinking of its semantics.

...

...
- Just how general is the calendar format here?

What do you mean?

ICU supports several calendars (Hebrew, Islamic, Japanese, Chinese and some more) and their support vary by ICU version.

Sorry, I never finished writing that paragraph. I meant to say: Will it support for example the French Revolutionary Calendar, or the Mayan Calender? http://en.wikipedia.org/wiki/French_Republican_Calendar http://en.wikipedia.org/wiki/Mayan_calendar I'm guessing the former would work in theory if ICU supported it, but the latter is too different to be meaningful. My point is: it would be helpful to give an impression of what properties a calendar must have for this abstraction to be meaningful. For example, must it have years, months, weeks, and days? Perhaps the ICU docs would explain it in more detail, in which case a reference to that would suffice. At present you simply have a warning "Be very careful with assumptions about calendars.", but no indication as to what assumptions *are* safe to make. John

Artyom

7:44 p.m.

...

...
...
Only one worry presented itself to me as I read the docs. Consider the

...

...
...
following example:

date_time some_point = period::year * 1995 + period::january + period::day*1;

It looks like a period can be implicitly cast to a date_time. This seems dangerous. I feel the start of 1995 should be obtained more explicitly through the calendar.

Actually date_time by default initialized to current time. So if you set a year 1995 it would be today's date and time back into 1995.

I prefer not to use explicit cast as it makes life much easier for users and also

period::year * 1995 + period::january + period::day*1

Is a periods set has its own type that are set later together.

An explicit cast is not the alternative I was considering. I feel something like

date_time some_point = gregorian_calendar.year(period::ad, 1995) + period::january + period::day*1;

would make more sense.

The point is you generally want to use global locale by default that defines the calendar. So basically you can do: date_time gregorian_time(some_locale); gregorian_time = period::year * 1999 + ...

...

I think I was confused about what a date_time actually is. I was thinking about it like a POSIX time, which needs an associated calendar to give it a human-readable description, but on closer inspection I see that this is not the case. I think you should clarify the description under "Handling Dates and Time"; the current "represents a time point. It is constructed from a calendar and allows manipulation of various time periods." doesn't really give enough clue as to how the programmer should be thinking of its semantics.

I'll probably try to clarify. date_time object is the object that has following properties: - Time - independent representation of time point (posix time) - Calendar - how to operate on the dates and times in currect calendar rules. Meaning what should I do when I want to change a date 5 days later (what changes months etc) and what happens to local time (Summer time period) and so on. So given time point it defines an operations on it - how to handle time in human periods (days weeks years months, hours etc) The textual representation (output) is different story and it is done on other level.

...

...
...
- Just how general is the calendar format here?

What do you mean?

ICU supports several calendars (Hebrew, Islamic, Japanese, Chinese and some

...

...
more) and their support vary by ICU version.

Sorry, I never finished writing that paragraph. I meant to say:

Will it support for example the French Revolutionary Calendar, or the Mayan Calender?

http://en.wikipedia.org/wiki/French_Republican_Calendar http://en.wikipedia.org/wiki/Mayan_calendar

I'm guessing the former would work in theory if ICU supported it, but the latter is too different to be meaningful.

I'm not so sure about the two above. Currently icu 4.6 supports following calendars: "gregorian", "japanese", "buddhist", "roc", // republic of China "persian", "islamic-civil", "islamic", "hebrew", "chinese", "indian", "coptic", "ethiopic", "ethiopic-amete-alem", As far as I know all have months, years and weeks, however even supported calendars have their weakness. For example, Hebrew calendar (lunar based one) that is active in Israel (my country) is not entirely correct in ICU. Why? Because dates are changed at sunset and not in 12AM. However in order to handle this information correctly you need to know exact geographical location of the user (latitude and longitude) so this isn't something known to the computer (unless it has GPS) So some assumptions are already made and has effect on its correctness. I can say safely: there are no safe assumptions... As you can see in the example I had shown. But some has to be made because otherwise programming wouldn't be possible, so you can assume that there are years, months, days and weeks and so on...

...

My point is: it would be helpful to give an impression of what properties a calendar must have for this abstraction to be meaningful. For example, must it have years, months, weeks, and days? Perhaps the ICU docs would explain it in more detail, in which case a reference to that would suffice. At present you simply have a warning "Be very careful with assumptions about calendars.", but no indication as to what assumptions *are* safe to make.

Almost nothing, number of months can vary, first day of months may be not 1 and so on. Best is to use minimal and maximal ranges for current time in calendar. The problem is that there are no easy way to define because there almost certainly would be something that you forget. The question is whether it is important or not. For example almost all software I had seen does not handle the change of date on sunset correctly for Hebrew calendar - and it is fine for most purposes. But software that was especially written for Religious purposes has additional input of geographical location and does handle Hebrew date change correctly (and it does not use ICU). Artyom P.S.: Basic thing to understand is when you deal with localization is that it is limitless and with very high confidence you can say that even most generic software misses something for some culture and it can be discovered only by real users.

Steven Watanabe

6:43 p.m.

New subject: [locale] Formal review of Boost.Locale library starts tomorrow

AMDG On 04/08/2011 03:57 AM, Artyom wrote:

...

...
I went to the reference for this because the tutorial (<http://cppcms.sourceforge.net/boost_locale/html/conversions.html>) did not explain. I see that it performs Unicode normalization, but there is no hint as to what value I may pass as the 'n' parameter. Where are these documented? This is a rhetorical question; I know the answer is "at the top of the page", but a link would be useful.

Strange that Doxygen didn't inserted like to norm_type automatically. I'll add it and try to look over similar places.

Doxygen doesn't auto-link for all lower-case class names. You need to use \ref. In Christ, Steven Watanabe

Steven Watanabe

2:48 p.m.

New subject: [locale] Formal review of Boost.Locale library starts tomorrow

AMDG I've started reading through the documentation and just on the main page, there are several expressions that I find slightly awkward. I've also tried to come up with a better phrasing for them: Paragraph 1: ...to provide localization in transparent... add an article to get ...to provide localization in a transparent... Paragraph 3: On the other hand, there is a great, ... ICU library... change a to the On the other hand, there is the great, ... ICU library... Paragraph 5: It is based on the operating system native API or on the standard C++ library support. to It is based on the operating system's native API or on the C++ standard library. Should I just create a patch for all such things? In Christ, Steven Watanabe

Artyom

6:18 p.m.

...

Should I just create a patch for all such things?

...

This would be great, Thanks, Artyom

Steven Watanabe

6:01 p.m.

New subject: [locale] Formal review of Boost.Locale library starts tomorrow

AMDG Some minor nits when reading the documentation: Minor point: The canonical documentation link is libs/locale/index.html. This should redirect to wherever the documentation actually resides. I see a lot of places where you're using "may" where "can" would be more appropriate. Message formatting: s/is change to/in changed to/ direct translation: the declarations of ngettext and friends are missing the int n parameter, which is rather important. Character Set Conversions: Would it make sense to have an additional alternative of replacing characters that couldn't be converted with some default text? I'm used to seeing ??? and it seems as though if the text can't be converted it's better to see that there's something there than to have it silently disappear. Boundary analysis: The last example can't possibly be correct. It uses map.begin() and map.end() to initialize map. Building: Building with CMake: s/relay on auto-linking/rely on auto-linking/ Appendix: Default Encoding under Microsoft Windows: s/However such behavior may broke/However such behavior may break/ In Christ, Steven Watanabe

Artyom

6:36 p.m.

...

AMDG

Some minor nits when reading the documentation:

Minor point: The canonical documentation link is libs/locale/index.html. This should redirect to wherever the documentation actually resides.

I'll do it when it goes to trunk.

...

[snip] direct translation: the declarations of ngettext and friends are missing the int n parameter, which is rather important.

Yes, good point there is indeed should be "int n" there. Noted, will be fixed.

...

Character Set Conversions:

Would it make sense to have an additional alternative of replacing characters that couldn't be converted with some default text? I'm used to seeing ??? and it seems as though if the text can't be converted it's better to see that there's something there than to have it silently disappear.

Ok, there are several problems with this. Boost.Locale supports 3 character set conversion backends: a) ICU b) iconv c) Win32API (MultiByteTo... Only ICU can handle such substitution, and there is no simple way to do this with iconv and Win32 API especially in Win32 API (in one of the directions) where you don't really know where was the problem. So you have an option to ignore invalid code points or abort the conversion.

...

Boundary analysis:

The last example can't possibly be correct. It uses map.begin() and map.end() to initialize map.

Yes you are right there is a mistake, it should be: mapping<iterator> map(sentence,text.begin(),text.end(),sentence_term); And also for(int i=0;i<2 && *p!=text.end();i++)

...

[snip] some lingual comments

Thanks, I'll fix all these language mistakes. Artyom

Steven Watanabe

10:25 p.m.

New subject: [locale] Review part 1: headers

AMDG This is getting rather long already, so will be a multi-part review. I've attached a diff of various fixes in comments that I made while going through the headers. locale.hpp: boundary.hpp: line 22: you already #included <iterator> line 47: You don't need to use typedef enum { ... } name; enum name { ... }; is more normal in C++. Is there a reason that you use four bits for each flag? Is this to allow the flags to be split up without breaking binary compatibility? line 102: You assume that unsigned is at least 20 bits. It's usually 32, but I don't think the standard guarantees as much. line 131: Please be consistent about using unsigned vs. uint32_t. line 140: shouldn't offset be std::size_t? Why should the size of the string be limited to 4GB? std::size_t is guaranteed to be large enough to handle anything that can be stored in memory. Why do you have explicit specializations of boundary_indexing? They're all the same. The compiler is perfectly capable of instantiating them all from a single definition. line 435: it's somewhat surprising to see a templated swap. You should comment that it requires that the base_iterators must be the same type. line 447: assignment is strongly exception safe. you should make this guarantee explititly. line 538-539: I don't understand this sentence. What does "tide" mean? line 532: entry text? I don't think this is right, but I don't know what you meant to say. line 600: token_iterator(mapping_type const &map,bool begin,unsigned mask) Just make this constructor an implementation detail. It's too messy to be part of the public interface. line 628: if(this!=&other) I doubt that this actually helps performance, and the compiler's version of the copy constructor and assignment operator should work fine. line 646: Normally it's undefined behavior to dereference an iterator like this. I would prefer an assert to throwing an exception. break_iterator: most of the comments from token_iterator also apply to break_iterator. collator.hpp line 74: Please describe the meaning of the return value, even if you think it's obvious. line 115: You should spell out the post-conditions of transform more clearly. I presume that the following holds, but I'd like to see it explicitly: Given strings s1 and s2 and collation level level, let t1 = transform(level, s1) and t2 = transform(level, s2). Then, t1 < t2 iff compare(level, s1, s2) < 0. config.hpp: conversion.hpp: Again, why the separate specializations? date_time.hpp: line 75: this is dangerous. These constants will be defined separately in every translation unit using them. This can cause violations of the one definition rule. (The standard makes a special exception for integral constants, but these are of class type.) operator*: - You missed signed char and long long. - This is one place where a little template metaprogramming would help: template<class T> typename enable_if<is_arithmetic<T>, date_time_period>::type operator*(period::period_type, T); - I think trying to override the built-in enum to int conversion like this is rather fragile. What happens when someone tries to use an enum constant: enum { this_year = 2011 }; this_year * period::year; - the operators need to be defined whenever period_type is defined to avoid accidentally calling the built-in operator if the user forgets to #include a header. line 491: The argument to operator[] should be std::size_t to match std::vector. date_time_facet.hpp: encoding.hpp: lines 103,112,145,154,218: technically data() would be more appropriate than c_str(), since you don't need the null terminator. line 218: Please qualify this call to avoid ADL in namespace std. (Other places in this file too.) format.hpp: line 65: Use static_cast instead of reinterpret_cast. Actually, you don't need a cast at all. Use static_cast in write as well. line 104: shouldn't this be get_position()? Is there precedent for this format string syntax? What are the requirements on the encoding of the format string? What happens when the format string's syntax is invalid? Undefined behavior? An exception? What exception guarantee does streaming provide? If an exception is thrown, can the stream's locale and format flags have changed or are they reset correctly to their original values. lines 259-263: I'm sure this conversion from char works in practice, but does the standard guarantee it? lines 341,345: You defined named constants earlier. Why not use them? line 292-293: I'm not sure why you have separate svalue and value. AFAICT, value is used if the value begins with a quote and svalue is used otherwise. formatting.hpp: line 227: string_set::set is not exception safe. if new fails, then you'll be left with the new Char type, the old size, and a null pointer. generator.hpp: line 62: This assumes that unsigned is 32 bits, which implies that you should be using uint32_t. line 90: Use locale_category_type? gnu_gettext.hpp: line 79: Why is the argument non-const? hold_ptr.hpp line 29: The constructor should be explicit. line 36: no need to test for NULL. The compiler already checks for NULL for you. info.hpp: #include <map> and <memory> are unnecessary. localization_backend.hpp: Missing #include <memory> for auto_ptr. line 129: What about different back ends for different char types? Does that even make sense? message.hpp: The compiler generated assignment operator of message only provides basic exception safety. If it fails, the message object may not be meaningful. Please document this or change it. line 776: you could use aggregate initialization. util.hpp line 48: Can you link to the documentation of the format of name? What happens if name is not in the correct format? line 76-77: Please rewrite this sentence. I can't figure out what it's trying to be say. line 108: assert(typeid(*this) == typeid(base_converter))? In Christ, Steven Watanabe

Dave Abrahams

11:27 p.m.

New subject: [locale] Review part 1: headers

On Fri, Apr 8, 2011 at 6:25 PM, Steven Watanabe <watanabesj@gmail.com> wrote:

...

line 102: You assume that unsigned is at least 20 bits. It's usually 32, but I don't think the standard guarantees as much.

The standard guarantees only the same as it does for int: 16 bits. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Olaf van der Spek

9 Apr 9 Apr

10:19 a.m.

New subject: [locale] Review part 1: headers

On Sat, Apr 9, 2011 at 1:27 AM, Dave Abrahams <dave@boostpro.com> wrote:

...

On Fri, Apr 8, 2011 at 6:25 PM, Steven Watanabe <watanabesj@gmail.com> wrote:

...
line 102: You assume that unsigned is at least 20 bits. It's usually 32, but I don't think the standard guarantees as much.

The standard guarantees only the same as it does for int: 16 bits.

Is that really the case? I thought it was changed to 32 bits at some point. -- Olaf

Mathias Gaunard

10:37 a.m.

New subject: [locale] Review part 1: headers

On 09/04/2011 12:19, Olaf van der Spek wrote:

...

On Sat, Apr 9, 2011 at 1:27 AM, Dave Abrahams<dave@boostpro.com> wrote:

...
On Fri, Apr 8, 2011 at 6:25 PM, Steven Watanabe<watanabesj@gmail.com> wrote:

...
line 102: You assume that unsigned is at least 20 bits. It's usually 32, but I don't think the standard guarantees as much.

The standard guarantees only the same as it does for int: 16 bits.

Is that really the case? I thought it was changed to 32 bits at some point.

No it was never changed to 32 bits, and there was never any plan of doing so AFAIK.

Artyom

9:50 a.m.

New subject: [locale] Review part 1: headers

...

This is getting rather long already, so will be a multi-part review. I've attached a diff of various fixes in comments that I made while going through the headers.

Thanks, I'll add them to trunk.

...

locale.hpp:

boundary.hpp:

line 22: you already #included <iterator>

Noted

...

line 47: You don't need to use typedef enum { ... } name; enum name { ... }; is more normal in C++.

Good points, thanks.

...

Is there a reason that you use four bits for each flag? Is this to allow the flags to be split up without breaking binary compatibility?

Exactly.

...

line 102: You assume that unsigned is at least 20 bits. It's usually 32, but I don't think the standard guarantees as much.

I don't think neither ICU nor Boost supports non-32 bit platform. I understand that strictly speaking it is non-standard but de-facto there are no problems with this and this assumption is quite safe. I can change this to uint32_t

...

line 131: Please be consistent about using unsigned vs. uint32_t.

Good point

...

line 140: shouldn't offset be std::size_t? Why should the size of the string be limited to 4GB? std::size_t is guaranteed to be large enough to handle anything that can be stored in memory.

You are right, I'll fix this. However, two points: 1. I already know compilers were size_t != sizeof(void *) so actually it must be uintptr_t... but it is not in C++03 but rather in C99. (it is in C++0x) But generally you right. 2. There is no reason on planet earth you need to create boundary mapping for text >= 4G! Even entire War & Peace book takes about 3MB. So if you do this you are probably doing something very-very wrong. But this is different story.

...

Why do you have explicit specializations of boundary_indexing? They're all the same. The compiler is perfectly capable of instantiating them all from a single definition.

Ok this is something that you'll see all over the Boost.Locale code when it comes to deal with facets generations. It is in order to support (brain damaged) DLL platform. I explain. Each specific type of facet should have its own unique static std::locale::id as "id" member. So you must specialize it, such that its definition would be explicit in DLL, library otherwise if you create a facet in DLL and pass it to user level program or the other way around there would be 2 (or more) instances of std::locale::id. So when you would call std::use_facet<facet_type>(locale) it would throw std::bad_cast because there would be actually two instances of same id. I literally passed 9 circles of hell in order to make everything work for both MSVC and GCC. So finally I had found that specializing each instance of specific facet is most reliable and working solution.

...

line 435: it's somewhat surprising to see a templated swap. You should comment that it requires that the base_iterators must be the same type.

Good point. I'll add this.

...

line 447: assignment is strongly exception safe. you should make this guarantee explititly.

What do you mean?

...

line 538-539: I don't understand this sentence. What does "tide" mean?

Doesn't example makes it more clear. Probably I use the word "tide" incorrectly, basically it means whether I want to select everything or I need to select strict subsets. For example if you want to select every word or only thous who actually have letters? So basically it is strict token selection should be.

...

line 532: entry text? I don't think this is right, but I don't know what you meant to say.

Spelling mistake entire text.

...

line 600: token_iterator(mapping_type const &map,bool begin,unsigned mask) Just make this constructor an implementation detail. It's too messy to be part of the public interface.

Noted, I'll define it as internal interface.

...

line 628: if(this!=&other) I doubt that this actually helps performance, and the compiler's version of the copy constructor and assignment operator should work fine.

Yes you are right

...

line 646: Normally it's undefined behavior to dereference an iterator like this. I would prefer an assert to throwing an exception.

The point is that these locations are valid positions just you can't call operator*. I think it is better behavior because it is quite easy to make a mistake with this. (From my experience with this) So such behavior gives a better way to debug the software.

...

break_iterator: most of the comments from token_iterator also apply to break_iterator.

Ok, noted.

...

collator.hpp

line 74: Please describe the meaning of the return value, even if you think it's obvious.

Ok, good point.

...

line 115: You should spell out the post-conditions of transform more clearly. I presume that the following holds, but I'd like to see it explicitly: Given strings s1 and s2 and collation level level, let t1 = transform(level, s1) and t2 = transform(level, s2). Then, t1 < t2 iff compare(level, s1, s2) < 0.

Good point. I'll add this to docs.

...

conversion.hpp:

Again, why the separate specializations?

Same as above, DLL platform support...

...

date_time.hpp:

line 75: this is dangerous. These constants will be defined separately in every translation unit using them. This can cause violations of the one definition rule. (The standard makes a special exception for integral constants, but these are of class type.)

So how would you suggest to fix it? Remove static and move their constructors to cpp? In any case they just aliases so you always can write period::month * 0 that is equal to period::january

...

operator*: - You missed signed char and long long.

Actually for unsigned char as well unsigned cchar!=char!=signed char I don't know I missed this.

...

- This is one place where a little template metaprogramming would help: template<class T> typename enable_if<is_arithmetic<T>, date_time_period>::type operator*(period::period_type, T);

Ok, I'll take a look on this. Good point.

...

- I think trying to override the built-in enum to int conversion like this is rather fragile. What happens when someone tries to use an enum constant: enum { this_year = 2011 }; this_year * period::year;

I understand this, on-the other hand it seems to be very convenient. Because otherwise I'll have to define for every period class every arithmetic operator and it would be O(n^2) complexity. I'm not sure if it would be good either. How would you suggest to do this?

...

- the operators need to be defined whenever period_type is defined to avoid accidentally calling the built-in operator if the user forgets to #include a header.

Good point.

...

line 491: The argument to operator[] should be std::size_t to match std::vector.

Noted.

...

date_time_facet.hpp:

encoding.hpp:

lines 103,112,145,154,218: technically data() would be more appropriate than c_str(), since you don't need the null terminator.

No, it should be c_str(), because otherwise if the the string is empty the data() is undefined, so it may crash (and it is crashes with MSVC) So c_str() guarantees me that it is valid value.

...

line 218: Please qualify this call to avoid ADL in namespace std. (Other places in this file too.)

Can you explain better?

...

format.hpp:

line 65: Use static_cast instead of reinterpret_cast. Actually, you don't need a cast at all. Use static_cast in write as well.

Noted.

...

line 104: shouldn't this be get_position()?

Yes... My bad.

...

Is there precedent for this format string syntax?

Actually it is partially based on ICU format, and some others formats like Java and so on. The most important part that this format should be rich and extend-able.

...

What are the requirements on the encoding of the format string?

ASCII compatible... At least for '{' and '\'' symbols. This covers all supported and frequently used encodings in locale context.

...

What happens when the format string's syntax is invalid? Undefined behavior? An exception?

If the format is invalid it is ignored. Rationale: you don't want the translator that actually provides translated format strings to screw the program.

...

What exception guarantee does streaming provide? If an exception is thrown, can the stream's locale and format flags have changed or are they reset correctly to their original values.

Currently they are not restored. I thought it is better to to try to restore flags in case of exception as restoring they may actually trigger other one.

...

lines 259-263: I'm sure this conversion from char works in practice, but does the standard guarantee it?

Character types can be casted as any other integer types so I don't see problems there. The "correct" way is to use narrow() or wide of ctype facet but they are frequently broken so I prefer to use such cast as most reliable (and covered by the unit-tests)

...

lines 341,345: You defined named constants earlier. Why not use them?

Good point, I'll take a look on this.

...

line 292-293: I'm not sure why you have separate svalue and value. AFAICT, value is used if the value begins with a quote and svalue is used otherwise.

Because some flags has localized parameter (may be wide parameter) like {1,strftime='תאריך %c'} and others are actually are ASCII flags like {1,dt=full} They are treated differently.

...

formatting.hpp:

line 227: string_set::set is not exception safe. if new fails, then you'll be left with the new Char type, the old size, and a null pointer.

size is needed only when ptr!=0 in order to copy its content correctly, so strictly speaking it is exception safe. However it is good to reset size to 0. I'll add this.

...

generator.hpp:

line 62: This assumes that unsigned is 32 bits, which implies that you should be using uint32_t.

Yes... However all supported platforms are at least 32bits. There is one small points. As far as I remember I always used native types over strict types in the interfaces (function calls) because if it is changed between small releases (like stdint.h found) it would not break binary compatibility so native types (as long as they good enough) are better alternative. (We are talking about C++ compilers that does not handle C99 stdint well) So strictly speaking uint32_t is better but does not really important or unrealistic requirement.

...

line 90: Use locale_category_type?

Yes you are right. Should be fixed.

...

gnu_gettext.hpp:

line 79: Why is the argument non-const?

Yes, you are right... I don't know why had I missed that. It should be const (and in fact it goes to const constructor under the hood)

...

hold_ptr.hpp

line 29: The constructor should be explicit.

Good point.

...

line 36: no need to test for NULL. The compiler already checks for NULL for you.

Yes, right. but does not hurts.

...

info.hpp:

#include <map> and <memory> are unnecessary.

Right... I'll remove them.

...

localization_backend.hpp:

Missing #include <memory> for auto_ptr.

Inherited for generator, but good point.

...

line 129: What about different back ends for different char types? Does that even make sense?

Not really. Generally you may want to have different backends because something is not supported. For example std backed may be suitable for formatting and parsing and for date_time handling (for example parsing numbers in std backend is much faster) But you still may want better ICU based collation and ICU based boundary analysis which is not supported by std::backend. As all backends support char/wchar_t and actually nobody at this point does not really support char16_t/char32_t (huge compiler/std library issues everywhere) So I think it is more then justified.

...

message.hpp:

The compiler generated assignment operator of message only provides basic exception safety. If it fails, the message object may not be meaningful. Please document this or change it.

Ok, interesting point, I'll deal with it.

...

line 776: you could use aggregate initialization.

What do you mean, message.hpp:776 points to #pragma warning(pop)

...

util.hpp

line 48: Can you link to the documentation of the format of name? What happens if name is not in the correct format?

It provides substitutions or ignores it. It is mostly function for backends that generates some shared facets and may be useful for backend developers.

...

line 76-77: Please rewrite this sentence. I can't figure out what it's trying to be say.

Yes you right. It should be /// /// This value is returned in case of incomplete input sequence (from_unicode member) or /// too small buffer (to_unicode member functions) ///

...

line 108: assert(typeid(*this) == typeid(base_converter))?

Why? To make sure it is overloaded? The point that clone() is used only in case is_thread_safe() == false. I can add this but how really important is that? Thanks for great comments. Artyom

Mathias Gaunard

10:20 a.m.

New subject: [locale] Review part 1: headers

On 09/04/2011 11:50, Artyom wrote:

...

I don't think neither ICU nor Boost supports non-32 bit platform.

I understand that strictly speaking it is non-standard but de-facto there are no problems with this and this assumption is quite safe.

Since this assumption arbitrarily limits the library to particular platforms and it is trivial to avoid relying it, I see no point in doing so.

...

I can change this to uint32_t

...
line 131: Please be consistent about using unsigned vs. uint32_t.

Good point

...
line 140: shouldn't offset be std::size_t? Why should the size of the string be limited to 4GB? std::size_t is guaranteed to be large enough to handle anything that can be stored in memory.

You are right, I'll fix this.

However, two points:

1. I already know compilers were size_t != sizeof(void *) so actually it must be uintptr_t... but it is not in C++03 but rather in C99. (it is in C++0x)

1) You work with contiguous memory, so size_t is the correct one, not uintptr_t 2) uintptr_t should ideally be available from boost/cstdint.hpp (it doesn't seem to be ATM, that's a bug) 3) I don't know of compilers where those are different, mind saying which ones exhibit this?

...

2. There is no reason on planet earth you need to create boundary mapping for text>= 4G!

Even entire War& Peace book takes about 3MB.

So if you do this you are probably doing something very-very wrong.

But this is different story.

Another unnecessary arbitrary limit. Why would you want to prevent people from treating very large data with your library? On the Internet, there is text much bigger than War & Peace, even if not necessarily in the form of books.

...

...
Why do you have explicit specializations of boundary_indexing? They're all the same. The compiler is perfectly capable of instantiating them all from a single definition.

Ok this is something that you'll see all over the Boost.Locale code when it comes to deal with facets generations.

It is in order to support (brain damaged) DLL platform.

Both MSVC and GCC support extern template, which is also a C++0x feature.

...

Yes... However all supported platforms are at least 32bits.

There is one small points. As far as I remember I always used native types over strict types in the interfaces (function calls) because if it is changed between small releases (like stdint.h found) it would not break binary compatibility so native types (as long as they good enough) are better alternative.

(We are talking about C++ compilers that does not handle C99 stdint well)

Boost has a portable stdint equivalent. In any case, you shouldn't even use uint32_t, but uint_least32_t, or better yet, char32_t if it is available. boost/cuchar.cpp in my Unicode library shows the right way to define the types Unicode needs to work with.

Artyom

11:38 a.m.

New subject: [locale] Review part 1: headers

...

...
I don't think neither ICU nor Boost supports non-32 bit platform.

I understand that strictly speaking it is non-standard but de-facto there are no problems with this and this assumption is quite safe.

Since this assumption arbitrarily limits the library to particular platforms and it is trivial to avoid relying it, I see no point in doing so.

As I told you I can fix it, but practically it does not give any advantage. Give me a name of modern compiler that supports C++03 and has sizeof(unsigned) < 4?

...

...
I can change this to uint32_t

...
line 131: Please be consistent about using unsigned vs. uint32_t.

Good point

...
line 140: shouldn't offset be std::size_t? Why should the size of the string be limited to 4GB? std::size_t is guaranteed to be large enough to handle anything that can be stored in memory.

You are right, I'll fix this.

However, two points:

1. I already know compilers were size_t != sizeof(void *) so actually it must be uintptr_t... but it is not in C++03 but rather in C99. (it is in C++0x)

1) You work with contiguous memory, so size_t is the correct one, not uintptr_t

size_t is used in boost as it is closest way to use something in C++03 standard that is similar to uintptr_t. That is way C99 and C++03 created uintptr_t. In any case as I told I'll change this for better consistency/

...

2) uintptr_t should ideally be available from boost/cstdint.hpp (it doesn't seem to be ATM, that's a bug)

3) I don't know of compilers where those are different, mind saying which ones exhibit this?

As I said before I prefer to relay on native types rather then boost's types especially if it is feasible. As I told above I'll change this were necessary but it is just for "code-beauty".

...

...
2. There is no reason on planet earth you need to create boundary mapping for text>= 4G!

Even entire War& Peace book takes about 3MB.

So if you do this you are probably doing something very-very wrong.

But this is different story.

Another unnecessary arbitrary limit.

Why would you want to prevent people from treating very large data with your library?

On the Internet, there is text much bigger than War & Peace, even if not necessarily in the form of books.

As I told above I'll change it to size_t but... You don't want to run boundary analysis on such chunks of text in any case - and if you do it, you are doing something wrong.

...

...
...
Why do you have explicit specializations of boundary_indexing? They're all the same. The compiler is perfectly capable of instantiating them all from a single definition.

Ok this is something that you'll see all over the Boost.Locale code when it comes to deal with facets generations.

It is in order to support (brain damaged) DLL platform.

Both MSVC and GCC support extern template, which is also a C++0x feature.

Believe me, it was only feasible solution I've reached after spending many man hours fighting DLL's, std::locale facets and both compilers. (AFAIR MSVC uses specialization as well for its own facets) So after fighting both compilers and spending days on solving and tuning this problem I can say this is what should be done. It requires some good experience with std::locale facets, DLLs exporting, and having specific tests. So what I did is unfortunately required and most reliable solution.

...

...
Yes... However all supported platforms are at least 32bits.

There is one small points. As far as I remember I always used native types over strict types in the interfaces (function calls) because if it is changed between small releases (like stdint.h found) it would not break binary compatibility so native types (as long as they good enough) are better alternative.

(We are talking about C++ compilers that does not handle C99 stdint well)

Boost has a portable stdint equivalent.

I know and use it, see the description above.

...

In any case, you shouldn't even use uint32_t, but uint_least32_t, or better yet, char32_t if it is available.

No, see: http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_no_special... In any case I use uint32_t or uint16_t when I do relate to handling of UTF-16 or UTF-32

...

boost/cuchar.cpp in my Unicode library shows the right way to define the types Unicode needs to work with.

Same link as above, typedef of uint16_t or uint32_t is unsuitable for Boost.Locale character representation. See your Unicode library is orthogonal to Boost.Locale, it handles things very different and has different purpose. Artyom

Mathias Gaunard

9:34 p.m.

New subject: [locale] Review part 1: headers

On 09/04/2011 13:38, Artyom wrote:

...

Give me a name of modern compiler that supports C++03 and has sizeof(unsigned)< 4?

On DSPs it is usually 1 byte and 16 bits, but then it could be argued they have bad compilers. I wonder if anyone ever tried to run Boost with a TI compiler (they certainly claim support for C++ is "excellent" on their wiki).

...

size_t is used in boost as it is closest way to use something in C++03 standard that is similar to uintptr_t.

That is way C99 and C++03 created uintptr_t.

In any case as I told I'll change this for better consistency/

size_t is a type big enough to hold the size of any object. Your data is necessary a single object, since it's a contiguous memory chunk, therefore size_t is the suitable type to use. You would only use uintptr_t if you needed to hold the position of any object or subobject anywhere in memory (which could be necessary if using a more linked-list-like data structure), which does not appear to be your case. In any case, if you want to make your code generic to make it work with iterators for arbitrary data structures, you would use iterator_traits<T>::difference_type.

...

...
3) I don't know of compilers where those are different, mind saying which ones exhibit this?

As I said before I prefer to relay on native types rather then boost's types especially if it is feasible.

As I told above I'll change this were necessary but it is just for "code-beauty".

Erroneous code is hardly beautiful, and relying on integer promotion (which is an ugly part of C and C++) is very error-prone. I would have liked a response to question 3) also.

...

As I told above I'll change it to size_t but...

You don't want to run boundary analysis on such chunks of text in any case - and if you do it, you are doing something wrong.

And why is that? Because the algorithm doesn't scale? It should be linear, which is good enough. Documenting its complexity would be useful otherwise.

...

No, see:

http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_no_special...

Hmm, good point. You could use a class instead of a typedef to avoid these issues.

Artyom

10 Apr 10 Apr

4:27 a.m.

New subject: [locale] Review part 1: headers

...

...
Give me a name of modern compiler that supports C++03 and has sizeof(unsigned)< 4?

On DSPs it is usually 1 byte and 16 bits, but then it could be argued they have bad compilers. I wonder if anyone ever tried to run Boost with a TI compiler (they certainly claim support for C++ is "excellent" on their wiki).

- Does ICU compilable for DSP? - Does standard C++ library of them supports something beyond C and POSIX locales? See... Where I need 32 bits I use them, however I need this enum to have 20 bits and this is practical not theoretical issue. So 16 bit platform would be not supported for now. I'm really try to solve practical issues.

...

...
size_t is used in boost as it is closest way to use something in C++03 standard that is similar to uintptr_t.

That is way C99 and C++03 created uintptr_t.

In any case as I told I'll change this for better consistency/

size_t is a type big enough to hold the size of any object. [snip]

...
...
3) I don't know of compilers where those are different, mind saying which ones exhibit this?

[snip] I would have liked a response to question 3) also.

OpenVMS's C++ compiler from HP, sizeof(size_t)=4 but sizeof(void *)=8 in 64 mode... And this is real platform, not DSP.

...

...
As I told above I'll change it to size_t but...

You don't want to run boundary analysis on such chunks of text in any case - and if you do it, you are doing something wrong.

And why is that? Because the algorithm doesn't scale? It should be linear, which is good enough. Documenting its complexity would be useful otherwise.

No, because when you deal with text there are real world assumptions you can made - there is no single chunk of text of size written by man such that >= 4GB... In any case I suggest to drop it because I'll change it to size_t

...

...
No, see:

http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_no_special...

Hmm, good point. You could use a class instead of a typedef to avoid these issues.

No as well. Read the second bullet in the link above. You can't create std::locale facets for arbitrary types. In fact, because GCC's libstd++ does not specializes them for char16_t/char32_t (library bug) I can't create such facets as I get undefined references. So no. Artyom

Michael Caisse

5:45 a.m.

New subject: [locale] Review part 1: headers

On 04/09/2011 09:27 PM, Artyom wrote:

...

...
...
Give me a name of modern compiler that supports C++03 and has sizeof(unsigned)< 4?

On DSPs it is usually 1 byte and 16 bits, but then it could be argued they have bad compilers. I wonder if anyone ever tried to run Boost with a TI compiler (they certainly claim support for C++ is "excellent" on their wiki).

- Does ICU compilable for DSP? - Does standard C++ library of them supports something beyond C and POSIX locales?

See... Where I need 32 bits I use them, however I need this enum to have 20 bits and this is practical not theoretical issue.

So 16 bit platform would be not supported for now. I'm really try to solve practical issues.

Hi Artyom - I happen to use GCC on a few targets that have unsigned as a 16-bit value but that isn't the point. The real point is the standard guarantees that unsigned is at least 16-bits and you need 20. You should be using uint32_t or equivalent. Using unsigned int when you need 20-bits is an error. michael -- Michael Caisse Object Modeling Designs www.objectmodelingdesigns.com

Mathias Gaunard

11:40 a.m.

New subject: [locale] Review part 1: headers

On 10/04/2011 06:27, Artyom wrote:

...

...
On DSPs it is usually 1 byte and 16 bits, but then it could be argued they have bad compilers. I wonder if anyone ever tried to run Boost with a TI compiler (they certainly claim support for C++ is "excellent" on their wiki).

- Does ICU compilable for DSP? - Does standard C++ library of them supports something beyond C and POSIX locales?

You wouldn't want to do text processing on DSPs anyway.

...

See... Where I need 32 bits I use them, however I need this enum to have 20 bits and this is practical not theoretical issue.

So 16 bit platform would be not supported for now. I'm really try to solve practical issues.

Your library extends the standard locale system, it would be great if it was of good enough quality that it could work with any standard-conforming compiler, especially if you wanted someday to integrate this within the C++ standard itself.

...

OpenVMS's C++ compiler from HP, sizeof(size_t)=4 but sizeof(void *)=8 in 64 mode...

Good to know; quite surprising. I would consider OpenVMS somewhat of an obsolete platform though, but that's just me.

...

No, because when you deal with text there are real world assumptions you can made - there is no single chunk of text of size written by man such that>= 4GB...

I've got IRC and IM logs that easily go beyond 4GB for a single channel or discussion. Are you suggesting those aren't text and I shouldn't use your library when I want to perform analysis of that text?

...

In any case I suggest to drop it because I'll change it to size_t

Read the second bullet in the link above. You can't create std::locale facets for arbitrary types.

In fact, because GCC's libstd++ does not specializes them for char16_t/char32_t (library bug) I can't create such facets as I get undefined references.

Yes, locale facets only work with char or wchar_t, but I don't see how using "unsigned int" (or whatever you meant by native types) is going to help you there...

Artyom

12:04 p.m.

New subject: [locale] Review part 1: headers

...

...
- Does ICU compilable for DSP? - Does standard C++ library of them supports something beyond C and POSIX locales?

You wouldn't want to do text processing on DSPs anyway.

Indeed.

...

...
See... Where I need 32 bits I use them, however I need this enum to have 20 bits and this is practical not theoretical issue.

So 16 bit platform would be not supported for now. I'm really try to solve practical issues.

Your library extends the standard locale system, it would be great if it was of good enough quality that it could work with any standard-conforming compiler, especially if you wanted someday to integrate this within the C++ standard itself.

I'll explain more briefly. I defenselessly do not have problems using uint32_t where it should be. And I'll fix it in places I missed. However I have a enum that is a sort of bitmask that is designed to be able to be split. For example if someday ICU's boundary analysis would support marking hiragana and katagana I would like to be able to split the mask into two portions without affecting ABI compatibility. This is very important for other purposes of Boost.Locale I developed it for (CppCMS) - without it Boost.Locale would not exist at all. Of course if it becomes standardized it would not have exactly the same values or exactly the same mask, you can always make it much smaller and simpler. So I don't see a real life reason why shouldn't I limit myself to 16 bit where 90% of boost libraries would likely fail in 16 bit platform. I totally understand what you are saying and I agree that if you should not insert artificial limitations to the code. However in this particular case IMHO such assumption is fine and justified.

...

...
OpenVMS's C++ compiler from HP, sizeof(size_t)=4 but sizeof(void *)=8 in 64 mode...

Good to know; quite surprising.

I would consider OpenVMS somewhat of an obsolete platform though, but that's just me.

I rather consider it is as a bug that wouldn't be likely fixed.

...

...
No, because when you deal with text there are real world assumptions you can made - there is no single chunk of text of size written by man such that>= 4GB...

I've got IRC and IM logs that easily go beyond 4GB for a single channel or discussion. Are you suggesting those aren't text and I shouldn't use your library when I want to perform analysis of that text?

But if it is a IRC log you likely have natural (IM specific) markers to split the log in the first place and only then apply boundary analysis on each message same as you would not apply boundary analysis on XML or HTML file but rather on tag-less content of block level items. Each tool should fit its use, you can use Cray Supercomputer to play Soliter or use Andriod Cellphone to analyse whether... The question should you really do this? Also I'm not so sure that ICU would be actually able to handle something beyond this.

...

...
In any case I suggest to drop it because I'll change it to size_t

Read the second bullet in the link above. You can't create std::locale facets for arbitrary types.

In fact, because GCC's libstd++ does not specializes them for char16_t/char32_t (library bug) I can't create such facets as I get undefined references.

Yes, locale facets only work with char or wchar_t, but I don't see how using "unsigned int" (or whatever you meant by native types) is going to help you there...

I probably don't understand you. All I tell the Boost.Locale does not use special characters types for characters as it uses only char and wchar_t for this purpose and only when compilers would be ready for char16_t and char32_t it would be possible to use them as well. I was talking about characters not integer values. In any case... I think UTF-16 and UTF-32 should fade away but this is other story. Best, Artyom

Steven Watanabe

9 Apr 9 Apr

3:31 p.m.

New subject: [locale] Review part 1: headers

AMDG On 04/09/2011 02:50 AM, Artyom wrote:

...

...
Why do you have explicit specializations of boundary_indexing? They're all the same. The compiler is perfectly capable of instantiating them all from a single definition.

Ok this is something that you'll see all over the Boost.Locale code when it comes to deal with facets generations.

It is in order to support (brain damaged) DLL platform.

<snip>

Okay. You might add a comment to this effect in the code, so no one tries to "fix" it.

...

...
line 435: it's somewhat surprising to see a templated swap. You should comment that it requires that the base_iterators must be the same type.

Good point. I'll add this.

...
line 447: assignment is strongly exception safe. you should make this guarantee explititly.

What do you mean?

The strong exception guarantee: Commit or rollback. If the function throws an exception then it has no other effect.

...

...
line 538-539: I don't understand this sentence. What does "tide" mean?

Doesn't example makes it more clear.

Yeah. It's fairly clear what the class does. It was just the use of "tide" that confused me. I don't know any meaning of "tide" which makes any sense in this context, so I wanted to make sure that it wasn't just a Unicode specific term that I was unfamiliar with.

...

Probably I use the word "tide" incorrectly, basically it means whether I want to select everything or I need to select strict subsets.

For example if you want to select every word or only thous who actually have letters?

So basically it is strict token selection should be.

...
line 646: Normally it's undefined behavior to dereference an iterator like this. I would prefer an assert to throwing an exception.

The point is that these locations are valid positions just you can't call operator*.

I think it is better behavior because it is quite easy to make a mistake with this.

(From my experience with this)

Of course, it's a good thing to detect the error. The thing is that an exception means that you unwind the stack, possibly handling the exception at a higher level. That's normally not the correct response for a programming error.

...

...
date_time.hpp:

line 75: this is dangerous. These constants will be defined separately in every translation unit using them. This can cause violations of the one definition rule. (The standard makes a special exception for integral constants, but these are of class type.)

So how would you suggest to fix it? Remove static and move their constructors to cpp?

That's the safest solution. It's possible to define constants of class type in a header correctly, but it's really a pain to do right. IIRC, it can also interfere with precompiled headers on some compilers.

...

...
- I think trying to override the built-in enum to int conversion like this is rather fragile. What happens when someone tries to use an enum constant: enum { this_year = 2011 }; this_year * period::year;

I understand this, on-the other hand it seems to be very convenient. Because otherwise I'll have to define for every period class every arithmetic operator and it would be O(n^2) complexity.

I'm not sure if it would be good either.

Yeah, there's really no good solution. I'd be inclined to drop the operators for period_type altogether, and just have people use the constructor of date_time_period. I know operator* is more convenient and nicer to look at, but I'd favor more verbose code that can't possibly go wrong.

...

...
line 218: Please qualify this call to avoid ADL in namespace std. (Other places in this file too.)

Can you explain better?

Argument dependent lookup is one of the most annoying features of C++. When you make a function call of the form f(x, y, z), the compiler will search for f in the current scope and also in a set of extra namespaces determined by the types of x, y, and z. It's usually a good idea to suppress ADL except when you explicitly intend to use it.

...

...
format.hpp:

line 65: Use static_cast instead of reinterpret_cast. Actually, you don't need a cast at all. Use static_cast in write as well.

Noted.

...
line 104: shouldn't this be get_position()?

Yes... My bad.

...
Is there precedent for this format string syntax?

Actually it is partially based on ICU format, and some others formats like Java and so on.

The most important part that this format should be rich and extend-able.

...
What are the requirements on the encoding of the format string?

ASCII compatible... At least for '{' and '\'' symbols.

This covers all supported and frequently used encodings in locale context.

Okay. One other thing: How can I put a literal '{' in the format string?

...

...
What happens when the format string's syntax is invalid? Undefined behavior? An exception?

If the format is invalid it is ignored.

Rationale: you don't want the translator that actually provides translated format strings to screw the program.

In that case, I hope you have tests with various broken format strings.

...

...
What exception guarantee does streaming provide? If an exception is thrown, can the stream's locale and format flags have changed or are they reset correctly to their original values.

Currently they are not restored.

I thought it is better to to try to restore flags in case of exception as restoring they may actually trigger other one.

Okay. Please document this behavior.

...

...
line 776: you could use aggregate initialization.

What do you mean, message.hpp:776 points to

#pragma warning(pop)

Oops. That should be line 766: details::set_domain tmp; tmp.domain_id = id; I was just suggesting: details::set_domain = { id };

...

...
util.hpp

line 108: assert(typeid(*this) == typeid(base_converter))?

Why? To make sure it is overloaded?

Exactly.

...

The point that clone() is used only in case is_thread_safe() == false.

I can add this but how really important is that?

It isn't very important, but if clone is called and it is not overridden, then it is always an error, right? Thus, an assertion is appropriate. In Christ, Steven Watanabe

Artyom

6:29 p.m.

New subject: [locale] Review part 1: headers

...

On 04/09/2011 02:50 AM, Artyom wrote:

...
...
Why do you have explicit specializations of boundary_indexing? They're all the same. The compiler is perfectly capable of instantiating them all from a single definition.

Ok this is something that you'll see all over the Boost.Locale code when it comes to deal with facets generations.

It is in order to support (brain damaged) DLL platform.

<snip>

Okay. You might add a comment to this effect in the code, so no one tries to "fix" it.

Good point. I'll add the rationale. I think I actually commented it in several places but not in all ones.

...

...
...
line 447: assignment is strongly exception safe. you should make this guarantee explititly.

What do you mean?

The strong exception guarantee: Commit or rollback. If the function throws an exception then it has no other effect.

I see, noted.

...

...
...
line 646: Normally it's undefined behavior to dereference an iterator like this. I would prefer an assert to throwing an exception.

The point is that these locations are valid positions just you can't call operator*.

I think it is better behavior because it is quite easy to make a mistake with this.

(From my experience with this)

Of course, it's a good thing to detect the error. The thing is that an exception means that you unwind the stack, possibly handling the exception at a higher level. That's normally not the correct response for a programming error.

It is similar to std::vector's at()... Maybe, in any case I had found exception helpful, especially in situation when you don't a error in some part to shutdown entire process. Defensive but forgiving programming.

...

...
...
line 75: this is dangerous. These constants will be defined separately in every translation unit using them. This can cause violations of the one definition rule. (The standard makes a special exception for integral constants, but these are of class type.)

So how would you suggest to fix it? Remove static and move their constructors to cpp?

That's the safest solution. It's possible to define constants of class type in a header correctly, but it's really a pain to do right. IIRC, it can also interfere with precompiled headers on some compilers.

Noted...

...

...
...
- I think trying to override the built-in enum to int conversion like this is rather fragile. What happens when someone tries to use an enum constant: enum { this_year = 2011 }; this_year * period::year;

I understand this, on-the other hand it seems to be very convenient. Because otherwise I'll have to define for every period class every arithmetic operator and it would be O(n^2) complexity.

I'm not sure if it would be good either.

Yeah, there's really no good solution. I'd be inclined to drop the operators for period_type altogether, and just have people use the constructor of date_time_period. I know operator* is more convenient and nicer to look at, but I'd favor more verbose code that can't possibly go wrong.

I'll think about it, what can be done.

...

...
...
line 218: Please qualify this call to avoid ADL in namespace std. (Other places in this file too.)

Can you explain better?

Argument dependent lookup is one of the most annoying features of C++. When you make a function call of the form f(x, y, z), the compiler will search for f in the current scope and also in a set of extra namespaces determined by the types of x, y, and z. It's usually a good idea to suppress ADL except when you explicitly intend to use it.

I see, noted.

...

...
ASCII compatible... At least for '{' and '\'' symbols.

This covers all supported and frequently used encodings in locale context.

Okay. One other thing: How can I put a literal '{' in the format string?

It is something I forget to add to documents. You should use {{ and }}.

...

...
...
What happens when the format string's syntax is invalid? Undefined behavior? An exception?

If the format is invalid it is ignored.

Rationale: you don't want the translator that actually provides translated format strings to screw the program.

In that case, I hope you have tests with various broken format strings.

AFAIR I have such tests in code, but I'll check this again. Good point.

...

...
...
What exception guarantee does streaming provide? If an exception is thrown, can the stream's locale and format flags have changed or are they reset correctly to their original values.

Currently they are not restored.

I thought it is better to to try to restore flags in case of exception as restoring they may actually trigger other one.

Okay. Please document this behavior.

Noted. I'll add this.

...

...
...
line 776: you could use aggregate initialization.

What do you mean, message.hpp:776 points to

#pragma warning(pop)

Oops. That should be line 766: details::set_domain tmp; tmp.domain_id = id; I was just suggesting: details::set_domain = { id };

Ok.

...

...
...
util.hpp

line 108: assert(typeid(*this) == typeid(base_converter))?

Why? To make sure it is overloaded?

Exactly.

...
The point that clone() is used only in case is_thread_safe() == false.

I can add this but how really important is that?

It isn't very important, but if clone is called and it is not overridden, then it is always an error, right? Thus, an assertion is appropriate.

Noted... Never used this technique, interesting point. Thanks, Artyom

Mathias Gaunard

8:47 p.m.

New subject: [locale] Review part 1: headers

On 09/04/2011 20:29, Artyom wrote:

...

...
...
...
line 646: Normally it's undefined behavior to dereference an iterator like this. I would prefer an assert to throwing an exception.

The point is that these locations are valid positions just you can't call operator*.

I think it is better behavior because it is quite easy to make a mistake with this.

(From my experience with this)

Of course, it's a good thing to detect the error. The thing is that an exception means that you unwind the stack, possibly handling the exception at a higher level. That's normally not the correct response for a programming error.

It is similar to std::vector's at()...

Maybe, in any case I had found exception helpful, especially in situation when you don't a error in some part to shutdown entire process.

Defensive but forgiving programming.

Definitely a no-no. Dereferencing an invalid iterator is a programming error, not an exceptional situation. The thing to use is an assertion. When your code is statically wrong, the compiler refuses to generate a binary. Likewise, when your code is dynamically wrong, it just terminates. Fix your code such that such situations may *never* happen. An exception is something that *could* happen, even in bug-free software, like a failure to perform an operation due to lack of the necessary resources.

Artyom

10 Apr 10 Apr

4:18 a.m.

New subject: [locale] Review part 1: headers

...

...
...
...
...
line 646: Normally it's undefined behavior to dereference an iterator like this. I would prefer an assert to throwing an exception.

The point is that these locations are valid positions just you can't call operator*.

I think it is better behavior because it is quite easy to make a mistake with this.

(From my experience with this)

Of course, it's a good thing to detect the error. The thing is that an exception means that you unwind the stack, possibly handling the exception at a higher level. That's normally not the correct response for a programming error.

It is similar to std::vector's at()...

Maybe, in any case I had found exception helpful, especially in situation when you don't a error in some part to shutdown entire process.

Defensive but forgiving programming.

Definitely a no-no. Dereferencing an invalid iterator is a programming error, not an exceptional situation.

The thing to use is an assertion.

When your code is statically wrong, the compiler refuses to generate a binary. Likewise, when your code is dynamically wrong, it just terminates.

Fix your code such that such situations may *never* happen. An exception is something that *could* happen, even in bug-free software, like a failure to perform an operation due to lack of the necessary resources.

I'm sorry but as far as I remember nobody deprecated std::logic_error (and std::out_of_range derived from it). Small point, when you program some mission critical systems and services you don't want the process to go down for something that maybe only a single response can go down that way. This is very important for some mission critical code (and I had written such code not once). In any case, this concept is part of C++03 and C++0x and can be seen via std::logical_error that allows to catch programmers errors. Artyom _______________________________________________

...

Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Gevorg Voskanyan

8:58 a.m.

New subject: [locale] Review part 1: headers

Hi Artyom, Artyom wrote:

...

Mathias Gaunard wrote:

...
Definitely a no-no. Dereferencing an invalid iterator is a programming error, not an exceptional situation.

The thing to use is an assertion. I'm sorry but as far as I remember nobody deprecated std::logic_error (and std::out_of_range derived from it).

Small point, when you program some mission critical systems and services you don't want the process to go down for something that maybe only a single response can go down that way. This is very important for some mission critical code (and I had written such code not once).

For that reason one can specify a different behavior for BOOST_ASSERT to invoke when it fails - throw std::logic_error, log somewhere, whatever. For ordinary programs, I definitely want a programmer error to result in an immediate termination of the program, which is what I get from BOOST_ASSERT by default. However when writing a mission critical system one can easily make BOOST_ASSERT to do something else, as appropriate for the project. So my point is that I see no reason for BOOST_ASSERT not to be used here. Best Regards, Gevorg

Artyom

9:16 a.m.

New subject: [locale] Review part 1: headers

...

From: Gevorg Voskanyan <v_gevorg@yahoo.com> Artyom wrote:

...
Mathias Gaunard wrote:

...
Definitely a no-no. Dereferencing an invalid iterator is a programming error, not an exceptional situation.

The thing to use is an assertion. I'm sorry but as far as I remember nobody deprecated std::logic_error (and std::out_of_range derived from it).

Small point, when you program some mission critical systems and services you don't want the process to go down for something that maybe only a single response can go down that way. This is very important for some mission critical code (and I had written such code not once).

For that reason one can specify a different behavior for BOOST_ASSERT to invoke

when it fails - throw std::logic_error, log somewhere, whatever. [snip]

Didn't know this, if so this is very good. If so I'll fix it this way BOOST_ASSERT. Artyom

Steven Watanabe

3:18 a.m.

New subject: [locale] review part 2.1: source

AMDG Okay, I managed to get through the encoding, icu, and posix subdirectories today. encoding/codepage.cpp: line 28: In C++ you should use <cstring> (and std::strlen) encoding/conv.hpp: encoding/iconv_codepage.hpp I think the convention for headers like this which are really source files in disguise is to give them the extension .ipp. line 125: You don't actually care how the vector's contents are initialized, do you? You can just specify the size. line 150: I think sresult.append(out_start, out_ptr); would be slightly clearer. line 168: What are the possible errno values besides EILSEQ, EINVAL, and E2BIG? If you get an error you don't expect, is silently stopping the correct behavior? iconv_between: Inheritance from iconv_from_utf is a little odd. iconv_to_utf: The comments from iconv_from_utf apply. Also, the implementation is almost the same as iconv_from_utf. I think you could define a single function template to implement both. encoding/uconv_codepage.hpp: line 52: Do you really intend to translate any std::exception? Including std::bad_alloc? That isn't consistant with what iconv_codepage does. encoding/wconv_codepage.hpp: Missing #include <cstring>, <string> lines 86,101: The STL is your friend: size_t remove_substitutions(std::vector<wchar_t> &v) { v.erase(std::remove(v.begin(), v.end(), wchar_t(0xFFFD)), v.end()); return v.size(); } icu/all_generator.hpp: Why <vector>? You're using locale and string, but not vector. icu/boundary.cpp: icu/cdata.hpp: Missing #include <string> icu/uconv.hpp: Needs #include <string> and #include <memory>. line 8: This #include guard is excessively generic. icu/codecvt.cpp: lines 44, 48: If this throws, cvt_ needs to be closed. icu/codecvt.hpp: Missing #include <memory> The #include guard is wrong. icu/collator.cpp: line 115: Does do_basic_transform guarantee that the string contains no embedded null bytes? It seems that this is required for hash to work correctly. icu/conversion.cpp: icu/date_time.cpp: In calendar_impl: I can't figure out why you're locking in some functions and not others. For the most part, you seem to lock in non-mutating functions and not in mutating functions, so I was guessing that you are trying to guarantee that a calender can be read concurrently, but writes must be exclusive and that ICU does not guarantee that concurrent reads are okay. However, get_timezone and is_same don't lock. Does ICU handle these methods differently from the others W.R.T. thread-safety? line 236: Do you need a lock here? You're only using self which is a different object. icu/formatter.cpp: line 69: This seems like a really bad idea. It means that integers larger than 0x7FFFFFFFFFFFFFFF will be formatted incorrectly. If this is an unavoidable limitation of ICU, I think you shoud probably throw an exception and document the behavior. And don't think that people won't try to print unsigned numbers this big. I know I have. line 263: This comparison is subtly wrong for int64_t. Consider what happens when date is exactly 2^63. limits_type::max() will be 0x7FFFFFFFFFFFFFFF. According to [expr] this will be converted to double. For most rounding modes, the result will be 2^63, which is equal to date. Then, when we get to line 268 and cast back to ValueType, the result will be 0x8000000000000000, and the behavior is undefined according to [conv.fpint]. (Note: It's obviously not easy to trigger this, but I'm guessing that a specially crafted test case could) icu/formatter.hpp: line 44: What about float and long double. long double may have more precision than double. Also, even for float, the double rounding caused by working with double can produce slightly different results from when using float directly. Does ICU only allow you to use double and [u]int[32|64]_t? line 93: I find this comment a bit misleading. It wasn't clear until I looked at the implementation that what was being cached was the ICU formatter. The boost::locale::impl_icu::formatter object is not cached. icu/icu_backend.cpp: line 69: Ideally you should not mark the object as valid until you know that this function will not fail. line 91: I'd kind of like install to be thread-safe... icu/icu_backend.hpp: icu/icu_util.hpp: icu/numeric.cpp: In num_format: You don't handle bool? line 290: Alas, the standard doesn't guarantee that you can put_back an arbitrary number of characters. This isn't going to work if you're reading from a non-seekable stream and the line is sufficiently long. icu/predefined_formatters.hpp: icu/time_zone.hpp: icu/time_zone.cpp: icu/uconv.hpp: posix/all_generater.hpp: posix/codecvt.cpp: mb2_iconv_converter seems to be doing a lot of the same work as encoding/iconv_codepage. Is there any way to avoid the duplication? posix/codecvt.hpp: Missing #include <memory> and <string>. posix/collate.cpp: posix/converter.cpp: I noticed that you convert UTF8 to wchar_t. For the posix backend, are there other multibyte encodings that would need this? Or would converting everything to UTF32 cause other problems? posix/numeric.cpp: line 54: sizeof(buf) is 64, but you pass 256? line 57: is 1024 absolutely guaranteed to be enough? Otherwise you should use a loop with exponential reallocation. Line 97 too. lines 68, 75: std::copy? posix/posix_backend.cpp: install: The same comment W.R.T. thread-safety applies as for icu. line 85: If this fails, don't you still need to call freelocale? I think this would be simplified if you created a simple RAII wrapper that holds a locale_t by value. posix/posix_backend.hpp: (To be continued) In Christ, Steven Watanabe

Artyom

9:12 a.m.

New subject: [locale] review part 2.1: source

...

Okay, I managed to get through the encoding, icu, and posix subdirectories today.

encoding/codepage.cpp:

line 28: In C++ you should use <cstring> (and std::strlen)

Noted

...

encoding/conv.hpp:

encoding/iconv_codepage.hpp

I think the convention for headers like this which are really source files in disguise is to give them the extension .ipp.

Noted. I'll update

...

line 125: You don't actually care how the vector's contents are initialized, do you? You can just specify the size.

I'm not remember exactly but I think I do. I'll review this.

...

line 150: I think sresult.append(out_start, out_ptr); would be slightly clearer.

I should look on it carefully whether I can change this. iconv API is very gentle in how it handles its pointers and states so I need to do everything carefully. Also I assume that append(Begin,End) may be less efficient then append(pointer,size) as it may be naively implemented as while(begin!=end) s += *begin++;

...

line 168: What are the possible errno values besides EILSEQ, EINVAL, and E2BIG? If you get an error you don't expect, is silently stopping the correct behavior?

According to opengroup's specifications, Linux man pages and others these are the error states that can occur. Maybe I should throw conversion_error if the method is stop().

...

iconv_between: Inheritance from iconv_from_utf is a little odd.

But it makes a code much simpler.

...

iconv_to_utf: The comments from iconv_from_utf apply. Also, the implementation is almost the same as iconv_from_utf. I think you could define a single function template to implement both.

Not exactly. I had found that it is much safer and readable to have two implementations then to create quite complex meta programming to make sure it works. It is much easier to assume that at least one side is "char"

...

encoding/uconv_codepage.hpp:

line 52: Do you really intend to translate any std::exception? Including std::bad_alloc? That isn't consistant with what iconv_codepage does.

The problem is that ICU does not throw any exceptions (yes... even new) so any exceptions that should be thrown there are caused by the handling of conversions. So I thought it is more then reasonable.

...

encoding/wconv_codepage.hpp:

Missing #include <cstring>, <string>

Noted

...

lines 86,101: The STL is your friend: size_t remove_substitutions(std::vector<wchar_t> &v)

{ v.erase(std::remove(v.begin(), v.end(), wchar_t(0xFFFD)), v.end()); return v.size(); }

After reading it carefully I can see how it works but generally I see how it works but it isn't so readable. In any case interesting and it may save some allocations. Noted

...

icu/all_generator.hpp:

Why <vector>? You're using locale and string, but not vector.

Leftovers of old code. Noted.

...

icu/boundary.cpp:

icu/cdata.hpp:

Missing #include <string>

Noted

...

icu/uconv.hpp:

Needs #include <string> and #include <memory>. line 8: This #include guard is excessively generic.

Yes... you are right. Missed this.

...

icu/codecvt.cpp:

lines 44, 48: If this throws, cvt_ needs to be closed.

You are right... I don't understand how did I miss this.

...

icu/codecvt.hpp:

Missing #include <memory> The #include guard is wrong.

Ohh... Yes you right thanks.

...

icu/collator.cpp:

line 115: Does do_basic_transform guarantee that the string contains no embedded null bytes? It seems that this is required for hash to work correctly.

Yes, the string returned by ICU transform explicitly states this. It can be compared using standard strcmp.

...

icu/conversion.cpp:

icu/date_time.cpp:

In calendar_impl: I can't figure out why you're locking in some functions and not others. For the most part, you seem to lock in non-mutating functions and not in mutating functions, so I was guessing that you are trying to guarantee that a calender can be read concurrently, but writes must be exclusive and that ICU does not guarantee that concurrent reads are okay. However, get_timezone and is_same don't lock. Does ICU handle these methods differently from the others W.R.T. thread-safety?

Ok I'll explain. Some of the icu::Calendar member functions are only "sematically" const but actually recalculate dates on request. This is not really nice feature of ICU calendar, and after rising such issue to ICU team they included in their work plan making some of classes frizable. So they would be explicitly safe for cuncurrent read only access. (Not sure it would happen for calendar but it will for icu::Collator) Fortunatelly the "semtantic" constness is explicitly documented and for member function that access to time or some ICU fields - that actually perform recalculation I do perform locking so it would be thread safe to access to same const instance from different threads.

...

line 236: Do you need a lock here? You're only using self which is a different object.

Yes you are right... It was accidentially remained from old buggy code (back then I missed that fieldDifference acutally updates calendar and now I anyway clone so no need for lock)

...

icu/formatter.cpp:

line 69: This seems like a really bad idea. It means that integers larger than 0x7FFFFFFFFFFFFFFF will be formatted incorrectly. If this is an unavoidable limitation of ICU, I think you shoud probably throw an exception and document the behavior. And don't think that people won't try to print unsigned numbers this big. I know I have.

Yes it is limitation of ICU. ICU supports int32_t, int64_t and double (much like Java) So I don't know whether it is better to abort the application or maybe display incorrect number. I'll explain. The translator or programmer can use: format(translate("The distance to the Sun is {1,number} millimeters")); However if it does not seems right translator or programmer can always use format(translate("The distance to the Sun is {1} millimeters")); // POSIX format Which handled correctly but does not provides nice output as "1,445,452,345,353,435" but rather "1445452345353435" Because it is finally the output for human I prefer that the output would be strange but program does not crash. It is always better that user would report a bug when he sees negative number and the programmer would change {1,num} to {1,posix} rather then program accidentially exists in some unknown situation. I think I'll document this behavior (of the range). Also I can check if it is possible to fallback to POSIX format if the range is not supported by ICU.

...

line 263: This comparison is subtly wrong for int64_t. Consider what happens when date is exactly 2^63. limits_type::max() will be 0x7FFFFFFFFFFFFFFF. According to [expr] this will be converted to double. For most rounding modes, the result will be 2^63, which is equal to date. Then, when we get to line 268 and cast back to ValueType, the result will be 0x8000000000000000, and the behavior is undefined according to [conv.fpint]. (Note: It's obviously not easy to trigger this, but I'm guessing that a specially crafted test case could)

I'll review it.

...

icu/formatter.hpp:

line 44: What about float and long double. long double may have more precision than double. Also, even for float, the double rounding caused by working with double can produce slightly different results from when using float directly. Does ICU only allow you to use double and [u]int[32|64]_t?

It only allows double, int32_t, and int64_t so it is best what I can do. Also I select most suitable integer type on the level of the facet (numeric.cpp)

...

line 93: I find this comment a bit misleading. It wasn't clear until I looked at the implementation that what was being cached was the ICU formatter. The boost::locale::impl_icu::formatter object is not cached.

Yes. You are right... Also the comment seems to be little bit old as all formatters ara cached at once using TLS.

...

icu/icu_backend.cpp:

line 69: Ideally you should not mark the object as valid until you know that this function will not fail.

Good point.

...

line 91: I'd kind of like install to be thread-safe...

No localization backend is not expected to be thread safe. It is used in this way: 1. It is cloned from the original one 2. All options are set 3. The facets are installed. So its thread safety implied by its clonability Also it allows me to install facets faster if underlying implementation is not thread safe. In any case localization_backend is usually not used directly by the user but rather by boost::locale::generator

...

icu/icu_backend.hpp:

icu/icu_util.hpp:

icu/numeric.cpp:

In num_format: You don't handle bool?

ICU does not handle it... So I can't do something better for this - I pass it to std::locale::classic formatting.

...

line 290: Alas, the standard doesn't guarantee that you can put_back an arbitrary number of characters. This isn't going to work if you're reading from a non-seekable stream and the line is sufficiently long.

The problem is that it is the best I can do from-inside std::locale's num_fmt facet. It has many major limitations. For exaple if you would try to parse 12345YX as std::cin >> number >> character then character would be X. This is specially critical when I parse spellout format to be able to putback correct part. Maybe I should try to see what is the size of the buffer and/or document that the stream should be able to handle putback well. It is actually fairly simple as streambuf has a buffer so it can actually receive putback characters without problem. Unfortunatelly it is the best solution I could reach so far with the limiteation of std::num_get

...

icu/predefined_formatters.hpp:

icu/time_zone.hpp:

icu/time_zone.cpp:

icu/uconv.hpp:

posix/all_generater.hpp:

posix/codecvt.cpp:

mb2_iconv_converter seems to be doing a lot of the same work as encoding/iconv_codepage. Is there any way to avoid the duplication?

No it does very different things. 1. iconv is used for double byte encodings only like Shift-JIS or GBK as for single byte and UTF-8 better solutions implemented. 2. It does not perform stream conversion but rather conversion of single byte or two bytes of sources. So finally it has more differences then common stuff.

...

posix/codecvt.hpp:

Missing #include <memory> and <string>.

posix/collate.cpp:

posix/converter.cpp:

I noticed that you convert UTF8 to wchar_t. For the posix backend, are there other multibyte encodings that would need this? Or would converting everything to UTF32 cause other problems?

- For most encodings (single byte) toupper/tolower does better job (faster simpler smaller) - For UTF-8 is is very suitable but still it is major approximation is towupper/towlower should not work on single code points but rather on entire text. So it is poor-man's approximation that works almost correctly for most western languages. - Now we remained with East-Asian double width encodings like Shift-JIS. This is quite different story: 1. The East-Asian scripts do not have upper and lower case thus actually to upper and lower do nothing for almost all but latin part. 2. In any case it remains approximation. So I think using normal toupper_l to tolower_l would be good enough as it is still approximation and because East-Asian scripts does not have case I think it is ok. I can change it for MB encodings but not sure if it would have added value.

...

posix/numeric.cpp:

line 54: sizeof(buf) is 64, but you pass 256?

Oppps... I'll fix this

...

line 57: is 1024 absolutely guaranteed to be enough? Otherwise you should use a loop with exponential reallocation. Line 97 too.

Ok... I actually has other even wother bug. I don't handle errors correctly. I should definitelly fix this. And it seems that I can add reallocation (I assumed that 0 returns on error and I have no way to distinguish between error and allocation. I assume I had written this code in too late hours and misread the docs)

...

lines 68, 75: std::copy?

I have limit by size not by iterator, so it is simpler.

...

posix/posix_backend.cpp:

install: The same comment W.R.T. thread-safety applies as for icu.

Same as above.

...

line 85: If this fails, don't you still need to call freelocale? I think this would be simplified if you created a simple RAII wrapper that holds a locale_t by value.

Ok... I see missed this. I prefer to have it shared as every facet has its own copy (2 characters * 4-5 facets) I'll fix this.

...

posix/posix_backend.hpp:

(To be continued)

In Christ, Steven Watanabe

Thank you very much... This is very-good code review, it is very helpful! Artyom

Mathias Gaunard

11:55 a.m.

New subject: [locale] review part 2.1: source

On 10/04/2011 11:12, Artyom wrote:

...

...
line 150: I think sresult.append(out_start, out_ptr); would be slightly clearer.

I should look on it carefully whether I can change this.

iconv API is very gentle in how it handles its pointers and states so I need to do everything carefully.

Also I assume that append(Begin,End) may be less efficient then append(pointer,size) as it may be naively implemented as

while(begin!=end) s += *begin++;

Why make such assumptions that could very well be unfounded when it only makes your code harder to understand?

...

...
iconv_between: Inheritance from iconv_from_utf is a little odd.

But it makes a code much simpler.

Inheritance should not be used for code sharing unless used with the CRTP idiom.

...

...
iconv_to_utf: The comments from iconv_from_utf apply. Also, the implementation is almost the same as iconv_from_utf. I think you could define a single function template to implement both.

Not exactly. I had found that it is much safer and readable to have two implementations then to create quite complex meta programming to make sure it works.

It is much easier to assume that at least one side is "char"

Code duplication is one of the problems with your library in my opinion. More on that in my upcoming review. So it is intentional that you duplicate code because you find writing it once generically makes it hard to validate?

Artyom

12:25 p.m.

New subject: [locale] review part 2.1: source

...

...
...
line 150: I think sresult.append(out_start, out_ptr); would be slightly clearer.

I should look on it carefully whether I can change this.

iconv API is very gentle in how it handles its pointers and states so I need to do everything carefully.

Also I assume that append(Begin,End) may be less efficient then append(pointer,size) as it may be naively implemented as

while(begin!=end) s += *begin++;

Why make such assumptions that could very well be unfounded when it only makes your code harder to understand?

I don't think so. I don't think that append(str,size) is less readable then append(str_begin,str_end) But this is the question of programming style.

...

...
...
iconv_between: Inheritance from iconv_from_utf is a little odd.

But it makes a code much simpler.

Inheritance should not be used for code sharing unless used with the CRTP idiom.

I'm sorry but I disagree. 1. Template metaprograming should be used in places where it is justified. 2. If you have a class figure I don't see why shouldn't it have shared code for circle and square. This makes sense for me. I understand that in this particular case the sharing looks odd, but in fact it makes things only simpler as conv_between is same as conv from char to to char because iconv supports it unlike Win32API or ICU API that requires using temporary UTF-16 storage.

...

...
...
iconv_to_utf: The comments from iconv_from_utf apply. Also, the implementation is almost the same as iconv_from_utf. I think you could define a single function template to implement both.

Not exactly. I had found that it is much safer and readable to have two implementations then to create quite complex meta programming to make sure it works.

It is much easier to assume that at least one side is "char"

Code duplication is one of the problems with your library in my opinion. More on that in my upcoming review.

Generally I try to avoid it, but if you see, you are welcome to point to such things. I probably missed some parts or duplicated accidentally some code during a year and the half of instant development and evolution of this library. In any case I would like to ask you to read carefully rationale behind this library in the tutorial. I know your Unicode library and I know that you have very different point of view on how things should be done. In any case I really looking forward for your review as the author of Boost.Unicode library.

...

So it is intentional that you duplicate code because you find writing it once generically makes it hard to validate?

You know... On the second glance it seems to be possible to merge two versions. In any case... It should be done very-very carefully to make sure nothing goes wrong. iconv API is tricky and differs by implementation so I would have to test it on several platforms afterward. I think I may fix this. Artyom

Mathias Gaunard

4:19 p.m.

New subject: [locale] review part 2.1: source

On 10/04/2011 14:25, Artyom wrote:

...

...
...
...
line 150: I think sresult.append(out_start, out_ptr); would be slightly clearer.

I should look on it carefully whether I can change this.

iconv API is very gentle in how it handles its pointers and states so I need to do everything carefully.

Also I assume that append(Begin,End) may be less efficient then append(pointer,size) as it may be naively implemented as

while(begin!=end) s += *begin++;

Why make such assumptions that could very well be unfounded when it only makes your code harder to understand?

I don't think so. I don't think that

append(str,size)

is less readable then

append(str_begin,str_end)

I didn't read the incriminated piece of code, I was just inferring from Steven's comment that it could be clearer. If it's just this indeed there is no difference in clarity.

Steven Watanabe

5:14 p.m.

New subject: [locale] review part 2.1: source

AMDG On 04/10/2011 02:12 AM, Artyom wrote:

...

...
encoding/iconv_codepage.hpp

line 168: What are the possible errno values besides EILSEQ, EINVAL, and E2BIG? If you get an error you don't expect, is silently stopping the correct behavior?

According to opengroup's specifications, Linux man pages and others these are the error states that can occur.

Maybe I should throw conversion_error if the method is stop().

That seems reasonable. This situation shouldn't ever happen, but it's probably a good idea to detect it.

...

...
iconv_to_utf: The comments from iconv_from_utf apply. Also, the implementation is almost the same as iconv_from_utf. I think you could define a single function template to implement both.

Not exactly. I had found that it is much safer and readable to have two implementations then to create quite complex meta programming to make sure it works.

That's sometimes true. I know around here we tend to err on the side of too much metaprogramming, but in this case I don't think you even need much.

...

It is much easier to assume that at least one side is "char"

Really? In this case the code looks almost identical.

...

...
lines 86,101: The STL is your friend: size_t remove_substitutions(std::vector<wchar_t> &v)

{ v.erase(std::remove(v.begin(), v.end(), wchar_t(0xFFFD)), v.end()); return v.size(); }

After reading it carefully I can see how it works but generally I see how it works but it isn't so readable.

In any case interesting and it may save some allocations.

Well, remove/erase is idiomatic, so it should be clearer to those familiar the STL.

...

...
icu/numeric.cpp:

In num_format: You don't handle bool?

ICU does not handle it... So I can't do something better for this - I pass it to std::locale::classic formatting.

Is there any way that you could use the message translation mechanism for "true" and "false" if std::boolalpha is set? Otherwise it should be formatted the same as an integer.

...

...
line 290: Alas, the standard doesn't guarantee that you can put_back an arbitrary number of characters. This isn't going to work if you're reading from a non-seekable stream and the line is sufficiently long.

The problem is that it is the best I can do from-inside std::locale's num_fmt facet.

It has many major limitations. For exaple if you would try to parse 12345YX as

std::cin>> number>> character

then character would be X.

This is specially critical when I parse spellout format to be able to putback correct part.

Maybe I should try to see what is the size of the buffer and/or document that the stream should be able to handle putback well.

It is actually fairly simple as streambuf has a buffer so it can actually receive putback characters without problem.

For a limited number of characters. If you're reading from a file or from a stringstream, you'll be okay because pbackfail can seek to an earlier position in the stream. If you're reading from a terminal, it should be okay because it's usually line buffered, and the lines aren't usually very long. I could see running into a problem if you're reading from a pipe, however.

...

Unfortunately it is the best solution I could reach so far with the limitation of std::num_get

Yeah. I can't think of anything better either. This is really more of a limitation of std::steambuf, though.

...

...
posix/converter.cpp:

I noticed that you convert UTF8 to wchar_t. For the posix backend, are there other multibyte encodings that would need this? Or would converting everything to UTF32 cause other problems?

- For most encodings (single byte) toupper/tolower does better job (faster simpler smaller)

<snip>

I can change it for MB encodings but not sure if it would have added value.

Ok. I basically know nothing about the domain. Feel free to take any comments like this with a grain of salt.

...

...
line 85: If this fails, don't you still need to call freelocale? I think this would be simplified if you created a simple RAII wrapper that holds a locale_t by value.

Ok... I see missed this.

I prefer to have it shared as every facet has its own copy (2 characters * 4-5 facets)

I'll fix this.

Oh, I wasn't suggesting that you go away from shared_ptr: struct posix_locale : boost::noncopyable { posix_locale(); ~posix_locale(); locale_t impl_; }; shared_ptr<posix_locale> p(new posix_locale()); It's harder to get this wrong. In Christ, Steven Watanabe

Artyom

6:21 p.m.

New subject: [locale] review part 2.1: source

...

...
...
iconv_to_utf: The comments from iconv_from_utf apply. Also, the implementation is almost the same as iconv_from_utf. I think you could define a single function template to implement both.

Not exactly. I had found that it is much safer and readable to have two implementations then to create quite complex meta programming to make sure it works.

That's sometimes true. I know around here we tend to err on the side of too much metaprogramming, but in this case I don't think you even need much.

After reviewing the code once again it seems that you are right. I'll merge it.

...

...
...
lines 86,101: The STL is your friend: size_t remove_substitutions(std::vector<wchar_t> &v)

{ v.erase(std::remove(v.begin(), v.end(), wchar_t(0xFFFD)), v.end()); return v.size(); }

After reading it carefully I can see how it works but generally I see how it works but it isn't so readable.

In any case interesting and it may save some allocations.

Well, remove/erase is idiomatic, so it should be clearer to those familiar the STL.

I'm familiar but still it is not trivial to read and understand complexity, even thou it so better solution that I'll likely going to use.

...

...
...
icu/numeric.cpp:

In num_format: You don't handle bool?

ICU does not handle it... So I can't do something better for this - I pass it to std::locale::classic formatting.

Is there any way that you could use the message translation mechanism for "true" and "false" if std::boolalpha is set? Otherwise it should be formatted the same as an integer.

In such case I'll have to grab translation dictionaries with the library and maintain them together with Boost. It is trivial on Linux like platforms where all dictionaries usually go to /usr/share/locale/$(LANG)/LC_MESSAGES/boost_locale.mo but it is less trivial on Windows. It would be better to give this to translator that handles all the messages in the software. So translation of true or false would not add much :-) In any case it would likely go to some sentence like "The result of test is true" or something like that so it would anyway be better to translate the string in full context as it may be "true" in Enlish but something context dependent on other languages.

...

...
It is actually fairly simple as streambuf has a buffer so it can actually receive putback characters without problem.

For a limited number of characters. If you're reading from a file or from a stringstream, you'll be okay because pbackfail can seek to an earlier position in the stream. If you're reading from a terminal, it should be okay because it's usually line buffered, and the lines aren't usually very long. I could see running into a problem if you're reading from a pipe, however.

I think I'll document this limitation.

...

...
Unfortunately it is the best solution I could reach so far with the limitation of std::num_get

Yeah. I can't think of anything better either. This is really more of a limitation of std::steambuf, though.

Yes, indeed.

...

...
...
line 85: If this fails, don't you still need to call freelocale? I think this would be simplified if you created a simple RAII wrapper that holds a locale_t by value.

Ok... I see missed this.

I prefer to have it shared as every facet has its own copy (2 characters * 4-5 facets)

I'll fix this.

Oh, I wasn't suggesting that you go away from shared_ptr:

struct posix_locale : boost::noncopyable { posix_locale(); ~posix_locale(); locale_t impl_; };

shared_ptr<posix_locale> p(new posix_locale());

It's harder to get this wrong.

Oh... year, you are right. Thanks, Artyom

Artyom

12 Apr 12 Apr

5:11 p.m.

New subject: [locale] review part 2.1: source

Hello Steven, What about the rest of the review? Artyom

Steven Watanabe

6:06 p.m.

New subject: [locale] review part 2.1: source

AMDG On 04/12/2011 10:11 AM, Artyom wrote:

...

What about the rest of the review?

I'll get there... I hope. I've already put about 15 hours into this, and I only have so much time. In Christ, Steven Watanabe

Artyom

7:58 p.m.

New subject: [locale] review part 2.1: source

...

AMDG

...

On 04/12/2011 10:11 AM, Artyom wrote:

...
What about the rest of the review?

I'll get there... I hope. I've already put about 15 hours into this, and I only have so much time.

In Christ, Steven Watanabe

Thank You Very Much, Artyom

Steven Watanabe

15 Apr 15 Apr

3:24 a.m.

New subject: [locale] review part 2.2: source

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 AMDG shared/date_time.cpp: line 79: If it's an invariant that tz_ is the same as the timezone of impl, then this is not exception safe if clone throws. You'll have to use copy and swap idiom. line 161: This is a bit odd, don't you think? Normally if assignment is strongly exception safe without doing any extra work, there's no reason to go out of your way to clear the old state on failure. line 350: should you check that v is in range? Regardless, the valid range should be documented. shared/format.cpp: line 53: it seems that this doesn't restore the float field, base field etc. shared/formatting.cpp: line 42: normally copy and swap is the preferred way to implement assignment. The main advantage is that it guarantees that if the operation fails, the state of the lhs is unchanged. shared/generator.hpp: line 41: From the way domains is used, would a std::set be more appropriate? shared/ids.cpp: missing copyright/license. (Issues like this will be caught by the inspect tool, so not a big deal now.) shared/ios_prop.hpp: line 30 and others,: casting to and from void* should use static_cast. line 56: can you use invalid instead of hard-coding this constant? line 96: I don't think this is guaranteed to be distinct from any valid pointer. Can you make invalid the address of a unique global object instead? shared/localization_backend.cpp: lines 200, 206: This is not guaranteed to be thread-safe in C++03. You must use call_once for thread-safe static inititalization. Note however that this is probably safe if you assume that no additional threads are started before main. shared/message.cpp: line 123: I take it that the format guarantees that hash_size_ > 2? Please validate this requirement when loading. line 132: Use null_pair? line 152: You require that off is in range, that off <= off + len, and that off + len is in range. Please check this. line 204: You really should check that you actually read 4 bytes. Otherwise if the file is too short, you get undefined behavior from reading uninitialized memory. line 311: How do you know that there's a null-terminator? line 326: suppose that form is 0 and the null-terminator is missing... line 407: This isn't guaranteed to work. The standard doesn't guarantee that the letters are contiguous. (Does anything use EBCDIC these days?) lines 440, 442: You're assuming that the string is null terminated. What if the file is incorrectly formatted? It may be possible to read past the end of the allocated block. I haven't actually worked through the exact code path to trigger this condition, but it seems suspicious. line 475: You're assuming that the separator actually appears. line 500: I think this can cause alignment problems. Does the file format make any alignment guarantees? Even if it does, you'll need to handle invalid files gracefully. Of course, it'll only be inefficient on platforms that allow unaligned loads, but still, the standard doesn't guarantee that it'll work. shared/mo_hash.hpp: shared/mo_lambda.cpp: line 239: There's no 'c' in tokenizer. Everything except compile can be in the unnamed namespace. shared/mo_lambda.hpp: std/all_generater.hpp: std/codecvt.cpp: std/collate.cpp: std/converter.cpp: std/numeric.cpp: std/std_backend.cpp: std/std_backend.hpp: util/codecvt_converter.cpp: util/default_locale.cpp: util/iconv.hpp: Is this header intended to be used in multiple translation units? If so you probably should make the functions inline and not use an unnamed namespace. util/info.hpp: util/locale_data.hpp: util/locale_data.cpp: util/numeric.cpp: util/timezone.hpp: win32/all_generator.hpp: win32/api.hpp: line 49: Don't most Windows functions use DWORD, rather than int? win32/collate.hpp: line 48: I think I just saw code looking just like this in another file. win32/converter.cpp: win32/lcid.cpp: line 51: Check whether this succeeds and uses the entire string? line 89: I thought double checked locking was considered unsafe? Anyway you should use call_once for initialization like this. win32/lcid.hpp: win32/numeric.cpp: line 32: I think I saw another copy of these functions. Get rid of the duplication? win32/win_backend.cpp: win32/win_backend.hpp: build/Jamfile.v2: line 30: I don't think you need a separate object file. It isn't used anywhere else. line 126: A simpler way to reference another library is /boost//thread line 211: I would tend to use icu-sources = [ glob icu/*.cpp ] ; result += <source>$(icu-sources) I see you're using something like this later on. line 278: Boost.Thread ought to set this flag automatically. I'm assuming from the fact that you have it, that it doesn't. Looks like a bug in the thread Jamfile. (Similarly, you should add <define>BOOST_LOCALE_NO_LIB in the usage requirements of boost_locale) test/Jamfile.v2: You don't actually need test-suite. test-suite is exactly the same as alias. In Christ, Steven Watanabe -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJNp7p5AAoJEDTBQuhymLHSiv4H/0HCgXhXiOAsMmMFY6BYn9W7 aSKX3n9qGCDTG7WI+tHoZwz46e6BHrf7PlfcA+6Z1v4wA1GBonxxXrmTN8AhNIIm ozn7Qlw7xH13rO7r3BfPPmib0sSOWCTEUcIfCI2ZW4huKuGMOnmI+tLaHr6qwVP3 +xAiWjAzNkShzn8Qy2rWYo9X+mYaXBrBjViejlyn7M0ArW2Rgcsu8FPWh8Kkhzqk 6TcltpHmUI4dQ9uAkJhP9zcjqyKIUA9vTApcKCDJMqGnAwSDoXA+ZWZGuoquhsia 4BAkL32+E9EnpEf982R+3cQ7NqYQy452rB9TC02qsFrYbrPv6FKDmQMb0uWvBGM= =bzjH -----END PGP SIGNATURE-----

Artyom

2:04 p.m.

New subject: [locale] review part 2.2: source

...

shared/date_time.cpp:

line 79: If it's an invariant that tz_ is the same as the timezone of impl, then this is not exception safe if clone throws. You'll have to use copy and swap idiom.

Noted

...

line 161: This is a bit odd, don't you think? Normally if assignment is strongly exception safe without doing any extra work, there's no reason to go out of your way to clear the old state on failure.

Yes, you are right.

...

line 350: should you check that v is in range? Regardless, the valid range should be documented.

The valid range is actually depends on specific calendar and its implementation. Sometimes it depends on actual calendar or for example in Hebrew calendar it can't be year less then ~5800 now as it is when the world was born (according to the calendar) So if the range is not valid the underlying implementation should throw.

...

shared/format.cpp:

line 53: it seems that this doesn't restore the float field, base field etc.

Ok... I'll check this.

...

shared/formatting.cpp:

line 42: normally copy and swap is the preferred way to implement assignment. The main advantage is that it guarantees that if the operation fails, the state of the lhs is unchanged.

Noted... This is something across all the code.

...

shared/generator.hpp:

line 41: From the way domains is used, would a std::set be more appropriate?

There should be at lease one that is default domain. It is the first one so in any case I would need to mark something as default.

...

shared/ids.cpp:

missing copyright/license. (Issues like this will be caught by the inspect tool, so not a big deal now.)

Yes you are right.

...

shared/ios_prop.hpp:

line 30 and others,: casting to and from void* should use static_cast.

Noted.

...

line 56: can you use invalid instead of hard-coding this constant?

Opps, you are right.

...

line 96: I don't think this is guaranteed to be distinct from any valid pointer. Can you make invalid the address of a unique global object instead?

This method is used quite widely, for example mmap returns (void *)(-1) in case of failure, and it can't be anything invalid. On the other hand there is enough problems with unique global objects there.

...

shared/localization_backend.cpp:

lines 200, 206: This is not guaranteed to be thread-safe in C++03. You must use call_once for thread-safe static inititalization. Note however that this is probably safe if you assume that no additional threads are started before main.

This mutexes are explicitly globally initialized later. So all such things should be solved on library start. So I think that is would be good enough and call_once would just inject additional dependency on Boost.Thread (currently Boost.Thread used in headers only variant when ICU is not used)

...

shared/message.cpp:

line 123: I take it that the format guarantees that hash_size_ > 2? Please validate this requirement when loading.

Good point

...

line 132: Use null_pair?

Noted

...

line 152: You require that off is in range, that off <= off + len, and that off + len is in range. Please check this.

Good point.

...

line 204: You really should check that you actually read 4 bytes. Otherwise if the file is too short, you get undefined behavior from reading uninitialized memory.

Actually variable magic is initialized (0) and if it was not read correctly its value wouldn't be valid. That is why I don't check the returned value.

...

line 311: How do you know that there's a null-terminator?

Promised by the file format.

...

line 326: suppose that form is 0 and the null-terminator is missing...

The different plural forms are separated by 0.

...

line 407: This isn't guaranteed to work. The standard doesn't guarantee that the letters are contiguous. (Does anything use EBCDIC these days?)

I'm not sure that something better and simple enough can be done. I assume we are ASCII at least.

...

lines 440, 442: You're assuming that the string is null terminated. What if the file is incorrectly formatted? It may be possible to read past the end of the allocated block. I haven't actually worked through the exact code path to trigger this condition, but it seems suspicious.

It is the requirement of the format, it is designed to be loaded as is to memory and be useful. So I don't check every possible string as it allows to load files much faster. In any case when I load file I put 1 extra 0 at the end so at some point it will be terminated

...

line 475: You're assuming that the separator actually appears.

std::string::npos is maximal value of size_t so substr just get bigger sub string. May be not 100% clear but this code does what it should.

...

line 500: I think this can cause alignment problems. Does the file format make any alignment guarantees? Even if it does, you'll need to handle invalid files gracefully. Of course, it'll only be inefficient on platforms that allow unaligned loads, but still, the standard doesn't guarantee that it'll work.

Actually there two types of catalogs: 1. Directly loaded - when the locale's 8 bit charset and the catalog's one is the same like UTF-8. 2. In memory when the base character is not char (wchar_t or charXX_t) or the catalog had to be converted to other character set. So yes, these pointers are going to be aligned.

...

shared/mo_hash.hpp:

shared/mo_lambda.cpp:

line 239: There's no 'c' in tokenizer.

Ooops. I'll fix this.

...

Everything except compile can be in the unnamed namespace.

...

shared/mo_lambda.hpp:

std/all_generater.hpp:

std/codecvt.cpp:

std/collate.cpp:

std/converter.cpp:

std/numeric.cpp:

std/std_backend.cpp:

std/std_backend.hpp:

util/codecvt_converter.cpp:

util/default_locale.cpp:

util/iconv.hpp:

Is this header intended to be used in multiple translation units? If so you probably should make the functions inline and not use an unnamed namespace. Yes you are right. I don't know why had I done it

Yes you are right. this way.

...

util/info.hpp:

util/locale_data.hpp:

util/locale_data.cpp:

util/numeric.cpp:

util/timezone.hpp:

win32/all_generator.hpp:

win32/api.hpp:

line 49: Don't most Windows functions use DWORD, rather than int?

Yes. You are right.

...

win32/collate.hpp:

line 48: I think I just saw code looking just like this in another file.

There is no win32/collate.hpp?

...

win32/converter.cpp:

win32/lcid.cpp:

line 51: Check whether this succeeds and uses the entire string?

...

line 89: I thought double checked locking was considered unsafe?

Not really I had a discussion with a professor an expert in this area, no problems when mutexes are full memory barriers and they are.

...

Anyway you should use call_once for initialization like this.

win32/lcid.hpp:

win32/numeric.cpp:

line 32: I think I saw another copy of these functions. Get rid of the duplication?

AFAIR, they had different semantics.

...

win32/win_backend.cpp:

win32/win_backend.hpp:

build/Jamfile.v2:

line 30: I don't think you need a separate object file. It isn't used anywhere else.

It does because it is used in different targets AFAIR is solved this problem.

...

line 126: A simpler way to reference another library is /boost//thread

...

line 211: I would tend to use icu-sources = [ glob icu/*.cpp ] ; result += <source>$(icu-sources) I see you're using something like this later on.

The problem comes when you want to put something temporary there out of project. Never liked glob

...

line 278: Boost.Thread ought to set this flag automatically. I'm assuming from the fact that you have it, that it doesn't. Looks like a bug in the thread Jamfile. (Similarly, you should add <define>BOOST_LOCALE_NO_LIB in the usage requirements of boost_locale)

I explain, I use boost.thread's mutexes but I don't want to link with this library. So in many cases I use header-only boost.thread so I prefer specify explicitly that I don't need Boost.Thread (Boost.Thread is linked only for ICU backend where TLS is needed)

...

test/Jamfile.v2: You don't actually need test-suite. test-suite is exactly the same as alias.

What do you mean?

...

In Christ, Steven Watanabe

Thank you very much, it was really very good code review. Artyom

Gevorg Voskanyan

2:46 p.m.

New subject: [locale] review part 2.2: source

Artyom wrote:

...

...
line 211: I would tend to use icu-sources = [ glob icu/*.cpp ] ; result += <source>$(icu-sources) I see you're using something like this later on.

The problem comes when you want to put something temporary there out of project.

Never liked glob

Liking or not glob may be a matter of taste, but here Steven was also suggesting to drop the 'for' loop and take advantage of variable expansion semantics instead. The code can be simplified to: local s = boundary codecvt ... ; result += <source>$(s).cpp ;

...

...
line 278: Boost.Thread ought to set this flag automatically. I'm assuming from the fact that you have it, that it doesn't. Looks like a bug in the thread Jamfile.

I'd say rather a bug in boost Jamroot, should have <define>BOOST_ALL_NO_LIB=1 in project's usage requirements.

...

...
(Similarly, you should add <define>BOOST_LOCALE_NO_LIB in the usage requirements of boost_locale)

I explain, I use boost.thread's mutexes but I don't want to link with this library. So in many cases I use header-only boost.thread so I prefer specify explicitly that I don't need Boost.Thread

(Boost.Thread is linked only for ICU backend where TLS is needed)

Provided Boost.Locale lives in the boost tree, there is no need for <define>BOOST_THREAD_NO_LIB=1 as boost's Jamroot has <define>BOOST_ALL_NO_LIB=1 in project requirements.

...

...
test/Jamfile.v2: You don't actually need test-suite. test-suite is exactly the same as alias.

What do you mean?

alias boost_locale_test : ... ; Gevorg

Gevorg Voskanyan

3:04 p.m.

New subject: [locale] review part 2.2: source

Gevorg Voskanyan wrote:

...

Liking or not glob may be a matter of taste, but here Steven was also suggesting

to drop the 'for' loop and take advantage of variable expansion semantics instead. The code can be simplified to: local s = boundary codecvt ... ; result += <source>$(s).cpp ;

Sorry I meant to say <source>icu/$(s).cpp Gevorg

Artyom

6:54 p.m.

New subject: [locale] review part 2.2: source

...

I'd say rather a bug in boost Jamroot, should have <define>BOOST_ALL_NO_LIB=1 in

project's usage requirements.

Is it? If so I'll remove it. Because this flag was added when I used CMake and then it moved to bjam. If so I can remove it. Artyom

Henrik Sundberg

7:10 p.m.

New subject: [locale] review part 2.2: source

On Fri, Apr 15, 2011 at 8:54 PM, Artyom <artyomtnk@yahoo.com> wrote:

...

...
I'd say rather a bug in boost Jamroot, should have <define>BOOST_ALL_NO_LIB=1 in project's usage requirements.

Is it? If so I'll remove it. Because this flag was added when I used CMake and then it moved to bjam.

If so I can remove it.

Off topic: Artyom, can you please keep the names of those you answer in these threads? /$

Gevorg Voskanyan

8:58 p.m.

New subject: [locale] review part 2.2: source

Artyom wrote:

...

Gevorg Voskanyan wrote:

...
I'd say rather a bug in boost Jamroot, should have <define>BOOST_ALL_NO_LIB=1

...
in

project's usage requirements.

Is it? If so I'll remove it. Because this flag was added when I used CMake and then it moved to bjam.

If so I can remove it.

Artyom

I'd say keep them for now, they don't hurt. Could be removed later when the Jamroot gets the needed changes. Gevorg

Steven Watanabe

4:58 p.m.

New subject: [locale] review part 2.2: source

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 AMDG On 04/15/2011 07:04 AM, Artyom wrote:

...

...
line 311: How do you know that there's a null-terminator?

Promised by the file format.

Well, all input to a program needs to be validated. It looks like you've already handled most of these issues, (at least to the point where the message system won't read outside the memory it owns)

...

...
lines 440, 442: You're assuming that the string is null terminated. What if the file is incorrectly formatted? It may be possible to read past the end of the allocated block. I haven't actually worked through the exact code path to trigger this condition, but it seems suspicious.

It is the requirement of the format, it is designed to be loaded as is to memory and be useful.

So I don't check every possible string as it allows to load files much faster.

In any case when I load file I put 1 extra 0 at the end so at some point it will be terminated

Okay. I see that now.

...

...
line 475: You're assuming that the separator actually appears.

std::string::npos is maximal value of size_t so substr just get bigger sub string. May be not 100% clear but this code does what it should.

Okay. I jut checked the standard, and it's okay. In Christ, Steven Watanabe -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQEcBAEBAgAGBQJNqHk9AAoJEDTBQuhymLHSnYUH/jClxMci4tAOO40pGVbxRmBn FO9eokCHOxTk1D95+sCLEYVaihN3QV80o4Ry1dOB5EXevil8w3RYoE0Lnp9xZ+Za MshJisavzcgwy0OZ/LnRKcbVVfIrXrYKavXqsKM2dkCylHQCNp0vH/2uV+TZDtlJ IHeUHsra/iQQnwJn3+Wd/9SkjOoh2y7oXj0nIApm2/Ov/mIUzEeM7x7eQpjzmLW3 GiinENUv4UQSa5QSm/rXp0t7iaJ/T3fIvHCgJkqjmhKNg0WG9zqd03VK7m7FuiAP 7ztjSJW+h7zYivglUxhuQVWZiC30cAxB4R7KIyL6COICBg++aRVqHGs9H4oBS8I= =6Ap/ -----END PGP SIGNATURE-----

Ilya Sokolov

18 Apr 18 Apr

7:26 a.m.

New subject: [locale] review part 2.2: source

On Fri, 15 Apr 2011 20:04:31 +0600, Artyom <artyomtnk@yahoo.com> wrote:

...

Steven Watanabe wrote:

...
win32/lcid.cpp: line 89: I thought double checked locking was considered unsafe?

Not really

I had a discussion with a professor an expert in this area, no problems when mutexes are full memory barriers and they are.

I'm not an expert, but there is no ReadRead/acquire barrier in case if table_is_ready is true. IIUC, it would be a problem if you are targeting IA-64 using non-msvc or msvc before 8.0 (2005). Could you detect and #error on such arch/compiler combinations?

...

...
shared/localization_backend.cpp:

lines 200, 206: This is not guaranteed to be thread-safe in C++03. You must use call_once for thread-safe static inititalization. Note however that this is probably safe if you assume that no additional threads are started before main.

This mutexes are explicitly globally initialized later.

So all such things should be solved on library start.

Some compilers might not execute init::init() ctor as the do_init variable is not used. Also, could you comment your assumptions in the code?

...

So I think that is would be good enough and call_once would just inject additional dependency on Boost.Thread (currently Boost.Thread used in headers only variant when ICU is not used)

Artyom

12:25 p.m.

New subject: [locale] review part 2.2: source

...

...
Not really

I had a discussion with a professor an expert in this area, no problems when mutexes are full memory barriers and they are.

I'm not an expert, but there is no ReadRead/acquire barrier in case if table_is_ready is true. IIUC, it would be a problem if you are targeting IA-64 using non-msvc or msvc before 8.0 (2005). Could you detect and #error on such arch/compiler combinations?

Mutex acquisition and release is full memory barrier.

...

...
This mutexes are explicitly globally initialized later.

So all such things should be solved on library start.

Some compilers might not execute init::init() ctor as the do_init variable is not used. Also, could you comment your assumptions in the code?

I'm not aware of such compilers and if they exist they are probably buggy. Artyom

Peter Dimov

12:42 p.m.

New subject: [locale] review part 2.2: source

Artyom wrote:

...

Mutex acquisition and release is full memory barrier.

No, they are only guaranteed to have acquire and release semantics, respectively. Double-checked locking is broken even if they're full barriers though. That said, MSVC from some version onwards, I forget which, guarantees that volatile reads have acquire semantics and that volatile writes have release semantics, and under this assumption, the code is correct.

Ilya Sokolov

19 Apr 19 Apr

4:07 a.m.

New subject: [locale] review part 2.2: source

On Mon, 18 Apr 2011 18:42:22 +0600, Peter Dimov <pdimov@pdimov.com> wrote:

...

Artyom wrote:

...
Mutex acquisition and release is full memory barrier.

No, they are only guaranteed to have acquire and release semantics, respectively.

IIRC, acquire and full due to existence of try_lock.

...

Double-checked locking is broken even if they're full barriers though. That said, MSVC from some version onwards, I forget which,

From 8.0 (2005).

...

guarantees that volatile reads have acquire semantics and that volatile writes have release semantics, and under this assumption, the code is correct.

Ilya Sokolov

4:01 a.m.

New subject: [locale] review part 2.2: source

On Mon, 18 Apr 2011 18:25:57 +0600, Artyom <artyomtnk@yahoo.com> wrote:

...

...
...
Not really

I had a discussion with a professor an expert in this area, no problems when mutexes are full memory barriers and they are.

I'm not an expert, but there is no ReadRead/acquire barrier in case if table_is_ready is true. IIUC, it would be a problem if you are targeting IA-64 using non-msvc or msvc before 8.0 (2005). Could you detect and #error on such arch/compiler combinations?

Mutex acquisition and release is full memory barrier.

There is no mutex acquisition/release on the code path if table_is_ready is true.

...

...
...
This mutexes are explicitly globally initialized later.

So all such things should be solved on library start.

Some compilers might not execute init::init() ctor as the do_init variable is not used. Also, could you comment your assumptions in the code?

I'm not aware of such compilers and if they exist they are probably buggy.

I was referring to ISO/IEC-14882:2003, 3.6.2/3: "It is implementation-defined whether or not the dynamic initialization (8.5, 9.4, 12.1, 12.6.1) of an object of namespace scope is done before the first statement of main. If the initialization is deferred to some point in time after the first statement of main, it shall occur before the first use of any function or object defined in the same translation unit as the object to be initialized. 31) ..." Just checked it and found that footnote 31 prohibits such behavior: "31) An object defined in namespace scope having initialization with side-effects must be initialized even if it is not used (3.7.1)."

Steven Watanabe

15 Apr 15 Apr

3:26 a.m.

New subject: [locale] One more comment patch

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 AMDG See attached. In Christ, Steven Watanabe -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJNp7rZAAoJEDTBQuhymLHS3LYH+wYZWDPzFltmFRlExoNdVXK0 h4OsfND3JALf4TCLXl2Axft2njPAgaJahTQc2oSMbz283s+Fu21968fHuzSfKS4p qDXzemlcNDjqwGU6g4x2wqswcJcHeJQN7a/xk5Z+LBgnIW06NlDTfaxEhTHWDtjJ k8nxFkKJ4C5Qe5zm0RIwSVH7XkuJxU2MyQTICWONIrogkmzYjm0puFF64ilZMAiE kZYKYUYpb+ZI/CrNgoM67x/JIlrJqBq+C4+58Z1VHCowdJqfqfarP0SE1fle0zmr hH3w2ONPzfZda90a3uOhIhTYsqCrda9yzztcn7rUJWrGLFy89LcXhTu1+4bx3qI= =tnNZ -----END PGP SIGNATURE-----

Artyom

2:21 p.m.

New subject: [locale] One more comment patch

Thanks I'll apply it. ----- Original Message ----

...

From: Steven Watanabe <watanabesj@gmail.com> To: boost@lists.boost.org Sent: Fri, April 15, 2011 6:26:18 AM Subject: Re: [boost] [locale] One more comment patch

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

AMDG

See attached.

In Christ, Steven Watanabe

Steven Watanabe

3:36 a.m.

New subject: [locale] Formal review of Boost.Locale library starts tomorrow

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 AMDG On 04/06/2011 02:29 PM, Chad Nelson wrote:

...

Please explicitly state in your review whether the library should be accepted.

*Yes, the Locale library should be accepted into Boost.*

...

- What is your evaluation of the design? - What is your evaluation of the implementation?

The design appears well thought out and the implementation is solid.

...

- What is your evaluation of the documentation?

The tutorial was sufficient to understand what the library could do. The reference could use more attention to details like preconditions, postconditions, exception guarantees, limitations, and so on.

...

- What is your evaluation of the potential usefulness of the library?

Who needs anything other than en-US anyway?

...

- Did you try to use the library? With what compiler? Did you have any problems?

No. I didn't try it.

...

- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?

I spent about 20 hours reading the entire code base carefully.

...

- Are you knowledgeable about the problem domain?

No. In Christ, Steven Watanabe -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJNp709AAoJEDTBQuhymLHSEYEH/231DU5CFAVZ2XJ4GnY2JZXG DfL7Xmv6SeOOEKNoJcsh3fu/uDAiqdH5tWw9aJPNiBBPZveUYbHq4UZ5/hCCZtMM STUZnW+LPPsqNiWchRy3PTiAFFFJMqZPjMOsal76fQMqUY6Pm4OGbKbgVSrwiKoG 1R5UeVFXJYo5fO5x5LOkfxFQdI8j3l/OVesd84sR0F6Pm4YgeH1LZ3xLgVl2nO2k JnSeXKe4xAnebdSDzA7oCPpa2CDqJnKBphGz+H1cWRbIbUSOzDwmXZ21Ec973Aq7 dlr8s4CP9hcYOIAzzDWDd8RXKxBg1X5sDDxYQe4UMKZdEkiv//tyStHlG+QugGY= =PEw0 -----END PGP SIGNATURE-----

Artyom

2:25 p.m.

...

...
Please explicitly state in your review whether the library should be

...

...
accepted.

*Yes, the Locale library should be accepted into Boost.*

Thanks, I hope users would enjoy it.

...

...
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?

I spent about 20 hours reading the entire code base carefully.

I want to thank you very much. It was great code review. I've learned a lot from it.

...

In Christ, Steven Watanabe

Artyom

Fabio Fracassi

5:56 p.m.

On 6/4/2011 23:29, Chad Nelson wrote:

...

The formal review of the Boost.Locale library by Artyom Beilis starts tomorrow, April 7th, and is scheduled to last through April 16th.

...

Please explicitly state in your review whether the library should be accepted.

Yes. (But I'm not quite sure if the format issue below might warrant a conditional Yes)

...

- What is your evaluation of the design?

I generally like it. I particularly like that it integrates nicely with the std facilities (esp. the use of std::string), and it integrates nicely with de-facto standard tools for localization and translation. (gettext) As far as I recall from earlier discussions it would be possible to add a backend to be able to use qt's translation files, too. This would be great, because it would then be possible to use the same toolchain (String extraction and Translation tools) for Application backend code when one uses qt for the Application GUI. (Even if not the QT Linguist tool is highly sophisticated, so being able to use it would be a great boon) I wouldn't require such a backend for inclusion, but if/when such a backend gets written I would like it to be included. Some criticisms: * it would be nice if boost::locale::format would also understand normal boost::format messages, or a compatible wrapper like boost::locale::compat::format that did the right thing. I think this is important! It would be very unfortunate to have two slightly incompatible facilities. * it would be nice if boost::locale::date_time and boost::date_time interacted nicely, or ideally could be merged * No boundary analysis in non ICU backends. it would be nice if at least simple cases (only single codepoint chars) would work correctly.

...

- What is your evaluation of the implementation?

I have only had a brief look, it looks clean enough.

...

- What is your evaluation of the documentation?

First of all I like the layout, it looks nice fresh and modern, and individual functions are easy to spot, something which I often miss in normal boost docu. But as others have noticed it is a bit hard to navigate, especially in the tutorial section. (This might be a problem in doxygen) One thing I miss is a more throughout documentation of the toolchain needed to extract strings, translate them, ... At least an example for each major platform with all the needed programms. Bonus Points for build system (bjam and CMake) integration. Some things are not quite clear * is there a native program to extract translation strings for each major platform? (Win-Mac-Linux)? * how do locale::format and locale::translate interact? will a string using format get translated? can I use all of formats niceness in translate? * some examples on what exactly the limitations of non ICU backends are. Something along the lines of if you do this on string x the result is y instead of z as it would be with ICU. As always when commenting on documentation I would like more How-to and walk-through examples, especially in the reference section. (somewhat like the qt docu which has a nice minimal code sniplett for each function, which shows how the function is used to achive something)

...

- What is your evaluation of the potential usefulness of the library?

Very useful. I consider a library that treats the localization as a prime candidate for boost and ultimately for standardization. I consider easy and central localization handling to be important because mistakes and oversights in this area lead to subtle and annoying user-visible bugs, which are nearly impossible to test automatically. It is also a topic which is almost never a central concern of any application (especially during initial development), and still needs to be dealt with in almost all code layers, so a simple standardized API which hides the almost insurmountable subtleties and complexities from the programmer is IMO essential.

...

- Did you try to use the library? With what compiler? Did you have any problems?

Not yet.

...

- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?

Throughout reading of the documentation, quick look at the code. If I have a bit of time later I will do a test run, and amend that info.

...

- Are you knowledgeable about the problem domain?

Moderately so. I used Qt's localization an internationalization services for a moderately sized application and a companion framework, and am currently designing a application framework where I do not want to use Qt, at least not for the lower code layers and inter module communication, and need something suitable. Regards, and keep up the good work Fabio

Artyom

7:33 p.m.

...

Fabio Wrote:

...

...
Please explicitly state in your review whether the library should be accepted.

Yes. (But I'm not quite sure if the format issue below might warrant a conditional Yes)

Thanks I'm glad.

...

...
- What is your evaluation of the design?

I generally like it.

I particularly like that it integrates nicely with the std facilities (esp. the use of std::string), and it integrates nicely with de-facto standard tools for localization and translation. (gettext)

As far as I recall from earlier discussions it would be possible to add a backend to be able to use qt's translation files, too. This would be great, because

...

would then be possible to use the same toolchain (String extraction and Translation tools) for Application backend code when one uses qt for the Application GUI. (Even if not the QT Linguist tool is highly sophisticated, so being able to use it would be a great boon)

There is no technical limitations to add Qt message catalogs (as they very similar to gettext in their concepts), however in order to do this I need a documentation of the file format. (That it what I told last time we discussed it) It should not be too hard but requires non-negligible amount of work.

...

I wouldn't require such a backend for inclusion, but if/when such a backend gets written I would like it to be included.

Some criticisms:

* it would be nice if boost::locale::format would also understand normal boost::format messages, or a compatible wrapper like boost::locale::compat::format that did the right thing. I think this is important! It would be very unfortunate to have two slightly incompatible facilities.

The major problem is that boost::format has too limited functionality for localization. So I can duplicate its functionality but should I? It is better to make users to understand the difference between two. Another point is that boost::format throws on almost any error while boost::locale::format ignores error (as you don't want a translator to crash the program)

...

* it would be nice if boost::locale::date_time and boost::date_time interacted nicely, or ideally could be merged

The problem is that they too different. - boost::locale::date_time is: - locale dependent - provides non-Gregorian calendars - represents the time as single time point - boost::date_time:: date and ptime has several problems. See: http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_plain_numb... So they can't be really merged.

...

* No boundary analysis in non ICU backends. it would be nice if at least simple cases (only single codepoint chars) would work correctly.

Boundary analysis (or text segmentation) is an algorithm defined by the Unicode standard. Its implementation (even simple) requires Unicode character database and many other things. So I don't really want to include broken or partial algorithms especially when they quite complex and standard API does not provide them. I hope once Boost.Unicode will be there it would be simpler.

...

[snip]

One thing I miss is a more throughout documentation of the toolchain needed to extract strings, translate them, ... At least an example for each major platform with all the needed programms. Bonus Points for build system (bjam and CMake) integration.

Good point. I'll add this to documentation.

...

Some things are not quite clear

* is there a native program to extract translation strings for each major platform? (Win-Mac-Linux)?

See: http://cppcms.sourceforge.net/boost_locale/html/messages_formatting.html#msg... GNU Gettext tools are cross platform and available on all platforms. Basically you will always need: - GNU Gettext tools like xgettext msgfmt msgmerge - Translation program like Lokalize or poedit

...

* how do locale::format and locale::translate interact? will a string using format get translated? can I use all of formats niceness in translate?

See: http://cppcms.sourceforge.net/boost_locale/html/localized_text_formatting.ht... It works this way: out << format(translate("Something {1}")) % param; - translate creates message object that holds "Something {1}" - format created and receives this message object - a "param" binded to the format and it keeps a reference on it - format is written to stream upon this it: converts "Something {1}" to localized message like "Some-Thing-in-my-language {1}" format parses the localized message and renders the binded param. Note translator can change string so if the original is "Today {1,date=short}" translator can change it to "Today {1,date=full}" or "Today {1,ftime='%A'} {1,ftime='%d/%m'}" according to his needs. I'll add documentation of this.

...

* some examples on what exactly the limitations of non ICU backends are. Something along the lines of if you do this on string x the result is y instead of z as it would be with ICU.

Good point. I'll try to make to be more specific.

...

As always when commenting on documentation I would like more How-to and walk-through examples, especially in the reference section. (somewhat like the qt docu which has a nice minimal code sniplett for each function, which shows how the function is used to achive something)

Noted.

...

Regards, and keep up the good work

Thank you for the review, Artyom

Jeffrey Lee Hellrung, Jr.

7:41 p.m.

On Fri, Apr 15, 2011 at 12:33 PM, Artyom <artyomtnk@yahoo.com> wrote: [...]

...

Another point is that boost::format throws on almost any error while boost::locale::format ignores error (as you don't want a translator to crash the program)

Hmmm...is there at least some convenient mechanism for the user to detect errors encountered by boost::locale::format after boost::locale::format has returned? Could the error behavior be specified in the call to boost::locale::format? - Jeff

Artyom

16 Apr 16 Apr

8:07 a.m.

...

...
Another point is that boost::format throws on almost any error while boost::locale::format ignores error (as you don't want a translator to crash the program)

Hmmm...is there at least some convenient mechanism for the user to detect errors encountered by boost::locale::format after boost::locale::format has returned?

Could the error behavior be specified in the call to boost::locale::format?

I explain, the error of for example missing parameter is almost never a error, consider following: format(translate("Passed {1} day", "Passed {1} days",n)) % n; Now in Hebrew it would have 3 plural forms: single, dual and plural עבר יום {1}‏ /passed {1} day עברו יומיים /passed two-days when "two-days" is a single world - no parameter used עברו {1} ימים / passed {1} days So missing parameter is ok in localization context, while boost::format throws unless you do something specific. Artyom

Jeffrey Lee Hellrung, Jr.

3:47 p.m.

On Sat, Apr 16, 2011 at 1:07 AM, Artyom <artyomtnk@yahoo.com> wrote:

...

...
...
Another point is that boost::format throws on almost any error while boost::locale::format ignores error (as you don't want a translator to crash the program)

Hmmm...is there at least some convenient mechanism for the user to detect errors encountered by boost::locale::format after boost::locale::format has returned?

Could the error behavior be specified in the call to boost::locale::format?

I explain, the error of for example missing parameter is almost never a error, consider following:

format(translate("Passed {1} day", "Passed {1} days",n)) % n;

Now in Hebrew it would have 3 plural forms: single, dual and plural

עבר יום {1}‏ /passed {1} day עברו יומיים /passed two-days when "two-days" is a single world - no parameter used עברו {1} ימים / passed {1} days

So missing parameter is ok in localization context, while boost::format throws unless you do something specific.

So that's the specific case of a missing parameter error, but is this the only kind of error one can make with boost::locale::format? If not, it's not really a convincing argument that all users all the time would like boost::locale::format to ignore all errors. And you do qualify that this particular error is "almost never a error", leading me to believe that sometimes it really is an error you care about. - Jeff

Vicente BOTET

4:33 p.m.

New subject: [locale] Formal review of Boost.Locale library starts tomorrow

...

Message du 16/04/11 17:48 De : "Jeffrey Lee Hellrung, Jr." A : boost@lists.boost.org Copie à : Objet : Re: [boost] [locale] Formal review of Boost.Locale library starts tomorrow

On Sat, Apr 16, 2011 at 1:07 AM, Artyom wrote:

...
...
...
Another point is that boost::format throws on almost any error while boost::locale::format ignores error (as you don't want a translator to crash the program) Hmmm...is there at least some convenient mechanism for the user to detect errors encountered by boost::locale::format after boost::locale::format has returned? Could the error behavior be specified in the call to boost::locale::format?

I explain, the error of for example missing parameter is almost never a error, consider following:

format(translate("Passed {1} day", "Passed {1} days",n)) % n;

Now in Hebrew it would have 3 plural forms: single, dual and plural

עבר יום {1}‏ /passed {1} day עברו יומיים /passed two-days when "two-days" is a single world - no parameter used עברו {1} ימים / passed {1} days

So missing parameter is ok in localization context, while boost::format throws unless you do something specific.

So that's the specific case of a missing parameter error, but is this the only kind of error one can make with boost::locale::format? If not, it's not really a convincing argument that all users all the time would like boost::locale::format to ignore all errors. And you do qualify that this particular error is "almost never a error", leading me to believe that sometimes it really is an error you care about.

+1 Vicente

Artyom

8:53 p.m.

...

...
...
I explain, the error of for example missing parameter is almost never a error, consider following:

format(translate("Passed {1} day", "Passed {1} days",n)) % n;

Now in Hebrew it would have 3 plural forms: single, dual and plural

עבר יום {1}‏ /passed {1} day עברו יומיים /passed two-days when "two-days" is a single world - no parameter used עברו {1} ימים / passed {1} days

So missing parameter is ok in localization context, while boost::format throws unless you do something specific.

So that's the specific case of a missing parameter error, but is this the only kind of error one can make with boost::locale::format? If not, it's not really a convincing argument that all users all the time would like boost::locale::format to ignore all errors. And you do qualify that this particular error is "almost never a error", leading me to believe that sometimes it really is an error you care about.

+1

Vicente

See: http://article.gmane.org/gmane.comp.lib.boost.devel/218075 Artyom

Artyom

7:53 p.m.

...

From: "Jeffrey Lee Hellrung, Jr." <jeffrey.hellrung@gmail.com>

...
...
Could the error behavior be specified in the call to boost::locale::format?

I explain, the error of for example missing parameter is almost never a error, consider following:

format(translate("Passed {1} day", "Passed {1} days",n)) % n;

Now in Hebrew it would have 3 plural forms: single, dual and plural

עבר יום {1}‏ /passed {1} day עברו יומיים /passed two-days when "two-days" is a single world - no parameter used עברו {1} ימים / passed {1} days

So missing parameter is ok in localization context, while boost::format throws unless you do something specific.

So that's the specific case of a missing parameter error, but is this the only kind of error one can make with boost::locale::format? If not, it's not really a convincing argument that all users all the time would like boost::locale::format to ignore all errors. And you do qualify that this particular error is "almost never a error", leading me to believe that sometimes it really is an error you care about.

I'll explain, Boost.Locale format may throw if for example io-operations fail or streaming object to file causes an exception or bad_alloc is thrown. However it never throws in case of syntax error, missing parameter or extra parameter. This point was also told before and I'll document it more explicitly. I explain the rationale: Usually boost.format is going to be used as: format(translate("something")) % a % b % c Now, how translation process goes. Translator takes a dictionary, defines the locale and loads it. Translator is not a programmer, but he looks on output. So I can expect that it would accidentally screw the format by writing: "{1,numb}" instead of {1,num} or {1,number} If boost.format throws it may accidentally stop the program and both translator and programmer would give very hard times to understand what is going. So by no means the formatting string should cause an exception, it may cause invalid output that the translator would see and fix but not abort the entire program. Best, Artyom

Jeffrey Lee Hellrung, Jr.

17 Apr 17 Apr

1:52 a.m.

On Sat, Apr 16, 2011 at 12:53 PM, Artyom <artyomtnk@yahoo.com> wrote:

...

...
From: "Jeffrey Lee Hellrung, Jr." <jeffrey.hellrung@gmail.com>

...
...
Could the error behavior be specified in the call to boost::locale::format?

I explain, the error of for example missing parameter is almost never a error, consider following:

format(translate("Passed {1} day", "Passed {1} days",n)) % n;

Now in Hebrew it would have 3 plural forms: single, dual and plural

עבר יום {1}‏ /passed {1} day עברו יומיים /passed two-days when "two-days" is a single world - no parameter used עברו {1} ימים / passed {1} days

So missing parameter is ok in localization context, while boost::format throws unless you do something specific.

So that's the specific case of a missing parameter error, but is this the only kind of error one can make with boost::locale::format? If not, it's not really a convincing argument that all users all the time would like boost::locale::format to ignore all errors. And you do qualify that this particular error is "almost never a error", leading me to believe that sometimes it really is an error you care about.

I'll explain, Boost.Locale format may throw if for example io-operations fail or streaming object to file causes an exception or bad_alloc is thrown.

However it never throws in case of syntax error, missing parameter or extra parameter.

This point was also told before and I'll document it more explicitly.

I explain the rationale:

Usually boost.format is going to be used as:

format(translate("something")) % a % b % c

Now, how translation process goes. Translator takes a dictionary, defines the locale and loads it.

Translator is not a programmer, but he looks on output.

So I can expect that it would accidentally screw the format by writing:

"{1,numb}" instead of {1,num} or {1,number}

If boost.format throws it may accidentally stop the program and both translator and programmer would give very hard times to understand what is going.

Okay, I can buy that in some circumstances, you want syntax errors to be silently ignored. This appears to be mostly because the allowable syntax can vary with the runtime locale; is that correct? (I've only skimmed the first few pages of the documentation so I should probably go through it more thoroughly. I was completely unfamiliar with localization until this review came up.) So by no means the formatting string should cause

...

an exception, it may cause invalid output that the translator would see and fix but not abort the entire program.

Based on your knowledge and experience in this domain, I take it at your word that swallowing syntax errors is the typical use case. But I'm skeptical that this use case is the only relevant one for all users all the time. You haven't given any argument against simply opening up the interface to allow the user to specify the error behavior, with the default being to swallow the error. To me, this wouldn't change the typical use case, and, on the implementation side, I wouldn't expect it to be technically challenging. I do appreciate you taking the time to bear with me and to explain to me in detail your rationale. - Jeff

Artyom

9:07 a.m.

...

...
"{1,numb}" instead of {1,num} or {1,number}

If boost.format throws it may accidentally stop the program and both translator and programmer would give very hard times to understand what is going.

Okay, I can buy that in some circumstances, you want syntax errors to be silently ignored. This appears to be mostly because the allowable syntax can vary with the runtime locale; is that correct? (I've only skimmed the first few pages of the documentation so I should probably go through it more thoroughly. I was completely unfamiliar with localization until this review came up.)

Not, it is mostly because we don't want that translator would screw the program accidentally.

...

So by no means the formatting string should cause

...
an exception, it may cause invalid output that the translator would see and fix but not abort the entire program.

Based on your knowledge and experience in this domain, I take it at your word that swallowing syntax errors is the typical use case. But I'm skeptical that this use case is the only relevant one for all users all the time. You haven't given any argument against simply opening up the interface to allow the user to specify the error behavior, with the default being to swallow the error. To me, this wouldn't change the typical use case, and, on the implementation side, I wouldn't expect it to be technically challenging.

I'll think about it, may be it is good idea to add throw on syntax error behavior even thou it is really not desired.

...

I do appreciate you taking the time to bear with me and to explain to me in detail your rationale.

- Jeff

Best, Artyom

Peter Dimov

1:31 p.m.

New subject: [locale] Formal review of Boost.Locale library startstomorrow

Jeff wrote:

...

...
Okay, I can buy that in some circumstances, you want syntax errors to be silently ignored.

No. You are looking at the problem from a programmer perspective, but when these errors are introduced, there is no programmer present. As a matter of fact, the programmers may already have been fired. The person who is translating the text is introducing the errors, and he is the one who needs to be able to see them and fix them. The best way to do that is for the translate/format functions to return a string from which one can deduce that a syntax error has occurred, so that the translator (or the user) can see the error appearing instead of the properly translated text - without this interfering with the user's ability to continue using the software as would happen if an exception aborted the entire, potentially time consuming, operation. This does not "silently ignore" the errors, it reports them in an appropriate way, to the appropriate person. Now, you could make the case that you, as a programmer, should still have the error reported to you by the library so that you can log it (if the library doesn't have built-in logging support), which of course would also allow you to promptly proceed to ignore everything that's been said and make the error handlers throw exceptions. :-)

Artyom

2:05 p.m.

New subject: [locale] Formal review of Boost.Locale library startstomorrow

----- Original Message ----

...

From: Peter Dimov <pdimov@pdimov.com> To: boost@lists.boost.org Sent: Sun, April 17, 2011 4:31:16 PM Subject: Re: [boost] [locale] Formal review of Boost.Locale library startstomorrow

Jeff wrote:

...
...
Okay, I can buy that in some circumstances, you want syntax errors to be silently ignored.

No. You are looking at the problem from a programmer perspective, but when these errors are introduced, there is no programmer present. [snip] This does not "silently ignore" the errors, it reports them in an appropriate way, to the appropriate person.

Now, you could make the case that you, as a programmer, should still have the error reported to you by the library so that you can log it (if the library doesn't have built-in logging support), which of course would also allow you to promptly proceed to ignore everything that's been said and make the error handlers throw exceptions. :-)

Actually it may be it is better to make boost::locale::format to add something to formatted message like cout << format(translate("Hello World {1}")) % param; The is translate to "bonjour le monde {1" (missed "}" Would print: "[syntax error] bonjour le monde {1" Instead of "bonjour le monde" However I'm not sure if it is desired default behavior. But interesting point to think about. Artyom

...

Peter Dimov

2:32 p.m.

New subject: [locale] Formal review of Boost.Locale librarystartstomorrow

Artyom wrote:

...

Actually it may be it is better to make boost::locale::format to add something to formatted message like

cout << format(translate("Hello World {1}")) % param;

The is translate to "bonjour le monde {1" (missed "}"

Would print:

"[syntax error] bonjour le monde {1"

This is what I do, yes: bonjour le monde [error: parse error at '{1'] or, for "bonjour le monde {2}": bonjour le monde [error: argument '2' missing]

Fabio Fracassi

2:47 p.m.

New subject: [locale] Formal review of Boost.Locale librarystartstomorrow

On 17/4/2011 16:32, Peter Dimov wrote:

...

Artyom wrote:

...
Actually it may be it is better to make boost::locale::format to add something to formatted message like

cout << format(translate("Hello World {1}")) % param;

The is translate to "bonjour le monde {1" (missed "}"

Would print:

"[syntax error] bonjour le monde {1"

This is what I do, yes:

bonjour le monde [error: parse error at '{1']

or, for "bonjour le monde {2}":

bonjour le monde [error: argument '2' missing]

I'd consider this quite a sensible option, because it is easy to spot for translators and testers, and if one slips through it doesn't inconvenience the user too much and he can easily send a conclusive bug report. regards Fabio

Jeffrey Lee Hellrung, Jr.

3:40 p.m.

New subject: [locale] Formal review of Boost.Locale librarystartstomorrow

On Sun, Apr 17, 2011 at 7:32 AM, Peter Dimov <pdimov@pdimov.com> wrote:

...

Artyom wrote:

...
Actually it may be it is better to make boost::locale::format to add something to formatted message like

cout << format(translate("Hello World {1}")) % param;

The is translate to "bonjour le monde {1" (missed "}"

Would print:

"[syntax error] bonjour le monde {1"

This is what I do, yes:

bonjour le monde [error: parse error at '{1']

or, for "bonjour le monde {2}":

bonjour le monde [error: argument '2' missing]

I would think making the error more obvious to the translator would be better. - Jeff

Jeffrey Lee Hellrung, Jr.

3:50 p.m.

New subject: [locale] Formal review of Boost.Locale library startstomorrow

On Sun, Apr 17, 2011 at 6:31 AM, Peter Dimov <pdimov@pdimov.com> wrote:

...

Jeff wrote:

...
Okay, I can buy that in some circumstances, you want syntax errors to be

...
silently ignored.

No. You are looking at the problem from a programmer perspective,

Only natural, right? ;)

...

but when these errors are introduced, there is no programmer present. As a matter of fact, the programmers may already have been fired. The person who is translating the text is introducing the errors, and he is the one who needs to be able to see them and fix them. The best way to do that is for the translate/format functions to return a string from which one can deduce that a syntax error has occurred, so that the translator (or the user) can see the error appearing instead of the properly translated text - without this interfering with the user's ability to continue using the software as would happen if an exception aborted the entire, potentially time consuming, operation. This does not "silently ignore" the errors, it reports them in an appropriate way, to the appropriate person.

Artyom seemed to imply that the result string would give little or no indication of an error, although it was never really specified what "silently ignore" meant beyond not throwing. It would certainly make the translator's job easier if they had some obvious indication there was a translation error. Now, you could make the case that you, as a programmer, should still have

...

the error reported to you by the library so that you can log it (if the library doesn't have built-in logging support), which of course would also allow you to promptly proceed to ignore everything that's been said and make the error handlers throw exceptions. :-)

I'm having trouble parsing this (which error handlers? those in boost::locale::format?), but it sounds like you accept that there may be scenarios under which a syntax error should throw an exception or perform some other task, e.g., log the error. - Jeff

Peter Dimov

5:17 p.m.

New subject: [locale] Formal review of Boost.Locale librarystartstomorrow

Jeffrey Lee Hellrung, Jr. wrote:

...

I'm having trouble parsing this (which error handlers? those in boost::locale::format?), but it sounds like you accept that there may be scenarios under which a syntax error should throw an exception...

No, I don't. As a practical matter, such scenarios occur approximately never. This can be used as an interview question if the candidate lists localization experience - it's a good trap. Consider for example a dialog box with 12 localizable texts. It makes no sense to abort its creation if the formatting or localization of the second text fails, and neither does it make sense to leave the rest of the texts blank or untranslated/unformatted. In practice, if these functions throw, you'll have to wrap each and every call with a try block, which is an obvious indication that these functions shouldn't throw.

Jeffrey Lee Hellrung, Jr.

9:50 p.m.

New subject: [locale] Formal review of Boost.Locale librarystartstomorrow

On Sun, Apr 17, 2011 at 10:17 AM, Peter Dimov <pdimov@pdimov.com> wrote:

...

Jeffrey Lee Hellrung, Jr. wrote:

I'm having trouble parsing this (which error handlers? those in

...
boost::locale::format?), but it sounds like you accept that there may be scenarios under which a syntax error should throw an exception...

No, I don't. As a practical matter, such scenarios occur approximately never. This can be used as an interview question if the candidate lists localization experience - it's a good trap.

Consider for example a dialog box with 12 localizable texts. It makes no sense to abort its creation if the formatting or localization of the second text fails, and neither does it make sense to leave the rest of the texts blank or untranslated/unformatted. In practice, if these functions throw, you'll have to wrap each and every call with a try block, which is an obvious indication that these functions shouldn't throw.

Okay; note that I'm not necessarily advocating that throwing on errors should be the default behavior. I'd believe that there are infinitely many examples where you *wouldn't* want to throw, and that's fine, but somewhat beside the point. I'll accept those with more experience on their word that throwing behavior is "approximately never" desirable. What about the part of what I said that you stripped off ("...or perform some other task")? And which error handlers were we talking about here? - Jeff

Peter Dimov

10:08 p.m.

New subject: [locale] Formal review of Boost.Localelibrarystartstomorrow

Jeffrey Lee Hellrung, Jr. wrote:

...

What about the part of what I said that you stripped off ("...or perform some other task")?

I can think of a legitimate scenario in which you'd want such errors to be logged - automated (non-interactive) testing. Other tasks don't spring to mind. I don't consider such support essential though, or absolutely required for such a library.

...

And which error handlers were we talking about here?

The hypothetical error handlers that the library would invoke were it extended to do so.

Fabio Fracassi

16 Apr 16 Apr

9:32 a.m.

On 15/4/2011 21:33, Artyom wrote:

...

...
Fabio Wrote:

...

There is no technical limitations to add Qt message catalogs (as they very similar to gettext in their concepts), however in order to do this I need a documentation of the file format.

(That it what I told last time we discussed it)

It should not be too hard but requires non-negligible amount of work.

Yes I understand that, thats why:

...

...
I wouldn't require such a backend for inclusion, but if/when such a backend gets written I would like it to be included.

So I do not expect you to write such a thing, but if someone did I'd expect you to integrate it. By the way, can you "guesstimate" how much work it would take?

...

...
* it would be nice if boost::locale::format would also understand normal boost::format messages, or a compatible wrapper like boost::locale::compat::format that did the right thing. I think this is important! It would be very unfortunate to have two slightly incompatible facilities.

The major problem is that boost::format has too limited functionality for localization. So I can duplicate its functionality but should I?

It is better to make users to understand the difference between two.

Another point is that boost::format throws on almost any error while boost::locale::format ignores error (as you don't want a translator to crash the program)

Maybe. From what I understand from the doku boost::format is a subset of locale::format, so I'd expect something like this to work: locale::format("Hello %1%!") % name Your docs do not say what happens when I do this: locale::format("Today is {1}.") % date(d) I'd expect it to work and give the date in some sensible default formatting. I'd hate if we had to do something like this: boost::format("Sensor Value (at %1%): %2$.2%") % (locale::format("{1,date_time}") % date) % value

...

...
* it would be nice if boost::locale::date_time and boost::date_time interacted nicely, or ideally could be merged

The problem is that they too different.

- boost::locale::date_time is: - locale dependent - provides non-Gregorian calendars - represents the time as single time point

- boost::date_time:: date and ptime has several problems.

See: http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_plain_numb...

So they can't be really merged.

I understand that, thats why I said interact nicely OR merge. Minimally I'd expect boost::date_time dt = ...; locale::format("Today is {1,date}.") % dt to work.

...

...
* No boundary analysis in non ICU backends. it would be nice if at least simple cases (only single codepoint chars) would work correctly.

Boundary analysis (or text segmentation) is an algorithm defined by the Unicode standard. Its implementation (even simple) requires Unicode character database and many other things. So I don't really want to include broken or partial algorithms especially when they quite complex and standard API does not provide them.

I hope once Boost.Unicode will be there it would be simpler.

I understand that. And I am also not particularly fussed about this point, but it would be nice if e.g. the word iterator worked correctly, if the input contains only single byte characters. Regards Fabio

Artyom

1:29 p.m.

...

...
It should not be too hard but requires non-negligible amount of work.

Yes I understand that, thats why:

...
...
I wouldn't require such a backend for inclusion, but if/when such a backend gets written I would like it to be included.

So I do not expect you to write such a thing, but if someone did I'd expect you to integrate it. By the way, can you "guesstimate" how much work it would take?

I have no idea, for example to handle XLIFF format it is required to have... an XML parser so I don't know what would it take.

...

Maybe. From what I understand from the doku boost::format is a subset of locale::format, so I'd expect something like this to work: locale::format("Hello %1%!") % name

Your docs do not say what happens when I do this: locale::format("Today is {1}.") % date(d) I'd expect it to work and give the date in some sensible default formatting.

I'd hate if we had to do something like this:

boost::format("Sensor Value (at %1%): %2$.2%") % (locale::format("{1,date_time}") % date) % value

You can just do boost::locale::format("Sensor Value (at {1,dt}): {2,p=2}") The point is that date_time is actually a number of seconds since Jan 1, 1970 GMT and that is why it is number. Its representation is actually a matter of formatting. Why because if user is happy with time_t he should be able to use it. Finally boost::locale::format does not care about what he writes to the stream.

...

...
...
* it would be nice if boost::locale::date_time and boost::date_time interacted nicely, or ideally could be merged

[snip]

So they can't be really merged.

I understand that, thats why I said interact nicely OR merge.

Minimally I'd expect

boost::date_time dt = ...; locale::format("Today is {1,date}.") % dt to work.

There are several problems: 1. boost::date_time has its own formatter facets and they can't be overridden 2. The use of locale facets in Boost.DateTime is quite incorrect. It is not that simple. I think it can be worked out together with cooperation of Boost.DateTime author/maintainer.

...

Regards

Fabio

Thanks, Artyom

Fabio Fracassi

17 Apr 17 Apr

3 p.m.

On 16/4/2011 15:29, Artyom wrote:

...

...
So I do not expect you to write such a thing, but if someone did I'd expect you to integrate it. By the way, can you "guesstimate" how much work it would take?

I have no idea, for example to handle XLIFF format it is required to have... an XML parser so I don't know what would it take.

I understand that it is hard, so let me ask another question, if the gettext loading support was missing, how much time would someone need to implement it. Ballpark figure is enough (a day, a week, a month?).

...

...
Your docs do not say what happens when I do this: locale::format("Today is {1}.") % date(d) I'd expect it to work and give the date in some sensible default formatting.

Can you answer the question above?

...

...
I'd hate if we had to do something like this:

boost::format("Sensor Value (at %1%): %2$.2%") % (locale::format("{1,date_time}") % date) % value

You can just do boost::locale::format("Sensor Value (at {1,dt}): {2,p=2}")

Thats great, so it should be easy to integrate locale::format and boost::format, shouldn't it?

...

...
Minimally I'd expect

boost::date_time dt = ...; locale::format("Today is {1,date}.") % dt to work.

There are several problems:

1. boost::date_time has its own formatter facets and they can't be overridden 2. The use of locale facets in Boost.DateTime is quite incorrect.

Ok I'd settle for this: boost::date_time dt = ...; boost::locale::date_time ldt = dt; locale::format("Today is {1,date}.") % ldt or locale::format("Today is {1,date}.") % locale::date_time(dt)

...

It is not that simple.

I think it can be worked out together with cooperation of Boost.DateTime author/maintainer.

Yes, that was kind of the Idea. Regards Fabio

Artyom

3:39 p.m.

----- Original Message ----

...

From: Fabio Fracassi <f.fracassi@gmx.net> To: boost@lists.boost.org Sent: Sun, April 17, 2011 6:00:17 PM Subject: Re: [boost] [locale] Formal review of Boost.Locale library starts tomorrow

On 16/4/2011 15:29, Artyom wrote:

...
...
So I do not expect you to write such a thing, but if someone did I'd expect you to integrate it. By the way, can you "guesstimate" how much work it would take?

I have no idea, for example to handle XLIFF format it is required to have... an XML parser so I don't know what would it take.

I understand that it is hard, so let me ask another question, if the gettext loading support was missing, how much time would someone need to implement it. Ballpark figure is enough (a day, a week, a month?).

It took me quite a long time due to fact that I needed to implement C equation parsing (took a few days) and to learn this format. At the beginning I was using some MIT code but then I had rewritten everything from the scratch. So it took maybe about a week or so.

...

...
...
Your docs do not say what happens when I do this: locale::format("Today is {1}.") % date(d) I'd expect it to work and give the date in some sensible default formatting.

Can you answer the question above?

Actually, one of the conclusions from the review that it is better for boost::locale::Date_time to have default output as date time formatting not as a number so boost::locale::date_time dt = ..; boost::locale::format("Today is {1}") % dt would display both date and time. And if dt is boost::date_time::date object it would display date as it was written to stream. If you want to display only date or time you would have to put: boost::locale::format("Today is {1,date}") % dt As boost::locale does not have separate date and time class but date_time - which combines both concepts of calendar and POSIX time in single object. Also I strongly encourage users to put as much information to format as possible where std::cout << format(translate("Today is {1,date}")) % dt; Over std::cout << translate("Today is:") << as::date << dt; As translator can provide : "Today {1,ftime='%Y-%M-%d'}" or something like or even more complex according to his needs like "Today {1,date} ({1,locale=he_IL.UTF-8@calendar=hebrew,date})" And display "Today Jan 1, 1970, (5 in Nissan 5770)" So basically such information should be in formatting string.

...

...
...
I'd hate if we had to do something like this:

boost::format("Sensor Value (at %1%): %2$.2%") % (locale::format("{1,date_time}") % date) % value

You can just do boost::locale::format("Sensor Value (at {1,dt}): {2,p=2}")

Thats great, so it should be easy to integrate locale::format and boost::format, shouldn't it?

Integrate yes, but use same format no as it requires reimplementing boost::format functionality from the scratch as the approaches are too different.

...

...
...
Minimally I'd expect

boost::date_time dt = ...; locale::format("Today is {1,date}.") % dt to work.

There are several problems:

1. boost::date_time has its own formatter facets and they can't be overridden 2. The use of locale facets in Boost.DateTime is quite incorrect.

Ok I'd settle for this:

boost::date_time dt = ...; boost::locale::date_time ldt = dt; locale::format("Today is {1,date}.") % ldt

or locale::format("Today is {1,date}.") % locale::date_time(dt)

There is no boost::date_time there is only boost::date_time::date (without time at all) and boost::date_time::ptime which has unclear semantics whether it is UTC or local time (which is quite wrong) as ptime can be created from locale_time clock and universal_time clock (How POSIX time can be local time? It is same as PI=6.28 but this is different story...) So you need to convert it explicitly and the reasonable way is up to you.

...

...
It is not that simple.

I think it can be worked out together with cooperation of Boost.DateTime author/maintainer.

Yes, that was kind of the Idea.

Of course it can be resolved in future, but with current implementation of boost::Date_time it is not that simple.

...

Regards

Fabio

Artyom

...

5215

Age (days ago)

5228

Last active (days ago)

List overview

Download

82 comments

17 participants

participants (17)

Artyom
Chad Nelson
Dave Abrahams
Fabio Fracassi
Gevorg Voskanyan
Henrik Sundberg
Ilya Sokolov
Jeffrey Lee Hellrung, Jr.
John Bytheway
Marshall Clow
Mathias Gaunard
Michael Caisse
Olaf van der Spek
Peter Dimov
Sebastian Redl
Steven Watanabe
Vicente BOTET