[boost] [locale] review

16 Apr 2011

      Dear All,

Here is a brief review of the proposed Boost.Locale library.  I know 
little about the internationalisation issues that this library 
addresses, so I will focus just on the few areas where I do have some experience.

The documentation is generally of good quality but could be expanded in 
places as noted below.  I appreciate the introduction of std::locale at 
the beginning; having avoided std::locale for precisely the "common 
critical problems" mentioned, this introduction was useful.  As others 
have said, a similar introduction to icu, perhaps even deep links to 
appropriate parts of its documentation, and perhaps also to Unicode 
docs would be useful.  I found that it was not sufficiently easy to 
find which header each function is in, e.g. on the 
charset_handling.html page the name of the header is not mentioned; if 
I click on e.g. from_utf I get to a reference page that also doesn't 
name the header.

Use of strings rather than symbols: there are a few places where the 
library uses strings such as "UTF-8" or "de_DE.UTF8" rather than 
symbols like utf_8 or de_DE_UTF8.  This use of strings can make things 
run-time errors that could otherwise be compile-time errors.  
Mis-typing e.g. "UTF_8" or "de-DE" is too easy.  Perhasp in some cases 
the domain of the parameter is not known at compile time, but even in 
those cases, it should be possible to provide symbols for the most 
common choices.

Collation: I would be interested to know how fast these comparisons are 
compared to regular std::string::operator<.  If I correctly understand 
what transform does (more docs needed), it would be interesting to then 
compare the performance of std::map<...,comparator> with a std::map<> 
storing transformed strings.  Or similarly, to compare std::sort using 
the comparator with std::sort of a pair<transformed string, original string>.

Ideally these collation functions would take character ranges of 
arbitrary types, however I guess that the icu API takes char* making 
this unreasonable.  Apparently there is no option for a NULL-terminated 
char*, unlike the case conversion functions, right?

Case conversions: I don't know what "title case" is or what "fold case" 
means; the docs should define this.  At least linking to the 
definitions of the Unicode normalisation forms would also be useful; 
preferably, some explanation of how they should be used would be included.

I can think of two recent case-conversion problems that I have had, 
both related to geographic names:

1. I had a list of North American names, in "UPPER CASE".  I decided to 
convert them to "Initial Capitals" for better appearance, but later 
realised that this was broken for accented names in Quebec.  I could 
imagine some sort of "case conversion is lossy" property of a locale 
that would detect this; does anything like that exist?  (I guess not.)

2. In a similar case I found that " 's " was being wrongly capitalised, 
e.g. "St. John'S Street".  (This was with PostgreSQL's case conversion 
functions.)  Again it would be useful to know whether (a) does the 
library try to handle this sort of thing correctly, and (b) can it tell 
me whether the operation that I want to do is lossy or not?  For 
example, say I have a set of place names that are a mixture of en_US 
and en_CA and fr_CA but I don't know which; can I perform a conversion 
that works because those locales all have the same rules for e.g. ' and 
-?  (I guess not.)

Character set conversion: I think it's great that Boost might finally 
get character set conversion - it's overdue!  However, I'm not 
enthusiastic about this:

   std::string utf8_string = to_utf<char>(latin1_string,"Latin1");

1. It should be possible for the source character set to be a 
compile-time constant symbol, and a template parameter e.g. 
conv<latin1,utf8>(str).  As it is, if I want to convert just a few 
bytes, it will surely spend more time decoding the charset name string 
than actually converting.

2. As above, ideally it should take a range of arbitrary type and 
output to a back_insert_iterator; presumably this is an icu limitation.

3. The error handling seems to be basic - it's "all or nothing".  
Considering from_utf, since this is the direction where there are more 
things that can be unrepresentable, I might want to:
- Skip the character;
- Replace with '?';
- Replace with an unaccented version, or e.g. curly-quotes to plain-quotes;
- Throw an exception that aborts the whole operation;
- Do the conversion up to the point of difficulty and then indicate the problem.

4. There's also the case where the input is chunked, and the 
chunk-breaks might fall within multi-byte characters; so individual 
chunks are potentially malformed while the overall stream is OK.  I 
think the best way to handle this is with a stateful functor to do the 
comparison; I wrote one using iconv, like this:

Iconver<latin1,utf8> latin1_to_utf8;

while (!end_of_input()) {
   output.append( latin1_to_utf8(input.get_chunk()) );
}
assert(latin1_to_utf8.empty());  // detect unconverted partial chars at 
end of input

Also: how is the performance of this conversion?  If it is being done 
by icu, we should have a hook (i.e. specialisation, hence my suggestion 
of template parameters for the character sets) so that we can provide 
optimised versions of important conversions.

Here are my answers to the "normal questions":

    - What is your evaluation of the design?

Generally it seems like a reasonable design for an icu wrapper.  If it 
were not an icu wrapper, no doubt some things could be done better.

     - What is your evaluation of the implementation?

I've looked at very little of the source.  I decided to look at 
iconv_codepage.hpp since I have some experience with iconv.

I note that a union is being used in what I believe is an undefined way:

         union {
             char first;
             uint16_t u16;
             uint32_t u32;
         } v;

             v.u16 = 1;
             if(v.first == 1) {
                 return "UTF-16LE";
             }
             else {
                 return "UTF-16BE";
             }

In my own iconv wrapper code I need a const_cast because the POSIX spec 
misses a const on the second argument to iconv(); some platforms 
correct this, but not all:

#if defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__) 
|| defined(__sun__)
// Previously __APPLE__ was included in this list; presumably they have
// changed their headers.  If you have an older system you may need to put
// it back.
#define ICONV_ARG2_IS_CONST
#endif

This doesn't seem to be done here.

There seems to be some use of reinterpret_cast which I think should be 
a static_cast via void*:

     virtual std::string convert(char_type const *ubegin,char_type const *uend)
     {
         ....
         char const *begin = reinterpret_cast<char const *>(ubegin);

"int err" should probably be "errno_t err".

There seems to be a "magic constant" of 10 to allow for the difference 
in size between the input and the output.

Some comments would help.  For example, to describe the meaning of the 
return values and error codes from iconv().

Based on that limited look, my view is that the code probably works, 
but it could be improved in many ways.

     - What is your evaluation of the documentation?

Good enough.

     - What is your evaluation of the potential usefulness of the
       library?

Useful and overdue.

     - Did you try to use the library? With what compiler? Did you
       have any problems?

No, I've not compiled anything.

     - How much effort did you put into your evaluation? A glance? A
       quick reading? In-depth study?

3 hours reading the docs and looking at a tiny bit of the implementation.

     - Are you knowledgeable about the problem domain?

Not beyond the areas that I have discussed above.
...
Please explicitly state in your review whether the library should be
accepted.
It's borderline.  I was going to say "yes, but please consider updating 
at least the charset conversion functions in the way that I proposed".  
But then I looked at the source and noted what I think are well-known 
issues like abuse of union and reinterpret_cast, which have lowered my 
opinion a bit.  And then I've seen some other reviews that have looked 
at the areas I've not considered (i.e. message translation) and they 
have been quite critical.  Based on all of that, my view has shifted to 
"no, but this is worthwhile functionality, so please come back when 
you've resolved some of these concerns".

Regards,  Phil.

[boost] [locale] review

Phil Endecott