Re: [boost] [locale] review

17 Apr 2011

...
From: Phil Endecott <spam_from_boost_dev@chezphil.org>

...
As  others have said, a similar introduction to icu,
perhaps even deep links to  appropriate parts of its 
documentation, and perhaps also to Unicode docs would  be useful.
The point is that Boost.Locale uses ICU as one of its back-ends
(even thou it is primary and the bast one) so I even prefer
to decouple Boost.Locale from ICU at some level so users
would see that Boost.Locale is not just convince wrapper of
ICU but more then that.

Also Unicode is very-very large topic and basically the
search in google for ICU and Unicode would give you the
best links...
...
I found that it was not sufficiently easy to find
which header  each function is in, e.g. on the charset_handling.html
page the name of the  header is not mentioned; if I click on e.g. from_utf
I get to a reference page  that also doesn't name the header.
Good point. I'll add it.

I'm little bit surprised about missing header name int the reference
docs because it what Doxygen usually does.
...
Use of strings rather than  symbols:
there are a few places where the library 
uses strings such as "UTF-8"  or "de_DE.UTF8" 
rather than symbols like utf_8 or de_DE_UTF8.
This use of  strings can make things run-time errors 
that could otherwise be compile-time  errors.
Mis-typing e.g. "UTF_8" or "de-DE" is too easy.
Perhasp in  some cases the domain of the parameter
is not known at compile time, but even in  those cases,
it should be possible to provide symbols for the most
common  choices.
What most common choices? There are few dozen of different
character encodings, there are even more locales, and what
considered common?

Also not all encodings are supported by all backends. For example
iconv would not handle some windows codepages and MultiByteTo..
would not handle some other encodings.

Locales, Is de_DE.UTF-8 common? Is he_IL.UTF-8 common?
Is zh_CN.UTF-8 common?

Also the level of support by different backends may depend
on actually OS configuration - if some locale is not configured
on Windows or Linux that non-ICU backends would fallback to 
the C/POSIX locale.

So should there be hard coded constants for locales and encodings?

I doubt.

However, I think there may be useful conversion like:

  std::string utf8_str=utf_to_utf<char>(L"Hello"); // UTF-16/UTF-32 to UTF-8

And it was rised by other reviewers and I actually consider adding something
like that to the code

Also small note, about character encoding mistype.

The encodings compared as characters and numbers only, the rest of the
symbols are ignored, so from Boost.Locale and ICU point of view:

UTF-8, UTF8, Utf_8 and UTF#$%^#$8 are the same.
...
Collation: I would be interested to know how fast these
comparisons are compared to regular std::string::operator<.
If I  correctly understand what transform does (more docs needed),
it would be  interesting to then compare the performance of 
std::map<...,comparator>  with a std::map<> storing transformed strings.
Or similarly, to  compare std::sort using the comparator with std::sort 
of a pair<transformed  string, original string>.
Transform is something very common for collation purposes. Every 
collation API provides it (ICU, POSIX, C and Win32 API).

I hadn't do any specific benchmarks but it is much faster as
collation normally involves Unicode normalization and then
search over character database for specific locale.

I doubt it std::sort of transofrmed messages would be much faster
as transform costs as well.

But if you search frequently for some string in database
then it as significant performance impacts. As you transform
it only once and then you just run strcmp which is much faster
...
Ideally these collation functions would take  character ranges
of arbitrary types, however I guess that the icu API takes
char* making this unreasonable.  Apparently there is no option
for a  NULL-terminated char*, unlike the case conversion functions,  right?
Good point. I'll take a look on this.
...
Case conversions: I don't know what
"title case" is or what  "fold case" means;
the docs should define this. 
At least linking to the  definitions of the Unicode
normalisation forms would also be useful; 
preferably,  some explanation of how they
should be used would be included.
Noted. Good point.
...
I can  think of two recent case-conversion problems that I have had, both 
related to  geographic names:
1. I had a list of North American names, in "UPPER  CASE".
   I decided to convert them to "Initial Capitals" for better  appearance,
   but later realised that this was broken for accented names in  Quebec.
   I could imagine some sort of "case conversion is lossy" property
   of a locale that would detect this; does anything like that exist?  (I  
guess not.)
Actually any case conversion is "lossy" because it does not know the original 
case,
however the good news is that most of code-points in Unicode database does
not have case at all.

Roughly Only European languages actually have case, the Semic, East-Asian and 
many
others don't have case at all.

And BTW "Inital Capitals" is actually Title Case.
...
2. In a similar case I found that " 's " was being
   wrongly  capitalised, e.g. "St. John'S Street".
  (This was with PostgreSQL's case  conversion functions.)  
  Again it would be useful to know whether
  (a) does  the library try to handle this sort of thing correctly, and 
  (b) can it tell me  whether the operation that I want to do is lossy or not?
  For example, say  I have a set of place names that are a mixture of en_US and 
en_CA
  and fr_CA but  I don't know which; can I perform a conversion that
  works because those locales  all have the same rules for e.g. ' and -?  (I 
guess  not.)
Case handling is actually Unicode algorithms, and it behaves same for almost
all locales but few: Azeri and Turkish where "i" converted to "I with dot"
and "I converted to i without dot" and AFAIR there is one or two
other coner cases locale dependent.

In any case case handling is always lossy but depends on how you define it.

To answer the (a) yes it handles it right using ICU backend.
If you ask if it does everything right then not. Because
different languages may have different capitalization rules
but as far as I see Unicode algorithm and current ICU library
ignores them.

But it is still the best we get.
...
Character set conversion: 
I think it's great that Boost might  finally get character set
conversion - it's overdue!  However, I'm not  enthusiastic about this:
std::string utf8_string =  to_utf<char>(latin1_string,"Latin1");
1. It should be possible for  the source character set to be a
   compile-time constant symbol, and a template  parameter e.g.
   conv<latin1,utf8>(str).  As it is, if I want to  convert just a
   few bytes, it will surely spend more time decoding the charset  name
   string than actually converting.
See notes above about constants. Also Boost.Locale does
not use its own conversion tables, it relays on 3rd part API
like ICU, iconv or Win32 API.

So this way or other you will spend some time calling
uconv_open or iconv_open or something else.

If you want to make stream conversions or convert few
characters a time you can use codecvt API.

There is actually an example in docs about using codecvt.
...
2. As above, ideally it should take  a range of arbitrary
   type and output to a back_insert_iterator; presumably
   this  is an icu limitation.
Actually iterator is bad ideome for charset conversion it
is more an operation on a stream. And this is what
codecvt is useable for.

And note that not all APIs support stream operations most
noticeable is Win32 API MultiByte... and iconv. I had very hard
times to support this stream conversions as it wasn't
too friendly API in reallity. As generally codepage
conversion is much more stateful then most of us think.

For example iconv may composite several character to a single
code-point.
...
3. The error handling seems to be basic - it's "all  or nothing". 
   Considering from_utf, since this is the direction where there
   are more things that can be unrepresentable, I might want to:
- Skip the  character;
- Replace with '?';
- Replace with an unaccented version, or  e.g. curly-quotes to plain-quotes;
- Throw an exception that aborts the whole  operation;
- Do the conversion up to the point of difficulty and then  indicate the 
problem.
Few points:

- Skip and Throw is supported.
...
- Replace with an unaccented version, or  e.g. curly-quotes to plain-quotes;
I'm not aware of any charset API that actually supports
such things.
...
- Replace with '?';
I think I noted before, it is problematic with non-ICU 
backends as not all support such operations and more
then that some do not even provide a way to know
where operation had failed (Win32 API)
...
- Do the conversion up to the point of difficulty and then  indicate the 
problem.
This is problem as well as not all API provide this.
...
4. There's also the case where the input is  chunked,
   and the chunk-breaks might fall within multi-byte characters;
   so  individual chunks are potentially malformed while 
   the overall stream is  OK.  I think the best way to handle
   this is with a stateful functor to do  the comparison; I wrote one using
   iconv, like  this:
Iconver<latin1,utf8> latin1_to_utf8;
while  (!end_of_input()) {
     output.append( latin1_to_utf8(input.get_chunk())  );
   }
   assert(latin1_to_utf8.empty());  // detect unconverted partial  chars at end 
of input
There is a codecvt for this as I mentioned above, also
codepage conversion is not the primary goal of Boost.Locale
but rather utility task.
...
Also: how is the performance of this  conversion? 
If it is being done by icu, we should have a hook
(i.e.  specialisation, hence my suggestion of template
parameters for the character  sets) so that we can provide
optimised versions of important  conversions.
I had found that iconv and ICU perform similarry in terms
of performance.

The performance of codecvt for UTF-8 and sinle byte charact
sets is very good for multi-byte like Shift-JIS not so good
but it is rather limitation of std::codecvt design.
...
I  note that a union is being used in what I believe is an undefined  way:
union {
             char first;
            uint16_t  u16;
            uint32_t u32;
         } v;
v.u16 =  1;
            if(v.first == 1) {
                 return "UTF-16LE";
             }
             else {
                return  "UTF-16BE";
            }
No problems there (see response of Steven)
...
In my own  iconv wrapper code I need a const_cast because 
the POSIX spec misses a const on  the second argument to iconv();
some platforms correct this, but not  all:
Actually POSIX specifies restrict which is little bit stronger
then const.
...
#if defined(__FreeBSD__) || defined(__OpenBSD__) ||  defined(__NetBSD__) || 
defined(__sun__)
// Previously __APPLE__ was included  in this list; presumably they have
// changed their headers.  If you  have an older system you may need to put
// it back.
#define  ICONV_ARG2_IS_CONST
#endif
It is wrong as it may depend on version of iconv...

And Boost.Locale does it smarted see response of Steven.
...
This doesn't seem to be done  here.
There seems to be some use of reinterpret_cast which I think should  be a 
static_cast via void*:
virtual std::string  convert(char_type const *ubegin,char_type const 
*uend)
     {
        ....
        char const  *begin = reinterpret_cast<char const *>(ubegin);
No problems there as wellm , see Steven's response
...
"int err" should  probably be "errno_t err".
No, see: http://pubs.opengroup.org/onlinepubs/009695399/functions/errno.html

errno defined as int by POSIX specifications.
...
There seems to be a "magic constant" of 10 to  allow for the difference in 
size between the input and the output.
So? It is just guess for extend it does not has other meanings.
...
Some  comments would help.  For example, to describe the meaning
of the return  values and error codes from iconv().
man iconv?

Also EINvalidVALue, E2ooBIG and EILlligalSEQence are quite self explaining
at least for thous who familiar with Unix programming.
...
- What is your evaluation of the  documentation?
Good enough.
- What is your  evaluation of the potential usefulness of the
       library?
Useful and overdue.
- Did you try to use  the library? With what compiler? Did you
      have any  problems?
No, I've not compiled anything.
- How much  effort did you put into your evaluation? A glance? A
       quick reading? In-depth study?
3 hours reading the docs and looking at a  tiny bit of the implementation.
- Are you knowledgeable  about the problem domain?
Not beyond the areas that I have discussed  above.
...
Please explicitly state in your review whether the  library should be
accepted.
It's borderline.  I was going to  say "yes, but please
consider updating at least the charset conversion functions
in the way that I proposed".
Some things like add utf-to-utf conversion would be added, however
I doubt about the substitution for the reasons I explained above.
...
But then I looked at the source and noted  what I think are well-known issues 
like abuse of union and > reinterpret_cast,  which have lowered my opinion a 
bit.
There was nothing wrong and Steven had answered for most
points.
...
Regards,   Phil.
Thanks for the review.

  Artyom