Re: [boost] [locale] Review

18 Apr 2011

      ...
From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
Let us start with my decision: I vote against inclusion
of Boost.Locale into  Boost in its current state,
but strongly encourage the author to reconsider
resubmitting it after some design issues have
been reconsidered. In which case,  I would be in favour
of putting it on the fast-track for a  mini-review.
Can you please state explicitly what things are
conditional for the library to get to mini-review.

There are many points down but not everything
clear about if you consider them critical or not.

I assume not all things would be or may be solved
but I'd rather like to see more specific notes
like (a) missing XYZ interface for function ABC. 

It would help me and review manager.
...
Since I am the author of another Unicode library that
I'd  like to see included in Boost, I hope that 
my review will not be considered  biased.
I believe Boost.Locale is a valuable and useful
library, and I believe  both libraries can coexist.
I can mention only that I'm looking forward for 
Boost.Unicode and I know that the way you look
on interfaces is very different.

Boost.Locale is build from very practical
point of view - primary usability. I don't
try to solve all possible corner cases, especially
when they are not central part of localization 
task.

I'd rather prefer spending more time on adding
specific missing features useful for localization
over providing stuff like iterator based interface
for some operations that usually executed
on small chunks of text.

The localization topic is almost endless and
I have different priorities that are linguistic
and localization centric.
...
The main reason for my decision is the
interface that the library provides:
albeit it aims at abstracting the
backend  used to implement the
facilities, it exposes too much of
how those facilities  are implemented,
which prevents benefiting from future
backends that may not  have the same
limitations as the current backends do.
Indeed, the  Boost.Locale rationale clearly
says that it only takes std::basic_string
and  const char* as input because that's what
the backend libraries expect  anyway.
Not only, there are more reasons for this:

http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_linear_chu...
...
Yet there are various scenarios where this is useful:
 - I  might want to apply some operation to data in a file
  without putting it all to  memory first; Boost.Locale
  could copy that memory by chunks to apply that  operation
  for backends that have no native support for this.
  Copying the whole  file to memory is a good enough approximation
  for a first version.
 - The  input I have may not be directly available in the
  form of a contiguous buffer,  but I wouldn't mind if
  Boost.Locale copied it.
The point, that apart of Boost.Unicode I don't see any option
for specific backend now or in near future that would require
such things.

Apart for charset conversion (that already supports
streaming via codecvt see an example in tutorial)
all known to me Unicode/Localization API work on single
chunk of memory.

I don't see myself implementing it other way.

When Boost.Unicode would be ready I'm sure its users
would be glad to use its own native interface where
it is needed.

So do we need more generic API that would never be used?

I really doubt.
...
If we consider the  functions from_utf, to_utf and between,
they can either take a const char*, two  const char*, or a
std::basic_string. Clearly this would be better if those  functions
took a range, which covers all three of these cases.
It's not just  a matter of being generic, it makes the
library easier to use since the range is  defined by a
single argument which can be anything I want, I do not need
to  convert to what the library needs. If the library needs to
do conversion behind  the scenes, it can do it just like
that, behind the scenes.
Actually it is convenience API.
...
Taking a  range causes the problem however of
what the return type should be: I think  there
should be two variants, one which returns a
string, and one which appends  to a container
(like Boost.StringAlgo) or write to an output
iterator (like STL  algorithms). Obviously, the
former would be defined in terms of the  latter.
This causes however a big problem: all backends
would not be  interchangeable with a stable
unique ABI. Maybe some template specializations
could be injected at link-time, but they would
have to be full concrete  specializations.
To be honest I don't think that iterators
especially over non-UTF charsets are good
idea.

I'd rather accept a concept of stream
that you read from and write to:

1. It is buffered
2. It is fast and can work with
   big chunks.
3. It can be passed transparently
   and ABI stable to any backends.

So I would rather see implementation of
generic algorithms over I/O streams rather
then iterators.
...
Maybe the idea of backends being interchangeable
at the ABI  level needs to be rethought.
After all, isn't being interchangeable at
the API  level enough, since you can easily
mangle the configuration in a symbol to cause
an error if linking with the non-matching backend?
It is actually a center part of the library
design that I think had very good effect
on it.

I'm working with localization and actually
use it for localization purposes and I know
how valuable it is to have generic stable
and independent API and different backends
especially when ICU maybe heavy dependency.

Without it I wouldn't develop the library
in first place.
...
Still talking about  conversion between encodings,
I would like it if it was possible to statically
choose the encodings we're converting between,
as some backends could directly  expose a specialized
and optimized some_encoding_to_some_other_encoding  function.
Actually almost all backends are optimized for UTF-8 
where it is possible.

So without brining charset tables into Boost.Locale
I can see only very few character specializations
can be used:

1. UTF-XYZ to/from UTF-ABC
2. Latin1/ISO-8859-1 to/from UTF-XYZ

But I think Boost.Unicode would do it better as
it specializes in handling UTF-XYZ.

Note, you will almost never see in Boost.Locale
API any kind relations to code-points as code
point is meaning-less in context of localization.

It is too low level abstraction that I actually
prefer developers not even relate to.

I don't try to implement tasks that should
UTF-8/16/32 library do.
...
Note the remarks about conversion between encodings
also  applies to case folding, collation etc.
Being able to specify statically the  normalization
form you're normalizing to and things like that could
be useful  just as for the character encodings.
I know this (we talked about it before).

However I don't relate to it as almost all
Unicode API around.

Such low level optimizations are good
but IMHO corner case.

It is good for lower level library.
...
Perhaps not made clear from the  docs,
but Boost.Locale embeds another mechanism
than the functions in the conv  namespace
to convert between character encodings:
using the classes derived from 
boost::locale::util::base_converter.
This is the only interface that can be
passed incomplete data, as obtained through buffering.
That interface was  designed to implement codecvt facets,
Exactly and this is its purpose.
...
and unfortunately, it's not much  better.
The library implements its own UTF-8/UTF-16/UTF-32 conversion
logic  there regardless of which backend is used; it is a shame
it cannot reuse the  to_utf or from_utf functions in all cases.
This implementation is also quite  tricky since the state management
prevents the code from being simple and  self-contained.
Actually the code reuses it in many places, but there
is a small point.

Unlike the case of normal to_utf/from_utf calls
the facets know their encoding and may do things
faster as for example in ICU it uses this 
knowledge to perform some tasks better.
...
That interface does not allow optimized conversion
functions  to be used. Compatibility with codecvt facets
is a nice feature, but it  shouldn't impair the interface
of the component that could have other  usages.
It is not nice feature, it is a feature that allows
to use Boost.Filesysem and Boost.ProgramOptions
correctly with UTF-8 or other encodings
on Windows platform with total encoding mess.

Using Boost.Locale and UTF-8 codecvt facet opens
a door to all other libraries to adopt it.

What is important that std::codecvt is part
of C++ standard and thus any library that uses
this standard would enjor the full power
of Boost.Locale without even depending on it.
...
Furthermore, I do not find the boundary analysis
interface  satisfying at all. It is not made clear
in the docs that doing boundary analysis  actually
goes through the whole string and returns a vector
of positions  corresponding to the boundaries.
You can use that vector to adapt any range  into a range
of strings. Each dereferencing copies the subrange
defined by the  vector into a new temporary string.
This is pretty terrible. I want to be  able to find the
boundaries as I advance through a string, or be able to
find  the closest boundary from an arbitrary position
without having to process the  whole string.
I do not want to have to copy my whole input range
into little  strings for no reason.
I may agree that it is not perfect, but giving
limitations of current state-of-the-art implementations
in ICU it is yet reasonable one.

AFAIK Boost.Unicode implemets its own
interface for text segmentation 
but without ICU's custom locale
rules and dictionaries that are important
for text segmentation of some languages.
...
Usage of this feature is also unnecessarily verbose. 
It should not be more than a function call to turn a
range of characters into a  range of tagged ranges of characters.
This part of the library is the one  that needs the more work.
Ironically enough, it's also the only one that deals a  bit
with iterators and ranges, and I just asked for more iterators
and ranges at  the beginning of this review ;).

...
Finally, I will not comment on the  implementation too much,
since I think other Boosters already did that
(though I  did not read all of it). There is some 
unnecessary or non-optimal code
Can you point? There were some duplications but non-coptimal?
...
some  code more complicated than it should be
Can you be more specific?
...
some compile-time values tested at  runtime,
some things that could have reused existing
utilities from Boost,
More specific points?
...
and  some inappropriate
use of reinterpret_cast.
In two places where it is going to be fixed.
...
While most of the backends  are separated, 
it is not the case for encoding/codepage.cpp,
which is shared by  all backends and contains ifdefs.
This is not a very nice extension  point.
Actually not codepage only.

- Formatting (locale::format)
- Locale information
- Message Formatting (gettext)
- codecvt facets (not details)

And more...

So there are lots of shared parts.

Now unlike most of operations that
are locale related codepage handing
is basics that is used all over the code.

So I don't really understand how 
do you see it might actually be?
...
I'm also a bit concerned about the generalized
usage of virtual  functions anywhere;
it seems unnecessary in quite a few places,
please prefer  using templates when dynamic binding is not required.
More specific points?
...
A lot of new and  vectors too, I'd prefer if ideally 
the library never allocated  anything.
I'm sorry but it is just something
that can't and would never happen.

This request has no reasonable base especially
for such complex topic.
...
I hope my review will be well received
and that you will  consider all of my "feature requests".
Your library is practical and can get  things done for sure,
but maybe it's not quite as good as it could be in terms  of
interface yet.
The interface is a part of the design and have
designed to be simple and covinent rather then
fully generic.

But I understand that we have very different points
of view.
...
Also, if I want Boost.Unicode to ever be a
possible  backend, Boost.Locale needs to be
able to make the most of it  ;).
Actually I was thinking about it.

Boost.Locale and Boost.Unicode very different:

- Boost.Locale is locale centric
- Boost.Unicode is locale agnostic.

Few points from Boost.Unicode that can actually
serve Boost.Locale:

- Normalization

  But I assume that it would be better to use Boost.Unicode
  directly as it is really locale independent 
  procedure.

  It is part of Boost.Locale for convinience 

- Collation if it is parametrized so it can be
  changed for each locale separatly

- Boundary analysis for a subset of languages

  not Japenese not Thai and some others at leaset
  for how Boost.Unicode looks like at this point.

I rather perfer to look at
Boost.Unicode as orthogonal library to Boost.Locale
as it does very different things.

It can't be Boost.Locale's backend as at least
in current state it is locale agnostic
and misses most important feature localization is actually
required - CLDR - Common Locale Database Repositry 

Thank you for the review.

  Artyom
...
_______________________________________________
Unsubscribe &  other changes: 
http://lists.boost.org/mailman/listinfo.cgi/boost