[boost] [locale] Review

17 Apr 2011

      Let us start with my decision: I vote against inclusion of Boost.Locale 
into Boost in its current state, but strongly encourage the author to 
reconsider resubmitting it after some design issues have been 
reconsidered. In which case, I would be in favour of putting it on the 
fast-track for a mini-review.

Since I am the author of another Unicode library that I'd like to see 
included in Boost, I hope that my review will not be considered biased. 
I believe Boost.Locale is a valuable and useful library, and I believe 
both libraries can coexist.

The main reason for my decision is the interface that the library 
provides: albeit it aims at abstracting the backend used to implement 
the facilities, it exposes too much of how those facilities are 
implemented, which prevents benefiting from future backends that may not 
have the same limitations as the current backends do.

Indeed, the Boost.Locale rationale clearly says that it only takes 
std::basic_string and const char* as input because that's what the 
backend libraries expect anyway.

Yet there are various scenarios where this is useful:
  - I might want to apply some operation to data in a file without 
putting it all to memory first; Boost.Locale could copy that memory by 
chunks to apply that operation for backends that have no native support 
for this. Copying the whole file to memory is a good enough 
approximation for a first version.
  - The input I have may not be directly available in the form of a 
contiguous buffer, but I wouldn't mind if Boost.Locale copied it.

If we consider the functions from_utf, to_utf and between, they can 
either take a const char*, two const char*, or a std::basic_string. 
Clearly this would be better if those functions took a range, which 
covers all three of these cases.
It's not just a matter of being generic, it makes the library easier to 
use since the range is defined by a single argument which can be 
anything I want, I do not need to convert to what the library needs. If 
the library needs to do conversion behind the scenes, it can do it just 
like that, behind the scenes.

Taking a range causes the problem however of what the return type should 
be: I think there should be two variants, one which returns a string, 
and one which appends to a container (like Boost.StringAlgo) or write to 
an output iterator (like STL algorithms). Obviously, the former would be 
defined in terms of the latter.

This causes however a big problem: all backends would not be 
interchangeable with a stable unique ABI. Maybe some template 
specializations could be injected at link-time, but they would have to 
be full concrete specializations.
Maybe the idea of backends being interchangeable at the ABI level needs 
to be rethought. After all, isn't being interchangeable at the API level 
enough, since you can easily mangle the configuration in a symbol to 
cause an error if linking with the non-matching backend?

Still talking about conversion between encodings, I would like it if it 
was possible to statically choose the encodings we're converting 
between, as some backends could directly expose a specialized and 
optimized some_encoding_to_some_other_encoding function.

Note the remarks about conversion between encodings also applies to case 
folding, collation etc.
Being able to specify statically the normalization form you're 
normalizing to and things like that could be useful just as for the 
character encodings.

Perhaps not made clear from the docs, but Boost.Locale embeds another 
mechanism than the functions in the conv namespace to convert between 
character encodings: using the classes derived from 
boost::locale::util::base_converter. This is the only interface that can 
be passed incomplete data, as obtained through buffering.
That interface was designed to implement codecvt facets, and 
unfortunately, it's not much better.
The library implements its own UTF-8/UTF-16/UTF-32 conversion logic 
there regardless of which backend is used; it is a shame it cannot reuse 
the to_utf or from_utf functions in all cases. This implementation is 
also quite tricky since the state management prevents the code from 
being simple and self-contained.
That interface does not allow optimized conversion functions to be used.
Compatibility with codecvt facets is a nice feature, but it shouldn't 
impair the interface of the component that could have other usages.

Furthermore, I do not find the boundary analysis interface satisfying at 
all. It is not made clear in the docs that doing boundary analysis 
actually goes through the whole string and returns a vector of positions 
corresponding to the boundaries.
You can use that vector to adapt any range into a range of strings. Each 
dereferencing copies the subrange defined by the vector into a new 
temporary string.
This is pretty terrible. I want to be able to find the boundaries as I 
advance through a string, or be able to find the closest boundary from 
an arbitrary position without having to process the whole string. I do 
not want to have to copy my whole input range into little strings for no 
reason.
Usage of this feature is also unnecessarily verbose. It should not be 
more than a function call to turn a range of characters into a range of 
tagged ranges of characters.
This part of the library is the one that needs the more work. Ironically 
enough, it's also the only one that deals a bit with iterators and 
ranges, and I just asked for more iterators and ranges at the beginning 
of this review ;).

Finally, I will not comment on the implementation too much, since I 
think other Boosters already did that (though I did not read all of it). 
There is some unnecessary or non-optimal code, some code more 
complicated than it should be, some compile-time values tested at 
runtime, some things that could have reused existing utilities from 
Boost, and some inappropriate use of reinterpret_cast.

While most of the backends are separated, it is not the case for 
encoding/codepage.cpp, which is shared by all backends and contains 
ifdefs. This is not a very nice extension point.

I'm also a bit concerned about the generalized usage of virtual 
functions anywhere; it seems unnecessary in quite a few places, please 
prefer using templates when dynamic binding is not required.
A lot of new and vectors too, I'd prefer if ideally the library never 
allocated anything.

I hope my review will be well received and that you will consider all of 
my "feature requests". Your library is practical and can get things done 
for sure, but maybe it's not quite as good as it could be in terms of 
interface yet.

Also, if I want Boost.Unicode to ever be a possible backend, 
Boost.Locale needs to be able to make the most of it ;).

[boost] [locale] Review

Mathias Gaunard