
Dear All, Here is a brief review of the proposed Boost.Locale library. I know little about the internationalisation issues that this library addresses, so I will focus just on the few areas where I do have some experience. The documentation is generally of good quality but could be expanded in places as noted below. I appreciate the introduction of std::locale at the beginning; having avoided std::locale for precisely the "common critical problems" mentioned, this introduction was useful. As others have said, a similar introduction to icu, perhaps even deep links to appropriate parts of its documentation, and perhaps also to Unicode docs would be useful. I found that it was not sufficiently easy to find which header each function is in, e.g. on the charset_handling.html page the name of the header is not mentioned; if I click on e.g. from_utf I get to a reference page that also doesn't name the header. Use of strings rather than symbols: there are a few places where the library uses strings such as "UTF-8" or "de_DE.UTF8" rather than symbols like utf_8 or de_DE_UTF8. This use of strings can make things run-time errors that could otherwise be compile-time errors. Mis-typing e.g. "UTF_8" or "de-DE" is too easy. Perhasp in some cases the domain of the parameter is not known at compile time, but even in those cases, it should be possible to provide symbols for the most common choices. Collation: I would be interested to know how fast these comparisons are compared to regular std::string::operator<. If I correctly understand what transform does (more docs needed), it would be interesting to then compare the performance of std::map<...,comparator> with a std::map<> storing transformed strings. Or similarly, to compare std::sort using the comparator with std::sort of a pair<transformed string, original string>. Ideally these collation functions would take character ranges of arbitrary types, however I guess that the icu API takes char* making this unreasonable. Apparently there is no option for a NULL-terminated char*, unlike the case conversion functions, right? Case conversions: I don't know what "title case" is or what "fold case" means; the docs should define this. At least linking to the definitions of the Unicode normalisation forms would also be useful; preferably, some explanation of how they should be used would be included. I can think of two recent case-conversion problems that I have had, both related to geographic names: 1. I had a list of North American names, in "UPPER CASE". I decided to convert them to "Initial Capitals" for better appearance, but later realised that this was broken for accented names in Quebec. I could imagine some sort of "case conversion is lossy" property of a locale that would detect this; does anything like that exist? (I guess not.) 2. In a similar case I found that " 's " was being wrongly capitalised, e.g. "St. John'S Street". (This was with PostgreSQL's case conversion functions.) Again it would be useful to know whether (a) does the library try to handle this sort of thing correctly, and (b) can it tell me whether the operation that I want to do is lossy or not? For example, say I have a set of place names that are a mixture of en_US and en_CA and fr_CA but I don't know which; can I perform a conversion that works because those locales all have the same rules for e.g. ' and -? (I guess not.) Character set conversion: I think it's great that Boost might finally get character set conversion - it's overdue! However, I'm not enthusiastic about this: std::string utf8_string = to_utf<char>(latin1_string,"Latin1"); 1. It should be possible for the source character set to be a compile-time constant symbol, and a template parameter e.g. conv<latin1,utf8>(str). As it is, if I want to convert just a few bytes, it will surely spend more time decoding the charset name string than actually converting. 2. As above, ideally it should take a range of arbitrary type and output to a back_insert_iterator; presumably this is an icu limitation. 3. The error handling seems to be basic - it's "all or nothing". Considering from_utf, since this is the direction where there are more things that can be unrepresentable, I might want to: - Skip the character; - Replace with '?'; - Replace with an unaccented version, or e.g. curly-quotes to plain-quotes; - Throw an exception that aborts the whole operation; - Do the conversion up to the point of difficulty and then indicate the problem. 4. There's also the case where the input is chunked, and the chunk-breaks might fall within multi-byte characters; so individual chunks are potentially malformed while the overall stream is OK. I think the best way to handle this is with a stateful functor to do the comparison; I wrote one using iconv, like this: Iconver<latin1,utf8> latin1_to_utf8; while (!end_of_input()) { output.append( latin1_to_utf8(input.get_chunk()) ); } assert(latin1_to_utf8.empty()); // detect unconverted partial chars at end of input Also: how is the performance of this conversion? If it is being done by icu, we should have a hook (i.e. specialisation, hence my suggestion of template parameters for the character sets) so that we can provide optimised versions of important conversions. Here are my answers to the "normal questions": - What is your evaluation of the design? Generally it seems like a reasonable design for an icu wrapper. If it were not an icu wrapper, no doubt some things could be done better. - What is your evaluation of the implementation? I've looked at very little of the source. I decided to look at iconv_codepage.hpp since I have some experience with iconv. I note that a union is being used in what I believe is an undefined way: union { char first; uint16_t u16; uint32_t u32; } v; v.u16 = 1; if(v.first == 1) { return "UTF-16LE"; } else { return "UTF-16BE"; } In my own iconv wrapper code I need a const_cast because the POSIX spec misses a const on the second argument to iconv(); some platforms correct this, but not all: #if defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__) || defined(__sun__) // Previously __APPLE__ was included in this list; presumably they have // changed their headers. If you have an older system you may need to put // it back. #define ICONV_ARG2_IS_CONST #endif This doesn't seem to be done here. There seems to be some use of reinterpret_cast which I think should be a static_cast via void*: virtual std::string convert(char_type const *ubegin,char_type const *uend) { .... char const *begin = reinterpret_cast<char const *>(ubegin); "int err" should probably be "errno_t err". There seems to be a "magic constant" of 10 to allow for the difference in size between the input and the output. Some comments would help. For example, to describe the meaning of the return values and error codes from iconv(). Based on that limited look, my view is that the code probably works, but it could be improved in many ways. - What is your evaluation of the documentation? Good enough. - What is your evaluation of the potential usefulness of the library? Useful and overdue. - Did you try to use the library? With what compiler? Did you have any problems? No, I've not compiled anything. - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? 3 hours reading the docs and looking at a tiny bit of the implementation. - Are you knowledgeable about the problem domain? Not beyond the areas that I have discussed above.
Please explicitly state in your review whether the library should be accepted.
It's borderline. I was going to say "yes, but please consider updating at least the charset conversion functions in the way that I proposed". But then I looked at the source and noted what I think are well-known issues like abuse of union and reinterpret_cast, which have lowered my opinion a bit. And then I've seen some other reviews that have looked at the areas I've not considered (i.e. message translation) and they have been quite critical. Based on all of that, my view has shifted to "no, but this is worthwhile functionality, so please come back when you've resolved some of these concerns". Regards, Phil.