
From: Phil Endecott <spam_from_boost_dev@chezphil.org>
As others have said, a similar introduction to icu, perhaps even deep links to appropriate parts of its documentation, and perhaps also to Unicode docs would be useful.
The point is that Boost.Locale uses ICU as one of its back-ends (even thou it is primary and the bast one) so I even prefer to decouple Boost.Locale from ICU at some level so users would see that Boost.Locale is not just convince wrapper of ICU but more then that. Also Unicode is very-very large topic and basically the search in google for ICU and Unicode would give you the best links...
I found that it was not sufficiently easy to find which header each function is in, e.g. on the charset_handling.html page the name of the header is not mentioned; if I click on e.g. from_utf I get to a reference page that also doesn't name the header.
Good point. I'll add it. I'm little bit surprised about missing header name int the reference docs because it what Doxygen usually does.
Use of strings rather than symbols: there are a few places where the library uses strings such as "UTF-8" or "de_DE.UTF8" rather than symbols like utf_8 or de_DE_UTF8. This use of strings can make things run-time errors that could otherwise be compile-time errors. Mis-typing e.g. "UTF_8" or "de-DE" is too easy. Perhasp in some cases the domain of the parameter is not known at compile time, but even in those cases, it should be possible to provide symbols for the most common choices.
What most common choices? There are few dozen of different character encodings, there are even more locales, and what considered common? Also not all encodings are supported by all backends. For example iconv would not handle some windows codepages and MultiByteTo.. would not handle some other encodings. Locales, Is de_DE.UTF-8 common? Is he_IL.UTF-8 common? Is zh_CN.UTF-8 common? Also the level of support by different backends may depend on actually OS configuration - if some locale is not configured on Windows or Linux that non-ICU backends would fallback to the C/POSIX locale. So should there be hard coded constants for locales and encodings? I doubt. However, I think there may be useful conversion like: std::string utf8_str=utf_to_utf<char>(L"Hello"); // UTF-16/UTF-32 to UTF-8 And it was rised by other reviewers and I actually consider adding something like that to the code Also small note, about character encoding mistype. The encodings compared as characters and numbers only, the rest of the symbols are ignored, so from Boost.Locale and ICU point of view: UTF-8, UTF8, Utf_8 and UTF#$%^#$8 are the same.
Collation: I would be interested to know how fast these comparisons are compared to regular std::string::operator<. If I correctly understand what transform does (more docs needed), it would be interesting to then compare the performance of std::map<...,comparator> with a std::map<> storing transformed strings. Or similarly, to compare std::sort using the comparator with std::sort of a pair<transformed string, original string>.
Transform is something very common for collation purposes. Every collation API provides it (ICU, POSIX, C and Win32 API). I hadn't do any specific benchmarks but it is much faster as collation normally involves Unicode normalization and then search over character database for specific locale. I doubt it std::sort of transofrmed messages would be much faster as transform costs as well. But if you search frequently for some string in database then it as significant performance impacts. As you transform it only once and then you just run strcmp which is much faster
Ideally these collation functions would take character ranges of arbitrary types, however I guess that the icu API takes char* making this unreasonable. Apparently there is no option for a NULL-terminated char*, unlike the case conversion functions, right?
Good point. I'll take a look on this.
Case conversions: I don't know what "title case" is or what "fold case" means; the docs should define this. At least linking to the definitions of the Unicode normalisation forms would also be useful; preferably, some explanation of how they should be used would be included.
Noted. Good point.
I can think of two recent case-conversion problems that I have had, both related to geographic names:
1. I had a list of North American names, in "UPPER CASE". I decided to convert them to "Initial Capitals" for better appearance, but later realised that this was broken for accented names in Quebec. I could imagine some sort of "case conversion is lossy" property of a locale that would detect this; does anything like that exist? (I guess not.)
Actually any case conversion is "lossy" because it does not know the original case, however the good news is that most of code-points in Unicode database does not have case at all. Roughly Only European languages actually have case, the Semic, East-Asian and many others don't have case at all. And BTW "Inital Capitals" is actually Title Case.
2. In a similar case I found that " 's " was being wrongly capitalised, e.g. "St. John'S Street". (This was with PostgreSQL's case conversion functions.) Again it would be useful to know whether (a) does the library try to handle this sort of thing correctly, and (b) can it tell me whether the operation that I want to do is lossy or not? For example, say I have a set of place names that are a mixture of en_US and en_CA and fr_CA but I don't know which; can I perform a conversion that works because those locales all have the same rules for e.g. ' and -? (I guess not.)
Case handling is actually Unicode algorithms, and it behaves same for almost all locales but few: Azeri and Turkish where "i" converted to "I with dot" and "I converted to i without dot" and AFAIR there is one or two other coner cases locale dependent. In any case case handling is always lossy but depends on how you define it. To answer the (a) yes it handles it right using ICU backend. If you ask if it does everything right then not. Because different languages may have different capitalization rules but as far as I see Unicode algorithm and current ICU library ignores them. But it is still the best we get.
Character set conversion: I think it's great that Boost might finally get character set conversion - it's overdue! However, I'm not enthusiastic about this:
std::string utf8_string = to_utf<char>(latin1_string,"Latin1");
1. It should be possible for the source character set to be a compile-time constant symbol, and a template parameter e.g. conv<latin1,utf8>(str). As it is, if I want to convert just a few bytes, it will surely spend more time decoding the charset name string than actually converting.
See notes above about constants. Also Boost.Locale does not use its own conversion tables, it relays on 3rd part API like ICU, iconv or Win32 API. So this way or other you will spend some time calling uconv_open or iconv_open or something else. If you want to make stream conversions or convert few characters a time you can use codecvt API. There is actually an example in docs about using codecvt.
2. As above, ideally it should take a range of arbitrary type and output to a back_insert_iterator; presumably this is an icu limitation.
Actually iterator is bad ideome for charset conversion it is more an operation on a stream. And this is what codecvt is useable for. And note that not all APIs support stream operations most noticeable is Win32 API MultiByte... and iconv. I had very hard times to support this stream conversions as it wasn't too friendly API in reallity. As generally codepage conversion is much more stateful then most of us think. For example iconv may composite several character to a single code-point.
3. The error handling seems to be basic - it's "all or nothing". Considering from_utf, since this is the direction where there are more things that can be unrepresentable, I might want to: - Skip the character; - Replace with '?'; - Replace with an unaccented version, or e.g. curly-quotes to plain-quotes; - Throw an exception that aborts the whole operation; - Do the conversion up to the point of difficulty and then indicate the problem.
Few points: - Skip and Throw is supported.
- Replace with an unaccented version, or e.g. curly-quotes to plain-quotes;
I'm not aware of any charset API that actually supports such things.
- Replace with '?';
I think I noted before, it is problematic with non-ICU backends as not all support such operations and more then that some do not even provide a way to know where operation had failed (Win32 API)
- Do the conversion up to the point of difficulty and then indicate the problem.
This is problem as well as not all API provide this.
4. There's also the case where the input is chunked, and the chunk-breaks might fall within multi-byte characters; so individual chunks are potentially malformed while the overall stream is OK. I think the best way to handle this is with a stateful functor to do the comparison; I wrote one using iconv, like this:
Iconver<latin1,utf8> latin1_to_utf8;
while (!end_of_input()) { output.append( latin1_to_utf8(input.get_chunk()) ); } assert(latin1_to_utf8.empty()); // detect unconverted partial chars at end of input
There is a codecvt for this as I mentioned above, also codepage conversion is not the primary goal of Boost.Locale but rather utility task.
Also: how is the performance of this conversion? If it is being done by icu, we should have a hook (i.e. specialisation, hence my suggestion of template parameters for the character sets) so that we can provide optimised versions of important conversions.
I had found that iconv and ICU perform similarry in terms of performance. The performance of codecvt for UTF-8 and sinle byte charact sets is very good for multi-byte like Shift-JIS not so good but it is rather limitation of std::codecvt design.
I note that a union is being used in what I believe is an undefined way:
union { char first; uint16_t u16; uint32_t u32; } v;
v.u16 = 1; if(v.first == 1) { return "UTF-16LE"; } else { return "UTF-16BE"; }
No problems there (see response of Steven)
In my own iconv wrapper code I need a const_cast because the POSIX spec misses a const on the second argument to iconv(); some platforms correct this, but not all:
Actually POSIX specifies restrict which is little bit stronger then const.
#if defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__) || defined(__sun__) // Previously __APPLE__ was included in this list; presumably they have // changed their headers. If you have an older system you may need to put // it back. #define ICONV_ARG2_IS_CONST #endif
It is wrong as it may depend on version of iconv... And Boost.Locale does it smarted see response of Steven.
This doesn't seem to be done here.
There seems to be some use of reinterpret_cast which I think should be a static_cast via void*:
virtual std::string convert(char_type const *ubegin,char_type const *uend) { .... char const *begin = reinterpret_cast<char const *>(ubegin);
No problems there as wellm , see Steven's response
"int err" should probably be "errno_t err".
No, see: http://pubs.opengroup.org/onlinepubs/009695399/functions/errno.html errno defined as int by POSIX specifications.
There seems to be a "magic constant" of 10 to allow for the difference in size between the input and the output.
So? It is just guess for extend it does not has other meanings.
Some comments would help. For example, to describe the meaning of the return values and error codes from iconv().
man iconv? Also EINvalidVALue, E2ooBIG and EILlligalSEQence are quite self explaining at least for thous who familiar with Unix programming.
- What is your evaluation of the documentation?
Good enough.
- What is your evaluation of the potential usefulness of the library?
Useful and overdue.
- Did you try to use the library? With what compiler? Did you have any problems?
No, I've not compiled anything.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
3 hours reading the docs and looking at a tiny bit of the implementation.
- Are you knowledgeable about the problem domain?
Not beyond the areas that I have discussed above.
Please explicitly state in your review whether the library should be accepted.
It's borderline. I was going to say "yes, but please consider updating at least the charset conversion functions in the way that I proposed".
Some things like add utf-to-utf conversion would be added, however I doubt about the substitution for the reasons I explained above.
But then I looked at the source and noted what I think are well-known issues like abuse of union and > reinterpret_cast, which have lowered my opinion a bit.
There was nothing wrong and Steven had answered for most points.
Regards, Phil.
Thanks for the review. Artyom