UTF8 library - second call for informal review

Hello everybody. This is the second call for the informal review of the UTF8 library. It is based on verson 1.02 of UTF8-CPP: http://utfcpp.sourceforge.net/ and you can find it at http://boost-consulting.com/vault/index.php?PHPSESSID=8j6irqkpv3reg5s1gv0lge6um5&direction=0&order=&directory=Strings%20-%20Text%20Processing The library is stable and is being used for several months now. Thanks in advance Nemanja Trifunovic ____________________________________________________________________________________ Any questions? Get answers on any topic at www.Answers.yahoo.com. Try it now.

Nemanja Trifunovic wrote:
Hello everybody.
This is the second call for the informal review of the UTF8 library. It is based on verson 1.02 of UTF8-CPP: http://utfcpp.sourceforge.net/ and you can find it at http://boost-consulting.com/vault/index.php?PHPSESSID=8j6irqkpv3reg5s1gv0lge6um5&direction=0&order=&directory=Strings%20-%20Text%20Processing
The library is stable and is being used for several months now.
Hi, The project looks very impressive. Just a question, can it convert UTF-16 in UCS-2BE formt to utf8? Best Regards cg

Nemanja: I don't use UTF8 myself, but I've had to program is_valid_utf8 and utf8_to_xml10 recently. You're lucky that I just gave is_valid as an assignment to a student! I haven't done a review in a long time, not even for formal reviews, and this is only an informal review. Given that unicode is a standard that *should* be supported by C++, really, this would be a prime candidate for inclusion in boost. I strongly support your efforts. I started writing a few comments, and it comes almost as a complete review :-) So here goes my mini-review. Keep in mind I want to be as constructive and positive as possible. I'm definitely hoping something will come out of this. DESIGN About the documentation: I wish there was a short glossary at the beginning (it's not provided by unicode.org - the glossary there is wayyyy too big!). Just to explain "code point", "byte sequence", "surrogate", etc. Some pointers to the unicode.org FAQ (e.g. for BOM). Also, before the reference, it would be nice to have a synopsis to refer to just to remember the interface. And as things will evolve (interface, design, etc.) you'll need to start a section Rationale at the end (before Conclusions). About the interface: The name "bom", even though it is inside namespace utf8, could be spelled out byte_order_mark (to help users know what to google for if they don't know the context). Generally, I don't think exceptions are appropriate here. A return status code would serve equally well and be more efficient. In fact, for append, it is enough to check that "result == append(cp, result)" to know you have an invalid code point. A similar design works with the other functions. next() and prior() should not return the code point. If the only goal is to advance the iterator, why impose additional calculations? If the code point is needed, have other functions called extract_next() and extract_prior(). Unicode has the advantage that you know where the sequences begin, so next(it,end) could even work if 'it' points to the middle of a byte sequence, but it would require to backtrack first to see the whole sequence if you need to return a value, even though this value may be discarded by the caller. Finally, what if end is not an issue (e.g. you know the input is valid and don't want to test for end of sequence)? You may choose not to support this if you want to promote safe usage, but for efficiency some users may require it. It would be better imho to have: template <typename octet_iterator> std::pair<octet_iterator,bool> next(octet_iterator it); template <typename octet_iterator> std::pair<octet_iterator,bool> next(octet_iterator it, octet_iterator end); // Return the beginning of the next byte sequence after the specified 'it'. // Note that this next beginning can never be equal to 'it'. If 'end' is specified // and is encountered before the next beginning is found, then return 'std::make_pair(it,true)'. // In case of an invalid UTF8 sequence, return 'std::make_pair(it2,false)' where 'it2' is the first // position that can be deduced to be invalid starting at 'it'. In this way, all outcomes can be decoded (easily and efficiently) from the return value. Similarly: template <typename octet_iterator> std::pair<octet_iterator,bool> extract_next(octet_iterator it, uint32_t *cp); template <typename octet_iterator> std::pair<octet_iterator,bool> extract_next(octet_iterator it, octet_iterator end, uint32_t* cp); // Return the beginning of the next byte sequence after the specified 'it', and // load the code point in the range between 'it' and the return value into '*cp' if 'cp' is not null. // Note that this next beginning can never be equal to 'it', and that 'it' should point to // the beginning of a byte sequence. If 'end' is specified and is encountered // before the next beginning is found, then return 'std::make_pair(it,true)'. // In case of an invalid UTF8 sequence, return 'std::make_pair(it2,false)' where 'it2' is the first // position that can be deduced to be invalid starting at 'it'. This is only a suggestion, I'm not sure it is the best interface. It nevertheless shows that exceptions can be completely avoided for next () and prior as well. Why is the parameter n in advance required to be positive? Likewise, advance could be reformulated as: template <typename octet_iterator, typename distance_type> std::pair<distance_type,bool> advance (octet_iterator& it, distance_type n); template <typename octet_iterator, typename distance_type> std::pair<distance_type,bool> advance (octet_iterator& it, distance_type n, octet_iterator end); // Advances an iterator 'it' by the specified number of code points within an UTF-8 sequence. // Note that that the return , and that 'it' should point to the beginning of a byte sequence. // If 'end' is specified and is encountered during the advance, then return 'std::make_pair(i,true)' // In case of an invalid UTF8 sequence, return 'std::make_pair(i,false)' and let 'it' point to the first // position that can be deduced to be invalid starting at 'it'. Note that if n>0, ++it will be used, // and if n<0, --it will be used and the iterator type will be required to be bidirectional. Both the interface and the implementation can be improved by specifying the requirements on iterators: it appears you want to support bidirectional iterators, but you do require random access (when you write 'end - it < length' for instance). I'm pretty sure it's possible to implement most functions with a ForwardIterator, and perhaps even an InputIterator. Advance might require bidirectional if n<0, that can be asserted from the iterator traits. Also the expected value type of the 'octet_iterator' could be explicited (needs to be convertible to an integral type, right?) IMPLEMENTATION: I have some doubts about the correctness of the implementation, since is_trail() always matches the range 80..BF which is good only for the third and fourth bytes, but not the special cases for the second byte. And the testing is certainly not sufficient to catch all the nooks and crannies of the definition. Recall that a byte sequence is valid if it conforms to the table 3.6 in the standard (below, sorry for the formatting): Table 3-6. Well-Formed UTF-8 Byte Sequences Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF Does your code check for exceptional 1st bytes such as C2, E0, ED, F0 or F4? I couldn't see how. Surely is_code_point_valid can be optimized. Perhaps the compiler optimizes things a bit, but I'm pretty sure it can be improved. More efficiency can be gained by tabulating the sequence_length function (you need a table of 256 entries): I'm taking this from the unicode.org sample code: /* * Index into the table below with the first byte of a UTF-8 sequence to * get the number of trailing bytes that are supposed to follow it. * Note that *legal* UTF-8 values can't have 4 or 5-bytes. */ static const char sequence_length_for_UTF8[256] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5 }; Even the validate_next could be made a bit more efficient by tabulating the first byte: instead of the sequence length, the table would contain various control bytes, which would be used in a switch statement. This would eliminate the tests for special cases in the switch(length) statement. Also by testing if you have enough memory inside each case, you can put back the shortcut into the switch statement and not have to pay the extra test 'if(1 == length)'. The for loops for backtracking the iterator can be totally avoided. E.g., in case 2 of the switch statement, you can write instead: if (!is_trail(*(it))) { return INCOMPLETE_SEQUENCE; } cp = ((cp << 6) & 0x7ff) + ((*++it) & 0x3f); Lastly, the testing should really test *all* the boundary conditions (that's all the cases in the table 3.6 of the unicode standard) *and* also test invalid strings. Without a thorough test driver, it's really hard to have any confidence into code such as the above statement "((cp << 6) & 0x7ff) + ((*++it) & 0x3f)".... CONCLUSION Given that the Unicode standard is so precisely defined, I would expect boost to provide only the highest-quality implementation. For sure, any implementation is better than none, but by working together (you and all interested boost members) I am sure that an excellent library is feasible. You already provide an excellent design and a good base. Keep it up, and let me know how I can help besides the remarks above, -- Hervé Brönnimann hervebronnimann@mac.com On Dec 5, 2006, at 3:56 PM, Nemanja Trifunovic wrote:
Hello everybody.
This is the second call for the informal review of the UTF8 library. It is based on verson 1.02 of UTF8-CPP: http:// utfcpp.sourceforge.net/ and you can find it at http://boost- consulting.com/vault/index.php? PHPSESSID=8j6irqkpv3reg5s1gv0lge6um5&direction=0&order=&directory=Stri ngs%20-%20Text%20Processing
The library is stable and is being used for several months now.
Thanks in advance
Nemanja Trifunovic

Dear Nemanja, On 12/5/06, Nemanja Trifunovic <nemanja_trifunovic@yahoo.com> wrote:
This is the second call for the informal review of the UTF8 library. It is based on verson 1.02 of UTF8-CPP: http://utfcpp.sourceforge.net/ and you can find it at
I like the functions you provide, and the "unchecked" namespace. Unlike Hervé, I do think exceptions are the way to go. I seem to miss a couple of things though. In a recent discussion on this list there seemed to be a preference for using iterators, which can be composed, for example to perform UTF-8->UTF-16 conversion, or conversions to other codepages. Iterators can be much more flexible than these free functions. Is there any particular reason why you do not include similar functions for UTF-16? One of the most important uses for UTF must be IO. Shouldn't a utf_codecvt be part of the library? Hervé is right: reading UTF-8 can be optimised a lot using tables with data. I've got an implementation lying around that I'd be happy to share. It took 30% less time than the straightforward implementation and it did all the necessary checks. The final thing is, your functions try to maintain strings with of valid UTF-8. Why not provide a string type that maintains this variant? Conclusion: in my opinion a lot of things are missing from the library at the moment. Regards, Rogier

Nemanja Trifunovic <nemanja_trifunovic@yahoo.com> writes:
This is the second call for the informal review of the UTF8 library. It is based on verson 1.02 of UTF8-CPP: http://utfcpp.sourceforge.net/ and you can find it at http://boost-consulting.com/vault/index.php?PHPSESSID=8j6irqkpv3reg5s1gv0lge6um5&direction=0&order=&directory=Strings%20-%20Text%20Processing
First, a typo: there is a missing 'n' in octet_differece_type defined in validate_next, on line 122. Secondly, a real issue: validate_next currently requires bidirectional iterators, in order to backtrack on error. Forward iterators are sufficient for this --- just store a copy of the original, and copy it back if there's an error. This would increase the number of use cases of the library. I agree that a converting iterator would be beneficial, as would a codecvt facet, but this is a good start. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
participants (5)
-
Anthony Williams
-
gchen
-
Hervé Brönnimann
-
Nemanja Trifunovic
-
Rogier van Dalen