Re: [boost] UTF8 library - second call for informal review

Hello Rogier, Thanks for your comments. 1) Iterators, or rather itarator adapters. I believe the iterators should be built on top of these functions. In fact, I am already developing them in the version 2 of the library (see here for the latest snapshot: http://utfcpp.svn.sourceforge.net/viewvc/utfcpp/v2_0/source/ ). However, I see some other iterator implementations, and would rather start with this free functions until we decide the best design for the iterators. 2) IO - currently it is out of the scope of this library. If I enough people agree with you, this may change, but currently I have no plans for IO. Honestly, I dislike C++ standard IO and would love to avoid it if possible :) 3) Tables for data: As I replied to Hervé, my test cases showed a version with the table to run slower (with two different compilers). I will investigate it further, though, since I agree it is not very logical. 4) A string type. There are way too many C++ string types out there already, and I wanted to provide a tool for making them work with UTF-8 encoding, rather than introducing yet another string class. Probably the same philosophy as Boost String Algorithms http://www.boost.org/doc/html/string_algo.html Best, Nemanja Trifunovic ----- Original Message ---- From: Rogier van Dalen <rogiervd@gmail.com> To: boost@lists.boost.org Sent: Wednesday, December 6, 2006 5:11:17 AM Subject: Re: [boost] UTF8 library - second call for informal review Dear Nemanja, On 12/5/06, Nemanja Trifunovic <nemanja_trifunovic@yahoo.com> wrote:
This is the second call for the informal review of the UTF8 library. It is based on verson 1.02 of UTF8-CPP: http://utfcpp.sourceforge.net/ and you can find it at
I like the functions you provide, and the "unchecked" namespace. Unlike Hervé, I do think exceptions are the way to go. I seem to miss a couple of things though. In a recent discussion on this list there seemed to be a preference for using iterators, which can be composed, for example to perform UTF-8->UTF-16 conversion, or conversions to other codepages. Iterators can be much more flexible than these free functions. Is there any particular reason why you do not include similar functions for UTF-16? One of the most important uses for UTF must be IO. Shouldn't a utf_codecvt be part of the library? Hervé is right: reading UTF-8 can be optimised a lot using tables with data. I've got an implementation lying around that I'd be happy to share. It took 30% less time than the straightforward implementation and it did all the necessary checks. The final thing is, your functions try to maintain strings with of valid UTF-8. Why not provide a string type that maintains this variant? Conclusion: in my opinion a lot of things are missing from the library at the moment. Regards, Rogier _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com

Hi Nemanja, Thanks for your reply. On 12/6/06, Nemanja Trifunovic <nemanja_trifunovic@yahoo.com> wrote:
1) In fact, I am already developing [iterator adapters] in the version 2 of the library
I suppose version 2 is the one you'd like to become part of Boost, then? I agree that the free functions are a good stepping stone for implementing iterators.
2) IO - currently it is out of the scope of this library.
Why? You can do nice things such as automatically setting the right codecvt based on the byte order mark. Works like a bliss. Again, I do have some code lying around for this. codecvt facets also provide more opportunity for optimisation.
3) Tables for data: As I replied to Hervé, my test cases showed a version with the table to run slower (with two different compilers).
I tested it on my implemenation of the codecvt facet; I imagine the effect is greater than the effect on iterators. Also, I didn't use one table but a couple. Then you don't need an "if" statement at all, only a switch. You might want to try whether that makes much of a difference.
4) A string type. There are way too many C++ string types out there already, and I wanted to provide a tool for making them work with UTF-8 encoding, rather than introducing yet another string class. Probably the same philosophy as Boost String Algorithms http://www.boost.org/doc/html/string_algo.html
I'm sorry, I fail to see how that argument works exactly. Can you elaborate? Algorithms that work on different kinds of containers: use free functions. I follow that. The alternative is to implement them as methods of a string class. Is that the alternative you have in mind? That was not what I meant. A UTF-encoded string is a container with codepoints (21-bit numbers) in an encoded form. The encoding puts some well-formedness constraints on the underlying bitstring. This is not unlike std::stack. Why not put the constraints in the type? Regards, Rogier

It seems the answer I submitted earlier did not make to the list. I appolgize if this creates a second message on the same topic.
I suppose version 2 is the one you'd like to become part of Boost, then? I agree that the free functions are a good stepping stone for implementing iterators.
I would prefer to start with version 1, and add iterator adapters only after they are properly designed.
2) IO - currently it is out of the scope of this library.
Why? You can do nice things such as automatically setting the right codecvt based on the byte order mark. Works like a bliss. Again, I do have some code lying around for this. codecvt facets also provide more opportunity for optimisation.
Is your code publicly available? Maybe we can join forces and submit the library together.
4) A string type. There are way too many C++ string types out there
already, and I wanted to provide a tool for
making them work with UTF-8 encoding, rather than introducing yet another string class. Probably the same philosophy as Boost String Algorithms http://www.boost.org/doc/html/string_algo.html
I'm sorry, I fail to see how that argument works exactly. Can you elaborate? Algorithms that work on different kinds of containers: use free functions. I follow that. The alternative is to implement them as methods of a string class. Is that the alternative you have in mind? That was not what I meant. A UTF-encoded string is a container with codepoints (21-bit numbers) in an encoded form. The encoding puts some well-formedness constraints on the underlying bitstring. This is not unlike std::stack. Why not put the constraints in the type?
I indeed thought you meant another std::string like class. Could you give an example how would it look like from a user's perspective? Thanks. Nemanja Trifunovic
participants (2)
-
Nemanja Trifunovic
-
Rogier van Dalen