
Hi. I hope you guys still remember this, despite my lack of activity on this list, but I was here about developing a Unicode library for Boost a while back, and well.. We're off! :) We are now at a point in the development where we would really appreciate feedback from you Boosters. We have developed a *very* early prototype based on some of the ideas put forward in the earlier discussion here, and we would like you guys to comment on it. The design is by no means locked, and only represent one possible way to implement a Unicode string class. If you have an alternate solution, please let us know. We are using an evolutionary development model (prototyping), so we are open for design changes should there be a need for that. You can download the source code (VC++/DevCpp projects only for now, sorry) from our site here: http://hovedprosjekter.hig.no/v2005/data/Gr5_unicode (the files section). On the same site we have also set up a forum where you can discuss different aspects of the project if you want to keep it off list. No registration is required, so it should be hassle (password) free. Current design: The current design is based around the concept of «encoding_traits». These are templated on the different encodings used in Unicode (UTF-8, 16 and 32, both endians), and provide functions and typedefs for working on code units (8, 16 and 32 bit integers respectively) in any encoding. These traits are then used for implementing different interfaces that externally use 32bit code points, thereby abstracting away the underlying encoding. The string class itself is created with encoding transparency in mind. Also at the class level. This means that the encoding used in the string is not a template parameter of the string class itself (making each instantiation of the string it's own type), but rather a parameter of an implementation class that is used internally to hold the string. Something like this (highly simplified): class impl_base { // A lot of pure virtual functions for manipulating a string. }; template<typename encoding> class impl { // Implement the functions...obviously. }; class encoded_string { impl_base* m_impl; template<typename encoding> void set_encoding(encoding enc_tag) { m_impl = new impl<encoding>(); } }; The reason for doing this is that it allows functions that take encoded_string parameters to be blissfully unaware of what encoding they are working on, without having to templatize (it that a word?) the function itself. (Something I understood was a bit of a worry for some in the last discussion.) An alternate way of doing this (something we also tested when developing the current version), is to simply template the string class itself on encoding, but then you loose the above advantage of being able to have non-template functions working on multipe encodings. You do however gain speed (I would assume), since you wouldn't have the overhead of virtual function-calls, as well as a less complex implementation. There's also an implementation of the Unicode Character Database in the prototype, along with an implementation of the normalization algorithms, but I won't go into the details of them here (to keep this from becoming a novel). Should be easy enough to understand if you want to. Anyway.. Comments are as always welcome. Either here, or in the forum at the site. Regards - Erik To Eric Niebler: Did you recieve the mail I sendt you a while back about the whole contact-person debackle? (on the Boost Consulting address) Never got a reply, so I'm not sure if it went through.