[boost] [Unicode strings] We're off

16 Mar 2005

      Hi. I hope you guys still remember this, despite my lack of activity on 
this list, but I was here about developing a Unicode library for Boost a 
while back, and well.. We're off! :) We are now at a point in the 
development where we would really appreciate feedback from you Boosters.

We have developed a *very* early prototype based on some of the ideas 
put forward in the earlier discussion here, and we would like you guys 
to comment on it. The design is by no means locked, and only represent 
one possible way to implement a Unicode string class. If you have an 
alternate solution, please let us know. We are using an evolutionary 
development model (prototyping), so we are open for design changes 
should there be a need for that. You can download the source code 
(VC++/DevCpp projects only for now, sorry) from our site here: 
http://hovedprosjekter.hig.no/v2005/data/Gr5_unicode (the files 
section). On the same site we have also set up a forum where you can 
discuss different aspects of the project if you want to keep it off 
list. No registration is required, so it should be hassle (password) free.

Current design:
The current design is based around the concept of «encoding_traits». 
These are templated on the different encodings used in Unicode (UTF-8, 
16 and 32, both endians), and provide functions and typedefs for working 
on code units (8, 16 and 32 bit integers respectively) in any encoding. 
These traits  are then used for implementing different interfaces that 
externally use 32bit code points,  thereby abstracting away the 
underlying encoding.

The string class itself is created with encoding transparency in mind. 
Also at the class level. This means that the encoding used in the string 
is not a template parameter of the string class itself (making each 
instantiation of the string it's own type), but rather a parameter of an 
implementation class that is used internally to hold the string. 
Something like this (highly simplified):

class impl_base
     {
     // A lot of pure virtual functions for manipulating a string.
     };

template<typename encoding>
class impl
     {
     // Implement the functions...obviously.
     };

class encoded_string
     {
     impl_base* m_impl;

     template<typename encoding>
     void set_encoding(encoding enc_tag)
         {
         m_impl = new impl<encoding>();
         }
     };

The reason for doing this is that it allows functions that take 
encoded_string parameters to be blissfully unaware of what encoding they 
are working on, without having to templatize (it that a word?) the 
function itself. (Something I understood was a bit of a worry for some 
in the last discussion.) An alternate way of doing this (something we 
also tested when developing the current version), is to simply template 
the string class itself on encoding, but then you loose the above 
advantage of being able to have non-template functions working on 
multipe encodings. You do however gain speed (I would assume), since you 
wouldn't have the overhead of virtual function-calls, as well as a less 
complex implementation.

There's also an implementation of the Unicode Character Database in the 
prototype, along with an implementation of the normalization algorithms, 
but I won't go into the details of them here (to keep this from becoming 
a novel). Should be easy enough to understand if you want to.

Anyway.. Comments are as always welcome. Either here, or in the forum at 
the site.

Regards
- Erik

To Eric Niebler: Did you recieve the mail I sendt you a while back about 
the whole contact-person debackle? (on the Boost Consulting address) 
Never got a reply, so I'm not sure if it went through.

[boost] [Unicode strings] We're off

Erik Wien