[boost] Re: [Unicode strings] We're off

16 Mar 2005

      "Erik Wien" <wien@start.no> wrote in message 
news:d19pdf$jhu$1@sea.gmane.org...
Thorsten Ottosen wrote:
...
Hi Erik,
...
Is entirely improper to make unicode strings a typedef
for std::basic_string<...> ?
|Not entirely, but certainly less that optimal. basic_string (and the
|iostreams) make assuptions that don't neccesarily apply to Unicode text.
|One of them is that strings can be represented as a sequence of equally
|sized characters. Unicode can be represented that way, but that would
|mean you'd have to use 32 bits pr. character to be able to represent all
|the code point assigned in the Unicode standard. In most cases, that is
|way too much overhead for a string, and usually also a waste, since
|unicode code points rarely require more that 16 bits to be encoded. You
|could of course implement unicode for 16 bit characters in basic_string,
|but that would require that the user know about things like surrogate
|pairs, and also know how to correctly handle them. An unlikely scenario.

I'm sure I get this, probably because I'm just don't know enough about
this subject.

Ok, so basic_string< char, char_trait<char>, allocator<char> >
makes assumptions. So what, I was implying that you should
write a specialization

basic_string< char, utf_traits<char>, allocator<char> >:

    template< class T, class UTF >
    class basic_string<T,utf_traits<UTF>,std::allocator<T> >
    {
    public:
        basic_string()
        {
        }
 ...
 };

    typedef basic_string< char, utf_traits<utf8> > utf8_string;

What is it you wouldn't be able to do with this interface?

|Normally I would not think so, and my first implementation did not work
|this way. That one was implemented with the entire string class being
|templated on encoding, and thereby eliminating the whole implementation
|inheritance tree in this implementation.
|
|There was however (as far as I could tell at least) some concern about
|this approach in the other thread. (Mostly related to code size and

hm...the function is only going to be used by 3 different classes, right?
If so at most 3 times the size of a virtual function solution;
 v-tables fill up too; and virtual functions in a class template
can have *large* code size impact if not all virtual functions
are used. (So are they?)

|being locked into an encoding at compile time.)

sometimes strong typesafety is good; sometimes it's not

| Some thought that could
|be a problem for XML parsers and related technology that needs to
|establish encoding at run-time. (When reading files for example)

ok, that seems to motivate that some form of dynamic types should be there.

| This
|new implementation was simply a test to see if an alternate solution
|could be found, without those drawbacks. (It has a plenthora of new ones
|though.)
|I am more than willing to change this if the current design is no good.
|Starting a discussion on this is one of my main reasons for posting the
|code in the first place.

It seems to me that we then need four classes

utf8_string
utf16_string
utf32_string
utf_string  // the dynamic one