Re: [boost] Re: [Unicode strings] We're off

18 Mar 2005

      On Wed, 16 Mar 2005 18:13:36 +0100, Erik Wien <wien@start.no> wrote:
...
Not entirely, but certainly less that optimal. basic_string (and the
iostreams) make assuptions that don't neccesarily apply to Unicode text.
One of them is that strings can be represented as a sequence of equally
sized characters. Unicode can be represented that way, but that would
mean you'd have to use 32 bits pr. character to be able to represent all
the code point assigned in the Unicode standard. In most cases, that is
way too much overhead for a string, and usually also a waste, since
unicode code points rarely require more that 16 bits to be encoded. You
could of course implement unicode for 16 bit characters in basic_string,
but that would require that the user know about things like surrogate
pairs, and also know how to correctly handle them. An unlikely scenario.
Looking at the code, it seems to duplicate alot of what basic_string
does. AFAIK, though i haven't looked that close at unicode, you have
two ways of viewing the string. As a string of UTF-* elements(?) and
the other as a string of characters. The former has the same
properties as basic_string, the latter doesn't.

It seems to me then, that a possible design would be to make it a
basic_string and provide special iterators etc that views the string
as characters. This would require the iterator to have a reference to
the basic_string to be able to support assignment. Maybe it would
require whole wrapper class around basic_string to provide the
required functionality.

Rakshasa

Re: [boost] Re: [Unicode strings] We're off

Sundell Software