Re: [boost] Re: Boost Unicode support ideas

14 Apr 2004

      ...
...
...
- The standard facets (and the locale class itself, in that it is a
   functor for comparing basic_strings) are tied to facilities such as
   std::basic_string and std::ios_base which are not suitable for
   Unicode support.
Why not?  Once the locale facets are provided, the std iostreams will
"just
work", that was the whole point of templating them in the first place.
I have already gone over this in other posts, but, in short,
std::basic_string
makes performance guarantees that are at odds with Unicode strings.
Basic_string is a sequence of code points, no more no less, all performance
guarentees for basic_string can be met as such.
...
...
However I think we're getting ahead of ourselves here:  I think a
Unicode
library should be handled in stages:
1) define the data types for 8/16/32 bit Unicode characters.
The fact that you believe this is a reasonable first step leads me to
believe
that you have not given much thought to the fact that even if you use a
32-bit
Unicode encoding, a character can take up more than 32 bits (and likewise
for
16-bit and 8-bit encodings. Unicode characters are not fixed-width data in
any
encoding.
Well it is the same first step that ICU takes: there is also a proposal
before the C language committee to introduce such data types (they're called
char16_t and char32_t), C++ is likely to follow suite (see
http://std.dkuug.dk/jtc1/sc22/wg14/www/docs/n1040.pdf).

I'm talking about code-points (and sequences thereof), not characters or
glyphs which as you say consist of multiple code points.

I would handle "characters" and "glyphs" as iterator adapters sitting on top
of sequences of code points.  For code points, basic_string is as good a
container as any (as are vector and deque and anything else you care to
define).
...
...
2) define iterator adapters to convert a sequence of one Unicode
character
type to another.
This is also not as easy as you seem to believe that it is, because even
within
one encoding many strings can have multiple representations.
I'm not talking about normalision / composition here: just conversion
between encodings, ICU does this already, as do many other libraries.

Iterator adapters for normalisation / composition / compression would also
be useful additions.

Likewise adapters for iterating "characters" and "glyphs".
...
...
3) define char_traits specialisations (as necessary) in order to get
basic_string working with Unicode character sequences, typedef the
appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
This is not a good idea. If you do this, you will produce a basic_string
which
can violate well-formedness of Unicode strings when you use any mutation
algorithm other than concatenation, or you will violate performance
guarantees
of basic_string.
Working on sequences of code points always requires care: clearly one could
erase a low surrogate and leave a high surrogate "orphanned" behind for
example.  One would need to make it clear in the documention that potential
problems like this can occur.
...
...
7) Anything I've forgotten :-)
I think you have forgotten to read and understand the complexity of
Unicode (or
any of the books that discuss the spec less tersely, such as Unicode
Demystified), because I think that some of the suggestions you made here
are
incompatible with how Unicode actually works. Please correct me if I am
wrong -- 
I would love to be wrong :-)
Well sometimes I'm wrong, and sometimes I'm right ;-)

Unicode is such a large and complex issue, that it's actually pretty hard to
keep even a small fraction of the issues in ones mind at a time, hence my
suggestion to split the issue up into a series of steps.
...
...
The main goal would be to define a good clean interface, the
implementation
could be:
We can't define a good clean interface until we understand the problems.
Obviously.

John.

Re: [boost] Re: Boost Unicode support ideas

John Maddock