[boost] Re: Boost Unicode support ideas

13 Apr 2004


      In article <00c101c42167$8d0a7f40$1b440352@fuji>,
 "John Maddock" <john@johnmaddock.co.uk> wrote:
...
...
- The standard facets (and the locale class itself, in that it is a
   functor for comparing basic_strings) are tied to facilities such as
   std::basic_string and std::ios_base which are not suitable for
   Unicode support.
Why not?  Once the locale facets are provided, the std iostreams will "just
work", that was the whole point of templating them in the first place.
I have already gone over this in other posts, but, in short, std::basic_string 
makes performance guarantees that are at odds with Unicode strings.
...
However I think we're getting ahead of ourselves here:  I think a Unicode
library should be handled in stages:
1) define the data types for 8/16/32 bit Unicode characters.
The fact that you believe this is a reasonable first step leads me to believe 
that you have not given much thought to the fact that even if you use a 32-bit 
Unicode encoding, a character can take up more than 32 bits (and likewise for 
16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any 
encoding.
...
2) define iterator adapters to convert a sequence of one Unicode character
type to another.
This is also not as easy as you seem to believe that it is, because even within 
one encoding many strings can have multiple representations.
...
3) define char_traits specialisations (as necessary) in order to get
basic_string working with Unicode character sequences, typedef the
appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
This is not a good idea. If you do this, you will produce a basic_string which 
can violate well-formedness of Unicode strings when you use any mutation 
algorithm other than concatenation, or you will violate performance guarantees 
of basic_string.
...
7) Anything I've forgotten :-)
I think you have forgotten to read and understand the complexity of Unicode (or 
any of the books that discuss the spec less tersely, such as Unicode 
Demystified), because I think that some of the suggestions you made here are 
incompatible with how Unicode actually works. Please correct me if I am wrong -- 
I would love to be wrong :-)
...
The main goal would be to define a good clean interface, the implementation
could be:
We can't define a good clean interface until we understand the problems.

meeroh

-- 
If this message helped you, consider buying an item
from my wish list: <http://web.meeroh.org/wishlist>

[boost] Re: Boost Unicode support ideas

Miro Jurisic