[boost] Re: Boost Unicode support ideas

14 Apr 2004

      Miro Jurisic <macdev@meeroh.org> writes:
...
In article <00c101c42167$8d0a7f40$1b440352@fuji>,
 "John Maddock" <john@johnmaddock.co.uk> wrote:
...
...
- The standard facets (and the locale class itself, in that it is a
   functor for comparing basic_strings) are tied to facilities such as
   std::basic_string and std::ios_base which are not suitable for
   Unicode support.
Why not?  Once the locale facets are provided, the std iostreams will "just
work", that was the whole point of templating them in the first place.
I have already gone over this in other posts, but, in short,
std::basic_string makes performance guarantees that are at odds with Unicode
strings.
Only if you use an encoding other than UTF-32/UCS-4. This has to be a (POD) UDT
rather than a typedef, so that one may specialize std::char_traits. Of course,
if this gets standardized, then it can be a built-in, since the standard can
specialize its own templates.
...
...
However I think we're getting ahead of ourselves here:  I think a Unicode
library should be handled in stages:
1) define the data types for 8/16/32 bit Unicode characters.
The fact that you believe this is a reasonable first step leads me to
believe that you have not given much thought to the fact that even if you
use a 32-bit Unicode encoding, a character can take up more than 32 bits
(and likewise for 16-bit and 8-bit encodings. Unicode characters are not
fixed-width data in any encoding.
Yes, but a codepoint is 32-bits, and codepoints can come in any sequence. A
given sequence of codepoints may or may not have a valid semantic meaning as a
"character", but that is like debating whether or not "fjkp" is a valid word
--- beyond the scope of basic string handling facilities.
...
...
2) define iterator adapters to convert a sequence of one Unicode character
type to another.
This is also not as easy as you seem to believe that it is, because even
within one encoding many strings can have multiple representations.
That is why there are various canonical forms defined. We should provide a
means of converting to the canonical forms.

However, this is independent of Unicode encoding --- the same sequence of code
points can be represented in each Unicode encoding in precisely one way.
...
...
3) define char_traits specialisations (as necessary) in order to get
basic_string working with Unicode character sequences, typedef the
appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
This is not a good idea. If you do this, you will produce a basic_string
which can violate well-formedness of Unicode strings when you use any
mutation algorithm other than concatenation, or you will violate performance
guarantees of basic_string.
Yes. basic_string<CharType> relies on each CharType being a valid entity in
its own right --- for Unicode this means it must be a single Unicode code
point, so using basic_string for UTF-8 is out.

You are right that Unicode does not play fair with most standard locale
facilities, especially case conversions (1-1, 1-many, 1-0, context sensitivity
(which could be seen as many-many), locale specifics).

Collation is one area where the standard library facilities should be OK,
since the standard library collation support deals with whole strings. When
you install the collation facet in your locale, you choose the Unicode
collation options that are relevant to you.

Anthony
-- 
Anthony Williams
Senior Software Engineer, Beran Instruments Ltd.