Re: [boost] Re: Boost Unicode support ideas

14 Apr 2004


      Anthony Williams <anthony_w.geo@yahoo.com> writes:
...
Miro Jurisic <macdev@meeroh.org> writes:
...
...
[snip]
...
...
I have already gone over this in other posts, but, in short,
std::basic_string makes performance guarantees that are at odds with Unicode
strings.
...
Only if you use an encoding other than UTF-32/UCS-4. This has to be a (POD) UDT
rather than a typedef, so that one may specialize std::char_traits. Of course,
if this gets standardized, then it can be a built-in, since the standard can
specialize its own templates.
Performance guarantees aside, standardizing around UTF-32 is not, IMO,
practical.
...
[snip]
...
...
...
3) define char_traits specialisations (as necessary) in order to get
basic_string working with Unicode character sequences, typedef the
appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
This is not a good idea. If you do this, you will produce a basic_string
which can violate well-formedness of Unicode strings when you use any
mutation algorithm other than concatenation, or you will violate performance
guarantees of basic_string.
...
Yes. basic_string<CharType> relies on each CharType being a valid entity in
its own right --- for Unicode this means it must be a single Unicode code
point, so using basic_string for UTF-8 is out.
basic_string can still be used as a low-level storage facility, although
it was certainly not designed to be used as such, and in treating it as
such, many of the ``compatibility'' advantages are lost anyway.  If you
are advocating internal representation in UTF-32, however, I would say
that performance measurements generally show that UTF-16 is
significantly faster for processing, such that the advantage of being
able to nicely fit it into the existing interfaces is not justified.
...
You are right that Unicode does not play fair with most standard locale
facilities, especially case conversions (1-1, 1-many, 1-0, context sensitivity
(which could be seen as many-many), locale specifics).
...
Collation is one area where the standard library facilities should be OK,
since the standard library collation support deals with whole strings. When
you install the collation facet in your locale, you choose the Unicode
collation options that are relevant to you.
Perhaps, except for the other issues which I have described in other
messages.

-- 
Jeremy Maitin-Shepard