Re: [boost] [unicode] Interest Check / Proof of Concept

20 Nov 2008

      Eric Niebler wrote:
...
Zach Laine wrote:
...
...
Over the past few months, I've been tinkering with a Unicode string library.
It's still *far* from finished, but it's far enough along that the overall
structure is visible. I've seen a bunch of Unicode proposals for Boost come
and go, so hopefully this one will address the most common needs people
have.
I would love to see a Unicode support library added to Boost.
However, I question the usefulness of another string class, or in this
case another hierarchy of string classes.  Interoperability with
std::string (and QString, and CString, and a thousand other
API-specific string classes) is always thorny.  I'd much rather see an
iterators- and algorithms-based approach
<snip>
Agree. Thanks Zach. I'm discouraged that every time the issue of a 
Unicode library comes up, the discussion immediately descends into a 
debate about how to design yet another string class. Such a high level 
wrapper *might* be useful (strong emphasis on "might"), but the core 
must be the Unicode algorithms, and the design for a Unicode library 
must start there.
I mostly agree.  If people want UTF-8 and UTF-16 iterator-adaptors that 
will efficiently convert byte-sequence iterators into unicode character 
iterators, then I probably already have exactly that.  Should I package 
it up for review?

There are, however, a few points to consider.  Most importantly, if you 
operate on a UTF-8 string only using an iterator-adaptor then you'll 
miss out on most of the clever features of the encoding.  Specifically:

- If you need to search for an ASCII character in a UTF-8 string then 
you can do so just by scanning the bytes.
- Similarly, searching for substrings (including substrings with 
non-ASCII characters) can be done just by scanning for a bytewise match.
- Sorting can be done using strcmp()-like comparisons on the byte sequences.

An implementation that doesn't somehow exploit these optimisations will 
perform sub-optimally, and I don't think that would be acceptable.

I don't really have a complete solution to offer.  What I do have is 
the beginnings of a character-set traits class with booleans indicating 
things like "is an ASCII superset", "is variable-length" etc.  The idea 
is that algorithms could be specialised based on these traits.  I'm not 
sure how it all joins together yet though.

Cheers,  Phil.

Re: [boost] [unicode] Interest Check / Proof of Concept

Phil Endecott