
Eric Niebler wrote:
Zach Laine wrote:
Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have.
I would love to see a Unicode support library added to Boost. However, I question the usefulness of another string class, or in this case another hierarchy of string classes. Interoperability with std::string (and QString, and CString, and a thousand other API-specific string classes) is always thorny. I'd much rather see an iterators- and algorithms-based approach <snip>
Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode library comes up, the discussion immediately descends into a debate about how to design yet another string class. Such a high level wrapper *might* be useful (strong emphasis on "might"), but the core must be the Unicode algorithms, and the design for a Unicode library must start there.
I mostly agree. If people want UTF-8 and UTF-16 iterator-adaptors that will efficiently convert byte-sequence iterators into unicode character iterators, then I probably already have exactly that. Should I package it up for review? There are, however, a few points to consider. Most importantly, if you operate on a UTF-8 string only using an iterator-adaptor then you'll miss out on most of the clever features of the encoding. Specifically: - If you need to search for an ASCII character in a UTF-8 string then you can do so just by scanning the bytes. - Similarly, searching for substrings (including substrings with non-ASCII characters) can be done just by scanning for a bytewise match. - Sorting can be done using strcmp()-like comparisons on the byte sequences. An implementation that doesn't somehow exploit these optimisations will perform sub-optimally, and I don't think that would be acceptable. I don't really have a complete solution to offer. What I do have is the beginnings of a character-set traits class with booleans indicating things like "is an ASCII superset", "is variable-length" etc. The idea is that algorithms could be specialised based on these traits. I'm not sure how it all joins together yet though. Cheers, Phil.