[boost] status of Boost Unicode library/enhancements ?

28 Mar 2006

      Hello,

I just scanned about 300 boost-devel messages with the word "Unicode"
and am very excited about the occasional mentions I see of a Boost
Unicode library.

Is that project still alive?  Is there a prototype or beta of any
sort, or even a simple statement of goals I can look at for the
proposed boost project?

I am about to embark on a large text processing (but _not_ display)
project and could make use of such a library.  (digression: part of it
will even involve the processing of Thai text, which seems to be the
#1 cited example of a weird language as far as i18n is concerned.
Having myself typeset a 283-page bilingual Thai-English book, I have
to agree :)

The last mentions I found were from late 2005, where Graham Barnett
mentioned a Unicode library was under development:

  http://thread.gmane.org/gmane.comp.lib.boost.devel/128403
  http://thread.gmane.org/gmane.comp.lib.boost.devel/129807

I tried searching the vault for 'unicode' but no dice.

I have examined (and would use by default) ICU from IBM:

  http://icu.sourceforge.net/userguide/intro.html

I would use its C++ UnicodeString, CharacterIterator, Locale-based
codepage converters, Normalization support, Collation support, and
regex matching (in particular with regex's that match character
classes like "nonspacing mark").

How do the proposed Boost library's capabilities differ from those
offered by ICU?

I've seen that there is ICU integration in Boost.Regex

  http://www.boost.org/libs/regex/doc/unicode.html

And of course it is possible today to store UTF-16 data in a
std::wstring and convert between UTF-8, UTF-16, and UTF-32 using
various easily available routines.  But as you can see above
I need more capability than just that.

ICU is probably sufficient, but I thought it might be nice to use
something that fits in with the rest of boost and STL more nicely.
Something that used/extended existing string mechanisms, iteration
mechanisms, and conversion mechanisms (e.g. those "code conversion
facets" which I do not yet understand :).  Consistent naming, error
reporting, and coding conventions would be a superficial but nice
added bonus.

I would hope that any such library would make some stabs at
performance enhancements such as ICU's UnicodeString's ability to
alias other strings to avoid copies, or store very small strings
inline.  Since ICU has since disabled some of those enhancements:

  http://icu.sourceforge.net/userguide/strings.html#unistr_performance

perhaps that would provide the Boost library an opportunity
to beat ICU's performance!

Thanks for all updates,

     - Chris Pirazzi