[boost] Thoughts on Unicode strings

22 Oct 2004

      I think that there should be at most two _external_ Unicode string types:

1.  vector of Unicode code-points
    externally each element is like a int32_t
2.  vector of abstract Unicode characters
    externally each element is a group of Unicode code-points
    (a primary or starting code point, followed by combiner codes)

Option [2] is what we ultimately want, so [1] should be included only if
really needed.  Internally, such strings could use UTF-8, UTF-16, etc., but
most users don't care about that from an outside perspective.

Some users do care about the outside appearance.  (I think some guy here
wanted UTF-8 XML.)  In those cases we have a specific input or output
routine that uses an appropriate encoding object, to hide whether or not the
Unicode string internally uses the same encoding as the final source/sink.

The internals of the Unicode string could use the cluster concept, like in
Mac OS X's Cocoa.  Here we would make concrete classes for UTF-8, UTF-16,
UTF-32, etc. strings.  (We could include normalization and other factors in
combinations too.)  The external class would keep a union (or something)
that uses one of the concrete classes.

Iterators probably should be made for code points and/or abstract
characters.  Bidirectional travel would be easiest.  Such iterators should
be configured (at compile- and/or run-time) for various normalization
schemes.

Input needs special handling, since we shouldn't allow ultimately invalid
byte/code-point combinations into Unicode strings.  We need something that
can enumerate over a byte stream for a particular encoding and spit out
whole code-points (or queue the code-points and spit out abstract
characters).

-- 
Daryle Walker
Mac, Internet, and Video Game Junkie
darylew AT hotmail DOT com