
I think that there should be at most two _external_ Unicode string types: 1. vector of Unicode code-points externally each element is like a int32_t 2. vector of abstract Unicode characters externally each element is a group of Unicode code-points (a primary or starting code point, followed by combiner codes) Option [2] is what we ultimately want, so [1] should be included only if really needed. Internally, such strings could use UTF-8, UTF-16, etc., but most users don't care about that from an outside perspective. Some users do care about the outside appearance. (I think some guy here wanted UTF-8 XML.) In those cases we have a specific input or output routine that uses an appropriate encoding object, to hide whether or not the Unicode string internally uses the same encoding as the final source/sink. The internals of the Unicode string could use the cluster concept, like in Mac OS X's Cocoa. Here we would make concrete classes for UTF-8, UTF-16, UTF-32, etc. strings. (We could include normalization and other factors in combinations too.) The external class would keep a union (or something) that uses one of the concrete classes. Iterators probably should be made for code points and/or abstract characters. Bidirectional travel would be easiest. Such iterators should be configured (at compile- and/or run-time) for various normalization schemes. Input needs special handling, since we shouldn't allow ultimately invalid byte/code-point combinations into Unicode strings. We need something that can enumerate over a byte stream for a particular encoding and spit out whole code-points (or queue the code-points and spit out abstract characters). -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com