
I think that there should be at most two _external_ Unicode string types: 1. vector of Unicode code-points externally each element is like a int32_t 2. vector of abstract Unicode characters externally each element is a group of Unicode code-points (a primary or starting code point, followed by combiner codes) Option [2] is what we ultimately want, so [1] should be included only if really needed. Internally, such strings could use UTF-8, UTF-16, etc., but most users don't care about that from an outside perspective. Some users do care about the outside appearance. (I think some guy here wanted UTF-8 XML.) In those cases we have a specific input or output routine that uses an appropriate encoding object, to hide whether or not the Unicode string internally uses the same encoding as the final source/sink. The internals of the Unicode string could use the cluster concept, like in Mac OS X's Cocoa. Here we would make concrete classes for UTF-8, UTF-16, UTF-32, etc. strings. (We could include normalization and other factors in combinations too.) The external class would keep a union (or something) that uses one of the concrete classes. Iterators probably should be made for code points and/or abstract characters. Bidirectional travel would be easiest. Such iterators should be configured (at compile- and/or run-time) for various normalization schemes. Input needs special handling, since we shouldn't allow ultimately invalid byte/code-point combinations into Unicode strings. We need something that can enumerate over a byte stream for a particular encoding and spit out whole code-points (or queue the code-points and spit out abstract characters). -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Daryle Walker <darylew@hotmail.com> writes:
I think that there should be at most two _external_ Unicode string types:
1. vector of Unicode code-points externally each element is like a int32_t 2. vector of abstract Unicode characters externally each element is a group of Unicode code-points (a primary or starting code point, followed by combiner codes)
Some users do care about the outside appearance. (I think some guy here wanted UTF-8 XML.) In those cases we have a specific input or output routine that uses an appropriate encoding object, to hide whether or not the Unicode string internally uses the same encoding as the final source/sink.
Iterators probably should be made for code points and/or abstract characters. Bidirectional travel would be easiest. Such iterators should be configured (at compile- and/or run-time) for various normalization schemes.
Input needs special handling, since we shouldn't allow ultimately invalid byte/code-point combinations into Unicode strings. We need something that can enumerate over a byte stream for a particular encoding and spit out whole code-points (or queue the code-points and spit out abstract characters).
The XML parser I have under development on Sourceforge (http://www.sf.net/projects/axemill) includes string handling facilities that support the above. I haven't yet found a need for dealing with "abstract characters", since the closest thing in XML (name matching) requires that the names use the same sequence of code points (including combining characters) in all places. The "vector of Unicode code-points" I use is a std::basic_string<UnicodeCharacter>, where UnicodeCharacter is a POD struct with a 32-bit int member to represent the unicode code point. I have to do it that way to allow customization of std::char_traits, since you cannot specialize std::char_traits for built-in types. Anthony -- Anthony Williams Software Developer
participants (2)
-
Anthony Williams
-
Daryle Walker