
At 9:48 PM +0100 4/16/04, Reece Dunn wrote:
Jeremy Maitin-Shepard wrote:
"Reece Dunn" writes:
[ big snip ]
[2] Basic Type And Iteration
The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters.
The Unicode string should provide at least 3 types of iterator, regardless of the internal representation (NOTE: as such, they will be implementation dependant on how the string is represented): * UTF-8 -- provides access to the UTF-8 representation of the string; * UTF-16 -- provides access to the UTF-16 representation of the string; * UTF-32 -- provides access to the Unicode character type.
This seems reasonable, although I practice the UTF-32/code-point iterator would be the most likely to be used.
Agreed, but the others would be useful: writing the string to a file as an example. This is why I suggest that the UTF-32 iterator is the default iterator (i.e. unicode_string::iterator is a UTF-32 iterator).
[ more snipped ]
RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are multi-character encodings of UTF-32 (not considering combining marks at this stage), whereas UTF-32 is a single character encoding.
I'm pretty sure that this is a bad assumption. You can't just ignore combining characters. I believe that Miro posted an example of how (even using UTF-32), you may not have a single character <<-->> single "entry" mapping. -- -- Marshall Marshall Clow Idio Software <mailto:marshall@idio.com> I want a machine that thinks I'm more important than it is, and acts like it. -- Eric Herrmann