Re: [boost] Thoughts on Unicode and Strings

16 Apr 2004

      At 9:48 PM +0100 4/16/04, Reece Dunn wrote:
...
Jeremy Maitin-Shepard wrote:
...
"Reece Dunn" writes:
[ big snip ]
...
...
...
[2] Basic Type And Iteration
...
...
The basic representation is more complex, because now we are dealing with
character boundaries (when dealing with UTF-8 and UTF-16 views 
of a Unicode
string). At this stage, combining characters and marks should 
not be concerned
with, only complete characters.
...
...
The Unicode string should provide at least 3 types of iterator, 
regardless of
the internal representation (NOTE: as such, they will be implementation
dependant on how the string is represented):
*  UTF-8  -- provides access to the UTF-8 representation of the string;
*  UTF-16 -- provides access to the UTF-16 representation of the string;
*  UTF-32 -- provides access to the Unicode character type.
...
This seems reasonable, although I practice the UTF-32/code-point
iterator would be the most likely to be used.
Agreed, but the others would be useful: writing the string to a file 
as an example. This is why I suggest that the UTF-32 iterator is the 
default iterator (i.e. unicode_string::iterator is a UTF-32 
iterator).
[ more snipped ]
...
...
...
RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and 
UTF-16 are
 multi-character encodings of UTF-32 (not considering combining 
marks at this
 stage), whereas UTF-32 is a single character encoding.
I'm pretty sure that this is a bad assumption.
You can't just ignore combining characters.

I believe that Miro posted an example of how (even using UTF-32), you
may not have a single character <<-->> single "entry" mapping.

-- 
-- Marshall

Marshall Clow     Idio Software   <mailto:marshall@idio.com>

I want a machine that thinks I'm more important than it is, and acts like it.
-- Eric Herrmann