Re: [boost] Thoughts on Unicode and Strings

Marshall Clow wrote:
"Reece Dunn" writes:
RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are multi-character encodings of UTF-32 (not considering combining marks at this stage), whereas UTF-32 is a single character encoding.
I'm pretty sure that this is a bad assumption.
Why is this a bad assumption? At the unicode_string level, we are talking about individual Unicode characters as specified by unicode.org. As an example, U+0x20 (space) can be represented simply on all encodings; U+0x2192 (left arrow) requires 2 bytes for UTF-8 encoding; U+0x1Dxxxx (I think these are the Fractur characters) require 3 UTF-8, 2 UTF-16 and 1 UTF-32. By treating a Unicode string as a virtual UTF-32 string (no matter what the underlying encoding is) makes it easier to use on a higher level, because you are dealing with the characters as they are represented on the Unicode tables. This makes it easier if there are mixed-width characters in the string: U+0x300A hello U+0x300B ==> [<<] hello [>>]
You can't just ignore combining characters.
I am not ignoring combining characters. All I'm saying is that dealing with grapheme clusters at this stage makes processing Unicode strings too complex. They should be treated as a view *on top of the underlying unicode_string represtentation*.
I believe that Miro posted an example of how (even using UTF-32), you may not have a single character <<-->> single "entry" mapping.
I understand that now (see my other post), but dealing with it all at one level would make the interface too complex and would become too difficult to manage. You could have something like: struct grapheme_cluster: public std::pair< unicode_string::utf32_iterator, unicode_string::utf32_iterator > { inline grapheme_cluster( unicode_string & us ): std::pair< unicode_string::utf32_iterator, unicode_string::utf32_iterator > ( us.utf32_begin(), us.utf32_end()) { } ... inline bool is_single() const { return( first == second ); } inline unicode_string::utf32_t get_base() const { return( *first ); } bool advance(); // implementation defined; false iff end of string ... }; NOTE: if is_single() is true, then is_base() will be the value of the unicode character, otherwise it is the primary character with the combining characters removed. Regards, Reece _________________________________________________________________ Express yourself with cool emoticons - download MSN Messenger today! http://www.msn.co.uk/messenger
participants (1)
-
Reece Dunn