Re: [boost] Thoughts on Unicode and Strings

17 Apr 2004

      Marshall Clow wrote:
...
...
...
"Reece Dunn" writes:
...
RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and 
UTF-16 are
 multi-character encodings of UTF-32 (not considering combining marks 
at this
 stage), whereas UTF-32 is a single character encoding.
...
I'm pretty sure that this is a bad assumption.
Why is this a bad assumption?

At the unicode_string level, we are talking about individual Unicode 
characters as specified by unicode.org. As an example, U+0x20 (space) can be 
represented simply on all encodings; U+0x2192 (left arrow) requires 2 bytes 
for UTF-8 encoding; U+0x1Dxxxx (I think these are the Fractur characters) 
require 3 UTF-8, 2 UTF-16 and 1 UTF-32.

By treating a Unicode string as a virtual UTF-32 string (no matter what the 
underlying encoding is) makes it easier to use on a higher level, because 
you are dealing with the characters as they are represented on the Unicode 
tables. This makes it easier if there are mixed-width characters in the 
string:
   U+0x300A hello U+0x300B ==> [<<] hello [>>]
...
You can't just ignore combining characters.
I am not ignoring combining characters. All I'm saying is that dealing with 
grapheme clusters at this stage makes processing Unicode strings too 
complex. They should be treated as a view *on top of the underlying 
unicode_string represtentation*.
...
I believe that Miro posted an example of how (even using UTF-32), you
may not have a single character <<-->> single "entry" mapping.
I understand that now (see my other post), but dealing with it all at one 
level would make the interface too complex and would become too difficult to 
manage. You could have something like:

struct grapheme_cluster: public std::pair< unicode_string::utf32_iterator, 
unicode_string::utf32_iterator >
{
   inline grapheme_cluster( unicode_string & us ):
      std::pair< unicode_string::utf32_iterator, 
unicode_string::utf32_iterator >
      ( us.utf32_begin(), us.utf32_end())
   {
   }

   ...

   inline bool is_single() const
   {
      return( first == second );
   }

   inline unicode_string::utf32_t get_base() const
   {
      return( *first );
   }

   bool advance(); // implementation defined; false iff end of string
   ...
};

NOTE: if is_single() is true, then is_base() will be the value of the 
unicode character, otherwise it is the primary character with the combining 
characters removed.

Regards,
Reece

_________________________________________________________________
Express yourself with cool emoticons - download MSN Messenger today! 
http://www.msn.co.uk/messenger

Reece Dunn

tags

participants (1)