Re: [boost] Boost Unicode support ideas

25 Apr 2004

      On 4/13/04 3:27 PM, "Miro Jurisic" <macdev@meeroh.org> wrote:
...
In article <00c101c42167$8d0a7f40$1b440352@fuji>,
"John Maddock" <john@johnmaddock.co.uk> wrote:
[SNIP]
...
However I think we're getting ahead of ourselves here:  I think a Unicode
library should be handled in stages:
1) define the data types for 8/16/32 bit Unicode characters.
The fact that you believe this is a reasonable first step leads me to believe
that you have not given much thought to the fact that even if you use a 32-bit
Unicode encoding, a character can take up more than 32 bits (and likewise for
16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any
encoding. 
[TRUNCATE]
Unicode code-points fit in 31-bit values.  The 8- and 16-bit standards just
encode the 32-bit standard.  We could base Unicode string only around the
code-points.

It may be better to use abstract Unicode characters instead.  However, each
abstract character can be made up of a variable number code-points.  Worse,
there can be several ways of expressing the same abstract character (that's
why there are normalization standards).

Maybe we can have:

struct unicode_code_point { int_least_32_t c; };

struct unicode_code_point_traits { /* like char_traits */ };

struct unicode_abstract_character
{
    int_least_32_t   main_char;     // can there be co-main characters?
    std::size_t      helper_count;  // length of following array
    int_least_32_t  *helper_chars;  // dynamic array of combiners
};

struct unicode_abstract_character_traits { /* like char_traits, but much
more complicated */ };

Recall that character types must be POD, so all the smarts have to go into
the traits class.

-- 
Daryle Walker
Mac, Internet, and Video Game Junkie
darylew AT hotmail DOT com