
On 4/13/04 3:27 PM, "Miro Jurisic" <macdev@meeroh.org> wrote:
In article <00c101c42167$8d0a7f40$1b440352@fuji>, "John Maddock" <john@johnmaddock.co.uk> wrote: [SNIP]
However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:
1) define the data types for 8/16/32 bit Unicode characters.
The fact that you believe this is a reasonable first step leads me to believe that you have not given much thought to the fact that even if you use a 32-bit Unicode encoding, a character can take up more than 32 bits (and likewise for 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any encoding. [TRUNCATE]
Unicode code-points fit in 31-bit values. The 8- and 16-bit standards just encode the 32-bit standard. We could base Unicode string only around the code-points. It may be better to use abstract Unicode characters instead. However, each abstract character can be made up of a variable number code-points. Worse, there can be several ways of expressing the same abstract character (that's why there are normalization standards). Maybe we can have: struct unicode_code_point { int_least_32_t c; }; struct unicode_code_point_traits { /* like char_traits */ }; struct unicode_abstract_character { int_least_32_t main_char; // can there be co-main characters? std::size_t helper_count; // length of following array int_least_32_t *helper_chars; // dynamic array of combiners }; struct unicode_abstract_character_traits { /* like char_traits, but much more complicated */ }; Recall that character types must be POD, so all the smarts have to go into the traits class. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com