[boost] Re: Thoughts on Unicode and Strings

19 Apr 2004

      Anthony Williams wrote:
[..]
...
Yes, but I was referring to more than just font differences. IIRC, there are
examples in Arabic, where there are alternative representations of whole
words, so the rendering engine does more than just translate characters to
images, it may rearrange the characters, or treat groups of characters as a
single item.
But yes, in general it is beyond simple text handling, which is partly what I
meant by "At another level".
Agreed. AFAIU, rearranging is done for text rendering only, which means 
it would not at all be relevant for text handling, just like 
right-to-left issues in mixed Latin/Arabic text, even for complex text 
handling. Please correct me if I'm wrong.

[...]
...
...
I would propose a class unicode_char, containing one or more codepoints
(e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char)
should return true for equivalent sequences. A Unicode string would be a
basic_string-like container of unicode_char's. The find_first_of and such
functions would then have the expected behaviour.
I think that actually storing character strings like that would be too slow,
but you would certainly want an interface that dealt with such constructs,
which is what I meant by '"character" chunks'.
...
The implementation should probably be more optimised than requiring an
allocation for every character, but IMO a good Unicode library should
*transparently* deal with such things as canonical equivalence for all
operations, like searching, deleting characters, etcetera. unicode_string
should be as easy to use as basic_string.
Yes, ideally.
I would like to make my point slightly clearer than I did before. I 
don't think it would do for a Unicode string library to concentrate on 
code points. Yes, the raw Unicode data should be available somewhere, so 
it can be written to file or sent to the OS's display routines. However, 
IMO it should use characters as its *only* interface for manipulation. 
The library should discourage using codepoints directly, because it will 
lead to all kinds of errors that do not often appear in English text 
manipulation but will for other languages. Think of such simple examples 
as the equivalence of rôle and rôle in different normalisations.

Regards,
Rogier

[boost] Re: Thoughts on Unicode and Strings

Rogier van Dalen