
Anthony Williams wrote: [..]
Yes, but I was referring to more than just font differences. IIRC, there are examples in Arabic, where there are alternative representations of whole words, so the rendering engine does more than just translate characters to images, it may rearrange the characters, or treat groups of characters as a single item.
But yes, in general it is beyond simple text handling, which is partly what I meant by "At another level".
Agreed. AFAIU, rearranging is done for text rendering only, which means it would not at all be relevant for text handling, just like right-to-left issues in mixed Latin/Arabic text, even for complex text handling. Please correct me if I'm wrong. [...]
I would propose a class unicode_char, containing one or more codepoints (e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char) should return true for equivalent sequences. A Unicode string would be a basic_string-like container of unicode_char's. The find_first_of and such functions would then have the expected behaviour.
I think that actually storing character strings like that would be too slow, but you would certainly want an interface that dealt with such constructs, which is what I meant by '"character" chunks'.
The implementation should probably be more optimised than requiring an allocation for every character, but IMO a good Unicode library should *transparently* deal with such things as canonical equivalence for all operations, like searching, deleting characters, etcetera. unicode_string should be as easy to use as basic_string.
Yes, ideally.
I would like to make my point slightly clearer than I did before. I don't think it would do for a Unicode string library to concentrate on code points. Yes, the raw Unicode data should be available somewhere, so it can be written to file or sent to the OS's display routines. However, IMO it should use characters as its *only* interface for manipulation. The library should discourage using codepoints directly, because it will lead to all kinds of errors that do not often appear in English text manipulation but will for other languages. Think of such simple examples as the equivalence of rôle and rôle in different normalisations. Regards, Rogier