
Peter Bindels wrote :
Indexing in UTF32 is trivial. Indexing in UTF16 is fairly trivial, and by the definition of the boundary between the base UTF-16 plane and the higher plane you should treat all characters >0xFFFF (encoded with two entries) as very irregular. You could then keep an array of indexes where these characters appear in your string (adding a slight bit to the overhead) making overhead constant-time except for the occurrences of those characters.
I had already thought of this. This would allow random access for O(log(n)), n being the number of surrogate pairs in the string. It also allows mutable iterators to not invalidate each other when modifying the string. However, for working on the grapheme clusters level, I don't think random access of the underlying encoding brings anything useful. Some searching algorithms need random access though, but these could work on the bytes with a few checks, since all UTF encodings (except UTF-7) guarantee that a sequence must not occur within a longer sequence or across the boundary of two other sequences. While composing characters may be rare in some scripts, there are numerous in others so indexing them doesn't seem like such a good idea. It can be interesting though to provide random access for people who want to work at a lower level though, but that's not a priority.
You cannot add this technique to UTF-8 texts because non-7-bit characters are a lot more common.
Another problem with utf-8 is that reverse iterating is a bit more expensive than forward iterating, because you don't know how many bytes you have to go back until you encounter the first of the multibyte group. UTF-8 isn't usually a good idea unless you need it to feed GTK+ or others libs. Actually, size-wise, another interesting encoding is GB18030, especially if working with a mix of Chinese and English. However there is no direct mapping with Unicode code points, so UTF-16 stays the best choice by default.
That's a point I hadn't thought of.
That's what the whole "grapheme cluster" thing is about. <e acute> and <e><acute> should be considered equal, each one being a single grapheme cluster, what an end-user thinks of as a single character. Also searching foo in foo<acute> shouldn't give any result, since the last character is o<acute>, not o. All algorithms must ensure that elements of a grapheme cluster don't get separated. This can be achieved simply by working with iterators over them or, in a more optimized way, by working carefully with code units or code points. That is why, along with the default grapheme cluster interface, access will be given to lower-level layers, for the power users.
In that case, what advantages does UTF-32 hold over any of the other two?
First some people might want to directly work with code points and not grapheme clusters, just like most (if not all) other unicode string classes work. With UTF-32 the code units are code points, so working on them is very lightweight. Also, it can be useful if you need to interface with something that uses UTF-32 since that way no conversion is needed for input and output. That is also why there will be an utf-8 backend too.