Re: [boost] Work that has been done on Unicode

18 Sep 2006

      Peter Bindels wrote :
...
Indexing in UTF32 is trivial. Indexing in UTF16 is fairly trivial, and
by the definition of the boundary between the base UTF-16 plane and
the higher plane you should treat all characters >0xFFFF (encoded with
two entries) as very irregular. You could then keep an array of
indexes where these characters appear in your string (adding a slight
bit to the overhead) making overhead constant-time except for the
occurrences of those characters.
I had already thought of this.
This would allow random access for O(log(n)), n being the number of 
surrogate pairs in the string.
It also allows mutable iterators to not invalidate each other when 
modifying the string.

However, for working on the grapheme clusters level, I don't think 
random access of the underlying encoding brings anything useful.
Some searching algorithms need random access though, but these could 
work on the bytes with a few checks, since all UTF encodings (except 
UTF-7) guarantee that a sequence must not occur within a longer sequence 
or across the boundary of two other sequences.

While composing characters may be rare in some scripts, there are 
numerous in others so indexing them doesn't seem like such a good idea.

It can be interesting though to provide random access for people who 
want to work at a lower level though, but that's not a priority.
...
You cannot add this technique to
UTF-8 texts because non-7-bit characters are a lot more common.
Another problem with utf-8 is that reverse iterating is a bit more 
expensive than forward iterating, because you don't know how many bytes 
you have to go back until you encounter the first of the multibyte group.

UTF-8 isn't usually a good idea unless you need it to feed GTK+ or 
others libs.

Actually, size-wise, another interesting encoding is GB18030, especially 
if working with a mix of Chinese and English.
However there is no direct mapping with Unicode code points, so UTF-16 
stays the best choice by default.
...
That's a point I hadn't thought of.
That's what the whole "grapheme cluster" thing is about.

<e acute> and <e><acute> should be considered equal, each one being a 
single grapheme cluster, what an end-user thinks of as a single character.
Also searching foo in foo<acute> shouldn't give any result, since the 
last character is o<acute>, not o.

All algorithms must ensure that elements of a grapheme cluster don't get 
separated. This can be achieved simply by working with iterators over 
them or, in a more optimized way, by working carefully with code units 
or code points.
That is why, along with the default grapheme cluster interface, access 
will be given to lower-level layers, for the power users.
...
In that case, what advantages does
UTF-32 hold over any of the other two?
First some people might want to directly work with code points and not 
grapheme clusters, just like most (if not all) other unicode string 
classes work.
With UTF-32 the code units are code points, so working on them is very 
lightweight.

Also, it can be useful if you need to interface with something that uses 
UTF-32 since that way no conversion is needed for input and output. That 
is also why there will be an utf-8 backend too.

Re: [boost] Work that has been done on Unicode

loufoque