Re: [boost] Work that has been done on Unicode

17 Sep 2006

      On 17/09/06, loufoque <mathias.gaunard@etu.u-bordeaux1.fr> wrote:
...
Peter Bindels wrote :
...
That's not entirely accurate. UTF-8 is Latin-centric, so that all
latin texts can be processed in linear time, taking longer for the
rest.
Huh?
Not really.
All non ASCII characters, including latin ones, require more than one
byte per character.
Ok, I'll come back on Latin, I intended to say, the Latin-section
represented in ASCII-7.
...
...
UTF-16 is common-centric, in that it works efficiently for all
common texts in all common scriptures, except for a few. Choosing
UTF-8 over UTF-16 would make the implementation (and accompanying
software) slow in all parts of the world that aren't solely using
Latin characters.
I doubt the overhead is really noticeable.
UTF-16 just makes validation and iteration a little simpler.
Indexing in UTF32 is trivial. Indexing in UTF16 is fairly trivial, and
by the definition of the boundary between the base UTF-16 plane and
the higher plane you should treat all characters >0xFFFF (encoded with
two entries) as very irregular. You could then keep an array of
indexes where these characters appear in your string (adding a slight
bit to the overhead) making overhead constant-time except for the
occurrences of those characters. You cannot add this technique to
UTF-8 texts because non-7-bit characters are a lot more common.

Add to that that UTF-8 2-byte encoding only supports 13-bit entries.
That means that all characters from 0x2000...0xD7FF and
0xE000...0xFFFC use a byte more than they would in UTF-16. I checked
this, this includes about all of Asia, in particular including all
common Japanese and Chinese characters, as well as a number of Latin
extended characters. You can see the ranges of unicode characters in
the filenames of the links at:
http://www.unicode.org/charts/
...
UTF-32 allows random access but that's rather useless since you need to
iterate over the string anyway to handle combining characters.
That's a point I hadn't thought of. In that case, what advantages does
UTF-32 hold over any of the other two?