
On 17/09/06, loufoque <mathias.gaunard@etu.u-bordeaux1.fr> wrote:
Peter Bindels wrote :
That's not entirely accurate. UTF-8 is Latin-centric, so that all latin texts can be processed in linear time, taking longer for the rest.
Huh? Not really. All non ASCII characters, including latin ones, require more than one byte per character.
Ok, I'll come back on Latin, I intended to say, the Latin-section represented in ASCII-7.
UTF-16 is common-centric, in that it works efficiently for all common texts in all common scriptures, except for a few. Choosing UTF-8 over UTF-16 would make the implementation (and accompanying software) slow in all parts of the world that aren't solely using Latin characters.
I doubt the overhead is really noticeable. UTF-16 just makes validation and iteration a little simpler.
Indexing in UTF32 is trivial. Indexing in UTF16 is fairly trivial, and by the definition of the boundary between the base UTF-16 plane and the higher plane you should treat all characters >0xFFFF (encoded with two entries) as very irregular. You could then keep an array of indexes where these characters appear in your string (adding a slight bit to the overhead) making overhead constant-time except for the occurrences of those characters. You cannot add this technique to UTF-8 texts because non-7-bit characters are a lot more common. Add to that that UTF-8 2-byte encoding only supports 13-bit entries. That means that all characters from 0x2000...0xD7FF and 0xE000...0xFFFC use a byte more than they would in UTF-16. I checked this, this includes about all of Asia, in particular including all common Japanese and Chinese characters, as well as a number of Latin extended characters. You can see the ranges of unicode characters in the filenames of the links at: http://www.unicode.org/charts/
UTF-32 allows random access but that's rather useless since you need to iterate over the string anyway to handle combining characters.
That's a point I hadn't thought of. In that case, what advantages does UTF-32 hold over any of the other two?