
Peter Bindels wrote :
That's not entirely accurate. UTF-8 is Latin-centric, so that all latin texts can be processed in linear time, taking longer for the rest.
Huh? Not really. All non ASCII characters, including latin ones, require more than one byte per character. It can still be processed in linear time though, it just means you can't have random access.
UTF-16 is common-centric, in that it works efficiently for all common texts in all common scriptures, except for a few. Choosing UTF-8 over UTF-16 would make the implementation (and accompanying software) slow in all parts of the world that aren't solely using Latin characters.
I doubt the overhead is really noticeable. UTF-16 just makes validation and iteration a little simpler.
That would be most of Europe, Asia, Africa, South-America and a number of people in North-America and Australia. Forcing them to UTF-32 makes for quite a lot worse memory use than could reasonably be expected.
UTF-32 allows random access but that's rather useless since you need to iterate over the string anyway to handle combining characters.