Re: [boost] Work that has been done on Unicode

17 Sep 2006

      Peter Bindels wrote :
...
That's not entirely accurate. UTF-8 is Latin-centric, so that all
latin texts can be processed in linear time, taking longer for the
rest.
Huh?
Not really.
All non ASCII characters, including latin ones, require more than one 
byte per character.

It can still be processed in linear time though, it just means you can't 
have random access.
...
UTF-16 is common-centric, in that it works efficiently for all
common texts in all common scriptures, except for a few. Choosing
UTF-8 over UTF-16 would make the implementation (and accompanying
software) slow in all parts of the world that aren't solely using
Latin characters.
I doubt the overhead is really noticeable.
UTF-16 just makes validation and iteration a little simpler.
...
That would be most of Europe, Asia, Africa,
South-America and a number of people in North-America and Australia.
Forcing them to UTF-32 makes for quite a lot worse memory use than
could reasonably be expected.
UTF-32 allows random access but that's rather useless since you need to 
iterate over the string anyway to handle combining characters.

Re: [boost] Work that has been done on Unicode

loufoque