
On Wed, May 13, 2009 at 18:35, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
Phil Endecott wrote:
Some feedback based on that document:
UTF-16 .... This is the recommended encoding for dealing with Unicode.
Recommended by who? It's not the encoding that I would normally recommend.
The Unicode standard, in some technical notes: http://www.unicode.org/notes/tn12/ It recommends the use of UTF-16 for general purpose text processing.
It also states that UTF-8 is good for compatibility and data exchange, and UTF-32 uses just too much memory and is thus quite a waste.
I really think UTF-8 should be the recommended one, since it forces people to remember that it's no longer one unit, one "character". Even in Beman Dawes's talk (http://www.boostcon.com/site-media/var/sphene/sphwiki/attachment/2009/05/07/...) where slide 11 mentions UTF-32 and remembers that UTF-16 can still take 2 encoding units per codepoint, slide 13 says that UTF-16 is "desired" where "random access critical". What kind of real-world use do people have for random access, anyways? Even UTF-32 isn't random access for the things I can think of that people would care about, what with combining codepoints and ligatures and other such things. As an aside, I'd like to see comparisons between compressed UTF-8 and compressed UTF-16, since neither one is random-access anyways, and it seems to me that caring about size of text before compression is about as important as the performance of a program with the optimizer turned off.