
On Fri, May 15, 2009 at 2:31 AM, Scott McMurray <me22.ca+boost@gmail.com> wrote:
On Wed, May 13, 2009 at 18:35, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
Phil Endecott wrote:
Some feedback based on that document:
UTF-16 .... This is the recommended encoding for dealing with Unicode.
Recommended by who? It's not the encoding that I would normally recommend.
The Unicode standard, in some technical notes: http://www.unicode.org/notes/tn12/ It recommends the use of UTF-16 for general purpose text processing.
It also states that UTF-8 is good for compatibility and data exchange, and UTF-32 uses just too much memory and is thus quite a waste.
I really think UTF-8 should be the recommended one, since it forces people to remember that it's no longer one unit, one "character".
Even in Beman Dawes's talk (http://www.boostcon.com/site-media/var/sphene/sphwiki/attachment/2009/05/07/...) where slide 11 mentions UTF-32 and remembers that UTF-16 can still take 2 encoding units per codepoint, slide 13 says that UTF-16 is "desired" where "random access critical".
It is really important to recognize that there isn't a single recommended Unicode encoding. The most appropriate encoding can only be chosen in relationship to a particular application and/or algorithm. UTF-8 and UTF-16 are both are in heavy use because they serve somewhat different needs. UTF-32 isn't used as often as those other two in strings, but I've found it very useful for passing around single codepoints. And then some needs change at runtime, so at least for strings an adaptive encoding is needed.
What kind of real-world use do people have for random access, anyways? Even UTF-32 isn't random access for the things I can think of that people would care about, what with combining codepoints and ligatures and other such things.
There are several related issues, assuming we are talking about strings. Some operations are doable but uncommon, so the cost of doing them should only be incurred if they are actually needed. Some operations are unsafe without prior knowledge of the string contents, but are perfectly safe with knowledge of the contents. Some operations may be quite a bit cheaper in C++0x that C++03. etc., etc. It is hard to talk in the abstract; we need to see the actual algorithms first. --Beman