
On Fri, Jan 14, 2011 at 04:54:05PM +0200, Peter Dimov wrote:
John B. Turpish wrote: - UTF-8 has the nice property that you can do things with a string without even decoding the characters; for example, you can sort UTF-8 strings as-is, or split them on a specific (7 bit) character, such as '.' or '/'. Please excuse me if I'm stating the obvious, but I feel I should mention that binary sorting is not collation.
"The basic principle to remember is: The position of characters in the Unicode code charts does not specify their sorting weight." -- http://unicode.org/reports/tr10/#Introduction Any application that requires you to present a sorted list of strings to a user pretty much requires a collation algorithm; in that sense, the usefulness of the above mentioned property of UTF-8 is limited. Again, sorry if I'm stating the obvious here. I've had to bring up that argument in character encoding related discussions more than once, and it's become a bit of a knee-jerk response by now ;) For the application discussed, i.e. for passing strings to OS APIs, this really doesn't matter, though. Where it does matter slightly is when deciding whether or not to use UTF-8 internally in your application. The UCA maps code points to collation elements, or strings into lists of collation elements, and then binary sorts those collation element lists instead of the original strings. My guess would be that using UCS/UTF-32 for that is likely to be cheaper, though I haven't actually ran any comparisons here. If anyone has, I'd love to know. All of this is mostly an aside, I guess :) Jens -- 1.21 Jiggabytes of memory should be enough for anybody.