
On 23/06/07, Johan RĂ¥de <rade@maths.lth.se> wrote:
Peter Bindels wrote:
Combining accents, nor in fact any character with accent, were in ASCII last time I checked.
Exactly what question is being discussed here?
As far as I was concerned, in how far switching from ASCII to UTF-8 would impact performance, if only 7-bit characters were being used.
I thought the question was, how fast is text search with UTF-8 strings that happen to contain ASCII only, compared with text search with ASCII strings.
Exactly.
Even if the UTF-8 strings happen to contain ASCII, the search algorithm may still have to check for combining characters.
The wider question is, should people who currently use ASCII and care a lot about performance and don't care about i18n switch to UTF-8?
I think it's best to switch to UTF-8, for the simple reason that it's equally fast, or a bit slower but more correct. At the very least, from a corporate perspective, UTF-8 would strongly reduce development time by automatically (up to a certain limit) coping with extended behaviour when different from plain ASCII, whilst barely impacting performance when using only ASCII features. Sorting in specific is an odd case. Having read only the introduction of the collation algorithm, I'm not entirely certain about the complexity, but I think it should come out to about O(n), similar to comparing two ASCII strings. It'll have a higher base factor, but won't be quite so determining for the performance of anything but the special case of high-performance computing that's centered around string usage.
From a purely technical perspective, ASCII is essentially a base case for UTF-8. If you properly wrap UTF-8, you can keep the entire technical complexity of it below covers (putting the collation under operator< and operator==, putting multibyte character handling under operator<< and operator>>). That makes it as attractive as any string class is now over a (const) char *.
Regards, Peter