Re: [boost] [rfc] I/O Library Design

23 Jun 2007

      On 23/06/07, Johan Råde <rade@maths.lth.se> wrote:
...
Peter Bindels wrote:
...
Combining accents, nor in fact any character with accent, were in
ASCII last time I checked.
Exactly what question is being discussed here?
As far as I was concerned, in how far switching from ASCII to UTF-8
would impact performance, if only 7-bit characters were being used.
...
I thought the question was, how fast is text search with UTF-8 strings that happen
to contain ASCII only, compared with text search with ASCII strings.
Exactly.
...
Even if the UTF-8 strings happen to contain ASCII,
the search algorithm may still have to check for combining characters.
The wider question is, should people who currently use ASCII and care a lot about performance
and don't care about i18n switch to UTF-8?
I think it's best to switch to UTF-8, for the simple reason that it's
equally fast, or a bit slower but more correct.

At the very least, from a corporate perspective, UTF-8 would strongly
reduce development time by automatically (up to a certain limit)
coping with extended behaviour when different from plain ASCII, whilst
barely impacting performance when using only ASCII features.

Sorting in specific is an odd case. Having read only the introduction
of the collation algorithm, I'm not entirely certain about the
complexity, but I think it should come out to about O(n), similar to
comparing two ASCII strings. It'll have a higher base factor, but
won't be quite so determining for the performance of anything but the
special case of high-performance computing that's centered around
string usage.
...
From a purely technical perspective, ASCII is essentially a base case
for UTF-8. If you properly wrap UTF-8, you can keep the entire
technical complexity of it below covers (putting the collation under
operator< and operator==, putting multibyte character handling under
operator<< and operator>>). That makes it as attractive as any string
class is now over a (const) char *.
Regards,
Peter