Re: [boost] Re: Any interest in adding unicode support to boost?

21 Oct 2004

      On Wed, 20 Oct 2004 12:48:31 -0700, Eric Niebler
<eric@boost-consulting.com> wrote:
...
I think the default should be UTF-16 encoding, and that the iterator
should use a scheme like this to be random access. Rationale: there are
string algorithms that benefit from random access (Boyer-Moore comes to
mind).
Correct me if I'm wrong. From what I gather from a Google search,
Boyer-Moore is a fast string search algorithm. Why not use the
algorithm on the code units rather than codepoints? UTF-8 and UTF-16
are both not stateful, specifically to allow optimisations such as
this (as well as error recovery).

As was pointed out earlier in this thread, searching for Unicode
characters takes looking at combining characters as well. I think this
will go for many, if not all, algorithms that you can think of: either
they can be made to work with code units, or they must work on
abstract characters, which means a variable-width encoding anyway.
(See the Unicode Standard 4, Section 2.5 for a similar argument for
UTF-16 over UTF-32, even though the latter is fixed-width.)

I'm ready to be proven wrong; however, at this moment at least I
believe that any effort to make UTF-16 randomly accessible is not
useful.

Regards,
Rogier

Re: [boost] Re: Any interest in adding unicode support to boost?

Rogier van Dalen