
On Wed, 20 Oct 2004 12:48:31 -0700, Eric Niebler <eric@boost-consulting.com> wrote:
I think the default should be UTF-16 encoding, and that the iterator should use a scheme like this to be random access. Rationale: there are string algorithms that benefit from random access (Boyer-Moore comes to mind).
Correct me if I'm wrong. From what I gather from a Google search, Boyer-Moore is a fast string search algorithm. Why not use the algorithm on the code units rather than codepoints? UTF-8 and UTF-16 are both not stateful, specifically to allow optimisations such as this (as well as error recovery). As was pointed out earlier in this thread, searching for Unicode characters takes looking at combining characters as well. I think this will go for many, if not all, algorithms that you can think of: either they can be made to work with code units, or they must work on abstract characters, which means a variable-width encoding anyway. (See the Unicode Standard 4, Section 2.5 for a similar argument for UTF-16 over UTF-32, even though the latter is fixed-width.) I'm ready to be proven wrong; however, at this moment at least I believe that any effort to make UTF-16 randomly accessible is not useful. Regards, Rogier