
Rogier van Dalen wrote:
On Wed, 20 Oct 2004 12:48:31 -0700, Eric Niebler <eric@boost-consulting.com> wrote:
I think the default should be UTF-16 encoding, and that the iterator should use a scheme like this to be random access. Rationale: there are string algorithms that benefit from random access (Boyer-Moore comes to mind).
Correct me if I'm wrong. From what I gather from a Google search, Boyer-Moore is a fast string search algorithm. Why not use the algorithm on the code units rather than codepoints? UTF-8 and UTF-16 are both not stateful, specifically to allow optimisations such as this (as well as error recovery).
Searching a Unicode string for a particular bit pattern is not particularly meaningful because the same string can be represented with different bit patterns. Have I misinterpreted what you are suggesting? -- Eric Niebler Boost Consulting www.boost-consulting.com