
Erik Wien wrote:
The iterators used are bidirectional, not random access (impossible on UTF-8 and UTF-16)
No. Andrei Alexandrescu explained a scheme to me whereby a UTF-16 encoded string can have a random-access iterator, and I think it should. The basic idea is you keep a plain array of 16-bit integers which are the 16-bit characters and the first 16 bits of surrogate pairs. Then you have a data structure which maps from string offsets to the second 16 bits of surrogate pairs. Random access involves a simple index and a map look-up. Sequential access requires no map look-up. And since surrogate pairs are very rare, the map will almost always be empty and the look-up is skipped. I think the default should be UTF-16 encoding, and that the iterator should use a scheme like this to be random access. Rationale: there are string algorithms that benefit from random access (Boyer-Moore comes to mind). -- Eric Niebler Boost Consulting www.boost-consulting.com