Re: [boost] RFC: interest in Unicode codecs?

20 Jul 2009


      Rogier van Dalen wrote:
...
Non-checking iterator adaptors can be faster. That would be useful
when you know that a string is safe, for example, in a UTF string type
that has a validity invariant.
I suppose that type of string should probably use optimized iterators 
that make use of the fact it is stored on contiguous and properly 
aligned memory anyway, so it will need special code.
...
I think this means that all iterator adaptors can be constructed from
3 iterators (begin, position, end) and the ones that don't check the
input can also be constructed from 1 iterator. For a checking forward
iterator, only two iterators are necessary (position, end). This is
how I implemented this, at any rate.
Indeed, that makes 3 cases per encoding and I'm only handling the 
broadest case for now.
...
It makes sense to design for correctness. It's probably worth keeping
in minds, though, whether conceivable extensions and optimisations are
possible in your design.
I suppose you could attach traits to select more optimal iteration methods.
...
I like the idea of the Pipe and related concepts. I am wondering,
however, whether the UTF-8 decoding iterator can be fast enough given
the current specification. I think Pipe (or another concept) might
have to support decoding of exactly one output element. Correct me if
I'm wrong.
I don't really understand what you mean.
Calling Pipe::ltr or Pipe::rtl only decodes one "element" (utf8 decoding 
means a multibyte sequence is read and a code point is written, utf8 
encoding means a code point is read and a multibyte sequence is written).
...
The actual implementation of extensions and optimisations can be
delayed until the need appears. I'd be happy to contribute checking
policies.
The mechanism to do so has yet to be defined unfortunately ;).

Re: [boost] RFC: interest in Unicode codecs?

Mathias Gaunard