Re: [boost] [regex] How robust are the <boost/regex/pending/unicode_iterator.hpp> adapters?

20 Jul 2011


      On Tue, Jul 19, 2011 at 4:24 PM, John Maddock <boost.regex@virgin.net> wrote:
...
...
...
Yes, they can read past the end of your input range if it contains
invalid
data at the end.
Interesting. Would a fix be difficult?
I was about to say there aren't any known issues, but yes that is a problem
- and a fix would mean changing the interface - the problem comes because
the iterators only store the current position in the underlying sequence and
assumes that they can increment or decrement over a complete multi-byte
sequence.  So if your underlying sequence contains a *truncated* multibye
sequence at the start or end of the string then they can read past-the-end
or even past-the-start :-(
The only real fix is to redesign them to be range-based, so we can add the
additional checks necessary, but of course this also makes them more
heavyweight than they are at present.  I guess I was hoping we would have
had a proper Unicode library for this by now (in Boost that is, not the
sandbox ;)
Oh well, maybe I should just bite the bullet and change/fix this hole.
In my GSoC project I am currently developing a Unicode string adapter
library that wraps and add Unicode awareness to conventional string
types such as std::string. Not sure if that helps but if you are
developing new library APIs I think this might be useful. I still have
not completed the documentation but you can look at the draft at
http://crf.scriptmatrix.net/ustr/. The code repository is available at
GitHub: https://github.com/crf00/boost.ustr.

(Sorry, no means to hijack the thread but hope that helps.)


cheers,

Soares Chen

Re: [boost] [regex] How robust are the <boost/regex/pending/unicode_iterator.hpp> adapters?

Soares Chen Ruo Fei