
On Tue, Jul 19, 2011 at 4:24 PM, John Maddock <boost.regex@virgin.net> wrote:
Yes, they can read past the end of your input range if it contains invalid data at the end.
Interesting. Would a fix be difficult?
I was about to say there aren't any known issues, but yes that is a problem - and a fix would mean changing the interface - the problem comes because the iterators only store the current position in the underlying sequence and assumes that they can increment or decrement over a complete multi-byte sequence. So if your underlying sequence contains a *truncated* multibye sequence at the start or end of the string then they can read past-the-end or even past-the-start :-(
The only real fix is to redesign them to be range-based, so we can add the additional checks necessary, but of course this also makes them more heavyweight than they are at present. I guess I was hoping we would have had a proper Unicode library for this by now (in Boost that is, not the sandbox ;)
Oh well, maybe I should just bite the bullet and change/fix this hole.
In my GSoC project I am currently developing a Unicode string adapter library that wraps and add Unicode awareness to conventional string types such as std::string. Not sure if that helps but if you are developing new library APIs I think this might be useful. I still have not completed the documentation but you can look at the draft at http://crf.scriptmatrix.net/ustr/. The code repository is available at GitHub: https://github.com/crf00/boost.ustr. (Sorry, no means to hijack the thread but hope that helps.) cheers, Soares Chen