[regex] How robust are the <boost/regex/pending/unicode_iterator.hpp> adapters?

Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32 adapters to implement char16_t and char32_t support. Do they have any known bugs or other outstanding problems? TIA, --Beman

----- Original Message ----
From: Beman Dawes <bdawes@acm.org> To: Boost Developers List <boost@lists.boost.org> Sent: Mon, July 18, 2011 7:17:19 PM Subject: [boost] [regex] How robust are the <boost/regex/pending/unicode_iterator.hpp> adapters?
Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32 adapters to implement char16_t and char32_t support.
UTF-16 to UTF-32 and backward is quite trivial and can be done in 10-20 rows of code. You can borrow some code from Boost.Locale it is really trivial. If you need I'll give you samples. Really, don't add dependency for this...
Do they have any known bugs or other outstanding problems?
I can tell you char16_t and char32_t are far-far-far-far from being useful especially if you want to use codecvt or something like that. http://cppcms.sourceforge.net/boost_locale/html/status_of_cpp0x_characters_s... So don't be in hurry to implement anything useful for them, currently these characters support is totally broken with major compilers.
TIA,
--Beman
Best, Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/

On Mon, Jul 18, 2011 at 12:28 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
----- Original Message ----
From: Beman Dawes <bdawes@acm.org> To: Boost Developers List <boost@lists.boost.org> Sent: Mon, July 18, 2011 7:17:19 PM Subject: [boost] [regex] How robust are the <boost/regex/pending/unicode_iterator.hpp> adapters?
Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32 adapters to implement char16_t and char32_t support.
UTF-16 to UTF-32 and backward is quite trivial and can be done in 10-20 rows of code. You can borrow some code from Boost.Locale it is really trivial.
If you need I'll give you samples.
Really, don't add dependency for this...
Adding a dependency on header-only code that has been stable for years seems very low risk. Even if it is only a few lines of code, I'd have to test it and maintain it, and I'd prefer not to have to do either.
Do they have any known bugs or other outstanding problems?
I can tell you char16_t and char32_t are far-far-far-far from being useful especially if you want to use codecvt or something like that.
People have used uint16_t and uint32_t typedefs as workarounds for years, and found that useful. So it doesn't matter all that much if compilers or standard libraries don't yet support char16_t and char32_t directly. --Beman

On 18/07/2011 18:17, Beman Dawes wrote:
Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32 adapters to implement char16_t and char32_t support. Do they have any known bugs or other outstanding problems?
Yes, they can read past the end of your input range if it contains invalid data at the end.

On Mon, Jul 18, 2011 at 1:17 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 18/07/2011 18:17, Beman Dawes wrote:
Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32 adapters to implement char16_t and char32_t support. Do they have any known bugs or other outstanding problems?
Yes, they can read past the end of your input range if it contains invalid data at the end.
Interesting. Would a fix be difficult? Alternately, does your Unicode library in the sandbox have equivalent functionality, but without reading past the end on bad data? --Beman

Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32 adapters to implement char16_t and char32_t support. Do they have any known bugs or other outstanding problems?
Yes, they can read past the end of your input range if it contains invalid data at the end.
Interesting. Would a fix be difficult?
I was about to say there aren't any known issues, but yes that is a problem - and a fix would mean changing the interface - the problem comes because the iterators only store the current position in the underlying sequence and assumes that they can increment or decrement over a complete multi-byte sequence. So if your underlying sequence contains a *truncated* multibye sequence at the start or end of the string then they can read past-the-end or even past-the-start :-( The only real fix is to redesign them to be range-based, so we can add the additional checks necessary, but of course this also makes them more heavyweight than they are at present. I guess I was hoping we would have had a proper Unicode library for this by now (in Boost that is, not the sandbox ;) Oh well, maybe I should just bite the bullet and change/fix this hole. John.

On Tue, Jul 19, 2011 at 4:24 AM, John Maddock <boost.regex@virgin.net> wrote:
Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32 adapters to implement char16_t and char32_t support. Do they have any known bugs or other outstanding problems?
Yes, they can read past the end of your input range if it contains invalid data at the end.
Interesting. Would a fix be difficult?
I was about to say there aren't any known issues, but yes that is a problem - and a fix would mean changing the interface - the problem comes because the iterators only store the current position in the underlying sequence and assumes that they can increment or decrement over a complete multi-byte sequence. So if your underlying sequence contains a *truncated* multibye sequence at the start or end of the string then they can read past-the-end or even past-the-start :-(
Ouch!
The only real fix is to redesign them to be range-based, so we can add the additional checks necessary, but of course this also makes them more heavyweight than they are at present. I guess I was hoping we would have had a proper Unicode library for this by now (in Boost that is, not the sandbox ;)
Oh well, maybe I should just bite the bullet and change/fix this hole.
What about moving portions of Mathias Gaunard's Unicode library into detail? Have you looked at his code in the sandbox? I'll take a look at that too. --Beman

The only real fix is to redesign them to be range-based, so we can add the additional checks necessary, but of course this also makes them more heavyweight than they are at present. I guess I was hoping we would have had a proper Unicode library for this by now (in Boost that is, not the sandbox ;)
Oh well, maybe I should just bite the bullet and change/fix this hole.
What about moving portions of Mathias Gaunard's Unicode library into detail? Have you looked at his code in the sandbox?
No, which directory is it under? Actually, I'm thinking that the fix may be easier than I thought after all - if I add a 2-arg "range-checked" constructor as an overload, then the iterator's constructor can validate the end-points of the underlying sequence during construction, and there's no need to otherwise change the implementation or add overhead by checking every increment/decrement for movement out-of-range because we'll know that it can't happen. John.

On Tue, Jul 19, 2011 at 12:08 PM, John Maddock <boost.regex@virgin.net> wrote:
The only real fix is to redesign them to be range-based, so we can add the additional checks necessary, but of course this also makes them more heavyweight than they are at present. I guess I was hoping we would have had a proper Unicode library for this by now (in Boost that is, not the sandbox ;)
Oh well, maybe I should just bite the bullet and change/fix this hole.
What about moving portions of Mathias Gaunard's Unicode library into detail? Have you looked at his code in the sandbox?
No, which directory is it under?
Actually, the best link is probably http://mathias.gaunard.com/unicode/doc/html/index.html
Actually, I'm thinking that the fix may be easier than I thought after all - if I add a 2-arg "range-checked" constructor as an overload, then the iterator's constructor can validate the end-points of the underlying sequence during construction, and there's no need to otherwise change the implementation or add overhead by checking every increment/decrement for movement out-of-range because we'll know that it can't happen.
Ah! Nice solution! Go for it! --Beman

Actually, I'm thinking that the fix may be easier than I thought after all - if I add a 2-arg "range-checked" constructor as an overload, then the iterator's constructor can validate the end-points of the underlying sequence during construction, and there's no need to otherwise change the implementation or add overhead by checking every increment/decrement for movement out-of-range because we'll know that it can't happen.
Ah! Nice solution! Go for it!
This is now fixed in Trunk, also updated tests (and code somewhat) for Unicode v6, and added minimalist docs: http://svn.boost.org/svn/boost/trunk/libs/regex/doc/html/boost_regex/ref/int... None of which precludes other (better?) versions appearing as libraries in their own right in the future.... it's just that as these have been stable since 2004 it seemed sensible to give them a quick make-over rather than roll out all new code. HTH, John.

On Tue, Jul 19, 2011 at 9:08 AM, John Maddock <boost.regex@virgin.net>wrote: [...]
Actually, I'm thinking that the fix may be easier than I thought after all - if I add a 2-arg "range-checked" constructor as an overload, then the iterator's constructor can validate the end-points of the underlying sequence during construction, and there's no need to otherwise change the implementation or add overhead by checking every increment/decrement for movement out-of-range because we'll know that it can't happen.
...unless the characters are modified in between traversal operations. I don't know if that's a legitimate concern or not, just thought I'd bring it up :/ *If* it's remotely a concern, probably sufficient to just document this limitation...? - Jeff

Actually, I'm thinking that the fix may be easier than I thought after all - if I add a 2-arg "range-checked" constructor as an overload, then the iterator's constructor can validate the end-points of the underlying sequence during construction, and there's no need to otherwise change the implementation or add overhead by checking every increment/decrement for movement out-of-range because we'll know that it can't happen.
...unless the characters are modified in between traversal operations. I don't know if that's a legitimate concern or not, just thought I'd bring it up :/ *If* it's remotely a concern, probably sufficient to just document this limitation...?
IMO changing the underlying sequence while you're iterating over it with an adapter is a recipe for disaster anyway, so no I don't think it's an issue in practice. John.

on Tue Jul 19 2011, John Maddock <boost.regex-AT-virgin.net> wrote:
Actually, I'm thinking that the fix may be easier than I thought after all - if I add a 2-arg "range-checked" constructor as an overload, then the iterator's constructor can validate the end-points of the underlying sequence during construction,
Doesn't that make construction of an iterator over N bytes an O(N) operation?
and there's no need to otherwise change the implementation or add overhead by checking every increment/decrement for movement out-of-range because we'll know that it can't happen.
To be precise—I think—it can't happen unless the bytes in the underlying buffer are changed after construction. It's not *quite* the same guarantee, but it's probably good enough, and maybe even preferable. I would suggest that anyone needing the other kind of check adapt an underlying iterator that contains the check. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Actually, I'm thinking that the fix may be easier than I thought after all - if I add a 2-arg "range-checked" constructor as an overload, then the iterator's constructor can validate the end-points of the underlying sequence during construction,
Doesn't that make construction of an iterator over N bytes an O(N) operation?
No, it only checks that each *end* of the sequence contains a valid multibyte sequence - effectly these can then act as sentinels - if there are invalid sequences within the range (not at the endpoints) then we can catch these anyway already.
and there's no need to otherwise change the implementation or add overhead by checking every increment/decrement for movement out-of-range because we'll know that it can't happen.
To be precise—I think—it can't happen unless the bytes in the underlying buffer are changed after construction. It's not *quite* the same guarantee, but it's probably good enough, and maybe even preferable.
I would suggest that anyone needing the other kind of check adapt an underlying iterator that contains the check.
Changing the underlying bytes after construction of the adapters is a big no-no anyway. It's an invariant that the adapter must always point between two multibyte sequences, and never be left stranded in the middle of one, that could be broken if we allowed the underlying sequence to change at arbitrary moments in time. John.

On 23/07/2011 19:44, John Maddock wrote:
Actually, I'm thinking that the fix may be easier than I thought after all - if I add a 2-arg "range-checked" constructor as an overload, then the iterator's constructor can validate the end-points of the underlying sequence during construction,
Doesn't that make construction of an iterator over N bytes an O(N) operation?
No, it only checks that each *end* of the sequence contains a valid multibyte sequence - effectly these can then act as sentinels - if there are invalid sequences within the range (not at the endpoints) then we can catch these anyway already.
That requires the underlying iterator to be bidirectional, right?

Actually, I'm thinking that the fix may be easier than I thought after all - if I add a 2-arg "range-checked" constructor as an overload, then the iterator's constructor can validate the end-points of the underlying sequence during construction,
Doesn't that make construction of an iterator over N bytes an O(N) operation?
No, it only checks that each *end* of the sequence contains a valid multibyte sequence - effectly these can then act as sentinels - if there are invalid sequences within the range (not at the endpoints) then we can catch these anyway already.
That requires the underlying iterator to be bidirectional, right?
Correct - but that's what regex requires anyway (which is all these are used for for now). I guess a more sophisticated version could adapt to the traversal category of the underlying iterator - using relatively lightweight endpoint checking when it's possible and "check every increment" for forward/input iterators? John.

on Sat Jul 23 2011, John Maddock <boost.regex-AT-virgin.net> wrote:
Changing the underlying bytes after construction of the adapters is a big no-no anyway. It's an invariant that the adapter must always point between two multibyte sequences, and never be left stranded in the middle of one, that could be broken if we allowed the underlying sequence to change at arbitrary moments in time.
I agree that it's unlikely to be needed, but a careful user could make such changes without ever breaking that invariant. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Changing the underlying bytes after construction of the adapters is a big no-no anyway. It's an invariant that the adapter must always point between two multibyte sequences, and never be left stranded in the middle of one, that could be broken if we allowed the underlying sequence to change at arbitrary moments in time.
I agree that it's unlikely to be needed, but a careful user could make such changes without ever breaking that invariant.
In which case there's no issue with the iterators. John.

on Tue Jul 19 2011, John Maddock <boost.regex-AT-virgin.net> wrote:
Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32 adapters to implement char16_t and char32_t support. Do they have any known bugs or other outstanding problems?
Yes, they can read past the end of your input range if it contains invalid data at the end.
Interesting. Would a fix be difficult?
I was about to say there aren't any known issues, but yes that is a problem - and a fix would mean changing the interface - the problem comes because the iterators only store the current position in the underlying sequence and assumes that they can increment or decrement over a complete multi-byte sequence. So if your underlying sequence contains a *truncated* multibye sequence at the start or end of the string then they can read past-the-end or even past-the-start :-(
The only real fix is to redesign them to be range-based, so we can add the additional checks necessary, but of course this also makes them more heavyweight than they are at present. I guess I was hoping we would have had a proper Unicode library for this by now (in Boost that is, not the sandbox ;)
What about just asking people who aren't sure if they're processing invalid unicode to add some sentinel bytes? Wouldn't that work? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Tue, Jul 19, 2011 at 4:24 PM, John Maddock <boost.regex@virgin.net> wrote:
Yes, they can read past the end of your input range if it contains invalid data at the end.
Interesting. Would a fix be difficult?
I was about to say there aren't any known issues, but yes that is a problem - and a fix would mean changing the interface - the problem comes because the iterators only store the current position in the underlying sequence and assumes that they can increment or decrement over a complete multi-byte sequence. So if your underlying sequence contains a *truncated* multibye sequence at the start or end of the string then they can read past-the-end or even past-the-start :-(
The only real fix is to redesign them to be range-based, so we can add the additional checks necessary, but of course this also makes them more heavyweight than they are at present. I guess I was hoping we would have had a proper Unicode library for this by now (in Boost that is, not the sandbox ;)
Oh well, maybe I should just bite the bullet and change/fix this hole.
In my GSoC project I am currently developing a Unicode string adapter library that wraps and add Unicode awareness to conventional string types such as std::string. Not sure if that helps but if you are developing new library APIs I think this might be useful. I still have not completed the documentation but you can look at the draft at http://crf.scriptmatrix.net/ustr/. The code repository is available at GitHub: https://github.com/crf00/boost.ustr. (Sorry, no means to hijack the thread but hope that helps.) cheers, Soares Chen

On 18/07/2011 19:25, Beman Dawes wrote:
Alternately, does your Unicode library in the sandbox have equivalent functionality, but without reading past the end on bad data?
My library does provide that functionality, but the iterators are more heavyweight as John Maddock highlighted. I do not do the checking on construction but as the data is being iterated.
participants (7)
-
Artyom Beilis
-
Beman Dawes
-
Dave Abrahams
-
Jeffrey Lee Hellrung, Jr.
-
John Maddock
-
Mathias Gaunard
-
Soares Chen Ruo Fei