[regex] couple o' bugs

newer
Re: [boost] Performance comparison...

Eric Niebler

2 Jan 2006 2 Jan '06

4:12 a.m.

I've been porting some test cases from Boost.Regex to Boost.Xpressive and tracking down the discrepencies (very few, thankfully). I've turned up what appear to be a couple of bugs in Boost.Regex. The regex "a(b)?c\\1d" successfully matches the string "acd". It shouldn't. A back-reference to a sub-matche that didn't participate in the match should not match. Perl, python and xpressive all agree on this point. As discussed previously, Boost.Regex treats [a-Z] as a legal regex, but it isn't. 'a' is 97 and 'Z' is 90, which makes this character range ill-formed, even when icase is specified. When matching "a(b+|((c)*))+d" against "abcd", Boost.Regex says the third sub-match should be "c", but perl says it should not participate in the match. I think perl is right here. The logic is: the quantified group 1 will match at least 3 times: first, it eats the b, next it eats the c, and finally, it matches an empty string. On this last iteration, the quantified group 3 will match zero times; hence, it has not participated in the match. (FWIW, xpressive has a bug in this area too, which I'm working on fixing.) This last bug plagues several of the test cases in test_tricky_cases.cpp. -- Eric Niebler Boost Consulting www.boost-consulting.com

Show replies by date

John Maddock

3 Jan 3 Jan

1:54 p.m.

...

The regex "a(b)?c\\1d" successfully matches the string "acd". It shouldn't. A back-reference to a sub-matche that didn't participate in the match should not match. Perl, python and xpressive all agree on this point.

Hmmm, I never did know what to do in that case, and I suspect Boost.Regex has changed several times in this area. Eric, you're right about Perl, and PCRE, but can you take a look at the ECMAScript spec and double check my reading of it: it looks to me as if section 15.10.2.9 AtomEscape Item 8.3 requires backreferences to non-matched sub-expressions to be treated as if they were simply skipped? I'm sure I deliberately made it behave that way, but it's all lost in the mists of time as to why: must remember to add to more comments, a New Years Resolution!!!

...

As discussed previously, Boost.Regex treats [a-Z] as a legal regex, but it isn't. 'a' is 97 and 'Z' is 90, which makes this character range ill-formed, even when icase is specified.

Regex rejects [a-Z] correctly without icase, but accepts it incorrectly when icase is on. The problem here is the desire to support Unicode, we could either: 1) Do what ICU does and enumerate every character in the range [x-y] and convert it to it's case-folded equivalent. The trouble is it's pathologically slow for very large ranges. Of course if you use a very large range you get exactly what you deserve! 2) Do what I presume xpressive does (haven't looked, sorry in a rush today!) and match a character c if c has an equivalent case in the range x-y. This is what ECMAScript specifies I think? The problem is case folding in Unicode is *very* complicated, and using toupper/tolower to map case equivalents doesn't even come close (there are at least three cases for a start). As far as I know you more or less *have* to do case folding to get the right answer for Unicode characters. 3) Do what Boost.Regex does and do case-folding on the fly. It works correctly for 98% of cases, but accepts a few invalid expressions incorrectly. At present I'm inclined to suggest that anyone who uses [a-Z] in a regex gets what they deserve anyway :-) So I'm not going to fix this one for now: having it as a test case is questionable though ;-)

...

When matching "a(b+|((c)*))+d" against "abcd", Boost.Regex says the third sub-match should be "c", but perl says it should not participate in the match. I think perl is right here. The logic is: the quantified group 1 will match at least 3 times: first, it eats the b, next it eats the c, and finally, it matches an empty string. On this last iteration, the quantified group 3 will match zero times; hence, it has not participated in the match. (FWIW, xpressive has a bug in this area too, which I'm working on fixing.)

Actually it's only one case where this is relevent I believe. First note that all the cases that use POSIX extended syntax are matched using the "Leftmost-Longest" rule, so for example "a(b+|((c)*))+d" against "abcd" then the last match of $1, $2, $3 id against the "c" as this is "better" (The same in $0 as your case but both longer and more to the left in $1/$2/$3 than the alternative you give). In the Perl cases, the first one a(b+|((c)*))+d against "abd" looks correct to me: $1 and $2 match the empty string after the "b", and $3 is unmatched. In the second case a(b+|((c)*))+d against "abcd" things get tricky: Boost.Regex does repeat $1 and $2 so that they match the null string after the "c", however $3 doesn't get invoked at all and is left pointing to the last thing it did match (the "c"). I can't see anything in ECMAScript to indicate what the "right" behaviour is here. I don't know what Perl does, but PCRE (and hence python) does the same as Boost.Regex. There's a whole series of possible expressions like this: (a(((x)))b|a(((y)))b)+ The question being whether each time we take an alternative or repeat all the nested sub-expressions get unset, or whether they continue to remember what they matched last time through. Unsetting them looks tricky IMO. Looking into this some more: PCRE appears to never unset them, Perl appears to unset them, but only in this specific case: $string="abcd"; $string =~ s/a(b+|(((c))*))+d/{$1}{$2}{$3}{$4}/; print($string); prints {}{}{c}{c} which is the same as Boost.Regex and PCRE $string="abcd"; $string =~ s/a(b+|((c)*))+d/{$1}{$2}{$3}/; print($string); prints {}{}{} which is just plain weird - inconsistent certainly - and the extra parenthesis shouldn't have effected $3 one jot. I'm verging on suggesting that this is a Perl bug. John.

Eric Niebler

5:22 p.m.

John Maddock wrote:

...

...
The regex "a(b)?c\\1d" successfully matches the string "acd". It shouldn't. A back-reference to a sub-matche that didn't participate in the match should not match. Perl, python and xpressive all agree on this point.

Hmmm, I never did know what to do in that case, and I suspect Boost.Regex has changed several times in this area. Eric, you're right about Perl, and PCRE, but can you take a look at the ECMAScript spec and double check my reading of it: it looks to me as if section 15.10.2.9 AtomEscape Item 8.3 requires backreferences to non-matched sub-expressions to be treated as if they were simply skipped?

I'm sure I deliberately made it behave that way, but it's all lost in the mists of time as to why: must remember to add to more comments, a New Years Resolution!!!

Oh, my! You're right. "If the regular expression has n or more capturing parentheses but the nth one is undefined because it hasn't captured anything, then the backreference always succeeds." 15.10.2.9 Yuk! This makes ECMA (and hence C++) regular expressions differ from perl/python/PCRE in a fundamental way, and disallows at least one useful and idiomatic perl regex construct. How unfortunate.

...

...
As discussed previously, Boost.Regex treats [a-Z] as a legal regex, but it isn't. 'a' is 97 and 'Z' is 90, which makes this character range ill-formed, even when icase is specified.

Regex rejects [a-Z] correctly without icase, but accepts it incorrectly when icase is on. The problem here is the desire to support Unicode, we could either:

1) Do what ICU does and enumerate every character in the range [x-y] and convert it to it's case-folded equivalent. The trouble is it's pathologically slow for very large ranges. Of course if you use a very large range you get exactly what you deserve!

Please, no.

...

2) Do what I presume xpressive does (haven't looked, sorry in a rush today!) and match a character c if c has an equivalent case in the range x-y. This is what ECMAScript specifies I think? The problem is case folding in Unicode is *very* complicated, and using toupper/tolower to map case equivalents doesn't even come close (there are at least three cases for a start). As far as I know you more or less *have* to do case folding to get the right answer for Unicode characters. 3) Do what Boost.Regex does and do case-folding on the fly. It works correctly for 98% of cases, but accepts a few invalid expressions incorrectly. At present I'm inclined to suggest that anyone who uses [a-Z] in a regex gets what they deserve anyway :-) So I'm not going to fix this one for now: having it as a test case is questionable though ;-)

4) Punt the decision to the traits type :-) For xpressive, I added a in_range_nocase(Char a, Char b) member to the traits concept. By default the traits provided by xpressive do *not* do proper case folding. They just use toupper and tolower, and are documented as such. An ambitious person can write their own trait to do proper Unicode case folding and get the right behavior. This interface works, but it's sub-optimal since it causes more calls to toupper/tolower than are necessary. I have a mental sticky note to find a better interface.

...

...
When matching "a(b+|((c)*))+d" against "abcd", Boost.Regex says the third sub-match should be "c", but perl says it should not participate in the match. I think perl is right here. The logic is: the quantified group 1 will match at least 3 times: first, it eats the b, next it eats the c, and finally, it matches an empty string. On this last iteration, the quantified group 3 will match zero times; hence, it has not participated in the match. (FWIW, xpressive has a bug in this area too, which I'm working on fixing.)

Actually it's only one case where this is relevent I believe. First note that all the cases that use POSIX extended syntax are matched using the "Leftmost-Longest" rule, so for example "a(b+|((c)*))+d" against "abcd" then the last match of $1, $2, $3 id against the "c" as this is "better" (The same in $0 as your case but both longer and more to the left in $1/$2/$3 than the alternative you give).

In the Perl cases, the first one a(b+|((c)*))+d against "abd" looks correct to me: $1 and $2 match the empty string after the "b", and $3 is unmatched.

In the second case a(b+|((c)*))+d against "abcd" things get tricky: Boost.Regex does repeat $1 and $2 so that they match the null string after the "c", however $3 doesn't get invoked at all and is left pointing to the last thing it did match (the "c").

But $3 does get involved. It is matched 0 times, and should be undefined. I can't see anything in ECMAScript to

...

indicate what the "right" behaviour is here. I don't know what Perl does, but PCRE (and hence python) does the same as Boost.Regex.

There's a whole series of possible expressions like this:

(a(((x)))b|a(((y)))b)+

The question being whether each time we take an alternative or repeat all the nested sub-expressions get unset, or whether they continue to remember what they matched last time through. Unsetting them looks tricky IMO. Looking into this some more: PCRE appears to never unset them, Perl appears to unset them, but only in this specific case:

$string="abcd"; $string =~ s/a(b+|(((c))*))+d/{$1}{$2}{$3}{$4}/; print($string);

prints {}{}{c}{c} which is the same as Boost.Regex and PCRE

Yes, I believe this is a Perl bug, and I filed it yesterday.

...

$string="abcd"; $string =~ s/a(b+|((c)*))+d/{$1}{$2}{$3}/; print($string);

prints {}{}{} which is just plain weird - inconsistent certainly - and the extra parenthesis shouldn't have effected $3 one jot. I'm verging on suggesting that this is a Perl bug.

Inconsistent. Yes. Weird? No. The simple rule is: when you're about to start repeating a group, that capture and all captures within are first set to undefined. (ECMA-262 15.10.2.5.) Now that I look more closely, I see that ECMA is stricter about setting captures to undefined in these situations than Perl is, and xpressive is non-compliant in this area, too. <sigh> ECMA-262 gives this example: /(z)((a+)?(b+)?(c))*/.exec("zaacbbbcac") which returns the array ["zaacbbbcac", "z", "ac", "a", undefined, "c"] -- Eric Niebler Boost Consulting www.boost-consulting.com

John Maddock

4 Jan 4 Jan

10:28 a.m.

...

...
1) Do what ICU does and enumerate every character in the range [x-y] and convert it to it's case-folded equivalent. The trouble is it's pathologically slow for very large ranges. Of course if you use a very large range you get exactly what you deserve!

Please, no.

:-)

...

4) Punt the decision to the traits type :-) For xpressive, I added a in_range_nocase(Char a, Char b) member to the traits concept. By default the traits provided by xpressive do *not* do proper case folding. They just use toupper and tolower, and are documented as such. An ambitious person can write their own trait to do proper Unicode case folding and get the right behavior.

Right, but the question is: is it actually *possible* to do proper Unicode case folding with this interface?

...

Inconsistent. Yes. Weird? No. The simple rule is: when you're about to start repeating a group, that capture and all captures within are first set to undefined. (ECMA-262 15.10.2.5.) Now that I look more closely, I see that ECMA is stricter about setting captures to undefined in these situations than Perl is, and xpressive is non-compliant in this area, too. <sigh>

Oh shucks I see it now: incidently the behaviour described is consistent, doing what they do will work for all alternatives and repeats just as well (since an alternative must be within a repeat if it's captures are going to need clearing). It's going to be a pain to implement though :-/ Thanks for the pointer, John.

Eric Niebler

5:38 p.m.

John Maddock wrote:

...

...
4) Punt the decision to the traits type :-) For xpressive, I added a in_range_nocase(Char a, Char b) member to the traits concept. By default the traits provided by xpressive do *not* do proper case folding. They just use toupper and tolower, and are documented as such. An ambitious person can write their own trait to do proper Unicode case folding and get the right behavior.

Right, but the question is: is it actually *possible* to do proper Unicode case folding with this interface?

Trivially, yes. (The actual interface is in_range_nocase(Char from, Char to, Char ch) -- there's a typo above.) The algorithm is: 1) Build a table such that for every Unicode character, you can get a list of its case-folded equivalents. I wrote a script that does this, using http://www.unicode.org/Public/UNIDATA/CaseFolding.txt as input. 2) In in_range_nocase(), look up the list of case-folded equivalent chars for ch, the char to test. 3) For each char in the list, see if it's in the range specified. -- Eric Niebler Boost Consulting www.boost-consulting.com

7164

Age (days ago)

7166

Last active (days ago)

List overview

Download

4 comments

2 participants

participants (2)

Eric Niebler
John Maddock