Re: [boost] [regex] couple o' bugs

3 Jan 2006

      ...
The regex "a(b)?c\\1d" successfully matches the string "acd". It
shouldn't. A back-reference to a sub-matche that didn't participate in
the match should not match. Perl, python and xpressive all agree on
this point.
Hmmm, I never did know what to do in that case, and I suspect Boost.Regex 
has changed several times in this area.  Eric, you're right about Perl, and 
PCRE, but can you take a look at the ECMAScript spec and double check my 
reading of it: it looks to me as if section 15.10.2.9 AtomEscape Item 8.3 
requires backreferences to non-matched sub-expressions to be treated as if 
they were simply skipped?

I'm sure I deliberately made it behave that way, but it's all lost in the 
mists of time as to why: must remember to add to more comments, a New Years 
Resolution!!!
...
As discussed previously, Boost.Regex treats [a-Z] as a legal regex,
but it isn't. 'a' is 97 and 'Z' is 90, which makes this character
range ill-formed, even when icase is specified.
Regex rejects [a-Z] correctly without icase, but accepts it incorrectly when 
icase is on.  The problem here is the desire to support Unicode, we could 
either:

1) Do what ICU does and enumerate every character in the range [x-y] and 
convert it to it's case-folded equivalent.  The trouble is it's 
pathologically slow for very large ranges.  Of course if you use a very 
large range you get exactly what you deserve!
2) Do what I presume xpressive does (haven't looked, sorry in a rush today!) 
and match a character c if c has an equivalent case in the range x-y.  This 
is what ECMAScript specifies I think?  The problem is case folding in 
Unicode is *very* complicated, and using toupper/tolower to map case 
equivalents doesn't even come close (there are at least three cases for a 
start).  As far as I know you more or less *have* to do case folding to get 
the right answer for Unicode characters.
3) Do what Boost.Regex does and do case-folding on the fly.  It works 
correctly for 98% of cases, but accepts a few invalid expressions 
incorrectly.  At present I'm inclined to suggest that anyone who uses [a-Z] 
in a regex gets what they deserve anyway :-)  So I'm not going to fix this 
one for now: having it as a test case is questionable though ;-)
...
When matching "a(b+|((c)*))+d" against "abcd", Boost.Regex says the
third sub-match should be "c", but perl says it should not participate
in the match. I think perl is right here. The logic is: the quantified
group 1 will match at least 3 times: first, it eats the b, next it
eats the c, and finally, it matches an empty string. On this last
iteration, the quantified group 3 will match zero times; hence, it
has not participated in the match. (FWIW, xpressive has a bug in this
area too, which I'm working on fixing.)
Actually it's only one case where this is relevent I believe.  First note 
that all the cases that use POSIX extended syntax are matched using the 
"Leftmost-Longest" rule, so for example "a(b+|((c)*))+d" against "abcd" then 
the last match of $1, $2, $3 id against the "c" as this is "better" (The 
same in $0 as your case but both longer and more to the left in $1/$2/$3 
than the alternative you give).

In the Perl cases, the first one a(b+|((c)*))+d against "abd" looks correct 
to me: $1 and $2 match the empty string after the "b", and $3 is unmatched.

In the second case a(b+|((c)*))+d against "abcd" things get tricky: 
Boost.Regex does repeat $1 and $2 so that they match the null string after 
the "c", however $3 doesn't get invoked at all and is left pointing to the 
last thing it did match (the "c").  I can't see anything in ECMAScript to 
indicate what the "right" behaviour is here.  I don't know what Perl does, 
but PCRE (and hence python) does the same as Boost.Regex.

There's a whole series of possible expressions like this:

(a(((x)))b|a(((y)))b)+

The question being whether each time we take an alternative or repeat all 
the nested sub-expressions get unset, or whether they continue to remember 
what they matched last time through.  Unsetting them looks tricky IMO. 
Looking into this some more: PCRE appears to never unset them, Perl appears 
to unset them, but only in this specific case:

$string="abcd";
$string =~ s/a(b+|(((c))*))+d/{$1}{$2}{$3}{$4}/;
print($string);

prints {}{}{c}{c} which is the same as Boost.Regex and PCRE

$string="abcd";
$string =~ s/a(b+|((c)*))+d/{$1}{$2}{$3}/;
print($string);

prints {}{}{} which is just plain weird - inconsistent certainly - and the 
extra parenthesis shouldn't have effected $3 one jot.  I'm verging on 
suggesting that this is a Perl bug.

John.

Re: [boost] [regex] couple o' bugs

John Maddock