
The regex "a(b)?c\\1d" successfully matches the string "acd". It shouldn't. A back-reference to a sub-matche that didn't participate in the match should not match. Perl, python and xpressive all agree on this point.
Hmmm, I never did know what to do in that case, and I suspect Boost.Regex has changed several times in this area. Eric, you're right about Perl, and PCRE, but can you take a look at the ECMAScript spec and double check my reading of it: it looks to me as if section 15.10.2.9 AtomEscape Item 8.3 requires backreferences to non-matched sub-expressions to be treated as if they were simply skipped? I'm sure I deliberately made it behave that way, but it's all lost in the mists of time as to why: must remember to add to more comments, a New Years Resolution!!!
As discussed previously, Boost.Regex treats [a-Z] as a legal regex, but it isn't. 'a' is 97 and 'Z' is 90, which makes this character range ill-formed, even when icase is specified.
Regex rejects [a-Z] correctly without icase, but accepts it incorrectly when icase is on. The problem here is the desire to support Unicode, we could either: 1) Do what ICU does and enumerate every character in the range [x-y] and convert it to it's case-folded equivalent. The trouble is it's pathologically slow for very large ranges. Of course if you use a very large range you get exactly what you deserve! 2) Do what I presume xpressive does (haven't looked, sorry in a rush today!) and match a character c if c has an equivalent case in the range x-y. This is what ECMAScript specifies I think? The problem is case folding in Unicode is *very* complicated, and using toupper/tolower to map case equivalents doesn't even come close (there are at least three cases for a start). As far as I know you more or less *have* to do case folding to get the right answer for Unicode characters. 3) Do what Boost.Regex does and do case-folding on the fly. It works correctly for 98% of cases, but accepts a few invalid expressions incorrectly. At present I'm inclined to suggest that anyone who uses [a-Z] in a regex gets what they deserve anyway :-) So I'm not going to fix this one for now: having it as a test case is questionable though ;-)
When matching "a(b+|((c)*))+d" against "abcd", Boost.Regex says the third sub-match should be "c", but perl says it should not participate in the match. I think perl is right here. The logic is: the quantified group 1 will match at least 3 times: first, it eats the b, next it eats the c, and finally, it matches an empty string. On this last iteration, the quantified group 3 will match zero times; hence, it has not participated in the match. (FWIW, xpressive has a bug in this area too, which I'm working on fixing.)
Actually it's only one case where this is relevent I believe. First note that all the cases that use POSIX extended syntax are matched using the "Leftmost-Longest" rule, so for example "a(b+|((c)*))+d" against "abcd" then the last match of $1, $2, $3 id against the "c" as this is "better" (The same in $0 as your case but both longer and more to the left in $1/$2/$3 than the alternative you give). In the Perl cases, the first one a(b+|((c)*))+d against "abd" looks correct to me: $1 and $2 match the empty string after the "b", and $3 is unmatched. In the second case a(b+|((c)*))+d against "abcd" things get tricky: Boost.Regex does repeat $1 and $2 so that they match the null string after the "c", however $3 doesn't get invoked at all and is left pointing to the last thing it did match (the "c"). I can't see anything in ECMAScript to indicate what the "right" behaviour is here. I don't know what Perl does, but PCRE (and hence python) does the same as Boost.Regex. There's a whole series of possible expressions like this: (a(((x)))b|a(((y)))b)+ The question being whether each time we take an alternative or repeat all the nested sub-expressions get unset, or whether they continue to remember what they matched last time through. Unsetting them looks tricky IMO. Looking into this some more: PCRE appears to never unset them, Perl appears to unset them, but only in this specific case: $string="abcd"; $string =~ s/a(b+|(((c))*))+d/{$1}{$2}{$3}{$4}/; print($string); prints {}{}{c}{c} which is the same as Boost.Regex and PCRE $string="abcd"; $string =~ s/a(b+|((c)*))+d/{$1}{$2}{$3}/; print($string); prints {}{}{} which is just plain weird - inconsistent certainly - and the extra parenthesis shouldn't have effected $3 one jot. I'm verging on suggesting that this is a Perl bug. John.