Boost-regex: Weird behaviour with non-greedy matching operator in regex_replace in boost 1.40?
Hi,
the non-greedy matching seems to have a weird behaviour (maybe
non-determeterministic?) to work if the pattern is preceeded with something.
E.g. when I want to get all characters in a string except the last on,
if its an 'o', I would use
regex matchExpr("(.*?)o?");
So if I write
string text("hallo");
regex matchExpr("(.*?)o?");
string valueExpr("$1");
string result;
regex_replace(back_inserter(resul), text.begin(), text.end(),
matchExpr, valueExpr);
cout << "Match \"" << result << "\"" << endl;
it will print the expected "hall". If I now use instead
string text(" hallo");
regex matchExpr(" (.*?)o?");
it will print "hallo". And for
string text("hhallo");
regex matchExpr("h(.*?)o?");
string valueExpr("$1");
regex_replace(back_inserter(resul), text.begin(), text.end(),
matchExpr, valueExpr);
cout << "Match \"" << result << "\"" << endl;
it will print "allo" which seems even more strange to me.
I'm using boost1.40 on ubuntu with g++ 4.2.4
To give you a complete example:
#include
AMDG Florian Schwarz wrote:
the non-greedy matching seems to have a weird behaviour (maybe non-determeterministic?) to work if the pattern is preceeded with something.
Boost.Regex doesn't guarantee that it finds the longest possible match. perl does exactly the same thing: $ perl -e '$x = "hhallo"; $x =~ s/h(.*?)o?/$1/g; print $x;' allo
E.g. when I want to get all characters in a string except the last on, if its an 'o', I would use regex matchExpr("(.*?)o?"); So if I write string text("hallo"); regex matchExpr("(.*?)o?"); string valueExpr("$1"); string result; regex_replace(back_inserter(resul), text.begin(), text.end(), matchExpr, valueExpr); cout << "Match \"" << result << "\"" << endl; it will print the expected "hall". If I now use instead string text(" hallo"); regex matchExpr(" (.*?)o?"); it will print "hallo". And for string text("hhallo"); regex matchExpr("h(.*?)o?"); string valueExpr("$1"); regex_replace(back_inserter(resul), text.begin(), text.end(), matchExpr, valueExpr); cout << "Match \"" << result << "\"" << endl; it will print "allo" which seems even more strange to me.
In this case, the regex matches each 'h'. In Christ, Steven Watanabe
I have the following questions: - why does test 1 match the expected "hall" while test 2 matches "hallo" - why does test 1 match the whole string while test 4 matches only a part of it.
Because that's the way that Perl regexes work, if you have the expression (.*?)o? then for preference the .*? part will match *no characters at all*, so basically your expression either matches no characters, or one character if the next character is an "o". So since you're doing a search and replace, the effect is: * If the next character is not an "o", match a zero length string and output a null string (the contents of $1). * Since the last match was against a zero length string, then skip to the next character. * Otherwise if the next character is an "o", match it and output $1 - again this is an empty string. * Move to the end of the string matched. * Find the next match and output all unmatched text (everything from the end of the last match to the start of this one). * Repeat. So in effect we end up deleting all the letter "o"'s. Or at least I think that's what's going on here after a very brief look ;-) HTH, John.
Somehow I just don't get it. When I match "hallo" with "(.*?)o?" and "xhallo" with "x(.*?)o?", I expect that $1 will in both cases be the same. But this is not the case. In the former the result is "hall" while in the later its "hallo", which seems weird to me... Best regards Florian John Maddock wrote:
I have the following questions: - why does test 1 match the expected "hall" while test 2 matches "hallo" - why does test 1 match the whole string while test 4 matches only a part of it.
Because that's the way that Perl regexes work, if you have the expression (.*?)o? then for preference the .*? part will match *no characters at all*, so basically your expression either matches no characters, or one character if the next character is an "o". So since you're doing a search and replace, the effect is:
* If the next character is not an "o", match a zero length string and output a null string (the contents of $1). * Since the last match was against a zero length string, then skip to the next character. * Otherwise if the next character is an "o", match it and output $1 - again this is an empty string. * Move to the end of the string matched. * Find the next match and output all unmatched text (everything from the end of the last match to the start of this one). * Repeat.
So in effect we end up deleting all the letter "o"'s.
Or at least I think that's what's going on here after a very brief look ;-)
HTH, John. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Somehow I just don't get it. When I match "hallo" with "(.*?)o?" and "xhallo" with "x(.*?)o?", I expect that $1 will in both cases be the same. But this is not the case. In the former the result is "hall" while in the later its "hallo", which seems weird to me...
No that's not what's happening, remember the .*? part is non-greedy and will match as few characters as possible (zero if possible) that still results in an overall match. Consider the program below that enumerates all the possible matches in the string - this is what regex_replace basically does internally - but in this case you get to see all the individual matches, output is as follows: Enumerating all the matches of "(.*?)o?" in the text "Hallo" $0 = "" $1 = "" Position = 0 $0 = "H" $1 = "H" Position = 0 $0 = "" $1 = "" Position = 1 $0 = "a" $1 = "a" Position = 1 $0 = "" $1 = "" Position = 2 $0 = "l" $1 = "l" Position = 2 $0 = "" $1 = "" Position = 3 $0 = "lo" $1 = "l" Position = 3 $0 = "" $1 = "" Position = 5 Enumerating all the matches of "x(.*?)o?" in the text "xHallo" $0 = "x" $1 = "" Position = 0 So in this latter case there is only one match found, and in the case or regex_replace the unmatched part (all of "Hallo") gets output unchanged. Here's the example program: int main ( int argc, char** argv ) { std::string input = "xHallo"; boost::regex test ( "x(.*?)o?" ); boost::sregex_iterator it ( input.begin(), input.end (), test); boost::sregex_iterator none; std::cout << "Enumerating all the matches of \"" << test.str() << "\" in the text \"" << input << "\"" << std::endl; while ( it != none ) { std::cout << "$0 = \"" << it->str(0) << "\" $1 = \"" << it->str(1) << "\" Position = " << it->position() << std::endl; ++it; } return 0; } HTH, John.
Thanks a lot for the detailed response. I was so sure that regex_replace would replace by default only the first occurence and not all so that I didn't look into the documentation, therefor I didn't understand your first explanation :-( So now I see my mistake... Best regards Florian John Maddock wrote:
Somehow I just don't get it. When I match "hallo" with "(.*?)o?" and "xhallo" with "x(.*?)o?", I expect that $1 will in both cases be the same. But this is not the case. In the former the result is "hall" while in the later its "hallo", which seems weird to me...
No that's not what's happening, remember the .*? part is non-greedy and will match as few characters as possible (zero if possible) that still results in an overall match. Consider the program below that enumerates all the possible matches in the string - this is what regex_replace basically does internally - but in this case you get to see all the individual matches, output is as follows:
Enumerating all the matches of "(.*?)o?" in the text "Hallo" $0 = "" $1 = "" Position = 0 $0 = "H" $1 = "H" Position = 0 $0 = "" $1 = "" Position = 1 $0 = "a" $1 = "a" Position = 1 $0 = "" $1 = "" Position = 2 $0 = "l" $1 = "l" Position = 2 $0 = "" $1 = "" Position = 3 $0 = "lo" $1 = "l" Position = 3 $0 = "" $1 = "" Position = 5
Enumerating all the matches of "x(.*?)o?" in the text "xHallo" $0 = "x" $1 = "" Position = 0
So in this latter case there is only one match found, and in the case or regex_replace the unmatched part (all of "Hallo") gets output unchanged.
Here's the example program:
int main ( int argc, char** argv ) { std::string input = "xHallo"; boost::regex test ( "x(.*?)o?" ); boost::sregex_iterator it ( input.begin(), input.end (), test); boost::sregex_iterator none;
std::cout << "Enumerating all the matches of \"" << test.str() << "\" in the text \"" << input << "\"" << std::endl;
while ( it != none ) { std::cout << "$0 = \"" << it->str(0) << "\" $1 = \"" << it->str(1) << "\" Position = " << it->position() << std::endl; ++it; } return 0; }
HTH, John. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (3)
-
Florian Schwarz
-
John Maddock
-
Steven Watanabe