I sent this originally to James Maddock, but realized this is probably a better place to post it. At work we've been testing out regex (nice work BTW) in some of our code, and appear to have found a bug. We ran into it parsing HTML, and I've written a test C++ app to reproduce it. In the program below, the output should be the same for both searches as far as I can tell, but it's not. I don't know if it's some interaction with the quote character or something like that. We attempted to use other quantifiers (after '?', we tried '*', '{0,1}', ["]?) to no avail. I'm confident this is not user error. The extra grouping is annoying (in "goodPatternStr"), but is an acceptable workaround. The strange thing is that a non-capturing group doesn't fix it. Ideas? --Mark Ping -------------------------------------------------------------------------------------- output: input: ]*name=\"?([^> \"]*)[^>]*value=\"?([^> \"]*)"; const char* goodPatternStr = "]*name=(\"?)([^> \"]*)[^>]*value=\"?([^> \"]*)"; // ^^^^^ //note that the only difference between bad and good is that "good" //has grouping around the optional " after 'name=' // //In both versions of the matches, the second group is matched. //Only the first group has this problem. boost::match_resultsstd::string::const_iterator what; std::string in = "
I sent this originally to James Maddock, but realized this is probably a better place to post it.
Sorry it took me a while to get around to it.
At work we've been testing out regex (nice work BTW) in some of our code, and appear to have found a bug. We ran into it parsing HTML, and I've written a test C++ app to reproduce it.
In the program below, the output should be the same for both searches as far as I can tell, but it's not. I don't know if it's some interaction with the quote character or something like that. We attempted to use other quantifiers (after '?', we tried '*', '{0,1}', ["]?) to no avail. I'm confident this is not user error. The extra grouping is annoying (in "goodPatternStr"), but is an acceptable workaround. The strange thing is that a non-capturing group doesn't fix it.
Ideas?
I have an answer for you, but I don't think you're going to like it: it comes down to how the "leftmost longest" rules are applied: what's happening here is that $1 is being matched, but it's matching the null string just before the \" (at character 26 I think it was), the alternative (the one you expected), would have matched starting at character 27 (just one to the right of the \"). So the match found is in some sense "better" (further to the left) that the one you expected. I think I'm going to have to switch to perl matching rules so I can stop explaining this... :-) A simpler solution to your problem is to use a + quantifier rather than a *, so that it can't match the null string: const char* badPatternStr = "]*name=\"?([^> \"]+)[^>]*value=\"?([^> \"]+)"; Hope this helps, John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
--- In Boost-Users@y..., "John Maddock"
I sent this originally to James Maddock, but realized this is probably a better place to post it.
Sorry it took me a while to get around to it.
No need to apologize. I'm greatful for the library you provided to the community!
I have an answer for you, but I don't think you're going to like it: it comes down to how the "leftmost longest" rules are applied:
what's happening here is that $1 is being matched, but it's matching the null string just before the \" (at character 26 I think it was), the alternative (the one you expected), would have matched starting at character 27 (just one to the right of the \"). So the match found is in some sense "better" (further to the left) that the one you expected. I think I'm going to have to switch to perl matching rules so I can stop explaining this... :-)
Drat. I thought that might be it after reading the docs, but I figured that the expression "? would also match leftmost longest (that is, a single " is better than the null string). Also, I thought that by putting a non-capturing group (?:"?) around the expression, this would leftmost-longest match the ". And yet it didn't. (I also tried "* and "{0,1} etc.) Sigh. I *really* did read the docs before emailing. :)
A simpler solution to your problem is to use a + quantifier rather than a *, so that it can't match the null string:
Unfortunately, the match can be empty, hence we couldn't use +. But thanks for the reply, and for the library!
Drat. I thought that might be it after reading the docs, but I figured that the expression "? would also match leftmost longest (that is, a single " is better than the null string). Also, I thought that by putting a non-capturing group (?:"?) around the expression, this would leftmost-longest match the ". And yet it didn't.
Leftmost longest applies only to marked sub-expressions - there is no concept of greedy or non-greedy repeats as such in the POSIX standard. (?:\"?) isn't a marked subexpression BTW :-)
(I also tried "* and "{0,1} etc.)
Sigh. I *really* did read the docs before emailing. :)
A simpler solution to your problem is to use a + quantifier rather than a *, so that it can't match the null string:
Unfortunately, the match can be empty, hence we couldn't use +.
But thanks for the reply, and for the library!
Info: http://www.boost.org Wiki: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl Unsubscribe: mailto:boost-users-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
--- In Boost-Users@y..., "John Maddock"
Drat. I thought that might be it after reading the docs, but I figured that the expression "? would also match leftmost longest (that is, a single " is better than the null string). Also, I thought that by putting a non-capturing group (?:"?) around the expression, this would leftmost-longest match the ". And yet it didn't.
Leftmost longest applies only to marked sub-expressions - there is no concept of greedy or non-greedy repeats as such in the POSIX standard. (?:\"?) isn't a marked subexpression BTW :-)
Hrm. Makes sense. However, I just thought of something. There were two expressions like that, and while the first one had trouble, the second one didn't. That is, I didn't have to group the " on the second expression.
Hrm. Makes sense. However, I just thought of something. There were two expressions like that, and while the first one had trouble, the second one didn't. That is, I didn't have to group the " on the second expression.
The second one was at the end of the expression - if it had matched the zero width alternative then the overall match would have been shorter. Here's another alternative expression that I think does what you want: const char* badPatternStr = "]*name=\"?([^> \"]*)\"?[[:space:]][^>]*value=\"?([^> \"]+)"; Note the extra \"?[[:space:]] after the matched text - I think that will work, there does have to be one or more spaces between attributes doesn't there? John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
participants (2)
-
emarkp
-
John Maddock