[Boost-bugs] [ boost-Bugs-1685103 ] regex_match throws exception

Bugs item #1685103, was opened at 2007-03-21 06:39 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=107586&aid=1685103&group_id=7586 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: regex Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: krankurs (krankurs) Assigned to: John Maddock (johnmaddock) Summary: regex_match throws exception Initial Comment: Running: MS Windows XP with pack 2, boost_1_33_1, MS VS 2005, VC++ console app, BOOST_REGEX_DYN_LINK; Multi-threaded Debug DLL (/MDd) boost::regex site_url_regex("http(s?)://((\b(?:\d{1,3}\.){3}\d{1,3}\b)|(www(.[a-z0-9]+)+.mydomain.com))"); boost::regex_match("http://www.appwebsite1.mydomain.com", site_url_regex); returns true; (as expected) boost::regex_match("http://www.appwebsite1.mydomain.com/1234", site_url_regex); returns false; (as expected) boost::regex_match("http://www.appwebsite1.mydomain.com/12345678/12345678/", site_url_regex); throws exception, - expected false Thank you, Leonid ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=107586&aid=1685103&group_id=7586 ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Boost-bugs mailing list Boost-bugs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/boost-bugs

"SourceForge.net" <noreply@sourceforge.net> wrote in message news:E1HTz9p-0006qo-ID@sc8-sf-web21.sourceforge.net...
Bugs item #1685103, was opened at 2007-03-21 06:39 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=107586&aid=1685103&group_id=7586
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: regex Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: krankurs (krankurs) Assigned to: John Maddock (johnmaddock) Summary: regex_match throws exception
Initial Comment: Running: MS Windows XP with pack 2, boost_1_33_1, MS VS 2005, VC++ console app, BOOST_REGEX_DYN_LINK; Multi-threaded Debug DLL (/MDd)
boost::regex site_url_regex("http(s?)://((\b(?:\d{1,3}\.){3}\d{1,3}\b)|(www(.[a-z0-9]+)+.mydomain.com))");
boost::regex_match("http://www.appwebsite1.mydomain.com", site_url_regex);
returns true; (as expected)
boost::regex_match("http://www.appwebsite1.mydomain.com/1234", site_url_regex);
returns false; (as expected)
boost::regex_match("http://www.appwebsite1.mydomain.com/12345678/12345678/", site_url_regex);
throws exception, - expected false
Thank you, Leonid
Other than the escape characters not being escaped in the example above, correct code follows: --- boost::regex site_url_regex("http(s?)://((\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b)|(www(.[a-z0-9]+)+.mydomain.com))"); boost::regex_match("http://www.appwebsite1.mydomain.com/12345678/12345678/", site_url_regex); --- ... this actually does highlight an apparent bug/deficiency. A much simpler example is given below: --- boost::regex test_regex("(?:.[a-z0-9]+)+"); boost::regex_match("www.appwebsite1.mydomain.com/1/1/", test_regex); --- This throws the same exception, at least in a VC 8.0 build using the DLL version of the reg-ex lib (debug), in perl_matcher_non_recursive.hpp at line 164 (raise_error(traits_inst, regex_constants::error_space);), because the state_count is 107429 and the max_state_count is 107425. Michael Goldshteyn

Michael Goldshteyn wrote:
... this actually does highlight an apparent bug/deficiency. A much simpler example is given below:
--- boost::regex test_regex("(?:.[a-z0-9]+)+");
boost::regex_match("www.appwebsite1.mydomain.com/1/1/", test_regex); ---
This throws the same exception, at least in a VC 8.0 build using the DLL version of the reg-ex lib (debug), in perl_matcher_non_recursive.hpp at line 164 (raise_error(traits_inst, regex_constants::error_space);), because the state_count is 107429 and the max_state_count is 107425.
Right, the behaviour is deliberate and by design: it tries to protect you from pathological regexes that may take "forever" to try and match. Obviously the logic is heuristic based and therefore imperfect, but in this case there is still a bug in your regex: I assume you meant to use "(?:\\.[a-z0-9]+)+" to match a literal "." rather than the regex meaning of "any character". This would have been OK. The reason your example is pathological is because given a long string of characters, then for each character the state machine can either take the inner repeat or the outer repeat and the number of states visited grows exponentially with the number of characters in the string: the expression is effectively the same as (.+)+ which is the classic patholgical test case for backtracking regexes. HTH, John.

"John Maddock" <john@johnmaddock.co.uk> wrote in message news:014401c76c69$ef3f56e0$877c1f56@fuji...
Michael Goldshteyn wrote:
... this actually does highlight an apparent bug/deficiency. A much simpler example is given below:
--- boost::regex test_regex("(?:.[a-z0-9]+)+");
boost::regex_match("www.appwebsite1.mydomain.com/1/1/", test_regex); ---
This throws the same exception, at least in a VC 8.0 build using the DLL version of the reg-ex lib (debug), in perl_matcher_non_recursive.hpp at line 164 (raise_error(traits_inst, regex_constants::error_space);), because the state_count is 107429 and the max_state_count is 107425.
Right, the behaviour is deliberate and by design: it tries to protect you from pathological regexes that may take "forever" to try and match. Obviously the logic is heuristic based and therefore imperfect, but in this case there is still a bug in your regex: I assume you meant to use "(?:\\.[a-z0-9]+)+" to match a literal "." rather than the regex meaning of "any character". This would have been OK.
The reason your example is pathological is because given a long string of characters, then for each character the state machine can either take the inner repeat or the outer repeat and the number of states visited grows exponentially with the number of characters in the string: the expression is effectively the same as (.+)+ which is the classic patholgical test case for backtracking regexes.
HTH, John.
I understand the issues, now, but am wondering whether control over the "maximum number of states" should be given to the caller or if the issue is one of running out of memory? Michael Goldshteyn

Michael Goldshteyn wrote:
I understand the issues, now, but am wondering whether control over the "maximum number of states" should be given to the caller or if the issue is one of running out of memory?
No it's not running out of memory, just giving up 'cos there's no end in sight. I've tweaked the algorithm slightly for 1.34 to give up less easily when the string being matched is short, but basically if you get an exception it usually means that there is something wrong with the expression: even if a match (or not) could be found for a short string, sooner or later it's going to bite you. John.

"John Maddock" <john@johnmaddock.co.uk> wrote in message news:009501c76c9e$1e795850$995b1b56@fuji...
Michael Goldshteyn wrote:
I understand the issues, now, but am wondering whether control over the "maximum number of states" should be given to the caller or if the issue is one of running out of memory?
No it's not running out of memory, just giving up 'cos there's no end in sight. I've tweaked the algorithm slightly for 1.34 to give up less easily when the string being matched is short, but basically if you get an exception it usually means that there is something wrong with the expression: even if a match (or not) could be found for a short string, sooner or later it's going to bite you.
John.
John, will the "still going to bite you" part still hold true if the example uses non-greedy versions of '+' (i.e. "+?") Michael Goldshteyn
participants (3)
-
John Maddock
-
Michael Goldshteyn
-
SourceForge.net