[Regex] - Is this a limitation of sregex_iterator ?
Hi I am using sregex_iterator to parse an html file like below. string htmlFile; //populate htmlFile with a file contents regex regExpr("Resurfacing(.|\\n)*Home", boost::regex::icase); sregex_iterator itr(htmlFile.begin(), htmlFile.end(), regExpr); But that is throwing and std::runtime_error exception with message "Regular expression too big". What can i do to avoid this ? I went through http://www.boost.org/libs/regex/doc/configuration.html and changed the variables like below #define BOOST_REGEX_NON_RECURSIVE #define BOOST_REGEX_BLOCKSIZE (4096 * 10) #define BOOST_REGEX_MAX_BLOCKS 1024 Even after that i am getting the same error message. I found that when i changed the regular expression ( i mean a simpler 'regExpr' variable) to a simpler one, the exception (std::runtime_error) was not thrown. I need to parse big html files for some complex regular expressions. I dont mind even if the sregex_iterator takes much memory or time.How can i solve this error. ? Or is this a limitatiton of boost::regex library. Thanks In Advance Kiran.
kiran wrote:
Hi I am using sregex_iterator to parse an html file like below.
string htmlFile; //populate htmlFile with a file contents
regex regExpr("Resurfacing(.|\\n)*Home", boost::regex::icase); sregex_iterator itr(htmlFile.begin(), htmlFile.end(), regExpr);
But that is throwing and std::runtime_error exception with message "Regular expression too big". What can i do to avoid this ? I went through http://www.boost.org/libs/regex/doc/configuration.html and changed the variables like below
#define BOOST_REGEX_NON_RECURSIVE #define BOOST_REGEX_BLOCKSIZE (4096 * 10) #define BOOST_REGEX_MAX_BLOCKS 1024
Even after that i am getting the same error message. I found that when i changed the regular expression ( i mean a simpler 'regExpr' variable) to a simpler one, the exception (std::runtime_error) was not thrown. I need to parse big html files for some complex regular expressions. I dont mind even if the sregex_iterator takes much memory or time.How can i solve this error. ? Or is this a limitatiton of boost::regex library.
The error message is somewhat missleading for which I apologise. It's not really a limitation of the library: it's a limitation of Perl style regexes. The problem is that many Perl style regexes are sufficiently ambiguous that they can take effectively "forever" to match. Boost.Regex tries to shield you from this possibility by keeping track of how many states the state-machine has visited, and throwing an exception if the number of states visited looks to be growing too fast compared to text searched. I'm a little surprised that this expression should be giving you problems but basically: The .|\\n part is superfluous as . matches newlines by default in Boost.Regex anyway. There are also some optimisations that get appied to .* that don't apply otherwise :-) If there are large numbers of newlines in the text being searched then (.|\\n)* creates a large number of possible branches through the state machine: basically the number of possible paths doubles for each newline, which is what leads eventually to the exception being thrown. You might also want to question whether you want a greedy repeat here, and whether "Resurfacing.*?Home" wouldn't be more to the point. HTH, John.
Hi
Thanks for answering the question and spending your valuable time to
help novices like me. I have never spoke to person of such high calibre in
my whole life. Be frank, i never expected a reply from you.
Coming to the topic, I never knew in boost::regex "dot" matched new
line by default. Is there any way to avoid this and make it behave like a
perl regular expression ? Will the "boost::regex::perl" flag do it ?
Another thing is why is --- regex expr("Resurfacing.*?Home",
boost::regex::icase) ---- getting aborted. ?
Thanks
Kiran.
----- Original Message -----
From: "John Maddock"
kiran wrote:
Hi I am using sregex_iterator to parse an html file like below.
string htmlFile; //populate htmlFile with a file contents
regex regExpr("Resurfacing(.|\\n)*Home", boost::regex::icase); sregex_iterator itr(htmlFile.begin(), htmlFile.end(), regExpr);
But that is throwing and std::runtime_error exception with message "Regular expression too big". What can i do to avoid this ? I went through http://www.boost.org/libs/regex/doc/configuration.html and changed the variables like below
#define BOOST_REGEX_NON_RECURSIVE #define BOOST_REGEX_BLOCKSIZE (4096 * 10) #define BOOST_REGEX_MAX_BLOCKS 1024
Even after that i am getting the same error message. I found that when i changed the regular expression ( i mean a simpler 'regExpr' variable) to a simpler one, the exception (std::runtime_error) was not thrown. I need to parse big html files for some complex regular expressions. I dont mind even if the sregex_iterator takes much memory or time.How can i solve this error. ? Or is this a limitatiton of boost::regex library.
The error message is somewhat missleading for which I apologise. It's not really a limitation of the library: it's a limitation of Perl style regexes. The problem is that many Perl style regexes are sufficiently ambiguous that they can take effectively "forever" to match. Boost.Regex tries to shield you from this possibility by keeping track of how many states the state-machine has visited, and throwing an exception if the number of states visited looks to be growing too fast compared to text searched.
I'm a little surprised that this expression should be giving you problems but basically:
The .|\\n part is superfluous as . matches newlines by default in Boost.Regex anyway. There are also some optimisations that get appied to .* that don't apply otherwise :-)
If there are large numbers of newlines in the text being searched then (.|\\n)* creates a large number of possible branches through the state machine: basically the number of possible paths doubles for each newline, which is what leads eventually to the exception being thrown.
You might also want to question whether you want a greedy repeat here, and whether "Resurfacing.*?Home" wouldn't be more to the point.
HTH, John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (2)
-
John Maddock
-
kiran