Re: [Boost-users] Regex problem

12 Mar 2008

      Dave DeLong wrote:
...
...
Hi everyone,
I'm trying to parse an HTML page using the Regex library and am
running in to errors.
In the following snippets, "pageSource" is a string pointer to the
contents of an html file.
This code causes my app to crash:
void Page::removeScriptTags() {
boost::regex tagRegex("<[sS][cC][rR][iI][pP][tT][\\w\\W]*?>[.]*?</\
\s*?[sS][cC][rR][iI][pP][tT]\\s*?>");
string replaced = boost::regex_replace(*pageSource, pageSource,
tagRegex, " ", boost::match_default);
delete pageSource;
pageSource = new string(replaced);
}
That shouldn't even compile - there are too many arguments to 
regex_replace - it should just be,

*pageSource = boost::regex_replace(*pageSource, tagRegex, " ", 
boost::match_default);

The expression is also needlessly inefficient, you could just make the whole 
expression case insensitive by prefixing with "(?:i)", then "[\\w\\W]" will 
match *either* something that is a word character, *or* something that is 
*not* a word character, which is probably not what you meant :-)  Likewise 
[.] will match a literal "." which again is probably not what you meant.  So 
maybe try something like:

"(?:i)<script[^>]*>.*?</script[^>]*>"
...
...
and this code crashes when attempting to destruct "matches":
void Page::findTitleSummary() {
boost::cmatch matches;
boost::regex
bodyRegex("<[tT][iI][tT][lL][eE][\\w\\W]*?>([^<]*)</\\s*?
[tT][iI][tT][lL][eE]\\s*?>");
if (boost::regex_search(pageSource->c_str(), matches, bodyRegex)) {
pageSummary = new string(matches[1]);
hasFoundSummary = true;
}
}
What am I missing?
Without seeing a compilable code sample to play with I don't know, but it 
looks like you're accessing memory that's already gone out of scope 
somewhere.

HTH, John.

Re: [Boost-users] Regex problem

John Maddock