Dave DeLong wrote:
Hi everyone,
I'm trying to parse an HTML page using the Regex library and am running in to errors.
In the following snippets, "pageSource" is a string pointer to the contents of an html file.
This code causes my app to crash:
void Page::removeScriptTags() { boost::regex tagRegex("<[sS][cC][rR][iI][pP][tT][\\w\\W]*?>[.]*?\ \s*?[sS][cC][rR][iI][pP][tT]\\s*?>"); string replaced = boost::regex_replace(*pageSource, pageSource, tagRegex, " ", boost::match_default); delete pageSource; pageSource = new string(replaced); }
That shouldn't even compile - there are too many arguments to regex_replace - it should just be, *pageSource = boost::regex_replace(*pageSource, tagRegex, " ", boost::match_default); The expression is also needlessly inefficient, you could just make the whole expression case insensitive by prefixing with "(?:i)", then "[\\w\\W]" will match *either* something that is a word character, *or* something that is *not* a word character, which is probably not what you meant :-) Likewise [.] will match a literal "." which again is probably not what you meant. So maybe try something like: "(?:i)]*>"
and this code crashes when attempting to destruct "matches":
void Page::findTitleSummary() { boost::cmatch matches; boost::regex bodyRegex("<[tT][iI][tT][lL][eE][\\w\\W]*?>([^<]*)\\s*? [tT][iI][tT][lL][eE]\\s*?>"); if (boost::regex_search(pageSource->c_str(), matches, bodyRegex)) { pageSummary = new string(matches[1]); hasFoundSummary = true; } }
What am I missing?
Without seeing a compilable code sample to play with I don't know, but it looks like you're accessing memory that's already gone out of scope somewhere. HTH, John.