Ah, you're right.  That was one of my attempts to fix it (which you can guess didn't work).

As for the inefficiency, this is my first stab at regex.  =)

Here's the complete function as it stands (or doesn't, since it still crashes):

void Page::removeScriptTags() {
boost::regex tagRegex("(?:i)<script[^>]*>.*?</script[^>]*>");
string source(*pageSource);
string replaced = boost::regex_replace(source, tagRegex, " ", boost::match_default);
delete pageSource;
pageSource = new string(replaced);
}

PageSource, as I told earlier, is an allocated string that stores the contents of the webpage.  I thought that the problem might be that pageSource is on the heap, so I've been trying to move it to the stack to see if that makes a difference.  It doesn't seem like it does.  I still crash at this line:

string replaced = boost::regex_replace(source, tagRegex, " ", boost::match_default);

Thanks,

Dave

On 12 Mar, 2008, at 6:39 AM, John Maddock wrote:

That shouldn't even compile - there are too many arguments to
regex_replace - it should just be,

*pageSource = boost::regex_replace(*pageSource, tagRegex, " ",
boost::match_default);

The expression is also needlessly inefficient, you could just make the whole
expression case insensitive by prefixing with "(?:i)", then "[\\w\\W]" will
match *either* something that is a word character, *or* something that is
*not* a word character, which is probably not what you meant :-)  Likewise
[.] will match a literal "." which again is probably not what you meant.  So
maybe try something like:

"(?:i)<script[^>]*>.*?</script[^>]*>"

and this code crashes when attempting to destruct "matches":

void Page::findTitleSummary() {
boost::cmatch matches;
boost::regex
bodyRegex("<[tT][iI][tT][lL][eE][\\w\\W]*?>([^<]*)</\\s*?
[tT][iI][tT][lL][eE]\\s*?>");
if (boost::regex_search(pageSource->c_str(), matches, bodyRegex)) {
pageSummary = new string(matches[1]);
hasFoundSummary = true;
}
}

What am I missing?

Without seeing a compilable code sample to play with I don't know, but it
looks like you're accessing memory that's already gone out of scope
somewhere.

HTH, John.

_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users