Hi everyone, I'm trying to parse an HTML page using the Regex library and am running in to errors. In the following snippets, "pageSource" is a string pointer to the contents of an html file. This code causes my app to crash: void Page::removeScriptTags() { boost::regex tagRegex("<[sS][cC][rR][iI][pP][tT][\\w\\W]*?>[.]*?\ \s*?[sS][cC][rR][iI][pP][tT]\\s*?>"); string replaced = boost::regex_replace(*pageSource, pageSource, tagRegex, " ", boost::match_default); delete pageSource; pageSource = new string(replaced); } and this code crashes when attempting to destruct "matches": void Page::findTitleSummary() { boost::cmatch matches; boost::regex bodyRegex("<[tT][iI][tT][lL][eE][\\w\\W]*?>([^<]*)\\s*? [tT][iI][tT][lL][eE]\\s*?>"); if (boost::regex_search(pageSource->c_str(), matches, bodyRegex)) { pageSummary = new string(matches[1]); hasFoundSummary = true; } } What am I missing? Thanks, Dave
Dave DeLong wrote:
Hi everyone,
I'm trying to parse an HTML page using the Regex library and am running in to errors.
In the following snippets, "pageSource" is a string pointer to the contents of an html file.
This code causes my app to crash:
void Page::removeScriptTags() { boost::regex tagRegex("<[sS][cC][rR][iI][pP][tT][\\w\\W]*?>[.]*?\ \s*?[sS][cC][rR][iI][pP][tT]\\s*?>"); string replaced = boost::regex_replace(*pageSource, pageSource, tagRegex, " ", boost::match_default); delete pageSource; pageSource = new string(replaced); }
That shouldn't even compile - there are too many arguments to regex_replace - it should just be, *pageSource = boost::regex_replace(*pageSource, tagRegex, " ", boost::match_default); The expression is also needlessly inefficient, you could just make the whole expression case insensitive by prefixing with "(?:i)", then "[\\w\\W]" will match *either* something that is a word character, *or* something that is *not* a word character, which is probably not what you meant :-) Likewise [.] will match a literal "." which again is probably not what you meant. So maybe try something like: "(?:i)]*>"
and this code crashes when attempting to destruct "matches":
void Page::findTitleSummary() { boost::cmatch matches; boost::regex bodyRegex("<[tT][iI][tT][lL][eE][\\w\\W]*?>([^<]*)\\s*? [tT][iI][tT][lL][eE]\\s*?>"); if (boost::regex_search(pageSource->c_str(), matches, bodyRegex)) { pageSummary = new string(matches[1]); hasFoundSummary = true; } }
What am I missing?
Without seeing a compilable code sample to play with I don't know, but it looks like you're accessing memory that's already gone out of scope somewhere. HTH, John.
Ah, you're right. That was one of my attempts to fix it (which you can guess didn't work). As for the inefficiency, this is my first stab at regex. =) Here's the complete function as it stands (or doesn't, since it still crashes): void Page::removeScriptTags() { boost::regex tagRegex("(?:i)]*>"); string source(*pageSource); string replaced = boost::regex_replace(source, tagRegex, " ", boost::match_default); delete pageSource; pageSource = new string(replaced); } PageSource, as I told earlier, is an allocated string that stores the contents of the webpage. I thought that the problem might be that pageSource is on the heap, so I've been trying to move it to the stack to see if that makes a difference. It doesn't seem like it does. I still crash at this line: string replaced = boost::regex_replace(source, tagRegex, " ", boost::match_default); Thanks, Dave On 12 Mar, 2008, at 6:39 AM, John Maddock wrote:
That shouldn't even compile - there are too many arguments to regex_replace - it should just be,
*pageSource = boost::regex_replace(*pageSource, tagRegex, " ", boost::match_default);
The expression is also needlessly inefficient, you could just make the whole expression case insensitive by prefixing with "(?:i)", then "[\\w\ \W]" will match *either* something that is a word character, *or* something that is *not* a word character, which is probably not what you meant :-) Likewise [.] will match a literal "." which again is probably not what you meant. So maybe try something like:
"(?:i)]*>"
and this code crashes when attempting to destruct "matches":
void Page::findTitleSummary() { boost::cmatch matches; boost::regex bodyRegex("<[tT][iI][tT][lL][eE][\\w\\W]*?>([^<]*)\\s*? [tT][iI][tT][lL][eE]\\s*?>"); if (boost::regex_search(pageSource->c_str(), matches, bodyRegex)) { pageSummary = new string(matches[1]); hasFoundSummary = true; } }
What am I missing?
Without seeing a compilable code sample to play with I don't know, but it looks like you're accessing memory that's already gone out of scope somewhere.
HTH, John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Dave DeLong wrote:
Ah, you're right. That was one of my attempts to fix it (which you can guess didn't work).
As for the inefficiency, this is my first stab at regex. =)
Here's the complete function as it stands (or doesn't, since it still crashes):
void Page::removeScriptTags() { boost::regex tagRegex("(?:i)]*>"); string source(*pageSource); string replaced = boost::regex_replace(source, tagRegex, " ", boost::match_default); delete pageSource; pageSource = new string(replaced); }
That looks fine as it stands, but unless you can reduce it to a complete test case that I can compile and run here it still doesn't help much. Also what compiler, platform and Boost version are you using? Also please check that there isn't some binary-compatibity issue going on: building your app with different options than Boost was built with, or linking to a library file that's from a different Boost version to the headers you're #including etc... John.
test case that I can compile and run here it still doesn't help much. Also what compiler, platform and Boost version are you using? Also please check that there isn't some binary-compatibity issue going on: building your app with different options than Boost was built with, or linking to a library file that's from a different Boost version to the headers you're #including
I think the problem I had before was with a specific version of gcc on cygwin. try/catch should give you some idea of the problem. You should be able to catch (...) or something more diagnostic. Failure to catch (...) finally convinced me of a different diagnosis. Mike Marchywka 586 Saint James Walk Marietta GA 30067-7165 404-788-1216 (C)<- leave message 989-348-4796 (P)<- emergency only marchywka@hotmail.com Note: If I am asking for free stuff, I normally use for hobby/non-profit information but may use in investment forums, public and private. Please indicate any concerns if applicable. Note: Hotmail is possibly blocking my mom's entire ISP - try me on marchywka@yahoo.com if no reply here. Thanks.
From: john@johnmaddock.co.uk To: boost-users@lists.boost.org Date: Wed, 12 Mar 2008 13:50:41 +0000 Subject: Re: [Boost-users] Regex problem
Dave DeLong wrote:
Ah, you're right. That was one of my attempts to fix it (which you can guess didn't work).
As for the inefficiency, this is my first stab at regex. =)
Here's the complete function as it stands (or doesn't, since it still crashes):
void Page::removeScriptTags() { boost::regex tagRegex("(?:i)]*>.*?]*>"); string source(*pageSource); string replaced = boost::regex_replace(source, tagRegex, " ", boost::match_default); delete pageSource; pageSource = new string(replaced); }
That looks fine as it stands, but unless you can reduce it to a complete test case that I can compile and run here it still doesn't help much. Also what compiler, platform and Boost version are you using? Also please check that there isn't some binary-compatibity issue going on: building your app with different options than Boost was built with, or linking to a library file that's from a different Boost version to the headers you're #including etc...
John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_________________________________________________________________ Connect and share in new ways with Windows Live. http://www.windowslive.com/share.html?ocid=TXT_TAGHM_Wave2_sharelife_012008
My platform is Mac OS X 10.5.2. I'm compiling and running via Xcode
3.0. Boost is version 1.34.1 (I installed it in the last two weeks).
It was compiled on this machine by running "sudo make install", which
installed it in /usr/local/include/boost. The library files are in /
usr/local/lib. In my project specification, I specify those paths in
my Library and Header search paths, and also add the linker option "-
lboost_regex".
The following code gets a EXC_BAD_ACCESS error and never executes the
catch blocks: (I realize that the pageSource doesn't actually have a
<script> tag; That's because I don't know if the pages I'll be
parsing will have one. However, the code still fails even if there is
a <script>TESTSCRIPT</script> tag in there.)
#include <iostream>
#include
Dave DeLong wrote:
Ah, you're right. That was one of my attempts to fix it (which you can guess didn't work).
As for the inefficiency, this is my first stab at regex. =)
Here's the complete function as it stands (or doesn't, since it still crashes):
void Page::removeScriptTags() { boost::regex tagRegex("(?:i)]*>"); string source(*pageSource); string replaced = boost::regex_replace(source, tagRegex, " ", boost::match_default); delete pageSource; pageSource = new string(replaced); }
That looks fine as it stands, but unless you can reduce it to a complete test case that I can compile and run here it still doesn't help much. Also what compiler, platform and Boost version are you using? Also please check that there isn't some binary-compatibity issue going on: building your app with different options than Boost was built with, or linking to a library file that's from a different Boost version to the headers you're #including etc...
John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Dave,I am using 10.5.2 and Xcode 3.0 and it compiles and runs fine for me. I know that isn't much help but its does add weigh to the idea that the test case isokay from a stability point. However... I changed the the HTML source adding a simple script: "<html><head>]*>"); string replaced = boost::regex_replace(*pageSource, tagRegex, " ", boost::match_default); delete pageSource; pageSource = new string(replaced); } catch (exception &e) { cout << e.what() << endl; } catch ( ... ) { cout << "Unknown exception" << endl; }
cout << *pageSource << endl;
delete pageSource; return 0; }
Any help would be greatly appreciated, as this project is due tomorrow. =)
Thanks,
Dave
On 12 Mar, 2008, at 7:50 AM, John Maddock wrote:
Dave DeLong wrote:
Ah, you're right. That was one of my attempts to fix it (which you
can guess didn't work).
As for the inefficiency, this is my first stab at regex. =)
Here's the complete function as it stands (or doesn't, since it still
crashes):
void Page::removeScriptTags() {
boost::regex tagRegex("(?:i)]*>");
string source(*pageSource);
string replaced = boost::regex_replace(source, tagRegex, " ",
boost::match_default);
delete pageSource;
pageSource = new string(replaced);
}
That looks fine as it stands, but unless you can reduce it to a complete test case that I can compile and run here it still doesn't help much. Also what compiler, platform and Boost version are you using? Also please check that there isn't some binary-compatibity issue going on: building your app
with different options than Boost was built with, or linking to a library file that's from a different Boost version to the headers you're #including etc...
John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Dave DeLong wrote:
My platform is Mac OS X 10.5.2. I'm compiling and running via Xcode 3.0. Boost is version 1.34.1 (I installed it in the last two weeks). It was compiled on this machine by running "sudo make install", which installed it in /usr/local/include/boost. The library files are in / usr/local/lib. In my project specification, I specify those paths in my Library and Header search paths, and also add the linker option "- lboost_regex".
The following code gets a EXC_BAD_ACCESS error and never executes the catch blocks: (I realize that the pageSource doesn't actually have a <script> tag; That's because I don't know if the pages I'll be parsing will have one. However, the code still fails even if there is a <script>TESTSCRIPT</script> tag in there.)
Works for me on Win32 and VC++, I don't have a Mac so I can't try it there. I mis-wrote the regex BTW, it should be "(?i)]*>". This still looks like you're linking to something that isn't binary compatible with your application: but I don't know enough (or indeed anything) about MacOS to be able to help with that. John.
participants (4)
-
Daniel Lord
-
Dave DeLong
-
John Maddock
-
Mike Marchywka