[regex] - Bug in boost::regex ??
Hi I am trying to extract a pattern from a file. Actually there are 2 occurances of the pattern in the file, one in line 2-3 and other in line 4. But the program is only reporting the occurance in line 4. The reason why i am giving (.|\n) coupled 'match_not_dot_newline' with is that i want the regex to be perl compatible. Why is the program not reporting the first occurance ? Is this a bug ? I am attaching the code and file. Thanks Kiran.
kiran wrote:
Hi I am trying to extract a pattern from a file. Actually there are 2 occurances of the pattern in the file, one in line 2-3 and other in line 4. But the program is only reporting the occurance in line 4.
The reason why i am giving (.|\n) coupled 'match_not_dot_newline' with is that i want the regex to be perl compatible. Why is the program not reporting the first occurance ? Is this a bug ? I am attaching the code and file.
It's doing exactly what the regex asks it to do: it matches everything from the first occurance of "Resurfacing" to the last occurance of "home". There is only one such match in the document. John.
Hi
It is DEFINITELY not doing what it is asked to do.
The EXPECTED OUTPUT was ::
Resurfacing for Swimming Pools</title>
<meta name="robots" content="index,follow">Home
<meta name="keywords" content="pool
Resurfacing,uglassit,fibre-shelkote,Uglassit,Fibre-shelkote,swimming pool
resurfacing">Home
But the RESULTANT OUTPUT we got was ::
Resurfacing,uglassit,fibre-shelkote,Uglassit,Fibre-shelkote,swimming
pool resurfacing">Home
This means that it is not picking the "Resurfacing" in the SECOND line of
the file, but rather picking the "Resurfacing" in the FOURTH line of the
file.
Why is the second one not picked ? This was my question.
Regards
Kiran.
----- Original Message -----
From: "John Maddock"
kiran wrote:
Hi I am trying to extract a pattern from a file. Actually there are 2 occurances of the pattern in the file, one in line 2-3 and other in line 4. But the program is only reporting the occurance in line 4.
The reason why i am giving (.|\n) coupled 'match_not_dot_newline' with is that i want the regex to be perl compatible. Why is the program not reporting the first occurance ? Is this a bug ? I am attaching the code and file.
It's doing exactly what the regex asks it to do: it matches everything from the first occurance of "Resurfacing" to the last occurance of "home". There is only one such match in the document.
John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
kiran wrote:
Why is the second one not picked ? This was my question.
It is picked for me: I modified your sample program (see below) so that it
actually compiled, and didn't reply on external files, and I see exactly the
output expected: everything from the first "Resurfacing" to the last "home".
#include "boost/regex.hpp"
using namespace boost;
using namespace std;
#include
Taking a quick look at the docs, the regex you want is:
"Resurfacing(.*?)Home"
Just a thought. Seems like quite the thread for a regex pattern.
And like John says, it should match from the first Resurfacing to the second
Home. If it didn't, I'd be concerned.
The * operator by itself is greedy. It wants to make matches as long as
possible. By using the *? notation, it makes it a non-greedy modifier, ie,
making the match as short as possible.
http://www.boost.org/libs/regex/doc/syntax_perl.html
Under the heading 'Non greedy repeats' pretty much explains things.
(Note: This applys to the perl style regex, I'm not entirely sure about the
other behaviors.)
Cheers,
Paul
On 8/30/06, John Maddock
kiran wrote:
Why is the second one not picked ? This was my question.
It is picked for me: I modified your sample program (see below) so that it actually compiled, and didn't reply on external files, and I see exactly the output expected: everything from the first "Resurfacing" to the last "home".
#include "boost/regex.hpp" using namespace boost; using namespace std; #include
#include #include <iostream> int main() { char buf[10000]; //int fd = open("glass.htm", O_RDONLY); //int size = read(fd, buf, 10000); string line = "<!-- saved from url=(0022)http://internet.e-mail -->\n" "<html><head>\n" "<title>UGlassIt Fibre-Shelkote Pool Resurfacing for Swimming Pools</title>\n" "Home\n" "Home"; //close(fd); regex expr("Resurfacing(.|\n)*Home" , boost::regex::icase | boost::regex::perl); try { sregex_iterator itr(line.begin(), line.end(), expr, boost::match_not_dot_newline); sregex_iterator i; while(itr != i) { cout<
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Hi,
Thanks for answering the question. That shows that that is not bug in
boost::regex. But I have one more thing to ask. Ofcourse when the
dependancy on external file is removed boost::regex is working fine. When
you ran the code with the string directly from the file, did your regex pick
the second one ? You can certainly think that this is a question not to be
answered by a busy person like you. You might choose to ignore this
question. But if you can, please answer this. Also please tell me which
version of boost library are you using? I am running the code in a linux
machine and the code i already sent is not picking the second one.
Thanks
Kiran.
----- Original Message -----
From: "John Maddock"
kiran wrote:
Why is the second one not picked ? This was my question.
It is picked for me: I modified your sample program (see below) so that it actually compiled, and didn't reply on external files, and I see exactly the output expected: everything from the first "Resurfacing" to the last "home".
#include "boost/regex.hpp" using namespace boost; using namespace std; #include
#include #include <iostream> int main() { char buf[10000]; //int fd = open("glass.htm", O_RDONLY); //int size = read(fd, buf, 10000); string line = "<!-- saved from url=(0022)http://internet.e-mail -->\n" "<html><head>\n" "<title>UGlassIt Fibre-Shelkote Pool Resurfacing for Swimming Pools</title>\n" "Home\n" "Home"; //close(fd); regex expr("Resurfacing(.|\n)*Home" , boost::regex::icase | boost::regex::perl); try { sregex_iterator itr(line.begin(), line.end(), expr, boost::match_not_dot_newline); sregex_iterator i; while(itr != i) { cout<
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
kiran wrote:
Hi, Thanks for answering the question. That shows that that is not bug in boost::regex. But I have one more thing to ask. Ofcourse when the dependancy on external file is removed boost::regex is working fine. When you ran the code with the string directly from the file, did your regex pick the second one ? You can certainly think that this is a question not to be answered by a busy person like you. You might choose to ignore this question. But if you can, please answer this. Also please tell me which version of boost library are you using? I am running the code in a linux machine and the code i already sent is not picking the second one.
I didn't run loading from the file: not enough time for that, sorry. In any case a quick check in the debugger, or even a cout << the_string; would quickly tell you what's getting loaded. I'm using what will become Boost-1.34. But there shouldn't be any differences to previous versions, although I would recomend use Boost-1.33.1 if you can. John.
participants (3)
-
John Maddock
-
kiran
-
Paul Davis