Regex - how to match everything beginning at '<a href=' and ending at '</a>'
On 4/8/07, Jeff
I am new to regex and see how to match the URL string using the expression below:
Expression:
To match: <a href="http://art.com">Art Page</a>
Unfortunately, the URL strings vary and I am basically wanting to get everything beginning at '
<a href="http://art.com" target="_blank">Art Page<br>And More</a>
or
<a href="http://art.com/"><img src="http://art.com/flower.jpg" alt="Art"></a>
Have you tried a simpler pattern: (Infinite wild match before the href in case they decide to put the target attribute first or something). Richard
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Richard Dingwall
Richard, thank you for responding. I tried your pattern above and it continues beyond '</a>'. The pattern that I am now using below accurately matches everything from here '"; boost::regex e(exp, boost::regex::normal | boost::regbase::icase); boost::sregex_token_iterator i(sFileCont.begin(), sFileCont.end(), e, 0); boost::sregex_token_iterator j; while(i != j) cout << *i++ << "\n"; //////////////////////////////// Please note tht sregex_token_iterator's 4th parameter is set to submatch = 0 in my code above. This leaves me with 2 questions: 1. although I have specified submatch = 0, I am creating a marked sub- expression, (.*?), and I don't understand why the sub-expression is required or if there is a better way that I don't know about. 2. why is the ? required in the sub-expression above? Thanks again
(Infinite wild match before the href in case they decide to put the target attribute first or something).
Richard
Jeff wrote:
Richard Dingwall
writes: Richard, thank you for responding. I tried your pattern above and it continues beyond '</a>'.
The pattern that I am now using below accurately matches everything from here '
//////////////////////////////// char exp[] = ""; boost::regex e(exp, boost::regex::normal | boost::regbase::icase); boost::sregex_token_iterator i(sFileCont.begin(), sFileCont.end(), e, 0); boost::sregex_token_iterator j; while(i != j) cout << *i++ << "\n"; ////////////////////////////////
Please note tht sregex_token_iterator's 4th parameter is set to submatch = 0 in my code above. This leaves me with 2 questions:
1. although I have specified submatch = 0, I am creating a marked sub- expression, (.*?), and I don't understand why the sub-expression is required or if there is a better way that I don't know about.
It's not required unless you want it.
2. why is the ? required in the sub-expression above?
Without the ? the .* is greedy: it will match as many characters as it can before the closing </a> tag, hense you get everything from the first <a> to the last </a> which is clearly not what you were wanting :-) You could also try something like: "]+href=\"([^\"]*)\"[^"]*>(.*?)</a>" Which should give you the URL in $1 and the link text in $2: but only if the URL is correctly quoted XML, badly written HTML may scrape through, to handle that you have to get more complex still. There are some more examples like this here: http://regexlib.com/Search.aspx?k=link HTH, John.
John, Thank you very much for your response. It was very helpful. Best Regards, Jeff Dunlap
participants (3)
-
Jeff
-
John Maddock
-
Richard Dingwall