Re: [Boost-users] extract url with boost::regex

26 Nov 2007

      hallouina-ml@yahoo.fr wrote:
...
Hello;
I try to extract an url from a webpage and it's almostly done but
completly unoptimised :
Before I try with a regex iterator. But I don't understand the
documentation.
:-(

Did you see this 
example:http://www.boost.org/libs/regex/example/snippets/regex_token_iterator_eg_2.c...

It does exactly what you want - it exacts all the URL's from a HTML file.
...
boost::regex rexp(".*(http:\\/\\/.+)\"*.*");
and I get this result :
http://www.nolife-tv.com/"
http://www.nolife-tv.com">
http://www.nolife-tv.com/images/stories/noiz/1.jpg"
http://www.nolife-tv.com/component/option,com_poll/task,results/id,16/Itemid...';"
http://www.joomla.org"
http://www.google-analytics.com/urchin.js"
http://www.omniture.com
and so on...
I will cut and get only the url without the " or '
why this regex get the " with it? I put the close bracket before the
" so why?  I already try to do \\" rather than \"
Because the .* on the end of the expression will match whatever text follows 
the ", the grouping construct (...) spits out a *sub-expression* which you 
can access via the match_results::operator[] or match_results::str(i) 
methods.

HTH, John.

Re: [Boost-users] extract url with boost::regex

John Maddock