extract url with boost::regex

26 Nov 2007

      Hello;

I try to extract an url from a webpage and it's almostly done but completly unoptimised :

Before I try with a regex iterator. But I don't understand the documentation. I past to many time on this way, so I try an other way :

I get my webpage with libcurl
then I replace all " " by a "\n"

like that :

                string::size_type i = 0;
                while (( i = page_a_analyser.find(' ', i )  ) != (string::npos))
                {
                        page_a_analyser.replace(i++, 1, "\n" );
                }

then I apply the regex : 

   boost::regex rexp(".*(http:\\/\\/.+)\"*.*");

and I get this result :

http://www.nolife-tv.com/"
http://www.nolife-tv.com">
http://www.nolife-tv.com/images/stories/noiz/1.jpg"
http://www.nolife-tv.com/component/option,com_poll/task,results/id,16/Itemid...';"
http://www.joomla.org"
http://www.google-analytics.com/urchin.js"
http://www.omniture.com

and so on...

I will cut and get only the url without the " or '
why this regex get the " with it? I put the close bracket before the " so why?  I already try to do \\" rather than \"

I try to do  (\"|')" too to say " or ', but this doesn't work too...

So I do an other way :

I get my webpage with libcurl

then I replace all " " by a "\n"
then replace all " by \n
then replace all ' by \n

then I apply the regex

And I should replace with 3 while rather than only one...  because the 3 conditions in one while wasn't working :

  while ( (( i = page_a_analyser.find(' ', i )  ) != (string::npos))  or  ( i = page_a_analyser.find('"', i )  ) != (string::npos) or  ( i = page_a_analyser.find('\'', i )  ) != (string::npos) )

So how can I do to just improve the regex to extract the url? to do just something like :

replace " " by "\n"
then apply the right regex.

I don't want to use a regex iterator again. regex iterator win again my patience... 3 day on it is enough for me.

Thanks for your attention

      _____________________________________________________________________________ 
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail

hallouina-ml＠yahoo.fr

John Maddock

tags

participants (2)