Boost-like URL handling/parsiing library?

I am in need of a library to parse/encode/decode URLs. A quick look through the boost tree doesn't yield any obvious candidates. Before I dive into writing my own, does anyone have any pointers/recommendations? Thanks! -- -- Marshall Marshall Clow Idio Software <mailto:marshall@idio.com> A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

Marshall Clow wrote:
I am in need of a library to parse/encode/decode URLs. A quick look through the boost tree doesn't yield any obvious candidates.
Before I dive into writing my own, does anyone have any pointers/recommendations?
URLs are too complex for normal regular expressions to take apart in one step. So you basically have a choice between Spirit or Xpressive to build a complex parser, or you go over the thing in multiple passes using Regex, String_Algo and perhaps even Tokenizer. Boost does not have an URL parsing library in its own right. Although, perhaps the networking protocol sandbox project has something. Sebastian Redl

Sebastian Redl wrote:
Marshall Clow wrote:
I am in need of a library to parse/encode/decode URLs.
URLs are too complex for normal regular expressions to take apart in one step. Actually, this isn't true. Though I make no claims on efficiency, I built a 1-step Boost.Regex-based solution which you can take a look at here:
https://vs.psc.edu/repositories/psctools/Tools/trunk/systools/systools/uri.h https://vs.psc.edu/repositories/psctools/Tools/trunk/systools/src/uri.cpp It follows the RFC at: http://www.apps.ietf.org/rfc/rfc3986.htm And there are a number of test cases here: https://vs.psc.edu/repositories/psctools/Tools/trunk/systools/tests/uri/ Hope this helps. Definitely let me know if you find the code useful, and if you find any bugs. Note that there are some soft dependencies to other code which should be easy to spot and remove. Cheers, Demian

Whoops, I should have said "isn't true depending upon your assumptions". My URI code does not convert %-encoded characters, and doesn't handle unicode. Also, the URL to the RFC is missing a trailing "l" (ell). Cheers, Demian Demian Nave wrote:
Sebastian Redl wrote:
Marshall Clow wrote:
I am in need of a library to parse/encode/decode URLs.
URLs are too complex for normal regular expressions to take apart in one step.
Actually, this isn't true. Though I make no claims on efficiency, I built a 1-step Boost.Regex-based solution which you can take a look at here:
https://vs.psc.edu/repositories/psctools/Tools/trunk/systools/systools/uri.h
https://vs.psc.edu/repositories/psctools/Tools/trunk/systools/src/uri.cpp
It follows the RFC at:
http://www.apps.ietf.org/rfc/rfc3986.htm
And there are a number of test cases here:
https://vs.psc.edu/repositories/psctools/Tools/trunk/systools/tests/uri/
Hope this helps. Definitely let me know if you find the code useful, and if you find any bugs. Note that there are some soft dependencies to other code which should be easy to spot and remove.
Cheers, Demian
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Marshall Clow ha scritto:
I am in need of a library to parse/encode/decode URLs. A quick look through the boost tree doesn't yield any obvious candidates.
That would be a most welcome addition. It doesn't have to be anything as complex as Boost.Filesystem. Something like Python urlparse might be enough for most use cases.
Before I dive into writing my own, does anyone have any pointers/recommendations?
In the meantime, you can use Boost.Regex or Boost.Xpressive. According to http://www.faqs.org/rfcs/rfc2396.html a suitable regex is the following: ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? if $n are the groups produced by the regex, the parts of the url are as follows: scheme = $2 authority = $4 path = $5 query = $7 fragment = $9 There is also an example in Boost.regex using token iterators, see http://boost.org/libs/regex/doc/examples.html (search for "url"). HTH, Ganesh

Marshall Clow wrote:
I am in need of a library to parse/encode/decode URLs. A quick look through the boost tree doesn't yield any obvious candidates.
Before I dive into writing my own, does anyone have any pointers/recommendations?
My effort at an HTTP request parser using Spirit is here: http://svn.chezphil.org/libpbe/trunk/src/Request.cc The bit for URL parsing starts with the rule for absolute_uri, I think. I built this from EBNF in RFCs 2616 and 2396. Unfortunately, these have bugs which remain unfixed after 8 years! Presumably most HTTP implementers are not using the EBNF as the basis of their parsers.... Phil.

Marshall Clow wrote:
I am in need of a library to parse/encode/decode URLs. A quick look through the boost tree doesn't yield any obvious candidates.
Before I dive into writing my own, does anyone have any pointers/recommendations?
Thanks!
You could get libfetch and just use fetch.c/h for the make and parse URL routines. It takes a little hacking, but works well. - Rush
participants (6)
-
Alberto Ganesh Barbati
-
Demian Nave
-
Marshall Clow
-
Phil Endecott
-
Rush Manbert
-
Sebastian Redl