[ann] Urdl - a library for downloading web content

Hi all, I have just released a new Boost.Asio-based library, Urdl, which can be used for accessing and downloading web content. <http://think-async.com/Urdl> <http://sourceforge.net/projects/urdl> It currently has limited support for the protocols "http", "https" and "file". It provides an easy-to-use extension to standard C++ iostreams and an asynchronous interface for use with Boost.Asio. In addition to (hopefully) being useful, it is intended as an example of using the Boost.Asio and Boost.System libraries. If you're writing protocol implementations using Asio, you might also be interested in how it uses coroutines to simplify the expression of asynchronous control flow. (Note: here coroutines refers to a macro-based system similar to <http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html>, not Giovanni Deretta's GSoC project). Cheers, Chris

On Jun 16, 2009, at 5:24 PM, Christopher Kohlhoff wrote:
Hi all,
I have just released a new Boost.Asio-based library, Urdl, which can be used for accessing and downloading web content.
<http://think-async.com/Urdl> <http://sourceforge.net/projects/urdl>
It currently has limited support for the protocols "http", "https" and "file". It provides an easy-to-use extension to standard C++ iostreams and an asynchronous interface for use with Boost.Asio.
Looks nice (from reading the docs). I think the URL parsing class will turn out to be very useful, independently of URDL. -- Marshall

Marshall Clow <mclow.lists <at> gmail.com> writes:
On Jun 16, 2009, at 5:24 PM, Christopher Kohlhoff wrote:
Hi all,
I have just released a new Boost.Asio-based library, Urdl,
[snip]
Looks nice (from reading the docs). I think the URL parsing class will turn out to be very useful, independently of URDL.
I agree on both counts. After a brief look, one thing that wasn't obvious to me was why exactly the url constructor throws boost::system::system_error (other than performance of open). I can understand why you'd want the open functions to do so and are taking advantage of implicit conversion from std::string/char*. However, either letting them throw a std::runtime_error derived url_parsing_error exception or translating this exception to a system_error with an appropriate error_code makes more sense to me. Also, what the specific error code would be for a url parsing error wasn't clear -- the constructor of url didn't document this from what I saw. Anyhow, looks useful, and I encourage you to continue development on this. I can see possibly using this in one of our projects. Thanks, -Ryan

Ryan Gallagher wrote:
After a brief look, one thing that wasn't obvious to me was why exactly the url constructor throws boost::system::system_error (other than performance of open). I can understand why you'd want the open functions to do so and are taking advantage of implicit conversion from std::string/char*. However, either letting them throw a std::runtime_error derived url_parsing_error exception or translating this exception to a system_error with an appropriate error_code makes more sense to me.
On reflection, adding new open() overloads that take std::string/char* does sound better. That way the behaviour will be just like if you open a std::ifstream with a bogus path.
Also, what the specific error code would be for a url parsing error wasn't clear -- the constructor of url didn't document this from what I saw.
So far I was just lazy and used invalid_argument for all errors. Later I intend to add more specific errors to give more information about what was wrong with the URL. Cheers, Chris

On Wed, Jun 17, 2009 at 1:04 AM, Marshall Clow<mclow.lists@gmail.com> wrote:
On Jun 16, 2009, at 5:24 PM, Christopher Kohlhoff wrote:
[snip]
Looks nice (from reading the docs). I think the URL parsing class will turn out to be very useful, independently of URDL.
I think the url class should allow for relative urls too. With concatenations like boost::filesystem::path with operator/.
-- Marshall
-- Felipe Magno de Almeida

Hi Chris, I've took a quick look at the library documentation. It is really nice and clear. I will probably be using it. Thanks for it. Take care, emre On Wed, Jun 17, 2009 at 10:24:41AM +1000, Christopher Kohlhoff wrote:
Hi all,
I have just released a new Boost.Asio-based library, Urdl, which can be used for accessing and downloading web content.
<http://think-async.com/Urdl> <http://sourceforge.net/projects/urdl>
It currently has limited support for the protocols "http", "https" and "file". It provides an easy-to-use extension to standard C++ iostreams and an asynchronous interface for use with Boost.Asio.
In addition to (hopefully) being useful, it is intended as an example of using the Boost.Asio and Boost.System libraries. If you're writing protocol implementations using Asio, you might also be interested in how it uses coroutines to simplify the expression of asynchronous control flow. (Note: here coroutines refers to a macro-based system similar to <http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html>, not Giovanni Deretta's GSoC project).
Cheers, Chris _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hi Chris, Christopher Kohlhoff skrev:
Hi all,
I have just released a new Boost.Asio-based library, Urdl, which can be used for accessing and downloading web content.
<http://think-async.com/Urdl> <http://sourceforge.net/projects/urdl>
It currently has limited support for the protocols "http", "https" and "file". It provides an easy-to-use extension to standard C++ iostreams and an asynchronous interface for use with Boost.Asio.
Looks really nice. 1. ------ May I suggest that you add a bunch of constants instead of using strings. You currently have // We're doing an HTTP POST ... is.set_option(urdl::http::request_method("POST")); // ... where the MIME type indicates plain text ... is.set_option(urdl::http::request_content_type("text/plain")); I don't like the hard-coding of "POST" and "text/plain". Also for type-safety reasons, strings are usually irritating. Maybe it would be possible to use enumeration values instead: // We're doing an HTTP POST ... is.set_option(urdl::http::request_method::post); // ... where the MIME type indicates plain text ... is.set_option(urdl::http::request_content_type::text_plain)); 2. -------- The Url class has operators like ==, != and <. May I suggets you hass hash_value( const url& ) too. 3. ------ I'm not a fan of short names. Especially http::errc::errc_t would be clearer IMO as http::error_codes::error_code or something like that. -Thorsten

Thorsten Ottosen wrote:
3. ------
I'm not a fan of short names. Especially
http::errc::errc_t
would be clearer IMO as
http::error_codes::error_code
or something like that.
-Thorsten _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
+1 on changing the short names. Other than that, this looks like something that might be useful to me in the near future. -Kenny Riddile

Thorsten Ottosen wrote:
May I suggest that you add a bunch of constants instead of using strings. You currently have
// We're doing an HTTP POST ... is.set_option(urdl::http::request_method("POST"));
// ... where the MIME type indicates plain text ... is.set_option(urdl::http::request_content_type("text/plain"));
I don't like the hard-coding of "POST" and "text/plain". Also for type-safety reasons, strings are usually irritating. Maybe it would be possible to use enumeration values instead:
// We're doing an HTTP POST ... is.set_option(urdl::http::request_method::post);
// ... where the MIME type indicates plain text ... is.set_option(urdl::http::request_content_type::text_plain));
Good idea, however rather than an enum I think I will add some static member functions for these "constants", as the set is unbounded and users may still need to supply a custom string. E.g.: is.set_option(urdl::http::request_method::post());
The Url class has operators like ==, != and <. May I suggets you hass hash_value( const url& ) too.
Will do.
I'm not a fan of short names. Especially
http::errc::errc_t
would be clearer IMO as
http::error_codes::error_code
or something like that.
This is chosen for consistency with the c++0x standard library, which has the error constants in std::errc::* (and obviously Boost.System uses boost::system::errc). It seems reasonable to me to use "errc" as an idiomatic name for scoping error constants in any library that uses std::error_code and friends. What do you think? Cheers, Chris

Christopher Kohlhoff skrev:
Thorsten Ottosen wrote:
The Url class has operators like ==, != and <. May I suggets you hass hash_value( const url& ) too.
Will do.
I'm not a fan of short names. Especially
http::errc::errc_t
would be clearer IMO as
http::error_codes::error_code
or something like that.
This is chosen for consistency with the c++0x standard library, which has the error constants in std::errc::* (and obviously Boost.System uses boost::system::errc). It seems reasonable to me to use "errc" as an idiomatic name for scoping error constants in any library that uses std::error_code and friends. What do you think?
I guess if C++0x will use it, then it is better to stick with it. I'm not a fan of it though :-) I mean, where does it end? -Thorsten

Hi Chris, Glad to see you're back! On Wed, Jun 17, 2009 at 2:24 AM, Christopher Kohlhoff<chris@kohlhoff.com> wrote:
Hi all,
I have just released a new Boost.Asio-based library, Urdl, which can be used for accessing and downloading web content.
<http://think-async.com/Urdl> <http://sourceforge.net/projects/urdl>
It currently has limited support for the protocols "http", "https" and "file". It provides an easy-to-use extension to standard C++ iostreams and an asynchronous interface for use with Boost.Asio.
My feedback: * The #1 feature should be supporting http 1.1 well (also https). Many libaries provide a http 1.0 implementation but fail short of supporting the many options http provides (I know this is a huge undertaking!) . At this point I don't see the value of supporting file (or planning to support ftp). In this case, I don't understand the value of future support for runtime polymorphism (can you explain how user-supplied backends would work ?) * It would be great to clarify why you based the design on a buffered stream (below are my perceived pros/cons): Pros - easy to add support for new protocols with read_until - easier header parsing Cons - increased implementation complexitiy with istreambuf - Maybe small performance penalty * In detail/http_read_stream.hpp open_coro chains the whole sequence from connecting to request/response. I think it would be better to split it in two coroutines (one for opening the connection and another for sending a request and getting a reply). This would make it easier to later implement keep-alive. * A newbie question: In open_coro you don't use Stream as a template parameter. Is this because Stream is a reference and you don't need to enforce any concept ? (I am trying to learn from your great coding practices :) ) Thank you for sharing your library ! regards jose

Jose wrote:
* The #1 feature should be supporting http 1.1 well (also https). Many libaries provide a http 1.0 implementation but fail short of supporting the many options http provides (I know this is a huge undertaking!) . At this point I don't see the value of supporting file (or planning to support ftp).
Sometimes I need to use them :) Seriously though, the power of working with URLs is their polymorphism, so the more protocols the better IMHO.
In this case, I don't understand the value of future support for runtime polymorphism (can you explain how user-supplied backends would work ?)
I'm thinking an option to register a factory function, e.g. something like: read_stream_impl_base* make_impl(const url& u) { ... } ... my_read_stream.set_option(implementation_factory(make_impl));
* It would be great to clarify why you based the design on a buffered stream (below are my perceived pros/cons): Pros - easy to add support for new protocols with read_until - easier header parsing Cons - increased implementation complexitiy with istreambuf - Maybe small performance penalty
I'm not sure which buffered stream you're referring to here. Can you clarify?
* In detail/http_read_stream.hpp
open_coro chains the whole sequence from connecting to request/response. I think it would be better to split it in two coroutines (one for opening the connection and another for sending a request and getting a reply). This would make it easier to later implement keep-alive.
That's what the connect/async_connect functions are for. I will revisit that division when adding connection pooling.
In open_coro you don't use Stream as a template parameter. Is this because Stream is a reference and you don't need to enforce any concept ? (I am trying to learn from your great coding practices :) )
It's already a template parameter on the enclosing class, so isn't needed on the nested open_coro class. Cheers, Chris

* The #1 feature should be supporting http 1.1 well (also https). Many libaries provide a http 1.0 implementation but fail short of supporting the many options http provides (I know this is a huge undertaking!) . At this point I don't see the value of supporting file (or planning to support ftp).
Sometimes I need to use them :) Seriously though, the power of working with URLs is their polymorphism, so the more protocols the better IMHO.
While I totally agree with that, I just want to point out that the discussion wether this library should support more protocols than HTTP raise a fundamental design issue: if you start to make it support more protocols, then you'll end with something like libcurl (which is a very nice library), and if that's the case it might be simpler to write a boost libcurl wrapper (the current c++ wrapper of libcurl aren't very nicely designed). Philippe

On Thu, Jun 18, 2009 at 2:11 PM, Philippe Vaucher<philippe.vaucher@gmail.com> wrote:
* The #1 feature should be supporting http 1.1 well (also https). Many libaries provide a http 1.0 implementation but fail short of supporting the many options http provides (I know this is a huge undertaking!) . At this point I don't see the value of supporting file (or planning to support ftp).
Sometimes I need to use them :) Seriously though, the power of working with URLs is their polymorphism, so the more protocols the better IMHO.
While I totally agree with that, I just want to point out that the discussion wether this library should support more protocols than HTTP raise a fundamental design issue: if you start to make it support more protocols, then you'll end with something like libcurl (which is a very nice library), and if that's the case it might be simpler to write a boost libcurl wrapper (the current c++ wrapper of libcurl aren't very nicely designed).
thanks, this is exactly the point I was trying to make but much better explained!

On Thu, 18 Jun 2009 14:11 +0200, "Philippe Vaucher" <philippe.vaucher@gmail.com> wrote:
While I totally agree with that, I just want to point out that the discussion wether this library should support more protocols than HTTP raise a fundamental design issue: if you start to make it support more protocols, then you'll end with something like libcurl (which is a very nice library), and if that's the case it might be simpler to write a boost libcurl wrapper (the current c++ wrapper of libcurl aren't very nicely designed).
That doesn't exactly meet the design goal of being an example for (and integrating with) Boost.Asio :) Cheers, Chris

On Thu, Jun 18, 2009 at 2:05 PM, Christopher Kohlhoff<chris@kohlhoff.com> wrote:
Jose wrote:
* The #1 feature should be supporting http 1.1 well (also https). Many libaries provide a http 1.0 implementation but fail short of supporting the many options http provides (I know this is a huge undertaking!) . At this point I don't see the value of supporting file (or planning to support ftp).
Sometimes I need to use them :) Seriously though, the power of working with URLs is their polymorphism, so the more protocols the better IMHO.
Yes, it's a matter of priorities. I think It is better a feature-rich http support than tons of half-supported protocols.
* It would be great to clarify why you based the design on a buffered stream (below are my perceived pros/cons): Pros - easy to add support for new protocols with read_until - easier header parsing Cons - increased implementation complexitiy with istreambuf - Maybe small performance penalty
I'm not sure which buffered stream you're referring to here. Can you clarify?
I refer to the overall design keeping a streambuf vs a state-machine just parsing the header in one go. thanks

Jose wrote:
* It would be great to clarify why you based the design on a buffered stream (below are my perceived pros/cons): Pros - easy to add support for new protocols with read_until - easier header parsing Cons - increased implementation complexitiy with istreambuf - Maybe small performance penalty I'm not sure which buffered stream you're referring to here. Can you clarify?
I refer to the overall design keeping a streambuf vs a state-machine just parsing the header in one go.
I'm still not 100% sure which bit you're referring to, but if you mean the asio::streambuf usage in http_read_stream, that's just to keep the implementation simple. Any overhead there only applies to receiving the headers, and isn't incurred once you start downloading the content. Cheers, Chris

I'm still not 100% sure which bit you're referring to, but if you mean the asio::streambuf usage in http_read_stream, that's just to keep the implementation simple. Any overhead there only applies to receiving the headers, and isn't incurred once you start downloading the content.
In this design you use a buffered stream so you can check for the status line, headers and body using read_until. A different approach, which I have used in to try to decode the data as it arrives (without a buffered stream).

Jarrad Waterloo, Software Engineer jwaterloo@dynamicquest.com DYNAMIC QUEST 336-370-0555 Brilliant! I know your library is primarily client side but is there any chance you could provide a complete, server and client side, tutorial of using the https protocol with your library? It doesn't have to be pretty just a ordered list of the commands of how you created the certificates and applied them on the server and client. I really anticipate the arrival of the following new features on your feature list: Status function callback. Runtime polymorphism and user-supplied backends. Support for FTP. -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Jose Sent: Thursday, June 18, 2009 5:37 AM To: boost@lists.boost.org Subject: Re: [boost] [ann] Urdl - a library for downloading web content Hi Chris, Glad to see you're back! On Wed, Jun 17, 2009 at 2:24 AM, Christopher Kohlhoff<chris@kohlhoff.com> wrote:
Hi all,
I have just released a new Boost.Asio-based library, Urdl, which can be used for accessing and downloading web content.
<http://think-async.com/Urdl> <http://sourceforge.net/projects/urdl>
It currently has limited support for the protocols "http", "https" and "file". It provides an easy-to-use extension to standard C++ iostreams and an asynchronous interface for use with Boost.Asio.
My feedback: * The #1 feature should be supporting http 1.1 well (also https). Many libaries provide a http 1.0 implementation but fail short of supporting the many options http provides (I know this is a huge undertaking!) . At this point I don't see the value of supporting file (or planning to support ftp). In this case, I don't understand the value of future support for runtime polymorphism (can you explain how user-supplied backends would work ?) * It would be great to clarify why you based the design on a buffered stream (below are my perceived pros/cons): Pros - easy to add support for new protocols with read_until - easier header parsing Cons - increased implementation complexitiy with istreambuf - Maybe small performance penalty * In detail/http_read_stream.hpp open_coro chains the whole sequence from connecting to request/response. I think it would be better to split it in two coroutines (one for opening the connection and another for sending a request and getting a reply). This would make it easier to later implement keep-alive. * A newbie question: In open_coro you don't use Stream as a template parameter. Is this because Stream is a reference and you don't need to enforce any concept ? (I am trying to learn from your great coding practices :) ) Thank you for sharing your library ! regards jose _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (11)
-
Christopher Kohlhoff
-
Emre Turkay
-
Felipe Magno de Almeida
-
Ilya Sokolov
-
Jarrad Waterloo
-
Jose
-
Kenny Riddile
-
Marshall Clow
-
Philippe Vaucher
-
Ryan Gallagher
-
Thorsten Ottosen