Re: [boost] [RFC] Preferred API for a CGI library proposal

Darren wrote:
I think the library should really be separated into (for example) a cgi::service - which handles the protocol specifics - and cgi::request's.
I think I agree, except that 'cgi' is the wrong name; it's an http request, which could be a CGI request or something else.
I have high hopes that a good cgi::service template would allow the library to be extended to handle arbitrary cgi-based protocols, including 'standalone HTTP'
Yes, except again you need to swap that around; "standard HTTP" is not a "CGI-based protocol", but the converse.
Of particular interest: *should GET/POST variables be parsed by default?
So the issue is can you be more efficient in the case when the variable is not used by not parsing it? Well, if you're concerned about efficiency in that case then you would be better off not sending the thing in the first place. So I suggest parsing everything immediately, or at least on the first use of any variable.
I'd agree in theory, but automatic parsing would make it easy for a malicious user to cripple the server by just POSTing huge files wouldn't it?
A DOS attack of X million uploads of a file of size S is in most ways equivalent to 10*X million uploads of a file of size S/10, or 100*X million uploads of a file of size S/100. Where do you draw the line? The place to avoid this sort of concern is with bandwidth throttling in the front-end of the web server.
There's also situations where a cgi program accepts large files and possibly parses them on the fly, or encrypts them or sends them in chunks to a database. As a real-world example, if you attach an exe to a gmail message, you have to wait for the whole file to be sent first before the server returns the reply that it's an invalid file type.
I think it's hard to avoid parsing the whole stream in order to know which variables are present and that it's syntactically correct before continuing. And I don't think you can control the order in which the browser sends the variables. But if you can devise a scheme that allows lazy parsing of the data, great! As long as it doesn't add any syntactic complexity in the common case of a few small variables.
*should cookie variables be accessible just like GET/POST vars, or separately?
Separately
Ok. Although I think direct access is important, I'm tempted to include an helper function like: cgi::param( /*name*/ ) // returns 'value' That would iterate over the GET/POST vars _as well as_ the cookie vars. I'll keep my eye open for objections to the idea.
I think that the recent fuss about "Javascript Hijacking" has emphasised the fact that programmers need to be aware of whether they are dealing with cookies, GET (URL) variables, or POST data. Cookies set by example.com are returned to example.com even when the request comes from a script element on a page served by bad.com. In contrast, the bad.com page's script cannot see the GET or POST data that example.com's page is sending. Phil.

On 06/04/07, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Darren wrote:
I think the library should really be separated into (for example) a cgi::service - which handles the protocol specifics - and cgi::request's.
I think I agree, except that 'cgi' is the wrong name; it's an http request, which could be a CGI request or something else.
Well, I intended to have a template class like cgi::basic_cgi_service<>, for instance. Then cgi::service would be a typedef for the basic cgi service. I suppose a sensible typedef for an http-service would be cgi::http_service. Would you find that misleading?
I have high hopes that a
good cgi::service template would allow the library to be extended to handle arbitrary cgi-based protocols, including 'standalone HTTP'
Yes, except again you need to swap that around; "standard HTTP" is not a "CGI-based protocol", but the converse.
I probably should have said something like '... extended to handle arbitrary protocols as long as they can be mapped to a cgi request'. The aim of the library as I see it (feel free to disagree), is only to enable writing CGI programs. The idea I mentioned of adding an 'http service' to basically give you a standalone server would be dependent on the 'service' handling everything by itself except for what it determines to be CGI requests. Only these would be passed on to the program itself. I'd imagine the service in this case would be housed in an external library, that sort of thing.
Of particular interest:
*should GET/POST variables be parsed by default?
So the issue is can you be more efficient in the case when the variable is not used by not parsing it? Well, if you're concerned about efficiency in that case then you would be better off not sending the thing in the first place. So I suggest parsing everything immediately, or at least on the first use of any variable.
I'd agree in theory, but automatic parsing would make it easy for a malicious user to cripple the server by just POSTing huge files wouldn't it?
A DOS attack of X million uploads of a file of size S is in most ways equivalent to 10*X million uploads of a file of size S/10, or 100*X million uploads of a file of size S/100. Where do you draw the line? The place to avoid this sort of concern is with bandwidth throttling in the front-end of the web server.
This is something I honestly don't know about. I would have thought the fewer connections an attacker needs to cause a DOS, the easier it is to do, but I've frequently found my natural thoughts to be the polar opposite to reality with this. I see what you're getting at Phil, maybe the concern is misplaced, but I'm wary of washing over it just yet.
There's also situations where a cgi program accepts large files and possibly
parses them on the fly, or encrypts them or sends them in chunks to a database. As a real-world example, if you attach an exe to a gmail message, you have to wait for the whole file to be sent first before the server returns the reply that it's an invalid file type.
I think it's hard to avoid parsing the whole stream in order to know which variables are present and that it's syntactically correct before continuing. And I don't think you can control the order in which the browser sends the variables. But if you can devise a scheme that allows lazy parsing of the data, great! As long as it doesn't add any syntactic complexity in the common case of a few small variables.
You can't control variable order, no. It's by no means trivial, but I think it's quite doable. Keeping the common case simple and intuitive is the main concern, but I'd like to try at incorporating this sort of delayed parser very much.
*should cookie variables be accessible just like GET/POST vars, or
separately?
Separately
Ok. Although I think direct access is important, I'm tempted to include an helper function like: cgi::param( /*name*/ ) // returns 'value' That would iterate over the GET/POST vars _as well as_ the cookie vars. I'll keep my eye open for objections to the idea.
I think that the recent fuss about "Javascript Hijacking" has emphasised the fact that programmers need to be aware of whether they are dealing with cookies, GET (URL) variables, or POST data. Cookies set by example.com are returned to example.com even when the request comes from a script element on a page served by bad.com. In contrast, the bad.com page's script cannot see the GET or POST data that example.com's page is sending.
That's a good point. Security is obviously a big concern with CGI programs. I've no intention of doing anything like tainting variables (a la Perl - I tried incorporating this in the past, but I found it painful to work with), but perhaps this sort of 'forced awareness' is worth the extra typing? Regards, Darren

On 06/04/07, Darren Garvey <lists.drrngrvy@googlemail.com> wrote:
*should cookie variables be accessible just like GET/POST vars, or separately?
Just to comment on my own point (sorry...), separation of GET/POST/cookie variables would make incremental parsing of them (ie. to save on memory consumption) almost trivial to implement.

Darren Garvey wrote:
This is an initial probe for ideas about an API for [any] upcoming CGI library submission.
I've been working on a library for writing web applications (via any sort of gateway: cgi, fcgi, scgi, mod_proxy_http, etc.) that might be of interest here. This is all very preliminary as it's something I hack on in my spare time more-or-less as an exercise (I don't even have much of a use for such a library!). It's design is inspired by Python's WSGI in that you build up a context based on a stack of middleware classes which act as mixins to provide abstractions and services on top of a base request/response provided by the gateway. The mixin system I've created allows mixins to make arbitrary additions/changes to the context structure. At the core is a stack class which handles combining multiple mixins into a single (potential) context. Here what it looks like right now: http://magila.googlepages.com/context.hpp It's actual method of operation is somewhat involved. Basically each mixin consists of three types: a constructor object, a context definition structure, and context instance structure. The constructor object is the mixin class itself, it's responsible for carrying any initialization parameters to the final instance. Inside the mixin class a context definition structure (ctx_type) and a context instance structure (type) are defined. The idea is to first build up a type which defines the final structure of the context then pass that as a template parameter to the instance type which actually instantiates the various structures contained therein. The two step process is necessary because otherwise you get a loop were a derived class's type depends on it's base who's type depends on the derived class's type. Here's an example of a basic http context and a simple mixin which parses the Content-Length header out as an int and provides an int in the response which is assigned to the response's Content-Length header. http://magila.googlepages.com/http_context.hpp http://magila.googlepages.com/http_mixins.hpp These can be combined using the stack class as such: stack<http::context, http::content_length> The result is a type which is itself a valid mixin. This allows a lot of flexibility in how you build up a context. Once you've built up you context with all the middleware you want you pass it as a template parameter to a gateway class which is responsible for adding it's own base request/response part of the context then calling process_request() to kick off the upper layers in the stack. To hide all this complexity from the end user we use a resource mixin which takes a class as a template parameter and calls an appropriate member function based on the http method in the request and passes it a reference to itself. See: http://magila.googlepages.com/http_resource.hpp And finally a trivial example of how it all looks to the end user: http://magila.googlepages.com/swgl_test.cpp You'll note I've left out the gateway class. What I'm using now is just a trivial standalone server using asio that I wrote just for debugging that's not in any shape that I'd want to show :). As an example, to create an instance of the final context you would do something like this: stack<base_context, upper_context>::type<> ctx(base_context(req, res), upper_context(foo, bar, bas)); And thus each mixin's ::type's constructor would be passed an object inheriting from it's containing class and which has been constructed from the instance of that class passed to the stack's constructor. Needless to say the code here is all very much a WIP and only provided for reference. Sorry if this has all been a waste of (your) time, just thought I'd throw it out there. Steven Siloti

Hi Steven, sorry about the late reply (I've been travelling). On 07/04/07, Steven Siloti < ssiloti@gmail.com > wrote:
I've been working on a library for writing web applications (via any sort of gateway: cgi, fcgi, scgi, mod_proxy_http, etc.) that might be of interest here. This is all very preliminary as it's something I hack on in my spare time more-or-less as an exercise (I don't even have much of a use for such a library!).
It's design is inspired by Python's WSGI in that you build up a context based on a stack of middleware classes which act as mixins to provide abstractions and services on top of a base request/response provided by the gateway.
I've looked at WSGI in the past for ideas, so I'm vaguely familiar with it. I'm taking a proper look at it now. If you'd be willing to discuss your approach more (by private email if you prefer), I'd love to share ideas. The mixin system I've created allows mixins to make arbitrary
additions/changes to the context structure. At the core is a stack class which handles combining multiple mixins into a single (potential) context. Here what it looks like right now:
http://magila.googlepages.com/context.hpp
It's actual method of operation is somewhat involved. Basically each mixin consists of three types: a constructor object, a context definition structure, and context instance structure. The constructor object is the mixin class itself, it's responsible for carrying any initialization parameters to the final instance.
Inside the mixin class a context definition structure (ctx_type) and a context instance structure (type) are defined. The idea is to first build up a type which defines the final structure of the context then pass that as a template parameter to the instance type which actually instantiates the various structures contained therein. The two step process is necessary because otherwise you get a loop were a derived class's type depends on it's base who's type depends on the derived class's type.
Here's an example of a basic http context and a simple mixin which parses the Content-Length header out as an int and provides an int in the response which is assigned to the response's Content-Length header.
http://magila.googlepages.com/http_context.hpp http://magila.googlepages.com/http_mixins.hpp
These can be combined using the stack class as such:
stack<http::context, http::content_length>
The result is a type which is itself a valid mixin. This allows a lot of flexibility in how you build up a context. Once you've built up you context with all the middleware you want you pass it as a template parameter to a gateway class which is responsible for adding it's own base request/response part of the context then calling process_request() to kick off the upper layers in the stack.
To hide all this complexity from the end user we use a resource mixin which takes a class as a template parameter and calls an appropriate member function based on the http method in the request and passes it a reference to itself. See:
http://magila.googlepages.com/http_resource.hpp
And finally a trivial example of how it all looks to the end user:
The library sounds quite industrial (which is possibly a good thing, IMO), but this example _looks_ industrial. Is this really the simplest case? Also, if I'd written a large program using this framework, is there a simple path to make the same program work with a different protocol? You'll note I've left out the gateway class. What I'm using now is just
a trivial standalone server using asio that I wrote just for debugging that's not in any shape that I'd want to show :).
I've had a look at the gateway class (at least the one on the server, sorry). I'm surprised you didn't show it too, as I think it makes the rest of your code much more understandable. :) As an example, to create an instance of the final context you would do
something like this:
stack<base_context, upper_context>::type<> ctx(base_context(req, res), upper_context(foo, bar, bas));
And thus each mixin's ::type's constructor would be passed an object inheriting from it's containing class and which has been constructed from the instance of that class passed to the stack's constructor.
It's an interesting idea and the syntax is somewhat appealing too. I'm not sure how necessary it is for a 'from scratch' approach though... How close is this to Python's WSGI (if that's a reasonable question)? Needless to say the code here is all very much a WIP and only provided
for reference. Sorry if this has all been a waste of (your) time, just thought I'd throw it out there.
Not at all. I'm glad you did. Cheers, Darren

Darren Garvey wrote:
Hi Steven, sorry about the late reply (I've been travelling).
On 07/04/07, Steven Siloti < ssiloti@gmail.com > wrote:
I've been working on a library for writing web applications (via any sort of gateway: cgi, fcgi, scgi, mod_proxy_http, etc.) that might be of interest here. This is all very preliminary as it's something I hack on in my spare time more-or-less as an exercise (I don't even have much of a use for such a library!).
It's design is inspired by Python's WSGI in that you build up a context based on a stack of middleware classes which act as mixins to provide abstractions and services on top of a base request/response provided by the gateway.
I've looked at WSGI in the past for ideas, so I'm vaguely familiar with it. I'm taking a proper look at it now. If you'd be willing to discuss your approach more (by private email if you prefer), I'd love to share ideas.
Certainly.
The mixin system I've created allows mixins to make arbitrary
additions/changes to the context structure. At the core is a stack class which handles combining multiple mixins into a single (potential) context. Here what it looks like right now:
http://magila.googlepages.com/context.hpp
It's actual method of operation is somewhat involved. Basically each mixin consists of three types: a constructor object, a context definition structure, and context instance structure. The constructor object is the mixin class itself, it's responsible for carrying any initialization parameters to the final instance.
Inside the mixin class a context definition structure (ctx_type) and a context instance structure (type) are defined. The idea is to first build up a type which defines the final structure of the context then pass that as a template parameter to the instance type which actually instantiates the various structures contained therein. The two step process is necessary because otherwise you get a loop were a derived class's type depends on it's base who's type depends on the derived class's type.
Here's an example of a basic http context and a simple mixin which parses the Content-Length header out as an int and provides an int in the response which is assigned to the response's Content-Length header.
http://magila.googlepages.com/http_context.hpp http://magila.googlepages.com/http_mixins.hpp
These can be combined using the stack class as such:
stack<http::context, http::content_length>
The result is a type which is itself a valid mixin. This allows a lot of flexibility in how you build up a context. Once you've built up you context with all the middleware you want you pass it as a template parameter to a gateway class which is responsible for adding it's own base request/response part of the context then calling process_request() to kick off the upper layers in the stack.
To hide all this complexity from the end user we use a resource mixin which takes a class as a template parameter and calls an appropriate member function based on the http method in the request and passes it a reference to itself. See:
http://magila.googlepages.com/http_resource.hpp
And finally a trivial example of how it all looks to the end user:
The library sounds quite industrial (which is possibly a good thing, IMO), but this example _looks_ industrial. Is this really the simplest case? Also, if I'd written a large program using this framework, is there a simple path to make the same program work with a different protocol?
The test code doesn't really represent a realistic use case. In a full blown application you'd have more or all resources inheriting from a common base class which defines a standard set of mixins. You'd also probably not be directly filling in the http response but rather interacting with a set of higher level mixins for handling things like cookies, authentication, and document templates. With a little work I could certainly enable a shorter "Hello World" case. Chiefly by adding some default behavior that makes the gateway act similar to the cgi::request class you present on cgi.sf.net when not given any template parameter. The gateway handles all the protocol specifics in a way analogous to the service concept presented on cgi.sf.net. Changing protocols is just a matter of changing the gateway type, everything above doesn't care as long as they get a standard http context structure at the base of the stack.
You'll note I've left out the gateway class. What I'm using now is just
a trivial standalone server using asio that I wrote just for debugging that's not in any shape that I'd want to show :).
I've had a look at the gateway class (at least the one on the server, sorry). I'm surprised you didn't show it too, as I think it makes the rest of your code much more understandable. :)
As an example, to create an instance of the final context you would do
something like this:
stack<base_context, upper_context>::type<> ctx(base_context(req, res), upper_context(foo, bar, bas));
And thus each mixin's ::type's constructor would be passed an object inheriting from it's containing class and which has been constructed from the instance of that class passed to the stack's constructor.
It's an interesting idea and the syntax is somewhat appealing too. I'm not sure how necessary it is for a 'from scratch' approach though... How close is this to Python's WSGI (if that's a reasonable question)?
The relation to WSGI is rather loose. I chiefly took the concept of a gateway feeding requests through a stack of middleware applications to the user. WSGI is a very pythonic API and thus does a lot of things which don't map very gracefully to C++. I'm actually not convinced myself that allowing arbitrary type modification of the context by middleware is worth the trouble it causes. A simpler approach would be to restrict middleware to adding a single uniquely named member to the root of the context. Essentially making the context structure analogous to the environment dictionary in WSGI. You'd loose the ability to add things directly to the request so ctx.request.content_length would change to ctx.content_length.request. But it would eliminate the need for the two stage buildup of the context type along with a slew of potential gotchas involving middleware mucking around with the context type in inappropriate ways. Steven Siloti
participants (3)
-
Darren Garvey
-
Phil Endecott
-
Steven Siloti