[Ann] socketstream library 0.7

To spice up the networking discussion, let me announce a new version of my networking experiment, the socketstream library. The home-page is here: http://socketstream.sourceforge.net/ It's not up to date yet, but the doxygen documentation is nearly current. The SourceForge page is here: http://sourceforge.net/projects/socketstream/ The Subversion repository is here: https://mndfck.org/svn/sockestream/ In particular, here are some example client projects of the library: https://mndfck.org/svn/socketstream/trunk/example/ The bug pointed out in this list in the time server was corrected. ;) Two new examples were added last time I mentioned this library here, an extremely simple http client, and a very simple irc client library with it's own example application. The irc client library makes use of the boost::signals library, and soon I'll have a reworked version of the irc::message class that uses boost::spirit to parse messages from istreambuf_iterator's. There are some weird bugs yet to solve, but it works enough to play around. I'm planning to eventually make a "boost" version of the library as an excuse to study the Boost.Build system. -- Pedro Lamarão

Hi, pedro.lamarao@mndfck.org schrieb:
To spice up the networking discussion, let me announce a new version of my networking experiment, the socketstream library.
I have a similar thing on its way, where I do things a little differently, but the general direction is the same. The biggest difference in my implementation is that listeners and connected sockets are constructed with a string (of arbitrary character type) rather than an address family specific type, i.e. the application need not see any types representing addresses. The string is searched for a whitespace character, with the portion before it being the address family and the remainder the (af specific) address. I have tried to get a bit away from the C socket functions, because I felt they were not really needed. For example, bind() can be avoided by a two-argument form of connect() that binds before connecting (and believe me, you do not want to know what #include <sys/socket.h> on Solaris does to code that contains the word "bind"). The other difference in API is that accepted sockets are extracted from the listener, not returned as an auto_ptr. This leads to slight ugliness in the implementation (copyfmt() and a set of callbacks) but looks really nice in the application source code. I also wonder whether I should provide a way to get a "one-shot" listener that will accept a single connection and give it as a stream to the application. The thing I am missing in your implementation (but that's also still missing in mine :-P) is a wrapper around select() / WaitForMultipleObjects() / Wait() / ... along with the ability to run a resolver in the background and get a notification on connection establishment/failure. Simon

Simon Richter wrote:
The biggest difference in my implementation is that listeners and connected sockets are constructed with a string (of arbitrary character type) rather than an address family specific type, i.e. the application need not see any types representing addresses. The string is searched for a whitespace character, with the portion before it being the address family and the remainder the (af specific) address.
How do you interpret the results of name resolution and pass that to the "connect" and "bind" equivalents? Do the "client" and "listener" classes automatically call the resolver?
I have tried to get a bit away from the C socket functions, because I felt they were not really needed. For example, bind() can be avoided by a two-argument form of connect() that binds before connecting (and believe me, you do not want to know what #include <sys/socket.h> on Solaris does to code that contains the word "bind").
That I didn't know. I'm most of the time in Linux; I've been waiting eagerly for Fedora Core 4 with it's Xen kernels to install NetBSD and try the portability of my code. But the need for the existence of a "socket" class is questionable; the IO primitive in the C++ standard library is the streambuf; ugly or not, it provides a complete "client" for a networking library. I'll probably make that class disappear if nothing else stops me.
The other difference in API is that accepted sockets are extracted from the listener, not returned as an auto_ptr. This leads to slight ugliness in the implementation (copyfmt() and a set of callbacks) but looks really nice in the application source code. I also wonder whether I should provide a way to get a "one-shot" listener that will accept a single connection and give it as a stream to the application.
That auto_ptr is there because stream objects are not copyable, and the purpose of that "listener" is to internalize the "accept" primitive and the stream constructor. I'd be happier to return an "rvalue reference" there, if we already had such a thing. I experimented elsewhere with a "socket" class whose copy constructor and assignment operator "moved" the internal descriptor (leaving the copied object in an invalid state) to allow such "moving" around of the object; but it doesn't help the stream classes.
The thing I am missing in your implementation (but that's also still missing in mine :-P) is a wrapper around select() / WaitForMultipleObjects() / Wait() / ... along with the ability to run a resolver in the background and get a notification on connection establishment/failure.
In the example/ directory of the socketstream project, check out the asynch_resolver.h file; there's an asynchronous version of the resolver class there implemented using boost::thread. A "select" class would be nice, true, but most of the experiments I've been doing with this code is multi-threaded, so I haven't actually needed such a class yet. But I confess I never truly learned to work with such interfaces. I suspect there's little hope of doing anything different than keeping a std::vector or std::list of whatever networking object we're holding, and creating the proper structure for select() or poll() when calling the blocker method. That might make the blocker method O(n) on the number of networking objects... But as I've stated elsewhere, this library is mostly about IOStreams, and I've been mostly studying the nice things we can do with operator>>, operator<<, and streambuf_iterators. -- Pedro Lamarão Desenvolvimento Intersix Technologies S.A. SP: (55 11 3803-9300) RJ: (55 21 3852-3240) www.intersix.com.br Your Security is our Business

Hi, Pedro Lamarão wrote:
How do you interpret the results of name resolution and pass that to the "connect" and "bind" equivalents? Do the "client" and "listener" classes automatically call the resolver?
Yes, that is the plan (name resolution is not implemented yet). A socket stream can be constructed with an optional reference to a "manager" class (which is the select() wrapper, basically), which holds a set of resolvers (similar to locale facets). If a manager object is given, it is asked for a resolver, which is given the query and a method to call on completion, then control is transferred back to the caller. If no manager object is given, a resolver is instantiated, the name resolved and the resolver destroyed (=> you need a manager for nonblocking I/O). The resolver is usually called by the manager when the application enters its idle loop, so applications may not wait for connections to get established when they use a manager object.
But the need for the existence of a "socket" class is questionable; the IO primitive in the C++ standard library is the streambuf; ugly or not, it provides a complete "client" for a networking library.
My streambufs hold a pointer to an abstract "impl" class which represents the actual socket. Concrete subclasses of that exist for different socket types, so new address families can be added at runtime (currently, only through my plugin interface, which I wrote about to the list a few days ago).
That auto_ptr is there because stream objects are not copyable, and the purpose of that "listener" is to internalize the "accept" primitive and the stream constructor.
They are not copy-constructible, that is different from not copiable :-). A new streambuf can be attached to an existing stream, as long as care is taken to properly destroy the streambuf on stream close).
I'd be happier to return an "rvalue reference" there, if we already had such a thing.
Hrm, they could be made copy-constructible, I think, the benefit of that would however be questionable.
In the example/ directory of the socketstream project, check out the asynch_resolver.h file; there's an asynchronous version of the resolver class there implemented using boost::thread.
I avoid spawning new threads from library functions like the plague, as it is difficult to properly collect those threads in case of an exception etc.
I suspect there's little hope of doing anything different than keeping a std::vector or std::list of whatever networking object we're holding, and creating the proper structure for select() or poll() when calling the blocker method. That might make the blocker method O(n) on the number of networking objects...
I think the manager object should have a list of pointers to the streams it monitors and have a callback registered with them so the stream is scheduled for removal on stream destruction (we cannot unregister the stream right away as there might be iterators in the list of pointers that we'd invalidate). I think we should attempt to coordinate our efforts here. Simon

On 6/9/05, Simon Richter <Simon.Richter@hogyros.de> wrote:
I think we should attempt to coordinate our efforts here.
Please do! Does either of the implementations offer an interface to the sockets at a lower level than iostreams though? I personally prefer to use lower-level read/write semantics with sockets and handle would-block conditions etc in my own code. Alternatively I would be happy with a mechanism whereby I could pass a block of bytes to be written-from (or read-to) and a callback that will be notified when the operation is complete. Perhaps this is one of the functions of the "manager" class you mention. -- Caleb Epstein caleb dot epstein at gmail dot com

Hi, Caleb Epstein wrote:
Please do! Does either of the implementations offer an interface to the sockets at a lower level than iostreams though?
Mine doesn't, so far, as I haven't seen a need for it.
I personally prefer to use lower-level read/write semantics with sockets and handle would-block conditions etc in my own code.
This is the main reason why the manager code is not implemented yet: I am still trying to think of a portable way to reset the read position if an extractor fails due to no more data being available presently.
Alternatively I would be happy with a mechanism whereby I could pass a block of bytes to be written-from (or read-to) and a callback that will be notified when the operation is complete. Perhaps this is one of the functions of the "manager" class you mention.
The manager's task is to check sockets that have pending write data for writability and sockets which have buffer space available for readability, and move data around in this case; the application can register callbacks or ask for iterators over sockets that have data available. A special "write stalled" callback is invoked when a socket does not accept any more data, the application is then supposed to take countermeasures before the streambuf is also full, or the socket will go bad(). Reading binary blocks with read() and write() on the streams should be possible, provided you use narrow streams and a do-nothing codecvt<> Simon

On 6/9/05, Simon Richter <Simon.Richter@hogyros.de> wrote:
Caleb Epstein wrote:
Please do! Does either of the implementations offer an interface to the sockets at a lower level than iostreams though?
Mine doesn't, so far, as I haven't seen a need for it.
I believe it is an absolute requirement for a C++ Sockets library.
I personally prefer to use lower-level read/write semantics with sockets and handle would-block conditions etc in my own code.
This is the main reason why the manager code is not implemented yet: I am still trying to think of a portable way to reset the read position if an extractor fails due to no more data being available presently.
Well, I don't think it makes sense to implement iostreams on top of a non-blocking socket interface. If a user wants to use "socketstreams" they can reasonably be forced to use a blocking I/O model. Although the Boost.Iostreams library may make non-blocking doable. This should not preculde a different user or even another part of the same application from using a non-blocking socket interface at "layer 1". IMHO of course. Proposed Socket Library Layers: http://thread.gmane.org/gmane.comp.lib.boost.devel/122484 -- Caleb Epstein caleb dot epstein at gmail dot com

Caleb Epstein wrote:
I believe it is an absolute requirement for a C++ Sockets library.
I believe the only absolute requirement for a C++ networking library is an implementation of the standard IOStream interface, and whatever support stuff is needed to make it work (like addressing and resolving names). The streambuf interface provides everything a typical "client" object requires for sending and receiving data. I suspect most would settle their needs if a "network_archive" were implemented for boost::serialization. The most network intensive application we maintain at work certainly would.
Well, I don't think it makes sense to implement iostreams on top of a non-blocking socket interface. If a user wants to use "socketstreams" they can reasonably be forced to use a blocking I/O model. Although the Boost.Iostreams library may make non-blocking doable.
This should not preculde a different user or even another part of the same application from using a non-blocking socket interface at "layer 1". IMHO of course.
Proposed Socket Library Layers: http://thread.gmane.org/gmane.comp.lib.boost.devel/122484
IOStreams are merely a formatting layer over the stream represented by a streambuf. And streambufs are merely a buffering strategy over some form of stream of data. In the end, we have the same kinds of operations we always had with read() and write(). The only difficulty of implementing a non-blocking streambuf is deciding how to return such state information to the client code; this decision would then allow or deny this or that kinds of programming idioms. This information is sometimes returned as an EWOULDBLOCK error, sometimes returned as a zero amount of bytes processed, by the socket primitives. Errors returned by streambuf are captured by the formatting object and translated to eofbit, failbit and/or errorbit being set, which might cause an exception to be thrown. At the streambuf layer, it would be possible to reserve some value of int_type that's not a possible value for char_type and call that traits_type::would_block(), so to speak. But the formatting formatting operators return references to themselves to allow for such nifty idioms as: iostream io; string s; while (io >> s) // Work, work, work But generally non-blocking operations aren't used standalone like that, but together with a poller service to avoid blocking on idle streams when there are too many streams lying around; this might make the problem moot, as we may just contain the non-blocking behaviour inside such a poller service. This is the approach I'll try, but I will probably not do it before thinking of some kind of application that would exercise this. -- Pedro Lamarão Desenvolvimento Intersix Technologies S.A. SP: (55 11 3803-9300) RJ: (55 21 3852-3240) www.intersix.com.br Your Security is our Business

Hi, Caleb Epstein wrote:
Please do! Does either of the implementations offer an interface to the sockets at a lower level than iostreams though? Mine doesn't, so far, as I haven't seen a need for it.
I believe it is an absolute requirement for a C++ Sockets library.
Hrm, I have never missed being able to access files from a lower level than iostreams so far. :-) I never do binary I/O directly in my applications but always implement an inserter/extractor pair that uses streambuf iterators, though. iostreams' read()/write() should be enough for stream-based I/O, and for datagrams I'd propose going through another step anyway (i.e. have a separate stream class that does not derive from the standard iostreams but rather allows inserting and extracting packets only. What would be really needed in iostreams would be some sort of transaction interface that would allow me to abort insertion and extraction mid-way. It may be possible to emulate that using putback, and I think this will be the way to go here.
Well, I don't think it makes sense to implement iostreams on top of a non-blocking socket interface. If a user wants to use "socketstreams" they can reasonably be forced to use a blocking I/O model.
This would be acceptable for the average client but inhibit writing server code without resorting to C function calls, parsing messages into stringstreams and going from there, at which point you already have two parsers - one to determine whether the message is complete and can be extracted, one to actually extract.
Although the Boost.Iostreams library may make non-blocking doable.
With a little care, it can be done with the current iostreams library -- however the {i,o}stream_iterator classes will have to be exchanged by transaction capable ones and there needs to be a way to distinguish between end-of-stream and end-of-available data on a stream. While iterators would go past-the-end in either case, an application needs to know whether to restart afterwards. Fortunately, this can be added as a stream-specific function. What I currently cannot think of is how to make nonblocking streams go bad() if an extraction fails because no more data is available and noone took care to put back the already extracted characters.
This should not preculde a different user or even another part of the same application from using a non-blocking socket interface at "layer 1". IMHO of course.
Whether the sockets you have are blocking or nonblocking is determined by whether there is a manager attached to them. If you use a manager, you are expected to handle end-of-file conditions that aren't, by asking the stream whether this is really EOF and resetting the stream state accordingly. I can see no problem here.
Proposed Socket Library Layers: http://thread.gmane.org/gmane.comp.lib.boost.devel/122484
This is more about the big picture, stacking more complex interfaces on top of it. I think we should implement iostreams for sockets first, then we can go on to implement the mighty httpwistream that will give you "wchar_t"s, whatever the document encoding was. :-) Simon

Simon Richter wrote:
Hi,
Caleb Epstein wrote:
Please do! Does either of the implementations offer an interface to the sockets at a lower level than iostreams though?
Mine doesn't, so far, as I haven't seen a need for it.
I believe it is an absolute requirement for a C++ Sockets library.
Hrm, I have never missed being able to access files from a lower level than iostreams so far. :-) I never do binary I/O directly in my applications but always implement an inserter/extractor pair that uses streambuf iterators, though.
See ? As soon as you don't care for formatting you do want access to the underlaying streambuf object, which, in this case, would be a much more suitable place to put the extra API to manipulate socket-specific options.
iostreams' read()/write() should be enough for stream-based I/O, and for datagrams I'd propose going through another step anyway (i.e. have a separate stream class that does not derive from the standard iostreams but rather allows inserting and extracting packets only.
Why use iostream's read if you can do that on the streambuf object directly. And, why use the concept of a 'stream' at all when dealing with datagrams ? That doesn't sound like a good fit to me. Regards, Stefan

Hi, Stefan Seefeld wrote:
See ? As soon as you don't care for formatting you do want access to the underlaying streambuf object, which, in this case, would be a much more suitable place to put the extra API to manipulate socket-specific options.
Hrm, that would basically mean making the "impl" API in my model a public API for application programmers as well as the socket_streambuf. The stream classes need to pass the socket-specific API calls down to the actual implementation anyway, so there is no big harm being done here.
iostreams' read()/write() should be enough for stream-based I/O, and for datagrams I'd propose going through another step anyway (i.e. have a separate stream class that does not derive from the standard iostreams but rather allows inserting and extracting packets only.
Why use iostream's read if you can do that on the streambuf object directly.
Easier interface.
And, why use the concept of a 'stream' at all when dealing with datagrams ? That doesn't sound like a good fit to me.
It it a stream of packets, not an iostream. There is no real need to call it a "stream" as well, but I think it should provide operator<< and operator>> for packets. Simon

Simon Richter wrote:
And, why use the concept of a 'stream' at all when dealing with datagrams ? That doesn't sound like a good fit to me.
It it a stream of packets, not an iostream. There is no real need to call it a "stream" as well, but I think it should provide operator<< and operator>> for packets.
The '>>' notation suggests an ordering of the packets, which doesn't exist. Stefan

Stefan Seefeld wrote:
The '>>' notation suggests an ordering of the packets, which doesn't exist.
The ordering of operator>> is as counter-intuitive as the ordering of calls to send(). If people can get used to the idea that the first send() may arrive after the second, they will get used to the same idea with operator syntax. -- Pedro Lamarão

Stefan Seefeld wrote:
Why use iostream's read if you can do that on the streambuf object directly. And, why use the concept of a 'stream' at all when dealing with datagrams ? That doesn't sound like a good fit to me.
class datagram; datagram d; iostream ios; // Fill in the datagram; ios << d; -- Pedro Lamarão

On 6/9/05, Simon Richter <Simon.Richter@hogyros.de> wrote:
Caleb Epstein wrote:
Please do! Does either of the implementations offer an interface to the sockets at a lower level than iostreams though? Mine doesn't, so far, as I haven't seen a need for it.
I believe it is an absolute requirement for a C++ Sockets library.
Hrm, I have never missed being able to access files from a lower level than iostreams so far. :-) I never do binary I/O directly in my applications but always implement an inserter/extractor pair that uses streambuf iterators, though.
Sockets are not files, and I think to treat them identically, and build a library assuming that iostreams is the lowest level interface anyone could want to use is folly. There are large, measurable, performance trade-offs associated with iostreams compared to C stdio operations on every platform I have encountered, and similarly when compared to the low level system read and write (or send/recv) calls. I write high speed, network-centric, message-driven applications for a living. I would not be able to write applications that scale properly using a purely iostreams based interface to the network. The high level abstractions are nice for simpler applications, but they simply don't work well when you need to scale to managing many hundreds of connections and guaranteeing a certain quality of service to each. A blocking I/O model is not acceptable for my uses. That said, I think a socket-iostreams library is a GREAT idea. I would use it to write simpler applications that don't require the type of scalability or complexity I mentioned above. I just don't think it should be the ONLY interface.
iostreams' read()/write() should be enough for stream-based I/O, and for datagrams I'd propose going through another step anyway (i.e. have a separate stream class that does not derive from the standard iostreams but rather allows inserting and extracting packets only.
But why not just build the iostreams interface on top of a lower-level interface? Thats all I'm looking for: the lower level ("layer 1" and "layer 2" in Iain's terms) interfaces that are necessary to build highly scalable, complex network services.
What would be really needed in iostreams would be some sort of transaction interface that would allow me to abort insertion and extraction mid-way. It may be possible to emulate that using putback, and I think this will be the way to go here.
Now the iostreams approach is starting to sound pretty complex, isn't it?
Well, I don't think it makes sense to implement iostreams on top of a non-blocking socket interface. If a user wants to use "socketstreams" they can reasonably be forced to use a blocking I/O model.
This would be acceptable for the average client but inhibit writing server code without resorting to C function calls, parsing messages into stringstreams and going from there, at which point you already have two parsers - one to determine whether the message is complete and can be extracted, one to actually extract.
I'm not sure I understand your point here. Are you saying you can implement non-blocking IO with a C++ iostreams interface? Perhaps its do-able, but your code would end up not really looking like "normal" iostreams any more. You'd have to insert checks between each << or >> operation, and figure out where you left off if you got a short read/write. This isn't terribly developer-friendly.
Although the Boost.Iostreams library may make non-blocking doable.
With a little care, it can be done with the current iostreams library -- however the {i,o}stream_iterator classes will have to be exchanged by transaction capable ones and there needs to be a way to distinguish between end-of-stream and end-of-available data on a stream. While iterators would go past-the-end in either case, an application needs to know whether to restart afterwards. Fortunately, this can be added as a stream-specific function.
Again, this sounds quite complex to me. Why not live with an iostreams interface that is blocking-only?
What I currently cannot think of is how to make nonblocking streams go bad() if an extraction fails because no more data is available and noone took care to put back the already extracted characters.
This should not preculde a different user or even another part of the same application from using a non-blocking socket interface at "layer 1". IMHO of course.
Whether the sockets you have are blocking or nonblocking is determined by whether there is a manager attached to them. If you use a manager, you are expected to handle end-of-file conditions that aren't, by asking the stream whether this is really EOF and resetting the stream state accordingly. I can see no problem here.
Why should EWOULDBLOCK be treated as an EOF? I really think this is trying to fit a square peg - non-blocking sockets - into a round hole - iostreams.
Proposed Socket Library Layers: http://thread.gmane.org/gmane.comp.lib.boost.devel/122484
This is more about the big picture, stacking more complex interfaces on top of it. I think we should implement iostreams for sockets first, then we can go on to implement the mighty httpwistream that will give you "wchar_t"s, whatever the document encoding was. :-)
I think we should implement C++ sockets first, and then iostreams on top of those. I know there have been others that have agreed with this approach before. Are any of them following this thread? -- Caleb Epstein caleb dot epstein at gmail dot com

Hi, Caleb Epstein wrote:
Sockets are not files, and I think to treat them identically, and build a library assuming that iostreams is the lowest level interface anyone could want to use is folly. There are large, measurable, performance trade-offs associated with iostreams compared to C stdio operations on every platform I have encountered, and similarly when compared to the low level system read and write (or send/recv) calls.
[...] I can see your point now. Probably something like this would work (still using the class names I've used so far): There is a base class, impl that provides an abstract interface with lots of socket API calls, but all of them take strings as arguments. This class is subclassed for each address family, implementing the string -> AF specific args conversion and providing AF specific calls. This class also owns the file handle. A wrapper class around a pointer to impl can allow construction of local impl variables with cleanup. A socket_streambuf wraps around an impl*, providing buffering. The socketstream class uses socket_streambuf. That gives three application layers. I think the manager can be taught to deal with any of them, so you get a select()/WFMO() wrapper as well. I wonder whether it would make sense to provide generic (non-socket) nonblocking I/O features here as well, for example terminal or GUI I/O.
What would be really needed in iostreams would be some sort of transaction interface that would allow me to abort insertion and extraction mid-way. It may be possible to emulate that using putback, and I think this will be the way to go here.
Now the iostreams approach is starting to sound pretty complex, isn't it?
Yes, but still needed IMO. I expect that the majority of applications will require formatting/parsing and non-blocking I/O for two or three streams.
I'm not sure I understand your point here. Are you saying you can implement non-blocking IO with a C++ iostreams interface? Perhaps its do-able, but your code would end up not really looking like "normal" iostreams any more. You'd have to insert checks between each << or >> operation, and figure out where you left off if you got a short read/write. This isn't terribly developer-friendly.
For <<, it is pretty easy: If the write buffer is filled to a certain extent, a callback will be invoked whose task it is to throttle the application, for example by telling the socket layer that the next write will fail() before the first byte is written, so you won't lose sync. Then you are back with normal iostreams error handling, with the added twist that your error handler can find out that someone has pulled the emergency brake and attempt to recover. This is not mandatory, the app may just give up the stream, which is default behaviour. For >>, it is nearly the same, just that you are supposed to instantiate a "txn" object before you start to extract, and when your parser changes state and has consumed the chars and will not go back, you call the txn object's commit function to tell the streambuf that it can discard the characters. When the txn object is destroyed without either explicit commit() or rollback(), the stream goes bad(). A special istream_iterator could wrap this.
Again, this sounds quite complex to me. Why not live with an iostreams interface that is blocking-only?
Because that is severely limited. It will not even allow me to write a simple interactive application that uses the network.
Why should EWOULDBLOCK be treated as an EOF? I really think this is trying to fit a square peg - non-blocking sockets - into a round hole - iostreams.
From an iterator standpoint, it is the end of the sequence. That there is a possibility that you can pick up again later does not matter to the algorithms that take iterators.
It might be interesting whether coroutines could be of any use here later on.
I think we should implement C++ sockets first, and then iostreams on top of those.
Yes, I see what you mean. They are basically implemented, all that needs to be done is to make the API public. Simon

----- Original Message ----- From: "Caleb Epstein" <caleb.epstein@gmail.com> To: <boost@lists.boost.org> Sent: Friday, June 10, 2005 11:10 AM Subject: Re: [boost] [Ann] socketstream library 0.7 [snip]
Why should EWOULDBLOCK be treated as an EOF? I really think this is trying to fit a square peg - non-blocking sockets - into a round hole - iostreams.
Proposed Socket Library Layers: http://thread.gmane.org/gmane.comp.lib.boost.devel/122484
This is more about the big picture, stacking more complex interfaces on top of it. I think we should implement iostreams for sockets first, then we can go on to implement the mighty httpwistream that will give you "wchar_t"s, whatever the document encoding was. :-)
I think we should implement C++ sockets first, and then iostreams on top of those.
I know there have been others that have agreed with this approach before. Are any of them following this thread?
Dont know if I'm in that group, but certainly agree. This subject was tackled in a recent thread that I contributed to; I don't imagine that that was the first time or that this is the last. Having also been involved in "high speed, network-centric, message-driven applications for a living" I pale at the thought of dumping the burden of iostreams on my already busy CPUs. But that may be a response originating from or near the spleen. The fact that the programming model underlying iostreams is synchronous seems to be acknowledged and that flaw is (in my opinion) untenable in a network world. By the time all necessary changes are made to iostreams to allow it to operate in an asynchronous world, it would be seriously ugly. But that too is just my opinion as I have never implemented such a thing. Aint about to put my hand up either. I like the two layer proposal, but I wonder at the class of applications that would be achievable within the sync iostream model; it might have a very small population. Cheers.

Caleb Epstein wrote: Hi Caleb,
[...]
This is more about the big picture, stacking more complex interfaces on top of it. I think we should implement iostreams for sockets first, then we can go on to implement the mighty httpwistream that will give you "wchar_t"s, whatever the document encoding was. :-)
I think we should implement C++ sockets first, and then iostreams on top of those.
I know there have been others that have agreed with this approach before. Are any of them following this thread?
basically I agree with you - we have been through all these discussions before. However as there is so much to do for a socket library I appreciate it if from time to time something is proposed at all (like this socketstream library now :-). There are still all the requirements in the Wiki which I tried to put in the UML diagram at http://www.highscore.de/boost/net/packages.png (which also includes socket streams of course). Personally I am also more interested in the low level interfaces. We still have to come up with a proposal how the asynchronicity pattern should look like. Then we could start building up classes for the "Berkeley" package. Unfortunately I had not much time lately but hope to go on working on the asynchronicity pattern next. As socket streams won't support asynchronous I/O (I think these were the latest news?) someone could go on building socket streams (the implementation could be based on the "Berkeley" package later). In the moment the low level "Berkeley" package is waiting for the asynchronicity pattern. Boris

On Fri, 10 Jun 2005 03:10:26 +0400, Caleb Epstein <caleb.epstein@gmail.com> wrote: []
Sockets are not files, and I think to treat them identically, and build a library assuming that iostreams is the lowest level interface anyone could want to use is folly. There are large, measurable, performance trade-offs associated with iostreams compared to C stdio operations on every platform I have encountered, and similarly when compared to the low level system read and write (or send/recv) calls.
I write high speed, network-centric, message-driven applications for a living. I would not be able to write applications that scale properly using a purely iostreams based interface to the network. The high level abstractions are nice for simpler applications, but they simply don't work well when you need to scale to managing many hundreds of connections and guaranteeing a certain quality of service to each. A blocking I/O model is not acceptable for my uses.
Totally agree. IMO, sockets is a border area for boost whose "emphasis is on libraries which work well with the C++ Standard Library". C++ Standard Library itself does not work well with sockets or, more precisely, does not offer low fat async input / output abstractions. -- Maxim Yegorushkin

Simon Richter wrote:
Hrm, I have never missed being able to access files from a lower level than iostreams so far. :-) I never do binary I/O directly in my applications but always implement an inserter/extractor pair that uses streambuf iterators, though.
iostreams' read()/write() should be enough for stream-based I/O, and for datagrams I'd propose going through another step anyway (i.e. have a separate stream class that does not derive from the standard iostreams but rather allows inserting and extracting packets only.
What write()? And as far as read(), can someone explain to me how its possible to use this interface for anything other than files? As near as I can tell, it has no way of expressing how much data it actually read, which is unacceptable, and useless, and really next door to defect-land in my opinion. readsome() returns streamsize, but this function isn't that useful either, as it will never trigger a call to underflow(). Now, what exactly do iostreams offer to general socket work? Most socket protocol are defined in terms that have no correlation with C++ locales and whitespace-skipping rules. In other words, the iostreams concept of >> and << really isn't appropriate for sockets, even for 'text protocols,' because generally additional control will be needed. Besides this, iostreams has many annoying characteristics that are unacceptable for use in many robust sockets situations, such as reading an indeterminant amount of characters, and then just throwing them away if something goes wrong. In other words, I'd suggest we forget about iostreams. Design a good socket based on past experience that handles diverse networking situations elegantly, and let some other library like Boost.Iostreams map that to a streambuf, should <iostream> support be needed. Aaron W. LaFramboise

Aaron W. LaFramboise wrote: Hi Aaron,
Simon Richter wrote:
Hrm, I have never missed being able to access files from a lower level than iostreams so far. :-) I never do binary I/O directly in my applications but always implement an inserter/extractor pair that uses streambuf iterators, though.
iostreams' read()/write() should be enough for stream-based I/O, and for datagrams I'd propose going through another step anyway (i.e. have a separate stream class that does not derive from the standard iostreams but rather allows inserting and extracting packets only.
What write()?
And as far as read(), can someone explain to me how its possible to use this interface for anything other than files? As near as I can tell, it has no way of expressing how much data it actually read, which is unacceptable, and useless, and really next door to defect-land in my opinion. readsome() returns streamsize, but this function isn't that useful either, as it will never trigger a call to underflow().
there was at least one long discussion about socket streams some time ago. If I recall correctly the discussion ended with 1) socket streams are desirable for a socket library as they provide an interface many developers are familiar with (these developers can then create simple network application quickly) and 2) socket streams can only support blocking I/O (even this could be limited though). Boris
[...]

Aaron W. LaFramboise wrote:
Simon Richter wrote:
Hrm, I have never missed being able to access files from a lower level than iostreams so far. :-) I never do binary I/O directly in my applications but always implement an inserter/extractor pair that uses streambuf iterators, though.
iostreams' read()/write() should be enough for stream-based I/O, and for datagrams I'd propose going through another step anyway (i.e. have a separate stream class that does not derive from the standard iostreams but rather allows inserting and extracting packets only.
What write()?
basic_ostream::write (char const*, streamsize) ?
And as far as read(), can someone explain to me how its possible to use this interface for anything other than files? As near as I can tell, it has no way of expressing how much data it actually read, which is unacceptable, and useless, and really next door to defect-land in my opinion.
basic_istream::gcount() ?
Now, what exactly do iostreams offer to general socket work? Most socket protocol are defined in terms that have no correlation with C++ locales and whitespace-skipping rules. In other words, the iostreams concept of >> and << really isn't appropriate for sockets, even for 'text protocols,' because generally additional control will be needed.
Most Internet protocols I've worked with or seen, such as HTTP or the Jabber protocol, or mail messages extended with MIME, contain text encoded in some charater set indicated in a header field. Try reading my UTF-8 mail with your locale set to ISO8859-1 (or some Windows variation). Most of these, and others such as FTP, are "text oriented", so to speak. Most of these have a very well specified message format, such that can be interpreted as the "serialized to this text" form of a message object. Define such a class with proper operator<< and operator>> and check out what kinds of semantics you can offer to user code. *This* is the great benefit in my opinion, and the greatest usefulness of the iostreams library.
Besides this, iostreams has many annoying characteristics that are unacceptable for use in many robust sockets situations, such as reading an indeterminant amount of characters, and then just throwing them away if something goes wrong.
I don't understand what you mean here.
In other words, I'd suggest we forget about iostreams. Design a good socket based on past experience that handles diverse networking situations elegantly, and let some other library like Boost.Iostreams map that to a streambuf, should <iostream> support be needed.
I sincerely have no idea what is the trouble everyone has with iostreams. It's just a formatting layer over a buffer layer. Every "socket library" or socket application without such library support has reinvented a buffer layer; and none of them have considered the usefulness of a formatting layer, leaving "sends" and "receives" as high up as main(). None of these, be it iostreams, be it personalized buffering, attack the complicated problems of IO efficiency, that end up being not blocking, and not making a system call, when you don't absolutely need to. -- Pedro Lamarão

pedro.lamarao@mndfck.org wrote:
Most Internet protocols I've worked with or seen, such as HTTP or the Jabber protocol, or mail messages extended with MIME, contain text encoded in some charater set indicated in a header field. Try reading my UTF-8 mail with your locale set to ISO8859-1 (or some Windows variation).
While most modern protocols support diverse character sets, this does not generally apply to the encoding of the protocol's primatives. These tend to be defined in terms of fixed numerical values that correspond to ASCII. As a simple example, >> cannot distinguish between different forms of whitespace. As near as I can tell, your best bet for line-ending sensitive code is to read each line into an ostringstream buffer. I suspect that in real practice implementations aiming for high quality parsing behavior for standard internet protocols won't even try to use the formatting facilities of iostreams, as they offer almost no value in the overall scheme of building a parser. << and >> are great for things related to human-readable formatting, where a human's eyes are the primary discriminator, but I am unconvinced they are useful for reading and writing text to be manipulated by machines. I do not think << and >> are even workable, in the general case, for a protocol who's whitespace and formatting rules do not exactly match those of C and C++.
Besides this, iostreams has many annoying characteristics that are unacceptable for use in many robust sockets situations, such as reading an indeterminant amount of characters, and then just throwing them away if something goes wrong.
I don't understand what you mean here.
When implementing the >> operator for a custom class, how do you handle the case where you need to read two primative types, but the second read fails, leaving the operator holding on to data that it has no way of 'putting back'? As near as I can tell, this ends up leaving the stream in a consistant but indeterminant state; something that might be OK for files, but is entirely not OK for a medium that is not rewindable, such as sockets. An improved streambuf could help cope with this, but this is tangental to the unsuitability of iostreams.
In other words, I'd suggest we forget about iostreams. Design a good socket based on past experience that handles diverse networking situations elegantly, and let some other library like Boost.Iostreams map that to a streambuf, should <iostream> support be needed.
I sincerely have no idea what is the trouble everyone has with iostreams. It's just a formatting layer over a buffer layer.
I think it needs to be demonstrated, in real code, that it is easier to write a conformant and high quality parser for an ASCII-based standard internet protocol using the iostreams primatives, than without. For the reasons above, I think they will tend to only make things more difficult.
Every "socket library" or socket application without such library support has reinvented a buffer layer; and none of them have considered the usefulness of a formatting layer, leaving "sends" and "receives" as high up as main().
I am not saying, "<iostream> is never useful for sockets." I am only saying that it is not a good primative for general work, done in real programs with real protocols, and hence is somewhat tangental to the path of seeking a general purpose sockets library. With the Boost Iostreams library, it is extremely easy to form a streambuf from any particular data source. With this, I completely disagree with your former statement, and I'd say: An implementation of a socket streambuf for iostreams is the only thing that a socket library *doesn't* need to provide. By the way, take this in no way as criticism of your library, which I have not formed an opinion on yet. I am only stating my belief that iostream implementations are tangental to the primary work of creating a Boost socket stream library. Aaron W. LaFramboise

Aaron W. LaFramboise wrote:
By the way, take this in no way as criticism of your library, which I have not formed an opinion on yet. I am only stating my belief that iostream implementations are tangental to the primary work of creating a Boost socket stream library.
If I didn't want criticism on my library, I wouldn't show it to other people. ;-) Criticism is welcome. I'm have an alternative version of that code, in the form of a "Boost.Network" library, suitable for the Sandbox. As soon as I've won my battle against the documentation building system, I'll upload it. Then, I'll concentrate on trying to show what kinds of coding patterns are made much easier with IOStreams syntax. That's my game. -- Pedro Lamarão

On Sun, 12 Jun 2005 01:29:29 -0500, Aaron W. LaFramboise wrote Let me throw a couple wrenches into the dicussion and then I'll go back to lurking...
As a simple example, >> cannot distinguish between different forms of whitespace.
Actually, I think the whitespace can be defined in facets, but if not then you would need a new stream type for the built-in types. For custom types, well they can do whatever they want.
...
<< and >> are great for things related to human-readable formatting, where a human's eyes are the primary discriminator, but I am unconvinced they are useful for reading and writing text to be manipulated by machines. I do not think << and >> are even workable, in the general case, for a protocol who's whitespace and formatting rules do not exactly match those of C and C++.
When implementing the >> operator for a custom class, how do you handle the case where you need to read two primative types, but the second read fails, leaving the operator holding on to data that it has no way of 'putting back'? As near as I can tell, this ends up leaving the stream in a consistant but indeterminant state; something that might be OK for files, but is entirely not OK for a medium that is not rewindable, such as sockets.
Implementing parsers using operartor>> is tough because you only have input iterator - that makes backtracking tough. That said, I don't think this issue is even remotely related to the your statement that << and >> are only good for 'human' output. Plenty of computer only i/o goes thru these operators.
An improved streambuf could help cope with this, but this is tangental to the unsuitability of iostreams.
Yes, but actually I don't think it is tangential overall. I believe a socket library that doesn't work with standard i/o is unacceptable for boost. Anyone that wants to get a socket library into the standard will have to clearly demonstrate why the current standard i/o model doesn't work. I've seen nothing so far that convinces me that standard streambufs can't be used in the core of a socket library for managing the opaque or 'char' level data. If you accept this, well then the iostream level is almost an incidental benefit...
Every "socket library" or socket application without such library support has reinvented a buffer layer; and none of them have considered the usefulness of a formatting layer, leaving "sends" and "receives" as high up as main().
I am not saying, "<iostream> is never useful for sockets." I am only saying that it is not a good primative for general work, done in real programs with real protocols, and hence is somewhat tangental to the path of seeking a general purpose sockets library.
Yes, I've see it work quite well in real programs. It works something like this: 1) Protocol header had message type/size at top of packet 2) Socket core ensured a full read of message into a std::streambuf 3) Application layer received streambuf with message type callback 4) Application would create 'message object' based on type 5) Used i/o streaming/serialization to read message object from streambuf Simple and clean. Socket core doesn't really care about message content -- as it should be. Application layer does that -- has the option of using iostreams or parsing from the buffer directly. BTW, some of the message formats are binary using a different serialization format adapter against the streambuf.
With the Boost Iostreams library, it is extremely easy to form a streambuf from any particular data source. With this, I completely disagree with your former statement, and I'd say: An implementation of a socket streambuf for iostreams is the only thing that a socket library *doesn't* need to provide.
If it's easy than just provide it now.
By the way, take this in no way as criticism of your library, which I have not formed an opinion on yet. I am only stating my belief that iostream implementations are tangental to the primary work of creating a Boost socket stream library.
Ah, obviously I totally disagree. Think about where it fits in now -- before you get called out in the review. Jeff

On Sun, 12 Jun 2005 19:25:30 +0400, Jeff Garland <jeff@crystalclearsoftware.com> wrote: []
Yes, I've see it work quite well in real programs. It works something like this: 1) Protocol header had message type/size at top of packet 2) Socket core ensured a full read of message into a std::streambuf 3) Application layer received streambuf with message type callback 4) Application would create 'message object' based on type 5) Used i/o streaming/serialization to read message object from streambuf
Simple and clean. Socket core doesn't really care about message content -- as it should be. Application layer does that -- has the option of using iostreams or parsing from the buffer directly. BTW, some of the message formats are binary using a different serialization format adapter against the streambuf.
... And too slow. You have one data copy from kernel into the streambuf, and another one from the streambuf to the message object. The same is for output: message -> streambuf -> socket. This is unacceptable, at least for the kind of software I build. For example, I had a router which used a TCP binary protocol. Each message is prefixed with 4 byte length, so I read the 4 bytes, resized a std::vector<char> for the message length, and then read the message into the vector. After profiling the router under heavy load it turned out that 30% of user time was spent in guess where? In zeroing out memory in std::vector<char>::resize(). And you are talking about data copying here... -- Maxim Yegorushkin

Maxim Yegorushkin writes:
For example, I had a router which used a TCP binary protocol. Each message is prefixed with 4 byte length, so I read the 4 bytes, resized a std::vector<char> for the message length, and then read the message into the vector. After profiling the router under heavy load it turned out that 30% of user time was spent in guess where? In zeroing out memory in std::vector<char>::resize(). And you are talking about data copying here...
This reminds me of recent post by Dave Harris: http://lists.boost.org/boost/2005/06/27934.php -- Alexander Nasonov

Maxim Yegorushkin wrote:
... And too slow. You have one data copy from kernel into the streambuf, and another one from the streambuf to the message object. The same is for output: message -> streambuf -> socket. This is unacceptable, at least for the kind of software I build.
For example, I had a router which used a TCP binary protocol. Each message is prefixed with 4 byte length, so I read the 4 bytes, resized a std::vector<char> for the message length, and then read the message into the vector. After profiling the router under heavy load it turned out that 30% of user time was spent in guess where? In zeroing out memory in std::vector<char>::resize(). And you are talking about data copying here...
Considering the protocol of your application has built in methods for announcing the length of the payload, your requirement is met by the streambuf::sgetn(char_type*, streamsize) method, for a properly specialized implementation of the virtual protected xsgetn method. A streambuf manipulated solely by sgetn and sputn won't ever fill internal buffers. By encapsulating your protocol message inside a protocol_message object with proper operator<< and operator>>, you can receive and send such objects by directly manipulating the streambuf associated with a stream through stream::rdbuf(). A proper implementation of xsgetn would would directly call the networking primitive with the proper parameters, after sanity checking, taking into account the possibility of there being characters in the internal buffer, for complete correctness. We'll get a couple of branches for every call to recv(), so, yeah, we're probably not 100% on par with C here; but then, we'll only call recv() twice for each object received anyway. Also, an implementation of a networking streambuf can implement "no buffering" semantics set up by a streambuf.pubsetbuf(0, 0) call (like your typical fstream provides) to absolutely ensure there is no buffering going on. So you get operator semantics for free. :-) And perhaps even a putback area, if there's one provided by this particular streambuf implementation. I'll take a note of this use case, though. As soon as documentation is finished for my last proposal, I'll try and do some benchmarking. How large are your payloads, usually? -- Pedro Lamarão

On Mon, 13 Jun 2005 15:13:19 +0400, <pedro.lamarao@mndfck.org> wrote:
Maxim Yegorushkin wrote:
... And too slow. You have one data copy from kernel into the streambuf, and another one from the streambuf to the message object. The same is for output: message -> streambuf -> socket. This is unacceptable, at least for the kind of software I build.
For example, I had a router which used a TCP binary protocol. Each message is prefixed with 4 byte length, so I read the 4 bytes, resized a std::vector<char> for the message length, and then read the message into the vector. After profiling the router under heavy load it turned out that 30% of user time was spent in guess where? In zeroing out memory in std::vector<char>::resize(). And you are talking about data copying here...
Considering the protocol of your application has built in methods for announcing the length of the payload, your requirement is met by the streambuf::sgetn(char_type*, streamsize) method, for a properly specialized implementation of the virtual protected xsgetn method.
A streambuf manipulated solely by sgetn and sputn won't ever fill internal buffers.
By encapsulating your protocol message inside a protocol_message object with proper operator<< and operator>>, you can receive and send such objects by directly manipulating the streambuf associated with a stream through stream::rdbuf().
[...]
So you get operator semantics for free. :-) And perhaps even a putback area, if there's one provided by this particular streambuf implementation.
Sounds interesting, but I don't see how this can work well with nonblocking sockets. You have to store how much bytes have already been read/sent somewhere.
I'll take a note of this use case, though. As soon as documentation is finished for my last proposal, I'll try and do some benchmarking. How large are your payloads, usually?
~8k. -- Maxim Yegorushkin

----- Original Message ----- From: "Maxim Yegorushkin" <e-maxim@yandex.ru> To: <boost@lists.boost.org> Sent: Monday, June 13, 2005 11:59 PM Subject: Re: [boost] [Ann] socketstream library 0.7 [snip]
... And too slow. You have one data copy from kernel into the streambuf, and another one from the streambuf to the message object. The same is for output: message -> streambuf -> socket. This is unacceptable, at least for [snip] 30% of user time was spent in guess where? In zeroing out memory in std::vector<char>::resize(). And you are talking about data copying here...
Considering the protocol of your application has built in methods for announcing the length of the payload, your requirement is met by the streambuf::sgetn(char_type*, streamsize) method, for a properly specialized implementation of the virtual protected xsgetn method. [snip]
So you get operator semantics for free. :-) And perhaps even a putback area, if there's one provided by this particular streambuf implementation.
Sounds interesting, but I don't see how this can work well with nonblocking sockets. You have to store how much bytes have already been read/sent somewhere. [snip]
There is really interesting material here. There is also other stuff that I feel obliged to comment on :-) 1. Some of the contortions suggested to effectively read messages off an iostream socket are not specific to the fundamental network/socket goals, i..e they are just difficulties associated with iostream-ing. 2. Some of those same contortions are in response to the different media (not sure if thats the best term) that the stream is running over, i.e. a network transport. This is more ammunition for anyone trying to shoot the sync-iostream-model-over-sockets down. Or at least suggest that the model is a constraint to those wrtiing iostream-based network apps. 3. Lastly, some of the observations (while excellent) seem a bit "macro" when a wider view might lead to different thinking. What I am trying to suggest here is that the time spent in vector<>::resize is truly surprising but its also very low-level. having been through 4 recent refactorings of a network framework, I have been surprised at the gains made in other areas by conceding say, byte-level processing, in another area. To make more of a case around the last point, consider the packetizing, parsing and copying thats recently been discussed. This has been related to the successful recognition of a message on the stream. Is it acknowldeged that a message is an arbitrarily complex data object? By the time an application is making use of the "fields" within a message thats probably a reasonable assumption. So at some point these fields must be "broken out" of the message. Or parsed. Or marshalled. Or serialized. Is the low-level packet (with the length) header and body) being scanned again? What copying is being done? This seems like multi-pass to me. To get to the point; I am currently reading blocks off network connections and presenting them to byte-by-byte lexer/parser routines. These form the structured network messages directly, i.e. fields are already plucked out. So which is better? Direct byte-by-byte conversion to structured network message or multi-pass? Cheers.

On Tue, 14 Jun 2005 08:00:26 +0400, Scott Woods <scottw@qbik.com> wrote: []
3. Lastly, some of the observations (while excellent) seem a bit "macro" when a wider view might lead to different thinking. What I am trying to suggest here is that the time spent in vector<>::resize is truly surprising
So was it for me.
but its also very low-level. having been through 4 recent refactorings of a network framework, I have been surprised at the gains made in other areas by conceding say, byte-level processing, in another area.
I'm currently working on an network framework. Three major performance improvements in several iterations was: a) drop textuality; b) drop a C++ glue layer which was built over libevent, so I'm now using libevent directly - this _is_ the framework for me; c) drop using std::vector as a message buffer.
To make more of a case around the last point, consider the packetizing, parsing and copying thats recently been discussed. This has been related to the successful recognition of a message on the stream.
Is it acknowldeged that a message is an arbitrarily complex data object?
It is.
By the time an application is making use of the "fields" within a message thats probably a reasonable assumption. So at some point these fields must be "broken out" of the message.
A point to note here, is that there may be checkpoints on a message path, where a message must be read in order to be forwarded. At such points one wants to avoid whole message parsing.
Or parsed. Or marshalled. Or serialized. Is the low-level packet (with the length) header and body) being scanned again? What copying is being done? This seems like multi-pass to me.
To get to the point; I am currently reading blocks off network connections and presenting them to byte-by-byte lexer/parser routines. These form the structured network messages directly, i.e. fields are already plucked out.
So which is better? Direct byte-by-byte conversion to structured network message or multi-pass?
I'm not sure I understand "byte-by-byte conversion" and "multi-pass". What I did was breaking a message in two parts: header and body. The header contains message type and asynchronous completion token stack. Body contains application protocol specific data. A message is read in a chunk of memory (which was that vector<char>) and only the header part is parsed. When a message is forwarded only the header part is rebuild, the body gets forwarded without any user space copying. Only at the final destination an application parses the message body. -- Maxim Yegorushkin

Scott Woods wrote:
3. Lastly, some of the observations (while excellent) seem a bit "macro" when a wider view might lead to different thinking. What I am trying to suggest here is that the time spent in vector<>::resize is truly surprising but its also very low-level. having been through 4 recent refactorings of a network framework, I have been surprised at the gains made in other areas by conceding say, byte-level processing, in another area.
I undersand that these difficulties are orthogonal to any IOStreams issues in the sense that everyone obtaining data from an unknown source must deal with them. This "buffering problem" is the problem that leads people to design protocols with fixed sizes everywhere.
To make more of a case around the last point, consider the packetizing, parsing and copying thats recently been discussed. This has been related to the successful recognition of a message on the stream.
Is it acknowldeged that a message is an arbitrarily complex data object? By the time an application is making use of the "fields" within a message thats probably a reasonable assumption. So at some point these fields must be "broken out" of the message. Or parsed. Or marshalled. Or serialized. Is the low-level packet (with the length) header and body) being scanned again? What copying is being done? This seems like multi-pass to me.
To get to the point; I am currently reading blocks off network connections and presenting them to byte-by-byte lexer/parser routines. These form the structured network messages directly, i.e. fields are already plucked out.
So which is better? Direct byte-by-byte conversion to structured network message or multi-pass?
I understood you correctly, I might rephrase that to myself like Do we read the whole message before parsing, or are we parsing directly from the data source? If we parse directly from the data source, we must analyze byte by byte, and so obtain byte by byte. If we want this, we will want a buffering layer to keep the amount of system calls to a reasonable level. streambufs provide such a buffering level, with IO operations proper for lexical analysis at such a level: sgetc, snextc, sbumpc. If you remember streambuf_iterators exist, and imagine a multi_pass iterator (hint, hint), many other interesting things come to mind. If we read the message completely beforehand, we must know how much we have to read, or we must inspect the data source in some way to watch for "end of message". If we have control over the protocol design, we can make it "fixed size". Making if "fixed size" would mean prefixing a "payload" with size information. If that size information is also fixed, then we're set. If not, we'll have to parse on the fly at least this prefix. I've seen at least one protocol naive enough to throw an int as the "prefix" directly into the data sink. Luckily, every machine involved was Intel x86 running some Windows server. If we don't have control over the protocol design, we can apply another layer, encapsulating the protocol message inside a "control message" providing the fixes sizes we need. To this control message the previous considerations would apply. This has been suggested here many times. So, after reading every byte we need, we'd start parsing over the sequence in memory, instead of the "sequence" from the data source. streambuf's provide for this, with the sgetn operation, and even the possibility of shutting the buffering down. At this point, we have read the same amount of bytes from the data source, in whatever order. But the amount of calls made to the IO system service is not the same, and the fixed size approach is more efficient in this regard. Also, the fixed size approach solves the "buffering problem" since we make no resizings along the way. C++ people, blessed with std::vector, already have a mechanismo to do away with such weirdness; think about how you do it in C. But such a design suffers elsewhere. Let me argue a little against it. First. We, unfortunately, can't pass std::vector to the operating system, so, at some point, we are allocating fixed sized buffers, and passing it to our IO primitives. There is no escape. If you are initializing std::vector with the correct size and giving &*begin() to these primitives, well... Why not allocate with new? If you are allocating it with whatever default size and resizing it later, you are losing part of the proposed benefit. When we're about to throw a message to the network, how do we know what size it is? If our message is composed of, say, a string, another string and an int, are we going to call string::size() twice for every message dumped? Is the int representation fixed in size? Is this size enough for MAX_INT? HTTP is like that; the size of the resource being sent to the client is present in a header field. If you think that is easy because HTTP servers can get that from the filesystem, get acquainted with server side scripting, and think again. HTTP headers, on the other hand, must be parsed at least a little, to look for "end of header". SMTP clients try to hint at the size of the data being sent, but that is not authoritative. There is also "end of data" marks in SMTP. Take a more complicated protocol like the Jabber protocol, whose messages are effectively XML nodes. Are we going to traverse the tree to find out the payload size? If we have a XML processor, why not apply it directly to the network? Check out that protocol to see how powerful the message format is before complaining about "weird protocols that only bring more trouble". We don't need to go that far, of course. Mail messages today are already trees of MIME parts. SMTP makes no guarantees the SIZE hint must be respected. SIZE hints may not even be present. What will the server do? I've seen a SMTP proxy service, whose job was to transform a mail message on the fly before it hit the SMTP server, suffer badly with this. That proxy won't be sending any SIZE hints. My point is, writing a generic networking library for generic client code dealing with generic protocols must provide for every kind of model. Impose a new layer of headers, and you're not generic anymore. Force everyone over a buffer, and you're not generic anymore. Put everything over asynchronicities and you're not generic anymore. (Please, think of us regular, blocking, thread-loving people, I beg you!) And this is all about getting and putting stuff to the network, without considerations about whatever necessary character coding conversions must be done from a certain place to a certain other place, which could perfectly well increse or decrease the size in bytes of a certain representation. Also, think of a system with ISO-8859-1 default locale; what do you do when you want to get a web page from a server providing UTF-8 pages? Good luck with that. Those dealing exclusively with in-house designed protocols live in heaven, safe from this kind of thing. If you are on the Internet, you have very little guarantees. It's hell out here, sir. -- Pedro Lamarão

Hi Pedro, Apologies for any sloppy formatting; mail client woes. ----- Original Message ----- From: <pedro.lamarao@mndfck.org> To: <boost@lists.boost.org> Sent: Wednesday, June 15, 2005 12:08 AM Subject: Re: [boost] [Ann] socketstream library 0.7
This "buffering problem" is the problem that leads people to design protocols with fixed sizes everywhere.
Yes - exactly. Header (incl length) and body is a perfectly functional response to a need. Is it the best we can do?
To get to the point; I am currently reading blocks off network connections and presenting them to byte-by-byte lexer/parser routines. These form the structured network messages directly, i.e. fields are already plucked out.
So which is better? Direct byte-by-byte conversion to structured network <> message or multi-pass?
I understood you correctly, I might rephrase that to myself like Do we read the whole message before parsing, or are we parsing directly from the data source?
Yes. That's a reasonable paraphrasing.
If we parse directly from the data source, we must analyze byte by byte, and so obtain byte by byte. If we want this, we will want a buffering layer to keep the amount of system calls to a reasonable level.
streambufs provide such a buffering level, with IO operations proper for lexical analysis at such a level: sgetc, snextc, sbumpc.
Yes, thats true. As well as many points you make about streambufs (didnt realize they were quite that flexible).
If you remember streambuf_iterators exist, and imagine a multi_pass iterator (hint, hint), many other interesting things come to mind.
If we read the message completely beforehand, we must know how much we have to read, or we must inspect the data source in some way to watch for "end of message".
[snip]
At this point, we have read the same amount of bytes from the data source, in whatever order. But the amount of calls made to the IO system service is not the same, and the fixed size approach is more efficient in this regard.
Also, the fixed size approach solves the "buffering problem" since we make no resizings along the way. C++ people, blessed with std::vector, already have a mechanismo to do away with such weirdness; think about how you do it in C.
Sorry but there is such a gulf between our approaches I'm not sure I can say anything to help clarify. As a last response the best I can do is say that; 1. The difference (in terms of CPU time) in maintaining a counter and inspecting a "current byte" and testing it for "end of message" seems minimal. This is stated relatively, i.e. it is far more significant that the bytes sent across the network are being scanned at the receiver more than once. Even maintaining the body counter is a (very low cost...) scan. 2. An approach using lex+parse techniques accepts raw byte blocks as input (convenient) and notifies the user through some kind of accept/reduce return code, that the message is complete and already "broken apart", i.e. no further scanning required by higher layers. 3. Lex+parse techniques do not care about block lengths. An accept state or parser reduction can occur anywhere. All the "unget" contortions recently mentioned are not needed. Partial messages are retained in the parser stack and only finally assembled on accept/reduce. This property is something much easier to live with than any kind of "fixed-size" approach that I have dealt with so far.
First. We, unfortunately, can't pass std::vector to the operating system, so, at some point, we are allocating fixed sized buffers, and passing it to our IO primitives. There is no escape.
Errrrr. Not quite following that. Are you saying that send( socket_descriptor, &vector_buffer[ 0 ], vector_buffer.size() ) is bad?
If you are initializing std::vector with the correct size and giving &*begin() to these primitives, well... Why not allocate with new? If you are allocating it with whatever default size and resizing it later, you are losing part of the proposed benefit.
Hmmm. If you are saying this to strengthen your case for streambufs then I understand.
When we're about to throw a message to the network, how do we know what size it is? If our message is composed of, say, a string, another string and an int, are we going to call string::size() twice for every message dumped? Is the int representation fixed in size? Is this size enough for MAX_INT?
[snip large section]
If you are on the Internet, you have very little guarantees. It's hell out here, sir.
Yes you make some very good points. The product I am currently working on is a vipers' nest of the protocols you talk about and more. There have been some unpleasant suggested uses for protocols such as IMAP4. Trying to build a generic network messaging library that facillitates clear concise application protocols *and* can cope with the likes of IMAP4 is, I believe, unrealistic. I didnt know I had a mechanismo until today. Feels great! :-) Cheers, Scott

Scott Woods wrote:
1. The difference (in terms of CPU time) in maintaining a counter and inspecting a "current byte" and testing it for "end of message" seems minimal. This is stated relatively, i.e. it is far more significant that the bytes sent across the network are being scanned at the receiver more than once. Even maintaining the body counter is a (very low cost...) scan. 2. An approach using lex+parse techniques accepts raw byte blocks as input (convenient) and notifies the user through some kind of accept/reduce return code, that the message is complete and already "broken apart", i.e. no further scanning required by higher layers. 3. Lex+parse techniques do not care about block lengths. An accept state or parser reduction can occur anywhere. All the "unget" contortions recently mentioned are not needed. Partial messages are retained in the parser stack and only finally assembled on accept/reduce. This property is something much easier to live with than any kind of "fixed-size" approach that I have dealt with so far.
This is the kind of application of a network library I'm most intrigued by. I've experimented with an aproximation of this approach by modifying a sinister buffering scheme in a C# application by apparently inefficient calls to the equivalents of send and receive to get only one byte at a time and implement a simple lexer; I expected terrible losses but experienced very little of those. Later reapplying a buffering layer at only two particular points made the difference very difficult to measure.
First. We, unfortunately, can't pass std::vector to the operating system, so, at some point, we are allocating fixed sized buffers, and passing it to our IO primitives. There is no escape.
Errrrr. Not quite following that. Are you saying that
send( socket_descriptor, &vector_buffer[ 0 ], vector_buffer.size() )
is bad?
No. What I meant was, the operating system won't resize std::vector for you. It expects a fixed-size amount of memory. Because of this, every "dynamically safe buffering" must be a layer over a "fixed size error-prone" buffering done somewhere. That is a constraint of our primitives. The intention of a streambuf implementation is precisely to conceal such a fixed size buffering, offering the most generic interface to what now becomes a concealed "sequence" (as the documentation I have at hand would call it).
Yes you make some very good points. The product I am currently working on is a vipers' nest of the protocols you talk about and more. There have been some unpleasant suggested uses for protocols such as IMAP4. Trying to build a generic network messaging library that facillitates clear concise application protocols *and* can cope with the likes of IMAP4 is, I believe, unrealistic.
The skeleton of a "protocol message" as I've been working is more or less: //---------- class protocol_message; { public: void clear () { /* Clear data members. */ } template <typename IteratorT> parse_info<IteratorT> parse (IteratorT begin, IteratorT end); // Defined later. }; template <typename CharT, typename TraitsT> basic_ostream<CharT, TraitsT>& operator<< (basic_ostream<CharT, TraitsT>& o, protocol_message const& m); // However is a message in the net... template <typename CharT, typename TraitsT> basic_istream<CharT, TraitsT>& operator<< (basic_istream<CharT, TraitsT>& i, protocol_message& m) { using namespace boost::spirit; // Here we use the Magic Glue typedef multi_pass<std::istreambuf_iterator<char> > iterator_t; iterator_t begin(i); iterator_t end = make_multi_pass(std::istreambuf_iterator<char>()); parse_info<iterator_t> info = m.parse(begin, end); if (!info.hit) i.setstate(std::ios_base::failbit); return i; } namespace detail { class grammar : public boost::spirit::grammar<grammar> { public: grammar (protocol_message& m) : _M_m(m) {} template <typename ScannerT> class definition; // We'll write in _M_ but the constructor takes a const reference. private: protocol_message mutable& _M_; } } template <typename IteratorT> boost::spirit::parse_info<IteratorT> message::parse (IteratorT begin, IteratorT end) { using namespace boost::spirit; this->clear(); detail::grammar g(*this); return boost::spirit::parse(begin, end, g); } //--------- Note how operator>> sets failbit in case of an unsuccessful parse: it allows us to write: iostream stream; protocol_message message; while (stream >> message) { // Work. } // Parsing failed or other error; try to recover? No exception is thrown. But an exception could be thrown; iostream can be configured to do that, and throw an ios_base::failure. The current implementation of the irc_client example distributed in the package I uploaded to the Sandbox is in this URI: https://mndfck.org/svn/socketstream/branches/boost/libs/network/example/irc_... This version has a Spirit grammar for a (modified) version of the IRC grammar as defined in 2812. It's still rough in the edges, but much better than used to be. IRC is a very uninsteresting application, but it's an interesting protocol to experiment with as there is no guarantee when a message is coming from where. "Synchronized" protocols like SMTP are much easier; client sends, server responds, and that's pretty much it. I'm very interested in these kinds of applications of a "netbuf" and the implementation of reusable "protocol message" classes for common protocols; I'm probably going after HTTP next, and try to write a simplified wget. There was also a concern earlier in this thread about excessive buffering in streambuf's with "fixed-sized message" protocols I'd like to address with an example. -- Pedro Lamarão Desenvolvimento Intersix Technologies S.A. SP: (55 11 3803-9300) RJ: (55 21 3852-3240) www.intersix.com.br Your Security is our Business

Hi Pedro, Still trying to get my Outlook to indent (>) and failing. I have all the proper options set but they are ignored. Go figure. Re-install time. I've inserted my comments with * ----- Original Message ----- From: "Pedro Lamarão" <pedro.lamarao@intersix.com.br>
3. Lex+parse techniques do not care about block lengths. An accept state or parser reduction can occur anywhere. All the "unget" contortions recently mentioned are not needed. Partial messages are retained in the parser stack and only finally assembled on accept/reduce. This property is something much easier to live with than any kind of "fixed-size" approach that I have dealt with so far.
From your code snippets I can see the layering of activity and how
This is the kind of application of a network library I'm most intrigued by. I've experimented with an aproximation of this approach by modifying a sinister buffering scheme in a C# application by apparently inefficient calls to the equivalents of send and receive to get only one byte at a time and implement a simple lexer; I expected terrible losses but experienced very little of those. Later reapplying a buffering layer at only two particular points made the difference very difficult to measure. * Ah yes. Dont know about sinister buffering or C# but think * I follow enough from context. And your observations are consistent * with what I have seen. [snip code] iostream stream; protocol_message message; while (stream >> message) { // Work. } * Very nice. No exception is thrown. But an exception could be thrown; iostream can be configured to do that, and throw an ios_base::failure. The current implementation of the irc_client example distributed in the package I uploaded to the Sandbox is in this URI: https://mndfck.org/svn/socketstream/branches/boost/libs/network/example/irc_ client/message.hpp * I did try to decompress your package with Windows utilities. These failed * with messages about "not bzip2"; can you indicate a specific utility? This version has a Spirit grammar for a (modified) version of the IRC grammar as defined in 2812. It's still rough in the edges, but much better than used to be. IRC is a very uninsteresting application, but it's an interesting protocol to experiment with as there is no guarantee when a message is coming from where. "Synchronized" protocols like SMTP are much easier; client sends, server responds, and that's pretty much it. I'm very interested in these kinds of applications of a "netbuf" and the implementation of reusable "protocol message" classes for common protocols; I'm probably going after HTTP next, and try to write a simplified wget. There was also a concern earlier in this thread about excessive buffering in streambuf's with "fixed-sized message" protocols I'd like to address with an example. ************************************** Nice use of boost. Did you mention this in the "who's using boost" thread? ultimately it is flexible enough to cope with the likes of IRC and (possibly :-) IMAP4. My concern about multi-pass is probably superseded by that exact ability to cope with ugly protocols (in same cases the ugliness is more correctly described as part of the encoding). In previous threads addressing similar issues the suggestion was to use an "envelope" approach; that delivered the same benefits as your low-level header+body. It is a little bit tragic to concede this point for me as I have invested quite heavily in a technology that parses straight from the network block to a variant. The variant is capable of holding a vector of variants as a "value" (yes, a recursive definition). Operator>> is overloaded in such a way that you can code in this manner; struct routable_message { unsigned long to_address; unsigned long from_address; net_variant payload; }; routable_message & operator>>( net_variant &v, routable_message &m ) { vector<net_variant> &a = net_array<3>( v ); // Access the expected tuple a[ 0 ] >> m.to_address; a[ 1 ] >> m.from_address; a[ 2 ] >> m.payload; return m; } At the point where a variant is completed (e.g. part way through a network block), it is presented to a receiver e.g. void message_router::operator()( net_variant &v ) { operator()( v >> routable_message() ); } void message_router::operator()( routable_message &m ) { iterator f = find( m.to_address ); if(f == end()) return; (* f->second )( m.payload, m.from_address ); } Hopefully this is enough to show how elegant the code becomes even when dealing with multiple layers of software, i.e. the message_router has no idea what type conversions are performed by the receiver of the payload. All operator>> implementations are required to use "move" semantics so any data "new'd" by the variant parser is exactly the data that is finally moved into the application type. To summarize; I have been resisting the header+body (or "envelope") technique but it would appear to be more extensible. The separation of "message completion" and "content parsing" allows for more protocol-specific handling that I cannot do as my "parser" runs over the entire message. Again the protocol-specifics that I allude to are often better described as encoding specific as most of the TCP application suite binds an encoding inextricably to each protocol. Dealing with continuations and embedded objects (different encoder states) may still exhaust the extensibility of the envelope approach. There is nothing in the IMAP4 protocol that cannot be represented within something such as my net_variant, i.e. it does not need a protocol-specific encoding. The same for SMTP, HTTP, .... How much simpler it could have been! gracias, Scott

Scott Woods wrote:
* I did try to decompress your package with Windows utilities. These failed * with messages about "not bzip2"; can you indicate a specific utility?
I compressed that with bzip2 from a Linux console. I'll upload an updated version in .tar.bzip2 an .zip formats. I'm working on the documentation to try and make my proposals more clear. -- Pedro Lamarão

On Sat, 11 Jun 2005 02:10:49 +0400, Aaron W. LaFramboise <aaronrabiddog51@aaronwl.com> wrote:
Simon Richter wrote:
[]
In other words, I'd suggest we forget about iostreams. Design a good socket based on past experience that handles diverse networking situations elegantly, and let some other library like Boost.Iostreams map that to a streambuf, should <iostream> support be needed.
I agree whole-heartedly. Sockets are naturally binary and confining them in clumsy text-oriented IOStreams interface is IMHO rather a bad idea. Also, one can not just wrap a TCP socket in an std::iostream, feed it to code writing into std::ostream and expect good performance. The reason is TCP mechanics: you pretty much always need corking (TCP_CORK), so you'll end up having to output an intermediate IO manipulator before the actual output, so that it corks up the socket upon construction and uncork it upon destruction, or having to use a corked socket by default and uncork it in ostream::flush. Say good bye to genericity... -- Maxim Yegorushkin

Simon Richter wrote:
Yes, that is the plan (name resolution is not implemented yet). A socket stream can be constructed with an optional reference to a "manager" class (which is the select() wrapper, basically), which holds a set of resolvers (similar to locale facets). If a manager object is given, it is asked for a resolver, which is given the query and a method to call on completion, then control is transferred back to the caller. If no manager object is given, a resolver is instantiated, the name resolved and the resolver destroyed (=> you need a manager for nonblocking I/O).
The resolver is usually called by the manager when the application enters its idle loop, so applications may not wait for connections to get established when they use a manager object.
But how do you actually implement asynchronous name resolution? Are you using asynchronous primitives from the operating system, or are you holding a worker thread?
But the need for the existence of a "socket" class is questionable; the IO primitive in the C++ standard library is the streambuf; ugly or not, it provides a complete "client" for a networking library.
My streambufs hold a pointer to an abstract "impl" class which represents the actual socket. Concrete subclasses of that exist for different socket types, so new address families can be added at runtime (currently, only through my plugin interface, which I wrote about to the list a few days ago).
So each "implementation type" is strongly connected to a value for the family parameter? Something like this? template <typename FamilyT> class socket; Such a definition would make a "socketbuf" subclass of streambuf have one extra template parameter; and we would still have to specify at least the value for "style" to select a transport protocol.
That auto_ptr is there because stream objects are not copyable, and the purpose of that "listener" is to internalize the "accept" primitive and the stream constructor.
They are not copy-constructible, that is different from not copiable :-). A new streambuf can be attached to an existing stream, as long as care is taken to properly destroy the streambuf on stream close).
streambuf's are neither copy constructible nor assignable, so, they are effectively non-copyable. You don't copy a streambuf while "personalizing" an iostream class, you pass a pointer to a streambuf. That's not copying. :)
I'd be happier to return an "rvalue reference" there, if we already had such a thing.
Hrm, they could be made copy-constructible, I think, the benefit of that would however be questionable.
If you're referring to the stream stuff, you'd have to change the standard to make anything copyable. If you're referring to the "implementation type" for the socket, that would be unwise, as it would break RAII.
In the example/ directory of the socketstream project, check out the asynch_resolver.h file; there's an asynchronous version of the resolver class there implemented using boost::thread.
I avoid spawning new threads from library functions like the plague, as it is difficult to properly collect those threads in case of an exception etc.
Well, then you would require asynchronous primitives from the operating system... There *are* such primitives, of course.
I suspect there's little hope of doing anything different than keeping a std::vector or std::list of whatever networking object we're holding, and creating the proper structure for select() or poll() when calling the blocker method. That might make the blocker method O(n) on the number of networking objects...
I think the manager object should have a list of pointers to the streams it monitors and have a callback registered with them so the stream is scheduled for removal on stream destruction (we cannot unregister the stream right away as there might be iterators in the list of pointers that we'd invalidate).
Hum... I was referring to the access to the poller interface. Take select, the simplest, for instance. The argument for select is a fd_set of descriptors, managed by macros take descriptors as arguments. But we won't be holding a fd_set of descriptors, but a standard container of references to stream objects. So this might be a selector class: class selector : public std::vector<iostream*> { public: void wait (timeval* timeout) { fd_set fds; for (iterator i = this->begin(); i != this->end(); ++i) FD_SET((*i)->handle(), &fds); ::select(fds, timeout); } }; Of course, the prototype for select is much more complicated; it would be wiser to use poll, but then I'm unsure if poll is available everywhere. -- Pedro Lamarão Desenvolvimento Intersix Technologies S.A. SP: (55 11 3803-9300) RJ: (55 21 3852-3240) www.intersix.com.br Your Security is our Business

Hi, Pedro Lamarão wrote:
But how do you actually implement asynchronous name resolution? Are you using asynchronous primitives from the operating system, or are you holding a worker thread?
By hand. When the resolver is called by the manager, it will take a quick glance at the hosts file, then assemble a DNS packet with the questions it could not answer and send that off. When the answer returns, all the people that asked are given their answers, which they will probably use to call connect().
My streambufs hold a pointer to an abstract "impl" class which represents the actual socket. Concrete subclasses of that exist for different socket types, so new address families can be added at runtime (currently, only through my plugin interface, which I wrote about to the list a few days ago).
So each "implementation type" is strongly connected to a value for the family parameter?
Yes.
template <typename FamilyT> class socket;
Nope. The family argument is not a template argument, as this allows addition of new families without the need to recompile applications.
Such a definition would make a "socketbuf" subclass of streambuf have one extra template parameter; and we would still have to specify at least the value for "style" to select a transport protocol.
stream vs datagram is a choice of different classes in the application, as there is no representation for "boundaries" in iostreams.
streambuf's are neither copy constructible nor assignable, so, they are effectively non-copyable.
You can copy the members by hand and register a callback that will handle further copies and destruction.
I avoid spawning new threads from library functions like the plague, as it is difficult to properly collect those threads in case of an exception etc.
Well, then you would require asynchronous primitives from the operating system... There *are* such primitives, of course.
Yes, or doing it by hand and circumventing the system administrator's choice of resolution order. I think that is a big issue that should be dealt with on a OS-by-OS basis, with doing it by hand being the fallback and a separate thread, possibly monitored by the manager object, being the choice for OSes that support threads.
But we won't be holding a fd_set of descriptors, but a standard container of references to stream objects.
"fd_set"s are cacheable. We can update our cached objects when the state of a stream changes, therefore reducing the complexity to O(1). :-)
Of course, the prototype for select is much more complicated; it would be wiser to use poll, but then I'm unsure if poll is available everywhere.
No problem to support both. :-) Simon
participants (12)
-
Aaron W. LaFramboise
-
Alexander Nasonov
-
Boris
-
Caleb Epstein
-
Jeff Garland
-
Maxim Yegorushkin
-
Pedro Lamarão
-
Pedro Lamarão
-
pedro.lamarao@mndfck.org
-
Scott Woods
-
Simon Richter
-
Stefan Seefeld