
Hi Pedro, Apologies for any sloppy formatting; mail client woes. ----- Original Message ----- From: <pedro.lamarao@mndfck.org> To: <boost@lists.boost.org> Sent: Wednesday, June 15, 2005 12:08 AM Subject: Re: [boost] [Ann] socketstream library 0.7
This "buffering problem" is the problem that leads people to design protocols with fixed sizes everywhere.
Yes - exactly. Header (incl length) and body is a perfectly functional response to a need. Is it the best we can do?
To get to the point; I am currently reading blocks off network connections and presenting them to byte-by-byte lexer/parser routines. These form the structured network messages directly, i.e. fields are already plucked out.
So which is better? Direct byte-by-byte conversion to structured network <> message or multi-pass?
I understood you correctly, I might rephrase that to myself like Do we read the whole message before parsing, or are we parsing directly from the data source?
Yes. That's a reasonable paraphrasing.
If we parse directly from the data source, we must analyze byte by byte, and so obtain byte by byte. If we want this, we will want a buffering layer to keep the amount of system calls to a reasonable level.
streambufs provide such a buffering level, with IO operations proper for lexical analysis at such a level: sgetc, snextc, sbumpc.
Yes, thats true. As well as many points you make about streambufs (didnt realize they were quite that flexible).
If you remember streambuf_iterators exist, and imagine a multi_pass iterator (hint, hint), many other interesting things come to mind.
If we read the message completely beforehand, we must know how much we have to read, or we must inspect the data source in some way to watch for "end of message".
[snip]
At this point, we have read the same amount of bytes from the data source, in whatever order. But the amount of calls made to the IO system service is not the same, and the fixed size approach is more efficient in this regard.
Also, the fixed size approach solves the "buffering problem" since we make no resizings along the way. C++ people, blessed with std::vector, already have a mechanismo to do away with such weirdness; think about how you do it in C.
Sorry but there is such a gulf between our approaches I'm not sure I can say anything to help clarify. As a last response the best I can do is say that; 1. The difference (in terms of CPU time) in maintaining a counter and inspecting a "current byte" and testing it for "end of message" seems minimal. This is stated relatively, i.e. it is far more significant that the bytes sent across the network are being scanned at the receiver more than once. Even maintaining the body counter is a (very low cost...) scan. 2. An approach using lex+parse techniques accepts raw byte blocks as input (convenient) and notifies the user through some kind of accept/reduce return code, that the message is complete and already "broken apart", i.e. no further scanning required by higher layers. 3. Lex+parse techniques do not care about block lengths. An accept state or parser reduction can occur anywhere. All the "unget" contortions recently mentioned are not needed. Partial messages are retained in the parser stack and only finally assembled on accept/reduce. This property is something much easier to live with than any kind of "fixed-size" approach that I have dealt with so far.
First. We, unfortunately, can't pass std::vector to the operating system, so, at some point, we are allocating fixed sized buffers, and passing it to our IO primitives. There is no escape.
Errrrr. Not quite following that. Are you saying that send( socket_descriptor, &vector_buffer[ 0 ], vector_buffer.size() ) is bad?
If you are initializing std::vector with the correct size and giving &*begin() to these primitives, well... Why not allocate with new? If you are allocating it with whatever default size and resizing it later, you are losing part of the proposed benefit.
Hmmm. If you are saying this to strengthen your case for streambufs then I understand.
When we're about to throw a message to the network, how do we know what size it is? If our message is composed of, say, a string, another string and an int, are we going to call string::size() twice for every message dumped? Is the int representation fixed in size? Is this size enough for MAX_INT?
[snip large section]
If you are on the Internet, you have very little guarantees. It's hell out here, sir.
Yes you make some very good points. The product I am currently working on is a vipers' nest of the protocols you talk about and more. There have been some unpleasant suggested uses for protocols such as IMAP4. Trying to build a generic network messaging library that facillitates clear concise application protocols *and* can cope with the likes of IMAP4 is, I believe, unrealistic. I didnt know I had a mechanismo until today. Feels great! :-) Cheers, Scott