
Scott Woods wrote:
3. Lastly, some of the observations (while excellent) seem a bit "macro" when a wider view might lead to different thinking. What I am trying to suggest here is that the time spent in vector<>::resize is truly surprising but its also very low-level. having been through 4 recent refactorings of a network framework, I have been surprised at the gains made in other areas by conceding say, byte-level processing, in another area.
I undersand that these difficulties are orthogonal to any IOStreams issues in the sense that everyone obtaining data from an unknown source must deal with them. This "buffering problem" is the problem that leads people to design protocols with fixed sizes everywhere.
To make more of a case around the last point, consider the packetizing, parsing and copying thats recently been discussed. This has been related to the successful recognition of a message on the stream.
Is it acknowldeged that a message is an arbitrarily complex data object? By the time an application is making use of the "fields" within a message thats probably a reasonable assumption. So at some point these fields must be "broken out" of the message. Or parsed. Or marshalled. Or serialized. Is the low-level packet (with the length) header and body) being scanned again? What copying is being done? This seems like multi-pass to me.
To get to the point; I am currently reading blocks off network connections and presenting them to byte-by-byte lexer/parser routines. These form the structured network messages directly, i.e. fields are already plucked out.
So which is better? Direct byte-by-byte conversion to structured network message or multi-pass?
I understood you correctly, I might rephrase that to myself like Do we read the whole message before parsing, or are we parsing directly from the data source? If we parse directly from the data source, we must analyze byte by byte, and so obtain byte by byte. If we want this, we will want a buffering layer to keep the amount of system calls to a reasonable level. streambufs provide such a buffering level, with IO operations proper for lexical analysis at such a level: sgetc, snextc, sbumpc. If you remember streambuf_iterators exist, and imagine a multi_pass iterator (hint, hint), many other interesting things come to mind. If we read the message completely beforehand, we must know how much we have to read, or we must inspect the data source in some way to watch for "end of message". If we have control over the protocol design, we can make it "fixed size". Making if "fixed size" would mean prefixing a "payload" with size information. If that size information is also fixed, then we're set. If not, we'll have to parse on the fly at least this prefix. I've seen at least one protocol naive enough to throw an int as the "prefix" directly into the data sink. Luckily, every machine involved was Intel x86 running some Windows server. If we don't have control over the protocol design, we can apply another layer, encapsulating the protocol message inside a "control message" providing the fixes sizes we need. To this control message the previous considerations would apply. This has been suggested here many times. So, after reading every byte we need, we'd start parsing over the sequence in memory, instead of the "sequence" from the data source. streambuf's provide for this, with the sgetn operation, and even the possibility of shutting the buffering down. At this point, we have read the same amount of bytes from the data source, in whatever order. But the amount of calls made to the IO system service is not the same, and the fixed size approach is more efficient in this regard. Also, the fixed size approach solves the "buffering problem" since we make no resizings along the way. C++ people, blessed with std::vector, already have a mechanismo to do away with such weirdness; think about how you do it in C. But such a design suffers elsewhere. Let me argue a little against it. First. We, unfortunately, can't pass std::vector to the operating system, so, at some point, we are allocating fixed sized buffers, and passing it to our IO primitives. There is no escape. If you are initializing std::vector with the correct size and giving &*begin() to these primitives, well... Why not allocate with new? If you are allocating it with whatever default size and resizing it later, you are losing part of the proposed benefit. When we're about to throw a message to the network, how do we know what size it is? If our message is composed of, say, a string, another string and an int, are we going to call string::size() twice for every message dumped? Is the int representation fixed in size? Is this size enough for MAX_INT? HTTP is like that; the size of the resource being sent to the client is present in a header field. If you think that is easy because HTTP servers can get that from the filesystem, get acquainted with server side scripting, and think again. HTTP headers, on the other hand, must be parsed at least a little, to look for "end of header". SMTP clients try to hint at the size of the data being sent, but that is not authoritative. There is also "end of data" marks in SMTP. Take a more complicated protocol like the Jabber protocol, whose messages are effectively XML nodes. Are we going to traverse the tree to find out the payload size? If we have a XML processor, why not apply it directly to the network? Check out that protocol to see how powerful the message format is before complaining about "weird protocols that only bring more trouble". We don't need to go that far, of course. Mail messages today are already trees of MIME parts. SMTP makes no guarantees the SIZE hint must be respected. SIZE hints may not even be present. What will the server do? I've seen a SMTP proxy service, whose job was to transform a mail message on the fly before it hit the SMTP server, suffer badly with this. That proxy won't be sending any SIZE hints. My point is, writing a generic networking library for generic client code dealing with generic protocols must provide for every kind of model. Impose a new layer of headers, and you're not generic anymore. Force everyone over a buffer, and you're not generic anymore. Put everything over asynchronicities and you're not generic anymore. (Please, think of us regular, blocking, thread-loving people, I beg you!) And this is all about getting and putting stuff to the network, without considerations about whatever necessary character coding conversions must be done from a certain place to a certain other place, which could perfectly well increse or decrease the size in bytes of a certain representation. Also, think of a system with ISO-8859-1 default locale; what do you do when you want to get a web page from a server providing UTF-8 pages? Good luck with that. Those dealing exclusively with in-house designed protocols live in heaven, safe from this kind of thing. If you are on the Internet, you have very little guarantees. It's hell out here, sir. -- Pedro Lamarão