[asio] socket::read_some() splits data into two parts.

Hello. I sent 8004 bytes to server through TCP connection and successfully read it like this: int readBytes; const int BUFFER_SIZE = 128; char charBuf[BUFFER_SIZE]; do { readBytes = socket.read_some(boost::asio::buffer(charBuf, BUFFER_SIZE)); } while(readBytes >= BUFFER_SIZE); But sometimes this code reads just 3752 (precisely) bytes and returns. After that it handles another async_read and reads 4525 bytes (which in sum gives 8004 bytes). Do I use "read_some()" function in the right way?

On May 11, 2011, at 2:28 PM, Slav wrote:
I sent 8004 bytes to server through TCP connection and successfully read it like this:
But sometimes this code reads just 3752 (precisely) bytes and returns. After that it handles another async_read and reads 4525 bytes (which in sum gives 8004 bytes).
Well since your buffer is only 128 bytes long, I hope that you are getting that amount or less from each read_some call. But in general, what you are seeing is normal TCP behavior. Brad -- Brad Howes Calling Team - Skype Prague Skype: br.howes

I meant I read data multiple times (accumulating incoming data using std::string::append(charBuffer) ) until readBytes >= BUFFER_SIZE which, sometimes, happens to interrupt (read_some() returns value less then BUFFER_SIZE when not all 8004 bytes was read). Now everything is clear to me. Thanks everyone!

On Wednesday, May 11, 2011 8:29 AM, Slav wrote:
Hello. I sent 8004 bytes to server through TCP connection and successfully read it like this:
int readBytes; const int BUFFER_SIZE = 128; char charBuf[BUFFER_SIZE]; do { readBytes = socket.read_some(boost::asio::buffer(charBuf, BUFFER_SIZE)); } while(readBytes >= BUFFER_SIZE);
But sometimes this code reads just 3752 (precisely) bytes and returns. After that it handles another async_read and reads 4525 bytes (which in sum gives 8004 bytes). Do I use "read_some()" function in the right way?
Try changing the last line to: while(readBytes != 0);

Then messages of length not multiple of 128 (BUFFER_SIZE) will not be read - tested it, anyway, after last socket::read_some() (with readBytes > 0 && < BUFFER_SIZE) next socket::read_some() never ends.

On Wednesday, May 11, 2011 9:57 AM, Slav wrote:
Then messages of length not multiple of 128 (BUFFER_SIZE) will not be read - tested it, anyway, after last socket::read_some() (with readBytes > 0 && < BUFFER_SIZE) next socket::read_some() never ends.
It would probably help to understand that TCP has no concept of a "message". Anything you write to a socket is appended to a stream of *bytes*. Several subsystems on both the sending and receiving computers will have the option of splitting and combining adjacent buffers with no consideration to how big each individual write was. Read_some has the option of reading any size up to and including the buffer size, but a read of less than that size does not mean "end of message". It could also mean "network congestion", "cable unplugged", or "Windows just felt lazy". On further thought, I think I see the problem (and I apologize for the bad recommendation in my last email). Your sender somehow needs to communicate the message size or flag the end of the message. A partial list of options includes: * Begin the message with a field specifying its total length. The receiving loop must read this length and then count bytes until it has the whole message, keeping in mind that each read_some can read any number of bytes. * Begin each message with a message id, where each message id has a known length. Once you calculate the length, count bytes as I described above. * End the message with a terminator. You could set up a line-oriented protocol where a newline terminates the read. With some thought, you might think of some other terminating byte or string appropriate to your protocol. In all cases, unless your final read_some has a carefully controlled size, remember that your buffer may contain the beginning of the next message, or even multiple complete messages. If so, it is your responsibility to retain this until you are ready to process the next message.

On 05/11/2011 09:36 AM, Andrew Holden wrote:
On Wednesday, May 11, 2011 9:57 AM, Slav wrote:
Then messages of length not multiple of 128 (BUFFER_SIZE) will not be read - tested it, anyway, after last socket::read_some() (with readBytes> 0&& < BUFFER_SIZE) next socket::read_some() never ends.
It would probably help to understand that TCP has no concept of a "message". Anything you write to a socket is appended to a stream of *bytes*.
Alternatively, we could say that TCP, in fact, does have a well-defined concept of messages: they are all exactly one byte long.
Several subsystems on both the sending and receiving computers
... and sometimes boxes in the middle ...
will have the option of splitting and combining adjacent buffers with no consideration to how big each individual write was.
I've done a little protocol stuff with ASIO now and I must say it's a lot of fun and I can't go back to doing it any other way.
On further thought, I think I see the problem (and I apologize for the bad recommendation in my last email). Your sender somehow needs to communicate the message size or flag the end of the message. A partial list of options includes:
The pattern I encounter over and over again (often at multiple levels in a protocol) is: class protocol_layer_context { vector<uint8> buffer; void on_received_data(vector<uint8> & rx_bytes) { buffer.append(rx_bytes); // perhaps a virtual override. size_t msg_len = this->parse_len_from_start_of_buffer(); if (buffer.size() <= msg_len) { vector<uint8> msg_buf = consume_data_from_front_of_buffer(buffer, msg_len); // perhaps a virtual override this->process_complete_message(msg_buf); } // post another ASIO read request this->request_more_data(); } ... But there are some important issues with this naive pseudocode: 1. It can result in recopying the data a bunch of times for every protocol layer, killing performance. 2. It's susceptible to a denial-of-service (DoS). Bad guy can send trick you into allocating all your memory. 3. Sometimes the length of a message is stated at the beginning of the message, sometimes it isn't known until the end. 4. No processing happens on the message until it's completely read, but some protocols really need the receiving endpoint to process it incrementally. 5. Error handling 6. Optimal threading 7. Etc. We find bugs in exactly this logic all the darn time. Often the data being received is untrusted and possibly malicious. Real-world protocol implementations will commonly crash under fragmentation fuzzing, sometimes resulting in exploitable security holes. In a sense, this is the general refactoring problem of 'incrementalizing' a parsing function by moving all its state from stack variables into a longer-lived context object. We've seen it done successfully with coroutines, but that's not a commonly accepted solution because, frankly, the native C/C++ runtimes have not yet given coroutines the love (i.e., portability and performance guarantees) they really deserved. If someone figured out how to leverage generic techniques to handle just the unidirectional message delimiting problem in a bulletproof way I think it would make a really great boost library. - Marsh

-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of Marsh Ray Sent: May-12-11 1:11 PM To: boost@lists.boost.org; boost-users@lists.boost.org Subject: [Boost-users] Delimiting protocol messages (was [asio] read_some() splits data)
On 05/11/2011 09:36 AM, Andrew Holden wrote:
On Wednesday, May 11, 2011 9:57 AM, Slav wrote:
Then messages of length not multiple of 128 (BUFFER_SIZE) will not be read - tested it, anyway, after last socket::read_some() (with readBytes> 0&& < BUFFER_SIZE) next socket::read_some() never ends.
It would probably help to understand that TCP has no concept of a "message". Anything you write to a socket is appended to a stream of *bytes*.
Alternatively, we could say that TCP, in fact, does have a well-defined concept of messages: they are all exactly one byte long.
Related to this, I wonder if there are any class libraries that facilitate processing these byte streams. I read, in the rationale part of the documentation for ASIO, the following: "Basis for further abstraction. The library should permit the development of other libraries that provide higher levels of abstraction. For example, implementations of commonly used protocols such as HTTP." It seems like such an obvious thing to do: to write a class library that contains classes that use the TCP capabilities of boost::asio to automagically take data read from the socket and do whatever is needed. For example, one might want to construct a series of http requests from the data coming in on port 443, and be able to relate the addressing data in the application layer to that in the TCP layer, and use that comparison to determine whether to forward the request to server A or server B. One reason for doing so would be for, for example, my own edification (and anyone else interested in learning) about how the different OSI layers work. Another would be for security purposes (e.g. to know whether or not an authorized user's session has been hijacked). It seems to me to be an obvious thing to do, but my question to you is "Do you know of anyone who has done it?" (in some kind of open source project) If not, do you know of resources available online where I could learn how to do it? I am finding it hard to find resources that are useful: I have well developed C++ skills, e.g. to write custom IO stream classes, but need some guidance on how to proceed with the 'further abstraction' the docs mention, and what the recommended best practices are specific to (high performance, secure) networking program development. You said, " I've done a little protocol stuff with ASIO now and I must say it's a lot of fun and I can't go back to doing it any other way." How did you get started on it? Did you use any documentation other than the asio docs? Do you know of any documents (ideally online) that show how you could use this stuff to thwart the major kinds of attacks that can be made on a web server? Thanks Ted

As I raised this question I thought that TCP itself has the ability to determin the message size - but it does not (I previously widely used ENet (secure data transmission based on UDP) - may be that's why I was misleaded). Important usage "application protocol" build upon "message size prefix" protocol (if such a small thing can be called "protocol") could be the C++ objects' serialization using the boost::serialization, so the whole picture of protocol stack would looks like this: Ethernet -> IP -> TCP -> "message size prefix" -> "boost::serialization"

Ted Byers
It seems to me to be an obvious thing to do, but my question to you is "Do you know of anyone who has done it?" (in some kind of open source project)
What about: http://cpp-netlib.github.com/ Apparently it will be submitted for review someday: http://comments.gmane.org/gmane.comp.lib.boost.user/67431
Thanks
Ted
Jerry

On 05/12/2011 04:28 PM, Jerry wrote:
Ted Byers
writes: It seems to me to be an obvious thing to do, but my question to you is "Do you know of anyone who has done it?" (in some kind of open source project)
What about: http://cpp-netlib.github.com/
Looks interesting. This page in particular looks like it's getting close to what I was talking about: http://cpp-netlib.github.com/0.9.0/message.html I realize the project is new and the docs may not be complete, but every other page seems to be about its HTTP implementation. Even the generic basic_message class presumes a headers/body structure. HTTP is often thought of as a half-duplex message/response protocol because it (mostly) stateless and originally closed the connection after every response. I was interested more in a general facility for a common low-level protocol buffering pattern.
Apparently it will be submitted for review someday: http://comments.gmane.org/gmane.comp.lib.boost.user/67431
Cool. - Marsh

-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of Marsh Ray Sent: May-12-11 5:45 PM To: boost-users@lists.boost.org Cc: Jerry Subject: Re: [Boost-users] Delimiting protocol messages (was [asio] read_some() splits data)
On 05/12/2011 04:28 PM, Jerry wrote:
Ted Byers
writes: It seems to me to be an obvious thing to do, but my question to you is "Do you know of anyone who has done it?" (in some kind of open source project)
What about: http://cpp-netlib.github.com/
Looks interesting. This page in particular looks like it's getting close to what I was talking about:
Yes, It is interesting
http://cpp-netlib.github.com/0.9.0/message.html
I realize the project is new and the docs may not be complete, but every other page seems to be about its HTTP implementation. Even the generic basic_message class presumes a headers/body structure.
what I haven't found, yet, is a way to compare the IP info in the TCP packest with the IP info in the HTTP headers. That is in particualr. Mre generally, I am looking for an online resource for learning network programming in general and security related network proramming in particular. Cheers Ted

On 05/12/2011 04:59 PM, Ted Byers wrote:
what I haven't found, yet, is a way to compare the IP info in the TCP packest with the IP info in the HTTP headers.
Sometimes a proxy will add something, but usually there aren't any IP addresses in HTTP headers.
That is in particualr. Mre generally, I am looking for an online resource for learning network programming in general and security related network proramming in particular.
That's interesting. There are resources about secure programming, and securing networks, but I don't see much new stuff about basic network programming. They are probably casualties of the trend to make all communications run over HTTP(s). I don't recall ever seeing a book or online resource saying "here's how to accept data from the network and process it in the most scalable and secure way using C or C++". On the crypto side of things I recommend:
http://www.amazon.com/Cryptography-Engineering-Principles-Practical-Applicat...
I tweeted your question: https://twitter.com/marshray/status/68810041234432000 Got this recommendation, doesn't seem to be too related to network programming though. Perhaps we'll get more.
http://www.amazon.com/Memory-Programming-Concept-Frantisek-Franek/dp/0521520...
- Marsh

On May 12, 2011, at 8:17 PM, Ted Byers wrote:
Related to this, I wonder if there are any class libraries that facilitate processing these byte streams.
Have you looked at boost::serialization? There is an example in boost::asio on how to use them together. Brad -- Brad Howes Calling Team - Skype Prague Skype: br.howes

I sent 8004 bytes to server through TCP connection and successfully read it like this:
int readBytes; const int BUFFER_SIZE = 128; char charBuf[BUFFER_SIZE]; do { readBytes = socket.read_some(boost::asio::buffer(charBuf, BUFFER_SIZE)); } while(readBytes >= BUFFER_SIZE);
But sometimes this code reads just 3752 (precisely) bytes and returns. After that it handles another async_read and reads 4525 bytes (which in sum gives 8004 bytes). Do I use "read_some()" function in the right way?
This is exactly what "read_some" does -- it reads SOME data. It may read even 1 byte. If you know exactly how many bytes you expect to get, use read() free function with the appropriate completion condition: http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio/reference/read.html

Now it's clear to me. Previosly I thought that socket::async_read() (inside which I call read_some() multiple times) will be called just when whole data will be accepted and read_some() will return "0" between multiple already accepted data bunches (or result will be less then buffer size) - it's not in such way. Thank you.

It must be reflected in documentation. Isn't it?

It must be reflected in documentation. Isn't it?
"The function call will block until one or more bytes of data has been read successfully, or until an error occurs." "Remarks: The read_some operation may not read all of the requested number of bytes. Consider using the read function if you need to ensure that the requested amount of data is read before the blocking operation completes." http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio/reference/basic_str...

"The read_some operation may not read all of the requested number of bytes." is pretty clear because incoming message isn't bound to the BUFFER_SIZE which is just the place where to srore intermediate data. "Consider using the read function if you need to ensure that the requested amount of data is read before the blocking operation completes." sounds like "read until my buffer will be fulled regardless of incoming message length" - not like "read until the logical message will be fully read and use my buffer as something where data will be put - regargingly it's size". And, by the way, there could come empty messages - just without any data. TCP contains info about message length - it is duplication to prefix all messages with it's length.

On 5/11/2011 10:11 AM, Slav wrote:
TCP contains info about message length - it is duplication to prefix all messages with it's length.
No, it really doesn't. All you get is a stream of bytes, reliably delivered, in sequence. So if you write 3,732 bytes onto a socket, there is *NO WAY* for the reader to tell that you did that. The reader might get 1 read of 3732 bytes, or 3732 reads of 1 byte, or anything in between. The reader could even read *more* than 3732 bytes in one read, if you wrote more than once. If you think you're going to get repeatable, identical matching pairs of reads and writes out of just a TCP socket (that is, without imposing your own protocol on top of TCP), you are in for endless hours/days/months of frustration.

On Wednesday, May 11, 2011 11:12 AM, Slav wrote:
And, by the way, there could come empty messages - just without any data.
TCP does not have a concept of an empty write.
TCP contains info about message length - it is duplication to prefix all messages with it's length.
Where did you read that? TCP has NO concept of "messages". As such, it has no concept of "message length". It is only a stream of bytes. The sending computer can easily combine the buffers from two consecutive write calls into a single packet, or split the buffer from a single write call into multiple packets, or both. In either case, ALL information about the size of the original write call(s), the number of write calls, and anything else that you hope will provide a clue about "messages" will be lost. Likewise, the receiving computer can and will freely combine and split packets into whatever buffers it sees fit, with similar effects on any "message boundaries". The only thing that will remain is the sequence of bytes. Do not try to search for TCP options to change this behavior. The closest you can come is options that will *usually* keep the message boundaries. This means that your program will *usually* not crash. If you wish to preserve message boundaries, then you MUST provide your own message framing, just as you would if writing to a file. If you prefix each message with its length, then you can use the read function to ensure you get the whole message in one call, as you will know the message length. This will also be effective at ensuring you don't have the beginning of the next message at the end of your buffer.

Yeah - I was really mistaken. Thanks for correcting me. I reimplemented the reading using message length prefix and now everything works fine. Left just one question: socket has "receive_buffer_size" option which is by default equals 8192 - does it mean that message of length (if it will come at once) will be truncated? Or I still could read it with multiple "async_read" calles (collecting it into a buffer using data length prefix)?

On Thursday, May 12, 2011 10:36 AM, Slav wrote:
Yeah - I was really mistaken. Thanks for correcting me.
I reimplemented the reading using message length prefix and now everything works fine.
Left just one question: socket has "receive_buffer_size" option which is by default equals 8192 - does it mean that message of length (if it will come at once) will be truncated? Or I still could read it with multiple "async_read" calles (collecting it into a buffer using data length prefix)?
TCP will not alter the byte stream, also meaning it will not drop bytes. If the receive buffer fills, then it will tell the other machine it is sending data too fast and needs to slow down. It will also have the other machine resend the data that couldn't fit in the receive buffer. Your programs (on both ends) will not need to address this issue; the operating system will handle it for you. That said, experimenting with the receive buffer size may improve performance, but will have no effect on correctness. Don't assume more is better. If you make the buffers too big, you'll just increase overhead and latency.
participants (8)
-
Andrew Holden
-
Brad Howes
-
Eric J. Holtman
-
Igor R
-
Jerry
-
Marsh Ray
-
Slav
-
Ted Byers