Serialization and async messaging

Scott Woods

29 Dec 2004 29 Dec '04

3:46 a.m.

Hi, I have been doing a lot of async messaging over TCP and have looked through the serialization library in the hope that it could be applied to the task. If the serialized representation of an object has an unknown length then its reasonable to say that we dont know how many network blocks will be consumed, i.e. the receiver may read 1 or more blocks before completion. We also cant guarantee that the flurry of recv's that might result from an FD_READ (sorry to those non-Windows folks), will neatly terminate on an object boundary. IMHO the input function (operator>>) needs to be "re-entrant" if serialization can be used in an async environment. Repeated calls should be perfectly acceptable. Ultimately it would return an indication that a complete object has indeed been loaded. Some "state" needs to be held somewhere. My best guess is that this is not the case but I cant find anything conclusive (without wading deeper into code). Can anyone comment? ps: Its a thoroughly amazing library.

Show replies by date

Robert Ramey

29 Dec 29 Dec

5:02 a.m.

Scott Woods wrote:

...

Hi,

I have been doing a lot of async messaging over TCP and have looked through the serialization library in the hope that it could be applied to the task.

If the serialized representation of an object has an unknown length then its reasonable to say that we dont know how many network blocks will be consumed, i.e. the receiver may read 1 or more blocks before completion. We also cant guarantee that the flurry of recv's that might result from an FD_READ (sorry to those non-Windows folks), will neatly terminate on an object boundary.

IMHO the input function (operator>>) needs to be "re-entrant" if serialization can be used in an async environment. Repeated calls should be perfectly acceptable. Ultimately it would return an indication that a complete object has indeed been loaded. Some "state" needs to be held somewhere. My best guess is that this is not the case but I cant find anything conclusive (without wading deeper into code).

Can anyone comment?

I'm not sure I'm understanding exactly what you want to do - but that deters me not at all from making a comment. I envisioned the library would be useful for marshalling data accross space (transmitting/recieving between programs) as well as across time (persistence - the most common application). So I would be pleased to see someone apply it in this way. However, I don't think the concept of transmission protocal should be mixed into the library - which is already very complex. I believe you could easily achive what you want to accomplish by serializaiton to a memory buffer (e.g string_stream) and transmitting this. On the other end the inverse of this process would occur. Robert Ramey

...

ps: Its a thoroughly amazing library.

Have you concluded this from actually using the library or from just reading the documentation? Robert Ramey

Jody Hagins

5:27 a.m.

On Tue, 28 Dec 2004 21:02:37 -0800 "Robert Ramey" <ramey@rrsd.com> wrote:

...

However, I don't think the concept of transmission protocal should be mixed into the library - which is already very complex.

I do not think Scott is talking about transmission protocol specifically.

...

I believe you could easily achive what you want to accomplish by serializaiton to a memory buffer (e.g string_stream) and transmitting this. On the other end the inverse of this process would occur.

If I understand Scott correctly, the problem still exists, if you want to use the lib in that way. Assume an object whose serialization is something like 4K. If reading from a file, or even a TCP stream with the socket in blocking mode, then you just keep reading until you get all the data. However, for a socket in non-blocking mode, you will typically use select or poll or some other notification mechanism to be told when data is available. You will then go read as much as is currently available, and then return to other tasks until more data is ready. Let's say that data is slow, and to ready the entire 4K of data, it takes 10 different "notifications" and 10 different "read" operations. I think Scott is saying that operator>> is insufficient because it can not do a partial read of what is there... it wants to snarf all 4K. I could be missing the boat, but this is the usual problem with serialization methods, when using them with sockets. For this to work, the operator>>() has to know that there is no more data (i.e., correctly interpret return code of read when the fd is in non-blocking mode), and keep its current state so that the next call to operator>>() will continue where the last call left off. I do not see this as a protocol issue but as supporting non blocking reads where you can get the data in many small chunks. Then again, it is possible that the serialization library already does support this in some way...

Robert Ramey

6:16 a.m.

Jody Hagins wrote:

...

...
I believe you could easily achive what you want to accomplish by serializaiton to a memory buffer (e.g string_stream) and transmitting this. On the other end the inverse of this process would occur.

If I understand Scott correctly, the problem still exists, if you want to use the lib in that way.

Assume an object whose serialization is something like 4K. If reading from a file, or even a TCP stream with the socket in blocking mode, then you just keep reading until you get all the data. However, for a socket in non-blocking mode, you will typically use select or poll or some other notification mechanism to be told when data is available. You will then go read as much as is currently available, and then return to other tasks until more data is ready. Let's say that data is slow, and to ready the entire 4K of data, it takes 10 different "notifications" and 10 different "read" operations.

I think Scott is saying that operator>> is insufficient because it can not do a partial read of what is there... it wants to snarf all 4K.

I could be missing the boat, but this is the usual problem with serialization methods, when using them with sockets. For this to work, the operator>>() has to know that there is no more data (i.e., correctly interpret return code of read when the fd is in non-blocking mode), and keep its current state so that the next call to operator>>() will continue where the last call left off.

I do not see this as a protocol issue but as supporting non blocking reads where you can get the data in many small chunks.

Then again, it is possible that the serialization library already does support this in some way...

Not as far as I can see. I would say that one should serialize the data in the chunk size you want and not attempt to break up the chunks. Robert Ramey

Scott Woods

7:21 p.m.

----- Original Message ----- From: "Robert Ramey" <ramey@rrsd.com> To: <boost@lists.boost.org> Sent: Wednesday, December 29, 2004 7:16 PM Subject: [boost] Re: Re: Serialization and async messaging <snip>

...

...
I could be missing the boat, but this is the usual problem with serialization methods, when using them with sockets. For this to work, the operator>>() has to know that there is no more data (i.e., correctly interpret return code of read when the fd is in non-blocking mode), and keep its current state so that the next call to operator>>() will continue where the last call left off.

I do not see this as a protocol issue but as supporting non blocking reads where you can get the data in many small chunks.

Then again, it is possible that the serialization library already does support this in some way...

Not as far as I can see. I would say that one should serialize the data in the chunk size you want and not attempt to break up the chunks.

Hmmmm. My background in network messaging gives me a certain POV and now I can see that serialization (a la persistence) gives quite another. Any attempts to arrange "chunks" at the level we are talking about will be subverted by the network (some router somewhere will decide to fragment/coalesce/...). So we can never give significance to a block received, i.e. the first byte may not be the first byte of a serialized object and the last byte may not be the last byte of a serialized object. Thanks for the feedback. My original question was fairly "open" (not deeply researched ;-) I suspected that operator>> was not really going to be appropriate in an async environment but held out some hope that experts would have found a completely different way to apply your library. Scott.

Gennadiy Rozental

6:40 a.m.

...

If I understand Scott correctly, the problem still exists, if you want to use the lib in that way.

Assume an object whose serialization is something like 4K. If reading from a file, or even a TCP stream with the socket in blocking mode, then you just keep reading until you get all the data. However, for a socket in non-blocking mode, you will typically use select or poll or some other notification mechanism to be told when data is available. You will then go read as much as is currently available, and then return to other tasks until more data is ready. Let's say that data is slow, and to ready the entire 4K of data, it takes 10 different "notifications" and 10 different "read" operations.

I don't think this is that big of a problem. This is usual situation in marshaling implementation that parser/builder/marshaler class need to have completed data block. And the usual solution is to use some kind of protocol level envelope that allows detect end. So what you do is you collect you pieces in buffer and then process it altogether. This is a case whether you sending object by object or (most probably) whole I don't know ... message. In all practical cases it does not incur any significant time/space overhead (IOW buffer size, envelope and message units could be configured so it's efficient). Gennadiy

Scott Woods

7:55 p.m.

...

...
Assume an object whose serialization is something like 4K. If reading from a file, or even a TCP stream with the socket in blocking mode, then you just keep reading until you get all the data. However, for a socket in non-blocking mode, you will typically use select or poll or some other notification mechanism to be told when data is available. You will then go read as much as is currently available, and then return to other tasks until more data is ready. Let's say that data is slow, and to ready the entire 4K of data, it takes 10 different "notifications" and 10 different "read" operations.

I don't think this is that big of a problem. This is usual situation in marshaling implementation that parser/builder/marshaler class need to have completed data block. And the usual solution is to use some kind of

----- Original Message ----- From: "Gennadiy Rozental" <gennadiy.rozental@thomson.com> To: <boost@lists.boost.org> Sent: Wednesday, December 29, 2004 7:40 PM Subject: [boost] Re: Re: Serialization and async messaging <snip> protocol

...

level envelope that allows detect end. So what you do is you collect you pieces in buffer and then process it altogether. This is a case whether you sending object by object or (most probably) whole I don't know ... message. In all practical cases it does not incur any significant time/space overhead (IOW buffer size, envelope and message units could be configured so it's efficient).

Understand this and can envision a successful implementation. It would seem to involve some duplication though, i.e. the generation and detection of the envelopes is similar to what occurs in serialization already. It was this duplication of (apparently equivalent) code that I was hoping to avoid in the first place. If the serialization format (e.g. XML) already satisfies your requirement for "some kind of protocol level envelope that allows detect end" then it seem a little tragic to ignore that potential? OTOH the envelope technique can be applied to all sorts of things and that might be enough to satisfy the aesthete/engineer. Cheers.

Caleb Epstein

3:10 p.m.

On Wed, 29 Dec 2004 00:27:23 -0500, Jody Hagins <jody-boost-011304@atdesk.com> wrote:

...

I think Scott is saying that operator>> is insufficient because it can not do a partial read of what is there... it wants to snarf all 4K.

I think at least one person here is mixing pickles and milk, as my old Latin teacher used to say. Firstly, the serialization library doesn't use operator>> or operator<< at all. If you're using these operators on some sort of socket iostream and you don't want blocking to occur, thats your first mistake. When reading/writing data with sockets, its up to you to use an appropriate protocol that can detect "message" boundaries so you know when you have a complete block of data to work with. See for example the BEEP protocol (http://beepcore.org/) for a nice specification and implementation. Don't tightly couple this communications layer to the application layer. The two layers should just pass "complete" blocks of data (e.g. serialized objects) between each other. -- Caleb Epstein caleb dot epstein at gmail dot com

Scott Woods

7:37 p.m.

----- Original Message ----- From: "Caleb Epstein" <caleb.epstein@gmail.com> To: <boost@lists.boost.org> Sent: Thursday, December 30, 2004 4:10 AM Subject: Re: [boost] Re: Serialization and async messaging

...

...
I think Scott is saying that operator>> is insufficient because it can not do a partial read of what is there... it wants to snarf all 4K.

I think at least one person here is mixing pickles and milk, as my old Latin teacher used to say.

Firstly, the serialization library doesn't use operator>> or operator<< at all. If you're using these operators on some sort of socket iostream and you don't want blocking to occur, thats your first mistake.

When reading/writing data with sockets, its up to you to use an appropriate protocol that can detect "message" boundaries so you know when you have a complete block of data to work with. See for example the BEEP protocol (http://beepcore.org/) for a nice specification and implementation.

Don't tightly couple this communications layer to the application layer. The two layers should just pass "complete" blocks of data (e.g. serialized objects) between each other.

At the core of the comments I have read, I believe that a common problem has been identified. The separation of application-level communications from the send+recv of blocks over a socket is the abstract, while "operator>>... wants to snarf all 4K" is the dirty guts of it? Thanks for the reference. First review has me resigned to another round of research. Cheers.

Scott Woods

7:02 p.m.

----- Original Message ----- From: "Jody Hagins" <jody-boost-011304@atdesk.com> To: <boost@lists.boost.org> Sent: Wednesday, December 29, 2004 6:27 PM Subject: Re: [boost] Re: Serialization and async messaging <snip>

...

I could be missing the boat, but this is the usual problem with serialization methods, when using them with sockets. For this to work, the operator>>() has to know that there is no more data (i.e., correctly interpret return code of read when the fd is in non-blocking mode), and keep its current state so that the next call to operator>>() will continue where the last call left off.

That is indeed what I have been trying to say :-)

Scott Woods

6:59 p.m.

From: "Robert Ramey" <ramey@rrsd.com> To: <boost@lists.boost.org> Sent: Wednesday, December 29, 2004 6:02 PM Subject: [boost] Re: Serialization and async messaging <snip>

...

...
IMHO the input function (operator>>) needs to be "re-entrant" if serialization can be used in an async environment. Repeated calls should be perfectly acceptable. Ultimately it would return an indication that a complete object has indeed been loaded. Some "state" needs to be held somewhere. My best guess is that this is not the case but I cant find anything conclusive (without wading deeper into code).

Can anyone comment?

I'm not sure I'm understanding exactly what you want to do - but that deters me not at all from making a comment.

I envisioned the library would be useful for marshalling data accross space (transmitting/recieving between programs) as well as across time (persistence - the most common application). So I would be pleased to see someone apply it in this way.

However, I don't think the concept of transmission protocal should be mixed into the library - which is already very complex.

...

I believe you could easily achive what you want to accomplish by serializaiton to a memory buffer (e.g string_stream) and transmitting

If you mean here that it would be unwise to couple the serialization to TCP then yes, I understand that. this.

...

On the other end the inverse of this process would occur.

On the transmit side your suggestion seems completely workable. On the receive side you have the problem (i.e. in async environment) of not knowing whether the data available at any point will be enough to complete the call to operator>>. You cant make a call that is going to "block" in some way.

...

...
ps: Its a thoroughly amazing library.

Have you concluded this from actually using the library or from just reading the documentation?

The documentation, the functionality that has been achieved and my struggles to achieve less. I hadnt even considered trying to recover objects including pointers to objects that... I have done some analogous work in the area of async messaging, i.e. lots of marshalling inside async frameworks. I have recently tried to extend this to persistence (or serialization) as supporting two distinct bodies of code seems ill-directed. The option that I was exploring here was to apply your serialization back the other way; to async messaging. Cheers.

Robert Ramey

7:34 p.m.

Scott Woods wrote:

...

...
I believe you could easily achive what you want to accomplish by serializaiton to a memory buffer (e.g string_stream) and transmitting this. On the other end the inverse of this process would occur.

On the transmit side your suggestion seems completely workable. On the receive side you have the problem (i.e. in async environment) of not knowing whether the data available at any point will be enough to complete the call to operator>>. You cant make a call that is going to "block" in some way.

I have to say I just can't see this understand this. I envision such a process as functioning in the following way. transmitting program ============== serialize to a string. we now have its total length. transmit a string using what ever method syncronous/asyncronous whatever. recieving program ============ retrieve a string - using what ever method async, sync or whatever. when a complete string is retrieved/reassembled or whatever de-serialize to the original structure. Honestly, I can't see any thing about this that is less than optimal. Robert Ramey

Scott Woods

9:08 p.m.

----- Original Message ----- From: "Robert Ramey" <ramey@rrsd.com> To: <boost@lists.boost.org> Sent: Thursday, December 30, 2004 8:34 AM Subject: [boost] Re: Re: Serialization and async messaging

...

...
On the transmit side your suggestion seems completely workable. On the receive side you have the problem (i.e. in async environment) of not knowing whether the data available at any point will be enough to complete the call to operator>>. You cant make a call that is going to "block" in some way.

I have to say I just can't see this understand this. I envision such a process as functioning in the following way.

transmitting program ============== serialize to a string. we now have its total length. transmit a string using what ever method syncronous/asyncronous whatever.

recieving program ============ retrieve a string - using what ever method async, sync or whatever. when a complete string is retrieved/reassembled or whatever de-serialize to the original structure.

Honestly, I can't see any thing about this that is less than optimal.

Aha. After Gennadiy's version of this I can now see it for what it is. Yes, the simple response is that this will work. A pedantic response is that it feels like duplication; why implement another layer (the envelopes) just to detect completion of objects, ignoring the potential in the existing serialization format (e.g. XML). A more useful response would point out the higher memory requirements, two scanning phases and increased incidence of copying. A comparative analysis is probably impossible. Given opportunity I would probably explore the non-envelope version buts thats curiosity and the sex appeal of "doing tricky stuff that results in less lines". Commercially the envelope version seems like a winner. Thanks.

Robert Ramey

30 Dec 30 Dec

3:02 a.m.

Scott Woods wrote:

...

...
transmitting program ============== serialize to a string. we now have its total length. transmit a string using what ever method syncronous/asyncronous whatever.

recieving program ============ retrieve a string - using what ever method async, sync or whatever. when a complete string is retrieved/reassembled or whatever de-serialize to the original structure.

Honestly, I can't see any thing about this that is less than optimal.

Aha. After Gennadiy's version of this I can now see it for what it is.

Yes, the simple response is that this will work. A pedantic response is that it feels like duplication; why implement another layer (the envelopes) just to detect completion of objects, ignoring the potential in the existing serialization format (e.g. XML). A more useful response would point out the higher memory requirements, two scanning phases and increased incidence of copying.

If you're really a gluton for punishment consider the following: make you sockets/tcp/whatever implementation with a i/ostream interface. E.G. class socket_istream ... Transmit ====== { //opens a socket_ostream socket_osteram os(???); // all current archive classes use a basic stream interface. so { xml_oarchive oa(os); // start serializing oa << ... ... // stop serializing ! } } Recieve ===== { //open and input socket stream socket_istream is(???); // all current archves classes use basic stream interface - so { xml_iarchive ia(is) ia >> ... ... // done } } Let us know when you've got this working. Good Luck Robert Ramey

Jeff Flinn

2:29 p.m.

Robert Ramey wrote:

...

Scott Woods wrote:

If you're really a gluton for punishment consider the following:

make you sockets/tcp/whatever implementation with a i/ostream interface.

The punishment level may be reduced somewhat by using Jonathan's IOStream library. Jeff

Scott Woods

9:29 p.m.

From: "Robert Ramey" <ramey@rrsd.com> To: <boost@lists.boost.org> Sent: Thursday, December 30, 2004 4:02 PM Subject: [boost] Re: Re: Re: Serialization and async messaging

...

Recieve ===== { //open and input socket stream socket_istream is(???); // all current archves classes use basic stream interface - so { xml_iarchive ia(is) ia >> ... ... // done } }

Let us know when you've got this working.

Good Luck

Thanks :-) You will be the first to know. ps: While I dont think its significant anymore the code sketch you give for receive still reflects file-based thinking. Heres another sketch. Its not intended to keep things bouncing just a "view from the other side". This is the type of code I have running with my own "serialization" except no envelope layer (and basic serialization ;-). // Called by async framework poll_or_FD_READ_handler( envelope_istream &is ) { application_message am; envelope_istream::iterator i, e; // Following loop may run 0 or more times e = is.end(); for( i = is.begin(); i != e; ++i) { // We have a completed envelope xml_iarchive ia( i->payload ); ia >> am; // Requires complete input data am(); // Application stuff } // Fall through here either because there are no more // complete envelopes available on the connection *at this moment*. // The envelope_istream may at this point be holding an incomplete // parse, the assumption being that more data will arrive over the // network and this routine will be called again. } Happy New Year!

7500

Age (days ago)

7501

Last active (days ago)

List overview

Download

15 comments

6 participants

participants (6)

Caleb Epstein
Gennadiy Rozental
Jeff Flinn
Jody Hagins
Robert Ramey
Scott Woods