Using serialization for replication

prez＠neuromancy.net

31 Mar 2006 31 Mar '06

9:59 p.m.

Hey, I want to use serialization for some kind of active replication, however the biggest barrier to this is the fact that serialization does not allow me to put the same object into the stream twice. To be more precise, I want to be able to serialize both to a network and to disk, and I very much like the elegance of the serialization approach (plus the fact it can re-create pointer references, arrays, and such). So my questions are: a) How difficult would it be to be able to allow an object to be serialized twice, where the second copy would more or less do an operator= (instead of creating a new object) on the first copy on deserialization. b) How difficult would it be to have more or less an 'appending' serialization stream - ie. I deserialize what I have previously stored on disk, and then continue to append to the serialization stream (which appends to the file from then on), making my serialization more or less unbounded. c) Is there currently an unbounded serialization stream? I mean, obviously the XML stream will not be 'complete' until you close the serialization stream and it can append any close tags it needs to, however if I wanted to continually write to a stream throughout the time my program is running, and if it crashes, be able to use that to get back to the state I was in, is it possible? I realize that in some respects this is kind of hammering a square peg into a round hole, since serialization seems to have been designed for 'one-short rights', but serialization and replication type functionality is very similar in nature. -- PreZ :) Death is life's way of telling you you've been fired. -- R. Geis

Show replies by date

Robert Ramey

1 Apr 1 Apr

4:10 a.m.

I'm not really sure what you want to do but I'll attempt to answer anyhow. prez@neuromancy.net wrote:

...

Hey,

I want to use serialization for some kind of active replication,

It would seem to me that the easiest way to do this would be to make a TEE type streambuf using the stream buffer library. This would duplicate each write to an additional stream.

...

however the biggest barrier to this is the fact that serialization does not allow me to put the same object into the stream twice. To be more precise, I want to be able to serialize both to a network and to disk, and I very much like the elegance of the serialization approach (plus the fact it can re-create pointer references, arrays, and such).

I believe the above would cover this.

...

b) How difficult would it be to have more or less an 'appending' serialization stream - ie. I deserialize what I have previously stored on disk, and then continue to append to the serialization stream (which appends to the file from then on), making my serialization more or less unbounded.

I think something like that can be done now and is in fact already being done. I know the somepeople are "embedding" serialization data inside of other data by just passing the streambuffer around without closing it.

...

c) Is there currently an unbounded serialization stream? I mean, obviously the XML stream will not be 'complete' until you close the serialization stream and it can append any close tags it needs to, however if I wanted to continually write to a stream throughout the time my program is running, and if it crashes, be able to use that to get back to the state I was in, is it possible?

I believe that using "no_header" on opening might get you what you want. Robert Ramey

Preston A. Elder

3 Apr 3 Apr

4:03 p.m.

Robert Ramey <ramey@rrsd.com> wrote:

...

I'm not really sure what you want to do but I'll attempt to answer anyhow.

What I'm trying to do is be able to take live objects, and send them to disk and/or another application every time it changes (including being created). The idea being that both another application is kept in sync with the first, and that if the application goes down, I can replay the disk version and when the replay is done, the application will be at the same state it was before it went down.

...

...
I want to use serialization for some kind of active replication,

It would seem to me that the easiest way to do this would be to make a TEE type streambuf using the stream buffer library. This would duplicate each write to an additional stream. Do you mean with boost::iostream? or as a part of boost::serialization?

...

...
however the biggest barrier to this is the fact that serialization does not allow me to put the same object into the stream twice. To be more precise, I want to be able to serialize both to a network and to disk, and I very much like the elegance of the serialization approach (plus the fact it can re-create pointer references, arrays, and such).

I believe the above would cover this. TEE would handle going to both network and disk, but it would not obviate the 'single object only once' problem. According to the serialization documentation (Reference -> Speical Considerations -> Object Tracking (http://www.boost.org/libs/serialization/doc/special.html#objecttracking)), an object may only be put on the stream once, I cannot put an object that has been changed on the stream again to be re-serialized (either by replacing the previously serialized entry, or adding it to be serialized again, but without allocating a new object).

This is why I was asking in the first place how difficult the modifications would be to allow an object to be serialized twice, and serialization to understand this and not create a separate instance, but just update the existing instance. Thanks for your help :) -- PreZ :) Death is life's way of telling you you've been fired. -- R. Geis

Robert Ramey

4 Apr 4 Apr

3:02 a.m.

Preston A. Elder wrote:

...

...
It would seem to me that the easiest way to do this would be to make a TEE type streambuf using the stream buffer library. This would duplicate each write to an additional stream. Do you mean with boost::iostream? or as a part of boost::serialization?

Remember that serialization uses streambuf for doing the actual i/o. Hence any thing that streambuf (boost streams) implements such as compression, duplicaiton, etc, is "inherited" by boost serialization.

...

...
...
however the biggest barrier to this is the fact that serialization does not allow me to put the same object into the stream twice. To be more precise, I want to be able to serialize both to a network and to disk, and I very much like the elegance of the serialization approach (plus the fact it can re-create pointer references, arrays, and such).

I believe the above would cover this. TEE would handle going to both network and disk, but it would not obviate the 'single object only once' problem. According to the serialization documentation (Reference -> Speical Considerations -> Object Tracking

This would be done by setting the serialization trait "tracking" to track_never. This would inhibit the checking for duplicates. This would occur before it gets to the stream buf implementation

...

(http://www.boost.org/libs/serialization/doc/special.html#objecttracking)), an object may only be put on the stream once, I cannot put an object that has been changed on the stream again to be re-serialized (either by replacing the previously serialized entry, or adding it to be serialized again, but without allocating a new object).

that's what "track_never" is for.

...

This is why I was asking in the first place how difficult the modifications would be to allow an object to be serialized twice, and serialization to understand this and not create a separate instance, but just update the existing instance.

So track_never permits the object to be written multiple times. Using a custom streambuf would place the serialized output into multiple streams. Robert Ramey

Preston A. Elder

6:10 p.m.

...

This would be done by setting the serialization trait "tracking" to track_never. This would inhibit the checking for duplicates. This would occur before it gets to the stream buf implementation This has one other side-effect though. It also means I cannot have

Robert Ramey <ramey@rrsd.com> wrote: pointer references. Consider, say, a tree - where nodes have pointers to the next entry, parent entry, and first child entry. I want to be able to pass a node, and have those pointers re-established, just as it would with tracking on. However I also want to be able to re-pass a node if it gets updated (eg. if its a 'node with data' or even if I now have a new first child or a new next sibling). Assume for this case, that a node is serialized on creation, and its pointers will either be NULL or refer to previously serialized objects. Turning off tracking means it does not know how to re-establish those links, and I'd end up with duplicate copies of the same node. As previously mentioned - being able to serialize the same object over and over is only part of replication - the real core is to be able to UPDATE a node that has been previously serialized, without having to a) de-serialize a new object, lookup the object and then operator=. and b) forego being able to serialize a pointer to that class-type and have it realize it has already seen that class and thus just make it point to the same thing. Right now, I'm going to hack my way around it by having tracking turned on, so my pointers get re-established, and create a derived class that exists just to create a new type that I can disable tracking for (since using a typedef will not work). The idea being that if the object has already been serialized once, I'll follow up the object with a derived version of the object. The deserializing procedure will then similarly check to see if its previously deserialized and if so, expect a follow-up object and then just operator= the original object (a pointer to which I will have thanks to deserialization's tracking) with the follow-up object. Or, pseudocode would be: Serialize: Object *myobj = ...; /* ... */ ar & myobj; if (myobj->previously_serialized()) ar & (NonTrackingObject *) myobj; Deserialize: Object *myobj; ar & myobj; if (myobj->previously_deserialized()) ar & *myobj; If my understanding is correct, if a tracking object has previously been serialized (by pointer), any further attempt to deserialize a pointer to that object will merely set the pointer to the previous deserialized version. Thus when a previously deserialization has happened, the first ar & myobj will only set myobj's pointer, and the second, because the object in this case is non-tracking, it will have the actual data I need, and since I deserialize to the deferenced pointer, it will deserialize into that object. Of course, any other place I deserialize the same pointer, I would not do this check, since I want only a reference. -- PreZ :) Death is life's way of telling you you've been fired. -- R. Geis

Preston A. Elder

10 Apr 10 Apr

4:02 p.m.

I also have another follow-up question, regarding the same thing. I'm pretty sure this is possible by reading the docs, but its not a documented feature (I'm not surprised, really). How do I do the equivalent of a la.reset_object_address(v, u) for an object that has NOT been serialized with that archiver? The situation is this, I will obviously have to have an iarchive and oarchive class - to accomodate failing over between instances of my application (remember, my [io]archive instances will remain active throughout the application). Therefore, when I send something to an oarchive (ie. the primary replicating 'out'), I also need to add that reference to the pointer tree maintained by iarchive. Similarly, when I receive the item via. an iarchive (ie. the secondary replicating 'in') I need to add that reference to the pointer tree maintained by oarchive. If I could make an iarchive and oarchive use a common pointer reference tree, that would be ideal (I'm not very worried about thread safety since only one would use it at a time), however if not, I need to keep the complementing object references up to date in case of fail over. Why? Because if my primary goes down, and I fail to my secondary, it now becomes the 'master' and starts replicating out itself. Because of this, it needs to be able to pick up where the old primary left off (both in case of any tertiary instance listening, and because if the primary comes back up it will want to be replicated to again, becoming the secondary). The more I think about this, the more it seems I'm trying to hammer a round peg into a square hole - especially considering I know that each tracked object is assigned an ID, and that ID would have to be maintained between the iarchive and oarchive instances to be able to change from a consumer into a producer like that and have third party consumers not notice the difference. Any ideas, etc. would be welcome. Also, might I make a request to the maintainer of serialization to hopefully turn boost::serialization into something that can be more suitable for replication purposes in a future release of boost? The interface of boost::serialization is fantastic, but the implementation I think needs a few more knobs and switches to enable a wider variety of purpose. -- PreZ :) Death is life's way of telling you you've been fired. -- R. Geis

Robert Ramey

5:17 p.m.

Preston A. Elder wrote:

...

Any ideas, etc. would be welcome.

I'm not exactly sure what you're trying to do - but no matter here's my idea anyway. create a TEE type streambuf. This would model the std::streambuf that the standard library uses. It would most likely be built with the i/o streams library. All date written to the streambuf would infact be written to multiple stream implementations. This would get you replication for free. In fact, what would be more useful would be an i/o stream adapter which would take any number of streambufs and compose them into one TEE type streambuf. This would permit one to leverage on all the streambufs already created. It would mean that the the streambufs would all have to be the same type. Some could be binary, others could be file bases, others could be network connections, etc. This is something that could/should be added to the i/ostreams library - if it isn't already there. The counter part of this - reading back one of the archives in the same application would read one of the streams in the TEE. Remember that all information concerning the state of the archive, addresses of created pointers etc, class i/d, etc is local to the archive. So there would be no conflict.

...

Also, might I make a request to the maintainer of serialization to hopefully turn boost::serialization into something that can be more suitable for replication purposes in a future release of boost?

So I don't see serialization as the right place to implement such functionality.

...

The interface of boost::serialization is fantastic, but the implementation I think needs a few more knobs and switches to enable a wider variety of purpose.

LOL - The reason the interface is "fantastic" is mainly due to my single minded dedication to keeping it that way. The way I've done this is to keep everything out of it that can possible be put somewhere else. I realize that this sometimes might seem limiting - but in fact its liberating. It has kept serialization from turning into he C++ equivalent of Microsoft word - where it would do everything everyone wanted if anyone could ever figure out how to make it do what it is they want. In spite of this, the serialization library implementation is still quite complicated. I have toyed with experiments to make the serialization library more useful for things like logging, rollback and recovery. But the experiments have been unsucessful so far in that the end up either making the library harder to use or less efficient. If I had nothing else to do, (or someone was paying me to do this) I might spend more time at it. But for the near term I don't see any functionality being added to the serialization library. I spend the time I have on incremental efficiency improvements and keeping it buildable in a changing infrastructure (bjam v2, new test library, new compilers - borland) etc. I'm pleased you seem to like the library and have found it useful. Robert Ramey

Preston A. Elder

7 p.m.

Sorry if you see this twice, but I don't think the original reply was sent. Robert Ramey <ramey@rrsd.com> wrote:

...

create a TEE type streambuf. This would model the std::streambuf that the standard library uses. It would most likely be built with the i/o streams library. All date written to the streambuf would infact be written to multiple stream implementations. This would get you replication for free. In fact, what would be more useful would be an i/o stream adapter which would take any number of streambufs and compose them into one TEE type streambuf. This would permit one to leverage on all the streambufs already created. It would mean that the the streambufs would all have to be the same type. Some could be binary, others could be file bases, others could be network connections, etc. This is something that could/should be added to the i/ostreams library - if it isn't already there.

The counter part of this - reading back one of the archives in the same application would read one of the streams in the TEE. Remember that all information concerning the state of the archive, addresses of created pointers etc, class i/d, etc is local to the archive. So there would be no conflict. If I used this method I would end up with objects being duplicated!

If I had a TEE style object and had: - 1 endpoint going to a local input stream - 1 endpoint going to disk - 1 endpoint going to a remote system (via. X transport method) I would end up with multiple objects because of the first endpoint! Every time the first endpoint saw a new object, it would allocate that object and deserialize it, just like the remote one would (and should) do. This would mean every object would be there twice! If, however, I could share the tracking map (eg. create a tracking map, then pass it to the constructor of both the input and output serializer, or alternatively, set it later or whatever), then this would not be an issue.

...

...
Also, might I make a request to the maintainer of serialization to hopefully turn boost::serialization into something that can be more suitable for replication purposes in a future release of boost? So I don't see serialization as the right place to implement such functionality. Perhaps you're correct, perhaps CORBA is more appropriate.

However AFAIK, CORBA doesn't work so well when loading from a disk. Plus CORBA doesn't solve one of my requirements. My requirements are simple: - Be able to restore the application to the same state it was when it died from a file on disk (persistence). - Be able to keep another instance of the application up to date real-time and be able to fail over to that system if necessary. - Be able to fail BACK to the original instance if necessary (eg. it is re-started and ready to once again be 'primary'). Serialization can handle all of these for me with some trickery to make it handle 'updating' objects instead of just creating them (previously mentioned in this thread). However there is one thing that is dangerous, and that is the fact that each instance of the application will have to have an iarchive and an oarchive at all times. And they will need to have their object tracking in sync.

...

LOL - The reason the interface is "fantastic" is mainly due to my single minded dedication to keeping it that way. The way I've done this is to keep everything out of it that can possible be put somewhere else. I realize that this sometimes might seem limiting - but in fact its liberating. It has kept serialization from turning into he C++ equivalent of Microsoft word - where it would do everything everyone wanted if anyone could ever figure out how to make it do what it is they want. In spite of this, the serialization library implementation is still quite complicated. I know, I've looked at the code:)

However I'm not actually asking you to change much - just have the ability to have two archives have the same object tracking backend. A simple ability to just do CreateObjectTracker() and then pass the result to the constructor of any archive I create after that would be sufficient, even if the return value is completely opaque. And of course, if implemented as an optional argument to a constructor, the default action could be to call that same function anyway. I could still use boost's serialization without this functionality, however I lose the biggest advantage (and strength) of serialization. Namely the restoration of pointers automatically. In other words, I could easily enough just create a new archive instance for each time I want to serialize an object (or simply turn off tracking for all objects), however this would mean all that book keeping serialization does for me with previously seen objects and restoring pointers and such would now have to be done by me, and more importantly, done manually - increasing the possibility of missing something. The serialization library is VERY close to the functionality required for replication (which is more or less a specialized form of serialization anyway), it just has a few specific requirements that I don't believe would change the way serialization works or complicate it much more than it is now.

...

I'm pleased you seem to like the library and have found it useful. I just like the interface, its very clean.

-- PreZ :) Death is life's way of telling you you've been fired. -- R. Geis

Robert Ramey

7:43 p.m.

Preston A. Elder wrote:

...

Sorry if you see this twice, but I don't think the original reply was sent.

...

If I used this method I would end up with objects being duplicated!

If I had a TEE style object and had: - 1 endpoint going to a local input stream - 1 endpoint going to disk - 1 endpoint going to a remote system (via. X transport method)

I would end up with multiple objects because of the first endpoint! Every time the first endpoint saw a new object, it would allocate that object and deserialize it, just like the remote one would (and should) do. This would mean every object would be there twice!

...

If, however, I could share the tracking map (eg. create a tracking map, then pass it to the constructor of both the input and output serializer, or alternatively, set it later or whatever), then this would not be an issue.

...

The serialization library is VERY close to the functionality required for replication (which is more or less a specialized form of serialization anyway), it just has a few specific requirements that I don't believe would change the way serialization works or complicate it much more than it is now.

Perhaps you might want to experiment with this idea by fiddling with the serialization source. Note that the "tracking map" is part of the implementation of basic_iarchive. This is not exposed as you would like. But there's no reason you can't tweak the source to make it visible. Then you could implement what it seems you want. Maybe that's a good solution for you. Note that lots can be done by deriving from the existing archives or make making a "Archive Adaptor" in a vein similar to the polymorphic_iarchive. Robert Ramey

7001

Age (days ago)

7011

Last active (days ago)

List overview

Download

8 comments

3 participants

participants (3)

Preston A. Elder
prez＠neuromancy.net
Robert Ramey