Re: [boost] Re: Re: Using Serialization for binary marshalling

In-Reply-To: <95C853D0-9EAF-11D8-BFD9-000A95DC1C98@itp.phys.ethz.ch> troyer@itp.phys.ethz.ch (Matthias Troyer) wrote (abridged):
I would use variable length integers. Use as many bytes as you need for the integer's actual value. That way the archive format is independent of whether short, int, long or some other type was used by the outputting program. It can also give you byte order independence for free. Specifically, I'd use a byte oriented scheme where the low 7 bits of each byte contribute to the current number, and the high bit says whether there are more bytes to come. void load( uintmax_t &result ) { result = 0; while (true) { unsigned char byte; *this >> byte; result = (result << 7) | (byte & 0x7f); if ((byte & 0x80) == 0) return; } } Thus integers less than 128 take 1 byte, less than 16,000 take 2 bytes etc. This gives a compact representation while still supporting 64-bit ints and beyond. You can use boost::numeric_cast<> or similar to bring the uintmax_t down to a smaller size. -- Dave Harris, Nottingham, UK

Dave Harris wrote:
...
I would use something similar, but I would store all integers as 2s-complement signed values. An unsigned value with 'n' significant bits would be stored as a signed value of no less than 'n + 1' bits. This means that if the type of a variable is changed from a signed type to an unsigned type or vice versa, the program can still read all old archives so long as all of the values in the old archives fit into the new variables. -- Rainer Deyke - rainerd@eldwood.com - http://eldwood.com

Dave Harris wrote:
Nice idea. I was thinking along the lines of how to achieve high-performance, with shared memory & fast network message passing in mind. I think now though, this is probably too specialized an application for boost::serialization. Cheers, Ian

On May 5, 2004, at 4:11 PM, Ian McCulloch wrote:
I disagree. The above ideas would be for archive formats geared towards persistence, where speed is not the main issue. All we need to achieve high performance is an archive designed for this purpose, with specialized serialization functions for the standard containers. Ignoring portability issues since the ultra-fast networks (such as Infiniband, Cray Red Storm or similar) are between homogeneous nodes with the same hardware. The same serialization code could then be used both for fast network communications as well as for serialization to a portable archive for persistence purposes, just by using different archive types. Matthias

Matthias Troyer wrote:
Sure. I realized shortly after I wrote that, that I had failed to mention an important feature that I need - namely the message-passing format needs to be the same (or at least readable by) the on-disk format. This is beacuse I use a pair of smart pointers (pref_handle<T> and pref_ptr<T>), the first of which doesn't care if the object is in memory or on disk (or concievably, on some other node). The second pointer loads the object into memory (if it wasn't already) and gives access to it. But in general an object that is not in memory has no type identification with it (polymorphic types do, but value types don't - this would be really excessive overhead). Sending a pref_handle<T> that exists only on disk (cache) across MPI just requires sending the byte stream itself, with no extra steps of deserialize in the disk-format and re-serialize in the network-format. Conversely, it is possible that on the receiving end, an object will be loaded into a pref_handle<T> and not deserialized immediately, but ending up with an object serialized in the network-format on the disk would be a disaster. Perhaps, a better design would be to keep close track of streams that are serialized in the network-format and just make sure that they never hit the disk. I'm not sure whether this is even possible with my current scheme though, but I'll look into it. But this would solve all(?) portability issues with respect to serializing typedef's etc. OTOH, with the current scheme it is possible to be somewhat lazy about caring whether the objects are in memory or not, since the serialization is so fast (and there is of course a rather large cache before it actually hits the disk, not to mention the OS buffer cache). Cheers, Ian
participants (4)
-
brangdon@cix.compulink.co.uk
-
Ian McCulloch
-
Matthias Troyer
-
Rainer Deyke