Re: RE: [boost] Re: Using Serialization for binary marshalling

Brian Braatz wrote:
Question- I was just looking for clarification. Does the boost serialization library allow me to have objects on different platforms stream to each other?
Yes. However, the archive must be designed to be platform independent. This is a little bit more subtle than it would first appear. For example, consider the text archive. It stores all numbers as text delimited by spaces. That's portable. But suppose a machine which uses an 80 bit IEEE double stores a high precision value to an archive. Another machine that reads the archive might only have a 64 bit double. So, some of the original precision would be lost. Of course the original value would have been preserved to the extent possible on the new machine. Even here, moving to an ebcdic machine would require a custom codecvt facet which is not in library. Is this "portable"? You be the judge. Within the limitations of the above, text and xml archives can be considered portable. The binary archive is really "native binary" in that it doesn't make conversions for integer size or endianness. (It does check that the endianness and integer size are the same on the saving and loading machines so it might be portable between some pairs of machines). In general the binary archive should be considered non-portable and not suitable for archives meant to be transferred from non identical architectures. There has been lots of interest in making an XDR and CDR archives. I really haven't considered issues in using XDR or CDR. I don't know whether these are just standards for encoding low level types or describe a whole protocol. If it's the former, it would be pretty simple implement. Otherwise, we have the problem of matching the serial representation of some arbitrary C++ structure with some externally defined protocol. Though it might be possible in certain cases, this is not something that the serialization system has been designed to do. http://lists.boost.org/MailArchives/boost/msg63695.php The package includes a "portable" archive as a demo. It doesn't include code for floats and longs. The portable archive includes a lead byte for each integer indicating the number of bytes required to hold the value. If the integer doesn't "fit" in the loaded type, an exception is thrown. The usage of boost::uint_16 etc. doesn't have to be explicitly considered. I didn't have good method for handling floats and doubles across machine architectures. Now that I have suitable code for handling portable floats/doubles, (http://lists.boost.org/MailArchives/boost/msg63672.php), I might think of promoting portable_binary_?archive to a first class supported archive. It would be complete and general, but depend on the serialization library being implemented on the "other" machine. Not an unreasonable restriction in my view. For those who must have CDR/XDR I have no final answer. Robert Ramey

Brian Braatz wrote:
Question- I was just looking for clarification. Does the boost serialization library allow me to have objects on different platforms stream to each other?
You will need two ingredients: 1.) a binary portable archive, e.g. the CDR or XDR archives discussed 2.) replacements for the serialization of the standard containers, since the ones provided by Robert store the sizes as int (instead of a portable integer type). One possibility would be to have a traits type in the serialization for these containers, which defaults to int (or better yet std::size_t) but which could be specialized for the portable archives. Matthias

Matthias Troyer wrote:
Brian Braatz wrote:
Question- I was just looking for clarification. Does the boost serialization library allow me to have objects on different platforms stream to each other?
You will need two ingredients:
1.) a binary portable archive, e.g. the CDR or XDR archives discussed 2.) replacements for the serialization of the standard containers, since the ones provided by Robert store the sizes as int (instead of a portable integer type). One possibility would be to have a traits type in the serialization for these containers, which defaults to int (or better yet std::size_t) but which could be specialized for the portable archives.
Why is it not possible to store an 'int' in a portable binary archive? If I understand XDR properly, RFC 1014 defines an implicit mapping between C datatypes and XDR datatypes such that int maps onto 'Signed Intger', unsigned maps onto 'Unsigned Integer' and so on. All of the XDR stream functions in the RPC toolkits use this mapping too. If the target platform has an int type that does not correspond to a XDR 'Signed Intger' then too bad, only the common subset of values will be available to the program. If the boost serialization library chooses a different mapping -- AFAICT the suggestions so far have been to use int_32 -> Signed Integer int_64 -> Hyper Integer and so on -- then this violates the usual XDR mapping and there is no chance that the resulting program would be wire-compatible with anything other than itself (and possibly not even then, if it is compiled on a platform that doesn't have int_64, for example). Requiring that the only types that can be portably serialized are the fixed-size typedefs is far too high a burden IMHO. There is no way to do a compile-time (or even a runtime) check that some user hasn't accidentally serialized an 'int' - except that it will mysteriously break on some platform. And what if you already have a std::vector<int> that you want to serialize? Should the user really be required to copy it into a std::vector<int_32> first? It seems to me, there are two mutually exclusive (incompatible) choices for a binary archive. Either specify mappings for the builtin types int -> X, long -> Y etc and use only those types (no serializing a size_t or int_32), *OR* specify mappings between fixed-size types int_32 -> X, int_64 -> Y etc and use only those types (no serializing a size_t or a plain builtin). Better make a choice now and be consistent, because changing it later will be a nightmare. My clear preference is the first option. In my own serialization code (which I hope one day to layer on top of boost serialization), I define serveral formats [currently 3, XDR, LE_LP32 (little-endian 32-bit pointers & long, as in x86) and LE_LP64 (little-endian 64-bit pointers & long, as in alpha)], as in the following pseudo-code: struct XDR_tag {}; struct LE_LP32_tag {}; struct LE_LP64_tag {}; template <typename FormatTag> struct Mapping; // XDR mapping template <> struct Mapping<XDR_tag> { typedef uint_32 size_type; template <typename T> struct BuiltinTypeToStreamType; template <> struct BuiltinTypeToStreamType<int> { // int maps onto signed 32 bits in XDR typedef int_32 type; }; template <> struct BuiltinTypeToStreamType<long> { // long maps onto signed 32 bits in XDR typedef int_32 type; }; // ... }; // LE_LP64 mapping template <> struct Mapping<LE_LP64_tag> { typedef uint_64 size_type; template <typename T> struct BuiltinTypeToStreamType; template <> struct BuiltinTypeToStreamType<int> { typedef int_32 type; }; template <> struct BuiltinTypeToStreamType<long> { typedef int_64 type; }; // ... }; The serialization functions map type T to Mapping<Format>::BuiltinTypeToStreamType<T>::type, which are then converted to the appropriate endianness and sent to the stream. Low-level stuff that cares about size_t can use Mapping<Format>::size_type (the serialization of std containers does this). The only place where the fixed-size types appear is in the intermediate phase before hitting the stream. They don't need to be typedefs for builtin types, the only requirement is that the buffer knows how to stream it. I also have some types (not typedefs!) to represent fixed-size integers if that is required - they just get mapped into the corresponding size type of the binary format. I've found thta I havn't used them much though. It would be possible to do something similar for size_t (which appears a lot, of course). This would be simpler than looking into the Mapping<Format> traits, but I havn't implemented that yet. Does something like this structure make sense for the boost serialization library? Cheers, Ian McCulloch

On May 5, 2004, at 4:03 AM, Ian McCulloch wrote:
Why is it not possible to store an 'int' in a portable binary archive?
[snip]
It seems to me, there are two mutually exclusive (incompatible) choices for a binary archive. Either specify mappings for the builtin types int -> X, long -> Y etc and use only those types (no serializing a size_t or int_32), *OR* specify mappings between fixed-size types int_32 -> X, int_64 -> Y etc and use only those types (no serializing a size_t or a plain builtin). Better make a choice now and be consistent, because changing it later will be a nightmare.
My clear preference is the first option. [snip]
As I see it the current serialization library allows both options, depending on your preferences. Any archive may choose which types it view as fundamental, but both have their disadvantages: * serializing int and long always as 32 bit and long long as 64 bit has the following problems: - on 64-bit architectures a long can be 64 bit, and the non-standard long long might not be supported by the compiler - serializing the size of a container as a 32 bit signed integer will prohibit you from serializing container with more than 2^31 entries. Note that we already encounter vectors with larger sizes in some of our codes. - serializing std::size_t with values larger than 2^32 might not be possible at all in a portable way. * serializing int32_t and int64_t as the basic types causes other problems as you stated: - the serialization of int, short and long becomes non-portable since they might be int16_t, int32_t or an int64_t depending on the platform. Whatever choice we pick there will thus be issues that one has to be aware of and one has to be careful in the choice of fundamental types one serializes. This is no problem as long as the application programmer has full control over the types. In serializing the standard containers this is however NOT the case since there the size is serialized as an int, which will not work for containers with more than 2^31 entries. Thus one will either have to reimplement these serialization functions, or be able to specify, e.g. by traits, which type should be used to serialize the size of a container. Matthias

Matthias Troyer wrote: [...]
As I see it the current serialization library allows both options, depending on your preferences. Any archive may choose which types it view as fundamental, but both have their disadvantages:
* serializing int and long always as 32 bit and long long as 64 bit has the following problems: - on 64-bit architectures a long can be 64 bit, and the non-standard long long might not be supported by the compiler - serializing the size of a container as a 32 bit signed integer will prohibit you from serializing container with more than 2^31 entries. Note that we already encounter vectors with larger sizes in some of our codes. - serializing std::size_t with values larger than 2^32 might not be possible at all in a portable way.
Yes. One possibility would be to serialize everything as the largest plausible width (say 64 bits), and fail at runtime if we try to read something from the stream that would overflow. This is about as portable as we can get I think - most/all platforms of interest have a 64-bit type, and a vector with more entries than will fit in a size_t won't run no matter what binary format we use. This would be unacceptably slow though in some important circumstances (say, MPI on an SMP machine or a fast network), and result in a pessimistically big archive. OTOH, I'm not sure that high-performance MPI is on the radar for boost::serialization... In my own codes, the format of a serialized object defaults to whatever is closest to the 'native' format; LE_LP32 on x86 and LE_LP64 on alpha (and presumably x86-64). It would be straightforward to add formats for other common platforms. In principle, if a calculation running on an x86 machine fails because it tries to expand a container beyond 2^32 (or likely, 2^31) entries, then it would be possible to take the last checkpoint file and continue the calculation on a 64-bit machine, were there would be no such limitation. The later checkpoint files would contain objects seralized in LE_LP64 format. Trying to restart those checkpoints on a 32-bit machine would cause a boost::numeric_cast<> to fail, but only if there are any (64-bit) size_t records in the stream that are larger than 2^32. I don't regard this situation as substantially different from, say, trying to read an archive onto a machine that doesn't have enough memory. If there are no such overflows, then it would run with no problems.
* serializing int32_t and int64_t as the basic types causes other problems as you stated: - the serialization of int, short and long becomes non-portable since they might be int16_t, int32_t or an int64_t depending on the platform.
Whatever choice we pick there will thus be issues that one has to be aware of and one has to be careful in the choice of fundamental types one serializes. This is no problem as long as the application programmer has full control over the types. In serializing the standard containers this is however NOT the case since there the size is serialized as an int, which will not work for containers with more than 2^31 entries. Thus one will either have to reimplement these serialization functions, or be able to specify, e.g. by traits, which type should be used to serialize the size of a container.
Its not so much the application programmer that has control over the types, rather the archive designer has to dictate to the application programer what types can be serialized. For sure, the serialization library should allow some choice for the representation of size_t, since using an int rules out serializing large containers on a 64-bit machine. Cheers, Ian
participants (3)
-
Ian McCulloch
-
Matthias Troyer
-
Robert Ramey