Re: [Boost-users] cross-platfrom binary serialization?
IMO the pragmatic approach would be to settle with one fp format and
implement (more or less trivial) transformation functions to support the
platforms that don't implement it natively. Picking IEEE754 standard would
seem fine unless somebody can point out gaps preventing this from being
able to convey features other formats can.
Going further, to accommodate the somewhat rare cases where IEEE754 would
not be the best choice the library could support multiple representations
and platform-specific transformations between them leaving the choice of
the FP format used by the archive (which could be modified at any time
using manipulators) to the programmer.
_
Michal Strzelczyk
Daryle Walker
I have an alternate suggestion: what about continued fractions? Turn the floating point value into a list of integers. This works no matter what f.p. systems the source and destination use. You just need a portable integer serialization format.
Then why cannot we use IEEE745 format for float serialization on all systems?
Because I don't think that the IEEE-754 internal format is as stable as you think it is. (Looking at Wikipedia's entries on IEEE-754 and 754r, there is a standard conceptual format, but problem is how do various implementations carry out the internal bit-wise format. Room for interpretation will doom your plan.)
1. If a systems is IEEE745 compatible, serialize float numbers directly. This will work 99.9% of the times, and will be very efficient. 2. If a system is not IEEE745 compatible, to serialize a float number, we write the number as 32 or 64 continuous bits, in IEEE745 format, to de-serialize a float number, we read the number in IEEE745 formats and write in native float format. An intermediate string representation can be used for the translations.
The string or "direct" conversions may introduce rounding errors.
I mean, these continuous IEEE745 bits are equivalent to your "portable integer serialization format", but will be much more efficient on IEEE745 compatible systems.
Is it worth potentially screwing 0.01% of your customers when you may not have to? -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users Generally, this communication is for informational purposes only and it is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. In the event you are receiving the offering materials attached below related to your interest in hedge funds or private equity, this communication may be intended as an offer or solicitation for the purchase or sale of such fund(s). All market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of JPMorgan Chase & Co., its subsidiaries and affiliates. This transmission may contain information that is privileged, confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. Although this transmission and any attachments are believed to be free of any virus or other defect that might affect any computer system into which it is received and opened, it is the responsibility of the recipient to ensure that it is virus free and no responsibility is accepted by JPMorgan Chase & Co., its subsidiaries and affiliates, as applicable, for any loss or damage arising in any way from its use. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you. Please refer to http://www.jpmorgan.com/pages/disclosures for disclosures relating to UK legal entities.
michal.x.strzelczyk@jpmorgan.com wrote:
IMO the pragmatic approach would be to settle with one fp format and implement (more or less trivial) transformation functions to support the platforms that don't implement it natively. Picking IEEE754 standard would seem fine unless somebody can point out gaps preventing this from being able to convey features other formats can.
I agree. Almost all platforms use the IEEE754 format for float and double. So these can, and probably should, be used for portable binary archives. If anyone needs to support some platform that uses some other format, then he will have to add code that converts to/from the IEEE754 formats. Several different formats are used for long double. I think it is reasonable not to support long double, at least not initially. I see three options for dealing with the endianness issue: 1. make all archives big-endian 2. make all archives little-endian 3. use the native format when saving, and put an endianness flag in the archive. 1 is inefficient when moving data between little-endian platforms 2 is inefficient when moving data between big-endian platforms So 3 should be most efficient. Is there an easy way of storing an endianness-flag in an archive? --Johan Råde
Johan Råde wrote:
1 is inefficient when moving data between little-endian platforms 2 is inefficient when moving data between big-endian platforms So 3 should be most efficient. Is there an easy way of storing an endianness-flag in an archive?
The version currently in the package does that now for integer types. Note, you could store floating points as a pair of integers using the facilties already in portable binary archive. This would have the side benefit of making the archives signifcantly smaller. Robert Ramey
Robert Ramey wrote:
Johan Råde wrote:
1 is inefficient when moving data between little-endian platforms 2 is inefficient when moving data between big-endian platforms So 3 should be most efficient. Is there an easy way of storing an endianness-flag in an archive?
The version currently in the package does that now for integer types.
Note, you could store floating points as a pair of integers using the facilties already in portable binary archive. This would have the side benefit of making the archives signifcantly smaller.
Robert Ramey
Great. Then you can handle float and double by just saving and loading the bytes, and deal with endianness the same way as for integers. I don't think there is any other scheme that will make the archive significantly smaller without losing precision. I believe that in most applications a float/double contains almost 32/64 bits of entropy. --Johan
Johan Råde wrote:
I see three options for dealing with the endianness issue: 1. make all archives big-endian 2. make all archives little-endian 3. use the native format when saving, and put an endianness flag in the archive.
1 is inefficient when moving data between little-endian platforms 2 is inefficient when moving data between big-endian platforms So 3 should be most efficient. Is there an easy way of storing an endianness-flag in an archive?
I don't think performance should be the overriding concern, especially since byte-shuffling is very fast. The problem with option 3 is that it introduces a potential source of bugs that only manifests when moving between platforms with different endianness. I'd prefer option 1, precisely because it requires shuffling on the most common platforms so any bugs in the shuffling code are sure to be caught early. -- Rainer Deyke - rainerd@eldwood.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 29 August 2008 01:42 am, Rainer Deyke wrote:
Johan Råde wrote:
I see three options for dealing with the endianness issue: 1. make all archives big-endian 2. make all archives little-endian 3. use the native format when saving, and put an endianness flag in the archive.
1 is inefficient when moving data between little-endian platforms 2 is inefficient when moving data between big-endian platforms So 3 should be most efficient. Is there an easy way of storing an endianness-flag in an archive?
I don't think performance should be the overriding concern, especially since byte-shuffling is very fast. The problem with option 3 is that it introduces a potential source of bugs that only manifests when moving between platforms with different endianness. I'd prefer option 1, precisely because it requires shuffling on the most common platforms so any bugs in the shuffling code are sure to be caught early.
Why are you guys not just using XDR for your portable binary format? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIuDko5vihyNWuA4URAi6qAKCkfVFj3NsWuLhOMJQR8/K6PaoDAQCfUipv CTwrCNBunviV+roNw2yTRX0= =wQzm -----END PGP SIGNATURE-----
Rainer Deyke wrote:
Johan Råde wrote:
I see three options for dealing with the endianness issue: 1. make all archives big-endian 2. make all archives little-endian 3. use the native format when saving, and put an endianness flag in the archive.
1 is inefficient when moving data between little-endian platforms 2 is inefficient when moving data between big-endian platforms So 3 should be most efficient. Is there an easy way of storing an endianness-flag in an archive?
I don't think performance should be the overriding concern, especially since byte-shuffling is very fast.
But it isn't fast. If the necessity of bitshuffling makes it impossible to serialize, say a vector<double> via the optimized array handling, you could easily be talking about a factor of 10 in speed. Showstopper for us, at least. I suppose you could copy the entire buffer, flip all the bits at once, then serialize the flipped buffer, but this also has significant cost, too much for scientific applications where another option exists. The problem with option 3 is that it introduces a potential source of
bugs that only manifests when moving between platforms with different endianness. I'd prefer option 1, precisely because it requires shuffling on the most common platforms so any bugs in the shuffling code are sure to be caught early.
Actually this is very easy to test for, even if you don't have machines of the other endianness available. (the md5sum of the generated archive must match for all platforms, and these sums can be checked in to svn) -t
On Aug 29, 2008, at 12:41 PM, troy d. straszheim wrote:
Rainer Deyke wrote:
Johan Råde wrote:
I see three options for dealing with the endianness issue: 1. make all archives big-endian 2. make all archives little-endian 3. use the native format when saving, and put an endianness flag in the archive.
1 is inefficient when moving data between little-endian platforms 2 is inefficient when moving data between big-endian platforms So 3 should be most efficient. Is there an easy way of storing an endianness-flag in an archive? I don't think performance should be the overriding concern, especially since byte-shuffling is very fast.
But it isn't fast. If the necessity of bitshuffling makes it impossible to serialize, say a vector<double> via the optimized array handling, you could easily be talking about a factor of 10 in speed. Showstopper for us, at least. I suppose you could copy the entire buffer, flip all the bits at once, then serialize the flipped buffer, but this also has significant cost, too much for scientific applications where another option exists.
Have you looked at HDF5? It supports metadata, parallel IO, arbitrary data structures, etc. and is designed for HPC applications. It also has native binary format support and will automatically provide cross- platform binary compatibility. It will likely require a bit more code instrumentation than boost serialization, but may be worth it if performance is key...
James Sutherland wrote:
On Aug 29, 2008, at 12:41 PM, troy d. straszheim wrote:
But it isn't fast. If the necessity of bitshuffling makes it impossible to serialize, say a vector<double> via the optimized array handling, you could easily be talking about a factor of 10 in speed. Showstopper for us, at least. I suppose you could copy the entire buffer, flip all the bits at once, then serialize the flipped buffer, but this also has significant cost, too much for scientific applications where another option exists.
Have you looked at HDF5? It supports metadata, parallel IO, arbitrary data structures, etc. and is designed for HPC applications. It also has native binary format support and will automatically provide cross-platform binary compatibility.
We do use HDF5... With boost.python and pytables plus boost::serializaton backed c++ datastructures one can quite flexibly provide converter/extractor/reducer utilities that get you from a boost::serialization portable binary format to a more 'analysis friendly' hdf5 format. Works great. -t
Just to keep the pot boiling - here's my two cents. a) Basically, floats can be represent as two integers. One for the exponent and one for a normalized fraction of one. b) The C standard library provides functions which generate these two integers from any double (frexp) and retrieves the origninal double from the pair of integers(ldexp). I would guess that these functions are pretty efficient as they only return a subset of some existing bits. c) the portable binary archive currently in the libraries handles integers in a portable manner. It has been tested on various platforms and already addresses issues such as what to do when on attempts to load an integer > 2^32 to a 32 bit machine. It also strips leading bits which don't add anything and makes the archives smaller. It would seem using the standard functions - supported by any standard C library - and using the functionality already in portable_binary_archive, one could add floating point functionality relatively easily - and it would be no less portable than the C library is. Robert Ramey
on Fri Aug 29 2008, "Robert Ramey"
Just to keep the pot boiling - here's my two cents.
a) Basically, floats can be represent as two integers. One for the exponent and one for a normalized fraction of one.
b) The C standard library provides functions which generate these two integers from any double (frexp) and retrieves the origninal double from the pair of integers(ldexp). I would guess that these functions are pretty efficient as they only return a subset of some existing bits.
c) the portable binary archive currently in the libraries handles integers in a portable manner. It has been tested on various platforms and already addresses issues such as what to do when on attempts to load an integer > 2^32 to a 32 bit machine. It also strips leading bits which don't add anything and makes the archives smaller.
It would seem using the standard functions - supported by any standard C library - and using the functionality already in portable_binary_archive, one could add floating point functionality relatively easily - and it would be no less portable than the C library is.
I wonder if it really works so well when the word size of the machines differs, or even when the word size is 32 bits on both ends. It's likely they're both using IEE754, so if long double has more than 32 bits of mantissa, your method will be needlessly lossy. I think long double commonly has 96 or 128 bits total, so you'd lose significant precision. The HPC community has had to solve this problem numerous times. These are people that care about the accuracy of their floating point numbers. Why one would begin anywhere other than with the formats the HPC people have already developed is beyond me. -- Dave Abrahams BoostPro Computing http://www.boostpro.com
David Abrahams wrote:
I wonder if it really works so well when the word size of the machines differs, or even when the word size is 32 bits on both ends. It's likely they're both using IEE754, so if long double has more than 32 bits of mantissa, your method will be needlessly lossy. I think long double commonly has 96 or 128 bits total, so you'd lose significant precision. The HPC community has had to solve this problem numerous times. These are people that care about the accuracy of their floating point numbers. Why one would begin anywhere other than with the formats the HPC people have already developed is beyond me.
The current implementation implements a variable length format where only the significant bits are stored. If it turns out that a number is stored in the archive cannot be represented on the machine reading the archive an exception is thrown. This would occur where a 64 bit machine stored a value > 2^32 and a 32 bit machine tried to loaded. This method has one great advantage. It automatically converts between integer (or long or what ever) when the size of integer varies between machines. It also eliminated redundent data (leading 0's) and never loses precision. If it can't do something the user want's to do - it punts. Its up to the library user to decide how to handle these special situations. I believe leveraging on this by converting floats to a pair of integers and serializating them would simplify the job and result in a truely portable (as opposed to 99%) archive. BTW - I was wrong about the two library functions mentioned above. The to return the exponent of the normalzed value - but the return mantissa as a float rather than an integer - damn!. Robert Ramey
troy d. straszheim wrote:
Rainer Deyke wrote:
I don't think performance should be the overriding concern, especially since byte-shuffling is very fast.
But it isn't fast.
It is when compared to the overhead of IO (disk or socket, possibly even memory).
If the necessity of bitshuffling makes it impossible to serialize, say a vector<double> via the optimized array handling, you could easily be talking about a factor of 10 in speed.
I think here you are talking about the overhead of a single write operation versus multiple write operations on the underlying stream, correct? It's true that the standard stream operations can be slow, but that is a separate problem from the actual byte shuffling and should be solved separately. Maybe this problem could be avoided by using a std::vector<char> instead of a stream object for the actual serialization and then dumping it all at once. (It is not reasonable to just dump in-memory objects to a stream in any portable format, binary or text.)
The problem with option 3 is that it introduces a potential source of
bugs that only manifests when moving between platforms with different endianness. I'd prefer option 1, precisely because it requires shuffling on the most common platforms so any bugs in the shuffling code are sure to be caught early.
Actually this is very easy to test for, even if you don't have machines of the other endianness available. (the md5sum of the generated archive must match for all platforms, and these sums can be checked in to svn)
I though option 3 was to write little-endian archives on little-endian machines and big-endian archives on big-endian machines? If so, the generated archives would /not/ be the same. Hence the potential source of bugs. -- Rainer Deyke - rainerd@eldwood.com
participants (8)
-
David Abrahams
-
Frank Mori Hess
-
James Sutherland
-
Johan Råde
-
michal.x.strzelczyk@jpmorgan.com
-
Rainer Deyke
-
Robert Ramey
-
troy d. straszheim