Re: [boost] [serialization] fast array serialization (10x speedup)

I've been perusing the files you checked, your example, and this list. Summary ======= First of all, a little more complete narrative description as to what the submission was intended to acommplish and how it would change the way the user uses the library would have been helpful. I'm going to summarize here what I think I understand about this. Please correct me if I get something wrong. a) a new trait is created. template <class Archive, class Type> struct has_fast_array_serialization : public mpl::bool_<false> {}; b) new functions save_array and load_array are implemented in those archives which have the above trait set to true. In this case the following is added to the binary_iarchive.hpp file. The effect is that this trait will return true when a fundamental type is to be saved/loaded to a binary_iarchive. // specialize has_fast_array_serialization // the binary archive provides fast array serialization for all fundamental types template <class Type> struct has_fast_array_serialization<binary_iarchive,Type> : public is_fundamental<Type> {}; c) A user with the following class my_class { valarray<int> m_a; template<class Archive> void save(Archive & ar, const unsigned int){ ar << m_a } ... }; d) In order for this to work, data types which want to exploit bitwise serialization have to look like the following - taken from valarray - serialization. This would have to be applied to serialization template<class Archive, class U> inline boost::disable_if<boost::archive::has_fast_array_serialization<Archive,U>
::type save( Archive & ar, const STD::valarray<U> &t, const unsigned int /* file_version */) { const boost::archive::container_size_type count(t.size()); ar << BOOST_SERIALIZATION_NVP(count); for (std::size_t i=0; i<t.size();+i) ar << t[i]; }
// with fast array serialization template<class Archive, class U> inline boost::enable_if<boost::archive::has_fast_array_serialization<Archive,U>
::type save( Archive & ar, const STD::valarray<U> &t, const unsigned int /* file_version */) { const boost::archive::container_size_type count(t.size()); ar << BOOST_SERIALIZATION_NVP(count); if (count) ar.save_array(boost::detail::get_data(t),t.size()); }
Presumably this treatment would be applied to: std::vector ublas::vector ublas::matrix mtl::vector blitz::array custom_lib::fast_matrix ... as well as others. For built-in arrays, the core library is tweaked to include code similar to the above for built-in arrays. Some Observations ================= Inmediatly the following come to mind. a) I'm not sure about the portability of enable_if. Would this not break the whole serialization system for those compilers which don't support it? b) what is the the point of save_array? why not just invoke save_binary directly? c) The same could be said for built arrays - just invoke save_binary d) There is no provision for NVP in the non-binary version above while in the binary version there is NVP around count. Presumably, these are oversights. e) The whole thing isn't obvious and its hard to follow. It couples the implementation code in i/o serializer.hpp to a specific kind of archive adding another dimension to be considered while understanding this thing. f) What about bitwise serializble types which aren't fundamental? That is structures which don't have things like pointers in them. They have the same opportunity but aren't addressed. If this is a good idea for fundamental types, someone is going to want to do them as well - which would open up some new problems. g) I don't see endian-ness addressed anywhere. I believe that protocols such as XDR and MPI are designed to transmit binary data between heterogenious machines. Suppose I save an array of ints as a sequence of raw bits on an intel type machine. Then I use load_binary to reload the same seqence of bits into sparc based machine. I won't get back the same data values. So either either the method will have to be limited to collections of bytes or some extra machinery would have to be added to conditionally to the endian translation depending on the source/target machine match/mismatch. f) Similar issues confront bitwise serialization of floats and doubles. I believe the "canonical" format for floats/doubles is ieee 80 bit. (I think that's what XDR uses - I could be wrong.) I believe that many machines store floats as 32 bit word and doubles as 64 bit words. I doubt they all are guarenteed to have the same format as far as exponent, sign and representation of value. So that's something else to be addressed. Of course endian-ness plays into this as well. g) I looked at the "benchmark" results. I notice that they are run with -O2 on the gcc compiler. Documentation for the gcc compiler command line specifies that this optimization level should does not enable automatic inlining for small functions. This is a crucial optimization to be effective in the serialization library. The library is written with the view that compilers will collapse inline code when possible. But this happens only in the gcc compiler when the -O3 optimization switch is used. Furthermore, with this compiler, it might be necessary to also specify max-inline-insns-recursive-auto switch. to gain maximum performance on boost type code. This latter is still under investigation. h) my own rudimentary benchmark (which was posted on this list) used 1000 instances of a structure which contained all C++ primitive data types plus an std::string made up of random characters. It was compiled as a boost test and built with bjam so it used the standard boost options for release mode. It compared timings against using raw stream i/o. Timings for binary_archive and standard stream i/o where comparable. I'm still working on this test. The problem is that standard stream i/o uses text output/input. Of course no one for whom performance is an issue would do this so I have to alter my timing test to use binary i/o to the standard stream as a comparison. But for now, I'm comfortable in asserting that there is not a large performance penalty using serialization as opposed to "rolling your own". As an aside, the test executable doing the same test for 3 different types of archives and all primitive data types only came to 238K. So there isn't a significant code bloat issue either. i) somehow I doubt that this archive type has been tested with all the serialization test suite. Instructions for doing so are in the documenation and the serialization/test directory includes batch files for doing this with one's own archives. Was this done? What where the results? With which compiler? It costs nothing to do this. end of observations =================== Admitedly, this is only a cursory examination. But its more than enough to make me skeptical of the whole idea. I you want, I could expand upon my reasons for this view, but I think they should be obvious. Now if someone feels differently and wants to implement such a thing, they have my full support. There is no need to modify the core library and no benefit - performance or otherwise. The following shows how to go about this. For purposes of this exposition, I am going to limit myself to how one would go about crafting a system similar to the one submitted. That is, I will not address concerns such as binary portability as the are not addressed in the submission as I see it. I'm assuming that the only issue is how best to arrange things so that save_binary/load_binary are invoked in for contiguous collections of fundamental types. Suggestions =========== I do see the utility and merit in what you're trying to do here - finally. Honestly it just wasn't obvious from the initial package. So here is how I would have gone about it. I previously posted a demo "fast_oarchive.hpp" I will expand upon it here. the archive class tree would look like the following (setting aside polymorphic archives). I would envision the following class basic_archive basic_oarchive common_oarchive basic_binary_oarchive binary_oarchive fast_oarchive_impl MPI_oarchive XDR_oarchive ... fast_oarchive_impl is an adaptor which can be applied to any legal archive. The current submission derives from binary_?archive. If this is all that is required then it doesn't have to be a template, it could just as well be derived directly from binary_oarchive. It includes overloads for the following: template<class T> void save_array(T & t, collection_size_t count){ // this version not certified for more complex types !!! BOOST_STATIC_ASSERT(boost::is_primitive<T>::value); // or pointers either !!! BOOST_STATIC_ASSERT(boost::is_pointer<T>::value); ... // if T is NOT a fundamental type - some mpl magic required here // foward call to base class this->Base::save_override(t, 0); // else - *(this->This()) << make_nvp("count", t.size() * sizeof(T)); *(this->This()) << make_nvp(make_binary_object(t.size() * sizeof(T), & t)); // note - the nvp wrappers probably not necessary of we're only // only going to apply this to binary archives. } // here's a way to do it for all vectors in one shot template<class T> void save_override(const std::vector<T> & t, int){ save_array(t, t.size()); } ... boost::valarray? ... // for C++ built-in arrays template<typename T, int N> void save_override(const T (& t)[N] , int){ save_array(t, sizeof(t)); } It might or might not contain similar implementations for std::vector ublas::vector ublas::matrix mtl::vector blitz::array custom_lib::fast_matrix etc For now assume that it does - later we'll relax that assumption. derived from fast_oarchive_impl are your archive classes for specific archive types - MPI_oarchive, .... These would handle the specific features of each particular archive. If it turns out that MPI can only apply the save_binary optimization to types that are no more than a character wide (perhaps for endien issues) then it would include something like: template<class T> void save_override(const std::vector<T> & t, int){ // suppose endian issues preclude save_binary preclude if(sizeof(T) > 1){ // skip over save_binary optimization this->binary_oarchive::save_override(t, 0) } else // just forward to the base class this->fast_oarchive_impl::save_override(t, 0) } So net result is: a) save_binary optimizations are invoked from the fast_oarchive_impl class. They only have to be specified once even though they are used in more than one variation of the binary archive. That is, if there are N types to be subjected to the treatment by M archives - there are only N overrides - regardless of the size of M. b) save_binary optimizations can be overriden for any particular archive types. (It's not clear to me how the current submission would address such a situation). c) There is no need alter the current library. d) It doesn't require that anything in the current library be conditioned on what kind of archive is being used. Insertion of such a coupling would be extremely unfortunate and create a lot of future maintainence work. This would be extremely unfortunate for such a special purpose library. This is especially true since its absolutly unnecessary. e) The implemenation above, could easily be improved to be resolved totally at compile time. Built with a high quality compiler (with the appropriate optimization switches set), this would result in fastest possible code. f) all code for save_binary is in one place - within fast_oarchive_impl. If fast_oarchive_impl is implemented as a template, it could be applied to any existing archive class - even text and xml. I don't know if there would be any interest in doing that - but it's not inconcievable. Note also that including all the save_binary optimizations in for all of std::vector ublas::vector ublas::matrix mtl::vector blitz::array custom_lib::fast_matrix doesn't require that the headers for these libraries be included. The code in the header isn't required until the template is instantiated. So there wouldn't be any "header bloat"(tm) g) Now, f above could also be seen as a disadvantage. That is, it might seem better to let each one involved in serialization of a particular collection keep his stuff separate. There are a couple of options here I will sketch out. i) one could make a new trait is_bitwise_serializable whose default value is false. For each collection type one would specialize this like: template<class T> struct is_bitwise_serializable<vector <T> > { ... is_fundamental<T> ... get_size(){ // override default options which is sizeof(T) .. } Now fast_oarchive_impl would contain something like: // here's a way to do it for all vectors in one shot template<class T> void save_override(const T & t, int){ // if T is NOT bitwise serializable - insert mpl magic required here // foward call to base class this->Base::save_override(t, 0); // else - *(this->This()) << make_nvp("count", t.size() * sizeof(T)); *(this->This()) << make_nvp(make_binary_object(...get_size(), & t)); // note - the nvp wrappers probably not necessary of we're only // only going to apply this to binary archives. } Which would implement the save_binary optimization for all types with the the is_bitwise_serializable trait set. Of course any class derived from fast_oarchive_impl could override this as before. Note that this would address the situation whereby one has something like struct RGB { unsigned char red; unsigned char green; unsigned char blue; }; which certainly would be candidate for save_binary optimization - it would even be portable across machines of incompatible machines - but then there might be alignment issues. Also, if someone tracks an object of type RGB somewhere, then this would work differently on different archives - not a good thing. So this would require some careful documentation on how to use such a trait. This would move the information about save_binary optimization out of the fast_oarchive_impl.hpp header file and into the header file for each particular type - arguable a better choice. ii) another option would be to implement differing serializations depending upon the archive type. So that we might have template<class T> void save(fast_oarchive_impl &ar, const std::vector<T> & t, const unsigned int){ // if T is a fundamental type or .... ar << t.size(); ar.save_binary(t.size() * sizeof(T), t.data?()); } This would basically much simpler substitute for the "fast_archive_trait" proposed by the submission. Wrap up ======= I guess it should be pretty obvious that I believe that a) The applicability and utility of a save/load binary optimization are narrow than claimed. b) The claims of performance improvement are suspect. c) This implementation doesn't address all the issues that need to be addressed for something like this. d) Such an idea could be implemented in a much simpler, more transparent, robust, and efficient manner. e) Such "special" implementation features are easily accommodated by the current library by extension. There is no need to change any of the core library implementation or interface. Finally, I have to comment on the way this has been proposed. Checking a diverse group into the CVS system on a separate branch is not convenient for most people. This would better be a zip file uploaded to the vault. It should also include some sort of document outlining what its intended to do and how it does it. Had this been done, its likely that many more people would have had a look at it and been able to commment. I'm sure the issues i've noted above would be apparent to a lot of people and I wouldn't have had to spend a few hours preparing this critique. If the above weren't bad enough, I'm pretty much been forced to do what really amounts to doing a private consulting job for free for a specific application. I really can't do this anymore. Its one think to point a user in the right direction (which often results in tweak to the manual) or incorporate a bug fix - which some user has tracked down, fixed and submitted. But to have to spend the time to critique something like this - whose problems should be obvious to most of us. is really unfair. I've been through it a couple of times - I'm not doing it again. If you feel that you need to split off the serialization library to do what you want, I've a better idea: How about you and Matthias taking it over yourself? After all, I've achieved all I expect to get from it. Then you would be free to make any changes you want without wasting time on these discussions. Of course if the problems in the above are only obvious to me, then it probably would indicate that I'm out of sync with the boost development community - in which case it really should be taken over by someone else. Robert Ramey

"Robert Ramey" wrote:
a) I'm not sure about the portability of enable_if. Would this not break the whole serialization system for those compilers which don't support it?
Yes: BCB, VC6, VC7 as example. /Pavel

Pavel Vozenilek wrote:
"Robert Ramey" wrote:
a) I'm not sure about the portability of enable_if. Would this not break the whole serialization system for those compilers which don't support it?
Yes: BCB, VC6, VC7 as example.
FWIW, you don't need enable_if<> for this. Just use tag-dispatching: template<class Archive, class U> inline void save_aux( Archive & ar, const STD::valarray<U> &t, const , unsigned int file_version , mpl::true_ ) { ... } template<class Archive, class U> inline void save_aux( Archive & ar, const STD::valarray<U> &t, const , unsigned int file_version , mpl::false_ ) { ... } template<class Archive, class U> inline void save( Archive & ar, const STD::valarray<U> &t, const , unsigned int file_version ) { save_aux( ar, t, file_version , boost::archive::has_fast_array_serialization<Archive,U>() ); } -- Daniel Wallin

On Nov 12, 2005, at 9:33 PM, Robert Ramey wrote:
I've been perusing the files you checked, your example, and this list.
Summary ======= First of all, a little more complete narrative description as to what the submission was intended to acommplish and how it would change the way the user uses the library would have been helpful. I'm going to summarize here what I think I understand about this. Please correct me if I get something wrong.
a) a new trait is created.
template <class Archive, class Type> struct has_fast_array_serialization : public mpl::bool_<false> {};
Yes, I wrote that in my original e-mail
b) new functions save_array and load_array are implemented in those archives which have the above trait set to true. In this case the following is added to the binary_iarchive.hpp file. The effect is that this trait will return true when a fundamental type is to be saved/loaded to a binary_iarchive.
// specialize has_fast_array_serialization // the binary archive provides fast array serialization for all fundamental types template <class Type> struct has_fast_array_serialization<binary_iarchive,Type> : public is_fundamental<Type> {};
This is just the example for binary archives. The set of types for which direct serialization of arrays is possible is different from archive to archive. E.g. MPI archives support array serialization for all PODs that are not pointers and do not contain pointer members.
Some Observations ================= Inmediatly the following come to mind.
a) I'm not sure about the portability of enable_if. Would this not break the whole serialization system for those compilers which don't support it?
I mentioned this issue in my initial e-mail, and if there are compilers that are supported by the serialization library but do not support enable_if, we can replace it by tag dispatching.
b) what is the the point of save_array? why not just invoke save_binary directly?
Because we might want to do different things than save_binary. Look back at the thread. I gave four different examples.
c) The same could be said for built arrays - just invoke save_binary
same as above.
d) There is no provision for NVP in the non-binary version above while in the binary version there is NVP around count. Presumably, these are oversights.
The count is not saved by save_array, but separately, and there the same code as in your version is used. Hence, the count is also stored as an NVP.
e) The whole thing isn't obvious and its hard to follow. It couples the implementation code in i/o serializer.hpp to a specific kind of archive adding another dimension to be considered while understanding this thing.
The real problem is that you implement the serialization of arrays in i/o serializer.hpp now. That's why I patched it there. The best solution would be to move array serialization to a separate header.
f) What about bitwise serializble types which aren't fundamental? That is structures which don't have things like pointers in them. They have the same opportunity but aren't addressed. If this is a good idea for fundamental types, someone is going to want to do them as well - which would open up some new problems.
I mentioned above that this is just what we do for MPI archives now. This mechanism can easily be extended to binary archives: First you introduce a new traits class template <class Type> struct is_bitwise_serializable : public is_fundamental< Type > {}; and then use this traits in the definition of template <class Type> struct has_fast_array_serialization<binary_iarchive,Type> : public is_bitwise_serializable <Type> {};
g) I don't see endian-ness addressed anywhere. I believe that protocols such as XDR and MPI are designed to transmit binary data between heterogenious machines. Suppose I save an array of ints as a sequence of raw bits on an intel type machine. Then I use load_binary to reload the same seqence of bits into sparc based machine. I won't get back the same data values. So either either the method will have to be limited to collections of bytes or some extra machinery would have to be added to conditionally to the endian translation depending on the source/target machine match/ mismatch.
That's is EXACTLY the reason why I propose to call save_array instead of save_binary. In a portable binary archve, save_array and load_array will take care of the endianness issue. XDR, CDR, MPI, PVM, HDF and other libraries do it just like that.
f) Similar issues confront bitwise serialization of floats and doubles. I believe the "canonical" format for floats/doubles is ieee 80 bit. (I think that's what XDR uses - I could be wrong.) I believe that many machines store floats as 32 bit word and doubles as 64 bit words. I doubt they all are guarenteed to have the same format as far as exponent, sign and representation of value. So that's something else to be addressed. Of course endian-ness plays into this as well.
Same answer as above. IEEE has 32 and 64 bit floating point types, and they are used also by XDR and CDR. As far as I know the 80 bit type is an Intel extension. Again you see that save_binary and load_binary will not do the trick. That's why we need save_array and load_array.
g) I looked at the "benchmark" results. I notice that they are run with -O2 on the gcc compiler. Documentation for the gcc compiler command line specifies that this optimization level should does not enable automatic inlining for small functions. This is a crucial optimization to be effective in the serialization library. The library is written with the view that compilers will collapse inline code when possible. But this happens only in the gcc compiler when the -O3 optimization switch is used. Furthermore, with this compiler, it might be necessary to also specify max-inline-insns-recursive-auto switch. to gain maximum performance on boost type code. This latter is still under investigation.
You can drop the double quotes around the "benchmark". I have been involved in benchmarking of high performance computers for 15 years, and know what I'm doing. I have also run the codes under -O3, with the same results. Regarding the inlining: -O2 inlines all the functions that are declared as inline. -O3 in addition attempts to inline small functions that are not declared inline. I surely hope that all such small functions in the library are declared inline, and the fact that there is no significant difference in performance
h) my own rudimentary benchmark (which was posted on this list) used 1000 instances of a structure which contained all C++ primitive data types plus an std::string made up of random characters. It was compiled as a boost test and built with bjam so it used the standard boost options for release mode. It compared timings against using raw stream i/o. Timings for binary_archive and standard stream i/o where comparable. I'm still working on this test. The problem is that standard stream i/o uses text output/input. Of course no one for whom performance is an issue would do this so I have to alter my timing test to use binary i/o to the standard stream as a comparison. But for now, I'm comfortable in asserting that there is not a large performance penalty using serialization as opposed to "rolling your own". As an aside, the test executable doing the same test for 3 different types of archives and all primitive data types only came to 238K. So there isn't a significant code bloat issue either.
Nobody who cares for performance would use text based I/O. All your benchmark shows is that the overhead of the serialization library is comparable to that of text/based I/O onto a hard disk. For this purpose you are right, the overhead can be ignored. On the other hand, my benchmark used binary I/O into files and into memory buffers, and that's where the overhead of the serialization library really hurts. A 10x slowdown is horrible and makes the library unusable for high performance applications.
i) somehow I doubt that this archive type has been tested with all the serialization test suite. Instructions for doing so are in the documenation and the serialization/test directory includes batch files for doing this with one's own archives. Was this done? What where the results? With which compiler? It costs nothing to do this.
Just ask if you had a doubt. The short answer is "I have done this". After adding the fast array serialization to the binary and polymorphic archives, I ran all your regression tests, without any problem (using gcc 4 under MacOS X).
end of observations ===================
Admitedly, this is only a cursory examination. But its more than enough to make me skeptical of the whole idea. I you want, I could expand upon my reasons for this view, but I think they should be obvious.
I will stop this e-mail here since as you can see there is nothing to be skeptical about. Actually I had already replied to all these issues before. I would appreciate if you read my replies instead of making the same statements over and over again without considering my arguments. The endianness issue you raise above is, as you can see from my reply, not a problem in my approach, but instead a killer argument for your proposal to use save_binary instead. I will reply to your alternative proposal in a seocnd e-mail. Matthias

Matthias Troyer wrote:
Regarding the inlining: -O2 inlines all the functions that are declared as inline. -O3 in addition attempts to inline small functions that are not declared inline. I surely hope that all such small functions in the library are declared inline,
Hmm, actually I don't know that for a fact. I think I did use inline for all thse functions -but I might have overlooked some. Also, the serialization library relies heavily on mpl and other boost stuff. I havn't verified that all that imported code uses inline either. But now I think about it. Its not clear to me that's its not better to leave it off and let the compiler decide. That might better permit one to optimise for size in one compilation while optimising for speed in another. This is an open question.
Nobody who cares for performance would use text based I/O. All your benchmark shows is that the overhead of the serialization library is comparable to that of text/based I/O onto a hard disk. For this purpose you are right, the overhead can be ignored. On the other hand, my benchmark used binary I/O into files and into memory buffers, and that's where the overhead of the serialization library really hurts. A 10x slowdown is horrible and makes the library unusable for high performance applications.
I'll update my "benchmark" to use stream read/write for raw i/o comparison. Robert Ramey

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Matthias Troyer | Sent: 13 November 2005 08:03 | To: boost@lists.boost.org | Cc: Robert Ramey | Subject: Re: [boost] [serialization] fast array serialization (10x speedup) | | | Same answer as above. IEEE has 32 and 64 bit floating point types, | and they are used also by XDR and CDR. As far as I know the 80 bit | type is an Intel extension. No, it is a proper IEEE 754 standard - as are 128-bit, and above, too! See, for example: http://babbage.cs.qc.edu/IEEE-754/IEEE-754references.html ftp://download.intel.com/technology/itj/q41999/pdf/ia64fpbf.pdf Paul -- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB Phone and SMS text +44 1539 561830, Mobile and SMS text +44 7714 330204 mailto: pbristow@hetp.u-net.com www.hetp.u-net.com

On Nov 12, 2005, at 9:33 PM, Robert Ramey wrote:
Now if someone feels differently and wants to implement such a thing, they have my full support. There is no need to modify the core library and no benefit - performance or otherwise. The following shows how to go about this.
For purposes of this exposition, I am going to limit myself to how one would go about crafting a system similar to the one submitted. That is, I will not address concerns such as binary portability as the are not addressed in the submission as I see it. I'm assuming that the only issue is how best to arrange things so that save_binary/load_binary are invoked in for contiguous collections of fundamental types.
Suggestions =========== I do see the utility and merit in what you're trying to do here - finally. Honestly it just wasn't obvious from the initial package. So here is how I would have gone about it.
[snip - look at original mail for the full proposal]
So net result is:
a) save_binary optimizations are invoked from the fast_oarchive_impl class. They only have to be specified once even though they are used in more than one variation of the binary archive. That is, if there are N types to be subjected to the treatment by M archives - there are only N overrides - regardless of the size of M.
Indeed this reduces an NxM problem into a 2*N problem: serialzation of all classes that can profit from this mechanism needs to written twice, Better than M times, but still worse than doing it once. There are more fundamental problem though that I will come to later.
b) save_binary optimizations can be overriden for any particular archive types. (It's not clear to me how the current submission would address such a situation).
Actually the problem is reverse. In my proposal, the save_array function of the archive can decide how to treat each type, while your proposal dispatches everything to save_binary.
c) There is no need alter the current library.
d) It doesn't require that anything in the current library be conditioned on what kind of archive is being used. Insertion of such a coupling would be extremely unfortunate and create a lot of future maintainence work. This would be extremely unfortunate for such a special purpose library. This is especially true since its absolutly unnecessary.
This coupling can be removed in my proposal just by moving the serialization of arrays out of i/o serializer.hpp and into a separate header. A coupling between archive types and serialization of arrays will be necessary at some point, and encapsulating this in a single small header file is probably the best.
e) The implemenation above, could easily be improved to be resolved totally at compile time. Built with a high quality compiler (with the appropriate optimization switches set), this would result in fastest possible code.
Same as my proposal.
f) all code for save_binary is in one place - within fast_oarchive_impl. If fast_oarchive_impl is implemented as a template, it could be applied to any existing archive class - even text and xml. I don't know if there would be any interest in doing that - but it's not inconcievable. Note also that including all the save_binary optimizations in for all of std::vector ublas::vector ublas::matrix mtl::vector blitz::array custom_lib::fast_matrix doesn't require that the headers for these libraries be included. The code in the header isn't required until the template is instantiated. So there wouldn't be any "header bloat"(tm)
This is where the real problem is hidden, and I will explain it below when you explain the alternatives.
g) Now, f above could also be seen as a disadvantage. That is, it might seem better to let each one involved in serialization of a particular collection keep his stuff separate. There are a couple of options here I will sketch out.
i) one could make a new trait is_bitwise_serializable whose default value is false. For each collection type one would specialize this like:
template<class T> struct is_bitwise_serializable<vector <T> > { ... is_fundamental<T> ... get_size(){ // override default options which is sizeof(T) .. }
Now fast_oarchive_impl would contain something like:
// here's a way to do it for all vectors in one shot template<class T> void save_override(const T & t, int){ // if T is NOT bitwise serializable - insert mpl magic required here // foward call to base class this->Base::save_override(t, 0); // else - *(this->This()) << make_nvp("count", t.size() * sizeof(T)); *(this->This()) << make_nvp(make_binary_object(...get_size (), & t)); // note - the nvp wrappers probably not necessary of we're only // only going to apply this to binary archives. }
Which would implement the save_binary optimization for all types with the the is_bitwise_serializable trait set. Of course any class derived from fast_oarchive_impl could override this as before.
There is one serious and fundamental flaw here: whether or not a certain type can be serialized more efficiently as an array depends not only on the type, but also on the archive. Hence we need a trait taking BOTH the archive and the type, one like the has_fast_array serialization that I proposed.
ii) another option would be to implement differing serializations depending upon the archive type. So that we might have
template<class T> void save(fast_oarchive_impl &ar, const std::vector<T> & t, const unsigned int){ // if T is a fundamental type or .... ar << t.size(); ar.save_binary(t.size() * sizeof(T), t.data?()); }
This would basically much simpler substitute for the "fast_archive_trait" proposed by the submission.
Now we are back to an NxM problem. But the real issue is that for many array, vector or matrix types this approach is not feasible, since serialization there needs to be intrusive. Thus, I cannot just reimplement it inside the archive, but the library author of these classes needs to implement serialization. Hence, your approach will not work for MTL matrices, Blitz arrays and other data types. Matthias

Matthias Troyer wrote:
a) save_binary optimizations are invoked from the fast_oarchive_impl class. They only have to be specified once even though they are used in more than one variation of the binary archive. That is, if there are N types to be subjected to the treatment by M archives - there are only N overrides - regardless of the size of M.
Indeed this reduces an NxM problem into a 2*N problem: serialzation of all classes that can profit from this mechanism needs to written twice, Better than M times, but still worse than doing it once. There are more fundamental problem though that I will come to later.
In your version of serialization/valarray.hpp, serialization/vector.hpp there are two functions - one for fast archives and one for other archives. That is, for all collections which might benefit from this optimization, there are two implementations. This is the key feature of your implementation - I have preserved that in my suggestion for an alternative. Put another way, Exactly the same number of functions need to be written in both implementations - there is no difference in this point.
b) save_binary optimizations can be overriden for any particular archive types. (It's not clear to me how the current submission would address such a situation).
Actually the problem is reverse. In my proposal, the save_array function of the archive can decide how to treat each type, while your proposal dispatches everything to save_binary.
currently any archive can decide how to treat any type. There is no need or benefit to making a separate function to do this.
d) It doesn't require that anything in the current library be conditioned on what kind of archive is being used. Insertion of such a coupling would be extremely unfortunate and create a lot of future maintainence work. This would be extremely unfortunate for such a special purpose library. This is especially true since its absolutly unnecessary.
This coupling can be removed in my proposal just by moving the serialization of arrays out of i/o serializer.hpp and into a separate header. A coupling between archive types and serialization of arrays will be necessary at some point, and encapsulating this in a single small header file is probably the best.
You can hide the code in i/o serializer.hpp just by inserting the following into your own archive // for C++ built-in arrays template<typename T, int N> void save_override(const T (& t)[N] , int){ // your own code here. whatever. save_array(t, sizeof(t)); } Once you do this, the code in i/o serializer for buit-in arrays hidden and never invoked. It effectively invisible to your code. This technique is shown in demo_fast_archive to achieve exactly this end. This is the basis of my view that the core library doesn't have to be modified to achieve your ends.
e) The implemenation above, could easily be improved to be resolved totally at compile time. Built with a high quality compiler (with the appropriate optimization switches set), this would result in fastest possible code.
Same as my proposal.
No disagreement here. It is my intention in this section to show how you can implement your proposal in a less intrusive way.
g) Now, f above could also be seen as a disadvantage. That is, it might seem better to let each one involved in serialization of a particular collection keep his stuff separate. There are a couple of options here I will sketch out.
i) one could make a new trait is_bitwise_serializable whose default value is false. For each collection type one would specialize this like:
template<class T> struct is_bitwise_serializable<vector <T> > { ... is_fundamental<T> ... get_size(){ // override default options which is sizeof(T) .. }
Now fast_oarchive_impl would contain something like:
// here's a way to do it for all vectors in one shot template<class T> void save_override(const T & t, int){ // if T is NOT bitwise serializable - insert mpl magic required here // foward call to base class this->Base::save_override(t, 0); // else - *(this->This()) << make_nvp("count", t.size() * sizeof(T)); *(this->This()) << make_nvp(make_binary_object(...get_size (), & t)); // note - the nvp wrappers probably not necessary of we're only // only going to apply this to binary archives. }
Which would implement the save_binary optimization for all types with the the is_bitwise_serializable trait set. Of course any class derived from fast_oarchive_impl could override this as before.
There is one serious and fundamental flaw here: whether or not a certain type can be serialized more efficiently as an array depends not only on the type, but also on the archive. Hence we need a trait taking BOTH the archive and the type, one like the has_fast_array serialization that I proposed.
You don't need a trait because the code is implemented inside of fast_oarchive_impl so it won't get invoked by any other archive class.
ii) another option would be to implement differing serializations depending upon the archive type. So that we might have
template<class T> void save(fast_oarchive_impl &ar, const std::vector<T> & t, const unsigned int){ // if T is a fundamental type or .... ar << t.size(); ar.save_binary(t.size() * sizeof(T), t.data?()); }
This would basically much simpler substitute for the "fast_archive_trait" proposed by the submission.
Now we are back to an NxM problem.
No we're not. fast_oarchive_impl is a base class from which all your other "fast" archives are derived:
But the real issue is that for many array, vector or matrix types this approach is not feasible, since serialization there needs to be intrusive. Thus, I cannot just reimplement it inside the archive, but the library author of these classes needs to implement serialization. Hence, your approach will not work for MTL matrices, Blitz arrays and other data types.
Matthias
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

ii) another option would be to implement differing serializations depending upon the archive type. So that we might have
template<class T> void save(fast_oarchive_impl &ar, const std::vector<T> & t, const unsigned int){ // if T is a fundamental type or .... ar << t.size(); ar.save_binary(t.size() * sizeof(T), t.data?()); }
This would basically much simpler substitute for the "fast_archive_trait" proposed by the submission.
Now we are back to an NxM problem.
Nope. Remember the class hierarchy basic_archive basic_oarchive common_oarchive basic_binary_oarchive binary_oarchive fast_oarchive_impl MPI_oarchive XDR_oarchive Since the above uses the fast_oarchive_impl it will be invoked for all classes derived from it. (subject to the C++ lookup rules). So it will have be done only once. Also can can be hidden by another function which uses an archive farther down the tree. None of the alternatives proposed require any more functions to be written than the original proposal does.
But the real issue is that for many array, vector or matrix types this approach is not feasible, since serialization there needs to be intrusive. Thus, I cannot just reimplement it inside the archive, but the library author of these classes needs to implement serialization.
It may be a real issue - some data types just don't expose enough information to permit themselves to be saved and restored. But this is not at all related to implementation of a save/load binary optimization. Robert Ramey

On Nov 12, 2005, at 9:33 PM, Robert Ramey wrote:
Wrap up ======= I guess it should be pretty obvious that I believe that
a) The applicability and utility of a save/load binary optimization are narrow than claimed.
I gave four concrete examples of which three have already been implemented. Can you please substantiate your claim that it is narrower than claimed.
b) The claims of performance improvement are suspect.
I gave a benchmark code and results. What is suspect about it? Do you get different results on other machines? Please substantiate the claim.
c) This implementation doesn't address all the issues that need to be addressed for something like this.
Actually it does, as I showed in my reply to your observations.
d) Such an idea could be implemented in a much simpler, more transparent, robust, and efficient manner.
As I also argued in my reply to your proposal, this will not work because of intrusiveness. Also, why should your approach (besides the intrusiveness problems) be more efficient???
e) Such "special" implementation features are easily accommodated by the current library by extension. There is no need to change any of the core library implementation or interface.
The one problem with the core library is that the core library implements the serialization of C-style arrays by a for-loop over the elements. If you would separate this from the core library, then only a small change in this file would be needed.
Finally, I have to comment on the way this has been proposed. Checking a diverse group into the CVS system on a separate branch is not convenient for most people. This would better be a zip file uploaded to the vault. It should also include some sort of document outlining what its intended to do and how it does it. Had this been done, its likely that many more people would have had a look at it and been able to commment. I'm sure the issues i've noted above would be apparent to a lot of people and I wouldn't have had to spend a few hours preparing this critique.
I actually submitted the diffs to the list. I did this instead of a tarball of the entire archive since it is smaller and the changes are easier to see. In the submission I also outlined what was changed and why.
[snip]
I refuse to reply to personal and unfounded polemic attacks. Matthias

Matthias Troyer wrote: I'm sorry if that's how it came off. It certainly wasn't directed at you. I crafted the email in response to Dave's request to take another look at it. I resented being pressured into that was my response. I had originally looked at the submission enough to conclude to my satisfaction that implementing your idea didn't require any modification of the core library. At that point, I didn't have a whole heck of a lot to say about it. I reallise that you had the opposite view on this point and that (and I guess you still do) after a little bit of back and forth, I concluded we would just have to agree to disagree. I could live with this. Unfortunately, that wasn't good enough for some people. So I was forced in to invest a lot of effort to demonstrate what to me is an obvious point. Now I know you don't think its obvious or even correct, but we're not going to convince each other so there's no practical alternative to just letting things simmer a while until someone with a fresh perspective can make a case that can convince one of us to change is viewpoint. Veiled threats to fork the library and make my life even more difficult are really way out of line and that's what I was responding to. So, please accept my apology if I was a little too harsh. Robert Ramey

On Nov 14, 2005, at 11:50 PM, Robert Ramey wrote:
Matthias Troyer wrote:
I'm sorry if that's how it came off. It certainly wasn't directed at you. I crafted the email in response to Dave's request to take another look at it. I resented being pressured into that was my response.
I had originally looked at the submission enough to conclude to my satisfaction that implementing your idea didn't require any modification of the core library. At that point, I didn't have a whole heck of a lot to say about it. I reallise that you had the opposite view on this point and that (and I guess you still do) after a little bit of back and forth, I concluded we would just have to agree to disagree. I could live with this. Unfortunately, that wasn't good enough for some people. So I was forced in to invest a lot of effort to demonstrate what to me is an obvious point. Now I know you don't think its obvious or even correct, but we're not going to convince each other so there's no practical alternative to just letting things simmer a while until someone with a fresh perspective can make a case that can convince one of us to change is viewpoint. Veiled threats to fork the library and make my life even more difficult are really way out of line and that's what I was responding to. So, please accept my apology if I was a little too harsh.
Thanks Robert. I appreciate it and also want to apologize if I sounded too harsh. I value the effort you put into the serialization library and just want to make it usable for high-performance applications as well. Matthias

Matthias Troyer wrote:
Thanks Robert. I appreciate it and also want to apologize if I sounded too harsh. I value the effort you put into the serialization library and just want to make it usable for high-performance applications as well.
Just to add another use case : Back in 2001 (or so), I had to implement serialization for IPC involving message passing between C++ programs on a RTOS (QNX) - the message passing was built on shared memory and QNX pulses (v. fast), QNX message passing (fast) and posix message queues (medium). Since I was using C++, I had a look at an early version of the serialization library (in the yahoo boost files section). It was way too slow, since I needed to pass serialized message objects that contained arrays of hundreds of POD elements between processes, and I was doing this hundreds of times a second. I ended up implementing (from scratch) something quite similar to the early serialization library, but which had a traits hook for optimizing array serialization (and lots of other differences - I forget). I achieved the necessary speed increase; it made the difference between a system that could handle thousands of digital IO changes per second, and one that could handle hundreds. Tom

Oh, and congratulations for restraining the otherwise natural Impulse to respond with some nasty remark. Anyway, I'm hoping with a little bit of time the issues will Become clearer all around. Sometimes I think that time is all that Is required and there is no short cut. I do believe your Idea has merit in certain cases. I am skeptical of its applicability to portable binary archives. As I've said many times - for me, its OK to agree to disagree. I've taken great pains to factor the library in small enough pieces with the hope of permitting ideas such as yours to be added on (or removed) in a convenient way. And this benefits you as well. Now you don't have to sell your idea the majority. If some doesn't like it, you can just say - ok don't use it. All you need is a couple of people to share your point of view to make it acceptable. Note that much of the success of the library is due to this Aspect of its design. Some people want XML, others want Binary, etc, etc. The way things are structured, those That don't want XML don't even see it in their code. Can You imagine what things would be like if I had to get Everyone to agree? As it is I'm pretty overwhelmed. Anyway, thanks for your interest in extending the library. I'm sorry its been a rougher and more painful road than it first appeared. Take my word for it - it'll get worse before it gets better. But you may well end up with something much better than you expected. And BTW, I think that making a portable binary archive (including XDR, CDR, etc variants) is MUCH harder than it first appears. And that's even BEFORE one thinks of adding in a bitwise collection optimization. So that's why I left portable_?archive as an example. Some people have corrected its handling of endian-ness for some compilers so I guess someone is using it though I have no idea whom. Also ralf-k (I forget his whole name) make a very nice suggestion about how to do floating point numbers in a portable binary way. I don't know if any of the above is interesting to you - but there it is. Robert Ramey Matthias Troyer wrote:
Thanks Robert. I appreciate it and also want to apologize if I sounded too harsh. I value the effort you put into the serialization library and just want to make it usable for high-performance applications as well.
Matthias
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Nov 15, 2005, at 7:20 PM, Robert Ramey wrote:
And BTW, I think that making a portable binary archive (including XDR, CDR, etc variants) is MUCH harder than it first appears. And that's even BEFORE one thinks of adding in a bitwise collection optimization. So that's why I left portable_?archive as an example. Some people have corrected its handling of endian-ness for some compilers so I guess someone is using it though I have no idea whom. Also ralf-k (I forget his whole name) make a very nice suggestion about how to do floating point numbers in a portable binary way.
I actually have a portable binary archive based on XDR format, but it works only on UNIX operating systems, and that's why I have not submitted it to Boost. Matthias

in "fast array serialization (10x speedup)" On Tue, Nov 15, 2005 at 08:07:03PM +0100, Matthias Troyer wrote:
I actually have a portable binary archive based on XDR format, but it works only on UNIX operating systems, and that's why I have not submitted it to Boost.
We need a couple tweaks yet, no? I have a very plain portable binary archive around that I'd like to submit, it depends on: Matthias Troyer wrote way back in that thread:
3. I had to introduce a new strong typedef in basic_archive.hpp:
BOOST_STRONG_TYPEDEF(std::size_t, container_size_type) BOOST_CLASS_IMPLEMENTATION(boost::archive::container_size_type, primitive_type)
I remember that you suggested in the past that this should be done anyways. One reason is that using unsigned int for the size of a container, as you do it now will not work on platforms with 32 bit int and 64 bit std::size_t : the size of a container can be more than 2^32. I don't always want to serialize std::size_t as the integer chosen by the specific implementation either, since that would again not be portable. By introducing a strong typedef, the archive implementation can decide how to serialize the size of a container.
The further modifications to the library in
boost/serialization/collections_load_imp.hpp boost/serialization/collections_save_imp.hpp boost/serialization/vector.hpp
were to change the collection serialization to use the container_size_type.
[snip]
4. boost/archive/basic_binary_[io]archive.hpp serialize container_size_type as an unsigned int as done till now. It might be better to bump the file version and serialize them as std::size_t.
Can we go ahead with these bits? -t

troy d. straszheim wrote:
4. boost/archive/basic_binary_[io]archive.hpp serialize container_size_type as an unsigned int as done till now. It might be better to bump the file version and serialize them as std::size_t.
Can we go ahead with these bits?
Is this holding something up right now? Robert Ramey

On Tue, Nov 15, 2005 at 02:28:20PM -0800, Robert Ramey wrote:
troy d. straszheim wrote:
4. boost/archive/basic_binary_[io]archive.hpp serialize container_size_type as an unsigned int as done till now. It might be better to bump the file version and serialize them as std::size_t.
Can we go ahead with these bits?
Is this holding something up right now?
Oh. Oops. No, not #4 there, just #3. Sorry. -t

Matthias Troyer wrote:
On Nov 15, 2005, at 7:20 PM, Robert Ramey wrote:
And BTW, I think that making a portable binary archive (including XDR, CDR, etc variants) is MUCH harder than it first appears. And that's even BEFORE one thinks of adding in a bitwise collection optimization. So that's why I left portable_?archive as an example. Some people have corrected its handling of endian-ness for some compilers so I guess someone is using it though I have no idea whom. Also ralf-k (I forget his whole name) make a very nice suggestion about how to do floating point numbers in a portable binary way.
Ralf W. Grosse-Kunstleve's post describing this method is here: http://lists.boost.org/Archives/boost/2004/04/64419.php
I actually have a portable binary archive based on XDR format, but it works only on UNIX operating systems, and that's why I have not submitted it to Boost.
Just out of interest, why is the method used platform-specific? Matt

On 11/16/05, Matthew Vogt <mattvogt@warpmail.net> wrote:
Matthias Troyer wrote:
I actually have a portable binary archive based on XDR format, but it
works only on UNIX operating systems, and that's why I have not submitted it to Boost.
Just out of interest, why is the method used platform-specific?
I'm guessing that Matthias' implementation uses the standard xdr_* functions (see man 3 xdr) to do the dirty work. I doubt these are available on Windows. Matthias? -- Caleb Epstein caleb dot epstein at gmail dot com

On Nov 16, 2005, at 11:36 PM, Caleb Epstein wrote:
On 11/16/05, Matthew Vogt <mattvogt@warpmail.net> wrote:
Matthias Troyer wrote:
I actually have a portable binary archive based on XDR format, but it
works only on UNIX operating systems, and that's why I have not submitted it to Boost.
Just out of interest, why is the method used platform-specific?
I'm guessing that Matthias' implementation uses the standard xdr_* functions (see man 3 xdr) to do the dirty work. I doubt these are available on Windows. Matthias?
Indeed, that's what I did, but we already have an idea for a nice portable native C++ implementation. It is just a matter of finding time now. Matthias

(brought over from another thread) On Tue, Nov 15, 2005 at 10:20:59AM -0800, Robert Ramey wrote:
And BTW, I think that making a portable binary archive (including XDR, CDR, etc variants) is MUCH harder than it first appears. And that's even BEFORE one thinks of adding in a bitwise collection optimization. So that's why I left portable_?archive as an example. Some people have corrected its handling of endian-ness for some compilers so I guess someone is using it though I have no idea whom.
The example portable_binary_*archive probably works cross-platform in an *endianness* sense, but not in terms of type sizes. As size_t varies from platform to platform so does the size of, for instance, the "size"s in your vectors in the archive. It took a lot of staring at hexdumps to figure out what nonportable stuff wasn't getting passed up to the portable archive for correct handling. But floats and doubles are no problem. They have the same size and layout whereever you are, and since you're just flipping bytes and writing/loading binary, you don't have to worry about NaN or inf or any of that, the archive remains blissfully ignorant. Very different than the text/XML case. Another shocker was that bools are 4 bytes on ppc. That took a while to track down. The implementation I've got looks like this: void save(bool b) { // catch ppc's 4 byte bools unsigned char byte = b; base_t::save(byte); } void save(short t) { #ifdef BOOST_PORTABLE_BINARY_ARCHIVE_BIG_ENDIAN t = swap16(t); #endif base_t::save(t); } // etc, etc for all PODs void save(long t) { // bumping up or shrinking funamental type sizes // should be factored out into a policy class, maybe // along with an overflow-checking policy (clip or throw) int64_t i = t; #ifdef BOOST_PORTABLE_BINARY_ARCHIVE_BIG_ENDIAN i = swap64(i); #endif base_t::save(i); } // etc void save(double d) { #ifdef BOOST_PORTABLE_BINARY_ARCHIVE_BIG_ENDIAN d = swap64(d); #endif base_t::save(d); } such stuff for every fundamental. So we need that container_size_type handling in the library. (That's what I was squawking about before, Robert.) Specified is that users of the archive either have to use portable typedefs (int16_t) for fundamental types, or they'll have to know for themselves that the fundamentals are the same size when writing/loading cross-platform. Like that comment says, another approach to the type-size business would be to factor out the sizes into a policy/trait, for which there would be some reasonable default... the sizes of each POD would be written to the archive header, and then loading archives could decided how they wanted to react if type sizes were different, e.g. refuse to load, throw on overflow, or ignore overflow. But I wanted to wait on that until a basic portable version was working against the serialization library trunk. -t

troy d. straszheim wrote:
The example portable_binary_*archive probably works cross-platform in an *endianness* sense, but not in terms of type sizes. As size_t varies from platform to platform so does the size of, for instance, the "size"s in your vectors in the archive. It took a lot of staring at hexdumps to figure out what nonportable stuff wasn't getting passed up to the portable archive for correct handling.
Another shocker was that bools are 4 bytes on ppc. That took a while to track down.
The implementation I've got looks like this:
void save(bool b) { // catch ppc's 4 byte bools unsigned char byte = b; base_t::save(byte); }
Hmmm - that's not the implementation I've got in the portable_binary_?archive.hpp in the serialzation/examples directory. My implemenation handles all type sizes and endien-ness - (excluding floats/doubles which are not addressed). The only restriction is that the values insertable by the "sending" machine are infact representable by the "receiving machine". For example, one can't send an integer value > 2**32 to a machine which only supports 32 bit integers.
So we need that container_size_type handling in the library. (That's what I was squawking about before, Robert.)
maybe - but not for this.
Specified is that users of the archive either have to use portable typedefs (int16_t) for fundamental types, or they'll have to know for themselves that the fundamentals are the same size when writing/loading cross-platform.
its a good idea for users to "portable typedefs (int16_t)" in anycase in my opinion. The portable_binary_?archives are totally compatible with this. Using these WILL guarentee the the receiving machine will be able to represent the value - if it can be done at all. That is still won't be able to send an int64_t to a machine that doesn't support that type. In this case you'll get a compilation error when you compile the code for the receiving machine - which seems fine by me. Robert Ramey

"Robert Ramey" <ramey@rrsd.com> writes:
Matthias Troyer wrote:
I'm sorry if that's how it came off. It certainly wasn't directed at you. I crafted the email in response to Dave's request to take another look at it. I resented being pressured into that was my response.
...
Now I know you don't think its obvious or even correct, but we're not going to convince each other so there's no practical alternative to just letting things simmer a while until someone with a fresh perspective can make a case that can convince one of us to change is viewpoint. Veiled threats to fork the library and make my life even more difficult are really way out of line and that's what I was responding to.
Robert, This should have come sooner, but it's been part of a longer communication that I've been working on all week and, I'm afraid, isn't going to get finished all in one message. So I'm sending the first part now because it should have gone out long ago. First of all, please accept my apology. I didn't mean it as a threat, but on re-reading it's obvious to me why it sounded that way. In fact I can't see how anyone could have interpreted it differently without a lot more context. I hope by the time you finish this series of messages that will be possible, but regardless, again please accept my apologies. The pressure was, as you say, out of line. Also, I'd like to acknowledge that without your tenacity, dedication, expertise, and, above all, your desire to meet the needs of the user community, this library never would have made it through _two_ formal reviews and been accepted into Boost. From reading feedback on the mailing list it has quickly become one of the most-used libraries in Boost, so it is obviously a major contribution to the C++ community at large. Your continued stewardship is crucial, and appreciated. I've been traveling this week and thinking carefully about Matthias' design and your concerns with it, and -- believe it or not -- I think we can offer a fresh perspective. It took me some time to understand your concerns about intrusiveness and coupling, but I think I do have a handle on it. In my opinion, Matthias' design was a bit more complicated than necessary, which can't have helped you to feel receptive. After some work, we think we have a basis for discussion that _begins_ with a design that meets all your criteria for approval. Introduction ============ In an upcoming message I'm planning to start by describing the least intrusive design that could possibly work -- one that makes no changes at all to the existing serialization library. It's not a bad design, but it has a few drawbacks that I'd like to discuss. At that point it should become clear what I meant about "hijacking" the serialization library. Finally, I'll describe the smallest set of changes to the existing serialization library that would be needed to address those drawbacks. Just to state the obvious, I hope to convince you that those few changes are worth making. Of course, as the maintainer of Boost.Serialization, it's completely up to you whether to do so. Before getting into design details, let me highlight the reason for this proposal in a way you may not have seen it stated before: ,---- | For many archive formats and common datatypes there exist APIs that | can quickly read or write contiguous sequences of those types all at | once (**). Reading or writing such a sequence by separately reading | or writing each element (as the serialization library currently | does) can be an order of magnitude more expensive. `---- We want to be able to capitalize on the existence of those APIs, and to do that we need a "hook" that will be used whenever a contiguous sequence is going to be (de)serialized. No such hook exists in Boost.Serialization. (**) Note that this capability is not necessarily tied to bitwise serialization or the use of a binary representation. The Design ========== We've attempted to use programming idioms and terminology found in the existing serialization library wherever possible, so that it's easy for you to read and understand, and you won't be distracted by minor stylistic differences. In the messages to follow, the word "array" will normally mean a contiguous sequence of instances of a single datatype, and not to a C++ builtin array type of the form T[N]. I'll try to be explicit when I intend to describe builtin arrays. <more to come> -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
,----
For many archive formats and common datatypes there exist APIs that can quickly read or write contiguous sequences of those types all at once (**). Reading or writing such a sequence by separately reading or writing each element (as the serialization library currently does) can be an order of magnitude more expensive. `----
I have no problem with the above.
We want to be able to capitalize on the existence of those APIs, and to do that we need a "hook" that will be used whenever a contiguous sequence is going to be (de)serialized. No such hook exists in Boost.Serialization.
Whether or not such a hook is necessary is the crux of the issue. I consider the submission a use case for archive creation and/or extension. As far as I could tell, that particular one didn't require any new hooks in the library. Maybe the next iteration will be different - but that's how I see it now.
(**) Note that this capability is not necessarily tied to bitwise serialization or the use of a binary representation.
The Design ========== We've attempted to use programming idioms and terminology found in the existing serialization library wherever possible, so that it's easy for you to read and understand, and you won't be distracted by minor stylistic differences.
Thanks for your consideration. I realize its an extra burden to make it easier for me to read and understantd and I appreciate your consideration.
In the messages to follow, the word "array" will normally mean a contiguous sequence of instances of a single datatype, and not to a C++ builtin array type of the form T[N]. I'll try to be explicit when I intend to describe builtin arrays.
Let me explain one place where our difference lies. The serialization library is basically three pieces a) serialization specifications for each data type to be serialized. (serialize functions) which are independent of the archive. That is these specifications depend only upon the requirements of the Saving Archive or Loading Archive concepts. b) archive classes which implement the Archive concept for different file formats. These archive classes have common implementation features factored out into common modules. Due to "practical" considerations like whether something should be pre-compiled in the library, whether it is dependent on a use's application type, minimzation of code bloat etc, This common implemnetation code might be included in one of the base classes or in the file i/oserializer.hpp. (The code in i/o serializer.hpp) would normally be one of the base classes but I believe that template meta programming consideratons related to less-conforming compilers). These "common code" modules are designed to hold code applicable to all archives. c) Finally, the escape hatch. Those serialization implementations which have to be dependent on the combinaron of archive type and datatype. The most obvious case is name-value pairs - nvp. nvp has its own default serialization which just serializes the value part. Withinxml archives this is overriden with a special version for that archive type. This is the model which I have always envisioned that the library be extended. It is only in this way the the library can be extended without being complicated geometircally as time goes on. I realize that this design and more importantly, it's motivation, might not be all that apparent from the the documentation on archive implementation. Sorry about that. As time goes on I would hope that this can be improved. But maybe this explains my reluctance to maintain parts of the library beyond the reach of those making other archives. This forms my main objection to the proposal. Of course I have/had lots of other objections to it and probably would have a lot more if I spent more time looking into it. I suspect that the job of making a protable binary archive is much harder than it first appears. Making it so that it can exploit opportuninties to be much faster while still being as "monkey - proof" is even harder still. I didn't pursue this as I really don't want to discourage these kinds of efforts and they are (or should be) orhogonal to the library as it is currently implemented.. If they can be implemented without altering the core - then I have no problem. If someone believes that modifying the core is unavoidable, then either he or I have made some sort of mistake and it will have to be resolved. If they don't reallly have to alter the core, but the archive auther thinks it would make his job easier - then we have a probem. I get a suggestion about once a month to modify the core of he library for this or that reason. Aside from bugs, it usually boils down to the suggestor looking at the code and seeing - "Oh I could fix this right there!" without considering all the repercussions and without considering the alternatives. (As you might guess, this is what I believe happened in this case). Another common occurence is the attempt to use the serialization system to accomplish some end for which it is not suited. A typical idea is to use it to implement some externally defined file format. I know I drag my feet, I know it drives people crazy, but I truely believe that the success of the library is due in no small part to my reluctance to add in any more than is absolutly necessary. So, I look forward to seeing progress on the following: a) better handling of special optimization opportunites which obtain for certain combinations of data-types and archives. Hopefully, an elegantl implementation will serve as a model for other people's pet addiitions. b) A protable binary implementation suitable for such things as MPI messages. I also expect these to take some time and hope they can be subjected to the boost "process" of public criticism and refinement. This will take more time but result in a better product. Hopefully, it will be less stressful as well - though I doubt it. I really am trying to wind down my involvement in the serialization library. I do want to spend some more time on execution profiling and performance tweaks. I would like to see the documentation improved on how to do things like you and matthias are attempting to do. The current documenation does have a section titled "case studies" which seems to me handy place to put examples of this nature and at the same time show users how to exploit any "add-in" functionality. Good luck on this Robert Ramey

"Robert Ramey" <ramey@rrsd.com> writes:
David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
,----
For many archive formats and common datatypes there exist APIs that can quickly read or write contiguous sequences of those types all at once (**). Reading or writing such a sequence by separately reading or writing each element (as the serialization library currently does) can be an order of magnitude more expensive. `----
I have no problem with the above.
We want to be able to capitalize on the existence of those APIs, and to do that we need a "hook" that will be used whenever a contiguous sequence is going to be (de)serialized. No such hook exists in Boost.Serialization.
Whether or not such a hook is necessary is the crux of the issue.
Yes. Or more precisely, whether the consequences of not having the hook in the serialization library itself are bad enough to warrant creating it there. I will discuss those consequences after I present our new design, which adds the hook, but only in our own extensions -- essentially a library built on top of the current serialization library without modifying it.
I consider the submission a use case for archive creation and/or extension.
I don't understand what you're trying to say. I presume by "the submission" you mean Matthias' proposed changes to your library. But I don't understand what you mean about it being a "use case."
As far as I could tell, that particular one didn't require any new hooks in the library.
Functionally speaking, that is correct. You /can/ do fast serialization of contiguous arrays without changing the library. You don't even have to write a whole new serialization library.
Maybe the next iteration will be different - but that's how I see it now.
There are some negative consequences of creating the hooks outside Boost.Serialization. Once you understand them, I'm pretty sure you will think they are significant. Whether they will be significant enough to induce you to make changes in Boost.Serialization is of course an open question.
Let me explain one place where our difference lies.
Having read everything that follows, I don't see any explanation of a "place where our difference lies." The parts I understand (most of it) sound like "motherhood and apple pie" -- good, common sense that's hard to disagree with. Is it a thought that was never finished? Would you care to try to put it more succinctly?
The serialization library is basically three pieces
a) serialization specifications for each data type to be serialized. (serialize functions) which are independent of the archive. That is these specifications depend only upon the requirements of the Saving Archive or Loading Archive concepts.
b) archive classes which implement the Archive concept for different file formats. These archive classes have common implementation features factored out into common modules. Due to "practical" considerations like whether something should be pre-compiled in the library, whether it is dependent on a use's application type, minimzation of code bloat etc, This common implemnetation code might be included in one of the base classes or in the file i/oserializer.hpp. (The code in i/o serializer.hpp) would normally be one of the base classes but I believe that template meta programming consideratons related to less-conforming compilers). These "common code" modules are designed to hold code applicable to all archives.
c) Finally, the escape hatch. Those serialization implementations which have to be dependent on the combinaron of archive type and datatype. The most obvious case is name-value pairs - nvp. nvp has its own default serialization which just serializes the value part. Withinxml archives this is overriden with a special version for that archive type. This is the model which I have always envisioned that the library be extended. It is only in this way the the library can be extended without being complicated geometircally as time goes on.
I realize that this design and more importantly, it's motivation, might not be all that apparent from the the documentation on archive implementation. Sorry about that.
No, it's perfectly clear what you're trying to do once you study the library implementation. Your design philosophy makes good sense AFAICT. I am a bit surprised to hear you state flatly that there is only one way to extend the library that can ever work. How can you possibly know you've considered every possibility? I don't have the same confidence, even about problems I've studied for years.
As time goes on I would hope that this can be improved. But maybe this explains my reluctance to maintain parts of the library beyond the reach of those making other archives.
Other archives? Beyond reach? I don't understand what you're saying here.
This forms my main objection to the proposal.
Sorry, I don't have any clue what you are referring to. Regardless, we are going to start from new code that doesn't change any part of Boost.Serialization, so if possible, it might be better to try to forget about what you've seen before.
Of course I have/had lots of other objections to it and probably would have a lot more if I spent more time looking into it.
Fortunately, you won't have to. We're going to present new code.
I suspect that the job of making a protable binary archive is much harder than it first appears.
Actually it's almost trivial (I did it over 10 years ago), but I don't know what that has to do with what we're trying to accomplish.
Making it so that it can exploit opportuninties to be much faster while still being as "monkey - proof" is even harder still.
The speedups we're proposing don't have anything in particular to do with portable binary archives.
I didn't pursue this as I really don't want to discourage these kinds of efforts and they are (or should be) orhogonal to the library as it is currently implemented.. If they can be implemented without altering the core - then I have no problem. If someone believes that modifying the core is unavoidable, then either he or I have made some sort of mistake and it will have to be resolved.
It's not unavoidable; as I've said before, it just has consequences that we don't like, and we think you probably won't like either. If you can hang on until we've presented what we think is the best design that avoids altering the core, then we can look at the consequences. Once you understand them, if you still don't want to make any changes and you're willing to accept the consequences, we're not going to press the issue any further.
If they don't reallly have to alter the core, but the archive auther thinks it would make his job easier - then we have a probem.
Let me be very clear about this, at least: ,---- | Ease of archive implementation is unrelated to the motivation for | requesting core changes. `---- I hope that allays at least one of your concerns.
I get a suggestion about once a month to modify the core of he library for this or that reason. Aside from bugs, it usually boils down to the suggestor looking at the code and seeing - "Oh I could fix this right there!" without considering all the repercussions and without considering the alternatives. (As you might guess, this is what I believe happened in this case).
Actually Matthias' considerations went much deeper than you give him credit for. In my opinion, he just failed to communicate his rationale properly, and since the details of his code seemed to you to violate basic principles of your design, I'm sure it was all the more difficult for you to understand the problems he is trying to avoid. Working from new code that (I hope!) won't cause you any alarm, it might be easier to understand the rationale.
Another common occurence is the attempt to use the serialization system to accomplish some end for which it is not suited. A typical idea is to use it to implement some externally defined file format. I know I drag my feet, I know it drives people crazy, but I truely believe that the success of the library is due in no small part to my reluctance to add in any more than is absolutly necessary.
Understood. It might be a good idea for you to clearly define the intended scope of the library. What criteria distinguish an appropriate application from an inappropriate one? I'm interested in hearing your intention as the library author, rather than something like "an appropriate application is one that works well with the library as it is currently specified and/or implemented." Depending on your answer, we might indeed be barking up the wrong tree.
So, I look forward to seeing progress on the following:
a) better handling of special optimization opportunites which obtain for certain combinations of data-types and archives. Hopefully, an elegantl implementation will serve as a model for other people's pet addiitions.
I hope we'll be able to show you something elegant very soon.
b) A protable binary implementation suitable for such things as MPI messages.
Portable binary archives and MPI have little relationship to one another. You don't flatten your data into a portable format, ship it in an MPI message that is just a sequence of bytes, and then deserialize. MPI handles portability internally.
I also expect these to take some time and hope they can be subjected to the boost "process" of public criticism and refinement. This will take more time but result in a better product. Hopefully, it will be less stressful as well - though I doubt it.
I really am trying to wind down my involvement in the serialization library.
That's a bit alarming, actually. Have you got someone else lined up to maintain it? It's important to us and to many others that the library has a future. Without the involvement of the original author, that would be in doubt.
I do want to spend some more time on execution profiling and performance tweaks.
I would like to see the documentation improved on how to do things like you and matthias are attempting to do. The current documenation does have a section titled "case studies" which seems to me handy place to put examples of this nature and at the same time show users how to exploit any "add-in" functionality.
Good luck on this
Thanks. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
I consider the submission a use case for archive creation and/or extension.
But I don't understand what you mean about it being a "use case."
I mean an example of how the library can be extended to achieve some specified requirement. In this case, improved method of saving/loading certain types of data in certain types of archives.
There are some negative consequences of creating the hooks outside Boost.Serialization. Once you understand them, I'm pretty sure you will think they are significant.
I'm all ears. I can really only comment on the proposal submitted and that's what I did.
Let me explain one place where our difference lies.
Having read everything that follows, I don't see any explanation of a "place where our difference lies." The parts I understand (most of it) sound like "motherhood and apple pie" -- good, common sense that's hard to disagree with. Is it a thought that was never finished? Would you care to try to put it more succinctly?
From looking at these discussions, one might get the impression
It seemed to me that that submission didn't take these aspects of the design into account. I had presumed that this was because the separation was nowhere really made explicit. I was trying to make up for that. I was concerned that it might not be obvious that the distribution of implementation in the hierarchy of class was very deliberate and not arbitrary. I can see how someone might look at the way something was done and say, "wow - that's not necessary - we can just collapse out that layer" etc. In fact I would expect a lot of people to react that way when they first see it. (What follows is a diversion from the question at hand for those who have some extra time or interest. Feel free to skip) An interesting thing is how the "implementation organization" comes about. If one is an avid reader of boost mail archives he will notice a huge amount of discussion about the design of things. How things should be separated, what implementation techniques should be used, etc., etc - the discussion addresses things a finer and finer level of detail as time goes on. Most of the discussion is is speculative - If one does this things this way then you'll be able to do x - but who needs to do x when you can do y, etc. that this is the way something like a largish body of code such the serialization library is designed. The truth is - it doesn't happen this way - at least not with me. The discussions can be interesting and helpful - up to a point. But once it arrives at a certain level of detail - its truely beyond the human brains capacity to imagine all the consequences of these design decisions. So when I started out, I had a) positive experience with Microsoft's MFC serialization b) a list of things about it that I wanted to "fix" c) a list of other systems which attempted to address the same issues I did. Although none of these systems included all the things I wanted to fix - many had interesting ideas. d) a fxed idea that description of how something is serialized must be orthogonal to the archive implementation. d) a concise half page description of how it would be used (your Archive Concept) I made the first tutorial demo and developed that in parallel with the first version of the library. It started out very simple As time went on, more "requirements" were added. Much of these "requirements" were formulated during the the first review. Lots of boost type discussion (good and bad) consumed lots of effort. All this discussion was pretty much summarized on G. Rosenthals definitiive review if the library. It was very complete and very well written. This resulted in much refactoring. After acceptance I realized we needed a polymorphic interface. Dynamic DLL loading resulted in more refactoring. Through all this the original demo tutorial application hardly ever changed. The final design is the triumph of evolution over intelligent design. There's a very deep lesson here I'm sure. I see things such as xtreme programming vs waterfall design, evolution vs creationism, maket capitalism vs socialist central planing, as all related. (End of diversion) So, from the above, it's obvious to me that how to implement the serialization system is not at all obvious. (If it were, I would have needed only one iteration !) Like lots of things it might be obvious in retrospect. Or worse it might LOOK obvious when its really not. I hope that clarifies things.
It is only in this way the the library can be extended without being complicated geometircally as time goes on.
I am a bit surprised to hear you state flatly that there is only one way to extend the library that can ever work. How can you possibly know you've considered every possibility? I don't have the same confidence, even about problems I've studied for years.
Hmmm what I meant to say is illustrated by the following: Suppose one has some library L. If its successful, there will be demand to enhance it as time goes on. This is a "good thing" (tm). Now suppose that the introduction of enhancement E results in L' which presents an API which is a superset of L. Of course its internally more complex with "global" modes and object traits etc. It does take more effort to debug than originally anticipated but it does work and Its backward compatible and now has the new functionality and everyone's happy. For a while. Almost everybody. Now its a little harder to learn to use for beginners. But its OK. The success of enhancement E stokes demand for enhancement F. Each additional enhancement is harder to implement, and the resulting library can be understood by less and less people, and its harder and harder to learn to use. This is a typical cycle which many software products suffer from. (BTW - other products suffer from this as well. It almost seems there is a thermodynamic principle at work - conceptual integrity of all ideas decrease over time as attempts are made to apply them ever more broadly) Now suppose when demand for enhancement E comes up someone says - wait a minute - You have to implement E as some sort of add on module. It seems like its more work. But since the work doesn't make the original code more intricate the effort to design, code, debug, test and document E is striclty proportional to the size of E. So of while there are lots of ways to extend a library - But by choosing an inconvenient method - the original utiliy of the library will suffer - even as the library gains functionality !!! So maybe instead of saying there's only one way to extend the library, I really meant to say there are lots of ways NOT to extend a library. What if the enhancement can't be done as an add-on? Then you've got to refactor the library. This should happen less and less frequently as time goes on.
As time goes on I would hope that this can be improved. But maybe this explains my reluctance to maintain parts of the library beyond the reach of those making other archives.
Other archives? Beyond reach? I don't understand what you're saying here.
I don't remember what I meant to say here. I probably meant to say that I would hope that the library extends by adding on more and more functionality through extension and accretion rather than making the stuff that's already in there more elaborate.
we are going to start from new code that doesn't change any part of Boost.Serialization, so if possible, it might be better to try to forget about what you've seen before.
no problem - I can't remember that far back anyway.
I suspect that the job of making a protable binary archive is much harder than it first appears.
Actually it's almost trivial (I did it over 10 years ago), but I don't know what that has to do with what we're trying to accomplish.
The speedups we're proposing don't have anything in particular to do with portable binary archives.
I presumed too much then. From the thread discussion, it seemed that this was just the intial effort to adapt the serialization library to the needs of High Performance Computing. XDR compatibility. (http://www.faqs.org/rfcs/rfc1014.html) was mentioned at some point as was MPI (http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-1.1/node39.htm#Node3... think) Both of these entail portable binary format - with atendant endian issues. Maybe the mentioning of this in the context of discussion of the submission which didn't really mention this confused things in my own mind. So just to keep the pot boiling - it seems to me that gaining the 10x speed up associated with "bitwise collecion" serializaton in the context of portable binary archives such as XDR is going to be a tall order.
I didn't pursue this as I really don't want to discourage these kinds of efforts and they are (or should be) orhogonal to the library as it is currently implemented.. If they can be implemented without altering the core - then I have no problem. If someone believes that modifying the core is unavoidable, then either he or I have made some sort of mistake and it will have to be resolved.
It's not unavoidable; as I've said before, it just has consequences that we don't like, and we think you probably won't like either. If you can hang on until we've presented what we think is the best design that avoids altering the core, then we can look at the consequences. Once you understand them, if you still don't want to make any changes and you're willing to accept the consequences, we're not going to press the issue any further.
Fine, I was asked to comment on what was submitted. We'll start the next round with a clean slate.
If they don't reallly have to alter the core, but the archive auther thinks it would make his job easier - then we have a probem.
Let me be very clear about this, at least:
,---- | Ease of archive implementation is unrelated to the motivation for | requesting core changes. `----
I hope that allays at least one of your concerns.
It does. And I'm sure you probably deal with this on a regular basis with your own libraries.
I get a suggestion about once a month to modify the core of he library for this or that reason. Aside from bugs, it usually boils down to the suggestor looking at the code and seeing - "Oh I could fix this right there!" without considering all the repercussions and without considering the alternatives. (As you might guess, this is what I believe happened in this case).
Note that this isn't a personal criticism - its a natural occurance that happens all the time.
Actually Matthias' considerations went much deeper than you give him credit for. In my opinion, he just failed to communicate his rationale properly, and since the details of his code seemed to you to violate basic principles of your design, I'm sure it was all the more difficult for you to understand the problems he is trying to avoid.
LOL - I think I understood the code submitted and what it was intended to achive. As far as I could fathom the rationale, I presented an alternative designed to achieve the same results without sprinking bits of code throughout lots of other modules.
Working from new code that (I hope!) won't cause you any alarm, it might be easier to understand the rationale.
I guess you and Matthias were somewhat taken aback by my response. Sorry about that. Anyway, it seems you do have an understanding and even appreciation of my concerns so I'm optimistic that the next iteration will be better. The crux of my argument is that I believe that the kinds of extensions you want to implement can best be done without altering the current library. I'm willing to be proved wrong with a counter example - but the last didn't qualify in my opinion. Also it seems that lots of people are using the library in ways I haven't totally forseen there there have been lots of opportunities for such counter examples to be presented. (The only one that really stuck was shared_ptr serialization - and I'm still not sure about that!!)
Another common occurence is the attempt to use the serialization system to accomplish some end for which it is not suited. A typical idea is to use it to implement some externally defined file format. I know I drag my feet, I know it drives people crazy, but I truely believe that the success of the library is due in no small part to my reluctance to add in any more than is absolutly necessary.
Understood. It might be a good idea for you to clearly define the intended scope of the library. What criteria distinguish an appropriate application from an inappropriate one? I'm interested in hearing your intention as the library author, rather than something like "an appropriate application is one that works well with the library as it is currently specified and/or implemented." Depending on your answer, we might indeed be barking up the wrong tree.
The very first sentence of the Overview of the Documentation states: "Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. Such a system can be used to reconstitute an equivalent structure in another program context. Depending on the context, this might used implement object persistence, remote parameter passing or other facility. In this system we use the term "archive" to refer to a specific rendering of this stream of bytes. This could be a file of binary data, text data, XML, or some other created by the user of this library. " I'm not sure I can make a better statement than that regarding what I expected the library to be used for.
So, I look forward to seeing progress on the following:
a) better handling of special optimization opportunites which obtain for certain combinations of data-types and archives. Hopefully, an elegantl implementation will serve as a model for other people's pet addiitions.
I hope we'll be able to show you something elegant very soon.
No need to hurry on my account.
b) A protable binary implementation suitable for such things as MPI messages.
Portable binary archives and MPI have little relationship to one another. You don't flatten your data into a portable format, ship it in an MPI message that is just a sequence of bytes, and then deserialize. MPI handles portability internally.
I've taken only the most cursory look at MPI. (turns out this may change due to some other project). So I won't dispute this. I don't see how one could pass information between heterogeneas machines without addressing all the issues related to making a portable binary archive. Perhaps MPI leaves that part undefined - but still it will have to be dealt with somewhere.
I also expect these to take some time and hope they can be subjected to the boost "process" of public criticism and refinement. This will take more time but result in a better product. Hopefully, it will be less stressful as well - though I doubt it.
I really am trying to wind down my involvement in the serialization library.
That's a bit alarming, actually. Have you got someone else lined up to maintain it?
I was thinking of Matthias though I've never brought it up
It's important to us and to many others that the library has a future.
As long as people continue to use it I'm sure it will have a future ...
Without the involvement of the original author, that would be in doubt.
...regardless of whether the original author is involved. Does this mean I can't die until I get a replacement? Personally, I see the idea that the viability of any piece of code is tied to the continuing involvment of the original author as a sign that the code is lacking in some dimension. It should be easy for someone to see what is going on and fix. If it's not - its really a failing on the original author. So I've been personally gratified to have people send me fixes to very obscure and arcane bugs. I don't always incorporate the fix due design considerations but I often do. Some of these things are devilishly hard - what happens when code implementing serialization is dynamically unloaded? - things like that. Other's are obscure corners of other standards - e.g. how does one encode a string with and embedded \0 into and html string. Or what is portable way to create a sNaN when loading a portable archive. There a probably lots of little corners with things that need fixing and the truth is I'm already relying on people with more specialized knowledge to help with these things. So already things are moving to other people on a case by case basis. I would hope to see the library grow and proper by seeing things layered on top of it. Thus my personal involvement should taper off as it seems to have in other successful boost libraries - and as it should in any successful programming project. There is one kind of change that I would like to in the core library as time goes on. I would like to see certain things migrate out of the library and become boostified. Examples are things like strong typedef, extended typeinfo, dataflow iterators (my personal favorite). I recognize that that is a little unrealistic and I never mess with these things so its not a big issue - its just I would like to see the library smaller. Also it would be interesting to see if the boost class factory can be used to replace similar functionality implemented in the serialization library - there may be other such cases. Robert Ramey

"Robert Ramey" <ramey@rrsd.com> writes:
I suspect that the job of making a protable binary archive is much harder than it first appears.
Actually it's almost trivial (I did it over 10 years ago), but I don't know what that has to do with what we're trying to accomplish.
The speedups we're proposing don't have anything in particular to do with portable binary archives.
I presumed too much then. From the thread discussion, it seemed that this was just the intial effort to adapt the serialization library to the needs of High Performance Computing.
That is the immediate application and the initial motivation for the proposal. However, there is a general class of problems not specific to what is usually thought of as HPC (tautological definitions of HPC aside) for which the proposed enhancements can provide dramatic speedups.
XDR compatibility. (http://www.faqs.org/rfcs/rfc1014.html) was mentioned at some point as was MPI (http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-1.1/node39.htm#Node3... think) Both of these entail portable binary format - with atendant endian issues.
Those are indeed important applications for the proposed enhancements.
Maybe the mentioning of this in the context of discussion of the submission which didn't really mention this confused things in my own mind.
Perhaps.
I didn't pursue this as I really don't want to discourage these kinds of efforts and they are (or should be) orhogonal to the library as it is currently implemented.. If they can be implemented without altering the core - then I have no problem. If someone believes that modifying the core is unavoidable, then either he or I have made some sort of mistake and it will have to be resolved.
It's not unavoidable; as I've said before, it just has consequences that we don't like, and we think you probably won't like either. If you can hang on until we've presented what we think is the best design that avoids altering the core, then we can look at the consequences. Once you understand them, if you still don't want to make any changes and you're willing to accept the consequences, we're not going to press the issue any further.
Fine, I was asked to comment on what was submitted. We'll start the next round with a clean slate.
Great.
If they don't reallly have to alter the core, but the archive auther thinks it would make his job easier - then we have a probem.
Let me be very clear about this, at least:
,---- | Ease of archive implementation is unrelated to the motivation for | requesting core changes. `----
I hope that allays at least one of your concerns.
It does. And I'm sure you probably deal with this on a regular basis with your own libraries.
Actually, I can't remember a time when the convenience of my users or extenders came into conflict with the conceptual integrity or maintainability of my library. Maybe it's just a mindset thing, or maybe it's something about the problem space you're addressing. Regardless, I understand and sympathize with your position.
I get a suggestion about once a month to modify the core of he library for this or that reason. Aside from bugs, it usually boils down to the suggestor looking at the code and seeing - "Oh I could fix this right there!" without considering all the repercussions and without considering the alternatives. (As you might guess, this is what I believe happened in this case).
Note that this isn't a personal criticism - its a natural occurance that happens all the time.
It's not insulting; it just happens to be mistaken. Matthias tried hard to avoid modifying the library, but these consequences I keep referring to can't be avoided any other way. Regardless, I hope you'll be able to suspend judgement until we get to that part of the discussion.
Actually Matthias' considerations went much deeper than you give him credit for. In my opinion, he just failed to communicate his rationale properly, and since the details of his code seemed to you to violate basic principles of your design, I'm sure it was all the more difficult for you to understand the problems he is trying to avoid.
LOL - I think I understood the code submitted and what it was intended to achive.
Of course you do. However, there's no sign that you understand the reasons for core changes. When Matthias asked the question that was aimed at highlighting those reasons, you stopped replying, and even though you've restarted the thread, you still haven't answered it. That said, at this point I think you should hold off until after I've methodically built the context for the question.
As far as I could fathom the rationale, I presented an alternative designed to achieve the same results without sprinking bits of code throughout lots of other modules.
Yes, as I've been saying, without core changes, fast array serialization is achievable, but not without significant costs that I don't think you've considered. At least, I haven't seen any evidence that you have.
Working from new code that (I hope!) won't cause you any alarm, it might be easier to understand the rationale.
I guess you and Matthias were somewhat taken aback by my response. Sorry about that. Anyway, it seems you do have an understanding and even appreciation of my concerns so I'm optimistic that the next iteration will be better.
I'm pretty sure you'll be more comfortable. It's hard to imagine what you could object to in code that only builds upon Boost.Serialization. If we're not going to modify the existing library we could even keep out of your directories and namespaces.
The crux of my argument is that I believe that the kinds of extensions you want to implement can best be done without altering the current library.
Yes, that's very clear.
I'm willing to be proved wrong with a counter example - but the last didn't qualify in my opinion.
There's no proof, and never will be. When we've come to an understanding about consequences of not altering the core, you'll either decide they're are bad enough to warrant an alteration, or you won't. It's a judgement call.
It might be a good idea for you to clearly define the intended scope of the library... Depending on your answer, we might indeed be barking up the wrong tree.
The very first sentence of the Overview of the Documentation states:
"Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. Such a system can be used to reconstitute an equivalent structure in another program context. Depending on the context, this might used implement object persistence, remote parameter passing or other facility. In this system we use the term "archive" to refer to a specific rendering of this stream of bytes. This could be a file of binary data, text data, XML, or some other created by the user of this library. "
I'm not sure I can make a better statement than that regarding what I expected the library to be used for.
You're right, that's pretty specific, and it shows me that our application is well within the bounds of your intention.
So, I look forward to seeing progress on the following:
a) better handling of special optimization opportunites which obtain for certain combinations of data-types and archives. Hopefully, an elegantl implementation will serve as a model for other people's pet addiitions.
I hope we'll be able to show you something elegant very soon.
No need to hurry on my account.
The "soonness" is for our benefit, not yours. :)
b) A protable binary implementation suitable for such things as MPI messages.
Portable binary archives and MPI have little relationship to one another. You don't flatten your data into a portable format, ship it in an MPI message that is just a sequence of bytes, and then deserialize. MPI handles portability internally.
I've taken only the most cursory look at MPI. (turns out this may change due to some other project). So I won't dispute this. I don't see how one could pass information between heterogeneas machines without addressing all the issues related to making a portable binary archive. Perhaps MPI leaves that part undefined -
No, I'm telling you the opposite. MPI addresses those issues directly.
but still it will have to be dealt with somewhere.
Right, MPI deals with it.
I also expect these to take some time and hope they can be subjected to the boost "process" of public criticism and refinement. This will take more time but result in a better product. Hopefully, it will be less stressful as well - though I doubt it.
I really am trying to wind down my involvement in the serialization library.
That's a bit alarming, actually. Have you got someone else lined up to maintain it?
I was thinking of Matthias though I've never brought it up
I seriously doubt that would be possible. Matthias has far too many jobs already. -- Dave Abrahams Boost Consulting www.boost-consulting.com

"Robert Ramey" <ramey@rrsd.com> writes:
Anyway, it seems you do have an understanding and even appreciation of my concerns so I'm optimistic that the next iteration will be better.
Here's the next iteration. The Design ========== In the text and code that follows, the word "array" usually refers to a contiguous sequence of instances of a single datatype, and not to a C++ builtin array type of the form T[N]. I'll try to be explicit when I intend to describe the latter. Organization ------------ We have no attachment to the organization proposed below and if you don't like it we'd be happy to move all the proposed code into a completely separate area of Boost. Please accept it just for the purposes of this discussion. We propose to add the following new files and directories: boost/serialization/ load_array.hpp - hooks for deserializing into arrays save_array.hpp - hooks for serializing from arrays array.hpp - dispatching tools boost/archive/array/ iarchive.hpp - base class templates for authors of oarchive.hpp array-optimized archives. binary_iarchive.hpp - archives that use the hooks for std::vector and T[n] binary_oarchive.hpp polymorphic_binary_iarchive.hpp polymorphic_binary_oarchive.hpp ...other array-optimized archives... Details ------- In this section I'll show the important details of some of the files above, to give a clear sense of the mechanisms in use. The snippets below are synopses, leaving out details like #includes and, usually, namespaces. We're also only showing the "load" half of the code, since the "save" half is almost identical with s/load/save/ and s/>>/<</. Finally, while all the mechanisms have been tested, the code shown here is not a direct copy of tested code and may contain errors. serialization/array.hpp ....................... // When passed an archive pointer and a data pointer, returns a tag // indicating whether optimization should be applied. mpl::false_ optimize_array(...) { return mpl::false_(); } serialization/load_array.hpp ............................ // Hooks for loading arrays // optimized_load_array // // Used to select either the standard array loading procedure or an // optimized one depending on properties of the array's element type. // Will usually be called with an MPL predicate as a fifth argument // saying whether optimization should be applied, e.g.: // // optimized_load_array(ar, p, n, v, is_fundamental<element_type>()) // // Most array-optimized archives won't need to call it directly, // since they will be derived from archive::array::optimized, // which provides the call. template <class Archive, class ValueType> void optimized_load_array( Archive& ar, ValueType * p, std::size_t n, unsigned int version, boost::mpl::false_) { // Optimization not appropriate; use standard procedure while (n--) ar >> serialization::make_nvp("item", *p++); } template <class Archive, class ValueType> void optimized_load_array( Archive& ar, ValueType * p, std::size_t n, unsigned int version, boost::mpl::true_) { // dispatch to archive-format-specific optimization for types that // meet the optimization criteria ar.load_array(p,n,version); } // load_array // // Authors of serialization for types containing arrays will call // this function to ensure that optimizations will be applied when // possible. template <class Archive, class ValueType> inline void load_array( Archive& ar, ValueType * p, std::size_t n, unsigned int version) { serialization::optimized_load_array( ar, p, version, optimize_array(&ar, p) ); } archive/array/iarchive.hpp .......................... Based in part on a suggestion of yours, for handling vectors and builtin arrays. // To conveniently array-optimize an input archive X: // // * Derive it from iarchive<X, Impl>, where Impl is an // archive implementation base class from // Boost.Serialization // // * add a member function template that implements the // procedure for serializing arrays of T (for appropriate T) // // template <class T> // load_array(T* p, size_t nelems, unsigned version) // // * add a unary MPL lambda expression member called // use_array_optimization whose result is convertible to // mpl::true_ iff array elements of type T can be serialized // with the load_array member function, and to mpl::false_ if // the unoptimized procedure must be used. namespace archive { namespace array { template <class Derived, class Base> class iarchive : public Base { template <class S> optimized(S& s, unsigned int flags) : Base(s,flags) {} // Load std::vector<T> using load_array template<class T> void load_override(std::vector<T> &x, unsigned int version); // Load T[N] using load_array template<class T, std::size_t N> void load_override(T(&x)[N], unsigned int version); // Load everything else in the usual way, forwarding on to the // Base class template<class T> void load_override(T&, unsigned BOOST_PFTO int version); protected: typedef iarchive iarchive_base; // convenience for derivers }; }} namespace serialization { // Overload optimize_array for array-optimized iarchives. This // version evaluates an MPL lambda expression in the archive to // say whether its load_array member should be used. // // If not for the lack of ADL in vc6/7, this could go // in archive::array template <class Archive, class ValueType> typename mpl::apply1< typename Archive::use_array_optimization , ValueType >::type optimize_array(array::iarchive<Archive>*, ValueType*) { typedef typename mpl::apply1< BOOST_DEDUCED_TYPENAME Archive::use_array_optimization , ValueType >::type result; return result(); } } // end namespace serialization archive/array/binary_iarchive.hpp .................................. class binary_iarchive : public array::iarchive< array::binary_iarchive , archive::binary_iarchive_impl<binary_iarchive> > { template <class S> binary_iarchive(S& s, unsigned int flags) : binary_iarchive::iarchive_base(s,flags) {} // use the optimized load procedure for all fundamental types. typedef boost::is_fundamental<mpl::_> use_array_optimization; // This is how we load an array when optimization is appropriate. template <class ValueType> void load_array(ValueType * p, std::size_t n, unsigned int version) { this->load_binary(p, n * sizeof(ValueType)); } }; This completes the design presentation. After you've digested it and we've answered any questions you might have, we can move on to evaluating its strengths and weaknesses. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
Anyway, it seems you do have an understanding and even appreciation of my concerns so I'm optimistic that the next iteration will be better.
Here's the next iteration.
I've got it. I gave it a short look over and it seems a step in the right direction. However, to do it proper justice will require my spending some more time with it which will take a few days to get to. I hope that's all right. Robert Ramey

"Robert Ramey" <ramey@rrsd.com> writes:
David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
Anyway, it seems you do have an understanding and even appreciation of my concerns so I'm optimistic that the next iteration will be better.
Here's the next iteration.
I've got it. I gave it a short look over and it seems a step in the right direction.
Thanks.
However, to do it proper justice will require my spending some more time with it which will take a few days to get to. I hope that's all right.
perfectly. -- Dave Abrahams Boost Consulting www.boost-consulting.com

I have one question about this. What is the ultimate purpose. That is it just to optimize serialization of certains types of collections of bit streamable objects? or does have some more ambitious goal. Robert Ramey

"Robert Ramey" <ramey@rrsd.com> writes:
I have one question about this.
What is the ultimate purpose. That is it just to optimize serialization of certains types of collections of bit streamable objects? or does have some more ambitious goal.
I thought I highlighted the ultimate purpose quite clearly already: ,---- | For many archive formats and common datatypes there exist APIs | that can quickly read or write contiguous sequences of those types | all at once (**). Reading or writing such a sequence by | separately reading or writing each element (as the serialization | library currently does) can be an order of magnitude more | expensive. `---- We want to be able to capitalize on the existence of those APIs, and to do that we need a "hook" that will be used whenever a contiguous sequence is going to be (de)serialized. No such hook exists in Boost.Serialization. (**) Note that this capability is not necessarily tied to bitwise serialization or the use of a binary representation. In particular, I took special pains to clarify above (**) that this is *not* merely about "serialization of certains types of collections of bit streamable objects." If that's unclear, maybe you could ask some specific questions so that I know what needs to be clarified. -- Dave Abrahams Boost Consulting www.boost-consulting.com

On Nov 22, 2005, at 5:33 PM, David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
I have one question about this.
What is the ultimate purpose. That is it just to optimize serialization of certains types of collections of bit streamable objects? or does have some more ambitious goal.
I thought I highlighted the ultimate purpose quite clearly already:
,---- | For many archive formats and common datatypes there exist APIs | that can quickly read or write contiguous sequences of those types | all at once (**). Reading or writing such a sequence by | separately reading or writing each element (as the serialization | library currently does) can be an order of magnitude more | expensive. `----
We want to be able to capitalize on the existence of those APIs, and to do that we need a "hook" that will be used whenever a contiguous sequence is going to be (de)serialized. No such hook exists in Boost.Serialization.
(**) Note that this capability is not necessarily tied to bitwise serialization or the use of a binary representation.
In particular, I took special pains to clarify above (**) that this is *not* merely about "serialization of certains types of collections of bit streamable objects."
To be a bit more specific, and give one example, an MPI message passing library will convert the binary representation of data when a message is sent between machines with incompatible binary formats. This is then more than just streaming the bits. This is all hidden by the API though, and we do not worry about it. Matthias

David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
I have one question about this.
What is the ultimate purpose. That is it just to optimize serialization of certains types of collections of bit streamable objects? or does have some more ambitious goal.
I thought I highlighted the ultimate purpose quite clearly already:
,---- | For many archive formats and common datatypes there exist APIs | that can quickly read or write contiguous sequences of those types | all at once (**). Reading or writing such a sequence by | separately reading or writing each element (as the serialization | library currently does) can be an order of magnitude more | expensive. `----
We want to be able to capitalize on the existence of those APIs, and to do that we need a "hook" that will be used whenever a contiguous sequence is going to be (de)serialized. No such hook exists in Boost.Serialization.
(**) Note that this capability is not necessarily tied to bitwise serialization or the use of a binary representation.
In particular, I took special pains to clarify above (**) that this is *not* merely about "serialization of certains types of collections of bit streamable objects."
If that's unclear, maybe you could ask some specific questions so that I know what needs to be clarified.
Could you give some other examples? Other than bit serializable types which can benefit from using binary read/write - none other have occurred to me. Another thing I'm wondering about is whether any work has been done to determine the source of the "10x speed up". For arrays of primitives, I would seem that the replacement of a loop of binary reads with one binary read of a larger data might explain it. If that were the case it might be most fruitful to invest efforts in a different kind of i/o stream which only supports read/write but doesn't deal with all the operators, code_cvt factets, etc. In my personal work, I've found that i/o stream, is very convenient - but it is a performance killer for binary i/o. Another possibility is a binary archive which doesn't depend upon i/o stream at all but rather fopen, fwrite, etc. In fact, in my own work, I've even found that too slow so I had to replace it with my own version one step closer to the OS which also exploited asio.h . This in turn entailed writing an asio implementation which wraps Windows async i/o API calls. My guess is that if I wanted to speed up serialization this would be a more effective direction. Another thing that I'm curious about is the how much compilers can really collapse inline code when its theoretically possible. In he case of an array of primitives, things should collapse to a loop of stream read calls without even calling anything inside the compiled library. I don't have any realy knowledge as to whether which compilers - if any - actually are doing that. I guess I could display the disassmbly and maybe it will come to that. But for now I think don't have all the information I need to understand this. Robert Ramey

"Robert Ramey" <ramey@rrsd.com> writes:
David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
I have one question about this.
What is the ultimate purpose. That is it just to optimize serialization of certains types of collections of bit streamable objects? or does have some more ambitious goal.
I thought I highlighted the ultimate purpose quite clearly already:
,---- | For many archive formats and common datatypes there exist APIs | that can quickly read or write contiguous sequences of those types | all at once (**). Reading or writing such a sequence by | separately reading or writing each element (as the serialization | library currently does) can be an order of magnitude more | expensive. `----
We want to be able to capitalize on the existence of those APIs, and to do that we need a "hook" that will be used whenever a contiguous sequence is going to be (de)serialized. No such hook exists in Boost.Serialization.
(**) Note that this capability is not necessarily tied to bitwise serialization or the use of a binary representation.
In particular, I took special pains to clarify above (**) that this is *not* merely about "serialization of certains types of collections of bit streamable objects."
If that's unclear, maybe you could ask some specific questions so that I know what needs to be clarified.
Could you give some other examples? Other than bit serializable types which can benefit from using binary read/write - none other have occurred to me.
Well, it's not clear what exactly you mean by "bit serializable," but I assume you're referring to types other than PODs. Here's just one very trivial example: imagine a Unicode library includes code to serialize and deserialize arrays of such strings. If the library is separately compiled, merely crossing the boundary between the Unicode library and serialization code in a loop will incur the cost of a function call for each element. If you can call a function in the Unicode library to serialize all the strings in an array at once, it's a performance win. A data structure containing an array of many short strings (short strings are very common, thus the effectiveness of the short string optimization) would benefit from avoiding that overhead. Furthermore, if you can serialize more elements within a single function call, you can apply loop unrolling for dramatic speedups: better than 2x in my tests. Your STL implementation (if it's any good) does loop unrolling internally to get this optimization.
Another thing I'm wondering about is whether any work has been done to determine the source of the "10x speed up". For arrays of primitives, I would seem that the replacement of a loop of binary reads with one binary read of a larger data might explain it.
As mentioned above, that's part of the explanation in some cases. But that's not the whole story. For example - In the case of binary serialization you can also save the cost of repeated per-element comparisons in the stream buffer implementation to make sure you're not overrunning the buffers. - In the case of MPI the result of the optimizations is that MPI can transfer complex data structures directly into the hardware's communication buffers without making an additional copy in memory. That's blazingly fast, as it can almost all happen in hardware. Furthermore, doing anything else is actually not possible in our application, because there's not enough memory for an in-memory copy of the data structure.
If that were the case it might be most fruitful to invest efforts in a different kind of i/o stream which only supports read/write but doesn't deal with all the operators, code_cvt factets, etc.
Why do you think that would be most fruitful?
In my personal work, I've found that i/o stream, is very convenient - but it is a performance killer for binary i/o.
The use of iostreams for binary I/O is IMO a design mistake -- according to the experts, binary I/O should be done directly to streambufs. But that's really irrelevant, as it isn't a performance-limiting factor for us.
Another possibility is a binary archive which doesn't depend upon i/o stream at all but rather fopen, fwrite, etc. In fact, in my own work, I've even found that too slow so I had to replace it with my own version one step closer to the OS which also exploited asio.h . This in turn entailed writing an asio implementation which wraps Windows async i/o API calls.
My guess is that if I wanted to speed up serialization this would be a more effective direction.
Why do you think that would be _more_ effective? Did you achieve a 10x speedup by that approach?
Another thing that I'm curious about is the how much compilers can really collapse inline code when its theoretically possible. In he case of an array of primitives, things should collapse to a loop of stream read calls without even calling anything inside the compiled library.
Right.
I don't have any realy knowledge as to whether which compilers - if any - actually are doing that.
They can all do that; it's just one level of inlining. If inlining didn't do that it would be almost pointless. That's one reason, for example, that the STL can compete or beat hand-written code. If inlining couldn't collapse loops, Blitz++ wouldn't have stood a chance at beating hand-written FORTRAN (http://www.oonumerics.org/blitz/benchmarks/).
I guess I could display the disassmbly and maybe it will come to that. But for now I think don't have all the information I need to understand this.
We'll be happy to try and help you to understand it. Just keep asking questions. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
We want to be able to capitalize on the existence of those APIs, and to do that we need a "hook" that will be used whenever a contiguous sequence is going to be (de)serialized. No such hook exists in Boost.Serialization.
FWIW, this is what I do in my library: template<class W, class T, class A> void write(W & w, std::vector<T, A> const & v) { int m = v.size(); begin_sequence(w, io::type_of(v), io::type<T>(), m); if(m > 0) write_sequence(w, &v[0], m); end_sequence(w, io::type_of(v), io::type<T>(), m); } The default implementation of write_sequence just does: template<class W, class It> inline void write_sequence(W & w, It first, std::size_t m) { for(; m > 0; ++first, --m) write(w, *first); } but writers that support contiguous fast writes overload write_sequence( my_writer&, T*, size_t ). Looking at collections_save_imp.hpp: template<class Archive, class Container> inline void save_collection(Archive & ar, const Container &s) { // record number of elements unsigned int count = s.size(); ar << make_nvp("count", const_cast<const unsigned int &>(count)); BOOST_DEDUCED_TYPENAME Container::const_iterator it = s.begin(); while(count-- > 0){ //if(0 == (ar.get_flags() & boost::archive::no_object_creation)) // note borland emits a no-op without the explicit namespace boost::serialization::save_construct_data_adl(ar, &(*it), 0U); ar << boost::serialization::make_nvp("item", *it++); } } all that needs to be done is: template<class Archive, class It> inline void save_sequence(Archive & ar, It it, unsigned count) { while(count-- > 0){ //if(0 == (ar.get_flags() & boost::archive::no_object_creation)) // note borland emits a no-op without the explicit namespace boost::serialization::save_construct_data_adl(ar, &(*it), 0U); ar << boost::serialization::make_nvp("item", *it++); } } template<class Archive, class Container> inline void save_collection(Archive & ar, const Container &s) { // record number of elements unsigned int count = s.size(); ar << make_nvp("count", const_cast<const unsigned int &>(count)); save_sequence( ar, s.begin(), count ); } unless I'm missing something fundamental. So what's all the fuss about?

Peter Dimov wrote:
David Abrahams wrote:
We want to be able to capitalize on the existence of those APIs, and to do that we need a "hook" that will be used whenever a contiguous sequence is going to be (de)serialized. No such hook exists in Boost.Serialization.
[snip]
Looking at collections_save_imp.hpp:
template<class Archive, class Container> inline void save_collection(Archive & ar, const Container &s)
[...]
all that needs to be done is:
template<class Archive, class It> inline void save_sequence(Archive & ar, It it, unsigned count) { while(count-- > 0){ //if(0 == (ar.get_flags() & boost::archive::no_object_creation)) // note borland emits a no-op without the explicit namespace boost::serialization::save_construct_data_adl(ar, &(*it), 0U); ar << boost::serialization::make_nvp("item", *it++); } }
template<class Archive, class Container> inline void save_collection(Archive & ar, const Container &s) { // record number of elements unsigned int count = s.size(); ar << make_nvp("count", const_cast<const unsigned int &>(count)); save_sequence( ar, s.begin(), count ); }
unless I'm missing something fundamental.
So what's all the fuss about?
That isn't quite all that needs to be done. (1) minor nit: an interface that uses (iterator, size) would be better than a container-based algorithm because that would make it easier to do optimizations based on the iterator type (eg, memcpy, or MPI operations in the case of a pointer, or maybe some kind of distributed iterator in combination with a parallel IO library?). Also, the collection isn't necessarily in the form of a container (although a proxy container would probably suffice for that case, and come to think of it, to handle resizing the container on load it might actually be preferable). (2) another minor nit: it is probably more convenient to handle the details of save_sequence() inside the archive (similarly to other primitive types), rather than as a free function. (3) : save_collection() [or some functional equivalent] isn't part of the public interface of the serialization library. For whatever reason this seems to be the sticking point. Making it an optional add-on is OK, but you really want people to use it _by default_, otherwise you need to go and rewrite all their serialization functions to make use of whatever additional functionality the archive provides. Cheers, Ian

Ian McCulloch wrote:
Peter Dimov wrote:
template<class Archive, class It> inline void save_sequence(Archive & ar, It it, unsigned count) { while(count-- > 0){ //if(0 == (ar.get_flags() & boost::archive::no_object_creation)) // note borland emits a no-op without the explicit namespace boost::serialization::save_construct_data_adl(ar, &(*it), 0U); ar << boost::serialization::make_nvp("item", *it++); } }
template<class Archive, class Container> inline void save_collection(Archive & ar, const Container &s) { // record number of elements unsigned int count = s.size(); ar << make_nvp("count", const_cast<const unsigned int &>(count)); save_sequence( ar, s.begin(), count ); }
unless I'm missing something fundamental.
So what's all the fuss about?
That isn't quite all that needs to be done.
You are right, a std::vector needs to be special-cased to use a pointer.
(1) minor nit: an interface that uses (iterator, size) would be better than a container-based algorithm because that would make it easier to do optimizations based on the iterator type (eg, memcpy, or MPI operations in the case of a pointer, or maybe some kind of distributed iterator in combination with a parallel IO library?).
I don't understand.
template<class Archive, class It> inline void save_sequence(Archive & ar, It it, unsigned count)
looks decidedly (iterator, size) based to me.
Also, the collection isn't necessarily in the form of a container (although a proxy container would probably suffice for that case, and come to think of it, to handle resizing the container on load it might actually be preferable).
I don't understand this either.
(2) another minor nit: it is probably more convenient to handle the details of save_sequence() inside the archive (similarly to other primitive types), rather than as a free function.
The point is that you can overload the free function inside your archive's namespace.
(3) : save_collection() [or some functional equivalent] isn't part of the public interface of the serialization library. For whatever reason this seems to be the sticking point.
save_collection isn't - and probably shouldn't - but save_sequence would be.

Peter Dimov wrote:
Ian McCulloch wrote:
Peter Dimov wrote:
[snip]
So what's all the fuss about?
That isn't quite all that needs to be done.
You are right, a std::vector needs to be special-cased to use a pointer.
(1) minor nit: an interface that uses (iterator, size) would be better than a container-based algorithm because that would make it easier to do optimizations based on the iterator type (eg, memcpy, or MPI operations in the case of a pointer, or maybe some kind of distributed iterator in combination with a parallel IO library?).
I don't understand.
If you are going to overload something, it is easier to overload on a type that actually appears in the function signature rather than doing a dispatch based on a nested ::iterator type. Anyway this is moot because I misunderstood you to mean that save_collection would be the customization point, but you apparantly intended save_array instead. sorry.
template<class Archive, class It> inline void save_sequence(Archive & ar, It it, unsigned count)
looks decidedly (iterator, size) based to me.
Also, the collection isn't necessarily in the form of a container (although a proxy container would probably suffice for that case, and come to think of it, to handle resizing the container on load it might actually be preferable).
I don't understand this either.
For example, if you had a bare array that you managed yourself with array new/delete, you would need to make some kind of proxy container to pass it to save_collection. But on loading the array you need to do a resize somewhere, and a proxy container would be one way to handle that (the resize() member of the proxy would handle the memory reallocation).
(2) another minor nit: it is probably more convenient to handle the details of save_sequence() inside the archive (similarly to other primitive types), rather than as a free function.
The point is that you can overload the free function inside your archive's namespace.
Ok, but as Dave pointed out the lookup rules complicate this approach.
(3) : save_collection() [or some functional equivalent] isn't part of the public interface of the serialization library. For whatever reason this seems to be the sticking point.
save_collection isn't - and probably shouldn't - but save_sequence would be.
Fine. Anything that allows customization of arrays/sequences. The actual interface is a detail at this point. Cheers, Ian

"Peter Dimov" <pdimov@mmltd.net> writes:
all that needs to be done is:
template<class Archive, class It> inline void save_sequence(Archive & ar, It it, unsigned count) { while(count-- > 0){ //if(0 == (ar.get_flags() & boost::archive::no_object_creation)) // note borland emits a no-op without the explicit namespace boost::serialization::save_construct_data_adl(ar, &(*it), 0U); ar << boost::serialization::make_nvp("item", *it++); } }
template<class Archive, class Container> inline void save_collection(Archive & ar, const Container &s) { // record number of elements unsigned int count = s.size(); ar << make_nvp("count", const_cast<const unsigned int &>(count)); save_sequence( ar, s.begin(), count ); }
unless I'm missing something fundamental.
So what's all the fuss about?
1. Robert has expressed deep reluctance to change any part of the existing library, which is why we're now presenting a design that avoids touching it. 2. This wouldn't work well for std::vector, since we know its elements are contiguous but we don't know that its iterators are pointers. Yes, I know there are some nasty hacks that will usually work for getting back to pointers, but they're nasty and don't always work. 3. An archive can normally only apply an array optimization to a particular subset of types. This subset varies by archive type and can usually be captured by a type trait such as is_POD or is_fundamental. We'd like to encapsulate that choice in a base class template that allows us to avoid writing complex dispatching logic in each array-optimized archive. Partial ordering rules make that impossible with the above design, because the save_sequence above will be a better match than one that operates on some_base<Archive>. 4. The fuss is really about what happens when you have a design that doesn't insert the equivalent of a save_sequence hook in the serialization library. It's a social/library-interoperability phenomenon that I haven't even had a chance to discuss yet -- and I really don't want to until Robert has had a chance to digest our design and understand where the speedups can come from. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
3. An archive can normally only apply an array optimization to a particular subset of types. This subset varies by archive type and can usually be captured by a type trait such as is_POD or is_fundamental.
In the situations that I've needed it, the subset of types that could be written in a single operation could not be described by a type trait. A type trait can't tell you whether the in-memory representation and the on-disk representation of a type are the same. I just enumerated the types, guarding the overloads with an appropriate #ifdef on endianness. I don't understand how an archive could handle an array of arbitrary PODs.
We'd like to encapsulate that choice in a base class template that allows us to avoid writing complex dispatching logic in each array-optimized archive.
I usually just add the equivalent of a save_sequence( A&, X*, unsigned ) overload for every X that is supported by A.

Peter Dimov wrote:
David Abrahams wrote:
3. An archive can normally only apply an array optimization to a particular subset of types. This subset varies by archive type and can usually be captured by a type trait such as is_POD or is_fundamental.
In the situations that I've needed it, the subset of types that could be written in a single operation could not be described by a type trait. A type trait can't tell you whether the in-memory representation and the on-disk representation of a type are the same. I just enumerated the types, guarding the overloads with an appropriate #ifdef on endianness.
I use a mapping that specifies how fundamental types are to be represented on disk (which is also a function of the archive format), and a function to do the conversion from the in-memory representation to the on-disk representation. Part of this is a trait to indicate if the representations are identical, which enables memcpy optimizations. The logic to determine what the actual in-memory representations are for the host platform is handled by autoconf macros, but the rest is done by metafunctions.
I don't understand how an archive could handle an array of arbitrary PODs.
It can't, obviously. I read that as being an (oversimplified) example of the basic approach of a trait to control optimizations. Indeed, an obvious candidate for (non-portable) optimizations is std::complex<double>, which isn't POD anyway.
We'd like to encapsulate that choice in a base class template that allows us to avoid writing complex dispatching logic in each array-optimized archive.
I usually just add the equivalent of a save_sequence( A&, X*, unsigned ) overload for every X that is supported by A.
I gather the idea is to make possible some kind of portable binary archive that does a dispatch based on the desired on-disk format. And allow an easy method for users to specify a type is memcpy()-able (or more accurately, some way of telling the archive what the in-memory format is so the archive can then decide if it is memcpy-able to the on-disk format) on *their particular hardware*, without adding an overload and doing that logic manually. Cheers, Ian

Ian McCulloch <ianmcc@physik.rwth-aachen.de> writes:
Peter Dimov wrote:
David Abrahams wrote:
3. An archive can normally only apply an array optimization to a particular subset of types. This subset varies by archive type and can usually be captured by a type trait such as is_POD or is_fundamental.
In the situations that I've needed it, the subset of types that could be written in a single operation could not be described by a type trait. A type trait can't tell you whether the in-memory representation and the on-disk representation of a type are the same.
As soon as you talk about "on-disk" you're already limiting your thinking too much, since we're not necessarily serializing to disk, and the idea of having the "same representation" somewhere is somewhat limiting too. An MPI archive builds up a skeletal representation of what needs to be serialized and then MPI reads it out of memory into the hardware. For example, MPI archives support array serialization for all PODs that are not pointers and do not contain pointer members. Other archives have different requirements (I'm sorry that I don't remember what they are but Matthias can give some examples). You can detect a subset of the array-serializable types for any given archive with a type trait, and allow specializations to get optimization for other types.
I don't understand how an archive could handle an array of arbitrary PODs.
It can't, obviously.
Well, technically it depends how pointers need to be treated. It may be that nested pointers are serialized in a separate pass.
I read that as being an (oversimplified) example of the basic approach of a trait to control optimizations.
Right.
Indeed, an obvious candidate for (non-portable) optimizations is std::complex<double>, which isn't POD anyway.
Yes that's true. Actually POD-ness may be less important than whether the type has a trivial destructor.
We'd like to encapsulate that choice in a base class template that allows us to avoid writing complex dispatching logic in each array-optimized archive.
I usually just add the equivalent of a save_sequence( A&, X*, unsigned ) overload for every X that is supported by A.
I gather the idea is to make possible some kind of portable binary archive that does a dispatch based on the desired on-disk format.
No, we keep telling everyone: this is not about portable binary archives (although they will probably benefit), and it's not about on-disk anything. The idea is to take advantage of APIs that can quickly serialize/deserialize contiguous sequences of T, for any T. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
Peter Dimov wrote:
In the situations that I've needed it, the subset of types that could be written in a single operation could not be described by a type trait. A type trait can't tell you whether the in-memory representation and the on-disk representation of a type are the same.
As soon as you talk about "on-disk" you're already limiting your thinking too much, since we're not necessarily serializing to disk, and the idea of having the "same representation" somewhere is somewhat limiting too. An MPI archive builds up a skeletal representation of what needs to be serialized and then MPI reads it out of memory into the hardware.
For example, MPI archives support array serialization for all PODs that are not pointers and do not contain pointer members.
I'm not sure I follow. How does an MPI archive serialize an array of X, where X is an arbitrary POD (without pointer members)?

Peter Dimov wrote:
David Abrahams wrote:
Peter Dimov wrote:
In the situations that I've needed it, the subset of types that could be written in a single operation could not be described by a type trait. A type trait can't tell you whether the in-memory representation and the on-disk representation of a type are the same.
As soon as you talk about "on-disk" you're already limiting your thinking too much, since we're not necessarily serializing to disk, and the idea of having the "same representation" somewhere is somewhat limiting too. An MPI archive builds up a skeletal representation of what needs to be serialized and then MPI reads it out of memory into the hardware.
For example, MPI archives support array serialization for all PODs that are not pointers and do not contain pointer members.
I'm not sure I follow. How does an MPI archive serialize an array of X, where X is an arbitrary POD (without pointer members)?
You need to construct an MPI 'datatype', which is conceptually a record of how the type is laid out as a sequence of (offset, nested_datatype) pairs. There are predefined datatypes for various types, including arithmetic builtins. MPI provides some functions to assemble datatypes corresponding to structures and arrays, including strided arrays. When the message is sent somewhere, the receiver needs to provide a compatible datatype. This requires only that it has the same sequence of basic types, the offsets can be different. So you could, for example, build a datatype that would read directly from a std::list, and read it back somewhere else as a vector, or even a struct. And vice versa. The 'usual' way (if there is such a thing; datatypes seem to be not used much in MPI) of constructing datatypes is using offsetof(), or just knowing what the layout is for compiler X, and constructing the datatype by hand. I don't know what mechanism Dave is thinking of to construct the datatype, it sounds unlikely that it could be done via a usual serialization function (but maybe if you could do member pointer arithmetic to replace offsetof() ?). I don't understand the restriction to PODs with no pointers though, as pointers are no problem - at least in principle - you just recursively follow the pointer when constructing the typemap. There is a difference though, as a datatype for an object without nested pointers has fixed offsets, whereas for a type containing pointers a new datatype needs to be constructed for each object that is serialized. Cheers, Ian

Ian McCulloch <ianmcc@physik.rwth-aachen.de> writes:
The 'usual' way (if there is such a thing; datatypes seem to be not used much in MPI) of constructing datatypes is using offsetof(), or just knowing what the layout is for compiler X, and constructing the datatype by hand. I don't know what mechanism Dave is thinking of to construct the datatype, it sounds unlikely that it could be done via a usual serialization function
Actually yes, it can. It's based on an innovation by Michael Gauckler and it's both fiendishly clever and blindingly obvious once you see it. He's writing a paper on it.
(but maybe if you could do member pointer arithmetic to replace offsetof() ?). I don't understand the restriction to PODs with no pointers though, as pointers are no problem - at least in principle - you just recursively follow the pointer when constructing the typemap.
IIUC, you just can't get the same acceleration for arrays of such types. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Ian McCulloch wrote:
You need to construct an MPI 'datatype', which is conceptually a record of how the type is laid out as a sequence of (offset, nested_datatype) pairs.
...
The 'usual' way (if there is such a thing; datatypes seem to be not used much in MPI) of constructing datatypes is using offsetof(), or just knowing what the layout is for compiler X, and constructing the datatype by hand.
...
I don't know what mechanism Dave is thinking of to construct the datatype, it sounds unlikely that it could be done via a usual serialization function (but maybe if you could do member pointer arithmetic to replace offsetof() ?). I don't understand the restriction to PODs with no pointers though, as pointers are no problem - at least in principle - you just recursively follow the pointer when constructing the typemap. There is a difference though, as a datatype for an object without nested pointers has fixed offsets, whereas for a type containing pointers a new datatype needs to be constructed for each object that is serialized.
Sounds like a lot of work. But if one want's to use all this stuff - then what's the point of using the serialization library at all? I mean I can see using one or the other - but what is the point of both? Robert Ramey

On Nov 23, 2005, at 6:00 AM, Robert Ramey wrote:
Ian McCulloch wrote:
You need to construct an MPI 'datatype', which is conceptually a record of how the type is laid out as a sequence of (offset, nested_datatype) pairs.
...
The 'usual' way (if there is such a thing; datatypes seem to be not used much in MPI) of constructing datatypes is using offsetof(), or just knowing what the layout is for compiler X, and constructing the datatype by hand.
...
I don't know what mechanism Dave is thinking of to construct the datatype, it sounds unlikely that it could be done via a usual serialization function (but maybe if you could do member pointer arithmetic to replace offsetof() ?).
Sounds like a lot of work. But if one want's to use all this stuff - then what's the point of using the serialization library at all? I mean I can see using one or the other - but what is the point of both?
Indeed this sounds like a lot of work and that's why this mechanism for message passing was rarely used in the past. The hard part is to manually build up the custom MPI datatype, i.e. to inform MPI about what the offsets and types of the various data members in a struct are. This is where the serialization library fits in and makes the task extraordinarily easy. Saving a data member with such an MPI archive will register its address, type (as well as the number of identical consecutive elements in an array) with the MPI library. Thus the serialization library does all the hard work already! As Dave mentioned earlier, this information can then used by the MPI library (and network hardware) to directly serialize the data into the I/O buffers of the network interconnect, without ever creating a copy in memory, and automatically taking care of potential endianness and format issues on heterogeneous networks. To answer your question: one wants to use both since the serialization library is used to create the information which the MPI library needs to efficiently send the data. Matthias

Matthias Troyer wrote:
Indeed this sounds like a lot of work and that's why this mechanism for message passing was rarely used in the past. The hard part is to manually build up the custom MPI datatype, i.e. to inform MPI about what the offsets and types of the various data members in a struct are.
This is where the serialization library fits in and makes the task extraordinarily easy. Saving a data member with such an MPI archive will register its address, type (as well as the number of identical consecutive elements in an array) with the MPI library. Thus the serialization library does all the hard work already!
I still don't see the 10x speedup in the subject. For a X[], the two approaches are: 1. for each x in X[], "serialize" into an MPI descriptor 2. serialize X[0] into an MPI descriptor, construct an array descriptor from it Conceptual issues with (2) aside (the external format of X is determined by X itself and you have no idea whether the structure of X[0] also describes X[1]), I'm not sure that there will be such a major speedup compared to the naive (1). Robert's point also deserves attention; a portable binary archive that writes directly into a socket eliminates the MPI middleman and will probably achieve a similar performance as your two-pass MPI approach. It also supports versioned non-PODs and other nontrivial types. As an example, I have a type which is saved as save( x ): save( x.anim.name() ); // std::string and loaded as load( x ): string tmp; load( tmp ); x.set_animation( tmp ); Not everything is field-based.

"Peter Dimov" <pdimov@mmltd.net> writes:
Matthias Troyer wrote:
Indeed this sounds like a lot of work and that's why this mechanism for message passing was rarely used in the past. The hard part is to manually build up the custom MPI datatype, i.e. to inform MPI about what the offsets and types of the various data members in a struct are.
This is where the serialization library fits in and makes the task extraordinarily easy. Saving a data member with such an MPI archive will register its address, type (as well as the number of identical consecutive elements in an array) with the MPI library. Thus the serialization library does all the hard work already!
I still don't see the 10x speedup in the subject. For a X[], the two approaches are:
1. for each x in X[], "serialize" into an MPI descriptor
Do you mean "datatype?" ------------------------^^^^^^^^^^
2. serialize X[0] into an MPI descriptor, construct an array descriptor from it
Conceptual issues with (2) aside (the external format of X is determined by X itself and you have no idea whether the structure of X[0] also describes X[1]),
I don't know what you mean.
I'm not sure that there will be such a major speedup compared to the naive (1).
Why the doubt? We've measured. Furthermore, as we stated earlier, in our application we can't afford to make an in-memory copy because it would take more than the available memory.
Robert's point also deserves attention; a portable binary archive that writes directly into a socket eliminates the MPI middleman and will probably achieve a similar performance as your two-pass MPI approach.
I'll let the experts say more, but AFAICT MPI provides essential services that aren't provided by plain sockets. See http://www.open-mpi.org/papers/euro-pvmmpi-2004-overview/euro-pvmmpi-2004-ov...
It also supports versioned non-PODs and other nontrivial types.
What does? Such an archive?
As an example, I have a type which is saved as
save( x ):
save( x.anim.name() ); // std::string
and loaded as
load( x ):
string tmp; load( tmp ); x.set_animation( tmp );
Not everything is field-based.
Yes of course. Our approach supports those types too; but you can only take advantage of a subset of the optimizations for that part of your communication. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
Conceptual issues with (2) aside (the external format of X is determined by X itself and you have no idea whether the structure of X[0] also describes X[1]),
I don't know what you mean.
When your mpi_archive is given an arbitrary X, there is no way to know whether serializing an array of X can be transformed into serializing a X and then building an array datatype out of the result. That's because X's serialize functions can do _anything_. Your archive is simply not allowed to alter them. (*) That is why the optimization must be explicitly enabled by the author of X. My approach deals with that by allowing him to write a save_sequence overload for X: void save_sequence( mpi_archive & a, X const * x, size_t n ) { a.save_pod_array( x, n ); } You are trying to generalize this by allowing X's author to define a trait that isn't archive specific. I claim that this doesn't work, because whether an array of X supports optimized writes into an archive A is (in my experience) determined by the combination of X and A, and nothing is gained by defining separate traits with the hope that another archive might reuse them (YAGNI principle). In short, you'll end up defining an is_mpi_pod trait, give it another, non-MPI name, and pretend that it's generally useful. It's not. It's MPI specific. (*) Consider a X that is essentially a discriminated union as an example. struct X { int valid_field_; // 0..1 int field1_; double field2_; }; that serializes valid_field_ and then either field1_ or field2_ depending on valid_field_.

On Nov 23, 2005, at 2:31 PM, Peter Dimov wrote:
David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
Conceptual issues with (2) aside (the external format of X is determined by X itself and you have no idea whether the structure of X[0] also describes X[1]),
I don't know what you mean.
When your mpi_archive is given an arbitrary X, there is no way to know whether serializing an array of X can be transformed into serializing a X and then building an array datatype out of the result. That's because X's serialize functions can do _anything_. Your archive is simply not allowed to alter them. (*)
That's why this will not work for all types. It will however work for those types which you are most likely to serialize in huge quantities in a numerical simulation: real, complex and integer numbers, as well as simple structs built from them.
That is why the optimization must be explicitly enabled by the author of X. My approach deals with that by allowing him to write a save_sequence overload for X:
void save_sequence( mpi_archive & a, X const * x, size_t n ) { a.save_pod_array( x, n ); }
You are trying to generalize this by allowing X's author to define a trait that isn't archive specific.
I don't think that Dave has ever claimed that. In my proposal I had defined a traits class has_fast_array_serialization<Archive,X> which depends on both and can be fully or partially specialized by the author of X. However, partial specialization ois not supported by all compilers. Dave's approach is to instead use a meta function inside the archive, which avoids the partial specialization problem. In case of a non- portable binary archive that meta function might use a traits class like Robert's is_bitwise_serializable<X> which the user of X can specialize. In the case of the MPI archive, there could be a trait like is_mpi_datatype<X> which tells the library whether the optimization can be used. In other archives, there might only be a fixed set of types just as in your proposal. Also, Dave's proposal has a free function such as template <class Archive, class X> save_array(Archive& ar, X const * n, std::size_t n) which you could overload as well, just like in your proposal Matthias

David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
I'm not sure that there will be such a major speedup compared to the naive (1).
Why the doubt? We've measured.
Out of curiosity, what did you measure?

On Nov 23, 2005, at 2:42 PM, Peter Dimov wrote:
David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
I'm not sure that there will be such a major speedup compared to the naive (1).
Why the doubt? We've measured.
Out of curiosity, what did you measure?
The 10x speedup that was measured was for using the serialization library, with and without save_array hooks to (de)serialize a std::vector<double> and a double[] into a binary archive, both into memory buffers and onto disk. The codes were posted with the message that started this thread. To quote just one number, the CPU times required to write 10 million double into a file were 4.12 and 0.37 seconds respectively. Matthias

Matthias Troyer wrote:
On Nov 23, 2005, at 2:42 PM, Peter Dimov wrote:
David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
I'm not sure that there will be such a major speedup compared to the naive (1).
Why the doubt? We've measured.
Out of curiosity, what did you measure?
The 10x speedup that was measured was for using the serialization library, with and without save_array hooks to (de)serialize a std::vector<double> and a double[] into a binary archive, both into memory buffers and onto disk. The codes were posted with the message that started this thread. To quote just one number, the CPU times required to write 10 million double into a file were 4.12 and 0.37 seconds respectively.
Oh, I'm not disputing that at all. I've observed a similar slowdown when std::vector<unsigned char> iterators changed into non-pointers, bypassing my optimized overload. :-)

On Nov 23, 2005, at 10:53 AM, Peter Dimov wrote:
Matthias Troyer wrote:
Indeed this sounds like a lot of work and that's why this mechanism for message passing was rarely used in the past. The hard part is to manually build up the custom MPI datatype, i.e. to inform MPI about what the offsets and types of the various data members in a struct are.
This is where the serialization library fits in and makes the task extraordinarily easy. Saving a data member with such an MPI archive will register its address, type (as well as the number of identical consecutive elements in an array) with the MPI library. Thus the serialization library does all the hard work already!
I still don't see the 10x speedup in the subject.
The 10x speedup was reported at the start of the thread several weeks ago for benchmarks, comparing the writing of large arrays and vectors using the serialization library as compared to directly writing them into a stream and memory buffer. The MPI case was brought up to show that the serialization library can be used more efficiently also there if we have a save_array and load_array functionality. Please keep this in mind when reading my replies. We are discussing MPI here as another example, next to binary archives, where save_array and load_array optimizations will be important.
For a X[], the two approaches are:
1. for each x in X[], "serialize" into an MPI descriptor 2. serialize X[0] into an MPI descriptor, construct an array descriptor from it
Correct.
Conceptual issues with (2) aside (the external format of X is determined by X itself and you have no idea whether the structure of X[0] also describes X[1]),
Of course you can use (2) only for contiguous arrays of the same type, and not for any pointer member or polymorphic members. It will work for any type that is layout-compatible with a POD and contains no pointers or unions. Examples are std::complex<T>, tuples of fundamental types, or any struct having only fundamental types as members. For these types the memory layout of X[0] is the same as of X [1]. The only case where this might not apply would be union or pointer members, and in that case the optimization can, of course, not be applied.
I'm not sure that there will be such a major speedup compared to the naive (1).
Oh yes, there can be a huge difference. Let me just give a few reasons: 1) in the applications we talk about we have to regularly send huge contiguous arrays of numbers (stored e.g. in a matrix, vector, valarray or multi_array) over the network. The typical size is 100 million numbers upwards. I'll stick to 100 million as a typical number in the following. Storing these 100 million numbers already takes up 800 MByte, and nearly fills the memory of the machine, and this causes problems: a) copying these numbers into a buffer using the serialization library needs another 800 MB of memory that might not be available b) creating MPI data types for each member separately mean storing at least 12 bytes (4 bytes each for the address, type and count), for a total of 1200 MBytes, instead of just 12 bytes. Again we will have a memory problem But the main issue is speed. Serializing 100 million numbers one by one, requires 100 million access to the network interface, while serializing the whole block at one just causes a single call, and the rest will be done by the hardware. The reason why we cannot afford this overhead is that actually on modern high performance networks ** the network bandwidth is the same as the memory bandwidth ** and that, even if all things could be perfectly inlined and optimized, the time to read the MPI datatype for each element when using (1) will completely overwhelm the time actually required to send the message using (2). To substantiate my claim (**) above, I want to mention a few numbers: * the "Black Widow" network of the Cray X1 series has a network bandwidth of 55 GByte/second! * the "Red Storm" network of the Cray XT3 Opteron clusters, uses one hypertransport channel for the network acces, and another one for memory access, and thus the bandwidth here is the same as the memory bandwidth * the IBM Blue Gene/L has a similarly fast network with 4.2 GByte/ second network bandwidth per node * even going to cheaper commodity hardware, like Quadrics interconnects, 1 GByte/second is common nowadays. I am sure you will understand that to keep up with these network data transfer rates we cannot afford to perform additional operations, such as accessing the network interface once per double to read the address, even aside from the memory issue raised above. I hope this clarifies why the approach (2) should be taken whenever possible.
Robert's point also deserves attention; a portable binary archive that writes directly into a socket eliminates the MPI middleman and will probably achieve a similar performance as your two-pass MPI approach.
This is indeed a nice idea and would remove the need for MPI in some applications on standard-speed TCP/IP based networks, such as Ethernet, but it is not a general solution for a number of reasons: 1. Sockets do not even exist on most dedicated network hardware, but MPI is still available since it is the standard API for message passing on parallel computers. Even if sockets are still available, they just add additional layers of (expensive) function calls between the network hardware and the serialization library, while the vendor-provided MPI implementations usually access the network hardware directly. 2. MPI is much more than a point-to-point communication protocol built on top of sockets. It is actually a standardized API for all high performance network hardware. In addition to point-to-point communication (using synchronous, asynchronous, buffered and one-way communication) it also provides a large number of global operations, such as broadcasts, reductions, gather and scatter. These work with log(N) complexity on N nodes and often use special network hardware dedicated to the task (such as on an IBM Blue Gene/L machine). All these operations can take advantage of the MPI datatype mechanism. 3. The MPI implementations can determine at runtime whether the transformation to a portable binary archive is actually needed or whether just the bits can be streamed, and it will do this transparently, hiding it from the user.
It also supports versioned non-PODs and other nontrivial types. As an example, I have a type which is saved as
save( x ):
save( x.anim.name() ); // std::string
and loaded as
load( x ):
string tmp; load( tmp ); x.set_animation( tmp );
Not everything is field-based.
Indeed, and for such types the optimization would not apply.

Matthias Troyer wrote:
Oh yes, there can be a huge difference. Let me just give a few reasons:
1) in the applications we talk about we have to regularly send huge contiguous arrays of numbers (stored e.g. in a matrix, vector, valarray or multi_array) over the network. The typical size is 100 million numbers upwards. I'll stick to 100 million as a typical number in the following. Storing these 100 million numbers already takes up 800 MByte, and nearly fills the memory of the machine, and this causes problems:
a) copying these numbers into a buffer using the serialization library needs another 800 MB of memory that might not be available
b) creating MPI data types for each member separately mean storing at least 12 bytes (4 bytes each for the address, type and count), for a total of 1200 MBytes, instead of just 12 bytes. Again we will have a memory problem
But the main issue is speed. Serializing 100 million numbers one by one, requires 100 million access to the network interface, while serializing the whole block at one just causes a single call, and the rest will be done by the hardware. The reason why we cannot afford this overhead is that actually on modern high performance networks
** the network bandwidth is the same as the memory bandwidth **
This makes sense, thank you. I just want to note that contiguous arrays of double are handled equally well by either approach under discussion; an mpi_archive will obviously include an overload for double[]. I was interested in the POD case. A large array of 3x3 matrices wrapped in matrix3x3 structs would probably be a good example that illustrates your point (c) above. (a) and (b) can be avoided by issuing multiple MPI_Send calls for non-optimized sequence writes.

On Nov 23, 2005, at 3:11 PM, Peter Dimov wrote:
Matthias Troyer wrote:
Oh yes, there can be a huge difference. Let me just give a few reasons:
1) in the applications we talk about we have to regularly send huge contiguous arrays of numbers (stored e.g. in a matrix, vector, valarray or multi_array) over the network. The typical size is 100 million numbers upwards. I'll stick to 100 million as a typical number in the following. Storing these 100 million numbers already takes up 800 MByte, and nearly fills the memory of the machine, and this causes problems:
a) copying these numbers into a buffer using the serialization library needs another 800 MB of memory that might not be available
b) creating MPI data types for each member separately mean storing at least 12 bytes (4 bytes each for the address, type and count), for a total of 1200 MBytes, instead of just 12 bytes. Again we will have a memory problem
But the main issue is speed. Serializing 100 million numbers one by one, requires 100 million access to the network interface, while serializing the whole block at one just causes a single call, and the rest will be done by the hardware. The reason why we cannot afford this overhead is that actually on modern high performance networks
** the network bandwidth is the same as the memory bandwidth **
This makes sense, thank you. I just want to note that contiguous arrays of double are handled equally well by either approach under discussion; an mpi_archive will obviously include an overload for double[].
Yes, but only if you have some save_array or save_sequence hook, or alternatively the archive specifically provides an overload for double[, std::vector<double>, std::valarray<double>, boost:multi_array<double,N>, ...
I was interested in the POD case. A large array of 3x3 matrices wrapped in matrix3x3 structs would probably be a good example that illustrates your point (c) above.
Indeed, the 3x3 matrix struct is a good example of why we want to use this mechanism for more than just a fixed number of fundamental types.
(a) and (b) can be avoided by issuing multiple MPI_Send calls for non-optimized sequence writes.
Yes, but that will hurt performance. The latency for a single MPI_Send is still typically of the order of 0.5-5 microseconds even on the fastest machines. You are right, though, in that if we cannot use the fast mechanism and run into memory problems then indeed we will need to split the message. In the case of non-optimized sequence writes I might not use the MPI data type mechanism though, but instead pack the message into a buffer and send that buffer. For that one could either use the MPI_Pack functions of MPI, or prepare a (portable) binary archive and send that. Matthias

"Peter Dimov" <pdimov@mmltd.net> writes:
I just want to note that contiguous arrays of double are handled equally well by either approach under discussion;
There are at least four approaches that could be considered "under discussion" right now: Matthias' original proposal, Robert's counter-proposal, our new proposal, and what you just proposed. If your proposal handles some cases just as well as Matthias' proposal, it's no coincidence: those proposals have crucial elements in common. However, the elements those two proposals have in common are the same ones that raised the strongest objections from Robert, so we're trying to examine a less intrusive design for now.
an mpi_archive will obviously include an overload for double[]. I was interested in the POD case. A large array of 3x3 matrices wrapped in matrix3x3 structs would probably be a good example that illustrates your point (c) above. (a) and (b) can be avoided by issuing multiple MPI_Send calls for non-optimized sequence writes.
As Matthias has pointed out, that has unacceptable performance costs. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
If your proposal handles some cases just as well as Matthias' proposal, it's no coincidence: those proposals have crucial elements in common. However, the elements those two proposals have in common are the same ones that raised the strongest objections from Robert, so we're trying to examine a less intrusive design for now.
Perhaps Robert can state for himself whether he considers http://lists.boost.org/Archives/boost/2005/11/97058.php unacceptable, and why. It is a very localized change. (Contiguous sequences with non-pointer iterators such as std::vector and std::string would also need to be touched slightly by manually inlining save_collection in their respective save overloads, but this is also minor; or alternatively, we can define boost::pbegin and use that in save_collection.)

Perhaps Robert can state for himself whether he considers
http://lists.boost.org/Archives/boost/2005/11/97058.php
unacceptable, and why.
This link doesn't work for me. Robert Ramey

Robert Ramey wrote:
Perhaps Robert can state for himself whether he considers
http://lists.boost.org/Archives/boost/2005/11/97058.php
unacceptable, and why.
This link doesn't work for me.
""" Looking at collections_save_imp.hpp: template<class Archive, class Container> inline void save_collection(Archive & ar, const Container &s) { // record number of elements unsigned int count = s.size(); ar << make_nvp("count", const_cast<const unsigned int &>(count)); BOOST_DEDUCED_TYPENAME Container::const_iterator it = s.begin(); while(count-- > 0){ //if(0 == (ar.get_flags() & boost::archive::no_object_creation)) // note borland emits a no-op without the explicit namespace boost::serialization::save_construct_data_adl(ar, &(*it), 0U); ar << boost::serialization::make_nvp("item", *it++); } } all that needs to be done is: template<class Archive, class It> inline void save_sequence(Archive & ar, It it, unsigned count) { while(count-- > 0){ //if(0 == (ar.get_flags() & boost::archive::no_object_creation)) // note borland emits a no-op without the explicit namespace boost::serialization::save_construct_data_adl(ar, &(*it), 0U); ar << boost::serialization::make_nvp("item", *it++); } } template<class Archive, class Container> inline void save_collection(Archive & ar, const Container &s) { // record number of elements unsigned int count = s.size(); ar << make_nvp("count", const_cast<const unsigned int &>(count)); save_sequence( ar, s.begin(), count ); } unless I'm missing something fundamental. """

"Peter Dimov" <pdimov@mmltd.net> writes:
David Abrahams wrote:
If your proposal handles some cases just as well as Matthias' proposal, it's no coincidence: those proposals have crucial elements in common. However, the elements those two proposals have in common are the same ones that raised the strongest objections from Robert, so we're trying to examine a less intrusive design for now.
Perhaps Robert can state for himself whether he considers
http://lists.boost.org/Archives/boost/2005/11/97058.php
unacceptable, and why. It is a very localized change. (Contiguous sequences with non-pointer iterators such as std::vector and std::string would also need to be touched slightly by manually inlining save_collection in their respective save overloads, but this is also minor; or alternatively, we can define boost::pbegin and use that in save_collection.)
Yes, it is a localized change. So were the changes in Matthias' original proposal, believe it or not. The particular set of changes you're proposing IMO makes it unnecessarily difficult for archive authors, and I'd prefer to see something that makes it a bit easier by factoring out common dispatching for optimizable archive/element combinations, but Robert has already indicated that making it easy for archive authors is not a high priority for him. To be fair, I haven't done the analysis: are you sure your approach doesn't lead to an MxN problem (for M archives and N types that need to be serialized)? -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
To be fair, I haven't done the analysis: are you sure your approach doesn't lead to an MxN problem (for M archives and N types that need to be serialized)?
Yes, it does, in theory. Reality isn't that bad. For every M, the archive author has already added the necessary overloads for every "fundamental" type that supports optimized array operations. This leaves a number of user-defined types n (because the number is smaller than N), times M. In addition, even if the author of an UDT hasn't provided an overload for a particular archive A, the user can add it himself. The m*n number for a particular codebase is bounded, and the overloads are typically one-liners. Looking at http://lists.boost.org/Archives/boost/2005/11/97002.php the difference is that you have a "please call ar.load_array" specializable predicate instead of a "please overload this function" customization point. Is this the latest version of the proposed design?

On Nov 24, 2005, at 5:24 PM, Peter Dimov wrote:
David Abrahams wrote:
To be fair, I haven't done the analysis: are you sure your approach doesn't lead to an MxN problem (for M archives and N types that need to be serialized)?
Yes, it does, in theory. Reality isn't that bad. For every M, the archive author has already added the necessary overloads for every "fundamental" type that supports optimized array operations. This leaves a number of user-defined types n (because the number is smaller than N), times M.
In addition, even if the author of an UDT hasn't provided an overload for a particular archive A, the user can add it himself. The m*n number for a particular codebase is bounded, and the overloads are typically one- liners.
What if the number n is infinite (e.g. all possible structs consisting only of fundamental types), which is what Robert calls "bitwise serializable"? Matthias

Matthias Troyer wrote:
On Nov 24, 2005, at 5:24 PM, Peter Dimov wrote:
David Abrahams wrote:
To be fair, I haven't done the analysis: are you sure your approach doesn't lead to an MxN problem (for M archives and N types that need to be serialized)?
Yes, it does, in theory. Reality isn't that bad. For every M, the archive author has already added the necessary overloads for every "fundamental" type that supports optimized array operations. This leaves a number of user-defined types n (because the number is smaller than N), times M.
In addition, even if the author of an UDT hasn't provided an overload for a particular archive A, the user can add it himself. The m*n number for a particular codebase is bounded, and the overloads are typically one- liners.
What if the number n is infinite (e.g. all possible structs consisting only of fundamental types), which is what Robert calls "bitwise serializable"?
Structs aren't bitwise serializable in general because of padding/packing/alignment. Archives that do not have a documented external format and just fwrite whatever happens to be in memory at the time aren't really archives, they are a very specific subset with limited uses (interprocess communication on the same machine, the same compiler and the same version) that should not shape the design. ("Archive" implies persistency, and relying on a specific memory layout is not a way to achieve it.) If you have such an archive, you can add an overload SFINAE'd on is_bitwise_serializable instead of separate overloads for every type. This allows you to turn this specific 4*inf problem into a 4+inf problem (don't forget that you need +inf specializations of is_bitwise_serializable).

"Peter Dimov" <pdimov@mmltd.net> writes:
What if the number n is infinite (e.g. all possible structs consisting only of fundamental types), which is what Robert calls "bitwise serializable"?
Structs aren't bitwise serializable in general because of padding/packing/alignment. Archives that do not have a documented external format and just fwrite whatever happens to be in memory at the time aren't really archives, they are a very specific subset with limited uses (interprocess communication on the same machine, the same compiler and the same version) that should not shape the design.
If Stepanov had used that philosophy we wouldn't have algorithms specialized for random access iterators. Containers with random access are "a very specific subset" with, arguably, "limited uses" (don't forget that the original Lisp guys thought it would be better if everything were made up of cons cells). The ability to specialize generic algorithms to take advantage of special properties of "specific" datatypes is fundamental to Generic Programming. In fact, every example we can think of so far where optimized array serialization is useful is just such a "specific" archive.
("Archive" implies persistency, and relying on a specific memory layout is not a way to achieve it.)
I think Robert's statement "Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. Such a system can be used to reconstitute an equivalent structure in another program context. Depending on the context, this might used implement object persistence, remote parameter passing or other facility. In this system we use the term "archive" to refer to a specific rendering of this stream of bytes. This could be a file of binary data, text data, XML, or some other created by the user of this library. " defines the intention of the library and if "archive" implies persistency to you it was perhaps an unfortunate naming choice, but I don't think that should be used to make arguments about the design that contradict his intention.
If you have such an archive, you can add an overload SFINAE'd on is_bitwise_serializable
Robert wants portability to vc6, which doesn't support SFINAE. I doubt he'd want to accept a change that, to be practically taken advantage of, would require users to apply SFINAE. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
What if the number n is infinite (e.g. all possible structs consisting only of fundamental types), which is what Robert calls "bitwise serializable"?
Structs aren't bitwise serializable in general because of padding/packing/alignment. Archives that do not have a documented external format and just fwrite whatever happens to be in memory at the time aren't really archives, they are a very specific subset with limited uses (interprocess communication on the same machine, the same compiler and the same version) that should not shape the design.
If Stepanov had used that philosophy we wouldn't have algorithms specialized for random access iterators. Containers with random access are "a very specific subset" with, arguably, "limited uses" (don't forget that the original Lisp guys thought it would be better if everything were made up of cons cells). The ability to specialize generic algorithms to take advantage of special properties of "specific" datatypes is fundamental to Generic Programming.
In fact, every example we can think of so far where optimized array serialization is useful is just such a "specific" archive.
An archive turns a C++ data structure into an untyped stream of bytes in a reversible way. The specific way to map C++ into bytes (the external format) is what distinguishes one archive from another. There is _one_ archive for which the external format is the same as the memory layout. It's possible to play with the serialization of non-contiguous data structures and create several such archives for the sake of NIH, but all these archives are isomorphic, they conceptually represent a single point in the design space. Whereas there are a number of distinct (non-isomorphic) random access containers.
("Archive" implies persistency, and relying on a specific memory layout is not a way to achieve it.)
I think Robert's statement
"Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. [...]
Robert's statement is not at odds with what I wrote above.
If you have such an archive, you can add an overload SFINAE'd on is_bitwise_serializable
Robert wants portability to vc6, which doesn't support SFINAE. I doubt he'd want to accept a change that, to be practically taken advantage of, would require users to apply SFINAE.
On VC6 you can use separate overloads. I have the feeling I must be missing something fundamental. :-) What do you perceive as the important difference between providing an optimize_array overload that returns mpl::true_ and providing a save_sequence overload that calls .save_array? (Except that the latter is obviously more flexible.)

"Peter Dimov" <pdimov@mmltd.net> writes:
David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
("Archive" implies persistency, and relying on a specific memory layout is not a way to achieve it.)
I think Robert's statement
"Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. [...]
Robert's statement is not at odds with what I wrote above.
What you wrote above implies additional constraints not present in Robert's statement. He doesn't say anything about persistency. I can think of many useful Archives that don't "persist" in any meaningful way. Whether those correspond to your notion of the word "archive" is another matter.
If you have such an archive, you can add an overload SFINAE'd on is_bitwise_serializable
Robert wants portability to vc6, which doesn't support SFINAE. I doubt he'd want to accept a change that, to be practically taken advantage of, would require users to apply SFINAE.
On VC6 you can use separate overloads.
I have the feeling I must be missing something fundamental. :-)
Me too. It certainly seems as though you -- most uncharacteristically I might add -- jumped into this conversation and made many statements without due consideration.
What do you perceive as the important difference between providing an optimize_array overload that returns mpl::true_ and providing a save_sequence overload that calls .save_array? (Except that the latter is obviously more flexible.)
It's not a major difference; I have merely been giving rationale for our choices and preferences. In short: a. Less code to write and fewer opportunities to make mistakes. Not a big difference, but a difference nonetheless. b. The ability to use a base class to implement common functionality. That's useful when accomodating Robert's desire not to modify his existing library code in any way, if you want to avoid duplicating support for std::vector and builtin arrays. In short, your simple proposal would, I think, be a major improvement. However, it is incomplete, it doesn't address Robert's constraints, and it imposes a bit more work than necessary on archive authors. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams <dave@boost-consulting.com> writes:
"Peter Dimov" <pdimov@mmltd.net> writes:
David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
("Archive" implies persistency, and relying on a specific memory layout is not a way to achieve it.)
I think Robert's statement
"Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. [...]
Robert's statement is not at odds with what I wrote above.
What you wrote above implies additional constraints not present in Robert's statement. He doesn't say anything about persistency.
Sorry, correction: he mentions persistency as *one possible* use of the library among others. Nothing he says implies persistency is essential. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
("Archive" implies persistency, and relying on a specific memory layout is not a way to achieve it.)
I think Robert's statement
"Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. [...]
Robert's statement is not at odds with what I wrote above.
What you wrote above implies additional constraints not present in Robert's statement. He doesn't say anything about persistency. I can think of many useful Archives that don't "persist" in any meaningful way. Whether those correspond to your notion of the word "archive" is another matter.
In fact, even Robert's statement doesn't describe all archives; we've been simplifying a bit. But this aside... OK, let's assume that you are right and I am wrong about the meaning of "archive". How does this advance your argument?
What do you perceive as the important difference between providing an optimize_array overload that returns mpl::true_ and providing a save_sequence overload that calls .save_array? (Except that the latter is obviously more flexible.)
It's not a major difference; I have merely been giving rationale for our choices and preferences. In short:
a. Less code to write and fewer opportunities to make mistakes. Not a big difference, but a difference nonetheless.
b. The ability to use a base class to implement common functionality. That's useful when accomodating Robert's desire not to modify his existing library code in any way, if you want to avoid duplicating support for std::vector and builtin arrays.
In short, your simple proposal would, I think, be a major improvement. However, it is incomplete, it doesn't address Robert's constraints, and it imposes a bit more work than necessary on archive authors.
Let's start from here. Why is it incomplete? How could it address Robert's constraints less than your proposal, which is more invasive? What additional work does it impose on archive authors? Can you illustrate the answers to these questions with examples? They can help me understand (a) and (b), too.

"Peter Dimov" <pdimov@mmltd.net> writes:
In short, your simple proposal would, I think, be a major improvement. However, it is incomplete, it doesn't address Robert's constraints, and it imposes a bit more work than necessary on archive authors.
Let's start from here.
Why is it incomplete?
It doesn't handle std::vector; you yourself admitted that would require additional code.
How could it address Robert's constraints less than your proposal, which is more invasive?
How can you possibly say that my proposal is more invasive? How many times do I have to remind everyone that I'm not proposing to make any changes to the serialization library? I'm really shocked to hear this coming from you, especially after my posts to this list earlier today. Are you referring to some hidden invasion I haven't considered?
What additional work does it impose on archive authors?
Many overloads required for all optimizable types on vc6. No possibility of factoring common functionality (vector/builtin-array support) into a common base class. For a start. -- Dave Abrahams Boost Consulting www.boost-consulting.com

"Peter Dimov" <pdimov@mmltd.net> writes:
What you wrote above implies additional constraints not present in Robert's statement. He doesn't say anything about persistency. I can think of many useful Archives that don't "persist" in any meaningful way. Whether those correspond to your notion of the word "archive" is another matter.
In fact, even Robert's statement doesn't describe all archives; we've been simplifying a bit. But this aside...
OK, let's assume that you are right and I am wrong about the meaning of "archive". How does this advance your argument?
It defends against your argument. You seemed to be saying that the kind of archives we're interested in are really illegitimate (not archives at all), apparently to buttress your argument that they're too much of a special case to warrant consideration in shaping the design. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
I have the feeling I must be missing something fundamental. :-)
Me too. It certainly seems as though you -- most uncharacteristically I might add -- jumped into this conversation and made many statements without due consideration.
This may be so; but please understand that I've been using a save_sequence-based design for years. It works.

"Peter Dimov" <pdimov@mmltd.net> writes:
David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
I have the feeling I must be missing something fundamental. :-)
Me too. It certainly seems as though you -- most uncharacteristically I might add -- jumped into this conversation and made many statements without due consideration.
This may be so; but please understand that I've been using a save_sequence-based design for years. It works.
I'm sure it does. I don't doubt your expertise in that particular area. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams <dave@boost-consulting.com> writes:
"Peter Dimov" <pdimov@mmltd.net> writes:
David Abrahams wrote:
"Peter Dimov" <pdimov@mmltd.net> writes:
I have the feeling I must be missing something fundamental. :-)
Me too. It certainly seems as though you -- most uncharacteristically I might add -- jumped into this conversation and made many statements without due consideration.
This may be so; but please understand that I've been using a save_sequence-based design for years. It works.
I'm sure it does. I don't doubt your expertise in that particular area.
Let me also make clear that I would never have disputed that it works in principle. The essential idea that makes your save_sequence design work is present in both Matthias' original proposal and in the one that's currently on the table. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Peter Dimov wrote:
Matthias Troyer wrote:
What if the number n is infinite (e.g. all possible structs consisting only of fundamental types), which is what Robert calls "bitwise serializable"?
Structs aren't bitwise serializable in general because of padding/packing/alignment. Archives that do not have a documented external format and just fwrite whatever happens to be in memory at the time aren't really archives, they are a very specific subset with limited uses (interprocess communication on the same machine, the same compiler and the same version) that should not shape the design. ("Archive" implies persistency, and relying on a specific memory layout is not a way to achieve it.)
However, that is a common use case of any serialization library, and it is one that benefits greatly from fast array serialization. If the serialization library does not support it, those doing IPC may discard boost serialization entirely, to avoid implementing two different mechanisms in their programs. Some operating systems rely extremely heavily on message passing between processes (QNX is an example), and it would be nice if boost.serialization would be useful there. Tom

Tom Widmer wrote:
Peter Dimov wrote:
Matthias Troyer wrote:
What if the number n is infinite (e.g. all possible structs consisting only of fundamental types), which is what Robert calls "bitwise serializable"?
Structs aren't bitwise serializable in general because of padding/packing/alignment. Archives that do not have a documented external format and just fwrite whatever happens to be in memory at the time aren't really archives, they are a very specific subset with limited uses (interprocess communication on the same machine, the same compiler and the same version) that should not shape the design. ("Archive" implies persistency, and relying on a specific memory layout is not a way to achieve it.)
However, that is a common use case of any serialization library, and it is one that benefits greatly from fast array serialization. If the serialization library does not support it, those doing IPC may discard boost serialization entirely, to avoid implementing two different mechanisms in their programs. Some operating systems rely extremely heavily on message passing between processes (QNX is an example), and it would be nice if boost.serialization would be useful there.
Note the current binary_?archive saves and restores all the bits of all primitives without regard to alignment and endian issues. For this reason it is referred to as "native binary" and considered suitable only for creating archives to be loaded back into the same environment. Robert Ramey

Matthias Troyer <troyer@itp.phys.ethz.ch> writes:
On Nov 24, 2005, at 5:24 PM, Peter Dimov wrote:
David Abrahams wrote:
To be fair, I haven't done the analysis: are you sure your approach doesn't lead to an MxN problem (for M archives and N types that need to be serialized)?
Yes, it does, in theory. Reality isn't that bad. For every M, the archive author has already added the necessary overloads for every "fundamental" type that supports optimized array operations.
In most cases this shouldn't require a separate overload for every fundamental type, since the actual array serialization procedure is often the same for _every_ type that can use the optimization. But again, that's a question of how much work the archive author is required to do, so Robert may not consider it a valid argument.
This leaves a number of user-defined types n (because the number is smaller than N), times M.
In addition, even if the author of an UDT hasn't provided an overload for a particular archive A, the user can add it himself. The m*n number for a particular codebase is bounded, and the overloads are typically one- liners.
What if the number n is infinite (e.g. all possible structs consisting only of fundamental types), which is what Robert calls "bitwise serializable"?
No, we can't detect that category automatically today. However, our approach is designed to make it very easy and more foolproof to provide the necessary information for such a type: in Matthias' original design, a trait specialization; in the more recent design, an overload of optimize_array that returns mpl::true_. In your proposal you need a separate overload of save_sequence. Then you likely either have to duplicate the normal fast serialization procedure in your overload or you have to dispatch to something written by the archive author that may be named differently for every archive. This work is not (necessarily) done by the archive author -- it may be done by the author of a type that needs to be serialized, or a 3rd party user. We prefer to add a small amount of additional framework to avoid that problem. Furthermore, Doug Gregor has designed and implemented (in GCC) a core language extension idea we had in Mont Tremblant that allows us to enumerate all the members of a class. We plan to propose that for standardization. We'd like to see a design that can immediately take advantage of such a feature when/if it becomes available. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
Yes, it does, in theory. Reality isn't that bad. For every M, the archive author has already added the necessary overloads for every "fundamental" type that supports optimized array operations.
In most cases this shouldn't require a separate overload for every fundamental type, since the actual array serialization procedure is often the same for _every_ type that can use the optimization. But again, that's a question of how much work the archive author is required to do, so Robert may not consider it a valid argument.
I prefer enumerating the overloads explicitly, but you can use a single SFINAE'd overload.
However, our approach is designed to make it very easy and more foolproof to provide the necessary information for such a type: in Matthias' original design, a trait specialization; in the more recent design, an overload of optimize_array that returns mpl::true_. In your proposal you need a separate overload of save_sequence.
Providing a save_sequence overload is approximately the same amount of work as adding a partial specialization or an optimize_array overload.
Then you likely either have to duplicate the normal fast serialization procedure in your overload...
... which is not a problem for archives supporting your .load_array or .save_array protocol.
or you have to dispatch to something written by the archive author that may be named differently for every archive.
Quite right. save_sequence for char[] can cover "bitwise serializability", though.

David Abrahams wrote:
Furthermore, Doug Gregor has designed and implemented (in GCC) a core language extension idea we had in Mont Tremblant that allows us to enumerate all the members of a class. We plan to propose that for standardization. We'd like to see a design that can immediately take advantage of such a feature when/if it becomes available.
This is interesting, can you give a link to any description of this extension? Is it for accessing data members only, or for introspection more generally? I think the fact that the serialization library defines a standardised method of processing an object by memberwise decomposition, has led to some confusion of purposes. General access to introspection facilities would facilitate other uses of serialization and permit the serialization library to focus more directly on its goals. Matt

...
As Dave mentioned earlier, this information can then used by the MPI library (and network hardware) to directly serialize the data into the I/O buffers of the network interconnect, without ever creating a copy in memory, and automatically taking care of potential endianness and format issues on heterogeneous networks.
Do you have any recommendations for links discussing MPI? I just found http://www-unix.mcs.anl.gov/mpi/ via Google. Is this the best starting point? Thanks, Jeff

On Nov 23, 2005, at 2:57 PM, Jeff Flinn wrote:
...
As Dave mentioned earlier, this information can then used by the MPI library (and network hardware) to directly serialize the data into the I/O buffers of the network interconnect, without ever creating a copy in memory, and automatically taking care of potential endianness and format issues on heterogeneous networks.
Do you have any recommendations for links discussing MPI? I just found http://www-unix.mcs.anl.gov/mpi/ via Google. Is this the best starting point?
Besides your link I would recommend: http://www.mpi-forum.org/ is the official standard web page http://www.lam-mpi.org/ for one implementation (done by the Indiana group that also contributed many Boost libraries) http://www.open-mpi.org/ for the latest efforts for an open-source high-performance MPI library as well as any of the many books on MPI. Alternatively you could attend the short MPI course that I will teach on Friday in Zurich, Switzerland. Matthias

"Matthias Troyer" <troyer@itp.phys.ethz.ch> wrote in message news:B6E95659-ED86-4591-BCE3-B42F96E7F2A8@itp.phys.ethz.ch...
On Nov 23, 2005, at 2:57 PM, Jeff Flinn wrote:
...
Do you have any recommendations for links discussing MPI? I just found http://www-unix.mcs.anl.gov/mpi/ via Google. Is this the best starting point?
Besides your link I would recommend:
http://www.mpi-forum.org/ is the official standard web page
http://www.lam-mpi.org/ for one implementation (done by the Indiana group that also contributed many Boost libraries)
http://www.open-mpi.org/ for the latest efforts for an open-source high-performance MPI library
as well as any of the many books on MPI. Alternatively you could attend the short MPI course that I will teach on Friday in Zurich, Switzerland.
Thanks for the links. And the course offer. I'll hop on the ole timeshare Gulfstream G5 and pop right over... Oops, it's in the shop having it's landing gear repaired. :) Thanks, Jeff

To summarize how we arrived here. ================================= a) Mattias augmented binary_?archive to replace element by serialization of primitive types with save/load binary for C++ arrays, std::vectors and boost::val_array. This resulted in a 10 x speed up of the serialization process. b) From this it has been concluded that binary archives should be enhanced to provide this facility automatically and transparently to the user. c) The structure of the library an the documentation suggest that the convenient way to do this is to specify an overload for each combination of archive/type which can benefit from special treatment. d) The above (c) is deemed inconvenient because it has been supposed that many archive classes will share a common implementation of load/save array. This would suggest that using (c) above, though simple and straight forward, will result in code repetition. e) So it has been proposed binary_iarchive be re-implemented in the following way iarchive - containg default implementation of load_array binary_iarchive - ? presumablu contains implementaion of load_array in terms of currently defined load_binary Its not clear whether all archives would be modified in this way or just binary_iarchive. The idea is that each type which can benefit from load_array can call it and the version of load_array corresponding to that particular archive will be invoked. This will require i) the serialization functino for types which can benefit from some load_array function would call this. ii) Only a small number of load_array functions would have ot be written for each archive. So the number of special functions to be written would be One for each type which might use load_array and "one" for each archive. Problems with the Design ======================== a) It doesn't address the root cause of "slow" performance of binary archives. The main problem is that it doesn't address the cause of the 10 X speed up. Its a classic case of premature optimization. The 10x speed up was based on test program. For a C++ array, the test boils down to replacing 10,000 invocations for stream write(..) with one invocation of stream write 10,000 times longer. Which is of course faster. Unfortunatly, the investigation stopped here with the conclusion that the best way to improve performance is to reduce the number of stream write calls in a few specific cases. As far as I know, the test was never profiled so I can't know for sure, but past experience and common sense suggest that stream write is a costly operation for binary i/o. This design proposal (as well as the previous one) fail to address this so its hard to take it as a serious proposal to speed up native binary serializaition. The current binary archives are implemented in terms of stream i/o. This was convenient to do so and has worked well. But basing the implemention on streams results in a slow implemenation. The documentation explicitly states that archives do not have to be implemented in terms of streams. The binary archives don't use any of the stream interface other than read(.. write(.. so it would be quite easy to make another binary archive which isn't based on stream i/o. It could be based on fread/fwrite. Given that the concern of the proposal of the authors is to make the library faster for machine to machine communication and the desired protocols (MPI) don't use file i/o, the fastest would be just a buffer say buffer_archve which doesn't do any i/o at all. It would just fill up a user specified buffer whose address was handed at buffer_archive construction time. This would totally eliminate stream i/o from the equation. Note that this would be easy to do. Just clone binary_archive, and modify it so it doesn't use a stream. (probably don't want to derive from basic_binary_archive). I would guess that that would take about a couple of hours at most. I would be surprised to see if the 10x speed up still exists with this "buffered_archive". note that for the intended application - mpi communication - some archive which doesn't use stream i/o have to be created anyway. b) re-implemenation of binary_archive in such a way so as not to break existing archives would be an error prone process. The switching between new and old method "should" result in exactly the same byte sequence. But it could easily occur that a small subtle change might render archives create under the previous binary_archive unreadable. c) The premise that one will save a lot of coding (see d) above) compared to to the current method of overloading based on the pair of archive/type is overyly optimistic. This is explained in Peter Dimov's post here: http://lists.boost.org/Archives/boost/2005/11/97089.php I'm aware this is speculative. I haven't investigated MPI, XDR and other's enough to know how much code sharing is possible. It does seem that there will be no sharing with the the "fast binary archive" of the previous submission. From the short descriptions of MPI I've seen on this list along with my cursory investigation of XDR, I'm doubtful that there is any sharing there either. Conclusions =========== a) The proposal suffers from "premature optimization". A large amount of design effort has been expended on areas which are likely not the source of observed performance bottlenecks. b) The proposal suffers from "over generalizaton". The attempt to generalize results in a much more complex system. Such a system will result in a net loss of conceptual integregrity and implementation transparancey. The claim that this generalization will actually result in a reduction of code is not convincing. c) by re-implementing a currently existing and used archive, it risks creating a maintainence headache for no real benefit. Suggestions =========== a) Do more work in finding the speed bottlenecks. Run a profiler. Make a buffer based non-stream based archive and re-run your tests. b) Make your MPI, XDR and whatever archives. Determine how much opportunity for code sharing is really available. c) If you still believe your proposal has merit, make your own "optimized binary archive". Don't derive from binary_archive but rather from common_?archive or perhaps basic_binary_archive. In this way you will have a totally free hand and won't have to achieve consensus with the rest of us which will save us all a huge amount of time. Robert Ramey

On 11/24/05, Robert Ramey <ramey@rrsd.com> wrote: [snip]
The current binary archives are implemented in terms of stream i/o. This was convenient to do so and has worked well. But basing the implemention on streams results in a slow implemenation. The documentation explicitly states that archives do not have to be implemented in terms of streams. The binary archives don't use any of the stream interface other than read(.. write(.. so it would be quite easy to make another binary archive which isn't based on stream i/o. It could be based on fread/fwrite. Given that the concern of the proposal of the authors is to make the library faster for machine to machine communication and the desired protocols (MPI) don't use file i/o, the fastest would be just a buffer say buffer_archve which doesn't do any i/o at all. It would just fill up a user specified buffer whose address was handed at buffer_archive construction time. This would totally eliminate stream i/o from the equation.
Note that this would be easy to do. Just clone binary_archive, and modify it so it doesn't use a stream. (probably don't want to derive from basic_binary_archive). I would guess that that would take about a couple of hours at most.
I would be surprised to see if the 10x speed up still exists with this "buffered_archive".
Creating a "buffered_archive" wouldnt require copying? And as Matthias have already put, it is an unacceptable overhead. I'm having some ideas about a 'fake_buffered'_archive. best regards, -- Felipe Magno de Almeida Developer from synergy and Computer Science student from State University of Campinas(UNICAMP). Unicamp: http://www.ic.unicamp.br Synergy: http://www.synergy.com.br "There is no dark side of the moon really. Matter of fact it's all dark."

Felipe Magno de Almeida wrote:
Creating a "buffered_archive" wouldnt require copying? And as Matthias have already put, it is an unacceptable overhead. I'm having some ideas about a 'fake_buffered'_archive.
LOL - then then have your archive send the data whereevery you want it to. The point is that the 10x speed up demonstration on which this whole thread rests is based on using the current binary_?archive which (apparently) won't be the mechanism which will eventually be used. Robert Ramey

On Nov 24, 2005, at 9:27 PM, Robert Ramey wrote:
Felipe Magno de Almeida wrote:
Creating a "buffered_archive" wouldnt require copying? And as Matthias have already put, it is an unacceptable overhead. I'm having some ideas about a 'fake_buffered'_archive.
LOL - then then have your archive send the data whereevery you want it to. The point is that the 10x speed up demonstration on which this whole thread rests is based on using the current binary_?archive which (apparently) won't be the mechanism which will eventually be used.
Just to clarify: there are a number of usage cases where an array optimization will be useful. One is a binary archive (into files or buffers), another usage case is MPI message passing without copying into a buffer. Thus the proposed test is relevant, and we're working on it. Matthias

Robert Ramey wrote:
Felipe Magno de Almeida wrote:
Creating a "buffered_archive" wouldnt require copying? And as Matthias have already put, it is an unacceptable overhead. I'm having some ideas about a 'fake_buffered'_archive.
LOL - then then have your archive send the data whereevery you want it to. The point is that the 10x speed up demonstration on which this whole thread rests is based on using the current binary_?archive which (apparently) won't be the mechanism which will eventually be used.
? It doesn't matter what back-end buffer is used, there will always be a substantial difference between buffering a bulk array copy versus a loop. If that buffer is going to be written to disk, the difference doesn't matter so much because the disk IO will be the bottleneck. But if it is going to a fast network interface, the buffering is critical. Besides, is the boost iostreams library really much slower than a hand-coded buffer? Anyway, this is a side issue. The main point is: David Abrahams wrote:
,---- | For many archive formats and common datatypes there exist APIs | that can quickly read or write contiguous sequences of those types | all at once (**). Reading or writing such a sequence by | separately reading or writing each element (as the serialization | library currently does) can be an order of magnitude more | expensive. `----
The operative phrase here is "archive formats". To pick a random example, from the netCDF users guide http://www.unidata.ucar.edu/software/netcdf/guide.txn_toc.html " The Network Common Data Form, or netCDF, is an interface to a library of data access functions for storing and retrieving data in the form of arrays. An array is an n-dimensional (where n is 0, 1, 2, ...) rectangular structure containing items which all have the same data type (e.g. 8-bit character, 32-bit integer). A scalar (simple single value) is a 0-dimensional array. " If there is to be any possibility of targetting an archive to this format, then array support is crucial. Similarly, the basic message passing interface in MPI is int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) The 'count' argument there is the array length. Again, without array support it is not possible to take full advantage of MPI. Maybe you don't care about these applications, but if that is the case then you should substantially narrow your description of the library, which misleadingly suggests that such applications would fall within its scope: " Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. Such a system can be used to reconstitute an equivalent structure in another program context. Depending on the context, this might used implement object persistence, remote parameter passing or other facility. In this system we use the term "archive" to refer to a specific rendering of this stream of bytes. This could be a file of binary data, text data, XML, or some other created by the user of this library. " Regards, Ian McCulloch

Ian McCulloch wrote:
Besides, is the boost iostreams library really much slower than a hand-coded buffer?
I'm pretty confident that its much, much slower, but this will remain in dispute until someone runs the code with a profiler.
Anyway, this is a side issue. The main point is:
David Abrahams wrote:
,---- | For many archive formats and common datatypes there exist APIs | that can quickly read or write contiguous sequences of those types | all at once (**). Reading or writing such a sequence by | separately reading or writing each element (as the serialization | library currently does) can be an order of magnitude more | expensive. `----
Sorry - that's NOT the main point. The main point is - do enhancement for special cases have to be incorporated into the the core code so that everybody else is obligated to use it? What are the advantages and disadvantages of doing so? No one is disputing that it desireable to be able to extend the library for these special circumstances.
If there is to be any possibility of targetting an archive to this format, then array support is crucial.
Then just make an archive which does it- what's stopping you? Robert Ramey

Robert Ramey wrote:
Ian McCulloch wrote:
Besides, is the boost iostreams library really much slower than a hand-coded buffer?
I'm pretty confident that its much, much slower, but this will remain in dispute until someone runs the code with a profiler.
Or do a benchmark of writing N items into a iostreams buffer versus N items into a trivial buffer. I actually just tried this, and the difference was a factor 2 for both in-cache and out-of-cache. g++ 3.4.4, AMD64. For out-of-cache, the iostreams was a factor 7 slower than a memcpy(), and the trivial buffer was factor 3.4 slower. For in-cache, the iostreams was a factor 30 slower than a memcpy() and the trivial buffer was a factor 15 slower. Benchmark code attached. This was the first time I used boost::iostreams so I am not sure if I did it correctly. I used a stream<back_insert_device<vector<unsigned char> > >, and the write() member function.
Anyway, this is a side issue. The main point is:
David Abrahams wrote:
,---- | For many archive formats and common datatypes there exist APIs | that can quickly read or write contiguous sequences of those types | all at once (**). Reading or writing such a sequence by | separately reading or writing each element (as the serialization | library currently does) can be an order of magnitude more | expensive. `----
Sorry - that's NOT the main point.
The main point is - do enhancement for special cases have to be incorporated into the the core code so that everybody else is obligated to use it? What are the advantages and disadvantages of doing so?
'Recommended' I can believe, but 'Obligated'? Why?
No one is disputing that it desireable to be able to extend the library for these special circumstances.
If there is to be any possibility of targetting an archive to this format, then array support is crucial.
Then just make an archive which does it- what's stopping you?
None of the existing container serialization functions would make use of it, and I have no desire to rewrite and maintain specialized versions of them. This isn't just a specialized case applying to just a few archive types and just a few user defined types. It is wide-ranging and applies a potentially large number of archive types and a very large number of serializable objects, including standard library components and existing boost components. Anyway, it is clear to me that my arguments are not helping in the slightest, and the probability of convincing you that what we are trying to do is both worthwhile but problematic without some form of array support in the serialization lib appears to be zero. So I will not waste your time and mine any further by continuing to post to this thread. Regards, Ian

Ian McCulloch wrote:
Anyway, it is clear to me that my arguments are not helping in the slightest, and the probability of convincing you that what we are trying to do is both worthwhile but problematic without some form of array support in the serialization lib appears to be zero. So I will not waste your time and mine any further by continuing to post to this thread.
Sorry, this didn't come out quite how I intended: Dave, Matthias etc can probably argue the issues more articulately than I can, so I will leave further debate to them. So far it seems to be going around in circles and I probably havn't helped. Cheers, Ian

Robert Ramey wrote:
Ian McCulloch wrote:
Besides, is the boost iostreams library really much slower than a hand-coded buffer?
I'm pretty confident that its much, much slower, but this will remain in dispute until someone runs the code with a profiler.
Anyway, this is a side issue. The main point is:
David Abrahams wrote:
,---- | For many archive formats and common datatypes there exist APIs | that can quickly read or write contiguous sequences of those types | all at once (**). Reading or writing such a sequence by | separately reading or writing each element (as the serialization | library currently does) can be an order of magnitude more | expensive. `----
Sorry - that's NOT the main point.
The main point is - do enhancement for special cases have to be incorporated into the the core code so that everybody else is obligated to use it? What are the advantages and disadvantages of doing so?
Umm, I don't follow - how are others going to be obligated to use those special cases that don't necessarily apply to them?
No one is disputing that it desireable to be able to extend the library for these special circumstances.
Actually it's *very* desirable.
If there is to be any possibility of targetting an archive to this format, then array support is crucial.
Then just make an archive which does it- what's stopping you?
True - but the suggestion is that it's common enough to want as part of the core library. Here we use netCDF (the format brought up by Ian) and HDF-5 data files, and run with MPI on a mosix cluster ... the point being that these data types are integral to industries like ours - where large volumes of data are the norm - and array access is indeed crucial - we see code here that runs in the order of seconds to minutes, or hours depending on how data is read and written - but as a concesion, that is also dependent on the underlying libraries we inherit from 3rd parties (aka netCDF, HDF-5, et al). I've been loosly following this thread and I understand your reluctance to this - but it's a matter of perception as to how needful, and widespread such a need is. ... back to lurker mode Cheers, -- Manfred Doudar MetOcean Engineers www.metoceanengineers.com

Manfred Doudar wrote:
The main point is - do enhancement for special cases have to be incorporated into the the core code so that everybody else is obligated to use it? What are the advantages and disadvantages of doing so?
Umm, I don't follow - how are others going to be obligated to use those special cases that don't necessarily apply to them?
The design of the original submission changed binary_archive so that everyone would benefit from the enhancement whether they asked for it or not. The library has been designed to avoid exactly that situation.
No one is disputing that it desireable to be able to extend the library for these special circumstances.
Actually it's *very* desirable.
The library has been designed to permit this. I don't see how anyone who looks at the documentation could conclude otherwise.
If there is to be any possibility of targetting an archive to this format, then array support is crucial.
Then just make an archive which does it- what's stopping you?
True - but the suggestion is that it's common enough to want as part of the core library.
Ahhh - perhaps this is the source of the confusion. When I use the word "core library" I'm refering to the common set of facilities that ALL archives use. For example, xml is not part o the core library - though its widely used. xml_archives have been build on top of the "core" and are in fact a use case for the library. Of course, users who don't build archives themselves may not be aware of of and see xml_archives are part of the "core library" because its included in the package and they just use it. The library is really an archive construction kit. As part of the package it includes 5 pre-made archives which a) can be used as is b) can serve as examples for making one's own archives c) can be used as base classes for making variations and extentions d) can be composed with archive adaptors to create variations of existing archives - e.g. polymophic archives. The current situation is where one user feels he needs to to something that none of the other archive creators have had to do - alter the construction kit itself. This violates the factoring which has permited the library to be used to make all these different archives from the same core layer. This same factoring has permitted the implementation of polymorphic versions of all the above 5 archives and will permit any new archive to be made polymorphic by composition with existing code. ****************************************************** *So for the Nth time. This is not about whether or not archive *XDR, CDR, MPI, or XYZ should be made. Anyone is free to make it. *Its about how to do it so that it doesn't impact anyone else's efforts. ****************************************************** <snip> about how important other formats are </snip>
I've been loosly following this thread and I understand your reluctance to this - but it's a matter of perception as to how needful, and widespread such a need is.
If there is a need, I trust someone will make the corresponding archive. If its too hard to make an archive - well that's another problem and I'll see how that can be made more transparent. Robert Ramey

Robert Ramey wrote:
Manfred Doudar wrote:
The main point is - do enhancement for special cases have to be incorporated into the the core code so that everybody else is obligated to use it? What are the advantages and disadvantages of doing so?
Umm, I don't follow - how are others going to be obligated to use those special cases that don't necessarily apply to them?
The design of the original submission changed binary_archive so that everyone would benefit from the enhancement whether they asked for it or not. The library has been designed to avoid exactly that situation.
I'm with you now.
[snip]
If there is to be any possibility of targetting an archive to this format, then array support is crucial.
Then just make an archive which does it- what's stopping you?
True - but the suggestion is that it's common enough to want as part of the core library.
Ahhh - perhaps this is the source of the confusion. When I use the word "core library" I'm refering to the common set of facilities that ALL archives use. For example, xml is not part o the core library - though its widely used. xml_archives have been build on top of the "core" and are in fact a use case for the library. Of course, users who don't build archives themselves may not be aware of of and see xml_archives are part of the "core library" because its included in the package and they just use it. The library is really an archive construction kit. As part of the package it includes 5 pre-made archives which
a) can be used as is b) can serve as examples for making one's own archives c) can be used as base classes for making variations and extentions d) can be composed with archive adaptors to create variations of existing archives - e.g. polymophic archives.
The current situation is where one user feels he needs to to something that none of the other archive creators have had to do - alter the construction kit itself. This violates the factoring which has permited the library to be used to make all these different archives from the same core layer. This same factoring has permitted the implementation of polymorphic versions of all the above 5 archives and will permit any new archive to be made polymorphic by composition with existing code.
****************************************************** *So for the Nth time. This is not about whether or not archive *XDR, CDR, MPI, or XYZ should be made. Anyone is free to make it. *Its about how to do it so that it doesn't impact anyone else's efforts. ******************************************************
Excellent! -Now that's an answer I think a lot of us have been looking for - hopefully it should clear things up for others too. Thanks, -- Manfred Doudar MetOcean Engineers www.metoceanengineers.com

Hi Robert, I'll let Dave comment on the parts where you review his proposal, and will focus on the performance. On Nov 24, 2005, at 6:59 PM, Robert Ramey wrote:
a) It doesn't address the root cause of "slow" performance of binary archives.
I have done the benchmarks you desired last night (see below), and they indeed show that the root cause of slow performance is the individual writing of many small elements instead of "block-writing" of the array in a call to something like save_array.
b) re-implemenation of binary_archive in such a way so as not to break existing archives would be an error prone process. The switching between new and old method "should" result in exactly the same byte sequence. But it could easily occur that a small subtle change might render archives create under the previous binary_archive unreadable.
Dave's design does not change anything in your archives or serialization functions, but only adds an additional binary archive using save_array and load_array.
c) The premise that one will save a lot of coding (see d) above) compared to to the current method of overloading based on the pair of archive/type is overyly optimistic.
Actually I have implemented two new archive classes (MPI and XDR) which can profit from it, and it does save lots of code duplication. All of the serialization functions for types that can make use of such an optimization can be shared between all these archive types. In addition formats such as HDF5 and netCDF have been mentioned, which can reuse the *same* serialization function to achieve optimal performance. There is nothing "optimistic" here since we have the actual implementations, which show that code duplication can be avoided.
Conclusions =========== a) The proposal suffers from "premature optimization". A large amount of design effort has been expended on areas which are likely not the source of observed performance bottlenecks.
As Dave pointed out one main reason for a save_array/load_array or save_sequence/load_sequence hook is to utilize existing APIs for serialization (including message pasing) that provide optimized functions for arrays of contiguous data. Examples include MPI, PVM, XDR, HDF5. There is a well established reason why all these libraries have special functions for arrays of contiguous data, because they all observed the same bottlenecks. These bottlenecks are well known for decades in high performance computing, and have caused all these APIs to include special support for contiguous arrays of data.
b) The proposal suffers from "over generalizaton". The attempt to generalize results in a much more complex system. Such a system will result in a net loss of conceptual integregrity and implementation transparancey. The claim that this generalization will actually result in a reduction of code is not convincing.
I'm confused by your statement. Actually the implementations of fast binary archives, MPI archives and XDR archives do share common serialization functions, and this does indeed result in code reduction and avoids code duplication.
c) by re-implementing a currently existing and used archive, it risks creating a maintainence headache for no real benefit.
To avoid any such potential problems Dave proposed to add a new archive in an array sub namespace. I guess that alleviates your concerns? Also, a 10x speedup might not be a benefit for you and your applications but as you can see from postings here, it is a concern for many others.
Suggestions ===========
a) Do more work in finding the speed bottlenecks. Run a profiler. Make a buffer based non-stream based archive and re-run your tests.
I have attached a benchmark for such an archive class and ran benchmarks for std::vector<char> serialization. Here are the numbers (using gcc-4 on a Powerbook G4): Time using serialization library: 13.37 Time using direct calls to save in a loop: 13.12 Time using direct call to save_array: 0.4 In this case the buffer had size 0 at first and needed to be resized during the insertions. Here are numbers for the case where enough memory has been reserved(): Time using serialization library: 12.61 Time using direct calls to save in a loop: 12.31 Time using direct call to save_array: 0.35 And here are the numbers for std::vector<double>, sing a vector of 1/8-th the size: Time using serialization library: 1.95 Time using direct calls to save in a loop: 1.93 Time using direct call to save_array: 0.37 Since there are fewer calls for these larger types it looks slightly better, but even now there is a more than 5x difference in this benchmark. As you can see the overhead of the serialization library (less than 2%) is insignificant compared to the cost of doing lots of individual insertion operations into the buffer instead of one big one. The bottleneck is thus clearly the many calls to save() instead of a single call to save_array().
b) Make your MPI, XDR and whatever archives. Determine how much opportunity for code sharing is really available.
This has been done and is the reason for the proposal to introduce something like the save_array/load_array functions. I have coded an XDR and two different types of MPI archives (one using a buffer, another not using a buffer). A single serialization function for std::valarray, using the load_array hook, to use the optimized APIs in MPI and XDR, as well as a faster binary archive, and the same is true for other types.
c) If you still believe your proposal has merit, make your own "optimized binary archive". Don't derive from binary_archive but rather from common_?archive or perhaps basic_binary_archive. In this way you will have a totally free hand and won't have to achieve consensus with the rest of us which will save us all a huge amount of time.
I'm confused. I realize that one should not derive from binary_iarchive, but why should one not derive from binary_iarchive_impl? Also, following Dave's proposal none of your archives is touched, but instead additional faster ones are provided. Matthias

Matthias Troyer wrote:
Hi Robert,
I'll let Dave comment on the parts where you review his proposal, and will focus on the performance.
On Nov 24, 2005, at 6:59 PM, Robert Ramey wrote:
a) It doesn't address the root cause of "slow" performance of binary archives.
I have done the benchmarks you desired last night (see below), and they indeed show that the root cause of slow performance is the individual writing of many small elements instead of "block-writing" of the array in a call to something like save_array.
b) re-implemenation of binary_archive in such a way so as not to break existing archives would be an error prone process. The switching between new and old method "should" result in exactly the same byte sequence. But it could easily occur that a small subtle change might render archives create under the previous binary_archive unreadable.
Dave's design does not change anything in your archives or serialization functions, but only adds an additional binary archive using save_array and load_array.
Hmm - that's not the way I read it. I've touched on this in another post.
c) The premise that one will save a lot of coding (see d) above) compared to to the current method of overloading based on the pair of archive/type is overyly optimistic.
Actually I have implemented two new archive classes (MPI and XDR) which can profit from it, and it does save lots of code duplication. All of the serialization functions for types that can make use of such an optimization can be shared between all these archive types. In addition formats such as HDF5 and netCDF have been mentioned, which can reuse the *same* serialization function to achieve optimal performance.
There is nothing "optimistic" here since we have the actual implementations, which show that code duplication can be avoided.
OK - I can really only comment on that which I've seen.
Conclusions =========== a) The proposal suffers from "premature optimization". A large amount of design effort has been expended on areas which are likely not the source of observed performance bottlenecks.
As Dave pointed out one main reason for a save_array/load_array or save_sequence/load_sequence hook is to utilize existing APIs for serialization (including message pasing) that provide optimized functions for arrays of contiguous data. Examples include MPI, PVM, XDR, HDF5. There is a well established reason why all these libraries have special functions for arrays of contiguous data, because they all observed the same bottlenecks. These bottlenecks are well known for decades in high performance computing, and have caused all these APIs to include special support for contiguous arrays of data.
I admit I'm skeptical of the benefits, but I've not disputed that someone should be able to do this without a problem. The difference lies in where the implementation should be placed.
b) The proposal suffers from "over generalizaton". The attempt to generalize results in a much more complex system. Such a system will result in a net loss of conceptual integregrity and implementation transparancey. The claim that this generalization will actually result in a reduction of code is not convincing.
I'm confused by your statement. Actually the implementations of fast binary archives, MPI archives and XDR archives do share common serialization functions, and this does indeed result in code reduction and avoids code duplication.
Upon reflection - I think I would prefer the term "premature generalization". I concede that's speculation on my part. It seems a lot of effort has been invested to avoid the MxN problem. My own experiments with bitwise_array_archive_adaptor have failed to convince me that the library needs more API to deal with this problem. Shortly, I will be uploading some code which perhaps will make my reasons for this belief more obvious.
c) by re-implementing a currently existing and used archive, it risks creating a maintainence headache for no real benefit.
To avoid any such potential problems Dave proposed to add a new archive in an array sub namespace.
As I said - that's not how I understood it.
I guess that alleviates your concerns? Also, a 10x speedup might not be a benefit for you and your applications but as you can see from postings here, it is a concern for many others.
LOL - No one has ever disputed the utility of a 10x speed up. The question is how best to achieve it without creating a ripple of side effects.
Suggestions ===========
a) Do more work in finding the speed bottlenecks. Run a profiler. Make a buffer based non-stream based archive and re-run your tests.
I have attached a benchmark for such an archive class and ran benchmarks for std::vector<char> serialization. Here are the numbers (using gcc-4 on a Powerbook G4):
Time using serialization library: 13.37 Time using direct calls to save in a loop: 13.12 Time using direct call to save_array: 0.4
In this case the buffer had size 0 at first and needed to be resized during the insertions. Here are numbers for the case where enough memory has been reserved():
Time using serialization library: 12.61 Time using direct calls to save in a loop: 12.31 Time using direct call to save_array: 0.35
And here are the numbers for std::vector<double>, sing a vector of 1/8-th the size:
Time using serialization library: 1.95 Time using direct calls to save in a loop: 1.93 Time using direct call to save_array: 0.37
Since there are fewer calls for these larger types it looks slightly better, but even now there is a more than 5x difference in this benchmark.
As you can see the overhead of the serialization library (less than 2%) is insignificant compared to the cost of doing lots of individual insertion operations into the buffer instead of one big one. The bottleneck is thus clearly the many calls to save() instead of a single call to save_array().
Well, this is interesting data. the call to save() resolves inline to a call to std::vector get element and stuffing the value into the buffer. I wonder how much of this in std::vector and how much is in the save to the buffer?. And it does diminish my skepticism about how much benefit the array serialization would be in at least these specific cases. So, I'll concede that this will be a useful facility for a significant group of users. Now we can focus on how to implement it with the minimal collateral damage.
b) Make your MPI, XDR and whatever archives. Determine how much opportunity for code sharing is really available.
This has been done and is the reason for the proposal to introduce something like the save_array/load_array functions. I have coded an XDR and two different types of MPI archives (one using a buffer, another not using a buffer). A single serialization function for std::valarray, using the load_array hook, to use the optimized APIs in MPI and XDR, as well as a faster binary archive, and the same is true for other types.
c) If you still believe your proposal has merit, make your own "optimized binary archive". Don't derive from binary_archive but rather from common_?archive or perhaps basic_binary_archive. In this way you will have a totally free hand and won't have to achieve consensus with the rest of us which will save us all a huge amount of time.
I'm confused. I realize that one should not derive from binary_iarchive, but why should one not derive from binary_iarchive_impl?
What I meant is if you don't change the current binary_i/oarchive implementation you won't have to worry about backward compatibility with any existing archives. I (mis?)understood the proposal to include adjustments to the current implementation so that it could be derived from.
Also, following Dave's proposal none of your archives is touched, but instead additional faster ones are provided.
This wasn't clear to me from my reading of the proposal. Robert Ramey

Robert Ramey wrote:
Matthias Troyer wrote:
Dave's design does not change anything in your archives or serialization functions, but only adds an additional binary archive using save_array and load_array.
Hmm - that's not the way I read it. I've touched on this in another post.
As already explained elsewhere, you mis-read it. The archive was in a sub-namespace. Perhaps it would have been clearer if Dave had used a completely different namespace name, boost::array_serialization_extensions, or boost::not_the_serialization_namespace ?.
c) The premise that one will save a lot of coding (see d) above) compared to to the current method of overloading based on the pair of archive/type is overyly optimistic.
Actually I have implemented two new archive classes (MPI and XDR) which can profit from it, and it does save lots of code duplication. All of the serialization functions for types that can make use of such an optimization can be shared between all these archive types. In addition formats such as HDF5 and netCDF have been mentioned, which can reuse the *same* serialization function to achieve optimal performance.
There is nothing "optimistic" here since we have the actual implementations, which show that code duplication can be avoided.
OK - I can really only comment on that which I've seen.
Are we talking at cross-purposes here? Matthias is talking about sharing *serialization* functions. That is, for each data type, there is only *one* serialization function that calls load/save_array (or whatever the array hook function is called...). You seem to be disputing the code duplication issue by saying that different *archives* will not typically(*) be able to share implementations of array processing. This I completely agree with. But that is a completely separate to the number of *serialization* functions that need to be written. Matthias, it might help if you show an example of a serialization function for some vector type, and the implementation of the array processing for the MPI and XDR archives, do demonstrate the orthogonality of the serialization vs archive ideas. (*) of course there are some counter-examples. That is the idea for deriving one archive from another, is it not? [snip]
As you can see the overhead of the serialization library (less than 2%) is insignificant compared to the cost of doing lots of individual insertion operations into the buffer instead of one big one. The bottleneck is thus clearly the many calls to save() instead of a single call to save_array().
Well, this is interesting data. the call to save() resolves inline to a call to std::vector get element and stuffing the value into the buffer. I wonder how much of this in std::vector and how much is in the save to the buffer?.
As described here http://lists.boost.org/Archives/boost/2005/11/97156.php the effect of using a custom buffer versus a buffer based around vector::push_back is exactly a factor 2, irrespective of cache effects. Matthias' benchmark showed that the time taken to serialize an array into a vector buffer is almost the same as the time taken to push_back the array in a loop (ie. the serialization library itself introduces negligable overhead in this case). Thus, a serialization archive based on the same buffer I used in my benchmark should achieve the same factor 2 speedup. Note that the speedup using save_array was of the order of 30, so that, even with a factor 2 speedup from using an optimized buffer, save_array would still be 15 times faster! (This is using the first set of data. Using the set for small arrays would only give a modest factor 3x improvement for save_array versus a cusom buffer archive). Cheers, Ian

"Robert Ramey" <ramey@rrsd.com> writes:
Also, following Dave's proposal none of your archives is touched, but instead additional faster ones are provided.
This wasn't clear to me from my reading of the proposal.
Regardless of whether it _was_ clear, can you now accept that there is **no proposal to modify the serialization library** ? As stated several times, we presented the simplest thing that we think can address the problem **without modifying the existing library**. Even thinking of that code as a proposal is a little bit wrong. We'll need that code (or something very much like it) in order to provide fast array serialization. We're _going_ to provide what's in "the proposal" (or something very much like it) one way or another, either within Boost or elsewhere. We could put that code in our own library, which we could submit for a separate Boost review, or we could publish it separately. We took special pains to conform as closely as possible to your expectations and requirements for code that could be part of the serialization library, but only to make it as easy as possible for you to understand what we're doing. After going to such great lengths to be understood it's very disappointing to have failed so miserably. I hope you can help rescue our efforts by making a commensurate effort to receive our postings as they are intended, rather than as... something else. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
Also, following Dave's proposal none of your archives is touched, but instead additional faster ones are provided.
This wasn't clear to me from my reading of the proposal.
Regardless of whether it _was_ clear, can you now accept that there is **no proposal to modify the serialization library** ?
As has been surmised, I did overlook that fact that although certain things had the same names the were in different namespaces. So that will resolve some confusion. OK that's fine. Now my question is why do you need anything from me?
As stated several times, we presented the simplest thing that we think can address the problem **without modifying the existing library**.
Ahh - I think it can be simpler. That's the rub.
Even thinking of that code as a proposal is a little bit wrong. We'll need that code (or something very much like it) in order to provide fast array serialization. We're _going_ to provide what's in "the proposal" (or something very much like it) one way or another, either within Boost or elsewhere. We could put that code in our own library, which we could submit for a separate Boost review, or we could publish it separately.
I have absolutely no problem with this. In fact, I look forward to seeing people come up with more and more archives. As I said in another post, I don't see the archives currently included with the package as really part of the library - but rather examples of how the common code can be used to build an archive class suitable to the purpose at hand. So I see no conflict here at all. In fact I see this as complimentary to my goal of narrowing my area of responsability as it regards the library. Of course, some people will use the facilities to undertake efforts which I consider misguided. I'm not hugely happy about that but I just have to live with it - and who knows I might be wrong. But since those people are investing their own effort its their call and I'm fine with it.
We took special pains to conform as closely as possible to your expectations and requirements for code that could be part of the serialization library, but only to make it as easy as possible for you to understand what we're doing. After going to such great lengths to be understood it's very disappointing to have failed so miserably. I hope you can help rescue our efforts by making a commensurate effort to receive our postings as they are intended, rather than as... something else.
Well, I concede I've misunderstood some of the things you're doing. Shortly, I'll post some code that I believe addresses all your design goals in a much simpler and effective way. That may be helpful in resolving this misunderstanding. Robert Ramey

"Robert Ramey" <ramey@rrsd.com> writes:
David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
Also, following Dave's proposal none of your archives is touched, but instead additional faster ones are provided.
This wasn't clear to me from my reading of the proposal.
Regardless of whether it _was_ clear, can you now accept that there is **no proposal to modify the serialization library** ?
As has been surmised, I did overlook that fact that although certain things had the same names the were in different namespaces. So that will resolve some confusion.
OK that's fine.
Now my question is why do you need anything from me?
As stated several times, we presented the simplest thing that we think can address the problem **without modifying the existing library**.
Ahh - I think it can be simpler. That's the rub.
If it were simpler it would require a little more work from archive authors, and we're not going to waste any breath on this list trying to convince you of that. But if it turns out I'm wrong about that, your simplification of our design would certainly be welcome. <snip>
So I see no conflict here at all. In fact I see this as complimentary to my goal of narrowing my area of responsability as it regards the library.
Of course, some people will use the facilities to undertake efforts which I consider misguided. I'm not hugely happy about that but I just have to live with it - and who knows I might be wrong. But since those people are investing their own effort its their call and I'm fine with it.
Good, then maybe tomorrow we'll be able to talk about the effects of Matthias and I going our own way with this. With or without your simplification it will be the same story.
We took special pains to conform as closely as possible to your expectations and requirements for code that could be part of the serialization library, but only to make it as easy as possible for you to understand what we're doing. After going to such great lengths to be understood it's very disappointing to have failed so miserably. I hope you can help rescue our efforts by making a commensurate effort to receive our postings as they are intended, rather than as... something else.
Well, I concede I've misunderstood some of the things you're doing. Shortly, I'll post some code that I believe addresses all your design goals in a much simpler and effective way. That may be helpful in resolving this misunderstanding.
I doubt it. Whether or not you can simplify the design has no real effect on the core issue. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Matthias Troyer wrote:
Suggestions ===========
a) Do more work in finding the speed bottlenecks. Run a profiler. Make a buffer based non-stream based archive and re-run your tests.
I have attached a benchmark for such an archive class and ran benchmarks for std::vector<char> serialization. Here are the numbers (using gcc-4 on a Powerbook G4):
I've take a look at your benchmark.cpp. First of all its very nice and simple and shows an understanding how the primitive i/o is isolated from the archives that use it. Its a step in the right direction. But I see some problems. The usage of std::vector<char> isn't what I would expect for an output buffer. You arn't using this in your own archives are you? Here are my timing results on my windoz XP system with a 2.4 gHz pentiem. With your original program I get for value_type set char Time using serialization library: 9.454 Size is 100000004 Time using direct calls to save in a loop: 8.844 Size is 100000000 Time using direct call to save_array: 0.266 Size is 100000000 for value type set to double Time using serialization library: 1.281 Size is 100000004 Time using direct calls to save in a loop: 1.218 Size is 100000000 Time using direct call to save_array: 0.266 Size is 100000000 I modified the to use a simple buffer output closer to what I would expect to use if I were going to make a primitive buffer output. BTW - that would be a very nice addition. This would be much faster than using strstream as is being use now. Here I the results with the program modified in this way For value type set to char I get Time using serialization library: 0.797 Size is 100000004 Time using direct calls to save in a loop: 0.297 (1) Size is 100000000 Time using direct call to save_array: 0.203 Size is 100000000 and for value_type set to double I get Time using serialization library: 0.109 (3) Size is 100000004 Time using direct calls to save in a loop: 0.078 (2) Size is 100000000 Time using direct call to save_array: 0.25 Size is 100000000 a) the usage of save_array does not have a huge effect on performance. It IS measureable. It seems that it saves about 1/3 the time over using a loop of saves in the best case. (1) b) In the worst case, its even slower than a loop of saves!!! (2) and even slower than the raw serialization system (3) c) the overhead of the serialization library isn't too bad. It does show up when doing 100M characters one by one, but generally it doesn't seem to be a big issuues. In my view, it does support my contention that implementing save_array - regardless of how it is in fact implemented - represents a premature optimization. I suspect that the net benefit in the kind of scenario you envision using it will be very small. Obviously, this test raises more questions than it answers and I think it should be persued further. Another thing I would like to see is a version of the test applied to C++ arrays. My interest is to isolate bottlenecks in the serialization library from those in the stl libraries. Robert Ramey begin 666 test_zmisc.cpp M(VEN8VQU9&4@/&-S=')I;F<^#0HC:6YC;'5D92 \8F]O<W0O87)C:&EV92]B M87-I8U]B:6YA<GE?;V%R8VAI=F4N:'!P/@T*(VEN8VQU9&4@/&EO<W1R96%M M/@T*(VEN8VQU9&4@/&)O;W-T+W1I;65R+FAP<#X-"B-I;F-L=61E(#QB;V]S M="]S97)I86QI>F%T:6]N+W9E8W1O<BYH<' ^#0H-"B-I9B Q#0IC;&%S<R!O M<')I;6ET:79E#0I[#0IP=6)L:6,Z#0H@(" @+R\@9&5F875L="!S879I;F<@ M;V8@<')I;6ET:79E<RX-"B @("!T96UP;&%T93QC;&%S<R!4/@T*(" @('9O M:60@<V%V92AC;VYS="!4("8@="D-"B @("![#0H@(" @(" @('-A=F5?8FEN M87)Y*"9T+"!S:7IE;V8H5"DI.PT*(" @('T-"@T*(" @("\O(&1E9F%U;'0@ M<V%V:6YG(&]F(&%R<F%Y<RX-"B @("!T96UP;&%T93QC;&%S<R!4/@T*(" @ M('9O:60@<V%V95]A<G)A>2AC;VYS="!4("H@<"P@<W1D.CIS:7IE7W0@;BD- M"B @("![#0H@(" @(" @('-A=F5?8FEN87)Y*' L(&XJ<VEZ96]F*%0I*3L- M"B @("!]#0H-"B @("!V;VED('-A=F4H8V]N<W0@<W1D.CIS=')I;F<@)G,I M('M]#0H-"B @("!V;VED('-A=F5?8FEN87)Y*&-O;G-T('9O:60@*F%D9')E M<W,L('-T9#HZ<VEZ95]T(&-O=6YT*0T*(" @('L-"B @(" @(&)U9F9E<BYI M;G-E<G0H8G5F9F5R+F5N9"@I+'-T871I8U]C87-T/&-O;G-T(&-H87(J/BAA M9&1R97-S*2P@<W1A=&EC7V-A<W0\8V]N<W0@8VAA<BH^*&%D9')E<W,I*V-O M=6YT*3L-"B @("!]#0H@(" @#0H@(" @<W1D.CIS:7IE7W0@<VEZ92@I('L@ M<F5T=7)N(&)U9F9E<BYS:7IE*"D[?0T*(" @('9O:60@<F5S97)V92AS=&0Z M.G-I>F5?="!N*2![(&)U9F9E<BYR97-E<G9E*&XI.WT-"G!R:79A=&4Z#0H@ M('-T9#HZ=F5C=&]R/&-H87(^(&)U9F9E<CL-"GT[#0H-"B-E;'-E#0H-"F-L M87-S(&]P<FEM:71I=F4-"GL-"G!U8FQI8SH-"B @(" O+R!D969A=6QT('-A M=FEN9R!O9B!P<FEM:71I=F5S+@T*(" @('1E;7!L871E/&-L87-S(%0^#0H@ M(" @=F]I9"!S879E*&-O;G-T(%0@)B!T*0T*(" @('L-"B @(" @(" @<V%V M95]B:6YA<GDH)G0L('-I>F5O9BA4*2D[#0H@(" @?0T*#0H@(" @+R\@9&5F M875L="!S879I;F<@;V8@87)R87ES+@T*(" @('1E;7!L871E/&-L87-S(%0^ M#0H@(" @=F]I9"!S879E7V%R<F%Y*&-O;G-T(%0@*B!P+"!S=&0Z.G-I>F5? M="!N*0T*(" @('L-"B @(" @(" @<V%V95]B:6YA<GDH<"P@;BIS:7IE;V8H M5"DI.PT*(" @('T-"@T*(" @('9O:60@<V%V92AC;VYS="!S=&0Z.G-T<FEN M9R F<RD@>WT-"@T*(" @('9O:60@<V%V95]B:6YA<GDH8V]N<W0@=F]I9" J M861D<F5S<RP@<W1D.CIS:7IE7W0@8V]U;G0I#0H@(" @>PT*(" @(" @("!S M=&0Z.FUE;6-P>2AB=69F97(L(&%D9')E<W,L(&-O=6YT*3L-"B @(" @(" @ M<" K/2!C;W5N=#L-"B @("!]#0H@(" @#0H@(" @<W1D.CIS:7IE7W0@<VEZ M92@I('L@<F5T=7)N(',[?0T*(" @('9O:60@<F5S97)V92AS=&0Z.G-I>F5? M="!N*7L-"B @(" @(" @<R ](&X[#0H@(" @(" @(' @/2 P.PT*(" @(" @ M("!B=69F97(@/2!N97<@8VAA<EMN73L-"B @("!]#0H@(" @?F]P<FEM:71I M=F4H*7L-"B @(" @(" @9&5L971E(&)U9F9E<CL-"B @("!]#0IP<FEV871E M.@T*(" @('-T9#HZ<VEZ95]T(',[#0H@(" @<W1D.CIS:7IE7W0@<#L-"B @ M("!C:&%R("H@8G5F9F5R.PT*?3L-"@T*(V5N9&EF#0H-"G1E;7!L871E/&-L M87-S($%R8VAI=F4^#0IC;&%S<R!O87)C:&EV95]I;7!L(#H@#0H@(" @<'5B M;&EC(&]P<FEM:71I=F4L#0H@(" @<'5B;&EC(&)O;W-T.CIA<F-H:79E.CIB M87-I8U]B:6YA<GE?;V%R8VAI=F4\07)C:&EV93X-"GL-"G!U8FQI8SH-"B @ M("!O87)C:&EV95]I;7!L*'5N<VEG;F5D(&EN="!F;&%G<RD@.@T*(" @(" @ M("!B;V]S=#HZ87)C:&EV93HZ8F%S:6-?8FEN87)Y7V]A<F-H:79E/$%R8VAI M=F4^*&9L86=S*0T*(" @('M]#0I].PT*#0IC;&%S<R!O87)C:&EV92 Z( T* M(" @('!U8FQI8R!O87)C:&EV95]I;7!L/&]A<F-H:79E/@T*>PT*<'5B;&EC M.@T*(" @(&]A<F-H:79E*'5N<VEG;F5D(&EN="!F;&%G<R ](# I(#H-"B @ M(" @(" @;V%R8VAI=F5?:6UP;#QO87)C:&EV93XH9FQA9W,I#0H@(" @>WT- M"GT[#0H-"FEN="!M86EN*"D-"GL@(" @#0H@('1Y<&5D968@8VAA<B!V86QU M95]T>7!E.PT*(" -"B @<W1D.CIS:7IE7W0@;G5M8F5R7V]F7V5L96UE;G1S M(#T@,3 P,# P,# P+W-I>F5O9BAV86QU95]T>7!E*3L-"B @8V]N<W0@<W1D M.CIV96-T;W(\=F%L=65?='EP93X@=BAN=6UB97)?;V9?96QE;65N=',I.PT* M#0H@('L-"B @("!O87)C:&EV92!A<CL-"B @("!A<BYR97-E<G9E*&YU;6)E M<E]O9E]E;&5M96YT<RIS:7IE;V8H=F%L=65?='EP92DK-"D[#0H@(" @8F]O M<W0Z.G1I;65R('1I;64[#0H@(" @87(@/#P@=CL-"B @("!S=&0Z.F-O=70@ M/#P@(E1I;64@=7-I;F<@<V5R:6%L:7IA=&EO;B!L:6)R87)Y.B B(#P\('1I M;64N96QA<'-E9"@I(#P\('-T9#HZ96YD;#L-"B @("!S=&0Z.F-O=70@/#P@ M(E-I>F4@:7,@(B \/"!A<BYS:7IE*"D@/#P@<W1D.CIE;F1L.PT*("!]#0H- M"B @>PT*(" @(&]A<F-H:79E(&%R.PT*(" @(&%R+G)E<V5R=F4H;G5M8F5R M7V]F7V5L96UE;G1S*G-I>F5O9BAV86QU95]T>7!E*2D[#0H@(" @8F]O<W0Z M.G1I;65R('1I;64[#0H@(" @9F]R("AS=&0Z.G9E8W1O<CQV86QU95]T>7!E M/CHZ8V]N<W1?:71E<F%T;W(@:70@/2!V+F)E9VEN*"D[(&ET("$]=BYE;F0H M*3L@*RMI="D-"B @(" @(&%R+G-A=F4H*FET*3L-"B @("!S=&0Z.F-O=70@ M/#P@(E1I;64@=7-I;F<@9&ER96-T(&-A;&QS('1O('-A=F4@:6X@82!L;V]P M.B B(#P\('1I;64N96QA<'-E9"@I(#P\('-T9#HZ96YD;#L-"B @("!S=&0Z M.F-O=70@/#P@(E-I>F4@:7,@(B \/"!A<BYS:7IE*"D@/#P@<W1D.CIE;F1L M.PT*("!]#0H-"B @>PT*(" @(&]A<F-H:79E(&%R.PT*(" @(&%R+G)E<V5R M=F4H;G5M8F5R7V]F7V5L96UE;G1S*G-I>F5O9BAV86QU95]T>7!E*2D[#0H@ M(" @8F]O<W0Z.G1I;65R('1I;64[#0H@(" @87(N<V%V95]A<G)A>2@F=ELP M72QV+G-I>F4H*2D[#0H@(" @<W1D.CIC;W5T(#P\(")4:6UE('5S:6YG(&1I M<F5C="!C86QL('1O('-A=F5?87)R87DZ("(@/#P@=&EM92YE;&%P<V5D*"D@ M/#P@<W1D.CIE;F1L.PT*(" @('-T9#HZ8V]U=" \/" B4VEZ92!I<R B(#P\ H(&%R+G-I>F4H*2 \/"!S=&0Z.F5N9&P[#0H@('T-"B @#0I]#0H-"@`` ` end

Hi Robert, I think you should check your benchmark code again. I think it is not doing what you think it is doing. class oprimitive { public: // default saving of primitives. template<class T> void save(const T & t) { save_binary(&t, sizeof(T)); } // default saving of arrays. template<class T> void save_array(const T * p, std::size_t n) { save_binary(p, n*sizeof(T)); } void save(const std::string &s) { abort(); } void save_binary(const void *address, std::size_t count) { std::memcpy(buffer, address, count); p += count; } std::size_t size() { return s;} void reserve(std::size_t n){ s = n; p = 0; buffer = new char[n]; } ~oprimitive(){ delete buffer; } private: std::size_t s; std::size_t p; char * buffer; }; There is a bug here: the oprimitive::save_binary() function always writes to the *start* of the buffer. Incrementing 'p' here has no effect. It is not too surprising that you see that a lot of repeated calls to save_binary() with a small sized object is much faster than a single call to save_binary() with a large object, because in the first case a single memory address is being overwritten repeatedly (with lots of scope for misleading compiler optimizations!), whereas the second case is limited by the memory bandwidth. Secondly, the buffer in the oprimitive class has much less functionality than the vector<char> buffer, as well as the buffer I used previously (http://lists.boost.org/Archives/boost/2005/11/97156.php). In particular, it does not check for buffer overflow when writing. Thus it has no capability for automatic resizing/flushing, and is only useful if you know in advance what the maximum size of the serialized data is. This kind of buffer is of rather limited use, so I think that this is not a fair comparison. FWIW, I include the benchmark I just ran. Amd64 g++ 3.4.4 on linux 2.6.10, and cheap (slow!) memory ;) vector<char> buffer: Time using serialization library: 3.79 Size is 100000004 Time using direct calls to save in a loop: 3.42 Size is 100000000 Time using direct call to save_array: 0.16 Size is 100000000 primitive buffer (with the save_binary() function modified to do "buffer += count"): Time using serialization library: 1.57 Size is 100000004 Time using direct calls to save in a loop: 1.35 Size is 100000000 Time using direct call to save_array: 0.16 Size is 100000000 Interestingly, on this platform/compiler combination, without the bug fix in save_binary() it still takes 1.11 seconds ;) I would guess your Windows compiler is doing some optimization that gcc is not, in that case. Regards, Ian

Ian McCulloch wrote:
Hi Robert,
I think you should check your benchmark code again. I think it is not doing what you think it is doing.
whoops - of course you're correct. Here are the correct (numbers) for value_type set to char Time using serialization library: 1.922 Size is 100000004 Time using direct calls to save in a loop: 1.031 Size is 100000000 Time using direct call to save_array: 0.25 Size is 100000000 for value type set to double Time using serialization library: 0.86 Size is 100000004 Time using direct calls to save in a loop: 0.36 Size is 100000000 Time using direct call to save_array: 0.265 Size is 100000000
Secondly, the buffer in the oprimitive class has much less functionality than the vector<char> buffer, as well as the buffer I used previously (http://lists.boost.org/Archives/boost/2005/11/97156.php). In particular, it does not check for buffer overflow when writing. Thus it has no capability for automatic resizing/flushing, and is only useful if you know in advance what the maximum size of the serialized data is. This kind of buffer is of rather limited use, so I think that this is not a fair comparison.
I think its much closer to the binary archive implementation the the current binary_oarchive is. I also think its fairly close to what an archive class would look like for a message passing application. The real difference here is that save_binary would be implemented in such a way that the overhead per call is pretty small. Maybe not quite as small as here, but much smaller than the overhead associated with ostream::write. So I believe that the above results give a much more accurate picture than the previous ones do of the effect of application of the proposed enhancement.
FWIW, I include the benchmark I just ran. Amd64 g++ 3.4.4 on linux 2.6.10, and cheap (slow!) memory ;)
vector<char> buffer:
Time using serialization library: 3.79 Size is 100000004 Time using direct calls to save in a loop: 3.42 Size is 100000000 Time using direct call to save_array: 0.16 Size is 100000000
primitive buffer (with the save_binary() function modified to do "buffer += count"):
Time using serialization library: 1.57 Size is 100000004 Time using direct calls to save in a loop: 1.35 Size is 100000000 Time using direct call to save_array: 0.16 Size is 100000000
Interestingly, on this platform/compiler combination, without the bug fix in save_binary() it still takes 1.11 seconds ;) I would guess your Windows compiler is doing some optimization that gcc is not, in that case.
Thanks for doing this - it is very helpful. Sure you're compiling at maximum optimization -O3 . In anycase, this is not untypical of my personal experience with benchmarks. They vary a lot depending on extraneaus variables. Our results seem pretty comparable though. Robert Ramey

Robert Ramey wrote:
Ian McCulloch wrote:
[...]
Secondly, the buffer in the oprimitive class has much less functionality than the vector<char> buffer, as well as the buffer I used previously (http://lists.boost.org/Archives/boost/2005/11/97156.php). In particular, it does not check for buffer overflow when writing. Thus it has no capability for automatic resizing/flushing, and is only useful if you know in advance what the maximum size of the serialized data is. This kind of buffer is of rather limited use, so I think that this is not a fair comparison.
I think its much closer to the binary archive implementation the the current binary_oarchive is.
I dont understand that sentence, sorry. Which binary archive implementation?
I also think its fairly close to what an archive class would look like for a message passing application.
Surely it depends on the usage pattern? If you are sending fixed size messages, then sure a fixed size buffer with no overflow checks will be fastest. If you are sending variable size messages with no particular upper bound on the message size then it is a tradeoff whether you use a resizeable buffer or count the number of items you need to serialize beforehand. I wouldn't like to guess what is the more 'typical' use. Both are important cases.
The real difference here is that save_binary would be implemented in such a way that the overhead per call is pretty small. Maybe not quite as small as here, but much smaller than the overhead associated with ostream::write.
Ok, but even with the ideal fixed-size buffer, the difference between the serialization lib and save_array for out-of-cache arrays of char, measured by you, is:
Time using serialization library: 1.922 Time using direct call to save_array: 0.25 almost a factor 8. For a buffer that has more overhead, no matter how small, it will directly translate to an increase in that factor.
In my view, it does support my contention that implementing save_array - regardless of how it is in fact implemented - represents a premature optimization. I suspect that the net benefit in the kind of scenario you envision using it will be very small.
Note however that in this case, save_array() is purely memory-bandwidth limited. It would be interesting for you if you repeated the benchmark with a much smaller array size. You should see several jumps in performance corresponding to various caches, L1, L2, TLB, perhaps others. In any particular benchmark, some of these thresholds might be hard to see. You will need to put the serialization into a loop to get the CPU time to a sensible number, and do a loop or two before starting the timer so that the data is already in the cache. In the fixed-size buffer scenario this is actually not too far from a realistic benchmark. I know (roughly) what the result will be. If you still stand by your previous comment: then obviously you do not.
So I believe that the above results give a much more accurate picture than the previous ones do of the effect of application of the proposed enhancement.
Fine. I am glad you finally agree with the 10x slowdown figure (well, if you want to be picky 7.688x slowdown on your Windows XP box, 9.8512x on my linux-opteron box). [...]
Interestingly, on this platform/compiler combination, without the bug fix in save_binary() it still takes 1.11 seconds ;) I would guess your Windows compiler is doing some optimization that gcc is not, in that case.
Thanks for doing this - it is very helpful.
Sure you're compiling at maximum optimization -O3 .
Of course. -O3 gives no difference from -O2, small difference from -O1, huge difference from -O0. When there is a bug in the benchmark, any result is possible ;) Quite possibly your compiler was simply noticing that the same memory location was being overwritten repeatedly and chose to instead store it in a register? Anyway, since you took no special effort to ensure that the compiler didn't optimize away code it would have been quite legitimate for your benchmark to report 0 time for all tests. In the absence of such effort, you at least need to check carefully the assembly output to make sure the benchmark is really testing what you think it is testing.
In anycase, this is not untypical of my personal experience with benchmarks. They vary a lot depending on extraneaus variables. Our results seem pretty comparable though.
Robert Ramey
Cheers, Ian

"Robert Ramey" <ramey@rrsd.com> writes:
I've take a look at your benchmark.cpp.
First of all its very nice and simple and shows an understanding how the primitive i/o is isolated from the archives that use it.
Its a step in the right direction. But I see some problems. The usage of std::vector<char> isn't what I would expect for an output buffer. You arn't using this in your own archives are you?
I modified the to use a simple buffer output closer to what I would expect to use if I were going to make a primitive buffer output. BTW - that would be a very nice addition. This would be much faster than using strstream as is being use now.
Here I the results with the program modified in this way
For value type set to char I get
Time using serialization library: 0.797 Size is 100000004 Time using direct calls to save in a loop: 0.297 (1) Size is 100000000 Time using direct call to save_array: 0.203 Size is 100000000
and for value_type set to double I get
Time using serialization library: 0.109 (3) Size is 100000004 Time using direct calls to save in a loop: 0.078 (2) Size is 100000000 Time using direct call to save_array: 0.25 Size is 100000000
First of all, I can't reproduce those results or anything like them with your code. Did you run the program three times and throw out the first result (to make sure the caches were full and you weren't seeing the freak effect of some other process?) On the code you posted, using vc-7.1, I get: Run #2 Time using serialization library: 1.015 Size is 100000004 Time using direct calls to save in a loop: 0.36 Size is 100000000 Time using direct call to save_array: 0.25 Size is 100000000 Run #3 Time using serialization library: 1.078 Size is 100000004 Time using direct calls to save in a loop: 0.359 Size is 100000000 Time using direct call to save_array: 0.234 Size is 100000000 Using MSVC-8.0 I get: Run #2 Time using serialization library: 1.5 Size is 100000004 Time using direct calls to save in a loop: 0.484 Size is 100000000 Time using direct call to save_array: 0.235 Size is 100000000 Run #3 Time using serialization library: 1.594 Size is 100000004 Time using direct calls to save in a loop: 0.516 Size is 100000000 Time using direct call to save_array: 0.234 Size is 100000000 Using gcc-4.0.2 I get: Run #2 Time using serialization library: 0.547 Size is 100000004 Time using direct calls to save in a loop: 0.344 Size is 100000000 Time using direct call to save_array: 0.25 Size is 100000000 Run #3 Time using serialization library: 0.547 Size is 100000004 Time using direct calls to save in a loop: 0.343 Size is 100000000 Time using direct call to save_array: 0.251 Size is 100000000 This is on a 2.00GHz Pentium M on Windows XP. Furthermore, it's not a fair comparison unless you first measure the number of bytes you have to save so you can preallocate the buffer. In general the only way to do that is with a special counting archive, so you have to account for the time taken up by the counting. Of course we did that test too. The code and test results are attached. In case it isn't obvious to you by now Matthias Troyer is a world-class expert in high performance computing. You don't get to be a recognized authority in that area without developing the ability to create tests that accurately measure performance. You also develop some generalized knowledge about what things will lead to slowdowns and speedups. It's really astounding that you manage challenge every assertion he makes in a domain where he is an expert and you are not, especially in a domain with so many subtle pitfalls waiting for the naive tester.
a) the usage of save_array does not have a huge effect on performance. It IS measureable. It seems that it saves about 1/3 the time over using a loop of saves in the best case. (1)
In the best case, even with your flawed test, it's a factor of 2 as shown above.
b) In the worst case, its even slower than a loop of saves!!! (2) and even slower than the raw serialization system (3)
That result is completely implausible. If you can get someone else to reproduce it using a proper test protocol I'll be *quite* impressed.
c) the overhead of the serialization library isn't too bad. It does show up when doing 100M characters one by one, but generally it doesn't seem to be a big issuues.
In my view, it does support my contention that implementing save_array - regardless of how it is in fact implemented - represents a premature optimization. I suspect that the net benefit in the kind of scenario you envision using it will be very small.
Obviously, this test raises more questions than it answers
Like what questions?

David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
Furthermore, it's not a fair comparison unless you first measure the number of bytes you have to save so you can preallocate the buffer. In general the only way to do that is with a special counting archive, so you have to account for the time taken up by the counting. Of course we did that test too. The code and test results are attached.
Without seeing the implementation of binary_oprimitive you plan to use I can only speculate what would be the closest test. Assuming that performance is an issue, I wouln't expect you to use the current binary_oarchive which is based on stream i/o. So if that's an important factor then it shouldn't used for benchmarking. I presume that is why Matthias chose not to use it. On the other hand its not clear why one sould chose to use a buffer based on std:vector<char> for this purpose either. I chose an implementation which I thought would be closest to the one that would actually end up being used for a network protocol. The question is what is the time difference between one invocation of save_binary with data N bytes long vs N invocations of save_binary 1 byte long. That is really all that is being measured here. So using an implementation of save_binary based on stream write isn't really very interesting unless one is really going to use that implementation. Of course I don't really know if you are going to do that - I just presumed you weren't.
In case it isn't obvious to you by now Matthias Troyer is a world-class expert in high performance computing. You don't get to be a recognized authority in that area without developing the ability to create tests that accurately measure performance. You also develop some generalized knowledge about what things will lead to slowdowns and speedups. It's really astounding that you manage challenge every assertion he makes in a domain where he is an expert and you are not, especially in a domain with so many subtle pitfalls waiting for the naive tester.
wow - well the bench mark was posted and I took that as an indication that it was ok to check it out. Sorry about that - Just go back to the std::vector<char> implementation of buffer and well let it go at that.
a) the usage of save_array does not have a huge effect on performance. It IS measureable. It seems that it saves about 1/3 the time over using a loop of saves in the best case. (1)
In the best case, even with your flawed test, it's a factor of 2 as shown above.
which is a heck of a lot less than 10x
b) In the worst case, its even slower than a loop of saves!!! (2) and even slower than the raw serialization system (3)
That result is completely implausible. If you can get someone else to reproduce it using a proper test protocol I'll be *quite* impressed.
Well, at least we can agree on that. We've corrected the bench mark and made a few more runs. The anomaly above disappears and things still vary but things don't change all that much. BTW, the program has a value type which can be set to either char or double which tests different primitives. If the results the rest of are showing are way differen than yours that might be an explanation.
c) the overhead of the serialization library isn't too bad. It does show up when doing 100M characters one by one, but generally it doesn't seem to be a big issuues.
In my view, it does support my contention that implementing save_array - regardless of how it is in fact implemented - represents a premature optimization. I suspect that the net benefit in the kind of scenario you envision using it will be very small.
Obviously, this test raises more questions than it answers
Like what questions?
a) Like the anomoly above - which I don't think is an issue anymore b) Will the current stream based implementation of binary_oarchive be used? or would it be substituted for a different one. c) What would the results be for the actual archive you plan to use? Robert Ramey

"Robert Ramey" <ramey@rrsd.com> writes:
David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
Furthermore, it's not a fair comparison unless you first measure the number of bytes you have to save so you can preallocate the buffer. In general the only way to do that is with a special counting archive, so you have to account for the time taken up by the counting. Of course we did that test too. The code and test results are attached.
Without seeing the implementation of binary_oprimitive you plan to use I can only speculate what would be the closest test.
?? We're not hiding any code. The code posted compiles as is.
Assuming that performance is an issue, I wouln't expect you to use the current binary_oarchive which is based on stream i/o. So if that's an important factor then it shouldn't used for benchmarking.
Did you look at the attached test code at all?
I presume that is why Matthias chose not to use it. On the other hand its not clear why one sould chose to use a buffer based on std:vector<char> for this purpose either. I chose an implementation which I thought would be closest to the one that would actually end up being used for a network protocol.
The question is what is the time difference between one invocation of save_binary with data N bytes long vs N invocations of save_binary 1 byte long. That is really all that is being measured here. So using an implementation of save_binary based on stream write isn't really very interesting unless one is really going to use that implementation. Of course I don't really know if you are going to do that - I just presumed you weren't.
No we are not. But as we have said many times we are not planning to copy the bytes anywhere. We are going to point MPI at the bytes and let the network hardware send them directly over the wire. We are supplying benchmark figures for binary_archive because it's presumably a case that you care about and understand. The MPI archives will do something completely different, but with similar performance characteristics.
In case it isn't obvious to you by now Matthias Troyer is a world-class expert in high performance computing. You don't get to be a recognized authority in that area without developing the ability to create tests that accurately measure performance. You also develop some generalized knowledge about what things will lead to slowdowns and speedups. It's really astounding that you manage challenge every assertion he makes in a domain where he is an expert and you are not, especially in a domain with so many subtle pitfalls waiting for the naive tester.
wow - well the bench mark was posted and I took that as an indication that it was ok to check it out.
Absolutely it's ok to check it out. Please ask questions if you don't understand something. Please let us help you. That said, your willingness to casually label Matthias' work "a classic case of premature optimization" is really appalling. That's something you might do with a greenhorn novice who has a lot to learn about optimization, but to someone with Matthias' distinction it is inappropriate. Matthias says he doesn't care about whether he is perceived as credibile but I have a hard time not being offended on his behalf. As a practical matter, it seems as though you are making it unreasonably difficult to demonstrate anything to your satisfaction. Matthias and I (as far as I am able as a non-expert) are happy to explain the basic facts of performance and large data sets and to help you understand how these things work, but Matthias' credentials ought to at least exempt us from having to argue with you about the validity of his tests, and earn him the right to be treated with respect.
Sorry about that - Just go back to the std::vector<char> implementation of buffer and well let it go at that.
I don't understand what you mean.
a) the usage of save_array does not have a huge effect on performance. It IS measureable. It seems that it saves about 1/3 the time over using a loop of saves in the best case. (1)
In the best case, even with your flawed test, it's a factor of 2 as shown above.
which is a heck of a lot less than 10x
First of all, as demonstrated by Ian, your test is fatally flawed, so it means nothing that it's a factor of 2 rather than a factor of 10. Secondly, to anyone who cares about performance, even a factor of two would be a cause for orgiastic and debauched celebration. A factor of two performance improvement is rarely available as low-hanging fruit.
b) In the worst case, its even slower than a loop of saves!!! (2) and even slower than the raw serialization system (3)
That result is completely implausible. If you can get someone else to reproduce it using a proper test protocol I'll be *quite* impressed.
Well, at least we can agree on that. We've corrected the bench mark and made a few more runs. The anomaly above disappears and things still vary but things don't change all that much.
?? With the bug corrected, using msvc-8.0, I get Run #2: Time using serialization library: 4.297 Size is 100000004 Time using direct calls to save in a loop: 1.766 Size is 100000000 Time using direct call to save_array: 0.296 Size is 100000000 Run #3: Time using serialization library: 4.328 Size is 100000004 Time using direct calls to save in a loop: 1.781 Size is 100000000 Time using direct call to save_array: 0.281 Size is 100000000 These show 15x speedups.
BTW, the program has a value type which can be set to either char or double which tests different primitives. If the results the rest of are showing are way differen than yours
I can't understand what you're trying to say. The code you posted has the value_type as char, so that's what I tested. Of course doubles are faster to write individually as they are forced into a more favorable alignment. However, we're still talking about a factor of 2x.
that might be an explanation.
c) the overhead of the serialization library isn't too bad. It does show up when doing 100M characters one by one, but generally it doesn't seem to be a big issuues.
In my view, it does support my contention that implementing save_array - regardless of how it is in fact implemented - represents a premature optimization. I suspect that the net benefit in the kind of scenario you envision using it will be very small.
Obviously, this test raises more questions than it answers
Like what questions?
a) Like the anomoly above - which I don't think is an issue anymore b) Will the current stream based implementation of binary_oarchive be used?
Used where? As we have stated many times, we don't plan to do *any* copying in memory for MPI serialization, so we wouldn't be writing to a stream, which has a stream buffer and thus necessitates copying.
or would it be substituted for a different one.
c) What would the results be for the actual archive you plan to use?
If we serialized every double through MPI individually rather than using a single MPI array send call it would be a factor of at least 1000x. While network bandwidth is the same as memory bandwidth in fast parallel systems, network latency is much higher than memory latency (about 3K CPU cycles), and you pay that price for each individual send. If we copied into a buffer first -- and remember, we can't copy into a preallocated buffer for the entire batch of data that needs to be sent because there isn't enough memory per CPU to copy the data -- we'd pay the cost of overflow checks on each individual write into the buffer (similar to what happens with streams) plus an additional 2x speed penalty just for copying all the data to memory before sending it over the wire (memory bandwith being equal to net bandwidth). For MPI serialization, in our application, there really is no alternative to sending large data sets as single batches. Based on past experience, I would expect you to challenge the claims in the three paragraphs above and demand benchmarks that prove their validity. Then, I would expect you to challenge the validity of the tests. I really hope you will violate my expectations this time, since it would be a waste of your time as well as ours. -- Dave Abrahams Boost Consulting www.boost-consulting.com

"Robert Ramey" <ramey@rrsd.com> writes:
To summarize how we arrived here. =================================
<snip>
e) So it has been proposed binary_iarchive be re-implemented in the following way
iarchive - containg default implementation of load_array binary_iarchive - ? presumablu contains implementaion of load_array in terms of currently defined load_binary
Its not clear whether all archives would be modified in this way or just binary_iarchive.
This is extremely discouraging. After I stated many times that our design had been changed so as NOT to modify any code in the serialization library, after we put the array-optimized archives in a separate sub-namespace so that they could live alongside the existing ones in the library, after I offered to put all of the code in some remote part of Boost not associated with the serialization library, you state that we are proposing to change the serialization library code. It might be possible to attribute most of the other misapprehensions, misstatements, and gratuitous and insulting peremptory dismissals in your post to cluelessness or lack of attention, but it's really hard to understand how a claim that we propose to change the library could be made in good faith. It appears to be the sort of "when did you stop beating your wife?" response that injects a false presumption into the conversation and puts the other party at an unfair disadvantage. You stated on 19 Nov. we would start with a clean slate. If you've changed your mind, please let us know now; it would certainly be a waste of time to carry on any further discussion if it's going to go this way. If we've misunderstood your posting, we'd very much appreciate an explanation of what you do mean. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
To summarize how we arrived here. =================================
<snip>
e) So it has been proposed binary_iarchive be re-implemented in the following way
iarchive - containg default implementation of load_array binary_iarchive - ? presumablu contains implementaion of load_array in terms of currently defined load_binary
Its not clear whether all archives would be modified in this way or just binary_iarchive.
This is extremely discouraging. After I stated many times that our design had been changed so as NOT to modify any code in the serialization library, after we put the array-optimized archives in a separate sub-namespace so that they could live alongside the existing ones in the library, after I offered to put all of the code in some remote part of Boost not associated with the serialization library, you state that we are proposing to change the serialization library code.
I was referring to :
archive/array/binary_iarchive.hpp ..................................
class binary_iarchive : public array::iarchive< array::binary_iarchive , archive::binary_iarchive_impl<binary_iarchive> > { template <class S> binary_iarchive(S& s, unsigned int flags) : binary_iarchive::iarchive_base(s,flags) {}
// use the optimized load procedure for all fundamental types. typedef boost::is_fundamental<mpl::_> use_array_optimization;
// This is how we load an array when optimization is appropriate. template <class ValueType> void load_array(ValueType * p, std::size_t n, unsigned int version) { this->load_binary(p, n * sizeof(ValueType)); } };
I'm also presuming - maybe incorrectly - that the serialization for something like a C++ array would contain code to invoke load_array. So if I re-compile an existing application and use it to load a binary_archive created under the previous version of the library the function load_array will be invoked. The same data will have been previously serialized with a loop of serializing each data member. Without knowing when load_array is invoked, one can't know for sure with the old archives will in fact be readable. Verifying that previously existing archives of type binary-?archive will be readable will be a non-trivial task. Robert Ramey

On Nov 25, 2005, at 5:42 PM, Robert Ramey wrote:
David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
To summarize how we arrived here. =================================
<snip>
e) So it has been proposed binary_iarchive be re-implemented in the following way
iarchive - containg default implementation of load_array binary_iarchive - ? presumablu contains implementaion of load_array in terms of currently defined load_binary
Its not clear whether all archives would be modified in this way or just binary_iarchive.
This is extremely discouraging. After I stated many times that our design had been changed so as NOT to modify any code in the serialization library, after we put the array-optimized archives in a separate sub-namespace so that they could live alongside the existing ones in the library, after I offered to put all of the code in some remote part of Boost not associated with the serialization library, you state that we are proposing to change the serialization library code.
I was referring to :
archive/array/binary_iarchive.hpp
Look, Dave proposed it in a sub-directory and a corresponding subnamespace array Matthias

"Robert Ramey" <ramey@rrsd.com> writes:
David Abrahams wrote:
"Robert Ramey" <ramey@rrsd.com> writes:
To summarize how we arrived here. =================================
<snip>
e) So it has been proposed binary_iarchive be re-implemented in the following way
iarchive - containg default implementation of load_array binary_iarchive - ? presumablu contains implementaion of load_array in terms of currently defined load_binary
Its not clear whether all archives would be modified in this way or just binary_iarchive.
This is extremely discouraging. After I stated many times that our design had been changed so as NOT to modify any code in the serialization library, after we put the array-optimized archives in a separate sub-namespace so that they could live alongside the existing ones in the library, after I offered to put all of the code in some remote part of Boost not associated with the serialization library, you state that we are proposing to change the serialization library code.
I was referring to :
archive/array/binary_iarchive.hpp ^^^^^
Note the directory element. This is a completely different file from any in the serialization library. Also note that I said up front that I was leaving out namespaces in my synopsis. Can I infer from this that you really had no idea that the code in this file was to go in namespace boost::archive::array? If it really looked to you as though we were proposing to change the library, in the face of all our statements to the contrary shouldn't you have at least asked us for an explanation?
..................................
class binary_iarchive : public array::iarchive< array::binary_iarchive , archive::binary_iarchive_impl<binary_iarchive> > { template <class S> binary_iarchive(S& s, unsigned int flags) : binary_iarchive::iarchive_base(s,flags) {}
// use the optimized load procedure for all fundamental types. typedef boost::is_fundamental<mpl::_> use_array_optimization;
// This is how we load an array when optimization is appropriate. template <class ValueType> void load_array(ValueType * p, std::size_t n, unsigned int version) { this->load_binary(p, n * sizeof(ValueType)); } };
I'm also presuming - maybe incorrectly - that the serialization for something like a C++ array would contain code to invoke load_array.
That code is in array::iarchive. That should have been clear from the comment on its load_override member function. // Load T[N] using load_array template<class T, std::size_t N> void load_override(T(&x)[N], unsigned int version);
So if I re-compile an existing application and use it to load a binary_archive created under the previous version of the library the function load_array will be invoked. The same data will have been previously serialized with a loop of serializing each data member. Without knowing when load_array is invoked, one can't know for sure with the old archives will in fact be readable. Verifying that previously existing archives of type binary-?archive will be readable will be a non-trivial task.
This can't really be a serious objection, can it? A program would have to be changed to #include different files and use different namespaces in order for any difference to be observed. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Despite saying I would no longer participate in this thread, I re-read Robert's recent list of queries and was struck by the thought that most, if not all, is based on accidental mis-readings of some previous posts, combined with some completely understandable misunderstandings about performance issues that are well-known in the HPC domain (since clarified elsewhere by some benchmarks). So, in the spirit of perhaps rescuing the situation, I reply again: Robert Ramey wrote:
To summarize how we arrived here. =================================
a) Mattias augmented binary_?archive to replace element by serialization of primitive types with save/load binary for C++ arrays, std::vectors and boost::val_array. This resulted in a 10 x speed up of the serialization process.
Right. If you see some function in the profile that is called 10,000 times and is a bottleneck, which is better? Optimize that function a bit and get a factor 2 speedup (see http://lists.boost.org/Archives/boost/2005/11/97156.php), or change the calling sequence so that that function is only called once, and get a factor 10 speedup?
b) From this it has been concluded that binary archives should be enhanced to provide this facility automatically and transparently to the user.
Right. From the point of view of a user, this is completely analagous to, say, a standard library implementor optimizing std::copy(container.begin(), container.end(), ostream_iterator<T>(stream)), for example. If he/she did this, would the user be interested in *disabling* that optimization? What would be the point? But note, 'should' here is really 'could'. The current proposal from David Abrahams explicitly does *not* modify any of the existing archives.
c) The structure of the library an the documentation suggest that the convenient way to do this is to specify an overload for each combination of archive/type which can benefit from special treatment.
As I understand it (someone please correct me if I got this wrong), the core idea of the new proposal (from http://lists.boost.org/Archives/boost/2005/11/96923.php and followups) is to provide a single point of customization, that *serialization function authors* can utilize to serialize an array in one call. Of course making use of this hook is optional, but since it is also a good convenience function (it saves a couple of lines by avoiding having to manually code a loop), there isn't really any point to not use it. Note that a significant point of the proposal http://lists.boost.org/Archives/boost/2005/11/96923.php is that the possibility exists that this 'hook' is not part of the serialization library proper, but the point is *it must be globally accessible* and not specific to a particular archive.
From http://lists.boost.org/Archives/boost/2005/11/96923.php :
David Abrahams wrote: | In an upcoming message I'm planning to start by describing the least | intrusive design that could possibly work -- one that makes no changes | at all to the existing serialization library. It's not a bad design, | but it has a few drawbacks that I'd like to discuss. At that point it | should become clear what I meant about "hijacking" the serialization | library. Finally, I'll describe the smallest set of changes to the | existing serialization library that would be needed to address those | drawbacks. | | Just to state the obvious, I hope to convince you that those few | changes are worth making. Of course, as the maintainer of | Boost.Serialization, it's completely up to you whether to do so. Moving on to your next point:
d) The above (c) is deemed inconvenient because it has been supposed that many archive classes will share a common implementation of load/save array. This would suggest that using (c) above, though simple and straight forward, will result in code repetition.
I think this must have resulted from a misunderstanding. 1. As far as I can tell, the proposal is completely consistent with (c) above. The question of how the save/load_array is actually dispatched to the archive is a detail that has relevance *only to archive implementors*. Obviously different dispatch mechanisms have different tradeoffs, as discussed elsewhere on this thread, but none of this should be visible for people who are not archive authors. 2. I don't recall seeing anyone suggest that different archive types would be able to share a common implementation of load/save array (except for the trivial case where an archive has no array support and instead uses a default implementation that serializes each element in a loop!). Can you cite where you saw this, so that someone can clarify this? 3. Without a single 'hook' to use when writing serialization functions, the only alternative is to specialize each serialization function that can make use of array optimizations *separately for each archive*. For example, an MPI archive might provide a helper function Save_MPI_Datatype(void* buffer, size_t count, MPI_Datatype type), and require serialization function authors' to call this member. An XDR archive might provide a function SaveArray(T* Begin, T* End). With no cooperation between the authors of the archive to use a common function for array serialization, the poor user will have to write two sets of serialization functions, one that calls Save_MPI_Datatype() and one that calls SaveArray(). I hope it is obvious to you that this situation is completely untenable!
e) So it has been proposed binary_iarchive be re-implemented in the following way
iarchive - containg default implementation of load_array binary_iarchive - ? presumablu contains implementaion of load_array in terms of currently defined load_binary
This was part of the original (pre Nov 19) proposal and is no longer relevant.
Its not clear whether all archives would be modified in this way or just binary_iarchive.
? What specific implementation of load_array would you suggest, for the other existing archive types?
The idea is that each type which can benefit from load_array can call it and the version of load_array corresponding to that particular archive will be invoked. This will require
i) the serialization functino for types which can benefit from some load_array function would call this.
Right. The set of types that can benefit is very dependent on the archive, but in general it is anything that looks like a container.
ii) Only a small number of load_array functions would have ot be written for each archive. So the number of special functions to be written would be One for each type which might use load_array and "one" for each archive.
Right.
Problems with the Design ========================
a) It doesn't address the root cause of "slow" performance of binary archives.
The main problem is that it doesn't address the cause of the 10 X speed up. Its a classic case of premature optimization. The 10x speed up was based on test program. For a C++ array, the test boils down to replacing 10,000 invocations for stream write(..) with one invocation of stream write 10,000 times longer.
I think this issue has been covered elsewhere, including benchmarks. If you can see any remaining problems with the claim that the root cause of the poor performance is the multiple calls to the buffer write(), and even using a specialized buffer *does not significantly help*, then please say so. [snip]
I would be surprised to see if the 10x speed up still exists with this "buffered_archive".
As shown elsewhere, you are surprised.
note that for the intended application - mpi communication - some archive which doesn't use stream i/o have to be created anyway.
Right. In this case, the write() to the stream is replaced by either a immediate call to MPI_Send(data, size, ...), or by a mechanism to construct a derived MPI_Datatype that will essentially record a map of which memory locations need to be sent and pass this directly to MPI. In both cases, the cost of not utilizing array operations would not be 10x, but more like 1000x or more.
b) re-implemenation of binary_archive in such a way so as not to break existing archives would be an error prone process. The switching between new and old method "should" result in exactly the same byte sequence. But it could easily occur that a small subtle change might render archives create under the previous binary_archive unreadable.
Again, this is not part of the revised proposal. But is the binary serialization really *that* fragile, that this is a significant concern? It suggests that binary archives using the boost serialization lib would be really hard to write correctly (much harder than my experience suggests)!
c) The premise that one will save a lot of coding (see d) above) compared to to the current method of overloading based on the pair of archive/type is overyly optimistic. This is explained in Peter Dimov's post here:
I think this is based on the same misunderstanding as the first point (d) above? Anyway, what do you mean by "current method" ? Can you describe how I should write my single serialization function (hopefully just one of them!) for my_matrix_type using the "curent method" ?
I'm aware this is speculative. I haven't investigated MPI, XDR and other's enough to know how much code sharing is possible. It does seem that there will be no sharing with the the "fast binary archive" of the previous submission. From the short descriptions of MPI I've seen on this list along with my cursory investigation of XDR, I'm doubtful that there is any sharing there either.
Again, I think this is based on the same misunderstanding. Archives will typically not share implementations of array processing.
Conclusions =========== a) The proposal suffers from "premature optimization". A large amount of design effort has been expended on areas which are likely not the source of observed performance bottlenecks.
Not true, as already explained here and elsewhere.
b) The proposal suffers from "over generalizaton". The attempt to generalize results in a much more complex system. Such a system will result in a net loss of conceptual integregrity and implementation transparancey. The claim that this generalization will actually result in a reduction of code is not convincing.
I think this is based on the same misunderstanding as the first point (d) above?
c) by re-implementing a currently existing and used archive, it risks creating a maintainence headache for no real benefit.
Again, this is not part of the current proposal. This issue is not very important because most of the interesting uses for array optimizations are not based on archives currently in the serialization lib. Whether existing archives make use of array optimizations is a side-issue that can be discussed if/when the necessary array hooks actually exist.
Suggestions ===========
a) Do more work in finding the speed bottlenecks. Run a profiler. Make a buffer based non-stream based archive and re-run your tests.
Done, see http://lists.boost.org/Archives/boost/2005/11/97166.php http://lists.boost.org/Archives/boost/2005/11/97156.php For the people here who have experience in this problem domain, the results are completely obvious. In hindsight it is equally obvious that someone from a different background would not be aware of this! Sorry.
b) Make your MPI, XDR and whatever archives. Determine how much opportunity for code sharing is really available.
I think Matthias already has a prototype MPI as well as XDR archive? But this point is again based on the misunderstanding of point (d) above. I doubt there is any possibility for code-sharing between the load/save_array functions between MPI and XDR *at all*, the point is that I, as a user, want to be able to write just a single serialization function for my_matrix_type that will make use of whatever array optimizations the archive I am using can provide.
c) If you still believe your proposal has merit, make your own "optimized binary archive". Don't derive from binary_archive but rather from common_?archive or perhaps basic_binary_archive. In this way you will have a totally free hand and won't have to achieve consensus with the rest of us which will save us all a huge amount of time.
I don't understand what this means. Regards, Ian

Ian McCulloch <ianmcc@physik.rwth-aachen.de> writes:
As I understand it (someone please correct me if I got this wrong), the core idea of the new proposal (from http://lists.boost.org/Archives/boost/2005/11/96923.php and followups) is to provide a single point of customization, that *serialization function authors* can utilize to serialize an array in one call. Of course making use of this hook is optional, but since it is also a good convenience function (it saves a couple of lines by avoiding having to manually code a loop), there isn't really any point to not use it.
Ian, I know you're trying to help, but please don't jump ahead to that conclusion. I would prefer that either a) Robert comes to that conclusion on his own or b) He understands and accepts the consequences (which I have not yet described) of not drawing that conclusion. I am trying to very carefully build understanding of those consequences and making the assertion yourself is not going convince anyone.
Note that a significant point of the proposal http://lists.boost.org/Archives/boost/2005/11/96923.php is that the possibility exists that this 'hook' is not part of the serialization library proper, but the point is *it must be globally accessible* and not specific to a particular archive.
There's really no need to make that point. Robert just needs to understand what he's getting into if he decides not to accept that, and the *technical* reasons for those consequences. So please, let's not push that conclusion on him. He's free to agree or disagree, as he pleases. -- Dave Abrahams Boost Consulting www.boost-consulting.com
participants (15)
-
Caleb Epstein
-
Daniel Wallin
-
David Abrahams
-
Felipe Magno de Almeida
-
Ian McCulloch
-
Jeff Flinn
-
Manfred Doudar
-
Matthew Vogt
-
Matthias Troyer
-
Paul A Bristow
-
Pavel Vozenilek
-
Peter Dimov
-
Robert Ramey
-
Tom Widmer
-
troy d. straszheim