Re: [boost] [serialization] fast array serialization (10x speedup)

12 Nov 2005

      I've been perusing the files you checked, your example, and this list.

Summary
=======
First of all, a little more complete narrative description as to what
the submission was intended to acommplish and how it would change
the way the user uses the library would have been helpful.  I'm going
to summarize here what I think I understand about this.  Please correct
me if I get something wrong.

a) a new trait is created.

template <class Archive, class Type>
struct has_fast_array_serialization
  : public mpl::bool_<false>
{};

b) new functions save_array and load_array are implemented in those
archives which have the above trait set to true. In this case the
following is added to the binary_iarchive.hpp file.  The effect
is that this trait will return true when a fundamental type is
to be saved/loaded to a binary_iarchive.

   // specialize has_fast_array_serialization
   // the binary archive provides fast array serialization for all 
fundamental types
   template <class Type>
   struct has_fast_array_serialization<binary_iarchive,Type>
     : public is_fundamental<Type>
   {};

c) A user with the following

class my_class {
    valarray<int> m_a;
    template<class Archive>
    void save(Archive & ar, const unsigned int){
        ar << m_a
    }
    ...
};

d) In order for this to work, data types which want to exploit bitwise 
serialization
have to look like the following - taken from valarray - serialization.  This 
would
have to be applied to serialization

template<class Archive, class U>
inline 
boost::disable_if<boost::archive::has_fast_array_serialization<Archive,U>
...
::type
save( Archive & ar, const STD::valarray<U> &t, const unsigned int /* 
file_version */)
{
  const boost::archive::container_size_type count(t.size());
  ar << BOOST_SERIALIZATION_NVP(count);
  for (std::size_t i=0; i<t.size();+i)
    ar << t[i];
}
// with fast array serialization
template<class Archive, class U>
inline 
boost::enable_if<boost::archive::has_fast_array_serialization<Archive,U>
...
::type
save( Archive & ar, const STD::valarray<U> &t, const unsigned int /* 
file_version */)
{
  const boost::archive::container_size_type count(t.size());
  ar << BOOST_SERIALIZATION_NVP(count);
  if (count)
    ar.save_array(boost::detail::get_data(t),t.size());
}
Presumably this treatment would be applied to:

std::vector
ublas::vector
ublas::matrix
mtl::vector
blitz::array
custom_lib::fast_matrix
...

as well as others.

For built-in arrays, the core library is tweaked
to include code similar to the above for built-in arrays.

Some Observations
=================
Inmediatly the following come to mind.

a) I'm not sure about the portability of enable_if.  Would this not break
the whole serialization system for those compilers which don't support it?

b) what is the the point of save_array?  why not just invoke save_binary
directly?

c) The same could be said for built arrays - just invoke save_binary

d) There is no provision for NVP in the non-binary version above while
in the binary version there is NVP around count. Presumably, these
are oversights.

e) The whole thing isn't obvious and its hard to follow.  It couples
the implementation code in i/o serializer.hpp to a specific kind of archive
adding another dimension to be considered while understanding this thing.

f) What about bitwise serializble types which aren't fundamental? That is
structures which don't have things like pointers in them. They have the
same opportunity but aren't addressed.  If this is a good idea for 
fundamental
types, someone is going to want to do them as well - which would open up 
some
new problems.

g) I don't see endian-ness addressed anywhere.  I believe that protocols
such as XDR and MPI are designed to transmit binary data between
heterogenious machines.  Suppose I save an array of ints as a sequence of 
raw
bits on an intel type machine. Then I use load_binary to reload the same 
seqence
of bits into sparc based machine.  I won't get back the same data values.
So either either the method will have to be limited to collections of bytes
or some extra machinery would have to be added to conditionally to the
endian translation depending on the source/target machine match/mismatch.

f) Similar issues confront bitwise serialization of floats and doubles. I 
believe
the "canonical" format for floats/doubles is ieee 80 bit.  (I think that's 
what
XDR uses - I could be wrong.)  I believe that many machines store floats as
32 bit word and doubles as 64 bit words.  I doubt they all are guarenteed
to have the same format as far as exponent, sign and representation of 
value.
So that's something else to be addressed.  Of course endian-ness plays into 
this
as well.

g) I looked at the "benchmark" results.  I notice that they are run with -O2
on the gcc compiler. Documentation for the gcc compiler command line 
specifies
that this optimization level should does not enable automatic inlining for
small functions.  This is a crucial optimization to be effective in the
serialization library.  The library is written with the view that compilers
will collapse inline code when possible.  But this happens only in the gcc
compiler when the -O3 optimization switch is used.  Furthermore, with this 
compiler,
it might be necessary to also specify max-inline-insns-recursive-auto 
switch.
to gain maximum performance on boost type code. This latter is still under
investigation.

h) my own rudimentary benchmark (which was posted on this list) used 1000
instances of a structure which contained all C++ primitive data types
plus an std::string made up of random characters. It was compiled as
a boost test and built with bjam so it used the standard boost options
for release mode.  It compared timings against using raw stream i/o.
Timings for binary_archive and standard stream i/o where comparable.
I'm still working on this test.  The problem is that standard stream i/o
uses text output/input.  Of course no one for whom performance is an issue
would do this so I have to alter my timing test to use binary i/o to
the standard stream as a comparison.  But for now, I'm comfortable
in asserting that there is not a large performance penalty using 
serialization
as opposed to "rolling your own".  As an aside, the test executable doing
the same test for 3 different types of archives and all primitive data
types only came to 238K.  So there isn't a significant code bloat issue
either.

i) somehow I doubt that this archive type has been tested with all
the serialization test suite.  Instructions for doing so are in the
documenation and the serialization/test directory includes batch files
for doing this with one's own archives.  Was this done?  What where the
results?  With which compiler?  It costs nothing to do this.

end of observations
===================

Admitedly, this is only a cursory examination.  But its more than enough to
make me skeptical of the whole idea. I you want, I could expand upon my 
reasons
for this view, but I think they should be obvious.

Now if someone feels differently and wants to implement such a thing, they
have my full support.  There is no need to modify the core library and no
benefit - performance or otherwise.  The following shows how to go about 
this.

For purposes of this exposition, I am going to limit myself to how one would
go about crafting a system similar to the one submitted.  That is, I will
not address concerns such as binary portability as the are not addressed
in the submission as I see it.  I'm assuming that the only issue is how
best to arrange things so that save_binary/load_binary are invoked in
for contiguous collections of fundamental types.

Suggestions
===========
I do see the utility and merit in what you're trying to do here - finally.
Honestly it just wasn't obvious from the initial package.  So here is how
I would have gone about it.

I previously posted a demo "fast_oarchive.hpp"  I will expand upon it here.
the archive class tree would look like the following (setting aside 
polymorphic
archives). I would envision the following class

basic_archive
  basic_oarchive
    common_oarchive
      basic_binary_oarchive
        binary_oarchive
          fast_oarchive_impl
            MPI_oarchive
            XDR_oarchive
            ...

fast_oarchive_impl is an adaptor which can be applied to any legal
archive.  The current submission derives from binary_?archive.  If this
is all that is required then it doesn't have to be a template, it
could just as well be derived directly from binary_oarchive.

It includes overloads for the following:

    template<class T>
    void save_array(T & t, collection_size_t count){
        // this version not certified for more complex types !!!
        BOOST_STATIC_ASSERT(boost::is_primitive<T>::value);
        // or pointers either !!!
        BOOST_STATIC_ASSERT(boost::is_pointer<T>::value);
 ...
        // if T is NOT a fundamental type - some mpl magic required here
          // foward call to base class
          this->Base::save_override(t, 0);
          // else -
          *(this->This()) << make_nvp("count", t.size() * sizeof(T));
          *(this->This()) << make_nvp(make_binary_object(t.size() * 
sizeof(T), & t));
          // note - the nvp wrappers probably not necessary of we're only
          // only going to apply this to binary archives.
    }

    // here's a way to do it for all vectors in one shot
    template<class T>
    void save_override(const std::vector<T> & t, int){
        save_array(t, t.size());
    }
    ...
    boost::valarray?
    ...
    // for C++ built-in arrays
    template<typename T, int N>
    void save_override(const T (& t)[N] , int){
        save_array(t, sizeof(t));
    }

It might or might not contain similar implementations for

std::vector
ublas::vector
ublas::matrix
mtl::vector
blitz::array
custom_lib::fast_matrix
etc

For now assume that it does - later we'll relax that assumption.

derived from fast_oarchive_impl are your archive classes for specific 
archive
types - MPI_oarchive, ....

These would handle the specific features of each particular archive. If it
turns out that MPI can only apply the save_binary optimization to types
that are no more than a character wide (perhaps for endien issues) then
it would include something like:

    template<class T>
    void save_override(const std::vector<T> & t, int){
       // suppose endian issues preclude save_binary preclude
       if(sizeof(T) > 1){
           // skip over save_binary optimization
           this->binary_oarchive::save_override(t, 0)
       }
       else
           // just forward to the base class
           this->fast_oarchive_impl::save_override(t, 0)
    }

So net result is:

a) save_binary optimizations are invoked from the fast_oarchive_impl class.
They only have to be specified once even though they are used in more
than one variation of the binary archive.  That is, if there are N
types to be subjected to the treatment by M archives - there are
only N overrides - regardless of the size of M.

b) save_binary optimizations can be overriden for any particular archive 
types.
(It's not clear to me how the current submission would address such a 
situation).

c) There is no need alter the current library.

d) It doesn't require that anything in the current library be conditioned on 
what
kind of archive is being used.  Insertion of such a coupling would be
extremely unfortunate and create a lot of future maintainence work.
This would be extremely unfortunate for such a special purpose library. This
is especially true since its absolutly unnecessary.

e) The implemenation above, could easily be improved to be resolved totally
at compile time.  Built with a high quality compiler (with the appropriate
optimization switches set), this would result in fastest possible code.

f) all code for save_binary is in one place - within fast_oarchive_impl. If
fast_oarchive_impl is implemented as a template, it could be applied to
any existing archive class - even text and xml.  I don't know if there
would be any interest in doing that - but it's not inconcievable.  Note
also that including all the save_binary optimizations in for all of
std::vector
ublas::vector
ublas::matrix
mtl::vector
blitz::array
custom_lib::fast_matrix
doesn't require that the headers for these libraries be included.  The
code in the header isn't required until the template is instantiated.
So there wouldn't be any "header bloat"(tm)

g) Now, f above could also be seen as a disadvantage.  That is, it
might seem better to let each one involved in serialization of
a particular collection keep his stuff separate.  There are a
couple of options here I will sketch out.

i) one could make a new trait is_bitwise_serializable whose default
value is false.  For each collection type one would specialize this
like:

template<class T>
struct is_bitwise_serializable<vector <T> > {
   ... is_fundamental<T>
   ... get_size(){ // override default options which is sizeof(T)
       ..
}

Now fast_oarchive_impl would contain something like:

    // here's a way to do it for all vectors in one shot
    template<class T>
    void save_override(const T & t, int){
        // if T is NOT bitwise serializable - insert mpl magic required here
          // foward call to base class
          this->Base::save_override(t, 0);
          // else -
          *(this->This()) << make_nvp("count", t.size() * sizeof(T));
          *(this->This()) << make_nvp(make_binary_object(...get_size(), & 
t));
          // note - the nvp wrappers probably not necessary of we're only
          // only going to apply this to binary archives.
    }

Which would implement the save_binary optimization for all types with
the the is_bitwise_serializable trait set.  Of course any class
derived from fast_oarchive_impl could override this as before.

Note that this would address the situation whereby one has something like

struct RGB {
  unsigned char red;
  unsigned char green;
  unsigned char blue;
};

which certainly would be candidate for save_binary optimization - it would
even be portable across machines of incompatible machines - but then there
might be alignment issues.  Also, if someone tracks an object of type RGB
somewhere, then this would work differently on different archives - not
a good thing.  So this would require some careful documentation on how
to use such a trait.

This would move the information about save_binary optimization out of
the fast_oarchive_impl.hpp header file and into the header file
for each particular type - arguable a better choice.

ii) another option would be to implement differing serializations
depending upon the archive type.  So that we might have

template<class T>
void save(fast_oarchive_impl &ar, const std::vector<T> & t, const unsigned 
int){
    // if T is a fundamental type or ....
    ar << t.size();
    ar.save_binary(t.size() * sizeof(T), t.data?());
}

This would basically much simpler substitute for the "fast_archive_trait"
proposed by the submission.

Wrap up
=======
I guess it should be pretty obvious that I believe that

a) The applicability and utility of a save/load binary optimization
are narrow than claimed.

b) The claims of performance improvement are suspect.

c) This implementation doesn't address all the issues that
need to be addressed for something like this.

d) Such an idea could be implemented in a much simpler,
more transparent, robust, and efficient manner.

e) Such "special" implementation features are easily
accommodated by the current library by extension.  There
is no need to change any of the core library implementation
or interface.

Finally, I have to comment on the way this has been proposed.
Checking a diverse group into the CVS system on a separate branch
is not convenient for most people.  This would better be
a zip file uploaded to the vault.  It should also include
some sort of document outlining what its intended to do
and how it does it.  Had this been done, its likely that
many more people would have had a look at it and been
able to commment.  I'm sure the issues i've noted above
would be apparent to a lot of people and I wouldn't have
had to spend a few hours preparing this critique.

If the above weren't bad enough, I'm pretty much been forced
to do what really amounts to doing a private consulting job
for free for a specific application.  I really can't do this
anymore.  Its one think to point a user in the right direction
(which often results in tweak to the manual) or incorporate
a bug fix - which some user has tracked down, fixed and
submitted.  But to have to spend the time to critique something
like this - whose problems should be obvious to most of us.
is really unfair.  I've been through it a couple of times -
I'm not doing it again.  If you feel that you need to split off the
serialization library to do what you want, I've a better idea:
How about you and Matthias taking it over yourself?  After all,
I've achieved all I expect to get from it.  Then you would be
free to make any changes you want without wasting time on these
discussions.  Of course if the problems in the above are only
obvious to me, then it probably would indicate that I'm out
of sync with the boost development community - in which case
it really should be taken over by someone else.

Robert Ramey