Re: [boost] [serialization] fast array serialization (10x speedup)

18 Nov 2005

      David Abrahams wrote:
...
"Robert Ramey" <ramey@rrsd.com> writes:
...
,----
...
For many archive formats and common datatypes there exist APIs that
can quickly read or write contiguous sequences of those types all at
once (**).  Reading or writing such a sequence by separately reading
or writing each element (as the serialization library currently
does) can be an order of magnitude more expensive.
`----
I have no problem with the above.
...
We want to be able to capitalize on the existence of those APIs, and
to do that we need a "hook" that will be used whenever a contiguous
sequence is going to be (de)serialized.  No such hook exists in
Boost.Serialization.
Whether or not such a hook is necessary is the crux of the issue.
I consider the submission a use case for archive creation and/or
extension.  As far as I could tell, that particular one didn't require
any new hooks in the library.  Maybe the next iteration will be
different - but that's how I see it now.
...
(**) Note that this capability is not necessarily tied to bitwise
    serialization or the use of a binary representation.
...
The Design
==========
We've attempted to use programming idioms and terminology found
in the existing serialization library wherever possible, so that it's
easy for you to read and understand, and you won't be distracted by
minor stylistic differences.
Thanks for your consideration.  I realize its an extra burden to make
it easier for me to read and understantd and I appreciate your
consideration.
...
In the messages to follow, the word "array" will normally mean a
contiguous sequence of instances of a single datatype, and not to a
C++ builtin array type of the form T[N].  I'll try to be explicit when
I intend to describe builtin arrays.
Let me explain one place where our difference lies.

The serialization library is basically three pieces

a) serialization specifications for each data type to be serialized.
(serialize functions) which are independent of the archive.  That
is these specifications depend only upon the requirements of
the Saving Archive or Loading Archive concepts.

b) archive classes which implement the Archive concept for
different file formats.  These archive classes have common
implementation features factored out into common modules.
Due to "practical" considerations like whether something should be
pre-compiled in the library, whether it is dependent on a use's
application type,  minimzation of code bloat etc, This common
implemnetation code might be included in one of the base classes or
in the file i/oserializer.hpp.  (The code in i/o serializer.hpp)
would normally be one of the base classes but I believe
that template meta programming consideratons related
to less-conforming compilers).   These "common code"
modules are designed to hold code applicable to all
archives.

c) Finally, the escape hatch.  Those serialization implementations
which have to be dependent on the combinaron of archive
type and datatype.  The most obvious case is name-value
pairs - nvp.  nvp has its own default serialization which
just serializes the value part.  Withinxml archives this is overriden
with a special version for that archive type.  This is the model
which I have always envisioned that the library be extended.
It is only in this way the the library can be extended without
being complicated geometircally as time goes on.

I realize that this design and more importantly, it's motivation,
 might not be all that apparent from the the documentation
on archive implementation.  Sorry about that.  As time goes
on I would hope that this can be improved.  But maybe this
explains my reluctance to maintain parts of the library
beyond the reach of those making other archives.  This
forms my main objection to the proposal.

Of course I have/had lots of other objections to it and
probably would have a lot more if I spent more time
looking into it.  I suspect that the job of making
a protable binary archive is much harder than it
first appears.  Making it so that it can exploit
opportuninties to be much faster while still being
as "monkey - proof"  is even harder still.  I didn't pursue
this as I really don't want to discourage these
kinds of efforts and they are (or should be) orhogonal
to the library as it is currently implemented..  If they
can be implemented without altering the core - then
I have no problem.  If someone believes that
modifying the core is unavoidable, then either
he or I have made some sort of mistake and it will
have to be resolved.   If they don't reallly have to alter
 the core, but the archive auther thinks it would make his job easier -
then we have a probem.

I get a suggestion about once a month to modify the core
of he library for this or that reason.  Aside from bugs,
it usually boils down to the suggestor looking at the code
and seeing - "Oh I could fix this right there!" without
considering all the repercussions and without considering
the alternatives.  (As you might guess, this is what I believe
happened in this case).  Another common occurence
is the attempt to use the serialization system to
accomplish some end for which it is not suited.  A typical
idea is to use it to implement some externally defined
file format.  I know I drag my feet, I know it drives people crazy,
but I truely believe that the success of the library is due in no small
part to my reluctance to add in any more than is
absolutly necessary.

So, I look forward to seeing progress on the following:

a) better handling of special optimization opportunites
which obtain for certain combinations of data-types and archives.
Hopefully, an elegantl implementation will serve as a model
for other people's pet addiitions.

b)  A protable binary implementation suitable for
such things as MPI messages.

I also expect these to take some time and hope they
can be subjected to the boost "process" of public
criticism and refinement.  This will take more time
but result in a better product.  Hopefully, it will
be less stressful as well - though I doubt it.

I really am trying to wind down my involvement in the
serialization library.  I do want to spend some more time
on execution profiling and performance tweaks.

I would like to see the documentation improved on how
to do things like you and matthias are attempting to do.
The current documenation does have a section
titled "case studies" which seems to me  handy place
to put examples of this nature and at the same time
show users how to exploit any "add-in" functionality.

Good luck on this

Robert Ramey