Re: [boost] [serialization] fast array serialization (10x speedup)

19 Nov 2005

      "Robert Ramey" <ramey@rrsd.com> writes:
...
David Abrahams wrote:
...
"Robert Ramey" <ramey@rrsd.com> writes:
...
,----
...
For many archive formats and common datatypes there exist APIs that
can quickly read or write contiguous sequences of those types all at
once (**).  Reading or writing such a sequence by separately reading
or writing each element (as the serialization library currently
does) can be an order of magnitude more expensive.
`----
I have no problem with the above.
...
We want to be able to capitalize on the existence of those APIs, and
to do that we need a "hook" that will be used whenever a contiguous
sequence is going to be (de)serialized.  No such hook exists in
Boost.Serialization.
Whether or not such a hook is necessary is the crux of the issue.
Yes.  Or more precisely, whether the consequences of not having the
hook in the serialization library itself are bad enough to warrant
creating it there.  I will discuss those consequences after I present
our new design, which adds the hook, but only in our own extensions --
essentially a library built on top of the current serialization
library without modifying it.
...
I consider the submission a use case for archive creation and/or
extension.
I don't understand what you're trying to say.  I presume by "the
submission" you mean Matthias' proposed changes to your library.  But
I don't understand what you mean about it being a "use case."
...
As far as I could tell, that particular one didn't require
any new hooks in the library.
Functionally speaking, that is correct.  You /can/ do fast
serialization of contiguous arrays without changing the library.  You
don't even have to write a whole new serialization library.
...
Maybe the next iteration will be different - but that's how I see it
now.
There are some negative consequences of creating the hooks outside
Boost.Serialization.  Once you understand them, I'm pretty sure you
will think they are significant.  Whether they will be significant
enough to induce you to make changes in Boost.Serialization is of
course an open question.
...
Let me explain one place where our difference lies.
Having read everything that follows, I don't see any explanation of a
"place where our difference lies."  The parts I understand (most of
it) sound like "motherhood and apple pie" -- good, common sense that's
hard to disagree with.  Is it a thought that was never finished?
Would you care to try to put it more succinctly?
...
The serialization library is basically three pieces
a) serialization specifications for each data type to be serialized.
(serialize functions) which are independent of the archive.  That
is these specifications depend only upon the requirements of
the Saving Archive or Loading Archive concepts.
b) archive classes which implement the Archive concept for
different file formats.  These archive classes have common
implementation features factored out into common modules.
Due to "practical" considerations like whether something should be
pre-compiled in the library, whether it is dependent on a use's
application type,  minimzation of code bloat etc, This common
implemnetation code might be included in one of the base classes or
in the file i/oserializer.hpp.  (The code in i/o serializer.hpp)
would normally be one of the base classes but I believe
that template meta programming consideratons related
to less-conforming compilers).   These "common code"
modules are designed to hold code applicable to all
archives.
c) Finally, the escape hatch.  Those serialization implementations
which have to be dependent on the combinaron of archive
type and datatype.  The most obvious case is name-value
pairs - nvp.  nvp has its own default serialization which
just serializes the value part.  Withinxml archives this is overriden
with a special version for that archive type.  This is the model
which I have always envisioned that the library be extended.
It is only in this way the the library can be extended without
being complicated geometircally as time goes on.
I realize that this design and more importantly, it's motivation,
 might not be all that apparent from the the documentation
on archive implementation.  Sorry about that.
No, it's perfectly clear what you're trying to do once you study the
library implementation.  Your design philosophy makes good sense
AFAICT.

I am a bit surprised to hear you state flatly that there is only one
way to extend the library that can ever work.  How can you possibly
know you've considered every possibility?  I don't have the same
confidence, even about problems I've studied for years.
...
As time goes on I would hope that this can be improved.  But maybe
this explains my reluctance to maintain parts of the library beyond
the reach of those making other archives.
Other archives?  Beyond reach?  I don't understand what you're saying
here.
...
This forms my main objection to the proposal.
Sorry, I don't have any clue what you are referring to.  Regardless,
we are going to start from new code that doesn't change any part of
Boost.Serialization, so if possible, it might be better to try to
forget about what you've seen before.
...
Of course I have/had lots of other objections to it and
probably would have a lot more if I spent more time
looking into it.
Fortunately, you won't have to.  We're going to present new code.
...
I suspect that the job of making a protable binary archive is much
harder than it first appears.
Actually it's almost trivial (I did it over 10 years ago), but I don't
know what that has to do with what we're trying to accomplish.
...
Making it so that it can exploit opportuninties to be much faster
while still being as "monkey - proof" is even harder still.
The speedups we're proposing don't have anything in particular to do
with portable binary archives.
...
I didn't pursue this as I really don't want to discourage these
kinds of efforts and they are (or should be) orhogonal to the
library as it is currently implemented..  If they can be implemented
without altering the core - then I have no problem.  If someone
believes that modifying the core is unavoidable, then either he or I
have made some sort of mistake and it will have to be resolved.
It's not unavoidable; as I've said before, it just has consequences
that we don't like, and we think you probably won't like either.  If
you can hang on until we've presented what we think is the best design
that avoids altering the core, then we can look at the consequences.
Once you understand them, if you still don't want to make any changes
and you're willing to accept the consequences, we're not going to
press the issue any further.
...
If they don't reallly have to alter the core, but the archive auther
thinks it would make his job easier - then we have a probem.
Let me be very clear about this, at least:

  ,----
  | Ease of archive implementation is unrelated to the motivation for
  | requesting core changes.
  `----

I hope that allays at least one of your concerns.
...
I get a suggestion about once a month to modify the core of he
library for this or that reason.  Aside from bugs, it usually boils
down to the suggestor looking at the code and seeing - "Oh I could
fix this right there!" without considering all the repercussions and
without considering the alternatives.  (As you might guess, this is
what I believe happened in this case).
Actually Matthias' considerations went much deeper than you give him
credit for.  In my opinion, he just failed to communicate his
rationale properly, and since the details of his code seemed to you to
violate basic principles of your design, I'm sure it was all the more
difficult for you to understand the problems he is trying to avoid.
Working from new code that (I hope!)  won't cause you any alarm, it
might be easier to understand the rationale.
...
Another common occurence is the attempt to use the serialization
system to accomplish some end for which it is not suited.  A typical
idea is to use it to implement some externally defined file format.
I know I drag my feet, I know it drives people crazy, but I truely
believe that the success of the library is due in no small part to
my reluctance to add in any more than is absolutly necessary.
Understood.  It might be a good idea for you to clearly define the
intended scope of the library.  What criteria distinguish an
appropriate application from an inappropriate one?  I'm interested in
hearing your intention as the library author, rather than something
like "an appropriate application is one that works well with the
library as it is currently specified and/or implemented."  Depending
on your answer, we might indeed be barking up the wrong tree.
...
So, I look forward to seeing progress on the following:
a) better handling of special optimization opportunites
which obtain for certain combinations of data-types and archives.
Hopefully, an elegantl implementation will serve as a model
for other people's pet addiitions.
I hope we'll be able to show you something elegant very soon.
...
b)  A protable binary implementation suitable for
such things as MPI messages.
Portable binary archives and MPI have little relationship to one
another.  You don't flatten your data into a portable format, ship it
in an MPI message that is just a sequence of bytes, and then
deserialize.  MPI handles portability internally.
...
I also expect these to take some time and hope they
can be subjected to the boost "process" of public
criticism and refinement.  This will take more time
but result in a better product.  Hopefully, it will
be less stressful as well - though I doubt it.
I really am trying to wind down my involvement in the
serialization library.
That's a bit alarming, actually.  Have you got someone else lined up
to maintain it?  It's important to us and to many others that the
library has a future.  Without the involvement of the original author,
that would be in doubt.
...
I do want to spend some more time
on execution profiling and performance tweaks.
I would like to see the documentation improved on how
to do things like you and matthias are attempting to do.
The current documenation does have a section
titled "case studies" which seems to me  handy place
to put examples of this nature and at the same time
show users how to exploit any "add-in" functionality.
Good luck on this
Thanks.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com