Re: [boost] [serialization] fast array serialization (10x speedup)

19 Nov 2005

      David Abrahams wrote:
...
...
I consider the submission a use case for archive creation and/or
extension.
But I don't understand what you mean about it being a "use case."
I mean an example of how the library can be extended to
achieve some specified requirement.  In this case, improved
method of saving/loading certain types of data in certain types of archives.
...
There are some negative consequences of creating the hooks outside
Boost.Serialization.  Once you understand them, I'm pretty sure you
will think they are significant.
I'm all ears.  I can really only comment on the proposal submitted
and that's what I did.
...
...
Let me explain one place where our difference lies.
...
Having read everything that follows, I don't see any explanation of a
"place where our difference lies."  The parts I understand (most of
it) sound like "motherhood and apple pie" -- good, common sense that's
hard to disagree with.  Is it a thought that was never finished?
Would you care to try to put it more succinctly?
...
From looking at these discussions, one might get the impression
It seemed to me that that submission didn't take these aspects of
the design into account.  I had presumed that this was because
the separation was nowhere really made explicit.  I was trying to
make up for that.  I was concerned that it might not be obvious that the
distribution of implementation in the hierarchy of class was very
deliberate and not arbitrary.  I can see how someone might
look at the way something was done and say, "wow - that's
not necessary - we can just collapse out that layer" etc.  In
fact I would expect a lot of people to react that way when
they first see it.

(What follows is a diversion from the question at hand for those
who have some extra time or interest. Feel free to skip)

An interesting thing is how the "implementation organization" comes about. 
If
one is an avid reader of boost mail archives he will notice a huge
amount of discussion about the design of things.  How things
should be separated, what implementation techniques should
be used, etc., etc - the discussion addresses things a finer and
finer level of detail as time goes on.  Most of the discussion is
is speculative - If one does this things this way then you'll be
able to do x - but who needs to do x when you can do y, etc.
that this is the way something like a largish body of code such
the serialization library is designed.  The truth is - it doesn't
happen this way - at least not with me.  The discussions can
be interesting and helpful - up to a point.  But once it arrives at
a certain level of detail - its truely beyond the human brains
capacity to imagine all the consequences of these design decisions.

So when I started out, I had

a) positive experience with Microsoft's MFC serialization
b) a list of things about it that I wanted to "fix"
c) a list of other systems which attempted to address the same
issues I did.  Although none of these systems included all the
things I wanted to fix - many had interesting ideas.
d) a fxed idea that description of how something is serialized
must be orthogonal to the archive implementation.
d) a concise half page description of how it would be used
(your Archive Concept)

I made the first tutorial demo and developed that
in parallel with the first version of the library.  It started
out very simple

As time went on, more "requirements" were added.  Much
of these "requirements" were formulated during the
the first review.  Lots of boost type discussion (good and bad)
consumed lots of effort.  All this discussion was pretty
much summarized on G. Rosenthals definitiive review
if the library.  It was very complete and very well
written. This resulted in much refactoring.  After acceptance
I realized we needed a polymorphic interface.  Dynamic
DLL loading resulted in more refactoring.  Through
all this the original demo tutorial application hardly
ever changed.

The final design is the triumph of evolution over intelligent design.

There's a very deep lesson here I'm sure.  I see things such
as xtreme programming vs waterfall design, evolution vs
creationism, maket capitalism vs socialist central planing,
as all related.

(End of diversion)

So, from the above, it's obvious to me that how to implement
the serialization system is not at all obvious. (If it were,
I would have needed only one iteration !)  Like lots
of things it might be obvious in retrospect.  Or worse
it might LOOK obvious when its really not.

I hope that clarifies things.
...
...
It is only in this way the the library can be extended without
being complicated geometircally as time goes on.
...
I am a bit surprised to hear you state flatly that there is only one
way to extend the library that can ever work.  How can you possibly
know you've considered every possibility?  I don't have the same
confidence, even about problems I've studied for years.
Hmmm what I meant to say is illustrated by the following:

Suppose one has some library L.  If its successful, there will be demand
to enhance it as time goes on.  This is a "good thing" (tm).  Now
suppose that the introduction of enhancement E results in L' which
presents an API which is a superset of L.  Of course its internally
more complex with "global" modes and object traits etc.  It does
take more effort to debug than originally anticipated but
it does work and Its backward compatible and now has the new
functionality and everyone's happy.  For a while.  Almost
everybody.  Now its a little harder to learn to use for beginners.
But its OK.  The success of enhancement E stokes demand for
enhancement F.  Each additional enhancement is harder to
implement, and the resulting library can be understood by
less and less people, and its harder and harder to learn to use.

This is a typical cycle which many software products suffer from.
(BTW - other products suffer from this as well.  It almost seems
there is a thermodynamic principle at work - conceptual integrity
of all ideas decrease over time as attempts are made to apply
them ever more broadly)

Now suppose when demand for enhancement E comes up someone
says - wait a minute - You have to implement E as some sort of add
on module.  It seems like its more work.  But since the work doesn't
make the original code more intricate the effort to design, code,
debug, test and document E is striclty proportional to the size of E.

So of while there are lots of ways to extend a library - But by
choosing an inconvenient method - the original utiliy of the library
will suffer - even as the library gains functionality !!!  So maybe
instead of saying there's only one way to extend the library,
I really meant to say there are lots of ways NOT to extend a
library.

What if the enhancement can't be done as an add-on?

Then you've got to refactor the library.  This should happen less and
less frequently as time goes on.
...
...
As time goes on I would hope that this can be improved.  But maybe
this explains my reluctance to maintain parts of the library beyond
the reach of those making other archives.
Other archives?  Beyond reach?  I don't understand what you're saying
here.
I don't remember what I meant to say here.  I probably meant to say
that I would hope that the library extends by adding on more and
more functionality through extension and accretion rather than making the 
stuff
that's already in there more elaborate.
...
we are going to start from new code that doesn't change any part of
Boost.Serialization, so if possible, it might be better to try to
forget about what you've seen before.
no problem - I can't remember that far back anyway.
...
...
I suspect that the job of making a protable binary archive is much
harder than it first appears.
Actually it's almost trivial (I did it over 10 years ago), but I don't
know what that has to do with what we're trying to accomplish.
...
The speedups we're proposing don't have anything in particular to do
with portable binary archives.
I presumed too much then.  From the thread discussion, it seemed
that this was just the intial effort to adapt the serialization library to
the needs of High Performance Computing. XDR compatibility. 
(http://www.faqs.org/rfcs/rfc1014.html)
was mentioned at some point as was MPI 
(http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-1.1/node39.htm#Node3... 
think)  Both of
these entail portable binary format - with atendant endian issues.
Maybe the mentioning of this in the context of discussion of
the submission which didn't really mention this confused things
in my own mind.

So just to keep the pot boiling - it seems to me that gaining
the 10x speed up associated with "bitwise collecion"
serializaton in the context of portable binary archives
such as XDR is going to be a tall order.
...
...
I didn't pursue this as I really don't want to discourage these
kinds of efforts and they are (or should be) orhogonal to the
library as it is currently implemented..  If they can be implemented
without altering the core - then I have no problem.  If someone
believes that modifying the core is unavoidable, then either he or I
have made some sort of mistake and it will have to be resolved.
...
It's not unavoidable; as I've said before, it just has consequences
that we don't like, and we think you probably won't like either.  If
you can hang on until we've presented what we think is the best design
that avoids altering the core, then we can look at the consequences.
Once you understand them, if you still don't want to make any changes
and you're willing to accept the consequences, we're not going to
press the issue any further.
Fine, I was asked to comment on what was submitted.  We'll start the
next round with a clean slate.
...
...
If they don't reallly have to alter the core, but the archive auther
thinks it would make his job easier - then we have a probem.
Let me be very clear about this, at least:
,----
 | Ease of archive implementation is unrelated to the motivation for
 | requesting core changes.
 `----
I hope that allays at least one of your concerns.
It does.  And I'm sure you probably deal with this on a regular basis
with your own libraries.
...
...
I get a suggestion about once a month to modify the core of he
library for this or that reason.  Aside from bugs, it usually boils
down to the suggestor looking at the code and seeing - "Oh I could
fix this right there!" without considering all the repercussions and
without considering the alternatives.  (As you might guess, this is
what I believe happened in this case).
Note that this isn't a personal criticism - its a natural occurance that
happens all the time.
...
Actually Matthias' considerations went much deeper than you give him
credit for.  In my opinion, he just failed to communicate his
rationale properly, and since the details of his code seemed to you to
violate basic principles of your design, I'm sure it was all the more
difficult for you to understand the problems he is trying to avoid.
LOL - I think I understood the code submitted and what it was
intended to achive.    As far as I could fathom the rationale, I presented
an alternative designed to achieve the same results without sprinking
bits of code throughout lots of other modules.
...
Working from new code that (I hope!)  won't cause you any alarm, it
might be easier to understand the rationale.
I guess you and Matthias were somewhat taken aback by my
response.  Sorry about that.  Anyway, it seems you do have
an understanding and even appreciation of my concerns so
I'm optimistic that the next iteration will be better.

The crux of my argument is that I believe that the kinds of extensions
you want to implement can best be done without altering the current
library.  I'm willing to be proved wrong with a counter example -
but the last didn't qualify in my opinion.  Also it seems that lots
of people are using the library in ways I haven't totally forseen
there there have been lots of opportunities for such counter
examples to be presented. (The only one that really stuck
was shared_ptr serialization - and I'm still not sure about that!!)
...
...
Another common occurence is the attempt to use the serialization
system to accomplish some end for which it is not suited.  A typical
idea is to use it to implement some externally defined file format.
I know I drag my feet, I know it drives people crazy, but I truely
believe that the success of the library is due in no small part to
my reluctance to add in any more than is absolutly necessary.
...
Understood.  It might be a good idea for you to clearly define the
intended scope of the library.  What criteria distinguish an
appropriate application from an inappropriate one?  I'm interested in
hearing your intention as the library author, rather than something
like "an appropriate application is one that works well with the
library as it is currently specified and/or implemented."  Depending
on your answer, we might indeed be barking up the wrong tree.
The very first sentence of the Overview of the Documentation states:

"Here, we use the term "serialization" to mean the reversible deconstruction 
of an arbitrary set of C++ data structures to a sequence of bytes. Such a 
system can be used to reconstitute an equivalent structure in another 
program context. Depending on the context, this might used implement object 
persistence, remote parameter passing or other facility. In this system we 
use the term "archive" to refer to a specific rendering of this stream of 
bytes. This could be a file of binary data, text data, XML, or some other 
created by the user of this library. "

I'm not sure I can make a better statement than that regarding what
I expected the library to be used for.
...
...
So, I look forward to seeing progress on the following:
a) better handling of special optimization opportunites
which obtain for certain combinations of data-types and archives.
Hopefully, an elegantl implementation will serve as a model
for other people's pet addiitions.
...
I hope we'll be able to show you something elegant very soon.
No need to hurry on my account.
...
...
b)  A protable binary implementation suitable for
such things as MPI messages.
...
Portable binary archives and MPI have little relationship to one
another.  You don't flatten your data into a portable format, ship it
in an MPI message that is just a sequence of bytes, and then
deserialize.  MPI handles portability internally.
I've taken only the most cursory look at MPI.  (turns out this
may change due to some other project).  So I won't
dispute this.  I don't see how one could pass information
between heterogeneas machines without addressing
all the issues related to making a portable binary archive.
Perhaps MPI leaves that part undefined - but still it will
have to be dealt with somewhere.
...
...
I also expect these to take some time and hope they
can be subjected to the boost "process" of public
criticism and refinement.  This will take more time
but result in a better product.  Hopefully, it will
be less stressful as well - though I doubt it.
I really am trying to wind down my involvement in the
serialization library.
That's a bit alarming, actually.  Have you got someone else lined up
to maintain it?
I was thinking of Matthias though I've never brought it up
...
It's important to us and to many others that the library has a future.
As long as people continue to use it I'm sure it will have a future ...
...
Without the involvement of the original author, that would be in doubt.
...regardless of whether the original author is involved.

Does this mean I can't die until I get a replacement?

Personally, I see the idea that the viability of any piece of code is tied
to the continuing involvment of the original author as a sign that the
code is lacking in some dimension.  It should be easy for someone
to see what is going on and fix.  If it's not - its really a failing on the
original author.  So I've been personally gratified to have people
send me fixes to very obscure and arcane bugs.  I don't always
incorporate the fix due design considerations but I often do.  Some
of these things are devilishly hard - what happens when code
implementing serialization is dynamically unloaded? - things like that.
Other's are obscure corners of other standards - e.g. how does
one encode a string with and embedded \0 into and html string. Or
what is portable way to create a sNaN when loading a
portable archive.  There a probably lots of little corners
with things that need fixing and the truth is I'm already relying on
people with more specialized knowledge to help with these things.
So already things are moving to other people on a case by case
basis.

I would hope to see the library grow and proper by seeing things layered
on top of it.  Thus my personal involvement should taper off as it
seems to have in other successful boost libraries - and as it should
in any successful programming project.

There is one kind of change that I would like to in the core library
as time goes on.  I would like to see certain things migrate out
of the library and become boostified.  Examples are things like
strong typedef, extended typeinfo, dataflow iterators (my personal
favorite).  I recognize that that is a little unrealistic and I never
mess with these things so its not a big issue - its just I would like
to see the library smaller.  Also it would be interesting to see if
the boost class factory can be used to replace similar functionality
implemented in the serialization library - there may be other
such cases.

Robert Ramey