Re: [boost] [serialization] fast array serialization (10x speedup)

26 Nov 2005

      Robert Ramey wrote:
...
Ian McCulloch wrote:
[...]
...
...
Secondly, the buffer in the oprimitive class has much less
functionality
than the vector<char> buffer, as well as the buffer I used previously
(http://lists.boost.org/Archives/boost/2005/11/97156.php).  In
particular,
it does not check for buffer overflow when writing.  Thus it has no
capability for automatic resizing/flushing, and is only useful if you
know
in advance what the maximum size of the serialized data is.  This
kind of buffer is of rather limited use, so I think that this is not
a fair comparison.
I think its much closer to the binary archive implementation the the
current binary_oarchive is.
I dont understand that sentence, sorry.  Which binary archive
implementation?
...
I also think its fairly close to what 
an archive class would look like for a message passing application.
Surely it depends on the usage pattern?  If you are sending fixed size
messages, then sure a fixed size buffer with no overflow checks will be
fastest.  If you are sending variable size messages with no particular
upper bound on the message size then it is a tradeoff whether you use a
resizeable buffer or count the number of items you need to serialize
beforehand.  I wouldn't like to guess what is the more 'typical' use.  Both
are important cases.
...
The real difference here is that save_binary would be implemented
in such a way that the overhead per call is pretty small.  Maybe
not quite as small as here, but much smaller than the overhead
associated with ostream::write.
Ok, but even with the ideal fixed-size buffer, the difference between the
serialization lib and save_array for out-of-cache arrays of char, measured
by you, is:
...
Time using serialization library: 1.922
Time using direct call to save_array: 0.25
almost a factor 8.  For a buffer that has more overhead, no matter how
small, it will directly translate to an increase in that factor.
...
...
...
In my view, it does support my contention that
implementing save_array - regardless of how it is
in fact implemented - represents a premature optimization.
I suspect that the net benefit in the kind of scenario you 
envision using it will be very small.
Note however that in this case, save_array() is purely memory-bandwidth
limited.  It would be interesting for you if you repeated the benchmark
with a much smaller array size.  You should see several jumps in
performance corresponding to various caches, L1, L2, TLB, perhaps others. 
In any particular benchmark, some of these thresholds might be hard to see.
You will need to put the serialization into a loop to get the CPU time to a
sensible number, and do a loop or two before starting the timer so that the
data is already in the cache.  In the fixed-size buffer scenario this is
actually not too far from a realistic benchmark.  I know (roughly) what the
result will be.  If you still stand by your previous comment:
then obviously you do not.
...
So I believe that the above 
results give a much more accurate picture than the previous
ones do of the effect of application of the proposed enhancement.
Fine.  I am glad you finally agree with the 10x slowdown figure (well, if
you want to be picky 7.688x slowdown on your Windows XP box, 9.8512x on my
linux-opteron box).

[...]
...
...
Interestingly, on this platform/compiler combination, without the bug
fix in save_binary() it still takes 1.11 seconds ;)  I would guess
your Windows compiler is doing some optimization that gcc is not, in
that case.
Thanks for doing this - it is very helpful.
Sure you're compiling at maximum optimization -O3 .
Of course.  -O3 gives no difference from -O2, small difference from -O1,
huge difference from -O0.  When there is a bug in the benchmark, any result
is possible ;)  Quite possibly your compiler was simply noticing that the
same memory location was being overwritten repeatedly and chose to instead
store it in a register?  Anyway, since you took no special effort to ensure
that the compiler didn't optimize away code it would have been quite
legitimate for your benchmark to report 0 time for all tests.  In the
absence of such effort, you at least need to check carefully the assembly
output to make sure the benchmark is really testing what you think it is
testing.
...
In anycase,  this 
is not untypical of my personal experience with benchmarks.  They vary
a lot depending on extraneaus variables.  Our results seem pretty
comparable though.
Robert Ramey
Cheers,
Ian