[serialization] binary_iarchive performance degradation with Intel 9.0 compiler

It could be that I'm being foolish, or that there is something very
strange going on in how the Boost serialization library interacts with
the Intel C++ compiler that I don't understand. I'll post a short
question first, eliding underlying motivations, to see if this
warrants further discussion.
I am running on 64-bit Linux, using the Intel C++ compiler, version
9.0, and gcc 3.4.5.
I have a large, complex data structure that I am serializing. All
serialize methods use the BOOST_SERIALIZATION_NVP macro. The data
structure serializes and deserializes accurately for each of text,
xml, and binary archives. The size on disk is about 250 megabytes,
using the binary archives.
The load method I use is roughly this:
#include

First of - thanks for a very simple example illustrating the problem. This has allowed me to say right of the bat that I can't imagine what might be going on here - thereby saving us all lots of time. Bill Lear wrote:
Here is the strange part: When using the Intel compiler, and run with the text and xml archive headers commented out, the load method takes about 9.6 seconds. When I uncomment the text and xml headers, the load time almost doubles, to 18.8 seconds or so. I do not see this with gcc.
So the question is: have I done something foolish here, or is this wacky?
It certainly looks wacky to me. It would seem that a lot of code is getting instantiated and invoked even though the archive is never used.
I am using the following command to compile the code:
% icpc -O2 -Ob2 -ip <...>
Any help appreciated.
By me as well. This would require pretty determined sleuthing. Here are some suggestions - in no particular order. How does the size of the executable vary. It shouldn't - but if it does that would indicate extra code being instantiated. Does the linker produce a "map" of some sort - this might indicate differences in code instantiation. I'm assuming it compiled for the "appropriate" optimization level. Lower optimization levels leave in lots of code useful only for debugging. This is especially true with templates. Is there an execution time profiler facilty available which shows which functions are consuming how much time. Intel prides itself on its tools in this area. That might be very help to check these out. Let us know what you find out. Robert Ramey

On Monday, June 5, 2006 at 10:07:03 (-0700) Robert Ramey writes:
...
This would require pretty determined sleuthing. Here are some suggestions - in no particular order.
How does the size of the executable vary. It shouldn't - but if it does that would indicate extra code being instantiated.
Hmm, first bit of data: with all headers included, the executable size goes up considerably. 6.1 Megabytes versus 14 Megabytes.
[other ideas] ... Is there an execution time profiler facilty available which shows which functions are consuming how much time. Intel prides itself on its tools in this area. That might be very help to check these out.
I may try this out. I tried to reproduce this on a smaller scale, with something reasonably complicated, though still relatively simple, compared to what I'm working with --- to no avail. Got the same times, and the executables were exactly the same size (though differ in byte-by-byte comparison). Well, hopefully over time I'll be able to build up my example to something more complex that starts to exhibit this behavior, or perhaps one of the Intel tools can shed some light. Perhaps, though, it's time to send a bug report to Intel and ask them what's wrong with their compiler!:-) Bill

"Robert Ramey"
How does the size of the executable vary. It shouldn't - but if it does that would indicate extra code being instantiated.
Of course it should vary if any classes are exported. The whole point of BOOST_CLASS_EXPORT is to instantiate a piece of code for each element in the cross-product of exported classes and visible archives. If you make more archives visible in the translation unit, more code will necessarily be instantiated (and run at startup). -- Dave Abrahams Boost Consulting www.boost-consulting.com

I was out whacking weeds today and was thinking about this. I'm going to guess that the unseen code in Bill example has a number of BOOST_CLASS_EXPORT macros. The current version will instanciate code for every combination of exported type and included archive class. At startup there will be three times as much code invoked to "register" these combinations. During in the course of serialization, there are look ups into the tables built at init time. These are stl sets, map and the like. Its possible that the fact this combination of compiler/library implementation consumes a disproportionate time under these circumstances. I would guess that one would need to use a profiler to get more information on this. Currently code is instantiate for each combination of exported type and included archive. Question for Dave: Is this the same in the new improved version of export (1.35) or does the new one only instantiate code for each archive actually used with an << or & operator? - Just asking. Robert Ramey David Abrahams wrote:
"Robert Ramey"
writes: How does the size of the executable vary. It shouldn't - but if it does that would indicate extra code being instantiated.
Of course it should vary if any classes are exported. The whole point of BOOST_CLASS_EXPORT is to instantiate a piece of code for each element in the cross-product of exported classes and visible archives. If you make more archives visible in the translation unit, more code will necessarily be instantiated (and run at startup).

On Monday, June 5, 2006 at 17:50:27 (-0700) Robert Ramey writes:
I was out whacking weeds today and was thinking about this.
I'm going to guess that the unseen code in Bill['s] example has a number of BOOST_CLASS_EXPORT macros. ...
Indeed you are correct. About a dozen or so. Bill

"Robert Ramey"
I was out whacking weeds today and was thinking about this.
I'm going to guess that the unseen code in Bill example has a number of BOOST_CLASS_EXPORT macros. The current version will instanciate code for every combination of exported type and included archive class. At startup there will be three times as much code invoked to "register" these combinations.
During in the course of serialization, there are look ups into the tables built at init time. These are stl sets, map and the like. Its possible that the fact this combination of compiler/library implementation consumes a disproportionate time under these circumstances. I would guess that one would need to use a profiler to get more information on this.
Currently code is instantiate for each combination of exported type and included archive. Question for Dave: Is this the same in the new improved version of export (1.35) or does the new one only instantiate code for each archive actually used with an << or & operator? - Just asking.
It does exactly what your old code did. Having it just instantiate the code for the type/archive combinations used would be _so_ much easier to implement. Personally, I don't understand why you'd want it to work the way it does, but I figured it was hard enough to do that you must've been sure that's what you wanted, and eliminating the ordering requirement looked like a big enough battle. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Hi Bill [snip code]
Here is the strange part: When using the Intel compiler, and run with the text and xml archive headers commented out, the load method takes about 9.6 seconds. When I uncomment the text and xml headers, the load time almost doubles, to 18.8 seconds or so. I do not see this with gcc.
FWIW, I've had a similar problem with Intel 9.0 on Windows. For a Boost.Statechart performance test where I can vary the number of states in powers of 2 I saw two effects: 1. Size-wise, Intel 9.0 executables were in about the same league as MSVC & GCC executables, as long as the number of states was below 8. With mounting complexity however, Intel 9.0 produced much larger binaries than the competition. 2. Performance-wise, the Intel 9.0 executables were top for small executables but quickly got begind both MSVC and GCC for larger binaries. What I found extremely strange was fact that the performance degradation seemed to depend only on the size of the executable and not on how much of the code is actually executed during a test. Specifically, a state machine executes considerably faster when compiled into a separate executable than exactly the same state machine when compiled into an executable that also contains other state machines!!! I partly "solved" the problem by compiling the Intel performance tests into as many executables as possible. Since Intel is not a platform I use for real work, I've never reported this problem. BTW, the offending code is publicly available. If you want to use it to report the problem, let me know and I'll send you some pointers. Regards, -- Andreas Huber When replying by private email, please remove the words spam and trap from the address shown in the header.
participants (4)
-
Andreas Huber
-
Bill Lear
-
David Abrahams
-
Robert Ramey