Re: [boost] serialization performance

bwood <brass@mailvault.com> writes:
Hi, Dave,
Hi Brian
I work on www.webEbenezer.net. I noticed today that Robert Ramey said "So far so good" regarding some performance tests he is doing. You might want to encourage him to consider some of the competition as far as performance.
You should probably be doing that, actually ;-) I'm helpfully cross-posting this reply to the Boost developers' list, but next time, please do it yourself so that you get "in the mix" there. Having these kinds of reports in the public record, and having people like you involved in the ensuing discussion, are crucial for the Boost process. Thanks again for writing.
Some recent tests I've done that compare Ebenezer Enterprises and Boost.serialization's performance show the Boost.serialization approach to be 7 to 9 times slower than the Ebenezer Enterprises approach. I'm using Linux 2.6.12, gcc 4.0.2 (with -O3) and a Boost.serialization library from the release tree. A test that compared the times to serialize/send a list<int> took 7.4 times longer with Boost than Ebenezer. I timed the following Boost: oArch & lst; // oArch is a binary_oarchive
Ebenezer: msgs.Send(buffer, lst); // In order to compare apples to apples, // I removed a section of code at the end
// of the Send function that flushes the buffer.
In a second test I added a deque of int... Boost: oArch & lst; oArch & dq; // std::deque<int> dq;
Ebenezer: msgs.Send(buffer, lst, dq);
In this case Boost took 9.1 times longer than Ebenezer Enterprises approach. If you want to run these tests with other compilers I think that would be helpful. I've been warned not to put too much emphasis on numbers from gcc.
Regards, Brian Wood
-- Dave Abrahams Boost Consulting www.boost-consulting.com

I work on www.webEbenezer.net. I noticed today that Robert Ramey said "So far so good" regarding some performance tests he is doing. You might want to encourage him to consider some of the competition as far as performance.
Some recent tests I've done that compare Ebenezer Enterprises and Boost.serialization's performance show the Boost.serialization approach to be 7 to 9 times slower than the Ebenezer Enterprises approach. I'm using Linux 2.6.12, gcc 4.0.2 (with -O3) and a Boost.serialization library from the release tree. A test that compared the times to serialize/send a list<int> took 7.4 times longer with Boost than Ebenezer. I timed the following Boost: oArch & lst; // oArch is a binary_oarchive
Ebenezer: msgs.Send(buffer, lst); // In order to compare apples to apples, // I removed a section of code at the end
// of the Send function that flushes the buffer.
In a second test I added a deque of int... Boost: oArch & lst; oArch & dq; // std::deque<int> dq;
Ebenezer: msgs.Send(buffer, lst, dq);
In this case Boost took 9.1 times longer than Ebenezer Enterprises approach. If you want to run these tests with other compilers I think that would be helpful. I've been warned not to put too much emphasis on numbers from gcc.
I've written a test program to compare the time required by boost serialization to the time required by using stream i/o to do a similar operation. I've only used VC 7.1 (release mode) and my focus has been to try to determine how much overhead the serialization system adds compared to the alternative of not using it. The test consists for saving/loading 1000 instances of a class which includes all primitive types (? about 20 members including strings). Times are calculated by dividing the total time by 1000 to give ms / operaton. These preliminary results don't raise any major red flags. I'm still working on this in my spare time. A couple of problems are: a) timings are too crude. b) I need to make a standard library test which writes/read binary data. This is interesting for comparison purposes, though I doubt many people actually use output/input streams in this way. c) I'm aware that there are couple of opportunities for improved performance in the archive implementation. d)I really want to get profile data from this (improved) test. I'm having some problems because not all plaforms support profiling and I'm having some problems figuring out how to work into bjam in the way I want. e)If I've got nothing else to do, I might want to enhance this with tests of some stl collections. Note that the compiled version of the attached program using the static library of the serialization library and the DLL version of the VC C++ runtime library is 268K. Given what we're serializing to three different archives, including xml, this seems like a very reasonable number to me. Below are a couple of test runs. Robert Ramey standard library write to file 0.047ms read from file 0.016ms binary_archives save archive 0ms save archive through pointer 0.015ms load archive 0.016ms load archive through pointer 0.016ms text_archives save archive 0.062ms save archive through pointer 0.047ms load archive 0.046ms load archive through pointer 0.079ms xml_archives save archive 0.11ms save archive through pointer 0.156ms load archive 0.187ms load archive through pointer 0.203ms *** No errors detected Running 1 test case... standard library write to file 0.062ms read from file 0.016ms binary_archives save archive 0.016ms save archive through pointer 0.015ms load archive 0.016ms load archive through pointer 0.031ms text_archives save archive 0.094ms save archive through pointer 0.078ms load archive 0.078ms load archive through pointer 0.093ms xml_archives save archive 0.125ms save archive through pointer 0.141ms load archive 0.172ms load archive through pointer 0.219ms *** No errors detected Press any key to continue Running 1 test case... standard library write to file 0.047ms read from file 0ms binary_archives save archive 0.016ms save archive through pointer 0.015ms load archive 0ms load archive through pointer 0.016ms text_archives save archive 0.063ms save archive through pointer 0.062ms load archive 0.063ms load archive through pointer 0.093ms xml_archives save archive 0.125ms save archive through pointer 0.157ms load archive 0.187ms load archive through pointer 0.219ms *** No errors detected Press any key to continue begin 666 test_overhead.cpp` end begin 666 A.hpp`` ` end

I've written a test program to compare the time required by boost serialization to the time required by using stream i/o to do a similar operation. I've only used VC 7.1 (release mode) and my focus has been to try to determine how much overhead the serialization system adds compared to the alternative of not using it. The test consists for saving/loading 1000 instances of a class which includes all primitive types (? about 20 members including strings). Times are calculated by dividing the total time by 1000 to give ms / operaton.
These preliminary results don't raise any major red flags.
Just to reiterate my experience of profiling boost::serialization loading under vtune (I cannot recommend this enough for serious performance analysis) using one of our real data files for testing (weighs in at > 100mb and thousands of individual instances of objects) that the major bottleneck was strcmp caused by the type_id compare looking up the per type information used for tracking and the such. This outweighed everything else by a wide margin. I can dig up the details if your interested and up some point I will need to look at this again and try and optimise it. This was using 1.32.0 so take the above with a grain of salt *if* this area has changed significantly. Martin -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.12.8/165 - Release Date: 9/11/2005

Wow - that is very interesting information for me. It turns out that 1.33.x does not use strmp for this purpose so maybe it will be faster. This kind of result is very much in line with my experience with profilers. It almost happens that the bottlenecks turn out to be in the last place I would have looked !!! So I am anxious to get at least the gcc profiler working for some serialization performance tests. Thanks for this very useful information. Robert Ramey Martin Slater wrote:
I've written a test program to compare the time required by boost serialization to the time required by using stream i/o to do a similar operation. I've only used VC 7.1 (release mode) and my focus has been to try to determine how much overhead the serialization system adds compared to the alternative of not using it. The test consists for saving/loading 1000 instances of a class which includes all primitive types (? about 20 members including strings). Times are calculated by dividing the total time by 1000 to give ms / operaton.
These preliminary results don't raise any major red flags.
Just to reiterate my experience of profiling boost::serialization loading under vtune (I cannot recommend this enough for serious performance analysis) using one of our real data files for testing (weighs in at > 100mb and thousands of individual instances of objects) that the major bottleneck was strcmp caused by the type_id compare looking up the per type information used for tracking and the such. This outweighed everything else by a wide margin. I can dig up the details if your interested and up some point I will need to look at this again and try and optimise it. This was using 1.32.0 so take the above with a grain of salt *if* this area has changed significantly.
Martin

Robert Ramey wrote:
Wow - that is very interesting information for me. It turns out that 1.33.x does not use strmp for this purpose so maybe it will be faster. This kind of result
This is good news, we haven't been able to upgrade to 1.33 yet, hopefully when we upgrade to vc2005 the boost upgrade can follow shortly after.
is very much in line with my experience with profilers. It almost happens that the bottlenecks turn out to be in the last place I would have looked !!!
Absolutely, profiling is the one true way to find the performance bottlenecks, anything else is an excersise in frustration. Martin -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.0/167 - Release Date: 11/11/2005

Martin Slater wrote:
Just to reiterate my experience of profiling boost::serialization loading under vtune (I cannot recommend this enough for serious performance analysis) using one of our real data files for testing (weighs in at > 100mb and thousands of individual instances of objects) that the major bottleneck was strcmp caused by the type_id compare looking up the per type information used for tracking and the such. This outweighed everything else by a wide margin. I can dig up the details if your interested and up some point I will need to look at this again and try and optimise it. This was using 1.32.0 so take the above with a grain of salt *if* this area has changed significantly.
A couple of observations. I believe the intel machine has hardware instructions which implement strcmp and that compilers support them. So even if strcmp is the bottleneck, I wouldn't expect it to show up on the profiler unless some sort of inlining were turned off. Or maybe the vtune profiler has special provision for these cases somewhere. I did check to verify that the strcmp in the type-id lookup has been removed. Instead we just make sure there is only one instance of a particular extended_type_info record so that we can just compare the addresses. There are still some optimizations to be implemented - but I can't predict how much they will speed up anything. Robert Ramey

On 11/12/05, Robert Ramey <ramey@rrsd.com> wrote: [snip - Martin Slater wrote]
A couple of observations.
I believe the intel machine has hardware instructions which implement strcmp and that compilers support them. So even if strcmp is the bottleneck, I wouldn't expect it to show up on the profiler unless some sort of inlining were turned off. Or maybe the vtune profiler has special provision for these cases somewhere.
If you check for extended_type_info equality using strcmp very much, I cant see why it wouldnt be a bottleneck. I dont know the details of the serialization library, but my guess is that it must find the extended_type_info which matches the type given, depending on the complexity of your find algorithm, it may have unnecessary comparitions, that would point strcmp as the bottleneck.
I did check to verify that the strcmp in the type-id lookup has been removed. Instead we just make sure there is only one instance of a particular extended_type_info record so that we can just compare the addresses. There are still some optimizations to be implemented - but I can't predict how much they will speed up anything.
IMHO, profilers are essential before any work on optimization. Probably lots of optimizations wont even be needed, while others will make the real difference in speed. I have experienced that reusing containers(as someone already posted in this mailing list as a possible solution to some bottlenecks in the serialization library) to have huge impact in performance.
Robert Ramey
best regards, -- Felipe Magno de Almeida Developer from synergy and Computer Science student from State University of Campinas(UNICAMP). Unicamp: http://www.ic.unicamp.br Synergy: http://www.synergy.com.br "There is no dark side of the moon really. Matter of fact it's all dark."

I believe the intel machine has hardware instructions which implement strcmp and that compilers support them. So even if strcmp is the bottleneck, I wouldn't expect it to show up on the profiler unless some sort of inlining were turned off. Or maybe the vtune profiler has special provision for these cases somewhere.
VTune doesn't have any special provision for this, if a function is inlined it will just show up in the function it inlined it to. Under VC by default strcmp is just a regular function call but you can enable it as a compiler intrinsic (#pragma intrinsic(strcmp) ) and the compiler may well then generate much better code (I know enabling memcpy this way can reduce memcpy(&a, &b, sizeof(int)) to a simple register mov in places).
I did check to verify that the strcmp in the type-id lookup has been removed. Instead we just make sure there is only one instance of a particular extended_type_info record so that we can just compare the addresses. There are still some optimizations
This is very good, I was looking at how to do this myself so am very happy I now don't have to;)
to be implemented - but I can't predict how much they will speed up anything.
Predication in optimisation I have found to be nigh on impossible, without a profiler or at the very least extemely heavy instrumentation within you code you will always get a shock as to where the time is spent. VC6 was a nightmare in this regard as for example it would not inline some trivial functions without being given __forceinline for that function casuing some potentially extrememly fast code to run pathetically slowly. If your interested in vtune they do an evaluation verion at https://registrationcenter.intel.com/EvalCenter/EvalForm.aspx?ProductID=319 It is simply the best profiling tool I have ever used. I'd be more than happy to help out with any profiling and optimisation I can. Martin. -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.0/167 - Release Date: 11/11/2005

Martin Slater wrote:
I believe the intel machine has hardware instructions which implement strcmp and that compilers support them. So even if strcmp is the bottleneck, I wouldn't expect it to show up on the profiler unless some sort of inlining were turned off. Or maybe the vtune profiler has special provision for these cases somewhere.
VTune doesn't have any special provision for this, if a function is inlined it will just show up in the function it inlined it to. Under VC by default strcmp is just a regular function call but you can enable it as a compiler intrinsic (#pragma intrinsic(strcmp) ) and the compiler may well then generate much better code (I know enabling memcpy this way can reduce memcpy(&a, &b, sizeof(int)) to a simple register mov in places).
I did check to verify that the strcmp in the type-id lookup has been removed. Instead we just make sure there is only one instance of a particular extended_type_info record so that we can just compare the addresses. There are still some optimizations
This is very good, I was looking at how to do this myself so am very happy I now don't have to;)
to be implemented - but I can't predict how much they will speed up anything.
Predication in optimisation I have found to be nigh on impossible, without a profiler or at the very least extemely heavy instrumentation within you code you will always get a shock as to where the time is spent. VC6 was a nightmare in this regard as for example it would not inline some trivial functions without being given __forceinline for that function casuing some potentially extrememly fast code to run pathetically slowly.
If your interested in vtune they do an evaluation verion at https://registrationcenter.intel.com/EvalCenter/EvalForm.aspx?ProductID=319 It is simply the best profiling tool I have ever used.
I'd be more than happy to help out with any profiling and optimisation I can.
Martin.

Martin Slater wrote:
I believe the intel machine has hardware instructions which implement strcmp and that compilers support them. So even if strcmp is the bottleneck, I wouldn't expect it to show up on the profiler unless some sort of inlining were turned off. Or maybe the vtune profiler has special provision for these cases somewhere.
VTune doesn't have any special provision for this, if a function is inlined it will just show up in the function it inlined it to. Under VC by default strcmp is just a regular function call but you can enable it as a compiler intrinsic (#pragma intrinsic(strcmp) ) and the compiler may well then generate much better code (I know enabling memcpy this way can reduce memcpy(&a, &b, sizeof(int)) to a simple register mov in places).
Hmmm - then the fact it showed up on the profiler suggests that the program wasn't compiled with full optimisation? You might want to expand upon this.
If your interested in vtune they do an evaluation verion at https://registrationcenter.intel.com/EvalCenter/EvalForm.aspx?ProductID=319 It is simply the best profiling tool I have ever used.
At one time I had the intel eval compiler installed and it was very good. My license expired and I just didn't have the incentive to actually pay for it. Too bad I would have liked to have it my test suite.
I'd be more than happy to help out with any profiling and optimisation I can.
Well, you'll get your chance pretty soon. Soon I'll be checking in my test_overhead program into the development tree. I think I can pass the compiler switches to get it to generate a profile - at least for gcc - but I'm struggling to figure out how to get bjam to invoke gprof to display the profile in the output and to make sure I can see it in the test matrix. So you may get your chance to make this work for vtune. I'm surprised that profling / bench marking isn't commonly part of the test suites of boost libraries. Robert Ramey

Hmmm - then the fact it showed up on the profiler suggests that the program wasn't compiled with full optimisation? You might want to expand upon this.
This may well be out of our control, IIRC it was coming from the crt library provided with VC, in user code you can control it via the necessary pragma but asking users to rebuild the crt library is a bit unreasonable;) Anyway if this test has been reduced to pointer / pointer comparision then all should be good.
If your interested in vtune they do an evaluation verion at https://registrationcenter.intel.com/EvalCenter/EvalForm.aspx?ProductID=319 It is simply the best profiling tool I have ever used.
At one time I had the intel eval compiler installed and it was very good. My license expired and I just didn't have the incentive to actually pay for it. Too bad I would have liked to have it my test suite.
This is just an eval for vtune, still not cheap though.
I'd be more than happy to help out with any profiling and optimisation I can.
Well, you'll get your chance pretty soon. Soon I'll be checking in my test_overhead program into the development tree. I think I can pass the compiler switches to get it to generate a profile - at least for gcc - but I'm struggling to figure out how to get bjam to invoke gprof to display the profile in the output and to make sure I can see it in the test matrix. So you may get your chance to make this work for vtune. I'm surprised that profling / bench marking isn't commonly part of the test suites of boost libraries.
Cool, i'll jump in as soon as you have it uploaded. cheers Martin -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.0/167 - Release Date: 11/11/2005

Some recent tests I've done that compare Ebenezer Enterprises and Boost.serialization's performance show the Boost.serialization approach to be 7 to 9 times slower than the Ebenezer Enterprises approach. I'm using Linux 2.6.12, gcc 4.0.2 (with -O3) and a Boost.serialization library from the release tree. A test that compared the times to serialize/send a list<int> took 7.4 times longer with Boost than Ebenezer. I timed the following Boost: oArch & lst; // oArch is a binary_oarchive
Ebenezer: msgs.Send(buffer, lst); // In order to compare apples to apples, // I removed a section of code at the end
// of the Send function that flushes the buffer.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This seems wrong to me, comparing library to library as it is written and as the user see's it would be much more interesting and to me makes this test pointless. If you can really just remove the flush then it probably shouldnt be there in the first place or configurable by the user on an archive by archive basis.
In a second test I added a deque of int... Boost: oArch & lst; oArch & dq; // std::deque<int> dq;
Ebenezer: msgs.Send(buffer, lst, dq);
In this case Boost took 9.1 times longer than Ebenezer Enterprises approach. If you want to run these tests with other compilers I think that would be helpful. I've been warned not to put too much emphasis on numbers from gcc.
It would be interesting to see what the results are with a vanilla version of your library as well as a feature by feature comparision to see if you really are comparing apples to apples. Martin -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.12.8/165 - Release Date: 9/11/2005
participants (4)
-
David Abrahams
-
Felipe Magno de Almeida
-
Martin Slater
-
Robert Ramey