
Matthias Troyer wrote:
Hi Robert,
I'll let Dave comment on the parts where you review his proposal, and will focus on the performance.
On Nov 24, 2005, at 6:59 PM, Robert Ramey wrote:
a) It doesn't address the root cause of "slow" performance of binary archives.
I have done the benchmarks you desired last night (see below), and they indeed show that the root cause of slow performance is the individual writing of many small elements instead of "block-writing" of the array in a call to something like save_array.
b) re-implemenation of binary_archive in such a way so as not to break existing archives would be an error prone process. The switching between new and old method "should" result in exactly the same byte sequence. But it could easily occur that a small subtle change might render archives create under the previous binary_archive unreadable.
Dave's design does not change anything in your archives or serialization functions, but only adds an additional binary archive using save_array and load_array.
Hmm - that's not the way I read it. I've touched on this in another post.
c) The premise that one will save a lot of coding (see d) above) compared to to the current method of overloading based on the pair of archive/type is overyly optimistic.
Actually I have implemented two new archive classes (MPI and XDR) which can profit from it, and it does save lots of code duplication. All of the serialization functions for types that can make use of such an optimization can be shared between all these archive types. In addition formats such as HDF5 and netCDF have been mentioned, which can reuse the *same* serialization function to achieve optimal performance.
There is nothing "optimistic" here since we have the actual implementations, which show that code duplication can be avoided.
OK - I can really only comment on that which I've seen.
Conclusions =========== a) The proposal suffers from "premature optimization". A large amount of design effort has been expended on areas which are likely not the source of observed performance bottlenecks.
As Dave pointed out one main reason for a save_array/load_array or save_sequence/load_sequence hook is to utilize existing APIs for serialization (including message pasing) that provide optimized functions for arrays of contiguous data. Examples include MPI, PVM, XDR, HDF5. There is a well established reason why all these libraries have special functions for arrays of contiguous data, because they all observed the same bottlenecks. These bottlenecks are well known for decades in high performance computing, and have caused all these APIs to include special support for contiguous arrays of data.
I admit I'm skeptical of the benefits, but I've not disputed that someone should be able to do this without a problem. The difference lies in where the implementation should be placed.
b) The proposal suffers from "over generalizaton". The attempt to generalize results in a much more complex system. Such a system will result in a net loss of conceptual integregrity and implementation transparancey. The claim that this generalization will actually result in a reduction of code is not convincing.
I'm confused by your statement. Actually the implementations of fast binary archives, MPI archives and XDR archives do share common serialization functions, and this does indeed result in code reduction and avoids code duplication.
Upon reflection - I think I would prefer the term "premature generalization". I concede that's speculation on my part. It seems a lot of effort has been invested to avoid the MxN problem. My own experiments with bitwise_array_archive_adaptor have failed to convince me that the library needs more API to deal with this problem. Shortly, I will be uploading some code which perhaps will make my reasons for this belief more obvious.
c) by re-implementing a currently existing and used archive, it risks creating a maintainence headache for no real benefit.
To avoid any such potential problems Dave proposed to add a new archive in an array sub namespace.
As I said - that's not how I understood it.
I guess that alleviates your concerns? Also, a 10x speedup might not be a benefit for you and your applications but as you can see from postings here, it is a concern for many others.
LOL - No one has ever disputed the utility of a 10x speed up. The question is how best to achieve it without creating a ripple of side effects.
Suggestions ===========
a) Do more work in finding the speed bottlenecks. Run a profiler. Make a buffer based non-stream based archive and re-run your tests.
I have attached a benchmark for such an archive class and ran benchmarks for std::vector<char> serialization. Here are the numbers (using gcc-4 on a Powerbook G4):
Time using serialization library: 13.37 Time using direct calls to save in a loop: 13.12 Time using direct call to save_array: 0.4
In this case the buffer had size 0 at first and needed to be resized during the insertions. Here are numbers for the case where enough memory has been reserved():
Time using serialization library: 12.61 Time using direct calls to save in a loop: 12.31 Time using direct call to save_array: 0.35
And here are the numbers for std::vector<double>, sing a vector of 1/8-th the size:
Time using serialization library: 1.95 Time using direct calls to save in a loop: 1.93 Time using direct call to save_array: 0.37
Since there are fewer calls for these larger types it looks slightly better, but even now there is a more than 5x difference in this benchmark.
As you can see the overhead of the serialization library (less than 2%) is insignificant compared to the cost of doing lots of individual insertion operations into the buffer instead of one big one. The bottleneck is thus clearly the many calls to save() instead of a single call to save_array().
Well, this is interesting data. the call to save() resolves inline to a call to std::vector get element and stuffing the value into the buffer. I wonder how much of this in std::vector and how much is in the save to the buffer?. And it does diminish my skepticism about how much benefit the array serialization would be in at least these specific cases. So, I'll concede that this will be a useful facility for a significant group of users. Now we can focus on how to implement it with the minimal collateral damage.
b) Make your MPI, XDR and whatever archives. Determine how much opportunity for code sharing is really available.
This has been done and is the reason for the proposal to introduce something like the save_array/load_array functions. I have coded an XDR and two different types of MPI archives (one using a buffer, another not using a buffer). A single serialization function for std::valarray, using the load_array hook, to use the optimized APIs in MPI and XDR, as well as a faster binary archive, and the same is true for other types.
c) If you still believe your proposal has merit, make your own "optimized binary archive". Don't derive from binary_archive but rather from common_?archive or perhaps basic_binary_archive. In this way you will have a totally free hand and won't have to achieve consensus with the rest of us which will save us all a huge amount of time.
I'm confused. I realize that one should not derive from binary_iarchive, but why should one not derive from binary_iarchive_impl?
What I meant is if you don't change the current binary_i/oarchive implementation you won't have to worry about backward compatibility with any existing archives. I (mis?)understood the proposal to include adjustments to the current implementation so that it could be derived from.
Also, following Dave's proposal none of your archives is touched, but instead additional faster ones are provided.
This wasn't clear to me from my reading of the proposal. Robert Ramey