
Despite saying I would no longer participate in this thread, I re-read Robert's recent list of queries and was struck by the thought that most, if not all, is based on accidental mis-readings of some previous posts, combined with some completely understandable misunderstandings about performance issues that are well-known in the HPC domain (since clarified elsewhere by some benchmarks). So, in the spirit of perhaps rescuing the situation, I reply again: Robert Ramey wrote:
To summarize how we arrived here. =================================
a) Mattias augmented binary_?archive to replace element by serialization of primitive types with save/load binary for C++ arrays, std::vectors and boost::val_array. This resulted in a 10 x speed up of the serialization process.
Right. If you see some function in the profile that is called 10,000 times and is a bottleneck, which is better? Optimize that function a bit and get a factor 2 speedup (see http://lists.boost.org/Archives/boost/2005/11/97156.php), or change the calling sequence so that that function is only called once, and get a factor 10 speedup?
b) From this it has been concluded that binary archives should be enhanced to provide this facility automatically and transparently to the user.
Right. From the point of view of a user, this is completely analagous to, say, a standard library implementor optimizing std::copy(container.begin(), container.end(), ostream_iterator<T>(stream)), for example. If he/she did this, would the user be interested in *disabling* that optimization? What would be the point? But note, 'should' here is really 'could'. The current proposal from David Abrahams explicitly does *not* modify any of the existing archives.
c) The structure of the library an the documentation suggest that the convenient way to do this is to specify an overload for each combination of archive/type which can benefit from special treatment.
As I understand it (someone please correct me if I got this wrong), the core idea of the new proposal (from http://lists.boost.org/Archives/boost/2005/11/96923.php and followups) is to provide a single point of customization, that *serialization function authors* can utilize to serialize an array in one call. Of course making use of this hook is optional, but since it is also a good convenience function (it saves a couple of lines by avoiding having to manually code a loop), there isn't really any point to not use it. Note that a significant point of the proposal http://lists.boost.org/Archives/boost/2005/11/96923.php is that the possibility exists that this 'hook' is not part of the serialization library proper, but the point is *it must be globally accessible* and not specific to a particular archive.
From http://lists.boost.org/Archives/boost/2005/11/96923.php :
David Abrahams wrote: | In an upcoming message I'm planning to start by describing the least | intrusive design that could possibly work -- one that makes no changes | at all to the existing serialization library. It's not a bad design, | but it has a few drawbacks that I'd like to discuss. At that point it | should become clear what I meant about "hijacking" the serialization | library. Finally, I'll describe the smallest set of changes to the | existing serialization library that would be needed to address those | drawbacks. | | Just to state the obvious, I hope to convince you that those few | changes are worth making. Of course, as the maintainer of | Boost.Serialization, it's completely up to you whether to do so. Moving on to your next point:
d) The above (c) is deemed inconvenient because it has been supposed that many archive classes will share a common implementation of load/save array. This would suggest that using (c) above, though simple and straight forward, will result in code repetition.
I think this must have resulted from a misunderstanding. 1. As far as I can tell, the proposal is completely consistent with (c) above. The question of how the save/load_array is actually dispatched to the archive is a detail that has relevance *only to archive implementors*. Obviously different dispatch mechanisms have different tradeoffs, as discussed elsewhere on this thread, but none of this should be visible for people who are not archive authors. 2. I don't recall seeing anyone suggest that different archive types would be able to share a common implementation of load/save array (except for the trivial case where an archive has no array support and instead uses a default implementation that serializes each element in a loop!). Can you cite where you saw this, so that someone can clarify this? 3. Without a single 'hook' to use when writing serialization functions, the only alternative is to specialize each serialization function that can make use of array optimizations *separately for each archive*. For example, an MPI archive might provide a helper function Save_MPI_Datatype(void* buffer, size_t count, MPI_Datatype type), and require serialization function authors' to call this member. An XDR archive might provide a function SaveArray(T* Begin, T* End). With no cooperation between the authors of the archive to use a common function for array serialization, the poor user will have to write two sets of serialization functions, one that calls Save_MPI_Datatype() and one that calls SaveArray(). I hope it is obvious to you that this situation is completely untenable!
e) So it has been proposed binary_iarchive be re-implemented in the following way
iarchive - containg default implementation of load_array binary_iarchive - ? presumablu contains implementaion of load_array in terms of currently defined load_binary
This was part of the original (pre Nov 19) proposal and is no longer relevant.
Its not clear whether all archives would be modified in this way or just binary_iarchive.
? What specific implementation of load_array would you suggest, for the other existing archive types?
The idea is that each type which can benefit from load_array can call it and the version of load_array corresponding to that particular archive will be invoked. This will require
i) the serialization functino for types which can benefit from some load_array function would call this.
Right. The set of types that can benefit is very dependent on the archive, but in general it is anything that looks like a container.
ii) Only a small number of load_array functions would have ot be written for each archive. So the number of special functions to be written would be One for each type which might use load_array and "one" for each archive.
Right.
Problems with the Design ========================
a) It doesn't address the root cause of "slow" performance of binary archives.
The main problem is that it doesn't address the cause of the 10 X speed up. Its a classic case of premature optimization. The 10x speed up was based on test program. For a C++ array, the test boils down to replacing 10,000 invocations for stream write(..) with one invocation of stream write 10,000 times longer.
I think this issue has been covered elsewhere, including benchmarks. If you can see any remaining problems with the claim that the root cause of the poor performance is the multiple calls to the buffer write(), and even using a specialized buffer *does not significantly help*, then please say so. [snip]
I would be surprised to see if the 10x speed up still exists with this "buffered_archive".
As shown elsewhere, you are surprised.
note that for the intended application - mpi communication - some archive which doesn't use stream i/o have to be created anyway.
Right. In this case, the write() to the stream is replaced by either a immediate call to MPI_Send(data, size, ...), or by a mechanism to construct a derived MPI_Datatype that will essentially record a map of which memory locations need to be sent and pass this directly to MPI. In both cases, the cost of not utilizing array operations would not be 10x, but more like 1000x or more.
b) re-implemenation of binary_archive in such a way so as not to break existing archives would be an error prone process. The switching between new and old method "should" result in exactly the same byte sequence. But it could easily occur that a small subtle change might render archives create under the previous binary_archive unreadable.
Again, this is not part of the revised proposal. But is the binary serialization really *that* fragile, that this is a significant concern? It suggests that binary archives using the boost serialization lib would be really hard to write correctly (much harder than my experience suggests)!
c) The premise that one will save a lot of coding (see d) above) compared to to the current method of overloading based on the pair of archive/type is overyly optimistic. This is explained in Peter Dimov's post here:
I think this is based on the same misunderstanding as the first point (d) above? Anyway, what do you mean by "current method" ? Can you describe how I should write my single serialization function (hopefully just one of them!) for my_matrix_type using the "curent method" ?
I'm aware this is speculative. I haven't investigated MPI, XDR and other's enough to know how much code sharing is possible. It does seem that there will be no sharing with the the "fast binary archive" of the previous submission. From the short descriptions of MPI I've seen on this list along with my cursory investigation of XDR, I'm doubtful that there is any sharing there either.
Again, I think this is based on the same misunderstanding. Archives will typically not share implementations of array processing.
Conclusions =========== a) The proposal suffers from "premature optimization". A large amount of design effort has been expended on areas which are likely not the source of observed performance bottlenecks.
Not true, as already explained here and elsewhere.
b) The proposal suffers from "over generalizaton". The attempt to generalize results in a much more complex system. Such a system will result in a net loss of conceptual integregrity and implementation transparancey. The claim that this generalization will actually result in a reduction of code is not convincing.
I think this is based on the same misunderstanding as the first point (d) above?
c) by re-implementing a currently existing and used archive, it risks creating a maintainence headache for no real benefit.
Again, this is not part of the current proposal. This issue is not very important because most of the interesting uses for array optimizations are not based on archives currently in the serialization lib. Whether existing archives make use of array optimizations is a side-issue that can be discussed if/when the necessary array hooks actually exist.
Suggestions ===========
a) Do more work in finding the speed bottlenecks. Run a profiler. Make a buffer based non-stream based archive and re-run your tests.
Done, see http://lists.boost.org/Archives/boost/2005/11/97166.php http://lists.boost.org/Archives/boost/2005/11/97156.php For the people here who have experience in this problem domain, the results are completely obvious. In hindsight it is equally obvious that someone from a different background would not be aware of this! Sorry.
b) Make your MPI, XDR and whatever archives. Determine how much opportunity for code sharing is really available.
I think Matthias already has a prototype MPI as well as XDR archive? But this point is again based on the misunderstanding of point (d) above. I doubt there is any possibility for code-sharing between the load/save_array functions between MPI and XDR *at all*, the point is that I, as a user, want to be able to write just a single serialization function for my_matrix_type that will make use of whatever array optimizations the archive I am using can provide.
c) If you still believe your proposal has merit, make your own "optimized binary archive". Don't derive from binary_archive but rather from common_?archive or perhaps basic_binary_archive. In this way you will have a totally free hand and won't have to achieve consensus with the rest of us which will save us all a huge amount of time.
I don't understand what this means. Regards, Ian