
Hi all, This message is a detailed reply to the analysis and comments by Joaquín M López Muñoz. The most challenging request, which required quite significant work, is concerning memory consumption and performance of augmented B+ trees. This aspect is quite broad. It is covered in two new sections of the updated documentations: 1) http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/bp_trees.html , the section “Space requirement“ ; 2) http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/performance.html , the section “Effect of augmenting on performance of B+ trees” ; The last document with performance measurements has been modified quite considerably. It now includes tests of all variants of proposed B+ trees and boost::multi_index_container as multiset. Please have a look at the updated documentation and my replies in the text below and let me know if I missed anything significant. On Sat, Dec 3, 2011 at 11:25 PM, Joaquin M Lopez Munoz <joaquin@tid.es>wrote:
Vadim Stadnik <vadimstdk <at> gmail.com> writes:
Hi all,
Advanced data structures (ADS) offer useful STL extensions that are not possible or not efficient with basic data structures, such as arrays, linked lists and red-black trees.
The project
https://github.com/vstadnik/stl_ext_adv
http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/index.html
focuses mainly on a generalized and unified computer representation of sequences using the augmented B+ trees. It provides the following STL extensions [...]
Hi,
This is a rough study of the functionality provided by your lib in general and as compared with Boost.MultiIndex.
* Container_bptree's internal data structure relies on B+ trees with two variants based on the "augmented data structure" idea explained in Cormen, Leiserson, Rivest and Stein's "Algorithms" (an online explanation can be found at http://tinyurl.com/d8lze7v , for instance.) Briefly put, such augmenting scheme boils down to adding additional information to the nodes updated on insertion time and erasure time which is based on the additional info of the sibling nodes. The variants are: - bp_tree_idx: augmented to allow for efficient indexing - bp_tree_idx_acc: doubly augmented to support efficient indexing plus a generalized "accumulate" notion usable for calculating summatories, mean values, etc.
* Question #1: why augmenting on B+ trees? The augmenting idea can be as easily built on top of RB-trees, which are the weapon of choice for implementing associative containers. Unless proved otherwise, RB-trees are assumed to be faster than B+ trees for in-memory structures (which your own tests hint at, cf. "insert N random elements" table.) You say:
"The red black trees are yet out of the scope of this project for the reasons explained in the documentation."
I've failed to find such explanation. Also, I suspect memory consumption is higher with B+ trees than RB trees. - Request #1: explain why B+ trees rather than RB trees.
That’s my fault, I should have added to the original message reference to this document: http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/bp_trees.html The section “Advantages of B+ trees“ briefly covers the topic. - Request #2: provide code for performance tests so that
Boost.MultiIndex and oher data structures can be compared.
The newly added file “perform_tests.hpp” contains the code of test algorithms. My tests and timer are Windows specific, I am planning to replace the timer with one of Boost timers and update this code. I have added performance measurements for Boost multi_index_container as mutiset to the document mentioned at the top of this message.
- Request #3: augment the performance tests so as to yield memory consumption figures.
This request is covered in two new sections of the updated documentation as explained at the beginning of this message.
* Usage scenario 1: Plain B+ trees as a replacement for RB tress for associative containers. My hunch is that RB-trees would perform better, as explained above.
I agree absolutely, Replacing red black trees with the augmented B+ trees is only justified if it is beneficial for a user algorithm, such as using a new extended functionality or a new efficient algorithm.
* Usage scenario 2: sequence-like containers with efficient insertion/erasure. Directly based on bp_tree_idx, with a do-nothing sorting critera. These containers are not truly random-access, but rather provide O(log n) access, which can be acceptable (or not) for some applications.
This is correct, one operation supporting random access iterators has logarithmic cost in the worst case. The new sequences are a good choice when a user algorithm requires efficient insert, erase operations, which outperform std::vector as shown in performance tests. They can also replace inefficient bidirectional iterators of std::list.
* Usage scenario 3: associative container with random access iterators. This roughly compares with a multi_index_container equipped with two indices, one ordered, the other random-access. Basic differences between both: - The multi_index_container provides truly O(1) random access whereas Container_bptree is O(logN) (which is not random access strictly speaking.) - With the multi_index_container, sorting order and insertion order are separate --some sort of synchronization between both indices should be needed to mimic the Continer_bptree case. - Insertion/erasure with the multi_index_container is linear (though much faster than with std::vector), logarithmic with Container_bptree. - The memory overhead introduced by the multi_index_container is two pointers vs. only one word for Container_bptree.
The Boost Multi-index library offers a very rich functionality. My knowledge of this library is not yet sufficient to make specific suggestions. Nevertheless, I think that this library can benefit from the proposed B+ trees and containers as well as from the method of augmenting applied to other data structures. I will be happy to help in this area.
* Usage scenario 4: containers with fast accumulate, a direct application of the augmenting idea explained above. There is nothing in Boost that provides this functionality.
Thank you for confirming this, since in one of previous messages there was information that logarithmic efficiency of the standard function accumulate() is already supported by Boost library.
My conclusions:
1. Augmented data structures (of which efficient indexing is just a particular application) are definitely useful and worth having in Boost. 2. Unless proved otherwise, we should be using RB trees (or maybe AVL) rather than B+ trees. 3. Depending on the particular needs, some usage scenarios can be better served with Boost.MultiIndex (note I'm saying "depending on needs", since differences between approaches are substantial.) 4 To me, rather than providing out of the box containers with augmenting we should provide this via two libraries: - As an additional capability of Boost.Intrusive associative containers (and AVL-tree based containers as well, if technically possible.) - As an additional capability of Boost.MultiIndex ordered indices.
Hope this helps,
Joaquín M López Muñoz Telefónica Digital
Thank you for your analysis and conclusions. I am happy for the Boost experts to decide how to use the proposed data structures and containers. To start work I need a list of tasks and/or requests with priorities. Regards, Vadim