Re: [boost] [stl_ext_adv] ] Interest in advanced data structures

8 Dec 2011

      Hi all,

This message is a detailed reply to the analysis and comments by Joaquín M
López Muñoz.

The most challenging request, which required quite significant work, is
concerning memory consumption and performance of augmented B+ trees. This
aspect is quite broad. It is covered in two new sections of the updated
documentations:

1) http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/bp_trees.html , the
section “Space requirement“ ;

2) http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/performance.html , the
section “Effect of augmenting on performance of B+ trees” ;

The last document with performance measurements has been modified quite
considerably. It now includes tests of all variants of proposed B+ trees
and boost::multi_index_container as multiset.

Please have a look at the updated documentation and my replies in the text
below and let me know if I missed anything significant.

On Sat, Dec 3, 2011 at 11:25 PM, Joaquin M Lopez Munoz <joaquin@tid.es>wrote:
...
Vadim Stadnik <vadimstdk <at> gmail.com> writes:
...
Hi all,
Advanced data structures (ADS) offer useful STL extensions that are not
possible or not efficient with basic data structures, such as arrays,
linked lists and red-black trees.
The project
https://github.com/vstadnik/stl_ext_adv
http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/index.html
focuses mainly on a generalized and unified computer representation of
sequences using the augmented B+ trees. It provides the following STL
extensions [...]
Hi,
This is a rough study of the functionality provided by your lib in
general and as compared with Boost.MultiIndex.
* Container_bptree's internal data structure relies on B+
trees with two variants based on the "augmented data structure"
idea explained in Cormen, Leiserson, Rivest and Stein's "Algorithms"
(an online explanation can be found at http://tinyurl.com/d8lze7v ,
for instance.) Briefly put, such augmenting scheme boils down
to adding additional information to the nodes updated on insertion
time and erasure time which is based on the additional info of the
sibling nodes. The variants are:
 - bp_tree_idx: augmented to allow for efficient indexing
 - bp_tree_idx_acc: doubly augmented to support efficient
 indexing plus a generalized "accumulate" notion usable for
 calculating summatories, mean values, etc.
* Question #1: why augmenting on B+ trees? The augmenting idea
can be as easily built on top of RB-trees, which are
the weapon of choice for implementing associative containers.
Unless proved otherwise, RB-trees are assumed to be faster than B+
trees for in-memory structures (which your own tests hint at, cf.
"insert N random elements" table.) You say:
"The red black trees are yet out of the scope of this project
 for the reasons explained in the documentation."
I've failed to find such explanation. Also, I suspect memory
consumption is higher with B+ trees than RB trees.
 - Request #1: explain why B+ trees rather than RB trees.
That’s my fault, I should have added to the original message reference to
this document:

http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/bp_trees.html
The section “Advantages of B+ trees“ briefly covers the topic.

 - Request #2: provide code for performance tests so that
...
Boost.MultiIndex and oher data structures can be compared.
The newly added file “perform_tests.hpp” contains the code of test
algorithms. My tests and timer are Windows specific, I am planning to
replace the timer with one of Boost timers and update this code. I have
added performance measurements for Boost multi_index_container as mutiset
to the document mentioned at the top of this message.
...
- Request #3: augment the performance tests so as to yield
 memory consumption figures.
This request is covered in two new sections of the updated documentation as
explained at the beginning of this message.
...
* Usage scenario 1: Plain B+ trees as a replacement for RB tress
for associative containers. My hunch is that RB-trees would
perform better, as explained above.
I agree absolutely,
Replacing red black trees with the augmented B+ trees is only justified if
it is beneficial for a user algorithm, such as using a new extended
functionality or a new efficient algorithm.
...
* Usage scenario 2: sequence-like containers with efficient
insertion/erasure. Directly based on bp_tree_idx, with a
do-nothing sorting critera. These containers are not truly
random-access, but rather provide O(log n) access, which can
be acceptable (or not) for some applications.
This is correct, one operation supporting random access iterators has
logarithmic cost in the worst case. The new sequences are a good choice
when a user algorithm requires efficient insert, erase operations, which
outperform std::vector as shown in performance tests. They can also replace
inefficient bidirectional iterators of std::list.
...
* Usage scenario 3: associative container with random access
iterators. This roughly compares with a multi_index_container
equipped with two indices, one ordered, the other random-access.
Basic differences between both:
 - The multi_index_container provides truly O(1) random access
 whereas Container_bptree is O(logN) (which is not random access
 strictly speaking.)
 - With the multi_index_container, sorting order and insertion
 order are separate --some sort of synchronization between both
 indices should be needed to mimic the Continer_bptree case.
 - Insertion/erasure with the multi_index_container is linear
 (though much faster than with std::vector), logarithmic with
 Container_bptree.
 - The memory overhead introduced by the multi_index_container
 is two pointers vs. only one word for Container_bptree.
The Boost Multi-index library offers a very rich functionality. My
knowledge of this library is not yet sufficient to make specific
suggestions. Nevertheless, I think that this library can benefit from the
proposed B+ trees and containers as well as from the method of augmenting
applied to other data structures. I will be happy to help in this area.
...
* Usage scenario 4: containers with fast accumulate, a direct
application of the augmenting idea explained above. There is
nothing in Boost that provides this functionality.
Thank you for confirming this, since in one of previous messages there was
information that logarithmic efficiency of the standard function
accumulate() is already supported by Boost library.
...
My conclusions:
1. Augmented data structures (of which efficient indexing
is just a particular application) are definitely useful and
worth having in Boost.
2. Unless proved otherwise, we should be using RB trees
(or maybe AVL) rather than B+ trees.
3. Depending on the particular needs, some usage scenarios
can be better served with Boost.MultiIndex (note I'm
saying "depending on needs", since differences between
approaches are substantial.)
4 To me, rather than providing out of the box containers with
augmenting we should provide this via two libraries:
 - As an additional capability of Boost.Intrusive associative
 containers (and AVL-tree based containers as well, if
 technically possible.)
 - As an additional capability of Boost.MultiIndex ordered
 indices.
Hope this helps,
Joaquín M López Muñoz
Telefónica Digital
Thank you for your analysis and conclusions.

I am happy for the Boost experts to decide how to use the proposed data
structures and containers. To start work I need a list of tasks and/or
requests with priorities.

Regards,

Vadim