new proposal: order statistics tree

Hi, I would like to propose to add new data structure - order statistics tree. Briefly, it is container which allows logarithmic inserting, searching and erasing as in set, but additionally it permits to access elements via numerical value - key order. I have some ready implementation with all appropriate functions (without their all overloaded forms) with gtest regression list, but the code isn't boostified. The regression passes on gcc 4.6.3 and vs2010. Is there anybody interested in? Best regards, Szymon Wojciechowski

Hi, @Joel: If one has order_statistics_tree<string, less<string> > with string values and less comparator, and the container has been inserted: cat, dog, bird, fish, dinosaur then the elements can by accessed by ranks of the keys in that container. In example tree[0] returns bird while bird is the least element, tree[1] = cat, tree[2] = dinosaur, tree[3] = dog, tree[4] = fish, which is the greatest. Best regards, Szymon Wojciechowski "Klaim - Joël Lamotte" <mjklaim@gmail.com> pisze:

On Wed, Jan 2, 2013 at 3:46 PM, Szymon Wojciechowski <sw10@interia.pl>wrote:
How does it compare to < http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2101.html#tr.hierarchy.augment> and the half dozen other augmented tree implementations that have appeared in the Boost list? (one as recent as last year). -- -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - grafikrobot/yahoo

Will it be like this? http://stackoverflow.com/questions/11230734/c-stl-order-statistic-tree http://www.opensource.apple.com/source/llvmgcc42/llvmgcc42-2336.9/libstdc++-... Regards, Rahul On Thu, Jan 3, 2013 at 8:59 PM, Rene Rivera <grafikrobot@gmail.com> wrote:

On Fri, Jan 4, 2013 at 3:39 PM, Szymon Wojciechowski <sw10@interia.pl>wrote:
The current state of that TR is at < http://svn.boost.org/svn/boost/sandbox/tree/>. Although a few notes about it.. * The code is old and since Bernhard wrote it I don't know at the moment the state of closely it follows the TR. Since most of it was written before the TR. * I've been updating the docs rather slowly (too many other projects going on). But I'm hoping to get an update of the TR done by in the Spring. And work again on the implementation after that. -- -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - grafikrobot/yahoo

On Thu, Jan 3, 2013 at 7:46 AM, Szymon Wojciechowski <sw10@interia.pl>wrote:
Yes, I am interested. What kind of tree did you implement? Did you measured performance of basic container operations and compared with the standard set and/or multiset? Regards, Vadim Stadnik

Hi Szymon Please, could you explain me more details of your idea, and mode examples ? The case propose in your message can be done with a countertree::set,. This is a sorted structure with additional access by position like in a vector. The insertion, deletion and access to the elements are O(logN). You can find the code and documentation in http://dl.dropbox.com/u/8437476/works/countertree/doc/index.html As commented in previous messages about the augmented trees. “The countertrees are augmented data structures. I redesign the insertion deletion and balanced algorithms. The actual version is around 10% - 15% slower than the GCC version of red-black trees. I have designed on paper a new algorithms for to insert, delete and balance the countertree, in order to improve the speed. In the new version you have a catalog of small operations with the nodes ( insertion, deletion, rotations, swap nodes... ) All the operations of the tree can be done combining operations of this catalog. The node operations have two parts, the pointers and node color, and the augmented information. In the countertree the augmented information are the node counters. If you design the process to do with your augmented information for all the operations of the catalog, you can create your augmented balanced tree mixing the pointer and color operations with your code for each operation. You can have augmented trees with several concepts, like the position, accumulators and statistical operations as are defined in the Policy Based Data Structures and in the Boost Accumulator library. In the node operation you mix the pointers and color operations with the process of the catalog for all the augmented information added. I am near to finish the concurrent version of the countertree data structures. My intention is to begin with the augmented data trees when finish this. The solution exposed can be combined with the concurrent process, which can improve a lot the speed of these accumulators, statistical operations or any other idea.” I don't know if we are talking about the same idea or if there are differences. Sincerely yours Francisco Tapia

"Francisco José Tapia" <fjtapia@gmail.com> pisze:
My solution solves the same problem as your - insert, erase and find in O(log n) and additionally access to element by it's order. It's newly implemented red-black tree. Each node has pointers to parent and sons, colour and value which represents subtree size. Unlike your, this solution still hasn't allocators. As in previous messages it seems to be faster than VS's set. Generally I will try to publish my sources at the end of current week. Best regards, Szymon Wojciechowski

"Vadim Stadnik" <vadimstdk@gmail.com> pisze:
I implemented augmented red-black tree which doesn't check the uniqueness of values. I have conducted a simple survey: I prepared 5 input files with unique 2,7*10e7 integers. The first data set has sorted values as other has randomized. I repeated 7 times on every input. A simple program reads from input integers and puts it to set (maybe I should use multiset) from VisualStudio or my order statistics tree. After that I measured time. Then program erases values from data structure in ordered way (from 0, 1, 2...). After that I also measured how much time it takes. I uploaded the results here: http://wyslijto.pl/plik/q5om39aoua - it's a little bit strange. What kind of performance test would you think about? Best regards, Szymon Wojciechowski

On Tue, Jan 8, 2013 at 7:47 AM, Szymon Wojciechowski <sw10@interia.pl>wrote:
Did you implement or is it possible to implement for this augmented red black tree split, join and splice operations with logarithmic computational complexity? As for tests, there are a number of performance tests of basic container operations in the following projects: 1. For augmented dynamically allocated B+ trees: https://github.com/vstadnik/stl_ext_adv I have added this link, since this project has results of performance tests for Boost.MultiIndex. 2. For augmented array based B+ trees: https://github.com/vstadnik/stl_ext_adv_review This project is prepared for Boost review. It compares performance of new containers with standard containers and Boost.Container. The file "boost_performance.cpp" uses Boost.Chrono timer. If your containers are STL compliant you should not have any problem with this test code. Regards, Vadim Stadnik

"Vadim Stadnik" <vadimstdk@gmail.com> pisze:
I have performed such test for 6 data containers (including boost::multi_index described by Tony and Vadim): the comparison is uploaded here: http://wyslijto.pl/plik/mxxpjbpmfn I prepared 5 datasets (first one was sorted). I repeated 5 times whole test for each container and dataset. The test inserts and accumulates as in Vadim's tests. However I added the erasure performance. I haven't thought about possiblility to implement such operations like splice, join, split. However, it seems to me to be impossible. When adding treeA + treeB, the elements of treeB should be considered independently where to place them in result. For example if thereA contains 1, 3, 5, 7, 9 and treeB contains 2, 4, 6, 8, 10 all treeB nodes will be independently located and inserted in O(n). Best regards, Szymon Wojciechowski

On Sun, Jan 13, 2013 at 7:10 PM, Szymon Wojciechowski <sw10@interia.pl>wrote:
The performance of your trees looks quite promising. What are the sizes of the data sets used in the tests?
The support of these operations with logarithmic running time would be quite beneficial for your variant of augmented trees. The augmented trees can support not only STL associative containers, but also STL sequences, such as std::vector and std::list. String algorithms can also take advantage of these efficient operations. The split operation is general. The join and splice operations are general for STL sequences. For associative containers they can be used only in special cases. You can find more details in documentation to augmented B+ trees. The CGAL library http://www.cgal.org/Manual/latest/doc_html/cgal_manual/STL_Extension_ref/Cla... implements split and join (as function catenate()) operations for Multiset. However, these trees are not augmented, this is why container sizes must be re-counted in linear time, which creates a performance problem for STL variants of containers. Note also that this is a problem of all dynamically allocated data structures that can support bidirectional iterators only, including std::list. With augmented red-black trees the cost of these operations and corresponding member functions of STL containers should be logarithmic. Regards, Vadim Stadnik

"Vadim Stadnik" <vadimstdk@gmail.com> pisze:
The performance of your trees looks quite promising. What are the sizes of the data sets used in the tests?
I forgot to add this. The data sets had 30 000 elements.
If it is possible as you said, it is worth to consider and do. However, now I would rather publish the code, so I am preparing it.

Hi Everyone, I have just uploaded my source code here: https://sourceforge.net/projects/orderstatistics/ Of course, the project isn't finished, so I encourage Everyone to comment and critizize. I hope that it will help to improve it. I am still adding new overloading forms of some functions. Best regards, Szymon Wojciechowski

Hi All, As the discussion fall silent, I would like to ask if is still anyone interested in this project or there is any reason for continuing it? Best regards, Szymon Wojciechowski

On Mon, Jan 7, 2013 at 3:57 PM, Szymon Wojciechowski <sw10@interia.pl>wrote:
Basically both an order statistics tree and boost multi-index are examples of containers with multiple indices. I suspect there may be interesting results from a compare/contrast type of study. Boost MultiIndex allows indexing of a container via any number/type of indices. So you could have a container indexed via key, and also indexed via integer. Now the difference, I think, is that MultiIndex uses independent indexes - changing a entry's key from "alpha" to "beta" doesn't change its integer index from 1 to 2. So that is a significant difference. Might be worth mentioning in docs or somewhere, just to avoid confusion. But furthermore, I could probably implement the same behaviour as a order statistics tree by using Boost.MultiIndex internally (and maintaining the interrelationships of the indices as needed). Would that be drastically less efficient? And also, how do the interfaces compare? Should there be any alignmen/consistency t in the interfaces, so that some template code could use either? Not sure how much template code exists that requires "any container with 2 indices, one of which is an int". Probably consistency of interface is more important for learning/understanding/remembering than for template code. Tony

On Tue, Jan 8, 2013 at 7:31 AM, Gottlob Frege <gottlobfrege@gmail.com>wrote: Basically both an order statistics tree and boost multi-index are examples
+1. This is a very good comment and I think there is no simple answer. I was asked similar questions during Boost discussion of augmented dynamically allocated B+ trees, this is why I would like to make a few comments based on my analysis. The fact that creates an overlap with Boost.MultIndex and can cause confusion is that augmented data structures support multiple augmenting. For example, basic balanced tree provides efficient search of elements by keys. This tree requires a comparison object. Augmenting of such tree by counter of elements adds efficient access to elements by rank, which is equivalent to index value in the subscript operator of an array or std::vector. This variant of tree has extended functionality and can work both with and without a comparison object. In other words, it offers equally efficient operations for both ordered and unordered containers. Yet another augmenting by the sum of values in children nodes improves computational complexity of calculation of sum of any number of consecutive elements from linear to logarithmic. As a result of the double augmenting, the calculation of the mean value and standard deviation can be done in O(log N) time only. Such variants of augmented data structures support STL sequences and associative containers with improved efficiency of specific member functions. In applications that require multiple views of a data set these augmented data structures and containers cannot replace Boost.MultiIndex functionality. On the other hand, replacing functionality of augmented data structures with Boost.MultiIndex should be generally possible. However, this solution will be less efficient in the case when an augmented data structure offers a set of more efficient algorithms compared to data structures of Boost.MultiIndex. The disadvantage of this approach is that one less efficient operation can cause a performance bottleneck. Normally, augmented data structures offer the improvement of the running time from O(N) down to O(log N) for a single operation or an algorithm. This performance benefit of augmented data structures certainly should not be ignored by Boost.MultiIndex. Unfortunately, at this stage it is not possible for a user algorithm to take advantage of Boost.MultiIndex with advanced data structures. This example shows declaration of a multiset: typedef multi_index_container<int, indexed_by<ordered_non_unique<identity<int>>>> multi_set_int ; The obvious idea of a parameterization of containers used by views of Boost.MultiIndex is potentially useful, but deeper analysis shows that it is quite difficult and costly to implement. Boost.MultiIndex has special non-standard requirements to the view containers. For example, std::vector cannot be used in place of random access indices, since it does not remain stable after insertion or erasure operations. Hope that the discussion helps better understand the complexity of these issues. Regards, Vadim Stadnik
participants (7)
-
Francisco José Tapia
-
Gottlob Frege
-
Klaim - Joël Lamotte
-
Rahul Sr
-
Rene Rivera
-
Szymon Wojciechowski
-
Vadim Stadnik