Re: [boost] Countertree

On Mon, Nov 19, 2012 at 9:02 PM, Leo Goodstadt <leo.goodstadt@llew.org.uk>wrote: 1) For counter trees, the median is just the N/2-th value in the counter tree, where K is the size of the tree. The Nth-percentile can be calculated similarly. A rolling median with a fixed size window requires there to be an insertion and deletion for each (most!) new value, so I am guessing it should run in O(N log K) time, where N = size of data, K = size of window
IIUC, it is better to avoid modifying original data (size N), the rolling window of size K can be represented by a multiset based on an augmented
On Tue, 20 Nov 2012, Vadim Stadnik <vadimstdk@gmail.com> wrote: tree. This method guarantees complexity of obtaining 1st median value for O(K log K) operations, which is the cost of construction of the rolling window, every subsequent median value can be obtained for O(log K) operations, when the window rolls. This approach should be more efficient.
Oops. Careless editing made me mix up my Ks and Ns, hopeless confusing everyone. So given a large data set of size N and we want to calculate medians across a rolling window of size K, we only need to create a counter tree of size K (the window size). The median is the K/2 value in the tree. As you point out, the entire operation runs in O(N log K) time. All sensible approaches to calculating medians I can think of take O(N log K) time. The question comes down to what the constants are. The great advantage of a counter tree is that we are not limited to medians (the 50%ile value). We can collect for example, the values at 25%, 50%, and, 75%, and so on, at only marginally extra cost. I presume that most of the cost lies in balancing the tree after insertions/deletions and updating the counters. The only shortcut is the trivial optimisation that if the value entering the window is the same as the one being removed at the other end, then the median obviously stays the same. Leo

Hi Francisco, First of all I think that your library will be a great addition to Boost. To make it even better I will suggest to make CounterTree more generic: 1) Expand CounterTree implementation to RangeTree (the name might be different). RangeTree is a data structure that supports any operation on its nodes that is associative ((a*b)*c = a*(b*c)) and commutative (a*b = b*a). The examples of such an operation are: addition, multiplication, gcd, xor, bit count, etc. RangeTree allows to execute the following operations in logarithmic time: a) insertion / removal / lookup of the new element by key; b) operation evaluation on the tree nodes within the given key range. 2) Considering above, expand RangeTree to support any type of operation argument (I call it property). E.g. for the CounterTree this is an integer representing number of elements in a subtree, for the ColourTree this can be node colour and operation defined as mixing all the colours in the subtree. 3) Considering 1) and 2), it will be possible to implement CounterTree as specialization of RangeTree. Augmented tree, recently requested by Phil Endecott for dynamic text width querying, can be implemented as specialization of the RangeTree data structure as well. 4) Summarizing all of the above the RangeTree declaration will look like: template <class Key, class Compare, class Property, class Operation, class Allocator> class RangeTree; Property RangeTree::query(const Key& left, const Key& right); // basic query interface The CounterTree specialization: template <class Key, class Compare, class Allocator> class CounterTree : RangeTree<Key, Compare, int, addition<int>, Allocator>; And even more, the CounterRangeTree specialization: template <class Key, class Compare, class Property, class Operation, class Allocator> class CounterRangeTree : RangeTree<Key, Compare, pair<int, Property>, operation< addition<int>, Operation >, Allocator>; Property CounterRangeTree::query(int start_index, int end_index); // counter query interface Such a design will widely expand application area of the library and will allow to use it for the large number of live updating/querying systems. Regards, Andrii _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

I have read the Policy- Based Data Structures of GCC and the Boost Accumulator library. I think the problem is the same for the two. The statistical operations are done over current data structures as arrays, vectors, map , hash... Each data structure have some advantages and several problems. The problem is the number of operations over the data structure. If you need obtain a statistical value over a collection of data, and need to obtain only a few times, many data structure can satisfy you requirements. The problem is when you need to do a lot of times, by example, you have intervals, and you need to detect the intersection between them in order to calculate a collision, in a particle simulation. With that problem, the solution to the problem pass for to redesign the data structure used. The augmented trees are a possible solution of many of these problems. But it is complex. The countertrees are augmented data structures. I redesign the insertion deletion and balanced algorithms. The actual version is around 10% - 15% slower than the GCC version of red-black trees. I have designed on paper a new algorithms for to insert, delete and balance the countertree, in order to improve the speed. In the new version you have a catalog of operations with the nodes. All the insertion, deletion and balancing can be done combining operations of this catalog. The node operations have two parts, the pointers and the node color, and the augmented information.In the countertree the augmented information are the node counters. If you design the process to do with your augmented information for all the operations of the catalog, you can create your augmented balanced tree mixing the pointer and color operations with your code for each operation. You can have augmented trees with several concepts, like the position, accumulators and statistical operations as are defined in the Policy Based Data Structures and in the Boost Accumulator library. In the node operation you mix the pointers and color operations with the process of the catalog for all the augmented informations added. I am working now on the concurrent version of the countertree. When finish with this , I will begin with the augmented data trees. The solution exposed can be combined with the concurrent process, which can improve a lot the speed of these accumulators, statistical operations or any other idea. If you are interested, when I have something useful , I will comment and offer to you. Sincerely yours Francisco 2012/11/22 Andrii Sydorchuk <sydorchuk.andriy@gmail.com>
Hi Francisco,
First of all I think that your library will be a great addition to Boost. To make it even better I will suggest to make CounterTree more generic:
1) Expand CounterTree implementation to RangeTree (the name might be different). RangeTree is a data structure that supports any operation on its nodes that is associative ((a*b)*c = a*(b*c)) and commutative (a*b = b*a). The examples of such an operation are: addition, multiplication, gcd, xor, bit count, etc. RangeTree allows to execute the following operations in logarithmic time: a) insertion / removal / lookup of the new element by key; b) operation evaluation on the tree nodes within the given key range.
2) Considering above, expand RangeTree to support any type of operation argument (I call it property). E.g. for the CounterTree this is an integer representing number of elements in a subtree, for the ColourTree this can be node colour and operation defined as mixing all the colours in the subtree.
3) Considering 1) and 2), it will be possible to implement CounterTree as specialization of RangeTree. Augmented tree, recently requested by Phil Endecott for dynamic text width querying, can be implemented as specialization of the RangeTree data structure as well.
4) Summarizing all of the above the RangeTree declaration will look like: template <class Key, class Compare, class Property, class Operation, class Allocator> class RangeTree; Property RangeTree::query(const Key& left, const Key& right); // basic query interface
The CounterTree specialization: template <class Key, class Compare, class Allocator> class CounterTree : RangeTree<Key, Compare, int, addition<int>, Allocator>;
And even more, the CounterRangeTree specialization: template <class Key, class Compare, class Property, class Operation, class Allocator> class CounterRangeTree : RangeTree<Key, Compare, pair<int, Property>, operation< addition<int>, Operation >, Allocator>; Property CounterRangeTree::query(int start_index, int end_index); // counter query interface
Such a design will widely expand application area of the library and will allow to use it for the large number of live updating/querying systems.
Regards, Andrii
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Francisco, Thanks for your reply. Please let me know if I misunderstood one of your comments. On Thu, Nov 22, 2012 at 2:11 PM, Francisco José Tapia <fjtapia@gmail.com>wrote:
The countertrees are augmented data structures. I redesign the insertion deletion and balanced algorithms. The actual version is around 10% - 15% slower than the GCC version of red-black trees.
I don't quite understand what you mean by redesign. Ideally those operations should be implemented completely in the same way as for red-black tree implementation (std::map, std::set), except that additional processing of augmented data is required. But that sounds more like extension, as all of the core logic that manipulates with tree nodes and colors should be completely the same. At least I don't see the reason it shouldn't. I have designed on paper a new algorithms for to insert, delete and
balance the countertree, in order to improve the speed. In the new version you have a catalog of operations with the nodes. All the insertion, deletion and balancing can be done combining operations of this catalog.
It sounds unreasonable for me to limit the set of allowed operations to some catalog. Basically any function that satisfies commutative and associative properties should work. Thus it makes sense to pass such a functor as a template argument to the Tree data structure, to be not limited to any catalog of predefined operations.
The node operations have two parts, the pointers and the node color, and the augmented information.In the countertree the augmented information are the node counters.
Sure, and the logic that deals with pointers and node colors should be completely the same as for the std::map or std::set.
I am working now on the concurrent version of the countertree. When finish with this , I will begin with the augmented data trees.
I am not sure what exactly "concurrent countertree" means? I don't see how insertion/deletion or lookup can be parallelized even for BST, probably because lack of my knowledge in the area. Thus simple example of speedup using any of the above operations with two threads will be perfect. Thanks, Andrii

When you design a tree, you must develop a balanced algorithm in order to prevent a degenerate tree ( Tree like a linked list). You have several types of balanced trees ( mainly Red-Black and AVL ). The balancing of a tree can be done in several ways. That decisions have influence on the quality of the balance and the speed of the tree. You have an excellent description with code in “Introduction to Algorithms”. When I decided add the counter to the nodes for the design of the CounterTree, I tried to use that code, but was very complex with the counters, and decide design my own balanced algorithm based on the 234Tree. The result is the actual version of CounterTree. It is 15% more or less slower than the GCC implementation, but is logic , because must manage the pointers and the counters. Looking for a way for to improve the speed , I examined others implementations of Red-Black trees, several books an mainly with the experience obtained with the Suballocator. I decided to change several things of the algorithm and the implementation, pursuing to improve the speed. I have this design done on paper. At that time , appear on Boost comments about the problem of the augmented trees, in order to do many others things with the trees. I study my design on paper, and decompose all the internal process of the tree in a “catalog” of small operations like: - Insert/delete a node in the right/left pointer of a node. - All the rotations with nodes - Swap of nodes directly linked and not directly linked With the new design, if you want build your own augmented tree, you need : - Design the information to insert in the node - Design the process to do with your node information in all the operations of the “catalog” With this you can compile your own augmented tree, obtaining the maximum speed. The concurrent countertree is a cover over the tree, which permit many process work concurrently with the tree in a safe mode. If you design your own augmented tree, and use this cover you can have your own concurrent augmented tree. In the same way than using the suballocator , you can improve the speed, and the memory management of your tree. For many statistical function , you can implement directly over the CounterTree. In the same way you can implement an efficient PriorityQueueusing it. But the best solution is to include the information and the process inside the tree. With this, you can customize your tree and obtain the maximum speed. I know that in the design process they will appear a lot of problems , but I expect resolve them in a satisfactory way. Actually I am working in the concurrent version. When finish I will begin with the augmented trees. Sorry by the delay, but I don't have more free time for to work on it. Thanks by your interest, your ideas and suggestions. You help me to improve it and to be useful to the Boost community. Sincerely yours Francisco Tapia 2012/11/22 Andrii Sydorchuk <sydorchuk.andriy@gmail.com>
Francisco,
Thanks for your reply. Please let me know if I misunderstood one of your comments.
On Thu, Nov 22, 2012 at 2:11 PM, Francisco José Tapia <fjtapia@gmail.com
wrote:
The countertrees are augmented data structures. I redesign the insertion deletion and balanced algorithms. The actual version is around 10% - 15% slower than the GCC version of red-black trees.
I don't quite understand what you mean by redesign. Ideally those operations should be implemented completely in the same way as for red-black tree implementation (std::map, std::set), except that additional processing of augmented data is required. But that sounds more like extension, as all of the core logic that manipulates with tree nodes and colors should be completely the same. At least I don't see the reason it shouldn't.
I have designed on paper a new algorithms for to insert, delete and
balance the countertree, in order to improve the speed. In the new version you have a catalog of operations with the nodes. All the insertion, deletion and balancing can be done combining operations of this catalog.
It sounds unreasonable for me to limit the set of allowed operations to some catalog. Basically any function that satisfies commutative and associative properties should work. Thus it makes sense to pass such a functor as a template argument to the Tree data structure, to be not limited to any catalog of predefined operations.
The node operations have two parts, the pointers and the node color, and the augmented information.In the countertree the augmented information are the node counters.
Sure, and the logic that deals with pointers and node colors should be completely the same as for the std::map or std::set.
I am working now on the concurrent version of the countertree. When finish with this , I will begin with the augmented data trees.
I am not sure what exactly "concurrent countertree" means? I don't see how insertion/deletion or lookup can be parallelized even for BST, probably because lack of my knowledge in the area. Thus simple example of speedup using any of the above operations with two threads will be perfect.
Thanks, Andrii
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (3)
-
Andrii Sydorchuk
-
Francisco José Tapia
-
Leo Goodstadt