Review managers wanted for augmented data structures

Hi all, At present, neither Boost library nor STL provides containers using augmented data structures. I suggest that the Boost community consider solving the challenging problem – how to extend STL by integrating advanced data structures. Is there an alternative of not integrating advanced data structures into STL? IMO, no, since it is equivalent to “STL must go”. C++ cannot ignore performance benefits provided by these data structures. On the other hand, advanced data structures can be implemented in many programming languages and this will make C++ less competitive. Augmented red-black trees could have been included in the very first version of STL. Why is it difficult and has not been done yet? The major obstacle for the integration is the C++ standard specification of computational complexities of single operations for containers and iterators. Augmented trees are non-standard due to different performance guarantees. At the same these data structures are beneficial for complex algorithms, since they offer a wider set of more efficient operations compared to standard containers. The typical performance improvement for one operation is O(log N) against O(N). For less efficient operations these trees normally provide running time O(log N) against O(1), which is good enough for many applications. The C++ STL will be under pressure as long as there are new data structures that offer better running time for user algorithms. Asymptotically, the main parameter that determines the running time of an algorithm is the total computational cost (to simplify the discussion we omit the effect of locality of reference). By definition, the total computational cost is the sum of costs of single operations multiplied by their frequencies. For a general purpose library, such as STL, it is impossible at the design stage to know frequencies of all container and iterator operations required in all possible user algorithms. A more reasonable and practical approach to STL would be to provide a wide set of interchangeable containers and data structures with different performance guarantees and allow programmers to make decisions about the most suitable containers for specific applications. Relatively small frequencies of inefficient operations do not have significant effect on the total computational cost of an algorithm. Thus, in principle, containers and iterators can provide even inefficient operation in order to support more unified interfaces and to improve the generality. The replacement of interchangeable containers offers the advantage of low development cost solutions to performance bottlenecks that arise due to enhancement requests and/or an increase in the number of elements processed by a user algorithm. Also, I’d like to note that the representation of sequences using augmented trees is a frequent point of confusion and probably the fact that was overlooked in the development of previous versions of STL. The general description of the method of augmenting data structures is given in the section 14.2 of this book T.H. Cormen, C.E. Leiserson, R.L. Rivest and C. Stein, Introduction to Algorithms, 2nd Edition, MIT Press and McGraw-Hill, 2001. The specific application of this method is provided for order statistic trees and interval trees. Unfortunately, there is no example of a sequence that stores unordered elements. Probably for this reason augmented red-black trees are often considered to be an improvement of search trees only. The efficient representation of sequences by the described red-black trees follows from the fact that augmenting by the counter of children nodes supports efficient mapping from non-negative integers into container elements. It works both for ordered and unordered containers with unique and multiple elements or keys. There are two libraries in the Boost review schedule: http://www.boost.org/community/review_schedule.html "STL Extensions" and "Countertree". None of them has as yet a review manager. The order of the reviews may not be significant. It is more important that at least one library pass the review. This would help integrating other advanced data structures. Are there Boost experts who could volunteer to take responsibility of review manager for any of these two libraries? Regards, Vadim Stadnik

On Wed, Feb 27, 2013 at 1:08 AM, Vadim Stadnik <vadimstdk@gmail.com> wrote:
Hi all,
At present, neither Boost library nor STL provides containers using augmented data structures. I suggest that the Boost community consider solving the challenging problem – how to extend STL by integrating advanced data structures.
[...] Augmented red-black trees could have been included in the very first version of STL.
Why is it difficult and has not been done yet?
The major obstacle for the integration is the C++ standard specification of computational complexities of single operations for containers and iterators.
I totally agree.
[...] There are two libraries in the Boost review schedule:
http://www.boost.org/community/review_schedule.html
"STL Extensions" and "Countertree". None of them has as yet a review manager. The order of the reviews may not be significant. It is more important that at least one library pass the review. This would help integrating other advanced data structures.
Are there Boost experts who could volunteer to take responsibility of review manager for any of these two libraries?
I really hope that some expert jumps in. In the meantime, I would like to point out some previous proposals of similar containers (including mine): 2004 (The oldest mention I could find. I don't know if it got implemented): http://lists.boost.org/Archives/boost/2004/03/62823.php 2006: "Hierarchical Data Structures" http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2101.html#tr.hierar... 2006: "AVL Array" (terrible name, I know) http://sourceforge.net/projects/avl-array Renamed as "Rank List" after some debate when proposed for Boost: http://boost.cvs.sourceforge.net/viewvc/boost-sandbox/boost-sandbox/libs/ran... This last version is probably outdated with respect to "AVL Array". And finally... 2012: Countertree http://dl.dropbox.com/u/8437476/works/countertree/doc/index.html I think that there are very interesting ideas in some of them. The "Hierarchical Data Structures" proposed "cursors" in addition to the ordinary iterators, to let the user traverse the tree at will ---not necessarily in-order---. In my proposal (AVL Array / Rank List), I decided to hide the internal structure of the tree, offering only iterators. As a result, I had to offer some additional methods, like a binary search (provided that the sequence is sorted). Later I found myself hacking into the tree structure to make some other experiments! I proposed augmenting the tree with an integer counting nodes in each subtree ---like all others--- but I also proposed augmenting the tree with a user-defined type. In the normal view of a sequence, every element occupies a unit of space. In what I called "Non-Proportional Sequence View" the user can decide the "virtual space" occupied by every element, and then jump to any position in that "virtual scale" in O(log N) time. I made some design decisions favouring functionality and low computational complexity over space. Every node contained pointers to the next and previous nodes in-order. Between this and the NPSV, they used a lot of space. At some point I increased their size even further, and the shocking result was that my experiments run faster. This leads me to think that a specific allocator ---like the one proposed in Countertree--- might help a lot. I also played with this concept of sequence ---with O(log N) random access and O(log N) insertion/removal--- in permanent storage. See "Shiftable Files" in the project "AVL Array". It is a toy implementation, but it works. I will love to discuss these ideas if anybody is interested. I would have said all this before, but I was disconnected from the Boost list for a long while and I didn't know that augmented trees had been proposed again. Sorry. Best regards, Martín Knoblauch Revuelta

On Sun, Mar 3, 2013 at 1:21 AM, Martin Knoblauch Revuelta < mkrevuelta@gmail.com> wrote:
[...] Augmented red-black trees could have been included in the very first version of STL.
Why is it difficult and has not been done yet?
The major obstacle for the integration is the C++ standard specification of computational complexities of single operations for containers and iterators.
I totally agree.
It looks like you already gained quite significant experience in this area. Do you know any other reasons and/or concerns why augmented data structures should NOT be added to STL?
In the meantime, I would like to point out some previous proposals of similar containers (including mine):
2004 (The oldest mention I could find. I don't know if it got implemented): http://lists.boost.org/Archives/boost/2004/03/62823.php 2006: "Hierarchical Data Structures"
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2101.html#tr.hierar... 2006: "AVL Array" (terrible name, I know) http://sourceforge.net/projects/avl-array Renamed as "Rank List" after some debate when proposed for Boost:
http://boost.cvs.sourceforge.net/viewvc/boost-sandbox/boost-sandbox/libs/ran... This last version is probably outdated with respect to "AVL Array". And finally... 2012: Countertree http://dl.dropbox.com/u/8437476/works/countertree/doc/index.html
I think that there are very interesting ideas in some of them.
Thank you for these links, The discussion of augmented trees by Boost community is particularly interesting. Do you know if there was a formal Boost review of any variant of augmented data structures and/or containers? ... I made some design decisions favouring functionality and low
computational complexity over space. Every node contained pointers to the next and previous nodes in-order. Between this and the NPSV, they used a lot of space. At some point I increased their size even further, and the shocking result was that my experiments run faster. This leads me to think that a specific allocator ---like the one proposed in Countertree--- might help a lot.
I am interested in any ideas and methods that can improve performance of augmented data structures. The problem with dynamically allocated augmented trees is poor locality of reference and, thus, worse performance compared to array based data structures. Custom allocators are helpful. However, array based variants of augmented trees are better, since they improve locality of reference when elements are stored contiguously. B+ trees have the important advantage over red-black trees. B+ trees can implement various combinations of two data structures: one linear and the other one is a B-tree. In particular, they can support a segmented array that offers high efficiency of sequential processing close to that of a single array or std::vector. This is one of the main reasons why I proposed for the Boost review array based augmented B+ trees and decided to postpone the review of dynamically allocated augmented B+ trees.
I will love to discuss these ideas if anybody is interested. I would have said all this before, but I was disconnected from the Boost list for a long while and I didn't know that augmented trees had been proposed again. Sorry.
It is quite symbolic that the current Boost review schedule has 2 proposals with different types of augmented trees. Even if both of these libraries fail there will be new proposals, because augmented data structures are beneficial for many user algorithms. Thank you for very useful feedback, Vadim Stadnik

On Sun, Mar 3, 2013 at 11:57 AM, Vadim Stadnik <vadimstdk@gmail.com> wrote:
On Sun, Mar 3, 2013 at 1:21 AM, Martin Knoblauch Revuelta < mkrevuelta@gmail.com> wrote:
[...]
It looks like you already gained quite significant experience in this area. Do you know any other reasons and/or concerns why augmented data structures should NOT be added to STL?
The main concern is the balance between the value they add and their cost. On one hand, which are the real-life applications where these containers would really help? In my opinion, the programming community has learned to solve zillions of problems with arrays, lists, search trees, heaps and hash tables. Now it's not easy to find a real-life problem where augmented trees provide a significantly faster solution. Though, they may allow amazingly simple, non-tuning, easier-to-code, solutions. On the other hand, in addition to the usual bloat, there is a deep concern regarding the possible incorrect use of these sequence containers. Lazy unexperienced programmers might misuse them as an all-purpose replacement of vector and list. If the standard library adopted them, it would be somehow advocating their use. If there were no real-life applications, then it would be advocating inefficiency.
[..] The discussion of augmented trees by Boost community is particularly interesting. Do you know if there was a formal Boost review of any variant of augmented data structures and/or containers?
There was an interesting discussion when I proposed it: http://boost.2283326.n4.nabble.com/proposal-Sequence-container-with-O-log-n-... http://boost.2283326.n4.nabble.com/proposal-Rank-List-old-name-AVL-Array-in-...
[..] I proposed for the Boost review array based augmented B+ trees and decided to postpone the review of dynamically allocated augmented B+ trees.
If you only plan to augment the tree with an integer, then this is perfectly reasonable. Though, if you plan to augment the trees with a data type that doesn't support subtraction ---like the interval trees of "Introduction to Algorithms", or something more complex--- it might be better to stick to a few children per node. Otherwise, deleting an element might require a lot of [potentially expensive] additions of a user-defined type. Best regards, Martín Knoblauch Revuelta

On Sun, Mar 3, 2013 at 9:00 PM, Martin Knoblauch Revuelta < mkrevuelta@gmail.com> wrote:
On Sun, Mar 3, 2013 at 11:57 AM, Vadim Stadnik <vadimstdk@gmail.com> wrote:
...
Do you know any other reasons and/or concerns why augmented data structures should NOT be added to STL?
The main concern is the balance between the value they add and their cost.
On one hand, which are the real-life applications where these containers would really help? In my opinion, the programming community has learned to solve zillions of problems with arrays, lists, search trees, heaps and hash tables. Now it's not easy to find a real-life problem where augmented trees provide a significantly faster solution. Though, they may allow amazingly simple, non-tuning, easier-to-code, solutions.
On the other hand, in addition to the usual bloat, there is a deep concern regarding the possible incorrect use of these sequence containers. Lazy unexperienced programmers might misuse them as an all-purpose replacement of vector and list.
If the standard library adopted them, it would be somehow advocating their use. If there were no real-life applications, then it would be advocating inefficiency.
I agree with your concerns. In practice, on Windows systems an algorithm using N<1000 elements of small size normally does not have to use advanced data structures at all. It performs quite well with the default STL container std::vector. The performance problems arise when the number of elements increases and the algorithms uses inefficient linear cost operations of std::vector. To some extent the risk of misuse is reduced by the array based augmented B+ trees. As I mentioned before they have good locality of reference and guarantee at least quite efficient sequential processing. The representation of strings and string algorithms is one of the most practically important application areas for augmented trees. The B+ trees proposed for the Boost review support not only O(log N) insertion and erasure for a single element, but also O(log N) time for splice and split operations. This advantage is not achievable with basic data structures and STL containers.
[..] The discussion of augmented trees by Boost community is particularly interesting. Do you know if there was a formal Boost review of any variant of augmented data structures and/or containers?
There was an interesting discussion when I proposed it:
http://boost.2283326.n4.nabble.com/proposal-Sequence-container-with-O-log-n-...
http://boost.2283326.n4.nabble.com/proposal-Rank-List-old-name-AVL-Array-in-...
[..] I proposed for the Boost review array based augmented B+ trees and decided to postpone the review of dynamically allocated augmented B+ trees.
If you only plan to augment the tree with an integer, then this is perfectly reasonable.
Thank you for the links, I will analyze the discussions. I have already implemented 5 variants of basic and augmented B+ trees and STL containers using these trees. Three variants of dynamically allocated B+ trees: https://github.com/vstadnik/stl_ext_adv and two variants of array based B+ trees: https://github.com/vstadnik/stl_ext_adv_review In combination with augmented red-black trees they should help find the solution to the problem how to use augmented and advanced data structures inside or alongside STL.
Though, if you plan to augment the trees with a data type that doesn't support subtraction ---like the interval trees of "Introduction to Algorithms", or something more complex--- it might be better to stick to a few children per node. Otherwise, deleting an element might require a lot of [potentially expensive] additions of a user-defined type.
The development of augmented interval B+ trees is not impossible, but at this stage my main priority is data structures that support STL interfaces. Regards, Vadim Stadnik
participants (2)
-
Martin Knoblauch Revuelta
-
Vadim Stadnik