[stl_ext_adv] ] Interest in advanced data structures

Hi all, Advanced data structures (ADS) offer useful STL extensions that are not possible or not efficient with basic data structures, such as arrays, linked lists and red-black trees. The project https://github.com/vstadnik/stl_ext_adv http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/index.html focuses mainly on a generalized and unified computer representation of sequences using the augmented B+ trees. It provides the following STL extensions: - The sequence container supports random access iterators and a union of interfaces of STL containers: vector, list and deque. This sequence offers the advantage of efficient update operations with logarithmic running time compared with linear running time of the same operations of std::vector. - The associative containers (set, multiset, map and multimap) with random access iterators. - The enhanced random access iterators effectively address the fundamental problem of the iterator invalidation. - The function accumulate() with logarithmic computational complexity can significantly improve performance of many practically important algorithms and applications, such as the numerical integration, the statistics, the data analysis etc. The examples demonstrate how to improve the computational complexity from O(N) to O(log N) in the following applications and problems: - numerical integration. - calculation of mean value and standard deviation; - calculation of weighted mean value; - testing if any number of consecutive elements are equal; - counting intersections of one interval with sequence intervals; Is there an interest in this project and implemented STL extensions? Interest in potentially useful, but not yet implemented algorithms and specialized containers: 1. The function accumulate() with constant computational cost. With this function some user algorithms, such as the numerical integration, can come close to the absolute theoretical limit of the computational cost. The function can be implemented through iterators enhanced with functionality of accumulators. For more details, see the section “Random access iterators” in the documentation. 2. The functions sequence::splice() with logarithmic running time in the worst case. Note that functions std::list::splice() provide in many cases constant computational cost, however, in the worst case the cost is linear. 3. Any other potentially useful specialized containers, iterators or algorithms? Augmented B+ trees are versatile and powerful data structures. They can be designed to maximize efficiency of particular algorithms and programming solutions. The limitation of specialized containers is that the computational cost of some other operations may increase. About augmented red black trees. The augmented red black trees can also support various STL extensions. The red black trees are yet out of the scope of this project for the reasons explained in the documentation. However, historically augmented red-black trees are probably the first ADS that could extend STL, since the method of augmenting red black trees was described as early as in 1990 [T.H. Cormen, C.E. Leiserson and R.L. Rivest.” Introduction to Algorithms” (1st ed.). MIT Press and McGraw-Hill, 1990]. This means, in particular, that the very first version of STL could provide associative containers with random access iterators. Also, as I understand the previously submitted project https://github.com/boost-vault/Containers/blob/master/countertree.zipimpleme... containers based on the augmented red black trees. It is hardly possible to make an exact prediction, but without a doubt future versions of STL will benefit from advanced data structures. Regards, Vadim Stadnik

Hi all,
Advanced data structures (ADS) offer useful STL extensions that are not possible or not efficient with basic data structures, such as arrays, linked lists and red-black trees.
The project
https://github.com/vstadnik/stl_ext_adv
http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/index.html
focuses mainly on a generalized and unified computer representation of sequences using the augmented B+ trees. It provides the following STL extensions ..
The examples demonstrate how to improve the computational complexity from O(N) to O(log N) in the following applications and problems:
- numerical integration. - calculation of mean value and standard deviation; - calculation of weighted mean value;
I think O(log N) is misleading. Problem: I want to estimate the mean of N numbers in a file on my disk. Clearly I should al least present all of the number in that file al least once (O(N)) to a black box algorithm that computes the mean?

The examples demonstrate how to improve the computational complexity from O(N) to O(log N) in the following applications and problems:
- numerical integration. - calculation of mean value and standard deviation; - calculation of weighted mean value;
I think O(log N) is misleading. Problem: I want to estimate the mean of N numbers in a file on my disk. Clearly I should al least present all of the number in that file al least once (O(N)) to a black box algorithm that computes the mean?
Thank you for this comment. You are right, if you mean the cost of construction of an advanced data structure, it is O(N log N). As soon as the required data have been written
On Wed, Nov 30, 2011 at 9:06 PM, Thijs (M.A.) van den Berg <thijs@sitmo.com>wrote: the sum and mean value of any consecutive elements can be found in O(log N) time. Please, have a look at performance tests of the fast algorithm accumulate() in the section: http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/performance.html Also, you can use example or test code to notice the difference in the performance of standard and fast algorithms accumulate(). Regards, Vadim Stadnik

On Wed, Nov 30, 2011 at 9:06 PM, Thijs (M.A.) van den Berg <thijs@sitmo.com>wrote:
The examples demonstrate how to improve the computational complexity from O(N) to O(log N) in the following applications and problems:
- numerical integration. - calculation of mean value and standard deviation; - calculation of weighted mean value;
I think O(log N) is misleading. Problem: I want to estimate the mean of N numbers in a file on my disk. Clearly I should al least present all of the number in that file al least once (O(N)) to a black box algorithm that computes the mean?
Thank you for this comment. You are right, if you mean the cost of construction of an advanced data structure, it is O(N log N). As soon as the required data have been written the sum and mean value of any consecutive elements can be found in O(log N) time. Please, have a look at performance tests of the fast algorithm accumulate() in the section:
http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/performance.html
Also, you can use example or test code to notice the difference in the performance of standard and fast algorithms accumulate().
Thank Vadim. For certain applications your container clearly outperforms other. The hierarchical accumulation is particularly interesting if you need to calculate many more accumulations than you need to do inserts. To get a better view of both pros and potential cons, can you also add a comparison for *push_back inserts* (like you did in the multiset) in a std::vector besides the worst case 'before the middle'? An interesting project!

On Wed, Nov 30, 2011 at 10:17 PM, Thijs (M.A.) van den Berg <thijs@sitmo.com
wrote:
On Wed, Nov 30, 2011 at 9:06 PM, Thijs (M.A.) van den Berg <thijs@sitmo.com>wrote:
Thank Vadim. For certain applications your container clearly outperforms other. The hierarchical accumulation is particularly interesting if you need to calculate many more accumulations than you need to do inserts. To get a better view of both pros and potential cons, can you also add a comparison for *push_back inserts* (like you did in the multiset) in a std::vector besides the worst case 'before the middle'?
An interesting project!
Thank you again for useful comments. I will add the comparison of push_back() performance in standard sequences (std::list and std::vector) and new sequences that are supported by two types of augmented B+ trees in this project. Regards, Vadim Stadnik

Le 30/11/11 10:58, Vadim Stadnik a écrit :
Hi all,
Advanced data structures (ADS) offer useful STL extensions that are not possible or not efficient with basic data structures, such as arrays, linked lists and red-black trees.
The project
https://github.com/vstadnik/stl_ext_adv
http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/index.html
focuses mainly on a generalized and unified computer representation of sequences using the augmented B+ trees. It provides the following STL extensions:
- The sequence container supports random access iterators and a union of interfaces of STL containers: vector, list and deque. This sequence offers the advantage of efficient update operations with logarithmic running time compared with linear running time of the same operations of std::vector.
- The associative containers (set, multiset, map and multimap) with random access iterators.
- The enhanced random access iterators effectively address the fundamental problem of the iterator invalidation.
- The function accumulate() with logarithmic computational complexity can significantly improve performance of many practically important algorithms and applications, such as the numerical integration, the statistics, the data analysis etc.
Hi, how is your library related/compared to Boost.MultiIndex? Best, Vicente

On Wed, Nov 30, 2011 at 11:29 PM, Vicente J. Botet Escriba < vicente.botet@wanadoo.fr> wrote:
how is your library related/compared to Boost.MultiIndex?
As far as I understand library multi-index provides multiple views of a collection of data. The submitted containers and data structures can be quite useful at implementation level for this type of applications, since one advanced data structure efficiently supports many types of STL sequences and associative containers. As for applications to other Boost libraries, I think that in the current shape the new data structures and containers can increase the performance of Accumulators and Interval. The project includes examples demonstrating how these containers can solve typical problems in these areas. With some modifications of the augmented data structures may be used in other Boost libraries. Regards, Vadim

Vadim Stadnik wrote
On Wed, Nov 30, 2011 at 11:29 PM, Vicente J. Botet Escriba < vicente.botet@> wrote:
how is your library related/compared to Boost.MultiIndex?
As far as I understand library multi-index provides multiple views of a collection of data. The submitted containers and data structures can be quite useful at implementation level for this type of applications, since one advanced data structure efficiently supports many types of STL sequences and associative containers.
I don't see how your library could improve the efficiency of Boost.MultiIndex. Maybe you could show some figures comparing both libraries.
As for applications to other Boost libraries, I think that in the current shape the new data structures and containers can increase the performance of Accumulators and Interval.
How your library can improve the performance of these libraries? Note that I'm not against your library, that I don't know well, but just I want to see what is the added value. Best, Vicente -- View this message in context: http://boost.2283326.n4.nabble.com/stl-ext-adv-Interest-in-advanced-data-str... Sent from the Boost - Dev mailing list archive at Nabble.com.

On Thu, Dec 1, 2011 at 3:01 AM, Vicente Botet <vicente.botet@wanadoo.fr>wrote:
how is your library related/compared to Boost.MultiIndex?
As far as I understand library multi-index provides multiple views of a collection of data. The submitted containers and data structures can be quite useful at implementation level for this type of applications, since one advanced data structure efficiently supports many types of STL sequences and associative containers.
I don't see how your library could improve the efficiency of Boost.MultiIndex. Maybe you could show some figures comparing both libraries.
As for applications to other Boost libraries, I think that in the current shape the new data structures and containers can increase the performance of Accumulators and Interval.
How your library can improve the performance of these libraries? Note that I'm not against your library, that I don't know well, but just I want to see what is the added value.
I am not familiar with implementation details of multi-index library. My reply only concerns the general use of new containers in multi-view applications. The demonstration is actually available in the example of fast counting of intersections of one interval with sequence of intervals. This solution uses one container, which stores all objects of intervals, plus two containers, which represent ordered views of lower and upper endpoints of these intervals. The ordered views support efficient search operations and provide random access iterators. For this reason the algorithm of intersection counting has logarithmic running time against linear time if the solution was based on standard containers. In principle, the approach can be extended and applied to range queries using multiple views. New library can support efficient accumulators through the fast member function accumulate() with logarithmic running time in the worst case. Efficient accumulators are not yet implemented in this project. The section "Iterators and accumulators" of this document http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/iterators.html discusses pros and cons of this option. I am not sure whether to add this functional or not, this is why I ask Boost community if there is an interest in this functionality. Regards, Vadim Stadnik

Le 01/12/11 00:41, Vadim Stadnik a écrit :
On Thu, Dec 1, 2011 at 3:01 AM, Vicente Botet<vicente.botet@wanadoo.fr>wrote:
how is your library related/compared to Boost.MultiIndex? As far as I understand library multi-index provides multiple views of a
collection of data. The submitted containers and data structures can be quite useful at implementation level for this type of applications, since one advanced data structure efficiently supports many types of STL sequences and associative containers.
I don't see how your library could improve the efficiency of Boost.MultiIndex. Maybe you could show some figures comparing both libraries.
As for applications to other Boost libraries, I think that in the current shape the new data structures and containers can increase the performance of Accumulators and Interval.
How your library can improve the performance of these libraries? Note that I'm not against your library, that I don't know well, but just I want to see what is the added value.
I am not familiar with implementation details of multi-index library. My reply only concerns the general use of new containers in multi-view applications.
What do you mean on the last sentence?
The demonstration is actually available in the example of fast counting of intersections of one interval with sequence of intervals. This solution uses one container, which stores all objects of intervals, plus two containers, which represent ordered views of lower and upper endpoints of these intervals. The ordered views support efficient search operations and provide random access iterators. For this reason the algorithm of intersection counting has logarithmic running time against linear time if the solution was based on standard containers. In principle, the approach can be extended and applied to range queries using multiple views. What I mean is that it seems to me that Boost.MultiIndex provides already several index for a container, including random access index. So, what is exactly the added value. If your use case can be implemented already using Boost.MultiIndex I will consider to propose a specific abstraction on top of Boost.Container as it is done in Boost.Bimap. If it is not possible, because a specific kind of container must be added to Boost.MultiIndex, I will prefer Boost.Multiindex integrates this new container if the use case is of general purpose.
I guess that you could study Boost.MultiIndex to see if your library provides something that is not doable with multi-index, or can be done better or that improves the performances. I ensure you will not lost your time.
New library can support efficient accumulators through the fast member function accumulate() with logarithmic running time in the worst case. Doesn't accumulate has O(log N) complexity for containers that have random access? If not, maybe Boost could include a boost::accumulate function that provides this complexity (if not already exists) in a non-intrusive way. Efficient accumulators are not yet implemented in this project. The section "Iterators and accumulators" of this document
http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/iterators.html
discusses pros and cons of this option. I am not sure whether to add this functional or not, this is why I ask Boost community if there is an interest in this functionality. IIUC you are proposing to store the accumulation of all the lower nodes in each node of the container, and maintain this information up to date while inserting and removing elements, isn't it?
I have not found a use case for that, but several times I have needed to have each node store the accumulation of each one of its children. I will see this as a possible addition to Boost.MultiIndex as a special container/index that stores the accumulation of the preceding members, with the pros and cons. Whether it is worth including this in Boost.MultiIndex is an open question, that Joaguin could answer if needed. Best, Vicente

On Thu, Dec 1, 2011 at 11:26 AM, Vicente J. Botet Escriba < vicente.botet@wanadoo.fr> wrote:
how is your library related/compared to Boost.MultiIndex?
As far as I understand library multi-index provides multiple views of a
collection of data. The submitted containers and data structures can be quite useful at implementation level for this type of applications, since one advanced data structure efficiently supports many types of STL sequences and associative containers.
I don't see how your library could improve the efficiency of Boost.MultiIndex. Maybe you could show some figures comparing both libraries.
As for applications to other Boost libraries, I think that in the
current shape the new data structures and containers can increase the performance of Accumulators and Interval.
How your library can improve the performance of these libraries? Note that I'm not against your library, that I don't know well, but just I want to see what is the added value.
I am not familiar with implementation details of multi-index library. My reply only concerns the general use of new containers in multi-view applications.
What do you mean on the last sentence?
The demonstration is actually available in the example of
fast counting of intersections of one interval with sequence of intervals. This solution uses one container, which stores all objects of intervals, plus two containers, which represent ordered views of lower and upper endpoints of these intervals. The ordered views support efficient search operations and provide random access iterators. For this reason the algorithm of intersection counting has logarithmic running time against linear time if the solution was based on standard containers. In principle, the approach can be extended and applied to range queries using multiple views.
What I mean is that it seems to me that Boost.MultiIndex provides already several index for a container, including random access index. So, what is exactly the added value. If your use case can be implemented already using Boost.MultiIndex I will consider to propose a specific abstraction on top of Boost.Container as it is done in Boost.Bimap. If it is not possible, because a specific kind of container must be added to Boost.MultiIndex, I will prefer Boost.Multiindex integrates this new container if the use case is of general purpose.
The last sentence is very close to what I mean. The proposed library is a generalized and quite efficient extension of standard containers as described in the documentation. The containers and algorithms of this library are designed to become modules and provides support for other more advanced libraries and solutions.
I guess that you could study Boost.MultiIndex to see if your library provides something that is not doable with multi-index, or can be done better or that improves the performances. I ensure you will not lost your time.
I will have a deeper look at multi-index library. Also, I would like to ask the following question, which can clarify the usefulness of new containers for multi-index: Can multi-index library solve the problem of counting of intersections of one interval with an arbitrary sequence of intervals in O(log N) time and provide the same running time for update operations? The proposed library can support this efficiency, since its containers have logarithmic cost of find, insert and erase operations and random access iterator for both ordered and un-ordered data.
New library can support efficient accumulators through the fast member
function accumulate() with logarithmic running time in the worst case.
Doesn't accumulate has O(log N) complexity for containers that have random access? If not, maybe Boost could include a boost::accumulate function that provides this complexity (if not already exists) in a non-intrusive way.
Standard containers, such as std::vector, std::list, std::set etc., provide linear computational cost of std::accumulate() through iterators. If Boost library has already implemented algorithm accumulate() with logarithmic complexity, please send me a reference or a link, this is very interesting.
Efficient accumulators are not yet implemented in this project. The
section "Iterators and accumulators" of this document
discusses pros and cons of this option. I am not sure whether to add this functional or not, this is why I ask Boost community if there is an interest in this functionality.
IIUC you are proposing to store the accumulation of all the lower nodes in each node of the container, and maintain this information up to date while inserting and removing elements, isn't it?
Just a minor clarification: the accumulated value of all elements in the range [0 , position) to be stored as a data member of an iterator.
I have not found a use case for that, but several times I have needed to have each node store the accumulation of each one of its children.
The augmented B+ tree (class bp_tree_idx_acc) constructs these data for every internal node or in other words for every level of the tree except the lowest, which stores container elements.
I will see this as a possible addition to Boost.MultiIndex as a special container/index that stores the accumulation of the preceding members, with the pros and cons. Whether it is worth including this in Boost.MultiIndex is an open question, that Joaguin could answer if needed.
Thank you, that would be very helpful. Regards, Vadim Stadnik

Vadim Stadnik wrote:
current shape the new data structures and containers can increase the performance of Accumulators and Interval. ... The demonstration is actually available in the example of fast counting of intersections of one interval with sequence of intervals. This solution uses one container, which stores all objects of intervals,
As for applications to other Boost libraries, I think that in the plus two containers, which represent ordered views of lower and upper endpoints of these intervals. The ordered views support efficient search operations and provide random access iterators. For this reason the algorithm of intersection counting has logarithmic running time against linear time if the solution was based on standard containers. In principle, the approach can be extended and applied to range queries using multiple views. ... Can multi-index library solve the problem of counting of intersections of one interval with an arbitrary sequence of intervals in O(log N) time and
On Thu, Dec 1, 2011 at 11:26 AM, Vicente J. Botet Escriba < vicente.botet@wanadoo.fr> wrote: ... provide the same running time for update operations? The proposed library can support this efficiency, since its containers have logarithmic cost of find, insert and erase operations and random access iterator for both ordered and un-ordered data.
Your example at least looks like some overlap with Boost ICL, the Interval Container Library. Or is that what you were referring to above when you mentioned the Boost Interval library? Jeff

On Fri, Dec 2, 2011 at 1:12 AM, Jeff Flinn <Jeffrey.Flinn@gmail.com> wrote:
...
Can multi-index library solve the problem of counting of intersections of
one interval with an arbitrary sequence of intervals in O(log N) time and provide the same running time for update operations? The proposed library can support this efficiency, since its containers have logarithmic cost of find, insert and erase operations and random access iterator for both ordered and un-ordered data.
Your example at least looks like some overlap with Boost ICL, the Interval Container Library. Or is that what you were referring to above when you mentioned the Boost Interval library?
I have included this example, since in my opinion it is an interesting demonstration that new containers can efficiently solve problems in various application areas. The example shows the advantage of the implemented STL extensions. The proposed library can certainly support interval algorithms. However, at some level the generality of STL interfaces becomes a limitation in this application area. In my opinion the benefit of augmented data structures for ICL can be increased through specialized variants. Please let me know if there is any special interest in interval algorithms. Regards, Vadim Stadnik

El 01/12/2011, a las 01:28, "Vicente J. Botet Escriba" <vicente.botet@wanadoo.fr> escribió:
I will see this as a possible addition to Boost.MultiIndex as a special container/index that stores the accumulation of the preceding members, with the pros and cons. Whether it is worth including this in Boost.MultiIndex is an open question, that Joaguin could answer if needed.
Hi, If time permits, I'll take a close look at the lib this weekend and offer my views on how it potentially interacts with Boost.Multiindex. Joaquín M López Muñoz Telefónica Digital Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at. http://www.tid.es/ES/PAGINAS/disclaimer.aspx

On Thu, Dec 1, 2011 at 11:49 PM, JOAQUIN M. LOPEZ MUÑOZ <joaquin@tid.es>wrote:
El 01/12/2011, a las 01:28, "Vicente J. Botet Escriba" < vicente.botet@wanadoo.fr> escribió:
I will see this as a possible addition to Boost.MultiIndex as a special container/index that stores the accumulation of the preceding members, with the pros and cons. Whether it is worth including this in Boost.MultiIndex is an open question, that Joaguin could answer if
needed.
Hi,
If time permits, I'll take a close look at the lib this weekend and offer my views on how it potentially interacts with Boost.Multiindex.
Thank you, this should be very helpful. Also, I would like to ask you to have a look at the example of fast counting of intersections of one interval with sequence of intervals. It has been already discussed in one of my previous replies to Vincente. I guess multi-index can provide a solution to the problem, but what would be the best efficiency of such solution with current multi-index? and what would be the efficiency of update operations? Regards, Vadim Stadnik

Den 30-11-2011 10:58, Vadim Stadnik skrev:
Hi all,
Advanced data structures (ADS) offer useful STL extensions that are not possible or not efficient with basic data structures, such as arrays, linked lists and red-black trees.
I'm interested. In particular, if the library can be integrated seamlessly with Boost.Accumulators and other libs, it seems very interesting. Are there variations of the trees that allow for other types of fast queries than just accumulate? If so, then there might be a generic way to incorporate them. -Thorsten

You are right, if you mean the cost of construction of an advanced data structure, it is O(N log N). As soon as the required data have been written the sum and mean value of any consecutive elements can be found in O(log N) time. Please, have a look at performance tests of the fast algorithm accumulate() in the section:
The next solution works in O(1) and requires only O(N) for construction step: template <typename T> class accumulator { public: void init(std::vector<T> &elems) { sum.reserve(elems.size() + 1); sum.push_back(0); for (std::vector<T>::iterator it = elems.begin(); it != elems.end(); ++it) { sum.push_back(sum.back() + *it); } } T operator()(T first, T last) const { return sum[last+1] - sum[first]; } private: std::vector<T> sum; }; I guess this solution is slightly better for those math targets you mentioned above. It also works with any associative operation. Andrii On Thu, Dec 1, 2011 at 11:42 AM, Thorsten Ottosen < thorsten.ottosen@dezide.com> wrote:
Den 30-11-2011 10:58, Vadim Stadnik skrev:
Hi all,
Advanced data structures (ADS) offer useful STL extensions that are not possible or not efficient with basic data structures, such as arrays, linked lists and red-black trees.
I'm interested. In particular, if the library can be integrated seamlessly with Boost.Accumulators and other libs, it seems very interesting.
Are there variations of the trees that allow for other types of fast queries than just accumulate? If so, then there might be a generic way to incorporate them.
-Thorsten
______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boost<http://lists.boost.org/mailman/listinfo.cgi/boost>

On Thu, Dec 1, 2011 at 10:45 PM, Andrii Sydorchuk < sydorchuk.andriy@gmail.com> wrote:
You are right, if you mean the cost of construction of an advanced data structure, it is O(N log N). As soon as the required data have been
written
the sum and mean value of any consecutive elements can be found in O(log N) time. Please, have a look at performance tests of the fast algorithm accumulate() in the section:
The next solution works in O(1) and requires only O(N) for construction step: template <typename T> class accumulator { public: void init(std::vector<T> &elems) { sum.reserve(elems.size() + 1); sum.push_back(0); for (std::vector<T>::iterator it = elems.begin(); it != elems.end(); ++it) { sum.push_back(sum.back() + *it); } }
T operator()(T first, T last) const { return sum[last+1] - sum[first]; }
private: std::vector<T> sum; }; I guess this solution is slightly better for those math targets you mentioned above. It also works with any associative operation.
Andrii
This is the best possible approach provided that data are not modified or updated. Unfortunately, in practical data analysis, data fitting and approximation this is a relatively rare case. For large and huge data sets the cost of vector update operations will be quite significant and often prohibitive. It is also interesting to note that if the same solution was based on container sequence<bp_tree_idx> from the proposed library then it would efficiently support update operations for container elements. However, it would have linear cost of updating the sum data. The best possible performance for a user algorithm which requires a wide set of operations can be achieved with sequence<bp_tree_idx_acc>. The constant cost of summation similar to this example is possible, but not implemented yet. This is discussed in section "Iterators and accumulators". Thank you, Vadim

On Thu, Dec 1, 2011 at 9:42 PM, Thorsten Ottosen < thorsten.ottosen@dezide.com> wrote:
Den 30-11-2011 10:58, Vadim Stadnik skrev:
Hi all,
Advanced data structures (ADS) offer useful STL extensions that are not possible or not efficient with basic data structures, such as arrays, linked lists and red-black trees.
I'm interested. In particular, if the library can be integrated seamlessly with Boost.Accumulators and other libs, it seems very interesting.
Thank you for the interest. At the moment I think that the best option is to provide access to the proposed data structures through STL interfaces, since they maximizes usefulness of these data structures. The other options are the development of specialized data structures or adding specialized interface functions to the existing classes. Let me know what kind of functionality is not supported yet to have a better understanding of the specific problems. Are there variations of the trees that allow for other types of fast
queries than just accumulate? If so, then there might be a generic way to incorporate them.
Yes, you are right. The augmented B+ trees are widely used in solutions to various problems of computational geometry. However, I think that this area is not general enough for Boost libraries, this is why it is out of scope of this project. In principle there are many other solutions which can benefit from augmented data structures. I have a number of ideas as for future development, but I am not sure if they are really useful for Boost users and developers. This is why I asked Boost community about most interesting and useful applications for the augmented data structures. A generalized implementation of augmented data structures is possible. It is also possible that such implementation can be based on a simpler topology and data scheme than those used in the proposed library. However, as usual a generalization involves some risk of not optimized support for specific algorithms. For example, the proposed data structures support not only the high efficiency of algorithm accumulate(), but also a relatively high accuracy of this algorithm. Regards, Vadim Stadnik

El 01/12/2011 13:30, Vadim Stadnik escribió:
In principle there are many other solutions which can benefit from augmented data structures. I have a number of ideas as for future development, but I am not sure if they are really useful for Boost users and developers. This is why I asked Boost community about most interesting and useful applications for the augmented data structures.
A generalized implementation of augmented data structures is possible. It is also possible that such implementation can be based on a simpler topology and data scheme than those used in the proposed library. However, as usual a generalization involves some risk of not optimized support for specific algorithms. For example, the proposed data structures support not only the high efficiency of algorithm accumulate(), but also a relatively high accuracy of this algorithm.
It would be very interesting if augmented data structures could be added to Boost.Intrusive algorithms. This way could build intrusive structures built on augmented data structures, and on top of that, non-intrusive, STL-like containers. If you are interested, have a look at Boost.Intrusive implementation to see if we could introduce augmented data structures there and benefit more applications. Ion

2011/12/1 Ion Gaztañaga <igaztanaga@gmail.com>
It would be very interesting if augmented data structures could be added to Boost.Intrusive algorithms. This way could build intrusive structures built on augmented data structures, and on top of that, non-intrusive, STL-like containers.
If you are interested, have a look at Boost.Intrusive implementation to see if we could introduce augmented data structures there and benefit more applications.
Thank you for the interest and suggestions.
I am planning to analyze some Boost libraries from the user point of view and compare with functionality of the proposed library. If Boost developers find this new library interesting and useful I will be happy to discuss implementation details and problems to be solved. Regards, Vadim Stadnik

Vadim Stadnik <vadimstdk <at> gmail.com> writes:
Hi all,
Advanced data structures (ADS) offer useful STL extensions that are not possible or not efficient with basic data structures, such as arrays, linked lists and red-black trees.
The project
https://github.com/vstadnik/stl_ext_adv
http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/index.html
focuses mainly on a generalized and unified computer representation of sequences using the augmented B+ trees. It provides the following STL extensions [...]
Hi, This is a rough study of the functionality provided by your lib in general and as compared with Boost.MultiIndex. * Container_bptree's internal data structure relies on B+ trees with two variants based on the "augmented data structure" idea explained in Cormen, Leiserson, Rivest and Stein's "Algorithms" (an online explanation can be found at http://tinyurl.com/d8lze7v , for instance.) Briefly put, such augmenting scheme boils down to adding additional information to the nodes updated on insertion time and erasure time which is based on the additional info of the sibling nodes. The variants are: - bp_tree_idx: augmented to allow for efficient indexing - bp_tree_idx_acc: doubly augmented to support efficient indexing plus a generalized "accumulate" notion usable for calculating summatories, mean values, etc. * Question #1: why augmenting on B+ trees? The augmenting idea can be as easily built on top of RB-trees, which are the weapon of choice for implementing associative containers. Unless proved otherwise, RB-trees are assumed to be faster than B+ trees for in-memory structures (which your own tests hint at, cf. "insert N random elements" table.) You say: "The red black trees are yet out of the scope of this project for the reasons explained in the documentation." I've failed to find such explanation. Also, I suspect memory consumption is higher with B+ trees than RB trees. - Request #1: explain why B+ trees rather than RB trees. - Request #2: provide code for performance tests so that Boost.MultiIndex and oher data structures can be compared. - Request #3: augment the performance tests so as to yield memory consumption figures. * Usage scenario 1: Plain B+ trees as a replacement for RB tress for associative containers. My hunch is that RB-trees would perform better, as explained above. * Usage scenario 2: sequence-like containers with efficient insertion/erasure. Directly based on bp_tree_idx, with a do-nothing sorting critera. These containers are not truly random-access, but rather provide O(log n) access, which can be acceptable (or not) for some applications. * Usage scenario 3: associative container with random access iterators. This roughly compares with a multi_index_container equipped with two indices, one ordered, the other random-access. Basic differences between both: - The multi_index_container provides truly O(1) random access whereas Container_bptree is O(logN) (which is not random access strictly speaking.) - With the multi_index_container, sorting order and insertion order are separate --some sort of synchronization between both indices should be needed to mimic the Continer_bptree case. - Insertion/erasure with the multi_index_container is linear (though much faster than with std::vector), logarithmic with Container_bptree. - The memory overhead introduced by the multi_index_container is two pointers vs. only one word for Container_bptree. * Usage scenario 4: containers with fast accumulate, a direct application of the augmenting idea explained above. There is nothing in Boost that provides this functionality. My conclusions: 1. Augmented data structures (of which efficient indexing is just a particular application) are definitely useful and worth having in Boost. 2. Unless proved otherwise, we should be using RB trees (or maybe AVL) rather than B+ trees. 3. Depending on the particular needs, some usage scenarios can be better served with Boost.MultiIndex (note I'm saying "depending on needs", since differences between approaches are substantial.) 4 To me, rather than providing out of the box containers with augmenting we should provide this via two libraries: - As an additional capability of Boost.Intrusive associative containers (and AVL-tree based containers as well, if technically possible.) - As an additional capability of Boost.MultiIndex ordered indices. Hope this helps, Joaquín M López Muñoz Telefónica Digital

* Question #1: why augmenting on B+ trees? The augmenting idea can be as easily built on top of RB-trees, which are the weapon of choice for implementing associative containers. Unless proved otherwise, RB-trees are assumed to be faster than B+ trees for in-memory structures (which your own tests hint at, cf. "insert N random elements" table.)
Could a B+ tree library be a basis for implementing disk-backed B+ trees? I think that would be useful. In a clever implementation the memory-backed and disk-backed B+ tree might even be the same class, with the difference factored out into a policy template parameter. Regards, Nate

On Sun, Dec 4, 2011 at 8:34 AM, Nathan Ridge <zeratul976@hotmail.com> wrote:
* Question #1: why augmenting on B+ trees? The augmenting idea can be as easily built on top of RB-trees, which are the weapon of choice for implementing associative containers. Unless proved otherwise, RB-trees are assumed to be faster than B+ trees for in-memory structures (which your own tests hint at, cf. "insert N random elements" table.)
Could a B+ tree library be a basis for implementing disk-backed B+ trees? I think that would be useful. In a clever implementation the memory-backed and disk-backed B+ tree might even be the same class, with the difference factored out into a policy template parameter.
Regards, Nate
I understood this comment is about adding persistence or serialization
support to the proposed B+ trees. Please correct if this is not the case. There are two levels of complexity in this area. A trivial one is just saving and reading data as a list, since these B+ trees are build on top of doubly linked lists. This might be sufficient for many practical applications, but certainly not for all. Hence there is the second and less trivial problem of saving and reading operations which preserve topology of these B+ trees that are designed to work in RAM. (Note: "topology" has been used here as a state of connectivity of nodes with data stored in the nodes). In principle, the problem can be solved for the submitted B+ trees, but I will need some Boost specific requirements to implement this feature. Regards, Vadim

* Question #1: why augmenting on B+ trees? The augmenting idea can be as easily built on top of RB-trees, which are the weapon of choice for implementing associative containers. Unless proved otherwise, RB-trees are assumed to be faster than B+ trees for in-memory structures (which your own tests hint at, cf. "insert N random elements" table.)
Could a B+ tree library be a basis for implementing disk-backed B+ trees? I think that would be useful. In a clever implementation the memory-backed and disk-backed B+ tree might even be the same class, with the difference factored out into a policy template parameter.
Regards, Nate
I understood this comment is about adding persistence or serialization support to the proposed B+ trees. Please correct if this is not the case. There are two levels of complexity in this area. A trivial one is just saving and reading data as a list, since these B+ trees are build on top of doubly linked lists. This might be sufficient for many practical applications, but certainly not for all. Hence there is the second and less trivial problem of saving and reading operations which preserve topology of these B+ trees that are designed to work in RAM. (Note: "topology" has been used here as a state of connectivity of nodes with data stored in the nodes). In principle, the problem can be solved for the submitted B+ trees, but I will need some Boost specific requirements to implement this feature.
What I mean by a disk-backed B+ tree is a B+ tree where only the nodes that are needed at a given time are loaded into memory from disk - the entire tree on disk might be too large to even fit into memory. Regards, Nate

On Sun, Dec 4, 2011 at 9:28 AM, Nathan Ridge <zeratul976@hotmail.com> wrote:
* Question #1: why augmenting on B+ trees? The augmenting idea can be as easily built on top of RB-trees, which are the weapon of choice for implementing associative containers. Unless proved otherwise, RB-trees are assumed to be faster than B+ trees for in-memory structures (which your own tests hint at, cf. "insert N random elements" table.)
Could a B+ tree library be a basis for implementing disk-backed B+ trees? I think that would be useful. In a clever implementation the memory-backed and disk-backed B+ tree might even be the same class, with the difference factored out into a policy template parameter.
Regards, Nate
I understood this comment is about adding persistence or serialization support to the proposed B+ trees. Please correct if this is not the case. There are two levels of complexity in this area. A trivial one is just saving and reading data as a list, since these B+ trees are build on top of doubly linked lists. This might be sufficient for many practical applications, but certainly not for all. Hence there is the second and less trivial problem of saving and reading operations which preserve topology of these B+ trees that are designed to work in RAM. (Note: "topology" has been used here as a state of connectivity of nodes with data stored in the nodes). In principle, the problem can be solved for the submitted B+ trees, but I will need some Boost specific requirements to implement this feature.
What I mean by a disk-backed B+ tree is a B+ tree where only the nodes that are needed at a given time are loaded into memory from disk - the entire tree on disk might be too large to even fit into memory.
Thank you for this clarification.
This is certainly the third level of complexity, this is why below only ideas about possible solutions. Assuming that problem #2 has been already solved the reading operation can be implemented through reading a sub-tree from the highest internal node which represents the root node of the sub-tree. Direct writing back the same sub-tree, which has been modified while working in RAM, is not allowed, since it can destroy invariants of the whole structure saved on disk. It seems to me the most reasonable option is to implement writing as efficient splice operations (in theory they have logarithmic cost), which are applied to the structure stored on disk. I know the efficient splice operations are possible for B+ trees in RAM, but I am not sure if these algorithms are also suitable for disk systems. Hope this quick consideration makes some useful sense. Regards, Vadim

* Question #1: why augmenting on B+ trees? The augmenting idea can be as easily built on top of RB-trees, which are the weapon of choice for implementing associative containers. Unless proved otherwise, RB-trees are assumed to be faster than B+ trees for in-memory structures (which your own tests hint at, cf. "insert N random elements" table.)
Could a B+ tree library be a basis for implementing disk-backed B+ trees? I think that would be useful. In a clever implementation the memory-backed and disk-backed B+ tree might even be the same class, with the difference factored out into a policy template parameter.
Regards, Nate
I understood this comment is about adding persistence or serialization support to the proposed B+ trees. Please correct if this is not the case. There are two levels of complexity in this area. A trivial one is just saving and reading data as a list, since these B+ trees are build on top of doubly linked lists. This might be sufficient for many practical applications, but certainly not for all. Hence there is the second and less trivial problem of saving and reading operations which preserve topology of these B+ trees that are designed to work in RAM. (Note: "topology" has been used here as a state of connectivity of nodes with data stored in the nodes). In principle, the problem can be solved for the submitted B+ trees, but I will need some Boost specific requirements to implement this feature.
What I mean by a disk-backed B+ tree is a B+ tree where only the nodes that are needed at a given time are loaded into memory from disk - the entire tree on disk might be too large to even fit into memory.
Thank you for this clarification. This is certainly the third level of complexity, this is why below only ideas about possible solutions. Assuming that problem #2 has been already solved the reading operation can be implemented through reading a sub-tree from the highest internal node which represents the root node of the sub-tree. Direct writing back the same sub-tree, which has been modified while working in RAM, is not allowed, since it can destroy invariants of the whole structure saved on disk. It seems to me the most reasonable option is to implement writing as efficient splice operations (in theory they have logarithmic cost), which are applied to the structure stored on disk. I know the efficient splice operations are possible for B+ trees in RAM, but I am not sure if these algorithms are also suitable for disk systems.
Hope this quick consideration makes some useful sense.
How is it done by relational database indexes? Aren't those commonly implemented as disk-backed B+ trees? Regards, Nate

On Sun, Dec 4, 2011 at 10:43 AM, Nathan Ridge <zeratul976@hotmail.com>wrote:
* Question #1: why augmenting on B+ trees? The augmenting idea can be as easily built on top of RB-trees, which are the weapon of choice for implementing associative containers. Unless proved otherwise, RB-trees are assumed to be faster than B+ trees for in-memory structures (which your own tests hint at, cf. "insert N random elements" table.)
Could a B+ tree library be a basis for implementing disk-backed B+ trees? I think that would be useful. In a clever implementation the memory-backed and disk-backed B+ tree might even be the same class, with the difference factored out into a policy template parameter.
Regards, Nate
I understood this comment is about adding persistence or serialization support to the proposed B+ trees. Please correct if this is not the case. There are two levels of complexity in this area. A trivial one is just saving and reading data as a list, since these B+ trees are build on top of doubly linked lists. This might be sufficient for many practical applications, but certainly not for all. Hence there is the second and less trivial problem of saving and reading operations which preserve topology of these B+ trees that are designed to work in RAM. (Note: "topology" has been used here as a state of connectivity of nodes with data stored in the nodes). In principle, the problem can be solved for the submitted B+ trees, but I will need some Boost specific requirements to implement this feature.
What I mean by a disk-backed B+ tree is a B+ tree where only the nodes that are needed at a given time are loaded into memory from disk - the entire tree on disk might be too large to even fit into memory.
Thank you for this clarification. This is certainly the third level of complexity, this is why below only ideas about possible solutions. Assuming that problem #2 has been already solved the reading operation can be implemented through reading a sub-tree from the highest internal node which represents the root node of the sub-tree. Direct writing back the same sub-tree, which has been modified while working in RAM, is not allowed, since it can destroy invariants of the whole structure saved on disk. It seems to me the most reasonable option is to implement writing as efficient splice operations (in theory they have logarithmic cost), which are applied to the structure stored on disk. I know the efficient splice operations are possible for B+ trees in RAM, but I am not sure if these algorithms are also suitable for disk systems.
Hope this quick consideration makes some useful sense.
How is it done by relational database indexes? Aren't those commonly implemented as disk-backed B+ trees?
The proposed B+ trees have not been designed for these applications. They
are implemented as dynamically allocated data structures, such as linked lists and red black trees. One node stores only one element. Thus, they do not store data contiguously in blocks (also named nodes!) as B+ trees for external storage. For more details please have a look at the section: http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/bp_trees.html Please send me a reference if you know methods of disk backing for dynamically allocated B+ trees. This can be useful in future development. Regards, Vadim

On Sat, Dec 3, 2011 at 11:25 PM, Joaquin M Lopez Munoz <joaquin@tid.es>wrote: ...
My conclusions:
1. Augmented data structures (of which efficient indexing is just a particular application) are definitely useful and worth having in Boost. 2. Unless proved otherwise, we should be using RB trees (or maybe AVL) rather than B+ trees. 3. Depending on the particular needs, some usage scenarios can be better served with Boost.MultiIndex (note I'm saying "depending on needs", since differences between approaches are substantial.) 4 To me, rather than providing out of the box containers with augmenting we should provide this via two libraries: - As an additional capability of Boost.Intrusive associative containers (and AVL-tree based containers as well, if technically possible.) - As an additional capability of Boost.MultiIndex ordered indices.
Hope this helps,
Joaquín M López Muñoz Telefónica Digital
Thank you very much, I really appreciate your deep analysis. Since you asked to add more performance measurements my detailed reply to all your comments will be a bit delayed. Regards, Vadim

Hi all, This message is a detailed reply to the analysis and comments by Joaquín M López Muñoz. The most challenging request, which required quite significant work, is concerning memory consumption and performance of augmented B+ trees. This aspect is quite broad. It is covered in two new sections of the updated documentations: 1) http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/bp_trees.html , the section “Space requirement“ ; 2) http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/performance.html , the section “Effect of augmenting on performance of B+ trees” ; The last document with performance measurements has been modified quite considerably. It now includes tests of all variants of proposed B+ trees and boost::multi_index_container as multiset. Please have a look at the updated documentation and my replies in the text below and let me know if I missed anything significant. On Sat, Dec 3, 2011 at 11:25 PM, Joaquin M Lopez Munoz <joaquin@tid.es>wrote:
Vadim Stadnik <vadimstdk <at> gmail.com> writes:
Hi all,
Advanced data structures (ADS) offer useful STL extensions that are not possible or not efficient with basic data structures, such as arrays, linked lists and red-black trees.
The project
https://github.com/vstadnik/stl_ext_adv
http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/index.html
focuses mainly on a generalized and unified computer representation of sequences using the augmented B+ trees. It provides the following STL extensions [...]
Hi,
This is a rough study of the functionality provided by your lib in general and as compared with Boost.MultiIndex.
* Container_bptree's internal data structure relies on B+ trees with two variants based on the "augmented data structure" idea explained in Cormen, Leiserson, Rivest and Stein's "Algorithms" (an online explanation can be found at http://tinyurl.com/d8lze7v , for instance.) Briefly put, such augmenting scheme boils down to adding additional information to the nodes updated on insertion time and erasure time which is based on the additional info of the sibling nodes. The variants are: - bp_tree_idx: augmented to allow for efficient indexing - bp_tree_idx_acc: doubly augmented to support efficient indexing plus a generalized "accumulate" notion usable for calculating summatories, mean values, etc.
* Question #1: why augmenting on B+ trees? The augmenting idea can be as easily built on top of RB-trees, which are the weapon of choice for implementing associative containers. Unless proved otherwise, RB-trees are assumed to be faster than B+ trees for in-memory structures (which your own tests hint at, cf. "insert N random elements" table.) You say:
"The red black trees are yet out of the scope of this project for the reasons explained in the documentation."
I've failed to find such explanation. Also, I suspect memory consumption is higher with B+ trees than RB trees. - Request #1: explain why B+ trees rather than RB trees.
That’s my fault, I should have added to the original message reference to this document: http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/bp_trees.html The section “Advantages of B+ trees“ briefly covers the topic. - Request #2: provide code for performance tests so that
Boost.MultiIndex and oher data structures can be compared.
The newly added file “perform_tests.hpp” contains the code of test algorithms. My tests and timer are Windows specific, I am planning to replace the timer with one of Boost timers and update this code. I have added performance measurements for Boost multi_index_container as mutiset to the document mentioned at the top of this message.
- Request #3: augment the performance tests so as to yield memory consumption figures.
This request is covered in two new sections of the updated documentation as explained at the beginning of this message.
* Usage scenario 1: Plain B+ trees as a replacement for RB tress for associative containers. My hunch is that RB-trees would perform better, as explained above.
I agree absolutely, Replacing red black trees with the augmented B+ trees is only justified if it is beneficial for a user algorithm, such as using a new extended functionality or a new efficient algorithm.
* Usage scenario 2: sequence-like containers with efficient insertion/erasure. Directly based on bp_tree_idx, with a do-nothing sorting critera. These containers are not truly random-access, but rather provide O(log n) access, which can be acceptable (or not) for some applications.
This is correct, one operation supporting random access iterators has logarithmic cost in the worst case. The new sequences are a good choice when a user algorithm requires efficient insert, erase operations, which outperform std::vector as shown in performance tests. They can also replace inefficient bidirectional iterators of std::list.
* Usage scenario 3: associative container with random access iterators. This roughly compares with a multi_index_container equipped with two indices, one ordered, the other random-access. Basic differences between both: - The multi_index_container provides truly O(1) random access whereas Container_bptree is O(logN) (which is not random access strictly speaking.) - With the multi_index_container, sorting order and insertion order are separate --some sort of synchronization between both indices should be needed to mimic the Continer_bptree case. - Insertion/erasure with the multi_index_container is linear (though much faster than with std::vector), logarithmic with Container_bptree. - The memory overhead introduced by the multi_index_container is two pointers vs. only one word for Container_bptree.
The Boost Multi-index library offers a very rich functionality. My knowledge of this library is not yet sufficient to make specific suggestions. Nevertheless, I think that this library can benefit from the proposed B+ trees and containers as well as from the method of augmenting applied to other data structures. I will be happy to help in this area.
* Usage scenario 4: containers with fast accumulate, a direct application of the augmenting idea explained above. There is nothing in Boost that provides this functionality.
Thank you for confirming this, since in one of previous messages there was information that logarithmic efficiency of the standard function accumulate() is already supported by Boost library.
My conclusions:
1. Augmented data structures (of which efficient indexing is just a particular application) are definitely useful and worth having in Boost. 2. Unless proved otherwise, we should be using RB trees (or maybe AVL) rather than B+ trees. 3. Depending on the particular needs, some usage scenarios can be better served with Boost.MultiIndex (note I'm saying "depending on needs", since differences between approaches are substantial.) 4 To me, rather than providing out of the box containers with augmenting we should provide this via two libraries: - As an additional capability of Boost.Intrusive associative containers (and AVL-tree based containers as well, if technically possible.) - As an additional capability of Boost.MultiIndex ordered indices.
Hope this helps,
Joaquín M López Muñoz Telefónica Digital
Thank you for your analysis and conclusions. I am happy for the Boost experts to decide how to use the proposed data structures and containers. To start work I need a list of tasks and/or requests with priorities. Regards, Vadim
participants (11)
-
Andrii Sydorchuk
-
Ion Gaztañaga
-
Jeff Flinn
-
Joaquin M Lopez Munoz
-
JOAQUIN M. LOPEZ MUÑOZ
-
Nathan Ridge
-
Thijs (M.A.) van den Berg
-
Thorsten Ottosen
-
Vadim Stadnik
-
Vicente Botet
-
Vicente J. Botet Escriba