
2010/3/7 Phil Endecott <spam_from_boost_dev@chezphil.org>:
Joachim Faulhaber wrote:
Hi Phil,
thank you again for your review.
And thank you for finally addressing this issue. I do not have much time to reply, unfortunately.
There are no data structures that are superior as such. Each has its pros and its cons and the user has to take her decision of choosing the appropriate one, dependent on the characteristics of her specific problem domain and use case.
Selection of appropriate data structures is vital. What I believe you have done is to take one data structure and build an overly-generic
... too much honour for me, generic will do
library on top of it. It seems to me that it would be better to either restrict the scope of the library to those things that it can do efficiently,
I think that's overly-restrictive. There are a lot of generic data structures in std and boost, that can be used in inefficient ways and there are no specific restrictions that guard against that. Examples: (1) You can call insert and erase on std::vector. So you can happily use some large vectors in applications that do a lot of dynamic insertions and erasures on them preferably at their front parts. Who restricts you? (2) You can use std::map<T, SomeLargeCollection<SomeMoreStuff...> >. Not restrictions. Sometimes people deliberately want to do this for some quick solutions where efficiency does not matter. No restrictions. (3) You can use say boost::dynamic_bitset to work with sparse bitsets that say contains bit 0 and bit 2^N-1. Works, as long your workstation has enough memory. Restrictions? BTW my library has more than one very space efficient solution to work with sparse bitsets and I can not see how an interval tree implementation can make these any better. Moreover, restricting say split_interval_map<D, Collection<T> > from being used, because of a potential worst case scenario, would imply to patronize the skilled user who is well aware of the complexity problem involved but decides, that the characteristics of his problem domain will not be problematic (e.g. small collections and moderate overlaps).
or to extend the implementation of the library to efficiently support all of the advertised capabilities.
which I have already agreed upon.
the segmentation of the interval_map, (its separation into elementary intervals that the aggregation results are associated to), is only implicit in the [interval tree]. Moreover the interval_tree, in contrast to the "eager implementation" does not "recognize", that neighbouring intervals can be joined, because of equality of the aggregated values associated to the elementary segments.
Most obvious is the case of an interval_set<T>. The previously discussed "worst case" (large intervals, that are all overlapping) is the best case for the eager evaluating implementation here. There will always be only a single interval node in the implementing std::set. So space and time of all operations are constant. An interval_tree would store all N intervals.
I think that you could quite easily update an interval tree in such a way that it would store an "eager" implementation. In the case of an interval_set, to insert an interval into an interval tree that is representing an interval_set you would
1. Erase all intervals that are contained within the new interval. 2. If an interval overlaps the bottom end of the new interval, erase it and extend the bottom end of the new interval. 3. If an interval overlaps the top end of the new interval, erase it and extend the top end of the new interval. 4. Insert the modified new interval.
This is pretty much what the current implementation is doing. But in addition you have worse space performance due to the additional information stored in each node of an interval tree: The value of the right endpoint of the right subtree of each node.
I guess that the other cases would be similar,
In order to join neighbouring segments of an interval_map in case of equal aggregates, you have to either store up to 2*N-1 aggregation results in your N-node interval tree, or you have to recompute up to 2*N-1 aggregates on every insertion ... As I already said, there might be nifty solutions, but you would send me back to square one, as Paul put it. And I don't think this would be appropriate for a library that already proved useful and obviously sufficiently efficient in pretty demanding contexts, see e.g. the review of Eric Jonas.
and they would have the same time and space complexity as your current std::set implementation.
and maybe not. Don't get me wrong. I am not refusing to work on your proposed optimized implementation. I fact I am pretty keen to improve my library this way, but as for everyone else, my time and energy resources are limited. And after putting a lot of work in a library that seems to be pretty useful and that has found many use cases in different fields, that comes with a formal spec and an associated tool that allows for repeatable validation, I think it is not adequate to devaluate the whole project on the grounds of an optimization that is IMO important but also partial.
Sorry for my short reply, but we are in the last few hours of the review period.
Thank you again for contributing on this important issue. Regards, Joachim.