
Hi Phil, thank you again for your review. 2010/2/23 Phil Endecott <spam_from_boost_dev@chezphil.org>: [...]
Joachim has clearly put a lot of work into what is a substantial and well-documented library. Unfortunately, it seems to me that it is built on a weak foundation.
After I have taken some more time to think about an interval_tree
http://en.wikipedia.org/wiki/Interval_tree implementation for my library, I arrived at a preliminary assessment:
(1) An interval_tree implementation is not suitable for most instantiations of interval containers provided by the ITL. (2) An interval_tree implementation is an important optimization for a specific instantiation: split_interval_map<D, Collection<T> > (3) For most other instantiations of interval containers an interval_tree implementation will lead to problems, that are not encountered using the current implementation and will show even worse performance in certain cases. (4) So an interval_tree implementation is a desirable optimization for a particular case, split interval map of collections, but probably not suitable as a general "foundation" of the library. (5) The foundation of the ITL is its formal specification on sets of axioms that clarifies semantics, an abstract interface and a generic design, that allows for a variety of implementations and optimizations. (6) Since an interval_tree implementation would rather be a partial optimization than the correction of a fundamental flaw, I don't think it is appropriate to bring the ITL library to a halt for this. There are no data structures that are superior as such. Each has its pros and its cons and the user has to take her decision of choosing the appropriate one, dependent on the characteristics of her specific problem domain and use case. The current implementation of inserting interval value pairs into interval_maps with computation of aggregates on overlap can be viewed as performing an "eager evaluation". After addition of interval value pairs, we have always two results available: (1) The new interval borders that result in the interval_map from addition of the interval and (2) The aggregated associated values for the overlapping segments. Using an interval_tree as implementation, both results are unknown after insertions and only implicitly available. They can be computed on demand. This is a "lazy evaluation" strategy and it's the basis of the better space performance, when collections of objects are involved that can not be aggregated in a compact way. One problem here is, that the segmentation of the interval_map, (its separation into elementary intervals that the aggregation results are associated to), is only implicit in the data structure. Moreover the interval_tree, in contrast to the "eager implementation" does not "recognize", that neighbouring intervals can be joined, because of equality of the aggregated values associated to the elementary segments. In the worst case it is possible that the interval_tree stores N interval value pairs for an interval_map that could be compressed to only one interval value pair, since all of the aggregates on overlap led to identical associated value. (1) This can happen, if you work with an interval_map<D, set<T> > and T has a small finite number of values only. If you have an application with large interval and heavy overlaps the O(N^2) space problem occurs. But if, due to the limited base set (values of T) a lot of neighbouring sets may be identical, the eager evaluating interval_map can get pretty small while the lazy one grows on and on. (2) Another use case: If your interval_map<D, int> tracks the maximum of values that are a property of some objects that changes over time. Let's say you have a party and you track the height of the tallest guest, similar to the documented example: http://www.joachim-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/ex... // I've heard Obama wants to introduce the metric system ;) ([19:00-23:10], 1.65m) // First guest arriving in time, 19:00 is // 1.95 meters tall. ([19:00-22:00], 1.70m) // Second guest ... // May be some more under 2.13m at [19:00-..) ([19:00-24:00), 2.13m) // Now comes Dirk Nowitzki ... // N more guests that are smaller than Dirk, // coming later and leaving earlier. The eager evaluating interval_map grows to the point when Dirk comes. Then it shrinks to size 1 and stays size 1, when smaller sized guests enter the party later. A lazy evaluating interval_map founded on interval_tree, would, correct me if I'm wrong, grow on and on and would compute queries in O(log(N)) time instead of O(c) time for the eager interval_map. Most obvious is the case of an interval_set<T>. The previously discussed "worst case" (large intervals, that are all overlapping) is the best case for the eager evaluating implementation here. There will always be only a single interval node in the implementing std::set. So space and time of all operations are constant. An interval_tree would store all N intervals. May be there are nifty optimisations of the lazy data structure for that. Yet, I have the impression that the interval_trees are surely not the foundation for *every* use case within the scope of this library. Considering this, I think it would not be appropriate to use the lazy evaluating interval_trees, as a *general* implementation for interval maps. And also it would not be a strong foundation for *every* use case. Rather it would be a more optimal implementation for an important class of instantiations of split_interval_maps. Personally I think, questions of implementation, although very important, are not the foundation of a library. More important are concepts, design and correctness. Those are the fields that I have invested a lot in, and I am fairly sure, that based on this foundation, a lot of useful and exciting optimizations can be and will be done in the future evolution of the library. I have particularly invested great effort in the investigation of the semantics of interval sets and interval maps creating sets of axioms that formally specify *what* the objects of my library will do and not *how* they will do it. This is the part, that I feel is the foundation of my library and I think it is a strong foundation. It provides not only a formal specification of its behavior but, in the extended part, a automated test tool to validate any given implementation against this specification. Thank you again for focusing on the important issues of complexity and implementation. Best regards, Joachim