[Itl] Review

older
[Review] ITL review starts today,...

Phil Endecott

23 Feb 2010 23 Feb '10

7:26 p.m.

My review of the proposed Interval Template Library follows. This is based only on studying the documentation, and previous discussion of the library on the Boost mailing list (e.g. http://lists.boost.org/Archives/boost/2009/03/149396.php ). Joachim has clearly put a lot of work into what is a substantial and well-documented library. Unfortunately, it seems to me that it is built on a weak foundation. One of the most well-known problems involving intervals is to determine which of a given set of intervals contains a particular point. How to store a set of intervals in such a way as to make this problem efficient has been extensively studied. If an "interval tree" (See e.g. Wikipedia: http://en.wikipedia.org/wiki/Interval_tree) is used, then the problem can be answered in logarithmic time and linear space. (More precisely: if there are N intervals in the container and M overlap the point of interest, finding them should take O(M + log N) time and the data structure should take O(N) space.) I was hoping to find that this library was an implementation of an interval tree. But it is not: it is built on top of normal standard-library containers. The implementation used can find the intervals that overlap a point in O(M + log N) time, but to do so it needs worst-case O(N^2) space and I think it needs O(N^2 log N) time to build that structure. Presumably this complexity is not an issue in Joachim's application. Unfortunately, library code will be used in many different applications including those where optimal complexity is required. For example, consider trying to store the lifespans of people in a genealogy program: recording a few thousand people living over a few hundred years would approach the worse-case space complexity and need megabytes of storage. Perhaps other reviewers will feel that there are sufficient applications where this library is suitable to justify its inclusion in Boost. My feeling is that since the interval tree data structure is so well known and its performance is so much better, we should not accept this library. Ideally it would be possible for Joachim to replace ITL's current standard containers with an interval tree implementation. Could an interval tree be implemented with an interval sufficiently similar to a std::set/map to make that possible, without losing the complexity benefits? Once again, I believe that apart from this weak foundation this looks like an excellent submission that is comprehensive and well-documented. Regards, Phil.

Show replies by date

Simonson, Lucanus J

23 Feb 23 Feb

11:06 p.m.

Phil Endecott wrote:

...

I was hoping to find that this library was an implementation of an interval tree. But it is not: it is built on top of normal standard-library containers. The implementation used can find the intervals that overlap a point in O(M + log N) time, but to do so it needs worst-case O(N^2) space and I think it needs O(N^2 log N) time to build that structure.

Oddly enough I had exactly that data structure (extended to 2D for rectangle query) in an old version of my library prior to boost submission. I took it out because I didn't want to go through the work of making its interface sufficiently generic to meet boost standards. Included in the work I didn't want to do was factoring out the 1D portion and exposing in interval based interface for it. I don't use the interval tree for my sweepline data structure because a std::map actually works better for sweepline when you store the end points of intervals as the key as opposed to the intervals themselves. We had a failed boost summer of code project that didn't produce a good 2D version of the same data structure and the Boost.Geometry library has promised to include this same data structure, but by working with the summer of code author, apparently. I would be happy to work with Joachim to add interval set operations and interval tree, plus the extension to 2D for rectangle query as an extension of my library. Since my library already provides an interval concept it is, perhaps, a better foundation for this extension than Boost.Geometry. I use extension here in a different sense of the word than Boost.GIL and Boost.Geometry extensions. Interval set operations would not be second class features, but rather a part of the core library. Many of the utility functions on intervals are already implemented for my interval concept. I also understand both the interval tree data structures and the algorithms required for optimally computing interval set operations. I have some extensive experience with the optimal interval set operation algorithm in the internal implementation of my sweepline algorithms for polygon set operations. I can help with design of interfaces, provide some code for the interval tree and 2D extension and guidance on which algorithms are needed for interval set operations and where to provide abstractions to simplify their implementation. Rather than as a large block of work this extension can be implemented piecemeal over time and would not require full review by boost but only my own review in the same manner that I can add features to the library myself as its maintainer without the need for full review for each one. I am on good terms with Joachim and I think we could work productively together. I hope he will not take it amiss that I make this offer on the list and not privately. There are a few sticky issues related to interval sets with a domain type that wouldn't normally be a legal coordinate type for my library. Please remember that integer coordinate type is a requirement of my general polygon set operations algorithm, not of the library in general or the way its interfaces are defined. I believe that these issues with domain types that don't map well to my coordinate concept can be overcome. (Lack of std::numeric_limits<std::string>::max() { return std::string(char(127)); } and so forth.) Obviously a polygon set operation can never work with a coordinate of type string (actually Manhattan polygon set operations could, but not 45-degree or general), but that doesn't mean we can't use string as a coordinate type when instantiating 1D data structures and algorithms or even 2D data structures and algorithms that don't require arithmetic. I think that would improve my library and make it more generic. It sounds like an interesting challenge at least. Regards, Luke

Joachim Faulhaber

24 Feb 24 Feb

12:02 a.m.

Hi Phil, 2010/2/23 Phil Endecott <spam_from_boost_dev@chezphil.org>:

...

My review of the proposed Interval Template Library follows. This is based only on studying the documentation, and previous discussion of the library on the Boost mailing list (e.g. http://lists.boost.org/Archives/boost/2009/03/149396.php ).

Joachim has clearly put a lot of work into what is a substantial and well-documented library. Unfortunately, it seems to me that it is built on a weak foundation.

One of the most well-known problems involving intervals is to determine which of a given set of intervals contains a particular point. How to store a set of intervals in such a way as to make this problem efficient has been extensively studied. If an "interval tree" (See e.g. Wikipedia: http://en.wikipedia.org/wiki/Interval_tree) is used, then the problem can be answered in logarithmic time and linear space. (More precisely: if there are N intervals in the container and M overlap the point of interest, finding them should take O(M + log N) time and the data structure should take O(N) space.)

I was hoping to find that this library was an implementation of an interval tree. But it is not: it is built on top of normal standard-library containers. The implementation used can find the intervals that overlap a point in O(M + log N) time, but to do so it needs worst-case O(N^2) space and I think it needs O(N^2 log N) time to build that structure.

This is a very limited view on the library. I never claimed it has it's strengths in performing stabbing queries. I has so many more aspects and applications where it is useful and efficient. If you insert N intervals in an interval_map the number of intervals (nodes) in the map is limited to 2*N-1. So space is O(N). A single addition of an interval into an interval_map has worst case time performance of O(N). So the worst case time for the addition of N interval value pairs is O(N^2). Which is not as bad as you write. The most frequent usage of interval_maps are aggregations, where you need all of the aggregated values. If, in the worst case N^2 aggregations are involved, you have to perform them anyway. If you'd use an interval tree to hold the data, you would not be able to store the the aggregation results for N+k segments in your N nodes. For what interval containers and specifically interval_maps are used most of the time, interval_trees are not an adequate data structure as far as I can see. Regards Joachim

Simonson, Lucanus J

12:52 a.m.

Joachim Faulhaber wrote:

...

Hi Phil,

2010/2/23 Phil Endecott <spam_from_boost_dev@chezphil.org>:

...
My review of the proposed Interval Template Library follows. This is based only on studying the documentation, and previous discussion of the library on the Boost mailing list (e.g. http://lists.boost.org/Archives/boost/2009/03/149396.php ).

Joachim has clearly put a lot of work into what is a substantial and well-documented library. Unfortunately, it seems to me that it is built on a weak foundation.

One of the most well-known problems involving intervals is to determine which of a given set of intervals contains a particular point. How to store a set of intervals in such a way as to make this problem efficient has been extensively studied. If an "interval tree" (See e.g. Wikipedia: http://en.wikipedia.org/wiki/Interval_tree) is used, then the problem can be answered in logarithmic time and linear space. (More precisely: if there are N intervals in the container and M overlap the point of interest, finding them should take O(M + log N) time and the data structure should take O(N) space.)

I was hoping to find that this library was an implementation of an interval tree. But it is not: it is built on top of normal standard-library containers. The implementation used can find the intervals that overlap a point in O(M + log N) time, but to do so it needs worst-case O(N^2) space and I think it needs O(N^2 log N) time to build that structure.

...

If you insert N intervals in an interval_map the number of intervals (nodes) in the map is limited to 2*N-1. So space is O(N). A single addition of an interval into an interval_map has worst case time performance of O(N). So the worst case time for the addition of N interval value pairs is O(N^2). Which is not as bad as you write.

Let us say that the interval map is a map of interval to set of ids of the original input intervals, perhaps their index in an array. In the worst case all intervals overlap all other intervals and have unique end points. The number of ids stored in the set for each of the O(n) elements in the map is O(n) and the data structure is in total O(n^2) in size. Because each set of ids is O(n) the insertion of each id is O(log n) and the total runtime complexity is O(n^2 log n) to build the data structure. The interval map could then be used to lookup the set of intervals that intersect a specific value in O(m + log n) time. I think this is the specific usage of the interval map that Phil had in mind. You can perform set operations on them, but even then I wouldn't want to use the tree insertion based algorithm if the two maps were of similar size since the scanning algorithm is O(n). I also wouldn't store interval maps as trees if I wanted to do interval set operations on them since vector storage will be much more compact and the O(n) algorithm will run much faster. There is a similar problem with polygon sets as interval sets, in that people generally want to construct them one polygon at a time, but want optimal runtime complexity for producing the final merged set. I defer the merge until it is requested by the user implicitly by calling an API that requires merged data. You could do the same thing with an interval set by storing a vector of intervals and then sorting and scaning it. You get a faster O(log n) lookup in a sorted vector than a tree. The only use case for the tree based interval map or set is when the user wants to interleave queries with insertion and removal. That use case has its place, but it shouldn't come at the expense of efficiency of other use cases. The most common use case is to use the interval map as a dictionary that is rarely or ever changed after it is initially constructed. The user needs to understand their use case and have the option of choosing the appropriate data structures and algorithms for their needs. Now I hope we are all on the same page about these issues and understand that they are solvable in the long term. Regards, Luke

Phil Endecott

1:36 p.m.

Joachim Faulhaber wrote:

...

2010/2/23 Phil Endecott <spam_from_boost_dev@chezphil.org>:

...
I was hoping to find that this library was an implementation of an interval tree. ?But it is not: it is built on top of normal standard-library containers. ?The implementation used can find the intervals that overlap a point in O(M + log N) time, but to do so it needs worst-case O(N^2) space and I think it needs O(N^2 log N) time to build that structure.

This is a very limited view on the library. I never claimed it has it's strengths in performing stabbing queries. I has so many more aspects and applications where it is useful and efficient.

It might help the discussion to describe some of these applications in depth. I can only think of examples where the O(N^2) space happens. Example: consider a building access-control system where users check in and out with cards; the system records a set of (in_time,out_time)->name records. If N people work in the building, it will take O(N^2) space to store each day's data.

...

If you insert N intervals in an interval_map the number of intervals (nodes) in the map is limited to 2*N-1. So space is O(N). A single addition of an interval into an interval_map has worst case time performance of O(N). So the worst case time for the addition of N interval value pairs is O(N^2). Which is not as bad as you write.

No, I believe my assessment is correct. See Luke's explanation.

...

The most frequent usage of interval_maps are aggregations, where you need all of the aggregated values. If, in the worst case N^2 aggregations are involved, you have to perform them anyway. If you'd use an interval tree to hold the data, you would not be able to store the the aggregation results for N+k segments in your N nodes. For what interval containers and specifically interval_maps are used most of the time, interval_trees are not an adequate data structure as far as I can see.

I don't entirely understand that paragraph. In an interval tree, I don't store the "aggregation results"; I just store the key-intervals and data. Aggregation occurs when I perform a lookup; I find all of the intervals that overlap my point or interval of interest, and merge their data in some appropriate way at that time. Regards, Phil.

Joachim Faulhaber

7:07 p.m.

2010/2/24 Phil Endecott <spam_from_boost_dev@chezphil.org>

...

Joachim Faulhaber wrote:

...
2010/2/23 Phil Endecott <spam_from_boost_dev@chezphil.org>:

...
I was hoping to find that this library was an implementation of an interval

tree. ?But it is not: it is built on top of normal standard-library containers. ?The implementation used can find the intervals that overlap a point in O(M + log N) time, but to do so it needs worst-case O(N^2) space and I think it needs O(N^2 log N) time to build that structure.

This is a very limited view on the library. I never claimed it has it's strengths in performing stabbing queries. I has so many more aspects and applications where it is useful and efficient.

It might help the discussion to describe some of these applications in depth. I can only think of examples where the O(N^2) space happens. Example: consider a building access-control system where users check in and out with cards; the system records a set of (in_time,out_time)->name records. If N people work in the building, it will take O(N^2) space to store each day's data.

The party example that I chose, simply because it is so easy to understand, seems to turn out unfortunate, because people tend to think of it as THE use case of interval_maps. We had similar use cases with relatively small sets and moderate overlaps where it works nicely. But for big sets and heavy overlap this is depreciated and the worst case space is O(N^2) indeed. But the much more important applications are those where the associated value are numerical types or lightweight objects that can be aggregated. If, in your access-control system, you only want to count the number of people in the building space is only O(N), because the associated values, the aggregation results, stay small. Another non trivial use of interval_map, which I think is space efficient compared to other methods is the example of large bitsets which is implemented by means of itl::interval_maps: http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/pro...

...

If you insert N intervals in an interval_map the number of intervals

...
(nodes) in the map is limited to 2*N-1. So space is O(N). A single addition of an interval into an interval_map has worst case time performance of O(N). So the worst case time for the addition of N interval value pairs is O(N^2). Which is not as bad as you write.

No, I believe my assessment is correct. See Luke's explanation.

Yes, your big Os are correct. And they don't care if that's unpleasant for me :-/ I am aware that using an itl::interval_map of sets or other collections can be inefficient and I have noted this in the documentation. I agree that, for interval_maps of collections a more efficient implementation is highly desirable and that an interval_tree is a good candidate to achieve this. For this version of the ITL, I wanted to concentrated on the interface and the semantical properties that I have checked to get the "big picture" right. Which is in line with the boost guidelines from http://www.boost.org/development/requirements.html: "Aim first for clarity and correctness; optimization should be only a secondary concern in most Boost libraries." Also I developed a very thorough suite of law based tests that will support the correct development of any optimization within my library design. A more efficient implementation of interval_maps of sets is one of the things I wanted to do next. The most frequent usage of interval_maps are aggregations, where

...

you need all of the aggregated values. If, in the worst case N^2 aggregations are involved, you have to perform them anyway. If you'd use an interval tree to hold the data, you would not be able to store the the aggregation results for N+k segments in your N nodes. For what interval containers and specifically interval_maps are used most of the time, interval_trees are not an adequate data structure as far as I can see.

I don't entirely understand that paragraph. In an interval tree, I don't

...

store the "aggregation results"; I just store the key-intervals and data. Aggregation occurs when I perform a lookup; I find all of the intervals that overlap my point or interval of interest, and merge their data in some appropriate way at that time.

This is one of the differences between an interval tree and an itl::interval_map. An addition of an interval value pair to an interval map always performs an aggregation of associated values where and inserted interval value pair overlaps. The aggregation results are always updated and they are stored together with the segments they belong to. Using an interval tree you need to iterate the implicit segments (I don't know if this can be done is easily), perform the aggregations on lookups and then you need another container to collect the results. I admit this kind of lazy evaluation can be beneficial. If the associated values are small e.g. numeric, the current itl::interval_map implementation is efficient as well. Regards, Joachim

Joachim Faulhaber

26 Feb 26 Feb

4:02 p.m.

Hi Phil, in my first reply to your review I was pretty frustrated and angry. Meanwhile I regret to have answered in that mood, because, at least in my case, those negative feelings disturb clear thinking :( and I typed out some false and some half-baked stuff. 2010/2/23 Phil Endecott <spam_from_boost_dev@chezphil.org>

...

My review of the proposed Interval Template Library follows. This is based only on studying the documentation, and previous discussion of the library on the Boost mailing list (e.g. http://lists.boost.org/Archives/boost/2009/03/149396.php ).

Joachim has clearly put a lot of work into what is a substantial and well-documented library. Unfortunately, it seems to me that it is built on a weak foundation.

[...]

...

I was hoping to find that this library was an implementation of an interval tree. But it is not: it is built on top of normal standard-library containers. The implementation used can find the intervals that overlap a point in O(M + log N) time, but to do so it needs worst-case O(N^2) space and I think it needs O(N^2 log N) time to build that structure.

[...]

...

My feeling is that since the interval tree data structure is so well known and its performance is so much better, we should not accept this library.

If I look at your arguments in a relaxed way, I have to acknowledge that they are substantial. I knew that this point (particularly space efficiency) for the interval_maps of collections, was a weak point. And I had *hoped* all the positive aspects of my work would compensate. I also remember my tendency to deny that issue, when you first addressed it last year. And denial never works ;-/ So, yes, your review as an answer to this was justified. Sticking to that point shows your standing for quality of boost libraries. This is of value, independent of the fact, that this kind of standing can be pretty unpleasant for contributors. Another thing that I wasn't able to see in my frustrated mood were the positive remarks about my work in your posting.

...

Ideally it would be possible for Joachim to replace ITL's current standard containers with an interval tree implementation. Could an interval tree be implemented with an interval sufficiently similar to a std::set/map to make that possible, without losing the complexity benefits?

Once again, I believe that apart from this weak foundation this looks like an excellent submission that is comprehensive and well-documented.

Even if your overall judgement is "reject", I have the impression that you find the library in it's substance valuable and that you also expressed, that it is a good candidate for boost, if the efficiency issue will be addressed. So I'd like to thank you for the positive feedback as well. Hoping that you stay supportive both for boost quality and also for my project ... Best regards, Joachim

Joachim Faulhaber

6 Mar 6 Mar

12:11 a.m.

Hi Phil, thank you again for your review. 2010/2/23 Phil Endecott <spam_from_boost_dev@chezphil.org>: [...]

...

Joachim has clearly put a lot of work into what is a substantial and well-documented library. Unfortunately, it seems to me that it is built on a weak foundation.

After I have taken some more time to think about an interval_tree

...

http://en.wikipedia.org/wiki/Interval_tree implementation for my library, I arrived at a preliminary assessment:

(1) An interval_tree implementation is not suitable for most instantiations of interval containers provided by the ITL. (2) An interval_tree implementation is an important optimization for a specific instantiation: split_interval_map<D, Collection<T> > (3) For most other instantiations of interval containers an interval_tree implementation will lead to problems, that are not encountered using the current implementation and will show even worse performance in certain cases. (4) So an interval_tree implementation is a desirable optimization for a particular case, split interval map of collections, but probably not suitable as a general "foundation" of the library. (5) The foundation of the ITL is its formal specification on sets of axioms that clarifies semantics, an abstract interface and a generic design, that allows for a variety of implementations and optimizations. (6) Since an interval_tree implementation would rather be a partial optimization than the correction of a fundamental flaw, I don't think it is appropriate to bring the ITL library to a halt for this. There are no data structures that are superior as such. Each has its pros and its cons and the user has to take her decision of choosing the appropriate one, dependent on the characteristics of her specific problem domain and use case. The current implementation of inserting interval value pairs into interval_maps with computation of aggregates on overlap can be viewed as performing an "eager evaluation". After addition of interval value pairs, we have always two results available: (1) The new interval borders that result in the interval_map from addition of the interval and (2) The aggregated associated values for the overlapping segments. Using an interval_tree as implementation, both results are unknown after insertions and only implicitly available. They can be computed on demand. This is a "lazy evaluation" strategy and it's the basis of the better space performance, when collections of objects are involved that can not be aggregated in a compact way. One problem here is, that the segmentation of the interval_map, (its separation into elementary intervals that the aggregation results are associated to), is only implicit in the data structure. Moreover the interval_tree, in contrast to the "eager implementation" does not "recognize", that neighbouring intervals can be joined, because of equality of the aggregated values associated to the elementary segments. In the worst case it is possible that the interval_tree stores N interval value pairs for an interval_map that could be compressed to only one interval value pair, since all of the aggregates on overlap led to identical associated value. (1) This can happen, if you work with an interval_map<D, set<T> > and T has a small finite number of values only. If you have an application with large interval and heavy overlaps the O(N^2) space problem occurs. But if, due to the limited base set (values of T) a lot of neighbouring sets may be identical, the eager evaluating interval_map can get pretty small while the lazy one grows on and on. (2) Another use case: If your interval_map<D, int> tracks the maximum of values that are a property of some objects that changes over time. Let's say you have a party and you track the height of the tallest guest, similar to the documented example: http://www.joachim-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/ex... // I've heard Obama wants to introduce the metric system ;) ([19:00-23:10], 1.65m) // First guest arriving in time, 19:00 is // 1.95 meters tall. ([19:00-22:00], 1.70m) // Second guest ... // May be some more under 2.13m at [19:00-..) ([19:00-24:00), 2.13m) // Now comes Dirk Nowitzki ... // N more guests that are smaller than Dirk, // coming later and leaving earlier. The eager evaluating interval_map grows to the point when Dirk comes. Then it shrinks to size 1 and stays size 1, when smaller sized guests enter the party later. A lazy evaluating interval_map founded on interval_tree, would, correct me if I'm wrong, grow on and on and would compute queries in O(log(N)) time instead of O(c) time for the eager interval_map. Most obvious is the case of an interval_set<T>. The previously discussed "worst case" (large intervals, that are all overlapping) is the best case for the eager evaluating implementation here. There will always be only a single interval node in the implementing std::set. So space and time of all operations are constant. An interval_tree would store all N intervals. May be there are nifty optimisations of the lazy data structure for that. Yet, I have the impression that the interval_trees are surely not the foundation for *every* use case within the scope of this library. Considering this, I think it would not be appropriate to use the lazy evaluating interval_trees, as a *general* implementation for interval maps. And also it would not be a strong foundation for *every* use case. Rather it would be a more optimal implementation for an important class of instantiations of split_interval_maps. Personally I think, questions of implementation, although very important, are not the foundation of a library. More important are concepts, design and correctness. Those are the fields that I have invested a lot in, and I am fairly sure, that based on this foundation, a lot of useful and exciting optimizations can be and will be done in the future evolution of the library. I have particularly invested great effort in the investigation of the semantics of interval sets and interval maps creating sets of axioms that formally specify *what* the objects of my library will do and not *how* they will do it. This is the part, that I feel is the foundation of my library and I think it is a strong foundation. It provides not only a formal specification of its behavior but, in the extended part, a automated test tool to validate any given implementation against this specification. Thank you again for focusing on the important issues of complexity and implementation. Best regards, Joachim

Phil Endecott

7 Mar 7 Mar

8:51 p.m.

Joachim Faulhaber wrote:

...

Hi Phil,

thank you again for your review.

And thank you for finally addressing this issue. I do not have much time to reply, unfortunately.

...

There are no data structures that are superior as such. Each has its pros and its cons and the user has to take her decision of choosing the appropriate one, dependent on the characteristics of her specific problem domain and use case.

Selection of appropriate data structures is vital. What I believe you have done is to take one data structure and build an overly-generic library on top of it. It seems to me that it would be better to either restrict the scope of the library to those things that it can do efficiently, or to extend the implementation of the library to efficiently support all of the advertised capabilities.

...

the segmentation of the interval_map, (its separation into elementary intervals that the aggregation results are associated to), is only implicit in the [interval tree]. Moreover the interval_tree, in contrast to the "eager implementation" does not "recognize", that neighbouring intervals can be joined, because of equality of the aggregated values associated to the elementary segments.

...

Most obvious is the case of an interval_set<T>. The previously discussed "worst case" (large intervals, that are all overlapping) is the best case for the eager evaluating implementation here. There will always be only a single interval node in the implementing std::set. So space and time of all operations are constant. An interval_tree would store all N intervals.

I think that you could quite easily update an interval tree in such a way that it would store an "eager" implementation. In the case of an interval_set, to insert an interval into an interval tree that is representing an interval_set you would 1. Erase all intervals that are contained within the new interval. 2. If an interval overlaps the bottom end of the new interval, erase it and extend the bottom end of the new interval. 3. If an interval overlaps the top end of the new interval, erase it and extend the top end of the new interval. 4. Insert the modified new interval. I guess that the other cases would be similar, and they would have the same time and space complexity as your current std::set implementation. Sorry for my short reply, but we are in the last few hours of the review period. Regards, Phil.

Joachim Faulhaber

8 Mar 8 Mar

12:35 a.m.

2010/3/7 Phil Endecott <spam_from_boost_dev@chezphil.org>:

...

Joachim Faulhaber wrote:

...
Hi Phil,

thank you again for your review.

And thank you for finally addressing this issue. I do not have much time to reply, unfortunately.

...
There are no data structures that are superior as such. Each has its pros and its cons and the user has to take her decision of choosing the appropriate one, dependent on the characteristics of her specific problem domain and use case.

Selection of appropriate data structures is vital. What I believe you have done is to take one data structure and build an overly-generic

... too much honour for me, generic will do

...

library on top of it. It seems to me that it would be better to either restrict the scope of the library to those things that it can do efficiently,

I think that's overly-restrictive. There are a lot of generic data structures in std and boost, that can be used in inefficient ways and there are no specific restrictions that guard against that. Examples: (1) You can call insert and erase on std::vector. So you can happily use some large vectors in applications that do a lot of dynamic insertions and erasures on them preferably at their front parts. Who restricts you? (2) You can use std::map<T, SomeLargeCollection<SomeMoreStuff...> >. Not restrictions. Sometimes people deliberately want to do this for some quick solutions where efficiency does not matter. No restrictions. (3) You can use say boost::dynamic_bitset to work with sparse bitsets that say contains bit 0 and bit 2^N-1. Works, as long your workstation has enough memory. Restrictions? BTW my library has more than one very space efficient solution to work with sparse bitsets and I can not see how an interval tree implementation can make these any better. Moreover, restricting say split_interval_map<D, Collection<T> > from being used, because of a potential worst case scenario, would imply to patronize the skilled user who is well aware of the complexity problem involved but decides, that the characteristics of his problem domain will not be problematic (e.g. small collections and moderate overlaps).

...

or to extend the implementation of the library to efficiently support all of the advertised capabilities.

which I have already agreed upon.

...

...
the segmentation of the interval_map, (its separation into elementary intervals that the aggregation results are associated to), is only implicit in the [interval tree]. Moreover the interval_tree, in contrast to the "eager implementation" does not "recognize", that neighbouring intervals can be joined, because of equality of the aggregated values associated to the elementary segments.

...
Most obvious is the case of an interval_set<T>. The previously discussed "worst case" (large intervals, that are all overlapping) is the best case for the eager evaluating implementation here. There will always be only a single interval node in the implementing std::set. So space and time of all operations are constant. An interval_tree would store all N intervals.

I think that you could quite easily update an interval tree in such a way that it would store an "eager" implementation. In the case of an interval_set, to insert an interval into an interval tree that is representing an interval_set you would

1. Erase all intervals that are contained within the new interval. 2. If an interval overlaps the bottom end of the new interval, erase it and extend the bottom end of the new interval. 3. If an interval overlaps the top end of the new interval, erase it and extend the top end of the new interval. 4. Insert the modified new interval.

This is pretty much what the current implementation is doing. But in addition you have worse space performance due to the additional information stored in each node of an interval tree: The value of the right endpoint of the right subtree of each node.

...

I guess that the other cases would be similar,

In order to join neighbouring segments of an interval_map in case of equal aggregates, you have to either store up to 2*N-1 aggregation results in your N-node interval tree, or you have to recompute up to 2*N-1 aggregates on every insertion ... As I already said, there might be nifty solutions, but you would send me back to square one, as Paul put it. And I don't think this would be appropriate for a library that already proved useful and obviously sufficiently efficient in pretty demanding contexts, see e.g. the review of Eric Jonas.

...

and they would have the same time and space complexity as your current std::set implementation.

and maybe not. Don't get me wrong. I am not refusing to work on your proposed optimized implementation. I fact I am pretty keen to improve my library this way, but as for everyone else, my time and energy resources are limited. And after putting a lot of work in a library that seems to be pretty useful and that has found many use cases in different fields, that comes with a formal spec and an associated tool that allows for repeatable validation, I think it is not adequate to devaluate the whole project on the grounds of an optimization that is IMO important but also partial.

...

Sorry for my short reply, but we are in the last few hours of the review period.

Thank you again for contributing on this important issue. Regards, Joachim.

5624

Age (days ago)

5637

Last active (days ago)

List overview

Download

9 comments

3 participants

participants (3)

Joachim Faulhaber
Phil Endecott
Simonson, Lucanus J

[Itl] Review

Phil Endecott

Simonson, Lucanus J

Joachim Faulhaber

Simonson, Lucanus J

Phil Endecott

Joachim Faulhaber

Joachim Faulhaber

Joachim Faulhaber

Phil Endecott

Joachim Faulhaber

tags

participants (3)