Re: [boost] proposal: interval containers

13 May 2008

      2008/5/12, Simonson, Lucanus J <lucanus.j.simonson@intel.com>:
...
Joachim Faulhaber wrote:
...
...
interval_set:
itl::interval_set implements all functionality of std::set as a set
of intervals and in addition +=, -= and *= for set union, difference
and intersection. On insertion of neighbouring or overlapping
intervals inserted intervals are joined.
split_interval_set:
split_interval_set works like interval_set but borders of the
inserted intervals are preserved, so it can be used as time grid.
Using itl::split_interval_set you can partition other interval
containers using the *= operator to perform intersections.
split_interval_map:
A split_interval_map implements a map as a map of intervals
and associated values. For dealing with the insertion
of overlapping intervals the principle of "aggregation on
overlap" has proved to be very useful: It is shown by example:
typedef set<string> guests;
split_interval_map<time, guests> party;
guests mary; mary.insert("Mary");
guests harry; harry.insert("Harry");
party.insert(make_pair(rightopen_interval(20:00, 22:00), mary));
party.insert(make_pair(rightopen_interval(21:00, 23:00), harry));
// party now contains
[20:00, 21:00)->{"Mary"}
[21:00, 22:00)->{"Harry","Mary"} //guest sets aggregated on overlap
[22:00, 23:00)->{"Harry"}
As can be seen from the example a split_interval_map has both
a decompositional behavior (on the time dimension) as well as
a accumulative one (on the associated values).
Quality, validation, portability:
The library code of the itl has been validated with a law based
test automaton (LaBatea;) which is included in
the current release. It includes a testcase generator and law tester
that checks laws like e.g. commutativity<interval_set<double>, +=>:
the commutativity of interval_set<double> w.r.t. the operation +=.
The code has been compiled and tested with three compilers
(gcc 3.4.4, conceptgcc-boostcon, msvc 8.0 sp1) and with three
implementations of the stl (linux/gnu, stlport, msvc8) under
suse linux 10.1. and ms-windows xp.
I'd be happy if you found interval conainers as presented in this
post an interesting contribution to look at.
If so, I am looking forward to an inspiring discussion.
I found this in your implementation:
   /// Container type for the implementation
   typedef typename itl::set<interval_type,exclusive_less,Alloc>
ImplSetT;
as the underlying data structure for the base type of your interval maps
and sets.
This is a coincidence, because I happen to have implemented the same
underlying split interval map data structure for use as the scanline
data structure for my algorithm that computes the connectivity graph of
overlapping planar geometries.  This data structure results in O(n) map
entries for O(n) overlapping intervals, where each entry has O(n)
elements in a set (in the worst case.)  That turns out to be pretty
suboptimal data structure because it has a storage space complexity
O(n^2).  A superior data structure for the same would be akin to a 1D
R*Tree, center-cutline implementation of a quad-tree or similar.  Such a
data structure which would store only O(n) elements of constant size and
provide log(n) + m (where m is n in the worst case) amortized lookup
time instead of worst case log(n) * m1 * m2 lookup time (m1 and m2 are n
in the worst case.)  Imagine everyone at your example party has
overlapping intervals and different arrival and departure times.
Looking up who was at the party while a given person was present would
require looking at each set of names for each sub-interval where the set
of people was different and results in reading the same name many times
because it appears in many adjacent sub intervals.
For my application the target use case is two layer connectivity
extraction, where the maximum number of polygons (people) in the
scanline (party) is bounded at two (pretty dead party) and the
complexity of the algorithm based on the split interval map type data
structure is optimal.  For n overlapping polygons I know the algorithm
to be suboptimal, and I know what data structure to apply to fix it.  I
would be more interested in the optimal data structure than a generic
wrapper around this usage of std containers.  I'm not sure how having
your interval set code would have helped me write my connectivity
extraction algorithm using this data structure vs. what I was able to do
quite readily with the stl alone.
By the way, a map of sets has poor performance when rebalancing itself.
Getting rid of this data structure in my code is on my todo list.  Once
I have a good replacement (and I have several quad-tree and R*Tree
implementations internally and in the open source community to choose
from) I would make it generic and provide it as part of my library (in
addition to using it where appropriate in my algorithms.)  There is also
a generic RTree implementation that I have added to the library since
submitting it to the vault, but that hasn't been cleared for external
publication yet.
I suggest you research R*Trees and related data structure for spatial
query and think about how your problem resembles 1D spatial query.
Thanks,
Luke
Dear Luke,

When I returned yesterday from a very mild Berlin spring
evening with some extra grappas consumed I was so happy
to find a long answer to my first boost posting that I
joyfully conceded my interval_container classes to be
suboptimal.

Well this was pretty suboptimal with respect to my
intention to demonstrate their benefits. Next day I
realized that this was an exaggeration. So please let
me correct a little;)

Seen from the perspective of a special application
interval_containers appear to be inferior to classes
and algorithms that are tailored for that field. Yet
interval_containers are not designed to be optimized
to a special sort of application.

Interval_sets and interval_maps are intended to be much
more general in scope: As stated in my original posting
the basic idea is that an interval_set implements
a set as a set of intervals. An interval_map implements
a map as a map of intervals with associated values.

So the character of interval_sets/maps is twofold. You
can use them with the same interface as stl::sets/maps.
And you can use its interval structure by iterating
over them. Or by performing intersection operations.

The additional task in implementing interval_sets/maps,
compared to common sets/maps, is the handling of overlaps:

1. interval_set: overlapping/bordering intervals are joined.
2. split_interval_set: overlapping/bordering intervals are
   preserved.
   simple so far!
3. split_interval_map: overlapping/bordering intervals are
   also preserved. Then we must decide what to do with the
   associated values on the overlapping interval ranges.

It unfolds quite naturally that the associated values are
'added' on the overlap parts. So their type has to conform
concept Addable, more specifically I chose InplaceAddable
(+= has to exist) for efficiency reasons.
...
From this design follows, that a split_interval_map is a
map that that does not overwrite values on additional
insertion but it accumulates them. All kinds of data_types
are accumulated according to their 'InplaceAddability'.
(a) So in the most simple case where data_type is int and
   we insert only value_pair([a,b),1). The resulting object
   is an overlap counter.
(b) For numeric data_types and arbitrary associated values
   you are getting a quantity counter. E. g. number of
   grappas consumed in time intervals by party guests.

In the cases (a) and (b) the problem of inefficient worst case
storage complexity is not present.

(c) Only if data_type is set<T> or some other 'InplaceAddable'
container<T> the worst case O(n^2) storage problem can arise.
In this case one may resort to a more efficient datastructure
or accept that behavior, if e.g. the overlap structure is not
excessive.

Finally: The problem of inefficient search on the associated
values: Since a split_interval_map is basically a map, it has
the downside of it's stl brother. Lookup on the data_values
is inefficient. So, in the current design efficient search
on associated values is not provided.

If this is crucial for an application a more specialized
class has to be chosen instead of split_interval_map. In
other words, like with stl::maps not every type instance
of a split_interval_map has optimal efficiency properties for
a given context of usage, but others have. The developer
has to decide then whether to use a general class or a
specialized one.

The design of my interval_containers is more oriented on
structure and orthogonality than on optimizing for specific
applications.

Nevertheless your points were very valuable and I will
consider them to improve the code.

Thank you again
Joachim