Re: [boost] Formal Review: Boost.Polygon

6 Sep 2009

      I am pleased to be able to review this important library.  Boost would definitely
benefit from the addition of a geometry library of the right sort.

I am certainly not a geometry expert, but I have written a number of 
things in the
last few years that use geometry in GUIs and for geographic data.  I 
have written this
review as a potential user by evaluating how well it would work in a 
couple of example
applications.

My experience over all has been a little disappointing.  I think this 
is mainly because
I had very high hopes for the library.  The reality has not quite met 
my expectations.
As I will explain in detail below, my complaints are mainly things like excessive
warnings and odd misfeatures in the interface.  These are all things 
that could be
fixed, and they perhaps only indicate that the library has arrived a little
prematurely.  Based only on these issues my verdict would be that that 
the library
could be accepted after some revisions.  But we must also look at the 
bigger picture,
i.e. the existence of other competing libraries "in the wings".  I will 
discuss this
after describing my practical evaluation.

Points starting ** in the following are issues that I think should be 
fixed for the
library to be acceptable.

Application 1
-------------

I have an application that can display placename search result markers 
on a world map
(see http://chezphil.org/tmp/six_thousand_lakes.jpeg).  When no markers 
are shown the
screen can be updated at about 20 frames per second; when many markers 
are shown
the frame rate drops to the point where usability is impaired.  In this 
case the vast
majority of the markers overlap each other, and this could be exploited 
to reduce the
amount of work that needs to be done by the graphics hardware and hence 
boost the
frame rate.

The first step was to add the "magic" required by the library to 
support my rectangle
type.  The examples don't seem to include a custom rectangle, and the decomposition
of a rectangle into a pair of intervals rather than the more common 
pair of points or
point and size, made this a bit tricky.  More serious were the 
following problems:

** The library does not seem to use any sort of explicit concept 
checking; if types do
not satisfy their concepts random errors from deep inside the library 
are emitted.  The
verbosity and incomprehensibility of errors from complex C++ code is, 
in my opinion,
the greatest single failing of the language today.  In the face of 
these ridiculous
messages, some users will simply abandon trying to use the library (or 
the whole
language, and go back to C) or limit themselves to only what can be 
done by copying
examples.  Fixing this needs effort from the language designers, 
compiler writers, and
libraries.  I am not certain that concept checking can totally cure 
this problem,
but it can only help.

** The library spews out enormous quantities of warnings (gcc 4.1.2), I 
think mainly
related to unused parameters.  This has to be fixed (and I don't 
believe fixing it is
hard).

** The library failed to work with std::pair being used as the interval 
type, despite
this being mentioned as a possibility in the documentation.  The 
problem seems to be
that the std namespace is brought into ADL and std::set is found rather 
than the
library's own set function.  I have not investigated how hard this 
would be to fix;
instead I wrote my own pair template.

Having completed the "magic", I started to adapt my marker drawing 
code.  My first
attempt was a minimally-invasive version that would check if each 
marker covered any
new area and not plot it otherwise:

   if (!contains(markers,b)) {
     markers |= b;
     plot(b);
   }

Unfortunately this failed since contains() does not seem to be defined 
for a
polygon_90_set and a rectangle.  This surprised me somewhat.  A 
valuable addition to
the documentation would be a matrix indicating which operations are 
supported between
which types.  Somebody will need every combination of operations and 
types, and it
would be useful to see quickly which are provided, which are planned 
for a future
version of the library, and which cannot be implemented due to some 
restriction of the
library design.

I could have tried to implement contains(polygon_90_set,rectangle) in 
terms of e.g.
intersection but I was concerned that that would have much worse 
complexity than an
optimised contains() algorithm could have.  Which raises another issue:

** Algorithms should indicate their complexity guarantees, and (where 
different) their
expected complexity for common inputs.

So I moved on to a more invasive modification where I accumulate all of 
the rectangles
and use get_rectangles() at the end to extract a minimal set of areas 
to plot:

   for (...) {
     markers |= b;
   }
   get_rectangles(rects,markers);
   foreach(rect in rects) {
     plot(rect);
   }

This works, and reduces the number of rectangles that are plotted from 
about 6,000 to
about 900.  Unfortunately get_rectangles() takes about 100 ms to run, 
which is
slightly longer than plotting the redundant markers would have taken, 
so overall the
frame rate is not improved.

In that code, I was surprised to see that get_rectangles takes an 
output container by
reference.  In this case, I have no need to allocate memory and store 
these rectangles
at all.  I would have prefered some way to get the rectangles as they 
were generated
and pass them directly to my plot function, e.g.

   for_each_rectangle(markers, &plot);

An output iterator would be a more conventional (but in this case more 
verbose) way to
achieve this.

** Unless there is some good reason for using output reference 
parameters, algorithms
should use output iterators.

I was also a little confused by the difference - if any - between 
"markers |= b" and
"markers.insert(b)".  I believe that I know what the former does, but I 
worry that
unioning-in each rectangle in turn is inefficient and that it would be 
better to
construct a set of rectangles and self-union in one go.  Perhaps this 
is what insert()
does - yet there is no self_union function, and in practice I seem to 
get the same
results.  Maybe get_rectangles() has done the self_union.  Some 
clarification of this
in the documentation would be useful.

Application 2
-------------

I have data for the borders of N. American states and provinces.  My 
original plan was
to try to reduce these state outline polygons to a set of polylines; my 
viewer does
this as a pre-processing step to halve the work it does when plotting 
these borders.
But the library really is limited to polygons (hence the name), with no 
support for
even representing polylines let alone doing anything with them.  It 
would be useful for
the documentation to include a "roadmap" section indicating whether 
this is being
considered for a future version.  It might be worthwhile to define 
polyline and other
concepts now, if they are relatively uncontroversial, to avoid multiple implementations
later.

So I decided instead to use this data to implement a "what state is 
this point in?"
function.  This boils down to reading in the state border polygons and 
testing points
using contains(polygon,point).

One issue that I noted while reading in the polygons is that the 
provided polygon_data
type is immutable: to read from a file, I need either an exotic 
iterator that can be
passed to its constructor, or a temporary.  I'm unsure why polygon_data 
can't be more
like a mutable std::container so that I can simply read points from a 
file and
push_back() in a loop.  Trying to use a std::vector<point_data> 
required more traits
specialisation that I expected; can't this be pre-defined?

** The polygon_data type should be mutable (i.e. support things like 
push_back()), or a
mutable version should be provided, or std::vector/list<point_data> 
should work with
minimal traits specialisation.

I also found that polygon_data is specialised by the coordinate type, 
not the point
type, so anyone using their own point type will have to provide their 
own polygon type.
Is there some reason why I can't have polygon_data<my_point>?  Again, 
making it easier
to use std::container<my_point> would help.

I also found little mention of winding direction in the documentation.  
I haven't
checked whether the data I have winds consistently; what does 
unknown_winding actually
do?

As I guessed, contains(polygon,point) is very slow if done naively: the library's
polygon concept (and the provided default implementation) have no way 
to determine
whether a point lies within a polygon in better than O(N) time.  
Testing first with the
bounding box makes it O(1) in the common case.  So I tried to write a 
polygon type that
would do this, something like this (but not exactly this!):

   struct my_polygon: polygon_data {
     rectangle bbox;
     template <typename ITER>
     my_polygon(ITER begin, ITER end):
       polygon_data(begin,end),
       bbox(extents(*this))
     {}
   };

   bool contains(const my_polygon& poly, point_t pt) {
     return contains(poly.bbox,pt) && contains(poly,pt);
   }

Hmm, well that looks OK until the very last "contains"... no doubt 
someone will
immediately spot the "right" way to do this.  I ended up making 
contains() a member of
my_polygon, and this worked reasonably well.  Note that the extents() 
function does not
work as written above: it takes a reference to a rectangle as an out parameter.
center() does something similar.  This seems wrong to me; returning a 
point or
rectangle would be more conventional I think.

** extents() and center() should return their results, rather than 
using an output
reference.

An important question is what this function should do when given a 
coordinate that lies
on the boundary between two (or more) states.  One can argue for 
various different
behaviours.  My preference would be to never return "no state" and to 
be deterministic,
but otherwise I don't care.  The contains() function has a parameter "consider_touch"
that makes it possible to implement different policies; this seems 
sufficient in most
cases, though I'm unsure how you would consider the left and bottom 
included and the
right and top excluded in a 90-degree shape (which might be needed for 
a GUI).

I have a fixed point type that I use for GPS coordinates (32 bits is 
good for lat/lng
GPS positions, but 24 bits isn't) and using it with this library worked 
well.  The
Coordinate Concept page doesn't say much about the type requirements 
("expected to be
integral") and spelling out in more detail what is needed would be 
useful.  In practice
I must have already implemented everything that it needed.

I would have liked to pursue this example further and see just how fast 
the "which
state?" function can be made.  For example, simplified versions of the 
state borders
could be stored for fast lookup for points not near the borders.  But I 
ran out of
time.  If anyone would like to experiment with this, the data can be 
found at
http://chezphil.org/tmp/states.tgz .

Other general notes
-------------------

Operator Overloading : I am not a huge fan of "gratuitous" operator 
overloading.  In
this case, I find |= for union acceptable, particularly since I think I 
can disable
this by not using the boost::polygon::operators namespace.  I would 
prefer to avoid
the duplicate operators (i.e. + == |).

However, I can't see any way to perform these boolean operations other 
than using the
operators, and I think this is an ommission.  If I were writing code 
that I expected
someone else unfamiliar with the library to understand I would like the important
calls to be conspicuous.

** The library should provide functions to perform the boolean 
operations, as well as
operators, for the benefit of those who want to make their code more verbose.

I dislike the use of operators with integers for the grow/shrink 
operations.  My
concern is that this meaning is non-obvious, and since these operators 
are also
defined for boolean operations, there is potential for error:

   polygon_set a, b, i;
   ...
   for (int i = 0; i<10; ++i) {
     ...
     a -= b+i;  // oops, forgot i is now an int
   }

Isotropic Style
---------------

I recall Luke describing his isotropic style on the list a long time 
ago, and the
documentation would benefit from an explanation along those lines.  One 
"FAQ" that it
would need to address is the question of the performance of a run-time 
selection of X
vs. Y compared to a compile-time selection in more conventional code.

I'm unsure if the use of caps for e.g. HORIZONTAL and VERTICAL enum 
members is Boost
style.  They look like macros to me.

I'm not very happy with the use of compass directions as direction 
labels, since I
have code in which these directions are definitely not what the axes 
actually mean.
Other labels including LEFT and RIGHT would have the same problem in 
other contexts.
GUI APIs are inconsistent about whether y increases up or down the 
screen, so terms
like "top left" are also problematic.  I think about the only safe 
terms are
"increasing X axis" etc.  I tend to use x0y0 ... x1y1 for corners.

Documentation
-------------

I find the documentation satisfactory, though it could benefit from 
some style
improvements (e.g. the font for the example code).

I suggest that an introductory tutorial should be added.  This should 
fully explain
everything that a new user who has no legacy code needs to know (i.e. 
it ignores the
traits and uses the provided _data types.)  Then, a second tutorial 
(more like the
current material) could explain how users with their own types can use 
them with the
library.

Answers to basic questions like "is it header-only?" and "what other 
Boost libraries
does it depend on?" should also be added somewhere prominent in the introduction.

That concludes my practical examination of the library.

We then have the question of rival geometry libraries to consider.

Barend has presented a few previews of his library and I have looked at 
these, though
not in much detail.  My feeling was that Barend had a long way to go 
before he would
have something useful, and that even then he may never produce 
something that I would
actually be able to use.  For example, he has decided to adopt the 
conventions of the
OGC; this is not entirely unreasonable, but it means that semantics of 
things like
whether border points are inside or outside a polygon are 
non-negotiable.  Barend has
also recently reminded us of the spatial indexes GSOC project and says 
that it is now
included in his library, despite it being the worse code I have ever 
seen presented to
Boost.

Despite all that, Barend's recent comments suggest that he has made significant
progress and that he may have something useful to show us in a few weeks.

Barend has also raised questions about the worse-case performance of Luke's
line intersection algorithm, and it seems that Luke accepts that it is 
not optimal.
Achieving the best possible algorithmic complexity is something that I 
would consider
very important or essential for a Boost library.  At the very least, 
Luke needs to
document the complexity of his current algorithm and explain why he 
cannot do any
better.  The views of experts should be sought if this cannot be 
resolved by the
review manager.

In view of all this, I suggest that this library should be rejected for 
now.  This
will tell Barend that he still has an opportunity to present his 
library for review,
and that it will be considered on a level playing field.  If Barend's 
library is
reviewed and found to be more complete, more performant and at least as 
usable as this
library, then it should be accepted.  On the other hand, if Barend's 
library is found
to be deficient in some way (or is not submitted for review), then Luke 
will have an
opportunity to resubmit an updated version of this library, which I 
anticipate should
be accepted.

Regards,  Phil.

Re: [boost] Formal Review: Boost.Polygon

Phil Endecott