Re: [boost] Release Tools Analysis Part II: Developer Feedback Systems

8 Aug 2007

      On 8 Aug 2007, at 17:01, David Abrahams wrote:
...
This part of my analysis focuses on the tools available for getting
feedback from the system about what's broken.  Once again, because
there's been substantial effort invested in dart/cmake/ctest and
interest expressed by Kitware in supporting our use thereof, I'm
including that along with our current mechanisms.  Although not
strictly a reporting system, I'll also discuss BuildBot a bit because
Rene has been doing some research on it and it has some feedback
features.
I've struggled to create a coherent organization to this post, but it
still rambles a little, for which I apologize in advance.
Feedback Systems
================
Boost's feedback system has evolved some unique and valuable features
Unique Boost Features
---------------------
* Automatic distinction of regressions from new failures.
* A markup system that allows us to distinguish library bugs from  
compiler
  bugs and add useful, detailed descriptions of severity and
  consequences.  This feature will continue to be important at *least*
  as long as widely-used compilers are substantially nonconforming.
* Automatic distinction of tests that had been failing due to toolset
  limitiations and begin passing without a known explanation.
* A summary page that shows only unresolved issues.
* A separate view encoding failure information in a way most
  appropriate for users rather than library developers.
While I acknowledge that Boost's feedback system has substantial
weaknesses, no other feedback system I've seen accomodates most of
these features in any way.
I agree. I've had numerous experiences with large projects that
have not done it as well as boost. Personally I find the status  
information held by meta-comm to be useful and informative. The  
opening page isn't very useful but digging in always leads to the  
information that is most useful.
...
Dart
----
It seems like Dart is a long, long way from being able to handle our
display needs -- it is really oriented towards providing binary "is
everything OK?" reports about the health of a project.  It would
actually be really useful for Boost to have such a binary view; it
would probably keep us much closer to the "no failures on the trunk
(or integration branch, if you prefer)" state that we hope to maintain
continuously.  However, I'm convinced our finer distinctions remain
extremely valuable as well.
Other problems with Dart's dashboards (see
http://public.kitware.com/dashboard.php?name=public):
* It is cryptic, rife with unexplained links and icons.  Even some of
  the Kitware guys didn't know what a few of them meant when asked.
* Just like most of Boost's regression pages, it doesn't deal well  
with
  large amounts of data.  One look at kitware's main dashboard above
  will show you a large amount of information, much of which is
  useless for at-a-glance assessment, and the continuous and
  experimental build results are all at the bottom of the page.
Dart's major strength is that it maintains a database of past build
results, so anyone can review the entire testing history.
BuildBot
--------
Buildbot is not really a feedback system; it's more a centralized
system for driving testing.  I will deal with that aspect of our
system in a separate message.
Buildbot's display result (see http://twistedmatrix.com/buildbot/ for
example) is no better suited to Boost's specific needs than Dart's,
but it does provide one useful feature not seen in either of the other
two systems: one can see, at any moment, what any of the test machines
are doing.  I know that's something Dart users want, and I certainly
want it.  In fact, as Rene has pointed out to me privately, the more
responsive we can make the system, the more useful it will be to
developers.  His fantasy, and now mine, is that we can show developers
the results of individual tests in real time.
Another great feature BuildBot has is an IRC plugin that insults the
developer who breaks the build
(http://buildbot.net/repos/release/docs/buildbot.html#IRC-Bot)
Apparently the person who fixes the build gets to choose the next
insult ;-)
Most importantly, BuildBot has a plugin architecture that would allow
us to (easily?) customize feedback actions
(http://buildbot.net/repos/release/docs/buildbot.html#Writing-New- 
Status-Plugins).
Boost's Systems
---------------
The major problems with our current feedback systems, AFAICT, are
fragility and poor user interface.
I probably don't need to make the case about fragility, but in case
there are any doubts, visit
http://engineering.meta-comm.com/boost-regression/CVS-HEAD/ 
developer/index.build-index.html
For the past several days, it has shown a Python backtrace
Traceback (most recent call last):
    File
    "D:\inetpub\wwwroots\engineering.meta-comm.com\boost-regression 
\handle_http.py",
    line 324, in ?
    ...
    File "C:\Python24\lib\zipfile.py", line 262, in _RealGetContents
      raise BadZipfile, "Bad magic number for central directory"
  BadZipfile: Bad magic number for central directory
This is a typical problem, and the system breaks for one reason or
another <subjective>on a seemingly weekly basis</subjective>.
With respect to the UI, although substantial effort has been invested
(for which we are all very grateful), managing that amount of
information is really hard, and we need to do better.  Some of the
current problems were described in this thread
<http://tinyurl.com/2w7xch> and <http://tinyurl.com/2n4usf>; here are
some others:
* The front page is essentially empty, showing little or no useful
  information
  <http://engineering.meta-comm.com/boost-regression/boost_1_34_1/ 
developer/index.html>
* Summary tables have a redundant list of libraries at left (it also
  appears in a frame immediately adjacent)
* Summaries and individual library charts present way too much
  information to be callied "summaries", overwhelming any
  reasonably-sized browser pane.  We usually don't need a square for
  every test/platform combination
* It's hard to answer simple questions, like, "what is the status of
  Boost.Python under gcc-3.4?" or "how well does MPL work on windows
  with STLPort?", or what is the list of
* A few links are cryptic (Full view/Release view) and could be better
  explained.
The email system that notifies developers when their libraries are
broken seems to be fairly reliable.  Its major weakness is that it
reports all failures (even those that aren't regressions) as
regressions, but that's a simple wording change.  Its second weakness
is that it has no way to harass the person who actually made the
code-breaking checkin, and harasses the maintainer of every broken
library just as aggressively, even if the breakage is due to one of
the library's dependencies.
Recommendations
---------------
Our web-based regression display system needs to be redesigned and
rewritten.  It was evolved from a state where we had far fewer
libraries, platforms, and testers, and is burdened with UI ideas that
only work in that smaller context.  I suggest we start with as minimal
a display as we think we can get away with: the front status reporting
page should be both useful and easily-grasped.
IMO the logical approach is to do this rewrite as a Trac plugin,
because of the obvious opportunities to integrate test reports with
other Trac functions (e.g. linking error messages to the source
browser, changeset views, etc.), because the Trac database can be used
to maintain the kind of history of test results that Dart manages, and
because Trac contains a nice builtin mechanism for
generating/displaying reports of all kinds.  In my conversations with
the Kitware guys, when we've discussed how Dart could accomodate
Boost's needs, I've repeatedly pushed them in the direction of
rebuilding Dart as a Trac plugin, but I don't think they "get it" yet.
I have some experience writing Trac plugins and would be willing to
contribute expertise and labor in this area.  However, I know that
we also need some serious web-UI design, and many other people are
much more skilled in that area than I am.  I don't want to waste my
own time doing badly what others could do well and more quickly, so
I'll need help.
Yes, I realize this raises questions about how test results will
actually be collected from testers; I'll try to deal with those in a
separate posting.
Generally I agree with all the recommendations. However I am a big  
fan of incremental delivery and I would advocate boost approach this  
systemically. You don't want to get into the tool business. (.. avoid  
the anecdotal 'why fix things in 5 minutes when I can take a year  
writing a tool to automate it! :-{)

For what it is worth my advice would be to do the following;

1. Choose 2/3 representative tool-chains/platforms as boost  
'reference models' (msvc-M.N, win XP X.Y...) (gcc-N.M, debian...)  
(gcc-N.M, MacOSX,...)
- the choices are based on what's right 'for the masses' and what is  
the defacto platform for mainstream development on those platforms  
(before anyone screams I am seriously NOT advocating dropping the  
builds on the other platforms - read on)
- whatever the choices end up being I believe 'boost' needs to make a  
clear policy decision.

2. These 'reference models' are the basis of summary reports at the  
top level against the 'stable' released libraries. That can go on a  
page and it should take a minor amount of time to generate  
incrementally from the existing system.

3. As for tracking individual test results I don't personally see  
what's wrong with putting these under subversion. Given the  
likelyhood of high commonality between the output text of successive  
runs I think it is a much better 'implementation choice' than  
strictly a database. Certainly XML output from the test framework  
would aid other post-processing - but can be a secondary step/ 
enhancement to Boost.Test? Also there is a strong correlation between  
the versioning of test results and the changes since the last run  
that changed the results. Some relatively trivial automation of the  
source dependency tree changes between successive runs of individual  
tests could be a significant aid for the authors/maintainers. I'm not  
an expert on bjam but I presume for an individual target it would not  
be difficult to run a diff between the sources in successive  
invocations in each test.

4. Given the reference models above it would then be sensible to show  
the status of successive tiers of the boost project. ie. stable,  
development, sandbox, ... Again an indirection at the top-level will  
make this accessible.

5. Beyond this I would split out the summaries into platform variants  
on individual pages 'boost on windows', 'boost on linux' etc. In this  
way no information is lost and the community of developers is taken  
care of.

Hope this helps. As things scale there is a stronger need for  
'standardization' its unavoidable. Tool-chains are rarely the silver  
bullet. What boost has shouldn't be neglected ... it is already good  
for reporting status and its failings can be worked on incrementally.

Andy
...
-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com
The Astoria Seminar ==> http://www.astoriaseminar.com
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/ 
listinfo.cgi/boost

Re: [boost] Release Tools Analysis Part II: Developer Feedback Systems

Andy Stevenson