[boost] Accumulators review

30 Jan 2007

      This is the first part, of what will probably end up a multi-part review.

First the headline - I do think that the library should be included in 
Boost - although I do have some reservations as noted below.

This part-review covers the documentation and a few simple tests I've 
conducted.  Later I'll try writing some accumulators so that I can comment 
better on the design.

Documentation review
~~~~~~~~~~~~~~~~~~~

Tutorial:
~~~~~~~~~

I'm fairly sure that the whole accumulator_set_wrapper business could be 
completely avoided with a TR1 conforming reference_wrapper - which 
unfortunately we don't have in Boost just yet - none the less this may be 
worthwhile mentioning.

The "push all the data in this sequence" idiom seems like a very common use 
case to me, is it not worthwhile supporting this directly via a couple of 
member functions:

template <class It>
accumulator_set& accumulator_set::push_range(It begin, It end);

template <class Seq>
accumulator_set& accumulator_set::push_sequence(Seq const& s);

Typo in "This will calcualte the left"

Statistic accumulators:
~~~~~~~~~~~~~~~~~~~~~~

Kurtosis
~~~~~~~~

This is going to confuse folks - I know it's got Paul and I into trouble 
already!  You're returning the "Kurtosis Excess" rather than the "Kurtosis 
Proper".  In the Math Toolkit stats functions we called these functions 
"kurtosis" and "kurtosis_excess", and I'd favour similar naming here, or at 
least a strong warning to the effect that there are multiple definitions in 
the literature - otherwise unwary users will get into all kinds of trouble.

Differing implementations?
~~~~~~~~~~~~~~~~~~~~~~~~~~

feature_of and as_feature get mentioned in the tutorial, along with a 
sentence to say that there can be different implementations of the same 
feature, but there is no explanation of how this works, and the links to the 
reference pages for as_feature and feature_of are distinctly detail free :-(

This all leads on to the algorithms used for the mean and variance etc: I 
assume that these are of the "naive" variety?  Perfectly satisfactory for 
simple cases, but there are well known pathological situations where these 
fail.

I'm also missing accessors for: the standard deviation, "unbiased" variance, 
and the unbiased (N-1) standard deviation.  I realise I could write these 
myself, but they would seem to be common enough cases to warrant inclusion 
in the library.  As it stands, I had to jump through hoops to get data I 
could compare with "known good values".

BTW, I assume that there are no accumulators for Autocorrelation Coefficient 
calculation?

Which leads on to a quick comparison I ran against the "known good" data 
here: http://www.itl.nist.gov/div898/strd/univ/homepage.html  The test 
program is attached, and outputs the relative error in the statistics 
calculated.  Each test is progressively harder than the previous one, the 
output is:

PI data:
Error in mean is: 0
Error in SD is: 3.09757e-016

Lottery data:
Error in mean is: 6.57202e-016
Error in SD is: 0

Accumulator 2 data:
Error in mean is: 9.25186e-015
Error in SD is: 2.71685e-012

Accumulator 3 data:
Error in mean is: 5.82076e-016
Error in SD is: 0.0717315

Accumulator 4 data:
Error in mean is: 9.87202e-015
Error in SD is: -1.#IND

As you can see the calculated standard deviation gets progressively worse, 
in the final case, the computed variance is actually -2, which is clearly 
non-sensical (and hence the NaN when using it to compute the standard 
deviation) :-(

I haven't tried to debug these cases: stepping through the accumulator code 
is not something I would recommend to anyone frankly (having tried it 
previously).

No doubt other torture tests could be constructed for the mean - 
exponentially distributed data where you see one or two large values, 
followed by a large number of very tiny values springs to mind - this is the 
sort of situation that the Kahan summation algorithm gets employed for.

I'm not sure what one should do about this.  These certainly illustrate the 
kinds of trap that the unwary can fall into.  It's particularly a problem 
for the "scientist in a rush" who may not be either a C++ or statistical 
expert (whether (s)he thinks they are or not!) but needs a library to 
quickly and reliably calculate some stats.  Better documentation would 
certainly help, but guidance on (or provision of - perhaps by default) 
better algorithms might be welcome as well.

That's all for now,

Regards, John.

[boost] Accumulators review

John Maddock