Review Request: Accumulators Framework and Statistical Accumulators Library

My framework for incremental statistical accumulation is ready for review. Here is the overview:
Boost.Accumulators is both a library for incremental statistical computation as well as an extensible framework for incremental calculation in general. The library deals primarily with the concept of an accumulator, which is a primitive computational entity that accepts data one sample at a time and maintains some internal state. These accumulators may offload some of their computations on other accumulators, on which they depend. Accumulators are grouped within an accumulator set. Boost.Accumulators resolves the inter-dependencies between accumulators in a set and ensures that accumulators are processed in the proper order.
The documentation is available online at: http://boost-sandbox.sourceforge.net/libs/accumulators/doc/html/index.html The download is in the Boost Vault in accumulators.zip at: http://boost-consulting.com/vault/index.php?directory=Math%20-%20Numerics This library is very much a complement to John Maddock's work on his statistical functions, which could be adapted to fit into this framework. If anybody wants the docs in PDF format, please send me a request. It's a 10 Mb file, and the Vault won't take it. -- Eric Niebler Boost Consulting www.boost-consulting.com

This library looks cool. I hope you don't mind some feedback, but I would suggest not calling histogram, "density". Why not call it "histogram"? Chris On 11/16/06, Eric Niebler <eric@boost-consulting.com> wrote:
My framework for incremental statistical accumulation is ready for review. Here is the overview:
Boost.Accumulators is both a library for incremental statistical computation as well as an extensible framework for incremental calculation in general. The library deals primarily with the concept of an accumulator, which is a primitive computational entity that accepts data one sample at a time and maintains some internal state. These accumulators may offload some of their computations on other accumulators, on which they depend. Accumulators are grouped within an accumulator set. Boost.Accumulators resolves the inter-dependencies between accumulators in a set and ensures that accumulators are processed in the proper order.
The documentation is available online at:
http://boost-sandbox.sourceforge.net/libs/accumulators/doc/html/index.html
The download is in the Boost Vault in accumulators.zip at:
http://boost-consulting.com/vault/index.php?directory=Math%20-%20Numerics
This library is very much a complement to John Maddock's work on his statistical functions, which could be adapted to fit into this framework.
If anybody wants the docs in PDF format, please send me a request. It's a 10 Mb file, and the Vault won't take it.
-- Eric Niebler Boost Consulting www.boost-consulting.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Chris Weed wrote:
On 11/16/06, Eric Niebler <eric@boost-consulting.com> wrote:
My framework for incremental statistical accumulation is ready for review. Here is the overview:
<snip>
This library looks cool. I hope you don't mind some feedback, but I would suggest not calling histogram, "density". Why not call it "histogram"? Chris
I don't have an answer for you right now. This library represents the work of several people, some of whom have a deep statistical background. I am not one of them -- this statistic was written by someone else. I'll make a note to make sure this question is answered during the review, if not before. Thanks for your interest. -- Eric Niebler Boost Consulting www.boost-consulting.com

On 11/16/06, Eric Niebler <eric@boost-consulting.com> wrote:
Chris Weed wrote:
On 11/16/06, Eric Niebler <eric@boost-consulting.com> wrote:
My framework for incremental statistical accumulation is ready for review. Here is the overview:
<snip>
This library looks cool. I hope you don't mind some feedback, but I would suggest not calling histogram, "density". Why not call it "histogram"? Chris
I don't have an answer for you right now. This library represents the work of several people, some of whom have a deep statistical background. I am not one of them -- this statistic was written by someone else. I'll make a note to make sure this question is answered during the review, if not before.
Thanks for your interest.
Just to follow up. I probably should have mentioned the reason for my comment. There are several ways to estimate a pdf, of which a histogram is one of them. Another example is kernel density estimation. I think it would be nice to differentiate your estimator as a histogram by name. Chris

Eric Niebler wrote:
My framework for incremental statistical accumulation is ready for review. Here is the overview:
:-) I'm very pleased to see this come up for review: I'll look forward to reviewing it!
If anybody wants the docs in PDF format, please send me a request. It's a 10 Mb file, and the Vault won't take it.
Glad to see I'm not the only one struggling with PDF generation! BTW FOP-0.2 produces *much* smaller PDF files than FOP-0.9 (600K vs 10Mb in my case): just in case you hadn't figured this out already :-) John.

Thank you for very interesting library. But I can not compile this with VisualStudio2003

But I can not compile this with VisualStudio2003
I apply SP1 to VisualStudio2003. Then I successfully compile the example code accumulators\libs\accumulators\example\example.vcproj Thank you for useful library.

I an very interested in this framework so I have started to take a look. General Accumulators are something I could make use of myself. I really like the conceptual design of the framework, and how it allows accumulators to be inter-dependant.. After a quick browse through the documentation I decided to take a look at the code. In particular I was interested in the numerics. I think the current implementation has some serious numerical weaknesses. I looked at 2 algorithms 'sum' and 'variance': In 'sum' I expected to see a compensated summation, this is numerically a lot better then just adding the numbers together. The 'variance' accumulator has a lazy calculation of variance using the formula \sigma_n^2 = M_n^{(2)} - \mu_n^2. This formal is specifically cited for it poor performance in the presence of rounding error. Indeed it may even return negative results. Any chance of getting your statistics guys to take a look at the numerics of the solutions? If people were to use library as is they would be in for nasty surprised! All the best, Michael -- ___________________________________ Michael Stevens Systems Engineering 34128 Kassel, Germany Phone/Fax: +49 561 5218038 Navigation Systems, Estimation and Bayesian Filtering http://bayesclasses.sf.net ___________________________________

Michael Stevens wrote:
I an very interested in this framework so I have started to take a look.
General Accumulators are something I could make use of myself. I really like the conceptual design of the framework, and how it allows accumulators to be inter-dependant..
Great! Glad you like it.
After a quick browse through the documentation I decided to take a look at the code. In particular I was interested in the numerics.
I think the current implementation has some serious numerical weaknesses. I looked at 2 algorithms 'sum' and 'variance':
In 'sum' I expected to see a compensated summation, this is numerically a lot better then just adding the numbers together.
'sum' is one I implemented. I'm not surprised to learn there are better approaches. The framework allows for different implementation strategies for the statistics, though. Using the extensibility features, you can define your own "compensated_sum" accumulator and declare that it satisfies the "sum" feature (so that "compensated_sum" and "sum" are indistinguishable from the POV of dependency resolution), and even come up with clever syntax for it, like: accumulator_set< double, features< sum(compensated) > > acc; You might even try writing "compensated_sum" yourself and submitting it, just to see what happens. :-) The questions of what the default "sum" should do, and what alternate implementations should be provided, are open.
The 'variance' accumulator has a lazy calculation of variance using the formula \sigma_n^2 = M_n^{(2)} - \mu_n^2. This formal is specifically cited for it poor performance in the presence of rounding error. Indeed it may even return negative results.
Any chance of getting your statistics guys to take a look at the numerics of the solutions? If people were to use library as is they would be in for nasty surprised!
I'll forward this message off to the stats guys. This would certainly be a good issue to re-raise once the review starts. -- Eric Niebler Boost Consulting www.boost-consulting.com

On Monday, 20. November 2006 17:35, Eric Niebler wrote:
Michael Stevens wrote:
I an very interested in this framework so I have started to take a look.
General Accumulators are something I could make use of myself. I really like the conceptual design of the framework, and how it allows accumulators to be inter-dependant..
Great! Glad you like it.
After a quick browse through the documentation I decided to take a look at the code. In particular I was interested in the numerics.
I think the current implementation has some serious numerical weaknesses. I looked at 2 algorithms 'sum' and 'variance':
In 'sum' I expected to see a compensated summation, this is numerically a lot better then just adding the numbers together.
'sum' is one I implemented. I'm not surprised to learn there are better approaches.
The framework allows for different implementation strategies for the statistics, though. Using the extensibility features, you can define your own "compensated_sum" accumulator and declare that it satisfies the "sum" feature (so that "compensated_sum" and "sum" are indistinguishable from the POV of dependency resolution), and even come up with clever syntax for it, like:
accumulator_set< double, features< sum(compensated) > > acc;
Cool! I will have to give it a try·
You might even try writing "compensated_sum" yourself and submitting it, just to see what happens. :-)
The questions of what the default "sum" should do, and what alternate implementations should be provided, are open.
I guess what people expect depends on their technical background! All the best with the library, Michael -- ___________________________________ Michael Stevens Systems Engineering 34128 Kassel, Germany Phone/Fax: +49 561 5218038 Navigation Systems, Estimation and Bayesian Filtering http://bayesclasses.sf.net ___________________________________

On 20 Nov 2006, at 00:44, Michael Stevens wrote:
The 'variance' accumulator has a lazy calculation of variance using the formula \sigma_n^2 = M_n^{(2)} - \mu_n^2. This formal is specifically cited for it poor performance in the presence of rounding error. Indeed it may even return negative results.
That's exactly why there is not only the lazy version but also an accurate one

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Matthias Troyer | Sent: 21 November 2006 17:49 | To: boost@lists.boost.org | Subject: Re: [boost] Review Request: Accumulators Framework | and StatisticalAccumulators Library | | On 20 Nov 2006, at 00:44, Michael Stevens wrote: | | > | > The 'variance' accumulator has a lazy calculation of | variance using | > the | > formula \sigma_n^2 = M_n^{(2)} - \mu_n^2. This formal is | > specifically | > cited for it poor performance in the presence of rounding error. | > Indeed it | > may even return negative results. | | That's exactly why there is not only the lazy version but also an | accurate one I wonder if there could also be an 'outlier rejecting' or robust version of these things like mean, variance etc. Since one its strengths is that it can take data as some sensor (or human) produces it, then it would be useful if it could shout or discard if duff data arrives. Does this sound feasible/desirable? Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

I wonder if there could also be an 'outlier rejecting' or robust version of these things like mean, variance etc.
Lets us consider the case Accumulators is used in annealing procedure. In this case, rounding error may cause serious problems. And 'outlier rejecting' does resolve this problems.

On Tuesday, 21. November 2006 18:49, Matthias Troyer wrote:
On 20 Nov 2006, at 00:44, Michael Stevens wrote:
The 'variance' accumulator has a lazy calculation of variance using the formula \sigma_n^2 = M_n^{(2)} - \mu_n^2. This formal is specifically cited for it poor performance in the presence of rounding error. Indeed it may even return negative results.
That's exactly why there is not only the lazy version but also an accurate one
Looking at variance.hpp there is a 'lazy' version and a 'iterative' version. The naming confused me. I was expecting lazy to have access to all the da and so be the most accurate. I guess my comments with regard to the lazy version are simply a matter of documenting the poor numerics. The iterative formal is via (n-1)/n * variance[n-1] + 1/(n-1) * (difference)^2. This is still numerically different from accumalating the differences squared iteratively and applying the 1/n is the result. I guess there is always a good reason to apply build ones own implementation to meet specific numericical requirements. Thanks, Michael -- ___________________________________ Michael Stevens Systems Engineering 34128 Kassel, Germany Phone/Fax: +49 561 5218038 Navigation Systems, Estimation and Bayesian Filtering http://bayesclasses.sf.net ___________________________________

Hi Eric, I've received your request and will add your Accumulators library to the review queue. Cheers, ron On Nov 16, 2006, at 5:46 PM, Eric Niebler wrote:
My framework for incremental statistical accumulation is ready for review. Here is the overview:
Boost.Accumulators is both a library for incremental statistical computation as well as an extensible framework for incremental calculation in general. The library deals primarily with the concept of an accumulator, which is a primitive computational entity that accepts data one sample at a time and maintains some internal state. These accumulators may offload some of their computations on other accumulators, on which they depend. Accumulators are grouped within an accumulator set. Boost.Accumulators resolves the inter-dependencies between accumulators in a set and ensures that accumulators are processed in the proper order.
The documentation is available online at:
http://boost-sandbox.sourceforge.net/libs/accumulators/doc/html/ index.html
The download is in the Boost Vault in accumulators.zip at:
http://boost-consulting.com/vault/index.php?directory=Math%20-% 20Numerics
This library is very much a complement to John Maddock's work on his statistical functions, which could be adapted to fit into this framework.
If anybody wants the docs in PDF format, please send me a request. It's a 10 Mb file, and the Vault won't take it.
-- Eric Niebler Boost Consulting www.boost-consulting.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/ listinfo.cgi/boost
participants (10)
-
Chris Weed
-
Eric Niebler
-
Eric Niebler
-
John Maddock
-
Matthias Troyer
-
Michael Stevens
-
N H
-
Niitsuma Hirotaka
-
Paul A Bristow
-
Ronald Garcia