Re: [boost] [Review] Review of the Accumulators library begins today Jan 29

AMDG John Phillips wrote:
The formal review of the Boost.Accumulators library, submitted by Eric Neibler begins today and ends on Wednesday, February 7. The library is available from
http://boost-consulting.com/vault/index.php?directory=Math%20-%20Numerics
From the documentation:
Boost.Accumulators is both a library for incremental statistical computation as well as an extensible framework for incremental calculation in general. The library deals primarily with the concept of an accumulator, which is a primitive computational entity that accepts data one sample at a time and maintains some internal state. These accumulators may offload some of their computations on other accumulators, on which they depend. Accumulators are grouped within an accumulator set. Boost.Accumulators resolves the inter-dependencies between accumulators in a set and ensures that accumulators are processed in the proper order.
Your comments may be brief or lengthy. If you identify problems along the way, please note if they are minor, serious, or showstoppers.
Here are some questions you might want to answer in your review: ? What is your evaluation of the design? ? What is your evaluation of the implementation? ? What is your evaluation of the documentation? ? What is your evaluation of the potential usefulness of the library? ? Did you try to use the library? With what compiler? Did you have any problems? ? How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? ? Are you knowledgeable about the problem domain?
And finally, every review should answer this question: ? Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
All interested parties are encouraged to submit a review. These can be sent to the developer list, the user list, or if you don't want to share your review with the general public it can be sent directly to me.
Thanks to all for the work you will do on this review.
John Phillips Review Manager user_s_guide.html#id450311 line 363 accumulators.qbk line 189 for some reason external<> is not a link in the html
accum_set; accum_set(1.0); accum_set(2.0); accum_set.drop<tag::sum>(); accum_set(3.0); std::cout << mean(accum_set) << std::endl; //prints 1 } This behavior can easily cause surprises. It must be fixed to print 2. I'm not quite sure how to implement it correctly. I think that probably you need to store the result at the point that drop is called for external lookup and continue to collect data for internal usage. Of course,
accumulators.qbk lines 35-37 [def _mpl_ [@../../libs/mpl MPL]] [def _mpl_lambda_expression_ [@../../libs/mpl/doc/refmanual/lambda-expression.html MPL Lambda Expression]] [def _parameters_ [@../../libs/parameters Boost.Parameters]] these links are bad post_construct() Unless for some obscure reason that I am missing it is just too difficult, I would strongly prefer to make sure the constructors are called in the correct order. The whole dance with fusion::vector is unnecessary. It would be simpler to use struct accumulator_tuple_base { template<class Args> accumulator_tuple(const Args&) {} } template<class T, class Base> struct accumulator_tuple : T, Base { template<class Args> accumulator_tuple(const Args& args) : T(args), Base(args) {} }; Since every accumulator is put before all the accumulators that depend on it in make_accumulator_tuple you can write typedef typename mpl::reverse_fold< sorted_accumulator_vector , accumulator_tuple_base , accumulator_tuple<mpl::_2, mpl::_1> >::type type; Then every accumulator is constructed after any others that it depends on. As a bonus the only limit to the number of accumulators is whatever the compiler can handle. What happens if you freeze a feature that another feature depends on? #include <iostream> #include <boost/accumulators/accumulators.hpp> #include <boost/accumulators/statistics/mean.hpp> #include <boost/accumulators/statistics/sum.hpp> using namespace boost::accumulators; int main() { accumulator_set<double, features<tag::mean, droppable<tag::sum> this only applies in the face of dependencies. The user guide needs to specify what header each component is in. BOOST_PARAMETER_NESTED_KEYWORD.html There is no description of what this macro is for. accumulator_set.hpp line 149 corresponds to -- What? accumulators_fwd.hpp #define's FUSION_MAX_VECTOR_SIZE/FUSION_MAX_TUPLE_SIZE This is a very bad idea as it might make make the value smaller than it would otherwise have been. Please just check FUSION_MAX_VECTOR_SIZE and issue a #error directive if it is too small. For simple cases at least the library seems to cause little to no overhead with msvc compared to hard coding everything. The docs need to state clearly what happens when more than one accumulator maps to the same feature. Overall the users guide is quite good. I had no difficulty understanding it. The reference manual is pretty sketchy though. For the most part the interface is clean and easy to use. I definitely think Accumulators should be included in boost. In Christ, Steven Watanabe

Hi, Steven, Steven Watanabe wrote:
AMDG
Ad Majorem Dei Gloriam?
user_s_guide.html#id450311 line 363 accumulators.qbk line 189 for some reason external<> is not a link in the html
Ah, the link is broken. Thanks.
accumulators.qbk lines 35-37 [def _mpl_ [@../../libs/mpl MPL]] [def _mpl_lambda_expression_ [@../../libs/mpl/doc/refmanual/lambda-expression.html MPL Lambda Expression]] [def _parameters_ [@../../libs/parameters Boost.Parameters]] these links are bad
Thanks. I think I was overly-optimistic when I wrote that. The links would work if Accumulators were in it's rightful (a-hem) place in the Boost tree. ;-)
post_construct() Unless for some obscure reason that I am missing it is just too difficult, I would strongly prefer to make sure the constructors are called in the correct order.
The whole dance with fusion::vector is unnecessary. It would be simpler to use struct accumulator_tuple_base { template<class Args> accumulator_tuple(const Args&) {} } template<class T, class Base> struct accumulator_tuple : T, Base { template<class Args> accumulator_tuple(const Args& args) : T(args), Base(args) {} }; Since every accumulator is put before all the accumulators that depend on it in make_accumulator_tuple you can write typedef typename mpl::reverse_fold< sorted_accumulator_vector , accumulator_tuple_base , accumulator_tuple<mpl::_2, mpl::_1> >::type type; Then every accumulator is constructed after any others that it depends on. As a bonus the only limit to the number of accumulators is whatever the compiler can handle.
This would be a valid design choice, but it would involve reinventing a lot of Fusion. I use Fusion both to generate the accumulator set, to process the accumulators, and to find accumulators that match Fusion predicates. Code reuse is good. Regarding post_process(), I decided at the time, after discussing this with Joel de Guzman (of Fusion) and Dave Abrahams, that relying on the construction order of accumulators within a set would be a mistake. It would rule out size optimizations like compressed_pair, where empty accumulators are optimized away via the Empty Base Optimization. Fusion doesn't do this optimization today, but it may in the future.
What happens if you freeze a feature that another feature depends on? #include <iostream> #include <boost/accumulators/accumulators.hpp> #include <boost/accumulators/statistics/mean.hpp> #include <boost/accumulators/statistics/sum.hpp>
using namespace boost::accumulators;
accum_set; accum_set(1.0); accum_set(2.0); accum_set.drop<tag::sum>(); accum_set(3.0); std::cout << mean(accum_set) << std::endl; //prints 1 } This behavior can easily cause surprises. It must be fixed to print 2. I'm not quite sure how to implement it correctly. I think that probably you need to store the result at the point that drop is called for external lookup and continue to collect data for internal usage. Of course,
int main() { accumulator_set<double, features<tag::mean, droppable<tag::sum> this only applies in the face of dependencies.
The code is doing something a little unusual, but the result looks correct to me. What do you think the result should be, and why?
The user guide needs to specify what header each component is in.
Yes.
BOOST_PARAMETER_NESTED_KEYWORD.html There is no description of what this macro is for.
Thanks.
accumulator_set.hpp line 149 corresponds to -- What?
... a feature. Fixed.
accumulators_fwd.hpp #define's FUSION_MAX_VECTOR_SIZE/FUSION_MAX_TUPLE_SIZE This is a very bad idea as it might make make the value smaller than it would otherwise have been. Please just check FUSION_MAX_VECTOR_SIZE and issue a #error directive if it is too small.
I had it that way at first, but changed it due to user feedback. The idea is to make it so that you can just #define BOOST_ACCUMULATORS_MAX_FEATURES and not have to worry about configuring Fusion.
For simple cases at least the library seems to cause little to no overhead with msvc compared to hard coding everything.
Thanks for the report!
The docs need to state clearly what happens when more than one accumulator maps to the same feature.
Right. The first accumulator with a particular feature is the one that gets used. Any others with the same feature are ignored.
Overall the users guide is quite good. I had no difficulty understanding it. The reference manual is pretty sketchy though.
Good docs are always the hardest part.
For the most part the interface is clean and easy to use.
I definitely think Accumulators should be included in boost.
Thanks. -- Eric Niebler Boost Consulting www.boost-consulting.com

Hi Eric, On 30 Jan 2007, at 18:47, Eric Niebler wrote:
Steven Watanabe wrote:
accum_set; accum_set(1.0); accum_set(2.0); accum_set.drop<tag::sum>(); accum_set(3.0); std::cout << mean(accum_set) << std::endl; //prints 1 } This behavior can easily cause surprises. It must be fixed to print 2. I'm not quite sure how to implement it correctly. I think that probably you need to store the result at the point that drop is called for external lookup and continue to collect data for internal usage. Of course,
int main() { accumulator_set<double, features<tag::mean, droppable<tag::sum> this only applies in the face of dependencies.
The code is doing something a little unusual, but the result looks correct to me. What do you think the result should be, and why?
It should be 2. Note that if mean depends on sum, you cannot drop the accumulation of sum. If I remember our design discussions correctly then the "drop" would only drop the feature if it is not required for another non-dropped feature. Matthias

This library looks very interesting. One feature I would like to see but couldn't find in the documentation is the ability to combine accumulators. For example, accumulator_set<double, stats<tag::mean> > acc1, acc2; acc1(1.0); acc2(2.0); acc1(acc2); cout << "Mean of 1.0 and 2.0 is " << mean(acc1) << endl; // Prints 1.5. This would be particularly useful for parallel computing applications. Is this possible? Would it be difficult to add? —Ben

On Jan 31, 2007, at 6:21 AM, Ben FrantzDale wrote:
This library looks very interesting. One feature I would like to see but couldn't find in the documentation is the ability to combine accumulators. For example,
accumulator_set<double, stats<tag::mean> > acc1, acc2; acc1(1.0); acc2(2.0); acc1(acc2); cout << "Mean of 1.0 and 2.0 is " << mean(acc1) << endl; // Prints 1.5.
This would be particularly useful for parallel computing applications.
Is this possible? Would it be difficult to add?
Yes, this is definitely needed but I think it will be lots of work to add it, based on my own experience in implementing something like this in our ALPS library for parallel Monte Carlo simulations. I am planning to think about this sometime this year, but believe it is a substantial extension to the library and will not be easy to design right. The problem I see is that preparing an estimate from a sequence of values, and preparing an estimate from a number of estimates of independent accumulators requires different algorithms., e.g.: for count, adding a value increases the count by 1 but adding an accumulator by the value of the accumulator for sum, adding a value increases the sum by the value but adding an accumulator by the sum of the accumulator for max, adding a value compares the current max by the value but adding an accumulator compares current ,ax with the max of the accumulator These were easy, trickier are robust estimators of variance and moments, even trickier will be median estimators where I currently even do not see how this should be done without storing most of the time series. I would thus propose to develop something like accumulator_set<double, stats<tag::mean> > acc1, acc2; acc1(1.0); acc2(2.0); parallel_accumulator_set<double, stats<tag::mean> > acc; // any better name??? acc(acc1); acc(acc2); cout << "Mean of 1.0 and 2.0 is " << mean(acc) << endl; // Prints 1.5. I will need this within the next year and will work on it myself -- I hope that Eric can help me with design and implementation questions but I don't expect him to provide it for me (unless he wants to spend more of his free time on this :-) ?). Matthias

Matthias Troyer wrote:
This would be particularly useful for parallel computing applications.
Is this possible? Would it be difficult to add?
I agree that this would be highly desirable.
Yes, this is definitely needed but I think it will be lots of work to add it, based on my own experience in implementing something like this in our ALPS library for parallel Monte Carlo simulations. I am planning to think about this sometime this year, but believe it is a substantial extension to the library and will not be easy to design right. The problem I see is that preparing an estimate from a sequence of values, and preparing an estimate from a number of estimates of independent accumulators requires different algorithms., e.g.:
Right: and for some accumulators the addition of two accumulators may not be possible at all. But can't we make this a conceptual requirement that if you try and combine two accumulator sets, then each accumulator must have a "combine" method (otherwise you get a compile time error). The "combine" methods for min/max/sum/count etc are all trivial aren't they - and could be added fairly easily ? But Eric would know the details :-) John.

John Maddock wrote:
Matthias Troyer wrote:
This would be particularly useful for parallel computing applications.
Is this possible? Would it be difficult to add?
I agree that this would be highly desirable.
Yes, this is definitely needed but I think it will be lots of work to add it, based on my own experience in implementing something like this in our ALPS library for parallel Monte Carlo simulations. I am planning to think about this sometime this year, but believe it is a substantial extension to the library and will not be easy to design right. The problem I see is that preparing an estimate from a sequence of values, and preparing an estimate from a number of estimates of independent accumulators requires different algorithms., e.g.:
Right: and for some accumulators the addition of two accumulators may not be possible at all. But can't we make this a conceptual requirement that if you try and combine two accumulator sets, then each accumulator must have a "combine" method (otherwise you get a compile time error). The "combine" methods for min/max/sum/count etc are all trivial aren't they - and could be added fairly easily ? But Eric would know the details :-)
This seems like a workable approach to me. But I've learned to believe Matthias when he says something is hard. It usually means I've overlooked something. :-) -- Eric Niebler Boost Consulting www.boost-consulting.com
participants (5)
-
Ben FrantzDale
-
Eric Niebler
-
John Maddock
-
Matthias Troyer
-
Steven Watanabe