
This is partial review of math toolkit.
What is your evaluation of the design?
Very good: Intuitive, easy to use, efficient and extensible. There is just one tiny detail of the design which might be worth a discussion, though it seems to be primarily a matter of taste: The naming of the "estimate_xxx" functions and their placement as static methods in the distribution classes. - When statisticians talk about estimating parameters of a distribution they usually mean fitting the distribution to a particular sample, for instance by means of maximum likelihood estimation (at least, that's my impression). Hence I find the naming of the static "estimate_" methods in the distribution classes a little bit unfortunate. Instead of estimate_degrees_of_freedom I would prefer find_minimum_sample_size, instead of estimate_lower_bound_on_p maybe something like confidence_interval_lower_bound_for_p (which I also find more descriptive), etc. - Apart from estimate_alpha and estimate_beta for the beta distribution, the "estimate_" functions generally serve a purpose in the context of a statistical model, not just a particular distribution. Take for example the "estimate_degrees_of_freedom" method of the student t distribution. The statistical context in which one would need the function is an i.i.d. normally distributed sample with a mean and variance estimated through the standard sample estimators. In principle one could come up with arbitrarily many statistical models and applicable "estimate" methods, and their placement into one of the classes might well be ambiguous. - It might be preferable to keep these functions separate from the distribution classes, for example, making the "estimate_degrees_of_freedom" method of the t distribution a free function "find_minimum_sample_size_for_standard_t_test". Naming would obviously not be easy, but that is also because currently the names mainly reflect the parameter and the distribution which are involved in the computation.
What is your evaluation of the implementation?
I did not evaluate the implementation, though the documentation suggests a very competent implementation. The documentation makes also clear that the authors' primary focus was on numerical accuracy, which is comforting to know.
What is your evaluation of the documentation?
Excellent: well written and very comprehensive. The discussion of the used algorithms (including references) and accuracy bounds from empirical tests is one of the best features. A nice complementary feature to the statistical examples would be a table or listing that translates between statistical tests/practices and their implementation using the toolkit, which could summarize the examples and could go beyond the examples by providing a reference for further tests, etc. At the end of this email I list some minuscule issues in the introduction to the statistical part of the library.
Do you think the library should be accepted as a Boost library?
Yes. This is a library that fills an important gap and might be the beginning of an even more comprehensive math library within Boost. Best regards, Stephan Some minor issues in the documentation: I find the second paragraph in the tip "Random numbers that approximate Quantiles of Distributions" in the "Statistical Distributions Overview" misleading. In general the difference in purposes of Boost.Random and Math toolkit has not much to do with accuracy. Could the tip "Random variates and distribution parameters" in the "Statistical Distributions Overview" be moved after the first example "f(k; n, p)"? [There's also a separate redundant section "Random Variate and Distribution Parameters". "Discrete Probability Distributions" is a duplicate, too.] Nitpicking: For which distribution may "Mathematically, the random variate [may] take an (+ or -) infinite value" (still in "Statistical Distributions Overview")? Certain functions like the CDF may still make sense for infinity, but usually variates (realizations of the variable) can't be infinite. At the beginning of the example "Calculating confidence intervals on the mean with the Students-t distribution" the i.i.d. normal assumption should be stated before introducing the confidence interval. Similar comments apply to the other examples. In the example "Testing a sample mean for difference from "true" mean" the wording is a little bit lax from a statistical point of view. One doesn't "accept" the null hypothesis, one can only "not reject" it, just as one can't "reject" the alternative hypothesis. The output should be worded in terms of rejection or non-rejection of the null hypothesis in the two-sided or one-sided t-tests. Similar comments apply to the other examples.