
At 05:11 AM 7/12/2006, you wrote:
T distribution(T x) const; // Probability Density Function or pdf or p T cumulative_probability(T x) const; // Cumulative Distribution Function. P
cumulative_probability is too long :-(
Do we REALLY need the cumulative here?
T probability(T x) const; // Cumulative Distribution Function or cdf or P
Sorry, as attractive as it seems at first blush, I think just "probability" is a very poor choice. A very common confusion in statistics is that people think of the value of the PDF as a probability -- even though it is not (hence the "D" for density). Even sophisticated people slip into thinking of it that way (after all, it *does* represent the probability of an event for discrete distributions). I think that people are much too likely to get confused and think that probability means the PDF. Even without that confusion, there is a legitimate ambiguity for the term: Which probability? Note for example that in traditional statistical hypothesis testing, the "p-value" (very roughly speaking, the probability of falsely rejecting the null hypothesis given the assumption that the null hypothesis is true) is the complementary CDF for a 1-tailed test and twice the complementary CDF for most 2-tailed tests. I don't have as much objection to using "distribution" for the PDF, but the nit-picker in me is a bit uncomfortable with it. A distribution is not a function, but to the extent that it can be identified with a particular function it's the CDF not the PDF (or the MGF -- the Moment Generating Function -- but lets not even go there). This is because the CDF is always defined for a distribution and the PDF (technically defined as the derivative of the CDF) may not be. Being slightly less pedantic, the *object* is the distribution, not the value of the function. I realize this is all pretty fine distinctions, but I would be much more comfortable if the naming doesn't actively mislead about the technical fine points.
John Maddock has been muttering about using Boost.Interval with these functions. It's on his TODO list allegedly ;-)
Would this help with the "CDF(x[ub]) - CDF(x[lb])"?
An interesting suggestion. Passing a single value to the function would give the CDF from -Infinity. Passing an interval would integrate over that interval. The problem is that, as I understand it, Boost.Interval objects represent Interval Arithmetic intervals -- i.e., computational error bounds around an unknown correct value. Using them to represent a more general range of reals violates their semantics. I would expect the result of passing an interval parameter to a CDF function to be an interval (easily implemented for CDF since its a non-decreasing function, but potentially trickier for the PDF) not a single value. Using a pair of T or something similar makes more sense, but it seems to me that the constuctor verbiage is a bit top heavy.
And/or allow one to produce "PDF((x[ub]+x[lb])/2)*(x[ub]-x[lb])" using the density/mass/distribution?
I would say using a range (but not an Interval) with the PDF does feel a bit cleaner than with the CDF. Then a single value would produce the PDF, a range from -Infinity would produce the same value as the CDF, a range to Infinity would produce the same value as the complementary CDF. Having to construct the range still would seem unnecessary cruft. Just allow either one argument or two argument forms (despite the "defaulted" parameter being the wrong one). I'd almost give up my objections to calling that function "distribution". Of course I would not suggest blindly using that little approximation I threw out. I just included it to make it clear that the value could be distinctly different from 0 even when computing the difference explicitly would lead to severe round-off problems. That formula can be seen as either a zero-order numerical integration or the first term of the differences in the differences of the Taylor series off the midpoint. Except for very small intervals you would want to add more terms either way. The Taylor series improves rapidly -- specifically quadratically (the next term is the second derivative of the PDF times the cube of the interval width divided by 24). You might run into some grey areas, though: regions where using the difference would produce unacceptable roundoff loss but the width is too large for effective use of small interval approximations. As I said, for the first release, I'd just implement it using the difference of the CDFs then worry about improving it later. Topher