Re: [boost] [math/staticstics/design] How best to namestatisticalfunctions?

12 Jul 2006

      At 05:11 AM 7/12/2006, you wrote:
...
T distribution(T x) const; // Probability Density Function or pdf or p
     T cumulative_probability(T x) const; // Cumulative Distribution
Function.  P
cumulative_probability is too long :-(
Do we REALLY need the cumulative here?
T probability(T x) const; // Cumulative Distribution Function or cdf or
P
Sorry, as attractive as it seems at first blush, I think just 
"probability" is a very poor choice.  A very common confusion in 
statistics is that people think of the value of the PDF as a 
probability -- even though it is not (hence the "D" for 
density).  Even sophisticated people slip into thinking of it that 
way (after all, it *does* represent the probability of an event for 
discrete distributions).  I think that people are much too likely to 
get confused and think that probability means the PDF.  Even without 
that confusion, there is a legitimate ambiguity for the term: Which 
probability?  Note for example that in traditional statistical 
hypothesis testing, the "p-value" (very roughly speaking, the 
probability of falsely rejecting the null hypothesis given the 
assumption that the null hypothesis is true) is the complementary CDF 
for a 1-tailed test and twice the complementary CDF for most 2-tailed tests.

I don't have as much objection to using "distribution" for the PDF, 
but the nit-picker in me is a bit uncomfortable with it.  A 
distribution is not a function, but to the extent that it can be 
identified with a particular function it's the CDF not the PDF (or 
the MGF -- the Moment Generating Function -- but lets not even go 
there).  This is because the CDF is always defined for a distribution 
and the PDF (technically defined as the derivative of the CDF) may 
not be.  Being slightly less pedantic, the *object* is the 
distribution, not the value of the function.  I realize this is all 
pretty fine distinctions, but I would be much more comfortable if the 
naming doesn't actively mislead about the technical fine points.
...
John Maddock has been muttering about using Boost.Interval with these
functions.
It's on his TODO list allegedly ;-)
Would this help with the "CDF(x[ub]) - CDF(x[lb])"?
An interesting suggestion.  Passing a single value to the function 
would give the CDF from -Infinity.  Passing an interval would 
integrate over that interval.  The problem is that, as I understand 
it, Boost.Interval objects represent Interval Arithmetic intervals -- 
i.e., computational error bounds around an unknown correct 
value.  Using them to represent a more general range of reals 
violates their semantics.  I would expect the result of passing an 
interval parameter to a CDF function to be an interval (easily 
implemented for CDF since its a non-decreasing function, but 
potentially trickier for the PDF) not a single value.  Using a pair 
of T or something similar makes more sense, but it seems to me that 
the constuctor verbiage is a bit top heavy.
...
And/or allow one to produce "PDF((x[ub]+x[lb])/2)*(x[ub]-x[lb])"
using the density/mass/distribution?
I would say using a range (but not an Interval) with the PDF does 
feel a bit cleaner than with the CDF.  Then a single value would 
produce the PDF, a range from -Infinity would produce the same value 
as the CDF, a range to Infinity would produce the same value as the 
complementary CDF. Having to construct the range still would seem 
unnecessary cruft.  Just allow either one argument or two argument 
forms (despite the "defaulted" parameter being the wrong one).  I'd 
almost give up my objections to calling that function "distribution".

Of course I would not suggest blindly using that little approximation 
I threw out.  I just included it to make it clear that the value 
could be distinctly different from 0 even when computing the 
difference explicitly would lead to severe round-off problems.

That formula can be seen as either a zero-order numerical integration 
or the first term of the differences in the differences of the Taylor 
series off the midpoint.  Except for very small intervals you would 
want to add more terms either way.  The Taylor series improves 
rapidly -- specifically quadratically (the next term is the second 
derivative of the PDF times the cube of the interval width divided by 24).

You might run into some grey areas, though: regions where using the 
difference would produce unacceptable roundoff loss but the width is 
too large for effective use of small interval approximations.

As I said, for the first release, I'd just implement it using the 
difference of the CDFs then worry about improving it later.

Topher

Re: [boost] [math/staticstics/design] How best to namestatisticalfunctions?

Topher Cooper