Re: [boost] [math/staticstics/design] How best to namestatisticalfunctions?

11 Jul 2006

      At 11:02 AM 7/11/2006, Paul A Bristow wrote:
...
|  So let's use the Students T distribution as an example. The
|  Students T
|  distribution is a *family* of 1-dimensional distributions
|  that depend on a single parameter, called "degrees of freedom".
Does the word *family* implies integral degrees of freedom?
Numerically, and perhaps conceptually, it isn't - it's a continuous real.
So could one also regard it as a two parameter function f(t, v) ?
However I don't think this matters here.
No, a "family of distributions" does not imply that the parameters 
are integral.  What is frequently referred to as *the* normal 
distribution is also a family parameterized by the mean and standard 
deviation.  Transformation between members of the family is so easy 
that we generally transform everything into and from one member of 
the family the "standard normal" distribution.

Keep in mind that a distribution is not a function, although it is 
associated with several functions or function-like entities.

Standard usage is to consider the distributions in the family to be 
indexed by parameters and therefore the associated functions to be 
indexed, single parameter functions.  There isn't much difference 
mathematically, though, between p[mu, sigma](x) and p(mu, sigma, x) 
(even when the indexes *are* integral), and sometimes it is useful to 
reframe them in that way.  The point is, that is a reframing, and the 
standard (no, I am not imagining that it is standard) usage is to 
treat single-dimensional distributions as being single-dimensional.
...
| Given a value, say, D,
|  for the degrees of freedom, you get a density function p_D and
|  integrating it gives you the cumulative density function P_D.
What about the Qs? (complements)
|  As I mentioned before, these should be member functions,
|  which could be called "density" (also called 'mass')
| and "cumulative".
OHOH many books don't mention either of these words!
But I would be very, very surprised to find many serious statistics 
books written in English that don't.
...
The whole nomenclature seems a massive muddle,
with mathematicians, statistics, and users or all sorts using different
terms
and everyone thinks they are the 'Standard' :-(
Some variation exists due to the interdisciplinary origin and 
continued nature of the field, but most of the terminology is pretty 
standard with some enclaves of specialized usage.
...
And the highest priority in my book is the END USERS,
not the professionals.
Exactly -- the professionals are aware of the non-standard 
usage.  Lets give the end users a chance of being able to use what 
they learned in their high school stat class.
...
|  The cumulative density function is a strictly increasing
|  function and
|  therefore can be inverted. The inverse function could be called
|  "inverse_cumulative", which is a completely unambiguous name.
But excessively long :-(
|  I would say that these three member functions should be
|  common to all
|  implemented distributions. Other common member functions
|  might include
|  "mean", "variance", and possibly others.
Median, mode, variance, skewness, kurtosis are common given, for example:
http://en.wikipedia.org/wiki/Student%27s_t
Skewness and kurtosis are generally defined but rarely used for 
distributions.  Their computation on small or even moderate samples 
tends to be rather unstable, so comparison to the ideal distributions 
isn't terribly useful. I wouldn't bother with them.  Mode is not 
uniquely defined for many distributions, nor is it that commonly used 
(even if the references give a formula) in practice for unimodal 
distributions.  Except for some specialized uses, these are more 
useful for theory than for computation -- more algebraic than numerical.

There are a lot of other possible associated functions, such as 
general quantiles or various confidence intervals, but I don't think 
many of them have general enough use to bother with for all 
distributions.  People who need it could use the distribution as a 
template parameter.  The only exception I would suggest would be to 
include the convenience of the standard deviation as well as the 
variance.  One might stick in RNG here but that is redundant at this point.

As to naming of the probability functions:

My personal preference would be to use what is probably the most 
common abbreviations for the basic functions.  They are simple, 
compact and standard.  Maybe a little obscure for those who only took 
statistics in high school or some who only know cookbook statistics 
-- but that is what documentation is for.  The ignorant are after all 
ignorant whatever choice is made, but you can do something about it 
by using the standard terms:

dist.pdf(x) -- Probability Density Function, this is what looks like 
a "bell shaped curve" for a normal distribution, for example.  A.k.a. "p"
dist.cdf(x) -- Cumulative Distribution Function.  P
dist.ccdf(x) -- Complementary Cumulative Distribution Function; 
ccdf(x) = 1 - cdf(x)
dist.icdf(p) -- Inverse Cumulative Distribution Function: P'; 
icdf(cdf(x)) = x and vice versa
dist.iccdf(p) -- Inverse Complementary Cumulative Distribution 
Function; iccdf(p) = icdf(1-p); iccdf(ccdf(x)) = x

Topher

Re: [boost] [math/staticstics/design] How best to namestatisticalfunctions?

Topher Cooper