Re: [boost] [math/staticstics/design] How best to namestatisticalfunctions?

12 Jul 2006

      |  -----Original Message-----
|  From: boost-bounces@lists.boost.org 
|  [mailto:boost-bounces@lists.boost.org] On Behalf Of Topher Cooper
|  Sent: 11 July 2006 17:32
|  To: boost@lists.boost.org
|  Subject: Re: [boost] [math/staticstics/design] How best to 
|  namestatisticalfunctions?
|  
|  At 11:02 AM 7/11/2006, Paul A Bristow wrote:
|  
|  
|  >|  So let's use the Students T distribution as an example. The
|  >|  Students T
|  >|  distribution is a *family* of 1-dimensional distributions
|  >|  that depend on a single parameter, called "degrees of freedom".
|  >
|  >Does the word *family* implies integral degrees of freedom?
|  
|  No, a "family of distributions" does not imply that the parameters 
|  are integral.  What is frequently referred to as *the* normal 
|  distribution is also a family parameterized by the mean and standard 
|  deviation.  Transformation between members of the family is so easy 
|  that we generally transform everything into and from one member of 
|  the family the "standard normal" distribution.
|  
|  Keep in mind that a distribution is not a function, although it is 
|  associated with several functions or function-like entities.
|  
|  Standard usage is to consider the distributions in the family to be 
|  indexed by parameters and therefore the associated functions to be 
|  indexed, single parameter functions.  There isn't much difference 
|  mathematically, though, between p[mu, sigma](x) and p(mu, sigma, x) 
|  (even when the indexes *are* integral), and sometimes it is 
|  useful to reframe them in that way.  The point is, that is a 
|  reframing, and the 
|  standard (no, I am not imagining that it is standard) usage is to 
|  treat single-dimensional distributions as being single-dimensional.

Thanks, I think I understand better now.

|  >And the highest priority in my book is the END USERS,
|  >not the professionals.
|  
|  Exactly -- the professionals are aware of the non-standard 
|  usage.  Lets give the end users a chance of being able to use what 
|  they learned in their high school stat class.

My main objective :-))

|  . Other common member functions might include
|  >|  "mean", "variance", and possibly others.
|  >
|  >Median, mode, variance, skewness, kurtosis are common 
|  given, for example:
|  >
|  >http://en.wikipedia.org/wiki/Student%27s_t
|  
|  Skewness and kurtosis are generally defined but rarely used for 
|  distributions.  Their computation on small or even moderate samples 
|  tends to be rather unstable, so comparison to the ideal 
|  distributions 
|  isn't terribly useful. I wouldn't bother with them.  Mode is not 
|  uniquely defined for many distributions, nor is it that 
|  commonly used 
|  (even if the references give a formula) in practice for unimodal 
|  distributions.  Except for some specialized uses, these are more 
|  useful for theory than for computation -- more algebraic 
|  than numerical.
|  
|  There are a lot of other possible associated functions, such as 
|  general quantiles or various confidence intervals, but I don't think 
|  many of them have general enough use to bother with for all 
|  distributions.  People who need it could use the distribution as a 
|  template parameter.  The only exception I would suggest would be to 
|  include the convenience of the standard deviation as well as the 
|  variance.  One might stick in RNG here but that is redundant 
|  at this point.

|  As to naming of the probability functions:
|  
|  My personal preference would be to use what is probably the most 
|  common abbreviations for the basic functions.  They are simple, 
|  compact and standard.  Maybe a little obscure for those who 
|  only took 
|  statistics in high school or some who only know cookbook statistics 
|  -- but that is what documentation is for.  The ignorant are 
|  after all 
|  ignorant whatever choice is made, but you can do something about it 
|  by using the standard terms:
|  
|  dist.pdf(x) -- Probability Density Function, this is what looks like 
|  a "bell shaped curve" for a normal distribution, for 
|  example.  A.k.a. "p"
|  dist.cdf(x) -- Cumulative Distribution Function.  P
|  dist.ccdf(x) -- Complementary Cumulative Distribution Function; 
|  ccdf(x) = 1 - cdf(x)
|  dist.icdf(p) -- Inverse Cumulative Distribution Function: P'; 
|  icdf(cdf(x)) = x and vice versa
|  dist.iccdf(p) -- Inverse Complementary Cumulative Distribution 
|  Function; iccdf(p) = icdf(1-p); iccdf(ccdf(x)) = x

My instinct is that these are too abbreviated, despite their logicalness.

But this is the key problem - being clear, not curt, and yet concise.

students_t.inverse_complement_cumulative_probability certains fails! ;-))

so we a getting to:

template <T> // T an integral or real or floating-point type.

     T distribution(T x) const; // Probability Density Function or pdf or p
     T cumulative_probability(T x) const; // Cumulative Distribution
Function.  P

cumulative_probability is too long :-(

Do we REALLY need the cumulative here?  

     T probability(T x) const; // Cumulative Distribution Function or cdf or
P

     T quantile(T probability) const; // Also known as Inverse cumulative
Distribution Function

what do we call

     T complementary_cumulative_probability(T x) const; // Complementary
Cumulative Distribution Function.  Q

??? :-((

and worse what about Inverse Complementary Cumulative Distribution

complementary_quantile??? :-((

and the ad hoc 'extra's

     static T degrees_of_freedom(T quantile, T probability) const;

So I feel we haven't QUITE got there yet.

But many thanks for your help so far.

Paul

---
Paul A Bristow
Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB
+44 1539561830 & SMS, Mobile +44 7714 330204 & SMS
pbristow@hetp.u-net.com

PS Since everybody obviously knows far more about stats that I do, can you
also suggest fully worked examples that can be used to demonstrate usage in
a tutorial.  I'm especailly keen to show how superior using this would be to
the traditional tables and fixed 95% confidence limits.