[math/staticstics/design] How best to name statistical functions?

newer
Re: [boost] Tuples and Bind problem

John Maddock

8 Jul 2006 8 Jul '06

4:37 p.m.

Paul Bristow has been toiling away producing some statistical functions on top of some of my Math special functions, and we've encountered a bit of a naming dilemma that I hope the ever resourceful Boosters can solve for us :-) For a given cumulative distribution function (I'm going to use the students-t function as an example below) there are two (or maybe three) variations: P: this is the regular cumulative distribution function, and is a rising function in it's argument (rises from 0 to 1). Q: this is 1-P and is also known as the complement of the cumulative distribution function. It falls from 1 to 0 over the range of it's argument. A: this is less well used and is P-Q or 1-2Q depending upon your point of view. Naming scheme 1: ~~~~~~~~~~~~~~~~ We have the reasonably obvious: students_t(df,x) : calculates P students_t_c(df,x) : calculates Q However that varies slightly from the existing practice of erf/erfc which if followed here would lead to: students_t(df,x) : calculates P students_tc(df,x) : calculates Q but the lack of the underscore doesn't look right to me. Naming Scheme 2: ~~~~~~~~~~~~~~~~ How about we call a spade a spade and use: students_t_P(df,x) : calculates P students_t_Q(df,x) : calculates Q Not pretty, but the P and Q notations are universally used in the literature, and of course we could handle the A case as well if that was felt to be needed. It doesn't follow normal Boost all_lower_case_names either, but since lower case "p" and "q" have slightly different meanings in the literature (they're for values of P and Q) I'm less keen on: students_t_p(df,x) : calculates P students_t_q(df,x) : calculates Q Wacky Scheme 3: ~~~~~~~~~~~~~~~ Both of the above suffer from a rather spectacular explosion of function prototypes once you include every variant for each distribution, an alternative using named parameters might be: P(dist=students_t, df=4, x=5.2); // P for 4 degrees freedom and x=5.2 Q(dist=students_t, df=5, x=20.0); // Q for 5 degrees freedom and x=20.0 But of course internally this would have to forward to something like (1) or (2) so it doesn't actually save you any implementation effort, just reduces the number of names. Inverses: ~~~~~~~~~ And if that's not enough, we also have inverses: * Calculate x given degrees of freedom and P. * Calculate x given degrees of freedom and Q. * Calculate degrees of freedom given x and P. * Calculate degrees of freedom given x and Q. At present we're looking at something like: students_t_inv(df,p); // Calculate x given degrees of freedom and P. But the other variants don't have obvious names under this scheme? So I'm hoping some Boosters can work their usual naming magic :-) Many thanks, John.

Show replies by date

Jeff Garland

8 Jul 8 Jul

4:48 p.m.

New subject: [math/staticstics/design] How best to name statistical functions?

John Maddock wrote:

...

Paul Bristow has been toiling away producing some statistical functions on top of some of my Math special functions, and we've encountered a bit of a naming dilemma that I hope the ever resourceful Boosters can solve for us :-)

Possibly better, save him from writing them, possibly? Has he looked at Eric Niebler's statistical accumulators? http://www.boost-consulting.com/vault/index.php?&direction=0&order=&directory=Math%20-%20Numerics Jeff

John Maddock

5:26 p.m.

Jeff Garland wrote:

...

John Maddock wrote:

...
Paul Bristow has been toiling away producing some statistical functions on top of some of my Math special functions, and we've encountered a bit of a naming dilemma that I hope the ever resourceful Boosters can solve for us :-)

Possibly better, save him from writing them, possibly? Has he looked at Eric Niebler's statistical accumulators?

http://www.boost-consulting.com/vault/index.php?&direction=0&order=&directory=Math%20-%20Numerics

Different functionality entirely, the two are completely complementary, and in fact we have been investigating using Eric's code in some of our examples. John.

Jeff Garland

6:13 p.m.

New subject: [math/staticstics/design] How best to name statistical functions?

John Maddock wrote:

...

Jeff Garland wrote:

...
Possibly better, save him from writing them, possibly? Has he looked at Eric Niebler's statistical accumulators?

http://www.boost-consulting.com/vault/index.php?&direction=0&order=&directory=Math%20-%20Numerics

Different functionality entirely, the two are completely complementary, and in fact we have been investigating using Eric's code in some of our examples.

Ok, I guess I'll have to read the original question now ;-) Jeff

Paul A Bristow

10 Jul 10 Jul

9:35 a.m.

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Jeff Garland | Sent: 08 July 2006 17:49 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How best to | name statistical functions? | | John Maddock wrote: | > Paul Bristow has been toiling away producing some | statistical functions on top of some of my Math special functions, and we've | encountered a bit of a naming dilemma that I hope the ever resourceful Boosters | can solve for us | > :-) | | Possibly better, save him from writing them, possibly? Has | he looked at Eric Niebler's statistical accumulators? Indeed - on my TODO list. Some further background, before you all leap in with your favourite names ;-) This is to support my proposal A Proposal to add Mathematical Functions for Statistics to the C++ Standard Library Document number: JTC 1/SC22/WG14/N1069, WG21/N1668 Date:11 Aug 2004 Recent WG21 paper http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2003.html includes this response to my proposal http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1668.pdf (To be reissued revised as N2048 but missed this mailing): " N1668 A Proposal to add Mathematical Functions for Statistics to the C++ Standard Library Date: 2004-08-11 Status: Open. Lillehammer [2005-04]: The main argument against this proposal is that a high-quality implementation would be extremely hard; this is about 150 functions, most of which have several parameters. Issue: are we willing to standardize something with the expectation that most implementations will be low quality? Are these functions ones where poor accuracy is acceptable? (If so, we could do this for float only, and drop the double and long double versions.) Mixed interest. No consensus for bringing this forward at this meeting. What might change people's mind: 1. Reasoning for why to include these functions and exclude others. 2. A smaller set of functions. 3. If this is intended to support an easy-to-use statistical package, then show the interface for that statistical package first. " But I think after John's stunning work on the incomplete beta & gamma, the guts of the functions that you all need to get information from your data using statistics, we are close to meeting the WG21 'requirements' to accept this proposal. His work in the sandbox is functionally complete. I am just doing some 'grunt' work on cosmetics and the wrappers to provide the statistics functions in a format that is best for the end users. Before you jump to judgement on this issue, I invite (beg!) you to consider the end users' needs. They are NOT mathematicians, they are probably NOT professional statisticians, but are ordinary physicist, chemists, surgeons, social 'scientists', bee keepers, farmers ... Bear in mind too that these groups all have different customary names/jargons for many of these functions. So IMO the names have to be helpful as possible TO THE USERS - clarity before curtness. There is also the complication that the distributions have, so-called by some, 'mass' values and 'cumulative' for others, and these two are confusing and confused, especially if they have the same name! Ideally we would have wrappers which provide BOTH of these variants. For each function there are variants - complements, and inverses (more than one inverse if more than one argument - something I have NOT tackled in the list before and I have only realised the need when doing the wrappers!) The inverse functions have been tackled by John mainly using root finding methods - the incomplete beta inverse is as usual MUCH more difficult and John has a state-of-the-art solution by Professor Temme. [Example, the 'forward' functions are useful tell you the probability of a hypothesis, the 'inverse' is useful to tell you what something would be needed to achieve a certain probability, for example a number of measurements or samples, OR the variance (or accuracy of measurement)]. To complicate things futher, here are also annoying C99 precedents in erf and erfc, which by Boost convention of using _ should be erf_c. These are some of the reasons why I came up with the list of names below. But as John has explained FOR ONE FUNCTION Student's t, it is not really enough. Your suggestions are most welcome. Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com Mathematical 'special' functions (only double versions are shown, overloads for float and long double will also be provided). double beta_distribution(double a, double b, float x)); // Beta distribution function. double beta_incomplete (double a, double b, double x); // Incomplete beta integral. double beta_incomplete_inv (double a, double b, double y); // Inverse of incomplete beta integral. double binomial (unsigned int k, unsigned int n, double p); // Binomial distribution function. double binomial_c (unsigned int k, unsigned int n, double p); // Binomial distribution function complemented. double binomial_distribution_inv(unsigned int k, unsigned int n, double y); // Binomial distribution function inverse. double binomial_neg_distribution (unsigned int k, unsigned int n, double p); // Negative binomial distribution . double binomial_neg_distribution_c (unsigned int k, unsigned int n, double p); // Negative binomial distribution complement. double binomial_neg_distribution_inv (unsigned int k, unsigned int n, double p); // Inverse of negative binomial distribution. double chi_sqr_distribution(double df, double x); // Chi-squared distribution function. double chi_sqr_distribution_c(double df, double x); // Chi-squared distribution function complemented. double chi_sqr_distribution_c_inv(double df, double p); // Inverse of Chi-squared distribution function complemented. double digamma(double x); // psi or digamma function. double fisher_distribution(unsigned int ia, unsigned int ib, double c); // Fisher F distribution. double fisher_distribution_c(unsigned int ia, unsigned int ib, double c); // Fisher F distribution complemented. double fisher_distribution_c_inv(double dfn, double dfd, double y); // Inverse of complemented Fisher F distribution. double gamma_distribution (double a, double b, double x); // Gamma probability distribution function. double gamma_distribution_c (double a, double b, double x); // Gamma probability distribution function complemented. double gamma_incomplete (double a, double x); // Incomplete gamma function. double gamma_incomplete_c (double a, double x); // Incomplete gamma function complemented. double gamma_incomplete_inv (double a, double y0); // Inverse of incomplete gamma integral. double gamma_incomplete_c_inv (double a, double y0); // Inverse of complemented incomplete gamma integral. double gamma (double x); // gamma function (or tgamma as in C99 math.h?) double lgamma (double x); // log gamma function name as C99. double normal_distribution (double a); // Normal distribution function. double normal_distribution_inv (double a); // Inverse of normal distribution function. double poisson_distribution (unsigned int k, double m); // Poisson distribution. double poisson_distribution_c(unsigned int k, double m); // Complemented Poisson distribution. double poisson_distribution_inv(unsigned int k, double y); // Inverse Poisson distribution. double students_t (double df, double t); // Student's t. double students_t_inv (double df, double p); // Inverse of Student's t. double students_t (unsigned int df, double t); // Student's t. double students_t_inv(unsigned int df, double p); // Inverse of Student's t. Distribution function probabilities and quantiles double normal_probability(double z); // Probability of quantile z. double normal_quantile(double p); // Quantile of probability p. double students_t_probability(double t, double df, double ncp);// Probability of quantile. double students_t_quantile(double p, double df, double ncp); // Quantile of probability p. double chi_sqr_probability(double x, double df, double ncp); // Probability of quantile. double chi_sqr_quantile(double p, double df, double ncp); // Quantile of probability p. double beta_probability(double x, double a, double b); // Probability of x, a, b. double beta_quantile(double p, double a, double b); // Quantile of double fisher_probability(double f, double dfn, double dfd, double ncp); // Probability of quantile. double fisher_quantile(double p, double dfn, double dfd, double ncp); // Quantile of probability p. double binomial_probability(double x, double n, double pr); // Probability of x. unsigned int binomial_first(double p, unsigned int n, double r); // 1st k for probability >= p double neg_binomial_probability(double x, double n, double pr); // Probability of quantile. double poisson_probability(double x, double lambda); // Probability of quantile. double poisson_quantile(double p, double lambda); // Quantile of probability p. double gamma_probability(double x, double shape, double scale); // Probability of x. double gamma_quantile(double p, double shape, double scale); // Quantile of probability p. double smirnov(int n, double p); // Exact Smirnov statistic. double smirnov_inv(int n, double x); // Exact Smirnov statistic. double kolmogorov ( double ); // Kolmogorov statistic. double kolmogorov_inv (double p); // Kolmogorov statistic inverse.

Kevin Lynch

9 Jul 9 Jul

10:48 a.m.

New subject: [math/staticstics/design] How best to name statistical functions?

John Maddock wrote:

...

Paul Bristow has been toiling away producing some statistical functions on top of some of my Math special functions, and we've encountered a bit of a naming dilemma that I hope the ever resourceful Boosters can solve for us :-)

<snip>

...

So I'm hoping some Boosters can work their usual naming magic :-)

Why not hide the functions behind a class interface? After all, the various functions are "properties" of the distributions. Hence: class students_t { students_t(double mu); double P(double x); double Q(double x); double invP(double p); (or perhaps inverseP or Pinv or something) ..... } class normal { normal(double mu, double sigma); double P(double x); double Q(double x); double invP(double x); ...... } This interface has a few major benefits over raw functions: 1) Since Paul is using your C++ special functions library in the implementation, there's no argument on the implementation side for C compatibility. Without C compatibility as a driving force, you don't need to stick with free functions and the corresponding combinatorial explosion of hard to remember names. 2) A class interface also lets you carry around data specific to the current "in use" distribution in one place, rather than needing to stuff it into every call (the mean in the case of Student's t, the mean and deviation for the Normal, etc). 3) This "normalizes" the interface for the calls to the distribution functions - every call for "P" has exactly one argument, and not two or three or four depending on the distribution in use. 4) The consistent interface is of course easier to document, teach and learn, and easier to use. Every well behaved (1D) distribution has to fit this interface, doesn't it? In other words, all well behaved (1D cumulative) distributions have to be single valued and invertible. You might also want to provide a function to obtain the non-cumulative distribution value (perhaps operator() or dist() or something). Of course, you would probably templatize and you might want to inherit from 1D or 2D abstract base classes if you plan to provide multidimensional distributions (or maybe not ...) and functions that operate on distributions. In any case, I look forward to the results.... -- ------------------------------------------------------------------------------- Kevin Lynch voice: (617) 353-6025 Physics Department Fax: (617) 353-9393 Boston University office: PRB-361 590 Commonwealth Ave. e-mail: krlynch@bu.edu Boston, MA 02215 USA http://budoe.bu.edu/~krlynch -------------------------------------------------------------------------------

Paul A Bristow

10 Jul 10 Jul

1:23 p.m.

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Kevin Lynch | Sent: 09 July 2006 11:49 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How best to | name statistical functions? | | John Maddock wrote: | > Paul Bristow has been toiling away producing some | statistical functions on | > top of some of my Math special functions, and we've | encountered a bit of a | > naming dilemma that I hope the ever resourceful Boosters | can solve for us :-) | Why not hide the functions behind a class interface? After all, the | various functions are "properties" of the distributions. Hence: | | class students_t { | students_t(double mu); | double P(double x); | double Q(double x); | double invP(double p); (or perhaps inverseP or Pinv or | something) | ..... | } | | class normal { | normal(double mu, double sigma); | double P(double x); | double Q(double x); | double invP(double x); | ...... | } Rather interesting idea. | | This interface has a few major benefits over raw functions: | | 1) Since Paul is using your C++ special functions library in the | implementation, there's no argument on the implementation side for C | compatibility. Without C compatibility as a driving force, you don't | need to stick with free functions and the corresponding combinatorial | explosion of hard to remember names. Agreed. | 2) A class interface also lets you carry around data specific to the | current "in use" distribution in one place, rather than | needing to stuff | it into every call (the mean in the case of Student's t, the mean and | deviation for the Normal, etc). | 3) This "normalizes" the interface for the calls to the distribution | functions - every call for "P" has exactly one argument, and | not two or three or four depending on the distribution in use. How would you envisage this working with Fisher, for example which has degrees of freedom 1 and 2, and a variance ratio. Is this a 1D or 2D or 3D? Its inversion will return df1 (given df2 and F and Probability) or df2 (given df1, F and Prob) or F (given Df1 and df2 and Prob) WOuld you like to flesh out how you suggest handling all these? | 4) The consistent interface is of course easier to document, | teach and learn, and easier to use. Yes, usability is a major requirement to allow all and sundry to USE this. | You might also want to provide a | function to obtain the non-cumulative distribution value (perhaps | operator() or dist() or something). Yes - most desriable - but this project is getting bigger, day by day ;-) (as an aside, John has devised a way to avoid bloat caused by the expectation that one can provide degrees of freedom as an integer OR a floating-point. Without his meta-magic, a serious downside of a fully templated version would be instantiation of many variants of functions). | Of course, you would probably templatize and you might want | to inherit | from 1D or 2D abstract base classes if you plan to provide | multidimensional distributions (or maybe not ...) and functions that | operate on distributions. | | In any case, I look forward to the results.... Watch this space... Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Kevin Lynch

8:04 p.m.

New subject: [math/staticstics/design] How best to name statistical functions?

Paul A Bristow wrote:

...

| 3) This "normalizes" the interface for the calls to the distribution | functions - every call for "P" has exactly one argument, and | not two or three or four depending on the distribution in use.

How would you envisage this working with Fisher, for example which has degrees of freedom 1 and 2, and a variance ratio.

Is this a 1D or 2D or 3D?

Its inversion will return df1 (given df2 and F and Probability) or df2 (given df1, F and Prob) or F (given Df1 and df2 and Prob)

WOuld you like to flesh out how you suggest handling all these?

I can't say that I would :-) This is going somewhat beyond my area of expertise ... it may well be the case that my model is just wrong, but happens to work within the subset of probability and statistics that I've had cause to use. I'd be perfectly willing to hear that criticism. Some questions from the clueless to help me think through this: Is it really the case that df1 in the Fisher distribution is any different than, say, the mean or deviation in the Normal? For example, I wouldn't talk about inverting the normal and extracting the mean. Or would I? That's not something I've ever thought about. In my work, I wouldn't try to extract the number of degrees of freedom for my Chi-squared distribution, but then again, I only do fits with same, in which case I always know the dof a priori. I guess I should really read through that Probability and Statistics text I bought but haven't gotten around to reading :-) -- ------------------------------------------------------------------------------- Kevin Lynch voice: (617) 353-6025 Physics Department Fax: (617) 353-9393 Boston University office: PRB-361 590 Commonwealth Ave. e-mail: krlynch@bu.edu Boston, MA 02215 USA http://budoe.bu.edu/~krlynch -------------------------------------------------------------------------------

Deane Yang

8:40 p.m.

Paul A Bristow wrote:

...

| -----Original Message----- | From: Kevin Lynch

...

| Why not hide the functions behind a class interface? After all, the | various functions are "properties" of the distributions. Hence: | | class students_t { | students_t(double mu); | double P(double x); | double Q(double x); | double invP(double p); (or perhaps inverseP or Pinv or | something) | ..... | } | | class normal { | normal(double mu, double sigma); | double P(double x); | double Q(double x); | double invP(double x); | ...... | }

Rather interesting idea.

I support Kevin's proposal rather strongly for exactly the reasons he states. But I'm not sure what P, Q, invP mean. I would prefer: double density(double x); double cumulative(double x); double inverse_cumulative(double y);

...

How would you envisage this working with Fisher, for example which has degrees of freedom 1 and 2, and a variance ratio.

Is this a 1D or 2D or 3D?

Its inversion will return df1 (given df2 and F and Probability) or df2 (given df1, F and Prob) or F (given Df1 and df2 and Prob)

WOuld you like to flesh out how you suggest handling all these?

Could you clarify your question? Isn't the F distribution still the probability distribution of a single real random variable? The cumulative and inverse cumulative density functions have a consistent mathematical meaning for any 1-dimensional probability distribution, do they not?

Paul A Bristow

11 Jul 11 Jul

8:38 a.m.

New subject: [math/staticstics/design] How best to name statisticalfunctions?

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Deane Yang | Sent: 10 July 2006 21:41 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How best to | name statisticalfunctions? | | Paul A Bristow wrote: | > | -----Original Message----- | > | From: Kevin Lynch | | > | Why not hide the functions behind a class interface? | After all, the | > | various functions are "properties" of the distributions. Hence: | > | | > | class students_t { | > | students_t(double mu); | > | double P(double x); | > | double Q(double x); | > | double invP(double p); (or perhaps inverseP or Pinv or | > | something) | > | ..... | > | } | > | | > | class normal { | > | normal(double mu, double sigma); | > | double P(double x); | > | double Q(double x); | > | double invP(double x); | > | ...... | > | } | > | > Rather interesting idea. | | I support Kevin's proposal rather strongly for exactly the | reasons he | states. But I'm not sure what P, Q, invP mean. I would prefer: | | double density(double x); | double cumulative(double x); | double inverse_cumulative(double y); | | > How would you envisage this working with Fisher, for | example which has | > degrees of freedom 1 and 2, and a variance ratio. | > | > Is this a 1D or 2D or 3D? | > | > Its inversion will return df1 (given df2 and F and Probability) | > or df2 (given df1, F and Prob) | > or F (given Df1 and df2 and Prob) | > | > WOuld you like to flesh out how you suggest handling all these? | > | | Could you clarify your question? Isn't the F distribution still the | probability distribution of a single real random variable? The | cumulative and inverse cumulative density functions have a | consistent mathematical meaning for any 1-dimensional probability | distribution, do they not? Well, if you regard the degrees of freedom as fixed, or the probability as fixed, often 95%, then yes, but, I would say that they are 2D (and others 3D) distributions. To keep it simpler, lets go back to the students t which I have implemented (actually templates but ignore that for now) as double students_t(double degrees_of_freedom, double t) t is roughly a measure of difference between two things (means for example) this returns the probability that the things are different. If degrees_of_freedom are small (you only measured 3 times, say), then t can be big, but it still doesn't mean much. But if you made a 100 measurements, it probably does. When you do the inverse, you may want to say, I want to be 95% confident, and I already have fixed the degrees_of_freedom, so what is the corresponding value for t. This is what the ubiquitous styudent's t tables do. On the other hand, sometimes you may decide you want 95% confidence, and you have already made some measurements of t, but you want to know how many (more probably) measurements (degrees_of_freedom) you would have to make to get this 95%. This is common problem - and often reveals in drug trials, for example, that there are not enough potential patients available to carry out a trial and achieve a 95% probability. If you accept this, then the problem is how to name the two, or three 'inverses' (and complements). students_t_inv_t and students_t_inv_df ??? Paul PS I also worry about the risk of code bloat. At present, I think that you don't pay for what you don't use. We certainly don't want all the possible functions discussed above instantiated, even for one floating-point type, if only one function is actually used. --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Deane Yang

2:11 p.m.

New subject: [math/staticstics/design] How best to name statisticalfunctions?

Paul A Bristow wrote:

...

Well, if you regard the degrees of freedom as fixed, or the probability as fixed, often 95%,

then yes,

but, I would say that they are 2D (and others 3D) distributions.

To keep it simpler, lets go back to the students t which I have implemented (actually templates but ignore that for now) as

double students_t(double degrees_of_freedom, double t)

t is roughly a measure of difference between two things (means for example)

this returns the probability that the things are different.

If degrees_of_freedom are small (you only measured 3 times, say),

then t can be big, but it still doesn't mean much.

But if you made a 100 measurements, it probably does.

When you do the inverse, you may want to say, I want to be 95% confident, and I already have fixed the degrees_of_freedom, so what is the corresponding value for t. This is what the ubiquitous styudent's t tables do.

On the other hand, sometimes you may decide you want 95% confidence, and you have already made some measurements of t, but you want to know how many (more probably) measurements (degrees_of_freedom) you would have to make to get this 95%.

This is common problem - and often reveals in drug trials, for example, that there are not enough potential patients available to carry out a trial and achieve a 95% probability.

If you accept this, then the problem is how to name the two, or three 'inverses' (and complements).

students_t_inv_t and students_t_inv_df ???

I think you're confusing *the* inverse cumulative distribution function with other possible inverse functions that can be defined for each specific distribution. This is why I really dislike a name like "students_t_inv_t", which tells me very little about what it is. So let's use the Students T distribution as an example. The Students T distribution is a *family* of 1-dimensional distributions that depend on a single parameter, called "degrees of freedom". Given a value, say, D, for the degrees of freedom, you get a density function p_D and integrating it gives you the cumulative density function P_D. As I mentioned before, these should be member functions, which could be called "density" and "cumulative". The cumulative density function is a strictly increasing function and therefore can be inverted. The inverse function could be called "inverse_cumulative", which is a completely unambiguous name. I would say that these three member functions should be common to all implemented distributions. Other common member functions might include "mean", "variance", and possibly others. Finally, you observe that it is often useful to specify the cumulative probability for a given value of the random variable and solve for the parameter (the "degrees of freedom" for a Students T distribution) that determines the distribution. Since each family of distributions depends on a different set of parameters (for example, normal distributions depend on two parameters, the mean and variance), the interface for this is trickier to define. I can think of two possibilities (I prefer the first): 1) Define ad hoc inverse functions for each specific distribution. So for the Students T distribution, you would define a member function of the form: double degrees_of_freedom(double cumulative_probability, double random_variable) const; 2) Always specify distribution parameters (other than the random variable itself) in the constructor using a tuple (a 1-tuple for the Students T and a 2-tuple for the normal). You could then define templated inverse functions: template <unsigned int index> double inverse(double cumulative probability, double random_variable) const; Each function would hold all other parameters fixed (as set by the constructor) and solve for the parameter specified by the index. (I don't like using tuples as an input type, because it means I always have to be very careful about the order of the parameters.) Deane

Paul A Bristow

3:02 p.m.

New subject: [math/staticstics/design] How best to namestatisticalfunctions?

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Deane Yang | Sent: 11 July 2006 15:11 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How best to | namestatisticalfunctions? | | So let's use the Students T distribution as an example. The | Students T | distribution is a *family* of 1-dimensional distributions | that depend on a single parameter, called "degrees of freedom". Does the word *family* implies integral degrees of freedom? Numerically, and perhaps conceptually, it isn't - it's a continuous real. So could one also regard it as a two parameter function f(t, v) ? However I don't think this matters here. | Given a value, say, D, | for the degrees of freedom, you get a density function p_D and | integrating it gives you the cumulative density function P_D. What about the Qs? (complements) | As I mentioned before, these should be member functions, | which could be called "density" (also called 'mass') | and "cumulative". OHOH many books don't mention either of these words! The whole nomenclature seems a massive muddle, with mathematicians, statistics, and users or all sorts using different terms and everyone thinks they are the 'Standard' :-( And the highest priority in my book is the END USERS, not the professionals. | The cumulative density function is a strictly increasing | function and | therefore can be inverted. The inverse function could be called | "inverse_cumulative", which is a completely unambiguous name. But excessively long :-( | I would say that these three member functions should be | common to all | implemented distributions. Other common member functions | might include | "mean", "variance", and possibly others. Median, mode, variance, skewness, kurtosis are common given, for example: http://en.wikipedia.org/wiki/Student%27s_t | Finally, you observe that it is often useful to specify the | cumulative | probability for a given value of the random variable and | solve for the | parameter (the "degrees of freedom" for a Students T | distribution) that | determines the distribution. Since each family of | distributions depends | on a different set of parameters (for example, normal distributions | depend on two parameters, the mean and variance), the | interface for this is trickier to define. | I can think of two possibilities (I prefer the first): | | 1) Define ad hoc inverse functions for each specific | distribution. So | for the Students T distribution, you would define a member | function of the form: | | double degrees_of_freedom(double cumulative_probability, double random_variable) const; I don't like 2 either, so I have snipped it ;-) This seems OK to me. I'd be grateful if you could sketch out how you see the whole Student's t class would look (just for double and omit the equations of course). (This will avoid any confusion about what we are talking about). However: But I still worried that the whole scheme will lead to much bigger code compared to a set of names of (template) functions (because code that isn't in fact used will be generated). Can anyone advise on this? It also would seem that the names will be much longer - perhaps overshadowing the gain in clarity? Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Deane Yang

4:15 p.m.

New subject: [math/staticstics/design] How best to namestatisticalfunctions?

Paul A Bristow wrote:

...

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Deane Yang | Sent: 11 July 2006 15:11 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How best to | namestatisticalfunctions? | | So let's use the Students T distribution as an example. The | Students T | distribution is a *family* of 1-dimensional distributions | that depend on a single parameter, called "degrees of freedom".

Does the word *family* implies integral degrees of freedom?

No. It's a continuous family of distributions, depending on 1 real parameter.

...

Numerically, and perhaps conceptually, it isn't - it's a continuous real. So could one also regard it as a two parameter function f(t, v) ?

Yes.

...

However I don't think this matters here.

No, it doesn't.

...

| Given a value, say, D, | for the degrees of freedom, you get a density function p_D and | integrating it gives you the cumulative density function P_D.

What about the Qs? (complements)

In other words, 1 - P. Right? One response is why do you need to define it, given how easy it is to get from the cumulative density function? If not, use a common name for it. Unfortunately, I don't have a good suggestion.

...

| As I mentioned before, these should be member functions, | which could be called "density" (also called 'mass')

| and "cumulative".

OHOH many books don't mention either of these words!

No? Surely they give the function P a name? I've always seen it referred to as the cumulative density function (CDF for short).

...

The whole nomenclature seems a massive muddle, with mathematicians, statistics, and users or all sorts using different terms and everyone thinks they are the 'Standard' :-(

And the highest priority in my book is the END USERS, not the professionals.

And that justifies using one-letter names? My highest priority is designing an interface that facilitates good programming practice by good programmers. I am in no way suggesting that the names I've proposed are "standard". The point is to use names that *are* widely used and do at least suggest the correct meaning of the functions. If you don't like my suggestions, please suggest others. But please don't use cryptic abbreviations like "P", "Q", and "inv_t"; they are no more standard than my suggestions, and they convey a lot less information.

...

| The cumulative density function is a strictly increasing | function and | therefore can be inverted. The inverse function could be called | "inverse_cumulative", which is a completely unambiguous name.

But excessively long :-(

Compared to what? I personally am very grateful that I no longer see short, cryptic function and variable names in code.

...

I'd be grateful if you could sketch out how you see the whole Student's t class would look (just for double and omit the equations of course). (This will avoid any confusion about what we are talking about).

Here's a first stab (I'm sure it can be improved): class StudentsT { public: explicit StudentsT(double degrees_of_freedom); double density(double x) const; double cumulative_probability(double x) const; double quantile(double probability) const; //Also known as inverse // cumulative double degrees_of_freedom(double quantile, double probability) const; //Functions below may return NaN, if undefined double mean() const; double variance() const; double skewness() const; double kurtosis() const; };

...

However:

But I still worried that the whole scheme will lead to much bigger code compared to a set of names of (template) functions (because code that isn't in fact used will be generated). Can anyone advise on this?

Why exactly do you worry about this? We're just repackaging the same set of functions. Also, note that by using classes, you can improve the computational speed, because the class can cache intermediate results common to the different functions, whereas separate functions need to recompute everything from scratch each time. To be honest, you sound like you're more comfortable programming in C than C++.

...

It also would seem that the names will be much longer - perhaps overshadowing the gain in clarity?

Definitely not for me. Deane

Paul A Bristow

4:33 p.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Deane Yang | Sent: 11 July 2006 17:15 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How best | tonamestatisticalfunctions? | | > What about the Qs? (complements) | | In other words, 1 - P. Right? One response is why do you | need to define | it, given how easy it is to get from the cumulative density function? | | If not, use a common name for it. Unfortunately, I don't have a good | suggestion. Perhaps not really needed? Is there an accuracy reason for both? | I've always seen it referred to as the cumulative density function (CDF for short). Agree common - but not all. | > But excessively long :-( | Compared to what? I personally am very grateful that I no longer see | short, cryptic function and variable names in code. OK, I'm persuaded, but what to others think? | Here's a first stab (I'm sure it can be improved): | | class StudentsT | { | public: | explicit StudentsT(double degrees_of_freedom); | | double density(double x) const; | double cumulative_probability(double x) const; | double quantile(double probability) const; | // Also known as inverse cumulative. | | double degrees_of_freedom(double quantile, double probability) const; | | // Functions below may return NaN, if undefined. | double mean() const; | double variance() const; | double skewness() const; | double kurtosis() const; | }; Thanks - I think I like it. Others views? Paul PS Never written a line of C in my life ;-) --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

John Maddock

4:50 p.m.

New subject: [math/staticstics/design] How besttonamestatisticalfunctions?

Paul A Bristow wrote:

...

...
-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Deane Yang Sent: 11 July 2006 17:15 To: boost@lists.boost.org Subject: Re: [boost] [math/staticstics/design] How best tonamestatisticalfunctions?

...
What about the Qs? (complements)

In other words, 1 - P. Right? One response is why do you need to define it, given how easy it is to get from the cumulative density function?

If not, use a common name for it. Unfortunately, I don't have a good suggestion.

Perhaps not really needed? Is there an accuracy reason for both?

It depends how accurate you want to be: calculating 1-P incurs cancellation error if P is very near 1, where as for most (all?) distributions we can calculate Q directly without the subraction from unity.

...

...
Here's a first stab (I'm sure it can be improved):

class StudentsT {

I think the "Boostified" name would be in all lower case: students_t or whatever. John.

John Maddock

4:26 p.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

Paul A Bristow wrote:

...

What about the Qs? (complements)

...
As I mentioned before, these should be member functions, which could be called "density" (also called 'mass')

Or distribution :-)

...

...
and "cumulative".

OHOH many books don't mention either of these words!

The whole nomenclature seems a massive muddle, with mathematicians, statistics, and users or all sorts using different terms and everyone thinks they are the 'Standard' :-(

And the highest priority in my book is the END USERS, not the professionals.

...
The cumulative density function is a strictly increasing function and therefore can be inverted. The inverse function could be called "inverse_cumulative", which is a completely unambiguous name.

But excessively long :-(

True, how about "persentile", or is that to ambiguous?

...

...
Finally, you observe that it is often useful to specify the cumulative probability for a given value of the random variable and solve for the parameter (the "degrees of freedom" for a Students T distribution) that determines the distribution. Since each family of distributions depends on a different set of parameters (for example, normal distributions depend on two parameters, the mean and variance), the interface for this is trickier to define.

...
I can think of two possibilities (I prefer the first):

1) Define ad hoc inverse functions for each specific distribution. So for the Students T distribution, you would define a member function of the form:

double degrees_of_freedom(double cumulative_probability, double random_variable) const;

I don't like 2 either, so I have snipped it ;-)

This seems OK to me.

That could be a static member function, since we're solving for the degrees of freedom parameter. It would also be more natural to me for the cumulative_probability parameter to come last in the list.

...

I'd be grateful if you could sketch out how you see the whole Student's t class would look (just for double and omit the equations of course). (This will avoid any confusion about what we are talking about).

However:

But I still worried that the whole scheme will lead to much bigger code compared to a set of names of (template) functions (because code that isn't in fact used will be generated). Can anyone advise on this?

For template classes member functions are only instantiated when used, so if you only use one member, then that's the only one instantiated. John.

Paul A Bristow

12 Jul 12 Jul

9:11 a.m.

New subject: [math/staticstics/design] How besttonamestatisticalfunctions?

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of John Maddock | Sent: 11 July 2006 17:26 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How | besttonamestatisticalfunctions? | | >> As I mentioned before, these should be member functions, | >> which could be called "density" (also called 'mass') | | Or distribution :-) This seems quite clear to me - both density and mass sound too physical to me, though they are in common use. What is important is that the documentation gives ALL the other possible names. | >> The inverse function could be called "inverse_cumulative" | > But excessively long :-( | True, how about "persentile", or is that to ambiguous? Percentile might be better - it is in the dictionary ;-)) But quantile is a more modern term and doesn't raise any questions about multiplying /dividing by/with 100, a source of unnecessary confusion - as we have found with Boost.Test. So I'm strongly in favour of quantile. But I also wonder if 'fraction' is a possible name? | >> 1) Define ad hoc inverse functions for each specific | >> distribution. So | >> for the Students T distribution, you would define a member | >> function of the form: | >> | >> double degrees_of_freedom(double cumulative_probability, double | >> random_variable) const; | | That could be a static member function, since we're solving | for the degrees of freedom parameter. OK | It would also be more natural to me for the | cumulative_probability parameter to come last in the list. Why? Quantile is also cumulative? | > But I still worried that the whole scheme will lead to much bigger | > code compared to a set of names of (template) functions | > (because code that isn't in fact used will be generated). | | For template classes member functions are only instantiated | when used, so if | you only use one member, then that's the only one instantiated. What that's what I thought - but I wanted expert reassurance before driving into a dead-end ;-) So my worry turns into a killer feature - keeping the cost of calling a single student's t down to reasonable levels is crucially important. Compared to linking to a "All_the_stats_functions_you_could_ever_want'.dll it should be easily 'affordable', as they say. Which also means that the cost of a Q or complement function is nothing unless you use it. (and you probably won't use the P version as well).

...

...
In other words, 1 - P. Right? One response is why do you need to define it, given how easy it is to get from the cumulative density function? Perhaps not really needed? Is there an accuracy reason for both?

| It depends how accurate you want to be: calculating 1-P incurs cancellation | error if P is very near 1, where as for most (all?) distributions we can | calculate Q directly without the subraction from unity. | I think the "Boostified" name would be in all lower case: students_t or whatever. Agree with this. Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Topher Cooper

1:11 p.m.

New subject: [math/staticstics/design] How besttonamestatisticalfunctions?

At 05:11 AM 7/12/2006, Paul A Bristow wrote:

...

...
...
In other words, 1 - P. Right? One response is why do you need to define it, given how easy it is to get from the cumulative density function? Perhaps not really needed? Is there an accuracy reason for both?

| It depends how accurate you want to be: calculating 1-P incurs cancellation | error if P is very near 1, where as for most (all?) distributions we can | calculate Q directly without the subraction from unity.

| I think the "Boostified" name would be in all lower case: students_t or whatever.

Agree with this.

This brings to mind another function that, though easily derived would be good to have to allow internal computations less subject to round-off error. This is a two parameter function that is the cumulative probability between a lower and an upper bound. Mathematically this can always be computed as "CDF(x[ub]) - CDF(x[lb])" (read the square brackets as mathematical subscript notation) but numerically with very small intervals, you can easily end up with 0 when you want something close to "PDF((x[ub]+x[lb])/2)*(x[ub]-x[lb])". You don't need to make any general guarantee about precision and so could do initial implementations as the difference of the cumulative functions, but then go back and do better for individual distributions. I don't know any standard term for this off the top of my head. I would suggest just using a two argument version of whatever is decided on for the cumulative distribution. So, using my suggested function name: standard_normal.cdf(-1.0, 1.0) would return the probability that a random variate with a normal distribution is within one standard deviation of the mean. The only problem I have with this is that if we look at the one parameter version as being the two parameter version with one parameter defaulted its the *first* parameter that is defaulted since: dist.cdf(x) = dsit.cdf(-INFINITY, x) That would suggest using the complementary cdf instead, but that seems a lot less natural. Topher

Paul A Bristow

4:08 p.m.

New subject: [math/staticstics/design] How besttonamestatisticalfunctions?

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Topher Cooper | Sent: 12 July 2006 14:11 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How | besttonamestatisticalfunctions? | This brings to mind another function that, though easily derived | would be good to have to allow internal computations less subject to | round-off error. This is a two parameter function that is the | cumulative probability between a lower and an upper | bound. Mathematically this can always be computed as "CDF(x[ub]) - | CDF(x[lb])" (read the square brackets as mathematical subscript | notation) but numerically with very small intervals, you can easily | end up with 0 when you want something close to | "PDF((x[ub]+x[lb])/2)*(x[ub]-x[lb])". An interesting suggestion. John Maddock has been muttering about using Boost.Interval with these functions. It's on his TODO list allegedly ;-) Would this help with the "CDF(x[ub]) - CDF(x[lb])"? And/or allow one to produce "PDF((x[ub]+x[lb])/2)*(x[ub]-x[lb])" using the density/mass/distribution? | You don't need to make any | general guarantee about precision and so could do initial | implementations as the difference of the cumulative functions, but | then go back and do better for individual distributions. | | I don't know any standard term for this off the top of my head. I | would suggest just using a two argument version of whatever is | decided on for the cumulative distribution. So, using my suggested | function name: | | standard_normal.cdf(-1.0, 1.0) | | would return the probability that a random variate with a normal | distribution is within one standard deviation of the mean. | | The only problem I have with this is that if we look at the one | parameter version as being the two parameter version with one | parameter defaulted its the *first* parameter that is | defaulted since: | | dist.cdf(x) = dist.cdf(-INFINITY, x) | | That would suggest using the complementary cdf instead, but that | seems a lot less natural. The language doesn't make this natural :-( Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

John Maddock

4:35 p.m.

New subject: [math/staticstics/design] Howbesttonamestatisticalfunctions?

Paul A Bristow wrote:

...

...
This brings to mind another function that, though easily derived would be good to have to allow internal computations less subject to round-off error. This is a two parameter function that is the cumulative probability between a lower and an upper bound. Mathematically this can always be computed as "CDF(x[ub]) - CDF(x[lb])" (read the square brackets as mathematical subscript notation) but numerically with very small intervals, you can easily end up with 0 when you want something close to "PDF((x[ub]+x[lb])/2)*(x[ub]-x[lb])".

An interesting suggestion.

John Maddock has been muttering about using Boost.Interval with these functions. It's on his TODO list allegedly ;-)

Yeh, but it's a long list ;-)

...

Would this help with the "CDF(x[ub]) - CDF(x[lb])"?

And/or allow one to produce "PDF((x[ub]+x[lb])/2)*(x[ub]-x[lb])" using the density/mass/distribution?

No it's entirely different functionality. Returning an interval guards against rounding error, or function sensitivity, leading you towards erroneous conclusions. Calculating a probablity over an interval is the same as integrating the distribution function from x to y rather than -INF to x. Doing it properly requires for example a four argument incomplete beta: ibeta(a, b, x, y); // incomplete beta integral from x to y. and a three argument incomplete gamma: gamma_Q(a, x, y); // incomplete gamma integral from x to y. However, I don't know how to implement those: I did have a very quick look into this when I did the incomplete gamma and didn't find any useful literature, so if anyone has any leads I'm all ears. My inclination is to leave stuff like this for version 2 (or 3!) though :-) John.

John Maddock

4:18 p.m.

New subject: [math/staticstics/design] Howbesttonamestatisticalfunctions?

Paul A Bristow wrote:

...

...
...
...
As I mentioned before, these should be member functions, which could be called "density" (also called 'mass')

Or distribution :-)

This seems quite clear to me - both density and mass sound too physical to me, though they are in common use.

What is important is that the documentation gives ALL the other possible names.

Yep, we'll need a glossary for sure.

...

...
...
...
The inverse function could be called "inverse_cumulative" But excessively long :-( True, how about "persentile", or is that to ambiguous?

Percentile might be better - it is in the dictionary ;-))

But quantile is a more modern term and doesn't raise any questions about multiplying /dividing by/with 100, a source of unnecessary confusion - as we have found with Boost.Test.

So I'm strongly in favour of quantile.

Agreed.

...

But I also wonder if 'fraction' is a possible name?

Oh god another one :-) No lets stick with quantile IMO, and add the other to the glossary.

...

...
It would also be more natural to me for the cumulative_probability parameter to come last in the list.

Why? Quantile is also cumulative?

Actually I'm not sure it matters after all which order they come in :-) I've just got used to your free functions, where either the form is: something_inv(random-or-shape-param, P-or-Q-param);

...

Which also means that the cost of a Q or complement function is nothing unless you use it. (and you probably won't use the P version as well).

Right and I think in most cases they're trivial to provide? If that turns out not to be the case drop 'em and see if anyone complains :-)

...

...
dist.pdf(x) -- Probability Density Function, this is what looks like a "bell shaped curve" for a normal distribution, for example. A.k.a. "p" dist.cdf(x) -- Cumulative Distribution Function. P dist.ccdf(x) -- Complementary Cumulative Distribution Function; ccdf(x) = 1 - cdf(x) dist.icdf(p) -- Inverse Cumulative Distribution Function: P'; icdf(cdf(x)) = x and vice versa dist.iccdf(p) -- Inverse Complementary Cumulative Distribution Function; iccdf(p) = icdf(1-p); iccdf(ccdf(x)) = x

My instinct is that these are too abbreviated, despite their logicalness.

Agreed.

...

But this is the key problem - being clear, not curt, and yet concise.

students_t.inverse_complement_cumulative_probability certains fails! ;-))

so we a getting to:

template <T> // T an integral or real or floating-point type.

T distribution(T x) const; // Probability Density Function or pdf or p T cumulative_probability(T x) const; // Cumulative Distribution Function. P

cumulative_probability is too long :-(

Do we REALLY need the cumulative here?

T probability(T x) const; // Cumulative Distribution Function or cdf or P

I like probability as a name.

...

T quantile(T probability) const; // Also known as Inverse cumulative Distribution Function

what do we call

T complementary_cumulative_probability(T x) const; // Complementary Cumulative Distribution Function. Q

??? :-((

How about complementary_probablity ? Or still too long? Or probability_c ?

...

and worse what about Inverse Complementary Cumulative Distribution

complementary_quantile??? :-((

quantile_c ?

...

and the ad hoc 'extra's

static T degrees_of_freedom(T quantile, T probability) const;

So I feel we haven't QUITE got there yet.

Closer though. John.

Topher Cooper

11 Jul 11 Jul

4:32 p.m.

New subject: [math/staticstics/design] How best to namestatisticalfunctions?

At 11:02 AM 7/11/2006, Paul A Bristow wrote:

...

| So let's use the Students T distribution as an example. The | Students T | distribution is a *family* of 1-dimensional distributions | that depend on a single parameter, called "degrees of freedom".

Does the word *family* implies integral degrees of freedom? Numerically, and perhaps conceptually, it isn't - it's a continuous real. So could one also regard it as a two parameter function f(t, v) ? However I don't think this matters here.

No, a "family of distributions" does not imply that the parameters are integral. What is frequently referred to as *the* normal distribution is also a family parameterized by the mean and standard deviation. Transformation between members of the family is so easy that we generally transform everything into and from one member of the family the "standard normal" distribution. Keep in mind that a distribution is not a function, although it is associated with several functions or function-like entities. Standard usage is to consider the distributions in the family to be indexed by parameters and therefore the associated functions to be indexed, single parameter functions. There isn't much difference mathematically, though, between p[mu, sigma](x) and p(mu, sigma, x) (even when the indexes *are* integral), and sometimes it is useful to reframe them in that way. The point is, that is a reframing, and the standard (no, I am not imagining that it is standard) usage is to treat single-dimensional distributions as being single-dimensional.

...

| Given a value, say, D, | for the degrees of freedom, you get a density function p_D and | integrating it gives you the cumulative density function P_D.

What about the Qs? (complements)

| As I mentioned before, these should be member functions, | which could be called "density" (also called 'mass')

| and "cumulative".

OHOH many books don't mention either of these words!

But I would be very, very surprised to find many serious statistics books written in English that don't.

...

The whole nomenclature seems a massive muddle, with mathematicians, statistics, and users or all sorts using different terms and everyone thinks they are the 'Standard' :-(

Some variation exists due to the interdisciplinary origin and continued nature of the field, but most of the terminology is pretty standard with some enclaves of specialized usage.

...

And the highest priority in my book is the END USERS, not the professionals.

Exactly -- the professionals are aware of the non-standard usage. Lets give the end users a chance of being able to use what they learned in their high school stat class.

...

| The cumulative density function is a strictly increasing | function and | therefore can be inverted. The inverse function could be called | "inverse_cumulative", which is a completely unambiguous name.

But excessively long :-(

| I would say that these three member functions should be | common to all | implemented distributions. Other common member functions | might include | "mean", "variance", and possibly others.

Median, mode, variance, skewness, kurtosis are common given, for example:

http://en.wikipedia.org/wiki/Student%27s_t

Skewness and kurtosis are generally defined but rarely used for distributions. Their computation on small or even moderate samples tends to be rather unstable, so comparison to the ideal distributions isn't terribly useful. I wouldn't bother with them. Mode is not uniquely defined for many distributions, nor is it that commonly used (even if the references give a formula) in practice for unimodal distributions. Except for some specialized uses, these are more useful for theory than for computation -- more algebraic than numerical. There are a lot of other possible associated functions, such as general quantiles or various confidence intervals, but I don't think many of them have general enough use to bother with for all distributions. People who need it could use the distribution as a template parameter. The only exception I would suggest would be to include the convenience of the standard deviation as well as the variance. One might stick in RNG here but that is redundant at this point. As to naming of the probability functions: My personal preference would be to use what is probably the most common abbreviations for the basic functions. They are simple, compact and standard. Maybe a little obscure for those who only took statistics in high school or some who only know cookbook statistics -- but that is what documentation is for. The ignorant are after all ignorant whatever choice is made, but you can do something about it by using the standard terms: dist.pdf(x) -- Probability Density Function, this is what looks like a "bell shaped curve" for a normal distribution, for example. A.k.a. "p" dist.cdf(x) -- Cumulative Distribution Function. P dist.ccdf(x) -- Complementary Cumulative Distribution Function; ccdf(x) = 1 - cdf(x) dist.icdf(p) -- Inverse Cumulative Distribution Function: P'; icdf(cdf(x)) = x and vice versa dist.iccdf(p) -- Inverse Complementary Cumulative Distribution Function; iccdf(p) = icdf(1-p); iccdf(ccdf(x)) = x Topher

Deane Yang

5:12 p.m.

New subject: [math/staticstics/design] How best to namestatisticalfunctions?

Topher Cooper wrote:

...

My personal preference would be to use what is probably the most common abbreviations for the basic functions. They are simple, compact and standard. Maybe a little obscure for those who only took statistics in high school or some who only know cookbook statistics -- but that is what documentation is for. The ignorant are after all ignorant whatever choice is made, but you can do something about it by using the standard terms:

dist.pdf(x) -- Probability Density Function, this is what looks like a "bell shaped curve" for a normal distribution, for example. A.k.a. "p" dist.cdf(x) -- Cumulative Distribution Function. P dist.ccdf(x) -- Complementary Cumulative Distribution Function; ccdf(x) = 1 - cdf(x) dist.icdf(p) -- Inverse Cumulative Distribution Function: P'; icdf(cdf(x)) = x and vice versa dist.iccdf(p) -- Inverse Complementary Cumulative Distribution Function; iccdf(p) = icdf(1-p); iccdf(ccdf(x)) = x

These would not be my first choice, but since they are relatively standard abbreviations and much shorter to type, I think they are reasonable suggestions.

Paul A Bristow

12 Jul 12 Jul

9:11 a.m.

New subject: [math/staticstics/design] How best to namestatisticalfunctions?

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Topher Cooper | Sent: 11 July 2006 17:32 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How best to | namestatisticalfunctions? | | At 11:02 AM 7/11/2006, Paul A Bristow wrote: | | | >| So let's use the Students T distribution as an example. The | >| Students T | >| distribution is a *family* of 1-dimensional distributions | >| that depend on a single parameter, called "degrees of freedom". | > | >Does the word *family* implies integral degrees of freedom? | | No, a "family of distributions" does not imply that the parameters | are integral. What is frequently referred to as *the* normal | distribution is also a family parameterized by the mean and standard | deviation. Transformation between members of the family is so easy | that we generally transform everything into and from one member of | the family the "standard normal" distribution. | | Keep in mind that a distribution is not a function, although it is | associated with several functions or function-like entities. | | Standard usage is to consider the distributions in the family to be | indexed by parameters and therefore the associated functions to be | indexed, single parameter functions. There isn't much difference | mathematically, though, between p[mu, sigma](x) and p(mu, sigma, x) | (even when the indexes *are* integral), and sometimes it is | useful to reframe them in that way. The point is, that is a | reframing, and the | standard (no, I am not imagining that it is standard) usage is to | treat single-dimensional distributions as being single-dimensional. Thanks, I think I understand better now. | >And the highest priority in my book is the END USERS, | >not the professionals. | | Exactly -- the professionals are aware of the non-standard | usage. Lets give the end users a chance of being able to use what | they learned in their high school stat class. My main objective :-)) | . Other common member functions might include | >| "mean", "variance", and possibly others. | > | >Median, mode, variance, skewness, kurtosis are common | given, for example: | > | >http://en.wikipedia.org/wiki/Student%27s_t | | Skewness and kurtosis are generally defined but rarely used for | distributions. Their computation on small or even moderate samples | tends to be rather unstable, so comparison to the ideal | distributions | isn't terribly useful. I wouldn't bother with them. Mode is not | uniquely defined for many distributions, nor is it that | commonly used | (even if the references give a formula) in practice for unimodal | distributions. Except for some specialized uses, these are more | useful for theory than for computation -- more algebraic | than numerical. | | There are a lot of other possible associated functions, such as | general quantiles or various confidence intervals, but I don't think | many of them have general enough use to bother with for all | distributions. People who need it could use the distribution as a | template parameter. The only exception I would suggest would be to | include the convenience of the standard deviation as well as the | variance. One might stick in RNG here but that is redundant | at this point. | As to naming of the probability functions: | | My personal preference would be to use what is probably the most | common abbreviations for the basic functions. They are simple, | compact and standard. Maybe a little obscure for those who | only took | statistics in high school or some who only know cookbook statistics | -- but that is what documentation is for. The ignorant are | after all | ignorant whatever choice is made, but you can do something about it | by using the standard terms: | | dist.pdf(x) -- Probability Density Function, this is what looks like | a "bell shaped curve" for a normal distribution, for | example. A.k.a. "p" | dist.cdf(x) -- Cumulative Distribution Function. P | dist.ccdf(x) -- Complementary Cumulative Distribution Function; | ccdf(x) = 1 - cdf(x) | dist.icdf(p) -- Inverse Cumulative Distribution Function: P'; | icdf(cdf(x)) = x and vice versa | dist.iccdf(p) -- Inverse Complementary Cumulative Distribution | Function; iccdf(p) = icdf(1-p); iccdf(ccdf(x)) = x My instinct is that these are too abbreviated, despite their logicalness. But this is the key problem - being clear, not curt, and yet concise. students_t.inverse_complement_cumulative_probability certains fails! ;-)) so we a getting to: template <T> // T an integral or real or floating-point type. T distribution(T x) const; // Probability Density Function or pdf or p T cumulative_probability(T x) const; // Cumulative Distribution Function. P cumulative_probability is too long :-( Do we REALLY need the cumulative here? T probability(T x) const; // Cumulative Distribution Function or cdf or P T quantile(T probability) const; // Also known as Inverse cumulative Distribution Function what do we call T complementary_cumulative_probability(T x) const; // Complementary Cumulative Distribution Function. Q ??? :-(( and worse what about Inverse Complementary Cumulative Distribution complementary_quantile??? :-(( and the ad hoc 'extra's static T degrees_of_freedom(T quantile, T probability) const; So I feel we haven't QUITE got there yet. But many thanks for your help so far. Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com PS Since everybody obviously knows far more about stats that I do, can you also suggest fully worked examples that can be used to demonstrate usage in a tutorial. I'm especailly keen to show how superior using this would be to the traditional tables and fixed 95% confidence limits.

Deane Yang

1:59 p.m.

New subject: [math/staticstics/design] How best to namestatisticalfunctions?

Paul A Bristow wrote:

...

cumulative_probability is too long :-(

Do we REALLY need the cumulative here?

T probability(T x) const; // Cumulative Distribution Function or cdf or P

T quantile(T probability) const; // Also known as Inverse cumulative Distribution Function

I like these suggestions. "quantile" is certainly a widely understood term. But isn't using "probability" for the cdf---which I like a lot!---straying a bit from standard terminology?

...

what do we call

T complementary_cumulative_probability(T x) const; // Complementary Cumulative Distribution Function. Q

and worse what about Inverse Complementary Cumulative Distribution

complementary_quantile??? :-((

I think here you may need to revert back to using those "cryptic" suffixes. Since the possibilities are pretty limited, they won't be quite so cryptic anymore. So maybe something like "probability_c" for complementary cdf?

...

and the ad hoc 'extra's

static T degrees_of_freedom(T quantile, T probability) const;

So I feel we haven't QUITE got there yet.

I agree. Nomenclature is so difficult.

Paul A Bristow

4:10 p.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

I think here you may need to revert back to using those "cryptic" | suffixes. Since the possibilities are pretty limited, they won't be | quite so cryptic anymore. So maybe something like | "probability_c" for complementary cdf? I've just lost some hair discovering a down-side of the neat _c - it is easy to forget/not notice it. Doh! So there is some benefit in a longer name. Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Topher Cooper

9:28 p.m.

New subject: [math/staticstics/design] How best to namestatisticalfunctions?

At 05:11 AM 7/12/2006, you wrote:

...

T distribution(T x) const; // Probability Density Function or pdf or p T cumulative_probability(T x) const; // Cumulative Distribution Function. P

cumulative_probability is too long :-(

Do we REALLY need the cumulative here?

T probability(T x) const; // Cumulative Distribution Function or cdf or P

Sorry, as attractive as it seems at first blush, I think just "probability" is a very poor choice. A very common confusion in statistics is that people think of the value of the PDF as a probability -- even though it is not (hence the "D" for density). Even sophisticated people slip into thinking of it that way (after all, it *does* represent the probability of an event for discrete distributions). I think that people are much too likely to get confused and think that probability means the PDF. Even without that confusion, there is a legitimate ambiguity for the term: Which probability? Note for example that in traditional statistical hypothesis testing, the "p-value" (very roughly speaking, the probability of falsely rejecting the null hypothesis given the assumption that the null hypothesis is true) is the complementary CDF for a 1-tailed test and twice the complementary CDF for most 2-tailed tests. I don't have as much objection to using "distribution" for the PDF, but the nit-picker in me is a bit uncomfortable with it. A distribution is not a function, but to the extent that it can be identified with a particular function it's the CDF not the PDF (or the MGF -- the Moment Generating Function -- but lets not even go there). This is because the CDF is always defined for a distribution and the PDF (technically defined as the derivative of the CDF) may not be. Being slightly less pedantic, the *object* is the distribution, not the value of the function. I realize this is all pretty fine distinctions, but I would be much more comfortable if the naming doesn't actively mislead about the technical fine points.

...

John Maddock has been muttering about using Boost.Interval with these functions. It's on his TODO list allegedly ;-)

Would this help with the "CDF(x[ub]) - CDF(x[lb])"?

An interesting suggestion. Passing a single value to the function would give the CDF from -Infinity. Passing an interval would integrate over that interval. The problem is that, as I understand it, Boost.Interval objects represent Interval Arithmetic intervals -- i.e., computational error bounds around an unknown correct value. Using them to represent a more general range of reals violates their semantics. I would expect the result of passing an interval parameter to a CDF function to be an interval (easily implemented for CDF since its a non-decreasing function, but potentially trickier for the PDF) not a single value. Using a pair of T or something similar makes more sense, but it seems to me that the constuctor verbiage is a bit top heavy.

...

And/or allow one to produce "PDF((x[ub]+x[lb])/2)*(x[ub]-x[lb])" using the density/mass/distribution?

I would say using a range (but not an Interval) with the PDF does feel a bit cleaner than with the CDF. Then a single value would produce the PDF, a range from -Infinity would produce the same value as the CDF, a range to Infinity would produce the same value as the complementary CDF. Having to construct the range still would seem unnecessary cruft. Just allow either one argument or two argument forms (despite the "defaulted" parameter being the wrong one). I'd almost give up my objections to calling that function "distribution". Of course I would not suggest blindly using that little approximation I threw out. I just included it to make it clear that the value could be distinctly different from 0 even when computing the difference explicitly would lead to severe round-off problems. That formula can be seen as either a zero-order numerical integration or the first term of the differences in the differences of the Taylor series off the midpoint. Except for very small intervals you would want to add more terms either way. The Taylor series improves rapidly -- specifically quadratically (the next term is the second derivative of the PDF times the cube of the interval width divided by 24). You might run into some grey areas, though: regions where using the difference would produce unacceptable roundoff loss but the width is too large for effective use of small interval approximations. As I said, for the first release, I'd just implement it using the difference of the CDFs then worry about improving it later. Topher

Deane Yang

10:13 p.m.

New subject: [math/staticstics/design] How best to namestatisticalfunctions?

Topher Cooper wrote:

...

At 05:11 AM 7/12/2006, you wrote:

...
T distribution(T x) const; // Probability Density Function or pdf or p T cumulative_probability(T x) const; // Cumulative Distribution Function. P

cumulative_probability is too long :-(

Do we REALLY need the cumulative here?

T probability(T x) const; // Cumulative Distribution Function or cdf or P

Sorry, as attractive as it seems at first blush, I think just "probability" is a very poor choice. ...

<explanation about why and discussion about using intervals snipped> I definitely do not want to use the same function name for both the density function and the cumulative probability. Your point about people confusing the meaning of the density function is on the mark, and I think using the same function name will only exacerbate the confusion. Do I would still vote for: double density(double x) const; (Despite the origin of the word "density" from physics, it is definitely used by mathematicians, statisticans, and engineers to mean exactly this. And I agree that the word "distribution" is not a synonym for "density".) On the other hand, I like the idea of using an interval type for the "probability" function and requiring an explicit interval constructor when calling the function, like student_t dist(2.0); double p = dist.probability(interval(-1.0, 2.0)); double q = dist.probability(interval(infinity, -1.0)); To me, syntax like this just makes it easier for me to understand what's going on. And I agree that we shouldn't just use the Boost Interval library. I think we should define an interval class specific to the statistics library, where the left endpoint is allowed to be -infinity and the right endpoint +infinity. Then we get a syntax that is easy to read and understand, and we don't need to come up with a good name for the cumulative or complementary cumulative probability functions.

Paul A Bristow

13 Jul 13 Jul

4:39 p.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Deane Yang | Sent: 12 July 2006 23:14 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How best | tonamestatisticalfunctions? | | Topher Cooper wrote: | > At 05:11 AM 7/12/2006, you wrote: | >> T distribution(T x) const; // Probability Density | Function or pdf or p | >> T cumulative_probability(T x) const; // Cumulative | Distribution | >> Function. P | >> | >> cumulative_probability is too long :-( | >> | >> Do we REALLY need the cumulative here? | >> | >> T probability(T x) const; // Cumulative Distribution | Function or cdf or | >> P | > | > Sorry, as attractive as it seems at first blush, I think just | > "probability" is a very poor choice. ... | | <explanation about why and discussion about using intervals snipped> | | I definitely do not want to use the same function name for both the | density function and the cumulative probability. Your point | about people | confusing the meaning of the density function is on the mark, and I | think using the same function name will only exacerbate the | confusion. | | Do I would still vote for: | | double density(double x) const; | | (Despite the origin of the word "density" from physics, it | is definitely | used by mathematicians, statisticans, and engineers to mean exactly | this. And I agree that the word "distribution" is not a synonym for | "density".) | | On the other hand, I like the idea of using an interval type for the | "probability" function and requiring an explicit interval | constructor | when calling the function, like | | student_t dist(2.0); | double p = dist.probability(interval(-1.0, 2.0)); | double q = dist.probability(interval(infinity, -1.0)); | | To me, syntax like this just makes it easier for me to | understand what's | going on. | | And I agree that we shouldn't just use the Boost Interval library. I | think we should define an interval class specific to the statistics | library, where the left endpoint is allowed to be -infinity and the | right endpoint +infinity. | | Then we get a syntax that is easy to read and understand, | and we don't | need to come up with a good name for the cumulative or complementary | cumulative probability functions. I've quickly knocked up a very rough sketch of how it might look like this (attached a zip of a .cpp run on MSVC 8.0) I'm sure you can suggest improvements to this. Seeing it used makes my still quite like a single function name 'probability' (with 1 parameter for pdf and two for cdf(s)) but I am willing to be out-voted. Neat but riskier. I also attached a response from Daniel Egloff making a similar, but more advanced proposal. (as John notes, the downside with a class is difficulty of extension). However, I am just about to go on holiday for two weeks, so I will leave you all to discuss further, and hope you've got everything sorted out and an example code written by the time I get back ;-)) Thanks Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Paul A Bristow

14 Jul 14 Jul

10:10 a.m.

New subject: [math/staticstics/design] How besttonamestatisticalfunctions?

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Paul A Bristow | Sent: 13 July 2006 17:40 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How | besttonamestatisticalfunctions? Sorry I forgot the attachment. Paul| | I've quickly knocked up a very rough sketch of how it might | look like this | (attached a zip of a .cpp run on MSVC 8.0) | | I'm sure you can suggest improvements to this. | | Seeing it used makes my still quite like a single function name | 'probability' (with 1 parameter for pdf and two for cdf(s)) | but I am willing | to be out-voted. Neat but riskier. | | I also attached a response from Daniel Egloff making a | similar, but more | advanced proposal. | | (as John notes, the downside with a class is difficulty of | extension). | | However, I am just about to go on holiday for two weeks, so | I will leave you | all to discuss further, and hope you've got everything | sorted out and an | example code written by the time I get back ;-)) | | Thanks | | Paul | | --- | Paul A Bristow | Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB | +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS | pbristow@hetp.u-net.com | | | | |

Paul A Bristow

9:05 a.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

THE inverse? Another quick question - I'm still in partial disambiguation mode. With the negative binomial distribution function (or are there more than one but one is THE Standard one?), which is **THE** inverse? the one that tells you the number of failures (MathCAD qnbinom & DCDFLIB) or the one that tells you the success probability? (Cephes, Wikipedia & DCDFLIB) John's response to this question was faintly blasphemous ;-) Same question with F and chisqr of course... Both/all of course are potentially useful :-) (and I feel all should be provided). Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

John Maddock

1:01 p.m.

New subject: [math/staticstics/design] How besttonamestatisticalfunctions?

Paul A Bristow wrote:

...

With the negative binomial distribution function (or are there more than one but one is THE Standard one?), which is **THE** inverse?

the one that tells you the number of failures (MathCAD qnbinom & DCDFLIB)

or the one that tells you the success probability? (Cephes, Wikipedia & DCDFLIB)

I was wrong about wikipedia: they agree with MathCAD and Mathmatica I think, I just read it wrong the first time :-(

...

John's response to this question was faintly blasphemous ;-)

:-)

...

Same question with F and chisqr of course...

Both/all of course are potentially useful :-)

(and I feel all should be provided).

If you look at Mathmatica's documentation here http://documents.wolfram.com/mathematica/Add-onsLinks/StandardPackages/Stati... and here http://documents.wolfram.com/mathematica/Add-onsLinks/StandardPackages/Stati... They're reasonably precise on which parameters are the "parameterisation" and which is the random variable. They use "quantile" to always invert the random variable. However, quite often that may not be the most useful one to invert IMO. For example in the binomial distribution we have: parameters: N: number of trials. p: probablity of success in one trial. Random Variable: n: number of successes. So the quantile gives you number of successes expected at a given probablity, but for many scientists, they'll measure the number of successes and want to invert to get the probability of one success (parameter p). Hopefully, I've actually got this right this time, I'm sure someone will jump in if not.... ? All of which means that in addition to a "generic" interface - however it turns out - we will still need distribution-specific ad-hock functions to invert for the parameterisation values, as well as the random variable. Also still in learning mode yours, John.

Topher Cooper

3:20 p.m.

New subject: [math/staticstics/design] How besttonamestatisticalfunctions?

At 09:01 AM 7/14/2006, John Maddock wrote:

...

So the quantile gives you number of successes expected at a given probablity, but for many scientists, they'll measure the number of successes and want to invert to get the probability of one success (parameter p).

Hopefully, I've actually got this right this time, I'm sure someone will jump in if not.... ?

Jumping in. That isn't a functional inversion at all. Given a particular set of observations presumed to be a sampling from an unknown member of a family of distributions one can define an estimator -- a computation on the observed values -- for the distribution parameter. Generally multiple estimators are available. We are interested in the difference between the estimator and the unknown "true" value. Through some indirect thinking we can treat the true value as a random variable (sort of -- statisticians will cringe here) and the difference becomes a random variable as well -- with its own distribution. Essentially the point estimator is mean or similar value for the distribution. Current practice prefers using a confidence interval rather than a point estimate. Here is a common (and commonly confused) example of multiple estimators. You have a sample of values and you want an estimator for the variance and you have no theoretical knowledge of the mean -- there are two common choices: S1 = sum((x[i] - mean(x))^2) / N and S2 = sum(x[i] - mean(x))^2) / (N-1) Which should you use? The distribution of the error in the first has a slightly smaller variance and so, in a sense, is a more accurate estimator. The usual advice though is to go with the second. The reason is that the first has a bias to it, leading to the possibility of accumulating large errors, while the second is unbiased. Doesn't make much difference for large samples, but you can choose whichever you want for small samples. Note: 1) Estimators can be for any population statistic, not just ones that happen to be used as parameters for the distribution family. 2) As I said, there can be more than one estimator for a given statistic. For example the sample median may be used as an estimator for the population mean when symmetry can be assumed since it is less sensitive to "ooutliers" than the sample mean. 3) Estimators are based on arbitrary computations on a sample of values which may not be directly related to a distribution parameter like the "hit count" is in your example. They are not, in general, a matter of plugging in a simple set of known scalar values. 4) You are also interested in auxiliary information for an estimator -- basically information about its error distribution about the true population statistic. For example, when you use the sample mean to estimate the distribution parameter mu (or equivalently, the population mean) of a presumed normal distribution you are interested in the "standard error" the estimated standard deviation of the estimator around the true mean. I don't think that this is really the kettle of worms you want to open up.

...

All of which means that in addition to a "generic" interface - however it turns out - we will still need distribution-specific ad-hock functions to invert for the parameterisation values, as well as the random variable.

Now there, I agree with you. Putting some commonly used computations in (e.g., standard error given sample size and sample standard deviation) would be nice. But don't kid yourself that you are going to build in all of, say, Regress into this library in any reasonable amount of time. Hit the high points and don't even try for completeness. Topher

Topher Cooper

1:45 p.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

I'm not sure what you are quoting with your first line, but, of course, there isn't a single inverse for any distribution. One task in statistics is hypothesis testing. In traditional statistics to do this you require the inverse cumulative distribution. Its well and consistently defined for every single dimensional distribution. It is also used in setting confidence limits and many other purposes. A numerical statistical package that doesn't include it is worthless. We also use the inverse cumulative in setting confidence bounds. Roughly speaking -- given that the underlying process is controlled by the following distribution, what x[lb] and x[ub] can I be 95% certain that any specific single sample will lie between? (A "single sample" actually could be a particular, single statistic for a set of samples, you just need the right distribution). The inverse of the complementary CDF is also useful but trivially derived from the other. Nice to have both, but not strictly necessary. *Some* distributions, such as the negative binomial distribution (note, NOT a function) and the binomial distribution are discrete distribution. It then is also meaningful to define an inverse for the PDF, especially a "fuzzy" inverse (how long a string of failures has a probability of 0.05 of occurring). Of course, in that case the inverse is not generally a function. Once in a great while, this might be useful. We can also take the distributional parameters and stop treating them like indexes for the family of CDFs and treat them like function arguments. We can then speak meaningfully about inverses for each of them. So, given the CDF for the normal distribution we have, lets say (this is math not any proposal for C++ naming): CDFz[mu, sigma](x) -> P becomes CDFz(x, mu, sigma) -> P The "standard" inverse CDF is then CDF'z(p, mu, sigma) -> x And one of the others is: CDF'z(x, mu, p) -> sigma I.e., given that I know a sample was generated from the normal distribution with mean mu and that the probability that the sample was greater than a particular precise value, x, is a particular precise probability, p, then what is the standard deviation, sigma, for that distribution? This is an important question algebraically. It allows us to derive distributions for parameter estimation that we can then use the inverse cumulative distribution function to give us confidence bounds for parameters. For example, given a particular sample drawn from say, a chi-square distribution, what is the distribution of possible values for the number of degrees of freedom? There may be situations where a particular distribution applies where a numerical inversion around a parameter is called for, but I can't think of any. Can you give me a reasonable scenario where these inverses around the parameters would be widely used? Lets have a use-case. I certainly think that after the common structure of the distribution classes have been put in place it is reasonable to ask what additional, distribution specific, methods should be added. If you want to put every formula in the handbooks in, go ahead -- little of it will ever be used in practice, but it will be there if some unanticipated need comes up and the user will be able to avoid the bother of looking up the formula themself. Some kind of naming convention for some of this distribution specific stuff seems reasonable. Having read accessors for each distribution parameter seems like a good idea, for example ("(x - aNormDist.mu)/aNormDist.sigma" where, in this case aNormDist.mu = aNormDist.mean and aNormDist.sigma = aNormDist.standardDeviation). Topher At 05:05 AM 7/14/2006, you wrote:

...

THE inverse?

Another quick question - I'm still in partial disambiguation mode.

With the negative binomial distribution function (or are there more than one but one is THE Standard one?), which is **THE** inverse?

the one that tells you the number of failures (MathCAD qnbinom & DCDFLIB)

or the one that tells you the success probability? (Cephes, Wikipedia & DCDFLIB)

John's response to this question was faintly blasphemous ;-)

Same question with F and chisqr of course...

Both/all of course are potentially useful :-)

(and I feel all should be provided).

Paul

--- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Paul A Bristow

2:45 p.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

Thanks for this further explanation, which has crossed by my and John Maddock's postings. | -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Topher Cooper | Sent: 14 July 2006 14:46 | To: boost@lists.boost.org | Subject: Re: [boost] [math/staticstics/design] How best | tonamestatisticalfunctions? | | I'm not sure what you are quoting with your first line, but, of | course, there isn't a single inverse for any distribution. | So, given the CDF for the normal distribution we have, lets | say (this is math not any proposal for C++ naming): | | CDFz[mu, sigma](x) -> P | | becomes | | CDFz(x, mu, sigma) -> P | | The "standard" inverse CDF is then | | CDF'z(p, mu, sigma) -> x So how to we find out what is considered "standard" - ask you? consult Mathemetica's documentation?textbooks..? Is there agreement on standard? I suspect so, but If this is to be part of C++ Standard, there needs to be a clear statisticans standard. | And one of the others is: | | CDF'z(x, mu, p) -> sigma What John called 'ad hoc'? | I.e., given that I know a sample was generated from the normal | distribution with mean mu and that the probability that the sample | was greater than a particular precise value, x, is a particular | precise probability, p, then what is the standard deviation, sigma, | for that distribution? | | This is an important question algebraically. It allows us to derive | distributions for parameter estimation that we can then use the | inverse cumulative distribution function to give us confidence bounds | for parameters. For example, given a particular sample drawn from | say, a chi-square distribution, what is the distribution of possible | values for the number of degrees of freedom? | | There may be situations where a particular distribution | applies where | a numerical inversion around a parameter is called for, but I can't | think of any. Can you give me a reasonable scenario where these | inverses around the parameters would be widely used? Lets | have a use-case. Well, unless I still don't understand, John produced one? And I've mentioned the 'how many degrees of freedom would be needed for chosen probability' example? Knowing whether more measurements (and/or more precise measurements) are needed is a very common need (not easily met at present, as far as I can see). Or are you talking about something different? Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Deane Yang

3:13 p.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

Paul A Bristow wrote:

...

| | CDFz[mu, sigma](x) -> P | | becomes | | CDFz(x, mu, sigma) -> P | | The "standard" inverse CDF is then | | CDF'z(p, mu, sigma) -> x

So how to we find out what is considered "standard" - ask you? consult Mathemetica's documentation?textbooks..? Is there agreement on standard? I suspect so, but

The standard inverse is just the quantile function, period. In other words, if CDF[parameters..., x] = P, then inverseCDF[parameters..., P] = x. All other possible inverse functions (where you are solving for one of the parameters that specify the distribution) are ad hoc.

...

If this is to be part of C++ Standard, there needs to be a clear statisticans standard.

The above is, as far as I know, the standard definition of the inverseCDF or quantile function.

...

| And one of the others is: | | CDF'z(x, mu, p) -> sigma

What John called 'ad hoc'?

Yes. If you are specifying *both* the quantile level *and* the probability, and solving for some other parameter that specifies the distribution, then you are in the realm of ad hoc inverse functions, because different families of distributions (normal versus students t versus exponential....) have different parameterizations.

...

Topher Cooper

3:45 p.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

At 10:45 AM 7/14/2006, Paul Bristow wrote:

...

So how to we find out what is considered "standard" - ask you? consult Mathemetica's documentation?textbooks..? Is there agreement on standard? I suspect so, but

You'll find a lot of variation in how the distribution parameters are expressed for some distributions but all single-dimensional distribution families are pretty unambiguous on this point. There are some number of parameters that indexes a specific distribution from a family of distributions. Random variables are associated with that distribution. There is a quantity, "x" representing possible values for such a random variable. The integral of the PDF of x (or sum for a discrete variate/distribution) from -infinity to t is the CDF for that distribution at t. It is the probability that a random variabe will have a value less than or equal to t. The inverse CDF sometimes called the "quantile" in statistical packages (a usage taken from statistics in the social sciences) is the functional inverse of the CDF function. It's value for a particular "p" is the value for t with a probability p that a random variable will be less than it. I don't think you'll find any real disagreement in any source about this. I've finally figured out that you guys are not really talking about functional inverses at all. You're saying "inverse" when you mean a parameter estimator. As I posted a little while ago, that's a much more elaborate issue than you think it is. Topher

John Maddock

4:22 p.m.

New subject: [math/staticstics/design] How best tonamestatisticalfunctions?

Topher Cooper wrote:

...

I've finally figured out that you guys are not really talking about functional inverses at all. You're saying "inverse" when you mean a parameter estimator. As I posted a little while ago, that's a much more elaborate issue than you think it is.

Correct: the initial confusion was over which were the parameters, and which the random variable. As Paul will testify, when he initially asked about this I didn't take enough care in answering, which managed to tie us both in knots :-( John.

6911

Age (days ago)

6917

Last active (days ago)

List overview

Download

37 comments

6 participants

participants (6)

Deane Yang
Jeff Garland
John Maddock
Kevin Lynch
Paul A Bristow
Topher Cooper