Re: [boost] [math/staticstics/design] How besttonamestatisticalfunctions?

14 Jul 2006

      At 09:01 AM 7/14/2006, John Maddock  wrote:
...
So the quantile gives you number of successes expected at a given
probablity, but for many scientists, they'll measure the number of successes
and want to invert to get the probability of one success (parameter p).
Hopefully, I've actually got this right this time, I'm sure someone will
jump in if not.... ?
Jumping in.

That isn't a functional inversion at all.  Given a particular set of 
observations presumed to be a sampling from an unknown member of a 
family of distributions one can define an estimator -- a computation 
on the observed values -- for the distribution parameter.  Generally 
multiple estimators are available.  We are interested in the 
difference between the estimator and the unknown "true" 
value.  Through some indirect thinking we can treat the true value as 
a random variable (sort of -- statisticians will cringe here) and the 
difference becomes a random variable as well -- with its own 
distribution.  Essentially the point estimator is mean or similar 
value for the distribution.  Current practice prefers using a 
confidence interval rather than a point estimate.

Here is a common (and commonly confused) example of multiple 
estimators.  You have a sample of values and you want an estimator 
for the variance and you have no theoretical knowledge of the mean -- 
there are two common choices:

     S1 = sum((x[i] - mean(x))^2) / N

and

     S2 = sum(x[i] - mean(x))^2) / (N-1)

Which should you use?  The distribution of the error in the first has 
a slightly smaller variance and so, in a sense, is a more accurate 
estimator.  The usual advice though is to go with the second.  The 
reason is that the first has a bias to it, leading to the possibility 
of accumulating large errors, while the second is unbiased.  Doesn't 
make much difference for large samples, but you can choose whichever 
you want for small samples.

Note:

      1) Estimators can be for any population statistic, not just 
ones that happen to be used as parameters for the distribution family.

      2) As I said, there can be more than one estimator for a given 
statistic.  For example the sample median may be used as an estimator 
for the population mean when symmetry can be assumed since it is less 
sensitive to "ooutliers" than the sample mean.

      3) Estimators are based on arbitrary computations on a sample 
of values which may not be directly related to a distribution 
parameter like the "hit count" is in your example.  They are not, in 
general, a matter of plugging in a simple set of known scalar values.

      4) You are also interested in auxiliary information for an 
estimator -- basically information about its error distribution about 
the true population statistic.  For example, when you use the sample 
mean to estimate the distribution parameter mu (or equivalently, the 
population mean) of a presumed normal distribution you are interested 
in the "standard error" the estimated standard deviation of the 
estimator around the true mean.

I don't think that this is really the kettle of worms you want to open up.
...
All of which means that in addition to a "generic" interface - however it
turns out - we will still need distribution-specific ad-hock functions to
invert for the parameterisation values, as well as the random variable.
Now there, I agree with you.  Putting some commonly used computations 
in (e.g., standard error given sample size and sample standard 
deviation) would be nice.  But don't kid yourself that you are going 
to build in all of, say, Regress into this library in any reasonable 
amount of time.  Hit the high points and don't even try for completeness.

Topher

Re: [boost] [math/staticstics/design] How besttonamestatisticalfunctions?

Topher Cooper