
At 09:01 AM 7/14/2006, John Maddock wrote:
So the quantile gives you number of successes expected at a given probablity, but for many scientists, they'll measure the number of successes and want to invert to get the probability of one success (parameter p).
Hopefully, I've actually got this right this time, I'm sure someone will jump in if not.... ?
Jumping in. That isn't a functional inversion at all. Given a particular set of observations presumed to be a sampling from an unknown member of a family of distributions one can define an estimator -- a computation on the observed values -- for the distribution parameter. Generally multiple estimators are available. We are interested in the difference between the estimator and the unknown "true" value. Through some indirect thinking we can treat the true value as a random variable (sort of -- statisticians will cringe here) and the difference becomes a random variable as well -- with its own distribution. Essentially the point estimator is mean or similar value for the distribution. Current practice prefers using a confidence interval rather than a point estimate. Here is a common (and commonly confused) example of multiple estimators. You have a sample of values and you want an estimator for the variance and you have no theoretical knowledge of the mean -- there are two common choices: S1 = sum((x[i] - mean(x))^2) / N and S2 = sum(x[i] - mean(x))^2) / (N-1) Which should you use? The distribution of the error in the first has a slightly smaller variance and so, in a sense, is a more accurate estimator. The usual advice though is to go with the second. The reason is that the first has a bias to it, leading to the possibility of accumulating large errors, while the second is unbiased. Doesn't make much difference for large samples, but you can choose whichever you want for small samples. Note: 1) Estimators can be for any population statistic, not just ones that happen to be used as parameters for the distribution family. 2) As I said, there can be more than one estimator for a given statistic. For example the sample median may be used as an estimator for the population mean when symmetry can be assumed since it is less sensitive to "ooutliers" than the sample mean. 3) Estimators are based on arbitrary computations on a sample of values which may not be directly related to a distribution parameter like the "hit count" is in your example. They are not, in general, a matter of plugging in a simple set of known scalar values. 4) You are also interested in auxiliary information for an estimator -- basically information about its error distribution about the true population statistic. For example, when you use the sample mean to estimate the distribution parameter mu (or equivalently, the population mean) of a presumed normal distribution you are interested in the "standard error" the estimated standard deviation of the estimator around the true mean. I don't think that this is really the kettle of worms you want to open up.
All of which means that in addition to a "generic" interface - however it turns out - we will still need distribution-specific ad-hock functions to invert for the parameterisation values, as well as the random variable.
Now there, I agree with you. Putting some commonly used computations in (e.g., standard error given sample size and sample standard deviation) would be nice. But don't kid yourself that you are going to build in all of, say, Regress into this library in any reasonable amount of time. Hit the high points and don't even try for completeness. Topher