[Math Statistical Distributions] Hypergeometric distribution

Hi John, Do you have any plans to add support for the hypergeometric distribution? That distribution is important for the Fisher exact test for independence in a 2x2 table. --Johan

Johan Råde wrote:
Hi John,
Do you have any plans to add support for the hypergeometric distribution? That distribution is important for the Fisher exact test for independence in a 2x2 table.
Yes, in the sense that it's pretty much at the top of the TODO list: http://svn.boost.org/svn/boost/trunk/libs/math/doc/sf_and_dist/html/math_too... I haven't looked into the implementation yet though, or found any test values to sanity check against. If you're in need of it, fancy writing it ? :-> John.

John Maddock wrote:
Johan Råde wrote:
Hi John,
Do you have any plans to add support for the hypergeometric distribution? That distribution is important for the Fisher exact test for independence in a 2x2 table.
Yes, in the sense that it's pretty much at the top of the TODO list: http://svn.boost.org/svn/boost/trunk/libs/math/doc/sf_and_dist/html/math_too...
I haven't looked into the implementation yet though, or found any test values to sanity check against. If you're in need of it, fancy writing it ? :->
First I need to get next release of our product out of the door, and then finish up the floating point classification and facets stuff. We'll see. I can't promise anything. --Johan

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Johan Råde Sent: 05 May 2008 07:43 To: boost@lists.boost.org Subject: [boost] [Math Statistical Distributions] Hypergeometric distribution
Hi John,
Do you have any plans to add support for the hypergeometric distribution? That distribution is important for the Fisher exact test for independence in a 2x2 table.
A brief glance shows that Wikipedia lacks a formula for the cdf but http://portal.acm.org/citation.cfm?id=151271.151274 looks promising An accurate computation of the hypergeometric distribution function Full text pdf formatPdf (676 KB) Source ACM Transactions on Mathematical Software (TOMS) archive Volume 19 , Issue 1 (March 1993) table of contents Pages: 33 - 43 Year of Publication: 1993 ISSN:0098-3500 Author Trong Wu Southern Illinois Univ., Edwardsville but I don't have immediate access to the full article pdf. Suggestions? MathCAD appears to offer the density, cumulative and inverse, so this could provide some independent test values. So I might be persuaded to tackle this (rather tedious - the formulae look a bit messy) task. Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

On Mon, May 5, 2008 at 3:19 AM, Paul A Bristow <pbristow@hetp.u-net.com> wrote:
[...] A brief glance shows that Wikipedia lacks a formula for the cdf
but http://portal.acm.org/citation.cfm?id=151271.151274 looks promising
I can get that for you, subject to the ACM terms of usage (can't post it up to the list, can't put it on a server, etc): http://www.acm.org/publications/policies/usage

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Paul A Bristow Sent: 05 May 2008 11:19 To: boost@lists.boost.org Subject: Re: [boost] [Math Statistical Distributions] Hypergeometricdistribution
Do you have any plans to add support for the hypergeometric distribution? That distribution is important for the Fisher exact test for independence in a 2x2 table.
A brief glance shows that Wikipedia lacks a formula for the cdf
but http://portal.acm.org/citation.cfm?id=151271.151274 looks promising
An accurate computation of the hypergeometric distribution function Full text pdf formatPdf (676 KB) Source ACM Transactions on Mathematical Software (TOMS) archive Volume 19 , Issue 1 (March 1993) table of contents Pages: 33 - 43 Year of Publication: 1993 ISSN:0098-3500 Author Trong Wu Southern Illinois Univ., Edwardsville
A brief study of this (kindly provided by Oliver Seiler) shows it to be 'interesting' - but outstandingly accurate.
MathCAD appears to offer the density, cumulative and inverse, so this could provide some independent test values.
K Krishnamoorthy, Handbook of Statistical distributions with applications, in Chapter 4 deals with this distribution and also gives FORTRAN algorithms, but Wu (see above ref) shows that these approximations can be spectacularly inaccurate at times. So this (and other distributions) might be a good GSoC? Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Paul A Bristow wrote:
A brief study of this (kindly provided by Oliver Seiler) shows it to be 'interesting' - but outstandingly accurate.
MathCAD appears to offer the density, cumulative and inverse, so this could provide some independent test values.
K Krishnamoorthy, Handbook of Statistical distributions with applications, in Chapter 4 deals with this distribution and also gives FORTRAN algorithms, but Wu (see above ref) shows that these approximations can be spectacularly inaccurate at times.
So this (and other distributions) might be a good GSoC?
I don't think it's big enough for a SOC on it's own - shouldn't be more than a couple of days work - a week at worst? The tricky bit as Wu notes is calculating the h(0) term - after that it's a reasonably straightforward series evaluation. But... I'm not completely convinced by the practicality of Wu's method - it's really very cunning, no doubt about that - but requires a table of all the prime numbers smaller than the sample size, the first 1000 primes would take you up to 8K, but then you need another table of the same size to keep track of all the common factors (unless I'm missing a trick somewhere). We could get the first term from two calls to tgamma_delta_ratio, but I haven't completely convinced myself that it won't unnecessarily under/overflow. Otherwise you're into using logs and lgamma, which is rather prone to cancellation errors in calculating the result :-( John.

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of John Maddock Sent: 06 May 2008 13:16 To: boost@lists.boost.org Subject: Re: [boost] [Math StatisticalDistributions]Hypergeometricdistribution
Paul A Bristow wrote:
A brief study of this (kindly provided by Oliver Seiler) shows it to be 'interesting' - but outstandingly accurate.
MathCAD appears to offer the density, cumulative and inverse, so this could provide some independent test values.
K Krishnamoorthy, Handbook of Statistical distributions with applications, in Chapter 4 deals with this distribution and also gives FORTRAN algorithms, but Wu (see above ref) shows that these approximations can be spectacularly inaccurate at times.
So this (and other distributions) might be a good GSoC?
I don't think it's big enough for a SOC on it's own - shouldn't be more than a couple of days work - a week at worst?
The tricky bit as Wu notes is calculating the h(0) term - after that it's a reasonably straightforward series evaluation. But... I'm not completely convinced by the practicality of Wu's method - it's really very cunning, no doubt about that - but requires a table of all the prime numbers smaller than the sample size, the first 1000 primes would take you up to 8K, but then you need another table of the same size to keep track of all the common factors (unless I'm missing a trick somewhere).
They sound a fairly serious requirement.
We could get the first term from two calls to tgamma_delta_ratio, but I haven't completely convinced myself that it won't unnecessarily under/overflow. Otherwise you're into using logs and lgamma, which is rather prone to cancellation errors in calculating the result :-(
Well option B is to use some quick, and sometimes dirty, approximation like everyone else does. I'm sure this won't appeal to your philosophy, but nothing stops us trying Wu's method later. What accuracy does Johan (and others) *need* for his(their) application? Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Paul A Bristow wrote:
What accuracy does Johan (and others) *need* for his(their) application?
I'm working with biological data. Almost all the input data have errors around 10 - 30% (or more) :-) Two correct decimals is more than my customers need. But other users do of course have other needs.

Johan Råde wrote:
Paul A Bristow wrote:
What accuracy does Johan (and others) *need* for his(their) application?
I'm working with biological data. Almost all the input data have errors around 10 - 30% (or more) :-) Two correct decimals is more than my customers need. But other users do of course have other needs.
Can't remember where I saw it now, but I believe folks recomend at least 6 decimal digits precision in the stat software's calculation. I believe the problem with using lgamma is you can end up with no digits correct in some cases :-( John.

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Johan Råde Sent: 06 May 2008 17:38 To: boost@lists.boost.org Subject: Re: [boost] [MathStatisticalDistributions]Hypergeometricdistribution
Paul A Bristow wrote:
What accuracy does Johan (and others) *need* for his(their) application?
I'm working with biological data. Almost all the input data have errors around 10 - 30% (or more) :-) Two correct decimals is more than my customers need. But other users do of course have other needs.
Scan of the code used in Krishnamoorthy's book, FWIW. His Statcalc program presumably uses something like this (in C). So this could be a quick'n'dirty solution. But probably simpler to do than Wu's magic version (that looks a *lot* cleverer than I am ;-) Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

There is one difficulty with the two-sided Fisher exact test. To calculate a p-value for the left-sided test, you take the cdf of the hypergeometric distribution for the observed value of the test statistica. I you want to calculate a p-value for the right-sided test, then you just take cdf complement for the observed value of the test statistica. But for a two-sided test, well, say that the observed value of the test statistica is n. Then you should sum the pdf over all k such that pdf(k) <= pdf(n). This means summing over both tails. And since the distribution is not symmetric, you can not just sum over one tail and multiply by 2, as you do with the 2-sided t-test. I don't see how to do that in a clean way using the current statistical distributions API. (Am I missing something?) Maybe some extension to the statistical distributions API is needed. Something like cdf(symmetric(dist,x)) for the sum/integral of pdf(dist,y) over all y such that pdf(dist,y) <= pdf(dist,x). --Johan Råde

Johan Råde wrote:
There is one difficulty with the two-sided Fisher exact test.
To calculate a p-value for the left-sided test, you take the cdf of the hypergeometric distribution for the observed value of the test statistica. I you want to calculate a p-value for the right-sided test, then you just take cdf complement for the observed value of the test statistica.
But for a two-sided test, well, say that the observed value of the test statistica is n. Then you should sum the pdf over all k such that pdf(k) <= pdf(n). This means summing over both tails. And since the distribution is not symmetric, you can not just sum over one tail and multiply by 2, as you do with the 2-sided t-test.
I don't see how to do that in a clean way using the current statistical distributions API. (Am I missing something?)
This is true of all asymmetric distributions of course, you need to add the two tails calculated separately: cdf(hypergeometric(), n) + cdf(hypergeometric(), total - n) Ah... wait, because it's discrete, that misses out one value from the right tail? So should be: cdf(hypergeometric(), n) + cdf(hypergeometric(), total - n - 1) ???
Maybe some extension to the statistical distributions API is needed. Something like cdf(symmetric(dist,x)) for the sum/integral of pdf(dist,y) over all y such that pdf(dist,y) <= pdf(dist,x).
Hmm, that's a slightly different quantity: I'm not especially familiar with Fisher's exact test, but from what I've seen there appear to be differences of opinion on how two sided tests are calculated? For the second "side" you want to sum the probabilities of all the contingency tables that are "at least as extreme" as the one you observed but in the other direction. One way as I've suggested above is to sum all the tables that are as *asymmetric* as the one you observe, yours is to sum all the tables with lower or equal *probablity*. Are you certain you require the latter? I'm sure you are... just double checking :-) I don't see any easy way of doing this, except by brute force - or maybe doing a numeric inversion on the PDF to find the correct right tail test statistic value, and then using the CDF's as above? John.

John Maddock wrote:
This is true of all asymmetric distributions of course, you need to add the two tails calculated separately:
cdf(hypergeometric(), n) + cdf(hypergeometric(), total - n)
Ah... wait, because it's discrete, that misses out one value from the right tail? So should be:
cdf(hypergeometric(), n) + cdf(hypergeometric(), total - n - 1) ???
If the distribution is asymmetric you may want neither total - n nor total - n - 1. What you want is sum/integral of pdf(dist,y) over all y such that pdf(dist,y) <= pdf(dist,x). You actually have to do a bit of work to find out what the right cut-off is for the other tail. --Johan

Johan Råde wrote:
John Maddock wrote:
This is true of all asymmetric distributions of course, you need to add the two tails calculated separately:
cdf(hypergeometric(), n) + cdf(hypergeometric(), total - n)
Ah... wait, because it's discrete, that misses out one value from the right tail? So should be:
cdf(hypergeometric(), n) + cdf(hypergeometric(), total - n - 1) ???
If the distribution is asymmetric you may want neither total - n nor total - n - 1. What you want is
sum/integral of pdf(dist,y) over all y such that pdf(dist,y) <= pdf(dist,x).
I consulted some statistics books, and realized that the above statement is complete nonsense. That is not how you do it, it is in fact more complex than that. Finding the right cut-off for the other tail in a two-sided test can be tricky. (In many situations there is an exact answer, given by a so called UMP (unbiased most powerful) test. But the formulas that give theses test can be of rather implicit type.) For symmetric distributions (for instance the t-test), you just calculate one tail and multiply by two. For the F-test that does not give the exact answer, but is known to give a good approximation. For the Fisher exact test (hypergeometric distribution) that is often not even a good approximation, and some other rule should be used to find the correct cutoff for the other tail. I'll look some more into this. --Johan
participants (4)
-
Johan Råde
-
John Maddock
-
Oliver Seiler
-
Paul A Bristow