[Probability] interest in probability/likelihood library

This note is to announce the release of a library for handling probability and likelihood quantities, together with their logarithms, in a consistent and (when desired) transparent manner. Ultimately, I think this could be useful to the Boost community (e.g., it may find application within the recently reviewed math toolkit or be useful together with the recently accepted units library, from which it draws certain ideas). I would however appreciate feedback on that point, as well as on ways to improve its design, implementation, or documentation. What follows is an overview, and an indication of a couple of areas I know I need help with. More information is also available at http://biology.nmsu.edu/software/probability The code may be downloaded at ftp://biology.nmsu.edu/pub/software/probability/probability-0.1.tar.gz OVERVIEW As you know, both probabilities and likelihoods recur throughout statistical models. Consequently, those quantities must be represented by the type system in computational statistical models. Commonly, this is done by using a suitable native floating point type (e.g., double). One specific drawback with this approach is that the compiler cannot enforce the correct usage of types and confusion may arise between probabilities/likelihoods and other floating point types. More insidious, however, is the confusion likely to arise when the computations involve both these quantities and their logarithms. Indeed, the solution to many statistical models involves logarithms of probabilities and likelihoods more than the natural quantities. However, models that require both are not uncommon. If a native floating point type is used for probabilities and likelihoods, the programmer must not only distinguish those from other floating point quantities (e.g., parameters), but also must distinguish them from their logarithms. Subtle mistakes easily result from forgetting which domain (linear or logarithm) a quantity refers to. Much clearer code would result if the following syntax were available to express the intents that (i) the likelihood should be calculated as a product across independent observations, and (ii) it should be accumulated using logarithms to avoid underflow. probability poisson (unsigned int i); // probability model log_likelihood l; // product of likelihoods across a series of independent observations for (observations::const_iterator i = obs.begin(); i != obs.end(); ++i) l *= poisson(*i); The purpose of this library is to encapsulate both probability and likelihood quantities within an appropriate set of types, while simultaneously achieving the following design goals. - Provide both convenient default and more flexible advanced types for representing all probability and likelihood quantities. - Maintain type safety between probabilities and likelihoods, and between their corresponding native quantities and logarithms. - Incur no runtime cost. That is, all type manipulations should be performed at compile time. - Impose no limitations on the type used to represent the value of a probability or likelihood, beyond the obvious requirement that it models a real number within a suitable domain. - Ensure that the layout of arrays and structures of probability and likelihood quantities is identical to the corresponding layout for the underlying value type. - Verify the validity of probability values within the closed domain [0,1] and likelihood values within the closed domain [0,infinity], both as native quantities and as logarithms. - Support the validator concept to allow either complete removal of the validation checks, thereby eliminating their run-time cost, or replacement with an alternative. - Provide all appropriate arithmetic operations, while limiting implicit type conversions to those absolutely necessary and naturally expected. - Provide consistent semantics for the arithmetic operations in terms of their definitions for native quantities in the linear domain. One consequence of this is that in many contexts the quantities can be regarded simply as probabilities or likelihoods without respect to their representational domain. Another important consequence is that generic algorithms may be constructed based on a common set of operators that retain their semantics across the representational domains. Given these design goals, when the types provided by the library are used the compiler can enforce type safety, provide conversions as required. and guarrantee that the intent is directly expressed in the source code. Furthermore, any type that models a real number may be used as the value type for probabilities and likelihoods. If a double is sufficient, however, the simple default types make the task of instantiating these quantities easier. AREAS NEEDING ASSISTANCE The following are a few items I can identify that I need assistance with. - Compiler specifics: This is thoroughly tested with g++ (v3 and v4), but I need information on its portability. - The test cases seem to require BOOST_TEST_DONT_PRINT, even though stream operators are present. Although the current code works, it seems incorrect to require that macro. - Some tests cannot use BOOST_CHECK_EQUAL when they otherwise seem like they should. The problem appears to be an interaction with the printing of test cases and may be related to the previous item. - Being new to the Boost community, I am unfamiliar with how to incorporate the tests and the documentation into the Boost framework. Guidance here is welcome. - Anything else that I may have overlooked. Thanks for your input. I hope this library is close to Boost standards and can be improved to the point of being a worthwhile inclusion within the set of libraries. -- Brook Milligan Internet: brook@nmsu.edu Department of Biology New Mexico State University Telephone: (505) 646-7980 Las Cruces, New Mexico 88003 U.S.A. FAX: (505) 646-5665

Brook Milligan skrev:
This note is to announce the release of a library for handling probability and likelihood quantities, together with their logarithms, in a consistent and (when desired) transparent manner.
If a native floating point type is used for probabilities and likelihoods,
You should probably define those terms in the documentation.
The purpose of this library is to encapsulate both probability and likelihood quantities within an appropriate set of types, while simultaneously achieving the following design goals.
- Incur no runtime cost. That is, all type manipulations should be performed at compile time.
That does not strike me as quite the same.
- Ensure that the layout of arrays and structures of probability and likelihood quantities is identical to the corresponding layout for the underlying value type.
Good. But It still needs to be testet if the performance equals that of normal arrays or matrices of floats.
Thanks for your input. I hope this library is close to Boost standards and can be improved to the point of being a worthwhile inclusion within the set of libraries.
It seems like a good start. I can only say, that if the runtime performance cost is zero, then I would find the library very interesting, and I would not hessitate to use it in my work on Bayesian Networks. best regards -Thorsten

Several people have commented, both publicly and privately, on the Probability library I mentioned last week. There is now a new version http://biology.nmsu.edu/software/probability/ that addresses most of the concerns. In what follows I will address the salient points as I see them. - The main documentation page now begins with a brief definition of probability and likelihood. - Runtime costs have now been quantified in a fairly simple manner. The results are summarized on the main page, but indicate that there is less than a 0.5% effect in a test involving a large fraction of operations on these quantities. Is suspect this is well within the noise, but input from those with greater benchmarking experience is welcome. - Additive operators are now provided within the log domain. This completes the full set of arithmetic operators. - A suggestion was made to combine this with the math toolkit (and possibly the units) library. I hesitate to do this immediately until it is clear that the Probability library is indeed acceptable. It seems that the process would occur in stages: handle this one on its own, then work on integration if that is generally a desirable direction. This should stand on its own merits, at least initially. - Another suggestion focused on the potential for a numerical value type for the log domain, independent of probabilities. Clearly, that is contained within this and such a type could be extracted out for independent use. Had such a type existed, a portion of this library would have been simpler. However, such a type will not address the interconversions between probabilities and likelihoods that form a natural part of much statistical modeling. Thus, the higher level types incorporated here remain important, with or without a general log domain type. For now it seems that this is an implementation detail from the perspective of the Probability library. If there is a strong interest in such a type, perhaps this library could be refactored into two.. Again, I would opt for waiting to assess the acceptability of this library and the general level of interest in these different facets. I appreciate the comments and welcome other ideas. I hope that more people will look over the new version of the library and provide feedback. Thanks for your interest. Cheers, Brook -- Brook Milligan Internet: brook@nmsu.edu Department of Biology New Mexico State University Telephone: (505) 646-7980 Las Cruces, New Mexico 88003 U.S.A. FAX: (505) 646-5665

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Brook Milligan Sent: 02 May 2007 19:10 To: boost@lists.boost.org Subject: [boost] [Probability] interest in probability/likelihood library
This note is to announce the release of a library for handling probability and likelihood quantities, together with their logarithms, in a consistent and (when desired) transparent manner. Ultimately, I think this could be useful to the Boost community (e.g., it may find application within the recently reviewed math toolkit or be useful together with the recently accepted units library, from which it draws certain ideas). I would however appreciate feedback on that point, as well as on ways to improve its design, implementation, or documentation. What follows is an overview, and an indication of a couple of areas I know I need help with. More information is also available at
http://biology.nmsu.edu/software/probability
The code may be downloaded at
ftp://biology.nmsu.edu/pub/software/probability/probability-0.1.tar.gz
At a very quick glance this has obvious benefits (though quite how big those benefits are is less clear to me at this glance). I have put on my TODO list seeing how this code plays with the recently accepted math toolkit and statistical distributions. But it might be more useful if *you* tried to combine these two - if not the units library as well ;-) (The final toolkit version will not appear until the 1.35 release, of course, but a version is available in the sandbox - but it has recently 'escaped' to SVN and I haven't regained access to it yet. You will also need the Boost 1.34 release too). Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

It seems like what is really needed is a numerical type that stores the value and does calculations in log-space (i.e. operator* does addition, operator/ does subtraction, pow does multiplication). Presumably it would be defined as a template over the underlying type used for storage. I don't see why this needs to be or should be coupled to probabilities in any way. -- Jeremy Maitin-Shepard
participants (4)
-
Brook Milligan
-
Jeremy Maitin-Shepard
-
Paul A Bristow
-
Thorsten Ottosen