[Probability] version 0.2.3 released

In order to better motivate the need for the Boost Probability library, I have updated the documentation, which is accessible at http://biology.nmsu.edu/software/probability/ Although this constitutes a new release, the only difference is in documentation. As a result, the contents of v0.2.2 in the Boost Vault still reflect exactly the most recent release and I haven't uploaded a new copy. The new motivational example is taken from the problem of ascertaining the long-term trend of global climate. One database used to assess this is available from the NOAA National Climate Data Center (http://www.ncdc.noaa.gov/oa/climate/ghcn-monthly/index.php). It contains monthly data for thousands of stations worldwide, in many cases for decades. Today's version, for example, contains 590,543 records of mean temperature. A typical likelihood calculation evaluating a model of climate would involve a product of likelihoods across all of these records, almost certainly yielding a result on the order of 10^{-600,000} or less. Such numbers cannot be handled using typical floating point representations, so specialized solutions of some form are required. The natural method is to accumulate the sum of logarithms of likelihoods, rather than the product of likelihoods, across the dataset. This keeps the values within suitable bounds, but requires keeping track of the fact that different types of values (probabilities, likelihoods, and log likelihoods) are being used throughout a typical program. If these are all represented using native types, such as double, it is easy to lose track of the fact that they have different semantics. A real solution of this problem would include modules taking care of calculating the probability of each individual data record and modules taking care of accumulating that information across the records. The problem is complex enough that each of these responsibilities would realistically be divided across many units and it would not be unreasonable to expect development to be divided among many programmers. In such situations it is all too easy to lose track of what semantics apply to a specific value when the only information available in the code is the data type (e.g., double) which provides little help and some (perhaps untrustworthy) comments that may or may not be read and in any case cannot affect the compiler. Using the Probability library, one can encode the exact semantics using the type system in a way that lends itself to generic programming. The resulting clarity, safety, and maintainability is retained regardless of how large the code base becomes and how the operations are distributed across modules and/or programmers. As a result of these features, I feel that this library makes a significant contribution to solving a well-defined set of problems that occur in certain types of scientific programming and modeling. I hope you will take a serious look at its capabilities and provide me with further feedback. I am especially interested in improving the portability of the code and need testers with access to compilers other than g++. I look forward to your comments, suggestions, and general discussion. Thank you. Cheers, Brook

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Brook Milligan Sent: 15 August 2007 22:48 To: boost@lists.boost.org Subject: [boost] [Probability] version 0.2.3 released
In order to better motivate the need for the Boost Probability library, I have updated the documentation, which is accessible at
http://biology.nmsu.edu/software/probability/
Although this constitutes a new release, the only difference is in documentation. As a result, the contents of v0.2.2 in the Boost Vault still reflect exactly the most recent release and I haven't uploaded a new copy.
The new motivational example is taken from the problem of ascertaining the long-term trend of global climate. One database used to assess this is available from the NOAA National Climate Data Center (http://www.ncdc.noaa.gov/oa/climate/ghcn-monthly/index.php). It contains monthly data for thousands of stations worldwide, in many cases for decades. Today's version, for example, contains 590,543 records of mean temperature. A typical likelihood calculation evaluating a model of climate would involve a product of likelihoods across all of these records, almost certainly yielding a result on the order of 10^{-600,000} or less. Such numbers cannot be handled using typical floating point representations, so specialized solutions of some form are required. The natural method is to accumulate the sum of logarithms of likelihoods, rather than the product of likelihoods, across the dataset. This keeps the values within suitable bounds, but requires keeping track of the fact that different types of values (probabilities, likelihoods, and log likelihoods) are being used throughout a typical program. If these are all represented using native types, such as double, it is easy to lose track of the fact that they have different semantics.
A real solution of this problem would include modules taking care of calculating the probability of each individual data record and modules taking care of accumulating that information across the records. The problem is complex enough that each of these responsibilities would realistically be divided across many units and it would not be unreasonable to expect development to be divided among many programmers. In such situations it is all too easy to lose track of what semantics apply to a specific value when the only information available in the code is the data type (e.g., double) which provides little help and some (perhaps untrustworthy) comments that may or may not be read and in any case cannot affect the compiler.
Using the Probability library, one can encode the exact semantics using the type system in a way that lends itself to generic programming. The resulting clarity, safety, and maintainability is retained regardless of how large the code base becomes and how the operations are distributed across modules and/or programmers.
As a result of these features, I feel that this library makes a significant contribution to solving a well-defined set of problems that occur in certain types of scientific programming and modeling. I hope you will take a serious look at its capabilities and provide me with further feedback. I am especially interested in improving the portability of the code and need testers with access to compilers other than g++.
Thanks for this further motivational example: it had previously seemed a bit of a sledge hammer to crack a nut, but I now see situations in which it could provide much more safety than I had imagined. I am busy trying to help John Maddock get the Math Toolkit 'out of the door' and fully into Boost (1.35?) but I will return to look at this in more detail. Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Paul A Bristow writes:
Thanks for this further motivational example: it had previously seemed a bit of a sledge hammer to crack a nut, but I now see situations in which it could provide much more safety than I had imagined.
The difficulty with constructing a simple motivational example is that the true value may not become apparent until one faces a real-world problem that is much more complex than an example warrants. Thus, perhaps some imagination is required to extrapolate from the structure of the example to what one might normally encounter. It seems that you are beginning to make that leap. I would appreciate suggestions on how to make that leap easier so that people can appreciate the power of this abstraction. Thanks for your comments. I look forward to more detailed reviews and tests. Cheers, Brook
participants (2)
-
Brook Milligan
-
Paul A Bristow