[Probability] interest in probability/likelihood library

2 May 2007

      This note is to announce the release of a library for handling
probability and likelihood quantities, together with their logarithms,
in a consistent and (when desired) transparent manner.  Ultimately, I
think this could be useful to the Boost community (e.g., it may find
application within the recently reviewed math toolkit or be useful
together with the recently accepted units library, from which it draws
certain ideas).  I would however appreciate feedback on that point, as
well as on ways to improve its design, implementation, or
documentation.  What follows is an overview, and an indication of a
couple of areas I know I need help with.  More information is also
available at

    http://biology.nmsu.edu/software/probability

The code may be downloaded at 

    ftp://biology.nmsu.edu/pub/software/probability/probability-0.1.tar.gz

OVERVIEW

As you know, both probabilities and likelihoods recur throughout
statistical models.  Consequently, those quantities must be
represented by the type system in computational statistical models.
Commonly, this is done by using a suitable native floating point type
(e.g., double).  One specific drawback with this approach is that the
compiler cannot enforce the correct usage of types and confusion may
arise between probabilities/likelihoods and other floating point
types.

More insidious, however, is the confusion likely to arise when the
computations involve both these quantities and their logarithms.
Indeed, the solution to many statistical models involves logarithms of
probabilities and likelihoods more than the natural quantities.
However, models that require both are not uncommon.

If a native floating point type is used for probabilities and
likelihoods, the programmer must not only distinguish those from other
floating point quantities (e.g., parameters), but also must
distinguish them from their logarithms.  Subtle mistakes easily result
from forgetting which domain (linear or logarithm) a quantity refers
to.

Much clearer code would result if the following syntax were available
to express the intents that (i) the likelihood should be calculated as
a product across independent observations, and (ii) it should be
accumulated using logarithms to avoid underflow.

     probability poisson (unsigned int i); // probability model
     log_likelihood l;
     // product of likelihoods across a series of independent observations
     for (observations::const_iterator i = obs.begin(); i != obs.end(); ++i)
       l *= poisson(*i);

The purpose of this library is to encapsulate both probability and
likelihood quantities within an appropriate set of types, while
simultaneously achieving the following design goals.

- Provide both convenient default and more flexible advanced types for
  representing all probability and likelihood quantities.

- Maintain type safety between probabilities and likelihoods, and
  between their corresponding native quantities and logarithms.

- Incur no runtime cost.  That is, all type manipulations should be
  performed at compile time.

- Impose no limitations on the type used to represent the value of a
  probability or likelihood, beyond the obvious requirement that it
  models a real number within a suitable domain.

- Ensure that the layout of arrays and structures of probability and
  likelihood quantities is identical to the corresponding layout for
  the underlying value type.

- Verify the validity of probability values within the closed domain
  [0,1] and likelihood values within the closed domain [0,infinity],
  both as native quantities and as logarithms.

- Support the validator concept to allow either complete removal of
  the validation checks, thereby eliminating their run-time cost, or
  replacement with an alternative.

- Provide all appropriate arithmetic operations, while limiting
  implicit type conversions to those absolutely necessary and
  naturally expected.

- Provide consistent semantics for the arithmetic operations in terms
  of their definitions for native quantities in the linear domain.
  One consequence of this is that in many contexts the quantities can
  be regarded simply as probabilities or likelihoods without respect
  to their representational domain.  Another important consequence is
  that generic algorithms may be constructed based on a common set of
  operators that retain their semantics across the representational
  domains.

Given these design goals, when the types provided by the library are
used the compiler can enforce type safety, provide conversions as
required. and guarrantee that the intent is directly expressed in the
source code.  Furthermore, any type that models a real number may be
used as the value type for probabilities and likelihoods.  If a double
is sufficient, however, the simple default types make the task of
instantiating these quantities easier.

AREAS NEEDING ASSISTANCE

The following are a few items I can identify that I need assistance
with.

- Compiler specifics: This is thoroughly tested with g++ (v3 and v4),
  but I need information on its portability.

- The test cases seem to require BOOST_TEST_DONT_PRINT, even though
  stream operators are present.  Although the current code works, it
  seems incorrect to require that macro.

- Some tests cannot use BOOST_CHECK_EQUAL when they otherwise seem
  like they should.  The problem appears to be an interaction with the
  printing of test cases and may be related to the previous item.

- Being new to the Boost community, I am unfamiliar with how to
  incorporate the tests and the documentation into the Boost
  framework.  Guidance here is welcome.

- Anything else that I may have overlooked.

Thanks for your input.  I hope this library is close to Boost
standards and can be improved to the point of being a worthwhile
inclusion within the set of libraries.

-- 
Brook Milligan                         Internet:  brook@nmsu.edu
Department of Biology
New Mexico State University            Telephone:  (505) 646-7980
Las Cruces, New Mexico  88003  U.S.A.  FAX:        (505) 646-5665

Brook Milligan

Thorsten Ottosen

Brook Milligan

Paul A Bristow

Jeremy Maitin-Shepard

tags

participants (4)