[Math/Statistical Distributions] Rethinking of distribution template parameters.

Dear Boost developers, This is a feature request for the next version of Math/Statisical Distributions lib. Currently, due to lack of input type information, discrete distributions can only be "emulated" by using the discrete_quantile policy. However, doing so the effective quantile type is still a real type. In my opinion, this have at least two disadvantages: 1. Operations are slow since the underlying quantile type is still real. Instead, operations on really integral types are generally faster. 2. Quantile comparison might be inaccurate since we are comparing real types So, for improving the support of discrete distributions, I think it would be nice if the all probability distributions gain a third template parameter, named, for instance, InputType. This can be used as follows: For discrete distributions, like the "discrete uniform distribution": --- [discrete_snip] --- template < typename InputType = int, typename ValueType = double, typename Policy = policies::policy<> > class discrete_uniform_distribution { public: typedef InputType input_type; typedef ValueType value_type; typedef Policy policy_type; discrete_uniform_distribution(input_type lower = 0, input_type upper = 9) : lower_(lower), upper_(upper) { // Empty } inline input_type lower() const { return lower_; } inline input_type upper() const { return upper_; } private: input_type lower_; input_type upper_; }; --- [/discrete_snip] --- For continuos distributions, like the "continuous uniform distribution": --- [continuous_snip] --- template < typename InputType = double, typename ValueType = InputType, typename Policy = policies::policy<> > class continuous_uniform_distribution { public: typedef InputType input_type; typedef ValueType value_type; typedef Policy policy_type; continuous_uniform_distribution(input_type lower = 0, input_type upper = 9) : lower_(lower), upper_(upper) { // Empty } inline input_type lower() const { return lower_; } inline input_type upper() const { return upper_; } private: input_type lower_; input_type upper_; }; --- [/continuous_snip] --- NOTE: as you know, discrete and uniform distributions behave differently. So two different class (or two different instantiations of the same templated class) must be provided. What do you think? Thank you for your attention!! -- Marco

On Tue, May 19, 2009 at 2:49 PM, Marco Guazzone <marco.guazzone@gmail.com> wrote:
Dear Boost developers,
This is a feature request for the next version of Math/Statisical Distributions lib.
Currently, due to lack of input type information, discrete distributions can only be "emulated" by using the discrete_quantile policy. However, doing so the effective quantile type is still a real type.
In my opinion, this have at least two disadvantages: 1. Operations are slow since the underlying quantile type is still real. Instead, operations on really integral types are generally faster. 2. Quantile comparison might be inaccurate since we are comparing real types
So, for improving the support of discrete distributions, I think it would be nice if the all probability distributions gain a third template parameter, named, for instance, InputType.
Just for completeness, the different helper functions would become: --- [code_snip] --- // Discrete Uniform Distribution template <typename InputType, typename ValueType, typename Policy> inline ValueType pdf(const discrete_uniform_distribution<InputType, ValueType, Policy>& dist, const InputType& x) { return ValueType(1.0)/(dist.upper()-dist.lower()+1); } template <typename InputType, typename ValueType, typename Policy> inline ValueType cdf(const discrete_uniform_distribution<InputType, ValueType, Policy>& dist, const InputType& q) { if (q <= dist.lower()) return 0; if (q >= dist.upper()) return 1; return (q-dist.lower()+1)/ValueType(dist.upper()-dist.lower()+1); } template <typename InputType, typename ValueType, typename Policy> inline InputType quantile(const discrete_uniform_distribution<InputType, ValueType, Policy>& dist, const ValueType& p) { return p*(dist.upper()-dist.lower()+1)+dist.lower()-1; } // Continuous Uniform Distribution template <typename InputType, typename ValueType, typename Policy> inline ValueType pdf(const continuous_uniform_distribution<InputType, ValueType, Policy>& dist, const InputType& x) { return ValueType(1.0)/(dist.upper()-dist.lower()); } template <typename InputType, typename ValueType, typename Policy> inline ValueType cdf(const continuous_uniform_distribution<InputType, ValueType, Policy>& dist, const InputType& q) { if (q <= dist.lower()) return 0; if (q >= dist.upper()) return 1; return (q-dist.lower())/ValueType(dist.upper()-dist.lower()); } template <typename InputType, typename ValueType, typename Policy> inline InputType quantile(const continuous_uniform_distribution<InputType, ValueType, Policy>& dist, const ValueType& p) { return p*(dist.upper()-dist.lower())+dist.lower(); } // and so on... --- [/code_snip] --- Cheers, -- Marco

This is a feature request for the next version of Math/Statisical Distributions lib.
Currently, due to lack of input type information, discrete distributions can only be "emulated" by using the discrete_quantile policy. However, doing so the effective quantile type is still a real type.
In my opinion, this have at least two disadvantages:
I believe your disadvantages are more imagined than real.
1. Operations are slow since the underlying quantile type is still real. Instead, operations on really integral types are generally faster.
Unfortunately there is no way the quantile of discrete distributions can be calculated internally using all integer arithmetic (at least I can't think of a case other than maybe the trivial bernoulli distribution). Normally the result of the quantile is calculated as a real-number and then appropriately rounded acording to the policy in effect, in a few cases the result is calculated directly as an integer by summing CDF values (hypergeomentric for example), but the internal calculations still have to done using reals. There's also no overhead from returning a real type (since it's usually returned in a register just like an integer type would be), there might be a tiny overhead if the user then casts to an integer, but if we internalised that cast by returning an integer type then everyone would pay that cost no matter what the use case :-( BTW there are a few genuine use cases for returning a real-valued result from the quantile of a descrete distribution.
2. Quantile comparison might be inaccurate since we are comparing real types
Nope, not if you've requested an integer result (which is the default policy), as integers are represented exactly in floating point types: unless the integer is so large as exceed the number of mantissa bits - but then the result would likely overflow an integer type anyway. In fact this is an important use case - the ability to return values larger than INT_MAX etc as a real valued type. There is one genuine concern here, but it can't be solved by your interface: that is if the result of the quantile function is calculated to be very very close to an integer value, but due to the usual rounding errors in calculation we can't be sure which side of the integer the true value lies. Unfortunately there is simply no way around this - we have to use real-valued types in the internal calculation, and all the stats packages I'm aware of have the same potential issue. Cheers, John.

On Thu, May 21, 2009 at 11:47 AM, John Maddock <john@johnmaddock.co.uk> wrote:
This is a feature request for the next version of Math/Statisical Distributions lib.
Currently, due to lack of input type information, discrete distributions can only be "emulated" by using the discrete_quantile policy. However, doing so the effective quantile type is still a real type.
In my opinion, this have at least two disadvantages:
I believe your disadvantages are more imagined than real.
1. Operations are slow since the underlying quantile type is still real. Instead, operations on really integral types are generally faster.
Unfortunately there is no way the quantile of discrete distributions can be calculated internally using all integer arithmetic (at least I can't think of a case other than maybe the trivial bernoulli distribution). Normally the result of the quantile is calculated as a real-number and then appropriately rounded acording to the policy in effect, in a few cases the result is calculated directly as an integer by summing CDF values (hypergeomentric for example), but the internal calculations still have to done using reals.
There's also no overhead from returning a real type (since it's usually returned in a register just like an integer type would be), there might be a tiny overhead if the user then casts to an integer, but if we internalised that cast by returning an integer type then everyone would pay that cost no matter what the use case :-(
BTW there are a few genuine use cases for returning a real-valued result from the quantile of a descrete distribution.
2. Quantile comparison might be inaccurate since we are comparing real types
Nope, not if you've requested an integer result (which is the default policy), as integers are represented exactly in floating point types: unless
I've missed that, given two floating point numbers x and y and the related floating point machine numbers fl(x) and fl(y): if x==y then fl(y)-eps < fl(x) < fl(y)+eps; and if x==y then round(fl(x)) == round(fl(y)); where eps is the unit roundoff error. I've only considered the first relation. Thank you!! Cheers, -- Marco
participants (2)
-
John Maddock
-
Marco Guazzone