Bugs in boost random number distributions

Bugs in boost random number distributions: Configuration: Boost 1.32.0 Linux 2.6.16, Fedora Core 4 gcc 4.0.2 Preamble #include <boost/random.hpp> using namespace boost; typedef boost::mt19937 random_generator; random_generator r(1); (1) uniform_int<char>(low, high) produces values outside [low, high]. E.g., variate_generator<random_generator&, uniform_int<char> > unichar(r, uniform_int<char>(-100, 100)); (2) uniform_int<short>(low, high) produces values outside [low, high]. E.g., variate_generator<random_generator&, uniform_int<short int> > unishort(r, uniform_int<short int>(-20000, 20000)); (3) uniform_int<long long>(low, high) produces a distribution with the wrong mean. E.g., variate_generator<random_generator&, uniform_int<long long> > unill(r, uniform_int<long long>(0LL, 0xffffffffffLL)); has a mean of about 2.0e9 instead of 5.5e11. (4) uniform_int<long long>(low, high) sometimes takes a LONG time. E.g., variate_generator<random_generator&, uniform_int<long long> > unillint(r, uniform_int<long long>(0LL, 0x10000000000LL)); unillint(); takes about 40 seconds on a 800 MHz Intel PC. (5) uniform_01<random_generator, float> uni01(r) sometimes returns a result outside [0,1). Specifically it can return 1.0f because (2^32-1)/2^32 rounds to 1. I haven't delved into the code to see where all these problems arise. However, it appears that getting some of these examples "right" is complicated because the underlying random generators that boost uses can have arbitrary min() and max() values. In the case of mt19937 (and some of the other random generators) which provides 32 bits of random data, the implementation can be made considerably simpler and faster. See for example http://charles.karney.info/random/ This is specialized to using the Mersenne-Twister generator. However, uniform integer, uniform real, and Bernoulli distributions are then efficient and exact (under the assumption that mt19937 is a perfect source of random bits). Thus the "right" number of random bits are delivered for float, double, and long double distributions; the uniform integer distribution is exact and faster than boost's inexact uniform_smallint; and so on. Perhaps the boost libraries should include specializations for those generators which return a whole number of random bits. Here is the code illustrating the bugs above... ================================================================ /* g++ -O2 -funroll-loops -o boost-testb boost-testb.cpp */ #include <iostream> #include <ctime> #include <boost/random.hpp> #ifdef BOOST_NO_STDC_NAMESPACE namespace std { using ::time; } #endif using namespace std; using namespace boost; typedef mt19937 random_generator; template<typename T> void testint(random_generator& r, size_t n, T x, T y) { double d = 0; variate_generator<random_generator&, uniform_int<T> > uniint(r, uniform_int<T>(x, y)); for (size_t i = 0; i < n; ++i) { T z = uniint(); d += z; if (z < x || z > y) { cout << "Out of bounds " << int(z) << endl; return; } } d /= double(n); if ( abs(d - (double(x)+double(y))/2.0) / ((double(y)-double(x))/sqrt(12.0*n)) > 10.0) { cout << "Mean wrong " << d << endl; } } template<typename T> void testreal(random_generator& r, size_t n) { T x = 0, y = 1; double d = 0; uniform_01<random_generator, T> unireal(r); for (size_t i = 0; i < n; ++i) { volatile T z = unireal(); d += z; if (z < x || z >= y) { cout << "Out of bounds " << z << endl; return; } } d /= double(n); if ( abs(d - (double(x)+double(y))/2.0) / ((double(y)-double(x))/sqrt(12.0*n)) > 10.0) { cout << "Mean wrong " << d << endl; } } int main() { random_generator r(1); testint<char>(r, 100, -100, 100); testint<short int>(r, 100, -20000, 20000); testint<long long>(r, 1000, 0, 0xffffffffffLL); testreal<float>(r, 100000000); variate_generator<random_generator&, uniform_int<long long> > unillint(r, uniform_int<long long>(0LL, 0x10000000000LL)); int t0 = time(0); unillint(); cout << " " << time(0)-t0 << "s per unillint()" << endl; return 0; } -- Charles Karney <ckarney@sarnoff.com> Sarnoff Corporation, Princeton, NJ 08543-5300 URL: http://charles.karney.info Tel: +1 609 734 2312 Fax: +1 609 734 2662

Charles Karney wrote:
I haven't delved into the code to see where all these problems arise.
interestingly, just two days ago I received similar report regarding "bad" distribution of uniform_int . Here is test program (comparing number of random results in top and bottom 10% of range): #include <iostream> #include <boost/random/mersenne_twister.hpp> #include <boost/random/uniform_int.hpp> #include <boost/random/variate_generator.hpp> int main() { boost::mt19937 rng; boost::variate_generator<boost::mt19937&, boost::uniform_int<> > gen(rng, boost::uniform_int<>(0,100000000)); for (int i = 0; i < 20; ++i) { int small = 0, large = 0; for (int j = 0; j < 100000; ++j) { int x = gen(); if (x < 10000000) ++small; else if (x >= 100000000 - 10000000) ++large; } std::cout << small << " " << large << std::endl; } } and output: 10320 9631 10200 9862 10128 9904 10204 9924 10247 9839 10225 9839 10062 9890 10159 9635 10160 9748 10256 9703 10157 9883 10352 9830 10258 9791 10152 9866 10251 9564 10197 9887 10444 9743 10223 10036 10455 10010 10270 9882 author of this test (and original report) successfully used Mersenne Twister generator in his own project - with much better peformance and distribution. B.

I haven't delved into the code to see where all these problems arise.
interestingly, just two days ago I received similar report regarding "bad" distribution of uniform_int . Here is test program (comparing number of random results in top and bottom 10% of range):
This bug has a different cause from the ones I reported. Here I identify the underlying causes for the bugs:
(1) uniform_int<char>(low, high) produces values outside [low, high]. E.g.,
variate_generator<random_generator&, uniform_int<char> > unichar(r, uniform_int<char>(-100, 100));
(2) uniform_int<short>(low, high) produces values outside [low, high]. E.g.,
variate_generator<random_generator&, uniform_int<short int> > unishort(r, uniform_int<short int>(-20000, 20000));
These are both caused by overflow in the statement _range = _max - _min; in init() in random/uniform_int.hpp. For signed types this can overflow, so that _range is negative even if _max >= _min. The wreaks havoc with the rest of the code.
(3) uniform_int<long long>(low, high) produces a distribution with the wrong mean. E.g.,
variate_generator<random_generator&, uniform_int<long long> > unill(r, uniform_int<long long>(0LL, 0xffffffffffLL));
has a mean of about 2.0e9 instead of 5.5e11.
(4) uniform_int<long long>(low, high) sometimes takes a LONG time. E.g.,
variate_generator<random_generator&, uniform_int<long long> > unillint(r, uniform_int<long long>(0LL, 0x10000000000LL)); unillint();
takes about 40 seconds on a 800 MHz Intel PC.
These are both caused by problems in do_compare<false, true> in random/detail/signed_unsigned_compare.hpp. In particular the statement static bool lessthan(T1 x, T2 y) { return y >= 0 && x < static_cast<T1>(y); } returns nonsensical results with T1 = unsigned and T2 = long long. A possible band-aid would be static bool lessthan(T1 x, T2 y) { return y >= 0 && (std::numeric_limits<T1>::max() >= std::numeric_limits<T2>::max() ? x < static_cast<T1>(y) : static_cast<T2>(x) < y); } However I fear that there are other places where overflow can occur in uniform_int. The following wasn't reported in my earlier E-mail, but illustrates the bug forwarded by Bronek Kozicki: (3a) uniform_int<int>(low, high) produces a distribution with the wrong mean. E.g., variate_generator<random_generator&, uniform_int<int> > uniint(r, uniform_int<int>(0, 800000000-1)); has a mean of about 3.7e8 instead of 4.0e8. This is caused by uniform_int invoking uniform_smallint which typically yields approximate results. The threshold for this if(brange / _range > 4) { return boost::uniform_smallint<result_type>(_min, _max)(eng); } is absurdly generous. In fact, I would argue that uniform_int should NEVER call uniform_smallint but should ALWAYS return accurate results (assuming that the underlying generator is perfect). Unfortunately if the test (brange / _range > 4) is changed to (false), a very inefficient rejection method takes over (see the 40 sec computation of a uniform_int<long long>(low, high) in (4) above). For an efficient rejection method that returns uniform integers in an arbitrary range, see the definition of IntegerC<T>(T m, T n) in http://charles.karney.info/random/
(5) uniform_01<random_generator, float> uni01(r) sometimes returns a result outside [0,1). Specifically it can return 1.0f because (2^32-1)/2^32 rounds to 1.
uniform_01 returns result_type(_rng() - _rng.min()) / (result_type(_rng.max() - _rng.min()) + 1) This has two problems. If result_type is float, the result can overflow to 1 (for mt19937). If result_type is double or long double, the result only contains 32 bits of randomness (instead of 53 bits or 64 bits). Fixing these problems appears to be a very messy undertaking given the present structure of the random number library. Thus dealing with a random generator returning long ints in [-2e9, 2e9] is going to be a pain since 4e9 isn't representable as a long int. One possible solution is to require more of the generators. Thus integer generators might be required to return unsigned results in [0,max]. Similarly real generators might be required to return results representable as eps*[0,max] for some (queriable) eps. I might even suggest going further and requiring that max = 2^b - 1, so that a whole number of bits of randomness are returned. This greatly simplifies the computation of the basic integer and real distributions. In particular, my random number library (based on mt129937, see web site given above) provides implementations of uniform real distributions which as equivalent to sampling uniformly in [0,1] and EXACTLY rounding the results down either to a multiple of the machine precision, e.g., Fixed<T>(), or to the nearest representable floating point number, e.g., Float<T>(). Perhaps this might slow some generators down (but notably not mt19937). But that is far preferable to the current situation where incorrect results are being obtained. -- Charles Karney <ckarney@sarnoff.com> Sarnoff Corporation, Princeton, NJ 08543-5300 URL: http://charles.karney.info Tel: +1 609 734 2312 Fax: +1 609 734 2662

FWIW, I've tried using boost::random *many* times in serious projects, and each time when it gets to testing I have to rewrite all my random code to move away from it, due to what is generally very bizarre (and non-random) behavior. The bugs you've listed are pretty indicative of my experience, although I haven't been quite as professional (that is, I never bothered submitting any reports ;) ). It would be nice if boost::random were usable, but is it even actively maintained anymore? I've gone back to it several times but always run into a problem almost instantly. On 8/8/06, Charles Karney <ckarney@sarnoff.com> wrote:
I haven't delved into the code to see where all these problems arise.
interestingly, just two days ago I received similar report regarding "bad" distribution of uniform_int . Here is test program (comparing number of random results in top and bottom 10% of range):
This bug has a different cause from the ones I reported. Here I identify the underlying causes for the bugs:
(1) uniform_int<char>(low, high) produces values outside [low, high]. E.g.,
variate_generator<random_generator&, uniform_int<char> > unichar(r, uniform_int<char>(-100, 100));
(2) uniform_int<short>(low, high) produces values outside [low, high]. E.g.,
variate_generator<random_generator&, uniform_int<short int> > unishort(r, uniform_int<short int>(-20000, 20000));
These are both caused by overflow in the statement
_range = _max - _min;
in init() in random/uniform_int.hpp. For signed types this can overflow, so that _range is negative even if _max >= _min. The wreaks havoc with the rest of the code.
(3) uniform_int<long long>(low, high) produces a distribution with the wrong mean. E.g.,
variate_generator<random_generator&, uniform_int<long long> > unill(r, uniform_int<long long>(0LL, 0xffffffffffLL));
has a mean of about 2.0e9 instead of 5.5e11.
(4) uniform_int<long long>(low, high) sometimes takes a LONG time. E.g.,
variate_generator<random_generator&, uniform_int<long long> > unillint(r, uniform_int<long long>(0LL, 0x10000000000LL)); unillint();
takes about 40 seconds on a 800 MHz Intel PC.
These are both caused by problems in do_compare<false, true> in random/detail/signed_unsigned_compare.hpp. In particular the statement
static bool lessthan(T1 x, T2 y) { return y >= 0 && x < static_cast<T1>(y); }
returns nonsensical results with T1 = unsigned and T2 = long long. A possible band-aid would be
static bool lessthan(T1 x, T2 y) { return y >= 0 && (std::numeric_limits<T1>::max() >= std::numeric_limits<T2>::max() ? x < static_cast<T1>(y) : static_cast<T2>(x) < y); }
However I fear that there are other places where overflow can occur in uniform_int.
The following wasn't reported in my earlier E-mail, but illustrates the bug forwarded by Bronek Kozicki:
(3a) uniform_int<int>(low, high) produces a distribution with the wrong mean. E.g.,
variate_generator<random_generator&, uniform_int<int> > uniint(r, uniform_int<int>(0, 800000000-1));
has a mean of about 3.7e8 instead of 4.0e8. This is caused by uniform_int invoking uniform_smallint which typically yields approximate results. The threshold for this
if(brange / _range > 4) { return boost::uniform_smallint<result_type>(_min, _max)(eng); }
is absurdly generous. In fact, I would argue that uniform_int should NEVER call uniform_smallint but should ALWAYS return accurate results (assuming that the underlying generator is perfect).
Unfortunately if the test (brange / _range > 4) is changed to (false), a very inefficient rejection method takes over (see the 40 sec computation of a uniform_int<long long>(low, high) in (4) above). For an efficient rejection method that returns uniform integers in an arbitrary range, see the definition of IntegerC<T>(T m, T n) in
http://charles.karney.info/random/
(5) uniform_01<random_generator, float> uni01(r) sometimes returns a result outside [0,1). Specifically it can return 1.0f because (2^32-1)/2^32 rounds to 1.
uniform_01 returns
result_type(_rng() - _rng.min()) / (result_type(_rng.max() - _rng.min()) + 1)
This has two problems. If result_type is float, the result can overflow to 1 (for mt19937). If result_type is double or long double, the result only contains 32 bits of randomness (instead of 53 bits or 64 bits).
Fixing these problems appears to be a very messy undertaking given the present structure of the random number library. Thus dealing with a random generator returning long ints in [-2e9, 2e9] is going to be a pain since 4e9 isn't representable as a long int.
One possible solution is to require more of the generators. Thus integer generators might be required to return unsigned results in [0,max]. Similarly real generators might be required to return results representable as eps*[0,max] for some (queriable) eps.
I might even suggest going further and requiring that max = 2^b - 1, so that a whole number of bits of randomness are returned. This greatly simplifies the computation of the basic integer and real distributions.
In particular, my random number library (based on mt129937, see web site given above) provides implementations of uniform real distributions which as equivalent to sampling uniformly in [0,1] and EXACTLY rounding the results down either to a multiple of the machine precision, e.g., Fixed<T>(), or to the nearest representable floating point number, e.g., Float<T>().
Perhaps this might slow some generators down (but notably not mt19937). But that is far preferable to the current situation where incorrect results are being obtained.
-- Charles Karney <ckarney@sarnoff.com> Sarnoff Corporation, Princeton, NJ 08543-5300
URL: http://charles.karney.info Tel: +1 609 734 2312 Fax: +1 609 734 2662 _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Damien Fisher wrote:
FWIW, I've tried using boost::random *many* times in serious projects, and each time when it gets to testing I have to rewrite all my random code to move away from it, due to what is generally very bizarre (and non-random) behavior. The bugs you've listed are pretty indicative of my experience, although I haven't been quite as professional (that is, I never bothered submitting any reports ;) ).
It would be nice if boost::random were usable, but is it even actively maintained anymore? I've gone back to it several times but always run into a problem almost instantly.
I don't know what the current maintenance situation is with random. It's surprising that the bugs I reported have persisted so long (since 2000?). Here's another misfeature: mt19937 is perhaps the most robust generator supplied with boost::random. Unfortunately the seeding methods provided by boost::random are pretty much guaranteed to result in correlated sequences; these were abandoned by the MT authors long ago (Jan 2002). I'm torn in knowing what to do. I'd love it if C++ had a random number library that was well tailored for scientific applications. At a MINIMUM, this would mean * One (or at most a few) generators with no known defects. This would mean passing the birthday spacings test, all bits in the results equally random, etc. mt19937 is a candidate here. * Seeding options that reliably open up "seed space" so that many (~ 2^30) uncorrelated random number sequences can be generated by manipulating the seed in a systematic way. (This is necessary when programming parallel applications, where, conceivably, each iteration in a do loop needs an independent random number sequence.) The 2002/01 seeding method for mt19937 provides this. In my random number library, I've modified this method slightly to enable all of state space to be accessible. In addition, in order to provide a consistent interface, I specified the seed to be a vector of unsigned longs (possibly of zero length). * EXACT implementations of uniform_int<T>(a,b) uniform_01<T>() bernoulli_distribution<T>(p) Exact here presumes that underlying generator is perfect. Thus a Bernoulli distribution with T = float and p = 1/pow(2.0f, 100) should yield true 1/2^100 of the time on average. I suggest at two variants for uniform_01<T>() (presumably with different names) defined as follows: select a real u uniformly in [0,1] and * round it down to the nearest multiple of 1/2^numeric_limits<T>::digits: uniform_fixed01<T>(), or * round it down to the nearest representable T: uniform_float01<T>(). Thus for T = float, 0.75f would be returned with probability 1/2^24, 0.375f would be returned with probability 1/2^25, etc. * Other numerical distributions (e.g., normal) would compute their results using these three basic distributions rather than directly accessing the raw data from the underlying generator. Thus: random generator produces raw random data (assumed perfect) | v exact uniform distributions of integers and reals and Bernoulli distribution | v high quality numerical distributions (normal, exponential, etc) The "exactness" requirement can be met which reasonable efficiency (faster that boost's implementation typically). For a sample implementation (built on top of mt19937) http://charles.karney.info/random (This also provides several additional exact uniform distributions, e.g., rounding the random u in different directions, supplying results in [-1/2,1/2] with guaranteed symmetry, varying the precision of the real results, etc.) I wish there were some way of "fixing" the boost implementation before it becomes a C++ standard. Unfortunately some of the problems stem not from the implementation but from the overall architecture: * too loose requirements on the random generators; * restrictive seeding facilities; * a clumsy (and inexact) interface between the random generators and the random distributions. Are these architectural issues up for discussion? I rather suspect that we've missed the boat on this. -- Charles Karney <ckarney@sarnoff.com> Sarnoff Corporation, Princeton, NJ 08543-5300 URL: http://charles.karney.info Tel: +1 609 734 2312 Fax: +1 609 734 2662

Charles Karney wrote:
* a clumsy (and inexact) interface between the random generators and the random distributions.
I wonder what you refer to here? I have been thinking that the use of a member template for template<class Engine> result_type operator()(Engine& eng) is sometimes awkward. It doesn't allow compile-time inquiry of the engine properties by the distribution class. For example, I have a distribution that generates an integer of random bits of a fixed width. It is awkward that the distribution class can't get the engine bit width.

On 8/8/06, Neal Becker <ndbecker2@gmail.com> wrote:
template<class Engine> result_type operator()(Engine& eng)
is sometimes awkward. It doesn't allow compile-time inquiry of the engine properties by the distribution class. For example, I have a distribution that generates an integer of random bits of a fixed width. It is awkward that the distribution class can't get the engine bit width.
I'm certainly no expert on generic programming or random numbers, but couldn't you use something like this: struct my_distribution { enum { bit_width 4 }; //or int bit_width(void) { return(4); } //Other distribution code. }; Then, in your templated operator(), you could just do Engine::bit_width or eng.bit_width(). Does that make any sense? Jeremy

Neal Becker wrote:
Charles Karney wrote:
* a clumsy (and inexact) interface between the random generators and the random distributions.
I wonder what you refer to here?
Here's one example. In bernoulli_distribution we have if(_p == RealType(0)) return false; else return RealType(eng() - (eng.min)()) <= _p * RealType((eng.max)()-(eng.min)()); where eng() returns integers between eng.min() and eng.max() incl. The problems are: * If the integers returned by eng() are signed (e.g., linear_congruential) then eng() - (eng.min)() and (eng.max)() - (eng.min)() can overflow and the results are bogus. (Perhaps this can't happen because eng.min() is always zero. Nevertheless the problem seems to be present in the overall design of the library.) * The results depend on the granularity of the underlying engine. Why not return an exact result? The answer lies in the overall structure of the library where the bernoulli is expected to use the raw numbers coming out of the generator. If there were a uniform_float01<RealType>(eng); available which was equivalent to rounding u down to the nearest representable RealType, then an EXACT implementation of bernoulli would be return p < uniform_float01<RealType>(eng); But typically we can write a faster implementation because it's usually unnecessary to generate all the bits of uniform_float01<RealType>(eng) before being able to do the comparison.
I have been thinking that the use of a member template for
template<class Engine> result_type operator()(Engine& eng)
is sometimes awkward. It doesn't allow compile-time inquiry of the engine properties by the distribution class. For example, I have a distribution that generates an integer of random bits of a fixed width. It is awkward that the distribution class can't get the engine bit width.
With boost::random there's no such thing as "engine bit width", e.g., ecuyer1988 returns results in [1,2^31-87]. So rather than make the code implementing the distributions pay the penalty for extracting a whole number of random bits, it would make sense to push the responsibility off onto the generator (or another intermediate layer). Then, you could ask for an unsigned "word" of bits and you would need to be able to query the generator for the width of this word (as a compile-time constant). (In the case of ecuyer1998 this width would be 30 and you'd be paying a factor of 2 penalty in sampling until you got a result in range. Presumably for this application you would want to devise a L'Ecuyer-style generator with a max value of 2^30+n. Or else you could stick with mt19937 where you get 32 bits of randomness right out of the box.) -- Charles Karney <ckarney@sarnoff.com> Sarnoff Corporation, Princeton, NJ 08543-5300 URL: http://charles.karney.info Tel: +1 609 734 2312 Fax: +1 609 734 2662

* a clumsy (and inexact) interface between the random generators and the random distributions. I wonder what you refer to here?
Here's one example. In bernoulli_distribution we have
if(_p == RealType(0)) return false; else return RealType(eng() - (eng.min)()) <= _p * RealType((eng.max)()-(eng.min)());
where eng() returns integers between eng.min() and eng.max() incl. The problems are:
* If the integers returned by eng() are signed (e.g., linear_congruential) then eng() - (eng.min)() and (eng.max)() - (eng.min)() can overflow and the results are bogus. (...)
RealType(eng() - (eng.min)()) could just be replaced with RealType(eng()) - RealType((eng.min)()), or alternatively template metaprogramming techniques be applied (result_type, min and max are compile time constants in the standard proposal).
* The results depend on the granularity of the underlying engine. Why not return an exact result? The answer lies in the overall structure of the library where the bernoulli is expected to use the raw numbers coming out of the generator. If there were a uniform_float01<RealType>(eng);
See the generate_canonical template function in the standard proposal. (...)
template<class Engine> result_type operator()(Engine& eng)
is sometimes awkward. It doesn't allow compile-time inquiry of the engine properties by the distribution class.
min and max are static constants in the standard proposal, so there's no general problem with compile time inspection. Stephan

Charles Karney wrote:
Are these architectural issues up for discussion? I rather suspect that we've missed the boat on this.
Actually the random number part of the library is very actively under review. You might be interested in the following papers: Rationale: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1933.pdf Current proposal: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2032.pdf I am not sure if you are worring about the same issues, but now is a REALLY good time to raise any additional concerns, or else it really will be too late! -- AlisdairM

AlisdairM wrote:
Actually the random number part of the library is very actively under review. You might be interested in the following papers:
Rationale: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1933.pdf
Current proposal: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2032.pdf
I am not sure if you are worring about the same issues, but now is a REALLY good time to raise any additional concerns, or else it really will be too late!
Thanks, this is good to know. I took a quick look at these papers and I'd like to provide comments. Is boost@lists.boost.org the appropriate mailing list? -- Charles Karney <ckarney@sarnoff.com> Sarnoff Corporation, Princeton, NJ 08543-5300 URL: http://charles.karney.info Tel: +1 609 734 2312 Fax: +1 609 734 2662

Charles Karney wrote:
Thanks, this is good to know. I took a quick look at these papers and I'd like to provide comments. Is boost@lists.boost.org the appropriate mailing list?
I'm not sure the authors of those papers read the Boost lists, so you might want to mail them directly. I believe their email addresses are in the papers. Of course, there may be value in discussion on the Boost lists as well, especially if we want a Boost implementation of the final proposal. But I think that is a different discussion than you are looking for right now. -- AlisdairM

I wish there were some way of "fixing" the boost implementation before it becomes a C++ standard. Unfortunately some of the problems stem not from the implementation but from the overall architecture:
If you haven't seen this before, you can find the latest public version of the current standards proposal at: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2032.pdf Brown et al. from Fermi Lab are aware of the problems in the current boost implementation and are preparing a reference implementation of the new standard. By the way, if I'm not mistaken, most of the problems you reported do not happen for the usual result_type's of double or uint32_t, so the situation is not as bad as it may seem (though it is not good either).
* too loose requirements on the random generators;
I tend to agree.
* restrictive seeding facilities;
I don't see the restriction. The template seeding method in Boost's MersenneTwister (and the current standard proposal) take a function object as the argument and seed the state with the return values from successive calls to this function object. This seeding method can be used to implement any conceivable seeding, including the one you favour.
* a clumsy (and inexact) interface between the random generators and the random distributions.
Could you elaborate on the perceived shortcomings?
Are these architectural issues up for discussion? I rather suspect that we've missed the boat on this.
You could contact the authors of the proposal or forward your comments to the standard working group (via the comp.std.c++ newsgroup, for example). Regards, Stephan

(Looks like I forgot to send this answer to the list, instead I sent to Charles only. Sorry.) Charles Karney wrote:
(1) uniform_int<char>(low, high) produces values outside [low, high]. E.g.,
variate_generator<random_generator&, uniform_int<char> > unichar(r, uniform_int<char>(-100, 100));
As a nit-pick, you should never ever perform arithmetic with plain "char"s, because it's your compiler's (nonportable) choice whether they're signed or unsigned. The example is valid with "signed char", though.
(2) uniform_int<short>(low, high) produces values outside [low, high]. E.g.,
variate_generator<random_generator&, uniform_int<short int> > unishort(r, uniform_int<short int>(-20000, 20000));
These are both caused by overflow in the statement
_range = _max - _min;
Yes, it's a bug. Fixing it requires somewhat extended surgery in this area of the code. I've checked that into CVS HEAD, but I haven't tested it extensively, yet. With the fix, the "signed char" example delivers reasonable output.
(3) uniform_int<long long>(low, high) produces a distribution with the wrong mean. E.g.,
"long long" is not (yet) C++, thus I don't consider fixing this high priority.
These are both caused by problems in do_compare<false, true> in random/detail/signed_unsigned_compare.hpp. In particular the statement
Since signed_unsigned_compare.hpp was inadequate considering your negative number / overflow issues above, I'm now using something else. Probably this fixes that issue as well? (I won't test that tonight.) As for using uniform_smallint<>, you're probably right, and the algorithm should be fixed to use rejection as a last resort only.
(5) uniform_01<random_generator, float> uni01(r) sometimes returns a result outside [0,1). Specifically it can return 1.0f because (232-1)/232 rounds to 1.
uniform_01 returns
result_type(_rng() - _rng.min()) / (result_type(_rng.max() - _rng.min()) + 1)
This has two problems. If result_type is float, the result can overflow to 1 (for mt19937).
This is a bug. I believe Walter Brown's paper contains a "round down" solution that might be workable. For the time being, I've implemented a rejection loop for the rare case of 1.
If result_type is double or long double, the result only contains 32 bits of randomness (instead of 53 bits or 64 bits).
This is, from a standardese point of view, a "quality of implementation" issue in my opinion. Yes, the code could be improved. No, I haven't done that yet.
I might even suggest going further and requiring that max = 2^b - 1, so that a whole number of bits of randomness are returned. This greatly simplifies the computation of the basic integer and real distributions.
Yes, but it excludes whole classes of engines. The (valid) question is whether those engines (linear congruential, inversive congruential, etc.) are still "interesting", useful and desirable giving the existence of mt19937. All fixes checked in to CVS HEAD, please have a look. Thanks for your debugging. Jens Maurer

On 8/14/06, Jens Maurer <Jens.Maurer@gmx.net> wrote:
(3) uniform_int<long long>(low, high) produces a distribution with the wrong mean. E.g.,
"long long" is not (yet) C++, thus I don't consider fixing this high priority.
Every modern C++ compiler supports long long (or better yet, int_least64_t), and long long is non-controversial and has been voted into the WP for the next standard. So people use long long and expect to get reasonable results. Thus I'd be inclined to treat any long long bug just as seriously as a plain long bug. --Beman

Jens, I'm trying to figure out what to do about this test failure here: http://tinyurl.com/rs8me the test doesn't actually use random_device, but does include the header via the TR1 version of <random>, however as you can see the linker complains about an unresolved external boost::random_device::default_token. Am I correct in thinking that this class is only actually usable on Linux? Thanks, John.
participants (10)
-
AlisdairM
-
Beman Dawes
-
Bronek Kozicki
-
Charles Karney
-
Damien Fisher
-
Jens Maurer
-
Jeremy Day
-
John Maddock
-
Neal Becker
-
Stephan Tolksdorf