Re: [boost] [random] new threefry random engine

29 Apr 2014

      On Sun, Apr 27, 2014 at 10:24 AM, Mathias Gaunard <
mathias.gaunard@ens-lyon.org> wrote:
...
On 27/04/14 14:58, Thijs (M.A.) van den Berg wrote:
You would still need to split the range [0,n) into non-overlapping
...
subsets per thread.
That's what the parallelization of the algorithm does.
(In the case in the previous mail, it's done automatically by OpenMP)
You will need some unique id per thread -either a thread id, or some
...
static counter that threads use when they are initialized- to do that. That
unique id can also be used to seed regular random engines, so that is to me
not a big addition.
That would defeat the point IMO.
I want to be able to write parallel code involving random number
generation where the result does not depend at all of how the code is
parallelized.
Counter-based random number generators can compute the i-th random number
in O(1) instead of the classical O(n).
That means that for a given generator with a given seed, I can compute the
i-th random number and the j-th random number in parallel. That's what I
want to do. I don't want per-thread seeds.
I've dusted off the code I wrote a couple of years ago and republished it,
now on github:
https://github.com/DEShawResearch/Random123-Boost

I've tried to respond to the discussion in this thread.  In particular, the
README, and the example now emphasize CBRNG's small footprint and fast
initialization and use with Distributions.  The primary example
demonstrates how to use Distributions in a thread-agnostic way.

I also removed the functions related to AES because I felt they added a lot
of clutter and introduced some technical issues (endian questions,
hardware-specific instructions and intrinsics) that would be distractions
at this point.  We can certainly revisit them in the future, if and when we
agree on a framework.

Here's the new README:

<README.md>
Random123-Boost
===============

Proposed Random123 functions for Boost.Random

The purpose of this source tree is to develop a new family of
"Counter Based Uniform Random Number Generators" (CBURNGs) for the
Boost.Random library.  CBURNGs were introduced in the paper, "Parallel
Random Numbers -- As Easy as 1, 2, 3", by Salmon, Moraes, Dror & Shaw,
which won the Best Paper award at the SC'11  conference:

http://dl.acm.org/citation.cfm?doid=2063405
also available at
http://www.thesalmons.org/john/random123/papers/random123sc11.pdf

Conventional generators (such as those in Boost.Random or the C++11
<random> library) are large and difficult to initialize.  Boost's
documentation explicitly advises against frequent initialization, so
common practice is to serialize access through a single global
generator, or, in a parallel program, to instantiate one generator
per-thread.

Unlike conventional generators CBURNGs require very little storage and
are designed to be created and destroyed frequently.  If created and
destroyed in an "inner loop", a good compiler can often keep their
state entirely in registers and can generate random values with a few
dozen inlined instructions.  In addition, they have extremely fast,
constant-time implementations of 'discard()', which allows callers to
easily "leapfrog" through a logical stream of random numbers.
These features make them ideal for parallel computation.

Consider the following program fragment, using a conventional
RandomNumberEngine, mt19937:

    using namespace boost::random;
    mt_19937 rng(seed); // seed may come from the command line
    normal_distribution nd;
    for(size_t i=0; i<atoms.size(); ++i){
        float boltzmannfactor = sqrt(kT/atoms[i].mass);
        atoms[i].vx = boltzmannfactor*nd(rng);
        atoms[i].vy = boltzmannfactor*nd(rng);
        atoms[i].vz = boltzmannfactor*nd(rng);
    }

Now imagine parallelizing this loop over a number of threads or cores.
The conventional approach is to create an independent generator in
each thread, perhaps folding a thread-id into the seed, or using a
'discard' function so that threads can "leapfrog" over one another.
But both these solutions result in output that depends on the number
of threads and how atoms are assigned to threads.  Furthermore,
improperly seeding millions of generators (not an unreasonable number
on a modern supercomputer) can lead to unintended correlations among
streams, so conventional wisdom is that one must very carefully choose
the engine and the initialization method.  The problem is even harder
if the overall program structure (i.e., not just this loop) dictates a
parallelization strategy that might assign the same atom to multiple
threads.

CBRNGs overcome all these problems.  With CBRNGs, the code looks like
this (see libs/random/examples/counter_based_example.cpp for a fully
worked example):

    using namespace boost::random;
    typedef threefry<4, uint32_t> Prf;
    normal_distribution nd;
    Prf::key_type key = {seed};
    Prf prf(key);  // seed may come from command line
    for(size_t i=0; i<atoms.size(); ++i){
        float boltzmannfactor = sqrt(kT/atoms[i].mass);
        Prf::domain_type d = {atoms[i].id, timestep, THERMALIZE_CTXT};
        counter_based_urng<Prf> cbrng(prf, d);
        nd.reset();
        atoms[i].vx = boltzmannfactor*nd(cbrng);
        atoms[i].vy = boltzmannfactor*nd(cbrng);
        atoms[i].vz = boltzmannfactor*nd(cbrng);
    }

Let's consider the code changes between the two fragments:

- counter_based_urng is a templated adapter class that models a bona
fide Boost UniformRandomNumberGenerator.  Its template parameter is a
Pseudo-Random Function (PRF).  An instance of the Pseudo-Random
Function and a value from the Pseudo-Random Function's domain are
required to construct a counter_based_urng.  E.g., these lines in the
example:

        Prf::domain_type d = {atoms[i].id, timestep, THERMALIZE_CTXT};
        counter_based_urng<Prf> cbrng(prf, d);

    In C++11, with initializer-lists, this might be shortened to:

         counter_based_urng<Prf> cbrng(prf, {atoms[i].id, timestep,
THERMALIZE_CTXT});

    Creation and destruction of the counter_based_urng is much faster than
    actually generating random values or processing them through a
    distribution, so it's reasonable to create and destroy the cbrng every
    time through the loop.

    Counter_based_urngs constructed from the same domain value and the
    same PRF are identical, i.e., they will generate exactly the same
    sequence.  On the other hand, counter_based_urngs constructed from
    domain values that differ in even a single bit generate independent,
    non-overlapping sequences of random values.  Thus, by choosing a value
    in the domain that encodes some program-specific state (e.g.,
    atoms[i].id and timestep), we produce a unique
    stream for each atom at each timestep that is statistically
    independent of all other streams.  Notice that the random values
    generated for a particular atom at a particular timestep are
    independent of the number of threads or the assignment of atoms to
    threads.  The additional constant THERMALIZE_CTXT is used to
    distinguish this loop from any other loop or context in the program,
    eliminating the possibility that the same sequence will be generated
    elsewhere in the program.

- Since it models a URNG, cbrng can be passed as an argument to the
normal_distribution, nd.  In order to make each atom independent,

        nd.reset()

    is called each time through the loop.

- PRFs are keyed.  I.e., the Prf constructor takes a key_type as an
argument.  Two Prfs of the same type, initialized with the same key
are indistinguishable.  On the other hand, two Prfs constructed from
keys that differ in any way, even by a single bit, will give rise to
statistically independent counter_based_urngs and output streams.

    In the example, we initialize the PRF outside the loop with:

        Prf::key_type key = {seed};
        Prf prf(key);  // seed may come from command line

Pseudo-random functions:  Philox and Threefry
---------------------------------------------

Two Pseudo-Random functions are implemented in this source tree: threefry
and philox.  Both are templated over an unsigned width, an unsigned
integer value type, and a round-count (which takes a reasonable and safe
default value).  For example:

   threefry<4, uint32_t>
   philox<2, uint64_t>

All PRFs have a key_type, a domain_type, and a range_type, all
of which are boost::arrays of the underlying value type.  I.e.,

   threefry<N, U>::key_type    = boost::array<U, N>
                   domain_type = boost::array<U, N>
                   range_type  = boost::array<U, N>

     philox<N, U>::key_type    = boost::array<U, N/2>
                   domain_type = boost::array<U, N>
                   range_type  = boost::array<U, N>

For threefry and philox, initialization is extremely fast.  For other
PRFs, e.g., the cryptographic AES function (not implemented), there
may be non-trivial computation associated with initialization, so
initializing PRFs is discouraged in *inner* loops, but is perfectly
reasonable at thread-scope or anywhere else that a few dozen
invocations of the generator will amortize the initialization cost.

Threefry and Philox are "pseudo-random", meaning that the outputs from
any set of inputs are practically indistiguishable from random.  In
particular, one can obtain apparently random output simply by
"counting" through inputs in an entirely regular way.  Strong
empirical evidence is presented in the SC11 paper for the statistical
quality of the threefry and philox functions.  Furthermore, the
pseudo-random functions obtained with different keys are shown to be
statistically independent, even if the keys differ by only a single
bit or follow regular patterns.  Among other things, threefry and
philox pass the entire BigCrush suite of tests.

counter_based_urng
------------------

The counter_based_urng class uses the pseudo-random property of PRFs
to perform random number generation in accordance with the
requirements of a UniformRandomNumberGenerator.  It reserves some
number (by default, all) of most-significant bits of the highest-index
member of the domain_type array for its own internal use as a
"counter".  It is an error for the domain_type constructor argument to
have non-zero bits in this range.  Whenever new random values are
required, the "counter bits" are incremented and the PRF is called,
generating a new set of random values that will be returned by
counter_based_urng.  It is an error to request random values after the
counter is exhausted.  Thus, counter_based_urngs will typically have
fairly "short" sequence lengths (anything from 4 up to 2^64).  This is
usually more than sufficient to provide input to one or more
Distributions, which generally call their engine a non-deterministic,
but usually small number of times.  On the other hand, it is cheap and
efficient to create huge numbers (2^64 or more) of independent
counter_based_urngs on demand.

counter_based_engine
--------------------

The counter_based_engine class is a templated adapter class that
models a bona fide RandomNumberEngine.  In particular, it is Seedable
in the same way as other Engines, so it can be used in any program
that expects a RandomNumberEngine.  The counter_based_engine offers
two very useful properties:

1 - it has a very small size, and a very fast constructor, so it is
practical to instantiate millions of them in a large parallel
application.

2 - the discard() function is very fast and runs in constant time.
Parallel programs can use this to "leapfrog" multiple sequences over
one another in different threads, or it can be used to initialize
generators in different threads with starting points that are
separated by enough to avoid overlap.
</README.md>

Re: [boost] [random] new threefry random engine

John Salmon