[boost] [Random] Proposal: Faster RANLUX engines

10 Sep 2019

      Boost.Random and the C++11 standard library ship with a 24- and a 48-bit
RANLUX pseudo-random number generator (PRNG). I propose to add 32- and
64-bit RANLUX PRNGs because they are at least 40% faster. Additionally,
the 32-bit RANLUX PRNG can be implemented as follows:

  #include <boost/cstdint.hpp>
  #include <boost/random.hpp>

  namespace br = boost::random;

  typedef
    br::subtract_with_carry_engine<boost::uint32_t,32,3,17>
    ranlux32_base;
  typedef
    br::discard_block_engine<ranlux32_base, 389, 16>
    ranlux32;

There is C++98 benchmark code at the end of this e-mail requiring only
Boost and a C++ compiler.

Below I will briefly discuss the theory behind RANLUX PRNGs, where the
new generators are coming from, and why the 64-bit RANLUX PRNG requires
(a few lines of) new code.

Drawing from RANLUX PRNGs is a two-step process. In the first step a
value is drawn from a "subtract-with-borrow" (SWB) generator
implementing b-bit modular arithmetic [1]. Then, values are generously
discarded yielding uniformly distributed values (uniform in a dynamical
systems model) [2]. The RANLUX PRNGs using 24- and 48-bit arithmetic
always gave me an itch so I looked into the motivation behind this. It
turns out [3] that this restriction arose likely from a) computational
limitations in the 1990s and b) Lüscher's interest in the generation of
floating-point numbers (single-precision floating-point numbers have
24-bit significand).

SWB PRNGs come in two flavors (Type I, Type II) and they make use of
three parameters called base b, short lag s, and long lag r. Using
modern prime factorization methods, one can quickly come up with the
32-bit PRNG above, a Type I SWB generator with b=2^32, r=17, and s=3.
This one can be implemented as shown above.

To generate 64-bit values, one should use a Type II SWB with b=2^64,
r=62, and s=5 (all other 64-bit generators with b=2^64 and r < 100 will
be awfully slow). What distinguishes the different types of SWBs is the
recurrence for the computation of variates. For Type I it is

x_n := (x_{n-s} - x_{n-r} + c_n) mod b,

where c_n is a carry bit. For Type II it is

x_n := (x_{n-r} - x_{n-s} + c_n) mod b.

(Observe the order of long lang and short lag in these formulas.) Type
II SWBs are not difficult to implement but it requires either a
modification of `subtract_with_carry_engine` or a new class.

Let me know what you think about this proposal.

Christoph Conrads

[1] G. Marsaglia, A. Zaman: "A New Class of Random Number Generators".
    1991. URL: https://www.jstor.org/stable/2959748
[2] Martin Lüscher: "A Portable High-Quality Random Number Generator for
    Lattice Field Theory Simulations". 1994. URL:
    https://arxiv.org/abs/hep-lat/9309020
[3] Christoph Conrads: "Faster RANLUX Pseudo-Random Number Generators".
    2019. URL:
    https://christoph-conrads.name/faster-ranlux-pseudo-random-number-generators

// Build with:
// c++ -Wextra -Wall -std=c++98 -pedantic benchmark.cpp -O3 \
// -march=native -DNDEBUG -lboost_chrono -lboost_system
#include <boost/chrono.hpp>
#include <boost/cstdint.hpp>
#include <boost/random.hpp>
#include <cassert>
#include <cstdio>
#include <limits>
#include <string>

namespace chrono = boost::chrono;

typedef chrono::milliseconds milliseconds;
typedef chrono::process_cpu_clock my_clock;

namespace br = boost::random;

typedef
    br::subtract_with_carry_engine<boost::uint32_t,32,3,17>
    ranlux32_base;
typedef br::discard_block_engine<ranlux32_base, 389, 16> ranlux32;

struct dummy_engine
{
  typedef boost::uint32_t result_type;

  static result_type max()
  { return std::numeric_limits<result_type>::max(); }
  static result_type min()
  { return std::numeric_limits<result_type>::min(); }

  result_type operator() ();

  result_type state_;
};

dummy_engine::result_type dummy_engine::operator() ()
{
    return state_++;
}

std::string get_name(const dummy_engine&) { return "dummy-prng"; }
std::string get_name(const br::ranlux24&) { return "ranlux24"; }
std::string get_name(const br::ranlux48&) { return "ranlux48"; }
std::string get_name(const ranlux32&) { return "ranlux32";}

template<typename Generator>
typename Generator::result_type
draw(Generator& gen) __attribute__((noinline));

template<typename Generator>
typename Generator::result_type draw(Generator& gen)
{
  return gen();
}

template<typename Generator>
void run(boost::uintmax_t num_draws)
{
  assert(Generator::min() == 0);

  unsigned w =
    Generator::max() == std::numeric_limits<boost::uint64_t>::max()?8u:
    Generator::max() == (UINTMAX_C(1)<<48) - 1u                    ?6u:
    Generator::max() == std::numeric_limits<boost::uint32_t>::max()?4u:
    Generator::max() == (UINTMAX_C(1)<<24) - 1u                    ?3u:
    Generator::max() == std::numeric_limits<boost::uint16_t>::max()?2u:
    Generator::max() == std::numeric_limits<boost::uint8_t>::max() ?1u:
    0u
  ;

  Generator gen;
  my_clock::time_point t_0 = my_clock::now();

  for(uintmax_t i = 0; i < num_draws; ++i)
    draw(gen);

  my_clock::time_point t_1 = my_clock::now();
  milliseconds t_ms = chrono::duration_cast<milliseconds>(t_1 - t_0);
  double bytes_per_sec = w * num_draws * 1000u/double(t_ms.count());
  std::string name = get_name(gen);
  boost::uint16_t dummy = boost::uint16_t(gen());

  std::printf(
    "%-26s | %10lu | %20.2e | %hu\n",
    name.c_str(), t_ms.count(), bytes_per_sec, dummy
  );
}

int main()
{
  boost::uintmax_t num_draws = UINTMAX_C(100000000);

  run<dummy_engine>(num_draws);
  run<br::ranlux24>(num_draws);
  run<ranlux32>(num_draws);
  run<br::ranlux48>(num_draws);
}