Re: [boost] [Multiprecision] Benchmarking

14 Jun 2012

      ...
I have compared the following types used for i64 and i128 in the code 
above:
(1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for 
comparison)
(2) int64_t and my own "quick and dirty" int128_t.
(3) cpp_int.
(4) int64_t and cpp_int.
(5) cpp_int without expression templates.
(6) int64_t and cpp_int without expression templates.
(7) int64_t and mp_int128_t.
(8) mpz_int.
(9) int64_t and mpz_int.
Mean times per function call are as follows:
(1)   39.3 ns
(2)   61.9 ns
(3)  513 ns
(4)  103 ns
(5)  446 ns
(6)  102 ns
(7)  102 ns
(8) 2500 ns
(9)  680 ns
(AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)
Note that:
- GMP is slowest, despite oft-repeated claims about its good performance.
GMP's performance only kicks in for large values - for tests like this 
memory allocation dominates.
...
- Expression templates make the code slower.
For sure - expression templates are about eliminating temporaries - if the 
temporaries are cheap (cpp_int containing small values and no dynamic memory 
allocation) then there's no benefit and probably a hit from bringing in all 
that machinary.  That's why expression templates are disabled for the fixed 
size cpp_int typedefs.

On the other hand, if you tried mpz_int without expression templates that 
should be even slower than the code with them - in that case eliminating 
temporaries does bring a benefit from fewer memory allocations.
...
- Even the fastest Boost.Multiprecision type, mp_int128_t, is 
significantly slower than my own quick hack.
This is all a bit disappointing.  Perhaps the problem is that I'm dealing 
only with 128 bits and the intended application is for much larger values?
No the problem is that it's a general purpose solution, in almost any 
special purpose case it will be slower.

Here's the thing - once the compiler's optimiser gets involved your 
multiplication routine is unlikely to ever be beaten by more than a trivial 
amount (a few persent at best), it's also sufficiently simple, that even the 
costs of a few "if" statements would likely acount for the extra time.

So believe it or not, I'm actually quite pleased it's no more than 2x 
slower!
...
I am reminded of my experience with xint last year.  See e.g. this thread:
http://thread.gmane.org/gmane.comp.lib.boost.devel/215565
The current proposal seems much better - xint was hundreds of times slower 
than my ad-hoc code - but I still find it discouraging.  I suspect that 
the performance difference may be attributable to the use of 
sign-magnitude representation; as a result, the implementation is full of 
sign-testing like this:
template <unsigned MinBits, bool Signed, class Allocator>
inline void eval_subtract(cpp_int_backend<MinBits, Signed, Allocator>& 
result, const signed_limb_type& o)
{
   if(o)
   {
      if(o < 0)
         eval_add(result, static_cast<limb_type>(-o));
      else
         eval_subtract(result, static_cast<limb_type>(o));
   }
}
I think I understand the motivation for sign-magnitude for variable-length 
representation, but I don't see any benefit in the case of e.g. 
mp_int128_t.
The sign testing above is because you're adding a signed int, it has nothing 
to do with sign-magnitude representation, if your code deals only with 
positive values, and had you used unsigned integers throughout, then those 
tests would disappear.

BTW the code originally used a 2's complement representation for fixed 
precision ints - that code is (very slightly) faster if you really want to 
do an NxN bit operation, but slower if most of the integers aren't using all 
the internal limbs.
...
I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit 
multiplication.  Here is my version, based on 32x32->64 multiplies:
inline int128_t mult_64x64_to_128(int64_t a, int64_t b)
{
  // Make life simple by dealing only with positive numbers:
  bool neg = false;
  if (a<0) { neg = !neg; a = -a; }
  if (b<0) { neg = !neg; b = -b; }
// Divide input into 32-bit halves:
  uint32_t ah = a >> 32;
  uint32_t al = a & 0xffffffff;
  uint32_t bh = b >> 32;
  uint32_t bl = b & 0xffffffff;
// Long multiplication, with 64-bit temporaries:
//            ah al
  //          * bh bl
  // ----------------
  //            al*bl   (t1)
  // +       ah*bl      (t2)
  // +       al*bh      (t3)
  // +    ah*bh         (t4)
  // ----------------
uint64_t t1 = static_cast<uint64_t>(al)*bl;
  uint64_t t2 = static_cast<uint64_t>(ah)*bl;
  uint64_t t3 = static_cast<uint64_t>(al)*bh;
  uint64_t t4 = static_cast<uint64_t>(ah)*bh;
int128_t r(t1);
  r.high = t4;
  r += int128_t(t2) << 32;
  r += int128_t(t3) << 32;
if (neg) r = -r;
return r;
}
So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply.
But we can't write this:
int64_t a, b ...;
  int128_t r = a * b;
because here a*b is a 64x64->64 multiply; instead we have to write 
something like
r = static_cast<int128_t>(a)*b;
so that our operator* overload is found; but this operator* doesn't know 
that its int128_t argument is actually a promoted int64_t, and it has to 
do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64 
multiplies.  The only chance of avoiding the extra work is if the compiler 
can do constant-propagation, but I have my doubts about that.
No: the fixed precision cpp_int's know how many limbs are actually used: so 
the same number of multiplies will still be used as in your code.  The 
problem is partly the "abstraction penalty", and partly all the special case 
handling inside cpp_int.

I will look at your test case, but I'd be very pleasantly surprised if 
there's any more low-hanging fruit performance wise.

Thanks for the comments, John.

Re: [boost] [Multiprecision] Benchmarking

John Maddock