Re: [boost] [Multiprecision] Benchmarking

14 Jun 2012

      On Thu, Jun 14, 2012 at 8:40 AM, Phil Endecott <
spam_from_boost_dev@chezphil.org> wrote:
...
Dear All,
I have attempted to evaluate the proposed multiprecision library by
applying it to the "Dealaunay Flip" function for 32-bit integer
coordinates, as this is one of the few things that I've ever implemented
that needed more than 64-bit integers:
inline bool delaunay_test(int32_t ax, int32_t ay, int32_t bx, int32_t by,
                         int32_t cx, int32_t cy, int32_t dx, int32_t dy)
{
 // Test whether the quadrilateral ABCD's diagonal AC should be flipped to
BD.
 // This is the Cline & Renka method.
 // Flip if the sum of the angles ABC and CDA is greater than 180 degrees.
 // Equivalently, flip if sin(ABC + CDA) < 0.
 // Trig identity: cos(ABC) * sin(CDA) + sin(ABC) * cos(CDA) < 0
 // We can use scalar and vector products to find sin and cos, and simplify
 // to the following code.
 // Numerical robustness is important.  This code addresses it by
performing
 // exact calculations with large integer types.
i64 ax64 = ax, ay64 = ay, bx64 = bx, by64 = by,
     cx64 = cx, cy64 = cy, dx64 = dx, dy64 = dy;
i64 cos_abc = (ax64-bx64)*(cx64-bx64) + (ay64-by64)*(cy64-by64);
 i64 cos_cda = (cx64-dx64)*(ax64-dx64) + (cy64-dy64)*(ay64-dy64);
 if (cos_abc >= 0 && cos_cda >= 0) return false;
 if (cos_abc < 0 && cos_cda < 0) return true;
i64 sin_abc = (ax64-bx64)*(cy64-by64) - (cx64-bx64)*(ay64-by64);
 i64 sin_cda = (cx64-dx64)*(ay64-dy64) - (ax64-dx64)*(cy64-dy64);
i128 sin_sum = static_cast<i128>(sin_abc)***cos_cda
              + static_cast<i128>(cos_abc)***sin_cda;
return sin_sum < 0;
}
As you can see, with 32-bit inputs, the first set of computations yield
64-bit results and about half of calls terminate at this stage.  Otherwise
a calculation yielding a 128-bit result is carried out.  In practice, the
results of the subtractions are typically small and the 128-bit value
rarely has more than 64 significant bits.
I have compared the following types used for i64 and i128 in the code
above:
(1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for
comparison)
(2) int64_t and my own "quick and dirty" int128_t.
(3) cpp_int.
(4) int64_t and cpp_int.
(5) cpp_int without expression templates.
(6) int64_t and cpp_int without expression templates.
(7) int64_t and mp_int128_t.
(8) mpz_int.
(9) int64_t and mpz_int.
Mean times per function call are as follows:
(1)   39.3 ns
(2)   61.9 ns
(3)  513 ns
(4)  103 ns
(5)  446 ns
(6)  102 ns
(7)  102 ns
(8) 2500 ns
(9)  680 ns
(AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)
Note that:
- GMP is slowest, despite oft-repeated claims about its good performance.
I think Luke made a good point about attempting to minimize allocations,
but nonetheless I suppose this might not be too surprising; I've been under
the impression that GMP is for *big* ints.
...
- Expression templates make the code slower.
Hmmm...John: comments?
...
- Even the fastest Boost.Multiprecision type, mp_int128_t, is
significantly slower than my own quick hack.
This is all a bit disappointing.  Perhaps the problem is that I'm dealing
only with 128 bits and the intended application is for much larger values?
I think so, and you probably identified a weak point of the library below.
...
I am reminded of my experience with xint last year.  See e.g. this thread:
http://thread.gmane.org/gmane.**comp.lib.boost.devel/215565<http://thread.gmane.org/gmane.comp.lib.boost.devel/215565>
The current proposal seems much better - xint was hundreds of times slower
than my ad-hoc code - but I still find it discouraging.  I suspect that the
performance difference may be attributable to the use of sign-magnitude
representation; as a result, the implementation is full of sign-testing
like this:
template <unsigned MinBits, bool Signed, class Allocator>
inline void eval_subtract(cpp_int_backend<**MinBits, Signed, Allocator>&
result, const signed_limb_type& o)
{
  if(o)
  {
     if(o < 0)
        eval_add(result, static_cast<limb_type>(-o));
     else
        eval_subtract(result, static_cast<limb_type>(o));
  }
}
I think I understand the motivation for sign-magnitude for variable-length
representation, but I don't see any benefit in the case of e.g. mp_int128_t.
John, perhaps this is something that can be configurable?
...
I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit
multiplication.  Here is my version, based on 32x32->64 multiplies:
inline int128_t mult_64x64_to_128(int64_t a, int64_t b)
{
 // Make life simple by dealing only with positive numbers:
 bool neg = false;
 if (a<0) { neg = !neg; a = -a; }
 if (b<0) { neg = !neg; b = -b; }
// Divide input into 32-bit halves:
 uint32_t ah = a >> 32;
 uint32_t al = a & 0xffffffff;
 uint32_t bh = b >> 32;
 uint32_t bl = b & 0xffffffff;
// Long multiplication, with 64-bit temporaries:
//            ah al
 //          * bh bl
 // ----------------
 //            al*bl   (t1)
 // +       ah*bl      (t2)
 // +       al*bh      (t3)
 // +    ah*bh         (t4)
 // ----------------
uint64_t t1 = static_cast<uint64_t>(al)*bl;
 uint64_t t2 = static_cast<uint64_t>(ah)*bl;
 uint64_t t3 = static_cast<uint64_t>(al)*bh;
 uint64_t t4 = static_cast<uint64_t>(ah)*bh;
int128_t r(t1);
 r.high = t4;
 r += int128_t(t2) << 32;
 r += int128_t(t3) << 32;
if (neg) r = -r;
return r;
}
So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply.
But we can't write this:
int64_t a, b ...;
 int128_t r = a * b;
because here a*b is a 64x64->64 multiply; instead we have to write
something like
r = static_cast<int128_t>(a)*b;
so that our operator* overload is found; but this operator* doesn't know
that its int128_t argument is actually a promoted int64_t, and it has to do
an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64
multiplies.  The only chance of avoiding the extra work is if the compiler
can do constant-propagation, but I have my doubts about that.
The above situation seems to be unavoidable given the present inability to
mix types, which, as John has discussed previously, is an issue he's
(understandably) completely ducking at the moment.

I wonder also if there is a non-negligible penalty for abstracting all
backend operations as, essentially, op-assign operations, as in "x = y +
z", it seems one must loop twice, once over, say, y and x to effect "x =
y", then once over x and z to effect "x += z".
...
Any thoughts?
My test code is here:  http://chezphil.org/tmp/boost_**mp_review.tbz2<http://chezphil.org/tmp/boost_mp_review.tbz2>
$ make
$ cat nn100000.dat | benchmark
(Note that this needs a fix to the resize() issue reported earlier in
order to get accurate answers, but that doesn't much alter the execution
time.)
Regards,  Phil.
Thanks, Phil, for investigating the viability of using Multiprecision for
your application!

- Jeff

Re: [boost] [Multiprecision] Benchmarking

Jeffrey Lee Hellrung, Jr.