
I have compared the following types used for i64 and i128 in the code above:
(1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for comparison) (2) int64_t and my own "quick and dirty" int128_t. (3) cpp_int. (4) int64_t and cpp_int. (5) cpp_int without expression templates. (6) int64_t and cpp_int without expression templates. (7) int64_t and mp_int128_t. (8) mpz_int. (9) int64_t and mpz_int.
Mean times per function call are as follows:
(1) 39.3 ns (2) 61.9 ns (3) 513 ns (4) 103 ns (5) 446 ns (6) 102 ns (7) 102 ns (8) 2500 ns (9) 680 ns
(AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)
Note that: - GMP is slowest, despite oft-repeated claims about its good performance.
GMP's performance only kicks in for large values - for tests like this memory allocation dominates.
- Expression templates make the code slower.
For sure - expression templates are about eliminating temporaries - if the temporaries are cheap (cpp_int containing small values and no dynamic memory allocation) then there's no benefit and probably a hit from bringing in all that machinary. That's why expression templates are disabled for the fixed size cpp_int typedefs. On the other hand, if you tried mpz_int without expression templates that should be even slower than the code with them - in that case eliminating temporaries does bring a benefit from fewer memory allocations.
- Even the fastest Boost.Multiprecision type, mp_int128_t, is significantly slower than my own quick hack.
This is all a bit disappointing. Perhaps the problem is that I'm dealing only with 128 bits and the intended application is for much larger values?
No the problem is that it's a general purpose solution, in almost any special purpose case it will be slower. Here's the thing - once the compiler's optimiser gets involved your multiplication routine is unlikely to ever be beaten by more than a trivial amount (a few persent at best), it's also sufficiently simple, that even the costs of a few "if" statements would likely acount for the extra time. So believe it or not, I'm actually quite pleased it's no more than 2x slower!
I am reminded of my experience with xint last year. See e.g. this thread:
http://thread.gmane.org/gmane.comp.lib.boost.devel/215565
The current proposal seems much better - xint was hundreds of times slower than my ad-hoc code - but I still find it discouraging. I suspect that the performance difference may be attributable to the use of sign-magnitude representation; as a result, the implementation is full of sign-testing like this:
template <unsigned MinBits, bool Signed, class Allocator> inline void eval_subtract(cpp_int_backend<MinBits, Signed, Allocator>& result, const signed_limb_type& o) { if(o) { if(o < 0) eval_add(result, static_cast<limb_type>(-o)); else eval_subtract(result, static_cast<limb_type>(o)); } }
I think I understand the motivation for sign-magnitude for variable-length representation, but I don't see any benefit in the case of e.g. mp_int128_t.
The sign testing above is because you're adding a signed int, it has nothing to do with sign-magnitude representation, if your code deals only with positive values, and had you used unsigned integers throughout, then those tests would disappear. BTW the code originally used a 2's complement representation for fixed precision ints - that code is (very slightly) faster if you really want to do an NxN bit operation, but slower if most of the integers aren't using all the internal limbs.
I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit multiplication. Here is my version, based on 32x32->64 multiplies:
inline int128_t mult_64x64_to_128(int64_t a, int64_t b) { // Make life simple by dealing only with positive numbers: bool neg = false; if (a<0) { neg = !neg; a = -a; } if (b<0) { neg = !neg; b = -b; }
// Divide input into 32-bit halves: uint32_t ah = a >> 32; uint32_t al = a & 0xffffffff; uint32_t bh = b >> 32; uint32_t bl = b & 0xffffffff;
// Long multiplication, with 64-bit temporaries:
// ah al // * bh bl // ---------------- // al*bl (t1) // + ah*bl (t2) // + al*bh (t3) // + ah*bh (t4) // ----------------
uint64_t t1 = static_cast<uint64_t>(al)*bl; uint64_t t2 = static_cast<uint64_t>(ah)*bl; uint64_t t3 = static_cast<uint64_t>(al)*bh; uint64_t t4 = static_cast<uint64_t>(ah)*bh;
int128_t r(t1); r.high = t4; r += int128_t(t2) << 32; r += int128_t(t3) << 32;
if (neg) r = -r;
return r; }
So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply.
But we can't write this:
int64_t a, b ...; int128_t r = a * b;
because here a*b is a 64x64->64 multiply; instead we have to write something like
r = static_cast<int128_t>(a)*b;
so that our operator* overload is found; but this operator* doesn't know that its int128_t argument is actually a promoted int64_t, and it has to do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64 multiplies. The only chance of avoiding the extra work is if the compiler can do constant-propagation, but I have my doubts about that.
No: the fixed precision cpp_int's know how many limbs are actually used: so the same number of multiplies will still be used as in your code. The problem is partly the "abstraction penalty", and partly all the special case handling inside cpp_int. I will look at your test case, but I'd be very pleasantly surprised if there's any more low-hanging fruit performance wise. Thanks for the comments, John.