
Dear All, I have attempted to evaluate the proposed multiprecision library by applying it to the "Dealaunay Flip" function for 32-bit integer coordinates, as this is one of the few things that I've ever implemented that needed more than 64-bit integers: inline bool delaunay_test(int32_t ax, int32_t ay, int32_t bx, int32_t by, int32_t cx, int32_t cy, int32_t dx, int32_t dy) { // Test whether the quadrilateral ABCD's diagonal AC should be flipped to BD. // This is the Cline & Renka method. // Flip if the sum of the angles ABC and CDA is greater than 180 degrees. // Equivalently, flip if sin(ABC + CDA) < 0. // Trig identity: cos(ABC) * sin(CDA) + sin(ABC) * cos(CDA) < 0 // We can use scalar and vector products to find sin and cos, and simplify // to the following code. // Numerical robustness is important. This code addresses it by performing // exact calculations with large integer types. i64 ax64 = ax, ay64 = ay, bx64 = bx, by64 = by, cx64 = cx, cy64 = cy, dx64 = dx, dy64 = dy; i64 cos_abc = (ax64-bx64)*(cx64-bx64) + (ay64-by64)*(cy64-by64); i64 cos_cda = (cx64-dx64)*(ax64-dx64) + (cy64-dy64)*(ay64-dy64); if (cos_abc >= 0 && cos_cda >= 0) return false; if (cos_abc < 0 && cos_cda < 0) return true; i64 sin_abc = (ax64-bx64)*(cy64-by64) - (cx64-bx64)*(ay64-by64); i64 sin_cda = (cx64-dx64)*(ay64-dy64) - (ax64-dx64)*(cy64-dy64); i128 sin_sum = static_cast<i128>(sin_abc)*cos_cda + static_cast<i128>(cos_abc)*sin_cda; return sin_sum < 0; } As you can see, with 32-bit inputs, the first set of computations yield 64-bit results and about half of calls terminate at this stage. Otherwise a calculation yielding a 128-bit result is carried out. In practice, the results of the subtractions are typically small and the 128-bit value rarely has more than 64 significant bits. I have compared the following types used for i64 and i128 in the code above: (1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for comparison) (2) int64_t and my own "quick and dirty" int128_t. (3) cpp_int. (4) int64_t and cpp_int. (5) cpp_int without expression templates. (6) int64_t and cpp_int without expression templates. (7) int64_t and mp_int128_t. (8) mpz_int. (9) int64_t and mpz_int. Mean times per function call are as follows: (1) 39.3 ns (2) 61.9 ns (3) 513 ns (4) 103 ns (5) 446 ns (6) 102 ns (7) 102 ns (8) 2500 ns (9) 680 ns (AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.) Note that: - GMP is slowest, despite oft-repeated claims about its good performance. - Expression templates make the code slower. - Even the fastest Boost.Multiprecision type, mp_int128_t, is significantly slower than my own quick hack. This is all a bit disappointing. Perhaps the problem is that I'm dealing only with 128 bits and the intended application is for much larger values? I am reminded of my experience with xint last year. See e.g. this thread: http://thread.gmane.org/gmane.comp.lib.boost.devel/215565 The current proposal seems much better - xint was hundreds of times slower than my ad-hoc code - but I still find it discouraging. I suspect that the performance difference may be attributable to the use of sign-magnitude representation; as a result, the implementation is full of sign-testing like this: template <unsigned MinBits, bool Signed, class Allocator> inline void eval_subtract(cpp_int_backend<MinBits, Signed, Allocator>& result, const signed_limb_type& o) { if(o) { if(o < 0) eval_add(result, static_cast<limb_type>(-o)); else eval_subtract(result, static_cast<limb_type>(o)); } } I think I understand the motivation for sign-magnitude for variable-length representation, but I don't see any benefit in the case of e.g. mp_int128_t. I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit multiplication. Here is my version, based on 32x32->64 multiplies: inline int128_t mult_64x64_to_128(int64_t a, int64_t b) { // Make life simple by dealing only with positive numbers: bool neg = false; if (a<0) { neg = !neg; a = -a; } if (b<0) { neg = !neg; b = -b; } // Divide input into 32-bit halves: uint32_t ah = a >> 32; uint32_t al = a & 0xffffffff; uint32_t bh = b >> 32; uint32_t bl = b & 0xffffffff; // Long multiplication, with 64-bit temporaries: // ah al // * bh bl // ---------------- // al*bl (t1) // + ah*bl (t2) // + al*bh (t3) // + ah*bh (t4) // ---------------- uint64_t t1 = static_cast<uint64_t>(al)*bl; uint64_t t2 = static_cast<uint64_t>(ah)*bl; uint64_t t3 = static_cast<uint64_t>(al)*bh; uint64_t t4 = static_cast<uint64_t>(ah)*bh; int128_t r(t1); r.high = t4; r += int128_t(t2) << 32; r += int128_t(t3) << 32; if (neg) r = -r; return r; } So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply. But we can't write this: int64_t a, b ...; int128_t r = a * b; because here a*b is a 64x64->64 multiply; instead we have to write something like r = static_cast<int128_t>(a)*b; so that our operator* overload is found; but this operator* doesn't know that its int128_t argument is actually a promoted int64_t, and it has to do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64 multiplies. The only chance of avoiding the extra work is if the compiler can do constant-propagation, but I have my doubts about that. Any thoughts? My test code is here: http://chezphil.org/tmp/boost_mp_review.tbz2 $ make $ cat nn100000.dat | benchmark (Note that this needs a fix to the resize() issue reported earlier in order to get accurate answers, but that doesn't much alter the execution time.) Regards, Phil.