
Dear All, I have attempted to evaluate the proposed multiprecision library by applying it to the "Dealaunay Flip" function for 32-bit integer coordinates, as this is one of the few things that I've ever implemented that needed more than 64-bit integers: inline bool delaunay_test(int32_t ax, int32_t ay, int32_t bx, int32_t by, int32_t cx, int32_t cy, int32_t dx, int32_t dy) { // Test whether the quadrilateral ABCD's diagonal AC should be flipped to BD. // This is the Cline & Renka method. // Flip if the sum of the angles ABC and CDA is greater than 180 degrees. // Equivalently, flip if sin(ABC + CDA) < 0. // Trig identity: cos(ABC) * sin(CDA) + sin(ABC) * cos(CDA) < 0 // We can use scalar and vector products to find sin and cos, and simplify // to the following code. // Numerical robustness is important. This code addresses it by performing // exact calculations with large integer types. i64 ax64 = ax, ay64 = ay, bx64 = bx, by64 = by, cx64 = cx, cy64 = cy, dx64 = dx, dy64 = dy; i64 cos_abc = (ax64-bx64)*(cx64-bx64) + (ay64-by64)*(cy64-by64); i64 cos_cda = (cx64-dx64)*(ax64-dx64) + (cy64-dy64)*(ay64-dy64); if (cos_abc >= 0 && cos_cda >= 0) return false; if (cos_abc < 0 && cos_cda < 0) return true; i64 sin_abc = (ax64-bx64)*(cy64-by64) - (cx64-bx64)*(ay64-by64); i64 sin_cda = (cx64-dx64)*(ay64-dy64) - (ax64-dx64)*(cy64-dy64); i128 sin_sum = static_cast<i128>(sin_abc)*cos_cda + static_cast<i128>(cos_abc)*sin_cda; return sin_sum < 0; } As you can see, with 32-bit inputs, the first set of computations yield 64-bit results and about half of calls terminate at this stage. Otherwise a calculation yielding a 128-bit result is carried out. In practice, the results of the subtractions are typically small and the 128-bit value rarely has more than 64 significant bits. I have compared the following types used for i64 and i128 in the code above: (1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for comparison) (2) int64_t and my own "quick and dirty" int128_t. (3) cpp_int. (4) int64_t and cpp_int. (5) cpp_int without expression templates. (6) int64_t and cpp_int without expression templates. (7) int64_t and mp_int128_t. (8) mpz_int. (9) int64_t and mpz_int. Mean times per function call are as follows: (1) 39.3 ns (2) 61.9 ns (3) 513 ns (4) 103 ns (5) 446 ns (6) 102 ns (7) 102 ns (8) 2500 ns (9) 680 ns (AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.) Note that: - GMP is slowest, despite oft-repeated claims about its good performance. - Expression templates make the code slower. - Even the fastest Boost.Multiprecision type, mp_int128_t, is significantly slower than my own quick hack. This is all a bit disappointing. Perhaps the problem is that I'm dealing only with 128 bits and the intended application is for much larger values? I am reminded of my experience with xint last year. See e.g. this thread: http://thread.gmane.org/gmane.comp.lib.boost.devel/215565 The current proposal seems much better - xint was hundreds of times slower than my ad-hoc code - but I still find it discouraging. I suspect that the performance difference may be attributable to the use of sign-magnitude representation; as a result, the implementation is full of sign-testing like this: template <unsigned MinBits, bool Signed, class Allocator> inline void eval_subtract(cpp_int_backend<MinBits, Signed, Allocator>& result, const signed_limb_type& o) { if(o) { if(o < 0) eval_add(result, static_cast<limb_type>(-o)); else eval_subtract(result, static_cast<limb_type>(o)); } } I think I understand the motivation for sign-magnitude for variable-length representation, but I don't see any benefit in the case of e.g. mp_int128_t. I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit multiplication. Here is my version, based on 32x32->64 multiplies: inline int128_t mult_64x64_to_128(int64_t a, int64_t b) { // Make life simple by dealing only with positive numbers: bool neg = false; if (a<0) { neg = !neg; a = -a; } if (b<0) { neg = !neg; b = -b; } // Divide input into 32-bit halves: uint32_t ah = a >> 32; uint32_t al = a & 0xffffffff; uint32_t bh = b >> 32; uint32_t bl = b & 0xffffffff; // Long multiplication, with 64-bit temporaries: // ah al // * bh bl // ---------------- // al*bl (t1) // + ah*bl (t2) // + al*bh (t3) // + ah*bh (t4) // ---------------- uint64_t t1 = static_cast<uint64_t>(al)*bl; uint64_t t2 = static_cast<uint64_t>(ah)*bl; uint64_t t3 = static_cast<uint64_t>(al)*bh; uint64_t t4 = static_cast<uint64_t>(ah)*bh; int128_t r(t1); r.high = t4; r += int128_t(t2) << 32; r += int128_t(t3) << 32; if (neg) r = -r; return r; } So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply. But we can't write this: int64_t a, b ...; int128_t r = a * b; because here a*b is a 64x64->64 multiply; instead we have to write something like r = static_cast<int128_t>(a)*b; so that our operator* overload is found; but this operator* doesn't know that its int128_t argument is actually a promoted int64_t, and it has to do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64 multiplies. The only chance of avoiding the extra work is if the compiler can do constant-propagation, but I have my doubts about that. Any thoughts? My test code is here: http://chezphil.org/tmp/boost_mp_review.tbz2 $ make $ cat nn100000.dat | benchmark (Note that this needs a fix to the resize() issue reported earlier in order to get accurate answers, but that doesn't much alter the execution time.) Regards, Phil.

Phil Endecott wrote:
But we can't write this: int64_t a, b ...; int128_t r = a * b; because here a*b is a 64x64->64 multiply; instead we have to write something like r = static_cast<int128_t>(a)*b; so that our operator* overload is found; but this operator* doesn't know that its int128_t argument is actually a promoted int64_t, and it has to do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64 multiplies. The only chance of avoiding the extra work is if the compiler can do constant-propagation, but I have my doubts about that. Any thoughts?
If you used an expression template for the multiply you could dispatch r = a * b; to a three argument multiply template function from within the assignment operator that would inspect the precision of each addend and the sum types and choose the algorithm that does minimal work with minimal loss of precision. However, it would be a big effort and it wouldn't work if both arguments of the multiply were builtin types. I'm not suggesting it is a good idea, it is just the first thing I thought of. Calling multiply(r, a, b) directly might be a better idea. This should really be done in the vector unit, and that will become even more true as better vector units come out. We need a standard mp specification so that the compiler/hardware vendors will provide library implementations that take advantage of existing and future hardware to accelerate their use. Boost can't be expected to provide an optimal multi-precision type that makes full use of the hardware, but we can be expected to come up with an interface that will help the standard committee specify a standard mp. The hardware vendors will happily compete to provide the highest possible performance implementation of the standard and the code that uses that hardware will be portable. By the way, gmp is only slow because of the allocations. Try declaring all of the gmp_int variables instead of letting there be temporaries and then make them static so that they recycle their storage rather than going to the heap for every use. I don't know if it will end up being the fastest, but at least it will be a fair comparison of algorithms instead of measuring primarily allocation time. Regards, Luke

On Thu, Jun 14, 2012 at 9:24 AM, Simonson, Lucanus J < lucanus.j.simonson@intel.com> wrote:
Phil Endecott wrote:
But we can't write this: int64_t a, b ...; int128_t r = a * b; because here a*b is a 64x64->64 multiply; instead we have to write something like r = static_cast<int128_t>(a)*b; so that our operator* overload is found; but this operator* doesn't know that its int128_t argument is actually a promoted int64_t, and it has to do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64 multiplies. The only chance of avoiding the extra work is if the compiler can do constant-propagation, but I have my doubts about that. Any thoughts?
If you used an expression template for the multiply you could dispatch r = a * b; to a three argument multiply template function from within the assignment operator that would inspect the precision of each addend and the sum types and choose the algorithm that does minimal work with minimal loss of precision. However, it would be a big effort and it wouldn't work if both arguments of the multiply were builtin types. I'm not suggesting it is a good idea, it is just the first thing I thought of. Calling multiply(r, a, b) directly might be a better idea.
As John has alluded to earlier (and I believe you concurred), he hasn't yet found a satisfying design approach to mixed-type arithmetic (and I can't blame him). I think it's reasonable to leave this unresolved at the moment, but I guess it's up to the reviewers if this is a deal breaker. I think it does unfortunately limit its applicability in the computational geometry realm where your expression tree is often of bounded height (at least in my experience), hence you want to be able to use arithmetic types of varying precision depending on the level of the tree. This should really be done in the vector unit, and that will become even
more true as better vector units come out. We need a standard mp specification so that the compiler/hardware vendors will provide library implementations that take advantage of existing and future hardware to accelerate their use. Boost can't be expected to provide an optimal multi-precision type that makes full use of the hardware, but we can be expected to come up with an interface that will help the standard committee specify a standard mp. The hardware vendors will happily compete to provide the highest possible performance implementation of the standard and the code that uses that hardware will be portable.
Vectorized arithmetic is the way to go for maximum performance. - Jeff

I have compared the following types used for i64 and i128 in the code above:
(1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for comparison) (2) int64_t and my own "quick and dirty" int128_t. (3) cpp_int. (4) int64_t and cpp_int. (5) cpp_int without expression templates. (6) int64_t and cpp_int without expression templates. (7) int64_t and mp_int128_t. (8) mpz_int. (9) int64_t and mpz_int.
Mean times per function call are as follows:
(1) 39.3 ns (2) 61.9 ns (3) 513 ns (4) 103 ns (5) 446 ns (6) 102 ns (7) 102 ns (8) 2500 ns (9) 680 ns
(AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)
Note that: - GMP is slowest, despite oft-repeated claims about its good performance.
GMP's performance only kicks in for large values - for tests like this memory allocation dominates.
- Expression templates make the code slower.
For sure - expression templates are about eliminating temporaries - if the temporaries are cheap (cpp_int containing small values and no dynamic memory allocation) then there's no benefit and probably a hit from bringing in all that machinary. That's why expression templates are disabled for the fixed size cpp_int typedefs. On the other hand, if you tried mpz_int without expression templates that should be even slower than the code with them - in that case eliminating temporaries does bring a benefit from fewer memory allocations.
- Even the fastest Boost.Multiprecision type, mp_int128_t, is significantly slower than my own quick hack.
This is all a bit disappointing. Perhaps the problem is that I'm dealing only with 128 bits and the intended application is for much larger values?
No the problem is that it's a general purpose solution, in almost any special purpose case it will be slower. Here's the thing - once the compiler's optimiser gets involved your multiplication routine is unlikely to ever be beaten by more than a trivial amount (a few persent at best), it's also sufficiently simple, that even the costs of a few "if" statements would likely acount for the extra time. So believe it or not, I'm actually quite pleased it's no more than 2x slower!
I am reminded of my experience with xint last year. See e.g. this thread:
http://thread.gmane.org/gmane.comp.lib.boost.devel/215565
The current proposal seems much better - xint was hundreds of times slower than my ad-hoc code - but I still find it discouraging. I suspect that the performance difference may be attributable to the use of sign-magnitude representation; as a result, the implementation is full of sign-testing like this:
template <unsigned MinBits, bool Signed, class Allocator> inline void eval_subtract(cpp_int_backend<MinBits, Signed, Allocator>& result, const signed_limb_type& o) { if(o) { if(o < 0) eval_add(result, static_cast<limb_type>(-o)); else eval_subtract(result, static_cast<limb_type>(o)); } }
I think I understand the motivation for sign-magnitude for variable-length representation, but I don't see any benefit in the case of e.g. mp_int128_t.
The sign testing above is because you're adding a signed int, it has nothing to do with sign-magnitude representation, if your code deals only with positive values, and had you used unsigned integers throughout, then those tests would disappear. BTW the code originally used a 2's complement representation for fixed precision ints - that code is (very slightly) faster if you really want to do an NxN bit operation, but slower if most of the integers aren't using all the internal limbs.
I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit multiplication. Here is my version, based on 32x32->64 multiplies:
inline int128_t mult_64x64_to_128(int64_t a, int64_t b) { // Make life simple by dealing only with positive numbers: bool neg = false; if (a<0) { neg = !neg; a = -a; } if (b<0) { neg = !neg; b = -b; }
// Divide input into 32-bit halves: uint32_t ah = a >> 32; uint32_t al = a & 0xffffffff; uint32_t bh = b >> 32; uint32_t bl = b & 0xffffffff;
// Long multiplication, with 64-bit temporaries:
// ah al // * bh bl // ---------------- // al*bl (t1) // + ah*bl (t2) // + al*bh (t3) // + ah*bh (t4) // ----------------
uint64_t t1 = static_cast<uint64_t>(al)*bl; uint64_t t2 = static_cast<uint64_t>(ah)*bl; uint64_t t3 = static_cast<uint64_t>(al)*bh; uint64_t t4 = static_cast<uint64_t>(ah)*bh;
int128_t r(t1); r.high = t4; r += int128_t(t2) << 32; r += int128_t(t3) << 32;
if (neg) r = -r;
return r; }
So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply.
But we can't write this:
int64_t a, b ...; int128_t r = a * b;
because here a*b is a 64x64->64 multiply; instead we have to write something like
r = static_cast<int128_t>(a)*b;
so that our operator* overload is found; but this operator* doesn't know that its int128_t argument is actually a promoted int64_t, and it has to do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64 multiplies. The only chance of avoiding the extra work is if the compiler can do constant-propagation, but I have my doubts about that.
No: the fixed precision cpp_int's know how many limbs are actually used: so the same number of multiplies will still be used as in your code. The problem is partly the "abstraction penalty", and partly all the special case handling inside cpp_int. I will look at your test case, but I'd be very pleasantly surprised if there's any more low-hanging fruit performance wise. Thanks for the comments, John.

on Thu Jun 14 2012, John Maddock <boost.regex-AT-virgin.net> wrote:
- Expression templates make the code slower.
For sure - expression templates are about eliminating temporaries - if the temporaries are cheap (cpp_int containing small values and no dynamic memory allocation) then there's no benefit and probably a hit from bringing in all that machinary. That's why expression templates are disabled for the fixed size cpp_int typedefs.
I'd like to propose an alternate viewpoint: in the context of numerics, expression templates are about delaying evaluation to gather more context, so you can perform the computation more efficiently. From that point-of-view there's absolutely no reason that they should ever make code slower. Just generate the appropriate code for the types and expressions involved. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Le 15/06/12 08:07, Dave Abrahams a écrit :
on Thu Jun 14 2012, John Maddock<boost.regex-AT-virgin.net> wrote:
- Expression templates make the code slower. For sure - expression templates are about eliminating temporaries - if the temporaries are cheap (cpp_int containing small values and no dynamic memory allocation) then there's no benefit and probably a hit from bringing in all that machinary. That's why expression templates are disabled for the fixed size cpp_int typedefs. I'd like to propose an alternate viewpoint: in the context of numerics, expression templates are about delaying evaluation to gather more context, so you can perform the computation more efficiently. From that point-of-view there's absolutely no reason that they should ever make code slower. Just generate the appropriate code for the types and expressions involved.
Hi, I agree, the library should be able to decide whether expression templates should be an improvement or not. A minor change in the library, defaulting the ExpressionTemplates parameter to a trait depending on the backend could help to that. The library could default to something reasonable, while the backend developer could always provide a specialization. Vicente

- Expression templates make the code slower.
For sure - expression templates are about eliminating temporaries - if the temporaries are cheap (cpp_int containing small values and no dynamic memory allocation) then there's no benefit and probably a hit from bringing in all that machinary. That's why expression templates are disabled for the fixed size cpp_int typedefs.
I'd like to propose an alternate viewpoint: in the context of numerics, expression templates are about delaying evaluation to gather more context, so you can perform the computation more efficiently. From that point-of-view there's absolutely no reason that they should ever make code slower. Just generate the appropriate code for the types and expressions involved.
Hi, I agree, the library should be able to decide whether expression templates should be an improvement or not.
A minor change in the library, defaulting the ExpressionTemplates parameter to a trait depending on the backend could help to that. The library could default to something reasonable, while the backend developer could always provide a specialization. Vicente
I like this idea. Unfortunately, I may not be the best one to program it, and I fear John would get stuck with it---if, that is, this turns out to be the way to go. So one would need to see what John thinks. So Vicente, if I understand, you are saying that one might add more *intelligence* to the kind of object used as the template parameter for enabling/disabling the expression templates. Perhaps is it like the good idea you wrote on 31-May-2012, but takes it even one step further? You wrote this, but it was for floats:
What about replacing the second bool template parameter by an enum class expression_template {disabled, enabled}; which will be more explicit. That is
typedef mp::mp_number<mp::mpfr_float_backend<300>, false> my_float; versus typedef mp::mp_number<mp::mpfr_float_backend<300>, mp::expression_template::disabled> my_float;
But I also fear that it would take q bit of time to specify this kind of back-end parameter type. Is it something one could add later, backwards compatibility being handled vie sensible default parameter? Best regards, Chris.

Christopher Kormanyos wrote
- Expression templates make the code slower.
Hi, I agree, the library should be able to decide whether expression templates should be an improvement or not.
A minor change in the library, defaulting the ExpressionTemplates parameter to a trait depending on the backend could help to that. The library could default to something reasonable, while the backend developer could always provide a specialization. Vicente
I like this idea. Unfortunately, I may not be the best one to program it, and I fear John would get stuck with it---if, that is, this turns out to be the way to go. So one would need to see what John thinks.
So Vicente, if I understand, you are saying that one might add more *intelligence* to the kind of object used as the template parameter for enabling/disabling the expression templates.
Yea and not. This request was in order to make it clearer the meaning of the bool parameter using something more domain specific.
Perhaps is it like the good idea you wrote on 31-May-2012, but takes it even one step further?
You wrote this, but it was for floats:
What about replacing the second bool template parameter by an enum class expression_template {disabled, enabled}; which will be more explicit. That is
typedef mp::mp_number<mp::mpfr_float_backend<300>, false> my_float; versus typedef mp::mp_number<mp::mpfr_float_backend<300>, mp::expression_template::disabled> my_float;
Yes and not. This request was in order to make it clearer the meaning of the bool parameter using something more domain specific. The ExpresionTemplate parameter could be defaulted to e.g. ExpresionTemplate = expression_template_trait<Backend>::value Where expression_template_trait could be defined with a defaulted value as template <typename Backend> struct expression_template_trait { static const expression_template value = Backend::expression_template_value; }; Another defaulted value could also be possible, as disabled(or enabled) by default template <typename Backend> struct expression_template_trait { static const expression_template value = expression_template::disabled; }; But I think a default expressed by the backend is a better default.
But I also fear that it would take q bit of time to specify this kind of back-end parameter type.
Is it something one could add later, backwards compatibility being handled vie sensible default parameter?
Humm, we could do it later, but I don't think we could ensure backward compatibility. IMO, if we consider it is a good thing, it should be included before release as it doesn't implies too much work, we need just to define which is the default of the trait. Best, Vicente -- View this message in context: http://boost.2283326.n4.nabble.com/Multiprecision-Benchmarking-tp4631293p463... Sent from the Boost - Dev mailing list archive at Nabble.com.

I agree, the library should be able to decide whether expression templates should be an improvement or not.
A minor change in the library, defaulting the ExpressionTemplates parameter to a trait depending on the backend could help to that. The library could default to something reasonable, while the backend developer could always provide a specialization.
That would certainly be easy to do, but: * That wouldn't help for types like cpp_int where the best strategy may depend on the runtime size of the number. * How would this differ in practice from the use of the typedefs we already provide where we know that expression templates don't help (mp_intXXX_t etc)? Cheers, John.

John Maddock-3 wrote
I agree, the library should be able to decide whether expression templates should be an improvement or not.
A minor change in the library, defaulting the ExpressionTemplates
parameter
to a trait depending on the backend could help to that. The library could default to something reasonable, while the backend developer could always provide a specialization.
That would certainly be easy to do, but:
* That wouldn't help for types like cpp_int where the best strategy may depend on the runtime size of the number.
The backend developer could define the trait depending on the compile-time size. For run-time size the library set use a default, and it will be up to the user to force or not the use of expression templates.
* How would this differ in practice from the use of the typedefs we already provide where we know that expression templates don't help (mp_intXXX_t etc)?
Well, you will not have specific typedefs for all the sizes, so a way to associate the default value will be convenient. Vicente -- View this message in context: http://boost.2283326.n4.nabble.com/Multiprecision-Benchmarking-tp4631293p463... Sent from the Boost - Dev mailing list archive at Nabble.com.

For sure - expression templates are about eliminating temporaries - if the temporaries are cheap (cpp_int containing small values and no dynamic memory allocation) then there's no benefit and probably a hit from bringing in all that machinary. That's why expression templates are disabled for the fixed size cpp_int typedefs.
I'd like to propose an alternate viewpoint: in the context of numerics, expression templates are about delaying evaluation to gather more context, so you can perform the computation more efficiently. From that point-of-view there's absolutely no reason that they should ever make code slower. Just generate the appropriate code for the types and expressions involved.
Sigh... I hear what you're saying but it's not that simple. Consider cpp_int: it uses something akin to the "small string optimization" for small numbers, and then only switches to dynamic allocation for larger values. Now consider a simple expression: a = b * c * d; The optimal evaluation strategy depends upon the size of the values in b, c and d: for small values that don't require memory allocation, temporaries, and copying is cheap, so it might be best to do: t = b * c; a = t * d; But for larger values, the cost of memory allocation and copying starts to dominate so: a = b * c; a *= d; Might be best (though it depends a bit on how multiplication is implemented). For larger values still, the cost of the multiplication is so expensive, that all methods are the same. So you can't choose between these at compile time. You could clearly add some backend-specific runtime logic to choose between them, but that hits another issue: In the benchmark in question, we're looking at arithmetic on small values (mostly integer sized) that just happen to be stored in a cpp_int. It's an important use case, and cpp_int is many many times more efficient than gmp thanks to not having to allocate memory for small values, but it's slower than "naive" simple code. The problem is that the cost of performing the arithmetic is as cheap as it could possibly be algorithmically speaking, so for cpp_int, the cost is completely dominated by other runtime logic used to select the best method. Let me give you two concrete examples so you can tell me how to avoid the them :-) 1) When multiplying in the cpp_int backend, if one value is a single limb long then we can optimize the multiplication routine by dispatching to the special "multiply by an int" code. When the other value is medium sized, it's a big win (2 or 3 times improvement), but for the trivial case that often occurs in computational geometry, the cost of the extra if statement and function call can easily add a 20% hit or more. One thing I could experiment with is overloads for the multiplication/addition of small fixed sized integer types - that might help mp_int128_t, but not variable sized types like cpp_int. 2) In the expression template unpacking and evaluation, one crucial piece of information is "does the LHS also appear on the RHS". Sometime if it does we can optimize, for example "a = a + b" gets transformed into "a += b", other times if the LHS does not appear in the RHS we can trash the LHS value and use it as working space, saving a temporary. But... it's a runtime calculation to figure that out, it messes up inlining as well as taking a small amount of time. For non-trivial values which require memory allocation etc, it makes not one jot of difference, but in the case in question, it completely dominates the runtime compared to what is in effect, just an int*int or whatever. Just curious, but can constexp partially solve this at all? We would need a way to get the address of the variables in the expression into the template arg list so that questions like this could be asked at compile time. Even then I believe that getting the cost of expression template unpacking down to zero is likely to be a fools errand. John.

on Fri Jun 15 2012, John Maddock <boost.regex-AT-virgin.net> wrote:
For sure - expression templates are about eliminating temporaries - if the temporaries are cheap (cpp_int containing small values and no dynamic memory allocation) then there's no benefit and probably a hit from bringing in all that machinary. That's why expression templates are disabled for the fixed size cpp_int typedefs.
I'd like to propose an alternate viewpoint: in the context of numerics, expression templates are about delaying evaluation to gather more context, so you can perform the computation more efficiently. From that point-of-view there's absolutely no reason that they should ever make code slower. Just generate the appropriate code for the types and expressions involved.
Sigh... I hear what you're saying but it's not that simple.
Consider cpp_int: it uses something akin to the "small string optimization" for small numbers, and then only switches to dynamic allocation for larger values.
Now consider a simple expression:
a = b * c * d;
The optimal evaluation strategy depends upon the size of the values in b, c and d: for small values that don't require memory allocation, temporaries, and copying is cheap, so it might be best to do:
t = b * c; a = t * d;
But for larger values, the cost of memory allocation and copying starts to dominate so:
a = b * c; a *= d;
Might be best (though it depends a bit on how multiplication is implemented).
For larger values still, the cost of the multiplication is so expensive, that all methods are the same.
That last fact doesn't matter, then.
So you can't choose between these at compile time.
Well you surely can. You can make a choice that is no worse than whatever is being described as "not using expression templates." My point is that there's no reason for anyone to say "using expression templates slowed this down."
You could clearly add some backend-specific runtime logic to choose between them, but that hits another issue:
In the benchmark in question, we're looking at arithmetic on small values (mostly integer sized) that just happen to be stored in a cpp_int. It's an important use case, and cpp_int is many many times more efficient than gmp thanks to not having to allocate memory for small values, but it's slower than "naive" simple code. The problem is that the cost of performing the arithmetic is as cheap as it could possibly be algorithmically speaking, so for cpp_int, the cost is completely dominated by other runtime logic used to select the best method.
Tough sitch. For cases like that you need to ask the user for hints.
Let me give you two concrete examples so you can tell me how to avoid the them :-)
1) When multiplying in the cpp_int backend, if one value is a single limb long then we can optimize the multiplication routine by dispatching to the special "multiply by an int" code. When the other value is medium sized, it's a big win (2 or 3 times improvement), but for the trivial case that often occurs in computational geometry, the cost of the extra if statement and function call can easily add a 20% hit or more. One thing I could experiment with is overloads for the multiplication/addition of small fixed sized integer types - that might help mp_int128_t, but not variable sized types like cpp_int.
2) In the expression template unpacking and evaluation, one crucial piece of information is "does the LHS also appear on the RHS". Sometime if it does we can optimize, for example "a = a + b" gets transformed into "a += b", other times if the LHS does not appear in the RHS we can trash the LHS value and use it as working space, saving a temporary.
My favorite approach goes something like this: a [noalias]+= b
But... it's a runtime calculation to figure that out, it messes up inlining as well as taking a small amount of time.
The other thing you can do is to aggregate ever-larger computations into single expressions so that you have a hope of gathering enough context together to do better optimizations.
For non-trivial values which require memory allocation etc, it makes not one jot of difference, but in the case in question, it completely dominates the runtime compared to what is in effect, just an int*int or whatever. Just curious, but can constexp partially solve this at all? We would need a way to get the address of the variables in the expression into the template arg list
Forget it :(. The great downfall of constexpr is that every constexpr function must be valid even when the arguments aren't constexprs. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Thu, Jun 14, 2012 at 8:40 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:
Dear All,
I have attempted to evaluate the proposed multiprecision library by applying it to the "Dealaunay Flip" function for 32-bit integer coordinates, as this is one of the few things that I've ever implemented that needed more than 64-bit integers:
inline bool delaunay_test(int32_t ax, int32_t ay, int32_t bx, int32_t by, int32_t cx, int32_t cy, int32_t dx, int32_t dy) { // Test whether the quadrilateral ABCD's diagonal AC should be flipped to BD. // This is the Cline & Renka method. // Flip if the sum of the angles ABC and CDA is greater than 180 degrees. // Equivalently, flip if sin(ABC + CDA) < 0. // Trig identity: cos(ABC) * sin(CDA) + sin(ABC) * cos(CDA) < 0 // We can use scalar and vector products to find sin and cos, and simplify // to the following code. // Numerical robustness is important. This code addresses it by performing // exact calculations with large integer types.
i64 ax64 = ax, ay64 = ay, bx64 = bx, by64 = by, cx64 = cx, cy64 = cy, dx64 = dx, dy64 = dy;
i64 cos_abc = (ax64-bx64)*(cx64-bx64) + (ay64-by64)*(cy64-by64); i64 cos_cda = (cx64-dx64)*(ax64-dx64) + (cy64-dy64)*(ay64-dy64); if (cos_abc >= 0 && cos_cda >= 0) return false; if (cos_abc < 0 && cos_cda < 0) return true;
i64 sin_abc = (ax64-bx64)*(cy64-by64) - (cx64-bx64)*(ay64-by64); i64 sin_cda = (cx64-dx64)*(ay64-dy64) - (ax64-dx64)*(cy64-dy64);
i128 sin_sum = static_cast<i128>(sin_abc)***cos_cda + static_cast<i128>(cos_abc)***sin_cda;
return sin_sum < 0; }
As you can see, with 32-bit inputs, the first set of computations yield 64-bit results and about half of calls terminate at this stage. Otherwise a calculation yielding a 128-bit result is carried out. In practice, the results of the subtractions are typically small and the 128-bit value rarely has more than 64 significant bits.
I have compared the following types used for i64 and i128 in the code above:
(1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for comparison) (2) int64_t and my own "quick and dirty" int128_t. (3) cpp_int. (4) int64_t and cpp_int. (5) cpp_int without expression templates. (6) int64_t and cpp_int without expression templates. (7) int64_t and mp_int128_t. (8) mpz_int. (9) int64_t and mpz_int.
Mean times per function call are as follows:
(1) 39.3 ns (2) 61.9 ns (3) 513 ns (4) 103 ns (5) 446 ns (6) 102 ns (7) 102 ns (8) 2500 ns (9) 680 ns
(AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)
Note that: - GMP is slowest, despite oft-repeated claims about its good performance.
I think Luke made a good point about attempting to minimize allocations, but nonetheless I suppose this might not be too surprising; I've been under the impression that GMP is for *big* ints.
- Expression templates make the code slower.
Hmmm...John: comments?
- Even the fastest Boost.Multiprecision type, mp_int128_t, is significantly slower than my own quick hack.
This is all a bit disappointing. Perhaps the problem is that I'm dealing only with 128 bits and the intended application is for much larger values?
I think so, and you probably identified a weak point of the library below.
I am reminded of my experience with xint last year. See e.g. this thread:
The current proposal seems much better - xint was hundreds of times slower than my ad-hoc code - but I still find it discouraging. I suspect that the performance difference may be attributable to the use of sign-magnitude representation; as a result, the implementation is full of sign-testing like this:
template <unsigned MinBits, bool Signed, class Allocator> inline void eval_subtract(cpp_int_backend<**MinBits, Signed, Allocator>& result, const signed_limb_type& o) { if(o) { if(o < 0) eval_add(result, static_cast<limb_type>(-o)); else eval_subtract(result, static_cast<limb_type>(o)); } }
I think I understand the motivation for sign-magnitude for variable-length representation, but I don't see any benefit in the case of e.g. mp_int128_t.
John, perhaps this is something that can be configurable?
I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit multiplication. Here is my version, based on 32x32->64 multiplies:
inline int128_t mult_64x64_to_128(int64_t a, int64_t b) { // Make life simple by dealing only with positive numbers: bool neg = false; if (a<0) { neg = !neg; a = -a; } if (b<0) { neg = !neg; b = -b; }
// Divide input into 32-bit halves: uint32_t ah = a >> 32; uint32_t al = a & 0xffffffff; uint32_t bh = b >> 32; uint32_t bl = b & 0xffffffff;
// Long multiplication, with 64-bit temporaries:
// ah al // * bh bl // ---------------- // al*bl (t1) // + ah*bl (t2) // + al*bh (t3) // + ah*bh (t4) // ----------------
uint64_t t1 = static_cast<uint64_t>(al)*bl; uint64_t t2 = static_cast<uint64_t>(ah)*bl; uint64_t t3 = static_cast<uint64_t>(al)*bh; uint64_t t4 = static_cast<uint64_t>(ah)*bh;
int128_t r(t1); r.high = t4; r += int128_t(t2) << 32; r += int128_t(t3) << 32;
if (neg) r = -r;
return r; }
So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply.
But we can't write this:
int64_t a, b ...; int128_t r = a * b;
because here a*b is a 64x64->64 multiply; instead we have to write something like
r = static_cast<int128_t>(a)*b;
so that our operator* overload is found; but this operator* doesn't know that its int128_t argument is actually a promoted int64_t, and it has to do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16 32x32->64 multiplies. The only chance of avoiding the extra work is if the compiler can do constant-propagation, but I have my doubts about that.
The above situation seems to be unavoidable given the present inability to mix types, which, as John has discussed previously, is an issue he's (understandably) completely ducking at the moment. I wonder also if there is a non-negligible penalty for abstracting all backend operations as, essentially, op-assign operations, as in "x = y + z", it seems one must loop twice, once over, say, y and x to effect "x = y", then once over x and z to effect "x += z".
Any thoughts?
My test code is here: http://chezphil.org/tmp/boost_**mp_review.tbz2<http://chezphil.org/tmp/boost_mp_review.tbz2>
$ make $ cat nn100000.dat | benchmark
(Note that this needs a fix to the resize() issue reported earlier in order to get accurate answers, but that doesn't much alter the execution time.)
Regards, Phil.
Thanks, Phil, for investigating the viability of using Multiprecision for your application! - Jeff

- Expression templates make the code slower.
Hmmm...John: comments?
As I mentioned, thay add a non-trivial amount of housekeeping to figure out how to best evaluate the expression template. If temporaries are expensive, then this is a big win, if they're cheap (cpp_int with small values), then it's a small hit.
I think I understand the motivation for sign-magnitude for variable-length representation, but I don't see any benefit in the case of e.g. mp_int128_t.
John, perhaps this is something that can be configurable?
Sign magnitude and 2's complement types mostly require completely different algorithms, besides the old 2's complement code is *mostly* slower as all the limbs are always used.
I wonder also if there is a non-negligible penalty for abstracting all backend operations as, essentially, op-assign operations, as in "x = y + z", it seems one must loop twice, once over, say, y and x to effect "x = y", then once over x and z to effect "x += z".
That isn't what's happening here - the library can - if the backend chooses to support it via the "optional" part of the backend requirements - support out-of-place operations as well as inplace ones. In that case x = y + z doesn't involve copying followed by addition, it just does the addition direct. However, for more complex expressions, you may well get "copy-followed-by-inplace-op" if that avoids a temporary. This is normally a win, but there may be a case (cheap temporaries) where it's a dis-optimization. Basically, it's difficult to be all things to all people... John.

On Thu, Jun 14, 2012 at 10:51 AM, John Maddock <boost.regex@virgin.net>wrote:
- Expression templates make the code slower.
Hmmm...John: comments?
As I mentioned, thay add a non-trivial amount of housekeeping to figure out how to best evaluate the expression template. If temporaries are expensive, then this is a big win, if they're cheap (cpp_int with small values), then it's a small hit.
I think I understand the motivation for sign-magnitude for variable-length
representation, but I don't see any benefit in the case of e.g. mp_int128_t.
John, perhaps this is something that can be configurable?
Sign magnitude and 2's complement types mostly require completely different algorithms, besides the old 2's complement code is *mostly* slower as all the limbs are always used.
I wonder also if there is a non-negligible penalty for abstracting all
backend operations as, essentially, op-assign operations, as in "x = y + z", it seems one must loop twice, once over, say, y and x to effect "x = y", then once over x and z to effect "x += z".
That isn't what's happening here - the library can - if the backend chooses to support it via the "optional" part of the backend requirements - support out-of-place operations as well as inplace ones.
In that case x = y + z doesn't involve copying followed by addition, it just does the addition direct. However, for more complex expressions, you may well get "copy-followed-by-inplace-op" if that avoids a temporary. This is normally a win, but there may be a case (cheap temporaries) where it's a dis-optimization.
Basically, it's difficult to be all things to all people...
Of course, and that all makes sense, thanks for clearing the above up! - Jeff

My test code is here: http://chezphil.org/tmp/boost_mp_review.tbz2
Just a couple of updates: 1) Your test code doesn't actually test fixed-width integer types anywhere as you have: #ifdef USE_MP_INTN_NOET typedef int64_t i64; typedef boost::multiprecision::mp_number< boost::multiprecision::cpp_int_backend<>, false> i128; #endif Should be: #ifdef USE_MP_INTN_NOET typedef int64_t i64; typedef boost::multiprecision::mp_number< boost::multiprecision::cpp_int_backend<128, true, void>, false> i128; #endif Of course the types are pretty similar, and it only saves a few percent - relative times on Win32 were: Your custom int128 1.9. mp_int128 2.7. cpp_int (no ET's) 3.2. 2) Some more comparisons with gmp shows expression templates help much more in that case: proposed mpz_int 100 proposed mpz_int (no ET's) 143 GMP's mpz_class [1] 140 Cheers, John. 1) With apologies to Marc Glisse I used the last released mpz_class, still haven't looked at his improvements in GMP development :-(
participants (8)
-
Christopher Kormanyos
-
Dave Abrahams
-
Jeffrey Lee Hellrung, Jr.
-
John Maddock
-
Phil Endecott
-
Simonson, Lucanus J
-
Vicente Botet
-
Vicente J. Botet Escriba