[Boost-users] mpfr_float stack allocation

9 May 2016

      Hi Boost Users,

I use Boost.Multiprecision's wrappers for MPFR, particularly the
boost::multiprecision::mpfr_float.  I've had good success with this library
and its types so far, enjoying performance gains because of expression
templates, and easy access to mpfr and gmp for floats, ints, and
rationals.  However, in my complex number type, I am struggling with named
real temporaries in arithmetic operators.  I am looking for help with
multiplication, and I can generalize to all other operations needing a real
temporary later.  Here's my current code for multiplication, implemented in
terms of *=.

/**
Complex multiplication.  uses a single temporary variable
1 temporary, 4 multiplications
*/
complex& operator*=(const complex & rhs)
{
mpfr_float re_cache(real_*rhs.real_ - imag_*rhs.imag_); // cache the real
part of the result
imag_ = real_*rhs.imag_ + imag_*rhs.real_;
real_ = re_cache;
return *this;
}

I am using the naive evaluation formula, for better or worse.   But the
particular method of evaluation doesn't matter to me right now, it's the
temporary variable re_cache used to cache the real part is killing me.
Profiling using Allinea, gprof, and Callgrind reveals that a significant
percentage of time in the program is consumed creating and destroying
re_cache, as well as other temporaries scattered throughout my code.

In an effort to eliminate heap allocations, I first tried replacing my using
mpfr_float
= boost::multiprecision::number<boost::multiprecision::mpfr_float_backend<0,boost::multiprecision::allocate_dynamic>,
boost::multiprecision::et_off>;  with allocate_stack, just for these named
temporary variables.

However, the documentation
<http://www.boost.org/doc/libs/1_60_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats/mpfr_float.html>
states that allocate_stack only works in fixed precision, so the 0 digits
indicating variable precision should cause problems?  I expected 0 digits
to fail to compile, but compile it does.  The conversion from the type with
allocate_stack to allocate_dynamic is not provided by the = operator, so I
can write re_cache.convert_to<mpfr_float_dynamic>().  But using variable
precision with stack allocation appears to cause all sorts of problems.
Everything depending on complex multiplication in my program breaks,
indicated by massive failures in my test suite.  Thus, I currently regard
allocate_stack as a non-solution.

My question fundamentally regards how to deal with re_cache and the *=
operator for complex multiplication.  Anyone have ideas on how to get rid
of constant heap de/allocation of re_cache without inducing a ton more
arithmetic?  Using a static for it is not a solution, both because I
anticipate multithreading in this application in the future, and the fact
that precision will vary though the run, so I'd have to check the precision
of re_cache on every evaluation.  The C version of the program I am
re-implementing used OpenMP for threads, and used a thread-id-indexed
global for re_cache.  That solution then forces OpenMP onto anyone wanting
to use the complex class in multiple threads.  Hence, I view this as a
non-solution, too.

Again, my initial thought was 'heap allocation is the problem here, so I'll
use the stack allocated backend'.  But variable precision doesn't appear to
work with allocate_stack.  And statics are no good.

Any thoughts?

Thanks for your time,
Daniel Brake
Postdoc
University of Notre Dame
Applied and Computational Math and Stats