[boost] performance using accumulate, bind, and make_transform_iterator

12 Nov 2004

      I'm a newbie to the list; if an answer to my question is in a FAQ I'm not
aware of, please let me know; but I've googled the documentation and don't
see it.  In fact, I see little regarding performance factors to consider
when using boost, so I may not be looking in the right places.

We're running a performance-sensitive application and so I've begun to find
points to optimize the code.  The following is a digest of our experience
bringing boost functionality into this effort (forgive the generic code; it
was sanitized for intellectual property perservation, etc.):

I am posting this experience in hopes that someone can confirm our reasoning
and perhaps point us in the right direction for information, etc., for
thinking through these kind of issues in the future.   Any advice is
appreciated.  You'll see that, in part, my woes came from violating the
cardinal rule of only optimizing that code which is most expensive;
nonetheless, this specific instance is an involved enough example that I
thought it worth pursuing...

==============

During the writing of an interface, a function was added to a templated
class, that has-a stl vector, to sum SomeValue of its contained elements:

const unsigned allElementsValue() const
{
    unsigned result = 0;

    std::vector<T>::const_iterator       thisElement(sequence_.begin());
    const std::vector<T>::const_iterator endElement (sequence_.end());
    while (thisElement != endElement)
    {
        result += thisElement->getSomeValue();
    }
}

At the time this interface was being written, it was a departure from an
prior approach and so it was important that performance was not
significantly affected.

During this effort, we used KCachegrind and found out that this call was
(relatively) more expensive than hoped.

Since std::accumulate was supposed to be a quicker way to walk through
elements, that approach was tried and the following function resulted:

const unsigned allElementsValue() const
{
    unsigned result = 0;

    return std::accumulate
          (
          boost::make_transform_iterator
                (sequence_.begin(),
                 boost::mem_fn(&T::getSomeValue)
                ),
          boost::make_transform_iterator
                (sequence_.end(),
                 boost::mem_fn(&T::getSomeValue)
                ),
          0
          );
}

We now note the use of make_transform_iterator and the use of a 0-ary member
function of the constituent element objects.

This code is semantically equivalent and yet yielded an relatively
significant performance improvement. The performance increase can be taken
in part to considering that STL and boost use templates and so some of what
why otherwise have to be done at run-time can now be done at compile-time.

So we naively took the moral of the story to be that std::accumulate is Good
and walking through elements manually with a loop is Bad.

The following code was later examined. Note that this particular code
happens to be executed many, many times in our run. So any performance
change to this code will have a major effect to the overall performance of
the program. Certainly a ripe and ready candidate for applying this new
conclusion.

The code was changed from this...

-----------
// Note that a and b (referred to in calcSomeValue below) are 
// references to other objects passed into the function that 
// has this code
float   result = 0.0;

// SomeObjectPtrCont is a typedef for a stl vector that contains 
// boost::shared_ptrs managing pointers to SomeObject.
// SomeObjectPtrs_ is an instance of SomeObjectPtrCont

SomeObjectPtrCont::const_iterator       thisIter(SomeObjectPtrs_.begin());
const SomeObjectPtrCont::const_iterator endIter (SomeObjectPtrs_.end());

while (thisIter != endIter)
{
    result += (*thisIter)->calcSomeValue(a, b);
    ++thisIter;
}
-----------

...to this:

-----------
// Note that a and b (referred to in calcSomeValue below) are 
// references to other objects passed into the function that 
// has this code
float   result = 0.0;

// Use a typedef so that we can handle overloading of calcSomeValue
typedef 
float (SomeObject::* calcFunct)(const T&, const OtherObjB&) const;

// Use stl accumulate to do same thing as the while loop
result = std::accumulate
        (
        boost::make_transform_iterator
              (
              SomeObjectPtrs_.begin(),
              boost::bind(static_cast<calcFunct>(&SomeObject::calcSomeValue),
                          _1, a, b)
              ),
        boost::make_transform_iterator
              (
              SomeObjectPtrs_.end(),
              boost::bind(static_cast<evalFunct>(&SomeObject::calcSomeValue),
                          _1, a, b)
              ),
        float(0.0)
        );
-----------

Note that this is a significantly more sophisticated use of std::accumulate
with boost::make_transform_iterator and boost::bind. Indeed, now bind must
take an argument, marked by the _1, to link the function being called,
evaluate, with each element in the collection being walked through. The
evaluate() function itself is binary and not a member of the constituent
elements. Nonetheless, it is semantically equivalent to the more
conventional while loop form.

Remembering our earlier simplistic conclusion that std::accumulate is Good
and that while-loops are Bad, this change was made, checked in, and pride
swelled within the author (silly me) for so cleverly using boost, et al. 
The expectation at the time was that since templates would be involved
more, then much of the logic would be resolved at compile-time rather than
run-time giving us a performance increase just like with the simpler
accumulation approach (basically since one form of accumulation is good,
than all are better).

We discovered that the revised code now ran 4 times SLOWER.

KCachegrind showed the problem was the above code. So what the *&^% gives?

Rather than analyze the situation, it's always better to tinker :), so the
next attempt involved an approach that std::accumulate also allows:

-----------
// Note that a and b (referred to in calcSomeValue below) are 
// references to other objects passed into the function that 
// has this code
float   result = 0.0;

// Use a typedef so that we can handle overloading of calcSomeValue
typedef 
float (SomeObject::* calcFunct)(const T&, const OtherObjB&) const;

// Alternate accumulation approach
result = std::accumulate
        (
        SomeObjectPtrs_.begin(),
        SomeObjectPtrs_.end(),
        float(0.0),
        boost::bind
              (std::plus<float>(),
               _1,
               boost::bind(static_cast<calcFunct>(&SomeObject:calcSomeValue),
                           _2, a, b)
              )
        );
-----------

This sped up the code but now it was measured to be 3 times slower than the
original form, though now not 4 times as slow. Hardly a ringing endorsement
for the use of std::accumulate.

Our best conclusion so far is that involving boost::bind in this kind of
operation causes accumulate to go through additional function redirection
at run-time to access the final evaluate() function desired. This more than
offsets any potential gain that might have been realized by the
std::accumulate because now its use of templated compilation is thwarted by
all of this run-time redirection. Note that because of the arguments
involved, the compiled code under boost::bind must now move variables onto
the stack so that when the underlying accumulate calls this implicit
functor, those values are part of its state. All of this is now done at
run-time where as before, without having to move such variables, it was
really just a change of a function pointer's value.

So, the new current thinking is that std::accumulate is Better than simple
while-loops that walk through elements which is still Bad. However,
boost::bind under std::accumulate is Worse. At this point, the best advice
is that one should use std::accumulate whenever simple accumulations are
needed. However, once one has to start involving boost::bind on 1+-ary
functions or anything else that requires new run-time function redirection,
then retreating back to while-loop is a better performance solution. 

Thank you all for reading this far...

Kind regards,
Richard Newman
Crowley Davis Research
richard@nospam.cdres.com (take out the nospam. to email me directly)