Re: [boost] non-branching min()

11 Sep 2009

      You might find some sections of Sean Aaron Anderson's "Bit-Twiddling 
Hacks" of interest, though perhaps not of practical use.  Check out:

Compute the minimum (min) or maximum (max) of two integers without 
branching 
<http://graphics.stanford.edu/%7Eseander/bithacks.html#IntegerMinOrMax>
Conditionally set or clear bits without branching 
<http://graphics.stanford.edu/%7Eseander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching>
Conditionally negate a value without branching 
<http://graphics.stanford.edu/%7Eseander/bithacks.html#ConditionalNegate>

Its worth noting that the better optimization your compiler has the 
better the loop in your test case will do on your branchless method.  
But a good optimizing compiler would probably avoid the branch in the 
first place.

To use the technique to its fullest you or the optimizer should 
translate your loop into something like (untested):

      type a[2] = {BIG_VALUE, ANY_VALUE};
      type* vp[2] = {--&v1[0], --&v2[0]};  // I know, but I'm expressing some possibly low level optimization.
      for (size_t i = 0; i<n; ++i)
      {
	a[1] = *vp[*++vp[1]<*++vp[0]];
	a[0] = a[a[1]<a[0]];
      }

(My apologies if I did something dumb -- my work situation has resulted in my not doing much C++ for a number of years -- but you should get the idea).

A good optimizing compiler should be able to do most or all of this from the inlined method, though the "relaxation" of the transfer of the move to "c[]" into an indirection as vp is pretty sophisticated.  Binding the local a and the temporary containing the min of the v's to the vector they would otherwise be transfered to is routine.

Of course, if this were code you really wanted to optimize, rather than a test loop, you would improve locality by processing each vector separately by making c[] a reference

Topher

DE wrote:
...
i'd like to present some kind of micro optimization towards pipelined
cpus
traditionaly min() has an implementation like
template<typename type> inline
    type min(type a, type b) { return a<b ? a : b; }
this function (even if inlined) introduces a branch in a sequence of
instructions
though branch prediction algorithms do exist they will guess at most
50% of them because routine input is arbitrary
since a wrong branch prediction causes pipeline stall i thought that
such a useful routine like min() MUST have non-stalling (read
'non-branching') implementation
i thought hard and in the end recalled a simple implementation which
is called LUT (for 'look-up table')
here's what i finally got:
template<typename type> inline
    type min(type a, type b)
    {
      const type c[2] = {a, b};
      return c[b<a];
    }
intuitively this implementation should work slower than the first one
but in real life it runs faster (roughly by a factor of 2 for 4 byte
types on core2duo)
here's a test code:
#include <vector>
    int main()
    {
      typedef int type;
      const int n = 1000000;
      std::vector<type> v1(n), v2(n);
      //filling v1 and v2 somehow
      type min = BIG_VALUE;
      for (size_t i = 0; i<n; ++i)
        a = min(a, min(v1[i], v2[i]));  //searching minima of v1 and v2
      return 0;
    }
however for 'double' it runs ~2% slower (i think it's because creating
a lut for 2 elements 8 bytes each is slower than a branch on c2d)
nevertheless i think it's a great approach at least for ints
also max() can implemented very similar
you may skip the rest
example of usage
imagine we want to implement a symmetric matrix that stores only half
of its elements (say upper triangle)
a00 a01 a02
    --- a11 a12
    --- --- a22
then access requests to the lower triangle elements would be
redirected to upper triangle ones
here we have a constraint that actually we only have access to
elements a(i, j) where i<=j must always be true (look at
indices above)
if we use a branch like
//...
    return i<=j ? a(i, j) : a(j, i);
we get a pipeline stall almost every time we want to access an
arbitrary element
however if we rewrite it carefully as
//...
    const size_t m = min(i, j);  //non-branching min() here
    return a(m, m + abs(int(i) - int(j)));  //abs() is also
                                            //non-branching
we get rid of branches at all! and we can hope a pipelined cpu will
handle it better than the other (branched) implementation
--
pavel
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [boost] non-branching min()

Topher Cooper