
i'd like to present some kind of micro optimization towards pipelined cpus traditionaly min() has an implementation like template<typename type> inline type min(type a, type b) { return a<b ? a : b; } this function (even if inlined) introduces a branch in a sequence of instructions though branch prediction algorithms do exist they will guess at most 50% of them because routine input is arbitrary since a wrong branch prediction causes pipeline stall i thought that such a useful routine like min() MUST have non-stalling (read 'non-branching') implementation i thought hard and in the end recalled a simple implementation which is called LUT (for 'look-up table') here's what i finally got: template<typename type> inline type min(type a, type b) { const type c[2] = {a, b}; return c[b<a]; } intuitively this implementation should work slower than the first one but in real life it runs faster (roughly by a factor of 2 for 4 byte types on core2duo) here's a test code: #include <vector> int main() { typedef int type; const int n = 1000000; std::vector<type> v1(n), v2(n); //filling v1 and v2 somehow type min = BIG_VALUE; for (size_t i = 0; i<n; ++i) a = min(a, min(v1[i], v2[i])); //searching minima of v1 and v2 return 0; } however for 'double' it runs ~2% slower (i think it's because creating a lut for 2 elements 8 bytes each is slower than a branch on c2d) nevertheless i think it's a great approach at least for ints also max() can implemented very similar you may skip the rest example of usage imagine we want to implement a symmetric matrix that stores only half of its elements (say upper triangle) a00 a01 a02 --- a11 a12 --- --- a22 then access requests to the lower triangle elements would be redirected to upper triangle ones here we have a constraint that actually we only have access to elements a(i, j) where i<=j must always be true (look at indices above) if we use a branch like //... return i<=j ? a(i, j) : a(j, i); we get a pipeline stall almost every time we want to access an arbitrary element however if we rewrite it carefully as //... const size_t m = min(i, j); //non-branching min() here return a(m, m + abs(int(i) - int(j))); //abs() is also //non-branching we get rid of branches at all! and we can hope a pipelined cpu will handle it better than the other (branched) implementation -- pavel

traditionaly min() has an implementation like
template<typename type> inline type min(type a, type b) { return a<b ? a : b; }
well, depending on your compiler optimization and cpu architecture, the compiler is able to optimize it, removing the branch ... e.g. different revisions of sse provide opcodes for min/max of floats/doubles and different kinds of integers ... in those cases, your branchless code may be even slower, unless it isn't optimized to use these instructions as well :) tim -- tim@klingt.org http://tim.klingt.org The composer makes plans, music laughs. Morton Feldman

Tim Blechmann wrote:
well, depending on your compiler optimization and cpu architecture, the compiler is able to optimize it, removing the branch ... All the code I do with ?: generates an asm without branches but with the sbb asm instruction. So I ponder how and with which compielr OP did his test
-- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

well, depending on your compiler optimization and cpu architecture, the compiler is able to optimize it, removing the branch ... e.g. different revisions of sse provide opcodes for min/max of floats/doubles and different kinds of integers ... in those cases, your branchless code may be even slower, unless it isn't optimized to use these instructions as well :) i could not get rid of a jump with msvc80 maybe other compiler can do it (tell me if you find out) i looked at what sseX can offer but unfortunately it isn't portable besides AFAIK all modern cpus have pipeline (at least 2 stages) so you
on 10.09.2009 at 21:31 Tim Blechmann wrote : probably will not miss if you count on it -- Pavel

well, depending on your compiler optimization and cpu architecture, the compiler is able to optimize it, removing the branch ... e.g. different revisions of sse provide opcodes for min/max of floats/doubles and different kinds of integers ... in those cases, your branchless code may be even slower, unless it isn't optimized to use these instructions as well :) i could not get rid of a jump with msvc80 maybe other compiler can do it (tell me if you find out)
i am mainly working with gcc and floating point code ... i stopped trying to optimize this kind of branching code, because the compiler did a very good job ... i don't have any experience with msvc, though ... tim -- tim@klingt.org http://tim.klingt.org I must say I find television very educational. The minute somebody turns it on, I go to the library and read a good book. Groucho Marx

DE wrote:
i'd like to present some kind of micro optimization towards pipelined cpus
traditionaly min() has an implementation like
template<typename type> inline type min(type a, type b) { return a<b ? a : b; }
branch free version
template<typename type> inline type min(type a, type b) { const type c[2] = {a, b}; return c[b<a]; }
Hmm, I brought this exact idea to boost about three years ago. I had min, abs and median of three. Here is my median of three implmentation template <class T> inline const T& median(const T& a, const T& b, const T& c) { const bool alb = a < b; const bool blc = b < c; const bool alc = a < c; const T* input[3] = {&a, &b, &c}; unsigned int index = 0; //default is to return a index += (alb & blc) | (!alb & !blc); //return b index += (unsigned int)((alc & !blc) | (!alc & blc)) << 1; return *(input[index]); } No branches required. You can improve the efficiency for larger types by using arrays of pointer to argument instead of arrays of argument value. In my own testing the speedup for prescott was 25% for min and 45% for median of three. I would expect less speedup than that in core2 processors because the pipeline is shorter. Also there is a new instruction in core2 that allows the min to compile to a branchless instruction stream. You need to use the proper flags with the compiler to enable the use of new instructions. If you are compiling in max compatibility mode you are dropping performance on the floor. In our testing branch free min (as you propose) was no faster than a < b ? a:b; when compiling with the new insturctions enabled for a core2 machine. Median of three was still 25% faster though. The response from boost three years ago was that such low level architecture specific optimizations have no place in portable libraries. Hope that helps, Luke

on 10.09.2009 at 21:41 Simonson, Lucanus J wrote :
Hmm, I brought this exact idea to boost about three years ago. I had min, abs and median of three. [code here] You can improve the efficiency for larger types by using arrays of pointer to argument instead of arrays of argument value. implemented that...
In my own testing the speedup for prescott was 25% for min and 45% for median of three. I would expect less speedup than that in core2 processors because the pipeline is shorter. pipeline is shorter for a reason it sould not degrade overall performance (on the contrary it should speed up)
Also there is a new instruction in core2 that allows the min to compile to a branchless instruction stream. You need to use the proper flags with the compiler to enable the use of new instructions. If you are compiling in max compatibility mode you are dropping performance on the floor. In our testing branch free min (as you propose) was no faster than a < b ? a:b; when compiling with the new insturctions enabled for a core2 machine. Median of three was still 25% faster though. The response from boost three years ago was that such low level architecture specific optimizations have no place in portable libraries. i wasn't here 3 years ago however like stepanov said once "great minds think alike" anyway thanks, i'll think of it and sertainly try it out
-- Pavel ps good luck with your polygon lib

On Thu, Sep 10, 2009 at 10:17 AM, DE <satan66613@yandex.ru> wrote:
i'd like to present some kind of micro optimization towards pipelined cpus
traditionaly min() has an implementation like
template<typename type> inline type min(type a, type b) { return a<b ? a : b; }
this function (even if inlined) introduces a branch in a sequence of instructions
I'm curious, have you actually tried to look at the generated instructions? Because on a 686+ this should compile this into two non-branching instructions for min(): cmp a, b cmova a, b And the same thing but with cmovb for max(). You solution should generate at least 5 instructions. If an Intel engineer is to be believed, cmov is great for use in normal code but when done in a short loop it creates dependencies that cause a performance hit. It sounds like your benchmark may be hitting this inefficiency squarely on the nose. https://mail.mozilla.org/pipermail/tamarin-devel/2008-April/000454.html -- Cory Nelson http://int64.org

on 11.09.2009 at 3:19 Cory Nelson wrote :
I'm curious, have you actually tried to look at the generated instructions? of course
Because on a 686+ this should compile this into two non-branching instructions for min(): cmp a, b cmova a, b And the same thing but with cmovb for max(). neither icc11 nor msvc80 generate such instructions msvc generates a conditional jump and icc generates code "without branches but with the sbb asm instruction" (c) Joel Falcou
You solution should generate at least 5 instructions. yep
i got an interesting result msvc happily optimizes a lut implementation but generates branch for traditional one (?:) however icc eliminates branch for ?: but generates slow code for lut in the end we have too competitors: msvc_lut and icc_?: and the record holds msvc which is some percent faster than icc code and also branches were faster than lut on amd athlon and athlon xp+ so i guess such implementation is not a good idea in general -- Pavel

DE wrote:
on 11.09.2009 at 3:19 Cory Nelson wrote :
I'm curious, have you actually tried to look at the generated instructions? of course
Because on a 686+ this should compile this into two non-branching instructions for min(): cmp a, b cmova a, b And the same thing but with cmovb for max(). neither icc11 nor msvc80 generate such instructions msvc generates a conditional jump and icc generates code "without branches but with the sbb asm instruction" (c) Joel Falcou
I believe the conditional move introduces a dependency that is about as bad as a conditional jump. It also limits the code to the latest processor models. The compilers generally try to avoid this situation. :-) Bo Persson

On 9/11/09, Bo Persson <bop@gmb.dk> wrote:
DE wrote:
on 11.09.2009 at 3:19 Cory Nelson wrote :
I'm curious, have you actually tried to look at the generated instructions? of course
Because on a 686+ this should compile this into two non-branching instructions for min(): cmp a, b cmova a, b And the same thing but with cmovb for max(). neither icc11 nor msvc80 generate such instructions msvc generates a conditional jump and icc generates code "without branches but with the sbb asm instruction" (c) Joel Falcou
I believe the conditional move introduces a dependency that is about as bad as a conditional jump.
Conditional jumps actually break dependencies, as the processor can speculate beyond them and execute code at the jump target even before computing the condition. Conditional moves do introduce dependencies as any other if-conversions, including an indexed load, thus an if-conversion should only be used if the CPU has no chance to predict the jump as it otherwise limit any out-of-order capabilities of the CPU. Cmov and other predicated instructions should be as fast or faster than if-conversions done with bitwise instructions or other instruction combinations.
It also limits the code to the latest processor models.
The compilers generally try to avoid this situation. :-)
cmov on x86 has been available since the first pentium pro, which has been launched almost 15 years ago. It is guaranteed to exist on x86_64. By default, some compiles still produce code compatible with the original 386, so you might need to use appropriate compiler flags. HTH, -- gpd

You might find some sections of Sean Aaron Anderson's "Bit-Twiddling Hacks" of interest, though perhaps not of practical use. Check out: Compute the minimum (min) or maximum (max) of two integers without branching <http://graphics.stanford.edu/%7Eseander/bithacks.html#IntegerMinOrMax> Conditionally set or clear bits without branching <http://graphics.stanford.edu/%7Eseander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching> Conditionally negate a value without branching <http://graphics.stanford.edu/%7Eseander/bithacks.html#ConditionalNegate> Its worth noting that the better optimization your compiler has the better the loop in your test case will do on your branchless method. But a good optimizing compiler would probably avoid the branch in the first place. To use the technique to its fullest you or the optimizer should translate your loop into something like (untested): type a[2] = {BIG_VALUE, ANY_VALUE}; type* vp[2] = {--&v1[0], --&v2[0]}; // I know, but I'm expressing some possibly low level optimization. for (size_t i = 0; i<n; ++i) { a[1] = *vp[*++vp[1]<*++vp[0]]; a[0] = a[a[1]<a[0]]; } (My apologies if I did something dumb -- my work situation has resulted in my not doing much C++ for a number of years -- but you should get the idea). A good optimizing compiler should be able to do most or all of this from the inlined method, though the "relaxation" of the transfer of the move to "c[]" into an indirection as vp is pretty sophisticated. Binding the local a and the temporary containing the min of the v's to the vector they would otherwise be transfered to is routine. Of course, if this were code you really wanted to optimize, rather than a test loop, you would improve locality by processing each vector separately by making c[] a reference Topher DE wrote:
i'd like to present some kind of micro optimization towards pipelined cpus
traditionaly min() has an implementation like
template<typename type> inline type min(type a, type b) { return a<b ? a : b; }
this function (even if inlined) introduces a branch in a sequence of instructions though branch prediction algorithms do exist they will guess at most 50% of them because routine input is arbitrary since a wrong branch prediction causes pipeline stall i thought that such a useful routine like min() MUST have non-stalling (read 'non-branching') implementation i thought hard and in the end recalled a simple implementation which is called LUT (for 'look-up table') here's what i finally got:
template<typename type> inline type min(type a, type b) { const type c[2] = {a, b}; return c[b<a]; }
intuitively this implementation should work slower than the first one but in real life it runs faster (roughly by a factor of 2 for 4 byte types on core2duo) here's a test code:
#include <vector> int main() { typedef int type; const int n = 1000000; std::vector<type> v1(n), v2(n); //filling v1 and v2 somehow type min = BIG_VALUE; for (size_t i = 0; i<n; ++i) a = min(a, min(v1[i], v2[i])); //searching minima of v1 and v2 return 0; }
however for 'double' it runs ~2% slower (i think it's because creating a lut for 2 elements 8 bytes each is slower than a branch on c2d) nevertheless i think it's a great approach at least for ints also max() can implemented very similar
you may skip the rest example of usage imagine we want to implement a symmetric matrix that stores only half of its elements (say upper triangle)
a00 a01 a02 --- a11 a12 --- --- a22
then access requests to the lower triangle elements would be redirected to upper triangle ones here we have a constraint that actually we only have access to elements a(i, j) where i<=j must always be true (look at indices above) if we use a branch like
//... return i<=j ? a(i, j) : a(j, i);
we get a pipeline stall almost every time we want to access an arbitrary element however if we rewrite it carefully as
//... const size_t m = min(i, j); //non-branching min() here return a(m, m + abs(int(i) - int(j))); //abs() is also //non-branching we get rid of branches at all! and we can hope a pipelined cpu will handle it better than the other (branched) implementation
-- pavel
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

on 11.09.2009 at 4:39 Topher Cooper wrote :
You might find some sections of Sean Aaron Anderson's "Bit-Twiddling Hacks" of interest, though perhaps not of practical use. Check out: Compute the minimum (min) or maximum (max) of two integers without branching <http://graphics.stanford.edu/%7Eseander/bithacks.html#IntegerMinOrMax> Conditionally set or clear bits without branching <http://graphics.stanford.edu/%7Eseander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching> Conditionally negate a value without branching <http://graphics.stanford.edu/%7Eseander/bithacks.html#ConditionalNegate> i'll check it out
Its worth noting that the better optimization your compiler has the better the loop in your test case will do on your branchless method. But a good optimizing compiler would probably avoid the branch in the first place. To use the technique to its fullest you or the optimizer should translate your loop into something like (untested): type a[2] = {BIG_VALUE, ANY_VALUE}; type* vp[2] = {--&v1[0], --&v2[0]}; // I know, but I'm expressing some possibly low level optimization. for (size_t i = 0; i<n; ++i) { a[1] = *vp[*++vp[1]<*++vp[0]]; a[0] = a[a[1]<a[0]]; } i thought about it but didn't try although
A good optimizing compiler should be able to do most or all of this from the inlined method, though the "relaxation" of the transfer of the move to "c[]" into an indirection as vp is pretty sophisticated. Binding the local a and the temporary containing the min of the v's to the vector they would otherwise be transfered to is routine. Of course, if this were code you really wanted to optimize, rather than a test loop, you would improve locality by processing each vector separately by making c[] a reference my intention was just to heavily load min() implementation alternatively one can compute 'res[i] = min(v1[i], v2[i])' in a loop the timing would be very similar
If you are going to benchmark over several platforms, please include the technique described here: http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
on 11.09.2009 at 13:10 Thorsten Ottosen wrote : thanks for the link by the way msvc abs() does just that thing like in this paper -- Pavel

You might find some sections of Sean Aaron Anderson's "Bit-Twiddling Hacks" of interest, though perhaps not of practical use. Check out: Compute the minimum (min) or maximum (max) of two integers without branching <http://graphics.stanford.edu/%7Eseander/bithacks.html#IntegerMinOrMax>
on 11.09.2009 at 4:39 Topher Cooper wrote : tried this work equally good (or bad) with icc and msvc but worse than best cases so i guess it's not worth of it because we want to get all or nothing in the end i think that one should use ?: implementation and hope the compilers' and processors' vendors will handle it efficiently -- Pavel

DE skrev:
i'd like to present some kind of micro optimization towards pipelined cpus
traditionaly min() has an implementation like
template<typename type> inline type min(type a, type b) { return a<b ? a : b; }
If you are going to benchmark over several platforms, please include the technique described here: http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax (There is also code for abs()). -Thorsten
participants (9)
-
Bo Persson
-
Cory Nelson
-
DE
-
Giovanni Piero Deretta
-
joel
-
Simonson, Lucanus J
-
Thorsten Ottosen
-
Tim Blechmann
-
Topher Cooper