
i'd like to present some kind of micro optimization towards pipelined cpus traditionaly min() has an implementation like template<typename type> inline type min(type a, type b) { return a<b ? a : b; } this function (even if inlined) introduces a branch in a sequence of instructions though branch prediction algorithms do exist they will guess at most 50% of them because routine input is arbitrary since a wrong branch prediction causes pipeline stall i thought that such a useful routine like min() MUST have non-stalling (read 'non-branching') implementation i thought hard and in the end recalled a simple implementation which is called LUT (for 'look-up table') here's what i finally got: template<typename type> inline type min(type a, type b) { const type c[2] = {a, b}; return c[b<a]; } intuitively this implementation should work slower than the first one but in real life it runs faster (roughly by a factor of 2 for 4 byte types on core2duo) here's a test code: #include <vector> int main() { typedef int type; const int n = 1000000; std::vector<type> v1(n), v2(n); //filling v1 and v2 somehow type min = BIG_VALUE; for (size_t i = 0; i<n; ++i) a = min(a, min(v1[i], v2[i])); //searching minima of v1 and v2 return 0; } however for 'double' it runs ~2% slower (i think it's because creating a lut for 2 elements 8 bytes each is slower than a branch on c2d) nevertheless i think it's a great approach at least for ints also max() can implemented very similar you may skip the rest example of usage imagine we want to implement a symmetric matrix that stores only half of its elements (say upper triangle) a00 a01 a02 --- a11 a12 --- --- a22 then access requests to the lower triangle elements would be redirected to upper triangle ones here we have a constraint that actually we only have access to elements a(i, j) where i<=j must always be true (look at indices above) if we use a branch like //... return i<=j ? a(i, j) : a(j, i); we get a pipeline stall almost every time we want to access an arbitrary element however if we rewrite it carefully as //... const size_t m = min(i, j); //non-branching min() here return a(m, m + abs(int(i) - int(j))); //abs() is also //non-branching we get rid of branches at all! and we can hope a pipelined cpu will handle it better than the other (branched) implementation -- pavel