
Steven wrote:
Ok. There might be extra optimizations possible when the index is known at compile time as opposed to run time. (I'm not talking about the difference between get<X>(p) and p[X], here, but the difference between when X is known at compile time using templates to avoid code duplication vs. runtime. and using function arguments). This is really a property of the algorithm rather than the point class, though.
I agree with you. It really taxes the compiler to optimize my highly nested inline function calls and it has too much opportunity to give up early instead of getting the job done. Switching from gcc 3.4.2 to gcc 4.2.0 resulted in about a 30% speedup in application code that relies heavily on my types and algorithms. Compile times went up slightly too. That tells me that the compiler is less than fully successful in optimizing things. If the compiler is having trouble providing constant propagation we can't necessarily expect it to optimize away the overhead of the compile time accessor either, but at least it doesn't have the option of giving up before instantiating the template function. On a related note, we recently confirmed that the 4.3.0 compiler (on newer hardware) converts: int myMax(int a, int b){ return a > b ? a : b;} into: .globl _Z5myMaxii .type _Z5myMaxii, @function _Z5myMaxii: .LFB2: .file 1 "t255.cc" .loc 1 7 0 .LVL0: .loc 1 7 0 cmpl %edi, %esi cmovge %esi, %edi .LVL1: .loc 1 10 0 movl %edi, %eax ret instead of: .globl _Z5myMaxii .type _Z5myMaxii, @function _Z5myMaxii: .LFB2: .file 1 "t255.cc" .loc 1 7 0 pushq %rbp .LCFI0: movq %rsp, %rbp .LCFI1: movl %edi, -4(%rbp) movl %esi, -8(%rbp) .loc 1 9 0 movl -4(%rbp), %eax cmpl -8(%rbp), %eax jle .L2 movl -4(%rbp), %eax movl %eax, -12(%rbp) jmp .L3 .L2: movl -8(%rbp), %eax movl %eax, -12(%rbp) .L3: movl -12(%rbp), %eax .loc 1 10 0 leave ret when compiling for old processor or with old compiler. That is about 4X fewer instructions and NO BRANCH instructions. Note: cmovge is a new instruction in the Core2 (merom) processors. I have been using the following: template <class T> inline const T& predicated_value(const bool& pred, const T& a, const T& b) { const T* input[2] = {&b, &a}; return *(input[pred]); } instead of ? syntax because it was 35% faster than the branch based machine code the compiler generated when executed on the prescott based hardware at the time. I'll be able to go back to letting the compiler know best as soon as we cycle out the old hardware and cycle in the new compiler. Thanks, Luke