Re: [boost] GTL compile time vs. run time accessors

3 May 2008

      Steven wrote:
...
Ok.  There might be extra optimizations possible when the index is
known
at compile time as opposed to run time.  (I'm not talking about the 
difference
between get<X>(p) and p[X], here, but the difference between when X is
known at compile time using templates to avoid code duplication vs.
runtime.
and using function arguments).  This is really a property of the 
algorithm rather
than the point class, though.
I agree with you.  It really taxes the compiler to optimize my highly
nested inline function calls and it has too much opportunity to give up
early instead of getting the job done.  Switching from gcc 3.4.2 to gcc
4.2.0 resulted in about a 30% speedup in application code that relies
heavily on my types and algorithms.  Compile times went up slightly too.
That tells me that the compiler is less than fully successful in
optimizing things.  If the compiler is having trouble providing constant
propagation we can't necessarily expect it to optimize away the overhead
of the compile time accessor either, but at least it doesn't have the
option of giving up before instantiating the template function.

On a related note, we recently confirmed that the 4.3.0 compiler (on
newer hardware) converts:

int myMax(int a, int b){ return a > b ? a : b;}

into:

.globl _Z5myMaxii
	.type	_Z5myMaxii, @function
_Z5myMaxii:
.LFB2:
	.file 1 "t255.cc"
	.loc 1 7 0
.LVL0:
	.loc 1 7 0
	cmpl	%edi, %esi
	cmovge	%esi, %edi
.LVL1:
	.loc 1 10 0
	movl	%edi, %eax
	ret

instead of:

.globl _Z5myMaxii
	.type	_Z5myMaxii, @function
_Z5myMaxii:
.LFB2:
	.file 1 "t255.cc"
	.loc 1 7 0
	pushq	%rbp
.LCFI0:
	movq	%rsp, %rbp
.LCFI1:
	movl	%edi, -4(%rbp)
	movl	%esi, -8(%rbp)
	.loc 1 9 0
	movl	-4(%rbp), %eax
	cmpl	-8(%rbp), %eax
	jle	.L2
	movl	-4(%rbp), %eax
	movl	%eax, -12(%rbp)
	jmp	.L3
.L2:
	movl	-8(%rbp), %eax
	movl	%eax, -12(%rbp)
.L3:
	movl	-12(%rbp), %eax
	.loc 1 10 0
	leave
	ret

when compiling for old processor or with old compiler.  That is about 4X
fewer instructions and NO BRANCH instructions.  Note: cmovge is a new
instruction in the Core2 (merom) processors.  I have been using the
following:

template <class T>
inline const T& predicated_value(const bool& pred, const T& a, const T&
b) {
  const T* input[2] = {&b, &a};  return *(input[pred]);
}

instead of ? syntax because it was 35% faster than the branch based
machine code the compiler generated when executed on the prescott based
hardware at the time.  I'll be able to go back to letting the compiler
know best as soon as we cycle out the old hardware and cycle in the new
compiler.

Thanks,
Luke

Re: [boost] GTL compile time vs. run time accessors

Simonson, Lucanus J