data:image/s3,"s3://crabby-images/a6514/a6514940b4e4548b45ff1f5f11b815ac861013f4" alt=""
Hi all -
I just tried out the operators library for the first time today. I
was quite happy with how easily it helped me implement my test class.
Unfortunately, I'm running into some performance issues, in particular
with operator +.
The interesting thing is that the new operator += is actually faster
than my old += by about a factor of 2, but my operator + is slower
than my old + by a factor of nearly 10!
I'm new here, and not really sure how much code is appropriate to post
to the list, so here's a minimal rundown:
--------------------------------------------------------------------------
Old code:
template<typename real>
Vector3<real> operator + (const Vector3<real> &a, const Vector3<real> &b);
template<typename real>
class Vector3 {
protected:
real m_v[3];
.
.
friend Vector3 operator +<> (const Vector3 &a, const Vector3 &b);
};
template<typename real>
inline Vector3<real> operator + (const Vector3<real> &a, const
Vector3<real> &b){
Vector3<real> t;
t[0] = a[0] + b[0];
t[1] = a[1] + b[1];
t[2] = a[2] + b[2];
return t;
}
-------------------------------------------------------------------------------------------------
New code (note that the new code is templatized on size, and that my
instantiations
were of size 3 to match the above code):
template
data:image/s3,"s3://crabby-images/ce77e/ce77ee84e9f97965c527f4cf3315134ee1538077" alt=""
On Monday 08 May 2006 06:05, Brian Budge wrote:
Hi all -
I just tried out the operators library for the first time today. I was quite happy with how easily it helped me implement my test class. Unfortunately, I'm running into some performance issues, in particular with operator +.
The interesting thing is that the new operator += is actually faster than my old += by about a factor of 2, but my operator + is slower than my old + by a factor of nearly 10!
I'm new here, and not really sure how much code is appropriate to post to the list, so here's a minimal rundown:
-------------------------------------------------------------------------- Old code:
template<typename real> Vector3<real> operator + (const Vector3<real> &a, const Vector3<real> &b);
template<typename real> class Vector3 { protected: real m_v[3]; . . friend Vector3 operator +<> (const Vector3 &a, const Vector3 &b); };
template<typename real> inline Vector3<real> operator + (const Vector3<real> &a, const Vector3<real> &b){ Vector3<real> t; t[0] = a[0] + b[0]; t[1] = a[1] + b[1]; t[2] = a[2] + b[2]; return t; }
--------------------------------------------------------------------------- ---------------------- New code (note that the new code is templatized on size, and that my instantiations were of size 3 to match the above code):
template
class vec : boost::additive< vec
, boost::multiplicative< vec
, T { private: typedef unsigned int uint; typedef const T &const_reference; typedef T &reference;
private: boost::array
m_v; template<typename U> vec &operator += (const vec &t){ for(uint i = 0; i < S; ++i){ m_v[i] += t.m_v[i]; } return *this; } };
Any ideas how to increase the performance of the new code here? A factor of 10 makes it seem like I am just missing something important.
I don't know anything about the operators library, but I noticed that the fast implementation does not have the loop that the slow one has. Loops are typically long to setup, so that could be the problem. Fred
data:image/s3,"s3://crabby-images/39fcf/39fcfc187412ebdb0bd6271af149c9a83d2cb117" alt=""
Any ideas how to increase the performance of the new code here? A factor of 10 makes it seem like I am just missing something important.
I would suspect it's the loop that's at fault, although very I'm surprised it's a factor of 10. Your original code had the loop unrolled, so you might try a bit of template metaprogramming to achieve the same effect here. Otherwise you're going to have to do a bit of debugging and/or inspection of the assembly generated. BTW the measurements you made were in release mode right? If inline expansions are turned off (debug mode for example) the operators-based version may well pass through many more function calls. Of course these all disappear as long as your compiler does a reasonable job of inlining. HTH, John.
data:image/s3,"s3://crabby-images/a6514/a6514940b4e4548b45ff1f5f11b815ac861013f4" alt=""
Thanks for the ideas guys.
Compile options are like so:
g++ -O3 -msse -mfpmath=sse
I tried the metaprogramming technique (which is pretty nifty :) ), and
got interesting results.
Basically, it made my += operator run twice as SLOW, while making my +
operator run twice as FAST.
I have a feeling that this is all due to the different optimizations
that gcc is doing at multiple stages of compilation. For example, it
may be doing autovectorization of the simple loop case of +=, which it
can't figure out with the metaprogramming technique. I'm still
stumped as to why I'm roughly an order of magnitude slower with + than
with +=.
Any more insights?
Thanks again for the ideas so far!
Brian
On 5/8/06, John Maddock
Any ideas how to increase the performance of the new code here? A factor of 10 makes it seem like I am just missing something important.
I would suspect it's the loop that's at fault, although very I'm surprised it's a factor of 10. Your original code had the loop unrolled, so you might try a bit of template metaprogramming to achieve the same effect here. Otherwise you're going to have to do a bit of debugging and/or inspection of the assembly generated.
BTW the measurements you made were in release mode right? If inline expansions are turned off (debug mode for example) the operators-based version may well pass through many more function calls. Of course these all disappear as long as your compiler does a reasonable job of inlining.
HTH, John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
data:image/s3,"s3://crabby-images/f9117/f9117a43c97c4cb46cae9629180933621aee8110" alt=""
Well, have you considered certainty of memory aliasing? In particular, gcc supports the restrict keyword, (e.g. double *__restrict__ c ) indicating that the memory spaces pointed to by c will never be accessed by anything /but/ c, allowing it to make load- store and register usage optimizations it couldn't otherwise. In particular, it's 100% certain in the manually indexed case that a[0] will never ever refer to b[1]. Then again, it can't be as sure in the looped version. Just a thought, that may or may not pan out. All it takes to try is a quick addition of __restrict__ however, so it's not a tough test. - Greg Link Penn State University York College of Pennsylvania On May 8, 2006, at 2:23 PM, Brian Budge wrote:
Thanks for the ideas guys.
Compile options are like so: g++ -O3 -msse -mfpmath=sse
I tried the metaprogramming technique (which is pretty nifty :) ), and got interesting results.
Basically, it made my += operator run twice as SLOW, while making my + operator run twice as FAST.
I have a feeling that this is all due to the different optimizations that gcc is doing at multiple stages of compilation. For example, it may be doing autovectorization of the simple loop case of +=, which it can't figure out with the metaprogramming technique. I'm still stumped as to why I'm roughly an order of magnitude slower with + than with +=.
Any more insights?
Thanks again for the ideas so far! Brian
On 5/8/06, John Maddock
wrote: Any ideas how to increase the performance of the new code here? A factor of 10 makes it seem like I am just missing something important.
I would suspect it's the loop that's at fault, although very I'm surprised it's a factor of 10. Your original code had the loop unrolled, so you might try a bit of template metaprogramming to achieve the same effect here. Otherwise you're going to have to do a bit of debugging and/or inspection of the assembly generated.
BTW the measurements you made were in release mode right? If inline expansions are turned off (debug mode for example) the operators- based version may well pass through many more function calls. Of course these all disappear as long as your compiler does a reasonable job of inlining.
HTH, John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
data:image/s3,"s3://crabby-images/a6514/a6514940b4e4548b45ff1f5f11b815ac861013f4" alt=""
Thanks for the idea Greg. I thought for sure you were on to something, but I tried adding the _restrict__ keyword in the operator.hpp binary_ops functions, and it made no difference :( On 5/8/06, Greg Link wrote:
Well, have you considered certainty of memory aliasing? In particular, gcc supports the restrict keyword, (e.g. double *__restrict__ c ) indicating that the memory spaces pointed to by c will never be accessed by anything /but/ c, allowing it to make load- store and register usage optimizations it couldn't otherwise. In particular, it's 100% certain in the manually indexed case that a[0] will never ever refer to b[1]. Then again, it can't be as sure in the looped version.
Just a thought, that may or may not pan out. All it takes to try is a quick addition of __restrict__ however, so it's not a tough test.
- Greg Link Penn State University York College of Pennsylvania
On May 8, 2006, at 2:23 PM, Brian Budge wrote:
Thanks for the ideas guys.
Compile options are like so: g++ -O3 -msse -mfpmath=sse
I tried the metaprogramming technique (which is pretty nifty :) ), and got interesting results.
Basically, it made my += operator run twice as SLOW, while making my + operator run twice as FAST.
I have a feeling that this is all due to the different optimizations that gcc is doing at multiple stages of compilation. For example, it may be doing autovectorization of the simple loop case of +=, which it can't figure out with the metaprogramming technique. I'm still stumped as to why I'm roughly an order of magnitude slower with + than with +=.
Any more insights?
Thanks again for the ideas so far! Brian
On 5/8/06, John Maddock
wrote: Any ideas how to increase the performance of the new code here? A factor of 10 makes it seem like I am just missing something important.
I would suspect it's the loop that's at fault, although very I'm surprised it's a factor of 10. Your original code had the loop unrolled, so you might try a bit of template metaprogramming to achieve the same effect here. Otherwise you're going to have to do a bit of debugging and/or inspection of the assembly generated.
BTW the measurements you made were in release mode right? If inline expansions are turned off (debug mode for example) the operators- based version may well pass through many more function calls. Of course these all disappear as long as your compiler does a reasonable job of inlining.
HTH, John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
data:image/s3,"s3://crabby-images/80ef5/80ef5415f678bc0b4f6288cc3042396651cce5d7" alt=""
On Mon, 8 May 2006, Brian Budge wrote:
Thanks for the ideas guys.
Compile options are like so: g++ -O3 -msse -mfpmath=sse
I tried the metaprogramming technique (which is pretty nifty :) ), and got interesting results.
Basically, it made my += operator run twice as SLOW, while making my + operator run twice as FAST.
I have a feeling that this is all due to the different optimizations that gcc is doing at multiple stages of compilation. For example, it may be doing autovectorization of the simple loop case of +=, which it can't figure out with the metaprogramming technique. I'm still stumped as to why I'm roughly an order of magnitude slower with + than with +=.
Any more insights?
Did you try with -funroll-loops ? I once did a few tests with vectors, one
version with loops and the other with manually unrolled loops, and with
options -O3 -funroll-loops, the generated code was identical. But then
again, that was with g++-3.3.
As for the use of boost::operators, I don't know, I did a small test using
the following, and the generated code with g++-4.0 and g++-4.1 (with
option -O3 -msse -mfpmath=sse and with and without -DUSE_OP) is identical
(diff reports no difference).
#ifdef USE_OP
#include
participants (5)
-
Brian Budge
-
François Duranleau
-
Fred Labrosse
-
Greg Link
-
John Maddock