
Any ideas how to increase the performance of the new code here? A factor of 10 makes it seem like I am just missing something important.
I would suspect it's the loop that's at fault, although very I'm surprised it's a factor of 10. Your original code had the loop unrolled, so you might try a bit of template metaprogramming to achieve the same effect here. Otherwise you're going to have to do a bit of debugging and/or inspection of the assembly generated. BTW the measurements you made were in release mode right? If inline expansions are turned off (debug mode for example) the operators-based version may well pass through many more function calls. Of course these all disappear as long as your compiler does a reasonable job of inlining. HTH, John.