
Anybody out there? Please tell me that I am not the only person on that mailing list to spot that those flop rates posted beforehand are much too low compared to what a 2.8 Ghz P4 should be able to deliver? So, for a closer look, I get out the compiler, VC8 in my case. Example14, ok. Run. First line appears, then the program hangs for minutes. Oh no, yeah, that was the debug build.... ^C. Ok, once again in release mode. Works better, but the figures still are disapointing: f(x,y,z) took 4.218 seconds to run 1e+009 iterations with double = 9.48317e+008 flops f(x,y,z) took 5.047 seconds to run 1e+009 iterations with quantity<double> = 7.9255e+008 flops g(x,y,z) took 4.219 seconds to run 1e+009 iterations with double = 9.48092e+008 flops g(x,y,z) took 4.625 seconds to run 1e+009 iterations with quantity<double> = 8.64865e+008 flops 950 MFlops, already better than the numbers posted beforehand. A brief look at the code. No memory referenced. Just local variables. That one should perform better. The zero-overhead is not so zero, after all. Once again - no avail. Another look at the code. Oh, what is this "if" in that loop? So measure that "if" alone: inline double f2(double x,double y,double z) { double V = 0, C = 0; for (int i = 0; i < TEST_LIMIT; ++i) { if (i % 100000 == 0) C = double(std::rand())/RAND_MAX; //V = V + ((x + y) * z * C); } return V; } That gives: f2(x,y,z) took 3.187 seconds to run 1e+009 iterations with double = 1.2551e+009 (would be) flops f2(x,y,z) took 3.141 seconds to run 1e+009 iterations with quantity<double> = 1.27348e+009 (would be) flops Ooops, the loop alone, even without any floating point, takes more than 3 seconds? More overhead than payload? Another one: inline double f3(double x,double y,double z) { double V = 0, C = 0; for (int i = 0; i < TEST_LIMIT; ){ C = double(std::rand())/RAND_MAX; const int next_limit = std::min(TEST_LIMIT, i+100000); for (; i < next_limit; ++i){ V = V + ((x + y) * z * C); } }; return V; } That gives: f3(x,y,z) took 1.515 seconds to run 1e+009 iterations with double = 2.64026e+009 flops f3(x,y,z) took 4.656 seconds to run 1e+009 iterations with quantity<double> = 8.59107e+008 flops 2.6 GFlops? That is ok for a single thread. But the zero-overhead appears to be a factor of 3 now! What do you say? Gcc would be better? So switch to linux box. g++ -O3. Looks better right from the start. Even though the P4 is supposed to be slower than the Core 2. f(x,y,z) took 3.22 seconds to run 1e+09 iterations with double = 1.24224e+09 flops f(x,y,z) took 3.21 seconds to run 1e+09 iterations with quantity<double> = 1.24611e+09 flops f2(x,y,z) took 3.22 seconds to run 1e+09 iterations with double = 1.24224e+09 (would be) flops f2(x,y,z) took 3.22 seconds to run 1e+09 iterations with quantity<double> = 1.24224e+09 (would be) flops f3(x,y,z) took 0.51 seconds to run 1e+09 iterations with double = 7.84314e+09 flops f3(x,y,z) took 0.65 seconds to run 1e+09 iterations with quantity<double> = 6.15385e+09 flops Oh, but what is that 7.84 Gflops over there? That one goes beyond the peak performance of the processor! GCC must be cheating here! Hmm. What does the intel compiler give? f(x,y,z) took 4.2 seconds to run 1e+09 iterations with double = 9.52381e+08 flops f(x,y,z) took 5.29 seconds to run 1e+09 iterations with quantity<double> = 7.56144e+08 flops f2(x,y,z) took 4.19 seconds to run 1e+09 iterations with double = 9.54654e+08 (would be) flops f2(x,y,z) took 4.18 seconds to run 1e+09 iterations with quantity<double> = 9.56938e+08 (would be) flops f3(x,y,z) took 0.47 seconds to run 1e+09 iterations with double = 8.51064e+09 flops f3(x,y,z) took 6.95 seconds to run 1e+09 iterations with quantity<double> = 5.7554e+08 flops Hmm. Even more cheating more on plain doubles but does not seem to like the templates. For this one, the overhead increases to nearly a factor of 15! So lets play a bit further,... What are those funny "inline" for? Lets try to #define them away,.... G++ -O3 again. f(x,y,z) took 3.25 seconds to run 1e+09 iterations with double = 1.23077e+09 flops f(x,y,z) took 9.96 seconds to run 1e+09 iterations with quantity<double> = 4.01606e+08 flops f2(x,y,z) took 3.23 seconds to run 1e+09 iterations with double = 1.23839e+09 (would be) flops f2(x,y,z) took 3.19 seconds to run 1e+09 iterations with quantity<double> = 1.25392e+09 (would be) flops f3(x,y,z) took 0.52 seconds to run 1e+09 iterations with double = 7.69231e+09 flops f3(x,y,z) took 10.2 seconds to run 1e+09 iterations with quantity<double> = 3.92157e+08 flops Ouch, f3 on quantity<double> had been 0.65 seconds beforehand, now it is 10 seconds. Somehow gcc forgot to cheat (eh, .."optimize") here. And even the original example f gets about 3 times slower now. There are only about 1000 (even non-virtual) function calls involved. These can impossibly sum up to 6 or even 10 seconds. Something different is going on here.... Well, it is getting late, I will stop here. So for the long and the short, I dont believe the zero-overhead. Not with the compilers I currently have at hand. Furthermore, the example is simply not meaningfull; it allows the compilers to play so many tricks that the resulting numbers are little more than noise. Matthias, I therefore have 3 further points for the "application domain and restrictions" page: - By the use of the library, performance of the debug build of your software may or may not degrade by several orders of magnitude, depending on the actual code. - The library is very demanding on the compiler. Apart from the compatibility requirements, the performance penalty induced by the use of the library mostly varies between zero and three, even higher numbers have been observed in very special cases. [btw, did anybody a comparision of compile-times on reasonably sized projects?] - The use of this library may impose additional obstacles when doing in-depth performance tuning for numerical computations, as the compilers may or may not recognize certain optimization possiblities anymore. I would have liked to give you more positive feedback, Martin.