
Hello, I have made a few tests with the performance of BOOST_FOREACH. The results are for me unexpected slow...may be one has an idea if I make something wrong... vector[int]: index:0.00167619 vector[int]: iterator:0.00167619 vector[int]: BOOST_FOREACH:9.40036 list[int]: iterator:239.669 list[int]: BOOST_FOREACH:233.219 vector[TestData]: index:0.00167619 vector[TestData]: iterator:0.00167619 vector[TestData]: BOOST_FOREACH:9.40008 list[TestData]: iterator:233.827 list[TestData]: BOOST_FOREACH:238.513 testArray[int]: index:0.00167619 testArray[int]: BOOST_FOREACH:9.40008 Best regards Hansjörg the test program was: // ForEach.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include "..\include\PerformanceCounter.h" #include <vector> #include <boost\foreach.hpp> #include <iostream> #include <list> using namespace std; using namespace Microtec; class TestData { public: TestData(int val1, int val2, int val3):var1(val1),var2(val2),var3(val3){} int var1; int var2; int var3; }; int testArray[10000000]; int _tmain(int argc, _TCHAR* argv[]) { vector<int> data; list<int> lstData; vector<TestData> vectorTestData; list<TestData> lstTestData; for(int i = 0; i < 10000000;i++) { data.push_back(i); lstData.push_back(i); vectorTestData.push_back(TestData(i,i,i)); lstTestData.push_back(TestData(i,i,i)); } int res = 0; const int actSize = (int)data.size(); for(int i = 0; i < actSize; i++) { res += data[i]; } //takes 0.0017 ms res = 0; for(vector<int>::iterator it = data.begin(); it != data.end(); it++) { res += *it; } //takes 0.0017 ms res = 0; BOOST_FOREACH(int val,data) { res += val; } //takes 9.400 ms //list res = 0; for(list<int>::iterator it = lstData.begin(); it != lstData.end(); it++) { res += *it; } //takes 239.669 ms res = 0; BOOST_FOREACH(int val,lstData) { res += val; } //takes 233.219 ms //vector TestData for(int i = 0; i < actSize; i++) { res += vectorTestData[i].var1; } //takes 0.0017 ms res = 0; for(vector<TestData>::iterator it = vectorTestData.begin(); it != vectorTestData.end(); it++) { res += it->var1; } //takes 0.0017 ms res = 0; BOOST_FOREACH(TestData& val,vectorTestData) { res += val.var1; } //takes 9.400 ms //list res = 0; for(list<TestData>::iterator it = lstTestData.begin(); it != lstTestData.end(); it++) { res += it->var1; } //takes 233.827 ms res = 0; BOOST_FOREACH(TestData& val,lstTestData) { res += val.var1; } //takes 238.827 ms for(int i = 0; i < actSize; i++) { res += testArray[i]; } //takes 0.0017 ms res = 0; BOOST_FOREACH(int val,testArray) { res += val; } //takes 9.400 ms return 0; }

Rest easy Hansi, I think your test is invalid. You write:
vector[TestData]: iterator:0.00167619 vector[TestData]: BOOST_FOREACH:9.40008
I too get this result I don't do anything with "res". If I add res to the lines that are printed to cout to stop the compiler doing so much dead code elimination I get the following results: vector[TestData]: iterator:32.1508 vector[TestData]: BOOST_FOREACH:30.584 [I won't read too much into FOREACH being faster either... in my experience the difference between iterators and FOREACH is lost in the noise of things beyond your control in all but the most extreme cases.] Pete

Hansi wrote:
<snip> During FOEACH's review, extensive performance tests showed FOREACH was only about 5% slower than the hand-coded equivalent on average. I suspect there may be something wrong with your test. Have you compiled with full optimizations? Defined NDEBUG? The code you've posted is obviously not what you are running your benchmarks with -- it takes no timing measurements. Perhaps you can post your code. -- Eric Niebler BoostPro Computing http://www.boostpro.com

It seems everything okay... You are right..it is not exactly the code.. Best regards and thanks for any hint! Hansjörg class PerformanceCounter { private: LARGE_INTEGER CurrentTickCount_; LARGE_INTEGER OldTickCount_; LARGE_INTEGER TickFrquency_; double EllapsedTime_; double TotalTime_; public: PerformanceCounter(): EllapsedTime_(0.0), TotalTime_(0.0) { ::QueryPerformanceFrequency(&TickFrquency_); ::QueryPerformanceCounter(&CurrentTickCount_); OldTickCount_ = CurrentTickCount_; } void Reset(const double intial_time = 0.0) { TotalTime_ = intial_time; EllapsedTime_ = intial_time; ::QueryPerformanceCounter(&CurrentTickCount_); OldTickCount_ = CurrentTickCount_; } void CalculateEllapsedTime() { ::QueryPerformanceCounter(&CurrentTickCount_); EllapsedTime_ = (double)(CurrentTickCount_.QuadPart - OldTickCount_.QuadPart)/TickFrquency_.QuadPart; TotalTime_ += EllapsedTime_; OldTickCount_ = CurrentTickCount_; } void CalculateEllapsedTime(const PerformanceCounter& to_intialize) { CurrentTickCount_ = to_intialize.CurrentTickCount(); EllapsedTime_ = (double)(CurrentTickCount_.QuadPart - OldTickCount_.QuadPart)/TickFrquency_.QuadPart; TotalTime_ += EllapsedTime_; OldTickCount_ = CurrentTickCount_; } const LARGE_INTEGER& CurrentTickCount() const {return CurrentTickCount_;} double EllapsedMilliseconds() const {return EllapsedTime_ * 1000;} double EllapsedSeconds() const {return EllapsedTime_;} double TotalTime() const {return TotalTime_;} }; class TestData { public: TestData(int val1, int val2, int val3):var1(val1),var2(val2),var3(val3){} int var1; int var2; int var3; }; int testArray[10000000]; int _tmain(int argc, _TCHAR* argv[]) { vector<int> data; list<int> lstData; vector<TestData> vectorTestData; list<TestData> lstTestData; for(int i = 0; i < 10000000;i++) { data.push_back(i); lstData.push_back(i); vectorTestData.push_back(TestData(i,i,i)); lstTestData.push_back(TestData(i,i,i)); } PerformanceCounter timer; int res = 0; const int actSize = (int)data.size(); timer.Reset(); for(int i = 0; i < actSize; i++) { res += data[i]; } timer.CalculateEllapsedTime(); cout<<"vector[int]: index:"<<timer.EllapsedMilliseconds()<< endl; res = 0; timer.Reset(); for(vector<int>::iterator it = data.begin(); it != data.end(); it++) { res += *it; } timer.CalculateEllapsedTime(); cout<<"vector[int]: iterator:"<<timer.EllapsedMilliseconds()<<endl; res = 0; timer.Reset(); BOOST_FOREACH(int val,data) { res += val; } timer.CalculateEllapsedTime(); cout<<"vector[int]: BOOST_FOREACH:"<<timer.EllapsedMilliseconds()<<endl<<endl; //list res = 0; timer.Reset(); for(list<int>::iterator it = lstData.begin(); it != lstData.end(); it++) { res += *it; } timer.CalculateEllapsedTime(); cout<<"list[int]: iterator:"<<timer.EllapsedMilliseconds()<<endl; res = 0; timer.Reset(); BOOST_FOREACH(int val,lstData) { res += val; } timer.CalculateEllapsedTime(); cout<<"list[int]: BOOST_FOREACH:"<<timer.EllapsedMilliseconds()<<endl<<endl; //vector TestData timer.Reset(); for(int i = 0; i < actSize; i++) { res += vectorTestData[i].var1; } timer.CalculateEllapsedTime(); cout<<"vector[int]: index:"<<timer.EllapsedMilliseconds()<< endl; res = 0; timer.Reset(); for(vector<TestData>::iterator it = vectorTestData.begin(); it != vectorTestData.end(); it++) { res += it->var1; } timer.CalculateEllapsedTime(); cout<<"vector[int]: iterator:"<<timer.EllapsedMilliseconds()<<endl; res = 0; timer.Reset(); BOOST_FOREACH(TestData& val,vectorTestData) { res += val.var1; } timer.CalculateEllapsedTime(); cout<<"vector[int]: BOOST_FOREACH:"<<timer.EllapsedMilliseconds()<<endl<<endl; //list res = 0; timer.Reset(); for(list<TestData>::iterator it = lstTestData.begin(); it != lstTestData.end(); it++) { res += it->var1; } timer.CalculateEllapsedTime(); cout<<"list[TestData]: iterator:"<<timer.EllapsedMilliseconds()<<endl; res = 0; timer.Reset(); BOOST_FOREACH(TestData& val,lstTestData) { res += val.var1; } timer.CalculateEllapsedTime(); cout<<"list[TestData]: BOOST_FOREACH:"<<timer.EllapsedMilliseconds()<<endl<<endl; timer.Reset(); for(int i = 0; i < actSize; i++) { res += testArray[i]; } timer.CalculateEllapsedTime(); cout<<"testArray[int]: index:"<<timer.EllapsedMilliseconds()<< endl; res = 0; timer.Reset(); BOOST_FOREACH(int val,testArray) { res += val; } timer.CalculateEllapsedTime(); cout<<"testArray[int]: BOOST_FOREACH:"<<timer.EllapsedMilliseconds()<<endl<<endl; return 0; }

Hansi wrote:
It seems everything okay... <snip>
By this, do you mean that you are no longer seeing the performance problem? -- Eric Niebler BoostPro Computing http://www.boostpro.com

Eric Niebler wrote:
During FOEACH's review, extensive performance tests showed FOREACH was only about 5% slower than the hand-coded equivalent on average.
Why not make FOREACH as fast as hand-coded equivalents when it is possible? With decltype, typeof, or the MSVC hack it shouldn't be a problem. Also some compilers have native support for foreach-like loops.

Mathias Gaunard wrote:
I haven't spent a lot of time thinking about it, but I imagine making this transparent to users would be hard. BOOST_FOREACH has documented customization points that any alternate implementation should respect. -- Eric Niebler BoostPro Computing http://www.boostpro.com

Quoting Hansi <hansipet@web.de>:
And that should make you sceptical of the results! Once you strip away the TMP and macros, FOREACH is not much different from a for lopp. There is an additional boolean [or so?] that looks ripe for optimization. Checking that premise: I just re-ran Michael's test with VC9 on full optimization and _SECURE_SCL=0. My results are: Iterator accumulate took 0.235 seconds. Pointer accumulate took 0.265 seconds. Iterator for loop took 0.234 seconds. Pointer for loop took 0.25 seconds. Index for loop took 0.281 seconds. Index for_each took 0.235 seconds. Pointer for_each took 0.25 seconds. BOOST_FOREACH took 0.25 seconds. MSVC for each took 0.234 seconds. Press any key to continue . . .

Am Dienstag, 18. November 2008 schrieb Peter Bartlett:
To underline these results, here what gcc-4.3.2 gives on x86_64 linux: {{{ maik@harald:~$ g++ -O0 test.cpp -I /home/maik/workspace/boost maik@harald:~$ ./a.out Iterator accumulate took 1.39 seconds. Pointer accumulate took 0.34 seconds. Iterator for loop took 1.4 seconds. Pointer for loop took 0.34 seconds. Index for loop took 0.59 seconds. Index for_each took 1.73 seconds. Pointer for_each took 0.63 seconds. BOOST_FOREACH took 4.59 seconds. maik@harald:~$ g++ -O2 test.cpp -I /home/maik/workspace/boost maik@harald:~$ ./a.out Iterator accumulate took 0.1 seconds. Pointer accumulate took 0.09 seconds. Iterator for loop took 0.1 seconds. Pointer for loop took 0.09 seconds. Index for loop took 0.1 seconds. Index for_each took 0.09 seconds. Pointer for_each took 0.1 seconds. BOOST_FOREACH took 0.1 seconds maik@harald:~$ g++ -DNDEBUG -O2 test.cpp -I /home/maik/workspace/boost maik@harald:~$ ./a.out Iterator accumulate took 0.09 seconds. Pointer accumulate took 0.1 seconds. Iterator for loop took 0.09 seconds. Pointer for loop took 0.1 seconds. Index for loop took 0.09 seconds. Index for_each took 0.1 seconds. Pointer for_each took 0.09 seconds. BOOST_FOREACH took 0.1 seconds. maik@harald:~$ g++ -DNDEBUG -O3 test.cpp -I /home/maik/workspace/boost maik@harald:~$ ./a.out Iterator accumulate took 0.09 seconds. Pointer accumulate took 0.09 seconds. Iterator for loop took 0.08 seconds. Pointer for loop took 0.09 seconds. Index for loop took 0.09 seconds. Index for_each took 0.09 seconds. Pointer for_each took 0.09 seconds. BOOST_FOREACH took 0.09 seconds. }}} where /home/maik/workspace/boost is a working copy of trunk. -- Maik

Peter Bartlett wrote:
Yes, one extra boolean test/set per iteration. That's all you should pay for using BOOST_FOREACH when optimizations are turned on. I can't get rid of it without also getting rid of the ability to iterate over a sequence by reference.
Thanks Peter! -- Eric Niebler BoostPro Computing http://www.boostpro.com

Quoting Eric Niebler <eric@boost-consulting.com>:
Right. Ahead of seeing your mail, I worked through the FOREACH implementation and once I'd removed the macros, the ?: tricks, the if(false) tricks and the static_casts came down to the basic implementation - I agree that there is an extra boolean that is checked and set per iteration. I figured this is where your 5% rule of thumb came from. But then I realised the inner for loop (which does the setting and checking of the boolean) is run exactly once for each run of the outer loop. Thus it seemed likely it might be a candidate for loop unrolling, and hence FOREACH actually be exactly efficient as a "normal" for loop. I thus compiled the following code: #define _SECURE_SCL 0 #include <iostream> #include <vector> #include <boost/foreach.hpp> struct obj { obj(int x) : value(x) {} char xxx[100000]; int value; }; __declspec(noinline) int foreach_version(std::vector<obj> const& x) { int result = 0; BOOST_FOREACH(obj const& i , x) result += i.value; return result; } __declspec(noinline) int iterator_version(std::vector<obj> const& x) { int result = 0; for( std::vector<obj>::const_iterator c = x.begin() ; c != x.end() ; ++c ) result += c->value; return result; } int main() { std::vector<obj> foo(100,1); int jj = iterator_version(foo); int j = foreach_version(foo); std::cout << "\n" << j << " " << jj << "\n\n"; } The generated assembler for the two functions was: __declspec(noinline) int foreach_version(std::vector<obj> const& x) { int result = 0; BOOST_FOREACH(obj const& i , x) 00401560 mov edx,dword ptr [esp+4] 00401564 mov ecx,dword ptr [edx+4] 00401567 mov edx,dword ptr [edx+8] 0040156A xor eax,eax 0040156C lea esp,[esp] 00401570 cmp ecx,edx 00401572 je foreach_version+22h (401582h) result += i.value; 00401574 add eax,dword ptr [ecx+186A0h] 0040157A add ecx,186A4h 00401580 jmp foreach_version+10h (401570h) return result; } //////////////////////////////// __declspec(noinline) int iterator_version(std::vector<obj> const& x) { int result = 0; for( std::vector<obj>::const_iterator c = x.begin() ; c != x.end() ; ++c ) 00401530 mov edx,dword ptr [esp+4] 00401534 mov ecx,dword ptr [edx+4] 00401537 mov edx,dword ptr [edx+8] 0040153A xor eax,eax 0040153C cmp ecx,edx 0040153E je iterator_version+20h (401550h) result += c->value; 00401540 add eax,dword ptr [ecx+186A0h] 00401546 add ecx,186A4h 0040154C cmp ecx,edx 0040154E jne iterator_version+10h (401540h) return result; } I.e. the FOREACH version has an extra lea instruction, but has one fewer cmp instruction. Unfortunately I'm too much of a newbie to weigh up the relative cost of those two instructions, but the inner for loop does indeed appear to have been optimized at least somewhat. Maik wrote:
The uniformity of these results may just be a fluke but it would be intriguing to find out if all variations have the same generated code - I suspect gcc also has no problems optimizing FOREACH to the point where one can't worry about performance when using it. Pete

Peter Bartlett wrote:
And that should make you sceptical of the results!
Indeed it should! I reran the test on VC9 SP1 (the VC8 in my post was a typo it was VC9). I got essentially identical results to my first run. I changed it from maximize speed to full optimizations and got much better results. I changed it to maximize speed and got similar results to full optimization. Confused? I was. Turns out the VC9 GUI lies. If I had optimization on inherit from project defaults it said Maximize Speed (/O2) but the /O2 switch did NOT appear on the command line. So my initial test was being run without the /O2 switch. Thanks Microsoft. VC9 SP1 with _SECURE_SCL=0 /O2 /Ob2 against boost trunk: Iterator accumulate took 0.109 seconds. Pointer accumulate took 0.11 seconds. Iterator for loop took 0.093 seconds. Pointer for loop took 0.094 seconds. Index for loop took 0.094 seconds. Iterator for_each took 0.093 seconds. Pointer for_each took 0.094 seconds. BOOST_FOREACH took 0.125 seconds. MSVC for each took 0.109 seconds. Press any key to continue . . . VC9 SP1 with _SECURE_SCL=0 /Ox /Ob2 against boost trunk: Iterator accumulate took 0.109 seconds. Pointer accumulate took 0.11 seconds. Iterator for loop took 0.109 seconds. Pointer for loop took 0.109 seconds. Index for loop took 0.11 seconds. Iterator for_each took 0.109 seconds. Pointer for_each took 0.109 seconds. BOOST_FOREACH took 0.11 seconds. MSVC for each took 0.109 seconds. Press any key to continue . . . I also corrected the "Index for_each" to read "Iterator for_each". -- Michael Marcin

on Mon Nov 17 2008, Eric Niebler <eric-AT-boost-consulting.com> wrote:
Just curious: you didn't compare with std::for_each (e.g. to see the effects of loop unrolling), did you? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

David Abrahams wrote:
http://article.gmane.org/gmane.comp.lib.boost.devel/180397 -- Michael Marcin

David Abrahams wrote:
I didn't do the benchmark myself, and I don't remember if std::for_each was part of it, sorry. -- Eric Niebler BoostPro Computing http://www.boostpro.com

David Abrahams wrote:
Just curious: you didn't compare with std::for_each (e.g. to see the effects of loop unrolling), did you?
For whatever it is worth I added: { boost::timer timer; sum_func sum = std::for_each( data.begin(), data.end(), sum_func() ); double elapsedtime = timer.elapsed(); std::cout << "std::for_each took " << elapsedtime << " seconds." << std::endl; force_calculation(sum.sum); } to the benchmark and ran it on my ubuntu box, Linux 2.6.27-7-generic #1 SMP Tue Nov 4 19:33:06 UTC 2008 x86_64 GNU/Linux g++ (Ubuntu 4.3.2-1ubuntu11) 4.3.2 I got: g++ -I../../boost_1_37_0 foreach_benchmark.cpp -o foreach_benchmark Iterator accumulate took 1.4 seconds. Pointer accumulate took 0.34 seconds. Iterator for loop took 1.39 seconds. Pointer for loop took 0.34 seconds. Index for loop took 0.6 seconds. Index for_each took 1.73 seconds. Pointer for_each took 0.63 seconds. BOOST_FOREACH took 4.36 seconds. std::for_each took 1.73 seconds. g++ -O3 -I../../boost_1_37_0 foreach_benchmark.cpp -o foreach_benchmark Iterator accumulate took 0.09 seconds. Pointer accumulate took 0.09 seconds. Iterator for loop took 0.09 seconds. Pointer for loop took 0.09 seconds. Index for loop took 0.09 seconds. Index for_each took 0.09 seconds. Pointer for_each took 0.09 seconds. BOOST_FOREACH took 0.09 seconds. std::for_each took 0.09 seconds. -- Bjørn

Bjørn Roald wrote:
(Don't post benchmarks for unoptimized code. People who read benchmarks do so because they care about performance ... so they compile with optimizations turned on. Not only are unoptimized benchmarks uninteresting, they can be downright misleading if people don't realize what they are looking at.)
That looks better. Thanks. So for libstdc++ at least, std::for_each is no faster. I just checked and it does not do loop unrolling, at least for 4.3.0, so that's not surprising. -- Eric Niebler BoostPro Computing http://www.boostpro.com

Well, if optimizing with the compiler affect your abillity to diagnose problems during testing, you may end up running slow unoptimized code during testing. So I do not agree that this is always the case. That is why I posted both. As far as production code, I do agree it is of no interest.
Sorry for the confusion, it was not intended.
Ok, do anybody know of other compilers/libs that implement loop unrolling in std::for_each? -- Bjørn

Quoting bjorn@4roald.org:
Ok, do anybody know of other compilers/libs that implement loop unrolling in std::for_each?
Are you looking for optimizations specifically for std::for_each other than those done by the compiler for general for loops? If so, possibly of interest is the fact that new in GCC 4.3 is the (experimental) ability to call parallel versions of certain algorithms in <algorithm> and <numeric> via OpenMP: http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html This could put some distance between BOOST_FOREACH and std::for_each for certain use cases though I haven't experimented myself. Pete

Peter Bartlett wrote:
Neither, or both; I am interested in what effects loop unrolling has on performance in general, and any straight forward and practical means of achieving that effect given that it is worth its while.
I think this is for completely different use-cases than those of the benchmark in question. Concurrency in algorithms are certainly of interest, but I would intuitively use that if the body of the concurrent part is of some significant size. Loop unrolling is for the case where the body of the loop is tiny. I would be surprised if the OpenMP based concurrency in a bench mark like this is not less efficient as far as wall clock, and as far as CPU time it certainly will be. With optimal partitioning of the loop iterations, it is possible an OpenMP version will win a wall clock battle, but that is something I doubt a library can archive. I may be wrong. Also, it is not clear to me how concurrency in std algorithms will effect correctness in my programs.
This could put some distance between BOOST_FOREACH and std::for_each for certain use cases though I haven't experimented myself.
Ok, I gave it a quick shot for the fun of it. But I only get confused from the results. It looks to me like the OpenMP code and flags have no effects on std::for_each, neither does explicit calls to __gnu_parallel::for_each give different results. It looks as if it is falling back to the serialized version. I have not spent any time figuring why. I added: int main() { // don't know if these are needed, I think there may be sensible defaults const int threads_wanted = 4; // I have a 4 core CPU omp_set_dynamic(false); omp_set_num_threads(threads_wanted); .... #ifdef _GLIBCXX_PARALLEL { boost::timer timer; sum_func sum = __gnu_parallel::for_each( data.begin(), data.end(), sum_func() ); double elapsedtime = timer.elapsed(); std::cout << "__gnu_parallel::for_each took " << elapsedtime << " seconds." << std::endl; force_calculation(sum.sum); } #endif to the benchmark, and compiled and executed like this: g++ -D_GLIBCXX_PARALLEL -I../../boost_1_37_0 -march=native -fopenmp -O3 foreach_benchmark.cpp-o foreach_benchmark && foreach_benchmark Iterator accumulate took 0.09 seconds. Pointer accumulate took 0.09 seconds. Iterator for loop took 0.09 seconds. Pointer for loop took 0.09 seconds. Index for loop took 0.09 seconds. Index for_each took 0.09 seconds. Pointer for_each took 0.09 seconds. BOOST_FOREACH took 0.09 seconds. std::for_each took 0.1 seconds. __gnu_parallel::for_each took 0.08 seconds. Nothing exiting, tell me why? I guess that boost::timer may not be appropriate to measure this if we in fact run in parallel threads. -- Bjørn

Eric Niebler wrote:
While I agree that people care more about optimized code, unoptimized code matters too. If the code is more than an order of magnitude slower unoptimized it might not be suitable for production code. I've use products which can't run in debug mode because they are too slow. I'm not a fan of log file debugging. In this case it looked fine for unoptimized to me. -- Michael Marcin

on Wed Nov 19 2008, Eric Niebler <eric-AT-boost-consulting.com> wrote:
I guess not. If they care enough to find out, someone should compare std::find with an equivalent FOREACH loop on a random access range. -- Dave Abrahams BoostPro Computing http://www.boostpro.com
participants (9)
-
bjorn@4roald.org
-
Bjørn Roald
-
David Abrahams
-
Eric Niebler
-
Hansi
-
Maik Beckmann
-
Mathias Gaunard
-
Michael Marcin
-
Peter Bartlett