
Peter Bartlett wrote:
Quoting bjorn@4roald.org:
Ok, do anybody know of other compilers/libs that implement loop unrolling in std::for_each?
Are you looking for optimizations specifically for std::for_each other than those done by the compiler for general for loops?
Neither, or both; I am interested in what effects loop unrolling has on performance in general, and any straight forward and practical means of achieving that effect given that it is worth its while.
If so, possibly of interest is the fact that new in GCC 4.3 is the (experimental) ability to call parallel versions of certain algorithms in <algorithm> and <numeric> via OpenMP:
http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
I think this is for completely different use-cases than those of the benchmark in question. Concurrency in algorithms are certainly of interest, but I would intuitively use that if the body of the concurrent part is of some significant size. Loop unrolling is for the case where the body of the loop is tiny. I would be surprised if the OpenMP based concurrency in a bench mark like this is not less efficient as far as wall clock, and as far as CPU time it certainly will be. With optimal partitioning of the loop iterations, it is possible an OpenMP version will win a wall clock battle, but that is something I doubt a library can archive. I may be wrong. Also, it is not clear to me how concurrency in std algorithms will effect correctness in my programs.
This could put some distance between BOOST_FOREACH and std::for_each for certain use cases though I haven't experimented myself.
Ok, I gave it a quick shot for the fun of it. But I only get confused from the results. It looks to me like the OpenMP code and flags have no effects on std::for_each, neither does explicit calls to __gnu_parallel::for_each give different results. It looks as if it is falling back to the serialized version. I have not spent any time figuring why. I added: int main() { // don't know if these are needed, I think there may be sensible defaults const int threads_wanted = 4; // I have a 4 core CPU omp_set_dynamic(false); omp_set_num_threads(threads_wanted); .... #ifdef _GLIBCXX_PARALLEL { boost::timer timer; sum_func sum = __gnu_parallel::for_each( data.begin(), data.end(), sum_func() ); double elapsedtime = timer.elapsed(); std::cout << "__gnu_parallel::for_each took " << elapsedtime << " seconds." << std::endl; force_calculation(sum.sum); } #endif to the benchmark, and compiled and executed like this: g++ -D_GLIBCXX_PARALLEL -I../../boost_1_37_0 -march=native -fopenmp -O3 foreach_benchmark.cpp-o foreach_benchmark && foreach_benchmark Iterator accumulate took 0.09 seconds. Pointer accumulate took 0.09 seconds. Iterator for loop took 0.09 seconds. Pointer for loop took 0.09 seconds. Index for loop took 0.09 seconds. Index for_each took 0.09 seconds. Pointer for_each took 0.09 seconds. BOOST_FOREACH took 0.09 seconds. std::for_each took 0.1 seconds. __gnu_parallel::for_each took 0.08 seconds. Nothing exiting, tell me why? I guess that boost::timer may not be appropriate to measure this if we in fact run in parallel threads. -- Bjørn