[boostcon12]Trouble with tuples very compiler dependent

This thread is a continuation of that started here: http://article.gmane.org/gmane.comp.lib.boost.devel/235427 It's been renamed because the new name is more topical. The tree_builder benchmarks mentioned here: http://article.gmane.org/gmane.comp.lib.boost.devel/235386 were refactored into: tuple_benchmark.tree_builder.cpp tuple_impl.bcon12_horizontal.hpp tuple_impl.bcon12_vertical.hpp and various other .hpp files. The benchmark was run as shown in the attached filt.compilation. The output after line: :filt_dr= shows with compiler=gcc4_8, the bcon12_horizontal implementation is slower than the bcon12_vertical. This agrees, at least qualitatively, with the page 12 bar chart from Eric's boostcon12 pdf file: trouble_with_tuples.pdf downloadable from here: https://github.com/boostcon/cppnow_presentations_2012/zipball/master and in the mon directory. The variadic bar in the pag12 bar chart corresponds to the: tuple_impl.bcon12_horizontal.hpp tuple implementation in the sandbox slim/test directory. The unrolled bar in the pag12 bar chart corresponds to the: tuple_impl.bcon12_vertical.hpp In contrast to the gcc4_8 compiler, with the clangxx compiler, the relative qualitative performance, is just the opposite. IOW, the bcon12_horizontal implementation is faster than the bcon12_vertical implementation. In fact, the rate of change of the performance difference accelerates as tree depth goes from 2 to 4. The rate of change is so stark that it suggests, at least to me, there may be some bug in clang. Of course that conclusion is based on almost no knowledge, on my part, of the clang implementation. The tuple_benchmark_filt.py can be modified to filter out other parts of the benchmark run output, which is here: http://svn.boost.org/svn/boost/sandbox/variadic_templates/sandbox/slim/test/... In conclusion, whether it's best to use vertical or horizontal tuple implementation depends on the compiler. -regards, Larry

On 11/19/12 10:19, Larry Evans wrote: [snip]
In contrast to the gcc4_8 compiler, with the clangxx compiler, the relative qualitative performance, is just the opposite. IOW, the bcon12_horizontal implementation is faster than the bcon12_vertical implementation. In fact, the rate of change of the performance difference accelerates as tree depth goes from 2 to 4. The rate of change is so stark that it suggests, at least to me, there may be some bug in clang. Of course that conclusion is based on almost no knowledge, on my part, of the clang implementation.
The tuple_benchmark_filt.py can be modified to filter out other parts of the benchmark run output, which is here:
http://svn.boost.org/svn/boost/sandbox/variadic_templates/sandbox/slim/test/...
For example, when the filter criteria restricts TUPLE_UNROLL_MAX to 10 (the same as TUPLE_SIZE), then, with compiler=clangxx, bcon12_vertical performs relatively better than bcon12_horizontal as TREE_DEPTH increases, as shown in the attached. -regards, Larry

On 11/19/2012 8:38 AM, Larry Evans wrote:
On 11/19/12 10:19, Larry Evans wrote: [snip]
In contrast to the gcc4_8 compiler, with the clangxx compiler, the relative qualitative performance, is just the opposite. IOW, the bcon12_horizontal implementation is faster than the bcon12_vertical implementation. In fact, the rate of change of the performance difference accelerates as tree depth goes from 2 to 4. The rate of change is so stark that it suggests, at least to me, there may be some bug in clang. Of course that conclusion is based on almost no knowledge, on my part, of the clang implementation.
The tuple_benchmark_filt.py can be modified to filter out other parts of the benchmark run output, which is here:
http://svn.boost.org/svn/boost/sandbox/variadic_templates/sandbox/slim/test/...
For example, when the filter criteria restricts TUPLE_UNROLL_MAX to 10 (the same as TUPLE_SIZE), then, with compiler=clangxx, bcon12_vertical performs relatively better than bcon12_horizontal as TREE_DEPTH increases, as shown in the attached.
All the measured times are below one second. Benchmarks become more meaningful when the thing being measured takes more than a few seconds to finish. If you don't mind, can you step up the limits? Once we have a few data points in the tens of seconds and minutes, we'll have a better idea of how the different compilers are performing. Thanks for doing these. It's very interesting. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On 11/19/12 12:11, Eric Niebler wrote:
On 11/19/2012 8:38 AM, Larry Evans wrote:
On 11/19/12 10:19, Larry Evans wrote: [snip]
In contrast to the gcc4_8 compiler, with the clangxx compiler, the relative qualitative performance, is just the opposite. IOW, the bcon12_horizontal implementation is faster than the bcon12_vertical implementation. In fact, the rate of change of the performance difference accelerates as tree depth goes from 2 to 4. The rate of change is so stark that it suggests, at least to me, there may be some bug in clang. Of course that conclusion is based on almost no knowledge, on my part, of the clang implementation.
The tuple_benchmark_filt.py can be modified to filter out other parts of the benchmark run output, which is here:
http://svn.boost.org/svn/boost/sandbox/variadic_templates/sandbox/slim/test/...
For example, when the filter criteria restricts TUPLE_UNROLL_MAX to 10 (the same as TUPLE_SIZE), then, with compiler=clangxx, bcon12_vertical performs relatively better than bcon12_horizontal as TREE_DEPTH increases, as shown in the attached.
All the measured times are below one second. Benchmarks become more meaningful when the thing being measured takes more than a few seconds to finish. If you don't mind, can you step up the limits? Once we have a few data points in the tens of seconds and minutes, we'll have a better idea of how the different compilers are performing.
Thanks for doing these. It's very interesting.
OK. I just changed TREE_DEPTH=2 to 7 by 1. The run.txt and filt.txt files are attached. AFAICT, vertical always does better than vertical no matter what the compiler; however, the change is more dramatic for clangxx. For gcc4_8 horizontal/vertical=about 2 at TREE_DEPTH=7. OTOH, for clangxx, the ratio is about 16! I've no idea why. -regards, Larry
participants (2)
-
Eric Niebler
-
Larry Evans