On 12:22 Tue 25 Oct , Larry Evans wrote:
On 10/25/2016 01:41 AM, Andreas Schäfer wrote:
On 07:50 Fri 21 Oct , Larry Evans wrote:
I can't imagine how anything could be faster than the soa_emitter_static_t because it uses a tuple of std::array
. I'd guess that the soa_emitter_block_t is only faster by luck (maybe during the soa_emitter_block_t run, my machine was not as busy on some other stuff). I think the reason why the different implementation techniques are so close is that the particle model is memory bound (i.e. it's moving a lot of data while each particle update involves relatively few calculations).
The difference becomes larger if you're using only a few particles: then all particles sit in the upper levels of the cache and the CPU doesn't have to wait as much for the data. It would also be worthwhile to try a more complex particle model (e.g. by adding interaction between the particles). With increased computational intensity (floating point operations per byte moved) the delta of the different strategies should increase much more.
Thanks for the explanation. The lastest version of the benchmark:
d6ee370606f7f167dedb93e174459c6c7c4d8c19
reports the relative difference of the times:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L823
Yeah, I saw that change when I merged upstream. TBH, I don't think this is helpful as the relative difference adds noice from one measurement to all other measurements. It complicates comparison between multiple runs of the benchmark and prevents conversion into other metrics (e.g. GFLOPS).
So, based on what you say above, I guess when particle_count:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L135
increases to a point where the cache is overflowed, the relative differences between methods should show a sharp difference?
The difference between the method is reduced when more and more particles are being used as then the memory bandwidth becomes the limiting factor. The transition between "in cache" and "in memory" isn't sharp, but rather smooth as the L3 cache will still retain some data, even if the total data set is too large to fit into L3. If you vary the number of particles, you should be able to observe different performance levels based on the cache level the data set fits into (32kB for L1, 256kB for L2 (on Intel), some MB for L3).
I've added an implementation of the benchmark based on LibFlatArray's SoA containers and expression templates[1]. While working on the benchmark, I realized that the vector types ("short_vec") in LibFlatArray were lacking some desirable operations (e.g. masked move), so to reproduce my results you'll have to use the trunk from [2]. I'm very happy that you wrote this benchmark because it's a valuable test bed for performance, programmability, and functionality. Thanks!
You're welcome. Much of the credit goes to the OP, as acknowledged, indirectly, here:
Thanks. Sorry for the confusion. :-)
I'll have to take a look at the assembly to figure out why that is.
Oh, I bet that will be fun ;)
I hope so. Hope dies last. ;-) Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!