Re: [boost] interest in structure of arrays container?

26 Oct 2016

      On 12:22 Tue 25 Oct     , Larry Evans wrote:
...
On 10/25/2016 01:41 AM, Andreas Schäfer wrote:
...
On 07:50 Fri 21 Oct     , Larry Evans wrote:
...
I can't imagine how anything could be faster
than the soa_emitter_static_t because it uses a tuple of
std::array<T,particle_count>.  I'd guess that the
soa_emitter_block_t is only faster by luck (maybe during
the soa_emitter_block_t run, my machine was not as busy on some other
stuff).
I think the reason why the different implementation techniques are so
close is that the particle model is memory bound (i.e. it's moving a
lot of data while each particle update involves relatively few
calculations).
The difference becomes larger if you're using only a few particles:
then all particles sit in the upper levels of the cache and the CPU
doesn't have to wait as much for the data. It would also be worthwhile
to try a more complex particle model (e.g. by adding interaction
between the particles). With increased computational intensity
(floating point operations per byte moved) the delta of the different
strategies should increase much more.
Thanks for the explanation.  The lastest version of the
benchmark:
d6ee370606f7f167dedb93e174459c6c7c4d8c19
reports the relative difference of the times:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L823
Yeah, I saw that change when I merged upstream. TBH, I don't think
this is helpful as the relative difference adds noice from one
measurement to all other measurements. It complicates comparison
between multiple runs of the benchmark and prevents conversion into
other metrics (e.g. GFLOPS).
...
So, based on what you say above, I guess when
particle_count:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L135
increases to a point where the cache is overflowed, the
relative differences between methods should show a sharp
difference?
The difference between the method is reduced when more and more
particles are being used as then the memory bandwidth becomes the
limiting factor. The transition between "in cache" and "in memory"
isn't sharp, but rather smooth as the L3 cache will still retain some
data, even if the total data set is too large to fit into L3. If you
vary the number of particles, you should be able to observe different
performance levels based on the cache level the data set fits into
(32kB for L1, 256kB for L2 (on Intel), some MB for L3).
...
...
I've added an implementation of the benchmark based on LibFlatArray's
SoA containers and expression templates[1]. While working on the
benchmark, I realized that the vector types ("short_vec") in
LibFlatArray were lacking some desirable operations (e.g. masked
move), so to reproduce my results you'll have to use the trunk from
[2]. I'm very happy that you wrote this benchmark because it's a
valuable test bed for performance, programmability, and functionality.
Thanks!
You're welcome.  Much of the credit goes to the OP, as
acknowledged, indirectly, here:
Thanks. Sorry for the confusion. :-)
...
...
I'll have to take a look at the assembly to figure out why
that is.
Oh, I bet that will be fun ;)
I hope so. Hope dies last. ;-)

Cheers
-Andreas

-- 
==========================================================
Andreas Schäfer
HPC and Supercomputing
Institute for Multiscale Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-20866
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==========================================================

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!