On 10/21/2016 01:07 AM, Michael Marcin wrote:
On 10/21/2016 12:48 AM, Michael Marcin wrote:
On 10/20/2016 10:02 PM, Larry Evans wrote:
The modification added soa_emitter_block_t which uses soa_block. Unfortunately, this soa_emitter_block_t takes about twice as long as your soa_emitter_static_t.
I've no idea why. Any guesses?
2x is quite an abstraction penalty. I can only assume your compiler is failing to optimize away some part of the abstraction.
OOPS. Yeah, I forgot about run-time optimization compiler flags :(
FWIW on vs2015 I'm not seeing nearly as much of a difference.
particle_count=1,000,000 AoS in 6.34667 seconds SoA in 4.26384 seconds SoA flat in 4.16572 seconds SoA Static in 5.4037 seconds SoA block in 5.5588 seconds
I'm still trying to work out how to fit overaligned subarrays into your framework.
The issue is that many simd instructions require more than just alignof(T) alignment.
subarrays of float/double/int/short/char or carefully crafted udts might need to be aligned to as much as 64bytes in the worst case.
On the MIC architecture, vector load/store operations must be called on 64-byte aligned memory addresses. On the Xeon architecture with AVX/AVX2 instruction sets (Sandy Bridge, Ivy Bridge or Haswell), alignment does not matter. In earlier architectures (Nehalem, Westmere) alignment did matter, but a 32-byte alignment was necessary.
https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/507...
At the very least support for the basic SSE 16 byte alignment of subarrays is crucial.
My best idea so far is some magic wrapper type that gets special treatment. Like: using data_t = soa_block< float3, soa_align
, bool >; This maybe opens the door for other magic types like: using data_t = soa_block< float3, soa_align
, soa_bit >;
That seems reasonable to me.