On 10/16/2016 10:49 PM, degski wrote:
This is still a toy example but it's closer to something real.
Yes, but is 1M particles common?
Depends on the game. To be fair depending where your bottlenecks are you might move code like this to compute shaders instead.
AoS in 6.54421 seconds
SoA in 5.91915 seconds SoA SSE in 3.58603 seconds
1M particles on my Ci3 5005U 2.0GHZ/AVX2/4GB laptop / WIN10 / Clang/LLVM 4.0:
AoS in 14.7198 seconds SoA in 13.5969 seconds SoA SSE in 8.78095 seconds
I've run this with a count of 25'000 and it shows something(s) interesting:
AoS in 0.274145 seconds SoA in 0.312875 seconds SoA SSE in 0.0768812 seconds
1. SoA slower than AoS. 2. SoA SSE way faster (relatively) than either SoA and AoS.
You've definitely made your case, when using SSE. I'll have a rethink.
Indeed the code generated for SoA here is much worse than the AoS. AoS update is roughly ~40 assembly instructions. SoA update is roughly ~200 assembly instructions. A lot of this is probably due to the soa_emitter_t implementation being suboptimal. Also, siozeof( particle_t ) = 68 bytes 68 * 25'000 = 1.7 megs Your CPU has a 3 megabytes l3 cache so the entire data structure can stay in fast memory. All version (aos, soa, soa_sse) are has access patterns that are very friendly to the memory prefetcher so l2 and l1 cache size should not affect the results. So with no main memory access the update with the much better code (aos) wins. If you bump your count up to 50'000 (3.4 megs) you might see SoA pull ahead again, at 100'000 (6.8 megs) you should definitely see it. Alternatively you could add more data the particles like say: struct pad_t { char data[64]; }; struct particle_t { ... pad_t pad; }; This additional data won't affect SoA update at all but should affect your AoS update. (Rough math 3megs / 25k elements = 120 bytes per element max to fit all in cache).