interest in structure of arrays container?

Structure of arrays is a pretty well known pattern for improving memory density and enabling vectorization in high performance code. Also known as parallel arrays[1]. After hand-rolling yet another of these it occurred to me that you make a reasonable first implementation with a simple alias template. template< typename... Ts > using soa = std::tuple< std::vector<Ts>... >; example: soa<int,float,std::string> data; This is decent start but has several issues including: * nothing to preserve the invariant keeping the arrays the same length * missing coalesced interface for vector methods (size, reserve, etc) * no iterator * wasted storage for duplicated allocators, capacities, sizes * sizeof...(Ts) allocations could be a single large block The work required to address these issues is not too great I think. Potential names: soa<Ts> structure_of_arrays<Ts> parallel_array<Ts> struct_array<Ts> parallel_vector<Ts> struct_vector<Ts> Does this already exist somewhere? Does it seem useful? [1]https://en.wikipedia.org/wiki/Parallel_array

On 15 October 2016 at 05:43, Michael Marcin <mike.marcin@gmail.com> wrote:
This is decent start but has several issues including: * nothing to preserve the invariant keeping the arrays the same length * missing coalesced interface for vector methods (size, reserve, etc) * no iterator * wasted storage for duplicated allocators, capacities, sizes * sizeof...(Ts) allocations could be a single large block
<snip> Does this already exist somewhere?
A std::deque that exposes it's chunks (and hopefully chunk-size)? Does it seem useful?
To me yes... I was working on something similar, but with the added requirement that deletion does not (necesarilly) change the indices of the remaining elements. degski

On 15 October 2016 at 05:43, Michael Marcin <mike.marcin@gmail.com> wrote:
Does this already exist somewhere?
Take a look at Boost.DoubleEnded 1 <http://erenon.hu/double_ended/> and see whether boost::double_ended::batch_deque <http://erenon.hu/double_ended/double_ended/design_rationale.html#double_ended.design_rationale.batch_deque> fills all of your requirements. degski

On 10/14/2016 11:32 PM, degski wrote:
On 15 October 2016 at 05:43, Michael Marcin <mike.marcin@gmail.com> wrote:
Does this already exist somewhere?
Take a look at Boost.DoubleEnded 1 <http://erenon.hu/double_ended/> and see whether boost::double_ended::batch_deque <http://erenon.hu/double_ended/double_ended/design_rationale.html#double_ended.design_rationale.batch_deque> fills all of your requirements.
Sorry I think I wasn't clear. Let me back up a bit and give some examples. Arrays of Structures is the normal programming model. Let's take a toy example, say I'm looking at some census data and have a structure that looks like: struct citizen_t { string first_name; string last_name; int salary; int age; }; Then the Array of Structures container would be: vector<citizen_t> aos_citizens; The Structure of Arrays version of the same data would look like: struct soa_citizens_t { vector<string> first_names; vector<string> last_names; vector<int> salary; vector<int> age; }; soa_citizens_t soa_citizens; Why does this matter? Let's say I want to calculate the average salary of 300 million citizens. The code is a simple iterative average and very simple. // Array of Structures int avg = 0; int t = 1; for ( auto & c : aos_citizens ) { avg += (c.salary - avg) / t; ++t; } // Structures of Arrays int avg = 0; int t = 1; for ( int salary : soa_citizens.salary ) { avg += (salary - avg) / t; ++t; } Run this under a simple benchmark: https://ideone.com/QNqKpD AoS 3.03523 seconds SoA 1.94902 seconds Both loops are doing the exact same work but the memory layout allows the second loop to run much faster.

On 2016-10-15 10:14, Michael Marcin wrote:
On 10/14/2016 11:32 PM, degski wrote:
On 15 October 2016 at 05:43, Michael Marcin <mike.marcin@gmail.com> wrote:
Does this already exist somewhere?
Take a look at Boost.DoubleEnded 1 <http://erenon.hu/double_ended/> and see whether boost::double_ended::batch_deque <http://erenon.hu/double_ended/double_ended/design_rationale.html#double_ended.design_rationale.batch_deque> fills all of your requirements.
Sorry I think I wasn't clear.
Let me back up a bit and give some examples.
Arrays of Structures is the normal programming model.
Let's take a toy example, say I'm looking at some census data and have a structure that looks like:
struct citizen_t { string first_name; string last_name; int salary; int age; };
Then the Array of Structures container would be: vector<citizen_t> aos_citizens;
The Structure of Arrays version of the same data would look like:
struct soa_citizens_t { vector<string> first_names; vector<string> last_names; vector<int> salary; vector<int> age; };
soa_citizens_t soa_citizens;
Hi, We implemented something similar for our software. Our usecase was: "when we have a batch of inputs to process, what is the most efficient way to store them such that we can process them with a fast batch algorithm". For example a batch of vectors is a matrix with vectors stored row-wise (and sparse vectors are a sparse matrix). Similarly, if we have pairs of inputs and outputs a batch of those pairs are pairs of batches, tuples of batches, etc. The problem starts when we look at "how do i get an element from a batch?". A matrix-row is not the same as a vector. Moreover, it is unnatural to see a matrix as "batch of rows" (e.g. why not "batch of columns"?) and thus we likely do not have a way to get a row using op[] or iterate over its rows. It becomes only more complicated when looking at a pair of vectors example (a single element of that would be a pair of two matrix-rows) etc. Even after some working on this, we have not really found a stable and "good" solution that does not involve massive amounts of preprocessor boilerplate and massive use of traits classes. I would like to see your take on this, maybe this could give us some way to improve our work! Best, Oswin

On 10/15/2016 03:14 AM, Michael Marcin wrote: [snip]
Arrays of Structures is the normal programming model.
Let's take a toy example, say I'm looking at some census data and have a structure that looks like:
struct citizen_t { string first_name; string last_name; int salary; int age; };
Then the Array of Structures container would be: vector<citizen_t> aos_citizens;
The Structure of Arrays version of the same data would look like:
struct soa_citizens_t { vector<string> first_names; vector<string> last_names; vector<int> salary; vector<int> age; };
soa_citizens_t soa_citizens;
[snip]
// Structures of Arrays int avg = 0; int t = 1; for ( int salary : soa_citizens.salary ) { avg += (salary - avg) / t; ++t; }
To generalize, what about: template<std::size_t N, typename... Fields> using soa_citizens_t= std::tuple < std::array<N,Fields>... > ; ? Then wouldn't: std::size_t n=5; soa_citizens_t<n,string,string,int,int> soa_citizens; int avg = 0; int t = 1; std::size_t const salary=2 for( int salary: get<salary>(soa_citizens)) { avg += (salary - avg) / t; ++t; } calculate the same thing as the avg in your example? (CONFESSION: Not tested). -regards, Larry

On 10/15/2016 04:06 PM, Larry Evans wrote:
template<std::size_t N, typename... Fields> using soa_citizens_t= std::tuple < std::array<N,Fields>... > ;
? Then wouldn't:
std::size_t n=5; soa_citizens_t<n,string,string,int,int> soa_citizens; int avg = 0; int t = 1; std::size_t const salary=2 for( int salary: get<salary>(soa_citizens)) { avg += (salary - avg) / t; ++t; }
FWIW, the attached compiles and runs with: clang version 3.8.0 (tags/RELEASE_380/final) on my ubuntu OS.

On 10/15/2016 5:15 PM, Larry Evans wrote:
On 10/15/2016 04:06 PM, Larry Evans wrote:
template<std::size_t N, typename... Fields> using soa_citizens_t= std::tuple < std::array<N,Fields>... > ;
? Then wouldn't:
std::size_t n=5; soa_citizens_t<n,string,string,int,int> soa_citizens; int avg = 0; int t = 1; std::size_t const salary=2 for( int salary: get<salary>(soa_citizens)) { avg += (salary - avg) / t; ++t; }
FWIW, the attached compiles and runs with:
Indeed this works well for a fixed sized dataset. It doesn't have the wasted duplicated information that a tuple of std::vectors has. It still doesn't provide an interface to query the soa directly for an iterator or size etc. Both a fixed-sized and dynamic-growth container are useful. Although I tend to need dynamic growth more often.

On 16 October 2016 at 08:36, Michael Marcin <mike.marcin@gmail.com> wrote:
It doesn't have the wasted duplicated information that a tuple of std::vectors has.
In your example, which I think is flawed as an example, you create in memory a 20+ GB data structure (hope I got the maths right). I realised this after a std::bad_alloc was thrown. Some duplicate info in the std::vector(s) seems rather irrelevant. This data structure optimizes one operation (calculating the average) at the cost of pessimising almost any other operation (relocation if the vector needs to grow, insert, delete, push_back etc are all done 4 times in your example), while iterating over the records will land you right into cache-miss-heaven. Keeping a running average while you read/input/delete the data gives you an O(1) average. Consider using a pector <https://github.com/aguinet/pector> instead of a vector. You state that the example is a toy example. To me the example shows that iterating over a vector of smaller objects (pods) is faster than iterating over a vector of larger objects, duh. The real use case might be more interesting, maybe you can describe it. degski

On 2016-10-16 08:59, degski wrote:
On 16 October 2016 at 08:36, Michael Marcin <mike.marcin@gmail.com> wrote: You state that the example is a toy example. To me the example shows that iterating over a vector of smaller objects (pods) is faster than iterating over a vector of larger objects, duh. The real use case might be more interesting, maybe you can describe it.
Hi, I gave a real use-case in a slightly more complicated setting. We had to face the problem of "how can we align data in a way that we can unify a set of slow matrix-vector multiplications into one fast matrix-matrix multiplication". We usually have to traverse the same dataset several 100-1000 times in order to do our computations, so some overhead in the data setup phase (e.g. cost of insertion) is okay and we would sacrifice even more performance in that phase if we got in return more performance in the following three days of computations. Having said that, I agree that just reordering does not give much in many cases. But the real gain is given when data can be aligned better or used more independently. e.g. if the struct contains a std::vector or std::string we can gain a lot by using a data structure which stores the contents of the vectors consecutively, thus removing one level of indirection. Note that this pattern is also common in game industry, where objects are made up of data structures which are used in (nearly) independent parts of the game and where data has to be copied to and from the gpu. Best, Oswin

Hey, On 12:34 Sun 16 Oct , Oswin Krause wrote:
I gave a real use-case in a slightly more complicated setting. We had to face the problem of "how can we align data in a way that we can unify a set of slow matrix-vector multiplications into one fast matrix-matrix multiplication". We usually have to traverse the same dataset several 100-1000 times in order to do our computations, so some overhead in the data setup phase (e.g. cost of insertion) is okay and we would sacrifice even more performance in that phase if we got in return more performance in the following three days of computations. Having said that, I agree that just reordering does not give much in many cases. But the real gain is given when data can be aligned better or used more independently. e.g. if the struct contains a std::vector or std::string we can gain a lot by using a data structure which stores the contents of the vectors consecutively, thus removing one level of indirection.
Note that this pattern is also common in game industry, where objects are made up of data structures which are used in (nearly) independent parts of the game and where data has to be copied to and from the gpu.
Could you give a pseudo code description of what typical operations you're running? In your fist mail you mentioned that you also have some sparse matrices. I find this interesting because we've implemented some improved containers for sparse matrices and SpMVM (sparse matrix vector multiplication). Those are written for our simulation codes and now I'm curious if they could be reused in other settings. Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

Hi, On 09:59 Sun 16 Oct , degski wrote:
In your example, which I think is flawed as an example, you create in memory a 20+ GB data structure (hope I got the maths right).
300M * 4 * 4B = 2.4GB. IMHO not too large. Even 20GB isn't that much in HPC.
This data structure optimizes one operation (calculating the average) at the cost of pessimising almost any other operation (relocation if the vector needs to grow, insert, delete, push_back etc are all done 4 times in your example)
Keep in mind that this is a sketch of what he would like to use, not a polished solution. The point is: with a Struct-of-Arrays (SoA) memory layout you can implement kernels operating on that data that can be vectorized. Depending on your CPU, vectorization can yield you a speedup of 2x - 8x. If implemented properly then insert, delete, push_back etc. can also be efficient for SoA.
while iterating over the records will land you right into cache-miss-heaven.
Depends. If your access pattern is very random and your algorithm needs to access every member of an object, then a std::vector might be faster. But if your algorithm has some (piecewise) continuity, then the SoA structure will benefit just as well from the CPU's caches. Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

On 10/16/2016 1:59 AM, degski wrote:
You state that the example is a toy example. To me the example shows that iterating over a vector of smaller objects (pods) is faster than iterating over a vector of larger objects, duh.
Iterating over a vector of smaller objects being faster than iterating over a vector of larger objects is the core of the SoA approach.
The real use case might be more interesting, maybe you can describe it.
A common real use case is a particle system in a game. A very simple particle could look like: A real particle will often have many more fields. struct particle_t { float3 position; float3 velocity; float3 acceleration; float2 size; float4 color; float energy; bool alive; }; Every Tick you need to update all of the particles. void update() { size_t n = position.size(); for ( size_t i = 0; i < n; ++i ) { auto & v = velocity[i]; v = v + (acceleration[i] * dt); v.y += gravity * dt; position[i] = position[i] + (v * dt); energy[i] -= dt; if ( energy[i] <= 0 ) { alive[i] = false; } } } Note that update() doesn't read or write several fields, these fields are only used for rendering. You can get a decent win purely through improved cache utilization by keeping this cold data out of the way. You can then think about going one step further and building a simd vectorized update loop, which is strictly impossible with an Array of Structures approach. There are two obvious ways to go about vectorizing this update. 1. You can store position/velocity/acceleration as float4 instead of float3 and operate on xyz for a single particle in one instruction. 2. You can store each component in its own array and operate on 4 particles at once. This is still a toy example but it's closer to something real. http://codepad.org/r9u82dOh AoS in 6.54421 seconds SoA in 5.91915 seconds SoA SSE in 3.58603 seconds Press any key to continue . . . Note: there is still room for improvement in the SSE example

On 16 October 2016 at 21:45, Michael Marcin <mike.marcin@gmail.com> wrote:
You can get a decent win purely through improved cache utilization by keeping this cold data out of the way.
And stick all the cold data in an AoS?
This is still a toy example but it's closer to something real.
Yes, but is 1M particles common? AoS in 6.54421 seconds
SoA in 5.91915 seconds SoA SSE in 3.58603 seconds
1M particles on my Ci3 5005U 2.0GHZ/AVX2/4GB laptop / WIN10 / Clang/LLVM 4.0: AoS in 14.7198 seconds SoA in 13.5969 seconds SoA SSE in 8.78095 seconds I've run this with a count of 25'000 and it shows something(s) interesting: AoS in 0.274145 seconds SoA in 0.312875 seconds SoA SSE in 0.0768812 seconds 1. SoA slower than AoS. 2. SoA SSE way faster (relatively) than either SoA and AoS. You've definitely made your case, when using SSE. I'll have a rethink. degski

On 10/16/2016 10:49 PM, degski wrote:
This is still a toy example but it's closer to something real.
Yes, but is 1M particles common?
Depends on the game. To be fair depending where your bottlenecks are you might move code like this to compute shaders instead.
AoS in 6.54421 seconds
SoA in 5.91915 seconds SoA SSE in 3.58603 seconds
1M particles on my Ci3 5005U 2.0GHZ/AVX2/4GB laptop / WIN10 / Clang/LLVM 4.0:
AoS in 14.7198 seconds SoA in 13.5969 seconds SoA SSE in 8.78095 seconds
I've run this with a count of 25'000 and it shows something(s) interesting:
AoS in 0.274145 seconds SoA in 0.312875 seconds SoA SSE in 0.0768812 seconds
1. SoA slower than AoS. 2. SoA SSE way faster (relatively) than either SoA and AoS.
You've definitely made your case, when using SSE. I'll have a rethink.
Indeed the code generated for SoA here is much worse than the AoS. AoS update is roughly ~40 assembly instructions. SoA update is roughly ~200 assembly instructions. A lot of this is probably due to the soa_emitter_t implementation being suboptimal. Also, siozeof( particle_t ) = 68 bytes 68 * 25'000 = 1.7 megs Your CPU has a 3 megabytes l3 cache so the entire data structure can stay in fast memory. All version (aos, soa, soa_sse) are has access patterns that are very friendly to the memory prefetcher so l2 and l1 cache size should not affect the results. So with no main memory access the update with the much better code (aos) wins. If you bump your count up to 50'000 (3.4 megs) you might see SoA pull ahead again, at 100'000 (6.8 megs) you should definitely see it. Alternatively you could add more data the particles like say: struct pad_t { char data[64]; }; struct particle_t { ... pad_t pad; }; This additional data won't affect SoA update at all but should affect your AoS update. (Rough math 3megs / 25k elements = 120 bytes per element max to fit all in cache).

On 10/16/2016 10:49 PM, degski wrote:
On 16 October 2016 at 21:45, Michael Marcin <mike.marcin@gmail.com> wrote:
You can get a decent win purely through improved cache utilization by keeping this cold data out of the way.
And stick all the cold data in an AoS?
What's hot or cold depends on the operation. For update() and color, size are unused. For render() energy, position, and acceleration would be unused.

Hey, On 00:36 Sun 16 Oct , Michael Marcin wrote:
Indeed this works well for a fixed sized dataset. It doesn't have the wasted duplicated information that a tuple of std::vectors has. It still doesn't provide an interface to query the soa directly for an iterator or size etc.
Both a fixed-sized and dynamic-growth container are useful. Although I tend to need dynamic growth more often.
I've implemented your example using the soa_grid from LibFlatArray: ================================= 8< ============================ #include <libflatarray/flat_array.hpp> class citizen { public: std::string first_name; std::string last_name; int salary; int age; }; LIBFLATARRAY_REGISTER_SOA( citizen, ((std::string)(first_name)) ((std::string)(last_name)) ((int)(salary)) ((int)(age))) int main() { LibFlatArray::soa_grid<citizen> citizens(300, 1, 1); long accumulator = 0; citizens.callback([&citizens, &accumulator](auto i) { for (; i.index() < citizens.dim_x(); ++i) { accumulator += i.salary(); } }); int average = accumulator / citizens.dim_x(); return 0; } ================================= 8< ============================ The size of the soa_grid can be chose at runtime, but it might not be a good match for the use case you describe: it's essentially 3D and hence doesn't support push_back() and friends. On the other hand, LibFlatArray::soa_array is 1D and behaves much more like std::vector, but it also has a maximum size that's fixated at compile time. Would something like an soa_vector, that's resizable at runtime and behaves much like a std::vector better suit your needs? Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

On 10/16/2016 3:28 AM, Andreas Schäfer wrote:
Hey,
On 00:36 Sun 16 Oct , Michael Marcin wrote:
Indeed this works well for a fixed sized dataset. It doesn't have the wasted duplicated information that a tuple of std::vectors has. It still doesn't provide an interface to query the soa directly for an iterator or size etc.
Both a fixed-sized and dynamic-growth container are useful. Although I tend to need dynamic growth more often.
I've implemented your example using the soa_grid from LibFlatArray:
================================= 8< ============================
#include <libflatarray/flat_array.hpp>
class citizen { public: std::string first_name; std::string last_name; int salary; int age; };
LIBFLATARRAY_REGISTER_SOA( citizen, ((std::string)(first_name)) ((std::string)(last_name)) ((int)(salary)) ((int)(age)))
int main() { LibFlatArray::soa_grid<citizen> citizens(300, 1, 1); long accumulator = 0;
citizens.callback([&citizens, &accumulator](auto i) { for (; i.index() < citizens.dim_x(); ++i) { accumulator += i.salary(); } }); int average = accumulator / citizens.dim_x();
return 0; }
================================= 8< ============================
The size of the soa_grid can be chose at runtime, but it might not be a good match for the use case you describe: it's essentially 3D and hence doesn't support push_back() and friends. On the other hand, LibFlatArray::soa_array is 1D and behaves much more like std::vector, but it also has a maximum size that's fixated at compile time.
Thanks for clarifying.
Would something like an soa_vector, that's resizable at runtime and behaves much like a std::vector better suit your needs?
Yes exactly that. I would also strongly prefer a tuple-like variadic template interface to a macro interface.

Heya, On 13:46 Sun 16 Oct , Michael Marcin wrote:
Would something like an soa_vector, that's resizable at runtime and behaves much like a std::vector better suit your needs?
Yes exactly that.
just FYI: I've added a basic soa_vector to LibFlatArray's trunk[1]. It's not perfect, but allows basic usage and particle callbacks (see test[2]). I'm looking forward to benchmarking that code. :-)
I would also strongly prefer a tuple-like variadic template interface to a macro interface.
Yeah, macros are poor, but I wanted an interface where members have sensible names (as opposed to first, second...) Not much options without reflection. Cheers -Andreas [1] https://github.com/gentryx/libflatarray/blob/master/include/libflatarray/soa... [2] https://github.com/gentryx/libflatarray/blob/master/test/soa_vector_test.cpp... -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

On 10/21/2016 08:11 PM, Andreas Schäfer wrote:
Heya,
On 13:46 Sun 16 Oct , Michael Marcin wrote:
Would something like an soa_vector, that's resizable at runtime and behaves much like a std::vector better suit your needs?
Yes exactly that.
just FYI: I've added a basic soa_vector to LibFlatArray's trunk[1]. It's not perfect, but allows basic usage and particle callbacks (see test[2]). I'm looking forward to benchmarking that code. :-)
Maybe by modifying: https://github.com/cppljevans/soa/blob/master/codepad.eol6auRN.cpp by inserting another struct, say LifFlatArray_emitter_t, which uses soa_vector. I, and I guess, Michael, would be interested in seeing the output from such a modification. -regards, Larry

Am 22. Oktober 2016 08:36:21 MESZ, schrieb Larry Evans <cppljevans@suddenlink.net>:
On 10/21/2016 08:11 PM, Andreas Schäfer wrote:
Heya,
On 13:46 Sun 16 Oct , Michael Marcin wrote:
Would something like an soa_vector, that's resizable at runtime and behaves much like a std::vector better suit your needs?
Yes exactly that.
just FYI: I've added a basic soa_vector to LibFlatArray's trunk[1]. It's not perfect, but allows basic usage and particle callbacks (see test[2]). I'm looking forward to benchmarking that code. :-)
Maybe by modifying:
https://github.com/cppljevans/soa/blob/master/codepad.eol6auRN.cpp
by inserting another struct, say LifFlatArray_emitter_t, which uses soa_vector. I, and I guess, Michael, would be interested in seeing the output from such a modification.
Good idea, that should be interesting. Will do! Cheers -Andreas -- Sent from my mobile. No type gold.

On 07:32 Sat 15 Oct , degski wrote:
On 15 October 2016 at 05:43, Michael Marcin <mike.marcin@gmail.com> wrote:
Does this already exist somewhere?
Take a look at Boost.DoubleEnded 1 <http://erenon.hu/double_ended/> and see whether boost::double_ended::batch_deque <http://erenon.hu/double_ended/double_ended/design_rationale.html#double_ended.design_rationale.batch_deque> fills all of your requirements.
This might work for some SoA types where all fields have the same type. In practice that's not generally the case though. Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

Heya, On 21:43 Fri 14 Oct , Michael Marcin wrote:
Does this already exist somewhere?
you may want to take a look at LibFlatArray. It's a struct-of-arrays library for C++, which also helps with vectorization and address generation: http://libgeodecomp.org/libflatarray.html
Does it seem useful?
I think everyone who's using arrays of structs/objects and needs improved arithmetic performance can benefit from such a library. Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

On 10/15/2016 3:44 AM, Andreas Schäfer wrote:
Heya,
On 21:43 Fri 14 Oct , Michael Marcin wrote:
Does this already exist somewhere?
you may want to take a look at LibFlatArray. It's a struct-of-arrays library for C++, which also helps with vectorization and address generation:
Looks to only work for homogeneous data and looks to not handle alignment. Otherwise looks quite similar to what I described. Thanks for the link.

On 12:02 Sat 15 Oct , Michael Marcin wrote:
Looks to only work for homogeneous data
Could you clarify what you mean by "homogeneous data"? Do you refer to structures where all members have the same type? In LibFlatArray all member fields of an SoA may have different types. This example[1] has a member of type bool and one of type double. You can also have arrays (e.g. int foo[3]) as members (e.g. [2]).
and looks to not handle alignment.
Alignment for the allocated memory is handled by the allocator. The aligned_allocator[3] can also be used with std containers (e.g. std::vector). For 2D and 3D grids the soa_grid container will pad the grid dimensions internally to ensure alignment.
Otherwise looks quite similar to what I described. Thanks for the link.
HTH :-) Cheers -Andreas [1] https://github.com/gentryx/libflatarray/blob/master/test/soa_grid_test.cpp#L... [2] https://github.com/gentryx/libflatarray/blob/master/test/soa_grid_test.cpp#L... [3] https://github.com/gentryx/libflatarray/blob/master/src/aligned_allocator.hp... -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

On 10/15/2016 1:48 PM, Andreas Schäfer wrote:
On 12:02 Sat 15 Oct , Michael Marcin wrote:
Looks to only work for homogeneous data
Could you clarify what you mean by "homogeneous data"? Do you refer to structures where all members have the same type? In LibFlatArray all member fields of an SoA may have different types. This example[1] has a member of type bool and one of type double. You can also have arrays (e.g. int foo[3]) as members (e.g. [2]).
Ah I was looking at soa_array.hpp and didn't see anything that looked like it could handle multiple data types. My apologies.
and looks to not handle alignment.
Alignment for the allocated memory is handled by the allocator. The aligned_allocator[3] can also be used with std containers (e.g. std::vector). For 2D and 3D grids the soa_grid container will pad the grid dimensions internally to ensure alignment.
I see, I was looking for alignof or similar constructs which would align each subarray. But I couldn't find anything.
Otherwise looks quite similar to what I described. Thanks for the link.
HTH :-)
When the docs say: "For each member a sub-array is reserved within. We know each member's size and the array's dimensions are fixed, thus we can deduce the offsets for each of these sub-arrays.? Does this mean the AoS has a compile-time fixed capacity? I was imagining a std::vector-like dynamic capacity with amortized growth.

On 16/10/2016 09:12, Michael Marcin wrote:
When the docs say: "For each member a sub-array is reserved within. We know each member's size and the array's dimensions are fixed, thus we can deduce the offsets for each of these sub-arrays.?
Does this mean the AoS has a compile-time fixed capacity?
The docs also lead off the "how it works" section with "The key to LibFlatArray's functionality is that we fix its size internally at compile time." So I would assume so. :)

El 15/10/2016 a las 10:44, Andreas Schäfer escribió:
Heya,
Does this already exist somewhere? you may want to take a look at LibFlatArray. It's a struct-of-arrays
On 21:43 Fri 14 Oct , Michael Marcin wrote: library for C++, which also helps with vectorization and address generation:
Jumping in a bit late, sorry. Two inconveniences I find with the approaches I've seen so far are: * Switching from AOS to SOA requires changing the syntax to access the code * There is no easy way to quickly experiment with SOA vs. AOS without rewriting /refactoring code. I played with these concepts some months ago and came up with some sketches: http://bannalia.blogspot.com/2015/08/c-encapsulation-for-data-oriented-desig... http://bannalia.blogspot.com/2015/09/c-encapsulation-for-data-oriented.html http://bannalia.blogspot.com/2015/11/soa-container-for-encapsulated-c-dod.ht... ( http://tinyurl.com/qb7gmbk http://tinyurl.com/nhhz9l8 http://tinyurl.com/zmdsqyf ) The key idea of this approach is to decouple user's code from actual access to the data making the former templatized on an Access policy, which can then be changed from AOS to SOA and back without further modifications. The sketch also includes a SOA container along this line. With a little more work it is possible to have not only "pure" AOS and SOA but also mixed layouts where some data members are given their own cache line while others are grouped together etc. Switching from one to other quckly would allow for easy profiling and tuning for the specific access profiles of a given application. Of course eveything above was a mere experiment very far away from production-ready code, but maybe someone can take advantage of the ideas to write something more robust. Joaquín M López Muñoz

On 10/18/2016 11:00 AM, Joaquin M López Muñoz wrote:
I played with these concepts some months ago and came up with some sketches:
http://bannalia.blogspot.com/2015/08/c-encapsulation-for-data-oriented-desig...
http://bannalia.blogspot.com/2015/09/c-encapsulation-for-data-oriented.html http://bannalia.blogspot.com/2015/11/soa-container-for-encapsulated-c-dod.ht...
( http://tinyurl.com/qb7gmbk http://tinyurl.com/nhhz9l8 http://tinyurl.com/zmdsqyf )
The key idea of this approach is to decouple user's code from actual access to the data making the former templatized on an Access policy, which can then be changed from AOS to SOA and back without further modifications. The sketch also includes a SOA container along this line. With a little more work it is possible to have not only "pure" AOS and SOA but also mixed layouts where some data members are given their own cache line while others are grouped together etc. Switching from one to other quckly would allow for easy profiling and tuning for the specific access profiles of a given application.
Of course eveything above was a mere experiment very far away from production-ready code, but maybe someone can take advantage of the ideas to write something more robust.
How wonderful that you also chose to a particle system as your example. Very enjoyable articles, thanks! Your dod::access and dod::pointer types looks to be quite similar to what I call a view and view_iterator respectively. I need to study your examples more. I saw https://github.com/gnzlbg/scattered in the comments, which also looks interesting.

Structure of arrays is a pretty well known pattern for improving memory density and enabling vectorization in high performance code. Also known as parallel arrays[1]. Hi, can this be made compatible with boost's multiindex array http://theboostcpplibraries.com/boost.multiindex? as another interface/view to it? I use this Boost library extensively and it is powerful! It would
On 10/14/2016 10:43 PM, Michael Marcin wrote: provide selectivity

On 10/14/2016 09:43 PM, Michael Marcin wrote:
Structure of arrays is a pretty well known pattern for improving memory density and enabling vectorization in high performance code. Also known as parallel arrays[1].
After hand-rolling yet another of these it occurred to me that you make a reasonable first implementation with a simple alias template.
template< typename... Ts > using soa = std::tuple< std::vector<Ts>... >;
example: soa<int,float,std::string> data;
This is decent start but has several issues including: * nothing to preserve the invariant keeping the arrays the same length * missing coalesced interface for vector methods (size, reserve, etc) * no iterator * wasted storage for duplicated allocators, capacities, sizes * sizeof...(Ts) allocations could be a single large block
The work required to address these issues is not too great I think.
The purpose of item: * sizeof...(Ts) allocations could be a single large block is to just require 1 heap allocation instead of N, where N is the number of vectors in soa<T1,T2,...,TN>? -regards, Larry

On 10:29 Tue 18 Oct , Larry Evans wrote:
The purpose of item:
* sizeof...(Ts) allocations could be a single large block
is to just require 1 heap allocation instead of N, where N is the number of vectors in soa<T1,T2,...,TN>?
One benefit of this would be that transferring such a container to another address space (think MPI or CUDA) would become much simple. Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

On 10/18/2016 10:55 AM, Andreas Schäfer wrote:
On 10:29 Tue 18 Oct , Larry Evans wrote:
The purpose of item:
* sizeof...(Ts) allocations could be a single large block
is to just require 1 heap allocation instead of N, where N is the number of vectors in soa<T1,T2,...,TN>?
One benefit of this would be that transferring such a container to another address space (think MPI or CUDA) would become much simple.
It also reduces the size of your handle structure (the structure the holds the pointer to the data). Otherwise every additionally member adds ~24 bytes for a simple tuple< vector<Ts>... > or sizeof(T*) bytes at a minimum for separately allocated blocks. It also can reduce the size of an iterator to a view of the data or remove an indirection from it depending on implementation. It has a nice benefit that the size of the handle + body (alloc'd data block) for a soa_vector would be identical to that of a normal AoS vector containing the same data. If a solution ticked all the other boxes and dropped this one, I'd be fine with that. There's quite a bit of complexity involved with calculating the offsets with a dynamic capacity and potentially arbitrary alignment requirements on internal subarrays. I'm also not sure how to reconcile it with another frequent SoA optimization. Replace bools or small enums/ints with an array of bit packed data. For example the alive bool in the particle_t elsewhere in this thread could be stored as a bit_vector which can be a size and speed win. As seen in the soa_emitter_sse_opt_t addition here: http://codepad.org/eol6auRN AoS in 5.8485 seconds SoA in 4.06838 seconds SoA flat in 3.99157 seconds SoA Static in 5.26953 seconds SoA SSE in 3.53028 seconds SoA SSE opt in 2.98845 seconds P.S. this also shows an improved soa_emitter_t which generates much better code (vs2015) when vector.data() is cached for each member before the loop. Improves SoA update from 191 instructions to 52 instructions and roughly 2x speedup for small (~25k) datasets, but is still more than the 36 instructions required for AoS update.

On 10/18/2016 10:24 PM, Michael Marcin wrote:
On 10/18/2016 10:55 AM, Andreas Schäfer wrote:
On 10:29 Tue 18 Oct , Larry Evans wrote:
The purpose of item:
* sizeof...(Ts) allocations could be a single large block
is to just require 1 heap allocation instead of N, where N is the number of vectors in soa<T1,T2,...,TN>?
One benefit of this would be that transferring such a container to another address space (think MPI or CUDA) would become much simple.
It also reduces the size of your handle structure (the structure the holds the pointer to the data). Otherwise every additionally member adds ~24 bytes for a simple tuple< vector<Ts>... > or sizeof(T*) bytes at a minimum for separately allocated blocks.
It also can reduce the size of an iterator to a view of the data or remove an indirection from it depending on implementation.
It has a nice benefit that the size of the handle + body (alloc'd data block) for a soa_vector would be identical to that of a normal AoS vector containing the same data.
If a solution ticked all the other boxes and dropped this one, I'd be fine with that. There's quite a bit of complexity involved with calculating the offsets with a dynamic capacity and potentially arbitrary alignment requirements on internal subarrays.
[snip] Michael, the attached is an outline of how to do it. As noted in the comments, a lot of member functions still need implementation; however, the offset calculations I think are correct. HTH. -regards, Larry

On 10/19/2016 01:26 AM, Larry Evans wrote:
On 10/18/2016 10:24 PM, Michael Marcin wrote:
On 10/18/2016 10:55 AM, Andreas Schäfer wrote:
On 10:29 Tue 18 Oct , Larry Evans wrote:
The purpose of item:
* sizeof...(Ts) allocations could be a single large block
is to just require 1 heap allocation instead of N, where N is the number of vectors in soa<T1,T2,...,TN>?
One benefit of this would be that transferring such a container to another address space (think MPI or CUDA) would become much simple.
It also reduces the size of your handle structure (the structure the holds the pointer to the data). Otherwise every additionally member adds ~24 bytes for a simple tuple< vector<Ts>... > or sizeof(T*) bytes at a minimum for separately allocated blocks.
It also can reduce the size of an iterator to a view of the data or remove an indirection from it depending on implementation.
It has a nice benefit that the size of the handle + body (alloc'd data block) for a soa_vector would be identical to that of a normal AoS vector containing the same data.
If a solution ticked all the other boxes and dropped this one, I'd be fine with that. There's quite a bit of complexity involved with calculating the offsets with a dynamic capacity and potentially arbitrary alignment requirements on internal subarrays.
[snip] Michael, the attached is an outline of how to do it. As noted in the comments, a lot of member functions still need implementation; however, the offset calculations I think are correct.
Code has been updated to implement most of the "essentials" of soa_block. You can now call begin<Index> to get the begin iterator for the Index'th vector. The code now is on github at: https://github.com/cppljevans/soa/blob/master/soa_block.cpp HTH. -regards, Larry

On 10/19/2016 7:32 AM, Larry Evans wrote:
Michael, the attached is an outline of how to do it. As noted in the comments, a lot of member functions still need implementation; however, the offset calculations I think are correct.
Code has been updated to implement most of the "essentials" of soa_block. You can now call begin<Index> to get the begin iterator for the Index'th vector.
The code now is on github at:
Still digesting this but one small tidbit I stood out for me //The c++ standard, IIRC, says all alignments are power of 2. I didn't actually know this was a requirement. "Every alignment value shall be a non-negative integral power of two." § 3.11/4 http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/n4606.pdf Thanks!

On 10/19/2016 11:59 PM, Michael Marcin wrote:
On 10/19/2016 7:32 AM, Larry Evans wrote:
Michael, the attached is an outline of how to do it. As noted in the comments, a lot of member functions still need implementation; however, the offset calculations I think are correct.
Code has been updated to implement most of the "essentials" of soa_block. You can now call begin<Index> to get the begin iterator for the Index'th vector.
The code now is on github at:
Still digesting this
You may be having trouble with digesting: soa_impl<...>::get_self soa_impl<...>::get_vec and also digesting: ~soa_impl() { using swallow = int[]; // guaranties left to right order (void)swallow{0, (void(this->template destruct<Indices>()), 0)...}; both techniques I copied from elsewhere (I can't remember where). If so, I could explain a bit. If it's something else, I'd be happy to try and explain that.
but one small tidbit I stood out for me
//The c++ standard, IIRC, says all alignments are power of 2.
I didn't actually know this was a requirement.
"Every alignment value shall be a non-negative integral power of two." § 3.11/4 http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/n4606.pdf
Thanks!
You're welcome. One change I made to the vec_offsets function could lead to a problem in some very special application. You might think that you could, instead of: , my_storage(new char[my_offsets.back()]) in the soa_block CTOR initialization list, you wanted to create a vector of vectors, or matrx, of soa's. Then you might think you could: soa_impl(std::size_t a_row_size, std::size_t a_col_size) : my_row_size(a_row_size) , my_col_size(a_col_size) , my_offsets(vec_offsets<Ts...>(my_row_size)) , my_storage(new char[my_offsets.back()*my_col_size]) { ... } However, you couldn't because the 2nd column of the matrix *might* not have the correct alignment. That's because, I changed the vec_offsets function by replacing, in the vec_offsets function, this: result[sizeTs]=nxt_offset( result[sizeTs-1], sizes[sizeTs-1]*vec_size, max_align); with this: result[sizeTs]=result[sizeTs-1]+sizes[sizeTs-1]*vec_size; My justification for doing that was I assumed nobody would want to do that; however, who knows. Also, if I left in the original, and there was no need for it, then people reviewing the code would wonder why I put it in :( I'll try to find some comments to put in the vec_offsets function to warn people of this unlikely problem. HTH, Larry

On 10/20/2016 03:55 AM, Larry Evans wrote:
On 10/19/2016 11:59 PM, Michael Marcin wrote:
On 10/19/2016 7:32 AM, Larry Evans wrote:
Michael, the attached is an outline of how to do it. As noted in the comments, a lot of member functions still need implementation; however, the offset calculations I think are correct.
Code has been updated to implement most of the "essentials" of soa_block. You can now call begin<Index> to get the begin iterator for the Index'th vector.
The code now is on github at:
Still digesting this [snip]
One change I made to the vec_offsets function could lead to a problem in some very special application. You might think that you could, instead of:
, my_storage(new char[my_offsets.back()])
in the soa_block CTOR initialization list, you wanted to create a vector of vectors, or matrx, of soa's. Then you might think you could:
soa_impl(std::size_t a_row_size, std::size_t a_col_size) : my_row_size(a_row_size) , my_col_size(a_col_size) , my_offsets(vec_offsets<Ts...>(my_row_size)) , my_storage(new char[my_offsets.back()*my_col_size]) { ... }
However, you couldn't because the 2nd column of the matrix *might* not have the correct alignment. That's because, I changed the vec_offsets function by replacing, in the vec_offsets function, this:
result[sizeTs]=nxt_offset( result[sizeTs-1], sizes[sizeTs-1]*vec_size, max_align);
with this:
result[sizeTs]=result[sizeTs-1]+sizes[sizeTs-1]*vec_size;
My justification for doing that was I assumed nobody would want to do that; however, who knows. Also, if I left in the original, and there was no need for it, then people reviewing the code would wonder why I put it in :(
I'll try to find some comments to put in the vec_offsets function to warn people of this unlikely problem.
Running the just uploaded test file: https://github.com/cppljevans/soa/blob/master/vec_offsets.test.cpp produces,, for the last test, the output: :vec_size=3:sizeof...(Ts)=2 sizeof...(Ts)=2 offset[0]=0 alignment[0]=16 aligned[0]=1 offset[1]=48 alignment[1]=4 aligned[1]=1 buf_size=84 max_alignment=16 max_aligned=0 buf_size+offset[0]=84 alignment[0]=16 aligned[0]=0 buf_size+offset[1]=132 alignment[1]=4 aligned[1]=1 Note the aligned[0]=0 line. This illustrates that when 2 buffers are concatenated, then the alignment for the first T in the 2nd buffer will not be correct. Hope that makes things clearer. -regards, Larry

On 10/19/2016 11:59 PM, Michael Marcin wrote:
On 10/19/2016 7:32 AM, Larry Evans wrote:
Michael, the attached is an outline of how to do it. As noted in the comments, a lot of member functions still need implementation; however, the offset calculations I think are correct.
Code has been updated to implement most of the "essentials" of soa_block. You can now call begin<Index> to get the begin iterator for the Index'th vector.
The code now is on github at:
Still digesting this but one small tidbit I stood out for me
[snip] I uploaded a modification of your benchmark: http://codepad.org/eol6auRN to: https://github.com/cppljevans/soa/blob/master/codepad.eol6auRN.cpp The modification added soa_emitter_block_t which uses soa_block. Unfortunately, this soa_emitter_block_t takes about twice as long as your soa_emitter_static_t. I've no idea why. Any guesses? -regards, Larry

On 10/20/2016 10:02 PM, Larry Evans wrote:
On 10/19/2016 11:59 PM, Michael Marcin wrote:
On 10/19/2016 7:32 AM, Larry Evans wrote:
Michael, the attached is an outline of how to do it. As noted in the comments, a lot of member functions still need implementation; however, the offset calculations I think are correct.
Code has been updated to implement most of the "essentials" of soa_block. You can now call begin<Index> to get the begin iterator for the Index'th vector.
The code now is on github at:
Still digesting this but one small tidbit I stood out for me
[snip] I uploaded a modification of your benchmark:
to: https://github.com/cppljevans/soa/blob/master/codepad.eol6auRN.cpp
The modification added soa_emitter_block_t which uses soa_block. Unfortunately, this soa_emitter_block_t takes about twice as long as your soa_emitter_static_t.
I've no idea why. Any guesses?
2x is quite an abstraction penalty. I can only assume your compiler is failing to optimize away some part of the abstraction. FWIW on vs2015 I'm not seeing nearly as much of a difference. particle_count=1,000,000 AoS in 6.34667 seconds SoA in 4.26384 seconds SoA flat in 4.16572 seconds SoA Static in 5.4037 seconds SoA block in 5.5588 seconds

On 10/21/2016 12:48 AM, Michael Marcin wrote:
On 10/20/2016 10:02 PM, Larry Evans wrote:
The modification added soa_emitter_block_t which uses soa_block. Unfortunately, this soa_emitter_block_t takes about twice as long as your soa_emitter_static_t.
I've no idea why. Any guesses?
2x is quite an abstraction penalty. I can only assume your compiler is failing to optimize away some part of the abstraction.
FWIW on vs2015 I'm not seeing nearly as much of a difference.
particle_count=1,000,000 AoS in 6.34667 seconds SoA in 4.26384 seconds SoA flat in 4.16572 seconds SoA Static in 5.4037 seconds SoA block in 5.5588 seconds
I'm still trying to work out how to fit overaligned subarrays into your framework. The issue is that many simd instructions require more than just alignof(T) alignment. subarrays of float/double/int/short/char or carefully crafted udts might need to be aligned to as much as 64bytes in the worst case. On the MIC architecture, vector load/store operations must be called on 64-byte aligned memory addresses. On the Xeon architecture with AVX/AVX2 instruction sets (Sandy Bridge, Ivy Bridge or Haswell), alignment does not matter. In earlier architectures (Nehalem, Westmere) alignment did matter, but a 32-byte alignment was necessary. https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/507... At the very least support for the basic SSE 16 byte alignment of subarrays is crucial. My best idea so far is some magic wrapper type that gets special treatment. Like: using data_t = soa_block< float3, soa_align<float,16>, bool >; This maybe opens the door for other magic types like: using data_t = soa_block< float3, soa_align<float,16>, soa_bit >;

On 10/21/2016 01:07 AM, Michael Marcin wrote:
On 10/21/2016 12:48 AM, Michael Marcin wrote:
On 10/20/2016 10:02 PM, Larry Evans wrote:
The modification added soa_emitter_block_t which uses soa_block. Unfortunately, this soa_emitter_block_t takes about twice as long as your soa_emitter_static_t.
I've no idea why. Any guesses?
2x is quite an abstraction penalty. I can only assume your compiler is failing to optimize away some part of the abstraction.
OOPS. Yeah, I forgot about run-time optimization compiler flags :(
FWIW on vs2015 I'm not seeing nearly as much of a difference.
particle_count=1,000,000 AoS in 6.34667 seconds SoA in 4.26384 seconds SoA flat in 4.16572 seconds SoA Static in 5.4037 seconds SoA block in 5.5588 seconds
I'm still trying to work out how to fit overaligned subarrays into your framework.
The issue is that many simd instructions require more than just alignof(T) alignment.
subarrays of float/double/int/short/char or carefully crafted udts might need to be aligned to as much as 64bytes in the worst case.
On the MIC architecture, vector load/store operations must be called on 64-byte aligned memory addresses. On the Xeon architecture with AVX/AVX2 instruction sets (Sandy Bridge, Ivy Bridge or Haswell), alignment does not matter. In earlier architectures (Nehalem, Westmere) alignment did matter, but a 32-byte alignment was necessary.
https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/507...
At the very least support for the basic SSE 16 byte alignment of subarrays is crucial.
My best idea so far is some magic wrapper type that gets special treatment. Like: using data_t = soa_block< float3, soa_align<float,16>, bool >;
This maybe opens the door for other magic types like: using data_t = soa_block< float3, soa_align<float,16>, soa_bit >;
That seems reasonable to me.

On 10/21/2016 01:39 AM, Larry Evans wrote:
On 10/21/2016 01:07 AM, Michael Marcin wrote:
On 10/21/2016 12:48 AM, Michael Marcin wrote:
On 10/20/2016 10:02 PM, Larry Evans wrote:
The modification added soa_emitter_block_t which uses soa_block. Unfortunately, this soa_emitter_block_t takes about twice as long as your soa_emitter_static_t.
I've no idea why. Any guesses?
2x is quite an abstraction penalty. I can only assume your compiler is failing to optimize away some part of the abstraction.
OOPS. Yeah, I forgot about run-time optimization compiler flags :(
[snip] Using optimization flag, -O2, (and maybe soa_block.begin_all) makes soa_block fastest, as described here: https://github.com/cppljevans/soa/blob/master/codepad.eol6auRN.cpp#L47 :)

On 10/21/2016 03:01 AM, Larry Evans wrote:
On 10/21/2016 01:39 AM, Larry Evans wrote:
On 10/21/2016 01:07 AM, Michael Marcin wrote:
On 10/21/2016 12:48 AM, Michael Marcin wrote:
On 10/20/2016 10:02 PM, Larry Evans wrote:
The modification added soa_emitter_block_t which uses soa_block. Unfortunately, this soa_emitter_block_t takes about twice as long as your soa_emitter_static_t.
I've no idea why. Any guesses?
2x is quite an abstraction penalty. I can only assume your compiler is failing to optimize away some part of the abstraction.
OOPS. Yeah, I forgot about run-time optimization compiler flags :(
[snip] Using optimization flag, -O2, (and maybe soa_block.begin_all) makes soa_block fastest [snip] But only fastest by a small amount.
I can't imagine how anything could be faster than the soa_emitter_static_t because it uses a tuple of std::array<T,particle_count>. I'd guess that the soa_emitter_block_t is only faster by luck (maybe during the soa_emitter_block_t run, my machine was not as busy on some other stuff). I've not much experience with benchmarking; so, take my reasoning with a grain of salt. -regards, Larry

On 07:50 Fri 21 Oct , Larry Evans wrote:
I can't imagine how anything could be faster than the soa_emitter_static_t because it uses a tuple of std::array<T,particle_count>. I'd guess that the soa_emitter_block_t is only faster by luck (maybe during the soa_emitter_block_t run, my machine was not as busy on some other stuff).
I think the reason why the different implementation techniques are so close is that the particle model is memory bound (i.e. it's moving a lot of data while each particle update involves relatively few calculations). The difference becomes larger if you're using only a few particles: then all particles sit in the upper levels of the cache and the CPU doesn't have to wait as much for the data. It would also be worthwhile to try a more complex particle model (e.g. by adding interaction between the particles). With increased computational intensity (floating point operations per byte moved) the delta of the different strategies should increase much more. I've added an implementation of the benchmark based on LibFlatArray's SoA containers and expression templates[1]. While working on the benchmark, I realized that the vector types ("short_vec") in LibFlatArray were lacking some desirable operations (e.g. masked move), so to reproduce my results you'll have to use the trunk from [2]. I'm very happy that you wrote this benchmark because it's a valuable test bed for performance, programmability, and functionality. Thanks! One key contribution is that the LibFlatArray-based kernels will automatically be vectorized without the user having to touch intrinsics (which automatically tie your code to a specific platform). LibFlatArray supports SSE, AVX, AVX512 (not yet available in consumer products), ARM NEON... I've re-run the benchmark a couple of times on my Intel Core i7-6700HQ (Skylake quad-core) to get stable results. Interestingly your SSE code is ~13% faster than the LibFlatArray code for large particle counts. I'll have to take a look at the assembly to figure out why that is. (As a library developer having such a test case is incredibly valuable, so thanks again!) For fewer particles the LibFlatArray kernel is ~31% faster. I assume that this delta would increase with a higher computational intensity as it's using AVX. On a SSE-only CPU the LibFlatArray code might be a little slower than the hand-optimized SSE code. particle_count=1.000.000 AoS in 9,21448 seconds SoA in 5,87921 seconds SoA flat in 5,81664 seconds SoA Static in 7,10225 seconds SoA block in 6,16696 seconds LibFlatArray SoA in 5,31733 seconds SoA SSE in 4,79973 seconds SoA SSE opt in 4,70757 seconds particle_count=1.024 AoS in 6,10074 seconds SoA in 6,6032 seconds SoA flat in 6,70765 seconds SoA Static in 6,74453 seconds SoA block in 6,54649 seconds LibFlatArray SoA in 2,10663 seconds SoA SSE in 3,53452 seconds SoA SSE opt in 2,76819 seconds Cheers -Andreas [1] https://github.com/gentryx/soa_experiment/blob/master/soa_compare.benchmark.... [2] https://github.com/gentryx/libflatarray -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

On 10/25/2016 01:41 AM, Andreas Schäfer wrote:
On 07:50 Fri 21 Oct , Larry Evans wrote:
I can't imagine how anything could be faster than the soa_emitter_static_t because it uses a tuple of std::array<T,particle_count>. I'd guess that the soa_emitter_block_t is only faster by luck (maybe during the soa_emitter_block_t run, my machine was not as busy on some other stuff).
I think the reason why the different implementation techniques are so close is that the particle model is memory bound (i.e. it's moving a lot of data while each particle update involves relatively few calculations).
The difference becomes larger if you're using only a few particles: then all particles sit in the upper levels of the cache and the CPU doesn't have to wait as much for the data. It would also be worthwhile to try a more complex particle model (e.g. by adding interaction between the particles). With increased computational intensity (floating point operations per byte moved) the delta of the different strategies should increase much more.
Thanks for the explanation. The lastest version of the benchmark: d6ee370606f7f167dedb93e174459c6c7c4d8c19 reports the relative difference of the times: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L823 So, based on what you say above, I guess when particle_count: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L135 increases to a point where the cache is overflowed, the relative differences between methods should show a sharp difference?
I've added an implementation of the benchmark based on LibFlatArray's SoA containers and expression templates[1]. While working on the benchmark, I realized that the vector types ("short_vec") in LibFlatArray were lacking some desirable operations (e.g. masked move), so to reproduce my results you'll have to use the trunk from [2]. I'm very happy that you wrote this benchmark because it's a valuable test bed for performance, programmability, and functionality. Thanks!
You're welcome. Much of the credit goes to the OP, as acknowledged, indirectly, here: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L6
One key contribution is that the LibFlatArray-based kernels will automatically be vectorized without the user having to touch intrinsics (which automatically tie your code to a specific platform). LibFlatArray supports SSE, AVX, AVX512 (not yet available in consumer products), ARM NEON...
I've re-run the benchmark a couple of times on my Intel Core i7-6700HQ (Skylake quad-core) to get stable results.
Hmmm. I didn't realize you'd have to run the benchmark several times to get stable results. I guess that reflect my ignorance of how benchmarks should be run. Could you explain how running a couple of times achieves stable results (actually, on some occassions, I've run the benchmark and got results completely unexpected, I suspect it was because some application deamon was stealing cycles from the benchmark, leading to the unexpedted results).
Interestingly your SSE code is ~13% faster than the LibFlatArray code for large particle counts.
Actually, the SSE code was the OP's. As intimated above, using the latest version of the benchmark should make this % difference more apparent. For example, the output looks like this: particle_count=1,024 frames=1,000 minimum duration=0.0369697 comparitive performance table: method rel_duration ________ ______________ SoA 0.902566 Flat 0.907562 Block 0.963046 AoS 1 StdArray 1.15868 LFA undefined SSE undefined SSE_opt undefined The above was done with compiler optimization flag -O0. It changes dramatically with -O2 or -O3.
I'll have to take a look at the assembly to figure out why that is.
Oh, I bet that will be fun ;)
(As a library developer having such a test case is incredibly valuable, so thanks again!) For fewer particles the LibFlatArray kernel is ~31% faster. I assume that this delta would increase with a higher computational intensity as it's using AVX. On a SSE-only CPU the LibFlatArray code might be a little slower than the hand-optimized SSE code.
particle_count=1.000.000 AoS in 9,21448 seconds SoA in 5,87921 seconds SoA flat in 5,81664 seconds SoA Static in 7,10225 seconds SoA block in 6,16696 seconds LibFlatArray SoA in 5,31733 seconds SoA SSE in 4,79973 seconds SoA SSE opt in 4,70757 seconds
particle_count=1.024 AoS in 6,10074 seconds SoA in 6,6032 seconds SoA flat in 6,70765 seconds SoA Static in 6,74453 seconds SoA block in 6,54649 seconds LibFlatArray SoA in 2,10663 seconds SoA SSE in 3,53452 seconds SoA SSE opt in 2,76819 seconds
From the above, the LibFlatArray and SSE methods are the fastest. I'd guess that a new "SoA block SSE" method, which uses the _mm_* methods, would narrow the difference. I'll try to figure out how to do that. I notice: #include <mmintrin.h> doesn't produce a compile error; however, that #include doesn't have the _mm_add_ps used here: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621 Do you know of some package I could install on my ubuntu OS that makes those SSE functions, such as _mm_add_ps, available? [snip] -regards, Larry

On 10/25/2016 12:22 PM, Larry Evans wrote: [snip]
From the above, the LibFlatArray and SSE methods are the fastest. I'd guess that a new "SoA block SSE" method, which uses the _mm_* methods, would narrow the difference. I'll try to figure out how to do that. I notice:
#include <mmintrin.h>
doesn't produce a compile error; however, that #include doesn't have the _mm_add_ps used here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621
Do you know of some package I could install on my ubuntu OS that makes those SSE functions, such as _mm_add_ps, available?
[snip] Never mind. Google for: __mm128 lead to: http://stackoverflow.com/questions/11679741/vector-of-mm128-wont-push-back and change of #include to: #include <emmintrin.h> which solved problem. particle_count=1,024 frames=1,000 minimum duration=0.0371714 comparitive performance table: method rel_duration ________ ______________ SSE_opt 0.330574 SSE 0.440405 Flat 0.904265 SoA 0.911574 Block 0.97398 AoS 1 StdArray 1.15079 LFA undefined Compilation finished at Tue Oct 25 12:36:33

On 10/25/2016 12:41 PM, Larry Evans wrote:
On 10/25/2016 12:22 PM, Larry Evans wrote: [snip]
From the above, the LibFlatArray and SSE methods are the fastest. I'd guess that a new "SoA block SSE" method, which uses the _mm_* methods, would narrow the difference. I'll try to figure out how to do that. I notice:
#include <mmintrin.h>
doesn't produce a compile error; however, that #include doesn't have the _mm_add_ps used here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621
Do you know of some package I could install on my ubuntu OS that makes those SSE functions, such as _mm_add_ps, available?
[snip] Never mind. Google for:
__mm128
lead to:
http://stackoverflow.com/questions/11679741/vector-of-mm128-wont-push-back
and change of #include to:
#include <emmintrin.h>
which solved problem.
particle_count=1,024 frames=1,000 minimum duration=0.0371714
comparitive performance table:
method rel_duration ________ ______________ SSE_opt 0.330574 SSE 0.440405 Flat 0.904265 SoA 0.911574 Block 0.97398 AoS 1 StdArray 1.15079 LFA undefined
OOPS. Another copy&paste careless error. Output should be: --{--cut here-- particle_count=1,000,000 frames=1,000 minimum duration=3.5909 comparitive performance table: method rel_duration ________ ______________ SSE_opt 1 SSE 1.01568 StdArray 1.44133 Flat 1.44861 Block 1.45053 SoA 1.52935 AoS 2.10294 LFA undefined Compilation finished at Tue Oct 25 16:12:45 --}--cut here-- which clearly shows SSE_opt as fastest. -regards, Larry -regards, Larry

On 10/25/2016 12:22 PM, Larry Evans wrote:
Hmmm. I didn't realize you'd have to run the benchmark several times to get stable results. I guess that reflect my ignorance of how benchmarks should be run.
The code was just a quick example hacked up to show large difference between different techniques. If you want to compare similar techniques you'll need a more robust benchmark. It would be easy to convert it to use: https://github.com/google/benchmark Which is quite good.
Could you explain how running a couple of times achieves stable results (actually, on some occassions, I've run the benchmark and got results completely unexpected, I suspect it was because some application deamon was stealing cycles from the benchmark, leading to the unexpedted results).
Interestingly your SSE code is ~13% faster than the LibFlatArray code for large particle counts.
Actually, the SSE code was the OP's.
Actually it originates from: https://software.intel.com/en-us/articles/creating-a-particle-system-with-st...
From the above, the LibFlatArray and SSE methods are the fastest. I'd guess that a new "SoA block SSE" method, which uses the _mm_* methods, would narrow the difference. I'll try to figure out how to do that. I notice:
#include <mmintrin.h>
doesn't produce a compile error; however, that #include doesn't have the _mm_add_ps used here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621
Do you know of some package I could install on my ubuntu OS that makes those SSE functions, such as _mm_add_ps, available?
[snip]
If you're using gcc I think the header <xmmintrin.h>

On 22:13 Tue 25 Oct , Michael Marcin wrote:
On 10/25/2016 12:22 PM, Larry Evans wrote:
Hmmm. I didn't realize you'd have to run the benchmark several times to get stable results. I guess that reflect my ignorance of how benchmarks should be run.
The code was just a quick example hacked up to show large difference between different techniques.
If you want to compare similar techniques you'll need a more robust benchmark.
It would be easy to convert it to use: https://github.com/google/benchmark
Which is quite good.
When doing performance measurements you have to take into account the most common sources of noise: 1. Other processes might eat up CPU time or memory bandwidth. 2. The OS might decide to move your benchmark from one core to another, so you're loosing all L1+L2 cache entries. (Solution: thread pinning) 3. Thermal conditions and thermal inertia may affect if/when the CPU increases its clock speed. (Solution: either disable turbo mode or run the benchmark long enough to even out the thermal fluctuations.) AFAIK Google Benchmark doesn't to thread pinning and cannot affect the turbo mode. LIKWID ( https://github.com/RRZE-HPC/likwid ) can be used to set clock frequencies and pin threads, and can read the performance counters of the CPU. Might be a good idea to use both, Google Benchmark and LIKWID together.
Could you explain how running a couple of times achieves stable results (actually, on some occassions, I've run the benchmark and got results completely unexpected, I suspect it was because some application deamon was stealing cycles from the benchmark, leading to the unexpedted results).
Interestingly your SSE code is ~13% faster than the LibFlatArray code for large particle counts.
Actually, the SSE code was the OP's.
Actually it originates from:
https://software.intel.com/en-us/articles/creating-a-particle-system-with-st...
Ah, thanks for the info.
From the above, the LibFlatArray and SSE methods are the fastest. I'd guess that a new "SoA block SSE" method, which uses the _mm_* methods, would narrow the difference. I'll try to figure out how to do that. I notice:
#include <mmintrin.h>
doesn't produce a compile error; however, that #include doesn't have the _mm_add_ps used here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621
Do you know of some package I could install on my ubuntu OS that makes those SSE functions, such as _mm_add_ps, available?
[snip]
If you're using gcc I think the header <xmmintrin.h>
The header should not depend on the compiler, but on the CPU model. Or rather: the vector ISA supported by the CPU: http://stackoverflow.com/questions/11228855/header-files-for-simd-intrinsics Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

On 12:22 Tue 25 Oct , Larry Evans wrote:
On 10/25/2016 01:41 AM, Andreas Schäfer wrote:
On 07:50 Fri 21 Oct , Larry Evans wrote:
I can't imagine how anything could be faster than the soa_emitter_static_t because it uses a tuple of std::array<T,particle_count>. I'd guess that the soa_emitter_block_t is only faster by luck (maybe during the soa_emitter_block_t run, my machine was not as busy on some other stuff).
I think the reason why the different implementation techniques are so close is that the particle model is memory bound (i.e. it's moving a lot of data while each particle update involves relatively few calculations).
The difference becomes larger if you're using only a few particles: then all particles sit in the upper levels of the cache and the CPU doesn't have to wait as much for the data. It would also be worthwhile to try a more complex particle model (e.g. by adding interaction between the particles). With increased computational intensity (floating point operations per byte moved) the delta of the different strategies should increase much more.
Thanks for the explanation. The lastest version of the benchmark:
d6ee370606f7f167dedb93e174459c6c7c4d8c19
reports the relative difference of the times:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L823
Yeah, I saw that change when I merged upstream. TBH, I don't think this is helpful as the relative difference adds noice from one measurement to all other measurements. It complicates comparison between multiple runs of the benchmark and prevents conversion into other metrics (e.g. GFLOPS).
So, based on what you say above, I guess when particle_count:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L135
increases to a point where the cache is overflowed, the relative differences between methods should show a sharp difference?
The difference between the method is reduced when more and more particles are being used as then the memory bandwidth becomes the limiting factor. The transition between "in cache" and "in memory" isn't sharp, but rather smooth as the L3 cache will still retain some data, even if the total data set is too large to fit into L3. If you vary the number of particles, you should be able to observe different performance levels based on the cache level the data set fits into (32kB for L1, 256kB for L2 (on Intel), some MB for L3).
I've added an implementation of the benchmark based on LibFlatArray's SoA containers and expression templates[1]. While working on the benchmark, I realized that the vector types ("short_vec") in LibFlatArray were lacking some desirable operations (e.g. masked move), so to reproduce my results you'll have to use the trunk from [2]. I'm very happy that you wrote this benchmark because it's a valuable test bed for performance, programmability, and functionality. Thanks!
You're welcome. Much of the credit goes to the OP, as acknowledged, indirectly, here:
Thanks. Sorry for the confusion. :-)
I'll have to take a look at the assembly to figure out why that is.
Oh, I bet that will be fun ;)
I hope so. Hope dies last. ;-) Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

On 10/21/2016 01:07 AM, Michael Marcin wrote:
On 10/21/2016 12:48 AM, Michael Marcin wrote:
On 10/20/2016 10:02 PM, Larry Evans wrote:
The modification added soa_emitter_block_t which uses soa_block. Unfortunately, this soa_emitter_block_t takes about twice as long as your soa_emitter_static_t.
I've no idea why. Any guesses?
2x is quite an abstraction penalty. I can only assume your compiler is failing to optimize away some part of the abstraction.
FWIW on vs2015 I'm not seeing nearly as much of a difference.
particle_count=1,000,000 AoS in 6.34667 seconds SoA in 4.26384 seconds SoA flat in 4.16572 seconds SoA Static in 5.4037 seconds SoA block in 5.5588 seconds
I'm still trying to work out how to fit overaligned subarrays into your framework.
The issue is that many simd instructions require more than just alignof(T) alignment.
subarrays of float/double/int/short/char or carefully crafted udts might need to be aligned to as much as 64bytes in the worst case.
On the MIC architecture, vector load/store operations must be called on 64-byte aligned memory addresses. On the Xeon architecture with AVX/AVX2 instruction sets (Sandy Bridge, Ivy Bridge or Haswell), alignment does not matter. In earlier architectures (Nehalem, Westmere) alignment did matter, but a 32-byte alignment was necessary.
https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/507...
At the very least support for the basic SSE 16 byte alignment of subarrays is crucial.
My best idea so far is some magic wrapper type that gets special treatment. Like: using data_t = soa_block< float3, soa_align<float,16>, bool >;
Something like: template<typename T, std::size_t Alignment> struct alignas(Alignment) soa_align { T data; }; Have you tried that yet. If not, I might try. -regards, Larry

On 10/25/2016 8:23 PM, Larry Evans wrote:
At the very least support for the basic SSE 16 byte alignment of subarrays is crucial.
My best idea so far is some magic wrapper type that gets special treatment. Like: using data_t = soa_block< float3, soa_align<float,16>, bool >;
Something like:
template<typename T, std::size_t Alignment> struct alignas(Alignment) soa_align { T data; };
Have you tried that yet. If not, I might try.
The issue is you don't want to overalign all elements of the array, just the first element. I have a working solution (using Peter Dimov's mp11 library as I'm not well-versed in post cpp03 metaprogramming). I'm just trying to play around with implementation ideas at the moment. Basically it'd be a nice to store only a single pointer and cheap constant time member sub-array access. But with alignment concerns all I've managed so far are two implementations. 1. 1 pointer with linear time member array access 2. n-pointers with constant time member array access I feel like there should exist implementation that trades a bit of dynamic allocation size for a single pointer and constant time member array access.

On 10/25/2016 11:07 PM, Michael Marcin wrote:
On 10/25/2016 8:23 PM, Larry Evans wrote:
At the very least support for the basic SSE 16 byte alignment of subarrays is crucial.
My best idea so far is some magic wrapper type that gets special treatment. Like: using data_t = soa_block< float3, soa_align<float,16>, bool >;
Something like:
template<typename T, std::size_t Alignment> struct alignas(Alignment) soa_align { T data; };
Have you tried that yet. If not, I might try.
The issue is you don't want to overalign all elements of the array, just the first element.
But aligning the first soa_align<T,A> is all that's needed because sizeof(soa_align<T,A>)%A == 0, hence, all subsequent elements would be aligned. At least that's my understanding. Am I missing something?
I have a working solution (using Peter Dimov's mp11 library as I'm not well-versed in post cpp03 metaprogramming).
I'm just trying to play around with implementation ideas at the moment.
Basically it'd be a nice to store only a single pointer and cheap constant time member sub-array access.
But with alignment concerns all I've managed so far are two implementations.
1. 1 pointer with linear time member array access 2. n-pointers with constant time member array access
I feel like there should exist implementation that trades a bit of dynamic allocation size for a single pointer and constant time member array access.
I intended soa_block to fill that need (after all the tasks shown in the **TODO** comments were done). If you see some flaw in the code, of course, I love to hear about it. **TODO** -regards, Larry

On 10/26/2016 12:58 AM, Larry Evans wrote:
On 10/25/2016 11:07 PM, Michael Marcin wrote:
On 10/25/2016 8:23 PM, Larry Evans wrote:
At the very least support for the basic SSE 16 byte alignment of subarrays is crucial.
My best idea so far is some magic wrapper type that gets special treatment. Like: using data_t = soa_block< float3, soa_align<float,16>, bool >;
Something like:
template<typename T, std::size_t Alignment> struct alignas(Alignment) soa_align { T data; };
Have you tried that yet. If not, I might try.
The issue is you don't want to overalign all elements of the array, just the first element.
But aligning the first soa_align<T,A> is all that's needed because sizeof(soa_align<T,A>)%A == 0, hence, all subsequent elements would be aligned. At least that's my understanding. Am I missing something?
Perhaps I'm misunderstanding. Using your struct above: std::array< soa_align<float, 16>, 4 > data; std::cout << "align array: " << alignof(decltype(data)) << '\n' << "size element: " << sizeof( data[0] ) << '\n' << "size array: " << sizeof( data ) << '\n' << "offset[1]: " << (char*)&(data[1]) - (char*)data.data() << '\n'; align array: 16 size element: 16 size array: 64 offset[1]: 16 For data to work with SSE instructions this needs to report: align array: 16 size element: 4 size array: 16 offset[1]: 4 i.e. 4 floats have to be contiguous in memory, and the *first* float has to be aligned to 16 bytes.
I have a working solution (using Peter Dimov's mp11 library as I'm not well-versed in post cpp03 metaprogramming).
I'm just trying to play around with implementation ideas at the moment.
Basically it'd be a nice to store only a single pointer and cheap constant time member sub-array access.
But with alignment concerns all I've managed so far are two implementations.
1. 1 pointer with linear time member array access 2. n-pointers with constant time member array access
I feel like there should exist implementation that trades a bit of dynamic allocation size for a single pointer and constant time member array access.
I intended soa_block to fill that need (after all the tasks shown in the **TODO** comments were done). If you see some flaw in the code, of course, I love to hear about it. **TODO**
IIRC it had implemented roughly the #2 strategy, storing a pointer + an array of n+1 offsets to access n members in constant time.

On 10/26/2016 02:27 AM, Michael Marcin wrote:
On 10/26/2016 12:58 AM, Larry Evans wrote:
On 10/25/2016 11:07 PM, Michael Marcin wrote:
On 10/25/2016 8:23 PM, Larry Evans wrote:
At the very least support for the basic SSE 16 byte alignment of subarrays is crucial.
My best idea so far is some magic wrapper type that gets special treatment. Like: using data_t = soa_block< float3, soa_align<float,16>, bool >;
Something like:
template<typename T, std::size_t Alignment> struct alignas(Alignment) soa_align { T data; };
Have you tried that yet. If not, I might try.
The issue is you don't want to overalign all elements of the array, just the first element.
But aligning the first soa_align<T,A> is all that's needed because sizeof(soa_align<T,A>)%A == 0, hence, all subsequent elements would be aligned. At least that's my understanding. Am I missing something?
Perhaps I'm misunderstanding. Using your struct above: std::array< soa_align<float, 16>, 4 > data; std::cout << "align array: " << alignof(decltype(data)) << '\n' << "size element: " << sizeof( data[0] ) << '\n' << "size array: " << sizeof( data ) << '\n' << "offset[1]: " << (char*)&(data[1]) - (char*)data.data() << '\n';
align array: 16 size element: 16 size array: 64 offset[1]: 16
For data to work with SSE instructions this needs to report:
align array: 16 size element: 4 size array: 16 offset[1]: 4
i.e. 4 floats have to be contiguous in memory, and the *first* float has to be aligned to 16 bytes.
Ah! I had no clue :( Thanks for explaining. I should have read more about SSE. [snip]

On 10/26/2016 02:27 AM, Michael Marcin wrote: [snip]
Perhaps I'm misunderstanding. Using your struct above: std::array< soa_align<float, 16>, 4 > data; std::cout << "align array: " << alignof(decltype(data)) << '\n' << "size element: " << sizeof( data[0] ) << '\n' << "size array: " << sizeof( data ) << '\n' << "offset[1]: " << (char*)&(data[1]) - (char*)data.data() << '\n';
align array: 16 size element: 16 size array: 64 offset[1]: 16
For data to work with SSE instructions this needs to report:
align array: 16 size element: 4 size array: 16 offset[1]: 4
i.e. 4 floats have to be contiguous in memory, and the *first* float has to be aligned to 16 bytes.
So why not: alignas(16) std::array<float, 4> data; IOW, does the decltype(data) have to have the required alignment or does &data have to have that alignment?

On 10/26/2016 4:32 AM, Larry Evans wrote:
On 10/26/2016 02:27 AM, Michael Marcin wrote:
i.e. 4 floats have to be contiguous in memory, and the *first* float has to be aligned to 16 bytes.
So why not:
alignas(16) std::array<float, 4> data;
IOW, does the decltype(data) have to have the required alignment or does &data have to have that alignment?
All that matters is the address of the first float be 16 and the number of floats in your array is divisible by 4. Since SSE processes 4 floats at a time, the 2nd group of 4 floats is also 16 byte aligned (sizeof(float)*4 == 16). Note: the different instruction sets/hardware support different data types/alignments. This is why all the particle_count's I used in the emitter example were multiples of 4. (And a multiple of 64 in the tests that use the bit_vector which packs 64 bools into a uint64_t). SSE2 has instructions to operate on - 2 double - 2 int64_t - 4 float - 4 int32_t - 8 short - 16 char Which all require the pointer to the data to be 16 byte aligned, and all are sized to 16 bytes such that you can operate on successive runs of data in an appropriately aligned array. If you don't know your particle_count is a multiple of 4 you need to write more code. For example an array of 39 floats you need to operate you can either pad that out to 40 floats to use SSE on the whole thing or you can use SSE on the first 36 floats (36/4 = 9 iterations) and have a non-vectorized implementation of the same algorithm at the end to handle the last 3 floats. If you don't know the alignment of your data this technique also applies to the beginning of the array. You can use the non-vectorized algorithm to processes the first 0-3 floats until you reach a 16 byte alignment then process all 16 byte aligned groups of 4 floats and then return to the non-vectorized implementation for 0-3 floats at the end of the array. This is pretty much what compilers do when they vectorize a loop. alignas(16) std::array<float, 4> data; Does work, although doesn't much help for the soa_block implementation. Indeed using alignof(decltype(data)) in my snippet is a little misleading. But I don't know how to query the alignment of an object rather than a type. The sse emitter test used an aligned_allocator to guarantee 16 byte alignment for the std::vector data. template< typename T > using sse_vector = vector<T, aligned_allocator<T,16> >;

On 10/26/2016 10:17 AM, degski wrote:
On 26 October 2016 at 18:00, Michael Marcin <mike.marcin@gmail.com> wrote:
But I don't know how to query the alignment of an object rather than a type.
Shouldn't ( & object [ 0 ] ) % 16 == 0 do the trick?
Or use: boost::alignment::detail::is_aligned(&object[0], 16). However, that code uses: boost::alignment::is_aligned ( std::size_t value , std::size_t alignment ) { return (value & (alignment - 1)) == 0; } which is more obscure than: return (value % alignment) == 0; But I guess there's some reason for using the more obscure code.

On 26 October 2016 at 18:41, Larry Evans <cppljevans@suddenlink.net> wrote:
But I guess there's some reason for using the more obscure code.
I guess some optimisation from way yonder (something modern compilers do routinely, even on a Monday morning!)... but more than probable irrelevant nowadays... degski

I guess some optimisation from way yonder (something modern compilers do routinely, even on a Monday morning!)... but more than probable irrelevant nowadays...
degski
I might be pessimistic, but I never trust the compiler and generally check what's being output. In this case, FWIW, on MSVC2015, the bit-twiddling version generates faster code than the mod version -- about 25% faster. I didn't test gcc or clang. Using google benchmark: Code: static void AlignedMod(benchmark::State& state) { while (state.KeepRunning()) { for(int i = state.range_x(); i < 128; i += state.range_y()) { bool aligned = (i % 16) == 0; benchmark::DoNotOptimize(aligned); } } } BENCHMARK(AlignedMod)->ArgPair(1, 1); static void AlignedAnd(benchmark::State& state) { while (state.KeepRunning()) { for(int i = state.range_x(); i < 128; i += state.range_y()) { bool aligned = ((i - 1) & 15) == 0; benchmark::DoNotOptimize(aligned); } } } BENCHMARK(AlignedAnd)->ArgPair(1, 1); Generated code of the inner loop: Mod version: mov eax,ebx and eax,8000000Fh jge AlignedMod+50h dec eax or eax,0FFFFFFF0h inc eax test eax,eax lea rcx,[aligned] sete byte ptr [aligned] call 07FF73B84A180h add ebx,dword ptr [rdi+1Ch] cmp ebx,80h jl AlignedMod+40h And version: lea eax,[rbx-1] test al,0Fh lea rcx,[aligned] sete byte ptr [aligned] call 07FF73B84A180h add ebx,dword ptr [rdi+1Ch] cmp ebx,80h jl AlignedAnd+40h Result: Benchmark Time CPU Iterations ------------------------------------------------------------------------- AlignedMod/1/1 204 ns 203 ns 4072727 AlignedAnd/1/1 153 ns 154 ns 4977778 -- chris

On 26 October 2016 at 21:33, Chris Glover <c.d.glover@gmail.com> wrote:
In this case, FWIW, on MSVC2015, the bit-twiddling version generates faster code than the mod version -- about 25% faster. I didn't test gcc or clang.
How often are you going to check this? I would do this only in debug. On 27 October 2016 at 06:58, Michael Marcin <mike.marcin@gmail.com> wrote:
There's quite a large difference between "this object happens to land on a 16 byte alignment" and "what is the alignment of this object?" or even "is this object alignas(16)".
Yes, resorting to some bit-fiddling then: size_t whatIsTheAlignmentOfThisPointer ( const void* p_ ) { return ( size_t ) ( ( uintptr_t ) p_ & ( uintptr_t ) -( ( intptr_t ) p_ ) ); } Yes, it assumes twos-complement, and yes it will therefore possibly fail [running it] on your average IoT-water-cooker. degski

On 10/27/2016 12:51 AM, degski wrote:
Yes, resorting to some bit-fiddling then:
size_t whatIsTheAlignmentOfThisPointer ( const void* p_ ) {
return ( size_t ) ( ( uintptr_t ) p_ & ( uintptr_t ) -( ( intptr_t ) p_ ) ); }
Again where the object happens to live is *not* necessarily its required alignment. Take: char a; alignas(16) char b; char a (1 byte alignment) may reside on a a 16 byte boundary. But that doesn't mean that the char is being forced to a 16 byte alignment. char b *is* required to be aligned to a 16 byte boundary. Anyways this is all a distraction from the issues at hand. Can we get back to on topic discussions?

On 10/26/2016 1:33 PM, Chris Glover wrote:
I guess some optimisation from way yonder (something modern compilers do routinely, even on a Monday morning!)... but more than probable irrelevant nowadays...
degski
I might be pessimistic, but I never trust the compiler and generally check what's being output. In this case, FWIW, on MSVC2015, the bit-twiddling version generates faster code than the mod version -- about 25% faster. I didn't test gcc or clang.
Using google benchmark:
Code:
static void AlignedMod(benchmark::State& state) { while (state.KeepRunning()) { for(int i = state.range_x(); i < 128; i += state.range_y()) { bool aligned = (i % 16) == 0; benchmark::DoNotOptimize(aligned); } } } BENCHMARK(AlignedMod)->ArgPair(1, 1);
static void AlignedAnd(benchmark::State& state) { while (state.KeepRunning()) { for(int i = state.range_x(); i < 128; i += state.range_y()) { bool aligned = ((i - 1) & 15) == 0; benchmark::DoNotOptimize(aligned); } } } BENCHMARK(AlignedAnd)->ArgPair(1, 1);
Generated code of the inner loop:
Mod version: mov eax,ebx and eax,8000000Fh jge AlignedMod+50h dec eax or eax,0FFFFFFF0h inc eax test eax,eax lea rcx,[aligned] sete byte ptr [aligned] call 07FF73B84A180h add ebx,dword ptr [rdi+1Ch] cmp ebx,80h jl AlignedMod+40h
And version: lea eax,[rbx-1] test al,0Fh lea rcx,[aligned] sete byte ptr [aligned] call 07FF73B84A180h add ebx,dword ptr [rdi+1Ch] cmp ebx,80h jl AlignedAnd+40h
Result: Benchmark Time CPU Iterations ------------------------------------------------------------------------- AlignedMod/1/1 204 ns 203 ns 4072727 AlignedAnd/1/1 153 ns 154 ns 4977778
It's very dangerous to draw conclusions from benchmarks like this. The operation and elapsed time are way too small for the results to be distinguishable from noise on the system. AlignedAnd has a random '- 1' in it so it's not even checking if i is 16 byte aligned. AlignedMod is taking the mod of a signed integer. Make it use an unsigned integer and some very familiar code gets generated. for ( unsigned i = state.range_x(); i < 128; i += state.range_y() ) 00007FF7ACBA316C mov ebx,dword ptr [rdi+14h] 00007FF7ACBA316F cmp ebx,80h 00007FF7ACBA3175 jae AlignedMod+12h (07FF7ACBA3132h) 00007FF7ACBA3177 nop word ptr [rax+rax] { bool aligned = (i % 16) == 0; 00007FF7ACBA3180 test bl,0Fh benchmark::DoNotOptimize( aligned ); 00007FF7ACBA3183 lea rcx,[aligned] { bool aligned = (i % 16) == 0; 00007FF7ACBA3188 sete byte ptr [aligned] benchmark::DoNotOptimize( aligned ); 00007FF7ACBA318D call benchmark::internal::UseCharPointer (07FF7ACBA1019h) { for ( unsigned i = state.range_x(); i < 128; i += state.range_y() ) 00007FF7ACBA3192 add ebx,dword ptr [rdi+1Ch] 00007FF7ACBA3195 cmp ebx,80h 00007FF7ACBA319B jb AlignedMod+60h (07FF7ACBA3180h) } } 00007FF7ACBA319D jmp AlignedMod+12h (07FF7ACBA3132h) } 00007FF7ACBA319F mov rbx,qword ptr [rsp+38h] 00007FF7ACBA31A4 mov rsi,qword ptr [rsp+40h] 00007FF7ACBA31A9 add rsp,20h 00007FF7ACBA31AD pop rdi 00007FF7ACBA31AE ret I don't agree that it is more obscure. I think it's perfectly natural to understand that the bottom 4 bits must be zero in order to be multiple of 16. I recommend to everyone the book Hacker's Delight which will make you truly appreciate how wonderful integers are.

On 10/26/2016 10:17 AM, degski wrote:
On 26 October 2016 at 18:00, Michael Marcin <mike.marcin@gmail.com> wrote:
But I don't know how to query the alignment of an object rather than a type.
Shouldn't ( & object [ 0 ] ) % 16 == 0 do the trick?
There's quite a large difference between "this object happens to land on a 16 byte alignment" and "what is the alignment of this object?" or even "is this object alignas(16)".

On 10/26/2016 10:00 AM, Michael Marcin wrote: [snip]
The sse emitter test used an aligned_allocator to guarantee 16 byte alignment for the std::vector data.
template< typename T > using sse_vector = vector<T, aligned_allocator<T,16> >;
I assume that vector<T>'s data is allocated with new, and new, IIUC, guarantees maximum alignment; hence, boost::alignment::is_aligned(std::vector<T>::data(),16) should always be true. What am I missing? BTW, see the new push: https://github.com/cppljevans/soa/commit/ea28ff814c8c1ea879fa5ac2453c12fc382... I've *not tested* it in: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp but I think it should work. IOW, instead of: sse_vector<float> position_x; sse_vector<float> position_y; sse_vector<float> position_z; sse_vector<float> velocity_x; sse_vector<float> velocity_y; sse_vector<float> velocity_z; sse_vector<float> acceleration_x; sse_vector<float> acceleration_y; sse_vector<float> acceleration_z; vector<float2> size; vector<float4> color; sse_vector<float> energy; vector<char> alive; I think you could use: soa_block < type_align<float,16>,// position_x; type_align<float,16>,// position_y; type_align<float,16>,// position_z; type_align<float,16>,// velocity_x; type_align<float,16>,// velocity_y; type_align<float,16>,// velocity_z; type_align<float,16>,// acceleration_x; type_align<float,16>,// acceleration_y; type_align<float,16>,// acceleration_z; float2_t,// size; float4_t,// color; type_align<float,16>,// energy; char// alive; > particles; where type_align is found here: https://github.com/cppljevans/soa/blob/master/vec_offsets.hpp#L14 HTH. -regards, Larry.

On 10/28/2016 9:09 AM, Larry Evans wrote:
On 10/26/2016 10:00 AM, Michael Marcin wrote: [snip]
The sse emitter test used an aligned_allocator to guarantee 16 byte alignment for the std::vector data.
template< typename T > using sse_vector = vector<T, aligned_allocator<T,16> >;
I assume that vector<T>'s data is allocated with new, and new, IIUC, guarantees maximum alignment; hence, boost::alignment::is_aligned(std::vector<T>::data(),16) should always be true. What am I missing?
Yes operator new isn't going to guarantee a 16 byte alignment on any platform I'm aware of. From n4606 § 3.7.4.1/2 The pointer returned shall be suitably aligned so that it can be converted to a pointer to any suitable complete object type (18.6.2.1) § 18.6.2.1/1 Effects: The allocation functions (3.7.4.1) called by a new-expression (5.3.4) to allocate size bytes of storage. The second form is called for a type with new-extended alignment, and allocates storage with the specified alignment. The first form is called otherwise, and allocates storage suitably aligned to represent any object of that size provided the object’s type does not have new-extended alignment. § 3.11/3 An extended alignment is represented by an alignment greater than alignof(std::max_align_t). It is implementation-defined whether any extended alignments are supported and the contexts in which they are supported (7.6.2). A type having an extended alignment requirement is an over-aligned type. [ Note: every over-aligned type is or contains a class type to which extended alignment applies (possibly through a non-static data member). — end note ] A new-extended alignment is represented by an alignment greater than __STDCPP_DEFAULT_NEW_ALIGNMENT__ (16.8). § 16.8/2 __STDCPP_DEFAULT_NEW_ALIGNMENT__ An integer literal of type std::size_t whose value is the alignment guaranteed by a call to operator new(std::size_t) or operator new[](std::size_t). [ Note: Larger alignments will be passed to operator new(std::size_t, std::align_val_t), etc. (5.3.4). — end note ] This has changed somewhat since C++11 ISO 14882 but the gist of it is the same in the older standard. § 3.7.4.1/2 The pointer returned shall be suitably aligned so that it can be converted to a pointer of any complete object type with a fundamental alignment requirement (3.11) § 3.11/2 A fundamental alignment is represented by an alignment less than or equal to the greatest alignment sup ported by the implementation in all contexts, which is equal to alignof(std::max_align_t) (18.2). The alignment required for a type might be different when it is used as the type of a complete object and when it is used as the type of a subobject. [ Example: struct B { long double d; }; struct D : virtual B { char c; } Basically new guarantees the returned pointer is aligned to alignof(max_align_t) On VS2015 max_align_t is double and alignof(max_align_t) == 8 which is not enough to guarantee 16 byte alignment. std::aligned_alloc was added for C++17 which solves this problem, boost has boost::alignment::aligned_alloc which also solves the problem pre-C++17.
BTW, see the new push:
https://github.com/cppljevans/soa/commit/ea28ff814c8c1ea879fa5ac2453c12fc382...
I've *not tested* it in:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp
but I think it should work. IOW, instead of:
sse_vector<float> position_x; sse_vector<float> position_y; sse_vector<float> position_z; sse_vector<float> velocity_x; sse_vector<float> velocity_y; sse_vector<float> velocity_z; sse_vector<float> acceleration_x; sse_vector<float> acceleration_y; sse_vector<float> acceleration_z; vector<float2> size; vector<float4> color; sse_vector<float> energy; vector<char> alive;
I think you could use:
soa_block < type_align<float,16>,// position_x; type_align<float,16>,// position_y; type_align<float,16>,// position_z; type_align<float,16>,// velocity_x; type_align<float,16>,// velocity_y; type_align<float,16>,// velocity_z; type_align<float,16>,// acceleration_x; type_align<float,16>,// acceleration_y; type_align<float,16>,// acceleration_z; float2_t,// size; float4_t,// color; type_align<float,16>,// energy; char// alive; > particles;
where type_align is found here:
https://github.com/cppljevans/soa/blob/master/vec_offsets.hpp#L14
That's quite similar to what I have in one of my potential implementations.

Michael, the latest revision of the code, when run, aborts when running the SSEopt_vec method, as detailed here: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L953 Would you have some idea what's causing that? On 10/28/2016 09:40 PM, Michael Marcin wrote:
On 10/28/2016 9:09 AM, Larry Evans wrote:
On 10/26/2016 10:00 AM, Michael Marcin wrote: [snip]
The sse emitter test used an aligned_allocator to guarantee 16 byte alignment for the std::vector data.
template< typename T > using sse_vector = vector<T, aligned_allocator<T,16> >;
I assume that vector<T>'s data is allocated with new, and new, IIUC, guarantees maximum alignment; hence, boost::alignment::is_aligned(std::vector<T>::data(),16) should always be true. What am I missing?
Yes operator new isn't going to guarantee a 16 byte alignment on any platform I'm aware of.
From n4606 [snip] Detailed reference to standard Docs correcting my assumptions.
Thanks for the [snipped] detailed reference to standard Docs correcting my assumptions :) It must have taken a lot of detailed reading!
BTW, see the new push:
https://github.com/cppljevans/soa/commit/ea28ff814c8c1ea879fa5ac2453c12fc382...
I've *not tested* it in:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp
but I think it should work.
Since that post, I've actually tested with the benchmark, and it ran. That run was done with the previous revision, the one with: https://github.com/cppljevans/soa/commit/02791e47080bf51ac3a467ed89e7cb91abf... Strangely, that one ran the SSEopt_vec method OK, in constrast to the current SSEopt_vec method.
IOW, instead of:
sse_vector<float> position_x; sse_vector<float> position_y; sse_vector<float> position_z; sse_vector<float> velocity_x; sse_vector<float> velocity_y; sse_vector<float> velocity_z; sse_vector<float> acceleration_x; sse_vector<float> acceleration_y; sse_vector<float> acceleration_z; vector<float2> size; vector<float4> color; sse_vector<float> energy; vector<char> alive;
I think you could use:
soa_block < type_align<float,16>,// position_x; type_align<float,16>,// position_y; type_align<float,16>,// position_z; type_align<float,16>,// velocity_x; type_align<float,16>,// velocity_y; type_align<float,16>,// velocity_z; type_align<float,16>,// acceleration_x; type_align<float,16>,// acceleration_y; type_align<float,16>,// acceleration_z; float2_t,// size; float4_t,// color; type_align<float,16>,// energy; char// alive; > particles;
where type_align is found here:
https://github.com/cppljevans/soa/blob/master/vec_offsets.hpp#L14
That's quite similar to what I have in one of my potential implementations.
Would you post that somewhere? I'd be curious about how it differs. -regards, Larry

On 10/30/2016 07:45 AM, Larry Evans wrote:
Michael, the latest revision of the code, when run, aborts when running the SSEopt_vec method, as detailed here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L953
Would you have some idea what's causing that?
[snip] After rebooting, the SSEopt_vec method now runs without problem. I've no idea what was causing errors before the reboot :( -Larry

On 10/30/2016 10:40 AM, Larry Evans wrote:
On 10/30/2016 07:45 AM, Larry Evans wrote:
Michael, the latest revision of the code, when run, aborts when running the SSEopt_vec method, as detailed here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L953
Would you have some idea what's causing that?
[snip] After rebooting, the SSEopt_vec method now runs without problem. I've no idea what was causing errors before the reboot :(
I suspect the problem was a wrong bit_vector definition. Hopefully the lastest push will solve problem.

On 10/30/2016 12:29 PM, Larry Evans wrote:
On 10/30/2016 10:40 AM, Larry Evans wrote:
On 10/30/2016 07:45 AM, Larry Evans wrote:
Michael, the latest revision of the code, when run, aborts when running the SSEopt_vec method, as detailed here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L953
Would you have some idea what's causing that?
[snip] After rebooting, the SSEopt_vec method now runs without problem. I've no idea what was causing errors before the reboot :(
I suspect the problem was a wrong bit_vector definition. Hopefully the lastest push will solve problem.
Was going to suggest that bit_vector might be the problem I'm not sure where the alternative implementation came from but it needs to be something about like this as a minimum. #ifdef HAVE_GOON_BIT_VECTOR using goon::bit_vector; #else struct bit_vector { void resize( size_t num_bits, bool value = false ) { size_t num_blocks = (num_bits + 63) / 64; constexpr uint64_t block_none = 0; constexpr uint64_t block_all = ~block_none; data_m.resize( num_blocks, value ? block_all : block_none ); num_bits_m = num_bits; } uint64_t * data() { return data_m.data(); } std::size_t size() const { return num_bits_m; } std::vector<uint64_t> data_m; std::size_t num_bits_m; }; #endif

On 10/30/2016 7:45 AM, Larry Evans wrote:
Would you post that somewhere? I'd be curious about how it differs.
My code isn't very complete but since everyone else is sharing I'll post what I've got. FWIW I had to go back pretty far and then still make changes to get a version of your soa_compare.benchmark.cpp that compiled on windows VS2015. particle_count=1,000,000 minimum duration=2.85542 comparative performance table: method rel_duration ________ ______________ Aos 3.08266 SoA 1.51109 Flat 1.50358 StdArray 1.91317 Block 1.60447 SSE 1.35994 SSE_opt 1.18056 SSE_goon 1 Press any key to continue . . . code: http://codepad.org/DECRpJrO test: http://codepad.org/IbdVcdq8

On 10/31/2016 03:09 AM, Michael Marcin wrote:
On 10/30/2016 7:45 AM, Larry Evans wrote:
Would you post that somewhere? I'd be curious about how it differs.
My code isn't very complete but since everyone else is sharing I'll post what I've got.
FWIW I had to go back pretty far and then still make changes to get a version of your soa_compare.benchmark.cpp that compiled on windows VS2015.
particle_count=1,000,000 minimum duration=2.85542
comparative performance table:
method rel_duration ________ ______________ Aos 3.08266 SoA 1.51109 Flat 1.50358 StdArray 1.91317 Block 1.60447 SSE 1.35994 SSE_opt 1.18056 SSE_goon 1 Press any key to continue . . .
Thanks Michael. I found it interesting. However, I was still getting the 'double free' error message; hence, I tried val_grind. It showed a problem in the alive update loop. When the code was changed to: uint64_t *block_ptr = alive.data(); auto e_ptr = energy.data(); for ( size_t i = 0; i < n; ) { #define REVISED_CODE #ifdef REVISED_CODE auto e_i = e_ptr + i; #endif uint64_t block = 0; do { #ifndef REVISED_CODE //this code causes valgrind to show errors. auto e_i = e_ptr + i; #endif _mm_store_ps( e_i, _mm_sub_ps( _mm_load_ps( e_i ), t )); block |= uint64_t ( _mm_movemask_ps( _mm_cmple_ps( _mm_load_ps( e_i ), zero ))) << (i % bits_per_uint64_t) ; i += 4; } while ( i % bits_per_uint64_t != 0 ); *block_ptr++ = block; } valgrind reported no errors; however, when !defined(REVISED_CODE), valgrind reported: valgrind --tool=memcheck /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe ==7937== Memcheck, a memory error detector ==7937== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==7937== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==7937== Command: /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe ==7937== COMPILE_OPTIM=0 particle_count=1,000 frames=1,000 {run_test=SSEopt_vec ==7937== Invalid read of size 16 ==7937== at 0x403D6B: emitter_t<(method_enum)6>::update() (soa_compare.benchmark.cpp:962) ==7937== by 0x403094: run_result_t run_test<emitter_t<(method_enum)6>
(unsigned long, unsigned long) (soa_compare.benchmark
My soa_compare.benchmark.cpp:962 line is: _mm_store_ps( e_i, _mm_sub_ps( _mm_load_ps( e_i ), t )); -regards, Larry

On 10/31/2016 9:14 AM, Larry Evans wrote:
However, I was still getting the 'double free' error message; hence, I tried val_grind. It showed a problem in the alive update loop. When the code was changed to:
uint64_t *block_ptr = alive.data(); auto e_ptr = energy.data(); for ( size_t i = 0; i < n; ) { #define REVISED_CODE #ifdef REVISED_CODE auto e_i = e_ptr + i; #endif uint64_t block = 0; do { #ifndef REVISED_CODE //this code causes valgrind to show errors. auto e_i = e_ptr + i; #endif _mm_store_ps( e_i, _mm_sub_ps( _mm_load_ps( e_i ), t )); block |= uint64_t ( _mm_movemask_ps( _mm_cmple_ps( _mm_load_ps( e_i ), zero ))) << (i % bits_per_uint64_t) ; i += 4; } while ( i % bits_per_uint64_t != 0 ); *block_ptr++ = block; }
valgrind reported no errors; however, when !defined(REVISED_CODE), valgrind reported:
valgrind --tool=memcheck /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== Memcheck, a memory error detector ==7937== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==7937== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==7937== Command: /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== COMPILE_OPTIM=0 particle_count=1,000
particle_count=1,000 is not a multiple of 64, the optimized energe/alive loop processes 64 particles at a time. I haven't bothered to analyze what the code will do in this case but memory corruption is likely The code to handle a tail (if particle_count % 64 != 0) isn't difficult to add but it is explicitly left out. One of the things you'll often do in a system such as this is fit the data to optimize the algorithm. In the case of a particle system plus or minus 0 to 63 particles is generally unnoticeable. You can address the problem however you like but the simplest solution would be to change your small particle count to 16 * 64 = 1024.

On 10/31/2016 12:00 PM, Michael Marcin wrote:
On 10/31/2016 9:14 AM, Larry Evans wrote:
However, I was still getting the 'double free' error message; hence, I tried val_grind. It showed a problem in the alive update loop. When the code was changed to:
uint64_t *block_ptr = alive.data(); auto e_ptr = energy.data(); for ( size_t i = 0; i < n; ) { #define REVISED_CODE #ifdef REVISED_CODE auto e_i = e_ptr + i; #endif uint64_t block = 0; do { #ifndef REVISED_CODE //this code causes valgrind to show errors. auto e_i = e_ptr + i; #endif _mm_store_ps( e_i, _mm_sub_ps( _mm_load_ps( e_i ), t )); block |= uint64_t ( _mm_movemask_ps( _mm_cmple_ps( _mm_load_ps( e_i ), zero ))) << (i % bits_per_uint64_t) ; i += 4; } while ( i % bits_per_uint64_t != 0 ); *block_ptr++ = block; }
valgrind reported no errors; however, when !defined(REVISED_CODE), valgrind reported:
valgrind --tool=memcheck /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== Memcheck, a memory error detector ==7937== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==7937== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==7937== Command: /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== COMPILE_OPTIM=0 particle_count=1,000
particle_count=1,000 is not a multiple of 64, the optimized energe/alive loop processes 64 particles at a time. I haven't bothered to analyze what the code will do in this case but memory corruption is likely
I see. However, still, since the calls to the _mm_* functions in the previous loop all are called with i%4==0 (because the i increment is i+=4) here: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L934 shouldn't the same apply to the alive loop call here: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L956 and putting the e_i assignment outside the alive loop here: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L953 assures that.
The code to handle a tail (if particle_count % 64 != 0) isn't difficult to add but it is explicitly left out. One of the things you'll often do in a system such as this is fit the data to optimize the algorithm. In the case of a particle system plus or minus 0 to 63 particles is generally unnoticeable.
You can address the problem however you like but the simplest solution would be to change your small particle count to 16 * 64 = 1024.
OK. I've done that.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 10/31/2016 12:51 PM, Larry Evans wrote:
On 10/31/2016 12:00 PM, Michael Marcin wrote:
On 10/31/2016 9:14 AM, Larry Evans wrote:
[snip] particle_count=1,000 is not a multiple of 64, the optimized energe/alive loop processes 64 particles at a time. I haven't bothered to analyze what the code will do in this case but memory corruption is likely
I see. However, still, since the calls to the _mm_* functions in the previous loop all are called with i%4==0 (because the i increment is i+=4) here:
[snip] OOPS. Never mind. the e_i is at increments of 4. Sorry for noise.

On 10/31/2016 12:00 PM, Michael Marcin wrote:
On 10/31/2016 9:14 AM, Larry Evans wrote:
However, I was still getting the 'double free' error message; hence, I tried val_grind. It showed a problem in the alive update loop. When the code was changed to:
uint64_t *block_ptr = alive.data(); auto e_ptr = energy.data(); for ( size_t i = 0; i < n; ) { #define REVISED_CODE #ifdef REVISED_CODE auto e_i = e_ptr + i; #endif uint64_t block = 0; do { #ifndef REVISED_CODE //this code causes valgrind to show errors. auto e_i = e_ptr + i; #endif _mm_store_ps( e_i, _mm_sub_ps( _mm_load_ps( e_i ), t )); block |= uint64_t ( _mm_movemask_ps( _mm_cmple_ps( _mm_load_ps( e_i ), zero ))) << (i % bits_per_uint64_t) ; i += 4; } while ( i % bits_per_uint64_t != 0 ); *block_ptr++ = block; }
valgrind reported no errors; however, when !defined(REVISED_CODE), valgrind reported:
valgrind --tool=memcheck /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== Memcheck, a memory error detector ==7937== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==7937== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==7937== Command: /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== COMPILE_OPTIM=0 particle_count=1,000
particle_count=1,000 is not a multiple of 64, the optimized energe/alive loop processes 64 particles at a time. I haven't bothered to analyze what the code will do in this case but memory corruption is likely
The code to handle a tail (if particle_count % 64 != 0) isn't difficult to add but it is explicitly left out. One of the things you'll often do in a system such as this is fit the data to optimize the algorithm. In the case of a particle system plus or minus 0 to 63 particles is generally unnoticeable.
You can address the problem however you like but the simplest solution would be to change your small particle count to 16 * 64 = 1024.
IIUC, the 64 constraint is because bit_vector stores the bit's in 64 increments (or bitsof(uint_64_t) where bitsof is defined here: https://github.com/cppljevans/soa/blob/master/bitsof.hpp However, with the following SSE_opt alive update loop: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L961 I think the constraint could be relaxed to a multiple of sse_width(=4). That's because the additional test: i_array < n_array here: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L978 prevents: auto e_i = e_ptr + i_array; here: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L985 from accessing outside of the array. Does that make sense? -regards, Larry

On 11/6/2016 10:36 AM, Larry Evans wrote:
On 10/31/2016 12:00 PM, Michael Marcin wrote:
On 10/31/2016 9:14 AM, Larry Evans wrote:
However, I was still getting the 'double free' error message; hence, I tried val_grind. It showed a problem in the alive update loop. When the code was changed to:
uint64_t *block_ptr = alive.data(); auto e_ptr = energy.data(); for ( size_t i = 0; i < n; ) { #define REVISED_CODE #ifdef REVISED_CODE auto e_i = e_ptr + i; #endif uint64_t block = 0; do { #ifndef REVISED_CODE //this code causes valgrind to show errors. auto e_i = e_ptr + i; #endif _mm_store_ps( e_i, _mm_sub_ps( _mm_load_ps( e_i ), t )); block |= uint64_t ( _mm_movemask_ps( _mm_cmple_ps( _mm_load_ps( e_i ), zero ))) << (i % bits_per_uint64_t) ; i += 4; } while ( i % bits_per_uint64_t != 0 ); *block_ptr++ = block; }
valgrind reported no errors; however, when !defined(REVISED_CODE), valgrind reported:
valgrind --tool=memcheck /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== Memcheck, a memory error detector ==7937== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==7937== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==7937== Command: /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== COMPILE_OPTIM=0 particle_count=1,000
particle_count=1,000 is not a multiple of 64, the optimized energe/alive loop processes 64 particles at a time. I haven't bothered to analyze what the code will do in this case but memory corruption is likely
The code to handle a tail (if particle_count % 64 != 0) isn't difficult to add but it is explicitly left out. One of the things you'll often do in a system such as this is fit the data to optimize the algorithm. In the case of a particle system plus or minus 0 to 63 particles is generally unnoticeable.
You can address the problem however you like but the simplest solution would be to change your small particle count to 16 * 64 = 1024.
IIUC, the 64 constraint is because bit_vector stores the bit's in 64 increments (or bitsof(uint_64_t) where bitsof is defined here:
https://github.com/cppljevans/soa/blob/master/bitsof.hpp
However, with the following SSE_opt alive update loop:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L961
I think the constraint could be relaxed to a multiple of sse_width(=4). That's because the additional test:
i_array < n_array
here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L978
prevents:
auto e_i = e_ptr + i_array;
here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L985
from accessing outside of the array. Does that make sense?
The general pattern is to handle full blocks and then separately handle the final partial block. This minimizes the amount of work the code must do per iteration. This code handles multiple of 4 particle counts. http://codepad.org/iRwgxAkW Here's an alternative implementation that benchmarks slower on my machine but is a bit easier to follow IMO. http://codepad.org/DjqVHac8 P.S. there was a bug in the original sse opt that wouldn't affect perf testing but _mm_cmple_ps should have been _mm_cmpgt_ps

On 15 October 2016 at 04:43, Michael Marcin <mike.marcin@gmail.com> wrote:
Does this already exist somewhere?
I don't know but I know when I considered doing this without reflection I stopped because it made the user code too complex or too specific.
Does it seem useful?
Yes, add +1 to the interest increment. Although I don't see how it can be done in a way that makes it easy for the user. Joël Lamotte
participants (10)
-
Andreas Schäfer
-
Chris Glover
-
degski
-
Gavin Lambert
-
Joaquin M López Muñoz
-
Klaim - Joël Lamotte
-
Larry Evans
-
Michael Marcin
-
Oswin Krause
-
Roger Martin