Re: [boost] interest in structure of arrays container?

26 Oct 2016

      On 10/26/2016 4:32 AM, Larry Evans wrote:
...
On 10/26/2016 02:27 AM, Michael Marcin wrote:
...
i.e. 4 floats have to be contiguous in memory, and the *first* float has
to be aligned to 16 bytes.
So why not:
alignas(16) std::array<float, 4> data;
IOW, does the decltype(data) have to have the required alignment
or does &data have to have that alignment?
All that matters is the address of the first float be 16 and the number 
of floats in your array is divisible by 4. Since SSE processes 4 floats 
at a time, the 2nd group of 4 floats is also 16 byte aligned 
(sizeof(float)*4 == 16). Note: the different instruction sets/hardware 
support different data types/alignments.

This is why all the particle_count's I used in the emitter example were 
multiples of 4. (And a multiple of 64 in the tests that use the 
bit_vector which packs 64 bools into a uint64_t).

SSE2 has instructions to operate on
- 2 double
- 2 int64_t
- 4 float
- 4 int32_t
- 8 short
- 16 char

Which all require the pointer to the data to be 16 byte aligned, and all 
are sized to 16 bytes such that you can operate on successive runs of 
data in an appropriately aligned array.

If you don't know your particle_count is a multiple of 4 you need to 
write more code.

For example an array of 39 floats you need to operate you can either pad 
that out to 40 floats to use SSE on the whole thing or you can use SSE 
on the first 36 floats (36/4 = 9 iterations) and have a non-vectorized 
implementation of the same algorithm at the end to handle the last 3 floats.

If you don't know the alignment of your data this technique also applies 
to the beginning of the array. You can use the non-vectorized algorithm 
to processes the first 0-3 floats until you reach a 16 byte alignment 
then process all 16 byte aligned groups of 4 floats and then return to 
the non-vectorized implementation for 0-3 floats at the end of the array.

This is pretty much what compilers do when they vectorize a loop.

alignas(16) std::array<float, 4> data;
Does work, although doesn't much help for the soa_block implementation.

Indeed using alignof(decltype(data)) in my snippet is a little 
misleading. But I don't know how to query the alignment of an object 
rather than a type.

The sse emitter test used an aligned_allocator to guarantee 16 byte 
alignment for the std::vector data.

     template< typename T >
     using sse_vector = vector<T, aligned_allocator<T,16> >;