
On 10/26/2016 4:32 AM, Larry Evans wrote:
On 10/26/2016 02:27 AM, Michael Marcin wrote:
i.e. 4 floats have to be contiguous in memory, and the *first* float has to be aligned to 16 bytes.
So why not:
alignas(16) std::array<float, 4> data;
IOW, does the decltype(data) have to have the required alignment or does &data have to have that alignment?
All that matters is the address of the first float be 16 and the number of floats in your array is divisible by 4. Since SSE processes 4 floats at a time, the 2nd group of 4 floats is also 16 byte aligned (sizeof(float)*4 == 16). Note: the different instruction sets/hardware support different data types/alignments. This is why all the particle_count's I used in the emitter example were multiples of 4. (And a multiple of 64 in the tests that use the bit_vector which packs 64 bools into a uint64_t). SSE2 has instructions to operate on - 2 double - 2 int64_t - 4 float - 4 int32_t - 8 short - 16 char Which all require the pointer to the data to be 16 byte aligned, and all are sized to 16 bytes such that you can operate on successive runs of data in an appropriately aligned array. If you don't know your particle_count is a multiple of 4 you need to write more code. For example an array of 39 floats you need to operate you can either pad that out to 40 floats to use SSE on the whole thing or you can use SSE on the first 36 floats (36/4 = 9 iterations) and have a non-vectorized implementation of the same algorithm at the end to handle the last 3 floats. If you don't know the alignment of your data this technique also applies to the beginning of the array. You can use the non-vectorized algorithm to processes the first 0-3 floats until you reach a 16 byte alignment then process all 16 byte aligned groups of 4 floats and then return to the non-vectorized implementation for 0-3 floats at the end of the array. This is pretty much what compilers do when they vectorize a loop. alignas(16) std::array<float, 4> data; Does work, although doesn't much help for the soa_block implementation. Indeed using alignof(decltype(data)) in my snippet is a little misleading. But I don't know how to query the alignment of an object rather than a type. The sse emitter test used an aligned_allocator to guarantee 16 byte alignment for the std::vector data. template< typename T > using sse_vector = vector<T, aligned_allocator<T,16> >;