Re: [boost] interest in structure of arrays container?

6 Nov 2016

      On 11/6/2016 10:36 AM, Larry Evans wrote:
...
On 10/31/2016 12:00 PM, Michael Marcin wrote:
...
On 10/31/2016 9:14 AM, Larry Evans wrote:
...
However, I was still getting the 'double free' error message; hence,
I tried val_grind.  It showed a problem in the alive update loop.
When the code was changed to:
uint64_t *block_ptr = alive.data();
        auto e_ptr = energy.data();
        for ( size_t i = 0; i < n; ) {
          #define REVISED_CODE
          #ifdef REVISED_CODE
            auto e_i = e_ptr + i;
          #endif
            uint64_t block = 0;
            do {
              #ifndef REVISED_CODE
                //this code causes valgrind to show errors.
                auto e_i = e_ptr + i;
              #endif
                _mm_store_ps( e_i, _mm_sub_ps( _mm_load_ps( e_i ), t ));
                block |=
                  uint64_t
                  ( _mm_movemask_ps( _mm_cmple_ps( _mm_load_ps( e_i ),
zero )))
                  << (i % bits_per_uint64_t)
                  ;
                i += 4;
            } while ( i % bits_per_uint64_t != 0 );
            *block_ptr++ = block;
        }
valgrind reported no errors; however, when !defined(REVISED_CODE),
valgrind reported:
valgrind --tool=memcheck
/tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== Memcheck, a memory error detector
==7937== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==7937== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright
info
==7937== Command:
/tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937==
COMPILE_OPTIM=0
particle_count=1,000
particle_count=1,000 is not a multiple of 64, the optimized energe/alive
loop processes 64 particles at a time. I haven't bothered to analyze
what the code will do in this case but memory corruption is likely
The code to handle a tail (if particle_count % 64 != 0) isn't difficult
to add but it is explicitly left out. One of the things you'll often do
in a system such as this is fit the data to optimize the algorithm. In
the case of a particle system plus or minus 0 to 63 particles is
generally unnoticeable.
You can address the problem however you like but the simplest solution
would be to change your small particle count to 16 * 64 = 1024.
IIUC, the 64 constraint is because bit_vector stores the bit's in 64
increments (or bitsof(uint_64_t) where bitsof is defined here:
https://github.com/cppljevans/soa/blob/master/bitsof.hpp
However, with the following SSE_opt alive update loop:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L961
I think the constraint could be relaxed to a multiple of sse_width(=4).
That's because the additional test:
i_array < n_array
here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L978
prevents:
auto e_i = e_ptr + i_array;
here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L985
from accessing outside of the array.
Does that make sense?
The general pattern is to handle full blocks and then separately handle 
the final partial block.
This minimizes the amount of work the code must do per iteration.

This code handles multiple of 4 particle counts.
http://codepad.org/iRwgxAkW

Here's an alternative implementation that benchmarks slower on my 
machine but is a bit easier to follow IMO.
http://codepad.org/DjqVHac8

P.S. there was a bug in the original sse opt that wouldn't affect perf 
testing but _mm_cmple_ps should have been _mm_cmpgt_ps