On 11/6/2016 10:36 AM, Larry Evans wrote:
On 10/31/2016 12:00 PM, Michael Marcin wrote:
On 10/31/2016 9:14 AM, Larry Evans wrote:
However, I was still getting the 'double free' error message; hence, I tried val_grind. It showed a problem in the alive update loop. When the code was changed to:
uint64_t *block_ptr = alive.data(); auto e_ptr = energy.data(); for ( size_t i = 0; i < n; ) { #define REVISED_CODE #ifdef REVISED_CODE auto e_i = e_ptr + i; #endif uint64_t block = 0; do { #ifndef REVISED_CODE //this code causes valgrind to show errors. auto e_i = e_ptr + i; #endif _mm_store_ps( e_i, _mm_sub_ps( _mm_load_ps( e_i ), t )); block |= uint64_t ( _mm_movemask_ps( _mm_cmple_ps( _mm_load_ps( e_i ), zero ))) << (i % bits_per_uint64_t) ; i += 4; } while ( i % bits_per_uint64_t != 0 ); *block_ptr++ = block; }
valgrind reported no errors; however, when !defined(REVISED_CODE), valgrind reported:
valgrind --tool=memcheck /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== Memcheck, a memory error detector ==7937== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==7937== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==7937== Command: /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
==7937== COMPILE_OPTIM=0 particle_count=1,000
particle_count=1,000 is not a multiple of 64, the optimized energe/alive loop processes 64 particles at a time. I haven't bothered to analyze what the code will do in this case but memory corruption is likely
The code to handle a tail (if particle_count % 64 != 0) isn't difficult to add but it is explicitly left out. One of the things you'll often do in a system such as this is fit the data to optimize the algorithm. In the case of a particle system plus or minus 0 to 63 particles is generally unnoticeable.
You can address the problem however you like but the simplest solution would be to change your small particle count to 16 * 64 = 1024.
IIUC, the 64 constraint is because bit_vector stores the bit's in 64 increments (or bitsof(uint_64_t) where bitsof is defined here:
https://github.com/cppljevans/soa/blob/master/bitsof.hpp
However, with the following SSE_opt alive update loop:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L961
I think the constraint could be relaxed to a multiple of sse_width(=4). That's because the additional test:
i_array < n_array
here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L978
prevents:
auto e_i = e_ptr + i_array;
here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L985
from accessing outside of the array. Does that make sense?
The general pattern is to handle full blocks and then separately handle the final partial block. This minimizes the amount of work the code must do per iteration. This code handles multiple of 4 particle counts. http://codepad.org/iRwgxAkW Here's an alternative implementation that benchmarks slower on my machine but is a bit easier to follow IMO. http://codepad.org/DjqVHac8 P.S. there was a bug in the original sse opt that wouldn't affect perf testing but _mm_cmple_ps should have been _mm_cmpgt_ps