
On Wednesday, March 30, 2011 02:35, Joel Falcou wrote:
On 30/03/11 08:04, Gruenke, Matt wrote:
[snip]
template< int i, typename V> T get_element( V ); template< int i, typename V> V set_element( V, T );
get_element is operator[] on pack and native. How do you do set_element, every solution i found was either an UB or slow.
I used shuffle, where possible. I think it's only supported for 16-bit elements or larger, on MMX/SSE2. I don't remember if I implanted it using shift, mask, and OR, for 8-bit, or if I just left it undefined for 8-bit.
By looking at your prototype I guess you replicate V and change the element in a memory buffer ?
I'm pretty sure I avoided memory for just about everything but initialization. I even went as far as circumventing the normal register copy instruction, where possible, which was strangely slow on P4's.
The real powerful function is Altivec permute but it is harder to find a proper abstraction of it.
Perhaps you can at leas think of a way to use static assert to enforce its inherent limitations. If permute's limitations are as the name suggests, then you can use the element indices to set bits in a vector and assert that all bits have been set. But maybe the compiler already does that for you.
template< typename V> void store_uncached( V, V * );
// avoids > > cache pollution
Does it make any real difference ? All tests I ran gave me minimal amount of speed-up. I'm curious to hear your experience and add it if needed.
Well, it's all about context. It doesn't make your writes faster. In fact, small bursts will actually be slower. However, if you're protecting something else in cache, then it can definitely pay off. It should also improve hyperthreading performance (again, assuming you're not going to read the written data for a while).
I'm also a fan of having a set of common, optimized 1-D operations, such as buffer packing/interleaving& unpacking/deinterleaving, extract/insert columns, convolution, dot-product, SAD, FFT, etc.
some are actually function working on the pack level. std::fold or transform gets you to the 1D version. Some makes few sense.
Often, I find the need to do things like de-interleave a scanline or tile of data, do some processing on the channels, and then re-interleave it. Processing at this granularity usually allows everything to stay in L1 cache. Efficient transpose (or at least extracting a batch of columns into horizontal buffers) is also very important.
Keep it low-level, though. IMO, any sort of high-level abstraction that ships data off to different accelerator back-ends, like GPUs, is a different animal and should go in a different library.
That's the goal of NT2 as a whole.
That's a fine thing to do - just not something I want mixed into my SIMD library. Since this is all about performance, whatever I use needs to give me the option to drop down to the next lower level if I find it necessary to get more performance in some hot spots. Thank you for the work you're doing on this. I look forward to seeing more. Matt