Re: [boost] [GSOC]SIMD Library

12 Apr 2011

      On Wednesday, March 30, 2011 02:35, Joel Falcou wrote:
...
On 30/03/11 08:04, Gruenke, Matt wrote:
[snip]
...
...
template<  int i, typename V>  T get_element( V );
      template<  int i, typename V>  V set_element( V, T );
get_element is operator[] on pack and native. How do you do
set_element, 
every solution i found was either an UB or slow.
I used shuffle, where possible.  I think it's only supported for 16-bit
elements or larger, on MMX/SSE2.  I don't remember if I implanted it
using shift, mask, and OR, for 8-bit, or if I just left it undefined for
8-bit.
...
By looking at your 
prototype I guess you replicate V and change the element in a memory 
buffer ?
I'm pretty sure I avoided memory for just about everything but
initialization.  I even went as far as circumventing the normal register
copy instruction, where possible, which was strangely slow on P4's.
...
The real powerful function is Altivec permute but it is harder to find
a 
proper abstraction of it.
Perhaps you can at leas think of a way to use static assert to enforce
its inherent limitations.  If permute's limitations are as the name
suggests, then you can use the element indices to set bits in a vector
and assert that all bits have been set.  But maybe the compiler already
does that for you.
...
...
template<  typename V>  void store_uncached( V, V * );
// avoids > > cache pollution
Does it make any real difference ? All tests I ran gave me minimal 
amount of speed-up. I'm curious to hear your experience and add it if 
needed.
Well, it's all about context.  It doesn't make your writes faster.  In
fact, small bursts will actually be slower.  However, if you're
protecting something else in cache, then it can definitely pay off.

It should also improve hyperthreading performance (again, assuming
you're not going to read the written data for a while).
...
...
I'm also a fan of having a set of common, optimized 1-D operations,
such as buffer packing/interleaving&
unpacking/deinterleaving, extract/insert columns, convolution,
dot-product, SAD, FFT, etc.
some are actually function working on the pack level. std::fold or 
transform gets you to the 1D version. Some makes few sense.
Often, I find the need to do things like de-interleave a scanline or
tile of data, do some processing on the channels, and then re-interleave
it.  Processing at this granularity usually allows everything to stay in
L1 cache.

Efficient transpose (or at least extracting a batch of columns into
horizontal buffers) is also very important.
...
...
Keep it low-level, though.  IMO, any sort of high-level
abstraction that ships data off to different accelerator
back-ends, like GPUs, is a different animal and should go
in a different library.
That's the goal of NT2 as a whole.
That's a fine thing to do - just not something I want mixed into my SIMD
library.  Since this is all about performance, whatever I use needs to
give me the option to drop down to the next lower level if I find it
necessary to get more performance in some hot spots.

Thank you for the work you're doing on this.  I look forward to seeing
more.

Matt