Re: [boost] [gsoc] boost.simd news from the front.

12 Jun 2011

      On 11/06/2011 17:30, David A. Greene wrote:
...
...
The benchmarks show this. Just in the slides, there is a 4x
improvement when using std::accumulate (which is still a pretty
trivial algorithm) with pack<float>  instead of float, and in both
cases automatic vectorization was enabled.
With which compilers?
Mainstream compilers with mainstream options: what people use.
The boostcon talk was pretty simple, it was not aimed at HPC experts, 
but rather tried to show that the library could be useful to everyone.
...
But how does the programmer know what is best?  It changes from
implementation to implementation.
The programmer is not supposed to know, the library does.

The programmer writes a * b + c, we generate a fma instruction if 
available, and a multiplication followed by an addition otherwise.

Now it is true there are some cases where there are multiple choices, 
and which one is fastest is not clear and may depend on the 
micro-architecture and not just the instruction set.
We don't really do different things depending on micro-architecture, but 
I think the cases where it really matters should be rather few.
When testing on some other micro-architectures, we notice that there are 
a few unexpected run times, but nothing that really justifies doing 
different codegen, at least at our level.

We're currently setting up a test farm, and we'll try to graph the run 
time in cycles of all of our functions on different architectures.
Any recommendation and which micro-architectures to include for x86? We 
can't afford to have too many.
We mostly work with Core and Nehalem.
...
That's overkill.  Alignment often isn't required to vectorize.
It doesn't cost the user much to enforce it for new applications, and it 
allows us not to have to worry about it.
...
With AMD and Intel's latest offerings alignment is much less of a
performance issue.
What about portability?
...
32 is not always the right answer on SandyBridge/Bulldozer because
256 bit vectorization is not always the right answer.
If 256-bit vectorization is not what you want, then you have to specify 
the size you want explicitly.

Otherwise we always prefer the size that allows the most parallelism.
...
What's the vector length of a pack<int>  on Bulldozer?
256 bits because __m256i exists, and there are some instructions for 
those types, even if they are few.
But I suppose 128 bits could also be an acceptable choice.

I need to benchmark this, but I think the conversions from/to AVX/SSE 
are sufficiently fast to make it a good choice in general.

It's a default anyway, you can set the size you want.
...
That is in fact what is happening via autotuning.  Yes, in some cases
hand-tuned code outperforms the compiler.  But in the vast majority of
cases, the compiler wins.
That's not what I remember of my last discussions with people working 
with the polyhedric optimization model.
They told me they came close, but still weren't as fast as 
state-of-the-art BLAS implementations.
...
My case comes from years of experience.
You certainly have experience in writing compilers, but do you have 
experience in writing SIMD versions of algorithms within applications?

You don't seem to be familiar with SIMD-style branching or other popular 
techniques.
...
It may be useful in some instances, but it is not a general solution to
the problem of writing vector code.  In the majority of cases, the
programmer will want the compiler to do it because it does a very good
job.
All tools are complimentary and can thus co-exist. What I didn't like 
about your original post is that you said compilers were the one and 
only solution to parallelization.
At least we can agree on something now ;).

Re: [boost] [gsoc] boost.simd news from the front.

Mathias Gaunard