Re: [boost] [gsoc] boost.simd news from the front.

11 Jun 2011

      Joel falcou <joel.falcou@gmail.com> writes:
...
On 10/06/11 18:05, David A. Greene wrote:
...
For writing new code, I contend that a good compiler, with a few
directives here and there, can accomplish the same result as this
library and with less programmer effort.
I dont really think so, especially if you take your "portable" stuff
in the equation.
I have seen it done.
...
...
A simple example:
void foo(float *a, float *b, int n)
{
    for (int i = 0; i<  n; ++i)
       a[i] = b[i];
}
This is not obviously parallel but with some simple help the user can
get the compiler to vectorize it.
Seriously, are you kidding me ? This is a friggin for_all ...
You can not get more embarrasingly parallel.
No, it's not obviously parallel.  Consider aliasing.
...
...
Another less simple case:
And this accumulate, they are like the most basic EP example you can get.
IF the user allows differences in answers.  Sometimes users are very
picky.
...
...
This is much less obviously parallel, but good compilers can make it so
if the user allows slightly different answers, which they often do.
Yeah and any brain dead developper can write the proper
boost::accumulate( simd::range(v), 0. )
to get it right.
What's right?  I want bit-reproducability with that IBM machine from 10
years ago.

Yes, in cases like that one would simply not use boost.simd and would
tell the compiler not to vectorize.  I'm trying to point out questions
and problems that arise in production systems.
...
So, who's the compiler's daddy here ?
Er?
...
...
Can you explain why not?  Assembly code in and of itself is not bad but
it raises some maintainability questions.  How many different
implementations of a particular ISA will the library support?
Because it is C functions maybe :E
What's the difference between:

ADDPD XMM0, XMM1

and

XMM0 = __builtin_ia32_addpd (XMM0, XMM1)

I would contend nothing, from a programming effort perpective.
...
Currently we support all SSEx familly, all AMD specific stuff and
Altivec for PPC and Cell adn we have a protocol to extend that.
How many different implementations of DGEMM do you have for x86?  I have
seen libraries with 10-20.

Ok, that's a bit unfair.  You are not trying to reproduce BLAS or
anything.  But let's say someone wants to write DGEMM.  He or she has a
couple of options:

- Write it in assembler.  Note that the programmer will have to take
  into account various combinations of matrix size and alignment, target
  microarchitecture and ISA and will probably have to code many
  different versions.

- Write it using the operator overloads provided by boost.simd.  Note
  that the programmer will have to take into account various
  combinations of matrix size and alignment, target microarchitecture
  and ISA and will probably have to code many different versions.

- Write just one version using either of the above.  It will work
  reasonably well in many cases and completely stink in others.

- Use an autotuning framework that generates many different variants by
  exploiting the abilities of a vectorizing compiler.

I'm sure there are other options, but these are the most common
approaches.  Everyone in the industry is moving to the last option.
...
...
4 floats are available.  That does not mean one always wants to use
all of them.  Heck, it's often the case one wants to use none of
them.
Not usign all element in a SIMD vector is Doing It Wrong.
I have seen cases in real code where using all of the elements is
exactly the wrong thing to do.  How would I express this in boost.simd?
What happens when I move that code to another implementation where using
all of the elements in a vector is exactly the right thing to do?
...
...
I'm demonstrating what I mean by "performance portable."  Substitute
"GPU" with any CPU sufficiently different from the baseline.
I wish you read about what a "library scope" and "rationale" mean.
Are you complaining that Boost.MPI dont cover GPU too ?
MPI is a completely different focus and you know it.  Your rationale, as
I understand it, is to make exploiting data parallelism simpler.  That's
good!  We need more of that.  I am trying to explain that simply using
vector instructions is usually not enough.  Vectorization is hard.  Not
the mechanics, that's relatively easy.  Getting the performance out is a
lot of work.  That is where most of the effort in vectorizing compilers
goes.

Intel and PGI compilers are not better than gcc because they can
vectorize and gcc cannot.  gcc can vectorize just fine.  Intel and PGI
compilers are better than gcc because they understand how to restructure
code at a high level and they have been taught when (and when not!) to
vectorize and how to best use the vector hardware.  That is something
not easily captured in a library.
...
...
Intel and PGI.
Ok, what guys on non intel nor PGI supproted machine does ?
Cry blood ?
If boost.simd is targeted to users who have subpar compilers, that's
fine.  But please don't go around telling people that compilers can't
vectorize and parallelize.  That's simply not true.

Boost.simd could be useful to vendors providing vectorized versions of
their libraries.  These are cases where for some reason or other the
compiler can't be convinced to generate the absolute best code.  That
happens and I can see some protability benefits from Boost.simd there.
But it is not as easy as you make it out to be and I don't think
Boost.simd should be sold as a perferred, general way for everyday
programmers to exploit data parallelism.

I have seen too many cases where programmers wrote an "obviously better"
vector implementation of a loop, only to have someone else rewrite it in
scalar so the compiler could properly vectorize it.

                               -Dave

Re: [boost] [gsoc] boost.simd news from the front.

greened＠obbligato.org