
Joel falcou <joel.falcou@gmail.com> writes:
On 10/06/11 18:05, David A. Greene wrote:
For writing new code, I contend that a good compiler, with a few directives here and there, can accomplish the same result as this library and with less programmer effort.
I dont really think so, especially if you take your "portable" stuff in the equation.
I have seen it done.
A simple example:
void foo(float *a, float *b, int n) { for (int i = 0; i< n; ++i) a[i] = b[i]; }
This is not obviously parallel but with some simple help the user can get the compiler to vectorize it.
Seriously, are you kidding me ? This is a friggin for_all ... You can not get more embarrasingly parallel.
No, it's not obviously parallel. Consider aliasing.
Another less simple case:
And this accumulate, they are like the most basic EP example you can get.
IF the user allows differences in answers. Sometimes users are very picky.
This is much less obviously parallel, but good compilers can make it so if the user allows slightly different answers, which they often do.
Yeah and any brain dead developper can write the proper
boost::accumulate( simd::range(v), 0. )
to get it right.
What's right? I want bit-reproducability with that IBM machine from 10 years ago. Yes, in cases like that one would simply not use boost.simd and would tell the compiler not to vectorize. I'm trying to point out questions and problems that arise in production systems.
So, who's the compiler's daddy here ?
Er?
Can you explain why not? Assembly code in and of itself is not bad but it raises some maintainability questions. How many different implementations of a particular ISA will the library support?
Because it is C functions maybe :E
What's the difference between: ADDPD XMM0, XMM1 and XMM0 = __builtin_ia32_addpd (XMM0, XMM1) I would contend nothing, from a programming effort perpective.
Currently we support all SSEx familly, all AMD specific stuff and Altivec for PPC and Cell adn we have a protocol to extend that.
How many different implementations of DGEMM do you have for x86? I have seen libraries with 10-20. Ok, that's a bit unfair. You are not trying to reproduce BLAS or anything. But let's say someone wants to write DGEMM. He or she has a couple of options: - Write it in assembler. Note that the programmer will have to take into account various combinations of matrix size and alignment, target microarchitecture and ISA and will probably have to code many different versions. - Write it using the operator overloads provided by boost.simd. Note that the programmer will have to take into account various combinations of matrix size and alignment, target microarchitecture and ISA and will probably have to code many different versions. - Write just one version using either of the above. It will work reasonably well in many cases and completely stink in others. - Use an autotuning framework that generates many different variants by exploiting the abilities of a vectorizing compiler. I'm sure there are other options, but these are the most common approaches. Everyone in the industry is moving to the last option.
4 floats are available. That does not mean one always wants to use all of them. Heck, it's often the case one wants to use none of them.
Not usign all element in a SIMD vector is Doing It Wrong.
I have seen cases in real code where using all of the elements is exactly the wrong thing to do. How would I express this in boost.simd? What happens when I move that code to another implementation where using all of the elements in a vector is exactly the right thing to do?
I'm demonstrating what I mean by "performance portable." Substitute "GPU" with any CPU sufficiently different from the baseline.
I wish you read about what a "library scope" and "rationale" mean. Are you complaining that Boost.MPI dont cover GPU too ?
MPI is a completely different focus and you know it. Your rationale, as I understand it, is to make exploiting data parallelism simpler. That's good! We need more of that. I am trying to explain that simply using vector instructions is usually not enough. Vectorization is hard. Not the mechanics, that's relatively easy. Getting the performance out is a lot of work. That is where most of the effort in vectorizing compilers goes. Intel and PGI compilers are not better than gcc because they can vectorize and gcc cannot. gcc can vectorize just fine. Intel and PGI compilers are better than gcc because they understand how to restructure code at a high level and they have been taught when (and when not!) to vectorize and how to best use the vector hardware. That is something not easily captured in a library.
Intel and PGI.
Ok, what guys on non intel nor PGI supproted machine does ? Cry blood ?
If boost.simd is targeted to users who have subpar compilers, that's fine. But please don't go around telling people that compilers can't vectorize and parallelize. That's simply not true. Boost.simd could be useful to vendors providing vectorized versions of their libraries. These are cases where for some reason or other the compiler can't be convinced to generate the absolute best code. That happens and I can see some protability benefits from Boost.simd there. But it is not as easy as you make it out to be and I don't think Boost.simd should be sold as a perferred, general way for everyday programmers to exploit data parallelism. I have seen too many cases where programmers wrote an "obviously better" vector implementation of a loop, only to have someone else rewrite it in scalar so the compiler could properly vectorize it. -Dave