[gsoc] boost.simd news from the front.

Hi everyone, Last week, we (Joel Falcou my mentor, Mathias Gaunard, and me) had a meeting to decide the courses of action for the future boost.simd library. For those who aren't aware of it, it's a subset of an existing library, namely nt2, which provide an abstraction for simd instructions in C++. Here is what have been decided so far : boost.simd will be composed of several modules mimicking what's done in nt2 (or in phoenix). That is there will be the main operators in a module (add / minus etc) and the extra modules that we provide like bitwise (bitwise_andnot etc) or arithmetic. Second, what will be included? We decided to keep only functions which might have a corresponding intrinsic to exploit. Finally boost.build will come into play later in the process. Final note, despite what I said in a previous mail a few weeks ago, the initial development will happen in the nt2 (available at https://github.com/MetaScale/nt2 : main repository (in a branch) for ease of use (to ease merge process mainly) and in the boost.simd repository. Cheers, Mathieu Masson.

On Mon, Jun 6, 2011 at 8:37 AM, Mathieu - <ptr.jetable@gmail.com> wrote:
Hi everyone,
Last week, we (Joel Falcou my mentor, Mathias Gaunard, and me) had a meeting to decide the courses of action for the future boost.simd library. For those who aren't aware of it, it's a subset of an existing library, namely nt2, which provide an abstraction for simd instructions in C++. Here is what have been decided so far : boost.simd will be composed of several modules mimicking what's done in nt2 (or in phoenix). That is there will be the main operators in a module (add / minus etc) and the extra modules that we provide like bitwise (bitwise_andnot etc) or arithmetic. Second, what will be included? We decided to keep only functions which might have a corresponding intrinsic to exploit. Finally boost.build will come into play later in the process.
Final note, despite what I said in a previous mail a few weeks ago, the initial development will happen in the nt2 (available at https://github.com/MetaScale/nt2 : main repository (in a branch) for ease of use (to ease merge process mainly) and in the boost.simd repository.
Hi Mathieu - I hope you forgive naive questions from the almost terminally ignorant on this topic, but I'm just trying to understand what, in the most general terms, you're trying to do.
From reading a few of the nt2 webpages, and wikipedia on SSE2, the business of exploiting SIMD capability seems to be in the domain of the compiler. How does this look from a library perspective? What are the mechanisms you'll use/consider?
You mention "in the style of phoenix", so does this mean that as users of your library we would construct the equivalent of phoenix lazy objects, which would then have enough internal intelligence to evaluate in a super-efficient, SIMD context? Thanks - Rob.

On 06/06/2011 12:10, Robert Jones wrote:
From reading a few of the nt2 webpages, and wikipedia on SSE2, the business of exploiting SIMD capability seems to be in the domain of the compiler. How does this look from a library perspective? What are the mechanisms you'll use/consider?
This has been discussed several times in the past on this mailing list already. I suggest you take a look at the Boostcon 2011 presentation: <https://github.com/boostcon/2011_presentations/raw/master/thu/simd.pdf>
You mention "in the style of phoenix", so does this mean that as users of your library we would construct the equivalent of phoenix lazy objects, which would then have enough internal intelligence to evaluate in a super-efficient, SIMD context?
Mathieu meant that the library layout is similar to that of Phoenix in that there are several more-or-less independent components. Phoenix has the core, the operator, the bind and the scope components (and probably others I'm forgetting). It's also the case for Boost.SIMD where functions are grouped together into components. That notwithstanding, Boost.SIMD does use expression templates and lazy objects in a way that is a bit similar to Phoenix.

On Mon, Jun 6, 2011 at 7:04 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
Mathieu meant that the library layout is similar to that of Phoenix in that there are several more-or-less independent components. Phoenix has the core, the operator, the bind and the scope components (and probably others I'm forgetting). It's also the case for Boost.SIMD where functions are grouped together into components.
That notwithstanding, Boost.SIMD does use expression templates and lazy objects in a way that is a bit similar to Phoenix.
Does that mean that Boost.SIMD expressions can be used in Boost.Spirit grammars?
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, Jun 6, 2011 at 7:15 AM, Greg Rubino <bibil.thaysose@gmail.com> wrote:
On Mon, Jun 6, 2011 at 7:04 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
Mathieu meant that the library layout is similar to that of Phoenix in that there are several more-or-less independent components. Phoenix has the core, the operator, the bind and the scope components (and probably others I'm forgetting). It's also the case for Boost.SIMD where functions are grouped together into components.
That notwithstanding, Boost.SIMD does use expression templates and lazy objects in a way that is a bit similar to Phoenix.
Does that mean that Boost.SIMD expressions can be used in Boost.Spirit grammars?
By that I mean in semantic actions, not as rules. That might not have been as obvious as I thought.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 06/06/2011 13:17, Greg Rubino wrote:
On Mon, Jun 6, 2011 at 7:15 AM, Greg Rubino<bibil.thaysose@gmail.com> wrote:
On Mon, Jun 6, 2011 at 7:04 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
Mathieu meant that the library layout is similar to that of Phoenix in that there are several more-or-less independent components. Phoenix has the core, the operator, the bind and the scope components (and probably others I'm forgetting). It's also the case for Boost.SIMD where functions are grouped together into components.
That notwithstanding, Boost.SIMD does use expression templates and lazy objects in a way that is a bit similar to Phoenix.
Does that mean that Boost.SIMD expressions can be used in Boost.Spirit grammars?
By that I mean in semantic actions, not as rules. That might not have been as obvious as I thought.
They're Proto expressions just like Phoenix and Spirit; but I'm not sure what this means exactly with regards to your question.

On Mon, Jun 6, 2011 at 12:04 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
On 06/06/2011 12:10, Robert Jones wrote:
From reading a few of the nt2 webpages, and wikipedia on SSE2, the
business of exploiting SIMD capability seems to be in the domain of the compiler. How does this look from a library perspective? What are the mechanisms you'll use/consider?
This has been discussed several times in the past on this mailing list already.
I suggest you take a look at the Boostcon 2011 presentation: <https://github.com/boostcon/2011_presentations/raw/master/thu/simd.pdf>
Excellent, just what I was looking for! Many thanks, R.

Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
On 06/06/2011 12:10, Robert Jones wrote:
From reading a few of the nt2 webpages, and wikipedia on SSE2, the business of exploiting SIMD capability seems to be in the domain of the compiler. How does this look from a library perspective? What are the mechanisms you'll use/consider?
This has been discussed several times in the past on this mailing list already.
I suggest you take a look at the Boostcon 2011 presentation: <https://github.com/boostcon/2011_presentations/raw/master/thu/simd.pdf>
I don't think this presentation makes the case for this library. That said, I am very glad you and others are thinking about these problems. Almost everything the compiler needs to vectorize well that it does not get from most language syntax can be summed up by two concepts: aliasing and alignment. I don't see how pack<> addresses the aliasing problem in any way that is not similar to simply grabbing local copies of global data or parameters. Various C++ "restrict" extensions already address the latter. We desperately need something much better than "restrict" in standard C++. Manycore is the future and parallel processing is the new normal. pack<> does address alignment, but it's overkill. It's also pessimistic. One does not always need aligned data to vectorize, so the conditions placed on pack<> are too restrictive. Furthermore, the alignment information pack<> does convey will likely get lost in the depths of the compiler, leading to suboptimal code generation unless that alignment information is available elsewhere (and it often is). I think a far more useful design of this library would be providing standard ways to assert certain conditions. For example: simd::assert(simd::is_aligned(&v[0], 16)) for (...) { } (Of course one can do the above today in standard C++ but the above is more readable.) and: simd::assert(simd::no_overlap(v, w)) for (...) { } or even for the old vectorheads: simd::assert(simd::ivdep) for (...) { } Provide simple things the compiler can recognize via pattern matching and we'll be a long way to getting the compiler to autovectorize. I like simd::allocator to provide certain guarantees to memory managed by containers. That plus some of the asserts described above could help generic code a lot. Other questions about the library: What's under the operators on pack<>? Is it assembly code? I wonder how pack<T> can know the best vector length. That is highly, highly code- and implementation-dependent. How does simd::where define pack<> elements of the result where the condition is false? Often the best solution is to leave them undefined but your example seems to require maintaining current values. How portable is Boost.simd? By portable I mean, how easy is it to move the code from one machine to another get the same level of performance? I don't mean to be too discouraging. But a library to do this kind of stuff seems archaic to me. It was archaic when Intel introduced MMX. If possible, I would like to see this evolve into a library to convey information to the compiler. -Dave

On 10/06/11 15:16, David A. Greene wrote:
I don't think this presentation makes the case for this library. That said, I am very glad you and others are thinking about these problems.
Sorry then
Almost everything the compiler needs to vectorize well that it does not get from most language syntax can be summed up by two concepts: aliasing and alignment.
No. How can a compiler vectorized a function in another binary .o ? Like who is gonna vectorize cos and its ilk ?
I don't see how pack<> addresses the aliasing problem in any way that is not similar to simply grabbing local copies of global data or parameters. Various C++ "restrict" extensions already address the latter. We desperately need something much better than "restrict" in standard C++. Manycore is the future and parallel processing is the new normal.
If you read the slides, you would have seen that pack is like the messenger of the whole sidm range system which is fitting right into *higher level of abstraction* and not some piggy backing of the compiler.
pack<> does address alignment, but it's overkill. It's also pessimistic. One does not always need aligned data to vectorize, so the conditions placed on pack<> are too restrictive. Furthermore, the alignment information pack<> does convey will likely get lost in the depths of the compiler, leading to suboptimal code generation unless that alignment information is available elsewhere (and it often is).
Well, my benchmarks disagree with this. See this old post of mine one year ago about the same subject. If getting 95% of peak performances is pessimistic, then sorry.
I think a far more useful design of this library would be providing standard ways to assert certain conditions. For example:
No. Range that accept SIMD operations are a perfect HL feature. We are writing a library not an extension for compilers.
What's under the operators on pack<>? Is it assembly code?
No as naked assembly prevent proper inlining and other register based compiler optimisation. We use w/e intirnsic is avialable for the current compiler/architecture at hand.
I wonder how pack<T> can know the best vector length. That is highly, highly code- and implementation-dependent.
No. On SSEx machine, SIMD vector are 128 bits, this means pack<T, sizeof(T)/16> is optimal so a simple meta-function finds it.
How does simd::where define pack<> elements of the result where the condition is false? Often the best solution is to leave them undefined but your example seems to require maintaining current values.
This make no sense. False is [0 ... 0] True is [ ~0 ... ~0]. Period. SIMD is all about branchless, so everything is computed in the whole vector. I seems to me you didnt get that pack is NOT a data container but a layer above SIMD registers that then get hidden under concept of ContiguousRange.
How portable is Boost.simd? By portable I mean, how easy is it to move the code from one machine to another get the same level of performance?
Works on gcc, msvc, sse and altivec, and we started looking at ARM NEON. Most of these have the same level of performance
I don't mean to be too discouraging. But a library to do this kind of stuff seems archaic to me. It was archaic when Intel introduced MMX. If possible, I would like to see this evolve into a library to convey information to the compiler.
I'll keep my archaic stuff giving me a x4-x8 speed up rather than waiting for compiler based solution nobody were able to give me since 1999 ... We already had this discussion two years ago, so i am not keen to go all over again as it clearly seems you are just retelling the same FUD that last time.

Joel falcou <joel.falcou@gmail.com> writes:
On 10/06/11 15:16, David A. Greene wrote:
Almost everything the compiler needs to vectorize well that it does not get from most language syntax can be summed up by two concepts: aliasing and alignment.
No. How can a compiler vectorized a function in another binary .o ?
The compiler. Someone had to build that .o.
Like who is gonna vectorize cos and its ilk ?
The library vendor. Lots of vendors produce special vector versions of these.
If you read the slides, you would have seen that pack is like the messenger of the whole sidm range system which is fitting right into *higher level of abstraction* and not some piggy backing of the compiler.
It's not a high level of abstraction. It's a very low level one. Users are barely willing to restructure loops to enable vectorization. Many will be unwilling to rewrite them completely. On the other hand, the data show that they are quite willing to add directives here and there.
pack<> does address alignment, but it's overkill. It's also pessimistic. One does not always need aligned data to vectorize, so the conditions placed on pack<> are too restrictive. Furthermore, the alignment information pack<> does convey will likely get lost in the depths of the compiler, leading to suboptimal code generation unless that alignment information is available elsewhere (and it often is).
Well, my benchmarks disagree with this. See this old post of mine one year ago about the same subject. If getting 95% of peak performances is pessimistic, then sorry.
On what code? It's quite easy to achieve that on something like a DGEMM. DGEMM is also an embarrassingly vectorizable code.
What's under the operators on pack<>? Is it assembly code?
No as naked assembly prevent proper inlining and other register based compiler optimisation. We use w/e intirnsic is avialable for the current compiler/architecture at hand.
That's effectively assembly code.
I wonder how pack<T> can know the best vector length. That is highly, highly code- and implementation-dependent.
No. On SSEx machine, SIMD vector are 128 bits, this means pack<T, sizeof(T)/16> is optimal so a simple meta-function finds it.
No. On SSEx machines, a vector of 32-bit floats can have 1, 2, 3 or 4 elements. Consider AVX. This is _not_ an easy problem to solve. It is not always the right answer to vectorize using the fully available vector length.
How does simd::where define pack<> elements of the result where the condition is false? Often the best solution is to leave them undefined but your example seems to require maintaining current values.
This make no sense. False is [0 ... 0] True is [ ~0 ... ~0]. Period. SIMD is all about branchless, so everything is computed in the whole vector. I seems to me you didnt get that pack is NOT a data container but a layer above SIMD registers that then get hidden under concept of ContiguousRange.
I know what a pack<> is. Perhaps I wasn't clear. If I have an operation (say, negation) under where() in which the even condition elements are true and the odd condition elements are false, what is the produced result for the odd elements of the result vector?
How portable is Boost.simd? By portable I mean, how easy is it to move the code from one machine to another get the same level of performance?
Works on gcc, msvc, sse and altivec, and we started looking at ARM NEON. Most of these have the same level of performance
What happens if you move the code from Nehalem to Barcelona? How about from an NVIDIA GPU to Nehalem?
I'll keep my archaic stuff giving me a x4-x8 speed up rather than waiting for compiler based solution nobody were able to give me since 1999 ...
Compilers have been doing this since the '70's. gcc is not an adequate compiler in this respect, but it is slowly getting there.
We already had this discussion two years ago, so i am not keen to go all over again as it clearly seems you are just retelling the same FUD that last time.
It's not FUD. It's my experience. -Dave

On 10/06/11 17:09, David A. Greene wrote:
It's not a high level of abstraction. It's a very low level one. Users are barely willing to restructure loops to enable vectorization. Many will be unwilling to rewrite them completely. On the other hand, the data show that they are quite willing to add directives here and there.
If range are not higher level than for loop, I think we can stop discussing right here.
On what code? It's quite easy to achieve that on something like a DGEMM. DGEMM is also an embarrassingly vectorizable code.
Give me one example of non-EP code which needs and can be vectorized.
That's effectively assembly code.
No.
No. On SSEx machines, a vector of 32-bit floats can have 1, 2, 3 or 4 elements.
No, SSE2 __m128 contains 4 floats. Period.
Consider AVX. This is _not_ an easy problem to solve. It is not always the right answer to vectorize using the fully available vector length.
AVX has 256 bits register and fits 8 floats. Again, what did I miss ?
I know what a pack<> is. Perhaps I wasn't clear. If I have an operation (say, negation) under where() in which the even condition elements are true and the odd condition elements are false, what is the produced result for the odd elements of the result vector?
where is ?:. It requires three argument. I tempted to say RTFM. a = c ? b; is not valid code, so neither is where(c,a); The more it goes and the more it looks like you didnt read the slides ... really.
What happens if you move the code from Nehalem to Barcelona? How about from an NVIDIA GPU to Nehalem?
Where did I say this stuff targeted GPU. This is a friggin strawman there. We address in-CPU vectorization, this is the scope of the library. Period again. We dont claim solving arbitrary data parallelism problem and we never did. You are again recycling the same non argument than in your last intervention on this very topic last year.
Compilers have been doing this since the '70's. gcc is not an adequate compiler in this respect, but it is slowly getting there.
MSVC does not, neither xlC ... neither clang ... so which compilers takes random crap C code and vectorize it automagically ?
It's not FUD. It's my experience.
It is really, FUD and strawmen.

Joel falcou <joel.falcou@gmail.com> writes:
On 10/06/11 17:09, David A. Greene wrote:
It's not a high level of abstraction. It's a very low level one. Users are barely willing to restructure loops to enable vectorization. Many will be unwilling to rewrite them completely. On the other hand, the data show that they are quite willing to add directives here and there.
If range are not higher level than for loop, I think we can stop discussing right here.
The for loops already exist. I'm primarily talking about users vectorizing existing code. For writing new code, I contend that a good compiler, with a few directives here and there, can accomplish the same result as this library and with less programmer effort.
On what code? It's quite easy to achieve that on something like a DGEMM. DGEMM is also an embarrassingly vectorizable code.
Give me one example of non-EP code which needs and can be vectorized.
Many codes are not embarrassingly vectorizable. Compilers have to jump through major loop restructuring and others things to expose the parallelism. Often users have to do it for the compiler, just as they would for this library. Users don't like to manually restructure their code, but they are ok with putting directives in the code to tell the compiler what to do. A simple example: void foo(float *a, float *b, int n) { for (int i = 0; i < n; ++i) a[i] = b[i]; } This is not obviously parallel but with some simple help the user can get the compiler to vectorize it. Another less simple case: float foo(float *restrict a, int n) { float result = 0.0; for (int i = 0; i < n; ++i) result += a[i]; return result } This is much less obviously parallel, but good compilers can make it so if the user allows slightly different answers, which they often do.
That's effectively assembly code.
No.
Can you explain why not? Assembly code in and of itself is not bad but it raises some maintainability questions. How many different implementations of a particular ISA will the library support?
No. On SSEx machines, a vector of 32-bit floats can have 1, 2, 3 or 4 elements.
No, SSE2 __m128 contains 4 floats. Period.
4 floats are available. That does not mean one always wants to use all of them. Heck, it's often the case one wants to use none of them.
Consider AVX. This is _not_ an easy problem to solve. It is not always the right answer to vectorize using the fully available vector length.
AVX has 256 bits register and fits 8 floats. Again, what did I miss ?
The fact that the implementation of 256-bit operations may really stink on a particular microarchitecture and the fact that integer operations are not 256 bits so that one has to make a difficult tradeoff for loops with mixed integer/floating point computation. This happens a lot.
I know what a pack<> is. Perhaps I wasn't clear. If I have an operation (say, negation) under where() in which the even condition elements are true and the odd condition elements are false, what is the produced result for the odd elements of the result vector?
where is ?:. It requires three argument. I tempted to say RTFM.
a = c ? b; is not valid code, so neither is where(c,a);
Ah, I misread the slide. My example would look like this in boost.simd: a = c ? -b : a So effectively, the result retains the original values. I was previously thinking more along the lines of vector predication. I guess in the context of this library, your semantics make sense.
What happens if you move the code from Nehalem to Barcelona? How about from an NVIDIA GPU to Nehalem?
Where did I say this stuff targeted GPU.
I'm demonstrating what I mean by "performance portable." Substitute "GPU" with any CPU sufficiently different from the baseline.
You are again recycling the same non argument than in your last intervention on this very topic last year.
Sorry, my memory is not as good as it used to be. :) I'm not sure what you're referring to.
Compilers have been doing this since the '70's. gcc is not an adequate compiler in this respect, but it is slowly getting there.
MSVC does not, neither xlC ... neither clang ... so which compilers takes random crap C code and vectorize it automagically ?
Intel and PGI. -Dave

On 10/06/11 18:05, David A. Greene wrote:
For writing new code, I contend that a good compiler, with a few directives here and there, can accomplish the same result as this library and with less programmer effort.
I dont really think so, especially if you take your "portable" stuff in the equation.
A simple example:
void foo(float *a, float *b, int n) { for (int i = 0; i< n; ++i) a[i] = b[i]; }
This is not obviously parallel but with some simple help the user can get the compiler to vectorize it.
Seriously, are you kidding me ? This is a friggin for_all ... You can not get more embarrasingly parallel.
Another less simple case:
And this accumulate, they are like the most basic EP example you can get.
This is much less obviously parallel, but good compilers can make it so if the user allows slightly different answers, which they often do.
Yeah and any brain dead developper can write the proper boost::accumulate( simd::range(v), 0. ) to get it right. So, who's the compiler's daddy here ?
Can you explain why not? Assembly code in and of itself is not bad but it raises some maintainability questions. How many different implementations of a particular ISA will the library support?
Because it is C functions maybe :E Currently we support all SSEx familly, all AMD specific stuff and Altivec for PPC and Cell adn we have a protocol to extend that.
4 floats are available. That does not mean one always wants to use all of them. Heck, it's often the case one wants to use none of them.
Not usign all element in a SIMD vector is Doing It Wrong.
I'm demonstrating what I mean by "performance portable." Substitute "GPU" with any CPU sufficiently different from the baseline.
I wish you read about what a "library scope" and "rationale" mean. Are you complaining that Boost.MPI dont cover GPU too ?
Intel and PGI.
Ok, what guys on non intel nor PGI supproted machine does ? Cry blood ? Someone is trolling someone hard here I suppose ...

Joel falcou <joel.falcou@gmail.com> writes:
On 10/06/11 18:05, David A. Greene wrote:
For writing new code, I contend that a good compiler, with a few directives here and there, can accomplish the same result as this library and with less programmer effort.
I dont really think so, especially if you take your "portable" stuff in the equation.
I have seen it done.
A simple example:
void foo(float *a, float *b, int n) { for (int i = 0; i< n; ++i) a[i] = b[i]; }
This is not obviously parallel but with some simple help the user can get the compiler to vectorize it.
Seriously, are you kidding me ? This is a friggin for_all ... You can not get more embarrasingly parallel.
No, it's not obviously parallel. Consider aliasing.
Another less simple case:
And this accumulate, they are like the most basic EP example you can get.
IF the user allows differences in answers. Sometimes users are very picky.
This is much less obviously parallel, but good compilers can make it so if the user allows slightly different answers, which they often do.
Yeah and any brain dead developper can write the proper
boost::accumulate( simd::range(v), 0. )
to get it right.
What's right? I want bit-reproducability with that IBM machine from 10 years ago. Yes, in cases like that one would simply not use boost.simd and would tell the compiler not to vectorize. I'm trying to point out questions and problems that arise in production systems.
So, who's the compiler's daddy here ?
Er?
Can you explain why not? Assembly code in and of itself is not bad but it raises some maintainability questions. How many different implementations of a particular ISA will the library support?
Because it is C functions maybe :E
What's the difference between: ADDPD XMM0, XMM1 and XMM0 = __builtin_ia32_addpd (XMM0, XMM1) I would contend nothing, from a programming effort perpective.
Currently we support all SSEx familly, all AMD specific stuff and Altivec for PPC and Cell adn we have a protocol to extend that.
How many different implementations of DGEMM do you have for x86? I have seen libraries with 10-20. Ok, that's a bit unfair. You are not trying to reproduce BLAS or anything. But let's say someone wants to write DGEMM. He or she has a couple of options: - Write it in assembler. Note that the programmer will have to take into account various combinations of matrix size and alignment, target microarchitecture and ISA and will probably have to code many different versions. - Write it using the operator overloads provided by boost.simd. Note that the programmer will have to take into account various combinations of matrix size and alignment, target microarchitecture and ISA and will probably have to code many different versions. - Write just one version using either of the above. It will work reasonably well in many cases and completely stink in others. - Use an autotuning framework that generates many different variants by exploiting the abilities of a vectorizing compiler. I'm sure there are other options, but these are the most common approaches. Everyone in the industry is moving to the last option.
4 floats are available. That does not mean one always wants to use all of them. Heck, it's often the case one wants to use none of them.
Not usign all element in a SIMD vector is Doing It Wrong.
I have seen cases in real code where using all of the elements is exactly the wrong thing to do. How would I express this in boost.simd? What happens when I move that code to another implementation where using all of the elements in a vector is exactly the right thing to do?
I'm demonstrating what I mean by "performance portable." Substitute "GPU" with any CPU sufficiently different from the baseline.
I wish you read about what a "library scope" and "rationale" mean. Are you complaining that Boost.MPI dont cover GPU too ?
MPI is a completely different focus and you know it. Your rationale, as I understand it, is to make exploiting data parallelism simpler. That's good! We need more of that. I am trying to explain that simply using vector instructions is usually not enough. Vectorization is hard. Not the mechanics, that's relatively easy. Getting the performance out is a lot of work. That is where most of the effort in vectorizing compilers goes. Intel and PGI compilers are not better than gcc because they can vectorize and gcc cannot. gcc can vectorize just fine. Intel and PGI compilers are better than gcc because they understand how to restructure code at a high level and they have been taught when (and when not!) to vectorize and how to best use the vector hardware. That is something not easily captured in a library.
Intel and PGI.
Ok, what guys on non intel nor PGI supproted machine does ? Cry blood ?
If boost.simd is targeted to users who have subpar compilers, that's fine. But please don't go around telling people that compilers can't vectorize and parallelize. That's simply not true. Boost.simd could be useful to vendors providing vectorized versions of their libraries. These are cases where for some reason or other the compiler can't be convinced to generate the absolute best code. That happens and I can see some protability benefits from Boost.simd there. But it is not as easy as you make it out to be and I don't think Boost.simd should be sold as a perferred, general way for everyday programmers to exploit data parallelism. I have seen too many cases where programmers wrote an "obviously better" vector implementation of a loop, only to have someone else rewrite it in scalar so the compiler could properly vectorize it. -Dave

On 11/06/2011 02:08, David A. Greene wrote:
What's the difference between:
ADDPD XMM0, XMM1
and
XMM0 = __builtin_ia32_addpd (XMM0, XMM1)
I would contend nothing, from a programming effort perpective.
Register allocation.
Currently we support all SSEx familly, all AMD specific stuff and Altivec for PPC and Cell adn we have a protocol to extend that.
How many different implementations of DGEMM do you have for x86? I have seen libraries with 10-20.
That's because they don't have generic programming, which would allow them to generate all variants with a single generic core and some meta-programming. We work with the LAPACK people, and some of them have realized that the things we do with metaprogramming could be very interesting to them, but we haven't had any research opportunity to start a project on this yet.
Ok, that's a bit unfair. You are not trying to reproduce BLAS or anything. But let's say someone wants to write DGEMM. He or she has a couple of options:
We gave it a quick try, we were slower, so we didn't look into it too much. We may attack it again some other day. For now we consider the versions that exist elsewhere fast enough.
- Write it using the operator overloads provided by boost.simd. Note that the programmer will have to take into account various combinations of matrix size and alignment, target microarchitecture and ISA and will probably have to code many different versions.
Shouldn't you just need the cache line size? This is something we provide as well. Ideally you shouldn't need anything else that cannot be made architecture-agnostic. And as I said, you should make the properties on size (and even alignment if you really care) a template parameter, so as to be able to dispatch it to relevant bits at compile-time...
- Use an autotuning framework that generates many different variants by exploiting the abilities of a vectorizing compiler.
C++ metaprogramming *is* a autotuning framework. Except there is significantly less effort in writing a library than in writing a compiler; and a library is not tied to a particular compiler, which is a great advantage.
I have seen cases in real code where using all of the elements is exactly the wrong thing to do. How would I express this in boost.simd?
Ignore the elements you don't care about?
Your rationale, as I understand it, is to make exploiting data parallelism simpler.
No it isn't. Its goal is to provide a SIMD abstraction layer. It's an infrastructure library to build other libraries. It is still fairly low-level. Making data parallelism simpler is the goal of NT2. And we do that by removing loops and pointers entirely. There is more to data parallelism than SIMD. It's just one of the building blocks.
Intel and PGI compilers are not better than gcc because they can vectorize and gcc cannot. gcc can vectorize just fine. Intel and PGI compilers are better than gcc because they understand how to restructure code at a high level and they have been taught when (and when not!) to vectorize and how to best use the vector hardware. That is something not easily captured in a library.
Again, we're not interested in automatic restructuration and vectorization of arbitrary code. It's an interface with which the programmer can explicitly structure his code for vectorization. I don't want to write random loops and have my compiler parallelize them when it happens to find some that are parallelizable, I want a tool that helps me write my code in the way I need to in order for it to generate vectorized instructions.
Intel and PGI.
Ok, what guys on non intel nor PGI supproted machine does ? Cry blood ?
If boost.simd is targeted to users who have subpar compilers
Other compilers than intel or PGI are subpar compilers? Maybe if you live in a very secluded world. They may be good at vectorization, but they're not even that good at C++.
But please don't go around telling people that compilers can't vectorize and parallelize. That's simply not true.
Run the trivial accumulate test? The most little of things can prevents them from vectorizing. Sure, if you add a few restrict there, a few pragmas elsewhere, some specific compiling options tied to floating point, you might be able to get the system to kick in. But my personal belief is that automatic parallelization of arbitrary code is an approach doomed to failure. Programming is about making things explicit using the right language for the task. The language of classical loops in C with pointers is ill-suited to describe operations that can be evaluated in parallel. That approach doesn't seem to be near as successful as tools and languages for explicit parallelization, be it for task-parallel or data-parallel problems. Fortran is a bit better, but not quite there. Matlab, as a language, is more interesting.
Boost.simd could be useful to vendors providing vectorized versions of their libraries.
Not all fast libraries need to be provided by hardware vendors.
I have seen too many cases where programmers wrote an "obviously better" vector implementation of a loop, only to have someone else rewrite it in scalar so the compiler could properly vectorize it.
Maybe if the compiler was really that good, it could still do the optimization when vectors are involved?

Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
On 11/06/2011 02:08, David A. Greene wrote:
What's the difference between:
ADDPD XMM0, XMM1
and
XMM0 = __builtin_ia32_addpd (XMM0, XMM1)
I would contend nothing, from a programming effort perpective.
Register allocation.
But that's not where the difficult work is.
Currently we support all SSEx familly, all AMD specific stuff and Altivec for PPC and Cell adn we have a protocol to extend that.
How many different implementations of DGEMM do you have for x86? I have seen libraries with 10-20.
That's because they don't have generic programming, which would allow them to generate all variants with a single generic core and some meta-programming.
No. No, no, no. These implementations are vastly different. It's not simply a matter of changing vector lenght.
We work with the LAPACK people, and some of them have realized that the things we do with metaprogramming could be very interesting to them, but we haven't had any research opportunity to start a project on this yet.
I'm not saying boost.simd is never useful. I'm saying the claims made about it seem overblown.
- Write it using the operator overloads provided by boost.simd. Note that the programmer will have to take into account various combinations of matrix size and alignment, target microarchitecture and ISA and will probably have to code many different versions.
Shouldn't you just need the cache line size? This is something we provide as well.
Nope. It's a LOT more complicated than that.
Ideally you shouldn't need anything else that cannot be made architecture-agnostic.
What's the right vector length? That alone depends heavily on the microarchitecture. And as I noted above, this is one of the simpler questions.
And as I said, you should make the properties on size (and even alignment if you really care) a template parameter, so as to be able to dispatch it to relevant bits at compile-time...
Yes, I can see how that would be useful. It will cover a lot of cases. But not everthing. And that's ok, as long as the library documentation spells that out.
C++ metaprogramming *is* a autotuning framework.
To a degree. How do you do different loop restructurings using the library?
Your rationale, as I understand it, is to make exploiting data parallelism simpler.
No it isn't. Its goal is to provide a SIMD abstraction layer. It's an infrastructure library to build other libraries. It is still fairly low-level.
Ok, that makes more sense.
Intel and PGI.
Ok, what guys on non intel nor PGI supproted machine does ? Cry blood ?
If boost.simd is targeted to users who have subpar compilers
Other compilers than intel or PGI are subpar compilers? Maybe if you live in a very secluded world.
No, not every compiler is subpar. But many are.
But please don't go around telling people that compilers can't vectorize and parallelize. That's simply not true.
Run the trivial accumulate test?
Vectorized.
The most little of things can prevents them from vectorizing. Sure, if you add a few restrict there, a few pragmas elsewhere, some specific compiling options tied to floating point, you might be able to get the system to kick in.
Yep. And that's a LOT easier the hand-restructuring loops and writing vector code manually.
But my personal belief is that automatic parallelization of arbitrary code is an approach doomed to failure.
Then HPC has been failing for 30 years.
Programming is about making things explicit using the right language for the task.
Programming is about programmer productivity.
Boost.simd could be useful to vendors providing vectorized versions of their libraries.
Not all fast libraries need to be provided by hardware vendors.
No, not all. In most other cases, though, the compiler should do it.
I have seen too many cases where programmers wrote an "obviously better" vector implementation of a loop, only to have someone else rewrite it in scalar so the compiler could properly vectorize it.
Maybe if the compiler was really that good, it could still do the optimization when vectors are involved?
No, because information has been lost at that point. -Dave

On 11/06/2011 17:42, David A. Greene wrote:
Mathias Gaunard<mathias.gaunard@ens-lyon.org> writes:
Register allocation.
But that's not where the difficult work is.
Right, NP-complete problems are not difficult. It's not really a problem when you're doing a small function in isolation, but we want all the functions to be inlineable (and most of them to be inlined), and we don't know in advance whether we need to copy the operands, which registers will be used and which will be available, etc.
Currently we support all SSEx familly, all AMD specific stuff and Altivec for PPC and Cell adn we have a protocol to extend that.
How many different implementations of DGEMM do you have for x86? I have seen libraries with 10-20.
That's because they don't have generic programming, which would allow them to generate all variants with a single generic core and some meta-programming.
No. No, no, no. These implementations are vastly different. It's not simply a matter of changing vector lenght.
We work with the LAPACK people, and some of them have realized that the things we do with metaprogramming could be very interesting to them, but we haven't had any research opportunity to start a project on this yet.
I'm not saying boost.simd is never useful. I'm saying the claims made about it seem overblown.
What I was saying about usage of meta-programming applied to the writing of adaptive and fast linear algebra primitives is completely unrelated to Boost.SIMD, albeit it could use it.
- Write it using the operator overloads provided by boost.simd. Note that the programmer will have to take into account various combinations of matrix size and alignment, target microarchitecture and ISA and will probably have to code many different versions.
Shouldn't you just need the cache line size? This is something we provide as well.
Nope. It's a LOT more complicated than that.
Well, as far as I know, the only platform-specific stuff you do for matrix multiplication apart from vectorization is loop tiling. Can your magic compiler guarantee it will do a perfect job at this, with a cache size only known at runtime?
C++ metaprogramming *is* a autotuning framework.
To a degree. How do you do different loop restructurings using the library?
I suggest you read some basic literature on what you can do with templates, in particular Todd Veldhuizen's "Active Libraries: Rethinking the roles of compilers and libraries" <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.8031&rep=rep1&type=pdf>
The most little of things can prevents them from vectorizing. Sure, if you add a few restrict there, a few pragmas elsewhere, some specific compiling options tied to floating point, you might be able to get the system to kick in.
Yep. And that's a LOT easier the hand-restructuring loops and writing vector code manually.
I may not want to set the floating point options for my whole translation unit. And if I had some pragmas just around the bit I want, that starts to look like explicit vectorization. Our goal with Boost.SIMD is not to write vector code manually (you don't go down to the instruction level), but rather to allow to make vectorization explicit (you describe a set of operations on operands whose types are vectors).
But my personal belief is that automatic parallelization of arbitrary code is an approach doomed to failure.
Then HPC has been failing for 30 years.
That's funny, because a huge portion of HPC people seems to be busy recoding stuff for multicore and GPU. How come they have to rewrite it since we have automatic parallelization solutions? Surely they can just input their old C code to their compiler and get optimal SIMD+OpenMP+MPI+CUDA as output. Their monolithic do-it-all state-of-the-art compiler provided by their hardware vendor takes care of everything they should need, and is able to predict all of the best solutions, right? Actually, I suppose it works fairly well with old Fortran code. Yet all of these people found reasons that made them want to use the tools themselves directly. And CUDA, which is arguably hugely popular these days, requires people to write their algorithms in terms of the kernel abstraction. It wouldn't be able to just work well with arbitrary C code. And still, I understand it's still fairly complicated for the compiler to make this work well, despite the imposed coding paradigms. Automatic parallelization solutions are nice to have. Like all compiler optimizations, it's a best effort thing. But when you really need the power, you've got to go get it yourself. At least that's my opinion.
Programming is about making things explicit using the right language for the task.
Programming is about programmer productivity.
Productivity implies a product. What matters in a product is that it fulfills the requirements. If my requirements are to use some hardware -- not necessarily limited to a single architecture -- to the best of their abilities, I'm better off describing my code in the way most fit for that hardware than betting everything on the fact the automatic code restructuration of the compiler will allow me to do that. When compilers start to guarantee some optimizations, maybe it will change. But the feedback I gathered from compiler people is that they could not guarantee certain types of transformations would be consistently applied regardless of data size; running the passes in different order could yield better or worse results, same with running them multiple times etc. Compilers just give "some" optimization, there is no formalism behind that can prove it will always reduce certain patterns. Having a fully-optimized program still requires explicitly doing it so. We're at a state where we even have to force inlining in some cases, because even with the inline specifier some compilers do not do the right thing.
Boost.simd could be useful to vendors providing vectorized versions of their libraries.
Not all fast libraries need to be provided by hardware vendors.
No, not all. In most other cases, though, the compiler should do it.
Monolithic designs are bad. Some people specialize in specific things, and they should be the providers for that thing.
I have seen too many cases where programmers wrote an "obviously better" vector implementation of a loop, only to have someone else rewrite it in scalar so the compiler could properly vectorize it.
Maybe if the compiler was really that good, it could still do the optimization when vectors are involved?
No, because information has been lost at that point.
How so? What information? There is no assembly involved, the compiler still has full knowledge. There is no reason why the compiler couldn't tell that __m128 a, b, c; c = __mm_add_ps(a, b); // probably calls __builtin_ia32_addps(a, b) is the same as float a, b, c; c = a + b except it does it four floats at a time.

Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
Shouldn't you just need the cache line size? This is something we provide as well.
Nope. It's a LOT more complicated than that.
Well, as far as I know, the only platform-specific stuff you do for matrix multiplication apart from vectorization is loop tiling.
Vector length matters. Instruction selection matters. Prefetching matters.
Can your magic compiler guarantee it will do a perfect job at this, with a cache size only known at runtime?
No one can ever guarantee "perfect." But the compiler should aim to reduce the programming burden.
To a degree. How do you do different loop restructurings using the library?
I suggest you read some basic literature on what you can do with templates, in particular Todd Veldhuizen's "Active Libraries: Rethinking the roles of compilers and libraries"
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.8031&rep=rep1&type=pdf>
Thanks for the pointer! Printed out and ready to digest. :)
And if I had some pragmas just around the bit I want, that starts to look like explicit vectorization.
But not like boost.simd as the actual algorithm code doesn't get touched.
Our goal with Boost.SIMD is not to write vector code manually (you don't go down to the instruction level), but rather to allow to make vectorization explicit (you describe a set of operations on operands whose types are vectors).
But again, that's not always the right thing to do. Often one only wants to partially vectorize a loop, for example. I'm sure boost.simd can represent that, too, but it is another very platform-dependent choice.
But my personal belief is that automatic parallelization of arbitrary code is an approach doomed to failure.
Then HPC has been failing for 30 years.
That's funny, because a huge portion of HPC people seems to be busy recoding stuff for multicore and GPU.
For the most part, they are FINALLY moving away from pure MPI code and starting to use things like OpenMP. For the GPU, good compiler solutions have only just begun to appear -- there's always a lag in tool availability for new architectures. But GPUs aren't all that new anyway. They are "just" vector machines. But I don't know of many HPC users who want to yet again restructure their loops. At best you can get them to restructure again IF you can guaratee them that the restructured code will run well on any machine. That's why performance portability is critical. HPC users have had it with rewriting code.
How come they have to rewrite it since we have automatic parallelization solutions? Surely they can just input their old C code to their compiler and get optimal SIMD+OpenMP+MPI+CUDA as output.
If the algorithm has already been written to vectorize, it should map to the GPU just fine. If not, it will have to be restructured anyway, which is generally beyond the capability of compilers or boost.simd, though many times loop directives can just tell the compiler what to do.
Their monolithic do-it-all state-of-the-art compiler provided by their hardware vendor takes care of everything they should need, and is able to predict all of the best solutions, right?
In many cases, yes. Of course, some level of user directive helps a lot. It's very hard for the compiler to automatically decide which kernels should run on a GPU, for example. But the user just needs a directive here or there to specify that.
And CUDA, which is arguably hugely popular these days, requires people to write their algorithms in terms of the kernel abstraction.
Yuck, yuck, yuck! :) CUDA was a fine technology when GPUs becvame popular, but it is being replaced quickly by things like the PGI GPU directives. The pattern is, do it manually first, then use compiler directives (designed based on learning from the manual approach), then have the compiler do it automatically (not possible in general, but often very effective in certain cases).
Automatic parallelization solutions are nice to have. Like all compiler optimizations, it's a best effort thing. But when you really need the power, you've got to go get it yourself. At least that's my opinion.
Sometimes, yes. But I would say with a good vectorizing compiler that is extremely rare. Given an average compiler, I certainly see the goodness in boost.simd.
If my requirements are to use some hardware -- not necessarily limited to a single architecture -- to the best of their abilities, I'm better off describing my code in the way most fit for that hardware than betting everything on the fact the automatic code restructuration of the compiler will allow me to do that.
I think it's a combination of both. Overspecification often hampers the compiler's ability to generate good code. Things should be specified at the "right" level given the available compiler capabilities. Of course, the definition of "right" varies widely from implementation to implementation.
Maybe if the compiler was really that good, it could still do the optimization when vectors are involved?
No, because information has been lost at that point.
How so? What information?
Unrolling is a good example. A hand-unrolled and scheduled loop is very difficult to "re-roll." That has all sorts of implications for vector code generation. -Dave

Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
Making data parallelism simpler is the goal of NT2. And we do that by removing loops and pointers entirely.
First off, I want to apologize for sparking some emotions. That was not my intent. I am deeply sorry for not expressing myself well. NT2 sounds very interesting! Does it generate the loops given calls into generic code? Let me try to clarify my thinking about boost.simd a bit. - There is a place for boost.simd. It's important that I emphasize this upfront. - Here is how I see a typical programmer using it, given a loop nest: - Programmer tries to run the compiler on it, examines code - Code sometimes (maybe most of the time) executes poorly - If not, done - Programmer restructures loop nest to expose parallelism - Try compiler directives first, if available (tell compiler which loops to interchange, where to cache block, blocking factors, which loops to collapse, etc.) - Otherwise, hand-restructure (ouch!) - Programmer tries compiler again on restructured loop nest - Code may execute poorly - If not, done - Programmer adds directives to tell the compiler which loops to vectorize, which to leave scalar, etc. - Code may still execute poorly - If not, done - Programmer uses boost.simd to write vector code at a higher level than provided compiler intrinsics Does that seem like a reasonable use case? -Dave

On 11/06/11 11:17, David A. Greene wrote:
Mathias Gaunard<mathias.gaunard@ens-lyon.org> writes:
Making data parallelism simpler is the goal of NT2. And we do that by removing loops and pointers entirely.
First off, I want to apologize for sparking some emotions. That was not my intent. I am deeply sorry for not expressing myself well.
We all fall for blatant miscommunication there I guess ;)
NT2 sounds very interesting! Does it generate the loops given calls into generic code?
Basically yes, you express container based, semantic driven code using a matlab like syntax (+ more in case where matlab dont provide anything suitable) and the various evaluation point generates loop nests with properties derived from information carried by the container type and its settings (storage order, sharing data status, etc). The evaluation is then done by forwarding the expression to a hierarchical layer of architecture dependant meta-programms that, at each steps, strip the expression of its important high level semantic inforamtions and help generate the proper piece of code. Boost.SIMD is used in the very deep level on CPU based generation to handle variability of SIMD ISA. I assume the rest of the discussion is done for a programm written with the correct algorithm in term of compelxity, right ?
- Programmer tries to run the compiler on it, examines code - Code sometimes (maybe most of the time) executes poorly - If not, done
Yes.
- Programmer restructures loop nest to expose parallelism - Try compiler directives first, if available (tell compiler which loops to interchange, where to cache block, blocking factors, which loops to collapse, etc.) - Otherwise, hand-restructure (ouch!)
If compilers allow for such informations to be carried yes.
- Programmer tries compiler again on restructured loop nest - Code may execute poorly - If not, done
Yes
- Programmer adds directives to tell the compiler which loops to vectorize, which to leave scalar, etc. - Code may still execute poorly - If not, done
Again, provided such a compiler is available on said platform
- Programmer uses boost.simd to write vector code at a higher level than provided compiler intrinsics
Yes and using a proper range based interface instead of a mess of for loops.
Does that seem like a reasonable use case?
Yes. What we missed to clarify is that for a large share of people, available compilers on their systems fails to provide way to do step #2 and #3. And for these people, what they see is a world in which they are on their own dealing with this.

Joel falcou <joel.falcou@gmail.com> writes:
On 11/06/11 11:17, David A. Greene wrote:
Mathias Gaunard<mathias.gaunard@ens-lyon.org> writes:
Making data parallelism simpler is the goal of NT2. And we do that by removing loops and pointers entirely.
First off, I want to apologize for sparking some emotions. That was not my intent. I am deeply sorry for not expressing myself well.
We all fall for blatant miscommunication there I guess ;)
NT2 sounds very interesting! Does it generate the loops given calls into generic code?
Basically yes, you express container based, semantic driven code using a matlab like syntax (+ more in case where matlab dont provide anything suitable) and the various evaluation point generates loop nests with properties derived from information carried by the container type and its settings (storage order, sharing data status, etc).
This is super-cool! Anything to help the programmer restructure code (or generate the loops correctly in the first place) is a huge win.
The evaluation is then done by forwarding the expression to a hierarchical layer of architecture dependant meta-programms that, at each steps, strip the expression of its important high level semantic inforamtions and help generate the proper piece of code.
You machine intrinsics here, yes? This is where I think many times the compiler might do better. If the compiler is good. :) It's a little odd that "important" information would be stripped. I know this is not a discussion of NT2 but for the curious, can you explain this? Thanks!
I assume the rest of the discussion is done for a programm written with the correct algorithm in term of compelxity, right ?
By correct algorithm, you mean an algorithm structured to expose data parallelism? If so, yes, I think that's right.
- Programmer tries to run the compiler on it, examines code - Code sometimes (maybe most of the time) executes poorly - If not, done
Yes.
- Programmer restructures loop nest to expose parallelism - Try compiler directives first, if available (tell compiler which loops to interchange, where to cache block, blocking factors, which loops to collapse, etc.) - Otherwise, hand-restructure (ouch!)
If compilers allow for such informations to be carried yes.
Right. Many don't and in those cases, boost.simd is a great alternative.
- Programmer tries compiler again on restructured loop nest - Code may execute poorly - If not, done
Yes
- Programmer adds directives to tell the compiler which loops to vectorize, which to leave scalar, etc. - Code may still execute poorly - If not, done
Again, provided such a compiler is available on said platform
Of course.
- Programmer uses boost.simd to write vector code at a higher level than provided compiler intrinsics
Yes and using a proper range based interface instead of a mess of for loops.
Yep!
Does that seem like a reasonable use case?
Yes. What we missed to clarify is that for a large share of people, available compilers on their systems fails to provide way to do step #2 and #3. And for these people, what they see is a world in which they are on their own dealing with this.
Oh absolutely. But I think that such people should be aware that code generated by boost.simd may not be the best for their hardware implementation IF they have access to a good compiler later. In those cases, though, I suppose replacing, say, pack<float> with float everywhere should get most of the original scalar code back. There may yet be a little more cleanup to do but isn't that the case with _every_ HPC code? :) :) :) -Dave

On 11/06/11 12:16, David A. Greene wrote:
You machine intrinsics here, yes? This is where I think many times the compiler might do better. If the compiler is good. :)
Well, nothing prevent us to say that if on architecture X with compiler Y, nothing has to be done, do nothing. Every arch/compiler/os combo is an extension point in NT2
It's a little odd that "important" information would be stripped. I know this is not a discussion of NT2 but for the curious, can you explain this? Thanks!
not stripped sorry, exploited where it need and potentially removed before being passed to the underlying level where they make less sense.
Oh absolutely. But I think that such people should be aware that code generated by boost.simd may not be the best for their hardware implementation IF they have access to a good compiler later. In those cases, though, I suppose replacing, say, pack<float> with float everywhere should get most of the original scalar code back. There may yet be a little more cleanup to do but isn't that the case with _every_ HPC code? :) :) :)
The code can be let untouched by sayign that on machine X with compiler Y, pack<float> is just w/e it need to be and redirecting pack operators and functions to a simple scalar version. We should I think keep the pack abstraction, even if for some smart compilers the onyl stuff it will do is add w/e mark-up/pragma/infos the compiler want to see. boost.simd is really for just letting peopel that need to mess with this kind of code for w/e reasons (and we evoked a few) to do so in a platform independant way with proper abstraction. If you need more, then we need to let a higher level tools do it, and that's what nt2 aims to be.

On 11/06/2011 18:17, David A. Greene wrote:
Mathias Gaunard<mathias.gaunard@ens-lyon.org> writes:
Making data parallelism simpler is the goal of NT2. And we do that by removing loops and pointers entirely.
First off, I want to apologize for sparking some emotions. That was not my intent. I am deeply sorry for not expressing myself well.
NT2 sounds very interesting! Does it generate the loops given calls into generic code?
Basically let's say you write table<float> a, b, c, d, e, f; f = b + 3 * c + sum(d + e); a table is N-dimensional container that can be statically or dynamically sized, or tagged to specify specific layout requirements (sparse matrices, distributed across the network, memory-mapped file, on the gpu...) sum takes a table, and returns the sum of all its elements. Other operations here are element-wise as you would expect. We generate something along the lines of float tmp = 0.f; for(int i ....) tmp += d[i] + e[i]; for(int i ...) f[i] = b[i] + 3 * c[i] + tmp; Except the for loops are parallelized using a collection of various technologies, including but not limited to SIMD. We can also generate GPU kernels from such expressions. We are not considering submitting NT2 to Boost yet, and don't know if we ever will. For now we only want to push for getting the SIMD abstraction layer into Boost.

Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
We generate something along the lines of
float tmp = 0.f; for(int i ....) tmp += d[i] + e[i];
for(int i ...) f[i] = b[i] + 3 * c[i] + tmp;
Will NT2 fuse the loops to get rid of the temporary? Does it do strip-mining or other such things (beyond that needed for vectorization)? Does NT2 try to generate a loop nest with the appropriate loops interchanged to improve performance? Again, these are not criticisms, I'm simply trying to get a better grasp of it. I am really, really interested in this. Abstracting loops for HPC is a really good idea, in my mind. It would be best if there was an option to leave the resulting loops scalar in case the user wants to try to have the compiler vectorize them. -Dave

On 14/06/11 16:38, David A. Greene wrote:
Mathias Gaunard<mathias.gaunard@ens-lyon.org> writes:
We generate something along the lines of
float tmp = 0.f; for(int i ....) tmp += d[i] + e[i];
for(int i ...) f[i] = b[i] + 3 * c[i] + tmp;
Will NT2 fuse the loops to get rid of the temporary? Does it do strip-mining or other such things (beyond that needed for vectorization)? Does NT2 try to generate a loop nest with the appropriate loops interchanged to improve performance?
If loop size is deductible at compile time (and nt2 container support such hints) yes. As for the rest, loop interchange could be either deduced if some compile-hints are given or extracted but it usually requires some semantic driven approach to this, aka the type of the container will have hints on how to evaluates its instances properly.

On 14/06/2011 23:38, David A. Greene wrote:
Mathias Gaunard<mathias.gaunard@ens-lyon.org> writes:
We generate something along the lines of
float tmp = 0.f; for(int i ....) tmp += d[i] + e[i];
for(int i ...) f[i] = b[i] + 3 * c[i] + tmp;
Will NT2 fuse the loops to get rid of the temporary?
Exactly how can you fuse the loops here? This is actually an instance of splitting, where we extract things that cannot/shouldn't be done in a single loop (or a single kernel).
Does it do strip-mining or other such things (beyond that needed for vectorization)? Does NT2 try to generate a loop nest with the appropriate loops interchanged to improve performance?
Loops are in the cache-friendly order, obviously. Smarter things are usually only done for higher-level abstractions than simple tables. Loop fusion of different expressions is somewhat limited to what we statically know about the sizes of the tables we're dealing with.
I am really, really interested in this. Abstracting loops for HPC is a really good idea, in my mind. It would be best if there was an option to leave the resulting loops scalar in case the user wants to try to have the compiler vectorize them.
All components of the system are meant to be independent, so that you can only use the part you want.

Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
On 14/06/2011 23:38, David A. Greene wrote:
Will NT2 fuse the loops to get rid of the temporary?
Exactly how can you fuse the loops here?
This is actually an instance of splitting, where we extract things that cannot/shouldn't be done in a single loop (or a single kernel).
Ok. The point of the question was to see what kinds of things NT2 is targeted to do. I'm really thrilled you guys are working on it!
Loops are in the cache-friendly order, obviously. Smarter things are usually only done for higher-level abstractions than simple tables.
Loop fusion of different expressions is somewhat limited to what we statically know about the sizes of the tables we're dealing with.
Sure.
All components of the system are meant to be independent, so that you can only use the part you want.
Great! I hope you seriously consider adding NT2 to your Boostification list. It would make a nice fit with boost.simd and boost.mpi. -Dave

On 15/06/11 14:12, David A. Greene wrote:
Great! I hope you seriously consider adding NT2 to your Boostification list. It would make a nice fit with boost.simd and boost.mpi.
NT2 as a whole will probably not for various reasons but efforts are made so anything NT2 is compatible with anything Boost.

Joel falcou wrote:
On 15/06/11 14:12, David A. Greene wrote:
Great! I hope you seriously consider adding NT2 to your Boostification list. It would make a nice fit with boost.simd and boost.mpi.
NT2 as a whole will probably not for various reasons but efforts are made so anything NT2 is compatible with anything Boost.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Where do I find the latest nt2? Is it http://sourceforge.net/projects/nt2/files/ ?

On 15/06/11 19:13, Neal Becker wrote:
Where do I find the latest nt2? Is it http://sourceforge.net/projects/nt2/files/ ?
We moved to github ;) but the current state is not yet on par to the old one, we're pushing through a release. Meanwhile IRC #nt2 on freenode or the google group nt2-dev is suitable for any strictly nt2 discussions.

Hi David,
- Programmer tries to run the compiler on it, examines code - Programmer restructures loop nest to expose parallelism - Try compiler directives first, if available (tell compiler which loops to interchange, where to cache block, blocking factors, which loops to collapse, etc.)> - Programmer tries compiler again on restructured loop nest - Programmer adds directives to tell the compiler which loops - Programmer uses boost.simd to write vector code at a higher level than provided compiler intrinsics
Does that seem like a reasonable use case?
In theory, yes. In practice, the code can stop working when you change a compiler (sometimes just a new version of the same compiler), or something changes in the code around the loop, not mentioning the very usual case when somebody makes a simple and innocent-looking change to the loop code, and the auto-vectorizer silently switches off. I think explicitly using Boost.SIMD (or any other explicit solution like BLAS/MKL/whatever) is much more robust in practice. Thanks, Maxim

MaximYanchenko <MaximYanchenko@yandex.ru> writes:
Does that seem like a reasonable use case?
In theory, yes. In practice, the code can stop working when you change a compiler (sometimes just a new version of the same compiler)
Oh yes. We are not immune to introducing bugs. :)
or something changes in the code around the loop, not mentioning the very usual case when somebody makes a simple and innocent-looking change to the loop code, and the auto-vectorizer silently switches off.
So "silently" sounds to me like a compiler bug. Most vectorizing compilers have ways of telling the user exactly what they did or did not do. In my opinion, a compiler that does not have this capability is of little use to the HPC user. In that case, she should use boost.simd. :)
I think explicitly using Boost.SIMD (or any other explicit solution like BLAS/MKL/whatever) is much more robust in practice.
For certain cases I would agree. But in most cases I would hope the compiler would do just as good a job, if not better. There are always counter-examples. :) -Dave

Hi David,
What's the difference between: ADDPD XMM0, XMM1 and XMM0 = __builtin_ia32_addpd (XMM0, XMM1) I would contend nothing, from a programming effort perpective.
If you compare how GCC handles this, you'll see that using any of asm inside a loop disables virtually any optimization (like loop unrolling). Even if you use automatic register allocation in the asm block. If you rewrite the same using builtins (almost 1-to-1), the optimization is back. Thanks, Maxim

David A. Greene wrote:
For writing new code, I contend that a good compiler, with a few directives here and there, can accomplish the same result as this library and with less programmer effort.
That's great, why don't you write one and then there will be one. Let me know when you're done. ;)
Give me one example of non-EP code which needs and can be vectorized.
Many codes are not embarrassingly vectorizable. Compilers have to jump through major loop restructuring and others things to expose the parallelism. Often users have to do it for the compiler, just as they would for this library. Users don't like to manually restructure their code, but they are ok with putting directives in the code to tell the compiler what to do.
A simple example:
void foo(float *a, float *b, int n) { for (int i = 0; i < n; ++i) a[i] = b[i]; }
This is not obviously parallel but with some simple help the user can get the compiler to vectorize it.
No, the compiler will substitute a couple moves and a long jump to __memcpy assembly code. Memcpy is vectorized by hand. The compiler does not need help of any kind to vectorize this loop, but this is a special case. Also, memcpy is clearly EP. All simple filters that are embarasingly EP are some operation done in the middle of what is effectively a memcpy, sometimes the source and destination are the same. You have the same vector loads and stores and the only difference is you actually do something with the data while it is register. The compiler should never vectorize memcpy when it has the better option of substituting the hand coded assembly for memcpy. If you somehow tricked it into generating vector code instead of __memcpy you would get worse performance.
MSVC does not, neither xlC ... neither clang ... so which compilers takes random crap C code and vectorize it automagically ?
Intel and PGI.
Even the best vectorizing compilers for C++ are still terrible. I think we should use "good" in an absolute rather than relative sense. Regards, Luke

On 11/06/2011 02:27, Simonson, Lucanus J wrote:
No, the compiler will substitute a couple moves and a long jump to __memcpy assembly code. Memcpy is vectorized by hand. The compiler does not need help of any kind to vectorize this loop, but this is a special case. Also, memcpy is clearly EP. All simple filters that are embarasingly EP are some operation done in the middle of what is effectively a memcpy, sometimes the source and destination are the same. You have the same vector loads and stores and the only difference is you actually do something with the data while it is register. The compiler should never vectorize memcpy when it has the better option of substituting the hand coded assembly for memcpy. If you somehow tricked it into generating vector code instead of __memcpy you would get worse performance.
With GCC you certainly don't get the same results at all with an element-per-element copy and and with memcpy.

"Simonson, Lucanus J" <lucanus.j.simonson@intel.com> writes:
David A. Greene wrote:
For writing new code, I contend that a good compiler, with a few directives here and there, can accomplish the same result as this library and with less programmer effort.
That's great, why don't you write one and then there will be one. Let me know when you're done. ;)
Already done. :)
MSVC does not, neither xlC ... neither clang ... so which compilers takes random crap C code and vectorize it automagically ?
Intel and PGI.
Even the best vectorizing compilers for C++ are still terrible. I think we should use "good" in an absolute rather than relative sense.
No compiler can vectorize things like pointer chasing on the SIMD architectures because the hardware doesn't support it. Likewise, no one using boost.simd will be able to either. If the algorithm is amenable to vectorization, the compiler should be able to get it. -Dave

On 11/06/2011 17:33, David A. Greene wrote:
If the algorithm is amenable to vectorization, the compiler should be able to get it.
This is exactly what I said in the slides: the compiler will be able to vectorize it, but only if it is inherently vectorizable. Designing for vectorization is a much more difficult task, and it needs an explicit programming interface.

David, based on previous threads concerning the Boost.SIMD GSoC project, I think what you're describing is simply a different library than what they're trying to build. I feel I have a good enough understanding of their proposal and it fits real needs that I have--needs which I feel I understand, having heretofore satisfied them with my own, less-refined/elegant solutions. While what you're describing sounds appealing, I'm finding your description very abstract and I think more work needs to be done to demonstrate whether and how those goals can be realized. Perhaps you might find more support for your ideas, if you can make a more convincing case for the feasibility of implementation and the practicality of using such a library. To be honest, I'm actually skeptical that even with the extensions implemented by the majority of Boost-supported compilers, C++ is sufficiently expressive to get very far towards your ideal, though I'd be happy for someone to demonstrate otherwise. At present, I believe that Boost.SIMD, as previously defined, would provide me with significant benefits. In my opinion, it's not worth holding up this work based on a loosely-defined set of ideals that don't fit its aims. I'm all for improving the state of auto-vectorization, but I want the more direct control that Boost.SIMD can give me, if/when auto-vectorization lets me down. Because they aren't portable & I cannot effectively use them in templates, I do not consider directly using intrinsics do as an acceptable fall-back. Finally, consider that perhaps lessons can be learned in the course of development, porting, refining, and using Boost.SIMD. Also, maybe a higher-level effort can either stack on top of it, or at least reuse some of the code. Therefore, taking the feasibility & practicality of your vision as a given and disregarding the interests of people like myself, there's still potential for plenty of value to be derived from proceeding with Boost.SIMD as presently defined. As a potential user, I support Joel & Mathieu continuing on their current track. That said, I hope work on explicit vector programming and auto-vectorization will continue. I would be interested and supportive of anyone exploring these areas, but I feel it's unnecessary and not worthwhile to hold up Boost.SIMD in hopes of something better. Matt

"Gruenke, Matt" <mgruenke@Tycoint.com> writes:
As a potential user, I support Joel & Mathieu continuing on their current track. That said, I hope work on explicit vector programming and auto-vectorization will continue. I would be interested and supportive of anyone exploring these areas, but I feel it's unnecessary and not worthwhile to hold up Boost.SIMD in hopes of something better.
I'm not saying we should hold up boost.simd. I'm saying that the author's claims that the compiler can't do much of this is flat out wrong. I fear they are going to lead a lot of people down the wrong path. There is a place for boost.simd. Absolutely there is. But the way it's positioned by the linked-to presentation is misleading and very wrong. Vectorizing compilers exist today. They've existing since the 1970's. -Dave

On 11/06/11 10:45, David A. Greene wrote:
Vectorizing compilers exist today. They've existing since the 1970's.
I still thinking you're mixing SIMD ISA and vector machine ... As I see it, Vector machine as in http://en.wikipedia.org/wiki/SIMD#Chronology existed since 197x. BUT, what you purposely fake to not understand is that we don't care about this type of machine and focus on this http://en.wikipedia.org/wiki/SIMD#Hardware which exists since 1994 or such and for which the use cases, idioms and techniques are completely different. So yeah, autovectorization on huge CRAY system is done automatically with results I really don't know about and for which i dont care as it is not our target audience but I concede it is maybe good or w/e. Now, strictly speaking on SIMD ISA in x86/PPC familly, no, auto-vectorizing is not that good and still require manual input, functions library and so forth. So in this case, our claims hold and boost.simd has to been has a set of enabling tools for helping people writing generic code able to be vectorized. Learn that Template Meta-Programming as we use it there don't prevent compiler to do stuff afterward our code generation process. They usually do (inlining or loop unrolling etc ...) and we let them do so as they wish cause they fill the gaps we can not access with our library. Did you ever go to the last slides where we have the Range based functions code where no SIMD stuff leak but yet is able to eat up SIMDRange ? I fail to see how this is *not* a correct way of designing code in C++ : - relying on range abstraction + providing a Range with proper properties. And this kind of code is largely sufficent and generic to handle most classical use cases with high level of performances. Dealing with microarchitecture never went that far as manual code goes, and I don't really get your obsession with that.

Joel falcou <joel.falcou@gmail.com> writes:
On 11/06/11 10:45, David A. Greene wrote:
Vectorizing compilers exist today. They've existing since the 1970's.
I still thinking you're mixing SIMD ISA and vector machine ...
They are the same, though existing SIMD architectures are less powerful.
As I see it, Vector machine as in http://en.wikipedia.org/wiki/SIMD#Chronology existed since 197x. BUT, what you purposely fake to not understand is that we don't care about this type of machine and focus on this http://en.wikipedia.org/wiki/SIMD#Hardware which exists since 1994 or such and for which the use cases, idioms and techniques are completely different.
Not completely. They are different in the sense that the SIMD hardware doesn't provide as many facilities as the old vector machines. But the principles are exactly the same. Vector codegen is harder on the SIMD machines and it's for this reason that I think boost.simd may not always generate the best code. It's really, really not always obvious what instructions should implement an arbitrary expression.
So yeah, autovectorization on huge CRAY system is done automatically with results I really don't know about and for which i dont care as it is not our target audience but I concede it is maybe good or w/e.
The very same compiler vectorizes very well for x86.
Now, strictly speaking on SIMD ISA in x86/PPC familly, no, auto-vectorizing is not that good and still require manual input, functions library and so forth. So in this case, our claims hold and boost.simd has to been has a set of enabling tools for helping people writing generic code able to be vectorized.
There are very good autovectorizors for x86. I'm not very familiar with PPC so I can't comment on that.
Learn that Template Meta-Programming as we use it there don't prevent compiler to do stuff afterward our code generation process. They usually do (inlining or loop unrolling etc ...) and we let them do so as they wish cause they fill the gaps we can not access with our library.
But the resulting vector code may run suboptimally. It will probably run fine 80% of the time, but for that other 20% it may be very bad indeed. This is where allowing the compiler to do vector codegen is critical.
Did you ever go to the last slides where we have the Range based functions code where no SIMD stuff leak but yet is able to eat up SIMDRange ?
Yes, I think it's very neat!
I fail to see how this is *not* a correct way of designing code in C++ : - relying on range abstraction + providing a Range with proper properties. And this kind of code is largely sufficent and generic to handle most classical use cases with high level of performances. Dealing with microarchitecture never went that far as manual code goes, and I don't really get your obsession with that.
My concern with it comes from experience tuning a vectorizing compiler to many different microarchitectures. If the range abstraction is correct and the underlying libraries provide hints to the compiler (the latter is a big missing piece, I admit), the compiler ought to be able to generate the same or better vector code from the high-level algoritm call. If it doesn't, it's because the compiler is not up to the task or the generic library somehow hides information from the compiler. I contend that it is often easier to add directives to the library than it is to use boost.simd to generate vector code, and the former will often perform better. If one doesn't have a good vectorizing compiler, the point is moot because the directives probably don't even exist. In that case boost.simd is the way to go. My pithy quip to colleagues is that std::vector<> ought to vectorize. :) It likely doesn't today, at least not with the default allocator. But that is largely a problem with the library implementation not specifying the lack of aliasing issues to the compiler and/or the higher-level standard algorithms not including directives to tell the compiler to ignore possible depedencies. Does that mean that boost.simd is a waste of time? Not at all! There will always be cases the compiler for whatever reason cannot get. In those cases, boost.simd is the far superior way to approach the problem over hand-written vector code. -Dave

[ Note: I had written this before I saw Joel's answer, so there might be a bit of duplication here. Sorry about this. } On 10/06/2011 22:16, David A. Greene wrote:
Almost everything the compiler needs to vectorize well that it does not get from most language syntax can be summed up by two concepts: aliasing and alignment.
That's not the problem as we see it. Our argument is that vectorization needs to be explicitly programmed by a human, because otherwise the compiler may not be able to entirely rethink your algorithm to make it vectorizable. Designing for vectorization can require to change your algorithm completely and you might get completely different results due to running with lower precision, running operations in different order (remember floating-point arithmetic is not associative) etc., making it a destructive conversion. If the programmer is not aware of how to program a SIMD unit, he might write code that is not vectorizable without realizing it. By making it explicit, we can guarantee to that programmer that he gets what he asked for, rather than depending on the whims of an optimizer. Even in Boost.SIMD, some users claim we depend on the optimizer too much. Also, the compiler seems to be unable to do this automatic vectorization apart in the most trivial of cases. The benchmarks show this. Just in the slides, there is a 4x improvement when using std::accumulate (which is still a pretty trivial algorithm) with pack<float> instead of float, and in both cases automatic vectorization was enabled. Actually, I think you even get a 6x improvement if you unroll the loop within accumulate (due to pipelining effects), but I'm not sure of this. Joel, you did that test, care to comment?
I don't see how pack<> addresses the aliasing problem
It doesn't aim to. A pack abstracts a SIMD register. Aliasing is a problem tied to pointers, which are a separate thing. Here are a non-exhaustive list of concerns to keep in mind when writing high-performance code on a single CPU core with a SIMD unit: - SIMD instructions used - memory allocation, alignment, and padding - loop unrolling for pipelining effects - cache-friendliness of memory accesses (tied to alignment as well -- also important for when you go multicore) The choice of Boost.SIMD is to separate all of those concerns. pack<T> takes care of formalizing the register and generating the best set of instructions available on the target CPU for what you ask. You're still responsible for the rest. We try to provide tools to help you in those tasks, but you don't have to use them if you don't want to. That's called low coupling. Actually, the NT2 library aims at providing a unified solution for the entire thing and more (multicore, gpu, distributed...) But Boost.SIMD is just the SIMD component, decoupled from the rest.
in any way that is not similar to simply grabbing local copies of global data or parameters. Various C++ "restrict" extensions already address the latter. We desperately need something much better than "restrict" in standard C++.
Yes, you should use the restrict keyword (which is available in most C++ compilers in one fashion or another) to instruct the compilers your pointers do not alias.
Manycore is the future and parallel processing is the new normal.
I don't see the direct link. Compilers can more easily tell that they can automatically parallelize a loop when it uses restrict pointers, but if you're explicitly parallelizing that loop (which is our approach -- explicit description of the parallelism but automatic generation of the associated code), then it's not strictly required.
pack<> does address alignment, but it's overkill.
It doesn't address it, it requires it. That's entirely the opposite, since that means that it's up to the caller, not the callee, to address it.
It's also pessimistic. One does not always need aligned data to vectorize, so the conditions placed on pack<> are too restrictive.
Loads and stores from an arbitrary pointer can be much slower, are not portable, and are just a bad idea. This is not how you're meant to use a SIMD unit, or even an ALU or a FPU. Also, there is no compelling argument at all for using them, other than automatic vectorization from arbitrary code, which is not what this library is about. It's just like reading an int from a pointer not aligned on a 4 byte boundary (assuming std::alignment_of<int>::value is 4). While that's allowed on x86 (at a certain runtime cost), it is not on a lot of other architectures, and it's not allowed by the C++ standard either.
Furthermore, the alignment information pack<> does convey will likely get lost in the depths of the compiler, leading to suboptimal code generation unless that alignment information is available elsewhere (and it often is).
It does not convey any alignment information. It's an abstraction for a register. A register does not have an address, so any concept of alignment makes no sense. In truth, it may not be a register, since we allow the compiler to transparently fall back to the stack, so that we don't have to go down to the register allocation level, but when it does it (usually because you've used more than 16 interdependent variables), it does it correctly. pack generates the right instructions through use of intrinsics, and I don't see what you mean about suboptimal code. With SSE, there is one intrinsic for aligned load, and another for unaligned load. We always use the aligned one. Other instrinsics are obviously not affected by this, since alignment does not matter for assembly instructions that operate on registers...
I think a far more useful design of this library would be providing standard ways to assert certain conditions. For example:
simd::assert(simd::is_aligned(&v[0], 16))
We already have this, though it's is_aligned<16>(&v[0]) is_aligned(&v[0]) also works, it checks against the strongest alignment required for the largest SIMD unit (i.e. 32 on SandyBridge/Bulldozer, and possibly 64 on future Intel processors).
Provide simple things the compiler can recognize via pattern matching and we'll be a long way to getting the compiler to autovectorize.
That would be fine if that was what we wanted, and if we were working on extending C++ compilers, but that's not the case at all. We're not compiler writers. We just want a tool to express vectorization in a way that is manageable, portable and future-proof, and that can guarantee us that we'll use the CPU capabilities to the max. Joel and I work in HPC research, and we need that performance. We can't just hope that compilers will automagically do it for us. You might as well write a naive matrix multiplication and expect the compiler to generate you code with performance on par with that of an optimized BLAS routine. That's wishful thinking, albeit it surely would be nice (and I know people that are working on this kind on thing -- in my research team even).
I like simd::allocator to provide certain guarantees to memory managed by containers. That plus some of the asserts described above could help generic code a lot.
Bear in mind there is a lot of stuff undocumented because the library is still being boostified and documentation being written.
Other questions about the library:
What's under the operators on pack<>? Is it assembly code?
It's intrinsic-level code. As showed in the slides, the type under the hood of pack is the types that you use when you write such intrinsics. We prefer using intrinsics than assembly, because it's more readable, more portable and more optimization-friendly.
I wonder how pack<T> can know the best vector length. That is highly, highly code- and implementation-dependent.
You tell it, one way or another, the SIMD ISAs that are available on your architecture. Then when you do pack<T, N>, it selects out of all of these the ones that support vectors of NxT. (it does not support the case where multiple matches are available) If you do pack<T>, it selects out of all of these the ones that support vectors of T, and takes the one with the biggest size. Of course, that means that each new instruction set requires an addition to the library. But we use a fairly fancy dispatching system (based on overloading, ADL, partial specialization and decltype) which allows to write things generically, externally and in a nice extendable way.
How does simd::where define pack<> elements of the result where the condition is false? Often the best solution is to leave them undefined but your example seems to require maintaining current values.
It's never undefined, since there is no lazy evaluation. This is clearly explained in the slides. if_else (or where or select, it has various names) is implemented in terms of bitwise operations.
How portable is Boost.simd? By portable I mean, how easy is it to move the code from one machine to another get the same level of performance?
Here is what we currently and what we aim to support: Operating systems: - Linux - Mac OS X - Windows - AIX (in development) Compilers: - GCC - Clang - MSVC10+ (due to the boost.typeof trick not working for us) Processors: - x86 (SSE2, SSE3, SSSE3, SSE4a, SSE4.1, SSE4.2 -- AVX, XOP and FMA4 were also working at some point, but I believe they've been broken by some recent changes) Less stable: - PowerPC (AltiVec, Cell SPU, VMX128 (Xbox360) and VSX (POWER7), the later two under way) - ARM (NEON in development) Reference platform is GCC Linux x86-64 with SSE4. There are still a couple of tests failing on MSVC/Windows, but we're working on it. Performance can be so-so with MSVC when aggressive inlining is disabled (due to call convention and aliasing issues), but we're also working on that.
I don't mean to be too discouraging. But a library to do this kind of stuff seems archaic to me. It was archaic when Intel introduced MMX. If possible, I would like to see this evolve into a library to convey information to the compiler.
Since you don't seem to have particular real world experience with SIMD, you don't really make a case for your judgment. Experience has shown us that this kind of thing is necessary, and a lot of tools, including some from Intel, AMD or Apple exist just to help you write SIMD code. Most if not all high-performance numerical computation libraries also write SSE instructions directly and don't rely on the compiler to do it. (BLAS, LAPACK, FFTW to name a few) Also, if you want a vectorized sinus function, you have to write it. Compilers don't have any, and if they did, they'd just use a library for it. Our implementations of trigonometric functions based on Boost.SIMD are for example faster (and more precise) than the ones provided with MKL, the math kernel library by Intel for optimized math on their processors. I believe that's in part because Boost.SIMD has allowed us to write it easily; and it is portable and automatically scales to new architectures and new vector sizes. And we actually implement most of the C99 and TR1 functions and more, in a IEEE754-compliant way, with very high precision, while they only provide a handful. We have many use cases where it has proven to allow us to do better things in our research activity, and it's also at the core of the technology of a start-up that's been acknowledged to be a disruptive innovation by the local "competitiveness cluster". So I can't really agree with the fact the library is archaic with such a shallow analysis.

Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
[ Note: I had written this before I saw Joel's answer, so there might be a bit of duplication here. Sorry about this. }
On 10/06/2011 22:16, David A. Greene wrote:
Our argument is that vectorization needs to be explicitly programmed by a human, because otherwise the compiler may not be able to entirely rethink your algorithm to make it vectorizable.
The human may have to restructure some of the code, but a good compiler can do a lot of restructuring and it can certainly do the mechanics of generating vector code. In many cases all that's needed is a directive here or there to help the compiler understand that there isn't a dependency when it can't know that statically.
Designing for vectorization can require to change your algorithm completely and you might get completely different results due to running with lower precision, running operations in different order (remember floating-point arithmetic is not associative) etc., making it a destructive conversion.
Yes, the programmer needs to keep all of these things in mind no matter how vector code is generated.
If the programmer is not aware of how to program a SIMD unit, he might write code that is not vectorizable without realizing it.
If that's the case, then the programmer is just as likely to write explicit vector code that is wrong. In other words, he will vectorize when it is illegal to do so.
Also, the compiler seems to be unable to do this automatic vectorization apart in the most trivial of cases.
Examples?
The benchmarks show this. Just in the slides, there is a 4x improvement when using std::accumulate (which is still a pretty trivial algorithm) with pack<float> instead of float, and in both cases automatic vectorization was enabled.
With which compilers? Again, I can see utility for boost.simd if the compiler doesn't know how to vectorize well. I'm arguing that you can't make the case that boost.simd is always necessary or even necessary in most cases.
Here are a non-exhaustive list of concerns to keep in mind when writing high-performance code on a single CPU core with a SIMD unit: - SIMD instructions used
Should be the job of vector codegen in the compiler, if the compiler supports it.
- memory allocation, alignment, and padding
A few directives here and there solves this.
- loop unrolling for pipelining effects
The compiler should handle this, perhaps with a few directives to help.
- cache-friendliness of memory accesses (tied to alignment as well -- also important for when you go multicore)
Ditto.
The choice of Boost.SIMD is to separate all of those concerns. pack<T> takes care of formalizing the register and generating the best set of instructions available on the target CPU for what you ask.
But how does the programmer know what is best? It changes from implementation to implementation.
Manycore is the future and parallel processing is the new normal.
I don't see the direct link.
The direct link is that compilers are going to have to get good at this. Some already are.
Compilers can more easily tell that they can automatically parallelize a loop when it uses restrict pointers, but if you're explicitly parallelizing that loop (which is our approach -- explicit description of the parallelism but automatic generation of the associated code), then it's not strictly required.
And you're going to end up with code that is not performance portable.
pack<> does address alignment, but it's overkill.
It doesn't address it, it requires it.
That's overkill. Alignment often isn't required to vectorize.
It's also pessimistic. One does not always need aligned data to vectorize, so the conditions placed on pack<> are too restrictive.
Loads and stores from an arbitrary pointer can be much slower, are not portable, and are just a bad idea. This is not how you're meant to use a SIMD unit, or even an ALU or a FPU.
With AMD and Intel's latest offerings alignment is much less of a performance issue.
pack generates the right instructions through use of intrinsics, and I don't see what you mean about suboptimal code.
Because the code the programmer writes may not be the best code for the given microarchitecture. This stuff is often non-obvious.
I think a far more useful design of this library would be providing standard ways to assert certain conditions. For example:
simd::assert(simd::is_aligned(&v[0], 16))
We already have this, though it's is_aligned<16>(&v[0]) is_aligned(&v[0]) also works, it checks against the strongest alignment required for the largest SIMD unit (i.e. 32 on SandyBridge/Bulldozer, and possibly 64 on future Intel processors).
32 is not always the right answer on SandyBridge/Bulldozer because 256 bit vectorization is not always the right answer. What's the vector length of a pack<int> on Bulldozer?
Joel and I work in HPC research, and we need that performance. We can't just hope that compilers will automagically do it for us.
I also work in HPC. Everyone I know relies on compilers.
You might as well write a naive matrix multiplication and expect the compiler to generate you code with performance on par with that of an optimized BLAS routine.
That is in fact what is happening via autotuning. Yes, in some cases hand-tuned code outperforms the compiler. But in the vast majority of cases, the compiler wins. If you want to use boost.simd for the cases where it does not, that is entirely appropriate. But again, don't claim that boost.simd is always the answer because compilers are dumb. They are not.
I wonder how pack<T> can know the best vector length. That is highly, highly code- and implementation-dependent.
If you do pack<T>, it selects out of all of these the ones that support vectors of T, and takes the one with the biggest size.
That's often the wrong answer.
I don't mean to be too discouraging. But a library to do this kind of stuff seems archaic to me.
Since you don't seem to have particular real world experience with SIMD, you don't really make a case for your judgment.
My case comes from years of experience.
Most if not all high-performance numerical computation libraries also write SSE instructions directly and don't rely on the compiler to do it. (BLAS, LAPACK, FFTW to name a few)
Yes, and boost.simd could be a good solution for those cases. But I don't think it's a good solution for the programmer to write his application.
Also, if you want a vectorized sinus function, you have to write it. Compilers don't have any, and if they did, they'd just use a library for it.
Yes. Again, boost.simd might help, but it does not relieve the burden of writing multiple implementations of a library function for multiple hardware implementations.
So I can't really agree with the fact the library is archaic with such a shallow analysis.
It may be useful in some instances, but it is not a general solution to the problem of writing vector code. In the majority of cases, the programmer will want the compiler to do it because it does a very good job. -Dave

On 11/06/2011 17:30, David A. Greene wrote:
The benchmarks show this. Just in the slides, there is a 4x improvement when using std::accumulate (which is still a pretty trivial algorithm) with pack<float> instead of float, and in both cases automatic vectorization was enabled.
With which compilers?
Mainstream compilers with mainstream options: what people use. The boostcon talk was pretty simple, it was not aimed at HPC experts, but rather tried to show that the library could be useful to everyone.
But how does the programmer know what is best? It changes from implementation to implementation.
The programmer is not supposed to know, the library does. The programmer writes a * b + c, we generate a fma instruction if available, and a multiplication followed by an addition otherwise. Now it is true there are some cases where there are multiple choices, and which one is fastest is not clear and may depend on the micro-architecture and not just the instruction set. We don't really do different things depending on micro-architecture, but I think the cases where it really matters should be rather few. When testing on some other micro-architectures, we notice that there are a few unexpected run times, but nothing that really justifies doing different codegen, at least at our level. We're currently setting up a test farm, and we'll try to graph the run time in cycles of all of our functions on different architectures. Any recommendation and which micro-architectures to include for x86? We can't afford to have too many. We mostly work with Core and Nehalem.
That's overkill. Alignment often isn't required to vectorize.
It doesn't cost the user much to enforce it for new applications, and it allows us not to have to worry about it.
With AMD and Intel's latest offerings alignment is much less of a performance issue.
What about portability?
32 is not always the right answer on SandyBridge/Bulldozer because 256 bit vectorization is not always the right answer.
If 256-bit vectorization is not what you want, then you have to specify the size you want explicitly. Otherwise we always prefer the size that allows the most parallelism.
What's the vector length of a pack<int> on Bulldozer?
256 bits because __m256i exists, and there are some instructions for those types, even if they are few. But I suppose 128 bits could also be an acceptable choice. I need to benchmark this, but I think the conversions from/to AVX/SSE are sufficiently fast to make it a good choice in general. It's a default anyway, you can set the size you want.
That is in fact what is happening via autotuning. Yes, in some cases hand-tuned code outperforms the compiler. But in the vast majority of cases, the compiler wins.
That's not what I remember of my last discussions with people working with the polyhedric optimization model. They told me they came close, but still weren't as fast as state-of-the-art BLAS implementations.
My case comes from years of experience.
You certainly have experience in writing compilers, but do you have experience in writing SIMD versions of algorithms within applications? You don't seem to be familiar with SIMD-style branching or other popular techniques.
It may be useful in some instances, but it is not a general solution to the problem of writing vector code. In the majority of cases, the programmer will want the compiler to do it because it does a very good job.
All tools are complimentary and can thus co-exist. What I didn't like about your original post is that you said compilers were the one and only solution to parallelization. At least we can agree on something now ;).

Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
We're currently setting up a test farm, and we'll try to graph the run time in cycles of all of our functions on different architectures. Any recommendation and which micro-architectures to include for x86? We can't afford to have too many. We mostly work with Core and Nehalem.
Try to get some AMD processors in there. Specifically, something Barcelona or later. When the AVX chips are out, it would be interesting to compare them. I suspect the offerings from Intel and AMD will be vastly different given the described Bulldozer architecture.
With AMD and Intel's latest offerings alignment is much less of a performance issue.
What about portability?
Sure, if binary portability is required the user will have to use the maximal subset of features.
What's the vector length of a pack<int> on Bulldozer?
256 bits because __m256i exists, and there are some instructions for those types, even if they are few.
It's going to perform really badly for any AVX machine because you end up having to generate tons of shuffles.
I need to benchmark this, but I think the conversions from/to AVX/SSE are sufficiently fast to make it a good choice in general.
They really are not. You get sabout 3-4x instructions in integer code, for example. These sorts of things get determined with experience so I wouldn't expect boost.simd or anyone else to get it perfectly right out the gate. x86 in particular is very tricky to optimize. :)
That's not what I remember of my last discussions with people working with the polyhedric optimization model. They told me they came close, but still weren't as fast as state-of-the-art BLAS implementations.
Sure, there are cases where hand-written code performs best. In those cases, boost.simd seems like a great solution.
My case comes from years of experience.
You certainly have experience in writing compilers, but do you have experience in writing SIMD versions of algorithms within applications?
Some, yes. I'm by no means a restructuring expert. But I have looked at an awful lot of HPC code. I've managed to uncover tons of compiler bugs/limitations. :)
You don't seem to be familiar with SIMD-style branching or other popular techniques.
I'm not sure what you mean by "SIMD-syle branching." I think you mean mask/merge/select operations.
All tools are complimentary and can thus co-exist. What I didn't like about your original post is that you said compilers were the one and only solution to parallelization.
I'm sorry if I conveyed that message. I wasn't trying to argue that but I can understand how it might seem that way. I was reacting to the set of slides that seemed to argue that the compiler never gets it right. :)
At least we can agree on something now ;).
I suspect we can agree on a great deal. HPC is hard, for example. ;) -Dave
participants (10)
-
greened@obbligato.org
-
Greg Rubino
-
Gruenke, Matt
-
Joel falcou
-
Mathias Gaunard
-
Mathieu -
-
MaximYanchenko
-
Neal Becker
-
Robert Jones
-
Simonson, Lucanus J