[boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)

18 Jan 2009

      On Sat, Jan 17, 2009 at 5:33 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
...
David A. Greene a écrit :
...
Ahem: http://www.cray.com
Two points :
1/ Not everyone has access to a cray-like machine. Parallelization tools for
CotS machines is not to be neglected and, on this front, lots of thing need
to be done
What do you mean by parallelization for CotS (Commodity off-the Shelf) machines?

I personally have dealt with two types of parallelization:
parallelization at a high level (dealing with High Performance
Computing using something like MPI for distributed message-passing
parallel computation across machines) and parallelization at a low
level (talking about SSE and auto-vectorization).

If you mean dealing with high level parallelism across machines, you
already have Boost.MPI and I suspect you don't intend to replace that
or tackle that kind of parallelism at the large scale.

If you mean dealing with parallelism within a program, then you
already have Boost.Thread and Boost.Asio (with the very nice
io_service object to be able to use as a job-pool of sorts run on many
threads).

At a lower level, you then have the parallel algorithm extensions to
the STL already being shipped by compiler vendors (free and
commercial/prioprietary) that have parallel versions of the STL
algorithms (see GNU libstdc++ that comes with GCC 4.3.x and
Microsoft's PPL).
...
2/ vector supercomputer != SIMD-enabled processor even if the former may
include the later.
Actually, a vector supercomputer is basically a SIMD machine at a
higher level -- that's if I'm understanding my related literature
correctly.
...
...
Auto parallelization has been around since at least the '80's in
production machines.  I'm sure it was around even earlier than that.
What do you call auto-parallelization ?
Are you telling me that, nowaday , I can take *any* source code written in C
or C++ or w/e compile it with some compiler specifying --parallel and
automagically get a parallel version of the code ?  If so, you'll have to
send a memo to at least a dozen research team (including mine) all over the
world so they can stop trying working on this problem and move on something
else. Should I also assume than each time a new architecture comes out,
those compilers also know the best way to generate code for them ?  I beg to
differ, but automatic parallelization is far from "done".
You might be surprised but the GNU Compiler Collection already has
-ftree-vectorize compiler flag that will do analysis on the IR
(Intermediate Representation) of your code and generate the
appropriate SSE/SSE2/SSE3/MMX/... for your architecture. They advise
adding -msse -msse2 for x86 platforms to the flags.

Although I don't think you should stop with the research for better
ways to do automatic vectorization, I do however think that doing so
at the IR level at the compiler would be more fruitful especially when
combined with something like static analysis and code transformation.
You might even be surprised to know that OpenCL (the new framework for
dealing with heterogeneous scalable parallelism) even allows for
run-time transformation of code that's meant to work with that
framework -- and Apple's soon-to-be-released new version of its
operating system and compiler/SDK will be already supporting.
...
Then again, by just looking at the problem of writing SIMD code : explain
why we still get better performance for simple code when writing SIMD code
by hand than letting gcc auto-vectorize it ?
Because GCC needs help with auto-vectorization, and that GCC is a
best-effort project (much like how Boost is). If you really want to
see great auto-vectorization numbers, maybe you can try looking at the
Intel compilers? Although I haven't personally gotten (published and
peer-reviewed) empirical studies to support my claim, just adding
auto-vectorization in the compilation of the source code of the
project I'm dealing with (pure C++, no hand-written SIMD stuff
necessary) I already get a significant improvement in performance
*and* scalability (vertically of course).

I think a "better" alternative would be to help the GCC folks do a
better (?) job at writing more efficient tree-vectorization
implementations and transformations that produce great SIMD-aware code
built into the compiler. If somehow you can help with the pattern
recognition of auto-vectorizable loops/algorithms from a higher level
than the IR level and do source-level transformation of C++ (which I
think would be way cool BTW, much like how the Lisp compilers are able
to do so) to be able to produce the right (?) IR for the compiler to
better auto-vectorize, then *that* would be something else.

Maybe you'd also like to look at GCC-ICI [0] to see how you can play
around with extending GCC then it comes to implementing these
(admittedly if I may say so) cool optimizations of the algorithms to
become automatically vectorized.
...
...
Perhaps your SIMD library could invent convenient ways to express those
idioms in a machine-independent way.
Well, considering the question was first about how to structure the group of
library i'm proposing,
I apologize to not having taken the time to express all the features of
those libraries. Moreover, even with a simple example, the fact that the
library hides the differences between SSE2,SSSE3,SSE3,SSE4,Altivec,SPU-VMX
and the forecoming AVX is a feature on its own. Oh, and as specified in the
former mail, the DSL take care of optimizing fused operation so thing like
FMA are detected and replaced by the proper intrinsic when possible. Same
with reduction like min/max, operations like b*c-a or SAD on SSEx.
I think I understand what you're trying to achieve by adding a layer
of indirection at the library/code level -- however I personally think
(and have read through numerous papers already, while doing my
research about parallel computing earlier in my student days) that
these optimizations/transformations are best served by the tools that
create the machine code (i.e. compilers) rather then dealt with at the
source code level. What I mean by this is that even though it's
technically possible to implement a "parallel C++ DSEL" (which can be
feasibly achieved now with Boost.Proto and taking some lessons with
attribute grammars [1] and how Spirit 2(x?) with a combination of
Boost.Phoenix does it) it would be reaching a bigger audience and
serving a larger community if the compilers became smarter at doing
the transformations themselves, rather than getting C++ developers to
learn yet another library.
...
Anyway, we'll be able to discuss the library in itself and its features when
a proper thread for it will start.
Maybe we can discuss it in this one? :D

HTH

References:

[0] http://gcc-ici.sourceforge.net/
[1] http://en.wikipedia.org/wiki/Attribute_grammar

-- 
Dean Michael C. Berris
Software Engineer, Friendster, Inc.

[boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)

Dean Michael Berris