
On Sat, Jan 17, 2009 at 5:33 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
David A. Greene a écrit :
Ahem: http://www.cray.com
Two points : 1/ Not everyone has access to a cray-like machine. Parallelization tools for CotS machines is not to be neglected and, on this front, lots of thing need to be done
What do you mean by parallelization for CotS (Commodity off-the Shelf) machines? I personally have dealt with two types of parallelization: parallelization at a high level (dealing with High Performance Computing using something like MPI for distributed message-passing parallel computation across machines) and parallelization at a low level (talking about SSE and auto-vectorization). If you mean dealing with high level parallelism across machines, you already have Boost.MPI and I suspect you don't intend to replace that or tackle that kind of parallelism at the large scale. If you mean dealing with parallelism within a program, then you already have Boost.Thread and Boost.Asio (with the very nice io_service object to be able to use as a job-pool of sorts run on many threads). At a lower level, you then have the parallel algorithm extensions to the STL already being shipped by compiler vendors (free and commercial/prioprietary) that have parallel versions of the STL algorithms (see GNU libstdc++ that comes with GCC 4.3.x and Microsoft's PPL).
2/ vector supercomputer != SIMD-enabled processor even if the former may include the later.
Actually, a vector supercomputer is basically a SIMD machine at a higher level -- that's if I'm understanding my related literature correctly.
Auto parallelization has been around since at least the '80's in production machines. I'm sure it was around even earlier than that.
What do you call auto-parallelization ?
Are you telling me that, nowaday , I can take *any* source code written in C or C++ or w/e compile it with some compiler specifying --parallel and automagically get a parallel version of the code ? If so, you'll have to send a memo to at least a dozen research team (including mine) all over the world so they can stop trying working on this problem and move on something else. Should I also assume than each time a new architecture comes out, those compilers also know the best way to generate code for them ? I beg to differ, but automatic parallelization is far from "done".
You might be surprised but the GNU Compiler Collection already has -ftree-vectorize compiler flag that will do analysis on the IR (Intermediate Representation) of your code and generate the appropriate SSE/SSE2/SSE3/MMX/... for your architecture. They advise adding -msse -msse2 for x86 platforms to the flags. Although I don't think you should stop with the research for better ways to do automatic vectorization, I do however think that doing so at the IR level at the compiler would be more fruitful especially when combined with something like static analysis and code transformation. You might even be surprised to know that OpenCL (the new framework for dealing with heterogeneous scalable parallelism) even allows for run-time transformation of code that's meant to work with that framework -- and Apple's soon-to-be-released new version of its operating system and compiler/SDK will be already supporting.
Then again, by just looking at the problem of writing SIMD code : explain why we still get better performance for simple code when writing SIMD code by hand than letting gcc auto-vectorize it ?
Because GCC needs help with auto-vectorization, and that GCC is a best-effort project (much like how Boost is). If you really want to see great auto-vectorization numbers, maybe you can try looking at the Intel compilers? Although I haven't personally gotten (published and peer-reviewed) empirical studies to support my claim, just adding auto-vectorization in the compilation of the source code of the project I'm dealing with (pure C++, no hand-written SIMD stuff necessary) I already get a significant improvement in performance *and* scalability (vertically of course). I think a "better" alternative would be to help the GCC folks do a better (?) job at writing more efficient tree-vectorization implementations and transformations that produce great SIMD-aware code built into the compiler. If somehow you can help with the pattern recognition of auto-vectorizable loops/algorithms from a higher level than the IR level and do source-level transformation of C++ (which I think would be way cool BTW, much like how the Lisp compilers are able to do so) to be able to produce the right (?) IR for the compiler to better auto-vectorize, then *that* would be something else. Maybe you'd also like to look at GCC-ICI [0] to see how you can play around with extending GCC then it comes to implementing these (admittedly if I may say so) cool optimizations of the algorithms to become automatically vectorized.
Perhaps your SIMD library could invent convenient ways to express those idioms in a machine-independent way.
Well, considering the question was first about how to structure the group of library i'm proposing,
I apologize to not having taken the time to express all the features of those libraries. Moreover, even with a simple example, the fact that the library hides the differences between SSE2,SSSE3,SSE3,SSE4,Altivec,SPU-VMX and the forecoming AVX is a feature on its own. Oh, and as specified in the former mail, the DSL take care of optimizing fused operation so thing like FMA are detected and replaced by the proper intrinsic when possible. Same with reduction like min/max, operations like b*c-a or SAD on SSEx.
I think I understand what you're trying to achieve by adding a layer of indirection at the library/code level -- however I personally think (and have read through numerous papers already, while doing my research about parallel computing earlier in my student days) that these optimizations/transformations are best served by the tools that create the machine code (i.e. compilers) rather then dealt with at the source code level. What I mean by this is that even though it's technically possible to implement a "parallel C++ DSEL" (which can be feasibly achieved now with Boost.Proto and taking some lessons with attribute grammars [1] and how Spirit 2(x?) with a combination of Boost.Phoenix does it) it would be reaching a bigger audience and serving a larger community if the compilers became smarter at doing the transformations themselves, rather than getting C++ developers to learn yet another library.
Anyway, we'll be able to discuss the library in itself and its features when a proper thread for it will start.
Maybe we can discuss it in this one? :D HTH References: [0] http://gcc-ici.sourceforge.net/ [1] http://en.wikipedia.org/wiki/Attribute_grammar -- Dean Michael C. Berris Software Engineer, Friendster, Inc.