
Ahem: http://www.cray.com Two points : 1/ Not everyone has access to a cray-like machine. Parallelization tools for CotS machines is not to be neglected and, on this front, lots of
David A. Greene a écrit : thing need to be done 2/ vector supercomputer != SIMD-enabled processor even if the former may include the later.
Auto parallelization has been around since at least the '80's in production machines. I'm sure it was around even earlier than that.
What do you call auto-parallelization ? Are you telling me that, nowaday , I can take *any* source code written in C or C++ or w/e compile it with some compiler specifying --parallel and automagically get a parallel version of the code ? If so, you'll have to send a memo to at least a dozen research team (including mine) all over the world so they can stop trying working on this problem and move on something else. Should I also assume than each time a new architecture comes out, those compilers also know the best way to generate code for them ? I beg to differ, but automatic parallelization is far from "done". Then again, by just looking at the problem of writing SIMD code : explain why we still get better performance for simple code when writing SIMD code by hand than letting gcc auto-vectorize it ?
Perhaps your SIMD library could invent convenient ways to express those idioms in a machine-independent way.
Well, considering the question was first about how to structure the group of library i'm proposing, I apologize to not having taken the time to express all the features of those libraries. Moreover, even with a simple example, the fact that the library hides the differences between SSE2,SSSE3,SSE3,SSE4,Altivec,SPU-VMX and the forecoming AVX is a feature on its own. Oh, and as specified in the former mail, the DSL take care of optimizing fused operation so thing like FMA are detected and replaced by the proper intrinsic when possible. Same with reduction like min/max, operations like b*c-a or SAD on SSEx.
Your simple SIMD expression example isn't terribly compelling. Any competent compiler should be able to vectorize a scalar loop that implements it
What would be compelling is a library to express things like the Cell's scratchpad. Libraries to do data staging would be interesting because more and more processers are going to add these kinds of local memory I don't see what you have in mind. Do you mean something like Hierarchic Tiled Array ? or some Cell based development library ? If the later, I don't think boost is the best home for it. As for HTA, lots of implementation already exists, and guess what, they just do the
Well, sorry then to have given a simple example. parallelization themselves instead of letting the computer do it. Anyway, we'll be able to discuss the library in itself and its features when a proper thread for it will start. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35