
On 11/06/2011 17:42, David A. Greene wrote:
Mathias Gaunard<mathias.gaunard@ens-lyon.org> writes:
Register allocation.
But that's not where the difficult work is.
Right, NP-complete problems are not difficult. It's not really a problem when you're doing a small function in isolation, but we want all the functions to be inlineable (and most of them to be inlined), and we don't know in advance whether we need to copy the operands, which registers will be used and which will be available, etc.
Currently we support all SSEx familly, all AMD specific stuff and Altivec for PPC and Cell adn we have a protocol to extend that.
How many different implementations of DGEMM do you have for x86? I have seen libraries with 10-20.
That's because they don't have generic programming, which would allow them to generate all variants with a single generic core and some meta-programming.
No. No, no, no. These implementations are vastly different. It's not simply a matter of changing vector lenght.
We work with the LAPACK people, and some of them have realized that the things we do with metaprogramming could be very interesting to them, but we haven't had any research opportunity to start a project on this yet.
I'm not saying boost.simd is never useful. I'm saying the claims made about it seem overblown.
What I was saying about usage of meta-programming applied to the writing of adaptive and fast linear algebra primitives is completely unrelated to Boost.SIMD, albeit it could use it.
- Write it using the operator overloads provided by boost.simd. Note that the programmer will have to take into account various combinations of matrix size and alignment, target microarchitecture and ISA and will probably have to code many different versions.
Shouldn't you just need the cache line size? This is something we provide as well.
Nope. It's a LOT more complicated than that.
Well, as far as I know, the only platform-specific stuff you do for matrix multiplication apart from vectorization is loop tiling. Can your magic compiler guarantee it will do a perfect job at this, with a cache size only known at runtime?
C++ metaprogramming *is* a autotuning framework.
To a degree. How do you do different loop restructurings using the library?
I suggest you read some basic literature on what you can do with templates, in particular Todd Veldhuizen's "Active Libraries: Rethinking the roles of compilers and libraries" <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.8031&rep=rep1&type=pdf>
The most little of things can prevents them from vectorizing. Sure, if you add a few restrict there, a few pragmas elsewhere, some specific compiling options tied to floating point, you might be able to get the system to kick in.
Yep. And that's a LOT easier the hand-restructuring loops and writing vector code manually.
I may not want to set the floating point options for my whole translation unit. And if I had some pragmas just around the bit I want, that starts to look like explicit vectorization. Our goal with Boost.SIMD is not to write vector code manually (you don't go down to the instruction level), but rather to allow to make vectorization explicit (you describe a set of operations on operands whose types are vectors).
But my personal belief is that automatic parallelization of arbitrary code is an approach doomed to failure.
Then HPC has been failing for 30 years.
That's funny, because a huge portion of HPC people seems to be busy recoding stuff for multicore and GPU. How come they have to rewrite it since we have automatic parallelization solutions? Surely they can just input their old C code to their compiler and get optimal SIMD+OpenMP+MPI+CUDA as output. Their monolithic do-it-all state-of-the-art compiler provided by their hardware vendor takes care of everything they should need, and is able to predict all of the best solutions, right? Actually, I suppose it works fairly well with old Fortran code. Yet all of these people found reasons that made them want to use the tools themselves directly. And CUDA, which is arguably hugely popular these days, requires people to write their algorithms in terms of the kernel abstraction. It wouldn't be able to just work well with arbitrary C code. And still, I understand it's still fairly complicated for the compiler to make this work well, despite the imposed coding paradigms. Automatic parallelization solutions are nice to have. Like all compiler optimizations, it's a best effort thing. But when you really need the power, you've got to go get it yourself. At least that's my opinion.
Programming is about making things explicit using the right language for the task.
Programming is about programmer productivity.
Productivity implies a product. What matters in a product is that it fulfills the requirements. If my requirements are to use some hardware -- not necessarily limited to a single architecture -- to the best of their abilities, I'm better off describing my code in the way most fit for that hardware than betting everything on the fact the automatic code restructuration of the compiler will allow me to do that. When compilers start to guarantee some optimizations, maybe it will change. But the feedback I gathered from compiler people is that they could not guarantee certain types of transformations would be consistently applied regardless of data size; running the passes in different order could yield better or worse results, same with running them multiple times etc. Compilers just give "some" optimization, there is no formalism behind that can prove it will always reduce certain patterns. Having a fully-optimized program still requires explicitly doing it so. We're at a state where we even have to force inlining in some cases, because even with the inline specifier some compilers do not do the right thing.
Boost.simd could be useful to vendors providing vectorized versions of their libraries.
Not all fast libraries need to be provided by hardware vendors.
No, not all. In most other cases, though, the compiler should do it.
Monolithic designs are bad. Some people specialize in specific things, and they should be the providers for that thing.
I have seen too many cases where programmers wrote an "obviously better" vector implementation of a loop, only to have someone else rewrite it in scalar so the compiler could properly vectorize it.
Maybe if the compiler was really that good, it could still do the optimization when vectors are involved?
No, because information has been lost at that point.
How so? What information? There is no assembly involved, the compiler still has full knowledge. There is no reason why the compiler couldn't tell that __m128 a, b, c; c = __mm_add_ps(a, b); // probably calls __builtin_ia32_addps(a, b) is the same as float a, b, c; c = a + b except it does it four floats at a time.