
Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
Shouldn't you just need the cache line size? This is something we provide as well.
Nope. It's a LOT more complicated than that.
Well, as far as I know, the only platform-specific stuff you do for matrix multiplication apart from vectorization is loop tiling.
Vector length matters. Instruction selection matters. Prefetching matters.
Can your magic compiler guarantee it will do a perfect job at this, with a cache size only known at runtime?
No one can ever guarantee "perfect." But the compiler should aim to reduce the programming burden.
To a degree. How do you do different loop restructurings using the library?
I suggest you read some basic literature on what you can do with templates, in particular Todd Veldhuizen's "Active Libraries: Rethinking the roles of compilers and libraries"
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.8031&rep=rep1&type=pdf>
Thanks for the pointer! Printed out and ready to digest. :)
And if I had some pragmas just around the bit I want, that starts to look like explicit vectorization.
But not like boost.simd as the actual algorithm code doesn't get touched.
Our goal with Boost.SIMD is not to write vector code manually (you don't go down to the instruction level), but rather to allow to make vectorization explicit (you describe a set of operations on operands whose types are vectors).
But again, that's not always the right thing to do. Often one only wants to partially vectorize a loop, for example. I'm sure boost.simd can represent that, too, but it is another very platform-dependent choice.
But my personal belief is that automatic parallelization of arbitrary code is an approach doomed to failure.
Then HPC has been failing for 30 years.
That's funny, because a huge portion of HPC people seems to be busy recoding stuff for multicore and GPU.
For the most part, they are FINALLY moving away from pure MPI code and starting to use things like OpenMP. For the GPU, good compiler solutions have only just begun to appear -- there's always a lag in tool availability for new architectures. But GPUs aren't all that new anyway. They are "just" vector machines. But I don't know of many HPC users who want to yet again restructure their loops. At best you can get them to restructure again IF you can guaratee them that the restructured code will run well on any machine. That's why performance portability is critical. HPC users have had it with rewriting code.
How come they have to rewrite it since we have automatic parallelization solutions? Surely they can just input their old C code to their compiler and get optimal SIMD+OpenMP+MPI+CUDA as output.
If the algorithm has already been written to vectorize, it should map to the GPU just fine. If not, it will have to be restructured anyway, which is generally beyond the capability of compilers or boost.simd, though many times loop directives can just tell the compiler what to do.
Their monolithic do-it-all state-of-the-art compiler provided by their hardware vendor takes care of everything they should need, and is able to predict all of the best solutions, right?
In many cases, yes. Of course, some level of user directive helps a lot. It's very hard for the compiler to automatically decide which kernels should run on a GPU, for example. But the user just needs a directive here or there to specify that.
And CUDA, which is arguably hugely popular these days, requires people to write their algorithms in terms of the kernel abstraction.
Yuck, yuck, yuck! :) CUDA was a fine technology when GPUs becvame popular, but it is being replaced quickly by things like the PGI GPU directives. The pattern is, do it manually first, then use compiler directives (designed based on learning from the manual approach), then have the compiler do it automatically (not possible in general, but often very effective in certain cases).
Automatic parallelization solutions are nice to have. Like all compiler optimizations, it's a best effort thing. But when you really need the power, you've got to go get it yourself. At least that's my opinion.
Sometimes, yes. But I would say with a good vectorizing compiler that is extremely rare. Given an average compiler, I certainly see the goodness in boost.simd.
If my requirements are to use some hardware -- not necessarily limited to a single architecture -- to the best of their abilities, I'm better off describing my code in the way most fit for that hardware than betting everything on the fact the automatic code restructuration of the compiler will allow me to do that.
I think it's a combination of both. Overspecification often hampers the compiler's ability to generate good code. Things should be specified at the "right" level given the available compiler capabilities. Of course, the definition of "right" varies widely from implementation to implementation.
Maybe if the compiler was really that good, it could still do the optimization when vectors are involved?
No, because information has been lost at that point.
How so? What information?
Unrolling is a good example. A hand-unrolled and scheduled loop is very difficult to "re-roll." That has all sorts of implications for vector code generation. -Dave