Re: [boost] [gsoc] boost.simd news from the front.

14 Jun 2011

      Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
...
...
...
Shouldn't you just need the cache line size? This is something we
provide as well.
Nope.  It's a LOT more complicated than that.
Well, as far as I know, the only platform-specific stuff you do for
matrix multiplication apart from vectorization is loop tiling.
Vector length matters.  Instruction selection matters.  Prefetching
matters.
...
Can your magic compiler guarantee it will do a perfect job at this,
with a cache size only known at runtime?
No one can ever guarantee "perfect."  But the compiler should aim to
reduce the programming burden.
...
...
To a degree.  How do you do different loop restructurings using the
library?
I suggest you read some basic literature on what you can do with
templates, in particular Todd Veldhuizen's "Active Libraries:
Rethinking the roles of compilers and libraries"
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.8031&rep=rep1&type=pdf>
Thanks for the pointer!  Printed out and ready to digest.  :)
...
And if I had some pragmas just around the bit I want, that starts to
look like explicit vectorization.
But not like boost.simd as the actual algorithm code doesn't get
touched.
...
Our goal with Boost.SIMD is not to write vector code manually (you
don't go down to the instruction level), but rather to allow to make
vectorization explicit (you describe a set of operations on operands
whose types are vectors).
But again, that's not always the right thing to do.  Often one only
wants to partially vectorize a loop, for example.  I'm sure boost.simd
can represent that, too, but it is another very platform-dependent
choice.
...
...
...
But my personal belief is that automatic parallelization of arbitrary
code is an approach doomed to failure.
Then HPC has been failing for 30 years.
That's funny, because a huge portion of HPC people seems to be busy
recoding stuff for multicore and GPU.
For the most part, they are FINALLY moving away from pure MPI code and
starting to use things like OpenMP.  For the GPU, good compiler
solutions have only just begun to appear -- there's always a lag in tool
availability for new architectures.  But GPUs aren't all that new
anyway.  They are "just" vector machines.

But I don't know of many HPC users who want to yet again restructure
their loops.  At best you can get them to restructure again IF you can
guaratee them that the restructured code will run well on any machine.
That's why performance portability is critical.

HPC users have had it with rewriting code.
...
How come they have to rewrite it since we have automatic
parallelization solutions? Surely they can just input their old C code
to their compiler and get optimal SIMD+OpenMP+MPI+CUDA as output.
If the algorithm has already been written to vectorize, it should map to
the GPU just fine.  If not, it will have to be restructured anyway,
which is generally beyond the capability of compilers or boost.simd,
though many times loop directives can just tell the compiler what to do.
...
Their monolithic do-it-all state-of-the-art compiler provided by their
hardware vendor takes care of everything they should need, and is able
to predict all of the best solutions, right?
In many cases, yes.  Of course, some level of user directive helps a
lot.  It's very hard for the compiler to automatically decide which
kernels should run on a GPU, for example.  But the user just needs a
directive here or there to specify that.
...
And CUDA, which is arguably hugely popular these days, requires people
to write their algorithms in terms of the kernel abstraction.
Yuck, yuck, yuck!  :) CUDA was a fine technology when GPUs becvame
popular, but it is being replaced quickly by things like the PGI GPU
directives.  The pattern is, do it manually first, then use compiler
directives (designed based on learning from the manual approach), then
have the compiler do it automatically (not possible in general, but
often very effective in certain cases).
...
Automatic parallelization solutions are nice to have. Like all
compiler optimizations, it's a best effort thing. But when you really
need the power, you've got to go get it yourself.  At least that's my
opinion.
Sometimes, yes.  But I would say with a good vectorizing compiler that
is extremely rare.  Given an average compiler, I certainly see the
goodness in boost.simd.
...
If my requirements are to use some hardware -- not necessarily limited
to a single architecture -- to the best of their abilities, I'm better
off describing my code in the way most fit for that hardware than
betting everything on the fact the automatic code restructuration of
the compiler will allow me to do that.
I think it's a combination of both.  Overspecification often hampers the
compiler's ability to generate good code.  Things should be specified at
the "right" level given the available compiler capabilities.  Of course,
the definition of "right" varies widely from implementation to
implementation.
...
...
...
Maybe if the compiler was really that good, it could still do the
optimization when vectors are involved?
No, because information has been lost at that point.
How so? What information?
Unrolling is a good example.  A hand-unrolled and scheduled loop is very
difficult to "re-roll."  That has all sorts of implications for vector
code generation.

                                 -Dave

Re: [boost] [gsoc] boost.simd news from the front.

greened＠obbligato.org