Re: [boost] [gsoc] boost.simd news from the front.

11 Jun 2011

      On 11/06/2011 17:42, David A. Greene wrote:
...
Mathias Gaunard<mathias.gaunard@ens-lyon.org>  writes:
...
...
Register allocation.
But that's not where the difficult work is.
Right, NP-complete problems are not difficult.

It's not really a problem when you're doing a small function in 
isolation, but we want all the functions to be inlineable (and most of 
them to be inlined), and we don't know in advance whether we need to 
copy the operands, which registers will be used and which will be 
available, etc.
...
...
...
...
Currently we support all SSEx familly, all AMD specific stuff and
Altivec for PPC and Cell adn we have a protocol to extend that.
How many different implementations of DGEMM do you have for x86?  I have
seen libraries with 10-20.
That's because they don't have generic programming, which would allow
them to generate all variants with a single generic core and some
meta-programming.
No.  No, no, no.  These implementations are vastly different.  It's not
simply a matter of changing vector lenght.
...
We work with the LAPACK people, and some of them have realized that
the things we do with metaprogramming could be very interesting to
them, but we haven't had any research opportunity to start a project
on this yet.
I'm not saying boost.simd is never useful.  I'm saying the claims made
about it seem overblown.
What I was saying about usage of meta-programming applied to the writing 
of adaptive and fast linear algebra primitives is completely unrelated 
to Boost.SIMD, albeit it could use it.
...
...
...
- Write it using the operator overloads provided by boost.simd.  Note
    that the programmer will have to take into account various
    combinations of matrix size and alignment, target microarchitecture
    and ISA and will probably have to code many different versions.
Shouldn't you just need the cache line size? This is something we
provide as well.
Nope.  It's a LOT more complicated than that.
Well, as far as I know, the only platform-specific stuff you do for 
matrix multiplication apart from vectorization is loop tiling.

Can your magic compiler guarantee it will do a perfect job at this, with 
a cache size only known at runtime?
...
...
C++ metaprogramming *is* a autotuning framework.
To a degree.  How do you do different loop restructurings using the
library?
I suggest you read some basic literature on what you can do with 
templates, in particular Todd Veldhuizen's "Active Libraries: Rethinking 
the roles of compilers and libraries"

<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.8031&rep=rep1&type=pdf>
...
...
The most little of things can prevents them from vectorizing. Sure, if
you add a few restrict there, a few pragmas elsewhere, some specific
compiling options tied to floating point, you might be able to get the
system to kick in.
Yep.  And that's a LOT easier the hand-restructuring loops and writing
vector code manually.
I may not want to set the floating point options for my whole 
translation unit.

And if I had some pragmas just around the bit I want, that starts to 
look like explicit vectorization.

Our goal with Boost.SIMD is not to write vector code manually (you don't 
go down to the instruction level), but rather to allow to make 
vectorization explicit (you describe a set of operations on operands 
whose types are vectors).
...
...
But my personal belief is that automatic parallelization of arbitrary
code is an approach doomed to failure.
Then HPC has been failing for 30 years.
That's funny, because a huge portion of HPC people seems to be busy 
recoding stuff for multicore and GPU.

How come they have to rewrite it since we have automatic parallelization 
solutions? Surely they can just input their old C code to their compiler 
and get optimal SIMD+OpenMP+MPI+CUDA as output.
Their monolithic do-it-all state-of-the-art compiler provided by their 
hardware vendor takes care of everything they should need, and is able 
to predict all of the best solutions, right?

Actually, I suppose it works fairly well with old Fortran code.
Yet all of these people found reasons that made them want to use the 
tools themselves directly.

And CUDA, which is arguably hugely popular these days, requires people 
to write their algorithms in terms of the kernel abstraction. It 
wouldn't be able to just work well with arbitrary C code. And still, I 
understand it's still fairly complicated for the compiler to make this 
work well, despite the imposed coding paradigms.

Automatic parallelization solutions are nice to have. Like all compiler 
optimizations, it's a best effort thing. But when you really need the 
power, you've got to go get it yourself.
At least that's my opinion.
...
...
Programming is about making things explicit using the right language
for the task.
Programming is about programmer productivity.
Productivity implies a product.
What matters in a product is that it fulfills the requirements.

If my requirements are to use some hardware -- not necessarily limited 
to a single architecture -- to the best of their abilities, I'm better 
off describing my code in the way most fit for that hardware than 
betting everything on the fact the automatic code restructuration of the 
compiler will allow me to do that.

When compilers start to guarantee some optimizations, maybe it will change.
But the feedback I gathered from compiler people is that they could not 
guarantee certain types of transformations would be consistently applied 
regardless of data size; running the passes in different order could 
yield better or worse results, same with running them multiple times etc.

Compilers just give "some" optimization, there is no formalism behind 
that can prove it will always reduce certain patterns. Having a 
fully-optimized program still requires explicitly doing it so.

We're at a state where we even have to force inlining in some cases, 
because even with the inline specifier some compilers do not do the 
right thing.
...
...
...
Boost.simd could be useful to vendors providing vectorized versions of
their libraries.
Not all fast libraries need to be provided by hardware vendors.
No, not all.  In most other cases, though, the compiler should do it.
Monolithic designs are bad.

Some people specialize in specific things, and they should be the 
providers for that thing.
...
...
...
I have seen too many cases where programmers wrote an "obviously better"
vector implementation of a loop, only to have someone else rewrite it in
scalar so the compiler could properly vectorize it.
Maybe if the compiler was really that good, it could still do the
optimization when vectors are involved?
No, because information has been lost at that point.
How so? What information?
There is no assembly involved, the compiler still has full knowledge.

There is no reason why the compiler couldn't tell that
__m128 a, b, c;
c = __mm_add_ps(a, b); // probably calls __builtin_ia32_addps(a, b)

is the same as
float a, b, c;
c = a + b

except it does it four floats at a time.