Re: [boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)

19 Jan 2009

      On Sun, Jan 18, 2009 at 9:50 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
...
Dean Michael Berris a écrit :
...
Please don't misunderstand that I'm disagreeing with you here because
I do agree that there is a need to address the parallelism problem
when implementing considerably demanding solutions at a level where
you don't really have to worry about the architecture below on which
the code you're running on is like. However given the reality of the
situation with a myriad of available platforms on which to compile/run
C++, the pressure from both sides of the equation (library writers and
tool developers on one side, hardware vendors on the other) to come up
with a solution is immense -- especially since the industry has got to
adapt to it sooner than later. ;-)
We agree then.
...
... I think there is a market for precisely this kind of thing/work
now -- helping domain experts be able to recognize and utilize the
inherent parallelism in their solutions and the tools they are using.
:-)
Best is to have them benefit of parallelism without them knowing about it
really.
I'm a little weary about hiding these important issues to the people
who understand the higher scheme of things. Which is why I personally
don't think leaving the (non-C++ programming) domain experts in the
dark about the inherent parallelism in their solution is a good idea.

The reason why I think this is because the people who are going to be
solving the real world problem should be aware that the computing
facilities they have are actually capable of parallel computing, and
that the way they write their solutions (be it in any programming
language) will have a direct impact in the performance and scalability
of their solution. It really doesn't matter if they're writing
something that should work on a GPU/CPU+SIMD but that the way they
write their solution should work in parallel. Once they are aware of
the available parallelism, they should be able to adapt the way they
think, and the way they come up with the solutions.

As far as hiding the parallelism from them, the compiler is the
perfect place to do that especially if your aim is to just leverage
platform specific parallelism features of the machine. Even these
domain experts once they know about the compiler capabilities may be
able to write their code in such a way that the compiler will be happy
to auto-vectorize -- and that's I think where it counts most.
...
...
Libraries OTOH are more components to me rather than tools. Maybe I'm
being too picky with terms, but if you meant libraries to be tools,
then I feel that simply doesn't work with how my brain is wired. It
doesn't make you wrong, but it just doesn't feel right to me. ;-)
Beware that Embedded DSL are nto more than DSL in disguise
inside a library, hence the confusion I keep between tools and libraries.
I would tend to agree with Dave that libraries tend to be disguised as
DSEL's (think Spirit) that perform a certain function -- thus the way
I think about them as components that work as part of a bigger whole.
...
...
I think I understand what you mean, but I don't think it's a failure
of the libraries that they're not known/used by the people doing the
programming. Much like how you can't blame the nailgun if a carpenter
didn't know about it that's why he's still using traditional hammers
and nails.
My point was : it is not that easy to say to people "use X".
Actually, it's easy to say it -- it's a matter of acceptance that's a
problem. Now if it was a library that forced users to change their
code just to be able to leverage something that the compiler should be
able to handle for them (like writing assembly code for instance)
sounds to me like too much to ask for. After all, the reason we have
higher level programming languages is to hide from ourselves the
details of the assembly/machine language on which platform we're going
to run programs on. ;-)
...
...
True, but libraries also require that users write code that actually
use the library. If the users already had code that didn't use your
library, how is it an advantage if they can get the auto-vectorization
from a future version of a compiler anyway without having to butcher
their code to use your library? And what if they find a bug in the
code using the library or (god forbid) find a bug in the library?
Same can be said for any library out there. What if tomorrow new C++
compiler will extract code from source and built top-notch threads from it ?
Should we prevent people to use Boost.Threads from now ?
No, what I'm pointing at here is that libraries for considerably very
low level parallelism will have to be maintained independent of the
code that's actually using it -- and thus another layer on which
failure can be found and inefficiencies introduced. The point for
using Boost.Thread instead of platform-specific-threading-library is
so that you can rely on a coherent interface for specifically
threading and synchronization among threads.

If that new C++ compiler is able to do that parallelism for us
effectively without us having to use Boost.Threads, then I think
slowly usage of Boost.Threads would go down on its own. However I
think the problem that Boost.Threads is solving is compelling enough
to be a viable solution in the interim.

The point I'm trying to make is that if the target is simply just SIMD
at the processor level, I'd think a library just for that is too
specific to be considerably generic. I might be missing the point
here, but if the compiler can already do it now (and will only get
better in the future) and that I can write specific code for the
platform even with C++ through compiler-vendor-provided libraries (if
I needed to be specific about what I wanted to do with the compiler
and the platform) if I didn't want to rely on a compiler to do it for
me, what would be the value of a very narrow/specific library like a
SIMD-specific thingamagig?
...
...
Actually, DSELs require that you write code in the domain language --
and here is where the problem lies
Well, if parallelism is outsourced behind the scene, it's not a problem.
But then you (the DSEL writer) for that specific domain would have to
deal with parallelism the old-fashioned way without a DSEL for that
(yet) helping you to do it -- and that doesn't scale. That doesn't
help the domain expert especially if he doesn't know that he can
actually come up with solutions that do leverage the parallelism
available in his platform.
...
...
If this were the case then maybe just having this DSEL may be good to
give to parallelism-savvy C++ programmers, but not necessarily still
the domain experts who will be doing the writing of the
domain-specific logic. Although you can argue that parallel
programming is a domain in itself, in which case you're still not
bridging the gap between those that know about parallel programming
and the other domain experts.
Parallel programming is a domain in itself but not a domain for user but for
tool writer.
A user domain si things like math, finance, physics, anything. We agree
...
Yes, not all platforms are Intel platforms, but I don't know if you've
noticed yet that Intel compilers even create code that will run on AMD
processors -- yes, even SSE[0..3] -- as per their product
documentation. If your target is CotS machines, I think Intel/GCC is
your best bet (at least in x84_64). I haven't dealt with other
platforms though aside from Intel/AMD, but it's not unreasonable to
think that since everybody's moving the direction of leveraging and
exploiting parallelism in hardware, that the compiler vendors will
have to compete (and eventually get better) in this regard.
Well, we can't let Altivec and its offspring on the side of the road. Cell
processor use it and I
considering the Cell as a simili-COtS as a PS3 cost something like only half
a kidney.
I don't target COTS or not-CotS, my goal is cover the basics and the SIMD
absics involves old Motorola enabled PPC and Intel machines.
So the strict minimum is Altivec+SSE flavors. I hope that one day (AVX v2),
both will converge tough.
And precisely because of that is why I think better compilers that
leverage these platform-specific features would be the correct and
far-reaching solution than a library just for SIMD. If your goal was a
library/DSEL for expressing parallelism in general in C++ hiding the
details of threads and whatnot only which a SIMD-specific extension
would be part of, then I wouldn't feel like the goal is a little too
narrow.
...
...
Why do I get the feeling that you're saying:
compiler writing != software engineering
? :-P
No I mean that *I* feel more confortabel writing stuff on this side of the
compiler than on the other ;)
Okay. :-)
...
...
Anyway, I think if you're looking to contribute to a compiler-building
community, GCC may be a little too big (I don't want to use the term
advanced, because I haven't bothered looking at the code of the GCC
project) but I know Clang over at LLVM are looking for help to finish
the C++ implementation of the compiler front-end. From what I'm
reading with Clang and LLVM, it should be feasible to write
language-agnostic optimization algorithms/implementations just dealing
with the LLVM IR.
Well, as I work like half a miel from Albert Cohen office, I'll certainly
have a discussion about Clang someday ;)
The C++->C++ tools is on my todo task list, but not for  now as I think DSEL
in C++ still have untapped ressources.
I agree, but if you're going to tackle the concurrency problem through
a DSEL, I'd think a DSEL at a higher level than SIMD extensions would
be more fruitful. For example, I'd think something like:

vector<huge_numbers> numbers;
// populate numbers
async_result_stream results =
  apply(numbers, [... insert funky parallelisable lambda construction ...])
while (results) {
  huge_number a;
  results >> a;
  cout << a << endl;
}

Would be able to spawn thread pools, launch tasks, and provide an
interface to getting the results using futures underneath. The domain
experts who already know C++ will be able to express their funky
parallelisable lambda construction and just know that when they use
the facility it will do the necessary decomposition and parallelism as
much as it can at the library level. This I think is something that is
feasible (although a little hard) to achieve -- and to think that the
compiler will even be able to vectorize an inner loop in the
decomposed lambda construction, that detail isn't even necessarily
dealt with by the library.
...
...
In that case, I think that kind of library (DSEL) would be nice to
have -- especially to abstract the details of expressing parallelism
in general a the source code level.
Except it is like ... friggin hard ?
Uh, yes. ;-)
...
My stance is to have applciation domain specific library that hide all
parallelism tasks by
relying on small scale parallel library themselves like Thread or my
proposition.
In which case I think that DSEL for parallelism would be much more
acceptable than even the simplest SIMD DSEL mainly because I'd think
if you really wanted to leverage SIMD by hand, you'd just use the
vector registers and use the vector functions directly from your code
instead. At least that's in my case as both a user and a library
writer.
...
...
Nice! I would agree that something like *that* is appropriate as a
domain-specific language which leverages parallelism in the details of
the implementation.
...
I however think also that there are some details that would be nice to
tackle at the appropriate layer -- SIMD code construction is, well,
meant to be at the domain of the compiler (as far as SSE or similar
things go). OpenCL is meant to be an interface the hardware and
software vendors are moving towards supporting for a long time coming
(at least what I'm reading from the press releases) so I'm not too
worried about the combinatorial explosion of architectures and
parallelism runtimes.
Except some people (like one of the poster in the previosu thread) daily
deals with code that need this level of abstraction and not more.
Hence the rationale behind "Boost.SIMD"
In which case I think a DSEL is clever, but a SIMD-only library would
be too small in scope for my taste. But that's just me I think. ;-)
...
...
I agree completely, but I'm afraid if the DSEL is for expressing
parallelism in C++, the goal of "giving domain experts tools that knew
about the parallelism" wouldn't be met readily. For C++ developers
that want to leverage parallelism in general sure, but I don't think
I'd be particularly compelled to use a SIMD-only DSEL.
I think we can't just wake up and say "ok today I just sole the parallelism
problem in C++ using DSEL".
I think that, on the contrary, a concrete, reasonable roadmap would be
"tiling" the parallel problem world by small scale software solution that
can inter-operate and interact freely. Then when the basic blocks of such
tools have been done, we can start cementing them into higher one.
Of course maybe not in a day. But it can feasibly be achieved with
some effort from brilliant library writers.

I like thinking at a higher level first and solving the problems in
the lower level with more specific focus but within a bigger context.
Once you can recognize the patterns in the solution from a higher
level can you really try solving problems at a lower level with better
insight. Missing context is always hard to deal with.

They came up with the STL anyway right, whoever thought there'd be a
string class that makes sense in C++. ;-)

-- 
Dean Michael C. Berris
Software Engineer, Friendster, Inc.