Re: [boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)

18 Jan 2009

      On Sun, Jan 18, 2009 at 4:16 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
...
Dean Michael Berris a écrit :
...
What do you mean by parallelization for CotS (Commodity off-the Shelf)
machines?
I took the link provided to cray.com as an answer to my parallelization
question. What I mean is that not everyone have to deal with large mainframe
but more so beowulf like machine or even simple multi-core machines on which
running vendor-specific runtime middleware may or may not be sensible.
Moreover, not every people that need parallelism want to do HPC. Computer
Vision and multimedia applciations are also highly demanding and sometimes
even more as they are also bound to some real-time or interactive-time
constraints in which the GFLOPS is not what you seek.
Anyway, consider my answer on this subject as miscommunication.
Ok, I understand.
...
...
If you mean dealing with high level parallelism across machines, you
already have Boost.MPI and I suspect you don't intend to replace that
or tackle that kind of parallelism at the large scale.
MPI is by no way what I would call high-level. It's only a message-passing
assembly language with a few interesting abstraction. BSP based tools or
algorithmic skeletons based tools are real high-level tools for
inter-machine parallelization. And well, I already have replaced MPI by
something with an higher abstraction level for at least three architecture
style (cluster, mutli-cores, cell) and published about this (see [1] for
reference) and plan to do so at boost'con this year to show exactly how
boost meta-programming tools made those kind of tools possible and usable.
Thanks for the reference, I'll try reading through it when I get the time. :)

At any rate, what I meant when I said MPI was a high-level abstraction
is that it forces you to think at a significantly higher level of
abstraction than parallelism constructs that are simply for SIMD
computing models, MIMD computing models, Hypercube, Meshes, etc. and
express the communication between processing components of the
solution. This means you are forced to think about how the processing
components communicate among each other, instead of thinking about
whether the architecture you're running on top of is SIMD-aware or
MIMD-aware, is a Torus or non-Torus Mesh, a Hypercube, a Connection
Machine, etc. -- although you can place infinitely many layers of
abstractions on top of this (say, expose an MPI-aware HTTP server and
have worker nodes do the "heavy processing" in a Beowulf cluster) it
doesn't change that MPI (the spec and the implementations) would be
higher level than say assembly instructions for a SIMD-processor.

Although I understand that MPI isn't the highest level, it would be a
"parallel programmer's" dream come true to be able to write a program
and express parallelism without having to worry about whether the
logical/actual concurrency happens on a single machine or a cluster of
machines. Erlang tried to do something similar by making process ID's
reflect whether they're local or remote processes but you still had to
be mindful of distinctions like that; and I think Haskell is doing
something similar with Data Parallel Haskell, but I think that doesn't
cover distributed (i.e. many-machine) parallelism.

Please don't misunderstand that I'm disagreeing with you here because
I do agree that there is a need to address the parallelism problem
when implementing considerably demanding solutions at a level where
you don't really have to worry about the architecture below on which
the code you're running on is like. However given the reality of the
situation with a myriad of available platforms on which to compile/run
C++, the pressure from both sides of the equation (library writers and
tool developers on one side, hardware vendors on the other) to come up
with a solution is immense -- especially since the industry has got to
adapt to it sooner than later. ;-)
...
...
If you mean dealing with parallelism within a program, then you
already have Boost.Thread and Boost.Asio (with the very nice
io_service object to be able to use as a job-pool of sorts run on many
threads).
Then again : low level. I've dealt with a large variety of people ranging
from  physicist to computer vision experts. They are all embarassed when
they have to take their Matlab or C or FORTRAN legacy code to C++/thread.
I understand but...
...
Some of them doesn't even KNOW their machine has those kind of features or
don't even know about simple thing like scaling, gustafson-barsis or amdhal
laws and think that as they have 3 machines with 4 cores, their code will
magically go 12 times faster no matter what. I've been a parallel software
designer in such a laboratory and I can tell you that most expert in field X
have no idea on how to tackle MPI , ASIO or even simple thread. Worse,
sometimes they don't want to because they find this uninteresting.
I agree, but...
...
Ok, you
might say that they can just send their code to someone to parallelize.
Alas, most of the time they don't want to as they' won't be able to be sure
you didn't butchered their initial code (following the good ol' Not Made
Here principles). At this point, you *have* to give them tools they
understand/want to use/are acustomed with so they can do the porting
themselves. And this requires providing either really high-level or
domain-specific interface to those  parallelism level which include, among
other, to use things like thread or asio by hiding them very very deeply.
Then again,that's my own personnal experiment. If you have some secret
management tricks to have parallelism-unaware people to use low-level tools,
then I'm all open to hear them :) cause I have to do this on a daily basis.
... I think there is a market for precisely this kind of thing/work
now -- helping domain experts be able to recognize and utilize the
inherent parallelism in their solutions and the tools they are using.
:-)

Now as far as management tricks, I don't have any -- I struggle with
this a lot too, believe me -- but the idea of tools can vary from
context to context (and person to person).

I personally think about tools to be something concrete that does a
specific task -- in this case I think "tools" mean compilers (GCC,
ICC, MSVC, etc.), linkers (GNU LD, etc.), and program runtimes
(OpenCL, CRT, LLVM). I'd argue Yacc and Lex are tools that produce
parsers and lexers -- maybe you have something that takes a
domain-specific language and creates C++ implementations out of the
DSL, then in the C++ you create it wouldn't be impossible to generate
code that uses Boost.Thread, Boost.Asio, Boost.*. :-D

Libraries OTOH are more components to me rather than tools. Maybe I'm
being too picky with terms, but if you meant libraries to be tools,
then I feel that simply doesn't work with how my brain is wired. It
doesn't make you wrong, but it just doesn't feel right to me. ;-)
...
...
At a lower level, you then have the parallel algorithm extensions to
the STL already being shipped by compiler vendors (free and
commercial/prioprietary) that have parallel versions of the STL
algorithms (see GNU libstdc++ that comes with GCC 4.3.x and
Microsoft's PPL).
Same remark than before. There are people out there that don't even know STL
(and sometimes proper C++) exists or even worse don't want to use it cause,
you never know, it can be bugged. And sadly, these are'nt jokes :/
I think I understand what you mean, but I don't think it's a failure
of the libraries that they're not known/used by the people doing the
programming. Much like how you can't blame the nailgun if a carpenter
didn't know about it that's why he's still using traditional hammers
and nails.
...
...
actually, a vector supercomputer is basically a SIMD machine at a
higher level -- that's if I'm understanding my related literature
correctly.
What I meant to say is that they have a fundamentally different low-level
API and such auto-vectorization at machine level may be different than
intra-processor SIMD code generation. Then again, this remarks are maybe due
to me badly reading previous post.
Ok.
...
...
You might be surprised but the GNU Compiler Collection already has
-ftree-vectorize compiler flag that will do analysis on the IR
(Intermediate Representation) of your code and generate the
appropriate SSE/SSE2/SSE3/MMX/... for your architecture. They advise
adding -msse -msse2 for x86 platforms to the flags.
Yes I am aware and all my experimentation show it produces correct code for
trivial software but fail to vectorize larger code piece and die in shame as
soon as you need shuffle, typecasting and other similar features.
Indeed.
...
...
Although I don't think you should stop with the research for better
ways to do automatic vectorization, I do however think that doing so
at the IR level at the compiler would be more fruitful especially when
combined with something like static analysis and code transformation.
Except this means a new compiler version, which means that end-user has
either to wait for those algorithm to be in the mainstream gcc or w/e
compiler distribution OR that they'll have to use some experimental version.
In the former case, they don't want to wait. In the later case, they don't
want to have to install such thingy. The library approach is a way to get
thing that works out fast and only based on existing compilers.
Nothing prevent this library to evolve as compiler do.
True, but libraries also require that users write code that actually
use the library. If the users already had code that didn't use your
library, how is it an advantage if they can get the auto-vectorization
from a future version of a compiler anyway without having to butcher
their code to use your library? And what if they find a bug in the
code using the library or (god forbid) find a bug in the library?
...
On code transformation, that's exactly what a DSEL do: Building new 'source'
from a high-level specification and writing "compiler extensions" from the
library side of the world. I've been doing this since POOMA and the avent of
Blitz++ and always thought they were spot-on for things like parallelism.
When proto was first announced and became usable, it was a real advance as
building DSL started to look like building a new language in the good ol'
ways.
Actually, DSELs require that you write code in the domain language --
and here is where the problem lies.

For instance, if I meant to write a loop that executed exactly N times
everytime, then I should be able to rely on my compiler to be smart
about it and figure out whether the target I'm building for would be
able to handle SIMD instructions and generate native code for that. It
doesn't sacrifice the readability of my code because I can write a
normal C++ loop and have the compiler do the transformation of the
code from C++ to native code without very little intervention on my
part as a programmer.

If you're talking about a DSEL at a higher level and want to be able
to extract parallelism at the domain level (say for example doing
parallel (or tree-wise parallel) computation of complex mathematical
equations) then I would agree that DSLs or even DSELs are great for
the job. However, that would also mean you're going to have to
write/create that DSL/DSEL for that specific domain, instead of a
being a DSL/DSEL mainly for parallelism.

Although I can imagine as a C++ developer writing something like:

 eval( ( task_group += task1 | task2 | task3 ) , tie(task1_future,
task2_future, task3_future) );

That it's a feasible DSEL for executing a task-group and tying their
futures to the result -- but that's hardly a DSL/DSEL for something
like say particle physics or linear algebra.

If this were the case then maybe just having this DSEL may be good to
give to parallelism-savvy C++ programmers, but not necessarily still
the domain experts who will be doing the writing of the
domain-specific logic. Although you can argue that parallel
programming is a domain in itself, in which case you're still not
bridging the gap between those that know about parallel programming
and the other domain experts.
...
...
You might even be surprised to know that OpenCL (the new framework for
dealing with heterogeneous scalable parallelism) even allows for
run-time transformation of code that's meant to work with that
framework -- and Apple's soon-to-be-released new version of its
operating system and compiler/SDK will be already supporting.
Oh, I await openCL fondly, so I'll have a new low-level tools to generate
high-level libraries ;)
So do I. :-)
...
...
Because GCC needs help with auto-vectorization, and that GCC is a
best-effort project (much like how Boost is). If you really want to
see great auto-vectorization numbers, maybe you can try looking at the
Intel compilers? Although I haven't personally gotten (published and
peer-reviewed) empirical studies to support my claim, just adding
auto-vectorization in the compilation of the source code of the
project I'm dealing with (pure C++, no hand-written SIMD stuff
necessary) I already get a significant improvement in performance
*and* scalability (vertically of course).
I'm aware of this. Gcc auto-vectorize is for me still in its infancy. Take a
simple dot product. The auto-vectorized version shows a speed-up of 2.5 for
floating point value. The handmade version goes up to 3.8. Moreover, it only
covers statically analyzable loop nest to be vectorized (i speak of what can
be done nowadays with 4.3 and 4.4, not what's announced in various paper).
With icc, the scope of vectorizable code is larger and contain sensible
parts. The code quality is also superior. Except not all platform are intel
platform.
Yes, not all platforms are Intel platforms, but I don't know if you've
noticed yet that Intel compilers even create code that will run on AMD
processors -- yes, even SSE[0..3] -- as per their product
documentation. If your target is CotS machines, I think Intel/GCC is
your best bet (at least in x84_64). I haven't dealt with other
platforms though aside from Intel/AMD, but it's not unreasonable to
think that since everybody's moving the direction of leveraging and
exploiting parallelism in hardware, that the compiler vendors will
have to compete (and eventually get better) in this regard.
...
...
I think a "better" alternative would be to help the GCC folks do a
better (?) job at writing more efficient tree-vectorization
implementations and transformations that produce great SIMD-aware code
built into the compiler. If somehow you can help with the pattern
recognition of auto-vectorizable loops/algorithms from a higher level
than the IR level and do source-level transformation of C++ (which I
think would be way cool BTW, much like how the Lisp compilers are able
to do so) to be able to produce the right (?) IR for the compiler to
better auto-vectorize, then *that* would be something else.
Fact is, well, I'm more acustomed to do software engineering than compiler
writing. I wish I could lend a hand but that's far out of skills scope.
But, I'm open to learn new thing. If you have entry in the gcc community,
I'm all for it.
Why do I get the feeling that you're saying:

 compiler writing != software engineering

? :-P

Anyway, I think if you're looking to contribute to a compiler-building
community, GCC may be a little too big (I don't want to use the term
advanced, because I haven't bothered looking at the code of the GCC
project) but I know Clang over at LLVM are looking for help to finish
the C++ implementation of the compiler front-end. From what I'm
reading with Clang and LLVM, it should be feasible to write
language-agnostic optimization algorithms/implementations just dealing
with the LLVM IR.

I personally am not attached with the GCC project -- I think they do
one heck of a great job -- but something that's detached from it would
also be feasible to write (although I don't know how practical it
would be). For example, something that takes C++ code and transforms
it to more optimized (?) C++ before passing it to the compiler would
be interesting to have. This I'm afraid is in the realm of static code
analysis and code transformation at a very high level that may or may
not be what others programming in C++ would be looking for (that's a
guess).

I think some people in the list have already contributed to the GCC's
C++ compiler project (I remember Doug Gregor did some work (if not all
of it?) on ConceptGCC which prototyped C++0x Concepts support in the
compiler). Maybe they can help out in that regard.
...
...
Maybe you'd also like to look at GCC-ICI [0] to see how you can play
around with extending GCC then it comes to implementing these
(admittedly if I may say so) cool optimizations of the algorithms to
become automatically vectorized.
Reference noted. :)
:-)
...
...
I think I understand what you're trying to achieve by adding a layer
of indirection at the library/code level -- however I personally think
(and have read through numerous papers already, while doing my
research about parallel computing earlier in my student days) that
these optimizations/transformations are best served by the tools that
create the machine code (i.e. compilers) rather then dealt with at the
source code level. What I mean by this is that even though it's
technically possible to implement a "parallel C++ DSEL" (which can be
feasibly achieved now with Boost.Proto and taking some lessons with
attribute grammars [1] and how Spirit 2(x?) with a combination of
Boost.Phoenix does it) it would be reaching a bigger audience and
serving a larger community if the compilers became smarter at doing
the transformations themselves, rather than getting C++ developers to
learn yet another library.
Well except if this simd library was not aimed at users. The vec library I
proposed is here for library builder as a way to quickly express SIMD code
fragment across SIMD platform. It is NOT meant to be yet another array class
with SIMD capability because this models is too restrictive for library
developpers that may need SIMD access for doing something else with a
differen tmodel. And well, it doesn't sound worse than learning a new
boost.asio or boost.thread library.
In that case, I think that kind of library (DSEL) would be nice to
have -- especially to abstract the details of expressing parallelism
in general a the source code level.
...
I already have something far more high-level from which this code is
extracted from. NT2 is a matlab-like scientific computing library[2] that
just reinject Matlab syntax into C++ via a coupel of DSEL and take care of
vectorization and thread creation on multi-core machine. This was found to
be more appealing as user can just get their Matlab code, copy+paste them
into a cpp file, search/replace a few syntax quirks and compile with NT2 to
get instant performance increase. I can tell you that this appeal more to
HPC users than any low-level API, even with the sexier encapsulation you can
have.
Nice! I would agree that something like *that* is appropriate as a
domain-specific language which leverages parallelism in the details of
the implementation.
...
As for the source level code. They are plenty and successful. Now, are you
able to name *one* that is actually used outside academic research ? I'm
still convinced that high-level, domain-oriented tools are the way to go,
not just adding new layer of things like MPI, OpenCL or what not. However,
those layers, which have the tendency to be multiplied as architecture
flavor changes, need to abstracted. And that's only such an abstraction that
I was proposing for SIMD intrinsics : a simple, POD like entity that maps
onto those and take care of generating the most efficient code for a given
platform. I don't why it could be different in scope than boost.thread that
do the same with threading API.
I'm not discouraging you with writing a library that tackles the
expression of parallelism at the source code level in a coherent
manner -- I am actually encouraging it if you haven't done so yet. I'd
love to get my hands on something like that to be able to express the
parallelism in the solution without having to worry about exactly how
many threads I'll be running to parallelize this part of the code that
simply executed a set of tasks, etc.

I however think also that there are some details that would be nice to
tackle at the appropriate layer -- SIMD code construction is, well,
meant to be at the domain of the compiler (as far as SSE or similar
things go). OpenCL is meant to be an interface the hardware and
software vendors are moving towards supporting for a long time coming
(at least what I'm reading from the press releases) so I'm not too
worried about the combinatorial explosion of architectures and
parallelism runtimes.
...
On a larger scale, my stance on the problem is the following : Architectures
get more and more complex and, following this trend, low-level tools starts
to loose their interest but are mostly the only thing available. What is
needed is a high-level abstraction layer on those. But no abstraction can
encompass *all* need of parallel software developers so there is a need for
various models and abstraction (ranging from arrays to agents to god know
what) and all of them need an interoperable interface. This is what DSL
construction tools allow us to do : quickly and properly specify those
precise scoped tools in form of library. When proper compiler support and/or
tools will become available, we'll just shift the library implementation,
much like Boost already support 0x constructs inside its implementation.
I agree completely, but I'm afraid if the DSEL is for expressing
parallelism in C++, the goal of "giving domain experts tools that knew
about the parallelism" wouldn't be met readily. For C++ developers
that want to leverage parallelism in general sure, but I don't think
I'd be particularly compelled to use a SIMD-only DSEL.
...
Hoping that we don't hijack the list too much. But I'm curious to know the
position of other boost members concerning how parallelism problem should be
solved within our bounds.
Me too, but unfortunately I don't know a better venue to be able to
discuss the development of a DSEL in C++ to tackle the parallelism
issue. :-) Boost is about the "smartest" mailing list I know on which
issues like these can be discussed in a fashion that is relevant to
C++ and technical enough for those who know about the subject. ;-)
...
References
[1] Quaff:
-> principles : http://www.lri.fr/~falcou/pub/falcou-PARCO-2007.pdf
-> the Cell version (french paper) :
http://www.lri.fr/~falcou/pub/falcou-SYMPA-2008.pdf
[2] NT2 : http://www.springerlink.com/content/l4r4462r25740127/
Thanks for the links!

-- 
Dean Michael C. Berris
Software Engineer, Friendster, Inc.