Re: [boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)

20 Jan 2009

      Hi Patrick,

On Tue, Jan 20, 2009 at 9:01 AM, Patrick Mihelich
<patrick.mihelich@gmail.com> wrote:
...
I'm not sure how or why this turned into a discussion of the general
concurrency problem in C++. This is interesting, certainly, but should
probably be considered a separate topic from SIMD and auto-vectorization. It
doesn't seem very fair to me to criticize a SIMD library for fighting one
battle instead of winning the whole war; you could make similar criticisms
of Boost.Thread or Boost.MPI, yet these are useful libraries.
I'm not about being fair: I am about tackling the issue from a larger
perspective.

I am not criticizing a SIMD library in the sense that I don't think
other people will want to use it -- I personally think that being
wrapped in a DSEL makes it clever, but nonetheless the scope of the
problem is too narrowly defined. That means, *I* don't think putting
these groups of operations together needs to be that complicated.
...
The sense I'm getting from this discussion is that SIMD code generation is
uninteresting, and that we should stick our heads in the sand and wait for
the Sufficiently Smart Compilers to come along. OK, I sympathize with this.
Writing functions using SIMD intrinsics is a bit of a distraction from the
computer vision tasks I actually care about, but I have time budgets to meet
and usually some work of this type has to be done.
Actually, I think you're missing the point (at least from what I'm saying).

I'm saying SIMD code generation ought to be the job of the compiler(s)
for the platforms where they make sense. Now *if* you wanted to be
able to specifically make it work, you can do something that others
have already been doing: adding a layer of indirection.

Now this layer of indirection can be as clever as a DSEL (which I
don't think it needs to be) or as simple as a function that switches
implementations at compile time using preprocessor macros or some
other facility. Now if you needed to optimize a set of operations that
are specific to your field (like for example, applying a blur on a set
of pixels represented by a set of floats) then I wouldn't find it hard
to imagine having that specific part hand-optimized for your need.

Does this need another library? I wager to say it doesn't -- it's like
saying you're implementing a DSEL in C++ to do simple mathematics. Now
if you want to write your own DSEL for image manipulation and perform
the transformation in the background to use the SIMD instructions for
a specific platform then fine that would be great -- and the details
of the implementation would be just that, details, that I don't see a
need for a special library just for SIMD instructions *especially*
since the compilers will be able to automatically vectorize the parts
that can easily be vectorized. (I use the term "easily" here very
loosely because that depends on the compiler you're using).
...
IMO, waiting for compiler technology is neither pragmatic in the short-term
nor (as I argued in the other thread) conceptually correct. If you look at
expression-template based linear algebra libraries like uBlas and Eigen2,
these are basically code generation libraries (if compilers were capable of
complicated loop fusion optimizations, we might not need such libraries at
all). Given an expression involving vectors, it's fairly mechanical to
transform it directly into optimal assembly. Whereas at the level of
optimizing IR code, reasoning about pointers and loops is rather
complicated. Are the pointers 16-byte aligned? Does it make sense to
partially unroll the loop to exploit data parallelism? What transformations
can (and should) we make when traversing a matrix in a double loop? There
are all sorts of obstacles the compiler must overcome to vectorize code. Why
not handle these issues in the stage of compilation with the best
information and clearest idea of what the final assembly code should look
like - in this case at the library level with meta-programming?
I am not against it -- now if you're talking about fixing uBlas to
make it aware of the capabilities of a platform and perform the
transformations necessary to be able to leverage vendor-specific
libraries, then I'm all for it. Do I think it needs a special
DSEL/library for doing so? *That* is what I'm questioning.

The reason I like the thought of letting the compiler do the
auto-vectorization for me is that the compiler already knows about my
code and the transformations it's going to do to make it work --
there's no reason for a compiler not to be able to know these details
you talk about. It's not even absurd for a compiler to turn certain
code patterns to use OpenMP to parallelize parts of the solution and
then at even lower levels even create the SIMD code to leverage the
SIMD extensions of the compiler it's going to use.
...
The fact of the matter is that compilers do not generate optimal
SIMD-accelerated code except in the simplest of cases, and so we end up
using SIMD intrinsics by hand. Frankly I don't expect this to change
dramatically anytime soon; I'm no compiler expert, but my impression is that
some complicated algebraic optimizations (for which C++ is not very suited)
are necessary.
And I don't question that fact that compilers do not yet generate
optimal SIMD-accelerated code -- especially if you're talking about
GCC. I might be surprised to hear the same about Intel's compiler but
I know that it does perform quite advanced optimizations on the code
to leverage SSE on the platforms it supports.

If it's a matter of arranging your C++ so that it can be automatically
vectorized by a less sophisticated yet auto-vectorizing compiler, then
I would think that would be a more achievable goal (and easier?) to
accomplish than releasing/maintaining a SIMD-only DSEL/library.
...
Using SIMD intrinsics by hand is nasty in assorted ways.
I know.
...
The syntax is not
standard across compilers. The instructions are not generic; if I change
datatypes from double to float, or int to short, I have to completely
rewrite the function.
What stops you from adding a function that specializes on the types on
top of these SIMD-specific functions/vectors?
...
Maybe someone wants to run my SSE-enabled code on an
Altivec processor, what then? There may be counterparts to all the
instructions, but I don't know Altivec. Even different versions of the same
instruction set are a problem; do I really want to think about the benefits
of SSSE3 vs. just SSE2 and write different versions of the same function for
the various instruction sets?
Which is why I'd rather rely on the compiler to do it for me -- if the
compiler for the Altivec processor doesn't know how to auto-vectorize
my code, then tough luck I'm going to use SIMD-specific
functions/vectors anyway that's specific for that platform. If I
already had that SIMD-enabled min() function that broke up a large
vector into smaller SIMD vectors that did a faster min() than a
non-SIMD version of min(), then I can specialize on the Altivec
platform.

Heck, if I was writing a image manipulation library for different
platforms, I might chunk up the operations even at a higher level and
leverage the SIMD-izable parts at that level specific to the image
manipulation library.

But again, I don't see a need for that library that is specific to
SIMD-enabling operations to be done on values. But then again, that's
just me.
...
Just having a nice, generic wrapper around SIMD ops would be a big step to
ease the writing of SIMD-enabled code and make it accessible to a wider
community. I think the Proto-based approach holds promise for taking
advantage of various instruction sets without the user having to think much
about it.
Here lies the problem: SIMD operations are specific yet generic enough
to be used on their own already. What you want to be able to deal with
is the vendor-provided interface to the SIMD vectors and SIMD
operations they already support. More specifically GCC's support for
SSE/SSE2/SSE3/MMX/... registers/vectors and operations [0] and if
you're lucky enough to get your hand on Intel compilers, you can also
read about how to layout your code for the compiler to be able to
automatically vectorize it for you [1].

References:

[0] - http://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Vector-Extensions.html#Vector-Ex...
[1] - http://www.aartbik.com/SSE/index.html

-- 
Dean Michael C. Berris
Software Engineer, Friendster, Inc.

Re: [boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)

Dean Michael Berris