Re: [boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)

20 Jan 2009

      I'm not sure how or why this turned into a discussion of the general
concurrency problem in C++. This is interesting, certainly, but should
probably be considered a separate topic from SIMD and auto-vectorization. It
doesn't seem very fair to me to criticize a SIMD library for fighting one
battle instead of winning the whole war; you could make similar criticisms
of Boost.Thread or Boost.MPI, yet these are useful libraries.

The sense I'm getting from this discussion is that SIMD code generation is
uninteresting, and that we should stick our heads in the sand and wait for
the Sufficiently Smart Compilers to come along. OK, I sympathize with this.
Writing functions using SIMD intrinsics is a bit of a distraction from the
computer vision tasks I actually care about, but I have time budgets to meet
and usually some work of this type has to be done.

IMO, waiting for compiler technology is neither pragmatic in the short-term
nor (as I argued in the other thread) conceptually correct. If you look at
expression-template based linear algebra libraries like uBlas and Eigen2,
these are basically code generation libraries (if compilers were capable of
complicated loop fusion optimizations, we might not need such libraries at
all). Given an expression involving vectors, it's fairly mechanical to
transform it directly into optimal assembly. Whereas at the level of
optimizing IR code, reasoning about pointers and loops is rather
complicated. Are the pointers 16-byte aligned? Does it make sense to
partially unroll the loop to exploit data parallelism? What transformations
can (and should) we make when traversing a matrix in a double loop? There
are all sorts of obstacles the compiler must overcome to vectorize code. Why
not handle these issues in the stage of compilation with the best
information and clearest idea of what the final assembly code should look
like - in this case at the library level with meta-programming?

The fact of the matter is that compilers do not generate optimal
SIMD-accelerated code except in the simplest of cases, and so we end up
using SIMD intrinsics by hand. Frankly I don't expect this to change
dramatically anytime soon; I'm no compiler expert, but my impression is that
some complicated algebraic optimizations (for which C++ is not very suited)
are necessary.

Using SIMD intrinsics by hand is nasty in assorted ways. The syntax is not
standard across compilers. The instructions are not generic; if I change
datatypes from double to float, or int to short, I have to completely
rewrite the function. Maybe someone wants to run my SSE-enabled code on an
Altivec processor, what then? There may be counterparts to all the
instructions, but I don't know Altivec. Even different versions of the same
instruction set are a problem; do I really want to think about the benefits
of SSSE3 vs. just SSE2 and write different versions of the same function for
the various instruction sets?

Just having a nice, generic wrapper around SIMD ops would be a big step to
ease the writing of SIMD-enabled code and make it accessible to a wider
community. I think the Proto-based approach holds promise for taking
advantage of various instruction sets without the user having to think much
about it.

-Patrick

On Mon, Jan 19, 2009 at 11:14 AM, David A. Greene <greened@obbligato.org>wrote:
...
On Monday 19 January 2009 00:50, Dean Michael Berris wrote:
...
I agree, but if you're going to tackle the concurrency problem through
a DSEL, I'd think a DSEL at a higher level than SIMD extensions would
be more fruitful. For example, I'd think something like:
vector<huge_numbers> numbers;
// populate numbers
async_result_stream results =
  apply(numbers, [... insert funky parallelisable lambda construction
...])
while (results) {
  huge_number a;
  results >> a;
  cout << a << endl;
}
Would be able to spawn thread pools, launch tasks, and provide an
interface to getting the results using futures underneath. The domain
experts who already know C++ will be able to express their funky
parallelisable lambda construction and just know that when they use
the facility it will do the necessary decomposition and parallelism as
much as it can at the library level. This I think is something that is
feasible (although a little hard) to achieve -- and to think that the
compiler will even be able to vectorize an inner loop in the
decomposed lambda construction, that detail isn't even necessarily
dealt with by the library.
Indeed, I think something like this is the right approach.  Intel's Thread
Building Blocks is one attempt.  I'm not saying it's the best but it's at
least instructive.
-Dave
_______________________________________________
Unsubscribe & other changes:
http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)

Patrick Mihelich