
On Tuesday 20 January 2009 18:51, David Abrahams wrote:
Why are these SIMD operations different in that respect from, say, large matrix multiplications?
A matrix multiplication is a higher-level construct. Still, most compilers will pattern-match matrix multiplication to an optimal routine.
Not hardly. No compiler is going to introduce register and cache-level blocking.
I'm not sure what you mean here. Compilers do blocking all the time. Typically, the compiler will match a matrix multiply (and a number of other patterns) to library code that has been pre-tuned. Typically the library code has a number of possible paths based on the size of the matrices, etc. Some of those paths may be blocked at several different levels. Or not, if that gives better performance. There's a rich amount of research going on about how to auto-tune library code for just such purposes.
SIMD code generation is extremely low-level. Programmers want to think in a higher level.
Naturally. But are the algorithms implemented by SIMD instructions lower-level than std::for_each or std::accumulate? If not, maybe they deserve to be in a library.
A library of fast routines for doing various things is quite different from creating a whole DSEL to do SIMD code generation. A library of fast matrix mutliply, etc. would indeed be useful. How much does Boost want to concern itself with providing libraries tuned with asm routines for various architectures? It strikes me that writing these routines using gcc intrinsics wouldn't result in optimal code on all architectures. Similarly, it seems that a DSEL to do the same would have similar deficiencies. When you're talking "optimal," you're setting a pretty dang high bar. -Dave