
Hi Patrick, On Tue, Jan 20, 2009 at 9:01 AM, Patrick Mihelich <patrick.mihelich@gmail.com> wrote:
I'm not sure how or why this turned into a discussion of the general concurrency problem in C++. This is interesting, certainly, but should probably be considered a separate topic from SIMD and auto-vectorization. It doesn't seem very fair to me to criticize a SIMD library for fighting one battle instead of winning the whole war; you could make similar criticisms of Boost.Thread or Boost.MPI, yet these are useful libraries.
I'm not about being fair: I am about tackling the issue from a larger perspective. I am not criticizing a SIMD library in the sense that I don't think other people will want to use it -- I personally think that being wrapped in a DSEL makes it clever, but nonetheless the scope of the problem is too narrowly defined. That means, *I* don't think putting these groups of operations together needs to be that complicated.
The sense I'm getting from this discussion is that SIMD code generation is uninteresting, and that we should stick our heads in the sand and wait for the Sufficiently Smart Compilers to come along. OK, I sympathize with this. Writing functions using SIMD intrinsics is a bit of a distraction from the computer vision tasks I actually care about, but I have time budgets to meet and usually some work of this type has to be done.
Actually, I think you're missing the point (at least from what I'm saying). I'm saying SIMD code generation ought to be the job of the compiler(s) for the platforms where they make sense. Now *if* you wanted to be able to specifically make it work, you can do something that others have already been doing: adding a layer of indirection. Now this layer of indirection can be as clever as a DSEL (which I don't think it needs to be) or as simple as a function that switches implementations at compile time using preprocessor macros or some other facility. Now if you needed to optimize a set of operations that are specific to your field (like for example, applying a blur on a set of pixels represented by a set of floats) then I wouldn't find it hard to imagine having that specific part hand-optimized for your need. Does this need another library? I wager to say it doesn't -- it's like saying you're implementing a DSEL in C++ to do simple mathematics. Now if you want to write your own DSEL for image manipulation and perform the transformation in the background to use the SIMD instructions for a specific platform then fine that would be great -- and the details of the implementation would be just that, details, that I don't see a need for a special library just for SIMD instructions *especially* since the compilers will be able to automatically vectorize the parts that can easily be vectorized. (I use the term "easily" here very loosely because that depends on the compiler you're using).
IMO, waiting for compiler technology is neither pragmatic in the short-term nor (as I argued in the other thread) conceptually correct. If you look at expression-template based linear algebra libraries like uBlas and Eigen2, these are basically code generation libraries (if compilers were capable of complicated loop fusion optimizations, we might not need such libraries at all). Given an expression involving vectors, it's fairly mechanical to transform it directly into optimal assembly. Whereas at the level of optimizing IR code, reasoning about pointers and loops is rather complicated. Are the pointers 16-byte aligned? Does it make sense to partially unroll the loop to exploit data parallelism? What transformations can (and should) we make when traversing a matrix in a double loop? There are all sorts of obstacles the compiler must overcome to vectorize code. Why not handle these issues in the stage of compilation with the best information and clearest idea of what the final assembly code should look like - in this case at the library level with meta-programming?
I am not against it -- now if you're talking about fixing uBlas to make it aware of the capabilities of a platform and perform the transformations necessary to be able to leverage vendor-specific libraries, then I'm all for it. Do I think it needs a special DSEL/library for doing so? *That* is what I'm questioning. The reason I like the thought of letting the compiler do the auto-vectorization for me is that the compiler already knows about my code and the transformations it's going to do to make it work -- there's no reason for a compiler not to be able to know these details you talk about. It's not even absurd for a compiler to turn certain code patterns to use OpenMP to parallelize parts of the solution and then at even lower levels even create the SIMD code to leverage the SIMD extensions of the compiler it's going to use.
The fact of the matter is that compilers do not generate optimal SIMD-accelerated code except in the simplest of cases, and so we end up using SIMD intrinsics by hand. Frankly I don't expect this to change dramatically anytime soon; I'm no compiler expert, but my impression is that some complicated algebraic optimizations (for which C++ is not very suited) are necessary.
And I don't question that fact that compilers do not yet generate optimal SIMD-accelerated code -- especially if you're talking about GCC. I might be surprised to hear the same about Intel's compiler but I know that it does perform quite advanced optimizations on the code to leverage SSE on the platforms it supports. If it's a matter of arranging your C++ so that it can be automatically vectorized by a less sophisticated yet auto-vectorizing compiler, then I would think that would be a more achievable goal (and easier?) to accomplish than releasing/maintaining a SIMD-only DSEL/library.
Using SIMD intrinsics by hand is nasty in assorted ways.
I know.
The syntax is not standard across compilers. The instructions are not generic; if I change datatypes from double to float, or int to short, I have to completely rewrite the function.
What stops you from adding a function that specializes on the types on top of these SIMD-specific functions/vectors?
Maybe someone wants to run my SSE-enabled code on an Altivec processor, what then? There may be counterparts to all the instructions, but I don't know Altivec. Even different versions of the same instruction set are a problem; do I really want to think about the benefits of SSSE3 vs. just SSE2 and write different versions of the same function for the various instruction sets?
Which is why I'd rather rely on the compiler to do it for me -- if the compiler for the Altivec processor doesn't know how to auto-vectorize my code, then tough luck I'm going to use SIMD-specific functions/vectors anyway that's specific for that platform. If I already had that SIMD-enabled min() function that broke up a large vector into smaller SIMD vectors that did a faster min() than a non-SIMD version of min(), then I can specialize on the Altivec platform. Heck, if I was writing a image manipulation library for different platforms, I might chunk up the operations even at a higher level and leverage the SIMD-izable parts at that level specific to the image manipulation library. But again, I don't see a need for that library that is specific to SIMD-enabling operations to be done on values. But then again, that's just me.
Just having a nice, generic wrapper around SIMD ops would be a big step to ease the writing of SIMD-enabled code and make it accessible to a wider community. I think the Proto-based approach holds promise for taking advantage of various instruction sets without the user having to think much about it.
Here lies the problem: SIMD operations are specific yet generic enough to be used on their own already. What you want to be able to deal with is the vendor-provided interface to the SIMD vectors and SIMD operations they already support. More specifically GCC's support for SSE/SSE2/SSE3/MMX/... registers/vectors and operations [0] and if you're lucky enough to get your hand on Intel compilers, you can also read about how to layout your code for the compiler to be able to automatically vectorize it for you [1]. References: [0] - http://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Vector-Extensions.html#Vector-Ex... [1] - http://www.aartbik.com/SSE/index.html -- Dean Michael C. Berris Software Engineer, Friendster, Inc.