
I'm not sure how or why this turned into a discussion of the general concurrency problem in C++. This is interesting, certainly, but should probably be considered a separate topic from SIMD and auto-vectorization. It doesn't seem very fair to me to criticize a SIMD library for fighting one battle instead of winning the whole war; you could make similar criticisms of Boost.Thread or Boost.MPI, yet these are useful libraries. The sense I'm getting from this discussion is that SIMD code generation is uninteresting, and that we should stick our heads in the sand and wait for the Sufficiently Smart Compilers to come along. OK, I sympathize with this. Writing functions using SIMD intrinsics is a bit of a distraction from the computer vision tasks I actually care about, but I have time budgets to meet and usually some work of this type has to be done. IMO, waiting for compiler technology is neither pragmatic in the short-term nor (as I argued in the other thread) conceptually correct. If you look at expression-template based linear algebra libraries like uBlas and Eigen2, these are basically code generation libraries (if compilers were capable of complicated loop fusion optimizations, we might not need such libraries at all). Given an expression involving vectors, it's fairly mechanical to transform it directly into optimal assembly. Whereas at the level of optimizing IR code, reasoning about pointers and loops is rather complicated. Are the pointers 16-byte aligned? Does it make sense to partially unroll the loop to exploit data parallelism? What transformations can (and should) we make when traversing a matrix in a double loop? There are all sorts of obstacles the compiler must overcome to vectorize code. Why not handle these issues in the stage of compilation with the best information and clearest idea of what the final assembly code should look like - in this case at the library level with meta-programming? The fact of the matter is that compilers do not generate optimal SIMD-accelerated code except in the simplest of cases, and so we end up using SIMD intrinsics by hand. Frankly I don't expect this to change dramatically anytime soon; I'm no compiler expert, but my impression is that some complicated algebraic optimizations (for which C++ is not very suited) are necessary. Using SIMD intrinsics by hand is nasty in assorted ways. The syntax is not standard across compilers. The instructions are not generic; if I change datatypes from double to float, or int to short, I have to completely rewrite the function. Maybe someone wants to run my SSE-enabled code on an Altivec processor, what then? There may be counterparts to all the instructions, but I don't know Altivec. Even different versions of the same instruction set are a problem; do I really want to think about the benefits of SSSE3 vs. just SSE2 and write different versions of the same function for the various instruction sets? Just having a nice, generic wrapper around SIMD ops would be a big step to ease the writing of SIMD-enabled code and make it accessible to a wider community. I think the Proto-based approach holds promise for taking advantage of various instruction sets without the user having to think much about it. -Patrick On Mon, Jan 19, 2009 at 11:14 AM, David A. Greene <greened@obbligato.org>wrote:
On Monday 19 January 2009 00:50, Dean Michael Berris wrote:
I agree, but if you're going to tackle the concurrency problem through a DSEL, I'd think a DSEL at a higher level than SIMD extensions would be more fruitful. For example, I'd think something like:
vector<huge_numbers> numbers; // populate numbers async_result_stream results = apply(numbers, [... insert funky parallelisable lambda construction ...]) while (results) { huge_number a; results >> a; cout << a << endl; }
Would be able to spawn thread pools, launch tasks, and provide an interface to getting the results using futures underneath. The domain experts who already know C++ will be able to express their funky parallelisable lambda construction and just know that when they use the facility it will do the necessary decomposition and parallelism as much as it can at the library level. This I think is something that is feasible (although a little hard) to achieve -- and to think that the compiler will even be able to vectorize an inner loop in the decomposed lambda construction, that detail isn't even necessarily dealt with by the library.
Indeed, I think something like this is the right approach. Intel's Thread Building Blocks is one attempt. I'm not saying it's the best but it's at least instructive.
-Dave _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost