[OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)

On Sat, Jan 17, 2009 at 5:33 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
David A. Greene a écrit :
Ahem: http://www.cray.com
Two points : 1/ Not everyone has access to a cray-like machine. Parallelization tools for CotS machines is not to be neglected and, on this front, lots of thing need to be done
What do you mean by parallelization for CotS (Commodity off-the Shelf) machines? I personally have dealt with two types of parallelization: parallelization at a high level (dealing with High Performance Computing using something like MPI for distributed message-passing parallel computation across machines) and parallelization at a low level (talking about SSE and auto-vectorization). If you mean dealing with high level parallelism across machines, you already have Boost.MPI and I suspect you don't intend to replace that or tackle that kind of parallelism at the large scale. If you mean dealing with parallelism within a program, then you already have Boost.Thread and Boost.Asio (with the very nice io_service object to be able to use as a job-pool of sorts run on many threads). At a lower level, you then have the parallel algorithm extensions to the STL already being shipped by compiler vendors (free and commercial/prioprietary) that have parallel versions of the STL algorithms (see GNU libstdc++ that comes with GCC 4.3.x and Microsoft's PPL).
2/ vector supercomputer != SIMD-enabled processor even if the former may include the later.
Actually, a vector supercomputer is basically a SIMD machine at a higher level -- that's if I'm understanding my related literature correctly.
Auto parallelization has been around since at least the '80's in production machines. I'm sure it was around even earlier than that.
What do you call auto-parallelization ?
Are you telling me that, nowaday , I can take *any* source code written in C or C++ or w/e compile it with some compiler specifying --parallel and automagically get a parallel version of the code ? If so, you'll have to send a memo to at least a dozen research team (including mine) all over the world so they can stop trying working on this problem and move on something else. Should I also assume than each time a new architecture comes out, those compilers also know the best way to generate code for them ? I beg to differ, but automatic parallelization is far from "done".
You might be surprised but the GNU Compiler Collection already has -ftree-vectorize compiler flag that will do analysis on the IR (Intermediate Representation) of your code and generate the appropriate SSE/SSE2/SSE3/MMX/... for your architecture. They advise adding -msse -msse2 for x86 platforms to the flags. Although I don't think you should stop with the research for better ways to do automatic vectorization, I do however think that doing so at the IR level at the compiler would be more fruitful especially when combined with something like static analysis and code transformation. You might even be surprised to know that OpenCL (the new framework for dealing with heterogeneous scalable parallelism) even allows for run-time transformation of code that's meant to work with that framework -- and Apple's soon-to-be-released new version of its operating system and compiler/SDK will be already supporting.
Then again, by just looking at the problem of writing SIMD code : explain why we still get better performance for simple code when writing SIMD code by hand than letting gcc auto-vectorize it ?
Because GCC needs help with auto-vectorization, and that GCC is a best-effort project (much like how Boost is). If you really want to see great auto-vectorization numbers, maybe you can try looking at the Intel compilers? Although I haven't personally gotten (published and peer-reviewed) empirical studies to support my claim, just adding auto-vectorization in the compilation of the source code of the project I'm dealing with (pure C++, no hand-written SIMD stuff necessary) I already get a significant improvement in performance *and* scalability (vertically of course). I think a "better" alternative would be to help the GCC folks do a better (?) job at writing more efficient tree-vectorization implementations and transformations that produce great SIMD-aware code built into the compiler. If somehow you can help with the pattern recognition of auto-vectorizable loops/algorithms from a higher level than the IR level and do source-level transformation of C++ (which I think would be way cool BTW, much like how the Lisp compilers are able to do so) to be able to produce the right (?) IR for the compiler to better auto-vectorize, then *that* would be something else. Maybe you'd also like to look at GCC-ICI [0] to see how you can play around with extending GCC then it comes to implementing these (admittedly if I may say so) cool optimizations of the algorithms to become automatically vectorized.
Perhaps your SIMD library could invent convenient ways to express those idioms in a machine-independent way.
Well, considering the question was first about how to structure the group of library i'm proposing,
I apologize to not having taken the time to express all the features of those libraries. Moreover, even with a simple example, the fact that the library hides the differences between SSE2,SSSE3,SSE3,SSE4,Altivec,SPU-VMX and the forecoming AVX is a feature on its own. Oh, and as specified in the former mail, the DSL take care of optimizing fused operation so thing like FMA are detected and replaced by the proper intrinsic when possible. Same with reduction like min/max, operations like b*c-a or SAD on SSEx.
I think I understand what you're trying to achieve by adding a layer of indirection at the library/code level -- however I personally think (and have read through numerous papers already, while doing my research about parallel computing earlier in my student days) that these optimizations/transformations are best served by the tools that create the machine code (i.e. compilers) rather then dealt with at the source code level. What I mean by this is that even though it's technically possible to implement a "parallel C++ DSEL" (which can be feasibly achieved now with Boost.Proto and taking some lessons with attribute grammars [1] and how Spirit 2(x?) with a combination of Boost.Phoenix does it) it would be reaching a bigger audience and serving a larger community if the compilers became smarter at doing the transformations themselves, rather than getting C++ developers to learn yet another library.
Anyway, we'll be able to discuss the library in itself and its features when a proper thread for it will start.
Maybe we can discuss it in this one? :D HTH References: [0] http://gcc-ici.sourceforge.net/ [1] http://en.wikipedia.org/wiki/Attribute_grammar -- Dean Michael C. Berris Software Engineer, Friendster, Inc.

What do you mean by parallelization for CotS (Commodity off-the Shelf) machines? I took the link provided to cray.com as an answer to my parallelization question. What I mean is that not everyone have to deal with large
Dean Michael Berris a écrit : mainframe but more so beowulf like machine or even simple multi-core machines on which running vendor-specific runtime middleware may or may not be sensible. Moreover, not every people that need parallelism want to do HPC. Computer Vision and multimedia applciations are also highly demanding and sometimes even more as they are also bound to some real-time or interactive-time constraints in which the GFLOPS is not what you seek. Anyway, consider my answer on this subject as miscommunication.
If you mean dealing with high level parallelism across machines, you already have Boost.MPI and I suspect you don't intend to replace that or tackle that kind of parallelism at the large scale.
MPI is by no way what I would call high-level. It's only a message-passing assembly language with a few interesting abstraction. BSP based tools or algorithmic skeletons based tools are real high-level tools for inter-machine parallelization. And well, I already have replaced MPI by something with an higher abstraction level for at least three architecture style (cluster, mutli-cores, cell) and published about this (see [1] for reference) and plan to do so at boost'con this year to show exactly how boost meta-programming tools made those kind of tools possible and usable.
If you mean dealing with parallelism within a program, then you already have Boost.Thread and Boost.Asio (with the very nice io_service object to be able to use as a job-pool of sorts run on many threads).
Then again : low level. I've dealt with a large variety of people ranging from physicist to computer vision experts. They are all embarassed when they have to take their Matlab or C or FORTRAN legacy code to C++/thread. Some of them doesn't even KNOW their machine has those kind of features or don't even know about simple thing like scaling, gustafson-barsis or amdhal laws and think that as they have 3 machines with 4 cores, their code will magically go 12 times faster no matter what. I've been a parallel software designer in such a laboratory and I can tell you that most expert in field X have no idea on how to tackle MPI , ASIO or even simple thread. Worse, sometimes they don't want to because they find this uninteresting. Ok, you might say that they can just send their code to someone to parallelize. Alas, most of the time they don't want to as they' won't be able to be sure you didn't butchered their initial code (following the good ol' Not Made Here principles). At this point, you *have* to give them tools they understand/want to use/are acustomed with so they can do the porting themselves. And this requires providing either really high-level or domain-specific interface to those parallelism level which include, among other, to use things like thread or asio by hiding them very very deeply. Then again,that's my own personnal experiment. If you have some secret management tricks to have parallelism-unaware people to use low-level tools, then I'm all open to hear them :) cause I have to do this on a daily basis.
At a lower level, you then have the parallel algorithm extensions to the STL already being shipped by compiler vendors (free and commercial/prioprietary) that have parallel versions of the STL algorithms (see GNU libstdc++ that comes with GCC 4.3.x and Microsoft's PPL).
Same remark than before. There are people out there that don't even know STL (and sometimes proper C++) exists or even worse don't want to use it cause, you never know, it can be bugged. And sadly, these are'nt jokes :/
actually, a vector supercomputer is basically a SIMD machine at a higher level -- that's if I'm understanding my related literature correctly. What I meant to say is that they have a fundamentally different low-level API and such auto-vectorization at machine level may be different than intra-processor SIMD code generation. Then again, this remarks are maybe due to me badly reading previous post.
You might be surprised but the GNU Compiler Collection already has -ftree-vectorize compiler flag that will do analysis on the IR (Intermediate Representation) of your code and generate the appropriate SSE/SSE2/SSE3/MMX/... for your architecture. They advise adding -msse -msse2 for x86 platforms to the flags. Yes I am aware and all my experimentation show it produces correct code for trivial software but fail to vectorize larger code piece and die in shame as soon as you need shuffle, typecasting and other similar features. Although I don't think you should stop with the research for better ways to do automatic vectorization, I do however think that doing so at the IR level at the compiler would be more fruitful especially when combined with something like static analysis and code transformation.
Except this means a new compiler version, which means that end-user has either to wait for those algorithm to be in the mainstream gcc or w/e compiler distribution OR that they'll have to use some experimental version. In the former case, they don't want to wait. In the later case, they don't want to have to install such thingy. The library approach is a way to get thing that works out fast and only based on existing compilers. Nothing prevent this library to evolve as compiler do. On code transformation, that's exactly what a DSEL do: Building new 'source' from a high-level specification and writing "compiler extensions" from the library side of the world. I've been doing this since POOMA and the avent of Blitz++ and always thought they were spot-on for things like parallelism. When proto was first announced and became usable, it was a real advance as building DSL started to look like building a new language in the good ol' ways.
You might even be surprised to know that OpenCL (the new framework for dealing with heterogeneous scalable parallelism) even allows for run-time transformation of code that's meant to work with that framework -- and Apple's soon-to-be-released new version of its operating system and compiler/SDK will be already supporting.
Oh, I await openCL fondly, so I'll have a new low-level tools to generate high-level libraries ;)
Because GCC needs help with auto-vectorization, and that GCC is a best-effort project (much like how Boost is). If you really want to see great auto-vectorization numbers, maybe you can try looking at the Intel compilers? Although I haven't personally gotten (published and peer-reviewed) empirical studies to support my claim, just adding auto-vectorization in the compilation of the source code of the project I'm dealing with (pure C++, no hand-written SIMD stuff necessary) I already get a significant improvement in performance *and* scalability (vertically of course). I'm aware of this. Gcc auto-vectorize is for me still in its infancy. Take a simple dot product. The auto-vectorized version shows a speed-up of 2.5 for floating point value. The handmade version goes up to 3.8. Moreover, it only covers statically analyzable loop nest to be vectorized (i speak of what can be done nowadays with 4.3 and 4.4, not what's announced in various paper). With icc, the scope of vectorizable code is larger and contain sensible parts. The code quality is also superior. Except not all platform are intel platform.
I think a "better" alternative would be to help the GCC folks do a better (?) job at writing more efficient tree-vectorization implementations and transformations that produce great SIMD-aware code built into the compiler. If somehow you can help with the pattern recognition of auto-vectorizable loops/algorithms from a higher level than the IR level and do source-level transformation of C++ (which I think would be way cool BTW, much like how the Lisp compilers are able to do so) to be able to produce the right (?) IR for the compiler to better auto-vectorize, then *that* would be something else.
Fact is, well, I'm more acustomed to do software engineering than compiler writing. I wish I could lend a hand but that's far out of skills scope. But, I'm open to learn new thing. If you have entry in the gcc community, I'm all for it.
Maybe you'd also like to look at GCC-ICI [0] to see how you can play around with extending GCC then it comes to implementing these (admittedly if I may say so) cool optimizations of the algorithms to become automatically vectorized.
Reference noted. :)
I think I understand what you're trying to achieve by adding a layer of indirection at the library/code level -- however I personally think (and have read through numerous papers already, while doing my research about parallel computing earlier in my student days) that these optimizations/transformations are best served by the tools that create the machine code (i.e. compilers) rather then dealt with at the source code level. What I mean by this is that even though it's technically possible to implement a "parallel C++ DSEL" (which can be feasibly achieved now with Boost.Proto and taking some lessons with attribute grammars [1] and how Spirit 2(x?) with a combination of Boost.Phoenix does it) it would be reaching a bigger audience and serving a larger community if the compilers became smarter at doing the transformations themselves, rather than getting C++ developers to learn yet another library.
Well except if this simd library was not aimed at users. The vec library I proposed is here for library builder as a way to quickly express SIMD code fragment across SIMD platform. It is NOT meant to be yet another array class with SIMD capability because this models is too restrictive for library developpers that may need SIMD access for doing something else with a differen tmodel. And well, it doesn't sound worse than learning a new boost.asio or boost.thread library. I already have something far more high-level from which this code is extracted from. NT2 is a matlab-like scientific computing library[2] that just reinject Matlab syntax into C++ via a coupel of DSEL and take care of vectorization and thread creation on multi-core machine. This was found to be more appealing as user can just get their Matlab code, copy+paste them into a cpp file, search/replace a few syntax quirks and compile with NT2 to get instant performance increase. I can tell you that this appeal more to HPC users than any low-level API, even with the sexier encapsulation you can have. As for the source level code. They are plenty and successful. Now, are you able to name *one* that is actually used outside academic research ? I'm still convinced that high-level, domain-oriented tools are the way to go, not just adding new layer of things like MPI, OpenCL or what not. However, those layers, which have the tendency to be multiplied as architecture flavor changes, need to abstracted. And that's only such an abstraction that I was proposing for SIMD intrinsics : a simple, POD like entity that maps onto those and take care of generating the most efficient code for a given platform. I don't why it could be different in scope than boost.thread that do the same with threading API. On a larger scale, my stance on the problem is the following : Architectures get more and more complex and, following this trend, low-level tools starts to loose their interest but are mostly the only thing available. What is needed is a high-level abstraction layer on those. But no abstraction can encompass *all* need of parallel software developers so there is a need for various models and abstraction (ranging from arrays to agents to god know what) and all of them need an interoperable interface. This is what DSL construction tools allow us to do : quickly and properly specify those precise scoped tools in form of library. When proper compiler support and/or tools will become available, we'll just shift the library implementation, much like Boost already support 0x constructs inside its implementation. Hoping that we don't hijack the list too much. But I'm curious to know the position of other boost members concerning how parallelism problem should be solved within our bounds. References [1] Quaff: -> principles : http://www.lri.fr/~falcou/pub/falcou-PARCO-2007.pdf -> the Cell version (french paper) : http://www.lri.fr/~falcou/pub/falcou-SYMPA-2008.pdf [2] NT2 : http://www.springerlink.com/content/l4r4462r25740127/ -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Sun, Jan 18, 2009 at 4:16 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
Dean Michael Berris a écrit :
What do you mean by parallelization for CotS (Commodity off-the Shelf) machines?
I took the link provided to cray.com as an answer to my parallelization question. What I mean is that not everyone have to deal with large mainframe but more so beowulf like machine or even simple multi-core machines on which running vendor-specific runtime middleware may or may not be sensible. Moreover, not every people that need parallelism want to do HPC. Computer Vision and multimedia applciations are also highly demanding and sometimes even more as they are also bound to some real-time or interactive-time constraints in which the GFLOPS is not what you seek.
Anyway, consider my answer on this subject as miscommunication.
Ok, I understand.
If you mean dealing with high level parallelism across machines, you already have Boost.MPI and I suspect you don't intend to replace that or tackle that kind of parallelism at the large scale.
MPI is by no way what I would call high-level. It's only a message-passing assembly language with a few interesting abstraction. BSP based tools or algorithmic skeletons based tools are real high-level tools for inter-machine parallelization. And well, I already have replaced MPI by something with an higher abstraction level for at least three architecture style (cluster, mutli-cores, cell) and published about this (see [1] for reference) and plan to do so at boost'con this year to show exactly how boost meta-programming tools made those kind of tools possible and usable.
Thanks for the reference, I'll try reading through it when I get the time. :) At any rate, what I meant when I said MPI was a high-level abstraction is that it forces you to think at a significantly higher level of abstraction than parallelism constructs that are simply for SIMD computing models, MIMD computing models, Hypercube, Meshes, etc. and express the communication between processing components of the solution. This means you are forced to think about how the processing components communicate among each other, instead of thinking about whether the architecture you're running on top of is SIMD-aware or MIMD-aware, is a Torus or non-Torus Mesh, a Hypercube, a Connection Machine, etc. -- although you can place infinitely many layers of abstractions on top of this (say, expose an MPI-aware HTTP server and have worker nodes do the "heavy processing" in a Beowulf cluster) it doesn't change that MPI (the spec and the implementations) would be higher level than say assembly instructions for a SIMD-processor. Although I understand that MPI isn't the highest level, it would be a "parallel programmer's" dream come true to be able to write a program and express parallelism without having to worry about whether the logical/actual concurrency happens on a single machine or a cluster of machines. Erlang tried to do something similar by making process ID's reflect whether they're local or remote processes but you still had to be mindful of distinctions like that; and I think Haskell is doing something similar with Data Parallel Haskell, but I think that doesn't cover distributed (i.e. many-machine) parallelism. Please don't misunderstand that I'm disagreeing with you here because I do agree that there is a need to address the parallelism problem when implementing considerably demanding solutions at a level where you don't really have to worry about the architecture below on which the code you're running on is like. However given the reality of the situation with a myriad of available platforms on which to compile/run C++, the pressure from both sides of the equation (library writers and tool developers on one side, hardware vendors on the other) to come up with a solution is immense -- especially since the industry has got to adapt to it sooner than later. ;-)
If you mean dealing with parallelism within a program, then you already have Boost.Thread and Boost.Asio (with the very nice io_service object to be able to use as a job-pool of sorts run on many threads).
Then again : low level. I've dealt with a large variety of people ranging from physicist to computer vision experts. They are all embarassed when they have to take their Matlab or C or FORTRAN legacy code to C++/thread.
I understand but...
Some of them doesn't even KNOW their machine has those kind of features or don't even know about simple thing like scaling, gustafson-barsis or amdhal laws and think that as they have 3 machines with 4 cores, their code will magically go 12 times faster no matter what. I've been a parallel software designer in such a laboratory and I can tell you that most expert in field X have no idea on how to tackle MPI , ASIO or even simple thread. Worse, sometimes they don't want to because they find this uninteresting.
I agree, but...
Ok, you might say that they can just send their code to someone to parallelize. Alas, most of the time they don't want to as they' won't be able to be sure you didn't butchered their initial code (following the good ol' Not Made Here principles). At this point, you *have* to give them tools they understand/want to use/are acustomed with so they can do the porting themselves. And this requires providing either really high-level or domain-specific interface to those parallelism level which include, among other, to use things like thread or asio by hiding them very very deeply. Then again,that's my own personnal experiment. If you have some secret management tricks to have parallelism-unaware people to use low-level tools, then I'm all open to hear them :) cause I have to do this on a daily basis.
... I think there is a market for precisely this kind of thing/work now -- helping domain experts be able to recognize and utilize the inherent parallelism in their solutions and the tools they are using. :-) Now as far as management tricks, I don't have any -- I struggle with this a lot too, believe me -- but the idea of tools can vary from context to context (and person to person). I personally think about tools to be something concrete that does a specific task -- in this case I think "tools" mean compilers (GCC, ICC, MSVC, etc.), linkers (GNU LD, etc.), and program runtimes (OpenCL, CRT, LLVM). I'd argue Yacc and Lex are tools that produce parsers and lexers -- maybe you have something that takes a domain-specific language and creates C++ implementations out of the DSL, then in the C++ you create it wouldn't be impossible to generate code that uses Boost.Thread, Boost.Asio, Boost.*. :-D Libraries OTOH are more components to me rather than tools. Maybe I'm being too picky with terms, but if you meant libraries to be tools, then I feel that simply doesn't work with how my brain is wired. It doesn't make you wrong, but it just doesn't feel right to me. ;-)
At a lower level, you then have the parallel algorithm extensions to the STL already being shipped by compiler vendors (free and commercial/prioprietary) that have parallel versions of the STL algorithms (see GNU libstdc++ that comes with GCC 4.3.x and Microsoft's PPL).
Same remark than before. There are people out there that don't even know STL (and sometimes proper C++) exists or even worse don't want to use it cause, you never know, it can be bugged. And sadly, these are'nt jokes :/
I think I understand what you mean, but I don't think it's a failure of the libraries that they're not known/used by the people doing the programming. Much like how you can't blame the nailgun if a carpenter didn't know about it that's why he's still using traditional hammers and nails.
actually, a vector supercomputer is basically a SIMD machine at a higher level -- that's if I'm understanding my related literature correctly.
What I meant to say is that they have a fundamentally different low-level API and such auto-vectorization at machine level may be different than intra-processor SIMD code generation. Then again, this remarks are maybe due to me badly reading previous post.
Ok.
You might be surprised but the GNU Compiler Collection already has -ftree-vectorize compiler flag that will do analysis on the IR (Intermediate Representation) of your code and generate the appropriate SSE/SSE2/SSE3/MMX/... for your architecture. They advise adding -msse -msse2 for x86 platforms to the flags.
Yes I am aware and all my experimentation show it produces correct code for trivial software but fail to vectorize larger code piece and die in shame as soon as you need shuffle, typecasting and other similar features.
Indeed.
Although I don't think you should stop with the research for better ways to do automatic vectorization, I do however think that doing so at the IR level at the compiler would be more fruitful especially when combined with something like static analysis and code transformation.
Except this means a new compiler version, which means that end-user has either to wait for those algorithm to be in the mainstream gcc or w/e compiler distribution OR that they'll have to use some experimental version. In the former case, they don't want to wait. In the later case, they don't want to have to install such thingy. The library approach is a way to get thing that works out fast and only based on existing compilers. Nothing prevent this library to evolve as compiler do.
True, but libraries also require that users write code that actually use the library. If the users already had code that didn't use your library, how is it an advantage if they can get the auto-vectorization from a future version of a compiler anyway without having to butcher their code to use your library? And what if they find a bug in the code using the library or (god forbid) find a bug in the library?
On code transformation, that's exactly what a DSEL do: Building new 'source' from a high-level specification and writing "compiler extensions" from the library side of the world. I've been doing this since POOMA and the avent of Blitz++ and always thought they were spot-on for things like parallelism. When proto was first announced and became usable, it was a real advance as building DSL started to look like building a new language in the good ol' ways.
Actually, DSELs require that you write code in the domain language -- and here is where the problem lies. For instance, if I meant to write a loop that executed exactly N times everytime, then I should be able to rely on my compiler to be smart about it and figure out whether the target I'm building for would be able to handle SIMD instructions and generate native code for that. It doesn't sacrifice the readability of my code because I can write a normal C++ loop and have the compiler do the transformation of the code from C++ to native code without very little intervention on my part as a programmer. If you're talking about a DSEL at a higher level and want to be able to extract parallelism at the domain level (say for example doing parallel (or tree-wise parallel) computation of complex mathematical equations) then I would agree that DSLs or even DSELs are great for the job. However, that would also mean you're going to have to write/create that DSL/DSEL for that specific domain, instead of a being a DSL/DSEL mainly for parallelism. Although I can imagine as a C++ developer writing something like: eval( ( task_group += task1 | task2 | task3 ) , tie(task1_future, task2_future, task3_future) ); That it's a feasible DSEL for executing a task-group and tying their futures to the result -- but that's hardly a DSL/DSEL for something like say particle physics or linear algebra. If this were the case then maybe just having this DSEL may be good to give to parallelism-savvy C++ programmers, but not necessarily still the domain experts who will be doing the writing of the domain-specific logic. Although you can argue that parallel programming is a domain in itself, in which case you're still not bridging the gap between those that know about parallel programming and the other domain experts.
You might even be surprised to know that OpenCL (the new framework for dealing with heterogeneous scalable parallelism) even allows for run-time transformation of code that's meant to work with that framework -- and Apple's soon-to-be-released new version of its operating system and compiler/SDK will be already supporting.
Oh, I await openCL fondly, so I'll have a new low-level tools to generate high-level libraries ;)
So do I. :-)
Because GCC needs help with auto-vectorization, and that GCC is a best-effort project (much like how Boost is). If you really want to see great auto-vectorization numbers, maybe you can try looking at the Intel compilers? Although I haven't personally gotten (published and peer-reviewed) empirical studies to support my claim, just adding auto-vectorization in the compilation of the source code of the project I'm dealing with (pure C++, no hand-written SIMD stuff necessary) I already get a significant improvement in performance *and* scalability (vertically of course).
I'm aware of this. Gcc auto-vectorize is for me still in its infancy. Take a simple dot product. The auto-vectorized version shows a speed-up of 2.5 for floating point value. The handmade version goes up to 3.8. Moreover, it only covers statically analyzable loop nest to be vectorized (i speak of what can be done nowadays with 4.3 and 4.4, not what's announced in various paper). With icc, the scope of vectorizable code is larger and contain sensible parts. The code quality is also superior. Except not all platform are intel platform.
Yes, not all platforms are Intel platforms, but I don't know if you've noticed yet that Intel compilers even create code that will run on AMD processors -- yes, even SSE[0..3] -- as per their product documentation. If your target is CotS machines, I think Intel/GCC is your best bet (at least in x84_64). I haven't dealt with other platforms though aside from Intel/AMD, but it's not unreasonable to think that since everybody's moving the direction of leveraging and exploiting parallelism in hardware, that the compiler vendors will have to compete (and eventually get better) in this regard.
I think a "better" alternative would be to help the GCC folks do a better (?) job at writing more efficient tree-vectorization implementations and transformations that produce great SIMD-aware code built into the compiler. If somehow you can help with the pattern recognition of auto-vectorizable loops/algorithms from a higher level than the IR level and do source-level transformation of C++ (which I think would be way cool BTW, much like how the Lisp compilers are able to do so) to be able to produce the right (?) IR for the compiler to better auto-vectorize, then *that* would be something else.
Fact is, well, I'm more acustomed to do software engineering than compiler writing. I wish I could lend a hand but that's far out of skills scope. But, I'm open to learn new thing. If you have entry in the gcc community, I'm all for it.
Why do I get the feeling that you're saying: compiler writing != software engineering ? :-P Anyway, I think if you're looking to contribute to a compiler-building community, GCC may be a little too big (I don't want to use the term advanced, because I haven't bothered looking at the code of the GCC project) but I know Clang over at LLVM are looking for help to finish the C++ implementation of the compiler front-end. From what I'm reading with Clang and LLVM, it should be feasible to write language-agnostic optimization algorithms/implementations just dealing with the LLVM IR. I personally am not attached with the GCC project -- I think they do one heck of a great job -- but something that's detached from it would also be feasible to write (although I don't know how practical it would be). For example, something that takes C++ code and transforms it to more optimized (?) C++ before passing it to the compiler would be interesting to have. This I'm afraid is in the realm of static code analysis and code transformation at a very high level that may or may not be what others programming in C++ would be looking for (that's a guess). I think some people in the list have already contributed to the GCC's C++ compiler project (I remember Doug Gregor did some work (if not all of it?) on ConceptGCC which prototyped C++0x Concepts support in the compiler). Maybe they can help out in that regard.
Maybe you'd also like to look at GCC-ICI [0] to see how you can play around with extending GCC then it comes to implementing these (admittedly if I may say so) cool optimizations of the algorithms to become automatically vectorized.
Reference noted. :)
:-)
I think I understand what you're trying to achieve by adding a layer of indirection at the library/code level -- however I personally think (and have read through numerous papers already, while doing my research about parallel computing earlier in my student days) that these optimizations/transformations are best served by the tools that create the machine code (i.e. compilers) rather then dealt with at the source code level. What I mean by this is that even though it's technically possible to implement a "parallel C++ DSEL" (which can be feasibly achieved now with Boost.Proto and taking some lessons with attribute grammars [1] and how Spirit 2(x?) with a combination of Boost.Phoenix does it) it would be reaching a bigger audience and serving a larger community if the compilers became smarter at doing the transformations themselves, rather than getting C++ developers to learn yet another library.
Well except if this simd library was not aimed at users. The vec library I proposed is here for library builder as a way to quickly express SIMD code fragment across SIMD platform. It is NOT meant to be yet another array class with SIMD capability because this models is too restrictive for library developpers that may need SIMD access for doing something else with a differen tmodel. And well, it doesn't sound worse than learning a new boost.asio or boost.thread library.
In that case, I think that kind of library (DSEL) would be nice to have -- especially to abstract the details of expressing parallelism in general a the source code level.
I already have something far more high-level from which this code is extracted from. NT2 is a matlab-like scientific computing library[2] that just reinject Matlab syntax into C++ via a coupel of DSEL and take care of vectorization and thread creation on multi-core machine. This was found to be more appealing as user can just get their Matlab code, copy+paste them into a cpp file, search/replace a few syntax quirks and compile with NT2 to get instant performance increase. I can tell you that this appeal more to HPC users than any low-level API, even with the sexier encapsulation you can have.
Nice! I would agree that something like *that* is appropriate as a domain-specific language which leverages parallelism in the details of the implementation.
As for the source level code. They are plenty and successful. Now, are you able to name *one* that is actually used outside academic research ? I'm still convinced that high-level, domain-oriented tools are the way to go, not just adding new layer of things like MPI, OpenCL or what not. However, those layers, which have the tendency to be multiplied as architecture flavor changes, need to abstracted. And that's only such an abstraction that I was proposing for SIMD intrinsics : a simple, POD like entity that maps onto those and take care of generating the most efficient code for a given platform. I don't why it could be different in scope than boost.thread that do the same with threading API.
I'm not discouraging you with writing a library that tackles the expression of parallelism at the source code level in a coherent manner -- I am actually encouraging it if you haven't done so yet. I'd love to get my hands on something like that to be able to express the parallelism in the solution without having to worry about exactly how many threads I'll be running to parallelize this part of the code that simply executed a set of tasks, etc. I however think also that there are some details that would be nice to tackle at the appropriate layer -- SIMD code construction is, well, meant to be at the domain of the compiler (as far as SSE or similar things go). OpenCL is meant to be an interface the hardware and software vendors are moving towards supporting for a long time coming (at least what I'm reading from the press releases) so I'm not too worried about the combinatorial explosion of architectures and parallelism runtimes.
On a larger scale, my stance on the problem is the following : Architectures get more and more complex and, following this trend, low-level tools starts to loose their interest but are mostly the only thing available. What is needed is a high-level abstraction layer on those. But no abstraction can encompass *all* need of parallel software developers so there is a need for various models and abstraction (ranging from arrays to agents to god know what) and all of them need an interoperable interface. This is what DSL construction tools allow us to do : quickly and properly specify those precise scoped tools in form of library. When proper compiler support and/or tools will become available, we'll just shift the library implementation, much like Boost already support 0x constructs inside its implementation.
I agree completely, but I'm afraid if the DSEL is for expressing parallelism in C++, the goal of "giving domain experts tools that knew about the parallelism" wouldn't be met readily. For C++ developers that want to leverage parallelism in general sure, but I don't think I'd be particularly compelled to use a SIMD-only DSEL.
Hoping that we don't hijack the list too much. But I'm curious to know the position of other boost members concerning how parallelism problem should be solved within our bounds.
Me too, but unfortunately I don't know a better venue to be able to discuss the development of a DSEL in C++ to tackle the parallelism issue. :-) Boost is about the "smartest" mailing list I know on which issues like these can be discussed in a fashion that is relevant to C++ and technical enough for those who know about the subject. ;-)
References [1] Quaff: -> principles : http://www.lri.fr/~falcou/pub/falcou-PARCO-2007.pdf -> the Cell version (french paper) : http://www.lri.fr/~falcou/pub/falcou-SYMPA-2008.pdf [2] NT2 : http://www.springerlink.com/content/l4r4462r25740127/
Thanks for the links! -- Dean Michael C. Berris Software Engineer, Friendster, Inc.

Dean Michael Berris a écrit :
Please don't misunderstand that I'm disagreeing with you here because I do agree that there is a need to address the parallelism problem when implementing considerably demanding solutions at a level where you don't really have to worry about the architecture below on which the code you're running on is like. However given the reality of the situation with a myriad of available platforms on which to compile/run C++, the pressure from both sides of the equation (library writers and tool developers on one side, hardware vendors on the other) to come up with a solution is immense -- especially since the industry has got to adapt to it sooner than later. ;-)
We agree then.
... I think there is a market for precisely this kind of thing/work now -- helping domain experts be able to recognize and utilize the inherent parallelism in their solutions and the tools they are using. :-)
Best is to have them benefit of parallelism without them knowing about it really.
Libraries OTOH are more components to me rather than tools. Maybe I'm being too picky with terms, but if you meant libraries to be tools, then I feel that simply doesn't work with how my brain is wired. It doesn't make you wrong, but it just doesn't feel right to me. ;-)
Beware that Embedded DSL are nto more than DSL in disguise inside a library, hence the confusion I keep between tools and libraries.
I think I understand what you mean, but I don't think it's a failure of the libraries that they're not known/used by the people doing the programming. Much like how you can't blame the nailgun if a carpenter didn't know about it that's why he's still using traditional hammers and nails.
My point was : it is not that easy to say to people "use X".
True, but libraries also require that users write code that actually use the library. If the users already had code that didn't use your library, how is it an advantage if they can get the auto-vectorization from a future version of a compiler anyway without having to butcher their code to use your library? And what if they find a bug in the code using the library or (god forbid) find a bug in the library?
Same can be said for any library out there. What if tomorrow new C++ compiler will extract code from source and built top-notch threads from it ? Should we prevent people to use Boost.Threads from now ?
Actually, DSELs require that you write code in the domain language -- and here is where the problem lies Well, if parallelism is outsourced behind the scene, it's not a problem. If this were the case then maybe just having this DSEL may be good to give to parallelism-savvy C++ programmers, but not necessarily still the domain experts who will be doing the writing of the domain-specific logic. Although you can argue that parallel programming is a domain in itself, in which case you're still not bridging the gap between those that know about parallel programming and the other domain experts.
Parallel programming is a domain in itself but not a domain for user but for tool writer. A user domain si things like math, finance, physics, anything. We agree
Yes, not all platforms are Intel platforms, but I don't know if you've noticed yet that Intel compilers even create code that will run on AMD processors -- yes, even SSE[0..3] -- as per their product documentation. If your target is CotS machines, I think Intel/GCC is your best bet (at least in x84_64). I haven't dealt with other platforms though aside from Intel/AMD, but it's not unreasonable to think that since everybody's moving the direction of leveraging and exploiting parallelism in hardware, that the compiler vendors will have to compete (and eventually get better) in this regard.
Well, we can't let Altivec and its offspring on the side of the road. Cell processor use it and I considering the Cell as a simili-COtS as a PS3 cost something like only half a kidney. I don't target COTS or not-CotS, my goal is cover the basics and the SIMD absics involves old Motorola enabled PPC and Intel machines. So the strict minimum is Altivec+SSE flavors. I hope that one day (AVX v2), both will converge tough.
Why do I get the feeling that you're saying:
compiler writing != software engineering
? :-P
No I mean that *I* feel more confortabel writing stuff on this side of the compiler than on the other ;)
Anyway, I think if you're looking to contribute to a compiler-building community, GCC may be a little too big (I don't want to use the term advanced, because I haven't bothered looking at the code of the GCC project) but I know Clang over at LLVM are looking for help to finish the C++ implementation of the compiler front-end. From what I'm reading with Clang and LLVM, it should be feasible to write language-agnostic optimization algorithms/implementations just dealing with the LLVM IR.
Well, as I work like half a miel from Albert Cohen office, I'll certainly have a discussion about Clang someday ;) The C++->C++ tools is on my todo task list, but not for now as I think DSEL in C++ still have untapped ressources.
In that case, I think that kind of library (DSEL) would be nice to have -- especially to abstract the details of expressing parallelism in general a the source code level.
Except it is like ... friggin hard ? My stance is to have applciation domain specific library that hide all parallelism tasks by relying on small scale parallel library themselves like Thread or my proposition.
Nice! I would agree that something like *that* is appropriate as a domain-specific language which leverages parallelism in the details of the implementation.
I however think also that there are some details that would be nice to tackle at the appropriate layer -- SIMD code construction is, well, meant to be at the domain of the compiler (as far as SSE or similar things go). OpenCL is meant to be an interface the hardware and software vendors are moving towards supporting for a long time coming (at least what I'm reading from the press releases) so I'm not too worried about the combinatorial explosion of architectures and parallelism runtimes.
Except some people (like one of the poster in the previosu thread) daily deals with code that need this level of abstraction and not more. Hence the rationale behind "Boost.SIMD"
I agree completely, but I'm afraid if the DSEL is for expressing parallelism in C++, the goal of "giving domain experts tools that knew about the parallelism" wouldn't be met readily. For C++ developers that want to leverage parallelism in general sure, but I don't think I'd be particularly compelled to use a SIMD-only DSEL.
I think we can't just wake up and say "ok today I just sole the parallelism problem in C++ using DSEL". I think that, on the contrary, a concrete, reasonable roadmap would be "tiling" the parallel problem world by small scale software solution that can inter-operate and interact freely. Then when the basic blocks of such tools have been done, we can start cementing them into higher one. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

on Sun Jan 18 2009, Joel Falcou <joel.falcou-AT-u-psud.fr> wrote:
Beware that Embedded DSL are nto more than DSL in disguise inside a library,
I'd say it's the opposite: they are libraries disguised as DSLs, the DSL-ness being a feature of how they present themselves to the world. One important thing about embedded DSLs is that they maintain the interoperability properties of libraries, thus it makes good sense to think of them as "components" rather than tools. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

David Abrahams a écrit :
I'd say it's the opposite: they are libraries disguised as DSLs, the DSL-ness being a feature of how they present themselves to the world. One important thing about embedded DSLs is that they maintain the interoperability properties of libraries, thus it makes good sense to think of them as "components" rather than tools.
Said this way, it makes a lot of sense. I'm a bit laxist with using tools and components but the message was here I think :) -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Sun, Jan 18, 2009 at 9:50 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
Dean Michael Berris a écrit :
Please don't misunderstand that I'm disagreeing with you here because I do agree that there is a need to address the parallelism problem when implementing considerably demanding solutions at a level where you don't really have to worry about the architecture below on which the code you're running on is like. However given the reality of the situation with a myriad of available platforms on which to compile/run C++, the pressure from both sides of the equation (library writers and tool developers on one side, hardware vendors on the other) to come up with a solution is immense -- especially since the industry has got to adapt to it sooner than later. ;-)
We agree then.
... I think there is a market for precisely this kind of thing/work now -- helping domain experts be able to recognize and utilize the inherent parallelism in their solutions and the tools they are using. :-)
Best is to have them benefit of parallelism without them knowing about it really.
I'm a little weary about hiding these important issues to the people who understand the higher scheme of things. Which is why I personally don't think leaving the (non-C++ programming) domain experts in the dark about the inherent parallelism in their solution is a good idea. The reason why I think this is because the people who are going to be solving the real world problem should be aware that the computing facilities they have are actually capable of parallel computing, and that the way they write their solutions (be it in any programming language) will have a direct impact in the performance and scalability of their solution. It really doesn't matter if they're writing something that should work on a GPU/CPU+SIMD but that the way they write their solution should work in parallel. Once they are aware of the available parallelism, they should be able to adapt the way they think, and the way they come up with the solutions. As far as hiding the parallelism from them, the compiler is the perfect place to do that especially if your aim is to just leverage platform specific parallelism features of the machine. Even these domain experts once they know about the compiler capabilities may be able to write their code in such a way that the compiler will be happy to auto-vectorize -- and that's I think where it counts most.
Libraries OTOH are more components to me rather than tools. Maybe I'm being too picky with terms, but if you meant libraries to be tools, then I feel that simply doesn't work with how my brain is wired. It doesn't make you wrong, but it just doesn't feel right to me. ;-)
Beware that Embedded DSL are nto more than DSL in disguise inside a library, hence the confusion I keep between tools and libraries.
I would tend to agree with Dave that libraries tend to be disguised as DSEL's (think Spirit) that perform a certain function -- thus the way I think about them as components that work as part of a bigger whole.
I think I understand what you mean, but I don't think it's a failure of the libraries that they're not known/used by the people doing the programming. Much like how you can't blame the nailgun if a carpenter didn't know about it that's why he's still using traditional hammers and nails.
My point was : it is not that easy to say to people "use X".
Actually, it's easy to say it -- it's a matter of acceptance that's a problem. Now if it was a library that forced users to change their code just to be able to leverage something that the compiler should be able to handle for them (like writing assembly code for instance) sounds to me like too much to ask for. After all, the reason we have higher level programming languages is to hide from ourselves the details of the assembly/machine language on which platform we're going to run programs on. ;-)
True, but libraries also require that users write code that actually use the library. If the users already had code that didn't use your library, how is it an advantage if they can get the auto-vectorization from a future version of a compiler anyway without having to butcher their code to use your library? And what if they find a bug in the code using the library or (god forbid) find a bug in the library?
Same can be said for any library out there. What if tomorrow new C++ compiler will extract code from source and built top-notch threads from it ? Should we prevent people to use Boost.Threads from now ?
No, what I'm pointing at here is that libraries for considerably very low level parallelism will have to be maintained independent of the code that's actually using it -- and thus another layer on which failure can be found and inefficiencies introduced. The point for using Boost.Thread instead of platform-specific-threading-library is so that you can rely on a coherent interface for specifically threading and synchronization among threads. If that new C++ compiler is able to do that parallelism for us effectively without us having to use Boost.Threads, then I think slowly usage of Boost.Threads would go down on its own. However I think the problem that Boost.Threads is solving is compelling enough to be a viable solution in the interim. The point I'm trying to make is that if the target is simply just SIMD at the processor level, I'd think a library just for that is too specific to be considerably generic. I might be missing the point here, but if the compiler can already do it now (and will only get better in the future) and that I can write specific code for the platform even with C++ through compiler-vendor-provided libraries (if I needed to be specific about what I wanted to do with the compiler and the platform) if I didn't want to rely on a compiler to do it for me, what would be the value of a very narrow/specific library like a SIMD-specific thingamagig?
Actually, DSELs require that you write code in the domain language -- and here is where the problem lies
Well, if parallelism is outsourced behind the scene, it's not a problem.
But then you (the DSEL writer) for that specific domain would have to deal with parallelism the old-fashioned way without a DSEL for that (yet) helping you to do it -- and that doesn't scale. That doesn't help the domain expert especially if he doesn't know that he can actually come up with solutions that do leverage the parallelism available in his platform.
If this were the case then maybe just having this DSEL may be good to give to parallelism-savvy C++ programmers, but not necessarily still the domain experts who will be doing the writing of the domain-specific logic. Although you can argue that parallel programming is a domain in itself, in which case you're still not bridging the gap between those that know about parallel programming and the other domain experts.
Parallel programming is a domain in itself but not a domain for user but for tool writer. A user domain si things like math, finance, physics, anything. We agree
Yes, not all platforms are Intel platforms, but I don't know if you've noticed yet that Intel compilers even create code that will run on AMD processors -- yes, even SSE[0..3] -- as per their product documentation. If your target is CotS machines, I think Intel/GCC is your best bet (at least in x84_64). I haven't dealt with other platforms though aside from Intel/AMD, but it's not unreasonable to think that since everybody's moving the direction of leveraging and exploiting parallelism in hardware, that the compiler vendors will have to compete (and eventually get better) in this regard.
Well, we can't let Altivec and its offspring on the side of the road. Cell processor use it and I considering the Cell as a simili-COtS as a PS3 cost something like only half a kidney. I don't target COTS or not-CotS, my goal is cover the basics and the SIMD absics involves old Motorola enabled PPC and Intel machines. So the strict minimum is Altivec+SSE flavors. I hope that one day (AVX v2), both will converge tough.
And precisely because of that is why I think better compilers that leverage these platform-specific features would be the correct and far-reaching solution than a library just for SIMD. If your goal was a library/DSEL for expressing parallelism in general in C++ hiding the details of threads and whatnot only which a SIMD-specific extension would be part of, then I wouldn't feel like the goal is a little too narrow.
Why do I get the feeling that you're saying:
compiler writing != software engineering
? :-P
No I mean that *I* feel more confortabel writing stuff on this side of the compiler than on the other ;)
Okay. :-)
Anyway, I think if you're looking to contribute to a compiler-building community, GCC may be a little too big (I don't want to use the term advanced, because I haven't bothered looking at the code of the GCC project) but I know Clang over at LLVM are looking for help to finish the C++ implementation of the compiler front-end. From what I'm reading with Clang and LLVM, it should be feasible to write language-agnostic optimization algorithms/implementations just dealing with the LLVM IR.
Well, as I work like half a miel from Albert Cohen office, I'll certainly have a discussion about Clang someday ;) The C++->C++ tools is on my todo task list, but not for now as I think DSEL in C++ still have untapped ressources.
I agree, but if you're going to tackle the concurrency problem through a DSEL, I'd think a DSEL at a higher level than SIMD extensions would be more fruitful. For example, I'd think something like: vector<huge_numbers> numbers; // populate numbers async_result_stream results = apply(numbers, [... insert funky parallelisable lambda construction ...]) while (results) { huge_number a; results >> a; cout << a << endl; } Would be able to spawn thread pools, launch tasks, and provide an interface to getting the results using futures underneath. The domain experts who already know C++ will be able to express their funky parallelisable lambda construction and just know that when they use the facility it will do the necessary decomposition and parallelism as much as it can at the library level. This I think is something that is feasible (although a little hard) to achieve -- and to think that the compiler will even be able to vectorize an inner loop in the decomposed lambda construction, that detail isn't even necessarily dealt with by the library.
In that case, I think that kind of library (DSEL) would be nice to have -- especially to abstract the details of expressing parallelism in general a the source code level.
Except it is like ... friggin hard ?
Uh, yes. ;-)
My stance is to have applciation domain specific library that hide all parallelism tasks by relying on small scale parallel library themselves like Thread or my proposition.
In which case I think that DSEL for parallelism would be much more acceptable than even the simplest SIMD DSEL mainly because I'd think if you really wanted to leverage SIMD by hand, you'd just use the vector registers and use the vector functions directly from your code instead. At least that's in my case as both a user and a library writer.
Nice! I would agree that something like *that* is appropriate as a domain-specific language which leverages parallelism in the details of the implementation.
I however think also that there are some details that would be nice to tackle at the appropriate layer -- SIMD code construction is, well, meant to be at the domain of the compiler (as far as SSE or similar things go). OpenCL is meant to be an interface the hardware and software vendors are moving towards supporting for a long time coming (at least what I'm reading from the press releases) so I'm not too worried about the combinatorial explosion of architectures and parallelism runtimes.
Except some people (like one of the poster in the previosu thread) daily deals with code that need this level of abstraction and not more. Hence the rationale behind "Boost.SIMD"
In which case I think a DSEL is clever, but a SIMD-only library would be too small in scope for my taste. But that's just me I think. ;-)
I agree completely, but I'm afraid if the DSEL is for expressing parallelism in C++, the goal of "giving domain experts tools that knew about the parallelism" wouldn't be met readily. For C++ developers that want to leverage parallelism in general sure, but I don't think I'd be particularly compelled to use a SIMD-only DSEL.
I think we can't just wake up and say "ok today I just sole the parallelism problem in C++ using DSEL". I think that, on the contrary, a concrete, reasonable roadmap would be "tiling" the parallel problem world by small scale software solution that can inter-operate and interact freely. Then when the basic blocks of such tools have been done, we can start cementing them into higher one.
Of course maybe not in a day. But it can feasibly be achieved with some effort from brilliant library writers. I like thinking at a higher level first and solving the problems in the lower level with more specific focus but within a bigger context. Once you can recognize the patterns in the solution from a higher level can you really try solving problems at a lower level with better insight. Missing context is always hard to deal with. They came up with the STL anyway right, whoever thought there'd be a string class that makes sense in C++. ;-) -- Dean Michael C. Berris Software Engineer, Friendster, Inc.

Dean Michael Berris a écrit :
I'm a little weary about hiding these important issues to the people who understand the higher scheme of things. Which is why I personally don't think leaving the (non-C++ programming) domain experts in the dark about the inherent parallelism in their solution is a good idea.
No, we have to let them unaware of the *details*. Of course they shouldknwo about how many or how coarse is the inehrent aprallelism or if for a given paltform tasks parallelism performs better than data-aprallelism. We have to hide them the ugly gears.
As far as hiding the parallelism from them, the compiler is the perfect place to do that especially if your aim is to just leverage platform specific parallelism features of the machine. Even these domain experts once they know about the compiler capabilities may be able to write their code in such a way that the compiler will be happy to auto-vectorize -- and that's I think where it counts most.
Yes I fully agree if we lived in a perfect world. auto-whaterver-izing compilers are alas not the norm currently.
Actually, it's easy to say it -- it's a matter of acceptance that's a problem. Now if it was a library that forced users to change their code just to be able to leverage something that the compiler should be able to handle for them (like writing assembly code for instance) sounds to me like too much to ask for. After all, the reason we have higher level programming languages is to hide from ourselves the details of the assembly/machine language on which platform we're going to run programs on. ;-)
See my reamrks about NT2 previously. Librarie sliek NT2 are the way to go for users. I think Boost.SIMD has its usefulness (as said in the other thread) for library developeprs.
I agree, but if you're going to tackle the concurrency problem through a DSEL, I'd think a DSEL at a higher level than SIMD extensions would be more fruitful. For example, I'd think something like:
vector<huge_numbers> numbers; // populate numbers async_result_stream results = apply(numbers, [... insert funky parallelisable lambda construction ...]) while (results) { huge_number a; results >> a; cout << a << endl; }
I never said I'll tackle concurency with Boost.SIMD ;) Of course it'll need a far more expressive and abstarct DSEL.
In which case I think that DSEL for parallelism would be much more acceptable than even the simplest SIMD DSEL mainly because I'd think if you really wanted to leverage SIMD by hand, you'd just use the vector registers and use the vector functions directly from your code instead. At least that's in my case as both a user and a library writer.
Then again, one's tool is not what another one's need.
In which case I think a DSEL is clever, but a SIMD-only library would be too small in scope for my taste. But that's just me I think. ;-)
Patrick Mihelich seems to disagree ont he other thread ;)
I like thinking at a higher level first and solving the problems in the lower level with more specific focus but within a bigger context. Once you can recognize the patterns in the solution from a higher level can you really try solving problems at a lower level with better insight. Missing context is always hard to deal with.
Maybe. I'm not sure myself on how to tackle this. Le's try and see how it fares. If I ended up cornered then I'll see and go back to another approach. Heck, I'm even paid for doing exactly this (searchign not beign cornered ;)) The issue is large and arguments around how to do it properly are necessary as no one lay be able to find the definite answer alone. In my larger plan, there is a mix to find between exernal tools that preprocess source to source, DSEL at low level and other kind of tools.

On Monday 19 January 2009 00:50, Dean Michael Berris wrote:
I agree, but if you're going to tackle the concurrency problem through a DSEL, I'd think a DSEL at a higher level than SIMD extensions would be more fruitful. For example, I'd think something like:
vector<huge_numbers> numbers; // populate numbers async_result_stream results = apply(numbers, [... insert funky parallelisable lambda construction ...]) while (results) { huge_number a; results >> a; cout << a << endl; }
Would be able to spawn thread pools, launch tasks, and provide an interface to getting the results using futures underneath. The domain experts who already know C++ will be able to express their funky parallelisable lambda construction and just know that when they use the facility it will do the necessary decomposition and parallelism as much as it can at the library level. This I think is something that is feasible (although a little hard) to achieve -- and to think that the compiler will even be able to vectorize an inner loop in the decomposed lambda construction, that detail isn't even necessarily dealt with by the library.
Indeed, I think something like this is the right approach. Intel's Thread Building Blocks is one attempt. I'm not saying it's the best but it's at least instructive. -Dave

I'm not sure how or why this turned into a discussion of the general concurrency problem in C++. This is interesting, certainly, but should probably be considered a separate topic from SIMD and auto-vectorization. It doesn't seem very fair to me to criticize a SIMD library for fighting one battle instead of winning the whole war; you could make similar criticisms of Boost.Thread or Boost.MPI, yet these are useful libraries. The sense I'm getting from this discussion is that SIMD code generation is uninteresting, and that we should stick our heads in the sand and wait for the Sufficiently Smart Compilers to come along. OK, I sympathize with this. Writing functions using SIMD intrinsics is a bit of a distraction from the computer vision tasks I actually care about, but I have time budgets to meet and usually some work of this type has to be done. IMO, waiting for compiler technology is neither pragmatic in the short-term nor (as I argued in the other thread) conceptually correct. If you look at expression-template based linear algebra libraries like uBlas and Eigen2, these are basically code generation libraries (if compilers were capable of complicated loop fusion optimizations, we might not need such libraries at all). Given an expression involving vectors, it's fairly mechanical to transform it directly into optimal assembly. Whereas at the level of optimizing IR code, reasoning about pointers and loops is rather complicated. Are the pointers 16-byte aligned? Does it make sense to partially unroll the loop to exploit data parallelism? What transformations can (and should) we make when traversing a matrix in a double loop? There are all sorts of obstacles the compiler must overcome to vectorize code. Why not handle these issues in the stage of compilation with the best information and clearest idea of what the final assembly code should look like - in this case at the library level with meta-programming? The fact of the matter is that compilers do not generate optimal SIMD-accelerated code except in the simplest of cases, and so we end up using SIMD intrinsics by hand. Frankly I don't expect this to change dramatically anytime soon; I'm no compiler expert, but my impression is that some complicated algebraic optimizations (for which C++ is not very suited) are necessary. Using SIMD intrinsics by hand is nasty in assorted ways. The syntax is not standard across compilers. The instructions are not generic; if I change datatypes from double to float, or int to short, I have to completely rewrite the function. Maybe someone wants to run my SSE-enabled code on an Altivec processor, what then? There may be counterparts to all the instructions, but I don't know Altivec. Even different versions of the same instruction set are a problem; do I really want to think about the benefits of SSSE3 vs. just SSE2 and write different versions of the same function for the various instruction sets? Just having a nice, generic wrapper around SIMD ops would be a big step to ease the writing of SIMD-enabled code and make it accessible to a wider community. I think the Proto-based approach holds promise for taking advantage of various instruction sets without the user having to think much about it. -Patrick On Mon, Jan 19, 2009 at 11:14 AM, David A. Greene <greened@obbligato.org>wrote:
On Monday 19 January 2009 00:50, Dean Michael Berris wrote:
I agree, but if you're going to tackle the concurrency problem through a DSEL, I'd think a DSEL at a higher level than SIMD extensions would be more fruitful. For example, I'd think something like:
vector<huge_numbers> numbers; // populate numbers async_result_stream results = apply(numbers, [... insert funky parallelisable lambda construction ...]) while (results) { huge_number a; results >> a; cout << a << endl; }
Would be able to spawn thread pools, launch tasks, and provide an interface to getting the results using futures underneath. The domain experts who already know C++ will be able to express their funky parallelisable lambda construction and just know that when they use the facility it will do the necessary decomposition and parallelism as much as it can at the library level. This I think is something that is feasible (although a little hard) to achieve -- and to think that the compiler will even be able to vectorize an inner loop in the decomposed lambda construction, that detail isn't even necessarily dealt with by the library.
Indeed, I think something like this is the right approach. Intel's Thread Building Blocks is one attempt. I'm not saying it's the best but it's at least instructive.
-Dave _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hi Patrick, On Tue, Jan 20, 2009 at 9:01 AM, Patrick Mihelich <patrick.mihelich@gmail.com> wrote:
I'm not sure how or why this turned into a discussion of the general concurrency problem in C++. This is interesting, certainly, but should probably be considered a separate topic from SIMD and auto-vectorization. It doesn't seem very fair to me to criticize a SIMD library for fighting one battle instead of winning the whole war; you could make similar criticisms of Boost.Thread or Boost.MPI, yet these are useful libraries.
I'm not about being fair: I am about tackling the issue from a larger perspective. I am not criticizing a SIMD library in the sense that I don't think other people will want to use it -- I personally think that being wrapped in a DSEL makes it clever, but nonetheless the scope of the problem is too narrowly defined. That means, *I* don't think putting these groups of operations together needs to be that complicated.
The sense I'm getting from this discussion is that SIMD code generation is uninteresting, and that we should stick our heads in the sand and wait for the Sufficiently Smart Compilers to come along. OK, I sympathize with this. Writing functions using SIMD intrinsics is a bit of a distraction from the computer vision tasks I actually care about, but I have time budgets to meet and usually some work of this type has to be done.
Actually, I think you're missing the point (at least from what I'm saying). I'm saying SIMD code generation ought to be the job of the compiler(s) for the platforms where they make sense. Now *if* you wanted to be able to specifically make it work, you can do something that others have already been doing: adding a layer of indirection. Now this layer of indirection can be as clever as a DSEL (which I don't think it needs to be) or as simple as a function that switches implementations at compile time using preprocessor macros or some other facility. Now if you needed to optimize a set of operations that are specific to your field (like for example, applying a blur on a set of pixels represented by a set of floats) then I wouldn't find it hard to imagine having that specific part hand-optimized for your need. Does this need another library? I wager to say it doesn't -- it's like saying you're implementing a DSEL in C++ to do simple mathematics. Now if you want to write your own DSEL for image manipulation and perform the transformation in the background to use the SIMD instructions for a specific platform then fine that would be great -- and the details of the implementation would be just that, details, that I don't see a need for a special library just for SIMD instructions *especially* since the compilers will be able to automatically vectorize the parts that can easily be vectorized. (I use the term "easily" here very loosely because that depends on the compiler you're using).
IMO, waiting for compiler technology is neither pragmatic in the short-term nor (as I argued in the other thread) conceptually correct. If you look at expression-template based linear algebra libraries like uBlas and Eigen2, these are basically code generation libraries (if compilers were capable of complicated loop fusion optimizations, we might not need such libraries at all). Given an expression involving vectors, it's fairly mechanical to transform it directly into optimal assembly. Whereas at the level of optimizing IR code, reasoning about pointers and loops is rather complicated. Are the pointers 16-byte aligned? Does it make sense to partially unroll the loop to exploit data parallelism? What transformations can (and should) we make when traversing a matrix in a double loop? There are all sorts of obstacles the compiler must overcome to vectorize code. Why not handle these issues in the stage of compilation with the best information and clearest idea of what the final assembly code should look like - in this case at the library level with meta-programming?
I am not against it -- now if you're talking about fixing uBlas to make it aware of the capabilities of a platform and perform the transformations necessary to be able to leverage vendor-specific libraries, then I'm all for it. Do I think it needs a special DSEL/library for doing so? *That* is what I'm questioning. The reason I like the thought of letting the compiler do the auto-vectorization for me is that the compiler already knows about my code and the transformations it's going to do to make it work -- there's no reason for a compiler not to be able to know these details you talk about. It's not even absurd for a compiler to turn certain code patterns to use OpenMP to parallelize parts of the solution and then at even lower levels even create the SIMD code to leverage the SIMD extensions of the compiler it's going to use.
The fact of the matter is that compilers do not generate optimal SIMD-accelerated code except in the simplest of cases, and so we end up using SIMD intrinsics by hand. Frankly I don't expect this to change dramatically anytime soon; I'm no compiler expert, but my impression is that some complicated algebraic optimizations (for which C++ is not very suited) are necessary.
And I don't question that fact that compilers do not yet generate optimal SIMD-accelerated code -- especially if you're talking about GCC. I might be surprised to hear the same about Intel's compiler but I know that it does perform quite advanced optimizations on the code to leverage SSE on the platforms it supports. If it's a matter of arranging your C++ so that it can be automatically vectorized by a less sophisticated yet auto-vectorizing compiler, then I would think that would be a more achievable goal (and easier?) to accomplish than releasing/maintaining a SIMD-only DSEL/library.
Using SIMD intrinsics by hand is nasty in assorted ways.
I know.
The syntax is not standard across compilers. The instructions are not generic; if I change datatypes from double to float, or int to short, I have to completely rewrite the function.
What stops you from adding a function that specializes on the types on top of these SIMD-specific functions/vectors?
Maybe someone wants to run my SSE-enabled code on an Altivec processor, what then? There may be counterparts to all the instructions, but I don't know Altivec. Even different versions of the same instruction set are a problem; do I really want to think about the benefits of SSSE3 vs. just SSE2 and write different versions of the same function for the various instruction sets?
Which is why I'd rather rely on the compiler to do it for me -- if the compiler for the Altivec processor doesn't know how to auto-vectorize my code, then tough luck I'm going to use SIMD-specific functions/vectors anyway that's specific for that platform. If I already had that SIMD-enabled min() function that broke up a large vector into smaller SIMD vectors that did a faster min() than a non-SIMD version of min(), then I can specialize on the Altivec platform. Heck, if I was writing a image manipulation library for different platforms, I might chunk up the operations even at a higher level and leverage the SIMD-izable parts at that level specific to the image manipulation library. But again, I don't see a need for that library that is specific to SIMD-enabling operations to be done on values. But then again, that's just me.
Just having a nice, generic wrapper around SIMD ops would be a big step to ease the writing of SIMD-enabled code and make it accessible to a wider community. I think the Proto-based approach holds promise for taking advantage of various instruction sets without the user having to think much about it.
Here lies the problem: SIMD operations are specific yet generic enough to be used on their own already. What you want to be able to deal with is the vendor-provided interface to the SIMD vectors and SIMD operations they already support. More specifically GCC's support for SSE/SSE2/SSE3/MMX/... registers/vectors and operations [0] and if you're lucky enough to get your hand on Intel compilers, you can also read about how to layout your code for the compiler to be able to automatically vectorize it for you [1]. References: [0] - http://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Vector-Extensions.html#Vector-Ex... [1] - http://www.aartbik.com/SSE/index.html -- Dean Michael C. Berris Software Engineer, Friendster, Inc.

on Tue Jan 20 2009, "Dean Michael Berris" <mikhailberis-AT-gmail.com> wrote:
Hi Patrick,
On Tue, Jan 20, 2009 at 9:01 AM, Patrick Mihelich <patrick.mihelich@gmail.com> wrote:
The sense I'm getting from this discussion is that SIMD code generation is uninteresting, and that we should stick our heads in the sand and wait for the Sufficiently Smart Compilers to come along. OK, I sympathize with this. Writing functions using SIMD intrinsics is a bit of a distraction from the computer vision tasks I actually care about, but I have time budgets to meet and usually some work of this type has to be done.
Actually, I think you're missing the point (at least from what I'm saying).
I'm saying SIMD code generation ought to be the job of the compiler(s) for the platforms where they make sense.
Why are these SIMD operations different in that respect from, say, large matrix multiplications?
Now *if* you wanted to be able to specifically make it work, you can do something that others have already been doing: adding a layer of indirection.
I.e., a library?
Now this layer of indirection can be as clever as a DSEL (which I don't think it needs to be) or as simple as a function that switches implementations at compile time using preprocessor macros or some other facility. Now if you needed to optimize a set of operations that are specific to your field (like for example, applying a blur on a set of pixels represented by a set of floats) then I wouldn't find it hard to imagine having that specific part hand-optimized for your need.
Does this need another library? I wager to say it doesn't -- it's like saying you're implementing a DSEL in C++ to do simple mathematics.
Why do you say that? Do you routinely find yourself having to write non-portable code to do simple math in C++? Do you routinely find that the compiler generates inadequate simple math code? -- Dave Abrahams BoostPro Computing http://boostpro.com

On Tuesday 20 January 2009 06:58, David Abrahams wrote:
Actually, I think you're missing the point (at least from what I'm saying).
I'm saying SIMD code generation ought to be the job of the compiler(s) for the platforms where they make sense.
Why are these SIMD operations different in that respect from, say, large matrix multiplications?
A matrix multiplication is a higher-level construct. Still, most compilers will pattern-match matrix multiplication to an optimal routine. SIMD code generation is extremely low-level. Programmers want to think in a higher level. If the programmers want to direct the compiler to do something, I'm all for it. But those directions should be expressible in the higher level at which the programmer is working. The fact that not all compilers provide this capability is a shortcoming of the compilers. gcc is a prime example and is fortunately one the community can easily fix should it choose to do so.
Does this need another library? I wager to say it doesn't -- it's like saying you're implementing a DSEL in C++ to do simple mathematics.
Why do you say that? Do you routinely find yourself having to write non-portable code to do simple math in C++? Do you routinely find that the compiler generates inadequate simple math code?
I have people who tell me that every day. :) Then we go and fix it. -Dave

on Tue Jan 20 2009, "David A. Greene" <greened-AT-obbligato.org> wrote:
On Tuesday 20 January 2009 06:58, David Abrahams wrote:
Actually, I think you're missing the point (at least from what I'm saying).
I'm saying SIMD code generation ought to be the job of the compiler(s) for the platforms where they make sense.
Why are these SIMD operations different in that respect from, say, large matrix multiplications?
A matrix multiplication is a higher-level construct. Still, most compilers will pattern-match matrix multiplication to an optimal routine.
Not hardly. No compiler is going to introduce register and cache-level blocking.
SIMD code generation is extremely low-level. Programmers want to think in a higher level.
Naturally. But are the algorithms implemented by SIMD instructions lower-level than std::for_each or std::accumulate? If not, maybe they deserve to be in a library. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Tuesday 20 January 2009 18:51, David Abrahams wrote:
Why are these SIMD operations different in that respect from, say, large matrix multiplications?
A matrix multiplication is a higher-level construct. Still, most compilers will pattern-match matrix multiplication to an optimal routine.
Not hardly. No compiler is going to introduce register and cache-level blocking.
I'm not sure what you mean here. Compilers do blocking all the time. Typically, the compiler will match a matrix multiply (and a number of other patterns) to library code that has been pre-tuned. Typically the library code has a number of possible paths based on the size of the matrices, etc. Some of those paths may be blocked at several different levels. Or not, if that gives better performance. There's a rich amount of research going on about how to auto-tune library code for just such purposes.
SIMD code generation is extremely low-level. Programmers want to think in a higher level.
Naturally. But are the algorithms implemented by SIMD instructions lower-level than std::for_each or std::accumulate? If not, maybe they deserve to be in a library.
A library of fast routines for doing various things is quite different from creating a whole DSEL to do SIMD code generation. A library of fast matrix mutliply, etc. would indeed be useful. How much does Boost want to concern itself with providing libraries tuned with asm routines for various architectures? It strikes me that writing these routines using gcc intrinsics wouldn't result in optimal code on all architectures. Similarly, it seems that a DSEL to do the same would have similar deficiencies. When you're talking "optimal," you're setting a pretty dang high bar. -Dave

David A. Greene a écrit :
A library of fast routines for doing various things is quite different from creating a whole DSEL to do SIMD code generation.
How a DSEL can be different from a library still puzzle me as the basic definition of a DSEL is a DSL embedded into a host language as a library.
A library of fast matrix mutliply, etc. would indeed be useful.
You mean, useful like being said weeks in advance that it's useless cause uBlas already do it as it was said earlier ? And if, as you said compilers already do what it's needed, then I call this useless too cause we'll just wait that all compiler do the same ...
How much does Boost want to concern itself with providing libraries tuned with asm routines for various architectures?
I never implied using assembly language. And even so, what's the problem ? Isn't there small assembly coe or intrinsic level stuff in the shared_ptr or w/e library that use TSL and sync_lock ? And hiding such thingy into a platform independant thing is not what a library is supposed to do ?
It strikes me that writing these routines using gcc intrinsics wouldn't result in optimal code on all architectures. Similarly, it seems that a DSEL to do the same would have similar deficiencies.
Except that *maybe* the DSEL take care of using the correct set of intrinsic depending on platform using, I don't know, architecture detection at compile-time ? And, IIRC the gcc intrinsic are just C like function over the SIMD assembly function ... so I don't how it can't ...

On Wednesday 21 January 2009 01:30, Joel Falcou wrote:
David A. Greene a écrit :
A library of fast routines for doing various things is quite different from creating a whole DSEL to do SIMD code generation.
How a DSEL can be different from a library still puzzle me as the basic definition of a DSEL is a DSL embedded into a host language as a library.
Implementing a DSEL to do code generation is a LOT more work than simply coding a fast library in asm. If you want to generate SIMD code for lots of libraries than a DSEL might be worth it, but I'm talking about specialized applications here (matrix multiply, etc.).
A library of fast matrix mutliply, etc. would indeed be useful.
You mean, useful like being said weeks in advance that it's useless cause uBlas already do it as it was said earlier ? And if, as you said compilers already do what it's needed, then I call this useless too cause we'll just wait that all compiler do the same ...
I'm talking about specific routines tuned in a way that a general-purpose compiler would not be able to replicate. It's a very small set of codes.
It strikes me that writing these routines using gcc intrinsics wouldn't result in optimal code on all architectures. Similarly, it seems that a DSEL to do the same would have similar deficiencies.
Except that *maybe* the DSEL take care of using the correct set of intrinsic depending on platform using, I don't know, architecture detection at compile-time ? And, IIRC the gcc intrinsic are just C like function over the SIMD assembly function ... so I don't how it can't ...
Then your DSEL is actually a full-blown compiler code generator. Generating "optimal" code is a lot more than just picking instructions. You have to allocate registers, schedule, etc. and that changes not just based on ISA but on the implementation of that ISA provided by a particular processor. Writing a DSEL containing all of this knowledge is much more work than just coding the library in asm if the set of libraries is small. -Dave

on Tue Jan 20 2009, "David A. Greene" <greened-AT-obbligato.org> wrote:
When you're talking "optimal," you're setting a pretty dang high bar.
I don't have to care about "optimal" if the difference between a suboptimal use of SIMD and not using it at all is an order of magnitude. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

When you're talking "optimal," you're setting a pretty dang high bar.
I don't have to care about "optimal" if the difference between a suboptimal use of SIMD and not using it at all is an order of magnitude.
Indeed, however, in Gautam Sewani's GSOC project this year he looked quite hard at optimising Boost.Math with SSE2/3 instructions and found it quite hard to find *any* use cases where hand written SEE2 code was better than compiler generated code - one exception was the classic vectorised addition, but even then you're struggling to get a 2x improvement. Of course if the submitter can show that his code *is* faster than the alternatives then all this discusion is entirely moot, and strictly IMO we should stop discussing the bicycle shed colour and get on with it :-) Just my 2c worth, John.

When you're talking "optimal," you're setting a pretty dang high bar.
I don't have to care about "optimal" if the difference between a suboptimal use of SIMD and not using it at all is an order of magnitude.
Indeed, however, in Gautam Sewani's GSOC project this year he looked quite hard at optimising Boost.Math with SSE2/3 instructions and found it quite hard to find *any* use cases where hand written SEE2 code was better than compiler generated code - one exception was the classic vectorised addition, but even then you're struggling to get a 2x improvement.
hm, i cannot really comment on non-vectorized functions. i am heavily using simd (sse) for vectorized operations. doing vectorized math operations, something like: void tanh4(float * out, const float * in); i measured a performance gain of a factor 6 to 7 compared to the libm implementation ... in general, when optimizing code for simd operations, it makes sense to focus on vector operations ... reading all this discussion about vectorizing compilers, one should always take into account, that compilers are not allowed to do some transformations, because of aliasing issues. writing a simdfied vector function, one can specify that pointers are required to be aligned and memory regions are not allowed to overlap, an assumption the compiler is not able to make. i would be curious to see a boost.simd library and could provide some of my sse code for vector operations ... best, tim -- tim@klingt.org http://tim.klingt.org Which is more musical, a truck passing by a factory or a truck passing by a music school? John Cage

On Wednesday 21 January 2009 03:11, David Abrahams wrote:
on Tue Jan 20 2009, "David A. Greene" <greened-AT-obbligato.org> wrote:
When you're talking "optimal," you're setting a pretty dang high bar.
I don't have to care about "optimal" if the difference between a suboptimal use of SIMD and not using it at all is an order of magnitude.
Well, the whole discussion is about "optimal." If one doesn't care about "optimal" then a compiler will do just fine all the time and there's no need for a DSEL, asm or ugly gcc intrinsics. -Dave

David A. Greene wrote:
On Wednesday 21 January 2009 03:11, David Abrahams wrote:
on Tue Jan 20 2009, "David A. Greene" <greened-AT-obbligato.org> wrote:
When you're talking "optimal," you're setting a pretty dang high bar. I don't have to care about "optimal" if the difference between a suboptimal use of SIMD and not using it at all is an order of magnitude.
Well, the whole discussion is about "optimal." If one doesn't care about "optimal" then a compiler will do just fine all the time and there's no need for a DSEL, asm or ugly gcc intrinsics.
What if we replace optimal with optimized? Surely library code than gives a 4x speedup is desirable to have even if you can hand generate code that gives you a 5x. Getting a 4x speedup over naive simd-less in simple vector operations and still being able to concentrate on the problem at hand instead of low level optimization details sounds fantastic to me. -- Michael Marcin

Well, the whole discussion is about "optimal." If one doesn't care about "optimal" then a compiler will do just fine all the time and there's no need for a DSEL, asm or ugly gcc intrinsics.
What if we replace optimal with optimized?
Surely library code than gives a 4x speedup is desirable to have even if you can hand generate code that gives you a 5x. Getting a 4x speedup over naive simd-less in simple vector operations and still being able to concentrate on the problem at hand instead of low level optimization details sounds fantastic to me.
Indeed, and like I keep saying: show us the code that produces the speedup and we can all stop arguing and start rejoycing :-) John.

John Maddock wrote:
Well, the whole discussion is about "optimal." If one doesn't care about "optimal" then a compiler will do just fine all the time and there's no need for a DSEL, asm or ugly gcc intrinsics.
What if we replace optimal with optimized?
Surely library code than gives a 4x speedup is desirable to have even if you can hand generate code that gives you a 5x. Getting a 4x speedup over naive simd-less in simple vector operations and still being able to concentrate on the problem at hand instead of low level optimization details sounds fantastic to me.
Indeed, and like I keep saying: show us the code that produces the speedup and we can all stop arguing and start rejoycing :-)
The compiler doesn't do better than a good assembly programmer *can* do by hand, it does as well as a good assembly programmer *usually* does. Arguing for the compiler to perform SIMD optimization is in and of itself arguing for giving up optimal for good enough. Rather than pointing to the code that produces the speedup, we can point to the application domains where the speedup is realized. Video encoding is one place where SIMD becomes a big deal. Very often when comparing processors using benchmarks there will be a pronounced difference in performance for video encoding that isn't there on other benchmarks because of the use of SSE in the benchmark. As John points out, these fantastic performance benefits just aren't in the offing for most applications. Sometimes I write C++ code that looks a little like verilog and I think to myself, wow, this would be screamingly fast in hardware. For video encoding and other very important applications they actually do implement such functions in hardware, they know exactly where they want to use that hardware in the code. I think it makes perfect sense for Adobe (for instance) to implement a library of image processing primitives based on SIMD that detects the hardware at runtime and chooses the appropriate implementation for that hardware so that the binary rather than the code is portable across x86 platforms. It does not, however, make sense for them to open source such a library because it is sufficiently difficult to implement that it represents a competitive advantage. SIMD is hard, and if you stand to benefit from it, you have to use it in order to compete. I sympathise with the OP, I think an open source SIMD library could be a real help to guys like him. He should be asking the hardware manufacturers for support instead of boost. A SIMD library could be seen as comparable to TBB, for example, but the number of applications where it could be applied is much less, so it might be easier to just write the intrinsics in the few places where it actually matters, the places where you want hand crafted code anyway. Regards, Luke

On Wednesday 21 January 2009 20:07, Michael Marcin wrote:
Well, the whole discussion is about "optimal." If one doesn't care about "optimal" then a compiler will do just fine all the time and there's no need for a DSEL, asm or ugly gcc intrinsics.
What if we replace optimal with optimized?
Surely library code than gives a 4x speedup is desirable to have even if you can hand generate code that gives you a 5x. Getting a 4x speedup over naive simd-less in simple vector operations and still being able to concentrate on the problem at hand instead of low level optimization details sounds fantastic to me.
Agreed. I'm under the impression that Joel wants to get 5x through a DSEL-based code generator. I'm not sure it's worthwhile in the general case. It may be worthwhile in highly specialized cases. -Dave

David A. Greene a écrit :
Agreed. I'm under the impression that Joel wants to get 5x through a DSEL-based code generator.
I never assumed this ... -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Monday 19 January 2009 19:01, Patrick Mihelich wrote:
IMO, waiting for compiler technology is neither pragmatic in the short-term nor (as I argued in the other thread) conceptually correct. If you look at expression-template based linear algebra libraries like uBlas and Eigen2, these are basically code generation libraries (if compilers were capable of complicated loop fusion optimizations, we might not need such libraries at
Good compilers do a good amount of loop fusion. They can't get all cases, certainly, but I would encourage you to explore what's already being done. gcc is not a good example.
all). Given an expression involving vectors, it's fairly mechanical to transform it directly into optimal assembly. Whereas at the level of optimizing IR code, reasoning about pointers and loops is rather complicated. Are the pointers 16-byte aligned? Does it make sense to
The alignment issue is becoming less of a problem on newer architectures. Barcelona, for example, doesn't care. It's entirely appropriate to fix these kinds of problems in the hardware.
partially unroll the loop to exploit data parallelism? What transformations can (and should) we make when traversing a matrix in a double loop? There
Good optimizing compilers do lots of transformations. Intercfhange, unswitching, unroll-and-jam, collapse, coalesce, etc. etc. etc. No, they are not perfect. But it strikes me the answer is not to do SIMD codegen by hand. What would be more interesting is a general way to convey the information the compiler needs. Typically this is done with vendor-specific pragmas but there are all kinds of tricks one could imagine might be able to help the compiler that are expressable directly in C++. Actually, it would be quite interesting to see what we could convey to the compiler through judicious use of "pointless" C++ code.
The fact of the matter is that compilers do not generate optimal SIMD-accelerated code except in the simplest of cases, and so we end up using SIMD intrinsics by hand. Frankly I don't expect this to change
Define "optimal." A compiler will not generate "optimal" code in most cases because the compiler is general purpose. Some transformations that benefit a specific piece of code can be disastrous on another. But usually these decisions don't happen in code generation. They happen in the optimizer where loop transformations are scheduled. If we're going to help the compiler, this is where we need to do it. Some compilers have ways to force specific transformations to be done. IMHO more compilers need to provide these interfaces. A SIMD library is too narrowly focused. In a decade another technology will come along. Will we need another DSEL or library for that one? Why not just tell the compiler what it wants to know? In any event, no one is stopping anyone from creating a SIMD DSEL. I believe it's the wrong approach but then my career doesn't depend on optimal graphics code. In certain specialized cases it may well be worth it.
dramatically anytime soon; I'm no compiler expert, but my impression is that some complicated algebraic optimizations (for which C++ is not very suited) are necessary.
Algebraic manipulations are no problem. Pointers and aliasing are problematic. Generous use of "restrict" can make a huge difference. One of the major problems I see day in and day out is developers "helping" the compiler by pre-linearizing addresses, etc. As a wise man once said, "Avoid premature optimization." -Dave

Dean Michael Berris wrote:
I personally have dealt with two types of parallelization: parallelization at a high level (dealing with High Performance Computing using something like MPI for distributed message-passing parallel computation across machines) and parallelization at a low level (talking about SSE and auto-vectorization).
MPI is by no way high-level; it's low-level in that you have to explicitly say which tasks execute where and who they communicate with. Threads, for example, are much more high-level than that: they get scheduled dynamically, trying to use best the hardware as it is being used, or some other optimization factor, depending on the scheduler. The difference between MPI and SIMD is not low-level vs high-level however: it's task-parallel vs data-parallel.

On Sun, Jan 18, 2009 at 8:46 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
Dean Michael Berris wrote:
I personally have dealt with two types of parallelization: parallelization at a high level (dealing with High Performance Computing using something like MPI for distributed message-passing parallel computation across machines) and parallelization at a low level (talking about SSE and auto-vectorization).
MPI is by no way high-level; it's low-level in that you have to explicitly say which tasks execute where and who they communicate with. Threads, for example, are much more high-level than that: they get scheduled dynamically, trying to use best the hardware as it is being used, or some other optimization factor, depending on the scheduler.
The difference between MPI and SIMD is not low-level vs high-level however: it's task-parallel vs data-parallel.
Yes, but the comparison I was making regarding high-level and low-level were in terms of code. In MPI you coded in C/C++ expressing communication between/among tasks without having to know about the actual (i.e. physical) topology of your architecture. This was considerably higher level than say writing hand-optimized SSE-aware code in your C++ code to express parallelism. I understand that MPI is a specification for communication primitives (much like assembly for distributed computing across logical/actual machines) and that wasn't the comparison I was going for when I meant it was "high-level". Though if you think about it, it's high level because you didn't deal directly with networking/shared-memory primitives, etc. but I don't want to belabor the point too much. ;-) When you talk about threads, those are high level in the context of single-machine parallelism -- unless your OS is a distributed OS that spawned processes across multiple machines and magically showed just one big environment to programs running in that parallel machine. Threads can be considered parallelism primitives in a single machine context, but not in a general parallel computing context where you might be spanning multiple distributed/independent machines. BTW, you can use asynchronous communication primitives in MPI and still use threads on each process. I don't even think MPI and Threads work at the same level anyway so it depends on which level you're looking from which determines whether you think MPI is low-level or Threads are high-level. ;-) -- Dean Michael C. Berris Software Engineer, Friendster, Inc.

Dean Michael Berris wrote:
When you talk about threads, those are high level in the context of single-machine parallelism
What is a machine? What's the difference between two machines with a CPU each and a NUMA machine with two CPUs? Only the protocol (and its throughput and latency, of course) to access memory on another CPU. NUMA is really a form of cluster computing. And you can implement NUMA in software over a cluster network. Threads are just tasks that share memory. There is no need for all those tasks to be on the same machine: memory be can distributed.
-- unless your OS is a distributed OS that spawned processes across multiple machines and magically showed just one big environment to programs running in that parallel machine. Threads can be considered parallelism primitives in a single machine context, but not in a general parallel computing context where you might be spanning multiple distributed/independent machines.
That's a single-system image, and it actually works very well if you have the required low-latency network.
BTW, you can use asynchronous communication primitives in MPI and still use threads on each process. I don't even think MPI and Threads work at the same level anyway so it depends on which level you're looking from which determines whether you think MPI is low-level or Threads are high-level. ;-)
I would personally prefer to describe parallel tasks in one way once and for all rather than have to express them in different ways depending on what level of hardware parallelism can be used to run them.
participants (10)
-
David A. Greene
-
David Abrahams
-
Dean Michael Berris
-
Joel Falcou
-
John Maddock
-
Mathias Gaunard
-
Michael Marcin
-
Patrick Mihelich
-
Simonson, Lucanus J
-
Tim Blechmann