Unfortunately the concurrency/parallelism group has decided that they do not want C++ to provide types representing SIMD registers. I'm afraid I don't quite understand the rationale for such a refusal; proposing more high-level constructs similar to valarray (or to our own library NT2) was suggested, but that's obviously a more complex and limited API, not a basic building block to program portably a specific processor unit.
As much as a refusal can be heartbreaking, trust me when I say that indecision or indifference is worse. The fact they gave you a definite refusal is actually one of the least worst outcomes from an ISO standards proposal, because you can now move on without uncertainty.
Development of Boost.SIMD will still proceed, aiming for integration in Boost, but standardization appears to be definitely out of the question. Any feedback of the API presented in the proposal is welcome. http://open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3571.pdf
If I were still part of ISO standards, I'd observe the following: 1. GPU and CPU stream computation technologies are still merging. In other words, it's too soon to standardize this technology lest we accidentally break some novel form of new convergence. Happy to reconsider post-C++14. 2. It's hard to standardize current CPU SIMD implementations due to extremely irritating inconsistencies between vendors. For example, a generic straight port of SSE2 to NEON will have awful performance because SSE2 code does a lot of flipping between SIMD and non-SIMD, and because NEON is a coprocessor on ARM that generates poor performance. Another bug bear of mine on NEON is the lack of an equivalent to _mm_movemask_epi8(), which can be emulated in about eight NEON instructions, but in so doing you'll make code which was very speedy on SSE2 pretty slow in most cases on NEON. My point here is you cannot standardize such non-uniform behavior in a universally performant way, because you'll just get lowest common denominator performance across all SIMD implementations which kinda defeats its purpose. If all vendors were like NEON, or like SSE2, then we still have to see how CUDA and OpenCL pan out long run. 3. I am unsure if C++ is the appropriate language for SIMD standardization when perhaps a meta-form of JIT compiled C++ would be much superior (i.e. you supply LLVM bytecode, and it gets delivered to a GPU/CPU/whatever). We'll have those on the table with LLVM-type compilers. In other words, I would vote to wait and see what the market throws up. I appreciate that none of these three rationale are what you want to hear. Still, I hope my observations are useful to you. None of them suggests you shouldn't proceed with Boost.SIMD. Boost has a much wider remit than just as a testing ground for future C++ standard library features. But I suspect that if SIMD ever does get standardized, it won't look like your library or proposal because it will be based on technologies which don't exist yet. Hope that helps, Niall
On 18/04/13 16:08, Niall Douglas wrote:
1. GPU and CPU stream computation technologies are still merging. In other words, it's too soon to standardize this technology lest we accidentally break some novel form of new convergence. Happy to reconsider post-C++14.
I think CPUs and GPUs are different things, and that it is a mistake to consider an unified programming model. A GPU is an accelerator for large regular computations, and requiring sending memory and receiving it back. It's also programmed with a very constrained programming model that cannot express efficiently all kinds of operations. A CPU, on the other hand, is a very flexible processor and all memory is already there. You can make it do a lot of complex computations, irregular, sparse or iterative, can do dynamic scheduling and work stealing, and have fine-grained control on all components and how they work together.
2. It's hard to standardize current CPU SIMD implementations due to extremely irritating inconsistencies between vendors. For example, a generic straight port of SSE2 to NEON will have awful performance because SSE2 code does a lot of flipping between SIMD and non-SIMD, and because NEON is a coprocessor on ARM that generates poor performance. Another bug bear of mine on NEON is the lack of an equivalent to _mm_movemask_epi8(), which can be emulated in about eight NEON instructions, but in so doing you'll make code which was very speedy on SSE2 pretty slow in most cases on NEON.
My point here is you cannot standardize such non-uniform behavior in a universally performant way, because you'll just get lowest common denominator performance across all SIMD implementations which kinda defeats its purpose. If all vendors were like NEON, or like SSE2, then we still have to see how CUDA and OpenCL pan out long run.
A developer should try to write code that is as vertical as possible (i.e. that does the same operation on all elements of the register, and doesn't mix between the elements). Doing this will give performance on all kinds of SIMD hardware. Using a couple of regular horizontal operations might be ok, but if your algorithm is heavily dependent on more advanced operations like shuffling, performance will indeed be hardware-dependent. The main use case of _mm_movemask_epi8 is to implement reduction functions (which are horizontal) like all, none or any. NEON does not have an instruction to do this directly, but you can do it with relatively few instructions using VHADD; while it might not be as fast as SSE it shouldn't be too bad either.
3. I am unsure if C++ is the appropriate language for SIMD standardization when perhaps a meta-form of JIT compiled C++ would be much superior (i.e. you supply LLVM bytecode, and it gets delivered to a GPU/CPU/whatever). We'll have those on the table with LLVM-type compilers. In other words, I would vote to wait and see what the market throws up.
Hardware manufacturers already provide a C programming interface to use them. The idea of a C++ library rather than C is to be able to use the native operators and function overloading.
I appreciate that none of these three rationale are what you want to hear. Still, I hope my observations are useful to you. None of them suggests you shouldn't proceed with Boost.SIMD. Boost has a much wider remit than just as a testing ground for future C++ standard library features. But I suspect that if SIMD ever does get standardized, it won't look like your library or proposal because it will be based on technologies which don't exist yet.
Hope that helps,
This is similar to the feedback I received at the meeting in Bristol. However, SIMD has been here for 25 years and is still in the roadmap of future processors. Across all this time it has mostly stayed the same. On the other hand GPU computing is relatively new and is evolving a lot. It's also quite trendy and buzzword-y, and is in reality not as fast and versatile as marketing makes it out to be. A lot of people seem to be intent on standardizing GPU technology rather than SIMD technology; that's quite a shame.
On Tue, Apr 23, 2013 at 5:43 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
On 18/04/13 16:08, Niall Douglas wrote:
1. GPU and CPU stream computation technologies are still merging. In other
words, it's too soon to standardize this technology lest we accidentally break some novel form of new convergence. Happy to reconsider post-C++14.
I think CPUs and GPUs are different things, and that it is a mistake to consider an unified programming model.
A GPU is an accelerator for large regular computations, and requiring sending memory and receiving it back. It's also programmed with a very constrained programming model that cannot express efficiently all kinds of operations.
A CPU, on the other hand, is a very flexible processor and all memory is already there. You can make it do a lot of complex computations, irregular, sparse or iterative, can do dynamic scheduling and work stealing, and have fine-grained control on all components and how they work together.
IMHO, there is a clear trend of convergence. GPUs are slowly getting close to CPUs, both in terms of capabilities and physical distance. For example, AMD's APU share execution units between CPU and GPU and I expect Intel to eventually follow that lead. GPU and CPU already share memory. The only thing left is to extend x86 instruction set to employ additional units in the GPU. On the other hand, Nvidia is planning Maxwell (to be released next year) to include an ARM core within the GPU, so the graphics card becomes more independent from CPU and the main memory. Also take a note of Intel Xeon Phi, which is basically a CPU coming close to GPU in terms of number crunching. But I have to say I don't support the point of delaying standardization to wait and see what comes out of this diverse field of technologies. Things are evolving constantly, so we can end up waiting forever that way. I don't advocate for rushing the standardization but SIMD technologies, although different in different architectures, are quite established and have proved to be useful. If there is a common extensible layer that can simplify programming SIMD, why not standardize it? Anyway, I will be a happy user of Boost.SIMD even if it doesn't go to the standard, provided that it fits my needs.
participants (3)
-
Andrey Semashev
-
Mathias Gaunard
-
Niall Douglas