
Hi, This week I presented a proposal to the C++ standards committee to provide a standard library component for SIMD computation based on the library in development Boost.SIMD (not yet a Boost library). I was hoping to get feedback on the interface and establish an API that would satisfy hardware and compiler vendors. Unfortunately the concurrency/parallelism group has decided that they do not want C++ to provide types representing SIMD registers. I'm afraid I don't quite understand the rationale for such a refusal; proposing more high-level constructs similar to valarray (or to our own library NT2) was suggested, but that's obviously a more complex and limited API, not a basic building block to program portably a specific processor unit. Development of Boost.SIMD will still proceed, aiming for integration in Boost, but standardization appears to be definitely out of the question. Any feedback of the API presented in the proposal is welcome. <http://open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3571.pdf>

On Thu, Apr 18, 2013 at 2:46 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
This is a shame. Is the rationale or the official response from the working group available somewhere? Development of Boost.SIMD will still proceed, aiming for integration in
I have a few questions: 1. When a particular algorithm or pack configuration is not supported by the hardware, is the implementation required to emulate it with scalar or partially vectorized operations? And what is the behavior of Boost.SIMD in this regard? 2. It looks like the proposal does not define any means to discover the availability of certain operations and pack configurations in hardware. How would algorithm versioning work with this proposal? I'm not assuming each algorithm and operation would dispatch its implementation based on the hardware check as this would be too slow. 3. It supports division and modulus for integers? Is it supported by any hardware? This is out of curiosity, I'm not familiar with implementations other than SSE/AVX. 4. How would advanced operations be implemented, such as FMA and integer madd? Is it through additional library provided functions? IMHO, the availability of these operations is often crucial for performance of the user's algorithm, if it is more complicated than just accumulating integers. 5. Do you have any plans or time frames on Boost.SIMD inclusion? What is the state of the library? I also want to encourage you on continuing this work. This is a very interesting area and there is certainly demand for a higher level abstraction of SIMD operations. Keep up the good work and perhaps one day we'll see SIMD in the Standard after all.

On Thu, 18 Apr 2013, Andrey Semashev wrote:
Yes (according to my recollection of reading the paper).
3. It supports division and modulus for integers?
Why not?
Is it supported by any hardware?
At least some special cases are, like division by a power of 2. And if the divisor is constant, you can also let the implementation handle turning it into a multiplication. And the general case might very well be supported in the future, if it isn't already.
If you only want fma as a fast way to compute a+b*c, you could just let your compiler optimize an addition and a multiplication to fma. They are not bad at that. If you rely on the extra accuracy of fma, then library functions seem necessary. -- Marc Glisse

On Friday 19 April 2013 01:21:58 Marc Glisse wrote:
I think these special cases are better coded explicitly.
Does the compiler do that with user-defined operators (which are user-defined in case of packs)? Or do you mean the implementation of the operator will handle that? The latter means that the division will be very slow, but ok, since the division is slow even in hardware...
According to my experience, compilers are reluctant at pattern matching the intrinsics and replacing them with other intrinsics (which is a good thing). So if the user's code a*b+c*d is equivalent to two _mm_mullo_epi16/_mm_mulhi_epi16 and _mm_add_epi32 then that's what you'll get in the output instead of a single _mm_madd_epi16. Note also that _mm_madd_epi16 requires a special layout of its operands in xmm register elements, which is also a blocker for the compiler optimization. Regarding FMA, this is probably easier for compilers, but due to the difference in accuracy I don't expect compilers to perform this optimization lightly (i.e. without a specific compiler switch explicitly allowing it). And a switch, being a global option, may not be suitable in every place of the application. So having a way to explicitly express programmer's intention is useful here too. I think special opreations like FMA, madd, hadd/hsub, avg, min/max should be provided as functions. Also, it might be helpful to be able to convert packs to the compiler-specific types, like __m128i, and back to be able to use other more special intrinsics that are not available as functions or interoperate with inline assembler. What I also forgot to ask is how the paper and Boost.SIMD handle overflowing and saturating integer arithmetics? I assume, the operators on packs implement overflowing operations since that's how scalar operations work. Is it possible to do saturating operations then?

Le 19/04/2013 07:55, Andrey Semashev a écrit :
They are already present in Boost simd. Overflowing operations are the current operators + - * / abs and neg. Saturating operations are abss, adds,subs,muls,divs,negs (the final s standing for saturated). abs and negs differ from abs and neg in that they handle Valmin -> Valmax with integer. (versus Valmin -> Valmin for the standard ones). We also have saturate<A>(a) which returns the saturated value of a in the type A (the available types in Bosst.simd are signed/unsigned/integers (8,16,32,64) and pack of such) All operations in Boost.simd are coded in a way that if they do not exist or have no speed interest to be written using proper intrinsics they fallback to a map of the scalar available implementation on each element of the SIMD vectors (however note that is very uncommon). This is the case of integer division for 64 bits integers (without entering to far in implementation, it is often (when possible) speedier to use floating division intrinsics to implement integer division on today's processors)
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 19/04/13 06:55, Andrey Semashev wrote:
_mm_madd_epi16 is not a vertical operation, so it's a fairly special function, and you can't expect the compiler to recognize cases where it can use it. _mm_macc_epi16 is the vertical one (XOP only), and quite more easy on the optimizer. There are fma and correct_fma functions in any case.
They do. A compiler is allowed to use higher precision for intermediate results whenever it wants. This is what also allows compilers to use 80-bit of precision for operations on float or double.
The standard proposal tried to keep things simple, the library itself has quite a few more things.

On Sunday 21 April 2013 11:34:14 Mathias Gaunard wrote:
That's my point. Nonetheless this operation is very useful in some cases and I would like to be able to use it with Boost.SIMD. Same as many other special operations.
So, is it possible to convert pack to __m128i & co. and back in Boost.SIMD?

18.04.2013 14:46, Mathias Gaunard:
Unfortunately the concurrency/parallelism group has decided that they do not want C++ to provide types representing SIMD registers.
I would also like to know details of rejection.
I am using Eigen library in my projects, internally it has it's own abstraction around SIMD instructions and several backends for different instruction sets. Such kind of high-level libraries would clearly benefit from some standard-way to do SIMD operations. However, I see one major drawback of low-level SIMD interface in ISO: it is not as future-proof as higher-level API. SIMD instruction sets are expanding and becoming more complex, I am not sure how low-level library is supposed to catch future trends. For instance, there is FMA instruction "d=a+b*c" - yes, your proposal have appropriate fma function in <cmath>. But imagine that some new architecture would have "double FMA" instruction like: "f=a+b*c+d*e", or even more complex instruction: "2x2 matrix multiplication". In order to support such kind of new instructions, low-level library should add new functions - i.e. wait for new version of ISO. And until new version of ISO would not adapt these new functions, that low-level interface would not be competitive - users (higher-level libraries developers) would again use compiler-specific intrinsics. While in case of higher level interface (like Eigen), only internal implementation should be adjusted in order to get benefits. -- Evgeny Panasyuk

20.04.2013 3:37, Mathias Gaunard:
And what is your point? Do you mean that we should rely on auto-vectorizer? Quote from proposal: "Autovectorizers have the ability to detect code fragments that can be vectorized. This automatic process nds its limits when the user code is not presenting a clear vectorizable pattern (i.e. complex data dependencies, non-contiguous memory accesses, aliasing or control ows). The SIMD code generation stays fragile and the resulting instruction ow may be suboptimal compared to an explicit vectorization." -- Evgeny Panasyuk

On Mon, Apr 22, 2013 at 6:32 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
I think the argument was that in one regard you point out that we cannot rely on compiler to optimize the code and in the other you suggest the opposite. Although I admit that expression transform to FMA is simpler for the compiler to handle, I would still prefer to explicitly spell it out as a function call. In general, when writing SIMD code, I would prefer to spell out as much as possible and leave only lowest level optimizations to the compiler (such as instruction scheduling, register allocation and spilling, maybe CSE and DCE, things like that).

On Thu, 18 Apr 2013, Mathias Gaunard wrote:
Copying here my earlier comments so they are in the same place as others'. Some of them are only relevant for standardization, not for a boost library. Hello, a few comments while reading N3571. pack<T,N> seems more similar to std::array than std::tuple to me. We could even dream of merging pack and array into a single type. One reason existing std::array implementations (at least those I know) do not use vector registers is the ABI. An efficient implementation would pass function arguments in vector registers, but that would for instance make x86, mmx, sse, sse2 and avx five incompatible ABIs. As much as possible, I would like to avoid having a different interface for vectors and scalars. We have std::min for scalars, we can overload it for vectors instead of having simd::min. We have ?: for scalars, you don't need to restrict yourself to a pure library, you can use ?: for vectors as well instead of if_else, like OpenCL (g++-4.8 also implements that). Why forbid & for logical? Doesn't hurt to make it equivalent to &&. For the logical class, you may want to consider sparc VIS as an example that doesn't use the same registers. Masking: it is a bit strange to be able to do pack<double> & int but not double & int. Currently in gcc we require that you (reinterpret) cast the pack<double> to a pack<some integer>, do the masking and go back. Not very important though. Any policy on reinterpret_cast-ing a pack to a pack of a different type? The description of some overloads of shuffle are hard to read: missing indices, missing F parameter. aligned_malloc doesn't exist. aligned_alloc is C11. posix has posix_memalign. The closest in name is Microsoft's _aligned_malloc. N3396 might be relevant here. template < class T , std :: size_t N = unspecified > struct alignas ( sizeof ( T ) * N ) pack Do you really want to specify that large an alignment? You give examples with N=100... Maybe operator[] const could return by value if it wants to? Since the splat constructor is implicit, you may not need to document all the mixed operations. Calling splat both the idea of copying a single element in all places, and the idea of converting elementwise, is confusing. Any notion of a subvector? For gather and others, the proposal accepts mixing vector sizes. However, for better performance, we will usually want to use vectors of the same size. Any convenient way, given a pack type, to ask for a signed integer pack type of the same size and number of elements? Reduction: people sometimes come up with proposals for a variadic min/max, which might interact in funny ways. cmath functions: it is not clear what signatures are supported, in particular for functions that use several types (ldexp has double and int). The list doesn't seem to exactly match cmath, actually. frexp takes an int*, does the vector version take a pointer to a vector, or some scatter-like vector? Traits: are those supposed to be template aliases? Or to derive from what they are supposed to "return"? Or have a typedef ... type; inside? For transform and accumulate, I have seen other proposals that specify new versions more in terms of permissions (what the compiler is allowed to do) and less implementation. Depending on the tag argument you pass to transform/accumulate, you give the compiler permission to reorder the operations, or do other transformations and it then deduces that it can parallelize and/or vectorize. Looks nice. Note that it doesn't contradict this proposal, simd::transform can always forward to std::transform(..., vectorizable_tag()) (or in the reverse direction). constexpr, noexcept? That's it for now :-) -- Marc Glisse

On 19/04/13 00:29, Marc Glisse wrote:
I wasn't subscribed to c++-lib-ext at the time, so I missed them.
It's statically-sized, so both runtime and compile-time access are possible. Compile-time access could be significantly more efficient.
We could even dream of merging pack and array into a single type.
I don't think that's a good idea, pack has a very strong numerical semantic. You don't really want to do +, * or / on arrays. Plus pack<T, N> requires N to be a power of 2.
The idea is that any SIMD code should be valid scalar code as well. I'm not sure whether we want the selection for functions to use ADL or whether it should be std::min. I have no strong opinion on this.
We have ?: for scalars, you don't need to restrict yourself to a pure library
Overloading ?: would probably have to be a standalone language extension to core no?
It was a wish from my colleague; I personally think it might be better to align the pack operators to be the same as the scalar equivalents, and therefore not allow it without a cast.
Any policy on reinterpret_cast-ing a pack to a pack of a different type?
In Boost.SIMD there is a bitwise_cast<To>(from) function, which is essentially the same as To to; memcpy(&to, &from, sizeof(from)); return to; This can be optimized to a reinterpret_cast in some cases, but reinterpret_cast itself is dangerous because of aliasing issues. ass T , std :: size_t N = unspecified >
N can never be 100, it's a power of 2. It would be possible to relax the alignment requirement somewhat however.
Maybe operator[] const could return by value if it wants to?
Yes.
Since the splat constructor is implicit, you may not need to document all the mixed operations.
It was to make things clearer, but I guess it might not be necessary.
Any notion of a subvector?
No, though it would probably be useful to be able to split any vector in two. Duly noted.
For gather and others, the proposal accepts mixing vector sizes.
For scatter/gather, it doesn't accept a number of indices different than the size of the vectors being loaded. Conversions are allowed to happen however, you can load from a uint8* to a vector of int32.
We have some equivalents in Boost.SIMD, to make things simpler we use int32 for float and int64 for double, so that the sizes of vectors are the same.
Traits: are those supposed to be template aliases? Or to derive from what they are supposed to "return"? Or have a typedef ... type; inside?
They're metafunctions, i.e. classes with a type member typedef.
simd::transform takes a function object that must be valid with both scalar and pack values; std::transform only requires that the function object be valid with scalar values. That's the main difference between the two functions.

On Sat, 20 Apr 2013, Mathias Gaunard wrote:
Well, yes, it touches core, but there is a large difference between allowing one specific library type in ?: and letting users overload it as they please. Anyway, since standardization seems to be out, you can forget this comment. -- Marc Glisse

Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
It's pretty clear to me. A better approach is to add language constructs to help the compiler do the vectorization. It's 2013. We should be finished with requiring people to hand-vectorize code. Adding things like "restrict" and/or keywords like "concurrent," ways to disambiguate possible aliases, describe unknown loop dependences, etc. are going to be much more flexible and fruitful long-term than providing a library that is tied to a particular model of parallelism (and a narrow model of vectorization, BTW). Please look at what compilers from PGI, Intel, CAPS and yes, Cray, do to help users parallelize code. A reading of the pragma descriptions in the various compiler manuals would be informative. -David
participants (7)
-
Andrey Semashev
-
dag@cray.com
-
Evgeny Panasyuk
-
jtl
-
Marc Glisse
-
Mathias Gaunard
-
Tim Blechmann