[GSOC]SIMD Library

Hi everyone, I recently ask documentation about the SIMD Library project here and I receive some guides and references links from Joel Falcou. Now I have some questions about the library itself. Is the goal of the project to build a library similar to libSIMDx86 but more focusing on the AltiVec instruction set ? Joel also advises me to take a look at Proto. I read that Proto is used to build DSEL. Will Proto be used to map high level classes and methods of the library to native SIMD instructions ? What about the library itself ? Which modules have to be developped ? Thanks.

On 29/03/11 13:16, Faiçal Tchirou wrote:
Hi everyone, I recently ask documentation about the SIMD Library project here and I receive some guides and references links from Joel Falcou. Now I have some questions about the library itself. Is the goal of the project to build a library similar to libSIMDx86 but more focusing on the AltiVec instruction set ? Joel also advises me to take a look at Proto. I read that Proto is used to build DSEL. Will Proto be used to map high level classes and methods of the library to native SIMD instructions ? What about the library itself ? Which modules have to be developped ? Thanks.
The code is rather advanced in nt2 as stated in the proposal: work is basically to make it work on non intel SIMD ISA, make it boost compliant in terms of code/doc and structure, advance the integration with STD and other standard concept like range etc ... The existing code base is already proto based. Proto is mainly use to detect common SIMD compositions of instrinsic to be mapped over hardcoded better intrinsic (like altivec mapping a+b*c onto vec_madd(b,c,a) Other views are welcome of course

On 29/03/2011 13:16, Faiçal Tchirou wrote:
Hi everyone, I recently ask documentation about the SIMD Library project here and I receive some guides and references links from Joel Falcou. Now I have some questions about the library itself. Is the goal of the project to build a library similar to libSIMDx86 but more focusing on the AltiVec instruction set ?
No. The library that will become Boost.SIMD already exists outside of Boost, as part of the NT2 project. It already has support for SSE and AVX, and basic support for AltiVec that needs to be improved.
Joel also advises me to take a look at Proto. I read that Proto is used to build DSEL. Will Proto be used to map high level classes and methods of the library to native SIMD instructions ? What about the library itself ?
Proto is used to map combinations of operators or functions to specific SIMD-optimized functions. Examples a*b+c -> fma(a, b, c) a+b*c -> fam(a, b, c) a & ~b -> bitwise_andnot(a, b) ~a & b -> bitwise_notand(a, b) etc. (this particular list has special SIMD instructions on SSE and/or AltiVec).
Which modules have to be developped ?
Various things are possible: - work on AltiVec support - work on NEON support - work on saturated arithmetic functions - work on documentation - work on tests - work on benchmarks - work on boostification - work on other things I can't think of right now Pick what you want and make it into a project proposal.

"Mathias Gaunard" <mathias.gaunard@ens-lyon.org> wrote in message news:imsgh3$90h$1@dough.gmane.org...
No. The library that will become Boost.SIMD already exists outside of Boost, as part of the NT2 project. It already has support for SSE and AVX, and basic support for AltiVec that needs to be improved.
Would it be possible to design Boost.SIMD so that it allows switching between different implementations/backends. For example OS X already provides the Accelerate framework so I might want to/need to use that, or I might want/need to use Intel's MKL or AMD's ACML-MV or Framewave...? I'm looking for a library to replace Framewave (which seems to be dead for almost two years and might perhaps be useful for scavenging), for relatively simple 1D vector operations (multiplication, addition, trigonometric functions, log, exp...), if NT2 already supports those (SSE1 and optionally SSE2 versions) it would be a great place to start with/as an intermediate solution while waiting for Boost.SIMD...(?) -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

On 29/03/2011 13:56, Domagoj Saric wrote:
"Mathias Gaunard" <mathias.gaunard@ens-lyon.org> wrote in message news:imsgh3$90h$1@dough.gmane.org...
No. The library that will become Boost.SIMD already exists outside of Boost, as part of the NT2 project. It already has support for SSE and AVX, and basic support for AltiVec that needs to be improved.
Would it be possible to design Boost.SIMD so that it allows switching between different implementations/backends. For example OS X already provides the Accelerate framework so I might want to/need to use that, or I might want/need to use Intel's MKL or AMD's ACML-MV or Framewave...?
I'm looking for a library to replace Framewave (which seems to be dead for almost two years and might perhaps be useful for scavenging), for relatively simple 1D vector operations (multiplication, addition, trigonometric functions, log, exp...), if NT2 already supports those (SSE1 and optionally SSE2 versions) it would be a great place to start with/as an intermediate solution while waiting for Boost.SIMD...(?)
Boost.SIMD does not have first-class support for 1D vectors. It only works with SIMD registers. It's up to you to allocate your memory (which you can do using the SIMD allocator), load/store SIMD registers from/to it (which you can do using the SIMD iterator adaptor), run or parallelize loops, etc. There aren't many solutions to implement most basic functions, so I don't really see the point of providing a backend that calls some other library instead of calling the couple of native instructions supported by the processor. As for MKL, it provides fast implementations of math functions. We are not certain yet whether those will be part of Boost.SIMD or not -- indeed, maybe Boost.SIMD should only contain 'trivial' arithmetic and bitwise functions. In any case, our implementations of the trigonometric and exponential math functions are faster and more accurate than those of MKL. There could be, however, wrappers on top of MKL in NT2. We already have that for several other libraries.

Would it be possible to design Boost.SIMD so that it allows switching between different implementations/backends. For example OS X already provides the Accelerate framework so I might want to/need to use that, or I might want/need to use Intel's MKL or AMD's ACML-MV or Framewave...?
I'm looking for a library to replace Framewave (which seems to be dead for almost two years and might perhaps be useful for scavenging), for relatively simple 1D vector operations (multiplication, addition, trigonometric functions, log, exp...), if NT2 already supports those (SSE1 and optionally SSE2 versions) it would be a great place to start with/as an intermediate solution while waiting for Boost.SIMD...(?)
Boost.SIMD does not have first-class support for 1D vectors. It only works with SIMD registers. It's up to you to allocate your memory (which you can do using the SIMD allocator), load/store SIMD registers from/to it (which you can do using the SIMD iterator adaptor), run or parallelize loops, etc.
i think this is very important, since most frameworks only provide vector functions, but are not composable.
As for MKL, it provides fast implementations of math functions. We are not certain yet whether those will be part of Boost.SIMD or not -- indeed, maybe Boost.SIMD should only contain 'trivial' arithmetic and bitwise functions. In any case, our implementations of the trigonometric and exponential math functions are faster and more accurate than those of MKL.
i would really appreciate if a boost.simd library would provide math functions! implementing them is not really straight-forward but gives a significant performance gain. they could be implemented in a way ontop of the `trivial' arithmetic interface, so the implementan could be reused among different backends. cheers, tim -- tim@klingt.org http://tim.klingt.org Linux is like a wigwam: no windows, no gates, apache inside, stable.

On 29/03/11 14:29, Tim Blechmann wrote:
i think this is very important, since most frameworks only provide vector functions, but are not composable.
We have a set of allocator & adaptor to turn any contiguous range into a range processable as SIMD pack. Currently this works without any problem : std::vector<float, simd::allocator<float> > v(150); std::transform( simd::begin(v), simd::end(v), lambda::_1 * 10 ); and output vectorized code. So for us it is better and more generic to have such adaptor than force feeding some one-size-fit-all vector class. Discussions elcome of course ;) As for the composability, the proto layer takes care of that
i would really appreciate if a boost.simd library would provide math functions! implementing them is not really straight-forward but gives a significant
performance gain. they could be implemented in a way ontop of the `trivial' arithmetic interface, so the implementan could be reused among different
backends.
The main question is : which ? NT2 is actually providing somethign like 250+ vectorized amth function, some rather obscure. I think we can have some basic one and let people in the know (aka Boost Math author) play with the basic functions to rebuilt w/e fancy operations they need. We spend a lot of time on the trigonometrics but i dont think we have the math background to do for example vectorized elliptic functions or such. Hence the idea of boost simd to *enable* such writing in a simple and generic way instead of forcing stuff over people. Again YMMV. Oh and, currently all these function are indeed ISA agnostic as the algorithm is completely non affected by the ISA itself. Altivec wins a bit more cause the inner horner polynomail benefits from vec_madd while sse does not.

"Mathias Gaunard" <mathias.gaunard@ens-lyon.org> wrote in message news:imsimd$mei$1@dough.gmane.org...
On 29/03/2011 13:56, Domagoj Saric wrote: Boost.SIMD does not have first-class support for 1D vectors. It only works with SIMD registers. It's up to you to allocate your memory (which you can do using the SIMD allocator), load/store SIMD registers from/to it (which you can do using the SIMD iterator adaptor), run or parallelize loops, etc.
That's great too...actually I'd rather have it this way than only the higher level vector functions as usually found in other libraries (especially if you cannot not have them, such as with the OS X Accelerate framework, plus it can prove to be more future proof...i.e. Apple updates it for new architectures and targets like iPad2, Intel AVX...)... OTOH also providing a higher level interface (as Boost.Range and/or Boost.Math algorithm overloads or something similar) top of the lower level one afterwards would be great...
There aren't many solutions to implement most basic functions, so I don't really see the point of providing a backend that calls some other library instead of calling the couple of native instructions supported by the processor.
While staying at the lowest possible level yes, I agree. As soon as we depart from the 'metal' however, the amount of boilerplate increases and (mentioned) libraries can become useful... I'm not saying this is critical rather nice to have...
As for MKL, it provides fast implementations of math functions. We are not certain yet whether those will be part of Boost.SIMD or not -- indeed, maybe Boost.SIMD should only contain 'trivial' arithmetic and bitwise functions.
Hmm...well yes, if it is going to be called Boost.SIMD than it probably should be restricted to the low 'hardware abstraction' level...The higher level vector and math functions could than be provided through/by Boost.Range, Boost.Math, Boost.uBLAS... I would certainly like to have both layers in Boost...
In any case, our implementations of the trigonometric and exponential math functions are faster and more accurate than those of MKL. There could be, however, wrappers on top of MKL in NT2. We already have that for several other libraries.
Could you perhaps help me with a few pointers on how to use/test NT2 (e.g. performing sincos on a 1D vector)? I checked out the latest revision from SVN, and downloaded the documentation for the official version 2 but it is in French which I unfortunately do not understand. The examples for version 2 that contain the cos() function also do not work as they mention a nonexistent eve.hpp header... ps. does NT2 support SSE1 for single precision floats? Unfortunately this is a must for me (and probably for a lot of people as there are quite a few Athlon XP's around) so I cannot use SSE2+ only libraries like Eigen... pps. if you are considering MKL you should probably also consider free alternatives like AMD's ACML and Apple's Accelerate framework... ppps. this project is great news...thanks for the doing all the work guys! ;) -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

"Domagoj Saric" <domagoj.saric@littleendian.com> wrote in message news:imspgl$297$1@dough.gmane.org...
That's great too...actually I'd rather have it this way than only the higher level vector functions as usually found in other libraries (especially if you cannot not have them, such as with the OS X Accelerate framework, plus it can prove to be more future proof...i.e. Apple updates it for new architectures and targets like iPad2, Intel AVX...)... OTOH also providing a higher level interface (as Boost.Range and/or Boost.Math algorithm overloads or something similar) top of the lower level one afterwards would be great...
Oops...sorry the "(especially if you cannot not have them,...." part was supposed to go into this paragraph: <<
While staying at the lowest possible level yes, I agree. As soon as we depart from the 'metal' however, the amount of boilerplate increases and (mentioned) libraries can become useful... I'm not saying this is critical rather nice to have...
after "...can become useful..."... -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

On 29/03/11 16:14, Domagoj Saric wrote:
That's great too...actually I'd rather have it this way than only the higher level vector functions as usually found in other libraries (especially if you cannot not have them, such as with the OS X Accelerate framework, plus it can prove to be more future proof...i.e. Apple updates it for new architectures and targets like iPad2, Intel AVX...)...
We have everything up to AVX
OTOH also providing a higher level interface (as Boost.Range and/or Boost.Math algorithm overloads or something similar) top of the lower level one afterwards would be great...
we plan on looking at making some simd_range so you cna pass them to boost:;range::some_algo too. Def. a work for GSoC
While staying at the lowest possible level yes, I agree. As soon as we depart from the 'metal' however, the amount of boilerplate increases and (mentioned) libraries can become useful... I'm not saying this is critical rather nice to have...
We have such boilerplate somehow, we can discuss what's make and doesn't make it.
Hmm...well yes, if it is going to be called Boost.SIMD than it probably should be restricted to the low 'hardware abstraction' level...The higher level vector and math functions could than be provided through/by Boost.Range, Boost.Math, Boost.uBLAS... I would certainly like to have both layers in Boost...
My idea was have Boost.SIMD as an infrastructure library for other higher level one to work upon
Could you perhaps help me with a few pointers on how to use/test NT2 (e.g. performing sincos on a 1D vector)? I checked out the latest revision from SVN, and downloaded the documentation for the official version 2 but it is in French which I unfortunately do not understand. The examples for version 2 that contain the cos() function also do not work as they mention a nonexistent eve.hpp header...
We have to close the SF account. Everythign is in github atm ;) Docs are lackign so you may fidn it dense to use. Some basic doc and examples will be uploaded soon.
ps. does NT2 support SSE1 for single precision floats? Unfortunately this is a must for me (and probably for a lot of people as there are quite a few Athlon XP's around) so I cannot use SSE2+ only libraries like Eigen...
Eigen only support double ? Wow, ell for SSEx we have function for all integers, double and float.
pps. if you are considering MKL you should probably also consider free alternatives like AMD's ACML and Apple's Accelerate framework...
Something to look for the benchmarks yup.

"Joel Falcou" <joel.falcou@lri.fr> wrote in message news:4D91F96D.5000008@lri.fr...
Eigen only support double ? Wow, ell for SSEx we have function for all integers, double and float.
No, it supports float just does not provide an SSE1-only implementation (uses plain x87/standard C++ for pre-SSE2 CPUs)... Speaking of which, what is the difference between NT2 and Eigen? ...and what do you think/hope/plan, would a foreseeable Boost.SIMD+Range+Math+uBLAS combination provide the Boost equivalent of Eigen or is that outside the scope of Boost (or it is just to early to speculate)..? -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

On 30/03/11 14:56, Domagoj Saric wrote:
No, it supports float just does not provide an SSE1-only implementation (uses plain x87/standard C++ for pre-SSE2 CPUs)...
OK so we can have a look if needed.
Speaking of which, what is the difference between NT2 and Eigen?
NT2 is more than linear algebra, tries to be boost/std compliant, is proto based and aims at being as close to matlab as possible so the learnign curve for scientist is not huge.
...and what do you think/hope/plan, would a foreseeable Boost.SIMD+Range+Math+uBLAS combination provide the Boost equivalent of Eigen or is that outside the scope of Boost (or it is just to early to speculate)..?
O, like Boost.NT2 ? Well Dave asked me the same question at last year Boost'Con. We have diverging view on how the library should live but we are open to bring parts of nt2 into boost when it makes sense (hence boost.simd). but nt2 should live his own life but by beign boost comaptible, it should be usable next to it.

On 29/03/2011 16:14, Domagoj Saric wrote:
Could you perhaps help me with a few pointers on how to use/test NT2 (e.g. performing sincos on a 1D vector)? I checked out the latest revision from SVN, and downloaded the documentation for the official version 2 but it is in French which I unfortunately do not understand. The examples for version 2 that contain the cos() function also do not work as they mention a nonexistent eve.hpp header...
We've been using git for a while, the svn version is quite outdated. The git version has documentation purely in English, except it's almost non-existent. I had started recently writing documentation separate from the code, but I quickly realized this was a huge amount of work, so I am looking into doxygen-based solutions atm (either boostbook or Sphinx with doxylink). I'm not giving you any examples yet, because the high-level SIMD code is undergoing lots of changes recently, and things like sincos (which returns a tuple) are broken when you use the Proto-based interface :(. I'll come back to you shortly.
ps. does NT2 support SSE1 for single precision floats? Unfortunately this is a must for me (and probably for a lot of people as there are quite a few Athlon XP's around) so I cannot use SSE2+ only libraries like Eigen...
I'm afraid not, it only supports SSE2 and upwards. But it could be a nice summer project to add support for SSE1 only... But seriously, who targets that kind of old CPU? ;)

"Mathias Gaunard" <mathias.gaunard@ens-lyon.org> wrote in message news:imsv07$67b$1@dough.gmane.org...
I'm not giving you any examples yet, because the high-level SIMD code is undergoing lots of changes recently, and things like sincos (which returns a tuple) are broken when you use the Proto-based interface :(. I'll come back to you shortly.
No problem...thanks in advance ;) ps. I'd be wiling to test/use a lower level interface...
ps. does NT2 support SSE1 for single precision floats? Unfortunately this is a must for me (and probably for a lot of people as there are quite a few Athlon XP's around) so I cannot use SSE2+ only libraries like Eigen...
I'm afraid not, it only supports SSE2 and upwards. But it could be a nice summer project to add support for SSE1 only...
But seriously, who targets that kind of old CPU? ;)
Actually, I'd say most 'serious' commercial projects that do not by themselves require 64bit-era machines to be usable. Athlon XP is a P4 era CPU that is/was more powerful and much cheaper than its Intel counterpart and it unfortunately lacked SSE2 instructions. You can see here http://www.kvraudio.com/forum/viewtopic.php?p=4402655#4402655 one of our customers saying that he still has a working P3 setup...A lot of such machines are still around and ticking (often as secondary 24/7 under-the-desk/in-the-attic machines)... Plus, AFAIK SSE2 does not yield enough gain over SSE1 for many/most single precision operations to warrant its 'automatic usage'/exclusion of non-SSE2 CPUs... ps. does NT2 provide a way to choose/use multiple instruction sets? Important so that one can, for example, detect the level of SSE supported by the machine and switch to an appropriate routine... -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

On 30/03/2011 15:11, Domagoj Saric wrote:
ps. does NT2 provide a way to choose/use multiple instruction sets?
Not out of the box. What you could do is compile your code several times with different options and make each version into a separate dll, then select the right dll to load at runtime. Alternatively you could also compile the code at runtime. We have things that can do that in the work for NT2 that we use for OpenCL.
Important so that one can, for example, detect the level of SSE supported by the machine and switch to an appropriate routine...
Boost.SIMD does not have knowledge of high-level routines, and obviously you don't want to switch on each basic function...

On Wed, Mar 30, 2011 at 8:16 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 30/03/2011 15:11, Domagoj Saric wrote:
ps. does NT2 provide a way to choose/use multiple instruction sets?
Not out of the box. What you could do is compile your code several times with different options and make each version into a separate dll, then select the right dll to load at runtime.
It would be useful if the types chose their instruction set by policy. template<typename Policy> void go(float *f) { typedef vec4<float, Policy> vec; ... } void (*go_ptr)(float*) = go<x87>; if(supports_sse) go_ptr = go<sse2>; It would let us handle our own multiple-instruction-sets-in-one-binary needs at whatever level we deem best. -- Cory Nelson http://int64.org

On Wed, 30 Mar 2011 16:46:30 -0700, Cory Nelson <phrosty@gmail.com> wrote:
It would be useful if the types chose their instruction set by policy.
template<typename Policy> void go(float *f) { typedef vec4<float, Policy> vec; ... }
void (*go_ptr)(float*) = go<x87>;
if(supports_sse) go_ptr = go<sse2>;
It would let us handle our own multiple-instruction-sets-in-one-binary needs at whatever level we deem best.
cant be easily added at this level. However, the underlying functor system maybe able to have you write: simd::functor<some_func, sse2_> f2; simd::functor<some_func, sse3_> f3; boost::function<proper prototype> current_f = has_sse3 ? f3 : f2; The main problem is that i am not sure it wont crap out at compile time due to improper intrinsic being used.

On Wed, Mar 30, 2011 at 10:31 PM, falcou <Joel.Falcou@lri.fr> wrote:
On Wed, 30 Mar 2011 16:46:30 -0700, Cory Nelson <phrosty@gmail.com> wrote:
It would be useful if the types chose their instruction set by policy.
template<typename Policy> void go(float *f) { typedef vec4<float, Policy> vec; ... }
void (*go_ptr)(float*) = go<x87>;
if(supports_sse) go_ptr = go<sse2>;
It would let us handle our own multiple-instruction-sets-in-one-binary needs at whatever level we deem best.
cant be easily added at this level. However, the underlying functor system maybe able to have you write:
simd::functor<some_func, sse2_> f2; simd::functor<some_func, sse3_> f3;
boost::function<proper prototype> current_f = has_sse3 ? f3 : f2;
The main problem is that i am not sure it wont crap out at compile time due to improper intrinsic being used.
What would ever make intrinsics improper? -- Cory Nelson http://int64.org

On 31/03/2011 11:47, Cory Nelson wrote:
What would ever make intrinsics improper?
GCC generates a compile-time error if you include the intrinsics files without the right option (-msse2 or similar) enabled with the compiler. For this reason, we took care of only including the files requested by the instruction set the user has chosen. Now then, I guess it would always be possible to cheat and redefine those things.

"Mathias Gaunard" <mathias.gaunard@ens-lyon.org> wrote in message news:in1jp5$ugp$1@dough.gmane.org...
On 31/03/2011 11:47, Cory Nelson wrote:
What would ever make intrinsics improper?
GCC generates a compile-time error if you include the intrinsics files without the right option (-msse2 or similar) enabled with the compiler. For this reason, we took care of only including the files requested by the instruction set the user has chosen.
Yes, the GCC solution is not quite the ideal one for this problem...If I, for example, want to use SSE3 but only if it is available I obviously cannot pass -msse3 because it will use SSE3 code for the whole binary (which will then of course crash on SSE2 machines) but if I do not pass -msse3 then I do not get the intrinsics at all...The "what were they thinking" question does pop up...
Now then, I guess it would always be possible to cheat and redefine those things.
Is that possible (considering the macros are defined by the compiler)? -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

GCC generates a compile-time error if you include the intrinsics files without the right option (-msse2 or similar) enabled with the compiler. For this reason, we took care of only including the files requested by the instruction set the user has chosen.
Yes, the GCC solution is not quite the ideal one for this problem...If I, for example, want to use SSE3 but only if it is available I obviously cannot pass -msse3 because it will use SSE3 code for the whole binary (which will then of course crash on SSE2 machines) but if I do not pass -msse3 then I do not get the intrinsics at all...The "what were they thinking" question does pop up...
the usual approach of implementing a runtime-dispatching is to compile different source files for different instruction sets. tim

"Tim Blechmann" <tim@klingt.org> wrote in message news:in1nhu$jej$1@dough.gmane.org...
GCC generates a compile-time error if you include the intrinsics files without the right option (-msse2 or similar) enabled with the compiler. For this reason, we took care of only including the files requested by the instruction set the user has chosen.
Yes, the GCC solution is not quite the ideal one for this problem...If I, for example, want to use SSE3 but only if it is available I obviously cannot pass -msse3 because it will use SSE3 code for the whole binary (which will then of course crash on SSE2 machines) but if I do not pass -msse3 then I do not get the intrinsics at all...The "what were they thinking" question does pop up...
the usual approach of implementing a runtime-dispatching is to compile different source files for different instruction sets.
That's the usual approach with GCC-like compilers because they force you to do it that way...the MSVC approach (always giving you access to all intrinsics) works better in these situations (IMHO)... -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

"Mathias Gaunard" <mathias.gaunard@ens-lyon.org> wrote in message news:in1qcm$4f0$1@dough.gmane.org...
On 31/03/2011 12:54, Domagoj Saric wrote:
Is that possible (considering the macros are defined by the compiler)?
I don't see any reason why you couldn't do
#ifndef __SSE2__ #define __SSE2__ #endif
but that's a ugly hack at best.
Of course, silly me, you only have to add/define not subtract/undefine macros... ps. OT: please help a total Git newbie, I tried checking out/cloning the NT2 repository with the latest TortoiseGit however it fails with "PuTTY Fatal Error: Disconnected: No Supported authentication methods available (server sent: publickey)" after issuing the "git.exe clone --progress -v "git@github.com:MetaScale/nt2.git"" command... (OTOH I was able to successfully clone a different Git based library that is hosted on Sourceforge...) -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

On 31/03/11 15:43, Domagoj Saric wrote:
ps. OT: please help a total Git newbie, I tried checking out/cloning the NT2 repository with the latest TortoiseGit however it fails with "PuTTY Fatal Error: Disconnected: No Supported authentication methods available (server sent: publickey)" after issuing the "git.exe clone --progress -v "git@github.com:MetaScale/nt2.git"" command... (OTOH I was able to successfully clone a different Git based library that is hosted on Sourceforge...)
this URl is the private git, try usign the https one

"Joel Falcou" <joel.falcou@lri.fr> wrote in message news:4D94886B.5040908@lri.fr...
On 31/03/11 15:43, Domagoj Saric wrote:
ps. OT: please help a total Git newbie, I tried checking out/cloning the NT2 repository with the latest TortoiseGit however it fails with "PuTTY Fatal Error: Disconnected: No Supported authentication methods available (server sent: publickey)" after issuing the "git.exe clone --progress -v "git@github.com:MetaScale/nt2.git"" command... (OTOH I was able to successfully clone a different Git based library that is hosted on Sourceforge...)
this URl is the private git, try usign the https one
Tried that first but it failed with the "...error setting certificate..." message...Googled that one now and it seems it is some mysisgit bug, editing the gitconfig file with a double backslashed full certificate path fixes/works around it...thanks ;) -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

We'll give a shot at enabling native<T,sse3_> and so forth and see if it doesnt break something silly. For pack however such stuff can't be done as we map the type/cardinal pair over the underlying types adn take the best extension for the computation.

On Thu, Mar 31, 2011 at 3:07 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 31/03/2011 11:47, Cory Nelson wrote:
What would ever make intrinsics improper?
GCC generates a compile-time error if you include the intrinsics files without the right option (-msse2 or similar) enabled with the compiler. For this reason, we took care of only including the files requested by the instruction set the user has chosen.
Now then, I guess it would always be possible to cheat and redefine those things.
Is there a way to detect those compile options? I think it would be acceptable to just disable any support for them if -msse etc. aren't given, and put a warning in the docs explaining it. Does -msse/etc. only enable those intrinsics, or does GCC actually try to use them during code gen for plain C++? -- Cory Nelson http://int64.org

On 01/04/11 01:12, Cory Nelson wrote:
On Thu, Mar 31, 2011 at 3:07 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 31/03/2011 11:47, Cory Nelson wrote:
What would ever make intrinsics improper?
GCC generates a compile-time error if you include the intrinsics files without the right option (-msse2 or similar) enabled with the compiler. For this reason, we took care of only including the files requested by the instruction set the user has chosen.
Now then, I guess it would always be possible to cheat and redefine those things.
Is there a way to detect those compile options?
-mssex defines _SSEX_ adn this macro is checked in the intrinsic files so you cant include sseX intrinsic if -msseX was not set.
I think it would be acceptable to just disable any support for them if -msse etc. aren't given, and put a warning in the docs explaining it. Does -msse/etc. only enable those intrinsics, or does GCC actually try to use them during code gen for plain C++?
We have to explore this. After giving some thought the need for non binary level of cross instruction set support is indeed valuable.

On 01/04/2011 09:20, Joel Falcou wrote:
Is there a way to detect those compile options?
Yes, but the equivalent options don't exist on MSVC. The problem is that the two compilers have radically different ways to deal with this.
We have to explore this. After giving some thought the need for non binary level of cross instruction set support is indeed valuable.
I tested it, and it doesn't work. If -msse2 etc. are not set, the __builtin_ia32 family of functions, which are used by emmintrin.h and other headers to implement the _mm_ family of functions, don't exist.

On Fri, Apr 1, 2011 at 1:01 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 01/04/2011 09:20, Joel Falcou wrote:
Is there a way to detect those compile options?
Yes, but the equivalent options don't exist on MSVC.
The problem is that the two compilers have radically different ways to deal with this.
SIMD in general is a tough problem, especially if you're aiming for the user to write generic SIMD code without having to worry about it compiling down to SSE, AVX, NEON, etc.! I think that will be impractical if you want good performance, but here's how I'd imagine a try would look like: template<typename InstructionSet> struct kernel { void operator()(float const *f) { typedef simd::vec<4, InstructionSet> vec4; ... } }; std::function<void(float const*)> func(simd::generate<kernel>()); Where generate() instantiates kernel for all available instruction sets on VC++, using cpuid at runtime to pick which one to return, and probably a single instruction set on GCC. VC++ is almost never targeting the compile machine, and it is very common for SIMD-optimized apps on Windows to use cpuid to select code paths at runtime. I don't see any other way short of separate binaries which will be distribution (and probably compile) hell. Note this would also let the user specialize the kernel for different instruction sets if they wanted to. Like I said before, using one generic algorithm for multiple instruction sets is probably not going to give anywhere near the performance of hand-written intrinsics, though it might still be faster than plain C++. It would provide an optimization point. A problem with this design is that a lot of times an algorithm will only need SSE2 and not use any new instructions when you instantiate it with SSSE3. I'm not certain how one would cleanly solve this. Maybe an optional mpl::map for generate() that maps things like ssse3->sse2? -- Cory Nelson http://int64.org

"Mathias Gaunard" <mathias.gaunard@ens-lyon.org> wrote in message news:imvhgr$a19$1@dough.gmane.org...
On 30/03/2011 15:11, Domagoj Saric wrote:
ps. does NT2 provide a way to choose/use multiple instruction sets?
Not out of the box. What you could do is compile your code several times with different options and make each version into a separate dll, then select the right dll to load at runtime.
Alternatively you could also compile the code at runtime.
We have things that can do that in the work for NT2 that we use for OpenCL.
Yes but 'runtime compilation' and/or separate DLLs are not always the ideal (or possible) solution...If you could provide something along the lines of what Cory Nelson suggested that would be great :) Otherwise we are left with the awkwardness of making a header template that is included in different .cpp files compiled with different preprocessor directives...
Important so that one can, for example, detect the level of SSE supported by the machine and switch to an appropriate routine...
Boost.SIMD does not have knowledge of high-level routines, and obviously you don't want to switch on each basic function...
Of course, I was thinking of higher level routines (whether provided by the library or by me/the user)... -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

On 29/03/11 16:14, Domagoj Saric wrote:
ps. does NT2 support SSE1 for single precision floats? Unfortunately this is a must for me (and probably for a lot of people as there are quite a few Athlon XP's around) so I cannot use SSE2+ only libraries like Eigen...
Correction fo my reply here. After checking with my coworker, seems we dont have a strict SSE1 bindings in which SSe2 is not used. So I'll put it into our todo list.

My experience is mostly with MMX and integer SSE2. A useful approach I've used in the past was to create type-safe wrappers for the various intrinsics. These mostly took the form of overloaded inline functions, though I used templates whenever an immediate integer operand was required. These overloads enabled me to write higher-level templates that supported multiple vector types, even if they were sometimes machine-specific. The templates enabled optimizations of degenerate & special cases. Even though this isn't as sophisticated as what proto can do, I think it will be useful to have a fall-back for cases where there are either specialized instructions that aren't easily expressible as expressions or other cases where it's difficult to get proto to generate the instruction sequence you want. Besides type-safety, the wrappers make the code much more readable than using the native intrinsics. A few functions and templates implemented idioms and tricks for doing common tasks, like loading a vector with zeros (hint: xor anything with itself) or filling the vector with copies of a single value (I think this is called splat, in Altivec). Another reason to do this is to take advantage of architecture-specific optimizations. Some of these generics were: template< typename V > V zero(); // generates a vector of 0's template< typename V > V full_mask(); // sets all bits to 1 template< int i, typename V > T get_element( V ); template< int i, typename V > V set_element( V, T ); template< int i, typename V, typename T > V set_element_general( V, T ); template< int i, int j, ... > V shuffle( V ); // rearranges the elements in V. template< int n, typename T > V load( T * ); // loads n lowest elements of T[] template< int n, typename T > void store( V, T * ); // stores n lowest elements of V template< typename V > void store_uncached( V, V * ); // avoids cache pollution template< typename T, typename V > T horizontal_sum( V ); // sum of all elements in V I'm also a fan of having a set of common, optimized 1-D operations, such as buffer packing/interleaving & unpacking/deinterleaving, extract/insert columns, convolution, dot-product, SAD, FFT, etc. Keep it low-level, though. IMO, any sort of high-level abstraction that ships data off to different accelerator back-ends, like GPUs, is a different animal and should go in a different library. Matt ________________________________ From: boost-bounces@lists.boost.org on behalf of Faiçal Tchirou Sent: Tue 3/29/2011 7:16 AM To: boost@lists.boost.org Subject: [boost] [GSOC]SIMD Library Hi everyone, I recently ask documentation about the SIMD Library project here and I receive some guides and references links from Joel Falcou. Now I have some questions about the library itself. Is the goal of the project to build a library similar to libSIMDx86 but more focusing on the AltiVec instruction set ? Joel also advises me to take a look at Proto. I read that Proto is used to build DSEL. Will Proto be used to map high level classes and methods of the library to native SIMD instructions ? What about the library itself ? Which modules have to be developped ? Thanks. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 30/03/11 08:04, Gruenke, Matt wrote:
My experience is mostly with MMX and integer SSE2. A useful approach I've used in the past was to create type-safe wrappers for the various intrinsics. These mostly took the form of overloaded inline functions, though I used templates whenever an immediate integer operand was required. These overloads enabled me to write higher-level templates that supported multiple vector types, even if they were sometimes machine-specific. The templates enabled optimizations of degenerate& special cases. Even though this isn't as sophisticated as what proto can do, I think it will be useful to have a fall-back for cases where there are either specialized instructions that aren't easily expressible as expressions or other cases where it's difficult to get proto to generate the instruction sequence you want. Besides type-safety, the wrappers make the code much more readable than using the native intrinsics.
we provide a native<Type,Extension> POD class not using proto for such low level need. It is what makes the pack proto terminal data. So basically, you dont want to think of extension then use pack, need a precise low level stuff done, use native. Basically on an Ix86 machine, pack<float> wraps a native<float,tag::sse_>.
A few functions and templates implemented idioms and tricks for doing common tasks, like loading a vector with zeros (hint: xor anything with itself)
except on altivec where vec_splatu8(0) is faster :p
template< typename V> V zero(); // generates a vector of 0's template< typename V> V full_mask(); // sets all bits to 1
we have constant generator for this and untyped proto-based constant placeholders so : pack<float> p(1,2,3,4), q = p * pi_; does what you think it does :p
template< int i, typename V> T get_element( V ); template< int i, typename V> V set_element( V, T );
get_element is operator[] on pack and native. How do you do set_element, every solution i found was either an UB or slow. By looking at your prototype I guess you replicate V and change the element in a memory buffer ? We ended up enforcing the fact SIMD vector are immutable at the elementwise level. SSE4.x provides a extract/insert function we're providing as a free function IIRC.
template< int i, int j, ...> V shuffle( V ); // rearranges the elements in V.
we have that but we still ponder if it should be p.shuffle(mask) or mapped over fusion::nview or somethign fancy like p.xzyw() Note than on SSE2, shuffle is not permute and is not provided for all types. The real powerful function is Altivec permute but it is harder to find a proper abstraction of it.
template< int n, typename T> V load( T * ); // loads n lowest elements of T[] template< int n, typename T> void store( V, T * ); // stores n lowest elements of V
load,store,splat are done. load and store are also masked as iterator for pack and native if needed.
template< typename V> void store_uncached( V, V * ); // avoids cache pollution
Does it make any real difference ? All tests I ran gave me minimal amount of speed-up. I'm curious to hear your experience and add it if needed.
template< typename T, typename V> T horizontal_sum( V ); // sum of all elements in V
float k = sum(p) :p using add + shuffle on SSE < 3, hadd otherwise, vec_sum on Altivec. Note that if you want to go to sum element of a 1D vector, accumulate using + on pack works out of the box and return you a pack. So pack<flaot> s = std::accumulate(simd::begin(v),simd::end(v), zero_); float r = sum(s); gives you the sum of 1D vector in 2 lines.
I'm also a fan of having a set of common, optimized 1-D operations, such as buffer packing/interleaving& unpacking/deinterleaving, extract/insert columns, convolution, dot-product, SAD, FFT, etc.
some are actually function working on the pack level. std::fold or transform gets you to the 1D version. Some makes few sense. FFT in SIDM in 1D is not low level for me and out of my league atm.
Keep it low-level, though. IMO, any sort of high-level abstraction that ships data off to different accelerator back-ends, like GPUs, is a different animal and should go in a different library.
That's the goal of NT2 as a whole.

On Wednesday, March 30, 2011 02:35, Joel Falcou wrote:
On 30/03/11 08:04, Gruenke, Matt wrote:
[snip]
template< int i, typename V> T get_element( V ); template< int i, typename V> V set_element( V, T );
get_element is operator[] on pack and native. How do you do set_element, every solution i found was either an UB or slow.
I used shuffle, where possible. I think it's only supported for 16-bit elements or larger, on MMX/SSE2. I don't remember if I implanted it using shift, mask, and OR, for 8-bit, or if I just left it undefined for 8-bit.
By looking at your prototype I guess you replicate V and change the element in a memory buffer ?
I'm pretty sure I avoided memory for just about everything but initialization. I even went as far as circumventing the normal register copy instruction, where possible, which was strangely slow on P4's.
The real powerful function is Altivec permute but it is harder to find a proper abstraction of it.
Perhaps you can at leas think of a way to use static assert to enforce its inherent limitations. If permute's limitations are as the name suggests, then you can use the element indices to set bits in a vector and assert that all bits have been set. But maybe the compiler already does that for you.
template< typename V> void store_uncached( V, V * );
// avoids > > cache pollution
Does it make any real difference ? All tests I ran gave me minimal amount of speed-up. I'm curious to hear your experience and add it if needed.
Well, it's all about context. It doesn't make your writes faster. In fact, small bursts will actually be slower. However, if you're protecting something else in cache, then it can definitely pay off. It should also improve hyperthreading performance (again, assuming you're not going to read the written data for a while).
I'm also a fan of having a set of common, optimized 1-D operations, such as buffer packing/interleaving& unpacking/deinterleaving, extract/insert columns, convolution, dot-product, SAD, FFT, etc.
some are actually function working on the pack level. std::fold or transform gets you to the 1D version. Some makes few sense.
Often, I find the need to do things like de-interleave a scanline or tile of data, do some processing on the channels, and then re-interleave it. Processing at this granularity usually allows everything to stay in L1 cache. Efficient transpose (or at least extracting a batch of columns into horizontal buffers) is also very important.
Keep it low-level, though. IMO, any sort of high-level abstraction that ships data off to different accelerator back-ends, like GPUs, is a different animal and should go in a different library.
That's the goal of NT2 as a whole.
That's a fine thing to do - just not something I want mixed into my SIMD library. Since this is all about performance, whatever I use needs to give me the option to drop down to the next lower level if I find it necessary to get more performance in some hot spots. Thank you for the work you're doing on this. I look forward to seeing more. Matt

On 12/04/2011 12:13, Gruenke, Matt wrote:
Efficient transpose (or at least extracting a batch of columns into horizontal buffers) is also very important.
I've been looking into how to adapt forward iterators of tuples into a tuple of iterators to memory contiguous by chunks, if that's what you mean. The intent is that if you get data from a database for example, where you get the data row per row, you can then treat it with SIMD column per column.

On Tue 4/12/2011 2:17 PM, Mathias Gaunard wrote:
On 12/04/2011 12:13, Gruenke, Matt wrote:
Efficient transpose (or at least extracting a batch of columns into horizontal buffers) is also very important.
I've been looking into how to adapt forward iterators of tuples into a tuple of iterators to memory contiguous by chunks, if that's what you mean.
Well, I see how this could be handled using iterators that maintain proxy objects. The challenges would be keeping the overhead low enough and finding the right abstraction boundary. For instance, if my goal is to rearrange data to make it contiguous on the axis over which I wantto operate, I don't want you to hide that from me. However, it would be nice if I could use the same interface for constructing a contiguous proxy from different input formats and datatypes. Matt
participants (8)
-
Cory Nelson
-
Domagoj Saric
-
Faiçal Tchirou
-
falcou
-
Gruenke, Matt
-
Joel Falcou
-
Mathias Gaunard
-
Tim Blechmann