Back to Boost.SIMD - Some performances ...

I'm still working on a potential Boost.SIMD proposal despite the apparent lack of interest by the list. Last discussion spawned the fact that actual performances figures may be interesting. So here's some (see end of mail). This table show for a subset of non-trivial functions the cycles needed to compute one value in scalar, using SSE2, some precision concerns and the actual speed-up. Most of them are super-linear because either : 1/ libc algorithm is badly implemented, or 2/ non-SIMD architectural difference between SSE2 FPU and scalar FPU leads to additional speed-up Most transcendental functions used a SIMD evrsion of the old yet useful Cephes C library based around various polynomial estimations. Results on Altivec processor are roughly the same except for transcendental where the use of a proper FMA instead of sequence of mul-add increases performances.For trivial function like +,-,*,/, I was happily surprised that indeed gcc is able to generate SIMD code. Alas, gcc auto-SIMD speed-up never exceed 2.54 while our code can go up to 3.5. Concerning the problem of interface and support of odd-ball vector size in a platform independant fashion, we use the remark of Matthias and provide a vec<T,C> class in which the vector cardinal can be speficified (and is equal to the native cardinal of said type by default). Things like vec<double,5> are handled as boost::array and provide same interface and set of functions. for any given functions, it cna be applied either to any vec<T,C> types or any native SIMD type (__m128 in SSEx or vector xxx in Altivec). Syntaxic sugar like v = v+4 is provided and perform consatnt splatting before SIMD evaluation. This still has to be boostified and made independant of the whole project it depends on. Once done, a preliminary version will be uploaded into the Vault. Current target architecture are : - SSE2, SSSE3,SSE3 - AltiVec for PPC and Cell processor (a patched version of boost is needed) Comments and questions welcomed. || -------------- || --------------- || scalar || -------------- vector ---------------- || ---- || || || || cycles ||cycles | ulp | rms | peak || s-up || || --------------------------------------------------------------------------------------------- || || abs_ || float || 2.0 || 0.8 | 0 | 0.00e+00 | 0.00e+00 || 2.4 || || acosh_ || float || 148.2 || 30.2 | 1 | 9.15e-09 | 1.19e-07 || 4.9 || || acos_ || float || 261.8 || 14.7 | 3 | 7.01e-08 | 2.38e-07 || 17.8 || || arg_ || float || 5.0 || 1.2 | 0 | 0.00e+00 | 0.00e+00 || 4.2 || || asinh_ || float || 152.8 || 32.4 | 1 | 1.22e-08 | 1.19e-07 || 4.7 || || asin_ || float || 256.5 || 11.6 | 2 | 5.32e-08 | 2.28e-07 || 22.1 || || atanh_ || float || 123.9 || 20.4 | 2 | 2.27e-08 | 4.55e-07 || 6.1 || || atan_ || float || 160.7 || 12.7 | 1 | 3.55e-08 | 6.74e-08 || 12.7 || || bitofsign_ || float || 5.1 || 0.8 | 0 | 0.00e+00 | 0.00e+00 || 6.1 || || boolean_ || float || 5.4 || 1.0 | 0 | 0.00e+00 | 0.00e+00 || 5.4 || || cbrt_ || float || 152.5 || 39.7 | 1 | 2.76e-08 | 7.77e-08 || 3.8 || || ceil_ || float || 16.6 || 2.8 | 0 | 0.00e+00 | 0.00e+00 || 5.9 || || cosh_ || float || 211.5 || 19.1 | 2 | 4.00e-08 | 1.83e-07 || 11.1 || || cos_ || float || 112.2 || 14.6 | 1 | 2.98e-08 | 1.11e-07 || 7.7 || || cospi_ || float || 103.6 || 12.1 | 1 | 3.43e-08 | 1.19e-07 || 8.6 || || cot_ || float || 142.8 || 17.8 | 3 | 5.54e-08 | 2.38e-07 || 8.0 || || cotpi_ || float || 142.1 || 17.1 | 6 | 9.62e-08 | 4.08e-07 || 8.3 || || exp10_ || float || 169.3 || 32.1 | 1 | 2.88e-08 | 1.19e-07 || 5.3 || || exp_ || float || 171.3 || 19.3 | 1 | 2.60e-08 | 1.19e-07 || 8.9 || || expm1_ || float || 294.1 || 42.6 | 3 | 2.89e-08 | 1.94e-07 || 6.9 || || floor_ || float || 16.9 || 2.7 | 0 | 0.00e+00 | 0.00e+00 || 6.4 || || gd_ || float || 602.4 || 35.8 | 3 | 3.93e-08 | 2.46e-07 || 16.8 || || indeg_ || float || 2.2 || 0.8 | 0 | 2.59e-08 | 5.94e-08 || 2.7 || || inrad_ || float || 2.2 || 0.8 | 0 | 2.53e-08 | 5.94e-08 || 2.6 || || iseqz_ || float || 5.5 || 0.9 | 0 | 0.00e+00 | 0.00e+00 || 6.3 || || iseven_ || float || 45.0 || 2.3 | 0 | 0.00e+00 | 0.00e+00 || 19.9 || || isfin_ || float || 5.9 || 0.9 | 0 | 0.00e+00 | 0.00e+00 || 6.6 || || isflint_ || float || 38.3 || 1.8 | 0 | 0.00e+00 | 0.00e+00 || 21.4 || || isgez_ || float || 6.0 || 0.8 | 0 | 0.00e+00 | 0.00e+00 || 7.4 || || isgtz_ || float || 6.0 || 0.9 | 0 | 0.00e+00 | 0.00e+00 || 6.5 || || isinf_ || float || 6.0 || 0.9 | 0 | 0.00e+00 | 0.00e+00 || 6.5 || || islez_ || float || 5.0 || 0.9 | 0 | 0.00e+00 | 0.00e+00 || 5.8 || || isltz_ || float || 5.0 || 0.9 | 0 | 0.00e+00 | 0.00e+00 || 5.8 || || isnan_ || float || 3.0 || 0.8 | 0 | 0.00e+00 | 0.00e+00 || 3.6 || || isnegative_ || float || 5.6 || 1.1 | 0 | 0.00e+00 | 0.00e+00 || 5.0 || || isnez_ || float || 5.4 || 0.9 | 0 | 0.00e+00 | 0.00e+00 || 6.2 || || isnotfinite_ || float || 3.0 || 0.8 | 0 | 0.00e+00 | 0.00e+00 || 3.6 || || isodd_ || float || 47.8 || 2.5 | 0 | 0.00e+00 | 0.00e+00 || 18.8 || || ispositive_ || float || 5.6 || 1.3 | 0 | 0.00e+00 | 0.00e+00 || 4.2 || || log10abs_ || float || 107.5 || 17.3 | 2 | 6.59e-08 | 2.12e-07 || 6.2 || || log10_ || float || 105.4 || 16.9 | 2 | 6.58e-08 | 2.12e-07 || 6.2 || || log1p_ || float || 149.7 || 18.8 | 1 | 9.16e-09 | 1.19e-07 || 8.0 || || log2abs_ || float || 107.5 || 17.2 | 1 | 1.89e-08 | 1.19e-07 || 6.3 || || log2_ || float || 105.5 || 17.1 | 4 | 1.90e-08 | 2.51e-07 || 6.2 || || logabs_ || float || 107.5 || 23.6 | 1 | 4.36e-08 | 1.19e-07 || 4.6 || || log_ || float || 108.3 || 15.4 | 1 | 9.12e-09 | 1.19e-07 || 7.0 || || mantissa_ || float || 22.1 || 5.0 | 0 | 0.00e+00 | 0.00e+00 || 4.4 || || oneminus_ || float || 3.3 || 0.8 | 0 | 5.16e-10 | 5.96e-08 || 4.0 || || oneplus_ || float || 3.4 || 0.8 | 0 | 2.87e-10 | 5.74e-08 || 4.1 || || rec_ || float || 37.4 || 4.3 | 0 | 2.48e-08 | 5.92e-08 || 8.8 || || round_ || float || 43.7 || 5.5 | 0 | 0.00e+00 | 0.00e+00 || 8.0 || || rsqrt_ || float || 105.4 || 11.3 | 1 | 3.62e-08 | 8.81e-08 || 9.3 || || signedbool_ || float || 9.3 || 0.9 | 0 | nan | 0.00e+00 || 10.6 || || sign_ || float || 12.1 || 2.5 | 0 | 0.00e+00 | 0.00e+00 || 4.9 || || signnz_ || float || 12.1 || 1.4 | 0 | 0.00e+00 | 0.00e+00 || 8.5 || || sinh_ || float || 267.2 || 19.1 | 3 | 2.38e-07 | 3.84e-07 || 14.0 || || sin_ || float || 110.5 || 17.0 | 1 | 2.98e-08 | 1.12e-07 || 6.5 || || sinpi_ || float || 115.8 || 14.4 | 1 | 2.98e-08 | 1.10e-07 || 8.0 || || sqr_ || float || 2.0 || 0.7 | 0 | 2.53e-08 | 5.94e-08 || 2.9 || || sqrt_ || float || 68.3 || 7.1 | 0 | 2.62e-08 | 5.96e-08 || 9.6 || || sqrtabs_ || float || 68.3 || 7.0 | 0 | 2.61e-08 | 5.96e-08 || 9.7 || || tanh_ || float || 206.4 || 20.8 | 197 | 1.63e-07 | 1.77e-05 || 9.9 || || tan_ || float || 153.0 || 17.8 | 2 | 4.18e-08 | 1.45e-07 || 8.6 || || tanpi_ || float || 156.2 || 18.0 | 2 | 4.16e-08 | 1.48e-07 || 8.7 || || --------------------------------------------------------------------------------------------- || -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Thu, Mar 26, 2009 at 12:13 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
I'm still working on a potential Boost.SIMD proposal despite the apparent lack of interest by the list.
I can't remember if I expressed interest last time this was brought up, but I *am* certainly interested. Will you be able to compare this against Intel's Math Kernel Library or AMD's Core Math Library? --Michael Fawcett

Will you be able to compare this against Intel's Math Kernel Library or AMD's Core Math Library Well, I don't think those libraries provide such kind of a facility. I
Michael Fawcett a écrit : thought MKL was Intel's LAPACK homebrew ? I guess I can install some of this (at least Intel as I don't have AMD machine overthere) and see what can be compared in a proper fashion. Currently, a GEMM version written in C using our madd function is on par with the SSE2 ATLAS version at roughly 2-3%. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Thu, Mar 26, 2009 at 12:29 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
Michael Fawcett a écrit :
Will you be able to compare this against Intel's Math Kernel Library or AMD's Core Math Library
Well, I don't think those libraries provide such kind of a facility. I thought MKL was Intel's LAPACK homebrew ?
Both libraries provide at least a subset of the functionality in your benchmarks. http://www.intel.com/cd/software/products/asmo-na/eng/266863.htm http://developer.amd.com/cpu/Libraries/acml/downloads/pages/default.aspx#doc... (use the "ACML User Guide/FFT Documentation" link which goes to a pdf)
I guess I can install some of this (at least Intel as I don't have AMD machine overthere) and see what can be compared in a proper fashion.
ACML works on Intel processors as well, but is not optimal.
Currently, a GEMM version written in C using our madd function is on par with the SSE2 ATLAS version at roughly 2-3%.
Great to hear! --Michael Fawcett

Michael Fawcett a écrit :
Both libraries provide at least a subset of the functionality in your benchmarks. OK
ACML works on Intel processors as well, but is not optimal.
OK but if I want to be fair, I'll need an AMD ;)
Great to hear Well, the performance is mainly due to proper loop blocking and interchange than using SIMD ;)
-- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Michael Fawcett a écrit :
Both libraries provide at least a subset of the functionality in your benchmarks.
http://www.intel.com/cd/software/products/asmo-na/eng/266863.htm OK, there is some performances that happen to match my machine. If I look at the High-Accuracy mode (which is what we do), I have this following samples :
our sin : 17 cpe, MKL sin : 10.5 our asinh : 11 cpe, MKL : 14.39 So seems some are better and some worse. Considering the difference, I guess some of my algorithm is either non-optimal or have some pipeline hole. I'll investigate this and see how it fares. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Hi Joel,
I'm still working on a potential Boost.SIMD proposal despite the apparent lack of interest by the list. (...)
Comments and questions welcomed.
I'm sure that if you manage to write a platform and compiler independent library for vectorized basic math functions that can compete with Intel's MKL in terms of accuracy and speed, there'll be a lot of interest in your library. Could you comment on what vector size you used for your timings and whether you think the library will provide competitive performance for the specific vector sizes of N=1,2,3,4? Also, do the functions properly handle denormals and overflows? BTW, are you aware of the Eigen library: http://eigen.tuxfamily.org Stephan

Stephan Tolksdorf a écrit :
I'm sure that if you manage to write a platform and compiler independent library for vectorized basic math functions that can compete with Intel's MKL in terms of accuracy and speed, there'll be a lot of interest in your library. Being almost on-par is already OK. But those tests are usign the MKL array and not raw calls of functions. I bet that MKL array use blocking and/or pipelining.
Could you comment on what vector size you used for your timings and whether We used same settings as MKL aka 100 vector of floats and 1000 iterations per functions. you think the library will provide competitive performance for the specific vector sizes of N=1,2,3,4? Well, Boost.SIMD has a different scope. It provides pervasive acces to SIMD facility (SIMD vector of 1 element are rather useless) vector of 2 and 4 that can be mapped on proper native intrinsic type will benefit from SIMD acceleration, other will benefit from agressive inlining of boost::array. Also, do the functions properly handle denormals and overflows? We try as hard as we can. That's the main activity now, providing High Accuracy for functiosn that currently don't. We also plan to support LA and EP mode.
BTW, are you aware of the Eigen library: http://eigen.tuxfamily.org Well, considering my PHD thesis was writing such a library I am. I send you back the question, are you aware of NT2 : http://nt2.sourceforge.net ?
cause it's the project this Boost.SIMD is actually extarcted from ;) -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Joel Falcou wrote:
Stephan Tolksdorf a écrit :
you think the library will provide competitive performance for the specific vector sizes of N=1,2,3,4? Well, Boost.SIMD has a different scope. It provides pervasive acces to SIMD facility (SIMD vector of 1 element are rather useless) vector of 2 and 4 that can be mapped on proper native intrinsic type will benefit from SIMD acceleration, other will benefit from agressive inlining of boost::array.
An efficient and well tested scalar math library as the by-product of a generic SIMD vector library certainly wouldn't be useless, and I'd guess that an SSE scalar implementation would look pretty similar to the SSE vector implementation...
BTW, are you aware of the Eigen library: http://eigen.tuxfamily.org Well, considering my PHD thesis was writing such a library I am. I send you back the question, are you aware of NT2 : http://nt2.sourceforge.net ?
cause it's the project this Boost.SIMD is actually extarcted from ;)
I obviously wasn't. It's a bit unfortunate that there are so many parallel development efforts in the area of template libraries for linear algebra: ublas, mtl, eigen, nt2... Have you thought about joining efforts with the Eigen guys? I'm no expert in this area, but their benchmark numbers look pretty compelling and the API seems to support fixed-size vectors in an elegant way. There would probably be huge economies of scale if the C++ community converged towards a single template matrix library. Stephan

An efficient and well tested scalar math library as the by-product of a generic SIMD vector library certainly wouldn't be useless, and I'd guess that an SSE scalar implementation would look pretty similar to the SSE vector implementation...
Well in fatc it does as the emulated vec<T,N> with N != cardinal<vector T> use those. Boost.SIMD shouldmaybe renamed Boost.FastMath in fact
I obviously wasn't. It's a bit unfortunate that there are so many parallel development efforts in the area of template libraries for linear algebra: ublas, mtl, eigen, nt2...
I know, so do I
Have you thought about joining efforts with the Eigen guys? I'm no expert in this area, but their benchmark numbers look pretty compelling and the API seems to support fixed-size vectors in an elegant way. There would probably be huge economies of scale if the C++ community converged towards a single template matrix library. Well, I largely prefer my API ;) but that's a domain rpeference. NT2 mimics exactly Matlab API and syntax wherever possible cause it was aimed at a tool for physicist and automatism peopel to pass their matlab demo onto a proper C++ paltform. Morevoer nt2 next version has a extensive list of features that can be used as mark-up on matrix types.
Some exemples of divergence : eigen2 sum of cube of column i is r = m.cols(i).colwise().cube().sum() NT2 sum of cube of column i is r = sum0( cube( m(_,i) ) ); Compelx indexing is also supported : Matlab : k(1,:, 1:2:10) = cos( m ); NT2 : k( 1, _, colon(1,2,10) ) = cos(m); Mark-up settings : want an upper-traingular matrix of flaot of maximum static_size of 50x50 with dynamic allocation and want to specify that all loops involving it need to be blocked by a 3x3 pattern ? matrix<float, settings( upper_triangular, 2D_(ofCapacity<50,50>), cache(tiling<3,3>) )> m; etc ... I'm not against teaming-up but not sure which one is better than the other. Moreover, I don't think eigen2 use proto as a base while NT2 performs lots of pre-optimization using proto-transforms and again that's something I don't want to lose. Anyway, I'm not here to discuss NT2 as a whole, maybe we can continue this elsewhere ;) -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

on Thu Mar 26 2009, Joel Falcou <joel.falcou-AT-u-psud.fr> wrote:
Some exemples of divergence : eigen2 sum of cube of column i is r = m.cols(i).colwise().cube().sum()
NT2 sum of cube of column i is r = sum0( cube( m(_,i) ) );
Compelx indexing is also supported : Matlab : k(1,:, 1:2:10) = cos( m ); NT2 : k( 1, _, colon(1,2,10) ) = cos(m);
Mark-up settings : want an upper-traingular matrix of flaot of maximum static_size of 50x50 with dynamic allocation and want to specify that all loops involving it need to be blocked by a 3x3 pattern ?
matrix<float, settings( upper_triangular, 2D_(ofCapacity<50,50>), cache(tiling<3,3>) )> m;
etc ...
I'm not against teaming-up but not sure which one is better than the other. Moreover, I don't think eigen2 use proto as a base while NT2 performs lots of pre-optimization using proto-transforms and again that's something I don't want to lose.
A small note of encouragement: this sounds really cool, and the numbers are promising! -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Le Mer 1 avril 2009 23:00, David Abrahams a écrit :
A small note of encouragement: this sounds really cool, and the numbers
are promising! Thanks for the kind words. I will soon have a another shot on some new functions (binary ones for instance, and ternary operator like knuth egality etc).
I may also have to discuss a simple case : when running in emulation mode, I use a large number of optimized *scalar* function. It feels akward to have those hidden as implementtion details so I put them into the user namespace. Now, is Boost.SIMD a proper name as it provides equal proportions of optimiuzed scalar *and* vectorial functions ?

Stephan Tolksdorf a écrit :
I'm sure that if you manage to write a platform and compiler independent library for vectorized basic math functions that can compete with Intel's MKL in terms of accuracy and speed, there'll be a lot of interest in your library. Being almost on-par is already OK. But those tests are usign the MKL array and not raw calls of functions. I bet that MKL array use blocking and/or pipelining.
Could you comment on what vector size you used for your timings and whether We used same settings as MKL aka 100 vector of floats and 1000 iterations per functions. you think the library will provide competitive performance for the specific vector sizes of N=1,2,3,4? Well, Boost.SIMD has a different scope. It provides pervasive acces to SIMD facility (SIMD vector of 1 element are rather useless) vector of 2 and 4 that can be mapped on proper native intrinsic type will benefit from SIMD acceleration, other will benefit from agressive inlining of boost::array. Also, do the functions properly handle denormals and overflows? We try as hard as we can. That's the main activity now, providing High Accuracy for functiosn that currently don't. We also plan to support LA and EP mode.
BTW, are you aware of the Eigen library: http://eigen.tuxfamily.org Well, considering my PHD thesis was writing such a library I am. I send you back the question, are you aware of NT2 : http://nt2.sourceforge.net ?
cause it's the project this Boost.SIMD is actually extarcted from ;) -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Thu, Mar 26, 2009 at 1:13 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
I'm still working on a potential Boost.SIMD proposal despite the apparent lack of interest by the list. Last discussion spawned the fact that actual performances figures may be interesting. So here's some (see end of mail). This table show for a subset of non-trivial functions the cycles needed to compute one value in scalar, using SSE2, some precision concerns and the actual speed-up.
Could the library's interface match a GPGPU implementation? Creating operations result in shader compilations, so there would have to be a way to save these operations. Like boost::function<vec(vec, vec)> for the performance to be worth. Do you see a way to do this with your library? (This is using a OpenGL implementation). [snip]
-- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35
-- Felipe Magno de Almeida

Felipe Magno de Almeida a écrit :
Could the library's interface match a GPGPU implementation? Creating operations result in shader compilations, so there would have to be a way to save these operations. Like boost::function<vec(vec, vec)>for the performance to be worth. Do you see a way to do this with your library? (This is using a OpenGL implementation). Well, i have a similar system planned in the matrix library that Boost.SIMD is extarcted from (http://nt2.sourceforge.net). I think it's slightly out of scope for a strictly SIMD library.
Implementationwise, it is doable I think as the core library is already split up upon extension detection. Adding another level is easy. But I fear the performance won't be that great as we'll run lots of small scale operation when GPGPU prefer large amounts of data (hence this being more a model for a matrix library). -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Thu, Mar 26, 2009 at 2:56 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
Felipe Magno de Almeida a écrit :
Could the library's interface match a GPGPU implementation? Creating operations result in shader compilations, so there would have to be a way to save these operations. Like boost::function<vec(vec, vec)>for the performance to be worth. Do you see a way to do this with your library? (This is using a OpenGL implementation).
Well, i have a similar system planned in the matrix library that Boost.SIMD is extarcted from (http://nt2.sourceforge.net). I think it's slightly out of scope for a strictly SIMD library.
Implementationwise, it is doable I think as the core library is already split up upon extension detection. Adding another level is easy. But I fear the performance won't be that great as we'll run lots of small scale operation when GPGPU prefer large amounts of data (hence this being more a model for a matrix library).
FWIW, ACML also offers a GPU library that offloads operations to the GPU when "it makes sense". I do agree with Joel - a GPGPU implementation can be added later. It would probably be better to let OpenCL mature as well, before heading into this area. Joel, how does the extension detection mechanism work? Is there a small runtime penalty for each function as it detects which path would be optimal, or can you define at compile-time what extensions are available (e.g. if you are compiling for a fixed hardware platform, like a console). --Michael Fawcett

Michael Fawcett a écrit :
Joel, how does the extension detection mechanism work? Is there as mall runtime penalty for each function as it detects which path would be optimal, or can you define at compile-time what extensions are available (e.g. if you are compiling for a fixed hardware platform, like a console). I have a #ifdef/#elif structure that detects which extension have been set up ont he compiler and I match this with a platform detection to know where to jump and how to overload some functions or class definition.
I tried the runtime way and it was fugly slow. So I'm back to a compile-time detection as performance was critical. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Thu, Mar 26, 2009 at 3:08 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
Michael Fawcett a écrit :
Joel, how does the extension detection mechanism work? Is there a small runtime penalty for each function as it detects which path would be optimal, or can you define at compile-time what extensions are available (e.g. if you are compiling for a fixed hardware platform, like a console).
I have a #ifdef/#elif structure that detects which extension have been set up ont he compiler and I match this with a platform detection to know where to jump and how to overload some functions or class definition.
I tried the runtime way and it was fugly slow. So I'm back to a compile-time detection as performance was critical.
That's the answer I wanted to hear. --Michael Fawcett

Joel Falcou wrote:
Michael Fawcett a écrit :
Joel, how does the extension detection mechanism work? Is there as mall runtime penalty for each function as it detects which path would be optimal, or can you define at compile-time what extensions are available (e.g. if you are compiling for a fixed hardware platform, like a console). I have a #ifdef/#elif structure that detects which extension have been set up ont he compiler and I match this with a platform detection to know where to jump and how to overload some functions or class definition.
I tried the runtime way and it was fugly slow. So I'm back to a compile-time detection as performance was critical.
Actually, I would expect this to be a mix of runtime and compile-time decision. While there are certainly things that can be decided at compile-time (architecture, available extensions, data types), there are also parameter that are only available at runtime, such as alignment, problem size, etc. In Sourcery VSIPL++ (http://www.codesourcery.com/vsiplplusplus/) we use a dispatch mechanism that allows programmers to chain extension 'evaluators' in a type-list, and this type-list is then walked over once by the compiler to eliminate unavailable matches, and the resulting list at runtime to find a match based on the above runtime parameters. This is also where we parametrize for what sizes we want to dispatch to a given backend (for example if the performance gain outmatches the data I/O penalty, etc.). Obviously, all this wouldn't make sense on a very fine-grained level. But for typical blas-level or signal-processing operations (matrix multiply, FFT, etc.) this works like a charm. (We target all sorts of hardware, from clusters over Cell processors down to GPUs.) Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Stefan Seefeld a écrit :
Joel Falcou wrote:
Michael Fawcett a écrit :
Joel, how does the extension detection mechanism work? Is there as mall runtime penalty for each function as it detects which path would be optimal, or can you define at compile-time what extensions are available (e.g. if you are compiling for a fixed hardware platform, like a console). I have a #ifdef/#elif structure that detects which extension have been set up ont he compiler and I match this with a platform detection to know where to jump and how to overload some functions or class definition.
I tried the runtime way and it was fugly slow. So I'm back to a compile-time detection as performance was critical.
Actually, I would expect this to be a mix of runtime and compile-time decision. While there are certainly things that can be decided at compile-time (architecture, available extensions, data types), there are also parameter that are only available at runtime, such as alignment, problem size, etc. Well, again, the grain here is the data pack, aka generalized SIMD vector.
In Sourcery VSIPL++ (http://www.codesourcery.com/vsiplplusplus/) we use a dispatch mechanism that allows programmers to chain extension 'evaluators' in a type-list, and this type-list is then walked over once by the compiler to eliminate unavailable matches, and the resulting list at runtime to find a match based on the above runtime parameters. This is also where we parametrize for what sizes we want to dispatch to a given backend (for example if the performance gain outmatches the data I/O penalty, etc.).
Obviously, all this wouldn't make sense on a very fine-grained level. But for typical blas-level or signal-processing operations (matrix multiply, FFT, etc.) this works like a charm.
(We target all sorts of hardware, from clusters over Cell processors down to GPUs.) That's what we do in NT2, mixed CT/RT selectors
-- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Joel Falcou wrote:
That's what we do in NT2, mixed CT/RT selectors
So now I'm interested :) Keep us posted Best -- Fernando Cacciola SciSoft Consulting, Founder http://www.scisoft-consulting.com

On Thu, Mar 26, 2009 at 3:56 PM, Joel Falcou <joel.falcou@u-psud.fr> wrote:
Felipe Magno de Almeida a écrit :
Could the library's interface match a GPGPU implementation? Creating operations result in shader compilations, so there would have to be a way to save these operations. Like boost::function<vec(vec, vec)>for the performance to be worth. Do you see a way to do this with your library? (This is using a OpenGL implementation).
Well, i have a similar system planned in the matrix library that Boost.SIMD is extarcted from (http://nt2.sourceforge.net). I think it's slightly out of scope for a strictly SIMD library.
I see. Thinking about it, I agree with you. It would make more sense in a matrix library. [snip]
-- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35
-- Felipe Magno de Almeida

Joel, qq is some code available? I'm very interested on my part to do some testing! Thanks, Christian

Christian Henning a écrit :
Joel,
qq is some code available? I'm very interested on my part to do some testing!
As said earlier, it need to be fully extracted from the main project it's living in right now. I'll try to get soemthing testable without itchies in a week or so -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35
participants (9)
-
Christian Henning
-
David Abrahams
-
Felipe Magno de Almeida
-
Fernando Cacciola
-
joel falcou
-
Joel Falcou
-
Michael Fawcett
-
Stefan Seefeld
-
Stephan Tolksdorf