
Hi there, is there a bit more information for "Optimizing the Boost Math and Graph Library for Multi-core and SIMD processors" project? Thanks, Christian

Christian Henning wrote:
Hi there, is there a bit more information for "Optimizing the Boost Math and Graph Library for Multi-core and SIMD processors" project?
The abstract is here: http://code.google.com/soc/2008/boost/appinfo.html?csaid=902E3C97E6715F39 I don't want to release Gautam's full proposal without his permission: I'm CC'ing him so he can respond directly, but I believe he's tied up in exams until the end of the month. Was there something specific you were interested in? Regards, John.

Hi John,
I don't want to release Gautam's full proposal without his permission: I'm CC'ing him so he can respond directly, but I believe he's tied up in exams until the end of the month.
I can wait no problem.
Was there something specific you were interested in?
I'm very interested in the SIMD part. What functions of the math toolkit would be targeted? Also, will this project create facilities that I can use SIMD hardware with my own calculations? For example, I would very much like to do ray/box intersection tests by using SSE. Would that be possible after the SOC? This project might be of interest to you, in case you didn't know about it: http://www.pixelglow.com/macstl/ Thanks, Christian

I'm very interested in the SIMD part. What functions of the math toolkit would be targeted? Also, will this project create facilities that I can use SIMD hardware with my own calculations? For example, I would very much like to do ray/box intersection tests by using SSE. Would that be possible after the SOC?
This project might be of interest to you, in case you didn't know about it:
Glen Low at PixelGlow was very positive and proactive with macstl (smart cookie!) but sadly didn't get as much interest and commercial support as he needed to put bread on the table. I found MacSTL reasonably good but not that well supported on x86 (much better on Altivec). Glen was a little prone to hyperbole in some of his speedup claims. I also found that the Intel compiler was reasonably good at optimising C (gulp) code to achieve the same sorts of speeds (particularly relevant for complex numbers which are hopelessly SIMD optimised in C++ by Intel). Perhaps also take a look at www.codesourcery.com and their VSIPL++ option. It uses expression templates to coalesce sequences of ops and makes use of SIMD operations and also utilises underlying vector libs if available but as with macstl, has a more restrictive license. I think at one point GIL and VSIPL++ almost intersected with some interest by someone at codesourcery in providing interfaces into GIL.

Paul Baxter wrote:
I'm very interested in the SIMD part. What functions of the math toolkit would be targeted? Also, will this project create facilities that I can use SIMD hardware with my own calculations? For example, I would very much like to do ray/box intersection tests by using SSE. Would that be possible after the SOC?
This project might be of interest to you, in case you didn't know about it:
Glen Low at PixelGlow was very positive and proactive with macstl (smart cookie!) but sadly didn't get as much interest and commercial support as he needed to put bread on the table.
This is tricky isn't it: these things are good academic projects, but are very much "niche" applications :-( The other issue is portability: although several platforms support vectorised float operations (including NVidia graphics cards if you want to gain some extra cores!), only Intel supports vectorised double operations as far as I know, and it's these latter that I'm interested in :-(
I found MacSTL reasonably good but not that well supported on x86 (much better on Altivec). Glen was a little prone to hyperbole in some of his speedup claims. I also found that the Intel compiler was reasonably good at optimising C (gulp) code to achieve the same sorts of speeds (particularly relevant for complex numbers which are hopelessly SIMD optimised in C++ by Intel).
Perhaps also take a look at www.codesourcery.com and their VSIPL++ option. It uses expression templates to coalesce sequences of ops and makes use of SIMD operations and also utilises underlying vector libs if available but as with macstl, has a more restrictive license.
Unfortunately BLAS/FFT style operations, aren't the ones we really need in this project (think vectorised Horner evaluations and the like) :-( Cheers, John.

Christian Henning wrote:
Hi John,
I don't want to release Gautam's full proposal without his permission: I'm CC'ing him so he can respond directly, but I believe he's tied up in exams until the end of the month.
I can wait no problem.
Was there something specific you were interested in?
I'm very interested in the SIMD part. What functions of the math toolkit would be targeted? Also, will this project create facilities that I can use SIMD hardware with my own calculations? For example, I would very much like to do ray/box intersection tests by using SSE. Would that be possible after the SOC?
Maybe :-) It's not the main focus of the project, I believe the intent was to use the Intel SSE intrinsics directly rather than abstracting them, since these appear to be available on all Intel based platforms. Although some other platforms also support vectorising operations, they're predominantly for integer or 32-bit float operations which is less useful for our purposes (anyone know different?).
This project might be of interest to you, in case you didn't know about it:
Indeed, thanks! John.

Does the project plan include an optimized SSE implementation of the cumulative distribution function for the Student t and Fisher F distributions? That would be very useful for data-mining applications. --Johan Råde John Maddock wrote:
Christian Henning wrote:
Hi John,
I don't want to release Gautam's full proposal without his permission: I'm CC'ing him so he can respond directly, but I believe he's tied up in exams until the end of the month. I can wait no problem.
Was there something specific you were interested in? I'm very interested in the SIMD part. What functions of the math toolkit would be targeted? Also, will this project create facilities that I can use SIMD hardware with my own calculations? For example, I would very much like to do ray/box intersection tests by using SSE. Would that be possible after the SOC?
Maybe :-)
It's not the main focus of the project, I believe the intent was to use the Intel SSE intrinsics directly rather than abstracting them, since these appear to be available on all Intel based platforms. Although some other platforms also support vectorising operations, they're predominantly for integer or 32-bit float operations which is less useful for our purposes (anyone know different?).
This project might be of interest to you, in case you didn't know about it:
Indeed, thanks!
John.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Johan Råde wrote:
Does the project plan include an optimized SSE implementation of the cumulative distribution function for the Student t and Fisher F distributions? That would be very useful for data-mining applications.
Since those ultimately rely on the incomplete beta: and that's one of the functions that I hope we'll be focusing on, then yes up to a point :-) I'm not sure there's all that much scope for SSE optimisation of that function (unless we can optimise the infinite series and continued fractions used to make use of parrellel evaluation). There are some opportunities for task-based (ie OpenMP or similar) parrellelism during the computation, and I believe Gautam has some other ideas which might work, but this is all very much in the realm of "investigate whether it's worth while" at present. BTW if you (or anyone else for that matter) have any real world test cases that Gautam can look into that would be great - I realise there may well be confidentiality issues in some cases - but there's not much we can do about this I guess? HTH, John.

John Maddock wrote:
Johan Råde wrote:
Does the project plan include an optimized SSE implementation of the cumulative distribution function for the Student t and Fisher F distributions? That would be very useful for data-mining applications.
Since those ultimately rely on the incomplete beta: and that's one of the functions that I hope we'll be focusing on, then yes up to a point :-)
I'm not sure there's all that much scope for SSE optimisation of that function (unless we can optimise the infinite series and continued fractions used to make use of parrellel evaluation). There are some opportunities for task-based (ie OpenMP or similar) parrellelism during the computation, and I believe Gautam has some other ideas which might work, but this is all very much in the realm of "investigate whether it's worth while" at present.
BTW if you (or anyone else for that matter) have any real world test cases that Gautam can look into that would be great - I realise there may well be confidentiality issues in some cases - but there's not much we can do about this I guess?
A typical data mining scenario might be to calculate the cdf for the t- or F-distribution for each value in an array of say 100,000 single or double precision floating point numbers. (I tend to use double precision.) Anything that could speed up that task would be interesting. SEE parallelism, if possible, would be interesting. Multi-core parallelism is less interesting, that is already easy to do, for instance using OpenMP. Then there are the issues brought up by Stephen Nuchia: http://article.gmane.org/gmane.comp.lib.boost.devel/173840 --Johan Råde

Johan Råde wrote:
A typical data mining scenario might be to calculate the cdf for the t- or F-distribution for each value in an array of say 100,000 single or double precision floating point numbers. (I tend to use double precision.) Anything that could speed up that task would be interesting.
Nod, the question is what the actual combination of arguments that get passed to the incomplete beta are: if the data isn't unduely sensitive, what would be really useful is to have a log of those values so we can see which parts of the implementation are getting hammered the most.
SEE parallelism, if possible, would be interesting.
Multi-core parallelism is less interesting, that is already easy to do, for instance using OpenMP.
Then there are the issues brought up by Stephen Nuchia: http://article.gmane.org/gmane.comp.lib.boost.devel/173840
Nod, there's a lot that can be done, but ground is going to be won a yard at a time :-( John.

John Maddock wrote:
Johan Råde wrote:
A typical data mining scenario might be to calculate the cdf for the t- or F-distribution for each value in an array of say 100,000 single or double precision floating point numbers. (I tend to use double precision.) Anything that could speed up that task would be interesting.
Nod, the question is what the actual combination of arguments that get passed to the incomplete beta are: if the data isn't unduely sensitive, what would be really useful is to have a log of those values so we can see which parts of the implementation are getting hammered the most.
In data mining applications, usually most of the variables satisfy the null hypothesis. In other words, their distribution is given by the t- or F-distribution at hand. So you can generate test data by starting with an array of uniform [0,1] random numbers, and then you apply the inverse of the cdf to each value. Concerning the degrees of freedom, in the problems we analyze: For the t-distribution, 10 - 1000, typically around 100. For the F distribution, the first number of degrees of freedom: 2 - 10 the second number of degrees of freedom: 10 - 1000, typically around 100. HTH Johan Råde

John Maddock wrote:
Johan Råde wrote:
A typical data mining scenario might be to calculate the cdf for the t- or F-distribution for each value in an array of say 100,000 single or double precision floating point numbers. (I tend to use double precision.) Anything that could speed up that task would be interesting.
Nod, the question is what the actual combination of arguments that get passed to the incomplete beta are: if the data isn't unduely sensitive, what would be really useful is to have a log of those values so we can see which parts of the implementation are getting hammered the most.
Here is a detailed test case, that I think is typical of the needs of data mining applications: Select a t distribution with 10 - 1000 degrees of freedom, or an F distribution where the first number of degrees of freedom is 2-10 and the second 10-1000. 1. Preparation - generate test data: Generate 100000 random numbers with uniform [0,1] distribution. Apply the quantile function (i.e. inverse of cdf) to each number. This gives the test data. 2. The test Apply the cdf or cdf complement to the test data from Step 1. Is this the kind of suggestion you were asking for? --Johan
participants (4)
-
Christian Henning
-
Johan Råde
-
John Maddock
-
Paul Baxter