uBLAS Observation: terrible execution speed for prod(matrix,prod(matrix,prod(matrix,matrix)))

Hi, I am new to uBLAS. I am looking to see if it will be useful for my application (I develop 6-Degrees-of-freedom flight simulations used to analyze avionics systems.) I have benchmarked a few lines of code with RogueWave's Math.h++ (testing for execution speed) and made the following observation. Observation 1) For matrix products of two small (~3-by-3) matrices ( i.e., prod(smallMat1, smallMat2) ) uBLAS executes MUCH FASTER than Math.h++. Observation 2) For matrix produces of two large matrices (100 by 100) uBLAS and Math.h++ give similar execuation speed. Obervation 3) (This is the kicker...) for strung-together matrix products like "Mat1 * Mat2 * Mat3 * Mat4 * Mat5", which I am implementing as "prod(Mat1, prod(Mat2, prod( Mat3, prod(Mat4,Mat5) ) ) ) Math.h++ is MUCH MUCH FASTER than uBLAS. -- by ~2 orders OF MAGNITUDE! I believe that this is a characteristic of the scalability of template expressions. My question is this: Is there a way I an instruct uBLAS not to use template expressions when for strung-together matrix products? Regards, Walt

From: Walter Bowen
Observation 1) For matrix products of two small (~3-by-3) matrices ( i.e., prod(smallMat1, smallMat2) ) uBLAS executes MUCH FASTER than Math.h++.
Observation 2) For matrix produces of two large matrices (100 by 100) uBLAS and Math.h++ give similar execuation speed.
Obervation 3) (This is the kicker...) for strung-together matrix products like "Mat1 * Mat2 * Mat3 * Mat4 * Mat5", which I am implementing as "prod(Mat1, prod(Mat2, prod( Mat3, prod(Mat4,Mat5) ) ) ) Math.h++ is MUCH MUCH FASTER than uBLAS. -- by ~2 orders OF MAGNITUDE! I believe that this is a characteristic of the scalability of template expressions.
My question is this: Is there a way I an instruct uBLAS not to use template expressions when for strung-together matrix products?
I'm not familiar with uBLAS, but I thought this was one of the _strengths_ of expression templates (if that's what you mean), that it may be used to e.g. unroll loops (especially useful for smaller matrices), and also fuse loops together, avoiding temporary matrices, such as your expression above here, involving several matrices being multiplied together (especially useful for larger matrices, where several matrices are multiplied together). At least, that's the approach taken in libraries such as Blitz++, where they have achieved performance comparable to Fortran, both for small and large matrices. It performs especially good on expressions with several large matrices, as loop-unrolling is not enough for that. You also need to eliminate temporary matrices, and fuse the loops to one set of loops. Regards, Terje

From: Walter Bowen
Observation 1) For matrix products of two small (~3-by-3) matrices ( i.e., prod(smallMat1, smallMat2) ) uBLAS executes MUCH FASTER than Math.h++.
Observation 2) For matrix produces of two large matrices (100 by 100) uBLAS and Math.h++ give similar execuation speed.
Obervation 3) (This is the kicker...) for strung-together matrix products like "Mat1 * Mat2 * Mat3 * Mat4 * Mat5", which I am implementing as "prod(Mat1, prod(Mat2, prod( Mat3, prod(Mat4,Mat5) ) ) ) Math.h++ is MUCH MUCH FASTER than uBLAS. -- by ~2 orders OF MAGNITUDE! I believe that this is a characteristic of the scalability of template expressions.
My question is this: Is there a way I an instruct uBLAS not to use template expressions when for strung-together matrix products?
I'm not familiar with uBLAS, but I thought this was one of the _strengths_ of expression templates (if that's what you mean), that it may be used to e.g. unroll loops (especially useful for smaller matrices), and also fuse loops together, avoiding temporary matrices, such as your expression above here, involving several matrices being multiplied together (especially useful for larger matrices, where several matrices are multiplied together).
At least, that's the approach taken in libraries such as Blitz++, where
The problem is that expression templates avoid temporary objects by evaluating expressions as they are needed. By nesting too many matrix products, the inner products have to be computed many times. ----- Original Message ----- From: "Terje Slettebø" <terje.s@chello.no> Newsgroups: gmane.comp.lib.boost.user Sent: Monday, July 29, 2002 8:44 PM Subject: Re: uBLAS Observation: terrible execution speed for prod(matrix,prod(matrix,prod(matrix,matrix))) they
have achieved performance comparable to Fortran, both for small and large matrices. It performs especially good on expressions with several large matrices, as loop-unrolling is not enough for that. You also need to eliminate temporary matrices, and fuse the loops to one set of loops.
Regards,
Terje

--- In Boost-Users@y..., Walter Bowen <yg-boost-users@m...> wrote: <snip comparison performance ublas and Roguewave math.h++ > I find it surprising that ublas can outperform math++. Nevertheless, I think that, once you're handling matrices larger than 10x10 you should consider calling BLAS (preferably vendor-tuned BLAS or ATLAS). During the review, there was some discussion on the performance. Martin Weiser for instance made this graph : http://www.zib.de/weiser/ublas_review.gif. Of course you need to take into account that much emphasis has been on the interface, I figure performance optimisation will now (after the review) be looked at in greater detail (I can't speak for Joerg and Mathias of course but I figure they're on vacation so I wanted to give you this feedback already anyhow). Also, you can use the BLAS-bindings for ublas that are in the sandbox. I've only added a few BLAS-calls up till now but will extend this to cover all of BLAS in the future. I use these bindings myself in my projects and this provides me the rich functionality of ublas with the performance of ATLAS. toon

--- In Boost-Users@y..., Walter Bowen <yg-boost-users@m...> wrote:
Hi, I am new to uBLAS. I am looking to see if it will be useful for my application (I develop 6-Degrees-of-freedom flight simulations used to analyze avionics systems.) I have benchmarked a few lines of code with RogueWave's Math.h++ (testing for execution speed) and made the following observation.
Observation 1) For matrix products of two small (~3-by-3) matrices ( i.e., prod(smallMat1, smallMat2) ) uBLAS executes MUCH FASTER than Math.h++.
Observation 2) For matrix produces of two large matrices (100 by
Fine. 100)
uBLAS and Math.h++ give similar execuation speed.
As Toon already pointed out, you could try 1000 by 1000 matrices to check, if math.h++ uses blocked operations internally (which uBLAS doesn't).
Obervation 3) (This is the kicker...) for strung-together matrix products like "Mat1 * Mat2 * Mat3 * Mat4 * Mat5", which I am implementing as "prod(Mat1, prod(Mat2, prod( Mat3, prod(Mat4,Mat5) ) ) ) Math.h++ is MUCH MUCH FASTER than uBLAS. -- by ~2 orders OF MAGNITUDE! I believe that this is a characteristic of the scalability of template expressions.
Correct. This observation was first stated by Benedikt Weber during uBLAS review: http://lists.boost.org/MailArchives/boost/msg31283.php
My question is this: Is there a way I an instruct uBLAS not to use template expressions when for strung-together matrix products?
You could reintroduce temporaries directly prod(Mat1, matrix<your_type> (prod(Mat2, matrix<your_type> (prod(Mat3, matrix<your_type> (prod(Mat4,Mat5))))))) or use some recently introduced free functions: prod(Mat1, prod<matrix<your_type> > (Mat2, prod<matrix<your_type> > (Mat3, prod<matrix<your_type> > (Mat4,Mat5)))) HTH Joerg P.S.: this is certainly the third FAQ entry ;-)
participants (5)
-
jhrwalter
-
Matthias Kronenberger
-
Terje Slettebø
-
tknapen
-
Walter Bowen