performance of a linear algebra/matrix library

hi all i work on a linear algebra library with an intent to propose it for inclusion into boost collection of libs so my question is: how fast such a lib should be to make everybody happy? i found that to increase the performance of a generic linear algebra library one must complicate the implementation to a very high degree (though this statement should be obvious to everyone) i struggled with poor performance of matrix multiplication and finally identified the bottleneck i improved the implementation (seriosly complicating it, i'm afraid i will forget how it works in a month) and now it performs some 33% slower than C code for virtually any size of matrix the C code is as follows: typedef double type; const size_t n = 42; type *a = new type[n*n], *a1 = new type[n*n], *a2 = new type[n*n]; //filling 'a1' and 'a2' for (size_t i = 0; i<n; ++i) { type * const a_i = a + i*n; for (size_t j = 0; j<n; ++j) a_i[j] = 0.; for (size_t k = 0; k<n; ++k) { type *a = a_i; const type a1_ik = a1[i*n + k]; const type *a2_k = a2 + k*n; for (size_t j = 0; j<n; ++j, ++a, ++a2_k) *a += a1_ik*(*a2_k); } } the C++ code is just matrix<type> m(n, n), m1(n, n), m2(n, n); //filling 'm1' and 'm2' m.assign(m1, m2); more detailed results: 'n' is the dimensions of matrices, 'ratio' is runtime of C++ code divided by runtime of C code (time_cpp/time_c) n | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 ratio | 1,77 | 1,57 | 1,34 | 1,12 | 1,05 | 1,12 | 1,32 | 1,33 | 1,23 | 1,23 | 1,23 as you can see starting with 8 by 8 matices C++ code is roughly 33% slower than C code is that enough to make you (the reader) happy? -- Pavel

----- Original Message ----- From: "DE" <satan66613@yandex.ru> To: <boost@lists.boost.org> Sent: Thursday, May 06, 2010 7:02 PM Subject: [boost] performance of a linear algebra/matrix library
hi all
i work on a linear algebra library with an intent to propose it for inclusion into boost collection of libs
so my question is: how fast such a lib should be to make everybody happy?
i found that to increase the performance of a generic linear algebra library one must complicate the implementation to a very high degree (though this statement should be obvious to everyone)
i struggled with poor performance of matrix multiplication and finally identified the bottleneck
i improved the implementation (seriosly complicating it, i'm afraid i will forget how it works in a month) and now it performs some 33% slower than C code for virtually any size of matrix
<snip>
the C++ code is just
matrix<type> m(n, n), m1(n, n), m2(n, n); //filling 'm1' and 'm2' m.assign(m1, m2);
more detailed results: 'n' is the dimensions of matrices, 'ratio' is runtime of C++ code divided by runtime of C code (time_cpp/time_c)
n | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 ratio | 1,77 | 1,57 | 1,34 | 1,12 | 1,05 | 1,12 | 1,32 | 1,33 | 1,23 | 1,23 | 1,23
as you can see starting with 8 by 8 matices C++ code is roughly 33% slower than C code
is that enough to make you (the reader) happy?
Why is it slower? or better where do you spent more time than doing it in C? BTW, what m.assign(m1,m2) does (I have not read the C code)? What about Boost.LA? Best, Vicente

on 06.05.2010 at 21:24 vicente.botet wrote :
Why is it slower? or better where do you spent more time than doing it in C? abstraction penalty i guess i'm going to further dig toward it besides in C you can do it in a very efficient way since it's all about pointer arithmetic
BTW, what m.assign(m1,m2) does (I have not read the C code)? oops... 'm.assign(m1*m2)' of course
What about Boost.LA? afaik boost LA concerns statically sized matrices that's not an option for me personally
-- Pavel

----- Original Message ----- From: "DE" <satan66613@yandex.ru> To: "vicente.botet" <boost@lists.boost.org> Sent: Thursday, May 06, 2010 7:37 PM Subject: Re: [boost] performance of a linear algebra/matrix library
on 06.05.2010 at 21:24 vicente.botet wrote :
Why is it slower? or better where do you spent more time than doing it in C? abstraction penalty i guess i'm going to further dig toward it besides in C you can do it in a very efficient way since it's all about pointer arithmetic
Why you don't use the same pointer arithmetic in C++?
BTW, what m.assign(m1,m2) does (I have not read the C code)? oops... 'm.assign(m1*m2)' of course Aaah!
What about Boost.LA? afaik boost LA concerns statically sized matrices that's not an option for me personally
I though you wanted performances :) have you compared with uBlas? Vicente

Why you don't use the same pointer arithmetic in C++? it's not easy because the data is hidden behind an abstraction
I though you wanted performances :)
on 06.05.2010 at 21:44 vicente.botet wrote : that is when you access an element like 'm(i, j)' it may be retrieved from some kind of storage or even computed on the fly performance is an important thing but i will not go C anyway
have you compared with uBlas? i haven't yet uBLAS docs claim there is no abstraction penalty for relativly large matrices but i don't know with what kind of C code they compared, maybe it is inefficient by the same factor as ublas routines giving apparently no abstraction penalty
-- Pavel

----- Original Message ----- From: "DE" <satan66613@yandex.ru> To: "vicente.botet" <boost@lists.boost.org> Sent: Thursday, May 06, 2010 8:03 PM Subject: Re: [boost] performance of a linear algebra/matrix library
Why you don't use the same pointer arithmetic in C++? it's not easy because the data is hidden behind an abstraction
on 06.05.2010 at 21:44 vicente.botet wrote : that is when you access an element like 'm(i, j)' it may be retrieved from some kind of storage or even computed on the fly
I though you wanted performances :) performance is an important thing but i will not go C anyway
I was referring to Boost.LA. But if you need configurable sizes for your matrixes
have you compared with uBlas? i haven't yet uBLAS docs claim there is no abstraction penalty for relativly large matrices but i don't know with what kind of C code they compared, maybe it is inefficient by the same factor as ublas routines giving apparently no abstraction penalty
The best thing to do is check it yourself. It should be not too complex. Vicente

on 06.05.2010 at 22:12 vicente.botet wrote :
performance is an important thing but i will not go C anyway
I was referring to Boost.LA. But if you need configurable sizes for your matrixes
yes i got it i respect boost la approach but it does not serve my purposes
i haven't yet uBLAS docs claim there is no abstraction penalty for relativly large matrices but i don't know with what kind of C code they compared, maybe it is inefficient by the same factor as ublas routines giving apparently no abstraction penalty
The best thing to do is check it yourself. It should be not too complex.
i'll end up with it eventually -- Pavel

DE wrote:
abstraction penalty i guess
If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though: http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp
afaik boost LA concerns statically sized matrices that's not an option for me personally
There is Eigen which is very fast and supports both statically-sized and dynamically-sized matrices: http://eigen.tuxfamily.org/index.php?title=Main_Page

on 06.05.2010 at 22:48 Kenny Riddile wrote :
DE wrote:
abstraction penalty i guess
If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:
http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp thanks, i'll certainly try it
afaik boost LA concerns statically sized matrices that's not an option for me personally
There is Eigen which is very fast and supports both statically-sized and dynamically-sized matrices:
eigen performance looks impressive but i don't quite like the design (well, it's a matter of taste) -- Pavel

on 06.05.2010 at 22:48 Kenny Riddile wrote :
If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:
http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp
i've run the code for my everyday compiler (msvc80) and it greported no abstraction penalty i've also tried icc11 and it showed minor abstraction penalty (mean = 1.17) accidentaly i tried to compile my code with icc11 and -- it's a miracle! -- the test showed no abstraction penalty i believe the implementation in fact works just as i expected so i guess if one wants speed he should get a better compiler for curious ones here are some results: n 2 4 8 16 32 64 128 256 512 1024 2048 msvc80 1.77 1.57 1.34 1.12 1.05 1.12 1.32 1.33 1.23 1.23 1.23 icc11 1.53 1.32 1.26 1.55 2.17 1.07 1.02 0.98 1.00 1.03 1.02 n is matrices' dimensions numbers in the latter 2 lines are ratios of runtime of C++ code to that of C code (time_cpp/time_c) 'msvc80' denotes time ratio for msvc version 'icc11' denotes the one for icc version -- Pavel

On Friday 07 May 2010 16:06:41 DE wrote:
on 06.05.2010 at 22:48
Kenny Riddile wrote :
If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:
http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp
i've run the code for my everyday compiler (msvc80) and it greported no abstraction penalty
i've also tried icc11 and it showed minor abstraction penalty (mean = 1.17)
i'm trying gcc 4.4.3 here (optimization/total/penalty ) O0/ 32.15 /5.29 O1/ 1.02 /0.39 O2/ 0.86 /0.94 O3/ 1.09 /0.39 Os/ 0.96 /0.44 Do i understand that rigth or does gcc produce better code when you abstract away ?

On 05/07/2010 07:57 AM, Marius Stoica wrote:
On Friday 07 May 2010 16:06:41 DE wrote:
on 06.05.2010 at 22:48
Kenny Riddile wrote :
If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:
http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp
i've run the code for my everyday compiler (msvc80) and it greported no abstraction penalty
i've also tried icc11 and it showed minor abstraction penalty (mean = 1.17)
i'm trying gcc 4.4.3 here
(optimization/total/penalty )
O0/ 32.15 /5.29 O1/ 1.02 /0.39 O2/ 0.86 /0.94 O3/ 1.09 /0.39 Os/ 0.96 /0.44
Do i understand that rigth or does gcc produce better code when you abstract away ?
[Continuing the slightly OT discussion...] I noticed this, too, just with the default release settings on MSVC9. I can only think to attribute this to iterating through indices in the C-style version, rather than through iterators...? I haven't looked closely at it at all, though. - Jeff

On 05/07/2010 03:06 PM, DE wrote:
on 06.05.2010 at 22:48 Kenny Riddile wrote :
If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:
http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp
i've run the code for my everyday compiler (msvc80) and it greported no abstraction penalty
i've also tried icc11 and it showed minor abstraction penalty (mean = 1.17)
accidentaly i tried to compile my code with icc11 and -- it's a miracle! -- the test showed no abstraction penalty
i believe the implementation in fact works just as i expected so i guess if one wants speed he should get a better compiler
for curious ones here are some results:
n 2 4 8 16 32 64 128 256 512 1024 2048 msvc80 1.77 1.57 1.34 1.12 1.05 1.12 1.32 1.33 1.23 1.23 1.23 icc11 1.53 1.32 1.26 1.55 2.17 1.07 1.02 0.98 1.00 1.03 1.02
n is matrices' dimensions numbers in the latter 2 lines are ratios of runtime of C++ code to that of C code (time_cpp/time_c) 'msvc80' denotes time ratio for msvc version 'icc11' denotes the one for icc version
Hi, maybe this is a stupid question: Do you average the performance results? When I run AbstractionPenaltyBenchmark.cpp the results differ from run to run by ca. 30%, such that the one single result is not significant. I usually average them over 100-1000 runs until they have converged. Karsten

on 08.05.2010 at 17:19 Karsten Ahnert wrote :
Hi,
maybe this is a stupid question: Do you average the performance results? When I run AbstractionPenaltyBenchmark.cpp the results differ from run to run by ca. 30%, such that the one single result is not significant. I usually average them over 100-1000 runs until they have converged.
yes i average the results but i only need an order and not a perfectly precise number so the number of test runs is small -- Pavel

DE wrote:
hi all
i work on a linear algebra library with an intent to propose it for inclusion into boost collection of libs
so my question is: how fast such a lib should be to make everybody happy?
i found that to increase the performance of a generic linear algebra library one must complicate the implementation to a very high degree (though this statement should be obvious to everyone)
i struggled with poor performance of matrix multiplication and finally identified the bottleneck
i improved the implementation (seriosly complicating it, i'm afraid i will forget how it works in a month) and now it performs some 33% slower than C code for virtually any size of matrix
Sound slike you just trash cache ... doing matrix product in i,j,k order is like, well, bad as all grad student should know ... -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On May 6, 2010, at 11:54 AM, joel falcou wrote:
DE wrote:
i found that to increase the performance of a generic linear algebra library one must complicate the implementation to a very high degree (though this statement should be obvious to everyone)
i struggled with poor performance of matrix multiplication and finally identified the bottleneck
i improved the implementation (seriosly complicating it, i'm afraid i will forget how it works in a month) and now it performs some 33% slower than C code for virtually any size of matrix
Sound slike you just trash cache ... doing matrix product in i,j,k order is like, well, bad as all grad student should know ...
Perhaps section 6.2 of Ulrich's paper would offer some ideas? http://people.redhat.com/drepper/cpumemory.pdf -- Noel

on 06.05.2010 at 21:54 joel falcou wrote :
Sound slike you just trash cache ... doing matrix product in i,j,k order is like, well, bad as all grad student should know ... i believe for small matrices i,j,k is as efficient as any other for relatively large matrices i go i,k,j (to minimize cache issues) but still that may be the case
i haven't tried tiling yet, maybe it will save the day -- Pavel

On 05/06/2010 01:02 PM, DE wrote:
so my question is: how fast such a lib should be to make everybody happy?
I think this is the wrong question to ask. You can't "make everybody happy". Rather than attempt that, you should think very hard about what target (users, architectures, use-cases) are most important to you, and then design and optimize for those. (I'm particularly thinking of multi-core and similar platforms, where implicitly parallelizable code is key to acceptance.)
as you can see starting with 8 by 8 matices C++ code is roughly 33% slower than C code
is that enough to make you (the reader) happy?
Definitely not. Stefan -- ...ich hab' noch einen Koffer in Berlin...

on 07.05.2010 at 1:32 Stefan Seefeld wrote :
On 05/06/2010 01:02 PM, DE wrote:
so my question is: how fast such a lib should be to make everybody happy?
I think this is the wrong question to ask. You can't "make everybody happy". Rather than attempt that, you should think very hard about what target (users, architectures, use-cases) are most important to you, and then design and optimize for those. i do my best
(I'm particularly thinking of multi-core and similar platforms, where implicitly parallelizable code is key to acceptance.) this will be the next thing to be concerned after the design is finished
-- Pavel

On 05/07/2010 09:10 AM, DE wrote:
(I'm particularly thinking of multi-core and similar platforms, where implicitly parallelizable code is key to acceptance.)
this will be the next thing to be concerned after the design is finished
I doubt you can separate such concerns from the design. Stefan -- ...ich hab' noch einen Koffer in Berlin...

for those who still care i investigated an "issue" which i claimed was due to abstraction penalty it seemed to me the error was in the source code... indeed it was in the code... of my DNA it turned out the performance differnece of 33% is due to loop unrolling in the C code while the loop in the C++ code was not unrolled in other aspects the two loops were identical (abstraction in the C++ code completely optimized away) so it's not the C++ code that run slow but the optimized C code that run faster i unrolled the loop in the C++ code manually and the two started to run in the same time (it seems now it was a cpu pipeline issue) furthermore i looked at icc11 generetaed assembly code and was shocked icc not only optimized the abstraction away but also unrolled both loops (the C and C++ ones) AND vectorized them that is both plain C and C++ pieces of code were transformed into instruction sequences like movsd movhpd mulpd movapd //etc. as a result icc generated code ran 15% faster than both msvc80 and msvc10 verions here a question arises: since a compiler is able to generate very fast code involving simd instructions is one supposed to provide simd-enabled implementation of a generic library? personally i think now that it is worthless -- Pavel

DE wrote:
here a question arises: since a compiler is able to generate very fast code involving simd instructions is one supposed to provide simd-enabled implementation of a generic library? personally i think now that it is worthless
Yeah sure, let's look how icc is able to vectorize arbitrary large arithmetic function (like cos, sqrt etc that don't have a 1-1 SIMD intrinsic mapping). Other think. If you work on dynamic memory, the compiler won't be doing any vectorization as the memory may not be aligned properly. Add to that loop with dependencies that human know how to rewirte (but not compiler), arbitrary function support, SoA support for thing like complex or other ADS ... Oh and, if your code as right, then both would have been inlined. That's what happen in our library (which lead us to rmeove the unroll settings) -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

on 11.05.2010 at 20:20 joel falcou wrote :
DE wrote:
here a question arises: since a compiler is able to generate very fast code involving simd instructions is one supposed to provide simd-enabled implementation of a generic library? personally i think now that it is worthless
Yeah sure, let's look how icc is able to vectorize arbitrary large arithmetic function (like cos, sqrt etc that don't have a 1-1 SIMD intrinsic mapping). i think i agree
Other think. If you work on dynamic memory, the compiler won't be doing any vectorization as the memory may not be aligned properly. icc11 does it for dynamically allocated memory both for C and C++ code i think it's thanks to 'new' which allocates memory 8 byte aligned
Add to that loop with dependencies that human know how to rewirte (but not compiler), maybe
arbitrary function support, SoA support for thing like complex or other ADS ... what's SoA? and ADS?
Oh and, if your code as right, then both would have been inlined. That's what happen in our library (which lead us to rmeove the unroll settings) don't get it
-- Pavel

DE wrote:
Other think. If you work on dynamic memory, the compiler won't be doing any vectorization as the memory may not be aligned properly.
icc11 does it for dynamically allocated memory both for C and C++ code i think it's thanks to 'new' which allocates memory 8 byte aligned
16 byte but yeah maybe indeed.
what's SoA? and ADS?
Structure of Array , Abstract Data Structure
don't get i YOur code should NOT preclude inline. In our library, we still get automatic inlining from the compiler.
-- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

DE wrote:
what's SoA? and ADS?
Structure of Array , Abstract Data Structure
don't get i YOur code should NOT preclude inline. In our library, we still get automatic inlining from the compiler.
on 11.05.2010 at 22:17 joel falcou wrote : that's non-trivial, agree the C++ code inlines perfectly by both compilers (msvc80+ and icc11) icc even unrolls and vectorizes it -- Pavel
participants (9)
-
Belcourt, Kenneth
-
DE
-
Jeffrey Lee Hellrung, Jr.
-
joel falcou
-
Karsten Ahnert
-
Kenny Riddile
-
Marius Stoica
-
Stefan Seefeld
-
vicente.botet