performance of a linear algebra/matrix library

DE

6 May 2010 6 May '10

5:02 p.m.

hi all i work on a linear algebra library with an intent to propose it for inclusion into boost collection of libs so my question is: how fast such a lib should be to make everybody happy? i found that to increase the performance of a generic linear algebra library one must complicate the implementation to a very high degree (though this statement should be obvious to everyone) i struggled with poor performance of matrix multiplication and finally identified the bottleneck i improved the implementation (seriosly complicating it, i'm afraid i will forget how it works in a month) and now it performs some 33% slower than C code for virtually any size of matrix the C code is as follows: typedef double type; const size_t n = 42; type *a = new type[n*n], *a1 = new type[n*n], *a2 = new type[n*n]; //filling 'a1' and 'a2' for (size_t i = 0; i<n; ++i) { type * const a_i = a + i*n; for (size_t j = 0; j<n; ++j) a_i[j] = 0.; for (size_t k = 0; k<n; ++k) { type *a = a_i; const type a1_ik = a1[i*n + k]; const type *a2_k = a2 + k*n; for (size_t j = 0; j<n; ++j, ++a, ++a2_k) *a += a1_ik*(*a2_k); } } the C++ code is just matrix<type> m(n, n), m1(n, n), m2(n, n); //filling 'm1' and 'm2' m.assign(m1, m2); more detailed results: 'n' is the dimensions of matrices, 'ratio' is runtime of C++ code divided by runtime of C code (time_cpp/time_c) n | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 ratio | 1,77 | 1,57 | 1,34 | 1,12 | 1,05 | 1,12 | 1,32 | 1,33 | 1,23 | 1,23 | 1,23 as you can see starting with 8 by 8 matices C++ code is roughly 33% slower than C code is that enough to make you (the reader) happy? -- Pavel

Show replies by date

vicente.botet

6 May 6 May

5:24 p.m.

----- Original Message ----- From: "DE" <satan66613@yandex.ru> To: <boost@lists.boost.org> Sent: Thursday, May 06, 2010 7:02 PM Subject: [boost] performance of a linear algebra/matrix library

...

hi all

i work on a linear algebra library with an intent to propose it for inclusion into boost collection of libs

so my question is: how fast such a lib should be to make everybody happy?

i found that to increase the performance of a generic linear algebra library one must complicate the implementation to a very high degree (though this statement should be obvious to everyone)

i struggled with poor performance of matrix multiplication and finally identified the bottleneck

i improved the implementation (seriosly complicating it, i'm afraid i will forget how it works in a month) and now it performs some 33% slower than C code for virtually any size of matrix

<snip>

...

the C++ code is just

matrix<type> m(n, n), m1(n, n), m2(n, n); //filling 'm1' and 'm2' m.assign(m1, m2);

more detailed results: 'n' is the dimensions of matrices, 'ratio' is runtime of C++ code divided by runtime of C code (time_cpp/time_c)

n | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 ratio | 1,77 | 1,57 | 1,34 | 1,12 | 1,05 | 1,12 | 1,32 | 1,33 | 1,23 | 1,23 | 1,23

as you can see starting with 8 by 8 matices C++ code is roughly 33% slower than C code

is that enough to make you (the reader) happy?

Why is it slower? or better where do you spent more time than doing it in C? BTW, what m.assign(m1,m2) does (I have not read the C code)? What about Boost.LA? Best, Vicente

DE

5:37 p.m.

on 06.05.2010 at 21:24 vicente.botet wrote :

...

Why is it slower? or better where do you spent more time than doing it in C? abstraction penalty i guess i'm going to further dig toward it besides in C you can do it in a very efficient way since it's all about pointer arithmetic

...

BTW, what m.assign(m1,m2) does (I have not read the C code)? oops... 'm.assign(m1*m2)' of course

...

What about Boost.LA? afaik boost LA concerns statically sized matrices that's not an option for me personally

-- Pavel

vicente.botet

5:44 p.m.

----- Original Message ----- From: "DE" <satan66613@yandex.ru> To: "vicente.botet" <boost@lists.boost.org> Sent: Thursday, May 06, 2010 7:37 PM Subject: Re: [boost] performance of a linear algebra/matrix library

...

on 06.05.2010 at 21:24 vicente.botet wrote :

...
Why is it slower? or better where do you spent more time than doing it in C? abstraction penalty i guess i'm going to further dig toward it besides in C you can do it in a very efficient way since it's all about pointer arithmetic

Why you don't use the same pointer arithmetic in C++?

...

...
BTW, what m.assign(m1,m2) does (I have not read the C code)? oops... 'm.assign(m1*m2)' of course Aaah!

...

...
What about Boost.LA? afaik boost LA concerns statically sized matrices that's not an option for me personally

I though you wanted performances :) have you compared with uBlas? Vicente

DE

6:03 p.m.

...

Why you don't use the same pointer arithmetic in C++? it's not easy because the data is hidden behind an abstraction

...

I though you wanted performances :)

on 06.05.2010 at 21:44 vicente.botet wrote : that is when you access an element like 'm(i, j)' it may be retrieved from some kind of storage or even computed on the fly performance is an important thing but i will not go C anyway

...

have you compared with uBlas? i haven't yet uBLAS docs claim there is no abstraction penalty for relativly large matrices but i don't know with what kind of C code they compared, maybe it is inefficient by the same factor as ublas routines giving apparently no abstraction penalty

-- Pavel

vicente.botet

6:12 p.m.

----- Original Message ----- From: "DE" <satan66613@yandex.ru> To: "vicente.botet" <boost@lists.boost.org> Sent: Thursday, May 06, 2010 8:03 PM Subject: Re: [boost] performance of a linear algebra/matrix library

...

...
Why you don't use the same pointer arithmetic in C++? it's not easy because the data is hidden behind an abstraction

on 06.05.2010 at 21:44 vicente.botet wrote : that is when you access an element like 'm(i, j)' it may be retrieved from some kind of storage or even computed on the fly

...
I though you wanted performances :) performance is an important thing but i will not go C anyway

I was referring to Boost.LA. But if you need configurable sizes for your matrixes

...

...
have you compared with uBlas? i haven't yet uBLAS docs claim there is no abstraction penalty for relativly large matrices but i don't know with what kind of C code they compared, maybe it is inefficient by the same factor as ublas routines giving apparently no abstraction penalty

The best thing to do is check it yourself. It should be not too complex. Vicente

DE

6:20 p.m.

on 06.05.2010 at 22:12 vicente.botet wrote :

...

...
performance is an important thing but i will not go C anyway

...

I was referring to Boost.LA. But if you need configurable sizes for your matrixes

yes i got it i respect boost la approach but it does not serve my purposes

...

...
i haven't yet uBLAS docs claim there is no abstraction penalty for relativly large matrices but i don't know with what kind of C code they compared, maybe it is inefficient by the same factor as ublas routines giving apparently no abstraction penalty

...

The best thing to do is check it yourself. It should be not too complex.

i'll end up with it eventually -- Pavel

Kenny Riddile

6:48 p.m.

DE wrote:

...

abstraction penalty i guess

If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though: http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp

...

afaik boost LA concerns statically sized matrices that's not an option for me personally

There is Eigen which is very fast and supports both statically-sized and dynamically-sized matrices: http://eigen.tuxfamily.org/index.php?title=Main_Page

DE

7:25 p.m.

on 06.05.2010 at 22:48 Kenny Riddile wrote :

...

DE wrote:

...
abstraction penalty i guess

...

If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:

...

http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp thanks, i'll certainly try it

...

...
afaik boost LA concerns statically sized matrices that's not an option for me personally

...

There is Eigen which is very fast and supports both statically-sized and dynamically-sized matrices:

...

http://eigen.tuxfamily.org/index.php?title=Main_Page

eigen performance looks impressive but i don't quite like the design (well, it's a matter of taste) -- Pavel

DE

7 May 7 May

1:06 p.m.

on 06.05.2010 at 22:48 Kenny Riddile wrote :

...

If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:

...

http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp

i've run the code for my everyday compiler (msvc80) and it greported no abstraction penalty i've also tried icc11 and it showed minor abstraction penalty (mean = 1.17) accidentaly i tried to compile my code with icc11 and -- it's a miracle! -- the test showed no abstraction penalty i believe the implementation in fact works just as i expected so i guess if one wants speed he should get a better compiler for curious ones here are some results: n 2 4 8 16 32 64 128 256 512 1024 2048 msvc80 1.77 1.57 1.34 1.12 1.05 1.12 1.32 1.33 1.23 1.23 1.23 icc11 1.53 1.32 1.26 1.55 2.17 1.07 1.02 0.98 1.00 1.03 1.02 n is matrices' dimensions numbers in the latter 2 lines are ratios of runtime of C++ code to that of C code (time_cpp/time_c) 'msvc80' denotes time ratio for msvc version 'icc11' denotes the one for icc version -- Pavel

Marius Stoica

2:57 p.m.

On Friday 07 May 2010 16:06:41 DE wrote:

...

on 06.05.2010 at 22:48

Kenny Riddile wrote :

...
If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:

http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp

i've run the code for my everyday compiler (msvc80) and it greported no abstraction penalty

i've also tried icc11 and it showed minor abstraction penalty (mean = 1.17)

i'm trying gcc 4.4.3 here (optimization/total/penalty ) O0/ 32.15 /5.29 O1/ 1.02 /0.39 O2/ 0.86 /0.94 O3/ 1.09 /0.39 Os/ 0.96 /0.44 Do i understand that rigth or does gcc produce better code when you abstract away ?

Jeffrey Lee Hellrung, Jr.

3:44 p.m.

On 05/07/2010 07:57 AM, Marius Stoica wrote:

...

On Friday 07 May 2010 16:06:41 DE wrote:

...
on 06.05.2010 at 22:48

Kenny Riddile wrote :

...
If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:

http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp

i've run the code for my everyday compiler (msvc80) and it greported no abstraction penalty

i've also tried icc11 and it showed minor abstraction penalty (mean = 1.17)

i'm trying gcc 4.4.3 here

(optimization/total/penalty )

O0/ 32.15 /5.29 O1/ 1.02 /0.39 O2/ 0.86 /0.94 O3/ 1.09 /0.39 Os/ 0.96 /0.44

Do i understand that rigth or does gcc produce better code when you abstract away ?

[Continuing the slightly OT discussion...] I noticed this, too, just with the default release settings on MSVC9. I can only think to attribute this to iterating through indices in the C-style version, rather than through iterators...? I haven't looked closely at it at all, though. - Jeff

Karsten Ahnert

8 May 8 May

1:19 p.m.

On 05/07/2010 03:06 PM, DE wrote:

...

on 06.05.2010 at 22:48 Kenny Riddile wrote :

...
If you mean a language-imposed, inherent abstraction penalty, then I doubt it (certainly not 33%). You can find out for yourself though:

...
http://www.stepanovpapers.com/AbstractionPenaltyBenchmark.cpp

i've run the code for my everyday compiler (msvc80) and it greported no abstraction penalty

i've also tried icc11 and it showed minor abstraction penalty (mean = 1.17)

accidentaly i tried to compile my code with icc11 and -- it's a miracle! -- the test showed no abstraction penalty

i believe the implementation in fact works just as i expected so i guess if one wants speed he should get a better compiler

for curious ones here are some results:

n 2 4 8 16 32 64 128 256 512 1024 2048 msvc80 1.77 1.57 1.34 1.12 1.05 1.12 1.32 1.33 1.23 1.23 1.23 icc11 1.53 1.32 1.26 1.55 2.17 1.07 1.02 0.98 1.00 1.03 1.02

n is matrices' dimensions numbers in the latter 2 lines are ratios of runtime of C++ code to that of C code (time_cpp/time_c) 'msvc80' denotes time ratio for msvc version 'icc11' denotes the one for icc version

Hi, maybe this is a stupid question: Do you average the performance results? When I run AbstractionPenaltyBenchmark.cpp the results differ from run to run by ca. 30%, such that the one single result is not significant. I usually average them over 100-1000 runs until they have converged. Karsten

DE

2:50 p.m.

on 08.05.2010 at 17:19 Karsten Ahnert wrote :

...

Hi,

...

maybe this is a stupid question: Do you average the performance results? When I run AbstractionPenaltyBenchmark.cpp the results differ from run to run by ca. 30%, such that the one single result is not significant. I usually average them over 100-1000 runs until they have converged.

yes i average the results but i only need an order and not a perfectly precise number so the number of test runs is small -- Pavel

joel falcou

6 May 6 May

5:54 p.m.

DE wrote:

...

hi all

i work on a linear algebra library with an intent to propose it for inclusion into boost collection of libs

so my question is: how fast such a lib should be to make everybody happy?

i found that to increase the performance of a generic linear algebra library one must complicate the implementation to a very high degree (though this statement should be obvious to everyone)

i struggled with poor performance of matrix multiplication and finally identified the bottleneck

i improved the implementation (seriosly complicating it, i'm afraid i will forget how it works in a month) and now it performs some 33% slower than C code for virtually any size of matrix

Sound slike you just trash cache ... doing matrix product in i,j,k order is like, well, bad as all grad student should know ... -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Belcourt, Kenneth

6:03 p.m.

On May 6, 2010, at 11:54 AM, joel falcou wrote:

...

DE wrote:

...
i found that to increase the performance of a generic linear algebra library one must complicate the implementation to a very high degree (though this statement should be obvious to everyone)

i struggled with poor performance of matrix multiplication and finally identified the bottleneck

i improved the implementation (seriosly complicating it, i'm afraid i will forget how it works in a month) and now it performs some 33% slower than C code for virtually any size of matrix

Sound slike you just trash cache ... doing matrix product in i,j,k order is like, well, bad as all grad student should know ...

Perhaps section 6.2 of Ulrich's paper would offer some ideas? http://people.redhat.com/drepper/cpumemory.pdf -- Noel

DE

6:08 p.m.

on 06.05.2010 at 21:54 joel falcou wrote :

...

Sound slike you just trash cache ... doing matrix product in i,j,k order is like, well, bad as all grad student should know ... i believe for small matrices i,j,k is as efficient as any other for relatively large matrices i go i,k,j (to minimize cache issues) but still that may be the case

i haven't tried tiling yet, maybe it will save the day -- Pavel

Stefan Seefeld

9:32 p.m.

On 05/06/2010 01:02 PM, DE wrote:

...

so my question is: how fast such a lib should be to make everybody happy?

I think this is the wrong question to ask. You can't "make everybody happy". Rather than attempt that, you should think very hard about what target (users, architectures, use-cases) are most important to you, and then design and optimize for those. (I'm particularly thinking of multi-core and similar platforms, where implicitly parallelizable code is key to acceptance.)

...

as you can see starting with 8 by 8 matices C++ code is roughly 33% slower than C code

is that enough to make you (the reader) happy?

Definitely not. Stefan -- ...ich hab' noch einen Koffer in Berlin...

DE

7 May 7 May

1:10 p.m.

on 07.05.2010 at 1:32 Stefan Seefeld wrote :

...

On 05/06/2010 01:02 PM, DE wrote:

...
so my question is: how fast such a lib should be to make everybody happy?

...

I think this is the wrong question to ask. You can't "make everybody happy". Rather than attempt that, you should think very hard about what target (users, architectures, use-cases) are most important to you, and then design and optimize for those. i do my best

...

(I'm particularly thinking of multi-core and similar platforms, where implicitly parallelizable code is key to acceptance.) this will be the next thing to be concerned after the design is finished

-- Pavel

Stefan Seefeld

1:17 p.m.

On 05/07/2010 09:10 AM, DE wrote:

...

...
(I'm particularly thinking of multi-core and similar platforms, where implicitly parallelizable code is key to acceptance.)

this will be the next thing to be concerned after the design is finished

I doubt you can separate such concerns from the design. Stefan -- ...ich hab' noch einen Koffer in Berlin...

DE

11 May 11 May

4:15 p.m.

for those who still care i investigated an "issue" which i claimed was due to abstraction penalty it seemed to me the error was in the source code... indeed it was in the code... of my DNA it turned out the performance differnece of 33% is due to loop unrolling in the C code while the loop in the C++ code was not unrolled in other aspects the two loops were identical (abstraction in the C++ code completely optimized away) so it's not the C++ code that run slow but the optimized C code that run faster i unrolled the loop in the C++ code manually and the two started to run in the same time (it seems now it was a cpu pipeline issue) furthermore i looked at icc11 generetaed assembly code and was shocked icc not only optimized the abstraction away but also unrolled both loops (the C and C++ ones) AND vectorized them that is both plain C and C++ pieces of code were transformed into instruction sequences like movsd movhpd mulpd movapd //etc. as a result icc generated code ran 15% faster than both msvc80 and msvc10 verions here a question arises: since a compiler is able to generate very fast code involving simd instructions is one supposed to provide simd-enabled implementation of a generic library? personally i think now that it is worthless -- Pavel

joel falcou

4:20 p.m.

DE wrote:

...

here a question arises: since a compiler is able to generate very fast code involving simd instructions is one supposed to provide simd-enabled implementation of a generic library? personally i think now that it is worthless

Yeah sure, let's look how icc is able to vectorize arbitrary large arithmetic function (like cos, sqrt etc that don't have a 1-1 SIMD intrinsic mapping). Other think. If you work on dynamic memory, the compiler won't be doing any vectorization as the memory may not be aligned properly. Add to that loop with dependencies that human know how to rewirte (but not compiler), arbitrary function support, SoA support for thing like complex or other ADS ... Oh and, if your code as right, then both would have been inlined. That's what happen in our library (which lead us to rmeove the unroll settings) -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

DE

6:04 p.m.

on 11.05.2010 at 20:20 joel falcou wrote :

...

DE wrote:

...
here a question arises: since a compiler is able to generate very fast code involving simd instructions is one supposed to provide simd-enabled implementation of a generic library? personally i think now that it is worthless

Yeah sure, let's look how icc is able to vectorize arbitrary large arithmetic function (like cos, sqrt etc that don't have a 1-1 SIMD intrinsic mapping). i think i agree

...

Other think. If you work on dynamic memory, the compiler won't be doing any vectorization as the memory may not be aligned properly. icc11 does it for dynamically allocated memory both for C and C++ code i think it's thanks to 'new' which allocates memory 8 byte aligned

...

Add to that loop with dependencies that human know how to rewirte (but not compiler), maybe

...

arbitrary function support, SoA support for thing like complex or other ADS ... what's SoA? and ADS?

...

Oh and, if your code as right, then both would have been inlined. That's what happen in our library (which lead us to rmeove the unroll settings) don't get it

-- Pavel

joel falcou

6:17 p.m.

DE wrote:

...

...
Other think. If you work on dynamic memory, the compiler won't be doing any vectorization as the memory may not be aligned properly.

icc11 does it for dynamically allocated memory both for C and C++ code i think it's thanks to 'new' which allocates memory 8 byte aligned

16 byte but yeah maybe indeed.

...

what's SoA? and ADS?

Structure of Array , Abstract Data Structure

...

don't get i YOur code should NOT preclude inline. In our library, we still get automatic inlining from the compiler.

-- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

DE

6:37 p.m.

...

DE wrote:

...
what's SoA? and ADS?

Structure of Array , Abstract Data Structure

...

...
don't get i YOur code should NOT preclude inline. In our library, we still get automatic inlining from the compiler.

on 11.05.2010 at 22:17 joel falcou wrote : that's non-trivial, agree the C++ code inlines perfectly by both compilers (msvc80+ and icc11) icc even unrolls and vectorizes it -- Pavel

5554

Age (days ago)

5559

Last active (days ago)

List overview

Download

24 comments

9 participants

participants (9)

Belcourt, Kenneth
DE
Jeffrey Lee Hellrung, Jr.
joel falcou
Karsten Ahnert
Kenny Riddile
Marius Stoica
Stefan Seefeld
vicente.botet