
On May 6, 2010, at 11:54 AM, joel falcou wrote:
DE wrote:
i found that to increase the performance of a generic linear algebra library one must complicate the implementation to a very high degree (though this statement should be obvious to everyone)
i struggled with poor performance of matrix multiplication and finally identified the bottleneck
i improved the implementation (seriosly complicating it, i'm afraid i will forget how it works in a month) and now it performs some 33% slower than C code for virtually any size of matrix
Sound slike you just trash cache ... doing matrix product in i,j,k order is like, well, bad as all grad student should know ...
Perhaps section 6.2 of Ulrich's paper would offer some ideas? http://people.redhat.com/drepper/cpumemory.pdf -- Noel