
Interesting - this definitely shows some of the pitfalls of simple performance testing. Here are my results :
...
I'm not sure why my simple matrix multiplication results are so much slower...the others are comparable.
This could be very well a cache size issue. What processors exactly do you compare here? You might want to get out your favourite performance analyzer and have a look on the L2 cache misses. Or play around with the matrix sizes, observe the FLOP rates and try to correlate them to the amount of memory worked on.
In any case, the relative performance is obviously close enough to identical
Well, the relative performance doesn't say much unless the reference is already close enough to the theoretical peak, considering the underlying algorithm, processor, cache sizes, memory bandwidth etc. Yours, Martin.