
On Thu, 2 Apr 2009, Joel Falcou wrote:
SIMD algorithms for double precision seem to be rather hard to do right. It's difficult to get the right precision with respect to the scalar reference as scalar algorithm take advantages of the internal 80 bits floating points register, thus leading comparison between our implementation and the reference to yields things like 3000 ulp (ie 10^-13 RMS instead of 10^16). ... Discussions welcome.
My understanding is that that the problem lies with Intel's 80-bit "internal" precision. I've seen people force a copy out of the FP registers to counteract this, but I forget the full logic behind why. Maybe just to achieve cross-platform repeatability. For your purposes, it might be best to have "slow, IEEE-compliant" scalar ops for checking results and "fast, Intel-specific" scalars for comparing timings. - Daniel