BOOST::SIMD - handling double : precision vs speed

SIMD algorithms for double precision seem to be rather hard to do right. It's difficult to get the right precision with respect to the scalar reference as scalar algorithm take advantages of the internal 80 bits floating points register, thus leading comparison between our implementation and the reference to yields things like 3000 ulp (ie 10^-13 RMS instead of 10^16). Fixing this is difficult and even if it's possible for some algorithms, the average speed-up then drop to less than 10% - ie as fast as an unrolled scalar call over the SIMD vector elements. What should we enforce : precision or speed ? Or is the 10^-13 RMS enough ? Of course, this is mostly a problem on current SIMD extension in which vector of double only have 2 elements. It may be different with upcoming AVX and Larabee featuring larger vectors. Discussions welcome. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Thu, 2 Apr 2009, Joel Falcou wrote:
SIMD algorithms for double precision seem to be rather hard to do right. It's difficult to get the right precision with respect to the scalar reference as scalar algorithm take advantages of the internal 80 bits floating points register, thus leading comparison between our implementation and the reference to yields things like 3000 ulp (ie 10^-13 RMS instead of 10^16). ... Discussions welcome.
My understanding is that that the problem lies with Intel's 80-bit "internal" precision. I've seen people force a copy out of the FP registers to counteract this, but I forget the full logic behind why. Maybe just to achieve cross-platform repeatability. For your purposes, it might be best to have "slow, IEEE-compliant" scalar ops for checking results and "fast, Intel-specific" scalars for comparing timings. - Daniel

On Thu, 2 Apr 2009, Joel Falcou wrote:
SIMD algorithms for double precision seem to be rather hard to do right. It's difficult to get the right precision with respect to the scalar reference as scalar algorithm take advantages of the internal 80 bits floating points register, thus leading comparison between our implementation and the reference to yields things like 3000 ulp (ie 10^-13 RMS instead of 10^16). ... Discussions welcome.
My understanding is that that the problem lies with Intel's 80-bit "internal" precision. I've seen people force a copy out of the FP registers to counteract this, but I forget the full logic behind why. Maybe just to achieve cross-platform repeatability.
For your purposes, it might be best to have "slow, IEEE-compliant" scalar ops for checking results and "fast, Intel-specific" scalars for comparing timings.
- Daniel _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost Yes I agree, I am working with joel and the problem is one I try to
dherring@ll.mit.edu a écrit : treat. It seems to me that we must have speedy algorithm for simd double, because accracy will be slower than mere scalar mapping through simd vectors. It also seems to me that simd is not still mature for double : 2 element is too less to hope a big gain with branchless algorithm for math functions... Jean-thierry

I just did some benchmarking with single and double precision using SSE and CPU on a fairly recent machine. Only one SSE function is being used: _mm_sqrt_ps and _mm_sqrt_pd, respectively. Using float I get: SSE: 328ms CPU: 2890ms Using double I get: SSE: 1188ms CPU: 2875ms double is almost 4-times slower than float. Seems to me that one should favor float over double. I still would like to see boost.simd appear and do some more testing. Regards, Christian

Le Jeu 2 avril 2009 22:13, Christian Henning a écrit :
favor float over double. This x4 already contaisn a x2 from
double is almost 4-times slower than float. Seems to me that one should the difference of vector size. Rest is due to the internal algorithm to achieve correct precision.
I still would like to see boost.simd appear and do some more testing. This won't prevent boost.simd to appear at all, I was just checking soem points ;)

dherring@ll.mit.edu wrote:
My understanding is that that the problem lies with Intel's 80-bit "internal" precision. I've seen people force a copy out of the FP registers to counteract this, but I forget the full logic behind why. Maybe just to achieve cross-platform repeatability.
I suppose it's rather to have something that actually works. The way comparison works with floating point on intel prevents simple things such as multimap to work properly in certain cases.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thursday 02 April 2009, Joel Falcou wrote:
SIMD algorithms for double precision seem to be rather hard to do right. It's difficult to get the right precision with respect to the scalar reference as scalar algorithm take advantages of the internal 80 bits floating points register, thus leading comparison between our implementation and the reference to yields things like 3000 ulp (ie 10^-13 RMS instead of 10^16).
Fixing this is difficult and even if it's possible for some algorithms, the average speed-up then drop to less than 10% - ie as fast as an unrolled scalar call over the SIMD vector elements.
What should we enforce : precision or speed ? Or is the 10^-13 RMS enough ?
Why would you want your 64 bit SIMD floating point calculations to act like they are going through an x86 processor's 80 bit floating point unit? I think the important thing is just conformance to the ieee floating point standards. My impression was some of the early generations of SIMD instructions were not compliant, but the newer versions all are. If you're just worried about comparison against some non-SIMD reference code, maybe it would help to use compiler flags to disable internal 80 bit rounding when compiling the reference code (I think you can do this on gcc at least) . -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAknU+yQACgkQ5vihyNWuA4XASgCfet2JbgYOGP1iUFZT/FKZg8TZ C64AoMaQO5PvoI5nWGvawMxyoV73s8PP =4rVX -----END PGP SIGNATURE-----
participants (7)
-
Christian Henning
-
dherring@ll.mit.edu
-
Frank Mori Hess
-
joel falcou
-
Joel Falcou
-
jtl
-
Mathias Gaunard