
On Tue, Dec 25, 2012 at 7:10 PM, Joel Falcou <joel.falcou@gmail.com> wrote:
Le 25/12/2012 15:43, Peter Dimov a écrit :
Mathias Gaunard wrote:
The shifted iterator and the shifted load allow to do aligned loads if you statically know the misalignment of the memory.
Does this have any performance advantage over just using an unaligned load? I'd expect the microcode to do whatever the shifted load does, but I haven't measured it.
Shifted load is a couple of aligned load + bit shuffling. This is a technique steming from way back on Altivec. Experiments done on 1D filtering using both show some benefits over unaligned load on pre-Nehalem CPUs.
AFAIK, even on post-Nehalem CPUs unaligned loads (and stores) are slower if the operation spans across the cache line boundary. I don't have the numbers though. Will shifted_iterator use palignr from SSSE3?