
On Thu, Sep 8, 2011 at 4:42 AM, Pyry Jahkola <pyry.jahkola@iki.fi> wrote:
If you're doing a proper benchmark, Beman, I'd add one more trick to compare in the test:
inline void reorder(uint64_t source, uint64_t & target) { uint64_t step32, step16; step32 = source << 32 | source >> 32; step16 = (step32 & 0x0000FFFF0000FFFF) << 16 | (step32 & 0xFFFF0000FFFF0000) >> 16; target = (step16 & 0x00FF00FF00FF00FF) << 8 | (step16 & 0xFF00FF00FF00FF00) >> 8; }
Nice! Because my test setup is currently int32_t, I tested that flavor (and only on VC++10). Your approach is 27% faster. The assembly code is 8 instructions versus 12 instructions. Here is the actual code tested: inline int32_t by_return(int32_t x) { return (static_cast<uint32_t>(x) << 24) | ((static_cast<uint32_t>(x) << 8) & 0x00ff0000) | ((static_cast<uint32_t>(x) >> 8) & 0x0000ff00) | (static_cast<uint32_t>(x) >> 24); } inline int32_t by_return_pyry(int32_t x) { uint32_t step16; step16 = static_cast<uint32_t>(x) << 16 | static_cast<uint32_t>(x) >> 16; return ((static_cast<uint32_t>(step16) << 8) & 0xff00ff00) | ((static_cast<uint32_t>(step16) >> 8) & 0x00ff00ff); } The static_casts are important; the results are wrong without them, and less instructions are generated. The Microsoft compiler is smart enough to fold "step16 = static_cast<uint32_t>(x) << 16 | static_cast<uint32_t>(x) >> 16;" into a single "rol ecx, 16" instruction. Thanks, --Beman