Re: [boost] Boost.Endian comments

8 Sep 2011

      On Thu, Sep 8, 2011 at 4:42 AM, Pyry Jahkola <pyry.jahkola@iki.fi> wrote:
...
If you're doing a proper benchmark, Beman, I'd add one more trick to compare
in the test:
  inline void reorder(uint64_t source, uint64_t & target)
  {
      uint64_t step32, step16;
      step32 = source << 32 | source >> 32;
      step16 = (step32 & 0x0000FFFF0000FFFF) << 16
             | (step32 & 0xFFFF0000FFFF0000) >> 16;
      target = (step16 & 0x00FF00FF00FF00FF) << 8
             | (step16 & 0xFF00FF00FF00FF00) >> 8;
  }
Nice!

Because my test setup is currently int32_t, I tested that flavor (and
only on VC++10). Your approach is 27% faster. The assembly code is 8
instructions versus 12 instructions.

Here is the actual code tested:

  inline int32_t by_return(int32_t x)
  {
    return (static_cast<uint32_t>(x) << 24)
      | ((static_cast<uint32_t>(x) << 8) & 0x00ff0000)
      | ((static_cast<uint32_t>(x) >> 8) & 0x0000ff00)
      | (static_cast<uint32_t>(x) >> 24);
  }

  inline int32_t by_return_pyry(int32_t x)
  {
    uint32_t step16;
    step16 = static_cast<uint32_t>(x) << 16 | static_cast<uint32_t>(x) >> 16;
    return
        ((static_cast<uint32_t>(step16) << 8) & 0xff00ff00)
      | ((static_cast<uint32_t>(step16) >> 8) & 0x00ff00ff);
  }

The static_casts are important; the results are wrong without them,
and less instructions are generated. The Microsoft compiler is smart
enough to fold "step16 = static_cast<uint32_t>(x) << 16 |
static_cast<uint32_t>(x) >> 16;" into a single "rol ecx, 16"
instruction.

Thanks,

--Beman

Re: [boost] Boost.Endian comments

Beman Dawes