Boost.Endian comments

This is not a full review, but rather a few comments on the implementation of reorder(x) functions. What you are doing is casting pointers around and assigning individual bytes, which might not really effective: inline void reorder(int64_t source, int64_t& target) { const char* s (reinterpret_cast<const char*>(&source)); char * t (reinterpret_cast<char*>(&target) + sizeof(target) - 1); *t = *s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; } it does eight increments, eight decrements, one addition, one substraction and eight assignments. maybe something like: inline void reorder(int64_t source, int64_t& target) { target = ((source << 0x38) & 0xFF00000000000000) | ((source << 0x28) & 0x00FF000000000000) | ((source << 0x18) & 0x0000FF0000000000) | ((source << 0x08) & 0x000000FF00000000) | ((source >> 0x08) & 0x00000000FF000000) | ((source >> 0x18) & 0x0000000000FF0000) | ((source >> 0x28) & 0x000000000000FF00) | ((source >> 0x38) & 0x00000000000000FF); } would be more efficient?

tymofey wrote:
(See my other recent post about performance first.) I've just added something like the above suggestion: #elif defined(USE_TYMOFEY) return (src<<24) | ((src<<8) & 0x00ff0000) | ((src>>8) & 0x0000ff00) | ((src>>24) & 0x000000ff); The results are: A B USE_TYMOFEY 12.6 3.0 Impressively, on system B gcc is able to recognise that that expression is implemented by the rev instruction. On system A, the expected logical instructions are generated and they are faster than the bytewise loads and stores that Beman's code produces. Based on the similar speed, my guess is that this code is similar to what htonl() does. Regards, Phil.

On Mon, Sep 5, 2011 at 2:56 PM, tymofey <tymofey@qip.ru> wrote:
Yes, but if I read the standard correctly, it is relying on undefined behavior. Both the C and C++ standards say: The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2 [to the power] E2, reduced modulo one more than the maximum value representable in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1×2 [to the power] E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined. In practice I don't think this is a problem, but Boost code generally doesn't rely on code that has undefined behavior. Perhaps it would be OK to #ifdef on the architecture, and use shift-and-mask implementations only on architectures that don't do something weird (like cause an interrupt) in the case covered by the last sentence. --Beman --Beman

On 2011-09-07 22:30:35 +0300, Beman Dawes said:
If you're doing a proper benchmark, Beman, I'd add one more trick to compare in the test: inline void reorder(uint64_t source, uint64_t & target) { uint64_t step32, step16; step32 = source << 32 | source >> 32; step16 = (step32 & 0x0000FFFF0000FFFF) << 16 | (step32 & 0xFFFF0000FFFF0000) >> 16; target = (step16 & 0x00FF00FF00FF00FF) << 8 | (step16 & 0xFF00FF00FF00FF00) >> 8; } It's at least seemingly less work than the code above from tymofey. Of course, the compiler optimizations might end up making it worse. PS. The code for 128-bit integers is one step more, etc. -- Pyry Jahkola pyry.jahkola@iki.fi

On Thu, Sep 8, 2011 at 4:42 AM, Pyry Jahkola <pyry.jahkola@iki.fi> wrote:
Nice! Because my test setup is currently int32_t, I tested that flavor (and only on VC++10). Your approach is 27% faster. The assembly code is 8 instructions versus 12 instructions. Here is the actual code tested: inline int32_t by_return(int32_t x) { return (static_cast<uint32_t>(x) << 24) | ((static_cast<uint32_t>(x) << 8) & 0x00ff0000) | ((static_cast<uint32_t>(x) >> 8) & 0x0000ff00) | (static_cast<uint32_t>(x) >> 24); } inline int32_t by_return_pyry(int32_t x) { uint32_t step16; step16 = static_cast<uint32_t>(x) << 16 | static_cast<uint32_t>(x) >> 16; return ((static_cast<uint32_t>(step16) << 8) & 0xff00ff00) | ((static_cast<uint32_t>(step16) >> 8) & 0x00ff00ff); } The static_casts are important; the results are wrong without them, and less instructions are generated. The Microsoft compiler is smart enough to fold "step16 = static_cast<uint32_t>(x) << 16 | static_cast<uint32_t>(x) >> 16;" into a single "rol ecx, 16" instruction. Thanks, --Beman
participants (6)
-
Beman Dawes
-
Giovanni Piero Deretta
-
Phil Endecott
-
Pyry Jahkola
-
Tim Blechmann
-
tymofey