Boost.Endian comments

This is not a full review, but rather a few comments on the implementation of reorder(x) functions. What you are doing is casting pointers around and assigning individual bytes, which might not really effective: inline void reorder(int64_t source, int64_t& target) { const char* s (reinterpret_cast<const char*>(&source)); char * t (reinterpret_cast<char*>(&target) + sizeof(target) - 1); *t = *s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; } it does eight increments, eight decrements, one addition, one substraction and eight assignments. maybe something like: inline void reorder(int64_t source, int64_t& target) { target = ((source << 0x38) & 0xFF00000000000000) | ((source << 0x28) & 0x00FF000000000000) | ((source << 0x18) & 0x0000FF0000000000) | ((source << 0x08) & 0x000000FF00000000) | ((source >> 0x08) & 0x00000000FF000000) | ((source >> 0x18) & 0x0000000000FF0000) | ((source >> 0x28) & 0x000000000000FF00) | ((source >> 0x38) & 0x00000000000000FF); } would be more efficient?

tymofey wrote:
inline void reorder(int64_t source, int64_t& target) { target = ((source << 0x38) & 0xFF00000000000000) | ((source << 0x28) & 0x00FF000000000000) | ((source << 0x18) & 0x0000FF0000000000) | ((source << 0x08) & 0x000000FF00000000) | ((source >> 0x08) & 0x00000000FF000000) | ((source >> 0x18) & 0x0000000000FF0000) | ((source >> 0x28) & 0x000000000000FF00) | ((source >> 0x38) & 0x00000000000000FF); }
would be more efficient?
(See my other recent post about performance first.) I've just added something like the above suggestion: #elif defined(USE_TYMOFEY) return (src<<24) | ((src<<8) & 0x00ff0000) | ((src>>8) & 0x0000ff00) | ((src>>24) & 0x000000ff); The results are: A B USE_TYMOFEY 12.6 3.0 Impressively, on system B gcc is able to recognise that that expression is implemented by the rev instruction. On system A, the expected logical instructions are generated and they are faster than the bytewise loads and stores that Beman's code produces. Based on the similar speed, my guess is that this code is similar to what htonl() does. Regards, Phil.

inline void reorder(int64_t source, int64_t& target) {
target = ((source << 0x38) & 0xFF00000000000000)
| ((source << 0x28) & 0x00FF000000000000) | ((source << 0x18) & 0x0000FF0000000000) | ((source << 0x08) & 0x000000FF00000000) | ((source >> 0x08) & 0x00000000FF000000) | ((source >> 0x18) & 0x0000000000FF0000) | ((source >> 0x28) & 0x000000000000FF00) | ((source >> 0x38) & 0x00000000000000FF);
}
would be more efficient? [snip] Impressively, on system B gcc is able to recognise that that expression is implemented by the rev instruction.
On system A, the expected logical instructions are generated and they are faster than the bytewise loads and stores that Beman's code produces. Based on the similar speed, my guess is that this code is similar to what htonl() does.
i would assume that this algorithm is way more friendly for out-of-order machines as it is should make use of instruction level parallelism. it might be hard to verify the benefit with a synthetic benchmark, though ... tim

Impressively, on system B gcc is able to recognise that that expression is implemented by the rev instruction.
On system A, the expected logical instructions are generated and they are faster than the bytewise loads and stores that Beman's code produces. Based on the similar speed, my guess is that this code is similar to what htonl() does.
i would assume that this algorithm is way more friendly for out-of-order machines as it is should make use of instruction level parallelism.
it might be hard to verify the benefit with a synthetic benchmark, though ...
I'm working on a benchmark that tries to mimic real-world use cases. Stay tuned... --Beman

On Mon, Sep 5, 2011 at 2:56 PM, tymofey <tymofey@qip.ru> wrote:
This is not a full review, but rather a few comments on the implementation of reorder(x) functions. What you are doing is casting pointers around and assigning individual bytes, which might not really effective:
inline void reorder(int64_t source, int64_t& target) { const char* s (reinterpret_cast<const char*>(&source)); char * t (reinterpret_cast<char*>(&target) + sizeof(target) - 1); *t = *s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; }
it does eight increments, eight decrements, one addition, one substraction and eight assignments. maybe something like:
inline void reorder(int64_t source, int64_t& target) { target = ((source << 0x38) & 0xFF00000000000000) | ((source << 0x28) & 0x00FF000000000000) | ((source << 0x18) & 0x0000FF0000000000) | ((source << 0x08) & 0x000000FF00000000) | ((source >> 0x08) & 0x00000000FF000000) | ((source >> 0x18) & 0x0000000000FF0000) | ((source >> 0x28) & 0x000000000000FF00) | ((source >> 0x38) & 0x00000000000000FF); }
would be more efficient?
Yes, but if I read the standard correctly, it is relying on undefined behavior. Both the C and C++ standards say: The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2 [to the power] E2, reduced modulo one more than the maximum value representable in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1×2 [to the power] E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined. In practice I don't think this is a problem, but Boost code generally doesn't rely on code that has undefined behavior. Perhaps it would be OK to #ifdef on the architecture, and use shift-and-mask implementations only on architectures that don't do something weird (like cause an interrupt) in the case covered by the last sentence. --Beman --Beman

On Wed, Sep 7, 2011 at 6:15 PM, Beman Dawes <bdawes@acm.org> wrote:
On Mon, Sep 5, 2011 at 2:56 PM, tymofey <tymofey@qip.ru> wrote:
This is not a full review, but rather a few comments on the implementation of reorder(x) functions. What you are doing is casting pointers around and assigning individual bytes, which might not really effective:
inline void reorder(int64_t source, int64_t& target) { const char* s (reinterpret_cast<const char*>(&source)); char * t (reinterpret_cast<char*>(&target) + sizeof(target) - 1); *t = *s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; *--t = *++s; }
it does eight increments, eight decrements, one addition, one substraction and eight assignments. maybe something like:
inline void reorder(int64_t source, int64_t& target) { target = ((source << 0x38) & 0xFF00000000000000) | ((source << 0x28) & 0x00FF000000000000) | ((source << 0x18) & 0x0000FF0000000000) | ((source << 0x08) & 0x000000FF00000000) | ((source >> 0x08) & 0x00000000FF000000) | ((source >> 0x18) & 0x0000000000FF0000) | ((source >> 0x28) & 0x000000000000FF00) | ((source >> 0x38) & 0x00000000000000FF); }
would be more efficient?
Yes, but if I read the standard correctly, it is relying on undefined behavior.
Both the C and C++ standards say:
The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2 [to the power] E2, reduced modulo one more than the maximum value representable in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1×2 [to the power] E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.
In practice I don't think this is a problem, but Boost code generally doesn't rely on code that has undefined behavior. Perhaps it would be OK to #ifdef on the architecture, and use shift-and-mask implementations only on architectures that don't do something weird (like cause an interrupt) in the case covered by the last sentence.
If you use uint64_t instead of int64_t (always the right thing when doing bit-twidling), there shouldn't be any UB, right? -- gpd

inline void reorder(int64_t source, int64_t& target) { target = ((source << 0x38) & 0xFF00000000000000) | ((source << 0x28) & 0x00FF000000000000) | ((source << 0x18) & 0x0000FF0000000000) | ((source << 0x08) & 0x000000FF00000000) | ((source >> 0x08) & 0x00000000FF000000) | ((source >> 0x18) & 0x0000000000FF0000) | ((source >> 0x28) & 0x000000000000FF00) | ((source >> 0x38) & 0x00000000000000FF); }
would be more efficient?
Yes, but if I read the standard correctly, it is relying on undefined behavior.
Both the C and C++ standards say:
The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2 [to the power] E2, reduced modulo one more than the maximum value representable in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1×2 [to the power] E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.
In practice I don't think this is a problem, but Boost code generally doesn't rely on code that has undefined behavior. Perhaps it would be OK to #ifdef on the architecture, and use shift-and-mask implementations only on architectures that don't do something weird (like cause an interrupt) in the case covered by the last sentence.
If you use uint64_t instead of int64_t (always the right thing when doing bit-twidling), there shouldn't be any UB, right?
Good point! Each case of "source" in the above code could be changed to "static_cast<uint32_t>(source)", and then the whole expression cast back to int32_t. AFAICS, that does eliminate the UB. --Beman

On 2011-09-07 22:30:35 +0300, Beman Dawes said:
inline void reorder(int64_t source, int64_t& target) { target = ((source << 0x38) & 0xFF00000000000000) | ((source << 0x28) & 0x00FF000000000000) | ((source << 0x18) & 0x0000FF0000000000) | ((source << 0x08) & 0x000000FF00000000) | ((source >> 0x08) & 0x00000000FF000000) | ((source >> 0x18) & 0x0000000000FF0000) | ((source >> 0x28) & 0x000000000000FF00) | ((source >> 0x38) & 0x00000000000000FF); }
If you use uint64_t instead of int64_t (always the right thing when doing bit-twidling), there shouldn't be any UB, right?
Good point! Each case of "source" in the above code could be changed to "static_cast<uint32_t>(source)", and then the whole expression cast back to int32_t.
If you're doing a proper benchmark, Beman, I'd add one more trick to compare in the test: inline void reorder(uint64_t source, uint64_t & target) { uint64_t step32, step16; step32 = source << 32 | source >> 32; step16 = (step32 & 0x0000FFFF0000FFFF) << 16 | (step32 & 0xFFFF0000FFFF0000) >> 16; target = (step16 & 0x00FF00FF00FF00FF) << 8 | (step16 & 0xFF00FF00FF00FF00) >> 8; } It's at least seemingly less work than the code above from tymofey. Of course, the compiler optimizations might end up making it worse. PS. The code for 128-bit integers is one step more, etc. -- Pyry Jahkola pyry.jahkola@iki.fi

On Thu, Sep 8, 2011 at 4:42 AM, Pyry Jahkola <pyry.jahkola@iki.fi> wrote:
If you're doing a proper benchmark, Beman, I'd add one more trick to compare in the test:
inline void reorder(uint64_t source, uint64_t & target) { uint64_t step32, step16; step32 = source << 32 | source >> 32; step16 = (step32 & 0x0000FFFF0000FFFF) << 16 | (step32 & 0xFFFF0000FFFF0000) >> 16; target = (step16 & 0x00FF00FF00FF00FF) << 8 | (step16 & 0xFF00FF00FF00FF00) >> 8; }
Nice! Because my test setup is currently int32_t, I tested that flavor (and only on VC++10). Your approach is 27% faster. The assembly code is 8 instructions versus 12 instructions. Here is the actual code tested: inline int32_t by_return(int32_t x) { return (static_cast<uint32_t>(x) << 24) | ((static_cast<uint32_t>(x) << 8) & 0x00ff0000) | ((static_cast<uint32_t>(x) >> 8) & 0x0000ff00) | (static_cast<uint32_t>(x) >> 24); } inline int32_t by_return_pyry(int32_t x) { uint32_t step16; step16 = static_cast<uint32_t>(x) << 16 | static_cast<uint32_t>(x) >> 16; return ((static_cast<uint32_t>(step16) << 8) & 0xff00ff00) | ((static_cast<uint32_t>(step16) >> 8) & 0x00ff00ff); } The static_casts are important; the results are wrong without them, and less instructions are generated. The Microsoft compiler is smart enough to fold "step16 = static_cast<uint32_t>(x) << 16 | static_cast<uint32_t>(x) >> 16;" into a single "rol ecx, 16" instruction. Thanks, --Beman
participants (6)
-
Beman Dawes
-
Giovanni Piero Deretta
-
Phil Endecott
-
Pyry Jahkola
-
Tim Blechmann
-
tymofey