
6 Sep
2011
6 Sep
'11
1:37 p.m.
Impressively, on system B gcc is able to recognise that that expression is implemented by the rev instruction.
On system A, the expected logical instructions are generated and they are faster than the bytewise loads and stores that Beman's code produces. Based on the similar speed, my guess is that this code is similar to what htonl() does.
i would assume that this algorithm is way more friendly for out-of-order machines as it is should make use of instruction level parallelism.
it might be hard to verify the benefit with a synthetic benchmark, though ...
I'm working on a benchmark that tries to mimic real-world use cases. Stay tuned... --Beman