
Dear All, I've just done some quick benchmarks of Beman's proposed byte-swapping code using the following test program: #include <cstdint> #include <iostream> #include <arpa/inet.h> // for htonl() using std::uint32_t; static inline uint32_t byteswap(uint32_t src) { #if defined(USE_GCC_BUILTIN) return __builtin_bswap32(src); #elif defined(USE_HTONL) return htonl(src); #elif defined(USE_REV_INSTR) uint32_t target; __asm__ ( "rev %0, %1\n" : "=r" (target) : "r" (src) ); return target; #elif defined(USE_BEMAN) const char* s (reinterpret_cast<const char*>(&src)); uint32_t target; char * t (reinterpret_cast<char*>(&target) + sizeof(target) - 1); *t = *s; *--t = *++s; *--t = *++s; *--t = *++s; return target; #else #error "Define USE_*" #endif } int main() { uint32_t s = 0; for (uint32_t i = 0; i < 1000000000; ++i) { s += byteswap(i); } std::cout << s << "\n"; // Expect 506498560 } I've tested this on two systems: (A): Marvell ARMv5TE ("Feroceon") @ 1.2 GHz, g++ 4.4 (B): Freescale ARMv7 (i.MX53, Coretex A8) @ 1.0 GHz, g++ 4.6.1 Compiled with: --std=c++0x -O4 Execution times in seconds are: A B USE_GCC_BUILTIN 29.4 3.0 USE_HTONL 12.6 3.0 USE_REV_INSTR n/a 3.0 USE_BEMAN 17.6 8.0 ARM architecture version 6 (IIRC) introduced a "rev" byte-swapping instruction. On system B, which supports this, it is used by the gcc builtin and glibc's htonl() giving better performance than Beman's code. On system A which does not support this instruction, the gcc builtin makes a library call; I've not looked at what htonl() does. What do people see on other platforms? How important do we consider performance to be for this library? Regards, Phil.

Phil Endecott wrote:
I've tested this on two systems: (A): Marvell ARMv5TE ("Feroceon") @ 1.2 GHz, g++ 4.4 (B): Freescale ARMv7 (i.MX53, Coretex A8) @ 1.0 GHz, g++ 4.6.1
Compiled with: --std=c++0x -O4
What do the numbers look like if you compile with -march=native switch set? or if you're cross-compiling replace 'native' with correct gcc supported arch, To futher that, is there much of a difference when compiled with pgo? -fprofile-generate run -fprofile-use. btw I've found that in some situation O2 performs better than O3 or O4 - though pgo cleans up a lot of those inconsistencies at O3+ levels.

Arash Partow wrote:
Phil Endecott wrote:
I've tested this on two systems: (A): Marvell ARMv5TE ("Feroceon") @ 1.2 GHz, g++ 4.4 (B): Freescale ARMv7 (i.MX53, Coretex A8) @ 1.0 GHz, g++ 4.6.1
Compiled with: --std=c++0x -O4
What do the numbers look like if you compile with -march=native switch set? or if you're cross-compiling replace 'native' with correct gcc supported arch,
"error: bad value (native) for -march= switch" These are both native compilers that are already tuned to the systems that they run on.
To futher that, is there much of a difference when compiled with pgo? -fprofile-generate run -fprofile-use. btw I've found that in some situation O2 performs better than O3 or O4 - though pgo cleans up a lot of those inconsistencies at O3+ levels.
I'm sure that some tweaking could adjust the performance but mostly by changing the benchmark infrastructure i.e. loop unrolling, interleaving etc. The important thing here is what the actual byteswap gets compiled to, and there are three discrete choices: the REV instruction, word loads and stores with bit-twiddling, and byte loads and stores. The results I've posted are sufficient to show how those perform and which source code compiles to which implementation. Regards, Phil.

On Mon, Sep 5, 2011 at 6:38 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
I've just done some quick benchmarks of Beman's proposed byte-swapping code...
What do people see on other platforms?
Twenty plus years ago I put a lot of effort into finding optimum code for a C language endian library, but real-world application tests showed that what was optimum for one compiler was a dog on another compiler, that compiler switches could change what was optimum code, and then for the next release of the compiler we had to do it all over again. That said, if we can come up with a benchmark representative of real-world uses cases, and can come up with robust optimizations that have some staying power, I'll gladly include them in the code. --Beman

Dear All, I have humbly taken a look upon the "conversion.hpp" section of Beman Dawes' library and I was able to learn a lot... Thank you... I was able to learn a lot from Tymofey's and Phil Endecott's code too, so thank you too ! I am just a programmer-wannabe... I do hope that my message is not understood as disrespect, for it is not meant as such, not at all. I have conducted a small and non-representative benchmark, the ugly source code of which I have uploaded here: http://preview.tinyurl.com/4yhcc8t It compares several versions of code meant to "bswap" 16-bit, 32-bit and 64-bit integer values. In the table below, I have used the following notations: RC-1: This is endian::revert from the release-candidate (non-final, I think) version of Beman Dawes' library. Tymofey: A combination of my own unworthy hack for uint16_t, Phil Endecott's version for uint32_t (for USE_TYMOFEY) and Tymofey's suggestion for uint64_t. Imbecil-0: Boost-wannabe.RawMemory compiled with BOOST_RAW_MEMORY_TRICKS #define'd as 0. Imbecil-1: Boost-wannabe.RawMemory compiled with BOOST_RAW_MEMORY_TRICKS #define'd as 1 (the default). Here are the approximative results, on the same computer that is described here: http://adder.iworks.ro/Boost/RawMemory/#Benchmarks (Ctrl+F: "And the name of the computer is"). (Much to my shame, I have not employed Windows' "high-performance counter". Also, I did not run too many repetitions in order to avoid overheating the lovely computer or the benchmarks turning against me.) (For the 64-bit integers -- what a devious choice ! --, I have also noted the approximative number of machine code bytes that were generated.) Borland C++Builder 5.5: uint16_t RC-1 1765 Tymofey 265 Imbecil-0 250 Imbecil-1 250 uint32_t RC-1 1922 Tymofey 453 Imbecil-0 469 Imbecil-1 453 uint64_t RC-1 2360 ( 72 bytes of code) Tymofey 3375 (225 bytes of code) Imbecil-0 2234 (169 bytes of code) Imbecil-1 797 (7 bytes for the caller, 15 bytes for the callee) Digital Mars C++: uint16_t RC-1 360 Tymofey 250 Imbecil-0 265 Imbecil-1 250 uint32_t RC-1 1828 Tymofey 391 Imbecil-0 1203 Imbecil-1 453 <-- I am such a noob ! uint64_t RC-1 2797 ( 84 bytes of code) Tymofey 2875 (292 bytes of code) Imbecil-0 2453 (202 bytes of code) Imbecil-1 609 (7 bytes for the caller, 15 bytes for the callee) GCC 4.3.4 (20090804): uint16_t RC-1 1750 Tymofey 188 Imbecil-0 187 Imbecil-1 187 <-- Of course, I chose the run that favours me ! uint32_t RC-1 1328 Tymofey 453 Imbecil-0 438 Imbecil-1 188 uint64_t RC-1 2781 ( 67 bytes of code) Tymofey 1578 ( 96 bytes of code) Imbecil-0 578 ( 46 bytes of code) Imbecil-1 250 ( 8 bytes of code) Visual C++ 2003: uint16_t RC-1 984 Tymofey 187 Imbecil-0 125 Imbecil-1 203 <-- I have to investigate this ! uint32_t RC-1 1563 Tymofey 375 Imbecil-0 515 Imbecil-1 125 uint64_t RC-1 2328 ( 63 bytes of code) Tymofey 3047 (154 bytes of code) Imbecil-0 1110 (122 bytes of code) Imbecil-1 438 ( 12 bytes of code, but I had to "WorkAround.cpp" an Internal Compiler Error by avoiding the inlining the function; I have to investigate this !) Visual C++ 2005: uint16_t RC-1 906 Tymofey 188 Imbecil-0 187 Imbecil-1 203 uint32_t RC-1 1906 Tymofey 391 Imbecil-0 328 Imbecil-1 125 uint64_t RC-1 3437 ( 65 bytes of code) Tymofey 3047 (151 bytes of code) Imbecil-0 641 ( 61 bytes of code) Imbecil-1 188 ( 8 bytes of code) Visual C++ 2005 for x64: uint16_t RC-1 1687 Tymofey 203 Imbecil-0 188 Imbecil-1 203 uint32_t RC-1 1390 Tymofey 328 Imbecil-0 329 Imbecil-1 188 uint64_t RC-1 2406 ( 72 bytes of code) Tymofey 703 (210 bytes of code; the loop was unrolled 1:2) Imbecil-0 531 (162 bytes of code; the loop was unrolled 1:2) Imbecil-1 187 ( 3 bytes of code) The "Tymofey" optimizations are included in the previous emails from Tymofey and Phil Endecott. The "Imbecil" optimizations (and pessimizations) are described and included here: http://adder.iworks.ro/Boost/RawMemory If Assembler, compiler intrinsics, __fastcall, compiler-specific tuning, etc. sound interesting, you are welcome to have a closer look. Thank you for your time and for your work... -- Yours truly, Adder On 9/6/11, Beman Dawes <bdawes@acm.org> wrote:
On Mon, Sep 5, 2011 at 6:38 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
I've just done some quick benchmarks of Beman's proposed byte-swapping code...
What do people see on other platforms?
Twenty plus years ago I put a lot of effort into finding optimum code for a C language endian library, but real-world application tests showed that what was optimum for one compiler was a dog on another compiler, that compiler switches could change what was optimum code, and then for the next release of the compiler we had to do it all over again.
That said, if we can come up with a benchmark representative of real-world uses cases, and can come up with robust optimizations that have some staying power, I'll gladly include them in the code.
--Beman

Adder wrote:
I have conducted a small and non-representative benchmark, the ugly source code of which I have uploaded here:
http://preview.tinyurl.com/4yhcc8t
It compares several versions of code meant to "bswap" 16-bit, 32-bit and 64-bit integer values.
Could you post the important bits of your byte_swapper::reverse() to the list? Phil.

On 9/6/11, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Could you post the important bits of your byte_swapper::reverse() to the list?
For which compiler ? (-: 16-bit un-optimized version: return (x >> 8) | (x << 8); 32-bit un-optimized version (edited for space): return uint32_t (reverse16 (uint16_t (x >> 16))) | (uint32_t (reverse16 (uint16_t (x ))) << 16); 64-bit un-optimized version (edited for space): return uint64_t (reverse32 (uint32_t (x >> 32))) | (uint64_t (reverse32 (uint32_t (x ))) << 32); GCC 32-bit and 64-bit optimized versions use __builtin_bswap32 and __builtin_bswap64, with a warning in the documentation pointing to the http://hardwarebug.org/2010/01/14/beware-the-builtins link that has just been posted to this list.

Hi there, If only you had GCC 4.4.3 instead of 4.4.0 for platform (A)... I think it is that version that introduced __builtin_bswap32 and __builtin_bswap64. You can see the results of some benchmarks performed on another platform (including non-optimized code generated by GCC vs code that uses the above-mentioned intrinsics) in the link that I gladly respam here: http://adder.iworks.ro/Boost/RawMemory/#Benchmarks Regarding the importance of performance: what applications do you have in mind ? Loading a (portable) (binary) configuration file when the program is launched and saving it back when the program is closed might not require hand-written Assembler. A personal firewall application that enthusiastic power use for their Gigabit cable network and for their 300 Mbps wireless network might be a different thing. So is streaming OpenGL RGBA framebuffer data from the video game we are working on to the MPEG4 encoder which expects reversely formatted pixels in order to create a YouTube-uploadable demo. In such a case, the performance of the byte swapper made a significant difference. Thank you... (-: -- Yours truly, Adder On 9/6/11, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Dear All, [...] I've tested this on two systems: (A): Marvell ARMv5TE ("Feroceon") @ 1.2 GHz, g++ 4.4 (B): Freescale ARMv7 (i.MX53, Coretex A8) @ 1.0 GHz, g++ 4.6.1 [...] What do people see on other platforms? [...] How important do we consider performance to be for this library?

On 06/09/2011 00:38, Phil Endecott wrote:
#elif defined(USE_BEMAN)
const char* s (reinterpret_cast<const char*>(&src)); uint32_t target; char * t (reinterpret_cast<char*>(&target) + sizeof(target) - 1); *t = *s; *--t = *++s; *--t = *++s; *--t = *++s; return target;
Why isn't that just std::reverse?

On Tue, Sep 6, 2011 at 8:44 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 06/09/2011 00:38, Phil Endecott wrote:
#elif defined(USE_BEMAN)
const char* s (reinterpret_cast<const char*>(&src)); uint32_t target; char * t (reinterpret_cast<char*>(&target) + sizeof(target) - 1); *t = *s; *--t = *++s; *--t = *++s; *--t = *++s; return target;
Why isn't that just std::reverse?
Good question. I don't really know the answer, but it may just be a legacy of very old code that predates the availability of std::reverse. --Beman

Mathias Gaunard wrote:
On 06/09/2011 00:38, Phil Endecott wrote:
#elif defined(USE_BEMAN)
const char* s (reinterpret_cast<const char*>(&src)); uint32_t target; char * t (reinterpret_cast<char*>(&target) + sizeof(target) - 1); *t = *s; *--t = *++s; *--t = *++s; *--t = *++s; return target;
Why isn't that just std::reverse?
Well std::reverse is in-place, and this particular code is not. But yes, std::reverse could be used for some of Beman's functions, and some sort of std::copy with reverse iterators could be used here. There are also places where the code could be more concise if std::swap were used. Fundamentally though, I don't think that casting to bytes is the right way to do this; shifts and bitwise ops seem to produce better code, and involve less worrying casting. Regards, Phil.

On 06/09/2011 18:28, Phil Endecott wrote:
Well std::reverse is in-place, and this particular code is not. But yes, std::reverse could be used for some of Beman's functions, and some sort of std::copy with reverse iterators could be used here. There are also places where the code could be more concise if std::swap were used.
Well you could copy src into target, then reverse target in place. But I guess that would actually be slower.
Fundamentally though, I don't think that casting to bytes is the right way to do this
It's a way that works for all types, even non-integer ones (i.e. floating-point). It's good for a base implementation.
shifts and bitwise ops seem to produce better code
The best code depends on the actual type and architecture. The best thing to do, for performance, is to call specialized functions for each type, such as built-ins or intrinsics. I don't think this is really important though.
, and involve less worrying casting.
Nothing to worry about here. char* is allowed to alias to permit just that.

Mathias Gaunard wrote:
On 06/09/2011 18:28, Phil Endecott wrote:
shifts and bitwise ops seem to produce better code
The best code depends on the actual type and architecture.
Perhaps.
The best thing to do, for performance, is to call specialized functions for each type, such as built-ins or intrinsics.
Unfortunately not, no; in my tests the gcc built-ins were the worse-performing on one platform.
I don't think this is really important though.
You mean the performance of these endian operations in general is not important? Phil.

On 09/07/2011 01:54 AM, Phil Endecott wrote:
Unfortunately not, no; in my tests the gcc built-ins were the worse-performing on one platform.
Then it's a bug that should be fixed in the compiler. (But then it could also be that your benchmark is imperfect)
I don't think this is really important though.
You mean the performance of these endian operations in general is not important?
Not that much no. It's important that it's reasonably fast, but it needn't be performance critical.

Mathias Gaunard wrote:
On 09/07/2011 01:54 AM, Phil Endecott wrote:
You mean the performance of these endian operations in general is not important?
Not that much no. It's important that it's reasonably fast, but it needn't be performance critical.
Examples were given recently of swapping bytes in large data sets, where each extra cycle quickly accumulates into noticeable delays. Another example arises in high frequency trading where latency is of the utmost importance. Every little bit that can be shaved off the critical path, of which byte swapping often is a part, can mean the difference between making money or not. Thus, performance of low level functions like these can be critical. It is not necessary that Boost provide the fastest implementation on every platform, but merely provide interfaces that can be implemented in the fastest way on at least some significant platforms to prove the point. Later standardization, should it occur, can leave platform-specific optimization to implementers. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Wed, Sep 7, 2011 at 7:12 AM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Mathias Gaunard wrote:
On 09/07/2011 01:54 AM, Phil Endecott wrote:
You mean the performance of these endian operations in general is not important?
Not that much no. It's important that it's reasonably fast, but it needn't be performance critical.
Examples were given recently of swapping bytes in large data sets, where each extra cycle quickly accumulates into noticeable delays. Another example arises in high frequency trading where latency is of the utmost importance. Every little bit that can be shaved off the critical path, of which byte swapping often is a part, can mean the difference between making money or not. Thus, performance of low level functions like these can be critical.
It is not necessary that Boost provide the fastest implementation on every platform, but merely provide interfaces that can be implemented in the fastest way on at least some significant platforms to prove the point. Later standardization, should it occur, can leave platform-specific optimization to implementers.
+1 I've worked on commercial GIS software where byte swapping was one of the critical performance bottlenecks. --Beman

On 07/09/2011 13:12, Stewart, Robert wrote:
Mathias Gaunard wrote:
On 09/07/2011 01:54 AM, Phil Endecott wrote:
You mean the performance of these endian operations in general is not important?
Not that much no. It's important that it's reasonably fast, but it needn't be performance critical.
Examples were given recently of swapping bytes in large data sets, where each extra cycle quickly accumulates into noticeable delays. Another example arises in high frequency trading where latency is of the utmost importance. Every little bit that can be shaved off the critical path, of which byte swapping often is a part, can mean the difference between making money or not. Thus, performance of low level functions like these can be critical.
Why not avoid swapping bytes entirely, by converting the data set to native byte ordering offline, or by choosing to transmit the data in the native byte ordering? But I'll take your word for it that it is indeed necessary, and a measurable overhead compared to I/O.

Mathias Gaunard wrote:
Why not avoid swapping bytes entirely, by converting the data set to native byte ordering offline, or by choosing to transmit the data in the native byte ordering?
Third party data. No control. Besides, in the non-streaming examples, such conversion must be done still and making it faster is better. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Stewart, Robert wrote:
Examples were given recently of swapping bytes in large data sets, where each extra cycle quickly accumulates into noticeable delays.
I've just tried this: static inline uint32_t byteswap(uint32_t src) { return (src<<24) | ((src<<8) & 0x00ff0000) | ((src>>8) & 0x0000ff00) | (src>>24); } int main() { uint32_t buf[1024]; while (1) { size_t bytes = read(0,buf,4096); if (!bytes) break; std::transform(buf,buf+bytes/4,buf,&byteswap); write(1,buf,bytes); } } My first test file is a 600 MB video, which is large enough to fit into the test system's RAM; these timings are for "warm" runs i.e. the test data is all cached. This is on the i.MX53 system I called "B" in my last posts. Output is redirected to /dev/null. 1. With the byteswapping DISABLED, i.e. just the I/O: real 0m2.495s user 0m0.030s sys 0m2.460s 2. With the byteswapping enabled, compiled with -O4: real 0m3.019s user 0m0.900s sys 0m2.110s In this case the core of the assembler is this nice simple loop: L4: ldr r1, [r3, #0] rev r1, r1 str r1, [r3], #4 cmp r0, r3 bne .L4 3. With the byteswapping enabled, compiled with -O4 -funroll-loops: real 0m2.516s user 0m0.450s sys 0m2.060s (This code is much longer and less readable but it seems to be worthwhile.) 4. Then I tried this bytewise swapping: (Please tell me if you think this code is wrong, I didn't check the output) char buf[4096]; while (1) { size_t bytes = read(0,buf,4096); if (!bytes) break; for (int i=0; i<bytes; i += 4) { std::reverse(buf+i,buf+i+4); } write(1,buf,bytes); } Timing with -O4: (loop unrolling seems to make no improvement) real 0m5.131s user 0m3.180s sys 0m1.940s Increasing the block size doesn't make any significant difference; reducing it below 4096 bytes does slow it down. So the overhead of byteswapping compared to I/O - for a file cached in memory - is between about 25% (case 3) and 150% (case 4) on this system. I've also done a quick test with a file that is larger than RAM and so must be read from the SSD. In this case the results are: Case 4 above, bytewise swapping: real 0m44.969s user 0m15.910s sys 0m12.800s Case 3 above, rev instruction with loop unrolling: real 0m44.624s user 0m2.290s sys 0m12.990s So as expected the amount of CPU time used scales approximately as before, but the elapsed time doesn't change as it's limited by the SATA interface or SSD to around 50 MB/sec. Personally I think these savings are worthwhile, and I believe that a library developer should normally assume that potential users of a library will have applications that need optimal performance, even if the developer is happy with something more modest. Regards, Phil.

On 07/09/2011 21:03, Phil Endecott wrote:
Increasing the block size doesn't make any significant difference; reducing it below 4096 bytes does slow it down.
So the overhead of byteswapping compared to I/O - for a file cached in memory - is between about 25% (case 3) and 150% (case 4) on this system.
So std::reverse is six times slower than std::transform in that benchmark. Not entirely unexpected, especially on ARM.
So as expected the amount of CPU time used scales approximately as before, but the elapsed time doesn't change as it's limited by the SATA interface or SSD to around 50 MB/sec.
Personally I think these savings are worthwhile, and I believe that a library developer should normally assume that potential users of a library will have applications that need optimal performance, even if the developer is happy with something more modest.
The conclusion seems to be "doesn't matter for bandwidth, but does for latency". I take it that high-speed trading stuff needs the lowest latency possible?

Mathias Gaunard wrote:
On 07/09/2011 21:03, Phil Endecott wrote:
So as expected the amount of CPU time used scales approximately as before, but the elapsed time doesn't change as it's limited by the SATA interface or SSD to around 50 MB/sec.
The conclusion seems to be "doesn't matter for bandwidth, but does for latency".
The conclusion is that the I/O overhead swamped the test results.
I take it that high-speed trading stuff needs the lowest latency possible?
Yes, on the critical path, as I noted previously. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 9/7/11 1:49 AM, Mathias Gaunard wrote:
On 09/07/2011 01:54 AM, Phil Endecott wrote:
Unfortunately not, no; in my tests the gcc built-ins were the worse-performing on one platform.
Then it's a bug that should be fixed in the compiler. (But then it could also be that your benchmark is imperfect)
There have been reports that some gcc builtins (especially for ARM) do not generate very good code. One description of gcc's bswap builtin for ARM is here: http://hardwarebug.org/2010/01/14/beware-the-builtins/ My own personal work has found that gcc's implemenation of the popcount builtin is very inefficient on IA-64 archs compared to other well known (and easy to implement) bit-twiddling algorithms. (Of course, this is for cases where the popcount instruction is not available in the target processor.) Cheers, Tim

Then it's a bug that should be fixed in the compiler. (But then it could also be that your benchmark is imperfect)
Perhaps it is not the compilers who are failing Boost, but the other way around. For GCC for x86, we can equal and surpass the speed-ups obtained with compiler intrinsics even without those intrinsics: http://adder.iworks.ro/Boost/RawMemory/#Update-01 Can we not achieve this for other architectures too, in order to support programmers ?
I don't think this is really important though.
You mean the performance of these endian operations in general is not important?
Not that much no. It's important that it's reasonably fast, but it needn't be performance critical.

On Mon, Sep 12, 2011 at 8:43 PM, Adder <adder.thief@gmail.com> wrote:
Then it's a bug that should be fixed in the compiler. (But then it could also be that your benchmark is imperfect)
Perhaps it is not the compilers who are failing Boost, but the other way around.
For GCC for x86, we can equal and surpass the speed-ups obtained with compiler intrinsics even without those intrinsics:
http://adder.iworks.ro/Boost/RawMemory/#Update-01
Can we not achieve this for other architectures too, in order to support programmers ?
Yes; VC++ 2010 (and probably earlier versions) generated better, faster, code for just plain C++ in my timing tests of some conversion functions than for intrinsics. And this was for the most important use case, not just some obscure corner case. I'm assuming that when folks ask for intrinsics, they really mean high speed. --Beman

On 09/05/2011 05:38 PM, Phil Endecott wrote:
I've just done some quick benchmarks of Beman's proposed byte-swapping code using the following test program:
...
What do people see on other platforms?
I added a test case that uses Beman's integer types that yields much better performance (~2.5X) than the reorder functions in the conversion portion of the library. I.e. I added: #elif USE_BIG return *reinterpret_cast<const ubig32_t*>(&src); #elif USE_LITTLE return *reinterpret_cast<const ulittle32_t*>(&src); #elif USE_NATIVE return src; and on a Intel Core2 with gcc-4.4.3, I get the following results at -O2 (except where noted, -O3 and -O4 produced nearly identical results): USE_BUILTIN 1.42 USE_HTONL 1.39 USE_BEMAN 7.81 USE_BIG 3.09 USE_LITTLE 2.95 USE_NATIVE 0.86 (0.23 for -O3) So it looks like the mechanism used in the "integers" part of the library is superior to the mechanism used in the "conversion" part of the library and maybe the latter should be adapted to use the former. It seems like for that to work the the "endianness" enum would have to be extended to include "nonnative". Then the reorder<T> functions could then be implemented in terms of endian<T, nonnative>.

On 07/09/2011 14:16, John Filo wrote:
I.e. I added:
#elif USE_BIG return *reinterpret_cast<const ubig32_t*>(&src); #elif USE_LITTLE return *reinterpret_cast<const ulittle32_t*>(&src); #elif USE_NATIVE return src;
This breaks aliasing rules.
and on a Intel Core2 with gcc-4.4.3, I get the following results at -O2 (except where noted, -O3 and -O4 produced nearly identical results):
GCC is one of the few compilers that follow aliasing rules. There is no guarantee that the code generated is correct.

On 09/09/2011 03:29 AM, Mathias Gaunard wrote:
On 07/09/2011 14:16, John Filo wrote:
I.e. I added:
#elif USE_BIG return *reinterpret_cast<const ubig32_t*>(&src); #elif USE_LITTLE return *reinterpret_cast<const ulittle32_t*>(&src); #elif USE_NATIVE return src;
This breaks aliasing rules.
I wondered about that; I assumed that since the return statement was the only memory read from memory in the function, and that there were no writes that aliasing wouldn't be a problem. Also, I had -Wall on and GCC didn't complain, so I assumed it was okay. I guess I need to read up on what the exact rules are.

On Fri, Sep 9, 2011 at 8:53 AM, John Filo <filo@arlut.utexas.edu> wrote:
On 09/09/2011 03:29 AM, Mathias Gaunard wrote:
On 07/09/2011 14:16, John Filo wrote:
I.e. I added:
#elif USE_BIG return *reinterpret_cast<const ubig32_t*>(&src); #elif USE_LITTLE return *reinterpret_cast<const ulittle32_t*>(&src); #elif USE_NATIVE return src;
This breaks aliasing rules.
I wondered about that; I assumed that since the return statement was the only memory read from memory in the function, and that there were no writes that aliasing wouldn't be a problem. Also, I had -Wall on and GCC didn't complain, so I assumed it was okay. I guess I need to read up on what the exact rules are.
I'm missing something. Could you please post the full function, without any preprocessing statements, where the aliasing issue is arising? Thanks, --Beman

On 09/09/2011 10:45 AM, Beman Dawes wrote:
I'm missing something. Could you please post the full function, without any preprocessing statements, where the aliasing issue is arising? I should have provided more context. Sorry about that. It was an addition to the body of the byteswap function in Phil's test code that sparked this whole performance discussion. With all the various byteswap implementations stripped away, the code becomes:
#include <cstdint> #include <iostream> #include <boost/endian/integers.hpp> using std::uint32_t; using namespace boost::endian; static inline uint32_t byteswap(uint32_t src) { return *reinterpret_cast<const ubig32_t*>(&src); } int main() { uint32_t s = 0; for (uint32_t i = 0; i < 1000000000; ++i) s += byteswap(x[i]); std::cout << s << "\n"; } I did this to compare timing of your endian<> class to your reorder functions, and to show that the former was considerably more efficient than the latter and that you could just implement reorder() in terms of endian<> using the reinterpret_cast above. I don't fully grasp the C++ aliaising rules, but Matt claims the above reinterpret_cast can (or does) result in undefined behavior. In this particular case, gcc doesn't complain about any aliasing problems and it does produce the correct answer, but that doesn't mean my code is correct. Oddly enough, when I used a trick I've used in the past to get rid gcc aliasing warnings, gcc started complaining the I had aliasing problems in some cases, so clearly I *really* don't understand the aliasing rules in C++. The trick I used is to use a union, which I though was the recommended way to alias a chunk of memory to multiple types. I.e. template<typename DST, typename SRC> DST aliased_ptr_cast(DST dst) { union { SRC src; DST dst; } u; u.src = src; return u.dst; } inline uint32_t byteswap_big(uint32_t src) { return *aliased_ptr_cast<const ubig32_t*>(&src); } Using the above code, along with main() defined further up, the app compiles warning free and produces the correct answer on linux/intel/g++-4.4. If I change the type in the "aliased_ptr_cast" from ubig32_t to either ulittle32_t or unative32_t, I start getting aliasing warnings. I clearly have some learning to do.

On 09/09/2011 02:49 PM, Vicente J. Botet Escriba wrote:
Le 09/09/11 21:14, john filo a écrit :
template<typename DST, typename SRC> DST aliased_ptr_cast(DST dst)
I gues this is
DST aliased_ptr_cast(SRC src)
Correct. Let this be a lesson to me to always cut & paste from the source instead of typing it by hand.

On Fri, Sep 9, 2011 at 3:14 PM, john filo <filo@arlut.utexas.edu> wrote:
On 09/09/2011 10:45 AM, Beman Dawes wrote:
I'm missing something. Could you please post the full function, without any preprocessing statements, where the aliasing issue is arising?
I should have provided more context. Sorry about that. It was an addition to the body of the byteswap function in Phil's test code that sparked this whole performance discussion. With all the various byteswap implementations stripped away, the code becomes:
#include <cstdint> #include <iostream> #include <boost/endian/integers.hpp>
using std::uint32_t; using namespace boost::endian;
static inline uint32_t byteswap(uint32_t src) { return *reinterpret_cast<const ubig32_t*>(&src); }
int main() { uint32_t s = 0; for (uint32_t i = 0; i < 1000000000; ++i) s += byteswap(x[i]);
std::cout << s << "\n"; }
I did this to compare timing of your endian<> class to your reorder functions, and to show that the former was considerably more efficient than the latter and that you could just implement reorder() in terms of endian<> using the reinterpret_cast above.
I don't fully grasp the C++ aliaising rules, but Matt claims the above reinterpret_cast can (or does) result in undefined behavior. In this particular case, gcc doesn't complain about any aliasing problems and it does produce the correct answer, but that doesn't mean my code is correct.
I'm not sure I do either. I'd be curious to see a reference to the wording in the standard that Mathias thinks comes into play. I'm also curious what you didn't just write: return reinterpret_cast<ubig32_t>(src); But I'm probably trying to follow too many threads, and missing something obvious. --Beman

On 09/09/2011 04:13 PM, Beman Dawes wrote:
I'm not sure I do either. I'd be curious to see a reference to the wording in the standard that Mathias thinks comes into play.
I'm also curious what you didn't just write:
return reinterpret_cast<ubig32_t>(src);
gcc gives me: invalid cast from type ‘uint32_t’ to type ‘boost::endian::ubig32_t I've only ever used reinterpret_cast on pointers, so doing the above didn't even occur to me. I had to try it to see whether or not it would work. Using static_cast doesn't work because it does the byteswap operation twice, making the function a no-op.

On Fri, Sep 9, 2011 at 8:14 PM, john filo <filo@arlut.utexas.edu> wrote:
I don't fully grasp the C++ aliaising rules, but Matt claims the above reinterpret_cast can (or does) result in undefined behavior. In this particular case, gcc doesn't complain about any aliasing problems and it does produce the correct answer, but that doesn't mean my code is correct.
The rule itself is relatively simple (but variously interpreted): it is UB to dereference a pointer to T if T is 'different enough' from the actual dynamic type of the pointed memory. The definition of 'different enough' gives enough leeway to account for signed/unsigned and, IIRC, const/non-const, but not much else (most importantly, layout compatibility *does not* imply alias compatibility). There are special exceptions if T is 'char': a {signed,unsigned,} char* may alias anything, thus memcpy can be safely used to read/write from/to pointers to arbitrary types. How to interpret the aliasing rules when read/writing partial objects (instead of whole objects), like fields in a structure, is anyone guess.
Oddly enough, when I used a trick I've used in the past to get rid gcc aliasing warnings, gcc started complaining the I had aliasing problems in some cases, so clearly I *really* don't understand the aliasing rules in C++. The trick I used is to use a union, which I though was the recommended way to alias a chunk of memory to multiple types. I.e.
template<typename DST, typename SRC> DST aliased_ptr_cast(DST dst) { union { SRC src; DST dst; } u; u.src = src; return u.dst; }
inline uint32_t byteswap_big(uint32_t src) { return *aliased_ptr_cast<const ubig32_t*>(&src); }
The union trick [1] can be used for safely bitcasting values and the aliased_ptr_cast above will work correctly if DST and SRC are of the same size, converting the bit representation of src from type SRC to type DST. If SRC and DST are pointers (as in the intended usage of the ptr cast), it will convert a pointer type to another pointer type; this is ok as long as the bit pattern for SRC is a valid bit pattern for DST, but will have no effect on the pointed-to memory, so it doesn't help in anyway to circumvent pointer aliasing. I do not know of any portable way of safely cast pointers. GCC has __attribute__((may_alias)), but I haven't actually tried it. A compiler memory barrier might happen to work as aliasing UB often manifests as unexpected read/write reordering. HTH, [1] The union trick has been blessed by C99 TC3 as standard conforming, but then again, C99 and C++ rules are different enough that probably doesn't matter. -- gpd

On 09/12/2011 12:13 PM, Giovanni Piero Deretta wrote:
On Fri, Sep 9, 2011 at 8:14 PM, john filo<filo@arlut.utexas.edu> wrote:
I don't fully grasp the C++ aliaising rules, but Matt claims the above reinterpret_cast can (or does) result in undefined behavior. In this particular case, gcc doesn't complain about any aliasing problems and it does produce the correct answer, but that doesn't mean my code is correct.
The rule itself is relatively simple (but variously interpreted): it is UB to dereference a pointer to T if T is 'different enough' from the actual dynamic type of the pointed memory. The definition of 'different enough' gives enough leeway to account for signed/unsigned and, IIRC, const/non-const, but not much else (most importantly, layout compatibility *does not* imply alias compatibility).
Thanks for the explanation. Your reply prompted me to do what I should have done a long time ago; dig into this issue more so I better understand it. I was suprised by the amount of conflicting (and just plain wrong) information regarding how reinterpret_cast is supposed to behave, but I think I understand it now. Apologies for getting a bit off track re: this review, but the one thing I don't understand from 5.2.10 of the standard: -7- A pointer to an object can be explicitly converted to a pointer to an object of different type. Except that converting an rvalue of type ``pointer to T1'' to the type ``pointer to T2'' (where T1 and T2 are object types and where the alignment requirements of T2 are no stricter than those of T1) and back to its original type yields the original pointer value, the result of such a pointer conversion is unspecified. The last sentence seems to contradict itself. It first says that the following assertion will never trigger as long as T2's alignment requirements are not stricter than T1's. void f(T1 *arg) { T2 *p2 = reinterpret_cast<T2*>(arg) T1 *p1 = reinterpret_cast<T1*>(p2); assert(p1 == arg); } But then it goes on to say that the "conversion is unspecified". Is it saying that even though the value assigned to p1 is specified the value assigned to p2 is unspecified?
The union trick [1] can be used for safely bitcasting values and the aliased_ptr_cast above will work correctly if DST and SRC are of the same size, converting the bit representation of src from type SRC to type DST.
If SRC and DST are pointers (as in the intended usage of the ptr cast), it will convert a pointer type to another pointer type; this is ok as long as the bit pattern for SRC is a valid bit pattern for DST, but will have no effect on the pointed-to memory, so it doesn't help in anyway to circumvent pointer aliasing.
Okay, that makes sense. Basically the fact that my aliased_ptr_cast function has worked for me up until now is a fluke and I really can't rely on it. Bummer.
I do not know of any portable way of safely cast pointers. GCC has __attribute__((may_alias)), but I haven't actually tried it. A compiler memory barrier might happen to work as aliasing UB often manifests as unexpected read/write reordering.
That's depressing; I've been using reinterpret_cast incorrectly for a long time and have a lot of code I need to revisit. It does seem strange that there is no portable way to do pointer casting. Anyway, I think I've derailed this discussion enough. Thanks for the info.

John Filo wrote:
Apologies for getting a bit off track re: this review, but the one thing I don't understand from 5.2.10 of the standard:
-7- A pointer to an object can be explicitly converted to a pointer to an object of different type. Except that converting an rvalue of type ``pointer to T1'' to the type ``pointer to T2'' (where T1 and T2 are object types and where the alignment requirements of T2 are no stricter than those of T1) and back to its original type yields the original pointer value, the result of such a pointer conversion is unspecified.
The last sentence seems to contradict itself. It first says that the following assertion will never trigger as long as T2's alignment requirements are not stricter than T1's.
void f(T1 *arg) { T2 *p2 = reinterpret_cast<T2*>(arg) T1 *p1 = reinterpret_cast<T1*>(p2); assert(p1 == arg); }
Right.
But then it goes on to say that the "conversion is unspecified".
No. It says, "Except that converting ... 'pointer to T1' to ... 'pointer to T2' ... and back ... yields the original pointer value, the result ... is unspecified." What you did in f() is fine. Using p2 in any other way is unspecified behavior. That means what you get is up to the compiler vendor/version. It may be documented, but probably isn't. Thus, it is in no way portable. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 09/09/2011 05:45 PM, Beman Dawes wrote:
On Fri, Sep 9, 2011 at 8:53 AM, John Filo<filo@arlut.utexas.edu> wrote:
On 09/09/2011 03:29 AM, Mathias Gaunard wrote:
On 07/09/2011 14:16, John Filo wrote:
I.e. I added:
#elif USE_BIG return *reinterpret_cast<const ubig32_t*>(&src); #elif USE_LITTLE return *reinterpret_cast<const ulittle32_t*>(&src); #elif USE_NATIVE return src;
This breaks aliasing rules.
I wondered about that; I assumed that since the return statement was the only memory read from memory in the function, and that there were no writes that aliasing wouldn't be a problem. Also, I had -Wall on and GCC didn't complain, so I assumed it was okay. I guess I need to read up on what the exact rules are.
I'm missing something. Could you please post the full function, without any preprocessing statements, where the aliasing issue is arising?
It's arising in the snippet above...
participants (11)
-
Adder
-
Arash Partow
-
Beman Dawes
-
Giovanni Piero Deretta
-
john filo
-
John Filo
-
Mathias Gaunard
-
Phil Endecott
-
Stewart, Robert
-
Tim Moore
-
Vicente J. Botet Escriba