[thread] performance of thread-local storage implementation

Hi, I did some performance measurements as I wanted to find out, how expensive thread-local storage is. It turned out, that in my tests, boost's thread-local storage implementation is factor 5.5 slower than pthread's one. Why could this be? The test spawns n threads each of which increments a particular 64-bit counter. After 2 seconds the threads are shut down and joined and the counters are summed up. Between the counter objects there is enough space to rule out false sharing. Synchronization is done via a volatile sig_atomic_t, so no locks skew the tests. In the "no tls" test the counters are access directly. In the "pthread" and "boost" test the counters are accessed through thread-local storage, using pthread's and boost's implementation, respectively. The code can be found here [1] The test environment is: - Debian Lenny (gcc 4.3.2) - MacPro, 8 x Intel Xeon 2.80GHz - release variant, i.e. -O3 -finline-functions -fPIC -DNDEBUG The results are: ("factor" is ratio to "no tls" with same number of threads) type # threads counter sum factor no tls 1 558e+07 no tls 2 1116e+07 no tls 3 1674e+07 no tls 4 2237e+07 no tls 5 2811e+07 no tls 6 3372e+07 no tls 7 3943e+07 no tls 8 4784e+07 pthread 1 50e+07 11.16 pthread 2 101e+07 11.05 pthread 3 152e+07 11.01 pthread 4 203e+07 11.02 pthread 5 241e+07 11.66 pthread 6 306e+07 11.01 pthread 7 339e+07 11.63 pthread 8 434e+07 11.02 boost 1 9e+07 62.00 boost 2 18e+07 62.00 boost 3 27e+07 62.00 boost 4 37e+07 60.46 boost 5 46e+07 61.10 boost 6 56e+07 60.21 boost 7 65e+07 60.66 boost 8 79e+07 60.56 My interpretation of the results is, that using thread-local storage is quite expensive. This was somehow expected. But what's surprising is, that the boost lib introduces an other penalty of factor 5.5. Does anybody see any mistake? Or is there an explanation of why using the boost implementation is so expensive. Can it perhaps be done better? regards Alex [1] http://github.com/copton/roadrunner/tree/b2b19d752308c253e42d7c28c86f612aa60...

My interpretation of the results is, that using thread-local storage is quite expensive. This was somehow expected. But what's surprising is, that the boost lib introduces an other penalty of factor 5.5.
Hi Alexander, Have you run this benchmark in Windows? I would be very much interested in the results. Kind regards. -- EA
participants (2)
-
Alexander Bernauer
-
Edouard A.