New subject: [thread] performance of thread-local storage implementation

11 Mar 2009

      Hi,

I did some performance measurements as I wanted to find out, how
expensive thread-local storage is. It turned out, that in my tests,
boost's thread-local storage implementation is factor 5.5 slower than
pthread's one. Why could this be?

The test spawns n threads each of which increments a particular 64-bit
counter. After 2 seconds the threads are shut down and joined and
the counters are summed up. 

Between the counter objects there is enough space to rule out false
sharing. Synchronization is done via a volatile sig_atomic_t, so no
locks skew the tests.

In the "no tls" test the counters are access directly.
In the "pthread" and "boost" test the counters are accessed through
thread-local storage, using pthread's and boost's implementation,
respectively.

The code can be found here [1]

The test environment is:
 - Debian Lenny (gcc 4.3.2)
 - MacPro, 8 x Intel Xeon 2.80GHz
 - release variant, i.e. -O3 -finline-functions -fPIC -DNDEBUG

The results are: 
("factor" is ratio to "no tls" with same number of threads)

type     # threads   counter sum   factor

no tls      1         558e+07
no tls      2        1116e+07
no tls      3        1674e+07
no tls      4        2237e+07
no tls      5        2811e+07
no tls      6        3372e+07
no tls      7        3943e+07
no tls      8        4784e+07

pthread     1          50e+07      11.16
pthread     2         101e+07      11.05 
pthread     3         152e+07      11.01
pthread     4         203e+07      11.02
pthread     5         241e+07      11.66
pthread     6         306e+07      11.01
pthread     7         339e+07      11.63
pthread     8         434e+07      11.02

boost       1           9e+07      62.00
boost       2          18e+07      62.00
boost       3          27e+07      62.00
boost       4          37e+07      60.46
boost       5          46e+07      61.10
boost       6          56e+07      60.21
boost       7          65e+07      60.66
boost       8          79e+07      60.56

My interpretation of the results is, that using thread-local storage is
quite expensive. This was somehow expected. But what's surprising is,
that the boost lib introduces an other penalty of factor 5.5.

Does anybody see any mistake? Or is there an explanation of why using
the boost implementation is so expensive. Can it perhaps be done better?

regards

Alex

[1] 
http://github.com/copton/roadrunner/tree/b2b19d752308c253e42d7c28c86f612aa60...

[thread] performance of thread-local storage implementation

Alexander Bernauer

Edouard A.

tags

participants (2)