
Timmo Stange wrote:
I didn't want to go into too much detail here either and therefor chose a simple combined figure for both operations. The allocation is of course more costly if it involves locking a process-wide mutex,
It doesn't have to, given a sensible allocator.
but subsequent atomic increments roughly take 100 cycles on a Core 2 Duo and I've seen reports of higher delays with older Xeons on dual socket systems. That is all not very surprising, considering that atomic operations are essentially orthogonal to CPU designers' strategies of achieving a high instruction throughput.
Have you actually run a test and observed a 50x slowdown? You can use libs/smart_ptr/test/shared_ptr_timing_test.cpp. I'm obviously ignoring the small detail that the cleanup callback is an optional feature that appears to have no overhead unless actually used. I'm not convinced that a cleanup callback is a needed feature; just sick and tired of people citing a factor of 50 or 100 slowdown for an atomic increment. Cycle counting simply doesn't work anymore (for non-embedded CPUs); you need to measure.