Re: [Boost-users] [signals] thread-safe post-disconnect() cleanup

23 Feb 2007

      Peter Dimov wrote:
...
...
I didn't want to go into too much detail here either and therefor
chose a simple combined figure for both operations. The allocation
is of course more costly if it involves locking a process-wide mutex,
It doesn't have to, given a sensible allocator.
Right, that's why the "if". We haven't discussed allocator policies
for Signals, yet, but I'm not sure it will be a good idea to use
one for a wide variety of different objects/sizes. I've added
support for custom allocators in my implementation to be used for
slot storage. I haven't yet decided if it's a good idea to rebind
the same allocator for use with the target function storage 
(Boost.Function by default).
...
...
but subsequent atomic increments roughly take 100 cycles on a Core 2
Duo and I've seen reports of higher delays with older Xeons on dual
socket systems. That is all not very surprising, considering that
atomic operations are essentially orthogonal to CPU designers'
strategies of achieving a high instruction throughput.
Have you actually run a test and observed a 50x slowdown? You can use 
libs/smart_ptr/test/shared_ptr_timing_test.cpp.
I have run tests with interlocked intrinsics on Windows/IA32. Quite
naive ones, but I also once ran into actual performance problems
with a project, where I could prove that excessive atomic reference
counting was the bottleneck. I was wrongfully thinking it might
be faster than copying objects, but it is not, at least for small
objects (around 32 bytes). I even used a thread local segregated
heap and intrusive counting there, btw.
...
I'm obviously ignoring the small detail that the cleanup callback is an 
optional feature that appears to have no overhead unless actually used. I'm 
not convinced that a cleanup callback is a needed feature; just sick and 
tired of people citing a factor of 50 or 100 slowdown for an atomic 
increment. Cycle counting simply doesn't work anymore (for non-embedded 
CPUs); you need to measure.
I know and I'm sorry if I gave you that impression. I did not mention
cycles until you asked for more details and even then I was
intentionally vague. Let's not try to get into a futile discussion
about details here, because as you say, performance is not clearly
predictable on modern CPU architectures anymore. I don't think
micro-benchmarks are very helpful either with respect to caching
(in which they give too optimistic results) and super-scalarity
(where they perform worse than more complex real-life problems).

This was in no way meant as a critique of shared_ptr anyway. There's
no alternative to achieve the same functionality without making it
too complicated for such a general use. I did not hesitate to agree
on using shared_ptr as the probably best solution for thread-safe
tracking with Signals and I use it for other purposes as well. It's
a great tool!

However, I'd like to focus on the problem at hand. I don't know
how closely you have looked at Frank's thread_safe_signals, so I
describe the part of the implementation I had in mind when I raised
my "performance warning":

Frank uses a vector of shared_ptrs to tracked objects as a member of
his slot_call_iterator. That made me worry a little bit already,
because I usually don't expect a temporary copy (through pass by
value or other means) of an iterator to involve a vector copy
and reference count adjustments for an arbitrary number of pointers.
But it is a feasible way of handling the tracking and won't necessarily
be a real problem for most uses.

The new cleanup callback idea solely addresses the case when a slot
can't be disconnected immediately, because it is being executed by
another thread at the same time. If I understood Frank's new suggestion
correctly, he plans to add the cleanup callback to the list of
tracked objects. I'm not going to count the number of allocations
this involves, but it's a lot - too much to just address something
that is very unlikely to happen at all. I also see the possibility
to just call the cleanup function directly (without ever copying
it into a function object) in the default case. I don't care much
about the exceptional case being expensive, but I don't see the
reason for tracking the callback function there either.

Regards

Timmo Stange