
Peter Dimov wrote:
I didn't want to go into too much detail here either and therefor chose a simple combined figure for both operations. The allocation is of course more costly if it involves locking a process-wide mutex,
It doesn't have to, given a sensible allocator.
Right, that's why the "if". We haven't discussed allocator policies for Signals, yet, but I'm not sure it will be a good idea to use one for a wide variety of different objects/sizes. I've added support for custom allocators in my implementation to be used for slot storage. I haven't yet decided if it's a good idea to rebind the same allocator for use with the target function storage (Boost.Function by default).
but subsequent atomic increments roughly take 100 cycles on a Core 2 Duo and I've seen reports of higher delays with older Xeons on dual socket systems. That is all not very surprising, considering that atomic operations are essentially orthogonal to CPU designers' strategies of achieving a high instruction throughput.
Have you actually run a test and observed a 50x slowdown? You can use libs/smart_ptr/test/shared_ptr_timing_test.cpp.
I have run tests with interlocked intrinsics on Windows/IA32. Quite naive ones, but I also once ran into actual performance problems with a project, where I could prove that excessive atomic reference counting was the bottleneck. I was wrongfully thinking it might be faster than copying objects, but it is not, at least for small objects (around 32 bytes). I even used a thread local segregated heap and intrusive counting there, btw.
I'm obviously ignoring the small detail that the cleanup callback is an optional feature that appears to have no overhead unless actually used. I'm not convinced that a cleanup callback is a needed feature; just sick and tired of people citing a factor of 50 or 100 slowdown for an atomic increment. Cycle counting simply doesn't work anymore (for non-embedded CPUs); you need to measure.
I know and I'm sorry if I gave you that impression. I did not mention cycles until you asked for more details and even then I was intentionally vague. Let's not try to get into a futile discussion about details here, because as you say, performance is not clearly predictable on modern CPU architectures anymore. I don't think micro-benchmarks are very helpful either with respect to caching (in which they give too optimistic results) and super-scalarity (where they perform worse than more complex real-life problems). This was in no way meant as a critique of shared_ptr anyway. There's no alternative to achieve the same functionality without making it too complicated for such a general use. I did not hesitate to agree on using shared_ptr as the probably best solution for thread-safe tracking with Signals and I use it for other purposes as well. It's a great tool! However, I'd like to focus on the problem at hand. I don't know how closely you have looked at Frank's thread_safe_signals, so I describe the part of the implementation I had in mind when I raised my "performance warning": Frank uses a vector of shared_ptrs to tracked objects as a member of his slot_call_iterator. That made me worry a little bit already, because I usually don't expect a temporary copy (through pass by value or other means) of an iterator to involve a vector copy and reference count adjustments for an arbitrary number of pointers. But it is a feasible way of handling the tracking and won't necessarily be a real problem for most uses. The new cleanup callback idea solely addresses the case when a slot can't be disconnected immediately, because it is being executed by another thread at the same time. If I understood Frank's new suggestion correctly, he plans to add the cleanup callback to the list of tracked objects. I'm not going to count the number of allocations this involves, but it's a lot - too much to just address something that is very unlikely to happen at all. I also see the possibility to just call the cleanup function directly (without ever copying it into a function object) in the default case. I don't care much about the exceptional case being expensive, but I don't see the reason for tracking the callback function there either. Regards Timmo Stange