[I posted this in boost.devel, but got no response. Presumably that was the
wrong group. :)]
Hi Boosters
I'm just doing some research on reference-counting, and contrasting
self-counting types with externally counted types, the latter effected with
the rather excellent shared pointer technique.
[As a side question, is it true that Nicolai Josuttis was the inventor of
the technique?]
Naturally, one of the shared pointer's I've used is boost::shared_ptr. The
measurements have demonstrated some surprising performance characteristics
for shared_ptr in particular, and I want to solicit the opinion of the
expert's here before I make any bold claims, as some results are just too
surprising for me to trust them.
There are five types tested.
1. A self-counting type, using an ancient Synesis (my company) template
ref-counting bolt-in class
2. A self-counting type, using a custom-written (for this test) template
ref-counting bolt-in class
3. Externally counting, using boost::shared_ptr
4. Externally counting, using a custom-written (for this test) shared
pointer type
5. Externally counting, using a custom-written (for this test) shared
pointer type, which has a custom allocator for the shared count members
There were three scenarios investigated, and the same algorithms were
applied to the above five types
A: Saving refs
vector<smart-ptr> ptrs;
loop (100,000 times)
{
smart-ptr sp1(new Thing);
sp1 = sp1;
smart-ptr p2;
p2 = p1;
smart-ptr p3(p1);
smart-ptr p4(p3);
smart-ptr p5(p4);
smart-ptr p6(p5);
process_Thing(p6.get()); // A function to do something (actually it just
returns the result of a call on Thing) to prevent elision by optimisation
ptrs.push_back(p5);
}
The whole is executed within a single thread, and the program is tested in
both single and multithreaded forms.
B: Discarding refs
This is identical to A ("Saving refs") except that saving them to the vector
of ptrs is omitted. Again, this is done in one thread, and single and
multi-threaded forms tested
C: Multi-threaded sharing
20 smart-pointers are created, containing 20 Thing instances, in the main
thread. Then 20 threads, using the following thread-proc, are created, and
executed. Two times are recorded for each scenario: the total time, and the
sum of the individual thread-times (obtained by the WinSTL
threadtimes_counter, based on the Win32 API function GetThreadTimes())
std::vector ptrs(ITERATIONS / 10);
for(size_t i = 0; i < 10; ++i)
{
for(size_t j = 0; j < ITERATIONS / 10; ++j)
{
ptrs.push_back(things[i % things.size()]);
}
}
ITERATIONS is 100,000. things is the shared shared_ptr array
The purpose of this one is to really exercise the sharing between threads.
Results (all results are expressed in terms of the most costly pointer)
=============================================
[Note, all were compiled with Intel 7.0 with very aggressive speed
optimisations (-O2 -Ox -QaxMKW /Qvc7 -EHsc), and executed in a single
session on an otherwise quiescent 512MB 2GHz XP box.]
A (single-threaded)
--------------------
1. Synesis self-counter: 65%
2. Custom self-counter: 58%
3. boost::shared_ptr: 100%
4. custom SharedPtr: 53%
5. custom SharedPtr + rc pool): 48%
I was not surprised that the crusty old Synesis type was slower than the
custom self-counter. As expected, the custom shared-ptr with rc pool was the
fastest. I was a little surprised that the Boost shared_ptr was up to twice
as slow. However, they're all within 100% of each other, so it's no big
result I guess.
A (multi-threaded)
--------------------
1. Synesis self-counter: 90%
2. Custom self-counter: 46%
3. boost::shared_ptr: 100%
4. custom SharedPtr: 28%
5. custom SharedPtr + rc pool): 27%
This was surprising. I expected the use of thread-safe operations (atomic
increments, etc.) to flatten out the curve, but the reverse has happened.
Interestingly, the two real-world types (Synesis::RefCounter and
boost::shared_ptr) fair less well. Again, not what I expected.
B (single-threaded)
--------------------
1. Synesis self-counter: 65%
2. Custom self-counter: 55%
3. boost::shared_ptr: 100%
4. custom SharedPtr: 59%
5. custom SharedPtr + rc pool): 53%
This scenario has a lot more reference-increments, and a lot less
deallocation of the reference-counted instance. Nonetheless, the results are
virtually identical, in relative terms (the actual times taken are about 50%
across the board).
B (multi-threaded)
--------------------
1. Synesis self-counter: 71%
2. Custom self-counter: 41%
3. boost::shared_ptr: 100%
4. custom SharedPtr: 32%
5. custom SharedPtr + rc pool): 31%
Pretty much the same as with B (single-threaded). The results are the same,
in relative terms
Up to this point, the results are mildly interesting, but nothing earth
shattering.
C
-
(Times are total time, and thread times, respectively)
1. Synesis self-counter: 2.9% 39%
2. Custom self-counter: 2.7% 35%
3. boost::shared_ptr: 100% 100%
4. custom SharedPtr: 1.0% 13%
5. custom SharedPtr + rc pool): 1.2% 15%
This scenario tells quite a different tail indeed. One thing to note is that
because of the degree of reference-sharing between the ptr-storing vectors
in the 20 threads, the ref-count-pool becomes a pessimisation as a result of
the number of actual smart pointer instances being created exceeding the
arbitrary 1024 cached ref-counts available. It's likely that this number
would not be exceeded in real-world scenarios, so the pool is probably still
worth having.
Naturally, the very big question is: what's happening to boost::shared_ptr?
In thread times, it performs 2-3 times worse than in single threaded
scenarios (irrespective of whether they are built for single or multiple
threads). However, in total time, it performs extremely poorly, up to 100
times slower than some of the others! Clearly, there is some serious
contention involved.
There are three possibilities I have thought of:
(i) It performs very poorly in multi-threaded sharing scenarios. If this is
the case, can any people more expert in Boost than me proffer an explanation
as to precisely the problem, and is any such person motivated to try and
address the issue?
(ii) The particular sharing scenario is well away from any real-world
scenario in which boost::shared_ptr is likely to be used. If this is the
case, why do four different implementations, representing two very different
reference-counting strategies, all maintain a consistent relative
performance? What are the scenarios for which boost::shared_ptr is
optimised?
(iii) There are environment settings for which boost::shared_ptr will
perform much better. If that's so, can someone help me out with suggestions
of what defines to make? Why are these settings not the default.
I'd be very interested in your thoughts.
Many thanks in advance.
--
Matthew Wilson
STLSoft moderator (http://www.stlsoft.org)
Contributing editor, C/C++ Users Journal
(www.synesis.com.au/articles.html#columns)
"You can tell a Yorkshireman, but you can't tell him much!" -- Uncle Michael
----------------------------------------------------------------------------
---
_______________________________________________
Unsubscribe & other changes:
http://lists.boost.org/mailman/listinfo.cgi/boost