
Howard Hinnant wrote:
On Feb 8, 2008, at 6:19 PM, Phil Endecott wrote:
If you can share the code, I'll see how it performs on my Linux systems.
Thanks Phil! Sorry for the delay. I wanted to update my test to use the latest std::syntax and rerun them again. First, here's the code:
OK, I have made some minimal mods to work with my atomic-ops-and-futex mutex, and I get the following results. All of these results should be treated as provisional since I don't have 100% confidence that my mutex implementation is actually correct. For my 1GHz uniprocessor VIA-C3 system: Lock-in-order: 8 s Retry: 7 s Retry and yield: 7 s For a 3.6 GHz Intel Pentium 4 dual-core system: Lock-in-order: still running..... Retry: 13 s Retry and yield: 10 s So the most obvious initial observation is that you really don't want to run this sort of code on a multiprocessor system; this agrees with your observation:
Interesting side note: During the first run my disk backup utility kicked in. This extra competition for resources actually made the test case run faster!
I imagine that the allocation of threads to processors could also explain things like:
I have no explanation for the longer time in the third run.
I know that there is a certain amount of theoretical work about affinity of threads to CPUs but I don't know what OSes actually do in practice; do they migrate threads based on any run-time measures? Something that you might find interesting is that I modified the chopstick as follows: struct Chopstick { Chopstick() : random_next_(1) {} Mutex mut_; unsigned random_next_; char padding[8000]; <<<============ }; My aim was to prevent the chopsticks from being close together in the cache. This only works if your Mutex object actually contains the shared state itself, rather than being a pointer or some other sort of handle. By doing this, I increased the performance on the dual-core system with the retry-and-yield locking strategy to 6 seconds! (Finally better than the system with a third of the clock speed (and a tenth of the price)! Yay!) I think that this sort of cache-friendly design is important; the effect that I've measured here would occur in real programs, not just in artificial benchmarks like this. Perhaps the Mutex needs its own allocator, or something? I got a further improvement, down to 5 s on the dual-core system, by using my yield-based mutex (i.e. no futex at all, just a yield-in-a-loop to lock). So the scale of this benchmark (5 threads) is sufficiently small that the benefits of the futex (i.e. the thread is not woken prematurely) are not outweighed by the overhead of recording which thread is waiting for what. My guess is that your additional yield() is part of the same phenomenon: if the number of philosophers were increased until my Mutex<Futex> were better than my Mutex<Yield>, then the benefit of the additional yield in the multilock would also go away. I plan to do some more measurements with different systems and different mutex implementations (e.g. the evil-spin-then-wait and friends). But we should remember that this is only an artificial benchmark. Something that might be interesting to measure is the total number of context switches (per CPU) while the benchmark runs. I can see that with /usr/bin/time -f '%c %w'. If you have any similar facilities you might like to see how correlated the total number of context switches is with the total runtime. Regards, Phil