
Anthony Williams wrote:
My spreadsheet with the results in can be downloaded from http://www.justsoftwaresolutions.co.uk/files/dining_results.ods
Excellent work!
Not all of Andrey's runs are in there, but enough to get a reasonable mean and Standard-deviation for most cases.
This shows some interesting observations. Firstly, the padding (8k between entries) is hugely beneficial for the small (8-byte) mutexes (boost 1.35, and my BTS-based variant), but the benefit is less for the CRITICAL_SECTION based mutex, which is 24-bytes. If you consider that the data is only 4 bytes, and each cache line is 64 bytes, this is unsurprising --- with a 24 byte mutex and 4 byte data, you can only fit 2.5 entries in a cache line, so it's only adjacent entries that clash, whereas with a 8 byte mutex and 4 byte data you get 5 entries per cache line, so there is much more conflict. This makes me wonder if it's worth padding the mutex to 64 bytes, so it occupies an entire cache line, or maybe adding a 64-byte alignment requirement.
I think that the mutex should be grouped together with the data that it's protecting, e.g. // BAD: mutex mut[64]; Thing things[64]; // BETTER: pair<mutex,Thing> protected_things[64]; The presence of the "things" will separate out the mutexes, avoiding the cache-thrashing that you'd get if they were packed together, and furthermore if you're lucky the "thing" will be pre-fetched into your cache as soon as you lock its mutex. You could even do this: template <typename T> class Locked: public T { mutex mut; public: void lock() { mut.lock(); } void unlock() { mut.unlock(); } }; Locked<Thing> locked_things[64]; For small "things" it would still help to pad to the cache line size. Various ways to do this come to mind: add a compiler-specific alignment attribute (I think there's a proposal for a standard way to do this in C++0x, right?), add an explicit padding field dependent on sizeof(T), or for dynamically allocated objects write a Locked::operator new that rounds up the size requested. Of course it's wasteful to pad for mutexes that are rarely contested, and in many cases mutexes are rarely contested. Wouldn't it be great if our compilers could do profile-driven memory layout optimisation? Sounds like a good PhD topic to me. In the meantime, I wonder what else could be done at the library level to help programs not thrash their caches?
Secondly, the yield is quite beneficial in some cases (e.g. 58% improvement), but often detrimental (up to 33% degradation). Overall, I think it is not worth adding.
I still find that a mutex that ONLY uses yield gives very good performance. This is because I don't yet have a benchmark where there are large numbers of _unrelated_ threads running on the system. "Dining Philosophers" is not representative of many real-world problems, unfortunately....
The next thing I noticed was quite surprising. On the machines with 16 hardware threads, the 32-thread versions often ran faster than the 16-thread versions. I can only presume that this is because the increased thread count meant less contention, and thus fewer blocking waits.
Yes.
Finally, though the BTS-based mutex is faster than the boost 1.35 one in most cases, the CRITICAL_SECTION based version is actually very competitive, and fastest in many cases. I found this surprising, because it didn't match my prior experience, but might be a consequence of the (generally undesirable) properties of CRITICAL_SECTIONS: some threads end up getting all the locks, and running to completion very quickly, thus reducing the running thread count for the remainder of the test.
I think fairness is over-rated. An unfair schedule will reduce mean-time-to-completion, and get better cache hit rates. Cheers, Phil.