[thread] best practice to lock multiple mutexes

newer
[thread_safe_signals] new idea for...

Kowalke Oliver (QD IT PA AS)

1 Feb 2008 1 Feb '08

12:21 p.m.

Hi, what's the best practice in following scenario: m threads and n resources (each resource is protected by a mutex) thread x needs read/write access to resources (...i,j,k,..) threads y,z,... do access another subset of resources at the same time inorder to protected deadlocks I would do following in the code of the different threads (code executed by the threads are different): ... try_lock: // label unique_lock< mutex > lk1( mtx1, try_to_lock_t); unique_lock< mutex > lk4( mtx2, try_to_lock_t); unique_lock< mutex > lk7( mtx3, try_to_lock_t); if ( ! ( lk1 && lk2 && lk3) ) goto try_lock; // all mutexes are locked // now modify safely resource 1,2,3 ... Maybe you have a better pattern do solve this problem? best regards, Oliver

Show replies by date

dizzy

1 Feb 1 Feb

1:27 p.m.

On Friday 01 February 2008 14:21:40 Kowalke Oliver (QD IT PA AS) wrote:

...

Hi, what's the best practice in following scenario: m threads and n resources (each resource is protected by a mutex) thread x needs read/write access to resources (...i,j,k,..) threads y,z,... do access another subset of resources at the same time

inorder to protected deadlocks I would do following in the code of the different threads (code executed by the threads are different):

... try_lock: // label unique_lock< mutex > lk1( mtx1, try_to_lock_t); unique_lock< mutex > lk4( mtx2, try_to_lock_t); unique_lock< mutex > lk7( mtx3, try_to_lock_t); if ( ! ( lk1 && lk2 && lk3) ) goto try_lock; // all mutexes are locked // now modify safely resource 1,2,3 ...

Maybe you have a better pattern do solve this problem?

Im not sure if it solves exactly your problem but generally needing to lock 2 mutexes for a given resource has the problem that it may very easily run into deadlock situations if another thread tries to lock the same mutexes in the reversed order. As such all you need to do is define an order relation across the mutexes (good choices for such an order is order the locks by the mutexes addresses in memory, this should provide you a complete strict order relation across all mutexes that may be used at some moment). I'm not sure what your example is suposed to do, is it suposed to be a trylock or a normal lock (I supose that's what "try_to_lock_t" signals)? If it is a trylock then you just described a busy wait loop (will consume 100% CPU resources waiting for a lock). You just need to use normal blocking locks expect you should be careful about order of locks to be always the same for the same set of mutexes (as I described above). In my code (I needed often to lock 2 mutexes) I have made a lock wrapper class that does the ordered locking of the 2 mutexes. -- Mihai RUSU Email: dizzy@roedu.net "Linux is obsolete" -- AST

Harro Verkouter

1:28 p.m.

That's a hairy one :) I dealt with something like this at some point. Doesn't sound too bad, this approach. One thing you should not forget is, when finding out that you didn't lock all resources you wanted, unlock the ones you did get the lock for and sleep a while before retrying. Otherwise you will end up in a deadlock, at some point or the lock attempt will eat your CPUcycles. Some OS/thread schedulers don't do well with a tight loop ie: the other thread(s) will get hardly any CPU time at all anymore. Propably, but that may be dependent on the access frequency and/or pattern to the resources and the proportion of number of threads to number of resources and the amount of time spent by the thread modifying or working with the resources, you may be better of by having just a single mutex ... If you need to spend more time waiting to lock only the objects you need than it actually takes to work on/with the resources, than that is not a win-win situation. HTH, Harro Verkouter Kowalke Oliver (QD IT PA AS) wrote:

...

Hi, what's the best practice in following scenario: m threads and n resources (each resource is protected by a mutex) thread x needs read/write access to resources (...i,j,k,..) threads y,z,... do access another subset of resources at the same time

inorder to protected deadlocks I would do following in the code of the different threads (code executed by the threads are different):

... try_lock: // label unique_lock< mutex > lk1( mtx1, try_to_lock_t); unique_lock< mutex > lk4( mtx2, try_to_lock_t); unique_lock< mutex > lk7( mtx3, try_to_lock_t); if ( ! ( lk1 && lk2 && lk3) ) goto try_lock; // all mutexes are locked // now modify safely resource 1,2,3 ...

Maybe you have a better pattern do solve this problem?

best regards, Oliver

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Walter Horsten

2 p.m.

Hi, This article at Dr Dobb's proposes a possible solution to the problem of simultaneously locking multiple mutexes, I haven't tested it in practice yet though: http://www.ddj.com/cpp/204801163?pgno=1 (click away the ad) Kind regards, Walter On Feb 1, 2008 1:21 PM, Kowalke Oliver (QD IT PA AS) < Oliver.Kowalke@qimonda.com> wrote:

...

Hi, what's the best practice in following scenario: m threads and n resources (each resource is protected by a mutex) thread x needs read/write access to resources (...i,j,k,..) threads y,z,... do access another subset of resources at the same time

inorder to protected deadlocks I would do following in the code of the different threads (code executed by the threads are different):

... try_lock: // label unique_lock< mutex > lk1( mtx1, try_to_lock_t); unique_lock< mutex > lk4( mtx2, try_to_lock_t); unique_lock< mutex > lk7( mtx3, try_to_lock_t); if ( ! ( lk1 && lk2 && lk3) ) goto try_lock; // all mutexes are locked // now modify safely resource 1,2,3 ...

Maybe you have a better pattern do solve this problem?

best regards, Oliver

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Frank Mori Hess

7:16 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 01 February 2008 09:00 am, Walter Horsten wrote:

...

...
... try_lock: // label unique_lock< mutex > lk1( mtx1, try_to_lock_t); unique_lock< mutex > lk4( mtx2, try_to_lock_t); unique_lock< mutex > lk7( mtx3, try_to_lock_t); if ( ! ( lk1 && lk2 && lk3) ) goto try_lock; // all mutexes are locked // now modify safely resource 1,2,3 ...

Maybe you have a better pattern do solve this problem?

I'd just establish a locking order and stick to it. You can use debug tools like helgrind to detect locking order violations, or a mutex wrapper that detects locking order violations (shameless plug): http://www.comedi.org/projects/libpoet/boostbook/doc/boostbook/html/poet/acy... Trying to come up with a useable locking order at runtime adds overhead, requires all the mutexes to be available for locking in the same context, and generates false positives when using aforementioned debugging tools. - -- Frank -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHo2/y5vihyNWuA4URAjICAKDZmFEC/nOwz10gsOwkAax4exfzhgCfXNEf ol9+JpHJ/BDfRNfz39g0AEU= =vaeK -----END PGP SIGNATURE-----

Anthony Williams

2:44 p.m.

Kowalke Oliver (QD IT PA AS <Oliver.Kowalke <at> qimonda.com> writes:

...

what's the best practice in following scenario: m threads and n resources (each resource is protected by a mutex) thread x needs read/write access to resources (...i,j,k,..) threads y,z,... do access another subset of resources at the same time

inorder to protected deadlocks I would do following in the code of the different threads (code executed by the threads are different):

... try_lock: // label unique_lock< mutex > lk1( mtx1, try_to_lock_t); unique_lock< mutex > lk4( mtx2, try_to_lock_t); unique_lock< mutex > lk7( mtx3, try_to_lock_t); if ( ! ( lk1 && lk2 && lk3) ) goto try_lock; // all mutexes are locked // now modify safely resource 1,2,3 ...

Maybe you have a better pattern do solve this problem?

I intended to write a lock template for this, but never got round to it. There are two ways to solve the multiple lock problem: order the mutexes (e.g. by address) so they are always locked in the same order, or do try-and-back-off, one variant of which you showed above. A better version of try-and-back-off is to lock the first mutex unconditionally, and then try to lock the others in turn. If a lock fails, then unlock ALL the mutexes and start again, but this time with a blocking lock on the failed mutex. This means that you're waiting for the mutex that you couldn't get last time, but without holding any others, so you won't deadlock with another thread. There's a small possibility of livelock if two threads try this technique on precisely the same set of mutexes at the same time, but it's unlikely to occur in practice, and likely to resolve quickly if it does. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Phil Endecott

4:31 p.m.

...

Kowalke Oliver wrote: what's the best practice in following scenario: m threads and n resources (each resource is protected by a mutex) thread x needs read/write access to resources (...i,j,k,..) threads y,z,... do access another subset of resources at the same time

inorder to protected deadlocks I would do following in the code of the different threads (code executed by the threads are different):

... try_lock: // label unique_lock< mutex > lk1( mtx1, try_to_lock_t); unique_lock< mutex > lk4( mtx2, try_to_lock_t); unique_lock< mutex > lk7( mtx3, try_to_lock_t); if ( ! ( lk1 && lk2 && lk3) ) goto try_lock; // all mutexes are locked // now modify safely resource 1,2,3 ...

Maybe you have a better pattern do solve this problem?

N2447 proposes a generic locking algorithm. Quote: template <class L1, class L2, class ...L3> void lock(L1&, L2&, L3&...); Requires: Each template parameter type must supply the following member functions with semantics corresponding to the Mutex concept, except that try_lock is allowed to throw an exception [Note: The unique_lock class template meets these requirements when suitable instantiated. --end note] void lock(); bool try_lock(); void unlock(); Effects: All arguments are locked with an algorithm that avoids deadlock. If an exception is thrown by a call to lock() or try_lock(), there are no effects. End Quote. Presumably the intention is that you use it like this: { unique_lock< mutex > lk1( mtx1, defer_lock); unique_lock< mutex > lk4( mtx2, defer_lock); unique_lock< mutex > lk7( mtx3, defer_lock); lock(lk1,lk2,lk3); ....critical section.... } // locks are released when they go out of scope I guess that a multi-lock could be built on top of this, for more concise code: struct multilock { unique_lock lk1; unique_lock lk2; unique_lock lk3; multilock(mutex& mtx1, mutex& mtx2, mutex& mtx3): lk1(mtx1, defer_lock), lk2(mtx2, defer_lock), lk3(mtx3, defer_lock) { lock(lk1,lk2,lk3); } }; { multilock mlk(mtx1,mtx2,mtx3); ...critical section.... } This is not included in Boost.Threads (AFAIK), and doing so without C++0x varargs templates is a bit messy. The obvious (to me!) implementation of the lock algorithm is to order the mutexes. This can be done based on their addresses, if every instance of the same underlying mutex has the same address. This doesn't work for shared-memory mutexes; maybe the mutex itself should be required to be less-than-comparable? Anyway, you can easily write a version for a fixed number of locks: void lock(lock& lk1, lock& lk2) { if (lk1.mutex() < lk2.mutex()) { lk1.lock(); lk2.lock(); } else { lk2.lock(); lk1.lock(); } } Phil.

Howard Hinnant

5:39 p.m.

On Feb 1, 2008, at 11:31 AM, Phil Endecott wrote:

...

...
Kowalke Oliver wrote: what's the best practice in following scenario: m threads and n resources (each resource is protected by a mutex) thread x needs read/write access to resources (...i,j,k,..) threads y,z,... do access another subset of resources at the same time

inorder to protected deadlocks I would do following in the code of the different threads (code executed by the threads are different):

... try_lock: // label unique_lock< mutex > lk1( mtx1, try_to_lock_t); unique_lock< mutex > lk4( mtx2, try_to_lock_t); unique_lock< mutex > lk7( mtx3, try_to_lock_t); if ( ! ( lk1 && lk2 && lk3) ) goto try_lock; // all mutexes are locked // now modify safely resource 1,2,3 ...

Maybe you have a better pattern do solve this problem?

N2447 proposes a generic locking algorithm. Quote:

template <class L1, class L2, class ...L3> void lock(L1&, L2&, L3&...);

Requires: Each template parameter type must supply the following member functions with semantics corresponding to the Mutex concept, except that try_lock is allowed to throw an exception [Note: The unique_lock class template meets these requirements when suitable instantiated. --end note]

void lock(); bool try_lock(); void unlock();

Effects: All arguments are locked with an algorithm that avoids deadlock. If an exception is thrown by a call to lock() or try_lock(), there are no effects.

End Quote.

Presumably the intention is that you use it like this:

{ unique_lock< mutex > lk1( mtx1, defer_lock); unique_lock< mutex > lk4( mtx2, defer_lock); unique_lock< mutex > lk7( mtx3, defer_lock); lock(lk1,lk2,lk3); ....critical section.... } // locks are released when they go out of scope

I guess that a multi-lock could be built on top of this, for more concise code:

struct multilock { unique_lock lk1; unique_lock lk2; unique_lock lk3; multilock(mutex& mtx1, mutex& mtx2, mutex& mtx3): lk1(mtx1, defer_lock), lk2(mtx2, defer_lock), lk3(mtx3, defer_lock) { lock(lk1,lk2,lk3); } };

{ multilock mlk(mtx1,mtx2,mtx3); ...critical section.... }

This is not included in Boost.Threads (AFAIK), and doing so without C++0x varargs templates is a bit messy.

The obvious (to me!) implementation of the lock algorithm is to order the mutexes. This can be done based on their addresses, if every instance of the same underlying mutex has the same address. This doesn't work for shared-memory mutexes; maybe the mutex itself should be required to be less-than-comparable?

Anyway, you can easily write a version for a fixed number of locks:

void lock(lock& lk1, lock& lk2) { if (lk1.mutex() < lk2.mutex()) { lk1.lock(); lk2.lock(); } else { lk2.lock(); lk1.lock(); } }

Here's the reference implementation for N2447: // Copyright Howard Hinnant 2007. Distributed under the Boost // Software License, Version 1.0. (see http://www.boost.org/LICENSE_1_0.txt) // returns index of failed lock or -1 on success template <class L1, class L2> int try_lock(L1& l1, L2& l2) { unique_lock<L1> u(l1, try_to_lock); if (u.owns_lock()) { if (l2.try_lock()) { u.release(); return -1; } return 1; } return 0; } template <class L1, class L2, class ...L3> int try_lock(L1& l1, L2& l2, L3& l3...) { unsigned r = 0; unique_lock<L1> u(l1, try_to_lock); if (u.owns_lock()) { r = try_lock(l2, l3...); if (r == -1) u.release(); else ++r; } return r; } template <class L1, class L2> void lock(L1& l1, L2& l2) { while (true) { { unique_lock<L1> __u1(l1); if (l2.try_lock()) { __u1.release(); break; } } std::this_thread::yield(); { unique_lock<L2> __u2(l2); if (l1.try_lock()) { __u2.release(); break; } } std::this_thread::yield(); } } template <class L1, class L2, class ...L3> void __lock_first(int __i, L1& l1, L2& l2, L3& l3...) { while (true) { switch (__i) { case 0: { unique_lock<L1> __u1(l1); __i = try_lock(l2, l3...); if (__i == -1) { __u1.release(); return; } } ++__i; std::this_thread::yield(); break; case 1: { unique_lock<L2> __u2(l2); __i = try_lock(l3..., l1); if (__i == -1) { __u2.release(); return; } } if (__i == sizeof...(L3)) __i = 0; else __i += 2; std::this_thread::yield(); break; default: __lock_first(__i - 2, l3..., l1, l2); return; } } } template <class L1, class L2, class ...L3> inline void lock(L1& l1, L2& l2, L3& l3...) { __lock_first(0, l1, l2, l3...); } Notes: 1. It doesn't require Lock::mutex(). 2. It is *much* faster than the lock-in-order algorithm. 3. The yields are absolutely necessary to get the high performance mentioned in the previous note. 4. I've beat the heck out of this code (with extremely high contention tests). Live-lock just doesn't happen. Neither does thread starvation. The above code is very "fair". 5. Lock-in-order can result in thread starvation (experimentally determined under high contention tests). 6. These algorithms have the strong exception safety guarantee for a throwing lock() or try_lock(). If one of these throws, nothing is locked. unlock() must not throw. 7. Anthony's previous post:

...

A better version of try-and-back-off is to lock the first mutex unconditionally, and then try to lock the others in turn. If a lock fails, then unlock ALL the mutexes and start again, but this time with a blocking lock on the failed mutex. This means that you're waiting for the mutex that you couldn't get last time, but without holding any others, so you won't deadlock with another thread.

is a good description of the above algorithms. 8. The above algorithms can painstakenly be converted to not use variadic templates for a fixed number of locks (I've done up to 4 before the hair loss became too great ;-)). -Howard

Kowalke Oliver (QD IT PA AS)

4 Feb 4 Feb

11:32 a.m.

Hi Howard, thank you for your code - I'll try it out. How about add your implementation to boost-1.35.0? best regards, Oliver

...

Here's the reference implementation for N2447:

// Copyright Howard Hinnant 2007. Distributed under the Boost // Software License, Version 1.0. (see http://www.boost.org/LICENSE_1_0.txt)

// returns index of failed lock or -1 on success template <class L1, class L2> int try_lock(L1& l1, L2& l2) { unique_lock<L1> u(l1, try_to_lock); if (u.owns_lock()) { if (l2.try_lock()) { u.release(); return -1; } return 1; } return 0; }

template <class L1, class L2, class ...L3> int try_lock(L1& l1, L2& l2, L3& l3...) { unsigned r = 0; unique_lock<L1> u(l1, try_to_lock); if (u.owns_lock()) { r = try_lock(l2, l3...); if (r == -1) u.release(); else ++r; } return r; }

template <class L1, class L2> void lock(L1& l1, L2& l2) { while (true) { { unique_lock<L1> __u1(l1); if (l2.try_lock()) { __u1.release(); break; } } std::this_thread::yield(); { unique_lock<L2> __u2(l2); if (l1.try_lock()) { __u2.release(); break; } } std::this_thread::yield(); } }

template <class L1, class L2, class ...L3> void __lock_first(int __i, L1& l1, L2& l2, L3& l3...) { while (true) { switch (__i) { case 0: { unique_lock<L1> __u1(l1); __i = try_lock(l2, l3...); if (__i == -1) { __u1.release(); return; } } ++__i; std::this_thread::yield(); break; case 1: { unique_lock<L2> __u2(l2); __i = try_lock(l3..., l1); if (__i == -1) { __u2.release(); return; } } if (__i == sizeof...(L3)) __i = 0; else __i += 2; std::this_thread::yield(); break; default: __lock_first(__i - 2, l3..., l1, l2); return; } } }

template <class L1, class L2, class ...L3> inline void lock(L1& l1, L2& l2, L3& l3...) { __lock_first(0, l1, l2, l3...); }

Notes:

1. It doesn't require Lock::mutex().

2. It is *much* faster than the lock-in-order algorithm.

3. The yields are absolutely necessary to get the high performance mentioned in the previous note.

4. I've beat the heck out of this code (with extremely high contention tests). Live-lock just doesn't happen. Neither does thread starvation. The above code is very "fair".

5. Lock-in-order can result in thread starvation (experimentally determined under high contention tests).

6. These algorithms have the strong exception safety guarantee for a throwing lock() or try_lock(). If one of these throws, nothing is locked. unlock() must not throw.

7. Anthony's previous post:

...
A better version of try-and-back-off is to lock the first mutex unconditionally, and then try to lock the others in turn. If a lock fails, then unlock ALL the mutexes and start again, but this time with a blocking lock on the failed mutex. This means that you're waiting for the mutex that you couldn't get last time, but without holding any others, so you won't deadlock with another thread.

is a good description of the above algorithms.

8. The above algorithms can painstakenly be converted to not use variadic templates for a fixed number of locks (I've done up to 4 before the hair loss became too great ;-)).

-Howard

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Anthony Williams

8 Feb 8 Feb

7:42 p.m.

Howard Hinnant <hinnant <at> twcny.rr.com> writes:

...

template <class L1, class L2> void lock(L1& l1, L2& l2) { while (true) { { unique_lock<L1> __u1(l1); if (l2.try_lock()) { __u1.release(); break; } } std::this_thread::yield();

snipped rest of multi-lock code.

...

3. The yields are absolutely necessary to get the high performance mentioned in the previous note.

I find this comment intriguing. Is this always the case, or does it depend on the specifics of how a mutex is implemented on a given platform? Anthony

Howard Hinnant

8:37 p.m.

On Feb 8, 2008, at 2:42 PM, Anthony Williams wrote:

...

Howard Hinnant <hinnant <at> twcny.rr.com> writes:

...
template <class L1, class L2> void lock(L1& l1, L2& l2) { while (true) { { unique_lock<L1> __u1(l1); if (l2.try_lock()) { __u1.release(); break; } } std::this_thread::yield();

snipped rest of multi-lock code.

...
3. The yields are absolutely necessary to get the high performance mentioned in the previous note.

I find this comment intriguing. Is this always the case, or does it depend on the specifics of how a mutex is implemented on a given platform?

That is a question I've been asking and myself for a long time, unfortunately without a satisfactory answer. Here's my experience, and a few guesses... On 2 core Macs and 2 core Windows, I've measured improved speeds with the yields, using a "dining philosophers" style simulation. I haven't tested on Linux. Here are results (dating around 3 years ago): In order: 63 seconds. try/back off/without yield: 47 seconds. try/back off/with yield: 34 seconds. These numbers come from 5 threads each wanting two mutexes each (classic dining philosopher setup). The same runs on single core machines looked like: In order: 293 seconds. try/back off/without yield: not tested. try/back off/with yield: 278 seconds. The single core results were measured on different machines than the dual core results and so aren't comparable to the dual core results. I set up an additional "dining philosophers" test where each philosopher set at the corner of a cube (8) and attempted to attain a mutex associated with the edge of the cube adjacent to the corner (12 mutexes). That is, each diner required "3 forks" to eat. The results were: In order: 226 seconds. try/back off/without yield: not tested. try/back off/with yield: 77 seconds. Unfortunately I didn't run the last two tests without the yields. I believe the try/back off algorithms are faster than the in order algorithm because in the latter a thread sits on a mutex while blocked for another. Thus a "fork" that somebody else could be using is idled. But in the try/back off algorithms the diners are "more polite". If they can not eat, they offer their fork up so that someone else may be able to eat. Now what about the yields... Blocking on a mutex should be very similar to a yield. But the first test is showing a significant speed advantage to the yields. The above isn't noise, the tests were run many times and the averages are reported. On running the tests on a Mac I viewed the CPU activity in the "Activity Monitor" application. When running without the yields, the activity monitor looked like: (I hope you can see the graphic). The red in the graphic indicates "system time" while the green indicates "user time". In case you can't see the graphic, it looks like around 20% to 25% of the time is spent in "system", with the rest spent in "user". When running with the yields, the monitor looks like: Now the "system time" is almost completely gone (maybe 7% or 8% system time). So more "user time" translates to faster application (not so much "system time"). But this still doesn't answer your precise question: Is this behavior really portable? I'm afraid I don't really know for sure. It has just been my experience on Mac and Windows. The Windows tests were done 3 years ago, and I asked someone to do them for me. I didn't have the machine available to investigate myself. The Mac tests were done at the same time, but I've replicated the results more recently. I haven't replicated the Windows results more recently. As long as I'm posting graphics (I hope nobody minds), here's another interesting snapshot of the 8 diner/ 12 fork / 3 forks at a time case. Both snapshots are taken as the test is ending: Lock in order: try/back off/with yield: The graphs show that the lock-in-order algorithm isn't making full use of the cpu's in the final seconds of the test. The reason for this is that some diners "hogged" the system and got thru eating while other diners sat around not getting their work done. In the end there weren't enough diners left to keep both cpu's busy (I don't recall if we are looking at 1 or 2 diners at the end of this test). try/back off/with yield graphic shows the same effect as described above, but over a *much* shorter time period. Both cpu's are fully busy in "user time" for nearly the entire test. I hope this gives you some insights, despite the lack of a clear answer to your question. Perhaps you or others can present a good explanation for this behavior regarding the yields. :-) -Howard

Peter Dimov

10:14 p.m.

Howard Hinnant:

...

Here are results (dating around 3 years ago):

In order: 63 seconds. try/back off/without yield: 47 seconds. try/back off/with yield: 34 seconds.

With Alexander Terekhov's "swap-based mutex", I'm not seeing a significant difference between the versions with and without yield. The penalty likely comes from spin-waits in lock() (which is why they're evil).

Howard Hinnant

10:22 p.m.

On Feb 8, 2008, at 5:14 PM, Peter Dimov wrote:

...

Howard Hinnant:

...
Here are results (dating around 3 years ago):

In order: 63 seconds. try/back off/without yield: 47 seconds. try/back off/with yield: 34 seconds.

With Alexander Terekhov's "swap-based mutex", I'm not seeing a significant difference between the versions with and without yield. The penalty likely comes from spin-waits in lock() (which is why they're evil).

Thanks Peter. That's good data. -Howard

Peter Dimov

10:22 p.m.

...

Howard Hinnant:

...
Here are results (dating around 3 years ago):

In order: 63 seconds. try/back off/without yield: 47 seconds. try/back off/with yield: 34 seconds.

With Alexander Terekhov's "swap-based mutex", I'm not seeing a significant difference between the versions with and without yield. The penalty likely comes from spin-waits in lock() (which is why they're evil).

There's no significant difference with critical sections, too. Maybe I'm doing something wrong. :-)

Howard Hinnant

10:31 p.m.

On Feb 8, 2008, at 5:22 PM, Peter Dimov wrote:

...

...
Howard Hinnant:

...
Here are results (dating around 3 years ago):

In order: 63 seconds. try/back off/without yield: 47 seconds. try/back off/with yield: 34 seconds.

With Alexander Terekhov's "swap-based mutex", I'm not seeing a significant difference between the versions with and without yield. The penalty likely comes from spin-waits in lock() (which is why they're evil).

There's no significant difference with critical sections, too. Maybe I'm doing something wrong. :-)

Or this could be a manifestation of the fact that I never really had "hands on" tests with Windows. I have observed that if the contention isn't high, then the algorithm doesn't matter. Does your test have the threads sleeping or otherwise doing independent work? In mine the threads did nothing but fight over the locks. -Howard

Peter Dimov

10:36 p.m.

Howard Hinnant:

...

On Feb 8, 2008, at 5:22 PM, Peter Dimov wrote:

...
There's no significant difference with critical sections, too. Maybe I'm doing something wrong. :-)

Or this could be a manifestation of the fact that I never really had "hands on" tests with Windows. I have observed that if the contention isn't high, then the algorithm doesn't matter. Does your test have the threads sleeping or otherwise doing independent work? In mine the threads did nothing but fight over the locks.

This is what I tried: #define _WIN32_WINNT 0x500 #include <windows.h> //#include "mutex.cpp" class mutex { private: CRITICAL_SECTION cs_; public: mutex() { InitializeCriticalSection( &cs_ ); } ~mutex() { DeleteCriticalSection( &cs_ ); } void lock() { EnterCriticalSection( &cs_ ); } bool try_lock() { return TryEnterCriticalSection( &cs_ ); } void unlock() { LeaveCriticalSection( &cs_ ); } }; void lock2( mutex & m1, mutex & m2 ) { for( ;; ) { m1.lock(); if( m2.try_lock() ) return; m1.unlock(); //Sleep( 0 ); m2.lock(); if( m1.try_lock() ) return; m2.unlock(); //Sleep( 0 ); } } #include <assert.h> mutex m[ 5 ]; int r[ 5 ]; unsigned __stdcall threadproc( void* pv ) { int k = (int)pv; assert( k >= 0 && k < 5 ); int k2 = (k+1) % 5; for( int i = 0; i < 1000000; ++i ) { lock2( m[ k ], m[ k2 ] ); for( int j = 0; j < 1000; ++j ) { r[ k ] = r[ k ] * 1103515245 + 12345; } m[ k ].unlock(); m[ k2 ].unlock(); } return 0; } #include <iostream> #include <time.h> #include <process.h> int main() { HANDLE h[ 5 ] = { 0 }; time_t t1 = time( 0 ); for( int i = 0; i < 5; ++i ) { h[ i ] = (HANDLE)_beginthreadex( 0, 0, threadproc, (void*)i, 0, 0 ); } for( int i = 0; i < 5; ++i ) { WaitForSingleObject( h[ i ], INFINITE ); CloseHandle( h[ i ] ); } time_t t2 = time( 0 ); std::cout << t2 - t1 << std::endl; }

Howard Hinnant

11:10 p.m.

On Feb 8, 2008, at 5:36 PM, Peter Dimov wrote:

...

This is what I tried:

This test is all wrong.

...

unsigned __stdcall threadproc( void* pv ) { int k = (int)pv;

assert( k >= 0 && k < 5 );

This should be: asssert( 0 <= k && k < 5); ... Just kidding around! :-) Thanks Peter! Very interesting. -Howard

Howard Hinnant

10:36 p.m.

On Feb 8, 2008, at 5:31 PM, Howard Hinnant wrote:

...

On Feb 8, 2008, at 5:22 PM, Peter Dimov wrote:

...
...
Howard Hinnant:

...
Here are results (dating around 3 years ago):

In order: 63 seconds. try/back off/without yield: 47 seconds. try/back off/with yield: 34 seconds.

With Alexander Terekhov's "swap-based mutex", I'm not seeing a significant difference between the versions with and without yield. The penalty likely comes from spin-waits in lock() (which is why they're evil).

There's no significant difference with critical sections, too. Maybe I'm doing something wrong. :-)

Or this could be a manifestation of the fact that I never really had "hands on" tests with Windows. I have observed that if the contention isn't high, then the algorithm doesn't matter. Does your test have the threads sleeping or otherwise doing independent work? In mine the threads did nothing but fight over the locks.

Oh, also note that my single-core results were only 5% apart (comparing lock in order to try/backoff/yield). Are you running on single core? -Howard

Howard Hinnant

10:29 p.m.

On Feb 8, 2008, at 5:14 PM, Peter Dimov wrote:

...

Howard Hinnant:

...
Here are results (dating around 3 years ago):

In order: 63 seconds. try/back off/without yield: 47 seconds. try/back off/with yield: 34 seconds.

With Alexander Terekhov's "swap-based mutex", I'm not seeing a significant difference between the versions with and without yield. The penalty likely comes from spin-waits in lock() (which is why they're evil).

Hey, would you mind running against lock-in-order? I'd love to confirm what I've been preachin'. ;-) Or if you see lock-in-order as better, then better for me to hear it sooner rather than later... Thanks, Howard

Peter Dimov

10:42 p.m.

...

Hey, would you mind running against lock-in-order? I'd love to confirm what I've been preachin'. ;-) Or if you see lock-in-order as better, then better for me to hear it sooner rather than later...

No difference on single core, ~20s (vs ~14s for try/backoff) on dual core Pentium-D with lock2 changed to: void lock2_( mutex & m1, mutex & m2 ) { if( &m1 < &m2 ) { m1.lock(); m2.lock(); } else { m2.lock(); m1.lock(); } }

Phil Endecott

11:19 p.m.

Howard Hinnant wrote:

...

On Feb 8, 2008, at 2:42 PM, Anthony Williams wrote:

...
Howard Hinnant <hinnant <at> twcny.rr.com> writes:

...
3. The yields are absolutely necessary to get the high performance mentioned in the previous note.

I find this comment intriguing. Is this always the case, or does it depend on the specifics of how a mutex is implemented on a given platform?

That is a question I've been asking and myself for a long time, unfortunately without a satisfactory answer. Here's my experience, and a few guesses...

Hi Howard, If you can share the code, I'll see how it performs on my Linux systems. I bet it depends greatly on the workload. I have been considering writing an instrumented mutex that records how often it blocks. In my real code, the answer seems to be almost never. Peter's comment about evil spin-waits in lock() is interesting as I was considering implementing something like that. I wonder about this instead: void mutex::lock() { if (atomic-attempt-to-lock) { // uncontended return; }; // first wait, try yielding since it's fast: sched_yield(); if (atomic-attempt-to-lock) { // got lock on 2nd attempt return; } // We failed again. Maybe there are lots of threads waiting; in that // case it will be more efficient to futex_wait() since the kernel will // then only wake us when our lock has been woken. futex_wait(....); } I'll try to find time to experiment with this soon. Phil.

Peter Dimov

11:38 p.m.

Phil Endecott:

...

Peter's comment about evil spin-waits in lock() is interesting as I was considering implementing something like that. I wonder about this instead:

void mutex::lock() { if (atomic-attempt-to-lock) { // uncontended return; }; // first wait, try yielding since it's fast: sched_yield(); if (atomic-attempt-to-lock) { // got lock on 2nd attempt return; } // We failed again. Maybe there are lots of threads waiting; in that // case it will be more efficient to futex_wait() since the kernel will // then only wake us when our lock has been woken. futex_wait(....); }

This isn't very good, either. Consider a scenario where your time slice is 10ms, but the mutex is unlocked 3ms after you attempt a lock. In my opinion, mutex::lock should never do fancy stuff behind the user's back in an attempt to achieve "better performance". If a trylock/sched_yield/lock pattern is better for a specific workload, it can be implemented in user space. If it isn't, there is no way to undo the damage.

Phil Endecott

9 Feb 9 Feb

5:26 p.m.

Peter Dimov wrote:

...

Phil Endecott:

...
Peter's comment about evil spin-waits in lock() is interesting as I was considering implementing something like that. I wonder about this instead:

void mutex::lock() { if (atomic-attempt-to-lock) { // uncontended return; }; // first wait, try yielding since it's fast: sched_yield(); if (atomic-attempt-to-lock) { // got lock on 2nd attempt return; } // We failed again. Maybe there are lots of threads waiting; in that // case it will be more efficient to futex_wait() since the kernel will // then only wake us when our lock has been woken. futex_wait(....); }

This isn't very good, either. Consider a scenario where your time slice is 10ms, but the mutex is unlocked 3ms after you attempt a lock.

So some other thread gets to do some (useful) work for 7 ms. But I take your point.

...

In my opinion, mutex::lock should never do fancy stuff behind the user's back in an attempt to achieve "better performance". If a trylock/sched_yield/lock pattern is better for a specific workload, it can be implemented in user space. If it isn't, there is no way to undo the damage.

Yes, I agree emphatically with this. I was pleased, for example, to see that Howard's implementation of the multiple locking algorithm doesn't need access to any "internals" to work efficiently. The approach that I've been experimenting with is to supply a template parameter to the mutex class that determines its behaviour, i.e. template <typename WaitWake> class mutex { public: void lock(); // tries to lock with an atomic operation, and // calls WaitWake.wait() if it fails. void unlock(); // calls WaitWake.wake(). // etc. - an N2447-like interface }; The "WaitWake" concept is something with a futex-like interface: class Futex { int& var; public: Futex(int& var_); void wait(int expected); // returns immediately if var!=expected, // else blocks until wake() is called void wake(); // etc. }; I originally factored this out so that I could substitute a yield loop or spinlock using these degenerate classes: class Yield { public: Yield(int& var_) {} void wait(int expected) { ::sched_yield(); } void wake() {} }; class Spin { public: Spin(int& var_) {} void wait(int expected) {} void wake() {} }; Since then I've written a condition-variable class that also takes a "WaitWake" template parameter. But there are a couple of things that I can't do with this design: - The yield-loop and spin-loop are compatible, in the sense that different locks on the same mutex could use different implementations (based on the expectation of blocking in each case) and it would still work. The futex could be made compatible. So perhaps the WaitWake should be a parameter of the lock, not of the mutex. - But whether a lock is better off spinning or yielding is actually a function of what the _other_ thread that currently holds the lock is doing, not what this thread is going to do. So maybe we want a scheme like this (pseudo-code): mutex m; ... { lock l(m, 1e-6); var++; // lock is only held for a few nanoseconds } ... { lock l(m, 10); var = result_of_http_request(); // lock is held for seconds } class lock { mutex& m; public: lock(mutex& m_, float likely_hold_time): m(m_) { if (m.likely_hold_time < 1e-3) { m.spin_lock(); } else { m.yield_lock(); } m.likely_hold_time = likely_hold_time; } ~lock() { m.unlock(); } ..... }; My thought process gets about this far and then I think "this is too complicated". Simpler is better. - My current scheme also doesn't allow for the sort of spin-then-wait or yield-then-wait design at the top of this message. There are lots of variations on this like "if (m.lock_holder_cpu != this_cpu) { spin for a bit }". In practice, my real code doesn't need most of these features; I have maybe a dozen threads, not thousands, and lock contention seems to be very rare. I got a worthwhile speedup by bypassing glibc's pthread_mutex and pthread_condition implementation and going straight to futexes (for the fast uncontended case), and I try to do as much work as possible between context switches. If we had a set of concurrency benchmarks available then I'd investigate a bit further. Regards, Phil.

Howard Hinnant

7:02 p.m.

On Feb 8, 2008, at 6:19 PM, Phil Endecott wrote:

...

If you can share the code, I'll see how it performs on my Linux systems.

Thanks Phil! Sorry for the delay. I wanted to update my test to use the latest std::syntax and rerun them again. First, here's the code: #include <date_time> #include <mutex> #include <thread> #include <iostream> typedef std::mutex Mutex; typedef std::unique_lock<Mutex> Lock; Mutex cout_mutex; unsigned random_gen(unsigned& random_next) { random_next = random_next * 1103515245 + 12345; return (random_next >> 16) & 0x7FFF; } struct Chopstick { Chopstick() : random_next_(1) {} Mutex mut_; unsigned random_next_; }; unsigned eat(unsigned i) { while (i) --i; return i; } unsigned eater(int id, Chopstick& left, Chopstick& right) { { std::lock_guard<Mutex> _(cout_mutex); std::cout << "Thread " << id << " starting\n"; } unsigned i = 10000000; unsigned u = 0; while (--i) { Lock l(left.mut_, std::defer_lock); Lock r(right.mut_, std::defer_lock); std::lock(l, r); u += eat(random_gen(left.random_next_) + random_gen(right.random_next_)); } { std::lock_guard<Mutex> _(cout_mutex); std::cout << "Thread " << id << " ending\n"; } return u; } int main() { std::system_time t1 = std::get_system_time(); Chopstick A, B, C, D, E; std::thread T1(eater, 1, std::ref(A), std::ref(B)); std::thread T2(eater, 2, std::ref(B), std::ref(C)); std::thread T3(eater, 3, std::ref(C), std::ref(D)); std::thread T4(eater, 4, std::ref(D), std::ref(E)); std::thread T5(eater, 5, std::ref(E), std::ref(A)); T1.join(); T2.join(); T3.join(); T4.join(); T5.join(); std::system_time t2 = std::get_system_time(); std::cout << " time = " << (t2-t1).count() / 1.e9 << '\n'; } If anyone sees any glaring problems with the test, please let me know, thanks. Now for results just now run. Machine: iMac 2.16 GHz Intel Core 2 Duo, OS X 10.5.1. Lock in order: template <class L1, class L2> void lock(L1& l1, L2& l2) { if (l1.mutex() < l2.mutex()) { unique_lock<L1> u1(l1); unique_lock<L2> u2(l2); u1.release(); u2.release(); } else { unique_lock<L2> u2(l2); unique_lock<L1> u1(l1); u1.release(); u2.release(); } } $./a.out Thread 1 starting Thread 2 starting Thread 3 starting Thread 4 starting Thread 5 starting Thread 1 ending Thread 2 ending Thread 3 ending Thread 5 ending Thread 4 ending time = 62.0437 $./a.out Thread 1 starting Thread 2 starting Thread 3 starting Thread 4 starting Thread 5 starting Thread 1 ending Thread 2 ending Thread 3 ending Thread 5 ending Thread 4 ending time = 145.683 $./a.out Thread 1 starting Thread 2 starting Thread 3 starting Thread 4 starting Thread 5 starting Thread 1 ending Thread 2 ending Thread 3 ending Thread 5 ending Thread 4 ending time = 157.347 Interesting side note: During the first run my disk backup utility kicked in. This extra competition for resources actually made the test case run faster! Try/backoff/without yield template <class L1, class L2> void lock(L1& l1, L2& l2) { while (true) { { unique_lock<L1> u1(l1); if (l2.try_lock()) { u1.release(); break; } } // std::this_thread::yield(); { unique_lock<L2> u2(l2); if (l1.try_lock()) { u2.release(); break; } } // std::this_thread::yield(); } } $./a.out Thread 1 starting Thread 2 starting Thread 3 starting Thread 4 starting Thread 5 starting Thread 4 ending Thread 5 ending Thread 1 ending Thread 2 ending Thread 3 ending time = 32.3726 $./a.out Thread 1 starting Thread 2 starting Thread 3 starting Thread 4 starting Thread 5 starting Thread 5 ending Thread 4 ending Thread 1 ending Thread 3 ending Thread 2 ending time = 29.4687 $./a.out Thread 1 starting Thread 2 starting Thread 3 starting Thread 4 starting Thread 5 starting Thread 5 ending Thread 1 ending Thread 4 ending Thread 3 ending Thread 2 ending time = 56.226 I have no explanation for the longer time in the third run. Try/backoff/yield template <class L1, class L2> void lock(L1& l1, L2& l2) { while (true) { { unique_lock<L1> u1(l1); if (l2.try_lock()) { u1.release(); break; } } std::this_thread::yield(); { unique_lock<L2> u2(l2); if (l1.try_lock()) { u2.release(); break; } } std::this_thread::yield(); } } $./a.out Thread 1 starting Thread 2 starting Thread 3 starting Thread 4 starting Thread 5 starting Thread 4 ending Thread 5 ending Thread 1 ending Thread 3 ending Thread 2 ending time = 11.216 $./a.out Thread 1 starting Thread 2 starting Thread 3 starting Thread 4 starting Thread 5 starting Thread 4 ending Thread 5 ending Thread 1 ending Thread 3 ending Thread 2 ending time = 10.6229 $./a.out Thread 1 starting Thread 2 starting Thread 3 starting Thread 4 starting Thread 5 starting Thread 4 ending Thread 5 ending Thread 1 ending Thread 3 ending Thread 2 ending time = 11.3284 -Howard

Anthony Williams

10 Feb 10 Feb

11:52 a.m.

Howard Hinnant <hinnant <at> twcny.rr.com> writes:

...

First, here's the code:

Thanks for the test code. Here's my results, running on Windows XP Pro on an Intel Core2Duo 1.83GHz T5600. Boost 1.35 mutex, no yield: 22.2656,25.0156,21.7969,23.0312,21.9375 Boost 1.35 mutex, with yield: 20,0625,20.4062,19.9844,19.9531,20.4531 CRITICAL_SECTION based mutex, no yield: 27.4219,19.3281,26.4062,20,28.3594 CRITICAL_SECTION based mutex, with yield: 24.2344,19.0625,16.6084,25.25,21.5469 new BitTestAndSet-based mutex, no yield: 20.3906,20.2188,20.4844,19.7969,20.5938 new BitTestAndSet-based mutex, with yield: 20.3125,20.6406,20.0625,20.5625, 20.125 I also tried using an in-order based lock(): new BitTestAndSet-based mutex, in order: 23.75,24.8906,23.8438,24.0938,24 CRITICAL_SECTION based mutex, in order: 42.9219, 38.9844 These were taking too long, and I ran out of patience to run more tests. Additional observations: With the boost-1.35 mutex, the time spent in the kernel was higher without the yield than with the yield. With ALL the Critical-section based tests, the first thread finished very quickly (around 5s), and for the last few seconds (the last 20s in the case of the in order tests), there were only 2 threads running, with VERY low CPU usage. Lock in order was always the slowest, with double the time for the critical-section tests. The Critical-section based mutex had both the fastest and slowest run time, with a huge spread, whereas the others had a very narrow spread. The average time for CS-based mutexes was larger than the others. The yield improved performance on all but my new BitTestAndSet-based mutex, which already had comparable performance to the boost-1.35-with-yield case. For this mutex, the with-yield version was slightly slower than the no-yield version, but not enough to be conclusive (0.2%). Conclusion: Lock in order is always worse than try-and-back-off, sometimes considerably so. Adding yield() to try-and-back-off helps with some mutex implementations, and is effectively a no-op in others, so is worth adding. Anthony

Howard Hinnant

3:36 p.m.

On Feb 10, 2008, at 6:52 AM, Anthony Williams wrote:

...

Conclusion:

Lock in order is always worse than try-and-back-off, sometimes considerably so.

Adding yield() to try-and-back-off helps with some mutex implementations, and is effectively a no-op in others, so is worth adding.

Thanks much for the research (and previously stated theories) Anthony! -Howard

Peter Dimov

7:46 p.m.

Anthony Williams:

...

Adding yield() to try-and-back-off helps with some mutex implementations, and is effectively a no-op in others, so is worth adding.

I think that I would like to see how the algorithms scale to a number of cores equaling the number of philosophers before agreeing with that. As Phil notes, extra yields are not a problem if another thread can use the time to do something useful.

Howard Hinnant

8:42 p.m.

On Feb 10, 2008, at 2:46 PM, Peter Dimov wrote:

...

Anthony Williams:

...
Adding yield() to try-and-back-off helps with some mutex implementations, and is effectively a no-op in others, so is worth adding.

I think that I would like to see how the algorithms scale to a number of cores equaling the number of philosophers before agreeing with that. As Phil notes, extra yields are not a problem if another thread can use the time to do something useful.

Hey, Peter has volunteered to buy a 16 core machine for us to experiment on! :-) -Howard

Andrey Tcherepanov

11 Feb 11 Feb

4:56 p.m.

If you guys have some static windows build of all the tests and batch file to run it in one zip file, I can try to run it on our 16 cores machine (4CPU x 4cores or 2x4xHyperThreading) during low load times. If you interested of course... Regards, Andrey. On Sun, 10 Feb 2008 13:42:16 -0700, Howard Hinnant <hinnant@twcny.rr.com> wrote:

...

On Feb 10, 2008, at 2:46 PM, Peter Dimov wrote:

...
Anthony Williams:

...
Adding yield() to try-and-back-off helps with some mutex implementations, and is effectively a no-op in others, so is worth adding.

I think that I would like to see how the algorithms scale to a number of cores equaling the number of philosophers before agreeing with that. As Phil notes, extra yields are not a problem if another thread can use the time to do something useful.

Hey, Peter has volunteered to buy a 16 core machine for us to experiment on! :-)

-Howard

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Anthony Williams

12 Feb 12 Feb

7:52 a.m.

Andrey Tcherepanov <moyt63c02 <at> sneakemail.com> writes:

...

If you guys have some static windows build of all the tests and batch file to run it in one zip file, I can try to run it on our 16 cores machine (4CPU x 4cores or 2x4xHyperThreading) during low load times. If you interested of course...

I for one would be very grateful if you could do that. I'll prepare the builds. How shall I get them to you? Each build is around 1Mb, and zips to around 280k. I have a variety of builds I would like to run: 3 mutex implementations, yield/no yield on each, with 5, 16 or 32 threads (less than, equal, and greater than the number of cores). That's 18 combinations. Each should take a lot less than a minute to run (5 threads on my dual-core machine takes around 20-30s). Are you willing to run all those builds (18) several times each to get reliable figures? If not, I'll try and pick a subset. Your assistance is much appreciated. Thanks, Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Andrey Tcherepanov

6:03 p.m.

Hello, On Tue, 12 Feb 2008 00:52:47 -0700, Anthony Williams <anthony_w.geo@yahoo.com> wrote:

...

Andrey Tcherepanov <moyt63c02 <at> sneakemail.com> writes:

...
How shall I get them to you? Each build is around 1Mb, and zips to around 280k.

as long as zip file is not above couple megs, I think you can email me to andrey<dot>tcherepanov<at>gmail<dot>com . Or, even better, put it on vault, so if somebody else wants to rerun them on their machine can do it too.

...

I have a variety of builds I would like to run: 3 mutex implementations, yield/no yield on each, with 5, 16 or 32 threads (less than, equal, and greater than the number of cores). That's 18 combinations. Each should take a lot less than a minute to run (5 threads on my dual-core machine takes around 20-30s). Are you willing to run all those builds (18) several times each to get reliable figures? If not, I'll try and pick a subset.

I think I will be able to run them all, just create a batch script that will run them in any sequence that you feel like. Just don't override result files with each run, so if time will permit I will rerun them more times. One thing that I am not sure is if you need completely clean machine (no other processes running)? I am asking because these machines are parts of the db clusters (standby node in the active-passive cluster), and might have some small processes running anyway. In regards of cores and cpus and hyperthreading -- I will add full specs of the machines to the results.

...

Your assistance is much appreciated.

Thanks,

Anthony

You are very welcome. Regards, Andrey

Anthony Williams

13 Feb 13 Feb

4:40 p.m.

Andrey Tcherepanov <moyt63c02 <at> sneakemail.com> writes:

...

Hello,

On Tue, 12 Feb 2008 00:52:47 -0700, Anthony Williams <anthony_w.geo <at> yahoo.com> wrote:

...
Andrey Tcherepanov <moyt63c02 <at> sneakemail.com> writes:

...
How shall I get them to you? Each build is around 1Mb, and zips to around 280k.

as long as zip file is not above couple megs, I think you can email me to andrey<dot>tcherepanov<at>gmail<dot>com . Or, even better, put it on vault, so if somebody else wants to rerun them on their machine can do it too.

I've uploaded a zip file to my website: http://www.justsoftwaresolutions.co.uk/files/dining_philosophers.zip This contains 36 EXEs and a batch file. If you run the batch file, it runs each EXE in turn, and appends the results to dining_philosophers.txt. Multiple runs will just add to the end of the file. If you only want to run a subset, edit the batch file.

...

One thing that I am not sure is if you need completely clean machine (no other processes running)? I am asking because these machines are parts of the db clusters (standby node in the active-passive cluster), and might have some small processes running anyway.

The less that's running, the better, but provided the machine isn't heavily loaded, it shouldn't matter.

...

In regards of cores and cpus and hyperthreading -- I will add full specs of the machines to the results.

Thanks. Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Andrey Tcherepanov

8:34 p.m.

Anthony, I sent you results to anthony_w.geo<at>yahoo.com, please let me know if you received them or not - these corp. filters play strange games sometimes... On Wed, 13 Feb 2008 09:40:09 -0700, Anthony Williams <anthony_w.geo@yahoo.com> wrote:

...

I've uploaded a zip file to my website:

http://www.justsoftwaresolutions.co.uk/files/dining_philosophers.zip

This contains 36 EXEs and a batch file. If you run the batch file, it runs each EXE in turn, and appends the results to dining_philosophers.txt. Multiple runs will just add to the end of the file.

Anthony

Anthony Williams

9:47 p.m.

Andrey Tcherepanov <moyt63c02 <at> sneakemail.com> writes:

...

I sent you results to anthony_w.geo<at>yahoo.com, please let me know if you received them or not - these corp. filters play strange games sometimes...

Received, thank you. Anthony

Kowalke Oliver (QD IT PA AS)

14 Feb 14 Feb

7:45 a.m.

...

Andrey Tcherepanov <moyt63c02 <at> sneakemail.com> writes:

...
I sent you results to anthony_w.geo<at>yahoo.com, please let me know if you received them or not - these corp. filters play strange games sometimes...

Received, thank you.

Would you share the data? best regards, Oliver

Anthony Williams

15 Feb 15 Feb

12:34 p.m.

Kowalke Oliver (QD IT PA AS <Oliver.Kowalke <at> qimonda.com> writes:

...

...
Andrey Tcherepanov <moyt63c02 <at> sneakemail.com> writes:

...
I sent you results to anthony_w.geo<at>yahoo.com, please let me know if you received them or not - these corp. filters play strange games sometimes...

Received, thank you.

Would you share the data?

Preliminary results indicate that the effects of adding yield() to before a retry when locking multiple mutexes are unpredictable in general. In some cases there was a 58% drop in run time (i.e. better performance) with the yield. In other cases there was a 16% increase in run time (i.e. worse performance) with the yield. In many cases the standard deviation in the measurements was larger than the effect. I will share more details when I've analyzed all of Andrey's data, and my own data. Anthony

Anthony Williams

29 Feb 29 Feb

4:56 p.m.

Anthony Williams <anthony_w.geo <at> yahoo.com> writes:

...

Kowalke Oliver (QD IT PA AS <Oliver.Kowalke <at> qimonda.com> writes:

...
...
Andrey Tcherepanov <moyt63c02 <at> sneakemail.com> writes:

...
I sent you results to anthony_w.geo<at>yahoo.com, please let me know if you received them or not - these corp. filters play strange games sometimes...

Received, thank you.

Would you share the data?

My spreadsheet with the results in can be downloaded from http://www.justsoftwaresolutions.co.uk/files/dining_results.ods Not all of Andrey's runs are in there, but enough to get a reasonable mean and Standard-deviation for most cases. This shows some interesting observations. Firstly, the padding (8k between entries) is hugely beneficial for the small (8-byte) mutexes (boost 1.35, and my BTS-based variant), but the benefit is less for the CRITICAL_SECTION based mutex, which is 24-bytes. If you consider that the data is only 4 bytes, and each cache line is 64 bytes, this is unsurprising --- with a 24 byte mutex and 4 byte data, you can only fit 2.5 entries in a cache line, so it's only adjacent entries that clash, whereas with a 8 byte mutex and 4 byte data you get 5 entries per cache line, so there is much more conflict. This makes me wonder if it's worth padding the mutex to 64 bytes, so it occupies an entire cache line, or maybe adding a 64-byte alignment requirement. Secondly, the yield is quite beneficial in some cases (e.g. 58% improvement), but often detrimental (up to 33% degradation). Overall, I think it is not worth adding. The next thing I noticed was quite surprising. On the machines with 16 hardware threads, the 32-thread versions often ran faster than the 16-thread versions. I can only presume that this is because the increased thread count meant less contention, and thus fewer blocking waits. Finally, though the BTS-based mutex is faster than the boost 1.35 one in most cases, the CRITICAL_SECTION based version is actually very competitive, and fastest in many cases. I found this surprising, because it didn't match my prior experience, but might be a consequence of the (generally undesirable) properties of CRITICAL_SECTIONS: some threads end up getting all the locks, and running to completion very quickly, thus reducing the running thread count for the remainder of the test. Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Peter Dimov

6:38 p.m.

Anthony Williams: ...

...

Finally, though the BTS-based mutex is faster than the boost 1.35 one in most cases, the CRITICAL_SECTION based version is actually very competitive, and fastest in many cases. I found this surprising, because it didn't match my prior experience, but might be a consequence of the (generally undesirable) properties of CRITICAL_SECTIONS: some threads end up getting all the locks, and running to completion very quickly, thus reducing the running thread count for the remainder of the test.

This is a general property of artificial benchmarks that do nothing outside of the critical region. In such a case running the same thread to completion is indeed the most efficient way to accomplish the overall task. Sensible user code will generally try to keep its critical regions short and move as much of the code outside as possible (to increase the amount of available parallelism).

Phil Endecott

7:06 p.m.

Anthony Williams wrote:

...

My spreadsheet with the results in can be downloaded from http://www.justsoftwaresolutions.co.uk/files/dining_results.ods

Excellent work!

...

Not all of Andrey's runs are in there, but enough to get a reasonable mean and Standard-deviation for most cases.

This shows some interesting observations. Firstly, the padding (8k between entries) is hugely beneficial for the small (8-byte) mutexes (boost 1.35, and my BTS-based variant), but the benefit is less for the CRITICAL_SECTION based mutex, which is 24-bytes. If you consider that the data is only 4 bytes, and each cache line is 64 bytes, this is unsurprising --- with a 24 byte mutex and 4 byte data, you can only fit 2.5 entries in a cache line, so it's only adjacent entries that clash, whereas with a 8 byte mutex and 4 byte data you get 5 entries per cache line, so there is much more conflict. This makes me wonder if it's worth padding the mutex to 64 bytes, so it occupies an entire cache line, or maybe adding a 64-byte alignment requirement.

I think that the mutex should be grouped together with the data that it's protecting, e.g. // BAD: mutex mut[64]; Thing things[64]; // BETTER: pair<mutex,Thing> protected_things[64]; The presence of the "things" will separate out the mutexes, avoiding the cache-thrashing that you'd get if they were packed together, and furthermore if you're lucky the "thing" will be pre-fetched into your cache as soon as you lock its mutex. You could even do this: template <typename T> class Locked: public T { mutex mut; public: void lock() { mut.lock(); } void unlock() { mut.unlock(); } }; Locked<Thing> locked_things[64]; For small "things" it would still help to pad to the cache line size. Various ways to do this come to mind: add a compiler-specific alignment attribute (I think there's a proposal for a standard way to do this in C++0x, right?), add an explicit padding field dependent on sizeof(T), or for dynamically allocated objects write a Locked::operator new that rounds up the size requested. Of course it's wasteful to pad for mutexes that are rarely contested, and in many cases mutexes are rarely contested. Wouldn't it be great if our compilers could do profile-driven memory layout optimisation? Sounds like a good PhD topic to me. In the meantime, I wonder what else could be done at the library level to help programs not thrash their caches?

...

Secondly, the yield is quite beneficial in some cases (e.g. 58% improvement), but often detrimental (up to 33% degradation). Overall, I think it is not worth adding.

I still find that a mutex that ONLY uses yield gives very good performance. This is because I don't yet have a benchmark where there are large numbers of _unrelated_ threads running on the system. "Dining Philosophers" is not representative of many real-world problems, unfortunately....

...

The next thing I noticed was quite surprising. On the machines with 16 hardware threads, the 32-thread versions often ran faster than the 16-thread versions. I can only presume that this is because the increased thread count meant less contention, and thus fewer blocking waits.

Yes.

...

Finally, though the BTS-based mutex is faster than the boost 1.35 one in most cases, the CRITICAL_SECTION based version is actually very competitive, and fastest in many cases. I found this surprising, because it didn't match my prior experience, but might be a consequence of the (generally undesirable) properties of CRITICAL_SECTIONS: some threads end up getting all the locks, and running to completion very quickly, thus reducing the running thread count for the remainder of the test.

I think fairness is over-rated. An unfair schedule will reduce mean-time-to-completion, and get better cache hit rates. Cheers, Phil.

Anthony Williams

11 Feb 11 Feb

12:47 p.m.

Peter Dimov <pdimov <at> pdimov.com> writes:

...

Anthony Williams:

...
Adding yield() to try-and-back-off helps with some mutex implementations, and is effectively a no-op in others, so is worth adding.

I think that I would like to see how the algorithms scale to a number of cores equaling the number of philosophers before agreeing with that. As Phil notes, extra yields are not a problem if another thread can use the time to do something useful.

OK. I dropped down to two philosophers on my dual-core machine (same machine as before). This isn't the only thing running, but there's not much else going on. Boost 1.35 mutex, no yield: 27s Boost 1.35 mutex, with yield: 19s New BTS mutex, no yield: 12s New BTS mutex, with yield: 10s CS-based mutex, no yield: 39s CS-based mutex, with yield: 30s In the boost 1.35 cases, I was seeing about 70% CPU usage, but around 25% in kernel space. In the BTS cases, I was seeing around 70% CPU usage, but very low kernel time. In the CS cases, I was seeing about 25% CPU usage, almost all in kernel space. Note that the high contention caused the times for half the cases to be higher than for the 5-thread version, despite having less work to do! In all cases the yield improved performance vs the non-yield version. This also makes me think I need to check in my BTS-based mutex ASAP, as it appears to perform much better under such high contention. Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Phil Endecott

10 Feb 10 Feb

6:39 p.m.

Howard Hinnant wrote:

...

On Feb 8, 2008, at 6:19 PM, Phil Endecott wrote:

...
If you can share the code, I'll see how it performs on my Linux systems.

Thanks Phil! Sorry for the delay. I wanted to update my test to use the latest std::syntax and rerun them again. First, here's the code:

OK, I have made some minimal mods to work with my atomic-ops-and-futex mutex, and I get the following results. All of these results should be treated as provisional since I don't have 100% confidence that my mutex implementation is actually correct. For my 1GHz uniprocessor VIA-C3 system: Lock-in-order: 8 s Retry: 7 s Retry and yield: 7 s For a 3.6 GHz Intel Pentium 4 dual-core system: Lock-in-order: still running..... Retry: 13 s Retry and yield: 10 s So the most obvious initial observation is that you really don't want to run this sort of code on a multiprocessor system; this agrees with your observation:

...

Interesting side note: During the first run my disk backup utility kicked in. This extra competition for resources actually made the test case run faster!

I imagine that the allocation of threads to processors could also explain things like:

...

I have no explanation for the longer time in the third run.

I know that there is a certain amount of theoretical work about affinity of threads to CPUs but I don't know what OSes actually do in practice; do they migrate threads based on any run-time measures? Something that you might find interesting is that I modified the chopstick as follows: struct Chopstick { Chopstick() : random_next_(1) {} Mutex mut_; unsigned random_next_; char padding[8000]; <<<============ }; My aim was to prevent the chopsticks from being close together in the cache. This only works if your Mutex object actually contains the shared state itself, rather than being a pointer or some other sort of handle. By doing this, I increased the performance on the dual-core system with the retry-and-yield locking strategy to 6 seconds! (Finally better than the system with a third of the clock speed (and a tenth of the price)! Yay!) I think that this sort of cache-friendly design is important; the effect that I've measured here would occur in real programs, not just in artificial benchmarks like this. Perhaps the Mutex needs its own allocator, or something? I got a further improvement, down to 5 s on the dual-core system, by using my yield-based mutex (i.e. no futex at all, just a yield-in-a-loop to lock). So the scale of this benchmark (5 threads) is sufficiently small that the benefits of the futex (i.e. the thread is not woken prematurely) are not outweighed by the overhead of recording which thread is waiting for what. My guess is that your additional yield() is part of the same phenomenon: if the number of philosophers were increased until my Mutex<Futex> were better than my Mutex<Yield>, then the benefit of the additional yield in the multilock would also go away. I plan to do some more measurements with different systems and different mutex implementations (e.g. the evil-spin-then-wait and friends). But we should remember that this is only an artificial benchmark. Something that might be interesting to measure is the total number of context switches (per CPU) while the benchmark runs. I can see that with /usr/bin/time -f '%c %w'. If you have any similar facilities you might like to see how correlated the total number of context switches is with the total runtime. Regards, Phil

Howard Hinnant

11 Feb 11 Feb

2:32 p.m.

On Feb 10, 2008, at 1:39 PM, Phil Endecott wrote:

...

Something that you might find interesting is that I modified the chopstick as follows:

struct Chopstick { Chopstick() : random_next_(1) {} Mutex mut_; unsigned random_next_; char padding[8000]; <<<============ };

This made a dramatic change in here too, dropping fastest (retry/ yield) times from around 11s to around 6.5s. -Howard

Anthony Williams

9 Feb 9 Feb

8:31 p.m.

Howard Hinnant <hinnant <at> twcny.rr.com> writes:

...

On Feb 8, 2008, at 2:42 PM, Anthony Williams wrote:

...
Howard Hinnant <hinnant <at> twcny.rr.com> writes:

...
template <class L1, class L2> void lock(L1& l1, L2& l2) { while (true) { { unique_lock<L1> __u1(l1); if (l2.try_lock()) { __u1.release(); break; } } std::this_thread::yield();

snipped rest of multi-lock code.

...
3. The yields are absolutely necessary to get the high performance mentioned in the previous note.

I find this comment intriguing. Is this always the case, or does it depend on the specifics of how a mutex is implemented on a given platform?

Thanks for the detailed experience report.

...

I hope this gives you some insights, despite the lack of a clear answer to your question. Perhaps you or others can present a good explanation for this behavior regarding the yields.

I have a hypothesis about the cause for this behaviour, which is why I asked about trials with various mutexes. My hypothesis is this: if a thread finds a mutex locked (in try_lock), and then immediately does a lock() on that mutex, it will probably still find it locked, and then switch to kernel mode for a blocking wait. A yield between the try_lock() and the lock() will increase the time before the lock() call, and thus increase the chances that the mutex becomes unlocked, particularly if the lock is only held for a short time. Some mutex implementations spin in user mode before switching to kernel mode for a blocking wait, and I can even see the potential benefits of doing the equivalent of try_lock()/yield()/try_lock()/kernel_lock() internally within the mutex. Such mutexes might not benefit from the yield() in the generic lock() template. Do you still have your test code? If so, then I might run a yield/no-yield test with a CRITICAL_SECTION based implementation, a kernel Mutex based implementation, my boost 1.35 atomics/semaphore based mutex and a try/yield/try/block based mutex to see if this makes any difference. Anthony

6339

Age (days ago)

6367

Last active (days ago)

List overview

Download

42 comments

10 participants

participants (10)

Andrey Tcherepanov
Anthony Williams
dizzy
Frank Mori Hess
Harro Verkouter
Howard Hinnant
Kowalke Oliver (QD IT PA AS)
Peter Dimov
Phil Endecott
Walter Horsten