shared_mutex deadlocks?

Peter Dimov

20 Apr 2008 20 Apr '08

5:08 p.m.

I'm seeing a deadlock on a dual core Pentium D from libs/smart_ptr/test/sp_atomic_mt_test.cpp when USE_RWLOCK is defined (which causes it to use boost::shared_mutex). It works on uniprocessor.

Show replies by date

Anthony Williams

21 Apr 21 Apr

7:14 a.m.

"Peter Dimov" <pdimov@pdimov.com> writes:

...

I'm seeing a deadlock on a dual core Pentium D from libs/smart_ptr/test/sp_atomic_mt_test.cpp when USE_RWLOCK is defined (which causes it to use boost::shared_mutex). It works on uniprocessor.

How often, on what platform? I just tried it on my Core2Duo running Windows XP, and it ran fine. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Peter Dimov

12:30 p.m.

Anthony Williams:

...

"Peter Dimov" <pdimov@pdimov.com> writes:

...
I'm seeing a deadlock on a dual core Pentium D from libs/smart_ptr/test/sp_atomic_mt_test.cpp when USE_RWLOCK is defined (which causes it to use boost::shared_mutex). It works on uniprocessor.

How often, on what platform?

Every time, on a dual core Pentium D running XP64. The program is compiled as 32 bit with MSVC 7.1. It looks like a livelock rather than a deadlock, since the program keeps spinning its wheels at 100% but never completes. The writer thread is probably either blocked or starved. This happens on another (quad core) machine too for which I don't know the specs at the moment.

Anthony Williams

2:03 p.m.

"Peter Dimov" <pdimov@pdimov.com> writes:

...

Anthony Williams:

...
"Peter Dimov" <pdimov@pdimov.com> writes:

...
I'm seeing a deadlock on a dual core Pentium D from libs/smart_ptr/test/sp_atomic_mt_test.cpp when USE_RWLOCK is defined (which causes it to use boost::shared_mutex). It works on uniprocessor.

How often, on what platform?

Every time, on a dual core Pentium D running XP64. The program is compiled as 32 bit with MSVC 7.1. It looks like a livelock rather than a deadlock, since the program keeps spinning its wheels at 100% but never completes. The writer thread is probably either blocked or starved. This happens on another (quad core) machine too for which I don't know the specs at the moment.

Here's the thing: your writer is running in a tight loop, and so are your readers, with each reacquiring the lock immediately after releasing it. shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates. Your test should complete eventually (it takes 2-5 minutes on my machine), but it is likely that the problem is indeed due to writer starvation. In the current implementation, a blocked writer will prevent further readers acquiring the lock, thus if you frequently try and obtain a writer lock this reduces the potential reader concurrency. However, in order to avoid reader starvation, the writer is not automatically granted the lock when the read lock is released: instead it has to compete for the privilege with the blocked readers. It is up to the OS how it chooses to schedule it. Amusingly, the way I've programmed the upgrade_lock means that your test runs quicker (22 seconds vs 2-5 minutes) if you do: boost::upgrade_lock<boost::shared_mutex> up_lock( rw ); boost::unique_lock<boost::shared_mutex> lock( up_lock.move() ); rather than boost::unique_lock<boost::shared_mutex> lock( rw ); Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Peter Dimov

2:37 p.m.

Anthony Williams:

...

Here's the thing: your writer is running in a tight loop, and so are your readers, with each reacquiring the lock immediately after releasing it.

Right. This emulates a server that needs to service many concurrent reader requests. I don't have time to make it more real; the USE_RWLOCK path is for comparison purposes only.

...

shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.

I don't believe that the behavior I'm seeing matches the intent of the design.

...

In the current implementation, a blocked writer will prevent further readers acquiring the lock, thus if you frequently try and obtain a writer lock this reduces the potential reader concurrency. However, in order to avoid reader starvation, the writer is not automatically granted the lock when the read lock is released: instead it has to compete for the privilege with the blocked readers. It is up to the OS how it chooses to schedule it.

In theory, given 7 readers and one writer competing for the lock, the writer should get the lock 1/8 of the time, leading to an 8x reader:writer iteration ratio. But it doesn't. I still suspect there's something wrong with the lock that unintentionally favors readers. One possible explanation is that the reader thread that just released the lock is able to obtain it again. This should never happen if there's a writer waiting; the reader should be blocked "on the first gate".

Peter Dimov

3:20 p.m.

Anthony Williams:

...

...
shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.

On second look, yes, you are right, the test is indeed pathological. I tested a few other read/write mutex implementations I have lying around and all have various amounts of trouble with it.

Peter Dimov

6:10 p.m.

...

...
...
shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.

Some more observations. The same writer starvation occurs with 2 reader threads, so it's not caused by the overcommit. Two reader threads on two cores is common. The update frequency doesn't matter; a lower update frequency would just scale the time it takes to perform 1M updates, it will not change the average writer wait time. Some wait times (2R+1W): atomics: 7.673 microseconds lightweight_mutex (CRITICAL_SECTION): 3.069 us shared_mutex: 760 us rw_mutex (my implementation): 665 us (same problem) pthread_rwlock_t, pthreads-win32: 7.108 us rw_mutex (Hinnant/Terekhov): 85.532 us This last line uses my reimplementation of Howard Hinnant's read/write mutex based on his description; Howard credits Alexander Terekhov with the original algorithm. It does stall the writer a bit in exchange for optimal reader throughput, but doesn't suffer from outright starvation. I've attached my (not production quality) implementation of this rwlock.

Alexander Terekhov

7:10 p.m.

Peter Dimov wrote:

...

...
...
...
shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.

Some more observations.

The same writer starvation occurs with 2 reader threads, so it's not caused by the overcommit. Two reader threads on two cores is common.

The update frequency doesn't matter; a lower update frequency would just scale the time it takes to perform 1M updates, it will not change the average writer wait time.

Some wait times (2R+1W):

atomics: 7.673 microseconds lightweight_mutex (CRITICAL_SECTION): 3.069 us shared_mutex: 760 us rw_mutex (my implementation): 665 us (same problem) pthread_rwlock_t, pthreads-win32: 7.108 us rw_mutex (Hinnant/Terekhov): 85.532 us

This last line uses my reimplementation of Howard Hinnant's read/write mutex based on his description; Howard credits Alexander Terekhov with the original algorithm. It does stall the writer a bit in exchange for optimal reader throughput, but doesn't suffer from outright starvation.

See pthreads-win32 for the original algorithm (except support for static init stuff). regards, alexander.

Howard Hinnant

8:04 p.m.

On Apr 21, 2008, at 2:10 PM, Peter Dimov wrote:

...

...
...
...
shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.

Some more observations.

The same writer starvation occurs with 2 reader threads, so it's not caused by the overcommit. Two reader threads on two cores is common.

The update frequency doesn't matter; a lower update frequency would just scale the time it takes to perform 1M updates, it will not change the average writer wait time.

Some wait times (2R+1W):

atomics: 7.673 microseconds lightweight_mutex (CRITICAL_SECTION): 3.069 us shared_mutex: 760 us rw_mutex (my implementation): 665 us (same problem) pthread_rwlock_t, pthreads-win32: 7.108 us rw_mutex (Hinnant/Terekhov): 85.532 us

This last line uses my reimplementation of Howard Hinnant's read/ write mutex based on his description; Howard credits Alexander Terekhov with the original algorithm. It does stall the writer a bit in exchange for optimal reader throughput, but doesn't suffer from outright starvation.

I've attached my (not production quality) implementation of this rwlock. <rw_mutex.hpp>_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

I've so far only looked at your implementation for a few minutes. The lock-reduced version is tricky. It has been three years since I did this, but when I did, I needed 3 flags instead of your two. That may have been because I was guarding against max-readers and you're not, not sure yet. When I wrote it, it was on ppc only and so used the lwarx/stwcx. instructions directly and so was not written in a compare/ exchange style. It might be interesting to run this http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2406.html#shared_mu... through your test. It is the algorithm, but not "lock reduced". -Howard

Peter Dimov

10:06 p.m.

Howard Hinnant:

...

I've so far only looked at your implementation for a few minutes. The lock-reduced version is tricky. It has been three years since I did this, but when I did, I needed 3 flags instead of your two. That may have been because I was guarding against max-readers and you're not, not sure yet.

Actually most implementations tend to work as intended without a check if the "writer" bit is adjacent to the "#readers" bits (overflowing the count sets the writer bit). This is not the case for my implementation though since I've managed to put the bits in the wrong order. :-)

Anthony Williams

23 Apr 23 Apr

10:51 a.m.

"Peter Dimov" <pdimov@pdimov.com> writes:

...

Some wait times (2R+1W):

atomics: 7.673 microseconds lightweight_mutex (CRITICAL_SECTION): 3.069 us shared_mutex: 760 us rw_mutex (my implementation): 665 us (same problem) pthread_rwlock_t, pthreads-win32: 7.108 us rw_mutex (Hinnant/Terekhov): 85.532 us

This last line uses my reimplementation of Howard Hinnant's read/write mutex based on his description; Howard credits Alexander Terekhov with the original algorithm. It does stall the writer a bit in exchange for optimal reader throughput, but doesn't suffer from outright starvation.

I've been running your sp_atomic_mt_test with 8 readers and 1 writer, and various rw mutex implementations on my core2duo machine. Using pthread-win32 pthread_rwlock_t sees *serious* reader starvation: the whole thing completes in 0.5s, with 7 out of 8 reader threads having under 85000 iterations, and the 8th having just over 110000, for 1048576 writer iterations. Using your lightweight mutex runs in around 10s, with around 1000000 iterations for each reader. Using boost::shared_mutex or your implementation of the Hinnant/Terekhov rw-mutex takes 100-200s, with reader multipliers around 60-80. Statistically there's not a lot in it, though your Hinnant/Terekhov implementation runs a bit faster. shared_mutex algorithms are *hard* to get right. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Peter Dimov

1:15 p.m.

Anthony Williams:

...

shared_mutex algorithms are *hard* to get right.

I'm now tending towards the opinion that rwlocks are hard to use correctly (and test). For the 2R+1W case I've been testing, it occurs to me that the writer wait time is normal - statistically, the time for the two readers to finish their timeslices of 1ms is expected to be 0.75ms. A (possibly?) better test would not have a dedicated writer thread, but would use a pool of generic worker threads that alternate between reading and writing with a certain probability. I may try one at some point.

6314

Age (days ago)

6317

Last active (days ago)

List overview

Download

11 comments

4 participants

participants (4)

Alexander Terekhov
Anthony Williams
Howard Hinnant
Peter Dimov