
I'm seeing a deadlock on a dual core Pentium D from libs/smart_ptr/test/sp_atomic_mt_test.cpp when USE_RWLOCK is defined (which causes it to use boost::shared_mutex). It works on uniprocessor.

"Peter Dimov" <pdimov@pdimov.com> writes:
I'm seeing a deadlock on a dual core Pentium D from libs/smart_ptr/test/sp_atomic_mt_test.cpp when USE_RWLOCK is defined (which causes it to use boost::shared_mutex). It works on uniprocessor.
How often, on what platform? I just tried it on my Core2Duo running Windows XP, and it ran fine. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams:
"Peter Dimov" <pdimov@pdimov.com> writes:
I'm seeing a deadlock on a dual core Pentium D from libs/smart_ptr/test/sp_atomic_mt_test.cpp when USE_RWLOCK is defined (which causes it to use boost::shared_mutex). It works on uniprocessor.
How often, on what platform?
Every time, on a dual core Pentium D running XP64. The program is compiled as 32 bit with MSVC 7.1. It looks like a livelock rather than a deadlock, since the program keeps spinning its wheels at 100% but never completes. The writer thread is probably either blocked or starved. This happens on another (quad core) machine too for which I don't know the specs at the moment.

"Peter Dimov" <pdimov@pdimov.com> writes:
Anthony Williams:
"Peter Dimov" <pdimov@pdimov.com> writes:
I'm seeing a deadlock on a dual core Pentium D from libs/smart_ptr/test/sp_atomic_mt_test.cpp when USE_RWLOCK is defined (which causes it to use boost::shared_mutex). It works on uniprocessor.
How often, on what platform?
Every time, on a dual core Pentium D running XP64. The program is compiled as 32 bit with MSVC 7.1. It looks like a livelock rather than a deadlock, since the program keeps spinning its wheels at 100% but never completes. The writer thread is probably either blocked or starved. This happens on another (quad core) machine too for which I don't know the specs at the moment.
Here's the thing: your writer is running in a tight loop, and so are your readers, with each reacquiring the lock immediately after releasing it. shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates. Your test should complete eventually (it takes 2-5 minutes on my machine), but it is likely that the problem is indeed due to writer starvation. In the current implementation, a blocked writer will prevent further readers acquiring the lock, thus if you frequently try and obtain a writer lock this reduces the potential reader concurrency. However, in order to avoid reader starvation, the writer is not automatically granted the lock when the read lock is released: instead it has to compete for the privilege with the blocked readers. It is up to the OS how it chooses to schedule it. Amusingly, the way I've programmed the upgrade_lock means that your test runs quicker (22 seconds vs 2-5 minutes) if you do: boost::upgrade_lock<boost::shared_mutex> up_lock( rw ); boost::unique_lock<boost::shared_mutex> lock( up_lock.move() ); rather than boost::unique_lock<boost::shared_mutex> lock( rw ); Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams:
Here's the thing: your writer is running in a tight loop, and so are your readers, with each reacquiring the lock immediately after releasing it.
Right. This emulates a server that needs to service many concurrent reader requests. I don't have time to make it more real; the USE_RWLOCK path is for comparison purposes only.
shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.
I don't believe that the behavior I'm seeing matches the intent of the design.
In the current implementation, a blocked writer will prevent further readers acquiring the lock, thus if you frequently try and obtain a writer lock this reduces the potential reader concurrency. However, in order to avoid reader starvation, the writer is not automatically granted the lock when the read lock is released: instead it has to compete for the privilege with the blocked readers. It is up to the OS how it chooses to schedule it.
In theory, given 7 readers and one writer competing for the lock, the writer should get the lock 1/8 of the time, leading to an 8x reader:writer iteration ratio. But it doesn't. I still suspect there's something wrong with the lock that unintentionally favors readers. One possible explanation is that the reader thread that just released the lock is able to obtain it again. This should never happen if there's a writer waiting; the reader should be blocked "on the first gate".

Anthony Williams:
shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.
On second look, yes, you are right, the test is indeed pathological. I tested a few other read/write mutex implementations I have lying around and all have various amounts of trouble with it.

shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.
Some more observations. The same writer starvation occurs with 2 reader threads, so it's not caused by the overcommit. Two reader threads on two cores is common. The update frequency doesn't matter; a lower update frequency would just scale the time it takes to perform 1M updates, it will not change the average writer wait time. Some wait times (2R+1W): atomics: 7.673 microseconds lightweight_mutex (CRITICAL_SECTION): 3.069 us shared_mutex: 760 us rw_mutex (my implementation): 665 us (same problem) pthread_rwlock_t, pthreads-win32: 7.108 us rw_mutex (Hinnant/Terekhov): 85.532 us This last line uses my reimplementation of Howard Hinnant's read/write mutex based on his description; Howard credits Alexander Terekhov with the original algorithm. It does stall the writer a bit in exchange for optimal reader throughput, but doesn't suffer from outright starvation. I've attached my (not production quality) implementation of this rwlock.

Peter Dimov wrote:
shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.
Some more observations.
The same writer starvation occurs with 2 reader threads, so it's not caused by the overcommit. Two reader threads on two cores is common.
The update frequency doesn't matter; a lower update frequency would just scale the time it takes to perform 1M updates, it will not change the average writer wait time.
Some wait times (2R+1W):
atomics: 7.673 microseconds lightweight_mutex (CRITICAL_SECTION): 3.069 us shared_mutex: 760 us rw_mutex (my implementation): 665 us (same problem) pthread_rwlock_t, pthreads-win32: 7.108 us rw_mutex (Hinnant/Terekhov): 85.532 us
This last line uses my reimplementation of Howard Hinnant's read/write mutex based on his description; Howard credits Alexander Terekhov with the original algorithm. It does stall the writer a bit in exchange for optimal reader throughput, but doesn't suffer from outright starvation.
See pthreads-win32 for the original algorithm (except support for static init stuff). regards, alexander.

On Apr 21, 2008, at 2:10 PM, Peter Dimov wrote:
shared_mutex is not designed for this scenario, since you have high contention. shared_mutex is designed for infrequent updates.
Some more observations.
The same writer starvation occurs with 2 reader threads, so it's not caused by the overcommit. Two reader threads on two cores is common.
The update frequency doesn't matter; a lower update frequency would just scale the time it takes to perform 1M updates, it will not change the average writer wait time.
Some wait times (2R+1W):
atomics: 7.673 microseconds lightweight_mutex (CRITICAL_SECTION): 3.069 us shared_mutex: 760 us rw_mutex (my implementation): 665 us (same problem) pthread_rwlock_t, pthreads-win32: 7.108 us rw_mutex (Hinnant/Terekhov): 85.532 us
This last line uses my reimplementation of Howard Hinnant's read/ write mutex based on his description; Howard credits Alexander Terekhov with the original algorithm. It does stall the writer a bit in exchange for optimal reader throughput, but doesn't suffer from outright starvation.
I've attached my (not production quality) implementation of this rwlock. <rw_mutex.hpp>_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I've so far only looked at your implementation for a few minutes. The lock-reduced version is tricky. It has been three years since I did this, but when I did, I needed 3 flags instead of your two. That may have been because I was guarding against max-readers and you're not, not sure yet. When I wrote it, it was on ppc only and so used the lwarx/stwcx. instructions directly and so was not written in a compare/ exchange style. It might be interesting to run this http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2406.html#shared_mu... through your test. It is the algorithm, but not "lock reduced". -Howard

Howard Hinnant:
I've so far only looked at your implementation for a few minutes. The lock-reduced version is tricky. It has been three years since I did this, but when I did, I needed 3 flags instead of your two. That may have been because I was guarding against max-readers and you're not, not sure yet.
Actually most implementations tend to work as intended without a check if the "writer" bit is adjacent to the "#readers" bits (overflowing the count sets the writer bit). This is not the case for my implementation though since I've managed to put the bits in the wrong order. :-)

"Peter Dimov" <pdimov@pdimov.com> writes:
Some wait times (2R+1W):
atomics: 7.673 microseconds lightweight_mutex (CRITICAL_SECTION): 3.069 us shared_mutex: 760 us rw_mutex (my implementation): 665 us (same problem) pthread_rwlock_t, pthreads-win32: 7.108 us rw_mutex (Hinnant/Terekhov): 85.532 us
This last line uses my reimplementation of Howard Hinnant's read/write mutex based on his description; Howard credits Alexander Terekhov with the original algorithm. It does stall the writer a bit in exchange for optimal reader throughput, but doesn't suffer from outright starvation.
I've been running your sp_atomic_mt_test with 8 readers and 1 writer, and various rw mutex implementations on my core2duo machine. Using pthread-win32 pthread_rwlock_t sees *serious* reader starvation: the whole thing completes in 0.5s, with 7 out of 8 reader threads having under 85000 iterations, and the 8th having just over 110000, for 1048576 writer iterations. Using your lightweight mutex runs in around 10s, with around 1000000 iterations for each reader. Using boost::shared_mutex or your implementation of the Hinnant/Terekhov rw-mutex takes 100-200s, with reader multipliers around 60-80. Statistically there's not a lot in it, though your Hinnant/Terekhov implementation runs a bit faster. shared_mutex algorithms are *hard* to get right. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams:
shared_mutex algorithms are *hard* to get right.
I'm now tending towards the opinion that rwlocks are hard to use correctly (and test). For the 2R+1W case I've been testing, it occurs to me that the writer wait time is normal - statistically, the time for the two readers to finish their timeslices of 1ms is expected to be 0.75ms. A (possibly?) better test would not have a dedicated writer thread, but would use a pool of generic worker threads that alternate between reading and writing with a certain probability. I may try one at some point.
participants (4)
-
Alexander Terekhov
-
Anthony Williams
-
Howard Hinnant
-
Peter Dimov