[Thread] read_write_mutex bug?

Hi, Testing my own wrapper on x86_64 bit linux exposed a seeming deadlock on constructing a read_write_lock with a *::write_locked initial state in the constructor... I've distilled it to the plain code below that exhibits the same failure. Using a read_lock state on the same works without deadlock (or safety is this case) as expected. Using a scoped_write_lock worked once but deadlocked mostly. A normal mutex and recursive mutex works as expected. I'm using 1.33. Is this a known bug or am I using the read_write stuff incorrectly? Regards, Matt. matthurd@acm.org ____________________________________ my test in question:... void synch_test_019() { static int global_count = 0; static boost::read_write_mutex doobie( boost::read_write_scheduling_policy::writer_priority ); long number_of_threads = 10; static long inner_loop = 1000; ZOMOJO_WORKER_THREAD_BEGIN( inner_loop ) { boost::read_write_mutex::scoped_read_write_lock lk(doobie, boost::read_write_lock_state::write_locked); global_count = global_count + 1; } ZOMOJO_WORKER_THREAD_END( number_of_threads ) BOOST_MESSAGE( "orderly progression = " << inner_loop * number_of_threads ); BOOST_MESSAGE( "concurrent progression from this run = " << global_count ); BOOST_CHECK_PREDICATE(std::equal_to<int>(), (global_count)( inner_loop * number_of_threads) ); } _________________________________________________ The helper macros _________________________________________________ #ifndef ZOMOJO_THREAD_TEST_HELPER_HPP_20050816 #define ZOMOJO_THREAD_TEST_HELPER_HPP_20050816 // A poor man's barrier helper to assist in unit test cases for parallel execution #define ZOMOJO_WORKER_THREAD_BEGIN( LoopIterations ) \ static boost::read_write_mutex m_arbitrary_( boost::read_write_scheduling_policy::writer_priority ); \ struct simple_function_arbitrary_ \ { \ static void do_work_arbitrary_() \ { \ boost::read_write_mutex::scoped_read_lock lk_arbitrary_(m_arbitrary_); \ for (int j_arbitrary_ = 0; j_arbitrary_ < (LoopIterations); j_arbitrary_++) \ { #define ZOMOJO_WORKER_THREAD_END( NumberOfThreads ) \ } \ } \ }; \ \ boost::thread_group pool_arbitrary_; \ \ { \ boost::read_write_mutex::scoped_write_lock lk(m_arbitrary_); \ for (int i_arbitrary_=0; i_arbitrary_ < NumberOfThreads; ++i_arbitrary_) \ { \ pool_arbitrary_.create_thread( simple_function_arbitrary_::do_work_arbitrary_ ); \ } \ } \ \ pool_arbitrary_.join_all(); #endif // ZOMOJO_THREAD_TEST_HELPER_HPP_20050816

On Monday 22 August 2005 03:59, Matt Hurd wrote:
Using a read_lock state on the same works without deadlock (or safety is this case) as expected. Using a scoped_write_lock worked once but deadlocked mostly.
A normal mutex and recursive mutex works as expected.
I'm using 1.33. Is this a known bug or am I using the read_write stuff incorrectly?
I noticed same thing last week--a multithreaded program deadlocks with multiple threads waiting for write lock but no one holding it. It was not immediately clear to me why. This was reproduced on win32, linux and solaris, so it appears to be read_write_lock specific problem. Underlying primitives (mutex and condition) work fine on all platforms. The same code worked fine with boost 1.32. Teemu

On 23/08/05, Teemu Torma <teemu@torma.org> wrote:
On Monday 22 August 2005 03:59, Matt Hurd wrote: I'm using 1.33. Is this a known bug or am I using the read_write stuff incorrectly?
I noticed same thing last week--a multithreaded program deadlocks with multiple threads waiting for write lock but no one holding it. It was not immediately clear to me why.
This was reproduced on win32, linux and solaris, so it appears to be read_write_lock specific problem. Underlying primitives (mutex and condition) work fine on all platforms.
The same code worked fine with boost 1.32.
Teemu
Thanks for the confirmation Teemu. Perhaps boost would be better off with a simpler but similar approach totally rewritten to use posix on all platforms including win32. Posix is viable on Mac these days I assume too. Support would be easier I'd think. I'd be prepared to put an initial cut together. At least Alexander Terekhov would be happy ;-) There is a certain attractiveness is making posix do the heavy lifting... A 90% solution that covers basic atomic ops, simple fencing and mutex / lock implementations better suited to policy implementations wouldn't be too hard. I guess at a minimum tests should detect if multithreaded apps correctly fail when not protected so that the efficacy of other concurrency testing is known. I use this approach in my testing where it warns if unprotected concurrency works where it should fail. matt.

Testing my own wrapper on x86_64 bit linux exposed a seeming deadlock on constructing a read_write_lock with a *::write_locked initial state in the constructor...
I've distilled it to the plain code below that exhibits the same failure.
Matt, can you distil this test case a little more please: * Get rid of the macros so we can read the code :-) * Don't use Boost.Test for the tests, it's *not* thread safe, and bad things may be happening in there. * Don't declare variables as function-static: it's not thread safe. I'm not sure if this is an issue or not from reading the code, with those macros in place it's hard to see which bits are getting called by multiple threads. Many thanks! John.

On 31/08/05, John Maddock <john@johnmaddock.co.uk> wrote: <snip> Matt, can you distil this test case a little more please:
* Get rid of the macros so we can read the code :-) * Don't use Boost.Test for the tests, it's *not* thread safe, and bad things may be happening in there. * Don't declare variables as function-static: it's not thread safe. I'm not sure if this is an issue or not from reading the code, with those macros in place it's hard to see which bits are getting called by multiple threads.
Many thanks!
John.
Sure thing, I'll do that a bit later today. matt.

On 31/08/05, John Maddock <john@johnmaddock.co.uk> wrote:
Matt, can you distil this test case a little more please:
* Get rid of the macros so we can read the code :-) * Don't use Boost.Test for the tests, it's *not* thread safe, and bad things may be happening in there. * Don't declare variables as function-static: it's not thread safe. I'm not sure if this is an issue or not from reading the code, with those macros in place it's hard to see which bits are getting called by multiple threads.
John, Hopefully this is a little more readable. My set up is gcc 3.3.5 on Suse 9.3 pro x86_64 with dual opterons. Hope this helps, matt. matt@zomojo.com ___________________________ #include <iostream> #include <boost/thread/thread.hpp> #include <boost/thread/read_write_mutex.hpp> using namespace boost; // these definitions deadlock read_write_mutex guard( read_write_scheduling_policy::writer_priority ); typedef read_write_mutex::scoped_write_lock test_lock_type; // these definitions work as expected // mutex guard; // typedef mutex::scoped_lock test_lock_type; long global_count = 0; long loop_count_for_test = 100000; void do_work() { for (long i=0; i< loop_count_for_test; ++i ) { test_lock_type lk( guard ); ++global_count; } } int main() { thread_group pool; int number_of_threads = 100; for (int i=0; i < number_of_threads; ++i) { pool.create_thread( do_work ); } pool.join_all(); //result check std::cout << "We should get here without a deadlock\n"; std::cout << "global count = " << global_count << "\n"; std::cout << "global count should be = " << loop_count_for_test * number_of_threads << "\n"; return 0; }

Hopefully this is a little more readable. My set up is gcc 3.3.5 on Suse 9.3 pro x86_64 with dual opterons.
Hope this helps,
I can reproduce that on WinXP with VC7.1. I can also reduce the number of threads to about 10 and still get the deadlock. However, I can't see what the problem is: when the deadlock occurs all the threads are waiting for the writer condition variable (m_waiting_writers) to wake up one of the writers at boost::detail::thread::read_write_mutex_impl<boost::mutex>::do_write_lock() Line 512. The member m_waking_writers is set to one, and as far as I can see that can only occur in read_write_mutex_impl<Mutex>::do_wake_writer(void) line 1425, which then must have notified the condition variable to wake up one thread. m_state must have been set to zero before all this happens so the woken thread should not loop and go back to sleep (Footnote, actually that appears not to be true, sometimes a thread is woken with m_state == -1 but that appears not to be the immediate cause of the problem). So.. I'm stumped at present. Hopefully someone else can also take a look at this? John.

John Maddock schrieb:
Hopefully someone else can also take a look at this?
In the file: read_write_mutex.cpp In function: void read_write_mutex_impl<Mutex>::do_write_lock() The lines added by me (RS): -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- else if (m_sp == read_write_scheduling_policy::writer_priority) { //Writer priority: wait while locked BOOST_DEFINE_LOOP_COUNT adjust_dual_count adjust_waking(m_num_waking_writers, m_num_max_waking_writers, false); if (m_state == 0 && m_num_waking_writers > 0) // added by RS adjust_waking.set_adjust(true); // added by RS while (m_state != 0) { BOOST_ASSERT_LOOP_COUNT(); //See note at BOOST_ASSERT_LOOP_COUNT definition above BOOST_ASSERT(waker_exists()); //There should be someone to wake us up adjust_count adjust_waiting(m_num_waiting_writers); adjust_waking.set_adjust(true); m_waiting_writers.wait(l); } } -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- apparently fix the problem. As I am in a hurry just at the moment I cannot be more verbose. I'll be back later. However I definitely would ask someone else to review the change before checking this in, since the code is rather involved, and I am currently not sure if I am doing just cosmetics, or if I understood the consequences in its entirety. Regrads. Roland

John Maddock schrieb:
I can also reduce the number of threads to about 10 and still get the deadlock.
You can go as low as two.
However, I can't see what the problem is: when the deadlock occurs all the threads are waiting for the writer condition variable (m_waiting_writers) to wake up one of the writers at boost::detail::thread::read_write_mutex_impl<boost::mutex>::do_write_lock() Line 512. The member m_waking_writers is set to one, and as far as I can see that can only occur in read_write_mutex_impl<Mutex>::do_wake_writer(void) line 1425, which then must have notified the condition variable to wake up one thread. m_state must have been set to zero before all this happens so the woken thread should not loop and go back to sleep (Footnote, actually that appears not to be true, sometimes a thread is woken with m_state == -1 but that appears not to be the immediate cause of the problem). So.. I'm stumped at present.
When a thread is releasing its lock, the waiters on the condition m_waiting_writers are notified_one. The m_state is set to 0, and m_num_waking_writers > 0. Now when it happens (and it does happen) that another thread enters the do_write_lock _before_ any other thread has been woken up, it will see an m_state of 0. And this is bad, since there are m_num_waking_writers > 0. This is bad because obtaining the lock (which will be granted because of m_state == 0) in essence is kind of a wakeup. But the code does not account for this and correct the m_num_waking_writers. Hence the do_wake_write will never again try to notify_one any waiters. This leads to deadlock. What is left: Who actually is beeing woken up then? Yup obviously the original waiting writer receives a spurious wakeup, sees that the m_state is -1 again, and keeps on waiting. My already posted bugfix solves for this, but I am not yet sure what is about the other do_*_lock operations. Are they susceptible to this bug too? Regards, Roland

My already posted bugfix solves for this, but I am not yet sure what is about the other do_*_lock operations. Are they susceptible to this bug too?
There's a lot of combinations to test here, which is why these issues are going undiagnosed I believe. I needed several other patches similar to your to fix deadlocks with alternate scheduling policies, but when I moved on to a try lock, the old deadlocks came back again, here's a typical example below along with the patches I'm testing so far (someone needs to test timed locks this way as well), John. RCS file: /cvsroot/boost/boost/libs/thread/src/read_write_mutex.cpp,v retrieving revision 1.23 diff -r1.23 read_write_mutex.cpp 487a488,491
if (m_state == 0 && m_num_waking_writers > 0) { adjust_waking.set_adjust(true); // added by RS }
504a509,512
if (m_state == 0 && m_num_waking_writers > 0) { adjust_waking.set_adjust(true); // added by RS }
524a533,536
if (m_state == 0 && m_num_waking_writers > 0) { adjust_waking.set_adjust(true); // added by RS }
#include <iostream> #include <boost/thread/thread.hpp> #include <boost/thread/read_write_mutex.hpp> using namespace boost; // these definitions deadlock try_read_write_mutex guard( read_write_scheduling_policy::alternating_single_read ); typedef try_read_write_mutex::scoped_try_read_write_lock test_lock_type; // these definitions work as expected // mutex guard; // typedef mutex::scoped_lock test_lock_type; long global_count = 0; long loop_count_for_test = 100000; void do_work() { for (long i=0; i< loop_count_for_test; ++i ) { test_lock_type lk( guard, read_write_lock_state::unlocked ); if(0 == lk.try_write_lock()) { std::cout << "try failed" << std::endl; lk.write_lock(); } assert(lk.locked()); long j = ++global_count; lk.demote(); assert(j == global_count); } } int main() { thread_group pool; int number_of_threads = 10; for (int i=0; i < number_of_threads; ++i) { pool.create_thread( do_work ); } pool.join_all(); //result check std::cout << "We should get here without a deadlock\n"; std::cout << "global count = " << global_count << "\n"; std::cout << "global count should be = " << loop_count_for_test * number_of_threads << "\n"; return 0; }
participants (4)
-
John Maddock
-
Matt Hurd
-
Roland Schwarz
-
Teemu Torma