
John Maddock schrieb:
I can also reduce the number of threads to about 10 and still get the deadlock.
You can go as low as two.
However, I can't see what the problem is: when the deadlock occurs all the threads are waiting for the writer condition variable (m_waiting_writers) to wake up one of the writers at boost::detail::thread::read_write_mutex_impl<boost::mutex>::do_write_lock() Line 512. The member m_waking_writers is set to one, and as far as I can see that can only occur in read_write_mutex_impl<Mutex>::do_wake_writer(void) line 1425, which then must have notified the condition variable to wake up one thread. m_state must have been set to zero before all this happens so the woken thread should not loop and go back to sleep (Footnote, actually that appears not to be true, sometimes a thread is woken with m_state == -1 but that appears not to be the immediate cause of the problem). So.. I'm stumped at present.
When a thread is releasing its lock, the waiters on the condition m_waiting_writers are notified_one. The m_state is set to 0, and m_num_waking_writers > 0. Now when it happens (and it does happen) that another thread enters the do_write_lock _before_ any other thread has been woken up, it will see an m_state of 0. And this is bad, since there are m_num_waking_writers > 0. This is bad because obtaining the lock (which will be granted because of m_state == 0) in essence is kind of a wakeup. But the code does not account for this and correct the m_num_waking_writers. Hence the do_wake_write will never again try to notify_one any waiters. This leads to deadlock. What is left: Who actually is beeing woken up then? Yup obviously the original waiting writer receives a spurious wakeup, sees that the m_state is -1 again, and keeps on waiting. My already posted bugfix solves for this, but I am not yet sure what is about the other do_*_lock operations. Are they susceptible to this bug too? Regards, Roland