On Dec 1, 2009, at 9:50 AM, Peter Dimov wrote:
Stonewall Ballard wrote:
I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.
This looks like a classic CV pitfall when using "atomic" predicates that are not protected by the mutex used for the wait. The basic outline is that thread A does a CV wait on an atomic boolean variable, and thread B sets this variable and does a notify. There exists a race in which A sees false, is preempted, B stores true, does notify, waking up nobody, and then A continues with the wait. The cure is to insert an empty lock+unlock of A's mutex in B between the store and the notify; it doesn't need to encompass the store, and it doesn't need to encompass the notify call, either.
I can't read the boost.thread code well enough to be able to diagnose the problem, but from a cursory look, it looks possible to me that condition_variable::wait may perform its interruption check, be preempted, the interrupt can proceed with setting interrupt_requested and doing a broadcast, and then condition_variable::wait to block on its pthread_cond_wait.
I've found that all the threads are waiting on the condition variable at the time that I interrupt them, so that scenario doesn't seem to apply. The actual interrupt code in pthread/thread.cpp is: void thread::interrupt() { detail::thread_data_ptr const local_thread_info=get_thread_info(); if(local_thread_info) { lock_guard<mutex> lk(local_thread_info->data_mutex); local_thread_info->interrupt_requested=true; if(local_thread_info->current_cond) { BOOST_VERIFY(!pthread_cond_broadcast(local_thread_info->current_cond)); } } } As you can see, each thread has a mutex protecting its internal state, and the interrupter locks that mutex before setting interrupt_requested to true. I think that this precludes a race condition there. I think that what's actually happening here is that 16 threads are waiting on the same condition, then they're individually interrupted (using the above code). Each interrupt wakes all the threads waiting on the condition, where all but one thread locks the mutex, see that it's not interrupted, then unlocks the mutex and waits. Given the hardware parallelism on my 16-hyperthread Mac Pro, I suspect that there's ample opportunity for a bug.
Your workaround does prevent it, but I think that boost.thread should take care of that internally, by storing a mutex pointer in the thread data, along with the cv pointer.
As I'm getting more convinced that this is bug in pthreads, at least on OS X, it seems unwise to add that to boost::thread. A pthread condition variable knows its mutex, of course, but there doesn't appear to be any way to retrieve it. I'm going to try to reproduce this using just pthreads calls. Thanks for your help. - Stoney -- Stonewall Ballard stoney@sb.org http://stoney.sb.org/