Re: [Boost-users] thread_group::interrupt_all is not reliable

1 Dec 2009


      On Dec 1, 2009, at 3:29 AM, Roland Bock wrote:
...
Stonewall Ballard wrote:
...
...
The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.
This gave me the clue to try that:
<http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadcast.html>
thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp.
Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable.
I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting.
Of course, this could be a bug in the OS X pthreads implementation as well.
Hi,
FWIW, I ran that test of yours several times with varying parameters on my machine (quad core, 64bit, linux) and it did not show a single failure. Of course, since it is not a deterministic effect even on your machine, failure to reproduce does not really mean much, but well, I thought you might like to hear anyway :-)
Thanks, but this doesn't surprise me. Since the reliability drops rapidly as I add threads, I suspect it has something to do with running this on an 8-core (16 hyperthread) machine. I also suspect that it's a Mac OS bug.
...
And I totally agree: Predictable scheduling should not be required to wake up all threads, especially since the document also says
<snip>
The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex [...]
</cite>
Of course, that could just be a promise that it won't crash. I hope you're right, though.
...
As for boiling down the application for others to inspect:
Your debugger showed that the thread is still in wait() after the interrupt call.
Can you assure that ALL worker threads are in wait() prior to the interrupt?
 * If yes: There seems to be no connection with the interlocked queue,
   the sleep and so on. It should be possible to get rid of all that
   for a much simpler test program
 * If no: OK, there seems to be a connection between the wait(), the
   interrupt and the sleep and/or mutex.
Yes, I added code to check that, and all 16 threads were waiting on that condition when I interrupted them.
...
In any case, I would assume that by analysing the situation right before the interrupt, you should be able to reproduce the problem with much less code.
If I understand this correctly, I should be able to reproduce it with pthreads alone (no boost). I'm going to try that when I get some time and file a bug report.
...
Hope that helps in any way?
Yes, thanks. Reasoning about threads requires code review.

 - Stoney

-- 
Stonewall Ballard 
stoney@sb.org           http://stoney.sb.org/