On Dec 1, 2009, at 3:29 AM, Roland Bock wrote:
Stonewall Ballard wrote:
The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting. This gave me the clue to try that: http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadca... thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp. Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable. I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting. Of course, this could be a bug in the OS X pthreads implementation as well.
Hi,
FWIW, I ran that test of yours several times with varying parameters on my machine (quad core, 64bit, linux) and it did not show a single failure. Of course, since it is not a deterministic effect even on your machine, failure to reproduce does not really mean much, but well, I thought you might like to hear anyway :-)
Thanks, but this doesn't surprise me. Since the reliability drops rapidly as I add threads, I suspect it has something to do with running this on an 8-core (16 hyperthread) machine. I also suspect that it's a Mac OS bug.
And I totally agree: Predictable scheduling should not be required to wake up all threads, especially since the document also says
<snip> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex [...] </cite>
Of course, that could just be a promise that it won't crash. I hope you're right, though.
As for boiling down the application for others to inspect: Your debugger showed that the thread is still in wait() after the interrupt call.
Can you assure that ALL worker threads are in wait() prior to the interrupt? * If yes: There seems to be no connection with the interlocked queue, the sleep and so on. It should be possible to get rid of all that for a much simpler test program * If no: OK, there seems to be a connection between the wait(), the interrupt and the sleep and/or mutex.
Yes, I added code to check that, and all 16 threads were waiting on that condition when I interrupted them.
In any case, I would assume that by analysing the situation right before the interrupt, you should be able to reproduce the problem with much less code.
If I understand this correctly, I should be able to reproduce it with pthreads alone (no boost). I'm going to try that when I get some time and file a bug report.
Hope that helps in any way?
Yes, thanks. Reasoning about threads requires code review. - Stoney -- Stonewall Ballard stoney@sb.org http://stoney.sb.org/