Hello, First I want to thank you for helping with this. How did you come to this answer so quickly? Is this a technique or tool that I can learn, or was this wisdom from having worked with your library for so long? Next, I'd like to ensure that I fully understand what was occurring. Can you please confirm that I've got this right. I've read and thought about your answer and looked at the code and I believe these are the details. Main Uses SQ.shutdown() to shutdown the SQ thread it sets the SQ.m_shutdown flag sends the SQ thread a boost thread interrupt joins on the thread SQ Thread This thread is running under it's thread_main() and __happens__ to be in the is_shutdown() method, after it checked the interruption_point() but BEFORE it checks the m_shutdown() logic. SQ.m_shudown was just changed to true, so is_shutdown() sees that and throws a boost::thread_interrupted exception. However, it should also be noted that the boost thread data has its internal interrupt requested flag set too. Now SQ Thread, "throws" out of thread_main() is caught by the base classes operator() method and enters the SQ.thread_shutdown method. SQ.thread_shutdown needs to shut down it's lookup child thread. So it invokes the LU.shutdown method. The LU.shutdown method is just like above: sets the LU.m_shutdown flag sends the LU thread a boost thread interrupt joins on the LU thread However, join() is an interruption point, and I haven't "cleared the interrupt" for the boost thread yet. Therefore, when the SQ thread invokes join(), it checks if there are any outstanding interrupts it needs to honor. There are, so it throws another thread interrupted exception, which exits the join(), and I don't catch; therefore the SQ thread exits prematurely due to my faulty logic. Yes, I know, a wordy explanation for the brilliantly summary you gave me. I just want to double check that I've got the details down correctly. One final word. I feel ungrateful for bringing this up, and I'm still looking into this. However, I've encountered a new symptom. I've added the boost::this_thread::disable_interruption object to the beginning of my thread_base_c::shutdown() function and I've removed all references to m_shutdown -- as you recommended. However now, I occasionally get a deadlock during the shutdown. I call it a deadlock when the shutdown process stalls for longer than 5 minutes. I've run the test multiple times and I'll see the deadlock on rare occasions. I've run seven different test cycles with the deadlock occurring at different times for each: 189, 398, 797, 999, 1282, 1527, 3416 (not in that order) This could be an artifact of my simulation, and I'm just now starting to crawl through the gdb output -- nicely enough I can connect to the running process and see it's current running state. Do you have any debugging insights or tips that I should apply for this investigation? Thank you for all your help, -=John On 3/14/2012 3:23 PM, Anthony Williams wrote:
On 14/03/12 20:09, John Rocha wrote:
I have been able to extract the thread start/stop logic from our code base into a standalone program that illustrates the problem. Even this is still sort of long, 900 lines or so.
Thanks for the example. Your problem is that you are overlaying TWO interruption mechanisms --- boost::thread::interrupt() and your own m_shutdown flag.
thread::join() is an interruption point, so if your thread sees the m_shutdown flag before the boost::thread::interrupt(), then it will pick up the interrupt when it calls join() on its own worker threads.
I would suggest that you avoid the use of m_shutdown, since it is redundant. Also, wrap your calls to join() in scope with a boost::this_thread::disable_cancellation object so that the join cannot be interrupted.
void shutdown(const std::string &s_caller) { EE_LOG_MSG(EE_TRACE, "%s shutting down %s", s_caller.c_str(), m_name.c_str());
ptime start_time(microsec_clock::local_time());
m_shutdown = true; m_thread.interrupt(); m_thread.join();
inline void check_for_shutdown () { boost::this_thread::interruption_point();
if (m_shutdown) { throw boost::thread_interrupted(); } }
Anthony