Re: [Boost-users] [Thread] Timed join returning true before thread terminated

15 Mar 2012

      On 3/15/2012 11:41 AM, Vicente J. Botet Escriba wrote:
...
Hi,
does this means that there are some restrictions on the code that can be 
executed on the boost::thread_interrupted catch handler or some guidelines 
that s/he must follow?
Is it legal that the user throws itself a boost::thread_interrupted exception? 
Is there a way to clear the interrupted flag so that the join() is not 
interrupted on the boost::thread_interrupted catch handler?
What was wrong with the user code that made the library crash? Which 
precondition of the library was violated? Maybe some assertions help so that 
the error is identified as soon as possible.
Best,
Vicente
Hello Vicente,

I'd like to point out that the library didn't crash. My code did. This was
caused by faulty logic on my part -- a race condition with a very small window
of opportunity.

I had a cascading startup and shutdown design, for example (please pardon my 
verbose answer):

Startup:
--------
    main spawns T1 and waits for T1 to indicate it is done with init

        T1 spawns T2 and waits for T2 to indicate it is done with init

            T2 spawns T3 and waits for T3 to indicate it is donewith init

                T3 spawns T4 and waits for T4 to indicate it is done with init

                    T4 comes up finishes it's init, informs T3 it's init is done
                    and enters its main loop.

                T3 finishes its init, signals T2 its init is done and enters its
                main loop.

            T2 finishes its init, signals T1 its init is done and enters its
            main loop.

        T1 spawns T2 and waits for T2 to indicate it is done with it's init and
        then enters its main loop.

     The system is now running, so main blocks until it receives a terminate
     signal (-2).

Shutdown:
---------
This is pretty much the reverse of the startup.

    signal -2 is received which unblocks main

    main tells T1 to shutdown and then join()s on the thread waiting for it to
    finish.

         T1 tells T2 to shutdown and then join()s on the thread waiting for it
         to finish.

             T2 tells T3 to shutdown and then join()s on the thread waiting for
             it to finish.

                 T3 tells T4 to shutdown and then join()s on the thread waiting
                 for it to finish.

                     T4 does its shutdown logic and then exits the thread

                 T3 wakes from join() finishes its shutdown logic and exits the
                 thread

             T2 wakes from join() finishes its shutdown logic and exits the
             thread

         T1 wakes from join() finishes its shutdown logic and exits the
         thread

     main finishes its shutdown and the program exits. As part of that shutdown
     it invokes the destructor for its T1 object. Part of T1's destruction is to
     destruct the T2 object, which destructrs T3, which destructs T4.

With this understanding of the overall architecture, we can focus on my faulty
shutdown logic.

main() told T1 to shutdown but it used two(2) methods to inform T1. The first
was the thread_interrupt, the second was to set a shutdown flag in T1's
object. However, when main uses T1.interupt() that also causes a flag to be set
in the boost thread object too.

T1 happens to be in it's T1.check_for_shutdown() routine, after it invoked
boost::this_thread::interruption_point() check but before the "if shutdown
flag" check.

      T1.check_for_shutdown()
          // this will throw if the boost interruption requested flag is set
          boost::this_thread::interruption_point();

          **** the code is here when main() ran it T1.shutdown() ***

          if (T1.m_shutdown) {
              throw boost::thread_interrupted
          }

So I detected the shutdown with MY logic m_shutdown, not with an interruption
point. Consequently the boost thread's interruption is still pending, and will
remain pending until another interruption point is hit.

In my code, the thread_interrupted is caught by a handler that is waiting fot
this and it then invokes T1.thread_shutdown() routine. Which for T1 is to send
a boost interrupt to T2, and then join on T2.

BUT, join() is an interuption point, my code would now act upon that pending
interruption, thowing ANOTHER thread_interrupted, exiting join(), finishing off
shutdown logic and then exiting the thread.

So my code would terminate T1 before all of its child threads had terminated.

Now, main legally unblocks from join(), since T1 exited, and then it invokes
the destructor on T1, which does cascade destructions ont the T1-T4 objects.

THIS is what leads to my segmentation fault, called pure virtual,
etc. errors. Because there are threads still alive running code based on that
object, access data from that object, which was just deleted out from
underneath it.

Can the user throw a boost::thread_interrupted exception? To be honest this
wasn't the problem. Throwing this doesn't impact the setting of the threads "do
I have an interrupt pending" flag. I could have thrown my own custom exception
and I still would have encountered this problem. The problem is that my
"check_for_shutdown" logic assumed that when it exited no interrupts would be
pending.

Is there a way to clear the interrupted flag so that the join is not
interrupted? I would argue that clearing the flag isn't correct. One could add
logic such as:
       try {
           boost::this_thread::interruption_point():
       } catch (boost::thread_interrupted &)
       {
       }
       join()

Which would clear the flag. However, while I am blocked in join(), some other
thread could send me another interrupt which would break me out of join(). Not
what I wanted. I feel that Anthony's sugestion of blocking interrupts for this
method is the appropriate way to go.

I don't think any assertions could help, but maybe a documentation improvement?
Or maybe it's there and I didn't read carefully enough? After looking at the
boost codebase and this learning exchange, I learned that when the interrupt()
method is invoked the thread notes the interrupt and the interrupt is pending
until an interrupt point is hit. Moreover, only one interrupt can be pending at
a time. For example, two calls to interrupt() set the threads interrupt flag
once. Its either on or off. So the next interruption point will clear that
flag. It's like the old style signals.

Regards,

-=John

Re: [Boost-users] [Thread] Timed join returning true before thread terminated

John Rocha