Threads, condition variables, and signals
Hello,
I'm writing a multithreaded server on Unix using boost::thread. The
basic operation is fairly straightforward; a master thread starts up
worker threads, and coordinates various management tasks using
condition variables.
As with most Unix servers, the server is shut down with an OS signal
telling the server to shut down. The server catches this signal, then
simply calls exit(1). Any important cleanup happens in the
destructors of global variables, so it will happen properly no matter
how the server exits.
One object which is always cleaned up is the object for the master
thread. The master thread sets a flag, then does a notify_all() on
some condition variables to tell the other threads to shut down,
The problem I'm having is that if the master thread is waiting on a
condition variable when the signal arrives, and the destructor does a
notify_all() on that condition variable, then the destructor for the
boost::condition variable hangs.
For example, if I run the included code and hit CTRL-C after a short
bit, I get this output:
Waiting...
^C
Caught signal 2
Destroyed ThreadTest
Running destructor 1
[ ... hangs ... ]
Here is the program:
#include <iostream>
#include
On Wed, Oct 03, 2007 at 07:47:57PM -0400, Scott Gifford wrote:
As with most Unix servers, the server is shut down with an OS signal telling the server to shut down. The server catches this signal, then simply calls exit(1). Any important cleanup happens in the destructors of global variables, so it will happen properly no matter how the server exits.
This is not "properly"; I guessed your problem upfront before I had finished reading this very paragraph :) First, you need to block SIGINT in all threads except the main thread. If you don't have a main thread, make it just for the purpose of signal handling. Otherwise, SIGINT will be delivered to a random thread. Maybe that's OK in your case. Second, you don't need to lock the mutex before signaling the condition variable (this is what causes your deadlock). Third, how do you clean up threads which are NOT waiting at the condition variable at the moment of signal arrival? Fourth, your cleanup is NOT "proper" in any way because your cleanup executes in the context of signal handler. In that context (i.e. before the signal handler returns - which is never in your case), only async-signal safe functions may be used. Neither mutex locking nor condition signal is async signal safe. So how do you do it "properly." Hm: have another, 'main' thread just for the purposes of signal handling; have SIGINT unblocked in this thread and block it in all other threads. That thread does something like: while(!flag) sigsuspend(...); and SIGINT just sets the flag to true. When the while() loop exits, you're out of the signal context and then you may use functions such as pthread_cancel() to cancel all threads, or pthread_kill() to explicitly deliver signal to all threads and yes, also pthread_cond_signal() and pthread_cond_broadcast(). Then use pthread_join() to wait that all threads finish and _then_ call exit from the main thread. (If you use pthread_kill, you have the same caveat with async signal safe functions). I'm not sure that even the above scheme is 100% fool-proof, but it seems less broken than your current solution. Also, I don't see how you can use them to notify threads that also do some work outside of their monitor, i.e. if they are structured like: while(1) { // get mutex // wait on condition // do some work // release mutex // do some more work (*) } If you signal the variable while the thread is executing in (*), it will never pick up the signal. How do you do it in Boost.Thread - I don't know. The whole idea of using destructors to clean up global data in a multi-threaded program sounds like calling for trouble. While the destructors themselves may use locks, in how many threads is the destructor list walked and cleaned (the code generated by the compiler to walk over the list of global objects and call destructor for each) executing? Does your runtime library and/or compiler guarantee that every global destructor is executed exactly once even in a MT setting? (This sounds kinda the inverse of the threadsafe singleton pattern.)
Does anybody have any suggestions for a straightforward way to handle this properly?
Um, signals and threads don't mix well. No "straightforward" solution. Read about and understand async-signal safety.
Thanks for your help, Zeljko! A few questions below...
Zeljko Vrba
On Wed, Oct 03, 2007 at 07:47:57PM -0400, Scott Gifford wrote:
As with most Unix servers, the server is shut down with an OS signal telling the server to shut down. The server catches this signal, then simply calls exit(1). Any important cleanup happens in the destructors of global variables, so it will happen properly no matter how the server exits.
This is not "properly"; I guessed your problem upfront before I had finished reading this very paragraph :)
First, you need to block SIGINT in all threads except the main thread. If you don't have a main thread, make it just for the purpose of signal handling. Otherwise, SIGINT will be delivered to a random thread. Maybe that's OK in your case.
It's not OK in my case, but according to pthread_signal(3), Linux threads will do the right thing accidentally (each thread has its own PID, so if I signal the PID the server had when it started, that will always be the first thread). Still, the point of using boost::thread is portability, so I should probably figure out something more robust. I don't see anything in boost::thread to handle thread signal masks. Does anybody know of a portable way to handle this, or will I just have to write it for pthreads and adapt to other environments as I port my application?
Second, you don't need to lock the mutex before signaling the condition variable (this is what causes your deadlock).
I was under the impression it was necessary to avoid deadlocks. For example, if I have code like this: // Global bool flag; boost::mutex flagMutex; boost::condition flagCond; // Thread 1 { boost::mutex::scoped_lock lock(flagMutex); while (!flag) flagCond->wait(lock); } // Thread 2 flag = true; flagCond->notify_all(); If flag is initially false, I could have a flow like this: Thread 1 Thread 2 { boost::mutex::scoped_lock lock(flagMutex); while(!flag) flag = true; flagCond->notify_all(); flagCond->wait(lock); Basically: Thread 1 checks the condition before it is set, then Thread 2 notifies of a change before Thread 1 starts waiting, then Thread 1 starts waiting for a change, but it will never see one, so it hangs forever. If Thread 2 acquires the mutex before changing the flag, this race condition can't happen, because it cannot change flag between Thread 1's test and its wait. Is there some mechanism I'm not aware of that prevents this race from happening?
Third, how do you clean up threads which are NOT waiting at the condition variable at the moment of signal arrival?
They will notice the flag has changed when they are done with their work, and since their work is in fairly small units, they will always finish within a few seconds, which is fine.
Fourth, your cleanup is NOT "proper" in any way because your cleanup executes in the context of signal handler. In that context (i.e. before the signal handler returns - which is never in your case), only async-signal safe functions may be used. Neither mutex locking nor condition signal is async signal safe.
Ah, this is where I was confused. For some reason, I thought that exit(3) was safe to call from a signal handler, and it would leave the signal handling context and clean up safely, but I'm not sure where I got that idea.
So how do you do it "properly." Hm: have another, 'main' thread just for the purposes of signal handling; have SIGINT unblocked in this thread and block it in all other threads. That thread does something like:
while(!flag) sigsuspend(...);
and SIGINT just sets the flag to true. When the while() loop exits, you're out of the signal context and then you may use functions such as pthread_cancel() to cancel all threads, or pthread_kill() to explicitly deliver signal to all threads and yes, also pthread_cond_signal() and pthread_cond_broadcast(). Then use pthread_join() to wait that all threads finish and _then_ call exit from the main thread. (If you use pthread_kill, you have the same caveat with async signal safe functions).
Thanks, I'll try that! [...]
Also, I don't see how you can use them to notify threads that also do some work outside of their monitor, i.e. if they are structured like:
while(1) { // get mutex // wait on condition // do some work // release mutex // do some more work (*) }
If you signal the variable while the thread is executing in (*), it will never pick up the signal.
The work will always finish within a second or so, and it checks the condition before waiting on the condition variable, as in the example code above.
How do you do it in Boost.Thread - I don't know. The whole idea of using destructors to clean up global data in a multi-threaded program sounds like calling for trouble. While the destructors themselves may use locks, in how many threads is the destructor list walked and cleaned (the code generated by the compiler to walk over the list of global objects and call destructor for each) executing? Does your runtime library and/or compiler guarantee that every global destructor is executed exactly once even in a MT setting? (This sounds kinda the inverse of the threadsafe singleton pattern.)
I have no idea how I would go about looking for this guarantee, but it seems that if an environment doesn't provide it, it would be completely impossible to use global data reliably, making it too broken to use. I'm using g++ 4.1.2. Any pointers as to where to look for some sort of guarantee like this?
Does anybody have any suggestions for a straightforward way to handle this properly?
Um, signals and threads don't mix well. No "straightforward" solution. Read about and understand async-signal safety.
Blech, that's too bad. I should have known this would not be easy. ----Scott.
On Mon, Oct 08, 2007 at 02:27:53PM -0400, Scott Gifford wrote:
It's not OK in my case, but according to pthread_signal(3), Linux threads will do the right thing accidentally (each thread has its own PID, so if I signal the PID the server had when it started, that will always be the first thread). Still, the point of using boost::thread is portability, so I should probably figure out something more robust.
The NPTL implementation implements signals differently. When you send a signal, either with kill(2) or ^C on the controlling terminal (which in the end again uses kill(2); pthread_kill(2) can't be used externally to the process), SIGINT is delivered to the process _as a whole_. In this case, as POSIX prescribes, an arbitrary thread that does not block the signal will be picked to handle it. As for portability - what platforms do you care about? If it's only POSIX and Win32, and you don't need Win32 GUI - I suggest that you look into Cygwin or Microsoft's SFU (Services for UNIX) which is also free (of charge). Personally I prefer the latter and, unless I'm mistaken, it has received UNIX certification. Yes, according to http://en.wikipedia.org/wiki/POSIX an NT kernel + SFU is fully POSIX compliant. And you get gcc in the package too :) [No, I'm *not* a MS advocate. But I do recognize and recommend a quality solution when I see one. And I *do* have good opinion on SFU.]
I was under the impression it was necessary to avoid deadlocks.
Not deadlocks, but lost signals (i.e. race conditions).
Basically: Thread 1 checks the condition before it is set, then Thread 2 notifies of a change before Thread 1 starts waiting, then Thread 1 starts waiting for a change, but it will never see one, so it hangs forever.
Indeed, but that's a "lost signal", not a deadlock. Deadlock is a completely different situation: either circular waiting on a chain of locks or a thread attempting to lock a mutex that it _itself_ already has locked (as happens in your case; but this is just a special case of circular waiting).
Is there some mechanism I'm not aware of that prevents this race from happening?
No. Hmm, it'd be best to avoid using shared data. Make a message queue and send work to your worker threads as messages in the queue. If there are no messages in the queue, thread trying to read from it will just sleep. When it is time to quit the application, just send N messages to the queue (where N is the number of worker threads) and wait for them to finish. [Thus, you can use the message queue as a condition variable with "memory" - no signal (=message) ever gets lost.] Plus message queues are implicit synchronization points so you don't need any additional mutexes and CVs. I believe that POSIX MQs also support priority, so your "quit" messages could arrive earlier than any other "normal" messages. You might want to look into Boost.Interprocess for portable MQs (I haven't personally used it so I have no idea what features it supports).
destructor for each) executing? Does your runtime library and/or compiler guarantee that every global destructor is executed exactly once even in a MT setting? (This sounds kinda the inverse of the threadsafe singleton pattern.)
I have no idea how I would go about looking for this guarantee, but it seems that if an environment doesn't provide it, it would be completely impossible to use global data reliably, making it too broken to use. I'm using g++ 4.1.2. Any pointers as to where to look for some sort of guarantee like this?
The best place would be the gcc mailing list.
participants (2)
-
Scott Gifford
-
Zeljko Vrba