[thread] condition suffering from lost wakup bug?

While reviewing the implementation of the condition variable implementation I came across a case where a "lost wakup" seems to take place. The example is admittedly artificial, but I think still correct code. (I hope someone can proof me wrong.) Could this be an indication that the current implementation is not using Alexander Terekhovs algorithm? If this really is the case we should definitely change our code. On the other side, I constructed the test case, by carefully studying Alexanders algorithm, in order to better understand the design. I did the tests on W2K with Vc7.1. My qeustions: 1) Is this corect code with respect to condition variable semantics (posix thread semantics?) 2) If 1) appplies: Can someone reproduce this on another platform too? Thank you, Roland #include <iostream> #include <boost/thread.hpp> using namespace boost; mutex guard; typedef mutex::scoped_lock lock_type; condition cond; long global_count = 0; long count_for_test = 10000; long global_number_of_threads = 0; long number_of_threads_for_test = 1000; bool global_flag = true; void do_work() { lock_type lock(guard); ++global_number_of_threads; cond.notify_one(); while (global_count < count_for_test) cond.wait(lock); } void do_flag() { lock_type lock(guard); ++global_number_of_threads; cond.notify_one(); while (global_count < count_for_test) { cond.wait(lock); global_flag = false; // Step 3: The following signal occasionally // is beeing lost. See below. cond.notify_one(); } } int main() { thread_group pool; for (long i=0; i < number_of_threads_for_test; ++i) { pool.create_thread( do_work ); } pool.create_thread( do_flag ); { // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); } // the loop will run for some time and // then suddenly deadlock for (long j=0; j<count_for_test; ++j) { lock_type lock(guard); ++global_count; // Step 1: We do a heavy notification cond.notify_all(); // Step 2: Then we wait on the same condition. // Since we hold the lock we can be sure // all threads are waiting at this point. // Entering the below wait should // atomically wait and release the lock. // In turn I expect all threads being woken up, // except the main thread since the wait // occures after the notify_all(). // The do_flag thread will signal after // we have entered the below wait. // However the signal occasionally // is beeing lost. Is this kind of a // "lost wakup"? while(global_flag) cond.wait(lock); // deadlocks // When you break in with the debugger // as the deadlock occurs, you can see // that the flag indeed is false, which is a // proof that the signal has been sent. global_flag = true; } pool.join_all(); //result check std::cout << "We should get here without a deadlock\n"; return 0; }

On 15/09/05, Roland Schwarz <roland.schwarz@chello.at> wrote:
1) Is this corect code with respect to condition variable semantics (posix thread semantics?)
Haven't had a chance to scan through your code yet, just thought I'd run it and let you know first.
2) If 1) appplies: Can someone reproduce this on another platform too?
linux posix deadlocks waiting too (gcc 4.01, suse 9.3 x86_64, boost 1.33) matt.

On 15/09/05, Roland Schwarz <roland.schwarz@chello.at> wrote:
Thank you, Roland
#include <iostream>
#include <boost/thread.hpp>
using namespace boost;
mutex guard; typedef mutex::scoped_lock lock_type; condition cond;
long global_count = 0; long count_for_test = 10000;
long global_number_of_threads = 0; long number_of_threads_for_test = 1000;
bool global_flag = true;
void do_work() { lock_type lock(guard); ++global_number_of_threads; cond.notify_one(); while (global_count < count_for_test) cond.wait(lock); }
void do_flag() { lock_type lock(guard); ++global_number_of_threads; cond.notify_one(); while (global_count < count_for_test) { cond.wait(lock); global_flag = false; // Step 3: The following signal occasionally // is beeing lost. See below. cond.notify_one(); } }
int main() {
thread_group pool;
for (long i=0; i < number_of_threads_for_test; ++i) { pool.create_thread( do_work ); } pool.create_thread( do_flag );
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }
I can't see why you get here? I might be missing something but you're likely to end up with all threads waiting on conditions here... All the do_work threads are waiting, the do_flag is waiting. Then if all the threads aren't up and running we, the main thread, wait on the condition too, but we might not be the one the consumes a notify_one as the others threads start so we never progress. I could have the logic arse about though... matt.

Matt Hurd schrieb:
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }
I can't see why you get here? I might be missing something but you're likely to end up with all threads waiting on conditions here...
This is intentionally, since I need to create a "thundering herd here". for (long j=0; j<count_for_test; ++j) { lock_type lock(guard); ++global_count; // Step 1: We do a heavy notification cond.notify_all(); // Step 2: Then we wait on the same condition. // Since we hold the lock we can be sure The notify_all unlocks them. And every thread should bet a chance to run. They are going to sleep agin soon, however this is not of importance. The importance is the relative sequence: t1: Main signals t2: Main waits t3: do_flag signals, and this signal occasionally is beeing lost. Some background: When the unblocking starts, the condition internally closes a gate. During the closed gate new waiters are blocked off but not counted yet. Unfortunately if there now is a another signal, the signal is not counted too and when the gate is opened again the threads that queued up front gate will come in and get counted. Unfortunately they cannot be unblocked by the signals that followed them while the gate was closed, since they havent been counted too. The time that the gate is closed is usually short but not zero. Now it happens that, when during this time the sequence: wait, signal does happen the signal is lost. My thundering herd is just a simple means to keep the gate closed long enough to make the probaility of this event higher. But it can happen in principle also during normal operation. The probaility just is so low. The do_flag() is in the same pool and beeing scheduled sooner or later. Now it happens that on occassion it indeed is getting scheduled so soon to fall into the gate, and consequently deadlocks. Regards, Roland

On 15/09/05, Roland Schwarz <roland.schwarz@chello.at> wrote: Matt Hurd schrieb:
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }
I can't see why you get here? I might be missing something but you're likely to end up with all threads waiting on conditions here...
This is intentionally, since I need to create a "thundering herd here".
Sorry I wasn't clear. I didn't get to looking at the thundering herd bit in detail as according to me, my debugger agrees too, your main thread is locked waiting before this loop. That is, I don't get out of this loop:
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }
Which is what I was trying to explain but did a rather poor job. matt.

Matt Hurd wrote:
Sorry I wasn't clear. I didn't get to looking at the thundering herd bit in detail as according to me, my debugger agrees too, your main thread is locked waiting before this loop.
That is, I don't get out of this loop:
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }
Are you using linux or windows? Are you on MAIN trunk or thread_rewrite? Roland

Roland Schwarz schrieb:
Matt Hurd schrieb:
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }
I can't see why you get here? I might be missing something but you're likely to end up with all threads waiting on conditions here...
Oh so sorry, I did not answer your question, but instead just started to answer another one which seemed more interesting to me ;-)
The upstarting threads increment this global_number_threads as their first statement. Your code snippet simply waits until all increments are done, at which point the numer is equal to the number to the threads that have been put into the pool. The shown code snippet shows testing this predicate. If you fire this up and set a breakpoint just at the start of the follwing loop, you can see that main indeed will stop at this breakpoint. Regards, Roland

On 15/09/05, Roland Schwarz <roland.schwarz@chello.at> wrote:
If you fire this up and set a breakpoint just at the start of the follwing loop, you can see that main indeed will stop at this breakpoint.
No it doesn't is what I'm saying. Main waits forever in this loop as it doesn't get a notification after the predicate is satiated. When I add the std::cout below { // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); } std::cout << "Do I even get here?\n"; I don't get there. matt.

Matt Hurd schrieb:
std::cout << "Do I even get here?\n";
I don't get there.
Uups. Now I can see. Sorry, I just was nor careful enough about the startup synchronization. It just happend to work for me, and I simply was ensuring this fact by stopping there with the debugger. However the remedy should be easy: Just replace: ++global_number_of_threads; cond.notify_one(); by: ++global_number_of_threads; cond.notify_all(); Sorry, my fault. Roland

Roland Schwarz <roland.schwarz@chello.at> writes:
1) Is this corect code with respect to condition variable semantics (posix thread semantics?)
No. do_flag only signals one thread waiting on the condition. If this is one of the do_work threads, then it will wake up, go back to sleep, and the notification has been swallowed. Since the main thread didn't wake up, it won't break out of the loop checking global_flag. Since the only threads that do notification are the main thread and the do_flag thread, once all the threads have entered their waiting loop, if the do_flag notification wakes anyone other than the main thread, we have deadlock. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony Williams schrieb:
Roland Schwarz <roland.schwarz@chello.at> writes:
1) Is this corect code with respect to condition variable semantics (posix thread semantics?)
No.
do_flag only signals one thread waiting on the condition. If this is one of the do_work threads, then it will wake up, go back to sleep, and the notification has been swallowed. Since the main thread didn't wake up, it won't break out of the loop checking global_flag. Since the only threads that do notification are the main thread and the do_flag thread, once all the threads have entered their waiting loop, if the do_flag notification wakes anyone other than the main thread, we have deadlock.
Oops, how obvious. Shame on me /:~/ It seems that I cannot easily get a test case that would show up my suspicion. I considerably underestimated the difficulty to make sure all threads are currently waiting on the same condition, and then fire them all together. Altough I would very much appreciate if someone could tell me why it is not a problem when the sequence: "wait then signal" does occur, during the gate is closed. It might very well be that this indeed _is_ no problem, I just can't see why. Roland

Roland Schwarz wrote: [...]
Altough I would very much appreciate if someone could tell me why it is not a problem when the sequence: "wait then signal" does occur, during the gate is closed. It might very well be that this indeed _is_ no problem, I just can't see why.
Because incoming waiter holds external mutex while waiting on gate. regards, alexander.

Alexander Terekhov schrieb:
Roland Schwarz wrote: [...]
Altough I would very much appreciate if someone could tell me why it is not a problem when the sequence: "wait then signal" does occur, during the gate is closed. It might very well be that this indeed _is_ no problem, I just can't see why.
Because incoming waiter holds external mutex while waiting on gate.
This means that there can be at most only one thread beeing blocked on the gate, correct? But this still doesn't me get going. I am considering the following sequence of events: Given sem_wait(semBlockQueue) currently is waiting with nWaitersBlocked == N. The gate is open. 1) signal all takes place, causing closing the gate nWaitersBlocked = 0 nWaitersToUnblock = N 2) sem_wait is getting awake and starts procssing, needing some time to proceed while the gate is closed 3) a new waiter arrives while the gate still is closed he will block on the gate and not increment nWaitersBlocked yet 4) a signal arrives, while gate still is closed Now is 0 < nWaitersToUnblock < N, correct? But 0 == nWaitersBlocked the signal results in a NO-OP and consequently the waiter from 3) will never get it to see - hence it is lost ?? I am studying the pthread_cond_wait.c pseudo code taken from SNAPSHOT 2004-05-16. What do I miss here? I see this is not going to happen if the thread that signals, is holding the external mutex when trying to signal. But this isn't a requirement, is it? Thank you for your kind help. Roland

Roland Schwarz wrote:
4) a signal arrives, while gate still is closed Now is 0 < nWaitersToUnblock < N, correct? But 0 == nWaitersBlocked the signal results in a NO-OP and consequently the waiter from 3) will never get it to see - hence it is lost ??
Replacing the seuence on entry to wait: sem_Wait (semBlockLock); nWaitersBlocked++; sem_post(semBlockLock); by atomic_increment(nWaitersBlocked); sem_Wait (semBlockLock); sem_Wait (semBlockLock); would avoid loosing this signal, since the signal(bAll) function would increment nWaitersToUnblock then. Roland

Roland Schwarz wrote: [...]
3) a new waiter arrives while the gate still is closed he will block on the gate and not increment nWaitersBlocked yet
4) a signal arrives, while gate still is closed Now is 0 < nWaitersToUnblock < N, correct? But 0 == nWaitersBlocked the signal results in a NO-OP and consequently the waiter from 3) will never get it to see - hence it is lost ??
I am studying the pthread_cond_wait.c pseudo code taken from SNAPSHOT 2004-05-16.
What do I miss here?
http://www.opengroup.org/onlinepubs/007908799/xsh/pthread_cond_wait.html regards, alexander.

Alexander Terekhov schrieb:
Roland Schwarz wrote:
What do I miss here?
http://www.opengroup.org/onlinepubs/007908799/xsh/pthread_cond_wait.html
I.e. beeing able to get hold of the mutex establishes interrelationship between the threads. Sending a signal alone cannot be related reliably to other threads, even when it comes before the wait, viewed in "real time". (The only state change it could be signalling is: nothing changed.) Thank you very much, now everything makes sense again :-) Regards, Roland

I have found a test suite for various POSIX APIs, including the threading APIs at http://posixtest.sourceforge.net --- maybe some of the tests can be adapted to the boost threads API. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

I have found a test suite for various POSIX APIs, including the threading APIs at http://posixtest.sourceforge.net --- maybe some of the tests can be adapted to the boost threads API.
We certainly need better tests, the tests from pthread-win32 could probably be rewritten to use Boost.Thread primitives as well. Basically we want to beat this to death :-) John.

On 16/09/05, John Maddock <john@johnmaddock.co.uk> wrote:
I have found a test suite for various POSIX APIs, including the threading APIs at http://posixtest.sourceforge.net --- maybe some of the tests can be adapted to the boost threads API.
We certainly need better tests, the tests from pthread-win32 could probably be rewritten to use Boost.Thread primitives as well. Basically we want to beat this to death :-)
I'm overseas on biz for the next week and a bit. Another week after I get back we should have a basic framework and some tests in place I hope. With the testing I find it is important to demonstate that unsafe code fails where that is possible so the validity of the concurrency tests are strong for the protected versions. matt.

Roland Schwarz wrote: [...]
Could this be an indication that the current implementation
It's basically the same as pthreads-win32 except that it has a busted destructor that doesn't lock the gate sema to synchronize destruction with respect to exiting waiters in order to conform to POSIX. regards, alexander.
participants (5)
-
Alexander Terekhov
-
Anthony Williams
-
John Maddock
-
Matt Hurd
-
Roland Schwarz