[thread] condition suffering from lost wakup bug?

newer
Beyond the C++ Standard Library:...

older
Re: [boost] Checked Integral Class

Roland Schwarz

14 Sep 2005 14 Sep '05

9:38 p.m.

While reviewing the implementation of the condition variable implementation I came across a case where a "lost wakup" seems to take place. The example is admittedly artificial, but I think still correct code. (I hope someone can proof me wrong.) Could this be an indication that the current implementation is not using Alexander Terekhovs algorithm? If this really is the case we should definitely change our code. On the other side, I constructed the test case, by carefully studying Alexanders algorithm, in order to better understand the design. I did the tests on W2K with Vc7.1. My qeustions: 1) Is this corect code with respect to condition variable semantics (posix thread semantics?) 2) If 1) appplies: Can someone reproduce this on another platform too? Thank you, Roland #include <iostream> #include <boost/thread.hpp> using namespace boost; mutex guard; typedef mutex::scoped_lock lock_type; condition cond; long global_count = 0; long count_for_test = 10000; long global_number_of_threads = 0; long number_of_threads_for_test = 1000; bool global_flag = true; void do_work() { lock_type lock(guard); ++global_number_of_threads; cond.notify_one(); while (global_count < count_for_test) cond.wait(lock); } void do_flag() { lock_type lock(guard); ++global_number_of_threads; cond.notify_one(); while (global_count < count_for_test) { cond.wait(lock); global_flag = false; // Step 3: The following signal occasionally // is beeing lost. See below. cond.notify_one(); } } int main() { thread_group pool; for (long i=0; i < number_of_threads_for_test; ++i) { pool.create_thread( do_work ); } pool.create_thread( do_flag ); { // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); } // the loop will run for some time and // then suddenly deadlock for (long j=0; j<count_for_test; ++j) { lock_type lock(guard); ++global_count; // Step 1: We do a heavy notification cond.notify_all(); // Step 2: Then we wait on the same condition. // Since we hold the lock we can be sure // all threads are waiting at this point. // Entering the below wait should // atomically wait and release the lock. // In turn I expect all threads being woken up, // except the main thread since the wait // occures after the notify_all(). // The do_flag thread will signal after // we have entered the below wait. // However the signal occasionally // is beeing lost. Is this kind of a // "lost wakup"? while(global_flag) cond.wait(lock); // deadlocks // When you break in with the debugger // as the deadlock occurs, you can see // that the flag indeed is false, which is a // proof that the signal has been sent. global_flag = true; } pool.join_all(); //result check std::cout << "We should get here without a deadlock\n"; return 0; }

Show replies by date

Matt Hurd

14 Sep 14 Sep

11:47 p.m.

...

On 15/09/05, Roland Schwarz <roland.schwarz@chello.at> wrote:

...

1) Is this corect code with respect to condition variable semantics (posix thread semantics?)

Haven't had a chance to scan through your code yet, just thought I'd run it and let you know first.

...

2) If 1) appplies: Can someone reproduce this on another platform too?

linux posix deadlocks waiting too (gcc 4.01, suse 9.3 x86_64, boost 1.33) matt.

Matt Hurd

15 Sep 15 Sep

12:43 a.m.

...

On 15/09/05, Roland Schwarz <roland.schwarz@chello.at> wrote:

...

Thank you, Roland

#include <iostream>

#include <boost/thread.hpp>

using namespace boost;

mutex guard; typedef mutex::scoped_lock lock_type; condition cond;

long global_count = 0; long count_for_test = 10000;

long global_number_of_threads = 0; long number_of_threads_for_test = 1000;

bool global_flag = true;

void do_work() { lock_type lock(guard); ++global_number_of_threads; cond.notify_one(); while (global_count < count_for_test) cond.wait(lock); }

void do_flag() { lock_type lock(guard); ++global_number_of_threads; cond.notify_one(); while (global_count < count_for_test) { cond.wait(lock); global_flag = false; // Step 3: The following signal occasionally // is beeing lost. See below. cond.notify_one(); } }

int main() {

thread_group pool;

for (long i=0; i < number_of_threads_for_test; ++i) { pool.create_thread( do_work ); } pool.create_thread( do_flag );

{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }

I can't see why you get here? I might be missing something but you're likely to end up with all threads waiting on conditions here... All the do_work threads are waiting, the do_flag is waiting. Then if all the threads aren't up and running we, the main thread, wait on the condition too, but we might not be the one the consumes a notify_one as the others threads start so we never progress. I could have the logic arse about though... matt.

Roland Schwarz

6:24 a.m.

Matt Hurd schrieb:

...

...
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }

I can't see why you get here? I might be missing something but you're likely to end up with all threads waiting on conditions here...

This is intentionally, since I need to create a "thundering herd here". for (long j=0; j<count_for_test; ++j) { lock_type lock(guard); ++global_count; // Step 1: We do a heavy notification cond.notify_all(); // Step 2: Then we wait on the same condition. // Since we hold the lock we can be sure The notify_all unlocks them. And every thread should bet a chance to run. They are going to sleep agin soon, however this is not of importance. The importance is the relative sequence: t1: Main signals t2: Main waits t3: do_flag signals, and this signal occasionally is beeing lost. Some background: When the unblocking starts, the condition internally closes a gate. During the closed gate new waiters are blocked off but not counted yet. Unfortunately if there now is a another signal, the signal is not counted too and when the gate is opened again the threads that queued up front gate will come in and get counted. Unfortunately they cannot be unblocked by the signals that followed them while the gate was closed, since they havent been counted too. The time that the gate is closed is usually short but not zero. Now it happens that, when during this time the sequence: wait, signal does happen the signal is lost. My thundering herd is just a simple means to keep the gate closed long enough to make the probaility of this event higher. But it can happen in principle also during normal operation. The probaility just is so low. The do_flag() is in the same pool and beeing scheduled sooner or later. Now it happens that on occassion it indeed is getting scheduled so soon to fall into the gate, and consequently deadlocks. Regards, Roland

Matt Hurd

6:48 a.m.

...

On 15/09/05, Roland Schwarz <roland.schwarz@chello.at> wrote: Matt Hurd schrieb:

...
...
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }

I can't see why you get here? I might be missing something but you're likely to end up with all threads waiting on conditions here...

This is intentionally, since I need to create a "thundering herd here".

Sorry I wasn't clear. I didn't get to looking at the thundering herd bit in detail as according to me, my debugger agrees too, your main thread is locked waiting before this loop. That is, I don't get out of this loop:

...

...
...
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }

Which is what I was trying to explain but did a rather poor job. matt.

Roland Schwarz

7:10 a.m.

Matt Hurd wrote:

...

Sorry I wasn't clear. I didn't get to looking at the thundering herd bit in detail as according to me, my debugger agrees too, your main thread is locked waiting before this loop.

That is, I don't get out of this loop:

...
...
...
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }

Are you using linux or windows? Are you on MAIN trunk or thread_rewrite? Roland

Matt Hurd

7:16 a.m.

...

On 15/09/05, Roland Schwarz <roland.schwarz@chello.at> wrote:

...

Are you using linux or windows? Are you on MAIN trunk or thread_rewrite?

boost 1.33.0 release on linux x86_64. dual operton box. matt.

Roland Schwarz

6:54 a.m.

Roland Schwarz schrieb:

...

Matt Hurd schrieb:

...
...
{ // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); }

I can't see why you get here? I might be missing something but you're likely to end up with all threads waiting on conditions here...

Oh so sorry, I did not answer your question, but instead just started to answer another one which seemed more interesting to me ;-)

The upstarting threads increment this global_number_threads as their first statement. Your code snippet simply waits until all increments are done, at which point the numer is equal to the number to the threads that have been put into the pool. The shown code snippet shows testing this predicate. If you fire this up and set a breakpoint just at the start of the follwing loop, you can see that main indeed will stop at this breakpoint. Regards, Roland

Matt Hurd

7:14 a.m.

...

On 15/09/05, Roland Schwarz <roland.schwarz@chello.at> wrote:

...

If you fire this up and set a breakpoint just at the start of the follwing loop, you can see that main indeed will stop at this breakpoint.

No it doesn't is what I'm saying. Main waits forever in this loop as it doesn't get a notification after the predicate is satiated. When I add the std::cout below { // we wait until all threads are up lock_type lock(guard); while (global_number_of_threads < number_of_threads_for_test+1) cond.wait(lock); } std::cout << "Do I even get here?\n"; I don't get there. matt.

Roland Schwarz

7:41 a.m.

Matt Hurd schrieb:

...

std::cout << "Do I even get here?\n";

I don't get there.

Uups. Now I can see. Sorry, I just was nor careful enough about the startup synchronization. It just happend to work for me, and I simply was ensuring this fact by stopping there with the debugger. However the remedy should be easy: Just replace: ++global_number_of_threads; cond.notify_one(); by: ++global_number_of_threads; cond.notify_all(); Sorry, my fault. Roland

Anthony Williams

9:20 a.m.

Roland Schwarz <roland.schwarz@chello.at> writes:

...

1) Is this corect code with respect to condition variable semantics (posix thread semantics?)

No. do_flag only signals one thread waiting on the condition. If this is one of the do_work threads, then it will wake up, go back to sleep, and the notification has been swallowed. Since the main thread didn't wake up, it won't break out of the loop checking global_flag. Since the only threads that do notification are the main thread and the do_flag thread, once all the threads have entered their waiting loop, if the do_flag notification wakes anyone other than the main thread, we have deadlock. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Roland Schwarz

12:12 p.m.

Anthony Williams schrieb:

...

Roland Schwarz <roland.schwarz@chello.at> writes:

...
1) Is this corect code with respect to condition variable semantics (posix thread semantics?)

No.

do_flag only signals one thread waiting on the condition. If this is one of the do_work threads, then it will wake up, go back to sleep, and the notification has been swallowed. Since the main thread didn't wake up, it won't break out of the loop checking global_flag. Since the only threads that do notification are the main thread and the do_flag thread, once all the threads have entered their waiting loop, if the do_flag notification wakes anyone other than the main thread, we have deadlock.

Oops, how obvious. Shame on me /:~/ It seems that I cannot easily get a test case that would show up my suspicion. I considerably underestimated the difficulty to make sure all threads are currently waiting on the same condition, and then fire them all together. Altough I would very much appreciate if someone could tell me why it is not a problem when the sequence: "wait then signal" does occur, during the gate is closed. It might very well be that this indeed _is_ no problem, I just can't see why. Roland

Alexander Terekhov

1:02 p.m.

Roland Schwarz wrote: [...]

...

Altough I would very much appreciate if someone could tell me why it is not a problem when the sequence: "wait then signal" does occur, during the gate is closed. It might very well be that this indeed _is_ no problem, I just can't see why.

Because incoming waiter holds external mutex while waiting on gate. regards, alexander.

Roland Schwarz

1:50 p.m.

Alexander Terekhov schrieb:

...

Roland Schwarz wrote: [...]

...
Altough I would very much appreciate if someone could tell me why it is not a problem when the sequence: "wait then signal" does occur, during the gate is closed. It might very well be that this indeed _is_ no problem, I just can't see why.

Because incoming waiter holds external mutex while waiting on gate.

This means that there can be at most only one thread beeing blocked on the gate, correct? But this still doesn't me get going. I am considering the following sequence of events: Given sem_wait(semBlockQueue) currently is waiting with nWaitersBlocked == N. The gate is open. 1) signal all takes place, causing closing the gate nWaitersBlocked = 0 nWaitersToUnblock = N 2) sem_wait is getting awake and starts procssing, needing some time to proceed while the gate is closed 3) a new waiter arrives while the gate still is closed he will block on the gate and not increment nWaitersBlocked yet 4) a signal arrives, while gate still is closed Now is 0 < nWaitersToUnblock < N, correct? But 0 == nWaitersBlocked the signal results in a NO-OP and consequently the waiter from 3) will never get it to see - hence it is lost ?? I am studying the pthread_cond_wait.c pseudo code taken from SNAPSHOT 2004-05-16. What do I miss here? I see this is not going to happen if the thread that signals, is holding the external mutex when trying to signal. But this isn't a requirement, is it? Thank you for your kind help. Roland

Roland Schwarz

2:07 p.m.

Roland Schwarz wrote:

...

4) a signal arrives, while gate still is closed Now is 0 < nWaitersToUnblock < N, correct? But 0 == nWaitersBlocked the signal results in a NO-OP and consequently the waiter from 3) will never get it to see - hence it is lost ??

Replacing the seuence on entry to wait: sem_Wait (semBlockLock); nWaitersBlocked++; sem_post(semBlockLock); by atomic_increment(nWaitersBlocked); sem_Wait (semBlockLock); sem_Wait (semBlockLock); would avoid loosing this signal, since the signal(bAll) function would increment nWaitersToUnblock then. Roland

Roland Schwarz

2:42 p.m.

Roland Schwarz wrote:

...

atomic_increment(nWaitersBlocked); sem_Wait (semBlockLock); sem_Wait (semBlockLock);

Ouch, too fast. Obviously won't help since it interferes with the rest of the code, which uses it too... Roland

Alexander Terekhov

4:13 p.m.

Roland Schwarz wrote: [...]

...

3) a new waiter arrives while the gate still is closed he will block on the gate and not increment nWaitersBlocked yet

4) a signal arrives, while gate still is closed Now is 0 < nWaitersToUnblock < N, correct? But 0 == nWaitersBlocked the signal results in a NO-OP and consequently the waiter from 3) will never get it to see - hence it is lost ??

I am studying the pthread_cond_wait.c pseudo code taken from SNAPSHOT 2004-05-16.

What do I miss here?

http://www.opengroup.org/onlinepubs/007908799/xsh/pthread_cond_wait.html regards, alexander.

Roland Schwarz

5:08 p.m.

Alexander Terekhov schrieb:

...

Roland Schwarz wrote:

...
What do I miss here?

http://www.opengroup.org/onlinepubs/007908799/xsh/pthread_cond_wait.html

I.e. beeing able to get hold of the mutex establishes interrelationship between the threads. Sending a signal alone cannot be related reliably to other threads, even when it comes before the wait, viewed in "real time". (The only state change it could be signalling is: nothing changed.) Thank you very much, now everything makes sense again :-) Regards, Roland

Anthony Williams

10:19 p.m.

New subject: Testing boost threads (was: [thread] condition suffering from lost wakup bug?)

I have found a test suite for various POSIX APIs, including the threading APIs at http://posixtest.sourceforge.net --- maybe some of the tests can be adapted to the boost threads API. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

John Maddock

16 Sep 16 Sep

9:43 a.m.

New subject: Testing boost threads (was: [thread] condition suffering from lost wakup bug?)

...

I have found a test suite for various POSIX APIs, including the threading APIs at http://posixtest.sourceforge.net --- maybe some of the tests can be adapted to the boost threads API.

We certainly need better tests, the tests from pthread-win32 could probably be rewritten to use Boost.Thread primitives as well. Basically we want to beat this to death :-) John.

Matt Hurd

11:36 p.m.

New subject: Testing boost threads (was: [thread] condition suffering from lost wakup bug?)

...

On 16/09/05, John Maddock <john@johnmaddock.co.uk> wrote:

...
I have found a test suite for various POSIX APIs, including the threading APIs at http://posixtest.sourceforge.net --- maybe some of the tests can be adapted to the boost threads API.

We certainly need better tests, the tests from pthread-win32 could probably be rewritten to use Boost.Thread primitives as well. Basically we want to beat this to death :-)

I'm overseas on biz for the next week and a bit. Another week after I get back we should have a basic framework and some tests in place I hope. With the testing I find it is important to demonstate that unsafe code fails where that is possible so the validity of the concurrency tests are strong for the protected versions. matt.

Alexander Terekhov

15 Sep 15 Sep

1:13 p.m.

Roland Schwarz wrote: [...]

...

Could this be an indication that the current implementation

It's basically the same as pthreads-win32 except that it has a busted destructor that doesn't lock the gate sema to synchronize destruction with respect to exiting waiters in order to conform to POSIX. regards, alexander.

7258

Age (days ago)

7260

Last active (days ago)

List overview

Download

21 comments

5 participants

participants (5)

Alexander Terekhov
Anthony Williams
John Maddock
Matt Hurd
Roland Schwarz