[thread] Condition::notify_one, wait

Hiho, I'm using the new Boost Threads for a Thread Pool, however, I've got some weird problems. The documentation of Condition::notify_one says it unblocks one of the threads waiting currently on the condition -- what happens if I have several threads waiting and I call notify_one twice? It seems that on Windows/x64/VC9, only on thread wakes up if notify_one is called twice in a row. Is this right? Second, wait says: "The thread will unblock when notified by a call to this->notify_one() or this->notify_all(), or spuriously". Is it likely that a condition unblocks spuriously? After all, I'd expect to unblock only after calling notify_* and not "by accident". Is there any way to get diagnostic output when a wait has been called spuriously? The problem I'm having at the moment is: I have PIMPLed Boost.Thread behind some simple wrappers, and I'm using them like this here: Lock lock (mutex_); if (queue.empty ()) { waitForItem.wait (lock); } with Lock being a wrapper for scoped_lock, mutex being a wrapper for mutex and waitForItem being a condition, and funnily enough it goes waiting even though there is an item in queue (i.e., queue.empty () == false). I'm not sure where this is coming from (all access to queue is protected using mutexes), only candidate at the moment is that when inserting, I call notify_one, and if two items get inserted one right after the other, I assume that only one thread is woken up and hence the other remains waiting although there is an item in the queue. It works when I use notify_all (), but this does not smell right to me :/ Thanks, Anteru

Anteru
The documentation of Condition::notify_one says it unblocks one of the threads waiting currently on the condition -- what happens if I have several threads waiting and I call notify_one twice?
If there are multiple threads waiting, the first call to notify_one() should wake one, and the second call should wake another.
It seems that on Windows/x64/VC9, only on thread wakes up if notify_one is called twice in a row. Is this right?
That would be incorrect behaviour if there really is two threads waiting.
Second, wait says: "The thread will unblock when notified by a call to this->notify_one() or this->notify_all(), or spuriously". Is it likely that a condition unblocks spuriously? After all, I'd expect to unblock only after calling notify_* and not "by accident". Is there any way to get diagnostic output when a wait has been called spuriously?
It is not possible to detect a spurious wake: if it was, then the implementation would resume waiting. However, spurious wakes should be rare occurrences.
The problem I'm having at the moment is: I have PIMPLed Boost.Thread behind some simple wrappers, and I'm using them like this here:
Lock lock (mutex_);
if (queue.empty ()) { waitForItem.wait (lock); }
with Lock being a wrapper for scoped_lock, mutex being a wrapper for mutex and waitForItem being a condition, and funnily enough it goes waiting even though there is an item in queue (i.e., queue.empty () == false). I'm not sure where this is coming from (all access to queue is protected using mutexes), only candidate at the moment is that when inserting, I call notify_one, and if two items get inserted one right after the other, I assume that only one thread is woken up and hence the other remains waiting although there is an item in the queue.
Notifies are not sticky, so if there is no thread waiting when notify_one() is called, the notify is lost.
It works when I use notify_all (), but this does not smell right to me :/
Can you show me some code? Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

It is not possible to detect a spurious wake: if it was, then the implementation would resume waiting. However, spurious wakes should be rare occurrences.
Does someone have time to give us a quick overview as to why spurious wakes may occur? Thanks in advance ****************************************************************************** "This message and any attachments are solely for the intended recipient and may contain confidential and privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Thank you." Interactive Transaction Solutions Ltd (2473364 England) Registered Office: Systems House, Station Approach Emsworth PO10 7PW ********************************************************************** Ce message �lectronique contient des informations confidentielles � l'usage unique des destinataires indiqu�s, personnes physiques ou morales. Si vous n'�tes pas le destinataire voulu, toute divulgation, copie, ou diffusion ou toute autre utilisation de ces informations, est interdite. Si vous avez re�u ce message �lectronique par erreur, nous vous remercions d'en avertir son exp�diteur imm�diatement par email et de d�truire ce message ainsi que les �l�ments attach�s. Interactive transaction Solutions SAS- France (RCS Pontoise : 489 397 877) Si�ge social : Parc Saint Christophe, 10, Avenue de l�Entreprise 95865 Cergy-Pontoise Cedex ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________

Patrick Loney wrote:
It is not possible to detect a spurious wake: if it was, then the implementation would resume waiting. However, spurious wakes should be rare occurrences.
Does someone have time to give us a quick overview as to why spurious wakes may occur?
I'm sure you have typed "spurious wakeups" into google, and read whatever is found? If not, I'm afraid I only have a shameless plug: http://vladimir_prus.blogspot.com/2005/07/spurious-wakeups.html - Volodya

Anthony Williams schrieb:
It is not possible to detect a spurious wake: if it was, then the implementation would resume waiting. However, spurious wakes should be rare occurrences.
All right, maybe this could be added to the documentation somewhere.
Can you show me some code?
Worker threads are supposed to die as soon as a null-job is encountered. The worker threads call executeJob. It works with queueFull_.notifyAll () in insertJob, but it fails with notifyOne -- note, this simply pass through to the Boost::condition counterparts. It used to work with a home-grown condition implementation, moreover, I had a prototype in Python and this one worked too. The problem is that the during the destructor, not all threads die properly (usually one to two remain, if started with 4 threads on a dual-core) -- I can even reproduce it nearly always. Adding some debug output usually makes it work, so it does look like a race condition. I only observed a deadlock if the following happened: insertJob called insertJob called ... then, one thread remained waiting forever. However, if only one Job gets inserted at a time it works. Tested on a Dual-Core, so it was really concurrent. Note: It might be very well that it also deadlocks in more cases, it's just *damn* difficult to observe it properly. Only problem I had at the beginning was in the part queueFull_.wait (lock); if (jobQueue_.empty ()) { continue; } as one thread might have been waiting for the mutex_ and fast-tracks through the executeJob loop without waiting and then it gets its job, executes it while notifyOne has been called during insertJob which wakes up a waiting thread /after/ the other thread removed the item, so the thread has to re-check that the queue has still an item. Without this, a thread might pop an empty queue. class ThreadPool { private: std::queue IJob::Ptr jobQueue_; std::vector WorkerThread::Ptr workers_; Condition queueEmpty_, queueFull_; Mutex mutex_; uint threadCount_; }; ////////////////////////////////////////////////////////////////////////// void ThreadPool::stopAllThreads () { for (int i = 0; i < threadCount_; ++i) { insertJob (IJob::Ptr ()); waitFinished (); } } ////////////////////////////////////////////////////////////////////////// void ThreadPool::insertJob (ThreadPool::IJob::Ptr job) { Lock lock (mutex_); jobQueue_.push (job); queueFull_.notifyAll (); // notifyOne leads to a deadlock, notify all } ////////////////////////////////////////////////////////////////////////// void ThreadPool::executeJob () { while (true) { IJob::Ptr job; { Lock lock (mutex_); if (jobQueue_.empty ()) { queueFull_.wait (lock); // Deadlocks here, with jobQueue_.empty () == false ! if (jobQueue_.empty ()) { continue; } } job = jobQueue_.front (); jobQueue_.pop (); } if (job.get () == 0) { jobFinished (); return; } else { job->run (); jobFinished (); } } } ////////////////////////////////////////////////////////////////////////// ThreadPool::ThreadPool (const uint threadCount) : threadCount_ (threadCount) { for (uint i = 0; i < threadCount; ++i) { WorkerThread::Ptr w (new WorkerThread (this)); workers_.push_back (w); w->run (); } } ////////////////////////////////////////////////////////////////////////// ThreadPool::~ThreadPool () { waitFinished (); // Get through, no work waiting stopAllThreads (); joinAllThreads (); } ////////////////////////////////////////////////////////////////////////// void ThreadPool::joinAllThreads () { BOOST_FOREACH(Thread::Ptr t, workers_) { t->join (); } } ////////////////////////////////////////////////////////////////////////// void ThreadPool::jobFinished () { Lock lock (mutex_); if (jobQueue_.empty ()) { queueEmpty_.notifyAll (); } } ////////////////////////////////////////////////////////////////////////// void ThreadPool::waitFinished () { Lock lock (mutex_); if (jobQueue_.empty ()) { return; } else { queueEmpty_.wait (lock); } } Cheers, Anteru

Anteru
Anthony Williams schrieb:
It is not possible to detect a spurious wake: if it was, then the implementation would resume waiting. However, spurious wakes should be rare occurrences.
All right, maybe this could be added to the documentation somewhere.
I'll see if I can word things better.
Can you show me some code?
Worker threads are supposed to die as soon as a null-job is encountered. The worker threads call executeJob. It works with queueFull_.notifyAll () in insertJob, but it fails with notifyOne -- note, this simply pass through to the Boost::condition counterparts.
Could you show the code for Lock and Condition? Also, does this happen on Windows, a pthreads platform or both? Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams schrieb:
I'll see if I can word things better. Thanks!
Could you show the code for Lock and Condition? Also, does this happen on Windows, a pthreads platform or both? Happens on Windows, not tested on pthreads. I only tested on windows as I have been using custom condition/threads there.
My lock is: class Lock : private boost::noncopyable { public: Lock (Mutex& m); bool locked () const; private: std::tr1::shared_ptrdetail::LockImpl lock_; friend class Condition; }; with class LockImpl { public: boost::mutex::scoped_lock lock_; LockImpl::LockImpl (MutexImpl& m) : lock_ (m.m_) { } bool LockImpl::locked () const { return lock_; } }; and the Lock constructor does: Lock::Lock (Mutex& m) : lock_ (new LockImpl (*m.m_)) { } condition being: class Condition : private boost::noncopyable { public: Condition (); // Just pass through void notifyAll (); void notifyOne (); // See below void wait (Lock& lock); private: std::tr1::shared_ptrdetail::ConditionPrivate impl_; }; with class ConditionPrivate { public: // Just pass through void notifyAll (); void notifyOne (); // See below void wait (LockImpl& lock); private: boost::condition cond_; }; and here is the interesting part, I just extract the scoped_lock and pass it on. void ConditionPrivate::wait (detail::LockImpl& lock) { cond_.wait (lock.lock_); } Condition::Condition () : impl_ (new ConditionPrivate ()) { } void Condition::wait (Lock& lock) { impl_->wait (*lock.lock_); } and finally: class Mutex : private boost::noncopyable { public: Mutex (); ~Mutex (); void lock (); bool tryLock (); void release (); private: std::tr1::shared_ptrdetail::MutexImpl mutex_; friend class Lock; }; which passes through to class MutexImpl { public: boost::mutex m_; void MutexImpl::lock () { m_.lock (); } void MutexImpl::release () { m_.unlock (); } bool MutexImpl::tryLock () { return m_.try_lock (); } }; These are just simple wrappers, to get around the problem that windows.h gets included by date_time and to reduce compile times. Moreover, I had the old implementation (Win32 only) working just like this. Calling a lock via another function shouldn't matter as it just makes the call slower but not less thread safe (as the object does get locked eventually, and from that point on, the execution is serialized), or am I misunderstanding something? Thanks for taking a look, Anteru

Anteru
Anthony Williams schrieb:
I'll see if I can word things better. Thanks!
Could you show the code for Lock and Condition? Also, does this happen on Windows, a pthreads platform or both? Happens on Windows, not tested on pthreads. I only tested on windows as I have been using custom condition/threads there.
OK. I can't reproduce the problem here, but looking at the code with fresh eyes, there might be a bug in the code that tries to avoid spurious wakes. Here's something to try: comment out the while(!woken) on line 175 of boost/thread/win32/condition_variable.hpp This will increase the incidence of spurious wakes, but might fix the problem. If the problem is reproducible without the change, and this change fixes it, then I need to investigate the new condition variable code more thoroughly. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams schrieb:
OK. I can't reproduce the problem here, but looking at the code with fresh eyes, there might be a bug in the code that tries to avoid spurious wakes. Here's something to try: comment out the
while(!woken)
on line 175 of boost/thread/win32/condition_variable.hpp
This will increase the incidence of spurious wakes, but might fix the problem. If the problem is reproducible without the change, and this change fixes it, then I need to investigate the new condition variable code more thoroughly.
With the line, I can reproduce it, although not always (maybe 2 out of 3 runs show it). Without the line, I had 15+ runs without any problem (with heavily varying system load, from totally idle to fully loaded, debug and release tested). Either your fix reduces the probability of the error dramatically or it even fixes it ;) If there is something more I can test, let me know. Thanks, Anteru

Anteru
Anthony Williams schrieb:
OK. I can't reproduce the problem here, but looking at the code with fresh eyes, there might be a bug in the code that tries to avoid spurious wakes. Here's something to try: comment out the
while(!woken)
on line 175 of boost/thread/win32/condition_variable.hpp
This will increase the incidence of spurious wakes, but might fix the problem. If the problem is reproducible without the change, and this change fixes it, then I need to investigate the new condition variable code more thoroughly.
With the line, I can reproduce it, although not always (maybe 2 out of 3 runs show it). Without the line, I had 15+ runs without any problem (with heavily varying system load, from totally idle to fully loaded, debug and release tested). Either your fix reduces the probability of the error dramatically or it even fixes it ;)
If there is something more I can test, let me know. Thanks,
I found a usage pattern that triggers the bug, and added a test to trunk. I've also added a "proper" fix to trunk: the simple fix above would yield many spurious wakes. I'd be grateful if you could try the version of boost.thread from svn trunk. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams schrieb:
I found a usage pattern that triggers the bug, and added a test to trunk. I've also added a "proper" fix to trunk: the simple fix above would yield many spurious wakes. I'd be grateful if you could try the version of boost.thread from svn trunk.
Tried from trunk. That is, I recompiled Boost.Thread from trunk (r44150), and used to updated headers/binaries from that -- seems to work. At least, I cannot reproduce the problem any more (I'm a bit cautious with "works", as I could only test it on 1-2 CPUs). Thanks for fixing this that quickly. Any idea when this fix is going to be included in an official release? (1.35.1 or 1.36?) Anteru

Anteru
Anthony Williams schrieb:
I found a usage pattern that triggers the bug, and added a test to trunk. I've also added a "proper" fix to trunk: the simple fix above would yield many spurious wakes. I'd be grateful if you could try the version of boost.thread from svn trunk.
Tried from trunk. That is, I recompiled Boost.Thread from trunk (r44150), and used to updated headers/binaries from that -- seems to work. At least, I cannot reproduce the problem any more (I'm a bit cautious with "works", as I could only test it on 1-2 CPUs).
Thanks for testing it.
Thanks for fixing this that quickly. Any idea when this fix is going to be included in an official release? (1.35.1 or 1.36?)
I'll make sure it goes in the next one, but I don't know when that will be. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams wrote:
Anteru
writes: Thanks for fixing this that quickly. Any idea when this fix is going to be included in an official release? (1.35.1 or 1.36?)
I'll make sure it goes in the next one, but I don't know when that will be.
*If* there will be a 1.35.1 release, it would be really charming if this fix could be included there. Just my personal 2 Euro cent, Daniel Krügler

Anthony Williams schrieb:
Thanks for fixing this that quickly. Any idea when this fix is going to be included in an official release? (1.35.1 or 1.36?)
I'll make sure it goes in the next one, but I don't know when that will be.
Hmm, I'm still having problems, but mostly only under heavy system load. Seems a notifyOne gets lost somewhere (at least, the result is that a thread is waiting at a point where it shouldn't be). I've tried the latest SVN trunk (as the commit log said that another similar bug to the one I reported was fixed) -- no luck. Funnily enough, it's exactly at the same place again where I had my original problem. Seems that the switch during condition::wait is not fully atomic or something similar, as everything should be protected by a mutex and yet the state I'm getting into is inconsistent. I'll try to investigate a bit further. Any ideas? Cheers, Anteru

Anteru schrieb:
a mutex and yet the state I'm getting into is inconsistent. I'll try to investigate a bit further.
Some closer inspection: With notify_one in Debug builds I get a deadlock in 1 of 3 tries, with notify_all no deadlocks, while both deadlock nearly always in Release mode. Seems to be really timing sensitive. The problem is exactly at the same place as the original problem, i.e. the code in question is: Lock lock (mutex_); while (queue.empty ()) { waitForItem.wait (lock); } and the thread deadlocks on this condition, although queue.empty () == false and insertion only happens inside a critical section (secured by a Lock). However, at most one thread remains blocked, previously I had often 3 threads deadlocked. If there is anything I can try to nail it down further, please let me know. Cheers, Anteru

Anteru wrote:
Hiho,
I'm using the new Boost Threads for a Thread Pool, however, I've got some weird problems.
The documentation of Condition::notify_one says it unblocks one of the threads waiting currently on the condition -- what happens if I have several threads waiting and I call notify_one twice?
It seems that on Windows/x64/VC9, only on thread wakes up if notify_one is called twice in a row. Is this right?
Second, wait says: "The thread will unblock when notified by a call to this->notify_one() or this->notify_all(), or spuriously". Is it likely that a condition unblocks spuriously? After all, I'd expect to unblock only after calling notify_* and not "by accident". Is there any way to get diagnostic output when a wait has been called spuriously?
The problem I'm having at the moment is: I have PIMPLed Boost.Thread behind some simple wrappers, and I'm using them like this here:
Lock lock (mutex_);
if (queue.empty ()) { waitForItem.wait (lock); }
with Lock being a wrapper for scoped_lock, mutex being a wrapper for mutex and waitForItem being a condition, and funnily enough it goes waiting even though there is an item in queue (i.e., queue.empty () == false). I'm not sure where this is coming from (all access to queue is protected using mutexes), only candidate at the moment is that when inserting, I call notify_one, and if two items get inserted one right after the other, I assume that only one thread is woken up and hence the other remains waiting although there is an item in the queue.
It works when I use notify_all (), but this does not smell right to me :/
I think it is a common misconception that condition variables can be used alone to synchronize threads waiting for an event. Idiomatic usage requires some element of state (often simply a boolean value) to be changed before the condition is signalled. Then the activated thread should check the state WHILE HOLDING THE LOCK that the condition wait re-grants on exit. Only if the state is set should the woken thread do the action for which it was waiting and resetting the state. This model means that spurious returns from wait (for example interrupted system calls in a *nix system) will not cause unwanted behaviour.
Thanks, Anteru
-- Bill Somerville Class Design Limited
participants (6)
-
Anteru
-
Anthony Williams
-
Bill Somerville
-
Daniel Krügler
-
Patrick Loney
-
Vladimir Prus