[Thread] thousands of spurious wakeups from timed_wait() per second
I've been trying to get useful behavior out of
condition_variable::timed_wait() with little success. Depending on
the platform, I'm getting about 3000 or nearly 100000 spurious wakeups
per second.
I'm working on a few different Linux platforms with different versions
of Boost: 1.46, 1.48, and another in the high 40's. The eventual target
is 1.48. I've looked for bug reports and haven't found any, and I have
no evidence yet that suggests the different versions are significant to
this problem.
Because of the lack of problem reports, I figure I'm doing something
wrong. That's my question: "What am I doing wrong here?"
I have a test program that sets up a typical producer/consumer example.
The main program spawns a thread, the producer, that notifies the main
thread via a condition_variable once a second for a few seconds, then
ends. The main thread, the consumer, waits efficiently for notifications
with a 2-second timeout. When the producer thread finishes, the main
(consumer) thread eventually times out, joins the spawned thread,
and exits. At least, that's what is supposed to happen.
If I don't bother with timeouts and use condition_variable::wait(), it
works perfectly - except for the timeout, join, and exit. This is as
expected. So far I have seen no spurious wakeups at all with this test.
There is an ominous yet ambiguous remark in the documentation
"The duration overload of timed_wait is difficult to use correctly.
The overload taking a predicate should be preferred in most cases."
Due to that remark I stopped experimenting with relative timeouts before
I got anything to work. Using absolute timeouts, I get very high numbers
of spurious wakeups as mentioned above - 3000 and about 100000 per second
depending on the platform. I see no difference between timed_wait with
a predicate and without a predicate.
I'm also having trouble with the return value of timed_wait().
The documentation carefully states that timed_wait() returns "false if
the call is returning because the time specified by abs_time was reached,
true otherwise". This seems to mean that spurious wakeups return true
-- and if you think about it, the return value is useless if spurious
wakeups could return false. You'd have to add your own check for timeout.
I was quite surprised to find that every spurious wakeup of timed_wait()
that I've ever seen returns false. That just doesn't make sense to me.
Here is the version of my test program that uses timed_wait without a
predicate, followed by its output.
#include
I've been trying to get useful behavior out of condition_variable::timed_wait() with little success. Depending on the platform, I'm getting about 3000 or nearly 100000 spurious wakeups per second. Which platforms are you testing? I'm working on a few different Linux platforms with different versions of Boost: 1.46, 1.48, and another in the high 40's. The eventual target is 1.48. I've looked for bug reports and haven't found any, and I have no evidence yet that suggests the different versions are significant to this problem. Could you test with the develop branch?
Because of the lack of problem reports, I figure I'm doing something wrong. Well, not everyone writes a report. That's my question: "What am I doing wrong here?"
I have a test program that sets up a typical producer/consumer example. The main program spawns a thread, the producer, that notifies the main thread via a condition_variable once a second for a few seconds, then ends. The main thread, the consumer, waits efficiently for notifications with a 2-second timeout. When the producer thread finishes, the main (consumer) thread eventually times out, joins the spawned thread, and exits. At least, that's what is supposed to happen.
If I don't bother with timeouts and use condition_variable::wait(), it works perfectly - except for the timeout, join, and exit. This is as expected. So far I have seen no spurious wakeups at all with this test.
There is an ominous yet ambiguous remark in the documentation "The duration overload of timed_wait is difficult to use correctly. The overload taking a predicate should be preferred in most cases." I suspect that this mean that you need to check for the spurious wakeups, and the predicate recall you to do it. Changing the duration is something the users forgot to do as it is not evident. Due to that remark I stopped experimenting with relative timeouts before I got anything to work. Using absolute timeouts, I get very high numbers of spurious wakeups as mentioned above - 3000 and about 100000 per second depending on the platform. I see no difference between timed_wait with a predicate and without a predicate.
I'm also having trouble with the return value of timed_wait(). The documentation carefully states that timed_wait() returns "false if the call is returning because the time specified by abs_time was reached, true otherwise". This seems to mean that spurious wakeups return true -- and if you think about it, the return value is useless if spurious wakeups could return false. You'd have to add your own check for timeout.
Le 10/09/14 04:38, Steven Clark a écrit : the function return true if there were a notify on this condition. This doesn't mean that another thread has not already changed the result of the predicate. the function return false when there were no notify before the timeout was reached.
I was quite surprised to find that every spurious wakeup of timed_wait() that I've ever seen returns false. That just doesn't make sense to me.
Well, this is how it works. The problem is why do you receive so many spurious wakeup.
Here is the version of my test program that uses timed_wait without a predicate, followed by its output.
<snip>
******* Using timed_wait() now is 2014-Sep-09 20:48:12.183781 deadline is 2014-Sep-09 20:48:14.183781 hello from threadfn(), #0 timed_wait() returned true on wakeup #1451 main() exited inner loop with 1451 wakeups; gGotData is true main(), notified, gValue is 0 now is 2014-Sep-09 20:48:12.215471 deadline is 2014-Sep-09 20:48:14.215471 hello from threadfn(), #1 timed_wait() returned true on wakeup #96864 main() exited inner loop with 96864 wakeups; gGotData is true main(), notified, gValue is 1 now is 2014-Sep-09 20:48:13.216511 deadline is 2014-Sep-09 20:48:15.216511 hello from threadfn(), #2 timed_wait() returned true on wakeup #95911 main() exited inner loop with 95911 wakeups; gGotData is true main(), notified, gValue is 2 now is 2014-Sep-09 20:48:14.216805 deadline is 2014-Sep-09 20:48:16.216805 hello from threadfn(), #3 timed_wait() returned true on wakeup #97526 main() exited inner loop with 97526 wakeups; gGotData is true main(), notified, gValue is 3 now is 2014-Sep-09 20:48:15.217191 deadline is 2014-Sep-09 20:48:17.217191 hello from threadfn(), #4 timed_wait() returned true on wakeup #104710 main() exited inner loop with 104710 wakeups; gGotData is true main(), notified, gValue is 4 now is 2014-Sep-09 20:48:16.217453 deadline is 2014-Sep-09 20:48:18.217453 hello from threadfn(), #5 timed_wait() returned true on wakeup #102404 main() exited inner loop with 102404 wakeups; gGotData is true main(), notified, gValue is 5 now is 2014-Sep-09 20:48:17.217685 deadline is 2014-Sep-09 20:48:19.217685 hello from threadfn(), #6 timed_wait() returned true on wakeup #97906 main() exited inner loop with 97906 wakeups; gGotData is true main(), notified, gValue is 6 now is 2014-Sep-09 20:48:18.217291 deadline is 2014-Sep-09 20:48:20.217291 ^C
I have not reached to reproduce it on develop branch. I will try later with other versions. Here are my results on MacOs ******* Using timed_wait() hello from threadfn(), #0 now is 2014-Sep-10 08:27:08.045253 deadline is 2014-Sep-10 08:27:10.045253 main() exited inner loop with 0 wakeups; gGotData is true main(), notified, gValue is 0 now is 2014-Sep-10 08:27:08.046008 deadline is 2014-Sep-10 08:27:10.046008 hello from threadfn(), #1 timed_wait() returned true on wakeup #1 main() exited inner loop with 1 wakeups; gGotData is true main(), notified, gValue is 1 now is 2014-Sep-10 08:27:09.046545 deadline is 2014-Sep-10 08:27:11.046545 hello from threadfn(), #2 timed_wait() returned true on wakeup #1 main() exited inner loop with 1 wakeups; gGotData is true main(), notified, gValue is 2 now is 2014-Sep-10 08:27:10.047779 deadline is 2014-Sep-10 08:27:12.047779 hello from threadfn(), #3 timed_wait() returned true on wakeup #1 main() exited inner loop with 1 wakeups; gGotData is true main(), notified, gValue is 3 now is 2014-Sep-10 08:27:11.048845 deadline is 2014-Sep-10 08:27:13.048845 hello from threadfn(), #4 timed_wait() returned true on wakeup #1 main() exited inner loop with 1 wakeups; gGotData is true main(), notified, gValue is 4 now is 2014-Sep-10 08:27:12.049441 deadline is 2014-Sep-10 08:27:14.049441 hello from threadfn(), #5 timed_wait() returned true on wakeup #1 main() exited inner loop with 1 wakeups; gGotData is true main(), notified, gValue is 5 now is 2014-Sep-10 08:27:13.050441 deadline is 2014-Sep-10 08:27:15.050441 hello from threadfn(), #6 timed_wait() returned true on wakeup #1 main() exited inner loop with 1 wakeups; gGotData is true main(), notified, gValue is 6 main(), after joining Best, Vicente
Thank you for your interest!
Vicente J. Botet Escriba
Le 10/09/14 04:38, Steven Clark a écrit : Which platforms are you testing?
In the original post, I erred in some details regarding platforms. I am currently testing on two virtual machines, both Virtual Box running Ubuntu Linux (64-bit), and hosted on two different computers running Windows. * One is running Linux 3.13.0, Ubuntu 14.04.1, and Boost 1.54. * The other runs Linux 3.2.0, Ubuntu 12.04.5, and Boost 1.46. * My intended target is an Intel arm i.MX27 running Linux x.y.z and Boost 1.48. I have not run my tests on the intended target yet. I can move to a different version of Boost if necessary but it's a nuisance. Is this the platform information you wanted?
Could you test with the develop branch?
That would be a nuisance. Have there been relevant changes since Boost 1.54? It is much more recent than what I mentioned in the original post. I will try the development branch if you're sure it is worthwhile.
I'm also having trouble with the return value of timed_wait(). The documentation carefully states that timed_wait() returns "false if the call is returning because the time specified by abs_time was reached, true otherwise". This seems to mean that spurious wakeups return true -- and if you think about it, the return value is useless if spurious wakeups could return false. You'd have to add your own check for timeout. the function return true if there were a notify on this condition. This doesn't mean that another thread has not already changed the result of the predicate. the function return false when there were no notify before the timeout was reached.
Let me restate what I think you said, which is also what I've observed so far. Timed_wait() can return for three reasons: notification, timeout, or spurious. It returns true on notification, and false on timeout or spurious. (This is contrary to the documentation, which appears to be carefully written.) I find it hard to believe that normal wait() cannot tell the difference between spurious and notification, yet timed_wait() can. Granted, I don't understand the issues that lead to spurious wakeups. Perhaps you meant that timed_wait() returns true for notification or spurious, and false for timeout or spurious. That makes no sense to me either - there's no point in timed_wait() having a return value. What decision can the caller reliably make on the basis of such a return value? Thank you for running my test on a Mac. Your output is what I expected to see. I converted the test program to make pthread calls instead of Boost. It works perfectly, just like your Mac run. This is unsatisfactory because eventually I want the same code to run on Windows. I've been suspicious that timed_wait() isn't actually waiting at all - just doing something like yield(). When I added a check for the current time vs. the timeout deadline into my test program, the number of spurious wakeups dropped almost to half, suggesting to me that the boost::posix_time::microsec_clock::local_time() call almost doubles the time it takes to run through the inner loop. Are there any compiler flags that direct how Boost implements its thread functions? I think I ran across something once, something like BOOST_POSIX_THREADS, but I haven't found it and I have no idea what the alternative to pthreads might be.
Best, Vicente
Merci, Steve
Thank you for your interest!
Vicente J. Botet Escriba
writes: Le 10/09/14 04:38, Steven Clark a écrit : Which platforms are you testing? In the original post, I erred in some details regarding platforms. I am currently testing on two virtual machines, both Virtual Box running Ubuntu Linux (64-bit), and hosted on two different computers running Windows.
* One is running Linux 3.13.0, Ubuntu 14.04.1, and Boost 1.54. * The other runs Linux 3.2.0, Ubuntu 12.04.5, and Boost 1.46. * My intended target is an Intel arm i.MX27 running Linux x.y.z and Boost 1.48. I have not run my tests on the intended target yet. I can move to a different version of Boost if necessary but it's a nuisance. Is this the platform information you wanted? Thanks for the details. Could you check directly on windows?
I'm also having trouble with the return value of timed_wait(). The documentation carefully states that timed_wait() returns "false if the call is returning because the time specified by abs_time was reached, true otherwise". This seems to mean that spurious wakeups return true -- and if you think about it, the return value is useless if spurious wakeups could return false. You'd have to add your own check for timeout.
Could you test with the develop branch? That would be a nuisance. Have there been relevant changes since Boost 1.54? It is much more recent than what I mentioned in the original post. I will try the development branch if you're sure it is worthwhile. No. It is not worth if you have tested with 1.54. But version 1.48 is a very old version. the function return true if there were a notify on this condition. This doesn't mean that another thread has not already changed the result of the predicate. the function return false when there were no notify before the timeout was reached. Let me restate what I think you said, which is also what I've observed so far. Timed_wait() can return for three reasons: notification, timeout, or spurious. It returns true on notification, and false on timeout or spurious. (This is contrary to the documentation, which appears to be carefully written.) No, it can return either because notified or timeout. A spurious notification is one that doesn't satisfy the predicate you expect after wait. This could be because another thread has changed in the mean time
Le 11/09/14 18:17, Steve Clark a écrit : the condition.
I find it hard to believe that normal wait() cannot tell the difference between spurious and notification, yet timed_wait() can. Granted, I don't understand the issues that lead to spurious wakeups. Both wait and timed_wait can have associated spurious notifications.
Perhaps you meant that timed_wait() returns true for notification or spurious, and false for timeout or spurious.
No. When the timed_wait returns false, the condition has not been notified, so there is no possible spurious notification in this case ;-)
That makes no sense to me either - there's no point in timed_wait() having a return value. What decision can the caller reliably make on the basis of such a return value? If there is a timeout, this gives you an indication that maybe there is no thread that will notify you, protecting you blocking forever. Thank you for running my test on a Mac. Your output is what I expected to see.
I converted the test program to make pthread calls instead of Boost. It works perfectly, just like your Mac run. This is unsatisfactory because eventually I want the same code to run on Windows. Agreed. Could you check directly on windows?
Another possibility could be to use the chrono interface to see if the
bug is around the date-time interface.
#include
I've been suspicious that timed_wait() isn't actually waiting at all - just doing something like yield(). Yes, the date on your traces indicates that the timeout has not been reached at all, and that the notifications are done every second, as expected. The problem been the erroneous timeouts.
now is 2014-Sep-09 20:48:12.183781 now is 2014-Sep-09 20:48:12.215471 now is 2014-Sep-09 20:48:13.216511 now is 2014-Sep-09 20:48:14.216805 now is 2014-Sep-09 20:48:15.217191 now is 2014-Sep-09 20:48:16.217453 now is 2014-Sep-09 20:48:17.217685 now is 2014-Sep-09 20:48:18.217291
When I added a check for the current time vs. the timeout deadline into my test program, the number of spurious wakeups dropped almost to half, suggesting to me that the boost::posix_time::microsec_clock::local_time() call almost doubles the time it takes to run through the inner loop.
Are there any compiler flags that direct how Boost implements its thread functions? I think I ran across something once, something like BOOST_POSIX_THREADS, but I haven't found it and I have no idea what the alternative to pthreads might be.
Sorry, I don't understand what you mean. Could you rephrase? Best, Vicente
On 11 Sep 2014 at 16:17, Steve Clark wrote:
I converted the test program to make pthread calls instead of Boost. It works perfectly, just like your Mac run. This is unsatisfactory because eventually I want the same code to run on Windows.
I've been suspicious that timed_wait() isn't actually waiting at all - just doing something like yield(). When I added a check for the current time vs. the timeout deadline into my test program, the number of spurious wakeups dropped almost to half, suggesting to me that the boost::posix_time::microsec_clock::local_time() call almost doubles the time it takes to run through the inner loop.
Hmm. Maybe my timer changes on Windows for 1.56 are broken? I have to admit I relied on the unit test suite instead of manually checking I made the time calculations correctly. I believe there is a test in the unit test which checks that sleeping for a second really did take roughly a second. Can you test 1.55 or some earlier Boost? Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Le 12/09/14 12:35, Niall Douglas a écrit :
On 11 Sep 2014 at 16:17, Steve Clark wrote:
I converted the test program to make pthread calls instead of Boost. It works perfectly, just like your Mac run. This is unsatisfactory because eventually I want the same code to run on Windows.
I've been suspicious that timed_wait() isn't actually waiting at all - just doing something like yield(). When I added a check for the current time vs. the timeout deadline into my test program, the number of spurious wakeups dropped almost to half, suggesting to me that the boost::posix_time::microsec_clock::local_time() call almost doubles the time it takes to run through the inner loop. Hmm. Maybe my timer changes on Windows for 1.56 are broken? I have to admit I relied on the unit test suite instead of manually checking I made the time calculations correctly. I believe there is a test in the unit test which checks that sleeping for a second really did take roughly a second.
Can you test 1.55 or some earlier Boost?
It seems that the problem is present in 1.55 (see https://svn.boost.org/trac/boost/ticket/9856). Niall please, could you try to reproduce it on Windows 1.55? Best, Vicente
On 14 Sep 2014 at 21:59, Vicente J. Botet Escriba wrote:
Hmm. Maybe my timer changes on Windows for 1.56 are broken? I have to admit I relied on the unit test suite instead of manually checking I made the time calculations correctly. I believe there is a test in the unit test which checks that sleeping for a second really did take roughly a second.
Can you test 1.55 or some earlier Boost?
It seems that the problem is present in 1.55 (see https://svn.boost.org/trac/boost/ticket/9856).
Ah okay. My changes shouldn't have changed any prior behaviour, so if it was broken before it will still be broken.
Niall please, could you try to reproduce it on Windows 1.55?
Assign me the task. I'll look into it this Saturday. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Vicente J. Botet Escriba
Thanks for the details. Could you check directly on windows? [...] Agreed. Could you check directly on windows?
My windows development setup is broken at this time, and fixing it is not easy for me. (I have nearly 40 years of Unix-like experience but I'm a rank newbie at Windows development.) Besides, I need the code to work on both Linux and (eventually) Windows. On 14 Sep 2014 at 21:59, Vicente J. Botet Escriba wrote:
It seems that the problem is present in 1.55 (see https://svn.boost.org/trac/boost/ticket/9856).
Yes, that seems to be the same issue now that I understand the API correctly. The bug appears to be that timed_wait() incorrectly thinks it has timed out. Should I add anything to that bug ticket, such as mentioning the additional places where I've seen it or attaching my test program? How do I make or propose some minor clarifications to the Boost::thread documentation to get into the next release?
Vicente J. Botet Escriba
writes: Another possibility could be to use the chrono interface to see if the bug is around the date-time interface.
Last week when I casually looked through Boost documentation, it was not at all clear to me that chrono is newer and should be preferred in new code. This morning I made the straightforward transformation of my test program to use chrono and it works correctly on Linux 3.13.0 with Boost 1.54. I've not yet back-ported wait_until() to Boost 1.48 or 1.46 to see if it works on my other platforms. My time might be better spent updating those platforms to use the latest Boost. I'll move forward with the workaround to use Boost::chrono. Thank you very much! Would it be a good idea for me to morph my test program into a unit test and add it to the Boost:thread test suite? If so, would someone like to point me in the right direction and answer questions? Steven J. Clark VGo Communications
On 15 Sep 2014 at 20:27, Steven Clark wrote:
Last week when I casually looked through Boost documentation, it was not at all clear to me that chrono is newer and should be preferred in new code. This morning I made the straightforward transformation of my test program to use chrono and it works correctly on Linux 3.13.0 with Boost 1.54. I've not yet back-ported wait_until() to Boost 1.48 or 1.46 to see if it works on my other platforms. My time might be better spent updating those platforms to use the latest Boost.
chrono is what ended up in the C++ 11 STL. It is probably much better tested with the older xtime API pretty much deprecated. We probably should actually mark that older API with deprecation compiler warnings really ...
Would it be a good idea for me to morph my test program into a unit test and add it to the Boost:thread test suite? If so, would someone like to point me in the right direction and answer questions?
Please do and send it to me privately. I'll see if I can fix the bug next Saturday and I'll add any test case you supply to keep in sane into the future. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Le 15/09/14 22:27, Steven Clark a écrit :
Vicente J. Botet Escriba
writes: Thanks for the details. Could you check directly on windows? [...] Agreed. Could you check directly on windows? My windows development setup is broken at this time, and fixing it is not easy for me. (I have nearly 40 years of Unix-like experience but I'm a rank newbie at Windows development.) Besides, I need the code to work on both Linux and (eventually) Windows.
On 14 Sep 2014 at 21:59, Vicente J. Botet Escriba wrote:
It seems that the problem is present in 1.55 (see https://svn.boost.org/trac/boost/ticket/9856). Yes, that seems to be the same issue now that I understand the API correctly. The bug appears to be that timed_wait() incorrectly thinks it has timed out. Should I add anything to that bug ticket, such as mentioning the additional places where I've seen it or attaching my test program? You are welcome to participate on the Trac system. Please, add whatever information you consider is useful, examples, platform/compiler used, ..., analysis of the code, patched fxing the issues ;-)
How do I make or propose some minor clarifications to the Boost::thread documentation to get into the next release? You can create a ticket on the Trac system (preferably with a patch :)
Vicente J. Botet Escriba
writes: Another possibility could be to use the chrono interface to see if the bug is around the date-time interface. Last week when I casually looked through Boost documentation, it was not at all clear to me that chrono is newer and should be preferred in new code. Well , chrono related interface was added in version 1.50, which is not so new. This morning I made the straightforward transformation of my test program to use chrono and it works correctly on Linux 3.13.0 with Boost 1.54. Great. I've not yet back-ported wait_until() to Boost 1.48 or 1.46 to see if it works on my other platforms. My time might be better spent updating those platforms to use the latest Boost. I'll move forward with the workaround to use Boost::chrono. Thank you very much!
Would it be a good idea for me to morph my test program into a unit test and add it to the Boost:thread test suite? If so, would someone like to point me in the right direction and answer questions? Yes please. Any help in making the code more robust, adding tests that show bugs, ... is really welcome.
Best, Vicente
participants (4)
-
Niall Douglas
-
Steve Clark
-
Steven Clark
-
Vicente J. Botet Escriba