[thread] Can boost::this_thread::sleep_for() totally stop the thread?

I'm having, sometime, trouble with a rare heisenbug (rare but appearing several times a week). I described it there: http://stackoverflow.com/questions/17918548/can-windows-totally-stop-a-threa... There I ask if it's possible for windows to totally stop a thread that is sleeping too often. As I'm using boost::thread for sleeping this specific thread that, for unknown reason, sometime totally stop working, I'm also wondering if the boost::this_thread() implementation could not be doing something that the OS might not like? ----- I'll report the content of the page here: Can Windows totally stop a thread if it's sleeping too often? I have a rare heisenbug in a multi-threaded application where the main thread, and only this thread, will just do nothing. As it's an heisenbug it's really hard to understand why this is happening. The main thread is basically just looping. In the loop, it check several concurrent priority queues which contain tasks ordered by time to be executed. It pop a task, see if it's time to execute it. If it's time, it will just schedule it into TBB's task scheduler (using a root task which is the parent of all other tasks). If it's not time, the task is pushed again in the priority queue. That's for one cycle. At the end of the cycle, the main thread is put to sleep for a very short time that I expect will be longer in practice but it's not really a problem, I just don't want it to use too much resources when not necessary. Litterally: static const auto TIME_SCHEDULED_TASKS_SPAWN_FREQUENCY = microseconds(250); while( !m_task_scheduler.is_exiting() ) // check if the application should exit { m_clock_scheduler.spawn_realtime_tasks(); // here we spawn tasks if it's time this_thread::sleep_for( TIME_SCHEDULED_TASKS_SPAWN_FREQUENCY ); } m_clock_scheduler.clear_tasks(); m_root_task.wait_for_all(); I have a special task that just log a "TICK" message each second. It is automatically rescheduling until the end of the program. However, when the heisenbug appear, I can see the "TICK" disappearing and the application not doing anything else than the work that occurs in non-main threads. So it appear that only the main thread is touched. The problem can come from different places, maybe in the scheduling logic, but then it's also the only thread that have a sleep call. That sleep is a boost::this_thread::sleep_for(). My question is: **Is it possible that Windows (7 64bit) consider the main thread to be sleeping often and decide that it should sleep for a longer period of time than asked or be definitely ended?** I expect that it is not possible but I would like to be sure. I didn't find any precision on this in online documentation so far. ----- I really suspect that the problem can't come from the sleeping function but I need to eliminate each potential source of problem to isolate it. Thanks for your time. Joel Lamotte

As someone suggested to me in the comments of the page, I re-implemented the loop using a condition_variable so that pushing a task in the queue notify it. I have a friend who can reproduce the bug systematically (windows Vista, core 2 duo) so I sent him versions: - removing the sleep call works (but is a bit less performant apparently) - changing the loop to use condition_variable works So it looks like the sleep call does have a strange behaviour but I don't know the source of the problem. Using the condition_variable seems more appropriate so I'll keep this fix for now. Joel Lamotte

Le 29/07/13 14:44, Klaim - Joël Lamotte a écrit :
As someone suggested to me in the comments of the page, I re-implemented the loop using a condition_variable so that pushing a task in the queue notify it.
I have a friend who can reproduce the bug systematically (windows Vista, core 2 duo) so I sent him versions:
- removing the sleep call works (but is a bit less performant apparently) - changing the loop to use condition_variable works
So it looks like the sleep call does have a strange behaviour but I don't know the source of the problem. Using the condition_variable seems more appropriate so I'll keep this fix for now.
Hi, there were an issue fixed on 1.54 https://svn.boost.org/trac/boost/ticket/8136. Please, could you tell me if this fix is related? Best, Vicente

On Sat, Aug 3, 2013 at 8:42 AM, Vicente J. Botet Escriba < vicente.botet@wanadoo.fr> wrote:
there were an issue fixed on 1.54 https://svn.boost.org/trac/** boost/ticket/8136 <https://svn.boost.org/trac/boost/ticket/8136>.
Please, could you tell me if this fix is related?
It might be related to this fix mixed with another fix from another dependency. I use Ogre3D with TBB, where Ogre3D will look for TBB using a CMake script. There was a bug I reported[1] recently to the Ogre3D team because in specific conditions (TBB from sources+32bit) where the CMake script was adding add_definitions(-D_WIN32_WINNT=0x0501) which hid some windows functions use by Ogre code. I fixed it by adding a condition so that it does it only with MinGW (I'm using exclusively Visual Studio 2012 U3 right now). Now, I don't know if it's related, but this change looks like it might have an impact on the fix you're talking about. As I was saying, I have a friend who can reproduce the bug systematically (windows Vista, core 2 duo) so maybe finding a computer with these characteristics might make the issue find it easily. When I reimplemented the loop using condition variables - which was actually better because now the thread wait as long as the time necessary to spawn the next task and is woken up when a new task is pushed in the task priority queue - solves the problem for my friend, so there might be a strong difference in waiting code there that might be related? I'll try to make a repro-case with my friend tomorrow if it helps. [1] http://www.ogre3d.org/forums/viewtopic.php?f=4&t=78443 Joel Lamotte

On Sat, Aug 3, 2013 at 12:13 PM, Klaim - Joël Lamotte <mjklaim@gmail.com>wrote:
It might be related to this fix mixed with another fix from another dependency.
To clarify: I first upgraded Boost to 1.54, changed my code accordingly to use boost::future, checked once that it compiled and did the same thing. Then I upgraded Ogre and changed the configuration to force it to use TBB (tbb was up to date in my setup). It's when I hit the Ogre bug that I fixed in CMake. Concerning the sleep bug I'm seeing, I remarked it after that point, but it might have been there before. It's hard to know because it's really rare on my laptop (win7 64bit). Joel Lamotte

Le 03/08/13 14:37, Klaim - Joël Lamotte a écrit :
On Sat, Aug 3, 2013 at 12:13 PM, Klaim - Joël Lamotte <mjklaim@gmail.com>wrote:
It might be related to this fix mixed with another fix from another dependency.
To clarify: I first upgraded Boost to 1.54, changed my code accordingly to use boost::future, checked once that it compiled and did the same thing. Then I upgraded Ogre and changed the configuration to force it to use TBB (tbb was up to date in my setup). It's when I hit the Ogre bug that I fixed in CMake.
Concerning the sleep bug I'm seeing, I remarked it after that point, but it might have been there before. It's hard to know because it's really rare on my laptop (win7 64bit).
Let me know if you reach to reproduce it. Vicente

Sorry for the delay, I had to move to another more pressing problem. Anyway, my friend who can reproduce the problem wasn't available this month, and meanwhile I managed to reproduce the bug with my current version of the loop. Reminder: I'm using Boost 1.54, Visual Studio 2012 Update 3. Yesterday night I decided to try to build a repro case and now I think I see at least one of the problems but I need confirmation. I managed to build 2 repro cases at the end of this email, one very simple (A) and one a bit more complex and closer to what I actually do (B). Unfortunately, I am not sure if the problem in A, that I can understand, is really exactly the same as the problem in B. In A, the thread stops immediately, without even hitting the "TICK" log. Apparently the problem, even with the initial case which used this_thread::sleep_until, is related to condition_variable::wait_until. Basically, if the time given is 1 millisecond past the "now" timepoint used into the wait_until function, the thread will wait forever. Here is the implementation of wait_until in Boost 1.54: template <class Clock, class Duration> cv_status wait_until( unique_lock<mutex>& lock, const chrono::time_point<Clock, Duration>& t) { using namespace chrono; do_wait(lock, ceil<milliseconds>(t-Clock::now()).count()); return Clock::now() < t ? cv_status::no_timeout : cv_status::timeout; } My current hypothesis is that the ceil() call in this particular case doesn't do what is intended. Here is the implementation of ceil(): /** * rounds up */ template <class To, class Rep, class Period> To ceil(const duration<Rep, Period>& d) { To t = duration_cast<To>(d); if (t < d) ++t; return t; } Assuming the call to Clock::now() in wait_until() gives exactly t + 1ms, then ceil, if I'm correct will return 0. The call do_wait( 0 ) seems to wait forever, but I don't understand the details of the implementation. The same exact problem occurs with wait_for() (by the way, why does it always have to use steady_clock? see the implementation) Suggestion: add a check so that do_wait() returns immediately if the given time is <= 0. I'm assuming that the time given to the wait_until() call is always based on int-kind of value... if that's not the case, it might be harder to get right. However, in B, it is a bit more complex: the endless wait don't occur immediately but I managed to make it occur very quickly in my current setup at least. It seems that the problem is similar, but I couldnt' reproduce the behaviour of B in A by modifying the time to wait until, only (now - 1ms) seems to work. In B, we first acquire the current time point, then do some work, then call wait_until(). I believe it is certainly a variation of the bug in A but this is what happen in my application. It takes much more time to trigger the endless wait in my application but you can tweak the values to get the same result, for example by setting MIN_TIME_TASKS_SPAWN_FREQUENCY to 125 milliseconds (it's the value in my app). Basically, the problem seems to be that I introduce delay by working between the acquisition of the timepoint and the wait_until() call, which makes possible to have a duration value calculated into wait_until which would be ceild to 0, which would trigger the endless wait. First, please tell me if you can reproduce the problem. Second, I'll try to see if adding the (duration <= 0) check in do_wait() does fix the problem for me. I can reproduce it easily now so I should be able to figure it out today. ------------------------------ // A #include <boost/log/trivial.hpp> #include <boost/chrono.hpp> #include <boost/thread.hpp> #include <boost/thread/condition_variable.hpp> typedef boost::chrono::high_resolution_clock Clock; typedef Clock::time_point TimePoint; inline TimePoint real_time_now() { return Clock::now(); } int main() { using namespace boost::chrono; boost::condition_variable m_task_spawn_condition; boost::mutex main_thread_mutex; boost::unique_lock<boost::mutex> main_thread_lock( main_thread_mutex ); BOOST_LOG_TRIVIAL(info) << "[TaskScheduler::run_and_wait] Scheduling loop - BEGIN"; while( true ) { static const auto TIME_BACK = milliseconds(1); m_task_spawn_condition.wait_until( main_thread_lock, real_time_now() - TIME_BACK ); // wait forever //m_task_spawn_condition.wait_for( main_thread_lock, - TIME_BACK ); // same problem BOOST_LOG_TRIVIAL(trace) << "TICK"; } } ------------------------------ // B #include <atomic> #include <boost/log/trivial.hpp> #include <boost/chrono.hpp> #include <boost/thread.hpp> #include <boost/thread/condition_variable.hpp> typedef boost::chrono::high_resolution_clock Clock; typedef Clock::time_point TimePoint; inline TimePoint real_time_now() { return Clock::now(); } class Foo { std::atomic<bool> m_is_exiting; TimePoint m_next_tick_time; public: bool is_exiting() const { return m_is_exiting; } TimePoint spawn_tasks() // note that in my app, this call takes more time than here { using namespace boost::chrono; const auto now = real_time_now(); if( m_next_tick_time < now ) { m_next_tick_time = now + seconds(1); BOOST_LOG_TRIVIAL(info) << "TICK!"; } return m_next_tick_time; } }; int main() { using namespace boost::chrono; static const auto MIN_TIME_TASKS_SPAWN_FREQUENCY = milliseconds(1); //microseconds(1); // THE SHORTER THE QUICKER TO REPRODUCE THE BUG boost::condition_variable m_task_spawn_condition; Foo foo; boost::mutex main_thread_mutex; boost::unique_lock<boost::mutex> main_thread_lock( main_thread_mutex ); BOOST_LOG_TRIVIAL(info) << "[TaskScheduler::run_and_wait] Scheduling loop - BEGIN"; while( !foo.is_exiting() ) { const auto next_task_spawn_time = foo.spawn_tasks(); const auto now = real_time_now(); const auto next_minimum_spawn_time = now + MIN_TIME_TASKS_SPAWN_FREQUENCY; const auto next_spawn_time = next_task_spawn_time > TimePoint() && next_task_spawn_time < next_minimum_spawn_time ? next_task_spawn_time : next_minimum_spawn_time; const auto wait_time = next_spawn_time - now; if( wait_time > wait_time.zero() ) { BOOST_LOG_TRIVIAL(trace) << "WAIT TIME: " << wait_time; // UNCOMMENT THIS: MAKES IT WORKS. WAT?????? auto reason = m_task_spawn_condition.wait_until( main_thread_lock, next_spawn_time ); // DON'T WORK: WILL WAIT IF next_spawn_time is too close! } } }

Diving into the code, I just realize the real source of the problem: struct BOOST_SYMBOL_VISIBLE timeout { win32::ticks_type start; uintmax_t milliseconds; // HERE bool relative; boost::system_time abs_time; static unsigned long const max_non_infinite_wait=0xfffffffe; timeout(uintmax_t milliseconds_): // HERE do_wait() takses a timeout which will use a unsined int as milliseconds, which will make any negative number very big. Assuming ceil() returns a negative number (I might have made a mistake in my precedent mail), then we get a very big timeout.... I'm not sure how to fix this other than adding checks into wait_until and wait_for....

On Tue, Sep 3, 2013 at 12:05 PM, Klaim - Joël Lamotte <mjklaim@gmail.com>wrote:
Diving into the code, I just realize the real source of the problem:
struct BOOST_SYMBOL_VISIBLE timeout { win32::ticks_type start; uintmax_t milliseconds; // HERE bool relative; boost::system_time abs_time;
static unsigned long const max_non_infinite_wait=0xfffffffe;
timeout(uintmax_t milliseconds_): // HERE
do_wait() takses a timeout which will use a unsined int as milliseconds, which will make any negative number very big. Assuming ceil() returns a negative number (I might have made a mistake in my precedent mail), then we get a very big timeout....
I'm not sure how to fix this other than adding checks into wait_until and wait_for....
I tried replacing uintmax_t by long long in this quoted code, as a quick-fix, and it seems to work in my repro-cases. I'll try now recompiling my real application, see if it fix the problem there too. Joel Lamotte

A few points: 0. I only tried all my tests in 32bits. 1. For completeness, here is the CMake configuration I used to generate projects for both tests: cmake_minimum_required( VERSION 2.8.11 ) if( WIN32 AND CMAKE_SIZEOF_VOID_P EQUAL 8 ) # On Windows 64bit: check that the right boost binaries are set before continuing if( NOT DEFINED BOOST_LIBRARYDIR OR BOOST_LIBRARYDIR STREQUAL "BOOST_LIBRARYDIR-NOT-SET" ) set( BOOST_LIBRARYDIR "BOOST_LIBRARYDIR-NOT-SET" CACHE PATH "Location of the Boost library binaries" FORCE ) message( FATAL_ERROR "BOOST_LIBRARYDIR is not set (we need the 64bit version). Before continuing, please set it to the correct binary path (depending on if you want to link with 32 or 64bit version)." ) endif() endif() # Force Boost.Thread V4 use for all projects, including dependencies. add_definitions( -DBOOST_THREAD_VERSION=4 -DBOOST_RESULT_OF_USE_DECLTYPE ) set( Boost_USE_STATIC_LIBS ON ) set( Boost_USE_MULTITHREADED ON ) set( Boost_USE_STATIC_RUNTIME OFF ) find_package( Boost 1.54.0 REQUIRED COMPONENTS system date_time filesystem chrono thread program_options log log_setup iostreams ) if( NOT Boost_FOUND ) message( SEND_ERROR "Boost libraries NOT FOUND!" ) endif() include_directories( ${Boost_INCLUDE_DIR} ) add_executable( test main.cpp ) target_link_libraries( test ${Boost_LIBRARIES} ) # add_dependencies( test tbb ) 2. The quickfix seems to work even in my application. 3. I created a ticket: https://svn.boost.org/trac/boost/ticket/9079
participants (2)
-
Klaim - Joël Lamotte
-
Vicente J. Botet Escriba