[thread] Can boost::this_thread::sleep_for() totally stop the thread?
I'm having, sometime, trouble with a rare heisenbug (rare but appearing several times a week). I described it there: http://stackoverflow.com/questions/17918548/can-windows-totally-stop-a-threa... There I ask if it's possible for windows to totally stop a thread that is sleeping too often. As I'm using boost::thread for sleeping this specific thread that, for unknown reason, sometime totally stop working, I'm also wondering if the boost::this_thread() implementation could not be doing something that the OS might not like? ----- I'll report the content of the page here: Can Windows totally stop a thread if it's sleeping too often? I have a rare heisenbug in a multi-threaded application where the main thread, and only this thread, will just do nothing. As it's an heisenbug it's really hard to understand why this is happening. The main thread is basically just looping. In the loop, it check several concurrent priority queues which contain tasks ordered by time to be executed. It pop a task, see if it's time to execute it. If it's time, it will just schedule it into TBB's task scheduler (using a root task which is the parent of all other tasks). If it's not time, the task is pushed again in the priority queue. That's for one cycle. At the end of the cycle, the main thread is put to sleep for a very short time that I expect will be longer in practice but it's not really a problem, I just don't want it to use too much resources when not necessary. Litterally: static const auto TIME_SCHEDULED_TASKS_SPAWN_FREQUENCY = microseconds(250); while( !m_task_scheduler.is_exiting() ) // check if the application should exit { m_clock_scheduler.spawn_realtime_tasks(); // here we spawn tasks if it's time this_thread::sleep_for( TIME_SCHEDULED_TASKS_SPAWN_FREQUENCY ); } m_clock_scheduler.clear_tasks(); m_root_task.wait_for_all(); I have a special task that just log a "TICK" message each second. It is automatically rescheduling until the end of the program. However, when the heisenbug appear, I can see the "TICK" disappearing and the application not doing anything else than the work that occurs in non-main threads. So it appear that only the main thread is touched. The problem can come from different places, maybe in the scheduling logic, but then it's also the only thread that have a sleep call. That sleep is a boost::this_thread::sleep_for(). My question is: **Is it possible that Windows (7 64bit) consider the main thread to be sleeping often and decide that it should sleep for a longer period of time than asked or be definitely ended?** I expect that it is not possible but I would like to be sure. I didn't find any precision on this in online documentation so far. ----- I really suspect that the problem can't come from the sleeping function but I need to eliminate each potential source of problem to isolate it. Thanks for your time. Joel Lamotte
As someone suggested to me in the comments of the page, I re-implemented the loop using a condition_variable so that pushing a task in the queue notify it. I have a friend who can reproduce the bug systematically (windows Vista, core 2 duo) so I sent him versions: - removing the sleep call works (but is a bit less performant apparently) - changing the loop to use condition_variable works So it looks like the sleep call does have a strange behaviour but I don't know the source of the problem. Using the condition_variable seems more appropriate so I'll keep this fix for now. Joel Lamotte
Le 29/07/13 14:44, Klaim - Joël Lamotte a écrit :
As someone suggested to me in the comments of the page, I re-implemented the loop using a condition_variable so that pushing a task in the queue notify it.
I have a friend who can reproduce the bug systematically (windows Vista, core 2 duo) so I sent him versions:
- removing the sleep call works (but is a bit less performant apparently) - changing the loop to use condition_variable works
So it looks like the sleep call does have a strange behaviour but I don't know the source of the problem. Using the condition_variable seems more appropriate so I'll keep this fix for now.
Hi, there were an issue fixed on 1.54 https://svn.boost.org/trac/boost/ticket/8136. Please, could you tell me if this fix is related? Best, Vicente
On Sat, Aug 3, 2013 at 8:42 AM, Vicente J. Botet Escriba < vicente.botet@wanadoo.fr> wrote:
there were an issue fixed on 1.54 https://svn.boost.org/trac/** boost/ticket/8136 https://svn.boost.org/trac/boost/ticket/8136.
Please, could you tell me if this fix is related?
It might be related to this fix mixed with another fix from another dependency. I use Ogre3D with TBB, where Ogre3D will look for TBB using a CMake script. There was a bug I reported[1] recently to the Ogre3D team because in specific conditions (TBB from sources+32bit) where the CMake script was adding add_definitions(-D_WIN32_WINNT=0x0501) which hid some windows functions use by Ogre code. I fixed it by adding a condition so that it does it only with MinGW (I'm using exclusively Visual Studio 2012 U3 right now). Now, I don't know if it's related, but this change looks like it might have an impact on the fix you're talking about. As I was saying, I have a friend who can reproduce the bug systematically (windows Vista, core 2 duo) so maybe finding a computer with these characteristics might make the issue find it easily. When I reimplemented the loop using condition variables - which was actually better because now the thread wait as long as the time necessary to spawn the next task and is woken up when a new task is pushed in the task priority queue - solves the problem for my friend, so there might be a strong difference in waiting code there that might be related? I'll try to make a repro-case with my friend tomorrow if it helps. [1] http://www.ogre3d.org/forums/viewtopic.php?f=4&t=78443 Joel Lamotte
On Sat, Aug 3, 2013 at 12:13 PM, Klaim - Joël Lamotte
It might be related to this fix mixed with another fix from another dependency.
To clarify: I first upgraded Boost to 1.54, changed my code accordingly to use boost::future, checked once that it compiled and did the same thing. Then I upgraded Ogre and changed the configuration to force it to use TBB (tbb was up to date in my setup). It's when I hit the Ogre bug that I fixed in CMake. Concerning the sleep bug I'm seeing, I remarked it after that point, but it might have been there before. It's hard to know because it's really rare on my laptop (win7 64bit). Joel Lamotte
Le 03/08/13 14:37, Klaim - Joël Lamotte a écrit :
On Sat, Aug 3, 2013 at 12:13 PM, Klaim - Joël Lamotte
wrote: It might be related to this fix mixed with another fix from another dependency.
To clarify: I first upgraded Boost to 1.54, changed my code accordingly to use boost::future, checked once that it compiled and did the same thing. Then I upgraded Ogre and changed the configuration to force it to use TBB (tbb was up to date in my setup). It's when I hit the Ogre bug that I fixed in CMake.
Concerning the sleep bug I'm seeing, I remarked it after that point, but it might have been there before. It's hard to know because it's really rare on my laptop (win7 64bit).
Let me know if you reach to reproduce it. Vicente
Sorry for the delay, I had to move to another more pressing problem.
Anyway, my friend who can reproduce the problem wasn't available this
month, and meanwhile I
managed to reproduce the bug with my current version of the loop.
Reminder: I'm using Boost 1.54, Visual Studio 2012 Update 3.
Yesterday night I decided to try to build a repro case and now I think I
see at least one of the problems but I need confirmation.
I managed to build 2 repro cases at the end of this email, one very simple
(A) and one a bit more complex and closer to what I actually do (B).
Unfortunately, I am not sure if the problem in A, that I can understand, is
really exactly the same as the problem in B.
In A, the thread stops immediately, without even hitting the "TICK" log.
Apparently the problem, even with the initial case which used
this_thread::sleep_until, is related to condition_variable::wait_until.
Basically, if the time given is 1 millisecond past the "now" timepoint used
into the wait_until function, the thread will wait forever.
Here is the implementation of wait_until in Boost 1.54:
template
Diving into the code, I just realize the real source of the problem: struct BOOST_SYMBOL_VISIBLE timeout { win32::ticks_type start; uintmax_t milliseconds; // HERE bool relative; boost::system_time abs_time; static unsigned long const max_non_infinite_wait=0xfffffffe; timeout(uintmax_t milliseconds_): // HERE do_wait() takses a timeout which will use a unsined int as milliseconds, which will make any negative number very big. Assuming ceil() returns a negative number (I might have made a mistake in my precedent mail), then we get a very big timeout.... I'm not sure how to fix this other than adding checks into wait_until and wait_for....
On Tue, Sep 3, 2013 at 12:05 PM, Klaim - Joël Lamotte
Diving into the code, I just realize the real source of the problem:
struct BOOST_SYMBOL_VISIBLE timeout { win32::ticks_type start; uintmax_t milliseconds; // HERE bool relative; boost::system_time abs_time;
static unsigned long const max_non_infinite_wait=0xfffffffe;
timeout(uintmax_t milliseconds_): // HERE
do_wait() takses a timeout which will use a unsined int as milliseconds, which will make any negative number very big. Assuming ceil() returns a negative number (I might have made a mistake in my precedent mail), then we get a very big timeout....
I'm not sure how to fix this other than adding checks into wait_until and wait_for....
I tried replacing uintmax_t by long long in this quoted code, as a quick-fix, and it seems to work in my repro-cases. I'll try now recompiling my real application, see if it fix the problem there too. Joel Lamotte
A few points: 0. I only tried all my tests in 32bits. 1. For completeness, here is the CMake configuration I used to generate projects for both tests: cmake_minimum_required( VERSION 2.8.11 ) if( WIN32 AND CMAKE_SIZEOF_VOID_P EQUAL 8 ) # On Windows 64bit: check that the right boost binaries are set before continuing if( NOT DEFINED BOOST_LIBRARYDIR OR BOOST_LIBRARYDIR STREQUAL "BOOST_LIBRARYDIR-NOT-SET" ) set( BOOST_LIBRARYDIR "BOOST_LIBRARYDIR-NOT-SET" CACHE PATH "Location of the Boost library binaries" FORCE ) message( FATAL_ERROR "BOOST_LIBRARYDIR is not set (we need the 64bit version). Before continuing, please set it to the correct binary path (depending on if you want to link with 32 or 64bit version)." ) endif() endif() # Force Boost.Thread V4 use for all projects, including dependencies. add_definitions( -DBOOST_THREAD_VERSION=4 -DBOOST_RESULT_OF_USE_DECLTYPE ) set( Boost_USE_STATIC_LIBS ON ) set( Boost_USE_MULTITHREADED ON ) set( Boost_USE_STATIC_RUNTIME OFF ) find_package( Boost 1.54.0 REQUIRED COMPONENTS system date_time filesystem chrono thread program_options log log_setup iostreams ) if( NOT Boost_FOUND ) message( SEND_ERROR "Boost libraries NOT FOUND!" ) endif() include_directories( ${Boost_INCLUDE_DIR} ) add_executable( test main.cpp ) target_link_libraries( test ${Boost_LIBRARIES} ) # add_dependencies( test tbb ) 2. The quickfix seems to work even in my application. 3. I created a ticket: https://svn.boost.org/trac/boost/ticket/9079
participants (2)
-
Klaim - Joël Lamotte
-
Vicente J. Botet Escriba