Boost Threads Condition may block forever on clock change

Hello, I've noticed a strange change in behaviour when condition is called with xtime parameter, condition::timed_wait. If right after the condition is called the clock is moved back the code never exits from condition unless signalled, even moving the clock back forward did not seem to fix the problem. The test was done on Windows XP. Is this desired behaviour or a bug? It seems strange that the condition is based on the current time and not just on a time offset. Any help with resolving this would be really appreciated. Thank you, Alex

<ak1mbox-boost <at> yahoo.ca> writes:
I've noticed a strange change in behaviour when condition is called with xtime parameter, condition::timed_wait. If right after the condition is called the clock is moved back the code never exits from condition unless signalled, even moving the clock back forward did not seem to fix the problem. The test was done on Windows XP.
Is this desired behaviour or a bug? It seems strange that the condition is based on the current time and not just on a time offset.
Any help with resolving this would be really appreciated.
The implementation uses an absolute time internally, since absolute times are composable --- though the win32 API calls are made with a timeout in milliseconds, it is the number of milliseconds until the supplied absolute time. This allows for multiple win32 API calls without having to work out how much of the timeout has elapsed --- this is done implicitly by the calculation of the number of milliseconds remaining. Ideally we would have a monotonically increasing timer which is independent of the system clock. Unfortunately, we don't have such a timer --- we have to rely on the system's idea of UTC time. If the clock is set back after the timeout for the wait has been chosen, but before the number of milliseconds to wait in a win32 API call has been calculated, the number of milliseconds will be rather large. Anthony

Anthony Williams wrote:
The implementation uses an absolute time internally, since absolute times are composable --- though the win32 API calls are made with a timeout in milliseconds, it is the number of milliseconds until the supplied absolute time. This allows for multiple win32 API calls without having to work out how much of the timeout has elapsed --- this is done implicitly by the calculation of the number of milliseconds remaining.
Ideally we would have a monotonically increasing timer which is independent of the system clock. Unfortunately, we don't have such a timer --- we have to rely on the system's idea of UTC time. If the clock is set back after the timeout for the wait has been chosen, but before the number of milliseconds to wait in a win32 API call has been calculated, the number of milliseconds will be rather large.
With all due respect, it should be clear that such a dependency on a user-controlled setting that can indefinably block a program is simply not an acceptable design choice for professional programmers. No matter what the reason, infinite blocking behavior is a bug in the boost implementation and must be fixed in boost, not elsewhere. Thorsten

Thorsten Froehlich <froetho <at> iit.edu> writes:
Anthony Williams wrote:
The implementation uses an absolute time internally, since absolute times are composable --- though the win32 API calls are made with a timeout in milliseconds, it is the number of milliseconds until the supplied absolute time. This allows for multiple win32 API calls without having to work out how much of the timeout has elapsed --- this is done implicitly by the calculation of the number of milliseconds remaining.
Ideally we would have a monotonically increasing timer which is independent of the system clock. Unfortunately, we don't have such a timer --- we have to rely on the system's idea of UTC time. If the clock is set back after the timeout for the wait has been chosen, but before the number of milliseconds to wait in a win32 API call has been calculated, the number of milliseconds will be rather large.
With all due respect, it should be clear that such a dependency on a user-controlled setting that can indefinably block a program is simply not an acceptable design choice for professional programmers. No matter what the reason, infinite blocking behavior is a bug in the boost implementation and must be fixed in boost, not elsewhere.
Harsh words. This is not infinite blocking, just blocking with a long timeout (e.g. an hour if you set the clock back an hour). This has been a property of boost threads since it was first committed, over 6 years ago. The problem is that the POSIX and win32 APIs have different policies with respect to timeouts. POSIX takes absolute times, whereas win32 takes relative times. As the POSIX spec points out, there is a race condition implementing an absolute timeout on top of a relative-timeout-based API. If we want to support absolute timeouts (and we do), then their use on Windows will always be subject to this race condition. If the user changes the clock, this just exacerbates the problem. However, the timeout is an absolute time: if you just set the clock back an hour, then it's an hour longer before the specified absolute time is reached. The problem is when the clock is advanced forward passed the absolute timeout --- on Windows, the timeout is expressed in milliseconds, and this is independent of the clock time, so once we're waiting, we're waiting. In addition, the win32 API does not support condition variables pre-Vista, so we need to implement them using the available win32 primitives. This requires waiting multiple times on different synchronization primitives, and looping. The only reliable way to get the timeout right is to calculate the absolute timeout at the start, and use that as the basis for all the relative timeouts on the individual calls. Having said all that, there may be things that can be done. Windows apps *should* send WM_TIMECHANGE when they update the clock, so if we're in a message-handling thread, and we receive that message, we can potentially handle that by interrupting the wait and resuming based on the new clock time. That's quite a few "if"s, though. The GetTickCount API is good for 49 days, so we could use that as the basis for the timeout once the wait routine was actually called, but it still won't handle clock changes --- once the timeout is calculated it will remain fixed. This is a hard problem, and not one to be dismissed with "simply not an acceptable design choice for professional programmers". Anthony

Anthony Williams wrote:
As the POSIX spec points out, there is a race condition implementing an absolute timeout on top of a relative-timeout-based API. If we want to support absolute timeouts (and we do)
But why does this imply forcing everybody to use absolute timeouts only in the boost interface? Sure, there are as many uses for relative timeouts as there are for absolute timeouts in an interface, yet only absolute timeouts have this race condition. IMHO it makes more sense to use relative timeouts internally and have users compute relative timeouts for their absolute timeouts if they need them. The benefit would be that the absolute time needs to be queried only once to compute a relative timeout in code that wishes to work with absolute timeouts, eliminating the race condition on Windows. POSIX "users" can get predictable behavior using CLOCK_MONOTONIC even with the boost code internally using relative timeouts. Making this change, the race condition goes away on one more platform using in the boost interface. It should also be considered that relative timeouts in the boost interface certainly eliminate a problem on a platform with a large market share relevant to many boost developers for better or worse ;-) As "xtime" is documented as "...temporary solution that will be replaced by a more robust time library once available in Boost." I cannot see a good argument to keep it given its real problems. Further, the relative timeouts also create a much neater interface compared to what is currently needed to get something as simple as a 10 millisecond timeout: xtime t; get_xtime(&t, TIME_UTC); t.nsec += 10000000; if(t.nsec >= 1000000000) { t.nsec %= 1000000000; t.sec++; } condition bar; bar.timed_wait(somelock, t); to condition bar; bar.timed_wait(somelock, 10000000); For the absolute time case, the date_time library would be a nice help (not that I have used it before, so not sure the code below would be legal or not). First, looking at the current code (assuming "myabsolutetime" is a ptime object): time_duration td(second_clock::universal_time() - myabsolutetime); xtime t; t.nsec = td.total_nanoseconds() % 1000000000; t.sec = td.total_nanoseconds() - t.nsec; condition bar; bar.timed_wait(somelock, t); to time_duration td(second_clock::universal_time() - myabsolutetime); condition bar; bar.timed_wait(somelock, td.total_nanoseconds()); A lot nicer, isn't it? - Especially I strongly assume most programs rarely need absolute timeouts just a few milliseconds or even nanoseconds apart and using the date_time library's potential overhead (assuming there is a significant overhead) will not be a major burden for those who need absolute timeouts. Of course, what this does not address is the issue of absolute timeouts that intentionally depend on absolute time being changed by the user ... but is this really so important to justify a fully absolute timeout-based interface? Thorsten

Thorsten Froehlich <froetho@iit.edu> writes:
Anthony Williams wrote:
As the POSIX spec points out, there is a race condition implementing an absolute timeout on top of a relative-timeout-based API. If we want to support absolute timeouts (and we do)
But why does this imply forcing everybody to use absolute timeouts only in the boost interface? Sure, there are as many uses for relative timeouts as there are for absolute timeouts in an interface, yet only absolute timeouts have this race condition.
The new thread interface supports both in most places.
IMHO it makes more sense to use relative timeouts internally and have users compute relative timeouts for their absolute timeouts if they need them.
I side with the POSIX spec on this one: absolute timeouts are more scalable.
The benefit would be that the absolute time needs to be queried only once to compute a relative timeout in code that wishes to work with absolute timeouts, eliminating the race condition on Windows. POSIX "users" can get predictable behavior using CLOCK_MONOTONIC even with the boost code internally using relative timeouts.
On Windows, the internal relative timeouts need to be calculated potentially many times, as the code may need to wait multiple times, on either the same or different internal sync objects.
Making this change, the race condition goes away on one more platform using in the boost interface. It should also be considered that relative timeouts in the boost interface certainly eliminate a problem on a platform with a large market share relevant to many boost developers for better or worse ;-)
How does the race condition go away by making everyone use relative timeouts?
As "xtime" is documented as "...temporary solution that will be replaced by a more robust time library once available in Boost." I cannot see a good argument to keep it given its real problems.
xtime is only kept for backwards compatibility. The new interface uses boost::system_time, which is a typedef for boost::posix_time::ptime from boost::date_time
Of course, what this does not address is the issue of absolute timeouts that intentionally depend on absolute time being changed by the user ... but is this really so important to justify a fully absolute timeout-based interface?
Relative timeouts are now supported in most boost::thread interfaces. The thread docs are out of date: I am in the process of updating them. Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams wrote:
Making this change, the race condition goes away on one more platform using in the boost interface. It should also be considered that relative timeouts in the boost interface certainly eliminate a problem on a platform with a large market share relevant to many boost developers for better or worse ;-)
How does the race condition go away by making everyone use relative timeouts?
You no longer internally (invisible to the user of boost::thread) query the absolute time makes it go away for Windows. And setting the clock condition attribute CLOCK_MONOTONIC with POSIX would make it impossible there as far as I can tell, does it not?
Relative timeouts are now supported in most boost::thread interfaces. The thread docs are out of date: I am in the process of updating them.
Ah, good to know. I was indeed looking at the old documentation online... Thorsten

Thorsten Froehlich <froetho@iit.edu> writes:
Anthony Williams wrote:
Making this change, the race condition goes away on one more platform using in the boost interface. It should also be considered that relative timeouts in the boost interface certainly eliminate a problem on a platform with a large market share relevant to many boost developers for better or worse ;-)
How does the race condition go away by making everyone use relative timeouts?
You no longer internally (invisible to the user of boost::thread) query the absolute time makes it go away for Windows. And setting the clock condition attribute CLOCK_MONOTONIC with POSIX would make it impossible there as far as I can tell, does it not?
I've updated the implementation of boost::condition_variable and boost::condition_variable_any to allow for relative timeouts as well as absolute timeouts on the predicate version of timed_wait. On POSIX, the timeouts are always absolute under-the-hood, so a relative timeout is converted into an absolute one. On Windows, the relative and absolute timeouts are now treated differently. If an absolute timeout is specified (using boost::xtime or boost::system_time), then this is converted into a relative timeout immediately prior to each internal wait by comparison against the system clock. If the clock is changed, then this may or may not affect the duration of the wait, depending on whether the clock change occurred before or after each wait period. If a relative timeout is specified (using e.g. boost::posix_time::seconds), then the GetTickCount API is used as a baseline for the relative timeout, and before each internal wait the current tick count is compared against the baseline in order to determine the remaining time. Note that the non-predicate version of timed_wait still only takes an absolute time, because the potential for spurious wake-ups makes it hard to use correctly with a relative time, as demonstrated by the bug in APR. Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams:
If a relative timeout is specified (using e.g. boost::posix_time::seconds), then the GetTickCount API is used as a baseline for the relative timeout, and before each internal wait the current tick count is compared against the baseline in order to determine the remaining time.
We should probably think of a way to enable the use of a monotonic clock with the absolute timeout overloads. It doesn't feel right to make the relative overloads the only way to access the functionality.

"Peter Dimov" <pdimov@pdimov.com> writes:
Anthony Williams:
If a relative timeout is specified (using e.g. boost::posix_time::seconds), then the GetTickCount API is used as a baseline for the relative timeout, and before each internal wait the current tick count is compared against the baseline in order to determine the remaining time.
We should probably think of a way to enable the use of a monotonic clock with the absolute timeout overloads. It doesn't feel right to make the relative overloads the only way to access the functionality.
Yes, we should. However, absolute time is always going to have issues with clock changes --- if I schedule for something now (2007-11-27 18:23:15 UTC by the current clock) to run tomorrow (2007-11-28 18:23:15 UTC), does it schedule for 24-hours from now, or for when-the-clock-reads the right time? As far as I can gather, POSIX expects the latter. Windows only offers relative timeouts, so we get the former, though there's a potential race between calculating "now" vs changing the clock. If I had a reliable way to abort a wait if the clock changed. Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams:
We should probably think of a way to enable the use of a monotonic clock with the absolute timeout overloads. It doesn't feel right to make the relative overloads the only way to access the functionality.
Yes, we should. However, absolute time is always going to have issues with clock changes --- if I schedule for something now (2007-11-27 18:23:15 UTC by the current clock) to run tomorrow (2007-11-28 18:23:15 UTC), does it schedule for 24-hours from now, or for when-the-clock-reads the right time?
These are UTC times. The monotonic clock (CLOCK_MONOTONIC) is not user-settable and for it, now+K always occurs after K seconds. For CLOCK_REALTIME, it schedules for the given time. http://www.opengroup.org/onlinepubs/009695399/functions/pthread_condattr_get... http://www.opengroup.org/onlinepubs/000095399/functions/clock_getres.html

"Peter Dimov" <pdimov@pdimov.com> writes:
Anthony Williams:
We should probably think of a way to enable the use of a monotonic clock with the absolute timeout overloads. It doesn't feel right to make the relative overloads the only way to access the functionality.
Yes, we should. However, absolute time is always going to have issues with clock changes --- if I schedule for something now (2007-11-27 18:23:15 UTC by the current clock) to run tomorrow (2007-11-28 18:23:15 UTC), does it schedule for 24-hours from now, or for when-the-clock-reads the right time?
These are UTC times. The monotonic clock (CLOCK_MONOTONIC) is not user-settable and for it, now+K always occurs after K seconds. For CLOCK_REALTIME, it schedules for the given time.
http://www.opengroup.org/onlinepubs/009695399/functions/pthread_condattr_get... http://www.opengroup.org/onlinepubs/000095399/functions/clock_getres.html
OK, so we have three overloads: cv.timed_wait(mutex,system_time,pred); cv.timed_wait(mutex,duration,pred); cv.timed_wait(mutex,monotonic_clock_time,pred); The question is: how do we implement a monotonic clock? CLOCK_MONOTONIC is optional on POSIX, and GetTickCount wraps on Windows. On Windows, we could use the System Up Time performance Counter, but I'm not sure what the resolution is. Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams <anthony_w.geo@yahoo.com> writes:
On Windows, we could use the System Up Time performance Counter, but I'm not sure what the resolution is.
It appears that the counter is provided as a number of 100ns (like FILETIME), but it is not updated that fast --- it appears to be only updated every few milliseconds on my system. Also, it's not available on Windows 9x. Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams wrote: [snip]
As far as I can gather, POSIX expects the latter. Windows only offers relative timeouts, so we get the former, though there's a potential race between calculating "now" vs changing the clock. If I had a reliable way to abort a wait if the clock changed.
For Windows: As you said yourself earlier, WM_TIMECHANGE is available. Create a hidden window and run the message loop in a worker thread to catch those msgs. This could presumably be implemented using some kind of lazy init to work transparently even for the DLL version of Boost.Thread. IIRC, WM_TIMECHANGE is broadcasted by the system on time changes since w2k (and strongly recommended for applications before that). Whether that is reliable enough would be up to you. / Johan

Johan Nilsson <r.johan.nilsson <at> gmail.com> writes:
Anthony Williams wrote:
As far as I can gather, POSIX expects the latter. Windows only offers relative timeouts, so we get the former, though there's a potential race between calculating "now" vs changing the clock. If I had a reliable way to abort a wait if the clock changed.
For Windows: As you said yourself earlier, WM_TIMECHANGE is available. Create a hidden window and run the message loop in a worker thread to catch those msgs. This could presumably be implemented using some kind of lazy init to work transparently even for the DLL version of Boost.Thread.
Thanks for the suggestion, but I don't like the idea of forcing the creation of a background thread with a hidden Window just to handle the rare event of the system clock being changed during a wait. Also, you've still got to wake the waiting threads. You could do this with a manual-reset event, but how do you know to reset it? Another option is just to have the wait divide the wait up so it only waits for a maximum of some small period (e.g. 100ms or 10s or something) before waking up and checking the system time, and resuming waiting if the time has not expired. Again, this adds overhead.
IIRC, WM_TIMECHANGE is broadcasted by the system on time changes since w2k (and strongly recommended for applications before that). Whether that is reliable enough would be up to you.
That's good to know. Anthony

Anthony Williams wrote:
Johan Nilsson <r.johan.nilsson <at> gmail.com> writes:
[snip]
For Windows: As you said yourself earlier, WM_TIMECHANGE is available. Create a hidden window and run the message loop in a worker thread to catch those msgs. This could presumably be implemented using some kind of lazy init to work transparently even for the DLL version of Boost.Thread.
Thanks for the suggestion, but I don't like the idea of forcing the creation of a background thread with a hidden Window just to handle the rare event of the system clock being changed during a wait. Also, you've still got to wake the waiting threads. You could do this with a manual-reset event, but how do you know to reset it?
[Caveat: just thinking out loud here] What about using user-mode APCs? Queue the user APCs from the background thread when the time changes. Whenever a thread is created, they could register a APC do-nothing callback (and remove on destruction). For all timed waits inside the thread library, use the alertable Wait functions (WaitXxxEx) and handle the APC delivery accordingly.
Another option is just to have the wait divide the wait up so it only waits for a maximum of some small period (e.g. 100ms or 10s or something) before waking up and checking the system time, and resuming waiting if the time has not expired. Again, this adds overhead.
Well, that's also susceptible to a race. You really can't know for sure how much time that elapses between your sleeps and wake-ups (and adjust the total according to that). / Johan

Anthony Williams wrote: [...]
system clock being changed during a wait. Also, you've still got to wake the waiting threads. You could do this with a manual-reset event, but how do you know to reset it?
You can extend predicates for timed waits with clock change event count atomic and simply broadcast the condvars (after lock-unlock of associated mutexes) on clock change. regards, alexander.

Johan Nilsson <r.johan.nilsson <at> gmail.com> writes:
Anthony Williams wrote:
As far as I can gather, POSIX expects the latter. Windows only offers relative timeouts, so we get the former, though there's a potential race between calculating "now" vs changing the clock. If I had a reliable way to abort a wait if the clock changed.
For Windows: As you said yourself earlier, WM_TIMECHANGE is available. Create a hidden window and run the message loop in a worker thread to catch those msgs. This could presumably be implemented using some kind of lazy init to work transparently even for the DLL version of Boost.Thread.
It appears that the win32 API does provide a solution for this after all: Waitable timers. According to the docs, when you specify an absolute timeout for a timer it tracks clock changes. I've checked in a new version that uses these where available. Since the resolution of waitable timers is only about 10ms, if the wait is less than 20ms then that a simple relative wait is used (even if the initial target was absolute). At the moment, this is only done for the interruptible waits (sleep and condition::timed_wait), but a similar scheme could be used for all timed waits (e.g. mutex timed lock). Anthony

Anthony Williams wrote: [...]
Waitable timers. According to the docs, when you specify an absolute timeout for a timer it tracks clock changes.
That isn't really helpful given that win32 waitable timers don't synchronize with condvar waiters by locking associated mutexes. regards, alexander.

Alexander Terekhov <terekhov <at> web.de> writes:
Anthony Williams wrote: [...]
Waitable timers. According to the docs, when you specify an absolute timeout for a timer it tracks clock changes.
That isn't really helpful given that win32 waitable timers don't synchronize with condvar waiters by locking associated mutexes.
The point is that a condvar waiter doing a timed wait can use the waitable timer as part of a WaitForMultipleObjects call waiting on the sync objects internal to the cond var, rather than using a timeout: if the waitable timer is triggered then the specified absolute time has been reached, and the call has timed out. Of course, the condvar implementation then needs to handle relocking the mutexes. Anthony

Anthony Williams wrote:
Alexander Terekhov <terekhov <at> web.de> writes:
Anthony Williams wrote: [...]
Waitable timers. According to the docs, when you specify an absolute timeout for a timer it tracks clock changes.
That isn't really helpful given that win32 waitable timers don't synchronize with condvar waiters by locking associated mutexes.
The point is that a condvar waiter doing a timed wait can use the waitable timer as part of a WaitForMultipleObjects call waiting on the sync objects internal to the cond var, rather than using a timeout: if the waitable timer is triggered then the specified absolute time has been reached, and the call has timed out.
Of course, the condvar implementation then needs to handle relocking the mutexes.
I meant the race between clock change and "WaitForMultipleObjects call waiting". regards, alexander.

Alexander Terekhov <terekhov <at> web.de> writes:
Anthony Williams wrote:
Alexander Terekhov <terekhov <at> web.de> writes:
Anthony Williams wrote: [...]
Waitable timers. According to the docs, when you specify an absolute timeout for a timer it tracks clock changes.
That isn't really helpful given that win32 waitable timers don't synchronize with condvar waiters by locking associated mutexes.
The point is that a condvar waiter doing a timed wait can use the waitable timer as part of a WaitForMultipleObjects call waiting on the sync objects internal to the cond var, rather than using a timeout: if the waitable timer is triggered then the specified absolute time has been reached, and the call has timed out.
Of course, the condvar implementation then needs to handle relocking the mutexes.
I meant the race between clock change and "WaitForMultipleObjects call waiting".
Here's the flow as it stands at the moment: 1. User calls timed_wait with an absolute time 2. Thread added to waiter list and external mutex unlocked 3. Waitable timer configured for specified absolute time 4. Thread waits for internal notify or timer expiry 5a. if timer expired we're done: relock external mutex 5b. if we were notified, check if it was really us: if so, we're done: relock external mutex 6. If we didn't time out, and the notify wasn't us, loop to 3. Since the waitable timer is tied to the system clock time, the only "race" is if the clock is changed before the timer is configured in step 3. I believe the only issue is if the clock changes multiple times in quick succession. If, for example, the clock is changed back and then forward again, then the timer may or may not expire depending on whether it was configured before or after the forward change. There is no race with WaitForMultipleObjects, as the timer is tied to the system clock. If the timer has expired by the time we wait, then we time out immediately. It might be better to configure the timer once rather than multiple times, but that's not a big deal. Anthony

Anthony Williams wrote:
Thorsten Froehlich <froetho@iit.edu> writes:
Anthony Williams wrote:
Making this change, the race condition goes away on one more platform using in the boost interface. It should also be considered that relative timeouts in the boost interface certainly eliminate a problem on a platform with a large market share relevant to many boost developers for better or worse ;-) How does the race condition go away by making everyone use relative timeouts? You no longer internally (invisible to the user of boost::thread) query the absolute time makes it go away for Windows. And setting the clock condition attribute CLOCK_MONOTONIC with POSIX would make it impossible there as far as I can tell, does it not?
I've updated the implementation of boost::condition_variable and boost::condition_variable_any to allow for relative timeouts as well as absolute timeouts on the predicate version of timed_wait.
On POSIX, the timeouts are always absolute under-the-hood, so a relative timeout is converted into an absolute one.
Thank you _very_ much for the time you put into improving the thread implementation! Looks like there are some very useful thread changes to look forward to in the next boost release :-) Thorsten
participants (6)
-
ak1mbox-boost@yahoo.ca
-
Alexander Terekhov
-
Anthony Williams
-
Johan Nilsson
-
Peter Dimov
-
Thorsten Froehlich