[interprocess] message_queue hangs when another process dies

Hi, I have found the following problem with message_queue on win32 (MSVC 2008 + Windows Vista). Suppose we have 2 processes, one sends messages to queue, another reads them. When reading process dies (for instance using End process) during message_queue.receive the another process hangs in send. As I found during investigation, it tries to lock interprocess mutex, but it is already locked by dead process. So this message queue instance becomes unusable. I attached the example. Steps to reproduce: 1) Start "mq.exe s" (i.e. with argument s) It outs: sender: $PID 2) Start mq.exe without arguments. It outs: receiver: $PID 0 1 2 3 4 This time sender outs: send done: 3) At this point the message queue is empty and receiver waits for messages. End it with Task Manager. 4) Enter some number in sender console and press Enter. 5) The sender hangs in message_queue.send Best Regards, Sergei

Hi Sergei, Sergei Politov wrote:
Hi,
I have found the following problem with message_queue on win32 (MSVC 2008 + Windows Vista).
Sadly, Interproces message queues are built above interprocess_mutex and interprocess_condition and they don't support process termination. It's not a bug, because they never tried to be termination-safe, but if you have an idea on how to do that without kernel support, I'm glad if you could help. In windows shared memory is emulated with mapped files, so it's possible that the mapped file will get corrupted if you kill processes. Sorry for the bad news, Ion

On Wed, 9 Jul 2008, Ion Gaztañaga wrote:
Sergei Politov wrote:
I have found the following problem with message_queue on win32 (MSVC 2008 + Windows Vista).
Sadly, Interproces message queues are built above interprocess_mutex and interprocess_condition and they don't support process termination. It's not a bug, because they never tried to be termination-safe, but if you have an idea on how to do that without kernel support, I'm glad if you could help. In windows shared memory is emulated with mapped files, so it's possible that the mapped file will get corrupted if you kill processes.
If you can trust an atomic compare-and-swap to work in shared memory, there are lock-free queue algorithms... e.g. the Michael/Scott algorithm. - Daniel

Ion Gaztañaga <igaztanaga <at> gmail.com> writes:
Sadly, Interproces message queues are built above interprocess_mutex and interprocess_condition and they don't support process termination. It's not a bug, because they never tried to be termination-safe, but if you have an idea on how to do that without kernel support, I'm glad if you could help. In windows shared memory is emulated with mapped files, so it's possible that the mapped file will get corrupted if you kill processes.
Why we cannot use named mutexes, that is available in windows? Also I will be glad to hear any suggestions about replacements for message_queue. Best Regards, Sergei

Sergei Politov <spolitov <at> gmail.com> writes:
[interprocess] message_queue hangs when another process dies
Suppose we have 2 processes, one sends messages to queue, another reads them. When reading process dies (for instance using End process) during message_queue.receive the another process hangs in send.
I've run into an apparently old problem with message_queue (see post above from two years ago) and I am wondering if there isn't a fairly simple solution. It wouldn't be perfect but would be far better behavior than we have now. Please let me know if this looks like a good idea for inclusion into the interprocess library. The problem is the message_queue send operation will block forever trying to send to a process that has been abnormally terminated. The send is trying to do a interprocess_condition::notify_one call. Inside interprocess_condition::notify it executes the statement: "m_enter_mut.lock()". This mutex is holding back the send call from completing because the dead process still has ownership. My solutions to the problem lie within the interprocess_condition class, as this is really the source of the problem. Solution 1: Fixed timeout notify -------------------------------- Change the mutex lock call in interprocess_condition::notify to a timed_lock call using a fixed timeout value. This feature could be enabled/disabled and the timeout value configured through use of preprocessor symbols. Replace: inline void interprocess_condition::notify(boost::uint32_t command) { m_enter_mut.lock(); With: inline void interprocess_condition::notify(boost::uint32_t command) { #ifdef ENABLE_BOOST_INTERPROCESS_TIMEOUT boost::posix_time::ptime expires = boost::posix_time::microsec_clock::universal_time() + boost::posix_time::milliseconds(BOOST_INTERPROCESS_TIMEOUT_MS); if (!m_enter_mut.timed_lock(expires)) throw timeout_exception(); #else m_enter_mut.lock(); #endif This allows an exception to be thrown if it waits too long at the mutex. This may be adequate for most applications, I don't see a good reason for this to block for very long. This change will of course effect anything using interprocess_condition, which could be seen as a good thing or a bad thing. Good in that anything using it, like message_queue for instance, will immediately get improved functionality. The message_queue send will now throw an interprocess_timeout exception on the send without any code changes! However it may be seen as bad thing because the thrown exception may be unexpected behavior (although not expecting exceptions is not a wise thing). Solution 2: Notify with timeout ------------------------------- We introduce an interprocess_condition::notify that specifies the time to wait for notification to complete. Add: inline void interprocess_condition::notify( boost::uint32_t command, const boost::posix_time::ptime &abs_time) { if (!m_enter_mut.timed_lock(abs_time)) throw timeout_exception(); This solution is much the same as the first but introduces new methods to accomplish the functionality. The advantages of this approach would be control of the timeout value and existing functionality would not be changed. Disadvantage would be that software wanting this feature would need to be rewritten. For example the message_queue send & try_send functions could have additional timeout values. One issue I am having with this solution, is why would I want to use the old notify API? It seems the new methods would deprecate the old ones and create mild confusion. Actually looking closely at the message_queue API, this presents some challenges: // We can add the timeout here, no problem. void send ( const void *buffer, std::size_t buffer_size, unsigned int priority, const boost::posix_time::ptime& abs_time); // <-- new timeout value // The nature of this method is not to block, // so adding a timeout value here is counter intuitive. // But this is exactly what we need to do because it blocks // in our exceptional case. bool try_send( const void *buffer, std::size_t buffer_size, unsigned int priority, const boost::posix_time::ptime& abs_time); // <-- new timeout value // Here we probably need to use the existing timeout for // the timed_notify call. bool timed_send( const void *buffer, std::size_t buffer_size, unsigned int priority, const boost::posix_time::ptime& abs_time); // <-- existing timeout value I was originally thinking this was the best solution, but now after looking at the details, solution 1 is looking more appealing. Solution 2: Try notify ---------------------- This solution pushes the waiting code back to the caller. The advantage is that this solution does not block at any time, but its usage will be more complicated. Add: inline bool interprocess_condition::try_notify( boost::uint32_t command) { if (!m_enter_mut.try_lock()) return false; Not sure I am loving this solution, but it could be used to create a better behaved message_queue::try_send. One that would be difficult to use too, and I'm afraid not very popular (ie try_send returning false because it can't aquire the mutex right away).

Was this ever resolved? Are there any plans on resolving this in future versions? I am having the exact same issue. Are there any non-invasive solutions? Even if it means a possible lose of data? -- View this message in context: http://boost.2283326.n4.nabble.com/interprocess-message-queue-hangs-when-ano... Sent from the Boost - Dev mailing list archive at Nabble.com.

kopo <kopinskc <at> msoe.edu> writes:
Was this ever resolved? Are there any plans on resolving this in future versions?
Not that I am aware, I think many assumed it was too difficult to fix. But I have a proposed solution to which I am trying to generate some feedback. I am using the "Solution 1" modification to the library myself. For me, introducing a message_queue timeout was a far better than having the interprocess library deadlock.
I am having the exact same issue. Are there any non-invasive solutions? Even if it means a possible lose of data?
Are you using message_queue or interprocess_condition? I don't know what you mean by "non-invasive". The only solutions outside of a library fix are: 1) Monitor your thread for a deadlock and kill it if it is and cleanup. 2) Ensure your processes are never terminated. I think #2 is how the library is meant to be used, but IMHO this is not a requirement most real world applications can satisfy.

Ross MacGregor wrote:
Not that I am aware, I think many assumed it was too difficult to fix. But I have a proposed solution to which I am trying to generate some feedback. I am using the "Solution 1" modification to the library myself. For me, introducing a message_queue timeout was a far better than having the interprocess library deadlock.
I meant to reply to your original message with proposed solutions but clicked on the wrong post. I definitely thought that your first solution was probably the best, and plan on using it if no other solutions can be found. Ross MacGregor wrote:
Are you using message_queue or interprocess_condition? I don't know what you mean by "non-invasive". The only solutions outside of a library fix are: 1) Monitor your thread for a deadlock and kill it if it is and cleanup. 2) Ensure your processes are never terminated.
I think #2 is how the library is meant to be used, but IMHO this is not a requirement most real world applications can satisfy.
I was planning on using message_queue to pass log messages between external processes and a central logger. 1) This really could happen on any thread, so monitoring each thread for deadlock is not an option, let alone terminating it. 2) I would love to ensure that my logger process never terminates, but we don't live in a perfect world and if it terminates abnormally (or even normally) I need the other processes to function normally. When I say non-invasive solution it would be best not to modify the library so that other people that wish to use my API could do so with the existing BOOST code base and not have to take out a specially modified version. I also want my consumers to have the ability to upgrade if bug fixes come out for the library. -- View this message in context: http://boost.2283326.n4.nabble.com/interprocess-message-queue-hangs-when-ano... Sent from the Boost - Dev mailing list archive at Nabble.com.

Sent from my iPhone On May 5, 2011, at 6:45 AM, "kopo [via Boost]" <ml-node+3498130-1173126325-232438@n4.nabble.com> wrote:
Ross MacGregor wrote: Not that I am aware, I think many assumed it was too difficult to fix. But I have a proposed solution to which I am trying to generate some feedback. I am using the "Solution 1" modification to the library myself. For me, introducing a message_queue timeout was a far better than having the interprocess library deadlock. I meant to reply to your original message with proposed solutions but clicked on the wrong post. I definitely thought that your first solution was probably the best, and plan on using it if no other solutions can be found.
Ross MacGregor wrote: Are you using message_queue or interprocess_condition? I don't know what you mean by "non-invasive". The only solutions outside of a library fix are: 1) Monitor your thread for a deadlock and kill it if it is and cleanup. 2) Ensure your processes are never terminated.
I think #2 is how the library is meant to be used, but IMHO this is not a requirement most real world applications can satisfy. I was planning on using message_queue to pass log messages between external processes and a central logger.
1) This really could happen on any thread, so monitoring each thread for deadlock is not an option, let alone terminating it. 2) I would love to ensure that my logger process never terminates, but we don't live in a perfect world and if it terminates abnormally (or even normally) I need the other processes to function normally.
When I say non-invasive solution it would be best not to modify the library so that other people that wish to use my API could do so with the existing BOOST code base and not have to take out a specially modified version. I also want my consumers to have the ability to upgrade if bug fixes come out for the library.
If you reply to this email, your message will be added to the discussion below: http://boost.2283326.n4.nabble.com/interprocess-message-queue-hangs-when-ano... To unsubscribe from [interprocess] message_queue hangs when another process dies, click here.
-- View this message in context: http://boost.2283326.n4.nabble.com/interprocess-message-queue-hangs-when-ano... Sent from the Boost - Dev mailing list archive at Nabble.com.

kopo <kopinskc <at> msoe.edu> writes:
Ross MacGregor wrote:
I am using the "Solution 1" modification to the library myself.
I meant to reply to your original message with proposed solutions but clicked on the wrong post. I definitely thought that your first solution was probably the best, and plan on using it if no other solutions can be found.
I am considering a slight modification to this solution, I will post it in more detail when I get time. Basically, instead of changing interprocess_condition we should modify the interprocess_mutex lock function to time out instead. The lock function could simply call through to timed_lock() and throw an exception if it times out. This would be more effective change for the library.
When I say non-invasive solution it would be best not to modify the library so that other people that wish to use my API could do so with the existing BOOST code base and not have to take out a specially modified version.
Yes, that is a real problem for a library writer. I am using 1.44 right now, and I also needed to patch interprocess it to work around the boot time folder problem. Use the boost bug tracker to check the status of your release. Since the library is completely header file based, you may be able to replace one or more of the boost header files for your own. It would be a REAL hack job though (making use of the include guards).

Ross MacGregor wrote:
I am considering a slight modification to this solution, I will post it in more detail when I get time. Basically, instead of changing interprocess_condition we should modify the interprocess_mutex lock function to time out instead. The lock function could simply call through to timed_lock() and throw an exception if it times out. This would be more effective change for the library.
Are you proposing a modification such as: <file>interprocess_condition.hpp</file> inline void interprocess_condition::notify(boost::uint32_t command) { //m_enter_mut.lock(); boost::posix_time::ptime till = boost::posix_time::microsec_clock::universal_time() + boost::posix_time::milliseconds(1000); if(!m_enter_mut.timed_lock(till)) throw interprocess_exception(lock_error); where you would just provide a timed lock on the notify? Possibly in a notify_timed function. Where the time to wait could be specified in construction of the message queue. Are you planning on submitting your proposal to the library for consideration in the next release? -- View this message in context: http://boost.2283326.n4.nabble.com/interprocess-message-queue-hangs-when-ano... Sent from the Boost - Dev mailing list archive at Nabble.com.

kopo <kopinskc <at> msoe.edu> writes:
Are you proposing a modification such as:
No, that was my original fix. This is the same solution, but just moving the location of the exception. I now think think this timeout needs to be added directly to interprocess_mutex. This would affect the files emulation/interprocess_mutex.hpp and posix/interprocess_mutex.hpp. If the BOOST_INTERPROCESS_ENABLE_TIMEOUT symbol is set, the lock() method would actually call timed_lock() using the BOOST_INTERPROCESS_TIMEOUT_MS value. If it fails to acquire the lock, it will throw a newly defined timeout_exception.
Are you planning on submitting your proposal to the library for consideration in the next release?
Yes, I plan on submitting this, but I am afraid it may take some time. It not a high priority for me plus I have some vacation time next week.

Ross MacGregor <gordonrossmacgregor <at> gmail.com> writes:
Sergei Politov <spolitov <at> gmail.com> writes:
[interprocess] message_queue hangs when another process dies
Suppose we have 2 processes, one sends messages to queue, another reads them. When reading process dies (for instance using End process) during message_queue.receive the another process hangs in send.
I am wondering if there isn't a fairly simple solution. It wouldn't be perfect but would be far better behavior than we have now.
Just an update for everyone. Back in September, I worked with Ion to get a configurable timeout feature built into interprocess_mutex. It will throw a new timeout exception if locked for more than a set period of time (configurable at compile time). This will fix the issue of message_queue and interprocess_condition hanging up on termination of the dependent process. Hopefully this feature will be available in Boost 1.48.

On 11/3/2011 2:06 PM, Ross MacGregor wrote:
Just an update for everyone. Back in September, I worked with Ion to get a configurable timeout feature built into interprocess_mutex. It will throw a new timeout exception if locked for more than a set period of time (configurable at compile time). This will fix the issue of message_queue and interprocess_condition hanging up on termination of the dependent process.
Hopefully this feature will be available in Boost 1.48.
I hope so too. Thanks much for your help. -DB
participants (6)
-
David Byron
-
dherring@ll.mit.edu
-
Ion Gaztañaga
-
kopo
-
Ross MacGregor
-
Sergei Politov