[test] Possible design flaw in execution_monitor::catch_signals

A recent problem I encountered that caused the total failure of a regression test run has highlighted what I (and others) believe to be a design flaw in execution_monitor::catch_signals (in file execution_monitor.ipp). IIUC the purpose of the code in question is to intercept UNIX signals and convert them to exceptions which contain a diagnostic message indicating the signal type. For most signals this should work reliably. However, we suggest that this mechanism is prone to failure when attempting to deal with a SIGSEGV, as happened in my case. What I observed was a test process that had stalled by blocking against a mutex. Detailed analysis showed that the test failed with a memory segment violation resulting in a SIGSEGV. In brief, the primary execution failure corrupts the program heap. When the C++ exception is thrown at execution_monitor:462, the exception handling mechanism calls __cxa_allocate_exception which then calls std::malloc. But, because of the corrupted heap, this call blocks against the malloc mutex. This is the specific case with QNX, and other OSs will deal with this in different ways. The general point I would like to make is that after a memory segment violation, any process's memory will be in an undefined state and it is unreasonable to assume that it will be capable of continued execution as per the present design. I propose that there is no safe way to intercept a SIGSEGV and that it should be allowed to terminate the process without intervention. Comments please... Jim

Jim Douglas wrote:
A recent problem I encountered that caused the total failure of a regression test run has highlighted what I (and others) believe to be a design flaw in execution_monitor::catch_signals (in file execution_monitor.ipp).
IIUC the purpose of the code in question is to intercept UNIX signals and convert them to exceptions which contain a diagnostic message indicating the signal type. For most signals this should work reliably. However, we suggest that this mechanism is prone to failure when attempting to deal with a SIGSEGV, as happened in my case.
What I observed was a test process that had stalled by blocking against a mutex. Detailed analysis showed that the test failed with a memory segment violation resulting in a SIGSEGV.
In brief, the primary execution failure corrupts the program heap. When the C++ exception is thrown at execution_monitor:462, the exception handling mechanism calls __cxa_allocate_exception which then calls std::malloc. But, because of the corrupted heap, this call blocks against the malloc mutex.
This is the specific case with QNX, and other OSs will deal with this in different ways. The general point I would like to make is that after a memory segment violation, any process's memory will be in an undefined state and it is unreasonable to assume that it will be capable of continued execution as per the present design. I propose that there is no safe way to intercept a SIGSEGV and that it should be allowed to terminate the process without intervention.
Completely correct. The SIGSEGV handler is not allowed to return (except in the case where the signal was generated by a raise()), doing so is undefined bahaviour. Whatever extra stuff needs to be done to report the error, needs to be done in the signal handler itself, using async-signal-safe functions only. Definitely no malloc(). Cheers, Ian

In brief, the primary execution failure corrupts the program heap. When the C++ exception is thrown at execution_monitor:462, the exception handling mechanism calls __cxa_allocate_exception which then calls std::malloc. But, because of the corrupted heap, this call blocks against the malloc mutex.
Is there any way to avoid memory allocation? The exception itself doesn't allocate the memory (namely for the reasons above). We just need to avoid allocation on system level.
This is the specific case with QNX, and other OSs will deal with this in different ways. The general point I would like to make is that after a memory segment violation, any process's memory will be in an undefined state and it is unreasonable to assume that it will be capable of continued execution as per the present design. I propose that there is no safe way to intercept a SIGSEGV and that it should be allowed to terminate the process without intervention.
You could disable catching system errors for this test or this configuration (using either command line argument or environment variable). But be aware that this may cause different problems for regression testing runs. Gennadiy

Gennadiy Rozental wrote:
In brief, the primary execution failure corrupts the program heap. When the C++ exception is thrown at execution_monitor:462, the exception handling mechanism calls __cxa_allocate_exception which then calls std::malloc. But, because of the corrupted heap, this call blocks against the malloc mutex.
Is there any way to avoid memory allocation?
Yes, don't throw an exception. It is the system that is allocating memory for the exception.
The exception itself doesn't allocate the memory (namely for the reasons above).
Oh yes it does. Before you can throw an exception you have to create one and that will involve a call to new, which in most cases results in a call to malloc. [...]
You could disable catching system errors for this test or this configuration (using either command line argument or environment variable).
I am trying to make a general point here that the code is badly designed and is prone to failure on any OS, not just QNX. To use a simple analogy you are currently asking a mortally wounded soldier to get back in the front line and continue to fight. A memory segment violation _may_ cripple a process. There is _no_guarantee_ that following a memory segment violation a process is in a fit state to continue execution. Don't you remember the good old days when a memory segment violation under Windows would wipe out the entire OS :-) See also Ian McCulloch's post that backs me up. BTW the specific case has been dealt with by withdrawing the test. I just want to make sure this does not happen again to anyone. Jim

"Jim Douglas" <jim@dramatec.co.uk> wrote in message news:du8qc7$ikt$1@sea.gmane.org...
Gennadiy Rozental wrote:
In brief, the primary execution failure corrupts the program heap. When the C++ exception is thrown at execution_monitor:462, the exception handling mechanism calls __cxa_allocate_exception which then calls std::malloc. But, because of the corrupted heap, this call blocks against the malloc mutex.
Is there any way to avoid memory allocation?
Yes, don't throw an exception. It is the system that is allocating memory for the exception.
The exception itself doesn't allocate the memory (namely for the reasons above).
Oh yes it does. Before you can throw an exception you have to create one and that will involve a call to new, which in most cases results in a call to malloc.
I don't see why. The exception is defined on stack. Why would the system need to call to new? There are other alternatives: I could make exception object static, I could throw special int value to indicate falire and have exception object available globally.
[...]
You could disable catching system errors for this test or this configuration (using either command line argument or environment variable).
I am trying to make a general point here that the code is badly designed and is prone to failure on any OS, not just QNX. To use a simple analogy you are currently asking a mortally wounded soldier to get back in the front line and continue to fight.
I do not. Immediate reaction to SIGSEGV is an attempt to log an error and exit process.
A memory segment violation _may_ cripple a process. There is _no_guarantee_ that following a memory segment violation a process is in a fit state to continue execution. Don't you remember the good old days when a memory segment violation under Windows would wipe out the entire OS :-)
See also Ian McCulloch's post that backs me up.
BTW the specific case has been dealt with by withdrawing the test. I just want to make sure this does not happen again to anyone.
This is really very simple. For the usages in regression testing user is not interested in core files. Instead we prefer as much information in output/log that indicate the error. I do not have an ability to properly log an error from within signal handler. So I am trying to jump out of it to the place where I could. In worse case scenario we should end up with another core. Which is unusable, but we are not interested in core files in a fist place. And as usual any alternative proposition would be very welcome. Gennadiy
participants (3)
-
Gennadiy Rozental
-
Ian McCulloch
-
Jim Douglas