asio queue OOM stress test

I wrote a little program to stress the queue under low memory situations, and to demonstrate that there is no upper bound on memory usage when using async function calls. The first time I ran it under Linux the kernel killed mysqld and pretty much hung the system. The second time the kernel killed the test program. In neither run could I recover from the OOM error, which is one reason I wrote this. I want to demonstrate that under Linux you often can't catch exceptions from new, because your program is dead before the exception is thrown. Some might consider that a limitation of over committing memory managers like Linux and FreeBSD's, but that's how it works, and I think we are stuck with it. What is really interesting is on Windows, as I expected, I did get the exception and started running the handlers, but the program deadlocked each time I ran it after running about 600k function calls. I'm starting to wonder if this a bug unrelated to the OOM situation. #include <iostream> #include <boost/bind.hpp> #include <boost/asio.hpp> void doit(unsigned long* count) { ++(*count); if(!(*count % 100000)){ std::cout<<*count<<std::endl; } } int main(int argc, char* argv[]) { boost::asio::demuxer d; unsigned long call_count = 0; unsigned long post_count = 0; // // post messages until we run out memory and then // run them. try{ std::cout<<"posting events..."<<std::endl; for(post_count = 1;;++post_count){ // // Internally allocates queueing structures, and // a copy of the functor returned by bind. d.post(boost::bind(doit, &call_count)); if(!(post_count % 100000)){ // // print out something to show that // we are still alive. std::cout<<post_count<<std::endl; } } } catch(...){ // // Let's not do anything here in case we throw again. } // // There is a reasonable chance we might throw here, // because we are pretty much out of memory. Might // want to comment this line out. std::cout<<"caught exception. post_count: " <<post_count<<std::endl; try{ d.run(); } catch(...){ std::cout<<"caught exception running events."<<std::endl; // // Can't really do anything graceful here like continue // to handle connected sockets. // // I could try to call run again, but I doubt that's safe. // as we could have been mucking with internal structures // when the exception let go. } std::cout<<"all done"<<std::endl; std::cout<<"post_count: "<<post_count <<std::endl; std::cout<<"call_count: " <<call_count<<std::endl; return 0; }

function calls. I'm starting to wonder if this a bug unrelated to the OOM situation.
I have confirmed that on my machine the program deadlocks on Windows after running about 6 million queued events, even if the post loop is broken before the demuxer throws with OOM.

I have confirmed that on my machine the program deadlocks on Windows after running about 6 million queued events, even if the post loop is broken before the demuxer throws with OOM.
Running your test program on my machine (compiled vc 8.0 debug build, running xp) got the execption at postcount 6698316. then get to display of 4800000 before progress stops did it twice, got the same numbers. Simon

I have confirmed that on my machine the program deadlocks on Windows after running about 6 million queued events, even if the post loop is broken before the demuxer throws with OOM.
Running your test program on my machine (compiled vc 8.0 debug build, running xp) got the execption at postcount 6698316. then get to display of 4800000 before progress stops
You might want to try breaking the post loop at 5 million to get the OOM exception out of the picture, just to confirm the deadlock isn't a result of the exception being thrown. cheers, christopher

Hi Christopher, --- christopher baus <christopher@baus.net> wrote:
I have confirmed that on my machine the program deadlocks on Windows after running about 6 million queued events, even if the post loop is broken before the demuxer throws with OOM.
I can reproduce it too, however there is an earlier access violation during the post() phase down inside the std::exception constructor (yikes!). However, you are suppressing this AV because you are doing a catch(...). Therefore I don't think the deadlock is the actual problem here. Cheers, Chris

Turns out the AV only occurs with MSVC 7.1. With VC 8.0 Express I was able to diagnose the true error: PostQueuedCompletionStatus is failing with ERROR_NO_SYSTEM_RESOURCES, however I was not checking the return value. My bad :*) Here is a diff to fix this (against CVS version of asio). With this change the test runs successfully to completion. Cheers, Chris ---------------------- Index: win_iocp_demuxer_service.hpp =================================================================== RCS file: /cvsroot/asio/asio/include/asio/detail/win_iocp_demuxer_service.hpp,v retrieving revision 1.21 diff -u -r1.21 win_iocp_demuxer_service.hpp --- win_iocp_demuxer_service.hpp 2 Dec 2005 01:44:56 -0000 1.21 +++ win_iocp_demuxer_service.hpp 16 Dec 2005 10:02:57 -0000 @@ -102,7 +102,12 @@ if (::InterlockedExchangeAdd(&interrupted_, 0) != 0) { // Wake up next thread that is blocked on GetQueuedCompletionStatus. - ::PostQueuedCompletionStatus(iocp_.handle, 0, 0, 0); + if (!::PostQueuedCompletionStatus(iocp_.handle, 0, 0, 0)) + { + DWORD last_error = ::GetLastError(); + system_exception e("pqcs", last_error); + boost::throw_exception(e); + } break; } } @@ -113,7 +118,14 @@ void interrupt() { if (::InterlockedExchange(&interrupted_, 1) == 0) - ::PostQueuedCompletionStatus(iocp_.handle, 0, 0, 0); + { + if (!::PostQueuedCompletionStatus(iocp_.handle, 0, 0, 0)) + { + DWORD last_error = ::GetLastError(); + system_exception e("pqcs", last_error); + boost::throw_exception(e); + } + } } // Reset the demuxer in preparation for a subsequent run invocation. @@ -183,8 +195,15 @@ template <typename Handler> void post(Handler handler) { - win_iocp_operation* op = new handler_operation<Handler>(*this, handler); - ::PostQueuedCompletionStatus(iocp_.handle, 0, 0, op); + handler_operation<Handler>* op = + new handler_operation<Handler>(*this, handler); + if (!::PostQueuedCompletionStatus(iocp_.handle, 0, 0, op)) + { + DWORD last_error = ::GetLastError(); + delete op; + system_exception e("pqcs", last_error); + boost::throw_exception(e); + } } private:

Turns out the AV only occurs with MSVC 7.1. With VC 8.0 Express I was able to diagnose the true error: PostQueuedCompletionStatus is failing with ERROR_NO_SYSTEM_RESOURCES, however I was not checking the return value. My bad :*)
Here is a diff to fix this (against CVS version of asio). With this change the test runs successfully to completion.
Wow that was fast. I shouldn't have used catch(...), but I was being lazy. See laziness never pays. :) Thanks, Christopher

Here is a diff to fix this (against CVS version of asio). With this change the test runs successfully to completion.
Ok, I looked at your patch and I see what was going on here. The queue operation was failing before new, so post returned even though the event didn't get queued to the OS. That's why run didn't return. Makes sense. It seems there is an OS imposed limit on the queue depth in Windows. Another reason I wrote this is that I think a hard limit should be put in the reactive servers as so we don't end up with runaway applications. We should at least to attempt to stop enqueueing operations before the OMM killer has its way with us. This is the friction point between the theory of deferred execution and the unfortunate fact that our machines have finite resources. Thanks, Christopher

Hi, --- christopher baus <christopher@baus.net> wrote:
It seems there is an OS imposed limit on the queue depth in Windows. Another reason I wrote this is that I think a hard limit should be put in the reactive servers as so we don't end up with runaway applications. We should at least to attempt to stop enqueueing operations before the OMM killer has its way with us.
Yep, sorry I haven't replied to your related email on event throttling... still trying to catch up on digesting all the emails! I think this hard limit could be made configurable by adding a new class/service pair. E.g.: template <typename Allocator = std::allocator<void> > class limits_service { ... }; template <typename Service> class basic_limits { public: basic_limits(demuxer_type& d); void post_queue(std::size_t value); std::size_t post_queue() const; ... and so on ... }; typedef basic_limits<limits_service<> > limits; Usage: boost::asio::demuxer d; boost::asio::limits l(d); l.post_queue(42); Any suggestions about what sort of numbers should be used by default? Cheers, Chris

----- Original Message ----- From: "Christopher Kohlhoff" <chris@kohlhoff.com> To: <boost@lists.boost.org> Sent: Saturday, December 17, 2005 12:30 AM Subject: Re: [boost] asio queue OOM stress test -- Windows deadlock
--- christopher baus <christopher@baus.net> wrote:
Usage:
boost::asio::demuxer d; boost::asio::limits l(d); l.post_queue(42);
Any suggestions about what sort of numbers should be used by default?
6698316 - 1 :-) Simon

"Christopher Kohlhoff" <chris@kohlhoff.com> wrote:
Yep, sorry I haven't replied to your related email on event throttling... still trying to catch up on digesting all the emails!
No problem. I was beginning to wonder if you are sleeping :)
I think this hard limit could be made configurable by adding a new class/service pair. E.g.:
template <typename Allocator = std::allocator<void> > class limits_service { ... };
template <typename Service> class basic_limits { public: basic_limits(demuxer_type& d); void post_queue(std::size_t value); std::size_t post_queue() const; ... and so on ... };
typedef basic_limits<limits_service<> > limits;
Usage:
boost::asio::demuxer d; boost::asio::limits l(d); l.post_queue(42);
That looks good to me.
Any suggestions about what sort of numbers should be used by default?
42 sounds good. :) My gut instinct would be to make it a multiple of the number of connections you intend to handle. Like num_connections * 5. The chance that there is 5 operations pending/connection is pretty unlikely. Usually there is at most 2. The I/O event and the timeout.
participants (3)
-
christopher baus
-
Christopher Kohlhoff
-
simon meiklejohn