On Fri, Jul 2, 2010 at 9:55 AM, Ragnar Cederlund
Hi,
I'm having trouble with the io_service destructor never completing. The application I am working on uses one worker thread to handle all actual work to be done including communication on a socket in handler functions posted to the asio queue and another thread to handle the gui of the application.
At a certain point in time the program decides that it has nothing more to do and calls the io_service::stop function. Shortly thereafter the thread returns from its call to the io_service::run function. The next thing that happens is that the io_service object goes out of scope and thus the io_service destructor is called. After this the thread signals that it is finished and the application exits.
The problem I am seeing is that the destructor never finishes. Or actually, a customer sees this problem, but I have not been able to reproduce this so all the information I have comes from adding (excessive) logging in the boost source.
The io_service destructor loops through all services and calls shutdown_service on each one. The win_iocp_io_service handles this call by entering a while loop that should complete when the outstanding_work_ variable becomes zero. Inside the loop, there is a call to GetQueuedCompletionStatus to dequeue a completion packet, which is then tested to see if it contains a valid operation that is then destroyed and the outstanding_work_ variable is decremented.
In all the runs I have made, outstanding_work_ is initially 1 when the shutdown_service function is entered. The first and second calls to GetQueuedCompletionStatus come back immediately and the second results in the outstanding_work_ variable being decremented and the loop exits. Everything is fine.
At the customer location however, the second call times out instead of completing immediately and it continues to do so for subsequent loop iterations until the application is forcibly closed. There is only one of our customers who gets this problem. The customer runs Windows XP SP3, but I have failed to reproduce this on my own XP test machine.
We have tried both Boost version 1.43 and version 1.40 with the same result. In both cases, the customer has the problem but I cannot reproduce it. The asio behavior has changed somewhat between the versions and the description of the problem above is with version 1.43.
Does anyone have any idea of what may be going on? Any hints or tips are appreciated! Could we be using asio wrong in some way that could cause this, but we're just lucky most of the time? The customer may be running in a virtual machine, could this affect io completion ports in a way that could be problematic for asio?
I have been experiencing this *EXACT* same issue, exactly as described
above, when using an external library that uses Boost.ASIO internally
to host a webserver.
http://www.webtoolkit.eu/wt
It happens even when running the examples, locally.
Here is the copy/paste of my email in their mailing list when I was
trying to figure it out yesterday:
On Sat, Jul 3, 2010 at 8:16 PM, OvermindDL1
On Sat, Jul 3, 2010 at 7:32 PM, OvermindDL1
wrote: On Sat, Jul 3, 2010 at 7:31 PM, OvermindDL1
wrote: On Sat, Jul 3, 2010 at 5:57 PM, OvermindDL1
wrote: On Sat, Jul 3, 2010 at 3:48 PM, OvermindDL1
wrote: On Sat, Jul 3, 2010 at 5:47 AM, Alex Sadovsky
wrote: On Sat, Jul 3, 2010 at 10:46 AM, OvermindDL1
wrote: > > On Fri, Jul 2, 2010 at 11:03 PM, OvermindDL1 > wrote: > > I am still trying to figure out if it is something with my setup or > > what, but everytime my server gets under any tiny bit of load from > > multiple users, it just 'stops', will not respond to localhost, just > > 'nothing', in both debug and release. > > > > If I run it under debug in Visual Studio, VS (or one of my plugins, > > unsure anymore, it all just melds together over time) reports that the > > app is deadlocked. This happens with even simple examples that > > require a session, like the Bob Smith example (which is what I am > > trying to trace it through currently since it is so simple). > > > > Windows XP SP3. > > 4 gigs of ram (although XP 32-bit only uses ~3gigs). > > Computer is not under load. > > Using the simple Bob Smith example that comes with the latest Wt > > Master/head, all build types, just get two people to connect at once, > > seems to cause a deadlock (very easy to reproduce when they refresh > > the page repeatedly). Any thoughts? > > > Interestingly, the browser still initiates a connection to the server, > but never gets a response, it just hangs waiting for the response... > I can confirm the same behavior on a Win2003r2/32bit sytem. I haven't dug into this too much though.
I am using Visual Studio 2005 Standard for note, and although I have a deadlock detector, I do not have one that can identify them ahead of time nor what causes them. I might try to look at this later if I get a chance.
Actually I have a moment, and did just a quick check to see what the threads do.
Using the bobsmith.wt.exe demo, I launched the debug version (so I can see it stack data easily), opened up a browser, and held down refresh for a couple of seconds. At that time I saw all the worker threads just die, vanish, kaput, there were two threads left at that point, one was some wmifree thread, but it disappeared a few seconds later, and the last one remaining is the mainCRTStartup thread, the top of its stack holding: bobsmith.wt.exe!boost::this_thread::interruptible_wait+0x211 So it is still waiting for the threads to return, but since they all died, it does not.
Running it again, I caught the entire stack for the second thread that lives for a few seconds after the thing dies, it is: ADVAPI32.dll!WmiFreeBuffer+0x24e kernel32.dll!GetModuleFileNameA+0x1ba
I ran it again, and set the thread count to 1, and debugged that thread, apparently when it seems to 'lock up', the ASIO do_one function never returns, it never gets signal from the socket that there is incoming data, and then the socket closes, which kills ASIO.
Debugging into the actual GetQueuedCompletionStatus call in ASIO, it seems that it always return status code WAIT_TIMEOUT when not serving anything, but the last time through, the one that it is killed right after, the error code is 0.
Debugging into it and tracing it line by line, it seems that the thread stopped flag becomes set, trying to find out why...
Searched for every place that variable is ever set to 1, which is apparently only one place, the stop() method, to try to figure out what is setting it, and it is hit, so something is indeed setting it, tracing back up reveals it is called by: bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::work_finished() Line 315 C++ bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::auto_work::~auto_work() Line 641 C++ bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::do_one(bool block=true, boost::system::error_code & ec={...}) Line 548 + 0x19 bytes C++ bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::run(boost::system::error_code & ec={...}) Line 229 + 0xe bytes C++ bobsmith.wt.exe!boost::asio::io_service::run() Line 68 + 0xf bytes C++ bobsmith.wt.exe!http::server::Server::run() Line 213 C++
So basically, it is saying that it is running out of work, hmm...
After some more digging, it seems that the on_completion handler for handling a new socket is failing for some reason, which is causing the work to be decremented, without it ever being incremented, trying to figure out what is going on here, but regardless, with the auto-work holder is released right after it, it causes the thread to close since there is no more work.
Okay, figured out what is closing it all, the IO completion port call returns ERROR_SEM_TIMEOUT, so the semaphore is timing out, and it seems like it is timing out *before* the timeout call, this generally indicates a failure condition in the port, hmm...
Hmm, based on some other information noticed, it seems there is a bad WMI state at the time that happens, may or may not be related, although probably is...
Also, this happens on two different computers with two very different networking cards (same OS though).
Decided to watch the network traffic, apparently what kills it pretty well is opening up about 8 connections at once to it, most are in CLOSE_WAIT status, a few in ESTABLISHED, and the ones that locked up are in SYN_RCVD, but the computer never gets another response, and it seems that when any of these SYN_RCVD state ports are timed out, the program receives the ERROR_SEM_TIMEOUT (which is not related at all to the timeout given to the function I found out by reading, that would return something different). Technically this error should never be thrown from that function according to MSDN, but it is, maybe a bug in the Windows Networking Stack pre-Vista, unsure yet, can anyone else find out anything?
My test computers are both Windows XP SP3, compiled using Visual Studio 2005, fully updated. I have been trying my darndest to figure out what is going on in that handler that calls that one destructor that sets the reference count low enough that the thread dies on the next invocation. I honestly do not know ASIO nor IOCP well enough to know what is going on, but if anyone does, and they have access to VS 2005, I can setup a remote debugging platform so they can debug into the problem as it happens on my computer since I can reliably cause it to happen every single time. The above post, Ragnar Cederlund, if you want, I can test your product as well, if it happens on my computer too then I can let you remote debug it in real-time (using the remote debugging capabilities of Visual Studio 2005, or I can setup a quick VNC server for local access if you prefer that, if you do not want me to have access to your code then the remote debugging capabilities of your IDE will work, although I do not have any above 2005, I can install the remote debugging toolkits from the microsoft site if you need me to).