Re: [Boost-users] [Asio] io_service destructor never completes

5 Jul 2010

      On Fri, Jul 2, 2010 at 9:55 AM, Ragnar Cederlund
<ragnar.cederlund@illuminatelabs.com> wrote:
...
Hi,
I'm having trouble with the io_service destructor never completing. The
application I am working on uses one worker thread to handle all actual work
to be done including communication on a socket in handler functions posted
to the asio queue and another thread to handle the gui of the application.
At a certain point in time the program decides that it has nothing more to
do and calls the io_service::stop function. Shortly thereafter the thread
returns from its call to the io_service::run function. The next thing that
happens is that the io_service object goes out of scope and thus the
io_service destructor is called. After this the thread signals that it is
finished and the application exits.
The problem I am seeing is that the destructor never finishes. Or actually,
a customer sees this problem, but I have not been able to reproduce this so
all the information I have comes from adding (excessive) logging in the
boost source.
The io_service destructor loops through all services and calls
shutdown_service on each one. The win_iocp_io_service handles this call by
entering a while loop that should complete when the outstanding_work_
variable becomes zero. Inside the loop, there is a call to
GetQueuedCompletionStatus to dequeue a completion packet, which is then
tested to see if it contains a valid operation that is then destroyed and
the outstanding_work_ variable is decremented.
In all the runs I have made, outstanding_work_ is initially 1 when the
shutdown_service function is entered. The first and second calls to
GetQueuedCompletionStatus come back immediately and the second results in
the outstanding_work_ variable being decremented and the loop exits.
Everything is fine.
At the customer location however, the second call times out instead of
completing immediately and it continues to do so for subsequent loop
iterations until the application is forcibly closed. There is only one of
our customers who gets this problem. The customer runs Windows XP SP3, but I
have failed to reproduce this on my own XP test machine.
We have tried both Boost version 1.43 and version 1.40 with the same result.
In both cases, the customer has the problem but I cannot reproduce it. The
asio behavior has changed somewhat between the versions and the description
of the problem above is with version 1.43.
Does anyone have any idea of what may be going on? Any hints or tips are
appreciated! Could we be using asio wrong in some way that could cause this,
but we're just lucky most of the time? The customer may be running in a
virtual machine, could this affect io completion ports in a way that could
be problematic for asio?
I have been experiencing this *EXACT* same issue, exactly as described
above, when using an external library that uses Boost.ASIO internally
to host a webserver.

http://www.webtoolkit.eu/wt

It happens even when running the examples, locally.

Here is the copy/paste of my email in their mailing list when I was
trying to figure it out yesterday:

On Sat, Jul 3, 2010 at 8:16 PM, OvermindDL1 <overminddl1@gmail.com> wrote:
...
On Sat, Jul 3, 2010 at 7:32 PM, OvermindDL1 <overminddl1@gmail.com> wrote:
...
On Sat, Jul 3, 2010 at 7:31 PM, OvermindDL1 <overminddl1@gmail.com> wrote:
...
On Sat, Jul 3, 2010 at 5:57 PM, OvermindDL1 <overminddl1@gmail.com> wrote:
...
On Sat, Jul 3, 2010 at 3:48 PM, OvermindDL1 <overminddl1@gmail.com> wrote:
...
On Sat, Jul 3, 2010 at 5:47 AM, Alex Sadovsky <alexeysad@gmail.com> wrote:
...
On Sat, Jul 3, 2010 at 10:46 AM, OvermindDL1 <overminddl1@gmail.com> wrote:
>
> On Fri, Jul 2, 2010 at 11:03 PM, OvermindDL1 <overminddl1@gmail.com>
> wrote:
> > I am still trying to figure out if it is something with my setup or
> > what, but everytime my server gets under any tiny bit of load from
> > multiple users, it just 'stops', will not respond to localhost, just
> > 'nothing', in both debug and release.
> >
> > If I run it under debug in Visual Studio, VS (or one of my plugins,
> > unsure anymore, it all just melds together over time) reports that the
> > app is deadlocked.  This happens with even simple examples that
> > require a session, like the Bob Smith example (which is what I am
> > trying to trace it through currently since it is so simple).
> >
> > Windows XP SP3.
> > 4 gigs of ram (although XP 32-bit only uses ~3gigs).
> > Computer is not under load.
> > Using the simple Bob Smith example that comes with the latest Wt
> > Master/head, all build types, just get two people to connect at once,
> > seems to cause a deadlock (very easy to reproduce when they refresh
> > the page repeatedly).  Any thoughts?
>
>
> Interestingly, the browser still initiates a connection to the server,
> but never gets a response, it just hangs waiting for the response...
>
I can confirm the same behavior on a Win2003r2/32bit sytem. I haven't dug
into this too much though.
I am using Visual Studio 2005 Standard for note, and although I have a
deadlock detector, I do not have one that can identify them ahead of
time nor what causes them.  I might try to look at this later if I get
a chance.
Actually I have a moment, and did just a quick check to see what the threads do.
Using the bobsmith.wt.exe demo, I launched the debug version (so I can
see it stack data easily), opened up a browser, and held down refresh
for a couple of seconds.  At that time I saw all the worker threads
just die, vanish, kaput, there were two threads left at that point,
one was some wmifree thread, but it disappeared a few seconds later,
and the last one remaining is the mainCRTStartup thread, the top of
its stack holding:
  bobsmith.wt.exe!boost::this_thread::interruptible_wait+0x211
So it is still waiting for the threads to return, but since they all
died, it does not.
Running it again, I caught the entire stack for the second thread that
lives for a few seconds after the thing dies, it is:
 ADVAPI32.dll!WmiFreeBuffer+0x24e
 kernel32.dll!GetModuleFileNameA+0x1ba
I ran it again, and set the thread count to 1, and debugged that
thread, apparently when it seems to 'lock up', the ASIO do_one
function never returns, it never gets signal from the socket that
there is incoming data, and then the socket closes, which kills ASIO.
Debugging into the actual GetQueuedCompletionStatus call in ASIO, it
seems that it always return status code WAIT_TIMEOUT when not serving
anything, but the last time through, the one that it is killed right
after, the error code is 0.
Debugging into it and tracing it line by line, it seems that the
thread stopped flag becomes set, trying to find out why...
Searched for every place that variable is ever set to 1, which is
apparently only one place, the stop() method, to try to figure out
what is setting it, and it is hit, so something is indeed setting it,
tracing back up reveals it is called by:
 bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::work_finished()
 Line 315       C++
 bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::auto_work::~auto_work()
 Line 641       C++
 bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::do_one(bool
block=true, boost::system::error_code & ec={...})  Line 548 + 0x19
bytes   C++
 bobsmith.wt.exe!boost::asio::detail::win_iocp_io_service::run(boost::system::error_code
& ec={...})  Line 229 + 0xe bytes       C++
 bobsmith.wt.exe!boost::asio::io_service::run()  Line 68 + 0xf bytes   C++
 bobsmith.wt.exe!http::server::Server::run()  Line 213 C++
So basically, it is saying that it is running out of work, hmm...
After some more digging, it seems that the on_completion handler for
handling a new socket is failing for some reason, which is causing the
work to be decremented, without it ever being incremented, trying to
figure out what is going on here, but regardless, with the auto-work
holder is released right after it, it causes the thread to close since
there is no more work.
Okay, figured out what is closing it all, the IO completion port call
returns ERROR_SEM_TIMEOUT, so the semaphore is timing out, and it
seems like it is timing out *before* the timeout call, this generally
indicates a failure condition in the port, hmm...
Hmm, based on some other information noticed, it seems there is a bad
WMI state at the time that happens, may or may not be related,
although probably is...
Also, this happens on two different computers with two very different
networking cards (same OS though).
Decided to watch the network traffic, apparently what kills it pretty
well is opening up about 8 connections at once to it, most are in
CLOSE_WAIT status, a few in ESTABLISHED, and the ones that locked up
are in SYN_RCVD, but the computer never gets another response, and it
seems that when any of these SYN_RCVD state ports are timed out, the
program receives the ERROR_SEM_TIMEOUT (which is not related at all to
the timeout given to the function I found out by reading, that would
return something different).  Technically this error should never be
thrown from that function according to MSDN, but it is, maybe a bug in
the Windows Networking Stack pre-Vista, unsure yet, can anyone else
find out anything?
My test computers are both Windows XP SP3, compiled using Visual
Studio 2005, fully updated.  I have been trying my darndest to figure
out what is going on in that handler that calls that one destructor
that sets the reference count low enough that the thread dies on the
next invocation.

I honestly do not know ASIO nor IOCP well enough to know what is going
on, but if anyone does, and they have access to VS 2005, I can setup a
remote debugging platform so they can debug into the problem as it
happens on my computer since I can reliably cause it to happen every
single time.

The above post, Ragnar Cederlund, if you want, I can test your product
as well, if it happens on my computer too then I can let you remote
debug it in real-time (using the remote debugging capabilities of
Visual Studio 2005, or I can setup a quick VNC server for local access
if you prefer that, if you do not want me to have access to  your code
then the remote debugging capabilities of your IDE will work, although
I do not have any above 2005, I can install the remote debugging
toolkits from the microsoft site if you need me to).