
On Tue, Dec 15, 2009 at 10:25 AM, Scott Gifford
Jonathan Franklin
writes: Sure, I agree completely that attempting a read is necessary, but IME it is not sufficient. You must additionally send data, either with an OS write or TCP keepalive, to detect a completely unresponsive peer (i.e. one which has fallen off the network). The only way to detect an unresponsive peer is via a timeout, and with no data to send, there is nothing to time out. I also agree that write() won't always return an error, but it should attempt to send data, which will cause the TCP layer to wait for an acknowledgement of that data. If that times out, the TCP layer should detect an error on the socket, and a subsequent call to read() should return an error.
I think we're agreeing, but not being clear enough for each other, or the OP. The OP is interested in detecting a closed (or crashed) remote socket. There are 2 scenarios to consider: 1. The OP has no control over the application protocol, and there is no application-level ping or ACK mechanism built-in. In this case, the application cannot send any data outside of the "normal" operation (e.g. can't actively try to detect whether the remote host is still there). The application must rely on read() returning an EOF when it is notified that the remote socket has closed (e.g. by the remote system, the TCP keep-alive mechanism, an ICMP message from an intermediate router, etc). If the remote host is hard-down (blue screened, cable cut, etc), and there is no TCP keep-alive, then you're pretty much hosed. The only possibility would be to add an application-level timeout to the read. e.g. reset your timer each time you read data. Kill the socket when the timeout occurs. However, this may not be an option for your use case. 2. There is an application-level ping/ACK mechanism available (the OP may need to add it). In this case, the "ping" is sent to the hard-down remote host. The write() call will not fail, and it may take many write() calls to generate a failure. However, as soon as the TCP stack times out the send (right about when the writes will begin to fail), the read() call will immediately return an EOF. In neither case can one rely on write() failing. In case 2, one *can* rely on the read() eventually returning EOF. The worst-case scenario for case 1 will never detect the downed remote host. However, attempting to send data in case 1 under "normal" operation will generate an EOF from read(), but not a failure in write(). I prefer timing out "inactive" connections to sending "heart-beat" messages, when possible. Jon