I’m currently fighting with some crappy piece of (custom) server software which doesn’t accept its connections properly (written in Java by a PHP programmer who never before touched sockets let alone threads). My guess is that a thread is dying before the socket is properly accepted in the client thread. I can’t be sure and it actually doesn’t matter much since the software is currently reimplemented; the old version has to be kept running until the new version goes online, as reliable as possible but without any time and money spent on debugging the old codebase.
The bug manifests itself in the following netstat output; some connections are never transferred from the kernel to use space (that’s how I interpret this, better interpretations are welcome):
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 228 0 192.0.2.105:1988 220.127.116.11:7925 ESTABLISHED - tcp6 0 0 192.0.2.105:1988 18.104.22.168:9826 ESTABLISHED 14741/java tcp6 0 0 192.0.2.105:1988 22.214.171.124:5867 ESTABLISHED 14741/java tcp6 2677 0 192.0.2.105:1988 126.96.36.199:15688 ESTABLISHED - tcp6 3375 0 192.0.2.105:1988 188.8.131.52:3045 ESTABLISHED - tcp6 14742 0 192.0.2.105:1988 184.108.40.206:4679 ESTABLISHED - tcp6 774 0 192.0.2.105:1988 220.127.116.11:36064 ESTABLISHED - tcp6 92 0 192.0.2.105:1988 18.104.22.168:7164 ESTABLISHED - tcp6 0 0 192.0.2.105:1988 22.214.171.124:6322 ESTABLISHED 14741/java tcp6 0 0 192.0.2.105:1988 126.96.36.199:13937 ESTABLISHED 14741/java tcp6 3051 0 192.0.2.105:1988 188.8.131.52:31239 ESTABLISHED - tcp6 246 0 192.0.2.105:1988 184.108.40.206:5458 ESTABLISHED - tcp6 618 0 192.0.2.105:1988 220.127.116.11:20209 ESTABLISHED - tcp6 1041 0 192.0.2.105:1988 18.104.22.168:7424 ESTABLISHED - tcp6 0 0 192.0.2.105:1988 22.214.171.124:5065 ESTABLISHED 14741/java
When this happens and the clients reconnect, they tend to work. But they won’t reconnect by itself until they run into a rather long timeout. Since the custom full-duplex protocol in its current incarnation doesn’t ack any data sent by the client and the latter doesn’t expect any regularly incoming requests from the server, this can be days since the client sends its data happily until the kernel’s receive queue runs full. On the server (kernel) side it should be possible to detect stale sockets since the clients send data regularly.
So, assuming my interpretation of this problem is correct, what I wondered was if there is a kernel parameter I can tune which makes the kernel drop/close TCP connections with a RST if they aren’t read from by the user space in a timely manner.
Better explanations of what happens here are welcome as well.
You can try tuning TCP keepalive to much shorter values. By default a connection can be idle for two hours before keepalive kicks in.
Exactly what values you should use is really dependent on what your application does and what your users expect or how they interact with it.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.