Reduce keepalive ping/timeout

sipa commented at 5:17 pm on June 22, 2013: member

Ping automatically every 2 minutes (unconditionally), instead of of after 30 minutes of no sending, for latency measurement and keep-alive. Also, disconnect if no reply arrives within 5 minutes, instead of after 90 minutes of inactivity.

This should make detection of stalled connections much faster.

sipa commented at 5:30 pm on June 22, 2013: member

@mikehearn Would this interact badly with BitcoinJ wallets? How frequently do they send pings? (or anyone with an alternate implementation)

mikehearn commented at 1:56 pm on June 23, 2013: contributor

bitcoinj sends pings every two seconds, it’s much more aggressive than this patch. We could certainly reduce it. Ping times are used to pick which peer to download from.

jgarzik commented at 7:50 pm on June 23, 2013: contributor

The disconnect logic seems like it would negatively impact testnet.

sipa commented at 10:20 pm on June 23, 2013: member

Inverted the logic, so it now becomes a ping if nothing has been received for a while, rather than sent.

Tested that it indeed detects broken connections within one minute.

mikehearn commented at 10:41 am on July 5, 2013: contributor

Change looks good to me, although it will have the effect of disconnecting any nodes that don’t respond to pings with pongs (or some other message). As pong messages were added in protocol version 60000 it would effectively EOL clients older than that and this may deserve an announcement somewhere.

However bitcoinj clients will be fine with it.

gmaxwell commented at 4:46 pm on July 7, 2013: contributor

So, this causes nodes to impose an upper bound on their peers response latency. Effectively, someone who will take a minute to respond can’t participate in the network at all. This may adversely impact some anonymity protocols which cause high latency— for many kinds of Bitcoin usage the latency isn’t all that critical. I’d suggest that such users should be using an alternative transport, except the bitcoin p2p protocol is already reasonably well designed for high latency usage (at least its highly asynchronous).

Some of the other side effects of timeouts is that they can reduce link stability during a DOS attack, and they can preclude some kinds of node high availability schemes which might cause tens of seconds of unresponsiveness (e.g. during a state resync or hardware migration). Connection slot filling attacks are more effective if you can overload a peer temporarily and make them drop all their good connections.

Some other protocols, like BGP, negotiate the heart-beating— 30, 90 being a common set of parameters in that case— to address the problem of potentially imposing too fast a response requirement. But since our network protocol is mostly stateless (though— it has become less so with bloom-filtering, and never was completely: e.g. inv caching) there is less of an issue as restarts aren’t so bad.

I’d instead prefer a longer timeout and separately having peer rotation code that periodically slays the least recently active node from node out of the set which been heard from in >60 seconds. E.g. imposing the shorter timeout but only when we could potentially better use the slot.

TCP’s design targets a maximum segment lifetime of 2 minutes. Many operating systems have an SO_KEEPALIVE timeout of around 10 minutes. Somewhere in that range

mikehearn commented at 1:00 pm on July 8, 2013: contributor

It would be easy to adjust the timeout for onion addresses, but really, I’m not sure we want nodes in the network that can’t respond within a minute. Slow nodes can have terrible effects on the user experience for SPV wallets. They will automatically drop to the bottom of the preference list in bitcoinj and not be used for much except observing broadcasts, for that reason.

That said, just dropping the slowest peer when we run out of slots on the bitcoind side indeed sounds reasonable.

jgarzik commented at 3:05 am on August 25, 2013: contributor

How about a 5-minute timeout, and get this merged? Maybe 1 minute is too short, but 90 is far too long.

sipa commented at 0:16 am on October 14, 2013: member

Suggestion: send pings every 2 minutes (even in case something was received recently, with the new ping-response-time-measurement that means we always get some useful latency information), and disconnect after receiving nothing for 5 minutes.

gmaxwell commented at 5:15 pm on October 14, 2013: contributor

@sipa ACK unconditional 2 minute ping plus 5 minute disconnect.

sipa commented at 10:42 pm on October 14, 2013: member

Rebased & updated.

gavinandresen commented at 11:10 pm on October 14, 2013: contributor

ACK.

sipa commented at 11:21 pm on October 14, 2013: member

There may be a bug in this code; will investigate later.

EDIT: fixed.

jgarzik commented at 2:21 am on October 15, 2013: contributor

ACK

Optional nit: “5 * 60” is more self-documenting than “300”, and the previous code used the “M * 60” notation. All modern compilers will automatically convert the more human-readable M*60 at compile time, so there is no added runtime cost.

sipa commented at 8:08 pm on October 15, 2013: member

Moved/added the ping/timeout constants to net.h (where they can be accessed by both net & main), and changed the timeout logic a bit: either there is an unanswered ping >5 minutes old, or there has been no message received at all for 5 minutes.

in src/net.h: in 22bef8dafd outdated

240@@ -237,11 +241,11 @@ class CNode
241     std::multimap<int64, CInv> mapAskFor;
242 
243     // Ping time measurement
244-    uint64 nPingNonceSent;
245-    int64 nPingUsecStart;
246-    int64 nPingUsecTime;
247-    bool fPingQueued;
248-    
249+    uint64 nPingNonceSent; // The pong reply we're expecting, or 0 if no pong expected.

Diapolo commented at 8:55 pm on October 15, 2013:

Small nit: Perhaps indent the comments to be in line.

sipa commented at 9:36 pm on October 15, 2013:

Done.

sipa commented at 6:25 pm on October 16, 2013: member

Currently running this patch myself on bitcoin.sipa.be. I see a surprisingly high number of ping timeouts and inactivity timeouts (every 10 minutes or so, very irregularly). I’ll investigate whether these are actual connections that go dead (in which case this patch is actually useful…) or a problem with the disconnection logic.

luke-jr commented at 1:11 am on February 21, 2014: member

@sipa What did you find?

sipa commented at 1:30 pm on May 10, 2014: member

Rebased, and made some changes (better reporting, and don’t enforce 5-minute receive timeout on pre-BIP31 peers).

I’m testing it again myself.

sipa commented at 1:35 pm on May 10, 2014: member

Strange, I have one peer (/Satoshi:0.8.6/, proto v70001) which seems to disconnect when it receives a ping message, and then immediately reconnects.

sipa commented at 9:34 am on May 11, 2014: member

Seems the majority of timeouts detected are ping timeouts from getaddr.bitnodes.io (which apparently only recently implemented ping/pong - though probably not yet deployed).

Apart from that I do quite some others, from various clients and IPs, every ~10 minutes on average the past hours. These look like genuine connection problems.

ayeowch commented at 9:48 am on May 11, 2014: contributor

@sipa I have just deployed the ping/pong update to getaddr.bitnodes.io. The crawler should be responding to incoming ping properly now.

sipa commented at 10:17 am on May 11, 2014: member

@ayeowch Cool, good to know. I’m not seeing any pongs yet, though…

ayeowch commented at 10:26 am on May 11, 2014: contributor

@sipa Hmm.. If the crawler (subver=/getaddr.bitnodes.io:0.1/) is already connected to your node. Try sending a ping with bitcoind ping. I did get pong from the crawler by doing this.

sipa commented at 10:31 am on May 11, 2014: member

2014-05-11 10:27:53 Sending ping to [2a01:4f8:202:81b1::2]:48980 /getaddr.bitnodes.io:0.1/

No pong afterwards (while I do receive pongs from other nodes).

sipa commented at 11:50 am on May 31, 2014: member

For reference: the problem with bitnodes.io was lingering connections from a crawler that didn’t support BIP31 (but advertized a protocol version that did), which was fixed on their side now.

I’m seeing occassional disconnects because of the new ping timeouts, but I believe these are correct.

Any objections to merging?

laanwj commented at 4:55 am on June 5, 2014: member

@sipa No objections to pinging more, but I’m a bit wary about the ‘disconnect nodes faster’ part of this. Going from 90 minutes to 5 minutes is a big change. It wouldn’t be the first time that something is introduced that can hang bitcoind for a few minutes.

leofidus commented at 0:55 am on June 6, 2014: none

I think there are already scenarios where network pings are not answered within 5 minutes. For example the importprivkey rpc command can lock cs_main for many minutes during a rescan. If during that time a network message arrives which also requires a lock on cs_main (for example a harmless inv), this locks the network thread until the rescan is complete. This leaves the peer unable to answer pings for more than 10 minutes.

sipa commented at 1:02 am on June 6, 2014: member

I think it’s reasonable that peers disconnect you during that time. You can’t use their service, and they can’t use yours.

laanwj commented at 3:53 am on June 6, 2014: member

To me it just doesn’t feel very robust to have the network fall apart when somehow 5 minutes of lag are introduced.

sipa commented at 9:25 pm on June 8, 2014: member

@laanwj What timeout would you find acceptable?

laanwj commented at 11:11 am on June 9, 2014: member

Let’s pick uhm… 20 minutes. That slices the current inactivity timeout in four already. We can always lower the value later.

gmaxwell commented at 8:57 pm on June 9, 2014: contributor

@wumpus TCP itself will fail over long before that if the network itself is delayed. A node thats holding up our connection but not responding for >5 minutes is really broken as far as we should care. It can reconnect when it can start responding again. (and yes, in some cases bitocoind is ‘broken’, but thats okay… reconnecting is fairly cheap).

The flip side of a long timeout is that we’ll leave connections up to busted peers— keeping ourselves offline (or our resources consumed) for up to the timeout— when we could otherwise have useful connections up.

Maybe it would make sense to have a conservative upper limit for now, I certantly wouldn’t want to delay pulling this further to debate the timeout. I’m happy with anything 5 minutes or longer but would prefer closer to 5 minutes. Future peer rotation could prioritize punting long idle connection, which would address the problem that you might sit partitioned from the network for many minutes with no working connections.

sipa commented at 9:00 pm on June 9, 2014: member

Increased the timeout to 20 minutes.

gmaxwell commented at 9:05 pm on June 9, 2014: contributor

@sipa The commit message is now wrong.

Ping automatically every 2 minutes (unconditionally)

... instead of after 30 minutes of no sending, for latency measurement
and keep-alive. Also, disconnect if no reply arrives within 20 minutes,
instead of 90 of inactivity (for peers supporting the 'pong' message).

f1920e8606

sipa commented at 9:07 pm on June 9, 2014: member

Fixed.

BitcoinPullTester commented at 9:31 pm on June 9, 2014: none

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/f1920e86063d0ed008c6028d8223b0e21889bf75 for binaries and test log. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.

laanwj commented at 9:56 pm on June 9, 2014: member

@gmaxwell I just want to be careful here. @sipa Looks good to me now.

laanwj merged this on Jun 12, 2014

laanwj closed this on Jun 12, 2014

laanwj referenced this in commit 3f39b9d455 on Jun 12, 2014

cozz commented at 5:58 pm on June 13, 2014: contributor

Just tested on mainnet:

start with maxconnections=1
initial pingtime under a second
wait for block sync to start
ping
pingtime=140s

The problem is the node is busy sending us blocks, so does not respond to the ping. I think we should delay the ping when blocks are in flight.

ping

gmaxwell commented at 6:06 pm on June 13, 2014: contributor

@cozz I don’t understand your concern. Yes, pings will take a lot of time in the case, but it will not be disconnected.

cozz commented at 6:32 pm on June 13, 2014: contributor

Well, I thought if the pingtime is used for peer-selection later, then you probably dont want to ping, if you know that the ping will be much slower as normal, because blocks are in flight. I dont have any concerns for now, just saying that if we are going to judge a peer by its periodically pingtime, a slow ping does not necessarily mean a slow node. In my case the node is a good node, just busy.

Bushstar referenced this in commit 9e70209e49 on Apr 5, 2019

fanquake deleted a comment on May 11, 2019

DrahtBot locked this on Sep 8, 2021

Reduce keepalive ping/timeout #2784