Reduce keepalive ping/timeout #2784

pull sipa wants to merge 1 commits into bitcoin:master from sipa:shorttimeout changing 3 files +27 −17
  1. sipa commented at 5:17 pm on June 22, 2013: member

    Ping automatically every 2 minutes (unconditionally), instead of of after 30 minutes of no sending, for latency measurement and keep-alive. Also, disconnect if no reply arrives within 5 minutes, instead of after 90 minutes of inactivity.

    This should make detection of stalled connections much faster.

  2. sipa commented at 5:30 pm on June 22, 2013: member
    @mikehearn Would this interact badly with BitcoinJ wallets? How frequently do they send pings? (or anyone with an alternate implementation)
  3. mikehearn commented at 1:56 pm on June 23, 2013: contributor
    bitcoinj sends pings every two seconds, it’s much more aggressive than this patch. We could certainly reduce it. Ping times are used to pick which peer to download from.
  4. jgarzik commented at 7:50 pm on June 23, 2013: contributor
    The disconnect logic seems like it would negatively impact testnet.
  5. sipa commented at 10:20 pm on June 23, 2013: member

    Inverted the logic, so it now becomes a ping if nothing has been received for a while, rather than sent.

    Tested that it indeed detects broken connections within one minute.

  6. mikehearn commented at 10:41 am on July 5, 2013: contributor

    Change looks good to me, although it will have the effect of disconnecting any nodes that don’t respond to pings with pongs (or some other message). As pong messages were added in protocol version 60000 it would effectively EOL clients older than that and this may deserve an announcement somewhere.

    However bitcoinj clients will be fine with it.

  7. gmaxwell commented at 4:46 pm on July 7, 2013: contributor

    So, this causes nodes to impose an upper bound on their peers response latency. Effectively, someone who will take a minute to respond can’t participate in the network at all. This may adversely impact some anonymity protocols which cause high latency— for many kinds of Bitcoin usage the latency isn’t all that critical. I’d suggest that such users should be using an alternative transport, except the bitcoin p2p protocol is already reasonably well designed for high latency usage (at least its highly asynchronous).

    Some of the other side effects of timeouts is that they can reduce link stability during a DOS attack, and they can preclude some kinds of node high availability schemes which might cause tens of seconds of unresponsiveness (e.g. during a state resync or hardware migration). Connection slot filling attacks are more effective if you can overload a peer temporarily and make them drop all their good connections.

    Some other protocols, like BGP, negotiate the heart-beating— 30, 90 being a common set of parameters in that case— to address the problem of potentially imposing too fast a response requirement. But since our network protocol is mostly stateless (though— it has become less so with bloom-filtering, and never was completely: e.g. inv caching) there is less of an issue as restarts aren’t so bad.

    I’d instead prefer a longer timeout and separately having peer rotation code that periodically slays the least recently active node from node out of the set which been heard from in >60 seconds. E.g. imposing the shorter timeout but only when we could potentially better use the slot.

    TCP’s design targets a maximum segment lifetime of 2 minutes. Many operating systems have an SO_KEEPALIVE timeout of around 10 minutes. Somewhere in that range

  8. mikehearn commented at 1:00 pm on July 8, 2013: contributor

    It would be easy to adjust the timeout for onion addresses, but really, I’m not sure we want nodes in the network that can’t respond within a minute. Slow nodes can have terrible effects on the user experience for SPV wallets. They will automatically drop to the bottom of the preference list in bitcoinj and not be used for much except observing broadcasts, for that reason.

    That said, just dropping the slowest peer when we run out of slots on the bitcoind side indeed sounds reasonable.

  9. jgarzik commented at 3:05 am on August 25, 2013: contributor
    How about a 5-minute timeout, and get this merged? Maybe 1 minute is too short, but 90 is far too long.
  10. sipa commented at 0:16 am on October 14, 2013: member
    Suggestion: send pings every 2 minutes (even in case something was received recently, with the new ping-response-time-measurement that means we always get some useful latency information), and disconnect after receiving nothing for 5 minutes.
  11. gmaxwell commented at 5:15 pm on October 14, 2013: contributor
    @sipa ACK unconditional 2 minute ping plus 5 minute disconnect.
  12. sipa commented at 10:42 pm on October 14, 2013: member
    Rebased & updated.
  13. gavinandresen commented at 11:10 pm on October 14, 2013: contributor
    ACK.
  14. sipa commented at 11:21 pm on October 14, 2013: member

    There may be a bug in this code; will investigate later.

    EDIT: fixed.

  15. jgarzik commented at 2:21 am on October 15, 2013: contributor

    ACK

    Optional nit: “5 * 60” is more self-documenting than “300”, and the previous code used the “M * 60” notation. All modern compilers will automatically convert the more human-readable M*60 at compile time, so there is no added runtime cost.

  16. sipa commented at 8:08 pm on October 15, 2013: member
    Moved/added the ping/timeout constants to net.h (where they can be accessed by both net & main), and changed the timeout logic a bit: either there is an unanswered ping >5 minutes old, or there has been no message received at all for 5 minutes.
  17. in src/net.h: in 22bef8dafd outdated
    240@@ -237,11 +241,11 @@ class CNode
    241     std::multimap<int64, CInv> mapAskFor;
    242 
    243     // Ping time measurement
    244-    uint64 nPingNonceSent;
    245-    int64 nPingUsecStart;
    246-    int64 nPingUsecTime;
    247-    bool fPingQueued;
    248-    
    249+    uint64 nPingNonceSent; // The pong reply we're expecting, or 0 if no pong expected.
    


    Diapolo commented at 8:55 pm on October 15, 2013:
    Small nit: Perhaps indent the comments to be in line.

    sipa commented at 9:36 pm on October 15, 2013:
    Done.
  18. sipa commented at 6:25 pm on October 16, 2013: member
    Currently running this patch myself on bitcoin.sipa.be. I see a surprisingly high number of ping timeouts and inactivity timeouts (every 10 minutes or so, very irregularly). I’ll investigate whether these are actual connections that go dead (in which case this patch is actually useful…) or a problem with the disconnection logic.
  19. luke-jr commented at 1:11 am on February 21, 2014: member
    @sipa What did you find?
  20. sipa commented at 1:30 pm on May 10, 2014: member

    Rebased, and made some changes (better reporting, and don’t enforce 5-minute receive timeout on pre-BIP31 peers).

    I’m testing it again myself.

  21. sipa commented at 1:35 pm on May 10, 2014: member
    Strange, I have one peer (/Satoshi:0.8.6/, proto v70001) which seems to disconnect when it receives a ping message, and then immediately reconnects.
  22. sipa commented at 9:34 am on May 11, 2014: member

    Seems the majority of timeouts detected are ping timeouts from getaddr.bitnodes.io (which apparently only recently implemented ping/pong - though probably not yet deployed).

    Apart from that I do quite some others, from various clients and IPs, every ~10 minutes on average the past hours. These look like genuine connection problems.

  23. ayeowch commented at 9:48 am on May 11, 2014: contributor
    @sipa I have just deployed the ping/pong update to getaddr.bitnodes.io. The crawler should be responding to incoming ping properly now.
  24. sipa commented at 10:17 am on May 11, 2014: member
    @ayeowch Cool, good to know. I’m not seeing any pongs yet, though…
  25. ayeowch commented at 10:26 am on May 11, 2014: contributor
    @sipa Hmm.. If the crawler (subver=/getaddr.bitnodes.io:0.1/) is already connected to your node. Try sending a ping with bitcoind ping. I did get pong from the crawler by doing this.
  26. sipa commented at 10:31 am on May 11, 2014: member

    2014-05-11 10:27:53 Sending ping to [2a01:4f8:202:81b1::2]:48980 /getaddr.bitnodes.io:0.1/

    No pong afterwards (while I do receive pongs from other nodes).

  27. sipa commented at 11:50 am on May 31, 2014: member

    For reference: the problem with bitnodes.io was lingering connections from a crawler that didn’t support BIP31 (but advertized a protocol version that did), which was fixed on their side now.

    I’m seeing occassional disconnects because of the new ping timeouts, but I believe these are correct.

    Any objections to merging?

  28. laanwj commented at 4:55 am on June 5, 2014: member
    @sipa No objections to pinging more, but I’m a bit wary about the ‘disconnect nodes faster’ part of this. Going from 90 minutes to 5 minutes is a big change. It wouldn’t be the first time that something is introduced that can hang bitcoind for a few minutes.
  29. leofidus commented at 0:55 am on June 6, 2014: none
    I think there are already scenarios where network pings are not answered within 5 minutes. For example the importprivkey rpc command can lock cs_main for many minutes during a rescan. If during that time a network message arrives which also requires a lock on cs_main (for example a harmless inv), this locks the network thread until the rescan is complete. This leaves the peer unable to answer pings for more than 10 minutes.
  30. sipa commented at 1:02 am on June 6, 2014: member
    I think it’s reasonable that peers disconnect you during that time. You can’t use their service, and they can’t use yours.
  31. laanwj commented at 3:53 am on June 6, 2014: member
    To me it just doesn’t feel very robust to have the network fall apart when somehow 5 minutes of lag are introduced.
  32. sipa commented at 9:25 pm on June 8, 2014: member
    @laanwj What timeout would you find acceptable?
  33. laanwj commented at 11:11 am on June 9, 2014: member
    Let’s pick uhm… 20 minutes. That slices the current inactivity timeout in four already. We can always lower the value later.
  34. gmaxwell commented at 8:57 pm on June 9, 2014: contributor

    @wumpus TCP itself will fail over long before that if the network itself is delayed. A node thats holding up our connection but not responding for >5 minutes is really broken as far as we should care. It can reconnect when it can start responding again. (and yes, in some cases bitocoind is ‘broken’, but thats okay… reconnecting is fairly cheap).

    The flip side of a long timeout is that we’ll leave connections up to busted peers— keeping ourselves offline (or our resources consumed) for up to the timeout— when we could otherwise have useful connections up.

    Maybe it would make sense to have a conservative upper limit for now, I certantly wouldn’t want to delay pulling this further to debate the timeout. I’m happy with anything 5 minutes or longer but would prefer closer to 5 minutes. Future peer rotation could prioritize punting long idle connection, which would address the problem that you might sit partitioned from the network for many minutes with no working connections.

  35. sipa commented at 9:00 pm on June 9, 2014: member
    Increased the timeout to 20 minutes.
  36. gmaxwell commented at 9:05 pm on June 9, 2014: contributor
    @sipa The commit message is now wrong.
  37. Ping automatically every 2 minutes (unconditionally)
    ... instead of after 30 minutes of no sending, for latency measurement
    and keep-alive. Also, disconnect if no reply arrives within 20 minutes,
    instead of 90 of inactivity (for peers supporting the 'pong' message).
    f1920e8606
  38. sipa commented at 9:07 pm on June 9, 2014: member
    Fixed.
  39. BitcoinPullTester commented at 9:31 pm on June 9, 2014: none
    Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/f1920e86063d0ed008c6028d8223b0e21889bf75 for binaries and test log. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.
  40. laanwj commented at 9:56 pm on June 9, 2014: member
    @gmaxwell I just want to be careful here. @sipa Looks good to me now.
  41. laanwj merged this on Jun 12, 2014
  42. laanwj closed this on Jun 12, 2014

  43. laanwj referenced this in commit 3f39b9d455 on Jun 12, 2014
  44. cozz commented at 5:58 pm on June 13, 2014: contributor

    Just tested on mainnet:

    • start with maxconnections=1
    • initial pingtime under a second
    • wait for block sync to start
    • ping
    • pingtime=140s

    The problem is the node is busy sending us blocks, so does not respond to the ping. I think we should delay the ping when blocks are in flight.

    ping

  45. gmaxwell commented at 6:06 pm on June 13, 2014: contributor
    @cozz I don’t understand your concern. Yes, pings will take a lot of time in the case, but it will not be disconnected.
  46. cozz commented at 6:32 pm on June 13, 2014: contributor
    Well, I thought if the pingtime is used for peer-selection later, then you probably dont want to ping, if you know that the ping will be much slower as normal, because blocks are in flight. I dont have any concerns for now, just saying that if we are going to judge a peer by its periodically pingtime, a slow ping does not necessarily mean a slow node. In my case the node is a good node, just busy.
  47. Bushstar referenced this in commit 9e70209e49 on Apr 5, 2019
  48. fanquake deleted a comment on May 11, 2019
  49. DrahtBot locked this on Sep 8, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-22 03:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me