Disconnect from outbound peers with bad headers chains

sdaftuar commented at 9:06 pm on October 12, 2017: member

The first commit will disconnect an outbound peer that serves us a headers chain with insufficient work while we’re in IBD.

The second commit introduces a way to disconnect outbound peers whose chains fall out of sync with ours:

For a given outbound peer, we check whether their best known block (which is known from the blocks they announce to us) has at least as much work as our tip. If it doesn’t, we set a 20 minute timeout, and if we still haven’t heard about a block with as much work as our tip had when we set the timeout, then we send a single getheaders message, and wait 2 more minutes. If after two minutes their best known block has insufficient work, we disconnect that peer.

We protect 4 of our outbound peers (who provide some “good” headers chains, ie a chain with at least as much work as our tip at some point) from being subject to this logic, to prevent excessive network topology changes as a result of this algorithm, while still ensuring that we have a reasonable number of nodes not known to be on bogus chains.

We also don’t require our peers to be on the same chain as us, to prevent accidental partitioning of the network in the event of a chain split. Note that if our peers are ever on a more work chain than our tip, then we will download and validate it, and then either reorg to it, or learn of a consensus incompatibility with that peer and disconnect. This PR is designed to protect against peers that are on a less work chain which we may never try to download and validate.

fanquake added the label P2P on Oct 12, 2017

fanquake added this to the milestone 0.15.0.2 on Oct 12, 2017

in src/net_processing.cpp:520 in d5b4bf323f outdated

516@@ -502,7 +517,7 @@ void PeerLogicValidation::InitializeNode(CNode *pnode) {
517     NodeId nodeid = pnode->GetId();
518     {
519         LOCK(cs_main);
520-        mapNodeState.emplace_hint(mapNodeState.end(), std::piecewise_construct, std::forward_as_tuple(nodeid), std::forward_as_tuple(addr, std::move(addrName)));
521+        auto it = mapNodeState.emplace_hint(mapNodeState.end(), std::piecewise_construct, std::forward_as_tuple(nodeid), std::forward_as_tuple(addr, std::move(addrName)));

sipa commented at 0:40 am on October 13, 2017:

Unused variable

in src/net_processing.cpp:3282 in d5b4bf323f outdated

3277+            // This is an outbound peer subject to disconnection if their chain
3278+            // lags behind ours (note: if their chain has more work than ours,
3279+            // we should sync to it, unless it's invalid, in which case we
3280+            // should find that out and disconnect from them elsewhere).
3281+            if (state.pindexBestKnownBlock != nullptr && state.pindexBestKnownBlock->nChainWork >= chainActive.Tip()->nChainWork)
3282+            {

sipa commented at 0:48 am on October 13, 2017:

Nit: same line

sipa commented at 1:55 am on October 13, 2017: member

I’ve spent some time verifying the logic, and it looks very good.

I briefly misunderstood it to mean that if a peer makes progress during the 20 minute window, it would be extended. That would have been a problem (an incompatible chain that has less work but does grow continuously), but it’s not what happens (it requires reaching the same amount of work as we had at the beginning of the window).

A test for the second commit would be great.

in src/net_processing.cpp:2448 in 3203afcc7a outdated

2375+                // This peer has too little work on their headers chain to help
2376+                // us sync -- disconnect if using an outbound slot (unless
2377+                // whitelisted).
2378+                if (!(pfrom->fInbound || pfrom->fWhitelisted || pfrom->fAddnode)) {
2379+                    pfrom->fDisconnect = true;
2380+                }

gmaxwell commented at 9:17 am on October 13, 2017:

I think this needs to be backed off a bit: Consider– imagine we start and are fully synced up, but we are still in IsInitialBlockDownload because the network has been slow and so the tip’s timestamp is a ways back. Then we have a peer we get headers from who is a single block behind us. This would disconnect them, and I think that is probably undesirable.

It’s probably a good idea to review all code as if IsInitialBlockDownload() were just return true; and if there would be misbehavior in that case, then the usage is perhaps suboptimal. An obvious conservative thing to do would be to just use the nMinimumChainWork … though perhaps too conservative. It might also be reasonable to use a GetBlockProofEquivalentTime like test, or just the work several blocks back from chainactive tip. I think if it was conservative enough it could apply at all times, not just IsIninitialBlockDownload().

Maybe the existence of the second commit means that it would be okay for this one to be less aggressive?

sdaftuar commented at 1:14 pm on October 13, 2017:

I guess with the second commit, we will eventually disconnect some peers if they stay on chains with less work than our tip. So I think we could just leave this as nMinimumChainWork, to protect against the case that all our peers only have incomplete/bogus chains (which wouldn’t be prevented by the second commit).

promag commented at 2:47 pm on October 24, 2017:

Does it make sense to avoid the push above: https://github.com/bitcoin/bitcoin/blob/e9d58023aef97d954b7a75da806ee4f22c9760c3/src/net_processing.cpp#L2332 now that the node will disconnect?

Edit: nevermind, since this is protected with nCount != MAX_HEADERS_RESULTS.

sdaftuar commented at 1:16 pm on October 13, 2017: member

Thanks for the review. I’ve pushed up a couple quick fixes for the nits. Will work on a test, as well as trying to implement a suggestion @gmaxwell gave me offline for improving the behavior in the situation where none of our peers are giving us any block announcements.

sdaftuar force-pushed on Oct 14, 2017

in src/net_processing.cpp:2433 in 6e21a4a793 outdated

2382@@ -2383,6 +2383,21 @@ bool static ProcessMessage(CNode* pfrom, const std::string& strCommand, CDataStr
2383                 }
2384             }
2385         }
2386+        // If we're in IBD, we want outbound peers that will serve us a useful
2387+        // chain.  Disconnect peers that are behind us or on chains with
2388+        // insufficient work.
2389+        if (IsInitialBlockDownload() && nCount != MAX_HEADERS_RESULTS) {

TheBlueMatt commented at 8:23 pm on October 18, 2017:

I believe this adds a new invariant (currently true), that we shall never send a GETHEADERS or do something which expects, as a result, a HEADERS message, which is not MAX_HEADERS_RESULT if it doesn’t end with the peer’s tip (ie no GETHEADERS with hashStop of not-their-tip, or at least not when in IBD/before we’ve finished syncing the state of another peer). I don’t think this is bad, but it should be documented somewhere explicitly.

Eventually I really think we need to have peers in a “mode” which describes were we are 1. connect-handshake (version/verack/sendcmpct/sendheaders/feefilter/etc) 2. header-chain-sync (slightly complicated by the fact that we want to block all peers but one during ibd until we’ve finished one) 3. normal operation and then we can simplify and document strange things like this significantly instead of making everything yet more of a mess.

sdaftuar commented at 3:54 pm on October 19, 2017:

I believe this adds a new invariant (currently true), that we shall never send a GETHEADERS or do something which expects, as a result, a HEADERS message, which is not MAX_HEADERS_RESULT if it doesn’t end with the peer’s tip (ie no GETHEADERS with hashStop of not-their-tip, or at least not when in IBD/before we’ve finished syncing the state of another peer). I don’t think this is bad, but it should be documented somewhere explicitly.

I don’t think this is precisely correct. What we are assuming is the same thing we use for headers sync (eg on initial connection) – if we receive fewer than MAX_HEADERS_RESULTS, then we have no reason to request more headers in order to update pindexBestKnownBlock. Note that in the logic below, we are using pindexBestKnownBlock, not pindexLast.

In other words, if we did add a reason for requesting a single header to a random location in the chain, I think this logic would still be correct.

promag commented at 11:28 pm on October 19, 2017:

Swap conditions as the second is faster?

sdaftuar commented at 0:13 am on October 20, 2017:

I don’t think this really matters? In the steady state (ie after we’ve caught up from initial sync), the second condition will generally be true, while the first condition will be false.

TheBlueMatt commented at 0:16 am on October 20, 2017:

if we receive fewer than MAX_HEADERS_RESULTS, then we have no reason to request more headers in order to update pindexBestKnownBlock. Note that in the logic below, we are using pindexBestKnownBlock, not pindexLast.

I’m confused, then. If we randomly decide to request the header at height 1 in the middle of syncing our peer’s header chain, we’d do fine in the rest of the ::HEADERS handling here, just (correctly) not requesting any further headers built on top of it. However, we’d immediately disconnect them if we’re not in IBD by this new check. Previously this wouldn’t break anything, as the next HEADERS they send us will result in us requesting further headers.

sdaftuar commented at 1:09 am on October 20, 2017:

Ah, you’re right about that case – I was thinking of the case where we send a random getheaders after we’ve already synced the chain, and therefore updated pindexBestKnownBlock to be some recent-ish thing.

I guess it is true that if you just added a random getheaders message in the middle of initial sync, you might not break the sync logic, but this new logic would cause a disconnect… I can add a comment indicating that we shouldn’t do anything funky until after we’ve synced a peer’s headers chain?

in src/net_processing.cpp:207 in 4b8a4bd29c outdated

202@@ -200,6 +203,14 @@ struct CNodeState {
203      * otherwise: whether this peer sends non-witnesses in cmpctblocks/blocktxns.
204      */
205     bool fSupportsDesiredCmpctVersion;
206+    //! Whether this peer is protected from disconnection due to a bad/slow chain
207+    bool fProtectFromDisconnect;

TheBlueMatt commented at 10:13 pm on October 18, 2017:

new code should have the m_ prefix here (and other places), and probably drop the hungarian type prefix.

in src/net_processing.cpp:3323 in b44013e9e7 outdated

3318+                    // They've run out of time to catch up!
3319+                    LogPrintf("Disconnecting outbound peer %d for old chain, best known block = %s\n", pto->GetId(), state.pindexBestKnownBlock != nullptr ? state.pindexBestKnownBlock->GetBlockHash().ToString() : "<none>");
3320+                    pto->fDisconnect = true;
3321+                } else {
3322+                    const CBlockIndex *pindexStart = state.header_with_required_work;
3323+                    assert (pindexStart != nullptr);

TheBlueMatt commented at 11:21 pm on October 18, 2017:

Ugh, just for a print?

sdaftuar commented at 4:59 pm on October 19, 2017:

agreed, chucking it

TheBlueMatt commented at 11:24 pm on October 18, 2017: member

Looks pretty good. This is begging for a clearer node-sync-state model, but its fine for 0.15.0.2. Does want a test.

laanwj added the label Needs backport on Oct 19, 2017

sdaftuar force-pushed on Oct 19, 2017

sdaftuar commented at 5:05 pm on October 19, 2017: member

Thanks for the review – pushed up some fixup commits, which I’ll squash.

I switched to using GetTime() everywhere, and was then able to write up a unit test. This is currently next-to-impossible to test in the regtest environment, I believe, since we don’t support non-manual connections in regtest.

sdaftuar force-pushed on Oct 19, 2017

sdaftuar commented at 5:23 pm on October 19, 2017: member

Squashed (https://github.com/sdaftuar/bitcoin/commits/11490.1 -> 032dc531b911a0c0ea570aa74531508f54a36073)

jonasschnelli commented at 7:03 pm on October 19, 2017: contributor

Travis found:

0test/DoS_tests.cpp(80): error in "outbound_slow_chain_eviction": check dummyNode1.fDisconnect == true failed

sdaftuar force-pushed on Oct 19, 2017

sdaftuar commented at 9:02 pm on October 19, 2017: member

Unit test issue should be fixed now.

in src/net_processing.cpp:552 in 0a3d957935 outdated

548@@ -534,6 +549,8 @@ void PeerLogicValidation::FinalizeNode(NodeId nodeid, bool& fUpdateConnectionTim
549     nPreferredDownload -= state->fPreferredDownload;
550     nPeersWithValidatedDownloads -= (state->nBlocksInFlightValidHeaders != 0);
551     assert(nPeersWithValidatedDownloads >= 0);
552+    g_outbound_peers_with_protect_from_disconnect -= state->m_protect_from_disconnect;

promag commented at 11:07 pm on October 19, 2017:

int -= bool?

promag commented at 11:26 pm on October 19, 2017:

More verbose but IMO more clear:

0if (state->m_protect_from_disconnect) {
1    assert(g_outbound_peers_with_protect_from_disconnect > 0);
2    --g_outbound_peers_with_protect_from_disconnect;
3}

sdaftuar commented at 0:12 am on October 20, 2017:

I was following the pattern already in use in the lines above (nPreferredDownload -= state->fPreferredDownload). But I can change to something like what you suggest if this is frowned upon for some reason.

in src/net_processing.cpp:128 in 0a3d957935 outdated

123@@ -124,6 +124,9 @@ namespace {
124     /** Number of peers from which we're downloading blocks. */
125     int nPeersWithValidatedDownloads = 0;
126 
127+    /** Number of outbound peers with m_protect_from_disconnect. */
128+    int g_outbound_peers_with_protect_from_disconnect = 0;

promag commented at 11:10 pm on October 19, 2017:

Remove since this is easily calculated by counting nodes with m_protect_from_disconnect?

sdaftuar commented at 0:10 am on October 20, 2017:

By looping over all the CNodeState objects when this value is needed? I don’t think that makes sense. Note that tracking the value here is analogous to what we do with nPeersWithValidatedDownloads and nPreferredDownload.

in src/net_processing.h:27 in 0a3d957935 outdated

20@@ -21,6 +21,12 @@ static const unsigned int DEFAULT_BLOCK_RECONSTRUCTION_EXTRA_TXN = 100;
21  *  Timeout = base + per_header * (expected number of headers) */
22 static constexpr int64_t HEADERS_DOWNLOAD_TIMEOUT_BASE = 15 * 60 * 1000000; // 15 minutes
23 static constexpr int64_t HEADERS_DOWNLOAD_TIMEOUT_PER_HEADER = 1000; // 1ms/header
24+/** Protect at least this many outbound peers from disconnection due to slow/
25+ * behind headers chain.
26+ */
27+static constexpr int32_t MAX_OUTBOUND_PEERS_TO_PROTECT = 4;

promag commented at 11:12 pm on October 19, 2017:

_FROM_DISCONNECT?

in src/net_processing.cpp:2405 in 0a3d957935 outdated

2400@@ -2383,6 +2401,31 @@ bool static ProcessMessage(CNode* pfrom, const std::string& strCommand, CDataStr
2401                 }
2402             }
2403         }
2404+        // If we're in IBD, we want outbound peers that will serve us a useful
2405+        // chain.  Disconnect peers that are behind us or on chains with

promag commented at 11:28 pm on October 19, 2017:

Remove extra space.

in src/net_processing.cpp:3317 in 0a3d957935 outdated

3312+                state.m_headers_chain_timeout = time_in_seconds + CHAIN_SYNC_TIMEOUT;
3313+                state.m_header_with_required_work = chainActive.Tip();
3314+                state.m_sent_getheaders_to_check_chain_sync = false;
3315+            } else if (state.m_headers_chain_timeout > 0 && time_in_seconds > state.m_headers_chain_timeout) {
3316+                // No evidence yet that our peer has synced to a chain with work equal to that
3317+                // of our tip, when we first detected it was behind.  Send a single getheaders

promag commented at 11:34 pm on October 19, 2017:

Remove extra space.

sdaftuar force-pushed on Oct 20, 2017

sdaftuar commented at 0:32 am on October 20, 2017: member

Addressed some of @promag’s nits at https://github.com/sdaftuar/bitcoin/commits/11490.2, squashed -> 39e0c8e

in src/net_processing.cpp:2435 in e9d58023ae outdated

2382@@ -2383,6 +2383,20 @@ bool static ProcessMessage(CNode* pfrom, const std::string& strCommand, CDataStr
2383                 }
2384             }
2385         }
2386+        // If we're in IBD, we want outbound peers that will serve us a useful
2387+        // chain. Disconnect peers that are on chains with insufficient work.
2388+        if (IsInitialBlockDownload() && nCount != MAX_HEADERS_RESULTS) {
2389+            // When nCount < MAX_HEADERS_RESULTS, we know we have no more
2390+            // headers to fetch from this peer.
2391+            if (nodestate->pindexBestKnownBlock->nChainWork < nMinimumChainWork) {

ryanofsky commented at 7:58 pm on October 23, 2017:

In commit “Disconnecting from bad outbound peers in IBD”

Can you update comment to say why this is checking against nMinimumChainWork (and not chainActive.Tip()->nChainWork)?

ryanofsky commented at 4:20 pm on October 24, 2017:

Thanks. New “compare their tip” text might make a little more sense as part of the “peer has too little work” comment below rather than the “no more headers to fetch from this peer” comment up here.

in test/functional/minchainwork.py:30 in e9d58023ae outdated

26@@ -27,7 +27,7 @@ class MinimumChainWorkTest(BitcoinTestFramework):
27     def set_test_params(self):
28         self.setup_clean_chain = True
29         self.num_nodes = 3
30-        self.extra_args = [[], ["-minimumchainwork=0x65"], ["-minimumchainwork=0x65"]]
31+        self.extra_args = [['-whitelist=127.0.0.1'], ["-minimumchainwork=0x65", '-whitelist=127.0.0.1'], ["-minimumchainwork=0x65"]]

ryanofsky commented at 8:06 pm on October 23, 2017:

In commit “Disconnecting from bad outbound peers in IBD”

A comment saying why -whitelist is needed would be helfpul I think

in src/net_processing.cpp:206 in 9a70054f65 outdated

202@@ -200,6 +203,14 @@ struct CNodeState {
203      * otherwise: whether this peer sends non-witnesses in cmpctblocks/blocktxns.
204      */
205     bool fSupportsDesiredCmpctVersion;
206+    //! Whether this peer is protected from disconnection due to a bad/slow chain

ryanofsky commented at 8:19 pm on October 23, 2017:

In commit “Permit disconnection of outbound peers on bad/slow chains”

I think you should do something to more clearly group these variables together. You could use doxygen grouping syntax:

0//! State used to enforce CHAIN_SYNC_TIMEOUT.
1//! @{
2... variable declarations here ...
3//! @}

You could also give them a common prefix like m_sync_timeout_enable / m_sync_timeout_deadline / m_sync_timeout_min_work / m_sync_timeout_sent_getheaders or move them to a nested struct, though that’s probably asking a lot this far into the review.

in src/net_processing.cpp:3295 in 9a70054f65 outdated

3288@@ -3261,6 +3289,52 @@ bool PeerLogicValidation::SendMessages(CNode* pto, std::atomic<bool>& interruptM
3289             }
3290         }
3291 
3292+        // Check that outbound peers have reasonable chains
3293+        // GetTime() is used by this anti-DoS logic so we can test this using mocktime
3294+        int64_t time_in_seconds = GetTime();
3295+        if (!(state.m_protect_from_disconnect || pto->fInbound || pto->m_manual_connection || pto->fFeeler || pto->fOneShot) && state.fSyncStarted) {

ryanofsky commented at 8:40 pm on October 23, 2017:

In commit “Permit disconnection of outbound peers on bad/slow chains”

It seems like it could be good to have an pto->isOutbound() helper returning !pto->fInbound && !pto->m_manual_connection && !pto->fFeeler && !pto->fOneShot so this long string of checks doesn’t need to be repeated here and above in getheaders.

in src/net_processing.cpp:3297 in 9a70054f65 outdated

3288@@ -3261,6 +3289,52 @@ bool PeerLogicValidation::SendMessages(CNode* pto, std::atomic<bool>& interruptM
3289             }
3290         }
3291 
3292+        // Check that outbound peers have reasonable chains
3293+        // GetTime() is used by this anti-DoS logic so we can test this using mocktime
3294+        int64_t time_in_seconds = GetTime();
3295+        if (!(state.m_protect_from_disconnect || pto->fInbound || pto->m_manual_connection || pto->fFeeler || pto->fOneShot) && state.fSyncStarted) {
3296+            // This is an outbound peer subject to disconnection if their chain
3297+            // lags behind ours (note: if their chain has more work than ours,

ryanofsky commented at 8:48 pm on October 23, 2017:

In commit “Permit disconnection of outbound peers on bad/slow chains”

Instead of “if their chain lags behind ours” maybe be more specific and say “if they don’t announce a block with as much work as the current tip within CHAIN_SYNC_TIMEOUT+HEADERS_RESPONSE_TIME seconds.”

in src/net_processing.cpp:3328 in 9a70054f65 outdated

3295+        if (!(state.m_protect_from_disconnect || pto->fInbound || pto->m_manual_connection || pto->fFeeler || pto->fOneShot) && state.fSyncStarted) {
3296+            // This is an outbound peer subject to disconnection if their chain
3297+            // lags behind ours (note: if their chain has more work than ours,
3298+            // we should sync to it, unless it's invalid, in which case we
3299+            // should find that out and disconnect from them elsewhere).
3300+            if (state.pindexBestKnownBlock != nullptr && state.pindexBestKnownBlock->nChainWork >= chainActive.Tip()->nChainWork) {

ryanofsky commented at 9:08 pm on October 23, 2017:

In commit “Permit disconnection of outbound peers on bad/slow chains”:

I don’t see why the first and second branches of this if/else are distinct. It seems like you could unite them and drop unneeded checks with:

0if (!state.m_header_with_required_work || (state.pindexBestKnownBlock && state.pindexBestKnownBlock->nChainWork >= state.m_header_with_required_work->nChainWork) {
1    // Reset timeout and min required work if peer caught up to previous requirement.
2    state.m_headers_chain_timeout = time_in_seconds + CHAIN_SYNC_TIMEOUT;
3    state.m_header_with_required_work = chainActive.Tip();
4    state.m_sent_getheaders_to_check_chain_sync = false;
5} else if (time_in_seconds > state.m_headers_chain_timeout) {
6...

sdaftuar commented at 1:22 pm on October 24, 2017:

Hm, I’ll think about this. I haven’t quite convinced myself yet that there are no edge-case differences between your simplification and my original patch but you may be right.

in src/test/DoS_tests.cpp:78 in 39e0c8e1e7 outdated

71+    peerLogic->SendMessages(&dummyNode1, interruptDummy); // should result in getheaders
72+
73+    int64_t nStartTime = GetTime();
74+    // Wait 21 minutes, to
75+    SetMockTime(nStartTime+21*60);
76+    peerLogic->SendMessages(&dummyNode1, interruptDummy); // should result in getheaders

ryanofsky commented at 9:34 pm on October 23, 2017:

In commit “Add unit test for outbound peer eviction”

Maybe check dummyNode1.vSendMsg size or contents to see if there actually was a getheaders.

ryanofsky commented at 9:44 pm on October 23, 2017: member

utACK 39e0c8e1e7b5cdf57437aa0f957b41183fd0e197

sdaftuar commented at 1:43 pm on October 24, 2017: member

@ryanofsky Thanks for the review; I’ve addressed all but one of your comments, which I’m still thinking about.

in src/net_processing.cpp:210 in b73dcd12b9 outdated

213-    bool m_sent_getheaders_to_check_chain_sync;
214+
215+    /** State used to enforce CHAIN_SYNC_TIMEOUT
216+      * Only in effect for outbound, non-manual connections, with
217+      * m_protect == false
218+      * Algorithm: if a peer's best known block has less work than our tip,

ryanofsky commented at 4:17 pm on October 24, 2017:

Could maybe be more specific about the timeouts in the description. Instead of “T in the future”, could write “CHAIN_SYNC_TIMEOUT seconds in the future”, instead of “bump the timeout” could write “reset the timeout”, instead of “new shorter timeout” could refer to HEADERS_RESPONSE_TIME.

sdaftuar commented at 5:41 pm on October 24, 2017:

Done

sdaftuar force-pushed on Oct 24, 2017

sdaftuar commented at 4:20 pm on October 24, 2017: member

Squashed https://github.com/sdaftuar/bitcoin/commits/11490.3 -> 9dafb45

ryanofsky commented at 4:21 pm on October 24, 2017: member

Thanks for the updates! utACK 9dafb4587baa13ead7fe78ced3deedd303989016

in src/net_processing.cpp:2395 in 0c4f59b7f8 outdated

2390+            // headers to fetch from this peer. Compare their tip to
2391+            // nMinimumChainWork (rather than chainActive.Tip()) because we
2392+            // won't start block download until we have a headers chain that
2393+            // has at least nMinimumChainWork, even if a peer has a chain past
2394+            // our tip, as an anti-DoS measure.
2395+            if (nodestate->pindexBestKnownBlock->nChainWork < nMinimumChainWork) {

theuni commented at 5:27 pm on October 24, 2017:

I’m unsure if it would be possible to get here without setting a pindexBestKnownBlock, but please double-check to be safe: if (nodestate->pindexBestKnownBlock && ...

sdaftuar commented at 5:41 pm on October 24, 2017:

Done

ryanofsky commented at 5:54 pm on October 24, 2017: member

utACK d9a05060ce3c6b8d752898ef68d3dcc0f68c7fba (just adds two suggested comment updates).

in src/net_processing.cpp:2455 in ac5d938ef7 outdated

2445         }
2446+
2447+        if (!pfrom->fDisconnect && IsOutboundPeer(pfrom)) {
2448+            // If this is an outbound peer, check to see if we should protect
2449+            // it from the bad/lagging chain logic.
2450+            if (g_outbound_peers_with_protect_from_disconnect < MAX_OUTBOUND_PEERS_TO_PROTECT_FROM_DISCONNECT && nodestate->pindexBestKnownBlock->nChainWork >= chainActive.Tip()->nChainWork && !nodestate->m_chain_sync.m_protect) {

theuni commented at 6:36 pm on October 24, 2017:

nodestate->pindexBestKnownBlock needs a nullptr check here too. Though after a quick look, it doesn’t look like we can get here without a pindexBestKnownBlock. If that’s the case (I haven’t traced it fully) an assert may make more sense.

sdaftuar commented at 8:06 pm on October 24, 2017:

Yeah it should be impossible due to the call to UpdateBlockAvailability() above, but I’ll add the nullptr check anyway.

in src/net_processing.cpp:3320 in ac5d938ef7 outdated

3314@@ -3265,6 +3315,54 @@ bool PeerLogicValidation::SendMessages(CNode* pto, std::atomic<bool>& interruptM
3315             }
3316         }
3317 
3318+        // Check that outbound peers have reasonable chains
3319+        // GetTime() is used by this anti-DoS logic so we can test this using mocktime
3320+        int64_t time_in_seconds = GetTime();

theuni commented at 6:57 pm on October 24, 2017:

Can this be factored out into a function? It seems we really only need to call it when we get new headers, or when a timer expires. Moving it out will make it easier to break up the SendMessages monster down the road.

sdaftuar commented at 8:42 pm on October 24, 2017:

Done

in src/net_processing.cpp:537 in ac5d938ef7 outdated

531@@ -502,6 +532,13 @@ void FindNextBlocksToDownload(NodeId nodeid, unsigned int count, std::vector<con
532 
533 } // namespace
534 
535+// Returns true for outbound peers, excluding manual connections, feelers, and
536+// one-shots
537+bool IsOutboundPeer(const CNode *node)

theuni commented at 7:13 pm on October 24, 2017:

This name is really deceiving. Maybe IsOutboundDisconnectionCandidate?

sdaftuar commented at 8:42 pm on October 24, 2017:

done

in src/net_processing.cpp:3334 in ac5d938ef7 outdated

3329+                if (state.m_chain_sync.m_timeout != 0) {
3330+                    state.m_chain_sync.m_timeout = 0;
3331+                    state.m_chain_sync.m_work_header = nullptr;
3332+                    state.m_chain_sync.m_sent_getheaders = false;
3333+                }
3334+            } else if (state.m_chain_sync.m_timeout == 0 || (state.m_chain_sync.m_work_header != nullptr && state.pindexBestKnownBlock != nullptr && state.pindexBestKnownBlock->nChainWork >= state.m_chain_sync.m_work_header->nChainWork)) {

theuni commented at 7:36 pm on October 24, 2017:

I believe peers can avoid being added as disconnection candidates if they just never respond to requests for blocks/headers (pindexBestKnownBlock is unset). In that case (assuming the “Start block sync” fired off an initial GETHEADERS above), should the clock also start here if pindexBestKnownBlock is null?

Note: This comment conflicts with the one above about possibly asserting pindexBestKnownBlock here. Ignore whichever doesn’t apply :)

achow101 commented at 8:26 pm on October 24, 2017:

If they don’t respond, doesn’t the timeout just expire and pindexBestKnownBlock doesn’t matter?

sdaftuar commented at 8:31 pm on October 24, 2017:

~~Nice catch, thanks!~~

Actually I misread – I think if a peer just never responds to a headers request, then (a) they won’t be protected from this logic (because we only protect peers who give us headers with work at least equal to our tip), and (b) we’ll drop into this else if clause if m_timeout == 0, ie if we’ve never set a timeout. The rest of the condition is for the case where we had an existing timeout, then we’ll reset it if we received a header that has sufficient work.

Agree?

theuni commented at 8:54 pm on October 24, 2017:

ok, agreed.

sdaftuar commented at 8:43 pm on October 24, 2017: member

Updated to address latest comments from @theuni.

achow101 commented at 4:09 am on October 25, 2017: member

utACK 011423049270dc288bb99d921620f9e0ba0bb43c Needs squash

sdaftuar force-pushed on Oct 25, 2017

sdaftuar commented at 3:55 pm on October 25, 2017: member

Squashed https://github.com/sdaftuar/bitcoin/commits/11490.4 -> d6c4ad762a1f369a02367bfc18a79e79ab913857

gmaxwell commented at 7:22 pm on October 25, 2017: contributor

ACK. Looks good and I’ve been running it on a node for a couple days, seen some getheaders but no disconnections yet, though I’ve started disconnecting healthy outbound peers with the hope of finding some that fail naturally. :)

gmaxwell commented at 3:22 am on October 26, 2017: contributor

After a kicking all my outbounds every couple hours, I eventually hit on this disconnecting a peer… logs show that it was obviously broken. Good job patch.

Disconnecting from bad outbound peers in IBD

When in IBD, we'd like to use all our outbound peers to help us
sync the chain.  Disconnect any outbound peers whose headers have
insufficient work.

c60fd71a65

Permit disconnection of outbound peers on bad/slow chains

Currently we have no rotation of outbound peers.  If an outbound peer
stops serving us blocks, or is on a consensus-incompatible chain with
less work than our tip (but otherwise valid headers), then we will never
disconnect that peer, even though that peer is using one of our 8
outbound connection slots.  Because we rely on our outbound peers to
find an honest node in order to reach consensus, allowing an
incompatible peer to occupy one of those slots is undesirable,
particularly if it is possible for all such slots to be occupied by such
peers.

Protect against this by always checking to see if a peer's best known
block has less work than our tip, and if so, set a 20 minute timeout --
if the peer is still not known to have caught up to a chain with as much
work as ours after 20 minutes, then send a single getheaders message,
wait 2 more minutes, and if a better header hasn't been received by then,
disconnect that peer.

Note:

- we do not require that our peer sync to the same tip as ours, just an
equal or greater work tip.  (Doing otherwise would risk partitioning the
network in the event of a chain split, and is also unnecessary.)

- we pick 4 of our outbound peers and do not subject them to this logic,
to be more conservative. We don't wish to permit temporary network
issues (or an attacker) to excessively disrupt network topology.

5a6d00c6de

jnewbery commented at 5:48 pm on October 26, 2017: member

utACK 453545ffb5b2843757e6644db98f4e8ac11092d2

Add unit test for outbound peer eviction e065249c01

sdaftuar force-pushed on Oct 26, 2017

sdaftuar commented at 5:53 pm on October 26, 2017: member

@jnewbery pointed out to that the functional test didn’t actually need a change, due to protections given to manual connections (and every connection in regtest is a manual connection). So I reverted the change to minchainwork and just added a comment that explains the situation. Also fixed a comment typo that he pointed out in the unit test.

Squashed https://github.com/sdaftuar/bitcoin/commits/11490.6 -> e065249c014a070a8799b2ff947af5b8f012c5c1

jnewbery commented at 6:25 pm on October 26, 2017: member

utACK e065249c014a070a8799b2ff947af5b8f012c5c1

laanwj added this to the "Review priority 0.15.0.2" column in a project

laanwj merged this on Oct 26, 2017

laanwj closed this on Oct 26, 2017

laanwj referenced this in commit d93fa261f0 on Oct 26, 2017

fanquake removed this from the "Review priority 0.15.0.2" column in a project

MarcoFalke referenced this in commit a764b17d96 on Nov 1, 2017

MarcoFalke referenced this in commit 9e599f1a2c on Nov 1, 2017

MarcoFalke referenced this in commit b36650b1f4 on Nov 1, 2017

MarcoFalke removed the label Needs backport on Nov 1, 2017

MarcoFalke referenced this in commit bf191a7183 on Nov 2, 2017

MarcoFalke referenced this in commit 9961abf9e4 on Nov 2, 2017

MarcoFalke referenced this in commit e3272242e2 on Nov 2, 2017

codablock referenced this in commit 7b909fe998 on Sep 26, 2019

codablock referenced this in commit b451094e2e on Sep 29, 2019

barrystyle referenced this in commit bb9d414a2d on Jan 22, 2020

MarcoFalke locked this on Sep 8, 2021

Disconnect from outbound peers with bad headers chains #11490