Periodically make block-relay connections and sync headers

sdaftuar commented at 9:09 pm on September 1, 2020: member

To make eclipse attacks more difficult, regularly initiate outbound connections and stay connected long enough to sync headers and potentially learn of new blocks. If we learn a new block, rotate out an existing block-relay peer in favor of the new peer.

This augments the existing outbound peer rotation that exists – currently we make new full-relay connections when our tip is stale, which we disconnect after waiting a small time to see if we learn a new block. As block-relay connections use minimal bandwidth, we can make these connections regularly and not just when our tip is stale.

Like feeler connections, these connections are not aggressive; whenever our timer fires (once every 5 minutes on average), we’ll try to initiate a new block-relay connection as described, but if we fail to connect we just wait for our timer to fire again before repeating with a new peer.

sdaftuar commented at 9:10 pm on September 1, 2020: member

This implements something similar to what is discussed in #16859.

Marking this as draft for now, as it builds on top of #19724.

DrahtBot added the label P2P on Sep 1, 2020

DrahtBot commented at 11:51 pm on September 1, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#19884 (p2p: No delay in adding fixed seeds if -dnsseed=0 and peers.dat is empty by dhruv)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

in src/net.cpp:1768 in 720f112599 outdated

1765                 ++nOutbound;
1766             }
1767         }
1768     }
1769-    return std::max(nOutbound - m_max_outbound_full_relay - m_max_outbound_block_relay, 0);
1770+    return std::max(nOutbound - m_max_outbound_full_relay, 0);

naumenkogs commented at 7:55 am on September 2, 2020:

Perhaps it’s a good opportunity to refactor this function:

rename GetExtraOutboundCount -> GetExtraFullOutboundCount
rename int nOutbound -> uint32_t full_outbound_peers

sdaftuar commented at 2:45 pm on September 3, 2020:

Fixed.

in src/net_processing.cpp:4049 in d3612a1fd0 outdated

3960@@ -3961,13 +3961,11 @@ void PeerLogicValidation::EvictExtraOutboundPeers(int64_t time_in_seconds)
3961             AssertLockHeld(cs_main);
3962 
3963             // Ignore non-outbound peers, or nodes marked for disconnect already

naumenkogs commented at 8:08 am on September 2, 2020:

Should be along the following “ignore non-outbound-full-relay peers”? Or probably even better: “Consider only outbound full-relay peers not already marked for disconnection”?

sdaftuar commented at 2:45 pm on September 3, 2020:

Done.

practicalswift commented at 8:09 am on September 2, 2020: contributor

Concept ACK

in src/net.cpp:1773 in f0975e7564 outdated

1767@@ -1768,6 +1768,20 @@ int CConnman::GetExtraOutboundCount()
1768     return std::max(nOutbound - m_max_outbound_full_relay, 0);
1769 }
1770 
1771+int CConnman::GetExtraBlockRelayCount()
1772+{
1773+    int nBlockRelay = 0;

naumenkogs commented at 8:15 am on September 2, 2020:

nBlockRelay doesn’t seem to follow the code conventions we have for vars

sdaftuar commented at 2:45 pm on September 3, 2020:

Fixed.

naumenkogs commented at 8:25 am on September 2, 2020: member

Concept ACK

The most interesting question seems to be the eviction criteria after this new block-relay peer gave us a new block, and we want to evict someone.

I think the “evict the youngest” approach is reasonable: it would be very hard for an attacker to control our block-relay-only connections by just serving blocks faster when we connect to them periodically. They’d have to also maintain very long-lived connections to evict honest peers rather than their own Sybils.

We still have a couple places with m_tx_relay == nullptr, and in some of them we mention block-relay-only in a related comment. Should we converge at least those to IsBlockOnlyConn() check?

sdaftuar force-pushed on Sep 3, 2020

sdaftuar marked this as ready for review on Sep 3, 2020

sdaftuar commented at 2:46 pm on September 3, 2020: member

Now that #19724 has been merged, this is ready for review.

sdaftuar commented at 3:06 pm on September 3, 2020: member

We still have a couple places with m_tx_relay == nullptr, and in some of them we mention block-relay-only in a related comment. Should we converge at least those to IsBlockOnlyConn() check?

I actually didn’t mean to necessarily include that m_tx_relay == nullptr in the LogPrintf() that I did; the clarification of changing to using the IsBlockOnlyConn() check was leftover from a previous version of this work where I was going to introduce a new connection type altogether, and the m_tx_relay == nullptr check would have been incorrect for identifying the peer type.

In this PR that change is unnecessary (though an improvement), but I guess I don’t want to do a wholesale review of all the m_tx_relay==nullptr checks still in the code – many of them are places where we check for nullptr before dereferencing, which I think would make sense to leave as-is. Places in the code where we use that check to determine how to do something logical with that peer are different, and should be switched at some point. If there are places like that which make this new code harder to understand, please flag and I can include fixes here, but I think since I’m not adding new peer-types I hope that wouldn’t be necessary.

sdaftuar force-pushed on Sep 3, 2020

in src/net.cpp:1885 in be956cd435 outdated

1880@@ -1866,15 +1881,39 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
1881         // until we hit our block-relay-only peer limit.
1882         // GetTryNewOutboundPeer() gets set when a stale tip is detected, so we
1883         // try opening an additional OUTBOUND_FULL_RELAY connection. If none of
1884-        // these conditions are met, check the nNextFeeler timer to decide if
1885-        // we should open a FEELER.
1886+        // these conditions are met, check to see if it's time to try an extra
1887+        // block-relay peer (to confirm our tip is current, see below) or the nNextFeeler

ariard commented at 0:56 am on September 4, 2020:

I think it would be better to converge naming for BLOCK_RELAY and thus ease reviewing. They’re commented as either block-relay or block-relay-only peers through the codebase but actually designate the same type. I’m even on which one we pick as long as we’re consistent.

sdaftuar commented at 4:25 pm on September 4, 2020:

Fixed in latest version.

in src/net_processing.cpp:3923 in be956cd435 outdated

4002+        std::pair<NodeId, int64_t> second_youngest_peer = {-1, std::numeric_limits<int64_t>::max()};
4003+
4004+        m_connman.ForEachNode([&](CNode* pnode) {
4005+            LockAssertion lock(::cs_main);
4006+            if (!pnode->IsBlockOnlyConn() || pnode->fDisconnect) return;
4007+            if (pnode->GetId() > youngest_peer.first) {

ariard commented at 1:02 am on September 4, 2020:

This is making an assumption on GetNewNodeId() being a monotonic counter function of connections order. It may silently break id we modify ids selection to something else (like random nonces). Can we use nTimeConnected instead ?

naumenkogs commented at 7:39 am on September 4, 2020:

I’d also prefer to not having this assumption, and it seems to be easy to avoid.

naumenkogs commented at 9:01 am on September 4, 2020:

If m_connman.ForEachNode is not ordered, this algorithm may mistakenly select second_youngest.

sdaftuar commented at 3:29 pm on September 4, 2020:

Sure I can change this, but keep in mind this logic is used a little further down already, in the existing outbound full-relay peer eviction algorithm.

EDIT: Actually, I think while this has some logical appeal it makes the code read strictly worse – CNode::nTimeConnected is tracked in seconds, so it’s perfectly plausible that you might have two nodes that tie, and you’d presumably break the tie using nodeid anyway! I’m just going to add a comment that we’re using nodeid to make the determination of which peer is younger (ie higher NodeId).

ariard commented at 2:08 pm on September 8, 2020:

As alternative have you considered caching them ? We know both when we open such connections and when we drop them. It would avoid the risk of logic bug and iterating every other connections types not concerned by such eviction.

I think it’s more a future work direction.

sdaftuar commented at 3:04 pm on September 9, 2020:

I assume you mean caching at the time we open the connection? I think that is problematic, because in order to keep that extra state up to date in the event that peers get disconnected and we make new connections after that, you have to a lot of additional error checking and introduced added complexity. Doing all the checks in one place, at the point in time that we care about getting to a new/correct state when we’re over our maximum, seems simpler to me to reason about.

in src/net.cpp:1979 in be956cd435 outdated

1898+            // methodology from addrman) and stay connected long enough to sync
1899+            // headers, but not much else.
1900+            //
1901+            // Then disconnect the peer, if we haven't learned anything new.
1902+            //
1903+            // The idea is to make eclipse attacks very difficult to pull off,

ariard commented at 1:15 am on September 4, 2020:

I would be more conservative in allegation of this new eclipse counter-measure effectiveness.

I believe we should have a hierarchy of eclipse attacks classified with regards to resources they require from the attacker to successfully perform them. And thus serves as a ground to evaluate a counter-measure with regards to a scenario. The fact that a stronger attack A can easily bypass counter-measure for attack B doesn’t invalidate worthiness of counter-measure B.

For this new periodic-chain-sync counter-measure, I do agree it improves against eviction logic takeover or partial addrman poisoning. However I guess it doesn’t score well against total addrman poisoning or infrastructure-based eclipse.

As a suggestion, maybe we can add a prefix to any mention of eclipse attacks hinting scenario considered like addrman-poisoning or eviction-gaming ?

sdaftuar commented at 4:00 pm on September 4, 2020:

I re-read my comment, and I think it’s pretty accurate. If other reviewers think that the language here is somehow too strong and implies this logic is doing something it isn’t, I’ll reconsider.

Note, by the way, that the behavior introduced here is beneficial to not just the node doing it, but to the whole network, as a node already connected to the honest network that is periodically connecting to new peers to sync tips with others is helping bridge the entire network.

in src/net_processing.cpp:4116 in be956cd435 outdated

4107@@ -4064,6 +4108,11 @@ void PeerLogicValidation::CheckForStaleTipAndEvictPeers(const Consensus::Params
4108         }
4109         m_stale_tip_check_time = time_in_seconds + STALE_CHECK_INTERVAL;
4110     }
4111+
4112+    if (!m_initial_sync_finished && CanDirectFetch(consensusParams)) {

ariard commented at 1:20 am on September 4, 2020:

I suppose this doesn’t protect against initial-network-connection eclipse attack like DNS cache poisoning. Maybe after some timer based on an optimistic headers-chain sync delay and observing that our tip isn’t near to local clock trigger this logic anyway ?

That said, if you’re effectively eclipsed since the beginning and don’t have any good peers available in your addrman I don’t think it would change anything.

sdaftuar commented at 3:35 pm on September 4, 2020:

Wouldn’t the existing stale-tip check let us get new outbound peers in the case that our tip isn’t updating at all?

ariard commented at 1:32 pm on September 8, 2020:

I was assuming someone feeding you slowly the most-PoW valid chain thus never triggering the stale-tip check ? I think a broader definition of eclipse attack should include slow chain feeding as it’s open the door for offchain exploitation.

That said, I think eclipse attacks during the bootstrap view of your network view are a special-case and we can address them latter with smarter heuristics based on this work.

sdaftuar commented at 2:53 pm on September 9, 2020:

We do require that our initial headers-sync peer provide us with a headers chain that looks reasonable within a bounded amount of time (on the order of 20 minutes if I remember correctly – the time scales with the expected honest chain length and very conservative notions of how long it takes to download headers). However if we’re connecting blocks slowly, we can’t distinguish between our own node being too slow to validate the entire blockchain (say due to local computing/memory/disk/network resources) or our peers colluding to collectively slow us down.

This seems like something that needs human intervention to determine that initial sync is in fact going too slowly.

in src/net_processing.cpp:3926 in be956cd435 outdated

4005+            LockAssertion lock(::cs_main);
4006+            if (!pnode->IsBlockOnlyConn() || pnode->fDisconnect) return;
4007+            if (pnode->GetId() > youngest_peer.first) {
4008+                second_youngest_peer = youngest_peer;
4009+                youngest_peer.first = pnode->GetId();
4010+                youngest_peer.second = pnode->nLastBlockTime;

ariard commented at 1:31 am on September 4, 2020:

This is a bit of a name collusion, maybe second_youngest_peer -> next_youngest_peer

sdaftuar commented at 5:06 pm on September 4, 2020:

Changed.

in src/net.cpp:1909 in be956cd435 outdated

1906+            //
1907+            // This is similar to the logic for trying extra outbound (full-relay)
1908+            // peers, except:
1909+            // - we do this all the time on a poisson timer, rather than just when
1910+            //   our tip is stale
1911+            // - we potentially disconnect our next-youngest block-relay peer, if we

ariard commented at 1:46 am on September 4, 2020:

I think this is pproper even if it’s biasing a bit towards attackers. I assume they run modified clients and thus always win new block announcement races and have public sybils well-spread through /16s.

We start with Alice, Malicia as 2 block-relay connections:

Alice connection gets dropped due an external event from our node viewpoint (like network issue or Alice inbound eviction logic)
Malicia becomes the oldest connection
We open a new block-relay connection to Bob
We open a new chain-sync connection to Malicia_2
Malicia_2 announce a new best block updating her nLastBlockTime
We drop Bob
We have Malicia and Malicia_2 as stable block-only peers

Now to conserve its advantage an attacker must ensure to update our chain tip faster than learning any headers through the new spawn chain-sync peer otherwise Malicia_2 would be evicted by Caroll, first such peer sending a better block. An attacker has to perfom useful work to keep its bias and that way breaking attack purpose ?

sdaftuar commented at 3:44 pm on September 4, 2020:

Yep. Most importantly, because we’re always trying a new peer and exchanging best tip information, this should create a very high cost to an attacker to actually split the network for a meaningful amount of time (even if they’ve taking all our initial peer slots).

ariard commented at 1:48 am on September 4, 2020: member

I’m leaning towards Concept ACK but have you considered impact with #17428 ? I fear it may reduce its usefulness.

in src/net.h:55 in 013b41fc27 outdated

50@@ -51,6 +51,8 @@ static const bool DEFAULT_WHITELISTFORCERELAY = false;
51 static const int TIMEOUT_INTERVAL = 20 * 60;
52 /** Run the feeler connection loop once every 2 minutes or 120 seconds. **/
53 static const int FEELER_INTERVAL = 120;
54+/** Run the chain-sync connection loop once every 5 minutes. **/
55+static const int CHAIN_SYNC_INTERVAL = 300;

naumenkogs commented at 7:34 am on September 4, 2020:

can we make this mockable from the beginning? (std::chrono)

sdaftuar commented at 4:26 pm on September 4, 2020:

Can we punt until someone also changes the feeler logic to be the same? Right now the logic for both is very similar, which I think helps readability. (Also, I find std::chrono to be harder to work with than the tools I know, so I’m afraid I’ll introduce an error if I try to make the change myself.)

naumenkogs commented at 8:48 am on September 7, 2020:

Not sure what you mean by “feeler logic to be the same”, but I’m making feeler timings mockable as part of #19869, you’re very welcome to review :)

My opinion is not very strong here, we can update it later, I just thought it’s a good opportunity.

sdaftuar commented at 2:54 pm on September 9, 2020:

If #19869 is merged first, then I’ll update the code here as well when I rebase.

in src/net_processing.h:211 in 013b41fc27 outdated

93@@ -94,6 +94,9 @@ class PeerLogicValidation final : public CValidationInterface, public NetEventsI
94 private:
95     int64_t m_stale_tip_check_time; //!< Next time to check for stale tip
96 
97+    /** Whether we've completed initial sync yet, for determining when to turn
98+      * on extra block-relay peers. */
99+    bool m_initial_sync_finished{false};

naumenkogs commented at 7:41 am on September 4, 2020:

Should this ever be set back to false? For example, if we were offline for a week and we know we’re catching up.

sdaftuar commented at 3:39 pm on September 4, 2020:

If you go offline for a week by shutting down bitcoind there is no issue; if you close your laptop or disconnect from the network though then yes you’re right that we’ll use these occasional peers to help us catch up, which is not the intent. However, we don’t have a good way to distinguish that situation in our code right now… Arguably stale-tip checking shouldn’t fire either in those circumstances but we don’t try to do anything to limit that?

I’m inclined to leave this, and if we somehow improve our software to detect circumstances like that, then we can update this logic accordingly.

jnewbery commented at 12:48 pm on November 22, 2020:

if you close your laptop or disconnect from the network though then yes you’re right that we’ll use these occasional peers to help us catch up, which is not the intent.

If this is the intent (to stop making these short-lived connections if we’ve fallen behind the tip), then I think we can achieve that fairly easily by removing this caching variable, making CanDirectFetch() a public method on PeerManager and calling that function whenever we need to test if we should make an additional block-relay-only connection:

 0diff --git a/src/net.cpp b/src/net.cpp
 1index 48977aeadf..1de1bda9a8 100644
 2--- a/src/net.cpp
 3+++ b/src/net.cpp
 4@@ -1957,7 +1957,7 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
 5             conn_type = ConnectionType::BLOCK_RELAY;
 6         } else if (GetTryNewOutboundPeer()) {
 7             // OUTBOUND_FULL_RELAY
 8-        } else if (nTime > nNextExtraBlockRelay && m_start_extra_block_relay_peers) {
 9+        } else if (nTime > nNextExtraBlockRelay && m_msgproc->CanDirectFetch()) {
10             // Periodically connect to a peer (using regular outbound selection
11             // methodology from addrman) and stay connected long enough to sync
12             // headers, but not much else.
13diff --git a/src/net.h b/src/net.h
14index 58a5b36918..c836161f83 100644
15--- a/src/net.h
16+++ b/src/net.h
17@@ -635,6 +635,7 @@ public:
18     virtual bool SendMessages(CNode* pnode) = 0;
19     virtual void InitializeNode(CNode* pnode) = 0;
20     virtual void FinalizeNode(NodeId id, bool& update_connection_time) = 0;
21+    virtual bool CanDirectFetch() const = 0;
22 
23 protected:
24     /**
25diff --git a/src/net_processing.cpp b/src/net_processing.cpp
26index ad40d67a97..ef47d00e73 100644
27--- a/src/net_processing.cpp
28+++ b/src/net_processing.cpp
29@@ -883,6 +883,11 @@ void RequestTx(CNodeState* state, const GenTxid& gtxid, std::chrono::microsecond
30 
31 } // namespace
32 
33+bool PeerManager::CanDirectFetch() const
34+{
35+    return ::CanDirectFetch(m_chainparams.GetConsensus());
36+}
37+
38 // This function is used for testing the stale tip eviction logic, see
39 // denialofservice_tests.cpp
40 void UpdateLastBlockAnnounceTime(NodeId node, int64_t time_in_seconds)
41@@ -1956,7 +1961,7 @@ void PeerManager::ProcessHeadersMessage(CNode& pfrom, const std::vector<CBlockHe
42             m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETHEADERS, ::ChainActive().GetLocator(pindexLast), uint256()));
43         }
44 
45-        bool fCanDirectFetch = CanDirectFetch(m_chainparams.GetConsensus());
46+        bool fCanDirectFetch = CanDirectFetch();
47         // If this set of headers is valid and ends in a block with at least as
48         // much work as our tip, download as much as possible.
49         if (fCanDirectFetch && pindexLast->IsValid(BLOCK_VALID_TREE) && ::ChainActive().Tip()->nChainWork <= pindexLast->nChainWork) {
50@@ -3261,7 +3266,7 @@ void PeerManager::ProcessMessage(CNode& pfrom, const std::string& msg_type, CDat
51         }
52 
53         // If we're not close to tip yet, give up and let parallel block fetch work its magic
54-        if (!fAlreadyInFlight && !CanDirectFetch(m_chainparams.GetConsensus()))
55+        if (!fAlreadyInFlight && !CanDirectFetch())
56             return;
57 
58         if (IsWitnessEnabled(pindex->pprev, m_chainparams.GetConsensus()) && !nodestate->fSupportsDesiredCmpctVersion) {
59@@ -4073,7 +4078,7 @@ void PeerManager::CheckForStaleTipAndEvictPeers()
60         m_stale_tip_check_time = time_in_seconds + STALE_CHECK_INTERVAL;
61     }
62 
63-    if (!m_initial_sync_finished && CanDirectFetch(m_chainparams.GetConsensus())) {
64+    if (!m_initial_sync_finished && CanDirectFetch()) {
65         m_connman.StartExtraBlockRelayPeers();
66         m_initial_sync_finished = true;
67     }
68diff --git a/src/net_processing.h b/src/net_processing.h
69index 6e3e032831..88bf7ff2a6 100644
70--- a/src/net_processing.h
71+++ b/src/net_processing.h
72@@ -93,6 +93,11 @@ public:
73      */
74     void Misbehaving(const NodeId pnode, const int howmuch, const std::string& message);
75 
76+    /**
77+     * Return whether our tip block's time is close enough to current time that we can directly fetch.
78+     */
79+    bool CanDirectFetch() const;
80+
81 private:
82     /**
83      * Potentially mark a node discouraged based on the contents of a BlockValidationState object

That approach may be preferable for a couple of reasons:

Placing the logic that checks/sets the condition under which we’ll make additional block-relay-only peers in the same place that it makes those connections makes it much easier to reason about those state transitions. Currently m_start_extra_block_relay_peers is set based on a loop in net_processing and then read on a timer in net.
Caching the state internally makes it more difficult to test all the code paths. If the start_extra_block_relay_peers condition is set by a call into PeerManager, then that interface can be mocked and code paths can be hit more easily in unit/fuzz testing.

sdaftuar commented at 3:42 pm on November 27, 2020:

I think this would mean that we stop using block-relay-only peer rotation when our tip is stale, which might be when we want these connections happening the most? It becomes arguable whether we should just rely on stale-tip-checking + full outbound peer rotation at that point, but I think we would want to carefully reason about our protections in that scenario.

in src/net_processing.cpp:3991 in 013b41fc27 outdated

3987@@ -3988,6 +3988,51 @@ void PeerLogicValidation::ConsiderEviction(CNode& pto, int64_t time_in_seconds)
3988 
3989 void PeerLogicValidation::EvictExtraOutboundPeers(int64_t time_in_seconds)
3990 {
3991+    // If we have any extra block-relay peers, disconnect the youngest unless

naumenkogs commented at 9:07 am on September 4, 2020:

nit: it’s a little less clear that the youngest peer is the temporary connection created just couple minutes ago, would be cool to explain it more explicitly.

sdaftuar commented at 5:05 pm on September 4, 2020:

Fixed in latest version.

in src/net_processing.cpp:3937 in 013b41fc27 outdated

4014+            // second youngest.
4015+            to_disconnect = second_youngest_peer.first;
4016+        }
4017+        m_connman.ForNode(to_disconnect, [&](CNode* pnode) {
4018+            LockAssertion lock(::cs_main);
4019+            // Make sure we're not getting a block right now, and that

naumenkogs commented at 9:17 am on September 4, 2020:

nit: I had to double-check that nBlocksInFlight can’t be tricked with fake blocks. If you gonna update it, maybe just say “block with a valid header”?

sdaftuar commented at 3:30 pm on September 4, 2020:

Sure I can add a hint – I thought it was well-understood that we only request blocks that are on a (valid) headers chain resulting in a tip that has >= work as our tip, as we use that fact in multiple places in net_processing, I think.

naumenkogs commented at 8:30 am on September 7, 2020:

My worry was that nBlocksInFlight might contain just a hash for an announced block (pre-compact blocks) which we have no idea about. Then announcing any random hash would trick us.

Your current comment is sufficient to clarify that this is not possible.

sdaftuar commented at 3:28 pm on September 4, 2020: member

I’m leaning towards Concept ACK but have you considered impact with #17428 ? I fear it may reduce its usefulness.

I don’t think #17428 has any meaningful interaction with this. These connections are short-lived, and generally speaking under normal circumstances rotation doesn’t actually occur, because we’re unlikely to get a new block from one of these connections rather than from one of our high-bandwidth compact block peers. Even if rotation were to occur, I don’t think #17428 has any meaningful negative consequences here – keeping track of helpful block-relay peers seems an independently good idea.

sdaftuar force-pushed on Sep 4, 2020

sdaftuar commented at 5:19 pm on September 4, 2020: member

Responded to feedback so far, mostly doc/comment/variable naming changes.

I did make a change to the eviction algorithm, to now check that the newest peer served us a block more recently than the second youngest peer before deciding to evict the second-youngest peer instead. However, I haven’t yet been able to test the eviction logic (so far in a few weeks of running various versions of this code, I haven’t yet seen it evict an older block-relay-only peer yet), so this is something to be looked at carefully.

naumenkogs commented at 9:06 am on September 7, 2020: member

utACK 02cf9c7c649c6e3a425b486ad2a64c4f1a0c3fe7

If we want to make a strong statement that this is by all means an improvement over the existing approach, we might want to increase block-relay-only conns limit (m_max_outbound_block_relay) by 1. Otherwise, an attacker may make (currently) 1 out of 2 our block-relay-only conns more vulnerable for eviction by feeding us blocks faster.

Of course, to be a successful eclipse attack, they would have to do much more…

DrahtBot added the label Needs rebase on Sep 7, 2020

in src/net.cpp:1944 in 411056e3ce outdated

1907+            // This is similar to the logic for trying extra outbound (full-relay)
1908+            // peers, except:
1909+            // - we do this all the time on a poisson timer, rather than just when
1910+            //   our tip is stale
1911+            // - we potentially disconnect our next-youngest block-relay-only peer, if we
1912+            //   download a new block from this new block-relay-only peer.

ariard commented at 1:58 pm on September 8, 2020:

Update comment to reflect latest changes (411056e) on keeping the most recent block announcement?

sdaftuar commented at 3:52 pm on September 9, 2020:

Thanks! Fixed.

in src/net.h:335 in 411056e3ce outdated

319@@ -318,13 +320,20 @@ class CConnman
320     void SetTryNewOutboundPeer(bool flag);
321     bool GetTryNewOutboundPeer();
322 
323+    void StartExtraBlockRelayPeers() {

ariard commented at 2:00 pm on September 8, 2020:

Have you considered a MarkExtraBlockRelayPeers function which a) set m_start_extra_block_relay_peers to true if isn’t and b) return the state of m_start_extra_block_relay_peers ? That way you can drop new m_initial_sync_finished I think

sdaftuar commented at 3:52 pm on September 9, 2020:

I had a slight preference for not frequently looking at an atomic; m_initial_sync_finished is only used in one thread and therefore doesn’t need locks. This issue seems minor either way, so planning to leave it as-is.

in src/net_processing.cpp:3973 in 411056e3ce outdated

3987@@ -3988,6 +3988,55 @@ void PeerLogicValidation::ConsiderEviction(CNode& pto, int64_t time_in_seconds)
3988 
3989 void PeerLogicValidation::EvictExtraOutboundPeers(int64_t time_in_seconds)
3990 {
3991+    // If we have any extra block-relay-only peers, disconnect the youngest unless
3992+    // it's given us a block -- in which case, disconnect the second youngest

ariard commented at 2:02 pm on September 8, 2020:

Same, reflect changes, “a more recent block than one provided by next-youngest”?

sdaftuar commented at 3:53 pm on September 9, 2020:

Thanks! Fixed.

in src/net_processing.cpp:3941 in 411056e3ce outdated

4020+            LockAssertion lock(::cs_main);
4021+            // Make sure we're not getting a block right now, and that
4022+            // we've been connected long enough for this eviction to happen
4023+            // at all.
4024+            // Note that we only request blocks from a peer if we learn of a
4025+            // valid headers chain with at least as much work as our tip.

ariard commented at 2:11 pm on September 8, 2020:

“so any in-flight block is a good hint that such peer will soon provide us with useful work”

sdaftuar commented at 3:53 pm on September 9, 2020:

This seems self-evident! Leaving as-is.

in src/net_processing.h:209 in 411056e3ce outdated

93@@ -94,6 +94,9 @@ class PeerLogicValidation final : public CValidationInterface, public NetEventsI
94 private:
95     int64_t m_stale_tip_check_time; //!< Next time to check for stale tip
96 
97+    /** Whether we've completed initial sync yet, for determining when to turn

ariard commented at 2:11 pm on September 8, 2020:

You can mention CanDirectFetch as the completed-initial-sync-yet condition.

sdaftuar commented at 3:54 pm on September 9, 2020:

I don’t think that’s necessary here.

practicalswift commented at 2:12 pm on September 8, 2020: contributor

@naumenkogs

If we want to make a strong statement that this is by all means an improvement over the existing approach, we might want to increase block-relay-only conns limit (m_max_outbound_block_relay) by 1. Otherwise, an attacker may make (currently) 1 out of 2 our block-relay-only conns more vulnerable for eviction by feeding us blocks faster.

What would be the best arguments against bumping m_max_outbound_block_relay by 1?

(Personally I think (tentatively) that increasing m_max_outbound_block_relay is a good idea, but it would be nice to have the best arguments against too :))

in src/net_processing.cpp:3916 in 411056e3ce outdated

3987@@ -3988,6 +3988,55 @@ void PeerLogicValidation::ConsiderEviction(CNode& pto, int64_t time_in_seconds)
3988 
3989 void PeerLogicValidation::EvictExtraOutboundPeers(int64_t time_in_seconds)
3990 {
3991+    // If we have any extra block-relay-only peers, disconnect the youngest unless
3992+    // it's given us a block -- in which case, disconnect the second youngest
3993+    // instead (so we rotate our youngest block-relay-only peer).
3994+    // The youngest block-relay-only peer would be the extra peer we connected
3995+    // to temporarily in order to sync our tip; see net.cpp.

ariard commented at 2:12 pm on September 8, 2020:

“See ThreadOpenConnections”

sdaftuar commented at 3:54 pm on September 9, 2020:

I think this is fine as-is.

ariard commented at 2:26 pm on September 8, 2020: member

Code Review ACK-pending-on-rebase. I took the opportunity to add few more code/comments improvements, feel free to take them.

Even if rotation were to occur, I don’t think #17428 has any meaningful negative consequences here – keeping track of helpful block-relay peers seems an independently good idea.

I was thinking the reverse ! More #19858 having negative consequences on #17428. An attacker’s sybil if selected as extra-block-relay might be able to outrun a high-bandwidth compact blocks peers such that evicting this peer from its anchor slot. Breaking the best-effort connection stability of anchors peers ? After this current work, I would lean to introduce these anchors as a complete new type of peers to ease reasoning on them.

With regards to functional test coverage we are limited by #19315 and mocking addresses of P2P connections I guess ?

sdaftuar commented at 2:33 pm on September 8, 2020: member

What would be the best arguments against bumping m_max_outbound_block_relay by 1?

The best argument I’m aware of: we use up more connection slots on listening nodes, which is a limited resource. So if we start increasing the number of outbound connections we make, we should think hard about also increasing the default number of connections we expect a listening node to be able to handle.

I think that these connections should be lightweight enough that we ought to be able to allocate more inbound connection slots to low-resource peers like these; however that is not easy for us to do until we have a way to communicate to our outbound peer when we connect that we’re a block-relay-only peer and not planning to re-enable transaction relay later on in the connection’s lifetime.

My plan to increase this number is to: (1) first propose a p2p protocol upgrade that would inform our peer that we’re block-relay only, then (2) add support for reducing resources that we allocate to such inbound peers, and then (3) propose that we increase the number of inbound connections by accounting for block-relay inbounds differently from the rest, and finally (4) then increase the default number of outbound block-relay-only peers.

But if there are other concerns about increasing this number, I’d like to hear them! I’m currently working on steps 1 and 2 in the above.

At any rate, I’m not inclined to propose increasing the number of connections in this PR, because I didn’t want to use further permanent block-relay-only connections until we made more progress reducing the resource impact.

sdaftuar force-pushed on Sep 9, 2020

DrahtBot removed the label Needs rebase on Sep 9, 2020

sdaftuar force-pushed on Sep 9, 2020

sdaftuar commented at 3:59 pm on September 9, 2020: member

Rebased and updated to fix some documentation issues pointed out by @ariard.

I was thinking the reverse ! More #19858 having negative consequences on #17428. An attacker’s sybil if selected as extra-block-relay might be able to outrun a high-bandwidth compact blocks peers such that evicting this peer from its anchor slot.

I don’t think that is a big concern – by construction, only 1 out of our current 2 block-relay-only peers will be able to be rotated this way, and to maintain this attack an adversary has to continually give us the best block faster than anyone else, which makes them not much of an attacker until they stop. And once they stop, this logic continues to protect us because it’s always connecting to new peers to sync tips.

With regards to functional test coverage we are limited by #19315 and mocking addresses of P2P connections I guess ?

Yeah, I’m not aware of any way to test this code in our current testing framework. I believe that even after #19315, we still will not have a way to exercise the logic in ThreadOpenConnections, so I think manual testing is best for this PR.

naumenkogs commented at 6:39 am on September 10, 2020: member

ACK 8ea760b10221eb01b057363693ab74cdc810fd4a

sdaftuar commented at 1:25 pm on September 10, 2020: member

If anyone is curious, I tried to exercise the eviction logic a bit with this patch added on top of this PR:

 0diff --git a/src/net.cpp b/src/net.cpp
 1index c7b64f9d505..5a47d9d8c8d 100644
 2--- a/src/net.cpp
 3+++ b/src/net.cpp
 4@@ -2863,6 +2863,11 @@ CNode::CNode(NodeId idIn, ServiceFlags nLocalServicesIn, int nMyStartingHeightIn
 5 
 6     m_deserializer = MakeUnique<V1TransportDeserializer>(V1TransportDeserializer(Params().MessageStart(), SER_NETWORK, INIT_PROTO_VERSION));
 7     m_serializer = MakeUnique<V1TransportSerializer>(V1TransportSerializer());
 8+
 9+    // 10% chance of initializing nLastBlockTime to current time
10+    if (GetRandInt(100) <= 10) {
11+        nLastBlockTime = GetTime();
12+    }
13 }
14 
15 CNode::~CNode()
16diff --git a/src/net_processing.cpp b/src/net_processing.cpp
17index 2f98663e7cf..05dd1f38a6c 100644
18--- a/src/net_processing.cpp
19+++ b/src/net_processing.cpp
20@@ -4006,6 +4006,9 @@ void PeerManager::EvictExtraOutboundPeers(int64_t time_in_seconds)
21             if (node_state == nullptr ||
22                 (time_in_seconds - pnode->nTimeConnected >= MINIMUM_CONNECT_TIME && node_state->nBlocksInFlight == 0)) {
23                 pnode->fDisconnect = true;
24+                if (to_disconnect != youngest_peer.first) {
25+                    LogPrint(BCLog::NET, "Block-relay peer rotation: keeping new peer=%d, disconnecting old peer=%d\n", youngest_peer.first, to_disconnect);
26+                }
27                 LogPrint(BCLog::NET, "disconnecting extra block-relay-only peer=%d (last block received at time %d)\n", pnode->GetId(), pnode->nLastBlockTime);
28                 return true;
29             } else {

Perhaps other reviewers will find additional ways to fully exercise this code.

ariard commented at 4:24 pm on September 15, 2020: member

Tested/Code Review ACK 8ea760b under following caveats.

I observed that the new eviction logic is working as expected and extra block-relay-only peers are rotated based on their last block received time.

Note for the other reviewers, I took the proposed patch with further modifications:

scale down EXTRA_PEER_CHECK_INTERVAL/EXTRA_BLOCK_RELAY_ONLY_PEER_INTERVAL to 5s and 15s
activate extra block relay peers opening independently of IBD
disable block-sync
50% chance to initialize nLastBlockTime to current time

This interferes with protecting peers for in-flight blocks but at least underscores well the working opening and rotation.

in src/net_processing.cpp:3984 in 8ea760b102 outdated

3981+    if (extra_block_relay_peers > 0) {
3982+        std::pair<NodeId, int64_t> youngest_peer = {-1, 0};
3983+        std::pair<NodeId, int64_t> next_youngest_peer = {-1, 0};
3984+
3985+        m_connman.ForEachNode([&](CNode* pnode) {
3986+            LockAssertion lock(::cs_main);

jonatack commented at 3:23 pm on October 4, 2020:

The two lock assertions added in this function need to be updated to build on current master:

 0@@ -3964,8 +3964,8 @@ void PeerManager::EvictExtraOutboundPeers(int64_t time_in_seconds)
 1         std::pair<NodeId, int64_t> youngest_peer = {-1, 0};
 2         std::pair<NodeId, int64_t> next_youngest_peer = {-1, 0};
 3 
 4-        m_connman.ForEachNode([&](CNode* pnode) {
 5-            LockAssertion lock(::cs_main);
 6+        m_connman.ForEachNode([&](CNode* pnode) EXCLUSIVE_LOCKS_REQUIRED(::cs_main) {
 7+            AssertLockHeld(::cs_main);
 8             if (!pnode->IsBlockOnlyConn() || pnode->fDisconnect) return;
 9@@ -3979,8 +3979,8 @@ void PeerManager::EvictExtraOutboundPeers(int64_t time_in_seconds)
10             // disconnect our second youngest.
11             to_disconnect = next_youngest_peer.first;
12         }
13-        m_connman.ForNode(to_disconnect, [&](CNode* pnode) {
14-            LockAssertion lock(::cs_main);
15+        m_connman.ForNode(to_disconnect, [&](CNode* pnode) EXCLUSIVE_LOCKS_REQUIRED(::cs_main) {
16+            AssertLockHeld(::cs_main);
17             // Make sure we're not getting a ```

jonatack commented at 3:27 pm on October 4, 2020:

(see #19979)

sdaftuar commented at 5:49 pm on October 5, 2020:

Thanks, updated.

jonatack commented at 3:25 pm on October 4, 2020: member

Concept ACK

in src/net_processing.cpp:3964 in 8ea760b102 outdated

3978+    // to temporarily in order to sync our tip; see net.cpp.
3979+    // Note that we use higher nodeid as a measure for most recent connection.
3980+    int extra_block_relay_peers = m_connman.GetExtraBlockRelayCount();
3981+    if (extra_block_relay_peers > 0) {
3982+        std::pair<NodeId, int64_t> youngest_peer = {-1, 0};
3983+        std::pair<NodeId, int64_t> next_youngest_peer = {-1, 0};

jonatack commented at 8:54 am on October 5, 2020:

perhaps

0     // Note that we use higher nodeid as a measure for most recent connection.
1-    int extra_block_relay_peers = m_connman.GetExtraBlockRelayCount();
2-    if (extra_block_relay_peers > 0) {
3-        std::pair<NodeId, int64_t> youngest_peer = {-1, 0};
4-        std::pair<NodeId, int64_t> next_youngest_peer = {-1, 0};
5+    if (m_connman.GetExtraBlockRelayCount() > 0) {
6+        std::pair<NodeId, int64_t> youngest_peer{-1, 0}, next_youngest_peer{-1, 0};

sdaftuar commented at 6:18 pm on October 5, 2020:

Done.

in src/net_processing.cpp:4021 in 8ea760b102 outdated

4018+        });
4019+    }
4020+
4021+    // Check whether we have too many OUTBOUND_FULL_RELAY peers
4022+    int extra_peers = m_connman.GetExtraFullOutboundCount();
4023     if (extra_peers > 0) {

jonatack commented at 8:57 am on October 5, 2020:

perhaps, as extra_peers isn’t reused

0-    int extra_peers = m_connman.GetExtraFullOutboundCount();
1-    if (extra_peers > 0) {
2+    if (m_connman.GetExtraFullOutboundCount() > 0) {

sdaftuar commented at 6:18 pm on October 5, 2020:

Done.

in src/net_processing.cpp:3910 in 8ea760b102 outdated

3968@@ -3969,11 +3969,58 @@ void PeerManager::ConsiderEviction(CNode& pto, int64_t time_in_seconds)
3969 
3970 void PeerManager::EvictExtraOutboundPeers(int64_t time_in_seconds)

jonatack commented at 9:00 am on October 5, 2020:

EvictExtraOutboundPeers() seems like a natural candidate to be separated into two functions,EvictExtraOutboundBlockRelayPeers() and EvictExtraOutboundFullRelayPeers(). This might enable better unit testing of the logic. If you agree, it may be easier to do this now than later.

sdaftuar commented at 6:24 pm on October 5, 2020:

Well, there’s also an argument to be made that it’s simpler to have all of the outbound peer eviction logic in one place/one callback.

If there is a reason (like testing) to do so, I think splitting this up in the future could make sense, but for now I’d like to keep the changes here to be more straightforward. So assuming this implementation is correct I’d prefer to leave it as-is.

jnewbery commented at 1:51 pm on November 22, 2020:

I agree that splitting this function would be a better structure. There’s no shared state between the top and bottom half of the function, and splitting logic into smaller pieces makes it easier to follow. It also allows guard clauses and early exits, which reduces deep nesting and is a very clear way of expressing the intent of the function.

sdaftuar commented at 3:25 pm on November 27, 2020:

Feel free to pick this up in a future PR if you like; I’m leaving this as-is for now as I think this is easier to review and understand if this all happens in the same scheduler callback.

jonatack commented at 9:43 am on October 5, 2020: member

Ran this overnight with the following variant of @sdaftuar’s patch #19858 (comment) with some additional logging.

 0
 1--- a/src/net.cpp
 2+++ b/src/net.cpp
 3@@ -1831,6 +1831,7 @@ int CConnman::GetExtraFullOutboundCount()
 4+    LogPrintf("GetExtraFullOutboundCount() %d\n", std::max(full_outbound_peers - m_max_outbound_full_relay, 0));
 5     return std::max(full_outbound_peers - m_max_outbound_full_relay, 0);
 6 }
 7@@ -1845,6 +1846,7 @@ int CConnman::GetExtraBlockRelayCount()
 8+    LogPrintf("GetExtraBlockRelayCount() %d\n", std::max(block_relay_peers - m_max_outbound_block_relay, 0));
 9     return std::max(block_relay_peers - m_max_outbound_block_relay, 0);
10 }
11@@ -1877,6 +1879,7 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
12     // Minimum time before next feeler connection (in microseconds).
13     int64_t nNextFeeler = PoissonNextSend(nStart*1000*1000, FEELER_INTERVAL);
14     int64_t nNextExtraBlockRelay = PoissonNextSend(nStart*1000*1000, EXTRA_BLOCK_RELAY_ONLY_PEER_INTERVAL);
15+    LogPrintf("nNextExtraBlockRelay %d\n", nNextExtraBlockRelay);
16     while (!interruptNet)
17@@ -1980,6 +1983,7 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
18             // connections, they do not get their own ConnectionType enum
19             // (similar to how we deal with extra outbound peers).
20             nNextExtraBlockRelay = PoissonNextSend(nTime, EXTRA_BLOCK_RELAY_ONLY_PEER_INTERVAL);
21+            LogPrintf("nNextExtraBlockRelay %d\n", nNextExtraBlockRelay);
22             conn_type = ConnectionType::BLOCK_RELAY;
23@@ -2936,6 +2940,11 @@ CNode::CNode(NodeId idIn, ServiceFlags nLocalServicesIn, int nMyStartingHeightIn
24 
25     m_deserializer = MakeUnique<V1TransportDeserializer>(V1TransportDeserializer(Params(), GetId(), SER_NETWORK, INIT_PROTO_VERSION));
26     m_serializer = MakeUnique<V1TransportSerializer>(V1TransportSerializer());
27+    // 10% chance of initializing nLastBlockTime to current time
28+    if (GetRandInt(100) <= 10) {
29+        nLastBlockTime = GetTime();
30+    }
31--- a/src/net_processing.cpp
32+++ b/src/net_processing.cpp
33@@ -3990,6 +3988,9 @@ void PeerManager::EvictExtraOutboundPeers(int64_t time_in_seconds)
34             if (node_state == nullptr ||
35                 (time_in_seconds - pnode->nTimeConnected >= MINIMUM_CONNECT_TIME && node_state->nBlocksInFlight == 0)) {
36                 pnode->fDisconnect = true;
37+                if (to_disconnect != youngest_peer.first) {
38+                    LogPrint(BCLog::NET, "Block-relay peer rotation: keeping new peer=%d, disconnecting old peer=%d\n", youngest_peer.first, to_disconnect);
39+                }

Here was the result: pr19858-custom-logging-debug-log-grep.txt

sdaftuar force-pushed on Oct 5, 2020

sdaftuar commented at 6:28 pm on October 5, 2020: member

@jonatack Thanks for the review and testing; I’ve rebased to fix the silent merge conflict with the LockAssertion stuff and addressed your comments.

jonatack commented at 2:23 pm on October 6, 2020: member

Diff per git range-diff 875e1cc 8ea760b 93a0ec1 looks good, will circle back shortly.

luke-jr commented at 3:58 pm on October 20, 2020: member

Should this wait until it’s verified the new chain is valid?

sdaftuar commented at 4:25 pm on October 20, 2020: member

Should this wait until it’s verified the new chain is valid? @luke-jr I’m interpreting your question as asking: “Should we wait to decide which peer to keep and which to evict, until after any new chain received from the new peer has been validated?” If that question is not what you meant, please clarify.

I think a good question is exactly what we want to happen if we connect to a peer that is on another (equal-work) fork. (There’s no real need to stay connected to a peer on a less-work chain, while a peer on a more-work chain will have its blocks downloaded and validated and we’ll either stay connected because they gave us new blocks, or disconnect if the chain is invalid.)

With this PR, it’s very likely we’d stay connected to a peer on an equal-work and evict an older peer, because our download logic is designed to download blocks on all chains with work equal to or greater than our current tip. And then the eviction criteria would cause the old peer to be disconnected. I think this is okay behavior; in the event that the new fork gets extended and is valid, it’s probably better for us to have been connected to the part of the network graph that was working on that tip as well as the part of the graph working on the one we had.

If the alternate chain proves to be invalid, then we can disconnect, but arguably we’ve done some harm by losing our previous (potentially honest) block-relay-peer. However, I think this is a negligible problem, for two reasons: we only ever rotate 1 of our 2 block-relay-peers this way, so there’s no way to use this to disconnect us from all our peers. Also, practically speaking, mining a block in order to have a chain at equal work as our current tip is an extremely expensive attack to pull off, and so while someone could do so in order to take over 1 of our peer slots, it’s not sustainable to repeat block after block. In practice, if this were to occur I’d expect it to happen at the time of a consensus fork (like segwit2x or bcash), in which case this would resolve shortly after the forking point, as the competing chain extends an invalid block, and we’d either eventually detect that (if that chain has more work) or ignore it (if it has less work).

naumenkogs commented at 1:07 pm on October 22, 2020: member

ACK 93a0ec1a629af533bb21418a3e134c268bc57395

ariard commented at 0:06 am on November 5, 2020: member

Code Review ACK 93a0ec1. Changes since last ACK : new lock assertions around node iteration in EvictExtraOutboundPeers, more compact conditional in same function, new comments.

in src/net_processing.cpp:3943 in daf74979d7 outdated

3983+            // we've been connected long enough for this eviction to happen
3984+            // at all.
3985+            // Note that we only request blocks from a peer if we learn of a
3986+            // valid headers chain with at least as much work as our tip.
3987+            CNodeState *node_state = State(pnode->GetId());
3988+            if (node_state == nullptr ||

sipa commented at 0:45 am on November 5, 2020:

Is it possible to hit node_state == nullptr here? If not, I’d rather assert it.

sdaftuar commented at 2:19 pm on November 10, 2020:

It shouldn’t be possible (as far as I understand), but this pattern of checking for nullptr seems to be present throughout the code (and we seem to have merged recent commits that add nullptr checks in analogous places, eg fc66d0a65cdc52a3b259effe0c29b5eafb1b5ff5).

In this particular place in our code, nullptr should be handled just fine if it were to somehow occur (all we’d do is mark the CNode for disconnection, which seems safe and correct), so I don’t think crashing is required in this situation. (Though likely there are other places in our code where we do assert(), so probably we’d crash anyway somewhere else…)

jnewbery commented at 1:17 pm on November 22, 2020:

I think it’s better/more defensive not to add asserts in the p2p stack if we can gracefully handle the condition. Here we can just set fDisconnect and hopefully FinalizeNode() will handle cleaning up the state.

At most, we should use Assume() (https://github.com/bitcoin/bitcoin/pull/20255) which will crash in tests and continue in production.

ajtowns commented at 9:42 am on December 11, 2020:

Might have been better to have made it:

0        CNodeState* state = State(pnode->GetId());
1        int blocks_in_flight = (state == nullptr ? 0 : state->nBlocksInFlight);

As it is, it looks like node_state == nullptr is something that could happen and requires special casing to ensure we do disconnect it immediately, whereas in reality it’s never reachable (we ensure we have CNodeState prior to adding it to vNodes, and remove it only after removing from vNodes).

There are plenty of other places in the code where we assert if State() returns a nullptr, so I think an assert here would also be fine.

sipa commented at 1:39 am on November 5, 2020: member

utACK 93a0ec1a629af533bb21418a3e134c268bc57395

in src/net_processing.cpp:3952 in 93a0ec1a62 outdated

3997+                return true;
3998+            } else {
3999+                LogPrint(BCLog::NET, "keeping block-relay-only peer=%d chosen for eviction (connect time: %d, blocks_in_flight: %d)\n",
4000+                    pnode->GetId(), pnode->nTimeConnected, node_state->nBlocksInFlight);
4001+            }
4002+            return false;

jonatack commented at 10:08 am on November 6, 2020:

daf74979 nit suggestion (or the inverse change), for symmetry with the extra full outbound code in the same function

0             } else {
1                 LogPrint(BCLog::NET, "keeping block-relay-only peer=%d chosen for eviction (connect time: %d, blocks_in_flight: %d)\n",
2                     pnode->GetId(), pnode->nTimeConnected, node_state->nBlocksInFlight);
3+                return false;
4             }
5-            return false;
6         });

in src/net_processing.cpp:3946 in 93a0ec1a62 outdated

3991+            // valid headers chain with at least as much work as our tip.
3992+            CNodeState *node_state = State(pnode->GetId());
3993+            if (node_state == nullptr ||
3994+                (time_in_seconds - pnode->nTimeConnected >= MINIMUM_CONNECT_TIME && node_state->nBlocksInFlight == 0)) {
3995+                pnode->fDisconnect = true;
3996+                LogPrint(BCLog::NET, "disconnecting extra block-relay-only peer=%d (last block received at time %d)\n", pnode->GetId(), pnode->nLastBlockTime);

jonatack commented at 10:22 am on November 6, 2020:

daf74979 In my testing, this logging nearly always returned “(last block received at time 0)”; perhaps for this common case the logging could return “(no block received)” or similar.

sdaftuar commented at 1:16 pm on December 18, 2020:

@jonatack If you have a patch to do this, please feel free to open a PR (I agree that this is mildly annoying).

jonatack commented at 2:55 pm on December 19, 2020:

thanks @sdaftuar, done in #20723

jonatack commented at 10:25 am on November 6, 2020: member

ACK 93a0ec1a629af533bb21418a3e134c268bc57395 code review, running a node with this patch rebased to master since 24 hours 3 days, also previously with custom logging per #19858#pullrequestreview-501847787, with the caveat that I am still thinking about the possible ramifications of this change.

If you retouch, perhaps (per the language used in EvictExtraOutboundPeers):

0+++ b/src/net_processing.h
1@@ -81,7 +81,7 @@ public:
2-    /** If we have extra outbound peers, try to disconnect the one with the oldest block announcement */
3+    /** If we have extra outbound or block-relay-only peers, try to disconnect the one with the oldest block announcement */
4     void EvictExtraOutboundPeers(int64_t time_in_seconds) EXCLUSIVE_LOCKS_REQUIRED(cs_main);

in src/net.cpp:1830 in 93a0ec1a62 outdated

1819@@ -1820,18 +1820,32 @@ void CConnman::SetTryNewOutboundPeer(bool flag)
1820 // Also exclude peers that haven't finished initial connection handshake yet
1821 // (so that we don't decide we're over our desired connection limit, and then
1822 // evict some peer that has finished the handshake)
1823-int CConnman::GetExtraOutboundCount()
1824+int CConnman::GetExtraFullOutboundCount()

jnewbery commented at 11:29 am on November 22, 2020:

Since you’re cleaning this up and adding another very similar function below, consider factoring out those loops and using a more functional style:

 0diff --git a/src/net.cpp b/src/net.cpp
 1index 48977aeadf..8456af01c5 100644
 2--- a/src/net.cpp
 3+++ b/src/net.cpp
 4@@ -1822,30 +1822,14 @@ void CConnman::SetTryNewOutboundPeer(bool flag)
 5 // evict some peer that has finished the handshake)
 6 int CConnman::GetExtraFullOutboundCount()
 7 {
 8-    int full_outbound_peers = 0;
 9-    {
10-        LOCK(cs_vNodes);
11-        for (const CNode* pnode : vNodes) {
12-            if (pnode->fSuccessfullyConnected && !pnode->fDisconnect && pnode->IsFullOutboundConn()) {
13-                ++full_outbound_peers;
14-            }
15-        }
16-    }
17-    return std::max(full_outbound_peers - m_max_outbound_full_relay, 0);
18+    int count{CountNodesIf([](CNode* node) {return node->IsFullOutboundConn();})};
19+    return std::max(count - m_max_outbound_full_relay, 0);
20 }
21 
22 int CConnman::GetExtraBlockRelayCount()
23 {
24-    int block_relay_peers = 0;
25-    {
26-        LOCK(cs_vNodes);
27-        for (const CNode* pnode : vNodes) {
28-            if (pnode->fSuccessfullyConnected && !pnode->fDisconnect && pnode->IsBlockOnlyConn()) {
29-                ++block_relay_peers;
30-            }
31-        }
32-    }
33-    return std::max(block_relay_peers - m_max_outbound_block_relay, 0);
34+    int count{CountNodesIf([](CNode* node) {return node->IsBlockOnlyConn();})};
35+    return std::max(count - m_max_outbound_block_relay, 0);
36 }
37 
38 void CConnman::ThreadOpenConnections(const std::vector<std::string> connect)
39diff --git a/src/net.h b/src/net.h
40index 58a5b36918..5c8014ace5 100644
41--- a/src/net.h
42+++ b/src/net.h
43@@ -299,6 +299,15 @@ public:
44         }
45     };
46 
47+    using NodePred = std::function<bool(CNode*)>;
48+    /** Returns a count of how many fully connected peers satisfy a given predicate. */
49+    int CountNodesIf(const NodePred& pred) const
50+    {
51+        int count{0};
52+        ForEachNode([&count, &pred](CNode* node) {if (pred(node)) ++count;});
53+        return count;
54+    };
55+
56     template<typename Callable, typename CallableAfter>
57     void ForEachNodeThen(Callable&& pre, CallableAfter&& post)
58     {

I think that style is more compact, expressive, and less bug-prone than hand writing for loops.

ajtowns commented at 9:15 am on December 11, 2020:

If this does get refactored further, making it bool HasExtraBlockRelay() and bool HasExtraFullOutbound and returning count > m_max_x instead of max(0, count - m_max_x) would be an idea too.

in src/net_processing.cpp:2469 in 93a0ec1a62 outdated

2556@@ -2557,7 +2557,7 @@ void PeerManager::ProcessMessage(CNode& pfrom, const std::string& msg_type, CDat
2557             LogPrintf("New outbound peer connected: version: %d, blocks=%d, peer=%d%s (%s)\n",
2558                       pfrom.nVersion.load(), pfrom.nStartingHeight,
2559                       pfrom.GetId(), (fLogIPs ? strprintf(", peeraddr=%s", pfrom.addr.ToString()) : ""),
2560-                      pfrom.m_tx_relay == nullptr ? "block-relay" : "full-relay");
2561+                      pfrom.IsBlockOnlyConn() ? "block-relay" : "full-relay");

jnewbery commented at 11:41 am on November 22, 2020:

I agree with @naumenkogs review comment #19858 (comment) about changing the remaining places where m_tx_relay == nullptr is used as a proxy for IsBlockOnlyConn() to use IsBlockOnlyConn(). There’s already a commit in this branch for Use conn_type to identify block-relay peers, rather than m_tx_relay == nullptr, so I don’t see why we wouldn’t fix them all up at the same time.

Using m_tx_relay == nullptr is more indirect and confusing for reviewers. See for example #20217 (comment):

I think I’m confused because the check for block-relay-only connections in the code is sometimes to see if m_tx_relay exists, which is non-intuitive because block-relay-only also implies that addrs are not gossiped with this peer either, not just transactions, and sometimes to call IsBlockOnlyConn().

sdaftuar commented at 3:18 pm on November 27, 2020:

Happy to pick this up in a followup PR. I don’t think we need to clutter this one further with style cleanups now that this has some review and ACKs.

ajtowns commented at 9:01 am on December 11, 2020:

Isn’t this suggestion (changing m_tx_relay == nullptr to IsBlockOnlyConn() when logging a new outbound peer) already done? I think github is confusing me…

Anyway, changing that entire x ? "y" : "z" to just be pfrom.ConnectionTypeAsString() seems like it might be better.

in src/net_processing.h:211 in 93a0ec1a62 outdated

134@@ -135,6 +135,10 @@ class PeerManager final : public CValidationInterface, public NetEventsInterface
135     CTxMemPool& m_mempool;
136 
137     int64_t m_stale_tip_check_time; //!< Next time to check for stale tip
138+
139+    /** Whether we've completed initial sync yet, for determining when to turn
140+      * on extra block-relay-only peers. */
141+    bool m_initial_sync_finished{false};

jnewbery commented at 12:05 pm on November 22, 2020:

This variable name is very confusing. It’s a different concept from IsInitialBlockDownload() but very similarly named. IsInitialBlockDownload() returns whether we’ve ever had a block at our tip that is within 24 hours of current time. This returns whether we’ve ever had a block at our tip that is within 200 minutes of current time.

The variable also isn’t needed. We can just make StartExtraBlockRelayPeers() idempotent and always call it:

 0diff --git a/src/net.h b/src/net.h
 1index 58a5b36918..2c6221391c 100644
 2--- a/src/net.h
 3+++ b/src/net.h
 4@@ -340,8 +340,9 @@ public:
 5     bool GetTryNewOutboundPeer();
 6 
 7     void StartExtraBlockRelayPeers() {
 8+        if (m_start_extra_block_relay_peers.load()) return;
 9         LogPrint(BCLog::NET, "net: enabling extra block-relay-only peers\n");
10-        m_start_extra_block_relay_peers = true;
11+        m_start_extra_block_relay_peers.store(true);
12     }
13 
14     // Return the number of outbound peers we have in excess of our target (eg,

in src/net_processing.cpp:3966 in 93a0ec1a62 outdated

3966+    // Note that we use higher nodeid as a measure for most recent connection.
3967+    if (m_connman.GetExtraBlockRelayCount() > 0) {
3968+        std::pair<NodeId, int64_t> youngest_peer{-1, 0}, next_youngest_peer{-1, 0};
3969+
3970+        m_connman.ForEachNode([&](CNode* pnode) EXCLUSIVE_LOCKS_REQUIRED(::cs_main) {
3971+            AssertLockHeld(::cs_main);

jnewbery commented at 12:55 pm on November 22, 2020:

I don’t think cs_main is needed for this lambda. Unless I’ve missed something, we’re not accessing anything in here that is guarded by cs_main.

sdaftuar commented at 3:51 pm on November 27, 2020:

Oops, that seems correct. I’ll add a commit that removes this.

in src/net_processing.cpp:3947 in 93a0ec1a62 outdated

3992+            CNodeState *node_state = State(pnode->GetId());
3993+            if (node_state == nullptr ||
3994+                (time_in_seconds - pnode->nTimeConnected >= MINIMUM_CONNECT_TIME && node_state->nBlocksInFlight == 0)) {
3995+                pnode->fDisconnect = true;
3996+                LogPrint(BCLog::NET, "disconnecting extra block-relay-only peer=%d (last block received at time %d)\n", pnode->GetId(), pnode->nLastBlockTime);
3997+                return true;

jnewbery commented at 1:18 pm on November 22, 2020:

It’s strange to return true here and false below, and then throw away that return value. It’d be clearer to always return the same value.

in src/net_processing.cpp:3914 in 93a0ec1a62 outdated

3958-    if (extra_peers > 0) {
3959-        // If we have more outbound peers than we target, disconnect one.
3960-        // Pick the outbound peer that least recently announced
3961+    // If we have any extra block-relay-only peers, disconnect the youngest unless
3962+    // it's given us a block -- in which case, compare with the second-youngest, and
3963+    // out of those two, disconnect the peer who least recently gave us a block.

jnewbery commented at 1:25 pm on November 22, 2020:

Why not disconnect the worst peer (the one that has been longest without providing a block)? e.g.:

connect to block-relay-only peer 1. It never provides any blocks.
connect to block-relay-only peer 2. It provides blocks.
connect to extra block-relay-only peer 3. It provides one block in the time it’s connected.

With this logic we’d disconnect peer 2, but I think it’d be strictly better to disconnect peer 1.

Doing this would (I think) make the logic simpler, since you’d only need to keep track of the worst peer in the ForEachNode() lambda, rather than keeping track of two peers and then deciding which to disconnect.

naumenkogs commented at 7:53 am on November 27, 2020:

A peer would never provide us a block if we provided them a block earlier.

So, an attacker can try being veeeery fast at sending blocks to a victim (I assume it’s possible) so that other peers are worst, and slowly evict them one by one. This of course assumes an attacker has eviction capabilities.

Current approach of this PR is more conservative and wouldn’t allow this.

jnewbery commented at 9:20 am on November 27, 2020:

Essentially this means that we have one ‘fixed’ block-relay-only peer (even across restarts due to anchor connections) and one evictable block-relay-only peer.

You may be right that this makes it more difficult for an adversary to take over both block-relay-only connections, although it seems like it’d be very difficult to pull off - the adversary would need to take over a large part of the victim’s addrman, be lucky with timings of blocks and additional block-relay-only connections, and be sufficiently well connected to the network that they can always win a block race against all of the victim’s other connections.

In any case, if this is the model (one fixed and one evictable block-relay-only peer), then it’d be much better to explicitly document that rather than it be a consequence of the logic here.

sdaftuar commented at 3:23 pm on November 27, 2020:

I thought I did document this innet.h, but I think it would make sense to put a writeup of how all this works on our wiki, which I’d be happy to do after this is merged.

sdaftuar commented at 2:57 pm on December 16, 2020:

I’ve updated the documentation on the wiki: https://github.com/bitcoin-core/bitcoin-devwiki/wiki/P2P-Design-Philosophy/_compare/452d8a0c96503ac7158a975acf7d55cd674a3735

Please let me know if you think anything is missing and I’d be happy to clarify further.

in src/net_processing.cpp:3942 in 93a0ec1a62 outdated

3987+            // Make sure we're not getting a block right now, and that
3988+            // we've been connected long enough for this eviction to happen
3989+            // at all.
3990+            // Note that we only request blocks from a peer if we learn of a
3991+            // valid headers chain with at least as much work as our tip.
3992+            CNodeState *node_state = State(pnode->GetId());

jnewbery commented at 1:26 pm on November 22, 2020:

0            CNodeState* node_state = State(pnode->GetId());

(* and & should bind to the left in new code)

in src/net_processing.cpp:3965 in 93a0ec1a62 outdated

3965+    // to temporarily in order to sync our tip; see net.cpp.
3966+    // Note that we use higher nodeid as a measure for most recent connection.
3967+    if (m_connman.GetExtraBlockRelayCount() > 0) {
3968+        std::pair<NodeId, int64_t> youngest_peer{-1, 0}, next_youngest_peer{-1, 0};
3969+
3970+        m_connman.ForEachNode([&](CNode* pnode) EXCLUSIVE_LOCKS_REQUIRED(::cs_main) {

jnewbery commented at 1:57 pm on November 22, 2020:

In general, it’s better to avoid default capture in lambdas, since if that lambda is used outside the scope that it’s defined, it can lead to dangling references. That can’t happen here, but I think it’s just best practice to explicitly state what you’ll need.

0        m_connman.ForEachNode([&youngest_peer, &next_youngest_peer](CNode* pnode) EXCLUSIVE_LOCKS_REQUIRED(::cs_main) {

in src/net_processing.cpp:3935 in 93a0ec1a62 outdated

3980+        if (youngest_peer.second > next_youngest_peer.second) {
3981+            // Our newest block-relay-only peer gave us a block more recently;
3982+            // disconnect our second youngest.
3983+            to_disconnect = next_youngest_peer.first;
3984+        }
3985+        m_connman.ForNode(to_disconnect, [&](CNode* pnode) EXCLUSIVE_LOCKS_REQUIRED(::cs_main) {

jnewbery commented at 1:58 pm on November 22, 2020:

It’s best to capture simple types by value (and again, explicitly rather than by default):

0        m_connman.ForNode(to_disconnect, [time_in_seconds](CNode* pnode) EXCLUSIVE_LOCKS_REQUIRED(::cs_main) {

jnewbery commented at 9:34 am on November 27, 2020: member

I’m curious how the timings for this mechanism were chosen:

we’ll connect to an additional block-relay-only peer every 300 seconds on average
we’ll consider eviction of the additional block-relay-peer every 45 seconds
if the youngest block-relay-connection is less than 30 seconds old, then we won’t disconnect it.

That means that the additional block-relay-only connection has a lifetime of 30-75 seconds (52.5 on average), and we make one every 300 seconds. We therefore have an additional block-relay-only connection for 52.5/300 = 17.5% of the time. Put differently, this PR increases the average number of block-relay-only connections we have from 2 to 2.175.

Would it be simpler to simply increase the number of block-relay-only connections to 3, and then periodically evict the worst/youngest? That empty slot would then be filled using the normal mechanism in ThreadOpenConnections(). That seems like it’d give just as much protection as this approach (a new connection is tried every 300 seconds), but without adding yet another complex timing interaction between net and net_processing which is difficult to reason about and test. The downsides would be potentially a bit more churn in connections (since 100% of blocks can now cause the second-oldest block-relay-only peer to be evicted instead of 17.5% of blocks), a bit more resource usage on the node (which should be fine - if it can handle an additional connection 17.5% of the time, it can handle it 100% of the time), and a bit more usage of listening sockets (which again should be fine, it’s much less of an impact than when we added block-relay-only connections in the first place).

sdaftuar commented at 3:00 pm on November 27, 2020: member

@jnewbery I think the biggest difference between what is implemented here and what you seem to be suggesting is that your proposed approach would be much more aggressive about making outbound connections. When we increase the target number of connections we want to make, ThreadOpenConnections will constantly be trying to achieve that target, making connection attempts whenever we are below our target. In this PR, we use a feeler-like approach to making these additional connections, which should be lower load on the network, as we just make one connection attempt every 5 minutes.

As for your question about how the times were chosen - the only new parameter here is the frequency of these block-relay-only connections (the existing 30 second and 45 second timers were re-used from the existing outbound peer eviction logic, and those were based on making sure we checked frequently enough that we didn’t persist connections for a long time, while trying to give reasonably enough time for headers sync and block download to occur). I figured adding about 2 connections every 10 minutes was not such a big increase in outbound connection initiation amount, though if reviewers are concerned about the frequency we could reduce it.

The trade-off here is that the more frequently we make connections (and therefore add edges to the network graph of nodes), the more security we have against partitioning attacks, but also the more network load we sustain. I don’t think we have a great framework for evaluating the load, so I’d just eyeball it and observe after we deploy it, and we can always tune it if we think it’s excessive or insufficient. From watching the logs on my own node to verify how frequently connections are made and how much bandwidth is used by these connections, these numbers seemed fine to me, fwiw.

Also, I don’t think that we gain all that much in complexity reduction by going that route, as we still need to have intelligent eviction logic, which I think is where the meat of the complexity lies. I took a quick look right now, and I believe the additional changes to net.cpp to support feeler-like connection behavior (versus just having a target of 3 peers, of which we’d always try to evict 1) seems to be about 4 lines of code (not including comments), and is patterned off the existing feeler logic. Not too bad?

sdaftuar commented at 3:54 pm on November 27, 2020: member

(Oops, prematurely hit submit on this comment before I finished writing.) @jnewbery I fixed the unnecessary lock assertion in the latest commit. My strong preference for now is to merge this PR as-is, as no problems seem to have emerged in review and testing thus far, it has several ACKs, this PR represents a substantial improvement in our peering behavior, and frankly this work has gotten stale for me, having been several months since I last worked on this extensively. So if we can save minor improvements or tweaks for future work that would make progress much easier – I’m not inclined to do minor rewrites at this point (and if there are major problems remaining I should just close this PR!).

I think for some of the changes (like the m_tx_relay==nullptr thing), we could easily do that in a followup refactoring-only PR. For some of the initialization sequence points that came up, I believe the code here is reasonable and correct, but there is room for further improvement too – I think it’s worth careful consideration of behavior changes that might be introduced in doing so, and since those seem non-essential to the behavior introduced here, I think this PR shouldn’t be held up for such minor optimizations.

I’d like to focus any further discussion on this PR on the actual behavior introduced (such as concerns about the timing of connections, maintenance or cognitive complexities introduced, or behaviors I’ve glossed over as unimportant such as the initialization sequence, if I’m mistaken and there are indeed significant concerns around how that will work).

sdaftuar force-pushed on Nov 29, 2020

sdaftuar commented at 11:00 pm on November 29, 2020: member

Rebasing for the silent merge conflict with #20188, so squashed the lock annotation fix in while I was at it.

sdaftuar commented at 2:04 pm on December 1, 2020: member

@naumenkogs @sipa @jonatack @ariard Care to re-review?

jonatack commented at 5:43 pm on December 2, 2020: member

Am re-testing with added logging on a node, re-review soon.

DrahtBot added the label Needs rebase on Dec 9, 2020

jonatack commented at 4:18 pm on December 9, 2020: member

Reviewers may be interested in the article in today’s Bitcoin Optech about this PR: https://bitcoinops.org/en/newsletters/2020/12/09/#bitcoin-core-pr-review-club

sdaftuar force-pushed on Dec 9, 2020

DrahtBot removed the label Needs rebase on Dec 9, 2020

DrahtBot added the label Needs rebase on Dec 10, 2020

in src/net_processing.cpp:3970 in efa0e0df05 outdated

3912@@ -3913,13 +3913,11 @@ void PeerManager::EvictExtraOutboundPeers(int64_t time_in_seconds)
3913             AssertLockHeld(::cs_main);
3914 
3915             // Ignore non-outbound peers, or nodes marked for disconnect already
3916-            if (!pnode->IsOutboundOrBlockRelayConn() || pnode->fDisconnect) return;
3917+            if (!pnode->IsFullOutboundConn() || pnode->fDisconnect) return;

naumenkogs commented at 8:55 am on December 10, 2020:

To be clear, there was a bug here (no critical behavior, just meaningless code). It was already useless to check for “BlockRelayConn”, because they were filtered out later in “if (pnode->m_tx_relay == nullptr)”. All good after this change.

naumenkogs commented at 9:33 am on December 10, 2020: member

ACK 7dc24baf7c88f3b03fa2aa59fc4cba332d7e7a69

I find it slightly confusing that IsBlockOnlyConnreally means outbound block-only conn. It makes the code-review harder for me, and I assume even more for new developers. Not sure what we could do here though.

Shoot, another “needs rebase”

Simplify and clarify extra outbound peer counting 91d61952a8

Use conn_type to identify block-relay peers, rather than m_tx_relay == nullptr 3cc8a7a0f5

Periodically make block-relay connections and sync headers

To make eclipse attacks more difficult, regularly initiate outbound connections
and stay connected long enough to sync headers and potentially learn of new
blocks. If we learn a new block, rotate out an existing block-relay peer in
favor of the new peer.

This augments the existing outbound peer rotation that exists -- currently we
make new full-relay connections when our tip is stale, which we disconnect
after waiting a small time to see if we learn a new block.  As block-relay
connections use minimal bandwidth, we can make these connections regularly and
not just when our tip is stale.

Like feeler connections, these connections are not aggressive; whenever our
timer fires (once every 5 minutes on average), we'll try to initiate a new
block-relay connection as described, but if we fail to connect we just wait for
our timer to fire again before repeating with a new peer.

daffaf03fb

Clarify comments around outbound peer eviction b3a515c0be

sdaftuar force-pushed on Dec 10, 2020

sdaftuar commented at 1:55 pm on December 10, 2020: member

@naumenkogs Thanks for the re-review – I just rebased again to deal with the merge conflict, another trivial change.

I find it slightly confusing that IsBlockOnlyConn really means outbound block-only conn. It makes the code-review harder for me, and I assume even more for new developers.

I do as well! On my to-do list is to propose and implement support for negotiating block-relay only behavior at connection time; I think at that point we can add better understanding to our code for inbound versus outbound block-relay-only peers (as right now we have no way of knowing that an inbound peer is a block-relay-only peer).

MarcoFalke commented at 1:55 pm on December 10, 2020: member

@naumenkogs @sipa @jonatack @ariard you previously reviewed this, but there have been a few (silent) merge conflicts, so a re-ACK would be appreciated.

In the meantime it would be good if other maintainers refrained from merging conflicting pulls, see the list by DrahtBot: #19858 (comment)

DrahtBot removed the label Needs rebase on Dec 10, 2020

jonatack commented at 3:38 pm on December 10, 2020: member

Tested ACK b3a515c0bec97633a76bec101af47c3c90c0b749 over several weeks, though this change and behavior could benefit from test coverage and other follow-ups (refactoring, etc.) described in the review feedback. I did not verify the behavior of m_start_extra_block_relay_peers only being enabled after initial chain sync. Since my last review, one unneeded cs_main lock was removed.

A few empirical observations from running this code over time on a tor v3 node with mostly only 18 outbounds (~half very good manual ones) and 0, sometimes 1, inbounds:

The timing happens as expected
GetExtraFullOutboundCount() and GetExtraBlockRelayCount() mostly return 0 (97%+ of the time); when not, they return 1
Most of the time (97%+), the newly added block-relay-only peer is disconnected within ~50 seconds (and the log says “(last block received at time 0)”)
Actual peer rotation is rare
Without the patch to induce a 10% chance of initializing nLastBlockTime to current time, I no longer saw any keeping/evicting events

My main wishlist would be that the code be designed from the start to be testable and have regression coverage. I’m not as confident in reviewing or changing code without coverage.

jonatack commented at 3:44 pm on December 10, 2020: member

Logging used for my testing: https://github.com/jonatack/bitcoin/commit/8986db429738cb1f8e0057e4ac41af1a89bc714e (a variant of Suhas’ patch above).

ariard commented at 4:06 pm on December 10, 2020: member

Code Review ACK b3a515c, only change since last time is dropping a useless cs_main taking. I manually tested a previous version of the PR, and not substantial change has been introduced since then which would alter behavior IMO.

The trade-off here is that the more frequently we make connections (and therefore add edges to the network graph of nodes), the more security we have against partitioning attacks, but also the more network load we sustain. I don’t think we have a great framework for evaluating the load, so I’d just eyeball it and observe after we deploy it, and we can always tune it if we think it’s excessive or insufficient.

AFAIU this PR, I’m not worried about the network load introduced by this PR, whatever the metric we’re picking (connection slots/application bandwidth/IP/TCP bandwidth). Let’s assume a really worst-case scenario, where the victim node is always fallen-behind from a better-work chain by ~5 headers and has to download them every 5 min (EXTRA_BLOCK_RELAY_ONLY_PEER_INTERVAL). 80 * 5 * 12 * 24 * 30 = 3_456_000, this node will consume 3.45 MB by month from the network ? If I get the math right, I think that’s fairly acceptable.

That said, I would be glad if we start to develop and sustain frameworks to evaluate question like network load which rightfully spawn up in this kind of work. Beyond agreeing on security model efficiency, having sound quantitative model would ease reaching Concept ACK/NACK.

I thought I did document this innet.h, but I think it would make sense to put a writeup of how all this works on our wiki, which I’d be happy to do after this is merged.

I know I’m a minority opinion, but I still feel we should have a sound discussion before dissociating further rationale from code by writing more documentation in the wiki instead of in-tree.

Contra:

splitting code rationale from code paths increase the risk of documentation being airy, hard to link back and easy outdated
rationale should be submitted to the same review process, wiki is free to edit for project members,, IMO code rationale is a much important than code correctness, because ultimately rationale is leading how code should be
increase project reliance on centralized GH infrastructure of duplicated git copies

I’m a big fan of the code documentation approach which has been done for #19988, and I hope to stick more to this kind of code documentation standard in the future.

laanwj added this to the "Blockers" column in a project

MarcoFalke merged this on Dec 11, 2020

MarcoFalke closed this on Dec 11, 2020

in src/net_processing.cpp:3958 in b3a515c0be

3958+        });
3959+    }
3960+
3961+    // Check whether we have too many OUTBOUND_FULL_RELAY peers
3962+    if (m_connman.GetExtraFullOutboundCount() > 0) {
3963+        // If we have more OUTBOUND_FULL_RELAY peers than we target, disconnect one.

ajtowns commented at 9:48 am on December 11, 2020:

Seems better to use the names from ConnectionTypeAsString in comments than the shouty enums (ie outbound-full-relay) here).

ajtowns commented at 10:21 am on December 11, 2020: member

Post merge ACK b3a515c0bec97633a76bec101af47c3c90c0b749

Testing this, I’m seeing some cases where the extra block-relay peer is being evicted for sending a tx, eg:

02020-12-11T10:12:05Z Added connection peer=38
12020-12-11T10:12:06Z receive version message: /Satoshi:0.20.1/: version 70015, blocks=660884, us=xxx:8333, peer=38
22020-12-11T10:12:06Z New outbound peer connected: version: 70015, blocks=660884, peer=38 (block-relay)
32020-12-11T10:12:17Z received: inv (1261 bytes) peer=38
42020-12-11T10:12:17Z got inv: tx 565fed8bc9ff5fa333a8130ec399f23d24d4bcdd435778b1fd2a67278a980ee6  have peer=38
52020-12-11T10:12:17Z transaction (565fed8bc9ff5fa333a8130ec399f23d24d4bcdd435778b1fd2a67278a980ee6) inv sent in violation of protocol, disconnecting peer=38
62020-12-11T10:12:17Z disconnecting peer=38
72020-12-11T10:12:17Z Cleared nodestate for peer=38

which might be worth looking into further just in case those peers are actually running core like they say and we have a bug.

sidhujag referenced this in commit 47cbc7d34d on Dec 11, 2020

ajtowns commented at 11:44 am on December 16, 2020: member

Testing this, I’m seeing some cases where the extra block-relay peer is being evicted for sending a tx, […]

Looking into this a bit further: I’m only seeing this coming from nodes advertising themselves as 0.20.1, and the timing seems like it’s consistent with a 5s poisson delay since the connection is established. Not all 0.20.1 nodes are failing this way. I’ve just added some additional logging, and it looks like the txs they’re advertising aren’t particularly strange, but all the violating nodes seem to be running on cloud hosting (digital ocean, amazon, google cloud). So seems plausible that they’re just buggy and lying about their version details?

One thing that might be worth considering: our sybil mitigation only works for concurrent connections – our 10 regular outbounds all have to be in different netgroups because they’re simultaneously connected, but 10 sequential extra block-relay-only could all end up to the same netgroup. Could fix this by keeping track of the last 10 extra connections we’ve tried, and trying to choose the next one from a different netgroup.

jnewbery commented at 12:55 pm on December 16, 2020: member

Testing this, I’m seeing some cases where the extra block-relay peer is being evicted for sending a tx …

I’m surprised that your log isn’t showing the “send version message” line. You obviously have NET logging enabled since you’re seeing the “receive version message” line. These are outbound connections, so I’d expect to see “send version message” (in InitializeNode()) between “Added connection peer” (in ConnectNode()) and “receive version message” (in ProcessMessage()). Very odd.

Even if we did have that logging in PushNodeVersion(), it doesn’t include what the fRelay was set to. That’d be evidence that the peer is misbehaving and we don’t have a bug.

Perhaps peer message capture would help here (#19509)

ajtowns commented at 1:49 pm on December 16, 2020: member

Oh, with logips enabled, I see those nodes also don’t appear to be telling me my IP address, instead reporting their own IP in both us= and peeraddr=. Might just be a proxy that’s munging the version message both ways, and losing the relay bit as well as address info.

Edited to add:

I think about 12% of the extra block-relay-only connections my peer is opening get disconnected for this reason

Example:

02020-12-16T13:12:01Z Added connection to [their_ip]:8333 peer=157
12020-12-16T13:12:01Z sending version (103 bytes) peer=157
22020-12-16T13:12:01Z ABCD, sent version message peer=157 block-relay-only m_ignore_incoming_txs==0, pnode.m_tx_relay == nullptr, relaytxes=0

Note: we are signalling no tx relay in the version message we send

02020-12-16T13:12:01Z send version message: version 70016, blocks=661619, us=[::]:0, them=[their_ip]:8333, peer=157
12020-12-16T13:12:01Z received: version (102 bytes) peer=157
22020-12-16T13:12:01Z sending verack (0 bytes) peer=157
32020-12-16T13:12:01Z receive version message: /Satoshi:0.20.1/: version 70015, blocks=661619, us=[their_ip]:8333, peer=157, peeraddr=[their_ip]:8333

Note: they are claiming to be 0.20.1, and are telling us our ip is the same as their ip.

 02020-12-16T13:12:01Z received: verack (0 bytes) peer=157
 12020-12-16T13:12:01Z New outbound peer connected: version: 70015, blocks=661619, peer=157, peeraddr=[their_ip]:8333 (block-relay)
 22020-12-16T13:12:01Z sending sendheaders (0 bytes) peer=157
 32020-12-16T13:12:01Z sending sendcmpct (9 bytes) peer=157
 42020-12-16T13:12:01Z sending sendcmpct (9 bytes) peer=157
 52020-12-16T13:12:01Z sending ping (8 bytes) peer=157
 62020-12-16T13:12:01Z initial getheaders (661618) to peer=157 (startheight:661619)
 72020-12-16T13:12:01Z sending getheaders (1029 bytes) peer=157
 82020-12-16T13:12:02Z received: sendheaders (0 bytes) peer=157
 92020-12-16T13:12:02Z received: sendcmpct (9 bytes) peer=157
102020-12-16T13:12:02Z received: sendcmpct (9 bytes) peer=157
112020-12-16T13:12:02Z received: ping (8 bytes) peer=157
122020-12-16T13:12:02Z sending pong (8 bytes) peer=157
132020-12-16T13:12:02Z received: getheaders (1029 bytes) peer=157
142020-12-16T13:12:02Z getheaders 661619 to end from peer=157
152020-12-16T13:12:02Z sending headers (82 bytes) peer=157
162020-12-16T13:12:02Z received: feefilter (8 bytes) peer=157
172020-12-16T13:12:02Z received: feefilter of 0.00001000 BTC/kvB from peer=157
182020-12-16T13:12:02Z received: pong (8 bytes) peer=157
192020-12-16T13:12:02Z received: headers (82 bytes) peer=157

Everthing looks fine so far. Except 8s later we get sent some txids, despite saying we don’t want them. The txids themselves don’t seem particularly suspicious.

 02020-12-16T13:12:10Z received: inv (1261 bytes) peer=157
 12020-12-16T13:12:10Z got inv: tx 75bd60faafd08519ef2e792df0545c9392df41bf291d1134e7a5db9710ced23b  have peer=157
 22020-12-16T13:12:10Z transaction (75bd60faafd08519ef2e792df0545c9392df41bf291d1134e7a5db9710ced23b) inv sent in violation of protocol, disconnecting peer=157
 32020-12-16T13:12:10Z   ABCD, inv tx 75bd60faafd08519ef2e792df0545c9392df41bf291d1134e7a5db9710ced23b have peer=157
 42020-12-16T13:12:10Z   ABCD, inv tx 7b449d756f6dda691f992987e17bf3f7dc34e0ea0ce65d43622401bc8f60dd88 have peer=157
 52020-12-16T13:12:10Z   ABCD, inv tx bcb93b03d029cb3ec2d778b7d52dd3ec6dd281c5a3a0a5db9fb2acda71638c44 new peer=157
 62020-12-16T13:12:10Z   ABCD, inv tx 5b5a2b221caa73ed6b703add97c72f826ac70148b973cf375e7a53db0a1fb379 have peer=157
 72020-12-16T13:12:10Z   ABCD, inv tx 81921bce2a40d6a461996feec5a47b1a4121d0da247d837e6e4dea899230b659 have peer=157
 82020-12-16T13:12:10Z   ABCD, inv tx 7a34e32d7077960769e5c0cbacad4a12dda84418204a2652c491a570058fcb0a have peer=157
 92020-12-16T13:12:10Z   ABCD, inv tx af263f51587ff049cc64222d6660c63abb13458c92902d639882e2d3ab7287d6 have peer=157
102020-12-16T13:12:10Z   ABCD, inv tx 3d4b1ecae99adc8c8077ae091d88ca269c71c76c18de7a871b3218d6ba08356d have peer=157
112020-12-16T13:12:10Z   ABCD, inv tx 390071b455d1ed37f6c16863e08fcc97a53331a1f0ecd5d8f397658c8524afac have peer=157
122020-12-16T13:12:10Z   ABCD, inv tx d21e808b5d6dd2ed4e6e462b8beef3a4dcc3e791324bd7c8f7cc4c9f29dfba5d have peer=157
132020-12-16T13:12:10Z   ABCD, inv tx 9a7d584b6cd86fb404f9cd4f64278f00111db009ccbcf1c7838cc32990149e4e new peer=157
142020-12-16T13:12:10Z   ABCD, inv tx 79c47859b388197e2f54e84b8e4127c774cd411a1d7e212a9260f69616bed798 new peer=157
152020-12-16T13:12:10Z   ABCD, inv tx 43ee2d62b43292aa5f93456afeea622201f8617041d41b3a5270b8d069074787 have peer=157
162020-12-16T13:12:10Z   ABCD, inv tx f6ec29d439fe01ff4253086aa31ac82e03304fe1b8bb335b1afcf9be24140787 new peer=157
172020-12-16T13:12:10Z   ABCD, inv tx d81d8167089003070e65d909c51365f5906b30b0b9d41a842b0e87bdaa4b3f95 have peer=157
182020-12-16T13:12:10Z   ABCD, inv tx f6e90b5f427eba175ef3b464556b2deb2a6b6239f8ea45f82c6c3f99702ab740 new peer=157
192020-12-16T13:12:10Z   ABCD, inv tx e3f89991c70433914ca25c371a3ede00f3909bc72fc4c6debabe996f459fe095 have peer=157
202020-12-16T13:12:10Z   ABCD, inv tx 4af670a50a53b3949528a375cd958818c436efafbf8cf2d9476ba251e0624d03 new peer=157
212020-12-16T13:12:10Z   ABCD, inv tx 99c9200154e040337c4d853897930c6fb6ba5e052f2a9b801aa9a85d08cd4563 have peer=157
222020-12-16T13:12:10Z   ABCD, inv tx a2706abb0f8470c0e84aa07801fef1a56b1d2184b316bd13f1a18716a27da43e new peer=157
232020-12-16T13:12:10Z   ABCD, inv tx 5d41a07d349067fadc2ca165c32c2a1f369756006b512590fee88db0d3b05538 have peer=157
242020-12-16T13:12:10Z   ABCD, inv tx 0e25bc1d72e1d6585e8e0b5a6e30e228136d8f25e8af15600c5b23d16d99dce0 have peer=157
252020-12-16T13:12:10Z   ABCD, inv tx 9c7d8fa67e81da350ae834bbefb9dc2955a18067512129fee672cc28c1118629 have peer=157
262020-12-16T13:12:10Z   ABCD, inv tx 9d4bc391dc0f4ee61e39f0ed65ef3f611a461a2bd54b216aa0263265fa50cca4 new peer=157
272020-12-16T13:12:10Z   ABCD, inv tx 9d9a74778fbaa7605f3386142f3ac5169a333ca5a846863394ff3c9e9ae69778 have peer=157
282020-12-16T13:12:10Z   ABCD, inv tx bf464249cf02bc4caac19c04d6a0cb76ea1628531e6d52b0d8dc9e3d2f1a5477 new peer=157
292020-12-16T13:12:10Z   ABCD, inv tx 457df7291de77b3b5ca0871ad87621c819efab6fb1471842e88fb7959ab1e2ca new peer=157
302020-12-16T13:12:10Z   ABCD, inv tx 3f644cb1a138d6dd1bfd461470172b7dab8e161f5b6b056cc477eda781414aee have peer=157
312020-12-16T13:12:10Z   ABCD, inv tx 9e6dd4d37cc7a56fd188ffe99cc1dc523901e18f8b6ec02e8b99c5ed6f3f608a new peer=157
322020-12-16T13:12:10Z   ABCD, inv tx e64861304f5f7d829c82d559c0c1c7c57c5cb5e51bdfd96f10fcf04abb858408 have peer=157
332020-12-16T13:12:10Z   ABCD, inv tx 13bc931cd0a09e546b9d1b16c66964b33cb872e5eec65d21c2344dec1b317baa have peer=157
342020-12-16T13:12:10Z   ABCD, inv tx 038ee30078d52b1af945b16578ff4c302040248be273809aa285f7667a8f7edc have peer=157
352020-12-16T13:12:10Z   ABCD, inv tx 36a363550c4c8a5fbb38b69cfde0a938faa06fa300836d769cb085da963174a3 new peer=157
362020-12-16T13:12:10Z   ABCD, inv tx b09beb484f21adcec10665aa526132331f117e748997a41d9330370c8ce580f5 new peer=157
372020-12-16T13:12:10Z   ABCD, inv tx 12b3adb1fc1ccfbeec6d0700b1fce0b8e6a2dbe85e227745dca4f388b8b2d926 have peer=157
382020-12-16T13:12:10Z disconnecting peer=157
392020-12-16T13:12:10Z Cleared nodestate for peer=157

sdaftuar commented at 2:33 pm on December 16, 2020: member

One thing that might be worth considering: our sybil mitigation only works for concurrent connections – our 10 regular outbounds all have to be in different netgroups because they’re simultaneously connected, but 10 sequential extra block-relay-only could all end up to the same netgroup. Could fix this by keeping track of the last 10 extra connections we’ve tried, and trying to choose the next one from a different netgroup. @ajtowns I think this is an interesting idea – seems like it would be a strict improvement in security (in a mathematical sense, ie I can’t imagine how our security could be any worse off with that approach); but I’m not sure if the additional complexity is worth the potential gain? Not intrinsically opposed, but maybe this isn’t low-hanging fruit either.

laanwj referenced this in commit f0913f2f95 on Dec 17, 2020

laanwj removed this from the "Blockers" column in a project

MarcoFalke referenced this in commit 83e4670fd7 on Dec 18, 2020

sidhujag referenced this in commit 38768e9509 on Dec 18, 2020

dhruv commented at 2:22 am on January 1, 2021: member

IIUC, since this merge, an extra outbound-full-relay connection is made upon a stale tip(3 block intervals), and an extra outbound-block-relay connection is now made every 5 minutes. Is there still value in the former?

sdaftuar commented at 2:38 am on January 1, 2021: member

IIUC, since this merge, an extra outbound-full-relay connection is made upon a stale tip(3 block intervals), and an extra outbound-block-relay connection is now made every 5 minutes. Is there still value in the former?

Yes I think there is. The stale tip logic more aggressively seeks out a new peer to connect to (staying in that state until a new connection is actually made) while this logic fires once on a selection from addrman and gives up even if no connection is made. Moreover the stale tip logic is an eviction algorithm for full relay peers, while this logic is only for block relay peers. I think having rotation logic for both makes sense, though there is probably room to improve the interaction between these two behaviors in all the various cases we can think of.

dhruv commented at 5:09 am on January 2, 2021: member

@sdaftuar Thanks for explaining. Could this be a way to improve the interaction between the two behaviors?:

Upon stale tip, seek extra outbound-block-relay more aggressively rather than a 5 minute interval (stay in that state until the tip is no longer stale).
Novel block discovery by the extra outbound-block-relay triggers outbound-full-relay eviction.
Eliminate extra outbound-full-relay connections.

Perhaps the simpler starting point is just to implement (2)? Are there any downsides/known attacks if we do that?

MarcoFalke referenced this in commit 8b2c0df83e on Dec 6, 2021

sidhujag referenced this in commit f021c04462 on Dec 7, 2021

Fabcien referenced this in commit 9c0ffc9f4d on Jan 26, 2022

DrahtBot locked this on Aug 16, 2022

Periodically make block-relay connections and sync headers #19858

Conflicts