Protect localhost and block-relay-only peers from eviction

sdaftuar commented at 6:58 pm on August 5, 2020: member

Onion peers are disadvantaged under our eviction criteria, so prevent eventual eviction of them in the presence of contention for inbound slots by reserving some slots for localhost peers (sorted by longest uptime).

Block-relay-only connections exist as a protection against eclipse attacks, by creating a path for block propagation that may be unknown to adversaries. Protect against inbound peer connection slot attacks from disconnecting such peers by attempting to protect up to 8 peers that are not relaying transactions but have provided us with blocks.

Thanks to gmaxwell for suggesting these strategies.

sdaftuar commented at 7:00 pm on August 5, 2020: member

This should fix #19500, but I need to figure out how to test this…

sdaftuar force-pushed on Aug 5, 2020

DrahtBot added the label P2P on Aug 5, 2020

in src/net.cpp:924 in ce35ffee1a outdated

917@@ -903,12 +918,31 @@ bool CConnman::AttemptToEvictConnection()
918     // Protect 4 nodes that most recently sent us transactions.
919     // An attacker cannot manipulate this metric without performing useful work.
920     EraseLastKElements(vEvictionCandidates, CompareNodeTXTime, 4);
921+    // Protect up to 8 non-tx-relay peers that have sent us blocks.
922+    std::sort(vEvictionCandidates.begin(), vEvictionCandidates.end(), CompareNodeBlockRelayOnlyTime);
923+    size_t erase_size = std::min(size_t(8), vEvictionCandidates.size());
924+    std::remove_if(vEvictionCandidates.end() - erase_size, vEvictionCandidates.end(), [](NodeEvictionCandidate const &n) { return !n.fRelayTxes && n.nLastBlockTime > 0; });

sdaftuar commented at 8:45 pm on August 5, 2020:

@gmaxwell I should get rid of this n.nLastBlockTime > 0 right? So that we’ll always protect up to 8 non-tx-relay peers, even if they haven’t managed to be first to send us a block. But perhaps I should only protect them if fRelevantServices is true, so that non-full nodes do not get protected here?

gmaxwell commented at 9:12 pm on August 5, 2020:

I think requring fRelevantServices makes sense. I think it’s not import to require that they sent a block– helping them stay connected so long as they (claim they) could might help keep them around long enough to successfully do so.

sdaftuar commented at 1:01 am on August 6, 2020:

Fixed, thanks.

sdaftuar force-pushed on Aug 6, 2020

fanquake added this to the milestone 0.21.0 on Aug 6, 2020

practicalswift commented at 8:49 am on August 6, 2020: contributor

Concept ACK (modulo proper testing of course :))

in src/net.cpp:860 in e5af0d1b65 outdated

851@@ -845,6 +852,14 @@ static bool CompareNodeTXTime(const NodeEvictionCandidate &a, const NodeEviction
852     return a.nTimeConnected > b.nTimeConnected;
853 }
854 
855+// Pick out the potential block-relay only peers, and sort them by last block time.
856+static bool CompareNodeBlockRelayOnlyTime(const NodeEvictionCandidate &a, const NodeEvictionCandidate &b)

jnewbery commented at 10:20 am on August 6, 2020:

If this PR was built on top of #19316, you could test directly for block-relay-only peers. I think that’d be a cleaner implementation.

sdaftuar commented at 11:41 am on August 6, 2020:

We don’t currently have a way in our codebase to know that an inbound peer is blocks-only; that is a concept we can only know for outbound peers. We can infer it for inbound peers, as this PR does, but the right way to do this in the future would be to negotiate it at startup, so both sides know that a connection will stay blocks only.

I believe that PR is only cleaning things up for outbound blocks-only peers?

jnewbery commented at 12:51 pm on August 6, 2020:

Yes, you’re right. 19316 only changes things for outbound peers.

naumenkogs commented at 10:21 am on August 6, 2020: member

Concept ACK.

in src/net.cpp:926 in e5af0d1b65 outdated

917@@ -903,12 +918,31 @@ bool CConnman::AttemptToEvictConnection()
918     // Protect 4 nodes that most recently sent us transactions.
919     // An attacker cannot manipulate this metric without performing useful work.
920     EraseLastKElements(vEvictionCandidates, CompareNodeTXTime, 4);
921+    // Protect up to 8 non-tx-relay peers that have sent us blocks.
922+    std::sort(vEvictionCandidates.begin(), vEvictionCandidates.end(), CompareNodeBlockRelayOnlyTime);

ariard commented at 12:06 pm on August 6, 2020:

nLastBlockTime is updated at ProcessNewBlock return if the process block is new to us and connecting on our best chain, among other yardsticks. An attacker-controlled inbound connection by always being the fastest to announce block to us can trump other honest block-relay-only inbound to prevent update of their nLastBlockTime. I think assuming a malicious peer can be always the fastest is reasonable, just connect to Fibre and bypass some validation checks. By rotating fastest block announcement among few of its inbound connections, a malicious attacker seems to win over this protection.

Have you consider instead selecting randomly among all our block-relay-only peers 8 of them ? I think that would make this protection better as an attacker can’t predict its block-relay-only inbound will be favored. Because if an attacker can forge the fast-block-announcement characteristic as protected by the following protection (“Protect 4 nodes that most recently sent us blocks”), it’s not hard to also forge this new one as signaling as block-relay-only is cheap.

Correct me I I’m wrong about ProcessNewBlock and nLastBlockTime updates, this part is fairly complex with all the anti-DoS/trickery measures.

sdaftuar commented at 1:50 pm on August 6, 2020:

You’re right about the code and behavior – an adversary that is trying to be well-behaved by being the first to deliver blocks to us, from 8 different block-relay peers, can indeed take over this protection – but randomly choosing among block-relay-only peers is not helpful protection. If we choose randomly each time, an attacker can game that by merely spamming us with inbound connections until any given peer that had protection in a prior round has lots its protection in this round, and might get evicted.

Note also that it is free to pretend to be a blocks-only full-node peer; you merely send the right version message to us on connection. It is more work, and useful work, to be the first peer to deliver us a new block. So randomly protecting inbound peers who turn off tx-relay is not a useful thing to do, even if your random selection were on a longer timer to try to avoid the spamming problem I described above.

One way to think about the protections here is: if we can pick out a set of peers that may be useful to us, protect them. Because the alternative to protection is that we may lose peers that are demonstrably useful just because an attacker is spamming us. If we make a mistake and include peers that happen to also be adversaries because they’re acting like good peers, that’s no worse than if we didn’t add the protection at all; but if we can meaningfully increase the cost to an adversary of bypassing the protection, then we’ve accomplished something.

So maybe a poor criteria would be one where an attacker does useful work for us once and then can misbehave but still take over most or all of our protected spots. I don’t think that’s what is happening here (arguably, earliest connection time has the risk of an attacker being able to take it over, eventually – but the first sort is on most-recently-gave-us-a-block, so an attacker would have to be able to keep doing this to ensure that other peers don’t gain an advantage).

sdaftuar commented at 1:56 pm on August 6, 2020:

So maybe a poor criteria would be one where an attacker does useful work for us once and then can misbehave but still take over most or all of our protected spots. I don’t think that’s what is happening here (arguably, earliest connection time has the risk of an attacker being able to take it over, eventually – but the first sort is on most-recently-gave-us-a-block, so an attacker would have to be able to keep doing this to ensure that other peers don’t gain an advantage).

I do worry that this might be the case, however, because block-relay-only peers may not be the first to deliver us blocks very often (it’s possible that due to compact block relay, there are biases in favor of compact block reconstruction happening best on links that are also relaying transactions – I am not sure). In which case, the attack you describe of an adversary connecting 8 times, delivering the next 8 blocks from each one, and then being silent, may well work for a while to cause only their links to be protected, because the honest peers have a tough time beating other peers.

We can try to get some anecdotal data on this by looking at our own existing nodes to see if apparent block-relay peers are being selected as HB compact block peers? @gmaxwell any thoughts here?

EDIT: I guess my overall thought is that even if this is somewhat gameable, adding this won’t hurt and may help. If anyone comes up with something better we can always improve it.

ariard commented at 4:54 pm on August 6, 2020:

Yes, after writing this, I realized thinking further that my counter-proposal wasn’t worthy it. As you hinted, analyzing eviction as a serie of interdependent events means sooner or latter, an attacker still has a high-chance of success.

If we make a mistake and include peers that happen to also be adversaries because they’re acting like good peers, that’s no worse than if we didn’t add the protection at all; but if we can meaningfully increase the cost to an adversary of bypassing the protection, then we’ve accomplished something.

Thanks for the explanation. I agree that the current protection of fast-delivery-block peers is worth it and compel an attacker to position its nodes well with regard to block-topology and propagation, even if potentially gameable. Where I was more doubtful was on the value of adding a second protection relying on an-already used assumption. Even I was presuming that an attacker, by forgetting validation because it’s a zero-cost accepting invalid blocks, will always trump block announcement, independently of being in concurrence with compact block peers.

I think there is the interesting property with block-relay-only peers that it’s far far harder to guess their pairing, as you remember back in commit. Whereas our protected-by-netgroup peers (1st heuristic) might be full-relay ones, and thus an attacker might be able to guess their IP prefix by tx-relay topology probing.

Block-relay-only peers don’t have this weakness so it could be more worthy to combine them with the netgroup critera rather than with the fast-block-delivery one ?

Overall, if we can’t come up with better ideas, you right let’s move further and try to address this kind of issues in future works on network anomalies detection/anti-eclipse.

gmaxwell commented at 8:40 pm on August 6, 2020:

@ariaid There are, unfortunately, a lot of “nodes” that just connect and don’t do anything useful. Reserving slots for zombies that don’t even once do something useful isn’t very helpful. Unfortunately the software doesn’t really have that much criteria for deciding usefulness and most things you could imagine could be easily faked. One advantage of favouring many peers that have recently sent blocks is that even if the leader is someone who’s only making themselves first for a little while before misbehaving there are still many other slots. (Also, aside, the public fibre network is gone and the software is now unmaintained)

Keep in mind that blocks-only peers are currently disadvantaged because they’re ineligible for the CompareNodeTXTime criteria– yet they are important: because they’re difficult for an attacker to map they are intended form an important part of the network’s protection against an attacker carving partitions into the network via DOS attacks. It would be extremely unfortunate to have that protection get undermined by a simple connection flooding attack that gets most of the blocks-only peers evicted.

You can see the recent blocks protection for blocks-only peers being morally similar to reserving part of the longest uptime slots for localhost/onion connections. In this case actually ‘reserving’ doesn’t make sense because the number protected for that reason is relatively few. Basically there are now two reasons for a group to exist: Spreading connections across orthogonal kinds of being good under the theory that its harder for an attacker to be good in many ways, and spreading connections across different kinds of (useful) peers.

One could even imagine a generalization where you have a collection of different kinds of peers {ipv4, ipv6, onion, local-network, blocks-only, accepts-inbound-according-to-the-feeler-test} and different kinds of goodness {latency, block relay, tx relay, netgroups, uptime} and then you form a cross-product of the two sets and keep the best in each bucket— but currently there just aren’t that many slots to fight over.

If it were changed to netgroup based, I’d still recommend nLastBlockTime!=0 and fRelevantServices as sorting criteria ahead of a random netgroup preference.

At one point I wanted to add a generic “helpful” flag that was based on some a set of criteria that judged if the peer was probably capable of serving us a novel block– do they have the right services? if not– not helpful. have they recently given us a block? helpful. have they recently given us a new transaction we accepted? helpful. Have they been recently announcing blocks we eventually accepted? helpful. Otherwise not helpful. … stuff like that– so essentially all working peers should be flagged as helpful.

And then have all the eviction criteria (except perhaps the longest uptime) use the helpful flag as the first sorting key, so that reserved slots won’t be taken by stuff that is very obviously broken. Attackers can feign helpfulness– sure, but some won’t bother. And weirdly broken nodes (some of which are just incompetent attackers) are common and don’t feign anything.

I just found it to be too difficult to meet people’s expectations for testing for this kind of behaviour.

ariard commented at 11:43 pm on August 7, 2020:

@gmaxwell thanks for your explanation, I would push your model a bit further, beyond types of peers/goodness, you can classify goodness criterias between themselves, with regards to hardness assumptions on attackers. Some are good-behavioral monitoring (block-delivery, stable connections, …), other costly-topological-identifier (asmap, net-address), other non-observable ones (netgroup sort, you could imagine peer nonce sort).

It needs further research but I fear good-behavioral ones are on the low scale of hardness, because once you have an attacker infrastructure deployed it sounds a marginal cost to fake good-behavioral ones (e.g learn the block first, give it to all your node puppets). These types of goodness are the ones we use the most, I don’t deny their usefulness but I lean to think about them more like sanitization measures than inbound slot overtakes protections. By pruning out zombies nodes like you underscore, they optimize your resources, in the sense for a given bucket, increase potential utility.

With regards to switching to a pro-active logic, like the “helpful” described, it defavor low-grade full-nodes even when they are protocol compliant. I presume that’s already an effect of our eviction heuristics but I would be cautious going further and such breaking some network-wise diversity of peers. Obviously pushing for more efficiency-based pairing isn’t great for robustness.

Stricter slot allocation, especially based on netgroup might have the side-effect to facilitate some net-splits, like if you allocate X slots based on netgroup ABC, an attacker, once deployed in this netgroup can exhaust the quota towards every listening peers. I guess peers netaddresses aren’t evenly distributed across netgroup so you might be able to drop a fair chunk of the network.

Note, we do addr-relay for blocks-only peers, not for block-relay-only. So it might be easier for an attacker to map them than expected. I don’t think we want to severe addr-relay for blocks-only as otherwise it would be harder (impossible?) for them to learn about new addresses.

gmaxwell commented at 2:11 am on August 14, 2020:

other non-observable ones (netgroup sort, you could imagine peer nonce sort).

Not sure if it was clear, but the motivation for netgroup diverse peers was also along the lines of trying to restore topology diversity.

Many criteria– latency, block relay, and tx relay all somewhat favor peers that are geographically close. So you don’t want a situation where most nodes are just incestuously linked within the same geography, resulting in a small min-cut to other continents. It’s easy to imaging the network becoming that way over time when you factor in the fact that further links crossing more networks are a little less reliable than shorter ones.

I fear good-behavioral ones are on the low scale of hardness, because once you have an attacker infrastructure deployed it sounds a marginal cost to fake good-behavioral ones

Sure, though if they stop being good, they’ll quickly lose the advantage it gave them. … assuming they haven’t already successfully eclipsed you entirely. If they keep being good then hurrah, you won.

, once deployed in this netgroup can exhaust the quota towards every listening peers

Yes, though the preference isn’t directly observable.

presume that’s already an effect of our eviction heuristics but I would be cautious going further and such breaking some network-wise diversity of peers

I really disagree. It’s a reason to be not overly specific in what constitutes helpful, but there are a pretty obvious set of behaviours (like relaying us blocks) which if a peer doesn’t do, it isn’t helping us stay non-partitioned from the honest network. A peer is free to not to helpful stuff, they wouldn’t be disconnected directly for it– but for criteria that is expressly intended to protect the nodes’ connectivity, it can make sense to filter it.

I don’t even think it would be completely absurd to reserve some connection capacity for peers which claim to be exactly the same protocol & subversion as you. Sure, a malicious party could spoof it, always matching it– but there have been several incidents in the past where we were concerned about fork software accidentally engaging in a partitioning attack: Say Mallory release a fork of the software that on some date will change its consensus rules to require every block sends half its coinbase to her, Mallory isn’t malicious, just stupid. She suckers a fair number of people into running this fork. When the time comes, your node is, by chance, totally surrounded by BitcoinMallory nodes. No one mines any Mallory blocks (or, if they do– the Mallory node graph may be partitioned and not get them to any of your neighbours) and the Mallory peers block the ‘invalid’ valid blocks from reaching you. Without any blocks showing up you can’t tell the Mallory nodes consensus rules differ. The stale tip detection is a last ditch hail marry to recover from this state, but it may not work well when you need it (e.g. due to honest nodes being full or attacked, or just taking a long time to find one that’s up and not a Mallory node). Wouldn’t it have been better if a couple of your connections were running the same software as you– so that if any of them were connected to the graph of healthy nodes you would be too?

It’s critical to keep in mind that broken software radically outnumbers attackers at almost all times. Esp because if countermeasures against attackers are generally effective then there won’t be attackers. Also, if merely being a “broken” peer is enough to harm the network

Note, we do addr-relay for blocks-only peers,

It’s a choice of the peer if they want to announce themselves through us. You make a great point that a peer might be wise to suppress its announcements toward and from its outbound block-only connections because those connections exist in part to help obscure the node connectivity graph. I thought they already did this.

naumenkogs commented at 7:48 am on September 1, 2020:

Considering how easy it is to be the first to announce a block and then launch a rotation right after, I’m not very positive abound this solution.

I acknowledge that finding these criteria is hard though. I would suggest “take all no-tx nodes announced a block within a day, and save the most long-living connections among them”.

It would still protect from completely fake nodes (no blocks at all), and impose a stronger attack cost I’d say (in terms of connection lifetime). Another benefit is slower-but-persistent honest blocks-only peers won’t be evicted.

Not a strong opinion, I’m not a big fan of long-lifetime bias either, it just feels a bit stronger for now. For now gonna ACK the existing code.

ariard commented at 12:21 pm on August 6, 2020: member

Concept ACK on the onion peers protection. We might do the block-relay-only peer protection a bit better.

sdaftuar force-pushed on Aug 6, 2020

instagibbs commented at 3:47 pm on August 6, 2020: member

@wtogami

in src/net.cpp:909 in ec4ee3fd7e outdated

901@@ -887,7 +902,7 @@ bool CConnman::AttemptToEvictConnection()
902                                                node->nLastBlockTime, node->nLastTXTime,
903                                                HasAllDesirableServiceFlags(node->nServices),
904                                                peer_relay_txes, peer_filter_not_null, node->addr, node->nKeyedNetGroup,
905-                                               node->m_prefer_evict};
906+                                               node->m_prefer_evict, node->addr.IsLocal()};

sdaftuar commented at 3:50 pm on August 6, 2020:

I discovered that doing IsLocal(node->addr) is different from node->addr.IsLocal(), and the latter appears to be correct, but if someone can confirm that this captures the right idea I’d appreciate it.

dongcarl commented at 9:01 pm on August 6, 2020:

Somewhat confusingly:

node->addr.IsLocal() returns true only if node->addr is IPv[46] and is in the loopback address ranges
IsLocal(node->addr) checks whether or not node->addr resides in mapLocalHost, which tracks our view of addresses we might be referred to externally

So in this case, I believe your use of node->addr.IsLocal() is correct.

gmaxwell commented at 9:08 pm on August 6, 2020:

Yes. addr.IsLocal() is right. IsLocal() ought to be renamed to IsProbablyMyAddress. :)

sdaftuar commented at 4:18 pm on August 6, 2020: member

Updated to fix two bugs I discovered, and I did some light manual testing and verified that this seems to now do what is intended.

ghost commented at 10:47 pm on August 6, 2020: none

@sdaftuar I’ve boldly copied all the changes to net.cpp and recompiled/rebuild bitcoind (0.20.0). I will report back if it solves my issue but that will take some time (a few weeks or longer).

gmaxwell commented at 2:12 am on August 14, 2020: contributor

ACK ec4ee3fd7e9d0a8572c5e29404820bf5899b203a

in src/net.cpp:859 in ec4ee3fd7e outdated

851@@ -845,6 +852,14 @@ static bool CompareNodeTXTime(const NodeEvictionCandidate &a, const NodeEviction
852     return a.nTimeConnected > b.nTimeConnected;
853 }
854 
855+// Pick out the potential block-relay only peers, and sort them by last block time.

ariard commented at 5:26 pm on August 19, 2020:

If you have to modify, you might change to non-tx-relay peers/CompareNonTxRelayPeers as comment above the sort is referring to.

in src/net.cpp:948 in ec4ee3fd7e outdated

940+    std::sort(vEvictionCandidates.begin(), vEvictionCandidates.end(), CompareLocalHostTimeConnected);
941+    size_t local_erase_size = total_protect_size / 2;
942+    vEvictionCandidates.erase(std::remove_if(vEvictionCandidates.end() - local_erase_size, vEvictionCandidates.end(), [](NodeEvictionCandidate const &n) { return n.m_is_local; }), vEvictionCandidates.end());
943+    // Calculate how many we removed, and update our total number of peers that
944+    // we want to protect based on uptime accordingly.
945+    total_protect_size -= initial_size - vEvictionCandidates.size();

ariard commented at 5:56 pm on August 19, 2020:

You can write it as size_t total_protect_size = initial_size / 2 - (initial_size - vEvictionCandidates.size()) and thus save one line above. I found this equation expressing better that up to 1/2 remaining peers are protected in function of effectively protected localhost.

sdaftuar commented at 9:40 pm on August 28, 2020:

I think the way I wrote it makes sense, so not inclined to change this (seems like mostly style and not actually any more or less confusing?).

ariard commented at 5:57 pm on August 19, 2020: member

Code Review ACK ec4ee3f.

sdaftuar commented at 9:40 pm on August 28, 2020: member

Anything left here? From #19500 it seems this fixes the user’s issue.

naumenkogs commented at 7:50 am on September 1, 2020: member

ACK ec4ee3f

DrahtBot commented at 7:01 pm on September 1, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

Protect localhost and block-relay-only peers from eviction

Onion peers are disadvantaged under our eviction criteria, so prevent eventual
eviction of them in the presence of contention for inbound slots by reserving
some slots for localhost peers (sorted by longest uptime).

Block-relay-only connections exist as a protection against eclipse attacks, by
creating a path for block propagation that may be unknown to adversaries.
Protect against inbound peer connection slot attacks from disconnecting such
peers by attempting to protect up to 8 peers that are not relaying transactions
but appear to be full-nodes, sorted by recency of last delivered block.

Thanks to gmaxwell for suggesting these strategies.

752e6ad533

sdaftuar force-pushed on Sep 2, 2020

laanwj commented at 2:43 pm on September 3, 2020: member

I think this strategy makes sense. Code review ACK 752e6ad5336d5af0db9fe16d24c0c6aa25b74a3f

laanwj merged this on Sep 3, 2020

laanwj closed this on Sep 3, 2020

sidhujag referenced this in commit 7abe2f8ae9 on Sep 3, 2020

fanquake deleted a comment on Jan 8, 2021

laanwj referenced this in commit dede9eb924 on Mar 30, 2021

Fabcien referenced this in commit f9de217f49 on Jun 15, 2021

DrahtBot locked this on Aug 16, 2022

Protect localhost and block-relay-only peers from eviction #19670

Conflicts