The current mechanism for choosing outbound peers picks one at randoms among known-reachable addresses, with the caveat that we do not connect twice to the same netgroup (by default /16’s and, if an ASmap is configured, by AS’s). A more robust mechanism for preventing an attacker to control all of a node’s outbound connections would first randomize over netgroups and then pick a known-reachable address within that netgroup.
This alternative mechanism would make the probability for an attacker to control all of a node’s outbound connections exponentially decreasing in the number of connections, roughly $(\frac{k}{n})^c$ with $k$ the number of netgroups controlled by the attacker, $n$ the total number of netgroups available to be chosen from, and $c$ the number of outbound connections made. There is today in the order of 5k different /16’s to choose from in the network1. If we were to use this method, even an adversary that introduced 1k new ones with exclusively their nodes (which would be absurdly expensive) would not be able to control all of a node’s outbound peers with any relevant probability.
By contrast, the current mechanism will allow an adversary with enough node IPs to control all outbound connections of a node with a realistic probability, as long as those IPs are spread across at least as many netgroups as we make outbound connections by default. This is not merely a theoretical concern. This summer i investigated2 an entity that spun up hundreds of reachable nodes on the network. They have since scaled up to 3000 nodes, spread across 8 /16’s boundaries (and 3 or 4 AS’s). As a result, a freshly started clearnet nodes nowadays will make 3 to 5 of their outbound connections on average to this single entity, which is not even actively attacking (for instance by more aggressively sharing their own node addresses and/or not relaying other reachable nodes’). More discussions regarding this entity are available here and here.
Of course, switching to this mechanism for choosing outbound peers would have consequences on the network graph. Because we currently sample over all known node addresses, we will be biased towards netgroups that contain a lot of nodes (such as hosting providers). First sampling by netgroup would remove this bias, and make it significantly more likely to connect to more “obscure” netgroups. This could cause a resource allocation issue on the network, with the inbound connection slots of netgroups with a lesser amount of nodes getting overused (and maxed out) while tons of inbound connection slots in netgroups with a higher amount of nodes sit unused. Interestingly, this is a similar concern to that of switching to ASmap by default shared by @virtu here.
Naturally a middle of the road solution could be to use the alternative mechanism for half of our connections ($c = 5$ in the formula above is more than enough) to get the local eclipse resistance benefits while minimizing the risk of global network disruption. An alternative would be to fully move to sampling by netgroups, but not uniformly. The draw could be biased toward those with more available resources.
A related discussion is how we want a node to behave when its inbound connection slots are full (see #16599 (comment)).
A related question is whether we want to keep the “never connect to more than one netgroup” rule if we adopt the alternative mechanism. Without the rule but with the new mechanism, could it be the case that if all connection slots in “small” netgroups, a large fraction of the network eventually converge towards the larger netgroups? Possibly making several connections to the same netgroups? That seems unlikely. On the other hand if it happens it would organically spread resource usage (though not with a distribution we are happy with).
This topic was discussed during yesterday’s IRC meeting (which this issue is following up on). Logs available here.
-
A conservative estimate from querying the /16’s present in the tried table of a number of long-running nodes, and comparing what a number of sources (1, 2, 3) claim are the number of reachable ipv4 nodes. The command ran to gather the number of /16’s in a node’s addrman is the following:
bitcoin-cli getrawaddrman |jq -r '.tried[].address | select(test("^[0-9]{1,3}(\\.[0-9]{1,3}){3}$")) | (split(".")[0:2] | join("."))' |uniq |wc -l. ↩︎ -
See this blog post. The investigation started because the nodes were misconfigured, and i ended up being in contact with the person running those. It appears the person is purposefully trying to optimize for the highest possible number of node addresses announced, in particular by having advertising several IPs per node. ↩︎