Increase # of block-relay-only connections

amitiuttarwar commented at 8:28 pm on September 12, 2023: contributor

tldr:

Block-relay-only connections help improve partition resistance on the bitcoin network by increasing connectivity while obfuscating the network graph. To be conservative and reduce chances of unexpected network effects, initially 2 outbound block-relay-only connections per node were introduced. Since then, we have improved resource utilization and network behaviors to work towards increasing that number. Now, let’s try and do it!

This is joint work with mzumsande. #28463 proposes a specific implementation to increase the number of inbound slots for block-relay-only connections, which is a prerequisite to later increase the outbound slots. This issue is a place to discuss any conceptual questions or concerns.

History & Context:

In 2019, PR #15759 introduced 2 outbound block-relay-only connections for bitcoind nodes. The primary motivation of introducing these connections was to help obfuscate the network graph, since leaked information could help an adversary execute a partition attack.

From the beginning, there was open questioning around how many block-relay-only connections we should add. More increases robustness but effects to consider include resource utilization and addr relay implications. Introducing 2 extra connections was evaluated to be a good first step that balanced tangible benefits with potential risk. gmaxwell mentioned in this comment, “ultimately I’d like to optimize memory usage for both inbound and outbound blocksonly links, then at least double the inbound connection limit for blocks only links, and make 8 blocks only links.” @sdaftuar expressed agreement with this direction here.

At the time, there were a few different mechanisms that would be a cause for concern if the number of block-relay-only connections were increased. Many have been resolved since and we have highlighted one significant question that needs to be evaluated in the context of this proposal. Additionally, reviewers should consider if there may be any other undesirable side effects.

Addrman interactions (fixed) [PR #20187](https://github.com/bitcoin/bitcoin/pull/20187): There were a couple of nuanced interactions between block-relay-only connections and addrman that were addressed in this PR. The changes fix a privacy leak, ensure block-relay-only addresses are recognized as reliable connections, and fix a pre-existing addrman bug around active connections.
Addr relay implications (fixed) [PR #21528](https://github.com/bitcoin/bitcoin/pull/21528): As @ajtowns described in this comment, introducing block-relay-only connections initially degraded the propagation of addr messages on the network. [PR #21528](https://github.com/bitcoin/bitcoin/pull/21528) reduces addr blackholes by improving the behavior for honest nodes. The default behavior was updated to not treat inbound connections as addr relay peers until they indicated interest by initiating an address related p2p message.
Memory utilization (fixed) [PR #22778](https://github.com/bitcoin/bitcoin/pull/22778): block-relay-only connections do not require as much memory as transaction relay peers. They don’t need a TxRelay data structure, which is significant because the tx relay bloom filter uses approximately 500kb per peer. [PR #22778](https://github.com/bitcoin/bitcoin/pull/22778) allowed nodes to identify whether an inbound peer might ever relay transactions over the lifetime of the connection, and stop initializing the TxRelay data structure when unnecessary.
Netgroup limitation (fixed) [PR #27374](https://github.com/bitcoin/bitcoin/pull/27374): In certain circumstances, the logic to diversify netgroups of our outbound connections was limiting the total number of connections permitted by the node. This PR fixed this issue by applying separate logic for clearnet peers vs privacy network peers.
Number of available connection slots on the network (OPEN): available connection slots are a shared and limited network resource. Increasing the defaults for outbound connections should be carefully calibrated against the expected values for inbound slots over time. The next section shares context for how numbers have been selected for our proposal, and this is an area where we are very interested in reviewer feedback.
Is there anything else we are missing?

Availability of Connection Slots:

Context

The patch in #28463 proposes values based on observing network statistics & calculating expected memory utilization. This section provides more reasoning behind selecting those numbers, so reviewers can evaluate the methodology.

The fundamental question is: how much do we need to increase inbound capacity to accommodate for increasing the number of outbounds from 10 to 16. This is tricky to answer with precision because (1) estimating the number of non-reachable nodes is hard and (2) these numbers will inevitably change over time, and we want to accommodate for fluctuations.

There is no way to guarantee sufficient connection slots - if all users disabled inbounds, the network would fail. However, we can still observe network behaviors to build confidence around projecting likely proportions that would be maintained over time. After all, the default number of 8 outbound connections dates back to satoshi code from 2010, and has successfully held up over the years 🙂

Estimating the number of reachable clearnet nodes (per sept 2023)

6155 (bitnodes) 8509 (KIT) 3862 (Luke Dashjr) 7910 (21 Ninja)

Quality of data: The bitnodes number is verifiable because peers can be queried, and a random sampling has demonstrated a high probability of successfully connecting. Luke Dashjr reports significantly lower than the other two, but the methodology is not disclosed. The data from 21 Ninja also clearly specifies its methodology and crawler code.

Estimating the number of non-reachable clearnet nodes

50956 (Luke, sept 2023, unknown methodology and included networks) 32300 (KIT Addr Spam Paper 07/21, estimated from the degrees calculated from observed addr spam) 27000 - 35000 (KIT Monitoring Paper 12/21, estimated from the number of gossip addrs received) 35000 (bitnodes, estimated from gossip addrs received)

Quality of data: Both KIT methods don’t take into account nodes that don’t self-advertise (e.g. listen=0, or SPV clients) but do permit inbound slots. Estimating based on addr gossip is going to include many spam addresses, so may provide a reasonable upper bound.

Extrapolating numbers

Let’s use rough estimates of 8,000 reachable clearnet nodes & 40,000 non-reachable nodes.

Estimated slots required for each increment to the default number of outbound connections: 48,000. Required additional inbound slots for each increment to default outbounds: 48000/8000 = 6.

With the current network estimates, we would need at least 6 additional inbound slots for each increment to the default number of outbounds. To be more conservative, we should probably add ~8-10 inbound slots for each additional outbound.

This is all only clearnet. However, it seems likely that inbound capacity for privacy networks are higher in comparison, because unlike clearnet, there is no need to unblock ports, so accepting inbounds is the default behavior. While node operators are able to easily disable those inbounds, we’d anticipate that to happen less frequently than on clearnet where the reverse effort is required.

Future work:

We need users to adopt the changes to increase the default for inbound block-relay-only connections before we can safely increase the default for outbounds. #28463 proposes the increase in inbounds. If these changes are accepted, we would want to wait until the corresponding release is widely adopted, then have a future release update the outbound default.

Anchors were implemented using block-relay-only connections to mitigate against restart-based eclipse attacks. For more context see [issue #17326](https://github.com/bitcoin/bitcoin/issues/17326) & [PR #17428](https://github.com/bitcoin/bitcoin/pull/17428). If we increase the number of outbound block-relay-only connections, we will want to thoughtfully design the interactions with anchors. As sdaftuar mentions in this comment, we will likely want to cap the anchors at 2, which entails selection logic. In this comment, @brunoerg mentions the idea of having (at least) one anchor per network instead of just two generic ones. While discussing this in depth would be premature, we also want to keep these implications and options in mind as we advance the current work.

amitiuttarwar added the label Feature on Sep 12, 2023

naumenkogs commented at 8:12 am on September 13, 2023: member

I’m curious about some back-of-the-envelope estimates on the memory use: block-relay-only, tx-relay, and the node in general. That would help to understand the impact of such a patch.

Also, do you think netgroup limitation stuff would need to change? Would these connections be diversified separately, or together with tx-relay peers?

Otherwise I don’t see any issues, so this is probably a good idea. I agree with the next step:

We need users to adopt the changes to increase the default for inbound block-relay-only connections before we can safely increase the default for outbounds

amitiuttarwar commented at 8:01 pm on September 29, 2023: contributor

Good question @naumenkogs, I agree that understanding the memory usage is valuable, so I’m working on gathering more specific data. Off hand, the m_tx_inventory_known_filter is very large, and there’s a significant worst-case amount of memory used for storing txs in the queue for sending to peers. Will report back with my findings once I get some numbers to at least roughly reconcile with each other :)

On a different note, when discussing this proposal in person, an idea came up for an alternate approach around handling allocating inbound capacity between block-relay-only and full-relay links. The idea is to identify a weight indicated the proportion of resources used by the two types of connections, and cap the amount of inbounds a node will accept based on a dynamic weighted calculation.

For example: Let’s say we select a weighted score of tx-relay 5 to block-relay 1. The current configuration of 115 inbounds could map to 115 * 5 = 575 available. When running, the node would accept both types of connections, and decide if it needs to evict based on 5 * current-tx-conns + 1 * current-block-relay-conns.

An advantage of this approach is that the network would utilize inbound capacity more fully. In the current approach, especially when rolling out the increase, there would be unused slots because there’s more offered for block-relay connections than the amount of outbound block-relay needed.

A disadvantage of this approach is that it seems a bit tricky to communicate to node operators exactly what to expect. Also, selecting the ratio can only be a rough estimate because the different connections vary on expected/worst-case usage of memory, bandwidth & cpu.

When changing this approach we would also want to update the inbound eviction logic to make sure that tx-relay connections cannot crowd out block-relay-only connections in a strained network environment.

Curious to hear if reviewers have thoughts or preferences between the current approach (specific slots allocated to inbound-block-relay connections) or the alternate approach (have a weighted ratio to cap resource usage).

naumenkogs commented at 9:52 am on October 2, 2023: member

@amitiuttarwar Should we consider outbound connections for weightening as well? Sometimes we’re only at 6 outbound peers, would that leave more room for inbounds (say at the same time trying to always converge to 8 outbounds if that becomes possible)? Alternatively, if we have no inbounds, should that allow us to have more outbound tx-relay peers?

Also, when you say resource usage, should we include: memory AND bandwidth? I think @sipa thought about this approach some time ago.

This is probably a better way forward, so it’s mostly a matter of dev effort allocation. It’s a solid project!

amitiuttarwar commented at 8:21 pm on October 2, 2023: contributor

Should we consider outbound connections for weightening as well? Sometimes we’re only at 6 outbound peers, would that leave more room for inbounds (say at the same time trying to always converge to 8 outbounds if that becomes possible)? Alternatively, if we have no inbounds, should that allow us to have more outbound tx-relay peers?

hmmm, what are the intended benefits of this?

from my POV, I think inbounds and outbounds are best treated separately. a couple reasons come to mind:

slots to connect to on the network are a shared, limited resource. we want to make sure there’s sufficient inbound slots to service all nodes. this is already hard to reason about with opaque info from the network, but would be even more difficult if the amount of outbounds were variable for a node.
for a node, outbounds are selfishly necessary to ensure connection to the network. for the most part, offering inbounds is offering a service to the network more than individually benefitting (surface area of attack increases significantly)

Also, when you say resource usage, should we include: memory AND bandwidth? I think @sipa thought about this approach some time ago.

yes I think we should evaluate memory, bandwidth, and even cpu usage (which can be more severe in certain network scenarios or worst-case situations)

naumenkogs commented at 7:31 am on October 3, 2023: member

hmmm, what are the intended benefits of this?

The examples I have above.

You have 6 outbound connections and you’ve been stuck at it for an hour (can’t find 2 more). You might want to allow 2 extra inbounds instead (or 10 if they are SPV).
You are ready to take 100 inbounds, but it takes a week for them to find you, so in the meantime you can have extra outbound connections.

Currently, in both cases, you would have under-use of resources allocated for the node.

You may argue that (1) is too exotic, and (2) is rather a waste of resources than a benefit, which is probably true. I will think of more.

An even more advanced thing is letting the peer know “Keep me inbound only if you’re not at capacity, otherwise drop me and let join those who need it more”. “A non-critical outbound connection”. This would help with waste (2) in a resource-based framework, but also could be useful on its own. I first thought this was incompatible with the current approach to altruism/selfishness of the network, but now I think it changes nothing.

amitiuttarwar commented at 6:47 pm on October 4, 2023: contributor

okay, so here’s my understanding:

the proposal is to calculate number of inbounds & outbounds together, instead of the current system of treating them separately
the main motivation is to maximize resource utilization of each node
another motivation would be to increase number of connections in a case where the node is unable to find sufficient outbounds

my responses:

when we talk about resource utilization, it doesn’t make sense to treat maximizing available resources as a general value. for example- we don’t want a node to be sustained at 100% CPU usage. in terms of available inbound slots, we want to find a balance. individual nodes & the entire p2p network are most robust with strong interconnectedness. however, inbound slots are a shared, limited resource that are important to ensure nodes that are starting up are able to connect to the honest network. so, imo the ideal network state would balance strong connectivity between nodes with having some unused capacity available for new peers.
the case where a node is stuck at 6 outbound peers and is unable to find any additional would be a problem. I agree that we should be working to ensure that doesn’t actually occur. however, in that circumstance having additional inbounds could potentially be more harmful because an adversary could slowly take over a greater percentage of connections, have more influence over the addrman, and eventually partition / eclipse the target node. so imo it makes sense to continue to treat inbounds and outbounds separately, and work on other mitigation factors to ensure this kinda scenario doesn’t occur.
"(2) is rather a waste of resources than a benefit," -> agree. the value of network inbound slots is significantly higher than the value of a specific node maximizing connections/CPU/bandwidth/memory…

“A non-critical outbound connection”.

that’s interesting. what information do you imagine makes a particular inbound slot more valuable to one node vs another? if there are available slots on the network, then there are some attributes (such as diversifying netgroup) that would value them differently, but how would a node anticipate its alternate available options?

ajtowns commented at 0:50 am on October 5, 2023: contributor

Should we consider outbound connections for weightening as well?

If we’re at ~100 inbounds and ~8 outbounds, the weighting doesn’t make much difference – it just means 108 inbounds vs 100, so under a 10% difference, and likely less. So I don’t think it matters much either way – if either way makes the code simpler, that would be a win, otherwise don’t worry about it?

OTOH, the reason we have ~8 outbounds and ~100 inbounds is because we figure ~10x as many nodes are going to be unreachable for one reason or another (behind a NAT/firewall, misconfigured, mobile and not having a consistent IP, behind an ISP that doesn’t allow inbound connections). If that changed (due to increased tor/i2p usage, natpmp availability?) then it might make sense to target more of a balance between inbounds/outbounds. I think that would look more the other way though: ie, you’d be making more than 8 outbound connections if you had fewer than 30 inbounds, eg. I think that’s all theoretical though, and we don’t currently have a reason to change things along those lines.

IMHO, YMMV, just my 2sats, etc

naumenkogs commented at 11:43 am on October 20, 2023: member

@amitiuttarwar

what information do you imagine makes a particular inbound slot more valuable to one node vs another?

I think we’re talking about different things. My answer is — Whatever the initiator (inbound peer) says. If the peer says the connection is crucial for them, we prioritize them. Inbound slots are half-altruistic in the first place… Worst case: bad nodes always ask to prioritize them. Then, we’re back to the status quo.

ajtowns commented at 3:07 am on October 21, 2023: contributor

@amitiuttarwar

what information do you imagine makes a particular inbound slot more valuable to one node vs another?

I think we’re talking about different things. My answer is — Whatever the initiator (inbound peer) says. If the peer says the connection is crucial for them, we prioritize them. Inbound slots are half-altruistic in the first place… Worst case: bad nodes always ask to prioritize them. Then, we’re back to the status quo.

Worst case is bad nodes ask to be prioritised and good nodes don’t, and we end up not accepting inbound connections from good nodes. Having outbound connections is crucial, but having an outbound connection to a particular node should only matter if you have a meaningful relationship with that node, in which case I think you should just be whitelisting the connection on one (both) side(s) in order to prioritise it in whatever way you like.

virtu commented at 10:45 am on November 7, 2023: contributor

Estimating the number of reachable clearnet nodes (per sept 2023)

6155 (bitnodes) 8509 (KIT) 3862 (Luke Dashjr) 7910 (21 Ninja)

I think the misnomer “number of reachable nodes” is finally coming back to haunt us, since these numbers actually represent the number of reachable clearnet addresses, not nodes. Given that a node can connect to the Bitcoin network using multiple network types (as well as multiple addresses of the same network type), estimates for network-wide inbound slots based on these “node” counts will be too optimistic.

I have tried estimating the number of actual nodes under the moniker number of unique Bitcoin nodes here (feel free to double-check the methodology).

To get an estimate for the number of clearnet inbound slots, clearnet-only nodes can simply be added up (~1,100 IPv4, ~430 IPv6 and ~2,300 IPv4/IPv6 nodes) and multiplied by ~115; mixed clearnet-darknet nodes’ inbound slots, however, will not be exclusively available to clearnet nodes because a share of them will be occupied by darknet nodes.

Lacking information about the network composition of unreachable nodes, the distribution of reachable nodes could serve as makeshift proxy for the network-type ratio of incoming connections: for instance, if there’s ~11k reachable Tor nodes, but only ~6k reachable IPv4 ones, 65% of an IPv4/Onion node’s inbound slots could be estimated to be occupied by Tor connections while only the remaining 35% would be available to IPv4 nodes.

A quick back-of-the-envelope calculation using these assumptions results in the equivalent of only 5,400 reachable clearnet nodes. Assuming the 40,000 number for the number of unreachable nodes is correct, this, unfortunately, would raise the required additional inbound slots for each increment to default outbounds from six to eight or nine (45400/5400=8.4).

mzumsande commented at 5:16 pm on November 30, 2023: contributor

(feel free to double-check the methodology)

I think that this methodology (using networks in getaddr responses) will likely overestimate the number of peers reachable on multiple networks, because accepting addrs from a network does not mean you can actually accept connections from it. It doesn’t even mean that you can connect to others over it: For example, I think it’s quite typical that some node operators would enable an onion service if they aren’t reachable through clearnet (maybe they don’t control the firewall settings of their network, or don’t want to free ports manually) but still want to be reachable. These nodes could make outgoing connections to both clearnet and tor peers but would only be reachable via tor. The opposite scenario of only being reachable via clearnet is also possible.

Another effect is that having support for IPv6 (enabled by default) doesn’t even mean that you can reach others because we don’t have an auto-detection (see #28061) - we’d still accept rumoured IPv6 into addrman in this case and would include these in getaddr answers.

I’m not sure though how this affects the calculations.

virtu commented at 6:20 am on December 2, 2023: contributor

(feel free to double-check the methodology)

I think that this methodology (using networks in getaddr responses) will likely overestimate the number of peers reachable on multiple networks, because accepting addrs from a network does not mean you can actually accept connections from it. It doesn’t even mean that you can connect to others over it: For example, I think it’s quite typical that some node operators would enable an onion service if they aren’t reachable through clearnet (maybe they don’t control the firewall settings of their network, or don’t want to free ports manually) but still want to be reachable. These nodes could make outgoing connections to both clearnet and tor peers but would only be reachable via tor. The opposite scenario of only being reachable via clearnet is also possible.

Another effect is that having support for IPv6 (enabled by default) doesn’t even mean that you can reach others because we don’t have an auto-detection (see #28061) - we’d still accept rumoured IPv6 into addrman in this case and would include these in getaddr answers.

I’m not sure though how this affects the calculations.

Thanks for the review!

I agree with the general critique and the examples you brought up, as well as the implication that this leads to an under-estimation of the number of reachable nodes (nodes meaning actual node instances, not reachable addresses). The approach might still be useful as a lower-bound estimate thought.

I will investigate the feasibility of addrman’s cached getaddr replies to deduplicate nodes. Maybe that will help us zero in on the actual number of reachable nodes.

amitiuttarwar commented at 0:07 am on December 29, 2023: contributor

thank you @virtu & @mzumsande for the thoughtful discussion around estimating available slots by network! I agree that estimating clearnet inbound slots to be 114 per address to be an overestimation because of availability on multiple networks, so we should update our expectations accordingly.

suggestion on next steps

from an offline conversation with @mzumsande, we think a reasonable path forward is to continue with the proposal in #28463 to increase max connections to 200 while continuing this research around number of network nodes & available inbound slots. based on our findings, we can later decide how much to increment the number of outbound block-relay-only connections (eg. maybe bump to 6 per node instead of 8). curious to hear what reviewers think of this approach.

we also changed the proportion of slots available for block-relay vs full-relay on #28463, so regardless would want to re-calculate expectations for bumping the outbound number.

thoughts around estimating inbound slots

@virtu - couple questions about your research:

this leads to an under-estimation of the number of reachable nodes (nodes meaning actual node instances, not reachable addresses). The approach might still be useful as a lower-bound estimate thought.

I don’t quite follow how this would lead to general underestimation. eg. if a node is actually IPV4 but reports IPV4 & IPV6, wouldn’t the technique overestimate IPV6 and underestimate IPV4? in the case of a reachable node with IPv4 & tor enabled, I think this would lead to overestimating available slots on IPv4 because of firewalls. does my thinking / question make sense?

another curiosity I have is wondering how come the results show so many IPv4 only nodes. bitcoin core enables IPv6 by default (as explained in #28061), and my understanding is the vast majority of nodes on the network run bitcoin core (based on the satoshi user agents eg on bitnodes, which ofc is self-reported by the nodes). but it surprised me that 30% of peers would only return IPv4 addresses. any ideas?

Increase # of block-relay-only connections #28462

tldr:

History & Context:

Availability of Connection Slots:

Context

Estimating the number of reachable clearnet nodes (per sept 2023)

Estimating the number of non-reachable clearnet nodes

Extrapolating numbers

Future work:

Estimating the number of reachable clearnet nodes (per sept 2023)

suggestion on next steps

thoughts around estimating inbound slots