RFC: Erlay Conceptual Discussion #34542

issue marcofleon openend this issue on February 9, 2026
  1. marcofleon commented at 9:53 pm on February 9, 2026: contributor

    As suggested during the IRC meeting last week (logs), this issue is for discussing recent progress on Erlay, evaluating its trade-offs and goals, and ultimately deciding whether it’s worth continuing to pursue.

    The current project tracking issue is #30249, though some of the performance research results shown there may be outdated. We can gather updated data here and discuss.

    Couple questions on my mind:

    • What does the bandwidth/latency trade-off look like as we scale the number of connections?
    • Do the benefits outweigh the added code complexity?

    Of course, discussion isn’t limited to the above. Feel free to bring up any other points that come to mind.

  2. bitcoin deleted a comment on Feb 9, 2026
  3. fanquake added the label Brainstorming on Feb 10, 2026
  4. sr-gi commented at 1:09 pm on February 11, 2026: member

    Thanks for opening this @marcofleon.

    I’d like to give a refresher and overview of Erlay, and explain some of the details of the experimentation and current approach. I also have a few questions/points that I’d like to discuss with people in the project.

    Overview

    Erlay is an alternative method of announcing transactions between peers in the Bitcoin P2P network. Erlay’s main goal is to reduce bandwidth utilization when propagating transaction data by minimizing the amount of redundant inventory messages (INV) exchanged between peers.

    In our current transaction relaying approach (we’ll be calling txflood or fanout from now on), every transaction is announced on every transaction relaying link in the peer-to-peer network at least once. That is, for each transaction relaying connection in the network, each side of the connection will either send or receive an INV message containing the given transaction. Most of these announcements (n-1/n) are actually redundant, but it is impossible for the announcer to know until the announcement has been sent.

    Erlay tries to solve this issue by using set reconciliation to work out the transaction differences between two nodes connected to each other instead of just announcing all transactions through all links. To do so, nodes keep a set of transactions to be reconciled (or reconciliation set) between each of their Erlay-enabled peers, and reconciliation is performed at regular intervals. Once it is time to reconcile, peers exchange sketches of their reconciliation sets, which can be used to optimally compute the symmetrical difference between their reconciliation sets and, therefore, identify which transactions need to be sent by each end of the connection.

    Erlay works optimally (sketches are smaller and sketch differences are cheaper to compute) when the differences between sets are small. This means that, even if all peers of a given node are Erlay-enabled, some transactions will still be exchanged using fanout. Finally, as we will expand on later on, fanout is more efficient and considerably faster than set reconciliation provided the receiving node does not know about the transaction being announced.

    Fanout: The current approach

    Before jumping into Erlay, its important to understand how fanout works in Core, so we can reason about the proposed changes and tradeoffs.

    Transactions are exchanged between network peers using a 3-step protocol (INV/GETDATA/TX). The first round-trip of this protocol (INV/GETDATA) can be seen as the announcement/request phase, here is where both sides of the connection are made aware of what the other side knows. The last half-round trip (TX) is where the actual transaction is exchanged, if necessary. Given transactions are significantly bigger than transaction ids, the INV/GETDATA part of the protocol is used to minimize the amount of potentially redundant information exchanged between peers. If I don’t know if you know something, I’d better make sure first before dumping a lot of data on you. Hence, if the announcement receiver knows about the target transaction, we waste a minimal amount of bandwidth, and if not, we add a minimal amount of overhead to sending the transaction.

    Transactions are announced between peers at random intervals following a Poisson process. The expected delay between announcements depends on the type of connection: outbounds have an independent clock per connection with an expected value of 2 seconds, whereas inbounds share a timer with an expected value of 5 seconds. Transactions are queued to be announced until the timer goes off, and then announced using as many inventory messages as needed to clear the queue (each INV message can hold up to 50.000 entries IIRC). These timers are in place to obfuscate when a node learns about a transaction, in order to make it harder for a global network observer to pinpoint the transaction origin.

    When a node learns about a new transaction via an inventory message, it may request it back from the announcer. Announcements are requested back in order of arrival, prioritizing outbound peers. If a peer takes too long to respond to a request, we may ask the next peer and so on. Inbound peers are de-prioritized here, by giving them a 2-second delay before requesting. This is to prevent them from stalling us (e.g., sending multiple announcements of the same transaction from different connections to delay us from learning the announced transaction).

    Last but not least, when a transaction is being purposely delayed (in the to_be_announced queue), a node will pretend not to know about it, so it won’t reply to GETDATA messages requesting it. This is to prevent probing.

    Pros: fast (artificially slower for privacy/DoS protection purposes)

    Cons: each transaction is announced at least once per link (too redundant)

    Erlay: The second take

    Erlay is not a drop-in replacement for fanout, but a protocol that works alongside it. This is obvious for backwards compatibility reasons: nodes are not forced to implement or adopt the change, but it’s not the only reason why. Set reconciliation is really good at fixing small differences between sets (both fast and efficient), but it becomes way less optimal when the differences in the sets to reconcile are big: sketches become big (they grow with the size of the diffs they can compute), and the diffs are expensive to compute. This is why set reconciliation works best alongside fanout, and why our approach to implement Erlay is as a hybrid between both.

    Once this becomes clear, the main question to answer when implementing Erlay is: how we partition peers between fanout and reconciliation, and how big is the split? This is the question that has led most of the experimentation in the project and that most of the simulations have tried to answer. To answer this question, we need to understand the properties of each group:

    • Fanout peers are good at propagating the transaction fast, but too much use of them leads to redundant announcements
    • Reconciliation peers are good at fixing small differences between peers, but they are, inherently, slower at propagating transactions.

    The reason why reconciliation is slower lies in how transaction announcement is implemented in Core. When a transaction is propagated using set reconciliation, we need to go over an additional 1.5 round-trips (REQRECON/SKETCH/RECONDIFF) to learn if the target peer knows about a certain transaction AND THEN, if it doesn’t, default to INV/GETDATA/TX to send it. Therefore, 3 round-trips in total. This may seem a bit counter intuitive, why can we not just send the transaction straightaway once we learn the peer doesn’t have it? Because another peer could have announced the transaction before us, and the target peer may already have selected them for download.

    In order to answer this question, I built a discrete-time network event simulator for Bitcoin (you can find it here) so I could quickly iterate over different configurations in significantly big networks.

    Experimentation and findings

    The initial approach was to simply answer the following question: how small can we keep fanout so that network latency remains similar while maximizing the bandwidth savings of set reconciliation? This proved tricky right out of the bat: set reconciliation is much slower than I originally imagined, and deciding what’s acceptable in terms of latency is hard to assess. As a matter of example, here are some simulation results with no optimizations:

    Network: 110K nodes (100K unreachable, 10K reachable), random topology, 8 outbounds.

    No optimizations Full recon Erlay 1 (out), 10%(ins) Full fanout
    T-90% 25.07007s 20.413506s 8.676374s
    T-100% 32.31443s 27.577503s 14.88974s
    data-volume (reachable) 1038.9468 messages (4589.844 bytes) 846.8382 messages (4876.821 bytes) 119.51961 messages (6120.7563 bytes)
    data-volume (unreachable) 87.36723 messages (405.09848 bytes) 71.30196 messages (429.1808 bytes) 10.086697 messages (524.5129 bytes)

    This shows the difference between doing full reconciliation (no fanout involved), doing minimal reconciliation (using similar parameters defined as “best” in the previous Erlay attempt, and full fanout (the current state of things). It becomes pretty clear that the impact on latency is more than significant when no optimizations are being applied, but it also raises the question: What is acceptable in terms of latency and at what cost?

    One of the first attempts to reduce latency while maximizing our bandwidth savings was to reduce the previously mentioned expected values for the peer-announcement Poisson process sampling. Halving those artificially reduces latency with almost no effect on bandwidth (if any). This was the approach in the original Erlay implementation, but I never felt too comfortable with it. There is little documentation (if any) of why the broadcast intervals were picked as {5s, 2s}, and if that’s the route we want to take, I think it comes from a well-reasoned discussion.

    Fine-tuning the approach

    There have been several experiments and theories being tested to try to optimize the aforementioned approach to make it more latency-efficient while preserving the bandwidth savings. For an extended and more detailed explanation of some of these, I’d recommend checking the series I wrote on Delving. Here, I’ll be highlighting the two optimizations that have shown to work in simulation.

    Reconcile on trickle

    One of the main reasons why Erlay is slow has to do with when data is made available to our peers. As we’ve mentioned earlier, this happens when we trickle for them (their timer has gone off, so we will try to send them whatever we have available). However, adding reconciliation on top of this means that now both need to synchronize; otherwise, if data is being purposely delayed for a peer when they request reconciliation, they will have to wait until the next reconciliation cycle to get it (FYI, this is once every 8 seconds per outbound peer at the moment).

    A solution for this problem is not to reply to reconciliation requests straightaway. Instead, the request can be recorded, and replied to when we trickle for that peer, ensuring that they get all transactions we have received up to that point. Here are the simulation results using the same network as before, but applying this optimization:

    Recon on trickle Full recon Erlay 1 (out), 10%(ins) Full fanout
    T-90% 22.339193s 16.797455s 8.654676s
    T-100% 29.805967s 23.903776s 14.878592s
    data-volume (reachable) 713.03534 messages (4071.1865 bytes) 562.6937 messages (4548.8276 bytes) 119.527855 messages (6121.74 bytes)
    data-volume (unreachable) 59.583134 messages (350.2714 bytes) 47.12237 messages (395.96457 bytes) 10.085797 messages (524.41187 bytes)

    INV-based fanout

    The second optimization worth mentioning has to do with how we pick nodes for fanout and reconciliation. So far, nodes have been picked as a static split. We select some for reconciliation and some for fanout, making the latter big enough so a big chunk of the network can be reached fast, but not big enough so it starts being redundant, and leave the rest to reconciliation. The specific numbers here come from simulating different partitions and seeing what seems to perform better in random networks.

    However, we can do better than this. In an ideal world, if we knew how long the transaction has been propagating and what fraction of the network it has reached, we could use that to turn our fanout/reconciliation up or down. Unfortunately, we don’t have that information, but we know one thing: fanout is faster than reconciliation, therefore if we receive a transaction via fanout, we are likely to be amongst the first peers to receive it, whereas if we receive it via reconciliation, it is likely that we are further down the line.

    We can use this to tweak our fanout rates:

    • When receiving a transaction via fanout, keep fanning out until we reach a certain threshold, and reconcile with the rest of your peers
    • If a transaction is received via reconciliation, see it as being late and keep reconciling with the rest of your peers

    This approach applies to outbound connections only, given that anything learnt from an inbound connection is easy to fake, so it’s better not to be trusted. Applying this optimization on top of the previous one, and simulating for the same network, we obtain:

    on-trickle + inv-based-fanout Full recon Erlay 1 (out), 10%(ins) Full fanout
    T-90% 21.70218s 14.427859s 8.654676s
    T-100% 28.977697s 21.619083s 14.878592s
    data-volume (reachable) 695.2272 messages (4085.9502 bytes) 493.16547 messages (4593.411 bytes) 119.527855 messages (6121.74 bytes)
    data-volume (unreachable) 58.117065 messages (349.74905 bytes) 41.418915 messages (401.29572 bytes) 10.085797 messages (524.41187 bytes)

    Conclusions and questions

    With the two presented optimizations, we are able to achieve savings of ~25% in bandwidth for the cost of ~ 67% more latency. Some of these combinations have also been tested on smaller networks with actual Bitcoin nodes implementing Erlay (using Warnet) and the obtained results have been similar. However, things are far harder (and more expensive) to test like that.

    I personally think the design space for improvements and optimizations may still be substantial. However, I also think I’d be good to define what is acceptable in terms of bandwidth and latency tradeoffs, and what things are open to change:

    • Are the expected values for the poisson processes set to stone? Is it OK to reduce them? How can we reason about the implications?
    • What are acceptable transaction propagation times? Are we OK with increasing them at all? If so, by how much?
    • What would be acceptable bandwidth savings given the additional complexity brought by this change?

    I’ve tried to give as much context and information as possible while keeping this digestible. I’m happy to extend on any of the directions others are curious about, explain things that may have fallen of and provide additional data.

  5. sr-gi commented at 1:11 pm on February 11, 2026: member

    Couple questions on my mind:

    • What does the bandwidth/latency trade-off look like as we scale the number of connections?

    I’m going to reserve this spot to extend on the tradeoffs when the number of connections is increased, since I did not touch that on the original post, but it’s one of the main goals of the project (the write-up was already pretty lengthy). I may need to re-run some of the simulations for it.

  6. stickies-v commented at 1:55 pm on February 11, 2026: contributor

    Erlay’s main goal is to reduce bandwidth utilization when propagating transaction data by minimizing the amount of redundant inventory messages (INV) exchanged between peers.

    I’m not excited about this amount of research and code complexity for having bandwidth reduction as a goal. Bandwidth reduction by itself is nice, and obviously it is something we should improve where possible, but I don’t see the current bandwidth consumption as problematic for users who wish to run without blocksonly? Is it possible that, with this project going on for so long and with a couple of reboots, we kinda lost track of why we were doing this in the first place?

    It’s been a while since I initially read the Erlay paper, but IIRC the reason I found this interesting at the time was that Erlay allows having more connections with acceptable bandwidth trade-offs, improving eclipse attack resistance. Even if it seems we currently already have reasonable eclipse attack resistance, this security benefit to me seems like much more worthy goal of investigating and - if Erlay turns out to be a worthwhile improvement - pursuing? I’ve not been following this project closely, so apologies if this has been covered already, I’m just not seeing it mentioned in the current write-up.

  7. dergoegge commented at 2:23 pm on February 11, 2026: member
    To add to Stephan’s comment: simpler proposals seem to achieve (for the most part at least) the same goal of increased network security: #28463, #28462 and it’s not clear to me that the additional benefits that erlay offers over these (increased full-outbound relay connections instead of block-relay-only & bandwidth reduction) are worth the complexity (minisketch, reconciliation module, extra protocol messages).
  8. sr-gi commented at 3:19 pm on February 11, 2026: member

    Thanks for your comments @stickies-v @dergoegge .

    I think there are two conflicting goals being discussed here. If what we want to achieve is simply increase the network connectivity using any connection type, then sure, blocksonly are always going to be a cheaper and simpler solution. This can make Eclipse harder and will have a way smaller code complexity.

    If, on the other hand, we want to be able to scale the network without having to scale it’s backbone of reachable nodes (or at least not linearly), then allowing nodes to accept more tx-relaying connections with little to no impact to their bandwidth is important, and what Erlay is trying to achieve.

    I think it’s important to agree on what the goal is, because that will affect the complexity of the solution. @stickies-v I’ll expand on the network scaling bit on my second comment. I’ve parked it there for that reason.

  9. mzumsande commented at 3:53 pm on February 11, 2026: contributor

    As mentioned in IRC, one reason to increase connectivity with Erlay is to increase resistance specifically against transaction censorship (as opposed to only resistance against eclipse attacks). With the recent attempts to censor valid fee-paying transactions deemed undesirable for ideological reasons, not forwarding them to peers, this has become more relevant than it was in the past.

    However, even with the current connectivity, the threshold is quite low (a minority of ~10% of random reachable nodes should suffice to get transactions to miners) - plus there is always the possibility of preferential peering so I’m not sure how important it is to lower that threshold even more.

  10. sipa commented at 4:09 pm on February 11, 2026: member

    Let me take a step back, because I think the results don’t paint the full picture.

    The primary goal is increasing partition-resistance of the network. In what follows I’m going to assume 16 connections per node is the desired number, but this is just for presentation purposes. I’ll get back to the number later.

    We could add 8 more outbound extra full connections. But that also means simultaneously increasing the number of (default) permitted inbound connections for listening peers (say from 125 to 250), to not make the network run out of connection slots or making it easy for an attacker with tons of inbound slots to grab a large share of them. Unfortunately, with the current transaction relay design, every transaction INV announcement (36 bytes) goes over every connection at least once (and sometimes twice), so it scales with $\mathcal{O}(\mathrm{txrate} \times \mathrm{peers})$. The more tx-carrying connections a node has (including inbound), the larger the percentage of their bandwidth is taken by INVs. And this is in a way accomplishing too much, because in addition to better partition resistance, it’ll also reduce the tail latency for transactions to reach nodes in the network - something that isn’t actually desired.

    It’s possible instead just add 8 extra block-only connections. These obviously do not cause a bandwidth increase due to INVs, and additionally impose less memory and CPU usage on the node. However, they also only improve block-relay partition resistance. This means that an eclipse attack on just the tx-carrying connections might be successful still, which would result in bad transaction propagation, potentially bad block reconstruction/propagation if severe, and in extreme cases, transaction censorship.

    What if there was something in between the two, that has a lower bandwidth impact than full connections, improves tx-partition-resistance unlike block-only connections, but didn’t care about improving tail latency? As a thought experiment, add 8 extra reconciliation-connections. These are not full Erlay as proposed; they’re an entirely separate connection class, and exclusively use reconciliation for transaction relay on a slower time schedule than normal transaction propagation (+ normal block/addr relay). These connections should have very low bandwidth usage in non-adversarial settings because the large majority of transaction relay is done by the normal full connections, and reconciliation bandwidth scales with the amount of transactions being reconciled. They don’t worse propagation latency at all, because these 8 connections are just added on top of the existing 8 full ones. However, in adversarial settings, rather than transactions not propagating at all, they cause reconciliation to take over when normal relay fails. This will happen at the cost of extra latency as it’s a second-stage fallback mechanism, but it’s better than not propagating at all.

    So where we are at this point is that reconciliation-based connections are an alternative to (some of the) extra block-only connections that also improves tx-partition-resistance, but without the $\mathcal{O}(\mathrm(peers))$ factor in bandwidth effect that full connections have.

    The full Erlay idea is to not have these reconciliation connections be separate from the normal full outbound connections, but have both mechanisms (flooding and reconciliation) inside all the normal connections, and intelligently pick between them: use flooding when you guess a peer is unlikely to have already heard of the transaction, and use reconciliation otherwise. This is where I think the majority of the conceptual complexity of Erlay comes in, as it involves intricate relay policy decisions to make these guesses well, and it’s also where most of @sr-gi’s efforts have been in the past year, because it just feels that with good decisions, you should be able to do better than two entirely separate connection classes.

    Back to the number-of-connection question. There is no reason to assume that today, 8 connections is the right number for both tx propagation delay and tx-partition resistance. It’s a limitation of the existing technology that forces both numbers to be the same. If we didn’t care about partition resistance, would we consider reducing the number of outbound connections to reduce bandwidth (especially for incoming nodes with ~100 connections)? It’s not unreasonable to me we might conclude that 4 connections is actually sufficient for transaction propagation purposes; this may be especially true with BIP-153 block template sharing. And if so, we could consider today replacing 4 of the normal outbound connections with Erlay (or similar) connections. It’d worsen propagation delay, but perhaps in an acceptable way, and not worsen partition resistance. Of course, I’m not suggesting actually doing that, but as a demonstration that what reconciliation offers is untangling tx-propagation and tx-partition-resistance from connection-count decisions.

    However, with the effort to integrate flooding and reconciliation into one, and thus considering the possibility of replacing (all) normal connections with mixed flood-reconciliation connections, we maybe lost sight of the fact that the goal is increasing partition resistance (without unnecessary bandwidth and propagation-delay improvements), rather than bandwidth reduction per se.

    Maybe it’s worth reevaluating the need for combining the two. If we already had (say) 8 extra block-only connections, and it was proposed to replace some of those with reconciliation-based ones, I think it would look like a fairly interesting picture: a slight bandwidth increase, extra memory/CPU for the nodes that choose to participate, but we upgrade the partition-resistance offered by them from block-only to also cover transactions. And maybe the complexity of trying to integrate flooding and reconciliation isn’t worth it here, as it’s an improvement even without it.

  11. sr-gi commented at 10:27 am on February 12, 2026: member

    Is it possible that, with this project going on for so long and with a couple of reboots, we kinda lost track of why we were doing this in the first place?

    However, with the effort to integrate flooding and reconciliation into one, and thus considering the possibility of replacing (all) normal connections with mixed flood-reconciliation connections, we maybe lost sight of the fact that the goal is increasing partition resistance (without unnecessary bandwidth and propagation-delay improvements), rather than bandwidth reduction per se.

    My understanding of why the project didn’t move along in the previous iteration was that the latency/bandwidth tradeoff was not worth the change. I may have gotten that wrong, and missed the forest for the trees while retaking this, putting too much effort into optimizing the bandwidth/latency improvements instead of focusing on the partition resistance.

    Maybe it’s worth reevaluating the need for combining the two. If we already had (say) 8 extra block-only connections, and it was proposed to replace some of those with reconciliation-based ones, I think it would look like a fairly interesting picture: a slight bandwidth increase, extra memory/CPU for the nodes that choose to participate, but we upgrade the partition-resistance offered by them from block-only to also cover transactions. And maybe the complexity of trying to integrate flooding and reconciliation isn’t worth it here, as it’s an improvement even without it.

    I agree that this would be an interesting thing to try out, and it would significantly simplify the design of the approach. This would still require the added complexity of minisketch, the set reconciliation module, and the extra network messages, but the design would be much cleaner. For those concerned about the additional complexity, would this be an acceptable tradeoff (@marcofleon, @dergoegge, @stickies-v )?

  12. dergoegge commented at 3:48 pm on February 12, 2026: member

    I’d like to understand the current state of partition resistance and what concrete improvement Erlay would bring (and what metric best captures this). Since the papers cited by Erlay (regarding partitioning attacks) have been published, we’ve deployed several mitigations (e.g. anchors, asmap, v2 transport, etc.), and the theoretical attacks in the literature often rely on simplifying assumptions that seem to understate the engineering/operational cost to an attacker. Given all that, how much of the gap has already been closed, and how much does Erlay still improve the situation?

    If we’d just be doing this to improve partition resistance (because increased connectivity is obviously better in that regard) while it’s not clear that improvements are needed, then I’m not sure if I’d make this a priority myself. Given a project of this size, we’d need some momentum and at least a couple dedicated reviewers, otherwise we’ll still be discussing this in 5 years.

  13. marcofleon commented at 6:55 pm on February 12, 2026: contributor

    Still digesting and reading some more, but clarifying Erlay’s ultimate goal and walking through the different approaches has already been helpful to me. Thanks sipa.

    If, on the other hand, we want to be able to scale the network without having to scale it’s backbone of reachable nodes

    Interesting, I hadn’t thought of this before. This would entail enabling more inbound slots without increasing outbound connections as much. So a listening node would be able to support more than the current ~11 fully-connected nodes and not take a meaningful bandwidth hit.

    For those concerned about the additional complexity, would this be an acceptable tradeoff

    If the simpler approach, which iiuc could look something like 8 full, 4 reconciliation, and 4 blocks-only, improves the tx relay network’s partition resistance without significant drawbacks, then I would prefer that over Erlay. I’m interested in seeing what that implementation would look like in comparison.

    I’d still like to see the updated Erlay bandwidth and latency data with increased connection counts (no rush, just clarifying). However, unless there’s a compelling reason to take on the extra complexity, I’d rather go with a simpler design that achieves a similar outcome.

  14. stickies-v commented at 4:33 am on February 18, 2026: contributor

    Note: I’m not super involved with p2p, so my comments in this thread are a best-effort to contribute to the conversation moving forward, but shouldn’t be taken for much more than that.

    For those concerned about the additional complexity, would this be an acceptable tradeoff

    I think it makes sense to reframe the tradeoff like this. Improving partition resistance (as per sipa’s writing) and sustainably scaling the network (as per your writing) both are interesting goals to pursue. Evental project decisions can then be framed as how much we improve on those two domains, for a certain cost (such a code complexity, bandwidth, latency, …). Having a measure of the current state of partition resistance and network scaling currently would be helpful too, if at all possible.

    I generally prefer incremental, meaningful improvements. If separate reconciliation connections address the most important painpoints at minimal complexity, that absolutely seems like a good direction to focus on.

    This would still require the added complexity of minisketch

    Since minisketch is a separate library, I’m much less concerned about its complexity than I am about complexity in p2p code.

  15. sr-gi commented at 2:59 pm on February 18, 2026: member

    I’d like to understand the current state of partition resistance and what concrete improvement Erlay would bring (and what metric best captures this).

    Having a measure of the current state of partition resistance and network scaling currently would be helpful too, if at all possible.

    I’m happy to be corrected, but I do not think there is any metric that realistically captures what the current state of network partition resistance is, or that can quantify what would be good enough. Theoretically, we could compute how many /16 groups (or ASes for ASMap) would need to be controlled to have a certain likelihood of a node being eclipsed, but how realistic it is for an attacker capable of doing this to exist feels more speculative to me.

    If we’d just be doing this to improve partition resistance (because increased connectivity is obviously better in that regard) while it’s not clear that improvements are needed, then I’m not sure if I’d make this a priority myself.

    That is fair, but it applies to any network proposal trying to improve partition resistance. I personally think that improving partition resistance is a good goal on its own, especially if it doesn’t come with significant tradeoffs with bandwidth and latency.

    Given a project of this size, we’d need some momentum and at least a couple dedicated reviewers, otherwise we’ll still be discussing this in 5 years.

    I agree with you there. I have no interest in pursuing this if there is no sufficient reviewer buy-in

  16. ajtowns commented at 6:50 am on February 20, 2026: contributor

    FWIW, I like the idea of reconciliation-only connections – in thinking about inv2send stuff which is now PRed as #34628 I’ve been worried about the distinction we make for the inbound vs outbound side of connections (ie, the side that made the connection will send txs at potentially 2.5x the rate the other side will), which will mess up reconciliation while that zone is being hit. Having a dedicated class of connection (or even just a mode switch like the high-bw vs low-bw compact block stuff) would avoid that – once you decide to make it a reconciliation connection, just have both sides use the lower rate.

    FWIW, I think bumping the number of blocks-only connections should be a high priority (and higher pri than erlay), but it keeps slipping off my radar.

    FWIW, if you look at erlay as an anti-censorship addition to the network (ie, we’re just using it to add connectivity, not replacing the existing INV connectivity with it), then I don’t think latency matters very much. If you’re getting your tx relayed in five minutes, when without erlay it would not have been relayed at all due to censorship, that’s a win, even if uncensored txs take a fraction of that time.

    As far as the stats go, I don’t think 6-8 significant figures of precision is very realistic…

    Are the expected values for the poisson processes set to stone? Is it OK to reduce them? How can we reason about the implications?

    The random delays between INVs is to give a little privacy – if you wait 8s between INVs and txs are coming in at 7tx/s, then you have an anonymity set of 56 txs where your peers can’t discern the order in which they arrived. Because it’s an exponential/poisson process, the 5s average will mean that 10% of the time it will be less than 0.52s between INVs, so the anonymity set may be 3.7 transactions or less. I think “90% of the time, the anonymity set will be at least ~4 transactions” is the right sort of implication to think about here.

    One thing I’ve been thinking about with template sharing is if the ephemeral connections we make can be used for tx relay – ie, we do an extra block relay connection every few minutes already, if we share a template on that, that gives us much broader top-of-mempool tx relay capacity than a set of static connections, at least if a there’s a high-feerate tx that’s getting censored by a majority of the network for an extended period. Maybe doing a one-off “txs from the last few minutes” reconciliation immediately on making a new “blocks-only” connection to a compatible peer would also be feasible and useful. (Or maybe not 🤷‍♂️)

    Anyway, just some thoughts.

  17. gmaxwell commented at 4:56 pm on February 20, 2026: contributor

    the threshold is quite low

    I disagree.

    It’s quite inexpensive to spin up more “nodes”. There are already existing known attackers running multiple multiple /24s with every IP pretending to be a distinct node. The already mentioned censorship attackers have written attack software to gum up the connections of non-censoring nodes and fake their identity which at best leaves preferential peering an expensive cat and mouse game.

    When it only costs someone a nanosecond to test a password, the space of possible passwords must be enormous to have meaningful security. Similar applies here.

    Beyond just spinning up ’nodes’ attackers can approach the 10% threshold by trying to knock honest nodes offline through DOS attacks, false abuse reporting, etc. This is a risk to all operators that they’ll be disrupted by an attacker with a realistic prospect of making progress by attacking others.

    So I think the 10% sort of threshold only holds against attackers that are hardly competent or hardly trying, maybe ones that care if their efforts go unnoticed or not– not the sort of thing that should be relied on.

    It’s also a target on average, rather than worst case. If the consequence of being censored is a mere delay then knowing the attacker is unlikely to be successful on average if at least 10% of the ’nodes’ you select from are honest is helpful. But if the consequence is that your funds will be lost, for example, then you likely care about how many nodes must be honest for the attacker success to be less than, say, one in a million. And by that metric the existing behavior performs poorly.

    The number of effective connections is an important exponent in many kinds of attack resistance, in that it makes most other defenses more effective. E.g. say you want to use behavioral analysis to peer preferentially, having more connections makes that work better and work safer (by applying the preference only to a subset of connections you’re at least no worse off than if you didn’t make that subset of connections at all).

    Similarly, people have suggested things like POW or ZK-proof-of-utxo-ownership or trading captcha solving for blinded access tokens to create attack resistant connection slots… but all those schemes have their own weaknesses and if they were a primary mechanism could make some attackers more powerful because they’re better at meeting them than ordinary honest participants. If, instead, these mechanisms only run in addition to connections you’d otherwise have then that risk doesn’t matter. But that only works if you can expand the connection pool for these new attack resistant connections.

    And the only reason to not have every node constantly connect to every other node is resource consumption.

    With respect to latency the latency of transaction relay over a connection that doesn’t exist is infinite. I think that’s the right metric to judge this. One could hold connection count constant and try to improve bandwidth sure– but the only reason to do that is if you care more about bandwidth than latency. I suspect for most people who do, blocksonly would probably be a better tool.

    In terms of latency requirements: For the operation of the network there is particular reason to get a transaction widely relayed before it would be rationally mined and the network frequently has backlog. Even for the secondary purpose of transaction notification, delays of many seconds should be fine– consider that even knowing there has been a delay at that time scale means that you have another communication channel to know that the transaction exists, which could just as well be used to give you the transaction. :) For transactions that are being censored even fairly delayed propagation is probably fine (it just has to be fast enough to not make the delay into an exploitable attack against exchange protocols).

    It might be somewhat unsatisfying to have tools that are only valuable against attacks and then have the attacks not happen specifically because the tool moots the attack. But that’s the nature of defense: you often only need it if you don’t have it.

  18. sr-gi commented at 2:49 pm on February 25, 2026: member

    Thanks for chiming in @ajtowns @gmaxwell

    FWIW, if you look at erlay as an anti-censorship addition to the network (ie, we’re just using it to add connectivity, not replacing the existing INV connectivity with it), then I don’t think latency matters very much. If you’re getting your tx relayed in five minutes, when without erlay it would not have been relayed at all due to censorship, that’s a win, even if uncensored txs take a fraction of that time.

    In terms of latency requirements: For the operation of the network there is particular reason to get a transaction widely relayed before it would be rationally mined and the network frequently has backlog. Even for the secondary purpose of transaction notification, delays of many seconds should be fine– consider that even knowing there has been a delay at that time scale means that you have another communication channel to know that the transaction exists, which could just as well be used to give you the transaction. :) For transactions that are being censored even fairly delayed propagation is probably fine (it just has to be fast enough to not make the delay into an exploitable attack against exchange protocols).

    These sound like good reasons to move forward with a recon-only version of Erlay to me.

    The random delays between INVs is to give a little privacy – if you wait 8s between INVs and txs are coming in at 7tx/s, then you have an anonymity set of 56 txs where your peers can’t discern the order in which they arrived. Because it’s an exponential/poisson process, the 5s average will mean that 10% of the time it will be less than 0.52s between INVs, so the anonymity set may be 3.7 transactions or less. I think “90% of the time, the anonymity set will be at least ~4 transactions” is the right sort of implication to think about here.

    Oh, this is really interesting. I never thought about this in terms of the anonymity set that it provides based on the expected transaction throughput. So if I got this right, reducing the expected values to {2s, 1s} would yield an “anonymity set of 4 transactions ~78% and ~61% of the times, respectively” which is much more undesirable.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-03-10 09:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me