[p2p] Halt processing of unrequested transactions

ariard commented at 2:15 pm on February 18, 2021: member

Split-off from #20277

This PR halts the processing of unrequested transactions in Bitcoin at TX message reception. An unrequested transaction is one defined by which a “getdata” message for its specific identifier (either txid or wtxid) has not been previously issued by the node.

This change is motivated by reducing the CPU DoS surface of Bitcoin Core around mempool acceptance. Currently, an attacker can open multiple inbound connections to a node and send expensive to validate, junk transactions. Once the canonical INV/GETDATA sequence is enforced on the network, a further protection would be to deprioritize bandwidth and validation resources allocation, or even to wither connections with such DoSy peers. A permissioned peer (PF_RELAY) will still be able to bypass such restrictions.

Few high-level questions to go through :

Is this change ease any privacy attack, such as transaction origin inference or tx-relay connection graph leaks ?
Is this change ease observation of internal tx-requester configuration, such as preferred peers ?
As one identified downside of this change is the potential breaking of tx-relay capabilities of non-compliant Bitcoin clients, should we offer them a wider adaptation period ?

See mailing proposal and discussions: https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2021-February/018391.html

Note, our tx-relay functional test framework wasn’t at all compliant with the canonical inv-getdata sequence. This change modifies some tx-relay tests by introducing INV-GETDATA sequence enforcement (wait_for_getdata). It comes with a meaningful delay of running the framework (e.g +4min to run p2p_segwit.py)

Better could be either to label such tested nodes with PF_RELAY and such bypass the new restriction but I fear it might mask some behaviors verified and silently break tests (e.g PF_RELAY impacts the max announcements per-peer buffer). Or either to switch the whitelisting of the new restriction under the more-scoped PF_FORCERELAY permission or even some new one (e.g PF_FORCETXPROCESS) ?

[test]: Make tx-relay test framework with the inv-getdata sequence

Currently, our testing framework doesn't comply with the canonical
inv-getdata announcement sequence (`p2p_compactblocks.py,
`p2p_eviction.py`, `p2p_segwit.py`, `p2p_invalid_tx.py`,
`p2p_permissions.py`). We either add the required sequence or
whitelist testing peers when it doesn't have effect on test behaviors.

37fe0c933e

[p2p]: Stop unrequested transactions processing

To mitigate potential DoS risks, stop unrequested transaction
processing early, unless the peer has PF_RELAY permission.
Unrequested transaction is defined as one for which we have not
issued getdatas on this connection link.

Note, parent-orphan fetching happens through MSG_TX, thus we
should expect txid requests even on wtxid-relay peers.

f6eeda99b4

[test]: Add new coverage for processing-halt of unrequested txn 0b9b98d29b

Add a release note for unrequested transactions processing stop 63103daf80

fanquake added the label P2P on Feb 18, 2021

ajtowns commented at 2:48 pm on February 18, 2021: member

This PR halts the processing of unrequested transactions in Bitcoin at TX message reception. An unrequested transaction is one defined by which a “getdata” message for its specific identifier (either txid or wtxid) has not been previously issued by the node.

I don’t see the point of this? An attacker with valid utxos can generate invalid txs that are costly to reject, and simply announce their txids and wait for a request. That’s limited to 5000 txs in a batch (MAX_PEER_TX_ANNOUNCEMENTS), and I think each batch would have at least a 4s window (NONPREF_PEER_TX_DELAY + OVERLOADED_PEER_TX_DELAY). But that seems like it should be enough to keep the CPU busy? Even then, we’ll make progress through every other node in between dealing with each tx from the attacker, so this shouldn’t cause a denial of service as far as I can see? Presuming the tx’s are consensus invalid, the attacker will get given a Misbehaving score for each invalid tx and eventually kicked.

It comes with a meaningful delay of running the framework (e.g +4min to run p2p_segwit.py)

Err nack? :) Maybe better to add PF_RELAY to those tests than slow them down?

DrahtBot commented at 2:57 pm on February 18, 2021: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#21160 (Net/Net processing: Move tx inventory into net_processing by jnewbery)
#21148 (Split orphan handling from net_processing into txorphanage by ajtowns)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

ariard commented at 6:18 pm on February 18, 2021: member

@ajtowns

I don’t see the point of this? An attacker with valid utxos can generate invalid txs that are costly to reject, and simply announce their txids and wait for a request. That’s limited to 5000 txs in a batc>h ( MAX_PEER_TX_ANNOUNCEMENTS), and I think each batch would have at least a 4s window (NONPREF_PEER_TX_DELAY + OVERLOADED_PEER_TX_DELAY).

Are you saying this change isn’t worhty because it doesn’t prevent further tx-relay DoS ? As of today, an attacker doesn’t have to “wait” for a request and can just spam us we costly-to-evaluate transactions. Rate-limiting DoSy peers beyond MAX_TX_PEER_ANNOUNCEMENTS is already a strict improvement, you force a time burden on them.

But that seems like it should be enough to keep the CPU busy?

Once you’re dropping the unrequested transactions, you might start to squeeze a peer announcement buffer if its top-of-transaction-to-evaluate buffer is too slow/costly to evaluate. But that kind of further mitigations is useless if you can bypass it with unrequested transactions.

Even then, we’ll make progress through every other node in between dealing with each tx from the attacker, so this shouldn’t cause a denial of service as far as I can see?

If you have % of inbound peers being malicious and DoSing you, this starts to be a concern. What’s the % is a more serious question.

Presuming the tx’s are consensus invalid, the attacker will get given a Misbehaving score for each invalid tx > and eventually kicked.

If you’re a motivated attacker, this is super easy to avoid the kick-out. Just craft malicious non-standard junk (e.g make the sig of the last input high-s) and you will avoid MaybePunishNodeForTx ban. Generally, we should assume an attacker to be always able to circumvent bans triggered from invalid transactions, just target discrepancies in tx-relay policies.

Note, this mitigation proposal was a suggestion from @sdaftuar, following some CPU grinding of its own.

The last time I looked at this, it seemed like spending on the order of 1 second (on my modern workstation) to validate a transaction that would ultimately fail mempool acceptance was easily achievable; multiply that by the number of inbound connections we allow and you get a pretty bad state of affairs…

(in #20277)

That said, I concede it would be better to have a common attacker model (number of malicious inbound, worst-case transactions, CPU resources on a middle-grade host, I/O layout of UTXOs fetched, …) before to work or propose eventual tx-relay DoS mitigations. If so, I would be happy to do such investigations.

Err nack? :) Maybe better to add PF_RELAY to those tests than slow them down?

Yeah I feel the same. I fixed the tests just-in-case someone was willingly to analyze tests impacted, at least I’ve a branch…

sdaftuar commented at 7:01 pm on February 18, 2021: member

@ajtowns I think not processing unrequested transactions is a first step towards being able to protect the node from transaction-based CPU DoS; it’s of course not helpful on its own. I think as a next step we could consider keeping some sort of score on peers based on how many slow-to-validate transactions they’ve relayed to us that were not accepted to the mempool, and for badly scoring peers (or new peers that have just connected to us) we could have a longer delay between inv and getdata, or tolerate fewer transactions in flight (or both).

I haven’t tried to figure out exactly what these measures would look like in code or exactly how much benefit we would get, so for sure this is speculative, but I also don’t think any of those ideas work at all if we don’t first stop processing unrequested transactions – which seems like both a harmless and intuitive way for relay to work.

ajtowns commented at 1:05 am on February 19, 2021: member

If a single tx can take 1s to validate and fail for policy reasons to avoid getting disconnected/banned, then a single peer can generate a backlog of about 1h20m (5000 seconds) of processing, beginning only 2s after getting connected using INV/GETDATA rather than just using TX, so this doesn’t seem like a big help?

I think an attacker could also queue ~50 slow orphan txs, have them all downloaded without revealing anything slow is happening, and then activate them by relaying a good parent for them – without INV/GETDATA delays only able to prevent them from doing the next round of 50 slows txs. Depending on the scoring mechanism, rotating that sort of spam amongst a few peers might be enough to keep 100% load without much hassle.

The way I’ve been looking at this is that INV/GETDATA (and eventually erlay) is just a way to avoid wasting bandwidth, and that we should be robust to having to validate any random tx that anyone decides to send to us. Avoiding wasting bandwidth only matters for honest peers not attackers, who can presumably find plenty of ways to send us garbage data – masses of PINGs eg. That seems easier to design for to me – in particular, I think it might be hard to get txrequest to handle changes in the delay between INV/GETDATA efficiently, which might mean adding DoS vectors rather than removing them.

As an alternative approach, if we think a peer is wasting too much time, we could avoid calling ProcessMessages() for it at all until they’re back at a reasonable level, relying on hitting the fPauseRecv=true flag as new messages get queued, and it not being cleared again until we start doing ProcessMessages() again. We’d need to ensure that orphan tx’s get put in the from peer’s workset instead of the workset of whichever peer relayed the parent that made the tx minable, but that seems fine. Something like:

 0std::chrono::microseconds CNode::delay_processing_until{0};
 1
 2for (pnode : vNodes) {
 3    // for caling ProcessMessages
 4    start = GetTimeMicros();
 5    if (start > pnode->delay_processing_until) {
 6        ProcessMessages(pnode);
 7        fin = GetTimeMicros();
 8        pnode->delay_processing_until = std::clamp(pnode->delay_processing_until + 20*(fin-start), fin - 10s, fin + 60s);
 9    }
10}

(I think the factor of 20 there would means each peer can only use up 5% of your CPU)

EDIT: Changed to use std::clamp so that peers don’t get delayed too excessively. I’ve tried this out now (along with some changes to make it monitorable) on a node with no inbound connections; it takes about 5 minutes for the mempool to load at present and peers get delayed while that’s happening, but otherwise it seems to have no effect, as expected in the absence of an attack.

sdaftuar commented at 2:17 pm on February 19, 2021: member

If a single tx can take 1s to validate and fail for policy reasons to avoid getting disconnected/banned, then a single peer can generate a backlog of about 1h20m (5000 seconds) of processing, beginning only 2s after getting connected using INV/GETDATA rather than just using TX, so this doesn’t seem like a big help?

That assumes that we don’t limit the number of transactions in flight to each peer, which I think we could do as well. I think it’d be reasonable to limit to something like 5 in-flight transactions to a new peer (say) until we have some track record of accepting the transactions they send us, and then increasing it.

I think an attacker could also queue ~50 slow orphan txs, have them all downloaded without revealing anything slow is happening, and then activate them by relaying a good parent for them – without INV/GETDATA delays only able to prevent them from doing the next round of 50 slows txs. Depending on the scoring mechanism, rotating that sort of spam amongst a few peers might be enough to keep 100% load without much hassle.

I hadn’t thought of using orphans in this way – good point! I’ll have to give this more thought, but presumably a good first start would just be to ensure orphan validation times get accounted to the peers that relay them, and then maybe the number of in-flight orphans from a peer should be limited? In theory, if we made a change so that we generally only request missing ancestors from the peer that relays an orphan (via a new p2p message to request all of them at once, a la package relay), then I think that should make the accounting easier (since we’d never be commingling one peer’s outstanding transactions with another), but this probably would cause some multiplicative factor in the number of in-flight transactions to a peer.

The way I’ve been looking at this is that INV/GETDATA (and eventually erlay) is just a way to avoid wasting bandwidth, and that we should be robust to having to validate any random tx that anyone decides to send to us. Avoiding wasting bandwidth only matters for honest peers not attackers, who can presumably find plenty of ways to send us garbage data – masses of PINGs eg. That seems easier to design for to me – in particular, I think it might be hard to get txrequest to handle changes in the delay between INV/GETDATA efficiently, which might mean adding DoS vectors rather than removing them.

I have not given any thought to whether the things I’ve said would be easy or hard to implement in the new txrequest class. So you may be right that these ideas would take a lot of work… However I do think that your proposed strategy of delaying all messages from a peer has problems. In particular it means that honest peers relaying us a lot of valid transactions, and maybe also blocks, would have all their messages delayed relative to less useful peers – which could easily be expected to slow down block propagation. (Also it seems like as you coded it, there could be plenty of times when we don’t process any peers’ traffic, because they all are in a suspended state, which just seems wasteful and will increase our backlogs and latency.)

I think another problem with this strategy is that once you establish 21 inbound connections to a peer running with this logic, then they can work together to grant each other more CPU time and completely bypass the delays – the first peer uses 0.25 seconds of CPU and is stalled for (say) 5 seconds. The next 20 peers that get to run in the same loop iteration do the same, thing, so that by the time the first peer is up again, it’s time for it to run. So an attacker who gets to 21 connections is no longer limited by the logic. While increasing that 20 number would make the attack harder for an adversary to carry out, it also has a tradeoff of delaying processing of honest peer’s traffic too, so I’m not sure how to tune it in a way that would both be protective and not add undesirable latency.

–

I think the fundamental problem here is that we will have lots of cases where a peer can be getting us to use our CPU and it will not be possible for us to tell if the peer is adversarial or not. That’s basically the issue with transactions that fail policy checks but pass consensus checks: we can try to be more careful than we currently are about maybe classifying some of those policy checks (such as now-old soft forks, like DERSIG) as consensus rather than policy, but at the end of the day there will always be cases that we aren’t sure about. For instance, when we loosen a policy requirement, then older software will think that newer software is relaying invalid stuff, and it’s hard for a single node to tell that the looser policy is the one that maybe all its peers are using. (And obviously in general we should be tolerant of different policies on the network, whether due to user modifications, different software, etc.)

I guess the other problem (which I think your patch runs into) is that we want to always be using our CPU to process something if we have data to read, but in a way that guarantees that no peer or group of peers can cause us to not service the connection of a different peer.

–

Getting back to this proposal: what do you think the downside is of adopting this change, and not processing unrequested transactions? If it breaks a lot of existing software, I think that would be an argument for giving people time to fix their stuff and not merging it in a hurry. If there are old wallets that rely on this behavior (to random peers, who aren’t going to give them PF_RELAY) then that would probably be a good argument to shelve this altogether I guess, and try to find another way[*] . Do we think there is such software out there?

Otherwise, I think if we establish that transaction relay on the network should work via INV and then GETDATA, then that just gives us more freedom to design anti-DoS strategies than if we have a mindset of being required to accept unrequested transactions. My impression is that there’s no clear consensus that unrequested transactions should be permitted, so I thought that making that clearer by sending a note to the mailing list, and see if there are any responses, might solidify that understanding with the developer community. Maybe it’s enough to stop there and delay deploying changes to our code until we have something more fleshed out, but this all started because @ariard wanted to stop processing unrequested transactions during IBD – and I figure that we might as well just do this in general, if that is our understanding, if we’re going to do a change like that at all.

[*] For example, I proposed #15169 as a mitigation; at some point now that wtxidrelay has started to be deployed, I hope that we will eventually get rid of 2 of the 3 calls to CheckInputs too, since eventually net_processing won’t need to care about whether failure is due to being WITNESS_STRIPPED. I believe the combined effect of these two changes would be substantial, so maybe we might decide to not worry about this further if we did both of these things.

ajtowns commented at 4:18 pm on February 19, 2021: member

Rearranging:

Getting back to this proposal: what do you think the downside is of adopting this change, and not processing unrequested transactions? If it breaks a lot of existing software, […]

I don’t think it solves the problem – INV/GETDATA needs to be able to feed us ~7 tx/s to us just to keep up with block capacity and RBF, and if we receive 7 tx’s that take 1s to validate each second, we already can’t keep up.

I just don’t see the INV/GETDATA pairing as adding any useful information beyond de-duping across peers; if what we want to do is rate-limit TX messages we receive from a peer IMO we should literally do that – and we can do it when we get up to the TX message by pausing processing for that peer, there’s no need to complicate it by adding it into the INV handling.

If a single tx can take 1s to validate and fail for policy reasons to avoid getting disconnected/banned, then a single peer can generate a backlog of about 1h20m (5000 seconds) of processing, beginning only 2s after getting connected using INV/GETDATA rather than just using TX, so this doesn’t seem like a big help? That assumes that we don’t limit the number of transactions in flight to each peer, which I think we could do as well.

We dropped per-peer inflight limits with txrequest; see the rationale in the PR summary for #19988. The argument was similar to your point about my patch potentially meaning we just sit idle because we’re punishing all peers – we don’t want to limit inflights from any peer because that might delay us hearing about a transaction only offered by that peer.

I think it’d be reasonable to limit to something like 5 in-flight transactions to a new peer (say) until we have some track record of accepting the transactions they send us, and then increasing it.

Wouldn’t we expect an attacker to just wait until we’ve increased the limit before attacking us?

I hadn’t thought of using orphans in this way – good point! I’ll have to give this more thought, but presumably a good first start would just be to ensure orphan validation times get accounted to the peers that relay them,

Yes, I think so. I think after #21148 we could move the orphan worksets out of the per-peer structures and into txorphanage directly which I think would make this a little easier.

Hmm, actually I don’t think even a per-peer-orphan limit would help enough? If you’re attack is just increasing latency, then you could connect 20 peers to your victim, have each propose 5 orphan txs that will all take 1s to fail to validate and all depend on a single valid tx; then relay the valid tx. On the next run, you’ll take take 20s to process the first of each of those orphans; the run after that you’ll take 20s to process the next set, etc – but your honest peer will also only be getting one message processed in each of these sets of 20s too….

However I do think that your proposed strategy of delaying all messages from a peer has problems. In particular it means that honest peers relaying us a lot of valid transactions, and maybe also blocks, would have all their messages delayed relative to less useful peers – which could easily be expected to slow down block propagation. (Also it seems like as you coded it, there could be plenty of times when we don’t process any peers’ traffic, because they all are in a suspended state, which just seems wasteful and will increase our backlogs and latency.)

Peers that don’t do much work will end up with a 10s buffer (delay_until = fin-10s) so they’d need to actually use up ~~an entire~~ half a second in ProcessMessages before they’ll get delayed at all – I’m seeing something like 100 potential calls to ProcessMessages being missed per peer (so a total of 10s of additional latency – though I’m not checking whether there’s any actual work for the peer to do so it might just be latency for ProcessMessages saying “nope, nothing to do”) for peers that have been connected for about 10 hours.

It was the simplest token-bucket like thing I could think of in five minutes, so that’s already performing way better than I expected…

I think the fundamental problem here is that we will have lots of cases where a peer can be getting us to use our CPU and it will not be possible for us to tell if the peer is adversarial or not. That’s basically the issue with transactions that fail policy checks but pass consensus checks: […]

So… the next thought that comes to mind is what if ProcessMessages returned a “that message was useful!” boolean, and we only considered it a waste of time when the boolean indicated it was useless? So non-standard, too-low-fee, too-many-in-mempool-ancestors, pong-that-wasn’t-in-response-to-a-ping, etc? That is, have a per-peer token bucket of some sort for useless messages, and temporarily de-prioritise peers for useless messages, without building up a misbehaving/ban score. If a peer is sending hard-to-validate transactions that are actually valid and make it into our mempool, we’ll forward them, and therefore we can’t consider the peer to be an attacker, even if this is an attack, I think?

sdaftuar commented at 6:25 pm on February 19, 2021: member

@ajtowns (Perhaps this would be better for us to discuss on IRC or something, but timezones…! ) I’ll try to focus my reply on the most relevant point (in my mind) to this PR, though there’s certainly a lot more to talk about as well:

Getting back to this proposal: what do you think the downside is of adopting this change, and not processing unrequested transactions? If it breaks a lot of existing software, […]

I don’t think it solves the problem – INV/GETDATA needs to be able to feed us ~7 tx/s to us just to keep up with block capacity and RBF, and if we receive 7 tx’s that take 1s to validate each second, we already can’t keep up.

Putting aside whether the approach I described could be made to work, mustn’t it be the case that the universe of solutions we can come up with gets bigger if we allow our software the freedom to not process unrequested transactions, versus arguing that we must? I think if we have a solution that doesn’t require this behavior, then of course that’s great; I’m not sure that we do, and this seems like a useful step forward for increasing the search space of possible solutions.

I just want to establish what the arguments are against doing this:

it definitely can’t help us [while I’m skeptical that we know this now, since I don’t have a concrete solution in mind, I agree this is possibly a valid reason!]
it might work, but since we don’t know if it would, we should hold off for now [this seems like a perfectly reasonable argument against to me, though I don’t share this point of view]
there are potentially material downsides to the network from us deploying this, whether or not this benefits us [I have no idea whether this is true, but I sort of assume it isn’t and/or shouldn’t be true, though if it were, it would be helpful to know so that we abandon this line of thinking]
something else I’m missing?

I think if there’s significant uncertainty around (3) then that would make either line of thinking on (1) or (2) more potent, in terms of weighing the risks versus the benefits. But if no one thinks (3) is a material concern, it just seems to me like giving us more degrees of freedom on this is better than not.

I also think that since this doesn’t solve anything by itself, that it’s fine to wait until we have a concrete proposal that builds on this, if people are concerned that it might both be unnecessary and detrimental to the network. But my own view is that this is strictly increasing the set of solutions we might be able to deploy and hence is useful preparatory work, so in the absence of objections I’d prefer to see this change: I think it’s more likely that someone will someday pick this problem up to solve if we’ve laid this groundwork, and I worry that without it other solutions will be inadequate.

That said, if you or anyone else plans to work on this problem more fully in the near future, that’s a pretty good reason to hold off on this PR until more analysis is done and we have a better understanding of what a solution might entail.

ajtowns commented at 4:22 am on February 20, 2021: member

@sdaftuar Maybe we should at least move to the mailing list at least?

I think as long as PF_RELAY bypasses the INV/GETDATA requirement, forbidding TX’s that we haven’t asked for via GETDATA is fine on the current network (provided that doesn’t make tests slower) – I might think the only value is in deduping tx relay, but that’s still valuable enough on the live network that everyone should be doing it.

I’m more thinking that once we figure out some sane/implementable rate-limiting strategy, that some more careful thought will mean we can apply it at the “received a TX” point instead, or generalise it to “received any sort of message”; which is more general but would also allow INV handling to be less complicated. I guess that’s a variant on (2) – yes, it’s likely that some approach of this nature might work, but I think once we figure one out that does we can adapt it to be better/more general in a way that won’t need this patch; and until that point this change doesn’t really help much in other ways?

(On the other hand, I thought the previous idea of ignoring TX’s during IBD made sense because we’ve already said via FEEFILTER that we don’t want TX’s yet, but I took that as more of dont-waste-time thing than DoS prevention per se)

jnewbery commented at 10:25 am on February 20, 2021: member

I don’t have anything specific to add to the excellent points made by @sdaftuar and @ajtowns. I share aj’s scepticism that this is a beneficial change. I think this is the crucial point:

I don’t think it solves the problem – INV/GETDATA needs to be able to feed us ~7 tx/s to us just to keep up with block capacity and RBF, and if we receive 7 tx’s that take 1s to validate each second, we already can’t keep up.

I just don’t see the INV/GETDATA pairing as adding any useful information beyond de-duping across peers; if what we want to do is rate-limit TX messages we receive from a peer IMO we should literally do that – and we can do it when we get up to the TX message by pausing processing for that peer, there’s no need to complicate it by adding it into the INV handling.

There are a couple of other very interesting ideas which could be split off from this PR for discussion in IRC or separate issues:

How should orphan reprocessing be handled? We currently reprocess orphan transactions in the context of the peer that provided the parent (but punishing the peer that provided the orphan transaction if it turns out to be consensus-invalid). Would it instead make more sense to reprocess the orphan transaction in the context of the peer that provided the orphan transaction or in a global context?
Should we prioritize/rate-limit traffic from peers based on prior behaviour, and if so how? Should it be specialized based on the traffic type (eg txs), or should it apply generally to all message types? My reading is that there’s general agreement that we should do something, but it needs careful consideration. The next step might be for someone to come up with a design/proof-of-concept that could be reviewed and criticized.

I’ll add these as discussion topics for next week’s p2p irc meeting.

On the other hand, I thought the previous idea of ignoring TX’s during IBD made sense because we’ve already said via FEEFILTER that we don’t want TX’s yet, but I took that as more of dont-waste-time thing than DoS prevention per se

I also agree with this. The narrower approach of ignoring txs during IBD seems like an obvious win.

ariard commented at 6:05 pm on February 20, 2021: member

@ajtowns, Thanks for all the great points. While working on this, I became skeptical of naive time-based approach for rate-limiting as much from drop-on-the-floor abundant TX messages for the following reasons :

We have a heterogeneity of hosts deployed on the network. If we rely on hardcoded time-bounds, transaction X might be classified as expensive-to-validate by node A while being filtered out by node B. Of course, we could come with some smart heuristics per-grades of hosts, but it’s more likely to clogg for nothing low-grades ones, damaging their compact blocks processing
A high throughput of expensive-to-validate txn is a lawful behavior from your peers. Think about a big LN node suddently closing 10% of its ~100 channels, with an average of 5 pending HTLCs, it’s 60 txns to propagate through the network. This peer might burst its validation bandwidth, but those good feerate txn are the best candidate for block inclusion. Shunning away from this peer, and you might again degrade your compact block processing

Rather than node-dependent checks I would rather favor message-centric checks. As an ulterior work building on this one I was thinking about some StaticTxAnalyse() runnning in ATMP and yelling a score weighted by feerate back to net_processing.cpp. Such tx static analyzer would scope number of UTXO spent, size of transaction, number of sigops or other low-cost checks. From such score you can compresse peers announcement buckets. Doesn’t work if peers can send you unrequested transaction message.

That said, even before to discuss the best set of DoS mitigations (no silver-bullet!), it would better to refine the problem. A peer’s transactions traffic might be okay under regular transaction throughput, but gauged as DoSy relatively to another known, better feerate traffic under some heavy workload. IMHO, two bounds of the problem might be drawn up while mitigating DoS. Avoid block-processing fallout of the low-grade peer subset. Favor high-feerate, valid transaction traffic.

–

Hmm, actually I don’t think even a per-peer-orphan limit would help enough? If you’re attack is just increasing latency, then you could connect 20 peers to your victim, have each propose 5 orphan txs that will all take 1s to fail to validate and all depend on a single valid tx; then relay the valid tx. On the next run, you’ll take take 20s to process the first of each of those orphans; the run after that you’ll take 20s to process the next set, etc – but your honest peer will also only be getting one message processed in each of these sets of 20s too….

I think we have a fundemental issue with orphans processing. Orphanness source isn’t strictly assignable. I.e, Mallory might provoke mempool conflicts between Alice and Bob, sending Y to Alice, Y’ to Bob. From then, Mallory can send a set of Y childrens to Alice, she will accept them and relay forward to Bob. Alice will be the one swalloing the DoS penalty. And now you have a vector of tx-relay jamming…

Honestly, I hope once we get package relay we can deprecate and remove orphan processing. Good propagation of a package should be the responsibility of the issuer (or set of them if you have non-interactive chain of transactions).

–

I’m more thinking that once we figure out some sane/implementable rate-limiting strategy, that some more careful thought will mean we can apply it at the “received a TX” point instead, or generalise it to “received any sort of message”; which is more general but would also allow INV handling to be less complicated.

I don’t know about the generalized approach. Not all messages are safety-critical (“is globally dropping unsolicited pongs harmless ? what about severing tx-relay for few hours ?”), some of them are hard to fake, but also propagation of high-feerate transactions is essential to keep the chain moving forward. IMO, like we’re doing for connection types, it might be better to have different strategies.

–

I think the fundamental problem here is that we will have lots of cases where a peer can be getting us to use our CPU and it will not be possible for us to tell if the peer is adversarial or not. That’s basically the issue with transactions that fail policy checks but pass consensus checks: we can try to be more careful than we currently are about maybe classifying some of those policy checks (such as now-old soft forks, like DERSIG) as consensus rather than policy, but at the end of the day there will always be cases that we aren’t sure about. For instance, when we loosen a policy requirement, then older software will think that newer software is relaying invalid stuff, and it’s hard for a single node to tell that the looser policy is the one that maybe all its peers are using. (And obviously in general we should be tolerant of different policies on the network, whether due to user modifications, different software, etc.)

I would rather pass away w.r.t to a smarter typing of our script failures. First, it’s really footgunish when we have to classify a failed transaction in our reject filter (cf. discussion around BIP329 implementation). And secondly, as you pointed it well it’s conflicting with policies tolerance, such you’ll always have some exploitation margin around policies disjunction. The enemy knows the system :/

(Marking the PR as a draft until further IRC/mail discussions)

ariard marked this as a draft on Feb 20, 2021

in src/txrequest.cpp:689 in 63103daf80

684+        // We need to traverse all the REQUESTED/COMPLETED announcements to verify we effectively
685+        // requested the txhash from this peer.
686+        while (it_hash != m_index.get<ByTxHash>().end() && it_hash->m_txhash == txhash) {
687+            if (it_hash->m_peer == peer) return true;
688+            ++it_hash;
689+        }

pinheadmz commented at 2:52 pm on February 24, 2021:

I’m just trying to understand the boost multi-index data structure, and want to do due diligence whenever I see a while loop: This will iterate only as many times as we have actively requested a given TX, correct? So the most number of iterations possible is the total number of peers we have? Because the TX hash is specified, so it’s not like we run through the entire data structure.

pinheadmz commented at 2:58 pm on February 24, 2021: member

I think I agree with concept NACK group: Although there may be some benefit to this, the fact that it ultimately requires all wallets and alternative implementations to upgrade is too expensive (although I checked bcoin real quick and I’m pretty sure we broadcast all wallet TXs with an initial INV)

I also agree with @jnewbery – I’ve often wondered why I see INV messages in the logs during IBD, but I assume that is because blocksonly is one of those peer settings we don’t want to change after a connection has been established (I think we had a discussion in PR review club about this type of thing when reviewing BIP157: https://bitcoincore.reviews/16442) and that seems like something that can be avoided.

laanwj referenced this in commit 63c0d0e937 on Nov 30, 2021

DrahtBot commented at 11:18 pm on December 13, 2021: member

🐙 This pull request conflicts with the target branch and needs rebase.

Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a “draft”.

DrahtBot added the label Needs rebase on Dec 13, 2021

DrahtBot commented at 1:06 pm on March 21, 2022: member

Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
Is it no longer relevant? ➡️ Please close.
Did the author lose interest or time to work on this? ➡️ Please close it and mark it ‘Up for grabs’ with the label, so that it can be picked up in the future.

MarcoFalke commented at 1:42 pm on March 21, 2022: member

Closing for now. Let us know when this should be reopened.

MarcoFalke closed this on Mar 21, 2022

DrahtBot locked this on Mar 21, 2023

[p2p] Halt processing of unrequested transactions #21224

Conflicts