The way I look at this issue is that we have two competing goals:
a) Have well-defined behavior on the p2p network, to limit the range of behavior and make interactions easier to reason about in the future. This tends towards being strict around deploying new specifications, both to simplify our software logic and so that when people write their own implementations, they might notice if they break something during testing.
b) Avoid disconnecting peers unless it’s important/we have a good reason, because otherwise, over time, the network graph will become more fragile and less partition-resistant: older software that is silently failing to stay connected to newer software will have fewer honest peers to connect to and become easier to sybil. (As an aside: to the extent that we do sometimes disconnect peers for imperfect behavior, it can be less risky to the network if we differentiate between inbound and outbound peers, and be more tolerant of inbound peers – this would allow slightly broken software to still find honest peers and not be partitioned off the network over time.)
So with that in mind, I’m reluctant to add new disconnection semantics for behavior that is harmless; in this case, I think it is supposed to be harmless to not have implemented BIP 338, as this is meant to be a purely optional specification. The question to me then becomes: since we already disconnect peers for some messages if we have sent fRelay=0
, should we add to the set of disallowed messages to such peers based on how we understand our own software to behave?
I think it’s a good point that those messages (tx-getdata, tx-notfound) should never come through to us if we’ve sent a disabletx
, because presuming that our software isn’t itself broken, we would never announce or send transactions to the peer. However, if there is some other software on the network that is slightly broken around transaction relay and occasionally sends us a tx-getdata or tx-notfound when not appropriate, is that a big deal? I’m not sure…
First, changing our behavior in this way wouldn’t prevent malicious peers from wasting our bandwidth (as they can do that without triggering disconnection, even without using these messages). So instead, it would be to catch/handle broken peers and prevent them from taking up a connection slot or wasting our bandwidth.
If there is broken software out there that we know has a bug like this which causes these fRelay=0
connections to be high bandwidth when they shouldn’t be, by sending these specific p2p messages that you’re suggestion we disconnect for, then I think that would be sufficient reason to do exactly what you suggest. However, I’m not sure that we’re aware of any such software. In the absence of knowing such broken software exists, I think two other options are reasonable: a) do nothing until we notice it, or b) write smarter logic that catches the thing that I think we actually care about for these connections, which is bandwidth waste (ie seeing high traffic of bogus messages as a percentage of total traffic), and manages our connections using that metric as a factor.
So I’d say right now I’m reluctant, but not strongly opposed, to adding more disconnection criteria like you’re suggesting in this PR. However if someone proposed adding more disconnection criteria as part of a bigger resource management strategy, I think that could be totally reasonable.