[WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks

furszy commented at 1:45 pm on July 3, 2024: member

The ’not_found’ message, introduced in protocol version 70001 (Bitcoin Core 0.8.0 - Feb 2013), serves as a response to a ‘getdata’ message when the receiving node does not have the requested data available for relay. Thus far, this has only been applied to transactions. This PR proposes extending its use to blocks as well.

Currently, when the remote peer does not have the requested block data, it ignores the ‘getdata’ inventory block request. This lack of response causes the requesting node to wait (and stall if the request was for an IBD block) for 10 minutes until the request timeout is triggered. As a result, the local node abruptly disconnects the peer due to perceived unresponsiveness in order to download the block from another source, even though the peer handled the message accordantly and may still be relaying other headers, blocks, and transactions properly.

This behavior is suboptimal for well-behaving nodes. Therefore, this PR proposes sending ’not_found’ messages for blocks the remote peer does not know about or no longer stores, allowing the local node to promptly request them from another peer. Doing so will help prevent network synchronization stalling scenarios, allow us to shorten idle wait times (by reducing the block request timeout for peers supporting this feature), and ensure well-behaving peers remain connected.

Pending tasks:

Write detailed commits description.
Expand PR description
1. Describe feature negotiation message. Annotate all considered options.
2. Describe ’not_found’ handler behavior for blocks closer to the tip vs historical ones.
Create BIP for the new feature negotiation message.
Think about increasing the peer’s block download priority when it signals support for the feature.
Think about disconnecting the peer when a known feature negotiation message is received prior to the version message.
Complete TODOs in the code.

DrahtBot commented at 1:45 pm on July 3, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process. A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#30393 (refactor: use existing RNG object in ProcessGetBlockData by maflcko)
#30364 (logging: Replace LogError and LogWarning with LogAlert by ryanofsky)
#30233 (refactor: move m_is_inbound out of CNodeState by sr-gi)
#28055 (Bugfix: net_processing: Restore “Already requested” error for FetchBlock by luke-jr)
#27052 (test: rpc: add last block announcement time to getpeerinfo result by LarryRuane)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label P2P on Jul 3, 2024

furszy renamed this:
~~p2p: send not_found msgs for unknown, pruned or unwilling to share blocks~~
[WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks
on Jul 3, 2024

dergoegge commented at 1:59 pm on July 3, 2024: member

Currently, when the remote peer does not have the requested block data, it ignores the ‘getdata’ inventory block request. This lack of response causes the requesting node to wait (and stall if the request was for an IBD block) for 10 minutes until the request timeout is triggered.

Nodes advertise which blocks they have and what their current tip is, so the timeouts should only happen if the peer is misbehaving or if you disrespect the protocol by asking for things they don’t have. Disconnection seems appropriate in both cases?

What is the problem this is trying to solve?

furszy commented at 2:43 pm on July 3, 2024: member

Nodes advertise which blocks they have and what their current tip is, so the timeouts should only happen if the peer is misbehaving or if you disrespect the protocol by asking for things they don’t have.

We don’t ask all peers about all the blocks they know about. If the node is in the main-chain, most of the time we only know their last advertised header only. Once we are headers-wise sync, the block downloading procedure download blocks from different peers. It has a window of 16 in-flight blocks per peer. The peer who advertised the header might not be the one who ends up providing the final block.

This open a window for other scenarios such peers not responding when you ask for an old block that isn’t in the main chain (fingerprinting protection), not responding when you ask for a block deeper than the pruning threshold (to avoid leaking the prune-height) or for a block that was just pruned (erased in-between the inv roundtrip). Which doesn’t necessarily represent a bad behaving peer.

When this situations arises, well-behaving peers responding with a not_found will allow us to promptly request the block from another source instead of staying idle for 10 minutes and disconnect the well-behaving peer that is still in full compliance with the protocol specs. Preferring peers signaling this feature over others over time seems the correct path to follow to me.

sipa commented at 2:48 pm on July 3, 2024: member

The peer who advertised the header might not be the one who ends up providing the final block.

I don’t think we ever ask for a block from a peer without that specific peer having announced the block or a descendant to us?

I can imagine asking a peer for a block that is not on the main chain (due to a very recent reorg) or due to just being pruned, though I don’t know how often this happens.

I think this is generally a good change, but to judge how important it is, I think it’s good to have numbers about whether this actually happens in production between honest nodes.

furszy commented at 3:20 pm on July 3, 2024: member

The peer who advertised the header might not be the one who ends up providing the final block.

I don’t think we ever ask for a block from a peer without that specific peer having announced the block or a descendant to us?

Yeah. I actually spent some time checking that specifically because it would allow an easy way to partition the network by sending only the new header tip, disconnecting and expect the peer requests it to someone else who does not have it etc..

What I meant with the “The peer who advertised the header might not be the one who ends up providing the final block” is that providing a descendant of a block doesn’t mean the peer has the block data in all circumstances. Thus why wrote those examples.

I think it’s good to have numbers about whether this actually happens in production between honest nodes.

Ideally, yeah. But.. how would you get numbers when the disconnecting peer does not tell you the idle reason? Will think about this further.

Protocol wise, and generally speaking here, I don’t see why a connection should stay idle for 10 minutes when two honest peers are connected. There should always be a response, even more when the other peer continue sending all other messages accordantly.

p2p: negotiate NOT_FOUND support for blocks

Introduces a new p2p protocol message 'SENDNOTFOUND' sent during the
handshake phase to notify the remote peer we support receiving
'notfound' messages for unknown blocks.

ac0c285258

net: move disconnection outside the 'getdata' handler

No-behavior change.

This will allow us to be consistent in terms of when the node
will disconnect and also let us send a 'notfound' for blocks
instead of disconnecting in a simpler manner.

a5c0fcbaf1

p2p: send NOT_FOUND for unknown/pruned blocks b8e3502d2a

DRAFT: disconnect when a negotiation msg is received prior to the version msg? 9b2e70d11e

test: validate 'notfound' block responses cfc37985dd

net: handle 'notfound' block msgs 9d61bfd135

Drop: note for the future 5e1023d83d

RPC: getpeerinfo, return peer missing blocks 7a354a7be3

test: coverage for 'notfound' handler 5b6317c9c8

furszy force-pushed on Jul 4, 2024

DrahtBot commented at 1:42 pm on July 9, 2024: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

DrahtBot added the label Needs rebase on Jul 9, 2024

gmaxwell commented at 8:11 pm on August 3, 2024: contributor

I believed the current behavior was this: If a node is non-pruned it will have all blocks, no issue. If a node is pruned it will respond to requests for the last 288 which are protocol required and will disconnect any peer that asks for a block outside of the window. Again, no issue.

If I’m mistaken about the current behavior, I think it should be changed to match that behavior because:

If you allow peers to probe exactly what blocks you have, it allows tracking nodes around the network as they change addresses and potentially exposing confidential information regarding which blocks were kept (e.g. if a node kept blocks containing transactions of relevance to them).

This was a specific consideration in the design of the limited flag, and is why there isn’t some kind of fancy signaling about what blocks a node has or doesn’t have.

Kind of separately, I think that sending “not found” is a regression to an older period of protocol development– and that generally not founds have been replaced with not requesting things that won’t be found and disconnecting when things are unexpected as disconnection is the strongest protection against accidentally jamming a peer (and yet still harmless if it caused by some kind of crazy race condition). Not founds were indeed used in the past but they were found to be largely a counterproductive bandwidth waste, generally useless, and a source of vulnerabilities (particularly privacy ones). So to a first approximation I think if you find yourself thinking sending a not found is a solution to a problem, you’ve got a mistake somewhere else.

furszy commented at 9:35 pm on August 14, 2024: member

Thanks for the comment gmaxwell.

I think if you find yourself thinking sending a not found is a solution to a problem, you’ve got a mistake somewhere else.

Yeah, thats actually how this working path started #28120.

I believed the current behavior was this: If a node is non-pruned it will have all blocks, no issue. If a node is pruned it will respond to requests for the last 288 which are protocol required and will disconnect any peer that asks for a block outside of the window. Again, no issue.

Those scenarios are ok, but what about the following ones:

For AssumeUTXO: AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash. This causes honest peers trying to synchronize historical block from them to forcibly disconnect due to the lack of response while the AssumeUTXO node is synchronizing the tip chain or the background one. This scenario can be replicated just by running an AssumeUTXO node and connecting a new node -> test case exercising the disconnection can be found here.

So, based on your comment, instead of implementing something like this, would your suggestion be to signal the AssumeUTXO sync state at the network level in some way? Because that sounds like signaling when (1) the node does not have the background chain and (2) when the node finishes syncing the background chain. Which sounds odd at first. Probably not a fingerprinting concern because AssumeUTXO nodes should already be distinguishable enough just by observing their block requests. But.. I’m not so sure about it.
High-Bandwidth Peers: BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.

gmaxwell commented at 10:53 pm on August 17, 2024: contributor

The assumeutxo node would just be a limited node, it’ll offer the most recent blocks and none of the history. (it might have more history then it offers, as is the case of any other limited node.) Signaling in a more fine grained manner would have little to no value to the network, but would enable tracking the node around as it moves from IP to IP. In any case, it’s no justification for not-found even that aside as we wouldn’t want nodes wasting their resources to connect to one only to be told not-found.

If they were used at all, it would be via signaling and so you’d know what they had to offer. The disadvantage of signing is all those tracking problems, but those problems exist for not-found too, so signaling is equivalent for privacy but avoids wasting time connecting to and requesting from things that can’t help you.

An obvious way that might evolve is through the definition of a “less limited peer” that also has all the blocks in the current window of X blocks or whatnot. This was evaluated when limited was defined, but measurements showed that all older blocks past a VERY short interval were fetched at equal probability which resulted in the current limited definition. In a world where assumeutxo were in widespread use the fetching pattern would be different but the resulting fetching pattern would depend a lot on how assumeutxo values were chosen (as they need to be chosen in a way that avoids attack), and so at the time limited was defined it was decided to not presume anything about those future fetch patterns.

I believe the security/decenteralization considerations for assumeutxo have still just been ignored for the time being, so I don’t think there are any clear answers on what fetching patterns it could expect. But it would be reasonable to assume that whatever they are they could be handled by a single additional node type.

High-Bandwidth Peers: BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block > turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.

They don’t relay just headers, they relay the block, which is why the BIP directs peers to not ban each other for sending an invalid block this way so long as it was POW valid.

furszy referenced this in commit 943449fae8 on Sep 3, 2024

furszy referenced this in commit 1f91abcc2c on Sep 3, 2024

furszy referenced this in commit fafc73bc03 on Sep 4, 2024

furszy commented at 10:37 pm on September 4, 2024: member

The assumeutxo node would just be a limited node, it’ll offer the most recent blocks and none of the history. (it might have more history then it offers, as is the case of any other limited node.)

Yeah, this was the piece I was missing. Thanks! Funny, I thought about a dynamically adjustable network signal but didn’t consider reusing the limited service flag by enabling/disabling full-node service support based on the assumeUTXO background chain-sync status..

Created #30807 implementing it. It fixes the last honest peer disconnection scenario I had in mind.

High-Bandwidth Peers: BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block > turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.

They don’t relay just headers, they relay the block, which is why the BIP directs peers to not ban each other for sending an invalid block this way so long as it was POW valid.

Yeah ok, great.

achow101 referenced this in commit 349632e022 on Sep 11, 2024

furszy closed this on Oct 13, 2024

[WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks #30385

Code Coverage

Reviews

Conflicts