Problem: For stalling at the tip, we have a parallel download mechanism for compact blocks that was added in #27626. For stalling during IBD, we have a lookahead window of 1024 blocks, and if that is exceeded, we disconnect the stalling peer. However, if we are close to but not at the tip (<=1024 blocks), neither of these mechanisms apply. We can’t do compact blocks yet, and the stalling mechanism doesn’t work because the 1024 window cannot be exceeded.
As a result, we have to resort to BLOCK_DOWNLOAD_TIMEOUT_BASE
which only disconnects a peer after 10 minutes (plus 5 minutes more for each additional peers we currently have blocks in flight). This is too long in my opinion, especially since peers get assigned up to 16 blocks (MAX_BLOCKS_IN_TRANSIT_PER_PEER
) and could repeat this process to stall us even longer if they send us a block after 10 minutes.
This issue was observed in #29281 and #12291 (comment) with broken peers that didn’t send us blocks.
Proposed solution: If we are 1024 or less blocks away from the tip and haven’t requested or received a block from any peer for 30 seconds, add another peer to download the critical block from that would help us advance our tip. Add up to two additional peers this way.
Other thoughts
- I also considered the alternative of extending the existing stalling mechanism that disconnects instead of introducing parallel downloads. This could be potentially less wasteful, but we might be over-eager to disconnect peers when really close to the tip, plus this might lead to cycling through lots of peers in extreme situations where we have a very slow internet connection.
- The chosen timeout of 30 seconds could lead to inefficiencies / bandwidth waste when we have a really slow internet connection. Maybe it could make sense to track the last successful download times from existing peers and use a dynamic timeout according to that statistics, instead of setting it to a fixed value.
- I will leave this PR in draft until I have tested it a bit more in the wild.
Fixes #29281