furszy
commented at 1:55 pm on July 21, 2023:
member
Even when the node believes it has IBD completed, need to avoid
requesting historical blocks from network-limited peers.
Otherwise, the limited peer will disconnect right away.
The simplest scenario could be a node that gets synced, drops
connections, and stays inactive for a while. Then, once it re-connects
(IBD stays completed), the node tries to fetch all the missing blocks
from any peer, getting disconnected by the limited ones.
Note:
Can verify the behavior by cherry-picking the test commit alone on
master. It will fail there.
DrahtBot
commented at 1:55 pm on July 21, 2023:
contributor
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Conflicts
No conflicts as of last run.
DrahtBot added the label
P2P
on Jul 21, 2023
mzumsande
commented at 6:53 pm on July 21, 2023:
contributor
Maybe getting disconnected could be beneficial in this rare situation? If our tip is that much behind but we’re not in IBD, we can’t help them and they can’t help us, so our priority should be to get the missing blocks asap, and getting disconnected by useless peers seems like it’s not such a bad thing towards that goal.
What would be bad is if all our current peers were limited, and we wouldn’t try to request the needed blocks from anyone, but also wouldn’t try to exchange them for better peers, so that we would be stuck.
furszy
commented at 8:54 pm on July 21, 2023:
member
Maybe getting disconnected could be beneficial in this rare situation? If our tip is that much behind but we’re not in IBD, we can’t help them and they can’t help us, so our priority should be to get the missing blocks asap, and getting disconnected by useless peers seems like it’s not such a bad thing towards that goal.
Agree but, while the outcome might be the desired one, the approach to achieve it isn’t the best. It seems unwise to have the node sending a message through the wire, knowing it will fail, just to force the other peer to disconnect.
The node should be smarter than that and not rely on external factors to take decisions.
Moreover, in cases where the other peer chooses to ignore the message without disconnecting (maybe due to using different software than bitcoin-core), the node will end up waiting for a response that will never arrive, wasting time and resources, which could have been easily avoided.
And also, while the node has more connection slots available, I don’t see why should disconnect a peer that is behaving properly.
Instead, I think that we should try to keep processes separated. The eviction process should be in charge of disconnecting peers that aren’t/will not provide any meaningful data. And the block downloading logic should only be responsible of deciding which blocks request to each peer and which ones not.
naumenkogs
commented at 9:40 am on July 25, 2023:
member
What would be bad is if all our current peers were limited, and we wouldn’t try to request the needed blocks from anyone, but also wouldn’t try to exchange them for better peers, so that we would be stuck.
I agree with this part. I think the solution should be expanded to give a chance in such cases, and it should be merged within the same PR.
luke-jr
commented at 6:33 pm on July 25, 2023:
member
Agree with @mzumsande and @naumenkogs. Maybe not the same PR if it’s substantially different, but the new eviction logic should be merged before this fix.
furszy
commented at 2:56 pm on August 1, 2023:
member
What would be bad is if all our current peers were limited, and we wouldn’t try to request the needed blocks from anyone, but also wouldn’t try to exchange them for better peers, so that we would be stuck.
Right now, this will hardly happen. The reason is CheckForStaleTipAndEvictPeers. The logic is as follows:
When the node doesn’t update the tip for a period longer than 30 minutes (TipMayBeStale()), it triggers the extra outbound connection process. Which instructs the net layer to bypass the maximum outbound connection restriction and create another full relay outbound connection to sync up the missing blocks.
Then, considering the worst-case scenario: when the extra outbound connection is also a limited peer. In that case, 45 seconds after connecting to the extra peer (scheduler task timeout), CheckForStaleTipAndEvictPeers will call EvictExtraOutboundPeers, which will evict the peer that least recently announced a block header.
Then, in case of no chain movement; the process starts again, adding an extra outbound connection, and so on, until it finds a peer that can provide the missing blocks.
Note:
While this sooner or later corrects the stalling situation, it’s not ideal to continue establishing connections to limited peers, on the extra outbound connection process, when the peer is stalled. So, to address this issue, have created #28170. Which basically disallows connections to limited peers when the node is behind enough to know that limited peers will not provide any of the required historical blocks (although they will still provide valid headers).
achow101 requested review from vasild
on Sep 20, 2023
achow101 requested review from sipa
on Sep 20, 2023
DrahtBot added the label
Needs rebase
on Oct 2, 2023
furszy force-pushed
on Oct 18, 2023
furszy
commented at 1:35 pm on October 18, 2023:
member
rebased. Conflicts solved.
DrahtBot removed the label
Needs rebase
on Oct 18, 2023
in
src/net_processing.cpp:1480
in
c1612ea1f0outdated
1491+ }
1492+ return;
1493+ }
1494+
1495+ // Don't request blocks that go further than what limited peers can provide
1496+ if (is_limited_peer && (state->pindexBestKnownBlock->nHeight - pindex->nHeight > static_cast<int>(NODE_NETWORK_LIMITED_MIN_BLOCKS) + 2 /* add two blocks buffer extension for possible races */)) {
andrewtoth
commented at 9:14 pm on October 29, 2023:
I don’t think we want to keep the + 2 buffer here. This means we would potentially still request blocks 2 blocks after 288, so up to 290. The buffer for sending requested blocks up to 288 + 2 is there for when the sender sees new blocks but the requester has not yet seen it. In this case we are requesting so we should respect the 288 limit.
I agree - no need for +2, or even it could be harmful and defeat the margin on the sender side - e.g. if the sender is at height 1290, this logic would allow requesting block 1000. Then if the sender progresses to height 1291 in the meantime and sees request for 1000 they will disconnect.
Further, to make this symmetric with the sender maybe it should be -2? If we assume the sender has no margin and disconnects for anything >288 then the requester should request only up to 286 (incl)?
Upps yeah. This was intended to be -2.. fixing it.
in
src/net_processing.cpp:1482
in
c1612ea1f0outdated
1443@@ -1443,30 +1444,46 @@ void PeerManagerImpl::FindNextBlocks(std::vector<const CBlockIndex*>& vBlocks, c
1444 // We consider the chain that this peer is on invalid.
1445 return;
1446 }
1447+
1448 if (!CanServeWitnesses(peer) && DeploymentActiveAt(*pindex, m_chainman, Consensus::DEPLOYMENT_SEGWIT)) {
1449 // We wouldn't download this block or its descendants from this peer.
1450 return;
1451 }
1452+
in
src/net_processing.cpp:1481
in
c1612ea1f0outdated
1492+ return;
1493+ }
1494+
1495+ // Don't request blocks that go further than what limited peers can provide
1496+ if (is_limited_peer && (state->pindexBestKnownBlock->nHeight - pindex->nHeight > static_cast<int>(NODE_NETWORK_LIMITED_MIN_BLOCKS) + 2 /* add two blocks buffer extension for possible races */)) {
1497+ return;
Should this not be continue;? I find PeerManagerImpl::FindNextBlocks() hard to follow. Are the entries in vToFetch sorted in ascending or descending order?
Should this not be continue;? I find PeerManagerImpl::FindNextBlocks() hard to follow. Are the entries in vToFetch sorted in ascending or descending order?
vToFetch is arranged in increasing order. Needs to stop once it reaches the last block the peer can provide, without continuing further.
FindNextBlocks begins from the last block in common with the peer. Then, during each iteration of the loop, it gets the block that is 128 blocks ahead and populates vToFetch by moving backwards: it pushes the last block of the window to the last position of the array, the second to last block to the second to last position of the array, and so on.
Subsequently, the function iterates over, the now ordered, vToFetch in forward direction, adding the blocks that will be requested to vBlocks.
Ok, lets use an example (from the newly added test):
vToFetch = [2 ... 129]state->pindexBestKnownBlock->nHeight = 301
So the peer’s height is 301. We can assume they can provide blocks [13 ... 301].
At the first iteration of the loop pindex->nHeight is 2 (vToFetch[0]).
The code ends up with return; because 301 - 2 > 286 and we never request any blocks from that peer.
Shouldn’t we request blocks [13 ... 301] instead?
furszy
commented at 11:41 pm on November 10, 2023:
Hmm, that will only happen when the node is close to the tip, because of the 1024 block download window. More precisely, when the node is within 1312 blocks from the tip (window + limited peers threshold). Nice observation!.
Will modify it and add coverage for it.
in
test/functional/p2p_node_network_limited.py:77
in
c1612ea1f0outdated
74+ self.disconnect_nodes(0, 2)
75+ self.disconnect_nodes(1, 2)
76+ assert_equal(len(full_node.getpeerinfo()), 0)
77+
78+ # Mine blocks and sync the pruned node. Surpass the NETWORK_NODE_LIMITED threshold.
79+ self.generate(miner, MIN_BLOCKS_TO_KEEP + 12, sync_fun=self.no_op)
Is the intention here to mimic NODE_NETWORK_LIMITED_MIN_BLOCKS from the C++ source code, which is 288? MIN_BLOCKS_TO_KEEP is a different constant defined in both the C++ and Python code which happens to be also 288. Should NODE_NETWORK_LIMITED_MIN_BLOCKS be defined in the Python code (equal to 288, but not necessary equal or tied to MIN_BLOCKS_TO_KEEP)?
Is the intention here to mimic NODE_NETWORK_LIMITED_MIN_BLOCKS from the C++ source code, which is 288?
Yeah, I used MIN_BLOCKS_TO_KEEP merely because it is already part of the test framework.
MIN_BLOCKS_TO_KEEP is a different constant defined in both the C++ and Python code which happens to be also 288. Should NODE_NETWORK_LIMITED_MIN_BLOCKS be defined in the Python code (equal to 288, but not necessary equal or tied to MIN_BLOCKS_TO_KEEP)?
If it helps readability, sure (I would just need to see where to place it).
It is important to note that NODE_NETWORK_LIMITED_MIN_BLOCKS is dependent on MIN_BLOCKS_TO_KEEP and should always be less than or equal to it.
Here sync_blocks() is used and above self.wait_until(lambda: node.getbestblockhash() == ... is used to achieve the same purpose. For consistency maybe use only one of the methods.
in
test/functional/p2p_node_network_limited.py:92
in
c1612ea1f0outdated
83+ # (they will disconnect if full_node, which is chain-wise behind, request blocks
84+ # older than MIN_BLOCKS_TO_KEEP)
85+ self.connect_nodes(0, 2)
86+ assert_equal(len(full_node.getpeerinfo()), 1)
87+ time.sleep(3) # Wait for a few seconds to ensure peers remain connected
88+ assert_equal(len(full_node.getpeerinfo()), 1)
This sleep is non-deterministic. The good thing is that it will not cause sporadic test failures, but still, it could occasionally report success even if the full node requests an old block and gets disconnected (false positive). Also it adds unconditional 3 seconds to the tests execution time.
Is it possible to lookup something in the log or via RPC to cross check that the full node has made a decision not to request blocks from the pruned node?
Is it possible to lookup something in the log or via RPC to cross check that the full node has made a decision not to request blocks from the pruned node?
Hmm, not really. We aren’t logging any of the reasons behind a lack of a getdata request for a certain block from a certain peer (invalid tree, no segwit support, already in-flight). The approach would lead to excessive logging for no real gain.
What we could do, maybe in a follow-up?, to eliminate the unnecessary sleep time, is migrate this functional test check into a unit test. Similar to what I’m doing on #28170 for FindNextBlocks.
The test would fail if run standalone without the preceding tests. The failure is on line 78: self.sync_blocks([miner, pruned_node]). That is because the test relies on the pruned node being connected to the miner which is indeed the case - some leftover from previous tests. This diff makes it succeed when run standalone:
0--- i/test/functional/p2p_node_network_limited.py
1+++ w/test/functional/p2p_node_network_limited.py
2@@ -54,25 +54,24 @@ class NodeNetworkLimitedTest(BitcoinTestFramework):
3 # The request of blocks further than the NETWORK_NODE_LIMITED threshold causes a direct disconnection.
4 pruned_node = self.nodes[0]
5 miner = self.nodes[1]
6 full_node = self.nodes[2]
7 8 # Connect and generate block to ensure IBD=false
9+ self.connect_nodes(0, 1)
10 self.connect_nodes(1, 2)
11- self.connect_nodes(0, 2)
12 self.generate(miner, 1, sync_fun=self.no_op)
1314 # Verify all peers are at the same height
15 self.sync_blocks([miner, pruned_node])
1617 # Verify peers are out of IBD
18 for node in self.nodes:
19 assert not node.getblockchaininfo()['initialblockdownload']
2021 # Disconnect full_node from the other two peers (the node will remain out of IBD)
22- self.disconnect_nodes(0, 2)
23 self.disconnect_nodes(1, 2)
24 assert_equal(len(full_node.getpeerinfo()), 0)
2526 # Mine blocks and sync the pruned node. Surpass the NETWORK_NODE_LIMITED threshold.
Ended up doing it in a slightly different way. Using setnetworkactive to isolate the full node from the other two. I think that is clearer in this way.
furszy force-pushed
on Nov 11, 2023
furszy
commented at 1:06 am on November 11, 2023:
member
Updated per feedback, thanks for the review @vasild!
The node will now request blocks within the limited peers threshold when the block download window contemplates them and not only when the previous blocks arrived or are in-flight. Added test coverage exercising this behavior and verifying that the node can get synced even when the blocks arrived out of order.
DrahtBot added the label
CI failed
on Nov 11, 2023
furszy force-pushed
on Nov 11, 2023
DrahtBot removed the label
CI failed
on Nov 11, 2023
in
src/net_processing.cpp:1511
in
930e531046outdated
1492+ return;
1493+ }
1494+
1495+ // Don't request blocks that go further than what limited peers can provide
1496+ if (is_limited_peer && (state->pindexBestKnownBlock->nHeight - pindex->nHeight > static_cast<int>(NODE_NETWORK_LIMITED_MIN_BLOCKS) - 2 /* two blocks buffer for possible races */)) {
1497+ continue;
I think there is an off-by-one error here, should this be >= instead of >? Lets say the limited peer is at height H then the intention of this is to request blocks (H-286, H] from it, a total of 286 blocks. However it would not continue for height H-286 and would request that block too.
in
test/functional/p2p_node_network_limited.py:104
in
930e531046outdated
99+ full_node.getblock(pruned_node.getblockhash(height)) # just need to assert it does not raise an exception
100+
101+ # Verify all blocks that should not have been requested.
102+ for move in range(1, limited_threshold_buffer):
103+ height = start_height_full_node + move
104+ assert_raises_rpc_error(-1, "Block not found on disk", full_node.getblock, pruned_node.getblockhash(height))
The full node is at height 1 and the pruned node is at height 289. The first loop verifies that the full node has blocks [4, 289] (286 blocks in total), which is ok. But then it only checks that the full node does not have block 2. What about block 3? It is not checked and should not have been downloaded but it has been downloaded.
To avoid the gap the range in the second loop should be range(1, limited_threshold_buffer + 1) or even:
0for height in range(start_height_full_node +1, start_height_full_node +1+ limited_threshold_buffer):
or have one loop instead of two:
0for height in range(start_height_full_node +1, tip_height +1):
1if height <= tip_height - (NODE_NETWORK_LIMITED_MIN_BLOCKS -2):
2 assert_raises_rpc_error(-1, "Block not found on disk", full_node.getblock, pruned_node.getblockhash(height))
3else:
4 full_node.getblock(pruned_node.getblockhash(height))
Sure. Will do if I have to retouch.
The number is to test the blocks download procedure and its moving window.
For instance, if the peer is at block 100, and it mines 300 blocks (288 + 12). The test verifies the node only download the blocks that are below the threshold from the limited peer (last 288 blocks minus the buffer). Then, the test connects a full node and it checks the node download the remaining historical blocks from it.
andrewtoth
commented at 12:55 pm on December 4, 2023:
Ah yes, it makes sense what num_historical_blocks purpose is. The nit is why the number 12 was chosen specifically. I suppose to just make it an even 300, and it can be arbitrary?
Ah yes, it makes sense what num_historical_blocks purpose is. The nit is why the number 12 was chosen specifically. I suppose to just make it an even 300, and it can be arbitrary?
Yep.
andrewtoth approved
andrewtoth
commented at 6:21 pm on December 3, 2023:
contributor
ACKa0687ffd568b0984a8b1e51fd0061f87f25e95d7
in
src/net_processing.cpp:1510
in
ea3138b7b1outdated
1475@@ -1475,6 +1476,11 @@ void PeerManagerImpl::FindNextBlocks(std::vector<const CBlockIndex*>& vBlocks, c
1476 return;
1477 }
14781479+ // Don't request blocks that go further than what limited peers can provide
1480+ if (is_limited_peer && (state->pindexBestKnownBlock->nHeight - pindex->nHeight >= static_cast<int>(NODE_NETWORK_LIMITED_MIN_BLOCKS) - 2 /* two blocks buffer for possible races */)) {
naumenkogs
commented at 12:54 pm on December 6, 2023:
ea3138b7b189158db80bfd1f9d9dd17e7ef52f3f
might as well jump pindex forward here, so that the second condition is satisfied?
I’m not sure what you are referring to. Could you rephrase it please.
Want to update pindexLastCommonBlock to pindex?
andrewtoth
commented at 11:29 pm on February 21, 2024:
I think what @naumenkogs is referring to here is that if this condition is true then pindex will hit this condition on the first pindex in the loop and every pindex up until this condition is no longer true. This can be more efficient by just bumping pindex to the first pindex that is past the network limited blocks threshold.
So, we can either change the loop from range based to regular for-loop and advance the index, or move this check out of this loop and into the loop on line 1464 and break early if the condition is met.
I think what @naumenkogs is referring to here is that if this condition is true then pindex will hit this condition on the first pindex in the loop and every pindex up until this condition is no longer true. This can be more efficient by just bumping pindex to the first pindex that is past the network limited blocks threshold.
So, we can either change the loop from range based to regular for-loop and advance the index, or move this check out of this loop and into the loop on line 1464 and break early if the condition is met.
Ok, we are on the same page now. Thanks. Let’s leave it for a follow-up. It would be nice to re-think this function and benchmark it properly. In a first glance, I think we would be optimizing the function for the less likely case and may degrade performance for the likely ones.
jonatack
commented at 11:33 pm on February 5, 2025:
Let’s leave it for a follow-up. It would be nice to re-think this function
Any update?
naumenkogs
commented at 12:59 pm on December 6, 2023:
member
The code looks alright, but i’d still rather have this addressed first. It looks like even #28170 doesn’t directly handle this, but rather prepare the codebase for another PR handling it?
furszy
commented at 4:08 pm on December 24, 2023:
member
Sorry for the delay @naumenkogs. ’ve been on #28170 before returning here. All good now, it is ready to go.
The code looks alright, but i’d still rather have this addressed first. It looks like even #28170 doesn’t directly handle this, but rather prepare the codebase for another PR handling it?
The concern was:
All connections are from limited peers, we are lagging behind and we no longer disconnect from them due the lack of the historical block request disconnection misfeature.
With #28170, when this happens, the node will detect that it is lagging behind and update the desirable connection service flags accordingly, ensuring that the extra outbound connection process only connects to full nodes (unlike now, where the node can still connect to another limited peer, etc). Then, once the connection is established, the node will recover from the stalling situation.
Essentially, instead of resolving the situation by evicting a good limited peer connection (which could be self-defeating in some cases), we will solve it by establishing a new connection. With this, I’m not saying that we couldn’t actually do both (it wouldn’t be bad to reserve some outbound slots for full nodes and/or evict the least helpful limited peer when the node is stalled), but I think the concern is no longer a blocker after #28170.
DrahtBot added the label
CI failed
on Jan 12, 2024
achow101 referenced this in commit
0b768746ef
on Jan 31, 2024
refactor: Make FindNextBlocks friendlier
No behavior change.
73127722a2
p2p: sync from limited peer, only request blocks below threshold
Requesting historical blocks from network limited peers is a
direct disconnection cause.
The node must only request the blocks who know for sure the
limited peer can provide.
Even when the node believes it completed IBD, need to avoid
requesting historical blocks from network-limited peers.
Otherwise, the limited peer will disconnect right away.
c5b5843d8f
furszy force-pushed
on Feb 1, 2024
DrahtBot removed the label
CI failed
on Feb 1, 2024
vasild approved
vasild
commented at 1:15 pm on February 4, 2024:
contributor
ACKc5b5843d8f10d96f76ee6b95f2b1b1b4ce755f75
DrahtBot requested review from andrewtoth
on Feb 4, 2024
DrahtBot removed review request from andrewtoth
on Feb 6, 2024
DrahtBot requested review from andrewtoth
on Feb 6, 2024
bitcoin deleted a comment
on Feb 6, 2024
in
test/functional/p2p_node_network_limited.py:70
in
c5b5843d8f
mzumsande
commented at 4:56 pm on February 15, 2024:
I don’t understand this part of the test. Why do we connect nodes 1 and 2, just to disconnect again without doing anything in between?
Why is the generate call necessary? For me, the test also succeeds if I delete it. But if we do have to generate a block, shouldn’t we then also at least add a sync_blocks to ensure it propagates to all nodes before disconnecting?
I don’t understand this part of the test. Why do we connect nodes 1 and 2, just to disconnect again without doing anything in between? Why is the generate call necessary?
To ensure all nodes are out of IBD. The idea was to isolate this test case from the previous cases executed along the file. It was done in this way to be able to run this test case at any point of the file. This was vasild request because he was running it first, commenting the rest of the cases to speed the process up.
The connection requirement is because the test overrides setup_network, which skips the framework’s connections creation process (peers start the test with no connections).
shouldn’t we then also at least add a sync_blocks to ensure it propagates to all nodes before disconnecting?
The self.generate(miner, 1) function calls sync_blocks() internally.
mzumsande
commented at 9:01 pm on February 15, 2024:
ok, makes sense!
DrahtBot removed review request from andrewtoth
on Feb 15, 2024
DrahtBot requested review from andrewtoth
on Feb 15, 2024
DrahtBot removed review request from andrewtoth
on Feb 15, 2024
DrahtBot requested review from andrewtoth
on Feb 15, 2024
DrahtBot removed review request from andrewtoth
on Feb 15, 2024
DrahtBot requested review from andrewtoth
on Feb 15, 2024
DrahtBot removed review request from andrewtoth
on Feb 15, 2024
DrahtBot requested review from andrewtoth
on Feb 15, 2024
DrahtBot removed review request from andrewtoth
on Feb 15, 2024
DrahtBot requested review from andrewtoth
on Feb 15, 2024
in
src/net_processing.cpp:1454
in
2f6a05512foutdated
DrahtBot removed review request from andrewtoth
on Feb 26, 2024
DrahtBot requested review from andrewtoth
on Feb 26, 2024
DrahtBot removed review request from andrewtoth
on Feb 27, 2024
DrahtBot requested review from andrewtoth
on Feb 27, 2024
pinheadmz approved
pinheadmz
commented at 11:23 pm on March 9, 2024:
member
ACKc5b5843d8f10d96f76ee6b95f2b1b1b4ce755f75
Built and ran all functional and unit tests on x86 Ubuntu. Confirmed test fails on master, passes with the patch in the branch. Reviewed code and existing discussion. Regarding the decision to keep the pruning peer connected at all, I agree with the author that the affected code is only responsible for requesting available blocks from a peer and it is reasonable to take the limited flag into account. I like the clean up in the first commit as well, it’s easier to read the loop with continue.
For sanity I added checks in the test to confirm blocks from the pruned node were “found” by the full node but had 'confirmations': -1, which makes sense because there is a gap between the full node’s chain tip and the blocks available from the pruning peer.
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2025-04-01 21:12 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me