rpc: Pruning nodes can not fetch blocks before syncing past their height

fjahr commented at 4:37 pm on December 31, 2021: contributor

This PR prevents getblockfrompeer from getting used on blocks that the node has not synced past yet if the node is in running in prune mode.

Problem

While a node is still catching up to the tip that it is aware of via the headers, the user can currently use to fetch blocks close to or at the tip. These blocks are stored in the block/rev file that otherwise contains blocks the node is receiving as part of the syncing process.

This creates a problem for pruned nodes: The files containing a fetched block are not pruned during syncing because they contain a block close to the tip. This means the entire file (~130MB) will not be pruned until the tip has moved on far enough from the fetched block. In extreme cases with heavy pruning (like 550) and multiple blocks being fetched this could mean that the disc usage far exceeds what the user expects, potentially running out of space.

Approach

There would be certainly other approaches that could fix the problem while still allowing the current behavior, but all of the ideas I came up with seemed like overkill for a niche problem on a new RPC where it’s still unclear how and how much it will be used.

Testing

So far I did not see a simple enough way to test this I am still looking into it and if it’s complex will potentially add it in a follow-up. What would be needed is a way to have a node fetch headers but not sync the blocks yet, that seems like a pattern that could be generally useful.

To manually reproduce the problematic behavior:

Start a node with current master with -prune=550 and an empty/new datadir, Testnet and Mainnet should both work.
While the node is syncing run getblockfrompeer on the current tip and a few other recent blocks.
Go to your datadir and observe the blocks folder: There should be a few full blk*.dat and rev*.dat files that are not being pruned. When you “pinned” a few of these files the blocks folder should be significantly above the target size of 550MB.

DrahtBot added the label RPC/REST/ZMQ on Dec 31, 2021

DrahtBot commented at 3:06 am on January 1, 2022: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#23813 (Add test and docs for getblockfrompeer with pruning by fjahr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

brunoerg commented at 3:09 pm on January 1, 2022: contributor

Concept ACK.

I started testing it by creating a functional test. Not sure if the approach is right, I started two nodes, the second one with -prune=550 and then, node0 mines 20 blocks, I get the bestblockhash from node0 and use it on getblockfrompeer (node1 –> node0), it should return an error (apparently it worked):

 0def run_test(self):
 1        self.log.info("Mine 20 blocks on Node 0")
 2        self.generate(self.nodes[0], 200, sync_fun=self.no_op)
 3        assert_equal(self.nodes[0].getblockcount(), 400)
 4
 5        self.log.info("Connect nodes")
 6        self.connect_nodes(0, 1)
 7
 8        peers = self.nodes[1].getpeerinfo()
 9        assert_equal(len(peers), 1)
10        peer_1_peer_0_id = peers[0]["id"]
11        best_block_hash_0 = self.nodes[0].getbestblockhash()
12        assert_raises_rpc_error(-1, 'In prune mode, only blocks that the node has already synced previously can be fetched from a peer', self.nodes[1].getblockfrompeer, best_block_hash_0, peer_1_peer_0_id)

In master branch, this test would fail, because an exeception won’t be raised.

fjahr commented at 6:24 pm on January 1, 2022: contributor

I started testing it by creating a functional test. Not sure if the approach is right, I started two nodes, the second one with -prune=550 and then, node0 mines 20 blocks, I get the bestblockhash from node0 and use it on getblockfrompeer (node1 –> node0), it should return an error (apparently it worked):

Hey @brunoerg , thanks for giving it a try but unfortunately I don’t think this approach works reliably. The problem is that the outcome of this test is a race because node 1 is downloading the blocks from node 0 in the background. It may have the current tip in flight by the time the assert is called, or not. We try to avoid tests where the outcome is not 100% reliable because we are seeing intermittent test failures quite often already.

What would be needed is a reliable way to let node 1 sync the headers but then prevent it from syncing the blocks. It seems we don’t have something like this and I am now thinking about how this could be done and where else it could be useful.

brunoerg commented at 1:40 pm on January 2, 2022: contributor

The problem is that the outcome of this test is a race because node 1 is downloading the blocks from node 0 in the background.

Interesting, this is new for me.

0def setup_network(self):
1  self.setup_nodes()

even with this config (setup_network) node1 will download the blocks from node0 in the background after node0 mines more 200 blocks?

 0def run_test(self):
 1        self.log.info("Mine 200 blocks on Node 0")
 2        self.generate(self.nodes[0], 200, sync_fun=self.no_op)
 3        assert_equal(self.nodes[0].getblockcount(), 400)
 4
 5        self.log.info("Connect nodes")
 6        self.connect_nodes(0, 1)
 7
 8        peers = self.nodes[1].getpeerinfo()
 9        assert_equal(len(peers), 1)
10        peer_1_peer_0_id = peers[0]["id"]
11        best_block_hash_0 = self.nodes[0].getbestblockhash()
12        assert_raises_rpc_error(-1, 'In prune mode, only blocks that the node has already synced previously can be fetched from a peer', self.nodes[1].getblockfrompeer, best_block_hash_0, peer_1_peer_0_id)
13
14        self.sync_blocks()
15        self.nodes[1].getblockfrompeer(best_block_hash_0, peer_1_peer_0_id)

I thought the first getblockfrompeer into the assert would fail because I didn’t setup it to sync blocks and the second one would work because of self.sync_blocks().

fjahr commented at 2:25 pm on January 2, 2022: contributor

even with this config (setup_network) node1 will download the blocks from node0 in the background after node0 mines more 20 blocks?

No, the nodes indeed can not sync if they are not connected. But you are connecting the nodes in your test here: self.connect_nodes(0, 1). In the moment the nodes are connected they also start syncing. You can see this for example by inserting print(self.nodes[1].getblockcount()) in the line before your first assert and then running the test a couple of times. You will see that the node is not at 200 blocks anymore and each time you run the test it will be a different number because of this race between the different processes.

I thought the first getblockfrompeer into the assert would fail because I didn’t setup it to sync blocks and the second one would work because of self.sync_blocks().

What sync_blocks() does is it waits for all the nodes to be caught up with each other. This ensures the reverse of our problem doesn’t happen: a test where all the nodes need to be caught up to continue does not fail intermittently and so with that the second assert is safe. But for the first assert we would basically need the inverse of the functionality, i.e. something that ensures that the nodes are definitely not caught up with each other until we have finished what we want to test.

fjahr commented at 4:09 pm on January 2, 2022: contributor

I think we have a working test now. I remembered that we can submit the header via a P2PInterface to the pruning node and that seems to work. :)

brunoerg commented at 7:01 pm on January 2, 2022: contributor

You will see that the node is not at 200 blocks anymore and each time you run the test it will be a different number because of this race between the different processes.

Yes, my bad. That approach would work because node0 is mining 200 blocks more, so probably node1 didn’t get the last one before the assertion but they’re syncing and it could cause intermittent failures.

brunoerg approved

brunoerg commented at 7:48 pm on January 2, 2022: contributor

tACK 105b2876c5cd4f3710e3510491070f47ae11b0e4

Compiled the branch code on MacOS 12 and started bitcoind (empty): ./src/bitcoind --prune=550 --daemon

Got a hash of a recent block from a block explorer: 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b

Executed getpeerinfo to see the connections and get some peers id to use.

And then, executed getblockfrompeer with 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b as block hash.

 0➜  bitcoin git:(fjahr) ✗ ./src/bitcoin-cli getblockfrompeer 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b 11
 1error code: -1
 2error message:
 3In prune mode, only blocks that the node has already synced previously can be fetched from a peer
 4➜  bitcoin git:(fjahr) ✗ ./src/bitcoin-cli getblockfrompeer 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b 8 
 5error code: -1
 6error message:
 7In prune mode, only blocks that the node has already synced previously can be fetched from a peer
 8➜  bitcoin git:(fjahr) ✗ ./src/bitcoin-cli getblockfrompeer 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b 5
 9error code: -1
10error message:
11In prune mode, only blocks that the node has already synced previously can be fetched from a peer

Sjors commented at 4:12 pm on January 6, 2022: member

Concept ACK, but light selfish preference for getting #23706 in first, since there’s a (small) conflict.

luke-jr commented at 7:07 am on January 12, 2022: member

Weak concept NACK. I think it’s better to allow it. Pruning is only best-effort anyway, not a guarantee.

DrahtBot added the label Needs rebase on Jan 25, 2022

fjahr force-pushed on Jan 25, 2022

fjahr commented at 8:26 pm on January 25, 2022: contributor

Rebased

Weak concept NACK. I think it’s better to allow it. Pruning is only best-effort anyway, not a guarantee.

Why do you think it’s better to allow it? Do you have any specific use cases in mind? Of course there is no guarantee to stay below the exact number but for prune=550 this has the potential to double or tripple the number quickly. And the worst case scenario of crashing due to a full disc seems bad enough that it’s worth to disable something where it’s unclear if there is a use case for it at all (using the RPC in this way). If there is a use case I am genuinely interested in hearing about it and would then look for a better solution rather than just disallowing it.

fjahr force-pushed on Jan 25, 2022

DrahtBot removed the label Needs rebase on Jan 25, 2022

Sjors commented at 1:08 pm on January 26, 2022: member

In the context of ForkMonitor I use this feature to fetch blocks that are, by definition, at or below the tip height. I.e. stale blocks. This PR doesn’t impact that use case, because these nodes are always up to date.

In fact, this PR adds some safety for when a fresh node is added to the site, or an existing node is reinstalled, and it’s still catching up (though in practice we don’t call getblockfrompeer on a node that is in IBD).

Another use case that seems obvious to me is to fetch an historical block, perhaps because you’re rescanning a wallet. In that case I don’t see any harm in waiting until the node is synced.

DrahtBot added the label Needs rebase on Jan 27, 2022

fjahr force-pushed on Jan 29, 2022

fjahr commented at 4:56 pm on January 29, 2022: contributor

Rebased

DrahtBot removed the label Needs rebase on Jan 29, 2022

DrahtBot added the label Needs rebase on Jun 1, 2022

rpc: Pruned nodes can not fetch unsynced blocks

While a node is still catching up to the tip that it is aware of via the headers, the user can currently use  to fetch blocks close to the tip. These blocks are stored in the current block/rev file which otherwise contains blocks the node is receiving as part of the syncing process.

This creates a problem for pruned nodes: The files containing a fetched block are not pruned during syncing because they contain a block close to the tip. This means the entire file will not be pruned until the tip have moved on far enough from the fetched block. In extreme cases with heavy pruning (550) and multiple blocks being fetched this could mean that the disc usage far exceeds what the user expects, potentially running out of space.

7fa851fba8

test: Add test for getblockfrompeer on syncing pruned nodes 5826bf546e

fjahr force-pushed on Jun 5, 2022

fjahr commented at 11:44 pm on June 5, 2022: contributor

Rebased

DrahtBot removed the label Needs rebase on Jun 6, 2022

Sjors commented at 9:49 am on July 18, 2022: member

utACK 5826bf546e83478947edbdf49978414f0b69eb1a

Very nice test.

achow101 commented at 9:09 pm on October 25, 2022: member

ACK 5826bf546e83478947edbdf49978414f0b69eb1a

in src/rpc/blockchain.cpp:458 in 7fa851fba8 outdated

452@@ -453,6 +453,12 @@ static RPCHelpMan getblockfrompeer()
453         throw JSONRPCError(RPC_MISC_ERROR, "Block header missing");
454     }
455 
456+    // Fetching blocks before the node has syncing past their height can prevent block files from
457+    // being pruned, so we avoid it if the node is in prune mode.
458+    if (index->nHeight > chainman.ActiveChain().Tip()->nHeight && node::fPruneMode) {

furszy commented at 9:50 pm on October 25, 2022:

In 7fa851fb: (non-blocking nit)

Why not use IsBlockPruned instead?

Which should be the same as saying that, on pruning mode, can only fetch blocks that were downloaded and discarded.

aureleoules commented at 0:21 am on October 26, 2022:

If IsBlockPruned was used instead, fetching an older block that isn’t pruned (blocks close to the tip for instance) on a pruned node would result in the error In prune mode, only blocks that the node has already synced previously can be fetched from a peer instead of Block already downloaded.

andrewtoth commented at 3:33 pm on October 26, 2022:

Doesn’t chainman.ActiveChain() need to be called with cs_main locked?

0    CChain& ActiveChain() const EXCLUSIVE_LOCKS_REQUIRED(GetMutex()) { return ActiveChainstate().m_chain; }

achow101 commented at 3:59 pm on October 26, 2022:

Hmm, indeed it does. I really need to be building clang…

0rpc/blockchain.cpp:464:35: warning: calling function 'ActiveChain' requires holding mutex 'cs_main' exclusively [-Wthread-safety-analysis]
1    if (index->nHeight > chainman.ActiveChain().Tip()->nHeight && node::fPruneMode) {
2                                  ^
31 warning generated.

aureleoules approved

aureleoules commented at 0:22 am on October 26, 2022: member

tACK 5826bf546e83478947edbdf49978414f0b69eb1a I tested the behavior by invalidating the tip with invalidateblock and trying to fetch it again with getblockfrompeer which resulted in In prune mode, only blocks that the node has already synced previously can be fetched from a peer as expected.

achow101 merged this on Oct 26, 2022

achow101 closed this on Oct 26, 2022

maflcko referenced this in commit ec92d23fb8 on Oct 26, 2022

sidhujag referenced this in commit 02a24fbe9d on Oct 27, 2022

sidhujag referenced this in commit 261830c108 on Oct 27, 2022

bitcoin locked this on Oct 26, 2023

rpc: Pruning nodes can not fetch blocks before syncing past their height #23927

Problem

Approach

Testing

Conflicts