Assumeutxo: bugfix on loadtxoutset with a divergent chain + test

alfonsoromanz commented at 3:13 pm on April 29, 2024: contributor

This PR adds a test to cover the scenario of loading an assumeutxo snapshot when the current chain tip is not an ancestor of the snapshot block but has less work.

During the review process, a bug was discovered where blocks between the last common ancestor and the background tip were not being requested if the background tip was not an ancestor of the snapshot block. mzumsande suggested a fix (65343ec49a6b73c4197dfc38e1c2f433b0a3838a) to start downloading historical blocks from the last common ancestor to address this issue. This fix has been incorporated into the PR with a slight modification.

Related to #28648

DrahtBot commented at 3:13 pm on April 29, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	fjahr, mzumsande, achow101
Concept ACK	naiyoma
Stale ACK	rkrux

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#30403 (test, assumeutxo: Remove resolved todo comments and add new test by fjahr)
#30320 (assumeutxo: Don’t load a snapshot if it’s not in the best header chain by mzumsande)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Tests on Apr 29, 2024

alfonsoromanz force-pushed on Apr 30, 2024

alfonsoromanz commented at 7:25 pm on April 30, 2024: contributor

I splitted the original commit into two commits (one for each test). The second test (https://github.com/bitcoin/bitcoin/pull/29996/commits/af0f401258e0c189799a36f4487eaa751d779e7b) may be redundant with this one: #29428. The only difference is that my test is executed on a node that has a divergent chain after block 199. I did that to cover this scenario described in the comments

[...] Loading a snapshot when the current chain tip is: [...] Not an ancestor or a descendant of the snapshot block and has more work

If it’s redundant I can delete the second commit. Thoughts?

naiyoma commented at 10:05 pm on April 30, 2024: contributor

Concept ACK, covers more tests scenarios for AssumeUTXO

in test/functional/feature_assumeutxo.py:172 in af0f401258 outdated

167+        assert_equal(loaded['coins_loaded'], SNAPSHOT_BASE_HEIGHT)
168+        assert_equal(loaded['base_height'], SNAPSHOT_BASE_HEIGHT)
169+
170+        # Restart the node and delete the snapshot chainstate from previous test.
171+        self.restart_node(2, extra_args=['-reindex-chainstate=1', *self.extra_args[2]])
172+        assert_equal(node.getblockcount(), 298)

kevkevinpal commented at 1:23 am on May 2, 2024:

instead of number 298 would it make more sense to use SNAPSHOT_BASE_HEIGHT?

alfonsoromanz commented at 11:38 am on May 2, 2024:

Actually SNAPSHOT_BASE_HEIGHT is 299, and here I am trying to make sure that the snapshot was deleted and therefore the node is now using it’s previous chain which has less work. I changed it to assert node.getblockcount() < SNAPSHOT_BASE_HEIGHT for more clarity

in test/functional/feature_assumeutxo.py:155 in af0f401258 outdated

147@@ -152,6 +148,38 @@ def test_invalid_mempool_state(self, dump_output_path):
148 
149         self.restart_node(2, extra_args=self.extra_args[2])
150 
151+    def test_snapshot_in_a_divergent_chain(self, dump_output_path):
152+        # First rollback node2's chain to the pregenerated one, up to height 199
153+        node = self.nodes[2]
154+        current_height = node.getblockcount()
155+        base_height = 199

kevkevinpal commented at 1:25 am on May 2, 2024:

is this not the same as START_HEIGHT?

alfonsoromanz commented at 11:39 am on May 2, 2024:

Yes, good catch. This line was removed since the iteration is no longer needed based on your other comment below.

in test/functional/feature_assumeutxo.py:208 in af0f401258 outdated

154+        current_height = node.getblockcount()
155+        base_height = 199
156+        for block_height in range(current_height, base_height, -1):
157+            block_hash = node.getblockhash(block_height)
158+            node.invalidateblock(block_hash)
159+        assert_equal(node.getblockcount(), START_HEIGHT)

kevkevinpal commented at 1:34 am on May 2, 2024:

0        block_hash = node.getblockhash(START_HEIGHT + 1)
1        node.invalidateblock(block_hash)
2        assert_equal(node.getblockcount(), START_HEIGHT)

alfonsoromanz commented at 11:39 am on May 2, 2024:

That’s a good one. Thanks. I Fixed it.

alfonsoromanz force-pushed on May 2, 2024

in test/functional/feature_assumeutxo.py:205 in ce80f7ec9a outdated

147@@ -152,6 +148,35 @@ def test_invalid_mempool_state(self, dump_output_path):
148 
149         self.restart_node(2, extra_args=self.extra_args[2])
150 
151+    def test_snapshot_in_a_divergent_chain(self, dump_output_path):
152+        # First rollback node2's chain to the pregenerated one, up to height 199
153+        node = self.nodes[2]

rkrux commented at 1:05 pm on May 8, 2024:

Nit: Because this test involves dealing with multiple nodes, for verbosity and clarity we can call this variable as second_node.

alfonsoromanz commented at 1:48 pm on May 10, 2024:

Thanks for your review. Actually this test uses only one node and I used the same notation used in other tests, but I am willing to change it if others agree.

in test/functional/feature_assumeutxo.py:212 in ce80f7ec9a outdated

155+        node.invalidateblock(block_hash)
156+        assert_equal(node.getblockcount(), START_HEIGHT)
157+
158+        self.log.info(f"Check importing a snapshot where current chain-tip is not an ancestor of the snapshot block but has less work")
159+        # Generate a divergent chain in node2 but with less work compared to the snapshot
160+        self.generate(node, nblocks=99, sync_fun=self.no_op)

rkrux commented at 1:06 pm on May 8, 2024:

nblocks=99 We can make this value dependent on SNAPSHOT_BASE_HEIGHT such as (SNAPSHOT_BASE_HEIGHT - 200) or (SNAPSHOT_BASE_HEIGHT / 2) so that it’s tied to SNAPSHOT_BASE_HEIGHT, and any future changes will automatically account for this variable as well.

alfonsoromanz commented at 1:48 pm on May 10, 2024:

Good point. The goal of this test is to generate a divergent chain starting from START_HEIGHT but having less work than the snapshot, which height equals SNAPSHOT_BASE_HEIGHT. Maybe I can set nBlocks=SNAPSHOT_BASE_HEIGHT-START_HEIGHT-1. This will result in 99 and it will adapt to any future changes to either START_HEIGHT or SNAPSHOT_BASE_HEIGHT

rkrux commented at 7:13 am on May 12, 2024:

Yes, that would be good.

alfonsoromanz commented at 3:14 pm on May 13, 2024:

Done. Thanks

in test/functional/feature_assumeutxo.py:220 in ce80f7ec9a outdated

163+        loaded = node.loadtxoutset(dump_output_path)
164+        assert_equal(loaded['coins_loaded'], SNAPSHOT_BASE_HEIGHT)
165+        assert_equal(loaded['base_height'], SNAPSHOT_BASE_HEIGHT)
166+
167+        # Restart the node and delete the snapshot chainstate from previous test.
168+        self.restart_node(2, extra_args=['-reindex-chainstate=1', *self.extra_args[2]])

rkrux commented at 1:11 pm on May 8, 2024:

Got to know from the CPP code that -reindex-chainstate will load up from the blk.dat files. That’s why I understand only 2 more blocks are needed to exceed the work of the snapshot chain.

in test/functional/feature_assumeutxo.py:230 in ce80f7ec9a outdated

173+        # This covers the scenario where the snapshot block is not on the most-work chain
174+        self.generate(node, nblocks=2, sync_fun=self.no_op)
175+        assert node.getblockcount() > SNAPSHOT_BASE_HEIGHT
176+        # Import the snapshot and assert its failure
177+        with node.assert_debug_log(expected_msgs=["[snapshot] activation failed - work does not exceed active chainstate"]):
178+            assert_raises_rpc_error(-32603, "Unable to load UTXO snapshot", node.loadtxoutset, dump_output_path)

rkrux commented at 1:14 pm on May 8, 2024:

-32603

Although not necessary for this PR, I am wondering how big of a lift would it be to tie these errors to the ones defined here: https://github.com/bitcoin/bitcoin/blob/master/src/rpc/protocol.h

alfonsoromanz commented at 3:15 pm on May 13, 2024:

Not sure. Maybe you can try opening an issue to discuss it

rkrux approved

rkrux commented at 1:16 pm on May 8, 2024: contributor

tACK ce80f7e

Make and tests successful, thanks for adding this test - it’s easy to follow through.

DrahtBot added the label Needs rebase on May 13, 2024

alfonsoromanz force-pushed on May 13, 2024

alfonsoromanz commented at 3:18 pm on May 13, 2024: contributor

Rebased. I also addressed some feedback from rkrux’s comments: changed hardcoded value of nblocks from 99 to a dynamic value, i.e SNAPSHOT_BASE_HEIGHT-START_HEIGHT-1

DrahtBot removed the label Needs rebase on May 13, 2024

naiyoma commented at 8:43 pm on May 16, 2024: contributor

Test passes for all three scenarios:

https://github.com/bitcoin/bitcoin/pull/29996/commits/a7501f779eb84effd5579070b8b1c1a3020273d6 loading a snapshot when chain tip isn’t ancestor but has less work
https://github.com/bitcoin/bitcoin/pull/29996/commits/3f033542b869dfbe037cda705c3b8f7faa32da5d loading a snapshot when chain tip isn’t ancestor/descendant, has more work and A valid snapshot file & snapshot block, but the block is not on the most-work chain

naiyoma commented at 9:17 pm on May 16, 2024: contributor

I splitted the original commit into two commits (one for each test). The second test (af0f401) may be redundant with this one: #29428. The only difference is that my test is executed on a node that has a divergent chain after block 199. I did that to cover this scenario described in the comments

[...] Loading a snapshot when the current chain tip is: [...] Not an ancestor or a descendant of the snapshot block and has more work

If it’s redundant I can delete the second commit. Thoughts?

IMO you should not delete the second commit, both tests verify that the snapshot load fails but under different scenarios

DrahtBot added the label Needs rebase on Jun 5, 2024

alfonsoromanz force-pushed on Jun 6, 2024

in test/functional/feature_assumeutxo.py:218 in 50cdb2450f outdated

209+
210+        self.log.info(f"Check importing a snapshot where current chain-tip is not an ancestor of the snapshot block but has less work")
211+        # Generate a divergent chain in node2 but with less work compared to the snapshot
212+        self.generate(node, nblocks=99, sync_fun=self.no_op)
213+        assert node.getblockcount() < SNAPSHOT_BASE_HEIGHT
214+        # Try importing the snapshot and assert its success

maflcko commented at 11:01 am on June 6, 2024:

This should also check that the background validation succeeds. Otherwise there could be a bug where the diverging chain is not rewound?

alfonsoromanz commented at 4:12 pm on June 7, 2024:

Thanks for pointing this out. The background validation does not seem to finish in this case.

I am now testing with a new node n3 (in my local copy) to avoid the manual rollback to START_HEIGHT (L207). Additionally, sync_blocks() was throwing a timeout when reusing n2 for this test (not sure why).

Here’s the new approach I’m trying:

The new node starts at START_HEIGHT (199).
I generate a divergent chain from START_HEIGHT up to height 298 (< SNAPSHOT_BASE_HEIGHT).
I load the snapshot (height=299).

After loading the snapshot, I can see these two chain states: 

0[{'blocks': 298, 'bestblockhash': '171f1d8af9371c9d54a3731f8befdfa6dd2fe553831970ddffdaeb0b93aa54d3', 'difficulty': Decimal('4.656542373906925E-10'), 'verificationprogress': 1, 'coins_db_cache_bytes': 7969177, 'coins_tip_cache_bytes': 438304768, 'validated': True}, {'blocks': 299, 'bestblockhash': '3bb7ce5eba0be48939b7a521ac1ba9316afee2c7bada3a0cca24188e6d7d96c0', 'difficulty': Decimal('4.656542373906925E-10'), 'verificationprogress': 1, 'coins_db_cache_bytes': 419430, 'coins_tip_cache_bytes': 23068672, 'snapshot_blockhash': '3bb7ce5eba0be48939b7a521ac1ba9316afee2c7bada3a0cca24188e6d7d96c0', 'validated': False}]

Next, I connect the nodes and ensure they all see the same tip: 

0self.connect_nodes(0, 3)
1self.wait_until(lambda: n3.getchainstates()['chainstates'][-1]['blocks'] == FINAL_HEIGHT)
2self.sync_blocks(nodes=(self.nodes[0], n3))

 After syncing, these are the chain states:

0[{'blocks': 298, 'bestblockhash': '171f1d8af9371c9d54a3731f8befdfa6dd2fe553831970ddffdaeb0b93aa54d3', 'difficulty': Decimal('4.656542373906925E-10'), 'verificationprogress': 1, 'coins_db_cache_bytes': 7969177, 'coins_tip_cache_bytes': 438304768, 'validated': True}, {'blocks': 399, 'bestblockhash': '193ad9344966a54125f4b8d3596572c356ca3dcc216b25b91cb0022fbf61c7e1', 'difficulty': Decimal('4.656542373906925E-10'), 'verificationprogress': 1, 'coins_db_cache_bytes': 419430, 'coins_tip_cache_bytes': 23068672, 'snapshot_blockhash': '3bb7ce5eba0be48939b7a521ac1ba9316afee2c7bada3a0cca24188e6d7d96c0', 'validated': False}]

It seems that the snapshot chain has now synced to the tip (height 399). However, this line times out after syncing the blocks:

self.wait_until(lambda: len(n3.getchainstates()['chainstates']) == 1)

I’m not sure if I’m doing something wrong or if there is indeed a bug where the divergent chain is not rewound. I will continue investigating.

Something to note here: if I follow the same process but I don’t generate any divergent chain, then the validation completes successfully.

Any directions on how to proceed with this would be appreciated.

maflcko commented at 9:12 am on June 9, 2024:

I’m not sure if I’m doing something wrong or if there is indeed a bug where the divergent chain is not rewound. I will continue investigating.

This sounds like a bug. Just to clarify, the active chain is stuck at height 298 and the background chain continues to sync past 399?

alfonsoromanz commented at 12:13 pm on June 10, 2024:

I guess I am confused with the terms “active” and “background” chain. I assume the active chain is the snapshot chain which starts at height 299, and the divergent chain is the one stuck at 298. I base this on the fact that when I run getbestblockhash after loadtxoutset, I get the hash of the snapshot tip. However, I might be mistaken. Shouldn’t the divergent chain be rewound to START_HEIGHT and become the background validation chain?

To address your questions:

The divergent chain is indeed stuck at height 298.
The snapshot chain continues to sync past height 399. Although 399 is the FINAL_HEIGHT for this test, I was able to mine an additional 100 blocks on top of node0, resulting in both node0 and node3 syncing again, and the snapshot chain syncing up to height 499.

However, even after syncing past 399, the background validation does not seem to finish. I always get a timeout when running this line after syncing the nodes:

0self.wait_until(lambda: len(node.getchainstates()['chainstates']) == 1)

Additionally, I am experiencing an intermittent issue where the sync does not always finish, and I have not been able to determine the cause yet. This issue happens only after mining past 399.

mzumsande commented at 10:35 pm on June 10, 2024:

I can confirm this behavior and also think that this has uncovered a bug in net_processing.

I think that the root cause is in TryDownloadingHistoricalBlocks (which is responsible for downloading the background chain): This function calls FindNextBlocks() with pindexWalk set to the current tip of the background chainstate (from_tip), and target_block set to the snapshot block. FindNextBlocks then walks from the snapshot block backwards to the the height of from_tip, save these blocks in vToFetch and then begins to download these blocks in forward order. This is incorrect, because the blocks in starting at the last common ancestor of from_tip and the snapshot block, up to the height of from_tip are never requested that way (their height is smaller than the height of from_tip). So, my proposed fix would be something like https://github.com/mzumsande/bitcoin/commit/edb2b69a16889552ddd40b71a491e7722d9b4e12 (feel free to cherry-pick/ adjust as you like). @alfonsoromanz: Could you check if that fix would solve the issue for you? @ryanofsky Could you take a look - would you agree with that explanation and the proposed fix?

mzumsande commented at 11:05 pm on June 10, 2024:

Although I’m not sure if this is an actual problem in practice, because with our DoS protections for low-work blocks and headers we shouldn’t really get into this situation with a divergent chain in the first place.

fjahr commented at 8:50 am on June 11, 2024:

Hi, I am a bit confused as well but I think I have an idea :) I think the approach chosen in this test doesn’t work for what it’s actually trying to do. Block 299 is marked invalid in the node under test so it’s not surprising that the node is stuck on 298. I don’t understand how the “divergent” chain could get “rewound” when the alternative “real” chain is still marked invalid. I don’t think the dump should mark and invalid block valid but that is what it would need to do in order to sync the chain containing the snapshot and connect both chainstates. What is indeed surprising here is that the dump does load even though the base block is invalid. I have suggested a fix here: #30267 although this scenario seems highly unlikely in practice. I am using a simplified version of your test here and made you co-author there @alfonsoromanz .

So I think the approach here would need to be altered so that the right chain is not marked invalid. I think that is fairly hard to do in regtest. It seems we would need a (premined) heavy chain to simulate this scenario. I think the scenario that James had in mind is also a lot less likely to occur since we pre-sync headers (see #25717). I didn’t check but I guess this test file predates that PR and a test would have been more practical when the attack scenario described there would have been exploited.

I don’t have another idea how to test this in a simpler way but such a heavy work regtest chain could be useful for a number of scenario, don’t you think @maflcko ?

EDIT: Also note that the todo in the comment explicitly states “Valid snapshot file and snapshot block” but this is not the case here. The snapshot block is invalid.

alfonsoromanz commented at 9:42 am on June 11, 2024:

Thanks @fjahr and @mzumsande, for your help.

Block 299 is marked invalid in the node under test so it’s not surprising that the node is stuck on 298 […] So I think the approach here would need to be altered so that the right chain is not marked invalid

Yes, that makes sense. Initially, I didn’t want to create a new node for this test and instead wanted to reuse the existing node2. To achieve this, I needed to rollback the chain to create a divergent chain from START_HEIGHT up to height 298. I used invalidateblock for that purpose, but I didn’t consider the side effects.

After trying to ensure the background validation finishes, I encountered an issue where I was unable to sync node0 and node2. This led me to adopt a new approach that doesn’t require invalidation by creating a new node starting at START_HEIGHT and creating a divergent chain from there. However, I did not push that code yet because the validation was not working either until @mzumsande provided the fix.

So, my proposed fix would be something like https://github.com/mzumsande/bitcoin/commit/edb2b69a16889552ddd40b71a491e7722d9b4e12 (feel free to cherry-pick/ adjust as you like).

@alfonsoromanz: Could you check if that fix would solve the issue for you?

Yes, this issue is fixed with your proposed change and my new approach with a new node3. Here is the updated approach:

The new node starts at START_HEIGHT (199).
I generate a divergent chain from START_HEIGHT up to height 298 (< SNAPSHOT_BASE_HEIGHT).
I load the snapshot (height=299).
I ensure the background validation finishes.

I can continue working on this new approach and also on adapting the second part of this test, which involves a divergent chain with more work than the snapshot. However this new approach will require merging @mzumsande proposed fix to work.

Thanks again, everyone!

fjahr commented at 9:45 am on June 11, 2024:

Huh, I didn’t see @mzumsande ’s post when I wrote mine… Will look into this approach as well.

fjahr commented at 10:14 am on June 11, 2024:

Yes, this issue is fixed with your proposed change and my new approach with a new node3. Here is the updated approach:

Sounds good to me in principle but before doing 3 ensure that the node’s tip is actually the tip of the divergent chain. This may happen through the node not knowing the real chain, i.e. not being in sync, but I don’t think this is what we are actually trying to test here.

Maybe I am blind or interpret the original Todo differently than the rest of you but I don’t see a simpler approach than the following to actually have a scenario that simulates what might happen on mainnet:

Start with a new node under test that is not connected to the nodes that have the real chain
Mine a divergent chain of the same height as the real chain
Compare total work of the real chain with the divergent chain and grind the last block(s) of the divergent chain until you see that its tip has less total work than that of the real chain, if this is not the case right away.
Then connect the node under test to a node with the real chain
The node under test should have knowledge of the real chain with more work but it should still have the divergent chain as its tip because it has seen that block first
Now load the snapshot into node under test

fjahr commented at 10:28 am on June 11, 2024:

To clarify the meaning of the Todo:

Interesting starting states could be loading a snapshot when the current chain tip is:

TODO: Not an ancestor of the snapshot block but has less work

Particularly “but has less work” could mean A) less work than the tip of the chain that includes the snapshot or B) less work than the snapshot block itself. I believe A is the more interesting and intended scenario because B only seems to be a state a node can be in if the snapshot block is invalid or its ancestors are unknown which should lead to a much earlier error.

ryanofsky commented at 2:46 pm on June 11, 2024:

re: https://github.com/bitcoin/bitcoin/pull/29996/files#r1633915935

my proposed fix would be something like mzumsande@edb2b69 (feel free to cherry-pick/ adjust as you like).

Nice find! Would suggest opening a separate PR so it is easier to understand the problem and fix. And maybe it is possible to come up with a simpler test for this problem specifically, like by adding an assert in FindNextBlocks() that pindexWalk is an ancestor of state->pindexBestKnownBlock and then adding a test that triggers the assert.

Would also consider tweaking the fix to call LastCommonAncestor() before calling TryDownloadingHistoricalBlocks(), so it is easier to understand from_tip variable being the immediate predecessor of blocks to download next instead of having a more complicated meaning.

ryanofsky commented at 3:22 pm on June 11, 2024:

re: #29996 (review)

To clarify the meaning of the Todo:

Interesting starting states could be loading a snapshot when the current chain tip is:

TODO: Not an ancestor of the snapshot block but has less work

Particularly “but has less work” could mean A) less work than the tip of the chain that includes the snapshot or B) less work than the snapshot block itself. I believe A is the more interesting and intended scenario because B only seems to be a state a node can be in if the snapshot block is invalid or its ancestors are unknown which should lead to a much earlier error.

I think the todo list came from my comment #27596 (review), and I probably was thinking of interpretation (B) not (A), but maybe (B) is not a very interesting scenario.

I’m not sure why it needs to “lead to a much earlier error,” though, or lead to any error. If the node is syncing to some chain that seems to have the most work, but then headers for a second chain are announced that has more work, the chainstate tip is not going to switch to the second chain, even though it has more work, until enough blocks from it are downloaded and validated, and a block from the second chain is reached that is that is valid and has more work than the chainstate tip. Before that happens, a snapshot from the second chain could be loaded such that the current chain tip has less work than the snapshot block and is not an ancestor of the snapshot block, but the snapshot block is valid and its ancestors can be downloaded.

Or maybe that is wrong, but at least it’s my understanding.

mzumsande commented at 3:37 pm on June 11, 2024:

Sounds good to me in principle but before doing 3 ensure that the node’s tip is actually the tip of the divergent chain (…)

Maybe I misunderstand, but that seems overly complicated. I assume we’re talking about the scenario “Not an ancestor of the snapshot block but has less work”: In my local test, I just gave the node the headers of the snapshot chain, and then used -generate to mine a divergent chain from the old tip. Number of blocks doesn’t really matter. The other chain (with the snapshot in it) will have more work, but it is headers-only, so the tip will be on the divergent chain no matter how much work it has. Then, after the snapshot is loaded and we connected to a peer that has all the blocks, the node will successfully download the snapshot chain, but currently the background sync won’t complete unless you apply my fix above. I assume that @alfonsoromanz’s test (not pushed yet) works in a similar way.

have suggested a fix here: #30267

Huh, I didn’t see @mzumsande ’s post when I wrote mine… Will look into this approach as well.

Just to avoid any confusion: There are two independent issues. My issue pops up if you don’t use invalidateblock anymore, as the current version of the PR still does.

However this new approach will require merging @mzumsande proposed fix to work.

I neither want to hijack this PR nor open a PR with just the fix without a test, so my suggestion would be that you could incorporate the one-line fix into this PR (meaning that this PR wouldn’t be test-only anymore) - if you’re interested and have the time, that is.

alfonsoromanz commented at 4:07 pm on June 11, 2024:

Particularly “but has less work” could mean A) less work than the tip of the chain that includes the snapshot or B) less work than the snapshot block itself. I believe A is the more interesting and intended scenario because B only seems to be a state a node can be in if the snapshot block is invalid or its ancestors are unknown which should lead to a much earlier error.

I interpreted it as B) less work than the snapshot block itself.

Regarding A): If the new node (divergent chain) has less work than the chain tip (399) but more than the snapshot (299), the expected behavior would be to get the error: "Unable to Load UTXO Snapshot - [snapshot] activation failed - work does not exceed active chainstate." This same behavior is expected for the other scenario of a node with a divergent chain but with more work, i.e “TODO: Not an ancestor or a descendant of the snapshot block and has more work”

So both scenarios look very similar to me. That’s why I was leaning towards option B. However, I am not an expert on real scenarios in mainnet and I may be missing something.

Also, I started working on your approach and completed this part:

Start with a new node under test that is not connected to the nodes that have the real chain.

Mine a divergent chain of the same height as the real chain.

Compare the total work of the real chain with the divergent chain and grind the last block(s) of the divergent chain until its tip has less total work than that of the real chain, if this is not the case right away.

Then connect the node under test to a node with the real chain.

But this is where I get confused:

The node under test should have knowledge of the real chain with more work but it should still have the divergent chain as its tip because it has seen that block first.

After following you steps and connecting and syncing the nodes, the divergent chain is replaced with the original chain because has more work. If I don’t run sync, it’s not replaced, but I guess it’s just a matter of time? or maybe I don’t understand how connect and sync works.

Given that the snapshot will not be loaded because it doesn’t exceed the active chainstate, I don’t see much difference in making the divergent chain have less or more work than the original chain. What am I missing?

Either way, I am happy to add tests for both A) and B) scenarios.

In my local test, I just gave the node the headers of the snapshot chain, and then used -generate to mine a divergent chain from the old tip. Number of blocks doesn’t really matter. The other chain (with the snapshot in it) will have more work, but it is headers-only, so the tip will be on the divergent chain no matter how much work it has. Then, after the snapshot is loaded and we connected to a peer that has all the blocks, the node will successfully download the snapshot chain, but currently the background sync won’t complete unless you apply my fix above. I assume that @alfonsoromanz’s test (not pushed yet) works in a similar way.

Yes that’s what I’m doing in my local code (not pushed). I’m submitting the headers to n3 just like the original code is doing to n1 and n2. After that, I call this test function test_snapshot_in_a_divergent_chain where I generate the divergent chain and load the snapshot. The background validation only finishes if I apply your fix.

I neither want to hijack this PR nor open a PR with just the fix without a test, so my suggestion would be that you could incorporate the one-line fix into this PR (meaning that this PR wouldn’t be test-only anymore) - if you’re interested and have the time, that is.

Yes, I can incorporate it and add you as a co-author. Thanks

alfonsoromanz commented at 6:44 am on June 12, 2024:

I just pushed my recent changes for this PR. This is the approach I decided to move forward with:

Scenario Choice: Between these two scenarios mentioned by @fjahr: “Particularly ‘but has less work’ could mean A) less work than the tip of the chain that includes the snapshot or B) less work than the snapshot block itself,” I chose B) for this PR: less work than the snapshot block itself. I am happy to add tests for the other scenario in a separate PR.
Added Fix: I incorporated the fix from @mzumsande to start downloading historical blocks from the last common ancestor. This avoids the scenario where blocks in between are not requested, which causes the background validation to never finish. https://github.com/bitcoin/bitcoin/pull/29996/commits/8b6a18a7d5bdc6b2bb057613e661816917314084
New Node for “Divergent Chain with Less Work”: I added a new node (n3) for the scenario where we load the snapshot in a node with a divergent chain but less work. I am not reusing previous nodes because I don’t know a way to do a clean rollback without invalidating blocks. As mentioned by @fjahr, we shouldn’t expect to load a snapshot in a scenario where part of the snapshot was invalidated. There is actually a new PR to prevent this from happening: #30267.
New Node for “Divergent Chain with More Work”: I also added a new node (n4) for the scenario where we load the snapshot in a node with a divergent chain and more work. I was not able to reuse n3 for the same reason described previously: n3 has already synced to the tip, and I don’t know any other way to rollback the chain other than invalidating blocks.

Any feedback is appreciated.

Thanks!

fjahr commented at 4:19 pm on June 18, 2024:

After following you steps and connecting and syncing the nodes, the divergent chain is replaced with the original chain because has more work. If I don’t run sync, it’s not replaced, but I guess it’s just a matter of time? or maybe I don’t understand how connect and sync works.

When we see two blocks on the same height, we take the one that is seen first as the tip, not the one that has more work. We do this to ensure a miner who found a very low hash doesn’t withhold it. What I described builds on this behavior but maybe there are other things at play that I am missing. Might be that this doesn’t work with longer forks.

In my local test, I just gave the node the headers of the snapshot chain, and then used -generate to mine a divergent chain from the old tip. Number of blocks doesn’t really matter. The other chain (with the snapshot in it) will have more work, but it is headers-only, so the tip will be on the divergent chain no matter how much work it has.

Ok, I think it was never mentioned before that one of the chains is a headers-only chain. When not specified otherwise I assume knowing a “chain” means having the blocks as well. I guess I am interpreting the TODOs differently here than everyone else.

fjahr commented at 12:59 pm on June 19, 2024:

I’m not sure why it needs to “lead to a much earlier error,” though, or lead to any error. If the node is syncing to some chain that seems to have the most work, but then headers for a second chain are announced that has more work, the chainstate tip is not going to switch to the second chain, even though it has more work, until enough blocks from it are downloaded and validated, and a block from the second chain is reached that is that is valid and has more work than the chainstate tip. Before that happens, a snapshot from the second chain could be loaded such that the current chain tip has less work than the snapshot block and is not an ancestor of the snapshot block, but the snapshot block is valid and its ancestors can be downloaded.

That’s not how the tests currently work if I am not mistaken. The nodes currently get the headers of the “second chain” first and then they mine their own chain that is the referred to as the “divergent chain” in the tests. But you are correct, we are not failing as early as I thought. I think we should though. We are currently failing only pretty late when calling CBlockIndexWorkComparator() (see also the error message in the last test added here in the PR) and I find that a weird point to fail at if the base block is not in the best (header) chain. I think we should actually already fail when looking for the snapshot start block the first time.

So I think something like this might be interesting to discuss later as an optimization: https://github.com/fjahr/bitcoin/commit/3ef2eea6e750e811198019173a39f2ad0a4da415 It’s untested because it makes some unrelated tests fail that are run before the headers are synced. So it’s not critical at all. But this is actually how I thought things worked when I wrote the that we would fail earlier.

DrahtBot removed the label Needs rebase on Jun 6, 2024

in test/functional/feature_assumeutxo.py:447 in 50cdb2450f outdated

443@@ -419,10 +444,10 @@ def check_tx_counts(final: bool) -> None:
444                 self.wait_until(lambda: n.getindexinfo() == completed_idx_state)
445 
446 
447-        # Node 2: all indexes + reindex
448-        # -----------------------------
449+        # Node 2: all indexes + reindex + divergent chain

fjahr commented at 8:52 am on June 11, 2024:

I think this edit here (and in the log below) isn’t needed. “all indexes + reindex” is describing the configuration of the node. Divergent chain is just one of the test cases that this node runs through. I think just having the logs in the test you already have is enough.

alfonsoromanz commented at 6:45 am on June 12, 2024:

Fixed

in test/functional/feature_assumeutxo.py:227 in 50cdb2450f outdated

222+
223+        self.log.info(f"Check importing a snapshot where current chain-tip is not an ancestor or a descendant of the snapshot block and has more work")
224+        # Generate two extra blocks and make sure node2's chain has more work than the snapshot's chain
225+        # This covers the scenario where the snapshot block is not on the most-work chain
226+        self.generate(node, nblocks=2, sync_fun=self.no_op)
227+        assert node.getblockcount() > SNAPSHOT_BASE_HEIGHT

fjahr commented at 9:02 am on June 11, 2024:

I would prefer usage of assert_equal with the precise height you expect here.

alfonsoromanz commented at 6:45 am on June 12, 2024:

Fixed

in test/functional/feature_assumeutxo.py:230 in 50cdb2450f outdated

215@@ -219,6 +216,19 @@ def test_snapshot_in_a_divergent_chain(self, dump_output_path):
216         assert_equal(loaded['coins_loaded'], SNAPSHOT_BASE_HEIGHT)
217         assert_equal(loaded['base_height'], SNAPSHOT_BASE_HEIGHT)
218 
219+        # Restart the node and delete the snapshot chainstate from previous test.
220+        self.restart_node(2, extra_args=['-reindex-chainstate=1', *self.extra_args[2]])
221+        assert node.getblockcount() < SNAPSHOT_BASE_HEIGHT
222+
223+        self.log.info(f"Check importing a snapshot where current chain-tip is not an ancestor or a descendant of the snapshot block and has more work")

fjahr commented at 9:26 am on June 11, 2024:

This may be resolved through addressing the comments on the previous test case anyway but for this one also, both chains should be valid.

ryanofsky commented at 1:49 pm on June 11, 2024: contributor

This PR adds tests to cover two scenarios of loading a snapshot when the current chain tip is:

Not an ancestor of the snapshot block but has less work

Not an ancestor or a descendant of the snapshot block and has more work

In the second scenario, the snapshot block does not belong to the most-work chain anymore so I believe it covers this scenario too: TODO: Valid snapshot file and snapshot block, but the block is not on the most-work chain. Therefore I deleted that TODO as well.

I’m not sure about this last part. It should be possible for the current chain tip to have more work than the snapshot block, which is on a different chain, and for the chain the snapshot block is on to ultimately be the most-work chain. Maybe this is not an interesting scenario to test though, since it involves a later-reorg. I’m not sure.

alfonsoromanz force-pushed on Jun 12, 2024

alfonsoromanz commented at 7:16 am on June 12, 2024: contributor

This PR adds tests to cover two scenarios of loading a snapshot when the current chain tip is:

Not an ancestor of the snapshot block but has less work

Not an ancestor or a descendant of the snapshot block and has more work

In the second scenario, the snapshot block does not belong to the most-work chain anymore so I believe it covers this scenario too: TODO: Valid snapshot file and snapshot block, but the block is not on the most-work chain. Therefore I deleted that TODO as well.

I’m not sure about this last part. It should be possible for the current chain tip to have more work than the snapshot block, which is on a different chain, and for the chain the snapshot block is on to ultimately be the most-work chain. Maybe this is not an interesting scenario to test though, since it involves a later-reorg. I’m not sure.

Yes, that makes sense. I am not sure what exact scenario the comment refers to. If my statement is not correct, I can undo the comment deletion.

in test/functional/feature_assumeutxo.py:227 in 4b34a0cfb6 outdated

222+        assert_equal(loaded['base_height'], SNAPSHOT_BASE_HEIGHT)
223+
224+        # Now lets sync the nodes and wait for the background validation to finish
225+        self.connect_nodes(0, 3)
226+        self.sync_blocks(nodes=(n0, n3))
227+        print('Ensuring background validation finishes')

fjahr commented at 3:30 pm on June 18, 2024:

0        self.log.info('Ensuring background validation finishes')

alfonsoromanz commented at 11:48 am on June 19, 2024:

Fixed

in test/functional/feature_assumeutxo.py:230 in 4b34a0cfb6 outdated

225+        self.connect_nodes(0, 3)
226+        self.sync_blocks(nodes=(n0, n3))
227+        print('Ensuring background validation finishes')
228+        self.wait_until(lambda: len(n3.getchainstates()['chainstates']) == 1)
229+
230+        self.log.info(f"Check importing a snapshot where current chain-tip is not an ancestor or a descendant of the snapshot block and has more work")

fjahr commented at 3:36 pm on June 18, 2024:

This is a f-string now but its actually not needed here.

alfonsoromanz commented at 11:49 am on June 19, 2024:

Fixed

in test/functional/feature_assumeutxo.py:221 in 4b34a0cfb6 outdated

216+        assert_equal(n3.getblockcount(), SNAPSHOT_BASE_HEIGHT - 1)
217+
218+        # Try importing the snapshot and assert its success
219+        self.log.info('Importing the snapshot into n3')
220+        loaded = n3.loadtxoutset(dump_output_path)
221+        assert_equal(loaded['coins_loaded'], SNAPSHOT_BASE_HEIGHT)

fjahr commented at 4:22 pm on June 18, 2024:

I would remove this line because it doesn’t really add anything that the line below doesn’t test.

fjahr commented at 4:23 pm on June 18, 2024:

Maybe rather check the chainstates instead if you want to add something more but I think it’s not absolutely necessary.

alfonsoromanz commented at 12:01 pm on June 19, 2024:

Thanks. I removed the line as suggested. I wasn’t sure which chainstates check might be useful to add. Something like this, maybe?

fjahr commented at 12:49 pm on June 19, 2024:

Yeah, just that each of them is at the height you expect.

alfonsoromanz commented at 1:17 pm on June 19, 2024:

Thanks. Updated

alfonsoromanz force-pushed on Jun 19, 2024

alfonsoromanz commented at 12:26 pm on June 19, 2024: contributor

Pushed changes to address the latest feedback from fjahr

alfonsoromanz force-pushed on Jun 19, 2024

alfonsoromanz force-pushed on Jun 27, 2024

alfonsoromanz renamed this:
~~test: Assumeutxo: import snapshot in a node with a divergent chain~~
Assumeutxo: bugfix on loadtxoutset with a divergent chain + tests
on Jun 27, 2024

alfonsoromanz commented at 8:36 am on June 27, 2024: contributor

Rearranging the order of the commits to make it easier to test the bugfix: the first test (53f714d8bfd2331651445bcadb773a10f4d30264) can be run without the bugfix and is expected to fail, but it should succeed after applying the fix (65343ec49a6b73c4197dfc38e1c2f433b0a3838a).

I also updated the PR title and description to reflect that this PR is not a test-only PR anymore

mzumsande commented at 3:59 pm on June 28, 2024: contributor

Rearranging the order of the commits to make it easier to test the bugfix: the first test (53f714d) can be run without the bugfix and is expected to fail, but it should succeed after applying the fix (65343ec).

I’d suggest to revert that: The CI should pass on each commit, we even have a designated “test each commit” job that checks this (and now obviously fails, see below). It’s ok if reviewers have to rearrange commits locally for testing.

Would also consider tweaking the fix to call LastCommonAncestor() before calling TryDownloadingHistoricalBlocks(), so it is easier to understand from_tip variable being the immediate predecessor of blocks to download next instead of having a more complicated meaning.

Have you considered addressing this suggestion by @ryanofsky above? Without having tried it out, it makes sense to me.

alfonsoromanz force-pushed on Jun 29, 2024

p2p: Start downloading historical blocks from common ancestor

Otherwise, if the background tip is not an ancestor of the snapshot, blocks in between that ancestor up to the height of the background tip will never be requested.

Co-authored-by: Martin Zumsande <mzumsande@gmail.com>
Co-authored-by: Alfonso Roman Zubeldia <19962151+alfonsoromanz@users.noreply.github.com>

d35efe1efc

alfonsoromanz force-pushed on Jun 29, 2024

DrahtBot added the label CI failed on Jun 29, 2024

DrahtBot commented at 12:09 pm on June 29, 2024: contributor

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

Debug: https://github.com/bitcoin/bitcoin/runs/26839553678

alfonsoromanz commented at 1:10 pm on June 29, 2024: contributor

Thanks @mzumsande. I have reverted the change regarding the commit order and have also addressed the feedback from @ryanofsky by calling LastCommonAncestor() before TryDownloadingHistoricalBlocks().

DrahtBot removed the label CI failed on Jun 29, 2024

achow101 referenced this in commit 9251bc7111 on Jul 2, 2024

in test/functional/feature_assumeutxo.py:219 in 69de09a836 outdated

214+        # Generate a divergent chain in n3 up to 298
215+        self.generate(n3, nblocks=99, sync_fun=self.no_op)
216+        assert_equal(n3.getblockcount(), SNAPSHOT_BASE_HEIGHT - 1)
217+
218+        # Try importing the snapshot and assert its success
219+        self.log.info("Importing the snapshot into n3")

fjahr commented at 2:22 pm on July 4, 2024:

nit: n3 is a pretty meaningless variable name that I wouldn’t include in the logs. The comment above seems to be enough to explain what is going on so I think this can just be removed. The “Ensuring…” comment is also not adding much IMO but I am not too passionate about it.

alfonsoromanz commented at 10:10 am on July 5, 2024:

I deleted the logs mentioned.

in test/functional/feature_assumeutxo.py:238 in 556ff68cea outdated

233+        # Generate a divergent chain in n4 that has more work than the snapshot
234+        # This covers the scenario where the snapshot block is not on the most-work chain
235+        self.generate(n4, nblocks=101, sync_fun=self.no_op)
236+        assert_equal(n4.getblockcount(), SNAPSHOT_BASE_HEIGHT + 1)
237+        # Import the snapshot and assert its failure
238+        with n4.assert_debug_log(expected_msgs=["[snapshot] activation failed - work does not exceed active chainstate"]):

fjahr commented at 3:13 pm on July 4, 2024:

This goes back to some of my earlier comments but I think it got lost because there were multiple things discussed at the same time: I think this is not the error we want to hit in this case (as I interpret the TODO) and I think we will not hit it in the future either, because once #30320 is merged this will hit the error message added there and not this one. This error message here mentions “work” but this is the actual work of the active chainstate, not some headers that the node has received. I.e. this is the case where the node has basically already synced when loading the utxo set. And that makes sense because here the node loading the snapshot has generated the chain itself, so it has already synced too. So I am not sure that this case here adds anything new compared to the existing case for this scenario in test_snapshot_with_less_work(). And I tend to think that actually #30320 completes the TODOs as I understand them and this commit here is not needed in addition to #30320 and test_snapshot_with_less_work().

cc @mzumsande - What do you think?

alfonsoromanz commented at 6:46 pm on July 4, 2024:

Thanks @fjahr . I’m open to deleting the third commit. I mentioned this earlier but kept it because I thought my scenario might be somehow different due to the divergent chain.

Your explanation helped me understand the concept of the headers chain, and I think you’re right that #30320 may cover the remaining TODOs. I interpreted the part about "...loading a snapshot when the current chain tip is: [...]" as referring to the chainstate of the node importing the snapshot, not the network’s chain tip.

I’ll wait for @mzumsande’s input to confirm before deleting the third commit if needed. I’ll also address your new feedback.

mzumsande commented at 5:11 am on July 5, 2024:

I think that “chain tip” usually refers to blocks we have connected to a chainstate (so headers-only would not qualify), so I would suspect that the original intention of the TODO was just what the third commit implements.

But I agree with @fjahr that it doesn’t really add too much to test_snapshot_with_less_work - less work is less work, no matter if the chain is divergent or not. Plus, I agree that if #30320 gets merged it would become a special case of that, because a chain tip with more work than the snapshot requires a header with more work (but not vice versa).

These are just my 2 cents, will review the PR in more detail soon.

Unrelated: @alfonsoromanz Thanks for the mention, but could you remove the “@” from my name in the PR description - if it’d get merged with it, I’ll get notifications whenever some altcoin fork cherrypicks that commit later.

alfonsoromanz commented at 10:14 am on July 5, 2024:

Thanks @mzumsande. Makes sense.

I removed the third commit since there is agreement that it is redundant with test_snapshot_with_less_work().

Unrelated: @alfonsoromanz Thanks for the mention, but could you remove the “@” from my name in the PR description - if it’d get merged with it, I’ll get notifications whenever some altcoin fork cherrypicks that commit later.

Sorry about that. The “@” was already removed by fanquake.

fjahr commented at 3:16 pm on July 4, 2024: contributor

The first two commits look very good but I am not sure the third commit is still needed, but I am not completely sure yet and would like to hear what others think.

test: loadtxoutset in divergent chain with less work 5b7f70ba26

alfonsoromanz force-pushed on Jul 5, 2024

alfonsoromanz commented at 10:18 am on July 5, 2024: contributor

I removed the third commit (https://github.com/bitcoin/bitcoin/commit/af0f401258e0c189799a36f4487eaa751d779e7b) since there is agreement that it is redundant with test_snapshot_with_less_work() (https://github.com/bitcoin/bitcoin/pull/29428). See discussion: #29996 (review)

alfonsoromanz renamed this:
~~Assumeutxo: bugfix on loadtxoutset with a divergent chain + tests~~
Assumeutxo: bugfix on loadtxoutset with a divergent chain + test
on Jul 5, 2024

fjahr commented at 7:20 pm on July 5, 2024: contributor

tACK 5b7f70ba2661a194a3c476b81e03785feddb4e1c

Reviewed the code and confirmed the test fails without the behavior change.

DrahtBot requested review from rkrux on Jul 5, 2024

fjahr commented at 5:23 pm on July 8, 2024: contributor

nit: Could mention #28648 as related in the PR description.

mzumsande commented at 8:06 pm on July 8, 2024: contributor

Code Review ACK 5b7f70ba2661a194a3c476b81e03785feddb4e1c

achow101 commented at 7:05 pm on July 10, 2024: member

ACK 5b7f70ba2661a194a3c476b81e03785feddb4e1c

achow101 merged this on Jul 10, 2024

achow101 closed this on Jul 10, 2024

Fabcien referenced this in commit 9b32c3476a on Apr 22, 2025

bitcoin locked this on Jul 10, 2025

Assumeutxo: bugfix on loadtxoutset with a divergent chain + test #29996

Code Coverage

Reviews

Conflicts