p2p: For assumeutxo, download snapshot chain before background chain #29519

pull mzumsande wants to merge 2 commits into bitcoin:master from mzumsande:202202_fix_assumeutxo_block_download changing 1 files +13 −3
  1. mzumsande commented at 9:44 pm on February 29, 2024: contributor
    After loading a snapshot, pindexLastCommonBlock is usually already set to some block for existing peers. That means we’d continue syncing the background chain from those peers instead of prioritising the snapshot chain, which defeats the purpose of doing assumeutxo in the first place. Only existing peers are affected by this bug.
  2. DrahtBot commented at 9:44 pm on February 29, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK fjahr
    Stale ACK Sjors, ryanofsky

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  3. DrahtBot added the label P2P on Feb 29, 2024
  4. fanquake added this to the milestone 27.0 on Mar 1, 2024
  5. fanquake requested review from ryanofsky on Mar 1, 2024
  6. fanquake commented at 4:18 pm on March 5, 2024: member
    Note that given that 27 will not ship with mainnet assumeutxo, this isn’t a blocker, and a fix can still be backported to 27.x if deemed necessary.
  7. Sjors commented at 3:21 pm on March 11, 2024: member

    Concept ACK

    ac547ea43beb1d66d2c11e7e8abb76dd1bfc2883 looks reasonable, but it would be nice to somehow add a test for it.

    What happens once we’ve reached the tip and need more peers for background sync? And once background sync is done, is it updated to the tip again?

    Would it make sense to introduce pindexLastCommonBackgroundSyncBlock so we have more control over what peers to dedicate to the tip vs background (which might take weeks on some devices)?

  8. mzumsande commented at 8:30 pm on March 11, 2024: contributor

    ac547ea looks reasonable, but it would be nice to somehow add a test for it.

    I tested this syncing on signet, but I’ll think about a functional test; I imagine that it would be hard to prevent race conditions in a functional test because we’d:

    • need to first connect to a peer that has the entire chain and start syncing the background chain
    • then load the snapshot (at the same time the peer would continue syncing the background chain)
    • then check that we switch to requesting from the snapshot chain (while the bg chain hopefully hasn’t finished downloading yet)

    It might be possible though if that peer was a P2PConnection() (that could stall / withhold certain blocks) instead of a real node.

    What happens once we’ve reached the tip and need more peers for background sync? And once background sync is done, is it updated to the tip again?

    Would it make sense to introduce pindexLastCommonBackgroundSyncBlock so we have more control over what peers to dedicate to the tip vs background (which might take weeks on some devices)?

    We have this logic in SendMessages: It first tries FindNextBlocksToDownload meant for the active chainstate, and, if it doesn’t fill up the vToDownload buffer (e.g. because we are at / near the tip), then calls TryDownloadingHistoricalBlocks specifically for the background chainstate. I think that this existing separation should work well, the problem that this PR fixes is that FindNextBlocksToDownload would pick historical blocks from the background chain when it shouldn’t because higher-prio blocks are available.

  9. Sjors commented at 3:45 pm on March 12, 2024: member

    I can see how it would be difficult to test, indeed.

    I encountered this scenario while testing #29370 on signet. All my peers were only downloading the background state. I only got headers for the tip chain, but it didn’t progress. The workaround there was to turn the network off and on, which is consistent with your observation that only existing peers are affected.

    utACK ac547ea43beb1d66d2c11e7e8abb76dd1bfc2883

  10. fjahr commented at 5:58 pm on March 25, 2024: contributor

    The code and reasoning look good to me and I have been trying to write a test but I am currently unable to reproduce the behavior described before the change, i.e. to make the test fail on master.

    Here is the test so far: https://github.com/fjahr/bitcoin/commit/478fb28dac4e22175da150cee97991a66b4f5695

    This now does fail on master but only because it also checks for the debug log HERE I have added. This log confirms that the test does trigger the behavior change, i.e. on master the last common block is set once, after the change here it is set twice within FindNextBlockToDownload. But, of course, we would rather like to have the test fail at the asserts at the end but that is not the case so far. @mzumsande do you have an idea why this doesn’t reproduce the behavior you have seen on master?

  11. mzumsande commented at 10:24 pm on March 25, 2024: contributor

    @mzumsande do you have an idea why this doesn’t reproduce the behavior you have seen on master?

    I tested your branch, and I see the same - this is really interesting, and it works differently than I thought when I opened the PR!

    I think the following happens in PeerManagerImpl::FindNextBlocks (after being called from FindNextBlocksToDownload()): We process up to 1024 blocks ahead of the current pindexWalk: https://github.com/bitcoin/bitcoin/blob/2102c978b5f2d0bfaa0290c00ed693555f3403d4/src/net_processing.cpp#L1430 But due to the lines https://github.com/bitcoin/bitcoin/blob/2102c978b5f2d0bfaa0290c00ed693555f3403d4/src/net_processing.cpp#L1497-L1502 we never ask for the blocks below the snapshot block here: Even though we don’t have their data, they are now in the active chain, so they are skipped instead. In the test, we immediately ask for the blocks after the snapshot instead, which is exactly the desired behavior.

    So why did I observe a different behavior on signet? I think it only works like this in the test because there, the snapshot is just a 100 blocks away from pindexLastCommonBlock.

    If this wasn’t the case, we’d request nothing in FindNextBlocksToDownload(), only move our pindexLastCommonBlock ahead by a maximum of 1024. Then we have capacity for this peer left and would actually ask for the historical blocks via TryDownloadingHistoricalBlocks().

    So I think what would actually happen on mainnet and signet is that each time we can process a block request with an existing peer, we’d move pindexLastCommonBlock 1024 blocks ahead, and then request historical blocks - until pindexLastCommonBlock has reached the snapshot. So we’d continuing downloading the background chain for a while (a few minutes?) with existing peers until we switch over to the snapshot chain (instead of never switching over as I originally thought when opening this PR).

    Still, I think that this behavior is not great, and the fix is an improvement - not only because we start downloading the snapshot chain immediately, we also do much less iterating in FindNextBlocksToDownload , moving pindexLastCommonBlock ahead without actually downloading anything.

  12. fanquake removed this from the milestone 27.0 on Apr 16, 2024
  13. fjahr commented at 11:48 pm on May 30, 2024: contributor

    tACK ac547ea43beb1d66d2c11e7e8abb76dd1bfc2883

    I reproduced the issue on current master with signet and confirmed that it is fixed with the change here (three times actually with slightly different logs). Based on reading the code and some custom logging I agree with @mzumsande that his description above is what is indeed happening. Based on this I am fine to add this without a test.

  14. fjahr commented at 11:51 pm on May 30, 2024: contributor
    @ryanofsky would be really great to get your review here when you get the chance :)
  15. in src/net_processing.cpp:1402 in ac547ea43b outdated
    1396@@ -1397,9 +1397,12 @@ void PeerManagerImpl::FindNextBlocksToDownload(const Peer& peer, unsigned int co
    1397         return;
    1398     }
    1399 
    1400-    if (state->pindexLastCommonBlock == nullptr) {
    1401+    std::optional<int> snapshot_height{m_chainman.GetSnapshotBaseHeight()};
    1402+    if (state->pindexLastCommonBlock == nullptr ||
    1403+        (snapshot_height.has_value() && state->pindexLastCommonBlock->nHeight < snapshot_height.value())) {
    


    ryanofsky commented at 9:05 pm on June 5, 2024:

    In commit “p2p: For assumeutxo, download snapshot chain before background chain” (ac547ea43beb1d66d2c11e7e8abb76dd1bfc2883)

    It seems like the pindexLastCommonBlock->nHeight < snapshot_height check is not strict enough because the snapshot block might not necessarily be an ancestor the peer’s best known block.

    In this case, the pindexLastCommonBlock assignments on line 1406 and 1411 will reset it back to the fork point between the two nodes each time FindNextBlocksToDownload is called. And if the peer’s best block is more than BLOCK_DOWNLOAD_WINDOW blocks away from the current chain, the node will never be able to download enough blocks to reach it and validate it. This means if the snapshot is not on the best chain, the node may be unable to sync to a better chain.

    Suggestion to fix would be:

     0--- a/src/net_processing.cpp
     1+++ b/src/net_processing.cpp
     2@@ -1487,12 +1487,14 @@ void PeerManagerImpl::FindNextBlocksToDownload(const Peer& peer, unsigned int co
     3         return;
     4     }
     5 
     6-    std::optional<int> snapshot_height{m_chainman.GetSnapshotBaseHeight()};
     7+    // If pindexLastCommonBlock has not been set yet, set it now. Also reset it
     8+    // if an assumetxo snapshot has been loaded and it is an ancestor of the
     9+    // snapshot, so only blocks after the snapshot will be downloaded.
    10+    const CBlockIndex* snap_base{m_chainman.GetSnapshotBaseBlock()};
    11     if (state->pindexLastCommonBlock == nullptr ||
    12-        (snapshot_height.has_value() && state->pindexLastCommonBlock->nHeight < snapshot_height.value())) {
    13+        (snap_base && snap_base->GetAncestor(state->pindexLastCommonBlock->nHeight) == state->pindexLastCommonBlock)) {
    14         // Bootstrap quickly by guessing a parent of our best tip is the forking point.
    15         // Guessing wrong in either direction is not a problem.
    16-        // Also applies after loading a snapshot, when we want to prioritise loading the snapshot chain and therefore need to skip ahead.
    17         state->pindexLastCommonBlock = m_chainman.ActiveChain()[std::min(state->pindexBestKnownBlock->nHeight, m_chainman.ActiveChain().Height())];
    18     }
    19 
    

    mzumsande commented at 8:39 pm on June 11, 2024:

    This means if the snapshot is not on the best chain, the node may be unable to sync to a better chain.

    I’ve thought about this for a while, and I’m not sure if that problem is really being fixed by your suggestion for the following (slightly philosophical) reasons:

    • Accepting a snapshot means that we’ve created a new chainstate with its Active Tip being set to that block.
    • That is a commitment, we can’t just reorg away from it to another chain that doesn’t contain the snapshot block, even if such a chain existed - at least not until the background sync has finished (we’d need the full blocks below the snapshot), and once that has happened, the snapshot chainstate gets deleted.
    • If all of our peers are on a better chain not containing the Snapshot, it seems to me that we are in a hopeless situation anyway: With the current patch we might download no blocks from those peers, with your suggestion we’d download useless blocks that can’t be connected - I’m not sure why one is better than the other.

    So I think that the premise that we could get into this situation should be rejected: If a valid snapshot was provided and put into a release, it should only be possible if there was a massive reorg of multiple thousands of blocks (and we’d also need to have accepted header chains ending in both chaintips from different peers). I’d say that in this hypothetical situation, AssumeUtxo is not at the top of the list of our concerns - bitcoin would have far bigger problems.

    Without a massive reorg, there is the alternative that the snapshot provided was bad somehow. But then (apart from the fact that it shouldn’t have been hardcoded into chainparams), it also shouldn’t have been loaded in the first place because its header should not have been accepted to our block index with our header DoS protections.

    So all in all I think that once we’ve accepted a snapshot, we’ve made a commitment and have to treat it as part of the best chain, at least until background sync has finished.


    ryanofsky commented at 2:00 pm on June 12, 2024:

    re: #29519 (review)

    Good points. My suggested change was premised on the false assumption that the node would be able to seamlessly reorg to the most-work chain. But that’s not true because of missing undo data.

    Your philosophical point is interesting. I believe that:

    • Loading snapshots should not affect how nodes come to consensus, it should only affect what order they download blocks, and help them compute the latest UTXO sets sooner.

    And you seem to be suggesting:

    • Loading a snapshot should force a node to only consider chains that include the snapshot block and ignore other chains, even if they have more work.

    On this PR more specifically:

    When I made my suggested change, I was forgetting that we don’t have any undo data for the snapshot block and blocks before it, so unfortunately it is not possible for a node to follow the most-work chain when a snapshot is loaded until all the preceding blocks are downloaded.

    Given this fact, I think the best thing FindNextBlocksToDownload can do when an unvalidated snapshot is loaded is to ignore peers on other chains not containing the snapshot block, until it actually has the blocks and undo data that would be needed to validate other chains. I think the current PR is not exactly doing this though. It looks like it will still download up to BLOCK_DOWNLOAD_WINDOW useless blocks on the peer’s chain, starting at the fork point between the two chains, which is not great. It will also not download later blocks after snapshot is validated, when undo data needed to enable a re-org becomes available.


    On the broader approach to downloading blocks when a snapshot is loaded:

    I’d agree now that the only sensible block downloading behavior for an unvalidated snapshot chainstate is to only download blocks after the snapshot (due to lack of undo data before then) and ignore any peers whose best blocks don’t descend from the snapshot block.

    But it is less clear which blocks should be downloaded and attached to the older chainstate, and whether it should target the most-work block or the snapshot block. Since d43a1f1a2fa35d377c7a9ad7ab92d1ae325bde3d from #27746 it targets the snapshot block, but was not doing that originally. Comparing the two alternatives:

    • Targeting the most-work block seems like a purer approach that would probably simplify the code and remove special cases, and would also have the benefit of the node syncing to the most-work chain faster, with less wasted work if it the chain turns out to be valid.

    • Targeting the snapshot block is also very reasonable, and is what the code is doing currently, and guarantees the snapshot will be validated even if it is not used, and could be a practical defense against eclipse attacks, or fork chains that appear to have more work but turn out to be invalid.


    On whether we should care about any of this:

    I don’t think we need to care about these design choices too much, since everything above is referring to a case we don’t expect to encounter, where there is a possibly-valid chain of blocks, not including the snapshot block, that has more work than any known chain including the snapshot block. The only time when we should see this is when there is a major fork with a lot of work behind it.

    However, I don’t think “the premise that we could get into this situation should be rejected.” That seems like it is going too far. We should put some thought into how code will behavior in presence for forks for practical reasons, and also to inform design of the code and to write it in a clean way that isn’t full of assumptions and special cases.

    I do think, as long as we are hardcoding snapshot hashes in the source code, we can reject premise that snapshot chainstate turns out to be invalid, and just refuse to run in that case. But the case where the snapshot is valid, but the snapshot block is not on the most-work chain seems different and worth thinking about.


    mzumsande commented at 3:27 pm on June 12, 2024:

    And you seem to be suggesting: Loading a snapshot should force a node to only consider chains that include the snapshot block and ignore other chains, even if they have more work.

    This was meant more as a description of the status quo than a description of the ideal world. I think that with today’s logic we don’t really have another choice but to do so - temporarily until the background sync has finished.

    but the snapshot block is not on the most-work chain seems different and worth thinking about

    I agree with that.Some possible ways to deal with that:

    1.) Tighten the criteria for snapshot acceptance: We could reject loading the snapshot if the snapshot block is not a predecessor of m_best_header. Currently, we only check that it has more work than the ActiveTip. This is easy to implement, though it doesn’t cover the case where we only learn about the better chain after the snapshot has been loaded.

    2.) After a snapshot is accepted and we detect that the snapshot is no longer on the most-work chain we know of, we could abort syncing the snapshot, delete the chainstate and go back to normal sync mode. This sounds more complicated to implement, and I’m not convinced it’s worth it.

    It looks like it will still download up to BLOCK_DOWNLOAD_WINDOW useless blocks on the peer’s chain, starting at the fork point between the two chains, which is not great.

    That’s true. I will try to adjust the PR to deal with that. Since I find the logic in FindNextBlocksToDownload / TryDownloadingHistoricalBlocks / FindNextBlocks rather hard to understand, I would like to reproduce the situation in a test (at least locally, sounds like timing issues could make this hard to include in a PR) to be more confident that this is really how it works, which could take a few days.


    mzumsande commented at 3:42 pm on June 14, 2024:
    I have now added a first commit that avoids downloading blocks not on the snapshot chain completely. Local testing revealed that not only are we unable to reorg to a more-work chain, we also won’t fail gracefully: On master, we would still download the blocks and attempt to reorg, but then the node would just crash.
  16. ryanofsky commented at 9:37 pm on June 5, 2024: contributor
    Sorry for not seeing this earlier. I think I found a potential problem with this change, so I left a comment below and would be curious to find out more.
  17. ryanofsky approved
  18. ryanofsky commented at 2:28 pm on June 12, 2024: contributor
    Code review ACK ac547ea43beb1d66d2c11e7e8abb76dd1bfc2883. This is an improvement over the status quo in the normal case, though it seems like it introduces some odd behavior if snapshot block is not on the most work chain, as described in other comments. I think it would be nice to clean the code up more and clarify what is trying to do, instead of merging this change as it is. But it would also be ok to merge in its current form, if you prefer that.
  19. mzumsande commented at 3:28 pm on June 12, 2024: contributor

    I think it would be nice to clean the code up more and clarify what is trying to do, instead of merging this change as it is.

    I will look into improving this (see thread above) in the next days, so hold off merging please.

  20. p2p: Restrict downloading of blocks for snapshot chain
    If the best chain of the peer doesn't include the snapshot
    block, it is futile to download blocks from this chain,
    because we couldn't reorg to it. We'd also crash
    trying to reorg because this scenario is not handled.
    019fba99b0
  21. p2p: For assumeutxo, download snapshot chain before background chain
    After loading a snapshot, pindexLastCommonBlock is usually already set
    to some block for existing peers. That means we'd continue syncing the
    background chain from those peers instead of prioritising the snapshot
    chain, which defeats the purpose of doing assumeutxo in the first place.
    Only existing peers are affected by this bug.
    e977c698f9
  22. mzumsande force-pushed on Jun 14, 2024
  23. mzumsande commented at 3:46 pm on June 14, 2024: contributor
    ac547ea to e977c69: Added a commit that avoids downloading blocks not on the snapshot chain (see discussion above). Also made small changes to the (now) second commit using snap_base instead of GetSnapshotBaseHeight(), and reworked the documentation.
  24. in src/net_processing.cpp:1400 in 019fba99b0 outdated
    1396@@ -1397,6 +1397,14 @@ void PeerManagerImpl::FindNextBlocksToDownload(const Peer& peer, unsigned int co
    1397         return;
    1398     }
    1399 
    1400+    // When syncing with AssumeUtxo, if the snapshot is not in the peer's best chain, abort:
    


    fjahr commented at 2:14 pm on June 18, 2024:

    nit: A bit easier to parse IMO, only if you have to retouch.

    0    // When we sync with AssumeUtxo and discover the snapshot is not in the peer's best chain, abort:
    
  25. fjahr commented at 2:25 pm on June 18, 2024: contributor

    tACK e977c698f97ac9bba30e4e3837f41721841c28c4

    I reviewed the new changes and re-tested the behavior on Signet.

    I mostly agree with @mzumsande above and will post some more thoughts on #30288 rather than here since the conversation seems to have moved over there.

    So, since I agree this is not something we should encounter this is not a blocker, but just to confirm my understanding: with the current state of the change, if there was an actual other best chain deeper than the snapshot discovered, our node would get stuck because all other nodes would reorg to the best chain and then we would not request blocks from any of our peers, right?

  26. DrahtBot requested review from ryanofsky on Jun 18, 2024
  27. DrahtBot requested review from Sjors on Jun 18, 2024
  28. mzumsande commented at 3:39 pm on June 18, 2024: contributor

    if there was an actual other best chain deeper than the snapshot discovered, our node would get stuck because all other nodes would reorg to the best chain and then we would not request blocks from any of our peers, right?

    yes, but as you mentioned only if all peers were on that other chain. Whereas the behavior on master would be to download some blocks and then crash with an assert trying to reorg to the new chain - note that this would also happen if the other chain was invalid (for reasons we’d only find out during connecting the blocks) and even if there was just a single peer feeding us that chain.

  29. fjahr commented at 10:27 pm on June 19, 2024: contributor

    yes, but as you mentioned only if all peers were on that other chain.

    It may be worth logging something every time we ignore a peer that is on another chain. In case we encounter this (very unlikely) scenario it would help identify the issue faster and it could be an interesting data point alone to know if anyone ever sees this at all.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-03 10:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me