validation: Make ReplayBlocks interruptible #30155

pull mzumsande wants to merge 1 commits into bitcoin:master from mzumsande:202405_replay_blocks changing 2 files +31 −9
  1. mzumsande commented at 3:12 pm on May 22, 2024: contributor

    This addresses the problem from #11600 that the Rolling Forward process after a crash (ReplayBlocks()) is uninterruptible. ReplayBlocks can take a long time to finish, and this can be especially annoying to GUI users who are taunted to “press q to shutdown” even though pressing “q” does nothing.

    Now, when an interrupt is received during ReplayBlocks(), the intermediate progress is saved: In addition to writing the updated coins to disk, the flush adjusts the lower head block (saved in DB_HEAD_BLOCKS), so that with the next restart, the replay continues from where it was stopped without losing progress.

    I tested this manually on signet: A situation where ReplayBlocks() becomes necessary can be created by syncing with -dbcrashratio=1 and stopping the node after ~70k blocks. Then, after the next restart, the replay process can be interrupted. I then used the dumptxoutset output to check that the final utxo-set hash is unaffected by how many times the process is interrupted.

  2. DrahtBot commented at 3:12 pm on May 22, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK TheCharlatan, andrewtoth

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  3. DrahtBot added the label Validation on May 22, 2024
  4. validation: Make ReplayBlocks interruptible
    Now, when an interrupt is received during ReplayBlocks the
    intermediate progress is flushed:
    In addition to updating the coins on disk, this
    flush adjusts the head blocks, so that with the next
    restart, the replay continues from where we stopped.
    26e78f2ee4
  5. in src/validation.cpp:4758 in 4bd626ec6e outdated
    4753@@ -4754,6 +4754,12 @@ bool Chainstate::ReplayBlocks()
    4754         LogPrintf("Rolling forward %s (%i)\n", pindex.GetBlockHash().ToString(), nHeight);
    4755         m_chainman.GetNotifications().progress(_("Replaying blocks…"), (int)((nHeight - nForkHeight) * 100.0 / (pindexNew->nHeight - nForkHeight)), false);
    4756         if (!RollforwardBlock(&pindex, cache)) return false;
    4757+        if (m_chainman.m_interrupt) {
    4758+            LogPrintf("Flushing intermediate state of replay\n");
    


    maflcko commented at 6:35 pm on May 22, 2024:
    0            LogInfo("Flushing intermediate state of replay\n");
    

    nit: For new code it would be better to use the non-deprecated non-confusing alias LogInfo over LogPrintf.


    mzumsande commented at 8:33 pm on May 22, 2024:
    sure - updated.
  6. mzumsande force-pushed on May 22, 2024
  7. in src/txdb.cpp:103 in 26e78f2ee4
    100-        // We may be in the middle of replaying.
    101+        // We may be in the process of replaying.
    102         std::vector<uint256> old_heads = GetHeadBlocks();
    103         if (old_heads.size() == 2) {
    104             if (old_heads[0] != hashBlock) {
    105-                LogPrintLevel(BCLog::COINDB, BCLog::Level::Error, "The coins database detected an inconsistent state, likely due to a previous crash or shutdown. You will need to restart bitcoind with the -reindex-chainstate or -reindex configuration option.\n");
    


    luke-jr commented at 5:17 pm on May 23, 2024:
    Is it a safe assumption that this cannot happen except an interrupted replay?

    mzumsande commented at 5:59 pm on May 23, 2024:

    Currently we abort with an assert in the next line after the log, so the expectation in master is that this shouldn’t happen outside of db corruption. I couldn’t find any reports on the internet of users that have run into this error / assert, so it probably doesn’t happen very often in practice. But theoretically speaking, if the coins db gets corrupted, I guess anything can happen depending on the specifics of the corruption?

    While working on this PR, I initially went with the different approach of adding an explicit in_replay flag to BatchWrite(), and setting that flag only during the interrupt flush. I switched to the current approach of detecting this implicitly inside BatchWrite() because it seemed more elegant and led to less boilerplate code, but it’s still an option.


    ryanofsky commented at 2:18 pm on June 6, 2024:

    In commit “validation: Make ReplayBlocks interruptible” (26e78f2ee4a325f20c3b3bfaac5be532ab9e7bac)

    I switched to the current approach of detecting this implicitly inside BatchWrite() because it seemed more elegant and led to less boilerplate code, but it’s still an option.

    I would definitely be curious to see the other approach. Not just because it might be better at detecting inconsistent states like Luke is suggesting, but also because it seems like it would make the code more clear.

    I think it would be pretty hard to look at the code after this PR (without looking at the PR itself), and understand why old_heads[0] != hashBlock implies that this BatchWrite call is finalizing a replay that is currently being interrupted.

    I was also confused by the name of the “unfinished_replay” variable below, because I thought “unfinished replay” was just referring to a replay of an previously unfinished flush. I did not realize it was referring specifically to an interrupted replay of an unfinished flush.

    Also as a possible followup, for future maintainability of this code I think it could be a good idea to remove the replay logic entirely from BatchWrite(), since ReplayBlocks() is the only BatchWrite() caller that should need it. Other BatchWrite calls should be able to assume and assert that GetHeadBlocks() is empty, and this is just a normal write, not a replay.


    mzumsande commented at 7:16 pm on June 10, 2024:

    I would definitely be curious to see the other approach. Not just because it might be better at detecting inconsistent states like Luke is suggesting, but also because it seems like it would make the code more clear.

    See https://github.com/mzumsande/bitcoin/commit/db8232d47068bba9ae1bb8135ef456be731e5ca5 - because of the inheritance the bool interrupted_replay needed to be added to quite a few spots, which I didn’t love.

    I think it would be pretty hard to look at the code after this PR (without looking at the PR itself), and understand why old_heads[0] != hashBlock implies that this BatchWrite call is finalizing a replay that is currently being interrupted.

    I’ll add more documentation (if I don’t switch to the other approach).

    I was also confused by the name of the “unfinished_replay” variable below,

    Maybe interrupted_replay is better? But the issue remains that it’s the replay that is being interrupted (or unfinished), and not the flush.

    Also as a possible followup, for future maintainability of this code I think it could be a good idea to remove the replay logic entirely from BatchWrite(), since ReplayBlocks() is the only BatchWrite() caller that should need it. Other BatchWrite calls should be able to assume and assert that GetHeadBlocks() is empty, and this is just a normal write, not a replay.

    Seems reasonable for a follow-up, though I haven’t looked into details yet.

  8. achow101 requested review from sipa on Oct 15, 2024
  9. TheCharlatan commented at 2:54 pm on October 15, 2024: contributor
    Concept ACK
  10. andrewtoth commented at 5:52 pm on October 19, 2024: contributor

    Concept ACK

    It appears to me that the size of the cache used in ReplayBlocks is never checked if it grows too large. Wouldn’t this mean that if the node was sufficiently far behind, the rolling forward would fill up the cache to levels exceeding dbcache? This would result in a cycle of rolling forward and then OOM crashing, never being able to write any chainstate to disk.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me