validation: periodically flush dbcache during reindex-chainstate #32414

pull andrewtoth wants to merge 3 commits into bitcoin:master from andrewtoth:reindex-flush changing 2 files +75 −13
  1. andrewtoth commented at 12:55 pm on May 3, 2025: contributor

    After #30611 we periodically do a non-erasing flush of the dbcache to disk roughly every hour during IBD. The intention was to also do this periodic flush during reindex-chainstate, so we would not risk losing progress during a system failure when reindexing with a high dbcache value.

    It was discovered that reindex-chainstate does not perform a PERIODIC flush until it has already reached the tip. Since reindexing to tip usually happens within 24 hours, this behaviour was unnoticed with the previous periodic flush interval. Note that reindex-chainstate still does IF_NEEDED flushes during ConnectBlock, so this also would not be noticed when running with a lower dbcache value.

    This patch moves the PERIODIC flush from after the outer loop in ActivateBestChain to inside the outer loop after we release cs_main. This will periodically flush during IBD, reindex-chainstate, and steady state.

  2. DrahtBot commented at 12:56 pm on May 3, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32414.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK l0rinc, shahsb

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  3. DrahtBot added the label Validation on May 3, 2025
  4. andrewtoth commented at 12:57 pm on May 3, 2025: contributor
    Thanks to @l0rinc for discovering this behaviour, and @mzumsande for the suggested fix.
  5. andrewtoth force-pushed on May 3, 2025
  6. l0rinc changes_requested
  7. l0rinc commented at 4:20 pm on May 3, 2025: contributor

    I’m verifying is this fixes things or not, whether it slows down anything - or if it changes caching behavior in any other way.

    Regardless, I think we should cover this with a test - do you think we could extend e.g. https://github.com/bitcoin/bitcoin/blob/master/test/functional/feature_reindex.py with an assert_debug_log checking the periodic write logs of #32404?

  8. l0rinc commented at 6:07 pm on May 3, 2025: contributor

    I can confirm that this fixes the periodic flushes for -reindex-chainstate:

    0cat debug-7fc8d7f9c1f0fe3795b397327e38465ee6f76b83-1746277780.log | grep FlushStateToDisk
    12025-05-03T14:14:56Z [coindb] FlushStateToDisk write: flush mode=PERIODIC, prune=0, cache_large=0, cache_critical=0, periodic=1
    22025-05-03T15:16:31Z [coindb] FlushStateToDisk write: flush mode=PERIODIC, prune=0, cache_large=0, cache_critical=0, periodic=1
    32025-05-03T16:16:31Z [coindb] FlushStateToDisk write: flush mode=PERIODIC, prune=0, cache_large=0, cache_critical=0, periodic=1
    42025-05-03T16:36:48Z [coindb] FlushStateToDisk write: flush mode=ALWAYS, prune=0, cache_large=0, cache_critical=0, periodic=0
    52025-05-03T16:39:20Z [coindb] FlushStateToDisk write: flush mode=ALWAYS, prune=0, cache_large=0, cache_critical=0, periodic=0
    

    height_vs_time cache_vs_time


    Nit: is this comment still accurate? https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L2782 Guess it just refers to the size based trigger, not the time based - which is probably fine given its small context.

  9. in src/validation.cpp:3592 in b56a7e2e98 outdated
    3586@@ -3587,15 +3587,15 @@ bool Chainstate::ActivateBestChain(BlockValidationState& state, std::shared_ptr<
    3587         // caused an assert() failure during interrupt in such cases as the UTXO DB flushing checks
    3588         // that the best block hash is non-null.
    3589         if (m_chainman.m_interrupt) break;
    3590+
    3591+        // Write changes periodically to disk, after relay.
    3592+        if (!FlushStateToDisk(state, FlushStateMode::PERIODIC)) {
    


    TheCharlatan commented at 8:40 pm on May 3, 2025:
    Previously we didn’t flush before calling CheckBlockIndex, but I don’t think that would degrade the usefulness of the check. I am not sure though about taking the lock here again. If I read this right the lock is taken three times now on every iteration of the loop with little work being done in between. Might it be useful to consolidate these?

    andrewtoth commented at 11:21 pm on May 3, 2025:

    Previously we didn’t flush before calling CheckBlockIndex

    We do flush for every block via ActivateBestChain -> ActivateBestChainStep -> ConnectTip -> FlushStateToDisk. However, it is only flushed IF_NEEDED, and here we do it PERIODIC.


    andrewtoth commented at 11:32 pm on May 3, 2025:

    If I read this right the lock is taken three times now on every iteration of the loop with little work being done in between. Might it be useful to consolidate these?

    Not a bad idea. How does da4090eedf8aea5d4d05bb51c6325e4291209f1a look?


    andrewtoth commented at 0:05 am on May 4, 2025:
    Actually we can’t do that because the lock can’t be held for m_chainman.snapshot_download_completed(); :/.

    andrewtoth commented at 7:05 pm on May 4, 2025:
    Done in c1e554d3e5834a140f2a53854018499a3bfe6822. I had to move FlushStateToDisk above possibly calling m_chainman.snapshot_download_completed() and breaking or breaking on m_chainman.m_interrupt, but that order should not make a difference.
  10. DrahtBot added the label CI failed on May 3, 2025
  11. DrahtBot commented at 11:33 pm on May 3, 2025: contributor

    🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/41595509701 LLM reason (✨ experimental): The CI failure is caused by a trailing whitespace error detected during the lint check.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  12. andrewtoth force-pushed on May 3, 2025
  13. andrewtoth force-pushed on May 3, 2025
  14. andrewtoth commented at 11:55 pm on May 3, 2025: contributor

    is this comment still accurate? https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L2782 Guess it just refers to the size based trigger, not the time based - which is probably fine given its small context.

    I think it’s still accurate, and #30611 doesn’t make it stale. I think the comment https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L2861 describes the behavior.

    Regardless, I think we should cover this with a test - do you think we could extend e.g. https://github.com/bitcoin/bitcoin/blob/master/test/functional/feature_reindex.py with an assert_debug_log checking the periodic write logs of #32404?

    Hmm I was wondering if we could do something with mocktime and validation interface to confirm it flushes.

  15. andrewtoth closed this on May 3, 2025

  16. l0rinc commented at 8:12 am on May 4, 2025: contributor

    You didn’t mean to close it, right?


    I can confirm that this change retains the flushing behavior during IBD as well as far as I can tell from the plots: height_vs_time

    cache_coins_vs_time


    I’ll compare the reindex-chainstate performance next, before & after, let’s see if there’s a regression (we do expect a 4% slowdown for max cache, like in IBD).

  17. andrewtoth reopened this on May 4, 2025

  18. andrewtoth force-pushed on May 4, 2025
  19. DrahtBot removed the label CI failed on May 4, 2025
  20. validation: periodically flush dbcache during reindex-chainstate
    Move the periodic flush inside the outer loop of ActivateBestChain.
    For very long activations, such as with reindex-chainstate, this
    calls periodic flushes so progress can be saved to disk.
    
    Co-Authored-By: l0rinc <pap.lorinc@gmail.com>
    84820561dc
  21. andrewtoth force-pushed on May 4, 2025
  22. andrewtoth force-pushed on May 4, 2025
  23. in src/test/chainstate_write_tests.cpp:84 in 6f77517845 outdated
    79+    SetMockTime(GetTime());
    80+    chainstate.FlushStateToDisk(state_dummy, FlushStateMode::ALWAYS);
    81+    m_node.validation_signals->SyncWithValidationInterfaceQueue();
    82+    // The periodic flush interval is between 50 and 70 minutes (inclusive)
    83+    // The next call to a PERIODIC write will flush
    84+    SetMockTime(GetTime<std::chrono::minutes>() + 70min);
    


    l0rinc commented at 5:40 pm on May 4, 2025:

    moving DATABASE_WRITE_INTERVAL_MAX to the header might be overkill for this, but now that we’re repeating it here in multiple tests, consider adding a copy to this file as well.

    Nit: I’ve renamed it to DB_WRITE_INTERVAL_MAX in https://github.com/bitcoin/bitcoin/pull/32404/commits/d415f242a6f45229543fabe81d5a6b8b4234c9a6#diff-97c3a52bc5fad452d82670a7fd291800bae20c7bc35bb82686c2c0a4ea7b5b98R98


    l0rinc commented at 6:04 pm on May 4, 2025:

    After we’ve initialized the mock time we might as well use that as the basis (it also gets rid of the type hint):

    0    SetMockTime(GetMockTime() + 70min);
    

    andrewtoth commented at 7:01 pm on May 4, 2025:
    Done.

    andrewtoth commented at 7:01 pm on May 4, 2025:
    Done.
  24. in src/test/chainstate_write_tests.cpp:91 in 6f77517845 outdated
    86+    const auto sub{std::make_shared<TestSubscriber>()};
    87+    m_node.validation_signals->RegisterSharedValidationInterface(sub);
    88+
    89+    // ActivateBestChain back to tip
    90+    chainstate.ActivateBestChain(state_dummy, nullptr);
    91+    BOOST_CHECK(tip == chainstate.m_chain.Tip());
    


    l0rinc commented at 6:02 pm on May 4, 2025:
    0    BOOST_CHECK_EQUAL(tip, chainstate.m_chain.Tip());
    

    (and a few others, though displaying the pointer values isn’t very useful, unless one of the is a nullptr)


    andrewtoth commented at 7:01 pm on May 4, 2025:
    Done.
  25. in src/test/chainstate_write_tests.cpp:50 in 6f77517845 outdated
    41@@ -42,4 +42,58 @@ BOOST_FIXTURE_TEST_CASE(chainstate_write_interval, TestingSetup)
    42     BOOST_CHECK(sub->m_did_flush);
    43 }
    44 
    45+// Test that we do PERIODIC flushes inside ActivateBestChain.
    46+// This is necessary for reindex-chainstate to be able to periodically flush
    47+// before reaching chain tip.
    48+BOOST_FIXTURE_TEST_CASE(write_during_multiblock_activation, TestChain100Setup)
    49+{
    50+    struct TestSubscriber final : CValidationInterface {
    


    l0rinc commented at 6:07 pm on May 4, 2025:

    Nit - developer notes suggests to put the braces to new lines:

    0  - Braces on new lines for classes, functions, methods.
    

    andrewtoth commented at 6:23 pm on May 4, 2025:
    What about structs?

    l0rinc commented at 6:51 pm on May 4, 2025:
    AfterStruct seems to fall back to AfterClass if unspecified https://github.com/bitcoin/bitcoin/blob/master/src/.clang-format#L19

    andrewtoth commented at 7:01 pm on May 4, 2025:
    Done.
  26. in src/test/chainstate_write_tests.cpp:96 in 6f77517845 outdated
    91+    BOOST_CHECK(tip == chainstate.m_chain.Tip());
    92+    // Check that we flushed inside ActivateBestChain while we were at the
    93+    // second block from tip, since FlushStateToDisk is called with PERIODIC
    94+    // inside the outer loop.
    95+    m_node.validation_signals->SyncWithValidationInterfaceQueue();
    96+    BOOST_CHECK(sub->m_flushed_at_block == second_from_tip);
    


    l0rinc commented at 6:11 pm on May 4, 2025:
    👍 I can confirm that this passes after the PR and fails without the PERIODIC flushing moved inside the loop
  27. in src/test/chainstate_write_tests.cpp:71 in 6f77517845 outdated
    66+    const CBlockIndex* tip{chainstate.m_chain.Tip()};
    67+    CBlockIndex* second_from_tip{tip->pprev};
    68+
    69+    {
    70+        LOCK(m_node.chainman->GetMutex());
    71+        LOCK(chainstate.MempoolMutex());
    


    l0rinc commented at 6:15 pm on May 4, 2025:

    My understanding this is equivalent to:

    0        LOCK2(m_node.chainman->GetMutex(), chainstate.MempoolMutex());
    

    andrewtoth commented at 7:01 pm on May 4, 2025:
    Done.
  28. in src/validation.cpp:1 in 8d1e9fedb3


    l0rinc commented at 6:19 pm on May 4, 2025:
    Can you please explain where the 3 separate locks were that we’re replacing (I only see 2, but don’t yet have experience with how we’re locking exactly)?


    l0rinc commented at 7:02 pm on May 4, 2025:
    Yeah, but isn’t the locking kept in FlushStateToDisk? So we’re reducing 3 locks to 2 locks, right? That’s not what I understand from the commit message

    andrewtoth commented at 7:08 pm on May 4, 2025:
    If we kept the code as is, and exited_ibd is true and m_disabled is false, then we would be releasing and acquiring the lock 3 times after we release it for the inner loop.
  29. l0rinc approved
  30. l0rinc commented at 6:22 pm on May 4, 2025: contributor

    Concept ACK

    Thanks for fixing it! I need a bit mor explanation to be able to fully ack

  31. test: add test for periodic flush inside ActivateBestChain 41479ed1d2
  32. refactor: consolidate 3 separate locks into one block
    The main lock needs to be taken 3 separate times.
    It simplifies readability to take it once, and is more efficient.
    c1e554d3e5
  33. andrewtoth force-pushed on May 4, 2025
  34. in src/test/chainstate_write_tests.cpp:54 in c1e554d3e5
    51+// Test that we do PERIODIC flushes inside ActivateBestChain.
    52+// This is necessary for reindex-chainstate to be able to periodically flush
    53+// before reaching chain tip.
    54+BOOST_FIXTURE_TEST_CASE(write_during_multiblock_activation, TestChain100Setup)
    55+{
    56+    struct TestSubscriber final : CValidationInterface
    


    l0rinc commented at 7:04 pm on May 4, 2025:
    nit: if you edit again please reformat TestSubscriber and UpdatedBlockTip as well
  35. in src/test/chainstate_write_tests.cpp:37 in c1e554d3e5
    34@@ -31,15 +35,68 @@ BOOST_FIXTURE_TEST_CASE(chainstate_write_interval, TestingSetup)
    35     BOOST_CHECK(!sub->m_did_flush);
    36 
    37     // The periodic flush interval is between 50 and 70 minutes (inclusive)
    


    l0rinc commented at 7:05 pm on May 4, 2025:
    now that we have the constants these comments don’t really add anything
  36. in src/test/chainstate_write_tests.cpp:85 in c1e554d3e5
    82+    BOOST_CHECK_EQUAL(second_from_tip->pprev, chainstate.m_chain.Tip());
    83+
    84+    // Set m_next_write to current time
    85+    chainstate.FlushStateToDisk(state_dummy, FlushStateMode::ALWAYS);
    86+    m_node.validation_signals->SyncWithValidationInterfaceQueue();
    87+    // The periodic flush interval is between 50 and 70 minutes (inclusive)
    


    l0rinc commented at 7:05 pm on May 4, 2025:
    same, can be removed now I think (and 70 minutes isn’t inclusive anymore, right?)
  37. l0rinc approved
  38. l0rinc commented at 7:15 pm on May 4, 2025: contributor

    Code review ACK - I’m still running the reindex / IBD benchmarks to see if this changed anything unexpected.

    Do you think I should stop and restart after the locking changes?

     0diff --git a/src/test/chainstate_write_tests.cpp b/src/test/chainstate_write_tests.cpp
     1index e1b82ebc12..44257d5591 100644
     2--- a/src/test/chainstate_write_tests.cpp
     3+++ b/src/test/chainstate_write_tests.cpp
     4@@ -16,7 +16,8 @@ BOOST_AUTO_TEST_SUITE(chainstate_write_tests)
     5 
     6 BOOST_FIXTURE_TEST_CASE(chainstate_write_interval, TestingSetup)
     7 {
     8-    struct TestSubscriber final : CValidationInterface {
     9+    struct TestSubscriber final : CValidationInterface
    10+    {
    11         bool m_did_flush{false};
    12         void ChainStateFlushed(ChainstateRole, const CBlockLocator&) override
    13         {
    14@@ -34,13 +35,12 @@ BOOST_FIXTURE_TEST_CASE(chainstate_write_interval, TestingSetup)
    15     m_node.validation_signals->SyncWithValidationInterfaceQueue();
    16     BOOST_CHECK(!sub->m_did_flush);
    17 
    18-    // The periodic flush interval is between 50 and 70 minutes (inclusive)
    19-    SetMockTime(GetTime<std::chrono::minutes>() + DATABASE_WRITE_INTERVAL_MIN - 1min);
    20+    SetMockTime(GetMockTime() + DATABASE_WRITE_INTERVAL_MIN - 1min);
    21     chainstate.FlushStateToDisk(state_dummy, FlushStateMode::PERIODIC);
    22     m_node.validation_signals->SyncWithValidationInterfaceQueue();
    23     BOOST_CHECK(!sub->m_did_flush);
    24 
    25-    SetMockTime(GetTime<std::chrono::minutes>() + DATABASE_WRITE_INTERVAL_MAX);
    26+    SetMockTime(GetMockTime() + DATABASE_WRITE_INTERVAL_MAX);
    27     chainstate.FlushStateToDisk(state_dummy, FlushStateMode::PERIODIC);
    28     m_node.validation_signals->SyncWithValidationInterfaceQueue();
    29     BOOST_CHECK(sub->m_did_flush);
    30@@ -59,7 +59,8 @@ BOOST_FIXTURE_TEST_CASE(write_during_multiblock_activation, TestChain100Setup)
    31         {
    32             m_flushed_at_block = m_tip;
    33         }
    34-        void UpdatedBlockTip(const CBlockIndex* block_index, const CBlockIndex*, bool) override {
    35+        void UpdatedBlockTip(const CBlockIndex* block_index, const CBlockIndex*, bool) override
    36+        {
    37             m_tip = block_index;
    38         }
    39     };
    40@@ -82,7 +83,6 @@ BOOST_FIXTURE_TEST_CASE(write_during_multiblock_activation, TestChain100Setup)
    41     // Set m_next_write to current time
    42     chainstate.FlushStateToDisk(state_dummy, FlushStateMode::ALWAYS);
    43     m_node.validation_signals->SyncWithValidationInterfaceQueue();
    44-    // The periodic flush interval is between 50 and 70 minutes (inclusive)
    45     // The next call to a PERIODIC write will flush
    46     SetMockTime(GetMockTime() + DATABASE_WRITE_INTERVAL_MAX);
    
  39. shahsb commented at 3:11 am on May 5, 2025: none
    Concept ACK

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-05-05 12:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me