validation: sync chainstate to disk after syncing to tip #15218

pull andrewtoth wants to merge 2 commits into bitcoin:master from andrewtoth:flush-after-ibd changing 5 files +203 −0
  1. andrewtoth commented at 7:04 pm on January 20, 2019: contributor

    When finishing syncing the chainstate to tip, the chainstate is not persisted to disk until 24 hours after startup. This can cause an issue where the unpersisted chainstate must be resynced if bitcoind is not cleanly shut down. If using a large enough dbcache, it’s possible the entire chainstate from genesis would have to be resynced.

    This fixes the issue by persisting the chainstate to disk right after syncing to tip, but not clearing the utxo cache (using the Sync method introduced in #17487). This happens by scheduling a call to the new function SyncCoinsTipAfterChainSync every 30 seconds. This function checks that the node is out of IBD, and then checks if no new block has been added since the last call. Finally, it checks that there are no blocks currently being downloaded by peers. If all these conditions are met, then the chainstate is persisted and the function is no longer scheduled.

    Mitigates #11600.

  2. andrewtoth force-pushed on Jan 20, 2019
  3. in src/validation.cpp:1196 in c5c5702ffe outdated
    1189@@ -1190,6 +1190,11 @@ bool IsInitialBlockDownload()
    1190         return true;
    1191     if (chainActive.Tip()->GetBlockTime() < (GetTime() - nMaxTipAge))
    1192         return true;
    1193+
    1194+    // Flush state after IBD is finished, but don't block return
    1195+    void (*flush_ptr)() = &FlushStateToDisk;
    1196+    std::thread(flush_ptr).detach();
    


    practicalswift commented at 8:19 pm on January 20, 2019:
    Could be written as std::thread(FlushStateToDisk).detach();?

    andrewtoth commented at 8:23 pm on January 20, 2019:
    No, because FlushStateToDisk is overloaded and the compiler can’t determine which function to call in that case.

    practicalswift commented at 8:35 pm on January 20, 2019:

    @andrewtoth Ah, yes! Thanks!

    0$ git grep ' FlushStateToDisk.*{'
    1src/validation.cpp:bool static FlushStateToDisk(const CChainParams& chainparams, CValidationState &state, FlushStateMode mode, int nManualPruneHeight) {
    2src/validation.cpp:void FlushStateToDisk() {
    
  4. fanquake added the label Validation on Jan 21, 2019
  5. andrewtoth force-pushed on Jan 21, 2019
  6. laanwj commented at 3:42 pm on January 21, 2019: member

    Concept ACK, but I think IsInitialBlockDownload is the wrong place to implement this, as it’s a query function, having it suddenly spawn a thread that flushes is unexpected.

    Would be better to implement it closer to the validation logic and database update logic itself.

  7. andrewtoth force-pushed on Jan 21, 2019
  8. andrewtoth commented at 11:34 pm on January 21, 2019: contributor
    @laanwj Good point. I refactored to move this behaviour to ActivateBestChain in an area where periodic flushes are already expected.
  9. andrewtoth force-pushed on Jan 22, 2019
  10. in src/validation.cpp:2758 in e3e1e7aac4 outdated
    2750@@ -2751,7 +2751,13 @@ bool CChainState::ActivateBestChain(CValidationState &state, const CChainParams&
    2751     CheckBlockIndex(chainparams.GetConsensus());
    2752 
    2753     // Write changes periodically to disk, after relay.
    2754-    if (!FlushStateToDisk(chainparams, state, FlushStateMode::PERIODIC)) {
    2755+    // Unless we just finished initial sync, then always write to disk
    2756+    static bool initially_in_ibd = true;
    2757+    bool finished_ibd = initially_in_ibd && !IsInitialBlockDownload();
    2758+    if (finished_ibd) {
    2759+      initially_in_ibd = false;
    


    practicalswift commented at 6:37 am on January 22, 2019:
    Wrong indentation :-)

    andrewtoth commented at 2:25 pm on January 22, 2019:
    Fixed.
  11. laanwj commented at 12:29 pm on January 22, 2019: member

    @laanwj Good point. I refactored to move this behaviour to ActivateBestChain in an area where periodic flushes are already expected.

    Thanks, much better!

  12. andrewtoth force-pushed on Jan 22, 2019
  13. in src/validation.cpp:2755 in 36628231dc outdated
    2750@@ -2751,7 +2751,13 @@ bool CChainState::ActivateBestChain(CValidationState &state, const CChainParams&
    2751     CheckBlockIndex(chainparams.GetConsensus());
    2752 
    2753     // Write changes periodically to disk, after relay.
    2754-    if (!FlushStateToDisk(chainparams, state, FlushStateMode::PERIODIC)) {
    2755+    // Unless we just finished initial sync, then always write to disk
    2756+    static bool initially_in_ibd = true;
    


    ken2812221 commented at 2:51 pm on January 22, 2019:
    I think that this variable should be the member of CChainState to avoid share same variable if we have two CChainState object in the program.

    andrewtoth commented at 1:34 am on January 23, 2019:
    Done.
  14. sdaftuar commented at 8:52 pm on January 22, 2019: member

    I’m not really a fan of this change – the problem described in #11600 is from an unclean shutdown (ie system crash), where our recovery code could take a long time (but typically would be much faster than doing a -reindex to recover, which is how our code used to work).

    This change doesn’t really solve that problem, it just changes the window in which an unclean shutdown could occur (reducing it at most by 24 hours). But extra flushes, particularly during initial sync, aren’t obviously a good idea, since they harm performance. (Note that we leave IBD before we’ve synced all the way to the tip, I think once we’re within a day or two?)

    Because we flush every day anyway, it’s hard for me to say that this is really that much worse, performance-wise (after all we don’t currently support a node configuration where the utxo is kept entirely cached). But I’m not sure this solves anything either, and a change like this would have to be reverted if, for instance, we wanted to make the cache actually more useful on startup (something I’ve thought we should do for a while). So I think I’m a -0 on this change.

  15. andrewtoth commented at 1:18 am on January 23, 2019: contributor

    @sdaftuar This change also greatly improves the common workflow of spinning up a high performance instance to sync, then immediately shutting it down and using a cheaper one. Currently, you have to enter it and do a clean shutdown instead of just terminating. Similarly, when syncing to an external drive, you can now just unplug the drive or turn off the machine when finished.

    I would argue that moving the window to 0 hours directly after initial sync is an objective improvement. There is a lot of data that will be lost directly after, so why risk another 24 hours? After that, the most they will lose is 24 hours worth of rolling back, instead of 10 years. Also, this change does not do any extra flushes during initial sync, only after.

    I can’t speak to your last point about changing the way we use the cache, since I don’t know what your ideas are.

  16. andrewtoth force-pushed on Jan 23, 2019
  17. sdaftuar commented at 5:08 pm on January 23, 2019: member

    Currently, you have to enter it and do a clean shutdown instead of just terminating. @andrewtoth We already support this (better, I think) with the -stopatheight argument, no?

    I don’t really view data that is in memory as “at risk”; I view it as a massive performance optimization that will allow a node to process new blocks at the fastest possible speed while the data hasn’t yet been flushed. I also don’t feel very strongly about this for the reasons I gave above, so if others want this behavior then so be it.

  18. sipa commented at 7:31 pm on January 23, 2019: member
    @sdaftuar Maybe this is a bit of a different discussion, but there is another option; namely supporting flushing the dirty state to disk, but without wiping it from the cache. Based on our earlier benchmarking, we wouldn’t want to do this purely for maximizing IBD performance, but it could be done at specific times to minimize losses in case of crashes (the once per day flush for example, and also this IBD-is-finshed one).
  19. sdaftuar commented at 7:43 pm on January 23, 2019: member
    @sipa Agreed, I think that would make a lot more sense as a first pass optimization for the periodic flushes and would also work better for this purpose as well.
  20. gmaxwell commented at 7:18 pm on January 24, 2019: contributor

    . Currently, you have to enter it and do a clean shutdown instead of just terminating.

    Well with this, if you “just terminate” you’re going to end up with a replay of several days blocks at start, which is still ugly, even if less bad via this.

    Aside, actually if you actually shut off the computer any time during IBD you’ll likely completely corrupt the state and need to reindex because we don’t use fsync during IBD for performance reasons.

    We really need to get background writing going, so that our writes are never more than (say) a week of blocktime behind… but that is a much bigger change, so I don’t suggest “just do that instead”, though it would make the change here completely unnecessary.

    Might it be better to trigger the flush the first time it goes 30 seconds without connecting a block and there are no queued transfers, from the scheduler thread?

  21. in test/functional/mempool_accept.py:61 in 4787054903 outdated
    56@@ -57,6 +57,8 @@ def run_test(self):
    57         self.log.info('Start with empty mempool, and 200 blocks')
    58         self.mempool_size = 0
    59         wait_until(lambda: node.getblockcount() == 200)
    60+        # Generate 1 more block to exit IBD
    61+        node.generate(1)
    


    maflcko commented at 7:29 pm on January 24, 2019:
    Seems unrelated to the changes?

    andrewtoth commented at 1:44 am on January 25, 2019:
    The test below that checks if a mined transaction can be added to the mempool will fail after this flush is triggered when a block is generated. It rejects the transaction with the missing-inputs message rather than the expected tx-already-known. By triggering the flush here the test passes.
  22. andrewtoth commented at 2:22 am on January 25, 2019: contributor

    @andrewtoth We already support this (better, I think) with the -stopatheight argument, no? @sdaftuar Ahh, I never considered using that for this purpose. Thanks! @gmaxwell It might still be ugly to have a replay of a few days, but much better than making everything unusable for hours.

    There are comments from several people in this PR about adding background writing and writing dirty state to disk without wiping the cache. This change wouldn’t affect either of those improvements, and is an improvement by itself in the interim.

    As for moving this to the scheduler thread, I think this is better since it happens in a place where periodic flushes are already expected Also, checking every 30 seconds for a new block wouldn’t work if for instance the network cuts out for a few minutes.

  23. sipa commented at 2:36 am on January 25, 2019: member
    @andrewtoth The problem is that right now, causing a flush when exiting IBD will (temporarily) kill your performance right before finishing the sync (because it leaves you with an empty cache). If instead it was a non-clearing flush, there would be no such downside.
  24. sdaftuar commented at 6:56 pm on January 29, 2019: member

    My experiment in #15265 has changed my view on this a bit – now I think that we might as well make a change like this for now, but should change the approach slightly to do something like @gmaxwell’s proposal so that we don’t trigger the flush before we are done syncing:

    Might it be better to trigger the flush the first time it goes 30 seconds without connecting a block and there are no queued transfers, from the scheduler thread?

  25. andrewtoth force-pushed on Feb 10, 2019
  26. andrewtoth force-pushed on Feb 10, 2019
  27. andrewtoth commented at 8:34 pm on February 10, 2019: contributor

    @sdaftuar @gmaxwell I’ve updated this to check every 30 seconds on the scheduler thread if there has been an update to the active chain height. This only actually checks after IsInitialBlockDownload is false, which happens if latest block is within a day of the current time.

    I’m not sure how to check if there are queued transfers. If this is not sufficient, some guidance on how to do that would be appreciated.

  28. in src/init.cpp:1790 in 442db9d79d outdated
    1783@@ -1781,5 +1784,19 @@ bool AppInitMain(InitInterfaces& interfaces)
    1784         g_banman->DumpBanlist();
    1785     }, DUMP_BANS_INTERVAL * 1000);
    1786 
    1787+    // Once initial sync is finished, flush state to protect against data loss
    1788+    bool did_initial_flush = false;
    1789+    int last_chain_height = -1;
    1790+    scheduler.scheduleEvery([&did_initial_flush, &last_chain_height] {
    


    practicalswift commented at 9:12 pm on February 10, 2019:
    Lifetime issue with did_initial_flush and last_chain_height when AppInitMain goes out of scope?

    andrewtoth commented at 10:06 pm on February 10, 2019:
    Moved them inside the lambda as static variables.
  29. andrewtoth force-pushed on Feb 10, 2019
  30. andrewtoth force-pushed on Feb 11, 2019
  31. DrahtBot commented at 3:49 am on February 11, 2019: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK furszy
    Concept NACK maflcko
    Concept ACK laanwj, luke-jr, Sjors, mzumsande

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  32. andrewtoth force-pushed on Feb 14, 2019
  33. andrewtoth force-pushed on Feb 16, 2019
  34. andrewtoth force-pushed on Mar 10, 2019
  35. in src/init.cpp:1833 in 8ecd177c3a outdated
    1828+        if (last_chain_height == -1 || last_chain_height != current_height) {
    1829+            last_chain_height = current_height;
    1830+            return;
    1831+        }
    1832+
    1833+        did_initial_flush = true;
    


    luke-jr commented at 6:33 am on April 17, 2019:
    Can’t we just unschedule the task?

    andrewtoth commented at 2:44 am on April 18, 2019:
    Unless I’m missing something, I don’t think it’s possible to unschedule without making a lot more changes https://github.com/bitcoin/bitcoin/blob/master/src/scheduler.h#L58

    luke-jr commented at 2:17 am on April 19, 2019:
    Use scheduleFromNow conditionally?

    andrewtoth commented at 4:31 pm on April 19, 2019:
    Updated. How is that?
  36. andrewtoth force-pushed on Apr 19, 2019
  37. andrewtoth force-pushed on Apr 19, 2019
  38. in src/init.cpp:1256 in 4ff12b7e98 outdated
    1252@@ -1250,6 +1253,26 @@ bool AppInitLockDataDirectory()
    1253     return true;
    1254 }
    1255 
    1256+static void FlushAfterSync() 
    


    luke-jr commented at 5:01 pm on April 19, 2019:
    Nit: extra space on the end

    andrewtoth commented at 5:23 pm on April 19, 2019:
    Removed.
  39. in src/init.cpp:1260 in 4ff12b7e98 outdated
    1252@@ -1250,6 +1253,26 @@ bool AppInitLockDataDirectory()
    1253     return true;
    1254 }
    1255 
    1256+static void FlushAfterSync() 
    1257+{
    1258+    // Once initial sync is finished, flush state to protect against data loss
    1259+    if (IsInitialBlockDownload()) {
    1260+        scheduler.scheduleFromNow(*FlushAfterSync, SYNC_CHECK_INTERVAL * 1000);
    


    luke-jr commented at 5:02 pm on April 19, 2019:
    I don’t understand the purpose of * before FlushAfterSync.

    andrewtoth commented at 5:23 pm on April 19, 2019:
    Removed them.
  40. luke-jr approved
  41. andrewtoth force-pushed on Apr 19, 2019
  42. luke-jr approved
  43. luke-jr commented at 7:39 pm on April 19, 2019: member
    utACK
  44. in src/init.cpp:1258 in 6471856400 outdated
    1252@@ -1250,6 +1253,26 @@ bool AppInitLockDataDirectory()
    1253     return true;
    1254 }
    1255 
    1256+static void FlushAfterSync()
    1257+{
    1258+    // Once initial sync is finished, flush state to protect against data loss
    


    maflcko commented at 9:10 pm on April 19, 2019:
    doc-nit: This comment should be moved down after the if or as docstring on top of the method, imo. The condition in the if below is “initial sync is not finished”.

    andrewtoth commented at 3:04 pm on April 20, 2019:
    Updated comments to align with coding style.
  45. andrewtoth force-pushed on Apr 20, 2019
  46. andrewtoth force-pushed on Apr 20, 2019
  47. andrewtoth force-pushed on Apr 20, 2019
  48. andrewtoth force-pushed on Apr 20, 2019
  49. maflcko commented at 2:14 pm on April 22, 2019: member
    utACK 5d9aa4c6432bd41e8d65717976293e006a6c732a
  50. sdaftuar commented at 3:04 pm on April 22, 2019: member

    I’m not sure how to check if there are queued transfers. If this is not sufficient, some guidance on how to do that would be appreciated.

    There’s a variable called nPeersWithValidatedDownloads in net_processing.cpp which indicates how many peers we are downloading blocks from. So to implement @gmaxwell’s suggestion I think you would just need to expose that variable and then check to see if it equals 0.

  51. andrewtoth force-pushed on Apr 23, 2019
  52. andrewtoth commented at 0:57 am on April 23, 2019: contributor
    @sdaftuar Thanks. Updated.
  53. andrewtoth force-pushed on Apr 23, 2019
  54. in src/net_processing.cpp:806 in 6f76c4f97b outdated
    802@@ -803,6 +803,11 @@ bool GetNodeStateStats(NodeId nodeid, CNodeStateStats &stats) {
    803     return true;
    804 }
    805 
    806+int GetNumberOfPeersWithValidatedDownloads() EXCLUSIVE_LOCKS_REQUIRED(cs_main)
    


    maflcko commented at 11:47 am on April 23, 2019:

    Annotations in definitions are ignored, they need to be in the declaration

    0int GetNumberOfPeersWithValidatedDownloads()
    

    andrewtoth commented at 2:05 pm on April 23, 2019:
    Fixed.
  55. andrewtoth force-pushed on Apr 23, 2019
  56. DrahtBot added the label Needs rebase on Jul 31, 2019
  57. andrewtoth force-pushed on Jul 31, 2019
  58. andrewtoth force-pushed on Jul 31, 2019
  59. DrahtBot removed the label Needs rebase on Jul 31, 2019
  60. andrewtoth force-pushed on Jul 31, 2019
  61. andrewtoth force-pushed on Jul 31, 2019
  62. TheBlueMatt commented at 6:17 pm on August 21, 2019: contributor
    -0.5. This is really not the right way to go here, I think. Namely, people with high dbcache may also want to use it after IBD completes (ie cause they want to connect blocks quickly), and this breaks that by wiping the cache completely. Coupling it with a change to not drop non-dirty entries while flushing is the easy fix to make it not have a regression, though a more complete cleanup of background flushing would also be nice.
  63. jamesob commented at 7:14 pm on August 21, 2019: member

    this breaks that by wiping the cache completely

    On a sort of related note, I think it’s kind of crazy that we couple writing back to disk with emptying the in-memory cache. Feels like those should be separate considerations, especially in a case like this.

  64. laanwj added the label Up for grabs on Oct 8, 2019
  65. laanwj commented at 9:55 am on October 8, 2019: member
    This PR seems to be too controversial to merge in the current state, and discussion has become inactive. Closing as up for grabs. Let me know if you start work on this again and I’ll re-open it.
  66. laanwj closed this on Oct 8, 2019

  67. maflcko commented at 9:42 pm on November 15, 2019: member
    Could rebase on top of something like: #17487, and then open a new pull request and link to this discussion which is mostly about “NACK, because this will erase the cache”.
  68. andrewtoth commented at 3:00 am on November 16, 2019: contributor
    @MarcoFalke Thanks for bringing that to my attention. What is the benefit of making a new PR instead of reopening this one? Just for visibility?
  69. maflcko commented at 3:16 am on November 16, 2019: member
    Up to you. Let me know if you want this reopened.
  70. luke-jr commented at 5:02 am on November 16, 2019: member
    The benefit is that the no-longer-applicable criticism is no longer right there at first glance.
  71. maflcko commented at 1:09 pm on November 22, 2019: member
    Given that it was found this does not negatively affect performance, see #15265 (comment), I think it should be reconsidered in its current form.
  72. andrewtoth commented at 2:27 pm on November 22, 2019: contributor
    @MarcoFalke I don’t think the takeaway from that comment is that it doesn’t affect performance. That was benchmarking IBD for pruned nodes with not emptying the cache on flush. This PR is flushing the dbcache after IBD. I’m sure for a non-pruned node with a high dbcache this will negatively affect performance directly after the flush, as @TheBlueMatt ’s comment points out.
  73. andrewtoth commented at 2:31 pm on November 22, 2019: contributor
    If #17487 gets merged I will rebase this, perhaps without checking for active peers since the cache would no longer be affected and it will make the diff smaller.
  74. Sjors commented at 11:26 am on November 23, 2019: member
    Concept ACK on doing a Flush(erase=false) at the end of IBD once support for that is merged. It probably still makes sense to wait a little bit (e.g. checking for active peers), otherwise the user might be bothered with this flush right before sync really finishes.
  75. bitcoin locked this on Dec 16, 2021
  76. fanquake reopened this on Jan 31, 2023

  77. bitcoin unlocked this on Jan 31, 2023
  78. maflcko removed the label Up for grabs on Jan 31, 2023
  79. maflcko removed the label Validation on Jan 31, 2023
  80. DrahtBot added the label Validation on Jan 31, 2023
  81. theStack referenced this in commit 82903a7a8d on Jan 31, 2023
  82. DrahtBot added the label Needs rebase on Jan 31, 2023
  83. sidhujag referenced this in commit 8ff3d69a6a on Mar 25, 2023
  84. achow101 requested review from jamesob on Apr 25, 2023
  85. Sjors commented at 12:23 pm on July 28, 2023: member

    If #17487 gets merged I will rebase this

    It has been merged.

  86. andrewtoth force-pushed on Jul 30, 2023
  87. in src/net_processing.cpp:530 in 9c27c626cf outdated
    522@@ -523,6 +523,7 @@ class PeerManagerImpl final : public PeerManager
    523                         const std::chrono::microseconds time_received, const std::atomic<bool>& interruptMsgProc) override
    524         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_recent_confirmed_transactions_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex);
    525     void UpdateLastBlockAnnounceTime(NodeId node, int64_t time_in_seconds) override;
    526+    int GetNumberOfPeersWithValidatedDownloads() override EXCLUSIVE_LOCKS_REQUIRED(cs_main);
    


    luke-jr commented at 1:06 am on July 30, 2023:
    Why was the order here swapped? Seems like override should be the very last keyword IMO

    andrewtoth commented at 8:58 pm on August 17, 2023:
    Hmm it seems like this is the ordering for all methods in this class definition.
  88. in src/init.cpp:142 in 9c27c626cf outdated
    136@@ -137,6 +137,8 @@ static constexpr bool DEFAULT_PROXYRANDOMIZE{true};
    137 static constexpr bool DEFAULT_REST_ENABLE{false};
    138 static constexpr bool DEFAULT_I2P_ACCEPT_INCOMING{true};
    139 static constexpr bool DEFAULT_STOPAFTERBLOCKIMPORT{false};
    140+//! Check if initial sync is done with no change in block height or queued downloads every 30s
    


    luke-jr commented at 1:08 am on July 30, 2023:
    nit: There used to be a blank line before this. IMO it was nicer.
  89. in src/init.cpp:1066 in 9c27c626cf outdated
    1055+ * sync utxo state to protect against data loss
    1056+ */
    1057+static void SyncCoinsTipAfterChainSync(const NodeContext& node)
    1058+{
    1059+    LOCK(node.chainman->GetMutex());
    1060+    if (node.chainman->ActiveChainstate().IsInitialBlockDownload()) {
    


    luke-jr commented at 1:17 am on July 30, 2023:
    Do we want to monitor only the active chainstate? Or should we be paying attention to the background one too?

    andrewtoth commented at 2:46 am on July 30, 2023:
    That’s a good question. Please correct me if my assumptions are incorrect, but I believe the other chainstate would be the background syncing while our active is the assumeutxo chainstate which we want to sync quickly and then have active ASAP. In that case, there are only two chainstates, active and background. The background chainstate can take a long time to sync, possibly >24h, which would mean the periodic flush would take effect. So syncing that here would not really be beneficial. For the active chainstate it would be much less than 24h and therefore we would want sync it to disk to protect against data loss.
  90. in src/init.cpp:1069 in 9c27c626cf outdated
    1064+        return;
    1065+    }
    1066+
    1067+    static auto last_chain_height{-1};
    1068+    const auto current_height{node.chainman->ActiveHeight()};
    1069+    if (last_chain_height == -1 || last_chain_height != current_height) {
    


    luke-jr commented at 1:19 am on July 30, 2023:
    -1 is always going to be != current_height anyway…?
  91. luke-jr changes_requested
  92. DrahtBot removed the label Needs rebase on Jul 30, 2023
  93. andrewtoth commented at 3:07 am on July 30, 2023: contributor
    @luke-jr thank you for the speedy review! Will address your comments, but will wait for others to review as well. @Sjors @MarcoFalke rebased, apologies for the delay.
  94. luke-jr referenced this in commit ecac3c2be0 on Aug 16, 2023
  95. andrewtoth force-pushed on Aug 17, 2023
  96. DrahtBot added the label CI failed on Aug 18, 2023
  97. DrahtBot removed the label CI failed on Aug 18, 2023
  98. andrewtoth force-pushed on Aug 21, 2023
  99. andrewtoth force-pushed on Aug 26, 2023
  100. luke-jr referenced this in commit 51544ee85d on Aug 29, 2023
  101. DrahtBot added the label CI failed on Jan 9, 2024
  102. DrahtBot added the label Needs rebase on Jan 31, 2024
  103. andrewtoth force-pushed on Jan 31, 2024
  104. DrahtBot removed the label Needs rebase on Jan 31, 2024
  105. DrahtBot removed the label CI failed on Feb 3, 2024
  106. andrewtoth force-pushed on Feb 22, 2024
  107. andrewtoth force-pushed on Feb 22, 2024
  108. andrewtoth commented at 9:21 pm on February 22, 2024: contributor
    @Sjors @maflcko @luke-jr I’ve rebased, added some logging as well as a functional test.
  109. andrewtoth force-pushed on Mar 13, 2024
  110. andrewtoth renamed this:
    validation: Flush state after initial sync
    validation: sync chainstate to disk after syncing to tip
    on Mar 13, 2024
  111. andrewtoth force-pushed on Mar 13, 2024
  112. andrewtoth force-pushed on Mar 13, 2024
  113. luke-jr referenced this in commit 2e729d248c on Mar 14, 2024
  114. luke-jr referenced this in commit 38c883d635 on Mar 14, 2024
  115. DrahtBot added the label Needs rebase on Apr 30, 2024
  116. andrewtoth force-pushed on May 2, 2024
  117. DrahtBot removed the label Needs rebase on May 2, 2024
  118. Cinnamonrou approved
  119. Cinnamonrou approved
  120. Cinnamonrou approved
  121. in src/init.cpp:2025 in c9abaaf731 outdated
    2021@@ -1981,6 +2022,10 @@ bool AppInitMain(NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
    2022     StartupNotify(args);
    2023 #endif
    2024 
    2025+    node.scheduler->scheduleFromNow([&node] {
    


    mzumsande commented at 7:11 pm on May 20, 2024:
    Should the original scheduling be conditional on IsInitialBlockDownload() == true? Seems like the goal of this PR is to trigger flush if we transition out of IBD, but not so much after restarting an almost or completely synced node. Currently, it will always sync 1 minute after restart when the node is completely synced (with the second call to SyncCoinsTipAfterChainSync, because for the first one after 30s last_chain_height is still -1).

    andrewtoth commented at 0:33 am on June 3, 2024:
    Yes, good idea. Done.
  122. mzumsande commented at 7:39 pm on May 20, 2024: contributor

    Concept ACK

    While this one-time sync after IBD should help in some situations, I’m not sure that it completely resolves #11600 (I encountered this PR while looking into possible improvements to ReplayBlocks()) After all, there are several other situations in which a crash / unclean shutdown could lead to extensive replays (e.g. during IBD) that this PR doesn’t address.

  123. in src/init.cpp:1118 in 011d9b70f3 outdated
    1113+ */
    1114+static void SyncCoinsTipAfterChainSync(const NodeContext& node)
    1115+{
    1116+    LOCK(node.chainman->GetMutex());
    1117+    if (node.chainman->IsInitialBlockDownload()) {
    1118+        LogDebug(BCLog::COINDB, "Node is still in IBD, rescheduling chainstate disk sync...\n");
    


    chrisguida commented at 5:23 pm on May 29, 2024:

    Can we clarify this message a bit to emphasize that this rescheduling does not preclude periodic flushes to disk?

    It’s slightly confusing to see this message as it appears that disk flushes are being deferred indefinitely.

    See: https://github.com/bitcoinknots/bitcoin/issues/70#issuecomment-2136078696


    andrewtoth commented at 0:33 am on June 3, 2024:
    Updated the log message to be "Node is still in IBD, rescheduling post-IBD chainstate disk sync...". Does that clarify it?

    chrisguida commented at 12:03 pm on June 3, 2024:
    Yep, that’s great!!
  124. andrewtoth force-pushed on Jun 3, 2024
  125. DrahtBot added the label CI failed on Jun 3, 2024
  126. DrahtBot commented at 1:45 am on June 3, 2024: contributor

    🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the documentation.

    Possibly this is due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    Leave a comment here, if you need help tracking down a confusing failure.

    Debug: https://github.com/bitcoin/bitcoin/runs/25710459287

  127. andrewtoth commented at 2:48 am on June 3, 2024: contributor
    @mzumsande @chrisguida thank you for your reviews and suggestions. I’ve addressed them and rebased.
  128. DrahtBot removed the label CI failed on Jun 3, 2024
  129. DrahtBot added the label Needs rebase on Jun 11, 2024
  130. validation: sync utxo state after block sync eb8bc83510
  131. test: add test for SyncCoinsTipAfterChainSync 8887d28a01
  132. andrewtoth force-pushed on Jun 11, 2024
  133. DrahtBot removed the label Needs rebase on Jun 11, 2024
  134. in src/init.cpp:1127 in eb8bc83510 outdated
    1122+        LogDebug(BCLog::COINDB, "Node is still in IBD, rescheduling post-IBD chainstate disk sync...\n");
    1123+        node.scheduler->scheduleFromNow([&node] {
    1124+            SyncCoinsTipAfterChainSync(node);
    1125+        }, SYNC_CHECK_INTERVAL);
    1126+        return;
    1127+    }
    


    furszy commented at 7:55 pm on June 12, 2024:

    No need to lock the chainman mutex for IsInitialBlockDownload(). The function already locks it internally.

    Still, I think we shouldn’t use that. The more we lock cs_main, the more unresponsive the software is. Could use a combination of peerman.ApproximateBestBlockDepth() with a constant like we do inside the desirable services flags variation (GetDesirableServiceFlags). Or the peerman m_initial_sync_finished field.


    andrewtoth commented at 1:43 pm on June 13, 2024:

    Point taken for moving the explicit lock after this check, since the lock is taken in IsInitialBlockDownload().

    However, this check only runs once every 30 seconds. I don’t see how it could possibly affect responsiveness of the software. It is a very fast check I would assume on the order of microseconds every 30 seconds.

  135. in src/init.cpp:1137 in eb8bc83510 outdated
    1132+        LogDebug(BCLog::COINDB, "Chain height updated since last check, rescheduling post-IBD chainstate disk sync...\n");
    1133+        last_chain_height = current_height;
    1134+        node.scheduler->scheduleFromNow([&node] {
    1135+            SyncCoinsTipAfterChainSync(node);
    1136+        }, SYNC_CHECK_INTERVAL);
    1137+        return;
    


    furszy commented at 8:25 pm on June 12, 2024:

    Isn’t this going to always reschedule the task on the first run?

    Also, the active height refers to the latest connected block. It doesn’t tell us we are up-to-date with the network; To know if we are sync, should use the best known header or call to the ApproximateBestBlockDepth() function.

    And thinking more about this; what about adjusting the check interval based on the distance between the active chain height and the best header height?

    I know this could vary a lot but.. something simple like: “if the node is more than 400k blocks away, wait 5 or 10 minutes, if it is 100k blocks away wait 3 or 5 minutes, and if it less than that, wait 1 minute” would save a good number of unneeded checks in slow machines.


    andrewtoth commented at 1:55 pm on June 13, 2024:

    Isn’t this going to always reschedule the task on the first run?

    Yes, but do you think this is a problem? It just makes sure the node has not connected any nodes for at least 30 seconds.

    Also, the active height refers to the latest connected block. It doesn’t tell us we are up-to-date with the network; To know if we are sync, should use the best known header or call to the ApproximateBestBlockDepth() function.

    Doesn’t the fact that IsInitialBlockDownload() returns false make this point moot? It checks that our latest block is at most 24 hours old.

    And let’s say this call is triggered before we are completely up-to-date with the network. All that happens is the chainstate is synced to disk, but the utxo cache is not cleared. So at most 24 hours of blocks (~144 blocks) will be downloaded and processed (quickly still with the cache), but not persisted to disk until the next periodic flush (24 hours). I think this patch still achieves its goal and there is no downside now with Sync.

    would save a good number of unneeded checks in slow machines.

    I think this is premature optimization. I don’t think this check will be noticeable by system or user.


    andrewtoth commented at 2:43 pm on July 2, 2024:

    would save a good number of unneeded checks in slow machines. @furszy Let’s say on a slow machine IBD takes 48 hours to sync, and being generous this check takes 10ms (I think in reality it would be more than 2 orders of magnitude faster), then the total number of checks is 48h * 60 minutes * 2 (twice a minute) = 5,760 checks * 10ms = 57.6 seconds. So on a 48 hour sync with an excessively slow check it will still be less than a minute extra time added.


    furszy commented at 2:16 pm on July 4, 2024:

    Ok. np.

    Still.. I know it is an overkill if we only introduce what I’m going to suggest for this PR but.. thought on adding a signal for the ibd completion state https://github.com/furszy/bitcoin-core/commit/85a050a8690e5431848658604e913c58ae45a4aa. Which might be useful if we ever add any other scenario apart from this one.

  136. luke-jr referenced this in commit 2cff3f267b on Jun 13, 2024
  137. luke-jr referenced this in commit 7ebe6085b8 on Jun 13, 2024
  138. furszy commented at 2:18 pm on July 4, 2024: member
    Code ACK 8887d28a014420668801d7e4c5d1bf45c5e93684
  139. DrahtBot requested review from laanwj on Jul 4, 2024
  140. DrahtBot requested review from Sjors on Jul 4, 2024
  141. DrahtBot requested review from luke-jr on Jul 4, 2024
  142. DrahtBot requested review from mzumsande on Jul 4, 2024
  143. DrahtBot commented at 7:20 pm on July 29, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/26084905920

    Make sure to run all tests locally, according to the documentation.

    The failure may happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  144. DrahtBot added the label CI failed on Jul 29, 2024
  145. andrewtoth commented at 6:21 pm on August 8, 2024: contributor
    Closing in favor of #30611.
  146. andrewtoth closed this on Aug 8, 2024

  147. andrewtoth deleted the branch on Aug 15, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-19 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me