validation: sync chainstate to disk after syncing to tip

andrewtoth commented at 7:04 pm on January 20, 2019: contributor

When finishing syncing the chainstate to tip, the chainstate is not persisted to disk until 24 hours after startup. This can cause an issue where the unpersisted chainstate must be resynced if bitcoind is not cleanly shut down. If using a large enough dbcache, it’s possible the entire chainstate from genesis would have to be resynced.

This fixes the issue by persisting the chainstate to disk right after syncing to tip, but not clearing the utxo cache (using the Sync method introduced in #17487). This happens by scheduling a call to the new function SyncCoinsTipAfterChainSync every 30 seconds. This function checks that the node is out of IBD, and then checks if no new block has been added since the last call. Finally, it checks that there are no blocks currently being downloaded by peers. If all these conditions are met, then the chainstate is persisted and the function is no longer scheduled.

Mitigates #11600.

andrewtoth force-pushed on Jan 20, 2019

in src/validation.cpp:1196 in c5c5702ffe outdated

1189@@ -1190,6 +1190,11 @@ bool IsInitialBlockDownload()
1190         return true;
1191     if (chainActive.Tip()->GetBlockTime() < (GetTime() - nMaxTipAge))
1192         return true;
1193+
1194+    // Flush state after IBD is finished, but don't block return
1195+    void (*flush_ptr)() = &FlushStateToDisk;
1196+    std::thread(flush_ptr).detach();

practicalswift commented at 8:19 pm on January 20, 2019:

Could be written as std::thread(FlushStateToDisk).detach();?

andrewtoth commented at 8:23 pm on January 20, 2019:

No, because FlushStateToDisk is overloaded and the compiler can’t determine which function to call in that case.

practicalswift commented at 8:35 pm on January 20, 2019:

@andrewtoth Ah, yes! Thanks!

0$ git grep ' FlushStateToDisk.*{'
1src/validation.cpp:bool static FlushStateToDisk(const CChainParams& chainparams, CValidationState &state, FlushStateMode mode, int nManualPruneHeight) {
2src/validation.cpp:void FlushStateToDisk() {

fanquake added the label Validation on Jan 21, 2019

andrewtoth force-pushed on Jan 21, 2019

laanwj commented at 3:42 pm on January 21, 2019: member

Concept ACK, but I think IsInitialBlockDownload is the wrong place to implement this, as it’s a query function, having it suddenly spawn a thread that flushes is unexpected.

Would be better to implement it closer to the validation logic and database update logic itself.

andrewtoth force-pushed on Jan 21, 2019

andrewtoth commented at 11:34 pm on January 21, 2019: contributor

@laanwj Good point. I refactored to move this behaviour to ActivateBestChain in an area where periodic flushes are already expected.

andrewtoth force-pushed on Jan 22, 2019

in src/validation.cpp:2758 in e3e1e7aac4 outdated

2750@@ -2751,7 +2751,13 @@ bool CChainState::ActivateBestChain(CValidationState &state, const CChainParams&
2751     CheckBlockIndex(chainparams.GetConsensus());
2752 
2753     // Write changes periodically to disk, after relay.
2754-    if (!FlushStateToDisk(chainparams, state, FlushStateMode::PERIODIC)) {
2755+    // Unless we just finished initial sync, then always write to disk
2756+    static bool initially_in_ibd = true;
2757+    bool finished_ibd = initially_in_ibd && !IsInitialBlockDownload();
2758+    if (finished_ibd) {
2759+      initially_in_ibd = false;

practicalswift commented at 6:37 am on January 22, 2019:

Wrong indentation :-)

andrewtoth commented at 2:25 pm on January 22, 2019:

Fixed.

laanwj commented at 12:29 pm on January 22, 2019: member

@laanwj Good point. I refactored to move this behaviour to ActivateBestChain in an area where periodic flushes are already expected.

Thanks, much better!

andrewtoth force-pushed on Jan 22, 2019

in src/validation.cpp:2755 in 36628231dc outdated

2750@@ -2751,7 +2751,13 @@ bool CChainState::ActivateBestChain(CValidationState &state, const CChainParams&
2751     CheckBlockIndex(chainparams.GetConsensus());
2752 
2753     // Write changes periodically to disk, after relay.
2754-    if (!FlushStateToDisk(chainparams, state, FlushStateMode::PERIODIC)) {
2755+    // Unless we just finished initial sync, then always write to disk
2756+    static bool initially_in_ibd = true;

ken2812221 commented at 2:51 pm on January 22, 2019:

I think that this variable should be the member of CChainState to avoid share same variable if we have two CChainState object in the program.

andrewtoth commented at 1:34 am on January 23, 2019:

Done.

sdaftuar commented at 8:52 pm on January 22, 2019: member

I’m not really a fan of this change – the problem described in #11600 is from an unclean shutdown (ie system crash), where our recovery code could take a long time (but typically would be much faster than doing a -reindex to recover, which is how our code used to work).

This change doesn’t really solve that problem, it just changes the window in which an unclean shutdown could occur (reducing it at most by 24 hours). But extra flushes, particularly during initial sync, aren’t obviously a good idea, since they harm performance. (Note that we leave IBD before we’ve synced all the way to the tip, I think once we’re within a day or two?)

Because we flush every day anyway, it’s hard for me to say that this is really that much worse, performance-wise (after all we don’t currently support a node configuration where the utxo is kept entirely cached). But I’m not sure this solves anything either, and a change like this would have to be reverted if, for instance, we wanted to make the cache actually more useful on startup (something I’ve thought we should do for a while). So I think I’m a -0 on this change.

andrewtoth commented at 1:18 am on January 23, 2019: contributor

@sdaftuar This change also greatly improves the common workflow of spinning up a high performance instance to sync, then immediately shutting it down and using a cheaper one. Currently, you have to enter it and do a clean shutdown instead of just terminating. Similarly, when syncing to an external drive, you can now just unplug the drive or turn off the machine when finished.

I would argue that moving the window to 0 hours directly after initial sync is an objective improvement. There is a lot of data that will be lost directly after, so why risk another 24 hours? After that, the most they will lose is 24 hours worth of rolling back, instead of 10 years. Also, this change does not do any extra flushes during initial sync, only after.

I can’t speak to your last point about changing the way we use the cache, since I don’t know what your ideas are.

andrewtoth force-pushed on Jan 23, 2019

sdaftuar commented at 5:08 pm on January 23, 2019: member

Currently, you have to enter it and do a clean shutdown instead of just terminating. @andrewtoth We already support this (better, I think) with the -stopatheight argument, no?

I don’t really view data that is in memory as “at risk”; I view it as a massive performance optimization that will allow a node to process new blocks at the fastest possible speed while the data hasn’t yet been flushed. I also don’t feel very strongly about this for the reasons I gave above, so if others want this behavior then so be it.

sipa commented at 7:31 pm on January 23, 2019: member

@sdaftuar Maybe this is a bit of a different discussion, but there is another option; namely supporting flushing the dirty state to disk, but without wiping it from the cache. Based on our earlier benchmarking, we wouldn’t want to do this purely for maximizing IBD performance, but it could be done at specific times to minimize losses in case of crashes (the once per day flush for example, and also this IBD-is-finshed one).

sdaftuar commented at 7:43 pm on January 23, 2019: member

@sipa Agreed, I think that would make a lot more sense as a first pass optimization for the periodic flushes and would also work better for this purpose as well.

gmaxwell commented at 7:18 pm on January 24, 2019: contributor

. Currently, you have to enter it and do a clean shutdown instead of just terminating.

Well with this, if you “just terminate” you’re going to end up with a replay of several days blocks at start, which is still ugly, even if less bad via this.

Aside, actually if you actually shut off the computer any time during IBD you’ll likely completely corrupt the state and need to reindex because we don’t use fsync during IBD for performance reasons.

We really need to get background writing going, so that our writes are never more than (say) a week of blocktime behind… but that is a much bigger change, so I don’t suggest “just do that instead”, though it would make the change here completely unnecessary.

Might it be better to trigger the flush the first time it goes 30 seconds without connecting a block and there are no queued transfers, from the scheduler thread?

in test/functional/mempool_accept.py:61 in 4787054903 outdated

56@@ -57,6 +57,8 @@ def run_test(self):
57         self.log.info('Start with empty mempool, and 200 blocks')
58         self.mempool_size = 0
59         wait_until(lambda: node.getblockcount() == 200)
60+        # Generate 1 more block to exit IBD
61+        node.generate(1)

maflcko commented at 7:29 pm on January 24, 2019:

Seems unrelated to the changes?

andrewtoth commented at 1:44 am on January 25, 2019:

The test below that checks if a mined transaction can be added to the mempool will fail after this flush is triggered when a block is generated. It rejects the transaction with the missing-inputs message rather than the expected tx-already-known. By triggering the flush here the test passes.

andrewtoth commented at 2:22 am on January 25, 2019: contributor

@andrewtoth We already support this (better, I think) with the -stopatheight argument, no? @sdaftuar Ahh, I never considered using that for this purpose. Thanks! @gmaxwell It might still be ugly to have a replay of a few days, but much better than making everything unusable for hours.

There are comments from several people in this PR about adding background writing and writing dirty state to disk without wiping the cache. This change wouldn’t affect either of those improvements, and is an improvement by itself in the interim.

As for moving this to the scheduler thread, I think this is better since it happens in a place where periodic flushes are already expected Also, checking every 30 seconds for a new block wouldn’t work if for instance the network cuts out for a few minutes.

sipa commented at 2:36 am on January 25, 2019: member

@andrewtoth The problem is that right now, causing a flush when exiting IBD will (temporarily) kill your performance right before finishing the sync (because it leaves you with an empty cache). If instead it was a non-clearing flush, there would be no such downside.

sdaftuar commented at 6:56 pm on January 29, 2019: member

My experiment in #15265 has changed my view on this a bit – now I think that we might as well make a change like this for now, but should change the approach slightly to do something like @gmaxwell’s proposal so that we don’t trigger the flush before we are done syncing:

Might it be better to trigger the flush the first time it goes 30 seconds without connecting a block and there are no queued transfers, from the scheduler thread?

andrewtoth force-pushed on Feb 10, 2019

andrewtoth commented at 8:34 pm on February 10, 2019: contributor

@sdaftuar @gmaxwell I’ve updated this to check every 30 seconds on the scheduler thread if there has been an update to the active chain height. This only actually checks after IsInitialBlockDownload is false, which happens if latest block is within a day of the current time.

I’m not sure how to check if there are queued transfers. If this is not sufficient, some guidance on how to do that would be appreciated.

in src/init.cpp:1790 in 442db9d79d outdated

1783@@ -1781,5 +1784,19 @@ bool AppInitMain(InitInterfaces& interfaces)
1784         g_banman->DumpBanlist();
1785     }, DUMP_BANS_INTERVAL * 1000);
1786 
1787+    // Once initial sync is finished, flush state to protect against data loss
1788+    bool did_initial_flush = false;
1789+    int last_chain_height = -1;
1790+    scheduler.scheduleEvery([&did_initial_flush, &last_chain_height] {

practicalswift commented at 9:12 pm on February 10, 2019:

Lifetime issue with did_initial_flush and last_chain_height when AppInitMain goes out of scope?

andrewtoth commented at 10:06 pm on February 10, 2019:

Moved them inside the lambda as static variables.

andrewtoth force-pushed on Feb 10, 2019

andrewtoth force-pushed on Feb 11, 2019

DrahtBot commented at 3:49 am on February 11, 2019: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	furszy
Concept NACK	maflcko
Concept ACK	laanwj, luke-jr, Sjors, mzumsande

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

andrewtoth force-pushed on Feb 14, 2019

andrewtoth force-pushed on Feb 16, 2019

andrewtoth force-pushed on Mar 10, 2019

in src/init.cpp:1833 in 8ecd177c3a outdated

1828+        if (last_chain_height == -1 || last_chain_height != current_height) {
1829+            last_chain_height = current_height;
1830+            return;
1831+        }
1832+
1833+        did_initial_flush = true;

luke-jr commented at 6:33 am on April 17, 2019:

Can’t we just unschedule the task?

andrewtoth commented at 2:44 am on April 18, 2019:

Unless I’m missing something, I don’t think it’s possible to unschedule without making a lot more changes https://github.com/bitcoin/bitcoin/blob/master/src/scheduler.h#L58

luke-jr commented at 2:17 am on April 19, 2019:

Use scheduleFromNow conditionally?

andrewtoth commented at 4:31 pm on April 19, 2019:

Updated. How is that?

andrewtoth force-pushed on Apr 19, 2019

in src/init.cpp:1256 in 4ff12b7e98 outdated

1252@@ -1250,6 +1253,26 @@ bool AppInitLockDataDirectory()
1253     return true;
1254 }
1255 
1256+static void FlushAfterSync()

luke-jr commented at 5:01 pm on April 19, 2019:

Nit: extra space on the end

andrewtoth commented at 5:23 pm on April 19, 2019:

Removed.

in src/init.cpp:1260 in 4ff12b7e98 outdated

1252@@ -1250,6 +1253,26 @@ bool AppInitLockDataDirectory()
1253     return true;
1254 }
1255 
1256+static void FlushAfterSync() 
1257+{
1258+    // Once initial sync is finished, flush state to protect against data loss
1259+    if (IsInitialBlockDownload()) {
1260+        scheduler.scheduleFromNow(*FlushAfterSync, SYNC_CHECK_INTERVAL * 1000);

luke-jr commented at 5:02 pm on April 19, 2019:

I don’t understand the purpose of * before FlushAfterSync.

andrewtoth commented at 5:23 pm on April 19, 2019:

Removed them.

luke-jr approved

andrewtoth force-pushed on Apr 19, 2019

luke-jr approved

luke-jr commented at 7:39 pm on April 19, 2019: member

utACK

in src/init.cpp:1258 in 6471856400 outdated

1252@@ -1250,6 +1253,26 @@ bool AppInitLockDataDirectory()
1253     return true;
1254 }
1255 
1256+static void FlushAfterSync()
1257+{
1258+    // Once initial sync is finished, flush state to protect against data loss

maflcko commented at 9:10 pm on April 19, 2019:

doc-nit: This comment should be moved down after the if or as docstring on top of the method, imo. The condition in the if below is “initial sync is not finished”.

andrewtoth commented at 3:04 pm on April 20, 2019:

Updated comments to align with coding style.

andrewtoth force-pushed on Apr 20, 2019

maflcko commented at 2:14 pm on April 22, 2019: member

utACK 5d9aa4c6432bd41e8d65717976293e006a6c732a

sdaftuar commented at 3:04 pm on April 22, 2019: member

I’m not sure how to check if there are queued transfers. If this is not sufficient, some guidance on how to do that would be appreciated.

There’s a variable called nPeersWithValidatedDownloads in net_processing.cpp which indicates how many peers we are downloading blocks from. So to implement @gmaxwell’s suggestion I think you would just need to expose that variable and then check to see if it equals 0.

andrewtoth force-pushed on Apr 23, 2019

andrewtoth commented at 0:57 am on April 23, 2019: contributor

@sdaftuar Thanks. Updated.

andrewtoth force-pushed on Apr 23, 2019

in src/net_processing.cpp:806 in 6f76c4f97b outdated

802@@ -803,6 +803,11 @@ bool GetNodeStateStats(NodeId nodeid, CNodeStateStats &stats) {
803     return true;
804 }
805 
806+int GetNumberOfPeersWithValidatedDownloads() EXCLUSIVE_LOCKS_REQUIRED(cs_main)

maflcko commented at 11:47 am on April 23, 2019:

Annotations in definitions are ignored, they need to be in the declaration

0int GetNumberOfPeersWithValidatedDownloads()

andrewtoth commented at 2:05 pm on April 23, 2019:

Fixed.

andrewtoth force-pushed on Apr 23, 2019

DrahtBot added the label Needs rebase on Jul 31, 2019

andrewtoth force-pushed on Jul 31, 2019

DrahtBot removed the label Needs rebase on Jul 31, 2019

andrewtoth force-pushed on Jul 31, 2019

TheBlueMatt commented at 6:17 pm on August 21, 2019: contributor

-0.5. This is really not the right way to go here, I think. Namely, people with high dbcache may also want to use it after IBD completes (ie cause they want to connect blocks quickly), and this breaks that by wiping the cache completely. Coupling it with a change to not drop non-dirty entries while flushing is the easy fix to make it not have a regression, though a more complete cleanup of background flushing would also be nice.

jamesob commented at 7:14 pm on August 21, 2019: member

this breaks that by wiping the cache completely

On a sort of related note, I think it’s kind of crazy that we couple writing back to disk with emptying the in-memory cache. Feels like those should be separate considerations, especially in a case like this.

laanwj added the label Up for grabs on Oct 8, 2019

laanwj commented at 9:55 am on October 8, 2019: member

This PR seems to be too controversial to merge in the current state, and discussion has become inactive. Closing as up for grabs. Let me know if you start work on this again and I’ll re-open it.

laanwj closed this on Oct 8, 2019

maflcko commented at 9:42 pm on November 15, 2019: member

Could rebase on top of something like: #17487, and then open a new pull request and link to this discussion which is mostly about “NACK, because this will erase the cache”.

andrewtoth commented at 3:00 am on November 16, 2019: contributor

@MarcoFalke Thanks for bringing that to my attention. What is the benefit of making a new PR instead of reopening this one? Just for visibility?

maflcko commented at 3:16 am on November 16, 2019: member

Up to you. Let me know if you want this reopened.

luke-jr commented at 5:02 am on November 16, 2019: member

The benefit is that the no-longer-applicable criticism is no longer right there at first glance.

maflcko commented at 1:09 pm on November 22, 2019: member

Given that it was found this does not negatively affect performance, see #15265 (comment), I think it should be reconsidered in its current form.

andrewtoth commented at 2:27 pm on November 22, 2019: contributor

@MarcoFalke I don’t think the takeaway from that comment is that it doesn’t affect performance. That was benchmarking IBD for pruned nodes with not emptying the cache on flush. This PR is flushing the dbcache after IBD. I’m sure for a non-pruned node with a high dbcache this will negatively affect performance directly after the flush, as @TheBlueMatt ’s comment points out.

andrewtoth commented at 2:31 pm on November 22, 2019: contributor

If #17487 gets merged I will rebase this, perhaps without checking for active peers since the cache would no longer be affected and it will make the diff smaller.

Sjors commented at 11:26 am on November 23, 2019: member

Concept ACK on doing a Flush(erase=false) at the end of IBD once support for that is merged. It probably still makes sense to wait a little bit (e.g. checking for active peers), otherwise the user might be bothered with this flush right before sync really finishes.

bitcoin locked this on Dec 16, 2021

fanquake reopened this on Jan 31, 2023

bitcoin unlocked this on Jan 31, 2023

maflcko removed the label Up for grabs on Jan 31, 2023

maflcko removed the label Validation on Jan 31, 2023

DrahtBot added the label Validation on Jan 31, 2023

theStack referenced this in commit 82903a7a8d on Jan 31, 2023

DrahtBot added the label Needs rebase on Jan 31, 2023

sidhujag referenced this in commit 8ff3d69a6a on Mar 25, 2023

achow101 requested review from jamesob on Apr 25, 2023

Sjors commented at 12:23 pm on July 28, 2023: member

If #17487 gets merged I will rebase this

It has been merged.

andrewtoth force-pushed on Jul 30, 2023

in src/net_processing.cpp:530 in 9c27c626cf outdated

522@@ -523,6 +523,7 @@ class PeerManagerImpl final : public PeerManager
523                         const std::chrono::microseconds time_received, const std::atomic<bool>& interruptMsgProc) override
524         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_recent_confirmed_transactions_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex);
525     void UpdateLastBlockAnnounceTime(NodeId node, int64_t time_in_seconds) override;
526+    int GetNumberOfPeersWithValidatedDownloads() override EXCLUSIVE_LOCKS_REQUIRED(cs_main);

luke-jr commented at 1:06 am on July 30, 2023:

Why was the order here swapped? Seems like override should be the very last keyword IMO

andrewtoth commented at 8:58 pm on August 17, 2023:

Hmm it seems like this is the ordering for all methods in this class definition.

in src/init.cpp:142 in 9c27c626cf outdated

136@@ -137,6 +137,8 @@ static constexpr bool DEFAULT_PROXYRANDOMIZE{true};
137 static constexpr bool DEFAULT_REST_ENABLE{false};
138 static constexpr bool DEFAULT_I2P_ACCEPT_INCOMING{true};
139 static constexpr bool DEFAULT_STOPAFTERBLOCKIMPORT{false};
140+//! Check if initial sync is done with no change in block height or queued downloads every 30s

luke-jr commented at 1:08 am on July 30, 2023:

nit: There used to be a blank line before this. IMO it was nicer.

in src/init.cpp:1066 in 9c27c626cf outdated

1055+ * sync utxo state to protect against data loss
1056+ */
1057+static void SyncCoinsTipAfterChainSync(const NodeContext& node)
1058+{
1059+    LOCK(node.chainman->GetMutex());
1060+    if (node.chainman->ActiveChainstate().IsInitialBlockDownload()) {

luke-jr commented at 1:17 am on July 30, 2023:

Do we want to monitor only the active chainstate? Or should we be paying attention to the background one too?

andrewtoth commented at 2:46 am on July 30, 2023:

That’s a good question. Please correct me if my assumptions are incorrect, but I believe the other chainstate would be the background syncing while our active is the assumeutxo chainstate which we want to sync quickly and then have active ASAP. In that case, there are only two chainstates, active and background. The background chainstate can take a long time to sync, possibly >24h, which would mean the periodic flush would take effect. So syncing that here would not really be beneficial. For the active chainstate it would be much less than 24h and therefore we would want sync it to disk to protect against data loss.

in src/init.cpp:1069 in 9c27c626cf outdated

1064+        return;
1065+    }
1066+
1067+    static auto last_chain_height{-1};
1068+    const auto current_height{node.chainman->ActiveHeight()};
1069+    if (last_chain_height == -1 || last_chain_height != current_height) {

luke-jr commented at 1:19 am on July 30, 2023:

-1 is always going to be != current_height anyway…?

luke-jr changes_requested

DrahtBot removed the label Needs rebase on Jul 30, 2023

andrewtoth commented at 3:07 am on July 30, 2023: contributor

@luke-jr thank you for the speedy review! Will address your comments, but will wait for others to review as well. @Sjors @MarcoFalke rebased, apologies for the delay.

luke-jr referenced this in commit ecac3c2be0 on Aug 16, 2023

andrewtoth force-pushed on Aug 17, 2023

DrahtBot added the label CI failed on Aug 18, 2023

DrahtBot removed the label CI failed on Aug 18, 2023

andrewtoth force-pushed on Aug 21, 2023

andrewtoth force-pushed on Aug 26, 2023

luke-jr referenced this in commit 51544ee85d on Aug 29, 2023

DrahtBot added the label CI failed on Jan 9, 2024

DrahtBot added the label Needs rebase on Jan 31, 2024

andrewtoth force-pushed on Jan 31, 2024

DrahtBot removed the label Needs rebase on Jan 31, 2024

DrahtBot removed the label CI failed on Feb 3, 2024

andrewtoth force-pushed on Feb 22, 2024

andrewtoth commented at 9:21 pm on February 22, 2024: contributor

@Sjors @maflcko @luke-jr I’ve rebased, added some logging as well as a functional test.

andrewtoth force-pushed on Mar 13, 2024

andrewtoth renamed this:
~~validation: Flush state after initial sync~~
validation: sync chainstate to disk after syncing to tip
on Mar 13, 2024

andrewtoth force-pushed on Mar 13, 2024

luke-jr referenced this in commit 2e729d248c on Mar 14, 2024

luke-jr referenced this in commit 38c883d635 on Mar 14, 2024

DrahtBot added the label Needs rebase on Apr 30, 2024

andrewtoth force-pushed on May 2, 2024

DrahtBot removed the label Needs rebase on May 2, 2024

Cinnamonrou approved

in src/init.cpp:2025 in c9abaaf731 outdated

2021@@ -1981,6 +2022,10 @@ bool AppInitMain(NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
2022     StartupNotify(args);
2023 #endif
2024 
2025+    node.scheduler->scheduleFromNow([&node] {

mzumsande commented at 7:11 pm on May 20, 2024:

Should the original scheduling be conditional on IsInitialBlockDownload() == true? Seems like the goal of this PR is to trigger flush if we transition out of IBD, but not so much after restarting an almost or completely synced node. Currently, it will always sync 1 minute after restart when the node is completely synced (with the second call to SyncCoinsTipAfterChainSync, because for the first one after 30s last_chain_height is still -1).

andrewtoth commented at 0:33 am on June 3, 2024:

Yes, good idea. Done.

mzumsande commented at 7:39 pm on May 20, 2024: contributor

Concept ACK

While this one-time sync after IBD should help in some situations, I’m not sure that it completely resolves #11600 (I encountered this PR while looking into possible improvements to ReplayBlocks()) After all, there are several other situations in which a crash / unclean shutdown could lead to extensive replays (e.g. during IBD) that this PR doesn’t address.

in src/init.cpp:1118 in 011d9b70f3 outdated

1113+ */
1114+static void SyncCoinsTipAfterChainSync(const NodeContext& node)
1115+{
1116+    LOCK(node.chainman->GetMutex());
1117+    if (node.chainman->IsInitialBlockDownload()) {
1118+        LogDebug(BCLog::COINDB, "Node is still in IBD, rescheduling chainstate disk sync...\n");

chrisguida commented at 5:23 pm on May 29, 2024:

Can we clarify this message a bit to emphasize that this rescheduling does not preclude periodic flushes to disk?

It’s slightly confusing to see this message as it appears that disk flushes are being deferred indefinitely.

See: https://github.com/bitcoinknots/bitcoin/issues/70#issuecomment-2136078696

andrewtoth commented at 0:33 am on June 3, 2024:

Updated the log message to be "Node is still in IBD, rescheduling post-IBD chainstate disk sync...". Does that clarify it?

chrisguida commented at 12:03 pm on June 3, 2024:

Yep, that’s great!!

andrewtoth force-pushed on Jun 3, 2024

DrahtBot added the label CI failed on Jun 3, 2024

DrahtBot commented at 1:45 am on June 3, 2024: contributor

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

Debug: https://github.com/bitcoin/bitcoin/runs/25710459287

andrewtoth commented at 2:48 am on June 3, 2024: contributor

@mzumsande @chrisguida thank you for your reviews and suggestions. I’ve addressed them and rebased.

DrahtBot removed the label CI failed on Jun 3, 2024

DrahtBot added the label Needs rebase on Jun 11, 2024

validation: sync utxo state after block sync eb8bc83510

test: add test for SyncCoinsTipAfterChainSync 8887d28a01

andrewtoth force-pushed on Jun 11, 2024

DrahtBot removed the label Needs rebase on Jun 11, 2024

in src/init.cpp:1127 in eb8bc83510 outdated

1122+        LogDebug(BCLog::COINDB, "Node is still in IBD, rescheduling post-IBD chainstate disk sync...\n");
1123+        node.scheduler->scheduleFromNow([&node] {
1124+            SyncCoinsTipAfterChainSync(node);
1125+        }, SYNC_CHECK_INTERVAL);
1126+        return;
1127+    }

furszy commented at 7:55 pm on June 12, 2024:

No need to lock the chainman mutex for IsInitialBlockDownload(). The function already locks it internally.

Still, I think we shouldn’t use that. The more we lock cs_main, the more unresponsive the software is. Could use a combination of peerman.ApproximateBestBlockDepth() with a constant like we do inside the desirable services flags variation (GetDesirableServiceFlags). Or the peerman m_initial_sync_finished field.

andrewtoth commented at 1:43 pm on June 13, 2024:

Point taken for moving the explicit lock after this check, since the lock is taken in IsInitialBlockDownload().

However, this check only runs once every 30 seconds. I don’t see how it could possibly affect responsiveness of the software. It is a very fast check I would assume on the order of microseconds every 30 seconds.

in src/init.cpp:1137 in eb8bc83510 outdated

1132+        LogDebug(BCLog::COINDB, "Chain height updated since last check, rescheduling post-IBD chainstate disk sync...\n");
1133+        last_chain_height = current_height;
1134+        node.scheduler->scheduleFromNow([&node] {
1135+            SyncCoinsTipAfterChainSync(node);
1136+        }, SYNC_CHECK_INTERVAL);
1137+        return;

furszy commented at 8:25 pm on June 12, 2024:

Isn’t this going to always reschedule the task on the first run?

Also, the active height refers to the latest connected block. It doesn’t tell us we are up-to-date with the network; To know if we are sync, should use the best known header or call to the ApproximateBestBlockDepth() function.

And thinking more about this; what about adjusting the check interval based on the distance between the active chain height and the best header height?

I know this could vary a lot but.. something simple like: “if the node is more than 400k blocks away, wait 5 or 10 minutes, if it is 100k blocks away wait 3 or 5 minutes, and if it less than that, wait 1 minute” would save a good number of unneeded checks in slow machines.

andrewtoth commented at 1:55 pm on June 13, 2024:

Isn’t this going to always reschedule the task on the first run?

Yes, but do you think this is a problem? It just makes sure the node has not connected any nodes for at least 30 seconds.

Also, the active height refers to the latest connected block. It doesn’t tell us we are up-to-date with the network; To know if we are sync, should use the best known header or call to the ApproximateBestBlockDepth() function.

Doesn’t the fact that IsInitialBlockDownload() returns false make this point moot? It checks that our latest block is at most 24 hours old.

And let’s say this call is triggered before we are completely up-to-date with the network. All that happens is the chainstate is synced to disk, but the utxo cache is not cleared. So at most 24 hours of blocks (~144 blocks) will be downloaded and processed (quickly still with the cache), but not persisted to disk until the next periodic flush (24 hours). I think this patch still achieves its goal and there is no downside now with Sync.

would save a good number of unneeded checks in slow machines.

I think this is premature optimization. I don’t think this check will be noticeable by system or user.

andrewtoth commented at 2:43 pm on July 2, 2024:

would save a good number of unneeded checks in slow machines. @furszy Let’s say on a slow machine IBD takes 48 hours to sync, and being generous this check takes 10ms (I think in reality it would be more than 2 orders of magnitude faster), then the total number of checks is 48h * 60 minutes * 2 (twice a minute) = 5,760 checks * 10ms = 57.6 seconds. So on a 48 hour sync with an excessively slow check it will still be less than a minute extra time added.

furszy commented at 2:16 pm on July 4, 2024:

Ok. np.

Still.. I know it is an overkill if we only introduce what I’m going to suggest for this PR but.. thought on adding a signal for the ibd completion state https://github.com/furszy/bitcoin-core/commit/85a050a8690e5431848658604e913c58ae45a4aa. Which might be useful if we ever add any other scenario apart from this one.

luke-jr referenced this in commit 2cff3f267b on Jun 13, 2024

luke-jr referenced this in commit 7ebe6085b8 on Jun 13, 2024

furszy commented at 2:18 pm on July 4, 2024: member

Code ACK 8887d28a014420668801d7e4c5d1bf45c5e93684

DrahtBot requested review from laanwj on Jul 4, 2024

DrahtBot requested review from Sjors on Jul 4, 2024

DrahtBot requested review from luke-jr on Jul 4, 2024

DrahtBot requested review from mzumsande on Jul 4, 2024

DrahtBot commented at 7:20 pm on July 29, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/26084905920

Make sure to run all tests locally, according to the documentation.

The failure may happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot added the label CI failed on Jul 29, 2024

andrewtoth commented at 6:21 pm on August 8, 2024: contributor

Closing in favor of #30611.

andrewtoth closed this on Aug 8, 2024

andrewtoth deleted the branch on Aug 15, 2024

validation: sync chainstate to disk after syncing to tip #15218

Code Coverage

Reviews

Conflicts