Use non-atomic flushing with block replay

sipa commented at 9:50 am on April 4, 2017: member

This implements an alternative solution to the flush-time memory usage peak, suggested by @gmaxwell.

Instead of relying on using atomic batch writes in LevelDB for the chainstate, we rely on the fact that we have an external log of updates to it already (called the blockchain).

This patch adds an extra “head blocks” to the chainstate, which gives the range of blocks for writes may be incomplete. At the start of a flush, we write this record, write the dirty dbcache entries in 16 MiB batches, and at the end we remove the heads record again. If it is present at startup it means we crashed during flush, and we rollback/roll forward blocks inside of it to get a consistent tip on disk before proceeding.

If a flush completes succesfully, the resulting database is compatible with previous versions (down to 0.8). If the node crashes in the middle of a flush, a version of the code with this patch is needed to recovery.

sipa commented at 9:55 am on April 4, 2017: member

This badly needs testing, but I’m not sure how to simulate crashes in the middle of flushing (I’ve manually verified this patch can recover from failure by introducing an exit(0) in the middle of the flush code).

sipa force-pushed on Apr 4, 2017

laanwj commented at 10:43 am on April 4, 2017: member

Cool!

but I’m not sure how to simulate crashes in the middle of flushing

I’ll get to that :)

laanwj added the label UTXO Db and Indexes on Apr 4, 2017

laanwj added the label Resource usage on Apr 4, 2017

sipa force-pushed on Apr 5, 2017

sipa commented at 9:38 am on April 5, 2017: member

Rebased, fixed a bug, and added a commit to allows simulating crashes after partial flushes.

in src/txdb.cpp:84 in aa249d732f outdated

75@@ -74,6 +76,15 @@ bool CCoinsViewDB::BatchWrite(CCoinsMap &mapCoins, const uint256 &hashBlock) {
76             LogPrint(BCLog::COINDB, "Writing partial batch of %.2f MiB\n", batch.SizeEstimate() * (1.0 / 1048576.0));
77             db.WriteBatch(batch);
78             batch.Clear();
79+            int crash_simulate = GetArg("-dbcrashratio", 0);
80+            if (crash_simulate) {
81+                static FastRandomContext rng;
82+                if (rng.rand32() % crash_simulate == 0) {
83+                    LogPrintf("Simulating a crash. Goodbye.");
84+                    sync();

laanwj commented at 9:41 am on April 5, 2017:

This sync should be optional at least (not a realistic crash otherwise)

sipa commented at 12:29 pm on April 5, 2017:

I’ve just removed the sync. It seems not needed for testing (I’ve done reindexes with hundreds of crashes in between).

sipa force-pushed on Apr 5, 2017

gmaxwell commented at 4:28 pm on April 5, 2017: contributor

contrib/devtools/check-doc.py is unhappy that you added new arguments without asking for permission from the argument gods.

in src/txdb.cpp:61 in 7c2459d0e3 outdated

54+
55 bool CCoinsViewDB::BatchWrite(CCoinsMap &mapCoins, const uint256 &hashBlock) {
56     CDBBatch batch(db);
57     size_t count = 0;
58     size_t changed = 0;
59+    size_t batch_size = (size_t)GetArg("-dbbatchsize", nDefaultDbBatchSize) << 20;

gmaxwell commented at 4:31 pm on April 5, 2017:

If we have memory usage that is some multiple of this, perhaps the argument should be in the form of the actual usage rather than the batch size?

sipa commented at 7:47 am on April 6, 2017:

Well the relevant constraint is the memory usage peak from allocating the batch, which depends on the batch memory usage, not dbcache memory usage. Also, I don’t think anyone will need to change this property (except for tests, where it’s very useful to get much more frequent partial flushes).

sipa force-pushed on Apr 6, 2017

in src/dbwrapper.h:86 in a4ddc70da2 outdated

82@@ -75,6 +83,7 @@ class CDBBatch
83         leveldb::Slice slValue(ssValue.data(), ssValue.size());
84 
85         batch.Put(slKey, slValue);
86+        size_estimate += 3 + slKey.size() + slValue.size();

TheBlueMatt commented at 2:01 pm on April 11, 2017:

Can you add comments as to why 3 (and, below, 2) bytes are overhead here?

sipa commented at 9:27 pm on April 11, 2017:

Done

in src/init.cpp:1486 in a4ddc70da2 outdated

1478@@ -1476,6 +1479,12 @@ bool AppInitMain(boost::thread_group& threadGroup, CScheduler& scheduler)
1479                     break;
1480                 }
1481 
1482+                if (!ReplayBlocks(chainparams, pcoinsdbview)) {
1483+                    strLoadError = _("Unable to replay blocks. You will need to rebuild the databse using -reindex.");
1484+                    break;
1485+                }
1486+                pcoinsTip->SetBestBlock(pcoinsdbview->GetBestBlock()); // TODO: only initialize pcoinsTip after ReplayBlocks

TheBlueMatt commented at 6:10 pm on April 11, 2017:

Hmm, yes. Were you intending to do this after this PR? Can you just delete it and re-create it here? I feel like it may make sense to move the chainActive.Tip-setting from LoadBlockIndexDB to after this point.

Speaking of which, did you mean to add a PruneBlockIndexCandidates() to ReplayBlocks ala LoadBlockIndexDB?

sipa commented at 9:28 pm on April 11, 2017:

Were you intending to do this after this PR?

Yes, I tried doing it inside the PR, but doing it properly requires a bit more shuffling around and refactoring, which I’d prefer to keep for later.

Speaking of which, did you mean to add a PruneBlockIndexCandidates() to ReplayBlocks ala LoadBlockIndexDB?

ReplayBlocks doesn’t touch the block index, so I don’t think that would have any effect.

TheBlueMatt commented at 10:12 pm on April 11, 2017:

The PruneBlockIndexCandidates in LoadBlockIndexDB uses chainActive.Tip(), so I assumed it may need to be re-run with the new tip (though likely not a bug without it, just a should-do). I’m ok with cleaning this stuff up in a followup PR, but it seems less than ideal as-is right now.

in src/validation.cpp:3762 in a4ddc70da2 outdated

3709+        return error("ReplayBlocks(): chainstate boundaries not in block index");
3710+    }
3711+    auto pindexUpto = mapBlockIndex[hashUpto];
3712+
3713+    int nHeight = 1; // Skip the genesis block
3714+    if (mapBlockIndex.count(hashBest)) {

TheBlueMatt commented at 6:37 pm on April 11, 2017:

Should we also fail here if hashBest has been written (ie is non-IsNull) but isnt in mapBlockIndex?

sipa commented at 9:34 pm on April 11, 2017:

I think that would be caught by other code we already have, but I’ve added it here.

in src/validation.cpp:1308 in a4ddc70da2 outdated

1303+        if (!tx->IsCoinBase()) {
1304+            for (const CTxIn &txin : tx->vin) {
1305+                inputs.ModifyCoins(txin.prevout.hash)->Spend(txin.prevout.n);
1306+            }
1307+        }
1308+        inputs.ModifyNewCoins(tx->GetHash(), tx->IsCoinBase())->FromTx(*tx, nHeight);

TheBlueMatt commented at 6:53 pm on April 11, 2017:

Based on the comment above ModifyNewCoins I do not believe this works, we may need something new to capture the “maybe not fresh, but definitely fully overwrite in any case” case.

sipa commented at 9:31 pm on April 11, 2017:

Nice catch, fixed.

in src/validation.cpp:3735 in a4ddc70da2 outdated

3730+            return error("ReplayBlocks(): ReadBlockFromDisk failed at %d, hash=%s", pindex->nHeight, pindex->GetBlockHash().ToString());
3731+        }
3732+        ReplayBlock(block, cache, pindex->nHeight);
3733+    }
3734+    cache.SetBestBlock(hashUpto);
3735+    chainActive.SetTip(pindexUpto);

TheBlueMatt commented at 6:55 pm on April 11, 2017:

It seems super weird to be acting entirely on non-globals, and then suddenly set a global here.

sipa commented at 9:37 pm on April 11, 2017:

Good point, fixed.

sipa force-pushed on Apr 11, 2017

sipa force-pushed on Apr 12, 2017

sipa commented at 10:42 am on April 12, 2017: member

Addressed some of @TheBlueMatt’s comments.

in src/coins.h:318 in 16fc01343c outdated

314@@ -315,6 +315,9 @@ class CCoinsView
315     //! Retrieve the block hash whose state this CCoinsView currently represents
316     virtual uint256 GetBestBlock() const;
317 
318+    //! Retrieve the block hash up to which changes are included

TheBlueMatt commented at 2:04 pm on April 12, 2017:

s/which changes/which some changes/?

in src/validation.cpp:3765 in 16fc01343c outdated

3721+
3722+    int nHeight = 1; // Skip the genesis block
3723+    if (mapBlockIndex.count(hashBest)) {
3724+        auto pindexBest = mapBlockIndex[hashBest];
3725+        if (pindexUpto->GetAncestor(pindexBest->nHeight) != pindexBest) {
3726+            return error("ReplayBlocks(): chainstate tip does not derive from final boundary");

TheBlueMatt commented at 2:14 pm on April 12, 2017:

I believe we’ll hit this if we ever crash during a disconnect? Seems kinda annoying to not support disconnect.

sipa commented at 4:06 pm on April 12, 2017:

Good catch. ReplayBlocks should learn to deal with a reorg.

in src/txdb.cpp:63 in 16fc01343c outdated

58     CDBBatch batch(db);
59     size_t count = 0;
60     size_t changed = 0;
61+    size_t batch_size = (size_t)GetArg("-dbbatchsize", nDefaultDbBatchSize) << 20;
62+    if (!hashBlock.IsNull()) {
63+        batch.Write(DB_BEST_BLOCK_UPTO, hashBlock);

TheBlueMatt commented at 2:21 pm on April 12, 2017:

It seems to me this API ties us to only doing batches per-block, and never across long chains of actions (or at least not across multiple reorgs). Consider the case where you disconnect A to get to B, then disconnect B to get to C then connect D. There is no way to encode that you need to ensure everything from disconnecting B must be replayed to ensure there are no leftover entries from that, I believe. This is likely OK, but should likely be documented somewhere to ensure we dont end up adding a multi-reorg-flush bug later on.

sipa commented at 4:05 pm on April 12, 2017:

That’s a great point. I hadn’t considered that in the case of a reorg the set of partially written changes may include things from multiple branches. It seems solvable by allowing the ‘upto’ blocks to be a list of tip hashes, and then at start time choose which ones to undo and which ones to replay. I think that’s a problem for later, but it makes sense to have a comment about it.

sipa force-pushed on Apr 13, 2017

sipa force-pushed on Apr 16, 2017

sipa commented at 2:18 pm on April 16, 2017: member

Updated to deal with reorganizations. The disk format and recovery code can now also deal with multiple partially written branches. That functionality is not needed yet, but means we can switch to different partial flushing strategies later without breaking compatibility with older versions.

sipa force-pushed on Apr 16, 2017

in src/init.cpp:1486 in c9066bcf36 outdated

1478@@ -1476,6 +1479,13 @@ bool AppInitMain(boost::thread_group& threadGroup, CScheduler& scheduler)
1479                     break;
1480                 }
1481 
1482+                if (!ReplayBlocks(chainparams, pcoinsdbview)) {
1483+                    strLoadError = _("Unable to replay blocks. You will need to rebuild the database using -reindex-chainstate.");
1484+                    break;
1485+                }
1486+                pcoinsTip->SetBestBlock(pcoinsdbview->GetBestBlock()); // TODO: only initialize pcoinsTip after ReplayBlocks

TheBlueMatt commented at 4:01 pm on April 17, 2017:

I think you can do this for (almost) free now. See https://github.com/TheBlueMatt/bitcoin/commit/747b766a02c32a47a5f0018ed352991b55aaab19, though if you dont want to take it here I’ll just PR it afterwards.

in src/validation.cpp:3736 in c9066bcf36 outdated

3731+        return error("RollbackBlock(): failure reading undo data");
3732+
3733+    if (blockUndo.vtxundo.size() + 1 != block.vtx.size())
3734+        return error("RollbackBlock(): block and undo data inconsistent");
3735+
3736+    for (size_t i = 0; i < block.vtx.size(); ++i) {

TheBlueMatt commented at 4:44 pm on April 17, 2017:

Doesn’t this need to be in reverse order (like in DisconnectBlock, maybe you should just go ahead and add an option to DisconnectBlock to ignore errors in a pervious commit to make it easier to review?)?

sipa commented at 7:07 pm on April 17, 2017:

Good catch, fixed.

I’ll try to do the merging in a extra commit.

in src/validation.cpp:1507 in c9066bcf36 outdated

1504@@ -1505,18 +1505,15 @@ bool ApplyTxInUndo(const CTxInUndo& undo, CCoinsViewCache& view, const COutPoint
1505     CCoinsModifier coins = view.ModifyCoins(out.hash);
1506     if (undo.nHeight != 0) {
1507         // undo data contains height: this is the last output of the prevout tx being spent
1508-        if (!coins->IsPruned())
1509-            fClean = fClean && error("%s: undo data overwriting existing transaction", __func__);
1510+        if (!coins->IsPruned()) fClean = false;

TheBlueMatt commented at 4:56 pm on April 17, 2017:

Would be nice to not lose the error messages by adding a flag for printing errors. I think you need the flag either way for the next line, as I dont think you can run the Clear() if we’re re-undoing a tx.

sipa commented at 6:57 pm on April 17, 2017:

There is no change in behavior here; the error case doesn’t cause a return from the function. I believe the new (and existing) code is fine: if a failure is detected, the caller (VerifyDB or DisconnectBlock) won’t flush the changes view to the level below, ignoring the resulting inconsistent state.

If you insist, I’ll add a flag to ignore the error messages, but (perhaps in a separate PR) I think we should rid of these error() calls and instead perhaps return some return code. Reporting of these things doesn’t belong here.

TheBlueMatt commented at 7:19 pm on April 17, 2017:

I just find the errors useful to keep around, my real concern is the Clear() one line down, which I believe is an actual bug for the new usage in RollbackBlock.

sipa commented at 8:03 pm on April 17, 2017:

I’m not convinced the Clear() is wrong - it does mean we’re passing over a state where the outputs for that TX were all full spent. However, this is hard to reason about, and you may well be right. Furthermore, it seems that Clear() has no purpose. In the ‘clean’ case, the output is already pruned, to the Clear() is a no-op. In the other case it doesn’t matter. I’m removing it.

TheBlueMatt commented at 8:23 pm on April 17, 2017:

Hmm, you may be right. Indeed, however, hard to reason about.

in src/validation.cpp:3751 in c9066bcf36 outdated

3784+
3785+    // Build a set of all blocks to rollback.
3786+    std::set<const CBlockIndex*, CBlockIndexWorkComparator> vpindexRollback;
3787+    for (size_t i = 1; i < pindexHeads.size(); ++i) {
3788+        const CBlockIndex *pindexHead = pindexHeads[i];
3789+        while (fGenesis ? pindexHead != nullptr : pindexHead != pindexFork) {

TheBlueMatt commented at 6:44 pm on April 17, 2017:

I’m not convinced this is right. What if you connect both A and B, in the simple case? Now you’ll disconnect B before you re-connect A and then re-connect B? Is that neccessary?

sipa commented at 7:07 pm on April 17, 2017:

Note that i starts at 1 in the loop, skipping the branch that leads to the new tip. I’ve added a comment to clarify.

TheBlueMatt commented at 7:17 pm on April 17, 2017:

Indeed, but that only means you wont do the disconnect-then-reconnect thing for one block (which I suppose may be fine for this PR), but you will do it if you have two back-to-back blocks in the list (or am I confused?).

sipa commented at 7:59 pm on April 17, 2017:

No. Let’s say you have a chain A<-B<-C that was being flushed (meaning the old tip was A, you crashed in the middle of writing the changes for B and C, with C the intended new tip). In this case, at recovery time, GetBlockHeads() will return [C,A]. pindexFork will be A. The loop above will only process A, but because A is already the fork point, nothing is added to the disconnect set.

TheBlueMatt commented at 8:18 pm on April 17, 2017:

Yes, OK, it wasnt clear to me what GetBlockHeads() should be returning there, Flush seemed to indicate something different.

sipa commented at 7:09 pm on April 17, 2017: member

Thanks for the detailed review, @TheBlueMatt.

sipa force-pushed on Apr 17, 2017

sipa renamed this:
~~Use non-atomic flushing with block replay~~
[WIP] Use non-atomic flushing with block replay
on Apr 17, 2017

in src/txdb.cpp:66 in 02dfa4a54d outdated

61+        // Read existing heads from database.
62+        uint256 tip = GetBestBlock();
63+        std::vector<uint256> heads = GetHeadBlocks();
64+        // Construct a set with all existing heads, excluding the new tip.
65+        std::set<uint256> setHeads(heads.begin(), heads.end());
66+        if (setHeads.empty() || !tip.IsNull()) setHeads.insert(tip);

NicolasDorier commented at 5:52 am on April 18, 2017:

can tip be null ? I would have expected it is at least genesis.

sipa commented at 11:53 am on April 18, 2017:

Tip can be null when it’s the first write ever to the database.

NicolasDorier commented at 4:31 am on April 20, 2017:

I am surprised, I always thought the coinbase of the genesis block was not added in the database. This would mean that having it to null to mean “before processing the genesis”, is the same than enforcing the best block to be at least the genesis block.

in src/validation.cpp:3771 in 02dfa4a54d outdated

3766+    const CBlockIndex* pindexFork = nullptr;
3767+    bool fGenesis = false;
3768+
3769+    // Find last common ancestor of all heads.
3770+    for (const uint256& hash : hashHeads) {
3771+        if (hash.IsNull()) {

NicolasDorier commented at 6:05 am on April 18, 2017:

I am confused about all this fGenesis variable. I guess this is related to my other comment on https://github.com/bitcoin/bitcoin/pull/10148/commits/02dfa4a54da44e40b1d29687cc68885f3a306ef3#r111875593

I would not expect, pindexFork to be null, as the work fork point should be the genesis block.

sipa commented at 11:53 am on April 18, 2017:

I’ve added a lengthy comment to clarify.

laanwj commented at 6:46 am on April 18, 2017: member

contrib/devtools/check-doc.py is unhappy that you added new arguments without asking for permission from the argument gods.

Not sure if you were being sarcastic here, but to be clear: that script checks whether options are documented (either as debug option or as normal option), not whether you have a signed permission on a stone tablet from $PANTHEON.

Going to test this on a node.

sipa force-pushed on Apr 19, 2017

sipa commented at 5:02 pm on April 19, 2017: member

Rebased, and removed the unused multi-reorg support in the recovery code.

sdaftuar commented at 8:02 pm on April 19, 2017: member

@sipa Thanks for the simplification. I’ve read through the logic and I believe it is correct; will think about ways to carefully test.

There was some discussion on IRC but I wanted to summarize my current thoughts: this is still some complication we’re adding to consensus, and I expect that the per-txout caching from #10195 will change the behavior and performance of the cache substantially, so I’d prefer to review that PR independently of this one to better consider the design decisions of this PR. Edit: Just noticed you rebased #10195 without this, thanks!

sipa commented at 8:45 pm on April 19, 2017: member

@sdaftuar My plan is adding a unit test that uses wrapper around CCoinsViewDB that drops 50% of the dirty entries before writing, and then issues ReplayBlock to recover from it. Then run a test that creates random regtest blocks with a few transactions, and frequently reorganizes.

in src/validation.cpp:3755 in d9b3da62b8 outdated

3750+                return error("RollbackBlock(): ReadBlockFromDisk() failed at %d, hash=%s", pindexOld->nHeight, pindexOld->GetBlockHash().ToString());
3751+            }
3752+            LogPrintf("Rolling back %s (%i)\n", pindexOld->GetBlockHash().ToString(), pindexOld->nHeight);
3753+            bool fClean = true; // Ignoring resulting state of fClean.
3754+            CValidationState state;
3755+            if (!DisconnectBlock(block, state, pindexOld, cache, &fClean)) return false;

NicolasDorier commented at 5:00 am on April 20, 2017:

maybe return an error() here ?

sipa commented at 11:37 am on April 20, 2017:

Done. All cases in which DisconnectBlock returns false already report an error(), but it may help.

in src/validation.cpp:3598 in d9b3da62b8 outdated

3737+    if (!hashHeads[1].IsNull()) { // The old tip is allowed to be 0, indicating it's the first flush.
3738+        if (mapBlockIndex.count(hashHeads[1]) == 0) {
3739+            return error("ReplayBlocks(): reorganization from unknown block requested");
3740+        }
3741+        pindexOld = mapBlockIndex[hashHeads[1]];
3742+        pindexFork = LastCommonAncestor(pindexOld, pindexNew);

NicolasDorier commented at 5:03 am on April 20, 2017:

I would add a assert(pindexFork);

sipa commented at 11:37 am on April 20, 2017:

Done.

NicolasDorier commented at 5:14 am on April 20, 2017: contributor

CodeReview ACK. Would be more at ease with some functional tests.

An easy test is trying to sync a node for 1000 blocks, with dbbatchsize to 1 bytes (it is in MB right now though) and dbcrashratio to like 10, and see if it eventually manage to do it.

in src/txdb.cpp:121 in 641fc0d7c7 outdated

79@@ -59,12 +80,22 @@ bool CCoinsViewDB::BatchWrite(CCoinsMap &mapCoins, const uint256 &hashBlock) {
80         count++;
81         CCoinsMap::iterator itOld = it++;
82         mapCoins.erase(itOld);
83+        if (batch.SizeEstimate() > batch_size) {
84+            LogPrint(BCLog::COINDB, "Writing partial batch of %.2f MiB\n", batch.SizeEstimate() * (1.0 / 1048576.0));
85+            db.WriteBatch(batch);

morcos commented at 3:31 pm on April 20, 2017:

Should we be looking for failure here and returning false if any of the WriteBatch calls fail?

sipa commented at 10:58 pm on April 24, 2017:

WriteBatch throws a dbwrapper_error when writing fails.

in src/txdb.cpp:89 in 29548bcdf9 outdated

81@@ -59,12 +82,30 @@ bool CCoinsViewDB::BatchWrite(CCoinsMap &mapCoins, const uint256 &hashBlock) {
82         count++;
83         CCoinsMap::iterator itOld = it++;
84         mapCoins.erase(itOld);
85+        if (batch.SizeEstimate() > batch_size) {
86+            LogPrint(BCLog::COINDB, "Writing partial batch of %.2f MiB\n", batch.SizeEstimate() * (1.0 / 1048576.0));
87+            db.WriteBatch(batch);
88+            batch.Clear();
89+            int crash_simulate = GetArg("-dbcrashratio", 0);

TheBlueMatt commented at 3:55 pm on April 20, 2017:

Can we move GetArg lookup out of the hot loop?

sipa commented at 10:57 pm on April 24, 2017:

Done.

TheBlueMatt commented at 3:56 pm on April 20, 2017: member

utACK 29548bcdf9130bb40f164c20692fcbfcfd42c109 minus one GetArg and needing squashing.

morcos commented at 4:04 pm on April 20, 2017: member

Concept ACK. I’ve done a preliminary review and it looks pretty good, but I’d like to spend some more time thinking about it.

laanwj added this to the "Blockers" column in a project

sipa force-pushed on Apr 27, 2017

laanwj removed this from the "Blockers" column in a project

sipa force-pushed on May 3, 2017

sipa force-pushed on May 26, 2017

sipa commented at 11:57 pm on May 26, 2017: member

Rebased on #10195.

gmaxwell commented at 11:37 pm on June 1, 2017: contributor

@sipa Needs rebase!

sipa force-pushed on Jun 2, 2017

sipa commented at 0:05 am on June 2, 2017: member

Rebased after the merge of #10195.

gmaxwell commented at 0:17 am on June 2, 2017: contributor

@TheBlueMatt you didn’t want to finish a review of this before until after the per-txo changes happened. They’ve happened. Time to turn over your soul^wreview!

sdaftuar commented at 6:59 pm on June 7, 2017: member

Needs rebase. Also is it time now to remove WIP tag?

sipa force-pushed on Jun 7, 2017

sipa renamed this:
~~[WIP] Use non-atomic flushing with block replay~~
Use non-atomic flushing with block replay
on Jun 7, 2017

sipa commented at 7:48 pm on June 7, 2017: member

Rebased.

laanwj added this to the "Blockers" column in a project

in src/txdb.cpp:99 in 37528f7333 outdated

89+            if (old_heads.size() >= 2) old_tip = old_heads[1];
90+        }
91+        // In the first batch, mark the database as being in the middle of a
92+        // transition from old_tip to hashBlock.
93+        batch.Erase(DB_BEST_BLOCK);
94+        batch.Write(DB_HEAD_BLOCKS, std::vector<uint256>{hashBlock, old_tip});

ryanofsky commented at 4:48 pm on June 9, 2017:

This vector-of-two format seems a little strange. Would it work to not erase DB_BEST_BLOCK above, and just write a plain DB_BEST_NEW_BLOCK entry containing hash of the new tip. Then if there is a crash, the rescan can run between DB_BEST_BLOCK and DB_BEST_NEW_BLOCK.

sipa commented at 6:45 pm on June 9, 2017:

It’s for future compatibility with a design where entries are continuously written to disk in the background, which may lead to multiple partially-applied reorgs to co-exist on disk.

in src/validation.cpp:3607 in 37528f7333 outdated

3602+            LogPrintf("Rolling back %s (%i)\n", pindexOld->GetBlockHash().ToString(), pindexOld->nHeight);
3603+            DisconnectResult res = DisconnectBlock(block, pindexOld, cache);
3604+            if (res == DISCONNECT_FAILED) {
3605+                return error("RollbackBlock(): DisconnectBlock failed at %d, hash=%s", pindexOld->nHeight, pindexOld->GetBlockHash().ToString());
3606+            }
3607+            // We can safely ignore unclean here.

ryanofsky commented at 4:54 pm on June 9, 2017:

Could you add a few words about why it’s safe to ignore unclean? (It’s unclear to me.)

sipa commented at 6:42 pm on June 9, 2017:

Added a comment.

in src/validation.cpp:3616 in 37528f7333 outdated

3611+
3612+    // Roll forward from the forking point to the new tip.
3613+    int nForkHeight = pindexFork ? pindexFork->nHeight : 0;
3614+    for (int nHeight = nForkHeight + 1; nHeight <= pindexNew->nHeight; ++nHeight) {
3615+        const CBlockIndex* pindex = pindexNew->GetAncestor(nHeight);
3616+        LogPrintf("Rolling forward %s (%i)\n", pindex->nHeight, nHeight);

ryanofsky commented at 5:06 pm on June 9, 2017:

%s could be %i.

Also, is there a case where pindex->nHeight would not be the same as nHeight?

sipa commented at 6:42 pm on June 9, 2017:

Fixed.

sipa force-pushed on Jun 9, 2017

in src/txdb.cpp:74 in fa1300c0ad outdated

68@@ -68,10 +69,32 @@ uint256 CCoinsViewDB::GetBestBlock() const {
69     return hashBestChain;
70 }
71 
72+std::vector<uint256> CCoinsViewDB::GetHeadBlocks() const {
73+    std::vector<uint256> vhashHeadBlocks;
74+    if (!db.Read(DB_HEAD_BLOCKS, vhashHeadBlocks))

sdaftuar commented at 5:42 pm on June 14, 2017:

style nit: one line if should have braces

sipa commented at 7:36 am on June 15, 2017:

Fixed.

in src/validation.cpp:3571 in fa1300c0ad outdated

3559+        AddCoins(inputs, *tx, pindex->nHeight, true);
3560+    }
3561+    return true;
3562+}
3563+
3564+bool ReplayBlocks(const CChainParams& params, CCoinsView* view)

sdaftuar commented at 5:50 pm on June 14, 2017:

Would we ever call this on a view that isn’t a CCoinsViewDB? If not, perhaps the code would be slightly clearer if we passed in a CCoinsViewDB * instead, to clue future code readers in to what’s going on here?

sipa commented at 8:51 pm on June 14, 2017:

I prefer making types as generic as possible, and I think this code should work fine even on a cache on top of a db view.

in src/txdb.cpp:116 in fa1300c0ad outdated

112+            db.WriteBatch(batch);
113+            batch.Clear();
114+        }
115     }
116-    if (!hashBlock.IsNull())
117+    if (!hashBlock.IsNull()) {

sdaftuar commented at 6:23 pm on June 14, 2017:

To make sure we’ve considered all the cases, I’m trying to understand the circumstances in which we’d call BatchWrite with a null hashBlock – any ideas?

sipa commented at 7:36 am on June 15, 2017:

I’ve removed the need for flushing with hash=0.

in src/txdb.cpp:89 in fa1300c0ad outdated

84+    if (!hashBlock.IsNull()) {
85+        uint256 old_tip = GetBestBlock();
86+        if (old_tip.IsNull()) {
87+            // We may be in the middle of replaying.
88+            std::vector<uint256> old_heads = GetHeadBlocks();
89+            if (old_heads.size() >= 2) old_tip = old_heads[1];

sdaftuar commented at 6:41 pm on June 14, 2017:

Can we assert here that in this case, old_heads[0] == hashBlock?

sipa commented at 7:36 am on June 15, 2017:

Done.

in src/init.cpp:1497 in 4b9293d716 outdated

1492@@ -1494,9 +1493,13 @@ bool AppInitMain(boost::thread_group& threadGroup, CScheduler& scheduler)
1493                     strLoadError = _("Unable to replay blocks. You will need to rebuild the database using -reindex-chainstate.");
1494                     break;
1495                 }
1496-                pcoinsTip->SetBestBlock(pcoinsdbview->GetBestBlock()); // TODO: only initialize pcoinsTip after ReplayBlocks
1497+                pcoinsTip = new CCoinsViewCache(pcoinscatcher);
1498+                pcoinsTip->SetBestBlock(pcoinsdbview->GetBestBlock());

sdaftuar commented at 6:46 pm on June 14, 2017:

Am I correct in thinking that this call to SetBestBlock() is now unnecessary?

sipa commented at 7:37 am on June 15, 2017:

Indeed, removed.

in src/init.cpp:1501 in 4b9293d716 outdated

1497+                pcoinsTip = new CCoinsViewCache(pcoinscatcher);
1498+                pcoinsTip->SetBestBlock(pcoinsdbview->GetBestBlock());
1499                 LoadChainTip(chainparams);
1500 
1501+                // Force a chainstate write so that when we VerifyDB in a moment, it doesn't check stale data
1502+                FlushStateToDisk();

sdaftuar commented at 6:55 pm on June 14, 2017:

I know this is just a code move, but this comment is confusing – if I understand correctly, there’s no utxo information to flush at this point, just potentially the block index?

On further analysis, I think this line is not needed at all. I don’t think it hurts for belt and suspenders, but would be nice to fix the documentation if that is the case.

sipa commented at 7:37 am on June 15, 2017:

Indeed, removed (and this also allows removing the hash=0 flushing logic).

sdaftuar commented at 7:13 pm on June 14, 2017: member

Re-reviewed; it looks pretty good though I have a couple questions… Will work on testing.

sipa force-pushed on Jun 15, 2017

sipa force-pushed on Jun 16, 2017

in src/txdb.h:79 in 5e6f0f01ee outdated

75@@ -74,6 +76,7 @@ class CCoinsViewDB : public CCoinsView
76     bool GetCoin(const COutPoint &outpoint, Coin &coin) const override;
77     bool HaveCoin(const COutPoint &outpoint) const override;
78     uint256 GetBestBlock() const override;
79+    std::vector<uint256> GetHeadBlocks() const;

sdaftuar commented at 4:15 pm on June 22, 2017:

nit: add override here to eliminate this warning:

0In file included from txdb.cpp:6:
1./txdb.h:77:26: warning: 'GetHeadBlocks' overrides a member function but is not marked 'override' [-Winconsistent-missing-override]
2    std::vector<uint256> GetHeadBlocks() const;
3                         ^
4./coins.h:163:34: note: overridden virtual function is here
5    virtual std::vector<uint256> GetHeadBlocks() const;
6                                 ^
71 warning generated.
8    std::vector<uint256> GetHeadBlocks() const;

sipa commented at 6:59 pm on June 22, 2017:

Fixed.

in src/validation.cpp:1338 in 9786367f5c outdated

1327@@ -1328,17 +1328,15 @@ int ApplyTxInUndo(Coin&& undo, CCoinsViewCache& view, const COutPoint& out)
1328             return DISCONNECT_FAILED; // adding output for transaction without known metadata
1329         }
1330     }
1331-    view.AddCoin(out, std::move(undo), undo.fCoinBase);
1332+    view.AddCoin(out, std::move(undo), !fClean);

sdaftuar commented at 4:21 pm on June 22, 2017:

I think it’s worth adding some explanatory comments around the use of !fClean here. In particular, I think it’s non-obvious that the call to HaveCoin() at line 1317 is necessary for correctness, and not just a sanity check.

sipa commented at 7:00 pm on June 22, 2017:

Added some comments.

sdaftuar commented at 5:46 pm on June 22, 2017: member

It occurs to me that we aren’t bounding memory usage at all during recovery. This seems like it could cause problems for users who are running with a big -dbcache and then crash.

Perhaps we should suggest a -reindex if there are a lot of blocks to disconnect/connect, or abort and suggest a -reindex if the cache in ReplayBlocks exceeds the dbcache+mempool size?

sdaftuar commented at 5:51 pm on June 22, 2017: member

Also, I was able to write a functional test that exercises this logic (and which caught the undo.fCoinbase vs !fClean issue in ApplyTxInUndo()): https://github.com/sdaftuar/bitcoin/commits/test-10148

sipa force-pushed on Jun 22, 2017

sipa commented at 7:04 pm on June 22, 2017: member

Made the added argument to AddCoins check instead of overwrite. Its semantics are now to issue a HaveCoin to determine whether an overwrite happens, rather than just assume there is one. This has the advantage that (some) of the coins added in RollForwardBlock correctly get a FRESH marker, possibly reducing memory usage if the rollforward is large.
Added @sdaftuar ’s dbcrash test (which still runs succesfully). @sdaftuar As for memory usage during recovery, I think the correct approach is to make the recovery itself non-atomic later.

sdaftuar commented at 9:02 pm on June 22, 2017: member

Looks good, ACK. However, if we include the dbcrash test, then I think we also need something like https://github.com/sdaftuar/bitcoin/commit/96a7ef4e990cb33f4e5cc248b630d17b2384d33b in order for test/functional/test_runner.py --extended to not fail?

sipa force-pushed on Jun 22, 2017

sdaftuar commented at 2:14 am on June 23, 2017: member

ACK 03604ea5f3de4dd3301533ab353b2490105ebdb0

laanwj commented at 1:21 pm on June 24, 2017: member

Small merge conflict in check-doc.py

gmaxwell commented at 5:38 pm on June 26, 2017: contributor

@sdaftuar Thats a great test. @sipa needs conflicts fixed

[MOVEONLY] Move LastCommonAncestor to chain b3a279cd58

Non-atomic flushing using the blockchain as replay journal 013a56aa1a

Adapt memory usage estimation for flushing 0580ee08ff

Random db flush crash simulator eaca1b7b08

Dont create pcoinsTip until after ReplayBlocks.

This requires that we not access pcoinsTip in InitBlockIndex's
FlushStateToDisk (so we just skip it until later in AppInitMain)
and the LoadChainTip in LoadBlockIndex (which there is already one
later in AppinitMain, after ReplayBlocks, so skipping it there is
fine).

Includes some simplifications by Suhas Daftuar and Pieter Wuille.

d6af06d68a

sipa force-pushed on Jun 26, 2017

sipa commented at 6:46 pm on June 26, 2017: member

Rebased, and squashed the last two commits.

sipa commented at 8:16 pm on June 26, 2017: member

Travis where art thou

[qa] Test non-atomic chainstate writes

Adds new functional test, dbcrash.py, which uses -dbcrashratio to exercise the
logic for recovering from a crash during chainstate flush.

dbcrash.py is added to the extended tests, as it may take ~10 minutes to run

Use _Exit() instead of exit() for crash simulation

This eliminates stderr output such as:
    terminate called without an active exception
or
    Assertion failed: (!pthread_mutex_destroy(&m)), function ~recursive_mutex, file /usr/local/include/boost/thread/pthread/recursive_mutex.hpp, line 104.

Eliminating the stderr output on crash simulation allows testing with
test_runner.py, which reports a test as failed if stderr is produced.

176c021d08

sipa force-pushed on Jun 26, 2017

sdaftuar commented at 9:06 pm on June 27, 2017: member

re-ACK 176c021d085f5a45bc9e038e760942aa648dd797

laanwj commented at 4:25 pm on June 28, 2017: member

Tested ACK 176c021

laanwj merged this on Jun 28, 2017

laanwj closed this on Jun 28, 2017

laanwj referenced this in commit d4e551adfe on Jun 28, 2017

laanwj removed this from the "Blockers" column in a project

TheBlueMatt commented at 7:56 pm on July 7, 2017: member

utACK-sans-tests once the fixes for init-order bugs here goes through in #10758.

morcos commented at 4:09 pm on July 20, 2017: member

posthumous utACK-sans-tests and modulo some of the same bugs @TheBlueMatt found and fixed in #10758. I’ll review that now.

PastaPastaPasta referenced this in commit 2183204493 on Jul 6, 2019

PastaPastaPasta referenced this in commit 8645b8d75b on Jul 6, 2019

PastaPastaPasta referenced this in commit 2259404880 on Jul 6, 2019

PastaPastaPasta referenced this in commit 453d756571 on Aug 2, 2019

PastaPastaPasta referenced this in commit feadc24a70 on Aug 6, 2019

PastaPastaPasta referenced this in commit e9a09a0f39 on Aug 6, 2019

barrystyle referenced this in commit 4d0ac3e85c on Jan 22, 2020

random-zebra referenced this in commit ac523660b4 on Feb 21, 2021

DrahtBot locked this on Aug 16, 2022

Use non-atomic flushing with block replay #10148