blockstorage: Return on fatal flush errors

TheCharlatan commented at 6:43 pm on June 12, 2023: contributor

The goal of this PR is to establish that fatal blockstorage flush errors should be treated as errors at their call site.

Prior to this patch FlushBlockFile may have failed without returning in Chainstate::FlushStateToDisk, leading to a potential write from WriteBlockIndexDB that may refer to a block that is not fully flushed to disk yet. By returning if either FlushUndoFile or FlushBlockFile fail, we ensure that no further write operations take place that may lead to an inconsistent database when crashing. Add [[nodiscard]] annotations to them such that they are not ignored in future.

Functions that call either FlushUndoFile or FlushBlockFile, need to handle these extra abort cases properly. Since Chainstate::FlushStateToDisk already produces an abort error in case of WriteBlockIndexDB failing, no extra logic for functions calling Chainstate::FlushStateToDisk is required.

Besides Chainstate::FlushStateToDisk, FlushBlockFile is also called by FindBlockPos, while FlushUndoFile is only called by FlushBlockFile and WriteUndoDataForBlock. For both these cases, the flush error is not further bubbled up. Instead, the error is logged and a comment is provided why bubbling up an error would be less desirable in these cases.

This pull request is part of a larger effort towards improving the shutdown / abort / fatal error handling in validation code. It is a first step towards implementing proper fatal error return type enforcement similar as proposed by theuni in this pull request comment. For ease of review of these critical changes, a first step would be checking that AbortNode leads to early and error-conveying returns at its call site. Further work for enforcing returns when AbortNode is called is done in #27862.

DrahtBot commented at 6:43 pm on June 12, 2023: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	stickies-v, ryanofsky
Concept ACK	hebasto, naumenkogs
Stale ACK	dergoegge

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#28516 (validation: assumeutxo params for testnet and signet by Sjors)
#27596 (assumeutxo (2) by jamesob)
#27039 (blockstorage: do not flush block to disk if it is already there by pinheadmz)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Validation on Jun 12, 2023

in src/validation.cpp:2515 in 1fe06a811b outdated

2505@@ -2506,7 +2506,9 @@ bool Chainstate::FlushStateToDisk(
2506                 LOG_TIME_MILLIS_WITH_CATEGORY("write block and undo data to disk", BCLog::BENCH);
2507 
2508                 // First make sure all block and undo data is flushed to disk.
2509-                m_blockman.FlushBlockFile();
2510+                if (!m_blockman.FlushBlockFile()) {
2511+                    return false;

ryanofsky commented at 6:53 pm on June 12, 2023:

In commit “blockstorage: Return on fatal flush errors” (1fe06a811bf8be941bef5336af97688f48885828)

This method is not marked nodiscard, and it seems like there are some cases where this error is ignored and not returned to the caller.

TheCharlatan commented at 8:33 am on June 29, 2023:

There are many cases where these fatal error codes are eventually ignored as you go up the call stack. I would like to keep this PR focused on the question whether these blockstorage flush functions should at least indicate that they ran into an error. If you want to get a sense of just how many such “ignored” cases there are, I have my own set of patches that achieve something similar to Cory’s “bubble up”, but with your Result<T, E> type and introduced granularly for each fatal error call site here: https://github.com/TheCharlatan/bitcoin/pull/13.

ryanofsky commented at 5:50 pm on July 6, 2023:

re: #27866 (review)

introduced granularly for each fatal error call site here: TheCharlatan#13.

Thanks, it’s good to see nodiscard being used there, and introducing more meaningful error codes.

It’s possible my suggestion and your PRs have slightly different goals. Maybe your goal is to enhance libbitcoinkernel applications ability to handle and report errors, and my goal is to improve clarify of validation code and find to potential bugs in it. From my perspective, just adding [[nodiscard]] to function declarations and (void) to call sites along with “// This error is intentionally ignored because …” and “// This error is currently ignored but might be better to handle by …” comments would already be improvements. Maybe that doesn’t go far enough though, or is more tangential to what this PR is trying to do.

TheCharlatan commented at 7:34 pm on July 6, 2023:

Maybe your goal is to enhance libbitcoinkernel applications ability to handle and report errors, and my goal is to improve clarify of validation code and find to potential bugs in it.

I think that hits it on the nail. In any case I updated the PR description with much of the body of the commit message to make it clearer what I am trying to achieve here and why I’d consider it an improvement.

in src/node/chainstate.cpp:212 in 360fe5a47f outdated

206@@ -207,7 +207,9 @@ ChainstateLoadResult LoadChainstate(ChainstateManager& chainman, const CacheSize
207     } else if (snapshot_completion == SnapshotCompletionResult::SUCCESS) {
208         LogPrintf("[snapshot] cleaning up unneeded background chainstate, then reinitializing\n");
209         if (!chainman.ValidatedSnapshotCleanup()) {
210-            AbortNode("Background chainstate cleanup failed unexpectedly.");
211+            const auto error = _("Background chainstate cleanup failed unexpectedly.");
212+            AbortNode(error.original);
213+            return {ChainstateLoadStatus::INTERRUPTED, error};

ryanofsky commented at 7:03 pm on June 12, 2023:

In commit “chainstate: Return on ValidatedSnapshotCleanup fatal failure” (360fe5a47f75514621ef62bc4c906f2112d3308f)

I think this should return FAILURE not INTERRUPTED so caller can know that an error happened, not just an early shutdown request.

But I’d actually suggest dropping this commit from the PR. #27862 should cover this case, and I opened it earlier because I think AbortNode calls in assumeutxo code and in flush code seem different. There should be less of a rationale for ignoring errors in assumeutxo.

TheCharlatan commented at 7:51 pm on June 12, 2023:

Yup, I’ll drop, make this PR about the flush errors only and continue discussion on your PR.

ryanofsky commented at 7:22 pm on June 12, 2023: contributor

I’m not sure about all the changes in this PR because it seems like some validation code is intentionally written to ignore flush errors, and now they are returning early and skipping other work. It also seems reasonable that some failed flushes like periodic flushes might not need to be fatal errors.

But adding return values and nodiscard attributes could at least make it clear if errors are being ignored intentionally or not.

TheCharlatan force-pushed on Jun 12, 2023

DrahtBot added the label CI failed on Jun 12, 2023

TheCharlatan renamed this:
~~validation: Return on abort~~
blockstorage: Return on fatal flush errors
on Jun 12, 2023

TheCharlatan commented at 8:39 pm on June 12, 2023: contributor

Re #27866#pullrequestreview-1475602374

It also seems reasonable that some failed flushes like periodic flushes might not need to be fatal errors.

This is a good point, and I did not think of this before. A flush may fail once due to some sporadic error, but may succeed again at a later point in time. I’m still not sure though if it is really correct to ignore the flush error. If we fail to flush, but succeed in writing the block index, and then crash, we might have indexed a block, that has not been synced to disk. It also feels weird that we are currently okay with the block files failing to flush, while the call to CoinsTip().Flush() returns early on failure. Can you comment on this rationale?

DrahtBot removed the label CI failed on Jun 12, 2023

hebasto commented at 1:44 pm on June 15, 2023: member

A flush may fail once due to some sporadic error, but may succeed again at a later point in time. I’m still not sure though if it is really correct to ignore the flush error.

At least, such a flush error can be logged.

ryanofsky commented at 2:24 pm on June 15, 2023: contributor

Can you comment on this rationale?

I’m not familiar enough with existing code to know why it’s written the way it is. It could have a great rationale or just be weird like it appears to be.

I’m less concerned about this PR than I am about #27861, which is treating flush errors as fatal errors when they are not inherently fatal. If code writing to disk fails it could either mean (1) state on disk is corrupt or invalid and can’t be recovered from or (2) state on disk is valid but out of date and requires more work to catch up. These cases are inherently different and I think kernel API would benefit off having separate “flush error” and “fatal error” notifications to distinguish them, regardless of implementation details of current code.

in src/node/blockstorage.cpp:742 in 1fe06a811b outdated

738@@ -733,7 +739,9 @@ bool BlockManager::WriteUndoDataForBlock(const CBlockUndo& blockundo, BlockValid
739         // with the block writes (usually when a synced up node is getting newly mined blocks) -- this case is caught in
740         // the FindBlockPos function
741         if (_pos.nFile < m_last_blockfile && static_cast<uint32_t>(block.nHeight) == m_blockfile_info[_pos.nFile].nHeightLast) {
742-            FlushUndoFile(_pos.nFile, true);
743+            if(!FlushUndoFile(_pos.nFile, true)) {

jamesob commented at 5:36 pm on June 29, 2023:

Formatting: if (

in src/node/blockstorage.h:102 in 1fe06a811b outdated

94-    void FlushUndoFile(int block_file, bool finalize = false);
95-    bool FindBlockPos(FlatFilePos& pos, unsigned int nAddSize, unsigned int nHeight, CChain& active_chain, uint64_t nTime, bool fKnown);
96+    [[nodiscard]] bool FlushBlockFile(bool fFinalize = false, bool finalize_undo = false);
97+    [[nodiscard]] bool FlushUndoFile(int block_file, bool finalize = false);
98+    [[nodiscard]] bool FindBlockPos(FlatFilePos& pos, unsigned int nAddSize, unsigned int nHeight, CChain& active_chain, uint64_t nTime, bool fKnown);
99     bool FindUndoPos(BlockValidationState& state, int nFile, FlatFilePos& pos, unsigned int nAddSize);

stickies-v commented at 9:35 pm on June 29, 2023:

Any reason why this isn’t [[nodiscard]] too? Given that it returns the AbortNode result, and is used in blockstorage?

TheCharlatan commented at 8:32 am on June 30, 2023:

I can’t think of a good reason why this shouldn’t be [[nodiscard]] too. However, as I said here I am trying to keep this PR’s scope limited to just the flush functions and their immediate call sites. I added the [[nodiscard]] to FindBlockPos, because it reports on the flush error now too. FindUndoPos on the other hand is not immediately affected by the flush errors. There are many such examples where we ignore error codes upon return throughout the kernel code and I’d like to tackle them bit by bit.

stickies-v commented at 10:05 pm on June 29, 2023: contributor

Concept ACK

The blockstorage changes seem pretty straightforward, the validation.cpp one I’ll need to dive deeper into since it’s got quite a few callsites.

ryanofsky commented at 6:00 pm on July 6, 2023: contributor

Noticed in IRC libbitcoinkernel updates:

<achow101> it looks like the current pr is #27866, although #27861 has also been getting review

IMO, this PR could be an improvement, but right now seems like a draft change. I think the main thing that’s missing is a clear description of how it changes behavior, and why the behavior change is safe.

DrahtBot added the label Needs rebase on Jul 6, 2023

TheCharlatan force-pushed on Jul 7, 2023

TheCharlatan commented at 5:30 pm on July 7, 2023: contributor

Rebased 1fe06a811bf8be941bef5336af97688f48885828 -> d1c92b57a72b62ffa202f1d3d357b59befdc9c12 (kernelReturnOnAbort_0 -> kernelReturnOnAbort_1, compare)

Fixed conflict with #27861
Fixed formatting if formatting

in src/node/blockstorage.cpp:654 in d1c92b57a7 outdated

649@@ -644,7 +650,9 @@ bool BlockManager::FindBlockPos(FlatFilePos& pos, unsigned int nAddSize, unsigne
650         if (!fKnown) {
651             LogPrint(BCLog::BLOCKSTORAGE, "Leaving block file %i: %s\n", m_last_blockfile, m_blockfile_info[m_last_blockfile].ToString());
652         }
653-        FlushBlockFile(!fKnown, finalize_undo);
654+        if (!FlushBlockFile(!fKnown, finalize_undo)) {
655+            return false;

ryanofsky commented at 6:37 pm on July 7, 2023:

In commit “blockstorage: Return on fatal flush errors” (d1c92b57a72b62ffa202f1d3d357b59befdc9c12)

Let me know if I understand this correctly: The new return false avoids writing block data to a new block file if the previous block file is full and cannot be synced and trimmed. This seems to be safe because this FindBlockPos function only has one caller, SaveBlockToDisk, which only has two callers, Chainstate::AcceptBlock and Chainstate::LoadGenesisBlock which both seem to handle this error by stopping what they are doing and writing “Failed to find position to write new block” log messages. I did not look up AcceptBlock and LoadGenesisBlock callers but assume they are handling this case reasonably.

I think existing log messages are vague / misleading, so would suggest adding a new log message here like:

0LogPrintf("Failed to flush previous block file %05i (finalize=%i, finalize_undo=%i) before opening new block file %05i", m_last_blockfile, !fKnown, finalize_undo, nFile)`.

I’m not sure what the benefits are of returning false here. I think unlike the other changes in this commit, this change doesn’t help avoid any “potential write from WriteBlockIndexDB that may refer to a block that is not fully flushed to disk yet”, because this flush call is happening after previous block data has already been successfully written and the database has already been updated.

I think the clearest and most conservative thing to do would be to drop “return false” line in this commit, and only log an error message here. If you think returning false is a good idea, just do it in a followup commit decoupled from the other changes here with a commit message that explains why that is safe / desirable.

DrahtBot added the label CI failed on Jul 7, 2023

in src/node/blockstorage.cpp:553 in d1c92b57a7 outdated

549     assert(static_cast<int>(m_blockfile_info.size()) > m_last_blockfile);
550 
551     FlatFilePos block_pos_old(m_last_blockfile, m_blockfile_info[m_last_blockfile].nSize);
552     if (!BlockFileSeq().Flush(block_pos_old, fFinalize)) {
553         m_opts.notifications.flushError("Flushing block file to disk failed. This is likely the result of an I/O error.");
554+        return false;

ryanofsky commented at 7:03 pm on July 7, 2023:

In commit “blockstorage: Return on fatal flush errors” (d1c92b57a72b62ffa202f1d3d357b59befdc9c12)

This skips trying to flush the undo file if flushing the block file fails. It think it would be safer and more conservative to keep previous behavior of flushing both files in sequence.

You can still return false if either flush call fails, but I don’t see a benefit in changing behavior here and potentially making code less robust, when it’s not necessary just to add a return code.

I think it would also make review simpler to get rid of this unnecessary change, since the other changes in this commit are already pretty complex.

DrahtBot removed the label Needs rebase on Jul 7, 2023

in src/node/blockstorage.cpp:746 in d1c92b57a7 outdated

741@@ -734,7 +742,9 @@ bool BlockManager::WriteUndoDataForBlock(const CBlockUndo& blockundo, BlockValid
742         // with the block writes (usually when a synced up node is getting newly mined blocks) -- this case is caught in
743         // the FindBlockPos function
744         if (_pos.nFile < m_last_blockfile && static_cast<uint32_t>(block.nHeight) == m_blockfile_info[_pos.nFile].nHeightLast) {
745-            FlushUndoFile(_pos.nFile, true);
746+            if (!FlushUndoFile(_pos.nFile, true)) {
747+                return false;

ryanofsky commented at 7:24 pm on July 7, 2023:

In commit “blockstorage: Return on fatal flush errors” (d1c92b57a72b62ffa202f1d3d357b59befdc9c12)

This change seems fragile to me, and it’s unclear what the benefits are. I think it would be great to add a log print here, but if this this code is going to be changed to return false, I think that should happen in a separate followup commit, not combined with other changes here, and have a clear explanation and justification.

IIUC, this change introduces an inconsistency. For all block data, and for vast majority of undo data, the data is considered to be successfully written if the write call succeeeds regardless of whether syncing and trimming succeeds later. But now with this change, for some small fraction of undo block data, the data will be considered not written if the syncing or trimming fails. And the fraction of blocks this will be true for will be essentially random due to the undo flush heuristic (described in a new comment from #27746)

I understand that your goal here is to have libbitcoinkernel functions return more complete error information. I understand that goal is easier to achieve by changing behavior of validation and block storage code and returning early in certain cases. But fundamentally it is not necessary to change behavior of any existing code to have it return more error information. Without returning early, you can always accumulate errors and just return the error code at the end.

If returning early makes sense, I think it should be justified in its own terms and done carefully in smaller commits.

in src/validation.cpp:2515 in d1c92b57a7 outdated

2504@@ -2505,7 +2505,9 @@ bool Chainstate::FlushStateToDisk(
2505                 LOG_TIME_MILLIS_WITH_CATEGORY("write block and undo data to disk", BCLog::BENCH);
2506 
2507                 // First make sure all block and undo data is flushed to disk.
2508-                m_blockman.FlushBlockFile();
2509+                if (!m_blockman.FlushBlockFile()) {
2510+                    return false;

ryanofsky commented at 7:48 pm on July 7, 2023:

In commit “blockstorage: Return on fatal flush errors” (d1c92b57a72b62ffa202f1d3d357b59befdc9c12)

This line appears to implement the major change in behavior which is avoiding the “write from WriteBlockIndexDB that may refer to a block that is not fully flushed to disk yet.”

I wonder what this failure looks like in practice, though. Could we write a test for it? I think it would at least make sense to have a log print that explains the problem that’s happening.

There are also a lot of callers to FlushStateToDisk and I haven’t looked at all of them to determine if they are handling this correctly. Can you describe what they should be doing, and do you know if they actually are doing it? The commit message says “no extra logic for functions calling Chainstate::FlushStateToDisk is required” mentioning an “abort error”, but there is no abort error here, only a “return false” so I’m not sure how this is supposed to work.

TheCharlatan commented at 3:35 pm on July 25, 2023:

I wonder what this failure looks like in practice, though. Could we write a test for it?

I could not come up with a way to test this with our existing code. If anybody has any pointers for how to achieve this, I’d happy to give it a shot at implementing the test.

Can you describe what they should be doing, and do you know if they actually are doing it?

Callers of FlushStateToDisk already need to handle failures of the block index writing, so I don’t think they require extra logic for handling block file flushing failures. We already have places in the code where we ignore flushing errors (including the block index writing), so I think in places where we are fine with that failure, we should also be fine with the block file flushing failing.

ryanofsky commented at 2:32 pm on August 17, 2023:

re: #27866 (review)

I think this return false should be removed and replaced by a log print and comment, at least in this PR. Maybe a followup PR could try to change behavior in a bigger way, but it’s not clear here that the intended change of returning false is here actually good, and that there aren’t other unintended effects.

If a FlatFileSeq::Flush call fails, effectively meaning that an fflush or ftruncate call failed on block file (or undo file after the next commit), it’s not clear to me that the new behavior of shutting down right away without trying to save state to disk is better than the old behavior of updating the block index, pruning old block files (which may free up disk space), and updating the coins database. This appears to be the intended change, and it does not seem to be a clear win. In general if there is some low level filesystem error which might be transient or spurious, it seems better to me for an application to handle it by at least trying to save its state than to just give up immediately and throw away data that’s still in memory.

I’m also concerned about possible unintended effects of this change because FlushStateToDisk is called so many places, that it’s not clear what other effects are outside of this function if it now returns false instead of true.

I like all the other changes in this PR of adding more return information and adding comments explaining error how errors are handled. And I wouldn’t be even be surprised if it turns out the new behavior here is actually better than the old behavior in relevant cases. But I think if behavior is going to change, it should really be motivated by something specific like a bug report or test case, and ideally it should happen in a separate PR that doesn’t mostly consist of code cleanup.

dergoegge commented at 12:54 pm on August 23, 2023:

fwiw I think it is fine to change the behavior here but also don’t mind if it’s done in a follow-up. To me this just seems like a weird quirk given that we do return a fatal error for other flush operations.

I think, a failed flush opens the node up to be crashed in any case, e.g.: https://github.com/bitcoin/bitcoin/blob/5aa67eb3655a0023f0cf115176fc8d5bac53cdcd/src/net_processing.cpp#L2217-L2219

Cleanly shutting down seems preferable.

ryanofsky commented at 7:57 pm on July 7, 2023: contributor

Code review d1c92b57a72b62ffa202f1d3d357b59befdc9c12. I think the extra error reporting here is great, and some of the changes in behavior seem good too. But I think there is too much happening in one commit. There are 3-4 separate changes in behavior that could be implemented in separate commits, and I think would be more clearly understood and evaluated that way.

hebasto commented at 10:19 am on July 14, 2023: member

Concept ACK.

TheCharlatan commented at 9:23 pm on July 18, 2023: contributor

Putting this back to draft until I’ve invested the time to properly address @ryanofsky’s comments.

TheCharlatan marked this as a draft on Jul 18, 2023

TheCharlatan force-pushed on Jul 25, 2023

TheCharlatan commented at 3:25 pm on July 25, 2023: contributor

Thank you for the excellent review @ryanofsky.

Updated d1c92b57a72b62ffa202f1d3d357b59befdc9c12 -> 5720741c14e617ed338f64361b23f4e66ec99e07 (kernelReturnOnAbort_1 -> kernelReturnOnAbort_2, compare)

Split the changes up into three commits
Addressed @ryanofsky’s comment, ignoring FlushBlockFile failure in FindBlockPos. Also added a comment explaining why the return type is ignored and added the suggested log line.
Addressed @ryanofsky’s comment, always attempting FlushUndoFile in FlushBlockFIle, even when the block file flushing fails and return a success code only at the end of the function.
Addressed @ryanofsky’s comment, ignoring the FlushUndoFile failure in WriteUndoDataForBlock and instead added a comment and a log message in case of flush failure.

TheCharlatan force-pushed on Jul 26, 2023

DrahtBot removed the label CI failed on Jul 26, 2023

TheCharlatan marked this as ready for review on Jul 27, 2023

DrahtBot added the label Needs rebase on Jul 31, 2023

TheCharlatan force-pushed on Aug 1, 2023

TheCharlatan commented at 9:28 pm on August 1, 2023: contributor

Rebased 5720741c14e617ed338f64361b23f4e66ec99e07 -> 7721ab9139013d70ef0f058754fabc5aaeda2246 (kernelReturnOnAbort_2 -> kernelReturnOnAbort_3, compare)

Fixed conflict with https://github.com/bitcoin/bitcoin/pull/27746

DrahtBot removed the label Needs rebase on Aug 1, 2023

in src/node/blockstorage.cpp:553 in 7721ab9139 outdated

550 }
551 
552-void BlockManager::FlushBlockFile(bool fFinalize, bool finalize_undo)
553+bool BlockManager::FlushBlockFile(bool fFinalize, bool finalize_undo)
554 {
555+    bool success = true;

stickies-v commented at 5:06 pm on August 9, 2023:

nit: can be declared closer to where it’s used

in src/node/blockstorage.cpp:679 in 7721ab9139 outdated

675+        // inconsistency arising from the flush failure here. However, the undo
676+        // data may be inconsistent after a crash if the flush is called during
677+        // a reindex. A flush error might also leave some of the data files
678+        // untrimmed.
679+        if (!FlushBlockFile(!fKnown, finalize_undo)) {
680+            LogPrintf("Failed to flush previous block file %05i (finalize=%i, finalize_undo=%i) before opening new block file %05i\n", m_last_blockfile, !fKnown, finalize_undo, nFile);

stickies-v commented at 5:09 pm on August 9, 2023:

nit: long line

jonatack commented at 4:50 pm on August 17, 2023:

If you retouch, the transition to severity-based logging will be easier if used when adding new ones. Probably choose between error or warning here; along with info, all 3 will be logged unconditionally by default, unless the user sets a custom severity level.

0            LogPrintLevel(BCLog::BLOCKSTORAGE, BCLog::Level::Warning, "Failed to flush previous block file %05i (finalize=%i, finalize_undo=%i) before opening new block file %05i\n", m_last_blockfile, !fKnown, finalize_undo, nFile);

stickies-v approved

stickies-v commented at 10:23 am on August 10, 2023: contributor

ACK 7721ab9139013d70ef0f058754fabc5aaeda2246

The changes in blockstorage.cpp and blockstorage.h are strict improvements, increasing visibility in flushing failures. I found the change in validation.cpp much more difficult to review, given the wide variety of ways in which FlushStateToDisk() is used. I’ve found that:

it improves behaviour in some ways. For example, if FlushBlockFile() fails and the WriteBlockIndexDB() call still succeeds, we can end up in a situation where we expect to have a block, but don’t actually have the data on disk. I found this quite easy to replicate by modifying FlushBlockFile() to a no-op, pkill bitcoind during IBD, and then call getblock on the latest synced block. This PR fixes that.
I don’t see how it can degrade behaviour:
- logically, it doesn’t make sense to flush the block index and chainstate if the block data may be missing
- we already return early in FlushStateToDisk() for flushing failures, just not for FlushBlockFile()
- I don’t think this can affect consensus, e.g. the BlockValidationState &state out-parameter is unaffected; AcceptBlock() is unaffected
- this code path is only reached when fDoFullFlush || fPeriodicWrite, which pretty much happens only every hour, when we shutdown, or when we exceed cache sizes.

(thanks @dergoegge for helping me understand the latter two points)

ryanofsky changes_requested

ryanofsky commented at 3:39 pm on August 17, 2023: contributor

Code review 7721ab9139013d70ef0f058754fabc5aaeda2246. Thanks for splitting up the commits and adding helpful log prints and comments. I like everything in this PR except the new return false after FlushBlockFile() fails in Chainstate::FlushStateToDisk. I think stickies-v makes the best possible case for it here: #27866#pullrequestreview-1570159243, but the test mentioned there of commenting out the block flush code entirely seems not very realistic, and none of the other cleanup in the PR requires this change in behavior, so I would really like to see it moved to another PR and justified on its own terms if it is a change worth making.

in src/node/blockstorage.cpp:777 in 7721ab9139 outdated

773+            // be an indication for a failed write. If it were propagated here,
774+            // the caller would assume the undo data not to be written, when in
775+            // fact it is. Note though, that a failed flush might leave the data
776+            // file untrimmed.
777+            if (!FlushUndoFile(_pos.nFile, true)) {
778+                LogPrintf("Failed to flush undo file %05i\n", _pos.nFile);

jonatack commented at 4:51 pm on August 17, 2023:

Idem as #27866 (review) if you retouch.

dergoegge approved

dergoegge commented at 12:57 pm on August 23, 2023: member

Code review ACK 7721ab9139013d70ef0f058754fabc5aaeda2246

TheCharlatan commented at 7:37 pm on August 29, 2023: contributor

Putting this back in draft while I address the change requests here.

TheCharlatan marked this as a draft on Aug 29, 2023

blockstorage: Mark FindBlockPos as nodiscard

A false return value indicates a fatal error (disk space being too low),
so make sure we always consume this error code.

This commit is part of an ongoing process for making the handling of
fatal errors more transparent and easier to understand.

5671c15f45

blockstorage: Return on fatal block file flush error

By returning an error code if `FlushBlockFile` fails, the caller now has
to explicitly handle block file flushing errors. Before this change
such errors were non-explicitly ignored without a clear rationale.

Prior to this patch `FlushBlockFile` may have failed silently in
`Chainstate::FlushStateToDisk`. Improve this with a log line. Also add a
TODO comment to flesh out whether returning early in the case of an
error is appropriate or not. Returning early might be appropriate to
prohibit `WriteBlockIndexDB` from writing a block index entry that does
not refer to a fully flushed block.

Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called
by `FindBlockPos`. Don't change the abort behavior there, since we don't
want to fail the function if the flushing of already written blocks
fails. Instead, just document it.

f0207e0030

blockstorage: Return on fatal undo file flush error

By returning an error code if either `FlushUndoFile` or `FlushBlockFile`
fail, the caller now has to explicitly handle block undo file flushing
errors. Before this change such errors were non-explicitly ignored
without a clear rationale.

Besides the call to `FlushUndoFile` in `FlushBlockFile`, ignore its
return code at its call site in `WriteUndoDataForBlock`. There, a failed
flush of the undo data should not be indicative of a failed write.

Add [[nodiscard]] annotations to `FlushUndoFile` such that its return
value is not just ignored in the future.

d8041d4e04

TheCharlatan force-pushed on Sep 1, 2023

TheCharlatan commented at 5:30 am on September 1, 2023: contributor

Updated 7721ab9139013d70ef0f058754fabc5aaeda2246 -> d8041d4e042957660827313951b18c8dd9a99a16 (kernelReturnOnAbort_3 -> kernelReturnOnAbort_4, compare)

Removed early return in FlushStateToDisk, replaced it with a TODO comment and a log line.
Addressed @jonatack’s comment, added severity-based logging.
Addressed @stickies-v’s comment, splitting up long log line.

TheCharlatan marked this as ready for review on Sep 1, 2023

stickies-v commented at 11:39 am on September 1, 2023: contributor

re-ACK d8041d4

I was okay with the validation logic change that we had before, but as I don’t have a lot of familiarity with the complexities of FlushStateToDisk I am definitely okay with ryanofsky’s suggestion to keep that for a later, more focused improvement and making this pretty much a no-brainer PR.

DrahtBot requested review from dergoegge on Sep 1, 2023

naumenkogs commented at 9:28 am on September 4, 2023: member

Concept ACK

ryanofsky approved

ryanofsky commented at 3:59 pm on September 11, 2023: contributor

Code review ACK d8041d4e042957660827313951b18c8dd9a99a16. Thanks for making suggested changes. This is now an easier PR to review since it is only adding new return codes, not changing existing ones, and no longer changing the way flush errors are handled other than by adding log prints.

If I could make one more suggestion, it would be to switch the order of the second and third commits so the FlushBlockFile return value doesn’t get added with some initial behavior in the second commit (f0207e00303a1030eca795ede231e3c0d94df061) which is later modified in the third commit (d8041d4e042957660827313951b18c8dd9a99a16). If the commit order were switched, the FlushBlockFile return value could just be determined one way with no code churn.

It may be a good idea to follow up this PR with more changes that change error handling beyond adding log prints and return codes, but since FlushStateToDisk is called so many places, I think changes like these are harder to review, so it would be better for them to happen in a dedicated PR.

ryanofsky approved

ryanofsky commented at 4:51 pm on September 29, 2023: contributor

Code review ACK d8041d4e042957660827313951b18c8dd9a99a16

There should be no change in behavior other than logging here. Just to remind my self why, the reason is that it mostly adds new return values and doesn’t change existing ones. The only case where it does change a return value is in FlushBlockFile in the third commit. But there is only one call to FlushBlockFile, and if that fails it just triggers a log print.

DrahtBot requested review from ryanofsky on Sep 29, 2023

DrahtBot removed review request from ryanofsky on Sep 29, 2023

ryanofsky merged this on Sep 29, 2023

ryanofsky closed this on Sep 29, 2023

Frank-GER referenced this in commit 53099fdff4 on Oct 5, 2023

bitcoin locked this on Sep 28, 2024

blockstorage: Return on fatal flush errors #27866

Reviews

Conflicts