kernel: Handle fatal errors through return values

TheCharlatan commented at 9:34 pm on March 12, 2024: contributor

Based on #25665.

Currently functions issuing fatal errors call the fatalError notification method to issue a shutdown procedure. This makes it hard for higher level functions to figure out if an error deeper in the call stack occurred. For users of the kernel it might also be difficult to assert which function call issued a fatal error if they are being run concurrently. If the kernel would eventually be used by external users, getting fatal error information through a callback instead of function return values would also be cumbersome and limiting. Unit, bench, and fuzz tests currently don’t have a way to effectively test against fatal errors.

This patch set is an attempt to make fatal error handling in the kernel code more transparent. Fatal errors are now passed up the call stack through util::Result<T, FatalError> failure values. A previous attempt at this by theuni always immediately returned failure values if a function call returned a failure. However, this is not always desirable (see discussion here). Sometimes, further operations can still be completed even if a fatal error was issued. The solution to this is that these “ignored” errors are still moved through util::Result’s error string values with its .MoveMessages method, even while a failure value in the result is not present.

Next to some smaller behavior changes, the most significant change is that the issuing of a shutdown procedure is delayed until a potential fatal error is handled as opposed to immediately when it is first encountered. Another effect is that potential fatal errors are now asserted in the bench, fuzz and unit tests. Some of the currently not immediately returned fatal errors need some further scrutiny. These are marked with a TODO (fatal error) comment and could be tackled in a later PR.

To validate this approach a new clang-tidy check is introduced. It implements the following checks:

If a function calls another function with a util::Result<T, FatalCondition> return type, its return type has to be util::Result<T, FatalCondition> too, or it has to handle the value returned by the function with one of CheckFatal, HandleFatalError, UnwrapFatalError, or CheckFatalFailure.
In functions returning a util::Result<T, FatalCondition> a call to a function returning a util::Result<T, FatalCondition> needs to propagate the value by either:
- Returning it immediately
- Assigning it immediately to an existing result with .MoveMessages() or .Set()
- Eventually passing it as an argument to a .MoveMessages()

This PR is part of the libbitcoinkernel project and is a step towards stage 2, creating a more refined kernel API.

DrahtBot commented at 9:34 pm on March 12, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	stickies-v, ryanofsky

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#29735 (AcceptMultipleTransactions: Fix workspace not being set as client_maxfeerate failure by instagibbs)
#29700 (kernel, refactor: return error status on all fatal errors by ryanofsky)
#29641 (scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint by maflcko)
#29553 (assumeutxo: Add dumptxoutset height param, remove shell scripts by fjahr)
#29415 (Broadcast own transactions only via short-lived Tor or I2P connections by vasild)
#29402 (mempool: Log added for dumping mempool transactions to disk by kevkevinpal)
#29278 (Wallet: Add maxfeerate wallet startup option by ismaelsadeeq)
#29231 (logging: Update to new logging API by ajtowns)
#29015 (kernel: Streamline util library by ryanofsky)
#28843 ([refactor] Remove BlockAssembler m_mempool member by TheCharlatan)
#28687 (C++20 std::views::reverse by stickies-v)
#28233 (validation: don’t clear cache on periodic flush: >2x block connection speed by andrewtoth)
#26593 (tracing: Only prepare tracepoint arguments when actually tracing by 0xB10C)
#26022 (Add util::ResultPtr class by ryanofsky)
#25722 (refactor: Use util::Result class for wallet loading by ryanofsky)
#25665 (refactor: Add util::Result failure values, multiple error and warning messages by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Validation on Mar 12, 2024

TheCharlatan force-pushed on Mar 12, 2024

stickies-v commented at 3:01 pm on March 15, 2024: contributor

Concept ACK, nice one. Will first re-review #25665

TheCharlatan force-pushed on Mar 15, 2024

DrahtBot added the label Needs rebase on Mar 18, 2024

TheCharlatan force-pushed on Mar 19, 2024

DrahtBot removed the label Needs rebase on Mar 19, 2024

ryanofsky commented at 6:53 pm on March 19, 2024: contributor

Concept ACK, it is better for a function to return an error status than set a flag, initiate a global shutdown, and not return a success or failure value.

But the FatalError enum and HandleFatalError / CheckFatal / clang-tidy machinery do not seem very friendly to use, and I don’t see what benefits they offer over just using util::Result and [[nodiscard]].

It also seems like it would be nice to have local enums for functions like ProcessTransaction and ActivateBestChain that just have 2 or 3 error codes instead of 22 error codes like FatalError.

Also, I understand the goal of this is to improve the libbitcoinkernel interface, but I’m not sure this should mean moving AbortNode calls out of kernel code into net_processing code. I think it would be probably be better if the net_processing functions could return success/failure as well, and AbortNode calls could move to an even higher level. Or if AbortNode calls could just be left where they are and this could be a pure refactoring that just adds missing return values.

I guess my main question is would it be possible to make a stripped down version of this PR that just adds [[nodiscard]] to all functions where fatalError/flushError are called, and bubbles up util::Result values, and is just a plain refactoring that doesn’t change behavior, and doesn’t add the FatalError type and handlers? It seems like could make the PR smaller, and make libbitcoinkernel act like a normal library that can return errors, without changing other things.

ryanofsky commented at 7:40 pm on March 19, 2024: contributor

I don’t see what benefits they offer over just using util::Result and [[nodiscard]]

I guess I figured out the answer to this question while I was writing the rest of the rest of my comment. The benefit of adding all the FatalError/HandleFatalError/CheckFatal/clang-tidy stuff is that it can replace the fatalError and flushError notifications and ensure that new kernel code nevers trigger a shutdown without also returning an error code.

But IMO the benefit of enforcing that errors are always returned doesn’t clearly outweigh the costs, so am wondering about about a more lightweight approach that just bubbles up errors from existing fatalError/flushError calls.

TheCharlatan commented at 11:27 am on March 20, 2024: contributor

Re #29642 (comment) and #29642 (comment)

ensure that new kernel code nevers trigger a shutdown without also returning an error code.

Yes, that is the goal of this pull request. I first wrote the error handling and then the tidy plugin. The plugin helped uncover a handful of bugs that were not immediately obvious to me, so I do think it is useful beyond [[nodiscard]]. It also uncovered some cases where the current handling of the fatal errors is questionable. I don’t think the helper functions are all that complicated, though their naming sure could be improved to make it more obvious what they do. They also only form a small part of what the linter enforces. If reviewers view the linter as too specialized, it could be dropped from the PR, but still used to validate the work.

The benefit of adding all the FatalError/HandleFatalError/CheckFatal/clang-tidy stuff is that it can replace the fatalError and flushError notifications

To me this is more the logical conclusion of the PR than its goal. The goal is to bubble up an error code. Adding an error code next to the existing fatalError and flushError notifications in validation that is not used by production code for the sake of not changing things too much seems unfortunate though. The PR already tries hard to keep the rest of the behavior when a fatal error is encountered the same. Having three different fatal mechanisms seems more confusing than one consistent one that is enforced with a linter and is very mechanical. I also feel like having multiple ways to achieve the same outcome is bad API design. Removing the notifications could be done at a later point, but I don’t think it makes this PR significantly harder to review and it would leave things in an inconclusive state.

It also seems like it would be nice to have local enums for functions like ProcessTransaction and ActivateBestChain that just have 2 or 3 error codes instead of 22 error codes like FatalError.

The code has always treated fatal errors as their own weird separate thing, so putting them in a a single enum does not really feel that unnatural to me. However this PR could also be realized with more specialized types. What benefit do you see with such an approach?

DrahtBot added the label Needs rebase on Mar 20, 2024

ryanofsky commented at 9:22 pm on March 20, 2024: contributor

Thanks for explanations. So as I understand it, this PR is trying to do 3 things:

(1) Ensure that if an error happens while calling a function, the error is returned to the caller. (2) Ensure if that if an error happens while calling a function, and the node will be shutting down as a result of that error, the fact node is shutting down will be returned to the caller. (3) Fix a “handful of bugs” in error handling.

I think (1) is great and definitely worth doing, but I don’t think this PR is doing an ideal job in its implementation. I can describe the specific problems I see there currently below.

I don’t really think (2) is a worthwhile goal, because it should have no effect on libbitcoinkernel users and I don’t really see how it could catch any significant internal bugs. But if there is a way to implement it without negatively impacting libbitcoinkernel API, I think it would be fine to pursue.

I’m extremely wary of (3). I fully believe there are bugs in error handling and many things that could be improved, but I don’t think those changes can be effectively reviewed as part of a big refactoring PR like this. I think it can be ok to combine behavior changes with refactoring when the combining these simplifies things, but in this case I only see adding return information at the same time as changing error handling making things more complicated. (If there are counterexamples, though, I could be convinced otherwise.)

The specific problems I see with error reporting in this PR have to do with the FatalError enum. I think it is a bad developer experience to have a bunch of functions returning an enum with 20 possible error values, and not knowing which functions might return which values, and not having a suggestion of how different values need to be handled. I’m also skeptical that “fatal errors” are a concept that should be exposed to libbitcoinkernel callers, since not all callers will be implementing nodes that need to shut down, and normally library functions should be able to report errors without requiring they be handled in a certain way. Also, I think the fact that error type is an enum type rather than something more general like a std::variant means that it might be hard to return add useful metadata to the result value (like which block could not be read if a block read failed).

I was experimenting today with a change to just return [[nodiscard]] util::Result from every function calling fatalError or flushError, and I think is mostly complete (validation.cpp and blockstorage.cpp compile) which I think could be a simpler approach, providing a better API for libbitcoinkernel callers, and also being easier to review. I think other improvements could be made on top of it but maybe it is a simpler starting point. Commit is bc1055157b2e6458683953a41a7b243f27f77d98 and I can work on getting the rest of it compiling and then splitting it up.

TheCharlatan commented at 10:23 pm on March 20, 2024: contributor

Re #29642 (comment)

I’m extremely wary of (3).

Just for clarity, we are completely on the same page here. The bugs I was referring to before were all in how I was introducing the bubbling up of the errors and not with when or how an error is being issued currently. To reiterate, this patch set does not change when errors are being issued, nor should it influence the control flow. I’m not fixing bugs in error handling here.

I’m also skeptical that “fatal errors” are a concept that should be exposed to libbitcoinkernel callers, since not all callers will be implementing nodes that need to shut down, and normally library functions should be able to report errors without requiring they be handled in a certain way.

That is a fair point. I think in that case a more granular approach may be warranted.

Also, I think the fact that error type is an enum type rather than something more general like a std::variant means that it might be hard to return add useful metadata to the result value (like which block could not be read if a block read failed).

There is always the util::Result error string that describes where the error is coming from and what caused it, but I agree that a richer error type could be useful here. The reason I was using util::Result was that it provides a single, predictable way to handle the errors. An approach with a variant or tuple type would be able to return richer information, but also be less predictable.

EDIT:

I’m also skeptical that “fatal errors” are a concept that should be exposed to libbitcoinkernel callers, since not all callers will be implementing nodes that need to shut down.

Thinking about this some more, I’m still not convinced that putting them in a separate class of their own is bad. Core is and will be the primary user of the kernel, so consideration for how outside users might interpret the errors does not seem that important to me. I get that it might be nice to see when calling a function which fatal errors exactly it might issue, but for functions like ImportBlocks this is going to be a big subset of them. We also don’t really have similar mechanisms in place for existing error handling in the code., e.g. we often just return a plain error string, which I’m doing here too.

ryanofsky commented at 11:34 pm on March 21, 2024: contributor

I plan to look into this PR more and compare similarities & differences with the pure refactoring approach I suggested of just returning util::Result objects wherever fatal errors are triggered, and ensuring information is passed up with [[nodiscard]] attributes. I posted a draft PR following that approach in #29700, and I saw a lot of places where error handling could be improved so I think changes from both PRs could be useful together.

Thinking about this some more, I’m still not convinced that putting them in a separate class of their own is bad.

Yes, I agree returning information about which errors trigger shutdowns could be useful in theory. I just don’t see it providing an obvious practical benefit, and I think returning the information in the form of the FatalError enum has other drawbacks (documenting all the error codes could be a maintance burden, and the enum type could encourage writing awkward handling code with large switch statements).

refactor: Drop util::Result operator=

In cases where different error conditions need to be handled differently, it is
useful for `util::`Result` objects to be able to hold failure values and error
messages simultaneously. Also it is useful for `util::Result` objects to be
able to hold multiple error message strings and warnings, instead of just
single error strings.

In both of these cases, which are supported in upcoming commits, having an
`operator=` method is potentially dangerous if it clears existing error and
warnings strings while setting result values, or confusing if it doesn't clear
them, so the safest thing is to disable `operator=` and provide an explicit
`Set()` method so callers have a way of setting result values without having to
clear message strings.

440f67472d

refactor: Avoid copying util::Result values

Copying util::Result values is less efficient than moving them because they
allocate memory and contain strings. Also this is needed to avoid compile
errors in the next commit which adds a std::unique_ptr member to util::Result
which implicity disables copying.

882577bc37

refactor: Add util::Result failure values

Add util::Result support for returning more error information. For better error
handling, adds the ability to return a value on failure, not just a value on
success. This is a key missing feature that made the result class not useful
for functions like LoadChainstate() which produce different errors that need to
be handled differently.

This change also makes some more minor improvements:

- Smaller type size. util::Result<int> is 16 bytes, and util::Result<void> is 8
  bytes. Previously util::Result<int> and util::Result<void> were 72 bytes.

- More documentation and tests.

93cee983b0

refactor: Use util::Result class in LoadChainstate and VerifyLoadedChainstate 015b011c42

refactor: Add util::Result multiple error and warning messages

Add util::Result support for returning warning messages and multiple errors,
not just a single error string. This provides a way for functions to report
errors and warnings in a standard way, and simplifies interfaces.

The functionality is unit tested here, and put to use in followup PR
https://github.com/bitcoin/bitcoin/pull/25722

c7d4f43198

test: add static test for util::Result memory usage

Suggested by Martin Leitner-Ankerl <martin.ankerl@gmail.com>
https://github.com/bitcoin/bitcoin/pull/25722#discussion_r1174298529

Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com>

5e38b42b1e

Add bitcoin-fatal-error clang-tidy check

The check warns about incorrect usage of util::Result<T, FatalError>.
The check enforces the following rules:

1. If a function calls another function with a util::Result<T,
FatalCondition> return type, its return type has to be util::Result<T,
FatalCondition> too, or it has to handle the value returned by the
function with one of "CheckFatal", "HandleFatalError",
"UnwrapFatalError", "CheckFatalFailure".

2. In functions returning a util::Result<T, FatalCondition> a call to a
function returning a util::Result<T, FatalCondition> needs to propagate
the value by either:
 a) Returning it immediately
 b) Assigning it immediately to an existing result with .MoveMessages()
    or .Set()
 c) Eventually passing it as an arugment to a .MoveMessages() call

e5936fe79c

kernel: Handle ImportBlocks fatal errors with Result<T, FatalError>

This introduces a new FatalError enum class whose semantics when used as
a failure type in a util::Result<T, FatalError> are controlled by the
previously introduced bitcoin-fatal-error clang-tidy plugin.

The kernel's `UnwrapFatalError` method is meant to be used in unit and
fuzz tests, while production code should use `CheckFatal` to handle
fatal errors.

b16787d584

refactor: Replace call to fatalError with AbortNode in init e641470b6d

kernel: Handle LoadExternalBlockFile fatal errors with Result<T, FatalError> cf55203ed2

kernel: Handle ActivateSnapshot fatal errors with Result<T, FatalError> 1850f3b857

kernel: Add [[nodiscard]] attribute for LoadChainstate 93b9c3ba3f

kernel: Handle ValidatedSnapshotCleanup fatal errors with Result<T, FatalError>

This slightly refactors the method to return the FatalError instead of
throwing.

4bfa9da313

kernel: Handle MaybeCompleteSnapshotValidation fatal errors with Result<T, FatalError> 4e2de7b332

kernel: Handle AcceptBlock fatal errors with Result<T, FatalError>

To propagate the FatalError, also add the type to
InvalidateCoinsDBOnDisk and AcceptBlock.

The net_processing module now has to handle a FatalError, so introduce
shutdown and exit_status member variables for to be able to abort.

0758b72e2b

kernel: Handle ActivateBestChainStep fatal errors with Result<T, FatalError>

To propagate the FatalError, also add the type to PreciousBlock and
ActivateBestChain.

6eceef84b4

kernel: Handle ConnectTip fatal errors with Result<T, FatalError> 11a0d3fba9

kernel: Handle FlushStateToDisk fatal errors with Result<T, FatalError>

To propagate the FatalError, also add the type to PruneBlockFilesManual,
AcceptToMemoryPool, ProcessNewPackage, ResizeCoinsCaches,
ForceFlushStateToDisk, PruneAndFlush, DisconnectTip, InvalidateBlock,
MaybeUpdateMempoolForReorg, ProcessTransaction, MaybeRebalanceCaches,
LoadMempool

42943e5567

kernel: Handle ConnectBlock fatal errors with Result<T, FatalError>

To propagate the FatalError, also add the type to VerifyDB and
TestBlockValidity.

The miner module now has to handle a FatalError, so introduce shutdown
and exit_status member variables for it to be able to abort.

a555387dab

kernel: Handle SaveBlockToDisk fatal errors with Result<T, FatalError>

To propagate the FatalError, also add the type to LoadGenesisBlock.

16dee3c227

kernel: Handle FindBlockPos fatal errors with Result<T, FatalError> f9b17733c4

kernel: Handle FlushBlockFile fatal errors with Result<T, FatalError>

To propagate the FatalError, also add the type to
FlushChainstateBlockFile.

57b7430f8d

kernel: Handle WriteUndoDataForBlock fatal errors with Result<T, FatalError> 738e8505c9

kernel: Handle FlushUndoFile fatal errors with Result<T, FatalError> 0273acfd5a

kernel: Handle FindUndoPos fatal errors with Result<T, FatalError> 76f1420e29

kernel: Handle LoadBlockIndex fatal errors with Result<T, FatalError>

To propagate the FatalError, also add the type to LoadBlockIndexDB.

e2c336bbd6

Remove flushError, fatalError from kernel notifications

Also add using declarations where now possible.

bd0a9bb5da

TheCharlatan force-pushed on Mar 26, 2024

DrahtBot removed the label Needs rebase on Mar 26, 2024

DrahtBot added the label Needs rebase on Mar 28, 2024

DrahtBot commented at 11:56 am on March 28, 2024: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

DrahtBot commented at 0:02 am on June 25, 2024: contributor