kernel: Handle fatal errors through return values #29642

pull TheCharlatan wants to merge 27 commits into bitcoin:master from TheCharlatan:returnTypeShutdown changing 70 files +2058 −700
  1. TheCharlatan commented at 9:34 pm on March 12, 2024: contributor

    Based on #25665.

    Currently functions issuing fatal errors call the fatalError notification method to issue a shutdown procedure. This makes it hard for higher level functions to figure out if an error deeper in the call stack occurred. For users of the kernel it might also be difficult to assert which function call issued a fatal error if they are being run concurrently. If the kernel would eventually be used by external users, getting fatal error information through a callback instead of function return values would also be cumbersome and limiting. Unit, bench, and fuzz tests currently don’t have a way to effectively test against fatal errors.

    This patch set is an attempt to make fatal error handling in the kernel code more transparent. Fatal errors are now passed up the call stack through util::Result<T, FatalError> failure values. A previous attempt at this by theuni always immediately returned failure values if a function call returned a failure. However, this is not always desirable (see discussion here). Sometimes, further operations can still be completed even if a fatal error was issued. The solution to this is that these “ignored” errors are still moved through util::Result’s error string values with its .MoveMessages method, even while a failure value in the result is not present.

    Next to some smaller behavior changes, the most significant change is that the issuing of a shutdown procedure is delayed until a potential fatal error is handled as opposed to immediately when it is first encountered. Another effect is that potential fatal errors are now asserted in the bench, fuzz and unit tests. Some of the currently not immediately returned fatal errors need some further scrutiny. These are marked with a TODO (fatal error) comment and could be tackled in a later PR.

    To validate this approach a new clang-tidy check is introduced. It implements the following checks:

    1. If a function calls another function with a util::Result<T, FatalCondition> return type, its return type has to be util::Result<T, FatalCondition> too, or it has to handle the value returned by the function with one of CheckFatal, HandleFatalError, UnwrapFatalError, or CheckFatalFailure.
    2. In functions returning a util::Result<T, FatalCondition> a call to a function returning a util::Result<T, FatalCondition> needs to propagate the value by either:
      • Returning it immediately
      • Assigning it immediately to an existing result with .MoveMessages() or .Set()
      • Eventually passing it as an argument to a .MoveMessages()

    This PR is part of the libbitcoinkernel project and is a step towards stage 2, creating a more refined kernel API.

  2. DrahtBot commented at 9:34 pm on March 12, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK stickies-v, ryanofsky

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #29735 (AcceptMultipleTransactions: Fix workspace not being set as client_maxfeerate failure by instagibbs)
    • #29700 (kernel, refactor: return error status on all fatal errors by ryanofsky)
    • #29641 (scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint by maflcko)
    • #29553 (assumeutxo: Add dumptxoutset height param, remove shell scripts by fjahr)
    • #29415 (Broadcast own transactions only via short-lived Tor or I2P connections by vasild)
    • #29402 (mempool: Log added for dumping mempool transactions to disk by kevkevinpal)
    • #29278 (Wallet: Add maxfeerate wallet startup option by ismaelsadeeq)
    • #29231 (logging: Update to new logging API by ajtowns)
    • #29015 (kernel: Streamline util library by ryanofsky)
    • #28843 ([refactor] Remove BlockAssembler m_mempool member by TheCharlatan)
    • #28687 (C++20 std::views::reverse by stickies-v)
    • #28233 (validation: don’t clear cache on periodic flush: >2x block connection speed by andrewtoth)
    • #26593 (tracing: Only prepare tracepoint arguments when actually tracing by 0xB10C)
    • #26022 (Add util::ResultPtr class by ryanofsky)
    • #25722 (refactor: Use util::Result class for wallet loading by ryanofsky)
    • #25665 (refactor: Add util::Result failure values, multiple error and warning messages by ryanofsky)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label Validation on Mar 12, 2024
  4. TheCharlatan force-pushed on Mar 12, 2024
  5. stickies-v commented at 3:01 pm on March 15, 2024: contributor
    Concept ACK, nice one. Will first re-review #25665
  6. TheCharlatan force-pushed on Mar 15, 2024
  7. DrahtBot added the label Needs rebase on Mar 18, 2024
  8. TheCharlatan force-pushed on Mar 19, 2024
  9. DrahtBot removed the label Needs rebase on Mar 19, 2024
  10. ryanofsky commented at 6:53 pm on March 19, 2024: contributor

    Concept ACK, it is better for a function to return an error status than set a flag, initiate a global shutdown, and not return a success or failure value.

    But the FatalError enum and HandleFatalError / CheckFatal / clang-tidy machinery do not seem very friendly to use, and I don’t see what benefits they offer over just using util::Result and [[nodiscard]].

    It also seems like it would be nice to have local enums for functions like ProcessTransaction and ActivateBestChain that just have 2 or 3 error codes instead of 22 error codes like FatalError.

    Also, I understand the goal of this is to improve the libbitcoinkernel interface, but I’m not sure this should mean moving AbortNode calls out of kernel code into net_processing code. I think it would be probably be better if the net_processing functions could return success/failure as well, and AbortNode calls could move to an even higher level. Or if AbortNode calls could just be left where they are and this could be a pure refactoring that just adds missing return values.

    I guess my main question is would it be possible to make a stripped down version of this PR that just adds [[nodiscard]] to all functions where fatalError/flushError are called, and bubbles up util::Result values, and is just a plain refactoring that doesn’t change behavior, and doesn’t add the FatalError type and handlers? It seems like could make the PR smaller, and make libbitcoinkernel act like a normal library that can return errors, without changing other things.

  11. ryanofsky commented at 7:40 pm on March 19, 2024: contributor

    I don’t see what benefits they offer over just using util::Result and [[nodiscard]]

    I guess I figured out the answer to this question while I was writing the rest of the rest of my comment. The benefit of adding all the FatalError/HandleFatalError/CheckFatal/clang-tidy stuff is that it can replace the fatalError and flushError notifications and ensure that new kernel code nevers trigger a shutdown without also returning an error code.

    But IMO the benefit of enforcing that errors are always returned doesn’t clearly outweigh the costs, so am wondering about about a more lightweight approach that just bubbles up errors from existing fatalError/flushError calls.

  12. TheCharlatan commented at 11:27 am on March 20, 2024: contributor

    Re #29642 (comment) and #29642 (comment)

    ensure that new kernel code nevers trigger a shutdown without also returning an error code.

    Yes, that is the goal of this pull request. I first wrote the error handling and then the tidy plugin. The plugin helped uncover a handful of bugs that were not immediately obvious to me, so I do think it is useful beyond [[nodiscard]]. It also uncovered some cases where the current handling of the fatal errors is questionable. I don’t think the helper functions are all that complicated, though their naming sure could be improved to make it more obvious what they do. They also only form a small part of what the linter enforces. If reviewers view the linter as too specialized, it could be dropped from the PR, but still used to validate the work.

    The benefit of adding all the FatalError/HandleFatalError/CheckFatal/clang-tidy stuff is that it can replace the fatalError and flushError notifications

    To me this is more the logical conclusion of the PR than its goal. The goal is to bubble up an error code. Adding an error code next to the existing fatalError and flushError notifications in validation that is not used by production code for the sake of not changing things too much seems unfortunate though. The PR already tries hard to keep the rest of the behavior when a fatal error is encountered the same. Having three different fatal mechanisms seems more confusing than one consistent one that is enforced with a linter and is very mechanical. I also feel like having multiple ways to achieve the same outcome is bad API design. Removing the notifications could be done at a later point, but I don’t think it makes this PR significantly harder to review and it would leave things in an inconclusive state.

    It also seems like it would be nice to have local enums for functions like ProcessTransaction and ActivateBestChain that just have 2 or 3 error codes instead of 22 error codes like FatalError.

    The code has always treated fatal errors as their own weird separate thing, so putting them in a a single enum does not really feel that unnatural to me. However this PR could also be realized with more specialized types. What benefit do you see with such an approach?

  13. DrahtBot added the label Needs rebase on Mar 20, 2024
  14. ryanofsky commented at 9:22 pm on March 20, 2024: contributor

    Thanks for explanations. So as I understand it, this PR is trying to do 3 things:

    (1) Ensure that if an error happens while calling a function, the error is returned to the caller. (2) Ensure if that if an error happens while calling a function, and the node will be shutting down as a result of that error, the fact node is shutting down will be returned to the caller. (3) Fix a “handful of bugs” in error handling.

    I think (1) is great and definitely worth doing, but I don’t think this PR is doing an ideal job in its implementation. I can describe the specific problems I see there currently below.

    I don’t really think (2) is a worthwhile goal, because it should have no effect on libbitcoinkernel users and I don’t really see how it could catch any significant internal bugs. But if there is a way to implement it without negatively impacting libbitcoinkernel API, I think it would be fine to pursue.

    I’m extremely wary of (3). I fully believe there are bugs in error handling and many things that could be improved, but I don’t think those changes can be effectively reviewed as part of a big refactoring PR like this. I think it can be ok to combine behavior changes with refactoring when the combining these simplifies things, but in this case I only see adding return information at the same time as changing error handling making things more complicated. (If there are counterexamples, though, I could be convinced otherwise.)


    The specific problems I see with error reporting in this PR have to do with the FatalError enum. I think it is a bad developer experience to have a bunch of functions returning an enum with 20 possible error values, and not knowing which functions might return which values, and not having a suggestion of how different values need to be handled. I’m also skeptical that “fatal errors” are a concept that should be exposed to libbitcoinkernel callers, since not all callers will be implementing nodes that need to shut down, and normally library functions should be able to report errors without requiring they be handled in a certain way. Also, I think the fact that error type is an enum type rather than something more general like a std::variant means that it might be hard to return add useful metadata to the result value (like which block could not be read if a block read failed).

    I was experimenting today with a change to just return [[nodiscard]] util::Result from every function calling fatalError or flushError, and I think is mostly complete (validation.cpp and blockstorage.cpp compile) which I think could be a simpler approach, providing a better API for libbitcoinkernel callers, and also being easier to review. I think other improvements could be made on top of it but maybe it is a simpler starting point. Commit is bc1055157b2e6458683953a41a7b243f27f77d98 and I can work on getting the rest of it compiling and then splitting it up.

  15. TheCharlatan commented at 10:23 pm on March 20, 2024: contributor

    Re #29642 (comment)

    I’m extremely wary of (3).

    Just for clarity, we are completely on the same page here. The bugs I was referring to before were all in how I was introducing the bubbling up of the errors and not with when or how an error is being issued currently. To reiterate, this patch set does not change when errors are being issued, nor should it influence the control flow. I’m not fixing bugs in error handling here.

    I’m also skeptical that “fatal errors” are a concept that should be exposed to libbitcoinkernel callers, since not all callers will be implementing nodes that need to shut down, and normally library functions should be able to report errors without requiring they be handled in a certain way.

    That is a fair point. I think in that case a more granular approach may be warranted.

    Also, I think the fact that error type is an enum type rather than something more general like a std::variant means that it might be hard to return add useful metadata to the result value (like which block could not be read if a block read failed).

    There is always the util::Result error string that describes where the error is coming from and what caused it, but I agree that a richer error type could be useful here. The reason I was using util::Result was that it provides a single, predictable way to handle the errors. An approach with a variant or tuple type would be able to return richer information, but also be less predictable.

    EDIT:

    I’m also skeptical that “fatal errors” are a concept that should be exposed to libbitcoinkernel callers, since not all callers will be implementing nodes that need to shut down.

    Thinking about this some more, I’m still not convinced that putting them in a separate class of their own is bad. Core is and will be the primary user of the kernel, so consideration for how outside users might interpret the errors does not seem that important to me. I get that it might be nice to see when calling a function which fatal errors exactly it might issue, but for functions like ImportBlocks this is going to be a big subset of them. We also don’t really have similar mechanisms in place for existing error handling in the code., e.g. we often just return a plain error string, which I’m doing here too.

  16. ryanofsky commented at 11:34 pm on March 21, 2024: contributor

    I plan to look into this PR more and compare similarities & differences with the pure refactoring approach I suggested of just returning util::Result objects wherever fatal errors are triggered, and ensuring information is passed up with [[nodiscard]] attributes. I posted a draft PR following that approach in #29700, and I saw a lot of places where error handling could be improved so I think changes from both PRs could be useful together.

    Thinking about this some more, I’m still not convinced that putting them in a separate class of their own is bad.

    Yes, I agree returning information about which errors trigger shutdowns could be useful in theory. I just don’t see it providing an obvious practical benefit, and I think returning the information in the form of the FatalError enum has other drawbacks (documenting all the error codes could be a maintance burden, and the enum type could encourage writing awkward handling code with large switch statements).

  17. refactor: Drop util::Result operator=
    In cases where different error conditions need to be handled differently, it is
    useful for `util::`Result` objects to be able to hold failure values and error
    messages simultaneously. Also it is useful for `util::Result` objects to be
    able to hold multiple error message strings and warnings, instead of just
    single error strings.
    
    In both of these cases, which are supported in upcoming commits, having an
    `operator=` method is potentially dangerous if it clears existing error and
    warnings strings while setting result values, or confusing if it doesn't clear
    them, so the safest thing is to disable `operator=` and provide an explicit
    `Set()` method so callers have a way of setting result values without having to
    clear message strings.
    440f67472d
  18. refactor: Avoid copying util::Result values
    Copying util::Result values is less efficient than moving them because they
    allocate memory and contain strings. Also this is needed to avoid compile
    errors in the next commit which adds a std::unique_ptr member to util::Result
    which implicity disables copying.
    882577bc37
  19. refactor: Add util::Result failure values
    Add util::Result support for returning more error information. For better error
    handling, adds the ability to return a value on failure, not just a value on
    success. This is a key missing feature that made the result class not useful
    for functions like LoadChainstate() which produce different errors that need to
    be handled differently.
    
    This change also makes some more minor improvements:
    
    - Smaller type size. util::Result<int> is 16 bytes, and util::Result<void> is 8
      bytes. Previously util::Result<int> and util::Result<void> were 72 bytes.
    
    - More documentation and tests.
    93cee983b0
  20. refactor: Use util::Result class in LoadChainstate and VerifyLoadedChainstate 015b011c42
  21. refactor: Add util::Result multiple error and warning messages
    Add util::Result support for returning warning messages and multiple errors,
    not just a single error string. This provides a way for functions to report
    errors and warnings in a standard way, and simplifies interfaces.
    
    The functionality is unit tested here, and put to use in followup PR
    https://github.com/bitcoin/bitcoin/pull/25722
    c7d4f43198
  22. test: add static test for util::Result memory usage
    Suggested by Martin Leitner-Ankerl <martin.ankerl@gmail.com>
    https://github.com/bitcoin/bitcoin/pull/25722#discussion_r1174298529
    
    Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com>
    5e38b42b1e
  23. Add bitcoin-fatal-error clang-tidy check
    The check warns about incorrect usage of util::Result<T, FatalError>.
    The check enforces the following rules:
    
    1. If a function calls another function with a util::Result<T,
    FatalCondition> return type, its return type has to be util::Result<T,
    FatalCondition> too, or it has to handle the value returned by the
    function with one of "CheckFatal", "HandleFatalError",
    "UnwrapFatalError", "CheckFatalFailure".
    
    2. In functions returning a util::Result<T, FatalCondition> a call to a
    function returning a util::Result<T, FatalCondition> needs to propagate
    the value by either:
     a) Returning it immediately
     b) Assigning it immediately to an existing result with .MoveMessages()
        or .Set()
     c) Eventually passing it as an arugment to a .MoveMessages() call
    e5936fe79c
  24. kernel: Handle ImportBlocks fatal errors with Result<T, FatalError>
    This introduces a new FatalError enum class whose semantics when used as
    a failure type in a util::Result<T, FatalError> are controlled by the
    previously introduced bitcoin-fatal-error clang-tidy plugin.
    
    The kernel's `UnwrapFatalError` method is meant to be used in unit and
    fuzz tests, while production code should use `CheckFatal` to handle
    fatal errors.
    b16787d584
  25. refactor: Replace call to fatalError with AbortNode in init e641470b6d
  26. kernel: Handle LoadExternalBlockFile fatal errors with Result<T, FatalError> cf55203ed2
  27. kernel: Handle ActivateSnapshot fatal errors with Result<T, FatalError> 1850f3b857
  28. kernel: Add [[nodiscard]] attribute for LoadChainstate 93b9c3ba3f
  29. kernel: Handle ValidatedSnapshotCleanup fatal errors with Result<T, FatalError>
    This slightly refactors the method to return the FatalError instead of
    throwing.
    4bfa9da313
  30. kernel: Handle MaybeCompleteSnapshotValidation fatal errors with Result<T, FatalError> 4e2de7b332
  31. kernel: Handle AcceptBlock fatal errors with Result<T, FatalError>
    To propagate the FatalError, also add the type to
    InvalidateCoinsDBOnDisk and AcceptBlock.
    
    The net_processing module now has to handle a FatalError, so introduce
    shutdown and exit_status member variables for to be able to abort.
    0758b72e2b
  32. kernel: Handle ActivateBestChainStep fatal errors with Result<T, FatalError>
    To propagate the FatalError, also add the type to PreciousBlock and
    ActivateBestChain.
    6eceef84b4
  33. kernel: Handle ConnectTip fatal errors with Result<T, FatalError> 11a0d3fba9
  34. kernel: Handle FlushStateToDisk fatal errors with Result<T, FatalError>
    To propagate the FatalError, also add the type to PruneBlockFilesManual,
    AcceptToMemoryPool, ProcessNewPackage, ResizeCoinsCaches,
    ForceFlushStateToDisk, PruneAndFlush, DisconnectTip, InvalidateBlock,
    MaybeUpdateMempoolForReorg, ProcessTransaction, MaybeRebalanceCaches,
    LoadMempool
    42943e5567
  35. kernel: Handle ConnectBlock fatal errors with Result<T, FatalError>
    To propagate the FatalError, also add the type to VerifyDB and
    TestBlockValidity.
    
    The miner module now has to handle a FatalError, so introduce shutdown
    and exit_status member variables for it to be able to abort.
    a555387dab
  36. kernel: Handle SaveBlockToDisk fatal errors with Result<T, FatalError>
    To propagate the FatalError, also add the type to LoadGenesisBlock.
    16dee3c227
  37. kernel: Handle FindBlockPos fatal errors with Result<T, FatalError> f9b17733c4
  38. kernel: Handle FlushBlockFile fatal errors with Result<T, FatalError>
    To propagate the FatalError, also add the type to
    FlushChainstateBlockFile.
    57b7430f8d
  39. kernel: Handle WriteUndoDataForBlock fatal errors with Result<T, FatalError> 738e8505c9
  40. kernel: Handle FlushUndoFile fatal errors with Result<T, FatalError> 0273acfd5a
  41. kernel: Handle FindUndoPos fatal errors with Result<T, FatalError> 76f1420e29
  42. kernel: Handle LoadBlockIndex fatal errors with Result<T, FatalError>
    To propagate the FatalError, also add the type to LoadBlockIndexDB.
    e2c336bbd6
  43. Remove flushError, fatalError from kernel notifications
    Also add using declarations where now possible.
    bd0a9bb5da
  44. TheCharlatan force-pushed on Mar 26, 2024
  45. DrahtBot removed the label Needs rebase on Mar 26, 2024
  46. DrahtBot added the label Needs rebase on Mar 28, 2024
  47. DrahtBot commented at 11:56 am on March 28, 2024: contributor

    🐙 This pull request conflicts with the target branch and needs rebase.

  48. DrahtBot commented at 0:02 am on June 25, 2024: contributor

    ⌛ There hasn’t been much activity lately and the patch still needs rebase. What is the status here?

    • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
    • Is it no longer relevant? ➡️ Please close.
    • Did the author lose interest or time to work on this? ➡️ Please close it and mark it ‘Up for grabs’ with the label, so that it can be picked up in the future.
  49. DrahtBot commented at 1:06 am on September 22, 2024: contributor

    ⌛ There hasn’t been much activity lately and the patch still needs rebase. What is the status here?

    • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
    • Is it no longer relevant? ➡️ Please close.
    • Did the author lose interest or time to work on this? ➡️ Please close it and mark it ‘Up for grabs’ with the label, so that it can be picked up in the future.
  50. TheCharlatan commented at 6:03 pm on October 10, 2024: contributor
    I’m going to leave this in draft a little while longer. I think we should first focus on ryanofsky’s #29700 though. If that is merged, there is always opportunity to revisit the approach taken here.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me