kernel: Remove shutdown globals from kernel library

TheCharlatan commented at 11:20 am on May 21, 2023: contributor

This pull request is part of the libbitcoinkernel project #27587 https://github.com/orgs/bitcoin/projects/3 and more specifically its “Step 2: Decouple most non-consensus code from libbitcoinkernel”.

This removes the shutdown files from the kernel library. As a standalone library the libbitcoinkernel should not have to rely on code that accesses shutdown state through a global. Instead, replace it with an interrupt class that is owned by the kernel context. Additionally this moves the remaining usages of the uiInterface out of the kernel code. The process of moving the uiInterface out of the kernel was started in #27636.

The approach taken here uses the notification interface to externalize the existing shutdown code from the kernel library. The SignalInterrupt kernel context member is passed by reference to the ChainstateManager to allow for the termination of long-running kernel functions and sending interrupt signals.

DrahtBot commented at 11:20 am on May 21, 2023: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	theuni, hebasto

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#27823 (init: return error when block index is non-contiguous, fix feature_init.py file perturbation by mzumsande)
#27746 (Draft: rework validation logic for assumeutxo by sdaftuar)
#27708 (Return EXIT_FAILURE on post-init fatal errors by furszy)
#27607 (index: improve initialization and pruning violation check by furszy)
#27596 (assumeutxo (2) by jamesob)
#27576 (kernel: Remove args, settings, chainparams, chainparamsbase from kernel library by TheCharlatan)
#26966 (index: blockfilter initial sync speedup, parallelize process by furszy)
#26762 (refactor: Make CCheckQueue RAII-styled by hebasto)
#26045 (rpc: Optimize serialization disk space of dumptxoutset by aureleoules)
#10443 (Add fee_est tool for debugging fee estimation code by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Validation on May 21, 2023

fanquake commented at 9:06 am on May 22, 2023: member

cc @theuni

theuni commented at 12:24 pm on May 22, 2023: member

Nice! Huge concept ACK. Haven’t looked at the approach/code yet but will review.

TheCharlatan force-pushed on May 22, 2023

TheCharlatan force-pushed on May 24, 2023

theuni commented at 10:07 pm on May 25, 2023: member

I spent some time reviving my clang-tidy transform plugin which is able to force bubble-ups of fatal errors and decorations of functions which may call them. I’m not suggesting that we use this approach (I mentioned at core-dev that it works but I find it pretty ugly), though it’s useful to point out our current shutdown behavior.

Note that the plugin is self-written, so it’s not flawless. It does produce compiling code though, so I have a reasonable amount of confidence.

See here for the transform of the current master branch.

A quick guide:

Any functions/assignments/declarations which may call shutdown is wrapped in a macro like MAYBE_EXIT, EXIT_OR_IF, etc.
Any function containing one of those gets a return type of MaybeEarlyExit<real_return_type> in order to bubble it up the call-stack.
This repeats recursively until the topmost functions have had their return type modified.

This provides an easy way to see whether or not a high-level function may call shutdown from some deeply nested function.

It also provides an easy way to see when calls to shutdown are made without returning to the calling function.

I haven’t looked at the actual result yet at all. I suspect it will point out several things for us to fix.

maflcko commented at 5:56 am on May 26, 2023: member

It also provides an easy way to see when calls to shutdown are made without returning to the calling function.

Can you explain this? Am I correct when guessing that any added EXIT* macro, and not the BUBBLE* macros are indicating a missing return?

theuni commented at 3:12 pm on May 30, 2023: member

It also provides an easy way to see when calls to shutdown are made without returning to the calling function.

Can you explain this? Am I correct when guessing that any added EXIT* macro, and not the BUBBLE* macros are indicating a missing return? @MarcoFalke Right, the BUBBLE_UP is kinda like a cast, it’s not so interesting. The others introduce conditional returns to pass any errors back up the stack. So the problem cases are ones where we don’t already have that behavior.

The main complication comes from the fact that the callstack might look like:

0void foo()
1  int bar()
2    std::string baz()

Where baz() returns a fatal error. The solution implemented here is to wrap everything in a variant:

0MaybeEarlyExit<void> foo()
1  MaybeEarlyExit<int> bar()
2    MaybeEarlyExit<std::string> baz()

Where: MaybeEarlyExit : std::variant<T, FatalError, UserInterrupted> (and those are simple enum classes)

So basically if the call resulted in an error, pass an error type back up immediately rather than a real return value. These macros are defined here and do that as necessary (you can largely think of them of different notations of try/catch/rethrow). In fact, this model forces that invariant: you get a return value or a fatal error but never both.

BlockManager::WriteUndoDataForBlock is a good example of correct/incorrect:

FindUndoPos: potential error caught and passed back up (what callers do with it is a different question)
UndoWriteToDisk: aborts and returns immediately as needed
FlushUndoFile: this one is (I believe) an example of a missed return in master that could cause problems with changes like these. Not necessarily this one in particular, but it’s clearly not alone. Edit: to be clear, the problem here is that FlushUndoFile may call AbortNode, but we don’t react to that in WriteUndoDataForBlock.

I’m not sure how to proceed. @ryanofsky mentioned maybe fixing them up as part of this shutdown callback effort?

maflcko commented at 3:49 pm on May 30, 2023: member

Edit: to be clear, the problem here is that FlushUndoFile may call AbortNode, but we don’t react to that in WriteUndoDataForBlock.

Do we have to react there? AbortNode starts a shutdown, and the shutdown is later checked in ABC: https://github.com/bitcoin/bitcoin/blob/71300489af362c3fed4736de6bffab4d758b6a84/src/validation.cpp#L3193

Sure, better annotations are better, but then why not use the existing Result class over the MaybeEarlyExit type?

DrahtBot added the label Needs rebase on May 30, 2023

TheCharlatan force-pushed on May 30, 2023

theuni commented at 4:07 pm on May 30, 2023: member

Edit: to be clear, the problem here is that FlushUndoFile may call AbortNode, but we don’t react to that in WriteUndoDataForBlock.

Do we have to react there? AbortNode starts a shutdown, and the shutdown is later checked in ABC:

https://github.com/bitcoin/bitcoin/blob/71300489af362c3fed4736de6bffab4d758b6a84/src/validation.cpp#L3193

Yes, looks like I picked this example too hastily. You’re correct that this one is caught higher up.

Sure, better annotations are better, but then why not use the existing Result class over the MaybeEarlyExit type?

Again, I’m not suggesting that we use MaybeEarlyExit. Only suggesting analyzing the above transformation to see if there are true bugs hiding.

DrahtBot added the label CI failed on May 30, 2023

DrahtBot removed the label Needs rebase on May 30, 2023

TheCharlatan force-pushed on May 31, 2023

TheCharlatan marked this as ready for review on May 31, 2023

DrahtBot removed the label CI failed on May 31, 2023

ryanofsky commented at 5:30 pm on June 1, 2023: contributor

I think I agree with Cory, and think it is a problem even if it is not a bug for BlockManager::FlushBlockFile and BlockManager::FlushUndoFile methods to call Notifications::fatalError without returning errors themselves. The fatal errors there seem different than other fatal errors, and the contradict the “whatever function triggered the error should also return an error code or raise an exception” documention.

I would suggest making one (or both) of the following changes to this PR:

Changing FlushBlockFile and FlushUndoFile to send a different “flush failed” notification rather than “fatal error”. The node’s KernelNotifications handler could treat this just like a fatal error to keep current behavior the same. Other applications using libbitcoinkernel might choose to handle the error in a different way for example by sending a notification or trying to free up disk space, rather than shutting down the service right away.
Changing FlushBlockFile and FlushUndoFile to return [[nodiscard]] bool, and changing calls to these functions either bubble up the error or have comments about why the error is safe to ignore in that particular context.

in src/kernel/blockmanager_opts.h:29 in 978c05d58c outdated

25@@ -25,6 +26,7 @@ struct BlockManagerOpts {
26     bool fast_prune{false};
27     bool stop_after_block_import{DEFAULT_STOPAFTERBLOCKIMPORT};
28     const fs::path blocks_dir;
29+    std::atomic<bool>& shutdown_requested;

ryanofsky commented at 5:36 pm on June 1, 2023:

In commit “kernel: Pass reference of global shutdown requested atomic” (978c05d58cea9f96f2713667663af35b43905701)

I don’t think I understand what the purpose of this option is at a high level. If I am writing an application which uses libbitcoinkernel, the thing I would want from the kernel library is a cancel() function I could call to cancel potentially slow operations. I don’t want to have to mirror my application state to an atomic<bool> and be give the kernel read/write access to the bool as a more awkward way of cancelling.

So it seems to me the approach here is not very friendly to potential libbitcoinkernel applications, and is not even very friendly to our own application since it exposes the fRequestShutdown global more broadly and makes it harder to get rid of.

I think a better approach would be to get rid of the fRequestShutdown global and replace it with a CThreadInterrupt cancel member in kernel::Context. I think this should not be too hard to do, and would still be compatible with most other changes in the PR.

TheCharlatan commented at 8:11 pm on June 1, 2023:

Thank you for taking a look!

So it seems to me the approach here is not very friendly to potential libbitcoinkernel applications, and is not even very friendly to our own application since it exposes the fRequestShutdown global more broadly and makes it harder to get rid of.

I was trying this out today on my proof of concept branch and got the same impression. It is unwieldy and the naming is bad.

I think a better approach would be to get rid of the fRequestShutdown global and replace it with a CThreadInterrupt cancel member in kernel::Context. I think this should not be too hard to do, and would still be compatible with most other changes in the PR.

Just for clarity: This would imply passing a reference of the kernel context to the validation and blockstorage code? Giving the kernel ownership over this interrupt state seems a bit weird to me at first glance, since it is shared with all the other components that we have. Thinking of it as a pure execution interrupt, and not a kill signal makes this a bit more palatable though.

ryanofsky commented at 11:18 pm on June 1, 2023:

Just for clarity: This would imply passing a reference of the kernel context to the validation and blockstorage code? Giving the kernel ownership over this interrupt state seems a bit weird to me at first glance, since it is shared with all the other components that we have.

I think the idea is to give the kernel ownership of it’s own interrupt state, and not have to rely on an external mechanism. If other components that are already depending on the kernel want to share the same interrupt mechanism, they are then free to do that or not to do that.

I implemented this approach in a branch based of this PR: https://github.com/ryanofsky/bitcoin/commits/review.27711.1-edit.

The first commit 04ee8f1190dc6712ea053caed0428ff0c70fca85 is the main one which moves the globals in shutdown.cpp into a kernel::Context::cancel member that should allow code using libbitcoinkernel to cancel ongoing tasks and interrupt execution.

The second commit 6df7bf3adf8f2a683379ed7830c296847ebfcb51 is just an updated version of your first commit 978c05d58cea9f96f2713667663af35b43905701 which passes references to the interrupt object everywhere instead of references to a std::atomic<bool>, and calls the object m_interrupt instead of shutdown_requested and passes it directly to *Manager constructors instead of requiring another field to be filled in the options structs.

Free to incorporate any of those changes here.

One thing that I didn’t think about enough is the last commit. I like the shutdown reason enum added there, but I don’t think it should be necessary for the kernel to call out to the application cancel it’s own execution if it can just call the cancellation object directly. So it might make sense to change that commit a little too in light of the first commit.

TheCharlatan commented at 11:26 am on June 2, 2023:

I don’t really understand the improvement in the context of achieving step 2 of the kernel library project between the solution you implemented now and what is implemented currently here in the first commit, besides it being a big step towards de-globalizing. One of my goals for this PR here is to stop the kernel relying on code that handles platform-dependent blocking while still being signal-aware (e.g. the bulk of what is currently in shutdown). I don’t see why such logic should exist in a standalone kernel library. Do you have a different perspective here?

I like the shutdown reason enum added there, but I don’t think it should be necessary for the kernel to call out to the application cancel it’s own execution if it can just call the cancellation object directly.

This suggestion sounds good to me, setting the interrupt flag/bool/condition within kernel code would greatly improve ergonomics.

ryanofsky commented at 12:57 pm on June 2, 2023:

One of my goals for this PR here is to stop the kernel relying on code that handles platform-dependent blocking while still being signal-aware (e.g. the bulk of what is currently in shutdown). I don’t see why such logic should exist in a standalone kernel library. Do you have a different perspective here?

Yes, I don’t think we should do this because it gets in the way of the kernel having a nice and future-proof API.

I think we both agree that applications using libbitcoinkernel should have a way to cancel long running kernel threads and operations from an outside thread or signal handler. My branch provides that functionality though a SignalInterrupt object that sits in the kernel::Context struct. By contrast, the current PR uses a std::atomic<bool> object that application needs to initialize itself and pass to the library.

I think the approach in the branch is better for two reasons:

Using an SignalInterrupt object instead of an atomic bool provides a more future-proof API. It is true that right now, all the code running in the kernel can just get away with polling a bool every once in a while to decide whether to continue executing. But we don’t know that this will be true in the future. If future kernel code wants to use a thread pool or some other asynchronous processing, there should be a way for outside code to actively send an interrupt and cancel that processing, not just passively set a bool and require the kernel to poll for it. Supporting active cancellation should not require a change to the kernel API when we can just define the API in a general way without reference to std::atomic<bool> right now.
The API in the branch is friendlier than the API in the PR. Adding the interrupt object to the kernel::Context struct is nicer than requiring the application to initialize the interrupt object itself. Not every application needs to cancel kernel operations (bitcoin-chainstate.cpp is one example) so requiring those applications to declare an atomic shutdown variable adds boilerplate and complexity. You could address this by moving the atomic bool into kernel::Context. But even with that change I still think SignalInterrupt interface is cleaner than the atomic bool interface.

I understand the approach in the PR makes kernel a littler smaller and lightweight, but to me overhead of depending on CThreadInterrupt, TokenPipe, SignalInterrupt, and unistd.h does not seem high enough to worry about. Also, whatever overhead there is here is an implemention detail of the kernel, and can be changed later. It’s not something that’s hardcoded into the kernel API.

TheCharlatan commented at 1:57 pm on June 2, 2023:

Thank you for your comprehensive answer. I think it boils down to: A fatal error should not necessarily have to trigger an interrupt for all kernel operations. The user of the library should decide this. A startShutdown should interrupt all operations in order to cleanly halt kernel execution.

I don’t think it should be necessary for the kernel to call out to the application cancel it’s own execution if it can just call the cancellation object directly.

Going back to this earlier remark again, you declared the m_interrupt member in your branch as const. Do you think it is a bad idea to mutate it from within the chainman to forego a call to StartShutdown?

ryanofsky commented at 3:39 pm on June 2, 2023:

Thank you for your comprehensive answer. I think it boils down to: A fatal error should not necessarily have to trigger an interrupt for all kernel operations.

Yes, this is definitely a goal. Libbitcoinkernel should try to be a library, not an application or application framework. It should have a convenient way of cancelling operations and not force everything to use one shutdown variable.

But both the current PR and my branch do already support this. The main difference between the PR and branch is that the PR uses a std::atomic<bool> object, while the branch uses a SignalInterrupt object.

Going back to this earlier remark again, you declared the m_interrupt member in your branch as const. Do you think it is a bad idea to mutate it from within the chainman to forego a call to StartShutdown?

It’s not necessarily bad, but it would be nice to avoid it. It would be cleaner if interrupt state cleanly represented whether a cancellation was requested or not, and wasn’t also used to handle the -stopatheight and -stopafterblockimport corner cases.

Because of this, I was thinking of suggesting dropping the last commit a6a3c3245303d05917c04460e71790e33241f3b5 from the current PR, and leaving the two StartShutdown() calls in place for now. I think a better approach than calling StartShutdown() or calling m_interrupt() in those cases would be to just return “finished” or “success” status codes. Also in both cases, it would be better if the kernel wasn’t deciding itself whether to return early, but instead asked the application through std::function callbacks. I.e. there could be a std::function callbacks for “I just a new chain tip, should I keep processing or return early?” that would be more general than having hardcoded -stopatheight logic in the kernel. The third “FailedConnectingBestBlock” shutdown case also seems like it should be returning a real error code to the application and calling fatalErrror, not just sending a shutdown notification.

TheCharlatan commented at 3:46 pm on June 2, 2023:

Also in both cases, it would be better if the kernel wasn’t deciding itself whether to return early, but instead asked the application through std::function callbacks. I.e. there could be a std::function callbacks for “I just a new chain tip, should I keep processing or return early?” that would be more general than having hardcoded -stopatheight logic in the kernel.

How about making startShutdown return a boolean then that can be interpreted by calling code to either terminate the function or continue?

The third “FailedConnectingBestBlock” shutdown case also seems like it should be returning a real error code to the application and calling fatalErrror, not just sending a shutdown notification.

Agreed.

ryanofsky commented at 4:11 pm on June 2, 2023:

How about making startShutdown return a boolean then that can be interpreted by calling code to either terminate the function or continue?

I think it’s best if the notification interfaces just streamed notifications in one direction to the application, to update its UI, or trigger tasks like starting other services, saving state, freeing up disk space, or shutting down.

The notifications interface seems like an awkward way to send information the other direction from the application to the kernel, because it requires implementing a subclass, and because as a user of the kernel library, I would expect to look in option structs, not in the notification interface if to help implement functionality like -stopatheight.

I think it would be great to get rid of the 3 StartShutdown calls in kernel code, but probably better to do in a separate PR that was independent of this one, since there’s lots of approaches that could be used and none of them seem to depend on the other changes here.

ryanofsky commented at 5:16 pm on July 7, 2023:

re: #27711 (review)

I think it would be great to get rid of the 3 StartShutdown calls in kernel code, but probably better to do in a separate PR that was independent of this one, since there’s lots of approaches that could be used and none of them seem to depend on the other changes here.

I implemented this idea in #28048, getting rid of the stop_after_block_import and stop_at_height kernel options and replacing them with more flexible blocks imported and block tips hooks.

I did change my mind about making the hooks std::function callbacks instead of kernel::Notification callbacks, despite this adding a little more overhead to the notifications interface. Since the existing blockTip notification was already being sent at the same place the StopAtHeight check was implemented, it seemed less confusing to just reuse the existing callback instead of adding another callback at the same place.

theuni commented at 7:25 pm on June 1, 2023: member

Changing FlushBlockFile and FlushUndoFile to return [[nodiscard]] bool, and changing calls to these functions either bubble up the error or have comments about why the error is safe to ignore in that particular context.

I learned while working on the bubble-up stuff that a type can be [[nodiscard]] too. You can see an example of it here to force bubble-up warnings.

Rather than returning bool, maybe return a custom [[nodiscard]] result type? That way if the caller wants to bubble-up the result, the use-checking bubbles up as well. Of course that starts breaking down as soon as something actually needs its return value, so maybe it’s not useful. Just throwing it out there because imo it’s neat :)

TheCharlatan commented at 9:19 pm on June 1, 2023: contributor

Re #27711 (comment)

The node’s KernelNotifications handler could treat this just like a fatal error to keep current behavior the same. Other applications using libbitcoinkernel might choose to handle the error in a different way for example by sending a notification or trying to free up disk space, rather than shutting down the service right away.

Is there really a qualitative difference between the errors? Instead of adding multiple functions, it feels like a better approach here might be to encode some common error reasons into an enum and let the user then decide how to handle them. I don’t understand why these abort calls were ignored so far (the code was introduced here).

ryanofsky commented at 9:50 pm on June 1, 2023: contributor

Is there really a qualitative difference between the errors?

I didn’t dig into the actual errors, but the qualitative difference here seems to be that the other AbortNode errors actually need to be fatal and should try to shut down to prevent getting into a worse state. But an error flushing state in memory to disk does not need to be fatal, as long the state in memory is kept and can be saved later. There are lots of ways an application could handle a failed flush notification, and I think it would be better to avoid using fatal error callback for any notification that does not have to be fatal.

The choice of whether to use separate methods or enums is basically just aesthetic. And I didn’t dig into whether these particular flush errors need to fatal or not. I think you can decide what to do here or whether to do anything at all. My reason for suggesting “flushFailed” as an alternative to “nodiscard bool” was that latter change might be more invasive and require changing more sensitive code.

util: Add SignalInterrupt class and use in shutdown.cpp

This change helps generalize shutdown code so an interrupt can be
provided to libbitcoinkernel callers. This may also be useful to
eventually de-globalize all of the shutdown code.

Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
Co-authored-by: TheCharlatan <seb.kung@gmail.com>

8b3ac7d952

hebasto commented at 1:44 pm on June 5, 2023: member

Concept ACK.

hebasto commented at 2:46 pm on June 5, 2023: member

#27711 (review):

I think the approach in the branch is better for two reasons…

I agree with both of them.

kernel: Pass interrupt reference to chainman

This and the following commit seek to decouple the libbitcoinkernel
library from the shutdown code. As a library, it should it should have
its own flexible interrupt infrastructure without relying on node-wide
globals.

The commit takes the first step towards this goal by de-globalising
`ShutdownRequested` calls in kernel code.

Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
Co-authored-by: TheCharlatan <seb.kung@gmail.com>

57a57bae00

blockstorage: Return on fatal flush errors

By returning if either `FlushUndoFile` or `FlushBlockFile` fail, we
ensure that no further write operations take place that may lead to an
inconsistent database when crashing.

Prior to this patch `FlushBlockFile` may have failed without returning
in `Chainstate::FlushStateToDisk`, leading to a potential write from
`WriteBlockIndexDB` that may refer to a block that is not fully
flushed to disk yet.

Functions that have either `FlushUndoFile` or `FlushBlockFile` in their
call stack, need to handle these extra abort cases properly. Since
`Chainstate::FlushStateToDisk` already produces an abort error in case
of `WriteBlockIndexDB` failing, no extra logic for functions calling
`Chainstate::FlushStateToDisk` is required.

Besides `Chainstate::FlushStateToDisk`, `FlushBlockFile` is also called
by `FindBlockPos`, which also already produces other potential abort
failures. FlushUndoFile is only called by `FlushBlockFile` and
`WriteUndoDataForBlock`. Both already produce their own abort errors, so
again no extra handling of these cases in calling code is required.

Add [[nodiscard]] annotations to these cases such that they are not
ignored in future.

60f5461e16

scripted-diff: Rename FatalError to FatalErrorf

This is done in preparation for the next commit where a new FatalError
function is introduced. FatalErrorf follows common convention to append
'f' for functions accepting format arguments.

-BEGIN VERIFY SCRIPT-
sed -i 's/FatalError/FatalErrorf/g' $( git grep -l 'FatalError')
-END VERIFY SCRIPT-

bdca3c79fb

kernel: Add fatalError method to notifications

FatalError replaces what previously was the AbortNode function in
shutdown.cpp.

This commit is part of the libbitcoinkernel project and removes the
shutdown's and, more generally, the kernel library's dependency on
interface_ui with a kernel notification method. By removing interface_ui
from the kernel library, its dependency on boost is reduced to just
boost::multi_index. At the same time it also takes a step towards
de-globalising the interrupt infrastructure.

Also add an interrupt switch in chainman and blockman. This simplifies
testing of validation methods that may normally trigger an interrupt on a
fatal error by leaving out the interrupt invocation.

Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
Co-authored-by: TheCharlatan <seb.kung@gmail.com>

59f00d5fcd

blockstorage: Make failed to connect best block a FatalError

This is an error and we return on it, so treat it like one and don't
call the normal shutdown procedure.

79b549abee

kernel: Remove shutdown from kernel library

This decouples the libbitcoinkernel library from the shutdown routines.
As a library, it should it should have its own flexible interrupt
infrastructure without relying on node-wide globals.

This commit should achieve the final step towards this goal by replacing
calls to shutdown in kernel code with calls to interrupt. Since some
kernel library user might not want to interrupt, but still receive the
notification, add a bool to control whether an interrupt is triggered
upon reaching one of the trigger conditions.

While touching the lines around StartShutdown, it became apparent that
there was no logging indicating the reason for a call to StartShutdown.
Add an InterruptReason enum to the interrupt notification to communicate
this.

f13880e832

TheCharlatan force-pushed on Jun 7, 2023

TheCharlatan commented at 10:02 am on June 7, 2023: contributor

Thank you @theuni and @ryanofsky for the great discussion so far.

Updated a6a3c3245303d05917c04460e71790e33241f3b5 -> f13880e83200ec824d7528061f96708d96bf9bd4 (rmKernelShutdown_0 -> rmKernelShutdown_1, compare).

This update reworks the existing logic in this pull request by integrating @ryanofsky’s suggested branch.

Instead of getting rid of the shutdown logic in the kernel, the core logic is now retained, but de-globalized and wrapped into a more user-friendly and self-descriptive class (SignalInterrupt).

As an alternative to @ryanofsky’s suggestion for dropping the commits handling startShutdown I added boolean options that can be set by users of the kernel to indicate whether it should terminate if either a FatalError or Interrupt should trigger an actual interrupt.

Also adds two new commits, one for making a failure connecting the best block a FatalError and another for making FlushBlockFile and FlushUndoFile return on calls to FatalError. Rationale for each of these is provided in the commit messages.

These patches now went through quite some evolution in terms of their scope and intention. I am pushing this patch set to show how I’d currently unify the feeback received so far. In terms of next steps I’d like to lay out the following options:

Get the current patch set through review, completing the goal of removing the ui_interface and some of the boost headers from the kernel. Complete step 2 of this phase of the kernel project. Add another step in this phase of the kernel project for strictly typing any functions that may cause an interrupt. My current thinking is to use @ryanofsky’s #25665 extension of the Result type to return either a result or these fatal errors. Each instance of an interrupt call will be replaced piecemeal in separate PR’s to allow for clean review. I found that some of these cases get quite complicated, meaning simply returning on a fatal error may lead to bugs if surrounding code is not also adapted, and catching all nuances in a single, sprawling PR is probably too burdensome. Eventually this should lead to code where we’d only have to trigger a fatal interrupt at the edges of the library, and otherwise just push fatal errors to code using it.
Revert to a more minimal patch set, basically containing the new interrupt class + the FatalError wrapper to remove the ui_interface. Then proceed as in step 1.
Close this PR, then proceed as in step 1. Once all fatal interrupt cases are handled, circle back to the kernel interrupt handling for long-running functions and conditional interrupts (in other words those lines of code that currently call StartShutdown).

Option 1 is what seems pragmatic to me currently, option 2 would be closest to the goals as laid out in the kernel tracking issue (first cleave unwanted functionality / dependencies, then improve interfaces and fix weirdness discovered during development). Option 3 is where my heart is, even if it might block progress for some time and would depend on a bunch of other improvements to be completed first.

TheCharlatan renamed this:
~~kernel: Remove shutdown from kernel library~~
kernel: Remove shutdown globals from kernel library
on Jun 7, 2023

DrahtBot added the label CI failed on Jun 7, 2023

ryanofsky commented at 1:07 pm on June 7, 2023: contributor

Thanks for following up and incorporating the feedback!

On the topic of how to move forward with these changes, I think we should see if we can find a way to avoid making temporary changes to validation code that are not desirable for the long term. Specifically:

Having interrupt_on_fatal_error and interrupt_on_condition ChainStateManager options which are complicated (and very confusing even to me!)
Giving ChainStateManager a mutable reference to the interrupt object instead of a const reference, and using the mutable reference to have it send interrupt signals to itself.

If temporary kludges are necessary to avoid having a PR that is too big to review, maybe that’s a good tradeoff. But it would be better if we could see what the final code looks like before making that tradeoff. If this isn’t possible, I think we should at least be able to add comments / documentation in the code saying which things are temporary and how they should be used in the meantime.

If you want to take time to iterate more on this code more, I think we could easily pull out parts of this PR right now that are pure improvements and could be reviewed and merged quickly. If you took the first commit here 8b3ac7d95253ac9c17faee3df29f11144e79f8b4 and the last two commits a4034e719c0dfa07bc8c267af3c2139d3b353757 and bbfbcd0360e486d47025a412f2c0f4e8535ab255 that were previously part of #27636, I think we could quickly review and merge these in a separate PR as they are pure refactorings which do not change behavior or add temporary kludges to validation code.

After that PR or alongside it commits 60f5461e160c4ef9aa5ee8e34f70a6459d12cae5 and 79b549abee60d13edcf8fe6a4c899c1b0bd08f88 from this PR also seem like good candidates for splitting off and being in their own PR so they could get more careful review as they do change behavior subtly.

Great to see more work here, though. I think there are still some tricky details to work out but most of the changes seem very good.

DrahtBot added the label Needs rebase on Jun 9, 2023

DrahtBot commented at 8:50 pm on June 9, 2023: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

TheCharlatan commented at 11:20 am on June 12, 2023: contributor

Re #27711 (comment)

If you want to take time to iterate more on this code more, I think we could easily pull out parts of this PR right now that are pure improvements and could be reviewed and merged quickly.

Ok, I will put this PR back to draft, and open two new ones as you suggested. I’ll continue using this draft to stage my further improvements.

TheCharlatan marked this as a draft on Jun 12, 2023

TheCharlatan commented at 6:45 pm on June 12, 2023: contributor

I opened #27861 to introduce the interrupt class and implement the fatalError notification. Also opened #27866 to start tackling proper error return types when a function calls abort.

ryanofsky referenced this in commit 75135c673e on Jul 6, 2023

TheCharlatan commented at 10:45 am on July 8, 2023: contributor

Closing this PR in favor of:

#28053 Handling shutdown with -stopafterblockimport
#28048 Handling shutdown with -stopatheight
#27866 Handling blockstorage flush errors
#28051 Getting rid of shutdown code

TheCharlatan closed this on Jul 8, 2023

ryanofsky referenced this in commit f562856d02 on Sep 29, 2023

Frank-GER referenced this in commit 53099fdff4 on Oct 5, 2023

bitcoin locked this on Jul 7, 2024

kernel: Remove shutdown globals from kernel library #27711

Reviews

Conflicts