Increase init file stop timeout

setpill commented at 11:48 am on August 8, 2019: contributor

bitcoind can take a long time to flush its db cache to disk upon shutdown. Systemd sends a SIGKILL after a timeout, causing unclean shutdowns and triggering a long “Rolling forward” at the next startup. Disabling the timeout should prevent this from happening, and does not break systemd’s restart logic.

Addresses #13736.

fanquake added the label Scripts and tools on Aug 8, 2019

MarcoFalke assigned dongcarl on Aug 8, 2019

DrahtBot commented at 12:52 pm on August 8, 2019: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#15268 (doc: suggest using timeoutstopsec in systemd file during IBD by d3spwn)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

dongcarl commented at 3:55 pm on August 8, 2019: member

Documentation for TimeoutStopSec=: https://www.freedesktop.org/software/systemd/man/systemd.service.html#TimeoutStopSec=

Relevant quotations:

…it configures the time to wait for the service itself to stop. If it doesn’t terminate in the specified time, it will be forcibly terminated by SIGKILL (see KillMode= in systemd.kill(5)).

Takes a unit-less value in seconds, or a time span value such as “5min 20s”. Pass “infinity” to disable the timeout logic. Defaults to DefaultTimeoutStopSec= from the manager configuration file (see systemd-system.conf(5)).

I believe it’s warranted for us to set our own TimeoutStopSec= instead of relying on the fallback to DefaultTimeoutStopSec= (which has a default of 90 seconds), however, perhaps infinity is overkill. I believe 10 minutes is probably enough for most systems where we expect Bitcoin Core to operate healthily on, but that’s just my speculation and others can/should chime in.

setpill commented at 12:10 pm on August 9, 2019: contributor

I have recently encountered a situation on a system with decent specs (4 cores, 16 GB ram, 15 GB db cache, blocksdir on HDD, datadir on SSD) where the clean shutdown took almost 9 minutes. 10 minutes seems like it would be too close for comfort, especially considering the myriad of lower-end hardware that might run bitcoind. I have not attempted to analyse the bounds, but in a situation where the shutdown would hang truly indefinitely manual intervention would be required anyway, I assume, to inspect the error before continuing to use the software.

Is there a potential scenario in which the timeout of infinity would function as a regression compared to the status quo?

instagibbs commented at 1:10 pm on August 12, 2019: member

@setpill if the shutdown gets wedged somehow, would infinity cause this to never recover? Unclean shutdown at some point may be preferable.

setpill commented at 9:51 am on August 16, 2019: contributor

@instagibbs It is true it will never automatically recover if it gets wedged; but like I said if that happens one probably wants to inspect manually what is going on and if the node is still safe to use, perhaps file a bug report.

The systemd service file can also be overridden by admins to have a shutdown timeout in case it is needed, imho the default one provided with the program should err on the side of caution.

luke-jr commented at 6:59 pm on August 23, 2019: member

I have recently encountered a situation on a system with decent specs (4 cores, 16 GB ram, 15 GB db cache, blocksdir on HDD, datadir on SSD) where the clean shutdown took almost 9 minutes.

What caused this? It certainly doesn’t look normal.

IMO some timeout should be kept.

if that happens one probably wants to inspect manually what is going on

Not possible during shutdown…?

dongcarl commented at 5:43 pm on September 5, 2019: member

So perhaps what @setpill is saying is that for those who enabled the systemd debug shell (sudo systemctl enable --now debug-shell.service), even when the service is stuck, they can debug using a root shell on TTY9?

setpill commented at 1:33 pm on September 6, 2019: contributor

My opinion is that if bitcoin core truly does hang in a way that does not recover automatically, that it should not just be killed and restarted. It’s (imho) better to require manual intervention and let whoever runs the node decide if it is still ok to still trust that node.

But that is my preference, at the very least I think the current timeout should be reconsidered as it appears to be far too short for common usecases, and (especially during IBD) waiting for a clean shutdown saves a lot of time compared to the long “rolling forward” sequence triggered by an unclean shutdown.

instagibbs commented at 1:40 pm on September 6, 2019: member

at the very least I think the current timeout should be reconsidered

I think everyone is in agreement on this point. Mind compromising and just picking a “fairly long” shutdown period of say 10 minutes as @dongcarl suggests?

setpill commented at 3:02 pm on September 6, 2019: contributor

I suppose a 15gb dbcache is a somewhat exceptional situation, perhaps 10 minutes is a decent default. I’m also looking into implementing it in the other init files for consistency’s sake.

Set init stop timeout to 10 min

`bitcoind` can take a long time to flush its db cache to disk upon
shutdown. Most init files send a `SIGKILL` after a timeout of 1 minute,
causing unclean shutdowns and triggering a long "Rolling forward" at the
next startup. Increasing this timeout to 10 minutes should reduce how
often this occurs, especially during IBD.

fixup! Set ProtectHome in systemd service file

7fb7acfc20

setpill force-pushed on Sep 6, 2019

setpill renamed this:
~~Disable systemd stop timeout~~
Increase init file stop timeout
on Sep 6, 2019

setpill commented at 3:08 pm on September 6, 2019: contributor

NB: init files aside from systemd are untested; especially the centos bitcoind.init since it’s not merely an increase of an existing parameter but a setting of a new parameter.

instagibbs commented at 3:12 pm on September 6, 2019: member

utACK https://github.com/bitcoin/bitcoin/pull/16569/commits/7fb7acfc206b4bf8c296d72b66f3bd4fe342fd87

(~no experience in non systemd stuff fwiw)

laanwj commented at 9:37 am on October 8, 2019: member

Agree on increasing this timeout. Better be safe than sorry. It is frustrating to lose your entire sync state (with a high dbcache) on shutdown.

laanwj referenced this in commit 99cebc922c on Oct 8, 2019

laanwj merged this on Oct 8, 2019

laanwj closed this on Oct 8, 2019

sidhujag referenced this in commit a46e4acabc on Oct 8, 2019

MarkLTZ referenced this in commit 64a2993987 on Nov 17, 2019

deadalnix referenced this in commit 161dd286ff on Aug 24, 2021

DrahtBot locked this on Dec 16, 2021

Increase init file stop timeout #16569

Conflicts