persist IBD latch across restarts

pstratem commented at 9:57 pm on February 7, 2021: contributor

In the unlikely event that all or most of the listening nodes are restarted into Initial Block Download mode the network might get stuck with little or no listening nodes providing headers or blocks.

The timeout is intended to move nodes out of IBD so they will request and serve blocks and headers to inbound peers.

Edit: By latching IBD across restarts we prevent the network from getting into this state at all.

pstratem force-pushed on Feb 7, 2021

DrahtBot added the label Validation on Feb 7, 2021

in src/validation.cpp:1310 in 22c27cca43 outdated

1304@@ -1305,6 +1305,14 @@ bool CChainState::IsInitialBlockDownload() const
1305         return false;
1306     if (fImporting || fReindex)
1307         return true;
1308+    uint64_t ibdtimeout = gArgs.GetArg("-ibdtimeout", DEFAULT_IBD_TIMEOUT);
1309+    int64_t uptime = GetTime() - GetStartupTime();
1310+    assert(uptime >= 0);

jnewbery commented at 10:46 am on February 8, 2021:

This assumes that GetTime() is monotonic, which I don’t think is true. GetTime() is mockable so may move backwards in tests.

It looks like GetTime() is also deprecated.

pstratem commented at 0:43 am on February 9, 2021:

Nodes in IBD will not respond to GETHEADERS. Nodes in IBD do not request headers from inbound peers only outbound.

If all or most of the listening nodes are in IBD (emergency hot fix or something) then nobody will be able to get headers.

To get out of this the listening nodes need to request headers from inbound connections, either by exiting IBD or special casing that limit while in IBD.

schuie2528 commented at 1:01 am on February 28, 2021:

bc1que90l5dy4yzn79x8xmem8t7dgn0c2zcqvxnr6x

jnewbery commented at 10:50 am on February 8, 2021: member

I think this is essentially saying that if the node can’t get out of IBD within two days, then just pretend we’ve completed IBD anyway. I don’t think that’s the right behaviour - if we’ve stalled and aren’t making any progress towards getting out of IBD, we’d be better off warning the user very loudly and then perhaps shutting down.

MarcoFalke commented at 11:58 am on February 8, 2021: member

all or most of the listening nodes are restarted into Initial Block Download mode

Wouldn’t that mean that there hasn’t been a block in 24 hours? Then, waiting another two days for the next block doesn’t seem like a fallback that will be needed.

pstratem commented at 0:45 am on February 9, 2021: contributor

@MarcoFalke no it doesn’t, just that the listening nodes haven’t seen the block yet.

pstratem force-pushed on Feb 9, 2021

MarcoFalke commented at 7:08 am on February 9, 2021: member

If the listening nodes haven’t seen a block from 24h ago, something is going wrong and needs human attention to manually fixup. Having an automated fallback that hits after two days doesn’t seem needed when having to manually interfere anyway.

MarcoFalke commented at 7:04 am on February 10, 2021: member

I fully agree that the node should try to recover (to avoid manual intervention as much as possible) when its both reasonable and safe to do. Though in this particular case, the users that are eager enough to corrupt their datadir, won’t be patient enough to wait three full days for this workaround to take effect. And similarly, the patient users won’t see the benefits either because some eager users successfully poked their node on the first day of issues.

in src/validation.cpp:1309 in 99283bf751 outdated

1304@@ -1305,11 +1305,19 @@ bool CChainState::IsInitialBlockDownload() const
1305         return false;
1306     if (fImporting || fReindex)
1307         return true;
1308+    uint64_t ibdtimeout = gArgs.GetArg("-ibdtimeout", DEFAULT_IBD_TIMEOUT);
1309+    int64_t uptime = GetTime() - GetStartupTime();

jonatack commented at 9:25 am on February 10, 2021:

The ibdtimeout and uptime localvars can be const.

As @jnewbery mentions, we probably don’t want to use GetTime() here, see src/util/time.h.

Also maybe consider how this interacts with setmocktime

0    // For now, don't change mocktime if we're in the middle of validation, as
1    // this could have an effect on mempool time-based eviction, as well as
2    // IsCurrentForFeeEstimation() and IsInitialBlockDownload().
3    // TODO: figure out the right way to synchronize around mocktime, and
4    // ensure all call sites of GetTime() are accessing this safely.

in src/validation.h:56 in 99283bf751 outdated

52@@ -53,6 +53,7 @@ struct DisconnectedBlockTransactions;
53 struct PrecomputedTransactionData;
54 struct LockPoints;
55 
56+static const unsigned int DEFAULT_IBD_TIMEOUT = 3600 * 48; // 48 Hours

jonatack commented at 9:25 am on February 10, 2021:

Can use constexpr and mention that the constant refers to a time duration in seconds.

jonatack commented at 10:04 am on February 10, 2021: member

Interesting scenario. I wonder if latching IBD to false is a sufficiently fine-grained way to do this; maybe a separate timeout latch could be added, as you mention, “The listening nodes need to request headers from inbound connections, either by exiting IBD or special casing that limit while in IBD.” I’m not sure about the default -ibdtimeout value of 48 hours, which seems long, what other durations users would want to provide, or what guidance can be given to users. This probably needs a functional test if it goes forward.

jonatack commented at 10:08 am on February 10, 2021: member

the tricky part is being equally confident that the various attacks that the IBD specific behaviour is intended to protect against are still protected against– esp because no one reviewing this work may remember all of them

Yes.

MarcoFalke commented at 10:46 am on February 10, 2021: member

If this was a boolean option (default to false), it would be easier to review than this time-based threshold. Obviously, it would also lose the node-fixes-itself-after-three-days feature. Though, I am still sceptical of it’s use. It assumes that all of your connections (inbound, as well as block-feeler, outbound block-relay, …) are out of IBD for three days without a single one of them setting the flag manually.

(Edit: I wrote this before the previous comment, so it is not a reply to that)

pstratem commented at 6:51 pm on February 10, 2021: contributor

@MarcoFalke In IBD we do not request headers from inbound peers. Literally only from a single outbound peer with a timeout.

A setting to disable IBD for your own node might fix your node if your node is listening and receiving inbound connections from nodes which are sync’d with miners. Relying on that would be extremely fragile.

Even if you were able to fix your own node, you can’t then heal the rest of the network without them connecting to you as the single initial sync peer, the odds of which are extremely low.

The timeout prevents the network from being in this state for more than 48 hours with a bunch of nodes stuck in IBD.

I’m assuming that the vast majority of listening nodes are not actively maintained and have no wallet, but are rather used as a kind of gateway, so being down for 48 hours is a nuisance to the rest of the network, but is probably not directly going to cause them to take action.

pstratem commented at 7:23 pm on February 10, 2021: contributor

@gmaxwell Yes just writing the latch to disk is probably the best solution.

jonatack commented at 7:26 pm on February 10, 2021: member

Yes, if we can avoid using time that would be a real win. Easier to test too.

pstratem force-pushed on Feb 10, 2021

in src/validation.cpp:1328 in cf742ecc44 outdated

1322@@ -1312,6 +1323,9 @@ bool CChainState::IsInitialBlockDownload() const
1323     if (m_chain.Tip()->GetBlockTime() < (GetTime() - nMaxTipAge))
1324         return true;
1325     LogPrintf("Leaving InitialBlockDownload (latching to false)\n");
1326+    f = fsbridge::fopen(latch_filename, "wb");
1327+    if (!f) LogPrintf("Failed to write %s \n", latch_filename.string().c_str());
1328+    else fclose(f);

jonatack commented at 9:19 pm on February 10, 2021:

I think we now use bracket syntax for the conditional (after a famous Apple bug iirc)

0-    if (!f) LogPrintf("Failed to write %s \n", latch_filename.c_str());
1-    else fclose(f);
2+    if (f) {
3+        fclose(f);
4+    } else {
5+        LogPrintf("Failed to write %s\n", latch_filename.c_str());
6+    }

pstratem commented at 9:32 pm on February 10, 2021:

Coding style guidelines doesn’t agree, but I will expand it to new lines to make it easier to read.

(Indeed ironically I argued for using braces everywhere, but we don’t).

https://github.com/bitcoin/bitcoin/blob/master/doc/developer-notes.md#coding-style-c

jonatack commented at 9:43 pm on February 10, 2021:

The developer notes state:

0  - If an `if` only has a single-statement `then`-clause, it can appear
1    on the same line as the `if`, without braces. In every other case,
2    braces are required, and the `then` and `else` clauses must appear
3    correctly indented on a new line.

jonatack commented at 9:45 pm on February 10, 2021:

So if I read it correctly, without the else it would be fine, but with the else then braces are required (at any rate, that is what I’ve been doing).

pstratem force-pushed on Feb 10, 2021

net: latch IBD across restarts

this prevents a network wide failure in the event that all or most listening
nodes are restarted at approximately the same time

a3e713ace8

pstratem force-pushed on Feb 11, 2021

pstratem commented at 3:16 am on February 11, 2021: contributor

The CI failures are all random zmq issues.

decryp2kanon commented at 6:27 pm on February 11, 2021: contributor

Concept ACK

pstratem renamed this:
~~add timeout to initial block download~~
persist IBD latch across restarts
on Feb 12, 2021

in src/validation.cpp:1298 in a3e713ace8

1294@@ -1294,15 +1295,23 @@ void CChainState::InitCoinsCache(size_t cache_size_bytes)
1295 // `const` so that `CValidationInterface` clients (which are given a `const CChainState*`)
1296 // can call it.
1297 //
1298+// The file .ibd_complete is used to latch the datadir to IBD false across restarts

jonatack commented at 4:53 pm on February 13, 2021:

If this moves forward, noting that a file description should be added to doc/files.md (alternatively, perhaps persist the latch in settings.json, which is also chain-specific).

pstratem commented at 7:35 pm on February 13, 2021:

I did this as a separate file so that users can easily add the file to override, with GetDataDir() it should be chain specific, no?

jonatack commented at 7:50 pm on February 13, 2021:

Yes, I think they should both be.

luke-jr commented at 1:45 am on February 24, 2021:

Users can easily override settings.json as well - they’re just interpreted as command-line options.

So something like --ibd_complete=0/1 would work

pstratem commented at 5:10 am on February 22, 2021: contributor

Is there anything I need to do here?

jnewbery commented at 9:37 am on February 22, 2021: member

Is there anything I need to do here?

Convince other contributors that this is a good idea. Currently, if a node is switched off and then switched back on some time later, it’ll start up in IBD until it catches up to the tip. This seems like good behaviour - we don’t want to download transactions, relay our address, respond to getheaders, etc until we’ve caught back up. If you want other contributors to ACK this PR, you’ll need to convince them that changing that behaviour is an improvement, and does not introduce any regressions/vulnerabilities/etc.

in src/validation.cpp:1313 in a3e713ace8

1308     if (m_cached_finished_ibd.load(std::memory_order_relaxed))
1309         return false;
1310+    if (fs::exists(latch_filename)) {
1311+        LogPrintf("Leaving InitialBlockDownload (latching to false)\n");
1312+        m_cached_finished_ibd.store(true, std::memory_order_relaxed);
1313+        return false;

MarcoFalke commented at 9:40 am on February 22, 2021:

This will set ibd to false, when -reindex? If yes, that seems wrong

pstratem commented at 0:04 am on February 28, 2021:

yes, it is wrong

sipa commented at 9:10 pm on February 22, 2021: member

TL;DR: I think we should just drop the “don’t respond to GETHEADERS while in IBD” behavior, rather than keeping the IsInitialBlockDownload() latch across restarts. Possibly we can also improve the find-another-peer-to-sync-headers-from logic (and, if that’s the real problem, this PR doesn’t currently address it).

These are the things that IsInitialBlockDownload() affects (there are more, but these are probably the most important ones).

We don’t respond to GETHEADERS while in IBD.
We don’t announce our local address while in IBD.
The getblocktemplate RPC doesn’t work while in IBD.
We don’t request transactions in IBD.
Various compact block related things are disabled in IBD.

There is also a related behavior that is not controlled by IsInitialBlockDownload(), but by pindexBestHeader->GetBlockTime() > GetAdjustedTime() - 24 * 60 * 60 instead:

Only request headers from one peer at a time (however, if it times out, another peer will be selected).

I believe it is mostly (1) and (6) we’re concerned about here?

I don’t actually know why (1) exists in the first place. Perhaps it’s trying to minimize time to be synced with the network first before providing services to others? I think that only has a minimal effect, especially when combined with (2). It seems to me that just removing (1) would address most of the concern here.

It seems possible to improve on (6) as well, but it’s fundamentally a tradeoff between bandwidth and (high probability of) fast header syncing. The headers sync timeout is ~15 minutes though, which can be annoying, but is probably short enough that (6) doesn’t actually matter for the scenario under consideration here.

A way to improve (6) (especially useful when combined with removing (1)) would be to treat a “less than 2000 headers received at once” while the last one doesn’t get out us out of IBD as “done syncing with this peer, pick another one”.

These avoid the need to permanently latch to IBD, which seems like a very heavy hammer. Especially the transaction (4) and compact block related behaviours (5) we probably want to keep for nodes that were once out of IBD, but have been offline for a substantial period of time.

(3) does look fairly scary if nodes can actually get stuck in IBD - but it’s also a sign of a deeper problem. I’m not sure if that warrants addressing beyond just trying to make sure we don’t get stuck in IBD.

sipa commented at 2:00 am on February 27, 2021: member

@pstratem So I suspect that what happened in $ALTCOIN is the headers fetching logic is approximately doing this:

until the tip header is less than a day old, we pick a single peer to fetch headers from. If we have at least one outbound peer, only outbounds are considered.
headers fetching times out after 15 minutes, and after that a new peer is picked (but, in a normal situation, again an outbound one, and it’ll never actually get to asking an inbound peer)

I don’t believe the current PR actually addresses this, because none of the logic above is dependent on IsInitialBlockDownload. If actually all reachable nodes on the network are in “headers 24 hours behind” state, then making them IsIBD()==false would make them respond to headers, but never enough to get close enough to $NOW to get the network unstuck.

Here are a couple alternative ideas (not all necessary, but probably most powerful if all 3 are done):

When headers fetch reaches timeout, instead of disconnecting a peer, it is marked “failed headers sync”, and another peer is tried. Perhaps such “failed headers sync” peers can be preferred for eviction/replacement.
If we can’t respond to a GETHEADERS request (for whatever reason), just echo the reported tip back the requester in a HEADERS. That has no they-might-ban-us risk, because clearly that tip is valid to themselves, and can be used as a “done sending you headers” signed (see next point).
If a HEADERS message arrives from the headers sync peer with less than MAX_HEADERS entries, which doesn’t get us within 24h of $NOW, treat that as a “failed headers sync” as well. That means that headers sync peer cycling can go almost instantly, rather than needing 15-minute timeouts.

schuie2528 commented at 1:09 am on February 28, 2021: none

.

fanquake deleted a comment on Feb 28, 2021

sdaftuar commented at 5:45 pm on March 4, 2021: member

@sipa Originally the logic was added to not respond to GETHEADERS while in IBD because we relaxed the checkpoint logic, so that we could accept a non-checkpointed chain during initial sync. If a node was in IBD and fed a bogus initial headers chain, then a peer that is on the checkpointed chain would ban us if we gave them a checkpoint-violating header.

Since then, our logic has changed a lot, both around banning and with the addition of minimum-chain-work. An easy improvement would be to respond to getheaders messages as long as our tip is past the minimum chain work that we’re configured with. @pstratem My understanding of block relay is that the scenario that I think you have in mind should not really exist, but perhaps I’m not understanding the scenario. Let’s say the whole network is in IBD for some reason, because no blocks have been found in several days and then everyone restarts their software. At this point, initial headers sync with every peer is timing out every 20 minutes or whatever causing some peer churn (this all sounds bad, but not yet disastrous).

(1) Then a miner finds a block and submits it to their node. That node is not a listening node, but (2) with the new block it will leave IBD. It will also (3) announce that block (via INV) to all its peers, which will in turn immediately (4) send a getheaders to the node, which it will (5) respond to because its out of IBD.

That will (6) trigger block relay to all those peers via a getdata, which in turn will cause those nodes to (7) leave IBD and then (8) announce the block to their peers. Things should progress fine from there, in the same way, I think – what do you think is breaking down exactly?

I guess (1) could break down because miners have to manually cause their own node to leave IBD in order for getblocktemplate to work, via the -maxtipage option. But if you assume miners will quickly figure that out to not waste their hashpower, I think everything should work after that.

sdaftuar commented at 6:33 pm on March 4, 2021: member

That will (6) trigger block relay to all those peers via a getdata, which in turn will cause those nodes to (7) leave IBD and then (8) announce the block to their peers. Things should progress fine from there, in the same way, I think – what do you think is breaking down exactly?

Aha, on further investigation, I think (6) is where things break down: when a node is in IBD because its current tip is too old, if we learn of a new block header from an inbound peer, we will not fetch the block when we get the header via direct-fetch (because CanDirectFetch will return false due to our old tip), and we won’t fetch the block through parallel block fetch (via FindNextBlocksToDownload) if we’re in IBD and we have any outbound peers. I was able to replicate this behavior in a simple regtest setting.

My recollection is that we don’t want to do full-on IBD from inbound peers because users running bitcoind at home might not want to be used for some outbound peer’s initial sync. Assuming we think that is still a reasonable concern, a simple fix might be to try fetching blocks via FindNextBlocksToDownload from inbound peers if we have no blocks in flight, as presumably that means our outbound peers have nothing for us, so we might as well treat that scenario the same as if we have no outbounds at all.

(To be clear, I don’t believe that the headers-sync behaviors with inbound peers are really a problem, because if an inbound peer is the only one to be finding new blocks, we will get their headers via our inv-processing behavior. I think this is just a concern around block fetching.)

JeremyRubin commented at 4:51 am on May 24, 2021: contributor

Noting that I was operating a signet for a while which crashed, and by the time I rebooted it (life happens) it was a few months later. Now after restarting, I think the IBD latch is keeping my seed/mining node in IBD since it’s in the past, even though I’m producing new blocks. This also prevents this node from relaying the new blocks to peers.

This patch might fix this (pathological) case.

h/t @ajtowns for pointing this out.

MarcoFalke added the label Waiting for author on May 24, 2021

MarcoFalke commented at 6:59 am on January 27, 2022: member

Can this be closed now that #24171 was opened?

MarcoFalke commented at 7:56 am on January 27, 2022: member

Now after restarting, I think the IBD latch is keeping my seed/mining node in IBD since it’s in the past, even though I’m producing new blocks.

Staying in IBD while producing a block with a recent enough timestamp shouldn’t be possible. It may only happen that getblocktemplate refuses to give you a reply due to IBD, but since you produced a block this is probably not something you ran into. (If you were, you could use -maxtipage)

MarcoFalke commented at 1:05 pm on January 27, 2022: member

Going to close this for now, since there hasn’t been any activity by the author for about a year after the author acknowledged that the current patch has a bug (https://github.com/bitcoin/bitcoin/pull/21106#discussion_r584207780).

Discussion can continue and this can be reopened any time.

MarcoFalke closed this on Jan 27, 2022

ajtowns commented at 1:11 pm on January 27, 2022: member

Staying in IBD while producing a block with a recent enough timestamp shouldn’t be possible.

Once started, the signet miner defaults to calculating timestamps as “previous block time + 10 minutes” with optional randomisation, so if the node was down for weeks or more it could take a while to catch up with realtime and hence leave the miner’s node in IBD.

MarcoFalke commented at 1:59 pm on January 27, 2022: member

Ok, I see. So this pull could indeed make sense for signet. Or, as an alternative the signet miner could modify the mine script to mine a block with a more recent timestamp or somehow mine the “missing” blocks faster if it is important to have them.

For mainnet, my understanding is that #24171 will be sufficient to address the issue, as explained in #21106 (comment) and the following comment.

ajtowns commented at 2:06 pm on January 27, 2022: member

Ok, I see. So this pull could indeed make sense for signet. Or, as an alternative the signet miner could modify the mine script to mine a block with a more recent timestamp or somehow mine the “missing” blocks faster if it is important to have them.

It’s just a default: you can run the miner with --set-block-time=-1 --max-blocks=1 to mine a single block at the current time, then restart with --ongoing which will continue from there. (Not sure if that had been implemented when Jeremy was having problems) So I don’t think it’s too important to optimise for that scenario.

DrahtBot locked this on Jan 27, 2023

persist IBD latch across restarts #21106