Stuck in Endless Pre-Syncing Headers Loop

da2ce7 commented at 8:26 am on October 26, 2022: none

Expected behavior Pre-Sync Headers Like Normal.

Actual behavior

02022-10-26T07:34:14Z Pre-synchronizing blockheaders, height: 748853 (~98.55%)
12022-10-26T07:34:16Z Pre-synchronizing blockheaders, height: 226853 (~31.00%)

Unaffected by restating program.

02022-10-26T07:34:14Z Pre-synchronizing blockheaders, height: 748853 (~98.55%)
12022-10-26T07:34:16Z Pre-synchronizing blockheaders, height: 226853 (~31.00%)

To reproduce

I think that this is a one-off sort of error:

Here is a backup of my .bitcoin folder. https://drive.proton.me/urls/V04QAGG998#GlCnfHpkWW7F

~~The files before blk00047.dat and rev00047.dat are omitted, and need to be copied in from another source.~~ As sipa says, the blk files are not deterministic. - Will upload the full .bitcoin folder so people can reproduce…

Here is the 4.6gb full backup of my .bitcoin folder: https://drive.proton.me/urls/JA11NDEA14#GeG83qrpmvtt

System information

Fedora 37 Silverblue Running in Toolbox.

Bitcoin Core: 28cf75697186ea8e473e120a643994bdf8237d6c

da2ce7 added the label Bug on Oct 26, 2022

maflcko commented at 8:34 am on October 26, 2022: member

This may happen when one of the blocks in the main chain is marked invalid (for example due to corruption)

fanquake renamed this:
~~Stuck in Endless Pre-Syncing Headders Loop~~
Stuck in Endless Pre-Syncing Headers Loop
on Oct 26, 2022

maflcko added the label P2P on Oct 26, 2022

kouloumos commented at 9:13 am on October 26, 2022: contributor

Your debug.log shows that before seeing this behavior you were doing IBD up until height=224854 were this happened, which I think matches what MarcoFalke said.

02022-10-26T07:24:30Z UpdateTip: new best=00000000000000cd7d1c3d5137423c00e6a221d5492ace06d8fb9d990f2d7c96 height=224854 version=0x00000002 log2_work=69.513761 tx=14063312 date='2013-03-08T15:46:54Z' progress=0.018147 cache=356.2MiB(2733513txo)
12022-10-26T07:24:30Z ERROR: ConnectBlock: Consensus::CheckTxInputs: 878d6685666400b75a1947ccfc676249ecdf52678b2dc0d83e0328f8c24a951a, bad-txns-inputs-missingorspent, CheckTxInputs: inputs missing/spent
22022-10-26T07:24:30Z InvalidChainFound: invalid block=000000000000032021a6d18011d202df36cf07822a657b47390ab90568bb14e2  height=224855  log2_work=69.513793  date=2013-03-08T15:58:52Z
32022-10-26T07:24:30Z InvalidChainFound:  current best=00000000000000cd7d1c3d5137423c00e6a221d5492ace06d8fb9d990f2d7c96  height=224854  log2_work=69.513761  date=2013-03-08T15:46:54Z
42022-10-26T07:24:30Z ERROR: ConnectTip: ConnectBlock 000000000000032021a6d18011d202df36cf07822a657b47390ab90568bb14e2 failed, bad-txns-inputs-missingorspent, CheckTxInputs: inputs missing/spent

On the next run it started the pre-sync phase from that height, and that’s the point it restarts every time.

I’ve tried to reproduce using the backup of your directory, but I couldn’t. Probably because using it requires a -reindex.

maflcko commented at 9:35 am on October 26, 2022: member

Steps to reproduce (with a diff to force corruption):

 0diff --git a/src/validation.cpp b/src/validation.cpp
 1index 37e68cfe4a..811ff2f9eb 100644
 2--- a/src/validation.cpp
 3+++ b/src/validation.cpp
 4@@ -2201,7 +2201,7 @@ bool Chainstate::ConnectBlock(const CBlock& block, BlockValidationState& state,
 5         {
 6             CAmount txfee = 0;
 7             TxValidationState tx_state;
 8-            if (!Consensus::CheckTxInputs(tx, tx_state, view, pindex->nHeight, txfee)) {
 9+            if (Consensus::CheckTxInputs(tx, tx_state, view, pindex->nHeight, txfee)) {
10                 // Any transaction validation failure in ConnectBlock is a block consensus failure
11                 state.Invalid(BlockValidationResult::BLOCK_CONSENSUS,
12                             tx_state.GetRejectReason(), tx_state.GetDebugMessage());

then call ./src/qt/bitcoin-qt -datadir=/tmp -signet -printtoconsole=1

fanquake commented at 10:16 am on October 26, 2022: member

cc @sdaftuar @sipa @dergoegge

da2ce7 commented at 4:32 pm on October 26, 2022: none

@kouloumos

I’ve tried to reproduce using the backup of your directory, but I couldn’t. Probably because using it requires a -reindex.

The files before blk00047.dat and rev00047.dat are omitted, and need to be copied in from another source.

sipa commented at 4:49 pm on October 26, 2022: member

@da2ce7 The contents of those files is not deterministic, as it depends on the order you received blocks in. It’s not necessarily possible for someone to reconstruct your state without those files (as some blocks may be before/after the cut off differently).

sipa commented at 5:22 pm on October 26, 2022: member

Discussed this a bit with @sdaftuar.

What’s going on here is actually expected: your node believes that the chain other nodes are offering it is invalid, thus it’s correct behavior that it doesn’t actually manage to synchronize and accept that chain. This invalidity is only detected during the headers sync phase, and not during the new pre-sync phase that precedes it. The result is that your node goes through peers one by one, attempting to synchronize headers with them, by performing a full pre-sync phase, and only after that completes noticing they’re giving us a known invalid chain.

A question is whether we could detect this during the pre-sync phase instead, which wouldn’t stop the lack of progress, but would avoid the bandwidth waste on repeated presyncs with everyone. The answer is yes - it wouldn’t be hard to check for known-invalid headers in the presync phase as well, however, I don’t think we want to do that because of fingerprinting reasons: it would permit an attacker to feed you an invalid, low-PoW, block during IBD, and then later follow you around the network by claiming to have a chain that extends this invalid block. If you stop fetching immediately, they know you’re the same node as the one they gave the invalid block to earlier.

This fingerprinting is partially solvable: by keeping track of how much work was built on top of an invalid block, and if that work meets our anti-DoS threshold, permit reacting on known-invalid blocks when they’re fed to us during presync. This is however a fair bit of complexity and I’m not sure it’s worth it for just somewhat improving the situation for essentially broken nodes which will never recover anyway.

Perhaps time is better spent on better reporting to the user, in the form of targetted warnings in logs (or even failure to start) when there appears to be a long invalid high-PoW chain out there.

maflcko removed the label Bug on Oct 26, 2022

maflcko added the label Feature on Oct 26, 2022

sipa commented at 9:49 pm on October 26, 2022: member

Arguably the fact that this results in corrupted node wasting bandwidth on redownloading headers multiple times is a 24.0 regression. But I don’t know if it’s worth fixing as it involves some complexity, and would only benefit already broken nodes anyway/

maflcko added the label Brainstorming on Oct 27, 2022

Shekelme commented at 3:43 am on May 16, 2023: none

Same bug here. v24.0.1 Endless_Pre-synchronizing.txt

aleks-mariusz commented at 11:15 am on June 20, 2023: none

I’m seeing this with v25.0.0 as well :-/

02023-06-20T11:09:46Z Pre-synchronizing blockheaders, height: 782560 (~98.44%)
12023-06-20T11:09:49Z Pre-synchronizing blockheaders, height: 738560 (~93.00%)

What are the recommendations on fixing this?

maflcko commented at 12:01 pm on June 20, 2023: member

What are the recommendations on fixing this?

Bitcoin Core makes heavy use of CPU, RAM and disk IO. Hardware defects might only become visible when running Bitcoin Core. You might want to check your hardware for defects.

memtest86 to check your RAM
to check the CPU behaviour under load, use linpack or Prime95
to test your storage device use smartctl or CrystalDiskInfo

Source: https://bitcoin.stackexchange.com/a/12206

If your hardware doesn’t have any faults, you can do a -reindex to wipe the corrupt block file from the storage.

tansanDOTeth commented at 10:48 am on June 24, 2023: none

Is this normal? I’m looking at the logs and it looks like it happens quite often.

 0023-06-24T10:32:33Z Pre-synchronizing blockheaders, height: 775060 (~97.47%)
 12023-06-24T10:32:33Z Pre-synchronizing blockheaders, height: 777060 (~97.71%)
 22023-06-24T10:32:33Z Pre-synchronizing blockheaders, height: 779060 (~97.95%)
 32023-06-24T10:32:33Z Pre-synchronizing blockheaders, height: 781060 (~98.19%)
 42023-06-24T10:32:34Z Pre-synchronizing blockheaders, height: 783060 (~98.43%)
 52023-06-24T10:32:39Z New outbound peer connected: version: 70016, blocks=795702, peer=408 (outbound-full-relay)
 62023-06-24T10:32:40Z New outbound peer connected: version: 70016, blocks=795702, peer=409 (outbound-full-relay)
 72023-06-24T10:32:46Z Pre-synchronizing blockheaders, height: 335060 (~42.81%)
 82023-06-24T10:32:46Z New outbound peer connected: version: 70016, blocks=795702, peer=410 (outbound-full-relay)
 92023-06-24T10:32:49Z New outbound peer connected: version: 70015, blocks=795702, peer=411 (outbound-full-relay)
102023-06-24T10:32:51Z Pre-synchronizing blockheaders, height: 337060 (~43.06%)
112023-06-24T10:32:58Z Pre-synchronizing blockheaders, height: 339060 (~43.31%)
122023-06-24T10:33:06Z Pre-synchronizing blockheaders, height: 341060 (~43.57%)
132023-06-24T10:33:12Z Pre-synchronizing blockheaders, height: 343060 (~43.82%)
142023-06-24T10:33:20Z Pre-synchronizing blockheaders, height: 345060 (~44.07%)
152023-06-24T10:33:29Z Pre-synchronizing blockheaders, height: 347060 (~44.32%)
162023-06-24T10:33:34Z Pre-synchronizing blockheaders, height: 349060 (~44.58%)
172023-06-24T10:33:44Z Pre-synchronizing blockheaders, height: 351060 (~44.83%)
182023-06-24T10:33:49Z Pre-synchronizing blockheaders, height: 353060 (~45.09%)

aleks-mariusz commented at 10:53 am on June 24, 2023: none

What are the recommendations on fixing this?

If your hardware doesn’t have any faults, you can do a -reindex to wipe the corrupt block file from the storage.

This helped, re-indexing, but this throws away the entire progress haivng been made, and starts over at 0% :-/ it took 3+ days to get back to current state sadly w/ my hardware/network connection

tansanDOTeth commented at 10:55 am on June 24, 2023: none

What are the recommendations on fixing this?

If your hardware doesn’t have any faults, you can do a -reindex to wipe the corrupt block file from the storage.

This helped, re-indexing, but this throws away the entire progress haivng been made, and starts over at 0% :-/ it took 3+ days to get back to current state sadly w/ my hardware/network connection

Is there a way to do this from the GUI?

Edit: For Mac users:

0/Applications/Bitcoin-Qt.app/Contents/MacOS/Bitcoin-Qt -reindex

willcl-ark commented at 8:11 am on July 1, 2024: member

Do we want to keep this open to address this comment?:

Perhaps time is better spent on better reporting to the user, in the form of targetted warnings in logs (or even failure to start) when there appears to be a long invalid high-PoW chain out there.

Otherwise I think we can probably close this as stale.

jstefanop commented at 3:17 pm on December 13, 2024: none

FYI we are seeing this across a large number of our Nodes (FutureBit Apollos we have 10’s of thousands of nodes from our users on the network).

Re-index usually fixes, but there should be a way for core to detect this corruption and self fix/reindex from the point of corruption. Being stuck in an endless pre-sync loop is not great (or at least shutdown the node with an error?)

maflcko commented at 2:41 pm on December 16, 2024: member

detect this corruption and self fix/reindex from the point of corruption.

I am not sure this will be an improvement, because:

It will turn an endless per-sync loop into an endless reindex loop (at least for some).
If there is an issue with the hardware, it would be better to diagnose and fix it, instead of continuing. Otherwise the issue will possibly be silently ignored and re-appear in the future.

My recommendation would be to check which configs of your hardware run into this problem and then try to determine and fix the root cause.

eshutov commented at 12:47 pm on January 12, 2025: none

I have faced with the same issue:

02025-01-12T12:01:56Z Pre-synchronizing blockheaders, height: 852216 (~97.02%)
12025-01-12T12:02:00Z Pre-synchronizing blockheaders, height: 854216 (~97.23%)
22025-01-12T12:02:02Z Pre-synchronizing blockheaders, height: 856216 (~97.47%)
32025-01-12T12:02:29Z Pre-synchronizing blockheaders, height: 712215 (~81.31%)
42025-01-12T12:02:33Z Pre-synchronizing blockheaders, height: 714215 (~81.52%)
52025-01-12T12:02:37Z Pre-synchronizing blockheaders, height: 716215 (~81.75%)

Pre-synchronizing repeats in infinity loop and consumes network bandwidth. I confirm this happened for me due to hardware issue with my PC (GPU extra power cord wasn’t connected originally and this caused kernel backtraces in dmesg). As was mentioned above i tried to do -reindex. But this didn’t help. Still see the messages like above. Possibly the best option for those who ran into this is entire blockchain redownloading.

version: v28.0.0

mzumsande commented at 9:03 pm on February 12, 2025: contributor

A question is whether we could detect this during the pre-sync phase instead, which wouldn’t stop the lack of progress, but would avoid the bandwidth waste on repeated presyncs with everyone. The answer is yes - it wouldn’t be hard to check for known-invalid headers in the presync phase as well, however, I don’t think we want to do that because of fingerprinting reasons: it would permit an attacker to feed you an invalid, low-PoW, block during IBD, and then later follow you around the network by claiming to have a chain that extends this invalid block. If you stop fetching immediately, they know you’re the same node as the one they gave the invalid block to earlier.

I don’t think that fingerprinting is a concern here. An attacker cannot feed us an invalid, low-PoW block during IBD to fingerprint us - if they could get us to accept such a header to our block tree db somehow, that would mean that the Anti-DoS headers-sync algorithm would have failed, and we’d probably have way bigger problems than fingerprinting.

So I don’t think there is a conceptual problem with detecting this in the pre-sync phase.

But the point remains that this is just a symptom of the underlying corruption - we’d still endlessly cycle through peers that we would now disconnect immediately for sending us an “invalid” header, so I’m not sure how that would actually be more helpful for users?

sipa commented at 9:06 pm on February 12, 2025: member

@mzumsande Agreed, reading back what I wrote, I don’t see why I thought fingerprinting would be an issue.

I think the most useful course of actual here is detecting the presence of a high-PoW header-invalid chain, and reporting it to the user as a sign of likely corruption. Unsure what that would mean for anything other than the GUI, though.

mzumsande commented at 10:04 pm on February 12, 2025: contributor

I think the most useful course of actual here is detecting the presence of a high-PoW header-invalid chain, and reporting it to the user as a sign of likely corruption. Unsure what that would mean for anything other than the GUI, though.

We actually have this kind of check (CheckForkWarningConditions) but have it disabled during IBD for no good reason I can see (at least after #19905). I’ll work on a PR suggesting to use this check in the pre-sync header loop situation (and also rework it, e.g. I don’t think it needs to be called in ActivateBestChainStep()).

Stuck in Endless Pre-Syncing Headers Loop #26391