validation: Improve warnings in case of chain corruption #33553

pull mzumsande wants to merge 2 commits into bitcoin:master from mzumsande:202510_headersync_loop changing 2 files +10 −5
  1. mzumsande commented at 9:49 pm on October 6, 2025: contributor

    In case of corruption that leads to a block being marked as invalid that is seen as valid by the rest of the network, the user currently doesn’t receive good error messages, but will often be stuck in an endless headers-sync loop with no explanation (#26391).

    This PR improves warnings in two ways:

    Fixes #26391 (We’ll still do the endless looping, trying to find a peer with a headers that we can use, but will now repeatedly log warnings while doing so).

  2. DrahtBot added the label Validation on Oct 6, 2025
  3. DrahtBot commented at 9:49 pm on October 6, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33553.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK stickies-v, polespinasa
    Concept ACK janb84, hodlinator

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30214 (refactor: Improve assumeutxo state representation by ryanofsky)
    • #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  4. in src/validation.cpp:2035 in d55830cd50
    2035         return;
    2036     }
    2037 
    2038     if (m_chainman.m_best_invalid && m_chainman.m_best_invalid->nChainWork > m_chain.Tip()->nChainWork + (GetBlockProof(*m_chain.Tip()) * 6)) {
    2039-        LogPrintf("%s: Warning: Found invalid chain at least ~6 blocks longer than our best chain.\nChain state database corruption likely.\n", __func__);
    2040+        LogPrintf("%s: Warning: Found invalid chain at least ~6 blocks longer than our best chain. Chain state database corruption likely.\n", __func__);
    


    maflcko commented at 7:52 am on October 7, 2025:
    0        LogWarning("Found invalid chain at least ~6 blocks longer than our best chain. Chain state database corruption likely.");
    

    nit: while touching this, could add the correct log severity?


    mzumsande commented at 2:43 pm on October 7, 2025:
    done
  5. in src/validation.cpp:2029 in d55830cd50 outdated
    2025@@ -2026,15 +2026,13 @@ void Chainstate::CheckForkWarningConditions()
    2026 {
    2027     AssertLockHeld(cs_main);
    2028 
    2029-    // Before we get past initial download, we cannot reliably alert about forks
    2030-    // (we assume we don't get stuck on a fork before finishing our initial sync)
    2031-    // Also not applicable to the background chainstate
    2032-    if (m_chainman.IsInitialBlockDownload() || this->GetRole() == ChainstateRole::BACKGROUND) {
    2033+    // Not applicable to the background chainstate
    


    stickies-v commented at 11:05 am on October 7, 2025:
    nit: trivial docstring, would prefer removing (or elaborating)

    mzumsande commented at 4:03 pm on October 8, 2025:
    removed it. I think it’s obvious enough why it’s not applicable.
  6. mzumsande force-pushed on Oct 7, 2025
  7. janb84 commented at 2:56 pm on October 7, 2025: contributor

    Concept ACK

    This PR introduces a informative warning for the user in the case that the users has an invalid block (header) due to corruption or otherwise.

    NIT: why keep looping ? is there a path where the node can recover from this ?

    partial code review ✅

    Gives warning in case of chain corruption

    02025-10-07T14:24:52Z Pre-synchronizing blockheaders, height: 266146 (~97.53%)
    12025-10-07T14:24:55Z [warning] Received header for a block previously marked as invalid from peer=0. If this happens with all peers, database corruption is likely and -reindex may be necessary to recover.
    22025-10-07T14:24:56Z Pre-synchronizing blockheaders, height: 4146 (~1.54%)
    

    Loops without warning:

     02025-10-07T14:17:05Z Pre-synchronizing blockheaders, height: 258146 (~94.63%)
     12025-10-07T14:17:05Z Pre-synchronizing blockheaders, height: 262146 (~96.10%)
     22025-10-07T14:17:05Z Pre-synchronizing blockheaders, height: 266146 (~97.53%)
     32025-10-07T14:17:07Z Pre-synchronizing blockheaders, height: 4146 (~1.54%)
     42025-10-07T14:17:07Z Pre-synchronizing blockheaders, height: 16146 (~6.00%)
     52025-10-07T14:17:08Z Pre-synchronizing blockheaders, height: 28146 (~10.45%)
     62025-10-07T14:17:08Z Pre-synchronizing blockheaders, height: 42146 (~15.66%)
     72025-10-07T14:17:08Z Pre-synchronizing blockheaders, height: 54146 (~20.12%)
     82025-10-07T14:17:08Z Pre-synchronizing blockheaders, height: 64146 (~23.84%)
     92025-10-07T14:17:09Z Pre-synchronizing blockheaders, height: 78146 (~29.03%)
    102025-10-07T14:17:09Z Pre-synchronizing blockheaders, height: 92146 (~34.24%)
    
  8. mzumsande commented at 3:23 pm on October 7, 2025: contributor

    NIT: why keep looping ? is there a path where the node can recover from this ?

    Not if we are actually corrupted, but we can’t be 100% sure. Maybe we are right, and the next peer would extend the chain that we see as valid.

    There is the possibility of an an attack where a miner creates an invalid block with valid PoW (which is costly), and has some sybil nodes feed that chain to others - it wouldn’t be good if this could be used to shutdown nodes. Or there could be some kind of chaotic chain split / hard fork scenario. This is not very likely to happen (hopefully), but I don’t think shutting down the node automatically would be a good idea.

  9. in src/net_processing.cpp:2960 in 6ae9af6b98
    2955@@ -2956,6 +2956,10 @@ void PeerManagerImpl::ProcessHeadersMessage(CNode& pfrom, Peer& peer,
    2956                                                            state, &pindexLast)};
    2957     if (!processed) {
    2958         if (state.IsInvalid()) {
    2959+            if (state.GetResult() == BlockValidationResult::BLOCK_CACHED_INVALID) {
    2960+                LogWarning("Received header for a block previously marked as invalid from peer=%d. "
    


    stickies-v commented at 6:09 pm on October 7, 2025:

    nit to respect line-length, clang-format and grammatically make things clearer that the peer in question did not send the previous version of the header:

     0diff --git a/src/net_processing.cpp b/src/net_processing.cpp
     1index 073a434335..4574c6cf19 100644
     2--- a/src/net_processing.cpp
     3+++ b/src/net_processing.cpp
     4@@ -2957,8 +2957,10 @@ void PeerManagerImpl::ProcessHeadersMessage(CNode& pfrom, Peer& peer,
     5     if (!processed) {
     6         if (state.IsInvalid()) {
     7             if (state.GetResult() == BlockValidationResult::BLOCK_CACHED_INVALID) {
     8-                LogWarning("Received header for a block previously marked as invalid from peer=%d. "
     9-                           "If this happens with all peers, database corruption is likely and -reindex may be necessary to recover.", pfrom.GetId());
    10+                LogWarning("Header received from peer=%d was previously marked as invalid. If this "
    11+                           "happens with all peers, database corruption is likely and restarting "
    12+                           "with -reindex may be necessary to recover.",
    13+                           pfrom.GetId());
    14             }
    15             MaybePunishNodeForBlock(pfrom.GetId(), state, via_compact_block, "invalid header received");
    16             return;
    

    mzumsande commented at 4:04 pm on October 8, 2025:
    done.
  10. stickies-v commented at 6:12 pm on October 7, 2025: contributor
    Approach ACK, agreed that something non-invasive like logging is the way to go here.
  11. in src/net_processing.cpp:2959 in 6ae9af6b98
    2955@@ -2956,6 +2956,10 @@ void PeerManagerImpl::ProcessHeadersMessage(CNode& pfrom, Peer& peer,
    2956                                                            state, &pindexLast)};
    2957     if (!processed) {
    2958         if (state.IsInvalid()) {
    2959+            if (state.GetResult() == BlockValidationResult::BLOCK_CACHED_INVALID) {
    


    stickies-v commented at 11:38 am on October 8, 2025:

    I think there is potential for unconditional logging abuse here. If an inbound peer (which doesn’t get disconnected for sending BLOCK_CACHED_INVALID headers) knows of (or spends the PoW to construct) an invalid block with a valid header, they can resend the same header and spam the log. The impacts should be limited since we now have disk filling mitigation in place, but it’s still annoying at the very least, and it’s very cheap to execute on the entire network when e.g. a miner accidentally constructs such a block.

    Minimal test to showcase:

     0diff --git a/test/functional/p2p_invalid_block.py b/test/functional/p2p_invalid_block.py
     1index 88bdbe22a8..c5d22c3382 100755
     2--- a/test/functional/p2p_invalid_block.py
     3+++ b/test/functional/p2p_invalid_block.py
     4@@ -104,6 +104,8 @@ class InvalidBlockRequestTest(BitcoinTestFramework):
     5         block3.solve()
     6 
     7         peer.send_blocks_and_test([block3], node, success=False, reject_reason='bad-cb-amount')
     8+        for i in range(100):
     9+            peer.send_blocks_and_test([block3], node, success=False, reject_reason='Received header for a block previously marked')
    10 
    11 
    12         # Complete testing of CVE-2012-2459 by sending the original block.
    

    mzumsande commented at 4:15 pm on October 8, 2025:

    good point - I restricted the log to outbound peers, which makes sense anyway, because it’s consistent with the fact that we only disconnect those.

    (Besides, I’m curious what the current way to think about unconditional logging should be - be more liberal / care less about spamming? Or keep on being careful about log levels just as before when the rate limiting didn’t exist? A bit offtopic here, but maybe something to discuss at CoreDev…)


    stickies-v commented at 11:10 am on October 30, 2025:

    Or keep on being careful about log levels just as before when the rate limiting didn’t exist?

    Imo rate limiting changes nothing about how we should think about carefully using log levels. Logs are meant to be informative to the user, and any potential for uninformative entries to flood the log, potentially burying important information, should be avoided. Rate limiting is just a belt-and-suspenders, in my view.


    polespinasa commented at 3:27 am on October 31, 2025:

    If an inbound peer (which doesn’t get disconnected for sending BLOCK_CACHED_INVALID headers) knows of (or spends the PoW to construct) an invalid block with a valid header, they can resend the same header and spam the log.

    Just curiosity, why we don’t disconnect it?

  12. mzumsande force-pushed on Oct 8, 2025
  13. in src/net_processing.cpp:2965 in 04b943b63c outdated
    2960+                LogWarning("Header received from peer=%i was previously marked as invalid. "
    2961+                           "If this happens with all peers, database corruption is likely and "
    2962+                           "-reindex may be necessary to recover.",
    2963+                           pfrom.GetId());
    2964+            }
    2965             MaybePunishNodeForBlock(pfrom.GetId(), state, via_compact_block, "invalid header received");
    


    hodlinator commented at 9:05 pm on October 8, 2025:

    (Inline comment in random place): Heads up that I’ve started working on a functional test for this at https://github.com/bitcoin/bitcoin/compare/04b943b63ca4c27aa857b03e5de82ed8cae7b17d...hodlinator:bitcoin:pr/33553_suggestions

    It’s super-slow (partially due to HeadersSyncState only being created when receiving the very hardcoded 2000 headers message), and still of questionable usefulness.

  14. hodlinator commented at 9:06 pm on October 8, 2025: contributor

    Concept ACK 04b943b63ca4c27aa857b03e5de82ed8cae7b17d

    Good to see progress on this issue!

  15. stickies-v commented at 1:30 pm on October 13, 2025: contributor

    In case of corruption that leads to a block being marked as invalid

    Is the working assumption that the errors users are reporting in #26391 are due to disk corruption causing exactly the serialized nStatus::BLOCK_FAILED_{VALID,CHILD} bit to be flipped, causing the block to be loaded as invalid instead of valid, without causing any other corruption that would likely crash the application? Because while that wouldn’t be impossible, it seems like an incredibly unlikely event to me?

  16. mzumsande commented at 7:17 pm on October 14, 2025: contributor

    Is the working assumption that the errors users are reporting in #26391 are due to disk corruption causing exactly the serialized nStatus::BLOCK_FAILED_{VALID,CHILD} bit to be flipped, causing the block to be loaded as invalid instead of valid, without causing any other corruption that would likely crash the application? Because while that wouldn’t be impossible, it seems like an incredibly unlikely event to me?

    No, that would indeed be unlikely. The assumption is that some kind of corruption led to a downloaded block being marked as invalid during ConnectBlock() erroneously - the BLOCK_FAILED flag is then set by the validation code as a result. The original corruption could also be with the UTXO set (e.g. an input needed by the block is no longer there) or with the block.

  17. hodlinator commented at 1:36 pm on October 15, 2025: contributor

    Please excuse me while I beat the dead horse before digging deeper. Was also assuming this had to do with some kind of corruption in the bits of CBlockIndex::nStatus while at rest on disk, thanks for clearing this up! In #26391 (comment) we find what triggered the original issue:

    0ERROR: ConnectBlock: Consensus::CheckTxInputs: 878d6685666400b75a1947ccfc676249ecdf52678b2dc0d83e0328f8c24a951a, bad-txns-inputs-missingorspent, CheckTxInputs: inputs missing/spent
    1InvalidChainFound: invalid block=000000000000032021a6d18011d202df36cf07822a657b47390ab90568bb14e2  height=224855  log2_work=69.513793  date=2013-03-08T15:58:52Z
    

    When checking the inputs of the above transaction https://mempool.space/tx/878d6685666400b75a1947ccfc676249ecdf52678b2dc0d83e0328f8c24a951a, the UTXO set (chainstate) for some reason doesn’t contain expected entries. We mark the block as invalid, even though the UTXO set is the culprit. - The transaction itself is not corrupt, since it hashed into a valid txid. And the block (224855) is probably intact as the inputs are from older blocks, 224108 and 224589, and earlier checks in ConnectBlock() would probably have failed too if the block data had been mutated.

    Q1: Maybe bitcoind could automatically wipe the UTXO set if the current block is an ancestor of the assume valid hash, instead of marking the block as invalid?

    Another less drastic alternative could be to not automatically wipe the UTXO set, but log a suggestion to -reindex and - if block is an ancestor of assume valid - Shut the node down.

    In the full debug.log from the original issue one can see that 224108 and 224589 where downloaded within the same process lifetime as 224855. So this is the UTXO set being corrupted while bitcoind is running.

    Q2: Do we suspect that the UTXO set was corrupted / out of sync due to a logic error, or cosmic ray-type event out of our control?

    Edit: Clarified Qs and avoided negation. Edit2: Suggested less drastic alternative to wiping UTXO set. Corrected assume UTXO -> assume valid.

  18. in src/net_processing.cpp:2960 in 04b943b63c outdated
    2955@@ -2956,6 +2956,12 @@ void PeerManagerImpl::ProcessHeadersMessage(CNode& pfrom, Peer& peer,
    2956                                                            state, &pindexLast)};
    2957     if (!processed) {
    2958         if (state.IsInvalid()) {
    2959+            if (!pfrom.IsInboundConn() && state.GetResult() == BlockValidationResult::BLOCK_CACHED_INVALID) {
    2960+                LogWarning("Header received from peer=%i was previously marked as invalid. "
    


    stickies-v commented at 3:11 pm on October 15, 2025:

    This doesn’t combine well with invalidateblock: e.g. when invalidating the tip, the log just gets spammed with warnings as we cycle through new peers:

     02025-10-15T14:45:32Z UpdateTip: new best=000000092ca51c152010fd76552e12d9c9c7f410ebd04012dcff49eeecb6cfff height=274019 version=0x20000000 log2_work=42.943635 tx=27881630 date='2025-10-15T14:27:13Z' progress=1.000000 cache=71.8MiB(503926txo)
     1...
     22025-10-15T14:45:50Z UpdateTip: new best=0000000b9e7f0760c48d76ef301616dbaa1421788b29911c02f7beacbd20b867 height=274018 version=0x20000000 log2_work=42.943600 tx=27881585 date='2025-10-15T14:12:48Z' progress=0.999984 cache=71.8MiB(503258txo)
     32025-10-15T14:45:50Z InvalidChainFound: invalid block=000000092ca51c152010fd76552e12d9c9c7f410ebd04012dcff49eeecb6cfff  height=274019  log2_work=42.943635  date=2025-10-15T14:27:13Z
     42025-10-15T14:45:50Z InvalidChainFound:  current best=0000000b9e7f0760c48d76ef301616dbaa1421788b29911c02f7beacbd20b867  height=274018  log2_work=42.943600  date=2025-10-15T14:12:48Z
     52025-10-15T14:46:01Z New outbound-full-relay v2 peer connected: version: 70016, blocks=274019, peer=8
     62025-10-15T14:46:01Z [warning] Header received from peer=8 was previously marked as invalid. If this happens with all peers, database corruption is likely and -reindex may be necessary to recover.
     72025-10-15T14:46:03Z New outbound-full-relay v2 peer connected: version: 70016, blocks=274019, peer=9
     82025-10-15T14:46:03Z [warning] Header received from peer=9 was previously marked as invalid. If this happens with all peers, database corruption is likely and -reindex may be necessary to recover.
     92025-10-15T14:46:03Z New outbound-full-relay v2 peer connected: version: 70016, blocks=274019, peer=10
    102025-10-15T14:46:03Z [warning] Header received from peer=10 was previously marked as invalid. If this happens with all peers, database corruption is likely and -reindex may be necessary to recover.
    

    It seems like the main purpose of this new log message is to help users understand endless pre-sync loops. In that case, maybe we can improve on this by making the warning specific to the pre-sync?

    I’ve created a branch which implements that approach, see diff: https://github.com/mzumsande/bitcoin/compare/202510_headersync_loop..stickies-v:bitcoin:2025-10/log-presync-invalid-chain

    (note: I used static std::atomic<int64_t> previous_height{height} as a quick fix that works well enough, but adding a ChainstateManager::m_last_presync_height would be better)


    mzumsande commented at 1:57 am on October 16, 2025:

    This doesn’t combine well with invalidateblock

    Being out of consensus with all outbound peers is a situation as catastrophic as it gets, and probably requires intervention by the user - why wouldn’t we want to be a bit spammy in this scenario? Outside of a real catastrophic chain split, one could of course artificially induce this situation with invalidateblock (which is debug-only) and then circle through outbound peers - but I don’t see the use case, why would anyone do this and why we should care about it?


    stickies-v commented at 11:12 am on October 16, 2025:

    why wouldn’t we want to be a bit spammy in this scenario?

    I don’t disagree, but it seems orthogonal to this PR which is addressing db corruption issues? It’s okay for warning messages to not cover all edge cases, but I think it is good to try to avoid incorrect diagnostics or solutions if we can help it without too much complexity. invalidateblock is debug-only, but is also commonly referenced as a break-in-case-of-emergency RPC. With the current change, usage of the RPC is guaranteed to produce incorrect and undocumented warning logs, so I don’t think that’s ideal.

    In this case, it’s tricky because we have 2 different and quite catastrophic issues ((1) db corruption and 2) consensus issues), with overlapping symptoms: receiving seemingly invalid headers from peers. However, since statistically speaking the db corruption is very likely to affect historical blocks and thus trigger a new headers (pre-)sync, logging that at the header (pre-)sync stage seems more accurate to me, without downside? Consensus issues (that don’t seem likely to be db corruption issues) should in most cases happen at the tip, and can still be logged/warned as such, with a more accurate message.


    stickies-v commented at 11:14 am on October 16, 2025:
    Alternatively, I would be okay with (but not prefer) the log message to be generalized to highlight that db corruption or consensus failure are possible, and the invalidateblock RPC documentation updated to highlight that this warning will be flooded upon usage.

    mzumsande commented at 8:49 pm on October 29, 2025:

    I agree that mentioning a consensus error makes sense, so I added it to the warning message. Note that I didn’t change the existing CheckForkWarningConditions() message as well, but could do that too if reviewers suggest it.

    In principle, I don’t see a difference whether this scenario happens during IBD or close to the tip. In both cases, it is most likely corruption but a consensus error is possible too. Current behaviour is that in the IBD case, we will cycle through header-syncs with no warning (but unrelated logging), in the tip case we will silently cycle through outbound peers that we all disconnect with no warning at all. I think it makes sense to alert the user in both situations.

    As for invalidateblock: If it is used in an actual emergency case, I think that it is correct to repeatedly warn the user about what’s going on because they should probably do something to get compatible outbound peers - a few repeated log messages shouldn’t be a concern there, and eventually rate-limiting will kick in.

    If it is used for testing, having these warnings could remind the user to switch of network connectivity as the -dumptxoutset use case already does.


    stickies-v commented at 11:05 am on October 30, 2025:

    As for invalidateblock:… because they should probably do something to get compatible outbound peers…

    I don’t think that’s necessarily true. In the “happy” path use case of invalidateblock, a large number of nodes will have done the same, so compatible nodes should be found automatically. Other approaches to prevent connecting with incompatible nodes would of course be better, but we’re talking about an emergency here.

    … and eventually rate-limiting will kick in.

    Yes, but rate limiting still allows 1MB of logging every hour, so while that mitigates crash issues, it definitely isn’t a solution for UX.

    That said, I agree that being noisy about receiving invalid headers is better than being quiet, and that my earlier suggested approach is no improvement. In case of a legitimate network-wide use case for invalidateblock, sustained noisy logs will be the last of people’s concerns, and they can always be filtered until a patch is rolled out. Perhaps someone comes up with a better alternative later on, but I think this PR is an improvement to master and I can’t think of a better approach myself.


    stickies-v commented at 11:12 am on October 30, 2025:
    I think logging the block hash would be helpful here?
  19. p2p: Add warning message when receiving headers for blocks cached as invalid
    Currently, if database corruption leads to a block being marked as
    invalid incorrectly, we can get stuck in an infinite headerssync
    loop with no indication what went wrong or how to fix it.
    With the added log message, users will receive an explicit warning after each
    failed headerssync attempt with an outbound peer.
    f8955eaee1
  20. validation: call CheckForkWarningConditions during IBD, and at startup
    The existing IBD check was added at a time when CheckForkWarningConditions
    did also sophisticated fork detection that could lead to false positives
    during IBD (55ed3f14751206fc87f0cbf8cb4e223efacef338).
    
    The fork detection logic doesn't exist anymore
    (since fa62304c9760f0de9838e56150008816e7a9bacb), so the IBD check is no
    longer necessary.
    
    Displaying the log at startup will help node operators diagnose the
    problem better.
    6d2c8ea9db
  21. mzumsande force-pushed on Oct 29, 2025
  22. stickies-v approved
  23. stickies-v commented at 11:17 am on October 30, 2025: contributor
    ACK 6d2c8ea9dbd77c71051935b5ab59224487509559
  24. DrahtBot requested review from hodlinator on Oct 30, 2025
  25. DrahtBot requested review from janb84 on Oct 30, 2025
  26. polespinasa approved
  27. polespinasa commented at 3:28 am on October 31, 2025: member

    code reviewed and tested ACK 6d2c8ea9dbd77c71051935b5ab59224487509559

    This PR introduces an informative warning about an invalid block header, which may fix issues like #26391 by giving more information to the user.

    Tried to force the warning (f8955eaee19d592cc20c497d145f5eae25277014) following #26391 (comment) and it behaves as expected.

    Non-blocking nit: I agree with @stickies-v that the block hash may be useful in the message.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-10-31 18:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me