Check for block corruption in ConnectBlock()

sdaftuar commented at 4:40 pm on February 27, 2018: member

(Updated OP after reworking the approach)

Shut down if a corrupted block is found in ConnectBlock(). This prevents an infinite loop trying to connect such a block, and alerts the node operator that there may be potential hardware failure.

fanquake added the label Validation on Feb 27, 2018

laanwj commented at 7:13 am on March 1, 2018: member

For context: I had a node go into an infinite loop, when a bit flip in a transaction in the block file caused the merkle root check to fail. It kept printing the same message, pegging the CPU at 100% but without advancing:

02018-02-27 07:47:02 ERROR: ConnectBlock: Consensus::CheckBlock: bad-txnmrklroot,
1 hashMerkleRoot mismatch (code 16)
22018-02-27 07:47:02 ERROR: ConnectTip(): ConnectBlock 0000000000000000003cd56ca3
3cf3dbc186ff3a6713f6972a11089b6ae4bf625 failed

It might be due to software or hardware reasons (in my case, I found out in retrospect that the CPU was seriously overheating). In this kind of case it is better to give up and have the user manually check the issue, than try to continue.

I think we might as well get rid of this call and assume that the corruption will eventually be detected later.

I’m not sure removing the check is the best way forward. I’m happy that the corruption was detected as early as possible so I could take action and fix the CPU cooler. However, it could stop trying after one time, then give up. Or am I misunderstanding what this does?

laanwj added the label Data corruption on Mar 1, 2018

sdaftuar commented at 2:16 am on March 5, 2018: member

I’m not sure removing the check is the best way forward. I’m happy that the corruption was detected as early as possible so I could take action and fix the CPU cooler. However, it could stop trying after one time, then give up. Or am I misunderstanding what this does?

Yeah it’s not clear to me what is best. With my current proposal here, we skip the CheckBlock check altogether and go straight on to transaction validation. If some transaction was mutated (say due to disk corruption), it’s likely that some validation would fail, and the node would fall out of sync, which would be an indicator to the operator that something is wrong.

For situations where the disk corruption could have been detected, I think it’s better to more explicitly let the user know, rather than have to infer it from a random-looking validation failure. However, in general we aren’t able to detect all the forms of disk corruption that can happen (eg, typically we never know if the block was written correctly because we use an in-memory copy for validation). So I wasn’t sure it was worth trying to maintain some special-case logic to detect something that in general is unknowable.

But if you think catching this kind of thing some of the time is still better than nothing then I could just add back the CheckBlock() call and an assert that any failures cannot be due to potential corruption; I think that’s also a reasonable way to improve this behavior.

sdaftuar force-pushed on Mar 5, 2018

sdaftuar renamed this:
~~Remove call to CheckBlock() in ConnectBlock()~~
Check for block corruption in ConnectBlock()
on Mar 5, 2018

sdaftuar commented at 3:48 pm on March 5, 2018: member

@laanwj I changed this to just shut down if we encounter a corrupt block in ConnectBlock().

Shut down if trying to connect a corrupted block

The call to CheckBlock() in ConnectBlock() is redundant with calls to it
prior to storing a block on disk. If CheckBlock() fails with an error
indicating the block is potentially corrupted, then shut down
immediately, as this is an indication that the node is experiencing
hardware issues.  (If we didn't shut down, we could go into an infinite
loop trying to reconnect this same bad block, as we're not setting the
block's status to FAILED in the case where there is potential
corruption.)

If CheckBlock() fails for some other reason, we'll end up flagging this
block as bad (perhaps some prior software version "let a bad block in",
as the comment indicates), and not trying to connect it again, so this
case should be properly handled.

0e7c52dc6c

sdaftuar force-pushed on Mar 5, 2018

laanwj commented at 5:45 pm on March 5, 2018: member

utACK https://github.com/bitcoin/bitcoin/pull/12561/commits/0e7c52dc6cbb8fd881a0dd57a6167a812fe71dc4 looks good to me now (will try to test this once I get home, as I still have the corrupted file)

sdaftuar commented at 2:10 pm on March 7, 2018: member

Updated OP to reflect what this PR is doing, since those get included in our merge commits. Original OP for posterity:

The call to CheckBlock() in ConnectBlock() was largely redundant – the case it tried to protect against was one where we have somehow previously stored an invalid block (eg with different software, presumably). However, there are other ways that we could get invalid blocks stored on disk, such as if a block was previously accepted (again by different software) that violates rules enforced in ContextualCheckBlock() or ContextualCheckBlockHeader().

Because CheckBlock() can return failure due to potential malleability, this call could result in an infinite failure loop, in the event that we have disk corruption and the stored block that we’re now trying to connect doesn’t match. Since we’re already not catching all forms of corruption between AcceptBlock() and ConnectBlock() (for instance, corruption in a transaction witness), I think we might as well get rid of this call and assume that the corruption will eventually be detected later.

Note: in the case of a planned soft-fork upgrade, we already need to solve the problem of how to ensure that we don’t have invalid blocks stored on disk that might sneak through validation, but that problem is more general than what this invocation of CheckBlock() covers. I’ve updated a long comment explaining this issue.

in src/validation.cpp:1799 in 0e7c52dc6c

1795+    if (!CheckBlock(block, state, chainparams.GetConsensus(), !fJustCheck, !fJustCheck)) {
1796+        if (state.CorruptionPossible()) {
1797+            // We don't write down blocks to disk if they may have been
1798+            // corrupted, so this should be impossible unless we're having hardware
1799+            // problems.
1800+            return AbortNode(state, "Corrupt block found indicating potential hardware failure; shutting down");

eklitzke commented at 2:37 am on March 11, 2018:

You should log the block hash as well, as that is useful information for anyone who wants to try to dig deeper (and is the first thing any developer would want to know). It looks like AbortNode() directs the user to check debug.log, so you can add a LogPrintf() statement before returning the error.

eklitzke commented at 2:39 am on March 11, 2018: contributor

Would like to see more information in debug.log in this case, but otherwise looks good to me.

laanwj commented at 9:07 am on April 8, 2018: member

Going to merge this; it’s an improvement to error handing either way. Additional information can be added in future PRs.

laanwj merged this on Apr 8, 2018

laanwj closed this on Apr 8, 2018

laanwj referenced this in commit 24133b177a on Apr 8, 2018

MarcoFalke added the label Up for grabs on Apr 8, 2018

MarcoFalke commented at 3:51 pm on April 8, 2018: member

Marking “up for grabs” to add the logprint

MarcoFalke removed the label Up for grabs on Mar 5, 2019

Fabcien referenced this in commit 546d897364 on Aug 30, 2019

jonspock referenced this in commit bf06abb7a2 on Dec 8, 2019

jonspock referenced this in commit b4b9d97f75 on Dec 8, 2019

jonspock referenced this in commit 2426a9777c on Dec 8, 2019

proteanx referenced this in commit 789ba94130 on Dec 12, 2019

random-zebra referenced this in commit c9741c53b1 on Jun 4, 2020

PastaPastaPasta referenced this in commit ca0ed8dcfb on Nov 1, 2020

wqking referenced this in commit be1747ac9e on Jan 10, 2021

DrahtBot locked this on Dec 16, 2021

Check for block corruption in ConnectBlock() #12561