(Updated OP after reworking the approach)
Shut down if a corrupted block is found in ConnectBlock(). This prevents an infinite loop trying to connect such a block, and alerts the node operator that there may be potential hardware failure.
(Updated OP after reworking the approach)
Shut down if a corrupted block is found in ConnectBlock(). This prevents an infinite loop trying to connect such a block, and alerts the node operator that there may be potential hardware failure.
For context: I had a node go into an infinite loop, when a bit flip in a transaction in the block file caused the merkle root check to fail. It kept printing the same message, pegging the CPU at 100% but without advancing:
02018-02-27 07:47:02 ERROR: ConnectBlock: Consensus::CheckBlock: bad-txnmrklroot,
1 hashMerkleRoot mismatch (code 16)
22018-02-27 07:47:02 ERROR: ConnectTip(): ConnectBlock 0000000000000000003cd56ca3
3cf3dbc186ff3a6713f6972a11089b6ae4bf625 failed
It might be due to software or hardware reasons (in my case, I found out in retrospect that the CPU was seriously overheating). In this kind of case it is better to give up and have the user manually check the issue, than try to continue.
I think we might as well get rid of this call and assume that the corruption will eventually be detected later.
I’m not sure removing the check is the best way forward. I’m happy that the corruption was detected as early as possible so I could take action and fix the CPU cooler. However, it could stop trying after one time, then give up. Or am I misunderstanding what this does?
I’m not sure removing the check is the best way forward. I’m happy that the corruption was detected as early as possible so I could take action and fix the CPU cooler. However, it could stop trying after one time, then give up. Or am I misunderstanding what this does?
Yeah it’s not clear to me what is best. With my current proposal here, we skip the CheckBlock check altogether and go straight on to transaction validation. If some transaction was mutated (say due to disk corruption), it’s likely that some validation would fail, and the node would fall out of sync, which would be an indicator to the operator that something is wrong.
For situations where the disk corruption could have been detected, I think it’s better to more explicitly let the user know, rather than have to infer it from a random-looking validation failure. However, in general we aren’t able to detect all the forms of disk corruption that can happen (eg, typically we never know if the block was written correctly because we use an in-memory copy for validation). So I wasn’t sure it was worth trying to maintain some special-case logic to detect something that in general is unknowable.
But if you think catching this kind of thing some of the time is still better than nothing then I could just add back the CheckBlock() call and an assert that any failures cannot be due to potential corruption; I think that’s also a reasonable way to improve this behavior.
The call to CheckBlock() in ConnectBlock() is redundant with calls to it
prior to storing a block on disk. If CheckBlock() fails with an error
indicating the block is potentially corrupted, then shut down
immediately, as this is an indication that the node is experiencing
hardware issues. (If we didn't shut down, we could go into an infinite
loop trying to reconnect this same bad block, as we're not setting the
block's status to FAILED in the case where there is potential
corruption.)
If CheckBlock() fails for some other reason, we'll end up flagging this
block as bad (perhaps some prior software version "let a bad block in",
as the comment indicates), and not trying to connect it again, so this
case should be properly handled.
Updated OP to reflect what this PR is doing, since those get included in our merge commits. Original OP for posterity:
The call to CheckBlock() in ConnectBlock() was largely redundant – the case it tried to protect against was one where we have somehow previously stored an invalid block (eg with different software, presumably). However, there are other ways that we could get invalid blocks stored on disk, such as if a block was previously accepted (again by different software) that violates rules enforced in ContextualCheckBlock() or ContextualCheckBlockHeader().
Because CheckBlock() can return failure due to potential malleability, this call could result in an infinite failure loop, in the event that we have disk corruption and the stored block that we’re now trying to connect doesn’t match. Since we’re already not catching all forms of corruption between AcceptBlock() and ConnectBlock() (for instance, corruption in a transaction witness), I think we might as well get rid of this call and assume that the corruption will eventually be detected later.
Note: in the case of a planned soft-fork upgrade, we already need to solve the problem of how to ensure that we don’t have invalid blocks stored on disk that might sneak through validation, but that problem is more general than what this invocation of CheckBlock() covers. I’ve updated a long comment explaining this issue.
1795+ if (!CheckBlock(block, state, chainparams.GetConsensus(), !fJustCheck, !fJustCheck)) {
1796+ if (state.CorruptionPossible()) {
1797+ // We don't write down blocks to disk if they may have been
1798+ // corrupted, so this should be impossible unless we're having hardware
1799+ // problems.
1800+ return AbortNode(state, "Corrupt block found indicating potential hardware failure; shutting down");
AbortNode()
directs the user to check debug.log
, so you can add a LogPrintf()
statement before returning the error.