Check for block corruption in ConnectBlock() #12561

pull sdaftuar wants to merge 1 commits into bitcoin:master from sdaftuar:2018-02-improve-checkblock-failure changing 1 files +8 −1
  1. sdaftuar commented at 4:40 pm on February 27, 2018: member

    (Updated OP after reworking the approach)

    Shut down if a corrupted block is found in ConnectBlock(). This prevents an infinite loop trying to connect such a block, and alerts the node operator that there may be potential hardware failure.

  2. fanquake added the label Validation on Feb 27, 2018
  3. laanwj commented at 7:13 am on March 1, 2018: member

    For context: I had a node go into an infinite loop, when a bit flip in a transaction in the block file caused the merkle root check to fail. It kept printing the same message, pegging the CPU at 100% but without advancing:

    02018-02-27 07:47:02 ERROR: ConnectBlock: Consensus::CheckBlock: bad-txnmrklroot,
    1 hashMerkleRoot mismatch (code 16)
    22018-02-27 07:47:02 ERROR: ConnectTip(): ConnectBlock 0000000000000000003cd56ca3
    3cf3dbc186ff3a6713f6972a11089b6ae4bf625 failed
    

    It might be due to software or hardware reasons (in my case, I found out in retrospect that the CPU was seriously overheating). In this kind of case it is better to give up and have the user manually check the issue, than try to continue.

    I think we might as well get rid of this call and assume that the corruption will eventually be detected later.

    I’m not sure removing the check is the best way forward. I’m happy that the corruption was detected as early as possible so I could take action and fix the CPU cooler. However, it could stop trying after one time, then give up. Or am I misunderstanding what this does?

  4. laanwj added the label Data corruption on Mar 1, 2018
  5. sdaftuar commented at 2:16 am on March 5, 2018: member

    I’m not sure removing the check is the best way forward. I’m happy that the corruption was detected as early as possible so I could take action and fix the CPU cooler. However, it could stop trying after one time, then give up. Or am I misunderstanding what this does?

    Yeah it’s not clear to me what is best. With my current proposal here, we skip the CheckBlock check altogether and go straight on to transaction validation. If some transaction was mutated (say due to disk corruption), it’s likely that some validation would fail, and the node would fall out of sync, which would be an indicator to the operator that something is wrong.

    For situations where the disk corruption could have been detected, I think it’s better to more explicitly let the user know, rather than have to infer it from a random-looking validation failure. However, in general we aren’t able to detect all the forms of disk corruption that can happen (eg, typically we never know if the block was written correctly because we use an in-memory copy for validation). So I wasn’t sure it was worth trying to maintain some special-case logic to detect something that in general is unknowable.

    But if you think catching this kind of thing some of the time is still better than nothing then I could just add back the CheckBlock() call and an assert that any failures cannot be due to potential corruption; I think that’s also a reasonable way to improve this behavior.

  6. sdaftuar force-pushed on Mar 5, 2018
  7. sdaftuar renamed this:
    Remove call to CheckBlock() in ConnectBlock()
    Check for block corruption in ConnectBlock()
    on Mar 5, 2018
  8. sdaftuar commented at 3:48 pm on March 5, 2018: member
    @laanwj I changed this to just shut down if we encounter a corrupt block in ConnectBlock().
  9. Shut down if trying to connect a corrupted block
    The call to CheckBlock() in ConnectBlock() is redundant with calls to it
    prior to storing a block on disk. If CheckBlock() fails with an error
    indicating the block is potentially corrupted, then shut down
    immediately, as this is an indication that the node is experiencing
    hardware issues.  (If we didn't shut down, we could go into an infinite
    loop trying to reconnect this same bad block, as we're not setting the
    block's status to FAILED in the case where there is potential
    corruption.)
    
    If CheckBlock() fails for some other reason, we'll end up flagging this
    block as bad (perhaps some prior software version "let a bad block in",
    as the comment indicates), and not trying to connect it again, so this
    case should be properly handled.
    0e7c52dc6c
  10. sdaftuar force-pushed on Mar 5, 2018
  11. laanwj commented at 5:45 pm on March 5, 2018: member
    utACK https://github.com/bitcoin/bitcoin/pull/12561/commits/0e7c52dc6cbb8fd881a0dd57a6167a812fe71dc4 looks good to me now (will try to test this once I get home, as I still have the corrupted file)
  12. sdaftuar commented at 2:10 pm on March 7, 2018: member

    Updated OP to reflect what this PR is doing, since those get included in our merge commits. Original OP for posterity:

    The call to CheckBlock() in ConnectBlock() was largely redundant – the case it tried to protect against was one where we have somehow previously stored an invalid block (eg with different software, presumably). However, there are other ways that we could get invalid blocks stored on disk, such as if a block was previously accepted (again by different software) that violates rules enforced in ContextualCheckBlock() or ContextualCheckBlockHeader().

    Because CheckBlock() can return failure due to potential malleability, this call could result in an infinite failure loop, in the event that we have disk corruption and the stored block that we’re now trying to connect doesn’t match. Since we’re already not catching all forms of corruption between AcceptBlock() and ConnectBlock() (for instance, corruption in a transaction witness), I think we might as well get rid of this call and assume that the corruption will eventually be detected later.

    Note: in the case of a planned soft-fork upgrade, we already need to solve the problem of how to ensure that we don’t have invalid blocks stored on disk that might sneak through validation, but that problem is more general than what this invocation of CheckBlock() covers. I’ve updated a long comment explaining this issue.

  13. in src/validation.cpp:1799 in 0e7c52dc6c
    1795+    if (!CheckBlock(block, state, chainparams.GetConsensus(), !fJustCheck, !fJustCheck)) {
    1796+        if (state.CorruptionPossible()) {
    1797+            // We don't write down blocks to disk if they may have been
    1798+            // corrupted, so this should be impossible unless we're having hardware
    1799+            // problems.
    1800+            return AbortNode(state, "Corrupt block found indicating potential hardware failure; shutting down");
    


    eklitzke commented at 2:37 am on March 11, 2018:
    You should log the block hash as well, as that is useful information for anyone who wants to try to dig deeper (and is the first thing any developer would want to know). It looks like AbortNode() directs the user to check debug.log, so you can add a LogPrintf() statement before returning the error.
  14. eklitzke commented at 2:39 am on March 11, 2018: contributor
    Would like to see more information in debug.log in this case, but otherwise looks good to me.
  15. laanwj commented at 9:07 am on April 8, 2018: member
    Going to merge this; it’s an improvement to error handing either way. Additional information can be added in future PRs.
  16. laanwj merged this on Apr 8, 2018
  17. laanwj closed this on Apr 8, 2018

  18. laanwj referenced this in commit 24133b177a on Apr 8, 2018
  19. MarcoFalke added the label Up for grabs on Apr 8, 2018
  20. MarcoFalke commented at 3:51 pm on April 8, 2018: member
    Marking “up for grabs” to add the logprint
  21. MarcoFalke removed the label Up for grabs on Mar 5, 2019
  22. Fabcien referenced this in commit 546d897364 on Aug 30, 2019
  23. jonspock referenced this in commit bf06abb7a2 on Dec 8, 2019
  24. jonspock referenced this in commit b4b9d97f75 on Dec 8, 2019
  25. jonspock referenced this in commit 2426a9777c on Dec 8, 2019
  26. proteanx referenced this in commit 789ba94130 on Dec 12, 2019
  27. random-zebra referenced this in commit c9741c53b1 on Jun 4, 2020
  28. PastaPastaPasta referenced this in commit ca0ed8dcfb on Nov 1, 2020
  29. wqking referenced this in commit be1747ac9e on Jan 10, 2021
  30. DrahtBot locked this on Dec 16, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me