[validation] Crash if disconnecting a block fails #15305

pull sdaftuar wants to merge 2 commits into bitcoin:master from sdaftuar:2019-01-disconnect-failure-shutdown changing 3 files +55 −1
  1. sdaftuar commented at 9:15 pm on January 31, 2019: member

    If we’re unable to disconnect a block during normal operation, then that is a failure of our local system (such as disk failure) or the chain that we are on (eg CVE-2018-17144), but cannot be due to failure of the (more work) chain that we’re trying to validate.

    We should abort rather than stay on a less work chain.

    Fixes #14341.

  2. in src/validation.cpp:2556 in 5331e2a4d5 outdated
    2552@@ -2551,6 +2553,9 @@ bool CChainState::ActivateBestChainStep(CValidationState& state, const CChainPar
    2553             // This is likely a fatal error, but keep the mempool consistent,
    2554             // just in case. Only remove from the mempool in this case.
    2555             UpdateMempoolForReorg(disconnectpool, false);
    2556+            if (!g_abort_node) {
    


    MarcoFalke commented at 9:21 pm on January 31, 2019:
    Wouldn’t it make more sense to move this check into AbortNode otherwise it seems that this is a flag to indicate the node should be aborted as opposed to a flag that indicates the nodes has been aborted via AbortNode.

    sdaftuar commented at 9:28 pm on January 31, 2019:

    So I was thinking that I didn’t want to add a place to the code where we might call into AbortNode twice, but since I haven’t analyzed exactly what happens in that case I wasn’t sure I wanted to change any existing behavior if we’re doing that somewhere already.

    If everyone is comfortable with globally ensuring that AbortNode() enforces the 1 abort call limit, then I can move this in there, or otherwise perhaps I could just rename this variable to be a bit clearer?


    MarcoFalke commented at 8:02 pm on February 5, 2019:

    I don’t think it is worth the burden of having this undocumented global in the source code to ease today’s review of the diff at the cost of all future review of the overall code.

    I’d prefer to be consistent by allowing AbortNode to be called multiple times and then either overwriting the warning string or keeping the previous one, but special casing the reorg warning to not overwrite any previous warnings seems arbitrary.


    sdaftuar commented at 8:18 pm on February 5, 2019:
    Perhaps I should get rid of the call to AbortNode in DisconnectTip, and then do away with this new global and unconditionally call AbortNode in ABCStep? That would result in a behavior change for invalidateblock and certain failures in RewindBlockIndex, but that might actually be desirable anyway.
  3. in test/functional/feature_abortnode.py:42 in 5331e2a4d5 outdated
    37+        # attempt
    38+        self.nodes[1].generate(4)
    39+        connect_nodes(self.nodes[0], 1)
    40+
    41+        # Check that node0 aborted
    42+        self.nodes[0].assert_debug_log("Failed to disconnect block")
    


    MarcoFalke commented at 9:24 pm on January 31, 2019:
    Would be nice to limit the scope of this context to the minimum required (right after the wait_until). (Compare other uses of with node.assert_debug_log(expected_msgs=reason):)

    sdaftuar commented at 9:35 pm on January 31, 2019:
    Not sure I understand exactly how this is supposed to be used, but can you take a look at the fixup I just pushed and let me know if this is what you have in mind?

    MarcoFalke commented at 9:44 pm on January 31, 2019:

    It is like a context manager that checks what lines get appended to the debug log during its use. And the shortest use seems to me

    0        with self.nodes[0].assert_debug_log("Failed to disconnect block"):
    1            connect_nodes(self.nodes[0], 1)
    2
    3            # Check that node0 aborted
    4            self.log.info("Waiting for crash")
    5            wait_until(lambda: self.nodes[0].is_node_stopped(), timeout=60)
    6        #Exits here and checks debug log for error string
    
  4. fanquake added the label Validation on Jan 31, 2019
  5. in test/functional/feature_abortnode.py:39 in 06e19c76d6 outdated
    34+        os.unlink(undo_file)
    35+
    36+        # Connecting to a node with a more work chain will trigger a reorg
    37+        # attempt
    38+        self.nodes[1].generate(4)
    39+        with self.nodes[0].assert_debug_log("Failed to disconnect block"):
    


    MarcoFalke commented at 6:03 pm on February 1, 2019:

    This fails for me with:

     0AssertionError: [node 0] Expected message "F" does not partially match log:
     1
     2 - 2019-02-01T18:00:39.595725Z Received a POST request for / from 127.0.0.1:41758
     3 - 2019-02-01T18:00:39.595784Z ThreadRPCServer method=addnode user=__cookie__
     4 - 2019-02-01T18:00:39.595808Z trying connection 127.0.0.1:11153 lastseen=0.0hrs
     5 - 2019-02-01T18:00:39.596518Z Added connection peer=0
     6 - 2019-02-01T18:00:39.596544Z sending version (114 bytes) peer=0
     7 - 2019-02-01T18:00:39.596611Z send version message: version 70015, blocks=3, us=[::]:0, peer=0
     8 - 2019-02-01T18:00:39.598387Z Received a POST request for / from 127.0.0.1:41758
     9 - 2019-02-01T18:00:39.598441Z ThreadRPCServer method=getpeerinfo user=__cookie__
    10 - 2019-02-01T18:00:39.645960Z received: version (114 bytes) peer=0
    11 - 2019-02-01T18:00:39.646044Z sending verack (0 bytes) peer=0
    12 - 2019-02-01T18:00:39.646189Z sending getaddr (0 bytes) peer=0
    13 - 2019-02-01T18:00:39.646257Z receive version message: /Satoshi:0.17.99(testnode1)/: version 70015, blocks=4, us=[::]:0, peer=0
    14 - 2019-02-01T18:00:39.646287Z added time data, samples 2, offset +0 (+0 minutes)
    15 - 2019-02-01T18:00:39.646307Z received: verack (0 bytes) peer=0
    16 - 2019-02-01T18:00:39.646319Z New outbound peer connected: version: 70015, blocks=4, peer=0
    17 - 2019-02-01T18:00:39.646326Z sending sendheaders (0 bytes) peer=0
    18 - 2019-02-01T18:00:39.646355Z sending sendcmpct (9 bytes) peer=0
    19 - 2019-02-01T18:00:39.646390Z sending sendcmpct (9 bytes) peer=0
    20 - 2019-02-01T18:00:39.646439Z sending ping (8 bytes) peer=0
    21 - 2019-02-01T18:00:39.646505Z initial getheaders (2) to peer=0 (startheight:4)
    22 - 2019-02-01T18:00:39.646522Z sending getheaders (133 bytes) peer=0
    23 - 2019-02-01T18:00:39.646634Z sending feefilter (8 bytes) peer=0
    24 - 2019-02-01T18:00:39.650027Z Received a POST request for / from 127.0.0.1:41758
    25 - 2019-02-01T18:00:39.650092Z ThreadRPCServer method=getpeerinfo user=__cookie__
    26 - 2019-02-01T18:00:39.650672Z received: sendheaders (0 bytes) peer=0
    27 - 2019-02-01T18:00:39.650721Z received: sendcmpct (9 bytes) peer=0
    28 - 2019-02-01T18:00:39.650737Z received: sendcmpct (9 bytes) peer=0
    29 - 2019-02-01T18:00:39.650751Z received: ping (8 bytes) peer=0
    30 - 2019-02-01T18:00:39.650763Z sending pong (8 bytes) peer=0
    31 - 2019-02-01T18:00:39.650825Z received: getheaders (165 bytes) peer=0
    32 - 2019-02-01T18:00:39.650841Z getheaders 1 to end from peer=0
    33 - 2019-02-01T18:00:39.650854Z sending headers (244 bytes) peer=0
    34 - 2019-02-01T18:00:39.650895Z received: feefilter (8 bytes) peer=0
    35 - 2019-02-01T18:00:39.650907Z received: feefilter of 0.00001000 BTC/kB from peer=0
    36 - 2019-02-01T18:00:39.650934Z received: pong (8 bytes) peer=0
    37 - 2019-02-01T18:00:39.650950Z received: headers (325 bytes) peer=0
    38 - 2019-02-01T18:00:39.651028Z Requesting block 031ddb2a71a0787d38ffcf11ac71c785003a4a4d05a79b6edac7f2e7c86728a6 from  peer=0
    39 - 2019-02-01T18:00:39.651035Z Requesting block 24466befe557061e633ff435931ce3d802ce50fbe93199b55a9b3d689e12b346 from  peer=0
    40 - 2019-02-01T18:00:39.651042Z Requesting block 72dc4f648af86c3051f264f161d5ca18283adbe9b2b16eed8763ed4594173a73 from  peer=0
    41 - 2019-02-01T18:00:39.651048Z Requesting block 22046fc9668cae25c57f70a4b02f8599320a259f63564de4d7cf15648cdd9910 from  peer=0
    42 - 2019-02-01T18:00:39.651054Z Downloading blocks toward 22046fc9668cae25c57f70a4b02f8599320a259f63564de4d7cf15648cdd9910 (4) via headers direct fetch
    

    MarcoFalke commented at 6:04 pm on February 1, 2019:

    Could be fixed with

     0diff --git a/test/functional/feature_abortnode.py b/test/functional/feature_abortnode.py
     1index ad1b0cd8ca..9b5b3a67d1 100755
     2--- a/test/functional/feature_abortnode.py
     3+++ b/test/functional/feature_abortnode.py
     4@@ -36,12 +36,11 @@ class AbortNodeTest(BitcoinTestFramework):
     5         # Connecting to a node with a more work chain will trigger a reorg
     6         # attempt
     7         self.nodes[1].generate(4)
     8-        with self.nodes[0].assert_debug_log("Failed to disconnect block"):
     9+        with self.nodes[0].assert_debug_log(["Failed to disconnect block"]):
    10             connect_nodes(self.nodes[0], 1)
    11-
    12-        # Check that node0 aborted
    13-        self.log.info("Waiting for crash")
    14-        wait_until(lambda: self.nodes[0].is_node_stopped(), timeout=60)
    15+            # Check that node0 aborted
    16+            self.log.info("Waiting for crash")
    17+            wait_until(lambda: self.nodes[0].is_node_stopped(), timeout=60)
    18         self.log.info("Node crashed - now verifying restart fails")
    19         self.nodes[0].assert_start_raises_init_error()
    20 
    
  6. promag commented at 11:14 am on February 2, 2019: member
    Concept ACK.
  7. gmaxwell commented at 10:17 pm on February 2, 2019: contributor
    Concept ACK. The test seems a little fragile (an unrelated change to rename undo files or merge them into block files would break it), but I don’t see a better way to do it.
  8. in test/functional/feature_abortnode.py:26 in 06e19c76d6 outdated
    21+        self.num_nodes = 2
    22+
    23+    def setup_network(self):
    24+        self.setup_nodes()
    25+        # We'll connect the nodes later
    26+        pass
    


    practicalswift commented at 9:29 pm on February 4, 2019:
    pass redundant?

    sdaftuar commented at 6:07 pm on February 5, 2019:
    Removed in latest commit.
  9. sdaftuar force-pushed on Feb 5, 2019
  10. sdaftuar commented at 6:07 pm on February 5, 2019: member
    Thanks @MarcoFalke I took your change to the test, hopefully it passes travis now. If it’s too flaky though I’ll just drop it from this PR.
  11. sdaftuar force-pushed on Feb 5, 2019
  12. sdaftuar commented at 9:14 pm on February 5, 2019: member

    Updated this PR to get rid of the g_abort_node global, and instead change DisconnectTip to no longer call AbortNode itself, and instead leave it to the caller to decide how to handle failures.

    This slightly changes the behavior of invalidateblock and RewindBlockIndex – I believe it’s an improvement but reviewers should take note.

  13. sdaftuar commented at 0:05 am on February 7, 2019: member
    The appveyor test failure appears to be an unrelated, long-standing bug in the wallet_txn_doublespend.py test (fixed in #15360).
  14. in test/functional/feature_abortnode.py:36 in be155578b2 outdated
    32+        # Deleting the undo file will result in reorg failure
    33+        os.unlink(undo_file)
    34+
    35+        # Connecting to a node with a more work chain will trigger a reorg
    36+        # attempt
    37+        self.nodes[1].generate(3)
    


    promag commented at 2:55 pm on February 7, 2019:
    Any reason to not generate 4 here instead?

    sdaftuar commented at 6:03 pm on February 8, 2019:
    Yes– if you generate 4 here, then the call to connect_nodes() afterward can trigger a crash on node0 before the function returns, which causes problems for the test.
  15. in test/functional/feature_abortnode.py:30 in be155578b2 outdated
    25+        # We'll connect the nodes later
    26+
    27+    def run_test(self):
    28+        self.nodes[0].generate(3)
    29+        datadir = get_datadir_path(self.options.tmpdir, 0)
    30+        undo_file = datadir + "/regtest/blocks/rev00000.dat"
    


    promag commented at 2:56 pm on February 7, 2019:

    Should use os.path.join.

    nit, could inline this below

    0os.unlink(os.path.join(datadir, 'regtest', 'blocks', 'rev00000.dat'))
    

    sdaftuar commented at 6:45 pm on February 8, 2019:
    Done in latest commit.
  16. in test/functional/feature_abortnode.py:16 in be155578b2 outdated
    11+"""
    12+
    13+from test_framework.test_framework import BitcoinTestFramework
    14+from test_framework.util import wait_until, get_datadir_path, connect_nodes
    15+import os
    16+
    


    promag commented at 2:56 pm on February 7, 2019:
    nit, double new line.

    sdaftuar commented at 6:46 pm on February 8, 2019:
    I’m not seeing the extra new-line in our style guide, nor does it seem to be failing our style linter? Leaving as-is.

    promag commented at 8:59 pm on June 11, 2019:

    @sdaftuar in test/functional/README.md you see

    Then in https://www.python.org/dev/peps/pep-0008/#blank-lines you can see guide.


    MarcoFalke commented at 9:05 pm on June 11, 2019:
    No this should not be changed. Unlike other rules in pep8, some of which prevent nasty bugs in python, having two or one new line is completely equivalent.
  17. in test/functional/feature_abortnode.py:46 in be155578b2 outdated
    42+            # Check that node0 aborted
    43+            self.log.info("Waiting for crash")
    44+            wait_until(lambda: self.nodes[0].is_node_stopped(), timeout=60)
    45+        self.log.info("Node crashed - now verifying restart fails")
    46+        self.nodes[0].assert_start_raises_init_error()
    47+
    


    promag commented at 2:56 pm on February 7, 2019:
    nit, double new line.
  18. sdaftuar force-pushed on Feb 8, 2019
  19. DrahtBot commented at 5:03 am on February 14, 2019: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    No conflicts as of last run.

  20. DrahtBot added the label Needs rebase on Mar 7, 2019
  21. sdaftuar force-pushed on Mar 8, 2019
  22. DrahtBot removed the label Needs rebase on Mar 8, 2019
  23. in test/functional/feature_abortnode.py:43 in daf67ad1d3 outdated
    38+            connect_nodes(self.nodes[0], 1)
    39+            self.nodes[1].generate(1)
    40+
    41+            # Check that node0 aborted
    42+            self.log.info("Waiting for crash")
    43+            wait_until(lambda: self.nodes[0].is_node_stopped(), timeout=60)
    


    practicalswift commented at 8:50 am on March 25, 2019:
    Nit: Is the lambda needed here?

  24. sipa commented at 11:19 pm on March 25, 2019: member
    Concept ACK
  25. in src/validation.cpp:2298 in dafa0b7e16 outdated
    2294@@ -2295,7 +2295,7 @@ bool CChainState::DisconnectTip(CValidationState& state, const CChainParams& cha
    2295     std::shared_ptr<CBlock> pblock = std::make_shared<CBlock>();
    2296     CBlock& block = *pblock;
    2297     if (!ReadBlockFromDisk(block, pindexDelete, chainparams.GetConsensus()))
    2298-        return AbortNode(state, "Failed to read block");
    


    ryanofsky commented at 7:12 pm on April 10, 2019:

    In commit “[validation] Crash if disconnecting a block fails” (dafa0b7e1615a4e0985ece5d5f044f14753edcf6)

    Note: I was initially confused by the change to this line which removes an existing abort call. But this is just cleanup. Aborting here would be redundant with the new abort call below, which is broader and covers more failure cases.

  26. ryanofsky approved
  27. ryanofsky commented at 7:14 pm on April 10, 2019: member
    utACK daf67ad1d3cd6de67776fa26d49e6b1ba66aae7d. Or slightly tested. Confirmed that when the fix is reverted, wait_until in the test times out.
  28. in src/validation.cpp:2558 in daf67ad1d3 outdated
    2550@@ -2551,6 +2551,7 @@ bool CChainState::ActivateBestChainStep(CValidationState& state, const CChainPar
    2551             // This is likely a fatal error, but keep the mempool consistent,
    2552             // just in case. Only remove from the mempool in this case.
    2553             UpdateMempoolForReorg(disconnectpool, false);
    2554+            AbortNode(state, "Failed to disconnect block; see debug.log for details");
    


    ryanofsky commented at 7:21 pm on April 10, 2019:

    In commit “[validation] Crash if disconnecting a block fails” (dafa0b7e1615a4e0985ece5d5f044f14753edcf6)

    Probably would be good to have a comment saying that it’s better to abort than stay active using a chain that doesn’t have the most work (so it’s clear that handling the error this way is really intentional and this isn’t just fragile code).


    sdaftuar commented at 9:19 am on June 5, 2019:
    Done.
  29. [validation] Crash if disconnecting a block fails
    If we're unable to disconnect a block during normal operation, then that is a
    failure of our local system (such as disk failure) or the chain that we are on
    (eg CVE-2018-17144), but cannot be due to failure of the (more work) chain that
    we're trying to validate.
    
    We should abort rather than stay on a less work chain.
    4433ed0f73
  30. [qa] Test disconnect block failure -> shutdown a47df13471
  31. sdaftuar force-pushed on Jun 5, 2019
  32. sdaftuar commented at 9:20 am on June 5, 2019: member
    Added a comment as @ryanofsky requested.
  33. ryanofsky approved
  34. ryanofsky commented at 9:00 pm on June 12, 2019: member
    utACK a47df13471e3168e2e02023fb20cdf2414141b36. Only change since last review is new comment.
  35. practicalswift commented at 9:09 pm on June 12, 2019: contributor
    utACK a47df13471e3168e2e02023fb20cdf2414141b36
  36. in test/functional/feature_abortnode.py:37 in a47df13471
    32+        os.unlink(os.path.join(datadir, 'regtest', 'blocks', 'rev00000.dat'))
    33+
    34+        # Connecting to a node with a more work chain will trigger a reorg
    35+        # attempt.
    36+        self.nodes[1].generate(3)
    37+        with self.nodes[0].assert_debug_log(["Failed to disconnect block"]):
    


    promag commented at 10:05 pm on June 12, 2019:
    nit, use ' throughout.
  37. promag commented at 10:20 pm on June 12, 2019: member

    ACK a47df1347, it takes awhile to quit (RPC connection timeouts) but that’s unrelated - hope to fix that soon.

    Left 2 nits and a suggestion to improve the test Here’s

     0@@ -11,6 +11,7 @@
     1 """
     2
     3 from test_framework.test_framework import BitcoinTestFramework
     4+from test_framework.test_node import ErrorMatch
     5 from test_framework.util import wait_until, get_datadir_path, connect_nodes
     6 import os
     7
     8@@ -42,7 +43,9 @@ class AbortNodeTest(BitcoinTestFramework):
     9             self.log.info("Waiting for crash")
    10             wait_until(lambda: self.nodes[0].is_node_stopped(), timeout=60)
    11         self.log.info("Node crashed - now verifying restart fails")
    12-        self.nodes[0].assert_start_raises_init_error()
    13+        self.nodes[0].assert_start_raises_init_error(expected_msg='Corrupted block database detected', match=ErrorMatch.PARTIAL_REGEX)
    14+
    15+        self.start_node(0, ['-reindex'])
    16
    17 if __name__ == '__main__':
    18     AbortNodeTest().main()
    
  38. sdaftuar commented at 5:12 pm on July 8, 2019: member
    Thanks for the review @promag; I’ll leave the test nits alone for now, someone can pick up in a future PR if there’s interest.
  39. TheBlueMatt commented at 5:56 pm on July 10, 2019: member
    utACK a47df13471e3168e2e02023fb20cdf2414141b36. Didn’t bother to review the test in detail, it looked fine. Debated whether invalidateblock should ever crash the node, but not crashing in the case of hitting a pruned block (which is the only change here) is clearly better, even if there are other cases I’d argue we should crash in.
  40. sdaftuar commented at 7:41 pm on July 24, 2019: member
    Anything else needed here?
  41. fanquake approved
  42. fanquake commented at 1:04 am on July 25, 2019: member

    ACK a47df13471e3168e2e02023fb20cdf2414141b36

    I’ll leave the test nits alone for now, someone can pick up in a future PR if there’s interest.

    Agree

    even if there are other cases I’d argue we should crash in. @TheBlueMatt Did you want to open an issue listing these cases, or just dump out your thoughts?

  43. fanquake merged this on Jul 25, 2019
  44. fanquake closed this on Jul 25, 2019

  45. fanquake referenced this in commit d5a54ce8f0 on Jul 25, 2019
  46. sidhujag referenced this in commit fd15183645 on Jul 29, 2019
  47. laanwj referenced this in commit 871d3ae45b on Sep 16, 2019
  48. sidhujag referenced this in commit f0d2d15e05 on Sep 16, 2019
  49. deadalnix referenced this in commit 993b00906a on Jun 10, 2020
  50. ftrader referenced this in commit c25bfb8318 on Aug 17, 2020
  51. random-zebra referenced this in commit ac83fdf94c on Apr 9, 2021
  52. PastaPastaPasta referenced this in commit 3b2b94249a on Jun 27, 2021
  53. PastaPastaPasta referenced this in commit 95c1f3f244 on Jun 28, 2021
  54. PastaPastaPasta referenced this in commit 780d01110d on Jun 29, 2021
  55. PastaPastaPasta referenced this in commit 6ad276567d on Jul 1, 2021
  56. PastaPastaPasta referenced this in commit 0cbf492788 on Jul 1, 2021
  57. PastaPastaPasta referenced this in commit cb15f42793 on Jul 12, 2021
  58. PastaPastaPasta referenced this in commit db4c8cf41e on Jul 13, 2021
  59. DrahtBot locked this on Dec 16, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-05 22:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me