net_processing: improve block data I/O error handling in P2P paths #35003

pull furszy wants to merge 5 commits into bitcoin:master from furszy:2026_abc_io_exception changing 8 files +426 −3
  1. furszy commented at 3:29 pm on April 4, 2026: member

    Early note: the majority of this PR consists of test coverage. The changes per se are quite small, just the issue shows up in multiple paths in slightly different forms.

    The goal of the PR is to ensure I/O errors are not silently ignored during validation or P2P message handling. In some cases, such as GETDATA or GETBLOCKTXN requests, errors can be swallowed, leaving the node unresponsive to the request without any action. In a harder to reach but more severe case, a swallowed error can leave the node alive but stuck, unable to process new blocks and advance the chain. Also, silently ignoring errors obviously makes problems harder to diagnose. So this PR seeks to improve all of that.

    Currently, the same root cause, inability to access the blocks directory, can produce different behaviors depending on where it occurs:


    1. If it happens during block connection, inside AcceptBlock(), the error gets caught and triggers a fatal error graceful shutdown. This is the correct behavior.

    2. If it happens during block connection, after AcceptBlock(), inside ActivateBestChain(), the file system error gets thrown and swallowed by the P2P general try-catch, leaving the node in a living but stuck state. Where ActivateBestChain() will always silently fail on any posterior block processing, retrying to active the failing block.

    3. If it happens during GETDATA request, the file system error gets swallowed by the message handling general try-catch, only logging a net-debug-level message.
The remote peer is not disconnected, nor the node aborts. It just silently ignores the request. This is different to a missing block data, which currently logs the error + disconnects the peer.

    4. If it happens during GETBLOCKTXN request, the file system error either gets swallowed by the message handling general try-catch or crashes the system via an assertion, depending on if the problem is at the blocks directory level or at the block data level, which is inconsistent.

    So, this PR makes failure handling consistent and ensures the node never enters a stuck state due to a block I/O error. Changes:

    1. GETDATA now treats the issue in the same way as other I/O errors: it logs the error + disconnect the remote peer.
    2. GETBLOCKTXN now triggers a fatal error and graceful shutdown. The error here is stricter than in GETDATA because it can only access the last 10 blocks, which must be available for reorgs.
    3. ActivateBestChain() now triggers a fatal error and graceful shutdown when it happens. Fixing the silent stuck node state scenario.

    Testing Note Can reproduce the different behaviors that cause the stuck node scenario by running the second commit’s unit test 28af683b5dfdce88892e367b308857b94ec630fe or by manually introducing an exception in ConnectTip() (which occurs inside ActivateBestChain()). Previously, the thrown exception would have being caught by the general P2P try-catch and logged only at net-debug level, which most users do not have enabled, leaving the node alive but stuck, as subsequent block connections repeatedly attempt to process the failing block, throw the exception again, and get silently swallowed. Adding a functional test for this scenario is not possible due to the requirement of failing inside ActivateBestChain(), which always occurs after AcceptBlock(), which correctly captures the error.

    Separate Note There is another assertion scenario that we can move to graceful shutdown (fatal error) in the compact block relay that I haven’t done here.

    Extra Note If want to see a similar error producing a crash, but in the wallet, go to #34176.

  2. DrahtBot commented at 3:30 pm on April 4, 2026: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK frankomosh
    Concept ACK pinheadmz

    If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #34858 (test: Use NodeClockContext in more tests by maflcko)
    • #33477 (Rollback for dumptxoutset without invalidating blocks by fjahr)
    • #32317 (kernel: Separate UTXO set access from validation functions by sedited)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    LLM Linter (✨ experimental)

    Possible typos and grammar issues:

    • Unless we have a I/O issue, ... -> Unless we have an I/O issue, ... [Article is wrong (“a” vs “an”), which can impede quick comprehension of the intended meaning.]

    • ... must not be swallowed by the P2P message handling general try-catch. -> ... must not be swallowed by the P2P message handling's general try-catch. [Missing possessive/connector makes the phrase unclear; it reads as if “handling general” is describing the wrong thing.]

    Possible places where named args for integral literals may be used (e.g. func(x, /*named_arg=*/0) in C++, and func(x, named_arg=0) in Python):

    • FlatFileSeq(data_dir, "a", 16 * 1024) in src/test/flatfile_tests.cpp (flatfile_open_readonly_missing_dir)
    • FlatFileSeq(data_dir, "a", 16 * 1024) in src/test/flatfile_tests.cpp (flatfile_open_write_creates_dir)
    • FlatFileSeq(dir / "blocks", "a", 16 * 1024) in src/test/flatfile_tests.cpp (flatfile_open_write_missing_unwritable_parent)
    • FlatFileSeq(data_dir, "a", 100) in src/test/flatfile_tests.cpp (flatfile_flush_unwritable_dir)
    • FlatFileSeq(data_dir, "a", 100) in src/test/flatfile_tests.cpp (flatfile_allocate_unwritable_dir)

    Possible places where comparison-specific test macros should replace generic comparisons:

    • [src/test/flatfile_tests.cpp] BOOST_CHECK_THROW(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), fs::filesystem_error); -> Prefer BOOST_CHECK_EXCEPTION(...) with a HasReason(...) matcher if you want to ensure the specific failure reason/message (rather than only the exception type).

    2026-04-05 18:48:22

  3. furszy force-pushed on Apr 4, 2026
  4. DrahtBot added the label CI failed on Apr 4, 2026
  5. FlatFile: do not throw for parent dir creation failure
    The callers of this function don't expect it to throw.
    
    For example, when the blocks directory becomes inaccessible
    (e.g. volume detach or permission changed), FlatFileSeq::Open's
    call to fs::create_directories() throws filesystem_error. On the
    block connection path, if this occurs during ActivateBestChain(),
    just after AcceptBlock() the exception bypasses FatalError entirely.
    It gets caught by ProcessMessage's general catch-all with only a
    NET-level debug log, effectively swallowing the error msg for
    regular users, and every subsequent block arrival retries the
    same failing read, leaving the node looking healthy while the
    chain is stuck.
    89318e74b6
  6. test: verify ActivateBestChain fatal errors on I/O failure
    Cover the two ActivateBestChain paths that read data from
    disk: ConnectTip (block connection) and DisconnectTip (reorg).
    
    Simulate inaccessible block and undo files, ensure fatal error
    and shutdown is triggered.
    c59592bbcb
  7. test: ensure consistent failure behavior in BlockManager reads/writes
    Verifies that BlockManager behaves the same way whether a block file is
    missing or the blocks directory is inaccessible.
    
    Reads (ReadBlock, ReadRawBlock, ReadBlockUndo) return false, writes
    (WriteBlock, WriteBlockUndo) return null/false, and failures are logged.
    
    The goal is consistent failure handling. Previously, missing block files
    returned false and logged an error, while an inaccessible blocks directory
    was throwing an exception.
    0aeece851d
  8. net: replace GETBLOCKTXN assert on block read error with graceful shutdown
    A failure here indicates a serious inconsistency or disk issue, and the
    current behavior is to abort immediately, which is correct but not the best.
    Replace assert with a fatal error so the node shuts down through the normal
    shutdown path instead.
    
    This does not change the outcome (the node still stops), but allows a more
    controlled teardown, giving other components a chance to flush their state
    even when a particular block (or the blocks dir) on disk is unreachable.
    E.g. the blocks dir might be the only affected, the mempool might be ok,
    and we don't want to lose such information.
    ba034f4db7
  9. furszy force-pushed on Apr 4, 2026
  10. test: add P2P coverage for block disk I/O failures
    Add tests to ensure that block disk I/O failures triggered
    through P2P requests are not silently swallowed.
    
    The idea is to simulate I/O errors by making the blocks directory
    temporarily inaccessible while the node is running. This allows block
    reads to fail when serving P2P requests.
    
    Two behaviors are exercised:
    
    - GETDATA: failures should be handled locally by disconnecting the
      peer and logging the error, without impacting node operation.
    
    - GETBLOCKTXN: failures when accessing recent blocks violate our
      "last 10 blocks must always be available for reorgs" rule and
      must result in a fatal error, rather than being caught and only
      logged by the P2P message handling general try-catch.
    72f7a67370
  11. furszy force-pushed on Apr 5, 2026
  12. DrahtBot removed the label CI failed on Apr 5, 2026
  13. pinheadmz commented at 6:02 pm on April 6, 2026: member
    concept ACK Thanks for the thorough PR description! Definitely helps me to understand the code changes before even looking at them. Can also see its minimal changes with lots of test coverage, will review.
  14. frankomosh commented at 4:45 am on April 8, 2026: contributor

    tACK 72f7a67370255cab1bf0dfcf39920dd23a3c268b

    Built from source. All new tests pass:

    • flatfile_tests: 9 cases, no errors
    • validation_chainstate_tests: 5 cases, no errors
    • blockmanager_tests: 8 cases, no errors
    • p2p_handle_io_errors.py: both scenarios pass

    Also manually reverted the GETBLOCKTXN fix (commit ba034f4) by removing the fatalError call and replacing with a silent return. The functional test correctly caught this kind of mutation. the test expected the node to shut down, but the mutant node stayed alive, causing a timeout failure. Mutant killed by p2p_handle_io_errors.py. Unit tests did not detect this mutation. Also process_messages and process_message fuzz harnesses both passed without detecting the mutation, confirming the functional test is providing the critical oracle here.

    Line 5915 in SendMessages has assert(ret) after a ReadBlock call in the compact block announcement path. I think this is the same pattern as the GETBLOCKTXN assert fixed in commit 4. I believe this is the “another assertion scenario in compact block relay” mentioned in the PR description. Is this planned as a follow-up? Happy to take it if so.

  15. DrahtBot requested review from pinheadmz on Apr 8, 2026
  16. in test/functional/p2p_handle_io_errors.py:109 in 72f7a67370
    104+        if platform.system() == 'Windows' or os.geteuid() == 0:
    105+           self.log.warning("Skipping test: unable to enforce dir permissions")
    106+           return
    107+
    108+        self.test_getdata_on_broken_fs()
    109+        self.test_getblocktxn_fatal_on_block_io_error()
    


    maflcko commented at 8:36 am on April 8, 2026:

    Is there a reason why the p2p block msg flow is not tested end-to-end? I understand that the end-to-end behavior is already correct for it, and that there are unit tests, but they only look at the problem from a zoomed-in-on-flatfile perspective.

    I mention it, because the error flow happens in reality. I’ve last seen it in #34592 (comment).

    So I think it could make sense to test it, to avoid regressing on it.

    Also, I think, it could make sense to make the first commit of the pull request the functional test. This way, it is easier to see what the end-to-end behavior changes are, without having to manually undo commits or patches.

    Feel free to grab the test from this diff:

     0diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
     1index 7b3f164ebe..309f4bae63 100755
     2--- a/test/functional/p2p_handle_io_errors.py
     3+++ b/test/functional/p2p_handle_io_errors.py
     4@@ -12,2 +12,6 @@ import stat
     5 
     6+from test_framework.blocktools import (
     7+    create_block,
     8+    create_coinbase,
     9+)
    10 from test_framework.messages import (
    11@@ -17,2 +21,3 @@ from test_framework.messages import (
    12     MSG_WITNESS_FLAG,
    13+    msg_block,
    14     msg_getblocktxn,
    15@@ -46,2 +51,19 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
    16 
    17+    def test_block_connect_on_broken_fs(self):
    18+        self.log.info("Test block connect on inaccessible filesystem")
    19+        node = self.nodes[0]
    20+        peer = node.add_p2p_connection(P2PInterface())
    21+        block = create_block(tmpl=node.getblocktemplate({"rules": ["segwit"]}))
    22+        block.solve()
    23+
    24+        with simulate_io_error(node.blocks_path):
    25+            with node.assert_debug_log(expected_msgs=["AcceptBlock FAILED (System error while saving block to disk"]):
    26+                peer.send_without_ping(msg_block(block))
    27+                peer.wait_for_disconnect()
    28+
    29+            node.wait_until_stopped(
    30+                expect_error=True,
    31+                expected_stderr=re.compile(r"fatal internal error"),
    32+            )
    33+
    34     def test_getdata_on_broken_fs(self):
    35@@ -104,5 +126,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
    36         if platform.system() == 'Windows' or os.geteuid() == 0:
    37-           self.log.warning("Skipping test: unable to enforce dir permissions")
    38-           return
    39+            self.log.warning("Skipping test: unable to enforce dir permissions")
    40+            return
    41 
    42+        self.test_block_connect_on_broken_fs()
    43         self.test_getdata_on_broken_fs()
    
  17. maflcko commented at 8:39 am on April 8, 2026: member
    concept ack, left a test suggestion, which could make review easier
  18. DrahtBot added the label Needs rebase on Apr 9, 2026
  19. DrahtBot commented at 10:52 pm on April 9, 2026: contributor
    🐙 This pull request conflicts with the target branch and needs rebase.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-10 21:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me