validation: improve block data I/O error handling in P2P paths #35003

pull furszy wants to merge 5 commits into bitcoin:master from furszy:2026_abc_io_exception changing 10 files +489 −3
  1. furszy commented at 3:29 PM on April 4, 2026: member

    Early note: the majority of this PR consists of test coverage. The changes per se are quite small, just the issue shows up in multiple paths in slightly different forms.

    The goal of the PR is to ensure I/O errors are not silently ignored during validation or P2P message handling. In some cases, such as GETDATA or GETBLOCKTXN requests, errors can be swallowed, leaving the node unresponsive to the request without any action. In a harder to reach but more severe case, a swallowed error can leave the node alive but stuck, unable to process new blocks and advance the chain. Also, silently ignoring errors obviously makes problems harder to diagnose. So this PR seeks to improve all of that.

    Currently, the same root cause, inability to access the blocks directory, can produce different behaviors depending on where it occurs:


    1. If it happens during block connection, inside AcceptBlock(), the error gets caught and triggers a fatal error graceful shutdown. This is the correct behavior.

    2. If it happens during block connection (due to the unguarded FlatFileSeq::Open -> fs::create_directories call), after AcceptBlock(), inside ActivateBestChain(), the error gets thrown and swallowed by the P2P general try-catch, leaving the node in a living but stuck state. Where ActivateBestChain() will always silently fail on any posterior block processing, retrying to active the failing block.

    3. If it happens during GETDATA request, the file system error gets swallowed by the message handling general try-catch, only logging a net-debug-level message.
The remote peer is not disconnected, nor the node aborts. It just silently ignores the request. This is different to a missing block data, which currently logs the error + disconnects the peer.

    4. If it happens during GETBLOCKTXN request, the file system error either gets swallowed by the message handling general try-catch or crashes the system via an assertion, depending on if the problem is at the blocks directory level or at the block data level, which is inconsistent.

    So, this PR makes failure handling consistent and ensures the node never enters a stuck state due to a block I/O error. Changes:

    1. GETDATA now treats the issue in the same way as other I/O errors: it logs the error + disconnect the remote peer.
    2. GETBLOCKTXN now triggers a fatal error and graceful shutdown. The error here is stricter than in GETDATA because it can only access the last 10 blocks, which must be available for reorgs.
    3. ActivateBestChain() now triggers a fatal error and graceful shutdown when it happens. Fixing the silent stuck node state scenario.

    Testing Note Can reproduce the different behaviors that cause the stuck node scenario by running the second commit’s unit test 28af683b5dfdce88892e367b308857b94ec630fe or by manually introducing an exception in ConnectTip() (which occurs inside ActivateBestChain()). Previously, the thrown exception would have being caught by the general P2P try-catch and logged only at net-debug level, which most users do not have enabled, leaving the node alive but stuck, as subsequent block connections repeatedly attempt to process the failing block, throw the exception again, and get silently swallowed. Adding a functional test for this scenario is not possible due to the requirement of failing inside ActivateBestChain(), which always occurs after AcceptBlock(), which correctly captures the error.

    Separate Note There is another assertion scenario that we can move to graceful shutdown (fatal error) in the compact block relay that I haven't done here.

    Extra Note If want to see a similar error producing a crash, but in the wallet, go to #34176.

  2. DrahtBot commented at 3:30 PM on April 4, 2026: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK pinheadmz, rkrux
    Stale ACK maflcko, frankomosh

    If your review is incorrectly listed, please copy-paste <code>&lt;!--meta-tag:bot-skip--&gt;</code> into the comment that the bot should ignore.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #35139 (test: Add thread-safe fast-failing test macros by maflcko)
    • #32317 (kernel: Separate UTXO set access from validation functions by sedited)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

    LLM Linter (✨ experimental)

    Possible typos and grammar issues:

    • “must not be swallowed by the P2P message handling general try-catch.” -> “must not be swallowed by the P2P message handling try-catch.” [“general” is extraneous/grammatically wrong here and makes the phrase unclear.]

    Possible places where comparison-specific test macros should replace generic comparisons:

    • [src/test/flatfile_tests.cpp] BOOST_CHECK_THROW(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), fs::filesystem_error); -> Use BOOST_CHECK_EXCEPTION(..., fs::filesystem_error, HasReason(<exact/substring message>)) (or otherwise validate the filesystem_error.what() text) instead of only checking the generic exception type.

    <sup>2026-04-20 14:44:05</sup>

  3. furszy force-pushed on Apr 4, 2026
  4. DrahtBot added the label CI failed on Apr 4, 2026
  5. furszy force-pushed on Apr 4, 2026
  6. furszy force-pushed on Apr 5, 2026
  7. DrahtBot removed the label CI failed on Apr 5, 2026
  8. pinheadmz commented at 6:02 PM on April 6, 2026: member

    concept ACK Thanks for the thorough PR description! Definitely helps me to understand the code changes before even looking at them. Can also see its minimal changes with lots of test coverage, will review.

  9. frankomosh commented at 4:45 AM on April 8, 2026: contributor

    tACK 72f7a67370255cab1bf0dfcf39920dd23a3c268b

    Built from source. All new tests pass:

    • flatfile_tests: 9 cases, no errors
    • validation_chainstate_tests: 5 cases, no errors
    • blockmanager_tests: 8 cases, no errors
    • p2p_handle_io_errors.py: both scenarios pass

    Also manually reverted the GETBLOCKTXN fix (commit ba034f4) by removing the fatalError call and replacing with a silent return. The functional test correctly caught this kind of mutation. the test expected the node to shut down, but the mutant node stayed alive, causing a timeout failure. Mutant killed by p2p_handle_io_errors.py. Unit tests did not detect this mutation. Also process_messages and process_message fuzz harnesses both passed without detecting the mutation, confirming the functional test is providing the critical oracle here.

    Line 5915 in SendMessages has assert(ret) after a ReadBlock call in the compact block announcement path. I think this is the same pattern as the GETBLOCKTXN assert fixed in commit 4. I believe this is the "another assertion scenario in compact block relay" mentioned in the PR description. Is this planned as a follow-up? Happy to take it if so.

  10. DrahtBot requested review from pinheadmz on Apr 8, 2026
  11. in test/functional/p2p_handle_io_errors.py:109 in 72f7a67370 outdated
     104 | +        if platform.system() == 'Windows' or os.geteuid() == 0:
     105 | +           self.log.warning("Skipping test: unable to enforce dir permissions")
     106 | +           return
     107 | +
     108 | +        self.test_getdata_on_broken_fs()
     109 | +        self.test_getblocktxn_fatal_on_block_io_error()
    


    maflcko commented at 8:36 AM on April 8, 2026:

    Is there a reason why the p2p block msg flow is not tested end-to-end? I understand that the end-to-end behavior is already correct for it, and that there are unit tests, but they only look at the problem from a zoomed-in-on-flatfile perspective.

    I mention it, because the error flow happens in reality. I've last seen it in #34592 (comment).

    So I think it could make sense to test it, to avoid regressing on it.

    Also, I think, it could make sense to make the first commit of the pull request the functional test. This way, it is easier to see what the end-to-end behavior changes are, without having to manually undo commits or patches.

    Feel free to grab the test from this diff:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index 7b3f164ebe..309f4bae63 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -12,2 +12,6 @@ import stat
     
    +from test_framework.blocktools import (
    +    create_block,
    +    create_coinbase,
    +)
     from test_framework.messages import (
    @@ -17,2 +21,3 @@ from test_framework.messages import (
         MSG_WITNESS_FLAG,
    +    msg_block,
         msg_getblocktxn,
    @@ -46,2 +51,19 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
     
    +    def test_block_connect_on_broken_fs(self):
    +        self.log.info("Test block connect on inaccessible filesystem")
    +        node = self.nodes[0]
    +        peer = node.add_p2p_connection(P2PInterface())
    +        block = create_block(tmpl=node.getblocktemplate({"rules": ["segwit"]}))
    +        block.solve()
    +
    +        with simulate_io_error(node.blocks_path):
    +            with node.assert_debug_log(expected_msgs=["AcceptBlock FAILED (System error while saving block to disk"]):
    +                peer.send_without_ping(msg_block(block))
    +                peer.wait_for_disconnect()
    +
    +            node.wait_until_stopped(
    +                expect_error=True,
    +                expected_stderr=re.compile(r"fatal internal error"),
    +            )
    +
         def test_getdata_on_broken_fs(self):
    @@ -104,5 +126,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
             if platform.system() == 'Windows' or os.geteuid() == 0:
    -           self.log.warning("Skipping test: unable to enforce dir permissions")
    -           return
    +            self.log.warning("Skipping test: unable to enforce dir permissions")
    +            return
     
    +        self.test_block_connect_on_broken_fs()
             self.test_getdata_on_broken_fs()
    

    furszy commented at 8:03 PM on April 11, 2026:

    Sure, thanks! Test taken, and also reworked the branch per suggestion.

  12. maflcko commented at 8:39 AM on April 8, 2026: member

    concept ack, left a test suggestion, which could make review easier

  13. DrahtBot added the label Needs rebase on Apr 9, 2026
  14. furszy force-pushed on Apr 11, 2026
  15. DrahtBot removed the label Needs rebase on Apr 11, 2026
  16. furszy force-pushed on Apr 11, 2026
  17. DrahtBot added the label CI failed on Apr 11, 2026
  18. furszy commented at 8:05 PM on April 11, 2026: member

    Line 5915 in SendMessages has assert(ret) after a ReadBlock call in the compact block announcement path. I think this is the same pattern as the GETBLOCKTXN assert fixed in commit 4. I believe this is the "another assertion scenario in compact block relay" mentioned in the PR description. Is this planned as a follow-up? Happy to take it if so.

    Yes @frankomosh. All yours :).

  19. furszy force-pushed on Apr 11, 2026
  20. in test/functional/p2p_handle_io_errors.py:81 in 2902fef4a7
      76 | +
      77 | +        peer = node.add_p2p_connection(P2PInterface())
      78 | +        block_hash = node.getblockhash(3)  # Not in recent cache
      79 | +        with simulate_io_error(node.blocks_path):
      80 | +            # Bug: we currently swallow the GETDATA request, never reply to the other party, and log issue under debug-NET
      81 | +            with node.assert_debug_log(expected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
    


    furszy commented at 10:30 PM on April 11, 2026:

    Note: intentionally not checking for a specific filesystem error msg, since it slightly vary across OS. Specific msg doesn't matter, this first commit just proves the error is swallowed in master.

  21. furszy force-pushed on Apr 12, 2026
  22. in test/functional/p2p_handle_io_errors.py:44 in 75a8a957a5
      39 | +    try:
      40 | +        yield
      41 | +    finally:
      42 | +        os.chmod(parent_dir, old_mode)
      43 | +        if blocks_bak.exists() and not blocks_path.exists():
      44 | +            os.rename(blocks_bak, blocks_path)
    


    maflcko commented at 10:14 AM on April 13, 2026:

    nit in 75a8a957a5181a3cebfc9d3dd064330d52c51dbb: Is there a reason for this guard? The following seems stricter and what we want in this test. It also passes locally for me on this commit:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index 4438ad1dd5..5e05e9b768 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -40,8 +40,7 @@ def simulate_io_error(blocks_path):
             yield
         finally:
             os.chmod(parent_dir, old_mode)
    -        if blocks_bak.exists() and not blocks_path.exists():
    -            os.rename(blocks_bak, blocks_path)
    +        os.rename(blocks_bak, blocks_path)
     
     class P2PBlockIOErrorTest(BitcoinTestFramework):
         def set_test_params(self):
    

    furszy commented at 4:07 PM on April 13, 2026:

    Not anymore. I had more tests here before. Will drop it. Thanks.

  23. in test/functional/p2p_handle_io_errors.py:52 in 825dfc2005
      47 | +    def set_test_params(self):
      48 | +        self.num_nodes = 1
      49 | +        self.setup_clean_chain = True
      50 | +
      51 | +    def test_block_connect_on_broken_fs(self):
      52 | +        self.log.info("Test block connect on inaccessible filesystem")
    


    maflcko commented at 10:59 AM on April 13, 2026:

    nit in 75a8a957a5181a3cebfc9d3dd064330d52c51dbb:

    From the pull description:

    If it happens during block connection, after AcceptBlock(), inside ActivateBestChain(), the file system error gets thrown and swallowed by the P2P general try-catch, leaving the node in a living but stuck state. Where ActivateBestChain() will always silently fail on any posterior block processing, retrying to active the failing block.

    You claim that a failure in ActivateBestChain will leave the node alive, but stuck. However, when writing a test for this, I found that it properly shut down:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index 4438ad1dd5..4706dfebd4 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -14,2 +14,3 @@ from test_framework.blocktools import (
         create_block,
    +    create_empty_fork,
     )
    @@ -49,5 +50,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
             self.setup_clean_chain = True
    +        self.extra_args = [['-fastprune']]
     
    -    def test_block_connect_on_broken_fs(self):
    -        self.log.info("Test block connect on inaccessible filesystem")
    +    def test_block_accept_on_broken_fs(self):
    +        self.log.info("Test block accept on inaccessible filesystem")
             node = self.nodes[0]
    @@ -70,2 +72,24 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
     
    +    def test_block_connect_on_broken_fs(self):
    +        self.log.info("Test block connect on inaccessible filesystem")
    +        node = self.nodes[0]
    +        peer = node.add_p2p_connection(P2PInterface())
    +        blocks = create_empty_fork(node, fork_length=2)
    +        large_block = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]
    +        peer.send_and_ping(msg_block(blocks[0]))
    +        (node.blocks_path / 'blk00002.dat').unlink()
    +        node.getblock(large_block)
    +
    +        with node.assert_debug_log(expected_msgs=["ActivateBestChain failed (Failed to read block.)"]):
    +            peer.send_without_ping(msg_block(blocks[1]))
    +            peer.wait_for_disconnect()
    +
    +        node.wait_until_stopped(
    +            expect_error=True,
    +            expected_stderr=re.compile(r"fatal internal error"),
    +        )
    +
    +        # Restart node for next test
    +        self.start_node(0, extra_args=['-reindex'])
    +
         def test_getdata_on_broken_fs(self):
    @@ -125,2 +149,3 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
     
    +        self.test_block_accept_on_broken_fs()
             self.test_block_connect_on_broken_fs()
    

    I guess it could make sense to add the test?

    Also, it could make sense to clarify that point 2 in the pull request description refers to a thrown exception, presumably only from create_directories, and does not refer to any IO error?

    Just a nit, and up to you, if you want to keep this pull focussed only on the create_directories call, or be a more general io error handling improvement in P2P paths, with test coverage.


    furszy commented at 2:41 PM on April 13, 2026:

    You claim that a failure in ActivateBestChain will leave the node alive, but stuck. However, when writing a test for this, I found that it properly shut down.

    Yeah ok. Your test exercises a missing block file, not the inability to access the directory. If we want to replicate the stuck, the goal there should be to get the node stuck by making it throw inside the create_directories call.

    I had a test early in this branch that reproduced the "throw inside ActivateBestChain" through the P2P path in a deterministic way (which is not easy due to AcceptBlock happening first). But I did not like that it relied on another bug in reconsiderblock to trigger this one.

    Basically, the approach was to invalidate the tip, submit a fork, make the directory inaccessible, and then reconsider the block, which throws internally but resets the block validity flags (the operation is not atomically reverted upon failure). After that, sending any number of descendant blocks via P2P will be swallowed due to the inability to activate the very first block. But it was ugly to depend on reconsiderblock to trigger it.

    I guess it could make sense to add the test?

    Yes! for sure. Pulling it.

    Also, it could make sense to clarify that point 2 in the pull request description refers to a thrown exception, presumably only from create_directories, and does not refer to any IO error?

    Sure. Will do.


    maflcko commented at 8:00 AM on April 14, 2026:

    I had a test early in this branch that reproduced the "throw inside ActivateBestChain" through the P2P path in a deterministic way (which is not easy due to AcceptBlock happening first). But I did not like that it relied on another bug in reconsiderblock to trigger this one.

    Basically, the approach was to invalidate the tip, submit a fork, make the directory inaccessible, and then reconsider the block, which throws internally but resets the block validity flags (the operation is not atomically reverted upon failure). After that, sending any number of descendant blocks via P2P will be swallowed due to the inability to activate the very first block. But it was ugly to depend on reconsiderblock to trigger it.

    Ah interesting. I presume the bug still exists after this pull requests, as the ReconsiderBlock function is not changed here? Just wondering, as the bug seems unrelated to the bug fixed in this pull request. Also, I agree that such a test may be too spicy to include here.


    furszy commented at 2:19 PM on April 15, 2026:

    I presume the bug still exists after this pull requests, as the ReconsiderBlock function is not changed here?

    Would say it is partially fixed but haven't tried it. ReconsiderBlock internally calls ActivateBestChain, which triggers a fatal error after this PR, instead of throwing an exception. So it is a matter of whether we are flushing the block flags reset to disk prior to aborting or not. Because, if we do, the node would fail early on any following up startup.

  24. in test/functional/p2p_handle_io_errors.py:75 in 825dfc2005
      70 | +
      71 | +    def test_getdata_on_broken_fs(self):
      72 | +        """The node should not swallow GETDATA requests during I/O issues"""
      73 | +        self.log.info("Test GETDATA on inaccessible filesystem")
      74 | +        node = self.nodes[0]
      75 | +        self.generate(node, 6, sync_fun=self.no_op)
    


    maflcko commented at 11:08 AM on April 13, 2026:

    nit in the first commit: Any reason to disable something that is already skipped? Diff:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index d75a9f1d2e..4706dfebd4 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -96,7 +96,7 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
             """The node should not swallow GETDATA requests during I/O issues"""
             self.log.info("Test GETDATA on inaccessible filesystem")
             node = self.nodes[0]
    -        self.generate(node, 6, sync_fun=self.no_op)
    +        self.generate(node, 6)
     
             peer = node.add_p2p_connection(P2PInterface())
             block_hash = node.getblockhash(3)  # Not in recent cache
    

    furszy commented at 6:45 PM on April 13, 2026:

    Done as suggested.

  25. in test/functional/p2p_handle_io_errors.py:103 in 825dfc2005
      98 | +        self.log.info("Test GETBLOCKTXN triggers fatal error on inaccessible filesystem")
      99 | +        node = self.nodes[0]
     100 | +
     101 | +        # Generate two blocks so that the first one is evicted from the
     102 | +        # m_most_recent_block cache.
     103 | +        self.generate(node, 1, sync_fun=self.no_op)
    


    maflcko commented at 11:10 AM on April 13, 2026:

    nit: Same here about sync_fun.


    furszy commented at 6:44 PM on April 13, 2026:

    Done as suggested.

  26. maflcko commented at 11:14 AM on April 13, 2026: member

    Looks like the test fails on the second commit, or so?

    Also, provided some more test nits. Feel free to ignore.

    825dfc200520d503fec64346ed01a8b12ec8fa54~4 📭

    <details><summary>Show signature</summary>

    Signature:

    untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
    RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
    trusted comment: 825dfc200520d503fec64346ed01a8b12ec8fa54~4 📭
    HuyWCmP9dlHdaEd4PL2XkITSC+f4T6nxsowBmRtzkzVkthldhUIEhYd8dWE2Z+BGoWVYLPWjRRxIPvoLUhqCCg==
    

    </details>

  27. furszy commented at 2:18 PM on April 13, 2026: member

    Looks like the test fails on the second commit, or so?

    It seems the assertion error message is libc-dependent. Linux logs "Assertion `expr' failed" while MacOS logs "Assertion failed: (expr), ...".

    Will make the wait_until_stopped expected error a bit more general and check for "Assertion" wording only. That is more than enough to check the crash reason. The last commit makes it clear with the fatal error change anyway.

  28. furszy force-pushed on Apr 13, 2026
  29. DrahtBot removed the label CI failed on Apr 13, 2026
  30. in src/test/util/common.h:74 in a98cebe088
      69 | + * Simulate a filesystem access failure for `path` by temporarily making it
      70 | + * unavailable. The target is removed and its parent made inaccessible while
      71 | + * `fn()` executes, then everything is restored.
      72 | + */
      73 | +template <typename Fn>
      74 | +void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, Fn&& fn)
    


    maflcko commented at 8:00 AM on April 14, 2026:

    nit in a98cebe088a31933b5ee1c12e862f93165c32263: This function is pretty large, and there is no need to inline it, so it would be better to move it to the Cpp file:

    diff --git a/src/test/util/CMakeLists.txt b/src/test/util/CMakeLists.txt
    index d6864a2720..c0d47d157c 100644
    --- a/src/test/util/CMakeLists.txt
    +++ b/src/test/util/CMakeLists.txt
    @@ -5,6 +5,7 @@
     add_library(test_util STATIC EXCLUDE_FROM_ALL
       blockfilter.cpp
       coins.cpp
    +  common.cpp
       coverage.cpp
       json.cpp
       logging.cpp
    diff --git a/src/test/util/common.cpp b/src/test/util/common.cpp
    new file mode 100644
    index 0000000000..f547595531
    --- /dev/null
    +++ b/src/test/util/common.cpp
    @@ -0,0 +1,60 @@
    +// Copyright (c) The Bitcoin Core developers
    +// Distributed under the MIT software license, see the accompanying
    +// file COPYING or https://opensource.org/license/mit/.
    +
    +#include <test/util/common.h>
    +
    +void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn)
    +{
    +#ifdef WIN32
    +    // On Windows, any open file (such as the db) prevents directory renaming,
    +    // so we can't simulate a filesystem error in this platform. Skip it.
    +    return;
    +#else
    +    // This check relies on filesystem permission manipulation to simulate I/O
    +    // failures. Running as root bypasses permission checks, so the OS will
    +    // allow directory creation and file writes even when perms are set to
    +    // none, making it impossible to simulate the expected failures.
    +    if (getuid() == 0) return;
    + #endif
    +
    +    const fs::path root = fs::weakly_canonical(test_root_dir);
    +    const fs::path target = fs::weakly_canonical(path);
    +
    +    // Simple sanity check: ensure target is inside the test directory.
    +    if (!fs::PathToString(target).starts_with(fs::PathToString(root))) {
    +        throw std::runtime_error("Path escapes test root directory");
    +    }
    +
    +    if (!fs::exists(target)) {
    +        throw std::runtime_error("Path does not exist");
    +    }
    +
    +    const fs::path parent = target.parent_path();
    +    const fs::path backup = parent / fs::PathFromString(target.filename().string() + "_bak");
    +
    +    // Make the target disappear (file or directory)
    +    fs::rename(target, backup);
    +
    +    const auto old_perms = fs::status(parent).permissions();
    +
    +    // Helper to restore state (permissions + original path)
    +    auto restore = [&]() {
    +        fs::permissions(parent, old_perms, fs::perm_options::replace);
    +        if (fs::exists(backup) && !fs::exists(target)) {
    +            fs::rename(backup, target);
    +        }
    +    };
    +
    +    // Block any access and dir recreation under the parent directory
    +    fs::permissions(parent, fs::perms::none, fs::perm_options::replace);
    +
    +    try {
    +        fn(); // run test under simulated failure
    +    } catch (...) {
    +        restore();
    +        throw;
    +    }
    +
    +    restore();
    +}
    diff --git a/src/test/util/common.h b/src/test/util/common.h
    index 85e62b7ef7..06a972ced8 100644
    --- a/src/test/util/common.h
    +++ b/src/test/util/common.h
    @@ -70,60 +70,6 @@ inline std::ostream& operator<<(std::ostream& os, const T& obj)
      * unavailable. The target is removed and its parent made inaccessible while
      * `fn()` executes, then everything is restored.
      */
    -template <typename Fn>
    -void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, Fn&& fn)
    -{
    -#ifdef WIN32
    -    // On Windows, any open file (such as the db) prevents directory renaming,
    -    // so we can't simulate a filesystem error in this platform. Skip it.
    -    return;
    -#else
    -    // This check relies on filesystem permission manipulation to simulate I/O
    -    // failures. Running as root bypasses permission checks, so the OS will
    -    // allow directory creation and file writes even when perms are set to
    -    // none, making it impossible to simulate the expected failures.
    -    if (getuid() == 0) return;
    - #endif
    -
    -    const fs::path root = fs::weakly_canonical(test_root_dir);
    -    const fs::path target = fs::weakly_canonical(path);
    -
    -    // Simple sanity check: ensure target is inside the test directory.
    -    if (!fs::PathToString(target).starts_with(fs::PathToString(root))) {
    -        throw std::runtime_error("Path escapes test root directory");
    -    }
    -
    -    if (!fs::exists(target)) {
    -        throw std::runtime_error("Path does not exist");
    -    }
    -
    -    const fs::path parent = target.parent_path();
    -    const fs::path backup = parent / fs::PathFromString(target.filename().string() + "_bak");
    -
    -    // Make the target disappear (file or directory)
    -    fs::rename(target, backup);
    -
    -    const auto old_perms = fs::status(parent).permissions();
    -
    -    // Helper to restore state (permissions + original path)
    -    auto restore = [&]() {
    -        fs::permissions(parent, old_perms, fs::perm_options::replace);
    -        if (fs::exists(backup) && !fs::exists(target)) {
    -            fs::rename(backup, target);
    -        }
    -    };
    -
    -    // Block any access and dir recreation under the parent directory
    -    fs::permissions(parent, fs::perms::none, fs::perm_options::replace);
    -
    -    try {
    -        fn(); // run test under simulated failure
    -    } catch (...) {
    -        restore();
    -        throw;
    -    }
    -
    -    restore();
    -}
    +void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn);
     
     #endif // BITCOIN_TEST_UTIL_COMMON_H
    

    furszy commented at 2:39 PM on April 14, 2026:

    Sure, taken. Thanks!

  31. in src/test/flatfile_tests.cpp:139 in a98cebe088
     130 | @@ -130,4 +131,71 @@ BOOST_AUTO_TEST_CASE(flatfile_flush)
     131 |      BOOST_CHECK_EQUAL(fs::file_size(seq.FileName(FlatFilePos(0, 1))), 1U);
     132 |  }
     133 |  
     134 | +BOOST_AUTO_TEST_CASE(flatfile_open_readonly_missing_dir)
     135 | +{
     136 | +    // Open file with read_only=true on a directory that does not exist.
     137 | +    // File should be null instead of throwing an exception.
     138 | +    const auto data_dir = m_args.GetDataDirBase() / "nonexistent";
     139 | +    const FlatFileSeq seq(data_dir, "a", 16 * 1024);
    


    maflcko commented at 8:10 AM on April 14, 2026:

    nit in a98cebe088a31933b5ee1c12e862f93165c32263 : It is a bit confusing to see all those oddly specific (and different) numbers for the chunk size, when none of them matter. It would be better to just define a dummy once and use it in all places that don't matter:

    diff --git a/src/test/flatfile_tests.cpp b/src/test/flatfile_tests.cpp
    index e31a0773d1..e33c12265b 100644
    --- a/src/test/flatfile_tests.cpp
    +++ b/src/test/flatfile_tests.cpp
    @@ -13,2 +13,4 @@
     
    +static constexpr uint32_t DUMMY_CHUNK_SZ{1};
    +
     BOOST_FIXTURE_TEST_SUITE(flatfile_tests, BasicTestingSetup)
    @@ -138,3 +140,3 @@ BOOST_AUTO_TEST_CASE(flatfile_open_readonly_missing_dir)
         const auto data_dir = m_args.GetDataDirBase() / "nonexistent";
    -    const FlatFileSeq seq(data_dir, "a", 16 * 1024);
    +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
    @@ -149,3 +151,3 @@ BOOST_AUTO_TEST_CASE(flatfile_open_write_creates_dir)
         const auto data_dir = m_args.GetDataDirBase() / "new_write_dir";
    -    const FlatFileSeq seq(data_dir, "a", 16 * 1024);
    +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
    @@ -165,3 +167,3 @@ BOOST_AUTO_TEST_CASE(flatfile_open_write_missing_unwritable_parent)
         SimulateFileSystemError(m_path_root, dir, [&dir]() {
    -        const FlatFileSeq seq(dir / "blocks", "a", 16 * 1024);
    +        const FlatFileSeq seq(dir / "blocks", "a", DUMMY_CHUNK_SZ);
             const AutoFile file{seq.Open(FlatFilePos(0, 0), /*read_only=*/false)};
    @@ -176,3 +178,3 @@ BOOST_AUTO_TEST_CASE(flatfile_flush_unwritable_dir)
         fs::create_directories(data_dir);
    -    const FlatFileSeq seq(data_dir, "a", 100);
    +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
    @@ -190,3 +192,3 @@ BOOST_AUTO_TEST_CASE(flatfile_allocate_unwritable_dir)
         fs::create_directories(data_dir);
    -    const FlatFileSeq seq(data_dir, "a", 100);
    +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
     
    

    furszy commented at 2:39 PM on April 14, 2026:

    Sure, taken. Thanks!

  32. in test/functional/p2p_handle_io_errors.py:106 in a98cebe088
      99 | @@ -100,12 +100,14 @@ def test_getdata_on_broken_fs(self):
     100 |          peer = node.add_p2p_connection(P2PInterface())
     101 |          block_hash = node.getblockhash(3)  # Not in recent cache
     102 |          with simulate_io_error(node.blocks_path):
     103 | -            # Bug: we currently swallow the GETDATA request, never reply to the other party, and log issue under debug-NET
     104 | -            with node.assert_debug_log(expected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
     105 | +            # Request block and expect disconnection. The node must not swallow the exception and do nothing.
     106 | +            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"],
     107 | +                                       unexpected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
     108 |                                         timeout=30):
    


    maflcko commented at 8:28 AM on April 14, 2026:

    nit in a98cebe088a31933b5ee1c12e862f93165c32263: Now that the disconnect happens, the timeout here is not needed and confusing and should be removed. Also, I think the unexpected_msgs should be removed, as they depend on the exact log formatting, which may change in the future, so they will likely go silently stale anyway, and the disconnection is what matters here and is already checked.

    (Same in other places in this commit)

    Suggested diff:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index 26c1ed7b51..354f7a6c4c 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -102,6 +102,4 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
             with simulate_io_error(node.blocks_path):
    -            # Request block and expect disconnection. The node must not swallow the exception and do nothing.
    -            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"],
    -                                       unexpected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
    -                                       timeout=30):
    +            # Request block and expect disconnection.
    +            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"]):
                     peer.send_without_ping(msg_getdata([CInv(MSG_BLOCK | MSG_WITNESS_FLAG, int(block_hash, 16))]))
    @@ -133,7 +131,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
     
    -            # Request block and expect crash. The node must not swallow the exception and do nothing.
    -            with node.assert_debug_log(expected_msgs=["Unable to open file"],
    -                                       unexpected_msgs=["[ProcessMessages] [net] ProcessMessages(getblocktxn, 34 bytes): Exception 'filesystem error"],
    -                                       timeout=10):
    +            # Request block and expect crash+disconnect.
    +            with node.assert_debug_log(expected_msgs=["Unable to open file"]):
                     peer.send_without_ping(gbtn)
    +                peer.wait_for_disconnect()
     
    @@ -143,3 +140,2 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
                     expected_stderr=re.compile("Assertion"), # assertion crash
    -                timeout=30,
                 )
    

    furszy commented at 2:40 PM on April 14, 2026:

    looks good!, taken. Thanks

  33. in src/test/util/common.h:111 in a98cebe088
     106 | +    const auto old_perms = fs::status(parent).permissions();
     107 | +
     108 | +    // Helper to restore state (permissions + original path)
     109 | +    auto restore = [&]() {
     110 | +        fs::permissions(parent, old_perms, fs::perm_options::replace);
     111 | +        if (fs::exists(backup) && !fs::exists(target)) {
    


    maflcko commented at 8:35 AM on April 14, 2026:

    a98cebe088a31933b5ee1c12e862f93165c32263: same here: Any reason for the check?

    diff --git a/src/test/util/common.cpp b/src/test/util/common.cpp
    index f547595531..e89fe11e71 100644
    --- a/src/test/util/common.cpp
    +++ b/src/test/util/common.cpp
    @@ -43,5 +43,3 @@ void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path
             fs::permissions(parent, old_perms, fs::perm_options::replace);
    -        if (fs::exists(backup) && !fs::exists(target)) {
    -            fs::rename(backup, target);
    -        }
    +        fs::rename(backup, target);
         };
    

    furszy commented at 2:41 PM on April 14, 2026:

    not anymore. Removed. Thanks.

  34. in src/test/validation_chainstate_tests.cpp:191 in 63e4541495
     186 | +    // Disconnect block without changing validity so ActivateBestChain
     187 | +    // has a block to reconnect.
     188 | +    {
     189 | +        LOCK2(chainman->GetMutex(), chainstate.MempoolMutex());
     190 | +        BlockValidationState state;
     191 | +        BOOST_CHECK(chainstate.DisconnectTip(state, nullptr));
    


    maflcko commented at 9:14 AM on April 14, 2026:

    nit in 63e45414952b4cd155921b0cee20e3d3f97142a5: Could use named arg for /*disconnectpool=*/?


    furszy commented at 2:41 PM on April 14, 2026:

    Sure. Done.

  35. in src/test/blockmanager_tests.cpp:337 in 55ac4b4f2f
     332 | +        // create_directories fails so the file cannot be created.
     333 | +        {
     334 | +            ASSERT_DEBUG_LOG("Failed to write block");
     335 | +            CBlock dummy_block;
     336 | +            dummy_block.nVersion = 1;
     337 | +            const auto write_pos = blockman.WriteBlock(dummy_block, 999);
    


    maflcko commented at 9:30 AM on April 14, 2026:

    nit in 55ac4b4f2f389b588a1f372e506c2f5d63db701b: would be easier to read, if integral literal args used names. In this case, I guess it is /*nHeight=*/999?


    furszy commented at 2:41 PM on April 14, 2026:

    Sure. Done.

  36. in src/test/blockmanager_tests.cpp:356 in 55ac4b4f2f outdated
     351 | +            tip->nStatus |= BLOCK_HAVE_UNDO;
     352 | +        }
     353 | +    });
     354 | +
     355 | +    // Ensure we haven't corrupted any internal state during the failure.
     356 | +    // Check we can read the block again now that there is no dir access issue going on.
    


    maflcko commented at 9:36 AM on April 14, 2026:

    nit in 55ac4b4f2f389b588a1f372e506c2f5d63db701b : Should also check undo beside the block?

    Also, when assuming a pointer can not be null, it would be better to use a reference.

    Diff to achieve both:

    diff --git a/src/test/blockmanager_tests.cpp b/src/test/blockmanager_tests.cpp
    index 492cb518d0..b1ff1139ae 100644
    --- a/src/test/blockmanager_tests.cpp
    +++ b/src/test/blockmanager_tests.cpp
    @@ -309,3 +309,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
         auto& blockman = m_node.chainman->m_blockman;
    -    CBlockIndex* tip = WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip());
    +    CBlockIndex& tip{*WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip())};
     
    @@ -314,3 +314,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
                 ASSERT_DEBUG_LOG("OpenBlockFile failed");
    -            FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip->GetBlockPos());
    +            FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip.GetBlockPos());
                 BOOST_CHECK(!blockman.ReadRawBlock(tip_pos));
    @@ -321,3 +321,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
                 CBlock block;
    -            BOOST_CHECK(!blockman.ReadBlock(block, *tip));
    +            BOOST_CHECK(!blockman.ReadBlock(block, tip));
             }
    @@ -327,3 +327,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
                 CBlockUndo block_undo;
    -            BOOST_CHECK(!blockman.ReadBlockUndo(block_undo, *tip));
    +            BOOST_CHECK(!blockman.ReadBlockUndo(block_undo, tip));
             }
    @@ -345,8 +345,8 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
                 LOCK(chainman->GetMutex());
    -            tip->nStatus &= ~BLOCK_HAVE_UNDO;
    +            tip.nStatus &= ~BLOCK_HAVE_UNDO;
     
                 BlockValidationState state;
    -            BOOST_CHECK(!blockman.WriteBlockUndo(CBlockUndo{}, state, *tip));
    +            BOOST_CHECK(!blockman.WriteBlockUndo(CBlockUndo{}, state, tip));
     
    -            tip->nStatus |= BLOCK_HAVE_UNDO;
    +            tip.nStatus |= BLOCK_HAVE_UNDO;
             }
    @@ -357,3 +357,5 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
         CBlock recovered_block;
    -    BOOST_CHECK(blockman.ReadBlock(recovered_block, *tip));
    +    BOOST_CHECK(blockman.ReadBlock(recovered_block, tip));
    +    CBlockUndo block_undo;
    +    BOOST_CHECK(blockman.ReadBlockUndo(block_undo, tip));
     }
    

    furszy commented at 2:41 PM on April 14, 2026:

    Looks good!, taken. Thanks.

  37. in src/net_processing.cpp:4287 in 476fc0f899
    4283 | @@ -4284,7 +4284,11 @@ void PeerManagerImpl::ProcessMessage(Peer& peer, CNode& pfrom, const std::string
    4284 |              const bool ret{m_chainman.m_blockman.ReadBlock(block, block_pos, req.blockhash)};
    4285 |              // If height is above MAX_BLOCKTXN_DEPTH then this block cannot get
    4286 |              // pruned after we release cs_main above, so this read should never fail.
    4287 | -            assert(ret);
    4288 | +            // Unless we have a I/O issue, in such case, we want to shutdown gracefully.
    


    maflcko commented at 9:38 AM on April 14, 2026:

    nit in 476fc0f8991141f9d2d4f687b38b45b82a98bb39 from the llm:

    • a I/O issue -> an I/O issue [Grammatical article error; “I/O” starts with a vowel sound, so comprehension is improved by using “an”.]

    furszy commented at 2:42 PM on April 14, 2026:

    Done as suggested.

  38. maflcko approved
  39. maflcko commented at 9:41 AM on April 14, 2026: member

    Nice changes. I think the added tests are useful to have proper coverage for this. Also, the code changes move in the right direction. I also like that further places are pointed out where stuff can be improved.

    Everything looks great, and I've left more nits on the tests, given that all the code changes are tests. Feel free to ignore any of them, but I'd be happy to re-review all of them.

    lgtm 476fc0f8991141f9d2d4f687b38b45b82a98bb39 🏸

    <details><summary>Show signature</summary>

    Signature:

    untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
    RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
    trusted comment: lgtm 476fc0f8991141f9d2d4f687b38b45b82a98bb39 🏸
    XRrVpPFFIr4wq1HN1XUn/Y+JjL2oo7FUVxxSS8lD5F/L2WAqGj/ZewYzFssaE6uY8HeSbaD7nCh4rvbUeaOlBQ==
    

    </details>

  40. maflcko commented at 9:48 AM on April 14, 2026: member

    nit on the title: Maybe the prefix should be validation:, because it not only affects the p2p validation flow, but also the RPC one, and possibly the kernel one?

  41. furszy renamed this:
    net_processing: improve block data I/O error handling in P2P paths
    validation: improve block data I/O error handling in P2P paths
    on Apr 14, 2026
  42. DrahtBot added the label Validation on Apr 14, 2026
  43. furszy force-pushed on Apr 14, 2026
  44. furszy commented at 2:44 PM on April 14, 2026: member

    Updated per feedback and very nice suggestions. Thanks maflcko!

  45. maflcko commented at 4:07 PM on April 14, 2026: member

    Only change is the nits.

    review ACK f6c62f1bd4887d7865bf64c1fb399d47286fe531 🐟

    <details><summary>Show signature</summary>

    Signature:

    untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
    RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
    trusted comment: review ACK f6c62f1bd4887d7865bf64c1fb399d47286fe531 🐟
    QmQ331B1u4334Pm/xCen1sDl+6UdVg8nw67GYZXSKW4pVOWc8QaJLmqxz1b7wh8w5sUV++MOMmen3sKr+9kTDg==
    

    </details>

  46. DrahtBot requested review from frankomosh on Apr 14, 2026
  47. frankomosh commented at 1:54 PM on April 15, 2026: contributor

    reACK f6c62f1bd4887d7865bf64c1fb399d47286fe531.

  48. rkrux commented at 2:19 PM on April 15, 2026: contributor

    Concept ACK https://github.com/bitcoin/bitcoin/commit/f6c62f1bd4887d7865bf64c1fb399d47286fe531, will review.

    +1 for the detailed PR description and verbose commit messages that makes the reviewer feel like a story is unfolding in the PR.

  49. in test/functional/p2p_handle_io_errors.py:144 in c673aacfd7 outdated
     144 | +                expect_error=True,
     145 | +                expected_stderr=re.compile("Assertion"), # assertion crash
     146 | +            )
     147 | +
     148 | +        # Restart node for next test
     149 | +        self.start_node(0)
    


    rkrux commented at 1:45 PM on April 17, 2026:

    In c673aacfd78653a3d49cdcac355df3b6024c0608: this is the last test, is starting node here really required?


    furszy commented at 3:27 PM on April 17, 2026:

    Better to leave the test in a consistent state than force the next person working on it to figure out why their test isn't working.


    rkrux commented at 8:11 AM on April 18, 2026:

    than force the next person working on it to figure out why their test isn't working.

    That's thoughtful but it also adds marginal latency in the test - will leave it to your preference.

  50. in test/functional/p2p_handle_io_errors.py:77 in bd19cea308
      72 | +    def test_block_connect_on_broken_fs(self):
      73 | +        self.log.info("Test block connect on inaccessible filesystem")
      74 | +        node = self.nodes[0]
      75 | +        peer = node.add_p2p_connection(P2PInterface())
      76 | +        blocks = create_empty_fork(node, fork_length=2)
      77 | +        large_block = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]
    


    rkrux commented at 1:46 PM on April 17, 2026:

    In bd19cea30865efc0c8a353843662090446f4aada: s/large_block/large_block_hex


    rkrux commented at 3:03 PM on April 17, 2026:

    Nit: s/({'55'100_000})/({'55'66_000}) can also work because the file size is 64kib for fast prune.


    furszy commented at 6:36 PM on April 17, 2026:

    string fixed, thx.

  51. in test/functional/p2p_handle_io_errors.py:124 in bd19cea308
     119 | +        self.log.info("Test GETBLOCKTXN triggers fatal error on inaccessible filesystem")
     120 | +        node = self.nodes[0]
     121 | +
     122 | +        # Generate two blocks so that the first one is evicted from the
     123 | +        # m_most_recent_block cache.
     124 | +        self.generate(node, 1)
    


    rkrux commented at 2:58 PM on April 17, 2026:

    In bd19cea30865efc0c8a353843662090446f4aada:

    Generate two blocks

    Isn't it generating one block?


    furszy commented at 6:35 PM on April 17, 2026:

    yep, fixed. Thanks.

  52. in test/functional/p2p_handle_io_errors.py:80 in bd19cea308
      75 | +        peer = node.add_p2p_connection(P2PInterface())
      76 | +        blocks = create_empty_fork(node, fork_length=2)
      77 | +        large_block = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]
      78 | +        peer.send_and_ping(msg_block(blocks[0]))
      79 | +        (node.blocks_path / 'blk00002.dat').unlink()
      80 | +        node.getblock(large_block)
    


    rkrux commented at 3:13 PM on April 17, 2026:

    In bd19cea: I usually find these standalone calls unintuitive because no value is captured and no assertion done, often time it takes me a second to realise that this is there just to show that it works.

    Maybe add the following assertion as well to show that forked block is gone now.

    + assert_raises_rpc_error(-1, 'Block not found on disk', node.getblock, blocks[0].hash_hex)
    

    furszy commented at 6:35 PM on April 17, 2026:

    sure, pushed.

  53. rkrux commented at 3:18 PM on April 17, 2026: contributor

    Started reviewing at f6c62f1bd4887d7865bf64c1fb399d47286fe531, shared few comments in the test.

  54. furszy force-pushed on Apr 17, 2026
  55. furszy commented at 6:37 PM on April 17, 2026: member

    Updated per feedback. Test comments fixed. Thanks rkrux.

  56. in test/functional/p2p_handle_io_errors.py:143 in 3fec4bca85 outdated
     141 |  
     142 |              node.wait_until_stopped(
     143 | -                expected_ret_code=-6,
     144 |                  expect_error=True,
     145 | -                expected_stderr=re.compile("Assertion"), # assertion crash
     146 | +                expected_stderr=re.compile(r"fatal internal error"),
    


    rkrux commented at 8:10 AM on April 18, 2026:

    In c990e375dbbc5ac2d2ed701dbc0f4f51533d2650: A new error has been added in the commit and it seems like a missed opportunity that it's not asserted for in the corresponding test.

    -                expected_stderr=re.compile(r"fatal internal error"),
    +                expected_stderr=re.compile(r"Failed to read block during GETBLOCKTXN"),
    

    or

    -            with node.assert_debug_log(expected_msgs=["Unable to open file"]):
    +            with node.assert_debug_log(expected_msgs=["Unable to open file", "Failed to read block during GETBLOCKTXN"]):
    

    furszy commented at 2:43 PM on April 20, 2026:

    Done as suggested.

  57. DrahtBot added the label Needs rebase on Apr 19, 2026
  58. test: add P2P coverage for block disk I/O failures
    Add tests showing that block disk I/O failures triggered
    through P2P requests are silently swallowed.
    
    The idea is to simulate I/O errors by making the blocks directory
    temporarily inaccessible while the node is running. This allows block
    reads to fail when serving P2P requests.
    
    Aside from the properly behaving tests, this contains two cases were
    the node swallows the error:
    
    - GETDATA: failures should be handled locally by disconnecting the
      peer and logging the error, without impacting node operation.
    
    - GETBLOCKTXN: failures when accessing recent blocks violate our
      "last 10 blocks must always be available for reorgs" rule and
      must result in a fatal error, rather than being caught and only
      logged by the P2P message handling general try-catch.
    
    Co-authored-by: MarcoFalke <*~=`'#}+{/-|&$^_@721217.xyz>
    c4204d9137
  59. FlatFile: do not throw for parent dir creation failure
    The callers of this function don't expect it to throw.
    
    For example, when the blocks directory becomes inaccessible
    (e.g. volume detach or permission changed), FlatFileSeq::Open's
    call to fs::create_directories() throws filesystem_error. On the
    block connection path, if this occurs during ActivateBestChain(),
    just after AcceptBlock() the exception bypasses FatalError entirely.
    It gets caught by ProcessMessage's general catch-all with only a
    NET-level debug log, effectively swallowing the error msg for
    regular users, and every subsequent block arrival retries the
    same failing read, leaving the node looking healthy while the
    chain is stuck.
    d1fa303bd5
  60. test: verify ActivateBestChain fatal errors on I/O failure
    Cover the two ActivateBestChain paths that read data from
    disk: ConnectTip (block connection) and DisconnectTip (reorg).
    
    Simulate inaccessible block and undo files, ensure fatal error
    and shutdown is triggered.
    e718217909
  61. test: ensure consistent failure behavior in BlockManager reads/writes
    Verifies that BlockManager behaves the same way whether a block file is
    missing or the blocks directory is inaccessible.
    
    Reads (ReadBlock, ReadRawBlock, ReadBlockUndo) return false, writes
    (WriteBlock, WriteBlockUndo) return null/false, and failures are logged.
    
    The goal is consistent failure handling. Previously, missing block files
    returned false and logged an error, while an inaccessible blocks directory
    was throwing an exception.
    b1757671e8
  62. furszy force-pushed on Apr 19, 2026
  63. DrahtBot removed the label Needs rebase on Apr 19, 2026
  64. in test/functional/p2p_handle_io_errors.py:94 in c4204d9137 outdated
      89 | +            peer.wait_for_disconnect()
      90 | +
      91 | +        node.wait_until_stopped(
      92 | +            expect_error=True,
      93 | +            expected_stderr=re.compile(r"fatal internal error"),
      94 | +        )
    


    rkrux commented at 9:32 AM on April 20, 2026:

    In c4204d9137b66593a5d38c0e210dac863f6453d0: This new test that has been added seems fine on its own. However, the reviewer might try to relate it to the 2nd point in the PR description because the error here is of ActivateBestChain. But this testcase doesn't exactly test that point, which I found confusing. Can consider moving the following statement from the testing note of the description in the 2nd point itself there to make it obvious.

    Adding a functional test for this scenario is not possible due to the requirement of failing inside ActivateBestChain(), which always occurs after AcceptBlock(), which correctly captures the error.


    furszy commented at 2:37 PM on April 20, 2026:

    This behaves properly not because of the functions calls ordering, but because we are unlinking one single blk file and not the entire directory. The test that is above (test_block_accept_on_broken_fs()) is the one exercising the note you are mentioning.

  65. rkrux commented at 9:34 AM on April 20, 2026: contributor

    Overall lgtm at c990e375dbbc5ac2d2ed701dbc0f4f51533d2650

    Nit: Can consider switching the order of 3rd and 4th commits because latter is more closely related to the changes in 2nd commit as blockman methods are directly dependent on flatfile methods.

  66. net: replace GETBLOCKTXN assert on block read error with graceful shutdown
    A failure here indicates a serious inconsistency or disk issue, and the
    current behavior is to abort immediately, which is correct but not the best.
    Replace assert with a fatal error so the node shuts down through the normal
    shutdown path instead.
    
    This does not change the outcome (the node still stops), but allows a more
    controlled teardown, giving other components a chance to flush their state
    even when a particular block (or the blocks dir) on disk is unreachable.
    E.g. the blocks dir might be the only affected, the mempool might be ok,
    and we don't want to lose such information.
    65347588a1
  67. furszy force-pushed on Apr 20, 2026
  68. furszy commented at 2:47 PM on April 20, 2026: member

    Updated per feedback. Single test line diff to target a specific error msg during fatal error shutdown. Ready to go.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-05-02 12:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me