validation: improve block data I/O error handling in P2P paths #35003

pull furszy wants to merge 5 commits into bitcoin:master from furszy:2026_abc_io_exception changing 10 files +520 −3
  1. furszy commented at 3:29 PM on April 4, 2026: member

    Early note: the majority of this PR consists of test coverage. The changes per se are quite small, just the issue shows up in multiple paths in slightly different forms.

    The goal of the PR is to ensure I/O errors are not silently ignored during validation or P2P message handling. In some cases, such as GETDATA or GETBLOCKTXN requests, errors can be swallowed, leaving the node unresponsive to the request without any action. In a harder to reach but more severe case, a swallowed error can leave the node alive but stuck, unable to process new blocks and advance the chain. Also, silently ignoring errors obviously makes problems harder to diagnose. So this PR seeks to improve all of that.

    Currently, the same root cause, inability to access the blocks directory, can produce different behaviors depending on where it occurs:


    1. If it happens during block connection, inside AcceptBlock(), the error gets caught and triggers a fatal error graceful shutdown. This is the correct behavior.

    2. If it happens during block connection (due to the unguarded FlatFileSeq::Open -> fs::create_directories call), after AcceptBlock(), inside ActivateBestChain(), the error gets thrown and swallowed by the P2P general try-catch, leaving the node in a living but stuck state. Where ActivateBestChain() will always silently fail on any posterior block processing, retrying to active the failing block.

    3. If it happens during GETDATA request, the file system error gets swallowed by the message handling general try-catch, only logging a net-debug-level message.
The remote peer is not disconnected, nor the node aborts. It just silently ignores the request. This is different to a missing block data, which currently logs the error + disconnects the peer.

    4. If it happens during GETBLOCKTXN request, the file system error either gets swallowed by the message handling general try-catch or crashes the system via an assertion, depending on if the problem is at the blocks directory level or at the block data level, which is inconsistent.

    5. If it happens during GETCFILTERS request, the file system error gets swallowed by the message handling general try-catch.

    6. If this happens in any of the index functions, the node crashes.

    So, this PR makes failure handling consistent and ensures the node never enters a stuck state due to a block I/O error. Changes:

    1. GETDATA now treats the issue in the same way as other I/O errors: it logs the error + disconnect the remote peer.
    2. GETBLOCKTXN now triggers a fatal error and graceful shutdown. The error here is stricter than in GETDATA because it can only access the last 10 blocks, which must be available for reorgs.
    3. ActivateBestChain() now triggers a fatal error and graceful shutdown when it happens. Fixing the silent stuck node state scenario.
    4. GETCFILTERS now logs the error through the expected path: "Failed to find block filter in index". A more descriptive error message can be added in a follow-up PR.
    5. The index-related functions now ignore the error instead of crashing the node. This matches how the code was written, as these paths were not expecting OpenFile to throw. Better error handling can be added in a follow-up PR.

    Testing Note Can reproduce the different behaviors that cause the stuck node scenario by running the second commit’s unit test 28af683b5dfdce88892e367b308857b94ec630fe or by manually introducing an exception in ConnectTip() (which occurs inside ActivateBestChain()). Previously, the thrown exception would have being caught by the general P2P try-catch and logged only at net-debug level, which most users do not have enabled, leaving the node alive but stuck, as subsequent block connections repeatedly attempt to process the failing block, throw the exception again, and get silently swallowed. Adding a functional test for this scenario is not possible due to the requirement of failing inside ActivateBestChain(), which always occurs after AcceptBlock(), which correctly captures the error.

    Separate Note There is another assertion scenario that we can move to graceful shutdown (fatal error) in the compact block relay that I haven't done here.

    Extra Note If want to see a similar error producing a crash, but in the wallet, go to #34176.

  2. DrahtBot commented at 3:30 PM on April 4, 2026: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--006a51241073e994b41acfe9ec718e94-->

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/35003.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK pinheadmz, l0rinc, josibake
    Stale ACK frankomosh, maflcko, sedited, rkrux, w0xlt

    If your review is incorrectly listed, please copy-paste <code>&lt;!--meta-tag:bot-skip--&gt;</code> into the comment that the bot should ignore.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #35139 (test: Add thread-safe fast-failing test macros by maflcko)
    • #32317 (kernel: Separate UTXO set access from validation functions by sedited)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

    LLM Linter (✨ experimental)

    Possible places where comparison-specific test macros should replace generic comparisons:

    • [src/test/flatfile_tests.cpp] BOOST_CHECK_THROW(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), fs::filesystem_error); -> Consider BOOST_CHECK_EXCEPTION with a predicate that checks the exception message, so the test verifies the specific failure reason instead of only the exception type.

    <sup>2026-06-07 14:04:44</sup>

  3. furszy force-pushed on Apr 4, 2026
  4. DrahtBot added the label CI failed on Apr 4, 2026
  5. furszy force-pushed on Apr 4, 2026
  6. furszy force-pushed on Apr 5, 2026
  7. DrahtBot removed the label CI failed on Apr 5, 2026
  8. pinheadmz commented at 6:02 PM on April 6, 2026: member

    concept ACK Thanks for the thorough PR description! Definitely helps me to understand the code changes before even looking at them. Can also see its minimal changes with lots of test coverage, will review.

  9. frankomosh commented at 4:45 AM on April 8, 2026: contributor

    tACK 72f7a67370255cab1bf0dfcf39920dd23a3c268b

    Built from source. All new tests pass:

    • flatfile_tests: 9 cases, no errors
    • validation_chainstate_tests: 5 cases, no errors
    • blockmanager_tests: 8 cases, no errors
    • p2p_handle_io_errors.py: both scenarios pass

    Also manually reverted the GETBLOCKTXN fix (commit ba034f4) by removing the fatalError call and replacing with a silent return. The functional test correctly caught this kind of mutation. the test expected the node to shut down, but the mutant node stayed alive, causing a timeout failure. Mutant killed by p2p_handle_io_errors.py. Unit tests did not detect this mutation. Also process_messages and process_message fuzz harnesses both passed without detecting the mutation, confirming the functional test is providing the critical oracle here.

    Line 5915 in SendMessages has assert(ret) after a ReadBlock call in the compact block announcement path. I think this is the same pattern as the GETBLOCKTXN assert fixed in commit 4. I believe this is the "another assertion scenario in compact block relay" mentioned in the PR description. Is this planned as a follow-up? Happy to take it if so.

  10. DrahtBot requested review from pinheadmz on Apr 8, 2026
  11. in test/functional/p2p_handle_io_errors.py:109 in 72f7a67370 outdated
     104 | +        if platform.system() == 'Windows' or os.geteuid() == 0:
     105 | +           self.log.warning("Skipping test: unable to enforce dir permissions")
     106 | +           return
     107 | +
     108 | +        self.test_getdata_on_broken_fs()
     109 | +        self.test_getblocktxn_fatal_on_block_io_error()
    


    maflcko commented at 8:36 AM on April 8, 2026:

    Is there a reason why the p2p block msg flow is not tested end-to-end? I understand that the end-to-end behavior is already correct for it, and that there are unit tests, but they only look at the problem from a zoomed-in-on-flatfile perspective.

    I mention it, because the error flow happens in reality. I've last seen it in #34592 (comment).

    So I think it could make sense to test it, to avoid regressing on it.

    Also, I think, it could make sense to make the first commit of the pull request the functional test. This way, it is easier to see what the end-to-end behavior changes are, without having to manually undo commits or patches.

    Feel free to grab the test from this diff:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index 7b3f164ebe..309f4bae63 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -12,2 +12,6 @@ import stat
     
    +from test_framework.blocktools import (
    +    create_block,
    +    create_coinbase,
    +)
     from test_framework.messages import (
    @@ -17,2 +21,3 @@ from test_framework.messages import (
         MSG_WITNESS_FLAG,
    +    msg_block,
         msg_getblocktxn,
    @@ -46,2 +51,19 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
     
    +    def test_block_connect_on_broken_fs(self):
    +        self.log.info("Test block connect on inaccessible filesystem")
    +        node = self.nodes[0]
    +        peer = node.add_p2p_connection(P2PInterface())
    +        block = create_block(tmpl=node.getblocktemplate({"rules": ["segwit"]}))
    +        block.solve()
    +
    +        with simulate_io_error(node.blocks_path):
    +            with node.assert_debug_log(expected_msgs=["AcceptBlock FAILED (System error while saving block to disk"]):
    +                peer.send_without_ping(msg_block(block))
    +                peer.wait_for_disconnect()
    +
    +            node.wait_until_stopped(
    +                expect_error=True,
    +                expected_stderr=re.compile(r"fatal internal error"),
    +            )
    +
         def test_getdata_on_broken_fs(self):
    @@ -104,5 +126,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
             if platform.system() == 'Windows' or os.geteuid() == 0:
    -           self.log.warning("Skipping test: unable to enforce dir permissions")
    -           return
    +            self.log.warning("Skipping test: unable to enforce dir permissions")
    +            return
     
    +        self.test_block_connect_on_broken_fs()
             self.test_getdata_on_broken_fs()
    

    furszy commented at 8:03 PM on April 11, 2026:

    Sure, thanks! Test taken, and also reworked the branch per suggestion.

  12. maflcko commented at 8:39 AM on April 8, 2026: member

    concept ack, left a test suggestion, which could make review easier

  13. DrahtBot added the label Needs rebase on Apr 9, 2026
  14. furszy force-pushed on Apr 11, 2026
  15. DrahtBot removed the label Needs rebase on Apr 11, 2026
  16. furszy force-pushed on Apr 11, 2026
  17. DrahtBot added the label CI failed on Apr 11, 2026
  18. furszy commented at 8:05 PM on April 11, 2026: member

    Line 5915 in SendMessages has assert(ret) after a ReadBlock call in the compact block announcement path. I think this is the same pattern as the GETBLOCKTXN assert fixed in commit 4. I believe this is the "another assertion scenario in compact block relay" mentioned in the PR description. Is this planned as a follow-up? Happy to take it if so.

    Yes @frankomosh. All yours :).

  19. furszy force-pushed on Apr 11, 2026
  20. in test/functional/p2p_handle_io_errors.py:81 in 2902fef4a7
      76 | +
      77 | +        peer = node.add_p2p_connection(P2PInterface())
      78 | +        block_hash = node.getblockhash(3)  # Not in recent cache
      79 | +        with simulate_io_error(node.blocks_path):
      80 | +            # Bug: we currently swallow the GETDATA request, never reply to the other party, and log issue under debug-NET
      81 | +            with node.assert_debug_log(expected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
    


    furszy commented at 10:30 PM on April 11, 2026:

    Note: intentionally not checking for a specific filesystem error msg, since it slightly vary across OS. Specific msg doesn't matter, this first commit just proves the error is swallowed in master.

  21. furszy force-pushed on Apr 12, 2026
  22. in test/functional/p2p_handle_io_errors.py:44 in 75a8a957a5
      39 | +    try:
      40 | +        yield
      41 | +    finally:
      42 | +        os.chmod(parent_dir, old_mode)
      43 | +        if blocks_bak.exists() and not blocks_path.exists():
      44 | +            os.rename(blocks_bak, blocks_path)
    


    maflcko commented at 10:14 AM on April 13, 2026:

    nit in 75a8a957a5181a3cebfc9d3dd064330d52c51dbb: Is there a reason for this guard? The following seems stricter and what we want in this test. It also passes locally for me on this commit:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index 4438ad1dd5..5e05e9b768 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -40,8 +40,7 @@ def simulate_io_error(blocks_path):
             yield
         finally:
             os.chmod(parent_dir, old_mode)
    -        if blocks_bak.exists() and not blocks_path.exists():
    -            os.rename(blocks_bak, blocks_path)
    +        os.rename(blocks_bak, blocks_path)
     
     class P2PBlockIOErrorTest(BitcoinTestFramework):
         def set_test_params(self):
    

    furszy commented at 4:07 PM on April 13, 2026:

    Not anymore. I had more tests here before. Will drop it. Thanks.

  23. in test/functional/p2p_handle_io_errors.py:52 in 825dfc2005
      47 | +    def set_test_params(self):
      48 | +        self.num_nodes = 1
      49 | +        self.setup_clean_chain = True
      50 | +
      51 | +    def test_block_connect_on_broken_fs(self):
      52 | +        self.log.info("Test block connect on inaccessible filesystem")
    


    maflcko commented at 10:59 AM on April 13, 2026:

    nit in 75a8a957a5181a3cebfc9d3dd064330d52c51dbb:

    From the pull description:

    If it happens during block connection, after AcceptBlock(), inside ActivateBestChain(), the file system error gets thrown and swallowed by the P2P general try-catch, leaving the node in a living but stuck state. Where ActivateBestChain() will always silently fail on any posterior block processing, retrying to active the failing block.

    You claim that a failure in ActivateBestChain will leave the node alive, but stuck. However, when writing a test for this, I found that it properly shut down:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index 4438ad1dd5..4706dfebd4 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -14,2 +14,3 @@ from test_framework.blocktools import (
         create_block,
    +    create_empty_fork,
     )
    @@ -49,5 +50,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
             self.setup_clean_chain = True
    +        self.extra_args = [['-fastprune']]
     
    -    def test_block_connect_on_broken_fs(self):
    -        self.log.info("Test block connect on inaccessible filesystem")
    +    def test_block_accept_on_broken_fs(self):
    +        self.log.info("Test block accept on inaccessible filesystem")
             node = self.nodes[0]
    @@ -70,2 +72,24 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
     
    +    def test_block_connect_on_broken_fs(self):
    +        self.log.info("Test block connect on inaccessible filesystem")
    +        node = self.nodes[0]
    +        peer = node.add_p2p_connection(P2PInterface())
    +        blocks = create_empty_fork(node, fork_length=2)
    +        large_block = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]
    +        peer.send_and_ping(msg_block(blocks[0]))
    +        (node.blocks_path / 'blk00002.dat').unlink()
    +        node.getblock(large_block)
    +
    +        with node.assert_debug_log(expected_msgs=["ActivateBestChain failed (Failed to read block.)"]):
    +            peer.send_without_ping(msg_block(blocks[1]))
    +            peer.wait_for_disconnect()
    +
    +        node.wait_until_stopped(
    +            expect_error=True,
    +            expected_stderr=re.compile(r"fatal internal error"),
    +        )
    +
    +        # Restart node for next test
    +        self.start_node(0, extra_args=['-reindex'])
    +
         def test_getdata_on_broken_fs(self):
    @@ -125,2 +149,3 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
     
    +        self.test_block_accept_on_broken_fs()
             self.test_block_connect_on_broken_fs()
    

    I guess it could make sense to add the test?

    Also, it could make sense to clarify that point 2 in the pull request description refers to a thrown exception, presumably only from create_directories, and does not refer to any IO error?

    Just a nit, and up to you, if you want to keep this pull focussed only on the create_directories call, or be a more general io error handling improvement in P2P paths, with test coverage.


    furszy commented at 2:41 PM on April 13, 2026:

    You claim that a failure in ActivateBestChain will leave the node alive, but stuck. However, when writing a test for this, I found that it properly shut down.

    Yeah ok. Your test exercises a missing block file, not the inability to access the directory. If we want to replicate the stuck, the goal there should be to get the node stuck by making it throw inside the create_directories call.

    I had a test early in this branch that reproduced the "throw inside ActivateBestChain" through the P2P path in a deterministic way (which is not easy due to AcceptBlock happening first). But I did not like that it relied on another bug in reconsiderblock to trigger this one.

    Basically, the approach was to invalidate the tip, submit a fork, make the directory inaccessible, and then reconsider the block, which throws internally but resets the block validity flags (the operation is not atomically reverted upon failure). After that, sending any number of descendant blocks via P2P will be swallowed due to the inability to activate the very first block. But it was ugly to depend on reconsiderblock to trigger it.

    I guess it could make sense to add the test?

    Yes! for sure. Pulling it.

    Also, it could make sense to clarify that point 2 in the pull request description refers to a thrown exception, presumably only from create_directories, and does not refer to any IO error?

    Sure. Will do.


    maflcko commented at 8:00 AM on April 14, 2026:

    I had a test early in this branch that reproduced the "throw inside ActivateBestChain" through the P2P path in a deterministic way (which is not easy due to AcceptBlock happening first). But I did not like that it relied on another bug in reconsiderblock to trigger this one.

    Basically, the approach was to invalidate the tip, submit a fork, make the directory inaccessible, and then reconsider the block, which throws internally but resets the block validity flags (the operation is not atomically reverted upon failure). After that, sending any number of descendant blocks via P2P will be swallowed due to the inability to activate the very first block. But it was ugly to depend on reconsiderblock to trigger it.

    Ah interesting. I presume the bug still exists after this pull requests, as the ReconsiderBlock function is not changed here? Just wondering, as the bug seems unrelated to the bug fixed in this pull request. Also, I agree that such a test may be too spicy to include here.


    furszy commented at 2:19 PM on April 15, 2026:

    I presume the bug still exists after this pull requests, as the ReconsiderBlock function is not changed here?

    Would say it is partially fixed but haven't tried it. ReconsiderBlock internally calls ActivateBestChain, which triggers a fatal error after this PR, instead of throwing an exception. So it is a matter of whether we are flushing the block flags reset to disk prior to aborting or not. Because, if we do, the node would fail early on any following up startup.

  24. in test/functional/p2p_handle_io_errors.py:75 in 825dfc2005
      70 | +
      71 | +    def test_getdata_on_broken_fs(self):
      72 | +        """The node should not swallow GETDATA requests during I/O issues"""
      73 | +        self.log.info("Test GETDATA on inaccessible filesystem")
      74 | +        node = self.nodes[0]
      75 | +        self.generate(node, 6, sync_fun=self.no_op)
    


    maflcko commented at 11:08 AM on April 13, 2026:

    nit in the first commit: Any reason to disable something that is already skipped? Diff:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index d75a9f1d2e..4706dfebd4 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -96,7 +96,7 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
             """The node should not swallow GETDATA requests during I/O issues"""
             self.log.info("Test GETDATA on inaccessible filesystem")
             node = self.nodes[0]
    -        self.generate(node, 6, sync_fun=self.no_op)
    +        self.generate(node, 6)
     
             peer = node.add_p2p_connection(P2PInterface())
             block_hash = node.getblockhash(3)  # Not in recent cache
    

    furszy commented at 6:45 PM on April 13, 2026:

    Done as suggested.

  25. in test/functional/p2p_handle_io_errors.py:103 in 825dfc2005
      98 | +        self.log.info("Test GETBLOCKTXN triggers fatal error on inaccessible filesystem")
      99 | +        node = self.nodes[0]
     100 | +
     101 | +        # Generate two blocks so that the first one is evicted from the
     102 | +        # m_most_recent_block cache.
     103 | +        self.generate(node, 1, sync_fun=self.no_op)
    


    maflcko commented at 11:10 AM on April 13, 2026:

    nit: Same here about sync_fun.


    furszy commented at 6:44 PM on April 13, 2026:

    Done as suggested.

  26. maflcko commented at 11:14 AM on April 13, 2026: member

    Looks like the test fails on the second commit, or so?

    Also, provided some more test nits. Feel free to ignore.

    825dfc200520d503fec64346ed01a8b12ec8fa54~4 📭

    <details><summary>Show signature</summary>

    Signature:

    untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
    RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
    trusted comment: 825dfc200520d503fec64346ed01a8b12ec8fa54~4 📭
    HuyWCmP9dlHdaEd4PL2XkITSC+f4T6nxsowBmRtzkzVkthldhUIEhYd8dWE2Z+BGoWVYLPWjRRxIPvoLUhqCCg==
    

    </details>

  27. furszy commented at 2:18 PM on April 13, 2026: member

    Looks like the test fails on the second commit, or so?

    It seems the assertion error message is libc-dependent. Linux logs "Assertion `expr' failed" while MacOS logs "Assertion failed: (expr), ...".

    Will make the wait_until_stopped expected error a bit more general and check for "Assertion" wording only. That is more than enough to check the crash reason. The last commit makes it clear with the fatal error change anyway.

  28. furszy force-pushed on Apr 13, 2026
  29. DrahtBot removed the label CI failed on Apr 13, 2026
  30. in src/test/util/common.h:74 in a98cebe088
      69 | + * Simulate a filesystem access failure for `path` by temporarily making it
      70 | + * unavailable. The target is removed and its parent made inaccessible while
      71 | + * `fn()` executes, then everything is restored.
      72 | + */
      73 | +template <typename Fn>
      74 | +void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, Fn&& fn)
    


    maflcko commented at 8:00 AM on April 14, 2026:

    nit in a98cebe088a31933b5ee1c12e862f93165c32263: This function is pretty large, and there is no need to inline it, so it would be better to move it to the Cpp file:

    diff --git a/src/test/util/CMakeLists.txt b/src/test/util/CMakeLists.txt
    index d6864a2720..c0d47d157c 100644
    --- a/src/test/util/CMakeLists.txt
    +++ b/src/test/util/CMakeLists.txt
    @@ -5,6 +5,7 @@
     add_library(test_util STATIC EXCLUDE_FROM_ALL
       blockfilter.cpp
       coins.cpp
    +  common.cpp
       coverage.cpp
       json.cpp
       logging.cpp
    diff --git a/src/test/util/common.cpp b/src/test/util/common.cpp
    new file mode 100644
    index 0000000000..f547595531
    --- /dev/null
    +++ b/src/test/util/common.cpp
    @@ -0,0 +1,60 @@
    +// Copyright (c) The Bitcoin Core developers
    +// Distributed under the MIT software license, see the accompanying
    +// file COPYING or https://opensource.org/license/mit/.
    +
    +#include <test/util/common.h>
    +
    +void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn)
    +{
    +#ifdef WIN32
    +    // On Windows, any open file (such as the db) prevents directory renaming,
    +    // so we can't simulate a filesystem error in this platform. Skip it.
    +    return;
    +#else
    +    // This check relies on filesystem permission manipulation to simulate I/O
    +    // failures. Running as root bypasses permission checks, so the OS will
    +    // allow directory creation and file writes even when perms are set to
    +    // none, making it impossible to simulate the expected failures.
    +    if (getuid() == 0) return;
    + #endif
    +
    +    const fs::path root = fs::weakly_canonical(test_root_dir);
    +    const fs::path target = fs::weakly_canonical(path);
    +
    +    // Simple sanity check: ensure target is inside the test directory.
    +    if (!fs::PathToString(target).starts_with(fs::PathToString(root))) {
    +        throw std::runtime_error("Path escapes test root directory");
    +    }
    +
    +    if (!fs::exists(target)) {
    +        throw std::runtime_error("Path does not exist");
    +    }
    +
    +    const fs::path parent = target.parent_path();
    +    const fs::path backup = parent / fs::PathFromString(target.filename().string() + "_bak");
    +
    +    // Make the target disappear (file or directory)
    +    fs::rename(target, backup);
    +
    +    const auto old_perms = fs::status(parent).permissions();
    +
    +    // Helper to restore state (permissions + original path)
    +    auto restore = [&]() {
    +        fs::permissions(parent, old_perms, fs::perm_options::replace);
    +        if (fs::exists(backup) && !fs::exists(target)) {
    +            fs::rename(backup, target);
    +        }
    +    };
    +
    +    // Block any access and dir recreation under the parent directory
    +    fs::permissions(parent, fs::perms::none, fs::perm_options::replace);
    +
    +    try {
    +        fn(); // run test under simulated failure
    +    } catch (...) {
    +        restore();
    +        throw;
    +    }
    +
    +    restore();
    +}
    diff --git a/src/test/util/common.h b/src/test/util/common.h
    index 85e62b7ef7..06a972ced8 100644
    --- a/src/test/util/common.h
    +++ b/src/test/util/common.h
    @@ -70,60 +70,6 @@ inline std::ostream& operator<<(std::ostream& os, const T& obj)
      * unavailable. The target is removed and its parent made inaccessible while
      * `fn()` executes, then everything is restored.
      */
    -template <typename Fn>
    -void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, Fn&& fn)
    -{
    -#ifdef WIN32
    -    // On Windows, any open file (such as the db) prevents directory renaming,
    -    // so we can't simulate a filesystem error in this platform. Skip it.
    -    return;
    -#else
    -    // This check relies on filesystem permission manipulation to simulate I/O
    -    // failures. Running as root bypasses permission checks, so the OS will
    -    // allow directory creation and file writes even when perms are set to
    -    // none, making it impossible to simulate the expected failures.
    -    if (getuid() == 0) return;
    - #endif
    -
    -    const fs::path root = fs::weakly_canonical(test_root_dir);
    -    const fs::path target = fs::weakly_canonical(path);
    -
    -    // Simple sanity check: ensure target is inside the test directory.
    -    if (!fs::PathToString(target).starts_with(fs::PathToString(root))) {
    -        throw std::runtime_error("Path escapes test root directory");
    -    }
    -
    -    if (!fs::exists(target)) {
    -        throw std::runtime_error("Path does not exist");
    -    }
    -
    -    const fs::path parent = target.parent_path();
    -    const fs::path backup = parent / fs::PathFromString(target.filename().string() + "_bak");
    -
    -    // Make the target disappear (file or directory)
    -    fs::rename(target, backup);
    -
    -    const auto old_perms = fs::status(parent).permissions();
    -
    -    // Helper to restore state (permissions + original path)
    -    auto restore = [&]() {
    -        fs::permissions(parent, old_perms, fs::perm_options::replace);
    -        if (fs::exists(backup) && !fs::exists(target)) {
    -            fs::rename(backup, target);
    -        }
    -    };
    -
    -    // Block any access and dir recreation under the parent directory
    -    fs::permissions(parent, fs::perms::none, fs::perm_options::replace);
    -
    -    try {
    -        fn(); // run test under simulated failure
    -    } catch (...) {
    -        restore();
    -        throw;
    -    }
    -
    -    restore();
    -}
    +void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn);
     
     #endif // BITCOIN_TEST_UTIL_COMMON_H
    

    furszy commented at 2:39 PM on April 14, 2026:

    Sure, taken. Thanks!

  31. in src/test/flatfile_tests.cpp:139 in a98cebe088
     130 | @@ -130,4 +131,71 @@ BOOST_AUTO_TEST_CASE(flatfile_flush)
     131 |      BOOST_CHECK_EQUAL(fs::file_size(seq.FileName(FlatFilePos(0, 1))), 1U);
     132 |  }
     133 |  
     134 | +BOOST_AUTO_TEST_CASE(flatfile_open_readonly_missing_dir)
     135 | +{
     136 | +    // Open file with read_only=true on a directory that does not exist.
     137 | +    // File should be null instead of throwing an exception.
     138 | +    const auto data_dir = m_args.GetDataDirBase() / "nonexistent";
     139 | +    const FlatFileSeq seq(data_dir, "a", 16 * 1024);
    


    maflcko commented at 8:10 AM on April 14, 2026:

    nit in a98cebe088a31933b5ee1c12e862f93165c32263 : It is a bit confusing to see all those oddly specific (and different) numbers for the chunk size, when none of them matter. It would be better to just define a dummy once and use it in all places that don't matter:

    diff --git a/src/test/flatfile_tests.cpp b/src/test/flatfile_tests.cpp
    index e31a0773d1..e33c12265b 100644
    --- a/src/test/flatfile_tests.cpp
    +++ b/src/test/flatfile_tests.cpp
    @@ -13,2 +13,4 @@
     
    +static constexpr uint32_t DUMMY_CHUNK_SZ{1};
    +
     BOOST_FIXTURE_TEST_SUITE(flatfile_tests, BasicTestingSetup)
    @@ -138,3 +140,3 @@ BOOST_AUTO_TEST_CASE(flatfile_open_readonly_missing_dir)
         const auto data_dir = m_args.GetDataDirBase() / "nonexistent";
    -    const FlatFileSeq seq(data_dir, "a", 16 * 1024);
    +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
    @@ -149,3 +151,3 @@ BOOST_AUTO_TEST_CASE(flatfile_open_write_creates_dir)
         const auto data_dir = m_args.GetDataDirBase() / "new_write_dir";
    -    const FlatFileSeq seq(data_dir, "a", 16 * 1024);
    +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
    @@ -165,3 +167,3 @@ BOOST_AUTO_TEST_CASE(flatfile_open_write_missing_unwritable_parent)
         SimulateFileSystemError(m_path_root, dir, [&dir]() {
    -        const FlatFileSeq seq(dir / "blocks", "a", 16 * 1024);
    +        const FlatFileSeq seq(dir / "blocks", "a", DUMMY_CHUNK_SZ);
             const AutoFile file{seq.Open(FlatFilePos(0, 0), /*read_only=*/false)};
    @@ -176,3 +178,3 @@ BOOST_AUTO_TEST_CASE(flatfile_flush_unwritable_dir)
         fs::create_directories(data_dir);
    -    const FlatFileSeq seq(data_dir, "a", 100);
    +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
    @@ -190,3 +192,3 @@ BOOST_AUTO_TEST_CASE(flatfile_allocate_unwritable_dir)
         fs::create_directories(data_dir);
    -    const FlatFileSeq seq(data_dir, "a", 100);
    +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
     
    

    furszy commented at 2:39 PM on April 14, 2026:

    Sure, taken. Thanks!

  32. in test/functional/p2p_handle_io_errors.py:106 in a98cebe088
      99 | @@ -100,12 +100,14 @@ def test_getdata_on_broken_fs(self):
     100 |          peer = node.add_p2p_connection(P2PInterface())
     101 |          block_hash = node.getblockhash(3)  # Not in recent cache
     102 |          with simulate_io_error(node.blocks_path):
     103 | -            # Bug: we currently swallow the GETDATA request, never reply to the other party, and log issue under debug-NET
     104 | -            with node.assert_debug_log(expected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
     105 | +            # Request block and expect disconnection. The node must not swallow the exception and do nothing.
     106 | +            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"],
     107 | +                                       unexpected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
     108 |                                         timeout=30):
    


    maflcko commented at 8:28 AM on April 14, 2026:

    nit in a98cebe088a31933b5ee1c12e862f93165c32263: Now that the disconnect happens, the timeout here is not needed and confusing and should be removed. Also, I think the unexpected_msgs should be removed, as they depend on the exact log formatting, which may change in the future, so they will likely go silently stale anyway, and the disconnection is what matters here and is already checked.

    (Same in other places in this commit)

    Suggested diff:

    diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
    index 26c1ed7b51..354f7a6c4c 100755
    --- a/test/functional/p2p_handle_io_errors.py
    +++ b/test/functional/p2p_handle_io_errors.py
    @@ -102,6 +102,4 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
             with simulate_io_error(node.blocks_path):
    -            # Request block and expect disconnection. The node must not swallow the exception and do nothing.
    -            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"],
    -                                       unexpected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
    -                                       timeout=30):
    +            # Request block and expect disconnection.
    +            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"]):
                     peer.send_without_ping(msg_getdata([CInv(MSG_BLOCK | MSG_WITNESS_FLAG, int(block_hash, 16))]))
    @@ -133,7 +131,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
     
    -            # Request block and expect crash. The node must not swallow the exception and do nothing.
    -            with node.assert_debug_log(expected_msgs=["Unable to open file"],
    -                                       unexpected_msgs=["[ProcessMessages] [net] ProcessMessages(getblocktxn, 34 bytes): Exception 'filesystem error"],
    -                                       timeout=10):
    +            # Request block and expect crash+disconnect.
    +            with node.assert_debug_log(expected_msgs=["Unable to open file"]):
                     peer.send_without_ping(gbtn)
    +                peer.wait_for_disconnect()
     
    @@ -143,3 +140,2 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
                     expected_stderr=re.compile("Assertion"), # assertion crash
    -                timeout=30,
                 )
    

    furszy commented at 2:40 PM on April 14, 2026:

    looks good!, taken. Thanks

  33. in src/test/util/common.h:111 in a98cebe088
     106 | +    const auto old_perms = fs::status(parent).permissions();
     107 | +
     108 | +    // Helper to restore state (permissions + original path)
     109 | +    auto restore = [&]() {
     110 | +        fs::permissions(parent, old_perms, fs::perm_options::replace);
     111 | +        if (fs::exists(backup) && !fs::exists(target)) {
    


    maflcko commented at 8:35 AM on April 14, 2026:

    a98cebe088a31933b5ee1c12e862f93165c32263: same here: Any reason for the check?

    diff --git a/src/test/util/common.cpp b/src/test/util/common.cpp
    index f547595531..e89fe11e71 100644
    --- a/src/test/util/common.cpp
    +++ b/src/test/util/common.cpp
    @@ -43,5 +43,3 @@ void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path
             fs::permissions(parent, old_perms, fs::perm_options::replace);
    -        if (fs::exists(backup) && !fs::exists(target)) {
    -            fs::rename(backup, target);
    -        }
    +        fs::rename(backup, target);
         };
    

    furszy commented at 2:41 PM on April 14, 2026:

    not anymore. Removed. Thanks.

  34. in src/test/validation_chainstate_tests.cpp:191 in 63e4541495
     186 | +    // Disconnect block without changing validity so ActivateBestChain
     187 | +    // has a block to reconnect.
     188 | +    {
     189 | +        LOCK2(chainman->GetMutex(), chainstate.MempoolMutex());
     190 | +        BlockValidationState state;
     191 | +        BOOST_CHECK(chainstate.DisconnectTip(state, nullptr));
    


    maflcko commented at 9:14 AM on April 14, 2026:

    nit in 63e45414952b4cd155921b0cee20e3d3f97142a5: Could use named arg for /*disconnectpool=*/?


    furszy commented at 2:41 PM on April 14, 2026:

    Sure. Done.

  35. in src/test/blockmanager_tests.cpp:337 in 55ac4b4f2f
     332 | +        // create_directories fails so the file cannot be created.
     333 | +        {
     334 | +            ASSERT_DEBUG_LOG("Failed to write block");
     335 | +            CBlock dummy_block;
     336 | +            dummy_block.nVersion = 1;
     337 | +            const auto write_pos = blockman.WriteBlock(dummy_block, 999);
    


    maflcko commented at 9:30 AM on April 14, 2026:

    nit in 55ac4b4f2f389b588a1f372e506c2f5d63db701b: would be easier to read, if integral literal args used names. In this case, I guess it is /*nHeight=*/999?


    furszy commented at 2:41 PM on April 14, 2026:

    Sure. Done.

  36. in src/test/blockmanager_tests.cpp:356 in 55ac4b4f2f outdated
     351 | +            tip->nStatus |= BLOCK_HAVE_UNDO;
     352 | +        }
     353 | +    });
     354 | +
     355 | +    // Ensure we haven't corrupted any internal state during the failure.
     356 | +    // Check we can read the block again now that there is no dir access issue going on.
    


    maflcko commented at 9:36 AM on April 14, 2026:

    nit in 55ac4b4f2f389b588a1f372e506c2f5d63db701b : Should also check undo beside the block?

    Also, when assuming a pointer can not be null, it would be better to use a reference.

    Diff to achieve both:

    diff --git a/src/test/blockmanager_tests.cpp b/src/test/blockmanager_tests.cpp
    index 492cb518d0..b1ff1139ae 100644
    --- a/src/test/blockmanager_tests.cpp
    +++ b/src/test/blockmanager_tests.cpp
    @@ -309,3 +309,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
         auto& blockman = m_node.chainman->m_blockman;
    -    CBlockIndex* tip = WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip());
    +    CBlockIndex& tip{*WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip())};
     
    @@ -314,3 +314,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
                 ASSERT_DEBUG_LOG("OpenBlockFile failed");
    -            FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip->GetBlockPos());
    +            FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip.GetBlockPos());
                 BOOST_CHECK(!blockman.ReadRawBlock(tip_pos));
    @@ -321,3 +321,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
                 CBlock block;
    -            BOOST_CHECK(!blockman.ReadBlock(block, *tip));
    +            BOOST_CHECK(!blockman.ReadBlock(block, tip));
             }
    @@ -327,3 +327,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
                 CBlockUndo block_undo;
    -            BOOST_CHECK(!blockman.ReadBlockUndo(block_undo, *tip));
    +            BOOST_CHECK(!blockman.ReadBlockUndo(block_undo, tip));
             }
    @@ -345,8 +345,8 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
                 LOCK(chainman->GetMutex());
    -            tip->nStatus &= ~BLOCK_HAVE_UNDO;
    +            tip.nStatus &= ~BLOCK_HAVE_UNDO;
     
                 BlockValidationState state;
    -            BOOST_CHECK(!blockman.WriteBlockUndo(CBlockUndo{}, state, *tip));
    +            BOOST_CHECK(!blockman.WriteBlockUndo(CBlockUndo{}, state, tip));
     
    -            tip->nStatus |= BLOCK_HAVE_UNDO;
    +            tip.nStatus |= BLOCK_HAVE_UNDO;
             }
    @@ -357,3 +357,5 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
         CBlock recovered_block;
    -    BOOST_CHECK(blockman.ReadBlock(recovered_block, *tip));
    +    BOOST_CHECK(blockman.ReadBlock(recovered_block, tip));
    +    CBlockUndo block_undo;
    +    BOOST_CHECK(blockman.ReadBlockUndo(block_undo, tip));
     }
    

    furszy commented at 2:41 PM on April 14, 2026:

    Looks good!, taken. Thanks.

  37. in src/net_processing.cpp:4287 in 476fc0f899
    4283 | @@ -4284,7 +4284,11 @@ void PeerManagerImpl::ProcessMessage(Peer& peer, CNode& pfrom, const std::string
    4284 |              const bool ret{m_chainman.m_blockman.ReadBlock(block, block_pos, req.blockhash)};
    4285 |              // If height is above MAX_BLOCKTXN_DEPTH then this block cannot get
    4286 |              // pruned after we release cs_main above, so this read should never fail.
    4287 | -            assert(ret);
    4288 | +            // Unless we have a I/O issue, in such case, we want to shutdown gracefully.
    


    maflcko commented at 9:38 AM on April 14, 2026:

    nit in 476fc0f8991141f9d2d4f687b38b45b82a98bb39 from the llm:

    • a I/O issue -> an I/O issue [Grammatical article error; “I/O” starts with a vowel sound, so comprehension is improved by using “an”.]

    furszy commented at 2:42 PM on April 14, 2026:

    Done as suggested.

  38. maflcko approved
  39. maflcko commented at 9:41 AM on April 14, 2026: member

    Nice changes. I think the added tests are useful to have proper coverage for this. Also, the code changes move in the right direction. I also like that further places are pointed out where stuff can be improved.

    Everything looks great, and I've left more nits on the tests, given that all the code changes are tests. Feel free to ignore any of them, but I'd be happy to re-review all of them.

    lgtm 476fc0f8991141f9d2d4f687b38b45b82a98bb39 🏸

    <details><summary>Show signature</summary>

    Signature:

    untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
    RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
    trusted comment: lgtm 476fc0f8991141f9d2d4f687b38b45b82a98bb39 🏸
    XRrVpPFFIr4wq1HN1XUn/Y+JjL2oo7FUVxxSS8lD5F/L2WAqGj/ZewYzFssaE6uY8HeSbaD7nCh4rvbUeaOlBQ==
    

    </details>

  40. maflcko commented at 9:48 AM on April 14, 2026: member

    nit on the title: Maybe the prefix should be validation:, because it not only affects the p2p validation flow, but also the RPC one, and possibly the kernel one?

  41. furszy renamed this:
    net_processing: improve block data I/O error handling in P2P paths
    validation: improve block data I/O error handling in P2P paths
    on Apr 14, 2026
  42. DrahtBot added the label Validation on Apr 14, 2026
  43. furszy force-pushed on Apr 14, 2026
  44. furszy commented at 2:44 PM on April 14, 2026: member

    Updated per feedback and very nice suggestions. Thanks maflcko!

  45. maflcko commented at 4:07 PM on April 14, 2026: member

    Only change is the nits.

    review ACK f6c62f1bd4887d7865bf64c1fb399d47286fe531 🐟

    <details><summary>Show signature</summary>

    Signature:

    untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
    RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
    trusted comment: review ACK f6c62f1bd4887d7865bf64c1fb399d47286fe531 🐟
    QmQ331B1u4334Pm/xCen1sDl+6UdVg8nw67GYZXSKW4pVOWc8QaJLmqxz1b7wh8w5sUV++MOMmen3sKr+9kTDg==
    

    </details>

  46. DrahtBot requested review from frankomosh on Apr 14, 2026
  47. frankomosh commented at 1:54 PM on April 15, 2026: contributor

    reACK f6c62f1bd4887d7865bf64c1fb399d47286fe531.

  48. rkrux commented at 2:19 PM on April 15, 2026: contributor

    Concept ACK https://github.com/bitcoin/bitcoin/commit/f6c62f1bd4887d7865bf64c1fb399d47286fe531, will review.

    +1 for the detailed PR description and verbose commit messages that makes the reviewer feel like a story is unfolding in the PR.

  49. in test/functional/p2p_handle_io_errors.py:144 in c673aacfd7 outdated
     144 | +                expect_error=True,
     145 | +                expected_stderr=re.compile("Assertion"), # assertion crash
     146 | +            )
     147 | +
     148 | +        # Restart node for next test
     149 | +        self.start_node(0)
    


    rkrux commented at 1:45 PM on April 17, 2026:

    In c673aacfd78653a3d49cdcac355df3b6024c0608: this is the last test, is starting node here really required?


    furszy commented at 3:27 PM on April 17, 2026:

    Better to leave the test in a consistent state than force the next person working on it to figure out why their test isn't working.


    rkrux commented at 8:11 AM on April 18, 2026:

    than force the next person working on it to figure out why their test isn't working.

    That's thoughtful but it also adds marginal latency in the test - will leave it to your preference.

  50. in test/functional/p2p_handle_io_errors.py:77 in bd19cea308
      72 | +    def test_block_connect_on_broken_fs(self):
      73 | +        self.log.info("Test block connect on inaccessible filesystem")
      74 | +        node = self.nodes[0]
      75 | +        peer = node.add_p2p_connection(P2PInterface())
      76 | +        blocks = create_empty_fork(node, fork_length=2)
      77 | +        large_block = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]
    


    rkrux commented at 1:46 PM on April 17, 2026:

    In bd19cea30865efc0c8a353843662090446f4aada: s/large_block/large_block_hex


    rkrux commented at 3:03 PM on April 17, 2026:

    Nit: s/({'55'100_000})/({'55'66_000}) can also work because the file size is 64kib for fast prune.


    furszy commented at 6:36 PM on April 17, 2026:

    string fixed, thx.


    maflcko commented at 7:28 AM on May 8, 2026:

    I don't think large_block_hex makes sense here. This is the hash of the block, not the full block data in hex. The naming should either be reverted to large_block or use large_block_hash.


    maflcko commented at 3:59 PM on May 11, 2026:

    Looks like this wasn't addressed in the latest push


    furszy commented at 6:00 PM on May 15, 2026:

    Looks like this wasn't addressed in the latest push

    It seems the latest push predates your comment. That's why it isn't there.


    furszy commented at 6:04 PM on May 15, 2026:

    Done as suggested.

  51. in test/functional/p2p_handle_io_errors.py:124 in bd19cea308
     119 | +        self.log.info("Test GETBLOCKTXN triggers fatal error on inaccessible filesystem")
     120 | +        node = self.nodes[0]
     121 | +
     122 | +        # Generate two blocks so that the first one is evicted from the
     123 | +        # m_most_recent_block cache.
     124 | +        self.generate(node, 1)
    


    rkrux commented at 2:58 PM on April 17, 2026:

    In bd19cea30865efc0c8a353843662090446f4aada:

    Generate two blocks

    Isn't it generating one block?


    furszy commented at 6:35 PM on April 17, 2026:

    yep, fixed. Thanks.

  52. in test/functional/p2p_handle_io_errors.py:80 in bd19cea308
      75 | +        peer = node.add_p2p_connection(P2PInterface())
      76 | +        blocks = create_empty_fork(node, fork_length=2)
      77 | +        large_block = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]
      78 | +        peer.send_and_ping(msg_block(blocks[0]))
      79 | +        (node.blocks_path / 'blk00002.dat').unlink()
      80 | +        node.getblock(large_block)
    


    rkrux commented at 3:13 PM on April 17, 2026:

    In bd19cea: I usually find these standalone calls unintuitive because no value is captured and no assertion done, often time it takes me a second to realise that this is there just to show that it works.

    Maybe add the following assertion as well to show that forked block is gone now.

    + assert_raises_rpc_error(-1, 'Block not found on disk', node.getblock, blocks[0].hash_hex)
    

    furszy commented at 6:35 PM on April 17, 2026:

    sure, pushed.

  53. rkrux commented at 3:18 PM on April 17, 2026: contributor

    Started reviewing at f6c62f1bd4887d7865bf64c1fb399d47286fe531, shared few comments in the test.

  54. furszy force-pushed on Apr 17, 2026
  55. furszy commented at 6:37 PM on April 17, 2026: member

    Updated per feedback. Test comments fixed. Thanks rkrux.

  56. in test/functional/p2p_handle_io_errors.py:143 in 3fec4bca85 outdated
     141 |  
     142 |              node.wait_until_stopped(
     143 | -                expected_ret_code=-6,
     144 |                  expect_error=True,
     145 | -                expected_stderr=re.compile("Assertion"), # assertion crash
     146 | +                expected_stderr=re.compile(r"fatal internal error"),
    


    rkrux commented at 8:10 AM on April 18, 2026:

    In c990e375dbbc5ac2d2ed701dbc0f4f51533d2650: A new error has been added in the commit and it seems like a missed opportunity that it's not asserted for in the corresponding test.

    -                expected_stderr=re.compile(r"fatal internal error"),
    +                expected_stderr=re.compile(r"Failed to read block during GETBLOCKTXN"),
    

    or

    -            with node.assert_debug_log(expected_msgs=["Unable to open file"]):
    +            with node.assert_debug_log(expected_msgs=["Unable to open file", "Failed to read block during GETBLOCKTXN"]):
    

    furszy commented at 2:43 PM on April 20, 2026:

    Done as suggested.

  57. DrahtBot added the label Needs rebase on Apr 19, 2026
  58. furszy force-pushed on Apr 19, 2026
  59. DrahtBot removed the label Needs rebase on Apr 19, 2026
  60. in test/functional/p2p_handle_io_errors.py:94 in c4204d9137 outdated
      89 | +            peer.wait_for_disconnect()
      90 | +
      91 | +        node.wait_until_stopped(
      92 | +            expect_error=True,
      93 | +            expected_stderr=re.compile(r"fatal internal error"),
      94 | +        )
    


    rkrux commented at 9:32 AM on April 20, 2026:

    In c4204d9137b66593a5d38c0e210dac863f6453d0: This new test that has been added seems fine on its own. However, the reviewer might try to relate it to the 2nd point in the PR description because the error here is of ActivateBestChain. But this testcase doesn't exactly test that point, which I found confusing. Can consider moving the following statement from the testing note of the description in the 2nd point itself there to make it obvious.

    Adding a functional test for this scenario is not possible due to the requirement of failing inside ActivateBestChain(), which always occurs after AcceptBlock(), which correctly captures the error.


    furszy commented at 2:37 PM on April 20, 2026:

    This behaves properly not because of the functions calls ordering, but because we are unlinking one single blk file and not the entire directory. The test that is above (test_block_accept_on_broken_fs()) is the one exercising the note you are mentioning.

  61. rkrux commented at 9:34 AM on April 20, 2026: contributor

    Overall lgtm at c990e375dbbc5ac2d2ed701dbc0f4f51533d2650

    Nit: Can consider switching the order of 3rd and 4th commits because latter is more closely related to the changes in 2nd commit as blockman methods are directly dependent on flatfile methods.

  62. furszy force-pushed on Apr 20, 2026
  63. furszy commented at 2:47 PM on April 20, 2026: member

    Updated per feedback. Single test line diff to target a specific error msg during fatal error shutdown. Ready to go.

  64. maflcko commented at 7:36 AM on May 8, 2026: member

    lgtm.

    Only nit changes in the test.

    re-ACK 65347588a1488d13f925da071a5cbf6af56d7a3f 🏹

    <details><summary>Show signature</summary>

    Signature:

    untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
    RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
    trusted comment: re-ACK 65347588a1488d13f925da071a5cbf6af56d7a3f 🏹
    hVSIl3SJ9ICN4IOBkD6rctCSUNVcwg1C9WtUacDs4gChpqAnd8MKAHscr7tKhpebVcDT8AxooFrXo0Ct8ItABA==
    

    </details>

  65. DrahtBot requested review from frankomosh on May 8, 2026
  66. DrahtBot requested review from rkrux on May 8, 2026
  67. rkrux approved
  68. rkrux commented at 10:41 AM on May 8, 2026: contributor

    lgtm ACK 65347588a1488d13f925da071a5cbf6af56d7a3f

    I like the simulate_io_error context manager - I have noticed similar use cases in other functional tests as well, this might be generalised in the future and made as a utility.

  69. maflcko removed review request from pinheadmz on May 8, 2026
  70. maflcko removed review request from frankomosh on May 8, 2026
  71. DrahtBot requested review from frankomosh on May 8, 2026
  72. DrahtBot requested review from pinheadmz on May 8, 2026
  73. sedited commented at 10:07 PM on May 10, 2026: contributor

    Concept ACK

    I'm not sure on this approach. If this is just for this single ReadBlock path, might it be easier ascertain that this does not introduce unwanted side effects by moving the call to OpenBlockFile(...) in ReadRawBlock into the try...catch block? I did not test this, but it seems like the same failure could trigger a runaway exception in the indexes at the moment, but I'm not sure if we are still terminating in all cases after this patch. Similarly does this now ignore the exception in ImportBlock?

    Nit on the description: Is the node being "stuck" is the best way to describe it? After all, it does not freeze up, but rather can not do any block processing. I guess an admin could theoretically also "unstuck" it in this scenario.

  74. maflcko commented at 6:36 AM on May 11, 2026: member

    I'm not sure on this approach. If this is just for this single ReadBlock path, might it be easier ascertain that this does not introduce unwanted side effects by moving the call to OpenBlockFile(...) in ReadRawBlock into the try...catch block?

    Generally, I think it would be better to use one type of error handling (at least inside a single function). IIUC, if someone were to rm a blockfile, FlatFileSeq::Open (with read_only=true) would return nullptr and log an error. Same, if someone were to truncate a file. So it seems odd to treat create_directories differently. Either all of them should return nullptr, or all of them should throw (and then return a NotNull). So I think the current patch is preferable to moving the call around inside ReadRawBlock. The alternative patch of reworking the error handling to use exceptions consistently for this function, may be more involved, but I haven't checked.

    I did not test this, but it seems like the same failure could trigger a runaway exception in the indexes at the moment, but I'm not sure if we are still terminating in all cases after this patch.

    If the indexers are not properly checking the return value of Open(), then it seems like a pre-existing issue, because it can happen also when a file is removed or truncated. If this is the case, it would be good to fix those issues in a separate pull request.

    Similarly does this now ignore the exception in ImportBlock?

    IIUC it will be leading to a break, just like other Open() errors. I think this is fine, because fs::exists was called prior, so the error could only happen in a racy way, in which case it can't reliably be detected anyway. And it will lead to an abort of the node on the next block read or write anyway.

    So I think this PR is the a minimal and still correct patch. Alternative patches are certainly possible, though I think (regardless of who writes them), they should go into a fresh pull request, as they are conceptually different. Then, one of them can be merged and the other one can be closed or rebased.

  75. sedited commented at 9:18 AM on May 11, 2026: contributor

    If the indexers are not properly checking the return value of Open(), then it seems like a pre-existing issue, because it can happen also when a file is removed or truncated. If this is the case, it would be good to fix those issues in a separate pull request.

    The scenarios where this is not handled adequately at the moment are pretty contrived, but I feel like so are the ones that would make us reach the conditions we aim to guard against in this PR. As far as I can tell by sprinkling in a few error conditions into the code, they do trigger an unclean exit of the node in the indexes, and will instead continue after this patch. It would be good to document that this may go from crashing to ignoring the error on non-p2p paths.

    Other than that, the changes do look correct to me. I think it would also be nice to annotate the various member functions using Open in the block manager and Open itself with nodiscard.

  76. maflcko commented at 9:37 AM on May 11, 2026: member

    they do trigger an unclean exit of the node in the indexes

    Heh, right. Completely unrelated (just related to the error handling mess): IIRC, in other code paths, the indexes do not properly catch exceptions, so if such an exception (which should lead to a fatal node shutdown) is thrown in an rpc thread, it may silently continue (See also #34132 (review))

    It would be good to document that this may go from crashing to ignoring the error on non-p2p paths.

    Other than that, the changes do look correct to me. I think it would also be nice to annotate the various member functions using Open in the block manager and Open itself with nodiscard.

    Yeah, I guess that can't hurt. I'd also be curious about re-thinking the error handling wholesale from the ground up, but maybe this should be a brainstorming issue.

  77. furszy force-pushed on May 15, 2026
  78. furszy commented at 6:32 PM on May 15, 2026: member

    Updated per feedback. Annotated the various flat file functions as nodiscard.

    The indexes discussion made me realize we are also moving from swallowing a p2p getcfilters exception to properly handling the error now. Which is nice. We could have a test for it.

    It would be good to document that this may go from crashing to ignoring the error on non-p2p paths.

    What type of doc are you expecting? Commit message pointing to each place or go through each caller and add a comment on top?

    While it is true that this changes the indexes paths behavior, it is evident that the previous behavior was not intentional. The code was not expecting OpenFile to throw and crash the process. It was just missing that failure path. So someone could also argue that this is correcting unexpected crashes as well.

    If you are not that strong on it, I would rather move forward here, and work on the indexes in a separate PR. I'm happy to continue moving forward over #34897, #34489, #26966 and more. There is a lot we can do there.

  79. sedited commented at 7:52 PM on May 15, 2026: contributor

    If you are not that strong on it, I would rather move forward here, and work on the indexes in a separate PR.

    Yes, I was only asking for mentioning the various affected paths in the PR description. I don't think it should hold up this pull request.

  80. furszy commented at 9:56 PM on May 15, 2026: member

    Yes, I was only asking for mentioning the various affected paths in the PR description. I don't think it should hold up this pull request.

    Done. Ready to go.

  81. in src/node/blockstorage.h:227 in d09974cf3d
     223 | @@ -224,7 +224,7 @@ class BlockManager
     224 |      [[nodiscard]] bool FlushChainstateBlockFile(int tip_height);
     225 |      bool FindUndoPos(BlockValidationState& state, int nFile, FlatFilePos& pos, unsigned int nAddSize);
     226 |  
     227 | -    AutoFile OpenUndoFile(const FlatFilePos& pos, bool fReadOnly = false) const;
     228 | +    [[nodiscard]] AutoFile OpenUndoFile(const FlatFilePos& pos, bool fReadOnly = false) const;
    


    maflcko commented at 7:36 AM on May 16, 2026:

    in the last commit: Not sure I understand what the benefit of those would be. Of course, it is a bug to open a file and not use the returned file. But this is conceptually true for all functions whose goal is to call them to use their return value. Personally I don't think we should be adding [[nodiscard]] to all those functions, because this is extra verbosity/bloat and I don't think there has ever been a bug of this kind in this codebase, and I fail to see how one can be added in the future. Code like this won't compile anyway with or without the attribute:

    auto ReadUndo(const auto& pos) {
      OpenUndoFile(pos);
      Data undo;
      /* nothing -> does not compile */ >> undo;
      return undo;
    }
    

    I am not saying that [[nodiscard]] is useless, there are certainly valid use-cases, when a function name does not imply that something (e.g. an error) must be returned and could plausibly return void. For example, two lines up bool Flush() correctly uses the attribute. Also, on AutoFile itself, the attribute is correctly used on fclose. Otherwise, there could be code that compiles, like this:

    bool WriteUndo(const auto& pos, const auto& undo) {
     auto f =  OpenUndoFile(pos);
      f << undo;
      f.fclose(); // attribute required here
      return true;
    }
    

    So I htink the last commit should be dropped.


    furszy commented at 12:24 PM on May 16, 2026:

    I pretty much agree with you, that's why I did not add it before. I just assumed you guys were agreeing there based on your comment: #35003 (comment)

    And I didn't feel like arguing against it was worth the extra back and forth for such a small change.

    Just dropped the last commit.

  82. maflcko commented at 7:37 AM on May 16, 2026: member

    Only change is basically adding the last commit, but I think it should be dropped again.

    review ACK d09974cf3d5db404b11e7f1f3f959feae19838c3 🐚

    <details><summary>Show signature</summary>

    Signature:

    untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
    RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
    trusted comment: review ACK d09974cf3d5db404b11e7f1f3f959feae19838c3 🐚
    pzfG6+o8Fvf5F63HxAzASFsj6LVWtfvzj2zYEsSw87/4qt8Lc+4qwTethMkOVzOpnzc7BKo+E0dHNIuAfsp1Cw==
    

    </details>

  83. DrahtBot requested review from sedited on May 16, 2026
  84. DrahtBot requested review from rkrux on May 16, 2026
  85. l0rinc commented at 8:39 AM on May 16, 2026: contributor

    Concept ACK

  86. furszy force-pushed on May 16, 2026
  87. maflcko commented at 12:27 PM on May 16, 2026: member

    review ACK 01daa552c3ce06c8c8bfdab8c5b93a662cee6e79 🌴

    <details><summary>Show signature</summary>

    Signature:

    untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
    RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
    trusted comment: review ACK 01daa552c3ce06c8c8bfdab8c5b93a662cee6e79 🌴
    rHv5rLR5K0wwYrDTf/qFl8KopCbdE6DCO7BHuMOVG4vXCX+WCetd9ODtS8VMaSisoFlX068cVIFnxLKhDuPxCQ==
    

    </details>

  88. DrahtBot requested review from l0rinc on May 16, 2026
  89. furszy commented at 6:10 PM on May 20, 2026: member

    @sedited @rkrux wanna take another look? nothing changed from the past review round.

  90. sedited approved
  91. sedited commented at 9:34 PM on May 24, 2026: contributor

    ACK 65347588a1488d13f925da071a5cbf6af56d7a3f

    While I don't think these are the most desirable mechanics for handling system errors, this is an improvement over the status quo. Some followup work improving the handling in the indexes and in the file allocation path would be good. I like the additional tests.

  92. DrahtBot requested review from sedited on May 24, 2026
  93. rkrux commented at 10:59 AM on May 25, 2026: contributor

    lgtm re-ACK 01daa552c3ce06c8c8bfdab8c5b93a662cee6e79

    git range-diff 6534758...01daa55
    

    I have not gone through the last 3-4 comments in the PR discussion above but the range-diff is minimal since last a-c-k.

  94. rkrux commented at 11:01 AM on May 25, 2026: contributor

    In #35003#pullrequestreview-4353425125

    ACK https://github.com/bitcoin/bitcoin/commit/65347588a1488d13f925da071a5cbf6af56d7a3f

    This seems to be an outdated commit.

  95. DrahtBot added the label Needs rebase on May 26, 2026
  96. furszy force-pushed on May 26, 2026
  97. furszy commented at 1:51 PM on May 26, 2026: member

    Rebased due to conflicts with #33966. Ready to go.

  98. DrahtBot removed the label Needs rebase on May 26, 2026
  99. DrahtBot added the label Needs rebase on May 27, 2026
  100. test: add P2P coverage for block disk I/O failures
    Add tests showing that block disk I/O failures triggered
    through P2P requests are silently swallowed.
    
    The idea is to simulate I/O errors by making the blocks directory
    temporarily inaccessible while the node is running. This allows block
    reads to fail when serving P2P requests.
    
    Aside from the properly behaving tests, this contains two cases were
    the node swallows the error:
    
    - GETDATA: failures should be handled locally by disconnecting the
      peer and logging the error, without impacting node operation.
    
    - GETBLOCKTXN: failures when accessing recent blocks violate our
      "last 10 blocks must always be available for reorgs" rule and
      must result in a fatal error, rather than being caught and only
      logged by the P2P message handling general try-catch.
    
    Co-authored-by: MarcoFalke <*~=`'#}+{/-|&$^_@721217.xyz>
    356b0307d8
  101. furszy force-pushed on May 27, 2026
  102. DrahtBot removed the label Needs rebase on May 27, 2026
  103. DrahtBot added the label CI failed on May 27, 2026
  104. DrahtBot removed the label CI failed on May 28, 2026
  105. in src/test/util/common.cpp:9 in 400a5b16e3 outdated
       0 | @@ -0,0 +1,58 @@
       1 | +// Copyright (c) The Bitcoin Core developers
       2 | +// Distributed under the MIT software license, see the accompanying
       3 | +// file COPYING or https://opensource.org/license/mit/.
       4 | +
       5 | +#include <test/util/common.h>
       6 | +
       7 | +void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn)
       8 | +{
       9 | +#ifdef WIN32
    


    w0xlt commented at 6:11 PM on June 4, 2026:

    nit: in unsupported environments, the assertions never run, but the tests still pass. This can create false-positive test coverage.

    <details> <summary>diff</summary>

    diff --git a/src/test/blockmanager_tests.cpp b/src/test/blockmanager_tests.cpp
    index 79f5b15177..9cd6c2d61c 100644
    --- a/src/test/blockmanager_tests.cpp
    +++ b/src/test/blockmanager_tests.cpp
    @@ -331,7 +331,7 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
         auto& blockman = m_node.chainman->m_blockman;
         CBlockIndex& tip{*WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip())};
     
    -    SimulateFileSystemError(m_path_root, m_args.GetBlocksDirPath().parent_path(), [&]() {
    +    if (!SimulateFileSystemError(m_path_root, m_args.GetBlocksDirPath().parent_path(), [&]() {
             {
                 ASSERT_DEBUG_LOG("OpenBlockFile failed");
                 FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip.GetBlockPos());
    @@ -372,7 +372,10 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
     
                 tip.nStatus |= BLOCK_HAVE_UNDO;
             }
    -    });
    +    })) {
    +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
    +        return;
    +    }
     
         // Ensure we haven't corrupted any internal state during the failure.
         // Check we can read the block again now that there is no dir access issue going on.
    diff --git a/src/test/flatfile_tests.cpp b/src/test/flatfile_tests.cpp
    index e33c12265b..f835c5e8cb 100644
    --- a/src/test/flatfile_tests.cpp
    +++ b/src/test/flatfile_tests.cpp
    @@ -164,11 +164,14 @@ BOOST_AUTO_TEST_CASE(flatfile_open_write_missing_unwritable_parent)
         fs::create_directories(dir);
     
         // Make it read-only so creating subdirectories fails.
    -    SimulateFileSystemError(m_path_root, dir, [&dir]() {
    +    if (!SimulateFileSystemError(m_path_root, dir, [&dir]() {
             const FlatFileSeq seq(dir / "blocks", "a", DUMMY_CHUNK_SZ);
             const AutoFile file{seq.Open(FlatFilePos(0, 0), /*read_only=*/false)};
             BOOST_CHECK(file.IsNull());
    -    });
    +    })) {
    +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
    +        return;
    +    }
     }
     
     BOOST_AUTO_TEST_CASE(flatfile_flush_unwritable_dir)
    @@ -178,9 +181,12 @@ BOOST_AUTO_TEST_CASE(flatfile_flush_unwritable_dir)
         fs::create_directories(data_dir);
         const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
    -    SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
    +    if (!SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
             BOOST_CHECK(!seq.Flush(FlatFilePos(0, 1)));
    -    });
    +    })) {
    +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
    +        return;
    +    }
     }
     
     BOOST_AUTO_TEST_CASE(flatfile_allocate_unwritable_dir)
    @@ -192,12 +198,15 @@ BOOST_AUTO_TEST_CASE(flatfile_allocate_unwritable_dir)
         fs::create_directories(data_dir);
         const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
     
    -    SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
    +    if (!SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
             bool out_of_space;
             // Note: this throws due to 'CheckDiskSpace'. In the future, this shouldn't throw either.
             BOOST_CHECK_THROW(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), fs::filesystem_error);
             BOOST_CHECK(!out_of_space);
    -    });
    +    })) {
    +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
    +        return;
    +    }
     }
     
     BOOST_AUTO_TEST_SUITE_END()
    diff --git a/src/test/util/common.cpp b/src/test/util/common.cpp
    index e89fe11e71..43f666e5fb 100644
    --- a/src/test/util/common.cpp
    +++ b/src/test/util/common.cpp
    @@ -4,19 +4,25 @@
     
     #include <test/util/common.h>
     
    -void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn)
    +#include <stdexcept>
    +
    +#ifndef WIN32
    +#include <unistd.h>
    +#endif
    +
    +[[nodiscard]] bool SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn)
     {
     #ifdef WIN32
         // On Windows, any open file (such as the db) prevents directory renaming,
         // so we can't simulate a filesystem error in this platform. Skip it.
    -    return;
    +    return false;
     #else
         // This check relies on filesystem permission manipulation to simulate I/O
         // failures. Running as root bypasses permission checks, so the OS will
         // allow directory creation and file writes even when perms are set to
         // none, making it impossible to simulate the expected failures.
    -    if (getuid() == 0) return;
    - #endif
    +    if (geteuid() == 0) return false;
    +#endif
     
         const fs::path root = fs::weakly_canonical(test_root_dir);
         const fs::path target = fs::weakly_canonical(path);
    @@ -55,4 +61,5 @@ void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path
         }
     
         restore();
    +    return true;
     }
    diff --git a/src/test/util/common.h b/src/test/util/common.h
    index 77df87cff0..457c33cb7f 100644
    --- a/src/test/util/common.h
    +++ b/src/test/util/common.h
    @@ -8,14 +8,11 @@
     #include <util/fs.h>
     
     #include <chrono>
    +#include <functional>
     #include <optional>
     #include <ostream>
     #include <string>
     
    -#ifndef WIN32
    -#include <unistd.h>
    -#endif
    -
     /**
      * BOOST_CHECK_EXCEPTION predicates to check the specific validation error.
      * Use as
    @@ -70,8 +67,8 @@ inline std::ostream& operator<<(std::ostream& os, const T& obj)
      * unavailable. The target is removed and its parent made inaccessible while
      * `fn()` executes, then everything is restored.
      *
    - * Note: Returns early if the environment does not allow simulating a fs error.
    + * Returns false if the environment does not allow simulating a fs error.
      */
    -void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn);
    +[[nodiscard]] bool SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn);
     
     #endif // BITCOIN_TEST_UTIL_COMMON_H
    diff --git a/src/test/validation_chainstate_tests.cpp b/src/test/validation_chainstate_tests.cpp
    index 1b8cf7121e..7df38707c6 100644
    --- a/src/test/validation_chainstate_tests.cpp
    +++ b/src/test/validation_chainstate_tests.cpp
    @@ -201,15 +201,21 @@ BOOST_FIXTURE_TEST_CASE(activate_best_chain_connect_io_failure, TestChain100Setu
     
         // Inaccessible block file must trigger a fatal error.
         const auto blk_file = blocks_dir / "blk00000.dat";
    -    SimulateFileSystemError(m_path_root, blk_file, [&]() {
    +    if (!SimulateFileSystemError(m_path_root, blk_file, [&]() {
             check_activate_best_chain_fatal_error("Failed to read block", m_node, chainstate);
    -    });
    +    })) {
    +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
    +        return;
    +    }
     
         // Inaccessible blocks directory must also trigger a fatal error.
         const auto parent_dir = blocks_dir.parent_path();
    -    SimulateFileSystemError(m_path_root, parent_dir, [&]() {
    +    if (!SimulateFileSystemError(m_path_root, parent_dir, [&]() {
             check_activate_best_chain_fatal_error("Failed to read block", m_node, chainstate);
    -    });
    +    })) {
    +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
    +        return;
    +    }
     
         // Sanity check: the block reconnects once the filesystem is restored.
         BlockValidationState state;
    @@ -255,21 +261,30 @@ BOOST_FIXTURE_TEST_CASE(activate_best_chain_disconnect_io_failure, TestChain100S
     
         // Inaccessible block file during reorg must trigger fatal error.
         const auto blk_file = blocks_dir / "blk00000.dat";
    -    SimulateFileSystemError(m_path_root, blk_file, [&]() {
    +    if (!SimulateFileSystemError(m_path_root, blk_file, [&]() {
             check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
    -    });
    +    })) {
    +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
    +        return;
    +    }
     
         // Inaccessible undo file during reorg must trigger fatal error.
         const auto rev_file = blocks_dir / "rev00000.dat";
    -    SimulateFileSystemError(m_path_root, rev_file, [&]() {
    +    if (!SimulateFileSystemError(m_path_root, rev_file, [&]() {
             check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
    -    });
    +    })) {
    +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
    +        return;
    +    }
     
         // Inaccessible blocks directory during reorg must also trigger fatal error.
         const auto parent_dir = blocks_dir.parent_path();
    -    SimulateFileSystemError(m_path_root, parent_dir, [&]() {
    +    if (!SimulateFileSystemError(m_path_root, parent_dir, [&]() {
             check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
    -    });
    +    })) {
    +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
    +        return;
    +    }
     
         // Sanity check: the reorg completes once the filesystem is restored.
         state = BlockValidationState();
    

    </details>


    furszy commented at 2:04 PM on June 7, 2026:

    Cool, thanks. Taken.

  106. w0xlt commented at 7:42 PM on June 4, 2026: contributor

    ACK 400a5b16e396498d41ec1ae70ec91f7baf45d1db

  107. DrahtBot requested review from maflcko on Jun 4, 2026
  108. FlatFile: do not throw for parent dir creation failure
    The callers of this function don't expect it to throw.
    
    For example, when the blocks directory becomes inaccessible
    (e.g. volume detach or permission changed), FlatFileSeq::Open's
    call to fs::create_directories() throws filesystem_error. On the
    block connection path, if this occurs during ActivateBestChain(),
    just after AcceptBlock() the exception bypasses FatalError entirely.
    It gets caught by ProcessMessage's general catch-all with only a
    NET-level debug log, swallowing the error msg for any regular
    user, and every subsequent block arrival retries the same failing
    read, leaving the node looking healthy while the block processing
    mechanism is stuck.
    b9bc7297ef
  109. test: verify ActivateBestChain fatal errors on I/O failure
    Cover the two ActivateBestChain paths that read data from
    disk: ConnectTip (block connection) and DisconnectTip (reorg).
    
    Simulate inaccessible block and undo files, ensure fatal error
    and shutdown is triggered.
    1bfad4eb4f
  110. test: ensure consistent failure behavior in BlockManager reads/writes
    Verifies that BlockManager behaves the same way whether a block file is
    missing or the blocks directory is inaccessible.
    
    Reads (ReadBlock, ReadRawBlock, ReadBlockUndo) return false, writes
    (WriteBlock, WriteBlockUndo) return null/false, and failures are logged.
    
    The goal is consistent failure handling. Previously, missing block files
    returned false and logged an error, while an inaccessible blocks directory
    was throwing an exception.
    93a575c7ce
  111. net: replace GETBLOCKTXN assert on block read error with graceful shutdown
    A failure here indicates a serious inconsistency or disk issue, and the
    current behavior is to abort immediately, which is correct but not the best.
    Replace assert with a fatal error so the node shuts down through the normal
    shutdown path instead.
    
    This does not change the outcome (the node still stops), but allows a more
    controlled teardown, giving other components a chance to flush their state
    even when a particular block (or the blocks dir) on disk is unreachable.
    E.g. the blocks dir might be the only affected, the mempool might be ok,
    and we don't want to lose such information.
    7d4d4b8530
  112. furszy force-pushed on Jun 7, 2026
  113. furszy commented at 2:07 PM on June 7, 2026: member

    Updated per @w0xlt feedback, thanks. Added test warn logging message on unsupported environments.

  114. in src/test/validation_chainstate_tests.cpp:287 in 7d4d4b8530
     282 | +    if (!SimulateFileSystemError(m_path_root, parent_dir, [&]() {
     283 | +        check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
     284 | +    })) {
     285 | +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
     286 | +        return;
     287 | +    }
    


    maflcko commented at 6:35 PM on June 9, 2026:

    not sure about the recent test-change push. this was done to avoid "false coverage" on windows. However, i don't think anyone creates coverage reports on Windows. Also, after the first warning, the later ones are dead and unreachable code.

    Also, the patch is incomplete, because the "skipping" regex in src/test/CMakeLists.txt would have to be updated as well.

    Also, even with this patch, there will be "false coverage" through all of the code that is run before the check. That is this will be "false coverage":

    BOOST_FIXTURE_TEST_CASE(activate_best_chain_disconnect_io_failure, TestChain100Setup)
    {
        auto& chainman = m_node.chainman;
        auto& chainstate = chainman->ActiveChainstate();
        const auto blocks_dir = m_args.GetBlocksDirPath();
    
        // Set up a scenario where ActivateBestChain must reorg:
        //
        //   Original chain: ... -> 100 -> 101 -> 102 -> 103   (more work)
        //   Fork:           ... -> 100 -> F101 -> F102        (currently active)
        //
        // The fork is active but has less work than the original chain, so
        // ActivateBestChain will disconnect the fork blocks via DisconnectTip
        // before reconnecting the original chain.
    
        // Extend the chain by 3 blocks (heights 101–103).
        for (int i = 0; i < 3; i++) CreateAndProcessBlock({}, CScript() << OP_TRUE);
    
        // Invalidate at height 101 to create room for a shorter fork.
        CBlockIndex* b101_index = WITH_LOCK(chainman->GetMutex(), return chainman->ActiveChain()[101]);
        BlockValidationState state;
        chainstate.InvalidateBlock(state, b101_index);
    
        // Build a 2-block fork from height 100. The original chain is invalid,
        // so these become the active tip.
        for (int i = 0; i < 2; i++) CreateAndProcessBlock({}, CScript() << OP_TRUE << OP_TRUE);
    
        // Now restore the original chain's validity. It has more work (103 > 102) but ActivateBestChain
        // hasn't been called yet, so we're still on the fork. Similar to ReconsiderBlock but without the ABC call.
        {
            LOCK(chainman->GetMutex());
            chainstate.ResetBlockFailureFlags(b101_index);
            chainman->RecalculateBestHeader();
        }
    
        // Inaccessible block file during reorg must trigger fatal error.
        const auto blk_file = blocks_dir / "blk00000.dat";
    

    My recommendation would be to either revert the last test-only push for now and leave it for a follow-up, or properly do it:

    • Add the missing skip regex
    • Do an early check in the first line of each affected test case, and then fatally failing or Asserting inside SimulateFileSystemError that the simulation did work. This also avoids the [[nodiscard]] bool bloat and a simple void can be used.
  115. in test/functional/p2p_handle_io_errors.py:151 in 356b0307d8
     146 | +        # their parent directory, so we can't trigger the expected filesystem error.
     147 | +        # Also, when running as root, permission changes don't restrict access, so the
     148 | +        # test condition can't be enforced either.
     149 | +        if platform.system() == 'Windows' or os.geteuid() == 0:
     150 | +            self.log.warning("Skipping test: unable to enforce dir permissions")
     151 | +            return
    


    josibake commented at 9:37 AM on June 10, 2026:

    This feels a bit gross and it indicates to me we should be doing the fault injection differently. I don't have a good suggestion on how to do the fault injection better but perhaps this could at least be a skip_windows style function , same as the other functional tests?

  116. in test/functional/p2p_handle_io_errors.py:100 in 356b0307d8
      95 | +
      96 | +        # Restart node for next test
      97 | +        self.start_node(0, extra_args=['-reindex'])
      98 | +
      99 | +    def test_getdata_on_broken_fs(self):
     100 | +        """The node should not swallow GETDATA requests during I/O issues"""
    


    josibake commented at 9:48 AM on June 10, 2026:

    It would be much nicer to write the postconditions of the test, e.g.: "GETDATA must discard the failed request and disconnect during block I/O issues", instead of writing the behaviour you are trying to avoid.

    I think this makes it easier to assess the correctness of the test and whether or not the test is testing the right things. This also prompts the reviewer to consider other faults that could cause the postconditions to be violated , rather than only focusing their attention on the (possibly) more narrow set of bad behaviour you're testing for here.

  117. in test/functional/p2p_handle_io_errors.py:117 in 356b0307d8
     112 | +
     113 | +            # And the node keeps running
     114 | +            node.getblockcount()
     115 | +
     116 | +    def test_getblocktxn_fatal_on_block_io_error(self):
     117 | +        """GETBLOCKTXN block read failures should trigger a fatal error.
    


    josibake commented at 9:52 AM on June 10, 2026:

    Same comment regarding post conditions, perhaps this could be more specific: "GETBLOCKTXN block read failures must trigger a fatal shutdown, not an assertion or silent catch."

  118. in src/test/util/common.cpp:25 in b9bc7297ef
      20 | +    // This check relies on filesystem permission manipulation to simulate I/O
      21 | +    // failures. Running as root bypasses permission checks, so the OS will
      22 | +    // allow directory creation and file writes even when perms are set to
      23 | +    // none, making it impossible to simulate the expected failures.
      24 | +    if (geteuid() == 0) return false;
      25 | +#endif
    


    josibake commented at 9:56 AM on June 10, 2026:

    Same comment as the functional test: feels a bit smelly. Perhaps theres a better way to do the fault injection?

  119. in src/flatfile.cpp:49 in b9bc7297ef
      46 | +        fs::create_directories(path.parent_path(), ec);
      47 | +        if (ec) {
      48 | +            LogError("Unable to open file '%s'. Error creating parent directories: %s", fs::PathToString(path), ec.message());
      49 | +            return nullptr;
      50 | +        }
      51 | +    }
    


    josibake commented at 11:06 AM on June 10, 2026:

    I think this is the wrong direction.

    The problem, based on my understanding of the PR description, this fs::filesystem_error was bypassing the validation fatal-error path and bubbling up to a broad P2P catch-all, where it was logged and swallowed. More generally, the problem was that the exception was not handled at the right layer, and in practice was not handled at all.

    What you're doing now is folding the exception into the existing nullptr return from FlatFileSeq::Open() and then expecting every caller to check and do the right thing with a now more ambiguous failure value.

    Ideally, albeit a much bigger change, we would have something like:

    CBlock BlockManager::ReadBlockRequired(const CBlockIndex& index);
    std::vector<std::byte> BlockManager::ReadRawBlockRequired(const FlatFilePos& pos);
    FlatFileHandle FlatFileSeq::OpenRequired(const FlatFilePos& pos, OpenMode mode);
    

    ..where "required" indicates the postcondition of the function. More specifically, if FlatFileSeq is given valid pos and mode arguments (preconditions), it is required to open the file at that position. If it cannot satisfy its postcondition, it throws an exception because it is now violating the contract. Then we would have:

    struct BlockStorageError : std::runtime_error {
          BlockStorageErrorCode code;
          FlatFilePos pos;
          std::optional<uint256> block_hash;
      };
    

    ..and GETDATA would catch a blockstorageerror , discard the request, disconnect the peer and keep running. GETBLOCKTXN would catch a blockstorageerror and fatally shutdown, etc etc.

    I realise this is likely too expansive of a refactor at this stage, but what you are doing here feels like the wrong abstraction boundary. What about something like this instead:

    enum class BlockReadError {
          MissingOrPruned,
          IoError,
          CorruptData,
          WrongBlock,
      };
    
      util::Expected<CBlock, BlockReadError> ReadBlock(...);
    

    Then callers choose how to handle or propagate:

    • GETDATA: disconnect peer on IoError, ignore/handle pruned as appropriate.
    • GETBLOCKTXN: fatal on IoError for recent block.
    • ActivateBestChain: fatal on any required block/undo read failure.
    • etc, etc
  120. in src/net_processing.cpp:4289 in 7d4d4b8530
    4284 | @@ -4285,7 +4285,11 @@ void PeerManagerImpl::ProcessMessage(Peer& peer, CNode& pfrom, const std::string
    4285 |              const bool ret{m_chainman.m_blockman.ReadBlock(block, block_pos, req.blockhash)};
    4286 |              // If height is above MAX_BLOCKTXN_DEPTH then this block cannot get
    4287 |              // pruned after we release cs_main above, so this read should never fail.
    4288 | -            assert(ret);
    4289 | +            // Unless we have an I/O issue, in such case, we want to shutdown gracefully.
    4290 | +            if (!ret) {
    


    josibake commented at 11:12 AM on June 10, 2026:

    Strongly agree with the change here! But based on my previous comment, why couldn't this be catching and handling a fs::filesystem_error runtime throw? Or the typed errors I proposed?

  121. josibake commented at 11:18 AM on June 10, 2026: member

    Concept ACK

    Fault suppression masquerading as defensive programing strikes again! Big fan of not swallowing the errors and of improving the test coverage!

    I know this is not a fair critique, since I assume this is the best we can come up with, but the file permissions approach seems hacky and the fact you have to work around root users and windows is evidence. Thats not really a blocking piece of feedback, but it would be nice if you could add a TODO comment.

    I think the meat of this PR is your change to flatfileseq, so I left the majority of my review there. I've been reading https://sean-parent.stlab.cc/presentations/2022-08-31-exceptions/2022-08-31-exceptions.pdf and really enjoying it, so a lot of the ideas and terminology in my review are borrowed from there.

    More generally, I'd recommend everyone reviewing this PR or interested in error handling read/watch the presentation. I think the principles laid out are clear and very applicable to our codebase. h/t @purpleKarrot for recommending it to me.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-06-11 10:51 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me