validation: improve block data I/O error handling in P2P paths #35003

furszy commented at 3:29 PM on April 4, 2026: member

Early note: the majority of this PR consists of test coverage. The changes per se are quite small, just the issue shows up in multiple paths in slightly different forms.

The goal of the PR is to ensure I/O errors are not silently ignored during validation or P2P message handling. In some cases, such as GETDATA or GETBLOCKTXN requests, errors can be swallowed, leaving the node unresponsive to the request without any action. In a harder to reach but more severe case, a swallowed error can leave the node alive but stuck, unable to process new blocks and advance the chain. Also, silently ignoring errors obviously makes problems harder to diagnose. So this PR seeks to improve all of that.

Currently, the same root cause, inability to access the blocks directory, can produce different behaviors depending on where it occurs: 

If it happens during block connection, inside AcceptBlock(), the error gets caught and triggers a fatal error graceful shutdown. This is the correct behavior.
If it happens during block connection (due to the unguarded FlatFileSeq::Open -> fs::create_directories call), after AcceptBlock(), inside ActivateBestChain(), the error gets thrown and swallowed by the P2P general try-catch, leaving the node in a living but stuck state. Where ActivateBestChain() will always silently fail on any posterior block processing, retrying to active the failing block.
If it happens during GETDATA request, the file system error gets swallowed by the message handling general try-catch, only logging a net-debug-level message. The remote peer is not disconnected, nor the node aborts. It just silently ignores the request. This is different to a missing block data, which currently logs the error + disconnects the peer.
If it happens during GETBLOCKTXN request, the file system error either gets swallowed by the message handling general try-catch or crashes the system via an assertion, depending on if the problem is at the blocks directory level or at the block data level, which is inconsistent.
If it happens during GETCFILTERS request, the file system error gets swallowed by the message handling general try-catch.
If this happens in any of the index functions, the node crashes.

So, this PR makes failure handling consistent and ensures the node never enters a stuck state due to a block I/O error. Changes:

GETDATA now treats the issue in the same way as other I/O errors: it logs the error + disconnect the remote peer.
GETBLOCKTXN now triggers a fatal error and graceful shutdown. The error here is stricter than in GETDATA because it can only access the last 10 blocks, which must be available for reorgs.
ActivateBestChain() now triggers a fatal error and graceful shutdown when it happens. Fixing the silent stuck node state scenario.
GETCFILTERS now logs the error through the expected path: "Failed to find block filter in index". A more descriptive error message can be added in a follow-up PR.
The index-related functions now ignore the error instead of crashing the node. This matches how the code was written, as these paths were not expecting OpenFile to throw. Better error handling can be added in a follow-up PR.

Testing Note Can reproduce the different behaviors that cause the stuck node scenario by running the second commit’s unit test 28af683b5dfdce88892e367b308857b94ec630fe or by manually introducing an exception in ConnectTip() (which occurs inside ActivateBestChain()). Previously, the thrown exception would have being caught by the general P2P try-catch and logged only at net-debug level, which most users do not have enabled, leaving the node alive but stuck, as subsequent block connections repeatedly attempt to process the failing block, throw the exception again, and get silently swallowed. Adding a functional test for this scenario is not possible due to the requirement of failing inside ActivateBestChain(), which always occurs after AcceptBlock(), which correctly captures the error.

Separate Note There is another assertion scenario that we can move to graceful shutdown (fatal error) in the compact block relay that I haven't done here.

Extra Note If want to see a similar error producing a crash, but in the wallet, go to #34176.

DrahtBot commented at 3:30 PM on April 4, 2026: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/35003.

Reviews

See the guideline and AI policy for information on the review process.

Type	Reviewers
ACK	l0rinc
Concept ACK	pinheadmz
Approach NACK	josibake
Stale ACK	frankomosh, sedited, rkrux, w0xlt
User requested bot ignore	maflcko

If your review is incorrectly listed, please copy-paste <code></code> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#35731 (Indexes: Harden the flush-error notification invariant by arejula27)
#35676 (util: Abort in CheckDiskSpace/FlatFileSeq::Open on rare exceptions by maflcko)
#35139 (test: Add thread-safe fast-failing test macros by maflcko)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

txospenderindex -> txspendingindex [misspelled index name in the docstring; the intended term is unclear otherwise]

Possible places where named args for integral literals may be used (e.g. func(x, /*named_arg=*/0) in C++, and func(x, named_arg=0) in Python):

FlatFileSeq(data_dir, "a", 100) in src/test/flatfile_tests.cpp

Possible places where comparison-specific test macros should replace generic comparisons:

src/test/flatfile_tests.cpp BOOST_CHECK_THROW(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), fs::filesystem_error); -> consider using BOOST_CHECK_EXCEPTION with a predicate that checks the expected error message/details instead of only the exception type.

2026-07-02 15:13:56

furszy force-pushed on Apr 4, 2026

DrahtBot added the label CI failed on Apr 4, 2026

furszy force-pushed on Apr 4, 2026

furszy force-pushed on Apr 5, 2026

DrahtBot removed the label CI failed on Apr 5, 2026

pinheadmz commented at 6:02 PM on April 6, 2026: member

concept ACK Thanks for the thorough PR description! Definitely helps me to understand the code changes before even looking at them. Can also see its minimal changes with lots of test coverage, will review.

frankomosh commented at 4:45 AM on April 8, 2026: contributor

tACK 72f7a67370255cab1bf0dfcf39920dd23a3c268b

Built from source. All new tests pass:

flatfile_tests: 9 cases, no errors
validation_chainstate_tests: 5 cases, no errors
blockmanager_tests: 8 cases, no errors
p2p_handle_io_errors.py: both scenarios pass

Also manually reverted the GETBLOCKTXN fix (commit ba034f4) by removing the fatalError call and replacing with a silent return. The functional test correctly caught this kind of mutation. the test expected the node to shut down, but the mutant node stayed alive, causing a timeout failure. Mutant killed by p2p_handle_io_errors.py. Unit tests did not detect this mutation. Also process_messages and process_message fuzz harnesses both passed without detecting the mutation, confirming the functional test is providing the critical oracle here.

Line 5915 in SendMessages has assert(ret) after a ReadBlock call in the compact block announcement path. I think this is the same pattern as the GETBLOCKTXN assert fixed in commit 4. I believe this is the "another assertion scenario in compact block relay" mentioned in the PR description. Is this planned as a follow-up? Happy to take it if so.

DrahtBot requested review from pinheadmz on Apr 8, 2026

in test/functional/p2p_handle_io_errors.py:109 in 72f7a67370 outdated

 104 | +        if platform.system() == 'Windows' or os.geteuid() == 0:
 105 | +           self.log.warning("Skipping test: unable to enforce dir permissions")
 106 | +           return
 107 | +
 108 | +        self.test_getdata_on_broken_fs()
 109 | +        self.test_getblocktxn_fatal_on_block_io_error()

maflcko commented at 8:36 AM on April 8, 2026:

Is there a reason why the p2p block msg flow is not tested end-to-end? I understand that the end-to-end behavior is already correct for it, and that there are unit tests, but they only look at the problem from a zoomed-in-on-flatfile perspective.

I mention it, because the error flow happens in reality. I've last seen it in #34592 (comment).

So I think it could make sense to test it, to avoid regressing on it.

Also, I think, it could make sense to make the first commit of the pull request the functional test. This way, it is easier to see what the end-to-end behavior changes are, without having to manually undo commits or patches.

Feel free to grab the test from this diff:

diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
index 7b3f164ebe..309f4bae63 100755
--- a/test/functional/p2p_handle_io_errors.py
+++ b/test/functional/p2p_handle_io_errors.py
@@ -12,2 +12,6 @@ import stat
 
+from test_framework.blocktools import (
+    create_block,
+    create_coinbase,
+)
 from test_framework.messages import (
@@ -17,2 +21,3 @@ from test_framework.messages import (
     MSG_WITNESS_FLAG,
+    msg_block,
     msg_getblocktxn,
@@ -46,2 +51,19 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
 
+    def test_block_connect_on_broken_fs(self):
+        self.log.info("Test block connect on inaccessible filesystem")
+        node = self.nodes[0]
+        peer = node.add_p2p_connection(P2PInterface())
+        block = create_block(tmpl=node.getblocktemplate({"rules": ["segwit"]}))
+        block.solve()
+
+        with simulate_io_error(node.blocks_path):
+            with node.assert_debug_log(expected_msgs=["AcceptBlock FAILED (System error while saving block to disk"]):
+                peer.send_without_ping(msg_block(block))
+                peer.wait_for_disconnect()
+
+            node.wait_until_stopped(
+                expect_error=True,
+                expected_stderr=re.compile(r"fatal internal error"),
+            )
+
     def test_getdata_on_broken_fs(self):
@@ -104,5 +126,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
         if platform.system() == 'Windows' or os.geteuid() == 0:
-           self.log.warning("Skipping test: unable to enforce dir permissions")
-           return
+            self.log.warning("Skipping test: unable to enforce dir permissions")
+            return
 
+        self.test_block_connect_on_broken_fs()
         self.test_getdata_on_broken_fs()

furszy commented at 8:03 PM on April 11, 2026:

Sure, thanks! Test taken, and also reworked the branch per suggestion.

maflcko commented at 8:39 AM on April 8, 2026: member

concept ack, left a test suggestion, which could make review easier

DrahtBot added the label Needs rebase on Apr 9, 2026

furszy force-pushed on Apr 11, 2026

DrahtBot removed the label Needs rebase on Apr 11, 2026

furszy force-pushed on Apr 11, 2026

DrahtBot added the label CI failed on Apr 11, 2026

furszy commented at 8:05 PM on April 11, 2026: member

Line 5915 in SendMessages has assert(ret) after a ReadBlock call in the compact block announcement path. I think this is the same pattern as the GETBLOCKTXN assert fixed in commit 4. I believe this is the "another assertion scenario in compact block relay" mentioned in the PR description. Is this planned as a follow-up? Happy to take it if so.

Yes @frankomosh. All yours :).

furszy force-pushed on Apr 11, 2026

in test/functional/p2p_handle_io_errors.py:81 in 2902fef4a7

  76 | +
  77 | +        peer = node.add_p2p_connection(P2PInterface())
  78 | +        block_hash = node.getblockhash(3)  # Not in recent cache
  79 | +        with simulate_io_error(node.blocks_path):
  80 | +            # Bug: we currently swallow the GETDATA request, never reply to the other party, and log issue under debug-NET
  81 | +            with node.assert_debug_log(expected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],

furszy commented at 10:30 PM on April 11, 2026:

Note: intentionally not checking for a specific filesystem error msg, since it slightly vary across OS. Specific msg doesn't matter, this first commit just proves the error is swallowed in master.

furszy force-pushed on Apr 12, 2026

in test/functional/p2p_handle_io_errors.py:44 in 75a8a957a5

  39 | +    try:
  40 | +        yield
  41 | +    finally:
  42 | +        os.chmod(parent_dir, old_mode)
  43 | +        if blocks_bak.exists() and not blocks_path.exists():
  44 | +            os.rename(blocks_bak, blocks_path)

maflcko commented at 10:14 AM on April 13, 2026:

nit in 75a8a957a5181a3cebfc9d3dd064330d52c51dbb: Is there a reason for this guard? The following seems stricter and what we want in this test. It also passes locally for me on this commit:

diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
index 4438ad1dd5..5e05e9b768 100755
--- a/test/functional/p2p_handle_io_errors.py
+++ b/test/functional/p2p_handle_io_errors.py
@@ -40,8 +40,7 @@ def simulate_io_error(blocks_path):
         yield
     finally:
         os.chmod(parent_dir, old_mode)
-        if blocks_bak.exists() and not blocks_path.exists():
-            os.rename(blocks_bak, blocks_path)
+        os.rename(blocks_bak, blocks_path)
 
 class P2PBlockIOErrorTest(BitcoinTestFramework):
     def set_test_params(self):

furszy commented at 4:07 PM on April 13, 2026:

Not anymore. I had more tests here before. Will drop it. Thanks.

in test/functional/p2p_handle_io_errors.py:52 in 825dfc2005

  47 | +    def set_test_params(self):
  48 | +        self.num_nodes = 1
  49 | +        self.setup_clean_chain = True
  50 | +
  51 | +    def test_block_connect_on_broken_fs(self):
  52 | +        self.log.info("Test block connect on inaccessible filesystem")

maflcko commented at 10:59 AM on April 13, 2026:

nit in 75a8a957a5181a3cebfc9d3dd064330d52c51dbb:

From the pull description:

If it happens during block connection, after AcceptBlock(), inside ActivateBestChain(), the file system error gets thrown and swallowed by the P2P general try-catch, leaving the node in a living but stuck state. Where ActivateBestChain() will always silently fail on any posterior block processing, retrying to active the failing block.

You claim that a failure in ActivateBestChain will leave the node alive, but stuck. However, when writing a test for this, I found that it properly shut down:

diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
index 4438ad1dd5..4706dfebd4 100755
--- a/test/functional/p2p_handle_io_errors.py
+++ b/test/functional/p2p_handle_io_errors.py
@@ -14,2 +14,3 @@ from test_framework.blocktools import (
     create_block,
+    create_empty_fork,
 )
@@ -49,5 +50,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
         self.setup_clean_chain = True
+        self.extra_args = [['-fastprune']]
 
-    def test_block_connect_on_broken_fs(self):
-        self.log.info("Test block connect on inaccessible filesystem")
+    def test_block_accept_on_broken_fs(self):
+        self.log.info("Test block accept on inaccessible filesystem")
         node = self.nodes[0]
@@ -70,2 +72,24 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
 
+    def test_block_connect_on_broken_fs(self):
+        self.log.info("Test block connect on inaccessible filesystem")
+        node = self.nodes[0]
+        peer = node.add_p2p_connection(P2PInterface())
+        blocks = create_empty_fork(node, fork_length=2)
+        large_block = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]
+        peer.send_and_ping(msg_block(blocks[0]))
+        (node.blocks_path / 'blk00002.dat').unlink()
+        node.getblock(large_block)
+
+        with node.assert_debug_log(expected_msgs=["ActivateBestChain failed (Failed to read block.)"]):
+            peer.send_without_ping(msg_block(blocks[1]))
+            peer.wait_for_disconnect()
+
+        node.wait_until_stopped(
+            expect_error=True,
+            expected_stderr=re.compile(r"fatal internal error"),
+        )
+
+        # Restart node for next test
+        self.start_node(0, extra_args=['-reindex'])
+
     def test_getdata_on_broken_fs(self):
@@ -125,2 +149,3 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
 
+        self.test_block_accept_on_broken_fs()
         self.test_block_connect_on_broken_fs()

I guess it could make sense to add the test?

Also, it could make sense to clarify that point 2 in the pull request description refers to a thrown exception, presumably only from create_directories, and does not refer to any IO error?

Just a nit, and up to you, if you want to keep this pull focussed only on the create_directories call, or be a more general io error handling improvement in P2P paths, with test coverage.

furszy commented at 2:41 PM on April 13, 2026:

You claim that a failure in ActivateBestChain will leave the node alive, but stuck. However, when writing a test for this, I found that it properly shut down.

Yeah ok. Your test exercises a missing block file, not the inability to access the directory. If we want to replicate the stuck, the goal there should be to get the node stuck by making it throw inside the create_directories call.

I had a test early in this branch that reproduced the "throw inside ActivateBestChain" through the P2P path in a deterministic way (which is not easy due to AcceptBlock happening first). But I did not like that it relied on another bug in reconsiderblock to trigger this one.

Basically, the approach was to invalidate the tip, submit a fork, make the directory inaccessible, and then reconsider the block, which throws internally but resets the block validity flags (the operation is not atomically reverted upon failure). After that, sending any number of descendant blocks via P2P will be swallowed due to the inability to activate the very first block. But it was ugly to depend on reconsiderblock to trigger it.

I guess it could make sense to add the test?

Yes! for sure. Pulling it.

Also, it could make sense to clarify that point 2 in the pull request description refers to a thrown exception, presumably only from create_directories, and does not refer to any IO error?

Sure. Will do.

maflcko commented at 8:00 AM on April 14, 2026:

I had a test early in this branch that reproduced the "throw inside ActivateBestChain" through the P2P path in a deterministic way (which is not easy due to AcceptBlock happening first). But I did not like that it relied on another bug in reconsiderblock to trigger this one.

Basically, the approach was to invalidate the tip, submit a fork, make the directory inaccessible, and then reconsider the block, which throws internally but resets the block validity flags (the operation is not atomically reverted upon failure). After that, sending any number of descendant blocks via P2P will be swallowed due to the inability to activate the very first block. But it was ugly to depend on reconsiderblock to trigger it.

Ah interesting. I presume the bug still exists after this pull requests, as the ReconsiderBlock function is not changed here? Just wondering, as the bug seems unrelated to the bug fixed in this pull request. Also, I agree that such a test may be too spicy to include here.

furszy commented at 2:19 PM on April 15, 2026:

I presume the bug still exists after this pull requests, as the ReconsiderBlock function is not changed here?

Would say it is partially fixed but haven't tried it. ReconsiderBlock internally calls ActivateBestChain, which triggers a fatal error after this PR, instead of throwing an exception. So it is a matter of whether we are flushing the block flags reset to disk prior to aborting or not. Because, if we do, the node would fail early on any following up startup.

in test/functional/p2p_handle_io_errors.py:75 in 825dfc2005

  70 | +
  71 | +    def test_getdata_on_broken_fs(self):
  72 | +        """The node should not swallow GETDATA requests during I/O issues"""
  73 | +        self.log.info("Test GETDATA on inaccessible filesystem")
  74 | +        node = self.nodes[0]
  75 | +        self.generate(node, 6, sync_fun=self.no_op)

maflcko commented at 11:08 AM on April 13, 2026:

nit in the first commit: Any reason to disable something that is already skipped? Diff:

diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
index d75a9f1d2e..4706dfebd4 100755
--- a/test/functional/p2p_handle_io_errors.py
+++ b/test/functional/p2p_handle_io_errors.py
@@ -96,7 +96,7 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
         """The node should not swallow GETDATA requests during I/O issues"""
         self.log.info("Test GETDATA on inaccessible filesystem")
         node = self.nodes[0]
-        self.generate(node, 6, sync_fun=self.no_op)
+        self.generate(node, 6)
 
         peer = node.add_p2p_connection(P2PInterface())
         block_hash = node.getblockhash(3)  # Not in recent cache

furszy commented at 6:45 PM on April 13, 2026:

Done as suggested.

in test/functional/p2p_handle_io_errors.py:103 in 825dfc2005

  98 | +        self.log.info("Test GETBLOCKTXN triggers fatal error on inaccessible filesystem")
  99 | +        node = self.nodes[0]
 100 | +
 101 | +        # Generate two blocks so that the first one is evicted from the
 102 | +        # m_most_recent_block cache.
 103 | +        self.generate(node, 1, sync_fun=self.no_op)

maflcko commented at 11:10 AM on April 13, 2026:

nit: Same here about sync_fun.

furszy commented at 6:44 PM on April 13, 2026:

Done as suggested.

maflcko commented at 11:14 AM on April 13, 2026: member

Looks like the test fails on the second commit, or so?

Also, provided some more test nits. Feel free to ignore.

825dfc200520d503fec64346ed01a8b12ec8fa54~4 📭

<details><summary>Show signature</summary>

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: 825dfc200520d503fec64346ed01a8b12ec8fa54~4 📭
HuyWCmP9dlHdaEd4PL2XkITSC+f4T6nxsowBmRtzkzVkthldhUIEhYd8dWE2Z+BGoWVYLPWjRRxIPvoLUhqCCg==

</details>

furszy commented at 2:18 PM on April 13, 2026: member

Looks like the test fails on the second commit, or so?

It seems the assertion error message is libc-dependent. Linux logs "Assertion `expr' failed" while MacOS logs "Assertion failed: (expr), ...".

Will make the wait_until_stopped expected error a bit more general and check for "Assertion" wording only. That is more than enough to check the crash reason. The last commit makes it clear with the fatal error change anyway.

furszy force-pushed on Apr 13, 2026

DrahtBot removed the label CI failed on Apr 13, 2026

in src/test/util/common.h:74 in a98cebe088

  69 | + * Simulate a filesystem access failure for `path` by temporarily making it
  70 | + * unavailable. The target is removed and its parent made inaccessible while
  71 | + * `fn()` executes, then everything is restored.
  72 | + */
  73 | +template <typename Fn>
  74 | +void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, Fn&& fn)

maflcko commented at 8:00 AM on April 14, 2026:

nit in a98cebe088a31933b5ee1c12e862f93165c32263: This function is pretty large, and there is no need to inline it, so it would be better to move it to the Cpp file:

diff --git a/src/test/util/CMakeLists.txt b/src/test/util/CMakeLists.txt
index d6864a2720..c0d47d157c 100644
--- a/src/test/util/CMakeLists.txt
+++ b/src/test/util/CMakeLists.txt
@@ -5,6 +5,7 @@
 add_library(test_util STATIC EXCLUDE_FROM_ALL
   blockfilter.cpp
   coins.cpp
+  common.cpp
   coverage.cpp
   json.cpp
   logging.cpp
diff --git a/src/test/util/common.cpp b/src/test/util/common.cpp
new file mode 100644
index 0000000000..f547595531
--- /dev/null
+++ b/src/test/util/common.cpp
@@ -0,0 +1,60 @@
+// Copyright (c) The Bitcoin Core developers
+// Distributed under the MIT software license, see the accompanying
+// file COPYING or https://opensource.org/license/mit/.
+
+#include <test/util/common.h>
+
+void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn)
+{
+#ifdef WIN32
+    // On Windows, any open file (such as the db) prevents directory renaming,
+    // so we can't simulate a filesystem error in this platform. Skip it.
+    return;
+#else
+    // This check relies on filesystem permission manipulation to simulate I/O
+    // failures. Running as root bypasses permission checks, so the OS will
+    // allow directory creation and file writes even when perms are set to
+    // none, making it impossible to simulate the expected failures.
+    if (getuid() == 0) return;
+ #endif
+
+    const fs::path root = fs::weakly_canonical(test_root_dir);
+    const fs::path target = fs::weakly_canonical(path);
+
+    // Simple sanity check: ensure target is inside the test directory.
+    if (!fs::PathToString(target).starts_with(fs::PathToString(root))) {
+        throw std::runtime_error("Path escapes test root directory");
+    }
+
+    if (!fs::exists(target)) {
+        throw std::runtime_error("Path does not exist");
+    }
+
+    const fs::path parent = target.parent_path();
+    const fs::path backup = parent / fs::PathFromString(target.filename().string() + "_bak");
+
+    // Make the target disappear (file or directory)
+    fs::rename(target, backup);
+
+    const auto old_perms = fs::status(parent).permissions();
+
+    // Helper to restore state (permissions + original path)
+    auto restore = [&]() {
+        fs::permissions(parent, old_perms, fs::perm_options::replace);
+        if (fs::exists(backup) && !fs::exists(target)) {
+            fs::rename(backup, target);
+        }
+    };
+
+    // Block any access and dir recreation under the parent directory
+    fs::permissions(parent, fs::perms::none, fs::perm_options::replace);
+
+    try {
+        fn(); // run test under simulated failure
+    } catch (...) {
+        restore();
+        throw;
+    }
+
+    restore();
+}
diff --git a/src/test/util/common.h b/src/test/util/common.h
index 85e62b7ef7..06a972ced8 100644
--- a/src/test/util/common.h
+++ b/src/test/util/common.h
@@ -70,60 +70,6 @@ inline std::ostream& operator<<(std::ostream& os, const T& obj)
  * unavailable. The target is removed and its parent made inaccessible while
  * `fn()` executes, then everything is restored.
  */
-template <typename Fn>
-void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, Fn&& fn)
-{
-#ifdef WIN32
-    // On Windows, any open file (such as the db) prevents directory renaming,
-    // so we can't simulate a filesystem error in this platform. Skip it.
-    return;
-#else
-    // This check relies on filesystem permission manipulation to simulate I/O
-    // failures. Running as root bypasses permission checks, so the OS will
-    // allow directory creation and file writes even when perms are set to
-    // none, making it impossible to simulate the expected failures.
-    if (getuid() == 0) return;
- #endif
-
-    const fs::path root = fs::weakly_canonical(test_root_dir);
-    const fs::path target = fs::weakly_canonical(path);
-
-    // Simple sanity check: ensure target is inside the test directory.
-    if (!fs::PathToString(target).starts_with(fs::PathToString(root))) {
-        throw std::runtime_error("Path escapes test root directory");
-    }
-
-    if (!fs::exists(target)) {
-        throw std::runtime_error("Path does not exist");
-    }
-
-    const fs::path parent = target.parent_path();
-    const fs::path backup = parent / fs::PathFromString(target.filename().string() + "_bak");
-
-    // Make the target disappear (file or directory)
-    fs::rename(target, backup);
-
-    const auto old_perms = fs::status(parent).permissions();
-
-    // Helper to restore state (permissions + original path)
-    auto restore = [&]() {
-        fs::permissions(parent, old_perms, fs::perm_options::replace);
-        if (fs::exists(backup) && !fs::exists(target)) {
-            fs::rename(backup, target);
-        }
-    };
-
-    // Block any access and dir recreation under the parent directory
-    fs::permissions(parent, fs::perms::none, fs::perm_options::replace);
-
-    try {
-        fn(); // run test under simulated failure
-    } catch (...) {
-        restore();
-        throw;
-    }
-
-    restore();
-}
+void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn);
 
 #endif // BITCOIN_TEST_UTIL_COMMON_H

furszy commented at 2:39 PM on April 14, 2026:

Sure, taken. Thanks!

in src/test/flatfile_tests.cpp:139 in a98cebe088

 130 | @@ -130,4 +131,71 @@ BOOST_AUTO_TEST_CASE(flatfile_flush)
 131 |      BOOST_CHECK_EQUAL(fs::file_size(seq.FileName(FlatFilePos(0, 1))), 1U);
 132 |  }
 133 |  
 134 | +BOOST_AUTO_TEST_CASE(flatfile_open_readonly_missing_dir)
 135 | +{
 136 | +    // Open file with read_only=true on a directory that does not exist.
 137 | +    // File should be null instead of throwing an exception.
 138 | +    const auto data_dir = m_args.GetDataDirBase() / "nonexistent";
 139 | +    const FlatFileSeq seq(data_dir, "a", 16 * 1024);

maflcko commented at 8:10 AM on April 14, 2026:

nit in a98cebe088a31933b5ee1c12e862f93165c32263 : It is a bit confusing to see all those oddly specific (and different) numbers for the chunk size, when none of them matter. It would be better to just define a dummy once and use it in all places that don't matter:

diff --git a/src/test/flatfile_tests.cpp b/src/test/flatfile_tests.cpp
index e31a0773d1..e33c12265b 100644
--- a/src/test/flatfile_tests.cpp
+++ b/src/test/flatfile_tests.cpp
@@ -13,2 +13,4 @@
 
+static constexpr uint32_t DUMMY_CHUNK_SZ{1};
+
 BOOST_FIXTURE_TEST_SUITE(flatfile_tests, BasicTestingSetup)
@@ -138,3 +140,3 @@ BOOST_AUTO_TEST_CASE(flatfile_open_readonly_missing_dir)
     const auto data_dir = m_args.GetDataDirBase() / "nonexistent";
-    const FlatFileSeq seq(data_dir, "a", 16 * 1024);
+    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
 
@@ -149,3 +151,3 @@ BOOST_AUTO_TEST_CASE(flatfile_open_write_creates_dir)
     const auto data_dir = m_args.GetDataDirBase() / "new_write_dir";
-    const FlatFileSeq seq(data_dir, "a", 16 * 1024);
+    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
 
@@ -165,3 +167,3 @@ BOOST_AUTO_TEST_CASE(flatfile_open_write_missing_unwritable_parent)
     SimulateFileSystemError(m_path_root, dir, [&dir]() {
-        const FlatFileSeq seq(dir / "blocks", "a", 16 * 1024);
+        const FlatFileSeq seq(dir / "blocks", "a", DUMMY_CHUNK_SZ);
         const AutoFile file{seq.Open(FlatFilePos(0, 0), /*read_only=*/false)};
@@ -176,3 +178,3 @@ BOOST_AUTO_TEST_CASE(flatfile_flush_unwritable_dir)
     fs::create_directories(data_dir);
-    const FlatFileSeq seq(data_dir, "a", 100);
+    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
 
@@ -190,3 +192,3 @@ BOOST_AUTO_TEST_CASE(flatfile_allocate_unwritable_dir)
     fs::create_directories(data_dir);
-    const FlatFileSeq seq(data_dir, "a", 100);
+    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);

furszy commented at 2:39 PM on April 14, 2026:

Sure, taken. Thanks!

in test/functional/p2p_handle_io_errors.py:106 in a98cebe088

  99 | @@ -100,12 +100,14 @@ def test_getdata_on_broken_fs(self):
 100 |          peer = node.add_p2p_connection(P2PInterface())
 101 |          block_hash = node.getblockhash(3)  # Not in recent cache
 102 |          with simulate_io_error(node.blocks_path):
 103 | -            # Bug: we currently swallow the GETDATA request, never reply to the other party, and log issue under debug-NET
 104 | -            with node.assert_debug_log(expected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
 105 | +            # Request block and expect disconnection. The node must not swallow the exception and do nothing.
 106 | +            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"],
 107 | +                                       unexpected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
 108 |                                         timeout=30):

maflcko commented at 8:28 AM on April 14, 2026:

nit in a98cebe088a31933b5ee1c12e862f93165c32263: Now that the disconnect happens, the timeout here is not needed and confusing and should be removed. Also, I think the unexpected_msgs should be removed, as they depend on the exact log formatting, which may change in the future, so they will likely go silently stale anyway, and the disconnection is what matters here and is already checked.

(Same in other places in this commit)

Suggested diff:

diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
index 26c1ed7b51..354f7a6c4c 100755
--- a/test/functional/p2p_handle_io_errors.py
+++ b/test/functional/p2p_handle_io_errors.py
@@ -102,6 +102,4 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
         with simulate_io_error(node.blocks_path):
-            # Request block and expect disconnection. The node must not swallow the exception and do nothing.
-            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"],
-                                       unexpected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
-                                       timeout=30):
+            # Request block and expect disconnection.
+            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"]):
                 peer.send_without_ping(msg_getdata([CInv(MSG_BLOCK | MSG_WITNESS_FLAG, int(block_hash, 16))]))
@@ -133,7 +131,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
 
-            # Request block and expect crash. The node must not swallow the exception and do nothing.
-            with node.assert_debug_log(expected_msgs=["Unable to open file"],
-                                       unexpected_msgs=["[ProcessMessages] [net] ProcessMessages(getblocktxn, 34 bytes): Exception 'filesystem error"],
-                                       timeout=10):
+            # Request block and expect crash+disconnect.
+            with node.assert_debug_log(expected_msgs=["Unable to open file"]):
                 peer.send_without_ping(gbtn)
+                peer.wait_for_disconnect()
 
@@ -143,3 +140,2 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
                 expected_stderr=re.compile("Assertion"), # assertion crash
-                timeout=30,
             )

furszy commented at 2:40 PM on April 14, 2026:

looks good!, taken. Thanks

in src/test/util/common.h:111 in a98cebe088

 106 | +    const auto old_perms = fs::status(parent).permissions();
 107 | +
 108 | +    // Helper to restore state (permissions + original path)
 109 | +    auto restore = [&]() {
 110 | +        fs::permissions(parent, old_perms, fs::perm_options::replace);
 111 | +        if (fs::exists(backup) && !fs::exists(target)) {

maflcko commented at 8:35 AM on April 14, 2026:

a98cebe088a31933b5ee1c12e862f93165c32263: same here: Any reason for the check?

diff --git a/src/test/util/common.cpp b/src/test/util/common.cpp
index f547595531..e89fe11e71 100644
--- a/src/test/util/common.cpp
+++ b/src/test/util/common.cpp
@@ -43,5 +43,3 @@ void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path
         fs::permissions(parent, old_perms, fs::perm_options::replace);
-        if (fs::exists(backup) && !fs::exists(target)) {
-            fs::rename(backup, target);
-        }
+        fs::rename(backup, target);
     };

furszy commented at 2:41 PM on April 14, 2026:

not anymore. Removed. Thanks.

in src/test/validation_chainstate_tests.cpp:191 in 63e4541495

 186 | +    // Disconnect block without changing validity so ActivateBestChain
 187 | +    // has a block to reconnect.
 188 | +    {
 189 | +        LOCK2(chainman->GetMutex(), chainstate.MempoolMutex());
 190 | +        BlockValidationState state;
 191 | +        BOOST_CHECK(chainstate.DisconnectTip(state, nullptr));

maflcko commented at 9:14 AM on April 14, 2026:

nit in 63e45414952b4cd155921b0cee20e3d3f97142a5: Could use named arg for /*disconnectpool=*/?

furszy commented at 2:41 PM on April 14, 2026:

Sure. Done.

in src/test/blockmanager_tests.cpp:337 in 55ac4b4f2f

 332 | +        // create_directories fails so the file cannot be created.
 333 | +        {
 334 | +            ASSERT_DEBUG_LOG("Failed to write block");
 335 | +            CBlock dummy_block;
 336 | +            dummy_block.nVersion = 1;
 337 | +            const auto write_pos = blockman.WriteBlock(dummy_block, 999);

maflcko commented at 9:30 AM on April 14, 2026:

nit in 55ac4b4f2f389b588a1f372e506c2f5d63db701b: would be easier to read, if integral literal args used names. In this case, I guess it is /*nHeight=*/999?

furszy commented at 2:41 PM on April 14, 2026:

Sure. Done.

in src/test/blockmanager_tests.cpp:356 in 55ac4b4f2f outdated

 351 | +            tip->nStatus |= BLOCK_HAVE_UNDO;
 352 | +        }
 353 | +    });
 354 | +
 355 | +    // Ensure we haven't corrupted any internal state during the failure.
 356 | +    // Check we can read the block again now that there is no dir access issue going on.

maflcko commented at 9:36 AM on April 14, 2026:

nit in 55ac4b4f2f389b588a1f372e506c2f5d63db701b : Should also check undo beside the block?

Also, when assuming a pointer can not be null, it would be better to use a reference.

Diff to achieve both:

diff --git a/src/test/blockmanager_tests.cpp b/src/test/blockmanager_tests.cpp
index 492cb518d0..b1ff1139ae 100644
--- a/src/test/blockmanager_tests.cpp
+++ b/src/test/blockmanager_tests.cpp
@@ -309,3 +309,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
     auto& blockman = m_node.chainman->m_blockman;
-    CBlockIndex* tip = WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip());
+    CBlockIndex& tip{*WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip())};
 
@@ -314,3 +314,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
             ASSERT_DEBUG_LOG("OpenBlockFile failed");
-            FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip->GetBlockPos());
+            FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip.GetBlockPos());
             BOOST_CHECK(!blockman.ReadRawBlock(tip_pos));
@@ -321,3 +321,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
             CBlock block;
-            BOOST_CHECK(!blockman.ReadBlock(block, *tip));
+            BOOST_CHECK(!blockman.ReadBlock(block, tip));
         }
@@ -327,3 +327,3 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
             CBlockUndo block_undo;
-            BOOST_CHECK(!blockman.ReadBlockUndo(block_undo, *tip));
+            BOOST_CHECK(!blockman.ReadBlockUndo(block_undo, tip));
         }
@@ -345,8 +345,8 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
             LOCK(chainman->GetMutex());
-            tip->nStatus &= ~BLOCK_HAVE_UNDO;
+            tip.nStatus &= ~BLOCK_HAVE_UNDO;
 
             BlockValidationState state;
-            BOOST_CHECK(!blockman.WriteBlockUndo(CBlockUndo{}, state, *tip));
+            BOOST_CHECK(!blockman.WriteBlockUndo(CBlockUndo{}, state, tip));
 
-            tip->nStatus |= BLOCK_HAVE_UNDO;
+            tip.nStatus |= BLOCK_HAVE_UNDO;
         }
@@ -357,3 +357,5 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
     CBlock recovered_block;
-    BOOST_CHECK(blockman.ReadBlock(recovered_block, *tip));
+    BOOST_CHECK(blockman.ReadBlock(recovered_block, tip));
+    CBlockUndo block_undo;
+    BOOST_CHECK(blockman.ReadBlockUndo(block_undo, tip));
 }

furszy commented at 2:41 PM on April 14, 2026:

Looks good!, taken. Thanks.

in src/net_processing.cpp:4287 in 476fc0f899

4283 | @@ -4284,7 +4284,11 @@ void PeerManagerImpl::ProcessMessage(Peer& peer, CNode& pfrom, const std::string
4284 |              const bool ret{m_chainman.m_blockman.ReadBlock(block, block_pos, req.blockhash)};
4285 |              // If height is above MAX_BLOCKTXN_DEPTH then this block cannot get
4286 |              // pruned after we release cs_main above, so this read should never fail.
4287 | -            assert(ret);
4288 | +            // Unless we have a I/O issue, in such case, we want to shutdown gracefully.

maflcko commented at 9:38 AM on April 14, 2026:

nit in 476fc0f8991141f9d2d4f687b38b45b82a98bb39 from the llm:

a I/O issue -> an I/O issue [Grammatical article error; “I/O” starts with a vowel sound, so comprehension is improved by using “an”.]

furszy commented at 2:42 PM on April 14, 2026:

Done as suggested.

maflcko approved

maflcko commented at 9:41 AM on April 14, 2026: member

Nice changes. I think the added tests are useful to have proper coverage for this. Also, the code changes move in the right direction. I also like that further places are pointed out where stuff can be improved.

Everything looks great, and I've left more nits on the tests, given that all the code changes are tests. Feel free to ignore any of them, but I'd be happy to re-review all of them.

lgtm 476fc0f8991141f9d2d4f687b38b45b82a98bb39 🏸

<details><summary>Show signature</summary>

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: lgtm 476fc0f8991141f9d2d4f687b38b45b82a98bb39 🏸
XRrVpPFFIr4wq1HN1XUn/Y+JjL2oo7FUVxxSS8lD5F/L2WAqGj/ZewYzFssaE6uY8HeSbaD7nCh4rvbUeaOlBQ==

</details>

maflcko commented at 9:48 AM on April 14, 2026: member

nit on the title: Maybe the prefix should be validation:, because it not only affects the p2p validation flow, but also the RPC one, and possibly the kernel one?

furszy renamed this:
~~net_processing: improve block data I/O error handling in P2P paths~~
validation: improve block data I/O error handling in P2P paths
on Apr 14, 2026

DrahtBot added the label Validation on Apr 14, 2026

furszy force-pushed on Apr 14, 2026

furszy commented at 2:44 PM on April 14, 2026: member

Updated per feedback and very nice suggestions. Thanks maflcko!

maflcko commented at 4:07 PM on April 14, 2026: member

Only change is the nits.

review ACK f6c62f1bd4887d7865bf64c1fb399d47286fe531 🐟

<details><summary>Show signature</summary>

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: review ACK f6c62f1bd4887d7865bf64c1fb399d47286fe531 🐟
QmQ331B1u4334Pm/xCen1sDl+6UdVg8nw67GYZXSKW4pVOWc8QaJLmqxz1b7wh8w5sUV++MOMmen3sKr+9kTDg==

</details>

DrahtBot requested review from frankomosh on Apr 14, 2026

frankomosh commented at 1:54 PM on April 15, 2026: contributor

reACK f6c62f1bd4887d7865bf64c1fb399d47286fe531.

rkrux commented at 2:19 PM on April 15, 2026: contributor

Concept ACK https://github.com/bitcoin/bitcoin/commit/f6c62f1bd4887d7865bf64c1fb399d47286fe531, will review.

+1 for the detailed PR description and verbose commit messages that makes the reviewer feel like a story is unfolding in the PR.

in test/functional/p2p_handle_io_errors.py:144 in c673aacfd7 outdated

 144 | +                expect_error=True,
 145 | +                expected_stderr=re.compile("Assertion"), # assertion crash
 146 | +            )
 147 | +
 148 | +        # Restart node for next test
 149 | +        self.start_node(0)

rkrux commented at 1:45 PM on April 17, 2026:

In c673aacfd78653a3d49cdcac355df3b6024c0608: this is the last test, is starting node here really required?

furszy commented at 3:27 PM on April 17, 2026:

Better to leave the test in a consistent state than force the next person working on it to figure out why their test isn't working.

rkrux commented at 8:11 AM on April 18, 2026:

than force the next person working on it to figure out why their test isn't working.

That's thoughtful but it also adds marginal latency in the test - will leave it to your preference.

in test/functional/p2p_handle_io_errors.py:77 in bd19cea308

  72 | +    def test_block_connect_on_broken_fs(self):
  73 | +        self.log.info("Test block connect on inaccessible filesystem")
  74 | +        node = self.nodes[0]
  75 | +        peer = node.add_p2p_connection(P2PInterface())
  76 | +        blocks = create_empty_fork(node, fork_length=2)
  77 | +        large_block = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]

rkrux commented at 1:46 PM on April 17, 2026:

In bd19cea30865efc0c8a353843662090446f4aada: s/large_block/large_block_hex

rkrux commented at 3:03 PM on April 17, 2026:

Nit: s/({'55'*100_000})/({'55'*66_000}) can also work because the file size is 64kib for fast prune.

furszy commented at 6:36 PM on April 17, 2026:

string fixed, thx.

maflcko commented at 7:28 AM on May 8, 2026:

I don't think large_block_hex makes sense here. This is the hash of the block, not the full block data in hex. The naming should either be reverted to large_block or use large_block_hash.

maflcko commented at 3:59 PM on May 11, 2026:

Looks like this wasn't addressed in the latest push

furszy commented at 6:00 PM on May 15, 2026:

Looks like this wasn't addressed in the latest push

It seems the latest push predates your comment. That's why it isn't there.

furszy commented at 6:04 PM on May 15, 2026:

Done as suggested.

in test/functional/p2p_handle_io_errors.py:124 in bd19cea308

 119 | +        self.log.info("Test GETBLOCKTXN triggers fatal error on inaccessible filesystem")
 120 | +        node = self.nodes[0]
 121 | +
 122 | +        # Generate two blocks so that the first one is evicted from the
 123 | +        # m_most_recent_block cache.
 124 | +        self.generate(node, 1)

rkrux commented at 2:58 PM on April 17, 2026:

In bd19cea30865efc0c8a353843662090446f4aada:

Generate two blocks

Isn't it generating one block?

furszy commented at 6:35 PM on April 17, 2026:

yep, fixed. Thanks.

in test/functional/p2p_handle_io_errors.py:80 in bd19cea308

  75 | +        peer = node.add_p2p_connection(P2PInterface())
  76 | +        blocks = create_empty_fork(node, fork_length=2)
  77 | +        large_block = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]
  78 | +        peer.send_and_ping(msg_block(blocks[0]))
  79 | +        (node.blocks_path / 'blk00002.dat').unlink()
  80 | +        node.getblock(large_block)

rkrux commented at 3:13 PM on April 17, 2026:

In bd19cea: I usually find these standalone calls unintuitive because no value is captured and no assertion done, often time it takes me a second to realise that this is there just to show that it works.

Maybe add the following assertion as well to show that forked block is gone now.

+ assert_raises_rpc_error(-1, 'Block not found on disk', node.getblock, blocks[0].hash_hex)

furszy commented at 6:35 PM on April 17, 2026:

sure, pushed.

rkrux commented at 3:18 PM on April 17, 2026: contributor

Started reviewing at f6c62f1bd4887d7865bf64c1fb399d47286fe531, shared few comments in the test.

furszy force-pushed on Apr 17, 2026

furszy commented at 6:37 PM on April 17, 2026: member

Updated per feedback. Test comments fixed. Thanks rkrux.

in test/functional/p2p_handle_io_errors.py:143 in 3fec4bca85 outdated

 141 |  
 142 |              node.wait_until_stopped(
 143 | -                expected_ret_code=-6,
 144 |                  expect_error=True,
 145 | -                expected_stderr=re.compile("Assertion"), # assertion crash
 146 | +                expected_stderr=re.compile(r"fatal internal error"),

rkrux commented at 8:10 AM on April 18, 2026:

In c990e375dbbc5ac2d2ed701dbc0f4f51533d2650: A new error has been added in the commit and it seems like a missed opportunity that it's not asserted for in the corresponding test.

-                expected_stderr=re.compile(r"fatal internal error"),
+                expected_stderr=re.compile(r"Failed to read block during GETBLOCKTXN"),

or

-            with node.assert_debug_log(expected_msgs=["Unable to open file"]):
+            with node.assert_debug_log(expected_msgs=["Unable to open file", "Failed to read block during GETBLOCKTXN"]):

furszy commented at 2:43 PM on April 20, 2026:

Done as suggested.

DrahtBot added the label Needs rebase on Apr 19, 2026

furszy force-pushed on Apr 19, 2026

DrahtBot removed the label Needs rebase on Apr 19, 2026

in test/functional/p2p_handle_io_errors.py:94 in c4204d9137 outdated

  89 | +            peer.wait_for_disconnect()
  90 | +
  91 | +        node.wait_until_stopped(
  92 | +            expect_error=True,
  93 | +            expected_stderr=re.compile(r"fatal internal error"),
  94 | +        )

rkrux commented at 9:32 AM on April 20, 2026:

In c4204d9137b66593a5d38c0e210dac863f6453d0: This new test that has been added seems fine on its own. However, the reviewer might try to relate it to the 2nd point in the PR description because the error here is of ActivateBestChain. But this testcase doesn't exactly test that point, which I found confusing. Can consider moving the following statement from the testing note of the description in the 2nd point itself there to make it obvious.

Adding a functional test for this scenario is not possible due to the requirement of failing inside ActivateBestChain(), which always occurs after AcceptBlock(), which correctly captures the error.

furszy commented at 2:37 PM on April 20, 2026:

This behaves properly not because of the functions calls ordering, but because we are unlinking one single blk file and not the entire directory. The test that is above (test_block_accept_on_broken_fs()) is the one exercising the note you are mentioning.

rkrux commented at 9:34 AM on April 20, 2026: contributor

Overall lgtm at c990e375dbbc5ac2d2ed701dbc0f4f51533d2650

Nit: Can consider switching the order of 3rd and 4th commits because latter is more closely related to the changes in 2nd commit as blockman methods are directly dependent on flatfile methods.

furszy force-pushed on Apr 20, 2026

furszy commented at 2:47 PM on April 20, 2026: member

Updated per feedback. Single test line diff to target a specific error msg during fatal error shutdown. Ready to go.

maflcko commented at 7:36 AM on May 8, 2026: member

lgtm.

Only nit changes in the test.

re-ACK 65347588a1488d13f925da071a5cbf6af56d7a3f 🏹

<details><summary>Show signature</summary>

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: re-ACK 65347588a1488d13f925da071a5cbf6af56d7a3f 🏹
hVSIl3SJ9ICN4IOBkD6rctCSUNVcwg1C9WtUacDs4gChpqAnd8MKAHscr7tKhpebVcDT8AxooFrXo0Ct8ItABA==

</details>

DrahtBot requested review from frankomosh on May 8, 2026

DrahtBot requested review from rkrux on May 8, 2026

rkrux approved

rkrux commented at 10:41 AM on May 8, 2026: contributor

lgtm ACK 65347588a1488d13f925da071a5cbf6af56d7a3f

I like the simulate_io_error context manager - I have noticed similar use cases in other functional tests as well, this might be generalised in the future and made as a utility.

maflcko removed review request from pinheadmz on May 8, 2026

maflcko removed review request from frankomosh on May 8, 2026

DrahtBot requested review from frankomosh on May 8, 2026

DrahtBot requested review from pinheadmz on May 8, 2026

sedited commented at 10:07 PM on May 10, 2026: contributor

Concept ACK

I'm not sure on this approach. If this is just for this single ReadBlock path, might it be easier ascertain that this does not introduce unwanted side effects by moving the call to OpenBlockFile(...) in ReadRawBlock into the try...catch block? I did not test this, but it seems like the same failure could trigger a runaway exception in the indexes at the moment, but I'm not sure if we are still terminating in all cases after this patch. Similarly does this now ignore the exception in ImportBlock?

Nit on the description: Is the node being "stuck" is the best way to describe it? After all, it does not freeze up, but rather can not do any block processing. I guess an admin could theoretically also "unstuck" it in this scenario.

maflcko commented at 6:36 AM on May 11, 2026: member

I'm not sure on this approach. If this is just for this single ReadBlock path, might it be easier ascertain that this does not introduce unwanted side effects by moving the call to OpenBlockFile(...) in ReadRawBlock into the try...catch block?

Generally, I think it would be better to use one type of error handling (at least inside a single function). IIUC, if someone were to rm a blockfile, FlatFileSeq::Open (with read_only=true) would return nullptr and log an error. Same, if someone were to truncate a file. So it seems odd to treat create_directories differently. Either all of them should return nullptr, or all of them should throw (and then return a NotNull). So I think the current patch is preferable to moving the call around inside ReadRawBlock. The alternative patch of reworking the error handling to use exceptions consistently for this function, may be more involved, but I haven't checked.

I did not test this, but it seems like the same failure could trigger a runaway exception in the indexes at the moment, but I'm not sure if we are still terminating in all cases after this patch.

If the indexers are not properly checking the return value of Open(), then it seems like a pre-existing issue, because it can happen also when a file is removed or truncated. If this is the case, it would be good to fix those issues in a separate pull request.

Similarly does this now ignore the exception in ImportBlock?

IIUC it will be leading to a break, just like other Open() errors. I think this is fine, because fs::exists was called prior, so the error could only happen in a racy way, in which case it can't reliably be detected anyway. And it will lead to an abort of the node on the next block read or write anyway.

So I think this PR is the a minimal and still correct patch. Alternative patches are certainly possible, though I think (regardless of who writes them), they should go into a fresh pull request, as they are conceptually different. Then, one of them can be merged and the other one can be closed or rebased.

sedited commented at 9:18 AM on May 11, 2026: contributor

If the indexers are not properly checking the return value of Open(), then it seems like a pre-existing issue, because it can happen also when a file is removed or truncated. If this is the case, it would be good to fix those issues in a separate pull request.

The scenarios where this is not handled adequately at the moment are pretty contrived, but I feel like so are the ones that would make us reach the conditions we aim to guard against in this PR. As far as I can tell by sprinkling in a few error conditions into the code, they do trigger an unclean exit of the node in the indexes, and will instead continue after this patch. It would be good to document that this may go from crashing to ignoring the error on non-p2p paths.

Other than that, the changes do look correct to me. I think it would also be nice to annotate the various member functions using Open in the block manager and Open itself with nodiscard.

maflcko commented at 9:37 AM on May 11, 2026: member

they do trigger an unclean exit of the node in the indexes

Heh, right. Completely unrelated (just related to the error handling mess): IIRC, in other code paths, the indexes do not properly catch exceptions, so if such an exception (which should lead to a fatal node shutdown) is thrown in an rpc thread, it may silently continue (See also #34132 (review))

It would be good to document that this may go from crashing to ignoring the error on non-p2p paths.

Other than that, the changes do look correct to me. I think it would also be nice to annotate the various member functions using Open in the block manager and Open itself with nodiscard.

Yeah, I guess that can't hurt. I'd also be curious about re-thinking the error handling wholesale from the ground up, but maybe this should be a brainstorming issue.

furszy force-pushed on May 15, 2026

furszy commented at 6:32 PM on May 15, 2026: member

Updated per feedback. Annotated the various flat file functions as nodiscard.

The indexes discussion made me realize we are also moving from swallowing a p2p getcfilters exception to properly handling the error now. Which is nice. We could have a test for it.

It would be good to document that this may go from crashing to ignoring the error on non-p2p paths.

What type of doc are you expecting? Commit message pointing to each place or go through each caller and add a comment on top?

While it is true that this changes the indexes paths behavior, it is evident that the previous behavior was not intentional. The code was not expecting OpenFile to throw and crash the process. It was just missing that failure path. So someone could also argue that this is correcting unexpected crashes as well.

If you are not that strong on it, I would rather move forward here, and work on the indexes in a separate PR. I'm happy to continue moving forward over #34897, #34489, #26966 and more. There is a lot we can do there.

sedited commented at 7:52 PM on May 15, 2026: contributor

If you are not that strong on it, I would rather move forward here, and work on the indexes in a separate PR.

Yes, I was only asking for mentioning the various affected paths in the PR description. I don't think it should hold up this pull request.

furszy commented at 9:56 PM on May 15, 2026: member

Yes, I was only asking for mentioning the various affected paths in the PR description. I don't think it should hold up this pull request.

Done. Ready to go.

in src/node/blockstorage.h:227 in d09974cf3d

 223 | @@ -224,7 +224,7 @@ class BlockManager
 224 |      [[nodiscard]] bool FlushChainstateBlockFile(int tip_height);
 225 |      bool FindUndoPos(BlockValidationState& state, int nFile, FlatFilePos& pos, unsigned int nAddSize);
 226 |  
 227 | -    AutoFile OpenUndoFile(const FlatFilePos& pos, bool fReadOnly = false) const;
 228 | +    [[nodiscard]] AutoFile OpenUndoFile(const FlatFilePos& pos, bool fReadOnly = false) const;

maflcko commented at 7:36 AM on May 16, 2026:

in the last commit: Not sure I understand what the benefit of those would be. Of course, it is a bug to open a file and not use the returned file. But this is conceptually true for all functions whose goal is to call them to use their return value. Personally I don't think we should be adding [[nodiscard]] to all those functions, because this is extra verbosity/bloat and I don't think there has ever been a bug of this kind in this codebase, and I fail to see how one can be added in the future. Code like this won't compile anyway with or without the attribute:

auto ReadUndo(const auto& pos) {
  OpenUndoFile(pos);
  Data undo;
  /* nothing -> does not compile */ >> undo;
  return undo;
}

I am not saying that [[nodiscard]] is useless, there are certainly valid use-cases, when a function name does not imply that something (e.g. an error) must be returned and could plausibly return void. For example, two lines up bool Flush() correctly uses the attribute. Also, on AutoFile itself, the attribute is correctly used on fclose. Otherwise, there could be code that compiles, like this:

bool WriteUndo(const auto& pos, const auto& undo) {
 auto f =  OpenUndoFile(pos);
  f << undo;
  f.fclose(); // attribute required here
  return true;
}

So I htink the last commit should be dropped.

furszy commented at 12:24 PM on May 16, 2026:

I pretty much agree with you, that's why I did not add it before. I just assumed you guys were agreeing there based on your comment: #35003 (comment)

And I didn't feel like arguing against it was worth the extra back and forth for such a small change.

Just dropped the last commit.

maflcko commented at 7:37 AM on May 16, 2026: member

Only change is basically adding the last commit, but I think it should be dropped again.

review ACK d09974cf3d5db404b11e7f1f3f959feae19838c3 🐚

<details><summary>Show signature</summary>

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: review ACK d09974cf3d5db404b11e7f1f3f959feae19838c3 🐚
pzfG6+o8Fvf5F63HxAzASFsj6LVWtfvzj2zYEsSw87/4qt8Lc+4qwTethMkOVzOpnzc7BKo+E0dHNIuAfsp1Cw==

</details>

DrahtBot requested review from sedited on May 16, 2026

DrahtBot requested review from rkrux on May 16, 2026

l0rinc commented at 8:39 AM on May 16, 2026: contributor

Concept ACK

furszy force-pushed on May 16, 2026

maflcko commented at 12:27 PM on May 16, 2026: member

review ACK 01daa552c3ce06c8c8bfdab8c5b93a662cee6e79 🌴

<details><summary>Show signature</summary>

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: review ACK 01daa552c3ce06c8c8bfdab8c5b93a662cee6e79 🌴
rHv5rLR5K0wwYrDTf/qFl8KopCbdE6DCO7BHuMOVG4vXCX+WCetd9ODtS8VMaSisoFlX068cVIFnxLKhDuPxCQ==

</details>

DrahtBot requested review from l0rinc on May 16, 2026

furszy commented at 6:10 PM on May 20, 2026: member

@sedited @rkrux wanna take another look? nothing changed from the past review round.

sedited approved

sedited commented at 9:34 PM on May 24, 2026: contributor

ACK 65347588a1488d13f925da071a5cbf6af56d7a3f

While I don't think these are the most desirable mechanics for handling system errors, this is an improvement over the status quo. Some followup work improving the handling in the indexes and in the file allocation path would be good. I like the additional tests.

DrahtBot requested review from sedited on May 24, 2026

rkrux commented at 10:59 AM on May 25, 2026: contributor

lgtm re-ACK 01daa552c3ce06c8c8bfdab8c5b93a662cee6e79

git range-diff 6534758...01daa55

I have not gone through the last 3-4 comments in the PR discussion above but the range-diff is minimal since last a-c-k.

rkrux commented at 11:01 AM on May 25, 2026: contributor

In #35003#pullrequestreview-4353425125

ACK https://github.com/bitcoin/bitcoin/commit/65347588a1488d13f925da071a5cbf6af56d7a3f

This seems to be an outdated commit.

DrahtBot added the label Needs rebase on May 26, 2026

furszy force-pushed on May 26, 2026

furszy commented at 1:51 PM on May 26, 2026: member

Rebased due to conflicts with #33966. Ready to go.

DrahtBot removed the label Needs rebase on May 26, 2026

DrahtBot added the label Needs rebase on May 27, 2026

furszy force-pushed on May 27, 2026

DrahtBot removed the label Needs rebase on May 27, 2026

DrahtBot added the label CI failed on May 27, 2026

DrahtBot removed the label CI failed on May 28, 2026

in src/test/util/common.cpp:9 in 400a5b16e3 outdated

   0 | @@ -0,0 +1,58 @@
   1 | +// Copyright (c) The Bitcoin Core developers
   2 | +// Distributed under the MIT software license, see the accompanying
   3 | +// file COPYING or https://opensource.org/license/mit/.
   4 | +
   5 | +#include <test/util/common.h>
   6 | +
   7 | +void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn)
   8 | +{
   9 | +#ifdef WIN32

w0xlt commented at 6:11 PM on June 4, 2026:

nit: in unsupported environments, the assertions never run, but the tests still pass. This can create false-positive test coverage.

diff --git a/src/test/blockmanager_tests.cpp b/src/test/blockmanager_tests.cpp
index 79f5b15177..9cd6c2d61c 100644
--- a/src/test/blockmanager_tests.cpp
+++ b/src/test/blockmanager_tests.cpp
@@ -331,7 +331,7 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
     auto& blockman = m_node.chainman->m_blockman;
     CBlockIndex& tip{*WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip())};
 
-    SimulateFileSystemError(m_path_root, m_args.GetBlocksDirPath().parent_path(), [&]() {
+    if (!SimulateFileSystemError(m_path_root, m_args.GetBlocksDirPath().parent_path(), [&]() {
         {
             ASSERT_DEBUG_LOG("OpenBlockFile failed");
             FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip.GetBlockPos());
@@ -372,7 +372,10 @@ BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
 
             tip.nStatus |= BLOCK_HAVE_UNDO;
         }
-    });
+    })) {
+        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
+        return;
+    }
 
     // Ensure we haven't corrupted any internal state during the failure.
     // Check we can read the block again now that there is no dir access issue going on.
diff --git a/src/test/flatfile_tests.cpp b/src/test/flatfile_tests.cpp
index e33c12265b..f835c5e8cb 100644
--- a/src/test/flatfile_tests.cpp
+++ b/src/test/flatfile_tests.cpp
@@ -164,11 +164,14 @@ BOOST_AUTO_TEST_CASE(flatfile_open_write_missing_unwritable_parent)
     fs::create_directories(dir);
 
     // Make it read-only so creating subdirectories fails.
-    SimulateFileSystemError(m_path_root, dir, [&dir]() {
+    if (!SimulateFileSystemError(m_path_root, dir, [&dir]() {
         const FlatFileSeq seq(dir / "blocks", "a", DUMMY_CHUNK_SZ);
         const AutoFile file{seq.Open(FlatFilePos(0, 0), /*read_only=*/false)};
         BOOST_CHECK(file.IsNull());
-    });
+    })) {
+        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
+        return;
+    }
 }
 
 BOOST_AUTO_TEST_CASE(flatfile_flush_unwritable_dir)
@@ -178,9 +181,12 @@ BOOST_AUTO_TEST_CASE(flatfile_flush_unwritable_dir)
     fs::create_directories(data_dir);
     const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
 
-    SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
+    if (!SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
         BOOST_CHECK(!seq.Flush(FlatFilePos(0, 1)));
-    });
+    })) {
+        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
+        return;
+    }
 }
 
 BOOST_AUTO_TEST_CASE(flatfile_allocate_unwritable_dir)
@@ -192,12 +198,15 @@ BOOST_AUTO_TEST_CASE(flatfile_allocate_unwritable_dir)
     fs::create_directories(data_dir);
     const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
 
-    SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
+    if (!SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
         bool out_of_space;
         // Note: this throws due to 'CheckDiskSpace'. In the future, this shouldn't throw either.
         BOOST_CHECK_THROW(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), fs::filesystem_error);
         BOOST_CHECK(!out_of_space);
-    });
+    })) {
+        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
+        return;
+    }
 }
 
 BOOST_AUTO_TEST_SUITE_END()
diff --git a/src/test/util/common.cpp b/src/test/util/common.cpp
index e89fe11e71..43f666e5fb 100644
--- a/src/test/util/common.cpp
+++ b/src/test/util/common.cpp
@@ -4,19 +4,25 @@
 
 #include <test/util/common.h>
 
-void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn)
+#include <stdexcept>
+
+#ifndef WIN32
+#include <unistd.h>
+#endif
+
+[[nodiscard]] bool SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn)
 {
 #ifdef WIN32
     // On Windows, any open file (such as the db) prevents directory renaming,
     // so we can't simulate a filesystem error in this platform. Skip it.
-    return;
+    return false;
 #else
     // This check relies on filesystem permission manipulation to simulate I/O
     // failures. Running as root bypasses permission checks, so the OS will
     // allow directory creation and file writes even when perms are set to
     // none, making it impossible to simulate the expected failures.
-    if (getuid() == 0) return;
- #endif
+    if (geteuid() == 0) return false;
+#endif
 
     const fs::path root = fs::weakly_canonical(test_root_dir);
     const fs::path target = fs::weakly_canonical(path);
@@ -55,4 +61,5 @@ void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path
     }
 
     restore();
+    return true;
 }
diff --git a/src/test/util/common.h b/src/test/util/common.h
index 77df87cff0..457c33cb7f 100644
--- a/src/test/util/common.h
+++ b/src/test/util/common.h
@@ -8,14 +8,11 @@
 #include <util/fs.h>
 
 #include <chrono>
+#include <functional>
 #include <optional>
 #include <ostream>
 #include <string>
 
-#ifndef WIN32
-#include <unistd.h>
-#endif
-
 /**
  * BOOST_CHECK_EXCEPTION predicates to check the specific validation error.
  * Use as
@@ -70,8 +67,8 @@ inline std::ostream& operator<<(std::ostream& os, const T& obj)
  * unavailable. The target is removed and its parent made inaccessible while
  * `fn()` executes, then everything is restored.
  *
- * Note: Returns early if the environment does not allow simulating a fs error.
+ * Returns false if the environment does not allow simulating a fs error.
  */
-void SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn);
+[[nodiscard]] bool SimulateFileSystemError(const fs::path& test_root_dir, const fs::path& path, const std::function<void()>& fn);
 
 #endif // BITCOIN_TEST_UTIL_COMMON_H
diff --git a/src/test/validation_chainstate_tests.cpp b/src/test/validation_chainstate_tests.cpp
index 1b8cf7121e..7df38707c6 100644
--- a/src/test/validation_chainstate_tests.cpp
+++ b/src/test/validation_chainstate_tests.cpp
@@ -201,15 +201,21 @@ BOOST_FIXTURE_TEST_CASE(activate_best_chain_connect_io_failure, TestChain100Setu
 
     // Inaccessible block file must trigger a fatal error.
     const auto blk_file = blocks_dir / "blk00000.dat";
-    SimulateFileSystemError(m_path_root, blk_file, [&]() {
+    if (!SimulateFileSystemError(m_path_root, blk_file, [&]() {
         check_activate_best_chain_fatal_error("Failed to read block", m_node, chainstate);
-    });
+    })) {
+        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
+        return;
+    }
 
     // Inaccessible blocks directory must also trigger a fatal error.
     const auto parent_dir = blocks_dir.parent_path();
-    SimulateFileSystemError(m_path_root, parent_dir, [&]() {
+    if (!SimulateFileSystemError(m_path_root, parent_dir, [&]() {
         check_activate_best_chain_fatal_error("Failed to read block", m_node, chainstate);
-    });
+    })) {
+        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
+        return;
+    }
 
     // Sanity check: the block reconnects once the filesystem is restored.
     BlockValidationState state;
@@ -255,21 +261,30 @@ BOOST_FIXTURE_TEST_CASE(activate_best_chain_disconnect_io_failure, TestChain100S
 
     // Inaccessible block file during reorg must trigger fatal error.
     const auto blk_file = blocks_dir / "blk00000.dat";
-    SimulateFileSystemError(m_path_root, blk_file, [&]() {
+    if (!SimulateFileSystemError(m_path_root, blk_file, [&]() {
         check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
-    });
+    })) {
+        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
+        return;
+    }
 
     // Inaccessible undo file during reorg must trigger fatal error.
     const auto rev_file = blocks_dir / "rev00000.dat";
-    SimulateFileSystemError(m_path_root, rev_file, [&]() {
+    if (!SimulateFileSystemError(m_path_root, rev_file, [&]() {
         check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
-    });
+    })) {
+        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
+        return;
+    }
 
     // Inaccessible blocks directory during reorg must also trigger fatal error.
     const auto parent_dir = blocks_dir.parent_path();
-    SimulateFileSystemError(m_path_root, parent_dir, [&]() {
+    if (!SimulateFileSystemError(m_path_root, parent_dir, [&]() {
         check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
-    });
+    })) {
+        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
+        return;
+    }
 
     // Sanity check: the reorg completes once the filesystem is restored.
     state = BlockValidationState();

</details>

furszy commented at 2:04 PM on June 7, 2026:

Cool, thanks. Taken.

w0xlt commented at 7:42 PM on June 4, 2026: contributor

ACK 400a5b16e396498d41ec1ae70ec91f7baf45d1db

DrahtBot requested review from maflcko on Jun 4, 2026

furszy force-pushed on Jun 7, 2026

furszy commented at 2:07 PM on June 7, 2026: member

Updated per @w0xlt feedback, thanks. Added test warn logging message on unsupported environments.

in src/test/validation_chainstate_tests.cpp:287 in 7d4d4b8530

 282 | +    if (!SimulateFileSystemError(m_path_root, parent_dir, [&]() {
 283 | +        check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
 284 | +    })) {
 285 | +        BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");
 286 | +        return;
 287 | +    }

maflcko commented at 6:35 PM on June 9, 2026:

not sure about the recent test-change push. this was done to avoid "false coverage" on windows. However, i don't think anyone creates coverage reports on Windows. Also, after the first warning, the later ones are dead and unreachable code.

Also, the patch is incomplete, because the "skipping" regex in src/test/CMakeLists.txt would have to be updated as well.

Also, even with this patch, there will be "false coverage" through all of the code that is run before the check. That is this will be "false coverage":

BOOST_FIXTURE_TEST_CASE(activate_best_chain_disconnect_io_failure, TestChain100Setup)
{
    auto& chainman = m_node.chainman;
    auto& chainstate = chainman->ActiveChainstate();
    const auto blocks_dir = m_args.GetBlocksDirPath();

    // Set up a scenario where ActivateBestChain must reorg:
    //
    //   Original chain: ... -> 100 -> 101 -> 102 -> 103   (more work)
    //   Fork:           ... -> 100 -> F101 -> F102        (currently active)
    //
    // The fork is active but has less work than the original chain, so
    // ActivateBestChain will disconnect the fork blocks via DisconnectTip
    // before reconnecting the original chain.

    // Extend the chain by 3 blocks (heights 101–103).
    for (int i = 0; i < 3; i++) CreateAndProcessBlock({}, CScript() << OP_TRUE);

    // Invalidate at height 101 to create room for a shorter fork.
    CBlockIndex* b101_index = WITH_LOCK(chainman->GetMutex(), return chainman->ActiveChain()[101]);
    BlockValidationState state;
    chainstate.InvalidateBlock(state, b101_index);

    // Build a 2-block fork from height 100. The original chain is invalid,
    // so these become the active tip.
    for (int i = 0; i < 2; i++) CreateAndProcessBlock({}, CScript() << OP_TRUE << OP_TRUE);

    // Now restore the original chain's validity. It has more work (103 > 102) but ActivateBestChain
    // hasn't been called yet, so we're still on the fork. Similar to ReconsiderBlock but without the ABC call.
    {
        LOCK(chainman->GetMutex());
        chainstate.ResetBlockFailureFlags(b101_index);
        chainman->RecalculateBestHeader();
    }

    // Inaccessible block file during reorg must trigger fatal error.
    const auto blk_file = blocks_dir / "blk00000.dat";

My recommendation would be to either revert the last test-only push for now and leave it for a follow-up, or properly do it:

Add the missing skip regex
Do an early check in the first line of each affected test case, and then fatally failing or Asserting inside SimulateFileSystemError that the simulation did work. This also avoids the [[nodiscard]] bool bloat and a simple void can be used.

furszy commented at 4:43 PM on June 16, 2026:

sure, reverted.

in test/functional/p2p_handle_io_errors.py:151 in 356b0307d8 outdated

 146 | +        # their parent directory, so we can't trigger the expected filesystem error.
 147 | +        # Also, when running as root, permission changes don't restrict access, so the
 148 | +        # test condition can't be enforced either.
 149 | +        if platform.system() == 'Windows' or os.geteuid() == 0:
 150 | +            self.log.warning("Skipping test: unable to enforce dir permissions")
 151 | +            return

josibake commented at 9:37 AM on June 10, 2026:

This feels a bit gross and it indicates to me we should be doing the fault injection differently. I don't have a good suggestion on how to do the fault injection better but perhaps this could at least be a skip_windows style function , same as the other functional tests?

furszy commented at 4:53 PM on June 16, 2026:

We actually do something similar inside wallet_multiwallet.py. But I agree, we could do a follow-up standardizing this in the test framework properly; instead of a broad skip_windows, would say we should have a general "can_change_permissions" (or a better name) that explicitly checks for the test requirement.

in test/functional/p2p_handle_io_errors.py:100 in 356b0307d8 outdated

  95 | +
  96 | +        # Restart node for next test
  97 | +        self.start_node(0, extra_args=['-reindex'])
  98 | +
  99 | +    def test_getdata_on_broken_fs(self):
 100 | +        """The node should not swallow GETDATA requests during I/O issues"""

josibake commented at 9:48 AM on June 10, 2026:

It would be much nicer to write the postconditions of the test, e.g.: "GETDATA must discard the failed request and disconnect during block I/O issues", instead of writing the behaviour you are trying to avoid.

I think this makes it easier to assess the correctness of the test and whether or not the test is testing the right things. This also prompts the reviewer to consider other faults that could cause the postconditions to be violated , rather than only focusing their attention on the (possibly) more narrow set of bad behaviour you're testing for here.

furszy commented at 6:40 PM on June 16, 2026:

Sure, taken. thanks

in test/functional/p2p_handle_io_errors.py:117 in 356b0307d8 outdated

 112 | +
 113 | +            # And the node keeps running
 114 | +            node.getblockcount()
 115 | +
 116 | +    def test_getblocktxn_fatal_on_block_io_error(self):
 117 | +        """GETBLOCKTXN block read failures should trigger a fatal error.

josibake commented at 9:52 AM on June 10, 2026:

Same comment regarding post conditions, perhaps this could be more specific: "GETBLOCKTXN block read failures must trigger a fatal shutdown, not an assertion or silent catch."

l0rinc commented at 10:14 AM on June 11, 2026:

Could we also cover that GETCFILTERS failures reach the existing filter-index error path in a new testcase?

furszy commented at 6:52 PM on June 16, 2026:

Same comment regarding post conditions, perhaps this could be more specific: "GETBLOCKTXN block read failures must trigger a fatal shutdown, not an assertion or silent catch."

Sure, taken. thanks.

Could we also cover that GETCFILTERS failures reach the existing filter-index error path in a new testcase?

We surely could. We just need to set up + sync the index first, then send the request.

l0rinc commented at 2:15 AM on June 25, 2026:

We just need to set up + sync the index first, then send the request

Given that you explicitly mention GETCFILTERS in the description, could you please cover that as well?

furszy commented at 9:00 PM on June 30, 2026:

Added coverage for GETCFILTERS, and txindex + spendtxoindex.

in src/test/util/common.cpp:25 in b9bc7297ef outdated

  20 | +    // This check relies on filesystem permission manipulation to simulate I/O
  21 | +    // failures. Running as root bypasses permission checks, so the OS will
  22 | +    // allow directory creation and file writes even when perms are set to
  23 | +    // none, making it impossible to simulate the expected failures.
  24 | +    if (geteuid() == 0) return false;
  25 | +#endif

josibake commented at 9:56 AM on June 10, 2026:

Same comment as the functional test: feels a bit smelly. Perhaps theres a better way to do the fault injection?

furszy commented at 6:51 PM on June 16, 2026:

it seems good material for an investigation issue/PR. Here I looked for the simplest and most contained way to trigger the error path. We surely can come up with a more involved approach. The nice point about this PR is that only simulate_io_error will need to change in the future.

Overall, I find this + #34176 the beginning of a long journey.

in src/flatfile.cpp:49 in b9bc7297ef outdated

  46 | +        fs::create_directories(path.parent_path(), ec);
  47 | +        if (ec) {
  48 | +            LogError("Unable to open file '%s'. Error creating parent directories: %s", fs::PathToString(path), ec.message());
  49 | +            return nullptr;
  50 | +        }
  51 | +    }

josibake commented at 11:06 AM on June 10, 2026:

I think this is the wrong direction.

The problem is, based on my understanding of the PR description, this fs::filesystem_error was bypassing the validation fatal-error path and bubbling up to a broad P2P catch-all, where it was logged and swallowed. More generally, the problem was that the exception was not handled at the right layer, and in practice was not handled at all.

What you're doing now is folding the exception into the existing nullptr return from FlatFileSeq::Open() and then expecting every caller to check and do the right thing with a now more ambiguous failure value.

Ideally, albeit a much bigger change, we would have something like:

CBlock BlockManager::ReadBlockRequired(const CBlockIndex& index);
std::vector<std::byte> BlockManager::ReadRawBlockRequired(const FlatFilePos& pos);
FlatFileHandle FlatFileSeq::OpenRequired(const FlatFilePos& pos, OpenMode mode);

..where "required" indicates the postcondition of the function. More specifically, if FlatFileSeq is given valid pos and mode arguments (preconditions), it is required to open the file at that position. If it cannot satisfy its postcondition, it throws an exception because it is now violating the contract. Then we would have:

struct BlockStorageError : std::runtime_error {
      BlockStorageErrorCode code;
      FlatFilePos pos;
      std::optional<uint256> block_hash;
  };

..and GETDATA would catch a blockstorageerror , discard the request, disconnect the peer and keep running. GETBLOCKTXN would catch a blockstorageerror and fatally shutdown, etc etc.

I realise what I preposed above is likely too expansive of a refactor at this stage, but what you are doing here feels like the wrong abstraction boundary. What about something like this instead:

enum class BlockReadError {
      MissingOrPruned,
      IoError,
      CorruptData,
      WrongBlock,
  };

  util::Expected<CBlock, BlockReadError> ReadBlock(...);

Then callers choose how to handle or propagate:

GETDATA: disconnect peer on IoError, ignore/handle pruned as appropriate.
GETBLOCKTXN: fatal on IoError for recent block.
ActivateBestChain: fatal on any required block/undo read failure.
etc, etc

furszy commented at 7:21 PM on June 16, 2026:

What you're doing now is folding the exception into the existing nullptr return from FlatFileSeq::Open() and then expecting every caller to check and do the right thing with a now more ambiguous failure value.

I don't think that is the right frame for the changes. Here we are accommodating the return value to the current function's contract. The issue exists because the current FlatFileSeq::Open implementation does not respect it. Callers expect nullptr return when FlatFileSeq::Open cannot open the file for any reason, and there is only one spot inside the function that doesn't follow such rule.

About your proposal: I think the back and forth in your comment over how to handle errors, either through exceptions or expected, kinda presents the difficulty of deciding the right approach to follow. I'm sure people will have different perspectives on this, so I think it deserves its own standalone issue/PR.

josibake commented at 9:11 AM on June 18, 2026:

I think that is too strong a claim. The contract you describe was not documented, still is not documented, and was not actually followed before this PR, since fs::create_directories() could throw before the nullable fopen path. This is, to me, is an example of why this sort of implicit contract and mixing error handling is a bad design: it leaves room for bugs like these to happen.

Even after this change, FlatFileSeq::Open() is not a no-throw API; it can still throw from support code such as path construction/conversion, formatting, logging, or allocation. So we can't really enforce your desired contract with a noexcept. Separately, FlatFileSeq::Allocate() still has a filesystem-throwing path through CheckDiskSpace(). So I do not think FlatFileSeq currently has a consistent error-handling contract, which, again, leads to bugs like this.

But I suspect where we actually disagree is that you view the nullable Open() contract as the right direction, while I think it is the wrong abstraction. This PR expands the meaning of nullptr to cover null position, missing/pruned files, ordinary fopen failure, seek failure, and now parent directory creation failure. Those cases do not all have the same recovery semantics, as demonstrated in this PR:

GETDATA can disconnect the peer and continue.
GETBLOCKTXN should fatal.
validation paths should fatal

I would strongly prefer we go in the other direction and stop using nullptr to mean “something went wrong; infer what from surrounding context.”

I also think my earlier proposal may not have been clear. I am not arguing back and forth between exceptions and typed errors as equivalent options. My preferred approach is exceptions: if a function is called with satisfied preconditions but cannot satisfy its postconditions, it should throw, and that exception should be handled or translated at the appropriate layer, likely block storage here, not with small try/catch blocks all over the code. This is the approach described in the presentation I linked.

I only mentioned typed errors as a lesser evil if exceptions (and their consistent handling) are considered too large a change for this PR. I still think that is suboptimal, but it would at least avoid collapsing distinct failure modes into an ambiguous nullptr. I understand your desire to keep this PR scoped, but if we are touching this code, I think we should try to move it toward a more robust, self-documenting, idiomatic C++ error model rather than further entrenching the ambiguous nullptr contract.

Another useful companion reference that helped frame my thinking on this: https://isocpp.org/wiki/faq/exceptions

furszy commented at 3:25 PM on June 18, 2026:

It seems you constructed a bad view of what I said and went really deep on it. Almost all of it doesn't represent the position I hold.

But I suspect where we actually disagree is that you view the nullable Open() contract as the right direction

No, I never said that. I do not think returning nullptr is necessarily the right long term direction. This is just an improvement over the current status quo. The simplest form of it. The one that lets us introduce proper test coverage before making bigger changes in the future.

Btw, we have a very similar view on when to use exceptions vs typed errors vs something else.

Even after this change, FlatFileSeq::Open() is not a no-throw API; it can still throw from support code such as path construction/conversion, formatting, logging, or allocation. So we can't really enforce your desired contract with a noexcept.

I never said this was my desired contract, nor did I propose noexcept (which I'm actually not a fan of). I just described the current state. I was strictly talking about the current source code, which talks for itself. The three lines change fixes all callers because they all expect nullptr when the file cannot be accessed.

I would strongly prefer we go in the other direction and stop using nullptr to mean “something went wrong; infer what from surrounding context.”

I agree. This PR is just about consistency and test coverage over what we currently have. We can dream bigger after it, I have no doubts about that.

josibake commented at 9:22 AM on June 19, 2026:

Fair enough, I may have inferred too much about your longterm preference. I was reading that from the framing that Open() has a current contract of returning nullptr whenever the file cannot be accessed, and that callers expect this (said in a different response to me). If your view is only “this is the smallest consistency fix for the current code,” then thanks for clarifying.

My concern remains, though. Desipte neither of us wanting nullptr as the model, this PR still expands the set of distinct failures represented by nullptr at the lowest layer. That makes the eventual cleanup harder imo, because error information is discarded before a domain layer can decide how to translate it.

I took a look at the call sites again and I think the better incremental fix is to preserve the exception at FlatFileSeq::Open() and translate it at a block storage boundary. For example, BlockManager::ReadRawBlock() / ReadBlockUndo() / write helpers can catch fs::filesystem_error and convert it into their existing false, Unexpected{ReadRawError::IO}, or fatal paths. That still keeps this PR scoped and avoids try/catch noise in net_processing, but it does not collapse another distinct FlatFileSeq::Open() into a nullptr. This might be more lines of code but I fee this would be much easier to reason about from an architecture standpoint and clear improvement over the existing code while setting us up for success in the future.

If that boundary change is too much for this PR, I would still prefer a small amount of local try/catch at the currently affected call sites over expanding Open()’s nullable contract. That keeps the exception as the error propagation, so a later PR can move the translation boundary to block storage without needing to touch the FlatFileSeq code again.

furszy commented at 7:01 PM on June 29, 2026:

My concern remains, though. Desipte neither of us wanting nullptr as the model, this PR still expands the set of distinct failures represented by nullptr at the lowest layer. That makes the eventual cleanup harder imo, because error information is discarded before a domain layer can decide how to translate it.

I don't see how this would make any eventual change harder, and the error isn't being discarded either, it's logged exactly like every other file access error. None of that information is used anywhere today, every file access error is treated the same, and that equal treatment is the consistent part of this PR.

Would also argue that the half-way path you are proposing would just introduce an inconsistency. As one file access error would go through a channel while all others access errors go through a different one. Which would make this structure harder to follow and more error-prone, not easier as you are suggesting.

Overall, I don't think a partial implementation of a different error handling path is useful really. We should either change it all or don't touch it.

the better incremental fix is to preserve the exception at FlatFileSeq::Open() and translate it at a block storage boundary. For example, BlockManager::ReadRawBlock() / ReadBlockUndo() / write helpers can catch fs::filesystem_error

I partially answered this above in the inconsistency argument, but can also add that ReadRawBlock() / ReadBlockUndo() are not the only ones using the Open function. Constraining the changes there only would leave other places un-fixed: TxIndex::FindTx, TxoSpenderIndex::ReadTransaction open the file directly via OpenBlockFile, not through ReadRawBlock/ReadBlockUndo, same for the startup checks and the reindex loop. And BlockFilterIndex::ReadFilterFromDisk uses its own FlatFileSeq directly.

josibake commented at 11:33 AM on June 30, 2026:

What I am proposing is not worse than todays behaviour. Today we already have two error channels: most Open() failures return nullptr, while fs::create_directories() throws and escapes. The bug exists because that exception crosses the wrong boundary and gets swallowed by the P2P catch all. Preserving the exception until BlockManager translates it would not introduce any inconsistency, rather it would put the exceptional path at the layer that actually knows the recovery semantics, while preserving error information.

Logging does not preserve the error in the sense that matters here. A log line is a diagnostic side effect for humans, not error propagation. Once Open() logs and returns nullptr, the caller has lost cause and context.

The reason I keep harping on this is because this PR itself assigns different contexts different recovery semantics. So “all file access errors are treated the same” is only true after the lower layer has already collapsed the information.

I think we both agree that a full cleanup is bigger than this PR, but I don’t agree that the only reasonable options are “change it all” or “keep consistency and collapse this into nullptr too.” Translating the existing throw at a block storage boundary is a valid incremental step, imo, as it fixes the swallowed error bug while preserving the information needed for a cleaner error model later.

It doesn't feel like we see eye to eye on this, though, so I'll leave my review at that.

furszy commented at 3:39 PM on June 30, 2026:

Yeah. We agree on the future, just not on the first step. I understand your point, I'm just not in favor of leaving inconsistencies around since they are usually source of bugs. This error is not treated different by any caller, so I fail to see the point of having two error signaling paths.

in src/net_processing.cpp:4289 in 7d4d4b8530 outdated

4284 | @@ -4285,7 +4285,11 @@ void PeerManagerImpl::ProcessMessage(Peer& peer, CNode& pfrom, const std::string
4285 |              const bool ret{m_chainman.m_blockman.ReadBlock(block, block_pos, req.blockhash)};
4286 |              // If height is above MAX_BLOCKTXN_DEPTH then this block cannot get
4287 |              // pruned after we release cs_main above, so this read should never fail.
4288 | -            assert(ret);
4289 | +            // Unless we have an I/O issue, in such case, we want to shutdown gracefully.
4290 | +            if (!ret) {

josibake commented at 11:12 AM on June 10, 2026:

Strongly agree with the change here! But based on my previous comment, why couldn't this be catching and handling a fs::filesystem_error runtime throw? Or the typed errors I proposed?

furszy commented at 7:33 PM on June 16, 2026:

Strongly agree with the change here! But based on my previous comment, why couldn't this be catching and handling a fs::filesystem_error runtime throw? Or the typed errors I proposed?

This is related to the comment above, the discussion deserves its own issue/PR.

But to at least explain where I currently stand, I think the approach depends on what the caller expects and whether knowing the exact error would let it do anything differently, or whether it would just handle all errors the same way, as is the case here.

At least in this scenario, and in net_processing more generally, adding small try-catch blocks everywhere would be quite noisy and make the code harder to read and maintain.

josibake commented at 8:33 AM on June 19, 2026:

Handling all failures identically does not imply they should be collapsed into an ambiguous return value. A domain exception lets a caller handle the whole class uniformly while still preserving cause and context.

I also don't think existing caller's expectations should dictate the strategy: a functions error propagation strategy should be based on its own contract and boundary. Different callers have different needs, the expectations of callers may change over time, and new callers may have new requirements.

I agree that scattered try/catch blocks would be noisy, and that is not what I am advocating as the ideal design. imo, these disk I/O errors should be handled or translated in BlockManager. But as you've said, this is not a PR for the ideal state. In that case, I think a small amount of local try/catch at the current failure sites would be a better incremental tradeoff than expanding FlatFileSeq::Open()’s nullptr. It fixes the swallowed error behaviour problem stated in the PR while preserving the exception, which keeps us set up for a cleaner refactor of the handling/translation boundary later.

furszy commented at 7:03 PM on June 29, 2026:

My reply here would be similar to what I wrote here #35003 (review).

josibake commented at 11:18 AM on June 10, 2026: member

Concept ACK

Fault suppression masquerading as defensive programing strikes again! Big fan of not swallowing the errors and of improving the test coverage!

I know this is not a fair critique, since I assume this is the best we can come up with, but the file permissions approach seems hacky and the fact you have to work around root users and windows is evidence. Thats not really a blocking piece of feedback, but it would be nice if you could add a TODO comment.

I think the meat of this PR is your change to flatfileseq, so I left the majority of my review there. I've been reading https://sean-parent.stlab.cc/presentations/2022-08-31-exceptions/2022-08-31-exceptions.pdf and really enjoying it, so a lot of the ideas and terminology in my review are borrowed from there.

More generally, I'd recommend everyone reviewing this PR or interested in error handling read/watch the presentation. I think the principles laid out are clear and very applicable to our codebase. h/t @purpleKarrot for recommending it to me.

in test/functional/p2p_handle_io_errors.py:80 in 356b0307d8 outdated

  75 | +    def test_block_connect_on_broken_fs(self):
  76 | +        self.log.info("Test block connect on inaccessible filesystem")
  77 | +        node = self.nodes[0]
  78 | +        peer = node.add_p2p_connection(P2PInterface())
  79 | +        blocks = create_empty_fork(node, fork_length=2)
  80 | +        large_block_hash = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]

l0rinc commented at 8:34 AM on June 11, 2026:

356b030 test: add P2P coverage for block disk I/O failures:

nit: we could use a smaller block that still crosses the fastprune file boundary:

        large_block_hash = self.generatetodescriptor(node, 1, f"raw({'55'*70_000})")[0]

furszy commented at 7:34 PM on June 16, 2026:

Done as suggested. thanks

in test/functional/p2p_handle_io_errors.py:85 in 356b0307d8 outdated

  80 | +        large_block_hash = self.generatetodescriptor(node, 1, f"raw({'55'*100_000})")[0]
  81 | +        peer.send_and_ping(msg_block(blocks[0]))
  82 | +        # Remove access to the fork block only.
  83 | +        (node.blocks_path / 'blk00002.dat').unlink()
  84 | +        assert_raises_rpc_error(-1, 'Block not found on disk', node.getblock, blocks[0].hash_hex)
  85 | +        node.getblock(large_block_hash) # can still access the tip

l0rinc commented at 8:51 AM on June 11, 2026:

356b030 test: add P2P coverage for block disk I/O failures:

Instead of discarding the result, could the P2P block-connect test assert that the active-tip block is still readable?

        assert_equal(node.getblock(large_block_hash)["hash"], large_block_hash)

furszy commented at 10:03 PM on June 16, 2026:

Done as suggested. thanks

in src/test/flatfile_tests.cpp:128 in 7d4d4b8530

l0rinc commented at 9:28 AM on June 11, 2026:

In the new tests we're validating the state of Allocate and Flush, could we backport it to existing tests, something like:

BOOST_AUTO_TEST_CASE(flatfile_flush)
{
    const auto data_dir = m_args.GetDataDirBase();
    FlatFileSeq seq(data_dir, "a", 100);

    bool out_of_space;
    BOOST_CHECK_EQUAL(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), 100U);
    BOOST_CHECK(!out_of_space);

    // Flush without finalize should not truncate file.
    BOOST_CHECK(seq.Flush(FlatFilePos(0, 1)));
    BOOST_CHECK_EQUAL(fs::file_size(seq.FileName(FlatFilePos(0, 1))), 100U);

    // Flush with finalize should truncate file.
    BOOST_CHECK(seq.Flush(FlatFilePos(0, 1), true));
    BOOST_CHECK_EQUAL(fs::file_size(seq.FileName(FlatFilePos(0, 1))), 1U);
}

furszy commented at 11:34 PM on June 16, 2026:

Done as suggested. thanks

in src/test/flatfile_tests.cpp:203 in 7d4d4b8530 outdated

 198 | +    fs::create_directories(data_dir);
 199 | +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
 200 | +
 201 | +    if (!SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
 202 | +        bool out_of_space;
 203 | +        // Note: this throws due to 'CheckDiskSpace'. In the future, this shouldn't throw either.

l0rinc commented at 9:34 AM on June 11, 2026:

b9bc729 FlatFile: do not throw for parent dir creation failure:

Could we fix this instead?

diff --git a/src/flatfile.cpp b/src/flatfile.cpp
index cd25476bdc..230dee2dc2 100644
--- a/src/flatfile.cpp
+++ b/src/flatfile.cpp
@@ -75,7 +75,15 @@ size_t FlatFileSeq::Allocate(const FlatFilePos& pos, size_t add_size, bool& out_
         size_t new_size = n_new_chunks * m_chunk_size;
         size_t inc_size = new_size - old_size;
 
-        if (CheckDiskSpace(m_dir, inc_size)) {
+        bool has_disk_space;
+        try {
+            has_disk_space = CheckDiskSpace(m_dir, inc_size);
+        } catch (const fs::filesystem_error& e) {
+            LogError("Unable to allocate file '%s'. Error checking disk space: %s", fs::PathToString(FileName(pos)), e.what());
+            return 0;
+        }
+
+        if (has_disk_space) {
             FILE *file = Open(pos);
             if (file) {
                 LogDebug(BCLog::VALIDATION, "Pre-allocating up to position 0x%x in %s%05u.dat\n", new_size, m_prefix, pos.nFile);
diff --git a/src/test/flatfile_tests.cpp b/src/test/flatfile_tests.cpp
index 3ad37e0f13..9f288a4b28 100644
--- a/src/test/flatfile_tests.cpp
+++ b/src/test/flatfile_tests.cpp
@@ -201,8 +201,7 @@ BOOST_AUTO_TEST_CASE(flatfile_allocate_unwritable_dir)
 
     if (!SimulateFileSystemError(m_path_root, data_dir, [&seq]() {
         bool out_of_space;
-        // Note: this throws due to 'CheckDiskSpace'. In the future, this shouldn't throw either.
-        BOOST_CHECK_THROW(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), fs::filesystem_error);
+        BOOST_CHECK_EQUAL(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), 0U);
         BOOST_CHECK(!out_of_space);
     })) {
         BOOST_WARN_MESSAGE(false, "skipping: unable to enforce filesystem error simulation");

furszy commented at 11:12 PM on June 16, 2026:

I thought about this initially, but the try-catch seemed something we may want inside CheckDiskSpace rather than the per call try-catch. There isn't any diff for the caller between "no disk space" and "no disk access", they both should probably make CheckDiskSpace return false.

The change isn't big but introducing it would affect other areas that are not being tested; like the scheduler disk space check and FlushStateToDisk. That's why thought about leaving it for a focused follow-up and not expand this PR more than what is needed.

in src/net_processing.cpp:4288 in 7d4d4b8530

4284 | @@ -4285,7 +4285,11 @@ void PeerManagerImpl::ProcessMessage(Peer& peer, CNode& pfrom, const std::string
4285 |              const bool ret{m_chainman.m_blockman.ReadBlock(block, block_pos, req.blockhash)};
4286 |              // If height is above MAX_BLOCKTXN_DEPTH then this block cannot get
4287 |              // pruned after we release cs_main above, so this read should never fail.
4288 | -            assert(ret);
4289 | +            // Unless we have an I/O issue, in such case, we want to shutdown gracefully.

l0rinc commented at 9:35 AM on June 11, 2026:

7d4d4b8 net: replace GETBLOCKTXN assert on block read error with graceful shutdown:

I had to read it multiple times, maybe we could simplify it a bit:

            // If an I/O issue makes the read fail, shut down gracefully.

furszy commented at 11:18 PM on June 16, 2026:

The idea was to connect the sentence to the lines above. I don't find the repeated "If" at the beginning helpful.

Maybe

// If height is above MAX_BLOCKTXN_DEPTH then this block cannot get
// pruned after we release cs_main above, so this read should never fail.
// Unless an I/O error occurs, in which case we want to shut down gracefully.

in test/functional/p2p_handle_io_errors.py:41 in 356b0307d8 outdated

  36 | +    blocks_bak = blocks_path.parent / "blocks_bak"
  37 | +    parent_dir = blocks_path.parent
  38 | +    old_mode = stat.S_IMODE(os.stat(parent_dir).st_mode)
  39 | +
  40 | +    os.rename(blocks_path, blocks_bak)
  41 | +    os.chmod(parent_dir, 0o500)  # Prevent re-creation of blocks/

l0rinc commented at 9:58 AM on June 11, 2026:

356b030 test: add P2P coverage for block disk I/O failures:

These helpers rename a path before changing permissions - but if we fail we should probably still put the path back:

diff --git a/src/test/util/common.cpp b/src/test/util/common.cpp
index 43f666e5fb..5c2926ff66 100644
--- a/src/test/util/common.cpp
+++ b/src/test/util/common.cpp
@@ -38,22 +38,20 @@
 
     const fs::path parent = target.parent_path();
     const fs::path backup = parent / fs::PathFromString(target.filename().string() + "_bak");
+    const auto old_perms = fs::status(parent).permissions();
 
     // Make the target disappear (file or directory)
     fs::rename(target, backup);
 
-    const auto old_perms = fs::status(parent).permissions();
-
     // Helper to restore state (permissions + original path)
     auto restore = [&]() {
         fs::permissions(parent, old_perms, fs::perm_options::replace);
         fs::rename(backup, target);
     };
 
-    // Block any access and dir recreation under the parent directory
-    fs::permissions(parent, fs::perms::none, fs::perm_options::replace);
-
     try {
+        // Block any access and dir recreation under the parent directory
+        fs::permissions(parent, fs::perms::none, fs::perm_options::replace);
         fn(); // run test under simulated failure
     } catch (...) {
         restore();
diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
index 9b9dd3f2cf..217bd58737 100755
--- a/test/functional/p2p_handle_io_errors.py
+++ b/test/functional/p2p_handle_io_errors.py
@@ -39,9 +39,8 @@ def simulate_io_error(blocks_path):
     old_mode = stat.S_IMODE(os.stat(parent_dir).st_mode)
 
     os.rename(blocks_path, blocks_bak)
-    os.chmod(parent_dir, 0o500)  # Prevent re-creation of blocks/
-
     try:
+        os.chmod(parent_dir, 0o500)  # Prevent re-creation of blocks/
         yield
     finally:
         os.chmod(parent_dir, old_mode)

furszy commented at 11:23 PM on June 16, 2026:

Done as suggested. thanks

maflcko commented at 5:46 AM on June 25, 2026:

#35003 (review): I don't think this change was a beneficial change. When the chmod fails, the test should also fail, not silently pass and even skip running the fn/yield.

I think the correct fix would be to revert this change and then wait for the test to fail, and then figure out if that env/platform should be skipped or supported.

l0rinc commented at 1:18 PM on June 25, 2026:

When the chmod fails, the test should also fail

The suggestion was about renaming before we check for permissions.

maflcko commented at 3:46 PM on June 30, 2026:

When the chmod fails, the test should also fail

The suggestion was about renaming before we check for permissions.

Yes, I understand. I don't see the risk in renaming before changing permissions. Either changing the permissions will work fine, in which case nothing was gained. Or, changing the permissions will fail, in which case the test should fail anyway, and also nothing was gained by silently renaming back. In fact, it seems harmful, because it will silently be skipping the test.

A test failure should lead to a loud failure, not a silent skip.

in src/test/util/common.cpp:31 in b9bc7297ef outdated

  26 | +
  27 | +    const fs::path root = fs::weakly_canonical(test_root_dir);
  28 | +    const fs::path target = fs::weakly_canonical(path);
  29 | +
  30 | +    // Simple sanity check: ensure target is inside the test directory.
  31 | +    if (!fs::PathToString(target).starts_with(fs::PathToString(root))) {

l0rinc commented at 10:05 AM on June 11, 2026:

b9bc729 FlatFile: do not throw for parent dir creation failure:

Is this safe against a sibling path like /tmp/node10/blocks starting with /tmp/node1?

furszy commented at 7:45 PM on June 18, 2026:

The unit tests root is not /tmp/, the root is /tmp/test_common bitcoin/<test_name>/<rand_id>. Where <rand_id> is unique per run, and unit test initialize a single node.

in src/test/blockmanager_tests.cpp:328 in 93a575c7ce outdated

 322 | @@ -322,4 +323,66 @@ BOOST_FIXTURE_TEST_CASE(prune_lock_update_and_delete, TestingSetup)
 323 |      BOOST_CHECK(!blockman.DeletePruneLock("nonexistent"));
 324 |  }
 325 |  
 326 | +// Verify that block read operations return false (not throw) when the
 327 | +// blocks directory is missing (mimicking a detached volume).
 328 | +BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)

l0rinc commented at 10:50 AM on June 11, 2026:

93a575c test: ensure consistent failure behavior in BlockManager reads/writes:

Seem to me we should hold cs_main for the duration of the test and inside the SimulateFileSystemError lambda, matching other blockmanager test cases. Running it with -DDEBUG_LOCKORDER fails with:

Assertion failed: lock ::cs_main not held in node/blockstorage.cpp:1138; locks held:
unknown location:0: fatal error: in "blockmanager_tests/blockmanager_io_failure_consistency": signal: SIGABRT (application abort requested)
test/blockmanager_tests.cpp:352: last checkpoint

Something like this makes it pass again:

diff --git a/src/test/blockmanager_tests.cpp b/src/test/blockmanager_tests.cpp
--- a/src/test/blockmanager_tests.cpp	(revision 831ad45f33a296f28c3847e4dbc3645ab4292308)
+++ b/src/test/blockmanager_tests.cpp	(revision 3a73313ddd9cc0e4c2c43e7080c292bacde7a44d)
@@ -329,14 +329,16 @@
 // blocks directory is missing (mimicking a detached volume).
 BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
 {
+    LOCK(::cs_main);
     auto& chainman = m_node.chainman;
     auto& blockman = m_node.chainman->m_blockman;
-    CBlockIndex& tip{*WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip())};
+    CBlockIndex& tip{*chainman->ActiveTip()};
 
     if (!SimulateFileSystemError(m_path_root, m_args.GetBlocksDirPath().parent_path(), [&]() {
+        LOCK(::cs_main);
         {
             ASSERT_DEBUG_LOG("OpenBlockFile failed");
-            FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip.GetBlockPos());
+            FlatFilePos tip_pos = tip.GetBlockPos();
             BOOST_CHECK(!blockman.ReadRawBlock(tip_pos));
         }
 
@@ -366,7 +368,6 @@
         // Clear BLOCK_HAVE_UNDO so it actually attempts to write.
         {
             ASSERT_DEBUG_LOG("Failed to write undo data");
-            LOCK(chainman->GetMutex());
             tip.nStatus &= ~BLOCK_HAVE_UNDO;
 
             BlockValidationState state;

(I'm surprised this passed CI, cc: @maflcko)

maflcko commented at 1:52 PM on June 11, 2026:

(I'm surprised this passed CI, cc: @maflcko)

Hmm, not sure. It is set in one CI config:

ci/test/00_setup_env_native_asan.sh: -DAPPEND_CPPFLAGS='-DARENA_DEBUG -DDEBUG_LOCKORDER' \

Also, the clang TSA do not seem to be hitting?

maflcko commented at 4:27 PM on June 12, 2026:

@l0rinc See ec6cf49b91f726aa8cb70087eb3fcacb14b4ff22

maflcko commented at 4:29 PM on June 12, 2026:

My preference would be to fix it with a minimal lock guard, not a recursive guard that spans over several scopes.

furszy commented at 11:31 PM on June 16, 2026:

Done as suggested. thanks

l0rinc commented at 12:49 AM on June 25, 2026:

No lock order problems now 👍

in src/test/flatfile_tests.cpp:144 in b9bc7297ef outdated

 139 | +    // File should be null instead of throwing an exception.
 140 | +    const auto data_dir = m_args.GetDataDirBase() / "nonexistent";
 141 | +    const FlatFileSeq seq(data_dir, "a", DUMMY_CHUNK_SZ);
 142 | +
 143 | +    const AutoFile file{seq.Open(FlatFilePos(0, 0), /*read_only=*/true)};
 144 | +    BOOST_CHECK(file.IsNull());

l0rinc commented at 10:57 AM on June 11, 2026:

b9bc729 FlatFile: do not throw for parent dir creation failure:

This passes on master as well, maybe we could add this to the end for completeness:

diff --git a/src/test/flatfile_tests.cpp b/src/test/flatfile_tests.cpp
--- a/src/test/flatfile_tests.cpp	(revision 3a73313ddd9cc0e4c2c43e7080c292bacde7a44d)
+++ b/src/test/flatfile_tests.cpp	(revision 8af0268a17fe930d99425bc885b5b461bc1609eb)
@@ -143,6 +143,8 @@
 
     const AutoFile file{seq.Open(FlatFilePos(0, 0), /*read_only=*/true)};
     BOOST_CHECK(file.IsNull());
+    // A read-only open must not create the missing directory either.
+    BOOST_CHECK(!fs::exists(data_dir));
 }
 
 BOOST_AUTO_TEST_CASE(flatfile_open_write_creates_dir)

furszy commented at 11:32 PM on June 16, 2026:

Done as suggested. thanks

in test/functional/p2p_handle_io_errors.py:97 in 356b0307d8 outdated

  92 | +            expect_error=True,
  93 | +            expected_stderr=re.compile(r"fatal internal error"),
  94 | +        )
  95 | +
  96 | +        # Restart node for next test
  97 | +        self.start_node(0, extra_args=['-reindex'])

l0rinc commented at 10:58 AM on June 11, 2026:

356b030 test: add P2P coverage for block disk I/O failures:

Can we assert that reindexing actually recovered the chain?

diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
--- a/test/functional/p2p_handle_io_errors.py	(revision 8af0268a17fe930d99425bc885b5b461bc1609eb)
+++ b/test/functional/p2p_handle_io_errors.py	(revision 51ff3dd9a9826f696952b59c1f581b71c5b82eee)
@@ -96,8 +96,10 @@
             expected_stderr=re.compile(r"fatal internal error"),
         )
 
-        # Restart node for next test
+        # Restart node for next test. Only the fork block file was deleted,
+        # so reindexing the remaining block files must restore the old tip.
         self.start_node(0, extra_args=['-reindex'])
+        self.wait_until(lambda: node.getbestblockhash() == large_block_hash)
 
     def test_getdata_on_broken_fs(self):
         """The node should not swallow GETDATA requests during I/O issues"""

furszy commented at 11:34 PM on June 16, 2026:

Done as suggested. thanks

l0rinc changes_requested

l0rinc commented at 11:01 AM on June 11, 2026: contributor

Thanks for taking care of this, went over the changes quickly, left a few comments.

maflcko closed this on Jun 12, 2026

maflcko reopened this on Jun 12, 2026

DrahtBot added the label CI failed on Jun 12, 2026

DrahtBot closed this on Jun 12, 2026

DrahtBot reopened this on Jun 12, 2026

DrahtBot removed the label CI failed on Jun 12, 2026

DrahtBot added the label CI failed on Jun 12, 2026

DrahtBot commented at 5:04 PM on June 12, 2026: contributor

🚧 At least one of the CI tasks failed. Task test ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/27427396736/job/81068654537 LLM reason (✨ experimental): CI failed because Clang -Wthread-safety-analysis (treated as -Werror) reported WriteBlock was called without exclusively holding cs_main in blockmanager_tests.cpp.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

furszy force-pushed on Jun 16, 2026

furszy commented at 11:53 PM on June 16, 2026: member

Thanks everyone, updated per feedback. I hope to have tackled them all.

DrahtBot removed the label CI failed on Jun 17, 2026

in test/functional/p2p_handle_io_errors.py:153 in 69bb470ab0

 148 | +        # their parent directory, so we can't trigger the expected filesystem error.
 149 | +        # Also, when running as root, permission changes don't restrict access, so the
 150 | +        # test condition can't be enforced either.
 151 | +        if platform.system() == 'Windows' or os.geteuid() == 0:
 152 | +            self.log.warning("Skipping test: unable to enforce dir permissions")
 153 | +            return

l0rinc commented at 10:48 PM on June 24, 2026:

69bb470 test: add P2P coverage for block disk I/O failures:

This doesn't seem to work for me. I don't have Windows access so I just changed this to if True: and I'm getting:

p2p_handle_io_errors.py | ✓ Passed | 1 s

This one did work for me locally:

            raise SkipTest("unable to enforce dir permissions")

showing:

p2p_handle_io_errors.py | ○ Skipped | 1 s

furszy commented at 8:07 PM on June 29, 2026:

Done as suggested

in test/functional/p2p_handle_io_errors.py:111 in 1f1fd29591 outdated

 106 | @@ -107,12 +107,12 @@ def test_getdata_on_broken_fs(self):
 107 |          peer = node.add_p2p_connection(P2PInterface())
 108 |          block_hash = node.getblockhash(3)  # Not in recent cache
 109 |          with simulate_io_error(node.blocks_path):
 110 | -            # Bug: we currently swallow the GETDATA request, never reply to the other party, and log issue under debug-NET
 111 | -            with node.assert_debug_log(expected_msgs=["[ProcessMessages] [net] ProcessMessages(getdata, 37 bytes): Exception 'filesystem error"],
 112 | -                                       timeout=30):
 113 | +            # Request block and expect disconnection.
 114 | +            with node.assert_debug_log(expected_msgs=["Cannot load block from disk"]):

l0rinc commented at 10:59 PM on June 24, 2026:

1f1fd29 FlatFile: do not throw for parent dir creation failure:

beautiful, I love that you've covered the existing functionality separately

in src/test/validation_chainstate_tests.cpp:208 in 063fd2ce52

 203 | +    const auto blk_file = blocks_dir / "blk00000.dat";
 204 | +    SimulateFileSystemError(m_path_root, blk_file, [&]() {
 205 | +        check_activate_best_chain_fatal_error("Failed to read block", m_node, chainstate);
 206 | +    });
 207 | +
 208 | +    // Inaccessible blocks directory must also trigger a fatal error.

l0rinc commented at 11:38 PM on June 24, 2026:

063fd2c test: verify ActivateBestChain fatal errors on I/O failure:

We're simulating two unrelated failures here where the first might theoretically cause/affect the second - could we do the full disconnect-tip/simulate-the-specific-path-failure/reconnect-after-restoration for both cases, something like:

BOOST_FIXTURE_TEST_CASE(activate_best_chain_connect_io_failure, TestChain100Setup)
{
    auto& chainstate = m_node.chainman->ActiveChainstate();
    const auto blocks_dir = m_args.GetBlocksDirPath();
    for (const auto& path : {blocks_dir / "blk00000.dat", fs::path{blocks_dir.parent_path()}}) {
        // Disconnect block without changing validity so ActivateBestChain has a block to reconnect.
        BlockValidationState state;
        BOOST_REQUIRE([&] {
            LOCK2(m_node.chainman->GetMutex(), chainstate.MempoolMutex());
            return chainstate.DisconnectTip(state, /*disconnectpool=*/nullptr);
        }());
        BOOST_CHECK(state.IsValid());

        SimulateFileSystemError(m_path_root, path, [&] {
            check_activate_best_chain_fatal_error("Failed to read block", m_node, chainstate);
        });

        // Reconnect once the filesystem is restored.
        BOOST_REQUIRE(chainstate.ActivateBestChain(state));
        BOOST_CHECK(state.IsValid());
    }
}

furszy commented at 8:46 PM on June 29, 2026:

Done as suggested

in src/test/validation_chainstate_tests.cpp:183 in 063fd2ce52

 178 | +{
 179 | +    ASSERT_DEBUG_LOG(expected_log);
 180 | +    BlockValidationState state;
 181 | +    BOOST_CHECK(!chainstate.ActivateBestChain(state));
 182 | +    BOOST_CHECK_EQUAL(node.exit_status.load(), EXIT_FAILURE); // Ensure fatal error
 183 | +    node.exit_status = EXIT_SUCCESS;

l0rinc commented at 11:46 PM on June 24, 2026:

063fd2c test: verify ActivateBestChain fatal errors on I/O failure:

Resetting only exit_status after an expected fatal error seems fragile - it does not fully reset the node context, because fatalError() also requests shutdown and leaves the fixture interrupt set - and could leave lingering state. Could we rather prevent it from sending the shutdown signal, something like:

static void check_activate_best_chain_fatal_error(const std::string& expected_log, node::NodeContext& node, Chainstate& chainstate)
{
    Assert(node.notifications)->m_shutdown_on_fatal_error = false;
    node.exit_status = EXIT_SUCCESS;

    ASSERT_DEBUG_LOG(expected_log);
    BlockValidationState state;
    BOOST_REQUIRE(!chainstate.ActivateBestChain(state));
    BOOST_CHECK(state.IsError());
    BOOST_CHECK_EQUAL(node.exit_status, EXIT_FAILURE); // Ensure fatal error
}

furszy commented at 8:49 PM on June 29, 2026:

Done as suggested

in src/test/validation_chainstate_tests.cpp:246 in 063fd2ce52

 241 | +    BlockValidationState state;
 242 | +    chainstate.InvalidateBlock(state, b101_index);
 243 | +
 244 | +    // Build a 2-block fork from height 100. The original chain is invalid,
 245 | +    // so these become the active tip.
 246 | +    for (int i = 0; i < 2; i++) CreateAndProcessBlock({}, CScript() << OP_TRUE << OP_TRUE);

l0rinc commented at 12:04 AM on June 25, 2026:

063fd2c test: verify ActivateBestChain fatal errors on I/O failure:

We're doing OP_TRUE << OP_TRUE instead of previous OP_TRUE coinbases because the fork blocks need to be different from the original blocks created just above?

nit: please consider making the distinction slightly simpler with something like:

    for (int i{0}; i < 3; ++i) CreateAndProcessBlock({}, CScript{} << OP_1);
...
    for (int i{0}; i < 2; ++i) CreateAndProcessBlock({}, CScript{} << OP_2);

furszy commented at 8:52 PM on June 29, 2026:

We're doing OP_TRUE << OP_TRUE instead of previous OP_TRUE coinbases because the fork blocks need to be different from the original blocks created just above?

Yes. A quick way of getting a different hash. There wouldn't be a fork if blocks are the same.

in src/test/validation_chainstate_tests.cpp:260 in 063fd2ce52

 255 | +
 256 | +    // Inaccessible block file during reorg must trigger fatal error.
 257 | +    const auto blk_file = blocks_dir / "blk00000.dat";
 258 | +    SimulateFileSystemError(m_path_root, blk_file, [&]() {
 259 | +        check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
 260 | +    });

l0rinc commented at 12:33 AM on June 25, 2026:

063fd2c test: verify ActivateBestChain fatal errors on I/O failure:

Similarly to before, can the reorg failure test make the pending reorg precondition visible for each disk-read failure path separately?

// Invalidate at height 101 to create room for a shorter fork.
CBlockIndex* b101_index = WITH_LOCK(chainman->GetMutex(), return chainman->ActiveChain()[101]);
BlockValidationState state;
BOOST_REQUIRE(chainstate.InvalidateBlock(state, b101_index));
BOOST_CHECK(state.IsValid());
BOOST_REQUIRE_EQUAL(100, WITH_LOCK(chainman->GetMutex(), return chainman->ActiveHeight()));

// Build a 2-block fork from height 100. The original chain is invalid,
// so these become the active tip. Use a distinct coinbase script so this
// creates new fork blocks instead of recreating the invalidated blocks.
for (int i{0}; i < 2; ++i) CreateAndProcessBlock({}, CScript{} << OP_2);
BOOST_REQUIRE_EQUAL(102, WITH_LOCK(chainman->GetMutex(), return chainman->ActiveHeight()));

// Restore the original chain's validity without activating it. It has
// more work (103 > 102), so ActivateBestChain must reorg back to it.
{
    LOCK(chainman->GetMutex());
    chainstate.ResetBlockFailureFlags(b101_index);
    chainman->RecalculateBestHeader();
}

for (const auto& path : {blocks_dir / "blk00000.dat", blocks_dir / "rev00000.dat", fs::path{blocks_dir.parent_path()}}) {
    BOOST_REQUIRE_EQUAL(102, WITH_LOCK(chainman->GetMutex(), return chainman->ActiveHeight()));
    SimulateFileSystemError(m_path_root, path, [&] {
        check_activate_best_chain_fatal_error("Failed to disconnect block", m_node, chainstate);
    });
    BOOST_CHECK_EQUAL(102, WITH_LOCK(chainman->GetMutex(), return chainman->ActiveHeight()));
}

// Sanity check: the reorg completes once the filesystem is restored.
state = BlockValidationState{};
BOOST_REQUIRE(chainstate.ActivateBestChain(state));
BOOST_CHECK(state.IsValid());
BOOST_CHECK_EQUAL(103, WITH_LOCK(chainman->GetMutex(), return chainman->ActiveHeight()));

furszy commented at 4:58 PM on June 30, 2026:

Done as suggested

in src/test/blockmanager_tests.cpp:337 in a55a553a68

 332 | +    auto& chainman = m_node.chainman;
 333 | +    auto& blockman = m_node.chainman->m_blockman;
 334 | +    CBlockIndex& tip{*WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip())};
 335 | +
 336 | +    SimulateFileSystemError(m_path_root, m_args.GetBlocksDirPath().parent_path(), [&]() {
 337 | +        {

l0rinc commented at 2:00 AM on June 25, 2026:

a55a553 test: ensure consistent failure behavior in BlockManager reads/writes:

Running every assertion under the same SimulateFileSystemError block makes it harder to see which operation depends on state left by a previous check, could we split them out (the test is very fast, don't see the need to optimize it):

BOOST_FIXTURE_TEST_CASE(blockmanager_io_failure_consistency, TestChain100Setup)
{
    auto& chainman = m_node.chainman;
    auto& blockman = m_node.chainman->m_blockman;
    CBlockIndex& tip{*WITH_LOCK(chainman->GetMutex(), return chainman->ActiveTip())};
    const auto blocks_dir = m_args.GetBlocksDirPath().parent_path();

    SimulateFileSystemError(m_path_root, blocks_dir, [&] {
        ASSERT_DEBUG_LOG("OpenBlockFile failed");
        FlatFilePos tip_pos = WITH_LOCK(chainman->GetMutex(), return tip.GetBlockPos());
        BOOST_CHECK(!blockman.ReadRawBlock(tip_pos));
    });
    SimulateFileSystemError(m_path_root, blocks_dir, [&] {
        ASSERT_DEBUG_LOG("OpenBlockFile failed");
        CBlock block;
        BOOST_CHECK(!blockman.ReadBlock(block, tip));
    });
    SimulateFileSystemError(m_path_root, blocks_dir, [&] {
        ASSERT_DEBUG_LOG("OpenUndoFile failed");
        CBlockUndo block_undo;
        BOOST_CHECK(!blockman.ReadBlockUndo(block_undo, tip));
    });
    SimulateFileSystemError(m_path_root, blocks_dir, [&] {
        // WriteBlock should return a null position, not throw.
        // create_directories fails so the file cannot be created.
        ASSERT_DEBUG_LOG("Failed to write block");
        CBlock dummy_block;
        dummy_block.nVersion = 1;
        LOCK(chainman->GetMutex());
        const auto write_pos = blockman.WriteBlock(dummy_block, /*nHeight=*/999);
        BOOST_CHECK(write_pos.IsNull());
    });
    SimulateFileSystemError(m_path_root, blocks_dir, [&] {
        // WriteBlockUndo should return false, not throw.
        // Clear BLOCK_HAVE_UNDO so it actually attempts to write.
        ASSERT_DEBUG_LOG("Failed to write undo data");
        LOCK(chainman->GetMutex());
        tip.nStatus &= ~BLOCK_HAVE_UNDO;

        BlockValidationState state;
        BOOST_CHECK(!blockman.WriteBlockUndo(CBlockUndo{}, state, tip));
        BOOST_CHECK(state.IsError());

        tip.nStatus |= BLOCK_HAVE_UNDO;
    });

    // Ensure we haven't corrupted any internal state during the failure.
    // Check we can read the block again now that there is no dir access issue going on.
    CBlock recovered_block;
    BOOST_CHECK(blockman.ReadBlock(recovered_block, tip));
    CBlockUndo block_undo;
    BOOST_CHECK(blockman.ReadBlockUndo(block_undo, tip));
}

furszy commented at 4:59 PM on June 30, 2026:

Done as suggested

in src/net_processing.cpp:4362 in d4f95399b5 outdated

4357 | @@ -4358,7 +4358,11 @@ void PeerManagerImpl::ProcessMessage(Peer& peer, CNode& pfrom, const std::string
4358 |              const bool ret{m_chainman.m_blockman.ReadBlock(block, block_pos, req.blockhash)};
4359 |              // If height is above MAX_BLOCKTXN_DEPTH then this block cannot get
4360 |              // pruned after we release cs_main above, so this read should never fail.
4361 | -            assert(ret);
4362 | +            // Unless an I/O error occurs, in which case we want to shut down gracefully.
4363 | +            if (!ret) {

l0rinc commented at 2:03 AM on June 25, 2026:

d4f9539 net: replace GETBLOCKTXN assert on block read error with graceful shutdown:

Alternatively we could group the two branches and have a single return:

if (m_chainman.m_blockman.ReadBlock(block, block_pos, req.blockhash)) {
    SendBlockTransactions(pfrom, peer, block, req);
} else {
    m_chainman.GetNotifications().fatalError(_("Failed to read block during GETBLOCKTXN"));
}
return;

l0rinc commented at 9:49 PM on June 30, 2026:

It seems to me compact-block announcements still abort on the same kind of failure: https://github.com/bitcoin/bitcoin/blob/dc282ff31d1cc97507530a541d9cec8a8f6a6ef4/src/net_processing.cpp#L5997-L5998

Is that deliberate, or could we use the same fatal-error style there?

nit: please see my suggestion above on how we can simplify the code after the assertion removal.

furszy commented at 6:50 PM on July 1, 2026:

nit: please see my suggestion above on how we can simplify the code after the assertion removal.

As it is a style nit, will leave it as is for now.

l0rinc changes_requested

l0rinc commented at 2:16 AM on June 25, 2026: contributor

Went over the change more thoroughly, I love the structure, covering current behavior first, followed by fix and surgical test changes. I left a few comments, mostly about isolating effects of the file/folder corruptions to make sure they don't affect each other and about covering all mentioned cases with tests.

maflcko commented at 5:56 AM on June 25, 2026: member

Haven't read all of the discussion threads, but since my last review, the only change seems to be moving the chmod, which doesn't seem like a good change. (See inline comment)

Also, the suggestion by @josibake to use exceptions here sounds interesting. However, I haven't looked at the details and code, so it is harder to tell which is the right fit here. I'd be happy to review either approach, and wonder what other reviewers prefer here?

lgtm ACK d4f95399b5af4643b90bb396c4d172e908ae87c2 🐻

<details><summary>Show signature</summary>

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: lgtm ACK d4f95399b5af4643b90bb396c4d172e908ae87c2 🐻
oH+HVBgFxqvZd9iiXWWh2X30aRqlpbsZRmLXAWKmHc3lkuFDKeMRDVJRktd+GT8xFT0DR5CtMzWrWtrUHDDUCA==

</details>

DrahtBot requested review from l0rinc on Jun 25, 2026

DrahtBot requested review from josibake on Jun 25, 2026

DrahtBot requested review from w0xlt on Jun 25, 2026

maflcko commented at 5:56 AM on June 25, 2026: member

.

furszy force-pushed on Jun 30, 2026

furszy commented at 8:28 PM on June 30, 2026: member

I hopefully tackled all remaining feedback. Also added coverage for the GETCFILTERS p2p path (swallowed exception) and the RPC getrawtransaction and gettxspendingprevout paths (low-level filesystem exception reaching the user).

l0rinc approved

l0rinc commented at 10:08 PM on June 30, 2026: contributor

code review ACK 81cddf0d878a27b84c85db831be0165648b89e94

The changes look correct to me and the new coverage exercises the relevant failure paths well. Seems to me the remaining compact-block announcement should likely get the same fatal-error treatment, but I'm also fine with handling it as a follow-up. (nit: PR description needs slight adjustment now regarding commit order)

DrahtBot requested review from maflcko on Jun 30, 2026

furszy commented at 6:47 PM on July 1, 2026: member

It seems to me compact-block announcements still abort on the same kind of failure: https://github.com/bitcoin/bitcoin/blob/dc282ff31d1cc97507530a541d9cec8a8f6a6ef4/src/net_processing.cpp#L5997-L5998

Is that deliberate, or could we use the same fatal-error style there?

Yes, it is in the PR description. You probably forgot this one #35003#pullrequestreview-4072836130 and my reply #35003 (comment).

in test/functional/p2p_handle_io_errors.py:41 in cfdcd3b313

  36 | +from test_framework.wallet import MiniWallet
  37 | +
  38 | +
  39 | +@contextlib.contextmanager
  40 | +def simulate_io_error(target_path):
  41 | +    # Make files under target_path inaccessible by renaming its parent dir,

maflcko commented at 12:20 PM on July 2, 2026:

in cfdcd3b3136b58e3815e8e1333da7d50fb668cfc: I don't think this recent force push is correct.

Not the target_path's parent is renamed, but the target_path itself.

furszy commented at 3:13 PM on July 2, 2026:

ups, comment fixed.

in test/functional/p2p_handle_io_errors.py:153 in cfdcd3b313 outdated

 148 | +
 149 | +            # And the node keeps running
 150 | +            node.getblockcount()
 151 | +
 152 | +    def test_getcfilters_on_broken_fs(self):
 153 | +        """GETCFILTERS must log the filter-index read failure and keep running."""

maflcko commented at 12:39 PM on July 2, 2026:

in the first commit: Just like getdata, it would be kind to the peer to disconnect it, rather than silently not reply to the request?

furszy commented at 3:11 PM on July 2, 2026:

I was thinking about tackling it in a separate PR because it would be good to cover all filter-related messages at once, not just this one. We disconnect inside PrepareBlockFilterRequest for all of them, but not in any of the subsequent lookups.

maflcko commented at 12:59 PM on July 2, 2026: member

only test changes since last review, re-ACK 81cddf0d878a27b84c85db831be0165648b89e94 🌊

<details><summary>Show signature</summary>

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: only test changes since last review, re-ACK 81cddf0d878a27b84c85db831be0165648b89e94 🌊
25kQW99wcgHBdEbUId8fCqwmWawZrgdJHnNKS3hwKbmceslmlwVE98zG84KPEXu7zjtO4F8GL1Yeaj9ewBhNAQ==

</details>

test: add P2P coverage for block disk I/O failures

Add tests showing that block disk I/O failures triggered
through P2P requests are silently swallowed.

The idea is to simulate I/O errors by making the blocks directory
temporarily inaccessible while the node is running. This allows block
reads to fail when serving P2P requests.

Aside from the well behaving test cases, this covers a few cases
where the node mishandles the error:

- GETDATA: failures should be handled locally by disconnecting the
  peer and logging the error, without impacting node operation.

- GETBLOCKTXN: failures when accessing recent blocks violate our
  "last 10 blocks must always be available for reorgs" rule and
  must result in a fatal error, rather than being caught and only
  logged by the P2P message handling general try-catch.

- txindex/txospenderindex: getrawtransaction and gettxspendingprevout
  open the block file directly, so filesysteme exceptions leak out as
  a generic RPC error instead of being handled cleanly.

- GETCFILTERS: failures should be logged through the filter index read
  path and leave the node running, rather than being swallowed by the
  P2P messages general try-catch.

Co-authored-by: MarcoFalke <*~=`'#}+{/-|&$^_@721217.xyz>

beeb248e22

FlatFile: do not throw for parent dir creation failure

The callers of this function don't expect it to throw.

For example, when the blocks directory becomes inaccessible
(e.g. volume detach or permission changed), FlatFileSeq::Open's
call to fs::create_directories() throws filesystem_error. On the
block connection path, if this occurs during ActivateBestChain(),
just after AcceptBlock() the exception bypasses FatalError entirely.
It gets caught by ProcessMessage's general catch-all with only a
NET-level debug log, swallowing the error msg for any regular
user, and every subsequent block arrival retries the same failing
read, leaving the node looking healthy while the block processing
mechanism is stuck.

1acce228e7

test: verify ActivateBestChain fatal errors on I/O failure

Cover the two ActivateBestChain paths that read data from
disk: ConnectTip (block connection) and DisconnectTip (reorg).

Simulate inaccessible block and undo files, ensure fatal error
and shutdown is triggered.

23205b4b14

test: ensure consistent failure behavior in BlockManager reads/writes

Verifies that BlockManager behaves the same way whether a block file is
missing or the blocks directory is inaccessible.

Reads (ReadBlock, ReadRawBlock, ReadBlockUndo) return false, writes
(WriteBlock, WriteBlockUndo) return null/false, and failures are logged.

The goal is consistent failure handling. Previously, missing block files
returned false and logged an error, while an inaccessible blocks directory
was throwing an exception.

7814d7f00b

net: replace GETBLOCKTXN assert on block read error with graceful shutdown

A failure here indicates a serious inconsistency or disk issue, and the
current behavior is to abort immediately, which is correct but not the best.
Replace assert with a fatal error so the node shuts down through the normal
shutdown path instead.

This does not change the outcome (the node still stops), but allows a more
controlled teardown, giving other components a chance to flush their state
even when a particular block (or the blocks dir) on disk is unreachable.
E.g. the blocks dir might be the only affected, the mempool might be ok,
and we don't want to lose such information.

768014068b

furszy force-pushed on Jul 2, 2026

l0rinc commented at 3:58 PM on July 2, 2026: contributor

reACK 768014068b8ccdb205b4ad8e666eed07db247221

DrahtBot requested review from maflcko on Jul 2, 2026

maflcko commented at 4:17 PM on July 2, 2026: member

test-doc-only re-ACK 768014068b8ccdb205b4ad8e666eed07db247221 🍏

<details><summary>Show signature</summary>

Signature:

untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: test-doc-only re-ACK 768014068b8ccdb205b4ad8e666eed07db247221  🍏
eTqlGYFVii78opeRBwUyzB67YFv2RzmwLLXHJWOpZnMwKSqaIkoE59cn2wV8OTyGy5qmBLInA6fVRY01kz3RAQ==

</details>

sedited commented at 9:47 AM on July 4, 2026: contributor

@rkrux @w0xlt @frankomosh @josibake can you take another look here?

josibake commented at 12:10 PM on July 9, 2026: member

Approach NACK

While I certainly appreciate the work @furszy has put into this, the PR in its current form is not the direction I think we should be going. I spent some time reading and thinking through a more comprehensive error handling strategy for the codebase in order to convince myself I'm not pushing back simply for aesthetics, and I am convinced removing exceptions from the low level file primitive and folding it into a nullptr is the wrong direction.

I would much prefer something like this in, instead of the FlatFileSeq commit:

index b060108c18..df89984d53 100644
--- a/src/node/blockstorage.cpp
+++ b/src/node/blockstorage.cpp
@@ -821,13 +821,23 @@ void BlockManager::UnlinkPrunedFiles(const std::set<int>& setFilesToPrune) const

 AutoFile BlockManager::OpenBlockFile(const FlatFilePos& pos, bool fReadOnly) const
 {
-    return AutoFile{m_block_file_seq.Open(pos, fReadOnly), m_obfuscation};
+    try {
+        return AutoFile{m_block_file_seq.Open(pos, fReadOnly), m_obfuscation};
+    } catch (const fs::filesystem_error& e) {
+        LogError("OpenBlockFile failed for %s: %s", pos.ToString(), e.what());
+        return AutoFile{nullptr, m_obfuscation};
+    }
 }

 /** Open an undo file (rev?????.dat) */
 AutoFile BlockManager::OpenUndoFile(const FlatFilePos& pos, bool fReadOnly) const
 {
-    return AutoFile{m_undo_file_seq.Open(pos, fReadOnly), m_obfuscation};
+    try {
+        return AutoFile{m_undo_file_seq.Open(pos, fReadOnly), m_obfuscation};
+    } catch (const fs::filesystem_error& e) {
+        LogError("OpenUndoFile failed for %s: %s", pos.ToString(), e.what());
+        return AutoFile{nullptr, m_obfuscation};
+    }
 }

 fs::path BlockManager::GetBlockPosFilename(const FlatFilePos& pos) const

This preserves the exception, and moves the translation of that error into BlockManager which I have argued is the appropriate layer for handling the exception. This directly solves the stated problem: exceptions were being swallowed at a higher layer and not handled. It solves the problem by handling the exception at the appropriate layer, without needing to touch the lower level file primitives.

What my proposal does not address is the BlockFilterIndex uses of FlatFileSeq. I think thats fine, as those are not in the critical path. I'd argue its even desirable, as we may want to have a different policy on how FlatFileSeq failures are handled there, further demonstrating the strength of this approach.

I do understand that the approach in this PR was done with consistency in mind, but I don't think that is enough of a justification when the thing being made consistent is, as I've tried to argue here, a bad design.

in src/flatfile.cpp:42 in 1acce228e7

  36 | @@ -37,8 +37,16 @@ FILE* FlatFileSeq::Open(const FlatFilePos& pos, bool read_only) const
  37 |      if (pos.IsNull()) {
  38 |          return nullptr;
  39 |      }
  40 | -    fs::path path = FileName(pos);
  41 | -    fs::create_directories(path.parent_path());
  42 | +    const fs::path path = FileName(pos);
  43 | +    // Create parent path only if we are going to write.
  44 | +    if (!read_only) {

josibake commented at 12:12 PM on July 9, 2026:

Note: this is unrelated to error handling afaict and could be pulled out into its own PR with its own test, no?

maflcko commented at 1:38 PM on July 9, 2026: member

Let's recall this pull has more than 150 comments and is sitting for 3 months (which is reasonable for its size). However, it doesn't seem to be getting traction and has outstanding issues, some of which are enough to NACK it.

So, I don't think the pull will be merged as-is. Even if it was merged, there'd be a bunch of follow-ups, so it would be good to decide how to address them before merge.

I think it would be good to step back and answer a meta question:

Do we even care about providing fallbacks for those exceptions? Let's recall they can only happen when there is a coding bug, hardware corruption, or the users physically fiddles with the hardware (e.g. disconnect an external blocksdir)

There are different answers:

"They are so rare, we don't care about them at all". Then I think this pull (and other ones) can be closed and we can all move on to other things.
"They are rare, but providing a fatal shutdown path in all of them is too tedious". Then, I think it is fine to manually terminate/abort in the functions (See #35676)
"They are rare, but providing a fatal shutdown path in all of them should be done". Then, I think there needs to be a single large pull request to do it in all paths.
... Some other answer I am not seeing?

josibake commented at 2:16 PM on July 9, 2026: member

Thanks for zooming out @maflcko , I think its helpful. I'll respond here, but this is starting to feel like a discussion that can be moved to an issue around error handling in general, considering the discussion now spans two PRs.

Do we even care about providing fallbacks for those exceptions?

This is a policy question. I think its up to the caller to decide, which is why I've been pushing back on folding multiple errors into a nullptr, and pushing back on having the low level function abort instead of propagating the error up. Said differently, the answer to this question will change in the future, and it will likely be different for different callers.

Some other answer I am not seeing?

I think we should stop swallowing errors with broad catch alls that do nothing. Instead, if we can't catch it, let it terminate. If we would prefer a graceful fatal shutdown instead, have broad catch all that executes a safe shut down. Even better, have each boundary own their own error handling policy as I recommend here: BlockManager should own storage errors, BlockfilterIndex should manage storage errors related to indexes, etc etc.

In each of these boundaries, if they have the information they need to know that a graceful fatal shutdown is possible/safe/etc, they can do that. If the cant make that call, they should allow the error to propagate.

DrahtBot requested review from josibake on Jul 9, 2026

furszy commented at 2:49 PM on July 9, 2026: member

Restating the discussion we already had: #35003 (review).

Today, every filesystem error in FlatFileSeq::Open is handled through the same path except for this one case. Your proposal @josibake keeps that one case special, how leaving a function with two error-signaling paths for the same class of failure makes sense to you? That is just error-prone. It doesn't really matter which kind of error we are talking about.

Overall, following the by the books approach without pondering the actual sources is just bad in my view. And treating this bug fix PR as a broader architectural design discussion spot, instead of just creating a new issue or PR (as it was suggested in my first replies to you), is just blocking progress unnecessarily.

Let's recall this pull has more than 150 comments and is sitting for 3 months (which is reasonable for its size)

The PR is 97% about adding tests. It changes 5 lines of code in the node at most.

maflcko commented at 3:40 PM on July 9, 2026: member

Let's recall this pull has more than 150 comments and is sitting for 3 months (which is reasonable for its size)

The PR is 97% about adding tests. It changes 5 lines of code in the node at most.

Sorry for being unclear, with size I didn't mean the LOC in ./src, but the review size, also approximated by the number of diverse tests. Let's recall that the exceptions involved here concern all threads: scheduler, indexes, initload, main, p2p, rpc, except for the script and fetcher threads. Also, the exceptions here (or the return value) can propagate up ~anywhere in the codebase (except maybe for wallet and unrelated low-level utils). So any reviewer must be familiar with the whole codebase.

I can't provide an answer here, but I think reviewers should ask themselves the meta question (https://github.com/bitcoin/bitcoin/pull/35003#issuecomment-4925615702) and then hopefully there is rough consensus on the answer, whatever that will be.