net_processing: improve block data I/O error handling in P2P paths

furszy commented at 3:29 pm on April 4, 2026: member

Early note: the majority of this PR consists of test coverage. The changes per se are quite small, just the issue shows up in multiple paths in slightly different forms.

The goal of the PR is to ensure I/O errors are not silently ignored during validation or P2P message handling. In some cases, such as GETDATA or GETBLOCKTXN requests, errors can be swallowed, leaving the node unresponsive to the request without any action. In a harder to reach but more severe case, a swallowed error can leave the node alive but stuck, unable to process new blocks and advance the chain. Also, silently ignoring errors obviously makes problems harder to diagnose. So this PR seeks to improve all of that.

Currently, the same root cause, inability to access the blocks directory, can produce different behaviors depending on where it occurs: 

If it happens during block connection, inside AcceptBlock(), the error gets caught and triggers a fatal error graceful shutdown. This is the correct behavior.
If it happens during block connection, after AcceptBlock(), inside ActivateBestChain(), the file system error gets thrown and swallowed by the P2P general try-catch, leaving the node in a living but stuck state. Where ActivateBestChain() will always silently fail on any posterior block processing, retrying to active the failing block.
If it happens during GETDATA request, the file system error gets swallowed by the message handling general try-catch, only logging a net-debug-level message. The remote peer is not disconnected, nor the node aborts. It just silently ignores the request. This is different to a missing block data, which currently logs the error + disconnects the peer.
If it happens during GETBLOCKTXN request, the file system error either gets swallowed by the message handling general try-catch or crashes the system via an assertion, depending on if the problem is at the blocks directory level or at the block data level, which is inconsistent.

So, this PR makes failure handling consistent and ensures the node never enters a stuck state due to a block I/O error. Changes:

GETDATA now treats the issue in the same way as other I/O errors: it logs the error + disconnect the remote peer.
GETBLOCKTXN now triggers a fatal error and graceful shutdown. The error here is stricter than in GETDATA because it can only access the last 10 blocks, which must be available for reorgs.
ActivateBestChain() now triggers a fatal error and graceful shutdown when it happens. Fixing the silent stuck node state scenario.

Testing Note Can reproduce the different behaviors that cause the stuck node scenario by running the second commit’s unit test 28af683b5dfdce88892e367b308857b94ec630fe or by manually introducing an exception in ConnectTip() (which occurs inside ActivateBestChain()). Previously, the thrown exception would have being caught by the general P2P try-catch and logged only at net-debug level, which most users do not have enabled, leaving the node alive but stuck, as subsequent block connections repeatedly attempt to process the failing block, throw the exception again, and get silently swallowed. Adding a functional test for this scenario is not possible due to the requirement of failing inside ActivateBestChain(), which always occurs after AcceptBlock(), which correctly captures the error.

Separate Note There is another assertion scenario that we can move to graceful shutdown (fatal error) in the compact block relay that I haven’t done here.

Extra Note If want to see a similar error producing a crash, but in the wallet, go to #34176.

DrahtBot commented at 3:30 pm on April 4, 2026: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	frankomosh
Concept ACK	pinheadmz

If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#34858 (test: Use NodeClockContext in more tests by maflcko)
#33477 (Rollback for dumptxoutset without invalidating blocks by fjahr)
#32317 (kernel: Separate UTXO set access from validation functions by sedited)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

Unless we have a I/O issue, ... -> Unless we have an I/O issue, ... [Article is wrong (“a” vs “an”), which can impede quick comprehension of the intended meaning.]
... must not be swallowed by the P2P message handling general try-catch. -> ... must not be swallowed by the P2P message handling's general try-catch. [Missing possessive/connector makes the phrase unclear; it reads as if “handling general” is describing the wrong thing.]

Possible places where named args for integral literals may be used (e.g. func(x, /*named_arg=*/0) in C++, and func(x, named_arg=0) in Python):

FlatFileSeq(data_dir, "a", 16 * 1024) in src/test/flatfile_tests.cpp (flatfile_open_readonly_missing_dir)
FlatFileSeq(data_dir, "a", 16 * 1024) in src/test/flatfile_tests.cpp (flatfile_open_write_creates_dir)
FlatFileSeq(dir / "blocks", "a", 16 * 1024) in src/test/flatfile_tests.cpp (flatfile_open_write_missing_unwritable_parent)
FlatFileSeq(data_dir, "a", 100) in src/test/flatfile_tests.cpp (flatfile_flush_unwritable_dir)
FlatFileSeq(data_dir, "a", 100) in src/test/flatfile_tests.cpp (flatfile_allocate_unwritable_dir)

Possible places where comparison-specific test macros should replace generic comparisons:

[src/test/flatfile_tests.cpp] BOOST_CHECK_THROW(seq.Allocate(FlatFilePos(0, 0), 1, out_of_space), fs::filesystem_error); -> Prefer BOOST_CHECK_EXCEPTION(...) with a HasReason(...) matcher if you want to ensure the specific failure reason/message (rather than only the exception type).

2026-04-05 18:48:22

furszy force-pushed on Apr 4, 2026

DrahtBot added the label CI failed on Apr 4, 2026

FlatFile: do not throw for parent dir creation failure

The callers of this function don't expect it to throw.

For example, when the blocks directory becomes inaccessible
(e.g. volume detach or permission changed), FlatFileSeq::Open's
call to fs::create_directories() throws filesystem_error. On the
block connection path, if this occurs during ActivateBestChain(),
just after AcceptBlock() the exception bypasses FatalError entirely.
It gets caught by ProcessMessage's general catch-all with only a
NET-level debug log, effectively swallowing the error msg for
regular users, and every subsequent block arrival retries the
same failing read, leaving the node looking healthy while the
chain is stuck.

89318e74b6

test: verify ActivateBestChain fatal errors on I/O failure

Cover the two ActivateBestChain paths that read data from
disk: ConnectTip (block connection) and DisconnectTip (reorg).

Simulate inaccessible block and undo files, ensure fatal error
and shutdown is triggered.

c59592bbcb

test: ensure consistent failure behavior in BlockManager reads/writes

Verifies that BlockManager behaves the same way whether a block file is
missing or the blocks directory is inaccessible.

Reads (ReadBlock, ReadRawBlock, ReadBlockUndo) return false, writes
(WriteBlock, WriteBlockUndo) return null/false, and failures are logged.

The goal is consistent failure handling. Previously, missing block files
returned false and logged an error, while an inaccessible blocks directory
was throwing an exception.

0aeece851d

net: replace GETBLOCKTXN assert on block read error with graceful shutdown

A failure here indicates a serious inconsistency or disk issue, and the
current behavior is to abort immediately, which is correct but not the best.
Replace assert with a fatal error so the node shuts down through the normal
shutdown path instead.

This does not change the outcome (the node still stops), but allows a more
controlled teardown, giving other components a chance to flush their state
even when a particular block (or the blocks dir) on disk is unreachable.
E.g. the blocks dir might be the only affected, the mempool might be ok,
and we don't want to lose such information.

ba034f4db7

furszy force-pushed on Apr 4, 2026

test: add P2P coverage for block disk I/O failures

Add tests to ensure that block disk I/O failures triggered
through P2P requests are not silently swallowed.

The idea is to simulate I/O errors by making the blocks directory
temporarily inaccessible while the node is running. This allows block
reads to fail when serving P2P requests.

Two behaviors are exercised:

- GETDATA: failures should be handled locally by disconnecting the
  peer and logging the error, without impacting node operation.

- GETBLOCKTXN: failures when accessing recent blocks violate our
  "last 10 blocks must always be available for reorgs" rule and
  must result in a fatal error, rather than being caught and only
  logged by the P2P message handling general try-catch.

72f7a67370

furszy force-pushed on Apr 5, 2026

DrahtBot removed the label CI failed on Apr 5, 2026

pinheadmz commented at 6:02 pm on April 6, 2026: member

concept ACK Thanks for the thorough PR description! Definitely helps me to understand the code changes before even looking at them. Can also see its minimal changes with lots of test coverage, will review.

frankomosh commented at 4:45 am on April 8, 2026: contributor

tACK 72f7a67370255cab1bf0dfcf39920dd23a3c268b

Built from source. All new tests pass:

flatfile_tests: 9 cases, no errors
validation_chainstate_tests: 5 cases, no errors
blockmanager_tests: 8 cases, no errors
p2p_handle_io_errors.py: both scenarios pass

Also manually reverted the GETBLOCKTXN fix (commit ba034f4) by removing the fatalError call and replacing with a silent return. The functional test correctly caught this kind of mutation. the test expected the node to shut down, but the mutant node stayed alive, causing a timeout failure. Mutant killed by p2p_handle_io_errors.py. Unit tests did not detect this mutation. Also process_messages and process_message fuzz harnesses both passed without detecting the mutation, confirming the functional test is providing the critical oracle here.

Line 5915 in SendMessages has assert(ret) after a ReadBlock call in the compact block announcement path. I think this is the same pattern as the GETBLOCKTXN assert fixed in commit 4. I believe this is the “another assertion scenario in compact block relay” mentioned in the PR description. Is this planned as a follow-up? Happy to take it if so.

DrahtBot requested review from pinheadmz on Apr 8, 2026

in test/functional/p2p_handle_io_errors.py:109 in 72f7a67370

104+        if platform.system() == 'Windows' or os.geteuid() == 0:
105+           self.log.warning("Skipping test: unable to enforce dir permissions")
106+           return
107+
108+        self.test_getdata_on_broken_fs()
109+        self.test_getblocktxn_fatal_on_block_io_error()

maflcko commented at 8:36 am on April 8, 2026:

Is there a reason why the p2p block msg flow is not tested end-to-end? I understand that the end-to-end behavior is already correct for it, and that there are unit tests, but they only look at the problem from a zoomed-in-on-flatfile perspective.

I mention it, because the error flow happens in reality. I’ve last seen it in #34592 (comment).

So I think it could make sense to test it, to avoid regressing on it.

Also, I think, it could make sense to make the first commit of the pull request the functional test. This way, it is easier to see what the end-to-end behavior changes are, without having to manually undo commits or patches.

Feel free to grab the test from this diff:

 0diff --git a/test/functional/p2p_handle_io_errors.py b/test/functional/p2p_handle_io_errors.py
 1index 7b3f164ebe..309f4bae63 100755
 2--- a/test/functional/p2p_handle_io_errors.py
 3+++ b/test/functional/p2p_handle_io_errors.py
 4@@ -12,2 +12,6 @@ import stat
 5 
 6+from test_framework.blocktools import (
 7+    create_block,
 8+    create_coinbase,
 9+)
10 from test_framework.messages import (
11@@ -17,2 +21,3 @@ from test_framework.messages import (
12     MSG_WITNESS_FLAG,
13+    msg_block,
14     msg_getblocktxn,
15@@ -46,2 +51,19 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
16 
17+    def test_block_connect_on_broken_fs(self):
18+        self.log.info("Test block connect on inaccessible filesystem")
19+        node = self.nodes[0]
20+        peer = node.add_p2p_connection(P2PInterface())
21+        block = create_block(tmpl=node.getblocktemplate({"rules": ["segwit"]}))
22+        block.solve()
23+
24+        with simulate_io_error(node.blocks_path):
25+            with node.assert_debug_log(expected_msgs=["AcceptBlock FAILED (System error while saving block to disk"]):
26+                peer.send_without_ping(msg_block(block))
27+                peer.wait_for_disconnect()
28+
29+            node.wait_until_stopped(
30+                expect_error=True,
31+                expected_stderr=re.compile(r"fatal internal error"),
32+            )
33+
34     def test_getdata_on_broken_fs(self):
35@@ -104,5 +126,6 @@ class P2PBlockIOErrorTest(BitcoinTestFramework):
36         if platform.system() == 'Windows' or os.geteuid() == 0:
37-           self.log.warning("Skipping test: unable to enforce dir permissions")
38-           return
39+            self.log.warning("Skipping test: unable to enforce dir permissions")
40+            return
41 
42+        self.test_block_connect_on_broken_fs()
43         self.test_getdata_on_broken_fs()

maflcko commented at 8:39 am on April 8, 2026: member

concept ack, left a test suggestion, which could make review easier

DrahtBot added the label Needs rebase on Apr 9, 2026

DrahtBot commented at 10:52 pm on April 9, 2026: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

net_processing: improve block data I/O error handling in P2P paths #35003

Reviews

Conflicts

LLM Linter (✨ experimental)