Early note: the majority of this PR consists of test coverage. The changes per se are quite small, just the issue shows up in multiple paths in slightly different forms.
The goal of the PR is to ensure I/O errors are not silently ignored during validation or P2P message handling. In some cases, such as GETDATA or GETBLOCKTXN requests, errors can be swallowed, leaving the node unresponsive to the request without any action. In a harder to reach but more severe case, a swallowed error can leave the node alive but stuck, unable to process new blocks and advance the chain. Also, silently ignoring errors obviously makes problems harder to diagnose. So this PR seeks to improve all of that.
Currently, the same root cause, inability to access the blocks directory, can produce different behaviors depending on where it occurs:
-
If it happens during block connection, inside
AcceptBlock(), the error gets caught and triggers a fatal error graceful shutdown. This is the correct behavior. -
If it happens during block connection, after
AcceptBlock(), insideActivateBestChain(), the file system error gets thrown and swallowed by the P2P general try-catch, leaving the node in a living but stuck state. WhereActivateBestChain()will always silently fail on any posterior block processing, retrying to active the failing block. -
If it happens during
GETDATArequest, the file system error gets swallowed by the message handling general try-catch, only logging a net-debug-level message. The remote peer is not disconnected, nor the node aborts. It just silently ignores the request. This is different to a missing block data, which currently logs the error + disconnects the peer. -
If it happens during
GETBLOCKTXNrequest, the file system error either gets swallowed by the message handling general try-catch or crashes the system via an assertion, depending on if the problem is at the blocks directory level or at the block data level, which is inconsistent.
So, this PR makes failure handling consistent and ensures the node never enters a stuck state due to a block I/O error. Changes:
GETDATAnow treats the issue in the same way as other I/O errors: it logs the error + disconnect the remote peer.GETBLOCKTXNnow triggers a fatal error and graceful shutdown. The error here is stricter than inGETDATAbecause it can only access the last 10 blocks, which must be available for reorgs.ActivateBestChain()now triggers a fatal error and graceful shutdown when it happens. Fixing the silent stuck node state scenario.
Testing Note
Can reproduce the different behaviors that cause the stuck node scenario by running the second commit’s unit test 28af683b5dfdce88892e367b308857b94ec630fe or by manually introducing an exception in ConnectTip() (which occurs inside ActivateBestChain()). Previously, the thrown exception would have being caught by the general P2P try-catch and logged only at net-debug level, which most users do not have enabled, leaving the node alive but stuck, as subsequent block connections repeatedly attempt to process the failing block, throw the exception again, and get silently swallowed. Adding a functional test for this scenario is not possible due to the requirement of failing inside ActivateBestChain(), which always occurs after AcceptBlock(), which correctly captures the error.
Separate Note There is another assertion scenario that we can move to graceful shutdown (fatal error) in the compact block relay that I haven’t done here.
Extra Note If want to see a similar error producing a crash, but in the wallet, go to #34176.