wallet: fix crash on double block disconnection #31757

pull furszy wants to merge 2 commits into bitcoin:master from furszy:2025_wallet_fix_disconnectBlock_state changing 2 files +51 −2
  1. furszy commented at 6:46 pm on January 29, 2025: member

    The wallet crashes if it processes the same block disconnection event twice in a row due to an incompatible coinbase transaction state. This happens because disconnectBlock provides TxStateInactive without the “abandoned” flag for coinbase transactions to SyncTransaction, while AddToWallet() internally modifies it to retain the abandoned state.

    The crash flow is as follows:

    1. On the first disconnection, the transaction state transitions from “confirmed” to “inactive,” bypassing the state equality check since the provided state differs. Then, AddToWallet internally updates the state to “inactive + abandoned”

    2. On the second disconnection, as we provide only the “inactive” state to SyncTransaction(), the state equality assertion fails and crashes the wallet.

    Reviewers Note: The crash can easily be replicated by cherry-picking the test commit in master.

  2. DrahtBot commented at 6:46 pm on January 29, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31757.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK mzumsande, achow101
    Stale ACK rkrux

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #29500 (test: create assert_not_equal util by kevkevinpal)
    • #27865 (wallet: Track no-longer-spendable TXOs separately by achow101)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label Wallet on Jan 29, 2025
  4. furszy force-pushed on Jan 29, 2025
  5. DrahtBot added the label CI failed on Jan 29, 2025
  6. DrahtBot commented at 7:01 pm on January 29, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/36373842434

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  7. DrahtBot removed the label CI failed on Jan 29, 2025
  8. mzumsande commented at 7:17 pm on February 7, 2025: contributor

    Concept ACK - I opened #31824, which shows that this can actually happen in production.

    The suggested fix seems good from a defensive programming point of view, although it doesn’t address the root problem why double disconnections can happen.

  9. in src/wallet/wallet.cpp:1552 in 4229839b45 outdated
    1549-        SyncTransaction(ptx, TxStateInactive{});
    1550+    for (size_t index = 0; index < block.data->vtx.size(); index++) {
    1551+        const CTransactionRef& ptx = Assert(block.data)->vtx[index];
    1552+        // Coinbase transactions are not only inactive but also abandoned,
    1553+        // meaning they should never be relayed standalone via the p2p protocol.
    1554+        SyncTransaction(ptx, TxStateInactive{ /*abandoned=*/index == 0});
    


    rkrux commented at 2:28 pm on February 14, 2025:
    0- for (const CTransactionRef& ptx : Assert(block.data)->vtx) {
    1-        SyncTransaction(ptx, TxStateInactive{});
    2+ for (const CTransactionRef& ptx : Assert(block.data)->vtx) {
    3+        // Coinbase transactions are not only inactive but also abandoned,
    4+        // meaning they should never be relayed standalone via the p2p protocol.
    5+        SyncTransaction(ptx, TxStateInactive{ /*abandoned=*/ ptx->IsCoinBase()});
    

    Build and tests all good with this.

    I considered this with readability in mind, upon checking the IsCoinBase() implementation I noticed a basic sanity if condition on the input, which would be helpful for robustness.


    rkrux commented at 8:40 am on February 18, 2025:
    The same function seems to be used in the first block disconnection as well: https://github.com/bitcoin/bitcoin/blob/master/src/wallet/wallet.cpp#L1129
  10. rkrux commented at 2:28 pm on February 14, 2025: contributor

    Concept ACK

    Initial pass, will review in detail soon.

  11. achow101 commented at 10:52 pm on February 14, 2025: member
    Concept ACK
  12. rkrux approved
  13. rkrux commented at 8:40 am on February 18, 2025: contributor

    I have checked that a coinbase transaction is set to “inactive + abandoned” here on first block disconnect inside the AddToWallet function: https://github.com/bitcoin/bitcoin/blob/master/src/wallet/wallet.cpp#L1132

    And on an unwanted second block disconnection, the same state needs to be passed so that the assertion on line 1114 is satisfied if it’s not a new transaction and with the same state variant: https://github.com/bitcoin/bitcoin/blob/master/src/wallet/wallet.cpp#L1114

    I cherry-picked the unit test on master and as expected it failed with the below error due to wallet crash:

    0wallet/test/wallet_tests.cpp:1027: Entering test case "wallet_crash_during_disconnectBlock"
    1wallet/test/wallet_tests.cpp:1037: info: check GetBalance(*wallet, 0, false).m_mine_immature == 50 * COIN has passed
    2Assertion failed: (TxStateSerializedIndex(wtx.m_state) == TxStateSerializedIndex(state)), function AddToWallet, file wallet.cpp, line 1114.
    3unknown location:0: fatal error: in "wallet_tests/wallet_crash_during_disconnectBlock": signal: SIGABRT (application abort requested)
    4wallet/test/wallet_tests.cpp:1037: last checkpoint
    5wallet/test/wallet_tests.cpp:1027: Leaving test case "wallet_crash_during_disconnectBlock"; testing time: 576342us
    

    ACK 4229839b454befea2e18764eee20186b6c18f1c5

  14. DrahtBot requested review from achow101 on Feb 18, 2025
  15. DrahtBot requested review from mzumsande on Feb 18, 2025
  16. in src/wallet/test/wallet_tests.cpp:1047 in 4229839b45 outdated
    1042+    m_node.chain->findBlock(tip->GetBlockHash(), interfaces::FoundBlock().data(block_data));
    1043+
    1044+    // Construct block info for disconnection
    1045+    interfaces::BlockInfo info(*tip->phashBlock);
    1046+    info.data = &block_data;
    1047+    info.prev_hash = tip->phashBlock;
    


    rkrux commented at 9:11 am on February 18, 2025:
    0- info.prev_hash = tip->phashBlock;
    1+ info.prev_hash = tip->pprev->phashBlock;
    

    The hash of the previous block needs to be added here instead of the hash of the same block. I ran the test with this and it works.

    Reference: MakeBlockInfo() here https://github.com/bitcoin/bitcoin/blob/9da0820ec55e136d5213bf5bb90c36499debc034/src/kernel/chain.cpp#L18

    The pprev presence check can be avoided because the test is using TestChain100Setup that has 100 blocks mined already leading to pprev being present after CreateAndProcessBlock() is called on line 1035.


    furszy commented at 10:11 pm on February 18, 2025:
    Ended up changing the testing approach, preferring functional over unit tests. But thanks.
  17. in src/wallet/test/wallet_tests.cpp:1031 in 4229839b45 outdated
    1026+ */
    1027+BOOST_FIXTURE_TEST_CASE(wallet_crash_during_disconnectBlock, TestChain100Setup)
    1028+{
    1029+    const auto& chainman = Assert(m_node.chainman);
    1030+    const int64_t BLOCK_TIME = WITH_LOCK(chainman->GetMutex(), return chainman->ActiveChain().Tip()->GetBlockTimeMax() + 5);
    1031+    SetMockTime(BLOCK_TIME);  // to by-pass the wallet birth time.
    


    rkrux commented at 9:19 am on February 18, 2025:

    // to by-pass the wallet birth time.

    Can you explain this a bit more? I ran the test w/o it and it works.


    furszy commented at 10:10 pm on February 18, 2025:
    It was a leftover from a previous implementation. But see #27469.
  18. furszy force-pushed on Feb 18, 2025
  19. furszy commented at 10:16 pm on February 18, 2025: member
    Ended up changing the testing approach, opting for a functional test over a unit test due to its reduced maintenance costs across code refactorings. Made a few modifications to mzumzande’s #31824 functional tests.
  20. DrahtBot added the label Needs rebase on Feb 19, 2025
  21. in test/functional/wallet_reorgsrestore.py:110 in df87cf2a52 outdated
    45+        assert_equal(wallet.getwalletinfo()['immature_balance'], 25)
    46+
    47+        # Disconnect tip
    48+        tip = wallet.getbestblockhash()
    49+        wallet.invalidateblock(tip)
    50+        wallet.syncwithvalidationinterfacequeue()
    


    rkrux commented at 12:22 pm on February 19, 2025:
    Can add a comment here highlighting that this RPC call is added to ensure the block disconnected notification is handled by the wallet.

    furszy commented at 5:25 pm on February 19, 2025:
    Done as suggested
  22. in test/functional/wallet_reorgsrestore.py:45 in df87cf2a52 outdated
    40+        self.generatetoaddress(node, 1, wallet.getnewaddress(), sync_fun=self.no_op)
    41+
    42+        # Restart to ensure node and wallet are flushed
    43+        self.restart_node(0)
    44+        wallet = node.get_wallet_rpc("reorg_crash")
    45+        assert_equal(wallet.getwalletinfo()['immature_balance'], 25)
    


    rkrux commented at 12:49 pm on February 19, 2025:

    Might be in the nit-land but I find reading 25 here distracting. I had to figure out why it’s 25 that is because in regtest the subsidy halving interval is 150 and initially a 199 blocks long chain is created in the regtest. The block count at this stage is 215 which is because few blocks were created in the preceding test in this file, thereby making this test dependent on the changes of that test as well. IMO, a simple balance greater than 0 check is sufficient.

    Also generally, I am inclined to creating and using a constant, for such usages of 25, called INITIAL_ACTIVE_BLOCK_REWARD in the test framework.


    furszy commented at 5:25 pm on February 19, 2025:
    Done as suggested
  23. in test/functional/wallet_reorgsrestore.py:119 in df87cf2a52 outdated
    53+        assert_equal(wallet.getwalletinfo()['immature_balance'], 0)
    54+        coinbase_tx_id = wallet.getblock(tip, verbose=1)["tx"][0]
    55+        assert_equal(wallet.gettransaction(coinbase_tx_id)['details'][0]['abandoned'], True)
    56+
    57+        # Abort process abruptly to force an unclean shutdown (no flush)
    58+        node.process.kill()
    


    rkrux commented at 12:59 pm on February 19, 2025:
    This test passes in my system but one thing that’s on my mind is that can this test ever become flaky? Can there be a scenario that the best block locator changes are written to the disk before the node is killed here? Do you think a time delay should be added before killing the node here?

    furszy commented at 4:24 pm on February 19, 2025:

    This test passes in my system but one thing that’s on my mind is that can this test ever become flaky? Can there be a scenario that the best block locator changes are written to the disk before the node is killed here?

    That wouldn’t be a problem, it would be a fix. See #30221 and the last paragraph of #31824’s issue description.

    Do you think a time delay should be added before killing the node here?

    The test will still pass if the best locator is flushed to disk before the abrupt shutdown. And that’s fine, that’s the expected behavior.

  24. in test/functional/wallet_reorgsrestore.py:57 in df87cf2a52 outdated
    52+        # Tip was disconnected, ensure coinbase has been abandoned
    53+        assert_equal(wallet.getwalletinfo()['immature_balance'], 0)
    54+        coinbase_tx_id = wallet.getblock(tip, verbose=1)["tx"][0]
    55+        assert_equal(wallet.gettransaction(coinbase_tx_id)['details'][0]['abandoned'], True)
    56+
    57+        # Abort process abruptly to force an unclean shutdown (no flush)
    


    rkrux commented at 1:01 pm on February 19, 2025:
    0+  # Abort process abruptly to force an unclean shutdown (no flush) so that the best block locator changes are not persisted
    

    furszy commented at 5:25 pm on February 19, 2025:
    Done as suggested
  25. in test/functional/wallet_reorgsrestore.py:129 in df87cf2a52 outdated
    60+        # Start wallet again. Ensure the tx is still abandoned and the wallet has no balance.
    61+        self.start_node(0)
    62+        wallet = node.get_wallet_rpc("reorg_crash")
    63+        assert_equal(wallet.getbestblockhash(), tip)
    64+        assert_equal(wallet.getwalletinfo()['immature_balance'], 0)
    65+        wallet.invalidateblock(tip)
    


    rkrux commented at 1:03 pm on February 19, 2025:
    Can add a comment here: Disconnect tip again

    furszy commented at 5:25 pm on February 19, 2025:
    Done as suggested
  26. in src/wallet/wallet.cpp:1555 in f99b2f136c outdated
    1549-        SyncTransaction(ptx, TxStateInactive{});
    1550+    for (size_t index = 0; index < block.data->vtx.size(); index++) {
    1551+        const CTransactionRef& ptx = Assert(block.data)->vtx[index];
    1552+        // Coinbase transactions are not only inactive but also abandoned,
    1553+        // meaning they should never be relayed standalone via the p2p protocol.
    1554+        SyncTransaction(ptx, TxStateInactive{/*abandoned=*/index == 0});
    


    rkrux commented at 1:06 pm on February 19, 2025:

    Re: #31757 (review)

    Bringing it back again because I feel besides the robustness IsCoinBase() brings in here, we can also get rid of introducing index that is not used elsewhere in the function.

    LMK your thoughts.


    furszy commented at 4:17 pm on February 19, 2025:

    Bringing it back again because I feel besides the robustness IsCoinBase() brings in here

    By consensus, coinbase transactions must be the first in a block. Using IsCoinBase() is slightly more readable but not more robust. It is also a slower check, as it verifies that all bytes in the input hash are zero for every tx in the block. I’m a ~0 here.


    rkrux commented at 9:49 am on February 20, 2025:
    Thanks for entertaining this request, the readability was indeed pulling me towards it but with the consensus rule in place, I agree with leaving this as-is.
  27. in test/functional/wallet_reorgsrestore.py:134 in df87cf2a52 outdated
    62+        wallet = node.get_wallet_rpc("reorg_crash")
    63+        assert_equal(wallet.getbestblockhash(), tip)
    64+        assert_equal(wallet.getwalletinfo()['immature_balance'], 0)
    65+        wallet.invalidateblock(tip)
    66+        # Ensure abandoned state persists after restart
    67+        assert_equal(wallet.gettransaction(coinbase_tx_id)['details'][0]['abandoned'], True)
    


    rkrux commented at 1:20 pm on February 19, 2025:

    Might not hurt to check for wallet balance again. But more importantly it would be nice to check that the best block hash is not tip now like below. Should we stop the node again (cleanly this time) and check this on start?

    0+ assert_equal(node.getbestblockhash() == tip, False)
    

    furszy commented at 5:20 pm on February 19, 2025:

    Should we stop the node again (cleanly this time) and check this on start?

    There should be another test case already ensuring locator persistence during a clean shutdown.

  28. in test/functional/wallet_reorgsrestore.py:116 in df87cf2a52 outdated
    51+
    52+        # Tip was disconnected, ensure coinbase has been abandoned
    53+        assert_equal(wallet.getwalletinfo()['immature_balance'], 0)
    54+        coinbase_tx_id = wallet.getblock(tip, verbose=1)["tx"][0]
    55+        assert_equal(wallet.gettransaction(coinbase_tx_id)['details'][0]['abandoned'], True)
    56+
    


    rkrux commented at 1:27 pm on February 19, 2025:

    Interesting that the best block hash at the stage is not the tip because of invalidation but not so after unclean-stop-start.

    0assert_equal(node.getbestblockhash() == tip, False)
    
  29. in test/functional/wallet_reorgsrestore.py:35 in df87cf2a52 outdated
    30@@ -31,6 +31,41 @@ def set_test_params(self):
    31     def skip_test_if_missing_module(self):
    32         self.skip_if_no_wallet()
    33 
    34+    def test_reorg_handling_during_unclean_shutdown(self):
    35+        self.log.info("Test crash due to a duplicate block disconnection event during an unclean shutdown")
    


    rkrux commented at 1:31 pm on February 19, 2025:
    Test that wallet doesn't crash due to a duplicate block disconnection event after an unclean shutdown

    furszy commented at 5:25 pm on February 19, 2025:
    Done as suggested
  30. rkrux commented at 1:36 pm on February 19, 2025: contributor

    I reviewed till df87cf2

    I had thought about adding the functional test earlier but had ruled it out because I thought technically it’s a bug fix and not a functionality per se. But after reading this test, I’m in favour of this test, makes me think from the POV of what’s happening to the node overall.

  31. wallet: fix crash on double block disconnection
    The wallet crashes if it processes the same block disconnection event twice in a row due
    to an incompatible coinbase transaction state.
    This happens because 'disconnectBlock' provides 'TxStateInactive' without the "abandoned"
    flag for coinbase transactions to 'SyncTransaction', while 'AddToWallet()' internally
    modifies it to retain the abandoned state.
    
    The flow is as follows:
    1) On the first disconnection, the transaction state transitions from "confirmed" to
    "inactive," bypassing the state equality check since the provided state differs. Then,
    'AddToWallet' internally updates the state to "inactive + abandoned"
    
    2) On the second disconnection, as we provide only the "inactive" state
    to 'SyncTransaction()', the state equality assertion fails and crashes the wallet.
    9ef429b6ae
  32. test: wallet, coverage for crash on dup block disconnection during unclean shutdown
    Co-authored-by: furszy <matiasfurszyfer@protonmail.com>
    c80b105f77
  33. furszy force-pushed on Feb 19, 2025
  34. DrahtBot removed the label Needs rebase on Feb 19, 2025
  35. in test/functional/wallet_reorgsrestore.py:121 in c80b105f77
    116+
    117+        # Abort process abruptly to mimic an unclean shutdown (no flush)
    118+        # The best block locator changes might not have been persisted
    119+        node.process.kill()
    120+
    121+        # Start wallet again. Ensure the tx is still abandoned and the wallet has no balance
    


    mzumsande commented at 7:12 pm on February 19, 2025:
    “Ensure the tx is still abandoned and the wallet has no balance” sounds too positive to me - the test comment should say clearly that this is a bug in master and the wallet is showing the wrong balance because it incorrectly treats a tx which is in the current chain as abandoned (even if it doesn’t crash anymore).

    rkrux commented at 11:55 am on February 20, 2025:
    Agree, it’s a subtle point but important to convey in the comment. The 2 assertions on line 124 and 125 below are correct individually but incongruent with regards to the overall node state.
  36. in test/functional/wallet_reorgsrestore.py:124 in c80b105f77
    119+        node.process.kill()
    120+
    121+        # Start wallet again. Ensure the tx is still abandoned and the wallet has no balance
    122+        self.start_node(0)
    123+        wallet = node.get_wallet_rpc("reorg_crash")
    124+        assert_equal(wallet.getbestblockhash(), tip)
    


    rkrux commented at 11:50 am on February 20, 2025:

    Re: #31757 (review) Continuing this thread here because there doesn’t seem to be a way to continue that thread while making it part of the review.

    As per this comment and related code #31824 (comment), I understand that FlushStateMode::IF_NEEDED mode is used and therefore I assume that makes it quite unlikely to flush chainstate in a functional test, so the test will not fail.

    The test will still pass if the best locator is flushed to disk before the abrupt shutdown. And that’s fine, that’s the expected behavior.

    However, I don’t understand this comment. Hypothetically, if the state were to be flushed before abrupt node kill, wouldn’t this assertion here fail?

  37. rkrux commented at 11:58 am on February 20, 2025: contributor
    Asked a question for clarity.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-02-22 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me