net: don’t lock cs_main while reading blocks in net processing #26326

pull andrewtoth wants to merge 2 commits into bitcoin:master from andrewtoth:remove-read-lock-in-net changing 1 files +73 −43
  1. andrewtoth commented at 3:34 pm on October 17, 2022: contributor

    Inspired by #11913 and #26308.

    cs_main doesn’t need to be locked while reading blocks. This removes the locks in net_processing.

  2. DrahtBot added the label P2P on Oct 17, 2022
  3. andrewtoth force-pushed on Oct 17, 2022
  4. andrewtoth force-pushed on Oct 17, 2022
  5. andrewtoth force-pushed on Oct 17, 2022
  6. andrewtoth force-pushed on Oct 17, 2022
  7. DrahtBot commented at 6:24 pm on October 17, 2022: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK sr-gi, furszy, mzumsande, TheCharlatan, achow101
    Concept ACK dergoegge, pablomartin4btc
    Approach ACK hernanmarino

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #29641 (scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint by maflcko)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  8. dergoegge commented at 2:42 pm on October 18, 2022: member

    Concept ACK

    I would suggest that you rebase this PR on top of #26316 (if there is a dependency) and then maybe also mark this PR as a draft until the base is merged.

  9. andrewtoth force-pushed on Oct 19, 2022
  10. andrewtoth marked this as a draft on Oct 19, 2022
  11. andrewtoth marked this as ready for review on Nov 18, 2022
  12. andrewtoth commented at 4:11 pm on November 18, 2022: contributor
    I don’t think #26316 is necessary anymore. We can simply remove files that are still on disk like in #26533.
  13. andrewtoth force-pushed on Nov 18, 2022
  14. andrewtoth force-pushed on Nov 18, 2022
  15. andrewtoth force-pushed on Nov 24, 2022
  16. andrewtoth force-pushed on Nov 24, 2022
  17. in src/net_processing.cpp:3872 in 351e5873ab outdated
    3874+            is_above_blocktxn_depth = pindex->nHeight >= m_chainman.ActiveChain().Height() - MAX_BLOCKTXN_DEPTH;
    3875+        }
    3876+        if (is_above_blocktxn_depth) {
    3877+            CBlock block;
    3878+            if (!ReadBlockFromDisk(block, pindex, m_chainparams.GetConsensus())) {
    3879+                LogPrint(BCLog::NET, "Cannot load block from disk. It was likely pruned before we could read it.\n");
    


    maflcko commented at 10:12 am on December 8, 2022:
    I don’t think this can happen. MAX_BLOCKTXN_DEPTH is 10, so could keep the assert or at least make it a error log category + Assume(false)?

    andrewtoth commented at 1:45 pm on December 8, 2022:
    Could it not happen in regtest or unit tests?

    maflcko commented at 2:01 pm on December 8, 2022:
    Not sure, but it seems fine to crash in tests if they violate assumptions

    maflcko commented at 3:01 pm on December 8, 2022:
    See MIN_BLOCKS_TO_KEEP, which is also enforced in regtest

    andrewtoth commented at 3:50 am on December 19, 2022:
    Updated this to keep the assert. Also changed it to take FlatFilePos in ReadBlockFromDisk so it doesn’t have to reacquire the lock again after releasing it.
  18. in src/node/blockstorage.h:214 in 351e5873ab outdated
    210@@ -211,7 +211,7 @@ void UnlinkPrunedFiles(const std::set<int>& setFilesToPrune);
    211 /** Functions for disk access for blocks */
    212 bool ReadBlockFromDisk(CBlock& block, const FlatFilePos& pos, const Consensus::Params& consensusParams);
    213 bool ReadBlockFromDisk(CBlock& block, const CBlockIndex* pindex, const Consensus::Params& consensusParams);
    214-bool ReadRawBlockFromDisk(std::vector<uint8_t>& block, const FlatFilePos& pos, const CMessageHeader::MessageStartChars& message_start);
    215+bool ReadRawBlockFromDisk(std::vector<uint8_t>& block, const CBlockIndex* pindex, const CMessageHeader::MessageStartChars& message_start);
    


    maflcko commented at 10:14 am on December 8, 2022:
    0bool ReadRawBlockFromDisk(std::vector<uint8_t>& block, const CBlockIndex& block, const CMessageHeader::MessageStartChars& message_start);
    

    shouldn’t the * be a &, like before? (nullptr is not accepted)


    andrewtoth commented at 3:49 am on December 19, 2022:
    Removed touching this file.
  19. jamesob commented at 4:11 pm on December 8, 2022: member
    Cool. Can someone add a “needs benchmark” label?
  20. fanquake added the label Needs Benchmark on Dec 8, 2022
  21. andrewtoth commented at 5:14 pm on December 8, 2022: contributor

    Can someone add a “needs benchmark” label?

    What kind of benchmark would be appropriate here? AFAICT message processing is done on a single thread, so this patch wouldn’t speed up handling requests. It would free up other threads that are waiting on the lock as the message processing thread does lengthy disk IO.

  22. hernanmarino commented at 6:59 pm on December 16, 2022: contributor
    Approach ACK. Run all tests, everything run fine.
  23. pablomartin4btc commented at 10:45 pm on December 16, 2022: member
    cr ACK. I like the idea and I think this change is very useful for performance purposes. I ran unit tests and functional ones with no errors or warnings. Pending of benchmarking if I manage to find the way, otherwise when someone posts here I’m happy to re-evaluate.
  24. andrewtoth commented at 3:47 am on December 19, 2022: contributor

    I wrote a tool to allow benchmarking the affected code paths https://github.com/andrewtoth/spam_block_reqs. I ran it on block 768022 using 5000 requests for each of the four paths on both this branch and master. For example:

    0cargo run --release -- -r block-transactions -b 000000000000000000064b1c6e9606714fd4a96d6ff877e4a93b170d86361e28 -n 5000
    

    I did not benchmark the MSG_FILTERED_BLOCK path because it isn’t supported in rust-bitcoin, however there isn’t anything in that path that requires taking the lock other than the shared code for the other paths.

    Results show no regression:

    Request Type master (cb32328d1b80d0ccd6eb9532bd8fe4e0a4de385e) branch
    witness-block 22.6s 22.3s
    compact-block 60.9s 59.1s
    block-transactions 58.5s 59.0s
    legacy-block 77.7s 77.4s

    However, while running the spam-blocks tool using a very high number of requests so it would not complete, I ran a benchmark with ApacheBench requesting 2000 headers from block 750k 1000 times on a single core using REST.

    0ab -n 1000 -c 1 "http://127.0.0.1:8332/rest/headers/0000000000000000000592a974b1b9f087cb77628bb4a097d5c2c11b3476a58e.bin?count=2000"
    

    The mean response times for fetching headers while running the spam-blocks tool for each request-type is below:

    Request Type master (cb32328d1b80d0ccd6eb9532bd8fe4e0a4de385e) branch
    witness-block 14ms 1ms
    compact-block 19ms 1ms
    block-transactions 19ms 1ms
    legacy-block 42ms 1ms

    So it seems like this definitely improves responsiveness for any other threads that need to acquire a lock to cs_main.

  25. andrewtoth force-pushed on Dec 19, 2022
  26. andrewtoth force-pushed on Dec 19, 2022
  27. andrewtoth commented at 2:30 pm on December 19, 2022: contributor
    Moved all whitespace only scope changes to separate commit to make it easier to review.
  28. andrewtoth force-pushed on Dec 20, 2022
  29. in src/net_processing.cpp:4412 in b630e5dc16 outdated
    3878         }
    3879 
    3880+        if (!block_pos.IsNull()) {
    3881+            CBlock block;
    3882+            bool ret = ReadBlockFromDisk(block, block_pos, m_chainparams.GetConsensus());
    3883+            assert(ret);
    


    furszy commented at 1:16 pm on February 28, 2023:
    What about this assertion? Couldn’t ReadBlockFromDisk return false if the block is not on disk anymore?

    andrewtoth commented at 2:55 pm on February 28, 2023:
    See #26326 (review). Blocks below MAX_BLOCKTXN_DEPTH can’t be pruned, so we should retain the current behavior if this read fails.
  30. DrahtBot added the label Needs rebase on May 11, 2023
  31. andrewtoth force-pushed on Aug 2, 2023
  32. andrewtoth force-pushed on Aug 2, 2023
  33. DrahtBot removed the label Needs rebase on Aug 3, 2023
  34. DrahtBot added the label CI failed on Aug 3, 2023
  35. DrahtBot removed the label CI failed on Aug 3, 2023
  36. in src/net_processing.cpp:2296 in 3e93e6b756 outdated
    2209@@ -2205,23 +2210,29 @@ void PeerManagerImpl::ProcessGetBlockData(CNode& pfrom, Peer& peer, const CInv&
    2210         if (!(pindex->nStatus & BLOCK_HAVE_DATA)) {
    2211             return;
    2212         }
    2213+        can_direct_fetch = CanDirectFetch();
    2214+        block_pos = pindex->GetBlockPos();
    2215+    }
    2216+    const CNetMsgMaker msgMaker(pfrom.GetCommonVersion());
    


    furszy commented at 4:25 pm on August 3, 2023:
    tiny nit: would be good to add a jump line here.
  37. in src/net_processing.cpp:2171 in 3e93e6b756 outdated
    2171@@ -2172,16 +2172,20 @@ void PeerManagerImpl::ProcessGetBlockData(CNode& pfrom, Peer& peer, const CInv&
    2172         }
    2173     }
    2174 
    2175+    const CBlockIndex* pindex;
    2176+    const CBlockIndex* tip;
    2177+    bool can_direct_fetch;
    


    furszy commented at 4:29 pm on August 3, 2023:
    tiny nit: would be good to initialize the variable to nullptr and false.
  38. furszy commented at 4:36 pm on August 3, 2023: member
    Code ACK 3e93e6b7
  39. DrahtBot added the label Needs rebase on Aug 7, 2023
  40. andrewtoth force-pushed on Aug 7, 2023
  41. DrahtBot removed the label Needs rebase on Aug 7, 2023
  42. furszy commented at 2:36 pm on August 7, 2023: member
    re-ACK 144dd46. No changes since the last review.
  43. andrewtoth commented at 2:56 pm on August 7, 2023: contributor
    @furszy thank you for the review. Sorry I forgot to address your nits on this rebase. Will do them if I have to rebase again.
  44. DrahtBot added the label CI failed on Aug 7, 2023
  45. andrewtoth force-pushed on Aug 17, 2023
  46. andrewtoth force-pushed on Aug 17, 2023
  47. DrahtBot removed the label CI failed on Aug 18, 2023
  48. furszy commented at 9:25 pm on August 30, 2023: member
    reACK e8357eb after rebase.
  49. DrahtBot added the label Needs rebase on Oct 16, 2023
  50. andrewtoth force-pushed on Oct 23, 2023
  51. DrahtBot removed the label Needs rebase on Oct 23, 2023
  52. DrahtBot added the label CI failed on Oct 23, 2023
  53. furszy commented at 1:34 pm on October 23, 2023: member

    diff-ACK fae1fd61

    Have to admit that I’m not fan of the first commit. It forces me to remember where the indentation changes were located when I’m reviewing the other commits.

  54. DrahtBot requested review from hernanmarino on Oct 23, 2023
  55. DrahtBot requested review from dergoegge on Oct 23, 2023
  56. DrahtBot requested review from pablomartin4btc on Oct 23, 2023
  57. in src/net_processing.cpp:2494 in fae1fd6180 outdated
    2314         std::shared_ptr<CBlock> pblockRead = std::make_shared<CBlock>();
    2315-        if (!m_chainman.m_blockman.ReadBlockFromDisk(*pblockRead, *pindex)) {
    2316-            assert(!"cannot load block from disk");
    2317+        if (!m_chainman.m_blockman.ReadBlockFromDisk(*pblockRead, block_pos)) {
    2318+            LogPrint(BCLog::NET, "Cannot load block from disk. It was likely pruned before we could read it.\n");
    2319+            return;
    


    maflcko commented at 8:58 am on October 25, 2023:
    Last commit: Not sure. This looks like a p2p protocol change?

    furszy commented at 1:53 am on October 27, 2023:

    Last commit: Not sure. This looks like a p2p protocol change?

    For the local node, this change means that the node will continue functional instead of crashing. Which I think is good. The node shouldn’t crash because it received a p2p message that cannot answer.

    For the remote peer, this means that the connection will remain active for ~10 more minutes until the block request times out and the peer closes the socket. Which isn’t the best, but it is how we currently handle unanswered block requests.

    Are you seeing something else?


    maflcko commented at 8:11 am on October 27, 2023:

    For the local node, this change means that the node will continue functional instead of crashing.

    The node wouldn’t crash here, because cs_main was being held, so no pruning could happen in-between.

    My point is that it would be good to explain behavior changes, especially if they are p2p protocol changes. Hiding this under a “reduce cs_main” commit, which appears like a refactor doesn’t seem ideal.


    furszy commented at 1:45 pm on October 27, 2023:

    For the local node, this change means that the node will continue functional instead of crashing.

    The node wouldn’t crash here, because cs_main was being held, so no pruning could happen in-between.

    Could also be that the requested block data is not on disk or not accessible for the time this happen. Something external to our software.

    Still, removing the external cause from the equation. The only diff here is that we allow a tiny extra pruning window in-between the index lookup and the block data read. Because, we currently return early if the block is pruned here, without logging anything. And after this PR, we could either return early on the no logging line (if the block is already pruned) or log “Cannot load block from disk. It was likely pruned before we could read it” if the block was pruned right after the index lookup.

    But this is all internal to the node, I’m not seeing any p2p protocol change.

    Maybe @andrewtoth could take this to the PR description?


    maflcko commented at 1:59 pm on October 27, 2023:

    For the local node, this change means that the node will continue functional instead of crashing.

    The node wouldn’t crash here, because cs_main was being held, so no pruning could happen in-between.

    Could also be that the requested block data is not on disk or not accessible for the time this happen. Something external to our software.

    If your datadir was corrupted from outside, I think a crash is acceptable. You likely wouldn’t be able to re-org or otherwise stay in consensus anyway.

    I’m not seeing any p2p protocol change.

    Maybe “protocol change” was the wrong word? I guess “behavior change” may be better?


    furszy commented at 2:32 pm on October 27, 2023:

    If your datadir was corrupted from outside, I think a crash is acceptable. You likely wouldn’t be able to re-org or otherwise stay in consensus anyway.

    You know; if the block in conflict is old enough, then the node should be “mostly” ok. The problem would be on new peers performing IBD, which would disconnect from the node for the lack of response.

    I’m not seeing any p2p protocol change.

    Maybe “protocol change” was the wrong word? I guess “behavior change” may be better?

    Sounds good.


    andrewtoth commented at 6:22 pm on October 29, 2023:

    If a block is pruned before we lock cs_main, this function returns early. This behavior is not changed in this PR.

    If a block is not pruned before we lock cs_main, previously we are guaranteed to send the block to the peer.

    This PR changes this to have a short window after unlocking cs_main but before opening the fd that if the block is pruned and unlinked from another thread, then this function also returns early.

    I think we all agree this is not a p2p protocol change? I also believe I documented this behavior change in the PR description and in the debug logs.


    andrewtoth commented at 6:35 pm on October 29, 2023:

    So the other behavior change is that if we fail to read a block for reasons other than being pruned, previously the node would crash. Now we will log the failure and continue.

    If your datadir was corrupted from outside, I think a crash is acceptable. You likely wouldn’t be able to re-org or otherwise stay in consensus anyway.

    I suppose if the read fails, we can retake cs_main and check if the block was pruned out from under us. If so, we log and continue, otherwise we can keep crashing on an assert? Do we really want to keep crashing here if we fail to read a block? I’m not sure.


    andrewtoth commented at 6:53 pm on October 29, 2023:

    if the read fails, we can retake cs_main and check if the block was pruned out from under us. If so, we log and continue, otherwise we can keep crashing on an assert @maflcko I updated the code to do this. Thus I don’t think there is any behavior change. If the block requested is pruned, we return early before and after this PR. If the block requested is unpruned and we fail to read it, we will crash before and after this PR.


    maflcko commented at 8:05 am on October 30, 2023:

    Ah sorry for being unclear. I am not asking to add back the assert. Just adding the comment from #26326 (review) to the commit description would be great.

    However, I wonder what the point is of forcing the remote peer into a 10 minute delay+disconnect. Why not disconnect immediately? Also, if the remote has set a permission flag to avoid the disconnect (not sure if this exists), I wonder what the correct behavior would be.


    furszy commented at 2:31 pm on November 3, 2023:

    I am not asking to add back the assert.

    +1 on not re-adding the assert.

    However, I wonder what the point is of forcing the remote peer into a 10 minute delay+disconnect. Why not disconnect immediately? Also, if the remote has set a permission flag to avoid the disconnect (not sure if this exists), I wonder what the correct behavior would be.

    If the remote peer has a permission flag to avoid the disconnection, maybe it could respond with a NOTFOUND message? Same as we do with tx getdata requests. But probably better to discuss this in a follow-up or in an issue?


    andrewtoth commented at 0:12 am on November 29, 2023:

    Updated to log and return instead of hitting an assert. Also updated the commit description to clarify this behavior.

    However, I wonder what the point is of forcing the remote peer into a 10 minute delay+disconnect. Why not disconnect immediately?

    We could, but I think this is very unlikely to occur in practice. We already disconnect above on line 2293 if we are past the NODE_NETWORK_LIMITED_MIN_BLOCKS threshold, which is MIN_BLOCKS_TO_KEEP (288) so we won’t prune below that anyways.

  58. DrahtBot removed review request from hernanmarino on Oct 25, 2023
  59. DrahtBot requested review from hernanmarino on Oct 25, 2023
  60. DrahtBot removed review request from hernanmarino on Oct 27, 2023
  61. DrahtBot requested review from hernanmarino on Oct 27, 2023
  62. DrahtBot removed review request from hernanmarino on Oct 27, 2023
  63. DrahtBot requested review from hernanmarino on Oct 27, 2023
  64. DrahtBot removed review request from hernanmarino on Oct 27, 2023
  65. DrahtBot requested review from hernanmarino on Oct 27, 2023
  66. DrahtBot removed review request from hernanmarino on Oct 27, 2023
  67. DrahtBot requested review from hernanmarino on Oct 27, 2023
  68. DrahtBot removed review request from hernanmarino on Oct 27, 2023
  69. DrahtBot requested review from hernanmarino on Oct 27, 2023
  70. DrahtBot removed review request from hernanmarino on Oct 27, 2023
  71. DrahtBot requested review from hernanmarino on Oct 27, 2023
  72. DrahtBot removed review request from hernanmarino on Oct 27, 2023
  73. DrahtBot requested review from hernanmarino on Oct 27, 2023
  74. DrahtBot removed review request from hernanmarino on Oct 29, 2023
  75. DrahtBot requested review from hernanmarino on Oct 29, 2023
  76. DrahtBot removed review request from hernanmarino on Oct 29, 2023
  77. DrahtBot requested review from hernanmarino on Oct 29, 2023
  78. andrewtoth force-pushed on Oct 29, 2023
  79. DrahtBot removed the label CI failed on Oct 29, 2023
  80. andrewtoth force-pushed on Oct 29, 2023
  81. DrahtBot added the label Needs rebase on Nov 28, 2023
  82. andrewtoth force-pushed on Nov 28, 2023
  83. DrahtBot removed the label Needs rebase on Nov 29, 2023
  84. furszy commented at 2:44 am on November 29, 2023: member
    rebase utACK 1027339f
  85. DrahtBot removed review request from hernanmarino on Nov 29, 2023
  86. DrahtBot requested review from hernanmarino on Nov 29, 2023
  87. in src/net_processing.cpp:4412 in 1027339f63 outdated
    4078         }
    4079 
    4080+        if (!block_pos.IsNull()) {
    4081+            CBlock block;
    4082+            const bool ret{m_chainman.m_blockman.ReadBlockFromDisk(block, block_pos)};
    4083+            assert(ret);
    


    darosior commented at 9:10 pm on December 17, 2023:
    Worth adding a comment here, it’s not immediate that this assertion would never fail because MAX_BLOCKTXN_DEPTH < MIN_BLOCKS_TO_KEEP.

    andrewtoth commented at 3:24 pm on January 23, 2024:
    Rebased and added a comment to address this.

    sr-gi commented at 6:07 pm on May 1, 2024:

    Would it be worth having a static assert for this when defining MAX_BLOCKTXN_DEPTH?

    0static_assert(MAX_BLOCKTXN_DEPTH < MIN_BLOCKS_TO_KEEP, "MIN_BLOCKS_TO_KEEP too low");
    

    This was not obvious to me either, but I’m not too familiar with this part of net_processing

  88. DrahtBot removed review request from hernanmarino on Dec 17, 2023
  89. DrahtBot requested review from hernanmarino on Dec 17, 2023
  90. DrahtBot added the label CI failed on Jan 13, 2024
  91. andrewtoth force-pushed on Jan 23, 2024
  92. DrahtBot removed the label CI failed on Jan 23, 2024
  93. furszy commented at 8:39 pm on January 27, 2024: member
    Re-ACK f9d4152. Since my last review, this PR was only rebased and a comment was added.
  94. DrahtBot removed review request from hernanmarino on Jan 27, 2024
  95. DrahtBot requested review from hernanmarino on Jan 27, 2024
  96. achow101 removed review request from hernanmarino on Apr 9, 2024
  97. achow101 requested review from mzumsande on Apr 9, 2024
  98. achow101 requested review from sr-gi on Apr 9, 2024
  99. DrahtBot requested review from hernanmarino on May 1, 2024
  100. sr-gi commented at 8:18 pm on May 1, 2024: member

    Concept ACK

    I agree with @furszy in #26326 (review) that we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide

  101. sipa commented at 8:23 pm on May 1, 2024: member

    we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide

    We can’t just change the protocol like that; other software than Bitcoin Core might not expect notfound for block messages. But I do believe it would be useful to have some way of signalling (whether that’s notfound or something else), and using the information, that a block is unavailable. It would need a short BIP, and a way to negotiate support for it, but it would mean that over time we may also be able to prefer downloading from peers which commit to supporting such a message, and perhaps have shorted timeouts for non-responses from those.

  102. andrewtoth commented at 11:28 pm on May 1, 2024: contributor

    @sr-gi thank you for your review!

    that we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide

    After #28120 I don’t think it is very likely for this to occur by updated nodes. For older nodes and other software that requests blocks past the prune threshold it would require a protocol change as @sipa mentioned.

  103. furszy commented at 0:57 am on May 2, 2024: member
    Little add about the notfound response; I already have a draft for it. I’m busy with other stuff this week but will open the PR next week :).
  104. in src/net_processing.cpp:2313 in f9d41527d5 outdated
    2310         // as the network format matches the format on disk
    2311         std::vector<uint8_t> block_data;
    2312-        if (!m_chainman.m_blockman.ReadRawBlockFromDisk(block_data, pindex->GetBlockPos())) {
    2313-            assert(!"cannot load block from disk");
    2314+        if (!m_chainman.m_blockman.ReadRawBlockFromDisk(block_data, block_pos)) {
    2315+            LogPrint(BCLog::NET, "Cannot load block from disk. It was likely pruned before we could read it.\n");
    


    mzumsande commented at 7:22 pm on May 2, 2024:

    As written in #26326 (review), we never prune blocks > 288 blocks from the tip, and if pruning, we should have disconnected any peer requesting older blocks - so it should be impossible to trigger this log even with pruning (unless there are extreme scenarios like major reorgs). On the other hand, it could still be hit while not pruning due to disk problems - e.g. while helping other peers with IBD. Searching for this assert gives me multiple matches, e.g.

    It apparently also crashed forks like Bitcoin Unlimited back in the days.

    I’m not against removing it, but I think that if we do, we should at least have an error message that mentions the possibility of disk failures, and that is also unconditional so it doesn’t get hidden in the spammy NET log that very few people enable.


    andrewtoth commented at 8:19 pm on May 2, 2024:

    so it should be impossible to trigger this log even with pruning

    Since we have a 2 block buffer, peers can request up to 290 blocks. So one way this can be triggered - a peer requests block 290 and we get the filepos, we release the lock, pruneblockchain RPC is called, rpc thread acquires the lock and block 290 is pruned and unlinked before this thread can acquire the fd, read now fails.

    I don’t think this is really likely to occur though.


    mzumsande commented at 3:10 pm on May 3, 2024:
    In that case, maybe have a different and unconditional error message or even keep the assert just for the case where pruning isn’t enabled?

    andrewtoth commented at 3:24 pm on May 3, 2024:

    Indeed, I think that’s the best way. I did that before #26326 (review) but other reviewers thought removing the assert is better. But, I think it makes it easier to merge this PR if there is strictly no behavior change, and other patches can be proposed to modify the assertion behavior or disconnecting before returning early. This PR is just to reduce lock scope when doing block reading.

    So, if reading the block fails, we retake cs_main and determine if the block was pruned out from under us. If so, we log and return early. Otherwise, we assert as before. @maflcko @furszy what do you think?


    mzumsande commented at 3:32 pm on May 3, 2024:

    determine if the block was pruned out from under us

    That sounds like work, not sure how easy it is - I just meant to check whether we’re running in pruning mode, so the block could have been pruned theoretically. I don’t have a strong opinion between asserting / logging an error unconditionally, I just think that a conditional NET log is not sufficient in a case where something is seriously wrong on our end.


    furszy commented at 4:44 pm on May 3, 2024:

    I just meant to check whether we’re running in pruning mode, so the block could have been pruned theoretically.

    Just in case, it should also check that the block is further than the pruning threshold.

    So, if reading the block fails, we retake cs_main and determine if the block was pruned out from under us. If so, we log and return early. Otherwise, we assert as before.

    For now (at least in this PR), we should maintain the current network behavior if the assert is replaced. This means disconnecting the node immediately when it happens (same behavior as if the node crashed for the assertion. The socket gets closed).

  105. mzumsande commented at 7:33 pm on May 2, 2024: contributor
    Concept ACK
  106. net: reduce LOCK(cs_main) scope in GETBLOCKTXN
    Also adds a static assertion that MAX_BLOCKTXN_DEPTH <= MIN_BLOCKS_TO_KEEP
    613a45cd4b
  107. net: reduce LOCK(cs_main) scope in ProcessGetBlockData
    This also changes behavior if ReadBlockFromDisk or
    ReadRawBlockFromDisk fails. Previously, the node would crash
    due to an assert. This has been replaced with logging the error,
    disconnecting the peer, and returning early.
    75d27fefc7
  108. andrewtoth force-pushed on May 4, 2024
  109. andrewtoth commented at 7:45 pm on May 4, 2024: contributor

    Rebased and updated

    • Initial whitespace only commit has been removed
    • Added static_assert(MAX_BLOCKTXN_DEPTH <= MIN_BLOCKS_TO_KEEP, "MAX_BLOCKTXN_DEPTH too high"); in first commit
    • In the second commit, if reading the block fails, we retake cs_main and check if the block has been pruned. If so, we conditionally log to NET category, otherwise we unconditionally log an error. In either case, we also disconnect the peer before returning.

    Thank you for your reviews @furszy @sr-gi @mzumsande !

  110. sr-gi commented at 2:29 pm on May 6, 2024: member

    ACK 75d27fe

    The only differences are the ones mentioned, which align with the approach after considering #26326 (review)

  111. DrahtBot requested review from furszy on May 6, 2024
  112. DrahtBot requested review from mzumsande on May 6, 2024
  113. in src/net_processing.cpp:2479 in 75d27fefc7
    2507+            if (WITH_LOCK(m_chainman.GetMutex(), return m_chainman.m_blockman.IsBlockPruned(*pindex))) {
    2508+                LogPrint(BCLog::NET, "Block was pruned before it could be read, disconnect peer=%s\n", pfrom.GetId());
    2509+            } else {
    2510+                LogError("Cannot load block from disk, disconnect peer=%d\n", pfrom.GetId());
    2511+            }
    2512+            pfrom.fDisconnect = true;
    


    furszy commented at 7:51 pm on May 6, 2024:
    nit: It wouldn’t hurt to add a comment above this disconnection (just mention how this behaved historically). I think that people reading the code for the first time will not understand why the peer is being disconnected for a node local failure.

    andrewtoth commented at 8:38 pm on May 6, 2024:
    There is a comment a few lines up that describes the behavior https://github.com/bitcoin/bitcoin/pull/26326/files#diff-6875de769e90cec84d2e8a9c1b962cdbcda44d870d42e4215827e599e11e90e3R2453. But I agree, and will add that same comment here if I need to touch this up again.
  114. furszy commented at 7:52 pm on May 6, 2024: member
    ACK 75d27fefc with a non-blocking nit.
  115. in src/net_processing.cpp:2477 in 75d27fefc7
    2505-            assert(!"cannot load block from disk");
    2506+        if (!m_chainman.m_blockman.ReadRawBlockFromDisk(block_data, block_pos)) {
    2507+            if (WITH_LOCK(m_chainman.GetMutex(), return m_chainman.m_blockman.IsBlockPruned(*pindex))) {
    2508+                LogPrint(BCLog::NET, "Block was pruned before it could be read, disconnect peer=%s\n", pfrom.GetId());
    2509+            } else {
    2510+                LogError("Cannot load block from disk, disconnect peer=%d\n", pfrom.GetId());
    


    mzumsande commented at 6:14 pm on May 7, 2024:
    nit: maybe mentioning the block hash and/or height in the unconditional logs here and below could be helpful so in case of a disk failure, users could reproduce the failure easier (e.g. with getblock RPC).
  116. mzumsande commented at 6:18 pm on May 7, 2024: contributor
    Code Review ACK 75d27fefc7a04ebdda7be5724a014b6a896e7325
  117. TheCharlatan approved
  118. TheCharlatan commented at 6:32 pm on May 8, 2024: contributor
    ACK 75d27fefc7a04ebdda7be5724a014b6a896e7325
  119. achow101 commented at 9:57 pm on May 8, 2024: member
    ACK 75d27fefc7a04ebdda7be5724a014b6a896e7325
  120. achow101 merged this on May 8, 2024
  121. achow101 closed this on May 8, 2024

  122. andrewtoth deleted the branch on May 8, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-08 22:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me