rpc_getblockfrompeer.py intermittent failure: assert_equal(pruneheight, 248); not(249 == 248) #27749

issue maflcko opened this issue on May 25, 2023
  1. maflcko commented at 6:34 AM on May 25, 2023: member

    Is there an existing issue for this?

    • I have searched the existing issues

    Current behaviour

    https://cirrus-ci.com/task/4776911524593664?logs=ci#L3147

     test  2023-05-24T12:16:53.212000Z TestFramework (ERROR): Assertion failed 
                                       Traceback (most recent call last):
                                         File "/tmp/cirrus-build-825335453/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/test_framework.py", line 132, in main
                                           self.run_test()
                                         File "/tmp/cirrus-build-825335453/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/rpc_getblockfrompeer.py", line 126, in run_test
                                           assert_equal(pruneheight, 248)
                                         File "/tmp/cirrus-build-825335453/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/util.py", line 55, in assert_equal
                                           raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
                                       AssertionError: not(249 == 248)
    

    Expected behaviour

    .

    Steps to reproduce

    CI

    Relevant log output

    No response

    How did you obtain Bitcoin Core

    Compiled from source

    What version of Bitcoin Core are you using?

    current master

    Operating system and version

    [previous releases, qt5 dev package and depends packages, DEBUG] [focal]

    Machine specifications

    No response

  2. maflcko added the label Tests on May 25, 2023
  3. fanquake commented at 9:22 AM on May 25, 2023: member
  4. willcl-ark commented at 9:59 AM on May 26, 2023: member

    FWIW I have seen this error once before running tests locally, but I was making changes to reconsiderblock at the time and assumed it was related to that.

  5. furszy commented at 9:17 PM on May 26, 2023: member

    I wasn't able to reproduce it but my spider sense tells me that it will be fixed by #27770.

  6. theStack commented at 1:22 AM on May 29, 2023: contributor

    After digging deeper and analyzing the failed CI log, I got a clearer picture what's going on here. Comparing the "Leaving block file" debug messages from a successful and the failed test run gave the essential hint:

    $ grep "Leaving block" successful_run.log
    Leaving block file 0: CBlockFileInfo(blocks=249, size=65319, heights=0...248, time=2011-02-02...2023-05-28)
    Leaving block file 1: CBlockFileInfo(blocks=251, size=65511, heights=249...499, time=2023-05-28...2023-05-28)
    
    $ grep "Leaving block" failed_ci.log
    Leaving block file 0: CBlockFileInfo(blocks=249, size=65319, heights=0...249, time=2011-02-02...2023-05-24)
    Leaving block file 1: CBlockFileInfo(blocks=251, size=65511, heights=248...499, time=2023-05-24...2023-05-24)
    

    Blocks are appended to blk*.dat files in whatever order they are received from the network (either directly as "block" message or reconstructed via compact blocks), and it turns out that in this (obviously rare?) scenario node2 saw block 249 before block 248, so block 249 makes it still into blk00000.dat and block 248 is already ending up in the next one, blk00001.dat. So in this case, the pruneblockchain return value is obviously one higher than usual (249 vs. 248).

    I was surprised to see that even with only a single peer (node2 is only getting blocks from node0) blocks could arrive out-of-order, but that seems to be due to a mix of receiving full blocks and compact blocks. Wrote up some hacky python-regex-log-parsing magic to compare order of the actual chain (as generated on node0) with on how it's received on node2, confirming what's stated above:

    <details>

    $ ./analyze_issue_27749.py
    node2 received block 201, OK.
    node2 received block 202, OK.
    node2 received block 203, OK.
    node2 received block 204, OK.
    node2 received block 205, OK.
    node2 reconstructed block 206, OK.
    node2 received block 207, OK.
    node2 received block 208, OK.
    node2 received block 209, OK.
    node2 received block 210, OK.
    node2 received block 211, OK.
    !!! expected to receive/reconstruct block 212, but reconstructed block 213 !!!
    !!! expected to receive/reconstruct block 213, but received block 212 !!!
    node2 received block 214, OK.
    node2 received block 215, OK.
    node2 received block 216, OK.
    node2 received block 217, OK.
    node2 received block 218, OK.
    node2 received block 219, OK.
    node2 received block 220, OK.
    node2 received block 221, OK.
    node2 received block 222, OK.
    node2 received block 223, OK.
    node2 received block 224, OK.
    node2 received block 225, OK.
    node2 received block 226, OK.
    node2 received block 227, OK.
    node2 received block 228, OK.
    node2 received block 229, OK.
    node2 received block 230, OK.
    node2 received block 231, OK.
    node2 received block 232, OK.
    node2 received block 233, OK.
    node2 received block 234, OK.
    node2 received block 235, OK.
    node2 received block 236, OK.
    node2 received block 237, OK.
    node2 received block 238, OK.
    node2 received block 239, OK.
    node2 received block 240, OK.
    node2 received block 241, OK.
    node2 received block 242, OK.
    node2 received block 243, OK.
    node2 received block 244, OK.
    node2 received block 245, OK.
    node2 received block 246, OK.
    node2 received block 247, OK.
    !!! expected to receive/reconstruct block 248, but reconstructed block 249 !!!
    !!! expected to receive/reconstruct block 249, but received block 248 !!!
    node2 reconstructed block 250, OK.
    node2 reconstructed block 251, OK.
    node2 reconstructed block 252, OK.
    node2 reconstructed block 253, OK.
    node2 reconstructed block 254, OK.
    node2 reconstructed block 255, OK.
    node2 reconstructed block 256, OK.
    ..........
    node2 reconstructed block 605, OK.
    

    </details>

    Not sure how to best solve this issue. Is there a way to enforce linear blocks reception at least when only having a single peer? Or maybe the best solution would be to not involve the network at node2 at all (for the sake of the pruning part of the test) and just submitblock the blocks generated at node0? #27770 is of course always an alternative.

  7. mzumsande commented at 5:07 AM on May 29, 2023: contributor

    Nice analysis!

    Not sure how to best solve this issue. Is there a way to enforce linear blocks reception at least when only having a single peer?

    I think the cause is that a failed DoS check if (pindex->nHeight <= m_chainman.ActiveChain().Height() + 2) (link) will revert to HeadersProcessing if the chain of node2 hasn't caught up enough (the new block is too far ahead of the existing chain), adding another roundtrip compared to the block download compared to compact block reconstruction. When node2 has caught up and switches to compact block reconstruction, the two blocks can arrive to disk in a different order. This can happen in the first place because node2 initially also needs to catch up with blocks 201-205 from the first part of the test (and is rather slow to do so in the failed run!).

    So I think that an alternative fix would be to add a sync_blocks call between node0 and node2 before letting node0 generate the 400 blocks. This should guarantee that all of these 400 blocks will be processed by compact block reconstruction.

  8. furszy commented at 2:27 PM on May 29, 2023: member

    Awesome investigation! It feels good when all dots are connected.

    Apart from mzumzande's patch to enforce sequential block storage, which looks promising at first glance. I would like to refer to a portion of what I just wrote in #27770 (comment):

    I believe the main conclusion of the analysis (at least mine) is that the accuracy of the pruneblockchain result is questionable. The real issue lies in the fact that the RPC command's result description claims to return the "height of the last block pruned", which may not be true since blocks could have been stored out of order. Therefore, obtaining the block height 249 as the "last block pruned" does not necessarily indicate that block 248 has been pruned from disk.

    Considering this aspect, alongside mzumzande's fix, I hold the view that the test is quite fragile and would be unwise to maintain the current prune blockchain hardcoded numbers. These numbers have a significant discrepancy vs the block files information (as indicated in comment) and aren't really documented anywhere.

  9. Sjors commented at 1:23 PM on May 30, 2023: member

    cc @fjahr since he worked on the prune part of this test.

  10. mzumsande referenced this in commit 9fe9074266 on May 30, 2023
  11. mzumsande commented at 6:35 PM on May 30, 2023: contributor

    Opened #27784 which just adds an additional sync, as suggested in #27770 (comment) .

  12. fanquake referenced this in commit 433f17bc3f on May 31, 2023
  13. fanquake closed this on May 31, 2023

  14. sidhujag referenced this in commit 7704bfe509 on May 31, 2023
  15. bitcoin locked this on May 30, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-24 09:14 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me