rpc_getblockfrompeer.py intermittent failure: assert_equal(pruneheight, 248); not(249 == 248) #27749

issue maflcko openend this issue on May 25, 2023
  1. maflcko commented at 6:34 am on May 25, 2023: member

    Is there an existing issue for this?

    • I have searched the existing issues

    Current behaviour

    https://cirrus-ci.com/task/4776911524593664?logs=ci#L3147

    0 test  2023-05-24T12:16:53.212000Z TestFramework (ERROR): Assertion failed 
    1                                   Traceback (most recent call last):
    2                                     File "/tmp/cirrus-build-825335453/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/test_framework.py", line 132, in main
    3                                       self.run_test()
    4                                     File "/tmp/cirrus-build-825335453/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/rpc_getblockfrompeer.py", line 126, in run_test
    5                                       assert_equal(pruneheight, 248)
    6                                     File "/tmp/cirrus-build-825335453/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/util.py", line 55, in assert_equal
    7                                       raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
    8                                   AssertionError: not(249 == 248)
    

    Expected behaviour

    .

    Steps to reproduce

    CI

    Relevant log output

    No response

    How did you obtain Bitcoin Core

    Compiled from source

    What version of Bitcoin Core are you using?

    current master

    Operating system and version

    [previous releases, qt5 dev package and depends packages, DEBUG] [focal]

    Machine specifications

    No response

  2. maflcko added the label Tests on May 25, 2023
  3. fanquake commented at 9:22 am on May 25, 2023: member
  4. willcl-ark commented at 9:59 am on May 26, 2023: member
    FWIW I have seen this error once before running tests locally, but I was making changes to reconsiderblock at the time and assumed it was related to that.
  5. furszy commented at 9:17 pm on May 26, 2023: member
    I wasn’t able to reproduce it but my spider sense tells me that it will be fixed by #27770.
  6. theStack commented at 1:22 am on May 29, 2023: contributor

    After digging deeper and analyzing the failed CI log, I got a clearer picture what’s going on here. Comparing the “Leaving block file” debug messages from a successful and the failed test run gave the essential hint:

    0$ grep "Leaving block" successful_run.log
    1Leaving block file 0: CBlockFileInfo(blocks=249, size=65319, heights=0...248, time=2011-02-02...2023-05-28)
    2Leaving block file 1: CBlockFileInfo(blocks=251, size=65511, heights=249...499, time=2023-05-28...2023-05-28)
    3
    4$ grep "Leaving block" failed_ci.log
    5Leaving block file 0: CBlockFileInfo(blocks=249, size=65319, heights=0...249, time=2011-02-02...2023-05-24)
    6Leaving block file 1: CBlockFileInfo(blocks=251, size=65511, heights=248...499, time=2023-05-24...2023-05-24)
    

    Blocks are appended to blk*.dat files in whatever order they are received from the network (either directly as “block” message or reconstructed via compact blocks), and it turns out that in this (obviously rare?) scenario node2 saw block 249 before block 248, so block 249 makes it still into blk00000.dat and block 248 is already ending up in the next one, blk00001.dat. So in this case, the pruneblockchain return value is obviously one higher than usual (249 vs. 248).

    I was surprised to see that even with only a single peer (node2 is only getting blocks from node0) blocks could arrive out-of-order, but that seems to be due to a mix of receiving full blocks and compact blocks. Wrote up some hacky python-regex-log-parsing magic to compare order of the actual chain (as generated on node0) with on how it’s received on node2, confirming what’s stated above:

     0$ ./analyze_issue_27749.py
     1node2 received block 201, OK.
     2node2 received block 202, OK.
     3node2 received block 203, OK.
     4node2 received block 204, OK.
     5node2 received block 205, OK.
     6node2 reconstructed block 206, OK.
     7node2 received block 207, OK.
     8node2 received block 208, OK.
     9node2 received block 209, OK.
    10node2 received block 210, OK.
    11node2 received block 211, OK.
    12!!! expected to receive/reconstruct block 212, but reconstructed block 213 !!!
    13!!! expected to receive/reconstruct block 213, but received block 212 !!!
    14node2 received block 214, OK.
    15node2 received block 215, OK.
    16node2 received block 216, OK.
    17node2 received block 217, OK.
    18node2 received block 218, OK.
    19node2 received block 219, OK.
    20node2 received block 220, OK.
    21node2 received block 221, OK.
    22node2 received block 222, OK.
    23node2 received block 223, OK.
    24node2 received block 224, OK.
    25node2 received block 225, OK.
    26node2 received block 226, OK.
    27node2 received block 227, OK.
    28node2 received block 228, OK.
    29node2 received block 229, OK.
    30node2 received block 230, OK.
    31node2 received block 231, OK.
    32node2 received block 232, OK.
    33node2 received block 233, OK.
    34node2 received block 234, OK.
    35node2 received block 235, OK.
    36node2 received block 236, OK.
    37node2 received block 237, OK.
    38node2 received block 238, OK.
    39node2 received block 239, OK.
    40node2 received block 240, OK.
    41node2 received block 241, OK.
    42node2 received block 242, OK.
    43node2 received block 243, OK.
    44node2 received block 244, OK.
    45node2 received block 245, OK.
    46node2 received block 246, OK.
    47node2 received block 247, OK.
    48!!! expected to receive/reconstruct block 248, but reconstructed block 249 !!!
    49!!! expected to receive/reconstruct block 249, but received block 248 !!!
    50node2 reconstructed block 250, OK.
    51node2 reconstructed block 251, OK.
    52node2 reconstructed block 252, OK.
    53node2 reconstructed block 253, OK.
    54node2 reconstructed block 254, OK.
    55node2 reconstructed block 255, OK.
    56node2 reconstructed block 256, OK.
    57..........
    58node2 reconstructed block 605, OK.
    

    Not sure how to best solve this issue. Is there a way to enforce linear blocks reception at least when only having a single peer? Or maybe the best solution would be to not involve the network at node2 at all (for the sake of the pruning part of the test) and just submitblock the blocks generated at node0? #27770 is of course always an alternative.

  7. mzumsande commented at 5:07 am on May 29, 2023: contributor

    Nice analysis!

    Not sure how to best solve this issue. Is there a way to enforce linear blocks reception at least when only having a single peer?

    I think the cause is that a failed DoS check if (pindex->nHeight <= m_chainman.ActiveChain().Height() + 2) (link) will revert to HeadersProcessing if the chain of node2 hasn’t caught up enough (the new block is too far ahead of the existing chain), adding another roundtrip compared to the block download compared to compact block reconstruction. When node2 has caught up and switches to compact block reconstruction, the two blocks can arrive to disk in a different order. This can happen in the first place because node2 initially also needs to catch up with blocks 201-205 from the first part of the test (and is rather slow to do so in the failed run!).

    So I think that an alternative fix would be to add a sync_blocks call between node0 and node2 before letting node0 generate the 400 blocks. This should guarantee that all of these 400 blocks will be processed by compact block reconstruction.

  8. furszy commented at 2:27 pm on May 29, 2023: member

    Awesome investigation! It feels good when all dots are connected.

    Apart from mzumzande’s patch to enforce sequential block storage, which looks promising at first glance. I would like to refer to a portion of what I just wrote in #27770 (comment):

    I believe the main conclusion of the analysis (at least mine) is that the accuracy of the pruneblockchain result is questionable. The real issue lies in the fact that the RPC command’s result description claims to return the “height of the last block pruned”, which may not be true since blocks could have been stored out of order. Therefore, obtaining the block height 249 as the “last block pruned” does not necessarily indicate that block 248 has been pruned from disk.

    Considering this aspect, alongside mzumzande’s fix, I hold the view that the test is quite fragile and would be unwise to maintain the current prune blockchain hardcoded numbers. These numbers have a significant discrepancy vs the block files information (as indicated in comment) and aren’t really documented anywhere.

  9. Sjors commented at 1:23 pm on May 30, 2023: member
    cc @fjahr since he worked on the prune part of this test.
  10. mzumsande referenced this in commit 9fe9074266 on May 30, 2023
  11. mzumsande commented at 6:35 pm on May 30, 2023: contributor
    Opened #27784 which just adds an additional sync, as suggested in #27770 (comment) .
  12. fanquake referenced this in commit 433f17bc3f on May 31, 2023
  13. fanquake closed this on May 31, 2023

  14. sidhujag referenced this in commit 7704bfe509 on May 31, 2023
  15. bitcoin locked this on May 30, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-29 01:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me