test: fix non-determinism in p2p_headers_sync_with_minchainwork.py #25978

pull sdaftuar wants to merge 1 commits into bitcoin:master from sdaftuar:2022-09-fix-flaky-headers-sync-test changing 1 files +2 −2
  1. sdaftuar commented at 7:47 PM on September 1, 2022: member

    The test for node3's chaintips (added in PR25960) needs some sort of synchronization in order to be reliable.

  2. test: fix non-determinism in p2p_headers_sync_with_minchainwork.py
    The test for node3's chaintips needs some sort of synchronization in order to
    be reliable.
    88e7807e77
  3. DrahtBot added the label Tests on Sep 1, 2022
  4. MarcoFalke added this to the milestone 24.0 on Sep 2, 2022
  5. satsie commented at 4:56 PM on September 2, 2022: contributor

    I checked out this PR and the tests run for me fine both with and without your changes. However, when I drop in some debugger breakpoints to understand things better, I am able to observe failures in the first test maybe 25% of the time (this is with the changes that are in this PR). The failure is that the Synchronizing blockheaders, height: 14 log message cannot be found.

    Is there a known race condition here? Am I observing the behavior that this PR is aiming to fix?

    Here are my test steps:

    • check out branch, build
    • Use import pdb; pdb.set_trace() to add some breakpoints. Add a few at the beginning of the first test, but also to test_node.py assert_debug_log and test_framework.py generate
    • run the test: test/functional/p2p_headers_sync_with_minchainwork.py
    • During a breakpoint, get the log for node3 with print(self.nodes[3].debug_log_path) and tail it in another window. I thought this would be helpful but after comparing logs for a successful run, and a failed run, I can't quite catch when the failure is happening. The expected log message does appear in logs for both failed and successful runs.
  6. MarcoFalke commented at 5:16 PM on September 2, 2022: member

    However, when I drop in some debugger breakpoints to understand things better, I am able to observe failures

    While a debugger is attached, you may run into (unrelated) test timeouts, because the debugger doesn't prevent your system clock from advancing. You may increase the timeout with the timeout-factor command line option, but this will also make it harder to reproduce this intermittent issue, because you may have to wait longer for the timeout to happen if the issue occurs.

    Usually intermittent issues are observed when the system running them puts a thread to sleep for a longer time. Thus, I find it easiest to deterministically reproduce the failures by inserting the sleep myself. For example, you can modify the source code to insert a UninterruptibleSleep() into the right thread. (For example like done in #25914)

  7. mzumsande commented at 5:23 PM on September 2, 2022: contributor

    Haven't had the chance to look into this yet, but just pointing out that the test currently fails intermittently in the CI, which this PR should fix: https://cirrus-ci.com/task/5407068940140544 https://cirrus-ci.com/task/5945347225681920

  8. mzumsande commented at 6:15 PM on September 2, 2022: contributor

    Code Review ACK 88e7807e771a568ac34c320b4055d832990049df

    Is there a known race condition here? Am I observing the behavior that this PR is aiming to fix?

    Yes, there is a race condition: While node 0 mines some blocks, and we wait for node 1 and 2 to complete the headers-sync (and decide to reject the headers because of insufficient work), there was no such check for node 3 before, so that it would be possible to call getchaintips for node 3 before it had completed the headerssync. This can be seen in the linked logs (around line 4543) where the "Synchronizing blockheaders, height: 14" log appears after the getchaintips call.

  9. satsie commented at 1:26 AM on September 3, 2022: contributor

    @MarcoFalke thank you for the valuable insight on the debugger's relationship (or lack thereof) with the system clock. I like what you said about intentionally inserting sleeps to reproduce intermittent timing failures. That's clever! @mzumsande thank you for explaining that race condition so thoroughly! It was really helpful to have the CI logs with the failure to reference.

    Based on these two comments I'm convinced I was introducing instability with the debugger, especially as I added additional breakpoints. Anyways I got confused about something and ended up running the tests 10 times (without the debugger) and never saw any failures.

    ACK 88e7807e771a568ac34c320b4055d832990049df

  10. fanquake merged this on Sep 4, 2022
  11. fanquake closed this on Sep 4, 2022

  12. sidhujag referenced this in commit 8bf0aa3cbc on Sep 5, 2022
  13. bitcoin locked this on Sep 4, 2023
Labels

Milestone
24.0


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-14 12:14 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me