ci: add delay between commits while testing all ancestor commits #34943

pull rkrux wants to merge 1 commits into bitcoin:master from rkrux:ancestor-commits changing 1 files +4 −0
  1. rkrux commented at 11:12 AM on March 28, 2026: contributor

    Addresses #34731

    It was earlier suspected that there is an intermittent deadlock in the server but following are the reasons why I didn't believe it is a deadlock issue based on the debug logs and debugger thread backtrace output shared in the issue:

    • The HTTP worker threads (b-http_pool_x) are waiting on the condition variable and not on the mutex that signals that these threads are idle & waiting for work to be assigned to them.
    • The HTTP thread (b-http) is epoll waiting that means it is waiting for a request (or a part of it) to be received.
    • The added logs show that the first few testmempool RPCs were successful and the next one timed out. But the logs don't show a request for it being logged unlike in the previous ones, hinting that the server never received such a request (or in full) and thus never processed it. Even then the functional test client timed out, which means that it did send it (at least a part of it).

    The large orphan transactions being sent via p2p_orphan_handling.py and p2p_opportunistic_1p1c.py tests are each 780KB in size that are sent sequentially. The test tries to send 60 of them in a loop amounting to 46MB of data over a single HTTP connection that is reused. Later, tcpdump was collected via this PR #34726 (credit maflcko) that showed the HTTP server signalled TCP zero window win 0 intermittently that led to such large requests never being read fully, so the server never processed them and thus never sent a response - the test client rightfully timed out after 30 seconds.

    Interestingly, this intermittent issue has been observed only in the "test ancestor commits" CI job that recently started testing all the commits in the PR, which is more robust, as opposed to testing only the last 6 commits like it used to do earlier. For each commit in the PR, this job runs 16 tests in parallel where the CPU nproc is 8. These two are the only tests that send such large orphans to the server in the same instance 50-60 times amounting to 45MB being sent in a burst. I've noticed this issue in this job never in the first commit being tested but instead only in the subsequent ones.

    Adding a 6s delay before testing each commit in the "test ancestor commits job" seems to get rid of the issue, the passing CI job is here that tests 130 commits.

    Note: The previous approach I tried in PR #34847 added a 50ms delay before sending such large orphan transaction requests but I prefer this solution over it because this change is restricted in this CI job only that fails.

    <!-- *** Please remove the following help text before submitting: *** Pull requests without a rationale and clear improvement may be closed immediately. GUI-related pull requests should be opened against https://github.com/bitcoin-core/gui first. See CONTRIBUTING.md -->

    <!-- Please provide clear motivation for your patch and explain how it improves Bitcoin Core user experience or Bitcoin Core developer experience significantly: * Any test improvements or new tests that improve coverage are always welcome. * All other changes should have accompanying unit tests (see `src/test/`) or functional tests (see `test/`). Contributors should note which tests cover modified code. If no tests exist for a region of modified code, new tests should accompany the change. * Bug fixes are most welcome when they come with steps to reproduce or an explanation of the potential issue as well as reasoning for the way the bug was fixed. * Features are welcome, but might be rejected due to design or scope issues. If a feature is based on a lot of dependencies, contributors should first consider building the system outside of Bitcoin Core, if possible. * Refactoring changes are only accepted if they are required for a feature or bug fix or otherwise improve developer experience significantly. For example, most "code style" refactoring changes require a thorough explanation why they are useful, what downsides they have and why they *significantly* improve developer experience or avoid serious programming bugs. Note that code style is often a subjective matter. Unless they are explicitly mentioned to be preferred in the [developer notes](/doc/developer-notes.md), stylistic code changes are usually rejected. -->

    <!-- Bitcoin Core has a thorough review process and even the most trivial change needs to pass a lot of eyes and requires non-zero or even substantial time effort to review. There is a huge lack of active reviewers on the project, so patches often sit for a long time. -->

  2. DrahtBot added the label Tests on Mar 28, 2026
  3. DrahtBot commented at 11:12 AM on March 28, 2026: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Stale ACK maflcko

    If your review is incorrectly listed, please copy-paste <code>&lt;!--meta-tag:bot-skip--&gt;</code> into the comment that the bot should ignore.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #34847 (test: throttle large orphan transactions while being sent in RPCs by rkrux)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

  4. DrahtBot added the label CI failed on Mar 28, 2026
  5. rkrux commented at 1:28 PM on March 28, 2026: contributor
  6. rkrux force-pushed on Mar 28, 2026
  7. rkrux force-pushed on Mar 29, 2026
  8. rkrux commented at 4:13 AM on March 29, 2026: contributor

    With -j=nproc//2 the error didn't occur but the tests were slow by factor of more than 3.5 that led to the job being cancelled at the end because all the commits could not be tested (only 45 could). So, this is not a viable solution.

  9. DrahtBot removed the label CI failed on Mar 29, 2026
  10. rkrux commented at 9:23 AM on March 29, 2026: contributor

    So adding a 3s delay before testing each commit in the "test ancestor commits job" seems to fix the issue: https://github.com/bitcoin/bitcoin/actions/runs/23701103981/job/69044839367?pr=34943

    I suppose adding a delay before sending large orphan transactions over the wire doesn't seem necessary as done in PR #34847.

  11. rkrux renamed this:
    ci: ancestor commits [draft, ignore]
    ci: add delay between commit while testing all ancestor commits
    on Mar 29, 2026
  12. rkrux renamed this:
    ci: add delay between commit while testing all ancestor commits
    ci: add delay between commits while testing all ancestor commits
    on Mar 29, 2026
  13. rkrux force-pushed on Mar 29, 2026
  14. rkrux marked this as ready for review on Mar 29, 2026
  15. in .github/ci-test-each-commit-exec.py:23 in 103036adc7 outdated
      19 | @@ -19,6 +20,7 @@ def run(cmd, **kwargs):
      20 |  
      21 |  def main():
      22 |      print("Running tests on commit ...")
      23 | +    time.sleep(3)
    


    maflcko commented at 6:13 AM on March 30, 2026:

    I don't like hard sleeps, especially if they come without any reason. This should add a comment to say it is a temporary workaround for something.


    rkrux commented at 6:31 AM on March 30, 2026:

    Added comment here and reworded from "fixes" to "addresses" in PR description.

  16. DrahtBot approved
  17. DrahtBot commented at 6:17 AM on March 30, 2026: contributor

    .

  18. maflcko approved
  19. maflcko commented at 6:19 AM on March 30, 2026: member

    Fixes #34731

    I don't think this is a fix. This is just hiding the issue temporarily, not fixing the underlying issue.

    lgtm either way as a temp fix

  20. rkrux force-pushed on Mar 30, 2026
  21. maflcko commented at 6:33 AM on March 30, 2026: member

    lgtm ACK 8e3af53081b7d88204c09a75ecb7d00cfe64a41f

    I don't understand how and why this even affects #34731, but if this is really a workaround, it seems simply and harmless enough.

  22. ci: add delay between commits while testing ancestor commits
    This patch adds a 6s delay before starting testing the next commit.
    This is done to give the CI instance some time to clean up in this
    job that can potentially test upto 130 commits or so.
    3a4603326e
  23. in .github/ci-test-each-commit-exec.py:25 in 8e3af53081
      19 | @@ -19,6 +20,9 @@ def run(cmd, **kwargs):
      20 |  
      21 |  def main():
      22 |      print("Running tests on commit ...")
      23 | +    # Sleep before testing next commit because this avoids the issue of
      24 | +    # intermittent timeouts in 780KB large RPC requests in this CI job
      25 | +    time.sleep(3)
    


    maflcko commented at 5:46 PM on April 1, 2026:

    it is a bit odd that this passed CI, because I still don't understand why it passed.

    Also, adding a sleep 3 to my reproducer in #34731 (comment) doesn't affect it at all, so maybe this just passed by accident?


    rkrux commented at 5:58 PM on April 1, 2026:

    I've seen CI passing consistently with delays.

    It passed here with a 3s delay between commits, it passed in #34847 with either 50ms or 200ms delay between testmempoolaccept RPC calls, and it passed in #34963 with a 1s delay alongwith tcp_tw_reuse tuning.


    maflcko commented at 7:03 PM on April 1, 2026:

    I can see why a sleep helps while the connection is up. However the sleep here happens when the VM is fully empty and idle (no connections up), except for maybe time-wait ones.

    However, time-wait also doesn't seem likely to be related to the issues here, because the issue happens in established connections, not while setting up a new connection.

    In any case, just wondering. Seems fine to merge this from my side as a temporary workaround.


    rkrux commented at 10:47 AM on April 2, 2026:

    the sleep here happens when the VM is fully empty and idle (no connections up), except for maybe time-wait ones.

    The VM never stays idle until this sleep was added because as soon as one commit is finished testing, the other one starts immediately. From the debug stats I gathered, there are around 3000 wait-time sockets after a commit is finished, though that doesn't seem problematic on its own because the system never falls short of sockets while opening new connections.

    The test ancestor commits job is unusual in the way that every commit is not tested with a clean slate - commit N+1 is tested in a more worked-up VM environment than commit N, which doesn't give each commit the same baseline. This effect has become more pronounced after ancestor limit of 6 was removed recently. That's why I guess we see failures in later commits such as 12th, 34th, 47th, 54th, 113th, etc. This job currently seems like a combination of functional and performance testing, which seems unusual to me.

    Due to the nature of this CI job, there is constant socket churn happening because a connection is reused only within a test and can't be reused across tests because each starts and shuts down its node(s) as per its requirements. The idea behind adding a sleep here between testing commits was to give the system some rest before testing next commit. Though the remaining time-wait sockets won't go away in this time period of 3s because the time-wait period is 60s.

    so maybe this just passed by accident?

    I wanted to be sure that it didn't pass by accident, so in other draft PRs I tested with few combinations such as tcp_reuse with 1s delay, only 3s delay. Both of them had passed earlier but failed recently. So, unfortunately it does seem like it passed by accident. I'm trying with 6s at the moment because this delay is more than the combined delay of 5.5s (0.05 * (60 + 50)) between large RPC calls in the two p2p failing tests that I had earlier in other PR. However in one of the earlier tries, just by running few debug commands (sockstat, vmstat, loadaverage) between tests, which indirectly add a delay, made this job successful. So, I have seen a combination of results with these experiments.


    rkrux commented at 9:13 AM on April 7, 2026:

    I wanted to be sure that it didn't pass by accident, so in other draft PRs I tested with few combinations such as tcp_reuse with 1s delay, only 3s delay. Both of them had passed earlier but failed recently. So, unfortunately it does seem like it passed by accident.

    Adding delays seems to work intermittently and that doesn't instil a lot of confidence, I'm closing this pull.

  24. rkrux force-pushed on Apr 2, 2026
  25. rkrux closed this on Apr 7, 2026


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-22 09:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me