ci: add delay between commits while testing all ancestor commits

rkrux commented at 11:12 am on March 28, 2026: contributor

Addresses #34731

It was earlier suspected that there is an intermittent deadlock in the server but following are the reasons why I didn’t believe it is a deadlock issue based on the debug logs and debugger thread backtrace output shared in the issue:

The HTTP worker threads (b-http_pool_x) are waiting on the condition variable and not on the mutex that signals that these threads are idle & waiting for work to be assigned to them.
The HTTP thread (b-http) is epoll waiting that means it is waiting for a request (or a part of it) to be received.
The added logs show that the first few testmempool RPCs were successful and the next one timed out. But the logs don’t show a request for it being logged unlike in the previous ones, hinting that the server never received such a request (or in full) and thus never processed it. Even then the functional test client timed out, which means that it did send it (at least a part of it).

The large orphan transactions being sent via p2p_orphan_handling.py and p2p_opportunistic_1p1c.py tests are each 780KB in size that are sent sequentially. The test tries to send 60 of them in a loop amounting to 46MB of data over a single HTTP connection that is reused. Later, tcpdump was collected via this PR #34726 (credit maflcko) that showed the HTTP server signalled TCP zero window win 0 intermittently that led to such large requests never being read fully, so the server never processed them and thus never sent a response - the test client rightfully timed out after 30 seconds.

Interestingly, this intermittent issue has been observed only in the “test ancestor commits” CI job that recently started testing all the commits in the PR, which is more robust, as opposed to testing only the last 6 commits like it used to do earlier. For each commit in the PR, this job runs 16 tests in parallel where the CPU nproc is 8. These two are the only tests that send such large orphans to the server in the same instance 50-60 times amounting to 45MB being sent in a burst. I’ve noticed this issue in this job never in the first commit being tested but instead only in the subsequent ones.

Adding a 3s delay before testing each commit in the “test ancestor commits job” seems to get rid of the issue, the passing CI job is here that tests 125 commits.

Note: The previous approach I tried in PR #34847 added a 50ms delay before sending such large orphan transaction requests but I prefer this solution over it because this change is restricted in this CI job only that fails.

DrahtBot added the label Tests on Mar 28, 2026

DrahtBot commented at 11:12 am on March 28, 2026: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	maflcko

If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#34847 (test: throttle large orphan transactions while being sent in RPCs by rkrux)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label CI failed on Mar 28, 2026

rkrux commented at 1:28 pm on March 28, 2026: contributor

Failed at 18th commit with -j=nrpoc: https://github.com/bitcoin/bitcoin/actions/runs/23683980772/job/69000417113?pr=34943

rkrux force-pushed on Mar 28, 2026

rkrux force-pushed on Mar 29, 2026

rkrux commented at 4:13 am on March 29, 2026: contributor

With -j=nproc//2 the error didn’t occur but the tests were slow by factor of more than 3.5 that led to the job being cancelled at the end because all the commits could not be tested (only 45 could). So, this is not a viable solution.

DrahtBot removed the label CI failed on Mar 29, 2026

rkrux commented at 9:23 am on March 29, 2026: contributor

So adding a 3s delay before testing each commit in the “test ancestor commits job” seems to fix the issue: https://github.com/bitcoin/bitcoin/actions/runs/23701103981/job/69044839367?pr=34943

I suppose adding a delay before sending large orphan transactions over the wire doesn’t seem necessary as done in PR #34847.

rkrux renamed this:
~~ci: ancestor commits [draft, ignore]~~
ci: add delay between commit while testing all ancestor commits
on Mar 29, 2026

rkrux renamed this:
~~ci: add delay between commit while testing all ancestor commits~~
ci: add delay between commits while testing all ancestor commits
on Mar 29, 2026

rkrux force-pushed on Mar 29, 2026

rkrux marked this as ready for review on Mar 29, 2026

in .github/ci-test-each-commit-exec.py:23 in 103036adc7 outdated

19@@ -19,6 +20,7 @@ def run(cmd, **kwargs):
20 
21 def main():
22     print("Running tests on commit ...")
23+    time.sleep(3)

maflcko commented at 6:13 am on March 30, 2026:

I don’t like hard sleeps, especially if they come without any reason. This should add a comment to say it is a temporary workaround for something.

rkrux commented at 6:31 am on March 30, 2026:

Added comment here and reworded from “fixes” to “addresses” in PR description.

DrahtBot approved

DrahtBot commented at 6:17 am on March 30, 2026: contributor

.

maflcko approved

maflcko commented at 6:19 am on March 30, 2026: member

Fixes #34731

I don’t think this is a fix. This is just hiding the issue temporarily, not fixing the underlying issue.

lgtm either way as a temp fix

ci: add delay between commits while testing ancestor commits

This patch adds a 3s delay before starting testing the next commit.
This is done to give the CI instance some time to clean up in this
job that can potentially test upto 150 commits or so.

8e3af53081

rkrux force-pushed on Mar 30, 2026

maflcko commented at 6:33 am on March 30, 2026: member

lgtm ACK 8e3af53081b7d88204c09a75ecb7d00cfe64a41f

I don’t understand how and why this even affects #34731, but if this is really a workaround, it seems simply and harmless enough.

ci: add delay between commits while testing all ancestor commits #34943

Reviews

Conflicts