Reasons why I don't believe it is a deadlock issue based on the debug logs and debugger thread backtrace output shared in the issue:
The HTTP worker threads (b-http_pool_x) are waiting on the condition variable and not on the mutex that signals that these threads are idle & waiting for work to be assigned to them.
The HTTP thread (b-http) is epoll waiting that means it is waiting for a request (or a part of it) to be received.
The added logs show that the first few testmempool RPCs were successful and the next one timed out. But the logs don't show a request for it being logged unlike in the previous ones, hinting that the server never received such a request (or in full) and thus never processed it. Even then the functional test client timed out, which means that it did send it (at least a part of it).
The large orphan transactions being sent are each 780KB in size that are sent sequentially by the test. It tries to send 60 of them in a loop amounting to 46MB of data over a single HTTP connection that is reused.
More details are shared in the first commit message.
This PR throttles the RPCs on client side. I've not been able to reproduce this intermittent issue and thus I don't gurantee that this fixes the issue altogether.
Note: A previous approach in this PR tried to not reuse the HTTP connection for the RPCs in this test instead. But I noticed a CI run where this affected test took around 75mins to complete that led me to move to this approach where the HTTP connection is reused like before but with some throttling.
<!--
*** Please remove the following help text before submitting: ***
Pull requests without a rationale and clear improvement may be closed
immediately.
GUI-related pull requests should be opened against
https://github.com/bitcoin-core/gui
first. See CONTRIBUTING.md
-->
<!--
Please provide clear motivation for your patch and explain how it improves
Bitcoin Core user experience or Bitcoin Core developer experience
significantly:
* Any test improvements or new tests that improve coverage are always welcome.
* All other changes should have accompanying unit tests (see `src/test/`) or
functional tests (see `test/`). Contributors should note which tests cover
modified code. If no tests exist for a region of modified code, new tests
should accompany the change.
* Bug fixes are most welcome when they come with steps to reproduce or an
explanation of the potential issue as well as reasoning for the way the bug
was fixed.
* Features are welcome, but might be rejected due to design or scope issues.
If a feature is based on a lot of dependencies, contributors should first
consider building the system outside of Bitcoin Core, if possible.
* Refactoring changes are only accepted if they are required for a feature or
bug fix or otherwise improve developer experience significantly. For example,
most "code style" refactoring changes require a thorough explanation why they
are useful, what downsides they have and why they *significantly* improve
developer experience or avoid serious programming bugs. Note that code style
is often a subjective matter. Unless they are explicitly mentioned to be
preferred in the [developer notes](/doc/developer-notes.md), stylistic code
changes are usually rejected.
-->
<!--
Bitcoin Core has a thorough review process and even the most trivial change
needs to pass a lot of eyes and requires non-zero or even substantial time
effort to review. There is a huge lack of active reviewers on the project, so
patches often sit for a long time.
-->
DrahtBot added the label Tests on Mar 18, 2026
DrahtBot
commented at 8:57 AM on March 18, 2026:
contributor
<!--e57a25ab6845829454e8d69fc972939a-->
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
<!--021abf342d371248e50ceaed478a90ca-->
Reviews
See the guideline for information on the review process.
A summary of reviews will appear here.
<!--174a7506f384e20aa4161008e828411d-->
Conflicts
Reviewers, this pull request conflicts with the following ones:
#34943 (ci: add delay between commits while testing all ancestor commits by rkrux)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
<!--5faf32d7da4f0f540f40219e4f7537a3-->
LLM Linter (✨ experimental)
Possible places where named args for integral literals may be used (e.g. func(x, /*named_arg=*/0) in C++, and func(x, named_arg=0) in Python):
evbuffer_peek(buf, -1, nullptr, v, 8) in src/httpserver.cpp
<sup>2026-04-02 08:06:24</sup>
rkrux
commented at 10:31 AM on March 18, 2026:
contributor
maflcko
commented at 11:08 AM on March 18, 2026:
member
I don't think the issue happens in macOS, but only in the task that tests ancestor commits, but that task is skipped for pull requests with one commit. Also, reproducing requires tens of runs/commits.
hebasto
commented at 11:08 AM on March 18, 2026:
member
rkrux
commented at 11:13 AM on March 18, 2026:
contributor
I don't think the issue happens in macOS, but only in the task that tests ancestor commits,
Oh interesting, I do have a second commit that I will push after this recently started job ends/succeeds.
Also, reproducing requires tens of runs/commits.
I'm not trying to reproduce the issue. Just ensuring that the tests don't fail for some other reason on this commit. Will undraft the PR once the CI is green (or not yellow at least).
DrahtBot removed the label CI failed on Mar 18, 2026
rkrux force-pushed on Mar 18, 2026
rkrux force-pushed on Mar 18, 2026
rkrux renamed this: test: work in progress commit test: conditionally throttle large testmempoolaccept rpcs in p2p_orphan_handling test on Mar 18, 2026
rkrux
commented at 2:42 PM on March 18, 2026:
contributor
A previous instance of the CI run where the test took 4536s (~75min) to complete when the HTTP connection was not reused and a fresh one was created for every RPC in the test: ASan + LSan + UBSan + integer
rkrux marked this as ready for review on Mar 18, 2026
rkrux force-pushed on Mar 19, 2026
in
test/functional/p2p_orphan_handling.py:633
in
b309bdb2c4
629 | @@ -630,7 +630,8 @@ def test_maximal_package_protected(self):
630 |
631 | # Check to make sure these are orphans, within max standard size (to be accepted into the orphanage)
632 | for large_orphan in large_orphans:
633 | - testres = node.testmempoolaccept([large_orphan.serialize().hex()]) 634 | + # throttle these 780KB large requests if the RPC latency is greater than 1s
Above is a log excerpt from the last successful request. There is no such delay in these logs (as per the first timestamps), although these are only server side logs and not from the client (test) side. I think I will just revert to unconditional timeout instead of doing it conditionally for which I don't have any basis.
The tests are run on a fast gaming CPU,
Nice, I didn't know this but do we know how much load it is under (or atleast when the intermittent issue occured)?
A previous instance of the CI run where the test took 4536s (~75min) to complete when the HTTP connection was not reused and a fresh one was created for every RPC in the test: ASan + LSan + UBSan + integer
I like this approach for this test but this one occurence discouraged me. I do feel that it hints at the CI instance(s) being under load intermittently for which an unconditional timeout can be a remedy.
Nice, I didn't know this but do we know how much load it is under (or atleast when the intermittent issue occured)?
I wouldn't expect a high load to be the issue here. This is an optimized build without any sanitizers, running on a high-end CPU. Seeing a spurious 30 seconds timeout for an RPC that would otherwise take milliseconds seems off.
In fact, it may be a race, that is only visible because the CPU is so fast.
I've reworked the PR to throttle at the large orphan level if it's being sent over the network to the RPC. This handles both the p2p tests (p2p_orphan_handling, p2p_opportunistic_1p1c) that timed out intermittently.
CyberNFT
commented at 9:18 AM on March 19, 2026:
none
👍🏻
rkrux force-pushed on Mar 19, 2026
rkrux renamed this: test: conditionally throttle large testmempoolaccept rpcs in p2p_orphan_handling test test: throttle large testmempoolaccept rpcs in p2p_orphan_handling test on Mar 19, 2026
rkrux force-pushed on Mar 19, 2026
DrahtBot added the label CI failed on Mar 19, 2026
DrahtBot removed the label CI failed on Mar 19, 2026
rkrux force-pushed on Mar 27, 2026
rkrux marked this as a draft on Mar 27, 2026
rkrux force-pushed on Mar 27, 2026
rkrux force-pushed on Mar 27, 2026
DrahtBot added the label CI failed on Mar 27, 2026
rkrux renamed this: test: throttle large testmempoolaccept rpcs in p2p_orphan_handling test test: throttle large orphan transactions while being sent in RPCs on Mar 27, 2026
rkrux marked this as ready for review on Mar 27, 2026
DrahtBot removed the label CI failed on Mar 27, 2026
maflcko
commented at 10:40 AM on March 27, 2026:
member
Not sure this is the correct fix. We are not sending 1MB from somewhere outside the solar system to the earth. This is sending 780KB on a local socket from one process to another. Why should this take 30 seconds? Normally, the whole test passes in less time than that, and then suddenly a single RPC times out?
Also, you haven't even tested if this fix is working. 12 runs/commits is not enough. It can happen after the 40th or 60th run. You'll have to add 125 empty commits or so.
I don't mind a temporary workaround, but at least it should be tested, and it should be explained that this is just a temporary workaround for a real underlying bug.
Otherwise, are we going to update the docs to say: "If you call an RPC with a large payload, you have to manually sleep after each call"?
rkrux
commented at 11:00 AM on March 27, 2026:
contributor
This is sending 780KB on a local socket from one process to another. Why should this take 30 seconds?
It shouldn't take 30 seconds. Since it's all on local, I don't think this is a network latency issue, but more of an issue with the server drain rate of its TCP buffer not as quick as the client send rate. That's why I believe the zero TCP window issue occurs.
it should be explained that this is just a temporary workaround
The PR description does hint at it in the end but I can make this explicit.
Not sure this is the correct fix.
It can happen after the 40th or 60th run. You'll have to add 125 empty commits or so.
I can test with more commits and put it in draft until then.
for a real underlying bug.
This presence of this issue in only one CI job is what I find confusing (and interesting) the most.
"If you call an RPC with a large payload, you have to manually sleep after each call"
This shouldn't be required because this issue doesn't happen all the time and is intermittent in a specific CI job, which can even put into question the setup of that CI job.
rkrux marked this as a draft on Mar 27, 2026
in
test/functional/p2p_orphan_handling.py:632
in
56898d5d5aoutdated
628 | @@ -630,7 +629,7 @@ def test_maximal_package_protected(self):
629 |
630 | # Check to make sure these are orphans, within max standard size (to be accepted into the orphanage)
631 | for large_orphan in large_orphans:
632 | - testres = node.testmempoolaccept([large_orphan.serialize().hex()]) 633 | + testres = node.testmempoolaccept([large_orphan.to_send.serialize().hex()])
Instead of sleeping 300ms, it would be a smaller temporary workaround to just quickly spin up a new tcp connection. You can do this either:
by calling .cli() (spawns a bitcoin-cli process) in a trivial one-line patch
or cherry-pick fa8fc5a23752c2a590b95f62833cf013a3d6febc, which was meant for different threads, but using the new authproxy for a single rpc call should also be fine and work around the issue for now.
If you want to keep the unconditional sleep, my preference would be to inline it here again, like it was in the beginning of this pull?
it would be a smaller temporary workaround to just quickly spin up a new tcp connection.
A new connection for every iteration of testmempoolaccept?
If you want to keep the unconditional sleep, my preference would be to inline it here again, like it was in the beginning of this pull?
I preferred that too and then noticed the same failure in the p2p_opportunistic_1p1c test. So thought maybe highlight it in the code to add a sleep whenever a LargeOrphan is sent over the wire by putting the sleep in the class itself (though the sleep is effective only when many such large orphans are sent in a burst). Otherwise it seemed easy that a new call site might miss adding the sleep.
in
test/functional/p2p_opportunistic_1p1c.py:444
in
56898d5d5aoutdated
maflcko
commented at 3:01 PM on March 27, 2026:
member
lgtm (assuming ci passes)
Looks like it is on track of passing ...
So I guess adding a sleep to workaround a timeout bug is another data point that shows there is an underlying racy bug, which is only triggered by weird timing. (And can be avoided by adding weird timing/sleeps)
rkrux
commented at 3:07 PM on March 27, 2026:
contributor
Looks like it is on track of passing ...
Yeah, no failure yet. But I sense that the 360 minutes threshold of the job will be hit before all the ancestor commits are tested, so this job might get cancelled in the end. :(
Job finished 13mins faster than the one with 300ms sleep, saving more than 6s per commit.
rkrux
commented at 11:00 AM on March 29, 2026:
contributor
Opened PR #34943 for review that I prefer over this solution.
rkrux force-pushed on Mar 30, 2026
rkrux force-pushed on Mar 30, 2026
DrahtBot added the label CI failed on Mar 30, 2026
rkrux force-pushed on Mar 30, 2026
DrahtBot removed the label CI failed on Mar 30, 2026
rkrux force-pushed on Mar 31, 2026
rkrux force-pushed on Mar 31, 2026
rkrux force-pushed on Mar 31, 2026
DrahtBot added the label CI failed on Mar 31, 2026
rkrux force-pushed on Mar 31, 2026
rkrux force-pushed on Mar 31, 2026
rkrux force-pushed on Mar 31, 2026
DrahtBot removed the label CI failed on Mar 31, 2026
rkrux force-pushed on Apr 1, 2026
rkrux force-pushed on Apr 1, 2026
DrahtBot added the label CI failed on Apr 1, 2026
DrahtBot removed the label CI failed on Apr 1, 2026
rpc: wip0a5d744032
ci: add system network level debugging between testing each commitb6d90081f7
test: throttle large orphan transactions while being sent in RPCs
Each of these large orphan transaction is around 780KB large that are
sent sequentially without waiting in the `testmempoolaccept` RPC. For the
`p2p_orphan_handling` and `p2p_opportunistic_1p1c` tests that send these
large orphan transactions sequentially 50-60 times, it has been observed in
the CI via the tcpdump outputs (refer issue 34731 thread) that
the HTTP server is showing zero TCP window `win 0` intermittently that leads to
such requests never being read fully, so the server never processes them and
thus never sends a response. The test client rightfully times out after 30 seconds.
Interestingly, this intermittent issue has been observed only in the "test ancestor
commits" CI job that recently started testing all the commits in the PR, which is more
robust, as opposed to testing only the last 6 commits like it used to do earlier. For each
commit in the PR, this job runs 16 tests in parallel where the CPU nproc is 8. These
two are the only tests that send such large orphans to the server in the same instance
50-60 times amounting to 45MB being sent in a burst. I've noticed this issue in this
job never in the first commit being tested but instead only in the subsequent ones.
This commit creates a LargeOrphanTransaction class that provides two properties
- one to get the large orphan transaction for the internal operations and the
other to send this transaction over the network, which by default adds a 50ms
sleep before returning. This is to ensure that the test client doesn't bombard
the server with such large transactions without providing it with a cool down
period for the TCP window to clear.
One of the ealier CI runs I tested this change on had 125 commits with a 300ms
delay between each such RPC where this issue didn't occur:
https://github.com/bitcoin/bitcoin/actions/runs/23643332720/job/68869238262?pr=34847
All the `p2p_orphan_handling` tests finished in under 60sec each and all the
`p2p_opportunistic_1p1c` tests finished in undr 70sec each. Adding the timeout
does increase the overall latency of these two tests but might help in avoiding
the intermittent timeouts.
b8e79e3192
test: empty commit 11e472bf31d
test: empty commit 29af315df11
test: empty commit 344b251a097
test: empty commit 487c32ae9a7
test: empty commit 576e152c3a8
test: empty commit 6eef2e823bb
test: empty commit 795596b7cf8
test: empty commit 8961c4ff75a
test: empty commit 93b5565ad8a
test: empty commit 102eb0893fc6
test: empty commit 1100af81cdc6
test: empty commit 12bfcbb0b17b
test: empty commit 13da4a68ef4c
test: empty commit 141ea9df7c25
test: empty commit 15006974dba8
test: empty commit 1624b999a852
test: empty commit 17b509d5261b
test: empty commit 185fd1b3ed46
test: empty commit 19b37fd30914
test: empty 202533c7d4a4
test: empty 2121436524eb
test: empty 22335f671f98
test: empty 23fd50711dc7
test: empty 24126bddf0cd
test: empty 257d924cd96a
test: empty 266c003dc4cb
test: empty 2744bcd04782
test: empty 28ffe2b2c3a3
test: empty 295fda3f20a5
test: empty 30356fcb317c
test: empty 31763655b16a
test: empty 3229bd9c6761
test: empty 337cf2961707
test: empty 34491dc8dd33
test: empty 35f540f8707a
test: empty 360c3487025b
test: empty 373f75344119
test: empty 3879cb72df11
test: empty 396389535a13
test: empty 40901f2f64d7
test: empty 418b622d2252
test: empty 422e4705706c
test: empty 4340b21eb561
test: empty 44a914e3c2b7
test: empty 456e7f6da624
test: empty 46ce7496d21a
test: empty 47783043151f
test: empty 48ac6cce970a
test: empty 49f0ccf83cfb
test: empty 50f5109a4696
test: empty 51058c074783
test: empty 52d75aa9ec52
test: empty 5329ef8b06e5
test: empty 54bfd8496510
test: empty 55635534ef66
test: empty 5653ed8cf902
test: empty 57a928a19ffb
test: empty 588ef261c0f5
test: empty 59bbeabee2ca
test: empty 60d5aa917f0f
test: empty 61f02b43f41b
test: empty 62af88be56f4
test: empty 631ca0c07371
test: empty 649ffa9913b4
test: empty 65fdf87e19cf
test: empty 66dfc62daf59
test: empty 672b16d8eff2
test: empty 68461eebb18b
test: empty 6978816e82dd
test: empty 7058c874630e
test: empty 7154909be604
test: empty 72ffcea5c88e
test: empty 73bcf82497a5
test: empty 741a585b2188
test: empty 7584c871d9c9
test: empty 7649c21c0e53
test: empty 7715475289a7
test: empty 78a159ccc1c6
test: empty 79b7bc02a01e
test: empty 80a8dfb8f7af
test: empty 812075d12803
test: empty 8267fbef99c1
test: empty 83f054b6b8f4
test: empty 845e9401e19b
test: empty 857abd67dfaf
test: empty 865ed255e6c4
test: empty 879a6a890de8
test: empty 880f79cbd1ab
test: empty 89d074d4679f
test: empty 90dc7f17153e
test: empty 91a8647e740e
test: empty 92457bb412a8
test: empty 932abed3e70d
test: empty 9441c6107be8
test: empty 95f46df1cff2
test: empty 96630bb6e5b5
test: empty 97b09dc7be4f
test: empty 988bba2e588b
test: empty 99c978fc4ea6
test: empty 100c621fc9fdb
test: empty 10134f527b3a3
test: empty 1027d92d8c2f4
test: empty 1039e3b1f2862
test: empty 1044f1cf26233
test: empty 105f6e2c92938
test: empty 1063064b46593
test: empty 10785be54124e
test: empty 1088842a62c9b
test: empty 10986ab6577c6
test: empty 1101d409ab69d
test: empty 1111264bfad1d
test: empty 112fa1ea82306
test: empty 113a0a9ee3103
test: empty 114f21615d5da
test: empty 11590a70d36ad
test: empty 1162ebd9cb232
test: empty 11722eea3e2bb
test: empty 11855ab806b94
test: empty 11995ea044fd7
test: empty 12003fe69a9ab
test: empty 1214f3a182fcd
test: empty 1229ab3a64a93
test: empty 1237bf74e6764
rkrux force-pushed on Apr 2, 2026
DrahtBot added the label CI failed on Apr 2, 2026
rkrux
commented at 9:10 AM on April 7, 2026:
contributor
I am not satisfied with the solution of adding some delay between each such affected RPC call, I'm closing this PR.
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2026-04-22 09:12 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me