test: throttle large orphan transactions while being sent in RPCs #34847

rkrux commented at 8:56 AM on March 18, 2026: contributor

Addresses #34731

Reasons why I don't believe it is a deadlock issue based on the debug logs and debugger thread backtrace output shared in the issue:

The HTTP worker threads (b-http_pool_x) are waiting on the condition variable and not on the mutex that signals that these threads are idle & waiting for work to be assigned to them.
The HTTP thread (b-http) is epoll waiting that means it is waiting for a request (or a part of it) to be received.
The added logs show that the first few testmempool RPCs were successful and the next one timed out. But the logs don't show a request for it being logged unlike in the previous ones, hinting that the server never received such a request (or in full) and thus never processed it. Even then the functional test client timed out, which means that it did send it (at least a part of it).
The large orphan transactions being sent are each 780KB in size that are sent sequentially by the test. It tries to send 60 of them in a loop amounting to 46MB of data over a single HTTP connection that is reused.

More details are shared in the first commit message.

This PR throttles the RPCs on client side. I've not been able to reproduce this intermittent issue and thus I don't gurantee that this fixes the issue altogether.

Note: A previous approach in this PR tried to not reuse the HTTP connection for the RPCs in this test instead. But I noticed a CI run where this affected test took around 75mins to complete that led me to move to this approach where the HTTP connection is reused like before but with some throttling.

DrahtBot added the label Tests on Mar 18, 2026

DrahtBot commented at 8:57 AM on March 18, 2026: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process. A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#34943 (ci: add delay between commits while testing all ancestor commits by rkrux)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible places where named args for integral literals may be used (e.g. func(x, /*named_arg=*/0) in C++, and func(x, named_arg=0) in Python):

evbuffer_peek(buf, -1, nullptr, v, 8) in src/httpserver.cpp

rkrux commented at 10:31 AM on March 18, 2026: contributor

Is this job CI / macOS native (pull_request) stuck? It was started 1.5hrs back.

DrahtBot added the label CI failed on Mar 18, 2026

rkrux commented at 11:02 AM on March 18, 2026: contributor

Can the CI / macOS native (pull_request) job be re-run? It was cancelled after being stuck for 120 mins.

maflcko commented at 11:08 AM on March 18, 2026: member

I don't think the issue happens in macOS, but only in the task that tests ancestor commits, but that task is skipped for pull requests with one commit. Also, reproducing requires tens of runs/commits.

hebasto commented at 11:08 AM on March 18, 2026: member

Can the CI / macOS native (pull_request) job be re-run? It was cancelled after being stuck for 120 mins.

Done.

rkrux commented at 11:13 AM on March 18, 2026: contributor

I don't think the issue happens in macOS, but only in the task that tests ancestor commits,

Oh interesting, I do have a second commit that I will push after this recently started job ends/succeeds.

Also, reproducing requires tens of runs/commits.

I'm not trying to reproduce the issue. Just ensuring that the tests don't fail for some other reason on this commit. Will undraft the PR once the CI is green (or not yellow at least).

DrahtBot removed the label CI failed on Mar 18, 2026

rkrux force-pushed on Mar 18, 2026

rkrux renamed this:
~~test: work in progress commit~~
test: conditionally throttle large testmempoolaccept rpcs in p2p_orphan_handling test
on Mar 18, 2026

rkrux commented at 2:42 PM on March 18, 2026: contributor

A previous instance of the CI run where the test took 4536s (~75min) to complete when the HTTP connection was not reused and a fresh one was created for every RPC in the test: ASan + LSan + UBSan + integer

rkrux marked this as ready for review on Mar 18, 2026

rkrux force-pushed on Mar 19, 2026

in test/functional/p2p_orphan_handling.py:633 in b309bdb2c4

 629 | @@ -630,7 +630,8 @@ def test_maximal_package_protected(self):
 630 |  
 631 |          # Check to make sure these are orphans, within max standard size (to be accepted into the orphanage)
 632 |          for large_orphan in large_orphans:
 633 | -            testres = node.testmempoolaccept([large_orphan.serialize().hex()])
 634 | +            # throttle these 780KB large requests if the RPC latency is greater than 1s

maflcko commented at 8:52 AM on March 19, 2026:

I don't think the logs show such a delay? The tests are run on a fast gaming CPU, and the delay between calls is basically zero, never 1s.

Of course, for the failure the delay is inf (times out), but then the test already has failed and delaying the next one doesn't help?

What am I missing? Can you refer to the logs with excerpts to explain your thinking?

rkrux commented at 9:08 AM on March 19, 2026:

Thanks for raising this, I don't think you are missing anything.

  node0 2026-03-13T16:52:33.591304Z (mocktime: 2026-03-13T16:52:33Z) [http] [httpserver.cpp:191] [http_request_cb] [http] http_request_cb starting 
  node0 2026-03-13T16:52:33.591319Z (mocktime: 2026-03-13T16:52:33Z) [http] [httpserver.cpp:196] [http_request_cb] [http] http_request tracked 
  node0 2026-03-13T16:52:33.591328Z (mocktime: 2026-03-13T16:52:33Z) [http] [httpserver.cpp:237] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:50884 
  node0 2026-03-13T16:52:33.591349Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [httpserver.cpp:267] [operator()] [http] http_worker processing request for '/' 
  node0 2026-03-13T16:52:33.591354Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [httprpc.cpp:106] [HTTPReq_JSONRPC] [rpc] HTTPReq_JSONRPC starting for / 
  node0 2026-03-13T16:52:33.591953Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [httprpc.cpp:142] [HTTPReq_JSONRPC] [rpc] HTTPReq_JSONRPC body read, size=780258 
  node0 2026-03-13T16:52:33.594496Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [rpc/request.cpp:243] [parse] [rpc] ThreadRPCServer method=testmempoolaccept user=__cookie__ 
  node0 2026-03-13T16:52:33.594503Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [rpc/request.cpp:244] [parse] [rpc] ThreadRPCServer method=testmempoolaccept finished parse 
  node0 2026-03-13T16:52:33.594544Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [rpc/server.cpp:348] [JSONRPCExec] [rpc] ThreadRPCServer method=testmempoolaccept starting execution 
  node0 2026-03-13T16:52:33.594572Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [rpc/mempool.cpp:320] [operator()] DEBUG: testmempoolaccept - starting 
  node0 2026-03-13T16:52:33.597400Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [txmempool.cpp:441] [check] [mempool] Checking mempool with 0 transactions and 0 inputs 
  node0 2026-03-13T16:52:33.597412Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [rpc/mempool.cpp:409] [operator()] DEBUG: testmempoolaccept - finished 
  node0 2026-03-13T16:52:33.597425Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [rpc/server.cpp:364] [JSONRPCExec] [rpc] ThreadRPCServer method=testmempoolaccept finished execution 
  node0 2026-03-13T16:52:33.597434Z (mocktime: 2026-03-13T16:52:33Z) [http_pool_1] [httpserver.cpp:593] [WriteReply] [http] HTTPRequest::WriteReply status=200 size=238 
  node0 2026-03-13T16:52:33.597454Z (mocktime: 2026-03-13T16:52:33Z) [http] [httpserver.cpp:603] [operator()] [http] HTTPRequest::WriteReply callback starting status=200 
  node0 2026-03-13T16:52:33.597466Z (mocktime: 2026-03-13T16:52:33Z) [http] [httpserver.cpp:616] [operator()] [http] HTTPRequest::WriteReply callback finished 
  node0 2026-03-13T16:52:33.597486Z (mocktime: 2026-03-13T16:52:33Z) [http] [httpserver.cpp:198] [operator()] [http] http_request completed

Above is a log excerpt from the last successful request. There is no such delay in these logs (as per the first timestamps), although these are only server side logs and not from the client (test) side. I think I will just revert to unconditional timeout instead of doing it conditionally for which I don't have any basis.

The tests are run on a fast gaming CPU,

Nice, I didn't know this but do we know how much load it is under (or atleast when the intermittent issue occured)?

rkrux commented at 9:18 AM on March 19, 2026:

From #34847 (comment)

A previous instance of the CI run where the test took 4536s (~75min) to complete when the HTTP connection was not reused and a fresh one was created for every RPC in the test: ASan + LSan + UBSan + integer

I like this approach for this test but this one occurence discouraged me. I do feel that it hints at the CI instance(s) being under load intermittently for which an unconditional timeout can be a remedy.

maflcko commented at 12:01 PM on March 19, 2026:

Nice, I didn't know this but do we know how much load it is under (or atleast when the intermittent issue occured)?

I wouldn't expect a high load to be the issue here. This is an optimized build without any sanitizers, running on a high-end CPU. Seeing a spurious 30 seconds timeout for an RPC that would otherwise take milliseconds seems off.

In fact, it may be a race, that is only visible because the CPU is so fast.

rkrux commented at 9:55 AM on March 27, 2026:

I've reworked the PR to throttle at the large orphan level if it's being sent over the network to the RPC. This handles both the p2p tests (p2p_orphan_handling, p2p_opportunistic_1p1c) that timed out intermittently.

CyberNFT commented at 9:18 AM on March 19, 2026: none

👍🏻

rkrux force-pushed on Mar 19, 2026

rkrux renamed this:
~~test: conditionally throttle large testmempoolaccept rpcs in p2p_orphan_handling test~~
test: throttle large testmempoolaccept rpcs in p2p_orphan_handling test
on Mar 19, 2026

rkrux force-pushed on Mar 19, 2026

DrahtBot added the label CI failed on Mar 19, 2026

DrahtBot removed the label CI failed on Mar 19, 2026

rkrux force-pushed on Mar 27, 2026

rkrux marked this as a draft on Mar 27, 2026

rkrux force-pushed on Mar 27, 2026

DrahtBot added the label CI failed on Mar 27, 2026

rkrux renamed this:
~~test: throttle large testmempoolaccept rpcs in p2p_orphan_handling test~~
test: throttle large orphan transactions while being sent in RPCs
on Mar 27, 2026

rkrux marked this as ready for review on Mar 27, 2026

DrahtBot removed the label CI failed on Mar 27, 2026

maflcko commented at 10:40 AM on March 27, 2026: member

Not sure this is the correct fix. We are not sending 1MB from somewhere outside the solar system to the earth. This is sending 780KB on a local socket from one process to another. Why should this take 30 seconds? Normally, the whole test passes in less time than that, and then suddenly a single RPC times out?

Also, you haven't even tested if this fix is working. 12 runs/commits is not enough. It can happen after the 40th or 60th run. You'll have to add 125 empty commits or so.

I don't mind a temporary workaround, but at least it should be tested, and it should be explained that this is just a temporary workaround for a real underlying bug.

Otherwise, are we going to update the docs to say: "If you call an RPC with a large payload, you have to manually sleep after each call"?

rkrux commented at 11:00 AM on March 27, 2026: contributor

This is sending 780KB on a local socket from one process to another. Why should this take 30 seconds?

It shouldn't take 30 seconds. Since it's all on local, I don't think this is a network latency issue, but more of an issue with the server drain rate of its TCP buffer not as quick as the client send rate. That's why I believe the zero TCP window issue occurs.

it should be explained that this is just a temporary workaround

The PR description does hint at it in the end but I can make this explicit.

Not sure this is the correct fix. It can happen after the 40th or 60th run. You'll have to add 125 empty commits or so.

I can test with more commits and put it in draft until then.

for a real underlying bug.

This presence of this issue in only one CI job is what I find confusing (and interesting) the most.

"If you call an RPC with a large payload, you have to manually sleep after each call"

This shouldn't be required because this issue doesn't happen all the time and is intermittent in a specific CI job, which can even put into question the setup of that CI job.

rkrux marked this as a draft on Mar 27, 2026

in test/functional/p2p_orphan_handling.py:632 in 56898d5d5a outdated

 628 | @@ -630,7 +629,7 @@ def test_maximal_package_protected(self):
 629 |  
 630 |          # Check to make sure these are orphans, within max standard size (to be accepted into the orphanage)
 631 |          for large_orphan in large_orphans:
 632 | -            testres = node.testmempoolaccept([large_orphan.serialize().hex()])
 633 | +            testres = node.testmempoolaccept([large_orphan.to_send.serialize().hex()])

maflcko commented at 11:32 AM on March 27, 2026:

Instead of sleeping 300ms, it would be a smaller temporary workaround to just quickly spin up a new tcp connection. You can do this either:

by calling .cli() (spawns a bitcoin-cli process) in a trivial one-line patch
or cherry-pick fa8fc5a23752c2a590b95f62833cf013a3d6febc, which was meant for different threads, but using the new authproxy for a single rpc call should also be fine and work around the issue for now.

If you want to keep the unconditional sleep, my preference would be to inline it here again, like it was in the beginning of this pull?

rkrux commented at 3:16 PM on March 27, 2026:

it would be a smaller temporary workaround to just quickly spin up a new tcp connection.

A new connection for every iteration of testmempoolaccept?

If you want to keep the unconditional sleep, my preference would be to inline it here again, like it was in the beginning of this pull?

I preferred that too and then noticed the same failure in the p2p_opportunistic_1p1c test. So thought maybe highlight it in the code to add a sleep whenever a LargeOrphan is sent over the wire by putting the sleep in the class itself (though the sleep is effective only when many such large orphans are sent in a burst). Otherwise it seemed easy that a new call site might miss adding the sleep.

in test/functional/p2p_opportunistic_1p1c.py:444 in 56898d5d5a outdated

 443 | -            assert_greater_than_or_equal(3 * large_orphan.get_vsize(), 2 * 100000)
 444 | -            testres = node.testmempoolaccept([large_orphan.serialize().hex()])
 445 | +            assert_greater_than_or_equal(100000, large_orphan.get.get_vsize())
 446 | +            assert_greater_than(MAX_STANDARD_TX_WEIGHT, large_orphan.get.get_weight())
 447 | +            assert_greater_than_or_equal(3 * large_orphan.get.get_vsize(), 2 * 100000)
 448 | +            testres = node.testmempoolaccept([large_orphan.to_send.serialize().hex()])

maflcko commented at 2:59 PM on March 27, 2026:

I don't think the bug happened in this test, so no need to change it?

rkrux commented at 3:06 PM on March 27, 2026:

It did in PR #34927 here.

maflcko approved

maflcko commented at 3:01 PM on March 27, 2026: member

lgtm (assuming ci passes)

Looks like it is on track of passing ...

So I guess adding a sleep to workaround a timeout bug is another data point that shows there is an underlying racy bug, which is only triggered by weird timing. (And can be avoided by adding weird timing/sleeps)

rkrux commented at 3:07 PM on March 27, 2026: contributor

Looks like it is on track of passing ...

Yeah, no failure yet. But I sense that the 360 minutes threshold of the job will be hit before all the ancestor commits are tested, so this job might get cancelled in the end. :(

Edit: It had passed in 5hr 30mins here - https://github.com/bitcoin/bitcoin/actions/runs/23643332720/job/68869238262?pr=34847

rkrux force-pushed on Mar 28, 2026

DrahtBot added the label CI failed on Mar 28, 2026

rkrux force-pushed on Mar 28, 2026

rkrux commented at 10:54 AM on March 28, 2026: contributor

With a 20ms sleep, the CI job failed with the timeout in 47th commit: https://github.com/bitcoin/bitcoin/actions/runs/23681547345/job/68994196366?pr=34847

Passed with 300ms sleep earlier. Trying with 50ms.

DrahtBot removed the label CI failed on Mar 28, 2026

rkrux commented at 4:12 AM on March 29, 2026: contributor

Successful with 50ms sleep: https://github.com/bitcoin/bitcoin/actions/runs/23683685089/job/68999661705?pr=34847

Job finished 13mins faster than the one with 300ms sleep, saving more than 6s per commit.

rkrux commented at 11:00 AM on March 29, 2026: contributor

Opened PR #34943 for review that I prefer over this solution.

rkrux force-pushed on Mar 30, 2026

DrahtBot added the label CI failed on Mar 30, 2026

rkrux force-pushed on Mar 30, 2026

DrahtBot removed the label CI failed on Mar 30, 2026

rkrux force-pushed on Mar 31, 2026

DrahtBot added the label CI failed on Mar 31, 2026

rkrux force-pushed on Mar 31, 2026

DrahtBot removed the label CI failed on Mar 31, 2026

rkrux force-pushed on Apr 1, 2026

DrahtBot added the label CI failed on Apr 1, 2026

DrahtBot removed the label CI failed on Apr 1, 2026

rpc: wip 0a5d744032

ci: add system network level debugging between testing each commit b6d90081f7

test: throttle large orphan transactions while being sent in RPCs

Each of these large orphan transaction is around 780KB large that are
sent sequentially without waiting in the `testmempoolaccept` RPC. For the
`p2p_orphan_handling` and `p2p_opportunistic_1p1c` tests that send these
large orphan transactions sequentially 50-60 times, it has been observed in
the CI via the tcpdump outputs (refer issue 34731 thread) that
the HTTP server is showing zero TCP window `win 0` intermittently that leads to
such requests never being read fully, so the server never processes them and
thus never sends a response. The test client rightfully times out after 30 seconds.

Interestingly, this intermittent issue has been observed only in the "test ancestor
commits" CI job that recently started testing all the commits in the PR, which is more
robust, as opposed to testing only the last 6 commits like it used to do earlier. For each
commit in the PR, this job runs 16 tests in parallel where the CPU nproc is 8. These
two are the only tests that send such large orphans to the server in the same instance
50-60 times amounting to 45MB being sent in a burst. I've noticed this issue in this
job never in the first commit being tested but instead only in the subsequent ones.

This commit creates a LargeOrphanTransaction class that provides two properties
- one to get the large orphan transaction for the internal operations and the
other to send this transaction over the network, which by default adds a 50ms
sleep before returning. This is to ensure that the test client doesn't bombard
the server with such large transactions without providing it with a cool down
period for the TCP window to clear.

One of the ealier CI runs I tested this change on had 125 commits with a 300ms
delay between each such RPC where this issue didn't occur:
https://github.com/bitcoin/bitcoin/actions/runs/23643332720/job/68869238262?pr=34847
All the `p2p_orphan_handling` tests finished in under 60sec each and all the
`p2p_opportunistic_1p1c` tests finished in undr 70sec each. Adding the timeout
does increase the overall latency of these two tests but might help in avoiding
the intermittent timeouts.

b8e79e3192

test: empty commit 1 1e472bf31d

test: empty commit 2 9af315df11

test: empty commit 3 44b251a097

test: empty commit 4 87c32ae9a7

test: empty commit 5 76e152c3a8

test: empty commit 6 eef2e823bb

test: empty commit 7 95596b7cf8

test: empty commit 8 961c4ff75a

test: empty commit 9 3b5565ad8a

test: empty commit 10 2eb0893fc6

test: empty commit 11 00af81cdc6

test: empty commit 12 bfcbb0b17b

test: empty commit 13 da4a68ef4c

test: empty commit 14 1ea9df7c25

test: empty commit 15 006974dba8

test: empty commit 16 24b999a852

test: empty commit 17 b509d5261b

test: empty commit 18 5fd1b3ed46

test: empty commit 19 b37fd30914

test: empty 20 2533c7d4a4

test: empty 21 21436524eb

test: empty 22 335f671f98

test: empty 23 fd50711dc7

test: empty 24 126bddf0cd

test: empty 25 7d924cd96a

test: empty 26 6c003dc4cb

test: empty 27 44bcd04782

test: empty 28 ffe2b2c3a3

test: empty 29 5fda3f20a5

test: empty 30 356fcb317c

test: empty 31 763655b16a

test: empty 32 29bd9c6761

test: empty 33 7cf2961707

test: empty 34 491dc8dd33

test: empty 35 f540f8707a

test: empty 36 0c3487025b

test: empty 37 3f75344119

test: empty 38 79cb72df11

test: empty 39 6389535a13

test: empty 40 901f2f64d7

test: empty 41 8b622d2252

test: empty 42 2e4705706c

test: empty 43 40b21eb561

test: empty 44 a914e3c2b7

test: empty 45 6e7f6da624

test: empty 46 ce7496d21a

test: empty 47 783043151f

test: empty 48 ac6cce970a

test: empty 49 f0ccf83cfb

test: empty 50 f5109a4696

test: empty 51 058c074783

test: empty 52 d75aa9ec52

test: empty 53 29ef8b06e5

test: empty 54 bfd8496510

test: empty 55 635534ef66

test: empty 56 53ed8cf902

test: empty 57 a928a19ffb

test: empty 58 8ef261c0f5

test: empty 59 bbeabee2ca

test: empty 60 d5aa917f0f

test: empty 61 f02b43f41b

test: empty 62 af88be56f4

test: empty 63 1ca0c07371

test: empty 64 9ffa9913b4

test: empty 65 fdf87e19cf

test: empty 66 dfc62daf59

test: empty 67 2b16d8eff2

test: empty 68 461eebb18b

test: empty 69 78816e82dd

test: empty 70 58c874630e

test: empty 71 54909be604

test: empty 72 ffcea5c88e

test: empty 73 bcf82497a5

test: empty 74 1a585b2188

test: empty 75 84c871d9c9

test: empty 76 49c21c0e53

test: empty 77 15475289a7

test: empty 78 a159ccc1c6

test: empty 79 b7bc02a01e

test: empty 80 a8dfb8f7af

test: empty 81 2075d12803

test: empty 82 67fbef99c1

test: empty 83 f054b6b8f4

test: empty 84 5e9401e19b

test: empty 85 7abd67dfaf

test: empty 86 5ed255e6c4

test: empty 87 9a6a890de8

test: empty 88 0f79cbd1ab

test: empty 89 d074d4679f

test: empty 90 dc7f17153e

test: empty 91 a8647e740e

test: empty 92 457bb412a8

test: empty 93 2abed3e70d

test: empty 94 41c6107be8

test: empty 95 f46df1cff2

test: empty 96 630bb6e5b5

test: empty 97 b09dc7be4f

test: empty 98 8bba2e588b

test: empty 99 c978fc4ea6

test: empty 100 c621fc9fdb

test: empty 101 34f527b3a3

test: empty 102 7d92d8c2f4

test: empty 103 9e3b1f2862

test: empty 104 4f1cf26233

test: empty 105 f6e2c92938

test: empty 106 3064b46593

test: empty 107 85be54124e

test: empty 108 8842a62c9b

test: empty 109 86ab6577c6

test: empty 110 1d409ab69d

test: empty 111 1264bfad1d

test: empty 112 fa1ea82306

test: empty 113 a0a9ee3103

test: empty 114 f21615d5da

test: empty 115 90a70d36ad

test: empty 116 2ebd9cb232

test: empty 117 22eea3e2bb

test: empty 118 55ab806b94

test: empty 119 95ea044fd7

test: empty 120 03fe69a9ab

test: empty 121 4f3a182fcd

test: empty 122 9ab3a64a93

test: empty 123 7bf74e6764

rkrux force-pushed on Apr 2, 2026

DrahtBot added the label CI failed on Apr 2, 2026

rkrux commented at 9:10 AM on April 7, 2026: contributor

I am not satisfied with the solution of adding some delay between each such affected RPC call, I'm closing this PR.

rkrux closed this on Apr 7, 2026