Addresses #34731
It was earlier suspected that there is an intermittent deadlock in the server but following are the reasons why I didn’t believe it is a deadlock issue based on the debug logs and debugger thread backtrace output shared in the issue:
- The HTTP worker threads (b-http_pool_x) are waiting on the condition variable and not on the mutex that signals that these threads are idle & waiting for work to be assigned to them.
- The HTTP thread (b-http) is epoll waiting that means it is waiting for a request (or a part of it) to be received.
- The added logs show that the first few testmempool RPCs were successful and the next one timed out. But the logs don’t show a request for it being logged unlike in the previous ones, hinting that the server never received such a request (or in full) and thus never processed it. Even then the functional test client timed out, which means that it did send it (at least a part of it).
The large orphan transactions being sent via p2p_orphan_handling.py and
p2p_opportunistic_1p1c.py tests are each 780KB in size that are sent
sequentially. The test tries to send 60 of them in a loop amounting to 46MB
of data over a single HTTP connection that is reused. Later, tcpdump was collected
via this PR #34726 (credit maflcko) that showed the HTTP server signalled TCP zero
window win 0 intermittently that led to such large requests never being read fully,
so the server never processed them and thus never sent a response - the test client
rightfully timed out after 30 seconds.
Interestingly, this intermittent issue has been observed only in the “test ancestor commits” CI job that recently started testing all the commits in the PR, which is more robust, as opposed to testing only the last 6 commits like it used to do earlier. For each commit in the PR, this job runs 16 tests in parallel where the CPU nproc is 8. These two are the only tests that send such large orphans to the server in the same instance 50-60 times amounting to 45MB being sent in a burst. I’ve noticed this issue in this job never in the first commit being tested but instead only in the subsequent ones.
Adding a 3s delay before testing each commit in the “test ancestor commits job” seems to get rid of the issue, the passing CI job is here that tests 125 commits.
Note: The previous approach I tried in PR #34847 added a 50ms delay before sending such large orphan transaction requests but I prefer this solution over it because this change is restricted in this CI job only that fails.