Addresses #34731
It was earlier suspected that there is an intermittent deadlock in the server but following are the reasons why I didn't believe it is a deadlock issue based on the debug logs and debugger thread backtrace output shared in the issue:
- The HTTP worker threads (b-http_pool_x) are waiting on the condition variable and not on the mutex that signals that these threads are idle & waiting for work to be assigned to them.
- The HTTP thread (b-http) is epoll waiting that means it is waiting for a request (or a part of it) to be received.
- The added logs show that the first few testmempool RPCs were successful and the next one timed out. But the logs don't show a request for it being logged unlike in the previous ones, hinting that the server never received such a request (or in full) and thus never processed it. Even then the functional test client timed out, which means that it did send it (at least a part of it).
The large orphan transactions being sent via p2p_orphan_handling.py and
p2p_opportunistic_1p1c.py tests are each 780KB in size that are sent
sequentially. The test tries to send 60 of them in a loop amounting to 46MB
of data over a single HTTP connection that is reused. Later, tcpdump was collected
via this PR #34726 (credit maflcko) that showed the HTTP server signalled TCP zero
window win 0 intermittently that led to such large requests never being read fully,
so the server never processed them and thus never sent a response - the test client
rightfully timed out after 30 seconds.
Interestingly, this intermittent issue has been observed only in the "test ancestor commits" CI job that recently started testing all the commits in the PR, which is more robust, as opposed to testing only the last 6 commits like it used to do earlier. For each commit in the PR, this job runs 16 tests in parallel where the CPU nproc is 8. These two are the only tests that send such large orphans to the server in the same instance 50-60 times amounting to 45MB being sent in a burst. I've noticed this issue in this job never in the first commit being tested but instead only in the subsequent ones.
Adding a 6s delay before testing each commit in the "test ancestor commits job" seems to get rid of the issue, the passing CI job is here that tests 130 commits.
Note: The previous approach I tried in PR #34847 added a 50ms delay before sending such large orphan transaction requests but I prefer this solution over it because this change is restricted in this CI job only that fails.
<!-- *** Please remove the following help text before submitting: *** Pull requests without a rationale and clear improvement may be closed immediately. GUI-related pull requests should be opened against https://github.com/bitcoin-core/gui first. See CONTRIBUTING.md -->
<!-- Please provide clear motivation for your patch and explain how it improves Bitcoin Core user experience or Bitcoin Core developer experience significantly: * Any test improvements or new tests that improve coverage are always welcome. * All other changes should have accompanying unit tests (see `src/test/`) or functional tests (see `test/`). Contributors should note which tests cover modified code. If no tests exist for a region of modified code, new tests should accompany the change. * Bug fixes are most welcome when they come with steps to reproduce or an explanation of the potential issue as well as reasoning for the way the bug was fixed. * Features are welcome, but might be rejected due to design or scope issues. If a feature is based on a lot of dependencies, contributors should first consider building the system outside of Bitcoin Core, if possible. * Refactoring changes are only accepted if they are required for a feature or bug fix or otherwise improve developer experience significantly. For example, most "code style" refactoring changes require a thorough explanation why they are useful, what downsides they have and why they *significantly* improve developer experience or avoid serious programming bugs. Note that code style is often a subjective matter. Unless they are explicitly mentioned to be preferred in the [developer notes](/doc/developer-notes.md), stylistic code changes are usually rejected. -->
<!-- Bitcoin Core has a thorough review process and even the most trivial change needs to pass a lot of eyes and requires non-zero or even substantial time effort to review. There is a huge lack of active reviewers on the project, so patches often sit for a long time. -->