qa: timeout in StopHTTPServer() #31894

issue hodlinator openend this issue on February 18, 2025
  1. hodlinator commented at 10:49 am on February 18, 2025: contributor

    Ran into a CI timeout in feature_assumeutxo.py on Windows: https://github.com/hodlinator/bitcoin/actions/runs/13371466603/job/37340702068#step:12:123

    In the combined log we see the following, “Restarting node to stop at height” is referring to node 1):

     0test  ...T14:10:53.420000Z TestFramework (INFO): Restarting node to stop at height 359
     1test  ...T14:10:53.420000Z TestFramework.node1 (DEBUG): Stopping node
     2node1 ...T14:10:53.420155Z (mocktime...) [http] [D:\a\bitcoin\bitcoin\src\httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:56740
     3node1 ...T14:10:53.420240Z (mocktime...) [httpworker.8] [D:\a\bitcoin\bitcoin\src\rpc\request.cpp:241] [parse] [rpc] ThreadRPCServer method=gettxout user=__cookie__
     4node1 ...T14:10:53.421024Z (mocktime...) [http] [D:\a\bitcoin\bitcoin\src\httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:56740
     5node1 ...T14:10:53.421114Z (mocktime...) [httpworker.11] [D:\a\bitcoin\bitcoin\src\rpc\request.cpp:241] [parse] [rpc] ThreadRPCServer method=stop user=__cookie__
     6node1 ...T14:10:53.421203Z (mocktime...) [init] [D:\a\bitcoin\bitcoin\src\httpserver.cpp:508] [InterruptHTTPServer] [http] Interrupting HTTP server
     7node1 ...T14:10:53.421251Z (mocktime...) [init] [D:\a\bitcoin\bitcoin\src\httprpc.cpp:382] [InterruptHTTPRPC] [rpc] Interrupting HTTP RPC server
     8node1 ...T14:10:53.421280Z (mocktime...) [init] [D:\a\bitcoin\bitcoin\src\rpc\server.cpp:291] [operator ()] [rpc] Interrupting RPC
     9node1 ...T14:10:53.421320Z (mocktime...) [init] [D:\a\bitcoin\bitcoin\src\init.cpp:287] [Shutdown] Shutdown: In progress...
    10node1 ...T14:10:53.421345Z (mocktime...) [shutoff] [D:\a\bitcoin\bitcoin\src\httprpc.cpp:387] [StopHTTPRPC] [rpc] Stopping HTTP RPC server
    11node1 ...T14:10:53.422054Z (mocktime...) [addcon] [D:\a\bitcoin\bitcoin\src\util\thread.cpp:22] [TraceThread] addcon thread exit 
    12node1 ...T14:10:53.422540Z (mocktime...) [shutoff] [D:\a\bitcoin\bitcoin\src\httpserver.cpp:770] [UnregisterHTTPHandler] [http] Unregistering HTTP handler for / (exactmatch 1) 
    13node1 ...T14:10:53.422641Z (mocktime...) [shutoff] [D:\a\bitcoin\bitcoin\src\httpserver.cpp:770] [UnregisterHTTPHandler] [http] Unregistering HTTP handler for /wallet/ (exactmatch 0) 
    14node1 ...T14:10:53.422674Z (mocktime...) [shutoff] [D:\a\bitcoin\bitcoin\src\rpc\server.cpp:303] [operator ()] [rpc] Stopping RPC
    15node1 ...T14:10:53.422847Z (mocktime...) [shutoff] [D:\a\bitcoin\bitcoin\src\rpc\server.cpp:306] [operator ()] [rpc] RPC stopped.
    16node1 ...T14:10:53.422879Z (mocktime...) [shutoff] [D:\a\bitcoin\bitcoin\src\httpserver.cpp:520] [StopHTTPServer] [http] Stopping HTTP server
    17node1 ...T14:10:53.422945Z (mocktime...) [shutoff] [D:\a\bitcoin\bitcoin\src\httpserver.cpp:522] [StopHTTPServer] [http] Waiting for HTTP worker threads to exit
    18node1 ...T14:10:53.423118Z (mocktime...) [shutoff] [D:\a\bitcoin\bitcoin\src\httpserver.cpp:536] [StopHTTPServer] [http] Waiting for 1 connections to stop HTTP server
    19node1 ...T14:10:53.423164Z (mocktime...) [http] [D:\a\bitcoin\bitcoin\src\httpserver.cpp:355] [ThreadHTTP] [http] Exited http event loop
    20node1 ...T14:10:53.440186Z (mocktime...) [net] [D:\a\bitcoin\bitcoin\src\util\thread.cpp:22] [TraceThread] net thread exit
    21node1 ...T14:10:53.652881Z (mocktime...) [msghand] [D:\a\bitcoin\bitcoin\src\util\thread.cpp:22] [TraceThread] msghand thread exit
    22...
    23node1 ...T14:11:32.925206Z (mocktime...) [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2428] [StartExtraBlockRelayPeers] [net] enabling extra block-relay-only peers
    24...
    25node1 ...T14:25:47.935847Z (mocktime...) [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2391] [DumpAddresses] [net] Flushed 0 addresses to peers.dat  1ms
    26...
    27node1 ...T14:40:47.947785Z (mocktime...) [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2391] [DumpAddresses] [net] Flushed 0 addresses to peers.dat  1ms
    28...
    29test  ...T14:50:53.446000Z TestFramework (ERROR): JSONRPC error
    30
    31    Traceback (most recent call last):
    32      File "D:\a\bitcoin\bitcoin\build\test\functional\test_framework\authproxy.py", line 164, in _get_response
    33        http_response = self.__conn.getresponse()
    34...
    35      File "C:\hostedtoolcache\windows\Python\3.13.2\x64\Lib\socket.py", line 719, in readinto
    36        return self._sock.recv_into(b)
    37               ~~~~~~~~~~~~~~~~~~~~^^^
    38    TimeoutError: timed out
    39    During handling of the above exception, another exception occurred:
    40    Traceback (most recent call last):
    41      File "D:\a\bitcoin\bitcoin\build\test\functional\test_framework\test_framework.py", line 135, in main
    42        self.run_test()
    43        ~~~~~~~~~~~~~^^
    44      File "D:\a\bitcoin\bitcoin\build\test\functional\feature_assumeutxo.py", line 587, in run_test
    45        self.restart_node(1, extra_args=[
    46        ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
    47            f"-stopatheight={PAUSE_HEIGHT}", *self.extra_args[1]])
    48            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    49...
    50      File "D:\a\bitcoin\bitcoin\build\test\functional\test_framework\test_node.py", line 395, in stop_node
    51        self.stop(wait=wait)
    52        ~~~~~~~~~^^^^^^^^^^^
    53...
    54      File "D:\a\bitcoin\bitcoin\build\test\functional\test_framework\authproxy.py", line 106, in _request
    55        return self._get_response()
    56               ~~~~~~~~~~~~~~~~~~^^
    57      File "D:\a\bitcoin\bitcoin\build\test\functional\test_framework\authproxy.py", line 166, in _get_response
    58        raise JSONRPCException({
    59...
    60    test_framework.authproxy.JSONRPCException: 'stop' RPC took longer than 2400.000000 seconds. Consider using larger timeout for calls that take longer to return. (-344)
    

    Note how it takes:

    • ~40 seconds to go from receiving stop-RPC to “enabling extra block-relay-only peers”
    • Additional 14 minutes to reach “Flushed 0 addresses to peers.dat”
    • Additional 15 minutes for the second one
    • Additional 10 minutes until we reach the JSONRPC error/timeout.

    It would be good to have log output describing what the node is doing, it obviously started shutting down, and is maybe still flushing some state to disk?

    Also, I wonder if bitcoind might have terminated the TCP/HTTP connection before responding to the stop-RPC (maybe gettxout-RPC being called right before it is a factor)?

    Included in the log, but seems unrelated as it’s another node:

    0node0 stderr Error: A fatal internal error occurred, see debug.log for details: Assumeutxo data not found for the given blockhash '7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a7a'.
    
  2. maflcko added the label CI failed on Feb 18, 2025
  3. maflcko added the label Windows on Feb 18, 2025
  4. hodlinator commented at 11:02 am on February 18, 2025: contributor

    @pinheadmz since you are working on the HTTP rewrite… have you noticed anything fishy regarding whether the stop-RPC will actually always get the response sent off back to the client or not?

    Current stop-RPC implementation implies it should always work: https://github.com/bitcoin/bitcoin/blob/790c509a4796ec47be2275d86b35370ff25a599a/src/rpc/server.cpp#L170-L172

    But it’s from 2015 and the log above makes me very suspicious.

  5. maflcko commented at 11:23 am on February 18, 2025: member

    It looks like the scheduler thread never exited? It may be related to the comment:

    0    // After everything has been shut down, but before things get flushed, stop the
    1    // the scheduler. After this point, SyncWithValidationInterfaceQueue() should not be called anymore
    2    // as this would prevent the shutdown from completing.
    3    if (node.scheduler) node.scheduler->stop();
    

    However, the scheduler thread seems to cycle normally, so there doesn’t seem to be any SyncWithValidationInterfaceQueue event stuck in it.

  6. pinheadmz commented at 11:44 am on February 18, 2025: member

    @pinheadmz since you are working on the HTTP rewrite…

    I hadn’t noticed anything about the current implementation but managing the RPC stop race condition was definitely a tricky part of the rewrite.

    One thing to note about the actual total time elapsed in your case is that on CI the timeout factors are multiplied by large factors in some cases to avoid timeout errors on low-resource runners:

    https://github.com/bitcoin/bitcoin/blob/78c19ddd7c56a275d2ae835fd13ed1508acd7806/.github/workflows/ci.yml#L153-L163

    I suppose enabling libevent logs may help understanding whats happening…

    Or maybe investigate why the net thread appears to not get the interrupt signal?

  7. maflcko commented at 12:06 pm on February 18, 2025: member

    Or maybe investigate why the net thread appears to not get the interrupt signal?

    My impression was that the net thread is stopped, along with the http workers, and the http thread. See:

    0node1 ...T14:10:53.440186Z (mocktime...) [net] [D:\a\bitcoin\bitcoin\src\util\thread.cpp:22] [TraceThread] net thread exit
    

    There seem to be two issues here:

    • The stop RPC timing out
    • The node not stopping
  8. pinheadmz commented at 12:08 pm on February 18, 2025: member
    Hm yeah, but then why does enabling extra block-relay-only peers happen after that?
  9. maflcko commented at 1:15 pm on February 18, 2025: member

    Hm yeah, but then why does enabling extra block-relay-only peers happen after that?

    The StartExtraBlockRelayPeers function is logging to the [net] category (not to be confused with the [net] thread) and is called by CheckForStaleTipAndEvictPeers, which is running in the [scheduler] thread.

  10. hodlinator commented at 3:15 pm on February 18, 2025: contributor
    0[shutoff] [...\httpserver.cpp:536] [StopHTTPServer] [http] Waiting for 1 connections to stop HTTP server
    1[http] [...\httpserver.cpp:355] [ThreadHTTP] [http] Exited http event loop
    2[net] [...\util\thread.cpp:22] [TraceThread] net thread exit
    

    The fact that we don’t get “Stopped HTTP server” or even “Waiting for HTTP event thread to exit” after these indicates that we are stuck in the lower part of StopHTTPServer():

    https://github.com/bitcoin/bitcoin/blob/790c509a4796ec47be2275d86b35370ff25a599a/src/httpserver.cpp#L518-L557

    So we’re either stuck on g_requests.WaitUntilEmpty(); or just after, in the evhttp_free logic.

    The pending request taking time to complete is likely stop or gettxout.

    It is a bit worrying that we get “net thread exit” before completely stopping HTTP, but I guess the “net” thread (CConnman::ThreadSocketHandler) is only used for P2P traffic anyways, not HTTP/RPC, so it shouldn’t be related.

  11. mzumsande commented at 4:21 pm on February 18, 2025: contributor
    #31019 (macOS) might be the same issue, in which case this would not be windows-specific.
  12. maflcko removed the label Windows on Feb 18, 2025
  13. maflcko commented at 4:32 pm on February 18, 2025: member

    StopHTTPServer

    Could update the issue title to say “timeout in StopHTTPServer”?

    #31019

    They are not exactly identical. 31019 likely never received an RPC, and the node is expected to raise an init error after running into a wallet init error. However, the function of the deadlock is the same:

    0 node0 2024-10-02T00:15:45.783904Z [shutoff] [src/httpserver.cpp:522] [StopHTTPServer] [http] Stopping HTTP server 
    1 node0 2024-10-02T00:15:45.783911Z [shutoff] [src/httpserver.cpp:524] [StopHTTPServer] [http] Waiting for HTTP worker threads to exit 
    2...
    3(no further log messages from the `[shutoff]` thread)
    

    So I agree with you that they seem to be the same underlying cause/issue.

  14. hodlinator renamed this:
    test: feature_assumeutxo.py Windows timeout
    qa: timeout in StopHTTPServer()
    on Feb 18, 2025
  15. hodlinator commented at 3:32 pm on February 19, 2025: contributor

    I’m working on making the code a bit more robust:

    • Impose a timeout on HTTP connections shutting down - Maybe related to this issue.
    • Fix an issue where we were not always processing the event to free eventHTTP - Would probably just leak otherwise, shutdown completes regardless of this in my testing.
     0diff --git a/src/httpserver.cpp b/src/httpserver.cpp
     1index 88e640c377..ce2534fbd0 100644
     2--- a/src/httpserver.cpp
     3+++ b/src/httpserver.cpp
     4@@ -197,10 +197,15 @@ public:
     5         return WITH_LOCK(m_mutex, return m_tracker.size());
     6     }
     7     //! Wait until there are no more connections with active requests in the tracker
     8-    void WaitUntilEmpty() const EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
     9+    //! or return the number of remaining requests on timeout.
    10+    size_t WaitUntilEmpty(std::chrono::seconds timeout) const EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
    11     {
    12-        WAIT_LOCK(m_mutex, lock);
    13-        m_cv.wait(lock, [this]() EXCLUSIVE_LOCKS_REQUIRED(m_mutex) { return m_tracker.empty(); });
    14+        {
    15+            WAIT_LOCK(m_mutex, lock);
    16+            m_cv.wait_for(lock, timeout, [this]() EXCLUSIVE_LOCKS_REQUIRED(m_mutex) { return m_tracker.empty(); });
    17+        }
    18+
    19+        return WITH_LOCK(m_mutex, return m_tracker.size());
    20     }
    21 };
    22 //! Track active requests
    23@@ -535,13 +540,17 @@ void StopHTTPServer()
    24         if (const auto n_connections{g_requests.CountActiveConnections()}; n_connections != 0) {
    25             LogDebug(BCLog::HTTP, "Waiting for %d connections to stop HTTP server\n", n_connections);
    26         }
    27-        g_requests.WaitUntilEmpty();
    28+        if (int connections = g_requests.WaitUntilEmpty(/*timeout=*/5min); connections > 0) {
    29+            LogError("%d HTTP connections remain after timeout expired. Aborting to avoid freeing resources still under use.", connections);
    30+            std::abort();
    31+        }
    32     }
    33     if (eventHTTP) {
    34         // Schedule a callback to call evhttp_free in the event base thread, so
    35         // that evhttp_free does not need to be called again after the handling
    36         // of unfinished request connections that follows.
    37         event_base_once(eventBase, -1, EV_TIMEOUT, [](evutil_socket_t, short, void*) {
    38+            LogDebug(BCLog::HTTP, "Freeing eventHTTP");
    39             evhttp_free(eventHTTP);
    40             eventHTTP = nullptr;
    41         }, nullptr, nullptr);
    42@@ -549,6 +558,17 @@ void StopHTTPServer()
    43     if (eventBase) {
    44         LogDebug(BCLog::HTTP, "Waiting for HTTP event thread to exit\n");
    45         if (g_thread_http.joinable()) g_thread_http.join();
    46+        bool first_run = true;
    47+        while (event_base_get_num_events(eventBase, EVENT_BASE_COUNT_ACTIVE | EVENT_BASE_COUNT_ADDED) > 0) {
    48+            if (first_run) LogDebug(BCLog::HTTP, "Processing events left behind by HTTP event thread");
    49+            event_base_loop(eventBase, EVLOOP_NONBLOCK);
    50+            if (first_run) {
    51+                first_run = false;
    52+            } else {
    53+                // Only sleep when this happens repetitively.
    54+                std::this_thread::sleep_for(100ms);
    55+            }
    56+        }
    57         event_base_free(eventBase);
    58         eventBase = nullptr;
    59     }
    

    With the change I get:

    02025-02-19T15:22:35Z [http] Stopping HTTP server
    12025-02-19T15:22:35Z [http] Waiting for HTTP worker threads to exit
    22025-02-19T15:22:35Z [http] Exited http event loop
    32025-02-19T15:22:35Z [http] Waiting for HTTP event thread to exit
    42025-02-19T15:22:35Z [http] Processing events left behind by HTTP event thread
    52025-02-19T15:22:35Z [http] Freeing eventHTTP
    62025-02-19T15:22:35Z [http] Stopped HTTP server
    
  16. maflcko added the label RPC/REST/ZMQ on Feb 20, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-02-22 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me