http: Fix HTTP server shutdown #14670

pull promag wants to merge 6 commits into bitcoin:master from promag:2018-11-httpserver changing 7 files +54 −28
  1. promag commented at 3:04 pm on November 6, 2018: member

    Fixes #11777. Reverts #11006. Replaces #13501.

    With this change the HTTP server will exit gracefully, meaning that all requests will finish processing and sending the response, even if this means to wait more than 2 seconds (current time allowed to exit the event loop).

    Another small change is that connections are accepted even when the server is stopping, but HTTP requests are rejected. This can be improved later, especially if chunked replies are implemented.

    Briefly, before this PR, this is the order or events when a request arrives (RPC stop):

    1. bufferevent_disable(..., EV_READ)
    2. StartShutdown()
    3. evhttp_del_accept_socket(...)
    4. ThreadHTTP terminates (event loop exits) because there are no active or pending events thanks to 1. and 3.
    5. client doesn’t get the response thanks to 4.

    This can be verified by applying

    0     // Event loop will exit after current HTTP requests have been handled, so
    1     // this reply will get back to the client.
    2     StartShutdown();
    3+    MilliSleep(2000);
    4     return "Bitcoin server stopping";
    5 }
    

    and checking the log output:

     0    Received a POST request for / from 127.0.0.1:62443
     1    ThreadRPCServer method=stop user=__cookie__
     2    Interrupting HTTP server
     3**  Exited http event loop
     4    Interrupting HTTP RPC server
     5    Interrupting RPC
     6    tor: Thread interrupt
     7    Shutdown: In progress...
     8    torcontrol thread exit
     9    Stopping HTTP RPC server
    10    addcon thread exit
    11    opencon thread exit
    12    Unregistering HTTP handler for / (exactmatch 1)
    13    Unregistering HTTP handler for /wallet/ (exactmatch 0)
    14    Stopping RPC
    15    RPC stopped.
    16    Stopping HTTP server
    17    Waiting for HTTP worker threads to exit
    18    msghand thread exit
    19    net thread exit
    20
    21    ... sleep 2 seconds ...
    22
    23    Waiting for HTTP event thread to exit
    24    Stopped HTTP server
    

    For this reason point 3. is moved right after all HTTP workers quit. In that moment HTTP replies are queued in the event loop which keeps spinning util all connections are closed. In order to trigger the server side close with keep alive connections (implicit in HTTP/1.1) the header Connection: close is sent if shutdown was requested. This can be tested by

    0bitcoind -regtest
    1nc localhost 18443
    2POST / HTTP/1.1
    3Authorization: Basic ...
    4Content-Type: application/json
    5Content-Length: 44
    6
    7{"jsonrpc": "2.0","method":"stop","id":123}
    

    Summing up, this PR:

    • removes explicit event loop exit — event loop exits once there are no active or pending events
    • changes the moment the listening sockets are removed — explained above
    • sends header Connection: close on active requests when shutdown was requested which is relevant when it’s a persistent connection (default in HTTP 1.1) — libevent is aware of this header and closes the connection gracefully
    • removes event loop explicit break after 2 seconds timeout
  2. fanquake added the label RPC/REST/ZMQ on Nov 6, 2018
  3. jnewbery commented at 3:07 pm on November 6, 2018: member
    @promag can you explain how this is different from #13501 and what the reasoning was for closing that PR?
  4. MarcoFalke commented at 3:08 pm on November 6, 2018: member
    This is now the trillionth attempt in fixing the issue. Could you provide some rationale how this is different from the other approaches and why it should be preferred over the other ones?
  5. ryanofsky commented at 3:38 pm on November 6, 2018: member

    This is simpler than #13501, and if it works I think it is preferable. Instead of trying to wait for all reply data to be sent before stopping the event loop and calling evhttp_free, it lets evhttp_free worry about this and stop the event loop itself. Assuming http_free does the right thing, this is a cleaner and less invasive.

    I am curious what exactly the behavior of evhttp_free is. For example if we are in the middle of sending a huge reply to a slow client, and the server is shut down, how long does this wait (if it waits at all) to send all the data before shutting down? IIUC #13501 would wait as long as it took, and currently we wait for 2 seconds.

  6. ryanofsky commented at 4:07 pm on November 6, 2018: member

    It seems like evhttp_free just immediately calls shutdown and close on all open connections, so I would expect this change not to always fix #11777.

    https://github.com/libevent/libevent/blob/9afe7a6c12c112a6cbc49cf1bd684781275b2579/http.c#L1243 https://github.com/libevent/libevent/blob/9afe7a6c12c112a6cbc49cf1bd684781275b2579/http.c#L3721

    But maybe the WriteHeader("Connection", "close") call added to HTTPRequest::WriteReply improves things in some way, so even if there is a race condition, it won’t trigger in practice?

  7. DrahtBot commented at 5:57 pm on November 6, 2018: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #14726 (Use RPCHelpMan for all RPCs by MarcoFalke)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  8. promag force-pushed on Nov 6, 2018
  9. promag force-pushed on Nov 7, 2018
  10. promag force-pushed on Nov 7, 2018
  11. promag commented at 3:19 am on November 7, 2018: member

    @ryanofsky right, can’t call evhttp_free with ongoing requests.

    Updated code and PR description, should be more clear of the current problem and the approach used.

  12. promag force-pushed on Nov 7, 2018
  13. promag force-pushed on Nov 7, 2018
  14. promag commented at 9:33 am on November 7, 2018: member

    Split in multiple commits for easy review.

    It can be arguable to remove the timeout but IMO it’s safe because:

    • if an HTTP worker is with a long task then calling event_base_loopbreak is no good
    • active connections will timeout and close if no request/response happens (30 seconds by default).

    This can be observed by:

    0bitcoind -regtest
    1time nc localhost 18443
    2bitcoin-cli -regtest stop
    3nc localhost 18443  0.00s user 0.01s system 0% cpu 30.025 total
    4                                                   ^^^^^^
    

    The only way to improve this in a correct way is to lower the connection timeout (that I know of)

  15. in src/httpserver.cpp:573 in 4155634870 outdated
    565@@ -579,6 +566,9 @@ void HTTPRequest::WriteHeader(const std::string& hdr, const std::string& value)
    566 void HTTPRequest::WriteReply(int nStatus, const std::string& strReply)
    567 {
    568     assert(!replySent && req);
    569+    if (ShutdownRequested()) {
    


    ken2812221 commented at 1:19 pm on November 7, 2018:
    I think you could have a flag and set it in InterruptHTTPServer(). Then you can detect the flag here.

    promag commented at 1:23 pm on November 7, 2018:
    Why create redundancy?

    ken2812221 commented at 1:51 pm on November 7, 2018:
    To keep consistency and have lesser dependencies. Logically this should happen after http server interrupted.

    promag commented at 4:54 pm on November 7, 2018:
    Ah I see, I’ll do it.

    promag commented at 0:43 am on November 8, 2018:
    Actually I can’t do that because that way the RPC stop reply won’t get the Connection: close header — between StartShutdown() and InterruptHTTPServer() can go around 200ms: https://github.com/bitcoin/bitcoin/blob/11e1ac3ae08535cefbd8235a8deb6cd100bcb2b1/src/bitcoind.cpp#L46-L54

    ken2812221 commented at 5:04 am on November 8, 2018:
    Ok. Let’s use ShutdownRequested for now. Then we can switch to the flag if #13211 merged.

    promag commented at 2:06 pm on November 12, 2018:
    I don’t think that helps in this case.
  16. ken2812221 commented at 1:21 pm on November 7, 2018: contributor
    Concept ACK.
  17. ryanofsky commented at 3:12 pm on November 7, 2018: member

    Can you update the PR description to describe the change in external behavior instead of diving immediately into details of the code?

    It would also help if each individual commit stated how it changes behavior, and why the change makes sense by itself or in conjunction with other changes.

  18. promag commented at 4:56 pm on November 7, 2018: member

    describe the change in external behavior @ryanofsky sorry I don’t understand what you mean with external behavior.

    and why the change makes sense by itself or in conjunction with other changes

    I think I did?

    It would also help if each individual commit stated how it changes behavior

    Will do.

  19. MarcoFalke commented at 5:00 pm on November 7, 2018: member
    I think external behaviour means the behaviour users or json-rpc-wrapper libraries see.
  20. ryanofsky commented at 5:26 pm on November 7, 2018: member
    What will happen if bitcoin is in the middle of sending a large response to a slow client and it gets shutdown? Will it try to send as much as of the reply as possible but give up after a timeout? Will it stop sending immediately? Will it wait forever? Previous versions of this PR have done a bunch of different things, so what externally observable changes is this PR is intending to make? I think this would be the most useful information to know and include in the description.
  21. promag force-pushed on Nov 7, 2018
  22. promag force-pushed on Nov 7, 2018
  23. promag commented at 10:52 pm on November 7, 2018: member

    @ryanofsky still testing but I believe it waits until HTTP request finishes or timeouts due to inactivity on the connection. In detail this means:

    • completes reading the request headers and body
    • worker computes full response and adds event to send the response
    • the response is sent and then the connection is finally closed
    • at this point, if it’s the last connection, the event loop has no registered events and exists.

    Updated commits and PR description. Added @ken2812221 suggestion.

  24. promag force-pushed on Nov 8, 2018
  25. laanwj added this to the "Blockers" column in a project

  26. laanwj commented at 9:24 am on November 20, 2018: member

    Can you update the PR description to describe the change in external behavior instead of diving immediately into details of the code?

    Also, can we please add a test that fails without this, but passes with it? That’s the best way to “prove” that this fixes something.

  27. promag force-pushed on Nov 20, 2018
  28. promag force-pushed on Nov 20, 2018
  29. promag force-pushed on Nov 20, 2018
  30. promag commented at 6:16 pm on November 20, 2018: member
    • ef7c54c1e - rpc: Add wait argument to stop
    • 8589891ea - http: Send “Connection: close” header if shutdown is requested
    • e10898267 - http: Unlisten sockets after all workers quit
    • b1f241a0f - http: Remove unnecessary event_base_loopexit call
    • a72fbc8aa - http: Remove timeout to exit event loop
    • c76cde3c9 - qa: Test shutdown @laanwj if you build only the marked commits then test test/functional/feature_shutdown.py fails.
  31. promag force-pushed on Nov 20, 2018
  32. in test/functional/feature_shutdown.py:23 in fee25926eb outdated
    18+        self.setup_clean_chain = True
    19+        self.num_nodes = 1
    20+
    21+    def run_test(self):
    22+        node = get_rpc_proxy(self.nodes[0].url, 1, timeout=600, coveragedir=self.nodes[0].coverage_dir)
    23+        t = Thread(target=waitfornewblock, args=(node,))
    


    DrahtBot commented at 11:55 pm on November 20, 2018:

    nit: Could explicitly pass a timeout to waitfornewblock and inline as lambda:

    0t = Thread(target=lambda: self.nodes[0].waitfornewblock(timeout=             ))
    

    promag commented at 9:41 am on November 21, 2018:
    Thanks, I’ve changed a bit the code and now I don’t think this applies, please check again.
  33. promag force-pushed on Nov 21, 2018
  34. promag commented at 9:40 am on November 21, 2018: member
    Thanks @ken2812221, I was going to try something along that after inspecting stop_node implementation. @laanwj in order to properly test the fix, and to trigger the “race to exit the event loop”, I had to add wait argument to stop RPC as explained in #13501 (comment).
  35. laanwj commented at 10:17 am on November 21, 2018: member
    @promag Makes sense, thanks for adding a test!
  36. in src/rpc/server.cpp:230 in 92a81adbe8 outdated
    223@@ -224,11 +224,20 @@ UniValue stop(const JSONRPCRequest& jsonRequest)
    224     // Accept the deprecated and ignored 'detach' boolean argument
    225     if (jsonRequest.fHelp || jsonRequest.params.size() > 1)
    226         throw std::runtime_error(
    227-            "stop\n"
    228-            "\nStop Bitcoin server.");
    229+            "stop ( wait )\n"
    230+            "\nStop Bitcoin server.\n"
    231+            "\nArguments:\n"
    232+            "1. \"wait\"     (int, optional, default=0) Milliseconds to wait before returning to the client\n"
    


    laanwj commented at 10:18 am on November 21, 2018:
    maybe mention that this is for testing purposes (unless there are other reasons this can be useful to users but can’t really think of one)

    promag commented at 10:16 am on November 22, 2018:
    I also don’t see a reason to use this argument.
  37. promag commented at 10:18 am on November 22, 2018: member
    Sorry @ken2812221, just pushed what @laanwj suggested. I’ll squash later.
  38. DrahtBot added the label Needs rebase on Nov 23, 2018
  39. promag force-pushed on Nov 23, 2018
  40. promag commented at 12:00 pm on November 23, 2018: member
    Rebased.
  41. DrahtBot removed the label Needs rebase on Nov 23, 2018
  42. in src/rpc/server.cpp:239 in 447e91ad66 outdated
    236+                {
    237+                    {"wait", RPCArg::Type::NUM, true}
    238+                }}
    239+                .ToString() +
    240+            "\nArguments:\n"
    241+            "1. \"wait\"     (int, optional, default=0) Milliseconds to wait before returning to the client (intended for testing)\n"
    


    MarcoFalke commented at 4:27 pm on November 23, 2018:

    Maybe don’t put it in the help if it is only for testing? Also note that this has been a boolean argument in the past (first comment line in this function).

    So just like we hide testing RPCs altogether, this argument should be hidden as well.

    Could append a line to the comment further up to mention that this is now a testing-only argument…


    promag commented at 4:33 pm on November 23, 2018:
    I see, thanks.

    promag commented at 4:58 pm on November 23, 2018:
    Done.
  43. rpc: Add wait argument to stop 02e1e4eff6
  44. http: Send "Connection: close" header if shutdown is requested
    Sending the header "Connection: close" makes libevent close persistent
    connections (implicit with HTTP 1.1) which cleans the event base when
    shutdown is requested.
    18e9685816
  45. http: Unlisten sockets after all workers quit
    This (almost) move only ensures the event base loop doesn't exit before
    HTTP worker threads exit. This way events registered by HTTP workers are
    processed and not discarded.
    6b13580f4e
  46. http: Remove unnecessary event_base_loopexit call
    Let event base loop exit cleanly by processing all active and pending
    events. The call is no longer necessary because closing persistent
    connections is now properly handled.
    e98a9eede2
  47. http: Remove timeout to exit event loop
    Let HTTP connections to timeout due to inactivity.
    Let all remaning connections finish sending the response and close.
    8d3f46ec39
  48. qa: Test bitcond shutdown 28479f926f
  49. promag force-pushed on Nov 23, 2018
  50. ken2812221 approved
  51. ken2812221 commented at 8:40 am on November 30, 2018: contributor
    tACK 28479f926f21f2a91bec5a06671c60e5b0c55532
  52. laanwj commented at 4:39 pm on December 6, 2018: member
    utACK 28479f926f21f2a91bec5a06671c60e5b0c55532
  53. laanwj commented at 4:42 pm on December 6, 2018: member
    utACK 28479f926f21f2a91bec5a06671c60e5b0c55532
  54. laanwj merged this on Dec 6, 2018
  55. laanwj closed this on Dec 6, 2018

  56. laanwj referenced this in commit a88bd3186d on Dec 6, 2018
  57. jnewbery removed this from the "Blockers" column in a project

  58. ken2812221 commented at 8:41 am on December 14, 2018: contributor

    Is this another HTTP server bug? https://travis-ci.org/bitcoin/bitcoin/jobs/467883032

     0stdout:
     12018-12-14T08:34:22.452000Z TestFramework (INFO): Initializing test directory /tmp/test_runner_₿_🏃_20181214_082355/feature_shutdown_0
     22018-12-14T08:34:24.197000Z TestFramework (INFO): Stopping nodes
     32018-12-14T08:34:24.197000Z TestFramework (INFO): Cleaning up /tmp/test_runner_₿_🏃_20181214_082355/feature_shutdown_0 on exit
     42018-12-14T08:34:24.197000Z TestFramework (INFO): Tests successful
     5stderr:
     6Exception in thread Thread-1:
     7Traceback (most recent call last):
     8  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
     9    self.run()
    10  File "/usr/lib/python3.6/threading.py", line 864, in run
    11    self._target(*self._args, **self._kwargs)
    12  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/feature_shutdown.py", line 12, in test_long_call
    13    block = node.waitfornewblock()
    14  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/test_framework/coverage.py", line 47, in __call__
    15    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
    16  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/test_framework/authproxy.py", line 134, in __call__
    17    response = self._request('POST', self.__url.path, postdata.encode('utf-8'))
    18  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/test_framework/authproxy.py", line 105, in _request
    19    return self._get_response()
    20  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/test_framework/authproxy.py", line 166, in _get_response
    21    'code': -342, 'message': 'non-JSON HTTP response with \'%i %s\' from server' % (http_response.status, http_response.reason)})
    22test_framework.authproxy.JSONRPCException: non-JSON HTTP response with '503 Service Unavailable' from server (-342)
    
  59. promag commented at 8:57 am on December 14, 2018: member
    At least it doesn’t look. Looks more a race between the thread that runs test_long_call and stop_node — if stop_node runs before then RPC would be unavailable. It seems unlikely because of the 1 second wait but it should be possible considering the probable load.
  60. promag commented at 12:00 pm on December 14, 2018: member

    Tested the change:

     0diff --git a/test/functional/feature_shutdown.py b/test/functional/feature_shutdown.py
     1index 5a0c88546..b0df81736 100755
     2--- a/test/functional/feature_shutdown.py
     3+++ b/test/functional/feature_shutdown.py
     4@@ -7,8 +7,11 @@
     5 from test_framework.test_framework import BitcoinTestFramework
     6 from test_framework.util import assert_equal, get_rpc_proxy
     7 from threading import Thread
     8+from time import sleep
     9+
    10
    11 def test_long_call(node):
    12+    sleep(1)
    13     block = node.waitfornewblock()
    14     assert_equal(block['height'], 0)
    

    which gives the following error:

    0ConnectionRefusedError: [Errno 61] Connection refused
    

    Then on top of that:

    0@@ -20,6 +23,8 @@ class ShutdownTest(BitcoinTestFramework):
    1
    2     def run_test(self):
    3         node = get_rpc_proxy(self.nodes[0].url, 1, timeout=600, coveragedir=self.nodes[0].coverage_dir)
    4+        # force connection establishment
    5+        node.getblockcount()
    

    the error is now

    0test_framework.authproxy.JSONRPCException: non-JSON HTTP response with '503 Service Unavailable' from server (-342)
    

    This could be easily fixed if we had a way to inspect how many RPC workers are running, for instance, by having a getrpcinfo.

    Meanwhile this can be “prevented” by adding some relaxation:

    0@@ -26,6 +26,7 @@ class ShutdownTest(BitcoinTestFramework):
    1         # force connection establishment
    2         node.getblockcount()
    3         Thread(target=test_long_call, args=(node,)).start()
    4+        sleep(1)
    
  61. promag deleted the branch on Dec 15, 2018
  62. laanwj referenced this in commit 3ae3748ce1 on Jan 16, 2019
  63. scravy referenced this in commit 6753dac00f on Feb 28, 2019
  64. codablock referenced this in commit 835636d9b0 on Oct 14, 2019
  65. codablock referenced this in commit fefe07003b on Oct 14, 2019
  66. barrystyle referenced this in commit 32ea3cc4ef on Jan 22, 2020
  67. deadalnix referenced this in commit 66eda510af on Feb 6, 2020
  68. TheArbitrator referenced this in commit 7df7b4f53c on Jun 23, 2021
  69. TheArbitrator referenced this in commit 47b8deeda8 on Jun 25, 2021
  70. PastaPastaPasta referenced this in commit f8bbe7a8c2 on Jun 26, 2021
  71. PastaPastaPasta referenced this in commit aa199d882d on Jun 27, 2021
  72. PastaPastaPasta referenced this in commit 6d6c12b81c on Jun 28, 2021
  73. PastaPastaPasta referenced this in commit 77fc7bd31c on Jun 29, 2021
  74. PastaPastaPasta referenced this in commit b67f5124cc on Jul 1, 2021
  75. PastaPastaPasta referenced this in commit 399514859c on Jul 1, 2021
  76. Munkybooty referenced this in commit aa25265942 on Aug 2, 2021
  77. Munkybooty referenced this in commit 2e1fb3f3e8 on Aug 3, 2021
  78. Munkybooty referenced this in commit 21d93c1aea on Aug 5, 2021
  79. Munkybooty referenced this in commit 36b2f4e0f0 on Aug 5, 2021
  80. Munkybooty referenced this in commit 6db39063f8 on Aug 8, 2021
  81. Munkybooty referenced this in commit 9740c21531 on Aug 11, 2021
  82. Munkybooty referenced this in commit 09963a6af0 on Aug 11, 2021
  83. MarcoFalke locked this on Sep 8, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-10-04 22:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me