bitcoin-cli: print useful error if bitcoind rpc work queue exceeded #18335

pull LarryRuane wants to merge 2 commits into bitcoin:master from LarryRuane:cli-work-queue-error-msg changing 3 files +29 −1
  1. LarryRuane commented at 5:49 pm on March 12, 2020: contributor
    If bitcoind is processing 16 RPC requests, attempting to submit another request using bitcoin-cli produces this less-than-helpful error message: error: couldn't parse reply from server. This PR changes the error to: error: server response: Work queue depth exceeded.
  2. LarryRuane commented at 5:49 pm on March 12, 2020: contributor

    To reproduce the problem, start the client: bitcoind -rpcworkqueue=1. This reduces the maximum number of in-progress RPCs from 16 (DEFAULT_HTTP_WORKQUEUE) to 1. Then run:

    0seq 10 | while read i;do src/bitcoin-cli getblockhash 0 >/dev/null&done
    

    This submits this trivial RPC 10 time in the background (the & character) and usually produces one or more of the error: couldn't parse reply from server errors. If it doesn’t, try again, or increase the load to more than 10. You will see a reasonable error message in debug.log:

    02020-03-12T17:32:28Z WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting
    

    But most users wouldn’t think to look there, or wouldn’t have access.

    With this PR, the error is more understandable. The failed request in this case is retryable, so maybe that should be indicated in the text, but this text (Work queue depth exceeded) comes from bitcoind, which probably shouldn’t be changed, especially since there could be other RPC clients that depend on that exact text. We could have bitcoin-cli include additional text like please try again but it doesn’t really know for certain in every case that the request is retryable.

  3. in src/bitcoin-cli.cpp:407 in 86af496741 outdated
    403@@ -404,7 +404,7 @@ static UniValue CallRPC(BaseRequestHandler *rh, const std::string& strMethod, co
    404     // Parse reply
    405     UniValue valReply(UniValue::VSTR);
    406     if (!valReply.read(response.body))
    407-        throw std::runtime_error("couldn't parse reply from server");
    408+        throw std::runtime_error(strprintf("server response: %s", response.body));
    


    laanwj commented at 6:47 pm on March 12, 2020:
    I think it’s strange that it ends up here. It’s trying to parse an error message as JSON! Isn’t there some error code (either HTTP or JSON-RPC) that could be checked for instead? If not, that seems like a bug in the server.

    LarryRuane commented at 9:59 pm on March 12, 2020:

    It does seem wrong that my proposed fix retains the practice of trying to decode a non-JSON string as JSON, and depending on that to fail. Yes, perhaps this is a bug in the server – should it be returning JSON in body here rather than a plain string? But as I said at the end of the comment above, I was hesitant to change the server because I’m worried that there may be other RPC clients besides bitcoin-cli that depend on the current (plain string) response. What do you think? If the fix is better in the server, I can do that.

    Just for further information, the response returned in this work queue depth exceeded is (status 500 is HTTP_INTERNAL_SERVER_ERROR):

     0(gdb) f 0
     1Breakpoint 1, CallRPC (rh=0x555555845e90, strMethod="getblockhash", args=std::vector of length 1, capacity 2 = {...})
     2    at bitcoin-cli.cpp:406
     3406	    if (!valReply.read(response.body))
     4(gdb) p response
     5$2 = {
     6  status = 500, 
     7  error = -1, 
     8  body = "Work queue depth exceeded"
     9}
    10(gdb) 
    

    Here’s a successful reply:

    0(gdb) p response
    1$6 = {
    2  status = 200, 
    3  error = -1, 
    4  body = "{\"result\":\"000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f\",\"error\":null,\"id\":1}\n"
    5}
    

    Here’s the response when requesting a nonsense method:

    0(gdb) p response
    1$2 = {
    2  status = 404, 
    3  error = -1, 
    4  body = "{\"result\":null,\"error\":{\"code\":-32601,\"message\":\"Method not found\"},\"id\":1}\n"
    5}
    

    As part of this PR, I could also systematically search for and fix any other places where body is returned as plain string (not JSON); so far I see only one:

    0req->WriteReply(HTTP_BAD_METHOD, "JSONRPC server handles only POST requests")
    
  4. LarryRuane commented at 10:59 pm on March 12, 2020: contributor

    The second commit undoes the first, and fixes the problem in the server by having it return JSON in the body of the response:

    0error code: -32603
    1error message:
    2Work queue depth exceeded
    

    (Here’s where that error code, RPC_INTERNAL_ERROR is defined.) It makes sense that this is similar to:

    0$ src/bitcoin-cli not-a-valid-rpc
    1error code: -32601
    2error message:
    3Method not found
    4$ 
    
  5. DrahtBot added the label RPC/REST/ZMQ on Mar 13, 2020
  6. laanwj commented at 9:11 pm on March 17, 2020: member

    Ok, looks like his is actually non-trivial.

    HTTP_INTERNAL_SERVER_ERROR

    0$2 = {
    1  status = 500, 
    2  error = -1, 
    3  body = "Work queue depth exceeded"
    4}
    

    So in this case, possibly you could detect this error code in bitcoin-cli, and print an error message instead of trying to parse as JSON. Or are there cases in which it returns 500 but the response needs to be parsed as JSON?

    I must admit we’re sometimes inconsistent about whether to report errors at the HTTP layer or the JSON-RPC, so it’s very possible that your solution of changing the server to return a JSON error is the most consistent one!

    (I’m not personally very worried about clients depending on a specific exact error message text, though we might want to mention it briefly in the release notes)

  7. laanwj commented at 2:10 pm on March 27, 2020: member

    I must admit we’re sometimes inconsistent about whether to report errors at the HTTP layer or the JSON-RPC

    I checked this and yes, we do return JSON errors with HTTP_INTERNAL_SERVER_ERROR:

     0static void JSONErrorReply(HTTPRequest* req, const UniValue& objError, const UniValue& id)
     1{
     2    // Send error reply from json-rpc error object
     3    int nStatus = HTTP_INTERNAL_SERVER_ERROR;
     4    int code = find_value(objError, "code").get_int();
     5
     6    if (code == RPC_INVALID_REQUEST)
     7        nStatus = HTTP_BAD_REQUEST;
     8    else if (code == RPC_METHOD_NOT_FOUND)
     9        nStatus = HTTP_NOT_FOUND;
    10
    11    std::string strReply = JSONRPCReply(NullUniValue, objError, id);
    12
    13    req->WriteHeader("Content-Type", "application/json");
    14    req->WriteReply(nStatus, strReply);
    15}
    

    So we have no way to distinguish “parse this as JSON?” based on just the status code.

    I think the current change of changing the server to always return JSON errors is OK. I think it’s slightly more elegant than doing “try to parse as JSON and print the raw text otherwise” in the client. But I don’t have a strong opinion.

  8. hebasto commented at 7:39 pm on April 4, 2020: member

    @LarryRuane Just was noted about this PR by GH :)

    If you find this approach simpler you may grab it :)

  9. LarryRuane force-pushed on Apr 4, 2020
  10. DrahtBot commented at 8:02 pm on April 4, 2020: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    No conflicts as of last run.

  11. LarryRuane commented at 8:04 pm on April 4, 2020: contributor

    @hebasto

    If you find this approach simpler you may grab it :)

    Yes, your solution is much better, thank you! I tested your fix, and cherry-picked your commit to this PR (and force-push-removed my 3 commits).

  12. luke-jr approved
  13. luke-jr commented at 3:21 am on May 5, 2020: member
    utACK 4bb892cec2375d919ae66a743e2ad29cd56ba888
  14. util: Handle HTTP_SERVICE_UNAVAILABLE in bitcoin-cli b5a80fa7e4
  15. LarryRuane force-pushed on Jul 12, 2020
  16. LarryRuane commented at 8:54 pm on July 12, 2020: contributor
    Rebase (refresh)
  17. hebasto approved
  18. hebasto commented at 4:08 pm on July 14, 2020: member
    self-ACK b5a80fa7e487c37b7ac0e3874a2fabade41b9ca8
  19. jonatack commented at 4:48 pm on July 17, 2020: member
    Concept ACK. Could you please add functional tests for this in one of the test/functional/interface* files? Say, interface_bitcoin_cli.py (which has a lot of error response testing) or interface_rpc.py (which has a batch request test).
  20. in test/functional/interface_rpc.py:35 in c067589b7a outdated
    30+            node.cli('getrpcinfo').send_cli()
    31+        except subprocess.CalledProcessError as e:
    32+            if e.output != 'error: Server response: Work queue depth exceeded\n':
    33+                raise AssertionError("Unexpected cli error: :{}:".format(e.output))
    34+            got_exceeded_error = True
    35+            break
    


    LarryRuane commented at 3:55 pm on July 21, 2020:
    This 400 is, of course, arbitrary, as is the number of threads this test starts (5, see below). These numbers are the result of simple trial-and-error. We want each thread to do enough work that it’s still running by the time all 5 threads are started. Still, the test only takes about 1.3 seconds to run on my system. Increasing the number of threads allows the iteration count here to be reduced, but I don’t want to start too many threads, because each one probably uses a nontrivial amount of resources and I don’t want to overburden the CI workers.
  21. jonatack commented at 12:31 pm on July 22, 2020: member

    Tested almost-ACK. Thanks for adding the test! Runs quickly, but one of the Travis jobs failed at assert got_exceeded_error so it may need to be more reliable. Some suggested changes below: add a line of logging, attempt to remove perhaps-unneeded breaks, and PEP8 suggestions from pycodestyle and black – feel free to ignore the style nits. Tested that the test fails on master and passes on this branch.

     0index e0e6fc57aa..cdbed53dd6 100755
     1--- a/test/functional/interface_rpc.py
     2+++ b/test/functional/interface_rpc.py
     3@@ -11,6 +11,7 @@ from test_framework.util import assert_equal, assert_greater_than_or_equal
     4 from threading import Thread
     5 import subprocess
     6 
     7+
     8 def expect_http_status(expected_http_status, expected_rpc_code,
     9                        fcn, *args):
    10     try:
    11@@ -20,7 +21,10 @@ def expect_http_status(expected_http_status, expected_rpc_code,
    12         assert_equal(exc.error["code"], expected_rpc_code)
    13         assert_equal(exc.http_status, expected_http_status)
    14 
    15+
    16 got_exceeded_error = False
    17+
    18+
    19 def test_work_queue_getblock(node):
    20     global got_exceeded_error
    21     for _ in range(400):
    22@@ -29,10 +33,9 @@ def test_work_queue_getblock(node):
    23         try:
    24             node.cli('getrpcinfo').send_cli()
    25         except subprocess.CalledProcessError as e:
    26-            if e.output != 'error: Server response: Work queue depth exceeded\n':
    27-                raise AssertionError("Unexpected cli error: :{}:".format(e.output))
    28+            assert_equal(e.output, 'error: Server response: Work queue depth exceeded\n')
    29             got_exceeded_error = True
    30-            break
    31+
    32 
    33 class RPCInterfaceTest(BitcoinTestFramework):
    34     def set_test_params(self):
    35@@ -84,19 +87,18 @@ class RPCInterfaceTest(BitcoinTestFramework):
    36         expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
    37 
    38     def test_work_queue_exceeded(self):
    39+        self.log.info("Test work queue exceeded")
    40         global got_exceeded_error
    41-        self.restart_node(0,['-rpcworkqueue=1'])
    42+        self.restart_node(0, ['-rpcworkqueue=1'])
    43         threads = []
    44         for _ in range(5):
    45-            t = Thread(target=test_work_queue_getblock, args=(self.nodes[0], ))
    46+            t = Thread(target=test_work_queue_getblock, args=(self.nodes[0],))
    47             t.start()
    48             threads.append(t)
    49-            if got_exceeded_error:
    50-                break
    51         for t in threads:
    52             t.join()
    53         # This likely just means more threads are needed.
    54-        assert(got_exceeded_error)
    55+        assert got_exceeded_error
    56 
    57     def run_test(self):
    58         self.test_getrpcinfo()
    
  22. fjahr commented at 1:04 pm on July 22, 2020: member

    Concept ACK

    Code looks good at a glance but the test fails on Travis, so I will come back for a more in-depth review when that is fixed :)

  23. LarryRuane force-pushed on Jul 22, 2020
  24. in test/functional/interface_rpc.py:86 in db0d4605b9 outdated
    79@@ -67,10 +80,23 @@ def test_http_status_codes(self):
    80         expect_http_status(404, -32601, self.nodes[0].invalidmethod)
    81         expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
    82 
    83+    def test_work_queue_exceeded(self):
    84+        self.log.info("Test work queue exceeded")
    85+        self.restart_node(0, ['-rpcworkqueue=1'])
    86+        got_exceeded_error = []
    


    LarryRuane commented at 3:54 pm on July 22, 2020:
    This is a list, rather than just a boolean, because a list is a mutable object, and this eliminates the need for the global variable, which seems cleaner. Is it worth a comment explaining that? I tend to think not.

    darosior commented at 1:45 pm on October 31, 2020:

    nit:

    Is it worth a comment explaining that? I tend to think not.

    I don’t think so, but it’s a bit awkward for the list to be named like a boolean, maybe exceeded_errors if you have to retouch ?

  25. LarryRuane commented at 3:58 pm on July 22, 2020: contributor
    Addressed review suggestions, force-pushed (diff), thank you. The test will run until it succeeds (there’s no longer an arbitrary loop iteration count), or CI will timeout the test (which is extremely unlikely). Starting three threads should be enough, even two works, but the test runs faster with three (greater chance of exceeded the work queue depth limit).
  26. jonatack commented at 6:20 am on July 24, 2020: member
    Thanks @LarryRuane for removing the global and the update. The tests appear to be timing out, though it’s unclear.
  27. in test/functional/interface_rpc.py:85 in db0d4605b9 outdated
    79@@ -67,10 +80,23 @@ def test_http_status_codes(self):
    80         expect_http_status(404, -32601, self.nodes[0].invalidmethod)
    81         expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
    82 
    83+    def test_work_queue_exceeded(self):
    84+        self.log.info("Test work queue exceeded")
    85+        self.restart_node(0, ['-rpcworkqueue=1'])
    


    hebasto commented at 8:51 am on August 4, 2020:

    This would make CI testing more reliable:

    0        self.restart_node(0, ["-rpcworkqueue=1", "-rpcthreads=1"])
    
  28. in test/functional/interface_rpc.py:84 in db0d4605b9 outdated
    79@@ -67,10 +80,23 @@ def test_http_status_codes(self):
    80         expect_http_status(404, -32601, self.nodes[0].invalidmethod)
    81         expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
    82 
    83+    def test_work_queue_exceeded(self):
    84+        self.log.info("Test work queue exceeded")
    


    hebasto commented at 8:53 am on August 4, 2020:

    nit: Mind rewording to make it consistent with other log messages

    0        self.log.info("Testing work queue exceeded...")
    

    LarryRuane commented at 5:28 pm on August 4, 2020:
    Force-pushed to implement review suggestions.
  29. hebasto approved
  30. hebasto commented at 8:56 am on August 4, 2020: member

    ACK db0d4605b9e57230a02f8b620f93e5e4d191972c, tested on Linux Mint 20 (x86_64).

    https://cirrus-ci.com/task/5913926858702848 timeout seems irrelevant (could be restarted though).

  31. add functional test 8dd5946c0b
  32. LarryRuane force-pushed on Aug 4, 2020
  33. hebasto commented at 6:05 pm on August 4, 2020: member
    Travis failure seems related to #19654.
  34. hebasto approved
  35. hebasto commented at 6:25 pm on August 4, 2020: member
    re-ACK 8dd5946c0b7aa8f3976b14a5de4ce84b80a9c32a, only suggested changes since my previous review.
  36. fjahr commented at 11:44 am on August 5, 2020: member

    tACK 8dd5946c0b7aa8f3976b14a5de4ce84b80a9c32a

    Confirmed that the test fails without the fix. Ran test multiple times and seemed stable.

    mini-nit if you have to retouch: The commit message of the test commit is too general for our usual conventions.

  37. luke-jr referenced this in commit 6d207af2f6 on Aug 15, 2020
  38. hebasto commented at 8:17 am on August 22, 2020: member
    @luke-jr as you’ve incorporated this PR into Bitcoin Knots, mind refreshing your previous ACK to get things done?
  39. luke-jr commented at 2:41 pm on October 13, 2020: member
    utACK 8dd5946c0b7aa8f3976b14a5de4ce84b80a9c32a (no changes since previous utACK)
  40. in test/functional/interface_rpc.py:25 in 8dd5946c0b
    20@@ -18,6 +21,16 @@ def expect_http_status(expected_http_status, expected_rpc_code,
    21         assert_equal(exc.error["code"], expected_rpc_code)
    22         assert_equal(exc.http_status, expected_http_status)
    23 
    24+
    25+def test_work_queue_getblock(node, got_exceeded_error):
    


    darosior commented at 1:42 pm on October 31, 2020:
    nit: this is called queue_getblock but queues a getrpcinfo ? Maybe queue_rpc_call
  41. darosior approved
  42. darosior commented at 1:48 pm on October 31, 2020: member

    ACK 8dd5946c0b7aa8f3976b14a5de4ce84b80a9c32a

    Thanks for doing this, going to be helpful downstream :-) Left a few nits only if you have to retouch

  43. in src/httpserver.cpp:272 in 8dd5946c0b
    268@@ -269,7 +269,7 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
    269             item.release(); /* if true, queue took ownership */
    270         else {
    271             LogPrintf("WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting\n");
    272-            item->req->WriteReply(HTTP_INTERNAL_SERVER_ERROR, "Work queue depth exceeded");
    273+            item->req->WriteReply(HTTP_SERVICE_UNAVAILABLE, "Work queue depth exceeded");
    


    MarcoFalke commented at 6:57 pm on March 19, 2021:

    Does this need release notes to explain the changed http status code? Just in case a user relies on this?

    Just asking, you don’t have to invalidate review. Release note can be added after merge.


    LarryRuane commented at 7:15 pm on March 19, 2021:
    Good point, I think you’re right, I’ll do it separately, thanks.

    MarcoFalke commented at 7:54 pm on March 19, 2021:
    You can add them directly to doc/release-notes.md on the master branch

    jonatack commented at 7:58 pm on March 19, 2021:
    Maybe add a “needs release note” label.
  44. MarcoFalke merged this on Mar 19, 2021
  45. MarcoFalke closed this on Mar 19, 2021

  46. LarryRuane referenced this in commit d7fbab5461 on Mar 20, 2021
  47. sidhujag referenced this in commit 581817cbd5 on Mar 20, 2021
  48. LarryRuane referenced this in commit 71fc3cb65e on Mar 22, 2021
  49. LarryRuane referenced this in commit 169bb257be on Mar 22, 2021
  50. LarryRuane referenced this in commit 51eef4a03e on Mar 23, 2021
  51. MarcoFalke referenced this in commit 11840509fb on Mar 29, 2021
  52. sidhujag referenced this in commit c6261f8300 on Mar 29, 2021
  53. DrahtBot locked this on Aug 16, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-22 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me