bitcoin-cli: print useful error if bitcoind rpc work queue exceeded

LarryRuane commented at 5:49 pm on March 12, 2020: contributor

If bitcoind is processing 16 RPC requests, attempting to submit another request using bitcoin-cli produces this less-than-helpful error message: error: couldn't parse reply from server. This PR changes the error to: error: server response: Work queue depth exceeded.

LarryRuane commented at 5:49 pm on March 12, 2020: contributor

To reproduce the problem, start the client: bitcoind -rpcworkqueue=1. This reduces the maximum number of in-progress RPCs from 16 (DEFAULT_HTTP_WORKQUEUE) to 1. Then run:

0seq 10 | while read i;do src/bitcoin-cli getblockhash 0 >/dev/null&done

This submits this trivial RPC 10 time in the background (the & character) and usually produces one or more of the error: couldn't parse reply from server errors. If it doesn’t, try again, or increase the load to more than 10. You will see a reasonable error message in debug.log:

02020-03-12T17:32:28Z WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting

But most users wouldn’t think to look there, or wouldn’t have access.

With this PR, the error is more understandable. The failed request in this case is retryable, so maybe that should be indicated in the text, but this text (Work queue depth exceeded) comes from bitcoind, which probably shouldn’t be changed, especially since there could be other RPC clients that depend on that exact text. We could have bitcoin-cli include additional text like please try again but it doesn’t really know for certain in every case that the request is retryable.

in src/bitcoin-cli.cpp:407 in 86af496741 outdated

403@@ -404,7 +404,7 @@ static UniValue CallRPC(BaseRequestHandler *rh, const std::string& strMethod, co
404     // Parse reply
405     UniValue valReply(UniValue::VSTR);
406     if (!valReply.read(response.body))
407-        throw std::runtime_error("couldn't parse reply from server");
408+        throw std::runtime_error(strprintf("server response: %s", response.body));

laanwj commented at 6:47 pm on March 12, 2020:

I think it’s strange that it ends up here. It’s trying to parse an error message as JSON! Isn’t there some error code (either HTTP or JSON-RPC) that could be checked for instead? If not, that seems like a bug in the server.

LarryRuane commented at 9:59 pm on March 12, 2020:

It does seem wrong that my proposed fix retains the practice of trying to decode a non-JSON string as JSON, and depending on that to fail. Yes, perhaps this is a bug in the server – should it be returning JSON in body here rather than a plain string? But as I said at the end of the comment above, I was hesitant to change the server because I’m worried that there may be other RPC clients besides bitcoin-cli that depend on the current (plain string) response. What do you think? If the fix is better in the server, I can do that.

Just for further information, the response returned in this work queue depth exceeded is (status 500 is HTTP_INTERNAL_SERVER_ERROR):

 0(gdb) f 0
 1Breakpoint 1, CallRPC (rh=0x555555845e90, strMethod="getblockhash", args=std::vector of length 1, capacity 2 = {...})
 2    at bitcoin-cli.cpp:406
 3406	    if (!valReply.read(response.body))
 4(gdb) p response
 5$2 = {
 6  status = 500, 
 7  error = -1, 
 8  body = "Work queue depth exceeded"
 9}
10(gdb)

Here’s a successful reply:

0(gdb) p response
1$6 = {
2  status = 200, 
3  error = -1, 
4  body = "{\"result\":\"000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f\",\"error\":null,\"id\":1}\n"
5}

Here’s the response when requesting a nonsense method:

0(gdb) p response
1$2 = {
2  status = 404, 
3  error = -1, 
4  body = "{\"result\":null,\"error\":{\"code\":-32601,\"message\":\"Method not found\"},\"id\":1}\n"
5}

As part of this PR, I could also systematically search for and fix any other places where body is returned as plain string (not JSON); so far I see only one:

0req->WriteReply(HTTP_BAD_METHOD, "JSONRPC server handles only POST requests")

LarryRuane commented at 10:59 pm on March 12, 2020: contributor

The second commit undoes the first, and fixes the problem in the server by having it return JSON in the body of the response:

0error code: -32603
1error message:
2Work queue depth exceeded

(Here’s where that error code, RPC_INTERNAL_ERROR is defined.) It makes sense that this is similar to:

0$ src/bitcoin-cli not-a-valid-rpc
1error code: -32601
2error message:
3Method not found
4$

DrahtBot added the label RPC/REST/ZMQ on Mar 13, 2020

laanwj commented at 9:11 pm on March 17, 2020: member

Ok, looks like his is actually non-trivial.

HTTP_INTERNAL_SERVER_ERROR

0$2 = {
1  status = 500, 
2  error = -1, 
3  body = "Work queue depth exceeded"
4}

So in this case, possibly you could detect this error code in bitcoin-cli, and print an error message instead of trying to parse as JSON. Or are there cases in which it returns 500 but the response needs to be parsed as JSON?

I must admit we’re sometimes inconsistent about whether to report errors at the HTTP layer or the JSON-RPC, so it’s very possible that your solution of changing the server to return a JSON error is the most consistent one!

(I’m not personally very worried about clients depending on a specific exact error message text, though we might want to mention it briefly in the release notes)

laanwj commented at 2:10 pm on March 27, 2020: member

I must admit we’re sometimes inconsistent about whether to report errors at the HTTP layer or the JSON-RPC

I checked this and yes, we do return JSON errors with HTTP_INTERNAL_SERVER_ERROR:

 0static void JSONErrorReply(HTTPRequest* req, const UniValue& objError, const UniValue& id)
 1{
 2    // Send error reply from json-rpc error object
 3    int nStatus = HTTP_INTERNAL_SERVER_ERROR;
 4    int code = find_value(objError, "code").get_int();
 5
 6    if (code == RPC_INVALID_REQUEST)
 7        nStatus = HTTP_BAD_REQUEST;
 8    else if (code == RPC_METHOD_NOT_FOUND)
 9        nStatus = HTTP_NOT_FOUND;
10
11    std::string strReply = JSONRPCReply(NullUniValue, objError, id);
12
13    req->WriteHeader("Content-Type", "application/json");
14    req->WriteReply(nStatus, strReply);
15}

So we have no way to distinguish “parse this as JSON?” based on just the status code.

I think the current change of changing the server to always return JSON errors is OK. I think it’s slightly more elegant than doing “try to parse as JSON and print the raw text otherwise” in the client. But I don’t have a strong opinion.

hebasto commented at 7:39 pm on April 4, 2020: member

@LarryRuane Just was noted about this PR by GH :)

If you find this approach simpler you may grab it :)

LarryRuane force-pushed on Apr 4, 2020

DrahtBot commented at 8:02 pm on April 4, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

LarryRuane commented at 8:04 pm on April 4, 2020: contributor

@hebasto

If you find this approach simpler you may grab it :)

Yes, your solution is much better, thank you! I tested your fix, and cherry-picked your commit to this PR (and force-push-removed my 3 commits).

luke-jr approved

luke-jr commented at 3:21 am on May 5, 2020: member

utACK 4bb892cec2375d919ae66a743e2ad29cd56ba888

util: Handle HTTP_SERVICE_UNAVAILABLE in bitcoin-cli b5a80fa7e4

LarryRuane force-pushed on Jul 12, 2020

LarryRuane commented at 8:54 pm on July 12, 2020: contributor

Rebase (refresh)

hebasto approved

hebasto commented at 4:08 pm on July 14, 2020: member

self-ACK b5a80fa7e487c37b7ac0e3874a2fabade41b9ca8

jonatack commented at 4:48 pm on July 17, 2020: member

Concept ACK. Could you please add functional tests for this in one of the test/functional/interface* files? Say, interface_bitcoin_cli.py (which has a lot of error response testing) or interface_rpc.py (which has a batch request test).

in test/functional/interface_rpc.py:35 in c067589b7a outdated

30+            node.cli('getrpcinfo').send_cli()
31+        except subprocess.CalledProcessError as e:
32+            if e.output != 'error: Server response: Work queue depth exceeded\n':
33+                raise AssertionError("Unexpected cli error: :{}:".format(e.output))
34+            got_exceeded_error = True
35+            break

LarryRuane commented at 3:55 pm on July 21, 2020:

This 400 is, of course, arbitrary, as is the number of threads this test starts (5, see below). These numbers are the result of simple trial-and-error. We want each thread to do enough work that it’s still running by the time all 5 threads are started. Still, the test only takes about 1.3 seconds to run on my system. Increasing the number of threads allows the iteration count here to be reduced, but I don’t want to start too many threads, because each one probably uses a nontrivial amount of resources and I don’t want to overburden the CI workers.

jonatack commented at 12:31 pm on July 22, 2020: member

Tested almost-ACK. Thanks for adding the test! Runs quickly, but one of the Travis jobs failed at assert got_exceeded_error so it may need to be more reliable. Some suggested changes below: add a line of logging, attempt to remove perhaps-unneeded breaks, and PEP8 suggestions from pycodestyle and black – feel free to ignore the style nits. Tested that the test fails on master and passes on this branch.

 0index e0e6fc57aa..cdbed53dd6 100755
 1--- a/test/functional/interface_rpc.py
 2+++ b/test/functional/interface_rpc.py
 3@@ -11,6 +11,7 @@ from test_framework.util import assert_equal, assert_greater_than_or_equal
 4 from threading import Thread
 5 import subprocess
 6 
 7+
 8 def expect_http_status(expected_http_status, expected_rpc_code,
 9                        fcn, *args):
10     try:
11@@ -20,7 +21,10 @@ def expect_http_status(expected_http_status, expected_rpc_code,
12         assert_equal(exc.error["code"], expected_rpc_code)
13         assert_equal(exc.http_status, expected_http_status)
14 
15+
16 got_exceeded_error = False
17+
18+
19 def test_work_queue_getblock(node):
20     global got_exceeded_error
21     for _ in range(400):
22@@ -29,10 +33,9 @@ def test_work_queue_getblock(node):
23         try:
24             node.cli('getrpcinfo').send_cli()
25         except subprocess.CalledProcessError as e:
26-            if e.output != 'error: Server response: Work queue depth exceeded\n':
27-                raise AssertionError("Unexpected cli error: :{}:".format(e.output))
28+            assert_equal(e.output, 'error: Server response: Work queue depth exceeded\n')
29             got_exceeded_error = True
30-            break
31+
32 
33 class RPCInterfaceTest(BitcoinTestFramework):
34     def set_test_params(self):
35@@ -84,19 +87,18 @@ class RPCInterfaceTest(BitcoinTestFramework):
36         expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
37 
38     def test_work_queue_exceeded(self):
39+        self.log.info("Test work queue exceeded")
40         global got_exceeded_error
41-        self.restart_node(0,['-rpcworkqueue=1'])
42+        self.restart_node(0, ['-rpcworkqueue=1'])
43         threads = []
44         for _ in range(5):
45-            t = Thread(target=test_work_queue_getblock, args=(self.nodes[0], ))
46+            t = Thread(target=test_work_queue_getblock, args=(self.nodes[0],))
47             t.start()
48             threads.append(t)
49-            if got_exceeded_error:
50-                break
51         for t in threads:
52             t.join()
53         # This likely just means more threads are needed.
54-        assert(got_exceeded_error)
55+        assert got_exceeded_error
56 
57     def run_test(self):
58         self.test_getrpcinfo()

fjahr commented at 1:04 pm on July 22, 2020: member

Concept ACK

Code looks good at a glance but the test fails on Travis, so I will come back for a more in-depth review when that is fixed :)

LarryRuane force-pushed on Jul 22, 2020

in test/functional/interface_rpc.py:86 in db0d4605b9 outdated

79@@ -67,10 +80,23 @@ def test_http_status_codes(self):
80         expect_http_status(404, -32601, self.nodes[0].invalidmethod)
81         expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
82 
83+    def test_work_queue_exceeded(self):
84+        self.log.info("Test work queue exceeded")
85+        self.restart_node(0, ['-rpcworkqueue=1'])
86+        got_exceeded_error = []

LarryRuane commented at 3:54 pm on July 22, 2020:

This is a list, rather than just a boolean, because a list is a mutable object, and this eliminates the need for the global variable, which seems cleaner. Is it worth a comment explaining that? I tend to think not.

darosior commented at 1:45 pm on October 31, 2020:

nit:

Is it worth a comment explaining that? I tend to think not.

I don’t think so, but it’s a bit awkward for the list to be named like a boolean, maybe exceeded_errors if you have to retouch ?

LarryRuane commented at 3:58 pm on July 22, 2020: contributor

Addressed review suggestions, force-pushed (diff), thank you. The test will run until it succeeds (there’s no longer an arbitrary loop iteration count), or CI will timeout the test (which is extremely unlikely). Starting three threads should be enough, even two works, but the test runs faster with three (greater chance of exceeded the work queue depth limit).

jonatack commented at 6:20 am on July 24, 2020: member

Thanks @LarryRuane for removing the global and the update. The tests appear to be timing out, though it’s unclear.

in test/functional/interface_rpc.py:85 in db0d4605b9 outdated

79@@ -67,10 +80,23 @@ def test_http_status_codes(self):
80         expect_http_status(404, -32601, self.nodes[0].invalidmethod)
81         expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
82 
83+    def test_work_queue_exceeded(self):
84+        self.log.info("Test work queue exceeded")
85+        self.restart_node(0, ['-rpcworkqueue=1'])

hebasto commented at 8:51 am on August 4, 2020:

This would make CI testing more reliable:

0        self.restart_node(0, ["-rpcworkqueue=1", "-rpcthreads=1"])

in test/functional/interface_rpc.py:84 in db0d4605b9 outdated

79@@ -67,10 +80,23 @@ def test_http_status_codes(self):
80         expect_http_status(404, -32601, self.nodes[0].invalidmethod)
81         expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
82 
83+    def test_work_queue_exceeded(self):
84+        self.log.info("Test work queue exceeded")

hebasto commented at 8:53 am on August 4, 2020:

nit: Mind rewording to make it consistent with other log messages

0        self.log.info("Testing work queue exceeded...")

LarryRuane commented at 5:28 pm on August 4, 2020:

Force-pushed to implement review suggestions.

hebasto approved

hebasto commented at 8:56 am on August 4, 2020: member

ACK db0d4605b9e57230a02f8b620f93e5e4d191972c, tested on Linux Mint 20 (x86_64).

https://cirrus-ci.com/task/5913926858702848 timeout seems irrelevant (could be restarted though).

add functional test 8dd5946c0b

LarryRuane force-pushed on Aug 4, 2020

hebasto commented at 6:05 pm on August 4, 2020: member

Travis failure seems related to #19654.

hebasto approved

hebasto commented at 6:25 pm on August 4, 2020: member

re-ACK 8dd5946c0b7aa8f3976b14a5de4ce84b80a9c32a, only suggested changes since my previous review.

fjahr commented at 11:44 am on August 5, 2020: member

tACK 8dd5946c0b7aa8f3976b14a5de4ce84b80a9c32a

Confirmed that the test fails without the fix. Ran test multiple times and seemed stable.

mini-nit if you have to retouch: The commit message of the test commit is too general for our usual conventions.

luke-jr referenced this in commit 6d207af2f6 on Aug 15, 2020

hebasto commented at 8:17 am on August 22, 2020: member

@luke-jr as you’ve incorporated this PR into Bitcoin Knots, mind refreshing your previous ACK to get things done?

luke-jr commented at 2:41 pm on October 13, 2020: member

utACK 8dd5946c0b7aa8f3976b14a5de4ce84b80a9c32a (no changes since previous utACK)

in test/functional/interface_rpc.py:25 in 8dd5946c0b

20@@ -18,6 +21,16 @@ def expect_http_status(expected_http_status, expected_rpc_code,
21         assert_equal(exc.error["code"], expected_rpc_code)
22         assert_equal(exc.http_status, expected_http_status)
23 
24+
25+def test_work_queue_getblock(node, got_exceeded_error):

darosior commented at 1:42 pm on October 31, 2020:

nit: this is called queue_getblock but queues a getrpcinfo ? Maybe queue_rpc_call

darosior approved

darosior commented at 1:48 pm on October 31, 2020: member

ACK 8dd5946c0b7aa8f3976b14a5de4ce84b80a9c32a

Thanks for doing this, going to be helpful downstream :-) Left a few nits only if you have to retouch

in src/httpserver.cpp:272 in 8dd5946c0b

268@@ -269,7 +269,7 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
269             item.release(); /* if true, queue took ownership */
270         else {
271             LogPrintf("WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting\n");
272-            item->req->WriteReply(HTTP_INTERNAL_SERVER_ERROR, "Work queue depth exceeded");
273+            item->req->WriteReply(HTTP_SERVICE_UNAVAILABLE, "Work queue depth exceeded");

MarcoFalke commented at 6:57 pm on March 19, 2021:

Does this need release notes to explain the changed http status code? Just in case a user relies on this?

Just asking, you don’t have to invalidate review. Release note can be added after merge.

LarryRuane commented at 7:15 pm on March 19, 2021:

Good point, I think you’re right, I’ll do it separately, thanks.

MarcoFalke commented at 7:54 pm on March 19, 2021:

You can add them directly to doc/release-notes.md on the master branch

jonatack commented at 7:58 pm on March 19, 2021:

Maybe add a “needs release note” label.

MarcoFalke merged this on Mar 19, 2021

MarcoFalke closed this on Mar 19, 2021

LarryRuane referenced this in commit d7fbab5461 on Mar 20, 2021

sidhujag referenced this in commit 581817cbd5 on Mar 20, 2021

LarryRuane referenced this in commit 71fc3cb65e on Mar 22, 2021

LarryRuane referenced this in commit 169bb257be on Mar 22, 2021

LarryRuane referenced this in commit 51eef4a03e on Mar 23, 2021

MarcoFalke referenced this in commit 11840509fb on Mar 29, 2021

sidhujag referenced this in commit c6261f8300 on Mar 29, 2021

DrahtBot locked this on Aug 16, 2022

bitcoin-cli: print useful error if bitcoind rpc work queue exceeded #18335

Conflicts