This fixes a race between http_request_cb
and StopHTTPServer
where
the work queue is used after release.
Fixes #18856.
Concept ACK, assuming that it does what we do in the python code. Haven’t looked at the code here.
0self.network_event_loop.call_soon_threadsafe(self.network_event_loop.stop)
setting env variable: QT_QPA_PLATFORM=windows ->same warning QWARN : AppTests::appTests() QObject::connect: Cannot queue arguments of type ‘size_t’ (Make sure ‘size_t’ is registered using qRegisterMetaType().)
also on windows 10 & msvc the first test exits silent becoz fs::remove_all(m_path_root); in BasicTestingSetup::~BasicTestingSetup() says cannot remove dir,its not empty, but u can catch with try statement, if u want do more small fixes 👍
setting env variable: QT_QPA_PLATFORM=windows ->same warning QWARN : AppTests::appTests() QObject::connect: Cannot queue arguments of type ‘size_t’ (Make sure ‘size_t’ is registered using qRegisterMetaType().) @tarboss 38e2be4be00914e8be7ae9c7215985c9b481b0ae fixes it right?
at least that warning
********* Start testing of AppTests ********* Config: Using QtTest library 5.9.8, Qt 5.9.8 (x86_64-little_endian-llp64 static debug build; by MSVC 2017) PASS : AppTests::initTestCase() QINFO : AppTests::appTests() Backing up GUI settings to “C:\Users\tester\AppData\Local\Temp\test_common_Bitcoin Core\b516affdfbc77f1640d7e3e97f95db107556a19c4535813c07f5a21862418b89\regtest\guisettings.ini.bak” QDEBUG : AppTests::appTests() requestInitialize : Requesting initialize QDEBUG : AppTests::appTests() initialize : Running initialization in thread QDEBUG : AppTests::appTests() initializeResult : Initialization result: true QINFO : AppTests::appTests() Platform customization: “windows” QDEBUG : AppTests::appTests() TransactionTablePriv::refreshWallet QWARN : AppTests::appTests() QWindowsWindow::setGeometry: Unable to set geometry 751x430+598+336 on QWidgetWindow/‘RPCConsoleWindow’. Resulting geometry: 751x431+598+336 (frame: 8, 31, 8, 8, custom margin: 0, 0, 0, 0, minimum size: 751x365, maximum size: 16777215x16777215). QDEBUG : AppTests::appTests() requestShutdown : Requesting shutdown QDEBUG : AppTests::appTests() shutdown : Running Shutdown in thread QDEBUG : AppTests::appTests() NotifyUnload QDEBUG : AppTests::appTests() shutdown : Shutdown finished
E:\win10dev\bc\bitcoin\build_msvc\x64\Debug\test_bitcoin-qt.exe (process 11012) exited with code 3. To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops. Press any key to close this window . . .
449+ if (eventBase) {
450+ event_base_once(eventBase, -1, EV_TIMEOUT, [](evutil_socket_t, short, void*) {
451+ workQueue->Interrupt();
452+ }, nullptr, nullptr);
453+ } else {
454+ workQueue->Interrupt();
workQueue
is not created if eventBase
couldn’t be created, so it seems this else {}
clause is redundant?
I’m confused as to why the Qt change is in here. Please add an explanation to the opening post, or open a separate PR for it.
ACK 1d39b7cf00f9d4a64d93caa9d47a177d759c2d6a
ACK fec477712f9b37dc16b61a54e037073974633f67
Testing with diff:
0diff --git a/src/httpserver.cpp b/src/httpserver.cpp
1index 5e78fd1d71..fe3a3a8aab 100644
2--- a/src/httpserver.cpp
3+++ b/src/httpserver.cpp
4@@ -264,6 +264,7 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
5 // Dispatch to worker thread
6 if (i != iend) {
7 std::unique_ptr<HTTPWorkItem> item(new HTTPWorkItem(std::move(hreq), path, i->handler));
8+ UninterruptibleSleep(std::chrono::milliseconds{55});
9 assert(workQueue);
10 if (workQueue->Enqueue(item.get()))
11 item.release(); /* if true, queue took ownership */
Before:
0$ ./test/functional/feature_uacomment.py --valgrind
12020-06-06T19:12:04.028000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_rjp_1t0x
22020-06-06T19:12:14.514000Z TestFramework (INFO): test multiple -uacomment
32020-06-06T19:12:25.066000Z TestFramework (INFO): test -uacomment max length
42020-06-06T19:13:04.021000Z TestFramework (INFO): test -uacomment unsafe characters
52020-06-06T19:14:39.777000Z TestFramework (ERROR): Assertion failed
6Traceback (most recent call last):
7 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 484, in assert_start_raises_init_error
8 self.wait_for_rpc_connection()
9 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 219, in wait_for_rpc_connection
10 raise FailedToStartError(self._node_msg(
11test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status -6 during initialization
12
13During handling of the above exception, another exception occurred:
14
15Traceback (most recent call last):
16 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_framework.py", line 119, in main
17 self.run_test()
18 File "./test/functional/feature_uacomment.py", line 36, in run_test
19 self.nodes[0].assert_start_raises_init_error(["-uacomment=" + unsafe_char], expected, match=ErrorMatch.FULL_REGEX)
20 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 501, in assert_start_raises_init_error
21 self._raise_assertion_error(
22 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 161, in _raise_assertion_error
23 raise AssertionError(self._node_msg(msg))
24AssertionError: [node 0] Expected message "Error: User Agent comment \(\)\) contains unsafe characters." does not fully match stderr:
25"Error: User Agent comment ()) contains unsafe characters.
26bitcoind: httpserver.cpp:268: void http_request_cb(struct evhttp_request *, void *): Assertion `workQueue' failed."
After:
0$ while ./test/functional/feature_uacomment.py --valgrind ; do true ; done
12020-06-06T19:17:01.430000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_oy_7aci5
22020-06-06T19:17:12.355000Z TestFramework (INFO): test multiple -uacomment
32020-06-06T19:17:23.166000Z TestFramework (INFO): test -uacomment max length
42020-06-06T19:18:02.112000Z TestFramework (INFO): test -uacomment unsafe characters
52020-06-06T19:19:56.602000Z TestFramework (ERROR): Assertion failed
6Traceback (most recent call last):
7 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 484, in assert_start_raises_init_error
8 self.wait_for_rpc_connection()
9 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 219, in wait_for_rpc_connection
10 raise FailedToStartError(self._node_msg(
11test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization
12
13During handling of the above exception, another exception occurred:
14
15Traceback (most recent call last):
16 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_framework.py", line 119, in main
17 self.run_test()
18 File "./test/functional/feature_uacomment.py", line 36, in run_test
19 self.nodes[0].assert_start_raises_init_error(["-uacomment=" + unsafe_char], expected, match=ErrorMatch.FULL_REGEX)
20 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 501, in assert_start_raises_init_error
21 self._raise_assertion_error(
22 File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 161, in _raise_assertion_error
23 raise AssertionError(self._node_msg(msg))
24AssertionError: [node 0] Expected message "Error: User Agent comment \(🏃\) contains unsafe characters." does not fully match stderr:
25"Error: User Agent comment (🏃) contains unsafe characters.
26==3931735== Invalid read of size 8
27==3931735== at 0x4ACA354: evhttp_request_get_output_buffer (in /usr/lib64/libevent-2.1.so.6.0.2)
28==3931735== by 0x3E67AC: HTTPRequest::WriteReply(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (httpserver.cpp:581)
29==3931735== by 0x3E62C5: HTTPRequest::~HTTPRequest() (httpserver.cpp:526)
30==3931735== by 0x3E9417: operator() (unique_ptr.h:84)
31==3931735== by 0x3E9417: ~unique_ptr (unique_ptr.h:360)
32==3931735== by 0x3E9417: ~HTTPWorkItem (httpserver.cpp:47)
33==3931735== by 0x3E9417: HTTPWorkItem::~HTTPWorkItem() (httpserver.cpp:47)
34==3931735== by 0x3EA42D: operator() (unique_ptr.h:84)
35==3931735== by 0x3EA42D: ~unique_ptr (unique_ptr.h:360)
36==3931735== by 0x3EA42D: _Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > > (stl_construct.h:140)
37==3931735== by 0x3EA42D: __destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *> (stl_construct.h:152)
38==3931735== by 0x3EA42D: _Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *> (stl_construct.h:184)
39==3931735== by 0x3EA42D: _Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > > (alloc_traits.h:738)
40==3931735== by 0x3EA42D: std::deque<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >, std::allocator<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > > >::_M_destroy_data_aux(std::_Deque_iterator<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >&, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >*>, std::_Deque_iterator<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >&, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >*>) (deque.tcc:872)
41==3931735== by 0x3EA1F9: _M_destroy_data (stl_deque.h:2046)
42==3931735== by 0x3EA1F9: std::deque<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >, std::allocator<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > > >::~deque() (stl_deque.h:1004)
43==3931735== by 0x3E5D7D: ~WorkQueue (httpserver.cpp:89)
44==3931735== by 0x3E5D7D: StopHTTPServer() (httpserver.cpp:480)
45==3931735== by 0x14329B: Shutdown(NodeContext&) (init.cpp:190)
46==3931735== by 0x13C627: AppInit (bitcoind.cpp:164)
47==3931735== by 0x13C627: main (bitcoind.cpp:180)
48==3931735== Address 0x1b92c238 is 152 bytes inside a block of size 216 free'd
49==3931735== at 0x483B9F5: free (vg_replace_malloc.c:538)
50==3931735== by 0x4ACA8BC: evhttp_connection_free (in /usr/lib64/libevent-2.1.so.6.0.2)
51==3931735== by 0x4ACACD4: evhttp_free (in /usr/lib64/libevent-2.1.so.6.0.2)
52==3931735== by 0x3E5D41: StopHTTPServer() (httpserver.cpp:472)
53==3931735== by 0x14329B: Shutdown(NodeContext&) (init.cpp:190)
54==3931735== by 0x13C627: AppInit (bitcoind.cpp:164)
55==3931735== by 0x13C627: main (bitcoind.cpp:180)
56==3931735== Block was alloc'd at
57==3931735== at 0x483CAE9: calloc (vg_replace_malloc.c:760)
58==3931735== by 0x4ACA66C: evhttp_request_new (in /usr/lib64/libevent-2.1.so.6.0.2)
59==3931735== by 0x4ACA7AC: ??? (in /usr/lib64/libevent-2.1.so.6.0.2)
60==3931735== by 0x4ACB074: ??? (in /usr/lib64/libevent-2.1.so.6.0.2)
61==3931735== by 0x4AB8850: ??? (in /usr/lib64/libevent-2.1.so.6.0.2)
62==3931735== by 0x4AB04A9: ??? (in /usr/lib64/libevent-2.1.so.6.0.2)
63==3931735== by 0x4AB0BA6: event_base_loop (in /usr/lib64/libevent-2.1.so.6.0.2)
64==3931735== by 0x3E59A6: ThreadHTTP(event_base*) (httpserver.cpp:292)
65==3931735== by 0x3E9D17: __invoke_impl<bool, bool (*)(event_base *), event_base *> (invoke.h:60)
66==3931735== by 0x3E9D17: __invoke<bool (*)(event_base *), event_base *> (invoke.h:95)
67==3931735== by 0x3E9D17: _M_invoke<0, 1> (thread:264)
68==3931735== by 0x3E9D17: operator() (thread:271)
69==3931735== by 0x3E9D17: std::thread::_State_impl<std::thread::_Invoker<std::tuple<bool (*)(event_base*), event_base*> > >::_M_run() (thread:215)
70==3931735== by 0x4C6FAC3: ??? (in /usr/lib64/libstdc++.so.6.0.28)
71==3931735== by 0x48C0431: start_thread (in /usr/lib64/libpthread-2.31.so)
72==3931735== by 0x4FEB9D2: clone (in /usr/lib64/libc-2.31.so)
73==3931735==
74{
75 <insert_a_suppression_name_here>
76 Memcheck:Addr8
77 fun:evhttp_request_get_output_buffer
78 fun:_ZN11HTTPRequest10WriteReplyEiRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
79 fun:_ZN11HTTPRequestD1Ev
80 fun:operator()
81 fun:~unique_ptr
82 fun:~HTTPWorkItem
83 fun:_ZN12HTTPWorkItemD0Ev
84 fun:operator()
85 fun:~unique_ptr
86 fun:_Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > >
87 fun:__destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *>
88 fun:_Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *>
89 fun:_Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > >
90 fun:_ZNSt5dequeISt10unique_ptrI11HTTPClosureSt14default_deleteIS1_EESaIS4_EE19_M_destroy_data_auxESt15_Deque_iteratorIS4_RS4_PS4_ESA_
91 fun:_M_destroy_data
92 fun:_ZNSt5dequeISt10unique_ptrI11HTTPClosureSt14default_deleteIS1_EESaIS4_EED2Ev
93 fun:~WorkQueue
94 fun:_Z14StopHTTPServerv
95 fun:_Z8ShutdownR11NodeContext
96 fun:AppInit
97 fun:main
98}
99==3931735==
100==3931735== Exit program on first error (--exit-on-first-error=yes)"
474@@ -477,6 +475,10 @@ void StopHTTPServer()
475 event_base_free(eventBase);
476 eventBase = nullptr;
477 }
478+ if (workQueue) {
eventHTTP
/ eventBase
is freed it needs to be sure that there are no workers any more, and that there’s nothing in the work queue. After all, work in progress would refer to them.
In latest commit:
So at this point queue is empty and concurrent http_request_cb
fail to enqueue.
This fixes a race between http_request_cb and StopHTTPServer where
the work queue is used after release.
Code review ACK 4e353cb618745cdb5d98e58e7dcd400ded01299a
Couldn’t reproduce it locally per Marco’s instructions right now (some local Valgrind issue so I couldn’t run the tests under it) but this looks like a correct fix for the race to me.
90@@ -91,7 +91,7 @@ class WorkQueue
91 bool Enqueue(WorkItem* item)
92 {
93 LOCK(cs);
94- if (queue.size() >= maxDepth) {
95+ if (!running || queue.size() >= maxDepth) {
This is a necessary change, but because of this it, the following can appear in the debug.log
:
02020-12-14T19:03:41Z WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting
which is harmless, but a bit confusing and definitely misleading. I suggest adding this guard around that LogPrintf
:
0 if (workQueue->Enqueue(item.get()))
1 item.release(); /* if true, queue took ownership */
2 else {
3 if (!ShutdownRequested()) {
4 LogPrintf("WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting\n");
5 }
6 item->req->WriteReply(HTTP_INTERNAL_SERVER_ERROR, "Work queue depth exceeded");
(I did manually test that change.) Or you could change Enqueue()
to return more than just a bool
to indicate the reason the request was rejected, but it’s probably not worth it. Or, you could check ShutdownRequested()
before calling Enqueue()
and there would be no need to check running
in Enqueue
. But I think your change is best because the WorkQueue
object knows if it’s running or not, makes sense to take advantage of that knowledge.
ShutdownRequested
. Maybe just reword the error to something along “work queue full or shutdown in progress”.
106@@ -107,7 +107,7 @@ class WorkQueue
107 WAIT_LOCK(cs, lock);
108 while (running && queue.empty())
109 cond.wait(lock);
110- if (!running)
111+ if (!running && queue.empty())
Good change! This ensures that the queue is empty before exiting the thread, so there are no leftover items in the work queue when the destructor runs; without this, I got the following to appear in debug.log
during testing:
02020-12-14T19:11:40Z ~HTTPRequest: Unhandled request
connection: close
header, see HTTPRequest::WriteReply
- another case I’d like to remove the call to ShutdownRequested
.
Yes, I tested this, and without the changes to Enqueue
and Run
, bitcoin-cli
prints the following:
0error: Could not connect to the server 127.0.0.1:8332 (error code 1 - "EOF reached")
1
2Make sure the bitcoind server is running and that you are connecting to the correct RPC port.
(I think this is the client not getting a response.)
With your changes to Enqueue
and Run
:
0error: couldn't parse reply from server
This is a slight improvement, I guess. This error seems familiar… If I cherry-pick the 2-line change from #18335 then we get:
0error: Server response: Work queue depth exceeded
It’s the same wrong error message that I mentioned for the LogPrintf
:laughing:. I guess that PR should change to say something like work queue depth exceeded or client or shutdown in progress, as you suggested. (Nothing for this needed by your PR here.)
474@@ -477,6 +475,10 @@ void StopHTTPServer()
475 event_base_free(eventBase);
476 eventBase = nullptr;
477 }
478+ if (workQueue) {
479+ delete workQueue;
480+ workQueue = nullptr;
481+ }
Agree that this is the correct fix, because this now occurs after the call to g_thread_http.join()
(just above, line 468). This join ensures that no new items can be enqueued.
Suggestion, you may want to keep all of the workQueue
shutdown code together and move it to just after the call to g_thread_http.join()
, that is, after line 469 (with the current line numbering, in other words just before if (eventHTTP)
), instead of having it split between two places (beginning at line 453 above, and here). I don’t think there’s any reason those worker threads can’t live a little longer (and there’s no harm that I can think of). This would follow the pattern of tearing down the higher-level stuff before the lower-level stuff (not doing that is the source of this bug in the first place!).
g_thread_http.join()
there can be stuff pending on the event base, and if you delete the workQueue
before the event base finishes then all pending workItems
will be destroyed (because they are unique_ptr
) and then it’s only problems from here on.
Ah, okay, I didn’t understand about the event base; so maybe we could do all the shutdown (deconstruct) of the workQueue
down here, where you now have the delete
. Would this work?
0void StopHTTPServer()
1{
2 LogPrint(BCLog::HTTP, "Stopping HTTP server\n");
3 // Unlisten sockets, these are what make the event loop running, which means
4 // that after this and all connections are closed the event loop will quit.
5 for (evhttp_bound_socket *socket : boundSockets) {
6 evhttp_del_accept_socket(eventHTTP, socket);
7 }
8 boundSockets.clear();
9 if (eventBase) {
10 LogPrint(BCLog::HTTP, "Waiting for HTTP event thread to exit\n");
11 if (g_thread_http.joinable()) g_thread_http.join();
12 }
13 if (eventHTTP) {
14 evhttp_free(eventHTTP);
15 eventHTTP = nullptr;
16 }
17 if (eventBase) {
18 event_base_free(eventBase);
19 eventBase = nullptr;
20 }
21 if (workQueue) {
22 LogPrint(BCLog::HTTP, "Waiting for HTTP worker threads to exit\n");
23 for (auto& thread: g_thread_http_workers) {
24 thread.join();
25 }
26 g_thread_http_workers.clear();
27 delete workQueue;
28 workQueue = nullptr;
29 }
30 LogPrint(BCLog::HTTP, "Stopped HTTP server\n");
31}
(That seems slightly simpler and clearer to me, but the way you have it now is good.) If you make changes, I don’t mind taking another look (and testing) and quickly giving another ack.
I’m able to reproduce the problem consistently by adding the artificial delay suggested by @MarcoFalke and then starting a constant stream of RPC requests (of any kind):
0$ while src/bitcoin-cli listbanned ; do :; done
and then stopping bitcoind
with either control-c or the stop
RPC.
0bitcoind: httpserver.cpp:271: void http_request_cb(evhttp_request*, void*): Assertion `workQueue' failed.
1Aborted (core dumped)
I verified that with this PR, this doesn’t happen. Nice PR! My suggestions are optional, although guarding the LogPrintf
is probably worth doing.
ACK 4e353cb618745cdb5d98e58e7dcd400ded01299a
ACK 4e353cb618745cdb5d98e58e7dcd400ded01299a, tested (rebased on top of master 9313c4e6aa4b707c06a86b33d5d2753cd8383340) on Linux Mint 20.1 (x86_64) using MarcoFalke’s patch, including different -rpcthreads
/-rpcworkqueue
cases. The bug is fixed. The code is correct.
Agree, that the message “error: Server response: Work queue depth exceeded” should be reworded as it covers two different cases now.
Edit (laanwj): removed the @ in the ACK line
ACK 4e353cb618745cdb5d98e58e7dcd400ded01299a
Reviewed the code and it makes sense. However I was unable to reproduce the race condition so I cannot test this.
promag
MarcoFalke
tarboss
laanwj
hebasto
fjahr
LarryRuane
achow101
Labels
RPC/REST/ZMQ
Milestone
22.0