This fixes a race between http_request_cb and StopHTTPServer where
the work queue is used after release.
Fixes #18856.
@MarcoFalke can't find the issue where this was discussed 😞
Thanks! I was filtering for your issues...
Concept ACK, assuming that it does what we do in the python code. Haven't looked at the code here.
self.network_event_loop.call_soon_threadsafe(self.network_event_loop.stop)
setting env variable: QT_QPA_PLATFORM=windows ->same warning QWARN : AppTests::appTests() QObject::connect: Cannot queue arguments of type 'size_t' (Make sure 'size_t' is registered using qRegisterMetaType().)
also on windows 10 & msvc the first test exits silent becoz fs::remove_all(m_path_root); in BasicTestingSetup::~BasicTestingSetup() says cannot remove dir,its not empty, but u can catch with try statement, if u want do more small fixes 👍
setting env variable: QT_QPA_PLATFORM=windows ->same warning QWARN : AppTests::appTests() QObject::connect: Cannot queue arguments of type 'size_t' (Make sure 'size_t' is registered using qRegisterMetaType().) @tarboss 38e2be4be00914e8be7ae9c7215985c9b481b0ae fixes it right?
at least that warning
********* Start testing of AppTests ********* Config: Using QtTest library 5.9.8, Qt 5.9.8 (x86_64-little_endian-llp64 static debug build; by MSVC 2017) PASS : AppTests::initTestCase() QINFO : AppTests::appTests() Backing up GUI settings to "C:\Users\tester\AppData\Local\Temp\test_common_Bitcoin Core\b516affdfbc77f1640d7e3e97f95db107556a19c4535813c07f5a21862418b89\regtest\guisettings.ini.bak" QDEBUG : AppTests::appTests() requestInitialize : Requesting initialize QDEBUG : AppTests::appTests() initialize : Running initialization in thread QDEBUG : AppTests::appTests() initializeResult : Initialization result: true QINFO : AppTests::appTests() Platform customization: "windows" QDEBUG : AppTests::appTests() TransactionTablePriv::refreshWallet QWARN : AppTests::appTests() QWindowsWindow::setGeometry: Unable to set geometry 751x430+598+336 on QWidgetWindow/'RPCConsoleWindow'. Resulting geometry: 751x431+598+336 (frame: 8, 31, 8, 8, custom margin: 0, 0, 0, 0, minimum size: 751x365, maximum size: 16777215x16777215). QDEBUG : AppTests::appTests() requestShutdown : Requesting shutdown QDEBUG : AppTests::appTests() shutdown : Running Shutdown in thread QDEBUG : AppTests::appTests() NotifyUnload QDEBUG : AppTests::appTests() shutdown : Shutdown finished
E:\win10dev\bc\bitcoin\build_msvc\x64\Debug\test_bitcoin-qt.exe (process 11012) exited with code 3. To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops. Press any key to close this window . . .
449 | + if (eventBase) { 450 | + event_base_once(eventBase, -1, EV_TIMEOUT, [](evutil_socket_t, short, void*) { 451 | + workQueue->Interrupt(); 452 | + }, nullptr, nullptr); 453 | + } else { 454 | + workQueue->Interrupt();
workQueue is not created if eventBase couldn't be created, so it seems this else {} clause is redundant?
This might result in a connection reset by peer or something like that because one request job can be enqueued but not handled (no response sent).
I'm confused as to why the Qt change is in here. Please add an explanation to the opening post, or open a separate PR for it.
ACK 1d39b7cf00f9d4a64d93caa9d47a177d759c2d6a
ACK fec477712f9b37dc16b61a54e037073974633f67
@laanwj see #19033 (comment). I can drop the commit, again.
I prefer a new pull as well. The gui test thing doesn't need backport
Done, and rebased.
ACK fec477712f9b37dc16b61a54e037073974633f67, I have not tested the code, but I have reviewed it and it looks OK, I agree it can be merged.
Testing with diff:
diff --git a/src/httpserver.cpp b/src/httpserver.cpp
index 5e78fd1d71..fe3a3a8aab 100644
--- a/src/httpserver.cpp
+++ b/src/httpserver.cpp
@@ -264,6 +264,7 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
// Dispatch to worker thread
if (i != iend) {
std::unique_ptr<HTTPWorkItem> item(new HTTPWorkItem(std::move(hreq), path, i->handler));
+ UninterruptibleSleep(std::chrono::milliseconds{55});
assert(workQueue);
if (workQueue->Enqueue(item.get()))
item.release(); /* if true, queue took ownership */
Before:
$ ./test/functional/feature_uacomment.py --valgrind
2020-06-06T19:12:04.028000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_rjp_1t0x
2020-06-06T19:12:14.514000Z TestFramework (INFO): test multiple -uacomment
2020-06-06T19:12:25.066000Z TestFramework (INFO): test -uacomment max length
2020-06-06T19:13:04.021000Z TestFramework (INFO): test -uacomment unsafe characters
2020-06-06T19:14:39.777000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 484, in assert_start_raises_init_error
self.wait_for_rpc_connection()
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 219, in wait_for_rpc_connection
raise FailedToStartError(self._node_msg(
test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status -6 during initialization
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_framework.py", line 119, in main
self.run_test()
File "./test/functional/feature_uacomment.py", line 36, in run_test
self.nodes[0].assert_start_raises_init_error(["-uacomment=" + unsafe_char], expected, match=ErrorMatch.FULL_REGEX)
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 501, in assert_start_raises_init_error
self._raise_assertion_error(
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 161, in _raise_assertion_error
raise AssertionError(self._node_msg(msg))
AssertionError: [node 0] Expected message "Error: User Agent comment \(\)\) contains unsafe characters." does not fully match stderr:
"Error: User Agent comment ()) contains unsafe characters.
bitcoind: httpserver.cpp:268: void http_request_cb(struct evhttp_request *, void *): Assertion `workQueue' failed."
After:
$ while ./test/functional/feature_uacomment.py --valgrind ; do true ; done
2020-06-06T19:17:01.430000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_oy_7aci5
2020-06-06T19:17:12.355000Z TestFramework (INFO): test multiple -uacomment
2020-06-06T19:17:23.166000Z TestFramework (INFO): test -uacomment max length
2020-06-06T19:18:02.112000Z TestFramework (INFO): test -uacomment unsafe characters
2020-06-06T19:19:56.602000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 484, in assert_start_raises_init_error
self.wait_for_rpc_connection()
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 219, in wait_for_rpc_connection
raise FailedToStartError(self._node_msg(
test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_framework.py", line 119, in main
self.run_test()
File "./test/functional/feature_uacomment.py", line 36, in run_test
self.nodes[0].assert_start_raises_init_error(["-uacomment=" + unsafe_char], expected, match=ErrorMatch.FULL_REGEX)
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 501, in assert_start_raises_init_error
self._raise_assertion_error(
File "/home/marco/workspace/btc_bitcoin_core/test/functional/test_framework/test_node.py", line 161, in _raise_assertion_error
raise AssertionError(self._node_msg(msg))
AssertionError: [node 0] Expected message "Error: User Agent comment \(🏃\) contains unsafe characters." does not fully match stderr:
"Error: User Agent comment (🏃) contains unsafe characters.
==3931735== Invalid read of size 8
==3931735== at 0x4ACA354: evhttp_request_get_output_buffer (in /usr/lib64/libevent-2.1.so.6.0.2)
==3931735== by 0x3E67AC: HTTPRequest::WriteReply(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (httpserver.cpp:581)
==3931735== by 0x3E62C5: HTTPRequest::~HTTPRequest() (httpserver.cpp:526)
==3931735== by 0x3E9417: operator() (unique_ptr.h:84)
==3931735== by 0x3E9417: ~unique_ptr (unique_ptr.h:360)
==3931735== by 0x3E9417: ~HTTPWorkItem (httpserver.cpp:47)
==3931735== by 0x3E9417: HTTPWorkItem::~HTTPWorkItem() (httpserver.cpp:47)
==3931735== by 0x3EA42D: operator() (unique_ptr.h:84)
==3931735== by 0x3EA42D: ~unique_ptr (unique_ptr.h:360)
==3931735== by 0x3EA42D: _Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > > (stl_construct.h:140)
==3931735== by 0x3EA42D: __destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *> (stl_construct.h:152)
==3931735== by 0x3EA42D: _Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *> (stl_construct.h:184)
==3931735== by 0x3EA42D: _Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > > (alloc_traits.h:738)
==3931735== by 0x3EA42D: std::deque<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >, std::allocator<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > > >::_M_destroy_data_aux(std::_Deque_iterator<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >&, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >*>, std::_Deque_iterator<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >&, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >*>) (deque.tcc:872)
==3931735== by 0x3EA1F9: _M_destroy_data (stl_deque.h:2046)
==3931735== by 0x3EA1F9: std::deque<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> >, std::allocator<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > > >::~deque() (stl_deque.h:1004)
==3931735== by 0x3E5D7D: ~WorkQueue (httpserver.cpp:89)
==3931735== by 0x3E5D7D: StopHTTPServer() (httpserver.cpp:480)
==3931735== by 0x14329B: Shutdown(NodeContext&) (init.cpp:190)
==3931735== by 0x13C627: AppInit (bitcoind.cpp:164)
==3931735== by 0x13C627: main (bitcoind.cpp:180)
==3931735== Address 0x1b92c238 is 152 bytes inside a block of size 216 free'd
==3931735== at 0x483B9F5: free (vg_replace_malloc.c:538)
==3931735== by 0x4ACA8BC: evhttp_connection_free (in /usr/lib64/libevent-2.1.so.6.0.2)
==3931735== by 0x4ACACD4: evhttp_free (in /usr/lib64/libevent-2.1.so.6.0.2)
==3931735== by 0x3E5D41: StopHTTPServer() (httpserver.cpp:472)
==3931735== by 0x14329B: Shutdown(NodeContext&) (init.cpp:190)
==3931735== by 0x13C627: AppInit (bitcoind.cpp:164)
==3931735== by 0x13C627: main (bitcoind.cpp:180)
==3931735== Block was alloc'd at
==3931735== at 0x483CAE9: calloc (vg_replace_malloc.c:760)
==3931735== by 0x4ACA66C: evhttp_request_new (in /usr/lib64/libevent-2.1.so.6.0.2)
==3931735== by 0x4ACA7AC: ??? (in /usr/lib64/libevent-2.1.so.6.0.2)
==3931735== by 0x4ACB074: ??? (in /usr/lib64/libevent-2.1.so.6.0.2)
==3931735== by 0x4AB8850: ??? (in /usr/lib64/libevent-2.1.so.6.0.2)
==3931735== by 0x4AB04A9: ??? (in /usr/lib64/libevent-2.1.so.6.0.2)
==3931735== by 0x4AB0BA6: event_base_loop (in /usr/lib64/libevent-2.1.so.6.0.2)
==3931735== by 0x3E59A6: ThreadHTTP(event_base*) (httpserver.cpp:292)
==3931735== by 0x3E9D17: __invoke_impl<bool, bool (*)(event_base *), event_base *> (invoke.h:60)
==3931735== by 0x3E9D17: __invoke<bool (*)(event_base *), event_base *> (invoke.h:95)
==3931735== by 0x3E9D17: _M_invoke<0, 1> (thread:264)
==3931735== by 0x3E9D17: operator() (thread:271)
==3931735== by 0x3E9D17: std::thread::_State_impl<std::thread::_Invoker<std::tuple<bool (*)(event_base*), event_base*> > >::_M_run() (thread:215)
==3931735== by 0x4C6FAC3: ??? (in /usr/lib64/libstdc++.so.6.0.28)
==3931735== by 0x48C0431: start_thread (in /usr/lib64/libpthread-2.31.so)
==3931735== by 0x4FEB9D2: clone (in /usr/lib64/libc-2.31.so)
==3931735==
{
<insert_a_suppression_name_here>
Memcheck:Addr8
fun:evhttp_request_get_output_buffer
fun:_ZN11HTTPRequest10WriteReplyEiRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
fun:_ZN11HTTPRequestD1Ev
fun:operator()
fun:~unique_ptr
fun:~HTTPWorkItem
fun:_ZN12HTTPWorkItemD0Ev
fun:operator()
fun:~unique_ptr
fun:_Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > >
fun:__destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *>
fun:_Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *>
fun:_Destroy<std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > *, std::unique_ptr<HTTPClosure, std::default_delete<HTTPClosure> > >
fun:_ZNSt5dequeISt10unique_ptrI11HTTPClosureSt14default_deleteIS1_EESaIS4_EE19_M_destroy_data_auxESt15_Deque_iteratorIS4_RS4_PS4_ESA_
fun:_M_destroy_data
fun:_ZNSt5dequeISt10unique_ptrI11HTTPClosureSt14default_deleteIS1_EESaIS4_EED2Ev
fun:~WorkQueue
fun:_Z14StopHTTPServerv
fun:_Z8ShutdownR11NodeContext
fun:AppInit
fun:main
}
==3931735==
==3931735== Exit program on first error (--exit-on-first-error=yes)"
Thanks for these instructions @MarcoFalke.
474 | @@ -477,6 +475,10 @@ void StopHTTPServer() 475 | event_base_free(eventBase); 476 | eventBase = nullptr; 477 | } 478 | + if (workQueue) {
I don't think this is the correct fix. By the time eventHTTP / eventBase is freed it needs to be sure that there are no workers any more, and that there's nothing in the work queue. After all, work in progress would refer to them.
In latest commit:
So at this point queue is empty and concurrent http_request_cb fail to enqueue.
@MarcoFalke updated, see #19033 (review).
Please remove tag "waiting for author".
Concept ACK. Fixing bugs is good, but I can't review this. Happy to test some time
Ops, bad rebase! Should have db867b59d9ce29bcc2c872986085d99af4d1bcfc rebased. Will fix..
This fixes a race between http_request_cb and StopHTTPServer where
the work queue is used after release.
Code review ACK 4e353cb618745cdb5d98e58e7dcd400ded01299a
Couldn't reproduce it locally per Marco's instructions right now (some local Valgrind issue so I couldn't run the tests under it) but this looks like a correct fix for the race to me.
Ping.
90 | @@ -91,7 +91,7 @@ class WorkQueue 91 | bool Enqueue(WorkItem* item) 92 | { 93 | LOCK(cs); 94 | - if (queue.size() >= maxDepth) { 95 | + if (!running || queue.size() >= maxDepth) {
This is a necessary change, but because of this it, the following can appear in the debug.log:
2020-12-14T19:03:41Z WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting
which is harmless, but a bit confusing and definitely misleading. I suggest adding this guard around that LogPrintf:
if (workQueue->Enqueue(item.get()))
item.release(); /* if true, queue took ownership */
else {
if (!ShutdownRequested()) {
LogPrintf("WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting\n");
}
item->req->WriteReply(HTTP_INTERNAL_SERVER_ERROR, "Work queue depth exceeded");
(I did manually test that change.) Or you could change Enqueue() to return more than just a bool to indicate the reason the request was rejected, but it's probably not worth it. Or, you could check ShutdownRequested() before calling Enqueue() and there would be no need to check running in Enqueue. But I think your change is best because the WorkQueue object knows if it's running or not, makes sense to take advantage of that knowledge.
Well in that case the RPC error is also misleading. I'd like to avoid more calls to ShutdownRequested. Maybe just reword the error to something along "work queue full or shutdown in progress".
106 | @@ -107,7 +107,7 @@ class WorkQueue 107 | WAIT_LOCK(cs, lock); 108 | while (running && queue.empty()) 109 | cond.wait(lock); 110 | - if (!running) 111 | + if (!running && queue.empty())
Good change! This ensures that the queue is empty before exiting the thread, so there are no leftover items in the work queue when the destructor runs; without this, I got the following to appear in debug.log during testing:
2020-12-14T19:11:40Z ~HTTPRequest: Unhandled request
Right, I would have to test to make sure but I think without this the client wouldn't get a response. I also think that very last response would now have connection: close header, see HTTPRequest::WriteReply - another case I'd like to remove the call to ShutdownRequested.
Yes, I tested this, and without the changes to Enqueue and Run, bitcoin-cli prints the following:
error: Could not connect to the server 127.0.0.1:8332 (error code 1 - "EOF reached")
Make sure the bitcoind server is running and that you are connecting to the correct RPC port.
(I think this is the client not getting a response.)
With your changes to Enqueue and Run:
error: couldn't parse reply from server
This is a slight improvement, I guess. This error seems familiar... If I cherry-pick the 2-line change from #18335 then we get:
error: Server response: Work queue depth exceeded
It's the same wrong error message that I mentioned for the LogPrintf :laughing:. I guess that PR should change to say something like work queue depth exceeded or client or shutdown in progress, as you suggested. (Nothing for this needed by your PR here.)
474 | @@ -477,6 +475,10 @@ void StopHTTPServer() 475 | event_base_free(eventBase); 476 | eventBase = nullptr; 477 | } 478 | + if (workQueue) { 479 | + delete workQueue; 480 | + workQueue = nullptr; 481 | + }
Agree that this is the correct fix, because this now occurs after the call to g_thread_http.join() (just above, line 468). This join ensures that no new items can be enqueued.
Suggestion, you may want to keep all of the workQueue shutdown code together and move it to just after the call to g_thread_http.join(), that is, after line 469 (with the current line numbering, in other words just before if (eventHTTP)), instead of having it split between two places (beginning at line 453 above, and here). I don't think there's any reason those worker threads can't live a little longer (and there's no harm that I can think of). This would follow the pattern of tearing down the higher-level stuff before the lower-level stuff (not doing that is the source of this bug in the first place!).
It can't be 😄 this is the goal of this PR - destroy the work queue after event base finish. After g_thread_http.join() there can be stuff pending on the event base, and if you delete the workQueue before the event base finishes then all pending workItems will be destroyed (because they are unique_ptr) and then it's only problems from here on.
Ah, okay, I didn't understand about the event base; so maybe we could do all the shutdown (deconstruct) of the workQueue down here, where you now have the delete. Would this work?
<details> <summary>suggestion</summary>
void StopHTTPServer()
{
LogPrint(BCLog::HTTP, "Stopping HTTP server\n");
// Unlisten sockets, these are what make the event loop running, which means
// that after this and all connections are closed the event loop will quit.
for (evhttp_bound_socket *socket : boundSockets) {
evhttp_del_accept_socket(eventHTTP, socket);
}
boundSockets.clear();
if (eventBase) {
LogPrint(BCLog::HTTP, "Waiting for HTTP event thread to exit\n");
if (g_thread_http.joinable()) g_thread_http.join();
}
if (eventHTTP) {
evhttp_free(eventHTTP);
eventHTTP = nullptr;
}
if (eventBase) {
event_base_free(eventBase);
eventBase = nullptr;
}
if (workQueue) {
LogPrint(BCLog::HTTP, "Waiting for HTTP worker threads to exit\n");
for (auto& thread: g_thread_http_workers) {
thread.join();
}
g_thread_http_workers.clear();
delete workQueue;
workQueue = nullptr;
}
LogPrint(BCLog::HTTP, "Stopped HTTP server\n");
}
</details>
(That seems slightly simpler and clearer to me, but the way you have it now is good.) If you make changes, I don't mind taking another look (and testing) and quickly giving another ack.
I'm able to reproduce the problem consistently by adding the artificial delay suggested by @MarcoFalke and then starting a constant stream of RPC requests (of any kind):
$ while src/bitcoin-cli listbanned ; do :; done
and then stopping bitcoind with either control-c or the stop RPC.
bitcoind: httpserver.cpp:271: void http_request_cb(evhttp_request*, void*): Assertion `workQueue' failed.
Aborted (core dumped)
I verified that with this PR, this doesn't happen. Nice PR! My suggestions are optional, although guarding the LogPrintf is probably worth doing.
ACK 4e353cb618745cdb5d98e58e7dcd400ded01299a
Thanks for reviewing @LarryRuane! I'll hold on your suggestion to keep your and @fjahr ACK. I also think the error and log messages can be improved in a follow to avoid dragging this. In any case I hope my above comments are clear to you, otherwise let me know. @hebasto mind to test again? It would be great to push this early on the release cycle.
ACK 4e353cb618745cdb5d98e58e7dcd400ded01299a, tested (rebased on top of master 9313c4e6aa4b707c06a86b33d5d2753cd8383340) on Linux Mint 20.1 (x86_64) using MarcoFalke's patch, including different -rpcthreads/-rpcworkqueue cases. The bug is fixed. The code is correct.
Agree, that the message "error: Server response: Work queue depth exceeded" should be reworded as it covers two different cases now.
Edit (laanwj): removed the @ in the ACK line
ACK 4e353cb618745cdb5d98e58e7dcd400ded01299a
Reviewed the code and it makes sense. However I was unable to reproduce the race condition so I cannot test this.
(re-)ACK 4e353cb618745cdb5d98e58e7dcd400ded01299a Tested again rebased on master d75a1df6178aabf6f19271786e48bc5d57eddb54, Ubuntu 20.04, was able to reproduce the bug using the one-line patch Marco suggested, and the PR still fixes it.