This revives #19420. Since promag is not so active at the moment, I can support this to finally get it merged.
The PR is rebased and comments by jonatack have been addressed.
Once this is merged, I will also reopen #19434.
<!--e57a25ab6845829454e8d69fc972939a-->
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
<!--021abf342d371248e50ceaed478a90ca-->
See the guideline for information on the review process.
| Type | Reviewers |
|---|---|
| ACK | stickies-v, hebasto |
| Concept ACK | brunoerg, promag |
| Stale ACK | kevkevinpal |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Running test/functional/feature_abortnode.py before and after change we get a significant performance gain
running before changes f3bc1a72825fe2b51f4bc20e004cef464f05b965
test/functional/feature_abortnode.py
2022-12-23T04:47:19.826000Z TestFramework (INFO): Initializing test directory /var/folders/9g/cvx014yx4dq5lwl_x5zvn_j80000gn/T/bitcoin_func_test_k3_6xric
2022-12-23T04:47:21.229000Z TestFramework (INFO): Waiting for crash
2022-12-23T04:47:51.429000Z TestFramework (INFO): Node crashed - now verifying restart fails
2022-12-23T04:47:52.086000Z TestFramework (INFO): Stopping nodes
2022-12-23T04:47:52.300000Z TestFramework (INFO): Cleaning up /var/folders/9g/cvx014yx4dq5lwl_x5zvn_j80000gn/T/bitcoin_func_test_k3_6xric on exit
2022-12-23T04:47:52.300000Z TestFramework (INFO): Tests successful
32.50 seconds to run test
running test with these changes applied 40528e7
test/functional/feature_abortnode.py
2022-12-23T04:44:30.075000Z TestFramework (INFO): Initializing test directory /var/folders/9g/cvx014yx4dq5lwl_x5zvn_j80000gn/T/bitcoin_func_test_0h3w_rbx
2022-12-23T04:44:31.516000Z TestFramework (INFO): Waiting for crash
2022-12-23T04:44:31.837000Z TestFramework (INFO): Node crashed - now verifying restart fails
2022-12-23T04:44:32.452000Z TestFramework (INFO): Stopping nodes
2022-12-23T04:44:32.727000Z TestFramework (INFO): Cleaning up /var/folders/9g/cvx014yx4dq5lwl_x5zvn_j80000gn/T/bitcoin_func_test_0h3w_rbx on exit
2022-12-23T04:44:32.727000Z TestFramework (INFO): Tests successful
1.2 seconds to run test
Concept ACK - thanks for reviving this!
Concept ACK
147 | @@ -146,6 +148,10 @@ static GlobalMutex g_httppathhandlers_mutex; 148 | static std::vector<HTTPPathHandler> pathHandlers GUARDED_BY(g_httppathhandlers_mutex); 149 | //! Bound listening sockets 150 | static std::vector<evhttp_bound_socket *> boundSockets; 151 | +//! Track active requests 152 | +static Mutex g_requests_mutex;
Maybe a GlobalMutex would be more appropriate here?
static GlobalMutex g_requests_mutex;
I'm still newish to the repo, is there a convention on when we should use Mutex vs GlobalMutex?
I'm still newish to the repo, is there a convention on when we should use
MutexvsGlobalMutex?
It depends on a scope. For the global namespace the GlobalMutex is an appropriate option: https://github.com/bitcoin/bitcoin/blob/f3bc1a72825fe2b51f4bc20e004cef464f05b965/src/sync.h#L132-L141
Maybe a
GlobalMutexwould be more appropriate here?
This suggestion must fix the following errors:
httpserver.cpp:221:13: error: calling function 'MaybeCheckNotHeld' requires negative capability '!g_requests_mutex' [-Werror,-Wthread-safety-analysis]
LOCK(g_requests_mutex);
^
./sync.h:258:56: note: expanded from macro 'LOCK'
#define LOCK(cs) UniqueLock UNIQUE_NAME(criticalblock)(MaybeCheckNotHeld(cs), #cs, __FILE__, __LINE__)
^
httpserver.cpp:218:9: error: calling function 'MaybeCheckNotHeld' requires negative capability '!g_requests_mutex' [-Werror,-Wthread-safety-analysis]
LOCK(g_requests_mutex);
^
./sync.h:258:56: note: expanded from macro 'LOCK'
#define LOCK(cs) UniqueLock UNIQUE_NAME(criticalblock)(MaybeCheckNotHeld(cs), #cs, __FILE__, __LINE__)
^
httpserver.cpp:481:9: error: calling function 'MaybeCheckNotHeld' requires negative capability '!g_requests_mutex' [-Werror,-Wthread-safety-analysis]
WAIT_LOCK(g_requests_mutex, lock);
^
./sync.h:263:45: note: expanded from macro 'WAIT_LOCK'
#define WAIT_LOCK(cs, name) UniqueLock name(MaybeCheckNotHeld(cs), #cs, __FILE__, __LINE__)
^
3 errors generated.
Done, errors are fixed. Thanks!
Concept ACK :trollface:
482 | + while (!g_requests.empty()) { 483 | + g_requests_cv.wait(lock); 484 | + } 485 | + } 486 | + if (eventHTTP) { 487 | + event_base_once(eventBase, -1, EV_TIMEOUT, [](evutil_socket_t, short, void*) {
maybe include a comment explaining this, see #19420 (review). At this point, it is possible that ong_thread_http the eventHTTP struct is being used.
done, I hope that covers it.
219 | + g_requests.insert(req); 220 | + evhttp_request_set_on_complete_cb(req, [](struct evhttp_request* req, void*) { 221 | + LOCK(g_requests_mutex); 222 | + auto n = g_requests.erase(req); 223 | + assert(n == 1); 224 | + if (g_requests.empty()) g_requests_cv.notify_all();
maybe always call g_requests_cv.notify_all() and update comment above.
why would we want to always call .notify_all() if we are waiting for all the http requests to finish?
because that way you would get a log every time a connection closes. Otherwise g_requests_cv should be renamed to g_requests_empty_cv.
If g_requests_cv is used to signal changes to g_requests then also add call .notify_all() after g_requests.insert(req).
done, I have opted for always calling it
476 | @@ -459,14 +477,22 @@ void StopHTTPServer() 477 | evhttp_del_accept_socket(eventHTTP, socket); 478 | } 479 | boundSockets.clear(); 480 | + { 481 | + WAIT_LOCK(g_requests_mutex, lock); 482 | + while (!g_requests.empty()) { 483 | + g_requests_cv.wait(lock);
maybe log "Waiting for %d requests to stop HTTP server" or something like that.
done
482 | + WAIT_LOCK(g_requests_mutex, lock); 483 | + if (!g_requests.empty()) { 484 | + LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests.size()); 485 | + } 486 | + while (!g_requests.empty()) { 487 | + g_requests_cv.wait(lock);
my suggestion was to add the log here in the loop.
Yeah, I think doing it in the loop essentially creates a countdown that I didn't think was necessary. But I am kind of indifferent on this, so I have changed it now.
retACK f95063e with test/functional/feature_abortnode.py
From #15363 (comment):
The libevent HTTP interface is kind of crazy, so I don't think it's surprising that fixing this requires some trial and error. But it would be good to have tests for each issue that gets fixed to avoid going in circles.
Suggesting to add a test which will prevent any regressions in the future.
218 | + LOCK(g_requests_mutex); 219 | + g_requests.insert(req); 220 | + g_requests_cv.notify_all(); 221 | + evhttp_request_set_on_complete_cb(req, [](struct evhttp_request* req, void*) { 222 | + LOCK(g_requests_mutex); 223 | + auto n = g_requests.erase(req);
Calling notify_all() while holding a mutex is a pessimization. Suggesting to apply a diff:
auto n = WITH_LOCK(g_requests_mutex, return g_requests.erase(req));
done
212 | @@ -207,6 +213,19 @@ std::string RequestMethodString(HTTPRequest::RequestMethod m) 213 | /** HTTP request callback */ 214 | static void http_request_cb(struct evhttp_request* req, void* arg) 215 | { 216 | + // Track requests and notify when a request is completed. 217 | + { 218 | + LOCK(g_requests_mutex); 219 | + g_requests.insert(req);
Calling notify_all() while holding a mutex is a pessimization. Suggesting to apply a diff:
WITH_LOCK(g_requests_mutex, g_requests.insert(req));
or to introduce a new scope.
done
483 | + if (!g_requests.empty()) { 484 | + LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests.size()); 485 | + } 486 | + while (!g_requests.empty()) { 487 | + g_requests_cv.wait(lock); 488 | + }
While this code snippet is used across the current codebase, for new code I'd prefer another wait() overload with an "inlined" stop_waiting condition:
g_requests_cv.wait(lock, []() EXCLUSIVE_LOCKS_REQUIRED(g_requests_mutex) { return g_requests.empty(); });
done
147 | @@ -146,6 +148,10 @@ static GlobalMutex g_httppathhandlers_mutex; 148 | static std::vector<HTTPPathHandler> pathHandlers GUARDED_BY(g_httppathhandlers_mutex); 149 | //! Bound listening sockets 150 | static std::vector<evhttp_bound_socket *> boundSockets; 151 | +//! Track active requests 152 | +static GlobalMutex g_requests_mutex; 153 | +static std::condition_variable g_requests_cv; 154 | +static std::set<evhttp_request*> g_requests GUARDED_BY(g_requests_mutex);
Is any particular reason for not using std::unordered_set which has more efficient insert() and erase() methods?
No reason, should be a small set anyway, both are fine IMHO.
done
502 | } 503 | - if (eventHTTP) { 504 | - evhttp_free(eventHTTP); 505 | - eventHTTP = nullptr; 506 | - } 507 | if (eventBase) {
Combine two if (eventBase) blocks into a single one?
done
From #15363 (comment):
The libevent HTTP interface is kind of crazy, so I don't think it's surprising that fixing this requires some trial and error. But it would be good to have tests for each issue that gets fixed to avoid going in circles.
Suggesting to add a test which will prevent any regressions in the future. @hebasto like https://github.com/bitcoin/bitcoin/blob/7bb07bf8bd8e240562714f9cc4b7696868cb6c26/test/functional/feature_abortnode.py#L44 should take less than 5s or so?
From #15363 (comment):
The libevent HTTP interface is kind of crazy, so I don't think it's surprising that fixing this requires some trial and error. But it would be good to have tests for each issue that gets fixed to avoid going in circles.
Suggesting to add a test which will prevent any regressions in the future.
@hebasto like
should take less than 5s or so?
I have updated the test and it seems to be reliably failing on master, so I think this is a good enough regression test @hebasto ?
Also addressed other comments by @hebasto .
484 | + WAIT_LOCK(g_requests_mutex, lock); 485 | + g_requests_cv.wait(lock, []() EXCLUSIVE_LOCKS_REQUIRED(g_requests_mutex) { 486 | + if (!g_requests.empty()) { 487 | + LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests.size()); 488 | + } 489 | + return g_requests.empty();
nit: Isn't indentation too big?
fixed
482 | - if (g_thread_http.joinable()) g_thread_http.join(); 483 | + { 484 | + WAIT_LOCK(g_requests_mutex, lock); 485 | + g_requests_cv.wait(lock, []() EXCLUSIVE_LOCKS_REQUIRED(g_requests_mutex) { 486 | + if (!g_requests.empty()) { 487 | + LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests.size());
Having side effects in the predicate seems a bad idea due to spurious wakeups. It should be a pure function.
Approach ACK 6d3a848fc76f76af2b2b262011f37be952325b44.
213 | @@ -207,6 +214,17 @@ std::string RequestMethodString(HTTPRequest::RequestMethod m) 214 | /** HTTP request callback */ 215 | static void http_request_cb(struct evhttp_request* req, void* arg) 216 | { 217 | + // Track requests and notify when a request is completed. 218 | + { 219 | + WITH_LOCK(g_requests_mutex, g_requests.insert(req)); 220 | + g_requests_cv.notify_all();
What's the purpose of this notification? I think we (currently) only need to monitor for completed requests, not new ones?
Did you see #26742 (review) ?
I hadn't seen that, thanks for linking.
It's a minor concern, but I'm a bit worried that we're creating a very general contract ("g_requests_cv will always notify for any changes to the g_requests") on a global variable when it's enforced outside of g_requests, when in the future we might easily introduce mutations on g_requests without notifying the cv. A more scoped cv, such as g_requests_cv_handled largely mitigates that.
But, again, no big concern, and it's not likely to cause any major problems should we miss something there. As you see fit.
FYI I think one approach to improve on this contract (and as a nice benefit, also encapsulate code a bit better - I think this improves readability) in the below diff.
(WaitUntilEmpty() needs improvement because it shouldn't directly be logging about the HTTP server like that. I guess we also don't need a GlobalMutex here.)
<details> <summary>git diff on 55bbf04ad</summary>
diff --git a/src/httpserver.cpp b/src/httpserver.cpp
index b481ef093..90ef37bd5 100644
--- a/src/httpserver.cpp
+++ b/src/httpserver.cpp
@@ -134,6 +134,40 @@ struct HTTPPathHandler
HTTPRequestHandler handler;
};
+class RequestTracker
+{
+public:
+ typedef std::unordered_set<evhttp_request*> requests_t;
+private:
+ GlobalMutex m_requests_mutex;
+ std::condition_variable m_requests_cv;
+ requests_t m_requests GUARDED_BY(m_requests_mutex);
+public:
+ void Register(struct evhttp_request* req)
+ {
+ WITH_LOCK(m_requests_mutex, m_requests.insert(req));
+ m_requests_cv.notify_all();
+ }
+
+ size_t Remove(struct evhttp_request* req)
+ {
+ auto n{WITH_LOCK(m_requests_mutex, return m_requests.erase(req))};
+ m_requests_cv.notify_all();
+ return n;
+ }
+
+ void WaitUntilEmpty()
+ {
+ WAIT_LOCK(m_requests_mutex, lock);
+ if (!m_requests.empty()) {
+ LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", m_requests.size());
+ }
+ m_requests_cv.wait(lock, [this]() EXCLUSIVE_LOCKS_REQUIRED(m_requests_mutex) {
+ return m_requests.empty();
+ });
+ }
+};
+
/** HTTP module state */
//! libevent event loop
@@ -149,10 +183,8 @@ static GlobalMutex g_httppathhandlers_mutex;
static std::vector<HTTPPathHandler> pathHandlers GUARDED_BY(g_httppathhandlers_mutex);
//! Bound listening sockets
static std::vector<evhttp_bound_socket *> boundSockets;
-//! Track active requests
-static GlobalMutex g_requests_mutex;
-static std::condition_variable g_requests_cv;
-static std::unordered_set<evhttp_request*> g_requests GUARDED_BY(g_requests_mutex);
+static RequestTracker g_request_tracker;
+
/** Check if a network address is allowed to access the HTTP server */
static bool ClientAllowed(const CNetAddr& netaddr)
@@ -216,12 +248,9 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
{
// Track requests and notify when a request is completed.
{
- WITH_LOCK(g_requests_mutex, g_requests.insert(req));
- g_requests_cv.notify_all();
+ g_request_tracker.Register(req);
evhttp_request_set_on_complete_cb(req, [](struct evhttp_request* req, void*) {
- auto n = WITH_LOCK(g_requests_mutex, return g_requests.erase(req));
- assert(n == 1);
- g_requests_cv.notify_all();
+ assert(g_request_tracker.Remove(req) == 1);
}, nullptr);
}
@@ -477,15 +506,7 @@ void StopHTTPServer()
evhttp_del_accept_socket(eventHTTP, socket);
}
boundSockets.clear();
- {
- WAIT_LOCK(g_requests_mutex, lock);
- if (!g_requests.empty()) {
- LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests.size());
- }
- g_requests_cv.wait(lock, []() EXCLUSIVE_LOCKS_REQUIRED(g_requests_mutex) {
- return g_requests.empty();
- });
- }
+ g_request_tracker.WaitUntilEmpty();
if (eventHTTP) {
// Schedule a callback to call evhttp_free in the event base thread, so
// that evhttp_free does not need to be called again after the handling
</details>
Sorry but it's a significant change and this PR already has a pretty long history and several ACKs for the current approach. I would suggest you open a follow-up PR with this as an additional improvement.
148 | @@ -146,6 +149,10 @@ static GlobalMutex g_httppathhandlers_mutex; 149 | static std::vector<HTTPPathHandler> pathHandlers GUARDED_BY(g_httppathhandlers_mutex); 150 | //! Bound listening sockets 151 | static std::vector<evhttp_bound_socket *> boundSockets; 152 | +//! Track active requests 153 | +static GlobalMutex g_requests_mutex; 154 | +static std::condition_variable g_requests_cv; 155 | +static std::unordered_set<evhttp_request*> g_requests GUARDED_BY(g_requests_mutex);
I think this can also be a simple atomic counter which doesn't require a lock?
<details> <summary>git diff on 6d3a848fc</summary>
diff --git a/src/httpserver.cpp b/src/httpserver.cpp
index 346c6d5cb..a058a9c49 100644
--- a/src/httpserver.cpp
+++ b/src/httpserver.cpp
@@ -152,7 +152,7 @@ static std::vector<evhttp_bound_socket *> boundSockets;
//! Track active requests
static GlobalMutex g_requests_mutex;
static std::condition_variable g_requests_cv;
-static std::unordered_set<evhttp_request*> g_requests GUARDED_BY(g_requests_mutex);
+static std::atomic<size_t> g_requests_count;
/** Check if a network address is allowed to access the HTTP server */
static bool ClientAllowed(const CNetAddr& netaddr)
@@ -216,11 +216,9 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
{
// Track requests and notify when a request is completed.
{
- WITH_LOCK(g_requests_mutex, g_requests.insert(req));
- g_requests_cv.notify_all();
+ ++g_requests_count;
evhttp_request_set_on_complete_cb(req, [](struct evhttp_request* req, void*) {
- auto n = WITH_LOCK(g_requests_mutex, return g_requests.erase(req));
- assert(n == 1);
+ --g_requests_count;
g_requests_cv.notify_all();
}, nullptr);
}
@@ -480,10 +478,10 @@ void StopHTTPServer()
{
WAIT_LOCK(g_requests_mutex, lock);
g_requests_cv.wait(lock, []() EXCLUSIVE_LOCKS_REQUIRED(g_requests_mutex) {
- if (!g_requests.empty()) {
- LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests.size());
- }
- return g_requests.empty();
+ if (g_requests_count > 0) {
+ LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests_count);
+ }
+ return g_requests_count == 0;
});
}
if (eventHTTP) {
</details>
(cc @hebasto - I think this is a similar but less "controversial" approach than what I proposed offline)
Okay, makes sense. I'm not sure how controversial #19434 is, need to take a closer look. Would it make sense to use the atomic counter in this PR and then use a set in the follow-up? In case we can't get that merged, that would leave master in a better state, and I think the code churn is manageable?
You still need a lock, otherwise, how would you wait for the count to be zero?
I'm not sure I understand your comment - my diff (https://github.com/bitcoin/bitcoin/pull/26742#discussion_r1063664153) does not involve removing the lock, although it reduces reliance on it since a counter can be encapsulated with a more efficient std::atomic<size_t> as opposed to set mutations which require acquiring a lock. If I'm misunderstanding you, could you please clarify?
Ah neverming, we could use std::atomic<T>::wait
I prefer to keep it as is as well. I don't think #19434 will be controversial because it's not conceptually different, just an improvement. And using the count is just a minor simplification, the code works the same either way. We have put stranger things into the code base, like unused functions for example, as long as there were follow-ups where they would be used and the these follow-ups were following the same concept that was already ACKed.
Alright, it's not my ideal approach but let's be pragmatic and keep it as is. Can be marked as resolved for me.
Approach ACK 6d3a848fc
Can anyone please confirm or correct my understanding of why this so much faster?
My current understanding is that even when all of the workers in g_thread_http_workers have terminated (i.e. the entire queue is processed and running==false), the http server thread g_thread_http does not return (when joining) immediately because there are still some idle connections (with whom?) open, preventing the server from shutting down. By first (instead of afterwards) calling evhttp_free(eventHTTP), we can safely destroy the server even if connections are still open (but all requests are handled), causing g_thread_http.join() to return immediately in StopHTTPServer() as opposed to waiting for a timeout.
I think the commit and PR message would benefit from a more clear reason why this works.
Can anyone please confirm or correct my understanding of why this so much faster?
My current understanding is that even when all of the workers in
g_thread_http_workershave terminated (i.e. the entire queue is processed andrunning==false), the http server threadg_thread_httpdoes not return (when joining) immediately because there are still some idle connections (with whom?) open, preventing the server from shutting down. By first (instead of afterwards) callingevhttp_free(eventHTTP), we can safely destroy the server even if connections are still open (but all requests are handled), causingg_thread_http.join()to return immediately inStopHTTPServer()as opposed to waiting for a timeout.
Sounds good to me. I guess libevent doesn't close the idle connection as an optimization (less overhead if the client wants to send more requests) and if the client doesn't close them then we are seeing this issue. I would add that the overhead that we see in the functional test is the timeout we have configured on the http server ourselves here.
I guess libevent doesn't close the idle connection as an optimization (less overhead if the client wants to send more requests) and if the client doesn't close them then we are seeing this issue.
Not really an optimization of libevent, AuthServiceProxy used in the test framework creates a persistent connection (to reuse the same connection for consecutive HTTP requests, as @fjahr says).
ACK. With this change evhttp_free is called after all requests are completed. Calling evhttp_free at this point will cause all active connections to close and nothing remains on the event base, and all RPC threads are ready to be joined. As mentioned above, this is useful to avoid hitting the idle timeout.
Note that shutdown still hangs while a request is in-flight.
20 | @@ -21,12 +21,15 @@ 21 | #include <util/threadnames.h> 22 | #include <util/translation.h> 23 | 24 | +#include <condition_variable> 25 | #include <cstdio> 26 | #include <cstdlib> 27 | #include <deque> 28 | #include <memory> 29 | +#include <mutex>
can be removed.
+1, and we're missing #include <event2/http_struct.h> (for evhttp_request)
Both done
213 | @@ -207,6 +214,17 @@ std::string RequestMethodString(HTTPRequest::RequestMethod m) 214 | /** HTTP request callback */ 215 | static void http_request_cb(struct evhttp_request* req, void* arg) 216 | { 217 | + // Track requests and notify when a request is completed. 218 | + { 219 | + WITH_LOCK(g_requests_mutex, g_requests.insert(req)); 220 | + g_requests_cv.notify_all(); 221 | + evhttp_request_set_on_complete_cb(req, [](struct evhttp_request* req, void*) { 222 | + auto n = WITH_LOCK(g_requests_mutex, return g_requests.erase(req));
nit - think I'd prefer removing the n var altogether, the temp var doesn't really make it more readable imo
auto n{WITH_LOCK(g_requests_mutex, return g_requests.erase(req))};
Done
ACK https://github.com/bitcoin/bitcoin/commit/55bbf04adc4879dcacd135f40aa053e60032af4b
I have some open suggestions, but I'm happy with the PR as it is.
This was made obsolete by tracking the active requests and explicitly waiting for them to finish before shutdown.
Addressed comments by @promag and @stickies-v
re-ACK 60978c8
34 | #include <sys/stat.h>
35 |
36 | #include <event2/buffer.h>
37 | #include <event2/bufferevent.h>
38 | #include <event2/http.h>
39 | +#include <event2/http_struct.h>
nit: IWYU suggests to s/#include <event2/http_struct.h>/#include <event2/event.h>/.
ACK 60978c8080ec13ff4571c8a89e742517b2aca692
ACK 60978c8080ec13ff4571c8a89e742517b2aca692