http: speed up shutdown

laanwj commented at 5:12 pm on November 11, 2015: member

Pardon a bit of iteration. This continues/fixes #6719.

event_base_loopexit was not doing what I expected it to. What I expected was that it sets a timeout, given that no other pending events it would exit in N seconds. However, what it does was delay the event loop exit with N seconds, even if nothing is pending.

Solve it in a different way: give the event loop thread time to exit out of itself, and if it doesn’t, send loopbreak (and then join it).

This speeds up the RPC tests a lot, each exit incurred a 10 second overhead, with this change there should be no shutdown overhead in the common case and up to two seconds if the event loop is blocking.

As a bonus this breaks dependency on the global boost::thread_group, as the HTTP server minds its own offspring.

Time for rpctests with this patch:

0real    7m50.215s
1user    0m30.152s
2sys     0m11.752s

Without:

0real    13m12.069s
1user    0m30.412s
2sys     0m11.168s

laanwj added the label RPC on Nov 11, 2015

dcousens commented at 9:01 pm on November 11, 2015: contributor

event_base_loopbreak was not doing what I expected it to. What I expected was that it sets a timeout, given that no other pending events it would exit in N seconds. However, what it does was delay the event loop exit with N seconds, even if nothing is pending.

~~Sounds like you wanted event_base_loopexit? Ref:~~

The next event_base_loop() iteration after the given timer expires will complete normally (handling all queued events) then exit without blocking for events again.

~~In any case, event_base_loopbreak should still exit immediately, stopping any loops after the next event:~~

event_base_loop() will abort the loop after the next event is completed; event_base_loopbreak() is typically invoked from this event’s callback. This behavior is analogous to the “break;” statement.

Are you sure this isn’t symptomatic of something else?

Reference: http://www.wangafu.net/~nickm/libevent-2.0/doxygen/html/event_8h.html#a07a7599e478e4031fa8cf52e26d8aa1e

edit: Right, never mind, you were using event_base_loopexit, hence the confusion. You’re OP (before it was edited) makes it sound like you switched from event_base_loopbreak.

in src/httpserver.cpp: in 856be4fe46 outdated

474@@ -480,6 +475,14 @@ void StopHTTPServer()
475         workQueue->WaitExit();
476         delete workQueue;
477     }
478+    if (eventBase) {
479+        LogPrint("http", "Waiting for HTTP event thread to exit\n");
480+        // Give event loop a few seconds to exit (to send back last RPC responses), then break it
481+        if (!threadHTTP.try_join_for(boost::chrono::milliseconds(2000))) {

dcousens commented at 9:06 pm on November 11, 2015:

Out of interest, why do we have to wait [a hard coded timeout]? Why can’t we wait for all events to finish (and maybe just ignore all new incoming connections?)

MarcoFalke commented at 9:11 pm on November 11, 2015:

all events to finish

may take a long time (minutes) depending what was going in in the past?

dcousens commented at 9:27 pm on November 11, 2015:

@MarcoFalke but this doesn’t stop new connections (or is that done elsewhere?) in that waiting time. So conceptually, the status quo could be the same? That is, new connections coming in, old ones still finishing.

theuni commented at 4:14 am on November 12, 2015:

New connections are stopped just above, see where the accept sockets are torn down.

Though it does look like we should close connected sockets in http_reject_request_cb as well for good measure. @laanwj Is there a reason to keep them open once they’ve had a req rejected?

laanwj commented at 6:34 am on November 12, 2015:

Waiting for all events to finish could take forever. There’s tons of ways that someone could keep a connection option by slowly sending data. Also, timing events (such as the walletpassphrase timeout) may hold up the event queue.
Just breaking off everything in progress immediately prevents the answer to stop from getting back. We could just declare this a feature of course and be done with it… But it was a race condition, sometimes you would get a reply, sometimes not. We could define stop to cease all RPC activity immediately. But I think I like this better.
There AFAIK is no way to ask libevent, nor libevent_http if the events in the queue are important enough to wait for or not. And specially tagging the connection that sends ‘stop’ is quite ugly, not sure if even possible without diving into internal data structures.

For context read #6719.

I’m not 100% happy with this solution either, but I wasn’t able to find another way to do it, and it’s better than the previous one. @cfields sure, you can close the connection afterwards there, but that doesn’t solve this issue.

laanwj commented at 6:42 am on November 12, 2015:

@MarcoFalke but this doesn’t stop new connections (or is that done elsewhere?) in that waiting time. So conceptually, the status quo could be the same?

It does:

0        // Unlisten sockets
1        BOOST_FOREACH (evhttp_bound_socket *socket, boundSockets) {
2            evhttp_del_accept_socket(eventHTTP, socket);
3        }
4        // Reject requests on current connections
5        evhttp_set_gencb(eventHTTP, http_reject_request_cb, NULL);

This closes the port and makes new requests error (so that no new work items are created). But that’s best-effort. Current connections could still do e.g. a slowloris attack. The two seconds are a belt and suspenders, nothing more.

dcousens commented at 9:06 pm on November 11, 2015: contributor

utACK

jonasschnelli commented at 7:46 am on November 12, 2015: contributor

utACK.

laanwj commented at 8:48 am on November 12, 2015: member

Some more information: it appears that with libevent master, this issue doesn’t exist. event_base_loopexit does actually do what I expect it to in the OP. As that is the version I’m normally using, that is why I didn’t notice before.

Edit: so I think we should still do this for backward compat, but at least I’m not crazy :)

dcousens commented at 1:00 am on November 13, 2015: contributor

@laanwj maybe just inline a comment to mention this might be fixed in latest version of libevent? Could be useful for some future refactoring, save someone trying to establish whether there was a reason we moved away from it that they aren’t aware of.

laanwj commented at 9:14 am on November 13, 2015: member

Will do

Edit: OK, added comment and updated commit message.

laanwj commented at 9:14 am on November 13, 2015: member

Huh! I pushed this branch to bitcoin/bitcoin instead of my own repository. Oh well, no harm done, I’ll just remove it again when it’s merged.

laanwj force-pushed on Nov 13, 2015

http: speed up shutdown

This continues/fixes #6719.

`event_base_loopbreak` was not doing what I expected it to, at least in
libevent 2.0.21.
What I expected was that it sets a timeout, given that no other pending
events it would exit in N seconds. However, what it does was delay the
event loop exit with 10 seconds, even if nothing is pending.

Solve it in a different way: give the event loop thread time to exit
out of itself, and if it doesn't, send loopbreak.

This speeds up the RPC tests a lot, each exit incurred a 10 second
overhead, with this change there should be no shutdown overhead in the
common case and up to two seconds if the event loop is blocking.

As a bonus this breaks dependency on boost::thread_group, as the HTTP
server minds its own offspring.

a264c32e33

laanwj force-pushed on Nov 13, 2015

laanwj commented at 11:24 am on November 13, 2015: member

~~@theuni can you take a look here as well?~~ Edit: nm, you already did, but didn’t ack

gmaxwell commented at 7:32 pm on November 13, 2015: contributor

ACK.

gmaxwell merged this on Nov 13, 2015

gmaxwell closed this on Nov 13, 2015

gmaxwell referenced this in commit dbd2c135dd on Nov 13, 2015

dcousens commented at 8:51 am on November 14, 2015: contributor

Awesome, thanks @laanwj :)

MarcoFalke commented at 1:05 pm on November 15, 2015: member

Post merge tested ACK! @laanwj this makes travis a lot faster. (runtime reduced for some rpc tests by 90%)

Before:

0$ time for i in {1..10};do qa/pull-tester/rpc-tests.py disablewallet;done
1real    1m53.003s
2user    0m1.135s
3sys 0m0.437s
4
5$ time for i in {1..10};do qa/pull-tester/rpc-tests.py wallet;done
6real    10m4.888s
7user    0m21.829s
8sys 0m8.239s

After:

0disablewallet:
1real    0m13.809s
2user    0m1.166s
3sys 0m0.510s
4
5wallet:
6real    5m7.703s
7user    0m22.128s
8sys 0m8.222s

laanwj deleted the branch on Nov 30, 2015

zkbot referenced this in commit fb2f98e00b on Oct 23, 2017

sickpig referenced this in commit 46b5b86be5 on Mar 9, 2018

sickpig referenced this in commit dadc2880a1 on Mar 9, 2018

sickpig referenced this in commit 6d4a55878a on Mar 12, 2018

sickpig referenced this in commit 6db640a393 on Mar 12, 2018

sickpig referenced this in commit 76cf243311 on Mar 12, 2018

sickpig referenced this in commit 4428056252 on Mar 12, 2018

marlengit referenced this in commit 297ac08c3a on Sep 25, 2018

MarcoFalke locked this on Sep 8, 2021

http: speed up shutdown #6990