Add a -rpckeepalive and disable RPC use of HTTP persistent connections.

gmaxwell commented at 1:54 am on January 14, 2015: contributor

It turns out that some miners have been staying with old versions of Bitcoin Core because their software behaves poorly with persistent connections and the Bitcoin Core thread and connection limits.

What happens is that underlying HTTP libraries leave connections open invisibly to their users and then the user runs into the default four thread limit. This looks like Bitcoin Core is unresponsive to RPC.

There are many things that should be improved in Bitcoin Core’s behavior here, e.g. supporting more concurrent connections, not tying up threads for idle connections, disconnecting kept-alive connections when limits are reached, etc. All are fairly big, risky changes.

Disabling keep-alive is a simple workaround. It’s often not easy to turn off the keep-alive support in the client where it may be buried in some platform library.

If you are one of the few who really needs persistent connections you probably know that you want them and can find a switch; while if you don’t and the misbehavior is hitting you it is hard to discover the source of your problems is keepalive related. Given that it is best to default to off until they’re handled better.

gmaxwell added this to the milestone 0.10.0 on Jan 14, 2015

sipa commented at 2:06 am on January 14, 2015: member

Nitty suggestion: call it -rpckeepalive with default value 1.

gmaxwell renamed this:
~~Add a -disablerpckeepalive to disable RPC HTTP persistent connections.~~
Add a -rpckeepalive to control RPC use of HTTP persistent connections.
on Jan 14, 2015

gmaxwell commented at 2:18 am on January 14, 2015: contributor

Fair enough.

jonasschnelli commented at 4:47 am on January 14, 2015: contributor

Somehow related to #5526.

laanwj commented at 7:24 am on January 14, 2015: member

Is there any practical advantage to having http keep-alive support in Bitcoin Core? e.g. the “pipeline a lot of requests” case we already support RPC batching. RPC is meant as a local mechanism, so any latency for setting up a new TCP connection will be dwarfed by, say, parsing overhead.

Looks like something that is reasonably hard to handle right in client libraries, although OTOH they could just send a Connection: close header.

On the server side, a well-behaved implementation would disconnect if no new request comes in after ~15 seconds. Which we don’t.

Also taking #5526 into account: +1 for nuking this functionality (or at least disabling by default) for now.

jonasschnelli commented at 8:04 am on January 14, 2015: contributor

Because i recently analyzed the http behavior i allow myself to comment on this: IMO the problem of blocking or say open connections from/to clients is the missing timeout implementation. Disabling http keep-alive option alone will not prevent from dead-and-open connection because the implementation currently uses the blocking getline(). If a malfunction client is not sending a valid http request, it can end up with a never closed connection until a restart of bitcoind.

But right, the keep-alive option is questionable. I assume most RPC connection are coming from localhost or LAN. IMO there is no big benefit of saving http overhead with keep-alive. Therefor i can ACK disabling keep-alive in general or over the flag.

But still there is a need for implementing a non-blocking way of handling connection with a proper http timeout according to #5526 (see @jgarzik info there).

gmaxwell renamed this:
~~Add a -rpckeepalive to control RPC use of HTTP persistent connections.~~
Add a -rpckeepalive and disable RPC use of HTTP persistent connections.
on Jan 14, 2015

gmaxwell commented at 8:35 am on January 14, 2015: contributor

Based on conversation with @laanwj I’ve switched this off by default. This is actually my preference; but I expected to be alone in the opinion that we should just shut it off until it works better, at least for 0.10.

jonasschnelli commented at 9:43 am on January 14, 2015: contributor

ACK nit: you need to remove or rewrite the recently added rcp-test (httpbasics.py)

laanwj referenced this in commit 19e20c0086 on Jan 14, 2015

laanwj referenced this in commit ecfe8a600b on Jan 14, 2015

laanwj referenced this in commit 4c65e99ed8 on Jan 14, 2015

laanwj commented at 11:54 am on January 14, 2015: member

@jonaschnelli Looks like the test framework has no easy way to pass extra arguments to bitcoind :( Added it in https://github.com/laanwj/bitcoin/tree/2015_01_fix_tests_for_5655 @gmaxwell you can cherry-pick https://github.com/laanwj/bitcoin/commit/4c65e99ed84cf3da1b91efaabe5e85b11333df1a to make this pass travis again

jonasschnelli commented at 12:16 pm on January 14, 2015: contributor

@laanwj: test looks good but I would recommend to also add a tiny check if keep-alive is disabled when starting a node without the new arg.

I can also add a commit on top of yours if you like to have this additional test but lack of time to do it.

laanwj commented at 12:19 pm on January 14, 2015: member

@jonasschnelli more tests is always good, you could e.g. start one of the four nodes without the option (it passes extraargs for every node individually) and then try that out. I don’t have time right now.

in src/init.cpp: in 9ed055ef8f outdated

379@@ -380,6 +380,7 @@ std::string HelpMessage(HelpMessageMode mode)
380     strUsage += "  -rpcport=<port>        " + strprintf(_("Listen for JSON-RPC connections on <port> (default: %u or testnet: %u)"), 8332, 18332) + "\n";
381     strUsage += "  -rpcallowip=<ip>       " + _("Allow JSON-RPC connections from specified source. Valid for <ip> are a single IP (e.g. 1.2.3.4), a network/netmask (e.g. 1.2.3.4/255.255.255.0) or a network/CIDR (e.g. 1.2.3.4/24). This option can be specified multiple times") + "\n";
382     strUsage += "  -rpcthreads=<n>        " + strprintf(_("Set the number of threads to service RPC calls (default: %d)"), 4) + "\n";
383+    strUsage += "  -rpckeepalive          " + _("RPC support for HTTP persistent connections (default: 0)") + "\n";

Diapolo commented at 2:25 pm on January 14, 2015:

Can you please use the format from rpcthreads, which allows changing defaults without the need for re-translating the help message (use strprintf)?

jonasschnelli referenced this in commit c9792d1363 on Jan 14, 2015

jonasschnelli commented at 3:37 pm on January 14, 2015: contributor

@gmaxwell you might also pull in https://github.com/jonasschnelli/bitcoin/commit/c9792d1363c914737aad4bfc727e523d1d68c4cd This will extend the tests.

Add a -rpckeepalive and disable RPC use of HTTP persistent connections.

It turns out that some miners have been staying with old versions of
 Bitcoin Core because their software  behaves poorly with persistent
 connections and the Bitcoin Core thread and connection limits.

What happens is that underlying  HTTP libraries leave connections open
 invisibly to their users and then the user runs into the default four
 thread limit.  This looks like Bitcoin Core is unresponsive to RPC.

There are many things that should be improved in Bitcoin Core's behavior
 here, e.g. supporting more concurrent connections, not tying up threads
 for idle connections, disconnecting kept-alive  connections when limits
 are reached, etc. All are fairly big, risky changes.

Disabling keep-alive is a simple workaround. It's often not easy to turn
 off the keep-alive support in the client where it may be buried in some
 platform library.

If you are one of the few who really needs persistent connections you
 probably know that you want them and can find a switch; while if you
 don't and the misbehavior is hitting you it is hard to discover the
 source of your problems is keepalive related.  Given that it is best
 to default to off until they're handled better.

16a5c18cea

fix tests for #5655 56c1093dae

improve tests for #5655 1dd8ee72af

gmaxwell commented at 6:06 pm on January 14, 2015: contributor

@Diapolo Thanks for the air-cover on the translation strings.

jonasschnelli commented at 8:09 am on January 15, 2015: contributor

Tested again ACK.

laanwj merged this on Jan 15, 2015

laanwj closed this on Jan 15, 2015

laanwj referenced this in commit 9ca6e04790 on Jan 15, 2015

gmaxwell referenced this in commit aaf55d25c6 on Jan 15, 2015

laanwj commented at 9:19 am on January 15, 2015: member

Cherry-picked to 0.10 branch via aaf55d25c6191c0effd5fa5dd5d2d0f17e715854

jgarzik commented at 1:39 pm on January 15, 2015: contributor

Did anyone research the reason for adding keep-alive support?

Turning this option on by default will now break those clients that were fixed, yes?

jonasschnelli commented at 1:47 pm on January 15, 2015: contributor

#1101 is related, was basically merged via https://github.com/luke-jr/bitcoin/commit/7ee8f04aa18052b269c74fc1488ac65b8ce2359b

See also: #214 (comment)

Once there was a open PR of a asio implementation of the RPC server (https://github.com/bitcoin/bitcoin/pull/214 wonder why this never made it to the master).

laanwj commented at 2:14 pm on January 15, 2015: member

@jgarzik The reason clients break is a combination of a bug in the client - not effectively reusing open connections - and misbehavior at our side - not properly timing out unused connections.

Disabling keep-alive is always safe: clients know that they have to reconnect if the server slams the connection (especially as the Connection: Close header is sent).

The option can be safely reenabled by default again as soon as the problems on our side are fixed. Hopefully the client issues are also solved by that time, but it’s not necessary as we’ll be robust against them. @jonasschnelli Probably because there was no testing suite for RPC back then. Swapping the RPC server implementation is extremely risky if you are working blindly.

jgarzik commented at 7:00 pm on January 15, 2015: contributor

@laanwj “Disabling keep-alive is always safe” Wrong. This demonstrates the need to research why a code change was added before disabling it!

The keep-alive feature was added [in part] because the repeated open/close cycles of many clients would eventually cause a hang due to so many local sockets in TIME_WAIT.

Disabling keep-alive will resurrect that problem, introducing hangs into working client scenarios.

Telling users with working setups to rewrite their working code to use batches is a big request.

jonasschnelli commented at 7:33 pm on January 15, 2015: contributor

@jgarzik’s concerns sounds realistic to me. What if we just switch the default for -rpckeepalive from the current false to true? Then users with client behaviors after @gmaxwell can still disable it.

But the long term goal are unclear to me now. Would it make sense to rewrite the RPC server into the direction of https://github.com/jgarzik/rpcsrv? Gavin also wrote today on IRC:

If we did RPC again, I think a binary zeromq channel would be the way to go. Maybe with a little python listener that did JSON, if you want JSON…..

Maybe there should be a consensus direction to where the RPC implementations should go to so somebody could start working on it.

jgarzik commented at 8:42 pm on January 15, 2015: contributor

This is an example of a bug that this change re-introduces to bitcoind: #344

Or just google for bitcoind TIME_WAIT.

Merging this into 0.10 creates obvious regressions.

gmaxwell commented at 9:14 pm on January 15, 2015: contributor

Jeff, I believe I read every commit and PR related to the introduction of persistent connections. Not a single one of them mentioned this. I also initially created this a a default no change in behavior in the interest of being conservative, but unfortunately; you can’t tell if it’s the broken persistent connection support thats breaking you: it just looks like Bitcoin core is unresponsive.

In the future if a commit fixes a problem it would be helpful if it were mentioned in the commit message.

The last round of changes that impacted our persistent connections was a massive regression which caused complaints from many users– many more than the time_wait complaint. Apparently after our failure to do anything about it many people just stuck on 0.8.1. This is particularly problematic because some of the most frequently impacted parties were miners. I’ve also been told at conferences by various service providers that they’d switched to using hosted APIs due to Bitcoin core being unresponsive, which made no sense to me at the time. In the new light of being reminded why miners were staying stuck on 0.8.1, it finally does make sense.

I’m open to having persist default to on (after all, that was my original PR), but I think that doing so may be a mistake unless we really have reason to believe the timewait problems are more frequent than the others. I also think persistent connections are not the right solution for timewait in any case: isn’t the correct solution SO_REUSEADDR?

(I’m travelling today and won’t be submitting any commits, so don’t feel like waiting for me is required)

jgarzik commented at 9:29 pm on January 15, 2015: contributor

The original genesis of keep-alive support is that it was originally written (by JoelKatz?) to work around problems that miners were seeing with the original open/close/open/close behavior that leaves behind reams of TIME_WAIT etc. sockets, eventually leading to a non-responsive bitcoind.

RE SO_REUSEADDR, that is for servers restarting, not clients reconnecting. It does not help this situation, where the server (pre-KA & post-KA) is not closing+reopening the server socket.

Defaulting KA-off in 0.10 with little testing and thought to user impact is not a good choice so late in the release cycle.

For users whom this will actually break, the only salve we have offered is e.g. the “pipeline a lot of requests” case we already support RPC batching.

That is not what we should be telling production users with code that works today.

gmaxwell commented at 1:45 am on January 16, 2015: contributor

@jgarzik “the only salve we have offered is” not so: “turn keep alive back on”– it’s a flag, it’s documented in the help output. I’m not making an argument if this is sufficient or not; just pointing it out: I wouldn’t introduce a change in behavior like this without at least a switch to get the old behavior back.

We have multiple reports that the software is unusable since 0.8.1. We got a rash of complaints even from the few who figured out what was up, to the point where people stopped complaining but stayed with old versions. Doing nothing at all isn’t acceptable either. It’s not acceptable for us to do nothing. Again.

Sorry that I’m not maximally up to speed on the issues people saw with sockets in time_wait; it’s not super well documented (even on that complaint) what the issue is: My current understanding is that clients are making connections and closing them. The client’s outbound socket sticks around in TIME_WAIT for two minutes (with common TCP settings). If the client makes more than about 275 connections (32768 high ports vs 120 seconds) per second, they risk running out of ephemeral sockets toward the particular server address:port.. If so, I can see that behavior, but my quick test while on an airplane wasn’t able to make enough requests per second to actually observe it running out.

Do we have a reproduction? The issue this change addressed is easily reproducible and has been reported many times now and considering how tricky it is to figure out the cause (instead of mistaking it for ‘bitcoind hags’) that is quite concerning.

jgarzik commented at 1:57 am on January 16, 2015: contributor

The reproducer was typically high speed ‘getwork’ traffic from generation-1 pool server proxies, but non-miners were also able to make it fail with high speed wallet RPCs. 275/second would be a low number versus historical examples.

RE flag: It’s still a behavior change that forces clients to fix their software – software that has been working for years – with very little warning.

This behavior change was merged at the last minute during an -rc cycle. It is simply too risky to turn keep-alive off by default in 0.10. For 0.11, with plenty of user warning (“you might need to change your code”), it can be considered, sure.

Strong NAK for 0.10 behavior change. The same default should be kept for 0.10.

It is a strawman to argue against “doing nothing” Ship the feature in 0.10 if you like, but keep the current default to prevent breaking even more users. Miners can disable -rpckeepalive just as easily, to prove the new code works better than the old.

laanwj commented at 8:40 am on January 16, 2015: member

@jgarzik I don’t feel like arguing this anymore. Damned if we do, damned if we don’t. I get tired of this. Let’s keep the option but feel free to change the default to 1 again. I’ll ACK it.

jgarzik commented at 9:14 am on January 16, 2015: contributor

If people want the feature, a properly release-managed plan would not introduce this regression in 0.10 – particularly so late in the 0.10 release cycle:

0.10 defaults to 0.9.x behavior to prevent further regressions (keep-alive on)
Communicate to the users that you are introducing a change that might break their software in 0.11
Then turn keep-alive off in master if that’s what people want.

Fixing/preventing regressions should trump introducing new, speculative fixes based on conference anecdotes. Lack of Keep-Alive caused problems in 2011, and KA-off clearly reverts back to pre-fix behavior.

Should we address these new concerns? Absolutely, but not by rushing KA-off into release at the last moment. Sane release management policy implies KA-on for 0.10.

laanwj commented at 9:21 am on January 16, 2015: member

My last words on this: Disabling it was meant as a quick workaround for the problems reported by @gmaxwell.

We perceived that in its current implementation the keep-alive does more damage than it’s worth. I was not aware that the other way around also could break clients. Thanks for pointing to the issue.

Hence, I’m fine with changing the default back to 1 for 0.10, and for master too. If this goes into the long haul, might as well fix keep-alive instead instead of having a phase where it’s disabled by default. People experiencing issues can use -rpckeepalive=0.

sipa commented at 10:38 am on January 16, 2015: member

ACK on adding the option, bu lt leaving it enabled by default.

wtogami referenced this in commit 8c1d1474f5 on Jan 23, 2015

laanwj referenced this in commit 1065bb1cd2 on Aug 28, 2015

laanwj referenced this in commit ea55964826 on Sep 2, 2015

laanwj referenced this in commit 49183bb7fb on Sep 2, 2015

laanwj referenced this in commit 40b556d374 on Sep 3, 2015

reddink referenced this in commit a084d77b42 on May 27, 2020

DrahtBot locked this on Dec 16, 2021

Add a -rpckeepalive and disable RPC use of HTTP persistent connections. #5655