Change download logic to allow calling getdata on inbound peers

sdaftuar commented at 8:34 pm on January 14, 2015: member

This fixes the download behavior to allow downloading blocks on a chain that is known only by inbound peers and not by outbound/preferred download peers, mentioned in issue #5138. @sipa I tried implementing your suggestion (on IRC) of just guarding the getdata code block with a (preferredpeer || rand() %100 == 0). I think this works for what we were discussing (where an inbound peer announces a tip more than 1 block ahead on a chain no one else knows about), but it doesn’t resolve @gavinandresen’s test case in issue #5138 because we still wouldn’t be sending getheaders when an inbound peer connects like we do for outbound peers. Instead, we’d have to wait for an inv that builds on the longer chain from the inbound peer before we send getheaders, and then with this code send the getdata.

One question: is it necessary to also check !pto->fOneShot() with this change? I didn’t think it would be possible for us to have a one shot peer’s headers and therefore no blocks we’d ever try to download, so I left it out, but perhaps it’s preferable to include that check just to be safe…?

As an aside: I noticed that an additional way we could workaround the regtest issue @gavinandresen raised, besides connecting nodes to each other in both directions, would be to whitelist localhost.

I’ve tested this code on a modified version of the test, where after connecting a node with a longer chain and observing that the tips didn’t sync, I then generated one more block on that inbound peer and observed that the tips then did sync.

Finally, I believe that if we wanted to send an initial getheaders message to all inbound peers just as we do for outbound peers so that we could immediately download a longer chain known only by an inbound peer, we could do so with a similar small change to the code that sends the initial getheaders messages (while still preferring outbound peers for downloading blocks during initial block download), but it wasn’t clear to me whether such a behavior change would be desirable.

laanwj added the label P2P on Jan 15, 2015

in src/main.cpp: in 4a5ab51958 outdated

4534@@ -4535,7 +4535,7 @@ bool SendMessages(CNode* pto, bool fSendTrickle)
4535         // Message: getdata (blocks)
4536         //
4537         vector<CInv> vGetData;
4538-        if (!pto->fDisconnect && !pto->fClient && fFetch && state.nBlocksInFlight < MAX_BLOCKS_IN_TRANSIT_PER_PEER) {
4539+        if (!pto->fDisconnect && !pto->fClient && (fFetch || GetRandInt(100) == 0) && state.nBlocksInFlight < MAX_BLOCKS_IN_TRANSIT_PER_PEER) {

laanwj commented at 12:31 pm on January 29, 2015:

Please use insecure_rand here. GetRandInt() returns cryptographic randomness and is thus quite expensive to call, and its unecessary here.

sdaftuar commented at 2:47 pm on January 29, 2015:

Thanks – fixed (commit squashed and amended).

sdaftuar force-pushed on Jan 29, 2015

laanwj commented at 12:14 pm on March 9, 2015: member

utACK

gmaxwell commented at 9:17 pm on March 13, 2015: contributor

Seems like a slightly kludgy approach, but better than not, and at least its simple. Have we tested much with it set to also pull from inbound just to make sure there are no issues with that?

WRT localhost, thats probably reasonable but otoh it might hide issues. I think of this restriction on inbound as temporary until we get more controls in place to manage resource usage, after that it can go away completely I think.

sdaftuar commented at 5:12 pm on March 23, 2015: member

Yeah that’s pretty much how I feel too (kludgy but simple). I did some controlled regtest-testing when I wrote this, but I haven’t done any live network tests yet, which I agree would be worth doing so I’ll set something up and report back…

sipa commented at 12:29 pm on March 24, 2015: member

utACK. It’s kludgy, but fine as an interim solution.

It should even be perfectly fine to always enable fetching from any peer, except for the implicitly accepted behavior that not accepting incoming connections will reduce bandwidth usage. When headers-first is more widely deployed, we can get rid of that policy I think, and just add an actual bandwidth throttler.

sdaftuar commented at 4:51 pm on April 1, 2015: member

I tried testing initial block download with this code to see what kind of load it might put on an inbound peer (in the steady state, it doesn’t seem that this code triggers very often, as expected). It is apparently easier than I realized to run into a situation where a node could start up and acquire inbound peers quickly, before any outbound connections are established. In repeated testing last night, I restarted a long-running node (with -reindex and without the first blockfile, to trigger downloading the whole chain), and my node was selecting an inbound peer to be the peer it was using for initial headers sync, and then with this code that peer was exclusively serving blocks for a while.

Of course that load would have come down after the headers were almost synced and parallel download started, but even at that point you might expect that this code could cause that inbound peer to be serving as much as 1/9 of the blocks to the node (probably less than that due to the insecure_rand() guard but would depend on download speeds from the outbound peers I think).

And more generally, for a node starting up and doing a large download of the blockchain, even if it starts up and is downloading only from outbound peers for a while, if it receives an inv from an inbound peer for a new block, then it would retrieve headers from that inbound peer using the existing inv-processing logic, and this change would then cause that inbound peer to serve earlier blocks as well.

Perhaps this shouldn’t be merged as is if we want to avoid using inbound peers for large-ish block download?

sipa commented at 5:49 pm on April 1, 2015: member

@sdaftuar Any suggestions for a better strategy?

Change download logic to allow calling getheaders/getdata on inbound peers

SendMessages will now call getheaders on both inbound and outbound peers,
once the headers chain is close to synced.  It will also try downloading
blocks from inbound peers once we're out of initial block download (so
inbound peers will participate in parallel block fetching for the last day
or two of blocks being downloaded).

00dcaf4beb

sdaftuar force-pushed on Apr 2, 2015

sdaftuar commented at 5:50 pm on April 2, 2015: member

I’ve pushed an alternate fix here; please review this alternative solution.

What about dropping the insecure_rand in favor of a call to IsInitialBlockDownload():

0if (!pto->fDisconnect && !pto->fClient && (fFetch || !IsInitialBlockDownload()) && state.nBlocksInFlight < MAX_BLOCKS_IN_TRANSIT_PER_PEER) {

This would mean that we’d be willing to let inbound peers share the download burden on up to the last ~288 blocks or so (our tip is within 24 hours of our best known header and our best known header is within 24 hours of the current time, in practice this is probably usually closer to 144 blocks). If we do this, I think we should also go ahead and send the initial getheaders messages to inbound peers as well, which I’ve done in this commit. This now fixes the problem gavin reported here: #5138 (comment)

Also, I don’t think the insecure_rand was giving us much; I think the idea was that in the steady state case, that would cause us to prefer downloading blocks on a reorg chain from outbound peers rather than inbound peers, but that is sufficiently rare and the load sufficiently low that I don’t think optimizing that case is very important (especially because we’re already willing to download the first new block announced by an inbound peer immediately, so why not the next block or two as well?).

I think this ought to work reasonably well when a far-behind node starts up, because it should have a number of peers by the time it gets close to catching up, meaning that the burden on inbound peers shouldn’t be very large.

If a node were to come up an only be a day or two behind, though, then it’s possible that a couple inbound peers would be hit with a large fraction of these 288 blocks, so if that is too much potential load, we could lighten it further by reducing MAX_BLOCKS_IN_TRANSIT_PER_PEER for inbound peers – perhaps just allow 1 or 2 blocks in flight at a time from inbound peers here? I’m not sure how much optimizing is reasonable for this case; the existing logic already would put all the load on inbound peers if we don’t have any outbound peers.

sdaftuar commented at 7:38 pm on April 3, 2015: member

One behavior worth noting is that with this change, our maximum number of blocks in flight goes up (from 128 to 16 * Number of Network Peers). It’s probably not generally going to exceed 288, the most blocks that can be outstanding when inbound peers start being used with the outbound peers while fetching. But if you have a stalling peer that isn’t returning a block, then with the existing block download timeout logic (which depends on the number of blocks (with validated headers) in flight at the time of a block request), that stalling peer could have about 24 hours to deliver a block before being disconnected.

Still, there are easy workarounds (eg restarting a sync with listen=0 if you hit a bad peer before finishing a big sync with the network), and this change doesn’t make the problem of an unresponsive peer any worse in the steady state, so I’m not sure how much optimization is worthwhile for this case. Thoughts?

sdaftuar commented at 7:19 pm on April 6, 2015: member

FYI I just opened #5976 to mitigate any issues with longer block download timeouts that the approach here might otherwise exacerbate.

sipa commented at 10:05 pm on April 7, 2015: member

Concept ACK on the new approach. In the steady state, this shouldn’t matter much, as almost all blocks there are fetched through the direct-fetching logic, rather than the asynchronous headers-based fetching.

gavinandresen commented at 2:28 pm on April 27, 2015: contributor

utACK.

laanwj commented at 8:12 am on April 28, 2015: member

utACK

laanwj merged this on Apr 28, 2015

laanwj closed this on Apr 28, 2015

laanwj referenced this in commit 5048465fc5 on Apr 28, 2015

laanwj referenced this in commit cd737214ce on Sep 16, 2019

sidhujag referenced this in commit f9a3ff0065 on Sep 16, 2019

MarcoFalke referenced this in commit 59c138d2f1 on Sep 18, 2019

random-zebra referenced this in commit 46585f7b62 on Mar 31, 2020

MarcoFalke locked this on Sep 8, 2021

Change download logic to allow calling getdata on inbound peers #5662