Fix many orphan problem and stuck block download problem

rebroad commented at 1:18 pm on June 27, 2014: contributor

This has taken me several days to code and test, and I’ve tested it thoroughly.

With the current code, that uses only one syncnode, with these changes, the node will never download orphan blocks. With net debugging enabled, you will get a good view of what it’s doing.

It also downloads the blockchain very fast, and never allows more than 60 seconds (the default) of inactivity to occur, so recovers very quickly from stuck downloads. With my testing, this can safely be reduced to 10 seconds as if a downloads sticks for as long as 10 seconds, it never recovers, so rather than wait two minutes as the current code does (well, tries to do), it quickly switches to a new syncnode.

Please do test, and give comments. So far, with the orphan problem, many people are unable to run the node without limiting orphan blocks (which still doesn’t fix the stuck download anyway).

Feel free to copy this branch and improve it. I’ve included quite a few commits to make it easier to see the development process, but I won’t always have time to make the changes needed that people ask for, and from experience my pull requests often don’t get pulled for some reason. I think this functionality is needed now based upon the issues people have been raising.

in src/init.cpp: in fdf0338606 outdated

287@@ -288,6 +288,7 @@ std::string HelpMessage(HelpMessageMode mode)
288     strUsage += "  -genproclimit=<n>      " + _("Set the processor limit for when generation is on (-1 = unlimited, default: -1)") + "\n";
289     strUsage += "  -help-debug            " + _("Show all debugging options (usage: --help -help-debug)") + "\n";
290     strUsage += "  -logtimestamps         " + _("Prepend debug output with timestamp (default: 1)") + "\n";
291+    strUsage += "  -logips                " + _("Include IP addresses in debug output") + "\n";

Diapolo commented at 2:06 pm on June 27, 2014:

Please always list the default like in the other help strings.

rebroad commented at 2:56 pm on June 27, 2014:

I thought the default wasn’t shown unless it was 1.

in src/net.cpp: in fdf0338606 outdated

499@@ -502,6 +500,8 @@ CNode* ConnectNode(CAddress addrConnect, const char *pszDest)
500         }
501 
502         pnode->nTimeConnected = GetTime();
503+        if (fLogIPs) LogPrint("net", "connected %s peer=%d\n", pszDest ? pszDest : addrConnect.ToString(), pnode->id);

Diapolo commented at 2:08 pm on June 27, 2014:

Please use 2 lines for ifs.

rebroad closed this on Jul 4, 2014

rebroad reopened this on Jul 4, 2014

rebroad commented at 6:51 am on July 4, 2014: contributor

I’ve removed the debugging messages, and simplified the commits. Also removed the nodeid stuff, and the getblocks changes as these are in separate pulls.

rebroad commented at 7:13 am on July 4, 2014: contributor

@sipa this pull is much more readable now. Also, I’ve included the commit that splits MarkBlockAsReceived into two (also MarkBlockAsRequested), as I felt it was confusing that it was being used to mark blocks as requested but which are still in flight. Hope this is ok, but if not, I can remove this commit.

sipa commented at 5:21 pm on July 5, 2014: member

I hate to tell you after all the time you’ve spent on this, but headers-first will make most of this obsolete. It completely removes the concepts of “sync node” or “orphan blocks”, and stops using “getblocks” altogether (fixing #4387 as well).

Decent logic for disconnecting blocks when they’re stalling will still be useful, but many of the apparently stuck problems just caused by us not knowing what to request from whom will be gone.

ghost commented at 5:54 pm on July 5, 2014: none

@sipa Is there a PR yet for headers-first feature?

sipa commented at 5:56 pm on July 5, 2014: member

@drak Very soon!

rebroad commented at 4:41 pm on July 16, 2014: contributor

I realise some of this pull request may disappear when header-first is merged, but this is still a useful fix to the many orphan problem right now, and also improves debug.log information. Some ACKs would be appreciated. @sipa @gmaxwell @laanwj @drak @gavinandresen @Diapolo thanks

laanwj added the label P2P on Jul 31, 2014

ghost commented at 11:08 pm on August 26, 2014: none

I appreciate you writing up fixes for this. Definitely important.

Will this be merged soon? The stalling due to orphans is so bad that IMO, it makes the client mostly useless. I have to baby sit it and restart it. I finally got it going with lowered max connections.

gmaxwell commented at 11:10 pm on August 26, 2014: contributor

This will almost certantly not be merged. Issues with orphans are resolved by #4468.

ghost commented at 11:14 pm on August 26, 2014: none

@gmaxwell Ah, thanks for linking that one over. Looking forward to seeing that one merged in, then.

rebroad commented at 5:20 pm on August 29, 2014: contributor

@gmaxwell This could have been merged in the mean time though, I’d have thought. It was pull-requested quite some time before #4468 came about, too.

gmaxwell commented at 5:25 pm on August 29, 2014: contributor

@rebroad 4468 is just the latest rebase of it.

rebroad commented at 0:52 am on September 3, 2014: contributor

@gmaxwell This pull request is ready to merge (or at least has been at several stages), whereas #4468 isn’t, and has never been so far - it currently breaks re-orgs, so it’s clearly still under development, and no one knows when it will be ready to test.

The issues this pull fixes are urgent, and have been for some time now - the amount of bandwidth being wasted by duplicate blocks and orphans is unimpressing quite a few people who’ve downloaded the software and expected better.

What’s the reason for not merging this now? #4468 can provide the benefit it provides when it’s finally developed and ready, but I don’t see why that should delay these urgent issues from being fixed now…

gmaxwell commented at 1:05 am on September 3, 2014: contributor

#4468 is ready to merge except the testing infrastructure has not been updated to accommodate it, if not for that it would already be merged. Even with the it would already be merged except breaking pull tester in master would make all other pulls fail. It has no currently known problems and is suitable for testing, and I have several nodes running on it without issue.

When I looked at this patch previously it was performing a number of risky behaviors like disconnecting peers on low arbitrary timeouts. A partial adjustment that introduces new vulnerabilities and doesn’t address the larger performance problems is not a good use of review or testing resources. We certantly wouldn’t be rushing something like that out the door, so the claim of urgency is not helpful here.

rebroad commented at 1:46 am on September 3, 2014: contributor

@gmaxwell The default timeouts were address as soon as they were mentioned, and changed to 2 minutes as requested. I still mantain, however, that 2 minutes is far riskier than my suggested 10 seconds, as at a 2 minute timeout this can significantly cause atypical nodes to delay block propagation by them issuing the latest block inv as quickly as possible (before they’ve even got the headers), and then ignoring the getdata block request. With my code, this is mitigated significantly, and it’s highly unlikely and unusual for a node to not start sending a block within 10 seconds of it being requested - this is based on over two years of testing. On what are you basing your argument?

Add CaughtUp() function (needed by later commits) bcd2c9bb55

Rearrange ProcessMessages + add nLastDataPos variable for partial message tracking. d36e39ac05

Don't request blocks when block invs received unless it's the syncnode, or we've CaughtUp() 673ed30f53

Force selection of a new syncnode since the current syncnode is about to stop sending blocks. 3fae9f4813

Replacement stalled block logic. Depending on the situation, it will either disconnect

the peer (as previous behaviour), or it will reselect a syncnode (e.g. when 500 blocks
have been received from the syncnode).

48474db015

Add commandline option for sync timeout 284fcfafc1

Rename MarkBlockAsReceived to MarkBlockAsRequested to reduce confusion. 917aca9fb3

Show downloaded block size and time taken to download. 39e8166053

Debug when a block is incoming and show how many blocks still to come. cf203c45e2

rebroad force-pushed on Sep 3, 2014

gmaxwell commented at 1:58 am on September 3, 2014: contributor

I though previously explained in this in adequate detail in your prior pull request that disconnected well behaving nodes: This kind approach makes the network fragile in a multitude of situations, including hosts on even slightly slow links. Effectively a two minute timeout increases the minimum bandwidth that can keep a node running bitcoin by more than 5 fold. This of behavior means that moderate or short lived DOS attacks (or just high utilization) can cause disconnection potentially allowing an attacker to partition the network.

Yes— malicious nodes attacking the network can potentially slow down receiving blocks (and thats still true with your change). This is part of why the standard advice (and widespread practice) for mining nodes is to not expose them to the public network directly. Increasing partition risk, or reducing the ability to run slow or overloaded links at all isn’t a good trade-off.

I’d really encourage you to spend time testing headers first.

rebroad commented at 2:03 am on September 3, 2014: contributor

@gmaxwell Strawman arguments aren’t helping - no one is suggesting disconnecting well-behaving nodes. A node that doesn’t start sending a block within 2 minutes of it being requested, I think we can both agree, is not a well-behaving node.

Since the default behaviour in this pull is set to 2 minutes, I’m confused as to why you’re focusing on this point. Is there anything in this pull or for that matter, the nine commits that this pull comprises of, that you have any objections to?

gmaxwell commented at 2:06 am on September 3, 2014: contributor

It may be wortking fine— could even be the only honest node you’re connected to. Start is the wrong description of what the changes do, they may have sent immediately, and you’ve received 99.9% of it by the time two minutes pass. This is also a tangent, the changes here are not primarily related to steady state block relay.

rebroad commented at 2:12 am on September 3, 2014: contributor

@gmaxwell Are you saying you want me to change the code so that it disconnects nodes if they haven’t managed to send an entire block within two minutes? Surely if it’s send 99.9% then it would be counterproductive to start the download from scratch from another node.

I’m sorry, I’m not understanding what your issue is still. Can you just state simply what you would need to see changed in this pull request before you’d be happy to ACK it?

pstratem commented at 2:16 am on September 3, 2014: contributor

@rebroad this is a fairly large patch for something which will be entirely irrelevant when #4468 is merged

I appreciate the work to improve orphans that stall IBD (it’s annoying to me too!) but really making these changes with headers first right around the corner is unnecessary

NACK

rebroad commented at 2:17 am on September 3, 2014: contributor

@pstratem Hasn’t headers first been “right around the corner” for over 2 years now?

jgarzik commented at 2:21 am on September 3, 2014: contributor

headers-first is the consensus direction. It is not directly apparent, but many headers first-related patches are already in the tree, merged in tiny bites.

Any notable work in this area should be based on top of headers-first, to avoid wasting work.

gmaxwell commented at 2:26 am on September 3, 2014: contributor

Two years? no way. It was originally slated for 0.9 but it was too big to review and test all at once, so it was split up and missed that release. As Jeff notes much of it has already been merged.

BitcoinPullTester commented at 2:54 am on September 3, 2014: none

Automatic sanity-testing: FAILED BUILD/TEST, see http://jenkins.bluematt.me/pull-tester/p4431_cf203c45e2b2dadf74ec20494015680218dd5fa5/ for binaries and test log.

This could happen for one of several reasons:

It chanages paths in makefile.linux-mingw or otherwise changes build scripts in a way that made them incompatible with the automated testing scripts (please tweak those patches in qa/pull-tester)
It adds/modifies tests which test network rules (thanks for doing that), which conflicts with a patch applied at test time
It does not build on either Linux i386 or Win32 (via MinGW cross compile)
The test suite fails on either Linux i386 or Win32
The block test-cases failed (lookup the first bNN identifier which failed in https://github.com/TheBlueMatt/test-scripts/blob/master/FullBlockTestGenerator.java)

If you believe this to be in error, please ping BlueMatt on freenode or TheBlueMatt here.

This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.

laanwj commented at 6:21 am on September 3, 2014: member

@rebroad Instead of impatiently complaining how slow headers-first is being merged you could also help testing and reviewing it. Or even build on top of it. You know the code is available as pull request #4468 and it fully works?

laanwj closed this on Sep 3, 2014

rebroad commented at 6:50 am on September 3, 2014: contributor

@laanwj I am testing it and building on top of it, however, I’d always understood that we were supposed to submit pull requests built on-top of master, so I’m not sure how you would want me to submit patches without including the entirity of #4468 along with every pull request I make.

laanwj commented at 10:26 am on September 3, 2014: member

@rebroad You could submit patches to @sipa directly. Otherwise, just make sure that you don’t conflict with him or do duplicate work.

MarcoFalke locked this on Sep 8, 2021

Fix many orphan problem and stuck block download problem #4431