Per-peer block tracking, stalled block download detection, orphan pool limiting

sipa commented at 6:21 pm on January 11, 2014: member

This pull request brings in several block downloaded related improvements. It’s intended as a preparation for headers-first, but It’s useful on itself.

Changes:

Track blocks being downloaded per-peer, and limit the number in-flight.
Never request the same block twice, so no more duplicate downloads during sync.
In case a peer seems to be too slow to send blocks we need, they’re disconnected and their in-flight blocks can be requested elsewhere (which saves both us and them bandwidth).
Limit the number of orphans held in memory, as those still happen very frequently during sync (the result from accidentally downloading blocks from two peers at once).

This should result in generally better synchronization behaviour from random/multiple peers, but may slow down the first part of the process (due to small blocks) and slow down fetching over high-latency high-bandwidth connections (which is relatively rare).

Test plan:

Delete blocks/ and chainstate/
Start a download from random peers (no special options, or use -addnode=IP optionally).
- Verify that no “already have block …” messages appear during sync.
- Check that synchronization completes (it may still be slow, as there is no optimization to select particularly good peers to download from, and orphans are expected).
Delete blocks/ and chainstate/ again.
Start a download from a single fast peer (-connect=IP)
- Verify that no “already have block …” messages appear during sync.
- Check that synchronization completes and progresses quickly (doesn’t get stuck; it may be a bit slower than before this patch, particularly in the initial part of the chain; very occasional orphans are possible)

Headers-first will remove a part of the logic in this pull request again, but as it’s useful on itself, I prefer getting this reviewed and merged anyway:

There won’t be a per-peer to-be-downloaded queue of block hashes anymore; those will be decided globally based on good known (and verified) headers chains.
The per-peer block-downloads-in-flight will remain, but becomes a queue of CBlockIndex*’es instead of uint256’s.
There won’t be any orphan processing in the block download mechanism anymore (so no need to limit orphans).
Stalled download detection won’t use as many hardcoded constants anymore (as the question becomes “is this peer slowing too slow compared to the others to keep up” rather than “is this peer slow according to some absolute measurement”).

Tested by synchronizing while running in valgrind.

Depends on #3370 (Prepare block connection logic for headers-first).

laanwj commented at 5:16 pm on January 13, 2014: member

Needs rebase after #3516

sipa commented at 5:17 pm on January 13, 2014: member

Will rebase tonight.

sipa commented at 8:49 pm on January 18, 2014: member

Did a sync from scratch using this branch (plus #3517) on mainnet, a significant part of which under valgrind. Seems to work fine, though sometimes long chains of accumulated orphans are connected, resulting in cs_main being locked for a long time. I’ll try to fix that independently.

sipa commented at 9:36 pm on January 27, 2014: member

Rebase after logging changes.

jgarzik commented at 7:52 pm on January 29, 2014: contributor

ACK

laanwj commented at 1:22 pm on January 30, 2014: member

I did some testing with block syncs, no problems found, worked fine for me.

sipa commented at 1:26 pm on January 30, 2014: member

There is one potential problem here, namely that the many orphans can result in very long chains (I’ve seen up to ~400) of blocks being connected within a single cs_main lock, essentially making the program look frozen. This can be solved by pushing some locks down, but may need some discussion first.

I could easily submit the limit-orphans commit as a separate pull request though, as I suppose that can be seen as a memory usage bug fix.

Per-peer block download tracking and stalled download detection.

Keep track of which block is being requested (and to be requested) from
each peer, and limit the number of blocks in-flight per peer. In addition,
detect stalled downloads, and disconnect if they persist for too long.

This means blocks are never requested twice, and should eliminate duplicate
downloads during synchronization.

f59d8f0b64

BitcoinPullTester commented at 4:31 pm on February 8, 2014: none

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/f59d8f0b644d49324cabd19c58cf2262d49e1392 for binaries and test log. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.

rebroad commented at 1:06 am on February 23, 2014: contributor

Regarding slowing down the download of the small blocks - how about factoring in the block size and incrementally changing how many peers are used as the block sizes get larger?

in src/main.cpp: in f59d8f0b64

2104@@ -2021,7 +2105,6 @@ bool AddToBlockIndex(CBlock& block, CValidationState& state, const CDiskBlockPos
2105          pindexNew->nSequenceId = nBlockSequenceId++;
2106     }
2107     assert(pindexNew);
2108-    mapAlreadyAskedFor.erase(CInv(MSG_BLOCK, hash));

rebroad commented at 1:08 am on February 23, 2014:

Why was this here?

sipa commented at 1:40 pm on February 28, 2014:

Because blocks used to be tracked using the (weak) mapAlreadyAskedFor mechanism, but this patch introduces a better way, that retains information about whom it was asked from.

rebroad commented at 11:29 pm on February 26, 2014: contributor

I’ve been testing this a few days now. The stuck block detection doesn’t seem to be working:-

2014-02-26 22:24:01 Requesting block 00000000000000014e06412631ff67305cde77c6631dc58a2ce35308558cee05 from 216.99.101.192:8333 2014-02-26 22:28:31 ResendWalletTransactions() 2014-02-26 22:54:16 socket recv error 110 2014-02-26 22:54:17 receive version message: /Satoshi:0.8.6/: version 70001, blocks=287993, us=70.42.240.30:57711, them=207.255.42.202:8333, peer=207.255.42.202:8333 2014-02-26 22:54:17 Added time data, samples 10, offset +0 (+0 minutes) 2014-02-26 22:56:34 socket recv error 110 2014-02-26 22:57:26 connect() to 173.79.229.22:8333 failed after select(): Connection refused 2014-02-26 22:57:28 socket recv error 110 2014-02-26 22:58:03 socket recv error 110

almost an 30 minutes went by with no download happening, and it didn’t seem to notice.

in src/main.h: in f59d8f0b64

58@@ -59,6 +59,11 @@
59 static const unsigned int LOCKTIME_THRESHOLD = 500000000; // Tue Nov  5 00:53:20 1985 UTC
60 /** Maximum number of script-checking threads allowed */
61 static const int MAX_SCRIPTCHECK_THREADS = 16;
62+/** Number of blocks that can be requested at any given time from a single peer. */
63+static const int MAX_BLOCKS_IN_TRANSIT_PER_PEER = 128;

rebroad commented at 11:31 pm on February 26, 2014:

Magic number… why so high? In the later stages of block download I’d have thought a number close to 1 would be appropriate.

sipa commented at 10:26 pm on March 1, 2014:

That would be a further optimization, yes.

rebroad commented at 8:37 pm on March 1, 2014: contributor

So far upon testing this code, I’ve not yet seen it downloading blocks from more than one node. There’s not even any code to send a getblocks to more than one node from what I can tell. If a node is deemed to be unresponsive it’s forcefully disconnected. How come this is ok, and yet my pull request to disconnect nodes sending blocks my nodes already had was considered not ok?

sipa commented at 8:44 pm on March 1, 2014: member

This does not do parallel block download yet, as that requires headers-first sync. This is a step towards that, though.

And disconnecting unresponsive peers is fine - it’s the only way to not waste their bandwidth if we don’t need/want requested blocks anymore.

What is not fine, is disconecting peers because they’re the second ones to respond to a block that we requested twice. We should simply not request the same block twice in the first place. That is what this pull request is supposed to do. If you still see already have block messages with this patch, there is possibly a bug in it that will need to be addressed.

rebroad commented at 0:27 am on March 2, 2014: contributor

@sipa I deleted my previous comment about “already have” messages - none seen with this patch so far! But… a misbehaving node could still send random blocks if it wanted to, so at some point in the future I suspect we’ll want to set that sort of behaviour as misbehaviour - maybe even as part of this pull request…?

sipa commented at 0:44 am on March 2, 2014: member

@rebroad That’s a good question. The problem is that the current code actually does broadcast blocks sometimes without inv, namely newly mined blocks which are certain not be exist anywhere else.

rebroad commented at 7:41 am on March 2, 2014: contributor

why would it broadcast a block without broadcasting the inv first?

As an aside note, I notice the block download got stuck today, and wasn’t able to recover - it was still downloading blocks that were being advertised (i.e. the latest block mined), but the initial block download stayed stuck.

rebroad commented at 9:09 am on March 2, 2014: contributor

@sipa Re misbehaviour, it could just be for nodes sending unsolicited blocks that the node already has. Perhaps +30.

rebroad commented at 0:15 am on March 8, 2014: contributor

The stuck download detection seems like it might not be working.

laanwj referenced this in commit b76733d8e8 on Mar 10, 2014

laanwj merged this on Mar 10, 2014

laanwj closed this on Mar 10, 2014

rebroad commented at 4:08 am on March 30, 2014: contributor

I think this has been pulled prematurely. It’s still got problems as I mentioned above. For one, it doesn’t resume download if the OS is suspended and resumed.

laanwj commented at 7:47 am on March 30, 2014: member

@rebroad can you post new problems and bug reports in new issues? If you just dump them into existing, closed issues they will probably get lost.

laanwj commented at 10:32 am on March 30, 2014: member

And IMO, going toward headers-first / parallel block download is the single most important point in continued development of Bitcoin Core right now.

As this is pull is an important step in that direction, this was prioritized to be merged as soon as possible after 0.9.0 was released.

If a little temporary inconvenience while the kinks are worked out is too much for you, I suggest staying with 0.9 for now. Thanks for helping test though.

sipa commented at 3:01 pm on March 30, 2014: member

The stuck download detection here is just a very simple prototype. It only uses some hardcoded constants as thresholds. The purpose is turning it into a detection for one peer being slower than the other, once actual parallel downloading is there. @rebroad Is this a regression introduced by this pullreq, or a pre-existing problem that you expected to be fixed by this but wasn’t?

rebroad commented at 5:27 pm on March 31, 2014: contributor

@sipa it’s a regression.

laanwj commented at 7:27 am on April 4, 2014: member

I suspended my PC with bitcoin running a few nights and it had no problem recovering connections when re-awakened. @rebroad Do you perhaps have a dynamic IP, that changes if you computer was suspended for a while? If so Bitcoin indeed doesn’t cope with it too well (it will AFAIK still advertise the old address so you won’t get incoming connections). But that’s not a regression or new issue.

rebroad commented at 3:29 am on April 6, 2014: contributor

@laanwj yes, my IP address changes, but I don’t accept incoming connections either. The getting stuck after a suspend didn’t happen before I applied this pullreq AFAIK.

in src/main.cpp: in f59d8f0b64

300+void MarkBlockAsInFlight(NodeId nodeid, const uint256 &hash) {
301+    CNodeState *state = State(nodeid);
302+    assert(state != NULL);
303+
304+    // Make sure it's not listed somewhere already.
305+    MarkBlockAsReceived(hash);

rebroad commented at 9:28 am on October 28, 2016:

Really don’t like the use of functions that clearly don’t do what their name suggests they do.

What parts of this function are needed? And what would the function be called if it did what it is needing to do here?

satoshi-n commented at 11:32 pm on September 5, 2020: none

Incompetence or malicious bbbb

DrahtBot locked this on Feb 15, 2022

Per-peer block tracking, stalled block download detection, orphan pool limiting #3514