testnet: 0.13 peers are disconnected due to "stalling"

MarcoFalke commented at 7:23 pm on August 15, 2016: member

As segwit is active on testnet, blocks will only be downloaded from NODE_WITNESS peers. Though, exactly those peers are disconnected after BLOCK_STALLING_TIMEOUT = 2 seconds. I am running master (6e6ab2c)

This behavior is consistent across restarts. Sample debug.log and getchaintips.

It is possible that I shot my own foot at some point, but it seemed smelly enough to create an issue.

MarcoFalke added the label P2P on Aug 15, 2016

sipa added this to the milestone 0.13.1 on Aug 16, 2016

MarcoFalke commented at 4:40 pm on August 16, 2016: member

What is the reason of “Rewinding blocks” in the debug.log at height=894055 date=2016-07-03 13:51:13? I don’t understand how blocks with missing data could end up in the active chain…

Edit: Oh, nevemind. It will appear always in the log, even if no actual rewinding is done.

sipa commented at 4:46 pm on August 16, 2016: member

@MarcoFalke that happens when you progress past the segwit forkpoint with a pre-segwit client, and then upgrade to a post-segwit one.

sipa commented at 5:16 pm on August 16, 2016: member

Ah, yes.

sdaftuar commented at 7:51 pm on August 18, 2016: member

My guess: it looks to me like the stalling logic doesn’t make sense if you have non-witness, NODE_NETWORK peers:

 0        //
 1        // Message: getdata (blocks)
 2        //
 3        vector<CInv> vGetData;
 4        if (!pto->fDisconnect && !pto->fClient && (fFetch || !IsInitialBlockDownload()) && state.nBlocksInFlight < MAX_BLOCKS_IN_TRANSIT_PER_PEER) {
 5            vector<CBlockIndex*> vToDownload;
 6            NodeId staller = -1;
 7            FindNextBlocksToDownload(pto->GetId(), MAX_BLOCKS_IN_TRANSIT_PER_PEER - state.nBlocksInFlight, vToDownload, staller);
 8            BOOST_FOREACH(CBlockIndex *pindex, vToDownload) {
 9                if (State(pto->GetId())->fHaveWitness || !IsWitnessEnabled(pindex->pprev, consensusParams)) {
10                    uint32_t nFetchFlags = GetFetchFlags(pto, pindex->pprev, consensusParams);
11                    vGetData.push_back(CInv(MSG_BLOCK | nFetchFlags, pindex->GetBlockHash()));
12                    MarkBlockAsInFlight(pto->GetId(), pindex->GetBlockHash(), consensusParams, pindex);
13                    LogPrint("net", "Requesting block %s (%d) peer=%d\n", pindex->GetBlockHash().ToString(),
14                        pindex->nHeight, pto->id);
15                }
16            }
17            if (state.nBlocksInFlight == 0 && staller != -1) {
18                if (State(staller)->nStallingSince == 0) {
19                    State(staller)->nStallingSince = nNow;
20                    LogPrint("net", "Stall started peer=%d\n", staller);
21                }
22            }
23        }

Looks to me like when we call FindNextBlocksToDownload for an old, non-segwit peer, you’ll end up setting some segwit peer as a staller, and then potentially disconnecting it, even though doing so doesn’t allow you to download more blocks (as the peers found later may be non-segwit peers, as happened to @marcofalke, based on his debug log).

If this is right, then this is really just a segwit p2p issue. A quick fix would be to not set nStallingSince for the staller if you’re looking at a non-witness peer. I can try to put together a quick patch for you to test.

sdaftuar commented at 8:02 pm on August 18, 2016: member

@MarcoFalke I think there’s a one-liner that should work for you, can you change https://github.com/bitcoin/bitcoin/blob/6e6ab2c3238264b34c0c83ebf703502f5ec72848/src/main.cpp#L6712 to read:

0            if (state.nBlocksInFlight == 0 && staller != -1 && State(pto->GetId())->fHaveWitness) {

I think we’d want a slightly more subtle fix for master, but I think that should prevent you from disconnecting all your segwit peers.

gmaxwell commented at 8:06 pm on August 18, 2016: contributor

@sdaftuar seems plausible. I’m interested in knowing if that fixes it for marco. He should backup his state before trying it. (so if we come up with a different fix we could test that one too)

sdaftuar commented at 8:20 pm on August 18, 2016: member

@MarcoFalke It would also be helpful to see a debug log with -debug=net, while you’re able to reproduce the problem, so we can more fully see what’s happening.

MarcoFalke commented at 8:34 pm on August 18, 2016: member

Thanks @gmaxwell @sdaftuar !

I will try the patch and post results as well as a debug.log.

I just noticed I was on a different network (rural area) so if anyone tries to reproduce, make sure to limit the bandwidth available to a few hundred kB/s

MarcoFalke commented at 8:40 pm on August 18, 2016: member

debug log with -debug=net

Done: debug.log.txt

MarcoFalke commented at 8:52 pm on August 18, 2016: member

Apparently the patch from above

 0diff --git a/src/main.cpp b/src/main.cpp
 1index 75f991d..44eefc7 100644
 2--- a/src/main.cpp
 3+++ b/src/main.cpp
 4@@ -6715,7 +6715,7 @@ bool SendMessages(CNode* pto)
 5                         pindex->nHeight, pto->id);
 6                 }
 7             }
 8-            if (state.nBlocksInFlight == 0 && staller != -1) {
 9+            if (state.nBlocksInFlight == 0 && staller != -1 && State(pto->GetId())->fHaveWitness) {
10                 if (State(staller)->nStallingSince == 0) {
11                     State(staller)->nStallingSince = nNow;
12                     LogPrint("net", "Stall started peer=%d\n", staller);

improves the situation and sync appears to work. Though, I can still see disconnects due to stall in the log…

sdaftuar commented at 8:57 pm on August 18, 2016: member

@MarcoFalke Ok great!

I would still expect to see disconnects due to “stalling” if you’re on a low bandwidth connection, during IBD. I would guess that you’d see that on mainnet as well (just without getting stuck).

As I understand the stalling code (maybe @sipa will correct me), we will pretty aggressively disconnect our slowest peers during IBD, especially if you’re on a low-bandwidth connection. Pre-segwit, I think this is not such a big deal, because it’s helpful to try to download the most blocks from your fastest peers, and you’ll only disconnect if you have some other peer to try using, and you’ll typically be able to easily replace the ones you disconnect with other acceptable peers.

MarcoFalke commented at 9:08 pm on August 18, 2016: member

I ran some more tests each time resetting the datadir to the “corrupt” state and I can confirm the patch by @sdaftuar indeed fixes the “Unable to sync”-problem.

sipa commented at 9:11 pm on August 18, 2016: member

Great to hear. I believe this means we can treat this as a non-blocker for 0.13.0.

MarcoFalke commented at 9:28 pm on August 18, 2016: member

Agree, @laanwj could tag 0.13.0 tomorrow, so we can use the weekend to produce gitian results.

gmaxwell commented at 0:25 am on August 19, 2016: contributor

I agree that this issue is not a blocker for 0.13.

ntpd commented at 10:05 pm on August 25, 2016: none

 02016-08-25 21:43:02.365982 Peer=3151 is stalling block download, disconnecting
 12016-08-25 21:43:54.759148 Peer=3152 is stalling block download, disconnecting
 22016-08-25 21:45:13.561092 Peer=3153 is stalling block download, disconnecting
 32016-08-25 21:45:33.461048 Peer=3154 is stalling block download, disconnecting
 42016-08-25 21:46:40.283416 Peer=3155 is stalling block download, disconnecting
 52016-08-25 21:46:52.484168 Peer=3156 is stalling block download, disconnecting
 62016-08-25 21:47:57.365459 Peer=3157 is stalling block download, disconnecting
 72016-08-25 21:49:15.097728 Peer=3158 is stalling block download, disconnecting
 82016-08-25 21:49:19.248249 Peer=3159 is stalling block download, disconnecting
 92016-08-25 21:49:42.205163 Peer=3160 is stalling block download, disconnecting
102016-08-25 21:50:33.362082 Peer=3161 is stalling block download, disconnecting
112016-08-25 21:51:43.281139 Peer=3162 is stalling block download, disconnecting
122016-08-25 21:52:11.135347 Peer=3163 is stalling block download, disconnecting
132016-08-25 21:52:52.918473 Peer=3164 is stalling block download, disconnecting
142016-08-25 21:53:03.191162 Peer=3165 is stalling block download, disconnecting
152016-08-25 21:53:56.103521 Peer=3166 is stalling block download, disconnecting
162016-08-25 21:55:14.881598 Peer=3167 is stalling block download, disconnecting

I hit this on testnet, building the patch.

ntpd commented at 10:22 pm on August 25, 2016: none

best=0000000000000086c72142840f295185d472afb5697788b20c01724bd163298f height=918040 version=0x20000000 log2_work=67.888601 tx=10810312 date='2016-07-31 18:33:26' progress=0.999302 cache=47.4MiB(126718tx)

Back to syncing with peers using Marco’s patch.

maaku commented at 10:44 pm on August 25, 2016: contributor

I appear to have hit this as well on testnet. Patch got me unstuck.

(I disdain +1 posts on github, but am offering this as evidence that this isn’t a rare or isolated case.)

MarcoFalke added the label Priority Medium on Aug 26, 2016

sipa commented at 7:45 pm on August 26, 2016: member

@MarcoFalke Could you try https://github.com/sipa/bitcoin/commits/fixwitban ?

laanwj closed this on Aug 31, 2016

dooglus commented at 11:13 pm on August 12, 2017: contributor

@MarcoFalke @sdaftuar re. this patch:

-            if (state.nBlocksInFlight == 0 && staller != -1) {
+            if (state.nBlocksInFlight == 0 && staller != -1 && State(pto->GetId())->fHaveWitness) {

I think that should prevent you from disconnecting all your segwit peers.

Don’t you mean all my non-segwit peers? fHaveWitness is true for all my segwit peers and so adding that as a condition is a no-op for them.

The patch made no different for me anyway. But putting a ! before the State() call stopped it disconnecting my peers, and I did eventually get sent a couple of blocks I hadn’t seen before.

I’m thinking I need to disconnect stalling peers, but give them more than 2 seconds to start sending blocks.

DrahtBot locked this on Sep 8, 2021

testnet: 0.13 peers are disconnected due to “stalling” #8518