Stange four-day pause in block-processing on synchronized node #6188

laanwj commented at 9:59 AM on May 26, 2015: member

I noticed a curious four-day pause in block processing on one of my nodes. Before the pause the node was fully synchronized with the network.

The node was started at, and is running:

2015-05-15 11:50:17 Bitcoin version v0.10.99.0-524c6f8 (2015-05-15 13:43:29 +0200)

This is master as of commit 6fb90d8, with a few debug logging commits on top.

It is a well-connected node (both incoming and outgoing connections, both through Tor and IPv4, usually hovering around ~100)

2015-05-22 01:18:43 UpdateTip: new best=000000000000000009f5588ef4079517c94b05d78b51fdc77284f4a6af28d4f6  height=357497  log2_work=82.820292  tx=69423257  date=2015-05-22 01:17:22 progress=0.999999  cache=37.9MiB(6045tx)
2015-05-22 01:18:48 receive version message: /getaddr.bitnodes.io:0.1/: version 70002, blocks=357496, us=, peer=17650, peeraddr=
...

Today it suddenly started connecting blocks again. This seems to indicate that it had the more recent blocks available, but was missing one intermediate block, which it had been waiting for all that time.

2015-05-26 08:35:52 Pre-allocating up to position 0x100000 in rev00273.dat
2015-05-26 08:36:02 UpdateTip: new best=00000000000000000a77f8bcc2949a4e115291e3921d0a95290bf7532ad0d3bd  height=357498  log2_work=82.820328  tx=69423978  date=2015-05-22 01:30:24 progress=0.993628  cache=0.8MiB(0tx)
2015-05-26 08:36:16 UpdateTip: new best=0000000000000000100bb5f9bb1c1e8c0283d1378ad947b480e2575f72ff3c62  height=357499  log2_work=82.820363  tx=69424950  date=2015-05-22 01:52:45 progress=0.993650  cache=7.4MiB(2790tx)
2015-05-26 08:36:16 UpdateTip: new best=000000000000000015e8551f12d2e4b8bc262e0415ff09e65691aa1a87416db8  height=357500  log2_work=82.820399  tx=69424951  date=2015-05-22 01:53:02 progress=0.993651  cache=7.4MiB(2791tx)
2015-05-26 08:36:24 UpdateTip: new best=0000000000000000104e4b5bf9abfe77696e0433bff7034d528775d14de3e462  height=357501  log2_work=82.820434  tx=69425382  date=2015-05-22 01:56:42 progress=0.993654  cache=17.6MiB(4064tx)
2015-05-26 08:36:31 UpdateTip: new best=00000000000000000f10e4cad7ca93faaf781c62299ae67fd6637d87653ca71f  height=357502  log2_work=82.820469  tx=69425976  date=2015-05-22 02:05:59 progress=0.993664  cache=19.2MiB(5180tx)
2015-05-26 08:36:53 UpdateTip: new best=0000000000000000052f285f3ab330eb5d45b48fb93b291e0f50737608393d26  height=357503  log2_work=82.820505  tx=69426146  date=2015-05-22 02:07:13 progress=0.993665  cache=20.4MiB(5709tx)

In the meantime it was connected to the network normally, e.g. accepting transactions and receiving new connections. Nothing abnormal appears in the log, either before the pause or before continuing again. Currently it is still connecting blocks, and all the while the node is unreachable through RPC, and all peer connections have timed out (but that's a different issue - #5851).

laanwj added the label P2P on May 26, 2015

laanwj commented at 9:59 AM on May 26, 2015: member

It could be that this is the issue solved in #5976

sdaftuar commented at 2:11 PM on May 26, 2015: member

Though that is the type of thing that #5976 should address, I don't think that can be what is going on here -- if this were resolved due to a block download timeout causing a stalling peer to get disconnected, then I think that would have appeared in your debug.log:

LogPrintf("Timeout downloading block %s from peer=%d, disconnecting\n", state.vBlocksInFlight.front().hash.ToString(), pto->id);.
Also, it seems implausible that a fully synced node would have so many blocks in flight that it would take 4 days to disconnect a stalling peer. But I don't have an alternate guess to offer about what this could be either, puzzling.

laanwj commented at 3:47 PM on May 26, 2015: member

That message does indeed not appear in the log. After reconnecting four days of blocks, the node is back to running as normal.

laanwj commented at 3:20 PM on May 27, 2015: member

The same node, still running the same version, and not restarted since, has become stuck again since:

2015-05-27 10:01:15 UpdateTip: new best=000000000000000001643f7706f3dcbc3a386e4c1bfba852ff628d8024f875b6  height=358228  log2_work=82.845954  tx=6

9953772 date=2015-05-27 10:00:38 progress=0.999999 cache=30.0MiB(3697tx)

No strange errors in the log - this time I have getpeerinfo though: https://download.visucore.com/tmp/tangerine_20150527_getpeerinfo_filtered.txt No blocks in flight from any node, it appears.

Anything I should try, to see if it gets going again?

sipa commented at 4:14 PM on May 27, 2015: member

Strange, not a single block in flight, but many nodes with headers beyond your last synced block.

sipa commented at 4:17 PM on May 27, 2015: member

Can you also post a getchaintips result?

laanwj commented at 4:48 PM on May 27, 2015: member

Sure: https://download.visucore.com/tmp/tangerine_20150527_getchaintips.json

sdaftuar commented at 5:08 PM on May 27, 2015: member

Could you verify that the node doesn't have block 358229? Perhaps: bitcoin-cli getblock 00000000000000000a3ed9a4e25407518aa854f09fa1981adaae9455a91d1966 (I'd like to eliminate the possibility that somehow the node received it and it just failed to connect.)

laanwj commented at 5:17 PM on May 27, 2015: member

Result:

$ cli getblock 00000000000000000a3ed9a4e25407518aa854f09fa1981adaae9455a91d1966
 error: {"code":-32603,"message":"Can't read block from disk"}

In the log:

2015-05-27 17:10:56 ERROR: ReadBlockFromDisk: OpenBlockFile failed for CBlockDiskPos(nFile=-1, nPos=0)

I do have the block after it, #358230,

cli getblock 000000000000000001f942eb4bfa0aeccb6a14c268f4c72d5fff17270da771b9
{
    "hash" : "000000000000000001f942eb4bfa0aeccb6a14c268f4c72d5fff17270da771b9",
    "confirmations" : -1,
    "size" : 749157,
    "height" : 358230,

sdaftuar commented at 7:39 PM on May 27, 2015: member

If no one has any better ideas (this is a shot in the dark) maybe it would be helpful to know what happens if you bring down this node's network connection to force disconnection from all your peers, and then bring your network connection back up. Perhaps that would give us a clue as to whether it could be an inconsistency with mapBlocksInFlight (ie if it is unable to sync even after disconnecting from all peers), or if it does sync, maybe that would be a hint that one of your peers is triggering an issue that disconnecting from them resolves?

laanwj commented at 9:50 AM on May 28, 2015: member

OK - I have rejected all external network access for the user the node runs as, as well as stopped Tor to make sure it doesn't communicate with nodes through the proxy. Now waiting for all connections to time out.

laanwj commented at 10:28 AM on May 28, 2015: member

getpeerinfo() is empty now. Still no block connection happens. So it is not a peer deliberately holding up anything. Unfortunately this bitcoind does not have detailed debugging symbols, so I'm not easily able to dump any internal data structures not exposed on RPC.

Going to reopen network connectivity...

Edit: Reopening the network connectivity immediately triggered downloading of the block! It is connecting again. This indeed seems to indicate a problem in the blocks in flight administration, then, although the peerinfo never showed any blocks in flight!

Note: this is a reasonably slow system; it may be a race condition, or an unhanded edge case when a node times out while it is sending the block to you.

sdaftuar commented at 3:27 PM on May 28, 2015: member

It might be helpful to add some debugging to show what's in mapBlocksInFlight and vNodesDisconnected. I wonder if perhaps your node requested the block from a peer that disconnected while the block was in flight, but the disconnect was slow to be fully processed.

It looks like when a peer disconnects, we take it out of vNodes, but leave it hanging around in vNodesDisconnected while we try to clean up. I don't know what might trigger that cleanup process to stall, but if it were to stall, then I think it would match what you've reported here, where getpeerinfo wouldn't return any blocks in flight (since it loops over vNodes to see what those peers have outstanding), but mapBlocksInFlight would still have an entry for that block hash (since that only gets cleaned up when the peer is actually deleted, which doesn't happen until being removed from vNodesDisconnected), preventing FindNextBlocksToDownload from returning it.

laanwj commented at 5:33 AM on May 29, 2015: member

It caught up partially, but after a while it went off-track again. Going to upgrade the node to latest master (that includes #5976), this time with full debug information.

done: first run with the new version at least led it to catch up, we'll see what happens now in the following days

laanwj commented at 6:39 AM on June 2, 2015: member

No issues since. Will reopen this if and when the issue comes back.

laanwj closed this on Jun 2, 2015

DrahtBot locked this on Sep 8, 2021