Hopefully final fix for the stuck blockchain issue #1315

sipa commented at 3:40 PM on May 15, 2012: member

Immediately issue a "getblocks", instead of a "getdata" (which will trigger the relevant "inv" to be sent anyway), and only do so when the previous set of invs led us into a known and attached part of the block tree.

This patch has been tested on a (constructed) blockchain that was effectively stuck.

rebroad commented at 3:43 PM on May 15, 2012: contributor

I suspected getblocks was probably the answer :)

It would be very nice to see a mini white paper on what this does and how it works....

Hopefully final fix for the stuck blockchain issue

Immediately issue a "getblocks", instead of a "getdata" (which will
trigger the relevant "inv" to be sent anyway), and only do so when
the previous set of invs led us into a known and attached part of
the block tree.

385f730f31

gmaxwell commented at 2:13 AM on May 16, 2012: contributor

K. Tested resync from start several times. Tested partial resync. Tested recovery from fork with reorg on non-stuck node. Tested recovery from a forkmode stuck node. Tested with loadblocks. Make sure it wasn't bloating up the chain with a ton of copies of extra block .... I can't break it, so I'm pulling.

gmaxwell referenced this in commit 462f5d98a2 on May 16, 2012

gmaxwell merged this on May 16, 2012

gmaxwell closed this on May 16, 2012

rebroad commented at 2:09 AM on May 17, 2012: contributor

Just out of interest, does the initial getblocks (that's sent to the first peer upon starting the node) also cause the recovery from afork stuck node? Is it that this change is to enable it to become unstuck without restarting the node? Or did even restarting not fix things?

gmaxwell commented at 2:23 AM on May 17, 2012: contributor

Even restarting did not fix the particular issue this fix was needed to address— but normal nodes probably can't get into that state. (The nodes in question were ones that got stuck due to incorrectly rejecting the correct chain because e.g. of premature BIP16 enforcement)

sipa commented at 10:10 AM on May 17, 2012: member

A bit more elaborate: if you were running an 0.6 RC, you would have code that used the old BIP16 switchover date. The date passed, but you did not update your software. Suddenly someone sends an invalid BIP transaction (so, one that is valid according to the traditional rules, but not according to the BIP16 rules). On the main network BIP16 validation is not active, so the transaction gets accepted. However, your old RC enforces BIP16 validation, so it considers this transaction invalid. This only happens after downloading the block that contains it, and adding it to the tree in the database. A few hundred blocks are added on top of this one, all in your database, but this chain does not become the best chain (as it is considers invalid).

Finally, you upgrade your software, and you now have the correct BIP16 switchover date. The correct chain is already downloaded in your block database, but it is not marked as the active best chain. At startup, your node sends a getblocks from its current best tip (which is one block before the one that contained the invalid BIP16 transactions) to the top of the chain. The peer answers by sending 500 invs back, and remembers to request 500 more when the last of those is downloaded. However, we already have the first 500, so not one is requested, and nothing happens. We must somehow make the peer send us the rest of the invs, as that is our only means for reconnecting that chain to the current best block. Earlier versions of this patch forced a getdata of that 500th block, this one sends a getblocks immediately.

rebroad commented at 10:22 AM on May 17, 2012: contributor

I understand the explanation so far but it still doesn't explain how the new getblocks achieves that, and why upon receiving the very latest block, that that doesn't fix it, nor why it can't be coded to re-evaluate the last 6 or so blocks in the last known valid again to see if they are still valid by any new rules. The last solution would be better, IMHO, as it wouldn't increase network traffic, unlike this fix (kludge?).

Sent from my Nokia phone -----Original Message----- From: Pieter Wuille Sent: 17/05/2012 11:10:55 Subject: Re: [bitcoin] Hopefully final fix for the stuck blockchain issue (#1315)

A bit more elaborate: if you were running an 0.6 RC, you would have code that used the old BIP16 switchover date. The date passed, but you did not update your software. Suddenly someone sends an invalid BIP transaction (so, one that is valid according to the traditional rules, but not according to the BIP16 rules). On the main network BIP16 validation is not active, so the transaction gets accepted. However, your old RC enforces BIP16 validation, so it considers this transaction invalid. This only happens after downloading the block that contains it, and adding it to the tree in the database. A few hundred blocks are added on top of this one, all in your database, but this chain does not become the best chain (as it is considers invalid).

Finally, you upgrade your software, and you now have the correct BIP16 switchover date. The correct chain is already downloaded in your block database, but it is not marked as the active best chain. At startup, your node sends a getblocks from its current best tip (which is one block before the one that contained the invalid BIP16 transactions) to the top of the chain. The peer answers by sending 500 invs back, and remembers to request 500 more when the last of those is downloaded. However, we already have the first 500, so not one is requested, and nothing happens. We must somehow make the peer send us the rest of the invs, as that is our only means for reconnecting that chain to the current best block. Earlier versions of this patch forced a getdata of that 500th block, this one sends a getblocks immediately.

Reply to this email directly or view it on GitHub: #1315 (comment)

rebroad commented at 10:29 AM on May 17, 2012: contributor

Sorry, meant to say, that it could re-check the invalid again upon start up, perhaps by giving a command line option or perhaps automatically whenever the invalid again is longer by 6 blocks or more.

Sent from my Nokia phone -----Original Message----- From: Pieter Wuille Sent: 17/05/2012 11:10:55 Subject: Re: [bitcoin] Hopefully final fix for the stuck blockchain issue (#1315)

A bit more elaborate: if you were running an 0.6 RC, you would have code that used the old BIP16 switchover date. The date passed, but you did not update your software. Suddenly someone sends an invalid BIP transaction (so, one that is valid according to the traditional rules, but not according to the BIP16 rules). On the main network BIP16 validation is not active, so the transaction gets accepted. However, your old RC enforces BIP16 validation, so it considers this transaction invalid. This only happens after downloading the block that contains it, and adding it to the tree in the database. A few hundred blocks are added on top of this one, all in your database, but this chain does not become the best chain (as it is considers invalid).

Finally, you upgrade your software, and you now have the correct BIP16 switchover date. The correct chain is already downloaded in your block database, but it is not marked as the active best chain. At startup, your node sends a getblocks from its current best tip (which is one block before the one that contained the invalid BIP16 transactions) to the top of the chain. The peer answers by sending 500 invs back, and remembers to request 500 more when the last of those is downloaded. However, we already have the first 500, so not one is requested, and nothing happens. We must somehow make the peer send us the rest of the invs, as that is our only means for reconnecting that chain to the current best block. Earlier versions of this patch forced a getdata of that 500th block, this one sends a getblocks immediately.

Reply to this email directly or view it on GitHub: #1315 (comment)

sipa commented at 10:30 AM on May 17, 2012: member

This fix will - over the course of an entire blockchain syncup - maybe cause 50 kilobytes extra communication. What you suggest is also possible, but harder and with less guarantees, in my opinion. You'd need to traverse the entire blockchain database and find stale chains, and re-evaluate them all?

rebroad commented at 10:35 AM on May 17, 2012: contributor

Doesn't this fix also increase data transfer even after the node has caught up? I thought it does getblocks upon receipt of every block, doesn't it?

To re-evaluate the invalid again it would only need to re-evaluate one block upon start-up in the example you give. The first block in the longest invalid chain.

rebroad commented at 10:37 AM on May 17, 2012: contributor

Also, technically, i'd say this current fix requires a BIP.

sipa commented at 10:44 AM on May 17, 2012: member

In normal operation, this patch does nothing. It only sends out a getblocks when an inv is received with blocks that are already known and part of the block tree. During normal operation, this never happens, as you only request invs for the part after the main chain. And the block-sync process has never been well-formalized, though the responses to the network requests are. Those aren't changed however.

rebroad commented at 10:51 AM on May 17, 2012: contributor

I think you are incorrect to say it doesn't happen during normal operation. This is not my experience. When i new block arrives on the network, let's say 8 nodes announce it in invs. My node will getdata it from the first one, download it and ProcessBlock it usually well before the last connected peer sends invs for it, so with this code with each new block, the slowest peers to announce it will receive the new getblocks in this fix. As the network gets bigger, this could get worse. It could also get less worse if ProcessBlocks takes longer due to larger blocks.

gmaxwell commented at 1:15 PM on May 17, 2012: contributor

While previously testing this I specifically looked for excess requests during normal operations and didn't see any. Either I made a mistake or just had unlucky timing— or it's something about the peer mix thats triggering it, because I see ones now— about 1769 of them on 05/16.

Actually, they seem to be being caused in high volume by specific peers. E.g. I have a couple which are each responsible for several hundred of them.

rebroad commented at 4:32 PM on May 17, 2012: contributor

I've added a fix to this, in my current bitcoin-ParallelBlockDownload branch (the 3rd commit of pull #1326).... I still think the ideal solution is to do it without using the network though....

coblee referenced this in commit 4c682bbb38 on Jul 17, 2012

lateminer referenced this in commit d202c9c170 on Jan 22, 2019

lateminer referenced this in commit 586a051c7f on May 6, 2020

bitcoin locked this on Sep 8, 2021