Add integrity checks for missing blk*.dat files. #4173

pull ashleyholman wants to merge 2 commits into bitcoin:master from ashleyholman:missingblock changing 1 files +19 −1

ashleyholman commented at 5:41 AM on May 11, 2014: contributor

This is a fix for #3957.

Patch behaviour: if a peer requests a block which is in the block index, but the block file (blk*.dat) is missing, an error is logged to debug.log and no response is sent to the peer.

Buggy behaviour: without this patch, an invalid 'block' message is sent which contains null bytes.

Should anything further be done in response to this issue? Perhaps re-request the block from peers? Something else?
luke-jr commented at 7:39 AM on May 11, 2014: member

A reject message seems appropriate. It may be necessary to disconnect the peer as well, since the blockchain downloading code cannot handle a peer's failure to provide the block.
ashleyholman commented at 8:09 AM on May 11, 2014: contributor

@luke-jr currently all of the rejection states seem to relate to the peer doing something unacceptable (MALFORMED, INVALID, OBSOLETE, DUPLICATE, NONSTANDARD, DUST, INSUFFICIENTFEE, CHECKPOINT). Does there need to be an additional state added for when the request was unable to be fulfilled due to some runtime issue? like, REJECT_UNAVAILABLE? Similar in purpose to an HTTP 503.
gmaxwell commented at 8:11 AM on May 11, 2014: contributor

This should just be an assert right now, it's not a covered case in the code. (we probably should test for it at startup too).
ashleyholman commented at 8:20 AM on May 11, 2014: contributor

@gmaxwell That makes sense. So the process is terminated if a missing block is requested. Would this behaviour address the concern raised by @luke-jr, ie. the requesting node will continue syncing?

I will also work on doing the check of blk*.dat files on startup so it can be merged at the same time.
gmaxwell commented at 8:22 AM on May 11, 2014: contributor

Hanging up should make the peer choose a new syncnode, so it should indeed fix that.
sipa commented at 8:23 AM on May 11, 2014: member

Indeed, there is currently no level of service you can indicate to peers between NODE_NETWORK and nothing it all. If you can't serve every block you advertise, you're lying to your peers. You could in theory downgrade to nothing at all, but that's pointless, as peers also won't expect you to validate or relay transactions anymore.

For now, this should just cause the program to quit. In the longer term, we should push for some protocol changes to indicate partial block availability, so this can become a supported case.
ashleyholman commented at 10:01 AM on May 11, 2014: contributor

Further question: Considering this can and will shut down peoples nodes, ideally there is some communication to the user about what has happened and how they can repair it. So what are the steps to repair this problem? I tried restarting with a -rescan but it didn't fix the problem. I also had bootstrap.dat present but it said "Already have block 1" even though it was missing from disk. Does the user have to manually source the missing .dat file and copy it into their blocks directory?
ashleyholman commented at 10:11 AM on May 11, 2014: contributor

Oh, I just realised that -reindex fixes it. So, I guess it should automatically run a -reindex if the startup check finds any missing .dat files?
ProcessGetData(): abort if a block file is missing from disk 7a0e84dd63
ashleyholman closed this on May 11, 2014
ashleyholman deleted the branch on May 11, 2014
ashleyholman reopened this on May 11, 2014
ashleyholman commented at 12:13 PM on May 11, 2014: contributor
See attached new patches.
- On startup, LoadBlockIndexDB() does an additional scan of blk*.dat files to check that all files required by the index are present and readable.
- If "getdata" is received for a block thats file is missing, the process will abort with a failed assertion.
This hooks into the existing code that says "Error loading block database. Do you want to rebuild the block database now?" with an option to reindex or abort.
sipa commented at 12:58 PM on May 11, 2014: member

Pulltester hasn't even begun sending any blocks at the time it fails. There's no blocks to load.
ashleyholman commented at 1:01 PM on May 11, 2014: contributor

@sipa It seems my code is failing on first startup with a blank datadir. nLastBlockFile defaults to 0, but blk00000.dat isn't there yet at that point. I'm trying to find if there's a nice solution to this, other than skipping the check with nLastBlockFile == 0.
sipa commented at 1:07 PM on May 11, 2014: member

You can iterate the mapBlockIndex entries, and build a set of the occurring nBlockFile fields of those with BLOCK_HAVE_DATA set.
ashleyholman commented at 2:01 PM on May 11, 2014: contributor

Updated patch to use @sipa's method for finding blk files used by the index.
sipa commented at 7:09 PM on May 11, 2014: member

Untested ACK.
ashleyholman renamed this:
~~ProcessGetData: fix corrupt block message when requested block is missing on disk~~
Add integrity checks for missing blk*.dat files.
on May 12, 2014
ashleyholman commented at 1:52 AM on May 12, 2014: contributor

Renamed this pull request to better describe the new patch.
laanwj commented at 11:40 AM on May 12, 2014: member

Untested ACK
laanwj added this to the milestone 0.10.0 on May 12, 2014
in src/main.cpp:None in 5616149708 outdated
```
2952 | +        }
2953 | +    }
2954 | +    for (std::set<int>::iterator it = setBlkDataFiles.begin(); it != setBlkDataFiles.end(); it++)
2955 | +    {
2956 | +        CDiskBlockPos pos(*it, 0);
2957 | +        if (!OpenBlockFile(pos, true)) {
```
cozz commented at 12:08 PM on May 12, 2014:

Any reason not to fclose? All file descriptors stay open this way...

sipa commented at 12:31 PM on May 12, 2014:

Good point. Wrapping the OpenBlockFile call in a CAutoFile would also automatically close it again.
ashleyholman commented at 12:52 PM on May 12, 2014: contributor

Oh I see. I had assumed that the file gets closed when the pointer goes out of scope. C++ is kinda new for me :)

The same type of loop is in ThreadImport() as well when reindex is done, so possibly that code also leaves file handles open.

My code should really just use a stat() call instead but I wanted to re-use the existing OpenBlockFile code to prevent having to hardcode the file paths in a 2nd place.
LoadBlockIndexDB(): Require block db reindex if any blk*.dat files are missing. 8c93bf4c28
ashleyholman commented at 1:11 PM on May 12, 2014: contributor

Pushed new patch which wraps it in CAutoFile. (and re-tested)

Thanks for the review.
laanwj commented at 1:25 PM on May 12, 2014: member

@ashleyholman Yup, it's indeed error-prone to have to manually close handles. We try to write code that frees something automatically when it goes out of scope, and if not we create RAII wrappers such as CAutoFile, which you've already found :)
BitcoinPullTester commented at 1:55 PM on May 12, 2014: none

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/8c93bf4c2857cdc01f0e5efca2becef08d2c5f1e for binaries and test log. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.
ashleyholman commented at 3:26 AM on May 13, 2014: contributor

The same type of loop is in ThreadImport() as well when reindex is done, so possibly that code also leaves file handles open.

False alarm. ThreadImport() doesn't leak any file handles because LoadExternalBlockFile() closes them.
laanwj commented at 2:58 PM on May 19, 2014: member

ACK, works for me. I've tried moving blk00000 and blk00001 out of the way and successfully got an error. Moved them back and bitcoind worked again.
laanwj merged this on May 19, 2014
laanwj closed this on May 19, 2014
laanwj referenced this in commit aed38cbcb5 on May 19, 2014
laanwj cross-referenced this on May 28, 2014 from issue Startup succeeds with unreadable blk0*.dat files by luke-jr
langerhans cross-referenced this on Jan 2, 2015 from issue ProcessGetData(): abort if a block file is missing from disk by rnicoll
bitcoin locked this on Sep 8, 2021

Contributors

Milestone
0.10.0

Linked (view graph)

#2038 Startup succeeds with unreadable blk0*.dat files #3869 can't clear the 'amount' field on the 'receive' tab #3957 Missing blk00000.dat doesn't cause failure when retrieving blocks.