Reindex subtly corrupts blknnnnn.dat files and causes subsequent deserialization errors

prairiefalcon commented at 3:55 am on March 7, 2021: none

Summary When block downloads restart after a reindex of block files, spurious bytes are inserted in the blk.dat file. A subsequent reindex will show a deserialization error in the log file.

Test system CentOS 7, Bitcoin Core v0.21.0 pre-built binaries and reproduced going back to 0.10.3!

Problem Description Performing a reindex of the block files reliably introduces spurious bytes, resulting in subsequent deserialization errors.

At the point where block downloads resume and are added to the blk.dat file, 8 extra bytes are introduced in front of the first new data block. Depending on the scenario, these 8 bytes are either all 0’s, or a valid message start string (0xf9beb4d9 “magic” sequence) followed by 4 more bytes. Normal blocks of <magic> <length> <raw block> resume after these 8 bytes.

Investigation revealed that the 4 bytes after the magic were actually the length of a previously existing block in the blk.dat file that had been overwritten.

It appears what is happening is that after reindexing, a file position is calculated for adding newly downloaded blocks, and this position is incorrectly computed to be 8 bytes beyond the proper place. If the insert position is at the end of the file, the intervening 8 bytes are zero filled. If the insert position overwrites pre-existing blocks, the first 8 bytes of the overwritten block(s) are left behind.

It appears the problem results from an inconsistency in the operation of FindBlockPos during reindexing, when it returns file positions at the beginning of the raw block, as opposed to the end of previous block/beginning of the magic.

The block reading algorithm is apparently able to recover and continue, so this problem is not fatal. Inspection of the debug.log shows “LoadExternalBlockFile: Deserialize or I/O error” when it occurs. To be clear, the first reindex doesn’t show this error, but causes the problem such that a subsequent reindex will generate it.

Researching old issues, I suspect this problem contributed to symptoms observed in issues #6492 and #14933. It also looks like like #5864 might have been an attempt to address this or a similar problem.

Reproduction Scenarios These scenarios are rather contrived, but are the simplest ones I could come up with to quickly and reliably demonstrate the issue.

Scenario 1 To create a deterministic scenario, this example uses what I’ll refer to as “canonical” block files. That is, block files where the chain blocks are stored in-order up to the file size limit of 128M. Thus, the canonical blk00000.dat file has the following properties:

size: 134214363 blocks: 119965 (up to height 119964) SHA-256: fe4b764e0ce523cfce6d9e4f326e0d0977ee317a51b4a129cda707cfa4699dd7

Start with a “canonical” blk00000.dat as the only file in a new data dir
Start up bitcoin-qt with -reindex.
Once the reindex completes and block downloads resume over the network, terminate the process.

In the debug.log, you should see: 2021-03-06T21:41:52Z Leaving block file 0: CBlockFileInfo(blocks=119965, size=134214371, heights=0…119964, time=2009-01-03…2011-04-24)

Note that the size has grown by 8 bytes. Inspection of hexdumps show 8 extra zero bytes at the end of the file.

Before reindex:

007fff2c0  14 9f ce 4e af a6 17 17  d9 24 b1 d0 9b 30 28 3d  |...N.....$...0(=|
107fff2d0  ee a3 f1 e3 fd 88 ac 00  00 00 00                 |...........|

After reindex:

007fff2c0  14 9f ce 4e af a6 17 17  d9 24 b1 d0 9b 30 28 3d  |...N.....$...0(=|
107fff2d0  ee a3 f1 e3 fd 88 ac 00  00 00 00 00 00 00 00 00  |................|
207fff2e0  00 00 00                                          |...|

In this scenario, blk00000.dat is “full”, so the new block will go in blk00001.dat. But, in the process of testing blk00000.dat to see if the new block can be appended, it appears an incorrect file pointer offset winds up adding 8 bytes to the file.

Scenario 2 part 1 The results of this scenario are more variable due to the random ordering of blocks resulting from parallel block downloads.

Start with a blk00000.dat and blk00001.dat containing out-of-order blocks from a parallel block download.
Perform a reindex on these block files.
Wait past header synchronization until new block downloads resume over the network (blk00002.dat is created), then shut the process down.

What you should discover upon inspection of blk00001.dat is that the file has been changed, and at the point where the content changes, the spurious magic and length bytes will appear. This will be somewhere near but not at the end of the file. (It appears that the insert point is after the last block that could be verified to be part of the main chain; because of parallel block downloads, some remaining blocks that couldn’t be attached are overwritten.) An example from my run:

Before reindex:

007f788d0  88 ac 00 00 00 00 f9 be  b4 d9 ad bf 00 00 01 00  |................|
107f788e0  00 00 ea 8d fd 5c 0e 23  08 c8 b3 a6 73 61 5d fa  |.....\.#....sa].|

Note magic at offset 0x07f788d6, block length of 0x0000bfad, version 0x01 byte from raw block

After reindex:

007f788d0  88 ac 00 00 00 00 f9 be  b4 d9 ad bf 00 00 f9 be  |................|
107f788e0  b4 d9 61 bf 00 00 01 00  00 00 c3 70 ca 79 eb ed  |..a........p.y..|

Note magic at offset 0x07f788d6, length of 0x0000bfad as before, but now followed by another magic and a length 0x0000bf61 before getting to the version 0x01.

Scenario 2 part 2

Save a copy of the block files resulting from the end of the previous scenario.
Start another reindex using those files. Again, wait until new block downloads resume, then shut down.

In the debug.log, you should find the following where the corrupted block file was processed: 2021-03-06T22:19:55Z LoadExternalBlockFile: Deserialize or I/O error - ReadCompactSize(): size too large: iostream error

Depending on exactly how things go off the rails, the error could appear as: 2021-03-06T23:07:48Z LoadExternalBlockFile: Deserialize or I/O error - Read attempted past buffer limit: iostream error

If you inspect your last block file from the first reindex and compare it to its state after the second reindex, you should find a second instance of the spurious magic+length bytes.

If you perform another reindex with the block files as they stand now, you’ll get two deserialize errors in the debug log. Each additional reindex will add another instance.

prairiefalcon added the label Bug on Mar 7, 2021

mruddy commented at 8:51 am on April 13, 2022: contributor

@prairiefalcon Thanks for this great detailed bug report. I have reproduced your given scenarios and can confirm that the bug still exists in the current master branch as of at least 2b5a741e98f186e50d9fbe1ceadcc8b8c91547f7. I have a fix to avoid adding the invalid data to the block files. I will look into creating a way to automatically clean up affected blk files, and will also look into updating the automated tests to catch this going forward.

mruddy referenced this in commit 84580a7e31 on Apr 14, 2022

mruddy commented at 11:55 pm on April 14, 2022: contributor

As a side note, I did look into updating the reindex process to have a gated option to allow cleaning, or vacuuming, of the blk data files. I would have gated this behind a flag like -reindex-vacuum. My idea was that it would happen during the reindex process and vacuum each file just before it was processed. It would temporarily use an extra 128 MiB of storage space to avoid changing blk files in place. However, after looking at it, I decided that it is better to avoid possibly mutating the blk files during the reindex process. Bad things can happen, like a read error early in the process could corrupt a block which leads to the rest of the blocks being considered out of order during reindexing and then the following reindex, all of the blocks from the point of failure onward would need to be redownloaded. With the current reindex process blk files are immutable (after my fix for this issue) and the read error could be recovered from by simply re-running the reindex process. I want to avoid adding a possibly destructive option in this situation because a simple abort and retry could work.

mruddy referenced this in commit 2da96174b9 on Apr 15, 2022

laanwj added the label Block storage on Apr 15, 2022

mruddy referenced this in commit 66b8d3c23d on Apr 19, 2022

mruddy referenced this in commit 425fc88133 on Apr 20, 2022

mruddy referenced this in commit bcb0cacac2 on May 7, 2022

mzumsande referenced this in commit 5691b7101e on Jul 6, 2022

LarryRuane referenced this in commit e8b90d475b on Jul 27, 2022

LarryRuane referenced this in commit 35cb9cdf87 on Jul 27, 2022

glozow closed this on Oct 12, 2022

glozow referenced this in commit cc12b8947b on Oct 12, 2022

adam2k referenced this in commit 0db876ab32 on Oct 19, 2022

sidhujag referenced this in commit e0075d67c2 on Oct 23, 2022

bitcoin locked this on Oct 12, 2023

Reindex subtly corrupts blknnnnn.dat files and causes subsequent deserialization errors #21379