Do not crash if peers.dat is corrupted

NicolasDorier commented at 12:34 pm on November 29, 2022: contributor

When peers.dat is corrupted an error message is shown: Invalid or corrupt peers.dat (Checksum mismatch, data corrupted). then the node restart.

Most of our users aren’t really tech enough to manually delete the peers.dat files, nor can we detect it for them. It means that this error give us lot’s of work on our support team when somebody is impacted.

peers.dat isn’t an essential file, as such Bitcoin Core should just be fine restarting without crashing.

A bash workaround to detect the checksum mismatch would also considerably help us.

NicolasDorier added the label Feature on Nov 29, 2022

maflcko commented at 12:38 pm on November 29, 2022: member

It means that this error give us lot’s of work

Do you happen to know why it corrupts? If it is due to hardware error, it might be scary to just continue, because it might also corrupt wallet.dat.

maflcko commented at 12:41 pm on November 29, 2022: member

If the cause is known and harmless, it can be fixed like https://github.com/bitcoin/bitcoin/commit/b00b60ed4f27066656e45635f1386e6b1550c6ed

fanquake removed the label Feature on Nov 29, 2022

ghost commented at 10:05 pm on November 30, 2022: none

@NicolasDorier Were you able to find the reason for corruption? Steps to reproduce could help in knowing the severity of the issue

NicolasDorier commented at 2:29 am on December 1, 2022: contributor

@1440000bytes @MarcoFalke I don’t know the reason, I can just report it happened to maybe 2 or 3 different people on our support in one or two months. Removing the peers.dat fixed their problem. My guess is that bitcoin core was not shut down cleanly. (such as the process killed while saving peers.dat?)

On another note: A wallet corruption, which is genuinely more scary, doesn’t make bitcoin core crash when starting. (it isn’t related to this issue, but I can share a corrupted wallet.dat not crashing bitcoin core on startup if that interest anybody)

maflcko commented at 8:46 am on December 1, 2022: member

such as the process killed while saving peers.dat

RenameOver should be atomic, so I don’t think this can happen.

https://github.com/bitcoin/bitcoin/blob/e2bfd41f832dc7c7be6f17e928352f0eb2865f66/src/addrdb.cpp#L79

maflcko added the label Questions and Help on Dec 1, 2022

NicolasDorier commented at 9:03 am on December 1, 2022: contributor

I will ask for the next time it happens to save the peers.dat so we can analyze it.

maflcko commented at 9:13 am on December 1, 2022: member

Ok, thanks. Closing for now, but this can be reopened if you leave another comment or referred to if you open a new issue.

maflcko closed this on Dec 1, 2022

NicolasDorier commented at 9:47 am on December 1, 2022: contributor

@MarcoFalke I think the reason why it is corrupt is different from the topic of not crashing for a peers.dat file. This file is non-essential, it shouldn’t be a problem to restart.

maflcko reopened this on Dec 1, 2022

maflcko removed the label Questions and Help on Dec 1, 2022

maflcko added the label Brainstorming on Dec 1, 2022

NicolasDorier commented at 11:21 am on December 1, 2022: contributor

So I don’t know if that help, told me that it mainly happen with people running with low spec RAM (2GB) Apparently, one user had repeated corruptions with 1gb ram.

maflcko commented at 11:23 am on December 1, 2022: member

To me it sounds like a bug that should be fixed and not silently ignored

willcl-ark commented at 12:14 pm on December 1, 2022: member

The peers.dat file is designed to avoid having to reach out to the DNS or hardcoded seeds more than once, as this is the moment your node is most susceptible to being poisoned with attacker ip addresses and perhaps in the future blocks and transactions.

If the file becomes corrupted then anchors.dat should help protect the node from a successful future eclipse attack, but new addresses will have to either be added manually or fetched from DNS or hardcoded seeds again.

I agree that the best course of action here is to find out what’s corrupting peers.dat and fix that, rather than have Core silently ignore errors on something that could be used as a first step towards eclipse attacking you…

Side note: it does make me wonder whether it could be worth having certain runtime “profiles”. For example I have seen software with “paranoia level” settings, and we could perhaps have something like

Paranoid: Fail on detecting any corrupt file, data, etc. debugging enabled on many categories by default, notification of re-orgs >= n blocks etc.
Normal (default): Somthing similar to todays defaults
Resilient: Try to recover from more non-critical errors to stay operational. Automatcially restart and rebuild broken indexes, etc. if needed

mzumsande commented at 3:41 pm on December 1, 2022: contributor

Maybe related: #25874 (with a slightly different error)

I agree that having access to a corrupted peers.dat would be very helpful.

willcl-ark commented at 3:58 pm on December 1, 2022: member

It would be interesting if somehow Core is corrupting its own peers.dat in some way, but a more common source of errors is bad disk failure or a mv or archive operation going wrong/being cancelled mid-way through.

I just used dd if=/dev/urandom of=peers.dat bs=1024 seek=(math (random) % 10) count=1 conv=notrunc (fish syntax) to write some random bytes to one of the first 10 blocks for a quick test which reproduced the error.

I suppose in theory we could try to recover whatever entries we can. If an attacker has modified it they would also be able to update the checksum, so the hash is not providing any protection in this scenario.

ghost commented at 7:10 pm on December 1, 2022: none

It would be interesting if somehow Core is corrupting its own peers.dat in some way, but a more common source of errors is bad disk failure or a mv or archive operation going wrong/being cancelled mid-way through.

It would be more interesting if I can corrupt it remotely for another node. That would be something we need to report to: https://github.com/bitcoin/bitcoin/security/policy

willcl-ark commented at 10:02 am on December 2, 2022: member

@NicolasDorier I was just re-reading your original issue and had forgotten you had asked for a bash script to detect issues. peers.dat is just double sha256 checksummed, so something like this should work for you:

 0#!/usr/bin/env bash
 1
 2echo "Validating double sha256 checksum of file $1"
 3
 4# Check if file is provided as argument
 5if [ -z "$1" ]; then
 6	echo "Please provide a file as an argument"
 7	exit 1
 8fi
 9
10# Check if file exists
11if [ ! -f "$1" ]; then
12	echo "File does not exist"
13	exit 1
14fi
15
16# Use a full path for better error logging if possible
17if command -v realpath &>/dev/null; then
18	file=$(realpath "$1")
19else
20	file="$1"
21fi
22
23# Original file hash from file $1
24file_hash=$(tail --bytes=32 "$file")
25
26# Calculated sha256 hash of file $1
27calc_hash=$(head --bytes=-32 "$file" | sha256sum -b | cut -c1-64 | xxd -r -p | sha256sum | cut -c1-64 | xxd -r -p)
28
29if [ "$file_hash" == "$calc_hash" ]; then
30	echo "File $file has a valid sha256 checksum"
31	exit 0
32else
33	echo "File $file has an invalid sha256 checksum and may be corrupted"
34	echo "$file sha256 hash:"
35	echo "$file_hash" | xxd --plain -cols 64
36	echo "$file calculated sha256 hash:"
37	echo "$calc_hash" | xxd --plain -cols 64
38	echo "Please consider communicating this corrupt $1 file to Bitcoin Core developers for analysis"
39
40
41    # You could optionally move/rename the file here so that Bitcoin Core will startup without error
42	# e.g.:
43    # mv "$1" "$1.corrupt"
44	exit 1
45fi

If you do use a check like this, perhaps you could consider a way we could have these multiple corrupt peers.dat files returned for analysis?

NicolasDorier commented at 11:08 am on December 10, 2022: contributor

@willcl-ark The corrupt peers.dat probably didn’t have the sha256 matching. @petzsch was going to try to reproduce. Will ping him.

maflcko added the label Data corruption on Dec 10, 2022

petzsch commented at 11:39 am on December 10, 2022: none

Still working at reproducing the issue on a VirtualBox VM with 1 GB RAM and 8 CPU cores for hopefully faster syncing (at around 20% synced now). So far no errors concerning the peers.dat.

Is there anything else I should try for forcing data corruption to happen? Like power down the VM without a regular shutdown sequence? Not sure what our users where doing with their installs to torture them into this error. :-)

Could this error be hardware specific (not just low RAM)? I’m virtualizing with VirtualBox on a i9-9900k …the VM has 64 Bit Ubuntu 22.04 running with the usual btcpay docker stack. @NicolasDorier you mentioned a user who ran into this error several times with 1 GB RAM: Can we still reach them for more details what machine they used?

cpleonardo commented at 0:41 am on December 11, 2022: none

This error affected my btcpay server apparently since 1.7.1 update (11/29/22). $docker logs btcpayserver_bitcoind -> Error: Invalid or corrupt peers.dat (Checksum mismatch, data corrupted). If you believe this is a bug, please report it to https://github.com/bitcoin/bitcoin/issues. As a workaround, you can move the file ("/home/bitcoin/.bitcoin/peers.dat") out of the way (rename, move, or delete) to have a new one created on the next start.

As suggested, after deleting peers.dat my node resumed blocks syncing smoothly. Before that, I was able to make a copy of the corrupted peers.dat, but I haven’t found a way to read it.

My server has low hardware capacity: 2 GB RAM, 50 GB Disk (running in prune mode 25GB) and OS Ubuntu 20.04 x64.

maflcko commented at 9:34 am on December 12, 2022: member

Can you check the debug log for possible causes of the corruption (maybe at the time of shutdown); Alternatively you can upload it here.

You can find the debug.log in your data dir.

Please be aware that the debug log might contain personally identifying information.

You may also upload the peers dat, with the same disclaimer.

willcl-ark commented at 9:40 am on December 12, 2022: member

@cpleonardo You can see the deserialisation code in addrdb.cpp here if you want to try and re-implement some manual deserialisation.

In addition to @MarcoFalke’s suggestions, perhaps dmesg contains logs of a hardware failure? You could check with e.g. dmesg --level=err,warn (needs sudo).

jaonoctus commented at 3:25 pm on December 24, 2022: none

Can confirm this. Happened to a friend of mine after a power outage

NicolasDorier commented at 0:40 am on January 6, 2023: contributor

For now I will modify our docker image to wipe peers.dat after every restart, too many people hitting the issue. Or maybe try to detect corruption in bash.

kristapsk commented at 0:57 am on January 6, 2023: contributor

So I don’t know if that help, told me that it mainly happen with people running with low spec RAM (2GB) Apparently, one user had repeated corruptions with 1gb ram.

OOM kill?

NicolasDorier commented at 1:18 am on January 6, 2023: contributor

Adding this to our docker image

 0	# peers_dat is routinely corrupted, preventing bitcoind to start, see [#26599](/bitcoin-bitcoin/26599/)
 1	peers_dat="peers.dat"
 2	peers_dat_corrupted="peers_corrupted.dat"
 3	if [[ -f "${peers_dat}" ]]; then
 4		actual_hash=$(head -c -32 "${peers_dat}" | sha256sum | cut -c1-64 | xxd -r -p | sha256sum | cut -c1-64)
 5		expected_hash=$(tail -c 32 "${peers_dat}" | xxd -ps -c 32)
 6		if [[ "${actual_hash}" != "${expected_hash}" ]]; then
 7			echo "${peers_dat} is corrupted, moving it to ${peers_dat_corrupted}"
 8			rm -f "${peers_dat_corrupted}"
 9			mv "${peers_dat}" "${peers_dat_corrupted}"
10		fi
11	fi

If that happen, I keep the corrupted file so I can share here

NicolasDorier referenced this in commit 00e4ee954f on Jan 6, 2023

NicolasDorier referenced this in commit e1ec79b299 on Jan 6, 2023

NicolasDorier referenced this in commit ba4e9c30c9 on Jan 6, 2023

NicolasDorier commented at 4:03 am on January 6, 2023: contributor

For people having issues getting bitcoind to start on btcpay server docker deployment, run btcpay-update.sh. Our new image now automatically move the corrupted file.

On a maybe related matter, for long time we are also deleting the settings.json before starting bitcoin core with rm -f /home/bitcoin/.bitcoin/settings.json a long time ago, as it also happened often that bitcoind was unable to start due to random corruption of it.

fedevegili commented at 1:05 am on January 12, 2023: none

This just happened to my bitcoind node running inside an old raspberry pi 2 (1gb memory).

Here is my corrupted peers.dat: crashed-peers.zip

Here are the relevant lines from my debug.log: https://gist.github.com/fedevegili/d8afe7ca6a46bea5c5281aeda661e5e9

Hope it helps.

maflcko commented at 9:09 am on January 12, 2023: member

What type of underlying storage does /mnt/usb/bitcoin-data point to? If the medium is eager to corrupt itself you may find other files (such as the coins db) corrupted as well.

fedevegili commented at 11:26 am on January 12, 2023: none

What type of underlying storage does /mnt/usb/bitcoin-data point to? If the medium is eager to corrupt itself you may find other files (such as the coins db) corrupted as well.

It’s an ext4 1tb SSD connected to the raspberry pi through an external HDD case. I’d say it’s quite error prone.

However, only this file got corrupted so far, only this time. I have this setup for about 2 months.

mzumsande commented at 4:40 pm on January 12, 2023: contributor

Here are the relevant lines from my debug.log: https://gist.github.com/fedevegili/d8afe7ca6a46bea5c5281aeda661e5e9

What happened here, did the node crash without an error, and then you restarted it and got the corrupted-peers error? I’m asking because the time stamps in the debug.log aren’t chronological, at 2023-01-08T18:55:41Z a new outbound peer is connected, and then it jumps back in time.

jonatack commented at 8:15 pm on January 12, 2023: contributor

I don’t know the reason, I can just report it happened to maybe 2 or 3 different people on our support in one or two months. Removing the peers.dat fixed their problem. My guess is that bitcoin core was not shut down cleanly. (such as the process killed while saving peers.dat?)

I also saw this happen a couple times a year on my older laptop running debian. I restart nodes frequently in order to test pull requests, and IIRC the corruption would happen after one of the Bitcoin Core peers.dat-related threads hung during shutdown and bitcoind had to be killed manually.

mzumsande commented at 10:54 pm on January 12, 2023: contributor

I believe I’ve found the bug that caused this with the help of the provided peers.dat (which was completely ok as far as I can see, just that the checksum was wrong, and when overwriting the bad checksum with the correct one it would load correctly):

Every 15 minutes, the scheduler thread will dump peers.dat to disk - for this it calls https://github.com/bitcoin/bitcoin/blob/f4ef856375c5b295d78169b136c6aee928c19bc9/src/addrdb.cpp#L38-L40

which first writes the data (i.e. AddrMan) into the stream, and then writes the same data into a hasher - which then provides the hash that is added to the stream in the third line. The problem is that AddrMan can change in between the first two calls (e.g. if we receive a new address), and then the data and hash won’t match anymore and the written file is corrupt.

I could reproduce this by adding a sleep for the scheduler thread in between the two writes of data, manually adding artificial addresses with addpeeraddress during this sleep, and then killing bitcoind (so that it can’t correct the peers.dat at a clean shutdown). That way, I would corrupt my own peers.dat.

I will work on a fix!

fedevegili commented at 5:48 pm on January 13, 2023: none

Here are the relevant lines from my debug.log: https://gist.github.com/fedevegili/d8afe7ca6a46bea5c5281aeda661e5e9

What happened here, did the node crash without an error, and then you restarted it and got the corrupted-peers error? I’m asking because the time stamps in the debug.log aren’t chronological, at 2023-01-08T18:55:41Z a new outbound peer is connected, and then it jumps back in time.

Unfortunately I have no idea. I realized the logs were out of order too, but couldn’t figure out why.

Everything was working and 4-5 days later when I checked again, the bitcoin process was not running and I was not able to start the process anymore. No one touchs this raspberry pi and no one connected to it in the meantime. The bitcoind process starts with system boot, so I’d say there was definitely a restart at some point.

Maybe the logs seems to be out of order because of clock syncing adjustments after a system restart?

CubicEarth commented at 4:56 pm on January 15, 2023: none

To me it sounds like a bug that should be fixed and not silently ignored

What about having a backup peers.dat? On startup, after peers.dat is successfully loaded, the backup would be updated / replaced with the current version. If peers.dat was corrupted and failed to load, the backup could be used instead, along with an alert to the user.

mzumsande commented at 6:58 pm on January 17, 2023: contributor

I opened #26909 to fix this.

maflcko referenced this in commit b5c88a5479 on Jan 19, 2023

maflcko closed this on Jan 19, 2023

sidhujag referenced this in commit 4ab996b227 on Jan 19, 2023

beeduul commented at 5:57 pm on May 20, 2023: none

Although this issue has been marked as fixed for the next release, I’ll leave this additional note here for posterity.

This issue happens me every few days on my 4gb pi umbrel. It appears that immediately before each crash, the log contains Socks5() connect to xxx.xxx.xxx.xxx:8333 failed: InterruptibleRecv() timeout or other failure.

mzumsande commented at 9:15 pm on May 20, 2023: contributor

Although this issue has been marked as fixed for the next release, I’ll leave this additional note here for posterity.

This issue happens me every few days on my 4gb pi umbrel. It appears that immediately before each crash, the log contains Socks5() connect to xxx.xxx.xxx.xxx:8333 failed: InterruptibleRecv() timeout or other failure.

To be clear: the fix doesn’t prevent any crashes from happening - what it fixes is that if the node crashes for some unrelated reason, peers.dat shouldn’t get corrupted anymore (which would only be visible at the next startup). So if your node crashes every few days, it sounds like you have another, unrelated problem.

fanquake commented at 9:09 am on May 22, 2023: member

So if your node crashes every few days, it sounds like you have another, unrelated problem. @beeduul do you want to follow up with a new issue, providing more info if possible? Assuming this isn’t a hardware related problem.

bitcoin locked this on May 21, 2024

Do not crash if peers.dat is corrupted #26599