Do not crash if peers.dat is corrupted #26599

issue NicolasDorier openend this issue on November 29, 2022
  1. NicolasDorier commented at 12:34 pm on November 29, 2022: contributor

    When peers.dat is corrupted an error message is shown: Invalid or corrupt peers.dat (Checksum mismatch, data corrupted). then the node restart.

    Most of our users aren’t really tech enough to manually delete the peers.dat files, nor can we detect it for them. It means that this error give us lot’s of work on our support team when somebody is impacted.

    peers.dat isn’t an essential file, as such Bitcoin Core should just be fine restarting without crashing.

    A bash workaround to detect the checksum mismatch would also considerably help us.

  2. NicolasDorier added the label Feature on Nov 29, 2022
  3. maflcko commented at 12:38 pm on November 29, 2022: member

    It means that this error give us lot’s of work

    Do you happen to know why it corrupts? If it is due to hardware error, it might be scary to just continue, because it might also corrupt wallet.dat.

  4. maflcko commented at 12:41 pm on November 29, 2022: member
    If the cause is known and harmless, it can be fixed like https://github.com/bitcoin/bitcoin/commit/b00b60ed4f27066656e45635f1386e6b1550c6ed
  5. fanquake removed the label Feature on Nov 29, 2022
  6. ghost commented at 10:05 pm on November 30, 2022: none
    @NicolasDorier Were you able to find the reason for corruption? Steps to reproduce could help in knowing the severity of the issue
  7. NicolasDorier commented at 2:29 am on December 1, 2022: contributor

    @1440000bytes @MarcoFalke I don’t know the reason, I can just report it happened to maybe 2 or 3 different people on our support in one or two months. Removing the peers.dat fixed their problem. My guess is that bitcoin core was not shut down cleanly. (such as the process killed while saving peers.dat?)

    On another note: A wallet corruption, which is genuinely more scary, doesn’t make bitcoin core crash when starting. (it isn’t related to this issue, but I can share a corrupted wallet.dat not crashing bitcoin core on startup if that interest anybody)

  8. maflcko commented at 8:46 am on December 1, 2022: member

    such as the process killed while saving peers.dat

    RenameOver should be atomic, so I don’t think this can happen.

    https://github.com/bitcoin/bitcoin/blob/e2bfd41f832dc7c7be6f17e928352f0eb2865f66/src/addrdb.cpp#L79

  9. maflcko added the label Questions and Help on Dec 1, 2022
  10. NicolasDorier commented at 9:03 am on December 1, 2022: contributor
    I will ask for the next time it happens to save the peers.dat so we can analyze it.
  11. maflcko commented at 9:13 am on December 1, 2022: member
    Ok, thanks. Closing for now, but this can be reopened if you leave another comment or referred to if you open a new issue.
  12. maflcko closed this on Dec 1, 2022

  13. NicolasDorier commented at 9:47 am on December 1, 2022: contributor
    @MarcoFalke I think the reason why it is corrupt is different from the topic of not crashing for a peers.dat file. This file is non-essential, it shouldn’t be a problem to restart.
  14. maflcko reopened this on Dec 1, 2022

  15. maflcko removed the label Questions and Help on Dec 1, 2022
  16. maflcko added the label Brainstorming on Dec 1, 2022
  17. NicolasDorier commented at 11:21 am on December 1, 2022: contributor
    So I don’t know if that help, told me that it mainly happen with people running with low spec RAM (2GB) Apparently, one user had repeated corruptions with 1gb ram.
  18. maflcko commented at 11:23 am on December 1, 2022: member
    To me it sounds like a bug that should be fixed and not silently ignored
  19. willcl-ark commented at 12:14 pm on December 1, 2022: member

    The peers.dat file is designed to avoid having to reach out to the DNS or hardcoded seeds more than once, as this is the moment your node is most susceptible to being poisoned with attacker ip addresses and perhaps in the future blocks and transactions.

    If the file becomes corrupted then anchors.dat should help protect the node from a successful future eclipse attack, but new addresses will have to either be added manually or fetched from DNS or hardcoded seeds again.

    I agree that the best course of action here is to find out what’s corrupting peers.dat and fix that, rather than have Core silently ignore errors on something that could be used as a first step towards eclipse attacking you…

    Side note: it does make me wonder whether it could be worth having certain runtime “profiles”. For example I have seen software with “paranoia level” settings, and we could perhaps have something like

    • Paranoid: Fail on detecting any corrupt file, data, etc. debugging enabled on many categories by default, notification of re-orgs >= n blocks etc.
    • Normal (default): Somthing similar to todays defaults
    • Resilient: Try to recover from more non-critical errors to stay operational. Automatcially restart and rebuild broken indexes, etc. if needed
  20. mzumsande commented at 3:41 pm on December 1, 2022: contributor

    Maybe related: #25874 (with a slightly different error)

    I agree that having access to a corrupted peers.dat would be very helpful.

  21. willcl-ark commented at 3:58 pm on December 1, 2022: member

    It would be interesting if somehow Core is corrupting its own peers.dat in some way, but a more common source of errors is bad disk failure or a mv or archive operation going wrong/being cancelled mid-way through.

    I just used dd if=/dev/urandom of=peers.dat bs=1024 seek=(math (random) % 10) count=1 conv=notrunc (fish syntax) to write some random bytes to one of the first 10 blocks for a quick test which reproduced the error.

    I suppose in theory we could try to recover whatever entries we can. If an attacker has modified it they would also be able to update the checksum, so the hash is not providing any protection in this scenario.

  22. ghost commented at 7:10 pm on December 1, 2022: none

    It would be interesting if somehow Core is corrupting its own peers.dat in some way, but a more common source of errors is bad disk failure or a mv or archive operation going wrong/being cancelled mid-way through.

    It would be more interesting if I can corrupt it remotely for another node. That would be something we need to report to: https://github.com/bitcoin/bitcoin/security/policy

  23. willcl-ark commented at 10:02 am on December 2, 2022: member

    @NicolasDorier I was just re-reading your original issue and had forgotten you had asked for a bash script to detect issues. peers.dat is just double sha256 checksummed, so something like this should work for you:

     0#!/usr/bin/env bash
     1
     2echo "Validating double sha256 checksum of file $1"
     3
     4# Check if file is provided as argument
     5if [ -z "$1" ]; then
     6	echo "Please provide a file as an argument"
     7	exit 1
     8fi
     9
    10# Check if file exists
    11if [ ! -f "$1" ]; then
    12	echo "File does not exist"
    13	exit 1
    14fi
    15
    16# Use a full path for better error logging if possible
    17if command -v realpath &>/dev/null; then
    18	file=$(realpath "$1")
    19else
    20	file="$1"
    21fi
    22
    23# Original file hash from file $1
    24file_hash=$(tail --bytes=32 "$file")
    25
    26# Calculated sha256 hash of file $1
    27calc_hash=$(head --bytes=-32 "$file" | sha256sum -b | cut -c1-64 | xxd -r -p | sha256sum | cut -c1-64 | xxd -r -p)
    28
    29if [ "$file_hash" == "$calc_hash" ]; then
    30	echo "File $file has a valid sha256 checksum"
    31	exit 0
    32else
    33	echo "File $file has an invalid sha256 checksum and may be corrupted"
    34	echo "$file sha256 hash:"
    35	echo "$file_hash" | xxd --plain -cols 64
    36	echo "$file calculated sha256 hash:"
    37	echo "$calc_hash" | xxd --plain -cols 64
    38	echo "Please consider communicating this corrupt $1 file to Bitcoin Core developers for analysis"
    39
    40
    41    # You could optionally move/rename the file here so that Bitcoin Core will startup without error
    42	# e.g.:
    43    # mv "$1" "$1.corrupt"
    44	exit 1
    45fi
    

    If you do use a check like this, perhaps you could consider a way we could have these multiple corrupt peers.dat files returned for analysis?

  24. NicolasDorier commented at 11:08 am on December 10, 2022: contributor
    @willcl-ark The corrupt peers.dat probably didn’t have the sha256 matching. @petzsch was going to try to reproduce. Will ping him.
  25. maflcko added the label Data corruption on Dec 10, 2022
  26. petzsch commented at 11:39 am on December 10, 2022: none

    Still working at reproducing the issue on a VirtualBox VM with 1 GB RAM and 8 CPU cores for hopefully faster syncing (at around 20% synced now). So far no errors concerning the peers.dat.

    Is there anything else I should try for forcing data corruption to happen? Like power down the VM without a regular shutdown sequence? Not sure what our users where doing with their installs to torture them into this error. :-)

    Could this error be hardware specific (not just low RAM)? I’m virtualizing with VirtualBox on a i9-9900k …the VM has 64 Bit Ubuntu 22.04 running with the usual btcpay docker stack. @NicolasDorier you mentioned a user who ran into this error several times with 1 GB RAM: Can we still reach them for more details what machine they used?

  27. cpleonardo commented at 0:41 am on December 11, 2022: none

    This error affected my btcpay server apparently since 1.7.1 update (11/29/22). $docker logs btcpayserver_bitcoind -> Error: Invalid or corrupt peers.dat (Checksum mismatch, data corrupted). If you believe this is a bug, please report it to https://github.com/bitcoin/bitcoin/issues. As a workaround, you can move the file ("/home/bitcoin/.bitcoin/peers.dat") out of the way (rename, move, or delete) to have a new one created on the next start.

    As suggested, after deleting peers.dat my node resumed blocks syncing smoothly. Before that, I was able to make a copy of the corrupted peers.dat, but I haven’t found a way to read it.

    My server has low hardware capacity: 2 GB RAM, 50 GB Disk (running in prune mode 25GB) and OS Ubuntu 20.04 x64.

  28. maflcko commented at 9:34 am on December 12, 2022: member

    Can you check the debug log for possible causes of the corruption (maybe at the time of shutdown); Alternatively you can upload it here.

    You can find the debug.log in your data dir.

    Please be aware that the debug log might contain personally identifying information.

    You may also upload the peers dat, with the same disclaimer.

  29. willcl-ark commented at 9:40 am on December 12, 2022: member

    @cpleonardo You can see the deserialisation code in addrdb.cpp here if you want to try and re-implement some manual deserialisation.

    In addition to @MarcoFalke’s suggestions, perhaps dmesg contains logs of a hardware failure? You could check with e.g. dmesg --level=err,warn (needs sudo).

  30. jaonoctus commented at 3:25 pm on December 24, 2022: none
    Can confirm this. Happened to a friend of mine after a power outage
  31. NicolasDorier commented at 0:40 am on January 6, 2023: contributor
    For now I will modify our docker image to wipe peers.dat after every restart, too many people hitting the issue. Or maybe try to detect corruption in bash.
  32. kristapsk commented at 0:57 am on January 6, 2023: contributor

    So I don’t know if that help, told me that it mainly happen with people running with low spec RAM (2GB) Apparently, one user had repeated corruptions with 1gb ram.

    OOM kill?

  33. NicolasDorier commented at 1:18 am on January 6, 2023: contributor

    Adding this to our docker image

     0	# peers_dat is routinely corrupted, preventing bitcoind to start, see [#26599](/bitcoin-bitcoin/26599/)
     1	peers_dat="peers.dat"
     2	peers_dat_corrupted="peers_corrupted.dat"
     3	if [[ -f "${peers_dat}" ]]; then
     4		actual_hash=$(head -c -32 "${peers_dat}" | sha256sum | cut -c1-64 | xxd -r -p | sha256sum | cut -c1-64)
     5		expected_hash=$(tail -c 32 "${peers_dat}" | xxd -ps -c 32)
     6		if [[ "${actual_hash}" != "${expected_hash}" ]]; then
     7			echo "${peers_dat} is corrupted, moving it to ${peers_dat_corrupted}"
     8			rm -f "${peers_dat_corrupted}"
     9			mv "${peers_dat}" "${peers_dat_corrupted}"
    10		fi
    11	fi
    

    If that happen, I keep the corrupted file so I can share here

  34. NicolasDorier referenced this in commit 00e4ee954f on Jan 6, 2023
  35. NicolasDorier referenced this in commit e1ec79b299 on Jan 6, 2023
  36. NicolasDorier referenced this in commit ba4e9c30c9 on Jan 6, 2023
  37. NicolasDorier commented at 4:03 am on January 6, 2023: contributor

    For people having issues getting bitcoind to start on btcpay server docker deployment, run btcpay-update.sh. Our new image now automatically move the corrupted file.

    On a maybe related matter, for long time we are also deleting the settings.json before starting bitcoin core with rm -f /home/bitcoin/.bitcoin/settings.json a long time ago, as it also happened often that bitcoind was unable to start due to random corruption of it.

  38. fedevegili commented at 1:05 am on January 12, 2023: none

    This just happened to my bitcoind node running inside an old raspberry pi 2 (1gb memory).

    Here is my corrupted peers.dat: crashed-peers.zip

    Here are the relevant lines from my debug.log: https://gist.github.com/fedevegili/d8afe7ca6a46bea5c5281aeda661e5e9

    Hope it helps.

  39. maflcko commented at 9:09 am on January 12, 2023: member
    What type of underlying storage does /mnt/usb/bitcoin-data point to? If the medium is eager to corrupt itself you may find other files (such as the coins db) corrupted as well.
  40. fedevegili commented at 11:26 am on January 12, 2023: none

    What type of underlying storage does /mnt/usb/bitcoin-data point to? If the medium is eager to corrupt itself you may find other files (such as the coins db) corrupted as well.

    It’s an ext4 1tb SSD connected to the raspberry pi through an external HDD case. I’d say it’s quite error prone.

    However, only this file got corrupted so far, only this time. I have this setup for about 2 months.

  41. mzumsande commented at 4:40 pm on January 12, 2023: contributor

    Here are the relevant lines from my debug.log: https://gist.github.com/fedevegili/d8afe7ca6a46bea5c5281aeda661e5e9

    What happened here, did the node crash without an error, and then you restarted it and got the corrupted-peers error? I’m asking because the time stamps in the debug.log aren’t chronological, at 2023-01-08T18:55:41Z a new outbound peer is connected, and then it jumps back in time.

  42. jonatack commented at 8:15 pm on January 12, 2023: contributor

    I don’t know the reason, I can just report it happened to maybe 2 or 3 different people on our support in one or two months. Removing the peers.dat fixed their problem. My guess is that bitcoin core was not shut down cleanly. (such as the process killed while saving peers.dat?)

    I also saw this happen a couple times a year on my older laptop running debian. I restart nodes frequently in order to test pull requests, and IIRC the corruption would happen after one of the Bitcoin Core peers.dat-related threads hung during shutdown and bitcoind had to be killed manually.

  43. mzumsande commented at 10:54 pm on January 12, 2023: contributor

    I believe I’ve found the bug that caused this with the help of the provided peers.dat (which was completely ok as far as I can see, just that the checksum was wrong, and when overwriting the bad checksum with the correct one it would load correctly):

    Every 15 minutes, the scheduler thread will dump peers.dat to disk - for this it calls https://github.com/bitcoin/bitcoin/blob/f4ef856375c5b295d78169b136c6aee928c19bc9/src/addrdb.cpp#L38-L40

    which first writes the data (i.e. AddrMan) into the stream, and then writes the same data into a hasher - which then provides the hash that is added to the stream in the third line. The problem is that AddrMan can change in between the first two calls (e.g. if we receive a new address), and then the data and hash won’t match anymore and the written file is corrupt.

    I could reproduce this by adding a sleep for the scheduler thread in between the two writes of data, manually adding artificial addresses with addpeeraddress during this sleep, and then killing bitcoind (so that it can’t correct the peers.dat at a clean shutdown). That way, I would corrupt my own peers.dat.

    I will work on a fix!

  44. fedevegili commented at 5:48 pm on January 13, 2023: none

    Here are the relevant lines from my debug.log: https://gist.github.com/fedevegili/d8afe7ca6a46bea5c5281aeda661e5e9

    What happened here, did the node crash without an error, and then you restarted it and got the corrupted-peers error? I’m asking because the time stamps in the debug.log aren’t chronological, at 2023-01-08T18:55:41Z a new outbound peer is connected, and then it jumps back in time.

    Unfortunately I have no idea. I realized the logs were out of order too, but couldn’t figure out why.

    Everything was working and 4-5 days later when I checked again, the bitcoin process was not running and I was not able to start the process anymore. No one touchs this raspberry pi and no one connected to it in the meantime. The bitcoind process starts with system boot, so I’d say there was definitely a restart at some point.

    Maybe the logs seems to be out of order because of clock syncing adjustments after a system restart?

  45. CubicEarth commented at 4:56 pm on January 15, 2023: none

    To me it sounds like a bug that should be fixed and not silently ignored

    What about having a backup peers.dat? On startup, after peers.dat is successfully loaded, the backup would be updated / replaced with the current version. If peers.dat was corrupted and failed to load, the backup could be used instead, along with an alert to the user.

  46. mzumsande commented at 6:58 pm on January 17, 2023: contributor
    I opened #26909 to fix this.
  47. maflcko referenced this in commit b5c88a5479 on Jan 19, 2023
  48. maflcko closed this on Jan 19, 2023

  49. sidhujag referenced this in commit 4ab996b227 on Jan 19, 2023
  50. beeduul commented at 5:57 pm on May 20, 2023: none

    Although this issue has been marked as fixed for the next release, I’ll leave this additional note here for posterity.

    This issue happens me every few days on my 4gb pi umbrel. It appears that immediately before each crash, the log contains Socks5() connect to xxx.xxx.xxx.xxx:8333 failed: InterruptibleRecv() timeout or other failure.

  51. mzumsande commented at 9:15 pm on May 20, 2023: contributor

    Although this issue has been marked as fixed for the next release, I’ll leave this additional note here for posterity.

    This issue happens me every few days on my 4gb pi umbrel. It appears that immediately before each crash, the log contains Socks5() connect to xxx.xxx.xxx.xxx:8333 failed: InterruptibleRecv() timeout or other failure.

    To be clear: the fix doesn’t prevent any crashes from happening - what it fixes is that if the node crashes for some unrelated reason, peers.dat shouldn’t get corrupted anymore (which would only be visible at the next startup). So if your node crashes every few days, it sounds like you have another, unrelated problem.

  52. fanquake commented at 9:09 am on May 22, 2023: member

    So if your node crashes every few days, it sounds like you have another, unrelated problem. @beeduul do you want to follow up with a new issue, providing more info if possible? Assuming this isn’t a hardware related problem.

  53. bitcoin locked this on May 21, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-28 22:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me