LevelDB read failure: Corruption: block checksum mismatch #30159

issue apulsifer openend this issue on May 23, 2024
  1. apulsifer commented at 3:22 pm on May 23, 2024: none

    Is there an existing issue for this?

    • I have searched the existing issues

    Current behaviour

    When running in prune=550 mode, I consistently get the following error about once every 10 days per machine:

    LevelDB read failure: Corruption: block checksum mismatch

    There is no recovery from this error (reindex doesn’t work in prune mode), so the only solution is to nuke the datadir and do a full resync or restore the datadir from a backup.

    Searching the webs, the conventional wisdom is that this is caused by a hardware/disk problem. That is definitely not the case here, as I’ll explain. I suspect a bug in the code is causing some thread to write to an incorrect memory location, possibly a memory use-after-free/reallocation/reorganization bug.

    Details:

    I set up bitcoind in prune=550 mode to run on ten Amazon EC2 t4g.nano instances that each have 512mb ram and a 4 GB gpt3 (SSD) swap drive. The only job the machines have is to track the blockchain and provide RPC information on recent blocks and confirmed transactions. There is a lot of memory pressure and paging when a block comes in, but since speed is not an issue, that shouldn’t be a problem.

    About once per day, bitcoind on one of the machines reports LevelDB read failure: Corruption: block checksum mismatch and stops working. The only fix is to delete and restore the bitcoin data directory, and then restart bitcoind to sync and catch back up.

    I am virtually certain this is a software bug in bitcoind because (a) Amazon EC2 has some of the most tested and reliable hardware in the world; (b) this problem had shown up on all ten Amazon EC2 instances all running in different EC2 availability zones; (c) Bitcoin Cash Node is also installed on all ten machines, and it has had zero problems. Bitcoin Cash Node forked from bitcoin core a while back and also uses LevelDB, among other commonalities.

    I tried this with the bitcoin binary distributions bitcoin-25.2-aarch64-linux-gnu.tar.gz and bitcoin-26.1-aarch64-linux-gnu.tar.gz, as well as well as bitcoin-25.2 compiled from source, and it made no difference.

    I collected some of the files that have the checksum mismatch and can provide them if someone wants to look for clues on which software component corrupted the data.

    Expected behaviour

    I expect to never see a LevelDB data corruption failure.

    Steps to reproduce

    bitcoind config file:

    datadir=/bdata/bitcoin-data

    discover=0 listen=1 maxconnections=24

    par=1

    blocksonly=1 dbcache=200 maxsigcachesize=4 prune=550

    maxmempool=5 blockreconstructionextratxn=1 maxorphantx=1 mempoolexpiry=1 persistmempool=0

    disablewallet=1

    server=1 rpcallowip=127.0.0.1 rpcuser=btc rpcpassword=btc rpcworkqueue=40 rpcthreads=1

    printtoconsole=1 nodebuglogfile=1

    [main] rpcport=8332 rpcbind=127.0.0.1:8332 bind=[::]:9333 bind=127.0.0.1:8334=onion

    Relevant log output

    This is one example, I have more, they all look the same:

    Started bitcoind.service. 2024-05-22T17:53:08Z Bitcoin Core version v25.2.0 (release build) … 2024-05-22T19:05:51Z UpdateTip: new best=00000000000000000002c656268be2b9e044b5963af0507e16414552aa526d57 height=844603 version=0x237c6000 log2_work=94.939158 tx=1009037420 date=‘2024-05-22T14:46:04Z’ progress=0.999943 cache=72.4MiB(515946txo) 2024-05-22T19:05:57Z UpdateTip: new best=000000000000000000028401d5cd96ea647cc9adae836735615d7dbf64feed6f height=844604 version=0x2f50c000 log2_work=94.939172 tx=1009041112 date=‘2024-05-22T15:00:11Z’ progress=0.999946 cache=73.9MiB(526621txo) 2024-05-22T19:06:00Z Socks5() connect to 78.44.10.186:8333 failed: connection refused 2024-05-22T19:06:03Z UpdateTip: new best=000000000000000000018332b3b2594e340a0dfd150cbc2a852930c0cddaa91b height=844605 version=0x20000000 log2_work=94.939185 tx=1009044861 date=‘2024-05-22T15:14:46Z’ progress=0.999949 cache=75.4MiB(537704txo) 2024-05-22T19:06:05Z Socks5() connect to 2601:283:5080:8540::55d6:8333 failed: general failure 2024-05-22T19:06:12Z UpdateTip: new best=0000000000000000000163e90fef2b79654d0235d68a603f46ac5e41ce62d827 height=844606 version=0x2e000000 log2_work=94.939199 tx=1009048116 date=‘2024-05-22T15:31:30Z’ progress=0.999953 cache=76.8MiB(548883txo) 2024-05-22T19:06:20Z UpdateTip: new best=00000000000000000001a4ce0b96e5a761337a84974d27a694c7b8d2c74b8cf0 height=844607 version=0x27a94000 log2_work=94.939212 tx=1009050900 date=‘2024-05-22T15:43:50Z’ progress=0.999956 cache=78.1MiB(558437txo) 2024-05-22T19:06:26Z UpdateTip: new best=00000000000000000001d82049db35f2dfabccfba593ee3a433f0500c2734f4e height=844608 version=0x274a6000 log2_work=94.939226 tx=1009054026 date=‘2024-05-22T16:09:29Z’ progress=0.999961 cache=79.5MiB(569127txo) 2024-05-22T19:06:49Z Socks5() connect to 212.102.36.243:8333 failed: general failure 2024-05-22T19:07:09Z UpdateTip: new best=00000000000000000003571b667acb77721004099827b38802f77500cd370d8b height=844609 version=0x20000000 log2_work=94.939240 tx=1009057717 date=‘2024-05-22T16:21:29Z’ progress=0.999964 cache=5.4MiB(0txo) 2024-05-22T19:07:17Z LevelDB read failure: Corruption: block checksum mismatch: /bdata/bitcoin-data/chainstate/5847811.ldb 2024-05-22T19:07:17Z Fatal LevelDB error: Corruption: block checksum mismatch: /bdata/bitcoin-data/chainstate/5847811.ldb 2024-05-22T19:07:17Z You can use -debug=leveldb to get more complete diagnostic messages 2024-05-22T19:07:17Z Error: Error reading from database, shutting down. Error: Error reading from database, shutting down. 2024-05-22T19:07:17Z Error reading from database: Fatal LevelDB error: Corruption: block checksum mismatch: /bdata/bitcoin-data/chainstate/5847811.ldb bitcoind.service: Main process exited, code=dumped, status=6/ABRT bitcoind.service: Failed with result ‘core-dump’. bitcoind.service: Consumed 16min 43.740s CPU time.

    How did you obtain Bitcoin Core

    Compiled from source

    What version of Bitcoin Core are you using?

    bitcoin-25.2-aarch64-linux-gnu.tar.gz and bitcoin-26.1-aarch64-linux-gnu.tar.gz

    Operating system and version

    Linux 6.1.87-99.174.amzn2023.aarch64 #1 SMP

    Machine specifications

    Amazon EC2 t4g.nano instance (512mb ram) with unlimited CPU zram driver disabled (sudo yum remove zram-generator, reboot) 14 GB gp3 root drive 4 GB gp3 swap drive 40 GB gp3 data drive for bitcoin blockchain

    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS nvme1n1 259:0 0 40G 0 disk /bdata nvme2n1 259:1 0 4G 0 disk [SWAP] nvme0n1 259:2 0 14G 0 disk ├─nvme0n1p1 259:3 0 14G 0 part / └─nvme0n1p128 259:4 0 10M 0 part /boot/efi

    IPv4 enabled but not assigned an IP address IPv6 enabled, assigned an IP address, and routed to internet

    Some machines have Tor installed and enabled for testing (including the one for the log file attached above), but this made no difference in results.

  2. maflcko added the label Data corruption on May 23, 2024
  3. maflcko added the label Bug on May 23, 2024
  4. maflcko commented at 3:35 pm on May 23, 2024: member

    I suspect a bug in the code is causing some thread to write to an incorrect memory location, possibly a memory use-after-free/reallocation/reorganization bug.

    Would it be possible for you to compile and run with asan, or a similar sanitizer?

    Also, what filesystem are you using on the drives? Something like df --print-type --human-readable /bdata should print it.

  5. apulsifer commented at 3:40 pm on May 23, 2024: none

    xfs on the root and data drive

    sudo mkswap /dev/nvme[4 GB disk] sudo swapon /dev/nvme[4 GB disk]

    sudo mkdir /bdata sudo mkfs -t xfs /dev/nvme[40 GB disk] lsblk -o name,size,type,uuid sudo nano /etc/fstab

    add to fstab: UUID=[4 GB disk uuid] swap swap defaults 0 0 UUID=[40 GB disk uuid] /bdata xfs defaults,nofail 0 2

    sudo mount -a

    Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 4.0M 0 4.0M 0% /dev tmpfs tmpfs 210M 0 210M 0% /dev/shm tmpfs tmpfs 84M 552K 84M 1% /run /dev/nvme0n1p1 xfs 14G 3.4G 11G 25% / tmpfs tmpfs 210M 0 210M 0% /tmp /dev/nvme1n1 xfs 40G 19G 22G 46% /bdata /dev/nvme0n1p128 vfat 10M 1.4M 8.7M 14% /boot/efi tmpfs tmpfs 42M 0 42M 0% /run/user/1000

  6. apulsifer commented at 3:45 pm on May 23, 2024: none

    Would it be possible for you to compile and run with asan, or a similar sanitizer?

    I don’t see where I would get a chunk of time to do that right now…. But as I mentioned, I copied the corrupted files, and that might give some clues to someone familiar with their format (especially if the corruption is ascii in the middle of binary, or vice versa)

  7. maflcko commented at 4:22 pm on May 23, 2024: member

    I don’t see where I would get a chunk of time to do that right now….

    Sure, no rush. I’ll probably take some time to pin this down. (I don’t have an AWS account, so I can’t test it, but maybe someone else has).

    Some other ideas to test in the meantime:

    • Try another filesystem instead of xfs
    • Try the master branch (not for production, just for testing whether the issue still happens there)
  8. maflcko added the label UTXO Db and Indexes on May 23, 2024
  9. apulsifer commented at 5:51 pm on May 23, 2024: none
    xfs is used on the root drive of every Amazon EC2 instance running Amazon Linux. If xfs on Amazon EC2 were the problem, a lot of of critical infrastructure would be failing right now. And as I mentioned, these machines are also running bitcoin cash in an almost an identical configuration (data file path and ports changed) and it has had zero problems. Since the data corruption only happens about once per week when running on mainnet, I think figuring this problem out will probably take a customized and instrumented version of bitcoind being feed blocks at high speed with random jitter and waits and synthetic memory pressure. This could probably be done on a virtual machine anywhere, like Xen, KVM, Virtual Box, etc.
  10. maflcko commented at 6:27 pm on May 23, 2024: member

    Since the data corruption only happens about once per week

    Once per week is a lot and if this was a broader problem, I’d assume that more people were complaining. Given that you can consistently reproduce on different machines, this seems like a real bug is somewhere. However, Bitcoin Core is running fine in a lot of other places, so there has to be some hardware or configuration setting (or combination thereof) that triggers this bug on your side. It would be good to know which one it is.

  11. mzumsande commented at 6:32 pm on May 23, 2024: contributor

    (edited first question out, I misunderstood)

    From your log it appears that this happened while the node was catching up with the tip (almost but not completely synced yet), receiving blocks quickly. Is that typical, or does it usually happen when the node is synced and receives blocks as they are mined?

  12. apulsifer commented at 7:10 pm on May 23, 2024: none

    Once per week is a lot and if this was a broader problem, I’d assume that more people were complaining.

    It could just be that the memory pressure is uncovering the problem. Of course, any machine can experience memory pressure at times, but one thing that’s unique is that those machines starting hard paging to SSD for about 20 seconds after each new block arrives.

  13. apulsifer commented at 7:58 pm on May 23, 2024: none

    Could you explain the paging in a bit more detail?

    The machines page pretty hard to SSD for about 20 seconds after each new block arrives. That info comes from “sar -d 10”, which I logged for a while when I was setting up the first machine.

    Is the node constantly being bombarded with lots of simultaneous RPC calls (which ones?) or is some other interface used?

    None of these machines has at this point serviced a single RPC call (I’m still trying to get things set up). No other interface is being used, just the bitcoind peer network, half of the machines with direct IPv6, half via torproxy.

    Also, from your log it appears that this happened while the node was catching up with the tip (almost but not completely synced yet), receiving blocks quickly. Is that typical, or does it usually happen when the node is synced and receives blocks as they are mined?

    Sync’ing has been pretty typical at the moment, since I’m still setting things up and bitcoind has been started and stopped from time-to-time to try out different settings.

    The initial sync from the genesis block was done on a machine with 16 GB RAM. Then on May 7, bitcoind on that machine was stopped and the /bdata directory was copied to these 10 machines with only 512 MB RAM that are only expected to keep up with new blocks as they arrive. Since that time, two of the ten machines have had no data corruption. The other eight machines have had data corruption one or more times.

    The problem doesn’t just show up during block sync however. I do know that on Monday at end of day, all the machines were running and fully synced, and by Tuesday morning, two machines had data corruption. I was busy with other things and left all the machines alone, and by the time I went to fix them Wednesday afternoon, four machines had data corruption.

    I don’t think I can definitely rule out that starting and stopping bitcoind, or rebooting the machines has not contributed to this problem. The files could sit in the bdata directory for a while, and it’s possible one is getting corrupted when bitcoind shuts down or the machine reboots but bitcoind doesn’t notice it until sometime later when it reads the file. Note that bitcoind is being started and stopped by systemd service files (attached below) which has a 10 minute timeout. So systemd will politely ask bitcoind to stop, and if it hasn’t exited after 10 minutes, systemd will kill it. Another thing worth noting from the service file is that I set up bitcoind to run at Nice=16, which might contribute to triggering the problem.

    As of late last night, all machines are running and fully synced again. By next week, I’m going to start leaving them alone to run autonomously (with the exception of rebuilding a datadir if needed), so that will be a much better test of what happens when the machines are fully synced and bitcoind is running continuously.

    [Service] WorkingDirectory=/home/ec2-user ExecStart=/home/ec2-user/bitcoin/bin/bitcoind -conf=/home/ec2-user/bitcoin.conf Restart=always RestartSec=60 TimeoutStopSec=600 Nice=16 User=ec2-user Group=ec2-user StandardOutput=journal StandardError=journal

    [Unit] After=network-online.target

    [Install] WantedBy=multi-user.target

  14. maflcko commented at 8:20 pm on May 23, 2024: member

    I think calling the RPC gettxoutsetinfo muhash on all nodes (when they are synced to the same block) and it matches for all, then the chainstate leveldb at that point in time is probably fine. I presume all failures happened in the /bdata/bitcoin-data/chainstate/ leveldb?

    Edit: Calling that RPC will take a long time on your machines, I suspect.

  15. apulsifer commented at 11:51 pm on May 23, 2024: none

    Yes, all the checksum errors are in numerically-named NNNNNN.ldb files in /bdata/bitcoin-data/chainstate/

    I had no luck with gettxoutsetinfo muhash:

    bitcoin/bin/bitcoin-cli -rpcwaittimeout=0 -conf=/home/ec2-user/bitcoin.conf getblockcount 844826

    bitcoin/bin/bitcoin-cli -rpcwaittimeout=0 -conf=/home/ec2-user/bitcoin.conf gettxoutsetinfo muhash error: timeout on transient error: Could not connect to the server 127.0.0.1:8332 (error code 0 - “timeout reached”)

  16. maflcko commented at 6:27 am on May 24, 2024: member

    The RPC will take a long time (probably hours), so you’ll have to disable the client timeout -rpcclienttimeout=0.

    0bitcoin/bin/bitcoin-cli -rpcclienttimeout=0 -rpcwaittimeout=0 -conf=/home/ec2-user/bitcoin.conf gettxoutsetinfo muhash
    
  17. apulsifer commented at 12:43 pm on May 24, 2024: none

    Took about 20 minutes. They all match.

    Fri May 24 12:12:33 UTC 2024 { “height”: 844917, “bestblock”: “000000000000000000017dd5f59b73629f6f88797e90017a9df39c5e435296bf”, “txouts”: 181984254, “bogosize”: 13994337434, “muhash”: “b97a3fb13a61e8c889668d064f7ae5408a78ee7e4c6a4fdebedb30ecb2f23378”, “total_amount”: 19702649.24256659, “transactions”: 125720653, “disk_size”: 12220356091 } Fri May 24 12:39:58 UTC 2024

  18. apulsifer commented at 4:55 pm on May 31, 2024: none
    Update: Since leaving these servers alone for a week and not rebooting them or restarting bitcoind, they have stayed perfectly in sync without issues. So it looks like the problem is triggered by starting and stopping bitcoind (which I can live with, if I do have to restart a server and I get data corruption, I’ll image the data from another server).
  19. apulsifer commented at 12:53 pm on June 4, 2024: none

    Update: After running continuously since 05-23 (no reboots or restarting bitcoind), one of the servers failed this morning. So it seems the data corruption bug occurs even when bitcoind is running continuously, although at a much lower rate.

    Started bitcoind.service. 2024-05-23T11:23:40Z Bitcoin Core version v25.2.0 (release build) 2024-05-23T11:23:40Z InitParameterInteraction: parameter interaction: -blocksonly=1 -> setting -whitelistrelay=0 2024-05-23T11:23:40Z Using the ‘arm_shani(1way,2way)’ SHA256 implementation 2024-05-23T11:23:40Z Default data directory /home/ec2-user/.bitcoin 2024-05-23T11:23:40Z Using data directory /bdata/bitcoin-data 2024-05-23T11:23:40Z Config file: /home/ec2-user/bitcoin.conf 2024-05-23T11:23:40Z Config file arg: blockreconstructionextratxn=“1” 2024-05-23T11:23:40Z Config file arg: blocksonly=“1” 2024-05-23T11:23:40Z Config file arg: datadir="/bdata/bitcoin-data" 2024-05-23T11:23:40Z Config file arg: dbcache=“200” 2024-05-23T11:23:40Z Config file arg: debuglogfile=false 2024-05-23T11:23:40Z Config file arg: disablewallet=“1” 2024-05-23T11:23:40Z Config file arg: discover=“0” 2024-05-23T11:23:40Z Config file arg: dns=“0” 2024-05-23T11:23:40Z Config file arg: dnsseed=“0” 2024-05-23T11:23:40Z Config file arg: listen=“1” 2024-05-23T11:23:40Z Config file arg: maxconnections=“24” 2024-05-23T11:23:40Z Config file arg: maxmempool=“5” 2024-05-23T11:23:40Z Config file arg: maxorphantx=“1” 2024-05-23T11:23:40Z Config file arg: maxsigcachesize=“4” 2024-05-23T11:23:40Z Config file arg: mempoolexpiry=“1” 2024-05-23T11:23:40Z Config file arg: par=“1” 2024-05-23T11:23:40Z Config file arg: persistmempool=“0” 2024-05-23T11:23:40Z Config file arg: printtoconsole=“1” 2024-05-23T11:23:40Z Config file arg: prune=“550” 2024-05-23T11:23:40Z Config file arg: rest=“1” 2024-05-23T11:23:40Z Config file arg: rpcallowip=“127.0.0.1” 2024-05-23T11:23:40Z Config file arg: rpcthreads=“1” 2024-05-23T11:23:40Z Config file arg: rpcworkqueue=“40” 2024-05-23T11:23:40Z Config file arg: server=“1” 2024-05-23T11:23:40Z Config file arg: [main] bind=“127.0.0.1:8334=onion” 2024-05-23T11:23:40Z Config file arg: [main] rpcbind=“127.0.0.1:8332” 2024-05-23T11:23:40Z Command-line arg: conf="/home/ec2-user/bitcoin.conf" 2024-05-23T11:23:40Z Using at most 24 automatic connections (65535 file descriptors available) 2024-05-23T11:23:40Z Using 2 MiB out of 2 MiB requested for signature cache, able to store 65536 elements 2024-05-23T11:23:40Z Using 2 MiB out of 2 MiB requested for script execution cache, able to store 65536 elements 2024-05-23T11:23:40Z Script verification uses 0 additional threads 2024-05-23T11:23:40Z Wallet disabled! 2024-05-23T11:23:40Z scheduler thread start 2024-05-23T11:23:40Z Binding RPC on address 127.0.0.1 port 8332 2024-05-23T11:23:40Z [http] creating work queue of depth 40 2024-05-23T11:23:40Z [http] starting 1 worker threads 2024-05-23T11:23:40Z Using /16 prefix for IP bucketing 2024-05-23T11:23:40Z init message: Loading P2P addresses… 2024-05-23T11:23:41Z Loaded 67288 addresses from peers.dat 1001ms 2024-05-23T11:23:41Z init message: Loading banlist… 2024-05-23T11:23:41Z SetNetworkActive: true 2024-05-23T11:23:41Z Cache configuration: 2024-05-23T11:23:41Z * Using 2.0 MiB for block index database 2024-05-23T11:23:41Z * Using 8.0 MiB for chain state database 2024-05-23T11:23:41Z * Using 190.0 MiB for in-memory UTXO set (plus up to 4.8 MiB of unused mempool space) 2024-05-23T11:23:41Z init message: Loading block index… 2024-05-23T11:23:41Z Assuming ancestors of block 000000000000000000035c3f0d31e71a5ee24c5aaf3354689f65bd7b07dee632 have valid signatures. 2024-05-23T11:23:41Z Setting nMinimumChainWork=000000000000000000000000000000000000000044a50fe819c39ad624021859 2024-05-23T11:23:41Z Prune configured to target 550 MiB on disk for block and undo files. 2024-05-23T11:23:41Z Opening LevelDB in /bdata/bitcoin-data/blocks/index 2024-05-23T11:23:41Z Opened LevelDB successfully 2024-05-23T11:23:41Z Using obfuscation key for /bdata/bitcoin-data/blocks/index: 0000000000000000 2024-05-23T11:23:50Z LoadBlockIndexDB: last block file = 4298 2024-05-23T11:23:50Z LoadBlockIndexDB: last block file info: CBlockFileInfo(blocks=12, size=18765314, heights=844726…844737, time=2024-05-23…2024-05-23) 2024-05-23T11:23:50Z Checking all blk files are present… 2024-05-23T11:23:51Z LoadBlockIndexDB(): Block files have previously been pruned 2024-05-23T11:23:53Z Initializing chainstate Chainstate [ibd] @ height -1 (null) 2024-05-23T11:23:53Z Opening LevelDB in /bdata/bitcoin-data/chainstate 2024-05-23T11:23:53Z Opened LevelDB successfully 2024-05-23T11:23:53Z Using obfuscation key for /bdata/bitcoin-data/chainstate: 27687fc922c5e117 2024-05-23T11:23:59Z Loaded best chain: hashBestChain=000000000000000000027d7ef87e117148fb2f0fd86daa593be6a9ab60d90b55 height=844737 date=2024-05-23T11:14:21Z progress=0.999998 2024-05-23T11:23:59Z [snapshot] allocating all cache to the IBD chainstate 2024-05-23T11:23:59Z Opening LevelDB in /bdata/bitcoin-data/chainstate 2024-05-23T11:23:59Z Opened LevelDB successfully 2024-05-23T11:23:59Z Using obfuscation key for /bdata/bitcoin-data/chainstate: 27687fc922c5e117 2024-05-23T11:23:59Z [Chainstate [ibd] @ height 844737 (000000000000000000027d7ef87e117148fb2f0fd86daa593be6a9ab60d90b55)] resized coinsdb cache to 8.0 MiB 2024-05-23T11:23:59Z [Chainstate [ibd] @ height 844737 (000000000000000000027d7ef87e117148fb2f0fd86daa593be6a9ab60d90b55)] resized coinstip cache to 190.0 MiB 2024-05-23T11:23:59Z init message: Verifying blocks… 2024-05-23T11:23:59Z Verifying last 6 blocks at level 3 2024-05-23T11:23:59Z Verification progress: 0% 2024-05-23T11:24:08Z Verification progress: 16% 2024-05-23T11:24:13Z Verification progress: 33% 2024-05-23T11:24:16Z Verification progress: 50% 2024-05-23T11:24:21Z Verification progress: 66% 2024-05-23T11:24:26Z Verification progress: 83% 2024-05-23T11:24:30Z Verification progress: 99% 2024-05-23T11:24:30Z Verification: No coin database inconsistencies in last 6 blocks (17307 transactions) 2024-05-23T11:24:30Z block index 49358ms 2024-05-23T11:24:30Z init message: Pruning blockstore… 2024-05-23T11:24:30Z Leaving InitialBlockDownload (latching to false) 2024-05-23T11:24:30Z block tree size = 844738 2024-05-23T11:24:30Z nBestHeight = 844737 2024-05-23T11:24:30Z loadblk thread start 2024-05-23T11:24:30Z loadblk thread exit 2024-05-23T11:24:30Z torcontrol thread start 2024-05-23T11:24:30Z Bound to 127.0.0.1:8334 2024-05-23T11:24:30Z init message: Starting network threads… 2024-05-23T11:24:30Z DNS seeding disabled 2024-05-23T11:24:30Z init message: Done loading 2024-05-23T11:24:30Z opencon thread start 2024-05-23T11:24:30Z net thread start 2024-05-23T11:24:30Z addcon thread start 2024-05-23T11:24:30Z msghand thread start 2024-05-23T11:24:30Z New outbound peer connected: version: 70016, blocks=844737, peer=1 (manual) 2024-05-23T11:24:31Z New outbound peer connected: version: 70016, blocks=844737, peer=4 (manual) 2024-05-23T11:24:31Z New outbound peer connected: version: 70016, blocks=844737, peer=5 (manual) 2024-05-23T11:24:31Z New outbound peer connected: version: 70016, blocks=844737, peer=6 (manual) 2024-05-23T11:25:09Z New outbound peer connected: version: 70016, blocks=844737, peer=10 (manual) 2024-05-23T11:26:10Z New outbound peer connected: version: 70016, blocks=844737, peer=13 (manual) 2024-05-23T11:26:13Z Saw new header hash=00000000000000000003508531e1ec11798f1972e307235a54ef91bf945e246c height=844738 2024-05-23T11:26:58Z UpdateTip: new best=00000000000000000003508531e1ec11798f1972e307235a54ef91bf945e246c height=844738 version=0x322d6000 log2_work=94.940996 tx=1009663503 date=‘2024-05-23T11:22:20Z’ progress=0.999999 cache=1.8MiB(11945txo) 2024-05-23T11:26:58Z Saw new header hash=00000000000000000000c18685513156cfd695edd2378ca2ba819d785866a571 height=844739 2024-05-23T11:27:08Z UpdateTip: new best=00000000000000000000c18685513156cfd695edd2378ca2ba819d785866a571 height=844739 version=0x29a3e000 log2_work=94.941009 tx=1009668305 date=‘2024-05-23T11:25:43Z’ progress=1.000000 cache=2.6MiB(18123txo) 2024-05-23T11:27:39Z New outbound peer connected: version: 70016, blocks=844737, peer=15 (manual) 2024-05-23T11:27:55Z New outbound peer connected: version: 70016, blocks=844737, peer=14 (manual) 2024-05-23T11:36:44Z Saw new header hash=00000000000000000000e9be8cfceef5f4d12313ab2657d7fcf4e617dc9bb839 height=844740 2024-05-23T11:36:51Z UpdateTip: new best=00000000000000000000e9be8cfceef5f4d12313ab2657d7fcf4e617dc9bb839 height=844740 version=0x224c8000 log2_work=94.941023 tx=1009673632 date=‘2024-05-23T11:36:11Z’ progress=1.000000 cache=3.9MiB(26582txo) 2024-05-23T11:47:09Z Saw new header hash=0000000000000000000152fee6b2cb2779c2fe0ce34aaad57f9034c1613463a0 height=844741 2024-05-23T11:47:15Z UpdateTip: new best=0000000000000000000152fee6b2cb2779c2fe0ce34aaad57f9034c1613463a0 height=844741 version=0x24000000 log2_work=94.941037 tx=1009679243 date=‘2024-05-23T11:46:34Z’ progress=1.000000 cache=4.9MiB(34377txo) 2024-05-23T11:49:55Z Saw new header hash=000000000000000000025c416dc7962405d500d87238bf392c95aa9610c3a71e height=844742 2024-05-23T11:49:58Z UpdateTip: new best=000000000000000000025c416dc7962405d500d87238bf392c95aa9610c3a71e height=844742 version=0x2e000000 log2_work=94.941051 tx=1009686227 date=‘2024-05-23T11:49:28Z’ progress=1.000000 cache=5.3MiB(37584txo) 2024-05-23T11:59:40Z Saw new header hash=000000000000000000014072f1d5d67100bf6c097e971cd2af2d579b32a30f93 height=844743 2024-05-23T11:59:46Z UpdateTip: new best=000000000000000000014072f1d5d67100bf6c097e971cd2af2d579b32a30f93 height=844743 version=0x2652e000 log2_work=94.941064 tx=1009691358 date=‘2024-05-23T11:59:05Z’ progress=1.000000 cache=6.8MiB(45760txo) 2024-05-23T12:04:32Z Saw new header hash=00000000000000000002b692a4141102da57b78d667d8a3f9a461fe85106a4c5 height=844744

    2024-06-04T06:23:12Z Saw new header hash=00000000000000000001de5e312d55f873e73d14f3cd8a8ee656a392dbc28236 height=846459 2024-06-04T06:23:52Z UpdateTip: new best=00000000000000000001de5e312d55f873e73d14f3cd8a8ee656a392dbc28236 height=846459 version=0x2403a000 log2_work=94.964467 tx=1017950010 date=‘2024-06-04T06:22:45Z’ progress=1.000000 cache=87.4MiB(578849txo) 2024-06-04T06:24:43Z Saw new header hash=00000000000000000000225789427db9f0e8f310d8bc0a205f884a8ce68a2aaf height=846460 2024-06-04T06:25:09Z UpdateTip: new best=00000000000000000000225789427db9f0e8f310d8bc0a205f884a8ce68a2aaf height=846460 version=0x25ed2000 log2_work=94.964481 tx=1017955017 date=‘2024-06-04T06:24:38Z’ progress=1.000000 cache=88.0MiB(583022txo) 2024-06-04T06:37:08Z Saw new header hash=0000000000000000000012f2d726f8a033a2bfb5eada30cd92e15e6e1d196ce7 height=846461 2024-06-04T06:37:47Z UpdateTip: new best=0000000000000000000012f2d726f8a033a2bfb5eada30cd92e15e6e1d196ce7 height=846461 version=0x23c16000 log2_work=94.964494 tx=1017959189 date=‘2024-06-04T06:36:37Z’ progress=1.000000 cache=89.1MiB(590143txo) 2024-06-04T07:36:10Z Saw new header hash=000000000000000000031f97130e48c0a7797547416d16ccd3d7dd8a6cc6d0b0 height=846462 2024-06-04T07:36:52Z UpdateTip: new best=000000000000000000031f97130e48c0a7797547416d16ccd3d7dd8a6cc6d0b0 height=846462 version=0x2001e000 log2_work=94.964508 tx=1017962502 date=‘2024-06-04T07:35:52Z’ progress=1.000000 cache=90.6MiB(602543txo) 2024-06-04T07:42:53Z Saw new header hash=000000000000000000002bde133693a19d84616a4cf1db767f8864b5288cce6b height=846463 2024-06-04T07:44:01Z UpdateTip: new best=000000000000000000002bde133693a19d84616a4cf1db767f8864b5288cce6b height=846463 version=0x21aea000 log2_work=94.964521 tx=1017965939 date=‘2024-06-04T07:42:39Z’ progress=1.000000 cache=11.0MiB(0txo) 2024-06-04T08:01:04Z Saw new header hash=00000000000000000001d5e0369520ead2dc646b7b592b8bafff8dc02e368600 height=846464 2024-06-04T08:01:40Z LevelDB read failure: Corruption: block checksum mismatch: /bdata/bitcoin-data/chainstate/5978736.ldb 2024-06-04T08:01:40Z Fatal LevelDB error: Corruption: block checksum mismatch: /bdata/bitcoin-data/chainstate/5978736.ldb 2024-06-04T08:01:40Z You can use -debug=leveldb to get more complete diagnostic messages 2024-06-04T08:01:40Z Error: Error reading from database, shutting down. Error: Error reading from database, shutting down. 2024-06-04T08:01:40Z Error reading from database: Fatal LevelDB error: Corruption: block checksum mismatch: /bdata/bitcoin-data/chainstate/5978736.ldb bitcoind.service: Main process exited, code=dumped, status=6/ABRT bitcoind.service: Failed with result ‘core-dump’. bitcoind.service: Consumed 3h 46min 58.907s CPU time. bitcoind.service: Scheduled restart job, restart counter is at 1. Stopped bitcoind.service. bitcoind.service: Consumed 3h 46min 58.907s CPU time.

  20. maflcko commented at 5:17 pm on June 4, 2024: member

    Another thing you could try to debug this further is to put a swapfile, and the datadir on the same AWS gp3 SSD filesystem.

    I am happy to create an AWS account to test this, but it would be good if there was a single (bash) script, which can be deployed to AWS, so that it is easy for anyone to reproduce your exact setup.

  21. apulsifer commented at 6:05 pm on June 4, 2024: none

    IMO, the first thing to do would be for someone who’s familiar with the format of these block files to look at the corrupted files and see if they can figure out what code may have stomped on the blocks (it might be obvious, like a fragment of p2p networking data in the middle of a block – you never know until you look).

    The most likely scenario is that this is a latent software bug that will show up on any machine if its under memory pressure and heavy paging. In my experience, finding problems low incidence seemingly random problems like this requires instrumenting the code (or using automated tools) with frequent memory buffer guard checks, injected faults such as networking jitter, stalls, disconnects, and invalid data, and random waits before and after memory is allocated, freed, and used (including networking and I/O buffers) and locks are acquired and released. I myself am more familiar with troubleshooting these problems under Windoze than Linux tho.

  22. apulsifer commented at 9:39 am on July 1, 2024: none
    Update: After seeing no data corruption for over a week, 5 of the 10 servers experienced data corruption over the last two days. This makes me suspect the problem is not completely random, but depends on contents of the blocks. A replay of the mainnet blocks generated from 2024-06-28 to 2024-06-31 might make good test data (starting with the best blocks, and possibly also including the orphan blocks).
  23. apulsifer commented at 11:33 pm on July 15, 2024: none
    Another update: After again seeing no data corruption for a while (since the last update above), five of the ten servers again got data corruption over a two day period. This reinforces my beliefs that (a) this data corruption depends on the contents of the blocks received by the bitcoind, and that whatever triggers it is currently relatively infrequent (about once every two weeks); and (2) the data corruption happens at some earlier time and is not detected by bitcoind until sometime later when it attempts to read the corrupted block from the disk.
  24. maflcko commented at 7:59 am on July 25, 2024: member

    Ok, I spun up two machines to see if I can reproduce. I left zram and put everything on one SSD. Also, my config has some debug logging enabled. Also, I am using a recent guix build, instead of a source compile of 25.x.

    Let me know when this happens again, so that I can check if it happened to me as well. I’ll then try to debug this further.

     0sh-5.2$ nproc
     12
     2sh-5.2$ uname --kernel-release --kernel-version 
     36.1.97-104.177.amzn2023.aarch64 [#1](/bitcoin-bitcoin/1/) SMP Tue Jul 16 15:18:22 UTC 2024
     4sh-5.2$ free --human
     5               total        used        free      shared  buff/cache   available
     6Mem:           419Mi       354Mi        11Mi       0.0Ki        52Mi        54Mi
     7Swap:          4.4Gi       435Mi       4.0Gi
     8sh-5.2$ df --print-type --human-readable ./ 
     9Filesystem     Type  Size  Used Avail Use% Mounted on
    10/dev/nvme0n1p1 xfs    30G   19G   12G  63% /
    11sh-5.2$ cat ./bitcoin.conf 
    12discover=0
    13listen=1
    14maxconnections=24
    15
    16par=1
    17
    18blocksonly=1
    19dbcache=200
    20maxsigcachesize=4
    21prune=550
    22
    23maxmempool=5
    24blockreconstructionextratxn=1
    25maxorphantx=1
    26mempoolexpiry=1
    27persistmempool=0
    28
    29disablewallet=1
    30
    31server=1
    32rpcallowip=127.0.0.1
    33rpcuser=btc
    34rpcpassword=btc
    35rpcworkqueue=40
    36rpcthreads=1
    37
    38printtoconsole=1
    39nodebuglogfile=1
    40
    41[main]
    42rpcport=8332
    43rpcbind=127.0.0.1:8332
    44bind=[::]:9333
    45bind=127.0.0.1:8334=onion
    46
    47# extra args
    48logthreadnames=1
    49logsourcelocations=1
    50debug=1
    51debugexclude=libevent
    52#debugexclude=leveldb
    
  25. maflcko commented at 8:02 am on July 25, 2024: member

    In the meantime it could make sense for you to consider upgrading to a more recent version of Bitcoin Core. According to https://bitcoincore.org/en/lifecycle/ and https://bitcoincore.org/en/security-advisories/ , 25.x will be EOL soon and “Medium and High severity bugs will be disclosed 2 weeks after the last affected release goes EOL. This is a year after a fixed version was first released. A pre-announcement will be made 2 weeks prior to disclosure.”

    (I don’t know if there will be any disclosures, but given the advisory, it seems better to attempt an upgrade, than not to)

  26. apulsifer commented at 7:30 pm on July 26, 2024: none
    ok, thx, I updated all the servers to bitcoin-27.1-aarch64-linux-gnu.tar.gz The last data corruption occurred on July 13 and 14. I’ll let you know next time I see it.
  27. apulsifer commented at 9:11 pm on July 31, 2024: none
    FYI, I’m now seeing an outbreak of data corruption, 1 server yesterday, 2 more today, won’t be surprised to see more later today or tomorrow…
  28. maflcko commented at 6:26 am on August 1, 2024: member

    Both of mine are still up.

    What type of RPCs are you calling?

    I am not calling any RPCs and I am running a version after #30094 (which may be related).

  29. maflcko added the label Resource usage on Aug 1, 2024
  30. apulsifer commented at 8:18 pm on August 1, 2024: none
    Several times a day, we call “bitcoin-cli -rpcclienttimeout=0 getblockcount” to check if bitcoind is still operating. (This is done thru an SSH tunnel, which is the easiest way to check all servers.) That’s the only RPC call we’ve been making at this point.
  31. maflcko commented at 12:13 pm on August 2, 2024: member

    getblockcount

    yeah, that shouldn’t cause OOM, because all it does is serialize a single integer to JSON. I’ll downgrade my two versions to 27.1 and see what happens then. Otherwise, I’ll try to use different SSDs (c.f. #30159 (comment)).

    Can you try if using a single SSD works around the problem for you for now?

    If that doesn’t pin down the problem, I am not sure how to proceed, because without steps to reproduce, this will be close to impossible to debug or diagnose.

  32. apulsifer commented at 3:12 pm on August 2, 2024: none

    At some point last night, another server got data corruption, so up to 4 of 10 on this outbreak. The incidence rate is still only about half the machines every two weeks, so it’s going to take a while to duplicate this using only data from the live mainnet. (Note, also very unlikely getblockcount is causing OOM because it’s only called about 100 times between data corruption issues.)

    Another difference in our setup is that our servers do all of their communication over IPv6, while it looks like yours is only configured with IPv4. (There is some very weak evidence this might be related to the problem – I have two servers that are only configured with IPv4, and these two servers have never had a data corruption issue, but there are other differences in these servers: one is running testnet, not mainnet, and the other was only running bitcoind for about 2 weeks before bitcoind was shutdown because all of that machine’s resources were needed for something else.)

    Configuring an EC2 instance with IPv6 is a little cumbersome, but works something like this (from my notes):

    edit VPC to assign an Amazon-provided IPv6 CIDR address block edit VPC route table to create a route from ::/0 to the Internet Gateway (a local route will also automatically be created for the new route table IPv6 CIDR)

    edit all subnets: disable auto IPv4 assignment [note: IPv4 is still enabled and a private IPv4 address is assigned; if IPv4 were disabled, the instance’s NTP client might have to be reconfigured] add IPv6 /64 address block (each subnet must have a unique [sequential] prefix)

    edit security group to allow: incoming SSH from admin IPv6 address (an outgoing rule will also automatically be created to allow all traffic to ::/0)

  33. maflcko commented at 9:07 pm on August 9, 2024: member

    Did you get a chance to see if you can reproduce the crash, if you put all data and swap on a single root SSD?

  34. maflcko commented at 10:29 am on August 10, 2024: member

    In the meantime I booted up 4 more machines with zram disabled, two of which use three SSDs, as in your setup.

    I’d be highly surprised if leveldb corruption has something to do with ipv6, so I won’t be testing that.

    I’ll let them run for a month or two, but if they don’t find anything, I am not sure what to do here.

    Without exact and full steps to reproduce every single step, including the exact AWS VM setup, as well as the internal VM setup, there is little that can be done here.

  35. maflcko commented at 7:46 am on August 12, 2024: member
    The very first machine I set up on AWS refused ssh access and had to be rebooted (fine so far after a reboot). Maybe all of this is just AWS hardware failures?
  36. cryptoquick commented at 9:54 am on August 21, 2024: none

    I’m getting this error locally also, upon IBD:

    bitcoind -server -txindex=1 -datadir=/mnt/Node/bitcoind-mainnet -rpccookiefile=/mnt/Node/bitcoind-mainnet/.cookie

     02024-08-21T05:15:10Z UpdateTip: new best=00000000000000000002e0f442b3299c1e3a3c56f6a4efdb063c43f0091bd949 height=822516 version=0x26e2a000 log2_work=94.618928 tx=940818791 date='2023-12-23T04:55:07Z' progress=0.872532 cache=667.5MiB(5165648txo)
     12024-08-21T05:15:10Z UpdateTip: new best=00000000000000000003d85e0ab75b6831241c5b915e228c67bfb6a0e2541f59 height=822517 version=0x26d3a000 log2_work=94.618942 tx=940822146 date='2023-12-23T04:56:37Z' progress=0.872535 cache=668.2MiB(5171427txo)
     22024-08-21T05:15:10Z UpdateTip: new best=000000000000000000030fbdc651da1dc956e3fca4ac11a6288f7913d8b30827 height=822518 version=0x22b0a000 log2_work=94.618956 tx=940825569 date='2023-12-23T04:58:00Z' progress=0.872539 cache=669.1MiB(5180745txo)
     32024-08-21T05:15:10Z Cache size (702717616) exceeds total space (702653184)
     42024-08-21T05:16:56Z UpdateTip: new best=00000000000000000000cf89647acdaf074b2f20c587f1457c383d54dca546aa height=822519 version=0x331c8000 log2_work=94.618970 tx=940829349 date='2023-12-23T05:00:23Z' progress=0.872542 cache=0.3MiB(0txo)
     52024-08-21T05:16:56Z Fatal LevelDB error: Corruption: block checksum mismatch: /mnt/Node/bitcoind-mainnet/indexes/txindex/387117.ldb
     62024-08-21T05:16:56Z You can use -debug=leveldb to get more complete diagnostic messages
     72024-08-21T05:16:56Z
     8
     9************************
    10EXCEPTION: 15dbwrapper_error
    11Fatal LevelDB error: Corruption: block checksum mismatch: /mnt/Node/bitcoind-mainnet/indexes/txindex/387117.ldb
    12bitcoin in scheduler
    13
    14
    15
    16************************
    17EXCEPTION: 15dbwrapper_error
    18Fatal LevelDB error: Corruption: block checksum mismatch: /mnt/Node/bitcoind-mainnet/indexes/txindex/387117.ldb
    19bitcoin in scheduler
    20
    21terminate called after throwing an instance of 'dbwrapper_error'
    22  what():  Fatal LevelDB error: Corruption: block checksum mismatch: /mnt/Node/bitcoind-mainnet/indexes/txindex/387117.ldb
    23fish: Job 1, 'bitcoind -server -txindex=1 -da…' terminated by signal SIGABRT (Abort)
    

    bitcoind --version

    0Bitcoin Core version v27.1.0
    1Copyright (C) 2009-2024 The Bitcoin Core developers
    

    This is preventing me from syncing the node locally. The media seems fine, it’s just a 2.5" SSD, but I can try syncing to a different drive just in case the issue is with the media.

  37. maflcko commented at 10:24 am on August 21, 2024: member

    The media seems fine, it’s just a 2.5" SSD, but I can try syncing to a different drive just in case the issue is with the media.

    There are many known reasons for database corruptions, such as Apple or Windows filesystems that leveldb can’t handle, or known broken hardware (shipped broken by the manufacturer), or otherwise broken hardware (overheating, etc…). So my recommendation would be to create a new issue for each new instance. Otherwise, it is hard to keep track of a single instance in one thread. (This one is about AWS)

    Meta note: Maybe a meta issue with a list of all known and reproducible corruptions can be created? Maybe also a new issue template for database corruption can be created, so that issue creators have all the context that is relevant before opening an issue?

  38. maflcko commented at 7:04 am on August 22, 2024: member

    The very first machine I set up on AWS refused ssh access and had to be rebooted (fine so far after a reboot). Maybe all of this is just AWS hardware failures?

    Another machine (the second one I spun up?) dropped dead, without SSH access anymore.

    Screenshot from 2024-08-22 08-56-52

    Screenshot from 2024-08-22 09-03-09

  39. maflcko commented at 9:09 am on August 22, 2024: member
    A reboot fixed the network issue again and I still can not reproduce or otherwise see any corruption on any machine.
  40. maflcko commented at 10:30 am on October 8, 2024: member

    I still could not reproduce. @apulsifer Do you see any unusual metrics in the monitoring? Do the graphs look different when the corruption happens for you?

    Also, now that 28.0 is released, you may want to test and try it.

  41. maflcko commented at 10:01 am on October 15, 2024: member

    Closing for now due to inactivity and because I couldn’t reproduce this so far.

    If you have exact and full steps to reproduce, starting from a fresh install of the AWS machine, with all settings, etc, I can take another look.

    In the meantime, closing for now, but please do provide a comment with more details, if you find anything.

  42. maflcko closed this on Oct 15, 2024

  43. panicfarm commented at 4:34 pm on October 15, 2024: none

    I wanted to weigh in here, even though my LevelDB problem was different, it only seemed to happen on xfs. I tried to sync a 27.1 node on a KVM guest with Ubuntu 24.04.1/xfs. During the initial sync I got mysterious segfaults and crashes, apparently due to LevelDB corruption. I unmounted the volume, but xfs_repair did not find any filesystem errors. Downgraded the node to 25.0, segfaults still persisted. I also checked the KVM’s RAM with memtest86+, no errors. In the kernel log there were some xfs warnings, however.

    After this, I reinstalled 24.04.1 on the same KVM with btrfs. The errors seem to have gone away. So I suspect it is the xfs that is the culprit.

    A representative backtrace from a segfault’s core dump:

     0(gdb) bt
     1
     2[#0](/bitcoin-bitcoin/0/)  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
     3[#1](/bitcoin-bitcoin/1/)  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
     4[#2](/bitcoin-bitcoin/2/)  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
     5[#3](/bitcoin-bitcoin/3/)  0x00007a48aae4526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
     6[#4](/bitcoin-bitcoin/4/)  0x00007a48aae288ff in __GI_abort () at ./stdlib/abort.c:79
     7[#5](/bitcoin-bitcoin/5/)  0x00007a48aae2881b in __assert_fail_base (fmt=0x7a48aafd01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x5c43cc9a8e34 "internal_key.size() >= 8", file=file@entry=0x5c43cc9a8e1c "./leveldb/db/dbformat.h", 
     8    line=line@entry=96, function=function@entry=0x5c43cc9a85b8 "leveldb::Slice leveldb::ExtractUserKey(const leveldb::Slice&)") at ./assert/assert.c:94
     9[#6](/bitcoin-bitcoin/6/)  0x00007a48aae3b507 in __assert_fail (assertion=assertion@entry=0x5c43cc9a8e34 "internal_key.size() >= 8", file=file@entry=0x5c43cc9a8e1c "./leveldb/db/dbformat.h", line=line@entry=96, 
    10    function=function@entry=0x5c43cc9a85b8 "leveldb::Slice leveldb::ExtractUserKey(const leveldb::Slice&)") at ./assert/assert.c:103
    11[#7](/bitcoin-bitcoin/7/)  0x00005c43cc5eded5 in leveldb::ExtractUserKey (internal_key=...) at ./leveldb/db/dbformat.h:96
    12[#8](/bitcoin-bitcoin/8/)  0x00005c43cc5ee0c2 in leveldb::ExtractUserKey (internal_key=...) at leveldb/db/dbformat.cc:63
    13[#9](/bitcoin-bitcoin/9/)  leveldb::InternalKeyComparator::Compare (this=<optimized out>, akey=..., bkey=...) at leveldb/db/dbformat.cc:53
    14[#10](/bitcoin-bitcoin/10/) 0x00005c43cc5f0c73 in leveldb::Block::Iter::Compare (b=..., a=..., this=0x7a47dbf74b20) at leveldb/table/block.cc:92
    15[#11](/bitcoin-bitcoin/11/) leveldb::Block::Iter::Seek (this=0x7a47dbf74b20, target=...) at leveldb/table/block.cc:181
    16[#12](/bitcoin-bitcoin/12/) 0x00005c43cc5e0ad3 in leveldb::Table::InternalGet (this=0x7a47e6bbacf0, options=..., k=..., arg=arg@entry=0x7a486fdf8390, 
    17    handle_result=handle_result@entry=0x5c43cc5d3950 <leveldb::SaveValue(void*, leveldb::Slice const&, leveldb::Slice const&)>) at leveldb/table/table.cc:221
    18[#13](/bitcoin-bitcoin/13/) 0x00005c43cc5cddd9 in leveldb::TableCache::Get (this=0x5c43cd7ab2c0, options=..., file_number=<optimized out>, file_size=<optimized out>, k=..., arg=0x7a486fdf8390, 
    19    handle_result=0x5c43cc5d3950 <leveldb::SaveValue(void*, leveldb::Slice const&, leveldb::Slice const&)>) at leveldb/db/table_cache.cc:108
    20[#14](/bitcoin-bitcoin/14/) 0x00005c43cc5d03c2 in State::Match (arg=arg@entry=0x7a486fdf8390, level=level@entry=4, f=f@entry=0x7a47d9e388b0) at leveldb/db/version_set.cc:355
    21[#15](/bitcoin-bitcoin/15/) 0x00005c43cc5d67e0 in leveldb::Version::ForEachOverlapping (this=this@entry=0x7a48104f19c0, user_key=..., internal_key=..., arg=arg@entry=0x7a486fdf8390, func=0x5c43cc5d0360 <State::Match(void*, int, leveldb::FileMetaData*)>)
    22    at leveldb/db/version_set.cc:317
    23[#16](/bitcoin-bitcoin/16/) 0x00005c43cc5d6991 in leveldb::Version::Get (this=this@entry=0x7a48104f19c0, options=..., k=..., value=value@entry=0x7a486fdf85f0, stats=stats@entry=0x7a486fdf8470) at leveldb/db/version_set.cc:398
    24[#17](/bitcoin-bitcoin/17/) 0x00005c43cc5bc4f4 in leveldb::DBImpl::Get (this=0x5c43cd67c1d0, options=..., key=..., value=0x7a486fdf85f0) at leveldb/db/db_impl.cc:1143
    25[#18](/bitcoin-bitcoin/18/) 0x00005c43cc1d4352 in CDBWrapper::ReadImpl[abi:cxx11](Span<std::byte const>) const (this=this@entry=0x5c43cd7ac930, key=...) at ./util/check.h:43
    26[#19](/bitcoin-bitcoin/19/) 0x00005c43cc12e07c in CDBWrapper::Read<(anonymous namespace)::CoinEntry, Coin> (value=..., key=..., this=<optimized out>) at ./dbwrapper.h:226
    27[#20](/bitcoin-bitcoin/20/) CCoinsViewDB::GetCoin (this=<optimized out>, outpoint=..., coin=...) at txdb.cpp:69
    28[#21](/bitcoin-bitcoin/21/) 0x00005c43cc3f3e08 in CCoinsViewBacked::GetCoin (coin=..., outpoint=..., this=<optimized out>) at coins.cpp:25
    29[#22](/bitcoin-bitcoin/22/) operator() (__closure=<synthetic pointer>) at coins.cpp:374
    30[#23](/bitcoin-bitcoin/23/) ExecuteBackedWrapper<CCoinsViewErrorCatcher::GetCoin(const COutPoint&, Coin&) const::<lambda()> > (err_callbacks=..., func=...) at coins.cpp:359
    31[#24](/bitcoin-bitcoin/24/) CCoinsViewErrorCatcher::GetCoin (this=0x5c43cd68f068, outpoint=..., coin=...) at coins.cpp:374
    32[#25](/bitcoin-bitcoin/25/) 0x00005c43cc3f4f2d in CCoinsViewCache::FetchCoin (this=0x5c43cd692590, outpoint=...) at coins.cpp:48
    33[#26](/bitcoin-bitcoin/26/) 0x00005c43cc3f51aa in CCoinsViewCache::GetCoin (this=<optimized out>, outpoint=..., coin=...) at coins.cpp:61
    34[#27](/bitcoin-bitcoin/27/) 0x00005c43cc3f4f2d in CCoinsViewCache::FetchCoin (this=this@entry=0x7a486fdf9ce0, outpoint=...) at coins.cpp:48
    35[#28](/bitcoin-bitcoin/28/) 0x00005c43cc3f583b in CCoinsViewCache::HaveCoin (outpoint=..., this=0x7a486fdf9ce0) at coins.cpp:162
    36[#29](/bitcoin-bitcoin/29/) CCoinsViewCache::HaveInputs (this=this@entry=0x7a486fdf9ce0, tx=...) at coins.cpp:306
    37[#30](/bitcoin-bitcoin/30/) 0x00005c43cc1d12e6 in Consensus::CheckTxInputs (tx=..., state=..., inputs=..., nSpendHeight=818902, txfee=txfee@entry=@0x7a486fdf8de8: 0) at consensus/tx_verify.cpp:171
    38[#31](/bitcoin-bitcoin/31/) 0x00005c43cc170c50 in Chainstate::ConnectBlock (this=this@entry=0x5c43cd676f80, block=..., state=..., pindex=<optimized out>, pindex@entry=0x7a47ecc137a8, view=..., fJustCheck=fJustCheck@entry=false) at validation.cpp:2404
    39[#32](/bitcoin-bitcoin/32/) 0x00005c43cc173298 in Chainstate::ConnectTip (this=this@entry=0x5c43cd676f80, state=..., pindexNew=0x7a47ecc137a8, pblock=..., connectTrace=..., disconnectpool=...) at validation.cpp:2951
    40[#33](/bitcoin-bitcoin/33/) 0x00005c43cc180394 in Chainstate::ActivateBestChainStep (this=this@entry=0x5c43cd676f80, state=..., pindexMostWork=pindexMostWork@entry=0x7a47ecc137a8, pblock=..., fInvalidFound=@0x7a486fdfa08e: false, connectTrace=...)
    41    at validation.cpp:3143
    42[#34](/bitcoin-bitcoin/34/) 0x00005c43cc180b81 in Chainstate::ActivateBestChain (this=0x5c43cd676f80, state=..., pblock=...) at validation.cpp:3280
    43[#35](/bitcoin-bitcoin/35/) 0x00005c43cc181c7b in ChainstateManager::ProcessNewBlock (this=0x5c43cd7a67f0, block=..., force_processing=force_processing@entry=true, min_pow_checked=min_pow_checked@entry=false, new_block=new_block@entry=0x7a486fdfa5cf)
    44    at validation.cpp:4302
    45[#36](/bitcoin-bitcoin/36/) 0x00005c43cbf527f0 in (anonymous namespace)::PeerManagerImpl::ProcessBlock (this=this@entry=0x5c43cd7b8ab0, node=..., block=..., force_processing=force_processing@entry=true, min_pow_checked=min_pow_checked@entry=false)
    46    at net_processing.cpp:3282
    47[#37](/bitcoin-bitcoin/37/) 0x00005c43cbf74641 in (anonymous namespace)::PeerManagerImpl::ProcessMessage (this=this@entry=0x5c43cd7b8ab0, pfrom=..., msg_type=..., vRecv=..., time_received=..., interruptMsgProc=...) at net_processing.cpp:4752
    48[#38](/bitcoin-bitcoin/38/) 0x00005c43cbf79b91 in (anonymous namespace)::PeerManagerImpl::ProcessMessages (this=<optimized out>, pfrom=0x7a4804018f00, interruptMsgProc=...) at net_processing.cpp:5110
    49[#39](/bitcoin-bitcoin/39/) 0x00005c43cbf24fd6 in CConnman::ThreadMessageHandler (this=<optimized out>) at net.cpp:2918
    50[#40](/bitcoin-bitcoin/40/) 0x00005c43cc538b00 in std::function<void ()>::operator()() const (this=0x7a486fdffe10) at /usr/include/c++/bits/std_function.h:622
    51[#41](/bitcoin-bitcoin/41/) util::TraceThread(std::basic_string_view<char, std::char_traits<char> >, std::function<void ()>) (thread_name=..., thread_func=...) at util/thread.cpp:21
    52[#42](/bitcoin-bitcoin/42/) 0x00005c43cbf1e205 in std::__invoke_impl<void, void (*)(std::basic_string_view<char>, std::function<void()>), char const*, CConnman::Start(CScheduler&, const CConnman::Options&)::<lambda()> > (__f=<optimized out>)
    53    at /usr/include/c++/bits/invoke.h:60
    54[#43](/bitcoin-bitcoin/43/) std::__invoke<void (*)(std::basic_string_view<char>, std::function<void()>), char const*, CConnman::Start(CScheduler&, const CConnman::Options&)::<lambda()> > (__fn=<optimized out>) at /usr/include/c++/bits/invoke.h:95
    55[#44](/bitcoin-bitcoin/44/) std::thread::_Invoker<std::tuple<void (*)(std::basic_string_view<char, std::char_traits<char> >, std::function<void()>), char const*, CConnman::Start(CScheduler&, const CConnman::Options&)::<lambda()> > >::_M_invoke<0, 1, 2> (
    56    this=<optimized out>) at /usr/include/c++/thread:264
    57[#45](/bitcoin-bitcoin/45/) std::thread::_Invoker<std::tuple<void (*)(std::basic_string_view<char, std::char_traits<char> >, std::function<void()>), char const*, CConnman::Start(CScheduler&, const CConnman::Options&)::<lambda()> > >::operator() (
    58    this=<optimized out>) at /usr/include/c++/thread:271
    59[#46](/bitcoin-bitcoin/46/) std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::basic_string_view<char, std::char_traits<char> >, std::function<void()>), char const*, CConnman::Start(CScheduler&, const CConnman::Options&)::<lambda()> > > >::_M_run(void) (this=0x5c43cd7b4700) at /usr/include/c++/thread:215
    60[#47](/bitcoin-bitcoin/47/) 0x00005c43cc8b40e0 in execute_native_thread_routine ()
    61[#48](/bitcoin-bitcoin/48/) 0x00007a48aae9ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
    62[#49](/bitcoin-bitcoin/49/) 0x00007a48aaf29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
    

    On node restart, immediate core dump again:

     0when i restart biotcoin node I get immediately
     1
     2(gdb) bt
     3[#0](/bitcoin-bitcoin/0/)  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
     4[#1](/bitcoin-bitcoin/1/)  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
     5[#2](/bitcoin-bitcoin/2/)  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
     6[#3](/bitcoin-bitcoin/3/)  0x00007d1b3304526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
     7[#4](/bitcoin-bitcoin/4/)  0x00007d1b330288ff in __GI_abort () at ./stdlib/abort.c:79
     8[#5](/bitcoin-bitcoin/5/)  0x00007d1b3302881b in __assert_fail_base (fmt=0x7d1b331d01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x5e65d2cb9e34 "internal_key.size() >= 8", file=file@entry=0x5e65d2cb9e1c "./leveldb/db/dbformat.h", 
     9    line=line@entry=96, function=function@entry=0x5e65d2cb95b8 "leveldb::Slice leveldb::ExtractUserKey(const leveldb::Slice&)") at ./assert/assert.c:94
    10[#6](/bitcoin-bitcoin/6/)  0x00007d1b3303b507 in __assert_fail (assertion=assertion@entry=0x5e65d2cb9e34 "internal_key.size() >= 8", file=file@entry=0x5e65d2cb9e1c "./leveldb/db/dbformat.h", line=line@entry=96, 
    11    function=function@entry=0x5e65d2cb95b8 "leveldb::Slice leveldb::ExtractUserKey(const leveldb::Slice&)") at ./assert/assert.c:103
    12[#7](/bitcoin-bitcoin/7/)  0x00005e65d28feed5 in leveldb::ExtractUserKey (internal_key=...) at ./leveldb/db/dbformat.h:96
    13[#8](/bitcoin-bitcoin/8/)  0x00005e65d28ff0c2 in leveldb::ExtractUserKey (internal_key=...) at leveldb/db/dbformat.cc:63
    14[#9](/bitcoin-bitcoin/9/)  leveldb::InternalKeyComparator::Compare (this=<optimized out>, akey=..., bkey=...) at leveldb/db/dbformat.cc:53
    15[#10](/bitcoin-bitcoin/10/) 0x00005e65d2901c73 in leveldb::Block::Iter::Compare (b=..., a=..., this=0x5e65ede1ebf0) at leveldb/table/block.cc:92
    16[#11](/bitcoin-bitcoin/11/) leveldb::Block::Iter::Seek (this=0x5e65ede1ebf0, target=...) at leveldb/table/block.cc:181
    17[#12](/bitcoin-bitcoin/12/) 0x00005e65d28f1ad3 in leveldb::Table::InternalGet (this=0x5e65ede1f3f0, options=..., k=..., arg=arg@entry=0x7ffe21251980, 
    18    handle_result=handle_result@entry=0x5e65d28e4950 <leveldb::SaveValue(void*, leveldb::Slice const&, leveldb::Slice const&)>) at leveldb/table/table.cc:221
    19[#13](/bitcoin-bitcoin/13/) 0x00005e65d28dedd9 in leveldb::TableCache::Get (this=0x5e65df4b8460, options=..., file_number=<optimized out>, file_size=<optimized out>, k=..., arg=0x7ffe21251980, 
    20    handle_result=0x5e65d28e4950 <leveldb::SaveValue(void*, leveldb::Slice const&, leveldb::Slice const&)>) at leveldb/db/table_cache.cc:108
    21[#14](/bitcoin-bitcoin/14/) 0x00005e65d28e13c2 in State::Match (arg=arg@entry=0x7ffe21251980, level=level@entry=4, f=f@entry=0x5e65ee8ff380) at leveldb/db/version_set.cc:355
    22[#15](/bitcoin-bitcoin/15/) 0x00005e65d28e77e0 in leveldb::Version::ForEachOverlapping (this=this@entry=0x7d1a9c011290, user_key=..., internal_key=..., arg=arg@entry=0x7ffe21251980, func=0x5e65d28e1360 <State::Match(void*, int, leveldb::FileMetaData*)>)
    23    at leveldb/db/version_set.cc:317
    24[#16](/bitcoin-bitcoin/16/) 0x00005e65d28e7991 in leveldb::Version::Get (this=this@entry=0x7d1a9c011290, options=..., k=..., value=value@entry=0x7ffe21251be0, stats=stats@entry=0x7ffe21251a60) at leveldb/db/version_set.cc:398
    25[#17](/bitcoin-bitcoin/17/) 0x00005e65d28cd4f4 in leveldb::DBImpl::Get (this=0x5e65d4836f00, options=..., key=..., value=0x7ffe21251be0) at leveldb/db/db_impl.cc:1143
    26[#18](/bitcoin-bitcoin/18/) 0x00005e65d24e5352 in CDBWrapper::ReadImpl[abi:cxx11](Span<std::byte const>) const (this=this@entry=0x5e65ee6cd2e0, key=...) at ./util/check.h:43
    27[#19](/bitcoin-bitcoin/19/) 0x00005e65d243f07c in CDBWrapper::Read<(anonymous namespace)::CoinEntry, Coin> (value=..., key=..., this=<optimized out>) at ./dbwrapper.h:226
    28[#20](/bitcoin-bitcoin/20/) CCoinsViewDB::GetCoin (this=<optimized out>, outpoint=..., coin=...) at txdb.cpp:69
    29[#21](/bitcoin-bitcoin/21/) 0x00005e65d2705f2d in CCoinsViewCache::FetchCoin (this=0x7ffe212527e0, outpoint=...) at coins.cpp:48
    30[#22](/bitcoin-bitcoin/22/) 0x00005e65d270675d in CCoinsViewCache::HaveCoin (this=<optimized out>, outpoint=...) at coins.cpp:162
    31[#23](/bitcoin-bitcoin/23/) 0x00005e65d2467292 in ApplyTxInUndo (undo=..., view=..., out=...) at validation.cpp:2012
    32[#24](/bitcoin-bitcoin/24/) 0x00005e65d247b9a9 in Chainstate::DisconnectBlock (this=this@entry=0x5e65d47c9cf0, block=..., pindex=pindex@entry=0x5e65d8212188, view=...) at validation.cpp:2095
    33[#25](/bitcoin-bitcoin/25/) 0x00005e65d24858e5 in CVerifyDB::VerifyDB (this=this@entry=0x7ffe212529a8, chainstate=..., consensus_params=..., coinsview=..., nCheckLevel=nCheckLevel@entry=3, nCheckDepth=nCheckDepth@entry=6) at validation.cpp:4486
    34[#26](/bitcoin-bitcoin/26/) 0x00005e65d22b46f3 in node::VerifyLoadedChainstate (chainman=..., options=...) at ./kernel/chainparams.h:93
    35[#27](/bitcoin-bitcoin/27/) 0x00005e65d220b888 in operator() (__closure=<optimized out>) at init.cpp:1550
    36[#28](/bitcoin-bitcoin/28/) operator()<AppInitMain(node::NodeContext&, interfaces::BlockAndHeaderTipInfo*)::<lambda()> > (f=..., f=..., __closure=<synthetic pointer>) at init.cpp:1537
    37[#29](/bitcoin-bitcoin/29/) AppInitMain (node=..., tip_info=tip_info@entry=0x0) at init.cpp:1550
    38[#30](/bitcoin-bitcoin/30/) 0x00005e65d21da84b in AppInit (node=...) at bitcoind.cpp:228
    39[#31](/bitcoin-bitcoin/31/) main (argc=<optimized out>, argv=<optimized out>) at bitcoind.cpp:274
    

    Suspicious xfs kernel messages:

     0[180335.292017] workqueue: vmstat_shepherd hogged CPU for >10000us 128 times, consider switching to WQ_UNBOUND
     1[180536.534137] workqueue: xfs_end_io [xfs] hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
     2[183309.452362] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 1653126605 wd_nsec: 1653132723
     3[184950.366650] workqueue: xlog_ioend_work [xfs] hogged CPU for >10000us 128 times, consider switching to WQ_UNBOUND
     4[186217.630387] workqueue: submit_flushes hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
     5[189464.104161] workqueue: vmstat_update hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
     6[190262.440205] show_signal: 87 callbacks suppressed
     7[190262.440388] traps: b-msghand[39801] general protection fault ip:73e1e28ab7ec sp:73e0769f9630 error:0 in libc.so.6[73e1e2828000+188000]
     8[190799.851632] workqueue: vmstat_shepherd hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
     9[192583.510479] workqueue: md_submit_flush_data hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND
    10[192583.932649] workqueue: xfs_buf_ioend_work [xfs] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
    11[193644.783802] workqueue: xfs_reclaim_worker [xfs] hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
    12[193684.534933] workqueue: xfs_inodegc_worker [xfs] hogged CPU for >10000us 128 times, consider switching to WQ_UNBOUND
    13[193788.801343] workqueue: psi_avgs_work hogged CPU for >10000us 128 times, consider switching to WQ_UNBOUND
    14[196872.983987] workqueue: submit_flushes hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
    15[207400.742368] workqueue: blk_mq_requeue_work hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
    16[210783.023757] b-msghand[40733]: segfault at 117 ip 0000742fe42a9ae7 sp 0000742e89ff8950 error 4 in libc.so.6[742fe4228000+188000] likely on CPU 2 (core 0, socket 2)
    17[210783.023961] Code: 0f 1f 00 55 48 8b 4f 08 48 89 c8 48 83 e0 f8 48 89 e5 48 3b 04 07 0f 85 a1 00 00 00 f3 0f 6f 47 10 48 8b 57 18 66 48 0f 7e c0 <48> 3b 78 18 75 73 48 3b 7a 10 75 6d 48 8b 77 10 48 89 50 18 66 0f
    18[210887.447527] workqueue: xfs_end_io [xfs] hogged CPU for >10000us 512 times, consider switching to WQ_UNBOUND
    19[213131.836465] workqueue: blk_mq_run_work_fn hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND
    20[215632.110219] workqueue: xfs_log_worker [xfs] hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
    21[221156.397986] workqueue: psi_avgs_work hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
    22[221685.589604] workqueue: delayed_vfree_work hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
    23[222251.945507] workqueue: xfs_inodegc_worker [xfs] hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
    24[223706.567200] workqueue: css_killed_work_fn hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
    25[223844.041922] workqueue: md_submit_flush_data hogged CPU for >10000us 64 times, consider switching to WQ_UNBOUND
    26[225499.219594] workqueue: xlog_ioend_work [xfs] hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
    27[228320.031895] workqueue: submit_flushes hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
    28[228637.675466] workqueue: vmstat_shepherd hogged CPU for >10000us 512 times, consider switching to WQ_UNBOUND
    29[236922.903089] workqueue: blk_mq_timeout_work hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
    

    I do not know if it’s important, but I was using .raw files on two xfs NVMe host’s partitions, that I combined into a single raid0 stripping volume inside the guest. The host did not have any xfs errors, and the KVM logs also did not have anything unusual. The node used to be run directly on the host before the guest was created, with no errors.

  44. apulsifer commented at 0:28 am on October 16, 2024: none

    I haven’t seen any data corruption since Sept 12, so at this point I can’t really give you any information to duplicate the problem.

    I did note when checking the system logs that I have only ever seen corruption on systems where the bitcoind process is listening on an IPv6 port. I believe all of those machines are also getting incoming peer connection via IPv6, but at minimum, they are either getting incoming connections via IPv6 or making outgoing connections via IPv6, since that is the only interface on these systems that is routed to the outside world. I also have two machines that only have their IPv4 interface routed, and those machines have not experienced any data corruption (there are however some other differences in those machines – one is an ARM64 machine running testnet, while the other is an AMD64 machine running mainnet; the IPv6 systems are all ARM64 machines running mainnet).

    I also believe there is zero chance that the problem I was seeing is related to xfs, for the following reasons: most of the multiple millions of Amazon EC2 instances around the world are running xfs, and if there were a problem with xfs running on Amazon EC2, we would know about it. Second, all of my machines are using xfs and are running bitcoincash core as well, and neither bitcoincashd nor any other process running on the machines has experienced data corruption, only bitcoind.

    If I see any data corruption in the future, I’ll let you know….

    Thanks.

  45. wonder75 commented at 7:12 pm on October 22, 2024: none

    I haven’t seen any data corruption since Sept 12, so at this point I can’t really give you any information to duplicate the problem.

    Hallo @apulsifer, could you please provide information on how to get the node to working state again, after the corruption occurs? I suffer from frequent levelDB corruptions too and found no way to fix the problem other than redoing the Initial Block Download which means a downtime of 2-3 days for me. After that it might work for a week or a month until the next corruption. Would really like to know a way to manuly fix this issue without downloading everything over and over again.

    Neither -reindex nor -reindex-chainstate as suggested by others fixed the corruption for me.

    This is a node running on an Intel NUC device. I checked the hardware on the node for several days with stresstests for RAM and the SSD. They all succeeded without errors.

    Thank you very much!

  46. apulsifer commented at 7:23 pm on October 22, 2024: none

    I’m running on Amazon EC2 (Elastic Compute), and have multiple servers running bitcoind. The servers all have the bitcoin data directory stored on a dedicated EBS (Elastic Block Store) volume. If one fails, I have been snapshot’ing the volume on a working server and copying it over to the failed server. Time to repair is about 30 minutes. (Reindex also does not work for me, but I’m running pruned mode, and reindex is documented to not work in prune mode.)

    The only thing I could suggest for you to to make regular snapshots, so you could go back to a working snapshot instead of completely resync’ing. (Please read and +1 my related follow up post.)

  47. apulsifer commented at 7:28 pm on October 22, 2024: none
    Suggestion for bitcoind code maintainers– As I’ve mentioned, I believe this data corruption happens but is not detected until a few days later when bitcoind attempts to read an older block (for whatever reason, idk). It would be very handy if there were some utility or bitcoind RPC command or command line option that read and checked every file on the disk, or all the files containing data from the prior X days or X blocks. That would help users ensure that they have a known good copy/backup of the blockchain, and also might help narrow down when the data corruption occurs.
  48. sipa commented at 7:31 pm on October 22, 2024: member
    gettxoutsetinfo should read through all the chainstate LevelDB files.
  49. wonder75 commented at 4:43 pm on October 23, 2024: none

    I now upgraded to bitcoind 28.0, deleted the blockchain and all indexes. I started a fresh new initial block download. Will report here, if it crashes again. I enabled levelDB debugging to generate better logs.

    I have a feeling that it could have something to do with starting and stopping the node regulary. My bitcoind only runs for 30 minutes a day to catch up on the blockchain. I start and stop it with cron and systemd.

  50. apulsifer commented at 5:06 pm on October 23, 2024: none
    When I was initially setting up these systems, I was starting and stopping bitcoind. But since the setup was completed, bitcoind has been running continuously without stopping, and I still saw data corruption.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-21 09:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me