Blockchain sync failure

bol-van commented at 12:23 pm on August 31, 2015: none

I’m on Bitcoin core v0.11.0 windows x64. OS is Windows Server 2012 R2.

I’ve been using bitcoin core for years without significant problems, but last month something happened. Database got corrupted. I tried to delete all but wallet.dat, resync database. Tried ~5 times, put datadir to different hard drives. At random position sync stops with error. After process relaunch same error is displayed and program crashes with assertion.

bitcoin_read_database

bol-van commented at 1:57 pm on August 31, 2015: none

Same thing happens to bitcoind.

C:\Program Files\Bitcoin\daemon>bitcoind.exe -datadir=H:\bitcoin Error: Error reading from database, shutting down.

This application has requested the Runtime to terminate it in an unusual way. Please contact the application’s support team for more information.

debug.log :

2015-08-31 12:39:34 LevelDB read failure: Corruption: block checksum mismatch 2015-08-31 12:39:34 Corruption: block checksum mismatch 2015-08-31 12:39:34 Error: Error reading from database, shutting down. 2015-08-31 12:39:34 Error reading from database: Database corrupted

bol-van closed this on Aug 31, 2015

bol-van reopened this on Aug 31, 2015

laanwj commented at 2:23 pm on August 31, 2015: member

“Error reading from database: Database corrupted” levelDB corruption is usually caused by disk or memory corruption (while writing to disk). You could try using -par=1 to restrict syncing to one thread and then -reindex. Sometimes this helps when, for example, the CPU is overheating.

bol-van commented at 3:51 pm on August 31, 2015: none

Unlikely this is RAM or DISK problem. OS runs stable for weeks, memtest report nothing. No bad block events in the event log. One of the disks I tried to put db on is several days old. Ram problems are mostly random. Here I have 100% failure result each time. Any ways to further diagnose the source of the problem ?

bol-van commented at 8:05 am on September 1, 2015: none

I reproduced exact same behavior in VM with Windows Server 2003 X64. Pls someone try to resync the whole db ! Am I alone with this ?

laanwj added the label Windows on Sep 1, 2015

laanwj commented at 9:47 am on September 1, 2015: member

I’d be interested to know if the same happens in that VM with Bitcoin 10.2.

wtogami commented at 9:50 am on September 1, 2015: contributor

You are able to reproduce the failure on other hardware? What about bitcoind or bitcoin-qt for Linux in a VM?

bol-van commented at 9:54 am on September 1, 2015: none

Additional notice.

Both 0.10.2 and 0.11.0 cannot start db sync when empty datadir is on “\vmware-host\shared folders” and successfully do when datadir is on windows network drive.

2015-09-01 09:51:55 init message: Loading block index… 2015-09-01 09:51:55 Opening LevelDB in Z:\home-h\Bit2test\blocks\index 2015-09-01 09:51:55 Corruption: no meta-nextfile entry in descriptor 2015-09-01 09:52:23 init message: Loading block index… 2015-09-01 09:52:23 Wiping LevelDB in Z:\home-h\Bit2test\blocks\index 2015-09-01 09:52:23 Opening LevelDB in Z:\home-h\Bit2test\blocks\index 2015-09-01 09:52:23 Corruption: no meta-nextfile entry in descriptor 2015-09-01 09:52:25 Shutdown: In progress… 2015-09-01 09:52:25 StopNode() 2015-09-01 09:52:25 Shutdown: done

bol-van commented at 11:01 am on September 1, 2015: none

I have one guess. Trouble can be in memory mapped files. I know bitcoin core uses them, it can be seen in RamMap utility. I also run BURST coin pocminer. It extensively uses mapped files. Because of that kernel paged pool grows very large - up to more than half of the physical memory (its gigabytes). Huge pooltag is “MmSt”, it contain PTEs. Detailed subject description is here : http://blogs.technet.com/b/askperf/archive/2011/09/23/getting-to-know-the-mmst-pool-tag.aspx I’m on 24 GB system and set the PoolUsageMaximum to 10 (its 10 percent of RAM, 2.4G in my case). This measure effectively limit MmSt growth and it worked great until… what changed in last weeks ? I replaced failing hard drive which contain 3 TB of BURST miner plots. This time I formatted NTFS volume with 64K cluster size (was 4K). And probably from that point bitcoin db corruptions started. Now i killed pocminer and trying to sync bitcoin both on the host and in VM. Without pocminer bitcoin could start sync on vmware-host shared folder. Will report after my guess is confirmed or no.

PS. Bitcoin 0.11.0 linux x86, runs on different hardware node without VM. Already synced till 1 year old, still no problem.

bol-van commented at 4:30 am on September 2, 2015: none

Yes, trouble was triggered by BURST miner. Without it sync was successful. Running with almost exhausted paged pool cause errors not only in bitcoin core but also have other negative effects and having large cluster volume seem to harden them.

bol-van closed this on Sep 2, 2015

laanwj commented at 10:28 am on September 2, 2015: member

Thanks for looking into this so deeply. This issue could be useful for other people that experience issues on windows.

I still wonder how the combination of hw and sw caused corruption, but it’s likely the problem lies outside bitcoin core if it affects other software negatively as well.

bol-van commented at 1:12 pm on September 2, 2015: none

One of the negative effects was the following. Attempts to start db sync from bitcoin core running in vmware guest to vmware host drive were failing just at the start. Then I tried to mount network drive from vm guest to vm host using virtual network (regular ’net use \192.168.1.5’) and sync db to that drive. Start was successful but after some time I saw messages in the tray stating that windows could not flush data to network drive and data could be lost. Obviously, bitcoin core cannot display such messages, explorer.exe displays them. Event source is guest os kernel not being able to get read/write success confirmation from the server side. Thus the lanmanserver (The ‘Server’ service) component on the host was experiencing problems in exhausting paged pool condition probably IO-related. Its all very strange because in the task manager on the host I see that pool is being trimmed from 2.8G to 800M after its exhaustion and then again grows to 2.8G. I can suppose some paged pool allocations or some map-view-of-file operations fail before MmSt trim actually happens. Kernel components are written well-checked to not crush in any possible condition, but still denial-of-service exists. Windows architecture problem ? I know burst miner is badly designed. It should not map terabytes of data files to memory. But also OS should not behave bad in this condition. If MmSt pool is like cache it must be trimmed transparently without alloc fails.

From bitcoin core perspective may be some checks are missing or db engine lack enough atomicity to rollback failing changes ? At the moment I can state : BURST miner can kill bitcoin db in some conditions, possibly when burst plots are on a large cluster volume. This is not HW related at all. Its mainly the OS problem not being too resistant to some conditions.

laanwj added the label Data corruption on Feb 9, 2016

DrahtBot locked this on Dec 16, 2021

Blockchain sync failure #6606