verifychain 4 111000 aborts with assertion "hashPrevBlock == view.GetBestBlock()" failed

maflcko commented at 9:47 am on July 7, 2022: member

Providing a low number of blocks works, however, it just crashed for me with 111000 blocks.

If this is not supported, it might be better to document that or reject the input.

Commit 691a08718beff31d1b821b192609ea3bfdb24d41

Output:

0bitcoin-qt: validation.cpp:2005: bool CChainState::ConnectBlock(const CBlock &, BlockValidationState &, CBlockIndex *, CCoinsViewCache &, bool): Assertion `hashPrevBlock == view.GetBestBlock()' failed.

maflcko added the label Bug on Jul 7, 2022

maflcko added the label Validation on Jul 7, 2022

maflcko commented at 9:53 am on July 7, 2022: member

The debug log:

 02022-07-07T07:57:50Z [qt-rpcconsole] [validation.cpp:3925] [VerifyDB] Verifying last 111000 blocks at level 4
 12022-07-07T07:57:50Z [qt-rpcconsole] [validation.cpp:3932] [VerifyDB] [0%]...Selected []:8333 from tried
 22022-07-07T07:57:50Z [opencon] [net.cpp:460] [ConnectNode] [net:debug] trying connection []:8333 lastseen=713.5hrs
 32022-07-07T07:57:50Z [opencon] [net.cpp:2959] [CNode] [net] Added connection peer=4
 42022-07-07T07:57:56Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
 52022-07-07T07:57:56Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
 62022-07-07T07:57:57Z [dnsseed] [net.cpp:1670] [ThreadDNSAddressSeed] P2P peers available. Skipped DNS seeding.
 72022-07-07T07:57:57Z [dnsseed] [util/thread.cpp:19] [TraceThread] dnsseed thread exit
 82022-07-07T07:58:07Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
 92022-07-07T07:58:07Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
102022-07-07T07:58:24Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
112022-07-07T07:58:24Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
122022-07-07T07:58:50Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
132022-07-07T07:58:50Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
142022-07-07T07:59:29Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
152022-07-07T07:59:29Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
162022-07-07T07:59:50Z [net] [net.cpp:1547] [SocketHandlerConnected] [net] socket closed for peer=3
172022-07-07T07:59:50Z [net] [net.cpp:568] [CloseSocketDisconnect] [net] disconnecting peer=3
182022-07-07T08:00:26Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
192022-07-07T08:00:26Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
202022-07-07T08:01:53Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
212022-07-07T08:01:53Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
222022-07-07T08:04:03Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
232022-07-07T08:04:03Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
242022-07-07T08:07:17Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
252022-07-07T08:07:17Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
262022-07-07T08:12:09Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
272022-07-07T08:12:09Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
282022-07-07T08:17:37Z [net] [net.cpp:1390] [InactivityCheck] [net] socket sending timeout: 1201s peer=1
292022-07-07T08:17:37Z [net] [net.cpp:568] [CloseSocketDisconnect] [net] disconnecting peer=1
302022-07-07T08:19:27Z [torcontrol] [torcontrol.cpp:120] [eventcb] [tor] Error connecting to Tor control socket
312022-07-07T08:19:27Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
322022-07-07T08:21:48Z [qt-rpcconsole] [validation.cpp:3940] [VerifyDB] [10%]...Error connecting to Tor control socket
332022-07-07T08:30:24Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
342022-07-07T08:31:12Z [qt-rpcconsole] [validation.cpp:3940] [VerifyDB] [20%]...[30%]...Error connecting to Tor control socket
352022-07-07T08:46:50Z [torcontrol] [torcontrol.cpp:618] [disconnected_cb] [tor] Not connected to Tor control port 127.0.0.1:9051, trying to reconnect
362022-07-07T08:49:53Z [qt-rpcconsole] [validation.cpp:3940] [VerifyDB] [40%]...[50%]...

saranshisatgit commented at 5:43 am on July 8, 2022: none

How to reproduce this issue?

mzumsande commented at 4:18 pm on July 8, 2022: contributor

This can be reproduced by lowering the dbcache (e.g. with -dbcache=10, verifychain 4 1000 aborts on signet): The level 3 checks do in-memory disconnect of blocks - and skip these if the memory exhausted (code) Then, the level 4 checks try to reconnect the blocks again in memory, and ConnectBlock() asserts because the coinsview best block and the block to be connected are out of sync. I’m working on a fix, I think that if we run out of cache memory we should issue a warning and not attempt the level 4 checks.

mzumsande referenced this in commit d634c0e0c6 on Jul 8, 2022

mzumsande referenced this in commit b7adbca206 on Jul 8, 2022

verdy-p commented at 0:28 am on July 28, 2022: none

And this also looks like the old persistant bug that exists in Bitcoin Core (Qt) 0.22: incorrect assumption of block files numbering between blk..dat and rev.dat in /blocks, as well as in /blocks/.ldb, or after closing the app in chainstate/.ldb: It seems that the current code just uses an enumeration of files existing in directories to generate a new file number for rev* and index/*.ldb files, which may be created out of the expected sequence; when later checking blocks, they don’t match the expected signatures.

I’ve not been able to use Bitcoin core since ever I started trying using Bitcoin Core 0.22: I have a fast fiber Internet, and storage is clean; I have placed the Bitcoin data store on a dedicated clean volume, stored on a fast RAID. No I/O corruption has ever occured, but Bitcoin Core was unable to finish the IBD phase, bugging about half the way (after storing about 300GB of data). Using the “prune” option (to run as a partial block node instead of a full node), reduced the chance for that bug to occur, but it still occured.

Using more cache of memory reduced the bug to occur during the IBD phase, but NOT when the program had to be stopped (and it started generating chainstate/*.ldb files at extreme speed, but sometimes out of order (incorrect file numbers due to unordered I/O delays between threads generating them).

Now even with version 0.23 this continues as well, and I constantly need to erase all blocks/* and chainstate/* files (plus the enormous debug.log file, or just perform a fast reformat of the partition containing the datastore. (and there has never been any error reported on the filesystem after all crashes, or by the underlying OS.

Bitcoin.org uses unsafe synchronization when it generates new files in its storage (and bugs are depending on how peers are replying to requests and on variable I/O rate that storage disks and filesystems can sustain, which is extremely high during the IBD phase (several hundreds tens of megabytes per second downloaded, and many active concurrent threads for verifiers, we can see some pikes of CPU activity or I/O, delays to complete them, some background delays occasionnaly caused by the filesystem which may pause one thread but not another waiting for completion; so threads can become out of sync.)

Definitely this is NOT a hardware problem (as stated many times since long in various Bitcoin support forums, but really an issue in the code).

For tests, I tried to compile Bitcoin.org and force it to use a single thread and a single verifier thread, most of these bugs are disappearing (but they still occur, probably because Qt also uses its own internal threads, and massive I/O to “debug.log” is also adding to this.

Another symptom: frquently the “debug.log” file is missing the leading part which should start by a couple of lines: several kilobytes are missing, or were overwritten, most proably by concurrent threads using unsafe “appends” when the program was initializing and trying to connect to peer. Here again this means lack of synchronization between network threads for peer connections/downloads/uploads and other concurrent threads (such as background verifications of blocks in cache, or internal initializations of Qt itself: the debug logger is also affected by incorrect synchronization when it is used by background worker threads, forgetting to use the correct locking mechanism, or not properly synchronizing with RPC/IPC services inside Bitcoin Core or with external CLI).

Note that the doc says that we can tune the “maxmempoolsize” whouse default is 300 MB (not MiB). In the Qt interface, it shows a different value (in “MiB”) which is excessive. I had to manually set to to “512” MiB (in the GUI) instead of the default (it generated a cache of 300 GiB! causing lot of frequent swaps on the system drives, causing randomized delays inside BitBoin Core (Qt) and increasing the likeliness of desinchronization of threads (notably those writing to the storage cache, which get corrupted too frequently and randomly because of unsafe assumption inside the code, but never because of hardware or OS/driver issues).

BitCoind core should be able to work very well with a much more modest default than 300MB in its poolsize, it should be about 64MiB without causing significant delays when using 4 verifer threads and two dozen of peer connections (but I’ve never seen more than 12 peers except sometimes up to 10 network “feelers” that don’t consume of lot of I/O.

Another note: to run as a “full node”, Bitcoin Core now requires AT LEAST 512GB of storage bot its blocks and chainstate. That storage is growing. The default indicated in the doc is NO longer correct (or you have to run in “pruning” mode, but pruning operations are known to cause these desynchronizations of the store, and fatal errors caused by its internal corruption)

So those running “full nodes” successfully are on servers that use very fast storage in large and fast RAID (only on SSD, not HDD arrays); others running “partial block” nodes are running only on modest PCs and slow Internet connections (not fibers…): I could effectively avoid all corruptions by running Bitcoin Core inside a hosted VM where I had strongly capped the maximum speed of the Internet correction below about 20 Megabit/s; and not 1 Gigabit/s like on the native OS (where those crashes are always occuring if your local storage or filesystem is not very fast and you’ere hit by I/O limitations on disks or filesystems).

mzumsande commented at 8:53 pm on July 28, 2022: contributor

And this also looks like the old persistant bug (…)

I don’t think any of this is related to the verifychain abort described in this issue, which has nothing to do with block files or mempoolsize. As explained above, it is a memory issue, which wouldn’t ever come up during normal runs, but only if someone has a small dbcache and/or calls verifychain with a unusually large number of blocks. Also, I think your post mentions several separate problems - I think it would be best to open separate issues for these.

sipa commented at 9:14 pm on July 28, 2022: member

@verdy-p You seem to have many distinct complaints here, though most seem unrelated to the issue here. If you want to help getting them resolved, I suggest you open separate actionable issues for them.

Still, a few comments:

And this also looks like the old persistant bug that exists in Bitcoin Core (Qt) 0.22: incorrect assumption of block files numbering between blk..dat and rev.dat in /blocks, as well as in /blocks/.ldb, or after closing the app in chainstate/.ldb: It seems that the current code just uses an enumeration of files existing in directories to generate a new file number for rev* and index/*.ldb files, which may be created out of the expected sequence; when later checking blocks, they don’t match the expected signatures.

The *.ldb files are created and managed by the LevelDB database system, not Bitcoin Core. It’s of course possible there is a bug there, but if there is, it’s not in the Bitcoin Core codebase.

The blk*.dat and rev*.dat files are managed by Bitcoin Core directly, and there is no assumption they are consecutive. The block index database maintains a table that stores for each block in which *.dat file its data resides, and in which position.

Bitcoin.org uses unsafe synchronization when it generates new files in its storage (and bugs are depending on how peers are replying to requests and on variable I/O rate that storage disks and filesystems can sustain, which is extremely high during the IBD phase (several hundreds tens of megabytes per second downloaded, and many active concurrent threads for verifiers, we can see some pikes of CPU activity or I/O, delays to complete them, some background delays occasionally caused by the filesystem which may pause one thread but not another waiting for completion; so threads can become out of sync.)

It’s certainly possible there is a bug, and if so, accurate reports may help developers locate and fix it. But as stated here, this sounds like a lot of conjecture and assumptions. Bitcoin Core only has a single thread that processes incoming block messages from peers, and flushes to the database are always synchronized. Signature validation does occur on several separate threads (you can reduce this to one with -par=1 on the command line or par=1 in the config file), but that’s separate from database interaction.

Also: Bitcoin Core is not bitcoin.org.

Definitely this is NOT a hardware problem (as stated many times since long in various Bitcoin support forums, but really an issue in the code).

It’s hard to say based on what you provide here. If reducing parallelism improves the situation, but does not fix it, that makes me suspect your CPU is overheating and occasionally flipping bits. That may not be observable in other applications, as few programs are as susceptible to occasional bitflips throwing the whole validation process off.

Again, I don’t want to presume this is an issue on your side, but without more information, plus the fact that we don’t see reports like this from everyone, does suggest looking more widely for causes.

Another symptom: frquently the “debug.log” file is missing the leading part which should start by a couple of lines: several kilobytes are missing, or were overwritten, most proably by concurrent threads using unsafe “appends” when the program was initializing and trying to connect to peer. Here again this means lack of synchronization between network threads for peer connections/downloads/uploads and other concurrent threads (such as background verifications of blocks in cache, or internal initializations of Qt itself: the debug logger is also affected by incorrect synchronization when it is used by background worker threads, forgetting to use the correct locking mechanism, or not properly synchronizing with RPC/IPC services inside Bitcoin Core or with external CLI).

No, this is documented and intended behavior. The debug.log file is truncated to the last 10 MB at startup, unless -debug or -shrinkdebugfile=0 are set.

BitCoind core should be able to work very well with a much more modest default than 300MB in its poolsize, it should be about 64MiB without causing significant delays when using 4 verifer threads and two dozen of peer connections (but I’ve never seen more than 12 peers except sometimes up to 10 network “feelers” that don’t consume of lot of I/O.

Are you stating that the current codebase should be able to work with 64 MiB or RAM, or expressing the opinion that it should be possible to write a Bitcoin node that does? In the first case, definitely not - just loading the block headers into memory needs more than that. In the second case, pull requests welcome.

(or you have to run in “pruning” mode, but pruning operations are known to cause these desynchronizations of the store, and fatal errors caused by its internal corruption)

Reference for claims that it’s known that pruning causes corruption? Is this just your experience?

verdy-p commented at 0:03 am on July 30, 2022: none

I’ve NEVER been ablt to run Bitcoin core to finalize the IBD (since many months). It has always terminated abruptly with corruptions detected in *.ldb index files (most often for the chainstate). There are less frequent bugs in version 0.23, but really it still shows the same bugs (that are likely caused inside the Berkeley DB engine library you use, or many be in Qt).

And, no my PC has NEVER overheated, and NEVER detected any hardware problem or fielsystem corruptions. Even if I use it extensively.

And I’be tried as well on other PCs (including new ones), or inside a Linux VM, over Hyper-V on a Windows host, or VirtualBox on various Linux distribs, or on a NUC with Proxmox), and it has always failed. So I just wonder on what kind of machines you make it work (most probably only on very costly servers with lot of memory, many many cores, and extremely fast storages, so that you’re only bound to the Internet bandwidth).

To make Bitcoin Core work more successfully, I always need to cap the Internet bandwidth to less than 10 megabit/s, and even when it crashes, restarting it with “-reindex” will fail again in just a few minutes, but it defintely fails ALL the time after just a couple of hours (sometimes only a few minutes) if I let it use the fiber speed (>1 Gbit/s on download, even if Bitcoin core does not seem able to reach more than about 200 megabits/s in download). So it really seems affected by lack of proper synchronization between its working threads (for its actual work, or those created by your third party libraries, includng Qt or the DB engine, which you may not use safely according to their own recommendations).

And you’ve got severe problems into managing the I/O and memory usage. And there are very likely buffer overruns or use-after-free problems. This should be diagnose by stressing test tools and many more assertion checks (at least in a debug version). For now it’s just impossible to diagnose as the OS never reports any hardware problem, or any corruption for itself.

Or may be it was compiled to work only with some ranges of processors (may be you use a CPU that has hardware implementations of hashing or encryption algorithms, and your emulation software code used for fallbacks is wrong and have never been seriously tested… Or the bug leaves in incorrect agressive optimizations (unsafe AVX or SSE code?) made in your compiler settings, so that it only works with a few processors models and in very strict environement on isolated servers.

If bit flipping was occuring, I would have detected them since long in many another apps or in the filesystem. All my machines are very stable (and I’ve never used any overclocking, tempoeratures is constantly below the 40°C even today in hot summer), and the storage is using reasonable fast but very reliable RAID, never reporting so many inconsistantly failures than those presumed by your code. Something is worng but you have difficulties to isolate them, may be compiling your app not for native processors but for .net or Webassembly (just an example) could help you detect those bugs (even if your app is a bit slower, at least if will be safer, running on a more reproductible base environment). All crashes for now occur randomly in the progression of the IBD phase, where Bitcoin Core performs the largest amount of storage I/O and network I/O in parallel.

Do you have a supported verion of Bitcoin Core that can run in CLI only, without Qt (to see if its is the cause of these desynchronisations and frequent corruption of DB indexes or of the block cache? Why do you need to rebuild all validated blocks all the time, they should be readonly for most of them, not needing so many rewrites)? why corruptions are not self-healed, when you have Merkle trees with strong hashes (may be it’s architectural in the way you manage blocks dependencies, caches and transactions; may be you should use other safer technologies for the DB, which seems to use uses unsage optimistic locks without using any transaction journal, or use Memcach, plus MariaDB or PostgreSQL for the actual storage and indexing? as well why are individual stored blocks so large, all of them at least 128MB; they should not exceed 64MB at most, notablyt because we reach a threashild where the OS stops all kind of storage optimizations in its filesystems, causing huge and unpredictable variations of access time for I/O). And in my opinion, there’s NO gain at all to use a so large cache ni memory, when it requires hours to fill it and when as long it is not full you do not perform and intermediate flush: relaoding from disk occasionnaly will not impact the performance significantly.

maflcko commented at 6:59 am on July 30, 2022: member

Bitcoin Core makes heavy use of CPU, RAM and disk IO. Hardware defects might only become visible when running Bitcoin Core. You might want to check your hardware for defects.

memtest86 to check your RAM
to check the CPU behaviour under load, use linpack or Prime95
to test your storage device use smartctl or CrystalDiskInfo

Source: https://bitcoin.stackexchange.com/a/12206

verdy-p commented at 9:30 am on July 30, 2022: none

memtest86 check: OK linpack check: OK Prime95 check: OK smartctl: OK CrystalDiskInfo: OK SFC /SCANNOW: OK

And this still does not work!

I can use other CPU-intensive and lengthy apps, they never fail like Bitcoin Core (v0.22 or v0.23) which always fails (in Wnidows but as well in Linux machines, or in VMs with various hypervisors) within the first 2 hours (and never with any overheating CPU/GPU/memory/chipsets/storage drives) on every machine I tried to use. This can only be bugs in your existing sources (most probably related by unchecked order of completion events between threads due to lack of proper serialization for some resources, or incorrect measurement of time using unstable clock references).

mzumsande referenced this in commit 46ec70affa on Oct 7, 2022

mzumsande referenced this in commit 5eaca08dbc on Jan 19, 2023

mzumsande referenced this in commit 61431e3a57 on Jan 31, 2023

fanquake referenced this in commit 8f4ae65818 on Feb 5, 2023

fanquake closed this on Feb 5, 2023

josibake referenced this in commit 7b310d0b48 on Apr 5, 2023

Fabcien referenced this in commit 38a301820d on Dec 7, 2023

bitcoin locked this on Feb 5, 2024

verifychain 4 111000 aborts with assertion “hashPrevBlock == view.GetBestBlock()” failed #25563