Stale chain tips can block the RPC and all other responses from the peer for very long time (more than 15 minutes, sometimes hours, or undefinetely). During that time I can see that my local node no longer replies to ANY other peer (e.g. to bitnode.io core). And in process viewers, I can see that there’s a single thread blocking ALL other threads in a tight loop using 100% CPU in a tight loop.
I’ve seen that this is occurs in any db indexer (notably the chainstate indexer during IBD, or the txindexer, or the coinstat db indexer), notably in some critical steps where new blocks are added to the chain (once every about 15 minutes). This seems to happen when a new block contains a transaction validating a better chain, and invalidating other blocks from dead branches. In that case, the ongoin indexer will fail with unchecked conditions in its internal iterator, apparently because it expect the next block to be present and locks it incorrectly, and is not able to release it and retry if that next valid block has been replaced by a better block in the chain.
02022-08-21T19:44:15.875834Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70015, blocks=750474, peer=58, peeraddr=[2604:a880:cad:d0::d9e:f001]:8333 (block-relay-only)
12022-08-21T19:44:45.012022Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 330879
22022-08-21T19:45:16.005414Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 331746
32022-08-21T19:45:47.003036Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 332578
42022-08-21T19:46:18.039548Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 333378
52022-08-21T19:46:49.007385Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 334219
62022-08-21T19:47:20.013586Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 334916
72022-08-21T19:47:51.022263Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 335794
82022-08-21T19:48:22.007479Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 336769
92022-08-21T19:48:53.009529Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 337743
102022-08-21T19:49:24.039053Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 338611
112022-08-21T19:49:55.041510Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 339322
122022-08-21T19:50:26.010515Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 340016
132022-08-21T19:50:57.016185Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 340630
142022-08-21T19:51:28.055804Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 341499
152022-08-21T19:52:08.460256Z [validation.cpp:2441] [UpdateTipLog] UpdateTip: new best=0000000000000000000758f0830f8c7cf41c06740b2004b4186e439b0b989062 height=750475 version=0x2281c000 log2_work=93.689008 tx=758464817 date='2022-08-21T19:50:59Z' progress=1.000000 cache=3.7MiB(27192txo)
162022-08-21T19:52:10.247487Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 341803
172022-08-21T19:52:30.568835Z [validation.cpp:2441] [UpdateTipLog] UpdateTip: new best=00000000000000000005d0b2bbb26ce7bb0eef932eed58853ebe6525ca71c85a height=750476 version=0x20000004 log2_work=93.689019 tx=758465224 date='2022-08-21T19:52:12Z' progress=1.000000 cache=3.9MiB(28964txo)
182022-08-21T19:52:41.785630Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 341814
192022-08-21T19:53:12.001085Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 342683
202022-08-21T19:53:43.027307Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 343432
212022-08-21T19:54:14.012871Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 344012
222022-08-21T19:54:45.022063Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 344718
232022-08-21T19:54:59.410934Z [validation.cpp:2441] [UpdateTipLog] UpdateTip: new best=000000000000000000098f79590f23c8bb78c3d4d22d1cd4bab3a0e94fc7e6be height=750477 version=0x2ab96000 log2_work=93.689030 tx=758465722 date='2022-08-21T19:54:36Z' progress=1.000000 cache=4.1MiB(30966txo)
242022-08-21T19:55:16.033929Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 344913
252022-08-21T19:56:00.985269Z [mapport.cpp:204] [ProcessUpnp] UPnP Port Mapping successful.
262022-08-21T20:16:01.005268Z [mapport.cpp:204] [ProcessUpnp] UPnP Port Mapping successful.
Finally later it may eventually detect that and detect the inconsistency, by eviting some peers.
02022-08-21T20:26:10.234268Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 344915
12022-08-21T20:26:10.236286Z [net_processing.cpp:4381] [CheckForStaleTipAndEvictPeers] Potential stale tip detected, will try using extra outbound peer (last tip update: 1871 seconds ago)
22022-08-21T20:26:10.877268Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70016, blocks=750480, peer=194, peeraddr=78.108.108.38:8333 (outbound-full-relay)
32022-08-21T20:26:18.436270Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70016, blocks=750480, peer=195, peeraddr=208.104.92.74:8333 (outbound-full-relay)
42022-08-21T20:26:20.609320Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70016, blocks=750480, peer=196, peeraddr=[2a01:5200:6c:6162:6263:6173:6861:6263]:8333 (outbound-full-relay)
52022-08-21T20:26:22.546436Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70016, blocks=750480, peer=198, peeraddr=141.95.45.187:30034 (outbound-full-relay)
62022-08-21T20:27:18.146310Z [validation.cpp:2441] [UpdateTipLog] UpdateTip: new best=00000000000000000007cb8a63d4c0f24bf0e1c3e6bedbf0436ebab43a3ee690 height=750478 version=0x20800000 log2_work=93.689041 tx=758467003 date='2022-08-21T20:05:24Z' progress=0.999995 cache=5.5MiB(41861txo)
72022-08-21T20:27:20.065326Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 344926
82022-08-21T20:27:51.486742Z [validation.cpp:2441] [UpdateTipLog] UpdateTip: new best=00000000000000000006e1bb77ce90948b3050df8ee2143d1913fd583ce3c765 height=750479 version=0x20c00000 log2_work=93.689052 tx=758468307 date='2022-08-21T20:08:38Z' progress=0.999996 cache=6.6MiB(48588txo)
92022-08-21T20:28:03.488188Z [validation.cpp:2441] [UpdateTipLog] UpdateTip: new best=0000000000000000000171f39891841e7128e85f45f386ddde82f65c12dd5773 height=750480 version=0x20400000 log2_work=93.689063 tx=758468706 date='2022-08-21T20:10:40Z' progress=0.999996 cache=6.8MiB(50180txo)
102022-08-21T20:28:05.362170Z [index/base.cpp:169] [ThreadSync] Syncing coinstatsindex with block chain from height 344927
112022-08-21T20:28:05.364170Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70015, blocks=750480, peer=205, peeraddr=116.203.112.73:8333 (outbound-full-relay)
122022-08-21T20:28:05.718559Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70016, blocks=750480, peer=208, peeraddr=[2404:7a85:4161:2b00:49a1:427a:fac:3409]:8333 (outbound-full-relay)
132022-08-21T20:28:08.302936Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70016, blocks=750480, peer=211, peeraddr=185.167.113.59:8333 (outbound-full-relay)
142022-08-21T20:28:21.039390Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70016, blocks=750480, peer=213, peeraddr=[240b:252:9480:9d00:e65f:1ff:fe10:d767]:8333 (outbound-full-relay)
And here again we see the deadlock occuring for long time in the coindb indexer (note the difference of time between the last too events). And in fact nothing else occured: no resposes to other peers, the coindb indexer was stalled, the RPC service stopped to respond even locally.
It seems then that there are malicious nodes on the P2P network sending bad blocks that “partially pass” the basic security, but can cause BitCoin Core to be blocked and become unresponsive. It can no longer respond even to a basic RPC such as “bitcoin -getinfo”, or “bitcoin -netinfo”
It will not even repond to a “bitcoin-cli stop”: some threads are stopped, but not all, and the indexer thread is still running in its tight loop; if you are in that situation, the only way to recover is to KILL the process entirely, and you’ll get the related index to become invalid as it was not committed, it will have to be reconstructed from zero.
Most often, you have no way to recover (given that the RPC service is not responsive, and even the CTRL+C breaker, or a simple kill -hangup" will not work. You have to do an instant kill (kill -9 on Linux, or manual kill in the Process Viewer on Windows: nothing else will be synced to disk).
If the index was the IBD index, the chainstate can only be recovered by check level 4 (works often, but not necessarily always, but never with check level 3). If it does not work, then you have to rebuild the chainstate entirely.
But for any other index (such as the “txindex” or “coinstats” index, there’s NO check at all with recovery at startup, so the only way to recover is to entirely delete the content of that index directory and rebuild it from zero (and this will take many hours, and may fail again in the middle, due to bad blocks sent by malicious peers and that may have still not been replaced by valid blocks in the main chain).
Apparently these bad blocks seem to come from unapproved forks (possibly implementing SegWit with too large block sizes?), which are not correctly detected and cause the indexer to fail or run into some infinite loops, or some bad signatures using unsupported/snon-standard bytecodes, taking extremely long time to validate or just discard rapidly as invalid.
Such events with bad blocks coming from malicious peers tend to be frequent now. Unfortunately it’s not easy to track which nodes caused that bug as they may not even be connected since long: they have sent their garbage and have disconnected themselves as soon as done, just to crash specific versions of Bitcoin Core not detecting them.
(Note: this is once again occuring with the current unmodified release v23.0 from your official site; it affects both Linux and Windows versions).
Are there blocks that are extremely long to validate or index (taking much more than just about 1 minutes normally needed here to process every logged group of about 15 blocks, such as more than 30 minutes in the 1st break above, but still not completed after several hours at end of the logs above)? Isn’t it one of the reasons why SegWit was asked for, with blocks grouping many signatures, for example those generated by very large miner pools and needing tons of transactions to validate all their micropayments to many participants?
See also Bug #25365 (may be related with the same race issue when there’s a new chaintip updated while the indexer is running and iterating from the older chain view)
If this is the cause, the bug is manifested only when you run it on a machine with enough cores to allow more threads to run concurrently. The crashes/hangs I experiment are all runing with 8 cores activates; your test environment may just have 1 or 2 cores configured, most probably not enough to experiment the race issue between concurrent threads, and not experienced by people running it on cheap PC or notebooks, or in cheap/free VMs for Azure and Amazon (if you get just 1 or 2 non dedicated cores).
Note also that synchronization issues in code may even more likely to cause bugs on CPUs running multiple threads per core, or using more complex parallelism (e.g. new Intel processors with power-efficient and performance CPUs, with more complex caching layers, or needing special kernel-initiated firmware mitigations like Retpoline against attacks based on latency response time on caches and with speculative execution models; those possibile attacks and firmware or OS mitigations exist on many processors, and their effect on execution latency may vary, depending on the number of active concurrent threads; if you have low I/O latency on disk storage, this will also increase the likelyhood of race issues caused by improper synchronization. But in all these cases this greatly impacts the leveldb indexers, not properly iterating over the main chain of blocks.)
If you want to experiment it faster, you may want to test it on a machine with many more CPU cores, like newest AMD ThreadRipper (48 concurrent threads!). And then try running multiple indexers running in parallel (the chainstate indexer in IBD, the coinsdbindexer, the txindex, and the coinsdbindex: one of them will crash/hand, or report “data corruption”, but not caused by any hardware issue, but incorrect synchronization in Bitcoin Core’s code).
As well it seems that all bugs like #25894 (reported quite often, but always diagnosed in replies" incorrectly as a “hardware issue”, just supposed but never really verified) are also related to the same synchronization issues between any indexer thread and the thread handling any changes of the tips on the chainstate (notably if there’s a change between concurrent forks for new blocks after establishing a consensus. They more likely occurs at end of a trading day, when more blocks are announced for compensating transactions made by major traders (it’s then more frequent to get new blocks emitted more rapidly than just once every 10 minutes during the rest of the day.
Subsidiary question: Is there way to have a knonw list of trustable peers, and a blacklist for bad peers sending fake/corrupted data? Isn’t Bitcoin missing some additional checks of the structure of these blocks (not just the validation of the individual attempted transactions, but basically its assumed format, includnig for complex multipart signature schemes, or peers attempting to force the use of a hard fork with a disappoved BIP proposal, or attemping to use it too early before a valid date and proper nogociation, if they run some specific subchains)?
One remote peer I remarked was always present when I saw crashes/hangs was this one:
<-> type net mping ping send recv txn blk hb addrp addrl age id address version out full ipv6 27 27 4 1 0 1002 1 11 [2001:41d0:303:2b2b::]:8333 70015/Satoshi:0.14.2(bitcore-sl)/UASF-Segwit:1.0(BIP148)/
I’ve seen it several times sending groups of 2 or 3 unconfirmed blocks with large transactions in a row. It is likely a non-upgraded miner node.
(According to WHOIS, is is hosted by OVH in France, and likely on a fiber access; it uses the most antique version of Satoshi I seen for a “full node”, but with local patches trying to enforce BIP148 but with wrong version announcements).
Do I need to list it in my banlist?