IBD core dumped during recursive call to CCoinsViewCache::FetchCoin (while connecting best chain tip)

verdy-p commented at 8:44 am on August 16, 2022: none

During IBD, sometimes I get this crash reported by gbd:

 0[#0](/bitcoin-bitcoin/0/)  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
 1[#1](/bitcoin-bitcoin/1/)  0x00007fffff0c5537 in __GI_abort () at abort.c:79
 2[#2](/bitcoin-bitcoin/2/)  0x00000000080949d8 in CCoinsViewErrorCatcher::GetCoin (this=<optimized out>, outpoint=..., coin=...) at coins.cpp:307
 3[#3](/bitcoin-bitcoin/3/)  0x00000000083b0153 in CCoinsViewCache::FetchCoin (this=0x88b26a0, outpoint=...) at coins.cpp:46
 4[#4](/bitcoin-bitcoin/4/)  0x00000000083b033a in CCoinsViewCache::GetCoin (this=<optimized out>, outpoint=..., coin=...) at coins.cpp:59
 5[#5](/bitcoin-bitcoin/5/)  0x00000000083b0153 in CCoinsViewCache::FetchCoin (this=0x7fff3338f1c0, outpoint=...) at coins.cpp:46
 6[#6](/bitcoin-bitcoin/6/)  0x000000000829fa5a in CCoinsViewCache::HaveCoin (outpoint=..., this=0x7fff3338f1c0) at coins.cpp:160
 7[#7](/bitcoin-bitcoin/7/)  CCoinsViewCache::HaveInputs (tx=..., this=<optimized out>) at coins.cpp:265
 8[#8](/bitcoin-bitcoin/8/)  CCoinsViewCache::HaveInputs (tx=..., this=0x7fff3338f1c0) at coins.cpp:261
 9[#9](/bitcoin-bitcoin/9/)  Consensus::CheckTxInputs (tx=..., state=..., inputs=..., nSpendHeight=661434, txfee=@0x7fff3338dcd0: 0) at consensus/tx_verify.cpp:171
10[#10](/bitcoin-bitcoin/10/) 0x00000000085559f0 in CChainState::ConnectBlock(CBlock const&, BlockValidationState&, CBlockIndex*, CCoinsViewCache&, bool) [clone .constprop.0] (this=this@entry=0x88acfa0, block=..., state=...,
11    pindex=<optimized out>, pindex@entry=0xa6f0238, view=..., fJustCheck=fJustCheck@entry=false) at validation.cpp:2186
12[#11](/bitcoin-bitcoin/11/) 0x000000000827b79c in CChainState::ConnectTip (disconnectpool=..., connectTrace=<synthetic pointer>..., pblock=std::shared_ptr<const CBlock> (empty) = {...}, pindexNew=0xa6f0238, state=...,
13    this=0x88acfa0) at validation.cpp:2720
14[#12](/bitcoin-bitcoin/12/) CChainState::ActivateBestChainStep (connectTrace=..., fInvalidFound=<optimized out>, pblock=..., pindexMostWork=<optimized out>, state=..., this=<optimized out>) at validation.cpp:2883
15[#13](/bitcoin-bitcoin/13/) CChainState::ActivateBestChain (this=0x88acfa0, state=..., pblock=std::shared_ptr<const CBlock> (empty) = {...}) at validation.cpp:3010
16[#14](/bitcoin-bitcoin/14/) 0x000000000818ed7a in node::ThreadImport (chainman=..., vImportFiles=..., args=..., mempool_path=...) at node/blockstorage.cpp:887
17[#15](/bitcoin-bitcoin/15/) 0x00000000081199f2 in operator() (__closure=0x7fff2c000b60) at init.cpp:1575
18[#16](/bitcoin-bitcoin/16/) std::__invoke_impl<void, AppInitMain(node::NodeContext&, interfaces::BlockAndHeaderTipInfo*)::<lambda()>&> (__f=...) at /usr/include/c++/10/bits/invoke.h:60
19[#17](/bitcoin-bitcoin/17/) std::__invoke_r<void, AppInitMain(node::NodeContext&, interfaces::BlockAndHeaderTipInfo*)::<lambda()>&> (__fn=...) at /usr/include/c++/10/bits/invoke.h:110
20[#18](/bitcoin-bitcoin/18/) std::_Function_handler<void(), AppInitMain(node::NodeContext&, interfaces::BlockAndHeaderTipInfo*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
21    at /usr/include/c++/10/bits/std_function.h:291
22[#19](/bitcoin-bitcoin/19/) 0x00000000084331c4 in std::function<void ()>::operator()() const (this=0x7fff3338fe50) at /usr/include/c++/10/bits/std_function.h:622
23[#20](/bitcoin-bitcoin/20/) util::TraceThread(char const*, std::function<void ()>) (thread_name=<optimized out>, thread_func=...) at util/thread.cpp:19
24[#21](/bitcoin-bitcoin/21/) 0x00000000081197cb in std::__invoke_impl<void, void (*)(char const*, std::function<void()>), char const*, AppInitMain(node::NodeContext&, interfaces::BlockAndHeaderTipInfo*)::<lambda()> > (
25    __f=@0x13344fe8: 0x8433080 <util::TraceThread(char const*, std::function<void ()>)>) at /usr/include/c++/10/bits/invoke.h:60
26[#22](/bitcoin-bitcoin/22/) std::__invoke<void (*)(char const*, std::function<void()>), char const*, AppInitMain(node::NodeContext&, interfaces::BlockAndHeaderTipInfo*)::<lambda()> > (
27    __fn=@0x13344fe8: 0x8433080 <util::TraceThread(char const*, std::function<void ()>)>) at /usr/include/c++/10/bits/invoke.h:95
28[#23](/bitcoin-bitcoin/23/) std::thread::_Invoker<std::tuple<void (*)(char const*, std::function<void()>), char const*, AppInitMain(node::NodeContext&, interfaces::BlockAndHeaderTipInfo*)::<lambda()> > >::_M_invoke<0, 1, 2> (
29    this=0x13344fb8) at /usr/include/c++/10/thread:264
30[#24](/bitcoin-bitcoin/24/) std::thread::_Invoker<std::tuple<void (*)(char const*, std::function<void()>), char const*, AppInitMain(node::NodeContext&, interfaces::BlockAndHeaderTipInfo*)::<lambda()> > >::operator() (
31    this=0x13344fb8) at /usr/include/c++/10/thread:271
32[#25](/bitcoin-bitcoin/25/) std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(char const*, std::function<void()>), char const*, AppInitMain(node::NodeContext&, interfaces::BlockAndHeaderTipInfo*)::<lambda()> > > >::_M_run(void) (this=0x13344fb0) at /usr/include/c++/10/thread:215
33[#26](/bitcoin-bitcoin/26/) 0x00007fffff4beed0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
34[#27](/bitcoin-bitcoin/27/) 0x00007fffff796ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
35[#28](/bitcoin-bitcoin/28/) 0x00007fffff19ddef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Last logs displayed (running bitcoind with -debug=all) before the crash when running inside gdb (with current sources sync’ed from this git repo).

 02022-08-16T08:41:08.970168Z [loadblk] [validationinterface.cpp:199] [UpdatedBlockTip] [validation] Enqueuing UpdatedBlockTip: new block hash=00000000000000000007797ea03040a6ce4bbba93edc0797be66593f342b0e80 fork block hash=000000000000000000035e2d3a32a3249d4db7ef6c89db3a9f5d8e5111046285 (in IBD=true)
 12022-08-16T08:41:08.970455Z [scheduler] [validationinterface.cpp:227] [operator()] [validation] BlockConnected: block hash=00000000000000000007797ea03040a6ce4bbba93edc0797be66593f342b0e80 block height=661432
 22022-08-16T08:41:08.982873Z [loadblk] [validation.cpp:2717] [ConnectTip] [bench]   - Load block from disk: 12.38ms [1.19s (14.02ms/blk)]
 32022-08-16T08:41:08.985121Z [loadblk] [validation.cpp:2060] [ConnectBlock] [bench]     - Sanity checks: 2.00ms [0.19s (2.20ms/blk)]
 42022-08-16T08:41:08.985349Z [loadblk] [validation.cpp:2159] [ConnectBlock] [bench]     - Fork checks: 0.23ms [0.02s (0.23ms/blk)]
 52022-08-16T08:41:09.695210Z [loadblk] [validation.cpp:2244] [ConnectBlock] [bench]       - Connect 1308 transactions: 709.85ms (0.543ms/tx, 0.113ms/txin) [47.74s (555.09ms/blk)]
 62022-08-16T08:41:09.695478Z [loadblk] [validation.cpp:2257] [ConnectBlock] [bench]     - Verify 6271 txins: 710.13ms (0.113ms/txin) [47.76s (555.36ms/blk)]
 72022-08-16T08:41:09.695698Z [loadblk] [validation.cpp:2267] [ConnectBlock] [bench]     - Write undo data: 0.22ms [0.02s (0.22ms/blk)]
 82022-08-16T08:41:09.695895Z [loadblk] [validation.cpp:2278] [ConnectBlock] [bench]     - Index writing: 0.20ms [0.02s (0.19ms/blk)]
 92022-08-16T08:41:09.696274Z [loadblk] [validationinterface.cpp:251] [BlockChecked] [validation] BlockChecked: block hash=0000000000000000000b5996770f8489d67422b9797569fc4f7ff78d06029943 state=Valid
102022-08-16T08:41:09.696475Z [loadblk] [validation.cpp:2729] [ConnectTip] [bench]   - Connect total: 713.62ms [48.01s (558.28ms/blk)]
112022-08-16T08:41:09.702334Z [loadblk] [validation.cpp:2734] [ConnectTip] [bench]   - Flush: 5.86ms [0.64s (7.47ms/blk)]
122022-08-16T08:41:09.702531Z [loadblk] [validation.cpp:2740] [ConnectTip] [bench]   - Writing chainstate: 0.20ms [0.02s (0.22ms/blk)]
132022-08-16T08:41:09.703023Z [loadblk] [validation.cpp:2511] [UpdateTipLog] UpdateTip: new best=0000000000000000000b5996770f8489d67422b9797569fc4f7ff78d06029943 height=661433 version=0x20006000 log2_work=92.516788 tx=596277697 date='2020-12-15T05:08:04Z' progress=0.788542 cache=79.7MiB(606180txo)
142022-08-16T08:41:09.703239Z [loadblk] [validation.cpp:2751] [ConnectTip] [bench]   - Connect postprocess: 0.71ms [0.08s (0.95ms/blk)]
152022-08-16T08:41:09.703429Z [loadblk] [validation.cpp:2752] [ConnectTip] [bench] - Connect block: 732.76ms [49.95s (580.77ms/blk)]
162022-08-16T08:41:09.703618Z [loadblk] [txmempool.cpp:736] [check] [mempool] Checking mempool with 0 transactions and 0 inputs
172022-08-16T08:41:09.703806Z [loadblk] [validationinterface.cpp:227] [BlockConnected] [validation] Enqueuing BlockConnected: block hash=0000000000000000000b5996770f8489d67422b9797569fc4f7ff78d06029943 block height=661433
182022-08-16T08:41:09.704008Z [loadblk] [validationinterface.cpp:199] [UpdatedBlockTip] [validation] Enqueuing UpdatedBlockTip: new block hash=0000000000000000000b5996770f8489d67422b9797569fc4f7ff78d06029943 fork block hash=00000000000000000007797ea03040a6ce4bbba93edc0797be66593f342b0e80 (in IBD=true)
192022-08-16T08:41:09.711612Z [loadblk] [validation.cpp:2717] [ConnectTip] [bench]   - Load block from disk: 7.41ms [1.20s (13.94ms/blk)]
202022-08-16T08:41:09.713156Z [loadblk] [validation.cpp:2060] [ConnectBlock] [bench]     - Sanity checks: 1.24ms [0.19s (2.19ms/blk)]
212022-08-16T08:41:09.713345Z [loadblk] [validation.cpp:2159] [ConnectBlock] [bench]     - Fork checks: 0.19ms [0.02s (0.22ms/blk)]
222022-08-16T08:41:09.723584Z [loadblk] [dbwrapper.h:250] [Read] LevelDB read failure: Corruption: block checksum mismatch: /mnt/g/bitcoin/chainstate/308716.ldb
232022-08-16T08:41:09.723798Z [loadblk] [dbwrapper.cpp:246] [HandleError] Fatal LevelDB error: Corruption: block checksum mismatch: /mnt/g/bitcoin/chainstate/308716.ldb
242022-08-16T08:41:09.723998Z [loadblk] [dbwrapper.cpp:247] [HandleError] You can use -debug=leveldb to get more complete diagnostic messages
252022-08-16T08:41:09.724248Z [loadblk] [noui.cpp:43] [noui_ThreadSafeMessageBox] Error: Error reading from database, shutting down.
26Error: Error reading from database, shutting down.
272022-08-16T08:41:09.724697Z [loadblk] [coins.cpp:302] [GetCoin] Error reading from database: Fatal LevelDB error: Corruption: block checksum mismatch: /mnt/g/bitcoin/chainstate/308716.ldb
28
29Thread 15 "b-loadblk" received signal SIGABRT, Aborted.
30[Switching to Thread 0x7fff32f60700 (LWP 14607)]
31__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
3250      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
33(gdb)

Running on Debian Running on Debian bullseye InRelease (with all current apt updates).

Note that restarting bitcoind (with -checklevel=4 -checkblock=6) just restarts from a recent height, and does not detect any corruption, it processes a few dozens of blocks for a couple of minutes and crashes again. Retrying restarts from the same blockheight.

I’ve checked the leveldb with external tools, and did not find any corruption or missing index in the key indexes, or any incorrect sort of keys in the 6 levels, or any corruption with dummy/partial records in the journal or sorted indexes.

verdy-p added the label Bug on Aug 16, 2022

verdy-p renamed this:
~~IBD core dumped during recursive call to CCoinsViewCache::FetchCoin (while connecting bet chain tip)~~
IBD core dumped during recursive call to CCoinsViewCache::FetchCoin (while connecting best chain tip)
on Aug 16, 2022

MarcoFalke removed the label Bug on Aug 16, 2022

MarcoFalke added the label Data corruption on Aug 16, 2022

MarcoFalke commented at 8:55 am on August 16, 2022: member

What is the version (pass -version on the command line)? Did you modify the source code? What is gettxoutsetinfo?

Otherwise, I am pretty sure this is caused by the same underlying issue that also causes #25800

verdy-p commented at 10:38 am on August 16, 2022: none

No modification of the sources, these are exactly like with “git pull” from this repo (with or without the patch for dbwrapper’s logger, which allows going further, and seems to avoid overwriting the stack by allocating its buffer onlce on the hep and not on the stack)

(So it is version v23.99.0-22d96d76ab02-dirty; it includes all the recent modifications found in current git HEAD). Note that I CANNOT use gettxoutsetinfo after this crash, the RPC thread is no longer running… So if you have a way to perform this call from gdb…

My opinion is that I am approaching the case where there’s some temporary object on the stack used after it should have been freed) So may be #25800 is reladed to this one (which occurs much less ofter than without the small patches I proposed in #25800). Note also that in #25800, I propose to fully memset the buffer on the stack always with zeroes (possibly enlarging its size to more than 500 bytes), if it helps reproducing the bug faster.

But it seems that the bug reported here is an unexpected effect of the recursion in CoinsView::FetchCoin (without any protection offered by its associated exclusive lock: this occurs purely inside the same thread, if there’s a lock on a mutex it is already owned and there’s no check to count the number of times it has been requested and block it to a limit); I also see that FetchCoin uses an iterator on a range that may not accept some modifications inside the range made by a recursive call.

Note that I can reproduce this bug even when I compile bitcoind with debug=true (the default is false), this occurs always on the same block height, and with no incoming connections (prevented by firewall rules), only outgoing connections to 2 full nodes, and without performing any RPC call (so this is not caused by RPC threads).

Notice that this occurs just after detecting a “fork block”.

I wonder what is the validity of this line [src/coins.cpp:48] in CCoinsViewCache::FetchCoin(…):

0    CCoinsMap::iterator ret = cacheCoins.emplace(std::piecewise_construct, std::forward_as_tuple(outpoint),
1std::forward_as_tuple(std::move(tmp))).first;

And how this reorganization made in a recursive call could affect the validy of the iterator in the parent call made via GetCoin(…). For me the semantics of std::move(tmp) is not clear. The doc says this is a destructive read, and the order of evaluation of this expression (which uses forward_as_tuple(outpoint) twice) may have already invalidated the state of the outpoint iterator, making tmp possibly invalid, or possibly the two forward iterators are overlapping, possibly processing the same transaction twice, including the secondtime after it was removed from the chain. These are some parts of the C++11 library that I do not understand clearly. Shouldn’t we call std::move(tmp) first before creating the two tuples?

I initially did not want to debug the program, so I still do not undestand entirely its logic. If you have hints about possible patches I could make inside the code to get more useful traces, it would be really nice.

That code, using recursions and self-modifying iterators (with heavy C++11-based optimizations, apparently to avoid copy-contructions, by using move semantics and trying to get automatic management of the lifetime for indexed coin transactions, to determine when to delete or merge blocks indexed in the chainstate or purge/prune/split them) is really complex and I’m not sure to completely understand how all this works. It seems that some objects are deleted too early, because their references are “lost” in the middle, allowing some objects (still persisting somewhere on the stack after a function return) to be deleted too early, or being unexpectedly overwritten anywhere elsewhere in the application, if this is a case of “use-after-free” (e.g. if some coin transactions checked here fall on the starting or ending boundary between two successive “blocks” in the chainstate cache).

MarcoFalke commented at 2:36 pm on August 16, 2022: member

No modification of the sources,

So it is version v23.99.0-22d96d76ab02-dirty

“dirty” means you modified it

Note that I CANNOT use gettxoutsetinfo

You can use it after a restart and then compare the output with the expected output at that height (from another node)

verdy-p commented at 3:38 pm on August 16, 2022: none

Dirty means that I have applied the two patches described in #25800 (and only those) (on the dbwrapper, and missing initializers to 0 for piece of mind). I think they are safe, but without them, I have the crashes occuring much sooner.

Also I don’t need to execute a RPC gettxoutsetinfo because the log above already shows the block heights in the gdb output.

0Consensus::CheckTxInputs (tx=..., state=..., inputs=..., nSpendHeight=661434, txfee=@0x7fff3338dcd0: 0)

and the debug.log output (just before the crash and I get the stack trace in gdb) also shows:

02022-08-16T08:41:09.703806Z [loadblk] [validationinterface.cpp:227] [BlockConnected] [validation] Enqueuing BlockConnected: block hash=0000000000000000000b5996770f8489d67422b9797569fc4f7ff78d06029943 block height=661433

So this has crashed when processing block height=661434 (the previous height was successfully processed). However that still does not explain anything (unless you know how to inspect precisely what is so special for the existing block at that height in the main chain).

Later I’ve tried to detect when CCoinsViewCache::FetchCoin recurses: I’ve just used a global static counter=0, and I increment it just before line [src/coins.cpp:48] and decrement it just after (this global static counter is safe as there’s a single thread calling that function, at least in the IBD phase; I do not modify any other varaible or object). Additionally I perform a basic check if (counter) just before the increment to see its value see if I need to log it with a basic LogPrintf("Recurse %d", counter);).

The result is that this line [src/coins.cpp:48] is extremely rarely reached (this is apparently when a transaction is not fully spent and its parent is not already in the cache, but may be very old and has been purged from the cache, forcing it to be reloaded, and changing the current view of the cache which has unexpectedly changed in the parent of the recursive call). It passes well most of the time, but when it crashes, this is always when there’s been some pruning in the dblevel cache (because it was full) to allow another block to be loaded (in which case the cache may contain the data of any other unrelated block and no longer the data of the block being processed).

There’s apparently nothing in CCoinsViewCache that instructs leveldb to NOT purge the current block from its cache, if you call FetchCoin() multiple times to load blocks of parent transactions (we have not kept any local copy of that block when processing it, we apparently treat it directly from the cache view). Also note that when purging the cache in LevelDB, this does not necessarily invalidates the virtual memory addresses used to map a block file, that virtual address may be reused for newer blocks loaded just after.

What I am attempting to find is a reproductive case, because you can’t find any one since long (but such strange crashes or corruptions have been experimented by MANY people if you look at online forums, complaining that IBD never succeeds; they were all replied that their machine would not be reliable, but I don’t think this is the case for most of them; so all of these users have just abandoned). This bug seems to be very sensitive to memory usage (notably the fill level of the dbcache, so it depends on how much memory you’ve configured for it; but this cache fill level is ALSO sensitive to the current network activity, notably when running with default options that accept incoming connections which may perform requests to random blocks no longer present in the cache, including “getheaders” requests coming from other remote nodes in IBD phase).

And may be this is related to the recent bug #25632 reported, probably

verdy-p commented at 6:48 pm on August 16, 2022: none

You can use it after a restart and then compare the output with the expected output at that height (from another node)

Not really:

if we use another node, it will have to be able to connect to the faulting node, which it does not allow before it has checked some parts (te 6 last blocks), possibly rearranged them to collect the tips, and then restarted to reindex the chain. As well the RPC server will take some time to be responsive (initially the node has other priorities, including making outgoing connections to other fill nodes, but not accepting an incloming connection).
if we just restart the faulting node, the same appears, but its local RPC client (bitcoin-cli) will also have to wait.

In both cases, we get the last value for ‘gettxoutsetinfo’ too late, and it may not even be there at all after the initial recovery. So we have to rely on what was last displayed by the fauling node in its log, or what could be retrieved in the debugger by looking at the stack trace when the fatal exception occured (and this is exactly what I did above).

The alternative is to use a third party leveldb tool to scan its database without making any write to it (no attempt to repair it, just like what bitcoind does when we start it: the reason of the fault is lost at that time).

So the best thing is always using logs, or a debuggger (that is not possible easily with the released version as debug info have been stripped), or other tracing systems if they are implemented (like DTRACE, running with a “eBPF” helper of the kernel on Linux). To get enough info we just have -debug=all which must be activated before the crash, but if this takes a very long time before reaching the case where there’s a crash, we need a giant log file.

The alternative is to run the daemon with -nodebuglog, but then capture the output no the stadnard terminal with a program that will preserve some significant amount of logs in a rotating buffer (e.g. the last 10000 lines, hoping that this will be enough). Buteven in this case, we may not capture enough info even with “-debug=all”: the only solution in that case is to modify the sources to add additional custom logs (with non-destructive tests to perform a useful selection of cases we are suspecting; notably we should compile the debug option to enable “asserts”, and safety checks that may be alerady present or added for that purpose). This requires some modifications of the program, so isolating bugs is not evident, and without it it’s hard to isolate problems, and finally regenerate the conditions in a test case that will be added to the project, to make sure it won’t happen again for future releases.

If you see better solutions, I’m ready to hear! (notably propose patches that I could insert in the sources, or that could be hosted in Git in some branch, so that I can compile with them to get better diagnostic, and finally isolate the bug, because I’m now sure this is a bug of BitCoin Core itself, and not even a bug of its upstream dependant libraries or of the OSes or machines on which I’ve tested the current releases without any success since months and after trying many configurations).

verdy-p commented at 8:00 pm on August 16, 2022: none

Hmmm.. Now that I compile bitcoind with “debug=true” set in config.status, and now “make clean; make” I cannot reproduce the bug at all. Which reappears with “debug=false” (I did not change the code at all between the two) even if its more rare.

Are there some known issues in gcc with regard to the “move” semantics which causes some unsafe optimizations? Both versions were compiled with “-O2”, just “-g” is added (and stripping the debug symbols or not has no effect). Do we fall an a case of “undefined behavior” somewhere in the standard C++ library that we could detect with third party analysis tools?

Is there another way to trace the release version (debug=false)? How do we use DTRACE to get some info with help of the kernel (provided that it has support of eBPF)?

Is it worth the value to remove the “-g” option? bitcoind is still very fast even if some optimizations are disabled (we are more bound by I/O and network bandwidth than the very few extreme optimizations possibly performed by gcc).

Note that I compiled with gcc 10.2.1-6 (20210110), the version natively proposed in Debian bullseye. Should I retry, using Clang instead? Which C++ dialect should we use: C++11 (C++17 is used by gcc), if there were changes in semantics of the standard library impacting us?

MarcoFalke commented at 7:55 am on August 22, 2022: member

Yeah, sure you can try clang++, but I don’t think this makes any difference.

You may also enable the gettxoutsetinfo index, to allow querying the utxo set hash at any previous height to see if and where it diverges from a “nominal” node.

MarcoFalke commented at 6:39 am on August 23, 2022: member

Let’s continue the data corruption issues in #25800, as they should all be related

MarcoFalke closed this on Aug 23, 2022

verdy-p commented at 7:29 pm on September 6, 2022: none

Yeah, sure you can try clang++, but I don’t think this makes any difference.

You may also enable the gettxoutsetinfo index, to allow querying the utxo set hash at any previous height to see if and where it diverges from a “nominal” node.

I have already applied the 3 other indexes supported. They work, but not with the v22 and v23 released binaries, only after I have made some changes discussed above (and in #25800, including the change for the logging buffer no longer allocated on the stack but allocated once on the heap, as it rapidly causes stack corruption); I have also compiled with lock contention debugs, but they are not the cause; and had to disable the non-working #25632 (causing crashes) and a non-working check option (also crashing as it causes deadlocks). Compile time debug options (-DDEBUG_LOCK* or -g) do not seem to have any effect.

But now if it ever crashes I always set the option to check the chainstate at level 4 (not 3 by default), and it seems to always recover from “lost chaintips”; it does not cause excessive launch time, just about 15-20 more seconds.

I’m more concerned by the fact that RPC threads do not start early, and they are too restricted (4 only) when they could be allocated dynamically for a larger pool if needed (if just once thread is too busy and does not reply immediately, it just hangs bitcoin-cli that needs 4 RPC requests just to complete a “-getinfo”, or any other RTPC request).

And I’m still concerned by the stderr output made by the UPNP library (it should pass through logs, and I wonder if it’s possible to either patch that library or compile it with a macro to force it to use our logger).

bitcoin locked this on Sep 6, 2023

IBD core dumped during recursive call to CCoinsViewCache::FetchCoin (while connecting best chain tip) #25857