#13023 highlights a concurrency issue in ABC that I think is worth more explanation and discussion, particularly as there are two bugs that should be fixed, and at least one of them is now causing occasional test failures in rpc_deprecated.py
on travis.
- A known issue in ABC is that if we try to connect a block and discover that is invalid, then we can release cs_main while chainActive.Tip() has less work than our prior tip. However, in such a situation, another thread can be invoking CheckBlockIndex() just as cs_main is released (eg on regtest – CheckBlockIndex isn’t run by default on mainnet), and see an assertion fail because the current tip is not in setBlockIndexCandidates (it would have been previously pruned because we know about a valid more-work tip).
A band-aid fix would be to add the current tip to setBlockIndexCandidates after disconnecting a block, but that would just be masking the bigger issue, which is that we should fix ABC/ABCStep to not release cs_main while we’re at a lower-work tip.
- The second bug is the one @skeees found in #13023, which I tried to more fully explain in #13023 (comment). For posterity:
If ActivateBestChain is invoked simultaneously in separate threads, then we can end up at a lower-work tip, and remain stuck until the next block comes in.
- Suppose in thread 1, we have just been delivered a block that causes us to try to activate two blocks in ABC. We connect 1 of them in ABCStep, and then release cs_main before connecting the second.
- In thread 2, suppose we have been delivered a 3rd block (say via rpc) that builds on the first two. It invokes ABC and gets to run after the first block has been connected in thread 1. It connects one block, releases cs_main, and then connects one more, and finishes.
- When thread 1 gets to run again, the most work chain has advanced, but (before this PR) we don’t refresh pindexMostWork (except when we find an invalid block). Consequently we would invoke ABCStep with a less work tip than our current tip(!). This would cause us to disconnect our tip and return. Some travis failures have been observed due to this bug, as seen for instance here: https://travis-ci.org/bitcoin/bitcoin/jobs/370848272. The test that sometimes fails, rpc_deprecated.py, generates blocks on two nodes roughly simultaneously, so one of the nodes is generating blocks in an rpc thread while also processing blocks on the network thread, which I believe is enough to trigger this bug.
I think both bugs should be fixed, though the second one is more severe (as it could result in an unlucky miner losing a block) even in the absence of any kind of attack or misbehavior.