ChainstateManager locking improvements

jamesob commented at 7:12 pm on April 6, 2021: member

Hopefully this’ll make everyone happy, and clarify some chainstate locking issues going forward.

Background

There is some amount of confusion due to my choice to conflate use of cs_main between

its original purpose: guarding legacy data structures related to CChainState-related data (i.e. pre-ChainstateManager coverage of chainActive), and
its new purpose: guarding pointers to the constituent CChainState instances managed by ChainstateManager, e.g. m_ibd_chainstate, m_active_chainstate.

I initially thought that reusing cs_main would be a simplification, but it turned out to just be mostly confusing.

As a quick refresher, ChainstateManager manages two constituent CChainState instances (m_ibd_chainstate, m_snapshot_chainstate) and reveals a pointer to the one that is in active use (m_active_chainstate) per the assumeutxo semantics. As @jnewbery has pointed out, the active pointer doesn’t need to be guarded by a mutex, but can be marked atomic to avoid explicit locking.

Throughout the course of deglobalizing chainstate use, @dongcarl has had to embed cs_main acquisitions in the active chainstate accessor functions (e.g. ::ChainstateActive()) to avoid the risk of unserialized access to the ChainstateManager::m_active_chainstate state. This is suboptimal because callers are almost always already holding cs_main anyway, so we just end up recursively acquiring the lock in many cases. And fundamentally, cs_main use isn’t necessary for that once we decouple chainstate pointer management from cs_main.

These changes

make ChainstateManager::m_active_chainstate atomic to avoid requiring explicit locking (per @jnewbery’s suggestion),
introduce ChainstateManager::m_cs_chainstates to relieve cs_main of guarding the internal chainstates managed by ChainstateManager,
remove the cs_main acquisitions in ::ChainstateActive() and ChainstateManager::ActiveChainstate(), which are no longer necessary, and
annotate CChainState::m_chain as guarded by cs_main, which was an implicit requirement that I’m just making explicit here.

validation: unannotate g_chainman from needing cs_main

In preparation for removing `cs_main` requirements for
ChainstateActive()/ActiveChainstate(), relax lock requirements for
g_chainman.

I have verified that all usages of `g_chainman` that are concurrency-sensitive
are usages of its members (e.g. `m_blockman`), which are already annotated. The
references to `g_chainman.m_active_chainstate` will be protected by use of
`std::atomic` in later commits.

Here are all non-`BlockManager`-related usages:

```
% rg --pcre2 --trim 'g_chainman(?!.m_blockman)'

src/net_processing.cpp
1271:assert(std::addressof(g_chainman) == std::addressof(m_chainman));

src/validation.h
994:extern ChainstateManager g_chainman;

src/init.cpp
1423:node.chainman = &g_chainman;

doc/release-notes/release-notes-0.21.0.md
577:- #19927 Reduce direct `g_chainman` usage (dongcarl)

src/validation.cpp
105:ChainstateManager g_chainman;
109:assert(g_chainman.m_active_chainstate);
110:return *g_chainman.m_active_chainstate;
174:assert(std::addressof(g_chainman.BlockIndex()) == std::addressof(m_block_index));

src/node/interfaces.cpp
479:assert(std::addressof(g_chainman) == std::addressof(*m_node.chainman));
489:assert(std::addressof(g_chainman) == std::addressof(*m_node.chainman));
502:assert(std::addressof(g_chainman) == std::addressof(*m_node.chainman));
514:assert(std::addressof(g_chainman) == std::addressof(*m_node.chainman));
516:assert(std::addressof(g_chainman) == std::addressof(*m_node.chainman));
525:assert(std::addressof(g_chainman) == std::addressof(*m_node.chainman));
527:assert(std::addressof(g_chainman) == std::addressof(*m_node.chainman));

src/test/util/setup_common.cpp
143:m_node.chainman = &::g_chainman;

src/qt/test/apptests.cpp
89:UnloadBlockIndex(/* mempool */ nullptr, g_chainman);
90:g_chainman.Reset();
```

361dd79c27

validation: make ChainstateManager.m_active_chainstate atomic

This will allow us to not have to worry about repurposing cs_main as
a mutex for setting this state, and will subsequently allow us to
remove additional `cs_main` acquisitions and annotations in subsequent
commits.

3d949629c5

validation: remove cs_main requirement from m_active_chainstate

Atomicity of this data no longer requires explicit locking.

09c924d3be

validation: introduce ChainstateManager.m_cs_chainstates

To avoid (confusingly) repurposing cs_main to manage internal
ChainstateManager chainstate references, introduce a specific lock for this
purpose.

It's a RecursiveMutex (and not a Mutex) to facilitate acquiring the lock
in MaybeRebalanceCaches, since we'd otherwise have to give the lock public
class visibility (to allow callers to lock it).

5b8587f0bf

doc: fix outdated ChainstateManager doc 98480e293b

doc: note about InitializeChainstate() cs_main use 7ef10328fc

validation: annotate m_chain with cs_main

Which in turn requires `PruneIndexBlockCandidates()` to be annotated;
luckily no locking changes are needed there.

8257e9ab75

jamesob force-pushed on Apr 6, 2021

DrahtBot added the label Validation on Apr 6, 2021

ryanofsky commented at 8:19 pm on April 6, 2021: member

It’s great for this to be removing the recursive locks in 09c924d3beb56a6b64d8855b1818b6e73c40ec0d and 5b8587f0bf5076d6651ec67d4a941034b5287429. Also annotating CChainState::m_chain in 8257e9ab7592a7fe459089f021053c7740004817 makes a lot of sense.

The thing I don’t understand is what bug is prevented by making m_active_chainstate std::atomic<CChainState*> instead of CChainState*? What bugs are prevented by adding LOCK(m_cs_chainstates); or having m_cs_chainstates exist at all?

Not putting EXCLUSIVE_LOCKS_REQUIRED(cs_main) on ChainstateManager methods seems like Six Flags saying congratulations, we’re waiving the $100 fee to get into the park, we’re just going to charge you $100 the first time you get onto any ride. It seems like this could only be appealing if you wanted to go to a CChainState theme park without trying any rides (accessing any data inside).

DrahtBot commented at 11:51 pm on April 6, 2021: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#21789 (refactor: Remove ::Params() global from CChainState by MarcoFalke)
#21584 (Fix assumeutxo crash due to invalid base_blockhash by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

jamesob commented at 2:08 am on April 7, 2021: member

Thanks for the quick look @ryanofsky.

The thing I don’t understand is what bug is prevented by making m_active_chainstate std::atomic<CChainState*> instead of CChainState*? What bugs are prevented by adding LOCK(m_cs_chainstates); or having m_cs_chainstates exist at all?

Strictly speaking, I don’t think there’s a difference in bugs between this approach and just repurposing cs_main to guard this stuff. But there are a number of usability/maybe-performance arguments. I think the major arguments are:

cs_main is a lock that’s used for many different things, and eventually (I think?) we want to break it up into more granular, specified locks - though I could be wrong about that,
it’s kind of nice not having to worry callers about any locks if indeed this std::atomic approach works in the way that I claim it does, because….
otherwise, if we apply a GUARDED_BY(::cs_main) to m_active_chainstate, we’ll need to do awkward stuff like temporarily holding cs_main to acquire the active chainstate before we call ActivateBestChain(), since you can’t go into ABC holding cs_main.

I think your Six Flags analogy is both entertaining and true (in a strict sense), but if everyone generally agrees that we eventually need to break up cs_main to avoid unnecessary lock contention, then I think these changes are a step in that direction.

But otherwise I’m happy to separate out the commits that you mention and just annotate everything with cs_main if it makes more sense for that lock to live on in its spirit of “I lock every damn thing related to chainstate,” which may be easier to reason about than having three or four more granular locks running around.

But it does seem appealing to decouple, say, block storage locking from chain locking so that we can do things like prune while continuing validation. And longer term I suspect that each chainstate instance will want its own little cs_main so that we can do various chainstate operations in parallel should something like utreexo come around.

ryanofsky commented at 9:28 pm on April 8, 2021: member

Will review in more detail, and I’m all for lock granularity if locks are independent. But if we’re just turning one lock into two dependent locks where you acquire one lock, release it, then always acquire the other lock, it seems pointlessly complex. It increases the number of states things can be in and makes thread safety annotations less likely to catch race conditions. If there is theoretically a place where having these two locks could be better for performance, I would be curious to know where it is. The ActivateBestChain example just seems like a case of weird code organization that should be cleaned up, not a case where two locks make more sense for code as it is executed.

I think the clearest thing would be for this low level code not to have any synchronization: no locks or atomics, just plain annotations stating locking assumptions so higher level code can call it correctly. I’ve suggested this before #19806#pullrequestreview-584279133, and it can be done separately from this PR. This PR does make other improvements, and I’m happy with all the changes, and planning to review it.

dongcarl referenced this in commit 2bf5597966 on Apr 13, 2021

dongcarl commented at 1:51 am on April 13, 2021: member

I haven’t read everything yet, and plan on doing it soon. Q for you: Does this mean that even if we hold ::cs_main, we won’t be sure that the m_active_chainstate won’t change from under us?

dongcarl referenced this in commit 4d7947f31c on Apr 13, 2021

jamesob commented at 3:12 pm on April 13, 2021: member

Good points @ryanofsky @dongcarl. After digesting this a little, I actually think that both this approach and our current approach have bugs.

Consider the case where ActivateSnapshot() is called at this moment during ProcessNewBlock(). Neither a cs_main hold nor the atomic approach here for guarding the contents of m_active_chainstate will prevent a different active chainstate from being used in the contents of PNB above this line, creating a situation where we’ve called AcceptBlock() on a different chainstate than the one we’ve called ActivateBestChain() on.

The narrow solution in this case is maybe obviously to grab a single chainstate at the top of the function (instead of calling ActiveChainstate() for each usage), but I’m trying to think about what the right arrangement of locks and annotations would be to prevent this situation in general.

ryanofsky commented at 3:40 pm on April 13, 2021: member

The narrow solution in this case is maybe obviously to grab a single chainstate at the top of the function (instead of calling ActiveChainstate() for each usage)

Yes, please. This was on my unimplemented review suggestion followup list. #20749 (review) #20749#pullrequestreview-578845627 #21025#pullrequestreview-581735705 “Best would be to have a CChainState* chain_state local variable that is initialized while cs_main is held above, and then used here.”

but I’m trying to think about what the right arrangement of locks and annotations would be to prevent this situation in general.

It seems to me replacing synchronization with annotations in ChainstateManager (removing the locks and atomic and adding EXCLUSIVE_LOCKS_REQUIRED(cs_main) as suggested) would ensure callers use it correctly, or at least have to think about using it correctly.

ryanofsky commented at 5:05 pm on April 14, 2021: member

Offline thread:

<james>: I think we need cs_active_chainstate or something that functions like PNB either acquire or are annotated with to prevent the active chainstate from being swapped out during their usage

<russ>: I’m not trying to claim adding lock annotations will prevent all Chainstatemanger callers from being buggy and wrongly assuming chainstate doesn’t change when they didn’t lock it. I don’t think there is anything Chainstatemanager can do to prevent that unless it’s going to go into the kernel and freeze all running threads. My claim is just that bugs are less likely if locking cs_main locks the chainstate and callers are forced to lock cs_main and thinking about locking when they use ChainstateManager

<james>: The kind of sad thing is that that swap-out only happens once during runtime (i.e. when ActivateSnapshot() is called) and really we could avoid a lot of this mess if we restricted snapshots to being loaded at init time

<russ>: I don’t see what’s that complicated about this. If you don’t want the active chainstate to change, then you should lock cs_main. If you don’t care whether the active chainstate changes then you can take a reference to the chainstate and not lock cs_main.

ryanofsky commented at 5:11 pm on April 14, 2021: member

If there is a case where it’s neccessary for code to release cs_main but prevent the active chain from changing would suggest a condition variable:

0bool m_lock_chainstate = false;
1std::condition_variable m_lock_chainstate_cv;

Then in the code that wants to update the chainstate:

0lock_chainstate_cv.wait(cs_main, []{return !lock_chainstate;});

And the code that wants to lock the chainstate without holding cs_main:

0m_lock_chainstate = true;
1m_lock_chainstate_cv.notify_all();
2
3...
4
5m_lock_chainstate = false;
6m_lock_chainstate_cv.notify_all()

ryanofsky commented at 5:20 pm on April 14, 2021: member

Actually I guess that could be equivalent to your active_chainstate mutex if it using manual locking instead of a scoped lock. So maybe I’ll arrive at the same conclusion as you from a different angle. It just seems like there are a lot of solutions here, and there shouldn’t be any conflict between a good solution and one that uses compiler thread safety annotations

jnewbery commented at 10:53 am on April 15, 2021: member

I haven’t looked at the code here yet, but I want to respond to a couple of the points raised in the discussion here:

Not putting EXCLUSIVE_LOCKS_REQUIRED(cs_main) on ChainstateManager methods seems like Six Flags saying congratulations, we’re waiving the $100 fee to get into the park, we’re just going to charge you $100 the first time you get onto any ride. It seems like this could only be appealing if you wanted to go to a CChainState theme park without trying any rides (accessing any data inside).

I think the clearest thing would be for this low level code not to have any synchronization: no locks or atomics, just plain annotations stating locking assumptions so higher level code can call it correctly.

Most of the client code (eg in net_processing, rpc, etc) shouldn’t need to lock chainman in most cases. Usually it’s just trying to read a single piece of data (whether we’re in IBD, what the current height is, fetch a CBlockIndex pointer to the tip, etc). The higher level code shouldn’t have to concern itself with locks in those cases, and in fact those operations can be made lock-free by using atomics to cache the data (as IsInitialBlockDownload() used to do, but now sadly requires cs_main in most cases). Removing cs_main locks from the rpc methods and the GUI’s node/interface methods would ensure that RPC/GUI is never blocking validation from doing useful work. Reducing cs_main usage in net_processing would open the possibility to parallelizing IBD (either with multiple net_processing threads or a separate validation thread).

There are also cases where the client code does want to carry out multiple operations atomically on chainman. For those cases, I think it makes sense to expose cs_main to the caller and enforce it with annotations.

ryanofsky commented at 12:09 pm on April 15, 2021: member

Most of the client code (eg in net_processing, rpc, etc) shouldn’t need to lock chainman in most cases. Usually it’s just trying to read a single piece of data (whether we’re in IBD, what the current height is, fetch a CBlockIndex pointer to the tip, etc).

Yes, I think we are on the same page. When you just need one piece of data, this is what atomics are for and it makes sense to switch a variable from T to atomic<T> or T* to atomic<T*> so you can read it without locking. And for an API:

0T GetData() EXCLUSIVE_LOCKS_REQUIRED(cs_main);
1T GetCachedData();

is nice. But you can make individual variables atomic or not without affecting anything else more broadly. I do think it makes sense to delay switching a variable from T to atomic<T> until you have at least one piece of code that’s reading it without a lock.

DrahtBot commented at 9:32 am on May 3, 2021: member

🕵️ @sipa has been requested to review this pull request as specified in the REVIEWERS file.

DrahtBot added the label Needs rebase on May 12, 2021

DrahtBot commented at 8:46 pm on May 12, 2021: member

🐙 This pull request conflicts with the target branch and needs rebase.

Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a “draft”.

jamesob closed this on Aug 5, 2021

DrahtBot locked this on Aug 16, 2022

ChainstateManager locking improvements #21620

Background

These changes

Conflicts