How to avoid logic races (and recursive mutexes)?

MarcoFalke commented at 8:42 am on June 10, 2022: member

Currently many mutexes are recursive to accommodate for callers that have the mutex already locked. However, people seem not to like recursive mutexes, see also #19303.

Moreover, there are many logic races in the codebase. (Not data races, which are UB, but code that returns valid, but logically inconsistent data. For example #25077).

For example, if a method wants to return the latest block hash and the height of that block, it could do:

0return { chain.tip_hash(), chain.tip_height() };

Assuming that chain has a private lock, which is acquired in its member methods. Though, in a multi-threaded context this may return a stale hash, which does not correspond to the height. So it would be nice to have a way to fix that, even if all locks are private and non-recursive.

So I was thinking about a helper that returns a locked chain to the caller:

0auto lc{chain.locked()};
1return { lc.tip_hash(), lc.tip_height() };

Does this sound like a good idea?

MarcoFalke added the label Brainstorming on Jun 10, 2022

hebasto commented at 9:07 am on June 10, 2022: member

0return { chain.tip_hash(), chain.tip_height() };
Assuming that chain has a private lock, which is acquired in its member methods. Though, in a multi-threaded context this may return a stale hash, which does not correspond to the height.

Redesign a class API?

0const auto tip = chain.tip();
1const auto tip_hash = tip.hash;
2const auto tip_height = tip.heigh;

MarcoFalke commented at 9:21 am on June 10, 2022: member

I think this wouldn’t work if you also want to get a CCoinsViewCache and CCoinsViewMemPool and call their methods while holding the lock.

MarcoFalke commented at 12:29 pm on June 10, 2022: member

cc @ryanofsky @ajtowns @vasild

ryanofsky commented at 2:59 pm on June 10, 2022: member

To be concrete, is the idea is that the locked chain “lc” class would lock cs_main when it is constructed, and unlock it when it destructed?

If so, I think that can be a nice interface, and it is similar to the to the interfaces::Chain::Lock class which wallet code was using before #16426

Another benefit besides making locking easier to enforce is that it could eliminate explicit references to cs_main, and distinguish uses of cs_main for locking chain state from uses of cs_main for other reasons, and maybe make it easier to split up cs_main in the future.

A risk of this approach is that the locked chain class could turn into a monolithic class doing a lot of unrelated things. Also it sounds like introducing it might require a lot of code changes, though that would depend on the details.

Seems like a useful idea to explore

MarcoFalke commented at 3:31 pm on June 10, 2022: member

A risk of this approach is that the locked chain class could turn into a monolithic class doing a lot of unrelated things.

Yeah, my hope was that the returned object inherits the interface.

Something like:

(Could be annotated with https://clang.llvm.org/docs/ThreadSafetyAnalysis.html#scoped-capability ?)

 0
 1template<typename T>
 2struct Locked{
 3 T& fwd;
 4 Locked(T&t) : fwd{t} { t.lock(); };
 5 ~Locked() { t.unlock(); };
 6};
 7
 8void stuff() {
 9  Locked lc{node.chainman};
10  lc.fwd.tip_hash();
11  ...
12}

Obviously this only works with recursive mutexes, if we assume that tip_hash() also takes the lock.

ajtowns commented at 0:42 am on June 13, 2022: member

I think something like this makes sense for the combination of chain tip / utxo set / mempool access, since that is complicated, and controls a whole mess of different data structures – and assumeutxo means we’ve now sometimes got two tips/utxo sets, which makes it even more complicated.

For the general case, I don’t think the template really adds much over what we do now.

ajtowns commented at 1:54 am on June 13, 2022: member

I guess what I’m thinking is something like:

 0class KernelChain
 1{
 2public:
 3    CChainState& chainstate;
 4    CoinsViews& coinviews;
 5private:
 6    DebugLock<decltype(::cs_main)> main_lock;
 7};
 8
 9class KernelMemPool
10{
11public:
12    CTxMempool& mempool;
13private:
14    DebugLock<decltype(mempool.cs)> mempool_lock;
15};
16
17class KernelChainAndMemPool : public KernelChain, public KernelMemPool
18{
19};
20
21class ChainstateManager
22{
23public:
24    CBlockIndex* GetTip() const;
25    KernelChain GetChain();
26    KernelChainAndMemPool GetChainMemPool();
27
28    KernelChain GetIBDChain();
29};

If we figured out reader/writer locks; we could maybe have a const KernelChain& GetChainRO() const; to allow multiple threads simultaneous read-only access to the chain/mempool.

The idea above is that you only have access to the mempool once you’ve already taken both mempool.cs, and lose access to it at exactly the same time as you lose the lock; so none of the mempool methods need to worry about locking at all (and the mempool.cs could be moved into ChainstateManager, I guess), and likewise for cs_main.

vasild commented at 9:21 am on June 13, 2022: member

» A good interface is impossible to misuse «

What @hebasto suggested above is in-line with that - get all fields that need to be consistent with each other in one call. Replace GetX() and GetY() with GetState() (or GetSnapshot()) which returns both X and Y. That is also pretty simple to implement, understand and to use. It is dumb, which is good. Code should be dumb.

There may be some problems around the other idea about Locked lc{node.chainman}:

one could still call node.chainman.tip_hash() directly without creating lc beforehand
it opens the possibility to pass around lc as argument to functions, transferring mutex ownership, which opens up a can of worms and is better avoided if possible
the caller can unintentionally hold it locked for too long
will not work with thread safety annotations?
it would require a recurstive mutex? I think that every time a recursive mutex is needed there is some design flaw or at least low-level coding flaw.

MarcoFalke commented at 1:56 pm on June 13, 2022: member

GetState, KISS.

You may be over-simplifying things to the extend where it becomes a performance regression. If you wanted to check some outpoints that aren’t spent in the mempool against the utxo set, you’d have to create a snapshot of the whole utxo set, and a copy of the whole mempool, as it is not possible to create a snapshot of the mempool. (See for example #25077 )

I guess an alternative would be to pass lambdas into chainman methods to be executed when iterating chainman data structures.

vasild commented at 3:26 pm on June 13, 2022: member

Right, the GetState() approach is not suitable in all cases. I guess it is best to decide on a case-by-case basis. The lambda approach sounds nice.

ajtowns commented at 3:58 pm on June 13, 2022: member

Having now looked more at #25077 I’m a big concept ack on doing something along these lines; the current interface is a mess, and this does seem like the simplest way of addressing it cleanly.

ajtowns commented at 6:09 am on June 14, 2022: member

I guess the other thing to consider is that aliasing makes any analysis hard, including thread safety – if you do “foo.bar = 3” then it’s easy to lookup what bar is and whether you need a mutex; but when you do “x = &foo; x->bar = 3;” that becomes a lot harder.

So perhaps a better approach is to avoid making aliases of things that are protected by external mutexes. So if you’ve got:

0struct X {
1    Mutex m;
2    int i GUARDED_BY(m);
3};
4X x;

then it’s fine to say X& x2 = x; and make an alias of the entire structure, but it’s not fine to say int* pi = &x.i. ChainstateManager::ActiveChain in particular is returning an alias of m_chain which doesn’t have an internal mutex protecting itself, and isn’t natively thread safe. But, at least in theory, an alias of CChainState could be safe, since it’s already designed to be accessed from multiple threads and has its own internal locking logic to be thread safe.

MarcoFalke commented at 12:59 pm on June 15, 2022: member

Another thing is that referring to a mutex that is supposed to be private in a context outside its class also makes thread-safety annotations brittle, unless the mutex is made global and annotated with the recursive annotations or the mutex is made a recursive mutex to avoid double-locking.

People don’t seem to be liking globals and recursive mutexes, so while changing the code, it might be better to prefer an approach that doesn’t need either.

vasild commented at 3:53 pm on June 16, 2022: member

#25390 has some solution that is relevant to this.

MarcoFalke commented at 4:11 pm on June 16, 2022: member

Yeah, that works when the lock can live isolated inside the wrapping class. Otherwise my previous comment applies, so I don’t think we can do this with cs_main (unless we remove cs_main in a single commit, which I doubt will happen).

MarcoFalke commented at 2:01 pm on June 21, 2022: member

Hmm. Maybe the status quo isn’t perfect but good enough for now. It might be more fruitful to work on:

Making cs_main a member of chainman instead of it being a global
Maybe introduce more specific locks in classes that don’t need cs_main. That is a CChain lock, a BlockMan lock, …

Closing for now, but let me know if there is a comment that I missed that might be worth to look into further.

MarcoFalke closed this on Jun 21, 2022

DrahtBot locked this on Jun 21, 2023

How to avoid logic races (and recursive mutexes)? #25330