Large locking change #1995

alexanderkjeldaas commented at 6:37 PM on November 9, 2012: contributor

A while back I started looking at bitcoin, and I was disappointed with the code quality wrt locking.

I tried to fix the code to be clang -Wthread-safety clean, but I don't know the code enough to understand some of the entanglements that are currently present.

I've added lots of annotations that specifies the locking requirements on functions and data. This might seem a little overzealous, but I believe that this is necessary to "get back on track" and make the locking strategy clear and consistent.

Even with this patch there are lots of locking issues left - probably benign like things happening during startup and shutdown. There are also false positives from me not getting clang to grok TRY_LOCK correctly.

Below is a long note I wrote while looking through the source code a while back. Maybe you guys agree or maybe not - at least it explains the whats and the whys.

For an object that is accessed by multiple threads we have, for every field:

It is const, constant after initialization, or guarded by a mutex.
Re-using an object is a bad idea. It should never be done. If we do, we are mutating the constant-after-initialization fields. This leads to exessive locking and overcomplicated code. It is a lot simpler to discard the object and re-create it. This works, because the object is owned by some other object, and in that other object, it is a "guarded by mutex" field. Thus by re-creating the object, we lift the responsibility for locking into the parent.
A result of the above is that "constructors" must be separate from other functions in the class. Constructors that are not strictly C++ constructors, such as deserializing should be static functions if possible. This makes it impossible for them to to be used to mutate constant-after-initialization fields.
Locks should not be exported outside of a class. Java did this mistake and it is considered an anti-pattern.
Locks should never serialize code, they should protect data from being accessed while invariants on that data are broken.
Locks should be textually close to the fields. Given a lock, it should be easy to understand what it guards, and given a field, it should be easy to see which locks guard that field. Thus, locks should be private members of a class.

A good way to obfuscate this is to put a lock in a parent class, as done in keystore.h where CKeyStore. It is hard to know what that lock guards. It is a lot cleaner to duplicate that lock in the data section of the subclasses.
Recursive locks should never be used. They lead to bugs, and are only necessary in order to write sloppy code. If some functionality is both a transaction that ensures that the invariants of the data are obeyed, and part of larger transactions, then it is much clearer to separate these conserns into two functions:

void DoFoo() { LOCK(cs); DoFooUnlocked(); } void DoFooUnlocked() { ... }
Using locks to protect simple setters and getters is often buggy and/or complex. In most cases, mutating some state is part of a larger transaction, so the setter is often protected by some other lock external to the class. In these cases, re-building a new object is usually the best approach, but otherwise, explicitly marking setters as SetFooUnlocked() and not lock is clearer. It makes it explicit that you can't just mutate randomly without doing some external locking.
Do not hold locks over callbacks out of an object. This means that objects should signal some state change, and not guarantee which state the object is in when a callback happens. Sometimes, it makes sense to break this, but then the callback will have to be treated like a signal handler, or irq handler - i.e. it can only do simple things and not call out to other random code.
Every time you call TRY_LOCK, a kitten dies. The reason why this is bad is that it makes reasoning, and testing the code incredibly hard. TRY_LOCK is basically this:

if (rand() % 1) { DoA(); } else { DoB(); }

It is simply unnecessary randomness. The TRY_LOCK macro should be removed, it kills kittens.

If something like TRY_LOCK is truly needed, then at least it should be abstracted out of the actual functions, so that the functions themselves are testable. Thus in the above case, DoA() would be a separate, testable, function.

In some really large systems (millions of large tps), sharding work queues by lock is a reasonable design to reduce resource usage by threads (so you do cs_someLock.DoWorkNowOrWhenUnlocked(&DoA), and it is more efficient than TRY_LOCK, but for bitcoin - this is just overkill.
Having lots of semantically different locks in a class is not ideal. For example CNode has 4 locks. This means that there are 4 sets of invariants that can be established independently of each other in this class. To put it another way, when you call a function on this object, it should work on a subset of the state, or it should hold all 4 locks. This suggests that either the class is a composite of multiple objects, and should be broken up, or the number of locks should be reduced.
We cannot compose multiple critical sections which all take a lock into a large critical section. After we have released locks, all checks that we have done are invalidated. Most of CWallet is just plain wrong and based on this false assumption. Locking is done in CCryptoKeyStore instead of in CWallet. See above "Locks should be textually close to the fields" for a reason why people confuse themselves.
There is a lot of convoluted code like this: bool result; { LOCK(cs_KeyStore); result = (mapKeys.count(address) > 0); } return result;

which is equivalent to this:
```
LOCK(cs_KeyStore);
return (mapKeys.count(address) > 0);
```
The destructor is guaranteed to run after the expression is evaluated. There is no need to over-engineer.

Similarly, there is a lot of code like this:

void SomeFunction(...) { something(); { LOCK(cs_something); ... } }

or this: void SomeFunction(...) { { LOCK(cs_something); .. rest of function ... } return false; }

There is no point in all the excessive blocks. What are they for?

laanwj commented at 11:41 AM on November 10, 2012: member

Need to look at this in detail, but I certainly applaud the effort of improving the locking situation. I dread having to debug deadlocks in the current source.

I also took a look at this in the past (see #1801), but I didn't have the patience at that time to unentangle the complex web of locks. Also the nested mutexes are indeed a bad idea.

If you go through the difficulties of making sense of it all, it may be wise to switch from shared state completely and switch to a message-passing architecture. This will also allow running the different parts in different locked-down processs eventually, increasing robustness and security. But it's a lot of work.

alexanderkjeldaas commented at 3:21 PM on November 10, 2012: contributor

On 10 November 2012 08:41, Wladimir J. van der Laan < notifications@github.com> wrote:

Need to look at this in detail, but I certainly applaud the effort of improving the locking situation. I dread having to debug deadlocks in the current source.

I also took a look at this in the past (see #1801 https://github.com/bitcoin/bitcoin/issues/1801), but I didn't have the patience at that time to unentangle the complex web of locks. Also the nested mutexes are indeed a bad idea.

If you go through the difficulties of making sense of it all, it may be wise to switch from shared state completely and switch to a message-passing architecture. This will also allow running the different parts in different locked-down processs eventually, increasing robustness and security. But it's a lot of work.

Actually what I am doing is a series of transforms.

First I annotate data and locks based on evidence in the code. So I see that a given lock is taken before mutation of a variable, then that locks protects the variable. The annotations on functions are again just deduced from that. Then the compiler will prove inconsistencies, and that is when one either have to change the annotation, or change the locking code. The latter is not automatic but depends on knowing the code in question.

I understand that the result of all of the above can be somewhat hard to digest as a patch. I can make it easier by splitting it into smaller pieces, but first I'd like to get thumbs up for this kind of annotations. It really is a domain-specific type system available in clang.

I've done a large backend system at Google using this strategy, and not once did I have to debug locking issues. And this in a very popular product. Unfortunately, at this point in time I don't have much time to dedicate to this (I'm doing a startup, so I need to be selective about my time).

If you are interested, I can also insert TODOs for other developers to look at, for dubious locking/APIs that are too entangled. I have personally found that having a lot of TOODs in the code makes it easier for others to understand suboptimal code that they can reasonably try to fix when they refactor a piece of code, but others find that it makes the code look even more horrible than it is. YMMV.

Smaller sized changes would be something like:

Add the annotation support
Do the necessary changes to sync.h to support annotations
Annotate most of the classes/interfaces and do transformations of methods that are involved in recursive locking.
Create a patch that does the actual locking changes, including the compler-deduced inconsistency at each point.
Remove the compiler comments from the code.

Alexander

— Reply to this email directly or view it on GitHubhttps://github.com/bitcoin/bitcoin/pull/1995#issuecomment-10254336.

sipa commented at 5:38 PM on November 10, 2012: member

I'm certainly in favor of improving the locking system. I didn't know about clang's -Wthread-safety, but it certainly looks like a nice way to improvements. If we go that way, I'd like to see it completed as well, so that probably means rebasing from time to time, and following up. Unfortunately, refactoring patches are often not considered high priority, and they conflict easily with other patches.

About the locks in general, I think most of the uglyness/entanglement is the result of lack of encapsulation. Many objects expose their inner state publically, and are also accessed/modified from almost everywhere. I think we made at least some improvement since Satoshi's days (there wasn't even a CWallet or CkeyStore, and all was done from within main), but there's a long road ahead still.

For the block chain, I've been planning to add a class to separate the block tree (mapBlockChain, pblocktree & co) and the coin state (pcoinsTip, hashBestChain) entirely after ultraprune (which is now merged). Ideally, both get guarded by separate locks (so cs_main can go), and encapsulate them in well-behaved classes.

Regarding callbacks, @TheBlueMatt has worked on creating a message queue for the core before, so callbacks to wallet could run in separate threads. This never got merged, unfortunately.

laanwj commented at 8:58 PM on November 10, 2012: member

I agree that doing this in multiple phases is a good idea, and that adding the annotations would be a good first step, so that we can all get an idea of the kind of warnings that clang gives in this case.

gavinandresen commented at 10:37 PM on November 10, 2012: contributor

Thumbs up from me on improving the locking; ideas for how we can test to make sure we're not making things worse as we improve the code are welcome.

alexanderkjeldaas merged this on Nov 11, 2012

alexanderkjeldaas closed this on Nov 11, 2012

HashUnlimited referenced this in commit 5e4c1fca71 on Mar 21, 2018

KolbyML referenced this in commit b09752dc69 on Dec 5, 2020

bitcoin locked this on Sep 8, 2021