Improve deadlock detection #19157

vasild commented at 1:22 PM on June 3, 2020: contributor

Background

If two or more threads acquire mutexes in different order, that could cause a deadlock. Currently we have two mechanisms for detecting that - our DEBUG_LOCKORDER and the thread sanitizer.

Problem

Both methods may fail to detect some deadlocks:

deadlock type	detected by `DEBUG_LOCKORDER`	detected by TSan	detected by the proposed solution
A => B => C => A	:x:	:heavy_check_mark:	:heavy_check_mark:
test case `deadlock_unlock_not_last`	:x:	:heavy_check_mark:	:heavy_check_mark:
test case `deadlock_3`	:x:	:x: <sup>[1]</sup>	:heavy_check_mark:
A => B, restart the program, B => A <sup>[2]</sup>	:x:	:x:	:heavy_check_mark: <sup>[3]</sup>

[1] submitted as a bug report at https://github.com/google/sanitizers/issues/1258 [2] I guess this is how the bug which #19132 fixes sneaked in [3] as long as just B => A is executed

<details> <summary>test cases</summary>

class Event
{
public:
    void signal()
    {
        std::unique_lock<std::mutex> lk(m_mutex);
        m_occurred = true;
        m_cond.notify_all();
    }

    void wait()
    {
        std::unique_lock<std::mutex> lk(m_mutex);
        m_cond.wait(lk, [&]() { return m_occurred; });
    }

private:
    bool m_occurred{false};
    std::condition_variable m_cond;
    std::mutex m_mutex;
};

static std::mutex printf_mutex;
void printf_sync(const char* format, ...)
{
    va_list ap;
    va_start(ap, format);

    {
        std::unique_lock<std::mutex> lk(printf_mutex);
        vprintf(format, ap);
    }

    va_end(ap);
}

BOOST_AUTO_TEST_CASE(deadlock_3)
{
#if 0
    // detected by DEBUG_LOCKORDER
    // not detected by the thread sanitizer (when DEBUG_LOCKORDER is disabled)
    constexpr size_t n_threads = 2;
#else
    // deadlock is not detected by either one
    constexpr size_t n_threads = 3;
#endif

    // t0: lock m0
    // t1: lock m1
    // t2: lock m2
    // t0: try to lock m1, waits for t1
    // t1: try to lock m2, waits for t2
    // t2: try to lock m0, waits for t0 => deadlock

    std::array<Mutex, n_threads> mutexes;
    std::array<std::thread, n_threads> threads;
    std::array<Event, n_threads> locked_own;
    std::array<Event, n_threads> try_to_lock_next;

    auto thread = [&](size_t i) {
        LOCK(mutexes[i]);
        printf_sync("thread%zu locked mutex%zu\n", i, i);
        locked_own[i].signal();

        try_to_lock_next[i].wait();

        const size_t next = (i + 1) % n_threads;
        printf_sync("thread%zu trying to lock mutex%zu\n", i, next);
        LOCK(mutexes[next]);
    };

    for (size_t i = 0; i < n_threads; ++i) {
        threads[i] = std::thread{thread, i};
    }

    for (size_t i = 0; i < n_threads; ++i) {
        locked_own[i].wait();
    }

    for (size_t i = 0; i < n_threads; ++i) {
        try_to_lock_next[i].signal();
    }

    for (size_t i = 0; i < n_threads; ++i) {
        threads[i].join();
    }
}

BOOST_AUTO_TEST_CASE(deadlock_unlock_not_last)
{
    // t0: lock m0
    // t0: lock m1
    // t0: unlock m0
    // t1: lock m0
    // t1: try to lock m1, waits for t0
    // t0: try to lock m0, waits for t1 => deadlock

    Mutex m0;
    Mutex m1;
    Event t1_locked_m0;

    std::thread t0{[&]() {
        ENTER_CRITICAL_SECTION(m0);
        LOCK(m1);
        LEAVE_CRITICAL_SECTION(m0);
        t1_locked_m0.wait();
        LOCK(m0);
    }};

    std::thread t1{[&]() {
        LOCK(m0);
        t1_locked_m0.signal();
        LOCK(m1);
    }};

    t0.join();
    t1.join();
}

</details>

Solution

Attach a predefined integer to each mutex that denotes the locking order in relation to other mutexes. So, whenever attempting to lock a mutex a thread would check if the previous mutex it acquired has a lower number.

This would require some thorough considerations when adding a new mutex - when is it going to be used, what other mutexes are going to be held at the time the new one is acquired and what other mutexes may be acquired while holding the new one. That is a good practice and should be done anyway. Such a mechanism will enforce it.

vasild added the label Bug on Jun 3, 2020

maflcko removed the label Bug on Jun 3, 2020

maflcko added the label Feature on Jun 3, 2020

maflcko commented at 1:29 PM on June 3, 2020: member

Both methods may fail to detect some deadlocks

It seems they fail to detect a deadlock, but isn't the deadlock happening good enough of an indication that something is wrong? For example running the tests and observing a reproducible timeout should be enough to hint to a deadlock, no?

vasild commented at 1:42 PM on June 3, 2020: contributor

Well, the DEBUG_LOCKORDER checks and the deadlock detection from tsan exist because deadlocks happening are not good enough to analyze the problem. A sporadic timeout on CI once per week, or worse, on the users' computers without any printouts is very difficult to analyze.

Deadlocks are often elusive and not 100% repoducable. Even in the test case above most of the code is to ensure that the deadlock actually happens :)

maflcko commented at 1:48 PM on June 3, 2020: member

Concept ACK

vasild commented at 7:21 PM on June 3, 2020: contributor

I submitted the testcase at https://github.com/google/sanitizers/issues/1258. It looks like a bug to me.

sipa commented at 7:33 PM on June 3, 2020: member

@vasild Are you familiar with the dining philosopher's problem?

vasild commented at 6:53 AM on June 4, 2020: contributor

@sipa, in the context of the dining philosophers problem I propose the resource hierarchy solution where all resources (mutexes) are related by order.

sipa commented at 7:30 PM on June 4, 2020: member

@vasild I'm not sure that works in the presence of multiple partially overlapping sets of dependent locks.

Let's say the following lock orders exist: A->B A->C B->C (which is not a deadlock). Assume A->C is grabbed first, so you assign A=1, C=2; then A->B is grabbed, assigning B=3. Now B->C is grabbed which is considered a reversal and incorrectly treated as a deadlock.

vasild commented at 8:12 AM on June 5, 2020: contributor

Sorry, I should have elaborated more - by "predefined" I mean hardcoded in the source code (not assigned dynamically at runtime).

In the above case the developer(s) would assign A=1, B=2, C=3. Then, if the code has a bug and locks in different order, e.g. first B, then A, just executing that code path is enough to catch the problem. No need to have two contradicting executions A->B & B->A like now.

I counted about 70 mutexes in the entire code base, including tests. If it turns out that this is too big to do in one go we can start with defining the order between some mutexes and assign "wildcard" to the rest, skipping the hierarchy check for them.

maflcko commented at 11:51 AM on June 5, 2020: member

This sounds similar to https://clang.llvm.org/docs/ThreadSafetyAnalysis.html#acquired-before-acquired-after, but I heard someone say they are not actually implemented in libcxx [Citation needed]

hebasto commented at 11:53 AM on June 5, 2020: member

This sounds similar to https://clang.llvm.org/docs/ThreadSafetyAnalysis.html#acquired-before-acquired-after, but I heard someone say they are not actually implemented in libcxx [Citation needed]

https://clang.llvm.org/docs/ThreadSafetyAnalysis.html#acquired-before-and-acquired-after-are-currently-unimplemented

sipa commented at 5:23 PM on June 5, 2020: member

@vasild I think that's a very ugly solution, as it requires global coordination over the whole codebase. It makes components less reusable, and means that the introduction of code that now grabs two previously unrelated locks may result in requiring a renumbering of several locks.

As performance does not matter that much, I think you could just run a naive cycle-finding algorithm (DFS traversal, caching which nodes you've already hit) to find issues.

hebasto commented at 5:44 PM on June 5, 2020: member

FWIW, a custom hierarchical_mutex mutex type and its usage are provided by Anthony Williams in his C++ Concurrency in Action, SE, 2019 (3.2.4 Deadlock: the problem and a solution -- Use a lock hierarchy).

vasild commented at 7:20 AM on June 8, 2020: contributor

All, thanks for the input! Next I will see if I can turn this into some code, so that we have something concrete to discuss.

PS The cycle-finding algorithm is what TSan already does.

adamjonas commented at 6:07 PM on August 3, 2022: member

@vasild did you ever investigate this further?

vasild commented at 1:22 PM on August 12, 2022: contributor

No, but I don't think much investigation is needed. I just need to sit down and implement a hierarchical mutex (the resource hierarchy solution) and see how it will be accepted by the community. I have worked with a simple version of that in the InnoDB storage engine in MySQL that was written in the 90s and it was perfectly reasonable (thanks, Heikki! :heart:). I think it can be done in a more elegant way with the new language extensions that we have now in 2022.

willcl-ark commented at 2:53 PM on October 21, 2025: member

This issue hasn’t attracted much interest from other contributors in quite some time.

Given that, it doesn’t seem important enough to keep open indefinitely. I’m going to close it for now due to lack of activity, but pull requests or renewed discussion are always welcome.

Comment here if you think this should be re-opened.

willcl-ark closed this on Oct 21, 2025