Ensure wallet and chain tip are in sync

bvbfan commented at 2:34 pm on March 6, 2020: contributor

Signed-off-by: Anthony Fieroni bvbfan@abv.bg

fanquake added the label Wallet on Mar 6, 2020

in src/wallet/wallet.cpp:1163 in ddcc4b23b1 outdated

1155@@ -1156,8 +1156,11 @@ void CWallet::BlockUntilSyncedToCurrentChain() {
1156     // ::ChainActive().Tip(), otherwise put a callback in the validation interface queue and wait
1157     // for the queue to drain enough to execute it (indicating we are caught up
1158     // at least with the time we entered this function).
1159-    uint256 last_block_hash = WITH_LOCK(cs_wallet, return m_last_block_processed);
1160-    chain().waitForNotificationsIfTipChanged(last_block_hash);
1161+    bool tipIsSame = false;
1162+    do {
1163+        uint256 last_block_hash = WITH_LOCK(cs_wallet, return m_last_block_processed);
1164+        tipIsSame = chain().waitForNotificationsIfTipIsNotSame(last_block_hash);

MarcoFalke commented at 2:58 pm on March 6, 2020:

Shouldn’t this whole scope happen under the cs_wallet lock?

Also, shouldn’t m_chain_notifications_handler.reset() happen inside here under the wallet lock?

Otherwise a new block could come in and put new notifications in the handler while we reset it.

bvbfan commented at 3:10 pm on March 6, 2020:

Shouldn’t this whole scope happen under the cs_wallet lock?

No, we leave mutex for process thread to finish.

Also, shouldn’t m_chain_notifications_handler.reset() happen inside here under the wallet lock?

No.

Otherwise a new block could come in and put new notifications in the handler while we reset it.

I’m aware of that, we can have a function that takes a mutex and reset notifications

promag commented at 3:00 pm on March 6, 2020: member

I think this is the wrong approach.

bvbfan commented at 3:16 pm on March 6, 2020: contributor

I think this is the wrong approach.

That’s at least ensure sync between wallet and chain tips otherwise all calls to BlockUntilSyncedToCurrentChain are just pointless, they don’t do what is supposed to do. Your patch looks fine but not what we want when calls a function like BlockUntilSyncedToCurrentChain

DrahtBot commented at 2:45 am on March 7, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#18338 (wip: Fix wallet unload race condition by promag)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

bvbfan commented at 9:22 am on March 7, 2020: contributor

@promag the problem is not the queued connection / disconnection, actually

0void UnregisterValidationInterface(CValidationInterface* pwalletIn) {
1    if (g_signals.m_internals) {
2        g_signals.m_internals->m_connMainSignals.erase(pwalletIn);
3    }
4}

signal is received before disconnect is called by erasing the caller.

Fix unit test and rebase.

bvbfan commented at 10:47 am on March 7, 2020: contributor

Let’s clarify a bit more.

0void disconnect() override
1{
2    if (m_notifications) {
3        m_notifications = nullptr; // <---------------- it makes no sense to be before unregister
4        UnregisterValidationInterface(this);
5    }
6}

Since ValidationInterfaceConnections calls destructors in reverse order, for a bad luck UpdatedBlockTip is disconnected last which is not held the mutex but update time by atomic variable. Running in testnet for at least an hour in parallel with

0#!/bin/bash
1while [ 1 ]
2do
3    src/bitcoin-cli -testnet loadwallet test
4    src/bitcoin-cli -testnet unloadwallet test
5done

no crashes anymore.

ariard commented at 4:47 am on March 9, 2020: member

I think I’ve hit into #16307 issue while doing refactoring on this part of the code. Current refactoring direction is to move towards the wallet and chain being independent tips, not more synchronized so another way to solve this would be to reset handler correctly while holding wallet lock and then test handler still exists in BlockConnected/BlockDisconnected/…, flushing correctly and committing right locator means worst-case we redo a rescan at wallet loading.

Doing it race-condition free likely need a handler lock though (or `NotificationsHandlerImpl locking wallet…)

bvbfan commented at 1:06 pm on March 10, 2020: contributor

while holding wallet lock and then test handler still exists in BlockConnected/BlockDisconnected

i don’t think that’s right approach to me.

in src/interfaces/chain.cpp:163 in 78a88ef667 outdated

159@@ -160,8 +160,8 @@ class NotificationsHandlerImpl : public Handler, CValidationInterface
160     void disconnect() override
161     {
162         if (m_notifications) {
163-            m_notifications = nullptr;

ryanofsky commented at 4:35 pm on March 10, 2020:

This change seems clearly correct. Good catch! Maybe m_notifications should be an atomic pointer too since m_notifications is get and set from different threads, though probably the synchronization done by boost signals & the scheduler prevents bugs from it not being atomic

bvbfan commented at 8:30 am on March 11, 2020:

Maybe m_notifications should be an atomic pointer too since m_notifications is get and set from different threads, though probably the synchronization done by boost signals & the scheduler prevents bugs from it not being atomic

You’re right

in src/interfaces/chain.cpp:349 in 78a88ef667 outdated

345@@ -346,13 +346,14 @@ class ChainImpl : public Chain
346     {
347         return MakeUnique<NotificationsHandlerImpl>(*this, notifications);
348     }
349-    void waitForNotificationsIfTipChanged(const uint256& old_tip) override
350+    bool waitForNotificationsIfTipIsNotSame(const uint256& tip) override

ryanofsky commented at 4:39 pm on March 10, 2020:

I like the name change, and I can see how the having the bool return value should make races and the deadlock reported #16307 (comment) less likely, though I suspect there is also probably a more direct and reliable fix we can find

in src/wallet/wallet.cpp:114 in 78a88ef667 outdated

109@@ -110,8 +110,8 @@ static void ReleaseWallet(CWallet* wallet)
110     const std::string name = wallet->GetName();
111     wallet->WalletLogPrintf("Releasing wallet\n");
112     wallet->BlockUntilSyncedToCurrentChain();
113-    wallet->Flush();
114     wallet->m_chain_notifications_handler.reset();
115+    WITH_LOCK(wallet->cs_wallet, wallet->Flush());

ryanofsky commented at 4:46 pm on March 10, 2020:

Acquiring the cs_wallet lock here is unexpected. If this is really neccessary, I think the lock should be acquired closer to where it’s used inside the Flush implementation, probably with a comment if the purpose of the lock isn’t more obvious there

bvbfan commented at 8:15 am on March 11, 2020:

I think there is a misunderstanding here (maybe my own), but my understanding is that the point of BlockUntilSyncedToCurrentChain is just to ensure that if you make two RPC calls in a row the second RPC call is always aware of the result of first call. Ensuring that notification in the queue from the time of the call are processed is sufficient to do this, without trying to guarantee that the wallet tip and node tips are equal.

Yes, but mostly after BlockUntilSyncedToCurrentChain lock is acquired and there is no guarantee that notification is faster than this held, i.e. you can leave BlockUntilSyncedToCurrentChain and don’t do what you want, you gain mutex before notification thread and you’re not sync anything.

bvbfan commented at 8:17 am on March 11, 2020:

Here after BlockUntilSyncedToCurrentChain is ensuring it’s on current tip, the lock tries to guarantee that notification thread is finished before we flush and release memory.

ryanofsky commented at 4:59 pm on March 10, 2020: member

re: #18279 (comment)

That’s at least ensure sync between wallet and chain tips otherwise all calls to BlockUntilSyncedToCurrentChain are just pointless, they don’t do what is supposed to do. Your patch looks fine but not what we want when calls a function like BlockUntilSyncedToCurrentChain

I think there is a misunderstanding here (maybe my own), but my understanding is that the point of BlockUntilSyncedToCurrentChain is just to ensure that if you make two RPC calls in a row the second RPC call is always aware of the result of first call. Ensuring that notification in the queue from the time of the call are processed is sufficient to do this, without trying to guarantee that the wallet tip and node tips are equal.

Checking tips for equality is just an optimization to avoid waiting for the queue, and guaranteeing that the tips are equal isn’t actually possible (despite the attempt here) because BlockUntilSyncedToCurrentChain isn’t called with any locks held.

bvbfan commented at 1:21 pm on March 11, 2020: contributor

@promag, @ryanofsky i have another, looking better to me, solution ensuring there is no background tasks, but i’ve not push since it’s not well tested

 0diff --git a/src/scheduler.cpp b/src/scheduler.cpp
 1index 7cb7754fd..17c26be74 100644
 2--- a/src/scheduler.cpp
 3+++ b/src/scheduler.cpp
 4@@ -199,12 +199,16 @@ void SingleThreadedSchedulerClient::AddToProcessQueue(std::function<void ()> fun
 5 }
 6 
 7 void SingleThreadedSchedulerClient::EmptyQueue() {
 8-    assert(!m_pscheduler->AreThreadsServicingQueue());
 9-    bool should_continue = true;
10+    if (!m_pscheduler->AreThreadsServicingQueue())
11+        return;
12+    auto hasCallbacks = [this]() -> bool {
13+        LOCK(m_cs_callbacks_pending);
14+        return !m_callbacks_pending.empty();
15+    };
16+    auto should_continue = hasCallbacks();
17     while (should_continue) {
18         ProcessQueue();
19-        LOCK(m_cs_callbacks_pending);
20-        should_continue = !m_callbacks_pending.empty();
21+        should_continue = hasCallbacks();
22     }
23 }
24 
25diff --git a/src/validationinterface.cpp b/src/validationinterface.cpp
26index 1deb93c97..87ade954c 100644
27--- a/src/validationinterface.cpp
28+++ b/src/validationinterface.cpp
29@@ -90,6 +90,7 @@ void RegisterValidationInterface(CValidationInterface* pwalletIn) {
30 void UnregisterValidationInterface(CValidationInterface* pwalletIn) {
31     if (g_signals.m_internals) {
32         g_signals.m_internals->m_connMainSignals.erase(pwalletIn);
33+        g_signals.FlushBackgroundCallbacks();
34     }
35 }
36 
37diff --git a/src/wallet/wallet.cpp b/src/wallet/wallet.cpp
38index 4c36caed8..51732c0ae 100644
39--- a/src/wallet/wallet.cpp
40+++ b/src/wallet/wallet.cpp
41@@ -111,7 +111,7 @@ static void ReleaseWallet(CWallet* wallet)
42     wallet->WalletLogPrintf("Releasing wallet\n");
43     wallet->BlockUntilSyncedToCurrentChain();
44     wallet->m_chain_notifications_handler.reset();
45-    WITH_LOCK(wallet->cs_wallet, wallet->Flush());
46+    wallet->Flush();
47     delete wallet;
48     // Wallet is now released, notify UnloadWallet, if any.
49     {

Flushing callback right after disconnecting signals should effectively ensure we don’t have pending and it will not need atomic notification pointer or so.

ryanofsky commented at 2:10 pm on March 11, 2020: member

re: #18279 (comment)

@promag, @ryanofsky i have another, looking better to me, solution ensuring there is no background tasks, but i’ve not push since it’s not well tested

This fix is very similar to c3722190bc4f0efb76008dff16c6721ef0248856 from #18280, and probably we should review and merge #18280 when it’s ready and close this PR.

In my opinion though, waiting for notifications is a fragile fix that will slow down wallet unloading and node shutdown without good reason. A better fix would be to switch to shared pointers instead of raw pointers and fix the use-after-delete bug more directly: 06d2b530e57b88c0542e2bd8db00ca0b034b589d (branch, comment)

bvbfan commented at 2:22 pm on March 11, 2020: contributor

This fix is very similar to c372219 from #18280, and probably we should review and merge #18280 when it’s ready and close this PR.

Still no to me, changes in BlockUntilSyncedToCurrentChain are significant to me.

bvbfan commented at 2:25 pm on March 11, 2020: contributor

Also sync should be in UnregisterValidationInterface to be applied in all cases.

ryanofsky commented at 3:02 pm on March 11, 2020: member

Still no to me, changes in BlockUntilSyncedToCurrentChain are significant to me.

Of course no need to close this PR if you want to make other improvements, but I hope you can review #18280 when it’s ready as a fix for the original issue. Also would encourage closing this PR and opening a new one if the new version will differ substantially, so the review comment history will be intelligible.

I would also caution that while BlockUntilSyncedToCurrentChain can definitely be cleaned up, its purpose is ensuring consistent results between RPC calls and it shouldn’t be changed to block longer than actually necessary to fix real bugs.

Also sync should be in UnregisterValidationInterface to be applied in all cases.

This is only true when unregistering wallets, as far as I know. Not things like indexes. Also if we fix shared_ptr reference counting with 06d2b530e57b88c0542e2bd8db00ca0b034b589d (branch, comment) it should not even be necessary for wallets. No reason conceptually wallets should have to wait for events they don’t care about just to fix a pointer lifetime issue

bvbfan commented at 3:31 pm on March 11, 2020: contributor

This is only true when unregistering wallets, as far as I know. Not things like indexes.

No, that’s not correct to me,

Maybe m_notifications should be an atomic pointer too since m_notifications is get and set from different threads, though probably the synchronization done by boost signals & the scheduler prevents bugs from it not being atomic

doing in UnregisterValidationInterface it ensures notification is not accessed in race condition.

promag commented at 4:01 pm on March 11, 2020: member

No reason conceptually wallets should have to wait for events they don’t care about just to fix a pointer lifetime issue

Agree with this.

From #18280:

From boost signal documentation at https://www.boost.org/doc/libs/1_72_0/doc/html/signals2/thread-safety.html:

When a signal is invoked by calling signal::operator(), the invocation first acquires a lock on the signal’s mutex. Then it obtains a handle to the signal’s slot list and combiner. Next it releases the signal’s mutex, before invoking the combiner to iterate through the slot list.

This means that UnregisterValidationInterface doesn’t prevent more calls to that interface.

But just calling SyncWithValidationInterfaceQueue (which BlockUntilSyncedToCurrentChain might call) doesn’t fully fix the issue because on shutdown the scheduler thread is stopped earlier.

Obviously there are various possible fixes, like #18280 or 06d2b53. But this approach is clearly not good.

bvbfan closed this on Mar 15, 2020

DrahtBot locked this on Aug 16, 2022

Ensure wallet and chain tip are in sync #18279

Conflicts