net_processing: lock clean up

ajtowns commented at 4:16 am on March 25, 2021: contributor

cs_sendProcessing is replaced by a private mutex in net_processing, non-orphan-specific things are moved out from g_cs_orphans and g_cs_orphans is replaced by a private mutex in txorphanage.

fanquake added the label P2P on Mar 25, 2021

in src/net_processing.cpp:3454 in 59cfe0c74b outdated

3059@@ -3059,7 +3060,7 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
3060             }
3061 
3062             // Recursively process any orphan transactions that depended on this one
3063-            ProcessOrphanTx(peer->m_orphan_work_set);
3064+            ProcessOrphanTx(*peer);

jnewbery commented at 8:28 am on March 25, 2021:

What do you think about removing this call? The comment above is incorrect (we only process one tx at most, not recursively), and we’ll process the orphans in subsequent calls to ProcessMessages(). It seems strange that in this one case we can process up to two transactions in ProcessMessage() (the one that we just received, and up to one orphan child of it).

ajtowns commented at 11:27 am on March 25, 2021:

We could process 3 txs I think – the last orphan already in the workset (in ProcessMessages), one from a just received TX message, and an additional orphan whose parent was the contents of that TX message. I don’t think it’s problematic that way, but 1-non-trivial-ATMP-call per ProcessMessages invocation could be reasonable.

jnewbery commented at 11:49 am on March 25, 2021:

Very good point. I hadn’t considered processing the orphan before processing the net message.

in src/net_processing.cpp:3899 in 59cfe0c74b outdated

3870@@ -3871,8 +3871,7 @@ bool PeerManagerImpl::ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt
3871     }
3872 
3873     {
3874-        LOCK(g_cs_orphans);
3875-        if (!peer->m_orphan_work_set.empty()) return true;
3876+        if (more_orphans) return true;

jnewbery commented at 8:44 am on March 25, 2021:

How about moving this check up to immediately after the call to ProcessOrphanTx()? It’s not possible to have both more_orphans be true and have m_getdata_requests be non-empty:

ProcessOrphanTx(*peer) can only return true if there are elements in peer’s orphan work set, which can only be true if the last network message processed was a tx (m_peer_work_set only gets added to when processing a tx or when processing an element of m_peer_work_set, and we won’t process another network message until it’s empty)
m_getdata_requests can only be non-empty if the last network message processed was a getdata (m_getdata_requests only gets added to when processing a getdata, and we won’t process another message until it’s empty)

That’d eliminate the need for a more_orphans variable, and make it immediately obvious that we’re not going to process an orphan and a getdata request on the same pass through ProcessMessage().

in src/net_processing.cpp:3885 in 59cfe0c74b outdated

3859-        LOCK2(cs_main, g_cs_orphans);
3860-        if (!peer->m_orphan_work_set.empty()) {
3861-            ProcessOrphanTx(peer->m_orphan_work_set);
3862-        }
3863+        LOCK(cs_main);
3864+        more_orphans = ProcessOrphanTx(*peer);

jnewbery commented at 8:47 am on March 25, 2021:

Can be more concisely written as:

 0diff --git a/src/net_processing.cpp b/src/net_processing.cpp
 1index 02e2785c24..c90d73cb95 100644
 2--- a/src/net_processing.cpp
 3+++ b/src/net_processing.cpp
 4@@ -3854,11 +3854,7 @@ bool PeerManagerImpl::ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt
 5         }
 6     }
 7 
 8-    bool more_orphans = false;
 9-    {
10-        LOCK(cs_main);
11-        more_orphans = ProcessOrphanTx(*peer);
12-    }
13+    const bool more_orphans{WITH_LOCK(cs_main, return ProcessOrphanTx(*peer);)};
14 
15     if (pfrom->fDisconnect)
16         return false;

DrahtBot commented at 8:57 am on March 25, 2021: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#25168 (refactor: Avoid passing params where not needed by MarcoFalke)
#24970 (net processing: Move cleanSubVer, fPreferredDownload and nLocalHostNonce to Peer by jnewbery)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

in src/net_processing.cpp:2392 in 59cfe0c74b outdated

2025- *                                  reconsidered.
2026- */
2027-void PeerManagerImpl::ProcessOrphanTx(std::set<uint256>& orphan_work_set)
2028+bool PeerManagerImpl::ProcessOrphanTx(Peer& peer)
2029 {
2030     AssertLockHeld(cs_main);

jnewbery commented at 8:57 am on March 25, 2021:

It’d be good to remove the EXCLUSIVE_LOCKS_REQUIRED(cs_main) and only take the cs_main lock inside the while look (once we know that we have orphans to reprocess). In vast majority of cases we’re taking and releasing cs_main for no reason because we have no orphans to process.

ajtowns commented at 6:39 am on April 20, 2021:

I think:

just call AddChildrenToWorkSet when a new tx arrives, don’t also immediately process them (ie don’t also call ProcessOrphanTx)
when ProcessOrphanTx does work, don’t continue on to also do a ProcessMessage, give another peer a chance to get some work done first
have ProcessOrphanTx take cs_main only when necessary
make the GetTxToReconsider signature simpler

would all fit together nicely and be a good idea, but it also feels like it’s adding a bit much to this PR?

ajtowns commented at 1:44 pm on April 20, 2021:

This PR changes orphan processing from being done for the peer that provided the parent to the peer that provided the orphan. I think that’s fine, but perhaps we should split the PR into one that does refactoring only and one that has behaviour changes.

Hmm, might make sense to put the changes above together with the last commit in a separate PR, so this is just refactoring and the behaviour changes (who processes orphans, and how many orphans are processed in a cycle) are collected together.

jnewbery commented at 2:00 pm on April 20, 2021:

Yes, I agree that it’s better to split the PR into two: one that refactors and one that changes behaviour.

ajtowns commented at 10:41 pm on April 22, 2021:

This is drafted at https://github.com/ajtowns/bitcoin/commits/202104-whohandlesorphans

in src/txorphanage.cpp:91 in 59cfe0c74b outdated

85@@ -87,24 +86,27 @@ int TxOrphanage::EraseTx(const uint256& txid)
86 
87 void TxOrphanage::EraseForPeer(NodeId peer)
88 {
89-    AssertLockHeld(g_cs_orphans);
90+    LOCK(m_mutex);
91+
92+    const int work_erased = m_peer_work_set.erase(peer);

jnewbery commented at 8:58 am on March 25, 2021:

Well remembered to do this. Perhaps add a TxOrphange::Empty() function that returns whether there are any elements in any of the containers in the orphanage, and then assert m_orphange.Empty() in the FinalizeNode() logic when the last peer is deleted.

ajtowns commented at 11:28 am on March 25, 2021:

It crossed my mind that this might be another case where a multi index could be a win.

in src/txorphanage.cpp:104 in 59cfe0c74b outdated

102-            nErased += EraseTx(maybeErase->second.tx->GetHash());
103+            nErased += _EraseTx(maybeErase->second.tx->GetHash());
104         }
105     }
106-    if (nErased > 0) LogPrint(BCLog::MEMPOOL, "Erased %d orphan tx from peer=%d\n", nErased, peer);
107+    if (nErased > 0 || work_erased > 0) LogPrint(BCLog::MEMPOOL, "Erased %d orphan tx and %d from orphan work set for peer=%d\n", nErased, work_erased, peer);

jnewbery commented at 9:01 am on March 25, 2021:

Not sure that it’s very interesting to log work_erased. It’ll only ever be 0 (nothing in the work set) or 1 (items in the work set).

in src/net.h:235 in 59cfe0c74b outdated

231@@ -232,6 +232,7 @@ struct LocalServiceInfo {
232     uint16_t nPort;
233 };
234 
235+extern Mutex g_mutex_message_handler;

jnewbery commented at 9:11 am on March 25, 2021:

I don’t think this is a good idea. Anything that you guard with this mutex will never be accessible on other threads, which would prevent us from eg exposing it to RPC methods, or using it in validation interface callbacks. This seems like a way of reinventing thread local storage, which I don’t think we want.

In addition, this mutex is exposed and used in both the net and net_processing layer, which adds to the coupling between those components. That’s the opposite direction that I think we want to go in.

ajtowns commented at 11:34 am on March 25, 2021:

The reasoning isn’t that this is particularly “good”, it’s that it’s documenting the safety mechanisms we already have. It would have caught #21528 (review) for instance.

net_processing is fundamentally coupled with net – it implements the NetEventsInterface defined and called from net. I mean I guess if you preferred, it could be NetEventsInterface::m_mutex_message_handler? (EDIT: actually, whether or not you prefer that, I think I do…)

jnewbery commented at 12:46 pm on March 25, 2021:

And if you’re doing that then perhaps you can make it a PeerManagerImpl::m_mutex_message_handler and take it in ProcessMessages() and SendMessages(). Taking that lock once at the top of those functions would be very low cost (since there would never be any contention).

ajtowns commented at 6:04 am on April 20, 2021:

I’ve done this now – the first three commits introduce m_mutex_messages, remove cs_sendProcessing and move extra txns to be protected by m_mutex_messages instead of g_cs_orphans. I think it would be good to cherry-pick those commits into #21186 so that the moved addr variables can actually be marked as guarded by something.

jnewbery commented at 6:53 am on April 20, 2021:

Also see #21563

in src/net_processing.cpp:2090 in 59cfe0c74b outdated

2038+    bool more = false;
2039 
2040-        const auto [porphanTx, from_peer] = m_orphanage.GetTx(orphanHash);
2041-        if (porphanTx == nullptr) continue;
2042+    while (m_orphanage.GetTxToReconsider(peer.m_id, porphanTx, from_peer, more)) {
2043+        if (porphanTx == nullptr) break;

jnewbery commented at 9:26 am on March 25, 2021:

Is this possible? I think the logic in GetTxToReconsider() ensures that if true is returned, then porphanTx will not be null.

ajtowns commented at 11:36 am on March 25, 2021:

assert(porphanTx != nullptr) seemed overkill and not having anything seemed like it might be annoying for static analysers

jnewbery commented at 11:52 am on March 25, 2021:

How about: if (!Assume(porphanTx)) break ?

in src/txorphanage.h:35 in 59cfe0c74b outdated

37+     *  to work on. Otherwise returns true, and populates its arguments with tx,
38+     *  the originating peer, and whether there are more orphans for this
39+     *  peer to work on after this tx.
40      */
41-    std::pair<CTransactionRef, NodeId> GetTx(const uint256& txid) const EXCLUSIVE_LOCKS_REQUIRED(g_cs_orphans);
42+    bool GetTxToReconsider(NodeId peer, CTransactionRef& ref, NodeId& originator, bool& more);

jnewbery commented at 9:49 am on March 25, 2021:

This is a very strange interface. How about returning a std::optional<std::pair<CTransactionRef, NodeId>>, and then adding another public method HasTxToReconsider(). It’s a bit less efficient, but we’ll only ever call it after processing a transaction, so it seems negligible.

in src/txorphanage.h:14 in 59cfe0c74b outdated

 9@@ -10,8 +10,10 @@
10 #include <primitives/transaction.h>
11 #include <sync.h>
12 
13-/** Guards orphan transactions and extra txs for compact blocks */
14-extern RecursiveMutex g_cs_orphans;
15+#include <map>
16+#include <optional>

jnewbery commented at 9:49 am on March 25, 2021:

I guess this means you already tried using an optional in the return value of GetTxToReconsider()?

in src/init.cpp:209 in 59cfe0c74b outdated

214-    // Thus the implicit locking order requirement is: (1) cs_main, (2) g_cs_orphans, (3) cs_vNodes.
215+    // Thus the implicit locking order requirement is: (1) cs_main, (2) cs_vNodes.
216     if (node.connman) {
217         node.connman->StopThreads();
218-        LOCK2(::cs_main, ::g_cs_orphans);
219+        LOCK(::cs_main);

jnewbery commented at 9:53 am on March 25, 2021:

I don’t think you need this lock. It was previously here to enforce that cs_main was taken before g_cs_orphans.

It can also be removed from the process_message and process_messages fuzz tests.

ajtowns commented at 11:48 am on March 25, 2021:

I updated the comments which should still indicate why cs_main is needed there – it needs to be locked before cs_vNodes, but cs_vNodes is locked first if you call StopThreads directly?

jnewbery commented at 12:00 pm on March 25, 2021:

Ah, very good point. And we can’t take cs_main inside StopNodes because net isn’t aware of cs_main.

It’s outside the scope of this PR, but it’d be good to untangle this lock ordering dependency by eg making sure that cs_vNodes is not held when calling any PeerManager functions.

jnewbery commented at 6:52 am on April 20, 2021:

See #21563

jnewbery commented at 3:02 pm on April 25, 2021:

21563 is merged

in src/txorphanage.h:92 in 59cfe0c74b outdated

 96 
 97+    /** Map from txid to orphan transaction record. Limited by
 98+     *  -maxorphantx/DEFAULT_MAX_ORPHAN_TRANSACTIONS */
 99+    OrphanMap m_orphans GUARDED_BY(m_mutex);
100+
101+    /** Which peer provided a parent tx of orphans that need to be reconsidered */

jnewbery commented at 10:05 am on March 25, 2021:

This comment is now incorrect. The multimap stores the node that provided the orphan tx.

Also, this data is redundant data with the fromPeer in the OrphanTx struct (this could easily be a set of txids to reconsider, and the originating peer is then looked up in the OrphanTx struct). We store it in the multimap for efficiency - the comment could be updated to indicate that.

jnewbery commented at 10:13 am on March 25, 2021: member

Concept ACK. Lots of minor review comments inline, but a bit of high-level design feedback:

I’m not at all convinced by the g_mutex_message_handler. It seems like you’re trying to reinvent thread_local storage. That makes potential future changes difficult. It also adds unnecessary coupling between net and net_processing.
This PR changes orphan processing from being done for the peer that provided the parent to the peer that provided the orphan. I think that’s fine, but perhaps we should split the PR into one that does refactoring only and one that has behaviour changes.

ajtowns force-pushed on Apr 20, 2021

ajtowns commented at 6:53 am on April 20, 2021: contributor

Rebased, some of the suggestions picked up, reworked the “message handler” mutex.

in src/net_processing.cpp:330 in 9cc2da27d3 outdated

258+     * Anything only accessed by message parsing can be guarded by
259+     * this mutex, which is acquired by ProcessMessages and SendMessages
260+     */
261+    Mutex m_mutex_messages;
262+
263+    void _ProcessMessage(CNode& pfrom, const std::string& msg_type, CDataStream& vRecv,

jnewbery commented at 10:14 am on April 20, 2021:

I think it’d be cleaner for the m_mutex_messages lock to be taken by ProcessMessages, rather than ProcessMessage. That would remove the need to define an inner _ProcessMessage to be called when the mutex is already locked. It’s also consistent with the comment on the m_mutex_messages member, which says “is acquired by ProcessMessages and SendMessages”

ajtowns commented at 10:40 am on April 20, 2021:

It is taken by ProcessMessages() (and also by SendMessages()), however ProcessMessage() is also called from the test suite, so cannot assume the lock has been taken. That’s why the public interface, ProcessMessage(), is implemented as a wrapper that takes the lock, then calls the internal function.

jnewbery commented at 1:02 pm on April 20, 2021:

Ah, sorry - misread.

jnewbery commented at 10:15 am on April 20, 2021: member

I don’t understand the second commit log: SendMessages() is now protected internally by m_mutex_messages; so this additional lock is not useful. SendMessages() is not protected by the new mutex as far as I can see.

I prefer the approach in #21563, which locks the mutex whenever any NetEvents method is called.

jnewbery commented at 1:08 pm on April 20, 2021: member

I’ve re-reviewed the locking changes and they look reasonable. There are still a few review comments outstanding. I’m happy to review this PR again once those have been addressed.

ajtowns force-pushed on Apr 22, 2021

ajtowns renamed this:
~~NOMERGE: net_processing: orphan handling changes~~
net_processing: lock clean up
on Apr 22, 2021

ajtowns commented at 10:43 pm on April 22, 2021: contributor

I think this is now cleaned up, and ready for review. The behaviour changes that were in the last commit are now deferred to https://github.com/ajtowns/bitcoin/commits/202104-whohandlesorphans .

ajtowns added the label Refactoring on Apr 22, 2021

ajtowns marked this as ready for review on Apr 22, 2021

DrahtBot added the label Needs rebase on Apr 25, 2021

jnewbery commented at 3:38 pm on April 25, 2021: member

When you rebase this, there are a few files that no longer need to include txorphanage.h:

 0diff --git a/src/init.cpp b/src/init.cpp
 1index bb5b144802..e7b5ed60e3 100644
 2--- a/src/init.cpp
 3+++ b/src/init.cpp
 4@@ -52,7 +52,6 @@
 5 #include <torcontrol.h>
 6 #include <txdb.h>
 7 #include <txmempool.h>
 8-#include <txorphanage.h>
 9 #include <util/asmap.h>
10 #include <util/check.h>
11 #include <util/moneystr.h>
12diff --git a/src/test/fuzz/process_message.cpp b/src/test/fuzz/process_message.cpp
13index 7b99193ad0..5a71b25768 100644
14--- a/src/test/fuzz/process_message.cpp
15+++ b/src/test/fuzz/process_message.cpp
16@@ -18,7 +18,6 @@
17 #include <test/util/net.h>
18 #include <test/util/setup_common.h>
19 #include <test/util/validation.h>
20-#include <txorphanage.h>
21 #include <validationinterface.h>
22 #include <version.h>
23 
24diff --git a/src/test/fuzz/process_messages.cpp b/src/test/fuzz/process_messages.cpp
25index 11b236c9bd..f8b1c8fc90 100644
26--- a/src/test/fuzz/process_messages.cpp
27+++ b/src/test/fuzz/process_messages.cpp
28@@ -13,7 +13,6 @@
29 #include <test/util/net.h>
30 #include <test/util/setup_common.h>
31 #include <test/util/validation.h>
32-#include <txorphanage.h>
33 #include <validation.h>
34 #include <validationinterface.h>

Those files were only including txorphange.h to use the g_cs_orphans mutex.

ajtowns force-pushed on Apr 29, 2021

DrahtBot removed the label Needs rebase on Apr 29, 2021

ajtowns force-pushed on Apr 29, 2021

ajtowns commented at 9:21 am on April 29, 2021: contributor

Rebased (and likewise the whohandlesorphans followup)

in src/txorphanage.h:29 in 584043a020 outdated

28-    bool HaveTx(const GenTxid& gtxid) const LOCKS_EXCLUDED(::g_cs_orphans);
29+    bool HaveTx(const GenTxid& gtxid) const EXCLUSIVE_LOCKS_REQUIRED(!m_mutex);
30 
31-    /** Get an orphan transaction and its originating peer
32-     * (Transaction ref will be nullptr if not found)
33+    /** Get an orphan transaction for a peer to work on

jnewbery commented at 12:20 pm on April 29, 2021:

0    /** Get an orphan transaction for a peer to work on and erase it from the peer's workset.

ajtowns commented at 2:04 pm on May 7, 2021:

Done-ish

jnewbery commented at 2:39 pm on May 7, 2021:

thank you!

jnewbery commented at 12:40 pm on April 29, 2021: member

utACK 584043a0203332fe645529b1c27c086e4f14a61c

Very nice cleanup.

DrahtBot commented at 9:32 am on May 3, 2021: contributor

🕵️ @sipa has been requested to review this pull request as specified in the REVIEWERS file.

DrahtBot added the label Needs rebase on May 7, 2021

ajtowns force-pushed on May 7, 2021

ajtowns commented at 12:27 pm on May 7, 2021: contributor

Rebased past #21845. cc @MarcoFalke @promag for reviews :)

jnewbery commented at 1:00 pm on May 7, 2021: member

utACK 8b4f685ebef5eb14f049d04e2c4f4ce5b44878de

Do you mind addressing #21527 (review) while there are no other ACKs on the branch?

DrahtBot removed the label Needs rebase on May 7, 2021

ajtowns force-pushed on May 7, 2021

jnewbery commented at 2:39 pm on May 7, 2021: member

ACK dd754fd73ab0fcbe9461a6a21b5fc2bc22faf968

in src/net_processing.cpp:2284 in fec0ff1662 outdated

2094@@ -2093,6 +2095,7 @@ void PeerManagerImpl::ProcessHeadersMessage(CNode& pfrom, const Peer& peer,
2095 void PeerManagerImpl::ProcessOrphanTx(std::set<uint256>& orphan_work_set)
2096 {
2097     AssertLockHeld(cs_main);
2098+    AssertLockHeld(m_mutex_message_handling);

MarcoFalke commented at 5:21 am on May 9, 2021:

nit fec0ff16627c1cbf5545c9de8db19b749c90beee:

The lock order is m_mutex_message_handling -> cs_main -> g_cs_orphans, so it would be nice to also order the Asserts here in the same way.

in src/net_processing.cpp:3959 in 57900348db outdated

3901@@ -3895,9 +3902,7 @@ bool PeerManagerImpl::ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt
3902 
3903     {
3904         LOCK2(cs_main, g_cs_orphans);
3905-        if (!peer->m_orphan_work_set.empty()) {
3906-            ProcessOrphanTx(*peer);
3907-        }
3908+        if (ProcessOrphanTx(*peer)) return true;

MarcoFalke commented at 5:30 am on May 9, 2021:

57900348db566105351b525ae18cc2830e9665b5:

This undocumented one-line-if might be a bit too minimal. Previously there was a nice comment as to why return early is needed here. Now the comment is gone, or at least can’t be trivially attached to the return here.

ajtowns commented at 3:45 pm on May 9, 2021:

Not sure what you mean by “the comment is gone” ? The “maintains the order of responses and prevents ~~vRecvGetData~~ m_getdata_requests to grow unbounded” is still there, but at least I thought that sensibly described the ~~!pfrom->vRecvGetData.empty()~~ !peer->m_getdata_requests.empty() check it remains attached to, and didn’t really fit the “more orphans to work on” check. (EDIT: oops, was looking at 0.19 to see if there was any other comment)

ajtowns commented at 11:21 pm on September 2, 2021:

Added a comment.

in src/txorphanage.h:47 in 67618f0246 outdated

42@@ -43,9 +43,11 @@ class TxOrphanage {
43     /** Limit the orphanage to the given maximum */
44     unsigned int LimitOrphans(unsigned int max_orphans) EXCLUSIVE_LOCKS_REQUIRED(g_cs_orphans);
45 
46-    /** Add any orphans that list a particular tx as a parent into a peer's work set
47-     * (ie orphans that may have found their final missing parent, and so should be reconsidered for the mempool) */
48-    void AddChildrenToWorkSet(const CTransaction& tx, std::set<uint256>& orphan_work_set) const EXCLUSIVE_LOCKS_REQUIRED(g_cs_orphans);
49+    /** Which peer provided a parent tx of orphans that need to be reconsidered */
50+    std::multimap<NodeId, uint256> m_peer_work_set GUARDED_BY(g_cs_orphans);

MarcoFalke commented at 6:00 am on May 9, 2021:

67618f024690bc9b926aa48ead23a0f687f03fe4: I know there is a check to avoid duplicate insertion of the same txid, but is there a reason to pick this datastructure, which doesn’t disallow duplicate entries like set? Also insertion is trivially more expensive (log(all peers’ work sets) vs log(this peers’ work set)). Finally handling is done in insertion order, not in sorted order.

vasild commented at 1:53 pm on May 13, 2021:

… is there a reason to pick this datastructure, which doesn’t disallow duplicate entries …

I think the reason could be to allow (nodeX, tx1), (nodeX, tx2)?

ajtowns commented at 3:49 am on May 17, 2021:

Note that if you’ve got N peers each with K items in their work set, then a log search over everything is log(N*K), and a log search for the peer followed by a log search or the work item is log(N) + log(K) which is the same. If there’s one peer with K items and every other peer has 0, you instead get log(K) vs log(N) + 1/N log(K); so a map of sets would be better than a set of pairs provided N < K, but probably isn’t very interesting.

I picked the data structure because I thought it was simpler than a set of pairs, but I don’t think there’s an adequate check to avoid duplicate insertion of the same txid (and to do that, you’d have to search through all the entries for a peer, which would be annoying).

ajtowns commented at 3:52 am on May 17, 2021:

Changed to map<NodeId,set<uint256>>

MarcoFalke commented at 6:05 am on May 9, 2021: member

dd754fd73ab0fcbe9461a6a21b5fc2bc22faf968 🕵

Signature:

 0-----BEGIN PGP SIGNED MESSAGE-----
 1Hash: SHA512
 2
 3dd754fd73ab0fcbe9461a6a21b5fc2bc22faf968 🕵
 4-----BEGIN PGP SIGNATURE-----
 5
 6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
 7pUgcDAwAqL240wertSr+Mw2Ei6s6SQizRd2AzRniN6bec0MgUdiNK8aOz7CBTuf5
 83g5Jt5vqORpMkWidS6+JZFMDJQzOzcZXet+eGP6OpuUEFjUKQ2DXIPvRUvtZZU0s
 9d0Ad75v5Ov/jw5dU/+Mh1o/eaNJvtvE0bkHJMZrounpUDIH+0M0XKuzCVDPe1NeN
10PomxQwCDlCBbpYJ02ca6oOo9mSzVi0IaMToCdb+S0nI3/1fZwJ2saN16w1ul9Xg+
118w8Z6j9sxvV9DTTkQTVqv7ckyaXO7V/pZVGzW2FI2hqHYSf8LLRae07+Muqxlec7
126ldtgIZE0v262nVbcWJFITGr6hrZPr/Oyo6FBfAZnNjZmr/7Z60hY2n2FSNu8GUG
13B84+424mEkH2oOcevu6GeP/GKoL/JXu1gLNxKKPUoOohBfTMHM1OfAeEVm/o/DnI
14xIhl10GoHg3jo127zutHb1TCvknUKCzJzPeHbVvUSJB9g5DIbKqPaABFn7k+cCUy
15zJ3COLUx
16=f8yk
17-----END PGP SIGNATURE-----

Timestamp of file with hash 04c8a7094b10796be32033da7a78c70f9926aef2456fc44aaece53aa2519e97d -

in src/net_processing.cpp:328 in dd754fd73a outdated

258+    /** Message handling mutex.
259+     *  Message processing is single-threaded, so anything only accessed
260+     *  by ProcessMessage(s) or SendMessages can be guarded by this mutex,
261+     *  which guarantees it's only accessed by a single thread.
262+     */
263+    Mutex m_mutex_message_handling;

vasild commented at 3:25 pm on May 13, 2021:

If it is single-threaded, then we don’t need to guard anything with a mutex?

It looks to me that PeerManagerImpl::m_mutex_message_handling is useless because it is only being acquired from a single thread. No two threads are ever going to race to acquire it at the same time:

0ThreadMessageHandler() // there is just one thread executing this
1   ProcessMessages()
2       ...
3       lock m_mutex_message_handling
4       ...
5   ...
6   SendMessages()
7       ...
8       lock m_mutex_message_handling
9       ...

I may be missing something but to me it looks like m_mutex_message_handling can be removed safely, which will reduce the complexity of the code.

ajtowns commented at 3:56 am on May 17, 2021:

You don’t need the mutex at runtime, but having the GUARDED_BY(m_mutex_message_handling) at compile time makes it easy to verify nobody’s accidentally accessing the object from some other thread (either currently, or as a result of some future change, eg adding an RPC call that wants to expose some of the information stored in those objects). The cost of also having the mutex at runtime is trivial since there’s never any contention on it.

vasild commented at 4:11 pm on May 17, 2021:

NACK the addition of m_mutex_message_handling.

Following the logic of

GUARDED_BY() makes it easy to verify nobody’s accidentally accessing…

should we add a dummy mutex for every variable?

IMO mutexes should be used only where concurrent access is possible.

jnewbery commented at 4:43 pm on May 17, 2021:

@vasild can you take a look at the commits that build on top of this branch in https://github.com/jnewbery/bitcoin/commits/2021-05-use-mutex-message-handling. Currently, a lot of the data in net_processing is guarded by cs_main, which is also used in validation. One way to loosen that coupling between validation and net_processing is to introduce an internal mutex inside net_processing (m_mutex_message_handling) that guards its own members.

vasild commented at 1:26 pm on May 18, 2021:

I think that https://github.com/jnewbery/bitcoin/commits/2021-05-use-mutex-message-handling can be simplified and the common m_mutex_message_handling avoided by changing some of the variables to atomics and using a dedicated mutex for e.g. m_txrequest.

Btw, it does not compile.

 0diff --git i/src/net_processing.cpp w/src/net_processing.cpp
 1index c56863e5ef..cc04679683 100644
 2--- i/src/net_processing.cpp
 3+++ w/src/net_processing.cpp
 4@@ -401,13 +401,13 @@ private:
 5      * Set mapBlockSource[hash].second to false if the node should not be
 6      * punished if the block is invalid.
 7      */
 8     std::map<uint256, std::pair<NodeId, bool>> mapBlockSource GUARDED_BY(cs_main);
 9 
10     /** Number of peers with wtxid relay. */
11-    int m_wtxid_relay_peers GUARDED_BY(m_mutex_message_handing) {0};
12+    int m_wtxid_relay_peers GUARDED_BY(m_mutex_message_handling) {0};
13 
14     /** Number of outbound peers with m_chain_sync.m_protect. */
15     int m_outbound_peers_with_protect_from_disconnect GUARDED_BY(cs_main) = 0;
16 
17     bool AlreadyHaveTx(const GenTxid& gtxid) EXCLUSIVE_LOCKS_REQUIRED(cs_main);
18 
19@@ -1000,20 +1000,15 @@ void UpdateLastBlockAnnounceTime(NodeId node, int64_t time_in_seconds)
20     CNodeState *state = State(node);
21     if (state) state->m_last_block_announcement = time_in_seconds;
22 }
23 
24 void PeerManagerImpl::InitializeNode(CNode *pnode)
25 {
26-    LOCK(m_mutex_message_handling);
27-
28     NodeId nodeid = pnode->GetId();
29-    {
30-        LOCK(cs_main);
31-        mapNodeState.emplace_hint(mapNodeState.end(), std::piecewise_construct, std::forward_as_tuple(nodeid), std::forward_as_tuple(pnode->IsInboundConn()));
32-        assert(m_txrequest.Count(nodeid) == 0);
33-    }
34+    WITH_LOCK(cs_main, mapNodeState.emplace_hint(mapNodeState.end(), std::piecewise_construct, std::forward_as_tuple(nodeid), std::forward_as_tuple(pnode->IsInboundConn())));
35+    WITH_LOCK(m_mutex_message_handling, assert(m_txrequest.Count(nodeid) == 0));
36     {
37         PeerRef peer = std::make_shared<Peer>(nodeid, m_mutex_message_handling);
38         LOCK(m_peer_mutex);
39         m_peer_map.emplace_hint(m_peer_map.end(), nodeid, std::move(peer));
40     }
41     if (!pnode->IsInboundConn()) {

 0(clang 13)
 1
 2net_processing.cpp:986:33: error: reading variable 'fPreferredDownload' requires holding mutex 'peer.m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
 3    const bool preferred = peer.fPreferredDownload;
 4                                ^
 5net_processing.cpp:1033:13: error: calling function '_RelayTransaction' requires holding mutex 'cs_main' exclusively [-Werror,-Wthread-safety-analysis]
 6            _RelayTransaction(txid, tx->GetWitnessHash());
 7            ^
 8net_processing.cpp:1062:41: error: reading variable 'fPreferredDownload' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
 9            nPreferredDownload -= peer->fPreferredDownload;
10                                        ^
11net_processing.cpp:1063:42: error: reading variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
12            m_wtxid_relay_peers -= peer->m_wtxid_relay;
13                                         ^
14net_processing.cpp:1537:41: error: calling function '_RelayTransaction' requires holding mutex 'cs_main' exclusively [-Werror,-Wthread-safety-analysis]
15    WITH_LOCK(m_mutex_message_handling, _RelayTransaction(txid, wtxid););
16                                        ^
17net_processing.cpp:1550:43: error: reading variable 'm_wtxid_relay' requires holding mutex 'it->second->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
18        node->PushTxInventory(it->second->m_wtxid_relay ? wtxid : txid);
19                                          ^
20net_processing.cpp:2485:15: error: writing variable 'fPreferredDownload' requires holding mutex 'peer->m_mutex_message_handling' exclusively [-Werror,-Wthread-safety-analysis]
21        peer->fPreferredDownload = (!pfrom.IsInboundConn() || pfrom.HasPermission(PF_NOBAN)) && !pfrom.IsAddrFetchConn() && !pfrom.fClient;
22              ^
23net_processing.cpp:2486:37: error: reading variable 'fPreferredDownload' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
24        nPreferredDownload -= peer->fPreferredDownload;
25                                    ^
26net_processing.cpp:2653:24: error: reading variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
27            if (!peer->m_wtxid_relay) {
28                       ^
29net_processing.cpp:2654:23: error: writing variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' exclusively [-Werror,-Wthread-safety-analysis]
30                peer->m_wtxid_relay = true;
31                      ^
32net_processing.cpp:2777:23: error: reading variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
33            if (peer->m_wtxid_relay) {
34                      ^
35net_processing.cpp:3049:37: error: reading variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
36        const uint256& hash = peer->m_wtxid_relay ? wtxid : txid;
37                                    ^
38net_processing.cpp:3051:19: error: reading variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
39        if (peer->m_wtxid_relay && txid != wtxid) {
40                  ^
41net_processing.cpp:3075:13: error: calling function 'AlreadyHaveTx' requires holding mutex 'cs_main' exclusively [-Werror,-Wthread-safety-analysis]
42        if (AlreadyHaveTx(GenTxid(/* is_wtxid=*/true, wtxid))) {
43            ^
44net_processing.cpp:3084:21: error: calling function '_RelayTransaction' requires holding mutex 'cs_main' exclusively [-Werror,-Wthread-safety-analysis]
45                    _RelayTransaction(tx.GetHash(), tx.GetWitnessHash());
46                    ^
47net_processing.cpp:3090:44: error: calling function 'AcceptToMemoryPool' requires holding mutex 'cs_main' exclusively [-Werror,-Wthread-safety-analysis]
48        const MempoolAcceptResult result = AcceptToMemoryPool(m_chainman.ActiveChainstate(), m_mempool, ptx, false /* bypass_limits */);
49                                           ^
50net_processing.cpp:3094:23: error: calling function 'check' requires holding mutex 'cs_main' exclusively [-Werror,-Wthread-safety-analysis]
51            m_mempool.check(m_chainman.ActiveChainstate());
52                      ^
53net_processing.cpp:3099:13: error: calling function '_RelayTransaction' requires holding mutex 'cs_main' exclusively [-Werror,-Wthread-safety-analysis]
54            _RelayTransaction(tx.GetHash(), tx.GetWitnessHash());
55            ^
56net_processing.cpp:3114:13: error: calling function 'ProcessOrphanTx' requires holding mutex 'cs_main' exclusively [-Werror,-Wthread-safety-analysis]
57            ProcessOrphanTx(*peer);
58            ^
59net_processing.cpp:3131:21: error: reading variable 'recentRejects' requires holding mutex 'cs_main' [-Werror,-Wthread-safety-analysis]
60                if (recentRejects->contains(parent_txid)) {
61                    ^
62net_processing.cpp:3147:26: error: calling function 'AlreadyHaveTx' requires holding mutex 'cs_main' exclusively [-Werror,-Wthread-safety-analysis]
63                    if (!AlreadyHaveTx(gtxid)) AddTxAnnouncement(pfrom, *peer, gtxid, current_time);
64                         ^
65net_processing.cpp:3172:17: error: reading variable 'recentRejects' requires holding mutex 'cs_main' [-Werror,-Wthread-safety-analysis]
66                recentRejects->insert(tx.GetHash());
67                ^
68net_processing.cpp:3173:17: error: reading variable 'recentRejects' requires holding mutex 'cs_main' [-Werror,-Wthread-safety-analysis]
69                recentRejects->insert(tx.GetWitnessHash());
70                ^
71net_processing.cpp:3192:24: error: reading variable 'recentRejects' requires holding mutex 'cs_main' [-Werror,-Wthread-safety-analysis]
72                assert(recentRejects);
73                       ^
74net_processing.cpp:3193:17: error: reading variable 'recentRejects' requires holding mutex 'cs_main' [-Werror,-Wthread-safety-analysis]
75                recentRejects->insert(tx.GetWitnessHash());
76                ^
77net_processing.cpp:3204:21: error: reading variable 'recentRejects' requires holding mutex 'cs_main' [-Werror,-Wthread-safety-analysis]
78                    recentRejects->insert(tx.GetHash());
79                    ^
80net_processing.cpp:4291:29: error: reading variable 'fPreferredDownload' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
81        bool fFetch = peer->fPreferredDownload || (nPreferredDownload == 0 && !pto->fClient && !pto->IsAddrFetchConn()); // Download if this is a nice peer, or we have no nice peers and this one might do.
82                            ^
83net_processing.cpp:4504:53: error: reading variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
84                        const uint256& hash = peer->m_wtxid_relay ? txinfo.tx->GetWitnessHash() : txinfo.tx->GetHash();
85                                                    ^
86net_processing.cpp:4505:40: error: reading variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
87                        CInv inv(peer->m_wtxid_relay ? MSG_WTX : MSG_TX, hash);
88                                       ^
89net_processing.cpp:4536:85: error: reading variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
90                    CompareInvMempoolOrder compareInvMempoolOrder(&m_mempool, peer->m_wtxid_relay);
91                                                                                    ^
92net_processing.cpp:4548:40: error: reading variable 'm_wtxid_relay' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
93                        CInv inv(peer->m_wtxid_relay ? MSG_WTX : MSG_TX, hash);
94                                       ^
95net_processing.cpp:4636:117: error: reading variable 'fPreferredDownload' requires holding mutex 'peer->m_mutex_message_handling' [-Werror,-Wthread-safety-analysis]
96                if (current_time > state.m_headers_sync_timeout && nSyncStarted == 1 && (nPreferredDownload - peer->fPreferredDownload >= 1)) {
97                                                                                                                    ^
9832 errors generated.

ajtowns commented at 4:10 pm on May 18, 2021:

“Should we add a dummy mutex for every variable” – we should have a GUARDED_BY for every object that’s reachable by multiple threads, yes. The mutex is not a dummy, it’s a real mutex.

For this specific case, here’s an example of a patch that adds an access from another thread: https://github.com/ajtowns/bitcoin/commits/202105-whyhaveaguard . With the GUARDED_BY(m_mutex_message_handling) this is caught at compile time; without the guard (removed in the second commit), it compiles fine but introduces a race condition.

vasild commented at 10:41 am on May 19, 2021:

we should have a GUARDED_BY for every object that’s reachable by multiple threads

I assume here you mean – even if the current code accesses it from a single thread.

If yes, then I think that is an overkill - it means having a mutex attached to every non-const global or class member variable, including private ones (coz in the future somebody may access it from another thread).

For this specific case, here’s an example…

Yeah, but one can do such an example for any other variable. IMO annotating everything with GUARDED_BY(), just in case, would have a net-adverse effect on the code base.

 0diff --git i/src/net.h w/src/net.h
 1index 6b9a7ef136..18d57a7d8e 100644
 2--- i/src/net.h
 3+++ w/src/net.h
 4@@ -1256,12 +1256,18 @@ private:
 5      * an address and port that are designated for incoming Tor connections.
 6      */
 7     std::vector<CService> m_onion_binds;
 8 
 9     friend struct CConnmanTest;
10     friend struct ConnmanTestMsg;
11+
12+public:
13+    size_t ResponseCachesSize() const
14+    {
15+        return m_addr_response_caches.size();
16+    }
17 };
18 
19 /** Return a timestamp in the future (in microseconds) for exponentially distributed events. */
20 std::chrono::microseconds PoissonNextSend(std::chrono::microseconds now, std::chrono::seconds average_interval);
21 
22 /** Dump binary message to file, with timestamp */
23diff --git i/src/rpc/net.cpp w/src/rpc/net.cpp
24index 1f6b6e8d7e..cfab1ebafc 100644
25--- i/src/rpc/net.cpp
26+++ w/src/rpc/net.cpp
27@@ -631,12 +631,14 @@ static RPCHelpMan getnetworkinfo()
28     NodeContext& node = EnsureAnyNodeContext(request.context);
29     if (node.connman) {
30         ServiceFlags services = node.connman->GetLocalServices();
31         obj.pushKV("localservices", strprintf("%016x", services));
32         obj.pushKV("localservicesnames", GetServicesNames(services));
33     }
34+    obj.pushKV("bug1", node.connman->ResponseCachesSize());
35+    obj.pushKV("bug2", node.addrman->m_asmap.size());
36     if (node.peerman) {
37         obj.pushKV("localrelay", !node.peerman->IgnoresIncomingTxs());
38     }
39     obj.pushKV("timeoffset",    GetTimeOffset());
40     if (node.connman) {
41         obj.pushKV("networkactive", node.connman->GetNetworkActive());

ajtowns commented at 2:37 pm on May 20, 2021:

diff --git i/src/net.h w/src/net.h

Yes, net.cpp/net.h is terrible for having globally accessible variables that aren’t annotated for thread safety. Being able to introduce races that the compiler doesn’t catch is a bad thing… txmempool’s a better example, though it’s also missing some guards (cf #22003).

IMO annotating everything with GUARDED_BY(), just in case, would have a net-adverse effect on the code base.

GUARDED_BY is always a “just in case” thing – any time the code compiles correctly with GUARDED_BY it will also compile correctly without it – the benefit is “just in case” someone forgets while modifying the code later. With a guard, the compiler will catch the mistake; without it, you have to hope someone reviewing the code remembers, or wait for some obscure bug to be noticed and reported.

Not consistently guarding variables has an easily observable adverse affect on the code bas as a whole: it makes it hard to change code, because you’re not sure when things might be safe to reuse in a different context. eg see #21061 (review)

it means having a mutex attached to every non-const global or class member variable, including private ones

Yes and no – no, in that most classes don’t need to manually manage locks, but should rather rely on whatever instantiates them to ensure the entire object is accessed safely; but for the ones that do, yes: every non-const, non-atomic member should be guarded (as should every class’s non-const, non-atomic static members, since they’re effectively globals themselves).

vasild commented at 1:44 pm on May 25, 2021:

@ajtowns, thanks for the elaboration.

I still think that mutexes should be introduced when needed, not beforehand (with a reason that in the future somebody may access a variable from a different thread).

Looks like neither one of us is convinced by the other’s arguments. It’s ok, it would have been too boring if everybody agreed on everything all the time.

MarcoFalke commented at 3:29 pm on January 24, 2022:

I think it is fine to not have thread safety stuff on things that aren’t meant to be called by more than one thread. For example, most code in init.cpp doesn’t need them because they are only called once per process. Though, other stuff should communicate to the developer whether it is thread safe. Either via code comments (https://github.com/bitcoin/bitcoin/blob/e3de7cb9039770e0fd5b8bb8a5cba35c87ae8f00/src/random.h#L67, https://github.com/bitcoin/bitcoin/blob/e3de7cb9039770e0fd5b8bb8a5cba35c87ae8f00/src/random.h#L129) or via annotations.

ajtowns force-pushed on May 17, 2021

in src/txorphanage.cpp:147 in d2f5cc9de1 outdated

145+void TxOrphanage::AddChildrenToWorkSet(const CTransaction& tx, NodeId peer)
146 {
147-    AssertLockHeld(g_cs_orphans);
148+    LOCK(m_mutex);
149+
150+    // lookup this peer's work set, ensuring it exists

jnewbery commented at 7:31 am on May 17, 2021:

0    // Look up this peer's work set, ensuring it exists.

in src/net_processing.cpp:2108 in bdd2274936 outdated

2107-        const uint256 orphanHash = *peer.m_orphan_work_set.begin();
2108-        peer.m_orphan_work_set.erase(peer.m_orphan_work_set.begin());
2109+    while (!orphan_work_set.empty()) {
2110+        auto it = orphan_work_set.begin();
2111+        const uint256 orphanHash = *it;
2112+        it = orphan_work_set.erase(it);

jnewbery commented at 7:42 am on May 17, 2021:

In commit bdd227493602437d4b6ee2d366ca0a83189f071c:

The new value for it is unused. No need to set it here:

0        orphan_work_set.erase(it);

in src/txorphanage.h:40 in d2f5cc9de1 outdated

37+     *  the work set, and populates its arguments with tx, the originating
38+     *  peer, and whether there are more orphans for this peer to work on
39+     *  after this tx.
40      */
41-    std::pair<CTransactionRef, NodeId> GetTx(const uint256& txid) const EXCLUSIVE_LOCKS_REQUIRED(g_cs_orphans);
42+    bool GetTxToReconsider(NodeId peer, CTransactionRef& ref, NodeId& originator, bool& more) EXCLUSIVE_LOCKS_REQUIRED(!m_mutex);

jnewbery commented at 11:22 am on May 17, 2021:

I left a comment about this interface further up that seems to have been lost without being resolved. I think the interface is needlessly complex and doing too much. From my comment above:

How about returning a std::optional<std::pair<CTransactionRef, NodeId>>, and then adding another public method HasTxToReconsider(). It’s a bit less efficient, but we’ll only ever call it after processing a transaction, so it seems negligible.

ajtowns commented at 11:40 am on May 17, 2021:

See #21527 (review)

in src/txorphanage.cpp:185 in d2f5cc9de1 outdated

187+            uint256 txid = *it;
188+            it = work_set.erase(it);
189+
190+            const auto orphan_it = m_orphans.find(txid);
191+            if (orphan_it != m_orphans.end()) {
192+                more = (it != work_set.end());

jnewbery commented at 11:23 am on May 17, 2021:

(if you’re keeping more), the following is equivalent, and maybe more direct/clear:

0                more = !(work_set.empty());

jnewbery commented at 11:26 am on May 17, 2021: member

utACK d2f5cc9de12196c870b2079bffc38e90076074d4.

Verified range-diff. Only significant difference was replacing std::multimap<NodeId, uint256> m_peer_work_set with std::map<NodeId, std::set<uint256>> m_peer_work_set.

DrahtBot added the label Needs rebase on May 24, 2021

ajtowns force-pushed on May 25, 2021

DrahtBot removed the label Needs rebase on May 25, 2021

ajtowns force-pushed on May 25, 2021

ajtowns commented at 4:28 am on May 25, 2021: contributor

Rebased past #21186, addressed @jnewbery’s nits.

jnewbery commented at 10:55 am on May 25, 2021: member

Code review ACK d6dfa5977a7c03d9a81727e1692edb58bfeab62c

in src/net.cpp:2257 in d6dfa5977a outdated

2196@@ -2197,10 +2197,7 @@ void CConnman::ThreadMessageHandler()
2197             if (flagInterruptMsgProc)
2198                 return;
2199             // Send messages
2200-            {
2201-                LOCK(pnode->cs_sendProcessing);
2202-                m_msgproc->SendMessages(pnode);
2203-            }
2204+            m_msgproc->SendMessages(pnode);

vasild commented at 2:18 pm on May 25, 2021:

Previously this would have allowed concurrent executions of SendMessages() for different peers, whereas now the concurrency is reduced by serializing all SendMessages() under the newly introduced single mutex m_mutex_message_handling.

Currently this code is executed by a single thread, so that is irrelevant, but in a possible future where we might want to do concurrent SendMessages() for different peers, the deleted cs_sendProcessing will have to be re-introduced.

ariard commented at 6:28 pm on June 16, 2021:

Actually, sharing block headers/compact blocks messages in parallel has been discussed a while back (iirc https://bitcoincore.org/en/meetings/2018/05/03/, see “Call ProcessNewBlock asynchronously”) though we could insert some queued interface between a net_processing thread and multiple net threads ?

That said, if it’s direction we agree on, I think the discussion is worthy to have.

in src/net_processing.cpp:4244 in d6dfa5977a outdated

3961-        }
3962+        LOCK(cs_main);
3963+        if (ProcessOrphanTx(*peer)) return true;
3964     }
3965 
3966     if (pfrom->fDisconnect)

vasild commented at 4:27 pm on May 25, 2021:

Before this PR we would have returned false (no more work) if fDisconnect was set regardless of the outcome of ProcessOrphanTx() and regardless of whether peer->m_orphan_work_set was empty.

Now fDisconnect may be set, but we may return true if ProcessOrphanTx() returns true.

If this change is not intentional, then maybe swap the order:

0    if (pfrom->fDisconnect) {
1        return false;
2    }
3
4    {
5        LOCK(cs_main);
6        if (ProcessOrphanTx(*peer)) return true;
7    }

ajtowns commented at 11:23 pm on September 2, 2021:

Changed to return !pfrom->fDisconnect to preserve the same behaviour.

in src/txorphanage.cpp:148 in d6dfa5977a outdated

146 {
147-    AssertLockHeld(g_cs_orphans);
148+    LOCK(m_mutex);
149+
150+    // Get this peer's work set, ensuring it exists
151+    std::set<uint256>& orphan_work_set = m_peer_work_set.emplace(peer, std::set<uint256>{}).first->second;

vasild commented at 4:35 pm on May 25, 2021:

Can be simplified?

0    std::set<uint256>& orphan_work_set = m_peer_work_set[peer];

ajtowns commented at 11:24 pm on September 2, 2021:

Changed to C++17’s try_emplace which simplifies it slightly. Not really a fan of map::operator[]

in src/txorphanage.h:93 in d6dfa5977a outdated

 97+    /** Map from txid to orphan transaction record. Limited by
 98+     *  -maxorphantx/DEFAULT_MAX_ORPHAN_TRANSACTIONS */
 99+    OrphanMap m_orphans GUARDED_BY(m_mutex);
100+
101+    /** Which peer provided a parent tx of orphans that need to be reconsidered */
102+    std::map<NodeId, std::set<uint256>> m_peer_work_set GUARDED_BY(m_mutex);

vasild commented at 4:39 pm on May 25, 2021:

s/map/unordered_map/ to make lookup O(1) (we don’t rely on this being ordered by NodeId). Same for the set of transaction ids.

0    std::unordered_map<NodeId, std::unordered_set<uint256>> m_peer_work_set GUARDED_BY(m_mutex);

ajtowns commented at 11:25 pm on September 2, 2021:

Left as map; I don’t think performance matters enough to justify using extra space for the hashmap; and longer term, using a multiindex rather than a bunch of different containers referencing each other is probably better.

in src/net_processing.cpp:2399 in d6dfa5977a outdated

2175+    bool more = false;
2176 
2177-        const auto [porphanTx, from_peer] = m_orphanage.GetTx(orphanHash);
2178-        if (porphanTx == nullptr) continue;
2179+    while (m_orphanage.GetTxToReconsider(peer.m_id, porphanTx, from_peer, more)) {
2180+        if (!Assume(porphanTx)) break;

vasild commented at 4:53 pm on May 26, 2021:

This Assume() seems unnecessary because GetTxToReconsider() is documented to populate porphanTx if it returns true.

ajtowns commented at 2:53 am on May 28, 2021:

All Assume’s should be unnecessary by definition? See #21527 (review) for prior discussion.

vasild commented at 8:48 am on May 28, 2021:

Assume() makes sense if the value is derived in an obscure way. But not for checking whether a function has set its “out” parameters as documented.

For example, when calling CSHA256::Finalize(result) we don’t set result to a dummy value before the call and check that it is not that dummy value after the call with Assume(). Same for any other function that has “out” parameters.

If you insist to check that porphanTx was set by the function, then I guess you should do the same with from_peer: Assume(from_peer != -1).

in src/txorphanage.h:90 in d6dfa5977a outdated

94         }
95     };
96 
97+    /** Map from txid to orphan transaction record. Limited by
98+     *  -maxorphantx/DEFAULT_MAX_ORPHAN_TRANSACTIONS */
99+    OrphanMap m_orphans GUARDED_BY(m_mutex);

vasild commented at 9:36 am on May 27, 2021:

Is it possible for TxOrphanage::m_orphans to contain keys (transaction ids) which are not in TxOrphanage::m_peer_work_set or the other way around?

ajtowns commented at 3:20 am on May 28, 2021:

m_orphans will normally contain txids that aren’t in m_peer_work_set – they only get added to the work set when a parent appears and are (hopefully) very quickly removed from the work set after retrying ATMP.

It’s also possible for something to expire from m_orphans and still be present in m_peer_work_set: if a tx that’s in the work set is removed via LimitOrphans or EraseForBlock you’ll get that case. It will eventually be removed from the work set when GetTxToReconsider is called.

michaelfolkson commented at 11:42 am on June 16, 2021: contributor

NACK the addition of m_mutex_message_handling.

Looks like neither one of us is convinced by the other’s arguments. It’s ok, it would have been too boring if everybody agreed on everything all the time. @vasild: Are you in a position where you can ACK this PR overall (with reservations) or would you prefer not to ACK it? Either is fine (obviously), I just suspect other reviewers may be waiting to see if your concerns are addressed or not before spending time reviewing this.

in src/net_processing.cpp:346 in d6dfa5977a outdated

342+     * Reconsider orphan transactions after a parent has been accepted to the mempool.
343+     *
344+     * @param[in,out]  peer             The peer whose orphan transactions we will reconsider. Generally only one
345+     *                                  orphan will be reconsidered on each call of this function. This peer's set
346+     *                                  may be added to if accepting an orphan causes its children to be
347+     *                                  reconsidered.

ariard commented at 5:30 pm on June 16, 2021:

I think second part of this comment “This peer’s set may be added…” is confusing and annotating the parameter as an output doesn’t make sense anymore.

After this PR, parameter is no more a std::map<NodeId, std::set<uint256>> and new one can be actually a const.

ajtowns commented at 11:28 pm on September 2, 2021:

Dropped the out which doesn’t make sense. Knowing that this may increase the size of the set, not simply reduce it is valuable I think, so have left that text alone.

in src/txorphanage.h:49 in d6dfa5977a outdated

49-    void EraseForPeer(NodeId peer) EXCLUSIVE_LOCKS_REQUIRED(g_cs_orphans);
50+    void EraseForPeer(NodeId peer) EXCLUSIVE_LOCKS_REQUIRED(!m_mutex);
51 
52     /** Erase all orphans included in or invalidated by a new block */
53-    void EraseForBlock(const CBlock& block) LOCKS_EXCLUDED(::g_cs_orphans);
54+    void EraseForBlock(const CBlock& block) EXCLUSIVE_LOCKS_REQUIRED(!m_mutex);

ariard commented at 5:36 pm on June 16, 2021:

What do you think about EXCLUSIVE_LOCKS_EXCLUDED(!(...)) ? The logical negation operator is easy to miss for a reviewer.

ajtowns commented at 11:29 pm on September 2, 2021:

Seems like something more for clang upstream if anything?

in src/net_processing.cpp:2284 in d6dfa5977a outdated

2161- *                                  reconsidered.
2162- */
2163-void PeerManagerImpl::ProcessOrphanTx(std::set<uint256>& orphan_work_set)
2164+bool PeerManagerImpl::ProcessOrphanTx(Peer& peer)
2165 {
2166+    AssertLockHeld(m_mutex_message_handling);

ariard commented at 6:00 pm on June 16, 2021:

At commit f86a525, I don’t think this lock is useful to protect vExtraTxnForCompact, there is already one in AddToCompactExtraTransactions ? Though it might be useful at branch tip ?

ajtowns commented at 6:49 am on September 2, 2021:

I was just adding AssertLockHeld to correspond with the EXCLUSIVE_LOCKS_REQUIRED annotation as a matter of course (see the recommendation in developer-notes.md).

in src/net_processing.cpp:4239 in d6dfa5977a outdated

3968@@ -3943,11 +3969,6 @@ bool PeerManagerImpl::ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt
3969         if (!peer->m_getdata_requests.empty()) return true;
3970     }
3971 
3972-    {
3973-        LOCK(g_cs_orphans);
3974-        if (!peer->m_orphan_work_set.empty()) return true;

ariard commented at 6:18 pm on June 16, 2021:

I think this introduce a lightweight behavior change.

Previously, if our peer has fDisconnect=true set up from a previous message processing and the orphan work set is still non-empty after the ProcessOrphanTx above, we would have return false.

From now on, if the orphan work set is still non-empty after ProcessOrphanTx, we’re going to return true.

Though i don’t think it has any impact on allowing our peer to abuse more our CPU time, when we return from ProcessMessages to ThreadMessageHandler, we go to SendMessages which calls MaybeDiscourageAndDisconnect as first thing.

(Actually why the check on fDisconnect isn’t first in ProcessMessages, should be before any kind of processing work attempted with ProcessGetData/ProcessOrphanTx ?)

ajtowns commented at 6:51 am on September 2, 2021:

fDisconnect is already checked in net.cpp prior to ProcessMessages being called, but there’s always a chance that there’s a race and fDisconnect is set while in the middle of ProcessMessages for the peer.

ariard commented at 6:19 pm on June 16, 2021: member

Approach ACK, reviewed up to 1e3f227 so far

in src/net_processing.cpp:4581 in 7961d94c8b outdated

4323@@ -4303,6 +4324,8 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
4324     if (!peer) return false;
4325     const Consensus::Params& consensusParams = m_chainparams.GetConsensus();
4326 
4327+    LOCK(m_mutex_message_handling);

narula commented at 10:55 pm on June 18, 2021:

/** Message handling mutex.

Message processing is single-threaded, so anything only accessed

by ProcessMessage(s) or SendMessages can be guarded by this mutex,

which guarantees it’s only accessed by a single thread.

I’m confused. This mutex ends up protecting everything accessed in those functions. For example, in SendMessages, it also protects MaybeDiscourageAndDisconnect that accesses things like peer.m_misbehavior_mutex, which can be accessed by the validation thread. So this mutex is guarding more things than what is only accessed by the ProcessMessage(s) and SendMessages thread.

I suppose it’s safe that this mutex also ends up guarding things that are accessed by multiple threads, but I find the comment confusing.

ajtowns commented at 10:26 pm on September 2, 2021:

If you have:

0Mutex mutex;
1int x GUARDED_BY(mutex);
2int y;
3static void A() { LOCK(mutex); ++x; ++y; }
4static void B() { LOCK(mutex); --x; }
5void ThreadA() { for (int i = 0; i < 100; ++i) { A(); } }
6void ThreadB() { for (int i = 0; i < 100; ++i) { B(); } }

then I’d say that only x is guarded by the mutex, even though y is only actually accessed while the mutex is held. If you later introduce

0static void C() { y *= 2; }

then that would still be safe/correct if ThreadA were to start calling C(), but y would no longer always be protected by the mutex; conversely if ThreadB were to start calling C(), then y would no longer be thread safe, but because it’s not guarded by anything, the compiler would not catch that bug.

I guess I’d look at it more as mutexes protect code from being run simultaneously; it’s the GUARDED_BY annotations that protect the variables. If the variables aren’t annotated, they don’t have any automatic protections, only manual protection by the programmer constantly being careful.

vasild commented at 8:06 am on September 3, 2021:

I guess I’d look at it more as mutexes protect code from being run simultaneously

IMO mutexes protect variables, not code.

From https://en.wikipedia.org/wiki/Lock_(computer_science)

… a mechanism that enforces limits on access to a resource … … each thread cooperates by acquiring the lock before accessing the corresponding data …

ajtowns commented at 2:55 am on September 7, 2021:

Read your second quote carefully – it’s not the mutex/lock that protects the data, it’s the threads that protect it by cooperatively acquiring the lock before accessing the data.

It’s 99% true that the point of using a mutex is to protect data, but it’s not the mutex itself that ensures the data is protected, it’s how you use the mutex. It’s like looking both ways before crossing the road – if you don’t do that, you’ll probably walk into traffic; but looking alone isn’t enough, you have to make sure you do it before crossing the road, and depending on what you see, delay crossing the road. The point of annotations here is to ensure at compile time that you’re not metaphorically stepping into the road before looking, or literally not accessing variables that are only meant to be accessed by a particular thread from another thread.

(I think I’ve repeated this point enough now; if there’s something not clear, feel free to ask, but I’m not seeing any point to going around in circles another time)

in src/net_processing.cpp:2643 in 7961d94c8b outdated

2406@@ -2396,6 +2407,14 @@ void PeerManagerImpl::ProcessBlock(CNode& pfrom, const std::shared_ptr<const CBl
2407 void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type, CDataStream& vRecv,

amitiuttarwar commented at 11:28 pm on June 18, 2021:

do I understand correctly that after these changes PeerManagerImpl::ProcessMessage is a test-only function? If so, I think the comment in net_processing.h should be updated. Or even better, the function renamed to something like ProcessMessageTest :)

ajtowns commented at 10:35 pm on September 2, 2021:

PeerManager::ProcessMessage has always only been exposed for tests, and is already documented that way in the header file:

0    /** Process a single message from a peer. Public for fuzz testing */
1    virtual void ProcessMessage(CNode& pfrom, const std::string& msg_type, CDataStream& vRecv,

PeerManagerImpl::ProcessMessage is thus needed to complete the PeerManager interface.

vasild commented at 2:02 pm on June 21, 2021: contributor

NACK the addition of m_mutex_message_handling.

Looks like neither one of us is convinced by the other’s arguments. It’s ok, it would have been too boring if everybody agreed on everything all the time.

@vasild: Are you in a position where you can ACK this PR overall (with reservations) or would you prefer not to ACK it? Either is fine (obviously), I just suspect other reviewers may be waiting to see if your concerns are addressed or not before spending time reviewing this.

The rest of this PR looks good overall (modulo some minor other comments I left). But the mutex m_mutex_message_handling looks really strange to me, with a confusing comment.

I would re-review carefully and probably ACK this PR if the mutex is removed. If needed later then it can be added in another PR.

jnewbery commented at 10:20 am on July 27, 2021: member

@ajtowns are you still working on this? I think this new net_processing lock is the best way we have to make progress with much of the net/net_processing/validation separation work.

jnewbery referenced this in commit 1b84b99caa on Jul 28, 2021

ajtowns commented at 11:31 pm on September 2, 2021: contributor

Rebased past #21562. Minor updates per reviews.

ajtowns force-pushed on Sep 2, 2021

glozow commented at 1:43 pm on September 24, 2021: member

Concept ACK, I like the idea of txorphanage being internally thread-safe

DrahtBot added the label Needs rebase on Sep 27, 2021

ajtowns force-pushed on Sep 29, 2021

ajtowns commented at 7:13 pm on September 29, 2021: contributor

Rebased past #22976

DrahtBot removed the label Needs rebase on Sep 29, 2021

DrahtBot added the label Needs rebase on Oct 25, 2021

ajtowns force-pushed on Oct 26, 2021

ajtowns commented at 12:42 pm on October 26, 2021: contributor

Rebased past #23157

DrahtBot removed the label Needs rebase on Oct 26, 2021

DrahtBot added the label Needs rebase on Nov 10, 2021

ajtowns force-pushed on Dec 9, 2021

ajtowns commented at 3:17 pm on December 9, 2021: contributor

Rebased past a few conflicts.

DrahtBot removed the label Needs rebase on Dec 9, 2021

DrahtBot added the label Needs rebase on Jan 24, 2022

ajtowns force-pushed on Mar 4, 2022

ajtowns commented at 7:42 pm on March 4, 2022: contributor

Rebased with the cs_sendProcessing parts split out into #24474; marking as draft while that one gets reviewed.

ajtowns marked this as a draft on Mar 4, 2022

DrahtBot removed the label Needs rebase on Mar 4, 2022

DrahtBot added the label Needs rebase on Mar 7, 2022

ajtowns force-pushed on Mar 7, 2022

DrahtBot removed the label Needs rebase on Mar 7, 2022

ajtowns force-pushed on Mar 25, 2022

DrahtBot added the label Needs rebase on Apr 26, 2022

ajtowns force-pushed on May 3, 2022

ajtowns commented at 5:52 pm on May 3, 2022: contributor

Rebased to keep up with #24474

DrahtBot removed the label Needs rebase on May 3, 2022

DrahtBot added the label Needs rebase on May 16, 2022

ajtowns force-pushed on May 16, 2022

DrahtBot removed the label Needs rebase on May 16, 2022

DrahtBot added the label Needs rebase on May 19, 2022

net/net_processing: add missing thread safety annotations 70a3324d84

net: mark TransportSerializer/m_serializer as const

The (V1)TransportSerializer instance CNode::m_serializer is used from
multiple threads via PushMessage without protection by a mutex. This
is only thread safe because the class does not have any mutable state,
so document that by marking the methods and the object as "const".

fca9fadd4b

net: mark CNode unique_ptr members as const

Dereferencing a unique_ptr is not necessarily thread safe. The reason
these are safe is because their values are set at construction and do
not change later; so mark them as const and set them via the initializer
list to guarantee that.

021a96d41b

net: note CNode members that are treated as const

m_permissionFlags and m_prefer_evict are treated as const -- they're
only set immediately after construction before any other thread has
access to the object, and not changed again afterwards. As such they
don't need to be marked atomic or guarded by a mutex; though it would
probably be better to actually mark them as const...

5df0bd1e8d

net: add NetEventsInterface::g_mutex_msgproc_thread mutex

There are many cases where we asssume message processing is
single-threaded in order for how we access node-related memory to be
safe. Add an explicit mutex that we can use to document this, which allows
the compiler to catch any cases where we try to access that memory from
other threads and break that assumption.

c8c4faac48

net: drop cs_sendProcessing

SendMessages() is now protected g_mutex_msgproc_thread; so this additional
per-node mutex is redundant.

b1af785346

[move-only] net: move NetEventsInterface before CNode

This allows CNode members to be marked as guarded by the
g_mutex_msgproc_thread mutex.

8e75e74e6a

net: add thread safety annotations for CNode/Peer members accessed only via the msgproc thread 49b72d235c

net_processing: add thread safety annotations for Peer members accessed only via the msgproc thread 3ca9617d45

net_processing: extra transactions data are accessed only via the msgproc thread

Previously vExtraTxnForCompact and vExtraTxnForCompactIt were protected
by g_cs_orphans; protect them by g_mutex_msgproc_thread instead, as they
are only used during message processing.

49c3b28ab4

Remove unnecessary includes of txorphanage.h 953e9bc0a8

net_processing: Pass a Peer to ProcessOrphanTx 0dd568f41e

net_processing: Localise orphan_work_set handling to ProcessOrphanTx 983a6c09ca

txorphange/net_processing: move orphan workset to txorphanage d30d1da235

txorphanage: make m_peer_work_set private 5b3f4d4101

Move all g_cs_orphans locking to txorphanage 04a8962ead

txorphanage: move g_cs_orphans to TxOrphanage::m_mutex 9037716fe8

txorphanage: fix missing includes 023909c638

ajtowns force-pushed on May 19, 2022

DrahtBot removed the label Needs rebase on May 19, 2022

DrahtBot commented at 1:41 pm on May 20, 2022: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a “draft”.

DrahtBot added the label Needs rebase on May 20, 2022

ajtowns commented at 6:53 am on August 29, 2022: contributor

Closing this. If you’d like it resurrected please help #25174 make progress.

ajtowns closed this on Aug 29, 2022

bitcoin locked this on Aug 29, 2023

net_processing: lock clean up #21527

Conflicts