net: replace manual reference counting of CNode with shared

vasild commented at 9:53 am on March 7, 2025: contributor

Before this change the code used to count references to CNode objects manually via CNode::nRefCount. Unneeded CNodes were scheduled for deletion by putting them in CConnman::m_nodes_disconnected and were deleted after their reference count reached zero. Deleting consists of calling PeerManager::FinalizeNode() and destroying the CNode object.

Replace this scheme with std::shared_ptr. This simplifies the code and removes: CNode::nRefCount CNode::GetRefCount() CNode::AddRef() CNode::Release() CConnman::m_nodes_disconnected CConnman::NodesSnapshot

Now creating a snapshot of CConnman::m_nodes is done by simply copying it (under the mutex).

Call PeerManager::FinalizeNode() from the destructor of CNode, which is called when the reference count reaches 0.

This removes brittle code and replaces it by a code from the standard library which is more robust. All the below are possible with the manual reference counting and not possible with shared_ptr:

Use an object without adding a reference to it
Add too many references to an object, mismatch the add/remove by having more “add”
Forget to reduce the reference count, mismatch the add/remove by having less “remove”
Go with negative reference count by mistakenly having too many “remove”
Destroy an object while there are references to it

There is zero learning curve for the average C++ programmer who knows how shared_ptr works. Not so much with the manual reference counting because one has to learn about AddRef(), Release(), check and maintain where are they called and their relationship with m_nodes_disconnected.

This has some history in #28222 and #10738. See below #32015 (comment) and #32015 (comment).

Possible followup: change ProcessMessages() and SendMessages() to take CNode&: #32015 (review)

DrahtBot commented at 9:53 am on March 7, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32015.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	pinheadmz
Concept ACK	hodlinator, theuni

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32983 (rpc: refactor: use string_view in Arg/MaybeArg by stickies-v)
#32822 (fuzz: Make process_message(s) more deterministic by maflcko)
#32747 (Introduce SockMan (“lite”): low-level socket handling for HTTP by pinheadmz)
#32394 (net: make m_nodes_mutex non-recursive by vasild)
#32326 (net: improve the interface around FindNode() and avoid a recursive mutex lock by vasild)
#32278 (doc: better document NetEventsInterface and the deletion of “CNode"s by vasild)
#32065 (i2p: make a time gap between creating transient sessions and using them by vasild)
#30988 (Split CConnman by vasild)
#29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
#28584 (Fuzz: extend CConnman tests by vasild)
#28463 (p2p: Increase inbound capacity for block-relay only connections by mzumsande)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label P2P on Mar 7, 2025

vasild commented at 9:53 am on March 7, 2025: contributor

cc @willcl-ark, @dergoegge, @theuni

vasild commented at 9:55 am on March 7, 2025: contributor

cc @hodlinator, you might be interested in this as well

fanquake commented at 10:13 am on March 7, 2025: member

https://github.com/bitcoin/bitcoin/actions/runs/13718139296/job/38367326902?pr=32015#step:7:1603:

0Run p2p_headers_presync with args ['/Users/runner/work/bitcoin/bitcoin/ci/scratch/build-aarch64-apple-darwin23.6.0/src/test/fuzz/fuzz', PosixPath('/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync')]net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
1Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"
2
3net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
4Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"

in src/net.cpp:1938 in 2747f135be outdated

1941+
1942+                // Keep a reference to this CNode object to delay its destruction until
1943+                // after m_nodes_mutex has been released. Destructing a node involves
1944+                // calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order
1945+                // should be cs_main, m_nodes_mutex.
1946+                disconnected_nodes.push_back(node);

dergoegge commented at 11:03 am on March 7, 2025:

Now that ~CNode is responsible for calling FinalizeNode, the guarantees around which thread does the actual cleanup are lost. It’s basically just whichever thread ends up holding the last instance of a CNode. This seems bug prone.

E.g. there are no guarantees (afaict) that disconnected_nodes will actually be the last owner.

vasild commented at 1:29 pm on March 7, 2025:

The idea is that it does not matter which thread will destroy the CNode objects.

there are no guarantees (afaict) that disconnected_nodes will actually be the last owner.

That is right, but there is no need for such a guarantee. disconnected_nodes is to guarantee that the objects will not be destroyed while holding m_nodes_mutex.

hodlinator commented at 11:06 pm on March 7, 2025:

In thread #32015 (review):

Maybe one could have an additional vector<shared_ptr<CNode>> member in CConnman beyond m_nodes that always keeps the shared_ptr::use_count() above 0. Then use removal from that vector (when use_count() == 1) for controlled deletion with enforced lock order, more like how the code was before?

vasild commented at 8:55 am on March 11, 2025:

I do not think that is necessary. I mean - if we would assume that the code is going to be modified in a careless way to break the logic, in other words to destroy CNode objects while holding m_nodes_mutex, then we might as well assume that the code could be modified in a way that breaks the manual reference counting. IMO the manual reference counting is more fragile and error prone.

Further, if the code is changed to destroy CNode objects while holding m_nodes_mutex, then 83 functional tests fail. I think it is safe to assume that such a thing will not go unnoticed.

hodlinator commented at 8:18 pm on March 11, 2025:

In thread #32015 (review):

Started proving out my suggestion and landed on something closer to master.

Resurrect m_nodes_disconnected (now holding shared_ptr) along with clearly defined lifetimes.
Resurrect DeleteNode(), but make it take an r-value shared_ptr and assert that use_count() == 1.
Avoid adding ~CNode() and CNode::m_destruct_cb along with sending it in everywhere in the constructor. This means we don’t need to touch 5 of the test files.

https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_suggestion

Passes unit tests and functional tests.

DrahtBot added the label CI failed on Mar 7, 2025

DrahtBot commented at 11:14 am on March 7, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38367328595

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

vasild force-pushed on Mar 7, 2025

vasild commented at 1:23 pm on March 7, 2025: contributor

2747f135be...e35a382388: fix the CI failure, thanks!

vasild force-pushed on Mar 7, 2025

DrahtBot removed the label CI failed on Mar 7, 2025

in src/net.cpp:1888 in 36d932cebb outdated

1883@@ -1884,8 +1884,15 @@ bool CConnman::AddConnection(const std::string& address, ConnectionType conn_typ
1884     } // no default case, so the compiler can warn about missing cases
1885 
1886     // Count existing connections
1887-    int existing_connections = WITH_LOCK(m_nodes_mutex,
1888-                                         return std::count_if(m_nodes.begin(), m_nodes.end(), [conn_type](CNode* node) { return node->m_conn_type == conn_type; }););
1889+    const int existing_connections = WITH_LOCK(m_nodes_mutex,
1890+        return std::count_if(

hodlinator commented at 10:24 pm on March 7, 2025:

nit: Might as well use std::ranges::count_if here and in GetFullOutboundConnCount.

vasild commented at 4:06 pm on March 10, 2025:

Indeed, done, thanks!

hodlinator commented at 11:13 pm on March 7, 2025: contributor

Concept ACK 36d932cebbb235a79e90575960646e86a23d39a8

vasild force-pushed on Mar 10, 2025

vasild commented at 4:07 pm on March 10, 2025: contributor

36d932cebb...af622d00ba: address #32015 (review)

purpleKarrot commented at 6:24 pm on March 10, 2025: contributor

Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics? Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time.

hodlinator commented at 7:51 pm on March 10, 2025: contributor

Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time.

I remembered that as “the seminal std::rotate()-talk”, but it also has “No Raw Pointers” from 48:45. Probably stoked my hatred of smart pointers which led to #31713. To be fair, in the talk he also says that we used to do intrusive ref-counting… so this is at least one step away from that.

Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics?

Guess one could have an InternalCNode which is the actual meat, and make the public CNode instances hold shared_ptr<InternalCNode>. Not sure how well that plays out. Maybe you @purpleKarrot could take a stab at something along the lines of what you pointed to? Keep in mind that we try to minimize the number of lines touched in a change unless necessary.

purpleKarrot commented at 9:08 pm on March 10, 2025: contributor

After reviewing the code line by line, I realized that the only places where AddRef was ever called, was directly after construction (plus one place in the fuzz tests, not sure why that is needed). It looks like the CNode instances are not actually shared (good news). Could it be passed around by std::unique_ptr instead? This would greatly simplify future refactoring.

in src/net.cpp:1717 in af622d00ba outdated

1713@@ -1716,7 +1714,7 @@ bool CConnman::AttemptToEvictConnection()
1714         return false;
1715     }
1716     LOCK(m_nodes_mutex);
1717-    for (CNode* pnode : m_nodes) {
1718+    for (const auto& pnode : m_nodes) {

hodlinator commented at 10:01 pm on March 10, 2025:

Not sure about adding non-transitive const here since we set pnode->fDisconnect inside the loop.

Similar case in DisconnectNodes()-loop.

vasild commented at 11:55 am on March 11, 2025:

Removed the const.

dergoegge commented at 10:15 pm on March 10, 2025: member

realized that the only places where AddRef was ever called, was directly after construction

It is also called in the RAII helper NodesSnapshot which is used in different threads (i.e. “net” and “msghand”), so unique_ptr won’t work unfortunately.

plus one place in the fuzz tests, not sure why that is needed

It’s probably just there to have it “covered”.

in src/test/util/net.h:57 in af622d00ba outdated

53@@ -54,20 +54,19 @@ struct ConnmanTestMsg : public CConnman {
54     void AddTestNode(CNode& node)
55     {
56         LOCK(m_nodes_mutex);
57-        m_nodes.push_back(&node);
58+        m_nodes.push_back(std::shared_ptr<CNode>(&node));

purpleKarrot commented at 10:41 pm on March 10, 2025:

This looks suspicious. No custom deleter is passed to the shared_ptr constructor, so it defaults to default_delete which will delete ptr; eventually. Is the argument CNode& node guaranteed to be heap allocated? Otherwise:

0m_nodes.push_back(std::shared_ptr<CNode>(&node, [](CNode*){}));

vasild commented at 10:25 am on March 11, 2025:

Yes, the argument CNode& node is always heap allocated. Not sure about “guaranteed”. If not heap allocated, then this will be a gross error, resulting in an immediate crash or an error report by Valgrind or a memory sanitizer.

Note that the semantic of this piece of code is not changed by this PR. It was the same before as well - it would eventually delete the object.

in src/net.cpp:438 in af622d00ba outdated

435@@ -438,8 +436,7 @@ CNode* CConnman::ConnectNode(CAddress addrConnect, const char *pszDest, bool fCo
436                 // It is possible that we already have a connection to the IP/port pszDest resolved to.
437                 // In that case, drop the connection that was just created.
438                 LOCK(m_nodes_mutex);

hodlinator commented at 10:52 pm on March 10, 2025:

Since we aren’t doing anything with pnode anymore, we should be able to remove this LOCK() too, avoiding re-entrant locking inside FindNode().

vasild commented at 12:00 pm on March 11, 2025:

Right! Removed.

in src/net.cpp:328 in af622d00ba outdated

331@@ -332,32 +332,32 @@ bool IsLocal(const CService& addr)
332     return mapLocalHost.count(addr) > 0;
333 }
334 
335-CNode* CConnman::FindNode(const CNetAddr& ip)

hodlinator commented at 10:55 pm on March 10, 2025:

It is notable how unsafe the design was for the FindNode()-methods before switching to shared_ptr. Returning a raw pointer as we let go of the mutex protecting it.

purpleKarrot commented at 11:15 pm on March 10, 2025:

This is not a raw pointer, but an observer (“Raw pointer” is coined by Sean Parent as “anything that implies ownership”, which explicitly includes shared_ptr. Stricly speaking, changing CNode* to std::shared_ptr<CNode> turns an observer into a raw pointer).

I actually think that returning an observer is fine here, even after the change.

hodlinator commented at 8:02 am on March 11, 2025:

If that is the case I think Parent goes one step too far in redefining terms.

Returning an raw C pointer is not an okay design, as another thread could potentially come along and delete whatever the returned pointer is referencing, resulting in use-after-free. This is mitigated by FindNode being private and node deletion being done in a controlled fashion on master. But it’s a loaded foot-gun IMO.

Adding EXCLUSIVE_LOCKS_REQUIRED() would have been a different fix, forcing the caller to protect the returned value with the mutex.

0CNode* FindNode(...) EXCLUSIVE_LOCKS_REQUIRED(m_nodes_mutex)

Introducing shared_ptr solves the use-after-free design issue. (Although I agree it would be great to get rid of shared_ptr).

vasild commented at 12:10 pm on March 11, 2025:

I agree with @hodlinator that this has lifetime issue in master and in this PR it does not. That issue was also discussed in another PR. Good to resolve it.

purpleKarrot commented at 11:18 pm on March 10, 2025: contributor

unique_ptr won’t work unfortunately.

Got it. m_nodes needs to be able to be snapshotted. But that seems to be the only place where copies are made. FindNode should return non-owning observers rather than extend the ownership.

vasild force-pushed on Mar 11, 2025

vasild commented at 11:55 am on March 11, 2025: contributor

af622d00ba...d233c7e999: address #32015 (review)

vasild force-pushed on Mar 11, 2025

vasild commented at 12:00 pm on March 11, 2025: contributor

d233c7e999...4492abbf0a: rebase and address #32015 (review)

in src/net.h:1361 in 4492abbf0a outdated

1354@@ -1363,7 +1355,7 @@ class CConnman
1355     bool AlreadyConnectedToAddress(const CAddress& addr);
1356 
1357     bool AttemptToEvictConnection();
1358-    CNode* ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
1359+    std::shared_ptr<CNode> ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
1360     void AddWhitelistPermissionFlags(NetPermissionFlags& flags, const CNetAddr &addr, const std::vector<NetWhitelistPermissions>& ranges) const;
1361 
1362     void DeleteNode(CNode* pnode);

hodlinator commented at 1:01 pm on March 11, 2025:

Function body was removed.

vasild commented at 5:07 am on March 12, 2025:

Removed the declaration as well, thanks!

in src/net.cpp:3474 in 4492abbf0a outdated

3463     WITH_LOCK(m_nodes_mutex, nodes.swap(m_nodes));
3464-    for (CNode* pnode : nodes) {
3465+    for (auto& pnode : nodes) {
3466         LogDebug(BCLog::NET, "Stopping node, %s", pnode->DisconnectMsg(fLogIPs));
3467         pnode->CloseSocketDisconnect();
3468-        DeleteNode(pnode);

hodlinator commented at 1:14 pm on March 11, 2025:

Seems brittle on master that we just delete the node here without checking the refcount. From what I see we only call StopNodes() from tests and CConnman::Stop(), and we only create NodesSnapshots inside threads we’ve already stopped earlier in Stop().

vasild force-pushed on Mar 12, 2025

vasild commented at 5:06 am on March 12, 2025: contributor

4492abbf0a...8c0ce1ca1c: address #32015 (review)

Sjors commented at 10:24 am on March 12, 2025: member

This seems like a nice simplification. It would be good to do before nobody remembers what the old code was trying to do (and prevent). Since @theUni tried something similar before in #10738 it would be great if he can look at this again.

In particular the 2017 PR claimed “special care must be taken” as to which thread destroys the object, but @vasild believes this doesn’t matter (here): #32015 (review)

The end result of that PR seems a lot more complicated than what this PR does, and perhaps that’s part of why it didn’t get a huge amount of review and was closed eventually? @willcl-ark then tried again in 2023 in #28222 (initially not knowing about the earlier attempt). Similar concerns about threading were raised: #28222#pullrequestreview-1565029167

It seems that was never really settled and the PR died.

It would be good to be address this in the PR description to prevent going in circles.

vasild commented at 1:18 pm on March 12, 2025: contributor

Some comments from the other PRs:

from #10738#issue-240283546:

Special care must be taken, though, to only delete CNodes from a single thread, and to control which thread that is.

In master deletion of CNode objects could happen in the net thread or in the main thread. In other words, some CNode object might be destroyed in the net thread and another CNode object might be destroyed in the main thread. I do not see any problem with that. Of course we do not destroy the same object from more than one thread, in master or in this PR. With this PR deletion might as well happen in the msghand thread (when destroying the temporary copy of m_nodes in ThreadMessageHandler()). That is fine as well, correct me if I am wrong.

from #28222#pullrequestreview-1565029167:

which thread actually deletes the CNode.

The reason why that is important is because we only want to call FinalizeNode once and we want there to be no other references to that CNode at that time.

This PR achieves that by calling FinalizeNode() from ~CNode(). Just to clarify the obvious: the destructor is called only once and at the time the destructor is called shared_ptr has ensured that there are no other references to the object. This is simpler and easier to reason about than the code in master which waits manually for the number of references to drop.

vasild commented at 3:48 pm on March 14, 2025: contributor

Further idea around the second sentence of this:

#10738#issue-240283546:

Special care must be taken, though, to only delete CNodes from a single thread, and to control which thread that is. Eventually, this should move to the message processing thread, so that it’s not possible to hit cs_main from the main network thread.

It is possible to decouple CNode deletion from the FinalizeNode() call and do the latter in the msghand thread regardless of when CNode is deleted. FinalizeNode() only needs 3 properties from CNode, so they can be extracted out from the CNode object and saved to a place which later is looked up by the msghand thread which calls FinalizeNode() on that data.

Here is a patch that does that:

  0diff --git i/src/net_processing.cpp w/src/net_processing.cpp
  1index 1da3ec9d21..1124036a2a 100644
  2--- i/src/net_processing.cpp
  3+++ w/src/net_processing.cpp
  4@@ -487,13 +487,13 @@ public:
  5         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex);
  6     void NewPoWValidBlock(const CBlockIndex *pindex, const std::shared_ptr<const CBlock>& pblock) override
  7         EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex);
  8 
  9     /** Implement NetEventsInterface */
 10     void InitializeNode(const CNode& node, ServiceFlags our_services) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_tx_download_mutex);
 11-    void FinalizeNode(const CNode& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
 12+    void FinalizeNode(const NodeFinalizationData& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
 13     bool HasAllDesirableServiceFlags(ServiceFlags services) const override;
 14     bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override
 15         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex, !m_tx_download_mutex);
 16     bool SendMessages(CNode* pto) override
 17         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, g_msgproc_mutex, !m_tx_download_mutex);
 18 
 19@@ -1560,80 +1560,78 @@ void PeerManagerImpl::ReattemptInitialBroadcast(CScheduler& scheduler)
 20     // Schedule next run for 10-15 minutes in the future.
 21     // We add randomness on every cycle to avoid the possibility of P2P fingerprinting.
 22     const auto delta = 10min + FastRandomContext().randrange<std::chrono::milliseconds>(5min);
 23     scheduler.scheduleFromNow([&] { ReattemptInitialBroadcast(scheduler); }, delta);
 24 }
 25 
 26-void PeerManagerImpl::FinalizeNode(const CNode& node)
 27+void PeerManagerImpl::FinalizeNode(const NodeFinalizationData& node)
 28 {
 29-    NodeId nodeid = node.GetId();
 30     {
 31     LOCK(cs_main);
 32     {
 33         // We remove the PeerRef from g_peer_map here, but we don't always
 34         // destruct the Peer. Sometimes another thread is still holding a
 35         // PeerRef, so the refcount is >= 1. Be careful not to do any
 36         // processing here that assumes Peer won't be changed before it's
 37         // destructed.
 38-        PeerRef peer = RemovePeer(nodeid);
 39+        PeerRef peer = RemovePeer(node.id);
 40         assert(peer != nullptr);
 41         m_wtxid_relay_peers -= peer->m_wtxid_relay;
 42         assert(m_wtxid_relay_peers >= 0);
 43     }
 44-    CNodeState *state = State(nodeid);
 45+    CNodeState *state = State(node.id);
 46     assert(state != nullptr);
 47 
 48     if (state->fSyncStarted)
 49         nSyncStarted--;
 50 
 51     for (const QueuedBlock& entry : state->vBlocksInFlight) {
 52         auto range = mapBlocksInFlight.equal_range(entry.pindex->GetBlockHash());
 53         while (range.first != range.second) {
 54             auto [node_id, list_it] = range.first->second;
 55-            if (node_id != nodeid) {
 56+            if (node_id != node.id) {
 57                 range.first++;
 58             } else {
 59                 range.first = mapBlocksInFlight.erase(range.first);
 60             }
 61         }
 62     }
 63     {
 64         LOCK(m_tx_download_mutex);
 65-        m_txdownloadman.DisconnectedPeer(nodeid);
 66+        m_txdownloadman.DisconnectedPeer(node.id);
 67     }
 68-    if (m_txreconciliation) m_txreconciliation->ForgetPeer(nodeid);
 69+    if (m_txreconciliation) m_txreconciliation->ForgetPeer(node.id);
 70     m_num_preferred_download_peers -= state->fPreferredDownload;
 71     m_peers_downloading_from -= (!state->vBlocksInFlight.empty());
 72     assert(m_peers_downloading_from >= 0);
 73     m_outbound_peers_with_protect_from_disconnect -= state->m_chain_sync.m_protect;
 74     assert(m_outbound_peers_with_protect_from_disconnect >= 0);
 75 
 76-    m_node_states.erase(nodeid);
 77+    m_node_states.erase(node.id);
 78 
 79     if (m_node_states.empty()) {
 80         // Do a consistency check after the last peer is removed.
 81         assert(mapBlocksInFlight.empty());
 82         assert(m_num_preferred_download_peers == 0);
 83         assert(m_peers_downloading_from == 0);
 84         assert(m_outbound_peers_with_protect_from_disconnect == 0);
 85         assert(m_wtxid_relay_peers == 0);
 86         WITH_LOCK(m_tx_download_mutex, m_txdownloadman.CheckIsEmpty());
 87     }
 88     } // cs_main
 89-    if (node.fSuccessfullyConnected &&
 90-        !node.IsBlockOnlyConn() && !node.IsInboundConn()) {
 91+    if (node.was_connected_full_outbound) {
 92         // Only change visible addrman state for full outbound peers.  We don't
 93         // call Connected() for feeler connections since they don't have
 94         // fSuccessfullyConnected set.
 95         m_addrman.Connected(node.addr);
 96     }
 97     {
 98         LOCK(m_headers_presync_mutex);
 99-        m_headers_presync_stats.erase(nodeid);
100+        m_headers_presync_stats.erase(node.id);
101     }
102-    LogDebug(BCLog::NET, "Cleared nodestate for peer=%d\n", nodeid);
103+    LogDebug(BCLog::NET, "Cleared nodestate for peer=%d\n", node.id);
104 }
105 
106 bool PeerManagerImpl::HasAllDesirableServiceFlags(ServiceFlags services) const
107 {
108     // Shortcut for (services & GetDesirableServiceFlags(services)) == GetDesirableServiceFlags(services)
109     return !(GetDesirableServiceFlags(services) & (~services));
110diff --git i/src/net.h w/src/net.h
111index ff83ecde4d..cbe044f27e 100644
112--- i/src/net.h
113+++ w/src/net.h
114@@ -1000,14 +1000,29 @@ public:
115     /** Mutex for anything that is only accessed via the msg processing thread */
116     static Mutex g_msgproc_mutex;
117 
118     /** Initialize a peer (setup state) */
119     virtual void InitializeNode(const CNode& node, ServiceFlags our_services) = 0;
120 
121+    /** Data needed for `FinalizeNode()`. */
122+    struct NodeFinalizationData {
123+        NodeFinalizationData(const CNode& node)
124+            : id{node.GetId()},
125+              addr{node.addr},
126+              was_connected_full_outbound{node.fSuccessfullyConnected &&
127+                                          !node.IsBlockOnlyConn() &&
128+                                          !node.IsInboundConn()}
129+        {}
130+
131+        const NodeId id;
132+        const CService addr;
133+        const bool was_connected_full_outbound;
134+    };
135+
136     /** Handle removal of a peer (clear state) */
137-    virtual void FinalizeNode(const CNode& node) = 0;
138+    virtual void FinalizeNode(const NodeFinalizationData& node_finalization_data) = 0;
139 
140     /**
141      * Callback to determine whether the given set of service flags are sufficient
142      * for a peer to be "relevant".
143      */
144     virtual bool HasAllDesirableServiceFlags(ServiceFlags services) const = 0;
145@@ -1112,15 +1127,17 @@ public:
146 
147     ~CConnman();
148 
149     bool Start(CScheduler& scheduler, const Options& options) EXCLUSIVE_LOCKS_REQUIRED(!m_total_bytes_sent_mutex, !m_added_nodes_mutex, !m_addr_fetches_mutex, !mutexMsgProc);
150 
151     void StopThreads();
152-    void StopNodes();
153-    void Stop()
154+    void StopNodes() EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex);
155+    void Stop() EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex)
156     {
157+        AssertLockNotHeld(m_nodes_for_finalization_mutex);
158+
159         StopThreads();
160         StopNodes();
161     };
162 
163     void Interrupt() EXCLUSIVE_LOCKS_REQUIRED(!mutexMsgProc);
164     bool GetNetworkActive() const { return fNetworkActive; };
165@@ -1261,12 +1278,15 @@ public:
166 
167     /** Return true if we should disconnect the peer for failing an inactivity check. */
168     bool ShouldRunInactivityChecks(const CNode& node, std::chrono::seconds now) const;
169 
170     bool MultipleManualOrFullOutboundConns(Network net) const EXCLUSIVE_LOCKS_REQUIRED(m_nodes_mutex);
171 
172+    /** Remember some of the node's properties for calling `m_msgproc->FinalizeNode()` later. */
173+    void RememberForFinalization(const CNode& node) EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex);
174+
175 private:
176     struct ListenSocket {
177     public:
178         std::shared_ptr<Sock> sock;
179         inline void AddSocketPermissionFlags(NetPermissionFlags& flags) const { NetPermissions::AddFlag(flags, m_permissions); }
180         ListenSocket(std::shared_ptr<Sock> sock_, NetPermissionFlags permissions_)
181@@ -1287,13 +1307,13 @@ private:
182     bool InitBinds(const Options& options);
183 
184     void ThreadOpenAddedConnections() EXCLUSIVE_LOCKS_REQUIRED(!m_added_nodes_mutex, !m_unused_i2p_sessions_mutex, !m_reconnections_mutex);
185     void AddAddrFetch(const std::string& strDest) EXCLUSIVE_LOCKS_REQUIRED(!m_addr_fetches_mutex);
186     void ProcessAddrFetch() EXCLUSIVE_LOCKS_REQUIRED(!m_addr_fetches_mutex, !m_unused_i2p_sessions_mutex);
187     void ThreadOpenConnections(std::vector<std::string> connect, Span<const std::string> seed_nodes) EXCLUSIVE_LOCKS_REQUIRED(!m_addr_fetches_mutex, !m_added_nodes_mutex, !m_nodes_mutex, !m_unused_i2p_sessions_mutex, !m_reconnections_mutex);
188-    void ThreadMessageHandler() EXCLUSIVE_LOCKS_REQUIRED(!mutexMsgProc);
189+    void ThreadMessageHandler() EXCLUSIVE_LOCKS_REQUIRED(!mutexMsgProc, !m_nodes_for_finalization_mutex);
190     void ThreadI2PAcceptIncoming();
191     void AcceptConnection(const ListenSocket& hListenSocket);
192 
193     /**
194      * Create a `CNode` object from a socket that has just been accepted and add the node to
195      * the `m_nodes` member.
196@@ -1305,12 +1325,16 @@ private:
197     void CreateNodeFromAcceptedSocket(std::unique_ptr<Sock>&& sock,
198                                       NetPermissionFlags permission_flags,
199                                       const CService& addr_bind,
200                                       const CService& addr);
201 
202     void DisconnectNodes() EXCLUSIVE_LOCKS_REQUIRED(!m_reconnections_mutex, !m_nodes_mutex);
203+
204+    /** Call `m_msgproc->FinalizeNode()` on all entries from `m_nodes_for_finalization`. */
205+    void FinalizeNodes() EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex);
206+
207     void NotifyNumConnectionsChanged();
208     /** Return true if the peer is inactive and should be disconnected. */
209     bool InactivityCheck(const CNode& node) const;
210 
211     /**
212      * Generate a collection of sockets to check for IO readiness.
213@@ -1395,12 +1419,17 @@ private:
214     // Whether the node should be passed out in ForEach* callbacks
215     static bool NodeFullyConnected(const CNode& node);
216 
217     uint16_t GetDefaultPort(Network net) const;
218     uint16_t GetDefaultPort(const std::string& addr) const;
219 
220+    Mutex m_nodes_for_finalization_mutex;
221+
222+    /** When a node is destroyed some of its properties are saved here for calling m_msgproc->FinalizeNode() later. */
223+    std::queue<std::unique_ptr<NetEventsInterface::NodeFinalizationData>> m_nodes_for_finalization GUARDED_BY(m_nodes_for_finalization_mutex);
224+
225     // Network usage totals
226     mutable Mutex m_total_bytes_sent_mutex;
227     std::atomic<uint64_t> nTotalBytesRecv{0};
228     uint64_t nTotalBytesSent GUARDED_BY(m_total_bytes_sent_mutex) {0};
229 
230     // outbound limit & stats
231diff --git i/src/net.cpp w/src/net.cpp
232index d641579f6b..3a6ba226d0 100644
233--- i/src/net.cpp
234+++ w/src/net.cpp
235@@ -534,13 +534,13 @@ std::shared_ptr<CNode> CConnman::ConnectNode(CAddress addrConnect, const char *p
236             CalculateKeyedNetGroup(target_addr),
237             nonce,
238             addr_bind,
239             pszDest ? pszDest : "",
240             conn_type,
241             /*inbound_onion=*/false,
242-            [this](CNode& node) { m_msgproc->FinalizeNode(node); },
243+            [this](CNode& node) { RememberForFinalization(node); },
244             CNodeOptions{
245                 .permission_flags = permission_flags,
246                 .i2p_sam_session = std::move(i2p_transient_session),
247                 .recv_flood_size = nReceiveFloodSize,
248                 .use_v2transport = use_v2transport,
249             });
250@@ -1832,13 +1832,13 @@ void CConnman::CreateNodeFromAcceptedSocket(std::unique_ptr<Sock>&& sock,
251         CalculateKeyedNetGroup(addr),
252         nonce,
253         addr_bind,
254         /*addrNameIn=*/"",
255         ConnectionType::INBOUND,
256         inbound_onion,
257-        [this](CNode& node) { m_msgproc->FinalizeNode(node); },
258+        [this](CNode& node) { RememberForFinalization(node); },
259         CNodeOptions{
260             .permission_flags = permission_flags,
261             .prefer_evict = discouraged,
262             .recv_flood_size = nReceiveFloodSize,
263             .use_v2transport = use_v2transport,
264         });
265@@ -1904,14 +1904,12 @@ void CConnman::DisconnectNodes()
266     AssertLockNotHeld(m_reconnections_mutex);
267 
268     // Use a temporary variable to accumulate desired reconnections, so we don't need
269     // m_reconnections_mutex while holding m_nodes_mutex.
270     decltype(m_reconnections) reconnections_to_add;
271 
272-    std::vector<std::shared_ptr<CNode>> disconnected_nodes;
273-
274     {
275         LOCK(m_nodes_mutex);
276 
277         if (!fNetworkActive) {
278             // Disconnect any connected nodes
279             for (auto& pnode : m_nodes) {
280@@ -1925,18 +1923,12 @@ void CConnman::DisconnectNodes()
281         // Disconnect unused nodes
282         for (auto it = m_nodes.begin(); it != m_nodes.end();) {
283             auto node = *it;
284             if (node->fDisconnect) {
285                 it = m_nodes.erase(it);
286 
287-                // Keep a reference to this CNode object to delay its destruction until
288-                // after m_nodes_mutex has been released. Destructing a node involves
289-                // calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order
290-                // should be cs_main, m_nodes_mutex.
291-                disconnected_nodes.push_back(node);
292-
293                 // Add to reconnection list if appropriate. We don't reconnect right here, because
294                 // the creation of a connection is a blocking operation (up to several seconds),
295                 // and we don't want to hold up the socket handler thread for that long.
296                 if (node->m_transport->ShouldReconnectV1()) {
297                     reconnections_to_add.push_back({
298                         .addr_connect = node->addr,
299@@ -1964,12 +1956,38 @@ void CConnman::DisconnectNodes()
300         // Move entries from reconnections_to_add to m_reconnections.
301         LOCK(m_reconnections_mutex);
302         m_reconnections.splice(m_reconnections.end(), std::move(reconnections_to_add));
303     }
304 }
305 
306+void CConnman::RememberForFinalization(const CNode& node)
307+{
308+    LOCK(m_nodes_for_finalization_mutex);
309+    m_nodes_for_finalization.emplace(std::make_unique<NetEventsInterface::NodeFinalizationData>(node));
310+}
311+
312+void CConnman::FinalizeNodes()
313+{
314+    AssertLockNotHeld(m_nodes_for_finalization_mutex);
315+
316+    for (;;) {
317+        std::unique_ptr<NetEventsInterface::NodeFinalizationData> node_finalization_data;
318+
319+        {
320+            LOCK(m_nodes_for_finalization_mutex);
321+            if (m_nodes_for_finalization.empty()) {
322+                break;
323+            }
324+            node_finalization_data.swap(m_nodes_for_finalization.front());
325+            m_nodes_for_finalization.pop();
326+        }
327+
328+        m_msgproc->FinalizeNode(*node_finalization_data);
329+    }
330+}
331+
332 void CConnman::NotifyNumConnectionsChanged()
333 {
334     size_t nodes_size;
335     {
336         LOCK(m_nodes_mutex);
337         nodes_size = m_nodes.size();
338@@ -3050,12 +3068,14 @@ void CConnman::ThreadMessageHandler()
339 
340         WAIT_LOCK(mutexMsgProc, lock);
341         if (!fMoreWork) {
342             condMsgProc.wait_until(lock, std::chrono::steady_clock::now() + std::chrono::milliseconds(100), [this]() EXCLUSIVE_LOCKS_REQUIRED(mutexMsgProc) { return fMsgProcWake; });
343         }
344         fMsgProcWake = false;
345+
346+        FinalizeNodes();
347     }
348 }
349 
350 void CConnman::ThreadI2PAcceptIncoming()
351 {
352     static constexpr auto err_wait_begin = 1s;
353@@ -3440,12 +3460,14 @@ void CConnman::StopThreads()
354     if (threadSocketHandler.joinable())
355         threadSocketHandler.join();
356 }
357 
358 void CConnman::StopNodes()
359 {
360+    AssertLockNotHeld(m_nodes_for_finalization_mutex);
361+
362     if (fAddressesInitialized) {
363         DumpAddresses();
364         fAddressesInitialized = false;
365 
366         if (m_use_addrman_outgoing) {
367             // Anchor connections are only dumped during clean shutdown.
368@@ -3466,12 +3488,14 @@ void CConnman::StopNodes()
369     }
370     nodes.clear();
371 
372     vhListenSocket.clear();
373     semOutbound.reset();
374     semAddnode.reset();
375+
376+    FinalizeNodes(); // Finalize any leftovers which were added after the msghand thread was interrupted.
377 }
378 
379 CConnman::~CConnman()
380 {
381     Interrupt();
382     Stop();
383diff --git i/src/test/fuzz/p2p_handshake.cpp w/src/test/fuzz/p2p_handshake.cpp
384index d881cb35ba..99b2d70706 100644
385--- i/src/test/fuzz/p2p_handshake.cpp
386+++ w/src/test/fuzz/p2p_handshake.cpp
387@@ -61,14 +61,14 @@ FUZZ_TARGET(p2p_handshake, .init = ::initialize)
388 
389     LOCK(NetEventsInterface::g_msgproc_mutex);
390 
391     std::vector<CNode*> peers;
392     const auto num_peers_to_add = fuzzed_data_provider.ConsumeIntegralInRange(1, 3);
393     for (int i = 0; i < num_peers_to_add; ++i) {
394-        peers.push_back(ConsumeNodeAsUniquePtr(fuzzed_data_provider, i, [&peerman](CNode& node) {
395-                            peerman->FinalizeNode(node);
396+        peers.push_back(ConsumeNodeAsUniquePtr(fuzzed_data_provider, i, [&connman](CNode& node) {
397+                            connman.RememberForFinalization(node);
398                         }).release());
399         connman.AddTestNode(*peers.back());
400         peerman->InitializeNode(
401             *peers.back(),
402             static_cast<ServiceFlags>(fuzzed_data_provider.ConsumeIntegral<uint64_t>()));
403     }
404diff --git i/src/test/fuzz/p2p_headers_presync.cpp w/src/test/fuzz/p2p_headers_presync.cpp
405index 4c470dab1c..d6159efaee 100644
406--- i/src/test/fuzz/p2p_headers_presync.cpp
407+++ w/src/test/fuzz/p2p_headers_presync.cpp
408@@ -58,13 +58,13 @@ void HeadersSyncSetup::ResetAndInitialize()
409         ConnectionType::INBOUND
410     };
411 
412     for (auto conn_type : conn_types) {
413         CAddress addr{};
414         m_connections.push_back(new CNode(id++, nullptr, addr, 0, 0, addr, "", conn_type, false,
415-                                          [this](CNode& node) { m_node.peerman->FinalizeNode(node); }));
416+                                          [this](CNode& node) { m_node.connman->RememberForFinalization(node); }));
417         CNode& p2p_node = *m_connections.back();
418 
419         connman.Handshake(
420             /*node=*/p2p_node,
421             /*successfully_connected=*/true,
422             /*remote_services=*/ServiceFlags(NODE_NETWORK | NODE_WITNESS),
423diff --git i/src/test/fuzz/process_message.cpp w/src/test/fuzz/process_message.cpp
424index b2031a60cc..a7b6913584 100644
425--- i/src/test/fuzz/process_message.cpp
426+++ w/src/test/fuzz/process_message.cpp
427@@ -64,13 +64,13 @@ FUZZ_TARGET(process_message, .init = initialize_process_message)
428 
429     const std::string random_message_type{fuzzed_data_provider.ConsumeBytesAsString(CMessageHeader::MESSAGE_TYPE_SIZE).c_str()};
430     if (!LIMIT_TO_MESSAGE_TYPE.empty() && random_message_type != LIMIT_TO_MESSAGE_TYPE) {
431         return;
432     }
433     CNode& p2p_node = *ConsumeNodeAsUniquePtr(fuzzed_data_provider, std::nullopt, [](CNode& node) {
434-                           g_setup->m_node.peerman->FinalizeNode(node);
435+                           g_setup->m_node.connman->RememberForFinalization(node);
436                        }).release();
437 
438     connman.AddTestNode(p2p_node);
439     FillNode(fuzzed_data_provider, connman, p2p_node);
440 
441     const auto mock_time = ConsumeTime(fuzzed_data_provider);
442diff --git i/src/test/fuzz/process_messages.cpp w/src/test/fuzz/process_messages.cpp
443index 89584e3d59..9377327f8e 100644
444--- i/src/test/fuzz/process_messages.cpp
445+++ w/src/test/fuzz/process_messages.cpp
446@@ -53,13 +53,13 @@ FUZZ_TARGET(process_messages, .init = initialize_process_messages)
447     LOCK(NetEventsInterface::g_msgproc_mutex);
448 
449     std::vector<CNode*> peers;
450     const auto num_peers_to_add = fuzzed_data_provider.ConsumeIntegralInRange(1, 3);
451     for (int i = 0; i < num_peers_to_add; ++i) {
452         peers.push_back(ConsumeNodeAsUniquePtr(fuzzed_data_provider, i, [&connman](CNode& node) {
453-                            connman.MsgProc()->FinalizeNode(node);
454+                            connman.RememberForFinalization(node);
455                         }).release());
456         CNode& p2p_node = *peers.back();
457 
458         FillNode(fuzzed_data_provider, connman, p2p_node);
459 
460         connman.AddTestNode(p2p_node);

Maybe worth exploring in a followup if people think that makes sense. Calling FinalizeNode() on the msghand thread to avoid locking cs_main in the net thread looks out of the scope of this PR which is about removing the manual reference counting of CNode.

DrahtBot added the label Needs rebase on Mar 20, 2025

vasild force-pushed on Mar 20, 2025

vasild commented at 5:45 pm on March 20, 2025: contributor

8c0ce1ca1c...dcb0be18ac: rebase due to conflicts

DrahtBot removed the label Needs rebase on Mar 20, 2025

hodlinator commented at 8:59 am on March 24, 2025: contributor

unique_ptr

Edit: Please see newer branch: #32015 (comment)

Here is a branch implementing another possible variant, using unique_ptr as suggested by @purpleKarrot in #32015 (comment): https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_unique_ptr

Pros

It keeps the CNode ownership/lifetime under control of CConnman::m_nodes/m_nodes_disconnected. This can be considered an advancement over the unclear ownership/lifetime of shared_ptr (“global variables in disguise”).
It avoids the need for introducing ~CNode() and m_destruct_cb together with passing a callback into CNode() everywhere.

Cons

It requires more ongoing awareness of CNode lifetime than when using shared_ptr. However, once we grok CNode lifetimes, they appear fairly simple (unless I’m missing something):

 0CConnman::ThreadSocketHandler()
 1    while (!interruptNet)
 2        DisconnectNodes()
 3            for m_nodes
 4                move disconnected nodes to m_nodes_disconnected
 5            for m_nodes_disconnected
 6                // Not referenced from other threads.
 7                // And no new snapshots possible since we are in m_nodes_disconnected.
 8                if refcount <= 0
 9                    DeleteNode()
10        CConnman::SocketHandler()
11            NodesSnapshot
12
13CConnman::ThreadMessageHandler()
14    while (!flagInterruptMsgProc)
15        NodesSnapshot
16
17Shutdown() (init.cpp)
18    CConnman::Stop()
19        StopThreads()
20            threadMessageHandler.join()
21            threadSocketHandler.join()
22        StopNodes()
23            for m_nodes
24                DeleteNode()
25            for m_nodes_disconnected
26                DeleteNode()

It retains CNode::nRefCount (but documents its purpose as tracking NodesSnapshot references).

DrahtBot added the label Needs rebase on Mar 24, 2025

ryanofsky commented at 5:13 pm on April 1, 2025: contributor

Comparing current shared_ptr change dcb0be18ac64f001abaf646930bae1b5f6dba8c0 with suggested unique_ptr change cb959340a6734c3461f6d51a4a689f6067ba5e91 both seem like reasonable approaches.

I think vasild’s point that drawbacks of manual reference counting outweigh drawbacks of shared_ptr, even if new shared_ptr code makes it a little harder to understand where objects will be deleted exactly. So this PR does seem like an improvement over the status quo.
Not sure I understand the purpose of the FinalizeNode patch #32015 (comment) and whether it has benefits related to this PR, or unrelated to it?
The suggested unique_ptr branch also seems nice because changes are broken up into smaller commits. And even though it doesn’t completely remove reference counting, it does hide it inside the NodesSnapshot class, so maybe that is enough to achieve goals of this PR. I think the branch might also be able to also go further, deleting the CNode AddSnapshot, ReleaseSnapshot, and GetSnapshotCount methods and making NodesSnapshot a friend class that accesses the reference count directly. This could make it easier to see and guarantee that NodesSnapshot is the only thing accessing the reference count.

Overall it could be nice to adopt at least parts of the unique_ptr branch here.

hodlinator commented at 7:50 pm on April 1, 2025: contributor

I think the branch might also be able to also go further, deleting the CNode AddSnapshot, ReleaseSnapshot, and GetSnapshotCount methods and making NodesSnapshot a friend class that accesses the reference count directly. This could make it easier to see and guarantee that NodesSnapshot is the only thing accessing the reference count.

I thought about making a base class of CNode that exposed snapshot refcounting through a friend-relationship, but I like the fact that NodesSnapshot is a private class inside CConnman (and didn’t see a way to serve both goals).

theuni commented at 9:43 pm on April 11, 2025: member

As @Sjors mentioned, I attempted to do this a while ago, but brought in a whole new smart pointer impl to keep the threading guarantees so it never got much review.

In master deletion of CNode objects could happen in the net thread or in the main thread. In other words, some CNode object might be destroyed in the net thread and another CNode object might be destroyed in the main thread.

Where can this happen? At least in the current model, that sounds like a bug to me.

The idea is that it does not matter which thread will destroy the CNode objects.

What makes this suddenly ok? I recall that deleting from the socket handler thread was a hard requirement for safety/sanity.

This makes me very uncomfortable. It’s been a while since I’ve looked at this code so the details are fuzzy now, but I don’t like the idea of refactoring the simple refcounters and changing the threading guarantees for no obvious gain. Is there some strong motivation for this I’m missing?

hodlinator commented at 10:26 pm on April 11, 2025: contributor

In master deletion of CNode objects could happen in the net thread or in the main thread. In other words, some CNode object might be destroyed in the net thread and another CNode object might be destroyed in the main thread.

Where can this happen? At least in the current model, that sounds like a bug to me.

The “diagram” at the bottom of #32015 (comment) may be helpful. The net-thread corresponds to CConnman::ThreadSocketHandler() and the main-thread corresponds to Shutdown() (init.cpp). (See calls to DeleteNode() in the diagram).

vasild force-pushed on Apr 13, 2025

vasild commented at 5:22 pm on April 13, 2025: contributor

dcb0be1...71750da: rebase due to conflicts

I do not understand the fixation of knowing and controlling where an object is destroyed from. The elegance and simplicity behind the sole idea of shared_ptr is that it does not matter. @hodlinator, I checked https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_unique_ptr, here is a quick feedback:

Does not remove the manual reference counting which is the aim of this PR. That diff mostly replaces the raw pointer CNode* with std::unique_ptr<CNode>.
To implement the snapshots it takes and stores raw pointers away from the unique_ptr which is an anti-pattern because then the lifetime of the raw pointer has to be manually kept shorter than the lifetime of the unique_ptr. I think that defeats the purpose of unique_ptr itself.
Any operation on the std::vector<std::unique_ptr<CNode>> m_nodes that deletes elements, e.g. clear(), erase(), pop_back() or resize() will automatically destroy the CNode objects even if they are referenced (manual ref count >0). @theuni,

The idea is that it does not matter which thread will destroy the CNode objects.

What makes this suddenly ok? I recall that deleting from the socket handler thread was a hard requirement for safety/sanity.

I do not know about that. Can you elaborate? PeerManagerImpl::FinalizeNode() has to be called on the CNode before it is destroyed. Could this be some misunderstanding that from this follows that CNode must be destroyed from a particular thread?

… I don’t like the idea of refactoring the simple refcounters and changing the threading guarantees for no obvious gain …

Hmm, “simple refcounters”… IMO those are “complicated”. That might be subjective. But this patch reduces the code by about 70 lines. It entirely drops the need for snapshots as well. There is no changing of “threading guarantees” here. You seem to be thinking that CNode objects can and must only be destroyed from the net thread, but that might as well happen from the main thread. I do not think that this is a bug. And it is just fine, IMO, that with this PR deletion can now happen from the msghand thread. If you think that is not ok, can you explain why?

DrahtBot removed the label Needs rebase on Apr 13, 2025

vasild commented at 4:53 am on April 14, 2025: contributor

I thought a bit more about the unique_ptr + manual refcounting. It seems to me that the manual refcounting can be replaced by the internal refcoutning in shared_ptr, like this:

drop CNode::nRefCount, same counter is somewhere inside shared_ptr
CNode::GetSnapshotCount() -> std::shared_ptr::use_count()
CNode::AddSnapshot() + copy the raw pointer out from unique_ptr -> copy the shared_ptr
CNode::ReleaseSnapshot() -> destroy the shared_ptr copy
drop all of CConnman::NodesSnapshot -> just make a copy of m_nodes under the mutex.

The purpose of CConnman::m_nodes_disconnected and CConnman::DeleteNode() seems to be to call FinalizeNode() before the CNode object is destroyed. If m_nodes_disconnected and DeleteNode() are replaced with calling FinalizeNode() from ~CNode(), then it becomes this PR.

theuni commented at 8:39 pm on April 14, 2025: member

Note: This is a long and rambling comment, which mostly consists of me reminding myself about the specifics of this code. I’m leaving the whole thing in case my thought process helps any other reviewers.

The idea is that it does not matter which thread will destroy the CNode objects.

What makes this suddenly ok? I recall that deleting from the socket handler thread was a hard requirement for safety/sanity.

I do not know about that. Can you elaborate? PeerManagerImpl::FinalizeNode() has to be called on the CNode before it is destroyed. Could this be some misunderstanding that from this follows that CNode must be destroyed from a particular thread?

Yes, the main thread may call FinalizeNode(), but at that point the message handler and socket handler threads have already been joined, so that’s not really interesting.

We currently guarantee that no node will be deleted from threadMessageHandler as it iterates. That’s the one that matters.

This guarantee protects us from a large footgun. Note that my original motivation for #10738 had nothing to do with code cleanliness, rather I was targeting a substantial optimization: CConnman::ForNode / CConnman::ForEachNode both hold m_nodes_mutex for the duration of their callbacks. Those callbacks may be long-lived, substantial, and/or have scary locks themselves, so blocking the socket handler thread for them really sucks.

Crucially, PeerManagerImpl::State(nodeid) will stay valid for any reachable nodeid for the duration of SendMessages and ProcessMessages.

Now consider if:

with this PR deletion can now happen from the msghand thread

A reasonable follow-up to this PR (I actually expected to see it here) would be:

 0diff --git a/src/net.cpp b/src/net.cpp
 1index b68d2fb37aa..d5631c056ce 100644
 2--- a/src/net.cpp
 3+++ b/src/net.cpp
 4@@ -3902,13 +3902,16 @@ void CConnman::PushMessage(CNode* pnode, CSerializedNetMsg&& msg)
 5
 6 bool CConnman::ForNode(NodeId id, std::function<bool(CNode* pnode)> func)
 7 {
 8+    std::shared_ptr<CNode> found_node = nullptr;
 9+    {
10     LOCK(m_nodes_mutex);
11     for (auto&& pnode : m_nodes) {
12         if(pnode->GetId() == id) {
13-            return NodeFullyConnected(*pnode) && func(pnode.get());
14+            found_node = pnode;
15         }
16     }
17-    return false;
18+    }
19+    return found_node && NodeFullyConnected(*found_node) && func(found_node.get());
20 }
21
22 CSipHasher CConnman::GetDeterministicRandomizer(uint64_t id) const
23diff --git a/src/net.h b/src/net.h
24index e1318c8e66b..bff5c5f423f 100644
25--- a/src/net.h
26+++ b/src/net.h
27@@ -1140,11 +1140,9 @@ public:
28     using NodeFn = std::function<void(CNode*)>;
29     void ForEachNode(const NodeFn& func)
30     {
31-        LOCK(m_nodes_mutex);
32-        for (auto&& node : m_nodes) {
33-            if (NodeFullyConnected(*node)) {
34-                func(node.get());
35-            }
36+        auto nodes = WITH_LOCK(m_nodes_mutex, return m_nodes);
37+        for (auto&& node : nodes) {
38+            if (NodeFullyConnected(*node)) func(node.get());
39         }
40     };

Right? Because if the threading restriction is “gone”, that seems like it should be safe. Like I said, I did that as part of the PR mentioned above: https://github.com/bitcoin/bitcoin/pull/10738/commits/6dc671a22046c6c09a6de2b1d38e9e2fa3920628

But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode, or any function that copies the node vector under the lock and uses the copy.

The problem with this is that it makes interactions with PeerManager super hard to reason about. Consider a function like that does something like this:

0PeerManagerImpl::SomeFunction()
1{
2    NodeId to_disconnect = CallSomeFunctionToGetWorstNode();
3    m_connman.ForNode(to_disconnect, [&](CNode* pnode) {
4        ...
5        pnode->fDisconnect = true;
6        ...
7    }
8    SomeFunctionThatTakesAFewSeconds();
9    CNodeState *state = State(to_disconnect); // YIKES

The above isn’t contrived either, EvictExtraOutboundPeers/NewPoWValidBlock do things like that.

HOWEVER

As-is, the impl here currently protects against nodes being deleted in SendMessages and ProcessMessages using the same refcounting trick that we do in master. Frustratingly, in master it’s already abstracted away by NodesSnapshot so it’s not immediately obvious (RAII notwithstanding, it’s an annoying trend in our repo imo to hide crucial implementation details behind abstractions for the sake of having fewer lines of code). And this PR takes it one step further:

0auto nodes = WITH_LOCK(m_nodes_mutex, return m_nodes);

^^ This line is now in charge of keeping the shared_ptrs alive until the end of the message handling loop. If not for that (for ex, if we instead kept weak_ptrs and upgraded them as we passed through the loop), the ForEachNode case I explained above would be a problem.

So I don’t think it’s true (and imo it’s quite dangerous to assume) that it’s “safe to delete a CNode from any thread”. What’s happening here is that it’s now possible that nodes are deleted in two places rather than one: either after as pass through the socket handler loop, or after a pass through the message handler loop. That still makes me uncomfortable, but I admit that I can’t find any problems with it.

This PR also maintains the invariant that the socket handler thread is always completely finished with all socket operations before FinalizeNode is called.

And as far as I can tell, this PR does allow us to take the patch above, getting rid of the nasty lock-holding in ForEachNode and ForNode. BUT that would open up the possibility of deletion from other callers. For example, it would make it possible for the RPC thread to delete the node from sendmsgtopeer(). And if any future code stashed the shared_ptrs from CConnman, it would make it possible for the deletion to happen anywhere.

TLDR

This is a combination of 2 changes: replacing our refcounters with shared_ptrs (style), and adding a new deletion point (functionality). As far as I can tell, the 2nd change is a side-effect of the first because it’s not straightforward to implement this with shared pointers while avoiding the possible deletion in ThreadMessageHandler(). I really don’t care for behavioral changes that are dictated by style changes. So the question for reviewers, imo, is: is the functional change a good one?

I don’t trust that it would be safe to simply allow CNodes to be deleted just anywhere. But crucially, that’s not what’s being proposed in the code here. Instead, a single new deletion point has been introduced (at the end of the message processing loop), which is much more reasonable, as it protects SendMessages and ProcessMessages as-before.

I’m not completely opposed to this, but I still find it scary and rather unjustified. I’m not convinced by the “it reduces the lines of code” argument. In fact, I would probably prefer to add comments around the NodesSnapshot to obviate what it’s doing.

If we’re going to move forward with this I think we need:

~~Split the style change and functional change into separate commits: first the shared_ptr change as a no-op, then the additional deletion point.~~ Edit: Looks like this isn’t realistic.
Justification besides “code cleanup”. I don’t think that’s a valid reason to be making behavioral changes to such low-level code.
In-place comments that describe the guarantees we provide. If the safety of the p2p layer revolves around a copy of a vector of shared_ptrs, that needs to be made clear. That nodes.clear() is also too crucial to not have a comment.
Documentation about the finalization/deletion process in general. This should probably happen even if we don’t take these changes.

ryanofsky commented at 3:21 pm on April 15, 2025: contributor

@theuni would be curious for your feedback on hodlinators’s unique_ptr + isolated refcount approach in #32015 (comment).

The more I read here (not fully understanding everything) the more that approach seems appealing because when you say things like “it’s now possible that nodes are deleted in two places rather than one” and “But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode” it really makes shared_ptr seem like a poor fit. Shared_ptr can be great when objects have trivial destructors and it doesn’t actually matter when they are deleted, but if you need control over when resources are freed, more explicit refcounting would seem safer and simpler.

It seems like both you and vasild like making a dichotomy between (1) using manual refcounting and (2) using shared_ptr, and not wanting to choose a middle ground like hodlinators approach which provides automatic reference counting though an explicit mechanism that is less flexible but simpler than shared_ptr. I’m not really sure why that is.

theuni commented at 3:47 pm on April 15, 2025: member

The more I read here (not fully understanding everything) the more that approach seems appealing because when you say things like “it’s now possible that nodes are deleted in two places rather than one” and “But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode” it really makes shared_ptr seem like a poor fit. Shared_ptr can be great when objects have trivial destructors and it doesn’t actually matter when they are deleted, but if you need control over when resources are freed, more explicit refcounting would seem safer and simpler.

I agree.

It seems like both you and vasild like making a dichotomy between (1) using manual refcounting and (2) using shared_ptr, and not wanting to choose a middle ground like hodlinators approach which provides automatic reference counting though an explicit mechanism that is less flexible but simpler than shared_ptr. I’m not really sure why that is.

No reason, I just haven’t looked at it yet.

I spent all of yesterday poking around and trying to remind myself of where be the dragons in this code, hence the rambly and not very useful comment :)

I originally intended to comment that it was definitely unsafe to change the way deletion happens, but after poking around all day I concluded that even though it makes me uneasy (because I feel like I’ve forgotten something important over the years), I can’t really find any tangible issues with the addition of a new deletion point. I wouldn’t feel comfortable enough to ACK the approach, but I recognize that I can’t argue against it either. However, since there’s no real justification for the behavioral change (which I consider scary) other than “code cleanup”, I can’t really get onboard with it.

If the unique_ptr approach cleans things up without the behavioral change, that sounds much nicer to me. Will have a look.

vasild force-pushed on Apr 15, 2025

vasild commented at 4:15 pm on April 15, 2025: contributor

71750daa51...75acdcbaa9: add some comments and rename a variable, see below

We currently guarantee that no node will be deleted from threadMessageHandler as it iterates. That’s the one that matters.

Yes. I just want to clarify - what do you mean by “from”? I would say “no node will be deleted (does not matter from which thread) while threadMessageHandler iterates”. Is this what you mean? This guarantee is achieved by the caller of ProcessMessages() and SendMessages() taking a snapshot of all nodes to keep them from destruction while those functions are running. That is the same in master and in this PR.

Crucially, PeerManagerImpl::State(nodeid) will stay valid for any reachable nodeid for the duration of SendMessages and ProcessMessages.

Yes. True in master and in this PR for the same reason - the caller takes a snapshot while calling those functions.

Now consider if:

with this PR deletion can now happen from the msghand thread

Yes. Sounds good. As long as ProcessMessages() and SendMessages() are protected by the snapshot (they are in master and in this PR).

A reasonable follow-up to this PR (I actually expected to see it here) would be … Right? Because if the threading restriction is “gone”, that seems like it should be safe.

Right. That is safe and super reasonable because it minimizes the time m_nodes_mutex is held, like you mentioned. I think that belongs to this PR or a followup.

But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode, or any function that copies the node vector under the lock and uses the copy.

Only if all others have released their copies. In which case it is ok.

The problem with this is that it makes interactions with PeerManager super hard to reason about. Consider a function like that does something like this: PeerManagerImpl::SomeFunction() …

This does not change rules about interactions with PeerManager (so it is not harder or easier, it is the same). The fact that nodes may be deleted if nobody holds a reference to them is true in master and in this PR. I think the rule is “Don’t delete nodes while ProcessMessages() and SendMessages() are running”.

PeerManagerImpl::SomeFunction() could result in state being nullptr (YIKES) also in master, depending on where SomeFunction() is called from. If that is a problem for SomeFunction() it could either:

handle State() returning nullptr (some callers of State() already do that in master) or
it could increment the reference count internally at the start and reduce it at the end or demand that its callers do that, like ProcessMessages() and SendMessages().

This reasoning is unchanged by this PR.

the impl here currently protects against nodes being deleted in SendMessages and ProcessMessages using the same refcounting trick that we do in master.

Yes.

auto nodes = WITH_LOCK(m_nodes_mutex, return m_nodes);

This line is now in charge of keeping the shared_ptrs alive until the end of the message handling loop

Renamed it to nodes_snapshot and added a comment. I find the pattern used in master and in this PR reasonable:

Take a snapshot of the nodes
Do something with the nodes from the snapshot (Implying they will not get deleted while we work on them, that is obvious, right? What’s the meaning of “snapshot” otherwise?)
Release the snapshot // after the release this code does not care, nodes may get deleted if all others have released their references

So I don’t think it’s true (and imo it’s quite dangerous to assume) that it’s “safe to delete a CNode from any thread”.

But why? I agree it is unsafe to:

A. Delete a node without calling FinalizeNode(), or B. Delete a node while there are references to it.

but the thread that does the deletion? That does not matter.

What’s happening here is that it’s now possible that nodes are deleted in two places rather than one: either after as pass through the socket handler loop, or after a pass through the message handler loop.

IMO that is safe as long as A. and B. are guaranteed not to happen.

This PR also maintains the invariant that the socket handler thread is always completely finished with all socket operations before FinalizeNode is called.

That is also the same in master. FinalizeNode() is called after CNode::CloseSocketDisconnect(). To me the code in this PR makes that more obvious compared to master because FinalizeNode() is called from the CNode’s destructor and at that time it is clear that all socket operations are completely finished.

And as far as I can tell, this PR does allow us to take the patch above, getting rid of the nasty lock-holding in ForEachNode and ForNode.

I think so too.

BUT that would open up the possibility of deletion from other callers. For example, it would make it possible for the RPC thread to delete the node from sendmsgtopeer(). And if any future code stashed the shared_ptrs from CConnman, it would make it possible for the deletion to happen anywhere.

Yes, and that is fine. But don’t forget - only if all others have released their copies. We shouldn’t be worried that we will brick some other thread from the RPC thread if we delete a CNode object from the RPC thread which the other thread is using. That is because deletion only happens if reference count drops to zero which means nobody is using it and then it is irrelevant which thread does the deletion because at that time there are 0 users of the object.

Justification besides “code cleanup”. I don’t think that’s a valid reason to be making behavioral changes to such low-level code.

Added some to the PR description.

In-place comments that describe the guarantees we provide. If the safety of the p2p layer revolves around a copy of a vector of shared_ptrs, that needs to be made clear. That nodes.clear() is also too crucial to not have a comment.

Added some comments.

Documentation about the finalization/deletion process in general. This should probably happen even if we don’t take these changes.

I agree. Maybe if that doc existed beforehand it would have made this discussion shorter. Opened #32278

vasild commented at 7:29 pm on April 15, 2025: contributor

when you say things like “it’s now possible that nodes are deleted in two places rather than one” and

It is actually “… in 3 places rather than 2”

“But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode”

This sounds like some unexpected deletion is going to happen in the middle of everything and bad things will follow and this is going to be because of this PR. But that is not true. Nodes could be deleted as a result of ForNode even now in master - if ForNode sets fDisconnect to true, then that will eventually result in the deletion of the node when it is safe to do so. This is both in master and in this PR. Nodes will not be deleted unexpectedly, accidentally, while there are references to them because we are not careful with ForNode (again, same in master and in this PR).

It seems like both you and vasild like making a dichotomy between (1) using manual refcounting and (2) using shared_ptr, and not wanting to choose a middle ground like hodlinators approach which provides automatic reference counting though an explicit mechanism that is less flexible but simpler than shared_ptr. I’m not really sure why that is.

I expressed some concerns about the “unique_ptr+manual refcounting” in #32015 (comment)

hodlinator commented at 7:52 am on April 16, 2025: contributor

I do not understand the fixation of knowing and controlling where an object is destroyed from. The elegance and simplicity behind the sole idea of shared_ptr is that it does not matter.

I used to appreciate shared_ptr-like constructs, but as @purpleKarrot wrote, they introduce global variables in disguise. Using unique_ptr prevents that and makes ownership clear, which is a step forward.

@hodlinator, I checked master…hodlinator:bitcoin:pr/32015_unique_ptr, here is a quick feedback:

Does not remove the manual reference counting which is the aim of this PR. That diff mostly replaces the raw pointer CNode* with std::unique_ptr<CNode>.

Yes, the PR instead makes clear that refcounting is there just to track live snapshots. Making that sole purpose clear removes the itch to eradicate it for me.

To implement the snapshots it takes and stores raw pointers away from the unique_ptr which is an anti-pattern because then the lifetime of the raw pointer has to be manually kept shorter than the lifetime of the unique_ptr. I think that defeats the purpose of unique_ptr itself.

In a post-unique_ptr (and shared_ptr) world, raw pointers indicate non-ownership. It is subtle but not an anti-pattern.

Any operation on the std::vector<std::unique_ptr<CNode>> m_nodes that deletes elements, e.g. clear(), erase(), pop_back() or resize() will automatically destroy the CNode objects even if they are referenced (manual ref count >0).

Good point. Might be worth adding ~CNode() { Assume(GetSnapshotCount() == 0); }.

vasild commented at 8:16 am on April 16, 2025: contributor

Consider a function like that does something like this: … PeerManagerImpl::SomeFunction()

Here is an example that YIKES on master as well:

 0diff --git i/src/net.cpp w/src/net.cpp
 1index fc0edc1a5c..c48a62f31e 100644
 2--- i/src/net.cpp
 3+++ w/src/net.cpp
 4@@ -2758,12 +2758,14 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect, std
 5         const auto current_time{NodeClock::now()};
 6         int nTries = 0;
 7         const auto reachable_nets{g_reachable_nets.All()};
 8 
 9         while (!interruptNet)
10         {
11+            m_msgproc->HasAllDesirableServiceFlags(ServiceFlags{123});
12+
13             if (anchor && !m_anchors.empty()) {
14                 const CAddress addr = m_anchors.back();
15                 m_anchors.pop_back();
16                 if (!addr.IsValid() || IsLocal(addr) || !g_reachable_nets.Contains(addr) ||
17                     !m_msgproc->HasAllDesirableServiceFlags(addr.nServices) ||
18                     outbound_ipv46_peer_netgroups.count(m_netgroupman.GetGroup(addr))) continue;
19diff --git i/src/net_processing.cpp w/src/net_processing.cpp
20index 0b25396dd2..6148c1d8de 100644
21--- i/src/net_processing.cpp
22+++ w/src/net_processing.cpp
23@@ -519,12 +519,13 @@ public:
24         EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex);
25 
26     /** Implement NetEventsInterface */
27     void InitializeNode(const CNode& node, ServiceFlags our_services) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_tx_download_mutex);
28     void FinalizeNode(const CNode& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
29     bool HasAllDesirableServiceFlags(ServiceFlags services) const override;
30+    void SomeFunction() const;
31     bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override
32         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex, !m_tx_download_mutex);
33     bool SendMessages(CNode* pto) override
34         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, g_msgproc_mutex, !m_tx_download_mutex);
35 
36     /** Implement PeerManager */
37@@ -1660,14 +1661,32 @@ void PeerManagerImpl::FinalizeNode(const CNode& node)
38         LOCK(m_headers_presync_mutex);
39         m_headers_presync_stats.erase(nodeid);
40     }
41     LogDebug(BCLog::NET, "Cleared nodestate for peer=%d\n", nodeid);
42 }
43 
44+void PeerManagerImpl::SomeFunction() const
45+{
46+    const NodeId to_disconnect{0};
47+    m_connman.ForNode(to_disconnect, [&](CNode* node) {
48+        node->fDisconnect = true;
49+        return true;
50+    });
51+    while (WITH_LOCK(cs_main, return State(to_disconnect)) != nullptr) {
52+        std::this_thread::sleep_for(1s);
53+    }
54+    LogInfo("YIKES");
55+    Assert(false);
56+}
57+
58 bool PeerManagerImpl::HasAllDesirableServiceFlags(ServiceFlags services) const
59 {
60+    if (services == 123) {
61+        SomeFunction();
62+    }
63+
64     // Shortcut for (services & GetDesirableServiceFlags(services)) == GetDesirableServiceFlags(services)
65     return !(GetDesirableServiceFlags(services) & (~services));
66 }
67 
68 ServiceFlags PeerManagerImpl::GetDesirableServiceFlags(ServiceFlags services) const
69 {

That might happen on master and on this PR. I would say that for each yikes in this PR there is an equivalent yikes on master, or in other words, this PR does not have more or is not prone to more yikes. The rule is the same - “if a code relies on CNode not being deleted, then it should either hold m_nodex_mutex, or increment the reference count during its usage.

~~Split the style change and functional change into separate commits: first the shared_ptr change as a no-op, then the additional deletion point.~~ Edit: Looks like this isn’t realistic.

I just did that, see https://github.com/vasild/bitcoin/commits/cnode_shared_ptr_split_in_two_commits/ The intermediate state, after the first commit is somewhat odd - we have shared_ptr but don’t let it destroy the owned object by itself when we drop the shared_ptr, e.g. just remove from the vector. Instead in this intermediate state we monitor the use_count() and call reset() manually. Also NodesSnapshot is an useless wrapper that just copies the vector. The second commit cleans all that and arrives at the same end result as this PR.

hodlinator commented at 1:55 pm on April 16, 2025: contributor

Published new iteration of unique_ptr change in separate branch: https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_unique_ptr2

Moved changes clarifying CNode::nRefCount really only being used for NodesSnapshots to happen before unique_ptr commit.
Split the CNode::nRefCount-commit in two, one to change usage to only be for snapshots, one for renaming the methods. Now we also rename the field to m_snapshot_count and group refcounting methods together.
Added ~CNode()-dtor with assert(GetSnapshotCount() == 0) in order to catch invalid operations like CConnman::m_nodes.clear() as pointed out by vasild.
Added speculative final commit to introduce use of NodesSnapshot in ForNode and ForEachNode to decrease time m_nodes_mutex is locked, inspired by @theuni’s comment. Feels fairly safe given that we only delete nodes with 0 snapshots prior to CConnman::Stop() & StopNodes(). If another thread was still inside ForNode/ForEachNode at that point, we are already shutting down, and the asserts for zero snapshots in DeleteNode() will simply kill the process quicker. Should it be deemed too risky this commit could be skipped.
Moved some paranoid TRACEPOINT changes and making StopThreads & StopNodes private into early “Minor improvements” commit.
DeleteNode(std::unique_ptr<CNode> pnode) -> DeleteNode(std::unique_ptr<CNode>&& pnode): Clarifies that we are assuming ownership of the unique_ptr.
Rebased on same parent as this PR.

theuni commented at 10:17 pm on April 16, 2025: member

I’ve been working on some experiments and tests for this again today, but unfortunately I didn’t have time to get finished.

I’ll be away until Tuesday, but I’ll resume here next week. Sorry for the delay :(

in src/net.h:1436 in 75acdcbaa9 outdated

1431@@ -1442,8 +1432,7 @@ class CConnman
1432     std::vector<AddedNodeParams> m_added_node_params GUARDED_BY(m_added_nodes_mutex);
1433 
1434     mutable Mutex m_added_nodes_mutex;
1435-    std::vector<CNode*> m_nodes GUARDED_BY(m_nodes_mutex);
1436-    std::list<CNode*> m_nodes_disconnected;
1437+    std::vector<std::shared_ptr<CNode>> m_nodes GUARDED_BY(m_nodes_mutex);

vasild commented at 11:49 am on April 17, 2025:

Bringing the anti-shared_ptr discussion to its own dedicated thread here.

#32015 (comment) Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics?

#32015 (comment) I used to appreciate shared_ptr-like constructs, but as @purpleKarrot #32015 (comment), they introduce global variables in disguise.

How come? The CNode objects have a lifetime equal to the lifetime of the connection with the peer, usually much less than the lifetime of a global variable (which is the lifetime of the entire program). Also those CNode objects are in a private variable inside CConnman. Global variables can be accessed from everywhere. Private class members are the opposite.

would it be possible to increase the level of abstraction and pass around values with regular semantics?

ProcessMessages() and SendMessages() take CNode* and they can / should be changed to take CNode&.
FindNode() returns shared_ptr, this can be changed to return bool and the sole user that makes use of the return value can be changed to use ForNode().
ConnectNode() is the function that creates the objects and it returns shared_ptr, not sure if it makes sense to change it?
GenerateWaitSockets() takes a const vector of shared_ptr. It needs just a list of nodes, could take a vector of CNode, but since we already have a vector of shared_ptr we pass that. Same for SocketHandlerConnected() and both need to run on the same “snapshot” when called one after the other.

If there is review interest I would pursue those.

hodlinator commented at 8:07 pm on April 17, 2025:

The intent behind the code on master is to delete the nodes at specific points, otherwise CNode::Release() would do if (--nRefCount == 0) delete this (and had different logic around m_nodes_disconnected). You might be correct in that having been okay, nobody has been able to prove otherwise in this PR so far.
I agree that shared_ptr-usage in this PR not being exposed in the public interface of CConnman is a mitigating factor.
I bet you’ve checked all angles I would check when it comes to how the shared_ptr would lead to CNode destruction in the current PR.
shared_ptr still opens up the door towards future changes in the direction of chaos when compared to unique_ptr. Nailing down where ownership/lifetime is managed facilitates fresh eyes in following the code, improving maintainability. Using shared_ptr can be perceived as a cop-out of saying it will be destroyed whenever & wherever the last reference is destroyed. shared_ptr in other cases/projects are often an indication of imprecise/prototype-quality code, so as a general rule, it makes sense to avoid if possible.
This is not my first objection to shared_ptr in this project, I experimented with removing it from miniscript-code here: #30866#pullrequestreview-2434704657, which lead to uncovering an extra issue to fix in that PR. The PR settled on taking that fix and also replacing shared_ptr with unique_ptr for a less invasive change. I have #31713 which replaces those new unique_ptrs with straight up instances, so I’m by no means wedded to unique_ptr if there’s a better option.

ProcessMessages() and SendMessages() take CNode* and they can / should be changed to take CNode&.

FindNode() returns shared_ptr, this can be changed to return bool and the sole user that makes use of the return value can be changed to use ForNode().

ConnectNode() is the function that creates the objects and it returns shared_ptr, not sure if it makes sense to change it?

GenerateWaitSockets() takes a const vector of shared_ptr. It needs just a list of nodes, could take a vector of CNode, but since we already have a vector of shared_ptr we pass that. Same for SocketHandlerConnected() and both need to run on the same “snapshot” when called one after the other.

If there is review interest I would pursue those.

IMO if we go with shared_ptr we don’t need to change those, thanks for investigating though!

vasild commented at 2:50 pm on April 22, 2025:

Thanks for the input, I respect and agree with most of it. The only bit I do not agree with is a general rule to avoid shared_ptr. I agree using unique_ptr is better than shared_ptr, especially if unique_ptr is to be used as intented - to manage the lifetime of the owned object. If we start extracting the raw pointers from unique_ptr and managing them separately and manually ref-counting then maybe it is better to just use shared_ptr.

FindNode() returns shared_ptr, this can be changed to return bool and the sole user that makes use of the return value can be changed to use ForNode()

I think that’s worth doing, regardless of this PR. Done in https://github.com/bitcoin/bitcoin/pull/32326

hodlinator commented at 7:59 am on April 24, 2025:

If we start extracting the raw pointers from unique_ptr and managing them separately and manually ref-counting then maybe it is better to just use shared_ptr.

I think doing ref-counting for snapshots is fine, but agree raw pointers in this codebase are ambiguous. It would be one thing if we started over with strict policies for shared_ptr/unique_ptr/raw pointers. Then one could consistently use raw pointer to always mean a non-owning “view-pointer” (https://github.com/bitcoin/bitcoin/pull/32015#issuecomment-2712060971).

Maybe that distinction could still be made through introducing something like:

0template <typename T>
1class ViewPtr
2{
3    T* ptr;
4public:
5    ...operator->...;
6    ...operator*...;
7};

+ OwnerPtr<T> in the vane of GSL: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#SS-views

I think that’s worth doing, regardless of this PR. Done in #32326

Good call on peeling off a separate PR as it’s nice but not a requirement here.

theuni commented at 7:17 pm on April 24, 2025: member

Ok, I’m back after a week of paging all of this back in and hacking around. This is actually something I wanted to get cleaned up a long time ago, so I’m happy it has eyes on it again. Sorry in advance for the novel :)

I think the discussions about shared_ptr/unique_ptr/refcounting have distracted from what really needs to be discussed here: the lifetime and concurrency requirements/guarantees of the various subsystems.

The refcounters are (ab)used as a generic means of controlling the teardown order of a CNode. As I mentioned above, historically, this has meant that disconnection/finalization/deletion have always happened on the socket handler thread and always in that order (with the exception of when we’re shutting down).

However, as @vasild has pointed out, this is neither ideal nor necessary.

So rather than starting by discussing what types of pointers we should be using, we need to start by figuring out what we’re trying to solve/improve. And before we can do that, we need to lay out the way things currently work.

These are the important guarantees at the moment, as I see them:

`FinalizeNode` may only be called once per node.

This one is obvious, but it’s necessary to keep in mind that we shouldn’t be introducing any possible races.

`FinalizeNode` will never be called as a result of setting `fDisconnect` from `ProcessMessages`/`SendMessages`

This prevents any pointers obtained by calling State(id) or GetPeerRef(id) from being yanked out from under us when we’re in those functions.

I think this is the one that has led to the workarounds and shared_ptr discussion here. It is currently enforced by NodesSnapshot: a class which bumps the refcount for each node before calling ProcessMessages/SendMessages, then decref’ing them when finished. This has the effect of keeping them all alive for the duration of a single pass through the loop. @vasild’s observation was that this looks a lot like an ownership model which can be handled with shared_ptrs by taking advantage of their inherent refcounting and introducing a deletion point in ThreadMessageHandler.

I maintain that this is non-ideal and foot-gunny, as making a copy of m_nodes is the mechanism used to enforce the above contract, though it’s not at all obvious that it’s doing so.

Finalization happens just before deletion, both on the socket handler thread

Again, this one isn’t strictly necessary, but it’s currently the case.

And again, @vasild’s shared_ptr approach handles this by removing that restriction, and calling FinalizeNode as part of CNode’s destructor, as that means that its refcount has reached zero.

He points out that it is safe to call this sometimes from the message handler thread instead, as long as it is only called once. I agree with this and suggest (see below) taking it a step further and only calling it from the message handler thread, as that guarantees that it can’t be running during ProcessMessages/SendMessages.

Disconnection always happens before calling `FinalizeNode`.

I’ll come back to this one later as it’s not strictly necessary, but it’s currently the case. By the time FinalizeNode is called, the socket handler thread is completely done with the node (the socket has been shutdown).

When the socket handler disconnects a peer, it is almost immediately (in the same pass through the loop) removed from `m_nodes`

This one is annoying, but the unit tests depend on it. Basically, as soon as we notice that a peer has disconnected, the socket handler thread takes care to ensure that other subsystems won’t find it by looking through m_nodes. This includes ForEachNode/ForNode/CopyStats, etc.

Only a single thread deletes `CNode`s.

Duh.

Any other function passing a `CNode` (for ex, a caller of `ForNode() from RPC`) shouldn’t have it deleted for its duration

This is a completely separate issue from the disconnection/finalization semantics, and is a good use-case for shared_ptrs.

In reality, the refcounting (and by extension NodesSnapshot) has been used as a loose way of enforcing teardown semantics. It’s true that we can mimic those semantics by introducing shared_ptrs to manage the nodes instead, but that further obscures what the actual requirements are. As a few people have pointed out on this PR, abusing the facilities of shared_ptr/unique_ptr to replace the manual refcounting is not the solution. Rather, we should enforce the actual semantics in code.

I’ve pushed a branch here to illustrate the above: https://github.com/theuni/bitcoin/commits/finalize_threadheandler2/

It makes the following changes:

Expose m_nodes_disconnected to the message handling thread.
Add CNode::m_finalized which is set once FinalizeNode has been called. This, combined with the move from m_nodes to m_nodes_disconnected is enough for us to determine when the socket handler/message handler threads are done with a CNode.
Call FinalizeNode from the top/bottom of the message handler thread instead, to guarantee that it will never be called from another thread during a ProcessMessages/SendMessages invocation. This also makes much more sense conceptually, as FinalizeNode is not something the socket handler should have anything to do with.

With the above changes in place, the ordering of the teardown is fully proceduralized, so the refcounting is no longer needed and is thus removed. At that point, the CNode can be safely deleted once (in no particular order)

It has been moved to m_nodes_disconnected
CNode::m_finalized is true
No other threads (RPC via ForNode as an example) are holding it

In order to guarantee the last one, CNodes are kept in m_nodes and m_nodes_disconnected as shared_ptrs. Since the teardown semantics are enforced elsewhere, deletion is now uninteresting and can happen safely from any thread.

It also contains a few optimizations that become possible once the teardown procedure is proceduralized:

m_nodes_mutex is no longer held for the duration of ForNode/ForEachNode. This should be a nice improvement, especially for nodes with hundreds of connections.
FinalizeNode is now allowed to happen before disconnection, as there’s no actual need to order those events.

I realize that I’ve duplicated a good amount of work here like the shared_ptr commit as well as some of the pre-requisite cleanup commits that are similar to @hodlinator’s. It’s also incomplete, for example it’s missing the fuzzing fixups that @vasild has already done, as well as a bunch of additional comments/docs as I’ve suggested already. I’d be happy to rebase my branch to include those. I agree with #32326 as well. It’s not necessary for my branch, but it’s a nice improvement and I did have a similar commit at one point for the exact same reason.

I feel pretty strongly about addressing the teardown order procedurally rather than mimicking our current refcounting semantics which don’t make the requirements/guarantees clear. Ultimately I agree that shared_ptrs are best to enforce the lifetimes, but only after codifying the other rules.

I think the code in my branch is much easier to reason about. There’s no hidden side-effects of node deletion, magic copies of m_nodes aren’t used to prevent FinalizeNode being called at the wrong time, and disconnection/finalization/deletion no longer carry hidden dependencies on each other.

theuni commented at 1:56 pm on April 30, 2025: member

Can we maybe just hold off on this for now? The investigation above prompted me to start working on a much larger refactor.. one that would finish the network refactor (CConnman creation) that I never managed to fully complete almost a decade ago.

I’ve been head-down working on it, hope to have a POC to demonstrate in the next week or two.

vasild commented at 3:22 pm on April 30, 2025: contributor

I am incorporating your feedback here, looks good so far.

vasild force-pushed on May 1, 2025

vasild commented at 10:10 am on May 1, 2025: contributor

75acdcbaa9...15cc989538: incorporate feedback from above

I took your branch https://github.com/theuni/bitcoin/commits/finalize_threadheandler2/ and made some changes:

No need to ping-pong the tear down of a CNode between the net and msghand threads so much

In finalize_threadheandler2:

The net thread would disconnect a node and move it from m_nodes to m_nodes_disconnected.
Then the msghand thread would finalize those nodes.
Then the net thread would delete finalized nodes.

I simplified this to:

same as above; the sole fact that a node is in m_nodes_disconnected now means that a node has been disconnected and not finalized.
same as above; plus also delete the nodes.

So CNode::m_finalized is not needed because it becomes implicit - when a node is in m_nodes_disconnected it means it is not finalized. Then it is removed from there => finalized and deleted.

Reduced the optimization about holding m_nodes_mutex for long-lived callbacks

I did not remove m_nodex_mutex from ForEachNode() because in one of its usages the callback calls m_connman.MultipleManualOrFullOutboundConns() which requires m_nodes_mutex. Dealing with that deserves a separate PR.

Omitted the optimization to finalize a node at the end of the message handler loop

Could do that but to avoid a second finalization it would need CNode::m_finalized just for that. I think it is not worth having an extra member in CNode for this optimization. Can be done in a followup if desired.

Other minor tweaks and polishes

Like using std::ranges::foo() instead of std::foo() and simpler CConnman::DisconnectNodes() which uses no temporary variables when moving nodes around.

This PR deviated from the usual workflow where there is one author and reviewers that only review and don’t modify the code. The current state is that the PR has two authors - @theuni and @vasild who both reviewed and authored code. Lets see how this pans out…

vasild commented at 10:14 am on May 1, 2025: contributor

Can we maybe just hold off on this for now? The investigation above prompted me to start working on a much larger refactor…

Not sure about that because I have not seen that other refactor. In general I prefer do stuff in smaller, more manageable steps. I guess a further refactor should be possible on top of this PR. I would be happy to review. This PR should make a more solid foundation for further changes in the code.

DrahtBot added the label CI failed on May 1, 2025

DrahtBot commented at 11:47 am on May 1, 2025: contributor

🚧 At least one of the CI tasks failed. Task tidy: https://github.com/bitcoin/bitcoin/runs/41477750293 LLM reason (✨ experimental): Clang-tidy found a performance warning treated as an error, causing CI failure.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

vasild force-pushed on May 1, 2025

vasild commented at 12:34 pm on May 1, 2025: contributor

15cc989538...8b93e0894f: fix CI clang-tidy

DrahtBot removed the label CI failed on May 1, 2025

in src/net.h:1330 in 4326f97dc6 outdated

1326@@ -1325,7 +1327,7 @@ class CConnman
1327      * @param[in] nodes Select from these nodes' sockets.
1328      * @return sockets to check for readiness
1329      */
1330-    Sock::EventsPerSock GenerateWaitSockets(std::span<CNode* const> nodes);
1331+    Sock::EventsPerSock GenerateWaitSockets(const std::vector<std::shared_ptr<CNode>>& nodes);

pinheadmz commented at 6:29 pm on May 13, 2025:

4326f97dc6dccc9e558d99931f8235fc6b6af405

Why change this from span to vector?

vasild commented at 11:28 am on May 14, 2025:

Because the only caller of GenerateWaitSockets() has a vector so I decided to pass a const reference to a vector. However, that was the case as well before this PR and the change from raw pointer to shared_ptr (the aim of this PR) is unrelated to that. Changed back to std::span.

vasild commented at 1:16 pm on May 21, 2025:

It didn’t compile - couldn’t convert the return value of snap.Nodes() which is const std::vector<std::shared_ptr<CNode>>& to std::span<std::shared_ptr<CNode>>. So I used vector in the first commit, and later when the snapshots are removed - changed it to span.

#32015 (comment)

in src/net.h:1370 in 4326f97dc6 outdated

1364@@ -1363,11 +1365,9 @@ class CConnman
1365     bool AlreadyConnectedToAddress(const CAddress& addr);
1366 
1367     bool AttemptToEvictConnection();
1368-    CNode* ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
1369+    std::shared_ptr<CNode> ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
1370     void AddWhitelistPermissionFlags(NetPermissionFlags& flags, const CNetAddr &addr, const std::vector<NetWhitelistPermissions>& ranges) const;
1371 
1372-    void DeleteNode(CNode* pnode);

pinheadmz commented at 6:29 pm on May 13, 2025:

4326f97dc6dccc9e558d99931f8235fc6b6af405

Commit message says deletion is unchanged.

vasild commented at 11:31 am on May 14, 2025:

Yes, I meant that the location of the deletes is unchanged. Before we had DeleteNode() which did FinalizeNode() + delete and after that commit, DeleteNode() is expanded into the callers that now do FinalizeNode() directly instead of calling DeleteNode(). Also callers do clear() on the vector which is equivalent to the delete that was in DeleteNode().

Elaborated a little bit in the commit message.

in src/net.cpp:433 in 4326f97dc6 outdated

434@@ -437,9 +435,7 @@ CNode* CConnman::ConnectNode(CAddress addrConnect, const char *pszDest, bool fCo
435                 }
436                 // It is possible that we already have a connection to the IP/port pszDest resolved to.
437                 // In that case, drop the connection that was just created.
438-                LOCK(m_nodes_mutex);

pinheadmz commented at 6:44 pm on May 13, 2025:

4326f97dc6dccc9e558d99931f8235fc6b6af405

How did this not deadlock before? FindNode() called on the next line locks m_node_mutex

vasild commented at 11:26 am on May 14, 2025:

m_node_mutex is a recursive mutex, allowing multiple nested locks from a single thread. But see #32326 which changes it to a non-recursive mutex which is easier to reason about.

in src/net.cpp:3050 in 4326f97dc6 outdated

3046                 if (pnode->fDisconnect)
3047                     continue;
3048 
3049                 // Receive messages
3050-                bool fMoreNodeWork = m_msgproc->ProcessMessages(pnode, flagInterruptMsgProc);
3051+                bool fMoreNodeWork = m_msgproc->ProcessMessages(pnode.get(), flagInterruptMsgProc);

pinheadmz commented at 7:03 pm on May 13, 2025:

4326f97dc6dccc9e558d99931f8235fc6b6af405

Why did you choose to pass the raw pointer instead of refactoring ProcessMessages() to accept the shared_ptr?

vasild commented at 11:37 am on May 14, 2025:

I did not want to bloat this PR with changes to the interface of ProcessMessages() and SendMessages(). They take a raw pointer. Since they do not take ownership and the object is going to be alive for the duration of the call and there will always be an object (will never pass nullptr), then those functions can take CNode&.

Here is a patch to do that on top of this PR. Could be a followup. I do not want to bloat this PR with this unless reviewers want it.

It is a mechanical change

011 files changed, 93 insertions(+), 93 deletions(-)

  0diff --git i/src/net.cpp w/src/net.cpp
  1index 81ee1eec16..4189cee0df 100644
  2--- i/src/net.cpp
  3+++ w/src/net.cpp
  4@@ -3051,18 +3051,18 @@ void CConnman::ThreadMessageHandler()
  5 
  6             for (auto& pnode : nodes) {
  7                 if (pnode->fDisconnect)
  8                     continue;
  9 
 10                 // Receive messages
 11-                bool fMoreNodeWork = m_msgproc->ProcessMessages(pnode.get(), flagInterruptMsgProc);
 12+                bool fMoreNodeWork = m_msgproc->ProcessMessages(*pnode, flagInterruptMsgProc);
 13                 fMoreWork |= (fMoreNodeWork && !pnode->fPauseSend);
 14                 if (flagInterruptMsgProc)
 15                     return;
 16                 // Send messages
 17-                m_msgproc->SendMessages(pnode.get());
 18+                m_msgproc->SendMessages(*pnode);
 19 
 20                 if (flagInterruptMsgProc)
 21                     return;
 22             }
 23         }
 24 
 25diff --git i/src/net.h w/src/net.h
 26index 051ff50f3b..1e93dfcaa5 100644
 27--- i/src/net.h
 28+++ w/src/net.h
 29@@ -1002,27 +1002,27 @@ public:
 30      * Callback to determine whether the given set of service flags are sufficient
 31      * for a peer to be "relevant".
 32      */
 33     virtual bool HasAllDesirableServiceFlags(ServiceFlags services) const = 0;
 34 
 35     /**
 36-    * Process protocol messages received from a given node
 37-    *
 38-    * [@param](/bitcoin-bitcoin/contributor/param/)[in]   pnode           The node which we have received messages from.
 39-    * [@param](/bitcoin-bitcoin/contributor/param/)[in]   interrupt       Interrupt condition for processing threads
 40-    * [@return](/bitcoin-bitcoin/contributor/return/)                      True if there is more work to be done
 41-    */
 42-    virtual bool ProcessMessages(CNode* pnode, std::atomic<bool>& interrupt) EXCLUSIVE_LOCKS_REQUIRED(g_msgproc_mutex) = 0;
 43+     * Process protocol messages received from a given node
 44+     *
 45+     * [@param](/bitcoin-bitcoin/contributor/param/)[in,out] from The node which we have received messages from.
 46+     * [@param](/bitcoin-bitcoin/contributor/param/)[in] interrupt Interrupt condition for processing threads.
 47+     * [@return](/bitcoin-bitcoin/contributor/return/) True if there is more work to be done.
 48+     */
 49+    virtual bool ProcessMessages(CNode& from, std::atomic<bool>& interrupt) EXCLUSIVE_LOCKS_REQUIRED(g_msgproc_mutex) = 0;
 50 
 51     /**
 52-    * Send queued protocol messages to a given node.
 53-    *
 54-    * [@param](/bitcoin-bitcoin/contributor/param/)[in]   pnode           The node which we are sending messages to.
 55-    * [@return](/bitcoin-bitcoin/contributor/return/)                      True if there is more work to be done
 56-    */
 57-    virtual bool SendMessages(CNode* pnode) EXCLUSIVE_LOCKS_REQUIRED(g_msgproc_mutex) = 0;
 58+     * Send queued protocol messages to a given node.
 59+     *
 60+     * [@param](/bitcoin-bitcoin/contributor/param/)[in,out] to The node which we are sending messages to.
 61+     * [@return](/bitcoin-bitcoin/contributor/return/) True if there is more work to be done.
 62+     */
 63+    virtual bool SendMessages(CNode& to) EXCLUSIVE_LOCKS_REQUIRED(g_msgproc_mutex) = 0;
 64 
 65 
 66 protected:
 67     /**
 68      * Protected destructor so that instances can only be deleted by derived classes.
 69      * If that restriction is no longer desired, this should be made public and virtual.
 70diff --git i/src/net_processing.cpp w/src/net_processing.cpp
 71index 78e97b6684..7153226635 100644
 72--- i/src/net_processing.cpp
 73+++ w/src/net_processing.cpp
 74@@ -519,15 +519,15 @@ public:
 75         EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex);
 76 
 77     /** Implement NetEventsInterface */
 78     void InitializeNode(const CNode& node, ServiceFlags our_services) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_tx_download_mutex);
 79     void FinalizeNode(const CNode& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
 80     bool HasAllDesirableServiceFlags(ServiceFlags services) const override;
 81-    bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override
 82+    bool ProcessMessages(CNode& from, std::atomic<bool>& interrupt) override
 83         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex, !m_tx_download_mutex);
 84-    bool SendMessages(CNode* pto) override
 85+    bool SendMessages(CNode& to) override
 86         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, g_msgproc_mutex, !m_tx_download_mutex);
 87 
 88     /** Implement PeerManager */
 89     void StartScheduledTasks(CScheduler& scheduler) override;
 90     void CheckForStaleTipAndEvictPeers() override;
 91     std::optional<std::string> FetchBlock(NodeId peer_id, const CBlockIndex& block_index) override
 92@@ -4989,72 +4989,72 @@ bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode, Peer& peer)
 93     LogDebug(BCLog::NET, "Disconnecting and discouraging peer %d!\n", peer.m_id);
 94     if (m_banman) m_banman->Discourage(pnode.addr);
 95     m_connman.DisconnectNode(pnode.addr);
 96     return true;
 97 }
 98 
 99-bool PeerManagerImpl::ProcessMessages(CNode* pfrom, std::atomic<bool>& interruptMsgProc)
100+bool PeerManagerImpl::ProcessMessages(CNode& from, std::atomic<bool>& interruptMsgProc)
101 {
102     AssertLockNotHeld(m_tx_download_mutex);
103     AssertLockHeld(g_msgproc_mutex);
104 
105-    PeerRef peer = GetPeerRef(pfrom->GetId());
106+    PeerRef peer = GetPeerRef(from.GetId());
107     if (peer == nullptr) return false;
108 
109     // For outbound connections, ensure that the initial VERSION message
110     // has been sent first before processing any incoming messages
111-    if (!pfrom->IsInboundConn() && !peer->m_outbound_version_message_sent) return false;
112+    if (!from.IsInboundConn() && !peer->m_outbound_version_message_sent) return false;
113 
114     {
115         LOCK(peer->m_getdata_requests_mutex);
116         if (!peer->m_getdata_requests.empty()) {
117-            ProcessGetData(*pfrom, *peer, interruptMsgProc);
118+            ProcessGetData(from, *peer, interruptMsgProc);
119         }
120     }
121 
122     const bool processed_orphan = ProcessOrphanTx(*peer);
123 
124-    if (pfrom->fDisconnect)
125+    if (from.fDisconnect)
126         return false;
127 
128     if (processed_orphan) return true;
129 
130     // this maintains the order of responses
131     // and prevents m_getdata_requests to grow unbounded
132     {
133         LOCK(peer->m_getdata_requests_mutex);
134         if (!peer->m_getdata_requests.empty()) return true;
135     }
136 
137     // Don't bother if send buffer is too full to respond anyway
138-    if (pfrom->fPauseSend) return false;
139+    if (from.fPauseSend) return false;
140 
141-    auto poll_result{pfrom->PollMessage()};
142+    auto poll_result{from.PollMessage()};
143     if (!poll_result) {
144         // No message to process
145         return false;
146     }
147 
148     CNetMessage& msg{poll_result->first};
149     bool fMoreWork = poll_result->second;
150 
151     TRACEPOINT(net, inbound_message,
152-        pfrom->GetId(),
153-        pfrom->m_addr_name.c_str(),
154-        pfrom->ConnectionTypeAsString().c_str(),
155+        from.GetId(),
156+        from.m_addr_name.c_str(),
157+        from.ConnectionTypeAsString().c_str(),
158         msg.m_type.c_str(),
159         msg.m_recv.size(),
160         msg.m_recv.data()
161     );
162 
163     if (m_opts.capture_messages) {
164-        CaptureMessage(pfrom->addr, msg.m_type, MakeUCharSpan(msg.m_recv), /*is_incoming=*/true);
165+        CaptureMessage(from.addr, msg.m_type, MakeUCharSpan(msg.m_recv), /*is_incoming=*/true);
166     }
167 
168     try {
169-        ProcessMessage(*pfrom, msg.m_type, msg.m_recv, msg.m_time, interruptMsgProc);
170+        ProcessMessage(from, msg.m_type, msg.m_recv, msg.m_time, interruptMsgProc);
171         if (interruptMsgProc) return false;
172         {
173             LOCK(peer->m_getdata_requests_mutex);
174             if (!peer->m_getdata_requests.empty()) fMoreWork = true;
175         }
176         // Does this peer has an orphan ready to reconsider?
177@@ -5481,69 +5481,69 @@ bool PeerManagerImpl::SetupAddressRelay(const CNode& node, Peer& peer)
178         peer.m_addr_known = std::make_unique<CRollingBloomFilter>(5000, 0.001);
179     }
180 
181     return true;
182 }
183 
184-bool PeerManagerImpl::SendMessages(CNode* pto)
185+bool PeerManagerImpl::SendMessages(CNode& to)
186 {
187     AssertLockNotHeld(m_tx_download_mutex);
188     AssertLockHeld(g_msgproc_mutex);
189 
190-    PeerRef peer = GetPeerRef(pto->GetId());
191+    PeerRef peer = GetPeerRef(to.GetId());
192     if (!peer) return false;
193     const Consensus::Params& consensusParams = m_chainparams.GetConsensus();
194 
195     // We must call MaybeDiscourageAndDisconnect first, to ensure that we'll
196     // disconnect misbehaving peers even before the version handshake is complete.
197-    if (MaybeDiscourageAndDisconnect(*pto, *peer)) return true;
198+    if (MaybeDiscourageAndDisconnect(to, *peer)) return true;
199 
200     // Initiate version handshake for outbound connections
201-    if (!pto->IsInboundConn() && !peer->m_outbound_version_message_sent) {
202-        PushNodeVersion(*pto, *peer);
203+    if (!to.IsInboundConn() && !peer->m_outbound_version_message_sent) {
204+        PushNodeVersion(to, *peer);
205         peer->m_outbound_version_message_sent = true;
206     }
207 
208     // Don't send anything until the version handshake is complete
209-    if (!pto->fSuccessfullyConnected || pto->fDisconnect)
210+    if (!to.fSuccessfullyConnected || to.fDisconnect)
211         return true;
212 
213     const auto current_time{GetTime<std::chrono::microseconds>()};
214 
215-    if (pto->IsAddrFetchConn() && current_time - pto->m_connected > 10 * AVG_ADDRESS_BROADCAST_INTERVAL) {
216-        LogDebug(BCLog::NET, "addrfetch connection timeout, %s\n", pto->DisconnectMsg(fLogIPs));
217-        pto->fDisconnect = true;
218+    if (to.IsAddrFetchConn() && current_time - to.m_connected > 10 * AVG_ADDRESS_BROADCAST_INTERVAL) {
219+        LogDebug(BCLog::NET, "addrfetch connection timeout, %s\n", to.DisconnectMsg(fLogIPs));
220+        to.fDisconnect = true;
221         return true;
222     }
223 
224-    MaybeSendPing(*pto, *peer, current_time);
225+    MaybeSendPing(to, *peer, current_time);
226 
227     // MaybeSendPing may have marked peer for disconnection
228-    if (pto->fDisconnect) return true;
229+    if (to.fDisconnect) return true;
230 
231-    MaybeSendAddr(*pto, *peer, current_time);
232+    MaybeSendAddr(to, *peer, current_time);
233 
234-    MaybeSendSendHeaders(*pto, *peer);
235+    MaybeSendSendHeaders(to, *peer);
236 
237     {
238         LOCK(cs_main);
239 
240-        CNodeState &state = *State(pto->GetId());
241+        CNodeState &state = *State(to.GetId());
242 
243         // Start block sync
244         if (m_chainman.m_best_header == nullptr) {
245             m_chainman.m_best_header = m_chainman.ActiveChain().Tip();
246         }
247 
248         // Determine whether we might try initial headers sync or parallel
249         // block download from this peer -- this mostly affects behavior while
250         // in IBD (once out of IBD, we sync from all peers).
251         bool sync_blocks_and_headers_from_peer = false;
252         if (state.fPreferredDownload) {
253             sync_blocks_and_headers_from_peer = true;
254-        } else if (CanServeBlocks(*peer) && !pto->IsAddrFetchConn()) {
255+        } else if (CanServeBlocks(*peer) && !to.IsAddrFetchConn()) {
256             // Typically this is an inbound peer. If we don't have any outbound
257             // peers, or if we aren't downloading any blocks from such peers,
258             // then allow block downloads from this peer, too.
259             // We prefer downloading blocks from outbound peers to avoid
260             // putting undue load on (say) some home user who is just making
261             // outbound connections to the network, but if our only source of
262@@ -5565,14 +5565,14 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
263                    is up-to-date.  With a non-empty response, we can initialise
264                    the peer's known best block.  This wouldn't be possible
265                    if we requested starting at m_chainman.m_best_header and
266                    got back an empty response.  */
267                 if (pindexStart->pprev)
268                     pindexStart = pindexStart->pprev;
269-                if (MaybeSendGetHeaders(*pto, GetLocator(pindexStart), *peer)) {
270-                    LogDebug(BCLog::NET, "initial getheaders (%d) to peer=%d (startheight:%d)\n", pindexStart->nHeight, pto->GetId(), peer->m_starting_height);
271+                if (MaybeSendGetHeaders(to, GetLocator(pindexStart), *peer)) {
272+                    LogDebug(BCLog::NET, "initial getheaders (%d) to peer=%d (startheight:%d)\n", pindexStart->nHeight, to.GetId(), peer->m_starting_height);
273 
274                     state.fSyncStarted = true;
275                     peer->m_headers_sync_timeout = current_time + HEADERS_DOWNLOAD_TIMEOUT_BASE +
276                         (
277                          // Convert HEADERS_DOWNLOAD_TIMEOUT_PER_HEADER to microseconds before scaling
278                          // to maintain precision
279@@ -5598,13 +5598,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
280             LOCK(peer->m_block_inv_mutex);
281             std::vector<CBlock> vHeaders;
282             bool fRevertToInv = ((!peer->m_prefers_headers &&
283                                  (!state.m_requested_hb_cmpctblocks || peer->m_blocks_for_headers_relay.size() > 1)) ||
284                                  peer->m_blocks_for_headers_relay.size() > MAX_BLOCKS_TO_ANNOUNCE);
285             const CBlockIndex *pBestIndex = nullptr; // last header queued for delivery
286-            ProcessBlockAvailability(pto->GetId()); // ensure pindexBestKnownBlock is up-to-date
287+            ProcessBlockAvailability(to.GetId()); // ensure pindexBestKnownBlock is up-to-date
288 
289             if (!fRevertToInv) {
290                 bool fFoundStartingHeader = false;
291                 // Try to find first header that our peer doesn't have, and
292                 // then send all headers past that one.  If we come across any
293                 // headers that aren't on m_chainman.ActiveChain(), give up.
294@@ -5652,42 +5652,42 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
295             }
296             if (!fRevertToInv && !vHeaders.empty()) {
297                 if (vHeaders.size() == 1 && state.m_requested_hb_cmpctblocks) {
298                     // We only send up to 1 block as header-and-ids, as otherwise
299                     // probably means we're doing an initial-ish-sync or they're slow
300                     LogDebug(BCLog::NET, "%s sending header-and-ids %s to peer=%d\n", __func__,
301-                            vHeaders.front().GetHash().ToString(), pto->GetId());
302+                            vHeaders.front().GetHash().ToString(), to.GetId());
303 
304                     std::optional<CSerializedNetMsg> cached_cmpctblock_msg;
305                     {
306                         LOCK(m_most_recent_block_mutex);
307                         if (m_most_recent_block_hash == pBestIndex->GetBlockHash()) {
308                             cached_cmpctblock_msg = NetMsg::Make(NetMsgType::CMPCTBLOCK, *m_most_recent_compact_block);
309                         }
310                     }
311                     if (cached_cmpctblock_msg.has_value()) {
312-                        PushMessage(*pto, std::move(cached_cmpctblock_msg.value()));
313+                        PushMessage(to, std::move(cached_cmpctblock_msg.value()));
314                     } else {
315                         CBlock block;
316                         const bool ret{m_chainman.m_blockman.ReadBlock(block, *pBestIndex)};
317                         assert(ret);
318                         CBlockHeaderAndShortTxIDs cmpctblock{block, m_rng.rand64()};
319-                        MakeAndPushMessage(*pto, NetMsgType::CMPCTBLOCK, cmpctblock);
320+                        MakeAndPushMessage(to, NetMsgType::CMPCTBLOCK, cmpctblock);
321                     }
322                     state.pindexBestHeaderSent = pBestIndex;
323                 } else if (peer->m_prefers_headers) {
324                     if (vHeaders.size() > 1) {
325                         LogDebug(BCLog::NET, "%s: %u headers, range (%s, %s), to peer=%d\n", __func__,
326                                 vHeaders.size(),
327                                 vHeaders.front().GetHash().ToString(),
328-                                vHeaders.back().GetHash().ToString(), pto->GetId());
329+                                vHeaders.back().GetHash().ToString(), to.GetId());
330                     } else {
331                         LogDebug(BCLog::NET, "%s: sending header %s to peer=%d\n", __func__,
332-                                vHeaders.front().GetHash().ToString(), pto->GetId());
333+                                vHeaders.front().GetHash().ToString(), to.GetId());
334                     }
335-                    MakeAndPushMessage(*pto, NetMsgType::HEADERS, TX_WITH_WITNESS(vHeaders));
336+                    MakeAndPushMessage(to, NetMsgType::HEADERS, TX_WITH_WITNESS(vHeaders));
337                     state.pindexBestHeaderSent = pBestIndex;
338                 } else
339                     fRevertToInv = true;
340             }
341             if (fRevertToInv) {
342                 // If falling back to using an inv, just try to inv the tip.
343@@ -5707,13 +5707,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
344                     }
345 
346                     // If the peer's chain has this block, don't inv it back.
347                     if (!PeerHasHeader(&state, pindex)) {
348                         peer->m_blocks_for_inv_relay.push_back(hashToAnnounce);
349                         LogDebug(BCLog::NET, "%s: sending inv peer=%d hash=%s\n", __func__,
350-                            pto->GetId(), hashToAnnounce.ToString());
351+                            to.GetId(), hashToAnnounce.ToString());
352                     }
353                 }
354             }
355             peer->m_blocks_for_headers_relay.clear();
356         }
357 
358@@ -5726,26 +5726,26 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
359             vInv.reserve(std::max<size_t>(peer->m_blocks_for_inv_relay.size(), INVENTORY_BROADCAST_TARGET));
360 
361             // Add blocks
362             for (const uint256& hash : peer->m_blocks_for_inv_relay) {
363                 vInv.emplace_back(MSG_BLOCK, hash);
364                 if (vInv.size() == MAX_INV_SZ) {
365-                    MakeAndPushMessage(*pto, NetMsgType::INV, vInv);
366+                    MakeAndPushMessage(to, NetMsgType::INV, vInv);
367                     vInv.clear();
368                 }
369             }
370             peer->m_blocks_for_inv_relay.clear();
371         }
372 
373         if (auto tx_relay = peer->GetTxRelay(); tx_relay != nullptr) {
374                 LOCK(tx_relay->m_tx_inventory_mutex);
375                 // Check whether periodic sends should happen
376-                bool fSendTrickle = pto->HasPermission(NetPermissionFlags::NoBan);
377+                bool fSendTrickle = to.HasPermission(NetPermissionFlags::NoBan);
378                 if (tx_relay->m_next_inv_send_time < current_time) {
379                     fSendTrickle = true;
380-                    if (pto->IsInboundConn()) {
381+                    if (to.IsInboundConn()) {
382                         tx_relay->m_next_inv_send_time = NextInvToInbounds(current_time, INBOUND_INVENTORY_BROADCAST_INTERVAL);
383                     } else {
384                         tx_relay->m_next_inv_send_time = current_time + m_rng.rand_exp_duration(OUTBOUND_INVENTORY_BROADCAST_INTERVAL);
385                     }
386                 }
387 
388@@ -5779,13 +5779,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
389                         if (tx_relay->m_bloom_filter) {
390                             if (!tx_relay->m_bloom_filter->IsRelevantAndUpdate(*txinfo.tx)) continue;
391                         }
392                         tx_relay->m_tx_inventory_known_filter.insert(inv.hash);
393                         vInv.push_back(inv);
394                         if (vInv.size() == MAX_INV_SZ) {
395-                            MakeAndPushMessage(*pto, NetMsgType::INV, vInv);
396+                            MakeAndPushMessage(to, NetMsgType::INV, vInv);
397                             vInv.clear();
398                         }
399                     }
400                 }
401 
402                 // Determine transactions to relay
403@@ -5831,34 +5831,34 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
404                         }
405                         if (tx_relay->m_bloom_filter && !tx_relay->m_bloom_filter->IsRelevantAndUpdate(*txinfo.tx)) continue;
406                         // Send
407                         vInv.push_back(inv);
408                         nRelayedTransactions++;
409                         if (vInv.size() == MAX_INV_SZ) {
410-                            MakeAndPushMessage(*pto, NetMsgType::INV, vInv);
411+                            MakeAndPushMessage(to, NetMsgType::INV, vInv);
412                             vInv.clear();
413                         }
414                         tx_relay->m_tx_inventory_known_filter.insert(hash);
415                     }
416 
417                     // Ensure we'll respond to GETDATA requests for anything we've just announced
418                     LOCK(m_mempool.cs);
419                     tx_relay->m_last_inv_sequence = m_mempool.GetSequence();
420                 }
421         }
422         if (!vInv.empty())
423-            MakeAndPushMessage(*pto, NetMsgType::INV, vInv);
424+            MakeAndPushMessage(to, NetMsgType::INV, vInv);
425 
426         // Detect whether we're stalling
427         auto stalling_timeout = m_block_stalling_timeout.load();
428         if (state.m_stalling_since.count() && state.m_stalling_since < current_time - stalling_timeout) {
429             // Stalling only triggers when the block download window cannot move. During normal steady state,
430             // the download window should be much larger than the to-be-downloaded set of blocks, so disconnection
431             // should only happen during initial block download.
432-            LogInfo("Peer is stalling block download, %s\n", pto->DisconnectMsg(fLogIPs));
433-            pto->fDisconnect = true;
434+            LogInfo("Peer is stalling block download, %s\n", to.DisconnectMsg(fLogIPs));
435+            to.fDisconnect = true;
436             // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
437             // bandwidth is insufficient.
438             const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
439             if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
440                 LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
441             }
442@@ -5870,14 +5870,14 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
443         // being saturated. We only count validated in-flight blocks so peers can't advertise non-existing block hashes
444         // to unreasonably increase our timeout.
445         if (state.vBlocksInFlight.size() > 0) {
446             QueuedBlock &queuedBlock = state.vBlocksInFlight.front();
447             int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
448             if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
449-                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
450-                pto->fDisconnect = true;
451+                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), to.DisconnectMsg(fLogIPs));
452+                to.fDisconnect = true;
453                 return true;
454             }
455         }
456         // Check for headers sync timeouts
457         if (state.fSyncStarted && peer->m_headers_sync_timeout < std::chrono::microseconds::max()) {
458             // Detect whether this is a stalling initial-headers-sync peer
459@@ -5885,18 +5885,18 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
460                 if (current_time > peer->m_headers_sync_timeout && nSyncStarted == 1 && (m_num_preferred_download_peers - state.fPreferredDownload >= 1)) {
461                     // Disconnect a peer (without NetPermissionFlags::NoBan permission) if it is our only sync peer,
462                     // and we have others we could be using instead.
463                     // Note: If all our peers are inbound, then we won't
464                     // disconnect our sync peer for stalling; we have bigger
465                     // problems if we can't get any outbound peers.
466-                    if (!pto->HasPermission(NetPermissionFlags::NoBan)) {
467-                        LogInfo("Timeout downloading headers, %s\n", pto->DisconnectMsg(fLogIPs));
468-                        pto->fDisconnect = true;
469+                    if (!to.HasPermission(NetPermissionFlags::NoBan)) {
470+                        LogInfo("Timeout downloading headers, %s\n", to.DisconnectMsg(fLogIPs));
471+                        to.fDisconnect = true;
472                         return true;
473                     } else {
474-                        LogInfo("Timeout downloading headers from noban peer, not %s\n", pto->DisconnectMsg(fLogIPs));
475+                        LogInfo("Timeout downloading headers from noban peer, not %s\n", to.DisconnectMsg(fLogIPs));
476                         // Reset the headers sync state so that we have a
477                         // chance to try downloading from a different peer.
478                         // Note: this will also result in at least one more
479                         // getheaders message to be sent to
480                         // this peer (eventually).
481                         state.fSyncStarted = false;
482@@ -5910,13 +5910,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
483                 peer->m_headers_sync_timeout = std::chrono::microseconds::max();
484             }
485         }
486 
487         // Check that outbound peers have reasonable chains
488         // GetTime() is used by this anti-DoS logic so we can test this using mocktime
489-        ConsiderEviction(*pto, *peer, GetTime<std::chrono::seconds>());
490+        ConsiderEviction(to, *peer, GetTime<std::chrono::seconds>());
491 
492         //
493         // Message: getdata (blocks)
494         //
495         std::vector<CInv> vGetData;
496         if (CanServeBlocks(*peer) && ((sync_blocks_and_headers_from_peer && !IsLimitedPeer(*peer)) || !m_chainman.IsInitialBlockDownload()) && state.vBlocksInFlight.size() < MAX_BLOCKS_IN_TRANSIT_PER_PEER) {
497@@ -5939,15 +5939,15 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
498                     vToDownload, from_tip,
499                     Assert(m_chainman.GetSnapshotBaseBlock()));
500             }
501             for (const CBlockIndex *pindex : vToDownload) {
502                 uint32_t nFetchFlags = GetFetchFlags(*peer);
503                 vGetData.emplace_back(MSG_BLOCK | nFetchFlags, pindex->GetBlockHash());
504-                BlockRequested(pto->GetId(), *pindex);
505+                BlockRequested(to.GetId(), *pindex);
506                 LogDebug(BCLog::NET, "Requesting block %s (%d) peer=%d\n", pindex->GetBlockHash().ToString(),
507-                    pindex->nHeight, pto->GetId());
508+                    pindex->nHeight, to.GetId());
509             }
510             if (state.vBlocksInFlight.empty() && staller != -1) {
511                 if (State(staller)->m_stalling_since == 0us) {
512                     State(staller)->m_stalling_since = current_time;
513                     LogDebug(BCLog::NET, "Stall started peer=%d\n", staller);
514                 }
515@@ -5956,21 +5956,21 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
516 
517         //
518         // Message: getdata (transactions)
519         //
520         {
521             LOCK(m_tx_download_mutex);
522-            for (const GenTxid& gtxid : m_txdownloadman.GetRequestsToSend(pto->GetId(), current_time)) {
523+            for (const GenTxid& gtxid : m_txdownloadman.GetRequestsToSend(to.GetId(), current_time)) {
524                 vGetData.emplace_back(gtxid.IsWtxid() ? MSG_WTX : (MSG_TX | GetFetchFlags(*peer)), gtxid.GetHash());
525                 if (vGetData.size() >= MAX_GETDATA_SZ) {
526-                    MakeAndPushMessage(*pto, NetMsgType::GETDATA, vGetData);
527+                    MakeAndPushMessage(to, NetMsgType::GETDATA, vGetData);
528                     vGetData.clear();
529                 }
530             }
531         }
532 
533         if (!vGetData.empty())
534-            MakeAndPushMessage(*pto, NetMsgType::GETDATA, vGetData);
535+            MakeAndPushMessage(to, NetMsgType::GETDATA, vGetData);
536     } // release cs_main
537-    MaybeSendFeefilter(*pto, *peer, current_time);
538+    MaybeSendFeefilter(to, *peer, current_time);
539     return true;
540 }
541diff --git i/src/test/denialofservice_tests.cpp w/src/test/denialofservice_tests.cpp
542index 9ee7e9c9fe..1d658ac9e0 100644
543--- i/src/test/denialofservice_tests.cpp
544+++ w/src/test/denialofservice_tests.cpp
545@@ -77,33 +77,33 @@ BOOST_AUTO_TEST_CASE(outbound_slow_chain_eviction)
546         LOCK(cs_main);
547         BOOST_CHECK(m_node.chainman->ActiveChain().Tip() != nullptr);
548         BOOST_CHECK(m_node.chainman->ActiveChain().Tip()->nChainWork > 0);
549     }
550 
551     // Test starts here
552-    BOOST_CHECK(peerman.SendMessages(&dummyNode1)); // should result in getheaders
553+    BOOST_CHECK(peerman.SendMessages(dummyNode1)); // should result in getheaders
554 
555     {
556         LOCK(dummyNode1.cs_vSend);
557         const auto& [to_send, _more, _msg_type] = dummyNode1.m_transport->GetBytesToSend(false);
558         BOOST_CHECK(!to_send.empty());
559     }
560     connman.FlushSendBuffer(dummyNode1);
561 
562     int64_t nStartTime = GetTime();
563     // Wait 21 minutes
564     SetMockTime(nStartTime+21*60);
565-    BOOST_CHECK(peerman.SendMessages(&dummyNode1)); // should result in getheaders
566+    BOOST_CHECK(peerman.SendMessages(dummyNode1)); // should result in getheaders
567     {
568         LOCK(dummyNode1.cs_vSend);
569         const auto& [to_send, _more, _msg_type] = dummyNode1.m_transport->GetBytesToSend(false);
570         BOOST_CHECK(!to_send.empty());
571     }
572     // Wait 3 more minutes
573     SetMockTime(nStartTime+24*60);
574-    BOOST_CHECK(peerman.SendMessages(&dummyNode1)); // should result in disconnect
575+    BOOST_CHECK(peerman.SendMessages(dummyNode1)); // should result in disconnect
576     BOOST_CHECK(dummyNode1.fDisconnect == true);
577 
578     peerman.FinalizeNode(dummyNode1);
579 }
580 
581 struct OutboundTest : TestingSetup {
582@@ -330,13 +330,13 @@ BOOST_AUTO_TEST_CASE(peer_discouragement)
583                          /*inbound_onion=*/false};
584     nodes[0]->SetCommonVersion(PROTOCOL_VERSION);
585     peerLogic->InitializeNode(*nodes[0], NODE_NETWORK);
586     nodes[0]->fSuccessfullyConnected = true;
587     connman->AddTestNode(*nodes[0]);
588     peerLogic->UnitTestMisbehaving(nodes[0]->GetId()); // Should be discouraged
589-    BOOST_CHECK(peerLogic->SendMessages(nodes[0]));
590+    BOOST_CHECK(peerLogic->SendMessages(*nodes[0]));
591 
592     BOOST_CHECK(banman->IsDiscouraged(addr[0]));
593     BOOST_CHECK(nodes[0]->fDisconnect);
594     BOOST_CHECK(!banman->IsDiscouraged(other_addr)); // Different address, not discouraged
595 
596     nodes[1] = new CNode{id++,
597@@ -349,21 +349,21 @@ BOOST_AUTO_TEST_CASE(peer_discouragement)
598                          ConnectionType::INBOUND,
599                          /*inbound_onion=*/false};
600     nodes[1]->SetCommonVersion(PROTOCOL_VERSION);
601     peerLogic->InitializeNode(*nodes[1], NODE_NETWORK);
602     nodes[1]->fSuccessfullyConnected = true;
603     connman->AddTestNode(*nodes[1]);
604-    BOOST_CHECK(peerLogic->SendMessages(nodes[1]));
605+    BOOST_CHECK(peerLogic->SendMessages(*nodes[1]));
606     // [0] is still discouraged/disconnected.
607     BOOST_CHECK(banman->IsDiscouraged(addr[0]));
608     BOOST_CHECK(nodes[0]->fDisconnect);
609     // [1] is not discouraged/disconnected yet.
610     BOOST_CHECK(!banman->IsDiscouraged(addr[1]));
611     BOOST_CHECK(!nodes[1]->fDisconnect);
612     peerLogic->UnitTestMisbehaving(nodes[1]->GetId());
613-    BOOST_CHECK(peerLogic->SendMessages(nodes[1]));
614+    BOOST_CHECK(peerLogic->SendMessages(*nodes[1]));
615     // Expect both [0] and [1] to be discouraged/disconnected now.
616     BOOST_CHECK(banman->IsDiscouraged(addr[0]));
617     BOOST_CHECK(nodes[0]->fDisconnect);
618     BOOST_CHECK(banman->IsDiscouraged(addr[1]));
619     BOOST_CHECK(nodes[1]->fDisconnect);
620 
621@@ -380,13 +380,13 @@ BOOST_AUTO_TEST_CASE(peer_discouragement)
622                          /*inbound_onion=*/false};
623     nodes[2]->SetCommonVersion(PROTOCOL_VERSION);
624     peerLogic->InitializeNode(*nodes[2], NODE_NETWORK);
625     nodes[2]->fSuccessfullyConnected = true;
626     connman->AddTestNode(*nodes[2]);
627     peerLogic->UnitTestMisbehaving(nodes[2]->GetId());
628-    BOOST_CHECK(peerLogic->SendMessages(nodes[2]));
629+    BOOST_CHECK(peerLogic->SendMessages(*nodes[2]));
630     BOOST_CHECK(banman->IsDiscouraged(addr[0]));
631     BOOST_CHECK(banman->IsDiscouraged(addr[1]));
632     BOOST_CHECK(banman->IsDiscouraged(addr[2]));
633     BOOST_CHECK(nodes[0]->fDisconnect);
634     BOOST_CHECK(nodes[1]->fDisconnect);
635     BOOST_CHECK(nodes[2]->fDisconnect);
636@@ -422,13 +422,13 @@ BOOST_AUTO_TEST_CASE(DoS_bantime)
637                     /*inbound_onion=*/false};
638     dummyNode.SetCommonVersion(PROTOCOL_VERSION);
639     peerLogic->InitializeNode(dummyNode, NODE_NETWORK);
640     dummyNode.fSuccessfullyConnected = true;
641 
642     peerLogic->UnitTestMisbehaving(dummyNode.GetId());
643-    BOOST_CHECK(peerLogic->SendMessages(&dummyNode));
644+    BOOST_CHECK(peerLogic->SendMessages(dummyNode));
645     BOOST_CHECK(banman->IsDiscouraged(addr));
646 
647     peerLogic->FinalizeNode(dummyNode);
648 }
649 
650 BOOST_AUTO_TEST_SUITE_END()
651diff --git i/src/test/fuzz/p2p_handshake.cpp w/src/test/fuzz/p2p_handshake.cpp
652index d608efd87a..183896bf3c 100644
653--- i/src/test/fuzz/p2p_handshake.cpp
654+++ w/src/test/fuzz/p2p_handshake.cpp
655@@ -97,12 +97,12 @@ FUZZ_TARGET(p2p_handshake, .init = ::initialize)
656             connection.fPauseSend = false;
657 
658             try {
659                 more_work = connman.ProcessMessagesOnce(connection);
660             } catch (const std::ios_base::failure&) {
661             }
662-            peerman->SendMessages(&connection);
663+            peerman->SendMessages(connection);
664         }
665     }
666 
667     g_setup->m_node.connman->StopNodes();
668 }
669diff --git i/src/test/fuzz/p2p_headers_presync.cpp w/src/test/fuzz/p2p_headers_presync.cpp
670index b31b74ee4f..88072c7570 100644
671--- i/src/test/fuzz/p2p_headers_presync.cpp
672+++ w/src/test/fuzz/p2p_headers_presync.cpp
673@@ -94,13 +94,13 @@ void HeadersSyncSetup::SendMessage(FuzzedDataProvider& fuzzed_data_provider, CSe
674     (void)connman.ReceiveMsgFrom(connection, std::move(msg));
675     connection.fPauseSend = false;
676     try {
677         connman.ProcessMessagesOnce(connection);
678     } catch (const std::ios_base::failure&) {
679     }
680-    m_node.peerman->SendMessages(&connection);
681+    m_node.peerman->SendMessages(connection);
682 }
683 
684 CBlockHeader ConsumeHeader(FuzzedDataProvider& fuzzed_data_provider, const uint256& prev_hash, uint32_t prev_nbits)
685 {
686     CBlockHeader header;
687     header.nNonce = 0;
688diff --git i/src/test/fuzz/process_message.cpp w/src/test/fuzz/process_message.cpp
689index 4bd38a1ac6..2a304cbd4b 100644
690--- i/src/test/fuzz/process_message.cpp
691+++ w/src/test/fuzz/process_message.cpp
692@@ -85,11 +85,11 @@ FUZZ_TARGET(process_message, .init = initialize_process_message)
693     while (more_work) {
694         p2p_node.fPauseSend = false;
695         try {
696             more_work = connman.ProcessMessagesOnce(p2p_node);
697         } catch (const std::ios_base::failure&) {
698         }
699-        g_setup->m_node.peerman->SendMessages(&p2p_node);
700+        g_setup->m_node.peerman->SendMessages(p2p_node);
701     }
702     g_setup->m_node.validation_signals->SyncWithValidationInterfaceQueue();
703     g_setup->m_node.connman->StopNodes();
704 }
705diff --git i/src/test/fuzz/process_messages.cpp w/src/test/fuzz/process_messages.cpp
706index 0688868c02..342a4038b9 100644
707--- i/src/test/fuzz/process_messages.cpp
708+++ w/src/test/fuzz/process_messages.cpp
709@@ -84,12 +84,12 @@ FUZZ_TARGET(process_messages, .init = initialize_process_messages)
710             random_node.fPauseSend = false;
711 
712             try {
713                 more_work = connman.ProcessMessagesOnce(random_node);
714             } catch (const std::ios_base::failure&) {
715             }
716-            g_setup->m_node.peerman->SendMessages(&random_node);
717+            g_setup->m_node.peerman->SendMessages(random_node);
718         }
719     }
720     g_setup->m_node.validation_signals->SyncWithValidationInterfaceQueue();
721     g_setup->m_node.connman->StopNodes();
722 }
723diff --git i/src/test/net_tests.cpp w/src/test/net_tests.cpp
724index 0036d94c2f..ce10b50417 100644
725--- i/src/test/net_tests.cpp
726+++ w/src/test/net_tests.cpp
727@@ -875,13 +875,13 @@ BOOST_AUTO_TEST_CASE(initial_advertise_from_version_message)
728                     return;
729                 }
730             }
731         }
732     };
733 
734-    m_node.peerman->SendMessages(&peer);
735+    m_node.peerman->SendMessages(peer);
736 
737     BOOST_CHECK(sent);
738 
739     CaptureMessage = CaptureMessageOrig;
740     chainman.ResetIbd();
741     m_node.args->ForceSetArg("-capturemessages", "0");
742diff --git i/src/test/util/net.cpp w/src/test/util/net.cpp
743index 0b8070a88e..4f4496c93e 100644
744--- i/src/test/util/net.cpp
745+++ w/src/test/util/net.cpp
746@@ -28,13 +28,13 @@ void ConnmanTestMsg::Handshake(CNode& node,
747                                bool relay_txs)
748 {
749     auto& peerman{static_cast<PeerManager&>(*m_msgproc)};
750     auto& connman{*this};
751 
752     peerman.InitializeNode(node, local_services);
753-    peerman.SendMessages(&node);
754+    peerman.SendMessages(node);
755     FlushSendBuffer(node); // Drop the version message added by SendMessages.
756 
757     CSerializedNetMsg msg_version{
758         NetMsg::Make(NetMsgType::VERSION,
759                 version,                                        //
760                 Using<CustomUintFormatter<8>>(remote_services), //
761@@ -49,13 +49,13 @@ void ConnmanTestMsg::Handshake(CNode& node,
762                 relay_txs),
763     };
764 
765     (void)connman.ReceiveMsgFrom(node, std::move(msg_version));
766     node.fPauseSend = false;
767     connman.ProcessMessagesOnce(node);
768-    peerman.SendMessages(&node);
769+    peerman.SendMessages(node);
770     FlushSendBuffer(node); // Drop the verack message added by SendMessages.
771     if (node.fDisconnect) return;
772     assert(node.nVersion == version);
773     assert(node.GetCommonVersion() == std::min(version, PROTOCOL_VERSION));
774     CNodeStateStats statestats;
775     assert(peerman.GetNodeStateStats(node.GetId(), statestats));
776@@ -63,13 +63,13 @@ void ConnmanTestMsg::Handshake(CNode& node,
777     assert(statestats.their_services == remote_services);
778     if (successfully_connected) {
779         CSerializedNetMsg msg_verack{NetMsg::Make(NetMsgType::VERACK)};
780         (void)connman.ReceiveMsgFrom(node, std::move(msg_verack));
781         node.fPauseSend = false;
782         connman.ProcessMessagesOnce(node);
783-        peerman.SendMessages(&node);
784+        peerman.SendMessages(node);
785         assert(node.fSuccessfullyConnected == true);
786     }
787 }
788 
789 void ConnmanTestMsg::NodeReceiveMsgBytes(CNode& node, std::span<const uint8_t> msg_bytes, bool& complete) const
790 {
791diff --git i/src/test/util/net.h w/src/test/util/net.h
792index 36c5dc5c17..0a0719c735 100644
793--- i/src/test/util/net.h
794+++ w/src/test/util/net.h
795@@ -70,15 +70,15 @@ struct ConnmanTestMsg : public CConnman {
796                    ServiceFlags remote_services,
797                    ServiceFlags local_services,
798                    int32_t version,
799                    bool relay_txs)
800         EXCLUSIVE_LOCKS_REQUIRED(NetEventsInterface::g_msgproc_mutex);
801 
802-    bool ProcessMessagesOnce(CNode& node) EXCLUSIVE_LOCKS_REQUIRED(NetEventsInterface::g_msgproc_mutex)
803+    bool ProcessMessagesOnce(CNode& from) EXCLUSIVE_LOCKS_REQUIRED(NetEventsInterface::g_msgproc_mutex)
804     {
805-        return m_msgproc->ProcessMessages(&node, flagInterruptMsgProc);
806+        return m_msgproc->ProcessMessages(from, flagInterruptMsgProc);
807     }
808 
809     void NodeReceiveMsgBytes(CNode& node, std::span<const uint8_t> msg_bytes, bool& complete) const;
810 
811     bool ReceiveMsgFrom(CNode& node, CSerializedNetMsg&& ser_msg) const;
812     void FlushSendBuffer(CNode& node) const;

pinheadmz approved

pinheadmz commented at 7:42 pm on May 13, 2025: member

ACK 8b93e0894f94883f76b7df5d50a5f1a58544fb6c

Built on macos/arm64, ran all unit and functional tests, reviewed each commit and left a few questions.

Also ran on mainnet for a few hours to watch peers come and go.

This is a great refactor, manually refcounting is bound to be error prone and the patch gives us cleaner code with some performance optimizations. Since CNode was mostly handled by raw pointer most of the handling logic didn’t have to change to support std::shared_ptr instead.

 0-----BEGIN PGP SIGNED MESSAGE-----
 1Hash: SHA256
 2
 3ACK 8b93e0894f94883f76b7df5d50a5f1a58544fb6c
 4-----BEGIN PGP SIGNATURE-----
 5
 6iQIzBAEBCAAdFiEE5hdzzW4BBA4vG9eM5+KYS2KJyToFAmgjoCIACgkQ5+KYS2KJ
 7yTrA9xAA2lw5u65z3P7wsym6vKgne3bveduLNHcK/V1U/Jy6Oc9ArnO2NcD14CV8
 8TWYC07Nt1E/ZRqu3qWtwYklNtQSIHxUuvhaX98XPvXA2tA6XHWxgHdU9G9whERgb
 9GrxPxTh0JDA/R/poQD4+16rFHBQnd+OIhlw6x6XRLdBKcbVPdYsuH4QNmtL0smWe
10/S3OGQKmAgxK+f2FC85q8pXbrchZ5JrtOBliT+YUMxYM44ryKRosCt7nSWwceMGd
11pUD0YXR7NKJ/VJp8ZNFU8rUFxzOyKBcqKTOgqqyZUUJc/Db+JEK82zA814gck5ep
12Lwc7MLb/dgQnAYAEEY5tmUbftOSU2PDoH2/DA/JDRub1RwTMj7U8r/oMRnN5pWXC
13CS2M1E/GddeFBq19kS7wGOKwMcqHP9k9fqc7wLtZRmg8seKbw0Aac8hn+tGfHE8c
149RvnoEFzyYx7sTWXMc2900eUcvfsV51D35Zno7tWWY1uhWo8Td7Zk0xHQ1m+zH93
15/E8RspiR3754bJLYOhqR0EQxpHvkmnGGqr3cXL3IZWIPA/CxfW+EXEqGvZ2u87Kv
16UuW1akk/hx159IEznGNxVFcoO/Gcc5xCfR3/5o99Y3/Tt7Mshl5Dg6DXIL1BEudW
17LgpHbx/uWuxQzTWI8Ce4XiSPASP1tYErCNfPeuBA/mm0oEJLE3Q=
18=0vbV
19-----END PGP SIGNATURE-----

pinheadmz’s public key is on openpgp.org

DrahtBot requested review from theuni on May 13, 2025

DrahtBot requested review from hodlinator on May 13, 2025

vasild force-pushed on May 14, 2025

vasild commented at 11:26 am on May 14, 2025: contributor

8b93e0894f...bd6c57f387: address suggestions

vasild force-pushed on May 14, 2025

vasild commented at 11:39 am on May 14, 2025: contributor

bd6c57f387...20657a7c8e: rebase

Edit: CI failure looks unrelated

DrahtBot added the label Needs rebase on May 20, 2025

vasild force-pushed on May 21, 2025

vasild commented at 11:55 am on May 21, 2025: contributor

20657a7c8e...cbd7bd7422: rebase due to conflicts

DrahtBot removed the label Needs rebase on May 21, 2025

net: store CNode objects as shared_ptr

This is a non-functional change - the manual reference counting,
disconnection, finalization and deletion are unchanged.

`DeleteNode()` used to call `FinalizeNode()` + `delete` the raw pointer.
Expand `DeleteNode()` at the call sites - call `FinalizeNode()` at the
call sites and instead of raw `delete` either get the container vector
go out of scope or explicitly call `clear()` on it.

2c411bff0c

net: remove unnecessary nodes_copy and nodes_disconnected_copy temporaries

The other copies have a purpose for locking, these were just an awkward
convenience for deleting while iterating.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

90a149ca79

net: add a lock for m_nodes_disconnected

Subsequent commits will allow this to be shared with the message
handling thread.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

57fb227ccd

net: finalize and delete nodes from ThreadMessageHandler()

Rather than relying on refcounting to abstract away our teardown
procedure, make the rules explicit:
- FinalizeNode() is only called when ProcessMessages() and
  SendMessages() is not running. This is enforced by finalizing only at
  the top of the message handler loop.
- FinalizeNode() may only be called on nodes which have been
  disconnected, by virtue of finding them in m_nodes_disconnected. This
  enforces the order: disconnect -> finalize -> delete. This order is
  the current behavior but is not strictly necessary, and the
  restriction will be removed in a later commit.
- Nodes are only deleted after they've been disconnected and finalized.
  This is enforced by deleting them from m_nodes_disconnected and only
  after calling FinalizeNode().
- Callers of ForNode() and ForEachNode() are currently protected from
  finalization or deletion due to their locking of m_nodes_mutex.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

b20957a8bf

net: remove CNode reference counting and NodesSnapshot

Snapshots of `m_nodes` are not needed anymore because finalization and
deletion of nodes only happens in `ThreadMessageHandler()` and only a
after nodes have been disconnected and moved from `m_nodes` to
`m_nodes_disconnected`.

The manual reference counting becomes unused, remove it too.

Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>

bfba7af9ff

net: don't hold m_nodes_mutex for ForNode() callbacks

The shared_ptrs now ensure that the CNodes will not be deleted for the
duration of the callbacks.

Disconnection/Finalization may happen while the shared_ptr is being held,
though it won't be deleted for the duration of the callback.

The majority of uses are in ProcessMessages and SendMessages, during which
disconnection may occur but FinalizeNode will not be called.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

27e18ccf2b

net: optimization: wake message handler when nodes are disconnected

This keeps us from waiting for the condvar timeout (currently 100ms) before
finalizing a node that has been disconnected on the socket handler thread.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

67bc008f62

vasild force-pushed on May 21, 2025

vasild commented at 1:17 pm on May 21, 2025: contributor

cbd7bd7422...67bc008f62: fix CI

pinheadmz approved

pinheadmz commented at 2:12 pm on May 21, 2025: member

ACK 67bc008f62f9eac1616558770708c1e8547b09f0

Only rebase and trivial adjustments since last ack

 0-----BEGIN PGP SIGNED MESSAGE-----
 1Hash: SHA256
 2
 3ACK 67bc008f62f9eac1616558770708c1e8547b09f0
 4-----BEGIN PGP SIGNATURE-----
 5
 6iQIzBAEBCAAdFiEE5hdzzW4BBA4vG9eM5+KYS2KJyToFAmgt3x8ACgkQ5+KYS2KJ
 7yTqEKA//W/MMHBH002NCZEngJedxatJrtDCBj/RkfA9ijfOQ+WmNjnd2QNLJa3ds
 8z6NlZ0zlNNh60XZHGEgdWtEk8TvPECe/E9nEFjn6pze9REcIvizn/ZzomcNWgQxy
 9KJAptOo63YUXaK8mSQD3rF805KLjxbhEoG/YNOXhm5gBtaTFciHrc/bJk4l5IBIM
10fnkC/p2+SnLZksY0ezZIZWCaBnQKEKPsda4ziwzNRQ3ER5Ze/COicn8MAXXSBgV1
11nsRnSc6Rl65EbT9sIkH8A4zCcu2NWXqcP9gKnY9NjO48nauPHg8S/RiTySIxsiKZ
12G7smIZiKTnfI6GAbolFH9NGhUZCI/AJ/dMGZXsKGl8h8zNv83Nxz9jFFsSRNuHZq
13edHaCE9t2v3GnGDEtKeXVXjlKsr0azh3P8r2uBv3iR8FKmZ9tT3mMIyGAiHaz9gi
14uihplbgiqbC67eE5NkiCykuG62rnwLqGME8eJaloRQC1PLi9XMAPbdTqOV/7797i
15UoxMemp2DpotIiKtcNAUz+5egav29v38NnmPUC7hZGJNKslcQx37McDEKTVLxvZu
16hnKM1HE/BmPMiErEUjlCmx7PAhyoToaxYf1LKVhKNp/pMNaf9HeDhO5mkAM2/+cb
17E3+4enfaBZbu2OJeA1Y76t4T3jwlBl5i4++J6xO7E/6HA84nZ/4=
18=AZtd
19-----END PGP SIGNATURE-----

pinheadmz’s public key is on openpgp.org

DrahtBot added the label CI failed on May 21, 2025

DrahtBot removed the label CI failed on May 22, 2025

hodlinator commented at 6:32 am on May 22, 2025: contributor

Due to my anti-shared_ptr arguments in #32015 (review) and the main point part of my unique_ptr alternative #32015 (comment) (with some support #32015 (comment)) not being incorporated I have a hard time A-C-K:ing this as-is. Not planning to open a competing PR at this point though.

Curious what Cory comes up with (https://github.com/bitcoin/bitcoin/pull/32015#issuecomment-2842078787), and time allowing I plan to attempt an eradication of shared_ptr from his next solution if it makes use of it by the end.

Happy that this PR spawned off other improvements like #32326 and #32394.

theuni commented at 7:31 pm on May 22, 2025: member

I’m also not in favor of this, but now it’s because I think this is addressing the wrong issue altogether.

Generally I think this is a nice change, but it’s scary, and if we’re going to be changing it I’d rather do it once and for all. As part of the CConnman refactor, CNode was supposed to become internal and not passed into PeerManager at all. In that case, the lifetime and sharing issues would be completely moot. But that didn’t get done.. my fault.

So, I’m working on doing that now. It’s a large refactor and it will take a while, but I think it’s better to fix the root issue here rather than continuing to work around it.

vasild commented at 7:27 am on May 24, 2025: contributor

CNode was supposed to become internal and not passed into PeerManager at all

Do you need any help with that?

theuni commented at 6:56 pm on June 16, 2025: member

@vasild Sure!

This may be a good one to tag-team. I’ve been distracted for the last two weeks, but coming back to this now.

I did some initial hacking on it and got pretty far along, but it’s a substantial amount of work, both on the coding side as well as review. I’m starting to wonder if it might be a good collaborative/working group project.

I’ll ping you offline soon to share my thoughts/progress and see what you think.

DrahtBot added the label Needs rebase on Jul 25, 2025

DrahtBot commented at 10:27 am on July 25, 2025: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

net: replace manual reference counting of CNode with shared_ptr #32015

Code Coverage & Benchmarks

Reviews

Conflicts

unique_ptr

Pros

Cons

HOWEVER

TLDR

These are the important guarantees at the moment, as I see them:

FinalizeNode may only be called once per node.

FinalizeNode will never be called as a result of setting fDisconnect from ProcessMessages/SendMessages

Finalization happens just before deletion, both on the socket handler thread

Disconnection always happens before calling FinalizeNode.

When the socket handler disconnects a peer, it is almost immediately (in the same pass through the loop) removed from m_nodes

Only a single thread deletes CNodes.

Any other function passing a CNode (for ex, a caller of ForNode() from RPC) shouldn’t have it deleted for its duration

`FinalizeNode` may only be called once per node.

`FinalizeNode` will never be called as a result of setting `fDisconnect` from `ProcessMessages`/`SendMessages`

Disconnection always happens before calling `FinalizeNode`.

When the socket handler disconnects a peer, it is almost immediately (in the same pass through the loop) removed from `m_nodes`

Only a single thread deletes `CNode`s.

Any other function passing a `CNode` (for ex, a caller of `ForNode() from RPC`) shouldn’t have it deleted for its duration