net: replace manual reference counting of CNode with shared

vasild commented at 9:53 AM on March 7, 2025: contributor

Before this change the code used to count references to CNode objects manually via CNode::nRefCount. Unneeded CNodes were scheduled for deletion by putting them in CConnman::m_nodes_disconnected and were deleted after their reference count reached zero. Deleting consists of calling PeerManager::FinalizeNode() and destroying the CNode object.

Replace this scheme with std::shared_ptr. This simplifies the code and removes: CNode::nRefCount CNode::GetRefCount() CNode::AddRef() CNode::Release() CConnman::m_nodes_disconnected CConnman::NodesSnapshot

Now creating a snapshot of CConnman::m_nodes is done by simply copying it (under the mutex).

Call PeerManager::FinalizeNode() from the destructor of CNode, which is called when the reference count reaches 0.

This removes brittle code and replaces it by a code from the standard library which is more robust. All the below are possible with the manual reference counting and not possible with shared_ptr:

Use an object without adding a reference to it
Add too many references to an object, mismatch the add/remove by having more "add"
Forget to reduce the reference count, mismatch the add/remove by having less "remove"
Go with negative reference count by mistakenly having too many "remove"
Destroy an object while there are references to it

There is zero learning curve for the average C++ programmer who knows how shared_ptr works. Not so much with the manual reference counting because one has to learn about AddRef(), Release(), check and maintain where are they called and their relationship with m_nodes_disconnected.

This has some history in #28222 and #10738. See below #32015 (comment) and #32015 (comment).

Possible followup: change ProcessMessages() and SendMessages() to take CNode&: #32015 (review)

DrahtBot commented at 9:53 AM on March 7, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32015.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	hodlinator, theuni
Stale ACK	pinheadmz

If your review is incorrectly listed, please copy-paste <code></code> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#33956 (net: fix use-after-free with v2->v1 reconnection logic by Crypt-iQ)
#32394 (net: make m_nodes_mutex non-recursive by vasild)
#32278 (doc: better document NetEventsInterface and the deletion of "CNode"s by vasild)
#32061 (Replace libevent with our own HTTP and socket-handling implementation by pinheadmz)
#30988 (Split CConnman by vasild)
#29415 (Broadcast own transactions only via short-lived Tor or I2P connections by vasild)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label P2P on Mar 7, 2025

vasild commented at 9:53 AM on March 7, 2025: contributor

cc @willcl-ark, @dergoegge, @theuni

vasild commented at 9:55 AM on March 7, 2025: contributor

cc @hodlinator, you might be interested in this as well

fanquake commented at 10:13 AM on March 7, 2025: member

https://github.com/bitcoin/bitcoin/actions/runs/13718139296/job/38367326902?pr=32015#step:7:1603:

Run p2p_headers_presync with args ['/Users/runner/work/bitcoin/bitcoin/ci/scratch/build-aarch64-apple-darwin23.6.0/src/test/fuzz/fuzz', PosixPath('/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync')]net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"

net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"

in src/net.cpp:1939 in 2747f135be outdated

1941 | +
1942 | +                // Keep a reference to this CNode object to delay its destruction until
1943 | +                // after m_nodes_mutex has been released. Destructing a node involves
1944 | +                // calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order
1945 | +                // should be cs_main, m_nodes_mutex.
1946 | +                disconnected_nodes.push_back(node);

dergoegge commented at 11:03 AM on March 7, 2025:

Now that ~CNode is responsible for calling FinalizeNode, the guarantees around which thread does the actual cleanup are lost. It's basically just whichever thread ends up holding the last instance of a CNode. This seems bug prone.

E.g. there are no guarantees (afaict) that disconnected_nodes will actually be the last owner.

vasild commented at 1:29 PM on March 7, 2025:

The idea is that it does not matter which thread will destroy the CNode objects.

there are no guarantees (afaict) that disconnected_nodes will actually be the last owner.

That is right, but there is no need for such a guarantee. disconnected_nodes is to guarantee that the objects will not be destroyed while holding m_nodes_mutex.

hodlinator commented at 11:06 PM on March 7, 2025:

In thread #32015 (review):

Maybe one could have an additional vector<shared_ptr<CNode>> member in CConnman beyond m_nodes that always keeps the shared_ptr::use_count() above 0. Then use removal from that vector (when use_count() == 1) for controlled deletion with enforced lock order, more like how the code was before?

vasild commented at 8:55 AM on March 11, 2025:

I do not think that is necessary. I mean - if we would assume that the code is going to be modified in a careless way to break the logic, in other words to destroy CNode objects while holding m_nodes_mutex, then we might as well assume that the code could be modified in a way that breaks the manual reference counting. IMO the manual reference counting is more fragile and error prone.

Further, if the code is changed to destroy CNode objects while holding m_nodes_mutex, then 83 functional tests fail. I think it is safe to assume that such a thing will not go unnoticed.

hodlinator commented at 8:18 PM on March 11, 2025:

In thread #32015 (review):

Started proving out my suggestion and landed on something closer to master.

Resurrect m_nodes_disconnected (now holding shared_ptr) along with clearly defined lifetimes.
Resurrect DeleteNode(), but make it take an r-value shared_ptr and assert that use_count() == 1.
Avoid adding ~CNode() and CNode::m_destruct_cb along with sending it in everywhere in the constructor. This means we don't need to touch 5 of the test files.

https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_suggestion

Passes unit tests and functional tests.

DrahtBot added the label CI failed on Mar 7, 2025

DrahtBot commented at 11:14 AM on March 7, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38367328595

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

vasild force-pushed on Mar 7, 2025

vasild commented at 1:23 PM on March 7, 2025: contributor

2747f135be...e35a382388: fix the CI failure, thanks!

vasild force-pushed on Mar 7, 2025

DrahtBot removed the label CI failed on Mar 7, 2025

in src/net.cpp:1888 in 36d932cebb outdated

1883 | @@ -1884,8 +1884,15 @@ bool CConnman::AddConnection(const std::string& address, ConnectionType conn_typ
1884 |      } // no default case, so the compiler can warn about missing cases
1885 |  
1886 |      // Count existing connections
1887 | -    int existing_connections = WITH_LOCK(m_nodes_mutex,
1888 | -                                         return std::count_if(m_nodes.begin(), m_nodes.end(), [conn_type](CNode* node) { return node->m_conn_type == conn_type; }););
1889 | +    const int existing_connections = WITH_LOCK(m_nodes_mutex,
1890 | +        return std::count_if(

hodlinator commented at 10:24 PM on March 7, 2025:

nit: Might as well use std::ranges::count_if here and in GetFullOutboundConnCount.

vasild commented at 4:06 PM on March 10, 2025:

Indeed, done, thanks!

hodlinator commented at 11:13 PM on March 7, 2025: contributor

Concept ACK 36d932cebbb235a79e90575960646e86a23d39a8

vasild force-pushed on Mar 10, 2025

vasild commented at 4:07 PM on March 10, 2025: contributor

36d932cebb...af622d00ba: address #32015 (review)

purpleKarrot commented at 6:24 PM on March 10, 2025: contributor

Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics? Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time.

hodlinator commented at 7:51 PM on March 10, 2025: contributor

Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time.

I remembered that as "the seminal std::rotate()-talk", but it also has "No Raw Pointers" from 48:45. Probably stoked my hatred of smart pointers which led to #31713. To be fair, in the talk he also says that we used to do intrusive ref-counting... so this is at least one step away from that.

Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics?

Guess one could have an InternalCNode which is the actual meat, and make the public CNode instances hold shared_ptr<InternalCNode>. Not sure how well that plays out. Maybe you @purpleKarrot could take a stab at something along the lines of what you pointed to? Keep in mind that we try to minimize the number of lines touched in a change unless necessary.

purpleKarrot commented at 9:08 PM on March 10, 2025: contributor

After reviewing the code line by line, I realized that the only places where AddRef was ever called, was directly after construction (plus one place in the fuzz tests, not sure why that is needed). It looks like the CNode instances are not actually shared (good news). Could it be passed around by std::unique_ptr instead? This would greatly simplify future refactoring.

in src/net.cpp:1717 in af622d00ba outdated

1713 | @@ -1716,7 +1714,7 @@ bool CConnman::AttemptToEvictConnection()
1714 |          return false;
1715 |      }
1716 |      LOCK(m_nodes_mutex);
1717 | -    for (CNode* pnode : m_nodes) {
1718 | +    for (const auto& pnode : m_nodes) {

hodlinator commented at 10:01 PM on March 10, 2025:

Not sure about adding non-transitive const here since we set pnode->fDisconnect inside the loop.

Similar case in DisconnectNodes()-loop.

vasild commented at 11:55 AM on March 11, 2025:

Removed the const.

dergoegge commented at 10:15 PM on March 10, 2025: member

realized that the only places where AddRef was ever called, was directly after construction

It is also called in the RAII helper NodesSnapshot which is used in different threads (i.e. "net" and "msghand"), so unique_ptr won't work unfortunately.

plus one place in the fuzz tests, not sure why that is needed

It's probably just there to have it "covered".

in src/test/util/net.h:57 in af622d00ba outdated

  53 | @@ -54,20 +54,19 @@ struct ConnmanTestMsg : public CConnman {
  54 |      void AddTestNode(CNode& node)
  55 |      {
  56 |          LOCK(m_nodes_mutex);
  57 | -        m_nodes.push_back(&node);
  58 | +        m_nodes.push_back(std::shared_ptr<CNode>(&node));

purpleKarrot commented at 10:41 PM on March 10, 2025:

This looks suspicious. No custom deleter is passed to the shared_ptr constructor, so it defaults to default_delete which will delete ptr; eventually. Is the argument CNode& node guaranteed to be heap allocated? Otherwise:

m_nodes.push_back(std::shared_ptr<CNode>(&node, [](CNode*){}));

vasild commented at 10:25 AM on March 11, 2025:

Yes, the argument CNode& node is always heap allocated. Not sure about "guaranteed". If not heap allocated, then this will be a gross error, resulting in an immediate crash or an error report by Valgrind or a memory sanitizer.

Note that the semantic of this piece of code is not changed by this PR. It was the same before as well - it would eventually delete the object.

in src/net.cpp:438 in af622d00ba outdated

 435 | @@ -438,8 +436,7 @@ CNode* CConnman::ConnectNode(CAddress addrConnect, const char *pszDest, bool fCo
 436 |                  // It is possible that we already have a connection to the IP/port pszDest resolved to.
 437 |                  // In that case, drop the connection that was just created.
 438 |                  LOCK(m_nodes_mutex);

hodlinator commented at 10:52 PM on March 10, 2025:

Since we aren't doing anything with pnode anymore, we should be able to remove this LOCK() too, avoiding re-entrant locking inside FindNode().

vasild commented at 12:00 PM on March 11, 2025:

Right! Removed.

in src/net.cpp:335 in af622d00ba outdated

 331 | @@ -332,32 +332,32 @@ bool IsLocal(const CService& addr)
 332 |      return mapLocalHost.count(addr) > 0;
 333 |  }
 334 |  
 335 | -CNode* CConnman::FindNode(const CNetAddr& ip)

hodlinator commented at 10:55 PM on March 10, 2025:

It is notable how unsafe the design was for the FindNode()-methods before switching to shared_ptr. Returning a raw pointer as we let go of the mutex protecting it.

purpleKarrot commented at 11:15 PM on March 10, 2025:

This is not a raw pointer, but an observer ("Raw pointer" is coined by Sean Parent as "anything that implies ownership", which explicitly includes shared_ptr. Stricly speaking, changing CNode* to std::shared_ptr<CNode> turns an observer into a raw pointer).

I actually think that returning an observer is fine here, even after the change.

hodlinator commented at 8:02 AM on March 11, 2025:

If that is the case I think Parent goes one step too far in redefining terms.

Returning an raw C pointer is not an okay design, as another thread could potentially come along and delete whatever the returned pointer is referencing, resulting in use-after-free. This is mitigated by FindNode being private and node deletion being done in a controlled fashion on master. But it's a loaded foot-gun IMO.

Adding EXCLUSIVE_LOCKS_REQUIRED() would have been a different fix, forcing the caller to protect the returned value with the mutex.

CNode* FindNode(...) EXCLUSIVE_LOCKS_REQUIRED(m_nodes_mutex)

Introducing shared_ptr solves the use-after-free design issue. (Although I agree it would be great to get rid of shared_ptr).

vasild commented at 12:10 PM on March 11, 2025:

I agree with @hodlinator that this has lifetime issue in master and in this PR it does not. That issue was also discussed in another PR. Good to resolve it.

purpleKarrot commented at 11:18 PM on March 10, 2025: contributor

unique_ptr won't work unfortunately.

Got it. m_nodes needs to be able to be snapshotted. But that seems to be the only place where copies are made. FindNode should return non-owning observers rather than extend the ownership.

vasild force-pushed on Mar 11, 2025

vasild commented at 11:55 AM on March 11, 2025: contributor

af622d00ba...d233c7e999: address #32015 (review)

vasild force-pushed on Mar 11, 2025

vasild commented at 12:00 PM on March 11, 2025: contributor

d233c7e999...4492abbf0a: rebase and address #32015 (review)

in src/net.h:1361 in 4492abbf0a outdated

1354 | @@ -1363,7 +1355,7 @@ class CConnman
1355 |      bool AlreadyConnectedToAddress(const CAddress& addr);
1356 |  
1357 |      bool AttemptToEvictConnection();
1358 | -    CNode* ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
1359 | +    std::shared_ptr<CNode> ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
1360 |      void AddWhitelistPermissionFlags(NetPermissionFlags& flags, const CNetAddr &addr, const std::vector<NetWhitelistPermissions>& ranges) const;
1361 |  
1362 |      void DeleteNode(CNode* pnode);

hodlinator commented at 1:01 PM on March 11, 2025:

Function body was removed.

vasild commented at 5:07 AM on March 12, 2025:

Removed the declaration as well, thanks!

in src/net.cpp:3478 in 4492abbf0a outdated

3463 |      WITH_LOCK(m_nodes_mutex, nodes.swap(m_nodes));
3464 | -    for (CNode* pnode : nodes) {
3465 | +    for (auto& pnode : nodes) {
3466 |          LogDebug(BCLog::NET, "Stopping node, %s", pnode->DisconnectMsg(fLogIPs));
3467 |          pnode->CloseSocketDisconnect();
3468 | -        DeleteNode(pnode);

hodlinator commented at 1:14 PM on March 11, 2025:

Seems brittle on master that we just delete the node here without checking the refcount. From what I see we only call StopNodes() from tests and CConnman::Stop(), and we only create NodesSnapshots inside threads we've already stopped earlier in Stop().

vasild force-pushed on Mar 12, 2025

vasild commented at 5:06 AM on March 12, 2025: contributor

4492abbf0a...8c0ce1ca1c: address #32015 (review)

Sjors commented at 10:24 AM on March 12, 2025: member

This seems like a nice simplification. It would be good to do before nobody remembers what the old code was trying to do (and prevent). Since @theUni tried something similar before in #10738 it would be great if he can look at this again.

In particular the 2017 PR claimed "special care must be taken" as to which thread destroys the object, but @vasild believes this doesn't matter (here): #32015 (review)

The end result of that PR seems a lot more complicated than what this PR does, and perhaps that's part of why it didn't get a huge amount of review and was closed eventually? @willcl-ark then tried again in 2023 in #28222 (initially not knowing about the earlier attempt). Similar concerns about threading were raised: #28222#pullrequestreview-1565029167

It seems that was never really settled and the PR died.

It would be good to be address this in the PR description to prevent going in circles.

vasild commented at 1:18 PM on March 12, 2025: contributor

Some comments from the other PRs:

from #10738#issue-240283546:

Special care must be taken, though, to only delete CNodes from a single thread, and to control which thread that is.

In master deletion of CNode objects could happen in the net thread or in the main thread. In other words, some CNode object might be destroyed in the net thread and another CNode object might be destroyed in the main thread. I do not see any problem with that. Of course we do not destroy the same object from more than one thread, in master or in this PR. With this PR deletion might as well happen in the msghand thread (when destroying the temporary copy of m_nodes in ThreadMessageHandler()). That is fine as well, correct me if I am wrong.

from #28222#pullrequestreview-1565029167:

which thread actually deletes the CNode.

The reason why that is important is because we only want to call FinalizeNode once and we want there to be no other references to that CNode at that time.

This PR achieves that by calling FinalizeNode() from ~CNode(). Just to clarify the obvious: the destructor is called only once and at the time the destructor is called shared_ptr has ensured that there are no other references to the object. This is simpler and easier to reason about than the code in master which waits manually for the number of references to drop.

vasild commented at 3:48 PM on March 14, 2025: contributor

Further idea around the second sentence of this:

#10738#issue-240283546:

Special care must be taken, though, to only delete CNodes from a single thread, and to control which thread that is. Eventually, this should move to the message processing thread, so that it's not possible to hit cs_main from the main network thread.

It is possible to decouple CNode deletion from the FinalizeNode() call and do the latter in the msghand thread regardless of when CNode is deleted. FinalizeNode() only needs 3 properties from CNode, so they can be extracted out from the CNode object and saved to a place which later is looked up by the msghand thread which calls FinalizeNode() on that data.

Here is a patch that does that:

<details> <summary>[patch] Call FinalizeNode() in the message processing thread</summary>

diff --git i/src/net_processing.cpp w/src/net_processing.cpp
index 1da3ec9d21..1124036a2a 100644
--- i/src/net_processing.cpp
+++ w/src/net_processing.cpp
@@ -487,13 +487,13 @@ public:
         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex);
     void NewPoWValidBlock(const CBlockIndex *pindex, const std::shared_ptr<const CBlock>& pblock) override
         EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex);
 
     /** Implement NetEventsInterface */
     void InitializeNode(const CNode& node, ServiceFlags our_services) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_tx_download_mutex);
-    void FinalizeNode(const CNode& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
+    void FinalizeNode(const NodeFinalizationData& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
     bool HasAllDesirableServiceFlags(ServiceFlags services) const override;
     bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override
         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex, !m_tx_download_mutex);
     bool SendMessages(CNode* pto) override
         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, g_msgproc_mutex, !m_tx_download_mutex);
 
@@ -1560,80 +1560,78 @@ void PeerManagerImpl::ReattemptInitialBroadcast(CScheduler& scheduler)
     // Schedule next run for 10-15 minutes in the future.
     // We add randomness on every cycle to avoid the possibility of P2P fingerprinting.
     const auto delta = 10min + FastRandomContext().randrange<std::chrono::milliseconds>(5min);
     scheduler.scheduleFromNow([&] { ReattemptInitialBroadcast(scheduler); }, delta);
 }
 
-void PeerManagerImpl::FinalizeNode(const CNode& node)
+void PeerManagerImpl::FinalizeNode(const NodeFinalizationData& node)
 {
-    NodeId nodeid = node.GetId();
     {
     LOCK(cs_main);
     {
         // We remove the PeerRef from g_peer_map here, but we don't always
         // destruct the Peer. Sometimes another thread is still holding a
         // PeerRef, so the refcount is >= 1. Be careful not to do any
         // processing here that assumes Peer won't be changed before it's
         // destructed.
-        PeerRef peer = RemovePeer(nodeid);
+        PeerRef peer = RemovePeer(node.id);
         assert(peer != nullptr);
         m_wtxid_relay_peers -= peer->m_wtxid_relay;
         assert(m_wtxid_relay_peers >= 0);
     }
-    CNodeState *state = State(nodeid);
+    CNodeState *state = State(node.id);
     assert(state != nullptr);
 
     if (state->fSyncStarted)
         nSyncStarted--;
 
     for (const QueuedBlock& entry : state->vBlocksInFlight) {
         auto range = mapBlocksInFlight.equal_range(entry.pindex->GetBlockHash());
         while (range.first != range.second) {
             auto [node_id, list_it] = range.first->second;
-            if (node_id != nodeid) {
+            if (node_id != node.id) {
                 range.first++;
             } else {
                 range.first = mapBlocksInFlight.erase(range.first);
             }
         }
     }
     {
         LOCK(m_tx_download_mutex);
-        m_txdownloadman.DisconnectedPeer(nodeid);
+        m_txdownloadman.DisconnectedPeer(node.id);
     }
-    if (m_txreconciliation) m_txreconciliation->ForgetPeer(nodeid);
+    if (m_txreconciliation) m_txreconciliation->ForgetPeer(node.id);
     m_num_preferred_download_peers -= state->fPreferredDownload;
     m_peers_downloading_from -= (!state->vBlocksInFlight.empty());
     assert(m_peers_downloading_from >= 0);
     m_outbound_peers_with_protect_from_disconnect -= state->m_chain_sync.m_protect;
     assert(m_outbound_peers_with_protect_from_disconnect >= 0);
 
-    m_node_states.erase(nodeid);
+    m_node_states.erase(node.id);
 
     if (m_node_states.empty()) {
         // Do a consistency check after the last peer is removed.
         assert(mapBlocksInFlight.empty());
         assert(m_num_preferred_download_peers == 0);
         assert(m_peers_downloading_from == 0);
         assert(m_outbound_peers_with_protect_from_disconnect == 0);
         assert(m_wtxid_relay_peers == 0);
         WITH_LOCK(m_tx_download_mutex, m_txdownloadman.CheckIsEmpty());
     }
     } // cs_main
-    if (node.fSuccessfullyConnected &&
-        !node.IsBlockOnlyConn() && !node.IsInboundConn()) {
+    if (node.was_connected_full_outbound) {
         // Only change visible addrman state for full outbound peers.  We don't
         // call Connected() for feeler connections since they don't have
         // fSuccessfullyConnected set.
         m_addrman.Connected(node.addr);
     }
     {
         LOCK(m_headers_presync_mutex);
-        m_headers_presync_stats.erase(nodeid);
+        m_headers_presync_stats.erase(node.id);
     }
-    LogDebug(BCLog::NET, "Cleared nodestate for peer=%d\n", nodeid);
+    LogDebug(BCLog::NET, "Cleared nodestate for peer=%d\n", node.id);
 }
 
 bool PeerManagerImpl::HasAllDesirableServiceFlags(ServiceFlags services) const
 {
     // Shortcut for (services & GetDesirableServiceFlags(services)) == GetDesirableServiceFlags(services)
     return !(GetDesirableServiceFlags(services) & (~services));
diff --git i/src/net.h w/src/net.h
index ff83ecde4d..cbe044f27e 100644
--- i/src/net.h
+++ w/src/net.h
@@ -1000,14 +1000,29 @@ public:
     /** Mutex for anything that is only accessed via the msg processing thread */
     static Mutex g_msgproc_mutex;
 
     /** Initialize a peer (setup state) */
     virtual void InitializeNode(const CNode& node, ServiceFlags our_services) = 0;
 
+    /** Data needed for `FinalizeNode()`. */
+    struct NodeFinalizationData {
+        NodeFinalizationData(const CNode& node)
+            : id{node.GetId()},
+              addr{node.addr},
+              was_connected_full_outbound{node.fSuccessfullyConnected &&
+                                          !node.IsBlockOnlyConn() &&
+                                          !node.IsInboundConn()}
+        {}
+
+        const NodeId id;
+        const CService addr;
+        const bool was_connected_full_outbound;
+    };
+
     /** Handle removal of a peer (clear state) */
-    virtual void FinalizeNode(const CNode& node) = 0;
+    virtual void FinalizeNode(const NodeFinalizationData& node_finalization_data) = 0;
 
     /**
      * Callback to determine whether the given set of service flags are sufficient
      * for a peer to be "relevant".
      */
     virtual bool HasAllDesirableServiceFlags(ServiceFlags services) const = 0;
@@ -1112,15 +1127,17 @@ public:
 
     ~CConnman();
 
     bool Start(CScheduler& scheduler, const Options& options) EXCLUSIVE_LOCKS_REQUIRED(!m_total_bytes_sent_mutex, !m_added_nodes_mutex, !m_addr_fetches_mutex, !mutexMsgProc);
 
     void StopThreads();
-    void StopNodes();
-    void Stop()
+    void StopNodes() EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex);
+    void Stop() EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex)
     {
+        AssertLockNotHeld(m_nodes_for_finalization_mutex);
+
         StopThreads();
         StopNodes();
     };
 
     void Interrupt() EXCLUSIVE_LOCKS_REQUIRED(!mutexMsgProc);
     bool GetNetworkActive() const { return fNetworkActive; };
@@ -1261,12 +1278,15 @@ public:
 
     /** Return true if we should disconnect the peer for failing an inactivity check. */
     bool ShouldRunInactivityChecks(const CNode& node, std::chrono::seconds now) const;
 
     bool MultipleManualOrFullOutboundConns(Network net) const EXCLUSIVE_LOCKS_REQUIRED(m_nodes_mutex);
 
+    /** Remember some of the node's properties for calling `m_msgproc->FinalizeNode()` later. */
+    void RememberForFinalization(const CNode& node) EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex);
+
 private:
     struct ListenSocket {
     public:
         std::shared_ptr<Sock> sock;
         inline void AddSocketPermissionFlags(NetPermissionFlags& flags) const { NetPermissions::AddFlag(flags, m_permissions); }
         ListenSocket(std::shared_ptr<Sock> sock_, NetPermissionFlags permissions_)
@@ -1287,13 +1307,13 @@ private:
     bool InitBinds(const Options& options);
 
     void ThreadOpenAddedConnections() EXCLUSIVE_LOCKS_REQUIRED(!m_added_nodes_mutex, !m_unused_i2p_sessions_mutex, !m_reconnections_mutex);
     void AddAddrFetch(const std::string& strDest) EXCLUSIVE_LOCKS_REQUIRED(!m_addr_fetches_mutex);
     void ProcessAddrFetch() EXCLUSIVE_LOCKS_REQUIRED(!m_addr_fetches_mutex, !m_unused_i2p_sessions_mutex);
     void ThreadOpenConnections(std::vector<std::string> connect, Span<const std::string> seed_nodes) EXCLUSIVE_LOCKS_REQUIRED(!m_addr_fetches_mutex, !m_added_nodes_mutex, !m_nodes_mutex, !m_unused_i2p_sessions_mutex, !m_reconnections_mutex);
-    void ThreadMessageHandler() EXCLUSIVE_LOCKS_REQUIRED(!mutexMsgProc);
+    void ThreadMessageHandler() EXCLUSIVE_LOCKS_REQUIRED(!mutexMsgProc, !m_nodes_for_finalization_mutex);
     void ThreadI2PAcceptIncoming();
     void AcceptConnection(const ListenSocket& hListenSocket);
 
     /**
      * Create a `CNode` object from a socket that has just been accepted and add the node to
      * the `m_nodes` member.
@@ -1305,12 +1325,16 @@ private:
     void CreateNodeFromAcceptedSocket(std::unique_ptr<Sock>&& sock,
                                       NetPermissionFlags permission_flags,
                                       const CService& addr_bind,
                                       const CService& addr);
 
     void DisconnectNodes() EXCLUSIVE_LOCKS_REQUIRED(!m_reconnections_mutex, !m_nodes_mutex);
+
+    /** Call `m_msgproc->FinalizeNode()` on all entries from `m_nodes_for_finalization`. */
+    void FinalizeNodes() EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex);
+
     void NotifyNumConnectionsChanged();
     /** Return true if the peer is inactive and should be disconnected. */
     bool InactivityCheck(const CNode& node) const;
 
     /**
      * Generate a collection of sockets to check for IO readiness.
@@ -1395,12 +1419,17 @@ private:
     // Whether the node should be passed out in ForEach* callbacks
     static bool NodeFullyConnected(const CNode& node);
 
     uint16_t GetDefaultPort(Network net) const;
     uint16_t GetDefaultPort(const std::string& addr) const;
 
+    Mutex m_nodes_for_finalization_mutex;
+
+    /** When a node is destroyed some of its properties are saved here for calling m_msgproc->FinalizeNode() later. */
+    std::queue<std::unique_ptr<NetEventsInterface::NodeFinalizationData>> m_nodes_for_finalization GUARDED_BY(m_nodes_for_finalization_mutex);
+
     // Network usage totals
     mutable Mutex m_total_bytes_sent_mutex;
     std::atomic<uint64_t> nTotalBytesRecv{0};
     uint64_t nTotalBytesSent GUARDED_BY(m_total_bytes_sent_mutex) {0};
 
     // outbound limit & stats
diff --git i/src/net.cpp w/src/net.cpp
index d641579f6b..3a6ba226d0 100644
--- i/src/net.cpp
+++ w/src/net.cpp
@@ -534,13 +534,13 @@ std::shared_ptr<CNode> CConnman::ConnectNode(CAddress addrConnect, const char *p
             CalculateKeyedNetGroup(target_addr),
             nonce,
             addr_bind,
             pszDest ? pszDest : "",
             conn_type,
             /*inbound_onion=*/false,
-            [this](CNode& node) { m_msgproc->FinalizeNode(node); },
+            [this](CNode& node) { RememberForFinalization(node); },
             CNodeOptions{
                 .permission_flags = permission_flags,
                 .i2p_sam_session = std::move(i2p_transient_session),
                 .recv_flood_size = nReceiveFloodSize,
                 .use_v2transport = use_v2transport,
             });
@@ -1832,13 +1832,13 @@ void CConnman::CreateNodeFromAcceptedSocket(std::unique_ptr<Sock>&& sock,
         CalculateKeyedNetGroup(addr),
         nonce,
         addr_bind,
         /*addrNameIn=*/"",
         ConnectionType::INBOUND,
         inbound_onion,
-        [this](CNode& node) { m_msgproc->FinalizeNode(node); },
+        [this](CNode& node) { RememberForFinalization(node); },
         CNodeOptions{
             .permission_flags = permission_flags,
             .prefer_evict = discouraged,
             .recv_flood_size = nReceiveFloodSize,
             .use_v2transport = use_v2transport,
         });
@@ -1904,14 +1904,12 @@ void CConnman::DisconnectNodes()
     AssertLockNotHeld(m_reconnections_mutex);
 
     // Use a temporary variable to accumulate desired reconnections, so we don't need
     // m_reconnections_mutex while holding m_nodes_mutex.
     decltype(m_reconnections) reconnections_to_add;
 
-    std::vector<std::shared_ptr<CNode>> disconnected_nodes;
-
     {
         LOCK(m_nodes_mutex);
 
         if (!fNetworkActive) {
             // Disconnect any connected nodes
             for (auto& pnode : m_nodes) {
@@ -1925,18 +1923,12 @@ void CConnman::DisconnectNodes()
         // Disconnect unused nodes
         for (auto it = m_nodes.begin(); it != m_nodes.end();) {
             auto node = *it;
             if (node->fDisconnect) {
                 it = m_nodes.erase(it);
 
-                // Keep a reference to this CNode object to delay its destruction until
-                // after m_nodes_mutex has been released. Destructing a node involves
-                // calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order
-                // should be cs_main, m_nodes_mutex.
-                disconnected_nodes.push_back(node);
-
                 // Add to reconnection list if appropriate. We don't reconnect right here, because
                 // the creation of a connection is a blocking operation (up to several seconds),
                 // and we don't want to hold up the socket handler thread for that long.
                 if (node->m_transport->ShouldReconnectV1()) {
                     reconnections_to_add.push_back({
                         .addr_connect = node->addr,
@@ -1964,12 +1956,38 @@ void CConnman::DisconnectNodes()
         // Move entries from reconnections_to_add to m_reconnections.
         LOCK(m_reconnections_mutex);
         m_reconnections.splice(m_reconnections.end(), std::move(reconnections_to_add));
     }
 }
 
+void CConnman::RememberForFinalization(const CNode& node)
+{
+    LOCK(m_nodes_for_finalization_mutex);
+    m_nodes_for_finalization.emplace(std::make_unique<NetEventsInterface::NodeFinalizationData>(node));
+}
+
+void CConnman::FinalizeNodes()
+{
+    AssertLockNotHeld(m_nodes_for_finalization_mutex);
+
+    for (;;) {
+        std::unique_ptr<NetEventsInterface::NodeFinalizationData> node_finalization_data;
+
+        {
+            LOCK(m_nodes_for_finalization_mutex);
+            if (m_nodes_for_finalization.empty()) {
+                break;
+            }
+            node_finalization_data.swap(m_nodes_for_finalization.front());
+            m_nodes_for_finalization.pop();
+        }
+
+        m_msgproc->FinalizeNode(*node_finalization_data);
+    }
+}
+
 void CConnman::NotifyNumConnectionsChanged()
 {
     size_t nodes_size;
     {
         LOCK(m_nodes_mutex);
         nodes_size = m_nodes.size();
@@ -3050,12 +3068,14 @@ void CConnman::ThreadMessageHandler()
 
         WAIT_LOCK(mutexMsgProc, lock);
         if (!fMoreWork) {
             condMsgProc.wait_until(lock, std::chrono::steady_clock::now() + std::chrono::milliseconds(100), [this]() EXCLUSIVE_LOCKS_REQUIRED(mutexMsgProc) { return fMsgProcWake; });
         }
         fMsgProcWake = false;
+
+        FinalizeNodes();
     }
 }
 
 void CConnman::ThreadI2PAcceptIncoming()
 {
     static constexpr auto err_wait_begin = 1s;
@@ -3440,12 +3460,14 @@ void CConnman::StopThreads()
     if (threadSocketHandler.joinable())
         threadSocketHandler.join();
 }
 
 void CConnman::StopNodes()
 {
+    AssertLockNotHeld(m_nodes_for_finalization_mutex);
+
     if (fAddressesInitialized) {
         DumpAddresses();
         fAddressesInitialized = false;
 
         if (m_use_addrman_outgoing) {
             // Anchor connections are only dumped during clean shutdown.
@@ -3466,12 +3488,14 @@ void CConnman::StopNodes()
     }
     nodes.clear();
 
     vhListenSocket.clear();
     semOutbound.reset();
     semAddnode.reset();
+
+    FinalizeNodes(); // Finalize any leftovers which were added after the msghand thread was interrupted.
 }
 
 CConnman::~CConnman()
 {
     Interrupt();
     Stop();
diff --git i/src/test/fuzz/p2p_handshake.cpp w/src/test/fuzz/p2p_handshake.cpp
index d881cb35ba..99b2d70706 100644
--- i/src/test/fuzz/p2p_handshake.cpp
+++ w/src/test/fuzz/p2p_handshake.cpp
@@ -61,14 +61,14 @@ FUZZ_TARGET(p2p_handshake, .init = ::initialize)
 
     LOCK(NetEventsInterface::g_msgproc_mutex);
 
     std::vector<CNode*> peers;
     const auto num_peers_to_add = fuzzed_data_provider.ConsumeIntegralInRange(1, 3);
     for (int i = 0; i < num_peers_to_add; ++i) {
-        peers.push_back(ConsumeNodeAsUniquePtr(fuzzed_data_provider, i, [&peerman](CNode& node) {
-                            peerman->FinalizeNode(node);
+        peers.push_back(ConsumeNodeAsUniquePtr(fuzzed_data_provider, i, [&connman](CNode& node) {
+                            connman.RememberForFinalization(node);
                         }).release());
         connman.AddTestNode(*peers.back());
         peerman->InitializeNode(
             *peers.back(),
             static_cast<ServiceFlags>(fuzzed_data_provider.ConsumeIntegral<uint64_t>()));
     }
diff --git i/src/test/fuzz/p2p_headers_presync.cpp w/src/test/fuzz/p2p_headers_presync.cpp
index 4c470dab1c..d6159efaee 100644
--- i/src/test/fuzz/p2p_headers_presync.cpp
+++ w/src/test/fuzz/p2p_headers_presync.cpp
@@ -58,13 +58,13 @@ void HeadersSyncSetup::ResetAndInitialize()
         ConnectionType::INBOUND
     };
 
     for (auto conn_type : conn_types) {
         CAddress addr{};
         m_connections.push_back(new CNode(id++, nullptr, addr, 0, 0, addr, "", conn_type, false,
-                                          [this](CNode& node) { m_node.peerman->FinalizeNode(node); }));
+                                          [this](CNode& node) { m_node.connman->RememberForFinalization(node); }));
         CNode& p2p_node = *m_connections.back();
 
         connman.Handshake(
             /*node=*/p2p_node,
             /*successfully_connected=*/true,
             /*remote_services=*/ServiceFlags(NODE_NETWORK | NODE_WITNESS),
diff --git i/src/test/fuzz/process_message.cpp w/src/test/fuzz/process_message.cpp
index b2031a60cc..a7b6913584 100644
--- i/src/test/fuzz/process_message.cpp
+++ w/src/test/fuzz/process_message.cpp
@@ -64,13 +64,13 @@ FUZZ_TARGET(process_message, .init = initialize_process_message)
 
     const std::string random_message_type{fuzzed_data_provider.ConsumeBytesAsString(CMessageHeader::MESSAGE_TYPE_SIZE).c_str()};
     if (!LIMIT_TO_MESSAGE_TYPE.empty() && random_message_type != LIMIT_TO_MESSAGE_TYPE) {
         return;
     }
     CNode& p2p_node = *ConsumeNodeAsUniquePtr(fuzzed_data_provider, std::nullopt, [](CNode& node) {
-                           g_setup->m_node.peerman->FinalizeNode(node);
+                           g_setup->m_node.connman->RememberForFinalization(node);
                        }).release();
 
     connman.AddTestNode(p2p_node);
     FillNode(fuzzed_data_provider, connman, p2p_node);
 
     const auto mock_time = ConsumeTime(fuzzed_data_provider);
diff --git i/src/test/fuzz/process_messages.cpp w/src/test/fuzz/process_messages.cpp
index 89584e3d59..9377327f8e 100644
--- i/src/test/fuzz/process_messages.cpp
+++ w/src/test/fuzz/process_messages.cpp
@@ -53,13 +53,13 @@ FUZZ_TARGET(process_messages, .init = initialize_process_messages)
     LOCK(NetEventsInterface::g_msgproc_mutex);
 
     std::vector<CNode*> peers;
     const auto num_peers_to_add = fuzzed_data_provider.ConsumeIntegralInRange(1, 3);
     for (int i = 0; i < num_peers_to_add; ++i) {
         peers.push_back(ConsumeNodeAsUniquePtr(fuzzed_data_provider, i, [&connman](CNode& node) {
-                            connman.MsgProc()->FinalizeNode(node);
+                            connman.RememberForFinalization(node);
                         }).release());
         CNode& p2p_node = *peers.back();
 
         FillNode(fuzzed_data_provider, connman, p2p_node);
 
         connman.AddTestNode(p2p_node);

</details>

Maybe worth exploring in a followup if people think that makes sense. Calling FinalizeNode() on the msghand thread to avoid locking cs_main in the net thread looks out of the scope of this PR which is about removing the manual reference counting of CNode.

DrahtBot added the label Needs rebase on Mar 20, 2025

vasild force-pushed on Mar 20, 2025

vasild commented at 5:45 PM on March 20, 2025: contributor

8c0ce1ca1c...dcb0be18ac: rebase due to conflicts

DrahtBot removed the label Needs rebase on Mar 20, 2025

hodlinator commented at 8:59 AM on March 24, 2025: contributor

unique_ptr

Edit: Please see newer branch: #32015 (comment)

Here is a branch implementing another possible variant, using unique_ptr as suggested by @purpleKarrot in #32015 (comment): https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_unique_ptr

Pros

It keeps the CNode ownership/lifetime under control of CConnman::m_nodes/m_nodes_disconnected. This can be considered an advancement over the unclear ownership/lifetime of shared_ptr ("global variables in disguise").
It avoids the need for introducing ~CNode() and m_destruct_cb together with passing a callback into CNode() everywhere.

Cons

It requires more ongoing awareness of CNode lifetime than when using shared_ptr. However, once we grok CNode lifetimes, they appear fairly simple (unless I'm missing something):

CConnman::ThreadSocketHandler()
    while (!interruptNet)
        DisconnectNodes()
            for m_nodes
                move disconnected nodes to m_nodes_disconnected
            for m_nodes_disconnected
                // Not referenced from other threads.
                // And no new snapshots possible since we are in m_nodes_disconnected.
                if refcount <= 0
                    DeleteNode()
        CConnman::SocketHandler()
            NodesSnapshot

CConnman::ThreadMessageHandler()
    while (!flagInterruptMsgProc)
        NodesSnapshot

Shutdown() (init.cpp)
    CConnman::Stop()
        StopThreads()
            threadMessageHandler.join()
            threadSocketHandler.join()
        StopNodes()
            for m_nodes
                DeleteNode()
            for m_nodes_disconnected
                DeleteNode()

It retains CNode::nRefCount (but documents its purpose as tracking NodesSnapshot references).

DrahtBot added the label Needs rebase on Mar 24, 2025

ryanofsky commented at 5:13 PM on April 1, 2025: contributor

Comparing current shared_ptr change dcb0be18ac64f001abaf646930bae1b5f6dba8c0 with suggested unique_ptr change cb959340a6734c3461f6d51a4a689f6067ba5e91 both seem like reasonable approaches.

I think vasild's point that drawbacks of manual reference counting outweigh drawbacks of shared_ptr, even if new shared_ptr code makes it a little harder to understand where objects will be deleted exactly. So this PR does seem like an improvement over the status quo.
Not sure I understand the purpose of the FinalizeNode patch #32015 (comment) and whether it has benefits related to this PR, or unrelated to it?
The suggested unique_ptr branch also seems nice because changes are broken up into smaller commits. And even though it doesn't completely remove reference counting, it does hide it inside the NodesSnapshot class, so maybe that is enough to achieve goals of this PR. I think the branch might also be able to also go further, deleting the CNode AddSnapshot, ReleaseSnapshot, and GetSnapshotCount methods and making NodesSnapshot a friend class that accesses the reference count directly. This could make it easier to see and guarantee that NodesSnapshot is the only thing accessing the reference count.

Overall it could be nice to adopt at least parts of the unique_ptr branch here.

hodlinator commented at 7:50 PM on April 1, 2025: contributor

I think the branch might also be able to also go further, deleting the CNode AddSnapshot, ReleaseSnapshot, and GetSnapshotCount methods and making NodesSnapshot a friend class that accesses the reference count directly. This could make it easier to see and guarantee that NodesSnapshot is the only thing accessing the reference count.

I thought about making a base class of CNode that exposed snapshot refcounting through a friend-relationship, but I like the fact that NodesSnapshot is a private class inside CConnman (and didn't see a way to serve both goals).

theuni commented at 9:43 PM on April 11, 2025: member

As @Sjors mentioned, I attempted to do this a while ago, but brought in a whole new smart pointer impl to keep the threading guarantees so it never got much review.

In master deletion of CNode objects could happen in the net thread or in the main thread. In other words, some CNode object might be destroyed in the net thread and another CNode object might be destroyed in the main thread.

Where can this happen? At least in the current model, that sounds like a bug to me.

The idea is that it does not matter which thread will destroy the CNode objects.

What makes this suddenly ok? I recall that deleting from the socket handler thread was a hard requirement for safety/sanity.

This makes me very uncomfortable. It's been a while since I've looked at this code so the details are fuzzy now, but I don't like the idea of refactoring the simple refcounters and changing the threading guarantees for no obvious gain. Is there some strong motivation for this I'm missing?

hodlinator commented at 10:26 PM on April 11, 2025: contributor

In master deletion of CNode objects could happen in the net thread or in the main thread. In other words, some CNode object might be destroyed in the net thread and another CNode object might be destroyed in the main thread.

Where can this happen? At least in the current model, that sounds like a bug to me.

The "diagram" at the bottom of #32015 (comment) may be helpful. The net-thread corresponds to CConnman::ThreadSocketHandler() and the main-thread corresponds to Shutdown() (init.cpp). (See calls to DeleteNode() in the diagram).

vasild force-pushed on Apr 13, 2025

vasild commented at 5:22 PM on April 13, 2025: contributor

dcb0be1...71750da: rebase due to conflicts

I do not understand the fixation of knowing and controlling where an object is destroyed from. The elegance and simplicity behind the sole idea of shared_ptr is that it does not matter. @hodlinator, I checked https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_unique_ptr, here is a quick feedback:

Does not remove the manual reference counting which is the aim of this PR. That diff mostly replaces the raw pointer CNode* with std::unique_ptr<CNode>.
To implement the snapshots it takes and stores raw pointers away from the unique_ptr which is an anti-pattern because then the lifetime of the raw pointer has to be manually kept shorter than the lifetime of the unique_ptr. I think that defeats the purpose of unique_ptr itself.
Any operation on the std::vector<std::unique_ptr<CNode>> m_nodes that deletes elements, e.g. clear(), erase(), pop_back() or resize() will automatically destroy the CNode objects even if they are referenced (manual ref count >0). @theuni,

The idea is that it does not matter which thread will destroy the CNode objects.

What makes this suddenly ok? I recall that deleting from the socket handler thread was a hard requirement for safety/sanity.

I do not know about that. Can you elaborate? PeerManagerImpl::FinalizeNode() has to be called on the CNode before it is destroyed. Could this be some misunderstanding that from this follows that CNode must be destroyed from a particular thread?

... I don't like the idea of refactoring the simple refcounters and changing the threading guarantees for no obvious gain ...

Hmm, "simple refcounters"... IMO those are "complicated". That might be subjective. But this patch reduces the code by about 70 lines. It entirely drops the need for snapshots as well. There is no changing of "threading guarantees" here. You seem to be thinking that CNode objects can and must only be destroyed from the net thread, but that might as well happen from the main thread. I do not think that this is a bug. And it is just fine, IMO, that with this PR deletion can now happen from the msghand thread. If you think that is not ok, can you explain why?

DrahtBot removed the label Needs rebase on Apr 13, 2025

vasild commented at 4:53 AM on April 14, 2025: contributor

I thought a bit more about the unique_ptr + manual refcounting. It seems to me that the manual refcounting can be replaced by the internal refcoutning in shared_ptr, like this:

drop CNode::nRefCount, same counter is somewhere inside shared_ptr
CNode::GetSnapshotCount() -> std::shared_ptr::use_count()
CNode::AddSnapshot() + copy the raw pointer out from unique_ptr -> copy the shared_ptr
CNode::ReleaseSnapshot() -> destroy the shared_ptr copy
drop all of CConnman::NodesSnapshot -> just make a copy of m_nodes under the mutex.

The purpose of CConnman::m_nodes_disconnected and CConnman::DeleteNode() seems to be to call FinalizeNode() before the CNode object is destroyed. If m_nodes_disconnected and DeleteNode() are replaced with calling FinalizeNode() from ~CNode(), then it becomes this PR.

theuni commented at 8:39 PM on April 14, 2025: member

Note: This is a long and rambling comment, which mostly consists of me reminding myself about the specifics of this code. I'm leaving the whole thing in case my thought process helps any other reviewers.

The idea is that it does not matter which thread will destroy the CNode objects.

What makes this suddenly ok? I recall that deleting from the socket handler thread was a hard requirement for safety/sanity.

I do not know about that. Can you elaborate? PeerManagerImpl::FinalizeNode() has to be called on the CNode before it is destroyed. Could this be some misunderstanding that from this follows that CNode must be destroyed from a particular thread?

Yes, the main thread may call FinalizeNode(), but at that point the message handler and socket handler threads have already been joined, so that's not really interesting.

We currently guarantee that no node will be deleted from threadMessageHandler as it iterates. That's the one that matters.

This guarantee protects us from a large footgun. Note that my original motivation for #10738 had nothing to do with code cleanliness, rather I was targeting a substantial optimization: CConnman::ForNode / CConnman::ForEachNode both hold m_nodes_mutex for the duration of their callbacks. Those callbacks may be long-lived, substantial, and/or have scary locks themselves, so blocking the socket handler thread for them really sucks.

Crucially, PeerManagerImpl::State(nodeid) will stay valid for any reachable nodeid for the duration of SendMessages and ProcessMessages.

Now consider if:

with this PR deletion can now happen from the msghand thread

A reasonable follow-up to this PR (I actually expected to see it here) would be:

diff --git a/src/net.cpp b/src/net.cpp
index b68d2fb37aa..d5631c056ce 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -3902,13 +3902,16 @@ void CConnman::PushMessage(CNode* pnode, CSerializedNetMsg&& msg)

 bool CConnman::ForNode(NodeId id, std::function<bool(CNode* pnode)> func)
 {
+    std::shared_ptr<CNode> found_node = nullptr;
+    {
     LOCK(m_nodes_mutex);
     for (auto&& pnode : m_nodes) {
         if(pnode->GetId() == id) {
-            return NodeFullyConnected(*pnode) && func(pnode.get());
+            found_node = pnode;
         }
     }
-    return false;
+    }
+    return found_node && NodeFullyConnected(*found_node) && func(found_node.get());
 }

 CSipHasher CConnman::GetDeterministicRandomizer(uint64_t id) const
diff --git a/src/net.h b/src/net.h
index e1318c8e66b..bff5c5f423f 100644
--- a/src/net.h
+++ b/src/net.h
@@ -1140,11 +1140,9 @@ public:
     using NodeFn = std::function<void(CNode*)>;
     void ForEachNode(const NodeFn& func)
     {
-        LOCK(m_nodes_mutex);
-        for (auto&& node : m_nodes) {
-            if (NodeFullyConnected(*node)) {
-                func(node.get());
-            }
+        auto nodes = WITH_LOCK(m_nodes_mutex, return m_nodes);
+        for (auto&& node : nodes) {
+            if (NodeFullyConnected(*node)) func(node.get());
         }
     };

Right? Because if the threading restriction is "gone", that seems like it should be safe. Like I said, I did that as part of the PR mentioned above: https://github.com/bitcoin/bitcoin/pull/10738/commits/6dc671a22046c6c09a6de2b1d38e9e2fa3920628

But if we're not careful, the nodes could now be deleted as a result of ForNode/ForEachNode, or any function that copies the node vector under the lock and uses the copy.

The problem with this is that it makes interactions with PeerManager super hard to reason about. Consider a function like that does something like this:

PeerManagerImpl::SomeFunction()
{
    NodeId to_disconnect = CallSomeFunctionToGetWorstNode();
    m_connman.ForNode(to_disconnect, [&](CNode* pnode) {
        ...
        pnode->fDisconnect = true;
        ...
    }
    SomeFunctionThatTakesAFewSeconds();
    CNodeState *state = State(to_disconnect); // YIKES

The above isn't contrived either, EvictExtraOutboundPeers/NewPoWValidBlock do things like that.

HOWEVER

As-is, the impl here currently protects against nodes being deleted in SendMessages and ProcessMessages using the same refcounting trick that we do in master. Frustratingly, in master it's already abstracted away by NodesSnapshot so it's not immediately obvious (RAII notwithstanding, it's an annoying trend in our repo imo to hide crucial implementation details behind abstractions for the sake of having fewer lines of code). And this PR takes it one step further:

auto nodes = WITH_LOCK(m_nodes_mutex, return m_nodes);

^^ This line is now in charge of keeping the shared_ptrs alive until the end of the message handling loop. If not for that (for ex, if we instead kept weak_ptrs and upgraded them as we passed through the loop), the ForEachNode case I explained above would be a problem.

So I don't think it's true (and imo it's quite dangerous to assume) that it's "safe to delete a CNode from any thread". What's happening here is that it's now possible that nodes are deleted in two places rather than one: either after as pass through the socket handler loop, or after a pass through the message handler loop. That still makes me uncomfortable, but I admit that I can't find any problems with it.

This PR also maintains the invariant that the socket handler thread is always completely finished with all socket operations before FinalizeNode is called.

And as far as I can tell, this PR does allow us to take the patch above, getting rid of the nasty lock-holding in ForEachNode and ForNode. BUT that would open up the possibility of deletion from other callers. For example, it would make it possible for the RPC thread to delete the node from sendmsgtopeer(). And if any future code stashed the shared_ptrs from CConnman, it would make it possible for the deletion to happen anywhere.

TLDR

This is a combination of 2 changes: replacing our refcounters with shared_ptrs (style), and adding a new deletion point (functionality). As far as I can tell, the 2nd change is a side-effect of the first because it's not straightforward to implement this with shared pointers while avoiding the possible deletion in ThreadMessageHandler(). I really don't care for behavioral changes that are dictated by style changes. So the question for reviewers, imo, is: is the functional change a good one?

I don't trust that it would be safe to simply allow CNodes to be deleted just anywhere. But crucially, that's not what's being proposed in the code here. Instead, a single new deletion point has been introduced (at the end of the message processing loop), which is much more reasonable, as it protects SendMessages and ProcessMessages as-before.

I'm not completely opposed to this, but I still find it scary and rather unjustified. I'm not convinced by the "it reduces the lines of code" argument. In fact, I would probably prefer to add comments around the NodesSnapshot to obviate what it's doing.

If we're going to move forward with this I think we need:

~Split the style change and functional change into separate commits: first the shared_ptr change as a no-op, then the additional deletion point.~ Edit: Looks like this isn't realistic.
Justification besides "code cleanup". I don't think that's a valid reason to be making behavioral changes to such low-level code.
In-place comments that describe the guarantees we provide. If the safety of the p2p layer revolves around a copy of a vector of shared_ptrs, that needs to be made clear. That nodes.clear() is also too crucial to not have a comment.
Documentation about the finalization/deletion process in general. This should probably happen even if we don't take these changes.

ryanofsky commented at 3:21 PM on April 15, 2025: contributor

@theuni would be curious for your feedback on hodlinators's unique_ptr + isolated refcount approach in #32015 (comment).

The more I read here (not fully understanding everything) the more that approach seems appealing because when you say things like "it's now possible that nodes are deleted in two places rather than one" and "But if we're not careful, the nodes could now be deleted as a result of ForNode/ForEachNode" it really makes shared_ptr seem like a poor fit. Shared_ptr can be great when objects have trivial destructors and it doesn't actually matter when they are deleted, but if you need control over when resources are freed, more explicit refcounting would seem safer and simpler.

It seems like both you and vasild like making a dichotomy between (1) using manual refcounting and (2) using shared_ptr, and not wanting to choose a middle ground like hodlinators approach which provides automatic reference counting though an explicit mechanism that is less flexible but simpler than shared_ptr. I'm not really sure why that is.

theuni commented at 3:47 PM on April 15, 2025: member

The more I read here (not fully understanding everything) the more that approach seems appealing because when you say things like "it's now possible that nodes are deleted in two places rather than one" and "But if we're not careful, the nodes could now be deleted as a result of ForNode/ForEachNode" it really makes shared_ptr seem like a poor fit. Shared_ptr can be great when objects have trivial destructors and it doesn't actually matter when they are deleted, but if you need control over when resources are freed, more explicit refcounting would seem safer and simpler.

I agree.

It seems like both you and vasild like making a dichotomy between (1) using manual refcounting and (2) using shared_ptr, and not wanting to choose a middle ground like hodlinators approach which provides automatic reference counting though an explicit mechanism that is less flexible but simpler than shared_ptr. I'm not really sure why that is.

No reason, I just haven't looked at it yet.

I spent all of yesterday poking around and trying to remind myself of where be the dragons in this code, hence the rambly and not very useful comment :)

I originally intended to comment that it was definitely unsafe to change the way deletion happens, but after poking around all day I concluded that even though it makes me uneasy (because I feel like I've forgotten something important over the years), I can't really find any tangible issues with the addition of a new deletion point. I wouldn't feel comfortable enough to ACK the approach, but I recognize that I can't argue against it either. However, since there's no real justification for the behavioral change (which I consider scary) other than "code cleanup", I can't really get onboard with it.

If the unique_ptr approach cleans things up without the behavioral change, that sounds much nicer to me. Will have a look.

vasild force-pushed on Apr 15, 2025

vasild commented at 4:15 PM on April 15, 2025: contributor

71750daa51...75acdcbaa9: add some comments and rename a variable, see below

We currently guarantee that no node will be deleted from threadMessageHandler as it iterates. That's the one that matters.

Yes. I just want to clarify - what do you mean by "from"? I would say "no node will be deleted (does not matter from which thread) while threadMessageHandler iterates". Is this what you mean? This guarantee is achieved by the caller of ProcessMessages() and SendMessages() taking a snapshot of all nodes to keep them from destruction while those functions are running. That is the same in master and in this PR.

Crucially, PeerManagerImpl::State(nodeid) will stay valid for any reachable nodeid for the duration of SendMessages and ProcessMessages.

Yes. True in master and in this PR for the same reason - the caller takes a snapshot while calling those functions.

Now consider if:

with this PR deletion can now happen from the msghand thread

Yes. Sounds good. As long as ProcessMessages() and SendMessages() are protected by the snapshot (they are in master and in this PR).

A reasonable follow-up to this PR (I actually expected to see it here) would be ... Right? Because if the threading restriction is "gone", that seems like it should be safe.

Right. That is safe and super reasonable because it minimizes the time m_nodes_mutex is held, like you mentioned. I think that belongs to this PR or a followup.

But if we're not careful, the nodes could now be deleted as a result of ForNode/ForEachNode, or any function that copies the node vector under the lock and uses the copy.

Only if all others have released their copies. In which case it is ok.

The problem with this is that it makes interactions with PeerManager super hard to reason about. Consider a function like that does something like this: PeerManagerImpl::SomeFunction() ...

This does not change rules about interactions with PeerManager (so it is not harder or easier, it is the same). The fact that nodes may be deleted if nobody holds a reference to them is true in master and in this PR. I think the rule is "Don't delete nodes while ProcessMessages() and SendMessages() are running".

PeerManagerImpl::SomeFunction() could result in state being nullptr (YIKES) also in master, depending on where SomeFunction() is called from. If that is a problem for SomeFunction() it could either:

handle State() returning nullptr (some callers of State() already do that in master) or
it could increment the reference count internally at the start and reduce it at the end or demand that its callers do that, like ProcessMessages() and SendMessages().

This reasoning is unchanged by this PR.

the impl here currently protects against nodes being deleted in SendMessages and ProcessMessages using the same refcounting trick that we do in master.

Yes.

auto nodes = WITH_LOCK(m_nodes_mutex, return m_nodes);

This line is now in charge of keeping the shared_ptrs alive until the end of the message handling loop

Renamed it to nodes_snapshot and added a comment. I find the pattern used in master and in this PR reasonable:

Take a snapshot of the nodes
Do something with the nodes from the snapshot (Implying they will not get deleted while we work on them, that is obvious, right? What's the meaning of "snapshot" otherwise?)
Release the snapshot // after the release this code does not care, nodes may get deleted if all others have released their references

So I don't think it's true (and imo it's quite dangerous to assume) that it's "safe to delete a CNode from any thread".

But why? I agree it is unsafe to:

A. Delete a node without calling FinalizeNode(), or B. Delete a node while there are references to it.

but the thread that does the deletion? That does not matter.

What's happening here is that it's now possible that nodes are deleted in two places rather than one: either after as pass through the socket handler loop, or after a pass through the message handler loop.

IMO that is safe as long as A. and B. are guaranteed not to happen.

This PR also maintains the invariant that the socket handler thread is always completely finished with all socket operations before FinalizeNode is called.

That is also the same in master. FinalizeNode() is called after CNode::CloseSocketDisconnect(). To me the code in this PR makes that more obvious compared to master because FinalizeNode() is called from the CNode's destructor and at that time it is clear that all socket operations are completely finished.

And as far as I can tell, this PR does allow us to take the patch above, getting rid of the nasty lock-holding in ForEachNode and ForNode.

I think so too.

BUT that would open up the possibility of deletion from other callers. For example, it would make it possible for the RPC thread to delete the node from sendmsgtopeer(). And if any future code stashed the shared_ptrs from CConnman, it would make it possible for the deletion to happen anywhere.

Yes, and that is fine. But don't forget - only if all others have released their copies. We shouldn't be worried that we will brick some other thread from the RPC thread if we delete a CNode object from the RPC thread which the other thread is using. That is because deletion only happens if reference count drops to zero which means nobody is using it and then it is irrelevant which thread does the deletion because at that time there are 0 users of the object.

Justification besides "code cleanup". I don't think that's a valid reason to be making behavioral changes to such low-level code.

Added some to the PR description.

In-place comments that describe the guarantees we provide. If the safety of the p2p layer revolves around a copy of a vector of shared_ptrs, that needs to be made clear. That nodes.clear() is also too crucial to not have a comment.

Added some comments.

Documentation about the finalization/deletion process in general. This should probably happen even if we don't take these changes.

I agree. Maybe if that doc existed beforehand it would have made this discussion shorter. Opened #32278

vasild commented at 7:29 PM on April 15, 2025: contributor

when you say things like "it's now possible that nodes are deleted in two places rather than one" and

It is actually "... in 3 places rather than 2"

"But if we're not careful, the nodes could now be deleted as a result of ForNode/ForEachNode"

This sounds like some unexpected deletion is going to happen in the middle of everything and bad things will follow and this is going to be because of this PR. But that is not true. Nodes could be deleted as a result of ForNode even now in master - if ForNode sets fDisconnect to true, then that will eventually result in the deletion of the node when it is safe to do so. This is both in master and in this PR. Nodes will not be deleted unexpectedly, accidentally, while there are references to them because we are not careful with ForNode (again, same in master and in this PR).

It seems like both you and vasild like making a dichotomy between (1) using manual refcounting and (2) using shared_ptr, and not wanting to choose a middle ground like hodlinators approach which provides automatic reference counting though an explicit mechanism that is less flexible but simpler than shared_ptr. I'm not really sure why that is.

I expressed some concerns about the "unique_ptr+manual refcounting" in #32015 (comment)

hodlinator commented at 7:52 AM on April 16, 2025: contributor

I do not understand the fixation of knowing and controlling where an object is destroyed from. The elegance and simplicity behind the sole idea of shared_ptr is that it does not matter.

I used to appreciate shared_ptr-like constructs, but as @purpleKarrot wrote, they introduce global variables in disguise. Using unique_ptr prevents that and makes ownership clear, which is a step forward.

@hodlinator, I checked master...hodlinator:bitcoin:pr/32015_unique_ptr, here is a quick feedback:

Does not remove the manual reference counting which is the aim of this PR. That diff mostly replaces the raw pointer CNode* with std::unique_ptr<CNode>.

Yes, the PR instead makes clear that refcounting is there just to track live snapshots. Making that sole purpose clear removes the itch to eradicate it for me.

To implement the snapshots it takes and stores raw pointers away from the unique_ptr which is an anti-pattern because then the lifetime of the raw pointer has to be manually kept shorter than the lifetime of the unique_ptr. I think that defeats the purpose of unique_ptr itself.

In a post-unique_ptr (and shared_ptr) world, raw pointers indicate non-ownership. It is subtle but not an anti-pattern.

Any operation on the std::vector<std::unique_ptr<CNode>> m_nodes that deletes elements, e.g. clear(), erase(), pop_back() or resize() will automatically destroy the CNode objects even if they are referenced (manual ref count >0).

Good point. Might be worth adding ~CNode() { Assume(GetSnapshotCount() == 0); }.

vasild commented at 8:16 AM on April 16, 2025: contributor

Consider a function like that does something like this: ... PeerManagerImpl::SomeFunction()

Here is an example that YIKES on master as well:

<details> <summary>[patch] YIKES on master</summary>

diff --git i/src/net.cpp w/src/net.cpp
index fc0edc1a5c..c48a62f31e 100644
--- i/src/net.cpp
+++ w/src/net.cpp
@@ -2758,12 +2758,14 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect, std
         const auto current_time{NodeClock::now()};
         int nTries = 0;
         const auto reachable_nets{g_reachable_nets.All()};
 
         while (!interruptNet)
         {
+            m_msgproc->HasAllDesirableServiceFlags(ServiceFlags{123});
+
             if (anchor && !m_anchors.empty()) {
                 const CAddress addr = m_anchors.back();
                 m_anchors.pop_back();
                 if (!addr.IsValid() || IsLocal(addr) || !g_reachable_nets.Contains(addr) ||
                     !m_msgproc->HasAllDesirableServiceFlags(addr.nServices) ||
                     outbound_ipv46_peer_netgroups.count(m_netgroupman.GetGroup(addr))) continue;
diff --git i/src/net_processing.cpp w/src/net_processing.cpp
index 0b25396dd2..6148c1d8de 100644
--- i/src/net_processing.cpp
+++ w/src/net_processing.cpp
@@ -519,12 +519,13 @@ public:
         EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex);
 
     /** Implement NetEventsInterface */
     void InitializeNode(const CNode& node, ServiceFlags our_services) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_tx_download_mutex);
     void FinalizeNode(const CNode& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
     bool HasAllDesirableServiceFlags(ServiceFlags services) const override;
+    void SomeFunction() const;
     bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override
         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex, !m_tx_download_mutex);
     bool SendMessages(CNode* pto) override
         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, g_msgproc_mutex, !m_tx_download_mutex);
 
     /** Implement PeerManager */
@@ -1660,14 +1661,32 @@ void PeerManagerImpl::FinalizeNode(const CNode& node)
         LOCK(m_headers_presync_mutex);
         m_headers_presync_stats.erase(nodeid);
     }
     LogDebug(BCLog::NET, "Cleared nodestate for peer=%d\n", nodeid);
 }
 
+void PeerManagerImpl::SomeFunction() const
+{
+    const NodeId to_disconnect{0};
+    m_connman.ForNode(to_disconnect, [&](CNode* node) {
+        node->fDisconnect = true;
+        return true;
+    });
+    while (WITH_LOCK(cs_main, return State(to_disconnect)) != nullptr) {
+        std::this_thread::sleep_for(1s);
+    }
+    LogInfo("YIKES");
+    Assert(false);
+}
+
 bool PeerManagerImpl::HasAllDesirableServiceFlags(ServiceFlags services) const
 {
+    if (services == 123) {
+        SomeFunction();
+    }
+
     // Shortcut for (services & GetDesirableServiceFlags(services)) == GetDesirableServiceFlags(services)
     return !(GetDesirableServiceFlags(services) & (~services));
 }
 
 ServiceFlags PeerManagerImpl::GetDesirableServiceFlags(ServiceFlags services) const
 {

</details>

That might happen on master and on this PR. I would say that for each yikes in this PR there is an equivalent yikes on master, or in other words, this PR does not have more or is not prone to more yikes. The rule is the same - "if a code relies on CNode not being deleted, then it should either hold m_nodex_mutex, or increment the reference count during its usage.

~Split the style change and functional change into separate commits: first the shared_ptr change as a no-op, then the additional deletion point.~ Edit: Looks like this isn't realistic.

I just did that, see https://github.com/vasild/bitcoin/commits/cnode_shared_ptr_split_in_two_commits/ The intermediate state, after the first commit is somewhat odd - we have shared_ptr but don't let it destroy the owned object by itself when we drop the shared_ptr, e.g. just remove from the vector. Instead in this intermediate state we monitor the use_count() and call reset() manually. Also NodesSnapshot is an useless wrapper that just copies the vector. The second commit cleans all that and arrives at the same end result as this PR.

hodlinator commented at 1:55 PM on April 16, 2025: contributor

Published new iteration of unique_ptr change in separate branch: https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_unique_ptr2

Moved changes clarifying CNode::nRefCount really only being used for NodesSnapshots to happen before unique_ptr commit.
Split the CNode::nRefCount-commit in two, one to change usage to only be for snapshots, one for renaming the methods. Now we also rename the field to m_snapshot_count and group refcounting methods together.
Added ~CNode()-dtor with assert(GetSnapshotCount() == 0) in order to catch invalid operations like CConnman::m_nodes.clear() as pointed out by vasild.
Added speculative final commit to introduce use of NodesSnapshot in ForNode and ForEachNode to decrease time m_nodes_mutex is locked, inspired by @theuni's comment. Feels fairly safe given that we only delete nodes with 0 snapshots prior to CConnman::Stop() & StopNodes(). If another thread was still inside ForNode/ForEachNode at that point, we are already shutting down, and the asserts for zero snapshots in DeleteNode() will simply kill the process quicker. Should it be deemed too risky this commit could be skipped.
Moved some paranoid TRACEPOINT changes and making StopThreads & StopNodes private into early "Minor improvements" commit.
DeleteNode(std::unique_ptr<CNode> pnode) -> DeleteNode(std::unique_ptr<CNode>&& pnode): Clarifies that we are assuming ownership of the unique_ptr.
Rebased on same parent as this PR.

theuni commented at 10:17 PM on April 16, 2025: member

I've been working on some experiments and tests for this again today, but unfortunately I didn't have time to get finished.

I'll be away until Tuesday, but I'll resume here next week. Sorry for the delay :(

in src/net.h:1435 in 75acdcbaa9 outdated

1431 | @@ -1442,8 +1432,7 @@ class CConnman
1432 |      std::vector<AddedNodeParams> m_added_node_params GUARDED_BY(m_added_nodes_mutex);
1433 |  
1434 |      mutable Mutex m_added_nodes_mutex;
1435 | -    std::vector<CNode*> m_nodes GUARDED_BY(m_nodes_mutex);
1436 | -    std::list<CNode*> m_nodes_disconnected;
1437 | +    std::vector<std::shared_ptr<CNode>> m_nodes GUARDED_BY(m_nodes_mutex);

vasild commented at 11:49 AM on April 17, 2025:

Bringing the anti-shared_ptr discussion to its own dedicated thread here.

#32015 (comment) Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics?

#32015 (comment) I used to appreciate shared_ptr-like constructs, but as @purpleKarrot #32015 (comment), they introduce global variables in disguise.

How come? The CNode objects have a lifetime equal to the lifetime of the connection with the peer, usually much less than the lifetime of a global variable (which is the lifetime of the entire program). Also those CNode objects are in a private variable inside CConnman. Global variables can be accessed from everywhere. Private class members are the opposite.

would it be possible to increase the level of abstraction and pass around values with regular semantics?

ProcessMessages() and SendMessages() take CNode* and they can / should be changed to take CNode&.
FindNode() returns shared_ptr, this can be changed to return bool and the sole user that makes use of the return value can be changed to use ForNode().
ConnectNode() is the function that creates the objects and it returns shared_ptr, not sure if it makes sense to change it?
GenerateWaitSockets() takes a const vector of shared_ptr. It needs just a list of nodes, could take a vector of CNode, but since we already have a vector of shared_ptr we pass that. Same for SocketHandlerConnected() and both need to run on the same "snapshot" when called one after the other.

If there is review interest I would pursue those.

hodlinator commented at 8:07 PM on April 17, 2025:

The intent behind the code on master is to delete the nodes at specific points, otherwise CNode::Release() would do if (--nRefCount == 0) delete this (and had different logic around m_nodes_disconnected). You might be correct in that having been okay, nobody has been able to prove otherwise in this PR so far.
I agree that shared_ptr-usage in this PR not being exposed in the public interface of CConnman is a mitigating factor.
I bet you've checked all angles I would check when it comes to how the shared_ptr would lead to CNode destruction in the current PR.
shared_ptr still opens up the door towards future changes in the direction of chaos when compared to unique_ptr. Nailing down where ownership/lifetime is managed facilitates fresh eyes in following the code, improving maintainability. Using shared_ptr can be perceived as a cop-out of saying it will be destroyed whenever & wherever the last reference is destroyed. shared_ptr in other cases/projects are often an indication of imprecise/prototype-quality code, so as a general rule, it makes sense to avoid if possible.
This is not my first objection to shared_ptr in this project, I experimented with removing it from miniscript-code here: #30866#pullrequestreview-2434704657, which lead to uncovering an extra issue to fix in that PR. The PR settled on taking that fix and also replacing shared_ptr with unique_ptr for a less invasive change. I have #31713 which replaces those new unique_ptrs with straight up instances, so I'm by no means wedded to unique_ptr if there's a better option.

ProcessMessages() and SendMessages() take CNode* and they can / should be changed to take CNode&.

FindNode() returns shared_ptr, this can be changed to return bool and the sole user that makes use of the return value can be changed to use ForNode().

ConnectNode() is the function that creates the objects and it returns shared_ptr, not sure if it makes sense to change it?

GenerateWaitSockets() takes a const vector of shared_ptr. It needs just a list of nodes, could take a vector of CNode, but since we already have a vector of shared_ptr we pass that. Same for SocketHandlerConnected() and both need to run on the same "snapshot" when called one after the other.

If there is review interest I would pursue those.

IMO if we go with shared_ptr we don't need to change those, thanks for investigating though!

vasild commented at 2:50 PM on April 22, 2025:

Thanks for the input, I respect and agree with most of it. The only bit I do not agree with is a general rule to avoid shared_ptr. I agree using unique_ptr is better than shared_ptr, especially if unique_ptr is to be used as intented - to manage the lifetime of the owned object. If we start extracting the raw pointers from unique_ptr and managing them separately and manually ref-counting then maybe it is better to just use shared_ptr.

FindNode() returns shared_ptr, this can be changed to return bool and the sole user that makes use of the return value can be changed to use ForNode()

I think that's worth doing, regardless of this PR. Done in https://github.com/bitcoin/bitcoin/pull/32326

hodlinator commented at 7:59 AM on April 24, 2025:

If we start extracting the raw pointers from unique_ptr and managing them separately and manually ref-counting then maybe it is better to just use shared_ptr.

I think doing ref-counting for snapshots is fine, but agree raw pointers in this codebase are ambiguous. It would be one thing if we started over with strict policies for shared_ptr/unique_ptr/raw pointers. Then one could consistently use raw pointer to always mean a non-owning "view-pointer" (https://github.com/bitcoin/bitcoin/pull/32015#issuecomment-2712060971).

Maybe that distinction could still be made through introducing something like:

template <typename T>
class ViewPtr
{
    T* ptr;
public:
    ...operator->...;
    ...operator*...;
};

+ OwnerPtr<T> in the vane of GSL: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#SS-views

I think that's worth doing, regardless of this PR. Done in #32326

Good call on peeling off a separate PR as it's nice but not a requirement here.

theuni commented at 7:17 PM on April 24, 2025: member

Ok, I'm back after a week of paging all of this back in and hacking around. This is actually something I wanted to get cleaned up a long time ago, so I'm happy it has eyes on it again. Sorry in advance for the novel :)

I think the discussions about shared_ptr/unique_ptr/refcounting have distracted from what really needs to be discussed here: the lifetime and concurrency requirements/guarantees of the various subsystems.

The refcounters are (ab)used as a generic means of controlling the teardown order of a CNode. As I mentioned above, historically, this has meant that disconnection/finalization/deletion have always happened on the socket handler thread and always in that order (with the exception of when we're shutting down).

However, as @vasild has pointed out, this is neither ideal nor necessary.

So rather than starting by discussing what types of pointers we should be using, we need to start by figuring out what we're trying to solve/improve. And before we can do that, we need to lay out the way things currently work.

These are the important guarantees at the moment, as I see them:

`FinalizeNode` may only be called once per node.

This one is obvious, but it's necessary to keep in mind that we shouldn't be introducing any possible races.

`FinalizeNode` will never be called as a result of setting `fDisconnect` from `ProcessMessages`/`SendMessages`

This prevents any pointers obtained by calling State(id) or GetPeerRef(id) from being yanked out from under us when we're in those functions.

I think this is the one that has led to the workarounds and shared_ptr discussion here. It is currently enforced by NodesSnapshot: a class which bumps the refcount for each node before calling ProcessMessages/SendMessages, then decref'ing them when finished. This has the effect of keeping them all alive for the duration of a single pass through the loop. @vasild's observation was that this looks a lot like an ownership model which can be handled with shared_ptrs by taking advantage of their inherent refcounting and introducing a deletion point in ThreadMessageHandler.

I maintain that this is non-ideal and foot-gunny, as making a copy of m_nodes is the mechanism used to enforce the above contract, though it's not at all obvious that it's doing so.

Finalization happens just before deletion, both on the socket handler thread

Again, this one isn't strictly necessary, but it's currently the case.

And again, @vasild's shared_ptr approach handles this by removing that restriction, and calling FinalizeNode as part of CNode's destructor, as that means that its refcount has reached zero.

He points out that it is safe to call this sometimes from the message handler thread instead, as long as it is only called once. I agree with this and suggest (see below) taking it a step further and only calling it from the message handler thread, as that guarantees that it can't be running during ProcessMessages/SendMessages.

Disconnection always happens before calling `FinalizeNode`.

I'll come back to this one later as it's not strictly necessary, but it's currently the case. By the time FinalizeNode is called, the socket handler thread is completely done with the node (the socket has been shutdown).

When the socket handler disconnects a peer, it is almost immediately (in the same pass through the loop) removed from `m_nodes`

This one is annoying, but the unit tests depend on it. Basically, as soon as we notice that a peer has disconnected, the socket handler thread takes care to ensure that other subsystems won't find it by looking through m_nodes. This includes ForEachNode/ForNode/CopyStats, etc.

Only a single thread deletes `CNode`s.

Duh.

Any other function passing a `CNode` (for ex, a caller of `ForNode() from RPC`) shouldn't have it deleted for its duration

This is a completely separate issue from the disconnection/finalization semantics, and is a good use-case for shared_ptrs.

In reality, the refcounting (and by extension NodesSnapshot) has been used as a loose way of enforcing teardown semantics. It's true that we can mimic those semantics by introducing shared_ptrs to manage the nodes instead, but that further obscures what the actual requirements are. As a few people have pointed out on this PR, abusing the facilities of shared_ptr/unique_ptr to replace the manual refcounting is not the solution. Rather, we should enforce the actual semantics in code.

I've pushed a branch here to illustrate the above: https://github.com/theuni/bitcoin/commits/finalize_threadheandler2/

It makes the following changes:

Expose m_nodes_disconnected to the message handling thread.
Add CNode::m_finalized which is set once FinalizeNode has been called. This, combined with the move from m_nodes to m_nodes_disconnected is enough for us to determine when the socket handler/message handler threads are done with a CNode.
Call FinalizeNode from the top/bottom of the message handler thread instead, to guarantee that it will never be called from another thread during a ProcessMessages/SendMessages invocation. This also makes much more sense conceptually, as FinalizeNode is not something the socket handler should have anything to do with.

With the above changes in place, the ordering of the teardown is fully proceduralized, so the refcounting is no longer needed and is thus removed. At that point, the CNode can be safely deleted once (in no particular order)

It has been moved to m_nodes_disconnected
CNode::m_finalized is true
No other threads (RPC via ForNode as an example) are holding it

In order to guarantee the last one, CNodes are kept in m_nodes and m_nodes_disconnected as shared_ptrs. Since the teardown semantics are enforced elsewhere, deletion is now uninteresting and can happen safely from any thread.

It also contains a few optimizations that become possible once the teardown procedure is proceduralized:

m_nodes_mutex is no longer held for the duration of ForNode/ForEachNode. This should be a nice improvement, especially for nodes with hundreds of connections.
FinalizeNode is now allowed to happen before disconnection, as there's no actual need to order those events.

I realize that I've duplicated a good amount of work here like the shared_ptr commit as well as some of the pre-requisite cleanup commits that are similar to @hodlinator's. It's also incomplete, for example it's missing the fuzzing fixups that @vasild has already done, as well as a bunch of additional comments/docs as I've suggested already. I'd be happy to rebase my branch to include those. I agree with #32326 as well. It's not necessary for my branch, but it's a nice improvement and I did have a similar commit at one point for the exact same reason.

I feel pretty strongly about addressing the teardown order procedurally rather than mimicking our current refcounting semantics which don't make the requirements/guarantees clear. Ultimately I agree that shared_ptrs are best to enforce the lifetimes, but only after codifying the other rules.

I think the code in my branch is much easier to reason about. There's no hidden side-effects of node deletion, magic copies of m_nodes aren't used to prevent FinalizeNode being called at the wrong time, and disconnection/finalization/deletion no longer carry hidden dependencies on each other.

theuni commented at 1:56 PM on April 30, 2025: member

Can we maybe just hold off on this for now? The investigation above prompted me to start working on a much larger refactor.. one that would finish the network refactor (CConnman creation) that I never managed to fully complete almost a decade ago.

I've been head-down working on it, hope to have a POC to demonstrate in the next week or two.

vasild commented at 3:22 PM on April 30, 2025: contributor

I am incorporating your feedback here, looks good so far.

vasild force-pushed on May 1, 2025

vasild commented at 10:10 AM on May 1, 2025: contributor

75acdcbaa9...15cc989538: incorporate feedback from above

I took your branch https://github.com/theuni/bitcoin/commits/finalize_threadheandler2/ and made some changes:

No need to ping-pong the tear down of a CNode between the net and msghand threads so much

In finalize_threadheandler2:

The net thread would disconnect a node and move it from m_nodes to m_nodes_disconnected.
Then the msghand thread would finalize those nodes.
Then the net thread would delete finalized nodes.

I simplified this to:

same as above; the sole fact that a node is in m_nodes_disconnected now means that a node has been disconnected and not finalized.
same as above; plus also delete the nodes.

So CNode::m_finalized is not needed because it becomes implicit - when a node is in m_nodes_disconnected it means it is not finalized. Then it is removed from there => finalized and deleted.

Reduced the optimization about holding m_nodes_mutex for long-lived callbacks

I did not remove m_nodex_mutex from ForEachNode() because in one of its usages the callback calls m_connman.MultipleManualOrFullOutboundConns() which requires m_nodes_mutex. Dealing with that deserves a separate PR.

Omitted the optimization to finalize a node at the end of the message handler loop

Could do that but to avoid a second finalization it would need CNode::m_finalized just for that. I think it is not worth having an extra member in CNode for this optimization. Can be done in a followup if desired.

Other minor tweaks and polishes

Like using std::ranges::foo() instead of std::foo() and simpler CConnman::DisconnectNodes() which uses no temporary variables when moving nodes around.

This PR deviated from the usual workflow where there is one author and reviewers that only review and don't modify the code. The current state is that the PR has two authors - @theuni and @vasild who both reviewed and authored code. Lets see how this pans out...

vasild commented at 10:14 AM on May 1, 2025: contributor

Can we maybe just hold off on this for now? The investigation above prompted me to start working on a much larger refactor...

Not sure about that because I have not seen that other refactor. In general I prefer do stuff in smaller, more manageable steps. I guess a further refactor should be possible on top of this PR. I would be happy to review. This PR should make a more solid foundation for further changes in the code.

DrahtBot added the label CI failed on May 1, 2025

DrahtBot commented at 11:47 AM on May 1, 2025: contributor

🚧 At least one of the CI tasks failed. Task tidy: https://github.com/bitcoin/bitcoin/runs/41477750293 LLM reason (✨ experimental): Clang-tidy found a performance warning treated as an error, causing CI failure.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

vasild force-pushed on May 1, 2025

vasild commented at 12:34 PM on May 1, 2025: contributor

15cc989538...8b93e0894f: fix CI clang-tidy

DrahtBot removed the label CI failed on May 1, 2025

in src/net.h:1330 in 4326f97dc6 outdated

1326 | @@ -1325,7 +1327,7 @@ class CConnman
1327 |       * @param[in] nodes Select from these nodes' sockets.
1328 |       * @return sockets to check for readiness
1329 |       */
1330 | -    Sock::EventsPerSock GenerateWaitSockets(std::span<CNode* const> nodes);
1331 | +    Sock::EventsPerSock GenerateWaitSockets(const std::vector<std::shared_ptr<CNode>>& nodes);

pinheadmz commented at 6:29 PM on May 13, 2025:

4326f97dc6dccc9e558d99931f8235fc6b6af405

Why change this from span to vector?

vasild commented at 11:28 AM on May 14, 2025:

Because the only caller of GenerateWaitSockets() has a vector so I decided to pass a const reference to a vector. However, that was the case as well before this PR and the change from raw pointer to shared_ptr (the aim of this PR) is unrelated to that. Changed back to std::span.

vasild commented at 1:16 PM on May 21, 2025:

It didn't compile - couldn't convert the return value of snap.Nodes() which is const std::vector<std::shared_ptr<CNode>>& to std::span<std::shared_ptr<CNode>>. So I used vector in the first commit, and later when the snapshots are removed - changed it to span.

#32015 (comment)

in src/net.h:1369 in 4326f97dc6 outdated

1364 | @@ -1363,11 +1365,9 @@ class CConnman
1365 |      bool AlreadyConnectedToAddress(const CAddress& addr);
1366 |  
1367 |      bool AttemptToEvictConnection();
1368 | -    CNode* ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
1369 | +    std::shared_ptr<CNode> ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
1370 |      void AddWhitelistPermissionFlags(NetPermissionFlags& flags, const CNetAddr &addr, const std::vector<NetWhitelistPermissions>& ranges) const;
1371 |  
1372 | -    void DeleteNode(CNode* pnode);

pinheadmz commented at 6:29 PM on May 13, 2025:

4326f97dc6dccc9e558d99931f8235fc6b6af405

Commit message says deletion is unchanged.

vasild commented at 11:31 AM on May 14, 2025:

Yes, I meant that the location of the deletes is unchanged. Before we had DeleteNode() which did FinalizeNode() + delete and after that commit, DeleteNode() is expanded into the callers that now do FinalizeNode() directly instead of calling DeleteNode(). Also callers do clear() on the vector which is equivalent to the delete that was in DeleteNode().

Elaborated a little bit in the commit message.

in src/net.cpp:440 in 4326f97dc6 outdated

 434 | @@ -437,9 +435,7 @@ CNode* CConnman::ConnectNode(CAddress addrConnect, const char *pszDest, bool fCo
 435 |                  }
 436 |                  // It is possible that we already have a connection to the IP/port pszDest resolved to.
 437 |                  // In that case, drop the connection that was just created.
 438 | -                LOCK(m_nodes_mutex);

pinheadmz commented at 6:44 PM on May 13, 2025:

4326f97dc6dccc9e558d99931f8235fc6b6af405

How did this not deadlock before? FindNode() called on the next line locks m_node_mutex

vasild commented at 11:26 AM on May 14, 2025:

m_node_mutex is a recursive mutex, allowing multiple nested locks from a single thread. But see #32326 which changes it to a non-recursive mutex which is easier to reason about.

in src/net.cpp:3049 in 4326f97dc6 outdated

3046 |                  if (pnode->fDisconnect)
3047 |                      continue;
3048 |  
3049 |                  // Receive messages
3050 | -                bool fMoreNodeWork = m_msgproc->ProcessMessages(pnode, flagInterruptMsgProc);
3051 | +                bool fMoreNodeWork = m_msgproc->ProcessMessages(pnode.get(), flagInterruptMsgProc);

pinheadmz commented at 7:03 PM on May 13, 2025:

4326f97dc6dccc9e558d99931f8235fc6b6af405

Why did you choose to pass the raw pointer instead of refactoring ProcessMessages() to accept the shared_ptr?

vasild commented at 11:37 AM on May 14, 2025:

I did not want to bloat this PR with changes to the interface of ProcessMessages() and SendMessages(). They take a raw pointer. Since they do not take ownership and the object is going to be alive for the duration of the call and there will always be an object (will never pass nullptr), then those functions can take CNode&.

Here is a patch to do that on top of this PR. Could be a followup. I do not want to bloat this PR with this unless reviewers want it.

<details> <summary>[patch] Change ProcessMessages() and SendMessage() to take CNode&</summary>

It is a mechanical change

11 files changed, 93 insertions(+), 93 deletions(-)

diff --git i/src/net.cpp w/src/net.cpp
index 81ee1eec16..4189cee0df 100644
--- i/src/net.cpp
+++ w/src/net.cpp
@@ -3051,18 +3051,18 @@ void CConnman::ThreadMessageHandler()
 
             for (auto& pnode : nodes) {
                 if (pnode->fDisconnect)
                     continue;
 
                 // Receive messages
-                bool fMoreNodeWork = m_msgproc->ProcessMessages(pnode.get(), flagInterruptMsgProc);
+                bool fMoreNodeWork = m_msgproc->ProcessMessages(*pnode, flagInterruptMsgProc);
                 fMoreWork |= (fMoreNodeWork && !pnode->fPauseSend);
                 if (flagInterruptMsgProc)
                     return;
                 // Send messages
-                m_msgproc->SendMessages(pnode.get());
+                m_msgproc->SendMessages(*pnode);
 
                 if (flagInterruptMsgProc)
                     return;
             }
         }
 
diff --git i/src/net.h w/src/net.h
index 051ff50f3b..1e93dfcaa5 100644
--- i/src/net.h
+++ w/src/net.h
@@ -1002,27 +1002,27 @@ public:
      * Callback to determine whether the given set of service flags are sufficient
      * for a peer to be "relevant".
      */
     virtual bool HasAllDesirableServiceFlags(ServiceFlags services) const = 0;
 
     /**
-    * Process protocol messages received from a given node
-    *
-    * [@param](/bitcoin-bitcoin/contributor/param/)[in]   pnode           The node which we have received messages from.
-    * [@param](/bitcoin-bitcoin/contributor/param/)[in]   interrupt       Interrupt condition for processing threads
-    * [@return](/bitcoin-bitcoin/contributor/return/)                      True if there is more work to be done
-    */
-    virtual bool ProcessMessages(CNode* pnode, std::atomic<bool>& interrupt) EXCLUSIVE_LOCKS_REQUIRED(g_msgproc_mutex) = 0;
+     * Process protocol messages received from a given node
+     *
+     * [@param](/bitcoin-bitcoin/contributor/param/)[in,out] from The node which we have received messages from.
+     * [@param](/bitcoin-bitcoin/contributor/param/)[in] interrupt Interrupt condition for processing threads.
+     * [@return](/bitcoin-bitcoin/contributor/return/) True if there is more work to be done.
+     */
+    virtual bool ProcessMessages(CNode& from, std::atomic<bool>& interrupt) EXCLUSIVE_LOCKS_REQUIRED(g_msgproc_mutex) = 0;
 
     /**
-    * Send queued protocol messages to a given node.
-    *
-    * [@param](/bitcoin-bitcoin/contributor/param/)[in]   pnode           The node which we are sending messages to.
-    * [@return](/bitcoin-bitcoin/contributor/return/)                      True if there is more work to be done
-    */
-    virtual bool SendMessages(CNode* pnode) EXCLUSIVE_LOCKS_REQUIRED(g_msgproc_mutex) = 0;
+     * Send queued protocol messages to a given node.
+     *
+     * [@param](/bitcoin-bitcoin/contributor/param/)[in,out] to The node which we are sending messages to.
+     * [@return](/bitcoin-bitcoin/contributor/return/) True if there is more work to be done.
+     */
+    virtual bool SendMessages(CNode& to) EXCLUSIVE_LOCKS_REQUIRED(g_msgproc_mutex) = 0;
 
 
 protected:
     /**
      * Protected destructor so that instances can only be deleted by derived classes.
      * If that restriction is no longer desired, this should be made public and virtual.
diff --git i/src/net_processing.cpp w/src/net_processing.cpp
index 78e97b6684..7153226635 100644
--- i/src/net_processing.cpp
+++ w/src/net_processing.cpp
@@ -519,15 +519,15 @@ public:
         EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex);
 
     /** Implement NetEventsInterface */
     void InitializeNode(const CNode& node, ServiceFlags our_services) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_tx_download_mutex);
     void FinalizeNode(const CNode& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
     bool HasAllDesirableServiceFlags(ServiceFlags services) const override;
-    bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override
+    bool ProcessMessages(CNode& from, std::atomic<bool>& interrupt) override
         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex, !m_tx_download_mutex);
-    bool SendMessages(CNode* pto) override
+    bool SendMessages(CNode& to) override
         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, g_msgproc_mutex, !m_tx_download_mutex);
 
     /** Implement PeerManager */
     void StartScheduledTasks(CScheduler& scheduler) override;
     void CheckForStaleTipAndEvictPeers() override;
     std::optional<std::string> FetchBlock(NodeId peer_id, const CBlockIndex& block_index) override
@@ -4989,72 +4989,72 @@ bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode, Peer& peer)
     LogDebug(BCLog::NET, "Disconnecting and discouraging peer %d!\n", peer.m_id);
     if (m_banman) m_banman->Discourage(pnode.addr);
     m_connman.DisconnectNode(pnode.addr);
     return true;
 }
 
-bool PeerManagerImpl::ProcessMessages(CNode* pfrom, std::atomic<bool>& interruptMsgProc)
+bool PeerManagerImpl::ProcessMessages(CNode& from, std::atomic<bool>& interruptMsgProc)
 {
     AssertLockNotHeld(m_tx_download_mutex);
     AssertLockHeld(g_msgproc_mutex);
 
-    PeerRef peer = GetPeerRef(pfrom->GetId());
+    PeerRef peer = GetPeerRef(from.GetId());
     if (peer == nullptr) return false;
 
     // For outbound connections, ensure that the initial VERSION message
     // has been sent first before processing any incoming messages
-    if (!pfrom->IsInboundConn() && !peer->m_outbound_version_message_sent) return false;
+    if (!from.IsInboundConn() && !peer->m_outbound_version_message_sent) return false;
 
     {
         LOCK(peer->m_getdata_requests_mutex);
         if (!peer->m_getdata_requests.empty()) {
-            ProcessGetData(*pfrom, *peer, interruptMsgProc);
+            ProcessGetData(from, *peer, interruptMsgProc);
         }
     }
 
     const bool processed_orphan = ProcessOrphanTx(*peer);
 
-    if (pfrom->fDisconnect)
+    if (from.fDisconnect)
         return false;
 
     if (processed_orphan) return true;
 
     // this maintains the order of responses
     // and prevents m_getdata_requests to grow unbounded
     {
         LOCK(peer->m_getdata_requests_mutex);
         if (!peer->m_getdata_requests.empty()) return true;
     }
 
     // Don't bother if send buffer is too full to respond anyway
-    if (pfrom->fPauseSend) return false;
+    if (from.fPauseSend) return false;
 
-    auto poll_result{pfrom->PollMessage()};
+    auto poll_result{from.PollMessage()};
     if (!poll_result) {
         // No message to process
         return false;
     }
 
     CNetMessage& msg{poll_result->first};
     bool fMoreWork = poll_result->second;
 
     TRACEPOINT(net, inbound_message,
-        pfrom->GetId(),
-        pfrom->m_addr_name.c_str(),
-        pfrom->ConnectionTypeAsString().c_str(),
+        from.GetId(),
+        from.m_addr_name.c_str(),
+        from.ConnectionTypeAsString().c_str(),
         msg.m_type.c_str(),
         msg.m_recv.size(),
         msg.m_recv.data()
     );
 
     if (m_opts.capture_messages) {
-        CaptureMessage(pfrom->addr, msg.m_type, MakeUCharSpan(msg.m_recv), /*is_incoming=*/true);
+        CaptureMessage(from.addr, msg.m_type, MakeUCharSpan(msg.m_recv), /*is_incoming=*/true);
     }
 
     try {
-        ProcessMessage(*pfrom, msg.m_type, msg.m_recv, msg.m_time, interruptMsgProc);
+        ProcessMessage(from, msg.m_type, msg.m_recv, msg.m_time, interruptMsgProc);
         if (interruptMsgProc) return false;
         {
             LOCK(peer->m_getdata_requests_mutex);
             if (!peer->m_getdata_requests.empty()) fMoreWork = true;
         }
         // Does this peer has an orphan ready to reconsider?
@@ -5481,69 +5481,69 @@ bool PeerManagerImpl::SetupAddressRelay(const CNode& node, Peer& peer)
         peer.m_addr_known = std::make_unique<CRollingBloomFilter>(5000, 0.001);
     }
 
     return true;
 }
 
-bool PeerManagerImpl::SendMessages(CNode* pto)
+bool PeerManagerImpl::SendMessages(CNode& to)
 {
     AssertLockNotHeld(m_tx_download_mutex);
     AssertLockHeld(g_msgproc_mutex);
 
-    PeerRef peer = GetPeerRef(pto->GetId());
+    PeerRef peer = GetPeerRef(to.GetId());
     if (!peer) return false;
     const Consensus::Params& consensusParams = m_chainparams.GetConsensus();
 
     // We must call MaybeDiscourageAndDisconnect first, to ensure that we'll
     // disconnect misbehaving peers even before the version handshake is complete.
-    if (MaybeDiscourageAndDisconnect(*pto, *peer)) return true;
+    if (MaybeDiscourageAndDisconnect(to, *peer)) return true;
 
     // Initiate version handshake for outbound connections
-    if (!pto->IsInboundConn() && !peer->m_outbound_version_message_sent) {
-        PushNodeVersion(*pto, *peer);
+    if (!to.IsInboundConn() && !peer->m_outbound_version_message_sent) {
+        PushNodeVersion(to, *peer);
         peer->m_outbound_version_message_sent = true;
     }
 
     // Don't send anything until the version handshake is complete
-    if (!pto->fSuccessfullyConnected || pto->fDisconnect)
+    if (!to.fSuccessfullyConnected || to.fDisconnect)
         return true;
 
     const auto current_time{GetTime<std::chrono::microseconds>()};
 
-    if (pto->IsAddrFetchConn() && current_time - pto->m_connected > 10 * AVG_ADDRESS_BROADCAST_INTERVAL) {
-        LogDebug(BCLog::NET, "addrfetch connection timeout, %s\n", pto->DisconnectMsg(fLogIPs));
-        pto->fDisconnect = true;
+    if (to.IsAddrFetchConn() && current_time - to.m_connected > 10 * AVG_ADDRESS_BROADCAST_INTERVAL) {
+        LogDebug(BCLog::NET, "addrfetch connection timeout, %s\n", to.DisconnectMsg(fLogIPs));
+        to.fDisconnect = true;
         return true;
     }
 
-    MaybeSendPing(*pto, *peer, current_time);
+    MaybeSendPing(to, *peer, current_time);
 
     // MaybeSendPing may have marked peer for disconnection
-    if (pto->fDisconnect) return true;
+    if (to.fDisconnect) return true;
 
-    MaybeSendAddr(*pto, *peer, current_time);
+    MaybeSendAddr(to, *peer, current_time);
 
-    MaybeSendSendHeaders(*pto, *peer);
+    MaybeSendSendHeaders(to, *peer);
 
     {
         LOCK(cs_main);
 
-        CNodeState &state = *State(pto->GetId());
+        CNodeState &state = *State(to.GetId());
 
         // Start block sync
         if (m_chainman.m_best_header == nullptr) {
             m_chainman.m_best_header = m_chainman.ActiveChain().Tip();
         }
 
         // Determine whether we might try initial headers sync or parallel
         // block download from this peer -- this mostly affects behavior while
         // in IBD (once out of IBD, we sync from all peers).
         bool sync_blocks_and_headers_from_peer = false;
         if (state.fPreferredDownload) {
             sync_blocks_and_headers_from_peer = true;
-        } else if (CanServeBlocks(*peer) && !pto->IsAddrFetchConn()) {
+        } else if (CanServeBlocks(*peer) && !to.IsAddrFetchConn()) {
             // Typically this is an inbound peer. If we don't have any outbound
             // peers, or if we aren't downloading any blocks from such peers,
             // then allow block downloads from this peer, too.
             // We prefer downloading blocks from outbound peers to avoid
             // putting undue load on (say) some home user who is just making
             // outbound connections to the network, but if our only source of
@@ -5565,14 +5565,14 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
                    is up-to-date.  With a non-empty response, we can initialise
                    the peer's known best block.  This wouldn't be possible
                    if we requested starting at m_chainman.m_best_header and
                    got back an empty response.  */
                 if (pindexStart->pprev)
                     pindexStart = pindexStart->pprev;
-                if (MaybeSendGetHeaders(*pto, GetLocator(pindexStart), *peer)) {
-                    LogDebug(BCLog::NET, "initial getheaders (%d) to peer=%d (startheight:%d)\n", pindexStart->nHeight, pto->GetId(), peer->m_starting_height);
+                if (MaybeSendGetHeaders(to, GetLocator(pindexStart), *peer)) {
+                    LogDebug(BCLog::NET, "initial getheaders (%d) to peer=%d (startheight:%d)\n", pindexStart->nHeight, to.GetId(), peer->m_starting_height);
 
                     state.fSyncStarted = true;
                     peer->m_headers_sync_timeout = current_time + HEADERS_DOWNLOAD_TIMEOUT_BASE +
                         (
                          // Convert HEADERS_DOWNLOAD_TIMEOUT_PER_HEADER to microseconds before scaling
                          // to maintain precision
@@ -5598,13 +5598,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
             LOCK(peer->m_block_inv_mutex);
             std::vector<CBlock> vHeaders;
             bool fRevertToInv = ((!peer->m_prefers_headers &&
                                  (!state.m_requested_hb_cmpctblocks || peer->m_blocks_for_headers_relay.size() > 1)) ||
                                  peer->m_blocks_for_headers_relay.size() > MAX_BLOCKS_TO_ANNOUNCE);
             const CBlockIndex *pBestIndex = nullptr; // last header queued for delivery
-            ProcessBlockAvailability(pto->GetId()); // ensure pindexBestKnownBlock is up-to-date
+            ProcessBlockAvailability(to.GetId()); // ensure pindexBestKnownBlock is up-to-date
 
             if (!fRevertToInv) {
                 bool fFoundStartingHeader = false;
                 // Try to find first header that our peer doesn't have, and
                 // then send all headers past that one.  If we come across any
                 // headers that aren't on m_chainman.ActiveChain(), give up.
@@ -5652,42 +5652,42 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
             }
             if (!fRevertToInv && !vHeaders.empty()) {
                 if (vHeaders.size() == 1 && state.m_requested_hb_cmpctblocks) {
                     // We only send up to 1 block as header-and-ids, as otherwise
                     // probably means we're doing an initial-ish-sync or they're slow
                     LogDebug(BCLog::NET, "%s sending header-and-ids %s to peer=%d\n", __func__,
-                            vHeaders.front().GetHash().ToString(), pto->GetId());
+                            vHeaders.front().GetHash().ToString(), to.GetId());
 
                     std::optional<CSerializedNetMsg> cached_cmpctblock_msg;
                     {
                         LOCK(m_most_recent_block_mutex);
                         if (m_most_recent_block_hash == pBestIndex->GetBlockHash()) {
                             cached_cmpctblock_msg = NetMsg::Make(NetMsgType::CMPCTBLOCK, *m_most_recent_compact_block);
                         }
                     }
                     if (cached_cmpctblock_msg.has_value()) {
-                        PushMessage(*pto, std::move(cached_cmpctblock_msg.value()));
+                        PushMessage(to, std::move(cached_cmpctblock_msg.value()));
                     } else {
                         CBlock block;
                         const bool ret{m_chainman.m_blockman.ReadBlock(block, *pBestIndex)};
                         assert(ret);
                         CBlockHeaderAndShortTxIDs cmpctblock{block, m_rng.rand64()};
-                        MakeAndPushMessage(*pto, NetMsgType::CMPCTBLOCK, cmpctblock);
+                        MakeAndPushMessage(to, NetMsgType::CMPCTBLOCK, cmpctblock);
                     }
                     state.pindexBestHeaderSent = pBestIndex;
                 } else if (peer->m_prefers_headers) {
                     if (vHeaders.size() > 1) {
                         LogDebug(BCLog::NET, "%s: %u headers, range (%s, %s), to peer=%d\n", __func__,
                                 vHeaders.size(),
                                 vHeaders.front().GetHash().ToString(),
-                                vHeaders.back().GetHash().ToString(), pto->GetId());
+                                vHeaders.back().GetHash().ToString(), to.GetId());
                     } else {
                         LogDebug(BCLog::NET, "%s: sending header %s to peer=%d\n", __func__,
-                                vHeaders.front().GetHash().ToString(), pto->GetId());
+                                vHeaders.front().GetHash().ToString(), to.GetId());
                     }
-                    MakeAndPushMessage(*pto, NetMsgType::HEADERS, TX_WITH_WITNESS(vHeaders));
+                    MakeAndPushMessage(to, NetMsgType::HEADERS, TX_WITH_WITNESS(vHeaders));
                     state.pindexBestHeaderSent = pBestIndex;
                 } else
                     fRevertToInv = true;
             }
             if (fRevertToInv) {
                 // If falling back to using an inv, just try to inv the tip.
@@ -5707,13 +5707,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
                     }
 
                     // If the peer's chain has this block, don't inv it back.
                     if (!PeerHasHeader(&state, pindex)) {
                         peer->m_blocks_for_inv_relay.push_back(hashToAnnounce);
                         LogDebug(BCLog::NET, "%s: sending inv peer=%d hash=%s\n", __func__,
-                            pto->GetId(), hashToAnnounce.ToString());
+                            to.GetId(), hashToAnnounce.ToString());
                     }
                 }
             }
             peer->m_blocks_for_headers_relay.clear();
         }
 
@@ -5726,26 +5726,26 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
             vInv.reserve(std::max<size_t>(peer->m_blocks_for_inv_relay.size(), INVENTORY_BROADCAST_TARGET));
 
             // Add blocks
             for (const uint256& hash : peer->m_blocks_for_inv_relay) {
                 vInv.emplace_back(MSG_BLOCK, hash);
                 if (vInv.size() == MAX_INV_SZ) {
-                    MakeAndPushMessage(*pto, NetMsgType::INV, vInv);
+                    MakeAndPushMessage(to, NetMsgType::INV, vInv);
                     vInv.clear();
                 }
             }
             peer->m_blocks_for_inv_relay.clear();
         }
 
         if (auto tx_relay = peer->GetTxRelay(); tx_relay != nullptr) {
                 LOCK(tx_relay->m_tx_inventory_mutex);
                 // Check whether periodic sends should happen
-                bool fSendTrickle = pto->HasPermission(NetPermissionFlags::NoBan);
+                bool fSendTrickle = to.HasPermission(NetPermissionFlags::NoBan);
                 if (tx_relay->m_next_inv_send_time < current_time) {
                     fSendTrickle = true;
-                    if (pto->IsInboundConn()) {
+                    if (to.IsInboundConn()) {
                         tx_relay->m_next_inv_send_time = NextInvToInbounds(current_time, INBOUND_INVENTORY_BROADCAST_INTERVAL);
                     } else {
                         tx_relay->m_next_inv_send_time = current_time + m_rng.rand_exp_duration(OUTBOUND_INVENTORY_BROADCAST_INTERVAL);
                     }
                 }
 
@@ -5779,13 +5779,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
                         if (tx_relay->m_bloom_filter) {
                             if (!tx_relay->m_bloom_filter->IsRelevantAndUpdate(*txinfo.tx)) continue;
                         }
                         tx_relay->m_tx_inventory_known_filter.insert(inv.hash);
                         vInv.push_back(inv);
                         if (vInv.size() == MAX_INV_SZ) {
-                            MakeAndPushMessage(*pto, NetMsgType::INV, vInv);
+                            MakeAndPushMessage(to, NetMsgType::INV, vInv);
                             vInv.clear();
                         }
                     }
                 }
 
                 // Determine transactions to relay
@@ -5831,34 +5831,34 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
                         }
                         if (tx_relay->m_bloom_filter && !tx_relay->m_bloom_filter->IsRelevantAndUpdate(*txinfo.tx)) continue;
                         // Send
                         vInv.push_back(inv);
                         nRelayedTransactions++;
                         if (vInv.size() == MAX_INV_SZ) {
-                            MakeAndPushMessage(*pto, NetMsgType::INV, vInv);
+                            MakeAndPushMessage(to, NetMsgType::INV, vInv);
                             vInv.clear();
                         }
                         tx_relay->m_tx_inventory_known_filter.insert(hash);
                     }
 
                     // Ensure we'll respond to GETDATA requests for anything we've just announced
                     LOCK(m_mempool.cs);
                     tx_relay->m_last_inv_sequence = m_mempool.GetSequence();
                 }
         }
         if (!vInv.empty())
-            MakeAndPushMessage(*pto, NetMsgType::INV, vInv);
+            MakeAndPushMessage(to, NetMsgType::INV, vInv);
 
         // Detect whether we're stalling
         auto stalling_timeout = m_block_stalling_timeout.load();
         if (state.m_stalling_since.count() && state.m_stalling_since < current_time - stalling_timeout) {
             // Stalling only triggers when the block download window cannot move. During normal steady state,
             // the download window should be much larger than the to-be-downloaded set of blocks, so disconnection
             // should only happen during initial block download.
-            LogInfo("Peer is stalling block download, %s\n", pto->DisconnectMsg(fLogIPs));
-            pto->fDisconnect = true;
+            LogInfo("Peer is stalling block download, %s\n", to.DisconnectMsg(fLogIPs));
+            to.fDisconnect = true;
             // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
             // bandwidth is insufficient.
             const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
             if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
                 LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
             }
@@ -5870,14 +5870,14 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
         // being saturated. We only count validated in-flight blocks so peers can't advertise non-existing block hashes
         // to unreasonably increase our timeout.
         if (state.vBlocksInFlight.size() > 0) {
             QueuedBlock &queuedBlock = state.vBlocksInFlight.front();
             int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
             if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
-                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
-                pto->fDisconnect = true;
+                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), to.DisconnectMsg(fLogIPs));
+                to.fDisconnect = true;
                 return true;
             }
         }
         // Check for headers sync timeouts
         if (state.fSyncStarted && peer->m_headers_sync_timeout < std::chrono::microseconds::max()) {
             // Detect whether this is a stalling initial-headers-sync peer
@@ -5885,18 +5885,18 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
                 if (current_time > peer->m_headers_sync_timeout && nSyncStarted == 1 && (m_num_preferred_download_peers - state.fPreferredDownload >= 1)) {
                     // Disconnect a peer (without NetPermissionFlags::NoBan permission) if it is our only sync peer,
                     // and we have others we could be using instead.
                     // Note: If all our peers are inbound, then we won't
                     // disconnect our sync peer for stalling; we have bigger
                     // problems if we can't get any outbound peers.
-                    if (!pto->HasPermission(NetPermissionFlags::NoBan)) {
-                        LogInfo("Timeout downloading headers, %s\n", pto->DisconnectMsg(fLogIPs));
-                        pto->fDisconnect = true;
+                    if (!to.HasPermission(NetPermissionFlags::NoBan)) {
+                        LogInfo("Timeout downloading headers, %s\n", to.DisconnectMsg(fLogIPs));
+                        to.fDisconnect = true;
                         return true;
                     } else {
-                        LogInfo("Timeout downloading headers from noban peer, not %s\n", pto->DisconnectMsg(fLogIPs));
+                        LogInfo("Timeout downloading headers from noban peer, not %s\n", to.DisconnectMsg(fLogIPs));
                         // Reset the headers sync state so that we have a
                         // chance to try downloading from a different peer.
                         // Note: this will also result in at least one more
                         // getheaders message to be sent to
                         // this peer (eventually).
                         state.fSyncStarted = false;
@@ -5910,13 +5910,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
                 peer->m_headers_sync_timeout = std::chrono::microseconds::max();
             }
         }
 
         // Check that outbound peers have reasonable chains
         // GetTime() is used by this anti-DoS logic so we can test this using mocktime
-        ConsiderEviction(*pto, *peer, GetTime<std::chrono::seconds>());
+        ConsiderEviction(to, *peer, GetTime<std::chrono::seconds>());
 
         //
         // Message: getdata (blocks)
         //
         std::vector<CInv> vGetData;
         if (CanServeBlocks(*peer) && ((sync_blocks_and_headers_from_peer && !IsLimitedPeer(*peer)) || !m_chainman.IsInitialBlockDownload()) && state.vBlocksInFlight.size() < MAX_BLOCKS_IN_TRANSIT_PER_PEER) {
@@ -5939,15 +5939,15 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
                     vToDownload, from_tip,
                     Assert(m_chainman.GetSnapshotBaseBlock()));
             }
             for (const CBlockIndex *pindex : vToDownload) {
                 uint32_t nFetchFlags = GetFetchFlags(*peer);
                 vGetData.emplace_back(MSG_BLOCK | nFetchFlags, pindex->GetBlockHash());
-                BlockRequested(pto->GetId(), *pindex);
+                BlockRequested(to.GetId(), *pindex);
                 LogDebug(BCLog::NET, "Requesting block %s (%d) peer=%d\n", pindex->GetBlockHash().ToString(),
-                    pindex->nHeight, pto->GetId());
+                    pindex->nHeight, to.GetId());
             }
             if (state.vBlocksInFlight.empty() && staller != -1) {
                 if (State(staller)->m_stalling_since == 0us) {
                     State(staller)->m_stalling_since = current_time;
                     LogDebug(BCLog::NET, "Stall started peer=%d\n", staller);
                 }
@@ -5956,21 +5956,21 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
 
         //
         // Message: getdata (transactions)
         //
         {
             LOCK(m_tx_download_mutex);
-            for (const GenTxid& gtxid : m_txdownloadman.GetRequestsToSend(pto->GetId(), current_time)) {
+            for (const GenTxid& gtxid : m_txdownloadman.GetRequestsToSend(to.GetId(), current_time)) {
                 vGetData.emplace_back(gtxid.IsWtxid() ? MSG_WTX : (MSG_TX | GetFetchFlags(*peer)), gtxid.GetHash());
                 if (vGetData.size() >= MAX_GETDATA_SZ) {
-                    MakeAndPushMessage(*pto, NetMsgType::GETDATA, vGetData);
+                    MakeAndPushMessage(to, NetMsgType::GETDATA, vGetData);
                     vGetData.clear();
                 }
             }
         }
 
         if (!vGetData.empty())
-            MakeAndPushMessage(*pto, NetMsgType::GETDATA, vGetData);
+            MakeAndPushMessage(to, NetMsgType::GETDATA, vGetData);
     } // release cs_main
-    MaybeSendFeefilter(*pto, *peer, current_time);
+    MaybeSendFeefilter(to, *peer, current_time);
     return true;
 }
diff --git i/src/test/denialofservice_tests.cpp w/src/test/denialofservice_tests.cpp
index 9ee7e9c9fe..1d658ac9e0 100644
--- i/src/test/denialofservice_tests.cpp
+++ w/src/test/denialofservice_tests.cpp
@@ -77,33 +77,33 @@ BOOST_AUTO_TEST_CASE(outbound_slow_chain_eviction)
         LOCK(cs_main);
         BOOST_CHECK(m_node.chainman->ActiveChain().Tip() != nullptr);
         BOOST_CHECK(m_node.chainman->ActiveChain().Tip()->nChainWork > 0);
     }
 
     // Test starts here
-    BOOST_CHECK(peerman.SendMessages(&dummyNode1)); // should result in getheaders
+    BOOST_CHECK(peerman.SendMessages(dummyNode1)); // should result in getheaders
 
     {
         LOCK(dummyNode1.cs_vSend);
         const auto& [to_send, _more, _msg_type] = dummyNode1.m_transport->GetBytesToSend(false);
         BOOST_CHECK(!to_send.empty());
     }
     connman.FlushSendBuffer(dummyNode1);
 
     int64_t nStartTime = GetTime();
     // Wait 21 minutes
     SetMockTime(nStartTime+21*60);
-    BOOST_CHECK(peerman.SendMessages(&dummyNode1)); // should result in getheaders
+    BOOST_CHECK(peerman.SendMessages(dummyNode1)); // should result in getheaders
     {
         LOCK(dummyNode1.cs_vSend);
         const auto& [to_send, _more, _msg_type] = dummyNode1.m_transport->GetBytesToSend(false);
         BOOST_CHECK(!to_send.empty());
     }
     // Wait 3 more minutes
     SetMockTime(nStartTime+24*60);
-    BOOST_CHECK(peerman.SendMessages(&dummyNode1)); // should result in disconnect
+    BOOST_CHECK(peerman.SendMessages(dummyNode1)); // should result in disconnect
     BOOST_CHECK(dummyNode1.fDisconnect == true);
 
     peerman.FinalizeNode(dummyNode1);
 }
 
 struct OutboundTest : TestingSetup {
@@ -330,13 +330,13 @@ BOOST_AUTO_TEST_CASE(peer_discouragement)
                          /*inbound_onion=*/false};
     nodes[0]->SetCommonVersion(PROTOCOL_VERSION);
     peerLogic->InitializeNode(*nodes[0], NODE_NETWORK);
     nodes[0]->fSuccessfullyConnected = true;
     connman->AddTestNode(*nodes[0]);
     peerLogic->UnitTestMisbehaving(nodes[0]->GetId()); // Should be discouraged
-    BOOST_CHECK(peerLogic->SendMessages(nodes[0]));
+    BOOST_CHECK(peerLogic->SendMessages(*nodes[0]));
 
     BOOST_CHECK(banman->IsDiscouraged(addr[0]));
     BOOST_CHECK(nodes[0]->fDisconnect);
     BOOST_CHECK(!banman->IsDiscouraged(other_addr)); // Different address, not discouraged
 
     nodes[1] = new CNode{id++,
@@ -349,21 +349,21 @@ BOOST_AUTO_TEST_CASE(peer_discouragement)
                          ConnectionType::INBOUND,
                          /*inbound_onion=*/false};
     nodes[1]->SetCommonVersion(PROTOCOL_VERSION);
     peerLogic->InitializeNode(*nodes[1], NODE_NETWORK);
     nodes[1]->fSuccessfullyConnected = true;
     connman->AddTestNode(*nodes[1]);
-    BOOST_CHECK(peerLogic->SendMessages(nodes[1]));
+    BOOST_CHECK(peerLogic->SendMessages(*nodes[1]));
     // [0] is still discouraged/disconnected.
     BOOST_CHECK(banman->IsDiscouraged(addr[0]));
     BOOST_CHECK(nodes[0]->fDisconnect);
     // [1] is not discouraged/disconnected yet.
     BOOST_CHECK(!banman->IsDiscouraged(addr[1]));
     BOOST_CHECK(!nodes[1]->fDisconnect);
     peerLogic->UnitTestMisbehaving(nodes[1]->GetId());
-    BOOST_CHECK(peerLogic->SendMessages(nodes[1]));
+    BOOST_CHECK(peerLogic->SendMessages(*nodes[1]));
     // Expect both [0] and [1] to be discouraged/disconnected now.
     BOOST_CHECK(banman->IsDiscouraged(addr[0]));
     BOOST_CHECK(nodes[0]->fDisconnect);
     BOOST_CHECK(banman->IsDiscouraged(addr[1]));
     BOOST_CHECK(nodes[1]->fDisconnect);
 
@@ -380,13 +380,13 @@ BOOST_AUTO_TEST_CASE(peer_discouragement)
                          /*inbound_onion=*/false};
     nodes[2]->SetCommonVersion(PROTOCOL_VERSION);
     peerLogic->InitializeNode(*nodes[2], NODE_NETWORK);
     nodes[2]->fSuccessfullyConnected = true;
     connman->AddTestNode(*nodes[2]);
     peerLogic->UnitTestMisbehaving(nodes[2]->GetId());
-    BOOST_CHECK(peerLogic->SendMessages(nodes[2]));
+    BOOST_CHECK(peerLogic->SendMessages(*nodes[2]));
     BOOST_CHECK(banman->IsDiscouraged(addr[0]));
     BOOST_CHECK(banman->IsDiscouraged(addr[1]));
     BOOST_CHECK(banman->IsDiscouraged(addr[2]));
     BOOST_CHECK(nodes[0]->fDisconnect);
     BOOST_CHECK(nodes[1]->fDisconnect);
     BOOST_CHECK(nodes[2]->fDisconnect);
@@ -422,13 +422,13 @@ BOOST_AUTO_TEST_CASE(DoS_bantime)
                     /*inbound_onion=*/false};
     dummyNode.SetCommonVersion(PROTOCOL_VERSION);
     peerLogic->InitializeNode(dummyNode, NODE_NETWORK);
     dummyNode.fSuccessfullyConnected = true;
 
     peerLogic->UnitTestMisbehaving(dummyNode.GetId());
-    BOOST_CHECK(peerLogic->SendMessages(&dummyNode));
+    BOOST_CHECK(peerLogic->SendMessages(dummyNode));
     BOOST_CHECK(banman->IsDiscouraged(addr));
 
     peerLogic->FinalizeNode(dummyNode);
 }
 
 BOOST_AUTO_TEST_SUITE_END()
diff --git i/src/test/fuzz/p2p_handshake.cpp w/src/test/fuzz/p2p_handshake.cpp
index d608efd87a..183896bf3c 100644
--- i/src/test/fuzz/p2p_handshake.cpp
+++ w/src/test/fuzz/p2p_handshake.cpp
@@ -97,12 +97,12 @@ FUZZ_TARGET(p2p_handshake, .init = ::initialize)
             connection.fPauseSend = false;
 
             try {
                 more_work = connman.ProcessMessagesOnce(connection);
             } catch (const std::ios_base::failure&) {
             }
-            peerman->SendMessages(&connection);
+            peerman->SendMessages(connection);
         }
     }
 
     g_setup->m_node.connman->StopNodes();
 }
diff --git i/src/test/fuzz/p2p_headers_presync.cpp w/src/test/fuzz/p2p_headers_presync.cpp
index b31b74ee4f..88072c7570 100644
--- i/src/test/fuzz/p2p_headers_presync.cpp
+++ w/src/test/fuzz/p2p_headers_presync.cpp
@@ -94,13 +94,13 @@ void HeadersSyncSetup::SendMessage(FuzzedDataProvider& fuzzed_data_provider, CSe
     (void)connman.ReceiveMsgFrom(connection, std::move(msg));
     connection.fPauseSend = false;
     try {
         connman.ProcessMessagesOnce(connection);
     } catch (const std::ios_base::failure&) {
     }
-    m_node.peerman->SendMessages(&connection);
+    m_node.peerman->SendMessages(connection);
 }
 
 CBlockHeader ConsumeHeader(FuzzedDataProvider& fuzzed_data_provider, const uint256& prev_hash, uint32_t prev_nbits)
 {
     CBlockHeader header;
     header.nNonce = 0;
diff --git i/src/test/fuzz/process_message.cpp w/src/test/fuzz/process_message.cpp
index 4bd38a1ac6..2a304cbd4b 100644
--- i/src/test/fuzz/process_message.cpp
+++ w/src/test/fuzz/process_message.cpp
@@ -85,11 +85,11 @@ FUZZ_TARGET(process_message, .init = initialize_process_message)
     while (more_work) {
         p2p_node.fPauseSend = false;
         try {
             more_work = connman.ProcessMessagesOnce(p2p_node);
         } catch (const std::ios_base::failure&) {
         }
-        g_setup->m_node.peerman->SendMessages(&p2p_node);
+        g_setup->m_node.peerman->SendMessages(p2p_node);
     }
     g_setup->m_node.validation_signals->SyncWithValidationInterfaceQueue();
     g_setup->m_node.connman->StopNodes();
 }
diff --git i/src/test/fuzz/process_messages.cpp w/src/test/fuzz/process_messages.cpp
index 0688868c02..342a4038b9 100644
--- i/src/test/fuzz/process_messages.cpp
+++ w/src/test/fuzz/process_messages.cpp
@@ -84,12 +84,12 @@ FUZZ_TARGET(process_messages, .init = initialize_process_messages)
             random_node.fPauseSend = false;
 
             try {
                 more_work = connman.ProcessMessagesOnce(random_node);
             } catch (const std::ios_base::failure&) {
             }
-            g_setup->m_node.peerman->SendMessages(&random_node);
+            g_setup->m_node.peerman->SendMessages(random_node);
         }
     }
     g_setup->m_node.validation_signals->SyncWithValidationInterfaceQueue();
     g_setup->m_node.connman->StopNodes();
 }
diff --git i/src/test/net_tests.cpp w/src/test/net_tests.cpp
index 0036d94c2f..ce10b50417 100644
--- i/src/test/net_tests.cpp
+++ w/src/test/net_tests.cpp
@@ -875,13 +875,13 @@ BOOST_AUTO_TEST_CASE(initial_advertise_from_version_message)
                     return;
                 }
             }
         }
     };
 
-    m_node.peerman->SendMessages(&peer);
+    m_node.peerman->SendMessages(peer);
 
     BOOST_CHECK(sent);
 
     CaptureMessage = CaptureMessageOrig;
     chainman.ResetIbd();
     m_node.args->ForceSetArg("-capturemessages", "0");
diff --git i/src/test/util/net.cpp w/src/test/util/net.cpp
index 0b8070a88e..4f4496c93e 100644
--- i/src/test/util/net.cpp
+++ w/src/test/util/net.cpp
@@ -28,13 +28,13 @@ void ConnmanTestMsg::Handshake(CNode& node,
                                bool relay_txs)
 {
     auto& peerman{static_cast<PeerManager&>(*m_msgproc)};
     auto& connman{*this};
 
     peerman.InitializeNode(node, local_services);
-    peerman.SendMessages(&node);
+    peerman.SendMessages(node);
     FlushSendBuffer(node); // Drop the version message added by SendMessages.
 
     CSerializedNetMsg msg_version{
         NetMsg::Make(NetMsgType::VERSION,
                 version,                                        //
                 Using<CustomUintFormatter<8>>(remote_services), //
@@ -49,13 +49,13 @@ void ConnmanTestMsg::Handshake(CNode& node,
                 relay_txs),
     };
 
     (void)connman.ReceiveMsgFrom(node, std::move(msg_version));
     node.fPauseSend = false;
     connman.ProcessMessagesOnce(node);
-    peerman.SendMessages(&node);
+    peerman.SendMessages(node);
     FlushSendBuffer(node); // Drop the verack message added by SendMessages.
     if (node.fDisconnect) return;
     assert(node.nVersion == version);
     assert(node.GetCommonVersion() == std::min(version, PROTOCOL_VERSION));
     CNodeStateStats statestats;
     assert(peerman.GetNodeStateStats(node.GetId(), statestats));
@@ -63,13 +63,13 @@ void ConnmanTestMsg::Handshake(CNode& node,
     assert(statestats.their_services == remote_services);
     if (successfully_connected) {
         CSerializedNetMsg msg_verack{NetMsg::Make(NetMsgType::VERACK)};
         (void)connman.ReceiveMsgFrom(node, std::move(msg_verack));
         node.fPauseSend = false;
         connman.ProcessMessagesOnce(node);
-        peerman.SendMessages(&node);
+        peerman.SendMessages(node);
         assert(node.fSuccessfullyConnected == true);
     }
 }
 
 void ConnmanTestMsg::NodeReceiveMsgBytes(CNode& node, std::span<const uint8_t> msg_bytes, bool& complete) const
 {
diff --git i/src/test/util/net.h w/src/test/util/net.h
index 36c5dc5c17..0a0719c735 100644
--- i/src/test/util/net.h
+++ w/src/test/util/net.h
@@ -70,15 +70,15 @@ struct ConnmanTestMsg : public CConnman {
                    ServiceFlags remote_services,
                    ServiceFlags local_services,
                    int32_t version,
                    bool relay_txs)
         EXCLUSIVE_LOCKS_REQUIRED(NetEventsInterface::g_msgproc_mutex);
 
-    bool ProcessMessagesOnce(CNode& node) EXCLUSIVE_LOCKS_REQUIRED(NetEventsInterface::g_msgproc_mutex)
+    bool ProcessMessagesOnce(CNode& from) EXCLUSIVE_LOCKS_REQUIRED(NetEventsInterface::g_msgproc_mutex)
     {
-        return m_msgproc->ProcessMessages(&node, flagInterruptMsgProc);
+        return m_msgproc->ProcessMessages(from, flagInterruptMsgProc);
     }
 
     void NodeReceiveMsgBytes(CNode& node, std::span<const uint8_t> msg_bytes, bool& complete) const;
 
     bool ReceiveMsgFrom(CNode& node, CSerializedNetMsg&& ser_msg) const;
     void FlushSendBuffer(CNode& node) const;

</details>

pinheadmz approved

pinheadmz commented at 7:42 PM on May 13, 2025: member

ACK 8b93e0894f94883f76b7df5d50a5f1a58544fb6c

Built on macos/arm64, ran all unit and functional tests, reviewed each commit and left a few questions.

Also ran on mainnet for a few hours to watch peers come and go.

This is a great refactor, manually refcounting is bound to be error prone and the patch gives us cleaner code with some performance optimizations. Since CNode was mostly handled by raw pointer most of the handling logic didn't have to change to support std::shared_ptr instead.

<details><summary>Show Signature</summary>

pinheadmz's public key is on openpgp.org

</details>

DrahtBot requested review from theuni on May 13, 2025

DrahtBot requested review from hodlinator on May 13, 2025

vasild force-pushed on May 14, 2025

vasild commented at 11:26 AM on May 14, 2025: contributor

8b93e0894f...bd6c57f387: address suggestions

vasild force-pushed on May 14, 2025

vasild commented at 11:39 AM on May 14, 2025: contributor

bd6c57f387...20657a7c8e: rebase

Edit: CI failure looks unrelated

DrahtBot added the label Needs rebase on May 20, 2025

vasild force-pushed on May 21, 2025

vasild commented at 11:55 AM on May 21, 2025: contributor

20657a7c8e...cbd7bd7422: rebase due to conflicts

DrahtBot removed the label Needs rebase on May 21, 2025

vasild force-pushed on May 21, 2025

vasild commented at 1:17 PM on May 21, 2025: contributor

cbd7bd7422...67bc008f62: fix CI

pinheadmz approved

pinheadmz commented at 2:12 PM on May 21, 2025: member

ACK 67bc008f62f9eac1616558770708c1e8547b09f0

Only rebase and trivial adjustments since last ack

<details><summary>Show Signature</summary>

pinheadmz's public key is on openpgp.org

</details>

DrahtBot added the label CI failed on May 21, 2025

DrahtBot removed the label CI failed on May 22, 2025

hodlinator commented at 6:32 AM on May 22, 2025: contributor

Due to my anti-shared_ptr arguments in #32015 (review) and the main point part of my unique_ptr alternative #32015 (comment) (with some support #32015 (comment)) not being incorporated I have a hard time A-C-K:ing this as-is. Not planning to open a competing PR at this point though.

Curious what Cory comes up with (https://github.com/bitcoin/bitcoin/pull/32015#issuecomment-2842078787), and time allowing I plan to attempt an eradication of shared_ptr from his next solution if it makes use of it by the end.

Happy that this PR spawned off other improvements like #32326 and #32394.

theuni commented at 7:31 PM on May 22, 2025: member

I'm also not in favor of this, but now it's because I think this is addressing the wrong issue altogether.

Generally I think this is a nice change, but it's scary, and if we're going to be changing it I'd rather do it once and for all. As part of the CConnman refactor, CNode was supposed to become internal and not passed into PeerManager at all. In that case, the lifetime and sharing issues would be completely moot. But that didn't get done.. my fault.

So, I'm working on doing that now. It's a large refactor and it will take a while, but I think it's better to fix the root issue here rather than continuing to work around it.

vasild commented at 7:27 AM on May 24, 2025: contributor

CNode was supposed to become internal and not passed into PeerManager at all

Do you need any help with that?

theuni commented at 6:56 PM on June 16, 2025: member

@vasild Sure!

This may be a good one to tag-team. I've been distracted for the last two weeks, but coming back to this now.

I did some initial hacking on it and got pretty far along, but it's a substantial amount of work, both on the coding side as well as review. I'm starting to wonder if it might be a good collaborative/working group project.

I'll ping you offline soon to share my thoughts/progress and see what you think.

DrahtBot added the label Needs rebase on Jul 25, 2025

vasild force-pushed on Sep 8, 2025

vasild commented at 2:03 PM on September 8, 2025: contributor

67bc008f62...0cd86a5c70: rebase due to conflicts

DrahtBot removed the label Needs rebase on Sep 8, 2025

DrahtBot added the label Needs rebase on Sep 17, 2025

vasild force-pushed on Sep 29, 2025

vasild commented at 12:52 PM on September 29, 2025: contributor

0cd86a5c70...8218daac0f: rebase due to conflicts

DrahtBot removed the label Needs rebase on Sep 29, 2025

DrahtBot added the label Needs rebase on Sep 30, 2025

vasild force-pushed on Oct 2, 2025

vasild commented at 2:46 PM on October 2, 2025: contributor

8218daac0f...d026d5475e: rebase due to conflicts

This is now a little bit simplified after #32326 was merged - no need to touch FindNode() anymore.

DrahtBot removed the label Needs rebase on Oct 2, 2025

DrahtBot added the label Needs rebase on Oct 4, 2025

net: store CNode objects as shared_ptr

This is a non-functional change - the manual reference counting,
disconnection, finalization and deletion are unchanged.

`DeleteNode()` used to call `FinalizeNode()` + `delete` the raw pointer.
Expand `DeleteNode()` at the call sites - call `FinalizeNode()` at the
call sites and instead of raw `delete` either get the container vector
go out of scope or explicitly call `clear()` on it.

8e385bb6b8

net: remove unnecessary nodes_copy and nodes_disconnected_copy temporaries

The other copies have a purpose for locking, these were just an awkward
convenience for deleting while iterating.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

00a533a064

net: add a lock for m_nodes_disconnected

Subsequent commits will allow this to be shared with the message
handling thread.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

1b1e3f8666

net: finalize and delete nodes from ThreadMessageHandler()

Rather than relying on refcounting to abstract away our teardown
procedure, make the rules explicit:
- FinalizeNode() is only called when ProcessMessages() and
  SendMessages() is not running. This is enforced by finalizing only at
  the top of the message handler loop.
- FinalizeNode() may only be called on nodes which have been
  disconnected, by virtue of finding them in m_nodes_disconnected. This
  enforces the order: disconnect -> finalize -> delete. This order is
  the current behavior but is not strictly necessary, and the
  restriction will be removed in a later commit.
- Nodes are only deleted after they've been disconnected and finalized.
  This is enforced by deleting them from m_nodes_disconnected and only
  after calling FinalizeNode().
- Callers of ForNode() and ForEachNode() are currently protected from
  finalization or deletion due to their locking of m_nodes_mutex.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

f91e36f057

net: remove CNode reference counting and NodesSnapshot

Snapshots of `m_nodes` are not needed anymore because finalization and
deletion of nodes only happens in `ThreadMessageHandler()` and only a
after nodes have been disconnected and moved from `m_nodes` to
`m_nodes_disconnected`.

The manual reference counting becomes unused, remove it too.

Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>

05f669675f

net: don't hold m_nodes_mutex for ForNode() callbacks

The shared_ptrs now ensure that the CNodes will not be deleted for the
duration of the callbacks.

Disconnection/Finalization may happen while the shared_ptr is being held,
though it won't be deleted for the duration of the callback.

The majority of uses are in ProcessMessages and SendMessages, during which
disconnection may occur but FinalizeNode will not be called.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

91f66fc3ee

net: optimization: wake message handler when nodes are disconnected

This keeps us from waiting for the condvar timeout (currently 100ms) before
finalizing a node that has been disconnected on the socket handler thread.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

abc29c4d8b

vasild force-pushed on Oct 7, 2025

vasild commented at 9:33 AM on October 7, 2025: contributor

d026d5475e...abc29c4d8b: rebase due to conflicts

DrahtBot removed the label Needs rebase on Oct 7, 2025

DrahtBot added the label Needs rebase on Dec 5, 2025

DrahtBot commented at 10:44 AM on December 5, 2025: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

sedited commented at 5:20 PM on February 13, 2026: contributor

This has been in need of a rebase for a while. Can you push an update?

DrahtBot commented at 1:32 AM on May 13, 2026: contributor

⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?

Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
Is it no longer relevant? ➡️ Please close.
Did the author lose interest or time to work on this? ➡️ Please close it and mark it with one of the labels 'Up for grabs' or 'Insufficient Review', so that it can be picked up in the future.

sedited commented at 10:35 AM on May 14, 2026: contributor

@vasild are you still working on this?

vasild commented at 12:37 PM on May 15, 2026: contributor

Whenever I need to use a pointer, I increment a reference count to make sure nobody deletes it while I use it. When I am done with it, I decrement the reference counter. Whenever the counter drops to zero, the object is deleted. This is what happens in master currently. Because this is exactly what std::shared_ptr does, this PR replaces our home-made shared pointer with the one from the C++ standard library.

I think such a simplification is worth the effort to rebase, resolve conflicts and give some love to this PR. However, it is also an energy drain. Last status seems to be that I incorporated the feedback and then it was suggested it might be better to do another bigger change. That was one year ago. @theuni, would you be ok with this in 2026? If yes, then I will bring it up to date. If no, then I will close it because it will go nowhere anyway.

net: replace manual reference counting of CNode with shared_ptr #32015

Code Coverage & Benchmarks

Reviews

Conflicts

unique_ptr

Pros

Cons

HOWEVER

TLDR

These are the important guarantees at the moment, as I see them:

FinalizeNode may only be called once per node.

FinalizeNode will never be called as a result of setting fDisconnect from ProcessMessages/SendMessages

Finalization happens just before deletion, both on the socket handler thread

Disconnection always happens before calling FinalizeNode.

When the socket handler disconnects a peer, it is almost immediately (in the same pass through the loop) removed from m_nodes

Only a single thread deletes CNodes.

Any other function passing a CNode (for ex, a caller of ForNode() from RPC) shouldn't have it deleted for its duration

`FinalizeNode` may only be called once per node.

`FinalizeNode` will never be called as a result of setting `fDisconnect` from `ProcessMessages`/`SendMessages`

Disconnection always happens before calling `FinalizeNode`.

When the socket handler disconnects a peer, it is almost immediately (in the same pass through the loop) removed from `m_nodes`

Only a single thread deletes `CNode`s.

Any other function passing a `CNode` (for ex, a caller of `ForNode() from RPC`) shouldn't have it deleted for its duration