net: replace manual reference counting of CNode with shared_ptr #32015

pull vasild wants to merge 7 commits into bitcoin:master from vasild:cnode_shared_ptr changing 6 files +161 −226
  1. vasild commented at 9:53 am on March 7, 2025: contributor

    Before this change the code used to count references to CNode objects manually via CNode::nRefCount. Unneeded CNodes were scheduled for deletion by putting them in CConnman::m_nodes_disconnected and were deleted after their reference count reached zero. Deleting consists of calling PeerManager::FinalizeNode() and destroying the CNode object.

    Replace this scheme with std::shared_ptr. This simplifies the code and removes: CNode::nRefCount CNode::GetRefCount() CNode::AddRef() CNode::Release() CConnman::m_nodes_disconnected CConnman::NodesSnapshot

    Now creating a snapshot of CConnman::m_nodes is done by simply copying it (under the mutex).

    Call PeerManager::FinalizeNode() from the destructor of CNode, which is called when the reference count reaches 0.

    This removes brittle code and replaces it by a code from the standard library which is more robust. All the below are possible with the manual reference counting and not possible with shared_ptr:

    • Use an object without adding a reference to it
    • Add too many references to an object, mismatch the add/remove by having more “add”
    • Forget to reduce the reference count, mismatch the add/remove by having less “remove”
    • Go with negative reference count by mistakenly having too many “remove”
    • Destroy an object while there are references to it

    There is zero learning curve for the average C++ programmer who knows how shared_ptr works. Not so much with the manual reference counting because one has to learn about AddRef(), Release(), check and maintain where are they called and their relationship with m_nodes_disconnected.


    This has some history in #28222 and #10738. See below #32015 (comment) and #32015 (comment).

  2. DrahtBot commented at 9:53 am on March 7, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32015.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK hodlinator, theuni

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #32394 (net: make m_nodes_mutex non-recursive by vasild)
    • #32326 (net: improve the interface around FindNode() and avoid a recursive mutex lock by vasild)
    • #32278 (doc: better document NetEventsInterface and the deletion of “CNode"s by vasild)
    • #32065 (i2p: make a time gap between creating transient sessions and using them by vasild)
    • #30988 (Split CConnman by vasild)
    • #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
    • #28584 (Fuzz: extend CConnman tests by vasild)
    • #28463 (p2p: Increase inbound capacity for block-relay only connections by mzumsande)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label P2P on Mar 7, 2025
  4. vasild commented at 9:53 am on March 7, 2025: contributor
  5. vasild commented at 9:55 am on March 7, 2025: contributor
    cc @hodlinator, you might be interested in this as well
  6. fanquake commented at 10:13 am on March 7, 2025: member

    https://github.com/bitcoin/bitcoin/actions/runs/13718139296/job/38367326902?pr=32015#step:7:1603:

    0Run p2p_headers_presync with args ['/Users/runner/work/bitcoin/bitcoin/ci/scratch/build-aarch64-apple-darwin23.6.0/src/test/fuzz/fuzz', PosixPath('/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync')]net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
    1Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"
    2
    3net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
    4Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"
    
  7. in src/net.cpp:1938 in 2747f135be outdated
    1941+
    1942+                // Keep a reference to this CNode object to delay its destruction until
    1943+                // after m_nodes_mutex has been released. Destructing a node involves
    1944+                // calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order
    1945+                // should be cs_main, m_nodes_mutex.
    1946+                disconnected_nodes.push_back(node);
    


    dergoegge commented at 11:03 am on March 7, 2025:

    Now that ~CNode is responsible for calling FinalizeNode, the guarantees around which thread does the actual cleanup are lost. It’s basically just whichever thread ends up holding the last instance of a CNode. This seems bug prone.

    E.g. there are no guarantees (afaict) that disconnected_nodes will actually be the last owner.


    vasild commented at 1:29 pm on March 7, 2025:

    The idea is that it does not matter which thread will destroy the CNode objects.

    there are no guarantees (afaict) that disconnected_nodes will actually be the last owner.

    That is right, but there is no need for such a guarantee. disconnected_nodes is to guarantee that the objects will not be destroyed while holding m_nodes_mutex.


    hodlinator commented at 11:06 pm on March 7, 2025:

    In thread #32015 (review):

    Maybe one could have an additional vector<shared_ptr<CNode>> member in CConnman beyond m_nodes that always keeps the shared_ptr::use_count() above 0. Then use removal from that vector (when use_count() == 1) for controlled deletion with enforced lock order, more like how the code was before?


    vasild commented at 8:55 am on March 11, 2025:

    I do not think that is necessary. I mean - if we would assume that the code is going to be modified in a careless way to break the logic, in other words to destroy CNode objects while holding m_nodes_mutex, then we might as well assume that the code could be modified in a way that breaks the manual reference counting. IMO the manual reference counting is more fragile and error prone.

    Further, if the code is changed to destroy CNode objects while holding m_nodes_mutex, then 83 functional tests fail. I think it is safe to assume that such a thing will not go unnoticed.


    hodlinator commented at 8:18 pm on March 11, 2025:

    In thread #32015 (review):

    Started proving out my suggestion and landed on something closer to master.

    • Resurrect m_nodes_disconnected (now holding shared_ptr) along with clearly defined lifetimes.
    • Resurrect DeleteNode(), but make it take an r-value shared_ptr and assert that use_count() == 1.
    • Avoid adding ~CNode() and CNode::m_destruct_cb along with sending it in everywhere in the constructor. This means we don’t need to touch 5 of the test files.

    https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_suggestion

    Passes unit tests and functional tests.

  8. DrahtBot added the label CI failed on Mar 7, 2025
  9. DrahtBot commented at 11:14 am on March 7, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38367328595

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  10. vasild force-pushed on Mar 7, 2025
  11. vasild commented at 1:23 pm on March 7, 2025: contributor
    2747f135be...e35a382388: fix the CI failure, thanks!
  12. vasild force-pushed on Mar 7, 2025
  13. DrahtBot removed the label CI failed on Mar 7, 2025
  14. in src/net.cpp:1888 in 36d932cebb outdated
    1883@@ -1884,8 +1884,15 @@ bool CConnman::AddConnection(const std::string& address, ConnectionType conn_typ
    1884     } // no default case, so the compiler can warn about missing cases
    1885 
    1886     // Count existing connections
    1887-    int existing_connections = WITH_LOCK(m_nodes_mutex,
    1888-                                         return std::count_if(m_nodes.begin(), m_nodes.end(), [conn_type](CNode* node) { return node->m_conn_type == conn_type; }););
    1889+    const int existing_connections = WITH_LOCK(m_nodes_mutex,
    1890+        return std::count_if(
    


    hodlinator commented at 10:24 pm on March 7, 2025:
    nit: Might as well use std::ranges::count_if here and in GetFullOutboundConnCount.

    vasild commented at 4:06 pm on March 10, 2025:
    Indeed, done, thanks!
  15. hodlinator commented at 11:13 pm on March 7, 2025: contributor
    Concept ACK 36d932cebbb235a79e90575960646e86a23d39a8
  16. vasild force-pushed on Mar 10, 2025
  17. vasild commented at 4:07 pm on March 10, 2025: contributor
    36d932cebb...af622d00ba: address #32015 (review)
  18. purpleKarrot commented at 6:24 pm on March 10, 2025: contributor
    Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics? Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time.
  19. hodlinator commented at 7:51 pm on March 10, 2025: contributor

    Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time.

    I remembered that as “the seminal std::rotate()-talk”, but it also has “No Raw Pointers” from 48:45. Probably stoked my hatred of smart pointers which led to #31713. To be fair, in the talk he also says that we used to do intrusive ref-counting… so this is at least one step away from that.

    Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics?

    Guess one could have an InternalCNode which is the actual meat, and make the public CNode instances hold shared_ptr<InternalCNode>. Not sure how well that plays out. Maybe you @purpleKarrot could take a stab at something along the lines of what you pointed to? Keep in mind that we try to minimize the number of lines touched in a change unless necessary.

  20. purpleKarrot commented at 9:08 pm on March 10, 2025: contributor
    After reviewing the code line by line, I realized that the only places where AddRef was ever called, was directly after construction (plus one place in the fuzz tests, not sure why that is needed). It looks like the CNode instances are not actually shared (good news). Could it be passed around by std::unique_ptr instead? This would greatly simplify future refactoring.
  21. in src/net.cpp:1717 in af622d00ba outdated
    1713@@ -1716,7 +1714,7 @@ bool CConnman::AttemptToEvictConnection()
    1714         return false;
    1715     }
    1716     LOCK(m_nodes_mutex);
    1717-    for (CNode* pnode : m_nodes) {
    1718+    for (const auto& pnode : m_nodes) {
    


    hodlinator commented at 10:01 pm on March 10, 2025:

    Not sure about adding non-transitive const here since we set pnode->fDisconnect inside the loop.

    Similar case in DisconnectNodes()-loop.


    vasild commented at 11:55 am on March 11, 2025:
    Removed the const.
  22. dergoegge commented at 10:15 pm on March 10, 2025: member

    realized that the only places where AddRef was ever called, was directly after construction

    It is also called in the RAII helper NodesSnapshot which is used in different threads (i.e. “net” and “msghand”), so unique_ptr won’t work unfortunately.

    plus one place in the fuzz tests, not sure why that is needed

    It’s probably just there to have it “covered”.

  23. in src/test/util/net.h:57 in af622d00ba outdated
    53@@ -54,20 +54,19 @@ struct ConnmanTestMsg : public CConnman {
    54     void AddTestNode(CNode& node)
    55     {
    56         LOCK(m_nodes_mutex);
    57-        m_nodes.push_back(&node);
    58+        m_nodes.push_back(std::shared_ptr<CNode>(&node));
    


    purpleKarrot commented at 10:41 pm on March 10, 2025:

    This looks suspicious. No custom deleter is passed to the shared_ptr constructor, so it defaults to default_delete which will delete ptr; eventually. Is the argument CNode& node guaranteed to be heap allocated? Otherwise:

    0m_nodes.push_back(std::shared_ptr<CNode>(&node, [](CNode*){}));
    

    vasild commented at 10:25 am on March 11, 2025:

    Yes, the argument CNode& node is always heap allocated. Not sure about “guaranteed”. If not heap allocated, then this will be a gross error, resulting in an immediate crash or an error report by Valgrind or a memory sanitizer.

    Note that the semantic of this piece of code is not changed by this PR. It was the same before as well - it would eventually delete the object.

  24. in src/net.cpp:438 in af622d00ba outdated
    435@@ -438,8 +436,7 @@ CNode* CConnman::ConnectNode(CAddress addrConnect, const char *pszDest, bool fCo
    436                 // It is possible that we already have a connection to the IP/port pszDest resolved to.
    437                 // In that case, drop the connection that was just created.
    438                 LOCK(m_nodes_mutex);
    


    hodlinator commented at 10:52 pm on March 10, 2025:
    Since we aren’t doing anything with pnode anymore, we should be able to remove this LOCK() too, avoiding re-entrant locking inside FindNode().

    vasild commented at 12:00 pm on March 11, 2025:
    Right! Removed.
  25. in src/net.cpp:335 in af622d00ba outdated
    331@@ -332,32 +332,32 @@ bool IsLocal(const CService& addr)
    332     return mapLocalHost.count(addr) > 0;
    333 }
    334 
    335-CNode* CConnman::FindNode(const CNetAddr& ip)
    


    hodlinator commented at 10:55 pm on March 10, 2025:
    It is notable how unsafe the design was for the FindNode()-methods before switching to shared_ptr. Returning a raw pointer as we let go of the mutex protecting it.

    purpleKarrot commented at 11:15 pm on March 10, 2025:

    This is not a raw pointer, but an observer (“Raw pointer” is coined by Sean Parent as “anything that implies ownership”, which explicitly includes shared_ptr. Stricly speaking, changing CNode* to std::shared_ptr<CNode> turns an observer into a raw pointer).

    I actually think that returning an observer is fine here, even after the change.


    hodlinator commented at 8:02 am on March 11, 2025:

    If that is the case I think Parent goes one step too far in redefining terms.

    Returning an raw C pointer is not an okay design, as another thread could potentially come along and delete whatever the returned pointer is referencing, resulting in use-after-free. This is mitigated by FindNode being private and node deletion being done in a controlled fashion on master. But it’s a loaded foot-gun IMO.

    Adding EXCLUSIVE_LOCKS_REQUIRED() would have been a different fix, forcing the caller to protect the returned value with the mutex.

    0CNode* FindNode(...) EXCLUSIVE_LOCKS_REQUIRED(m_nodes_mutex)
    

    Introducing shared_ptr solves the use-after-free design issue. (Although I agree it would be great to get rid of shared_ptr).


    vasild commented at 12:10 pm on March 11, 2025:
    I agree with @hodlinator that this has lifetime issue in master and in this PR it does not. That issue was also discussed in another PR. Good to resolve it.
  26. purpleKarrot commented at 11:18 pm on March 10, 2025: contributor

    unique_ptr won’t work unfortunately.

    Got it. m_nodes needs to be able to be snapshotted. But that seems to be the only place where copies are made. FindNode should return non-owning observers rather than extend the ownership.

  27. vasild force-pushed on Mar 11, 2025
  28. vasild commented at 11:55 am on March 11, 2025: contributor
    af622d00ba...d233c7e999: address #32015 (review)
  29. vasild force-pushed on Mar 11, 2025
  30. vasild commented at 12:00 pm on March 11, 2025: contributor
    d233c7e999...4492abbf0a: rebase and address #32015 (review)
  31. in src/net.h:1361 in 4492abbf0a outdated
    1354@@ -1363,7 +1355,7 @@ class CConnman
    1355     bool AlreadyConnectedToAddress(const CAddress& addr);
    1356 
    1357     bool AttemptToEvictConnection();
    1358-    CNode* ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
    1359+    std::shared_ptr<CNode> ConnectNode(CAddress addrConnect, const char *pszDest, bool fCountFailure, ConnectionType conn_type, bool use_v2transport) EXCLUSIVE_LOCKS_REQUIRED(!m_unused_i2p_sessions_mutex);
    1360     void AddWhitelistPermissionFlags(NetPermissionFlags& flags, const CNetAddr &addr, const std::vector<NetWhitelistPermissions>& ranges) const;
    1361 
    1362     void DeleteNode(CNode* pnode);
    


    hodlinator commented at 1:01 pm on March 11, 2025:
    Function body was removed.

    vasild commented at 5:07 am on March 12, 2025:
    Removed the declaration as well, thanks!
  32. in src/net.cpp:3481 in 4492abbf0a outdated
    3463     WITH_LOCK(m_nodes_mutex, nodes.swap(m_nodes));
    3464-    for (CNode* pnode : nodes) {
    3465+    for (auto& pnode : nodes) {
    3466         LogDebug(BCLog::NET, "Stopping node, %s", pnode->DisconnectMsg(fLogIPs));
    3467         pnode->CloseSocketDisconnect();
    3468-        DeleteNode(pnode);
    


    hodlinator commented at 1:14 pm on March 11, 2025:
    Seems brittle on master that we just delete the node here without checking the refcount. From what I see we only call StopNodes() from tests and CConnman::Stop(), and we only create NodesSnapshots inside threads we’ve already stopped earlier in Stop().
  33. vasild force-pushed on Mar 12, 2025
  34. vasild commented at 5:06 am on March 12, 2025: contributor
    4492abbf0a...8c0ce1ca1c: address #32015 (review)
  35. Sjors commented at 10:24 am on March 12, 2025: member

    This seems like a nice simplification. It would be good to do before nobody remembers what the old code was trying to do (and prevent). Since @theUni tried something similar before in #10738 it would be great if he can look at this again.

    In particular the 2017 PR claimed “special care must be taken” as to which thread destroys the object, but @vasild believes this doesn’t matter (here): #32015 (review)

    The end result of that PR seems a lot more complicated than what this PR does, and perhaps that’s part of why it didn’t get a huge amount of review and was closed eventually? @willcl-ark then tried again in 2023 in #28222 (initially not knowing about the earlier attempt). Similar concerns about threading were raised: #28222#pullrequestreview-1565029167

    It seems that was never really settled and the PR died.

    It would be good to be address this in the PR description to prevent going in circles.

  36. vasild commented at 1:18 pm on March 12, 2025: contributor

    Some comments from the other PRs:

    from #10738#issue-240283546:

    Special care must be taken, though, to only delete CNodes from a single thread, and to control which thread that is.

    In master deletion of CNode objects could happen in the net thread or in the main thread. In other words, some CNode object might be destroyed in the net thread and another CNode object might be destroyed in the main thread. I do not see any problem with that. Of course we do not destroy the same object from more than one thread, in master or in this PR. With this PR deletion might as well happen in the msghand thread (when destroying the temporary copy of m_nodes in ThreadMessageHandler()). That is fine as well, correct me if I am wrong.

    from #28222#pullrequestreview-1565029167:

    which thread actually deletes the CNode.

    The reason why that is important is because we only want to call FinalizeNode once and we want there to be no other references to that CNode at that time.

    This PR achieves that by calling FinalizeNode() from ~CNode(). Just to clarify the obvious: the destructor is called only once and at the time the destructor is called shared_ptr has ensured that there are no other references to the object. This is simpler and easier to reason about than the code in master which waits manually for the number of references to drop.

  37. vasild commented at 3:48 pm on March 14, 2025: contributor

    Further idea around the second sentence of this:

    #10738#issue-240283546:

    Special care must be taken, though, to only delete CNodes from a single thread, and to control which thread that is. Eventually, this should move to the message processing thread, so that it’s not possible to hit cs_main from the main network thread.

    It is possible to decouple CNode deletion from the FinalizeNode() call and do the latter in the msghand thread regardless of when CNode is deleted. FinalizeNode() only needs 3 properties from CNode, so they can be extracted out from the CNode object and saved to a place which later is looked up by the msghand thread which calls FinalizeNode() on that data.

    Here is a patch that does that:

      0diff --git i/src/net_processing.cpp w/src/net_processing.cpp
      1index 1da3ec9d21..1124036a2a 100644
      2--- i/src/net_processing.cpp
      3+++ w/src/net_processing.cpp
      4@@ -487,13 +487,13 @@ public:
      5         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex);
      6     void NewPoWValidBlock(const CBlockIndex *pindex, const std::shared_ptr<const CBlock>& pblock) override
      7         EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex);
      8 
      9     /** Implement NetEventsInterface */
     10     void InitializeNode(const CNode& node, ServiceFlags our_services) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_tx_download_mutex);
     11-    void FinalizeNode(const CNode& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
     12+    void FinalizeNode(const NodeFinalizationData& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
     13     bool HasAllDesirableServiceFlags(ServiceFlags services) const override;
     14     bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override
     15         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex, !m_tx_download_mutex);
     16     bool SendMessages(CNode* pto) override
     17         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, g_msgproc_mutex, !m_tx_download_mutex);
     18 
     19@@ -1560,80 +1560,78 @@ void PeerManagerImpl::ReattemptInitialBroadcast(CScheduler& scheduler)
     20     // Schedule next run for 10-15 minutes in the future.
     21     // We add randomness on every cycle to avoid the possibility of P2P fingerprinting.
     22     const auto delta = 10min + FastRandomContext().randrange<std::chrono::milliseconds>(5min);
     23     scheduler.scheduleFromNow([&] { ReattemptInitialBroadcast(scheduler); }, delta);
     24 }
     25 
     26-void PeerManagerImpl::FinalizeNode(const CNode& node)
     27+void PeerManagerImpl::FinalizeNode(const NodeFinalizationData& node)
     28 {
     29-    NodeId nodeid = node.GetId();
     30     {
     31     LOCK(cs_main);
     32     {
     33         // We remove the PeerRef from g_peer_map here, but we don't always
     34         // destruct the Peer. Sometimes another thread is still holding a
     35         // PeerRef, so the refcount is >= 1. Be careful not to do any
     36         // processing here that assumes Peer won't be changed before it's
     37         // destructed.
     38-        PeerRef peer = RemovePeer(nodeid);
     39+        PeerRef peer = RemovePeer(node.id);
     40         assert(peer != nullptr);
     41         m_wtxid_relay_peers -= peer->m_wtxid_relay;
     42         assert(m_wtxid_relay_peers >= 0);
     43     }
     44-    CNodeState *state = State(nodeid);
     45+    CNodeState *state = State(node.id);
     46     assert(state != nullptr);
     47 
     48     if (state->fSyncStarted)
     49         nSyncStarted--;
     50 
     51     for (const QueuedBlock& entry : state->vBlocksInFlight) {
     52         auto range = mapBlocksInFlight.equal_range(entry.pindex->GetBlockHash());
     53         while (range.first != range.second) {
     54             auto [node_id, list_it] = range.first->second;
     55-            if (node_id != nodeid) {
     56+            if (node_id != node.id) {
     57                 range.first++;
     58             } else {
     59                 range.first = mapBlocksInFlight.erase(range.first);
     60             }
     61         }
     62     }
     63     {
     64         LOCK(m_tx_download_mutex);
     65-        m_txdownloadman.DisconnectedPeer(nodeid);
     66+        m_txdownloadman.DisconnectedPeer(node.id);
     67     }
     68-    if (m_txreconciliation) m_txreconciliation->ForgetPeer(nodeid);
     69+    if (m_txreconciliation) m_txreconciliation->ForgetPeer(node.id);
     70     m_num_preferred_download_peers -= state->fPreferredDownload;
     71     m_peers_downloading_from -= (!state->vBlocksInFlight.empty());
     72     assert(m_peers_downloading_from >= 0);
     73     m_outbound_peers_with_protect_from_disconnect -= state->m_chain_sync.m_protect;
     74     assert(m_outbound_peers_with_protect_from_disconnect >= 0);
     75 
     76-    m_node_states.erase(nodeid);
     77+    m_node_states.erase(node.id);
     78 
     79     if (m_node_states.empty()) {
     80         // Do a consistency check after the last peer is removed.
     81         assert(mapBlocksInFlight.empty());
     82         assert(m_num_preferred_download_peers == 0);
     83         assert(m_peers_downloading_from == 0);
     84         assert(m_outbound_peers_with_protect_from_disconnect == 0);
     85         assert(m_wtxid_relay_peers == 0);
     86         WITH_LOCK(m_tx_download_mutex, m_txdownloadman.CheckIsEmpty());
     87     }
     88     } // cs_main
     89-    if (node.fSuccessfullyConnected &&
     90-        !node.IsBlockOnlyConn() && !node.IsInboundConn()) {
     91+    if (node.was_connected_full_outbound) {
     92         // Only change visible addrman state for full outbound peers.  We don't
     93         // call Connected() for feeler connections since they don't have
     94         // fSuccessfullyConnected set.
     95         m_addrman.Connected(node.addr);
     96     }
     97     {
     98         LOCK(m_headers_presync_mutex);
     99-        m_headers_presync_stats.erase(nodeid);
    100+        m_headers_presync_stats.erase(node.id);
    101     }
    102-    LogDebug(BCLog::NET, "Cleared nodestate for peer=%d\n", nodeid);
    103+    LogDebug(BCLog::NET, "Cleared nodestate for peer=%d\n", node.id);
    104 }
    105 
    106 bool PeerManagerImpl::HasAllDesirableServiceFlags(ServiceFlags services) const
    107 {
    108     // Shortcut for (services & GetDesirableServiceFlags(services)) == GetDesirableServiceFlags(services)
    109     return !(GetDesirableServiceFlags(services) & (~services));
    110diff --git i/src/net.h w/src/net.h
    111index ff83ecde4d..cbe044f27e 100644
    112--- i/src/net.h
    113+++ w/src/net.h
    114@@ -1000,14 +1000,29 @@ public:
    115     /** Mutex for anything that is only accessed via the msg processing thread */
    116     static Mutex g_msgproc_mutex;
    117 
    118     /** Initialize a peer (setup state) */
    119     virtual void InitializeNode(const CNode& node, ServiceFlags our_services) = 0;
    120 
    121+    /** Data needed for `FinalizeNode()`. */
    122+    struct NodeFinalizationData {
    123+        NodeFinalizationData(const CNode& node)
    124+            : id{node.GetId()},
    125+              addr{node.addr},
    126+              was_connected_full_outbound{node.fSuccessfullyConnected &&
    127+                                          !node.IsBlockOnlyConn() &&
    128+                                          !node.IsInboundConn()}
    129+        {}
    130+
    131+        const NodeId id;
    132+        const CService addr;
    133+        const bool was_connected_full_outbound;
    134+    };
    135+
    136     /** Handle removal of a peer (clear state) */
    137-    virtual void FinalizeNode(const CNode& node) = 0;
    138+    virtual void FinalizeNode(const NodeFinalizationData& node_finalization_data) = 0;
    139 
    140     /**
    141      * Callback to determine whether the given set of service flags are sufficient
    142      * for a peer to be "relevant".
    143      */
    144     virtual bool HasAllDesirableServiceFlags(ServiceFlags services) const = 0;
    145@@ -1112,15 +1127,17 @@ public:
    146 
    147     ~CConnman();
    148 
    149     bool Start(CScheduler& scheduler, const Options& options) EXCLUSIVE_LOCKS_REQUIRED(!m_total_bytes_sent_mutex, !m_added_nodes_mutex, !m_addr_fetches_mutex, !mutexMsgProc);
    150 
    151     void StopThreads();
    152-    void StopNodes();
    153-    void Stop()
    154+    void StopNodes() EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex);
    155+    void Stop() EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex)
    156     {
    157+        AssertLockNotHeld(m_nodes_for_finalization_mutex);
    158+
    159         StopThreads();
    160         StopNodes();
    161     };
    162 
    163     void Interrupt() EXCLUSIVE_LOCKS_REQUIRED(!mutexMsgProc);
    164     bool GetNetworkActive() const { return fNetworkActive; };
    165@@ -1261,12 +1278,15 @@ public:
    166 
    167     /** Return true if we should disconnect the peer for failing an inactivity check. */
    168     bool ShouldRunInactivityChecks(const CNode& node, std::chrono::seconds now) const;
    169 
    170     bool MultipleManualOrFullOutboundConns(Network net) const EXCLUSIVE_LOCKS_REQUIRED(m_nodes_mutex);
    171 
    172+    /** Remember some of the node's properties for calling `m_msgproc->FinalizeNode()` later. */
    173+    void RememberForFinalization(const CNode& node) EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex);
    174+
    175 private:
    176     struct ListenSocket {
    177     public:
    178         std::shared_ptr<Sock> sock;
    179         inline void AddSocketPermissionFlags(NetPermissionFlags& flags) const { NetPermissions::AddFlag(flags, m_permissions); }
    180         ListenSocket(std::shared_ptr<Sock> sock_, NetPermissionFlags permissions_)
    181@@ -1287,13 +1307,13 @@ private:
    182     bool InitBinds(const Options& options);
    183 
    184     void ThreadOpenAddedConnections() EXCLUSIVE_LOCKS_REQUIRED(!m_added_nodes_mutex, !m_unused_i2p_sessions_mutex, !m_reconnections_mutex);
    185     void AddAddrFetch(const std::string& strDest) EXCLUSIVE_LOCKS_REQUIRED(!m_addr_fetches_mutex);
    186     void ProcessAddrFetch() EXCLUSIVE_LOCKS_REQUIRED(!m_addr_fetches_mutex, !m_unused_i2p_sessions_mutex);
    187     void ThreadOpenConnections(std::vector<std::string> connect, Span<const std::string> seed_nodes) EXCLUSIVE_LOCKS_REQUIRED(!m_addr_fetches_mutex, !m_added_nodes_mutex, !m_nodes_mutex, !m_unused_i2p_sessions_mutex, !m_reconnections_mutex);
    188-    void ThreadMessageHandler() EXCLUSIVE_LOCKS_REQUIRED(!mutexMsgProc);
    189+    void ThreadMessageHandler() EXCLUSIVE_LOCKS_REQUIRED(!mutexMsgProc, !m_nodes_for_finalization_mutex);
    190     void ThreadI2PAcceptIncoming();
    191     void AcceptConnection(const ListenSocket& hListenSocket);
    192 
    193     /**
    194      * Create a `CNode` object from a socket that has just been accepted and add the node to
    195      * the `m_nodes` member.
    196@@ -1305,12 +1325,16 @@ private:
    197     void CreateNodeFromAcceptedSocket(std::unique_ptr<Sock>&& sock,
    198                                       NetPermissionFlags permission_flags,
    199                                       const CService& addr_bind,
    200                                       const CService& addr);
    201 
    202     void DisconnectNodes() EXCLUSIVE_LOCKS_REQUIRED(!m_reconnections_mutex, !m_nodes_mutex);
    203+
    204+    /** Call `m_msgproc->FinalizeNode()` on all entries from `m_nodes_for_finalization`. */
    205+    void FinalizeNodes() EXCLUSIVE_LOCKS_REQUIRED(!m_nodes_for_finalization_mutex);
    206+
    207     void NotifyNumConnectionsChanged();
    208     /** Return true if the peer is inactive and should be disconnected. */
    209     bool InactivityCheck(const CNode& node) const;
    210 
    211     /**
    212      * Generate a collection of sockets to check for IO readiness.
    213@@ -1395,12 +1419,17 @@ private:
    214     // Whether the node should be passed out in ForEach* callbacks
    215     static bool NodeFullyConnected(const CNode& node);
    216 
    217     uint16_t GetDefaultPort(Network net) const;
    218     uint16_t GetDefaultPort(const std::string& addr) const;
    219 
    220+    Mutex m_nodes_for_finalization_mutex;
    221+
    222+    /** When a node is destroyed some of its properties are saved here for calling m_msgproc->FinalizeNode() later. */
    223+    std::queue<std::unique_ptr<NetEventsInterface::NodeFinalizationData>> m_nodes_for_finalization GUARDED_BY(m_nodes_for_finalization_mutex);
    224+
    225     // Network usage totals
    226     mutable Mutex m_total_bytes_sent_mutex;
    227     std::atomic<uint64_t> nTotalBytesRecv{0};
    228     uint64_t nTotalBytesSent GUARDED_BY(m_total_bytes_sent_mutex) {0};
    229 
    230     // outbound limit & stats
    231diff --git i/src/net.cpp w/src/net.cpp
    232index d641579f6b..3a6ba226d0 100644
    233--- i/src/net.cpp
    234+++ w/src/net.cpp
    235@@ -534,13 +534,13 @@ std::shared_ptr<CNode> CConnman::ConnectNode(CAddress addrConnect, const char *p
    236             CalculateKeyedNetGroup(target_addr),
    237             nonce,
    238             addr_bind,
    239             pszDest ? pszDest : "",
    240             conn_type,
    241             /*inbound_onion=*/false,
    242-            [this](CNode& node) { m_msgproc->FinalizeNode(node); },
    243+            [this](CNode& node) { RememberForFinalization(node); },
    244             CNodeOptions{
    245                 .permission_flags = permission_flags,
    246                 .i2p_sam_session = std::move(i2p_transient_session),
    247                 .recv_flood_size = nReceiveFloodSize,
    248                 .use_v2transport = use_v2transport,
    249             });
    250@@ -1832,13 +1832,13 @@ void CConnman::CreateNodeFromAcceptedSocket(std::unique_ptr<Sock>&& sock,
    251         CalculateKeyedNetGroup(addr),
    252         nonce,
    253         addr_bind,
    254         /*addrNameIn=*/"",
    255         ConnectionType::INBOUND,
    256         inbound_onion,
    257-        [this](CNode& node) { m_msgproc->FinalizeNode(node); },
    258+        [this](CNode& node) { RememberForFinalization(node); },
    259         CNodeOptions{
    260             .permission_flags = permission_flags,
    261             .prefer_evict = discouraged,
    262             .recv_flood_size = nReceiveFloodSize,
    263             .use_v2transport = use_v2transport,
    264         });
    265@@ -1904,14 +1904,12 @@ void CConnman::DisconnectNodes()
    266     AssertLockNotHeld(m_reconnections_mutex);
    267 
    268     // Use a temporary variable to accumulate desired reconnections, so we don't need
    269     // m_reconnections_mutex while holding m_nodes_mutex.
    270     decltype(m_reconnections) reconnections_to_add;
    271 
    272-    std::vector<std::shared_ptr<CNode>> disconnected_nodes;
    273-
    274     {
    275         LOCK(m_nodes_mutex);
    276 
    277         if (!fNetworkActive) {
    278             // Disconnect any connected nodes
    279             for (auto& pnode : m_nodes) {
    280@@ -1925,18 +1923,12 @@ void CConnman::DisconnectNodes()
    281         // Disconnect unused nodes
    282         for (auto it = m_nodes.begin(); it != m_nodes.end();) {
    283             auto node = *it;
    284             if (node->fDisconnect) {
    285                 it = m_nodes.erase(it);
    286 
    287-                // Keep a reference to this CNode object to delay its destruction until
    288-                // after m_nodes_mutex has been released. Destructing a node involves
    289-                // calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order
    290-                // should be cs_main, m_nodes_mutex.
    291-                disconnected_nodes.push_back(node);
    292-
    293                 // Add to reconnection list if appropriate. We don't reconnect right here, because
    294                 // the creation of a connection is a blocking operation (up to several seconds),
    295                 // and we don't want to hold up the socket handler thread for that long.
    296                 if (node->m_transport->ShouldReconnectV1()) {
    297                     reconnections_to_add.push_back({
    298                         .addr_connect = node->addr,
    299@@ -1964,12 +1956,38 @@ void CConnman::DisconnectNodes()
    300         // Move entries from reconnections_to_add to m_reconnections.
    301         LOCK(m_reconnections_mutex);
    302         m_reconnections.splice(m_reconnections.end(), std::move(reconnections_to_add));
    303     }
    304 }
    305 
    306+void CConnman::RememberForFinalization(const CNode& node)
    307+{
    308+    LOCK(m_nodes_for_finalization_mutex);
    309+    m_nodes_for_finalization.emplace(std::make_unique<NetEventsInterface::NodeFinalizationData>(node));
    310+}
    311+
    312+void CConnman::FinalizeNodes()
    313+{
    314+    AssertLockNotHeld(m_nodes_for_finalization_mutex);
    315+
    316+    for (;;) {
    317+        std::unique_ptr<NetEventsInterface::NodeFinalizationData> node_finalization_data;
    318+
    319+        {
    320+            LOCK(m_nodes_for_finalization_mutex);
    321+            if (m_nodes_for_finalization.empty()) {
    322+                break;
    323+            }
    324+            node_finalization_data.swap(m_nodes_for_finalization.front());
    325+            m_nodes_for_finalization.pop();
    326+        }
    327+
    328+        m_msgproc->FinalizeNode(*node_finalization_data);
    329+    }
    330+}
    331+
    332 void CConnman::NotifyNumConnectionsChanged()
    333 {
    334     size_t nodes_size;
    335     {
    336         LOCK(m_nodes_mutex);
    337         nodes_size = m_nodes.size();
    338@@ -3050,12 +3068,14 @@ void CConnman::ThreadMessageHandler()
    339 
    340         WAIT_LOCK(mutexMsgProc, lock);
    341         if (!fMoreWork) {
    342             condMsgProc.wait_until(lock, std::chrono::steady_clock::now() + std::chrono::milliseconds(100), [this]() EXCLUSIVE_LOCKS_REQUIRED(mutexMsgProc) { return fMsgProcWake; });
    343         }
    344         fMsgProcWake = false;
    345+
    346+        FinalizeNodes();
    347     }
    348 }
    349 
    350 void CConnman::ThreadI2PAcceptIncoming()
    351 {
    352     static constexpr auto err_wait_begin = 1s;
    353@@ -3440,12 +3460,14 @@ void CConnman::StopThreads()
    354     if (threadSocketHandler.joinable())
    355         threadSocketHandler.join();
    356 }
    357 
    358 void CConnman::StopNodes()
    359 {
    360+    AssertLockNotHeld(m_nodes_for_finalization_mutex);
    361+
    362     if (fAddressesInitialized) {
    363         DumpAddresses();
    364         fAddressesInitialized = false;
    365 
    366         if (m_use_addrman_outgoing) {
    367             // Anchor connections are only dumped during clean shutdown.
    368@@ -3466,12 +3488,14 @@ void CConnman::StopNodes()
    369     }
    370     nodes.clear();
    371 
    372     vhListenSocket.clear();
    373     semOutbound.reset();
    374     semAddnode.reset();
    375+
    376+    FinalizeNodes(); // Finalize any leftovers which were added after the msghand thread was interrupted.
    377 }
    378 
    379 CConnman::~CConnman()
    380 {
    381     Interrupt();
    382     Stop();
    383diff --git i/src/test/fuzz/p2p_handshake.cpp w/src/test/fuzz/p2p_handshake.cpp
    384index d881cb35ba..99b2d70706 100644
    385--- i/src/test/fuzz/p2p_handshake.cpp
    386+++ w/src/test/fuzz/p2p_handshake.cpp
    387@@ -61,14 +61,14 @@ FUZZ_TARGET(p2p_handshake, .init = ::initialize)
    388 
    389     LOCK(NetEventsInterface::g_msgproc_mutex);
    390 
    391     std::vector<CNode*> peers;
    392     const auto num_peers_to_add = fuzzed_data_provider.ConsumeIntegralInRange(1, 3);
    393     for (int i = 0; i < num_peers_to_add; ++i) {
    394-        peers.push_back(ConsumeNodeAsUniquePtr(fuzzed_data_provider, i, [&peerman](CNode& node) {
    395-                            peerman->FinalizeNode(node);
    396+        peers.push_back(ConsumeNodeAsUniquePtr(fuzzed_data_provider, i, [&connman](CNode& node) {
    397+                            connman.RememberForFinalization(node);
    398                         }).release());
    399         connman.AddTestNode(*peers.back());
    400         peerman->InitializeNode(
    401             *peers.back(),
    402             static_cast<ServiceFlags>(fuzzed_data_provider.ConsumeIntegral<uint64_t>()));
    403     }
    404diff --git i/src/test/fuzz/p2p_headers_presync.cpp w/src/test/fuzz/p2p_headers_presync.cpp
    405index 4c470dab1c..d6159efaee 100644
    406--- i/src/test/fuzz/p2p_headers_presync.cpp
    407+++ w/src/test/fuzz/p2p_headers_presync.cpp
    408@@ -58,13 +58,13 @@ void HeadersSyncSetup::ResetAndInitialize()
    409         ConnectionType::INBOUND
    410     };
    411 
    412     for (auto conn_type : conn_types) {
    413         CAddress addr{};
    414         m_connections.push_back(new CNode(id++, nullptr, addr, 0, 0, addr, "", conn_type, false,
    415-                                          [this](CNode& node) { m_node.peerman->FinalizeNode(node); }));
    416+                                          [this](CNode& node) { m_node.connman->RememberForFinalization(node); }));
    417         CNode& p2p_node = *m_connections.back();
    418 
    419         connman.Handshake(
    420             /*node=*/p2p_node,
    421             /*successfully_connected=*/true,
    422             /*remote_services=*/ServiceFlags(NODE_NETWORK | NODE_WITNESS),
    423diff --git i/src/test/fuzz/process_message.cpp w/src/test/fuzz/process_message.cpp
    424index b2031a60cc..a7b6913584 100644
    425--- i/src/test/fuzz/process_message.cpp
    426+++ w/src/test/fuzz/process_message.cpp
    427@@ -64,13 +64,13 @@ FUZZ_TARGET(process_message, .init = initialize_process_message)
    428 
    429     const std::string random_message_type{fuzzed_data_provider.ConsumeBytesAsString(CMessageHeader::MESSAGE_TYPE_SIZE).c_str()};
    430     if (!LIMIT_TO_MESSAGE_TYPE.empty() && random_message_type != LIMIT_TO_MESSAGE_TYPE) {
    431         return;
    432     }
    433     CNode& p2p_node = *ConsumeNodeAsUniquePtr(fuzzed_data_provider, std::nullopt, [](CNode& node) {
    434-                           g_setup->m_node.peerman->FinalizeNode(node);
    435+                           g_setup->m_node.connman->RememberForFinalization(node);
    436                        }).release();
    437 
    438     connman.AddTestNode(p2p_node);
    439     FillNode(fuzzed_data_provider, connman, p2p_node);
    440 
    441     const auto mock_time = ConsumeTime(fuzzed_data_provider);
    442diff --git i/src/test/fuzz/process_messages.cpp w/src/test/fuzz/process_messages.cpp
    443index 89584e3d59..9377327f8e 100644
    444--- i/src/test/fuzz/process_messages.cpp
    445+++ w/src/test/fuzz/process_messages.cpp
    446@@ -53,13 +53,13 @@ FUZZ_TARGET(process_messages, .init = initialize_process_messages)
    447     LOCK(NetEventsInterface::g_msgproc_mutex);
    448 
    449     std::vector<CNode*> peers;
    450     const auto num_peers_to_add = fuzzed_data_provider.ConsumeIntegralInRange(1, 3);
    451     for (int i = 0; i < num_peers_to_add; ++i) {
    452         peers.push_back(ConsumeNodeAsUniquePtr(fuzzed_data_provider, i, [&connman](CNode& node) {
    453-                            connman.MsgProc()->FinalizeNode(node);
    454+                            connman.RememberForFinalization(node);
    455                         }).release());
    456         CNode& p2p_node = *peers.back();
    457 
    458         FillNode(fuzzed_data_provider, connman, p2p_node);
    459 
    460         connman.AddTestNode(p2p_node);
    

    Maybe worth exploring in a followup if people think that makes sense. Calling FinalizeNode() on the msghand thread to avoid locking cs_main in the net thread looks out of the scope of this PR which is about removing the manual reference counting of CNode.

  38. DrahtBot added the label Needs rebase on Mar 20, 2025
  39. vasild force-pushed on Mar 20, 2025
  40. vasild commented at 5:45 pm on March 20, 2025: contributor
    8c0ce1ca1c...dcb0be18ac: rebase due to conflicts
  41. DrahtBot removed the label Needs rebase on Mar 20, 2025
  42. hodlinator commented at 8:59 am on March 24, 2025: contributor

    unique_ptr

    Edit: Please see newer branch: #32015 (comment)

    Here is a branch implementing another possible variant, using unique_ptr as suggested by @purpleKarrot in #32015 (comment): https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_unique_ptr

    Pros

    • It keeps the CNode ownership/lifetime under control of CConnman::m_nodes/m_nodes_disconnected. This can be considered an advancement over the unclear ownership/lifetime of shared_ptr (“global variables in disguise”).

    • It avoids the need for introducing ~CNode() and m_destruct_cb together with passing a callback into CNode() everywhere.

    Cons

    • It requires more ongoing awareness of CNode lifetime than when using shared_ptr. However, once we grok CNode lifetimes, they appear fairly simple (unless I’m missing something):
     0CConnman::ThreadSocketHandler()
     1    while (!interruptNet)
     2        DisconnectNodes()
     3            for m_nodes
     4                move disconnected nodes to m_nodes_disconnected
     5            for m_nodes_disconnected
     6                // Not referenced from other threads.
     7                // And no new snapshots possible since we are in m_nodes_disconnected.
     8                if refcount <= 0
     9                    DeleteNode()
    10        CConnman::SocketHandler()
    11            NodesSnapshot
    12
    13CConnman::ThreadMessageHandler()
    14    while (!flagInterruptMsgProc)
    15        NodesSnapshot
    16
    17Shutdown() (init.cpp)
    18    CConnman::Stop()
    19        StopThreads()
    20            threadMessageHandler.join()
    21            threadSocketHandler.join()
    22        StopNodes()
    23            for m_nodes
    24                DeleteNode()
    25            for m_nodes_disconnected
    26                DeleteNode()
    
    • It retains CNode::nRefCount (but documents its purpose as tracking NodesSnapshot references).
  43. DrahtBot added the label Needs rebase on Mar 24, 2025
  44. ryanofsky commented at 5:13 pm on April 1, 2025: contributor

    Comparing current shared_ptr change dcb0be18ac64f001abaf646930bae1b5f6dba8c0 with suggested unique_ptr change cb959340a6734c3461f6d51a4a689f6067ba5e91 both seem like reasonable approaches.

    • I think vasild’s point that drawbacks of manual reference counting outweigh drawbacks of shared_ptr, even if new shared_ptr code makes it a little harder to understand where objects will be deleted exactly. So this PR does seem like an improvement over the status quo.
    • Not sure I understand the purpose of the FinalizeNode patch #32015 (comment) and whether it has benefits related to this PR, or unrelated to it?
    • The suggested unique_ptr branch also seems nice because changes are broken up into smaller commits. And even though it doesn’t completely remove reference counting, it does hide it inside the NodesSnapshot class, so maybe that is enough to achieve goals of this PR. I think the branch might also be able to also go further, deleting the CNode AddSnapshot, ReleaseSnapshot, and GetSnapshotCount methods and making NodesSnapshot a friend class that accesses the reference count directly. This could make it easier to see and guarantee that NodesSnapshot is the only thing accessing the reference count.

    Overall it could be nice to adopt at least parts of the unique_ptr branch here.

  45. hodlinator commented at 7:50 pm on April 1, 2025: contributor

    I think the branch might also be able to also go further, deleting the CNode AddSnapshot, ReleaseSnapshot, and GetSnapshotCount methods and making NodesSnapshot a friend class that accesses the reference count directly. This could make it easier to see and guarantee that NodesSnapshot is the only thing accessing the reference count.

    I thought about making a base class of CNode that exposed snapshot refcounting through a friend-relationship, but I like the fact that NodesSnapshot is a private class inside CConnman (and didn’t see a way to serve both goals).

  46. theuni commented at 9:43 pm on April 11, 2025: member

    As @Sjors mentioned, I attempted to do this a while ago, but brought in a whole new smart pointer impl to keep the threading guarantees so it never got much review.

    In master deletion of CNode objects could happen in the net thread or in the main thread. In other words, some CNode object might be destroyed in the net thread and another CNode object might be destroyed in the main thread.

    Where can this happen? At least in the current model, that sounds like a bug to me.

    The idea is that it does not matter which thread will destroy the CNode objects.

    What makes this suddenly ok? I recall that deleting from the socket handler thread was a hard requirement for safety/sanity.

    This makes me very uncomfortable. It’s been a while since I’ve looked at this code so the details are fuzzy now, but I don’t like the idea of refactoring the simple refcounters and changing the threading guarantees for no obvious gain. Is there some strong motivation for this I’m missing?

  47. hodlinator commented at 10:26 pm on April 11, 2025: contributor

    In master deletion of CNode objects could happen in the net thread or in the main thread. In other words, some CNode object might be destroyed in the net thread and another CNode object might be destroyed in the main thread.

    Where can this happen? At least in the current model, that sounds like a bug to me.

    The “diagram” at the bottom of #32015 (comment) may be helpful. The net-thread corresponds to CConnman::ThreadSocketHandler() and the main-thread corresponds to Shutdown() (init.cpp). (See calls to DeleteNode() in the diagram).

  48. vasild force-pushed on Apr 13, 2025
  49. vasild commented at 5:22 pm on April 13, 2025: contributor

    dcb0be1...71750da: rebase due to conflicts

    I do not understand the fixation of knowing and controlling where an object is destroyed from. The elegance and simplicity behind the sole idea of shared_ptr is that it does not matter. @hodlinator, I checked https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_unique_ptr, here is a quick feedback:

    • Does not remove the manual reference counting which is the aim of this PR. That diff mostly replaces the raw pointer CNode* with std::unique_ptr<CNode>.
    • To implement the snapshots it takes and stores raw pointers away from the unique_ptr which is an anti-pattern because then the lifetime of the raw pointer has to be manually kept shorter than the lifetime of the unique_ptr. I think that defeats the purpose of unique_ptr itself.
    • Any operation on the std::vector<std::unique_ptr<CNode>> m_nodes that deletes elements, e.g. clear(), erase(), pop_back() or resize() will automatically destroy the CNode objects even if they are referenced (manual ref count >0). @theuni,

    The idea is that it does not matter which thread will destroy the CNode objects.

    What makes this suddenly ok? I recall that deleting from the socket handler thread was a hard requirement for safety/sanity.

    I do not know about that. Can you elaborate? PeerManagerImpl::FinalizeNode() has to be called on the CNode before it is destroyed. Could this be some misunderstanding that from this follows that CNode must be destroyed from a particular thread?

    … I don’t like the idea of refactoring the simple refcounters and changing the threading guarantees for no obvious gain …

    Hmm, “simple refcounters”… IMO those are “complicated”. That might be subjective. But this patch reduces the code by about 70 lines. It entirely drops the need for snapshots as well. There is no changing of “threading guarantees” here. You seem to be thinking that CNode objects can and must only be destroyed from the net thread, but that might as well happen from the main thread. I do not think that this is a bug. And it is just fine, IMO, that with this PR deletion can now happen from the msghand thread. If you think that is not ok, can you explain why?

  50. DrahtBot removed the label Needs rebase on Apr 13, 2025
  51. vasild commented at 4:53 am on April 14, 2025: contributor

    I thought a bit more about the unique_ptr + manual refcounting. It seems to me that the manual refcounting can be replaced by the internal refcoutning in shared_ptr, like this:

    • drop CNode::nRefCount, same counter is somewhere inside shared_ptr
    • CNode::GetSnapshotCount() -> std::shared_ptr::use_count()
    • CNode::AddSnapshot() + copy the raw pointer out from unique_ptr -> copy the shared_ptr
    • CNode::ReleaseSnapshot() -> destroy the shared_ptr copy
    • drop all of CConnman::NodesSnapshot -> just make a copy of m_nodes under the mutex.

    The purpose of CConnman::m_nodes_disconnected and CConnman::DeleteNode() seems to be to call FinalizeNode() before the CNode object is destroyed. If m_nodes_disconnected and DeleteNode() are replaced with calling FinalizeNode() from ~CNode(), then it becomes this PR.

  52. theuni commented at 8:39 pm on April 14, 2025: member

    Note: This is a long and rambling comment, which mostly consists of me reminding myself about the specifics of this code. I’m leaving the whole thing in case my thought process helps any other reviewers.


    The idea is that it does not matter which thread will destroy the CNode objects.

    What makes this suddenly ok? I recall that deleting from the socket handler thread was a hard requirement for safety/sanity.

    I do not know about that. Can you elaborate? PeerManagerImpl::FinalizeNode() has to be called on the CNode before it is destroyed. Could this be some misunderstanding that from this follows that CNode must be destroyed from a particular thread?

    Yes, the main thread may call FinalizeNode(), but at that point the message handler and socket handler threads have already been joined, so that’s not really interesting.

    We currently guarantee that no node will be deleted from threadMessageHandler as it iterates. That’s the one that matters.

    This guarantee protects us from a large footgun. Note that my original motivation for #10738 had nothing to do with code cleanliness, rather I was targeting a substantial optimization: CConnman::ForNode / CConnman::ForEachNode both hold m_nodes_mutex for the duration of their callbacks. Those callbacks may be long-lived, substantial, and/or have scary locks themselves, so blocking the socket handler thread for them really sucks.

    Crucially, PeerManagerImpl::State(nodeid) will stay valid for any reachable nodeid for the duration of SendMessages and ProcessMessages.

    Now consider if:

    with this PR deletion can now happen from the msghand thread

    A reasonable follow-up to this PR (I actually expected to see it here) would be:

     0diff --git a/src/net.cpp b/src/net.cpp
     1index b68d2fb37aa..d5631c056ce 100644
     2--- a/src/net.cpp
     3+++ b/src/net.cpp
     4@@ -3902,13 +3902,16 @@ void CConnman::PushMessage(CNode* pnode, CSerializedNetMsg&& msg)
     5
     6 bool CConnman::ForNode(NodeId id, std::function<bool(CNode* pnode)> func)
     7 {
     8+    std::shared_ptr<CNode> found_node = nullptr;
     9+    {
    10     LOCK(m_nodes_mutex);
    11     for (auto&& pnode : m_nodes) {
    12         if(pnode->GetId() == id) {
    13-            return NodeFullyConnected(*pnode) && func(pnode.get());
    14+            found_node = pnode;
    15         }
    16     }
    17-    return false;
    18+    }
    19+    return found_node && NodeFullyConnected(*found_node) && func(found_node.get());
    20 }
    21
    22 CSipHasher CConnman::GetDeterministicRandomizer(uint64_t id) const
    23diff --git a/src/net.h b/src/net.h
    24index e1318c8e66b..bff5c5f423f 100644
    25--- a/src/net.h
    26+++ b/src/net.h
    27@@ -1140,11 +1140,9 @@ public:
    28     using NodeFn = std::function<void(CNode*)>;
    29     void ForEachNode(const NodeFn& func)
    30     {
    31-        LOCK(m_nodes_mutex);
    32-        for (auto&& node : m_nodes) {
    33-            if (NodeFullyConnected(*node)) {
    34-                func(node.get());
    35-            }
    36+        auto nodes = WITH_LOCK(m_nodes_mutex, return m_nodes);
    37+        for (auto&& node : nodes) {
    38+            if (NodeFullyConnected(*node)) func(node.get());
    39         }
    40     };
    

    Right? Because if the threading restriction is “gone”, that seems like it should be safe. Like I said, I did that as part of the PR mentioned above: https://github.com/bitcoin/bitcoin/pull/10738/commits/6dc671a22046c6c09a6de2b1d38e9e2fa3920628

    But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode, or any function that copies the node vector under the lock and uses the copy.

    The problem with this is that it makes interactions with PeerManager super hard to reason about. Consider a function like that does something like this:

    0PeerManagerImpl::SomeFunction()
    1{
    2    NodeId to_disconnect = CallSomeFunctionToGetWorstNode();
    3    m_connman.ForNode(to_disconnect, [&](CNode* pnode) {
    4        ...
    5        pnode->fDisconnect = true;
    6        ...
    7    }
    8    SomeFunctionThatTakesAFewSeconds();
    9    CNodeState *state = State(to_disconnect); // YIKES
    

    The above isn’t contrived either, EvictExtraOutboundPeers/NewPoWValidBlock do things like that.

    HOWEVER

    As-is, the impl here currently protects against nodes being deleted in SendMessages and ProcessMessages using the same refcounting trick that we do in master. Frustratingly, in master it’s already abstracted away by NodesSnapshot so it’s not immediately obvious (RAII notwithstanding, it’s an annoying trend in our repo imo to hide crucial implementation details behind abstractions for the sake of having fewer lines of code). And this PR takes it one step further:

    0auto nodes = WITH_LOCK(m_nodes_mutex, return m_nodes);
    

    ^^ This line is now in charge of keeping the shared_ptrs alive until the end of the message handling loop. If not for that (for ex, if we instead kept weak_ptrs and upgraded them as we passed through the loop), the ForEachNode case I explained above would be a problem.

    So I don’t think it’s true (and imo it’s quite dangerous to assume) that it’s “safe to delete a CNode from any thread”. What’s happening here is that it’s now possible that nodes are deleted in two places rather than one: either after as pass through the socket handler loop, or after a pass through the message handler loop. That still makes me uncomfortable, but I admit that I can’t find any problems with it.

    This PR also maintains the invariant that the socket handler thread is always completely finished with all socket operations before FinalizeNode is called.

    And as far as I can tell, this PR does allow us to take the patch above, getting rid of the nasty lock-holding in ForEachNode and ForNode. BUT that would open up the possibility of deletion from other callers. For example, it would make it possible for the RPC thread to delete the node from sendmsgtopeer(). And if any future code stashed the shared_ptrs from CConnman, it would make it possible for the deletion to happen anywhere.


    TLDR

    This is a combination of 2 changes: replacing our refcounters with shared_ptrs (style), and adding a new deletion point (functionality). As far as I can tell, the 2nd change is a side-effect of the first because it’s not straightforward to implement this with shared pointers while avoiding the possible deletion in ThreadMessageHandler(). I really don’t care for behavioral changes that are dictated by style changes. So the question for reviewers, imo, is: is the functional change a good one?

    I don’t trust that it would be safe to simply allow CNodes to be deleted just anywhere. But crucially, that’s not what’s being proposed in the code here. Instead, a single new deletion point has been introduced (at the end of the message processing loop), which is much more reasonable, as it protects SendMessages and ProcessMessages as-before.

    I’m not completely opposed to this, but I still find it scary and rather unjustified. I’m not convinced by the “it reduces the lines of code” argument. In fact, I would probably prefer to add comments around the NodesSnapshot to obviate what it’s doing.

    If we’re going to move forward with this I think we need:

    • Split the style change and functional change into separate commits: first the shared_ptr change as a no-op, then the additional deletion point. Edit: Looks like this isn’t realistic.
    • Justification besides “code cleanup”. I don’t think that’s a valid reason to be making behavioral changes to such low-level code.
    • In-place comments that describe the guarantees we provide. If the safety of the p2p layer revolves around a copy of a vector of shared_ptrs, that needs to be made clear. That nodes.clear() is also too crucial to not have a comment.
    • Documentation about the finalization/deletion process in general. This should probably happen even if we don’t take these changes.
  53. ryanofsky commented at 3:21 pm on April 15, 2025: contributor

    @theuni would be curious for your feedback on hodlinators’s unique_ptr + isolated refcount approach in #32015 (comment).

    The more I read here (not fully understanding everything) the more that approach seems appealing because when you say things like “it’s now possible that nodes are deleted in two places rather than one” and “But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode” it really makes shared_ptr seem like a poor fit. Shared_ptr can be great when objects have trivial destructors and it doesn’t actually matter when they are deleted, but if you need control over when resources are freed, more explicit refcounting would seem safer and simpler.

    It seems like both you and vasild like making a dichotomy between (1) using manual refcounting and (2) using shared_ptr, and not wanting to choose a middle ground like hodlinators approach which provides automatic reference counting though an explicit mechanism that is less flexible but simpler than shared_ptr. I’m not really sure why that is.

  54. theuni commented at 3:47 pm on April 15, 2025: member

    The more I read here (not fully understanding everything) the more that approach seems appealing because when you say things like “it’s now possible that nodes are deleted in two places rather than one” and “But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode” it really makes shared_ptr seem like a poor fit. Shared_ptr can be great when objects have trivial destructors and it doesn’t actually matter when they are deleted, but if you need control over when resources are freed, more explicit refcounting would seem safer and simpler.

    I agree.

    It seems like both you and vasild like making a dichotomy between (1) using manual refcounting and (2) using shared_ptr, and not wanting to choose a middle ground like hodlinators approach which provides automatic reference counting though an explicit mechanism that is less flexible but simpler than shared_ptr. I’m not really sure why that is.

    No reason, I just haven’t looked at it yet.

    I spent all of yesterday poking around and trying to remind myself of where be the dragons in this code, hence the rambly and not very useful comment :)

    I originally intended to comment that it was definitely unsafe to change the way deletion happens, but after poking around all day I concluded that even though it makes me uneasy (because I feel like I’ve forgotten something important over the years), I can’t really find any tangible issues with the addition of a new deletion point. I wouldn’t feel comfortable enough to ACK the approach, but I recognize that I can’t argue against it either. However, since there’s no real justification for the behavioral change (which I consider scary) other than “code cleanup”, I can’t really get onboard with it.

    If the unique_ptr approach cleans things up without the behavioral change, that sounds much nicer to me. Will have a look.

  55. vasild force-pushed on Apr 15, 2025
  56. vasild commented at 4:15 pm on April 15, 2025: contributor

    71750daa51...75acdcbaa9: add some comments and rename a variable, see below

    We currently guarantee that no node will be deleted from threadMessageHandler as it iterates. That’s the one that matters.

    Yes. I just want to clarify - what do you mean by “from”? I would say “no node will be deleted (does not matter from which thread) while threadMessageHandler iterates”. Is this what you mean? This guarantee is achieved by the caller of ProcessMessages() and SendMessages() taking a snapshot of all nodes to keep them from destruction while those functions are running. That is the same in master and in this PR.

    Crucially, PeerManagerImpl::State(nodeid) will stay valid for any reachable nodeid for the duration of SendMessages and ProcessMessages.

    Yes. True in master and in this PR for the same reason - the caller takes a snapshot while calling those functions.

    Now consider if:

    with this PR deletion can now happen from the msghand thread

    Yes. Sounds good. As long as ProcessMessages() and SendMessages() are protected by the snapshot (they are in master and in this PR).

    A reasonable follow-up to this PR (I actually expected to see it here) would be … Right? Because if the threading restriction is “gone”, that seems like it should be safe.

    Right. That is safe and super reasonable because it minimizes the time m_nodes_mutex is held, like you mentioned. I think that belongs to this PR or a followup.

    But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode, or any function that copies the node vector under the lock and uses the copy.

    Only if all others have released their copies. In which case it is ok.

    The problem with this is that it makes interactions with PeerManager super hard to reason about. Consider a function like that does something like this: PeerManagerImpl::SomeFunction()

    This does not change rules about interactions with PeerManager (so it is not harder or easier, it is the same). The fact that nodes may be deleted if nobody holds a reference to them is true in master and in this PR. I think the rule is “Don’t delete nodes while ProcessMessages() and SendMessages() are running”.

    PeerManagerImpl::SomeFunction() could result in state being nullptr (YIKES) also in master, depending on where SomeFunction() is called from. If that is a problem for SomeFunction() it could either:

    • handle State() returning nullptr (some callers of State() already do that in master) or
    • it could increment the reference count internally at the start and reduce it at the end or demand that its callers do that, like ProcessMessages() and SendMessages().

    This reasoning is unchanged by this PR.

    the impl here currently protects against nodes being deleted in SendMessages and ProcessMessages using the same refcounting trick that we do in master.

    Yes.

    auto nodes = WITH_LOCK(m_nodes_mutex, return m_nodes);

    This line is now in charge of keeping the shared_ptrs alive until the end of the message handling loop

    Renamed it to nodes_snapshot and added a comment. I find the pattern used in master and in this PR reasonable:

    1. Take a snapshot of the nodes
    2. Do something with the nodes from the snapshot (Implying they will not get deleted while we work on them, that is obvious, right? What’s the meaning of “snapshot” otherwise?)
    3. Release the snapshot // after the release this code does not care, nodes may get deleted if all others have released their references

    So I don’t think it’s true (and imo it’s quite dangerous to assume) that it’s “safe to delete a CNode from any thread”.

    But why? I agree it is unsafe to:

    A. Delete a node without calling FinalizeNode(), or B. Delete a node while there are references to it.

    but the thread that does the deletion? That does not matter.

    What’s happening here is that it’s now possible that nodes are deleted in two places rather than one: either after as pass through the socket handler loop, or after a pass through the message handler loop.

    IMO that is safe as long as A. and B. are guaranteed not to happen.

    This PR also maintains the invariant that the socket handler thread is always completely finished with all socket operations before FinalizeNode is called.

    That is also the same in master. FinalizeNode() is called after CNode::CloseSocketDisconnect(). To me the code in this PR makes that more obvious compared to master because FinalizeNode() is called from the CNode’s destructor and at that time it is clear that all socket operations are completely finished.

    And as far as I can tell, this PR does allow us to take the patch above, getting rid of the nasty lock-holding in ForEachNode and ForNode.

    I think so too.

    BUT that would open up the possibility of deletion from other callers. For example, it would make it possible for the RPC thread to delete the node from sendmsgtopeer(). And if any future code stashed the shared_ptrs from CConnman, it would make it possible for the deletion to happen anywhere.

    Yes, and that is fine. But don’t forget - only if all others have released their copies. We shouldn’t be worried that we will brick some other thread from the RPC thread if we delete a CNode object from the RPC thread which the other thread is using. That is because deletion only happens if reference count drops to zero which means nobody is using it and then it is irrelevant which thread does the deletion because at that time there are 0 users of the object.

    • Justification besides “code cleanup”. I don’t think that’s a valid reason to be making behavioral changes to such low-level code.

    Added some to the PR description.

    In-place comments that describe the guarantees we provide. If the safety of the p2p layer revolves around a copy of a vector of shared_ptrs, that needs to be made clear. That nodes.clear() is also too crucial to not have a comment.

    Added some comments.

    Documentation about the finalization/deletion process in general. This should probably happen even if we don’t take these changes.

    I agree. Maybe if that doc existed beforehand it would have made this discussion shorter. Opened #32278

  57. vasild commented at 7:29 pm on April 15, 2025: contributor

    when you say things like “it’s now possible that nodes are deleted in two places rather than one” and

    It is actually “… in 3 places rather than 2”

    “But if we’re not careful, the nodes could now be deleted as a result of ForNode/ForEachNode”

    This sounds like some unexpected deletion is going to happen in the middle of everything and bad things will follow and this is going to be because of this PR. But that is not true. Nodes could be deleted as a result of ForNode even now in master - if ForNode sets fDisconnect to true, then that will eventually result in the deletion of the node when it is safe to do so. This is both in master and in this PR. Nodes will not be deleted unexpectedly, accidentally, while there are references to them because we are not careful with ForNode (again, same in master and in this PR).

    It seems like both you and vasild like making a dichotomy between (1) using manual refcounting and (2) using shared_ptr, and not wanting to choose a middle ground like hodlinators approach which provides automatic reference counting though an explicit mechanism that is less flexible but simpler than shared_ptr. I’m not really sure why that is.

    I expressed some concerns about the “unique_ptr+manual refcounting” in #32015 (comment)

  58. hodlinator commented at 7:52 am on April 16, 2025: contributor

    I do not understand the fixation of knowing and controlling where an object is destroyed from. The elegance and simplicity behind the sole idea of shared_ptr is that it does not matter.

    I used to appreciate shared_ptr-like constructs, but as @purpleKarrot wrote, they introduce global variables in disguise. Using unique_ptr prevents that and makes ownership clear, which is a step forward.

    @hodlinator, I checked master…hodlinator:bitcoin:pr/32015_unique_ptr, here is a quick feedback:

    • Does not remove the manual reference counting which is the aim of this PR. That diff mostly replaces the raw pointer CNode* with std::unique_ptr<CNode>.

    Yes, the PR instead makes clear that refcounting is there just to track live snapshots. Making that sole purpose clear removes the itch to eradicate it for me.

    • To implement the snapshots it takes and stores raw pointers away from the unique_ptr which is an anti-pattern because then the lifetime of the raw pointer has to be manually kept shorter than the lifetime of the unique_ptr. I think that defeats the purpose of unique_ptr itself.

    In a post-unique_ptr (and shared_ptr) world, raw pointers indicate non-ownership. It is subtle but not an anti-pattern.

    • Any operation on the std::vector<std::unique_ptr<CNode>> m_nodes that deletes elements, e.g. clear(), erase(), pop_back() or resize() will automatically destroy the CNode objects even if they are referenced (manual ref count >0).

    Good point. Might be worth adding ~CNode() { Assume(GetSnapshotCount() == 0); }.

  59. vasild commented at 8:16 am on April 16, 2025: contributor

    Consider a function like that does something like this: … PeerManagerImpl::SomeFunction()

    Here is an example that YIKES on master as well:

     0diff --git i/src/net.cpp w/src/net.cpp
     1index fc0edc1a5c..c48a62f31e 100644
     2--- i/src/net.cpp
     3+++ w/src/net.cpp
     4@@ -2758,12 +2758,14 @@ void CConnman::ThreadOpenConnections(const std::vector<std::string> connect, std
     5         const auto current_time{NodeClock::now()};
     6         int nTries = 0;
     7         const auto reachable_nets{g_reachable_nets.All()};
     8 
     9         while (!interruptNet)
    10         {
    11+            m_msgproc->HasAllDesirableServiceFlags(ServiceFlags{123});
    12+
    13             if (anchor && !m_anchors.empty()) {
    14                 const CAddress addr = m_anchors.back();
    15                 m_anchors.pop_back();
    16                 if (!addr.IsValid() || IsLocal(addr) || !g_reachable_nets.Contains(addr) ||
    17                     !m_msgproc->HasAllDesirableServiceFlags(addr.nServices) ||
    18                     outbound_ipv46_peer_netgroups.count(m_netgroupman.GetGroup(addr))) continue;
    19diff --git i/src/net_processing.cpp w/src/net_processing.cpp
    20index 0b25396dd2..6148c1d8de 100644
    21--- i/src/net_processing.cpp
    22+++ w/src/net_processing.cpp
    23@@ -519,12 +519,13 @@ public:
    24         EXCLUSIVE_LOCKS_REQUIRED(!m_most_recent_block_mutex);
    25 
    26     /** Implement NetEventsInterface */
    27     void InitializeNode(const CNode& node, ServiceFlags our_services) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_tx_download_mutex);
    28     void FinalizeNode(const CNode& node) override EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_headers_presync_mutex, !m_tx_download_mutex);
    29     bool HasAllDesirableServiceFlags(ServiceFlags services) const override;
    30+    void SomeFunction() const;
    31     bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override
    32         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, !m_headers_presync_mutex, g_msgproc_mutex, !m_tx_download_mutex);
    33     bool SendMessages(CNode* pto) override
    34         EXCLUSIVE_LOCKS_REQUIRED(!m_peer_mutex, !m_most_recent_block_mutex, g_msgproc_mutex, !m_tx_download_mutex);
    35 
    36     /** Implement PeerManager */
    37@@ -1660,14 +1661,32 @@ void PeerManagerImpl::FinalizeNode(const CNode& node)
    38         LOCK(m_headers_presync_mutex);
    39         m_headers_presync_stats.erase(nodeid);
    40     }
    41     LogDebug(BCLog::NET, "Cleared nodestate for peer=%d\n", nodeid);
    42 }
    43 
    44+void PeerManagerImpl::SomeFunction() const
    45+{
    46+    const NodeId to_disconnect{0};
    47+    m_connman.ForNode(to_disconnect, [&](CNode* node) {
    48+        node->fDisconnect = true;
    49+        return true;
    50+    });
    51+    while (WITH_LOCK(cs_main, return State(to_disconnect)) != nullptr) {
    52+        std::this_thread::sleep_for(1s);
    53+    }
    54+    LogInfo("YIKES");
    55+    Assert(false);
    56+}
    57+
    58 bool PeerManagerImpl::HasAllDesirableServiceFlags(ServiceFlags services) const
    59 {
    60+    if (services == 123) {
    61+        SomeFunction();
    62+    }
    63+
    64     // Shortcut for (services & GetDesirableServiceFlags(services)) == GetDesirableServiceFlags(services)
    65     return !(GetDesirableServiceFlags(services) & (~services));
    66 }
    67 
    68 ServiceFlags PeerManagerImpl::GetDesirableServiceFlags(ServiceFlags services) const
    69 {
    

    That might happen on master and on this PR. I would say that for each yikes in this PR there is an equivalent yikes on master, or in other words, this PR does not have more or is not prone to more yikes. The rule is the same - “if a code relies on CNode not being deleted, then it should either hold m_nodex_mutex, or increment the reference count during its usage.

    Split the style change and functional change into separate commits: first the shared_ptr change as a no-op, then the additional deletion point. Edit: Looks like this isn’t realistic.

    I just did that, see https://github.com/vasild/bitcoin/commits/cnode_shared_ptr_split_in_two_commits/ The intermediate state, after the first commit is somewhat odd - we have shared_ptr but don’t let it destroy the owned object by itself when we drop the shared_ptr, e.g. just remove from the vector. Instead in this intermediate state we monitor the use_count() and call reset() manually. Also NodesSnapshot is an useless wrapper that just copies the vector. The second commit cleans all that and arrives at the same end result as this PR.

  60. hodlinator commented at 1:55 pm on April 16, 2025: contributor

    Published new iteration of unique_ptr change in separate branch: https://github.com/bitcoin/bitcoin/compare/master...hodlinator:bitcoin:pr/32015_unique_ptr2

    • Moved changes clarifying CNode::nRefCount really only being used for NodesSnapshots to happen before unique_ptr commit.
    • Split the CNode::nRefCount-commit in two, one to change usage to only be for snapshots, one for renaming the methods. Now we also rename the field to m_snapshot_count and group refcounting methods together.
    • Added ~CNode()-dtor with assert(GetSnapshotCount() == 0) in order to catch invalid operations like CConnman::m_nodes.clear() as pointed out by vasild.
    • Added speculative final commit to introduce use of NodesSnapshot in ForNode and ForEachNode to decrease time m_nodes_mutex is locked, inspired by @theuni’s comment. Feels fairly safe given that we only delete nodes with 0 snapshots prior to CConnman::Stop() & StopNodes(). If another thread was still inside ForNode/ForEachNode at that point, we are already shutting down, and the asserts for zero snapshots in DeleteNode() will simply kill the process quicker. Should it be deemed too risky this commit could be skipped.
    • Moved some paranoid TRACEPOINT changes and making StopThreads & StopNodes private into early “Minor improvements” commit.
    • DeleteNode(std::unique_ptr<CNode> pnode) -> DeleteNode(std::unique_ptr<CNode>&& pnode): Clarifies that we are assuming ownership of the unique_ptr.
    • Rebased on same parent as this PR.
  61. theuni commented at 10:17 pm on April 16, 2025: member

    I’ve been working on some experiments and tests for this again today, but unfortunately I didn’t have time to get finished.

    I’ll be away until Tuesday, but I’ll resume here next week. Sorry for the delay :(

  62. in src/net.h:1435 in 75acdcbaa9 outdated
    1431@@ -1442,8 +1432,7 @@ class CConnman
    1432     std::vector<AddedNodeParams> m_added_node_params GUARDED_BY(m_added_nodes_mutex);
    1433 
    1434     mutable Mutex m_added_nodes_mutex;
    1435-    std::vector<CNode*> m_nodes GUARDED_BY(m_nodes_mutex);
    1436-    std::list<CNode*> m_nodes_disconnected;
    1437+    std::vector<std::shared_ptr<CNode>> m_nodes GUARDED_BY(m_nodes_mutex);
    


    vasild commented at 11:49 am on April 17, 2025:

    Bringing the anti-shared_ptr discussion to its own dedicated thread here.

    #32015 (comment) Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics?

    #32015 (comment) I used to appreciate shared_ptr-like constructs, but as @purpleKarrot #32015 (comment), they introduce global variables in disguise.

    How come? The CNode objects have a lifetime equal to the lifetime of the connection with the peer, usually much less than the lifetime of a global variable (which is the lifetime of the entire program). Also those CNode objects are in a private variable inside CConnman. Global variables can be accessed from everywhere. Private class members are the opposite.

    would it be possible to increase the level of abstraction and pass around values with regular semantics?

    • ProcessMessages() and SendMessages() take CNode* and they can / should be changed to take CNode&.
    • FindNode() returns shared_ptr, this can be changed to return bool and the sole user that makes use of the return value can be changed to use ForNode().
    • ConnectNode() is the function that creates the objects and it returns shared_ptr, not sure if it makes sense to change it?
    • GenerateWaitSockets() takes a const vector of shared_ptr. It needs just a list of nodes, could take a vector of CNode, but since we already have a vector of shared_ptr we pass that. Same for SocketHandlerConnected() and both need to run on the same “snapshot” when called one after the other.

    If there is review interest I would pursue those.


    hodlinator commented at 8:07 pm on April 17, 2025:
    1. The intent behind the code on master is to delete the nodes at specific points, otherwise CNode::Release() would do if (--nRefCount == 0) delete this (and had different logic around m_nodes_disconnected). You might be correct in that having been okay, nobody has been able to prove otherwise in this PR so far.

    2. I agree that shared_ptr-usage in this PR not being exposed in the public interface of CConnman is a mitigating factor.

    3. I bet you’ve checked all angles I would check when it comes to how the shared_ptr would lead to CNode destruction in the current PR.

    4. shared_ptr still opens up the door towards future changes in the direction of chaos when compared to unique_ptr. Nailing down where ownership/lifetime is managed facilitates fresh eyes in following the code, improving maintainability. Using shared_ptr can be perceived as a cop-out of saying it will be destroyed whenever & wherever the last reference is destroyed. shared_ptr in other cases/projects are often an indication of imprecise/prototype-quality code, so as a general rule, it makes sense to avoid if possible.

    5. This is not my first objection to shared_ptr in this project, I experimented with removing it from miniscript-code here: #30866#pullrequestreview-2434704657, which lead to uncovering an extra issue to fix in that PR. The PR settled on taking that fix and also replacing shared_ptr with unique_ptr for a less invasive change. I have #31713 which replaces those new unique_ptrs with straight up instances, so I’m by no means wedded to unique_ptr if there’s a better option.

    • ProcessMessages() and SendMessages() take CNode* and they can / should be changed to take CNode&.
    • FindNode() returns shared_ptr, this can be changed to return bool and the sole user that makes use of the return value can be changed to use ForNode().
    • ConnectNode() is the function that creates the objects and it returns shared_ptr, not sure if it makes sense to change it?
    • GenerateWaitSockets() takes a const vector of shared_ptr. It needs just a list of nodes, could take a vector of CNode, but since we already have a vector of shared_ptr we pass that. Same for SocketHandlerConnected() and both need to run on the same “snapshot” when called one after the other.

    If there is review interest I would pursue those.

    IMO if we go with shared_ptr we don’t need to change those, thanks for investigating though!


    vasild commented at 2:50 pm on April 22, 2025:

    Thanks for the input, I respect and agree with most of it. The only bit I do not agree with is a general rule to avoid shared_ptr. I agree using unique_ptr is better than shared_ptr, especially if unique_ptr is to be used as intented - to manage the lifetime of the owned object. If we start extracting the raw pointers from unique_ptr and managing them separately and manually ref-counting then maybe it is better to just use shared_ptr.

    FindNode() returns shared_ptr, this can be changed to return bool and the sole user that makes use of the return value can be changed to use ForNode()

    I think that’s worth doing, regardless of this PR. Done in https://github.com/bitcoin/bitcoin/pull/32326


    hodlinator commented at 7:59 am on April 24, 2025:

    If we start extracting the raw pointers from unique_ptr and managing them separately and manually ref-counting then maybe it is better to just use shared_ptr.

    I think doing ref-counting for snapshots is fine, but agree raw pointers in this codebase are ambiguous. It would be one thing if we started over with strict policies for shared_ptr/unique_ptr/raw pointers. Then one could consistently use raw pointer to always mean a non-owning “view-pointer” (https://github.com/bitcoin/bitcoin/pull/32015#issuecomment-2712060971).

    Maybe that distinction could still be made through introducing something like:

    0template <typename T>
    1class ViewPtr
    2{
    3    T* ptr;
    4public:
    5    ...operator->...;
    6    ...operator*...;
    7};
    

    + OwnerPtr<T> in the vane of GSL: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#SS-views

    I think that’s worth doing, regardless of this PR. Done in #32326

    Good call on peeling off a separate PR as it’s nice but not a requirement here.

  63. theuni commented at 7:17 pm on April 24, 2025: member

    Ok, I’m back after a week of paging all of this back in and hacking around. This is actually something I wanted to get cleaned up a long time ago, so I’m happy it has eyes on it again. Sorry in advance for the novel :)


    I think the discussions about shared_ptr/unique_ptr/refcounting have distracted from what really needs to be discussed here: the lifetime and concurrency requirements/guarantees of the various subsystems.

    The refcounters are (ab)used as a generic means of controlling the teardown order of a CNode. As I mentioned above, historically, this has meant that disconnection/finalization/deletion have always happened on the socket handler thread and always in that order (with the exception of when we’re shutting down).

    However, as @vasild has pointed out, this is neither ideal nor necessary.

    So rather than starting by discussing what types of pointers we should be using, we need to start by figuring out what we’re trying to solve/improve. And before we can do that, we need to lay out the way things currently work.

    These are the important guarantees at the moment, as I see them:

    FinalizeNode may only be called once per node.

    This one is obvious, but it’s necessary to keep in mind that we shouldn’t be introducing any possible races.

    FinalizeNode will never be called as a result of setting fDisconnect from ProcessMessages/SendMessages

    This prevents any pointers obtained by calling State(id) or GetPeerRef(id) from being yanked out from under us when we’re in those functions.

    I think this is the one that has led to the workarounds and shared_ptr discussion here. It is currently enforced by NodesSnapshot: a class which bumps the refcount for each node before calling ProcessMessages/SendMessages, then decref’ing them when finished. This has the effect of keeping them all alive for the duration of a single pass through the loop. @vasild’s observation was that this looks a lot like an ownership model which can be handled with shared_ptrs by taking advantage of their inherent refcounting and introducing a deletion point in ThreadMessageHandler.

    I maintain that this is non-ideal and foot-gunny, as making a copy of m_nodes is the mechanism used to enforce the above contract, though it’s not at all obvious that it’s doing so.

    Finalization happens just before deletion, both on the socket handler thread

    Again, this one isn’t strictly necessary, but it’s currently the case.

    And again, @vasild’s shared_ptr approach handles this by removing that restriction, and calling FinalizeNode as part of CNode’s destructor, as that means that its refcount has reached zero.

    He points out that it is safe to call this sometimes from the message handler thread instead, as long as it is only called once. I agree with this and suggest (see below) taking it a step further and only calling it from the message handler thread, as that guarantees that it can’t be running during ProcessMessages/SendMessages.

    Disconnection always happens before calling FinalizeNode.

    I’ll come back to this one later as it’s not strictly necessary, but it’s currently the case. By the time FinalizeNode is called, the socket handler thread is completely done with the node (the socket has been shutdown).

    When the socket handler disconnects a peer, it is almost immediately (in the same pass through the loop) removed from m_nodes

    This one is annoying, but the unit tests depend on it. Basically, as soon as we notice that a peer has disconnected, the socket handler thread takes care to ensure that other subsystems won’t find it by looking through m_nodes. This includes ForEachNode/ForNode/CopyStats, etc.

    Only a single thread deletes CNodes.

    Duh.

    Any other function passing a CNode (for ex, a caller of ForNode() from RPC) shouldn’t have it deleted for its duration

    This is a completely separate issue from the disconnection/finalization semantics, and is a good use-case for shared_ptrs.


    In reality, the refcounting (and by extension NodesSnapshot) has been used as a loose way of enforcing teardown semantics. It’s true that we can mimic those semantics by introducing shared_ptrs to manage the nodes instead, but that further obscures what the actual requirements are. As a few people have pointed out on this PR, abusing the facilities of shared_ptr/unique_ptr to replace the manual refcounting is not the solution. Rather, we should enforce the actual semantics in code.

    I’ve pushed a branch here to illustrate the above: https://github.com/theuni/bitcoin/commits/finalize_threadheandler2/

    It makes the following changes:

    • Expose m_nodes_disconnected to the message handling thread.
    • Add CNode::m_finalized which is set once FinalizeNode has been called. This, combined with the move from m_nodes to m_nodes_disconnected is enough for us to determine when the socket handler/message handler threads are done with a CNode.
    • Call FinalizeNode from the top/bottom of the message handler thread instead, to guarantee that it will never be called from another thread during a ProcessMessages/SendMessages invocation. This also makes much more sense conceptually, as FinalizeNode is not something the socket handler should have anything to do with.

    With the above changes in place, the ordering of the teardown is fully proceduralized, so the refcounting is no longer needed and is thus removed. At that point, the CNode can be safely deleted once (in no particular order)

    • It has been moved to m_nodes_disconnected
    • CNode::m_finalized is true
    • No other threads (RPC via ForNode as an example) are holding it

    In order to guarantee the last one, CNodes are kept in m_nodes and m_nodes_disconnected as shared_ptrs. Since the teardown semantics are enforced elsewhere, deletion is now uninteresting and can happen safely from any thread.

    It also contains a few optimizations that become possible once the teardown procedure is proceduralized:

    • m_nodes_mutex is no longer held for the duration of ForNode/ForEachNode. This should be a nice improvement, especially for nodes with hundreds of connections.
    • FinalizeNode is now allowed to happen before disconnection, as there’s no actual need to order those events.

    I realize that I’ve duplicated a good amount of work here like the shared_ptr commit as well as some of the pre-requisite cleanup commits that are similar to @hodlinator’s. It’s also incomplete, for example it’s missing the fuzzing fixups that @vasild has already done, as well as a bunch of additional comments/docs as I’ve suggested already. I’d be happy to rebase my branch to include those. I agree with #32326 as well. It’s not necessary for my branch, but it’s a nice improvement and I did have a similar commit at one point for the exact same reason.

    I feel pretty strongly about addressing the teardown order procedurally rather than mimicking our current refcounting semantics which don’t make the requirements/guarantees clear. Ultimately I agree that shared_ptrs are best to enforce the lifetimes, but only after codifying the other rules.

    I think the code in my branch is much easier to reason about. There’s no hidden side-effects of node deletion, magic copies of m_nodes aren’t used to prevent FinalizeNode being called at the wrong time, and disconnection/finalization/deletion no longer carry hidden dependencies on each other.

  64. theuni commented at 1:56 pm on April 30, 2025: member

    Can we maybe just hold off on this for now? The investigation above prompted me to start working on a much larger refactor.. one that would finish the network refactor (CConnman creation) that I never managed to fully complete almost a decade ago.

    I’ve been head-down working on it, hope to have a POC to demonstrate in the next week or two.

  65. vasild commented at 3:22 pm on April 30, 2025: contributor
    I am incorporating your feedback here, looks good so far.
  66. vasild force-pushed on May 1, 2025
  67. vasild commented at 10:10 am on May 1, 2025: contributor

    75acdcbaa9...15cc989538: incorporate feedback from above

    I took your branch https://github.com/theuni/bitcoin/commits/finalize_threadheandler2/ and made some changes:

    No need to ping-pong the tear down of a CNode between the net and msghand threads so much

    In finalize_threadheandler2:

    1. The net thread would disconnect a node and move it from m_nodes to m_nodes_disconnected.
    2. Then the msghand thread would finalize those nodes.
    3. Then the net thread would delete finalized nodes.

    I simplified this to:

    1. same as above; the sole fact that a node is in m_nodes_disconnected now means that a node has been disconnected and not finalized.
    2. same as above; plus also delete the nodes.

    So CNode::m_finalized is not needed because it becomes implicit - when a node is in m_nodes_disconnected it means it is not finalized. Then it is removed from there => finalized and deleted.

    Reduced the optimization about holding m_nodes_mutex for long-lived callbacks

    I did not remove m_nodex_mutex from ForEachNode() because in one of its usages the callback calls m_connman.MultipleManualOrFullOutboundConns() which requires m_nodes_mutex. Dealing with that deserves a separate PR.

    Omitted the optimization to finalize a node at the end of the message handler loop

    Could do that but to avoid a second finalization it would need CNode::m_finalized just for that. I think it is not worth having an extra member in CNode for this optimization. Can be done in a followup if desired.

    Other minor tweaks and polishes

    Like using std::ranges::foo() instead of std::foo() and simpler CConnman::DisconnectNodes() which uses no temporary variables when moving nodes around.


    This PR deviated from the usual workflow where there is one author and reviewers that only review and don’t modify the code. The current state is that the PR has two authors - @theuni and @vasild who both reviewed and authored code. Lets see how this pans out…

  68. vasild commented at 10:14 am on May 1, 2025: contributor

    Can we maybe just hold off on this for now? The investigation above prompted me to start working on a much larger refactor…

    Not sure about that because I have not seen that other refactor. In general I prefer do stuff in smaller, more manageable steps. I guess a further refactor should be possible on top of this PR. I would be happy to review. This PR should make a more solid foundation for further changes in the code.

  69. DrahtBot added the label CI failed on May 1, 2025
  70. DrahtBot commented at 11:47 am on May 1, 2025: contributor

    🚧 At least one of the CI tasks failed. Task tidy: https://github.com/bitcoin/bitcoin/runs/41477750293 LLM reason (✨ experimental): Clang-tidy found a performance warning treated as an error, causing CI failure.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  71. net: store CNode objects as shared_ptr
    This is a non-functional change - the manual reference counting,
    disconnection, finalization and deletion are unchanged.
    4326f97dc6
  72. net: remove unnecessary nodes_copy and nodes_disconnected_copy temporaries
    The other copies have a purpose for locking, these were just an awkward
    convenience for deleting while iterating.
    
    Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
    9c1d580dde
  73. net: add a lock for m_nodes_disconnected
    Subsequent commits will allow this to be shared with the message
    handling thread.
    
    Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
    04eb706008
  74. net: finalize and delete nodes from ThreadMessageHandler()
    Rather than relying on refcounting to abstract away our teardown
    procedure, make the rules explicit:
    - FinalizeNode() is only called when ProcessMessages() and
      SendMessages() is not running. This is enforced by finalizing only at
      the top of the message handler loop.
    - FinalizeNode() may only be called on nodes which have been
      disconnected, by virtue of finding them in m_nodes_disconnected. This
      enforces the order: disconnect -> finalize -> delete. This order is
      the current behavior but is not strictly necessary, and the
      restriction will be removed in a later commit.
    - Nodes are only deleted after they've been disconnected and finalized.
      This is enforced by deleting them from m_nodes_disconnected and only
      after calling FinalizeNode().
    - Callers of ForNode() and ForEachNode() are currently protected from
      finalization or deletion due to their locking of m_nodes_mutex.
    
    Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
    9b8dc37ced
  75. net: remove CNode reference counting and NodesSnapshot
    Snapshots of `m_nodes` are not needed anymore because finalization and
    deletion of nodes only happens in `ThreadMessageHandler()` and only a
    after nodes have been disconnected and moved from `m_nodes` to
    `m_nodes_disconnected`.
    
    The manual reference counting becomes unused, remove it too.
    
    Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>
    9b8e9d40d6
  76. net: don't hold m_nodes_mutex for ForNode() callbacks
    The shared_ptrs now ensure that the CNodes will not be deleted for the
    duration of the callbacks.
    
    Disconnection/Finalization may happen while the shared_ptr is being held,
    though it won't be deleted for the duration of the callback.
    
    The majority of uses are in ProcessMessages and SendMessages, during which
    disconnection may occur but FinalizeNode will not be called.
    
    Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
    50e33ff8df
  77. net: optimization: wake message handler when nodes are disconnected
    This keeps us from waiting for the condvar timeout (currently 100ms) before
    finalizing a node that has been disconnected on the socket handler thread.
    
    Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
    8b93e0894f
  78. vasild force-pushed on May 1, 2025
  79. vasild commented at 12:34 pm on May 1, 2025: contributor
    15cc989538...8b93e0894f: fix CI clang-tidy
  80. DrahtBot removed the label CI failed on May 1, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-05-05 18:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me