dergoegge
commented at 4:30 pm on August 10, 2023:
member
This PR is supposed to provide context for some of the refactoring PRs I’ve been working on (#25268, #26621, etc).
The aim is to completely decouple CConnman and its internals from PeerManager to allow for isolated testing of our message processing code (isolated from net only as isolating from validation is another can of worms). To get there, this work refactors CConnman’s API to use NodeId as the primary handle for managing connections and defines a new interface ConnectionsInterface that describes the interface between PeerManager and CConnman. The result is that CNode is no longer part of this interface and access to it is no longer needed within net processing.
Most of the API simply shifts from CConnman::DoX(CNode*) to CConnman::DoX(NodeId) and where possible the PR minimizes the need for new methods on CConnman to avoid the slight overhead of (internally) locking m_nodes_mutex to look up the right CNode by its id.
Some steps this PR is taking:
We can’t test PeerManager in isolation if it is not the owner of its own state. Therefore this PR moves all the remaining application layer data from net to net processing.
Some use of CNode in net processing only accesses constant data that is available from the time of connection establishment. This PR introduces ConnectionContext which is meant to hold all this constant connection data and for each connection a copy is stored on the corresponding Peer instance. This reduces the need for accessing CNode in net processing and avoids some For{Each}Node calls (which are a horrible layer violation).
Some use of CNode in net processing directly modifies eviction related members of CNode. Instead of wrapping access to these members in CConnman methods, this PR creates a new manager (EvictionManager) that encapsulates the eviction logic as well as the required per-peer data (entirely removing the relevant CNode members).
To disconnect a peer , PeerManager directly sets the disconnection flag CNode::fDisconnect to true. This PR wraps access to that flag on CConnman, i.e. CConnman::IsDisconnected & CConnman::DisconnectNode(NodeId).
Finally, the PR also introduces initial unit tests for PeerManager that make use of a mocked ConnectionsInterface instead of the actual CConnman.
Looking for conceptual & approach review.
This is not super polished so I would expect the CI to fail but I will break this up and polish if there is enough conceptual support. I will not keep this rebased as the general architectural direction this is taking should be clear in any case.
[test] Make all denial of service tests use their own connman/peermane86cd1f977
[eviction] Create EvictionManager stub981617d338
[eviction] Make EvictionManager keep track of candidates6de4364e8e
[net] Get candidates from eviction manager
We temporarily introduce a method to get a specific
NodeEvictionCandidate from the eviction manager for use in
`CConnman::AttemptToEvictConnection()`. `EvictionManager::GetCandidate`
is removed once we are done moving the remaining eviction state from
`CNode` to the eviction manager.
bc7e7fc60a
[net] Remove unused CNode state
Now that the eviction manager is aware of the keyed net group and
whether or not a peer is prefered for eviction, we can get rid of the
corresponding CNode members, since they are not used elsewhere.
7f7d02cdb6
[eviction] Move ping/block/tx times from CNode to EvictionManager157afa0b00
[eviction] Report services to EvictionManager5a3ecac7c0
[eviction] Move BIP37 tx relay and bloom filter status to EvictionManager98d484706e
[eviction] Move SelectNodeToEvict into EvictionManager3bdaabd27d
[refactor] Split internals of EvictionManager::SelectOutboundNodesToEvict into two functions221f5aa429
scripted-diff: Rename SelectNodeToEvict for clarification
SelectNodeToEvict only suggests inbound nodes to evict.
-BEGIN VERIFY SCRIPT-
ren() { sed -i "s:\<$1\>:$2:g" $(git grep -l "\<$1\>" ./src ./test); }
ren SelectNodeToEvict SelectInboundNodeToEvict
-END VERIFY SCRIPT-
314b533ceb
--- eviction stats ---458fb9ade0
scripted-diff: rename CNodeStateStats
Most of the stats in CNodeStateStats are now taken from the Peer object.
Rename the struct to PeerStats.
-BEGIN VERIFY SCRIPT-
ren() { sed -i "s:\<$1\>:$2:g" $(git grep -l "\<$1\>" ./src ./test); }
ren CNodeStateStats PeerStats
ren GetNodeStateStats GetPeerStats
ren nodeStateStats m_peer_stats
ren statestats peer_stats
ren fNodeStateStatsAvailable m_peer_stats_available
ren fStateStats peer_stats_available
-END VERIFY SCRIPT-
a210244ca9
[net processing] Move cleanSubVer to net_processing
Also rename to m_clean_subversion.
Since the subversion is now contained in the Peer object, it may not be
available in `getpeerinfo` while the peer connection is being torn down.
Update the rpc help text to mark the field optional and update the tests
to tolerate the field being absent.
5c16c704b0
[net processing] Remove CNode::nLocalHostNonce
Handle version nonces entirely within net_processing.
1be89db847
[net processing] Move m_bip152_highbandwidth_{to,from} to net_processingee92e0ddfa
[net processing] Move nTimeOffset to net_processing69d9d7511a
[net processing] Move nVersion to net_processingae5b4a1914
[net processing] Move m_greatest_common_version to net_processingee06a1d1cb
[net processing] Move CNode::addrLocal to Peer605e41b303
--- move app-data to net proc ---02c48eadea
[net] Introduce ConnectionContextcc46840a16
[net] CNode holds ConnectionContext3c7b8e247b
--- connection context ---128b9fa406
[net processing] Store copy of ConnectionContext on Peer2dcead90b3
[net processing] Pass ConnectionContext to AddTxAnnouncement2d121e273b
[net processing] Pass ConnectionContext to RejectIncomingTxsedca8a4efb
[net processing] Don't pass CNode to SetupAddressRelay6989fac39b
--- connection context (trivial cases net processing) ---8f4cc0ada0
[net] Convert m_nodes to std::unordered_map
For quick lookups by id.
In tests it is now necessary to use ConnmanTestMsg with AddTestNode and
ClearTestNodes, to make sure map lookups succeed.
7a2791246d
[net] Use NodeId map lookups instead of loopingafd45141fa
--- m_nodes as map ---abd5f1c86f
[net processing] Use CConnman::DisconnectNode over flipping flag6107379ce4
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
DrahtBot added the label
Needs rebase
on Aug 10, 2023
DrahtBot
commented at 5:32 pm on August 10, 2023:
contributor
🐙 This pull request conflicts with the target branch and needs rebase.
ajtowns
commented at 2:07 am on August 11, 2023:
contributor
Most of the API simply shifts from CConnman::DoX(CNode*) to CConnman::DoX(NodeId)
That seems undesirable to me? Where possible, acting directly on the node via a pointer/reference is much better than taking a global lock, doing a lookup, and then acting.
We can’t test PeerManager in isolation if it is not the owner of its own state.
I think it’s a mistake trying to have many different owners of “per-node” state – that increases the lookups, locks and coordination required when adding/removing nodes. I think a simpler approach would be to add a unique_ptr<Peer> (and perhaps likewise for other per-peer data) to CNode, which is initialized via NetEventsInterface::InitializeNode. If you want to test PeerManager without a CConnman, it then becomes the test framework’s job to set up a bunch of CNode objects and initialize them. In that world, the Peer struct is just the processing-related data members of CNode, and largely not something that should be passed around independently (let alone in addition to) the CNode its embedded in.
dergoegge
commented at 10:54 am on August 11, 2023:
member
@ajtowns Thanks for the quick feedback! I think we’ll mostly disagree here but I appreciate your opinion.
Most of the API simply shifts from CConnman::DoX(CNode*) to CConnman::DoX(NodeId)
That seems undesirable to me? Where possible, acting directly on the node via a pointer/reference is much better than taking a global lock, doing a lookup, and then acting.
While it is obviously true that directly accessing the CNode pointers is more performant (no locking or look ups), accessing internals of other components like this leads to spaghetti code and the performance win of micro or nano seconds is just not worth it.
I think a simpler approach would be to add a unique_ptr<Peer> (and perhaps likewise for other per-peer data) to CNode, which is initialized via NetEventsInterface::InitializeNode.
This sounds like doing a complete 180 on the work that happened over multiple years to decouple net and net processing. I’m not re-inventing the wheel in this PR, this general direction has been what other contributors have had in mind and worked on for multiple years. I’d prefer finishing up previous work and making things consistent instead of going in an entirely new direction.
ajtowns
commented at 11:23 am on August 11, 2023:
contributor
While it is obviously true that directly accessing the CNode pointers is more performant (no locking or look ups), accessing internals of other components like this leads to spaghetti code and the performance win of micro or nano seconds is just not worth it.
If we want to have multiple threads processing different peers, then avoiding (or at least minimising) having to go through shared data structures will be necessary. Keeping our data structures in sync is already challenging, making it more complicated and having more things to keep coordinated is a bad thing.
I think a simpler approach would be to add a unique_ptr<Peer> (and perhaps likewise for other per-peer data) to CNode, which is initialized via NetEventsInterface::InitializeNode.
This sounds like doing a complete 180 on the work that happened over multiple years to decouple net and net processing.
If you’re going in the wrong direction, doing a 180 is a good thing, and going even further in the same direction is a bad thing.
dergoegge
commented at 12:09 pm on August 11, 2023:
member
then avoiding (or at least minimising) having to go through shared data structures will be necessary
I could agree on s/necessary/more performant, but with the shared data structure design it would really only cause contention for the look ups which is really not that bad. i.e.
0// in CConnman::DoX(NodeId)
1 {
2 LOCK(m_nodes_mutex);
3// do lookup
4 }
56// do work on the CNode
Keeping our data structures in sync is already challenging,
It’s challenging right now because we are passing internal pointers around.
making it more complicated and having more things to keep coordinated is a bad thing.
It’s really not that complicated if all you have to do is call moduleX.ForgetPeer(id) in FinalizeNode.
If you’re going in the wrong direction, doing a 180 is a good thing, and going even further in the same direction is a bad thing.
I think we’re just gonna disagree on what the right direction is :) It mostly seems like a discussion of performance vs. sane interfaces and I don’t think it is worth optimizing for performance in this specific case (at the cost of bad interfaces) as it doesn’t represent a bottleneck either way.
theuni
commented at 4:04 pm on August 11, 2023:
member
We’ve debated the CNode* passing/abstracting thing ever since CConnman was merged. This has come up over and over so it’s definitely not new. And the big problem is: everyone has their own take. The question of “should we use CNode or CConnman as the interface for network stuff” could be debated forever. I think the only correct answer is “not both arbitrarily”, which is what we do now.
I commend @dergoegge for working on this. It’s tough and requires picking an approach and sticking to it.
Personally, I don’t love any solution, but NodeId as an access handle has always seemed like one of the most reasonable approaches to me. It’s not the most sophisticated, but it’s straightforward. And from my experiments years ago anyway, there were fewer footguns as there are no more external lifetime concerns. So Concept ACK from me on that part.
What matters to me is that CConnman gets a clean separation and CNode’s internals are no longer exposed. TBH I’m less concerned about the approach taken to get there. I understand that there may be a performance trade-off, but imo it’s worth it for simplicity. I fully expect people to disagree, but if performance is the argument, imo it needs to be demonstrated.
@ajtowns’s take is also totally reasonable imo, just a matter of different priorities I think.
ajtowns
commented at 2:55 am on August 14, 2023:
contributor
I think we’re just gonna disagree on what the right direction is
Isn’t the point of opening this to discuss that and at least try to come to an agreement? Being immediately dismissed out of hand feels pretty frustrating.
It’s challenging right now because we are passing internal pointers around.
I don’t really think that’s at all true? Passing pointers to objects is a completely normal thing; and whether something is “internal” or not is just a matter of how you define the api – if other things are accessing it, it’s not “internal” by definition, though perhaps it should be.
I’d contend that the challenging things is mostly that we’re letting nodes interact with other nodes’ state directly.
It’s really not that complicated if all you have to do is call moduleX.ForgetPeer(id) in FinalizeNode.
It also means either wrapping *Assert(GetState(nodeid)), doing a whole bunch of normally redundant if (peer) ... boilerplate after peer = GetPeerRef(nodeid), or else leaving potential nullptr dereferences lying about. No, it’s not that complicated, but it’s largely unnecessary, same as using a possibly null pointer in a function param, when you could use an always-valid reference instead.
It mostly seems like a discussion of performance vs. sane interfaces and I don’t think it is worth optimizing for performance in this specific case (at the cost of bad interfaces) as it doesn’t represent a bottleneck either way.
I don’t think this approach is a “sane” interface – interacting with many peers is fundamentally something we should be able to do in parallel, and adding many synchronisation points where we prevent ourselves from operating in parallel is a fundamentally broken approach. It’s broken both from the perspective of “it would be nice if we could do better in the future”, in that you’re making those improvements harder, but it’s also broken from the perspective of “it’s easier to avoid bugs when you can just focus on a single thing, and not have to worry about many interactions”.
Ignoring the performance aspect, and just focussing on the “net_processing deals with a single peer and doesn’t interfere with the state of other peers” would, I think, look like:
replace ForNode and ForEachNode with a callback – ie, pass a lambda (with copy captures) back to connman, and have connman forward that lambda back to each peer to deal with as they choose
replace the ForEachNode loops on evictions with a cached score – when you run ProcessMessages or SendMessages nodes update their eviction score, and connman tracks the worst scores so that when a node needs to be evicted it knows exactly which node to evict
…and I think that’s it? At that point all of PeerManager only needs to know about a single peer at any one time, which makes for simpler code (and makes parallelising net processing much more plausible, though that gets into “performance”) Huh, that seems much simpler than I would have expected.
Having PeerManager only focus on a single peer also seems a much better design for fuzz (or unit) testing to me – in the above model, you’d test interactions with other peers solely via the callbacks, eg, without ever necessarily having to actually instantiate multiple peers. Am I missing something there?
(Actually parallelising the msghand thread probably depends on whether we can at least do parallel lookups into the mempool perhaps via a reader/writer lock approach, but something like this would be a necessary step too)
dergoegge
commented at 1:34 pm on August 14, 2023:
member
Being immediately dismissed out of hand feels pretty frustrating.
I apologize, that wasn’t my intention.
I want to make it clear though that this PR is about continuing previous decoupling work that has seemingly dropped of peoples radar when previous contributors switched focus or left the project. I do not think the previous work went in the wrong direction nor that this breaks potentially multi threading the message processing code in the future.
If you think this design is somehow fundamentally broken w.r.t. parallelizing, please elaborate why because I don’t see it.
whether something is “internal” or not is just a matter of how you define the api – if other things are accessing it, it’s not “internal” by definition, though perhaps it should be.
It seemed well understood to me that we think of our p2p stack to consist of two layers/modules: net (lower level networking, connection management) and message processing (net processing) that each have their own responsibilities. Therefore I would classify pointers into their respective data structures as internals.
Perhaps a competing PR/design doc of what you are thinking about would help? I’ve thought about the design in this PR so much that I might be a little stuck in my ways now.
Judging by the rest of your comment, it seems to me like parallelizing message processing is of big interest to you. I agree that that is cool but I also think we should have better testing (ideally implementation agnostic) before switching to that. In any case maybe it makes more sense to discuss that in another issue?
Actually parallelising the msghand thread probably depends on whether we can at least do parallel lookups into the mempool perhaps via a reader/writer lock approach, but something like this would be a necessary step too
Using cs_main for so many validation unrelated things is probably another blocker, but perhaps easier to solve.
ajtowns
commented at 8:56 am on August 16, 2023:
contributor
I think I might be coming across as disagreeing more than I actually do? I’m all for modularisation and better testing, and getting things tightly focussed so that you don’t need to keep huge amounts of context in your head; but I think the way we’ve been doing that in net_processing in many places is adding complexity and interrelationships rather than reducing them.
Perhaps a competing PR/design doc of what you are thinking about would help? I’ve thought about the design in this PR so much that I might be a little stuck in my ways now.
I think it would be better to move towards something more like:
CTxMempool, TxRequestTracker, Chainstate, etc: global high-level state accessed by PeerManager
PeerManager: deals with updating a single peer’s logical/high-level state
CConnman: deals with collecting all our peers together, and prioritising which ones to service/connect/evict
CNode: deals with all of the state for a single peer
high-level, protocol related state: Peer m_peer; CNodeState m_state (probably via an opaque unique_ptr)
low-level, I/O state: Sock m_sock
coordination vs other peers: NodeId id; nRefCount; fPauseRecv; fDisconnect; ...
ie, I guess I don’t think net-vs-net_processing should be looked at as a “high level” vs “low level” split at all; but rather as the overall management of connections, who to keep, who to prioritise, etc (CConnman) vs the details of communicating with an individual connection, which messages to send, how to deal with messages you receive, etc (net_processing).
(There’s also a bunch of “low-level” things in net – CSerializedNetMsg, TransportSerializer, CNetMessage; but those are all more related to message handling in net_processing than CConnman)
Maybe think of it as net_processing as being like the bartender which gets you your drinks and listens to your sob stories; and CConnman is like the bouncer that lets you in in the first place or kicks you out when the bartender says you’re causing trouble, or a celebrity arrives and they need to make space, or you’re violating fire safety rules, or whatever. In that analogy, the things that go in CNode (rather than Peer) are the bits of info the bouncer cares about, either to read (fDisconnect = “kick this guy out”) or as info to pass onto the bartender, like giving someone an stamp on their hand at an all ages venue, so that the bartender knows they can serve them alcohol without having to interrupt the bouncer to ask if they checked their age or not: it’s simpler and more efficient if you can just look these things up directly, rather than having to ask someone. (The analogy breaks if you push it too far, of course)
That would imply a few implementation changes compared to master:
PeerManager doesn’t have m_peer_map or m_node_states, or their associated locks (but does still have m_txrequest and other global state)
PeerManager shouldn’t need a reference back to CConnman at all; when interacting with one peer generates work for other peers, that should be coordinated either via global state (mempool/txrequest/etc), or passing the extra work directly from ProcessMessages/SendMessages back to the CConnman that invoked it, which can then distribute the work to appropriate peers
peer I/O should be via CNode not CConnman (ie, move PushMessage into CNode, matching PollMessage)
CValidationInterface would have to implemented by CConnman instead of PeerManager (those functions apply across all nodes, so it needs to go through the container of all nodes)
similarly, the RPC functionality currently exposed through PeerManager would need to go through CConnman first, for the same reason
probably splitting CNode out of net.h, perhaps splitting Peer/CNodeState out of net_processing.h (so that net_node.h has an opaque class Peer;, but net_node.cpp can #include <net_msgstate.h> in order to implement CNode::~CNode and destruct its opaque unique_ptr<Peer> m_peer)
I think an approach like that would have a few benefits:
breaks the circular dependency between PeerManager and CConnman
lets you test PeerManager in isolation without even instantiating a dummy CConnman
can require PeerManager to be passed into CConnman at construction, rather than at connman->Start(), simplifying its API
can have CConnman own its NetEventsInterface instead of and be responsible for destructing it (make the virtual destructor public), removing the risk that peerman gets destructed while connman is still using it
potentially lets you simplify the per-peer locks, since all access to nodes’ high-level state goes via CConnman
keeps things efficient – you’re still separating things into logical modules, but not adding layers of lookups and indirection
Using cs_main for so many validation unrelated things is probably another blocker, but perhaps easier to solve.
mempool.cs can also be a blocker for effective parallelisation – you can’t quite do tx relay to two peers concurrently when even looking up a txid via the mempool to see if an INV is a duplicate requires an exclusive lock, especially if that lock’s held for an extended period (eg while validating a tx or block or reorging).
If you think this design is somehow fundamentally broken w.r.t. parallelizing, please elaborate why because I don’t see it.
If you want good performance then you keep things simple – few locks, access things directly. Often that approach is also simple and the easy to understand. Doing things in a complicated, round about way isn’t “fundamentally broken” – you can make it work, and you can always throw enough hardware at it to make it work fast enough. But I think it should be pretty obvious that:
The more complicated code here isn’t doing anything clever to justify the complexity: it’s still just reading/setting the same atomic bool. Both PeerManager and CConnman care about this bool, so it’s not “internal” to one or the other.
Now, sure: a quick map lookup while holding a global lock isn’t the end of the world, but refactoring should be making things simpler or more efficient or both, and this is doing the opposite. That’s why I think it’s the wrong direction. If the design is forcing your code to be complicated, ugly and slow, then the design’s bad.
For parallelisation in particular, you get best performance with the fewest locks (so you don’t have to stop, and potentially have another process claim the cpu) and the least shared data (so you can just process it in cache without having to synchronise with other tasks). So adding locks and lookups over shared data on things that were previously just a single atomic operation is completely backwards.
Here’s a concrete example: https://github.com/ajtowns/bitcoin/commits/202308-netnode . It first splits out CNode and related things from net.h, then moves PushMessage into CNode (and so it doesn’t require m_connman.m_total_bytes_sent_mutex or m_connman at all for sending messages), then also moves GetAddresses() into PeerManager. That drops the number of m_connman uses in net_processing from 80 to 27 by my count, so seems a decent start on cutting out the circular dep between CConnman and PeerManager. Dropping connman.ForEachNode and for (peer : m_peer_map) in this model is a whole other bunch of work, of course.
For example, if you look at net vs net_processing as low-vs-high level, moving -maxsendbuffer would be backwards; but if you look at it as “message handling” vs “dealing with the collection of all nodes” its logical: maxsendbuffer affects your communication with an individual node, it’s not something that matters across nodes.
ajtowns
commented at 1:41 am on August 17, 2023:
contributor
I guess two other things might be worth thinking about:
the approach I describe above maybe can be looked at as moving much of the low-level stuff into CNode (or net_node.h), leaving net_processing for the high-level logic about messages and peer state, and net/CConnman for the high-level logic about dealing with many peers. so it’s still something of a low/high level split, just splitting at a different point. moving as much low-level logic into CNode as possible also helps ensure we’re able to do all the low-level work across each node independently, which then may make it easier to parallelise that work.
one of the other things about parallelising message handling, is that if you’re dealing with node A, and it decides it wants to loop over all nodes, then if some other node B is super busy doing a long task, that ends up blocking A as well. having the design be “A just deals with its own stuff and adds anything for B to a queue that B will deal with when it gets around to it” works a fair bit better there, I think; the complexity is pretty much all in implementing the “queue” part.
dergoegge
commented at 8:05 pm on September 28, 2023:
member
Closing since there doesn’t seem to be sufficient support/interest for this
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2025-01-21 09:12 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me