This PR restricts the period when the cs_vNodes mutex is locked, prevents the only case when cs_vNodes could be locked before the ::cs_main.
This change makes the explicit locking of recursive mutexes in the explicit order redundant.
This PR restricts the period when the cs_vNodes mutex is locked, prevents the only case when cs_vNodes could be locked before the ::cs_main.
This change makes the explicit locking of recursive mutexes in the explicit order redundant.
<!--e57a25ab6845829454e8d69fc972939a-->
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
<!--174a7506f384e20aa4161008e828411d-->
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
EDIT: This PR has substantially changed. The comment below refers to the old branch.
This does indeed seem useless. The mutex was added by @TheBlueMatt in commit d7c58ad514ee00db00589216166808258bc16b60:
Split CNode::cs_vSend: message processing and message sending
cs_vSend is used for two purposes - to lock the datastructures used to queue messages to place on the wire and to only call SendMessages once at a time per-node. I believe SendMessages used to access some of the vSendMsg stuff, but it doesn't anymore, so these locks do not need to be on the same mutex, and also make deadlocking much more likely.
We're implicitly guaranteed to only call SendMessages() serially since it's only ever called by message handler thread. If we want to be explicit about it, it'd be better to add a global lock to PeerManagerImpl (or per-Peer object), so that net_processing is enforcing its own synchronization internally.
Rebased 8ca2ee6377f534b15b47440b819b1564c9c06b88 -> b58097facb4fec80df8fa140dfefd9c461c93832 (pr21563.02 -> pr21563.03) due to the conflict with #21571.
This seems safe to me. The only thing that CNode.cs_sendProcessing enforces is preventing SendMessages() from being called for the same CNode concurrently. However, SendMessages() can only be called by threadMessageHandler, so there's no possibility of it being called concurrently.
I wonder if instead of simply removing this, we should replace it with something that more closely documents our expectations. I think that calling any of the NetEventsInterface methods (InitializeNode, FinalizeNode, SendMessages, ProcessMessages) in PeerManagerImpl concurrently could lead to problems, so perhaps a global mutex in PeerManagerImpl should be added that's taken whenever any of those functions are called? If we ever want to add concurrency to PeerManagerImpl, we could look at loosening that restriction.
I wonder if instead of simply removing this, we should replace it with something that more closely documents our expectations. I think that calling any of the
NetEventsInterfacemethods (InitializeNode,FinalizeNode,SendMessages,ProcessMessages) inPeerManagerImplconcurrently could lead to problems, so perhaps a global mutex inPeerManagerImplshould be added that's taken whenever any of those functions are called? If we ever want to add concurrency toPeerManagerImpl, we could look at loosening that restriction.
Maybe postpone it until #19398 is fulfilled?
Updated b58097facb4fec80df8fa140dfefd9c461c93832 -> 4de76058f8c3150de07d7a02faaceb261d732499 (pr21563.03 -> pr21563.04):
node0 stderr libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
node0 stderr libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
Looks like a mutex is prematurely destroyed. Weird.
Updated 4de76058f8c3150de07d7a02faaceb261d732499 -> b1e5ca217c3557ed89d985c7e8ea28c2b9ed303b (pr21563.04 -> pr21563.05, diff):
210 | - // 211 | - // Thus the implicit locking order requirement is: (1) cs_main, (2) g_cs_orphans, (3) cs_vNodes. 212 | if (node.connman) { 213 | node.connman->StopThreads(); 214 | - LOCK2(::cs_main, ::g_cs_orphans); 215 | node.connman->StopNodes();
This can now be replaced with if (node.connman) node.connman->Stop().
2632 | @@ -2636,6 +2633,9 @@ void CConnman::StopNodes() 2633 | } 2634 | 2635 | // Close sockets 2636 | + std::vector<CNode*> nodes; 2637 | + std::list<CNode*> nodes_disconnected; 2638 | + { 2639 | LOCK(cs_vNodes);
I think this lock can be held for much less time:
// Delete peer connections
std::vector<CNode*> nodes;
{
LOCK(cs_vNodes);
nodes = std::move(vNodes);
vNodes.clear();
}
for (CNode* pnode : nodes) {
pnode->CloseSocketDisconnect();
DeleteNode(pnode);
}
for (CNode* pnode : vNodesDisconnected) {
DeleteNode(pnode);
}
vNodesDisconnected.clear();
// Close listening sockets
for (ListenSocket& hListenSocket : vhListenSocket) {
if (hListenSocket.socket != INVALID_SOCKET) {
if (!CloseSocket(hListenSocket.socket)) {
LogPrintf("CloseSocket(hListenSocket) failed with error %s\n", NetworkErrorString(WSAGetLastError()));
}
}
}
vhListenSocket.clear();
semOutbound.reset();
semAddnode.reset();
237 | @@ -238,10 +238,10 @@ class PeerManagerImpl final : public PeerManager 238 | void NewPoWValidBlock(const CBlockIndex *pindex, const std::shared_ptr<const CBlock>& pblock) override; 239 | 240 | /** Implement NetEventsInterface */ 241 | - void InitializeNode(CNode* pnode) override; 242 | - void FinalizeNode(const CNode& node) override; 243 | - bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override; 244 | - bool SendMessages(CNode* pto) override EXCLUSIVE_LOCKS_REQUIRED(pto->cs_sendProcessing); 245 | + void InitializeNode(CNode* pnode) override EXCLUSIVE_LOCKS_REQUIRED(!m_net_events_mutex);
What do you think about excluding holding cs_main when calling into a NetEventsInterface method?
It is not trivial due to the https://github.com/bitcoin/bitcoin/blob/c1f480fb234856a60e9b1f4bf97114bfd9b3d0b0/src/net_processing.cpp#L662
Going to keep the scope of this PR tight.
209 | - // EraseOrphansFor, which locks g_cs_orphans. 210 | - // 211 | - // Thus the implicit locking order requirement is: (1) cs_main, (2) g_cs_orphans, (3) cs_vNodes. 212 | if (node.connman) { 213 | node.connman->StopThreads(); 214 | - LOCK2(::cs_main, ::g_cs_orphans);
You can also remove the LOCK2(::cs_main, g_cs_orphans); from fuzz/process_message.cpp and fuzz/process_messages.cpp.
Updated b1e5ca217c3557ed89d985c7e8ea28c2b9ed303b -> a3663329ad1899cc012437167a316040d77e00bc (pr21563.05 -> pr21563.06):
This seems good to me.
@MarcoFalke - you introduced the LOCK2(::cs_main, ::g_cs_orphans); in #18458. Do you see any problem with locking cs_vNodes, grabbing a copy of vNodes, releasing the lock and then deleting the nodes in CConnman::Stop()? This is similar to what happens in the socket handler and message handler threads?
Removing a mutex that guards nothing then adding a mutex that also doesn't actually guard anything seems a bit backwards...
#21527 has it guarding the extratxns data structures (as part of getting rid of g_cs_orphans). It would be good for it to be able to guard the address data structures in struct Peer, but without it being a global, that will probably be awkward at best, since Peer doesn't have a reference back to PeerManagerImpl.
The cs_vNodes changes don't seem to be mentioned in the title or PR description?
The fist commit seems separate from the other changes?
Updated a3663329ad1899cc012437167a316040d77e00bc -> 9766b7f43d4fedd1740f4444689b6ba2ff33d12d (pr21563.06 -> pr21563.07):
The cs_vNodes changes don't seem to be mentioned in the title or PR description?
The PR description updated.
I think this change is fine. It steals the CNode*s from vNodes, releases the mutex and then cleans up the nodes. That's very similar to the pattern in SocketHandler():
and in ThreadMessageHandler():
The difference being that in those cases, an extra reference is taken and nRefCount is incremented. Here, the reference count is not incremented since the pointer is stolen from vNodes before clearing that vector.
Perhaps we could document our assumptions by calling Release() on the CNode object when stealing it from vNodes or immediately before calling DeleteNode(), and then asserting that nRefCount == 0 in the CNode destructor? Maybe in a follow up.
2646 | + // Delete peer connections 2647 | + std::vector<CNode*> nodes; 2648 | + { 2649 | + LOCK(cs_vNodes); 2650 | + nodes = std::move(vNodes); 2651 | + vNodes.clear();
Could write this as:
cs_vNodes.lock();
std::vector<CNode*> nodes(std::move(vNodes)); // move constructor clears vNodes
cs_vNodes.unlock();
If I'm reading the spec right, the move constructor is guaranteed to be constant time, while operator= is linear time; and the move constructor also guarantees the original ends up cleared. Since C++17 the move constructor is also marked noexcept, so lock and unlock in place of RAII locks should be sound. Alternatively, could use WAIT_LOCK(cs_vNodes, lock); ...; REVERSE_LOCK(lock);.
Does performance matter here at all? Shutdown only happens once and flushing to disk will take magnitudes longer anyway.
No I don't think so -- at worst it's just moving pointers around, and only a hundred or so of them in normal configurations. (I wasn't sure if std::move + clear was safe / necessary, so looked into the behaviours)
Could achieve the same in one less line of code with https://en.cppreference.com/w/cpp/container/vector/swap ?
Or even remove all of this logic and re-use the existing "disconnect all nodes" logic?
diff --git a/src/net.cpp b/src/net.cpp
index b7c1b8c6c4..9a4a989e15 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -2636,31 +2636,17 @@ void CConnman::StopNodes()
}
// Delete peer connections
- std::vector<CNode*> nodes;
{
LOCK(cs_vNodes);
- nodes = std::move(vNodes);
- vNodes.clear();
+ assert(vNodes.empty());
}
-
- for (CNode* pnode : nodes) {
- pnode->CloseSocketDisconnect();
- DeleteNode(pnode);
- }
-
- for (CNode* pnode : vNodesDisconnected) {
- DeleteNode(pnode);
- }
- vNodesDisconnected.clear();
+ assert(vNodesDisconnected.empty());
// Close listening sockets
- for (ListenSocket& hListenSocket : vhListenSocket) {
- if (hListenSocket.socket != INVALID_SOCKET) {
- if (!CloseSocket(hListenSocket.socket)) {
+ for (ListenSocket& hListenSocket : vhListenSocket)
+ if (hListenSocket.socket != INVALID_SOCKET)
+ if (!CloseSocket(hListenSocket.socket))
LogPrintf("CloseSocket(hListenSocket) failed with error %s\n", NetworkErrorString(WSAGetLastError()));
- }
- }
- }
vhListenSocket.clear();
semOutbound.reset();
diff --git a/src/net.h b/src/net.h
index 13f86b6a91..5a87a5a304 100644
--- a/src/net.h
+++ b/src/net.h
@@ -867,6 +867,8 @@ public:
void Stop()
{
StopThreads();
+fNetworkActive=false;
+DisconnectNodes();
StopNodes();
};
(I tried running the tests with my diff and for some reason the node crashed)
the move constructor is guaranteed to be constant time, while operator= is linear time
This is the move assignment operator (number 2 in https://en.cppreference.com/w/cpp/container/vector/operator%3D). Its complexity is linear in the size of this - the vector being moved to. Presumably that's because the objects in this need to be destructed and the memory already owned by this needs to deallocated. In our case, this is empty, and so the operation is constant time - it just needs to copy start pointer/size/capacity from other to this.
Does performance matter here at all?
Absolutely not.
I wasn't sure if std::move + clear was safe / necessary
Moving from a vector leaves it in a "valid but unspecified state". The clear() is probably unnecessary since this is shutdown and we're not going to touch vNodes again, but I think it's good practice to not to leave vectors in an unspecified state if we can help it.
Could achieve the same in one less line of code with https://en.cppreference.com/w/cpp/container/vector/swap ?
Thanks! Updated.
Just to answer (myself) why my diff crashed: When the threads are interrupted, the node refcounts do not get decremented
I am working on a change that will address this problem of forgetting Release() due to an early return.
Seems sensible at first glance, and I think any bugs would be caught by CI. Need to have a more thorough look though.
2669 | + if (hListenSocket.socket != INVALID_SOCKET) { 2670 | + if (!CloseSocket(hListenSocket.socket)) { 2671 | + LogPrintf("CloseSocket(hListenSocket) failed with error %s\n", NetworkErrorString(WSAGetLastError())); 2672 | + } 2673 | + } 2674 | + }
minor nit: would be nice to do style-fixups in a separate commit
ACK 9766b7f43d4fedd1740f4444689b6ba2ff33d12d
Something to consider for further improvement:
It is not even necessary to lock cs_vNodes inside StopNodes(). Why? Because by the time StopNodes() is called the other threads have been shut down.
If it was necessary to protect the code in StopNodes() with a mutex, then this PR would be broken:
lock()
tmp = std::move(shared_vector)
unlock()
destroy each element in tmp
// What prevents new entries from being added to shared_vector
// here, since we have unlock()ed? Nothing. So when this function
// completes, shared_vector is still alive and kicking (with new
// elements being added to it in the meantime).
The LOCK(cs_vNodes); is only needed to keep the thread safety analysis happy. This code actually belongs to the destructor ~CConnman() where we can access vNodes without locking and without upsetting the TSA. I think it can be moved to the destructor with some further changes (outside of this PR).
This code actually belongs to the destructor ~CConnman() where we can access vNodes without locking and without upsetting the TSA. I think it can be moved to the destructor with some further changes (outside of this PR).
Maybe, but that's a much more invasive change. CConnman::Stop() calls PeerManager::DeleteNode() for all the nodes. The destructor for CConnman is called after the destructor for PeerManager (in the reverse order that they're constructed). That can be changed, but we'd need be very careful.
utACK 9766b7f43d4fedd1740f4444689b6ba2ff33d12d
Agree with @MarcoFalke that the move & clear is more elegantly expressed as std::vector::swap().
<details> <summary>Diff</summary>
diff --git a/src/net.cpp b/src/net.cpp
index b7c1b8c6c4..94029491ab 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -2637,11 +2637,7 @@ void CConnman::StopNodes()
// Delete peer connections
std::vector<CNode*> nodes;
- {
- LOCK(cs_vNodes);
- nodes = std::move(vNodes);
- vNodes.clear();
- }
+ WITH_LOCK(cs_vNodes, nodes.swap(vNodes));
for (CNode* pnode : nodes) {
pnode->CloseSocketDisconnect();
</details>
Updated 9766b7f43d4fedd1740f4444689b6ba2ff33d12d -> 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f (pr21563.07 -> pr21563.08, diff).
Addressed @MarcoFalke's comments:
Could achieve the same in one less line of code with https://en.cppreference.com/w/cpp/container/vector/swap ?
minor nit: would be nice to do style-fixups in a separate commit
ACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f
utACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f
2656 | DeleteNode(pnode);
2657 | }
2658 | - vNodes.clear();
2659 | vNodesDisconnected.clear();
2660 | vhListenSocket.clear();
2661 | semOutbound.reset();
Might be worth adding a comment as to why vNodesDisconnected, vhListenSocket, semOutbound and semAddnode are all safe to reset here -- I think it's because:
vNodesDisconnected and vhListenSocket are only otherwise accessed by ThreadSocketHandlersemOutbound and semAddnode are only otherwise accessed via ThreadOpenConnections, Start(), Interrupt() (and Interrupt() is only safe if it's invoked in between Start() and StopNodes()), and AddConnection (which is called from rpc/net.cpp so also requires that we won't be adding connections via RPC -- might be good to add an if (semOutbound == nullptr) return false; to AddConnection())RPC must be (and is assumed to be) shutdown before Stop is called. The other threads are also assumed to be shut down. Maybe reverting my commit (https://github.com/bitcoin/bitcoin/pull/21563#issuecomment-826278304) could help to self-document that better?
99 | @@ -100,7 +100,6 @@ void fuzz_target(FuzzBufferType buffer, const std::string& LIMIT_TO_MESSAGE_TYPE 100 | g_setup->m_node.peerman->SendMessages(&p2p_node); 101 | } 102 | SyncWithValidationInterfaceQueue(); 103 | - LOCK2(::cs_main, g_cs_orphans); // See init.cpp for rationale for implicit locking order requirement 104 | g_setup->m_node.connman->StopNodes();
I think there probably should be a comment as to why explicitly calling StopNodes is still necessary
(I believe it's because if you don't do that, then m_node will get destructed, deleting its peerman, then attempting to delete it's connman which will see some entries in vNodes and try to call peerman->FinalizeNode() on them, but peerman is deleted at that point. With the lock order fixed, it may be possible to have ~NodeContext call if (peerman) { peerman->Stop(); } an remove those lines from the tests entirely. Not sure if that would also let you remove it from init.cpp)
I think changing the destruction order can be done in a separate pull?
utACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f - logic seems sound
Couple of comment improvements that would be good; longer term would also be better to add missing guards to things in net so that the compiler can catch more errors, and simplify the ownership/interaction between CConnman/PeerManager/NodeContext...
(consider out of scope of this PR)
Could we assert that all threads are stopped when StopNodes() starts executing? Something like
assert(threadMessageHandler.get_id() == std::thread::id());
assert(threadSocketHandler.get_id() == std::thread::id());
...
Or even call StopThreads() at the start of StopNodes()?
Or even call StopThreads() at the start of StopNodes()?
This used to be the case before commit fa369651c5523f393e0bfcfa328b8e27006711aa. Maybe that commit should be reverted now?
I think the changes are nice and can be merged
review ACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f 👢
<details><summary>Show signature and timestamp</summary>
Signature:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
review ACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f 👢
-----BEGIN PGP SIGNATURE-----
iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
pUiYLgwAy9ZbgTGcc2V3HKj7fNh6KWOQ2DgAVsBLAmRP1D+443eCC3twNxELvtOe
Mq2gmwseUZK1oNKU0RnCHvg5bZOqcId1N1Sf6BEbJOmkGFPktaVLTz2TmQpLLE42
ZTMrqWEJreQ9DXqEDDKV222Zu4ucLmUFuNqs5MYj+moUTBdedFcbeF96cJK2hBc1
OgTJXL6MLoMAFeL4PKS6j+E2yHXOFuKHWE9WHPJ0wFKujTqlBPtntZxoWb7fby2A
ZHrA8wxjdNQbgovYRjToVe/bif2SNuFbfNwE2XYalQ50U8bY5n3hC2SUB/poGTaJ
GoSOWiv1jAUimiMUVyicP+JDfprAjPkB9hXKC7k4kvbluTpqYz4UGxhnBmKnU/qe
+gU1fPldReMkVGg5YjG3pg7Kd4izdq2YzyNMXdyIRHJaWn+gvsl4vQjPLUe7+WFL
DvX4dkpJ9fHF/DoHLsYi5/g2sG3G2G1Z4jrV/TFqAInAa0tuEqdWwYBomHi5QBQX
GzrRIkFr
=lXKU
-----END PGP SIGNATURE-----
Timestamp of file with hash 2744498e52ef3957483818d251ab7a217f1deba6d21ad39bc26dd3c078c33d07 -
</details>
There is a related pr, which is waiting for review donations: #21750