This PR restricts the period when the cs_vNodes
mutex is locked, prevents the only case when cs_vNodes
could be locked before the ::cs_main
.
This change makes the explicit locking of recursive mutexes in the explicit order redundant.
This PR restricts the period when the cs_vNodes
mutex is locked, prevents the only case when cs_vNodes
could be locked before the ::cs_main
.
This change makes the explicit locking of recursive mutexes in the explicit order redundant.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
EDIT: This PR has substantially changed. The comment below refers to the old branch.
This does indeed seem useless. The mutex was added by @TheBlueMatt in commit d7c58ad514ee00db00589216166808258bc16b60:
Split CNode::cs_vSend: message processing and message sending
cs_vSend is used for two purposes - to lock the datastructures used to queue messages to place on the wire and to only call SendMessages once at a time per-node. I believe SendMessages used to access some of the vSendMsg stuff, but it doesn’t anymore, so these locks do not need to be on the same mutex, and also make deadlocking much more likely.
We’re implicitly guaranteed to only call SendMessages()
serially since it’s only ever called by message handler thread. If we want to be explicit about it, it’d be better to add a global lock to PeerManagerImpl
(or per-Peer
object), so that net_processing is enforcing its own synchronization internally.
This seems safe to me. The only thing that CNode.cs_sendProcessing
enforces is preventing SendMessages()
from being called for the same CNode
concurrently. However, SendMessages()
can only be called by threadMessageHandler
, so there’s no possibility of it being called concurrently.
I wonder if instead of simply removing this, we should replace it with something that more closely documents our expectations. I think that calling any of the NetEventsInterface
methods (InitializeNode
, FinalizeNode
, SendMessages
, ProcessMessages
) in PeerManagerImpl
concurrently could lead to problems, so perhaps a global mutex in PeerManagerImpl
should be added that’s taken whenever any of those functions are called? If we ever want to add concurrency to PeerManagerImpl
, we could look at loosening that restriction.
I wonder if instead of simply removing this, we should replace it with something that more closely documents our expectations. I think that calling any of the
NetEventsInterface
methods (InitializeNode
,FinalizeNode
,SendMessages
,ProcessMessages
) inPeerManagerImpl
concurrently could lead to problems, so perhaps a global mutex inPeerManagerImpl
should be added that’s taken whenever any of those functions are called? If we ever want to add concurrency toPeerManagerImpl
, we could look at loosening that restriction.
Maybe postpone it until #19398 is fulfilled?
Updated b58097facb4fec80df8fa140dfefd9c461c93832 -> 4de76058f8c3150de07d7a02faaceb261d732499 (pr21563.03 -> pr21563.04):
0 node0 stderr libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
0 node0 stderr libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
Looks like a mutex is prematurely destroyed. Weird.
Updated 4de76058f8c3150de07d7a02faaceb261d732499 -> b1e5ca217c3557ed89d985c7e8ea28c2b9ed303b (pr21563.04 -> pr21563.05, diff):
210- //
211- // Thus the implicit locking order requirement is: (1) cs_main, (2) g_cs_orphans, (3) cs_vNodes.
212 if (node.connman) {
213 node.connman->StopThreads();
214- LOCK2(::cs_main, ::g_cs_orphans);
215 node.connman->StopNodes();
if (node.connman) node.connman->Stop()
.
2632@@ -2636,6 +2633,9 @@ void CConnman::StopNodes()
2633 }
2634
2635 // Close sockets
2636+ std::vector<CNode*> nodes;
2637+ std::list<CNode*> nodes_disconnected;
2638+ {
2639 LOCK(cs_vNodes);
I think this lock can be held for much less time:
0 // Delete peer connections
1 std::vector<CNode*> nodes;
2 {
3 LOCK(cs_vNodes);
4 nodes = std::move(vNodes);
5 vNodes.clear();
6 }
7
8 for (CNode* pnode : nodes) {
9 pnode->CloseSocketDisconnect();
10 DeleteNode(pnode);
11 }
12
13 for (CNode* pnode : vNodesDisconnected) {
14 DeleteNode(pnode);
15 }
16 vNodesDisconnected.clear();
17
18 // Close listening sockets
19 for (ListenSocket& hListenSocket : vhListenSocket) {
20 if (hListenSocket.socket != INVALID_SOCKET) {
21 if (!CloseSocket(hListenSocket.socket)) {
22 LogPrintf("CloseSocket(hListenSocket) failed with error %s\n", NetworkErrorString(WSAGetLastError()));
23 }
24 }
25 }
26 vhListenSocket.clear();
27
28 semOutbound.reset();
29 semAddnode.reset();
237@@ -238,10 +238,10 @@ class PeerManagerImpl final : public PeerManager
238 void NewPoWValidBlock(const CBlockIndex *pindex, const std::shared_ptr<const CBlock>& pblock) override;
239
240 /** Implement NetEventsInterface */
241- void InitializeNode(CNode* pnode) override;
242- void FinalizeNode(const CNode& node) override;
243- bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override;
244- bool SendMessages(CNode* pto) override EXCLUSIVE_LOCKS_REQUIRED(pto->cs_sendProcessing);
245+ void InitializeNode(CNode* pnode) override EXCLUSIVE_LOCKS_REQUIRED(!m_net_events_mutex);
cs_main
when calling into a NetEventsInterface method?
It is not trivial due to the https://github.com/bitcoin/bitcoin/blob/c1f480fb234856a60e9b1f4bf97114bfd9b3d0b0/src/net_processing.cpp#L662
Going to keep the scope of this PR tight.
209- // EraseOrphansFor, which locks g_cs_orphans.
210- //
211- // Thus the implicit locking order requirement is: (1) cs_main, (2) g_cs_orphans, (3) cs_vNodes.
212 if (node.connman) {
213 node.connman->StopThreads();
214- LOCK2(::cs_main, ::g_cs_orphans);
LOCK2(::cs_main, g_cs_orphans);
from fuzz/process_message.cpp
and fuzz/process_messages.cpp
.
Updated b1e5ca217c3557ed89d985c7e8ea28c2b9ed303b -> a3663329ad1899cc012437167a316040d77e00bc (pr21563.05 -> pr21563.06):
LOCK2(::cs_main, ::g_cs_orphans);
in #18458. Do you see any problem with locking cs_vNodes
, grabbing a copy of vNodes
, releasing the lock and then deleting the nodes in CConnman::Stop()
? This is similar to what happens in the socket handler and message handler threads?
Removing a mutex that guards nothing then adding a mutex that also doesn’t actually guard anything seems a bit backwards…
#21527 has it guarding the extratxns data structures (as part of getting rid of g_cs_orphans). It would be good for it to be able to guard the address data structures in struct Peer, but without it being a global, that will probably be awkward at best, since Peer doesn’t have a reference back to PeerManagerImpl.
The cs_vNodes changes don’t seem to be mentioned in the title or PR description?
Updated a3663329ad1899cc012437167a316040d77e00bc -> 9766b7f43d4fedd1740f4444689b6ba2ff33d12d (pr21563.06 -> pr21563.07):
The cs_vNodes changes don’t seem to be mentioned in the title or PR description?
The PR description updated.
I think this change is fine. It steals the CNode*
s from vNodes
, releases the mutex and then cleans up the nodes. That’s very similar to the pattern in SocketHandler()
:
and in ThreadMessageHandler()
:
The difference being that in those cases, an extra reference is taken and nRefCount
is incremented. Here, the reference count is not incremented since the pointer is stolen from vNodes
before clearing that vector.
Perhaps we could document our assumptions by calling Release()
on the CNode
object when stealing it from vNodes
or immediately before calling DeleteNode()
, and then asserting that nRefCount == 0
in the CNode
destructor? Maybe in a follow up.
2646+ // Delete peer connections
2647+ std::vector<CNode*> nodes;
2648+ {
2649+ LOCK(cs_vNodes);
2650+ nodes = std::move(vNodes);
2651+ vNodes.clear();
Could write this as:
0cs_vNodes.lock();
1std::vector<CNode*> nodes(std::move(vNodes)); // move constructor clears vNodes
2cs_vNodes.unlock();
If I’m reading the spec right, the move constructor is guaranteed to be constant time, while operator=
is linear time; and the move constructor also guarantees the original ends up cleared. Since C++17 the move constructor is also marked noexcept, so lock
and unlock
in place of RAII locks should be sound. Alternatively, could use WAIT_LOCK(cs_vNodes, lock); ...; REVERSE_LOCK(lock);
.
Or even remove all of this logic and re-use the existing “disconnect all nodes” logic?
0diff --git a/src/net.cpp b/src/net.cpp
1index b7c1b8c6c4..9a4a989e15 100644
2--- a/src/net.cpp
3+++ b/src/net.cpp
4@@ -2636,31 +2636,17 @@ void CConnman::StopNodes()
5 }
6
7 // Delete peer connections
8- std::vector<CNode*> nodes;
9 {
10 LOCK(cs_vNodes);
11- nodes = std::move(vNodes);
12- vNodes.clear();
13+ assert(vNodes.empty());
14 }
15-
16- for (CNode* pnode : nodes) {
17- pnode->CloseSocketDisconnect();
18- DeleteNode(pnode);
19- }
20-
21- for (CNode* pnode : vNodesDisconnected) {
22- DeleteNode(pnode);
23- }
24- vNodesDisconnected.clear();
25+ assert(vNodesDisconnected.empty());
26
27 // Close listening sockets
28- for (ListenSocket& hListenSocket : vhListenSocket) {
29- if (hListenSocket.socket != INVALID_SOCKET) {
30- if (!CloseSocket(hListenSocket.socket)) {
31+ for (ListenSocket& hListenSocket : vhListenSocket)
32+ if (hListenSocket.socket != INVALID_SOCKET)
33+ if (!CloseSocket(hListenSocket.socket))
34 LogPrintf("CloseSocket(hListenSocket) failed with error %s\n", NetworkErrorString(WSAGetLastError()));
35- }
36- }
37- }
38 vhListenSocket.clear();
39
40 semOutbound.reset();
41diff --git a/src/net.h b/src/net.h
42index 13f86b6a91..5a87a5a304 100644
43--- a/src/net.h
44+++ b/src/net.h
45@@ -867,6 +867,8 @@ public:
46 void Stop()
47 {
48 StopThreads();
49+fNetworkActive=false;
50+DisconnectNodes();
51 StopNodes();
52 };
53
the move constructor is guaranteed to be constant time, while operator= is linear time
This is the move assignment operator (number 2 in https://en.cppreference.com/w/cpp/container/vector/operator%3D). Its complexity is linear in the size of this
- the vector being moved to. Presumably that’s because the objects in this
need to be destructed and the memory already owned by this
needs to deallocated. In our case, this
is empty, and so the operation is constant time - it just needs to copy start pointer/size/capacity from other
to this
.
Does performance matter here at all?
Absolutely not.
I wasn’t sure if std::move + clear was safe / necessary
Moving from a vector leaves it in a “valid but unspecified state”. The clear()
is probably unnecessary since this is shutdown and we’re not going to touch vNodes
again, but I think it’s good practice to not to leave vectors in an unspecified state if we can help it.
Could achieve the same in one less line of code with https://en.cppreference.com/w/cpp/container/vector/swap ?
Thanks! Updated.
Release()
due to an early return
.
2669+ if (hListenSocket.socket != INVALID_SOCKET) {
2670+ if (!CloseSocket(hListenSocket.socket)) {
2671+ LogPrintf("CloseSocket(hListenSocket) failed with error %s\n", NetworkErrorString(WSAGetLastError()));
2672+ }
2673+ }
2674+ }
ACK 9766b7f43d4fedd1740f4444689b6ba2ff33d12d
Something to consider for further improvement:
It is not even necessary to lock cs_vNodes
inside StopNodes()
. Why? Because by the time StopNodes()
is called the other threads have been shut down.
If it was necessary to protect the code in StopNodes()
with a mutex, then this PR would be broken:
0lock()
1tmp = std::move(shared_vector)
2unlock()
3destroy each element in tmp
4// What prevents new entries from being added to shared_vector
5// here, since we have unlock()ed? Nothing. So when this function
6// completes, shared_vector is still alive and kicking (with new
7// elements being added to it in the meantime).
The LOCK(cs_vNodes);
is only needed to keep the thread safety analysis happy. This code actually belongs to the destructor ~CConnman()
where we can access vNodes
without locking and without upsetting the TSA. I think it can be moved to the destructor with some further changes (outside of this PR).
This code actually belongs to the destructor ~CConnman() where we can access vNodes without locking and without upsetting the TSA. I think it can be moved to the destructor with some further changes (outside of this PR).
Maybe, but that’s a much more invasive change. CConnman::Stop()
calls PeerManager::DeleteNode()
for all the nodes. The destructor for CConnman
is called after the destructor for PeerManager
(in the reverse order that they’re constructed). That can be changed, but we’d need be very careful.
utACK 9766b7f43d4fedd1740f4444689b6ba2ff33d12d
Agree with @MarcoFalke that the move & clear is more elegantly expressed as std::vector::swap().
0diff --git a/src/net.cpp b/src/net.cpp
1index b7c1b8c6c4..94029491ab 100644
2--- a/src/net.cpp
3+++ b/src/net.cpp
4@@ -2637,11 +2637,7 @@ void CConnman::StopNodes()
5
6 // Delete peer connections
7 std::vector<CNode*> nodes;
8- {
9- LOCK(cs_vNodes);
10- nodes = std::move(vNodes);
11- vNodes.clear();
12- }
13+ WITH_LOCK(cs_vNodes, nodes.swap(vNodes));
14
15 for (CNode* pnode : nodes) {
16 pnode->CloseSocketDisconnect();
Updated 9766b7f43d4fedd1740f4444689b6ba2ff33d12d -> 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f (pr21563.07 -> pr21563.08, diff).
Addressed @MarcoFalke’s comments:
Could achieve the same in one less line of code with https://en.cppreference.com/w/cpp/container/vector/swap ?
minor nit: would be nice to do style-fixups in a separate commit
2656 DeleteNode(pnode);
2657 }
2658- vNodes.clear();
2659 vNodesDisconnected.clear();
2660 vhListenSocket.clear();
2661 semOutbound.reset();
Might be worth adding a comment as to why vNodesDisconnected
, vhListenSocket
, semOutbound
and semAddnode
are all safe to reset here – I think it’s because:
vNodesDisconnected
and vhListenSocket
are only otherwise accessed by ThreadSocketHandler
semOutbound
and semAddnode
are only otherwise accessed via ThreadOpenConnections
, Start()
, Interrupt()
(and Interrupt()
is only safe if it’s invoked in between Start()
and StopNodes()
), and AddConnection
(which is called from rpc/net.cpp so also requires that we won’t be adding connections via RPC – might be good to add an if (semOutbound == nullptr) return false;
to AddConnection()
)Stop
is called. The other threads are also assumed to be shut down. Maybe reverting my commit (https://github.com/bitcoin/bitcoin/pull/21563#issuecomment-826278304) could help to self-document that better?
99@@ -100,7 +100,6 @@ void fuzz_target(FuzzBufferType buffer, const std::string& LIMIT_TO_MESSAGE_TYPE
100 g_setup->m_node.peerman->SendMessages(&p2p_node);
101 }
102 SyncWithValidationInterfaceQueue();
103- LOCK2(::cs_main, g_cs_orphans); // See init.cpp for rationale for implicit locking order requirement
104 g_setup->m_node.connman->StopNodes();
I think there probably should be a comment as to why explicitly calling StopNodes
is still necessary
(I believe it’s because if you don’t do that, then m_node
will get destructed, deleting its peerman
, then attempting to delete it’s connman
which will see some entries in vNodes
and try to call peerman->FinalizeNode()
on them, but peerman
is deleted at that point. With the lock order fixed, it may be possible to have ~NodeContext
call if (peerman) { peerman->Stop(); }
an remove those lines from the tests entirely. Not sure if that would also let you remove it from init.cpp)
utACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f - logic seems sound
Couple of comment improvements that would be good; longer term would also be better to add missing guards to things in net so that the compiler can catch more errors, and simplify the ownership/interaction between CConnman/PeerManager/NodeContext…
(consider out of scope of this PR)
Could we assert that all threads are stopped when StopNodes()
starts executing? Something like
0assert(threadMessageHandler.get_id() == std::thread::id());
1assert(threadSocketHandler.get_id() == std::thread::id());
2...
Or even call StopThreads()
at the start of StopNodes()
?
Or even call StopThreads() at the start of StopNodes()?
This used to be the case before commit fa369651c5523f393e0bfcfa328b8e27006711aa. Maybe that commit should be reverted now?
I think the changes are nice and can be merged
review ACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f 👢
Signature:
0-----BEGIN PGP SIGNED MESSAGE-----
1Hash: SHA512
2
3review ACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f 👢
4-----BEGIN PGP SIGNATURE-----
5
6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
7pUiYLgwAy9ZbgTGcc2V3HKj7fNh6KWOQ2DgAVsBLAmRP1D+443eCC3twNxELvtOe
8Mq2gmwseUZK1oNKU0RnCHvg5bZOqcId1N1Sf6BEbJOmkGFPktaVLTz2TmQpLLE42
9ZTMrqWEJreQ9DXqEDDKV222Zu4ucLmUFuNqs5MYj+moUTBdedFcbeF96cJK2hBc1
10OgTJXL6MLoMAFeL4PKS6j+E2yHXOFuKHWE9WHPJ0wFKujTqlBPtntZxoWb7fby2A
11ZHrA8wxjdNQbgovYRjToVe/bif2SNuFbfNwE2XYalQ50U8bY5n3hC2SUB/poGTaJ
12GoSOWiv1jAUimiMUVyicP+JDfprAjPkB9hXKC7k4kvbluTpqYz4UGxhnBmKnU/qe
13+gU1fPldReMkVGg5YjG3pg7Kd4izdq2YzyNMXdyIRHJaWn+gvsl4vQjPLUe7+WFL
14DvX4dkpJ9fHF/DoHLsYi5/g2sG3G2G1Z4jrV/TFqAInAa0tuEqdWwYBomHi5QBQX
15GzrRIkFr
16=lXKU
17-----END PGP SIGNATURE-----
Timestamp of file with hash 2744498e52ef3957483818d251ab7a217f1deba6d21ad39bc26dd3c078c33d07 -