This continues the work of moving application layer data into net_processing, by moving all ping data into the new Peer object added in #19607.
For motivation, see #19398.
4075@@ -4076,31 +4076,15 @@ class CompareInvMempoolOrder
4076 };
4077 }
4078
4079-bool PeerManager::SendMessages(CNode* pto)
4080+void PeerManager::MaybeSendPing(CNode& pto)
pto
anyway because it’s now a reference not a pointer, maybe also call it node_to
or similar, since it’s not a “p” anymore?
216@@ -217,6 +217,9 @@ class PeerManager final : public CValidationInterface, public NetEventsInterface
217 /** Send a version message to a peer */
218 void PushNodeVersion(CNode& pnode, int64_t nTime);
219
220+ /** Send a ping message on a regular schedule or if requested manually */
221+ void MaybeSendPing(CNode& pto);
116+ // Wait 10 minutes
117+ time += 10 * 60;
118+ SetMockTime(time);
119+
120+ // Clear nPingNonceSent so we don't disconnect for a ping timeout
121+ dummyNode1.nPingNonceSent = 0;
Move the test change to its own commit so it’s easy to verify behaviour doesn’t change
good idea. Will do.
I don’t follow why the test change is necessary/desirable though.
The change is required because previously ping timeouts happened in in CConnman, so it was fine to bump the time in this test because there’s no CConnman to time out the connection. When the ping data and timout logic is moved to PeerManager, then the connection times out in this test if you bump the time forwards by 21 minutes without resetting the ping nonce.
Makes sense; might be a useful commit message for the test change?
Seems kind-of annoying to have to remember to constrain your mocktime bumps by random ping timeout limits. Shouldn’t this also (already) cause problems in functional tests?
The functional tests have an asyncio event loop thread that will automatically respond to ping messages:
The unit tests are single threaded, so any logic to handle messages needs to be done in the main test logic.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
“jnewbery deleted a comment from sipa 3 days ago” ?
ha. There were two comments from sheffine saying “why are you emailing me?” and one from sipa saying “sheffine, you’re confused”, so I deleted them all to reduce noise.
Censorship!!!
(yes, that’s indeed what happened)
216@@ -217,6 +217,9 @@ class PeerManager final : public CValidationInterface, public NetEventsInterface
217 /** Send a version message to a peer */
218 void PushNodeVersion(CNode& pnode, int64_t nTime);
219
220+ /** Send a ping message on a regular schedule or if requested manually */
221+ void MaybeSendPing(CNode* pto);
bool
return type, since you change it 3 commits later? I don’t have an opinion, but why is it better to do it this way?
Just so that the commits are hygienic. For people reviewing commit by commit, it wouldn’t make sense to have a function that returns a bool which is just ignored. Changing the return type in a subsequent commit is as easy as changing void to bool in the declaration and definition.
I was considering not returning a bool here, since it’s a bit of a strange interface (the function sets pto.fDisconnect
and returns false. Perhaps it should stay as returning void
, and have SendMessages()
check for fDisconnect
after calling this function?
1052- /** When the last ping was sent, or 0 if no ping was ever sent */
1053- std::atomic<std::chrono::microseconds> m_ping_start{0us};
1054- // Last measured round-trip time.
1055- std::atomic<int64_t> nPingUsecTime{0};
1056- // Best measured round-trip time.
1057 std::atomic<int64_t> nMinPingUsecTime{std::numeric_limits<int64_t>::max()};
CNode
but all the others in Peer
? We use best ping time for eviction logic which seems like an application-layer thing…
620- ping_wait = GetTime<std::chrono::microseconds>() - m_ping_start.load();
621- }
622-
623- // Raw ping time is in microseconds, but show it to user as whole seconds (Bitcoin users should be well used to small numbers with many decimal places by now :)
624- stats.m_ping_usec = nPingUsecTime;
625- stats.m_min_ping_usec = nMinPingUsecTime;
Since you removed this, CNodeStats
doesn’t know min ping time anymore. The p2p_eviction and p2p_ping functional tests fail since getpeerinfo()
always returns 0 for minping :(
Adding it back fixed the tests for me:
0stats.m_min_ping_usec = m_best_ping_time;
Oh hey, if s/m_min_ping_usec/m_best_ping_time then
0X(m_best_ping_time);
(wow so clean)
Good catch! I lost this in a rebase. I’ve now restored it.
I’m going to leave the m_min_ping_usec
name to avoid increasing the diff further.
Can’t really comment on the overall goal, as I don’t really have a lot of context regarding either locks, mutexes, or peer management, but the linked motivation sounded good to me. The PR is easy to follow and each step makes sense to me. I found the first commit message to be a bit confusing:
No need to create another shared_ptr if ProcessMessages() already has one in hand.
I take it to mean something along the lines of “Instead of creating another shared ptr, let’s pass one that the callsite already has on hand.”.
code review ACK https://github.com/bitcoin/bitcoin/pull/20721/commits/6f0309b4743d05c512fc86d2d206f72d512c5c93
CNode
. You broke a few functional tests by removing minping from copyStats()
which I assume is accidental?
For renaming, I personally think “best” is more informative than “min” but there’s some inconsistency in naming now. It’s still called ‘minping’ in getpeerinfo
and m_min_ping_usec
in CNodeStats
.
1055- std::atomic<int64_t> nPingUsecTime{0};
1056- // Best measured round-trip time.
1057- std::atomic<int64_t> nMinPingUsecTime{std::numeric_limits<int64_t>::max()};
1058- // Whether a ping is requested.
1059- std::atomic<bool> fPingQueued{false};
1060+ std::atomic<int64_t> m_best_ping_time{std::numeric_limits<int64_t>::max()};
0 std::atomic<int64_t> m_min_ping_time{std::numeric_limits<int64_t>::max()};
I think renaming from “min” to “best” needlessly sows confusion or more needed code churn for no gain:
m_min_ping_time
(it’s still code churn since it was previously named nMinPingUsecTime
, but I agree with you about the consistent naming).
229@@ -237,6 +230,12 @@ static RPCHelpMan getpeerinfo()
230 heights.push_back(height);
231 }
232 obj.pushKV("inflight", heights);
233+ if (statestats.m_ping_usec > 0) {
234+ obj.pushKV("pingtime", ((double)statestats.m_ping_usec) / 1e6);
235+ }
236+ if (statestats.m_ping_wait_usec > 0) {
237+ obj.pushKV("pingwait", ((double)statestats.m_ping_wait_usec) / 1e6);
Can you use named casts in lines 234 and 237 while touching this code?
https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines.html#Res-casts-named
Concept ACK
A couple of thoughts on preliminary quick skim-through.
Genuine question, why should best ping time be in CNode but all the others in Peer? We use best ping time for eviction logic which seems like an application-layer thing…
This is a very good question. I haven’t specified the difference between connection-layer and application-layer well enough with respect to ping times.
Functionally, ping times are used in two places: timeout logic and eviction logic (they’re also presented to the user through the RPC and GUI, but that doesn’t affect what follows).
There’s timeout logic in the connection layer (in InactivityCheck()
), which should detect when there’s a problem with the connection, such as:
There’s also timeout logic in the application layer, such as here:
where we detect if block download has timed out. Ping timeouts belong in the application layer, since failing to respond to a ping message with a pong message is a P2P protocol violation. However, that ping timeout logic has previously existed in net.cpp because the ping data existed in CNode. In this PR, we move the ping timout logic to the application layer, where it logically belongs. This logic requires the nPingNonceSent
, m_ping_start
and fPingQueued
data to be in the net_processing layer.
(You may be asking the question: “isn’t the connection layer timeout logic somewhat redundant with the ping timeout logic?” and you’d be right - if pings are not timing out, then we must be sending/receiving data from the peer, so the connection cannot time out. However, a bit of redundancy here is fine - the net layer is responsible for maintaining the data connection, and the net_processing layer is responsible for maintaining the p2p protocol peering.)
This exists entirely in the connection layer, in CConnman::AttemptToEvictConnection()
. It requires data from both the connection layer (e.g. connection time, connection type, whether the connection is local, etc) and from the application layer (e.g. last block received time, last tx received time, etc). For that to work, the application layer must inform the connection layer of events, such as a valid tx received, a valid block received, and (now) a valid pong received, and the connection layer must store that data (as nLastBlockTime
, nLastTXTime
and m_min_ping_time
) for use in the eviction logic. We could group those members in a structure called EvictionCriteria
or similar to make it clearer what they’re used for. For now, I’ve just expanded the comment for m_min_ping_time
.
Writing this has made me realize that nPingUsecTime
should probably stay in CNode
, since it doesn’t need to be used for any application logic. I’ve done that in the latest force-push.
This exists entirely in the connection layer, in
CConnman::AttemptToEvictConnection()
. It requires data from both the connection layer (e.g. connection time, connection type, whether the connection is local, etc)
Not convinced that makes sense – “connection type” and “how many local connections to evict” both seem like application logic to me. I was certainly expecting m_conn_type
to eventually make its way into net_processing, given it deals with message types (block/tx relay/addrfetch/feeler) and trust issues (inbound/outbound/manual) which are all application level concepts.
1143@@ -1143,6 +1144,10 @@ bool AppInitParameterInteraction(const ArgsManager& args)
1144 return InitError(Untranslated("peertimeout cannot be configured with a negative value."));
1145 }
1146
1147+ if (args.GetArg("-pingtimeout", TIMEOUT_INTERVAL) <= 0) {
pingtimeout
is set to 0.
1144@@ -1144,6 +1145,10 @@ bool AppInitParameterInteraction(const ArgsManager& args)
1145 return InitError(Untranslated("peertimeout cannot be configured with a negative value."));
1146 }
1147
1148+ if (args.GetArg("-pingtimeout", TIMEOUT_INTERVAL) <= 0) {
Argument handling is in ArgsManager::GetArg()
. Since the second argument here is an int, the GetArg(string, int)
version is called:
which will try to convert the string into an int.
This might result in a slightly inaccurate error message, but I think it’s fine. It’s the same behaviour as for -peertimeout
:
0→ ./src/bitcoind --pingtimeout=abc
1Error: pingtimeout cannot be configured with a negative value.
1077+ // Merely reporting pingtime might fool the caller into thinking the node was still responsive,
1078+ // since pingtime does not update until the ping is complete, which might take a while.
1079+ // So, if a ping is taking an unusually long time in flight,
1080+ // the caller can immediately detect that this is happening.
1081+ std::chrono::microseconds ping_wait{0};
1082+ if ((0 != peer->m_ping_nonce_sent) && (0 != peer->m_ping_start.load().count())) {
No, these aren’t required. In C++ &&
has a lower precedence than !=
so the parens aren’t changing the order of operations (https://en.cppreference.com/w/cpp/language/operator_precedence). They’re not doing any harm either.
This code is moved unaltered in this PR. I think it’s fine to keep it as it is.
82@@ -83,6 +83,9 @@ BOOST_AUTO_TEST_CASE(outbound_slow_chain_eviction)
83 auto peerLogic = PeerManager::make(chainparams, *connman, nullptr, *m_node.scheduler,
84 *m_node.chainman, *m_node.mempool, false);
85
86+ // Don't let pings time out
0 // Don't let pings time out due to mocktime
32@@ -33,6 +33,7 @@ def set_test_params(self):
33 [
34 "-acceptnonstdtxn=1",
35 "-peertimeout=9999", # bump because mocktime might cause a disconnect otherwise
36+ "-pingtimeout=9999", # bump because mocktime might cause a disconnect otherwise
-pingtimeout
is used? If not, why did some tests start to fail due to ping timeouts?
Previously, when ping timeouts were handled in CConnman::InactivityCheck()
, the whole logic was gated on the following guard:
0 if (now <= node.nTimeConnected + m_peer_connect_timeout) {
1 // Only run inactivity checks if the peer has been connected longer
2 // than m_peer_connect_timeout.
3 return false;
4 }
Because tests generally run for less than m_peer_connect_timeout, the ping timeout logic was not hit at all. Once the logic moves to net_processing, that guard no longer exists, so we start testing for ping timeouts as soon as the peer is connected.
314@@ -315,7 +315,8 @@ class PeerManagerImpl final : public PeerManager
315 /** Send a version message to a peer */
316 void PushNodeVersion(CNode& pnode, int64_t nTime);
317
318- /** Send a ping message on a regular schedule or if requested manually */
319+ /** Send a ping message on a regular schedule or if requested manually. May
320+ * disconnect the peer if a ping has timed out. */
321 void MaybeSendPing(CNode& node_to);
0 /** Send a ping message every PING_INTERVAL or if requested via rpc. Disconnect
1 * the peer if a ping has timed out. Return true upon disconnection. */
2 [[nodiscard]] bool ManagePing(CNode& node_to);
I’ve changed the signature to return a void, so the nodiscard tag is no longer required (good suggestion though!)
I’ve taken the suggested comment changes.
setmocktime
… but you only added -pingtimeout=9999
to two of them. Are you not worried about ping timeouts due to mocktime in any of the other tests?
CNode
in net to Peer
in net_processing.
Testing also by running src/bitcoind -pingtimeout=1 -signet
(didn’t want to disrupt mainnet peers) and got some expected log messages:
2021-02-03T15:34:23Z ping timeout: 1.018515s peer=5
1144@@ -1144,6 +1145,10 @@ bool AppInitParameterInteraction(const ArgsManager& args)
1145 return InitError(Untranslated("peertimeout cannot be configured with a negative value."));
1146 }
1147
1148+ if (args.GetArg("-pingtimeout", TIMEOUT_INTERVAL) <= 0) {
1149+ return InitError(Untranslated("pingtimeout cannot be configured with a negative value."));
Nit: Is it a bit pedantic to call it a nonpositive value
here to include zero?
Although I see it’s consistent with -peertimeout
error so that’s probably why :)
I see around ~20 functional tests that use setmocktime… but you only added -pingtimeout=9999 to two of them. Are you not worried about ping timeouts due to mocktime in any of the other tests? @pinheadmz I imagine it’s only necessary where
setmocktime
does an increase of more than the default ping timeout.
Dug into this a little bit looking for setmocktime
calls with big jumps - @jnewbery do you think p2p_addr_relay.py, p2p_addrv2_relay.py, and/or p2p_getaddr_caching.py need a -pingtimeout
?
Also a question that may or may not be relevant - how do the inactivity timers handle when setmocktime
goes backwards in time?
I’ve updated the -pingtimeout
option to be a binary enable/disable. We don’t need fine control over it in the tests, and it was essentially being used to disable ping timeouts entirely.
Dug into this a little bit looking for setmocktime calls with big jumps - @jnewbery do you think p2p_addr_relay.py, p2p_addrv2_relay.py, and/or p2p_getaddr_caching.py need a -pingtimeout?
Very nice. Thanks @glozow. I’ve updated those tests to disable ping timeouts.
Also a question that may or may not be relevant - how do the inactivity timers handle when setmocktime goes backwards in time?
That’s not a problem for the ping timeout. The ping timeout condition is now > peer.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL}
i.e. “(the time now) is after (when the ping was sent) plus (the timeout interval)”. If time goes backwards, then (the time now) is decreased and we won’t hit this condition.
47@@ -48,6 +48,7 @@ class AddrTest(BitcoinTestFramework):
48 def set_test_params(self):
49 self.setup_clean_chain = False
50 self.num_nodes = 1
51+ self.extra_args = [["-nopingtimeout"]] # mocktime might cause a disconnect otherwise
-pingtimeout
means timeout=true? And -nopingtimeout
is the same thing as -pingtimeout=0
? I couldn’t find where the implicit “no” logic was :(
706@@ -713,6 +707,12 @@ class CNode
707
708 std::string ConnectionTypeAsString() const { return ::ConnectionTypeAsString(m_conn_type); }
709
710+ //* We have completed a successful ping-pong keepalive. Update latest and best ping times. */
I appreciate the added documentation 🙏
nit
0 /** We have completed a successful ping-pong keepalive. Update latest and best ping times. */
Wooo 🙌 the boolean option makes much more sense to me - in the relevant tests, we more want to ignore ping timeouts than extend them. I feel like the interface would be more intuitive if it was just called -disablepingtimeout
since it’s an optional flag, but perhaps just personal preference?
I checked the functional tests by checking the input time against m_ping_wait_usec
from GetNodeStateStats
insetmocktime
RPC, and then running the functional tests that use setmocktime
. Seems like feature_bip68_sequence.py is actually the only one where it tripped… but my suggestions still make sense to me and I remember lightlike mentioning wallet_resendwallettransactions.py failing so maybe it was a race, no pings in flight?
Thanks for the reviews!
I feel like the interface would be more intuitive if it was just called -disablepingtimeout since it’s an optional flag, but perhaps just personal preference?
I think it’s best to avoid negative configuration names. -disablepingtimeout=0
would mean “do not disable ping timeout”, which is a double negative and slightly harder to parse than “enable ping timeout”. See other options like -listen
, -persistmempool
, etc, where the default is true, and -nolisten
, -nopersistmempool
is used to disable.
my suggestions still make sense to me and I remember lightlike mentioning wallet_resendwallettransactions.py failing so maybe it was a race, no pings in flight?
Right, these are all races, so they won’t fail consistently. I think it’s just safer to disable pings timeouts on any test which makes large mocktime leaps.
ACK https://github.com/bitcoin/bitcoin/pull/20721/commits/bbd970a2123aec5071dd3dd901e478844c6e0fea 🍿
The negative config option stuff makes more sense to me now, thanks for explaining!
Light Code-Review ACK bbd970a2123aec5071dd3dd901e478844c6e0fea
This may have been answered before but could someone tell me the exact sequence of events in the test framework involving setmocktime
that would make -nopingtimeout
necessary? My naive reading is that send_and_ping
blocks until it receives the pong
and is therefore not going to timeout if the next call is a setmocktime
.
I’ll also do a deeper read into [net processing] Move ping timeout logic to net processing
later this week to make sure that this move is indeed safe and doesn’t cause us to miss any opportunities to disconnect a peer.
3847- if (ping_time.count() >= 0) {
3848- // Successful ping time measurement, replace previous
3849- pfrom.nPingUsecTime = count_microseconds(ping_time);
3850- pfrom.nMinPingUsecTime = std::min(pfrom.nMinPingUsecTime.load(), count_microseconds(ping_time));
3851+ const auto ping_time = ping_end - peer->m_ping_start.load();
3852+ if (ping_time.count() > 0) {
>=
to >
. makes sense to call it a “timing mishap” if the pong was received in the same microsecond as we recorded for the ping going out, so seems fine. just highlighting the change since it’s easy to miss.
4337+ node_to.fPingQueued = false;
4338+ node_to.m_ping_start = GetTime<std::chrono::microseconds>();
4339+ if (node_to.GetCommonVersion() > BIP0031_VERSION) {
4340+ node_to.nPingNonceSent = nonce;
4341+ m_connman.PushMessage(&node_to, msgMaker.Make(NetMsgType::PING, nonce));
4342 } else {
pong messages were introduced in 2012 with p2p version 60000. I imagine that prior to that, the reason for a ping message was simply a keepalive (make sure that TCP connections stay up, NAT pinholes stay open, etc). This logic maintains backwards compatibility with clients older than 60000.
In practice, I expect there are almost no clients older than 60000, but if there are any it’d be a shame to arbitrarily disconnect them for failing ping timeouts.
709@@ -716,6 +710,12 @@ class CNode
710
711 std::string ConnectionTypeAsString() const { return ::ConnectionTypeAsString(m_conn_type); }
712
713+ /** A ping-pong keepalive has completed successfully. Update latest and minimum ping times. */
714+ void PingReceived(std::chrono::microseconds ping_time) {
PongReceived
?
4401+ MaybeSendPing(*pto, *peer);
4402+
4403+ // MaybeSendPing may have marked peer for disconnection
4404+ if (pto->fDisconnect) return true;
4405+
4406+ // If we get here, the outgoing message serialization version is set and can't change.
this comment previously followed the fSuccesfullyConnected
check, which I think(?) was the intent -> fSuccesfullyConnected
is set when processing the VERACK
message, and m_greatest_common_version
gets set when processing the VERSION
message.
but now this comment is below MaybeSendPing
, which also might retrieve m_greatest_common_version
to construct the ping message. I find the intent of this comment to be less clear after these changes.
This is a comment on the line below, which is contructing a CNetMsgMaker
object with the correct serialization version:
const CNetMsgMaker msgMaker(pto->GetCommonVersion());
fSuccessfullyConnected
check. Hopefully that’s clearer.
4390+{
4391+ // We must call MaybeDiscourageAndDisconnect first, to ensure that we'll
4392+ // disconnect misbehaving peers even before the version handshake is complete.
4393+ if (MaybeDiscourageAndDisconnect(*pto)) return true;
4394+
4395+ // Don't send anything until the version handshake is complete
I know you only moved this code, but I think we could handle fDisconnect
better.
This comment seems misleading since I think it only refers to fSuccesfullyConnected
. Then, we might set fDisconnect
to true in MaybeSendPing
, so we check this bool again a couple lines down to also return early.
A simpler logic flow might be removing this check here, starting MaybeSendPing
with return if node_to.fDisconnect
, and then removing this first early return from SendMessages
. And the comment on 4377 could be updated to something along the lines of Don't send messages to peers that are marked for disconnection
322@@ -315,6 +323,10 @@ class PeerManagerImpl final : public PeerManager
323 /** Send a version message to a peer */
324 void PushNodeVersion(CNode& pnode, int64_t nTime);
325
326+ /** Send a ping message every PING_INTERVAL or if requested via RPC. May
327+ * disconnect the peer if a ping has timed out. */
0 /** Send a ping message every PING_INTERVAL or if requested via RPC. May
1 * mark the peer to be disconnected if a ping has timed out. */
24@@ -25,11 +25,14 @@ static const bool DEFAULT_PEERBLOOMFILTERS = false;
25 static const bool DEFAULT_PEERBLOCKFILTERS = false;
26 /** Threshold for marking a node to be discouraged, e.g. disconnected and added to the discouragement filter. */
27 static const int DISCOURAGEMENT_THRESHOLD{100};
28+/** -pingtimeout default */
29+static const int64_t DEFAULT_PING_TIMEOUT = true;
another comment on code that you simply moved-
I found this definition odd. Why is it defined as an int but set as a boolean? Also, why is it an int64_t
, that seems like more space needed than for a simple bool value?
also note that this could be constexpr, but I don’t know if that actually impacts anything. I tried it out in godbolt explorer and looks like for such simple logic, it compiled down to the same assembly regardless 🤷🏽♀️
code review ACK bbd970a212
carefully read through the diff and confirmed nothing substantial changed / thought about potential divergences. seems to me that the main functional difference is that marking a peer for disconnect based on ping time has changed from the socket handler thread to the message handler thread (and then the socket handler thread still executes the disconnection). I imagine this could sometimes take slightly longer to identify the peers for disconnect- eg. ThreadMessageHandler
has a 100ms wait if !fMoreWork
, or generally the message handler thread seems like it has more capacity for doing stuff. However, I can’t see how this sort of small difference could cause any problems.
I left several comments, but they are all focused on style or documentation, and the majority of them are about code that you moved without changing. None of them are worth holding up this PR from merge or invalidating ACKs, just food for thought if you find any worthwhile for follow-ups, or have to push for a different reason.
This may have been answered before but could someone tell me the exact sequence of events in the test framework involving setmocktime that would make -nopingtimeout necessary? My naive reading is that send_and_ping blocks until it receives the pong and is therefore not going to timeout if the next call is a setmocktime.
The send_and_ping
method involves the python test framework sending a ping message to the bitcoind node and waiting for a pong. The timeout that could cause a problem is the ping message in the other direction. Here’s how it could fail:
ping
message to test framework at time 0setmocktime
to push the node’s mocktime some large number of seconds in the futurepong
response has been received/processed, SendMessages()
is called by the message handler thread for that peer, and the test in MaybeSetPing()
fails: now > peer.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL})
bitcoind periodically sends pings to all its peers, and is continually running the message handler thread loop, so this window condition will cause the test to fail intermittently.
4335+ // This means that setmocktime may cause pings to time out.
4336+ auto now = GetTime<std::chrono::microseconds>();
4337+
4338+ if (gArgs.GetArg("-pingtimeout", DEFAULT_PING_TIMEOUT) &&
4339+ peer.m_ping_nonce_sent &&
4340+ now > peer.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL}) {
if (x < y)
rather than if (y > x)
generally – smaller numbers generally go to the left of larger numbers, so keeping that convention is one less thing to have to think about. (This patch changes the tests from timeout < GetTime()
to now > timeout
)
(low_value < x && x < high_value)
and (x < low_value || high_value < x)
read much clearer than (x > low_value && x < high_value)
and (x < low_value || x > high_value)
.
4297+ auto now = GetTime<std::chrono::microseconds>();
4298+
4299+ if (gArgs.GetArg("-pingtimeout", DEFAULT_PING_TIMEOUT) &&
4300+ node_to.nPingNonceSent &&
4301+ now > node_to.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL}) {
4302+ LogPrint(BCLog::NET, "ping timeout: %fs peer=%d\n", 0.000001 * count_microseconds(now - node_to.m_ping_start.load()), node_to.GetId());
Would probably be better to set m_pingtimeout = gArgs.GetArg()
in the constructor, rather than re-query it every loop.
I think losing the check that GetSystemTimeInSeconds() <= node_to.nTimeConnected + peer_connect_timeout
(which is done in InactivityCheck
but not MaybeSendPing
) is a mistake – the docs for -peertimeout
say “Specify a p2p connection timeout delay in seconds. After connecting to a peer, wait this amount of time before considering disconnection based on inactivity” which seems like it should apply here: not sending a ping reply certainly seems like a form of inactivity to me! (Setting m_peer_connect_timeout
from gArgs
in the PeerManagerImpl
constructor seems like it would work here too, without violating layering)
Having -peertimeout
continue to apply to (a lack of) PONG messages seems to make -pingtimeout
unnecessary; the tests don’t need to be modified anymore, and I’m not seeing any reasons other than initial mocktime confusion as to why you’d want to prevent disconnections?
pingtimeout
as requested.
31@@ -32,7 +32,7 @@ def set_test_params(self):
32 self.extra_args = [
33 [
34 "-acceptnonstdtxn=1",
35- "-peertimeout=9999", # bump because mocktime might cause a disconnect otherwise
-peertimeout=9999
for all the test cases (ie, in the framework generated bitcoin.conf) – or at least all the test cases that use mocktime – and override it anywhere that’s explicitly testing disconnection due to timeouts (p2p_ping.py
).
I think it might make more sense in general to set -peertimeout=9999 for all the test cases (ie, in the framework generated bitcoin.conf)
Changing the behaviour of all tests is outside the scope of this PR, but I’m happy to review any proposal you have for making the tests more robust.
4311@@ -4299,7 +4312,7 @@ void PeerManagerImpl::MaybeSendPing(CNode& node_to)
4312 pingSend = true;
4313 }
4314
4315- if (node_to.nPingNonceSent == 0 && node_to.m_ping_start.load() + PING_INTERVAL < GetTime<std::chrono::microseconds>()) {
4316+ if (node_to.nPingNonceSent == 0 && now > node_to.m_ping_start.load() + PING_INTERVAL) {
4291@@ -4291,6 +4292,18 @@ class CompareInvMempoolOrder
4292
4293 void PeerManagerImpl::MaybeSendPing(CNode& node_to)
4294 {
4295+ // Use mockable time for ping timeouts.
If you’re doing disconnects here as well, doesn’t that make it MaybeSendPingOrDisconnectDueToLackOfPong()
? Not sure the Maybe
is useful – SendMessages
doesn’t have a Maybe
despite it not necessarily doing something every time it’s called.
This moves ping disconnections from ThreadSocketHandler
(which calls SocketHandler
which calls InactivityCheck
, I think on a 50ms cycle) to ThreadMessageHandler
(which calls SendMessages
which calls MaybeSendPing
, I think on a 100ms cycle, though will reinvoke SendMessages more frequently if any node’s ProcessMessages
indicates there’s more work). I think that’s fine.
make it MaybeSendPingOrDisconnectDueToLackOfPong()
I don’t think the function name needs to document all the functionality. It’s documented in the function comment:
0 /** Send a ping message every PING_INTERVAL or if requested via RPC. May
1 * mark the peer to be disconnected if a ping has timed out. */
Not sure the Maybe is useful – SendMessages doesn’t have a Maybe despite it not necessarily doing something every time it’s called.
Right, and MaybeDiscourageAndDisconnect()
does have maybe in it.
I don’t really care what this function is called. If you feel strongly, I can change it.
PingPongMessages
class that deals with both the send and receive side :pray:
76@@ -77,13 +77,12 @@ static RPCHelpMan ping()
77 [&](const RPCHelpMan& self, const JSONRPCRequest& request) -> UniValue
78 {
79 NodeContext& node = EnsureNodeContext(request.context);
80- if(!node.connman)
81+ if(!node.peerman) {
3843+ if (nonce == peer->m_ping_nonce_sent) {
3844 // Matching pong received, this ping is no longer outstanding
3845 bPingFinished = true;
3846- const auto ping_time = ping_end - pfrom.m_ping_start.load();
3847+ const auto ping_time = ping_end - peer->m_ping_start.load();
3848 if (ping_time.count() >= 0) {
ping_time.count == 0
is expected if you do setmocktime X
, then a ping goes out, and the pong is received before you call setmocktime Y
.
I think I’m a NACK on the -pingtimeout
changes as they stand – at a minimum it should be clear from the PR description and commit messages why those changes are needed, but there’s currently no motivation for them in those places.
I think duplicating the -peertimeout
check into net_processing is enough to obsolete the need to change the tests (if it’s not, I still don’t understand the change well enough to ack it), and that seems like a much more obvious refactor to me, and also matches the help text for the option better. (That said, I think adding -peertimeout=9999
to those tests in case they end up taking longer than 60s to run would be justifiable, but that’s true with or without these changes)
I believe we already have a bug in InactivityCheck
when setting mocktime into the past, in that nLastRecv
is set via GetTime<microseconds>
in ReceiveMsgBytes
but then compared against GetSystemTime
in InactivityCheck
leading to errors once -peertimeout
has passed.
I think I’m a NACK on the -pingtimeout changes as they stand – at a minimum it should be clear from the PR description and commit messages why those changes are needed, but there’s currently no motivation for them in those places.
I think duplicating the -peertimeout check into net_processing is enough to obsolete the need to change the tests
Ok, done. I’ve removed -pingtimeout
, and duplicted the -peertimeout
setting and logic into MaybeSendPing() as requested.
Moving the application-layer data and logic is an opportunity to clean up and rationalize the weird edge cases and strange behavior. As you’ve noted from your observation about the InactivityCheck
bug, and as sipa pointed out in #20027 (review), mixing mockable time and system time is too complicated and a source of bugs and confusion. Moving ping timeouts to the application layer means that we can lose one of those interactions between system time and mockable time - system time timeouts on the socket stay in net, and mockable timeouts live in the application layer. This is really only applicable for tests, and I don’t think there’s much downside to adding a hidden configuration option where it’s needed.
However, I don’t think this is important enough to hold up the rest of #19398. I’ve restored the behaviour to exactly how it was, so no changes are required to tests (except the modified PeerManager ctor signature). We can perhaps rationalize mocktime/system time for pings at some later date.
4376- pto->m_ping_start = GetTime<std::chrono::microseconds>();
4377- if (pto->GetCommonVersion() > BIP0031_VERSION) {
4378- pto->nPingNonceSent = nonce;
4379- m_connman.PushMessage(pto, msgMaker.Make(NetMsgType::PING, nonce));
4380+ peer.m_ping_queued = false;
4381+ peer.m_ping_start = GetTime<std::chrono::microseconds>();
think this could just be..
0 peer.m_ping_start = now;
346@@ -335,6 +347,10 @@ class PeerManagerImpl final : public PeerManager
347 * on extra block-relay-only peers. */
348 bool m_initial_sync_finished{false};
349
350+ /** After connecting to a peer, wait this amount of time before considering
351+ * disconnection based on inactivity */
352+ const int64_t m_peer_connect_timeout;
RE -pingtimeout
:
I understand how it could be surprising that -peertimeout
would no longer apply to ping logic, especially since InactivityChecks
would still bypass the version handshake timeout. However, I find it unfortunate to have the system time comparisons in MaybeSendPing
. Having the application level logic be mockable makes sense to me, and seems in line with the discussion in #20027 (review). Maybe in the longer term, we could have a debug arg that handles the application level timeouts of ping & verack handshake, and there could be a param interaction that sets the new arg based on -peertimeout
.
But that all seems out of the scope of this PR. I’ve reviewed the changes and the code looks correct as far as I can tell.
ACK 31cc466eb0
4297@@ -4297,6 +4298,20 @@ class CompareInvMempoolOrder
4298
4299 void PeerManagerImpl::MaybeSendPing(CNode& node_to)
4300 {
4301+ // Use mockable time for ping timeouts.
4302+ // This means that setmocktime may cause pings to time out.
4303+ auto now = GetTime<std::chrono::microseconds>();
4304+
4305+ if (GetSystemTimeInSeconds() > node_to.nTimeConnected + m_peer_connect_timeout &&
Adding:
0 /** Return true if inactivity checks should be ignored */
1 inline bool IgnoreInactivityChecks(const CNode& node) const
2 {
3 return GetSystemTimeInSeconds() <= node.nTimeConnected + m_peer_connect_timeout;
4 }
to CConnman
would let you avoid adding m_peer_connect_timeout
in PeerManagerImpl
and keep the non-mockable times in net.cpp. (You have access to connman
anyway via PeerManagerImpl::m_connman
and access to node
in order to set the disconnect flag)
4368+ // If we get here, the outgoing message serialization version is set and can't change.
4369+ const CNetMsgMaker msgMaker(pto->GetCommonVersion());
4370+
4371+ MaybeSendPing(*pto);
4372+
4373+ PeerRef peer = GetPeerRef(pto->GetId());
Should there be a if (peer == nullptr) return true;
here? We check that everywhere else.
Rather than move this line, might be better to continue doing it first thing and pass the result into MaybeDiscourageAndDisconnect
rather than having that function look it up too. (At least, that’s where I’m expecting to end up with #20758 eventually)
net_processing.cpp
, so doesn’t need the .*
there.
3998@@ -3969,43 +3999,39 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
3999 return;
4000 }
4001
4002-bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode)
4003+bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode, Peer& peer)
MaybeDiscourageAndDisconnect()
will only need to accept a Peer
? And right now, even though you could grab the PeerRef
from the m_peer_map
in here again, passing in a Peer
is a bit less overhead?
MaybeDiscourageAndDisconnect
should continue needing a CNode
in order to signal the disconnection, I think.
MaybeDiscourageAndDisconnect should continue needing a CNode in order to signal the disconnection, I think.
This is true. It’s also currently needed for the connection type, which I don’t think belongs in net_processing.
Even though you could grab the PeerRef from the m_peer_map in here again, passing in a Peer is a bit less overhead?
We want to grab a PeerRef
at the entry functions (ProcessMessages
and SendMessages
for the message handler thread), and then pass that reference down through the call stack. Grabbing PeerRef
incurs overhead because m_peer_mutex
needs to be locked and the shared ptr’s refcount needs to be incremented/decremented, so we shouldn’t do it unnecessarily.
1220@@ -1221,18 +1221,17 @@ void CConnman::NotifyNumConnectionsChanged()
1221 }
1222 }
1223
1224+bool CConnman::RunInactivityChecks(const CNode& node) const
1225+{
1226+ return GetSystemTimeInSeconds() > node.nTimeConnected + m_peer_connect_timeout;
RunInactivityChecks()
instead of taking a peertimeout param 👌
Strictly speaking, I think it’s possible for the time to be different from RunInactivityChecks
to InactivityChecks
since they each grab the time separately (as opposed to before, where they used the same now
)? I don’t think it’d be a huge issue since it’s in seconds and the system time would only move forward though
Could make this
0bool CConnman::RunInactivityChecks(const CNode& node, const uint64_t const* pnow=nullptr)
1{
2 const uint64_t now = (pnow ? *pnow : GetSystemTimeInSeconds());
3 return now > node.nTimeconnected + m_peer_connect_timeout;
4}
and call it from within InactivityCheck()
as if (!RunInactivityChecks(node, &now) return false;
to avoid unnecessarily getting the system time twice in quick succession. Might also make RunInactivityChecks
easier to test since you could pass in different times and check the answer, without having to worry that GetSystemTime
isn’t mockable.
Main changes since last review:
Using -peertimeout
on both sets of timeout checks indeed looks cleaner than before. It was easier to review for me - I did my best to verify that the decision on whether or not to run ping timeout checks is identical, which makes me feel more confident than when I was trying to find which tests might timeout. 🧠 And then tried to verify that we’re still timing out pings when we’re supposed to through the MaybeSendPing
route
code review ACK a3ec04b7a6
3969@@ -3969,43 +3970,39 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
3970 return;
3971 }
3972
3973-bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode)
3974+bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode, Peer& peer)
3975 {
3976- const NodeId peer_id{pnode.GetId()};
const NodeId peer_id{peer.m_id};
would’ve changed fewer lines :)
Hmm, actually I think I disagree with that (well, at least after exaggerating it into a bigger principle) – review is a much more important bottleneck than just about anything else, so optimising the code to be easier to review is much more helpful than optimising it to make perfect sense if we were going to erase our git history and anonymously post the current tree to a mailing list. I think just about any time there’s a choice between correct code that’s easier to review and correct code that’s better according to some other aesthetic, easier to review should be the winner. At least until review is no longer a serious bottleneck for the project.
Obviously, peer_id
to peer.m_id
is trivial, but if we’re talking what we should aim to maximise, I think ease of review is the single right answer. I don’t think you’ll often find that being far off from optimising for clarity, though. (I don’t think “expressiveness” applies either way tbh)
The point of review being a bottleneck is that we won’t ever progress to a global maximum of quality – if we could do that, by definition we wouldn’t be bottlenecked. Every merge is a reflection of that review bottleneck: did we spend our limited review cycles on working out if a while loop does the same thing as the for loop it replaces, or did we spend it on working out if muhash or minisketch does what it claims on the box? If every PR spends 5% of review time checking that shuffling code around is correct, then that’s 1-in-20 fewer PRs we can make that affect people that don’t see the code. Or it’s an incentive to get in the habit of merging code that’s not thoroughly reviewed in order to avoid delaying things because we’ve run out of review resources, which is worse.
That doesn’t mean we shouldn’t always be aiming to improve the quality of the code; just that we should be spending the resources we have on the changes that are most valuable, even if that means leaving some good things for (possibly much) later.
(Personally, I find most of the STL algorithms worse than the loops they replace – range based for loops in c++ are pretty pleasant and behave pretty sensibly; for me, dealing explicitly with iterators and lambdas generally isn’t pleasant – they’re harder to understand and easier to make mistakes with, for at best a minimal performance benefit, and for me at least any safety benefits are offset by the lack of clarity. There are more important battles to fight, and I’m hoping that perhaps familiarity might eventually breed affection, but for now they certainly don’t seem like moving toward even a local maximum to me)
Thanks AJ, these are good points.
You’re wrong about stl algorithms though :grin:
1023@@ -1024,6 +1024,9 @@ class CConnman
1024
1025 void SetAsmap(std::vector<bool> asmap) { addrman.m_asmap = std::move(asmap); }
1026
1027+ /** Return true if the peer has been connected for long enough to do inactivity checks. */
if the peer should be connected for inactivity
? – we could introduce a net permission to prevent that rather than just have a standard timeout for all peers, eg
1535@@ -1537,7 +1536,7 @@ void CConnman::SocketHandler()
1536 if (bytes_sent) RecordBytesSent(bytes_sent);
1537 }
1538
1539- if (InactivityCheck(*pnode)) pnode->fDisconnect = true;
1540+ if (RunInactivityChecks(*pnode) && InactivityCheck(*pnode)) pnode->fDisconnect = true;
InactivityCheck()
. Seems cleaner to keep the early exit to me
254@@ -248,6 +255,7 @@ class PeerManagerImpl final : public PeerManager
255 void CheckForStaleTipAndEvictPeers() override;
256 bool GetNodeStateStats(NodeId nodeid, CNodeStateStats& stats) override;
257 bool IgnoresIncomingTxs() override { return m_ignore_incoming_txs; }
258+ void SendPings() override;
RunInactivityChecks
as a separate commit, but I think you could reasonably have pulled SendPings()
out as a separate commit prior to moving things from net to net_processing too.
ACK a3ec04b7a61378dab0a9a53840e5f504d25ac563
Plus some maybe nits.
Refactor only. No change in behaviour.
Moves the logic to prevent running inactivity checks until
the peer has been connected for -peertimeout time into its
own function. This will be reused by net_processing later.
Ping messages are an application-level mechanism. Move timeout
logic from net to net processing.
-BEGIN VERIFY SCRIPT-
sed -i 's/fPingQueued/m_ping_queued/g' src/net_processing.cpp
sed -i 's/nMinPingUsecTime/m_min_ping_time/g' src/net.* src/net_processing.cpp src/test/net_tests.cpp
sed -i 's/nPingNonceSent/m_ping_nonce_sent/g' src/net_processing.cpp
sed -i 's/nPingUsecTime/m_last_ping_time/g' src/net.*
-END VERIFY SCRIPT-
1207@@ -1221,18 +1208,17 @@ void CConnman::NotifyNumConnectionsChanged()
1208 }
1209 }
1210
1211+bool CConnman::RunInactivityChecks(const CNode& node) const
InactivityChecksEnabled
or ShouldRunInactivityChecks
because right now the name suggests this function is where they are run (aka what InactivityCheck
does).
reACK https://github.com/bitcoin/bitcoin/commit/a5e15ae45ccae7948a6c5b95e6621f01afb88d55
rebase-only by git range-diff a3ec04b...a5e15ae
ACK a5e15ae45c
lost track of the different versions so just re-reviewed the whole diff commit by commit.
definitely like having the m_peer_connect_timeout
check as a function in net
more than duplicating system time check in net_processing
in the future, I think it could make sense to have -peertimeout
be limited to regtest. it’s hidden behind debug-help, but technically possible on mainnet. I don’t see how it’d be useful on mainnet though.
In the scripted diff I don’t like that it may modify non-git user files (e.g. a net.h.rej
file) in the same folder. Would be better to use git grep -l '...'
, which would also simplify review, as reviewers don’t have to call git grep
to verify the replacement was complete over all files, including docs.
review ACK a5e15ae45ccae7948a6c5b95e6621f01afb88d55 🥉
Signature:
0-----BEGIN PGP SIGNED MESSAGE-----
1Hash: SHA512
2
3review ACK a5e15ae45ccae7948a6c5b95e6621f01afb88d55 🥉
4-----BEGIN PGP SIGNATURE-----
5
6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
7pUhixgv9EHlcd7Eo4IXpSgu8XZbUGYox8iHe1MU0pdDdnGPBjdnoB71xvMNUczaF
8vDqmc9jIEfT19pPXGJfnOGQ53oWatA9be22GiLjZtjIA+w7slyjnTJ/WM2LCxA1V
9EWX2X57ad+DwEHW7YCzmf5pxxm/KK9TV3h3kE9I1yUGwg6FR5INfIyzssPqULAt2
108BexUMoXXaqeAV6WOJRq/9wdtOcv2WwD7YLc340VjdmRRRkVflySKnx0/m2x6oo8
11b3zBYj3YC6+UVmPRD2YIexoG+XHAVGXbc2kFRkeJ/Puknt7pWyurIsF6w+dgzM7B
12YrsLzEwgO7mO4xvXfvHYwNKuNX3WJTFJ+gpL1dIN7UlJQis4p3bl56Sxtyw6cKM+
13d8VZtjYkTcEpL3jKJYnIk4fHFA6WmHaRp7b4FQr6400G2bznF6AHMC2h0z5Jqhpy
1463XIWO9kD89mIPxNgzUxA+id/3SoaNPRN0Mj62RxTWmbpRX2gIvCvF/MRZJsdEuC
15TQhwMCuo
16=FZw8
17-----END PGP SIGNATURE-----
Timestamp of file with hash f094bc0ee463f6bec56659b9f20c20ab75a2f829388e57c839500e90f3c196a3 -
207@@ -209,8 +208,8 @@ static RPCHelpMan getpeerinfo()
208 if (stats.m_min_ping_usec < std::numeric_limits<int64_t>::max()) {
209 obj.pushKV("minping", ((double)stats.m_min_ping_usec) / 1e6);
4327+ // This means that setmocktime may cause pings to time out.
4328+ auto now = GetTime<std::chrono::microseconds>();
4329+
4330+ if (m_connman.RunInactivityChecks(node_to) && peer.m_ping_nonce_sent &&
4331+ now > peer.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL}) {
4332+ LogPrint(BCLog::NET, "ping timeout: %fs peer=%d\n", 0.000001 * count_microseconds(now - peer.m_ping_start.load()), peer.m_id);