Net: Move ping data to net

jnewbery commented at 11:24 am on December 19, 2020: member

This continues the work of moving application layer data into net_processing, by moving all ping data into the new Peer object added in #19607.

For motivation, see #19398.

fanquake added the label P2P on Dec 19, 2020

DrahtBot added the label Needs rebase on Dec 19, 2020

jnewbery deleted a comment on Dec 21, 2020

jnewbery marked this as a draft on Dec 21, 2020

jnewbery force-pushed on Dec 22, 2020

jnewbery marked this as ready for review on Dec 22, 2020

jnewbery commented at 1:46 pm on December 22, 2020: member

Rebased on master. This is now ready for review.

DrahtBot removed the label Needs rebase on Dec 22, 2020

in src/net_processing.cpp:4079 in cc0e0fc8d0 outdated

4075@@ -4076,31 +4076,15 @@ class CompareInvMempoolOrder
4076 };
4077 }
4078 
4079-bool PeerManager::SendMessages(CNode* pto)
4080+void PeerManager::MaybeSendPing(CNode& pto)

ajtowns commented at 9:43 pm on December 23, 2020:

If you’re changing every reference to pto anyway because it’s now a reference not a pointer, maybe also call it node_to or similar, since it’s not a “p” anymore?

jnewbery commented at 11:24 pm on December 23, 2020:

Sure. I’ll split the commit into a move and a rename/change to reference to make it easier to review.

jnewbery commented at 4:20 pm on December 24, 2020:

Done

in src/net_processing.h:221 in cc0e0fc8d0 outdated

216@@ -217,6 +217,9 @@ class PeerManager final : public CValidationInterface, public NetEventsInterface
217     /** Send a version message to a peer */
218     void PushNodeVersion(CNode& pnode, int64_t nTime);
219 
220+    /** Send a ping message on a regular schedule or if requested manually */
221+    void MaybeSendPing(CNode& pto);

ajtowns commented at 9:45 pm on December 23, 2020:

I don’t think exposing more implementation details in net_processing.h is a good idea. Fortunately it’s fixable: #20758

jnewbery commented at 11:23 pm on December 23, 2020:

Totally agree! I think #20758 can happen before or after this PR.

ajtowns commented at 2:47 am on December 24, 2020:

Yes, I don’t think either need gate the other, the above was literally my thought process in getting to that PR – sipa may not have nerd sniped me with twitter maths, but that doesn’t mean I didn’t get nerd sniped anyway…

sipa commented at 3:13 am on December 24, 2020:

This seems like a more productive form of aj-sniping.

jnewbery commented at 4:24 pm on December 24, 2020:

I like this kind of aj-sniping.

in src/test/denialofservice_tests.cpp:118 in a54ba87166 outdated

116+    // Wait 10 minutes
117+    time += 10 * 60;
118+    SetMockTime(time);
119+
120+    // Clear nPingNonceSent so we don't disconnect for a ping timeout
121+    dummyNode1.nPingNonceSent = 0;

ajtowns commented at 9:48 pm on December 23, 2020:

Move the test change to its own commit so it’s easy to verify behaviour doesn’t change with the move logic to net_processing commit? I don’t follow why the test change is necessary/desirable though.

jnewbery commented at 11:22 pm on December 23, 2020:

Move the test change to its own commit so it’s easy to verify behaviour doesn’t change

good idea. Will do.

I don’t follow why the test change is necessary/desirable though.

The change is required because previously ping timeouts happened in in CConnman, so it was fine to bump the time in this test because there’s no CConnman to time out the connection. When the ping data and timout logic is moved to PeerManager, then the connection times out in this test if you bump the time forwards by 21 minutes without resetting the ping nonce.

ajtowns commented at 3:11 am on December 24, 2020:

Makes sense; might be a useful commit message for the test change?

Seems kind-of annoying to have to remember to constrain your mocktime bumps by random ping timeout limits. Shouldn’t this also (already) cause problems in functional tests?

jnewbery commented at 4:10 pm on December 24, 2020:

The functional tests have an asyncio event loop thread that will automatically respond to ping messages:

https://github.com/bitcoin/bitcoin/blob/f656165e9c0d09e654efabd56e6581638e35c26c/test/functional/test_framework/p2p.py#L392-L393

The unit tests are single threaded, so any logic to handle messages needs to be done in the main test logic.

jnewbery commented at 4:24 pm on December 24, 2020:

ok, split this into its own commit.

ajtowns commented at 9:52 pm on December 23, 2020: member

Seems okay. Though, “jnewbery deleted a comment from sipa 3 days ago” ?

DrahtBot commented at 9:54 pm on December 23, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#21190 (net: Avoid discouraging the onion proxy when one inbound onion misbehaves by MarcoFalke)
#21015 (Make all of net_processing (and some of net) use std::chrono types by dhruv)
#20845 (net: Log to net debug in MaybeDiscourageAndDisconnect except for noban and manual peers by MarcoFalke)
#20758 (net-processing refactoring – lose globals, move implementation details from .h to .cpp by ajtowns)
#18261 (Erlay: bandwidth-efficient transaction relay protocol by naumenkogs)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

jnewbery commented at 10:19 pm on December 23, 2020: member

“jnewbery deleted a comment from sipa 3 days ago” ?

ha. There were two comments from sheffine saying “why are you emailing me?” and one from sipa saying “sheffine, you’re confused”, so I deleted them all to reduce noise.

sipa commented at 10:20 pm on December 23, 2020: member

Censorship!!!

(yes, that’s indeed what happened)

jnewbery commented at 11:27 pm on December 23, 2020: member

Thanks for the review @ajtowns. I’ll move the commits around a bit to address your comments.

jnewbery force-pushed on Dec 24, 2020

in src/net_processing.h:221 in 9466feef40 outdated

216@@ -217,6 +217,9 @@ class PeerManager final : public CValidationInterface, public NetEventsInterface
217     /** Send a version message to a peer */
218     void PushNodeVersion(CNode& pnode, int64_t nTime);
219 
220+    /** Send a ping message on a regular schedule or if requested manually */
221+    void MaybeSendPing(CNode* pto);

glozow commented at 7:06 pm on December 28, 2020:

Noob / practices question: why not just create this with a bool return type, since you change it 3 commits later? I don’t have an opinion, but why is it better to do it this way?

jnewbery commented at 10:16 am on December 29, 2020:

Just so that the commits are hygienic. For people reviewing commit by commit, it wouldn’t make sense to have a function that returns a bool which is just ignored. Changing the return type in a subsequent commit is as easy as changing void to bool in the declaration and definition.

I was considering not returning a bool here, since it’s a bit of a strange interface (the function sets pto.fDisconnect and returns false. Perhaps it should stay as returning void, and have SendMessages() check for fDisconnect after calling this function?

in src/net.h:1049 in 2c1c3ec6e4 outdated

1052-    /** When the last ping was sent, or 0 if no ping was ever sent */
1053-    std::atomic<std::chrono::microseconds> m_ping_start{0us};
1054-    // Last measured round-trip time.
1055-    std::atomic<int64_t> nPingUsecTime{0};
1056-    // Best measured round-trip time.
1057     std::atomic<int64_t> nMinPingUsecTime{std::numeric_limits<int64_t>::max()};

glozow commented at 9:04 pm on December 28, 2020:

Genuine question, why should best ping time be in CNode but all the others in Peer? We use best ping time for eviction logic which seems like an application-layer thing…

jnewbery commented at 12:05 pm on December 29, 2020:

Answered here: #20721 (comment)

in src/net.cpp:617 in 2c1c3ec6e4 outdated

620-        ping_wait = GetTime<std::chrono::microseconds>() - m_ping_start.load();
621-    }
622-
623-    // Raw ping time is in microseconds, but show it to user as whole seconds (Bitcoin users should be well used to small numbers with many decimal places by now :)
624-    stats.m_ping_usec = nPingUsecTime;
625-    stats.m_min_ping_usec  = nMinPingUsecTime;

glozow commented at 9:28 pm on December 28, 2020:

Since you removed this, CNodeStats doesn’t know min ping time anymore. The p2p_eviction and p2p_ping functional tests fail since getpeerinfo() always returns 0 for minping :(

Adding it back fixed the tests for me:

0stats.m_min_ping_usec  = m_best_ping_time;

glozow commented at 9:50 pm on December 28, 2020:

Oh hey, if s/m_min_ping_usec/m_best_ping_time then

0X(m_best_ping_time);

(wow so clean)

jnewbery commented at 10:39 am on December 29, 2020:

Good catch! I lost this in a rebase. I’ve now restored it.

I’m going to leave the m_min_ping_usec name to avoid increasing the diff further.

Xekyo commented at 9:32 pm on December 28, 2020: member

Can’t really comment on the overall goal, as I don’t really have a lot of context regarding either locks, mutexes, or peer management, but the linked motivation sounded good to me. The PR is easy to follow and each step makes sense to me. I found the first commit message to be a bit confusing:

No need to create another shared_ptr if ProcessMessages() already has one in hand.

I take it to mean something along the lines of “Instead of creating another shared ptr, let’s pass one that the callsite already has on hand.”.

code review ACK https://github.com/bitcoin/bitcoin/pull/20721/commits/6f0309b4743d05c512fc86d2d206f72d512c5c93

glozow commented at 9:43 pm on December 28, 2020: member

Concept ACK, although I have a bit of confusion about what constitutes net vs application layer, namely why best ping is still in CNode. You broke a few functional tests by removing minping from copyStats() which I assume is accidental? For renaming, I personally think “best” is more informative than “min” but there’s some inconsistency in naming now. It’s still called ‘minping’ in getpeerinfo and m_min_ping_usec in CNodeStats.

in src/net.h:1049 in 6f0309b474 outdated

1055-    std::atomic<int64_t> nPingUsecTime{0};
1056-    // Best measured round-trip time.
1057-    std::atomic<int64_t> nMinPingUsecTime{std::numeric_limits<int64_t>::max()};
1058-    // Whether a ping is requested.
1059-    std::atomic<bool> fPingQueued{false};
1060+    std::atomic<int64_t> m_best_ping_time{std::numeric_limits<int64_t>::max()};

jonatack commented at 10:07 pm on December 28, 2020:

0    std::atomic<int64_t> m_min_ping_time{std::numeric_limits<int64_t>::max()};

I think renaming from “min” to “best” needlessly sows confusion or more needed code churn for no gain:

“min” is more precise than “best” (best how?)
getpeerinfo and its help, -netinfo and its new help (#20764), and the GUI refer to “min ping”, “Min Ping”, “mping”, “minimum observed ping time”. Updating all of them would create unneeded churn; not updating them would create unnecessary confusion.

jnewbery commented at 10:40 am on December 29, 2020:

I agree. I’ve changed it to m_min_ping_time (it’s still code churn since it was previously named nMinPingUsecTime, but I agree with you about the consistent naming).

in src/rpc/net.cpp:228 in 6f0309b474 outdated

229@@ -237,6 +230,12 @@ static RPCHelpMan getpeerinfo()
230                 heights.push_back(height);
231             }
232             obj.pushKV("inflight", heights);
233+            if (statestats.m_ping_usec > 0) {
234+                obj.pushKV("pingtime", ((double)statestats.m_ping_usec) / 1e6);
235+            }
236+            if (statestats.m_ping_wait_usec > 0) {
237+                obj.pushKV("pingwait", ((double)statestats.m_ping_wait_usec) / 1e6);

jonatack commented at 10:09 pm on December 28, 2020:

Can you use named casts in lines 234 and 237 while touching this code?

https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines.html#Res-casts-named

jnewbery commented at 10:53 am on December 29, 2020:

I’d rather not change this while moving it. There’s another C-style cast from int64_t to double on line 206. Let’s fix them all at the same time.

jonatack commented at 10:14 pm on December 28, 2020: member

Concept ACK

A couple of thoughts on preliminary quick skim-through.

DrahtBot added the label Needs rebase on Dec 28, 2020

jnewbery force-pushed on Dec 29, 2020

jnewbery commented at 10:56 am on December 29, 2020: member

Thanks for the reviews, @glozow and @jonatack. I’ve addressed your comments: https://github.com/bitcoin/bitcoin/compare/6f0309b4743d05c512fc86d2d206f72d512c5c93..d666804b7b2d5006c37b86f5f88e344216ad0862.

jnewbery commented at 11:01 am on December 29, 2020: member

rebased

jnewbery force-pushed on Dec 29, 2020

DrahtBot removed the label Needs rebase on Dec 29, 2020

jnewbery force-pushed on Dec 29, 2020

jnewbery commented at 12:05 pm on December 29, 2020: member

Genuine question, why should best ping time be in CNode but all the others in Peer? We use best ping time for eviction logic which seems like an application-layer thing…

This is a very good question. I haven’t specified the difference between connection-layer and application-layer well enough with respect to ping times.

Functionally, ping times are used in two places: timeout logic and eviction logic (they’re also presented to the user through the RPC and GUI, but that doesn’t affect what follows).

Timeout logic

There’s timeout logic in the connection layer (in InactivityCheck()), which should detect when there’s a problem with the connection, such as:

no bytes being sent or received on the connection
a connection is opened, but no application data received.

There’s also timeout logic in the application layer, such as here:

https://github.com/bitcoin/bitcoin/blob/f1f26b8d5baec4a45a3a9ba0440cd4eff7af8407/src/net_processing.cpp#L4527-L4528

where we detect if block download has timed out. Ping timeouts belong in the application layer, since failing to respond to a ping message with a pong message is a P2P protocol violation. However, that ping timeout logic has previously existed in net.cpp because the ping data existed in CNode. In this PR, we move the ping timout logic to the application layer, where it logically belongs. This logic requires the nPingNonceSent, m_ping_start and fPingQueued data to be in the net_processing layer.

(You may be asking the question: “isn’t the connection layer timeout logic somewhat redundant with the ping timeout logic?” and you’d be right - if pings are not timing out, then we must be sending/receiving data from the peer, so the connection cannot time out. However, a bit of redundancy here is fine - the net layer is responsible for maintaining the data connection, and the net_processing layer is responsible for maintaining the p2p protocol peering.)

Inbound eviction logic

This exists entirely in the connection layer, in CConnman::AttemptToEvictConnection(). It requires data from both the connection layer (e.g. connection time, connection type, whether the connection is local, etc) and from the application layer (e.g. last block received time, last tx received time, etc). For that to work, the application layer must inform the connection layer of events, such as a valid tx received, a valid block received, and (now) a valid pong received, and the connection layer must store that data (as nLastBlockTime, nLastTXTime and m_min_ping_time) for use in the eviction logic. We could group those members in a structure called EvictionCriteria or similar to make it clearer what they’re used for. For now, I’ve just expanded the comment for m_min_ping_time.

Writing this has made me realize that nPingUsecTime should probably stay in CNode, since it doesn’t need to be used for any application logic. I’ve done that in the latest force-push.

jnewbery force-pushed on Dec 29, 2020

jnewbery commented at 7:58 pm on December 29, 2020: member

Ping timeouts are being triggered on feature_bip68_sequence.py test when run on cirrus. That test uses lots of mocktime jumps, so it’s not surprising. I’ll see if there’s a way to make sure that pings don’t timeout when using mocktime.

ajtowns commented at 3:11 am on December 30, 2020: member

This exists entirely in the connection layer, in CConnman::AttemptToEvictConnection(). It requires data from both the connection layer (e.g. connection time, connection type, whether the connection is local, etc)

Not convinced that makes sense – “connection type” and “how many local connections to evict” both seem like application logic to me. I was certainly expecting m_conn_type to eventually make its way into net_processing, given it deals with message types (block/tx relay/addrfetch/feeler) and trust issues (inbound/outbound/manual) which are all application level concepts.

DrahtBot added the label Needs rebase on Jan 2, 2021

jnewbery commented at 9:46 am on January 2, 2021: member

I’m going to hold off rebasing this until #20811 lands. Once #20811 is merged, we’ll be able to continue this series of PRs with much less churn in the header files. Marking as draft for now.

jnewbery marked this as a draft on Jan 7, 2021

jnewbery force-pushed on Jan 13, 2021

jnewbery commented at 2:30 pm on January 13, 2021: member

I’ve rebased on master now that #20811 is merge and added a (hidden) -pingtimeout option which avoids having to expose Peer to the tests.

There are still a couple of tests timing out, which I’ll fix up next.

jnewbery force-pushed on Jan 13, 2021

jnewbery commented at 4:40 pm on January 13, 2021: member

I’m going to leave this as WIP for now. I think #20927 should go in before this to make review of this PR easier. Please review that PR first.

in src/init.cpp:1148 in eddbcc18b4 outdated

1143@@ -1143,6 +1144,10 @@ bool AppInitParameterInteraction(const ArgsManager& args)
1144         return InitError(Untranslated("peertimeout cannot be configured with a negative value."));
1145     }
1146 
1147+    if (args.GetArg("-pingtimeout", TIMEOUT_INTERVAL) <= 0) {

mjdietzx commented at 6:42 pm on January 14, 2021:

nit: docs specify minimum is 1, but it seems here min is 0?

jnewbery commented at 10:56 am on January 16, 2021:

This error condition is hit if pingtimeout is set to 0.

mjdietzx commented at 9:44 pm on January 15, 2021: contributor

I don’t have a good understanding of how this all works yet to give an ack - but I did go through commit by commit and everything made sense and looked good to me (take that w/ a grain of salt). just noticed one nit

jnewbery force-pushed on Jan 16, 2021

jnewbery commented at 11:04 am on January 16, 2021: member

Pushed a fix for the intermittent ping timeout in wallet_resendwallettransactions.py

MarcoFalke referenced this in commit 32b191fb66 on Jan 22, 2021

sidhujag referenced this in commit 86c22c0c27 on Jan 22, 2021

jnewbery force-pushed on Jan 23, 2021

jnewbery commented at 11:14 am on January 23, 2021: member

Rebased on master. All tests are now passing for me locally. Marking this as ready for review.

jnewbery marked this as ready for review on Jan 23, 2021

DrahtBot removed the label Needs rebase on Jan 23, 2021

DrahtBot added the label Needs rebase on Jan 29, 2021

jnewbery commented at 9:25 am on January 29, 2021: member

rebased

jnewbery force-pushed on Jan 29, 2021

DrahtBot removed the label Needs rebase on Jan 29, 2021

in src/init.cpp:1148 in 2cfa278c35 outdated

1144@@ -1144,6 +1145,10 @@ bool AppInitParameterInteraction(const ArgsManager& args)
1145         return InitError(Untranslated("peertimeout cannot be configured with a negative value."));
1146     }
1147 
1148+    if (args.GetArg("-pingtimeout", TIMEOUT_INTERVAL) <= 0) {

murtyjones commented at 6:26 pm on January 31, 2021:

Are cases where the response is malformed (e.g. user inputs a string instead of a number) handled elsewhere?

jnewbery commented at 2:34 pm on February 2, 2021:

Argument handling is in ArgsManager::GetArg(). Since the second argument here is an int, the GetArg(string, int) version is called:

https://github.com/bitcoin/bitcoin/blob/1e69800d5e60c7805931f2c5b978720c6aee180b/src/util/system.cpp#L473-L477

which will try to convert the string into an int.

This might result in a slightly inaccurate error message, but I think it’s fine. It’s the same behaviour as for -peertimeout:

0→ ./src/bitcoind --pingtimeout=abc
1Error: pingtimeout cannot be configured with a negative value.

in src/net_processing.cpp:1111 in 2cfa278c35 outdated

1077+    // Merely reporting pingtime might fool the caller into thinking the node was still responsive,
1078+    // since pingtime does not update until the ping is complete, which might take a while.
1079+    // So, if a ping is taking an unusually long time in flight,
1080+    // the caller can immediately detect that this is happening.
1081+    std::chrono::microseconds ping_wait{0};
1082+    if ((0 != peer->m_ping_nonce_sent) && (0 != peer->m_ping_start.load().count())) {

murtyjones commented at 6:33 pm on January 31, 2021:

Are the inner sets of parenthesis required or is this a code-style thing?

jnewbery commented at 2:27 pm on February 2, 2021:

No, these aren’t required. In C++ && has a lower precedence than != so the parens aren’t changing the order of operations (https://en.cppreference.com/w/cpp/language/operator_precedence). They’re not doing any harm either.

This code is moved unaltered in this PR. I think it’s fine to keep it as it is.

in src/test/denialofservice_tests.cpp:86 in 9099ee6f6b outdated

82@@ -83,6 +83,9 @@ BOOST_AUTO_TEST_CASE(outbound_slow_chain_eviction)
83     auto peerLogic = PeerManager::make(chainparams, *connman, nullptr, *m_node.scheduler,
84                                        *m_node.chainman, *m_node.mempool, false);
85 
86+    // Don't let pings time out

dhruv commented at 1:34 am on February 1, 2021:

0    // Don't let pings time out due to mocktime

jnewbery commented at 2:35 pm on February 2, 2021:

Sounds good. Added!

in test/functional/feature_bip68_sequence.py:36 in 9099ee6f6b outdated

32@@ -33,6 +33,7 @@ def set_test_params(self):
33             [
34                 "-acceptnonstdtxn=1",
35                 "-peertimeout=9999",  # bump because mocktime might cause a disconnect otherwise
36+                "-pingtimeout=9999",  # bump because mocktime might cause a disconnect otherwise

dhruv commented at 1:38 am on February 1, 2021:

(Rookie question) Would the previous commit(9ad48f3965a32a6eba67ba49da19499fbe9b89e3) change anything with respect to mocktime unless -pingtimeout is used? If not, why did some tests start to fail due to ping timeouts?

jnewbery commented at 2:37 pm on February 2, 2021:

Previously, when ping timeouts were handled in CConnman::InactivityCheck(), the whole logic was gated on the following guard:

0    if (now <= node.nTimeConnected + m_peer_connect_timeout) {
1        // Only run inactivity checks if the peer has been connected longer
2        // than m_peer_connect_timeout.
3        return false;
4    }

Because tests generally run for less than m_peer_connect_timeout, the ping timeout logic was not hit at all. Once the logic moves to net_processing, that guard no longer exists, so we start testing for ping timeouts as soon as the peer is connected.

in src/net_processing.cpp:320 in f53ee86f48 outdated

314@@ -315,7 +315,8 @@ class PeerManagerImpl final : public PeerManager
315     /** Send a version message to a peer */
316     void PushNodeVersion(CNode& pnode, int64_t nTime);
317 
318-    /** Send a ping message on a regular schedule or if requested manually */
319+    /** Send a ping message on a regular schedule or if requested manually. May
320+     *  disconnect the peer if a ping has timed out. */
321     void MaybeSendPing(CNode& node_to);

dhruv commented at 5:38 pm on February 1, 2021:

0    /** Send a ping message every PING_INTERVAL or if requested via rpc. Disconnect
1     * the peer if a ping has timed out. Return true upon disconnection. */
2    [[nodiscard]] bool ManagePing(CNode& node_to);

jnewbery commented at 2:42 pm on February 2, 2021:

I’ve changed the signature to return a void, so the nodiscard tag is no longer required (good suggestion though!)

I’ve taken the suggested comment changes.

dhruv commented at 5:02 am on February 2, 2021: member

crACK 2cfa278c355f22d9396b58d346f2c485e8763bfb

Thanks for explaining the roles of the connection/application layers for ping data here. Super helpful!

A couple optional suggestions below.

jnewbery force-pushed on Feb 2, 2021

pinheadmz commented at 3:52 pm on February 3, 2021: member

I see around ~20 functional tests that use setmocktime… but you only added -pingtimeout=9999 to two of them. Are you not worried about ping timeouts due to mocktime in any of the other tests?

pinheadmz commented at 4:47 pm on February 3, 2021: member

concept ACK, tests pass locally, scripted diff verified in 4180381f683fad8755b30a3a8e65edc26c1ce5e2 Makes sense to me, moves ping properties from CNode in net to Peer in net_processing. Testing also by running src/bitcoind -pingtimeout=1 -signet (didn’t want to disrupt mainnet peers) and got some expected log messages: 2021-02-03T15:34:23Z ping timeout: 1.018515s peer=5

in src/init.cpp:1149 in 4180381f68 outdated

1144@@ -1144,6 +1145,10 @@ bool AppInitParameterInteraction(const ArgsManager& args)
1145         return InitError(Untranslated("peertimeout cannot be configured with a negative value."));
1146     }
1147 
1148+    if (args.GetArg("-pingtimeout", TIMEOUT_INTERVAL) <= 0) {
1149+        return InitError(Untranslated("pingtimeout cannot be configured with a negative value."));

dunxen commented at 5:19 pm on February 3, 2021:

Nit: Is it a bit pedantic to call it a nonpositive value here to include zero?

Although I see it’s consistent with -peertimeout error so that’s probably why :)

jnewbery commented at 10:51 am on February 4, 2021:

I’ve changed to option to be a binary true/false, so this check is no longer required.

glozow commented at 8:19 pm on February 3, 2021: member

I see around ~20 functional tests that use setmocktime… but you only added -pingtimeout=9999 to two of them. Are you not worried about ping timeouts due to mocktime in any of the other tests? @pinheadmz I imagine it’s only necessary where setmocktime does an increase of more than the default ping timeout.

Dug into this a little bit looking for setmocktime calls with big jumps - @jnewbery do you think p2p_addr_relay.py, p2p_addrv2_relay.py, and/or p2p_getaddr_caching.py need a -pingtimeout?

Also a question that may or may not be relevant - how do the inactivity timers handle when setmocktime goes backwards in time?

jnewbery force-pushed on Feb 4, 2021

jnewbery commented at 11:09 am on February 4, 2021: member

I’ve updated the -pingtimeout option to be a binary enable/disable. We don’t need fine control over it in the tests, and it was essentially being used to disable ping timeouts entirely.

Dug into this a little bit looking for setmocktime calls with big jumps - @jnewbery do you think p2p_addr_relay.py, p2p_addrv2_relay.py, and/or p2p_getaddr_caching.py need a -pingtimeout?

Very nice. Thanks @glozow. I’ve updated those tests to disable ping timeouts.

Also a question that may or may not be relevant - how do the inactivity timers handle when setmocktime goes backwards in time?

That’s not a problem for the ping timeout. The ping timeout condition is now > peer.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL} i.e. “(the time now) is after (when the ping was sent) plus (the timeout interval)”. If time goes backwards, then (the time now) is decreased and we won’t hit this condition.

jnewbery commented at 11:10 am on February 4, 2021: member

Thanks everyone for the reviews. All comments have now been addressed.

in test/functional/p2p_addr_relay.py:50 in 17029adf43 outdated

47@@ -48,6 +48,7 @@ class AddrTest(BitcoinTestFramework):
48     def set_test_params(self):
49         self.setup_clean_chain = False
50         self.num_nodes = 1
51+        self.extra_args = [["-nopingtimeout"]]  # mocktime might cause a disconnect otherwise

glozow commented at 10:51 pm on February 4, 2021:

So would -pingtimeout means timeout=true? And -nopingtimeout is the same thing as -pingtimeout=0? I couldn’t find where the implicit “no” logic was :(

dhruv commented at 2:53 am on February 5, 2021:

I think it’s done in InterpretOption() invoked from ArgsManager::ParseParameters() which is invoked in AppInit().

jnewbery commented at 9:18 am on February 5, 2021:

Thanks @dhruv!

glozow commented at 4:23 pm on February 5, 2021:

ahhhh thank you @dhruv!

in src/net.h:710 in 17029adf43 outdated

706@@ -713,6 +707,12 @@ class CNode
707 
708     std::string ConnectionTypeAsString() const { return ::ConnectionTypeAsString(m_conn_type); }
709 
710+    //* We have completed a successful ping-pong keepalive. Update latest and best ping times. */

glozow commented at 11:06 pm on February 4, 2021:

I appreciate the added documentation 🙏

nit

0    /** We have completed a successful ping-pong keepalive. Update latest and best ping times. */

jnewbery commented at 9:20 am on February 5, 2021:

:facepalm: I’ve fixed the doxygen style and also tweaked the comment wording a bit.

glozow commented at 11:17 pm on February 4, 2021: member

Wooo 🙌 the boolean option makes much more sense to me - in the relevant tests, we more want to ignore ping timeouts than extend them. I feel like the interface would be more intuitive if it was just called -disablepingtimeout since it’s an optional flag, but perhaps just personal preference?

I checked the functional tests by checking the input time against m_ping_wait_usec from GetNodeStateStats insetmocktime RPC, and then running the functional tests that use setmocktime. Seems like feature_bip68_sequence.py is actually the only one where it tripped… but my suggestions still make sense to me and I remember lightlike mentioning wallet_resendwallettransactions.py failing so maybe it was a race, no pings in flight?

jnewbery force-pushed on Feb 5, 2021

MarcoFalke added the label Needs rebase on Feb 5, 2021

jnewbery commented at 9:24 am on February 5, 2021: member

Thanks for the reviews!

I feel like the interface would be more intuitive if it was just called -disablepingtimeout since it’s an optional flag, but perhaps just personal preference?

I think it’s best to avoid negative configuration names. -disablepingtimeout=0 would mean “do not disable ping timeout”, which is a double negative and slightly harder to parse than “enable ping timeout”. See other options like -listen, -persistmempool, etc, where the default is true, and -nolisten, -nopersistmempool is used to disable.

my suggestions still make sense to me and I remember lightlike mentioning wallet_resendwallettransactions.py failing so maybe it was a race, no pings in flight?

Right, these are all races, so they won’t fail consistently. I think it’s just safer to disable pings timeouts on any test which makes large mocktime leaps.

jnewbery commented at 9:41 am on February 5, 2021: member

rebased

jnewbery force-pushed on Feb 5, 2021

DrahtBot removed the label Needs rebase on Feb 5, 2021

glozow commented at 4:36 pm on February 5, 2021: member

ACK https://github.com/bitcoin/bitcoin/pull/20721/commits/bbd970a2123aec5071dd3dd901e478844c6e0fea 🍿

The negative config option stuff makes more sense to me now, thanks for explaining!

dongcarl commented at 9:17 pm on February 8, 2021: member

Light Code-Review ACK bbd970a2123aec5071dd3dd901e478844c6e0fea

This may have been answered before but could someone tell me the exact sequence of events in the test framework involving setmocktime that would make -nopingtimeout necessary? My naive reading is that send_and_ping blocks until it receives the pong and is therefore not going to timeout if the next call is a setmocktime.

I’ll also do a deeper read into [net processing] Move ping timeout logic to net processing later this week to make sure that this move is indeed safe and doesn’t cause us to miss any opportunities to disconnect a peer.

in src/net_processing.cpp:3845 in 831f8c36e2 outdated

3847-                    if (ping_time.count() >= 0) {
3848-                        // Successful ping time measurement, replace previous
3849-                        pfrom.nPingUsecTime = count_microseconds(ping_time);
3850-                        pfrom.nMinPingUsecTime = std::min(pfrom.nMinPingUsecTime.load(), count_microseconds(ping_time));
3851+                    const auto ping_time = ping_end - peer->m_ping_start.load();
3852+                    if (ping_time.count() > 0) {

amitiuttarwar commented at 1:01 am on February 9, 2021:

just noting that this changed from >= to >. makes sense to call it a “timing mishap” if the pong was received in the same microsecond as we recorded for the ping going out, so seems fine. just highlighting the change since it’s easy to miss.

jnewbery commented at 11:21 am on February 9, 2021:

Oops, this was unintentional. You’re right that it shouldn’t have any impact, but I’m going to revert it to how it was. Good catch!

in src/net_processing.cpp:4360 in f6e13955d4 outdated

4337+        node_to.fPingQueued = false;
4338+        node_to.m_ping_start = GetTime<std::chrono::microseconds>();
4339+        if (node_to.GetCommonVersion() > BIP0031_VERSION) {
4340+            node_to.nPingNonceSent = nonce;
4341+            m_connman.PushMessage(&node_to, msgMaker.Make(NetMsgType::PING, nonce));
4342         } else {

amitiuttarwar commented at 1:10 am on February 9, 2021:

tangential because you didn’t change this logic, but what’s the point of sending a ping message to peers who are too old to support / respond with a pong?

jnewbery commented at 10:44 am on February 9, 2021:

pong messages were introduced in 2012 with p2p version 60000. I imagine that prior to that, the reason for a ping message was simply a keepalive (make sure that TCP connections stay up, NAT pinholes stay open, etc). This logic maintains backwards compatibility with clients older than 60000.

In practice, I expect there are almost no clients older than 60000, but if there are any it’d be a shame to arbitrarily disconnect them for failing ping timeouts.

in src/net.h:714 in 831f8c36e2 outdated

709@@ -716,6 +710,12 @@ class CNode
710 
711     std::string ConnectionTypeAsString() const { return ::ConnectionTypeAsString(m_conn_type); }
712 
713+    /** A ping-pong keepalive has completed successfully. Update latest and minimum ping times. */
714+    void PingReceived(std::chrono::microseconds ping_time) {

amitiuttarwar commented at 1:12 am on February 9, 2021:

isn’t this more of a PongReceived?

jnewbery commented at 11:17 am on February 9, 2021:

Yes. Much better. Fixed

in src/net_processing.cpp:4380 in bbd970a212 outdated

4401+    MaybeSendPing(*pto, *peer);
4402+
4403+    // MaybeSendPing may have marked peer for disconnection
4404+    if (pto->fDisconnect) return true;
4405+
4406+    // If we get here, the outgoing message serialization version is set and can't change.

amitiuttarwar commented at 1:24 am on February 9, 2021:

this comment previously followed the fSuccesfullyConnected check, which I think(?) was the intent -> fSuccesfullyConnected is set when processing the VERACK message, and m_greatest_common_version gets set when processing the VERSION message.

but now this comment is below MaybeSendPing, which also might retrieve m_greatest_common_version to construct the ping message. I find the intent of this comment to be less clear after these changes.

jnewbery commented at 10:46 am on February 9, 2021:

This is a comment on the line below, which is contructing a CNetMsgMaker object with the correct serialization version:

const CNetMsgMaker msgMaker(pto->GetCommonVersion());

jnewbery commented at 6:00 pm on February 9, 2021:

I’ve moved this back to immediately after the fSuccessfullyConnected check. Hopefully that’s clearer.

in src/net_processing.cpp:4399 in bbd970a212 outdated

4390+{
4391+    // We must call MaybeDiscourageAndDisconnect first, to ensure that we'll
4392+    // disconnect misbehaving peers even before the version handshake is complete.
4393+    if (MaybeDiscourageAndDisconnect(*pto)) return true;
4394+
4395+    // Don't send anything until the version handshake is complete

amitiuttarwar commented at 1:29 am on February 9, 2021:

I know you only moved this code, but I think we could handle fDisconnect better.

This comment seems misleading since I think it only refers to fSuccesfullyConnected. Then, we might set fDisconnect to true in MaybeSendPing, so we check this bool again a couple lines down to also return early.

A simpler logic flow might be removing this check here, starting MaybeSendPing with return if node_to.fDisconnect, and then removing this first early return from SendMessages. And the comment on 4377 could be updated to something along the lines of Don't send messages to peers that are marked for disconnection

jnewbery commented at 10:49 am on February 9, 2021:

That seems like a reasonable change, but I’m not sure it needs to happen in this PR. Very happy to review a PR that moves this around if you think it’d make it clearer.

in src/net_processing.cpp:327 in bbd970a212 outdated

322@@ -315,6 +323,10 @@ class PeerManagerImpl final : public PeerManager
323     /** Send a version message to a peer */
324     void PushNodeVersion(CNode& pnode, int64_t nTime);
325 
326+    /** Send a ping message every PING_INTERVAL or if requested via RPC. May
327+     *  disconnect the peer if a ping has timed out. */

amitiuttarwar commented at 1:31 am on February 9, 2021:

0    /** Send a ping message every PING_INTERVAL or if requested via RPC. May  
1     *  mark the peer to be disconnected if a ping has timed out. */

jnewbery commented at 11:22 am on February 9, 2021:

Done!

in src/net_processing.h:29 in bbd970a212 outdated

24@@ -25,11 +25,14 @@ static const bool DEFAULT_PEERBLOOMFILTERS = false;
25 static const bool DEFAULT_PEERBLOCKFILTERS = false;
26 /** Threshold for marking a node to be discouraged, e.g. disconnected and added to the discouragement filter. */
27 static const int DISCOURAGEMENT_THRESHOLD{100};
28+/** -pingtimeout default */
29+static const int64_t DEFAULT_PING_TIMEOUT = true;

amitiuttarwar commented at 1:35 am on February 9, 2021:

another comment on code that you simply moved-

I found this definition odd. Why is it defined as an int but set as a boolean? Also, why is it an int64_t, that seems like more space needed than for a simple bool value?

also note that this could be constexpr, but I don’t know if that actually impacts anything. I tried it out in godbolt explorer and looks like for such simple logic, it compiled down to the same assembly regardless 🤷🏽‍♀️

jnewbery commented at 10:51 am on February 9, 2021:

No, I added this code. It’s an int because originally the config option was to set how long the timeout. I later changed the config option to be binary on/off (https://github.com/bitcoin/bitcoin/pull/20721#issuecomment-773227475), but failed to change the type to bool. Will fix.

amitiuttarwar approved

amitiuttarwar commented at 1:48 am on February 9, 2021: contributor

code review ACK bbd970a212

carefully read through the diff and confirmed nothing substantial changed / thought about potential divergences. seems to me that the main functional difference is that marking a peer for disconnect based on ping time has changed from the socket handler thread to the message handler thread (and then the socket handler thread still executes the disconnection). I imagine this could sometimes take slightly longer to identify the peers for disconnect- eg. ThreadMessageHandler has a 100ms wait if !fMoreWork, or generally the message handler thread seems like it has more capacity for doing stuff. However, I can’t see how this sort of small difference could cause any problems.

I left several comments, but they are all focused on style or documentation, and the majority of them are about code that you moved without changing. None of them are worth holding up this PR from merge or invalidating ACKs, just food for thought if you find any worthwhile for follow-ups, or have to push for a different reason.

jnewbery force-pushed on Feb 9, 2021

jnewbery commented at 11:22 am on February 9, 2021: member

Thanks for the review @amitiuttarwar. I’ve resolved all of your review comments.

jnewbery force-pushed on Feb 9, 2021

jnewbery commented at 11:29 am on February 9, 2021: member

@dongcarl

This may have been answered before but could someone tell me the exact sequence of events in the test framework involving setmocktime that would make -nopingtimeout necessary? My naive reading is that send_and_ping blocks until it receives the pong and is therefore not going to timeout if the next call is a setmocktime.

The send_and_ping method involves the python test framework sending a ping message to the bitcoind node and waiting for a pong. The timeout that could cause a problem is the ping message in the other direction. Here’s how it could fail:

bitcoind sends a ping message to test framework at time 0
test calls setmocktime to push the node’s mocktime some large number of seconds in the future
before the test framework’s pong response has been received/processed, SendMessages() is called by the message handler thread for that peer, and the test in MaybeSetPing() fails: now > peer.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL})
bitcoind disconnects the test framework connection

bitcoind periodically sends pings to all its peers, and is continually running the message handler thread loop, so this window condition will cause the test to fail intermittently.

jnewbery force-pushed on Feb 9, 2021

amitiuttarwar commented at 6:12 pm on February 9, 2021: contributor

reACK af23d37cab, all that changed was in response to my feedback (thanks!)

in src/net_processing.cpp:4331 in af23d37cab outdated

4335+    // This means that setmocktime may cause pings to time out.
4336+    auto now = GetTime<std::chrono::microseconds>();
4337+
4338+    if (gArgs.GetArg("-pingtimeout", DEFAULT_PING_TIMEOUT) &&
4339+        peer.m_ping_nonce_sent &&
4340+        now > peer.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL}) {

ajtowns commented at 3:38 am on February 10, 2021:

Recommend if (x < y) rather than if (y > x) generally – smaller numbers generally go to the left of larger numbers, so keeping that convention is one less thing to have to think about. (This patch changes the tests from timeout < GetTime() to now > timeout)

jnewbery commented at 9:15 am on February 10, 2021:

I find that “now < timeout” or “now > timeout” is much more natural to parse and understand - the thing that we’re measuring (the time now) is on the left, and the thing that it’s being measured against (the timeout) is on the right. That’s how most people speak (“Is it after 10 o’clock?”), and the other way is a bit yoda-esque (“was 10 o’clock before now?”). Ultimately, I think this is a matter of taste.

ajtowns commented at 0:40 am on February 12, 2021:

If you write “if (10 o’clock < now)” you’ve literally put now after 10 o’clock and are asking if that’s a true statement. I do get what you’re saying; I just find the always less-than strategy lets you read the ordering much more directly in general – which one comes first is in exactly the order you’re reading, and not dictated by a single bit of punctuation. It’s also helpful for bounds checking (low_value < x && x < high_value) and (x < low_value || high_value < x) read much clearer than (x > low_value && x < high_value) and (x < low_value || x > high_value).

in src/net_processing.cpp:4302 in 43f190e50c outdated

4297+    auto now = GetTime<std::chrono::microseconds>();
4298+
4299+    if (gArgs.GetArg("-pingtimeout", DEFAULT_PING_TIMEOUT) &&
4300+        node_to.nPingNonceSent &&
4301+        now > node_to.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL}) {
4302+        LogPrint(BCLog::NET, "ping timeout: %fs peer=%d\n", 0.000001 * count_microseconds(now - node_to.m_ping_start.load()), node_to.GetId());

ajtowns commented at 4:35 am on February 10, 2021:

Would probably be better to set m_pingtimeout = gArgs.GetArg() in the constructor, rather than re-query it every loop.

I think losing the check that GetSystemTimeInSeconds() <= node_to.nTimeConnected + peer_connect_timeout (which is done in InactivityCheck but not MaybeSendPing) is a mistake – the docs for -peertimeout say “Specify a p2p connection timeout delay in seconds. After connecting to a peer, wait this amount of time before considering disconnection based on inactivity” which seems like it should apply here: not sending a ping reply certainly seems like a form of inactivity to me! (Setting m_peer_connect_timeout from gArgs in the PeerManagerImpl constructor seems like it would work here too, without violating layering)

Having -peertimeout continue to apply to (a lack of) PONG messages seems to make -pingtimeout unnecessary; the tests don’t need to be modified anymore, and I’m not seeing any reasons other than initial mocktime confusion as to why you’d want to prevent disconnections?

jnewbery commented at 10:08 am on February 10, 2021:

I’ve removed pingtimeout as requested.

in test/functional/feature_bip68_sequence.py:35 in ec2b67da21 outdated

31@@ -32,7 +32,7 @@ def set_test_params(self):
32         self.extra_args = [
33             [
34                 "-acceptnonstdtxn=1",
35-                "-peertimeout=9999",  # bump because mocktime might cause a disconnect otherwise

ajtowns commented at 4:48 am on February 10, 2021:

I think it might make more sense in general to set -peertimeout=9999 for all the test cases (ie, in the framework generated bitcoin.conf) – or at least all the test cases that use mocktime – and override it anywhere that’s explicitly testing disconnection due to timeouts (p2p_ping.py).

jnewbery commented at 9:26 am on February 10, 2021:

I think it might make more sense in general to set -peertimeout=9999 for all the test cases (ie, in the framework generated bitcoin.conf)

Changing the behaviour of all tests is outside the scope of this PR, but I’m happy to review any proposal you have for making the tests more robust.

in src/net_processing.cpp:4315 in 43f190e50c outdated

4311@@ -4299,7 +4312,7 @@ void PeerManagerImpl::MaybeSendPing(CNode& node_to)
4312         pingSend = true;
4313     }
4314 
4315-    if (node_to.nPingNonceSent == 0 && node_to.m_ping_start.load() + PING_INTERVAL < GetTime<std::chrono::microseconds>()) {
4316+    if (node_to.nPingNonceSent == 0 && now > node_to.m_ping_start.load() + PING_INTERVAL) {

ajtowns commented at 5:10 am on February 10, 2021:

If I’m following the logic here right, then if we backdate mocktime after sending a PING, we’ll never send a PING to that node again. I’m not seeing an obvious problem with that (and don’t think it’s a change in this PR), but it’s somewhat surprising.

jnewbery commented at 9:18 am on February 10, 2021:

That seems right to me. As you say, that’s not changed by this PR.

in src/net_processing.cpp:4326 in 43f190e50c outdated

4291@@ -4291,6 +4292,18 @@ class CompareInvMempoolOrder
4292 
4293 void PeerManagerImpl::MaybeSendPing(CNode& node_to)
4294 {
4295+    // Use mockable time for ping timeouts.

ajtowns commented at 5:46 am on February 10, 2021:

If you’re doing disconnects here as well, doesn’t that make it MaybeSendPingOrDisconnectDueToLackOfPong() ? Not sure the Maybe is useful – SendMessages doesn’t have a Maybe despite it not necessarily doing something every time it’s called.

This moves ping disconnections from ThreadSocketHandler (which calls SocketHandler which calls InactivityCheck, I think on a 50ms cycle) to ThreadMessageHandler (which calls SendMessages which calls MaybeSendPing, I think on a 100ms cycle, though will reinvoke SendMessages more frequently if any node’s ProcessMessages indicates there’s more work). I think that’s fine.

jnewbery commented at 9:24 am on February 10, 2021:

make it MaybeSendPingOrDisconnectDueToLackOfPong()

I don’t think the function name needs to document all the functionality. It’s documented in the function comment:

0    /** Send a ping message every PING_INTERVAL or if requested via RPC. May
1     *  mark the peer to be disconnected if a ping has timed out. */

Not sure the Maybe is useful – SendMessages doesn’t have a Maybe despite it not necessarily doing something every time it’s called.

Right, and MaybeDiscourageAndDisconnect() does have maybe in it.

I don’t really care what this function is called. If you feel strongly, I can change it.

ajtowns commented at 0:37 am on February 11, 2021:

“MaybeSendPing” just kinda bugs me as a name. I’d suggest “HandleBIP31PingPong()” or something, but that would bug me a bit too. Maybe some day it’ll be possible to just register a PingPongMessages class that deals with both the send and receive side :pray:

in src/rpc/net.cpp:80 in 15449ef4cb outdated

76@@ -77,13 +77,12 @@ static RPCHelpMan ping()
77         [&](const RPCHelpMan& self, const JSONRPCRequest& request) -> UniValue
78 {
79     NodeContext& node = EnsureNodeContext(request.context);
80-    if(!node.connman)
81+    if(!node.peerman) {

ajtowns commented at 5:52 am on February 10, 2021:

Space after “if” if you’re changing punctuation?

jnewbery commented at 10:08 am on February 10, 2021:

Done

in src/net_processing.cpp:3876 in af23d37cab outdated

3843+                if (nonce == peer->m_ping_nonce_sent) {
3844                     // Matching pong received, this ping is no longer outstanding
3845                     bPingFinished = true;
3846-                    const auto ping_time = ping_end - pfrom.m_ping_start.load();
3847+                    const auto ping_time = ping_end - peer->m_ping_start.load();
3848                     if (ping_time.count() >= 0) {

ajtowns commented at 6:09 am on February 10, 2021:

I think ping_time.count == 0 is expected if you do setmocktime X, then a ping goes out, and the pong is received before you call setmocktime Y.

jnewbery commented at 10:07 am on February 10, 2021:

Leaving this so as not to change behaviour

ajtowns commented at 11:22 am on February 10, 2021:

(This was meant to be a reply to #20721 (review) except github wasn’t let me comment there)

ajtowns changes_requested

ajtowns commented at 6:33 am on February 10, 2021: member

I think I’m a NACK on the -pingtimeout changes as they stand – at a minimum it should be clear from the PR description and commit messages why those changes are needed, but there’s currently no motivation for them in those places.

I think duplicating the -peertimeout check into net_processing is enough to obsolete the need to change the tests (if it’s not, I still don’t understand the change well enough to ack it), and that seems like a much more obvious refactor to me, and also matches the help text for the option better. (That said, I think adding -peertimeout=9999 to those tests in case they end up taking longer than 60s to run would be justifiable, but that’s true with or without these changes)

I believe we already have a bug in InactivityCheck when setting mocktime into the past, in that nLastRecv is set via GetTime<microseconds> in ReceiveMsgBytes but then compared against GetSystemTime in InactivityCheck leading to errors once -peertimeout has passed.

jnewbery force-pushed on Feb 10, 2021

jnewbery commented at 10:06 am on February 10, 2021: member

I think I’m a NACK on the -pingtimeout changes as they stand – at a minimum it should be clear from the PR description and commit messages why those changes are needed, but there’s currently no motivation for them in those places.

I think duplicating the -peertimeout check into net_processing is enough to obsolete the need to change the tests

Ok, done. I’ve removed -pingtimeout, and duplicted the -peertimeout setting and logic into MaybeSendPing() as requested.

Moving the application-layer data and logic is an opportunity to clean up and rationalize the weird edge cases and strange behavior. As you’ve noted from your observation about the InactivityCheck bug, and as sipa pointed out in #20027 (review), mixing mockable time and system time is too complicated and a source of bugs and confusion. Moving ping timeouts to the application layer means that we can lose one of those interactions between system time and mockable time - system time timeouts on the socket stay in net, and mockable timeouts live in the application layer. This is really only applicable for tests, and I don’t think there’s much downside to adding a hidden configuration option where it’s needed.

However, I don’t think this is important enough to hold up the rest of #19398. I’ve restored the behaviour to exactly how it was, so no changes are required to tests (except the modified PeerManager ctor signature). We can perhaps rationalize mocktime/system time for pings at some later date.

jnewbery force-pushed on Feb 10, 2021

in src/net_processing.cpp:4329 in 31cc466eb0 outdated

4376-        pto->m_ping_start = GetTime<std::chrono::microseconds>();
4377-        if (pto->GetCommonVersion() > BIP0031_VERSION) {
4378-            pto->nPingNonceSent = nonce;
4379-            m_connman.PushMessage(pto, msgMaker.Make(NetMsgType::PING, nonce));
4380+        peer.m_ping_queued = false;
4381+        peer.m_ping_start = GetTime<std::chrono::microseconds>();

amitiuttarwar commented at 8:25 pm on February 11, 2021:

think this could just be..

0        peer.m_ping_start = now;

jnewbery commented at 12:05 pm on February 12, 2021:

Done!

in src/net_processing.cpp:352 in 31cc466eb0 outdated

346@@ -335,6 +347,10 @@ class PeerManagerImpl final : public PeerManager
347       * on extra block-relay-only peers. */
348     bool m_initial_sync_finished{false};
349 
350+    /** After connecting to a peer, wait this amount of time before considering
351+     * disconnection based on inactivity */
352+    const int64_t m_peer_connect_timeout;

amitiuttarwar commented at 9:18 pm on February 11, 2021:

I’d prefer if this was a chrono duration, but I understand the context of keeping it consistent with net.

jnewbery commented at 11:39 am on February 12, 2021:

I’ve now removed this again. Sorry :grimacing:

amitiuttarwar approved

amitiuttarwar commented at 9:35 pm on February 11, 2021: contributor

RE -pingtimeout:

I understand how it could be surprising that -peertimeout would no longer apply to ping logic, especially since InactivityChecks would still bypass the version handshake timeout. However, I find it unfortunate to have the system time comparisons in MaybeSendPing. Having the application level logic be mockable makes sense to me, and seems in line with the discussion in #20027 (review). Maybe in the longer term, we could have a debug arg that handles the application level timeouts of ping & verack handshake, and there could be a param interaction that sets the new arg based on -peertimeout.

But that all seems out of the scope of this PR. I’ve reviewed the changes and the code looks correct as far as I can tell.

ACK 31cc466eb0

in src/net_processing.cpp:4305 in e588cc6d73 outdated

4297@@ -4297,6 +4298,20 @@ class CompareInvMempoolOrder
4298 
4299 void PeerManagerImpl::MaybeSendPing(CNode& node_to)
4300 {
4301+    // Use mockable time for ping timeouts.
4302+    // This means that setmocktime may cause pings to time out.
4303+    auto now = GetTime<std::chrono::microseconds>();
4304+
4305+    if (GetSystemTimeInSeconds() > node_to.nTimeConnected + m_peer_connect_timeout &&

ajtowns commented at 11:41 pm on February 11, 2021:

Adding:

0    /** Return true if inactivity checks should be ignored */
1    inline bool IgnoreInactivityChecks(const CNode& node) const
2    {
3        return GetSystemTimeInSeconds() <= node.nTimeConnected + m_peer_connect_timeout;
4    }

to CConnman would let you avoid adding m_peer_connect_timeout in PeerManagerImpl and keep the non-mockable times in net.cpp. (You have access to connman anyway via PeerManagerImpl::m_connman and access to node in order to set the disconnect flag)

jnewbery commented at 11:37 am on February 12, 2021:

Done

in src/net_processing.cpp:4346 in 1ae2babd77 outdated

4368+    // If we get here, the outgoing message serialization version is set and can't change.
4369+    const CNetMsgMaker msgMaker(pto->GetCommonVersion());
4370+
4371+    MaybeSendPing(*pto);
4372+
4373+    PeerRef peer = GetPeerRef(pto->GetId());

ajtowns commented at 0:28 am on February 12, 2021:

Should there be a if (peer == nullptr) return true; here? We check that everywhere else.

Rather than move this line, might be better to continue doing it first thing and pass the result into MaybeDiscourageAndDisconnect rather than having that function look it up too. (At least, that’s where I’m expecting to end up with #20758 eventually)

jnewbery commented at 11:37 am on February 12, 2021:

Done

ajtowns commented at 2:36 am on February 12, 2021: member

Looks good, couple of suggestions I think are worth taking.

ajtowns commented at 2:38 am on February 12, 2021: member

FYI, I think the scripted diff only touches net_processing.cpp, so doesn’t need the .* there.

ajtowns approved

jnewbery force-pushed on Feb 12, 2021

jnewbery commented at 11:39 am on February 12, 2021: member

Thanks for the review @ajtowns. I’ve addressed both of your comments. This is looking much cleaner now.

Apologies to the other reviewers for the churn. Hopefully we can settle on this approach and move towards getting this merged.

jnewbery force-pushed on Feb 12, 2021

in src/net_processing.cpp:4029 in a3ec04b7a6 outdated

3998@@ -3969,43 +3999,39 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
3999     return;
4000 }
4001 
4002-bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode)
4003+bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode, Peer& peer)

glozow commented at 6:59 pm on February 12, 2021:

Question about 6a03a1f7e4: Is the idea that, in the future, MaybeDiscourageAndDisconnect() will only need to accept a Peer? And right now, even though you could grab the PeerRef from the m_peer_map in here again, passing in a Peer is a bit less overhead?

ajtowns commented at 0:30 am on February 15, 2021:

MaybeDiscourageAndDisconnect should continue needing a CNode in order to signal the disconnection, I think.

jnewbery commented at 10:42 am on February 15, 2021:

MaybeDiscourageAndDisconnect should continue needing a CNode in order to signal the disconnection, I think.

This is true. It’s also currently needed for the connection type, which I don’t think belongs in net_processing.

Even though you could grab the PeerRef from the m_peer_map in here again, passing in a Peer is a bit less overhead?

We want to grab a PeerRef at the entry functions (ProcessMessages and SendMessages for the message handler thread), and then pass that reference down through the call stack. Grabbing PeerRef incurs overhead because m_peer_mutex needs to be locked and the shared ptr’s refcount needs to be incremented/decremented, so we shouldn’t do it unnecessarily.

in src/net.cpp:1213 in 4561423100 outdated

1220@@ -1221,18 +1221,17 @@ void CConnman::NotifyNumConnectionsChanged()
1221     }
1222 }
1223 
1224+bool CConnman::RunInactivityChecks(const CNode& node) const
1225+{
1226+    return GetSystemTimeInSeconds() > node.nTimeConnected + m_peer_connect_timeout;

glozow commented at 7:32 pm on February 12, 2021:

4561423100c1b6a8e778525a027d021b918a6312 Definitely like peerman asking connman if it should RunInactivityChecks() instead of taking a peertimeout param 👌 Strictly speaking, I think it’s possible for the time to be different from RunInactivityChecks to InactivityChecks since they each grab the time separately (as opposed to before, where they used the same now)? I don’t think it’d be a huge issue since it’s in seconds and the system time would only move forward though

ajtowns commented at 0:33 am on February 15, 2021:

Currently system time isn’t guaranteed to move forwards; and I believe we’re now doing these checks in different threads so I don’t think it’s reasonable to try to use the same exact point in time for the two comparisons. We should be doing both checks 10-20 times per second for each peer anyway, I believe, so any lack of precision shouldn’t matter.

ajtowns commented at 1:01 am on February 15, 2021:

Could make this

0bool CConnman::RunInactivityChecks(const CNode& node, const uint64_t const* pnow=nullptr)
1{
2    const uint64_t now = (pnow ? *pnow : GetSystemTimeInSeconds());
3    return now > node.nTimeconnected + m_peer_connect_timeout;
4}

and call it from within InactivityCheck() as if (!RunInactivityChecks(node, &now) return false; to avoid unnecessarily getting the system time twice in quick succession. Might also make RunInactivityChecks easier to test since you could pass in different times and check the answer, without having to worry that GetSystemTime isn’t mockable.

jnewbery commented at 3:30 pm on February 16, 2021:

I’ll take aj’s suggestion here: #20721 (review) in a follow-up

jnewbery commented at 3:58 pm on February 16, 2021:

#21198

glozow commented at 8:00 pm on February 12, 2021: member

Main changes since last review: Using -peertimeout on both sets of timeout checks indeed looks cleaner than before. It was easier to review for me - I did my best to verify that the decision on whether or not to run ping timeout checks is identical, which makes me feel more confident than when I was trying to find which tests might timeout. 🧠 And then tried to verify that we’re still timing out pings when we’re supposed to through the MaybeSendPing route

code review ACK a3ec04b7a6

in src/net_processing.cpp:4001 in 6a03a1f7e4 outdated

3969@@ -3969,43 +3970,39 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
3970     return;
3971 }
3972 
3973-bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode)
3974+bool PeerManagerImpl::MaybeDiscourageAndDisconnect(CNode& pnode, Peer& peer)
3975 {
3976-    const NodeId peer_id{pnode.GetId()};

ajtowns commented at 0:47 am on February 15, 2021:

const NodeId peer_id{peer.m_id}; would’ve changed fewer lines :)

jnewbery commented at 10:23 am on February 15, 2021:

Indeed it would, but minimizing lines changed isn’t the right metric to judge a commit. Instead we should aim to maximize the clarity and expressiveness of the code after the change. If I was writing this from scratch, I wouldn’t create a local copy of the peer id just to use in log lines.

ajtowns commented at 12:12 pm on February 15, 2021:

Hmm, actually I think I disagree with that (well, at least after exaggerating it into a bigger principle) – review is a much more important bottleneck than just about anything else, so optimising the code to be easier to review is much more helpful than optimising it to make perfect sense if we were going to erase our git history and anonymously post the current tree to a mailing list. I think just about any time there’s a choice between correct code that’s easier to review and correct code that’s better according to some other aesthetic, easier to review should be the winner. At least until review is no longer a serious bottleneck for the project.

Obviously, peer_id to peer.m_id is trivial, but if we’re talking what we should aim to maximise, I think ease of review is the single right answer. I don’t think you’ll often find that being far off from optimising for clarity, though. (I don’t think “expressiveness” applies either way tbh)

jnewbery commented at 12:28 pm on February 15, 2021:

This discussion is a bit abstract, but I’m going to stand by my assertion. We should always be aiming to improve the quality of the code by standardizing to a clear and consistent style and by taking advantage of higher-level abstractions. One example: if I find a hand written loop that I need to modify, I may well replace it with an stl algorithm. That’s undoubtedly a bit more effort to review, but results in higher-quality code. If we always aim to minimize deltas in code changes, then we’ll only ever converge to local maxima of quality.

ajtowns commented at 3:22 pm on February 15, 2021:

The point of review being a bottleneck is that we won’t ever progress to a global maximum of quality – if we could do that, by definition we wouldn’t be bottlenecked. Every merge is a reflection of that review bottleneck: did we spend our limited review cycles on working out if a while loop does the same thing as the for loop it replaces, or did we spend it on working out if muhash or minisketch does what it claims on the box? If every PR spends 5% of review time checking that shuffling code around is correct, then that’s 1-in-20 fewer PRs we can make that affect people that don’t see the code. Or it’s an incentive to get in the habit of merging code that’s not thoroughly reviewed in order to avoid delaying things because we’ve run out of review resources, which is worse.

That doesn’t mean we shouldn’t always be aiming to improve the quality of the code; just that we should be spending the resources we have on the changes that are most valuable, even if that means leaving some good things for (possibly much) later.

(Personally, I find most of the STL algorithms worse than the loops they replace – range based for loops in c++ are pretty pleasant and behave pretty sensibly; for me, dealing explicitly with iterators and lambdas generally isn’t pleasant – they’re harder to understand and easier to make mistakes with, for at best a minimal performance benefit, and for me at least any safety benefits are offset by the lack of clarity. There are more important battles to fight, and I’m hoping that perhaps familiarity might eventually breed affection, but for now they certainly don’t seem like moving toward even a local maximum to me)

jnewbery commented at 9:54 am on February 16, 2021:

Thanks AJ, these are good points.

You’re wrong about stl algorithms though :grin:

in src/net.h:1025 in 4561423100 outdated

1023@@ -1024,6 +1024,9 @@ class CConnman
1024 
1025     void SetAsmap(std::vector<bool> asmap) { addrman.m_asmap = std::move(asmap); }
1026 
1027+    /** Return true if the peer has been connected for long enough to do inactivity checks. */

ajtowns commented at 0:54 am on February 15, 2021:

if the peer should be connected for inactivity ? – we could introduce a net permission to prevent that rather than just have a standard timeout for all peers, eg

amitiuttarwar commented at 8:18 pm on February 15, 2021:

connected for inactivity?? I don’t follow.

glozow commented at 9:16 pm on February 15, 2021:

are y’all talking about the comment being unclear or ?

jnewbery commented at 3:58 pm on February 16, 2021:

#21198

in src/net.cpp:1518 in 4561423100 outdated

1535@@ -1537,7 +1536,7 @@ void CConnman::SocketHandler()
1536             if (bytes_sent) RecordBytesSent(bytes_sent);
1537         }
1538 
1539-        if (InactivityCheck(*pnode)) pnode->fDisconnect = true;
1540+        if (RunInactivityChecks(*pnode) && InactivityCheck(*pnode)) pnode->fDisconnect = true;

ajtowns commented at 0:55 am on February 15, 2021:

I’m surprised this is pulled out of InactivityCheck(). Seems cleaner to keep the early exit to me

jnewbery commented at 3:58 pm on February 16, 2021:

#21198

in src/net_processing.cpp:266 in bdbd691097 outdated

254@@ -248,6 +255,7 @@ class PeerManagerImpl final : public PeerManager
255     void CheckForStaleTipAndEvictPeers() override;
256     bool GetNodeStateStats(NodeId nodeid, CNodeStateStats& stats) override;
257     bool IgnoresIncomingTxs() override { return m_ignore_incoming_txs; }
258+    void SendPings() override;

ajtowns commented at 1:13 am on February 15, 2021:

Only mentioning since you’ve got RunInactivityChecks as a separate commit, but I think you could reasonably have pulled SendPings() out as a separate commit prior to moving things from net to net_processing too.

jnewbery commented at 10:44 am on February 15, 2021:

Yes, good point.

ajtowns commented at 1:15 am on February 15, 2021: member

ACK a3ec04b7a61378dab0a9a53840e5f504d25ac563

Plus some maybe nits.

jnewbery commented at 10:45 am on February 15, 2021: member

Thanks for the review @ajtowns! I’ll leave your nits for a follow-up PR. @dongcarl @amitiuttarwar - I think this would be ready for merge if you reACKed.

DrahtBot added the label Needs rebase on Feb 15, 2021

[net processing] Add Peer& arg to MaybeDiscourageAndDisconnect()

Refactor only. No change in behaviour.

f8b3058992

[net] Add RunInactivityChecks()

Moves the logic to prevent running inactivity checks until
the peer has been connected for -peertimeout time into its
own function. This will be reused by net_processing later.

1a07600b4b

[net processing] Move send ping message logic into function 0b43b81f69

[net processing] Move ping timeout logic to net processing

Ping messages are an application-level mechanism. Move timeout
logic from net to net processing.

dd2646d12c

[net processing] Move ping data fields to net processing 45dcf22661

scripted-diff: rename ping members

-BEGIN VERIFY SCRIPT-
sed -i 's/fPingQueued/m_ping_queued/g' src/net_processing.cpp
sed -i 's/nMinPingUsecTime/m_min_ping_time/g' src/net.* src/net_processing.cpp src/test/net_tests.cpp
sed -i 's/nPingNonceSent/m_ping_nonce_sent/g' src/net_processing.cpp
sed -i 's/nPingUsecTime/m_last_ping_time/g' src/net.*
-END VERIFY SCRIPT-

a5e15ae45c

jnewbery commented at 4:25 pm on February 15, 2021: member

Rebased

jnewbery force-pushed on Feb 15, 2021

DrahtBot removed the label Needs rebase on Feb 15, 2021

in src/net.cpp:1211 in a5e15ae45c

1207@@ -1221,18 +1208,17 @@ void CConnman::NotifyNumConnectionsChanged()
1208     }
1209 }
1210 
1211+bool CConnman::RunInactivityChecks(const CNode& node) const

amitiuttarwar commented at 7:46 pm on February 15, 2021:

I’d prefer if this was called something like InactivityChecksEnabled or ShouldRunInactivityChecks because right now the name suggests this function is where they are run (aka what InactivityCheck does).

jnewbery commented at 9:50 am on February 16, 2021:

I’ll change this in a follow-up.

jnewbery commented at 3:58 pm on February 16, 2021:

#21198

glozow commented at 8:07 pm on February 15, 2021: member

reACK https://github.com/bitcoin/bitcoin/commit/a5e15ae45ccae7948a6c5b95e6621f01afb88d55

rebase-only by git range-diff a3ec04b...a5e15ae

amitiuttarwar approved

amitiuttarwar commented at 9:02 pm on February 15, 2021: contributor

ACK a5e15ae45c

lost track of the different versions so just re-reviewed the whole diff commit by commit.

definitely like having the m_peer_connect_timeout check as a function in net more than duplicating system time check in net_processing

in the future, I think it could make sense to have -peertimeout be limited to regtest. it’s hidden behind debug-help, but technically possible on mainnet. I don’t see how it’d be useful on mainnet though.

MarcoFalke commented at 9:02 am on February 16, 2021: member

In the scripted diff I don’t like that it may modify non-git user files (e.g. a net.h.rej file) in the same folder. Would be better to use git grep -l '...', which would also simplify review, as reviewers don’t have to call git grep to verify the replacement was complete over all files, including docs.

review ACK a5e15ae45ccae7948a6c5b95e6621f01afb88d55 🥉

Signature:

 0-----BEGIN PGP SIGNED MESSAGE-----
 1Hash: SHA512
 2
 3review ACK a5e15ae45ccae7948a6c5b95e6621f01afb88d55 🥉
 4-----BEGIN PGP SIGNATURE-----
 5
 6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
 7pUhixgv9EHlcd7Eo4IXpSgu8XZbUGYox8iHe1MU0pdDdnGPBjdnoB71xvMNUczaF
 8vDqmc9jIEfT19pPXGJfnOGQ53oWatA9be22GiLjZtjIA+w7slyjnTJ/WM2LCxA1V
 9EWX2X57ad+DwEHW7YCzmf5pxxm/KK9TV3h3kE9I1yUGwg6FR5INfIyzssPqULAt2
108BexUMoXXaqeAV6WOJRq/9wdtOcv2WwD7YLc340VjdmRRRkVflySKnx0/m2x6oo8
11b3zBYj3YC6+UVmPRD2YIexoG+XHAVGXbc2kFRkeJ/Puknt7pWyurIsF6w+dgzM7B
12YrsLzEwgO7mO4xvXfvHYwNKuNX3WJTFJ+gpL1dIN7UlJQis4p3bl56Sxtyw6cKM+
13d8VZtjYkTcEpL3jKJYnIk4fHFA6WmHaRp7b4FQr6400G2bznF6AHMC2h0z5Jqhpy
1463XIWO9kD89mIPxNgzUxA+id/3SoaNPRN0Mj62RxTWmbpRX2gIvCvF/MRZJsdEuC
15TQhwMCuo
16=FZw8
17-----END PGP SIGNATURE-----

Timestamp of file with hash f094bc0ee463f6bec56659b9f20c20ab75a2f829388e57c839500e90f3c196a3 -

fanquake merged this on Feb 16, 2021

fanquake closed this on Feb 16, 2021

jnewbery deleted the branch on Feb 16, 2021

in src/rpc/net.cpp:209 in a5e15ae45c

207@@ -209,8 +208,8 @@ static RPCHelpMan getpeerinfo()
208         if (stats.m_min_ping_usec < std::numeric_limits<int64_t>::max()) {
209             obj.pushKV("minping", ((double)stats.m_min_ping_usec) / 1e6);

MarcoFalke commented at 12:08 pm on February 16, 2021:

unrelated note: Can use the count seconds double helper from #21015

jnewbery commented at 1:17 pm on February 16, 2021:

Thanks @MarcoFalke. I’ll address this in the follow-up

MarcoFalke commented at 1:53 pm on February 16, 2021:

#21015 isn’t merged yet, but hopefully soon will.

jnewbery commented at 2:03 pm on February 16, 2021:

I plan to review it soon

jnewbery commented at 3:59 pm on February 16, 2021:

I’ll do this in #21198 once #21015 is merged.

in src/net_processing.cpp:4332 in a5e15ae45c

4327+    // This means that setmocktime may cause pings to time out.
4328+    auto now = GetTime<std::chrono::microseconds>();
4329+
4330+    if (m_connman.RunInactivityChecks(node_to) && peer.m_ping_nonce_sent &&
4331+        now > peer.m_ping_start.load() + std::chrono::seconds{TIMEOUT_INTERVAL}) {
4332+        LogPrint(BCLog::NET, "ping timeout: %fs peer=%d\n", 0.000001 * count_microseconds(now - peer.m_ping_start.load()), peer.m_id);

MarcoFalke commented at 12:08 pm on February 16, 2021:

Same

jnewbery commented at 3:59 pm on February 16, 2021:

same same

jnewbery referenced this in commit cb849ced9b on Feb 16, 2021

sidhujag referenced this in commit 44a718fe85 on Feb 16, 2021

jnewbery referenced this in commit 3589baa288 on Feb 17, 2021

jnewbery referenced this in commit ae12c657fa on Feb 17, 2021

jnewbery referenced this in commit 240ff15977 on Feb 18, 2021

jnewbery referenced this in commit 57d02242f8 on Feb 22, 2021

jnewbery referenced this in commit 11ef360fba on Feb 26, 2021

jnewbery referenced this in commit caa4a3f8ae on Mar 30, 2021

jnewbery referenced this in commit 5ed535a02f on Apr 1, 2021

laanwj referenced this in commit 086226d98a on Apr 1, 2021

sidhujag referenced this in commit 6733623b83 on Apr 1, 2021

Fabcien referenced this in commit a021be3353 on Jan 25, 2022

Fabcien referenced this in commit e5046830a7 on Jan 25, 2022

Fabcien referenced this in commit 83bf47dcdc on Jan 25, 2022

Fabcien referenced this in commit dcaa6ed10b on Jan 25, 2022

Fabcien referenced this in commit 387300477d on Jan 26, 2022

Fabcien referenced this in commit 05a5cdd2ba on Jan 26, 2022

DrahtBot locked this on Aug 16, 2022

Net: Move ping data to net_processing #20721

Conflicts

Timeout logic

Inbound eviction logic