net: Waste less time in socket handling

ajtowns commented at 10:44 pm on December 6, 2025: contributor

Cuts out some wasted time in net socket handling. First, only calculates the current time once every 50ms, rather than once for each peer, which given we only care about second-level precision seems more than adequate. Second, caches the value of the -capturemessages setting in CConnman rather than re-evaluating it every time we invoke PushMessaage.

net: Pass time to InactivityChecks fuctions

We run InactivityChecks() for each node everytime poll()/select() every
50ms or so. Rather than calculating the current time once for each node,
just calculate it once and reuse it.

cea443e246

DrahtBot added the label P2P on Dec 6, 2025

DrahtBot commented at 10:44 pm on December 6, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34025.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	vasild, maflcko, sedited, mzumsande

If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#30951 (net: option to disallow v1 connection on ipv4 and ipv6 peers by stratospher)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible places where named args for integral literals may be used (e.g. func(x, /*named_arg=*/0) in C++, and func(x, named_arg=0) in Python):

CaptureMessage(pnode->addr, msg.m_type, msg.data, /is_incoming=/false) in src/net.cpp

2025-12-09

ajtowns commented at 10:56 pm on December 6, 2025: contributor

Not sure if the flame graph is usable, but:

perf

GetBoolArg takes up 0.31% of total time, as part of PushMessage that takes up 1.75% of total time, in b-msghand.

GetTime takes up 0.82% of total time, as part of InactivityCheck that takes up 1.78% of total time, in b-net.

Converting from std::chrono::microseconds to NodeClock::time_point is a lot more intrusive (impacting at least net_processing and node/eviction as well).

~~Note that CConnman was a friend of CNode until #27257~~ (no longer relevant)

sedited commented at 11:24 pm on December 6, 2025: contributor

Concept ACK

fanquake commented at 4:54 pm on December 8, 2025: member

cc @theuni @vasild

in src/net.cpp:2125 in b4d1007221 outdated

2121@@ -2123,6 +2122,8 @@ void CConnman::SocketHandlerConnected(const std::vector<CNode*>& nodes,
2122 {
2123     AssertLockNotHeld(m_total_bytes_sent_mutex);
2124 
2125+    auto now = GetTime<std::chrono::microseconds>();

vasild commented at 8:34 am on December 9, 2025:

GetTime() is deprecated.

0    const auto now = NodeClock::now();

(this is moving the deprecated call from elsewhere, but now is a good time to change it)

ajtowns commented at 8:22 pm on December 9, 2025:

Converting from std::chrono::microseconds to NodeClock::time_point is a lot more intrusive (impacting at least net_processing and node/eviction as well).

I did try that, it requires a lot of changes to all the things we compare now against, so m_connected, m_last_send, m_last_recv. m_connected in particular is a big hit compared to the rest of this PR.

vasild commented at 5:31 am on December 10, 2025:

This works:

0@@ -2125 +2125 @@ void CConnman::SocketHandlerConnected(const std::vector<CNode*>& nodes,
1-    auto now = GetTime<std::chrono::microseconds>();
2+    auto now = NodeClock::now();
3@@ -2218 +2218 @@ void CConnman::SocketHandlerConnected(const std::vector<CNode*>& nodes,
4-        if (InactivityCheck(*pnode, now)) pnode->fDisconnect = true;
5+        if (InactivityCheck(*pnode, now.time_since_epoch())) pnode->fDisconnect = true;

vasild commented at 5:38 am on December 10, 2025:

A question beyond this PR: if GetTime() is still useful because a lot of surrounding code uses e.g. std::chrono::seconds which we need to compare against, then should GetTime() be un-deprecated?

maflcko commented at 6:53 am on December 10, 2025:

Converting from std::chrono::microseconds to NodeClock::time_point is a lot more intrusive (impacting at least net_processing and node/eviction as well).

Happy to take a look as well in a fresh commit, either here, or in a follow-up.

if GetTime() is still useful because a lot of surrounding code uses e.g. std::chrono::seconds which we need to compare against, then should GetTime() be un-deprecated?

GetTime returning a time duration is wrong, because the current time point (now) is not a duration, but a time point. A duration arises as the difference of two points in time. This duration can then be compared with any other duration (e.g. peer timeout). I don’t think it makes sense un-deprecate something just because it is used in the current code. If this was a valid reason, nothing could ever be marked deprecated as long as it is used.

vasild commented at 9:07 am on December 10, 2025:

@maflcko, yes, I agree. But existent code uses “duration”:

0CNode::m_last_send
1CNode::m_last_recv
2CNode::m_last_tx_time
3CNode::m_last_block_time
4CNode::m_connected

so, if one if writing new code that needs to compare “now” to those, which one is preferred:

use the deprecated GetTime(); or
use NodeClock::now() and convert using time_since_epoch() in order to compare?

ajtowns commented at 9:07 pm on December 10, 2025:

Happy to take a look as well in a fresh commit, either here, or in a follow-up.

Happy to review in a followup.

When looking at this previously, it seemed like having an AtomicTimePoint<Clock> template would be helpful (for things like m_last_send which is currently atomic<seconds>), because time_point doesn’t support being an atomic. Here’s a branch where I last looked at this topic; the “Add AtomicTimePoint” commit might be worth cherry-picking.

maflcko commented at 3:28 pm on December 12, 2025:

because time_point doesn’t support being an atomic.

Duration does not either. I think it is fine to type .load() or .store(), where needed.

I guess there is no rush here, and this can be done, when all other places (non-atomic) are switched.

ajtowns commented at 10:07 pm on December 12, 2025:

Duration works as an atomic fine – we have atomic<seconds> and atomic<microseconds> in various places in net, net_processing and util/time. atomic<time_point> doesn’t work because even if the duration is noexcept (necessary for atomic), that doesn’t propagate to the time_point’s constructor also being noexcept which is a requirement for wrapping in atomic.

maflcko commented at 12:39 pm on December 13, 2025:

I may be missing something obvious. https://godbolt.org/z/8GM97aYhs seems to compile fine. And https://eel.is/c++draft/time.duration.cons doesn’t mention noexcept. Can you give a minimal reproducer, or link to the std docs?

ajtowns commented at 9:59 pm on December 13, 2025:

What I was seeing was that the duration constructor was noexcept in practice, but time_point has an explicit constructor that wouldn’t throw but wasn’t declared as noexcept, which was then rejected by atomic…

Stackoverflow answer about it: https://stackoverflow.com/questions/22701617/should-constructors-on-stdchrono-time-point-be-noexcept-or-why-arent

Ah, it looks like it changed between C++17 and C++20, going from atomic() noexcept = default; to constexpr atomic() noexcept(is_nothrow_default_constructible_v<T>); which fixes it.

ajtowns commented at 10:00 pm on December 13, 2025:

(Having a custom class might still have some value in allowing direct comparisons/calculations without explicit loads, I guess)

ajtowns commented at 11:20 pm on December 13, 2025:

Hmm, it looks to me like libstdc++ didn’t match the C++20 spec until https://github.com/gcc-mirror/gcc/commit/613f8ddbe3d7da63d827e588bf0333813c184b8a which doesn’t seem to have been merged until March this year in libstdc++15? C++20 spec change was https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0883r2.pdf

However I can’t reproduce the “core” bug LWG#2165 CWG#1778 even with -std=c++11, so apparently modern compilers just no longer behave this way? I can’t quite figure out what the change was or when it was made though. :man_shrugging:

EDIT: ah, looks like the underlying language issue was present in g++ 9.x but fixed in g++ 10.x (May 2020).

in src/net.cpp:2218 in b4d1007221 outdated

2214@@ -2214,7 +2215,7 @@ void CConnman::SocketHandlerConnected(const std::vector<CNode*>& nodes,
2215             }
2216         }
2217 
2218-        if (InactivityCheck(*pnode)) pnode->fDisconnect = true;
2219+        if (InactivityCheck(*pnode, now)) pnode->fDisconnect = true;

vasild commented at 8:38 am on December 9, 2025:

nit, feel free to ignore: given that we allow some nontrivial amount of time to pass between the variable initialization and usage, maybe now is not the best name for it. What about time_at_start_of_loop or something like that?

ajtowns commented at 8:51 pm on December 9, 2025:

I’m not sure I’d say “we allow some nontrivial amount of time to pass” – it’s probably a bug if that were to actually happen?

in src/net.h:1677 in b4d1007221 outdated

1673@@ -1666,6 +1674,11 @@ class CConnman
1674      */
1675     bool whitelist_relay;
1676 
1677+    /**
1678+     * flag for whether messages are captured
1679+     */
1680+    bool capture_messages{false};

vasild commented at 10:30 am on December 9, 2025:

naming: m_capture_messages

in src/net.h:970 in b4d1007221

966@@ -978,6 +967,8 @@ class CNode
967     }
968 
969 private:
970+    friend class CConnman;

vasild commented at 10:33 am on December 9, 2025:

I do not like the second commit a542bf10f0 net: [refactor] move ReceiveMsgBytes to CConnman - it reduces the encapsulation of CNode and goes against #27257. Its commit message says:

This allows ReceiveMsgBytes() to access CConnman members without needing to copy them to every CNode. This is used in the next commit.

But I don’t think the next commit b4d1007221 net: Cache -capturemessages setting does any of that, hmm?

ajtowns commented at 8:23 pm on December 9, 2025:

Ooof.

in src/net.cpp:2126 in b4d1007221 outdated

2121@@ -2123,6 +2122,8 @@ void CConnman::SocketHandlerConnected(const std::vector<CNode*>& nodes,
2122 {
2123     AssertLockNotHeld(m_total_bytes_sent_mutex);
2124 
2125+    auto now = GetTime<std::chrono::microseconds>();
2126+

vasild commented at 10:40 am on December 9, 2025:

I think the reasoning of the first commit cea443e246 net: Pass time to InactivityChecks fuctions is well grounded. No need to retrieve the current time for every node, given that we only care about seconds-precision here. I measured on my slow node with a few tens of connections: all nodes are processed in a few milliseconds or less. So, at worst this is outdated that much which is fine IMO.

ajtowns force-pushed on Dec 9, 2025

dergoegge commented at 6:37 pm on December 9, 2025: member

Re: #34025 (comment)

Just to confirm, this flamegraph is from a node that has finished syncing to the tip? i.e. IBD is not included in this graph, right?

maflcko commented at 7:35 pm on December 9, 2025: member

GetTime takes up 0.82% of total time, as part of InactivityCheck that takes up 1.78% of total time, in b-net.

Interesting. I was wondering why getting the time eats so much CPU. Though, in the happy path, InactivityCheck is just loading a few atomics, which means getting the time costs as much as loading a few atomics. Also, the flame graph probably shows the CPU time, and not the wall clock time. So the patch here likely won’t cut the wall clock time between two calls of SocketHandlerConnected, but only the CPU time inside a single SocketHandlerConnected call?

concept ack, Seems fine to still make the changes here.

ajtowns commented at 8:41 pm on December 9, 2025: contributor

Just to confirm, this flamegraph is from a node that has finished syncing to the tip? i.e. IBD is not included in this graph, right?

Yes; it’s taken over a longish period though, iirc, I think ~~either~~ 30m ~~or 2h~~. You can see ProcessNewBlock at 0.20% just before ProcessTransaction at 4.56% fwiw.

Also, the flame graph probably shows the CPU time, and not the wall clock time. So the patch here likely won’t cut the wall clock time between two calls of SocketHandlerConnected, but only the CPU time inside a single SocketHandlerConnected call?

Yeah, presuming SocketHandlerConnected isn’t using 100% of a core, it’ll be spending its time waiting for the SELECT_TIMEOUT_MILLISECONDS timeout to hit, which is 50ms.

net: Cache -capturemessages setting 5f5c1ea019

ajtowns force-pushed on Dec 9, 2025

vasild approved

vasild commented at 5:37 am on December 10, 2025: contributor

ACK 5f5c1ea01955d277581f6c2acbeb982949088960

Would be happy to see the call to the deprecated GetTime() removed: #34025 (review). I do not see it as a blocker because this PR is actually moving the call around, not planting a new one.

DrahtBot requested review from sedited on Dec 10, 2025

maflcko commented at 6:59 am on December 10, 2025: member

review ACK 5f5c1ea01955d277581f6c2acbeb982949088960 🏣

Signature:

0untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
1RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
2trusted comment: review ACK 5f5c1ea01955d277581f6c2acbeb982949088960 🏣
3DIf+EMRUrE9g+3ldiGUW0pHeRyThkZk3bbOL6sas10+I4f60l14Vo/S1nCwtrHEqiue3z1X4uTE3S7uN3Tg3DQ==

sedited approved

sedited commented at 11:51 am on December 10, 2025: contributor

ACK 5f5c1ea01955d277581f6c2acbeb982949088960

glozow requested review from theuni on Dec 11, 2025

mzumsande commented at 10:37 pm on December 11, 2025: contributor

ACK 5f5c1ea01955d277581f6c2acbeb982949088960

Looks like an improvement, but I wonder if instead of doing micro-optimizations like the ones here, it wouldn’t be better to have the entire InactivityCheck() procedure run somewhere else than in the SocketHandlerConnected, or at least less often - it seems a bit absurd to check every 50ms for a timeout of 20 minutes. Even the initial waiting time m_peer_connect_timeout from ShouldRunInactivityChecks() for a new peer doesn’t really be checked that frequently.

fanquake merged this on Dec 12, 2025

fanquake closed this on Dec 12, 2025

stringintech referenced this in commit b7c3ec1af3 on Dec 12, 2025

ajtowns commented at 0:13 am on December 13, 2025: contributor

it seems a bit absurd to check every 50ms for a timeout of 20 minutes. Even the initial waiting time m_peer_connect_timeout from ShouldRunInactivityChecks() for a new peer doesn’t really be checked that frequently.

I think if you’re running a listening node with many real inbound peers, on mainnet where there seems to be maybe 4.5 tx/s, you probably expect to receive socket data from each peer every 2 seconds or so (announcing the ~9 txs that have been received in that time). With 80 (inbound, tx relay) peers, that’s an interrupt every 25ms, rather than every 50ms. Worse if you have more peers, better if many of your peers are block-relay only or inactive spy nodes or similar.

theuni commented at 3:09 pm on December 18, 2025: member

Post-merge ACK 5f5c1ea01955d277581f6c2acbeb982949088960. Nits aside, both of these are nice cleanups.

net: Waste less time in socket handling #34025

Code Coverage & Benchmarks

Reviews

Conflicts

LLM Linter (✨ experimental)