net: Block v2->v1 transport downgrade if !fNetworkActive

hodlinator commented at 9:30 pm on March 14, 2025: contributor

We might have just set CNode::fDisconnect in the first loop because of !CConnman::fNetworkActive.

Attempting to reconnect using v1 transport just because fNetworkActive was set to false at the “right” stage in the v2 handshake does not make sense.

Issue discovered by davidgumberg.

DrahtBot commented at 9:31 pm on March 14, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32073.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	davidgumberg, vasild, mabu44, stratospher

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32061 ([draft] Replace libevent with our own HTTP and socket-handling implementation by pinheadmz)
#32015 (net: replace manual reference counting of CNode with shared_ptr by vasild)
#30988 (Split CConnman by vasild)
#30951 (net: option to disallow v1 connection on ipv4 and ipv6 peers by stratospher)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label P2P on Mar 14, 2025

hodlinator commented at 9:32 pm on March 14, 2025: contributor

How I tested

On master, run mainnet node:

0bitcoind -debug=net -daemonwait

Tail log for relevant info:

0tail -f ~/.bitcoin/debug.log | grep -E "SetNetworkActive|Network not active|retrying"

In another console, spam:

0build/bin/bitcoin-cli setnetworkactive false && sleep 0.5 && build/bin/bitcoin-cli setnetworkactive true

Expected grep results:

02025-03-14T21:12:17Z SetNetworkActive: false
12025-03-14T21:12:17Z [net] Network not active, disconnecting peer=9
22025-03-14T21:12:17Z SetNetworkActive: true
32025-03-14T21:12:18Z SetNetworkActive: false
4...
52025-03-14T21:12:23Z SetNetworkActive: true
62025-03-14T21:12:23Z SetNetworkActive: false
72025-03-14T21:12:24Z [net] Network not active, disconnecting peer=10
82025-03-14T21:12:24Z [net] retrying with v1 transport protocol for peer=10

hodlinator renamed this:
~~net: Only try downgrade v2->v1 transport if online~~
net: Only attempt v2->v1 transport downgrade if online
on Mar 15, 2025

hodlinator force-pushed on Mar 15, 2025

fanquake requested review from mzumsande on Mar 17, 2025

fanquake requested review from stratospher on Mar 17, 2025

stratospher commented at 8:43 am on March 17, 2025: contributor

fNetworkActive doesn’t indicate whether our node is online/offline - it’s just a node specific option configured using RPC/GUI to enable/disable P2P network activity. So even with this patch, we’d still attempt for a v1 reconnection if we’re trying to establish a v2 connection during the time period when our internet goes off :)

(Could you update the online/offline terminology in the PR?)

If we’re attempting a v2 connection during the time period when the fNetworkActive option is toggled, we would have the 1 extra connection attempt but less probability of this happening/OpenNetworkConnection immediately returns anyways.

Problem would be if something like #13038 happened, but i don’t think that’s the case?

in src/net.cpp:1935 in 0f4e20728d outdated

1931@@ -1932,7 +1932,7 @@ void CConnman::DisconnectNodes()
1932                 // Add to reconnection list if appropriate. We don't reconnect right here, because
1933                 // the creation of a connection is a blocking operation (up to several seconds),
1934                 // and we don't want to hold up the socket handler thread for that long.
1935-                if (pnode->m_transport->ShouldReconnectV1()) {
1936+                if (fNetworkActive && pnode->m_transport->ShouldReconnectV1()) {

vasild commented at 10:46 am on March 17, 2025:

fNetworkActive will be read 2 times in CConnman::DisconnectNodes(). Given that it can change any time, it could happen that it has one value in the first read and another value here in the second read (this is actually inside a loop so it is many second reads, one for each node). This makes it a bit involved to reason all possible outcomes and could produce unexpected results.

I think it would be better if the variable is read just once in the CConnman::DisconnectNodes() method. This will make the method atomic wrt to that - it will behave as if the variable is changed either before it starts or after it ends but never while it is executing.

0const bool network_active{fNetworkActive};
1// and then use network_active in both places.

hodlinator commented at 4:30 pm on March 17, 2025:

Also felt it was a bit brittle, fixed in latest push, with added note in commit message.

davidgumberg commented at 11:49 pm on March 17, 2025:

Not blocking, but this seemed slightly more correct to me before, since now if fNetworkActive == false when caching the result, but true when disconnecting, we might not attempt v1 reconnect when we should have.

Enumerating the possible states in the old version where fNetworkActive is checked “twice”, first to decide whether or not we should flag all m_nodes for disconnect, and the “second time” where for each node flagged for disconnect, we check whether or not we should retry v1 before actually disconnecting them:

If fNetworkActive is true in the first check and false in the second check, this is fine since we should never be attempting v1 reconnect when fNetworkActive == false
If fNetworkActive == false in the first check, causing us to flag all peers for disconnect, but becomes true for the second check, behavior is unchanged from today.
If fNetworkActive == true both times, this is the happy path, no change.
Same if fNetworkActive == false both times, nothing changes.

hodlinator commented at 2:06 pm on March 18, 2025:

Not blocking, but this seemed slightly more correct to me before, since now if fNetworkActive == false when caching the result, but true when disconnecting, we might not attempt v1 reconnect when we should have.

The probability that someone will be able to toggle fNetworkActive false -> true between the first loop starting and the second loop, when one also happens to have a peer in the right step in the handshake.. just doesn’t feel like it should be something we need to support. By keeping it “atomic” we compress down from 4 to 2 permutations.

Counter-example - Running V1 with V2 capable node

On master, one could contrive the possibility of having a V2 peer which we are handshaking with,

User toggles fNetworkActive true -> false.
Making CConnman::DisconnectNodes(): a. Set pnode->fDisconnect = true. b. Queue a v1 reconnection attempt.
User toggles fNetworkActive false -> true before we process reconnections.
We now end up using V1 with a peer capable of V2.

Behavior with current version of this PR in this case will be to drop the V2 peer and not attempt immediate reconnection.

(If we instead read fNetworkActive a second time when checking for reconnection like this PR did initially, on could reorder 2.b. and 3. in the above example to run into the same issue).

vasild commented at 10:46 am on March 17, 2025: contributor

Approach ACK 0f4e20728d231935f0140845370090a760ae93af

net: Block v2->v1 transport downgrade if !CConnman::fNetworkActive

We might have just set CNode::fDisconnect in the first loop because of being offline.

Also caches CConnman::fNetworkActive in case it's changed concurrently with our own thread.

6869fb4170

hodlinator force-pushed on Mar 17, 2025

hodlinator renamed this:
~~net: Only attempt v2->v1 transport downgrade if online~~
net: Block v2->v1 transport downgrade if !fNetworkActive
on Mar 17, 2025

hodlinator commented at 4:27 pm on March 17, 2025: contributor

Could you update the online/offline terminology in the PR?

Fixed to use fNetworkActive instead, agree it was a bit of a stretch.

If we’re attempting a v2 connection during the time period when the fNetworkActive option is toggled, we would have the 1 extra connection attempt but less probability of this happening/OpenNetworkConnection immediately returns anyways.

Yes, seems like we wouldn’t get very far with reconnection anyway, but good to avoid the log message.

hodlinator commented at 4:44 pm on March 17, 2025: contributor

Windows CI failure is unrelated:

0Downloading https://gitlab.freedesktop.org//freetype/freetype/-/archive/VER-2-13-3/freetype-VER-2-13-3.tar.gz -> freetype-freetype-VER-2-13-3.tar.gz
1error: https://gitlab.freedesktop.org//freetype/freetype/-/archive/VER-2-13-3/freetype-VER-2-13-3.tar.gz: failed: status code 503

Currently tracked as #32085.

davidgumberg commented at 2:56 am on March 18, 2025: contributor

Tested and Reviewed ACK 6869fb417096b43ba7f7

~~I could be wrong, but I think https://github.com/bitcoin/bitcoin/commit/0f4e20728d231935f0140845370090a760ae93af was more correct (see #32073 (review)), but I don’t think it’s that important.~~ I was wrong, see #32073 (review)

Anecdotally, the issue this fixes is more likely to occur on startup, probably because of how many new connections we’re making, so I used this script to reproduce:

0bitcoind -debug=net -daemonwait  && sleep 20 \
1  && bitcoin-cli setnetworkactive false && sleep 0.5 \
2  && bitcoin-cli setnetworkactive true && sleep 0.5 \
3  && bitcoin-cli  stop

Message about retrying v1 connection doesn’t occur on this branch.

DrahtBot requested review from vasild on Mar 18, 2025

vasild approved

vasild commented at 12:14 pm on March 19, 2025: contributor

ACK 6869fb417096b43ba7f74bf767ca3e41b9894899

mabu44 commented at 8:57 pm on March 19, 2025: none

ACK 6869fb417096b43ba7f74bf767ca3e41b9894899 Reviewed the code and tested it with the command provided in #32073 (comment). I was able to reproduce the problem on master only once and never on this branch.

stratospher commented at 6:34 am on March 20, 2025: contributor

ACK 6869fb4. I’ve reviewed the code but don’t have strong preference for this branch vs master since only functional change is just a single log not being printed in a low probability scenario (we happen to be attempting v2 connection when P2P network activity is being turned off).

ryanofsky assigned ryanofsky on Mar 24, 2025

ryanofsky merged this on Mar 24, 2025

ryanofsky closed this on Mar 24, 2025

hodlinator deleted the branch on Mar 24, 2025

net: Block v2->v1 transport downgrade if !fNetworkActive #32073

Code Coverage & Benchmarks

Reviews

Conflicts

How I tested

Counter-example - Running V1 with V2 capable node