Fix peers abruptly disconnecting from AssumeUTXO nodes during IBD

furszy commented at 11:37 pm on September 3, 2024: member

Because AssumeUTXO nodes prioritize tip synchronization, they relay their local address through the network before completing the background chain sync. This, combined with the advertising of full-node service (NODE_NETWORK), can result in an honest peer in IBD connecting to the AssumeUTXO node (while syncing) and requesting an historical block the node does not have. This behavior leads to an abrupt disconnection due to perceived unresponsiveness from the AssumeUTXO node.

This lack of response occurs because nodes ignore getdata requests when they do not have the block data available (further discussion can be found in #30385).

Fix this by refraining from signaling full-node service support while the background chain is being synced. During this period, the node will only signal NODE_NETWORK_LIMITED support. Then, full-node (NODE_NETWORK) support will be re-enabled once the background chain sync is completed.

Thanks mzumsande for a post-#30385 convo too.

Testing notes: Just cherry-pick the second commit (bb08c22) on master. It will fail there, due to the IBD node requesting historical blocks to the snapshot node - which is bad because the snapshot node will ignore the requests and stall + disconnect after some time.

DrahtBot commented at 11:37 pm on September 3, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	naumenkogs, mzumsande, achow101
Concept ACK	tdb3
Stale ACK	fjahr

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#29641 (scripted-diff: Use LogInfo over LogPrintf by maflcko)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label CI failed on Sep 4, 2024

fjahr commented at 2:29 pm on September 4, 2024: contributor

Concept ACK

I’ve read through #30385 and this seems like a good change.

fjahr commented at 2:31 pm on September 4, 2024: contributor

CI doesn’t seem happy though. Kind of weird, the logs don’t load for me.

furszy commented at 2:43 pm on September 4, 2024: member

CI doesn’t seem happy though. Kind of weird, the logs don’t load for me.

Yeah, working on it. Almost there.

furszy force-pushed on Sep 4, 2024

furszy commented at 9:54 pm on September 4, 2024: member

Updated and rebased due to conflicts with a recently merged PR. The CI shouldn’t complain now.

On a side note: It took me longer to place my test code inside feature_assumeutxo.py than to fix the issue. I think it would be good to reorganize this file a bit and make the assumeUTXO params configurable at startup or runtime for tests. This would allow us to split tests into different files for each of the products we provide: network/p2p, local node, and wallet.

achow101 added this to the milestone 28.0 on Sep 5, 2024

achow101 added the label Needs backport (28.x) on Sep 5, 2024

DrahtBot removed the label CI failed on Sep 5, 2024

achow101 commented at 10:15 pm on September 5, 2024: member

ACK bb08c227de158dbd88a80d904edb209e1734ab91

DrahtBot requested review from fjahr on Sep 5, 2024

tdb3 commented at 1:25 am on September 8, 2024: contributor

Concept ACK

Brief code review. Looks good. Planning to review in more detail.

in test/functional/feature_assumeutxo.py:590 in fafc73bc03 outdated

585         self.log.info(f"Loading snapshot into third node from {dump_output['path']}")
586         loaded = n2.loadtxoutset(dump_output['path'])
587         assert_equal(loaded['coins_loaded'], SNAPSHOT_BASE_HEIGHT)
588         assert_equal(loaded['base_height'], SNAPSHOT_BASE_HEIGHT)
589 
590+        # Since n2 is a full node, it will unset the 'NETWORK' service flag during snapshot loading.

fjahr commented at 2:28 pm on September 8, 2024:

0        # Even though n2 is a full node, it will unset the 'NETWORK' service flag during snapshot loading.

furszy commented at 8:24 pm on September 10, 2024:

Done. Updated.

in test/functional/feature_assumeutxo.py:396 in fafc73bc03 outdated

391         assert_equal(loaded['coins_loaded'], SNAPSHOT_BASE_HEIGHT)
392         assert_equal(loaded['base_height'], SNAPSHOT_BASE_HEIGHT)
393 
394+        self.log.info("Confirm that local services remain unchanged")
395+        # Since n1 is a pruned node, the 'NETWORK' service flag must always be unset.
396+        prune_node_services = n1.getnetworkinfo()['localservicesnames']

fjahr commented at 2:33 pm on September 8, 2024:

Since this check is repeated six times it could be a reusable function

0def check_network_limited(node):
1    node_services = node.getnetworkinfo()['localservicesnames']
2    assert 'NETWORK' not in node_services
3    assert 'NETWORK_LIMITED' in node_services

furszy commented at 8:24 pm on September 10, 2024:

Done. Updated.

in src/net.cpp:1794 in fafc73bc03 outdated

1790@@ -1791,7 +1791,8 @@ void CConnman::CreateNodeFromAcceptedSocket(std::unique_ptr<Sock>&& sock,
1791     const bool inbound_onion = std::find(m_onion_binds.begin(), m_onion_binds.end(), addr_bind) != m_onion_binds.end();
1792     // The V2Transport transparently falls back to V1 behavior when an incoming V1 connection is
1793     // detected, so use it whenever we signal NODE_P2P_V2.
1794-    const bool use_v2transport(nLocalServices & NODE_P2P_V2);
1795+    ServiceFlags local_services = GetLocalServices();

fjahr commented at 2:37 pm on September 8, 2024:

nit: Seems like this isn’t needed so I would have preferred if that was left out or a separate commit because the file isn’t touched at all otherwise.

furszy commented at 1:38 pm on September 9, 2024:

nit: Seems like this isn’t needed so I would have preferred if that was left out or a separate commit because the file isn’t touched at all otherwise.

This is needed to prevent a race condition. Since fafc73bc036792923ca64a112684abaf78f3b48e, nLocalServices can be accessed from different threads. So, we need to store the value once it’s read and use it here and few lines below (in the InitializeNode call), to maintain consistency in this flow. Otherwise, use_v2transport could have been calculated with a different nLocalServices value than the one used in InitializeNode.

in test/functional/feature_assumeutxo.py:293 in bb08c227de outdated

288+        headers_provider_conn.send_message(msg)
289+
290+        # Ensure headers arrived
291+        default_value = {'status': ''}  # No status
292+        headers_tip_hash = miner.getbestblockhash()
293+        self.wait_until(lambda: next(filter(lambda x: x['hash'] == headers_tip_hash, snapshot_node.getchaintips()), default_value)['status'] == "headers-only")

fjahr commented at 3:25 pm on September 8, 2024:

nit: I think this does the same and is a bit simpler? Allows to get rid of the default_value when applied to the ibd_node line below as well.

0        self.wait_until(lambda: any(tip['hash'] == headers_tip_hash and tip['status'] == "headers-only" for tip in snapshot_node.getchaintips()))

in test/functional/feature_assumeutxo.py:302 in bb08c227de outdated

297+        snapshot_node.loadtxoutset(snapshot['path'])
298+
299+        # Connect nodes and verify the ibd_node can sync-up the headers-chain from the snapshot_node
300+        self.connect_nodes(ibd_node.index, snapshot_node.index)
301+        snapshot_block_hash = snapshot['base_hash']
302+        self.wait_until(lambda: next(filter(lambda x: x['hash'] == snapshot_block_hash, ibd_node.getchaintips()), default_value)['status'] == "headers-only")

fjahr commented at 10:55 pm on September 8, 2024:

see above

0        self.wait_until(lambda: any(tip['hash'] == snapshot_block_hash and tip['status'] == "headers-only" for tip in ibd_node.getchaintips()))

in test/functional/feature_assumeutxo.py:276 in bb08c227de outdated

271+        # Node3 starts clean and seeks to sync-up from snapshot_node.
272+        miner = self.nodes[0]
273+        snapshot_node = self.nodes[2]
274+        ibd_node = self.nodes[3]
275+
276+        # Start test fresh by cleaning up node directories

fjahr commented at 10:56 pm on September 8, 2024:

If this is self-contained and you need to wipe everything this could have been done in a separate test line p2p_assumeutxo.py for example and you copy over the setup from this file here but it seems like there isn’t much needed.

furszy commented at 1:49 pm on September 9, 2024:

If this is self-contained and you need to wipe everything this could have been done in a separate test line p2p_assumeutxo.py for example and you copy over the setup from this file here but it seems like there isn’t much needed.

I started there (see me crying about the feature_assumeutxo.py code organization here #30807#pullrequestreview-2281468112), but ended up moving it inside this all-in-one file because I wasn’t happy copy pasting the costly snapshot setup into another test file (this test case doesn’t need all of that). I think we should allow assumeUTXO snapshot customization on regtest. It would make testing much simpler and self-contained. We would be able to subdivide it on network/p2p, local node, and wallet test files. But still, if you don’t mind, let’s leave this convo for a follow-up so we can fix this bug for v28. I would be happy to tackle if we reach consensus over it.

fjahr commented at 2:07 pm on September 9, 2024:

I think we should allow assumeUTXO snapshot customization on regtest. It would make testing much simpler and self-contained.

How would it make it simpler? It’s another thing to keep track of in the test code and the implementation code. If we need more snapshot heights we can always add them but I haven’t felt the need for it so far when I have added new coverage.

furszy commented at 2:28 pm on September 9, 2024:

I think we should allow assumeUTXO snapshot customization on regtest. It would make testing much simpler and self-contained.

How would it make it simpler? It’s another thing to keep track of in the test code and the implementation code.

The custom snapshot feature shouldn’t take more than few lines at init.cpp. And for the test side, it would be an extra argument during init. I believe we will sooner or later need different snapshot for further wallet tests, maybe more than one in order to cover different self-contained test scenarios. I think it worth splitting feature_assumeutxo.py in rpc_assumeutxo.py, p2p_assumeutxo.py, wallet_assumeutxo.py, validation_assumeutxo.py etc.. for maintenance reasons.

If we need more snapshot heights we can always add them but I haven’t felt the need for it so far when I have added new coverage.

This is simpler in the sense of not requiring to add new hardcoded values at chainparams.cpp every time we need to test something different. For example, this p2p test case just need a snapshot, it can be a very simple one. It does not require all the extra txs setup time.

fjahr commented at 11:04 pm on September 8, 2024: contributor

utACK bb08c227de158dbd88a80d904edb209e1734ab91

If we are in a hurry for v28 this is fine to merge as is, my comments can be kept for a follow-up.

Aside from the inline comments I would suggest to modernize the naming of nLocalServices. I prepared a scripted-diff commit to cherry-pick or I can open the follow-up: https://github.com/fjahr/bitcoin/commit/6cc25ffb1411285d8191bc932eb37b96348f022c

DrahtBot requested review from tdb3 on Sep 8, 2024

furszy commented at 2:10 pm on September 9, 2024: member

Thanks for the review fjahr!, it seems that your scripted-diff commit https://github.com/fjahr/bitcoin/commit/6cc25ffb1411285d8191bc932eb37b96348f022c makes CI unhappy. I haven’t checked the reason but we could leave it for a follow-up to move forward with the fix for v28. Happy to check it.

in src/net.h:1227 in fafc73bc03 outdated

1220@@ -1221,6 +1221,11 @@ class CConnman
1221     //! that peer during `net_processing.cpp:PushNodeVersion()`.
1222     ServiceFlags GetLocalServices() const;
1223 
1224+    //! Updates the local services that this node advertises to other peers
1225+    //! during connection handshake.
1226+    void AddLocalServices(ServiceFlags services) { nLocalServices = ServiceFlags(nLocalServices | services); };
1227+    void RemoveLocalServices(ServiceFlags services) { nLocalServices = ServiceFlags(nLocalServices &~ services); }

mzumsande commented at 2:33 pm on September 9, 2024:

whitespace nit: & ~services makes more sense to me, and clang-format agrees.

furszy commented at 8:24 pm on September 10, 2024:

Done. Updated.

mzumsande commented at 3:49 pm on September 9, 2024: contributor

Concept ACK

I think the description of the first commit could be improved:

AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash.

This is confusing to me - nodes advertise NODE_NETWORK (all historical blocks prior to tip) or NODE_NETWORK_LIMITED (last 288 from tip). They also communicate their current tip with peers, but it’s not part of the advertisement. So this includes (amongst others) blocks prior to the snapshot block hash, not posterior.

while the AssumeUTXO node is synchronizing the tip chain or the background chain.

Only the latter is a problem: while synchronzing the tip chain IsInitialBlockDownload is false and the node doesn’t self-advertise, and therefore wouldn’t receive inbound peers normally. Otherwise we’d have (at least in theory) a similar problem right after the snapshot was loaded, when also NODE_NETWORK_LIMITED would be incorrect because we don’t have the last 288 blocks on disk. But this shouldn’t be a problem in practice.

after stalling for ~10 minutes

This could be much less, due to the stalling detection (m_stalling_since), I wouldn’t mention it.

furszy force-pushed on Sep 10, 2024

furszy commented at 8:59 pm on September 10, 2024: member

AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash.

This is confusing to me - nodes advertise NODE_NETWORK (all historical blocks prior to tip) or NODE_NETWORK_LIMITED (last 288 from tip). They also communicate their current tip with peers, but it’s not part of the advertisement. So this includes (amongst others) blocks prior to the snapshot block hash, not posterior.

Yeah ok. Changed it. Thanks!

while the AssumeUTXO node is synchronizing the tip chain or the background chain.

Only the latter is a problem: while synchronzing the tip chain IsInitialBlockDownload is false and the node doesn’t self-advertise, and therefore wouldn’t receive inbound peers normally. Otherwise we’d have (at least in theory) a similar problem right after the snapshot was loaded, when also NODE_NETWORK_LIMITED would be incorrect because we don’t have the last 288 blocks on disk. But this shouldn’t be a problem in practice.

What about us connecting to a synced limited peer, during tip sync, who tries to fetch historical blocks due to the NODE_NETWORK service, like #29183. Also, manual connections are also affected if the other peer is in IBD. But.. this is something the node operator would hopefully be aware of before connecting peers.

Aside from this two nuances, thanks for the feedback! Updated the commit description. Let me know if further updates are needed.

assumeUTXO: fix peers disconnection during sync

Because AssumeUTXO nodes prioritize tip synchronization, they relay their local
address through the network before completing the background chain sync.
This, combined with the advertising of full-node service (NODE_NETWORK), can
result in an honest peer in IBD connecting to the AssumeUTXO node (while syncing)
and requesting an historical block the node does not have. This behavior leads to
an abrupt disconnection due to perceived unresponsiveness (lack of response)
from the AssumeUTXO node.

This lack of response occurs because nodes ignore getdata requests when they do
not have the block data available (further discussion can be found in PR 30385).

Fix this by refraining from signaling full-node service support while the
background chain is being synced. During this period, the node will only
signal 'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK')
support will be re-enabled once the background chain sync is completed.

6d5812e5c8

test: add coverage for assumeUTXO honest peers disconnection

Exercising and verifying the following points:

1. An IBD node can sync headers from an AssumeUTXO node at
   any time.

2. IBD nodes do not request historical blocks from AssumeUTXO
   nodes while they are syncing the background-chain.

3. The assumeUTXO node dynamically adjusts the network services
   it offers according to its state.

4. IBD nodes can fully sync from AssumeUTXO nodes after they
   finish the background-chain sync.

992f83bb6f

furszy force-pushed on Sep 10, 2024

naumenkogs approved

naumenkogs commented at 9:28 am on September 11, 2024: member

ACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac

DrahtBot requested review from fjahr on Sep 11, 2024

DrahtBot requested review from achow101 on Sep 11, 2024

DrahtBot requested review from mzumsande on Sep 11, 2024

DrahtBot added the label CI failed on Sep 11, 2024

mzumsande commented at 5:17 pm on September 11, 2024: contributor

ACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac

Updated the commit description. Let me know if further updates are needed.

Looks good to me!

What about us connecting to a synced limited peer, during tip sync, who tries to fetch historical blocks due to the NODE_NETWORK service, like #29183.

I think we just wouldn’t connect to limited peers during tip sync - only when we are very close to the global tip we accept those as outbound peers (GetDesirableServiceFlags).

achow101 commented at 5:22 pm on September 11, 2024: member

ACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac

achow101 merged this on Sep 11, 2024

achow101 closed this on Sep 11, 2024

achow101 referenced this in commit c6b5db1d59 on Sep 11, 2024

achow101 referenced this in commit b329ed739b on Sep 11, 2024

achow101 commented at 5:43 pm on September 11, 2024: member

Backported to 28.x in #30827

achow101 removed the label Needs backport (28.x) on Sep 11, 2024

furszy deleted the branch on Sep 11, 2024

furszy commented at 7:20 pm on September 11, 2024: member

I think we just wouldn’t connect to limited peers during tip sync - only when we are very close to the global tip we accept those as outbound peers (GetDesirableServiceFlags).

ha.. and I fixed an issue exactly there in the past.. (#28170). What a good memory I have..

fjahr commented at 11:58 am on September 12, 2024: contributor

post-merge utACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac

fanquake commented at 1:37 pm on September 12, 2024: member

Any chance this is now causing #30878? If so, would be good to fix before this is backported.

pablomartin4btc approved

pablomartin4btc commented at 2:04 pm on September 12, 2024: member

tACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac

I was reviewing it before merge, went thru the src code change and the test, which are both very well documented and easy to follow.

I agree with the split of feature_assumeutxo.py for different scenarios required by a particular module/ area (p2p, wallet, etc.) having a base or common file too when possible.

Fix peers abruptly disconnecting from AssumeUTXO nodes during IBD #30807

Code Coverage

Reviews

Conflicts