Fix peers abruptly disconnecting from AssumeUTXO nodes during IBD #30807

pull furszy wants to merge 2 commits into bitcoin:master from furszy:2024_net_assumeUTXO_service changing 7 files +147 −11
  1. furszy commented at 11:37 pm on September 3, 2024: member

    Because AssumeUTXO nodes prioritize tip synchronization, they relay their local address through the network before completing the background chain sync. This, combined with the advertising of full-node service (NODE_NETWORK), can result in an honest peer in IBD connecting to the AssumeUTXO node (while syncing) and requesting an historical block the node does not have. This behavior leads to an abrupt disconnection due to perceived unresponsiveness from the AssumeUTXO node.

    This lack of response occurs because nodes ignore getdata requests when they do not have the block data available (further discussion can be found in #30385).

    Fix this by refraining from signaling full-node service support while the background chain is being synced. During this period, the node will only signal NODE_NETWORK_LIMITED support. Then, full-node (NODE_NETWORK) support will be re-enabled once the background chain sync is completed.

    Thanks mzumsande for a post-#30385 convo too.

    Testing notes: Just cherry-pick the second commit (bb08c22) on master. It will fail there, due to the IBD node requesting historical blocks to the snapshot node - which is bad because the snapshot node will ignore the requests and stall + disconnect after some time.

  2. DrahtBot commented at 11:37 pm on September 3, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK naumenkogs, mzumsande, achow101
    Concept ACK tdb3
    Stale ACK fjahr

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #29641 (scripted-diff: Use LogInfo over LogPrintf by maflcko)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label CI failed on Sep 4, 2024
  4. fjahr commented at 2:29 pm on September 4, 2024: contributor

    Concept ACK

    I’ve read through #30385 and this seems like a good change.

  5. fjahr commented at 2:31 pm on September 4, 2024: contributor
    CI doesn’t seem happy though. Kind of weird, the logs don’t load for me.
  6. furszy commented at 2:43 pm on September 4, 2024: member

    CI doesn’t seem happy though. Kind of weird, the logs don’t load for me.

    Yeah, working on it. Almost there.

  7. furszy force-pushed on Sep 4, 2024
  8. furszy force-pushed on Sep 4, 2024
  9. furszy commented at 9:54 pm on September 4, 2024: member

    Updated and rebased due to conflicts with a recently merged PR. The CI shouldn’t complain now.

    On a side note: It took me longer to place my test code inside feature_assumeutxo.py than to fix the issue. I think it would be good to reorganize this file a bit and make the assumeUTXO params configurable at startup or runtime for tests. This would allow us to split tests into different files for each of the products we provide: network/p2p, local node, and wallet.

  10. achow101 added this to the milestone 28.0 on Sep 5, 2024
  11. achow101 added the label Needs backport (28.x) on Sep 5, 2024
  12. DrahtBot removed the label CI failed on Sep 5, 2024
  13. achow101 commented at 10:15 pm on September 5, 2024: member
    ACK bb08c227de158dbd88a80d904edb209e1734ab91
  14. DrahtBot requested review from fjahr on Sep 5, 2024
  15. tdb3 commented at 1:25 am on September 8, 2024: contributor

    Concept ACK

    Brief code review. Looks good. Planning to review in more detail.

  16. in test/functional/feature_assumeutxo.py:590 in fafc73bc03 outdated
    585         self.log.info(f"Loading snapshot into third node from {dump_output['path']}")
    586         loaded = n2.loadtxoutset(dump_output['path'])
    587         assert_equal(loaded['coins_loaded'], SNAPSHOT_BASE_HEIGHT)
    588         assert_equal(loaded['base_height'], SNAPSHOT_BASE_HEIGHT)
    589 
    590+        # Since n2 is a full node, it will unset the 'NETWORK' service flag during snapshot loading.
    


    fjahr commented at 2:28 pm on September 8, 2024:
    0        # Even though n2 is a full node, it will unset the 'NETWORK' service flag during snapshot loading.
    

    furszy commented at 8:24 pm on September 10, 2024:
    Done. Updated.
  17. in test/functional/feature_assumeutxo.py:396 in fafc73bc03 outdated
    391         assert_equal(loaded['coins_loaded'], SNAPSHOT_BASE_HEIGHT)
    392         assert_equal(loaded['base_height'], SNAPSHOT_BASE_HEIGHT)
    393 
    394+        self.log.info("Confirm that local services remain unchanged")
    395+        # Since n1 is a pruned node, the 'NETWORK' service flag must always be unset.
    396+        prune_node_services = n1.getnetworkinfo()['localservicesnames']
    


    fjahr commented at 2:33 pm on September 8, 2024:

    Since this check is repeated six times it could be a reusable function

    0def check_network_limited(node):
    1    node_services = node.getnetworkinfo()['localservicesnames']
    2    assert 'NETWORK' not in node_services
    3    assert 'NETWORK_LIMITED' in node_services
    

    furszy commented at 8:24 pm on September 10, 2024:
    Done. Updated.
  18. in src/net.cpp:1794 in fafc73bc03 outdated
    1790@@ -1791,7 +1791,8 @@ void CConnman::CreateNodeFromAcceptedSocket(std::unique_ptr<Sock>&& sock,
    1791     const bool inbound_onion = std::find(m_onion_binds.begin(), m_onion_binds.end(), addr_bind) != m_onion_binds.end();
    1792     // The V2Transport transparently falls back to V1 behavior when an incoming V1 connection is
    1793     // detected, so use it whenever we signal NODE_P2P_V2.
    1794-    const bool use_v2transport(nLocalServices & NODE_P2P_V2);
    1795+    ServiceFlags local_services = GetLocalServices();
    


    fjahr commented at 2:37 pm on September 8, 2024:
    nit: Seems like this isn’t needed so I would have preferred if that was left out or a separate commit because the file isn’t touched at all otherwise.

    furszy commented at 1:38 pm on September 9, 2024:

    nit: Seems like this isn’t needed so I would have preferred if that was left out or a separate commit because the file isn’t touched at all otherwise.

    This is needed to prevent a race condition. Since fafc73bc036792923ca64a112684abaf78f3b48e, nLocalServices can be accessed from different threads. So, we need to store the value once it’s read and use it here and few lines below (in the InitializeNode call), to maintain consistency in this flow. Otherwise, use_v2transport could have been calculated with a different nLocalServices value than the one used in InitializeNode.

  19. in test/functional/feature_assumeutxo.py:293 in bb08c227de outdated
    288+        headers_provider_conn.send_message(msg)
    289+
    290+        # Ensure headers arrived
    291+        default_value = {'status': ''}  # No status
    292+        headers_tip_hash = miner.getbestblockhash()
    293+        self.wait_until(lambda: next(filter(lambda x: x['hash'] == headers_tip_hash, snapshot_node.getchaintips()), default_value)['status'] == "headers-only")
    


    fjahr commented at 3:25 pm on September 8, 2024:

    nit: I think this does the same and is a bit simpler? Allows to get rid of the default_value when applied to the ibd_node line below as well.

    0        self.wait_until(lambda: any(tip['hash'] == headers_tip_hash and tip['status'] == "headers-only" for tip in snapshot_node.getchaintips()))
    
  20. in test/functional/feature_assumeutxo.py:302 in bb08c227de outdated
    297+        snapshot_node.loadtxoutset(snapshot['path'])
    298+
    299+        # Connect nodes and verify the ibd_node can sync-up the headers-chain from the snapshot_node
    300+        self.connect_nodes(ibd_node.index, snapshot_node.index)
    301+        snapshot_block_hash = snapshot['base_hash']
    302+        self.wait_until(lambda: next(filter(lambda x: x['hash'] == snapshot_block_hash, ibd_node.getchaintips()), default_value)['status'] == "headers-only")
    


    fjahr commented at 10:55 pm on September 8, 2024:

    see above

    0        self.wait_until(lambda: any(tip['hash'] == snapshot_block_hash and tip['status'] == "headers-only" for tip in ibd_node.getchaintips()))
    
  21. in test/functional/feature_assumeutxo.py:276 in bb08c227de outdated
    271+        # Node3 starts clean and seeks to sync-up from snapshot_node.
    272+        miner = self.nodes[0]
    273+        snapshot_node = self.nodes[2]
    274+        ibd_node = self.nodes[3]
    275+
    276+        # Start test fresh by cleaning up node directories
    


    fjahr commented at 10:56 pm on September 8, 2024:
    If this is self-contained and you need to wipe everything this could have been done in a separate test line p2p_assumeutxo.py for example and you copy over the setup from this file here but it seems like there isn’t much needed.

    furszy commented at 1:49 pm on September 9, 2024:

    If this is self-contained and you need to wipe everything this could have been done in a separate test line p2p_assumeutxo.py for example and you copy over the setup from this file here but it seems like there isn’t much needed.

    I started there (see me crying about the feature_assumeutxo.py code organization here #30807#pullrequestreview-2281468112), but ended up moving it inside this all-in-one file because I wasn’t happy copy pasting the costly snapshot setup into another test file (this test case doesn’t need all of that). I think we should allow assumeUTXO snapshot customization on regtest. It would make testing much simpler and self-contained. We would be able to subdivide it on network/p2p, local node, and wallet test files. But still, if you don’t mind, let’s leave this convo for a follow-up so we can fix this bug for v28. I would be happy to tackle if we reach consensus over it.


    fjahr commented at 2:07 pm on September 9, 2024:

    I think we should allow assumeUTXO snapshot customization on regtest. It would make testing much simpler and self-contained.

    How would it make it simpler? It’s another thing to keep track of in the test code and the implementation code. If we need more snapshot heights we can always add them but I haven’t felt the need for it so far when I have added new coverage.


    furszy commented at 2:28 pm on September 9, 2024:

    I think we should allow assumeUTXO snapshot customization on regtest. It would make testing much simpler and self-contained.

    How would it make it simpler? It’s another thing to keep track of in the test code and the implementation code.

    The custom snapshot feature shouldn’t take more than few lines at init.cpp. And for the test side, it would be an extra argument during init. I believe we will sooner or later need different snapshot for further wallet tests, maybe more than one in order to cover different self-contained test scenarios. I think it worth splitting feature_assumeutxo.py in rpc_assumeutxo.py, p2p_assumeutxo.py, wallet_assumeutxo.py, validation_assumeutxo.py etc.. for maintenance reasons.

    If we need more snapshot heights we can always add them but I haven’t felt the need for it so far when I have added new coverage.

    This is simpler in the sense of not requiring to add new hardcoded values at chainparams.cpp every time we need to test something different. For example, this p2p test case just need a snapshot, it can be a very simple one. It does not require all the extra txs setup time.

  22. fjahr commented at 11:04 pm on September 8, 2024: contributor

    utACK bb08c227de158dbd88a80d904edb209e1734ab91

    If we are in a hurry for v28 this is fine to merge as is, my comments can be kept for a follow-up.

    Aside from the inline comments I would suggest to modernize the naming of nLocalServices. I prepared a scripted-diff commit to cherry-pick or I can open the follow-up: https://github.com/fjahr/bitcoin/commit/6cc25ffb1411285d8191bc932eb37b96348f022c

  23. DrahtBot requested review from tdb3 on Sep 8, 2024
  24. furszy commented at 2:10 pm on September 9, 2024: member
    Thanks for the review fjahr!, it seems that your scripted-diff commit https://github.com/fjahr/bitcoin/commit/6cc25ffb1411285d8191bc932eb37b96348f022c makes CI unhappy. I haven’t checked the reason but we could leave it for a follow-up to move forward with the fix for v28. Happy to check it.
  25. in src/net.h:1227 in fafc73bc03 outdated
    1220@@ -1221,6 +1221,11 @@ class CConnman
    1221     //! that peer during `net_processing.cpp:PushNodeVersion()`.
    1222     ServiceFlags GetLocalServices() const;
    1223 
    1224+    //! Updates the local services that this node advertises to other peers
    1225+    //! during connection handshake.
    1226+    void AddLocalServices(ServiceFlags services) { nLocalServices = ServiceFlags(nLocalServices | services); };
    1227+    void RemoveLocalServices(ServiceFlags services) { nLocalServices = ServiceFlags(nLocalServices &~ services); }
    


    mzumsande commented at 2:33 pm on September 9, 2024:
    whitespace nit: & ~services makes more sense to me, and clang-format agrees.

    furszy commented at 8:24 pm on September 10, 2024:
    Done. Updated.
  26. mzumsande commented at 3:49 pm on September 9, 2024: contributor

    Concept ACK

    I think the description of the first commit could be improved:

    AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash.

    This is confusing to me - nodes advertise NODE_NETWORK (all historical blocks prior to tip) or NODE_NETWORK_LIMITED (last 288 from tip). They also communicate their current tip with peers, but it’s not part of the advertisement. So this includes (amongst others) blocks prior to the snapshot block hash, not posterior.

    while the AssumeUTXO node is synchronizing the tip chain or the background chain.

    Only the latter is a problem: while synchronzing the tip chain IsInitialBlockDownload is false and the node doesn’t self-advertise, and therefore wouldn’t receive inbound peers normally. Otherwise we’d have (at least in theory) a similar problem right after the snapshot was loaded, when also NODE_NETWORK_LIMITED would be incorrect because we don’t have the last 288 blocks on disk. But this shouldn’t be a problem in practice.

    after stalling for ~10 minutes

    This could be much less, due to the stalling detection (m_stalling_since), I wouldn’t mention it.

  27. furszy force-pushed on Sep 10, 2024
  28. furszy commented at 8:59 pm on September 10, 2024: member

    AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash.

    This is confusing to me - nodes advertise NODE_NETWORK (all historical blocks prior to tip) or NODE_NETWORK_LIMITED (last 288 from tip). They also communicate their current tip with peers, but it’s not part of the advertisement. So this includes (amongst others) blocks prior to the snapshot block hash, not posterior.

    Yeah ok. Changed it. Thanks!

    while the AssumeUTXO node is synchronizing the tip chain or the background chain.

    Only the latter is a problem: while synchronzing the tip chain IsInitialBlockDownload is false and the node doesn’t self-advertise, and therefore wouldn’t receive inbound peers normally. Otherwise we’d have (at least in theory) a similar problem right after the snapshot was loaded, when also NODE_NETWORK_LIMITED would be incorrect because we don’t have the last 288 blocks on disk. But this shouldn’t be a problem in practice.

    What about us connecting to a synced limited peer, during tip sync, who tries to fetch historical blocks due to the NODE_NETWORK service, like #29183. Also, manual connections are also affected if the other peer is in IBD. But.. this is something the node operator would hopefully be aware of before connecting peers.

    Aside from this two nuances, thanks for the feedback! Updated the commit description. Let me know if further updates are needed.

  29. assumeUTXO: fix peers disconnection during sync
    Because AssumeUTXO nodes prioritize tip synchronization, they relay their local
    address through the network before completing the background chain sync.
    This, combined with the advertising of full-node service (NODE_NETWORK), can
    result in an honest peer in IBD connecting to the AssumeUTXO node (while syncing)
    and requesting an historical block the node does not have. This behavior leads to
    an abrupt disconnection due to perceived unresponsiveness (lack of response)
    from the AssumeUTXO node.
    
    This lack of response occurs because nodes ignore getdata requests when they do
    not have the block data available (further discussion can be found in PR 30385).
    
    Fix this by refraining from signaling full-node service support while the
    background chain is being synced. During this period, the node will only
    signal 'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK')
    support will be re-enabled once the background chain sync is completed.
    6d5812e5c8
  30. test: add coverage for assumeUTXO honest peers disconnection
    Exercising and verifying the following points:
    
    1. An IBD node can sync headers from an AssumeUTXO node at
       any time.
    
    2. IBD nodes do not request historical blocks from AssumeUTXO
       nodes while they are syncing the background-chain.
    
    3. The assumeUTXO node dynamically adjusts the network services
       it offers according to its state.
    
    4. IBD nodes can fully sync from AssumeUTXO nodes after they
       finish the background-chain sync.
    992f83bb6f
  31. furszy force-pushed on Sep 10, 2024
  32. naumenkogs approved
  33. naumenkogs commented at 9:28 am on September 11, 2024: member
    ACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac
  34. DrahtBot requested review from fjahr on Sep 11, 2024
  35. DrahtBot requested review from achow101 on Sep 11, 2024
  36. DrahtBot requested review from mzumsande on Sep 11, 2024
  37. DrahtBot added the label CI failed on Sep 11, 2024
  38. mzumsande commented at 5:17 pm on September 11, 2024: contributor

    ACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac

    Updated the commit description. Let me know if further updates are needed.

    Looks good to me!

    What about us connecting to a synced limited peer, during tip sync, who tries to fetch historical blocks due to the NODE_NETWORK service, like #29183.

    I think we just wouldn’t connect to limited peers during tip sync - only when we are very close to the global tip we accept those as outbound peers (GetDesirableServiceFlags).

  39. achow101 commented at 5:22 pm on September 11, 2024: member
    ACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac
  40. achow101 merged this on Sep 11, 2024
  41. achow101 closed this on Sep 11, 2024

  42. achow101 referenced this in commit c6b5db1d59 on Sep 11, 2024
  43. achow101 referenced this in commit b329ed739b on Sep 11, 2024
  44. achow101 commented at 5:43 pm on September 11, 2024: member
    Backported to 28.x in #30827
  45. achow101 removed the label Needs backport (28.x) on Sep 11, 2024
  46. furszy deleted the branch on Sep 11, 2024
  47. furszy commented at 7:20 pm on September 11, 2024: member

    I think we just wouldn’t connect to limited peers during tip sync - only when we are very close to the global tip we accept those as outbound peers (GetDesirableServiceFlags).

    ha.. and I fixed an issue exactly there in the past.. (#28170). What a good memory I have..

  48. fjahr commented at 11:58 am on September 12, 2024: contributor
    post-merge utACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac
  49. fanquake commented at 1:37 pm on September 12, 2024: member
    Any chance this is now causing #30878? If so, would be good to fix before this is backported.
  50. pablomartin4btc approved
  51. pablomartin4btc commented at 2:04 pm on September 12, 2024: member

    tACK 992f83bb6f4b29b44f4eaace1d1a2c0001d43cac

    I was reviewing it before merge, went thru the src code change and the test, which are both very well documented and easy to follow.

    I agree with the split of feature_assumeutxo.py for different scenarios required by a particular module/ area (p2p, wallet, etc.) having a base or common file too when possible.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-20 01:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me