Parallel compact block downloads, take 3 #27626

pull instagibbs wants to merge 6 commits into bitcoin:master from instagibbs:2023-05-parallel-block-downloads changing 5 files +242 −76
  1. instagibbs commented at 2:05 pm on May 11, 2023: member

    This is an attempt at mitigating #25258 , which is a revival of #10984, which is a revival of #9447.

    This PR attempts to mitigate a single case, where high bandwidth peers can bail us out of a flakey peer not completing blocks for us. We allow up to 2 additional getblocktxns requests per unique block. This would hopefully allow the chance for an honest high bandwidth peer to hand us the transactions even if the first in flight peer stalls out.

    In contrast to previous effort:

    1. it will not help if subsequent peers send block headers only, so only high-bandwidth peers this time. See: https://github.com/bitcoin/bitcoin/pull/10984/files#diff-6875de769e90cec84d2e8a9c1b962cdbcda44d870d42e4215827e599e11e90e3R1411
    2. MAX_GETBLOCKTXN_TXN_AFTER_FIRST_IN_FLIGHT is removed, in favor of aiding recovery during turbulent mempools
    3. We require one of the 3 block fetching slots to be an outbound peer. This can be the original offering peer, or subsequent compact blocks given by high bandwidth peers.
  2. DrahtBot commented at 2:05 pm on May 11, 2023: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK sdaftuar, mzumsande
    Concept ACK jamesob, theuni
    Stale ACK dergoegge, ajtowns

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #27596 (assumeutxo (2) by jamesob)
    • #27052 (test: rpc: add last block announcement time to getpeerinfo result by LarryRuane)
    • #26969 (net, refactor: net_processing, add ProcessCompactBlockTxns by brunoerg)
    • #26812 (test: add end-to-end tests for CConnman and PeerManager by vasild)
    • #26621 (refactor: Continue moving application data from CNode to Peer by dergoegge)
    • #24008 (assumeutxo: net_processing changes by jamesob)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. instagibbs force-pushed on May 11, 2023
  4. DrahtBot added the label CI failed on May 11, 2023
  5. glozow added the label P2P on May 11, 2023
  6. glozow requested review from sdaftuar on May 11, 2023
  7. in src/net_processing.cpp:905 in e85fa40462 outdated
    900@@ -901,7 +901,7 @@ class PeerManagerImpl final : public PeerManager
    901      */
    902     void FindNextBlocksToDownload(const Peer& peer, unsigned int count, std::vector<const CBlockIndex*>& vBlocks, NodeId& nodeStaller) EXCLUSIVE_LOCKS_REQUIRED(cs_main);
    903 
    904-    typedef std::map<uint256, std::pair<NodeId, std::list<QueuedBlock>::iterator>> BlockDownloadMap;
    905+    typedef std::multimap<uint256, std::pair<NodeId, std::list<QueuedBlock>::iterator>> BlockDownloadMap;
    


    ajtowns commented at 2:24 pm on May 11, 2023:
    Would this be better as a map<uint256, map<NodeId, list<>::iterator>> ? Seems like it’d make some of the FIXMEs easier.

    instagibbs commented at 6:21 pm on May 11, 2023:

    after attempting the change, I think that would violate https://github.com/bitcoin/bitcoin/pull/27626/commits/e424dcba60e61f014b19a3e2f59238ca3b91f3db#diff-6875de769e90cec84d2e8a9c1b962cdbcda44d870d42e4215827e599e11e90e3R1394 which requires insertion ordering to know which is the stalling peer.

    totally forgot about the FIXME…

  8. jamesob referenced this in commit 74c4a66c86 on May 11, 2023
  9. in src/net_processing.cpp:4327 in e7911fd890 outdated
    4299@@ -4300,7 +4300,7 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
    4300 
    4301         if (pindex->nChainWork <= m_chainman.ActiveChain().Tip()->nChainWork || // We know something better
    4302                 pindex->nTx != 0) { // We had this block at some point, but pruned it
    4303-            if (fAlreadyInFlight) {
    4304+            if (in_flight_same_peer) {
    4305                 // We requested this block for some reason, but our mempool will probably be useless
    4306                 // so we just grab the block via normal getdata
    


    ajtowns commented at 2:55 pm on May 11, 2023:

    I think this case is (mostly) for when a non-high-bandwidth peer tells us about a block, we request it via a compact block, and then we get a different block from some other peer, UpdateTip to that, and only afterwards see the CMPCTBLOCK response from the original peer. But now that response is probably useless, so replace it with a GETDATA.

    I wonder if that might be better expressed as: if (in_flight_same_peer && pindex->pprev != ActiveChain().Tip()) ?


    instagibbs commented at 4:17 pm on May 11, 2023:
    sounds right, I’ll sit on this for a bit

    ajtowns commented at 4:38 pm on May 11, 2023:
    Probably better to defer it as an unrelated change anyway?
  10. in src/net_processing.cpp:4291 in e85fa40462 outdated
    4289         if (pindex->nStatus & BLOCK_HAVE_DATA) // Nothing to do here
    4290             return;
    4291 
    4292+        auto range_flight = mapBlocksInFlight.equal_range(pindex->GetBlockHash());
    4293+        bool fAlreadyInFlight = range_flight.first != range_flight.second;
    4294+        bool in_flight_same_peer = false;
    


    ajtowns commented at 2:58 pm on May 11, 2023:
    Might be better to name this requested_block_from_this_peer ? (I think for new code we prefer bool x{false}; so we get errors if we’re accidently narrowing values?)

    instagibbs commented at 4:46 pm on May 11, 2023:
    don’t love your name suggestion; changed the initialization though

    ajtowns commented at 5:16 pm on May 11, 2023:
    Let me rephrase. What I meant was that in_flight_same_peer confused me: “we just received the block from this peer, of course it was in flight, how else would it get to us?”. But what the bool is actually telling us is that we had deliberately asked this peer for this block, it wasn’t just spontaneously sent to us.

    instagibbs commented at 6:42 pm on May 11, 2023:
    ok, I can see that and how it reads cleaner, changed
  11. in src/net_processing.cpp:4300 in 1cde8b4862 outdated
    4296@@ -4297,6 +4297,7 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
    4297             }
    4298             range_flight.first++;
    4299         }
    4300+        bool first_in_flight = !already_in_flight || in_flight_same_peer;
    


    ajtowns commented at 3:03 pm on May 11, 2023:
    already_in_flight == 0 || (already_in_flight == 1 && in_flight_same_peer) ?
  12. in src/net_processing.cpp:4368 in 1cde8b4862 outdated
    4362@@ -4357,8 +4363,15 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
    4363                     blockTxnMsg << txn;
    4364                     fProcessBLOCKTXN = true;
    4365                 } else {
    4366-                    req.blockhash = pindex->GetBlockHash();
    4367-                    m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETBLOCKTXN, req));
    4368+                    // We will try to round-trip any compact blocks we get on failure,
    4369+                    // as long as it's first, or not too large
    4370+                    if (req.indexes.size() <= MAX_GETBLOCKTXN_TXN_AFTER_FIRST_IN_FLIGHT || first_in_flight) {
    


    ajtowns commented at 3:05 pm on May 11, 2023:
    (Test the bool first?)
  13. DrahtBot removed the label CI failed on May 11, 2023
  14. in src/net_processing.h:28 in 1cde8b4862 outdated
    21@@ -22,6 +22,11 @@ static const bool DEFAULT_PEERBLOOMFILTERS = false;
    22 static const bool DEFAULT_PEERBLOCKFILTERS = false;
    23 /** Threshold for marking a node to be discouraged, e.g. disconnected and added to the discouragement filter. */
    24 static const int DISCOURAGEMENT_THRESHOLD{100};
    25+/** Maximum number of outstanding CMPCTBLOCK requests for the same block. */
    26+/** Greg: Setting to 3 to match number of high bandwidth peers... ~ensures one is outbound */
    27+static const unsigned int MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK = 3;
    28+/** Maximum number of missing transactions for a non-first getblocktxn request */
    29+static const unsigned int MAX_GETBLOCKTXN_TXN_AFTER_FIRST_IN_FLIGHT = 10;
    


    ajtowns commented at 3:30 pm on May 11, 2023:

    Out of the last 612 compact blocks my node’s reconstructed that required requesting transactions, 277 needed 10 or fewer (45%), 441 needed 50 or fewer (72%), 488 needed 100 or fewer (80%), 548 needed 500 or fewer (90%). I’d probably prefer 50 or 100 than 10, I guess?

    (In simpler times, 1086/1162 (93%) need 10 or fewer, 1140/1162 (98%) needed 50 or fewer, 1144/1162 (98.5%) needed 100 or fewer. Again, I’m not including blocks that requested 0 txs in those denominators)


    instagibbs commented at 4:32 pm on May 11, 2023:
    what if we just removed the limit, since it’s not much of a protection, especially if the missing 10 txns are *nscriptions

    ajtowns commented at 4:46 pm on May 11, 2023:
    I think the limit’s fine; when you start up with an empty mempool, you’ll be requesting most of the transactions in the block because your mempool’s empty, and there’s no need to do that three times in parallel. Could perhaps have the limit work the other way: if you’ve already got N txs and need K txs, and 0 < K <= N/4 + 10; then if the block just has ten 100kvB txs, you won’t request them multiple times.

    dergoegge commented at 6:01 pm on May 11, 2023:

    Imo, it doesn’t seem worthwhile to have the limit:

    • the good case (small K) should be far more common than the bad one (large K)
    • the limit makes relay of blocks with large K less reliable w.r.t to stalling (reduces to what we have now)
    • the limit adds code complexity and a maintenance burden. 10 is not a good value according to your data and if #10984 had been merged then we would need to re-evaluate now. Who’s to say this won’t change again.

    instagibbs commented at 8:19 pm on May 11, 2023:
    if the issue is during warmup, that’s a pretty small window, and typically won’t actually do parallel fetching, 3 seems rare even when warmed up with limit of 10. I’ll gather some more stats on that front, seeing how many parallel blocks are initialized.

    instagibbs commented at 1:50 pm on May 12, 2023:
    This “warmup” should really only be an issue after we’ve seen new blocks from 3+ peers since we don’t persist hb peers across restarts and we have to wait for MaybeSetPeerAsAnnouncingHeaderAndIDs to be called on multiple unique peers. I now suspect the warmup period handles itself? My logs show 4+ blocks per node with only one partial block.
  15. in src/net_processing.cpp:4487 in 1cde8b4862 outdated
    4471                 auto [node_id, block_it] = range_flight.first->second;
    4472                 if (node_id == pfrom.GetId() && block_it->partialBlock) {
    4473                     expected_blocktxn = true;
    4474                     break;
    4475                 }
    4476+                first_in_flight = false;
    


    ajtowns commented at 3:39 pm on May 11, 2023:
    This logic is different to the previous first_in_flight; probably should be the same?

    instagibbs commented at 6:51 pm on May 11, 2023:
    done, cleaner thank you
  16. ajtowns commented at 3:47 pm on May 11, 2023: contributor
    Seems okay at first glance. It would be nice to have a “blockrequest.cpp” module that collects all this logic together. Perhaps we want to keep this change simple so it’s a smaller backport, and refactor later, though.
  17. in src/net_processing.cpp:4370 in df1218b447 outdated
    4363@@ -4364,8 +4364,11 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
    4364                     fProcessBLOCKTXN = true;
    4365                 } else {
    4366                     // We will try to round-trip any compact blocks we get on failure,
    4367-                    // as long as it's first, or not too large
    4368-                    if (req.indexes.size() <= MAX_GETBLOCKTXN_TXN_AFTER_FIRST_IN_FLIGHT || first_in_flight) {
    4369+                    // as long as it's first, or not too large, and as long as we chose
    4370+                    // the peer for high-bandwidth relay.
    4371+                    auto hb_peer = std::find(lNodesAnnouncingHeaderAndIDs.begin(), lNodesAnnouncingHeaderAndIDs.end(), pfrom.GetId());
    4372+                    if ((hb_peer != lNodesAnnouncingHeaderAndIDs.end() && req.indexes.size() <= MAX_GETBLOCKTXN_TXN_AFTER_FIRST_IN_FLIGHT) ||
    


    dergoegge commented at 4:24 pm on May 11, 2023:
    Could use pfrom.m_bip152_highbandwidth_to?

    instagibbs commented at 4:54 pm on May 11, 2023:
    done
  18. instagibbs force-pushed on May 11, 2023
  19. instagibbs force-pushed on May 11, 2023
  20. DrahtBot added the label CI failed on May 11, 2023
  21. DrahtBot removed the label CI failed on May 11, 2023
  22. instagibbs force-pushed on May 11, 2023
  23. instagibbs force-pushed on May 11, 2023
  24. in src/net_processing.cpp:4381 in ceded96265 outdated
    4378+                    // as long as it's first, or not too large and as long as we chose
    4379+                    // the peer for high-bandwidth relay.
    4380+                    if (first_in_flight ||
    4381+                        (pfrom.m_bip152_highbandwidth_to && req.indexes.size() <= MAX_GETBLOCKTXN_TXN_AFTER_FIRST_IN_FLIGHT)) {
    4382+                        req.blockhash = pindex->GetBlockHash();
    4383+                        m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETBLOCKTXN, req));
    


    dergoegge commented at 9:28 am on May 12, 2023:
    This could be a else if of the outer if statement. That would avoid the nesting here.

    instagibbs commented at 1:35 pm on May 12, 2023:
    done
  25. instagibbs commented at 1:38 pm on May 12, 2023: member
    I’ll clean up history later when we’re closer to ACKs
  26. DrahtBot added the label CI failed on May 12, 2023
  27. Sjors commented at 5:47 pm on May 12, 2023: member

    In contrast to previous effort, it will not help if subsequent peers send block headers only

    Can you elaborate on this? Are you saying that falling back to full block download is pointless?

  28. instagibbs commented at 5:56 pm on May 12, 2023: member

    @Sjors Basically previous efforts would allow the peer to send a header, not a compact block, and we would fire off an additional getdata for the compact block. See this block of code: https://github.com/bitcoin/bitcoin/pull/10984/files#diff-6875de769e90cec84d2e8a9c1b962cdbcda44d870d42e4215827e599e11e90e3R1411

    IMO it’s unlikely to make an impact, as high-bandwidth peers are already demonstrably fast, and allows us to control better who takes up the additional slots.

  29. Sjors commented at 7:13 pm on May 12, 2023: member

    So that earlier attempt would solicit a compact block from a (BIP 152) low-bandwidth mode peer, in response to it announcing a header. That’s similar to what we already do in HeadersDirectFetchBlocks, but allowing more than 1 in transit?

    Whereas in this PR we focus on the high-bandwidth mode peers, which are allowed to send us a cmpctblock message straight away (rather than waiting for us to ask). Right now we can only have one cmpctblock -> getblocktxn -> blocktxn “session” in progress. This PR makes that 3.

    This approach seems fine, but the other attempts were never merged, which is why I am confused by the “it will not help” phrase. Are you saying these earlier attempts failed in some experiment, are they useless in general or is this new approach just better (and less complex?)?

  30. instagibbs commented at 7:17 pm on May 12, 2023: member

    This PR makes that 3.

    Makes it 3 parallel block downloads, first can be of any kind, the 2nd and 3rd must be compact block announcements from high badwidth peers.

    I am confused by the “it will not help” phrase

    This PR does not take advantage of low-bandwidth compact block peers message patterns, in other words.

  31. instagibbs commented at 3:13 pm on May 15, 2023: member

    I was getting some sybil-stalls above the missing txn <=10 limit, so for now uncapped it.

    I also added in logic to ensure that at least one of the 3 slots is taken by an outbound peer. This should be more robust against sybils that are able to take up all inbound hb slots. e.g.,:

    1. uses fresh connection to announce, then stall
    2. uses both inbound hb peers to take up 2nd and 3rd slots, but never gives txns
    3. fresh peer times out, gets disconnected
    4. repeat 1-3

    Added more extensive tests.

  32. instagibbs force-pushed on May 15, 2023
  33. DrahtBot removed the label CI failed on May 15, 2023
  34. instagibbs force-pushed on May 16, 2023
  35. in src/net_processing.cpp:1131 in ffc0aa0390 outdated
    1125@@ -1125,34 +1126,40 @@ bool PeerManagerImpl::IsBlockRequested(const uint256& hash)
    1126 
    1127 void PeerManagerImpl::RemoveBlockRequest(const uint256& hash, std::optional<NodeId> from_peer)
    1128 {
    1129-    auto it = mapBlocksInFlight.find(hash);
    1130-    if (it == mapBlocksInFlight.end()) {
    1131-        // Block was not requested
    1132+    auto range = mapBlocksInFlight.equal_range(hash);
    1133+    if (range.first == range.second) {
    1134+        // Block was not requested by any peer
    


    sdaftuar commented at 2:02 pm on May 17, 2023:
    nit: This comment is not right, should say something like “was not requested from any peer”.

    instagibbs commented at 5:04 pm on May 17, 2023:
    done
  36. in src/net_processing.cpp:1534 in ffc0aa0390 outdated
    1529+                range.first++;
    1530+            } else {
    1531+                range.first = mapBlocksInFlight.erase(range.first);
    1532+            }
    1533+        }
    1534         mapBlocksInFlight.erase(entry.pindex->GetBlockHash());
    


    sdaftuar commented at 2:47 pm on May 17, 2023:
    Doesn’t this line need to be removed here?

    instagibbs commented at 5:05 pm on May 17, 2023:
    done
  37. in src/net_processing.cpp:4323 in 999a123082 outdated
    4319@@ -4320,7 +4320,7 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
    4320         }
    4321 
    4322         // If we're not close to tip yet, give up and let parallel block fetch work its magic
    4323-        if (!fAlreadyInFlight && !CanDirectFetch()) {
    4324+        if (!in_flight_same_peer && !CanDirectFetch()) {
    


    sdaftuar commented at 3:05 pm on May 17, 2023:
    What’s the justification for this change? It seems like this changes things so that in edge-case scenarios, we wouldn’t allow the opportunistic compact block reconstruction to take place, and I don’t see why we need to do that.

    instagibbs commented at 5:05 pm on May 17, 2023:
    hmm yeah, this is old code, doesn’t make sense to me either. removed the change.
  38. in src/net_processing.cpp:1150 in e54f2d3680 outdated
    1146@@ -1132,8 +1147,8 @@ void PeerManagerImpl::RemoveBlockRequest(const uint256& hash, std::optional<Node
    1147         return;
    1148     }
    1149 
    1150-    // Currently we don't request more than one peer for same block
    1151-    Assume(mapBlocksInFlight.count(hash) == 1);
    1152+    // We should never requested too many of this block
    


    sdaftuar commented at 3:10 pm on May 17, 2023:
    nit: perhaps “We should not have requested too many of this block”

    instagibbs commented at 5:05 pm on May 17, 2023:
    done
  39. in src/net_processing.cpp:1195 in e54f2d3680 outdated
    1193@@ -1179,8 +1194,8 @@ bool PeerManagerImpl::BlockRequested(NodeId nodeid, const CBlockIndex& block, st
    1194         }
    1195     }
    1196 
    1197-    // Make sure it's not listed somewhere already.
    1198-    RemoveBlockRequest(hash, std::nullopt);
    1199+    // Make sure it's not being fetched already from same peer.
    1200+    RemoveBlockRequest(hash, nodeid);
    


    sdaftuar commented at 3:13 pm on May 17, 2023:
    Is it worth adding an Assume() here that we have strictly less than MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK requests outstanding?

    instagibbs commented at 5:05 pm on May 17, 2023:
    done
  40. in src/net_processing.cpp:4518 in e54f2d3680 outdated
    4515+            if (!requested_block_from_this_peer) {
    4516                 LogPrint(BCLog::NET, "Peer %d sent us block transactions for block we weren't expecting\n", pfrom.GetId());
    4517                 return;
    4518             }
    4519 
    4520+            bool first_in_flight = already_in_flight == 0 || (already_in_flight == 1 && requested_block_from_this_peer);
    


    sdaftuar commented at 3:34 pm on May 17, 2023:

    So if we get 3 compact block announcements from 3 peers, and we then send 3 getblocktxn messages to them, and then the first blocktxn comes back and reconstruction fails, we will wait until we get the third blocktxn message back before we send out a getdata message for the full block, I think?

    That seems suboptimal, though I don’t have a fix in mind yet.


    sdaftuar commented at 4:16 pm on May 17, 2023:
    Actually, can we use the insertion order guarantees of multimap to solve this?

    instagibbs commented at 5:36 pm on May 17, 2023:
    Looks like some lost in translation logic from prior PRs. Indeed, we can just peek at the first entry, if it exists. Fixed, I think.
  41. in src/net_processing.cpp:4532 in e54f2d3680 outdated
    4534+                    // Might have collided, fall back to getdata now :(
    4535+                    std::vector<CInv> invs;
    4536+                    invs.push_back(CInv(MSG_BLOCK | GetFetchFlags(*peer), resp.blockhash));
    4537+                    m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETDATA, invs));
    4538+                } else {
    4539+                    RemoveBlockRequest(resp.blockhash, pfrom.GetId()); // Reset in-flight state in case of whitelist
    


    sdaftuar commented at 3:35 pm on May 17, 2023:
    nit: this comment doesn’t seem right

    instagibbs commented at 5:05 pm on May 17, 2023:
    removed nonsense copy/paste
  42. in src/net_processing.cpp:4325 in 324af0fd5c outdated
    4322+                requested_block_from_this_peer = true;
    4323+                break;
    4324+            }
    4325+            range_flight.first++;
    4326+        }
    4327+        bool first_in_flight = already_in_flight == 0 || (already_in_flight == 1 && requested_block_from_this_peer);
    


    sdaftuar commented at 3:49 pm on May 17, 2023:

    This notion of first_in_flight is not correct, I think. Imagine we get a headers announcement from an outbound (non-hb) peer, and so we send a getdata(cmpctblock). In the meantime, we get a compact block announcement from an inbound peer which fails to reconstruct. Then the compact block comes back from the outbound peer; first_in_flight will be false because we think we have two requests outstanding.

    Further down, when we decide whether to send a GETBLOCKTXN message to this non-HB outbound peer, we will choose not to, because first_in_flight is false and the peer is not already HB. This seems like it would be a regression, as inbound HB peers would gain the ability to interfere with compact block relay happening with other peers just by making a compactblock announcement.


    sdaftuar commented at 4:18 pm on May 17, 2023:
    Similarly here, can we use the insertion order of multimap to determine which peer was actually first to make the announcement? If in the situation where an outbound (non-hb) peer is the first to announce that we still always send the getblocktxn if compact blocks fail, then this new logic would not be making anything worse, which seems sufficient to me.

    instagibbs commented at 5:35 pm on May 17, 2023:

    If in the situation where an outbound (non-hb) peer is the first to announce that we still always send the getblocktxn if compact blocks fail, then this new logic would not be making anything worse, which seems sufficient to me.

    Exactly. Fixed, I think!

  43. sdaftuar commented at 3:52 pm on May 17, 2023: member

    I have not reviewed the test or done any manual testing yet, but just finished reading the code.

    One conceptual thing that I wanted to mention: right now it is possible for none of our HB compactblock peers to be outbound. The way the HB outbound reservation logic works is that if we ever have an outbound peer taking an HB slot, we will not allow inbounds to take over all 3 slots. However, it is possible that a peer starting up for the first time could have all its slots taken by inbounds.

  44. instagibbs commented at 4:15 pm on May 17, 2023: member

    However, it is possible that a peer starting up for the first time could have all its slots taken by inbounds.

    Yeah, goes without saying, but maybe it doesn’t. This PR offers no protection on fresh restart(no hb peers), and only offers 3 parallel downloads when an outbound is selected as hb and offers a block

  45. instagibbs force-pushed on May 17, 2023
  46. DrahtBot added the label CI failed on May 17, 2023
  47. sdaftuar commented at 5:47 pm on May 17, 2023: member
    Thanks for the quick updates – the code looks right to me now; will test.
  48. instagibbs force-pushed on May 17, 2023
  49. sdaftuar commented at 1:52 pm on May 18, 2023: member

    ACK 2d12fab42cc0f11d5ede96c7eb71a496931df0b5. I’ve reviewed the test as well and have been running this branch live, and it looks good to me.

    I think it’d be good to merge this sooner than later so we can get some experience running this in master before it appears in a release, so that we can get a sense of whether we want to institute any better bandwidth management before deploying.

    EDIT: not sure what’s up with the CI failure?

  50. instagibbs commented at 1:59 pm on May 18, 2023: member
    “jq: command not found” weird one, probably just restart?
  51. alias BlockDownloadMap for mapBlocksInFlight 86cff8bf18
  52. Remove nBlocksInFlight a90595478d
  53. Convert mapBlocksInFlight to a multimap cce96182ba
  54. Only request full blocks from the peer we thought had the block in-flight
    This is a change in behavior so that if for some reason we request a block from a peer, we don't allow an unsolicited CMPCT_BLOCK announcement for that same block to cause a request for a full block from the uninvited peer (as some type of request is already outstanding from the original peer)
    13f9b20b4c
  55. fanquake commented at 2:03 pm on May 18, 2023: member

    “jq: command not found” weird one, probably just restart?

    Rebase should do it. Added recently for IWYU in the tidy job.

  56. instagibbs force-pushed on May 18, 2023
  57. instagibbs commented at 2:04 pm on May 18, 2023: member
    rebased
  58. fanquake added this to the milestone 26.0 on May 18, 2023
  59. sdaftuar commented at 2:48 pm on May 18, 2023: member
    ACK 42c2696ae58e07de005edf1dda952761f8c9008e
  60. DrahtBot removed the label CI failed on May 18, 2023
  61. in test/functional/p2p_compactblocks.py:850 in 42c2696ae5 outdated
    845+            with p2p_lock:
    846+                assert "getblocktxn" in peer.last_message
    847+            return block, cmpct_block
    848+
    849+        # First, make delivery_peer, inbound_peer, and outbound_peer high bandwidth
    850+        block, cmpct_block = announce_cmpct_block(node, delivery_peer, 1)
    


    brunoerg commented at 7:55 pm on May 18, 2023:

    Suggestion to simplify it, in case you touch it again:

     0index 50cce9318..cd9268f6b 100755
     1--- a/test/functional/p2p_compactblocks.py
     2+++ b/test/functional/p2p_compactblocks.py
     3@@ -847,29 +847,14 @@ class CompactBlocksTest(BitcoinTestFramework):
     4             return block, cmpct_block
     5 
     6         # First, make delivery_peer, inbound_peer, and outbound_peer high 
     7bandwidth
     8-        block, cmpct_block = announce_cmpct_block(node, delivery_peer, 1)
     9-        msg = msg_blocktxn()
    10-        msg.block_transactions.blockhash = block.sha256
    11-        msg.block_transactions.transactions = block.vtx[1:]
    12-        delivery_peer.send_and_ping(msg)
    13-        assert_equal(int(node.getbestblockhash(), 16), block.sha256)
    14-        delivery_peer.clear_getblocktxn()
    15-
    16-        block, cmpct_block = announce_cmpct_block(node, inbound_peer, 1)
    17-        msg = msg_blocktxn()
    18-        msg.block_transactions.blockhash = block.sha256
    19-        msg.block_transactions.transactions = block.vtx[1:]
    20-        inbound_peer.send_and_ping(msg)
    21-        assert_equal(int(node.getbestblockhash(), 16), block.sha256)
    22-        inbound_peer.clear_getblocktxn()
    23-
    24-        block, cmpct_block = announce_cmpct_block(node, outbound_peer, 1)
    25-        msg = msg_blocktxn()
    26-        msg.block_transactions.blockhash = block.sha256
    27-        msg.block_transactions.transactions = block.vtx[1:]
    28-        outbound_peer.send_and_ping(msg)
    29-        assert_equal(int(node.getbestblockhash(), 16), block.sha256)
    30-        outbound_peer.clear_getblocktxn()
    31+        for peer in [delivery_peer, inbound_peer, outbound_peer]:
    32+            block, _ = announce_cmpct_block(node, peer, 1)
    33+            msg = msg_blocktxn()
    34+            msg.block_transactions.blockhash = block.sha256
    35+            msg.block_transactions.transactions = block.vtx[1:]
    36+            peer.send_and_ping(msg)
    37+            assert_equal(int(node.getbestblockhash(), 16), block.sha256)
    38+            peer.clear_getblocktxn()
    39 
    40         # Test the simple parallel download case...
    41         for num_missing in [1, 5, 20]:
    

    maflcko commented at 11:15 am on May 19, 2023:
    (If this nit is taken, it would also be good to add a log statement, so that if the loop is failing, one can derive easily in which iteration it failed. Generally I think any, or almost any comment in a functional test case can be a (debug)log statement to aid debugging in case of failure)

    instagibbs commented at 1:11 pm on May 22, 2023:
    done
  62. in src/net_processing.cpp:4395 in 42c2696ae5 outdated
    4391+                    // We will try to round-trip any compact blocks we get on failure,
    4392+                    // as long as it's first...
    4393                     req.blockhash = pindex->GetBlockHash();
    4394                     m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETBLOCKTXN, req));
    4395+                } else if (pfrom.m_bip152_highbandwidth_to &&
    4396+                    (!nodestate->m_is_inbound ||
    


    dergoegge commented at 7:33 am on May 19, 2023:
    0                    (pfrom.IsOutboundOrBlockRelayConn() ||
    

    I would avoid using CNodeState when possible.


    ajtowns commented at 3:05 am on May 21, 2023:
    Should be !pfrom.IsInboundConn() presumably, to include manual connections at least.

    instagibbs commented at 1:11 pm on May 22, 2023:
    took @ajtowns suggestion

    sdaftuar commented at 3:39 pm on May 23, 2023:

    This doesn’t result in a behavior change, does it? As far as I can tell nodestate->m_is_inbound is initialized to the value of pfrom.IsInboundConn() when the CNodeState() is constructed – is that understanding correct?

    I’m good with avoiding the use of CNodeState where possible though, if that is the reason for this change.

    If these are different somehow, then perhaps we’d want to change the logic in IsBlockRequestedFromOutbound to match it.


    dergoegge commented at 3:48 pm on May 23, 2023:

    This doesn’t result in a behavior change, does it? As far as I can tell nodestate->m_is_inbound is initialized to the value of pfrom.IsInboundConn() when the CNodeState() is constructed – is that understanding correct?

    Yes that is correct, it does not change behavior (my original suggestion would have changed it slightly, as it would exclude manual conns but that wasn’t intentional).

    I’m good with avoiding the use of CNodeState where possible though, if that is the reason for this change.

    That was the only reason for this nit from my side.

  63. in src/net_processing.cpp:1132 in 42c2696ae5 outdated
    1133-        // Block was not requested
    1134-        return;
    1135+    for (auto range = mapBlocksInFlight.equal_range(hash); range.first != range.second; range.first++) {
    1136+        auto [nodeid, block_it] = range.first->second;
    1137+        CNodeState *nodestate = State(nodeid);
    1138+        assert(nodestate);
    


    dergoegge commented at 7:37 am on May 19, 2023:
    0        CNodeState& nodestate = *Assert(State(nodeid));
    

    instagibbs commented at 1:11 pm on May 22, 2023:
    done
  64. in src/net_processing.cpp:1159 in 42c2696ae5 outdated
    1174-    state->nBlocksInFlight--;
    1175-    if (state->nBlocksInFlight == 0) {
    1176-        // Last validated block on the queue was received.
    1177-        m_peers_downloading_from--;
    1178+        CNodeState *state = State(node_id);
    1179+        assert(state != nullptr);
    


    dergoegge commented at 7:38 am on May 19, 2023:
    0        CNodeState& state = *Assert(State(node_id));
    

    instagibbs commented at 1:11 pm on May 22, 2023:
    done

    Sjors commented at 12:06 pm on May 24, 2023:
    Lost in rebase?

    instagibbs commented at 2:37 pm on May 24, 2023:
    hmmm appears so

    Sjors commented at 4:22 pm on May 24, 2023:
    It ended up in 03423f8bd12b95a06a4a9d8377e781625dd38aae
  65. dergoegge approved
  66. dergoegge commented at 8:01 am on May 19, 2023: member

    Code review ACK 42c2696ae58e07de005edf1dda952761f8c9008e

    Haven’t done a lot of testing yet besides reviewing the tests and running a node with this patch, but the code looks good to me.

    (comments are nits, feel free to ignore)

  67. fanquake requested review from ajtowns on May 19, 2023
  68. instagibbs commented at 11:06 am on May 19, 2023: member
    holding off on touching for nits, can do in follow-up
  69. in test/functional/p2p_compactblocks.py:878 in 42c2696ae5 outdated
    873+
    874+        # Test the simple parallel download case...
    875+        for num_missing in [1, 5, 20]:
    876+
    877+            # Remaining low-bandwidth peer is stalling_peer, who announces first
    878+            [peer['bip152_hb_to'] for peer in node.getpeerinfo()] == [False, True, True, True]
    


    mzumsande commented at 3:36 am on May 21, 2023:
    missing assert here

    instagibbs commented at 1:11 pm on May 22, 2023:
    fixed
  70. in src/net_processing.cpp:4497 in 620b3f5483 outdated
    4495@@ -4456,18 +4496,23 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
    4496         {
    4497             LOCK(cs_main);
    4498 
    4499-            bool expected_blocktxn = false;
    4500             auto range_flight = mapBlocksInFlight.equal_range(resp.blockhash);
    4501+            size_t already_in_flight = std::distance(range_flight.first, range_flight.second);
    


    mzumsande commented at 8:23 pm on May 21, 2023:
    nit: already_in_flight could be just a bool here

    instagibbs commented at 1:12 pm on May 22, 2023:
    leaving as-is, reads more naturally to me
  71. mzumsande commented at 9:21 pm on May 21, 2023: contributor

    Concept ACK, code looks good on first walkthrough.

    I want to run this code with some additional logging for a day or two before acking.

    In particular, I wonder if the fact that we’ll now try additional peers will lead to redundancies / increased traffic under non-adversarial conditions where there’s no staller but delivering the block just needs some time - especially when blocks contain a large number of txns that weren’t in our mempool, or on slower networks like Tor. Did you consider the possibility of giving the original peer a few seconds to deliver before requesting from additional peers?

    Also, all that first->second.second in connection with the multimap is not very pretty or easy to read, but I don’t see a better way.

  72. ajtowns commented at 5:18 am on May 22, 2023: contributor

    In particular, I wonder if the fact that we’ll now try additional peers will lead to redundancies / increased traffic under non-adversarial conditions where there’s no staller but delivering the block just needs some time - especially when blocks contain a large number of txns that weren’t in our mempool, or on slower networks like Tor. Did you consider the possibility of giving the original peer a few seconds to deliver before requesting from additional peers?

    Yeah, this is the behaviour I’m seeing –

     0# peer=3 first reports it
     12023-05-22T03:13:48.490251Z Saw new header hash=000000000000000000028a6e01ad99705343eea1ff85e43aa580bbae52a3a8ab height=790827
     22023-05-22T03:13:48.490389Z [net] Saw new cmpctblock header hash=000000000000000000028a6e01ad99705343eea1ff85e43aa580bbae52a3a8ab peer=3
     32023-05-22T03:13:48.512908Z [cmpctblock] Initialized PartiallyDownloadedBlock for block 000000000000000000028a6e01ad99705343eea1ff85e43aa580bbae52a3a8ab using a cmpctblock of size 18890
     42023-05-22T03:13:48.513282Z [net] sending getblocktxn (35 bytes) peer=3
     5
     6# peer=11 0.2s later
     72023-05-22T03:13:48.740077Z [net] received: cmpctblock (18890 bytes) peer=11
     82023-05-22T03:13:48.761214Z [cmpctblock] Initialized PartiallyDownloadedBlock for block 000000000000000000028a6e01ad99705343eea1ff85e43aa580bbae52a3a8ab using a cmpctblock of size 18890
     92023-05-22T03:13:48.761562Z [net] sending getblocktxn (35 bytes) peer=11
    10
    11# peer=9 another 0.2s later
    122023-05-22T03:13:48.946673Z [net] received: cmpctblock (18890 bytes) peer=9
    132023-05-22T03:13:48.968735Z [cmpctblock] Initialized PartiallyDownloadedBlock for block 000000000000000000028a6e01ad99705343eea1ff85e43aa580bbae52a3a8ab using a cmpctblock of size 18890
    142023-05-22T03:13:48.969097Z [net] sending getblocktxn (35 bytes) peer=9
    15
    16# peer=11 responds first
    172023-05-22T03:13:49.051028Z [net] received: blocktxn (97486 bytes) peer=11
    182023-05-22T03:13:49.055872Z [cmpctblock]   blocktxn peer=11 tx=f401366ed3a40be59923388aa5e58d90c921ac162a92d14a573d24553ba0ce65
    192023-05-22T03:13:49.055891Z [cmpctblock]   blocktxn peer=11 tx=5ebf80ab6ac479bd3568cf9d1575c62b94aaf8bace4058970ca5e3e35a3b0ab1
    202023-05-22T03:13:49.081909Z [cmpctblock] Successfully reconstructed block 000000000000000000028a6e01ad99705343eea1ff85e43aa580bbae52a3a8ab with 1 txn prefilled, 3078 txn from mempool (incl at least 0 from extra pool) and 2 txn requested
    21
    22# then peer=3 about 0.4s later
    232023-05-22T03:13:49.457226Z [net] received: blocktxn (97486 bytes) peer=3
    242023-05-22T03:13:49.461365Z [net] Peer 3 sent us block transactions for block we weren't expecting
    25
    26# then peer=9 about 0.14s later
    272023-05-22T03:13:49.600309Z [net] received: blocktxn (97486 bytes) peer=9
    282023-05-22T03:13:49.604364Z [net] Peer 9 sent us block transactions for block we weren't expecting
    

    For block 790788 I received two 1.5MB blocktxn messages:

     02023-05-21T20:02:38.285186Z Saw new header hash=000000000000000000023a845351079f13015d10aa55003d30e7589e597caa2a height=790788
     12023-05-21T20:02:38.285264Z [net] Saw new cmpctblock header hash=000000000000000000023a845351079f13015d10aa55003d30e7589e597caa2a peer=3
     2
     3# peer=9 announces 0.3s later
     42023-05-21T20:02:38.659630Z [net] received: cmpctblock (4129 bytes) peer=9
     52023-05-21T20:02:38.679677Z [cmpctblock] Initialized PartiallyDownloadedBlock for block 000000000000000000023a845351079f13015d10aa55003d30e7589e597caa2a using a cmpctblock of size 4129
     6
     7# peer=3 took ~2s to respond
     82023-05-21T20:02:40.240719Z [net] received: blocktxn (1541605 bytes) peer=3
     92023-05-21T20:02:43.332966Z UpdateTip: new best=000000000000000000023a845351079f13015d10aa55003d30e7589e597caa2a height=790788 version=0x20014000 log2_work=94.193595 tx=841476965 date='2023-05-21T20:01:57Z' progress=1.000000 cache=26.9MiB(224222txo)
    10
    11# peer=9 took ~7s to respond
    122023-05-21T20:02:45.830208Z [net] received: blocktxn (1541605 bytes) peer=9
    132023-05-21T20:02:45.893063Z [net] Peer 9 sent us block transactions for block we weren't expecting
    

    (EDIT: that block had a bunch of large, low feerate consolidation txs that are below mempool minfee so aren’t eligible for cpfp. One I looked at seems to have been first broadcast around 4th May, perhaps)

    In that case, if peer=9 had announced first, and I’d waited 5s before requesting for peer=3, I’d still have doubled up the 1.5MB response, and not gotten either respond until 7s after the announcement. But at least there would have been a maybe 50% chance of avoiding the second request entirely.

    Tripling the outgoing bandwidth load for peers selected for hb compact block relay seems concerning; not sure if I’d call it a blocker though.

  73. in src/net_processing.cpp:4398 in 42c2696ae5 outdated
    4397+                    IsBlockRequestedFromOutbound(blockhash) ||
    4398+                    already_in_flight < MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK - 1)) {
    4399+                    // ... or it's a hb relay peer and:
    4400+                    // - peer is outbound, or
    4401+                    // - we already have an outbound attempt in flight(so we'll take what we can get), or
    4402+                    // - it's not the final parallel download slot (which we may reserve for first outbound)
    


    ajtowns commented at 8:04 am on May 22, 2023:

    I don’t understand this logic – why is having an outbound attempt in flight so terrible that we’ll query all subsequent hb peers, even to the point of exceeding the parallel download limit? If outbound attempts are terrible, why do we also have an exception to immediately query hb outbound peers, also to the point of potentially exceeding the parallel download limit?

    I would have expected logic more like:

    • is this the first attempt?
    • otherwise, is this a hb peer, and already_in_flight < MAX-1 ?
    • otherwise, is this an outbound peer, and already_in_flight < MAX ? (ie, reserving the last slot for an outbound peer, possibly additional to one-or-more existing outbound peers)

    Or alternatively:

    • is this the first attempt?
    • is this a hb inbound peer, and already_in_flight < MAX ?
    • is this an outbound peer, are there no other outbound peers in flight? (ie request from all the hb inbound peers, and one outbound peer hb or not, avoiding putting an increased load on outbound peers)

    instagibbs commented at 11:01 am on May 22, 2023:

    https://github.com/bitcoin/bitcoin/pull/27626/files#diff-6875de769e90cec84d2e8a9c1b962cdbcda44d870d42e4215827e599e11e90e3R4346 <— this line will filter out any attempts beyond 3.

    So implicitly each of these conditions in this block hold with already_in_flight < MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK

    I could touch up my comments, add some Assumes to make things clearer, if indeed the code is doing what I think it is.


    ajtowns commented at 11:09 am on May 22, 2023:
    Well I was right in saying “I don’t understand this logic” – it’s just about reserving the “final” slot for the outbound peer.

    ajtowns commented at 12:08 pm on May 22, 2023:
    I guess I think it might be better to not do multiple parallel requests to outbound peers (on the one hand hopefully your outbounds are trustworthy and will respond quickly; on the other hand it’s undesirable if listening nodes end up having to send up to 3x as much data when non-listening nodes request blocktxns from all their hb peers), but otherwise this makes sense.

    instagibbs commented at 1:12 pm on May 22, 2023:
    added more clarifying comment and an Assume for now

    instagibbs commented at 1:20 pm on May 22, 2023:

    I guess I think it might be better to not do multiple parallel requests to outbound peers

    That makes some sense since using an outbound is mostly for the anti-sybil portion of the reasoning. Once we have an outbound hb peer, they’re likely highly reliable, and we probably don’t need to try so hard for other connections. Maybe even allow only up to two attempts if we already have an outbound in flight.

  74. instagibbs commented at 10:54 am on May 22, 2023: member

    @mzumsande This is a delicate trade-off as you’re aware.

    delivering the block just needs some time

    There’s a lot of mix and match we can likely do to improve the bandwidth usage, probably with only a small delay in block propagation. There are block timeouts, yes, but we also care about block prop speed for mining fairness(e.g., selfish mining attacks). So maybe there can be 2 buckets for each type of issue.

    e.g.,

    1. first or second slot must have outbound peer (best effort to keep blocks fast)
    2. third request is only dispatched after 5s delay (don’t let node get unduly timed out)

    Obviously a bit more complexity, but the linked issue can stay open and discussion continue with metrics?

    In other words, I don’t want a trivial sybil attacker to be able to slow down blocks at all listening nodes by 5 seconds each hop. That would be pretty bad for mining fairness.

  75. instagibbs force-pushed on May 22, 2023
  76. instagibbs commented at 1:28 pm on May 22, 2023: member

    responded to all nits

    let’s keep the tracking issue open for further discussion on bandwidth saving strategies post-merge

  77. fanquake added the label Needs backport (25.x) on May 22, 2023
  78. maflcko commented at 2:06 pm on May 22, 2023: member
    Red CI can be ignored. This is an upstream bug, see #27492 (comment)
  79. instagibbs commented at 3:51 pm on May 22, 2023: member

    Tripling the outgoing bandwidth load for peers selected for hb compact block relay seems concerning; not sure if I’d call it a blocker though.

    To be clear, this is only a tripling in the case where where all your peers are roughly equally as fast, and triple of what is the most important thing.

    The worst case scenario is going to be block after block of stuff we don’t have in our mempool, and them being slow to validate, allowing multiple compact blocks to arrive before receiving a response from any peer. This would result in roughly 3x bandwidth at tip for new blocks.

    To get a slightly more realistic scenario, glancing at my logs it’s ~119 extra transactions requested over the last 50 blocks due to the new logic. This will go up if there’s more transaction churn, and down if less, obviously. (edit: I see my non-listening node has significantly more, maybe 15 transactions a block, most from a single outlier?)

  80. ajtowns commented at 4:26 pm on May 22, 2023: contributor

    Tripling the outgoing bandwidth load for peers selected for hb compact block relay seems concerning; not sure if I’d call it a blocker though.

    To be clear, this is only a tripling in the case where where all your peers are roughly equally as fast, and triple of what is the most important thing.

    In an “everyone’s honest” environment, I think the most likely reason for there to be enough of a delay between cmpctblock / getblocktxns / blocktxns that there’ll be another cmpctblock received in the meantime (and thus another getblocktxns / blocktxns) is that the blocktxns message is large due to a lot of the block being made up of non-mempool txs (violating standardness, or fees paid out of band). That’s already pretty rare, and it’s already size limited by max block weight. The difference between limiting the number of reqs to outbounds is exacerbated by the ratio of listening/non-listening – if there’s 80k non-listening nodes and 20k listening nodes, then the worst case has 240k requests being made to the hb listening nodes from non-listening nodes instead of just 80k (as a non-listening node doesn’t have any inbound peers in the first place), and that extra load might not just be affecting random non-listening nodes.

    I think this is more of a todo than a blocker though.

  81. in src/net_processing.cpp:1194 in ac054de305 outdated
    1190@@ -1176,8 +1191,8 @@ bool PeerManagerImpl::BlockRequested(NodeId nodeid, const CBlockIndex& block, st
    1191         }
    1192     }
    1193 
    1194-    // Make sure it's not listed somewhere already.
    1195-    RemoveBlockRequest(hash, std::nullopt);
    1196+    // Make sure it's not being fetched already from same peer.
    


    ajtowns commented at 8:00 am on May 23, 2023:
    You just iterated over all the peers requesting this block, and returned early if one of them was this node. What’s the benefit of looping again?

    instagibbs commented at 3:36 pm on May 23, 2023:
    Hmm right, this is now completely redundant and can be removed.
  82. ajtowns approved
  83. ajtowns commented at 9:42 am on May 23, 2023: contributor

    ACK 3b57b2b8d958871bec9a33953be051a75d8bf07c

    Despite thinking this is an improvement on what we currently do and that it’s ready for merge, there’s two things I’m not 100% comfortable with, and that I think would be good to improve in a followup. (1) I think this puts more demand on outbound peers than it should, as per previous comments. (2) I think the multimap stuff makes this hard to understand, and it would be good to have a custom class that abstracts the multimap behaviour. I think it could look something like this – ie, have a structure that collects the results you want from the multimap, and a function that returns that structure that you use in most places. (This also avoids querying the map for the same hash many times in the same function, which isn’t actually a problem, but was bugging me…)

  84. DrahtBot requested review from dergoegge on May 23, 2023
  85. DrahtBot requested review from sdaftuar on May 23, 2023
  86. jamesob commented at 3:16 pm on May 23, 2023: member

    Blackbox ACK

    Been running an earlier version of this branch (a9f914515df0) for 5+ days and measuring header-to-tip timings with a recent bmon addition (https://github.com/chaincodelabs/bmon/pull/27); data is here: https://gist.github.com/jamesob/b601f083d5d88fdd3432da65bf5f43ae.

    This branch seems to compare well with other versions I’m running. I’m going to redeploy with HEAD and will get back with some more specific analysis.

  87. in src/net_processing.cpp:1522 in cce96182ba outdated
    1518@@ -1511,7 +1519,15 @@ void PeerManagerImpl::FinalizeNode(const CNode& node)
    1519         nSyncStarted--;
    1520 
    1521     for (const QueuedBlock& entry : state->vBlocksInFlight) {
    1522-        mapBlocksInFlight.erase(entry.pindex->GetBlockHash());
    1523+        auto range = mapBlocksInFlight.equal_range(entry.pindex->GetBlockHash());
    


    theuni commented at 3:38 pm on May 23, 2023:
    Nit: it took me a minute to parse this. It’d be easier to read as a static function with a name imo.
  88. in src/net_processing.cpp:4475 in cce96182ba outdated
    4474                 LogPrint(BCLog::NET, "Peer %d sent us block transactions for block we weren't expecting\n", pfrom.GetId());
    4475                 return;
    4476             }
    4477 
    4478-            PartiallyDownloadedBlock& partialBlock = *it->second.second->partialBlock;
    4479+            PartiallyDownloadedBlock& partialBlock = *range_flight.first->second.second->partialBlock;
    


    theuni commented at 3:45 pm on May 23, 2023:

    Too much. This one is just too much :p

    (Call it a)Nit: Could you please break this up with a temp const ref or so for the sake of readability?

  89. in src/net_processing.cpp:1182 in ac054de305 outdated
    1178@@ -1166,6 +1179,8 @@ bool PeerManagerImpl::BlockRequested(NodeId nodeid, const CBlockIndex& block, st
    1179     CNodeState *state = State(nodeid);
    1180     assert(state != nullptr);
    1181 
    1182+    Assume(mapBlocksInFlight.count(hash) <= MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK);
    


    mzumsande commented at 3:53 pm on May 23, 2023:
    I think that this can currently be broken by FetchBlock() / getblockfrompeer, which would now allow multiple block requests in parallel, breaking the MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK limit. We could probably make sure in FetchBlock() that MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK isn’t breached (or maybe disallow all parallel request).

    instagibbs commented at 4:35 pm on May 23, 2023:
    ah, good catch! I think the only sensible change here is to restrict parallel fetches, but that seems like a more serious API change? @Sjors ?

    instagibbs commented at 4:45 pm on May 23, 2023:
    restricted it to a single invocation per block at a time. Alternative seems more difficult to think about.

    Sjors commented at 4:55 pm on May 23, 2023:
    It’s fine by me, though currently we have no way of telling whether a request is in progress. Maybe bring this up in #27652 too.

    instagibbs commented at 5:09 pm on May 23, 2023:

    After talking offline with @Sjors , his use-case does involve asking multiple peers the same block to detect stale tips.

    Instead, any call of getblockbypeer will cause all of the same block requests to be forgotten. It’s up to the user of the RPC to not have this interfere with syncing of their own node.


    Sjors commented at 5:25 pm on May 23, 2023:
    That works for me.
  90. theuni commented at 4:01 pm on May 23, 2023: member

    Very shallow/mechanical code review ACK. Left a few nits for some head-scratchers.

    I was mostly looking for misuses of the multimap. Zero opinion about the logical change itself.

  91. instagibbs force-pushed on May 23, 2023
  92. instagibbs force-pushed on May 23, 2023
  93. Support up to 3 parallel compact block txn fetchings
    A single outbound slot is required, so if the first two slots
    are taken by inbound in-flights, the node will reject additional
    unless they are coming from outbound.
    
    This means in the case where a fast sybil peer is attempting to
    stall out a node, a single high bandwidth outbound peer can
    mitigate the attack.
    03423f8bd1
  94. Add tests for parallel compact block downloads d7f359b35e
  95. instagibbs force-pushed on May 23, 2023
  96. DrahtBot added the label CI failed on May 23, 2023
  97. instagibbs commented at 5:33 pm on May 23, 2023: member
    Ready for re-ACKs now that getblockfrompeer issue is fixed
  98. sdaftuar commented at 5:45 pm on May 23, 2023: member

    ACK d7f359b35e8b1e9acc4d397de262cd9ba9bbcb83

    To reiterate, I think we should continue to keep an eye on bandwidth usage (and load on outbound peers) after this patch and revisit whether any additional tuning of this behavior is needed prior to release. However it’ll be useful to have this in master so that we can monitor for the next few months.

  99. DrahtBot removed review request from sdaftuar on May 23, 2023
  100. DrahtBot requested review from ajtowns on May 23, 2023
  101. in src/net_processing.cpp:1183 in 03423f8bd1 outdated
    1178@@ -1166,6 +1179,8 @@ bool PeerManagerImpl::BlockRequested(NodeId nodeid, const CBlockIndex& block, st
    1179     CNodeState *state = State(nodeid);
    1180     assert(state != nullptr);
    1181 
    1182+    Assume(mapBlocksInFlight.count(hash) <= MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK);
    1183+
    


    mzumsande commented at 6:00 pm on May 23, 2023:
    Another nit - I think that this should this be < instead of <= because it is the responsibility of the caller to call BlockRequested only if there is room for one more request.


    mzumsande commented at 6:30 pm on May 23, 2023:
    Hmm, is the case where we are at MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK but BlockRequested() is still called one something that can actually happen? Because in that case that caller would need to have previously checked that one of the existing requests is for the same nodeId, otherwise the limit would be breached.

    instagibbs commented at 6:42 pm on May 23, 2023:
    make sure to check the other call locations, but they aren’t supposed to
  102. mzumsande commented at 6:08 pm on May 23, 2023: contributor

    Code Review ACK d7f359b35e8b1e9acc4d397de262cd9ba9bbcb83

    I’ve run this for >2 days now with additional logging and encountered no problems.

    One thing I frequently observe that hasn’t been mentioned above is that a low-bandwidth peer is first to give us a new header, so there is a header/getdata(CMPCT) roundtrip, and by the time that has finished we already requested the block additionally from a high-bandwidth peer that doesn’t need the roundtrip.

  103. fanquake merged this on May 24, 2023
  104. fanquake closed this on May 24, 2023

  105. Sjors commented at 10:56 am on May 24, 2023: member

    The first two commits are wonderful refactors. But cce96182ba2457335868c65dc16b081c3dee32ee breaks p2p_compactblocks.py --valgrind for me:

    0023-05-24T10:54:53.143000Z TestFramework (INFO): Testing reconstructing compact blocks from all peers...
    12023-05-24T10:54:53.304000Z TestFramework (ERROR): Assertion failed
    2...
    3  File "/home/sjors/.pyenv/versions/3.8.16/lib/python3.8/http/client.py", line 972, in send
    4    self.sock.sendall(data)
    5BrokenPipeError: [Errno 32] Broken pipe
    

    The tip of the PR (with the modified tests) fails at:

    0023-05-24T11:01:14.517000Z TestFramework (INFO): Testing reconstructing compact blocks with a stalling peer...
    12023-05-24T11:01:14.678000Z TestFramework (ERROR): Assertion failed
    

    On Ubuntu 23.04 with g++ 12.2.0 and ./configure --without-bdb.

  106. maflcko commented at 11:20 am on May 24, 2023: member
    I can’t reproduce. Can you share the combined log or steps to reproduce starting on a fresh install of the operating system? Otherwise there is nothing we can do.
  107. in src/net_processing.cpp:1127 in cce96182ba outdated
    1122@@ -1122,34 +1123,40 @@ bool PeerManagerImpl::IsBlockRequested(const uint256& hash)
    1123 
    1124 void PeerManagerImpl::RemoveBlockRequest(const uint256& hash, std::optional<NodeId> from_peer)
    1125 {
    1126-    auto it = mapBlocksInFlight.find(hash);
    1127-    if (it == mapBlocksInFlight.end()) {
    1128-        // Block was not requested
    1129+    auto range = mapBlocksInFlight.equal_range(hash);
    1130+    if (range.first == range.second) {
    


    Sjors commented at 11:35 am on May 24, 2023:

    cce96182ba2457335868c65dc16b081c3dee32ee: maybe do if (!mapBlocksInFlight.count(hash) { first and then construct the range after the Assume below?

    (for a followup)

  108. fanquake commented at 12:17 pm on May 24, 2023: member
    I also cannot reproduce.
  109. maflcko removed the label CI failed on May 24, 2023
  110. Sjors commented at 12:46 pm on May 24, 2023: member

    It’s not a virtual machine. Logs are combined by default, right?

    0$ cat /tmp/bitcoin_func_test_zkozpx3h/test_framework.log | nc termbin.com 9999 
    

    https://termbin.com/c3yr (this is from cce96182ba2457335868c65dc16b081c3dee32ee)

    valgrind-3.19.0, not using depends (also reproduced with depends).

    The node seems to disconnect right after receiving msg_sendcmpct at delivery_peer.send_and_ping(msg_cmpctblock(cmpct_block.to_p2p())).

  111. Sjors commented at 1:09 pm on May 24, 2023: member

    Ah, this is more useful:

     0==2894643== Thread 25 b-msghand:
     1==2894643== Conditional jump or move depends on uninitialised value(s)
     2==2894643==    at 0x24A6EC: (anonymous namespace)::PeerManagerImpl::RemoveBlockRequest(uint256 const&, std::optional<long>) [clone .isra.0] (in /home/sjors/dev/bitcoin/src/bitcoind)
     3==2894643==    by 0x25AAA1: (anonymous namespace)::PeerManagerImpl::ProcessBlock(CNode&, std::shared_ptr<CBlock const> const&, bool, bool) (in /home/sjors/dev/bitcoin/src/bitcoind)
     4==2894643==    by 0x2718A4: (anonymous namespace)::PeerManagerImpl::ProcessMessage(CNode&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, CDataStream&, std::chrono::duration<long, std::ratio<1l, 1000000l> >, std::atomic<bool> const&) (in /home/sjors/dev/bitcoin/src/bitcoind)
     5==2894643==    by 0x274E68: (anonymous namespace)::PeerManagerImpl::ProcessMessages(CNode*, std::atomic<bool>&) (in /home/sjors/dev/bitcoin/src/bitcoind)
     6==2894643==    by 0x22A9E7: CConnman::ThreadMessageHandler() (in /home/sjors/dev/bitcoin/src/bitcoind)
     7==2894643==    by 0x620BF5: util::TraceThread(std::basic_string_view<char, std::char_traits<char> >, std::function<void ()>) (in /home/sjors/dev/bitcoin/src/bitcoind)
     8==2894643==    by 0x22095C: std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::basic_string_view<char, std::char_traits<char> >, std::function<void ()>), char const*, CConnman::Start(CScheduler&, CConnman::Options const&)::{lambda()#5}> > >::_M_run() (in /home/sjors/dev/bitcoin/src/bitcoind)
     9==2894643==    by 0x494F3E2: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.31)
    10==2894643==    by 0x4C6F189: start_thread (pthread_create.c:444)
    11==2894643==    by 0x4CFDAF3: clone (clone.S:100)
    12==2894643== 
    13{
    14   <insert_a_suppression_name_here>
    15   Memcheck:Cond
    16   fun:_ZN12_GLOBAL__N_115PeerManagerImpl18RemoveBlockRequestERK7uint256St8optionalIlE.isra.0
    17   fun:_ZN12_GLOBAL__N_115PeerManagerImpl12ProcessBlockER5CNodeRKSt10shared_ptrIK6CBlockEbb
    18   fun:_ZN12_GLOBAL__N_115PeerManagerImpl14ProcessMessageER5CNodeRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEER11CDataStreamNSt6chrono8durationIlSt5ratioILl1ELl1000000EEEERKSt6atomicIbE
    19   fun:_ZN12_GLOBAL__N_115PeerManagerImpl15ProcessMessagesEP5CNodeRSt6atomicIbE
    20   fun:_ZN8CConnman20ThreadMessageHandlerEv
    21   fun:_ZN4util11TraceThreadESt17basic_string_viewIcSt11char_traitsIcEESt8functionIFvvEE
    22   fun:_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvSt17basic_string_viewIcSt11char_traitsIcEESt8functionIFvvEEEPKcZN8CConnman5StartER10CSchedulerRKNSE_7OptionsEEUlvE3_EEEEE6_M_runEv
    23   obj:/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.31
    24   fun:start_thread
    25   fun:clone
    26}
    27==2894643== 
    28==2894643== Exit program on first error (--exit-on-first-error=yes)
    

    Having a hard time narrowing the config down, but it does seem that --enable-debug hides the issue (at least without depends).

  112. maflcko commented at 1:20 pm on May 24, 2023: member

    Maybe open a separate issue? (There are already too many comments in this thread).

    Make sure to include details, otherwise it will be harder to do something. Note that not all compiler versions are compatible with all valgrind versions.

  113. in src/net_processing.cpp:1138 in cce96182ba outdated
    1143+    while (range.first != range.second) {
    1144+        auto [node_id, list_it] = range.first->second;
    1145 
    1146-    CNodeState *state = State(node_id);
    1147-    assert(state != nullptr);
    1148+        if (from_peer && *from_peer != node_id) {
    


    Sjors commented at 2:35 pm on May 24, 2023:

    This is the line valgrind trips over. Adding something like fprintf(stderr, "from_peer=%ld\n", *from_peer); above it makes it stop complaining. Maybe this is just a valgrind a bug?

    Tracking in #27741

  114. sidhujag referenced this in commit 97c5058859 on May 24, 2023
  115. in src/net_processing.cpp:1159 in cce96182ba outdated
    1172+            m_peers_downloading_from--;
    1173+        }
    1174+        state->m_stalling_since = 0us;
    1175+
    1176+        range.first = mapBlocksInFlight.erase(range.first);
    1177     }
    


    Sjors commented at 3:52 pm on May 24, 2023:
    cce96182ba2457335868c65dc16b081c3dee32ee: the goal of the above loop seems to be to find one entry: the vBlocksInFlight entry for this particular block and from_peer. A followup could move that into its own function, and/or have the subsequent operations on the entry live outside the loop.
  116. in src/net_processing.cpp:1795 in 03423f8bd1 outdated
    1795-    // If a block was already in-flight for a different peer, its BLOCKTXN
    1796-    // response will be dropped.
    1797+    // Forget about all prior requests
    1798+    RemoveBlockRequest(block_index.GetBlockHash(), std::nullopt);
    1799+
    1800+    // Mark block as in-flight
    


    Sjors commented at 4:28 pm on May 24, 2023:

    03423f8bd12b95a06a4a9d8377e781625dd38aae: some of the above deleted comment is still relevant.

    0  // If the peer does not send us a block, vBlocksInFlight remains non-empty,
    1  // causing us to timeout and disconnect.
    

    instagibbs commented at 5:36 pm on May 24, 2023:
    Not anymore, we completely forget we expect it from a peer now. No disconnect should occur.

    Sjors commented at 5:45 pm on May 24, 2023:
    That was already the case. The comment is there to point out that the current request will result in a disconnect (unless a new request is made on time).
  117. in src/net_processing.cpp:1792 in 03423f8bd1 outdated
    1788@@ -1774,11 +1789,10 @@ std::optional<std::string> PeerManagerImpl::FetchBlock(NodeId peer_id, const CBl
    1789 
    1790     LOCK(cs_main);
    1791 
    1792-    // Mark block as in-flight unless it already is (for this peer).
    1793-    // If the peer does not send us a block, vBlocksInFlight remains non-empty,
    1794-    // causing us to timeout and disconnect.
    1795-    // If a block was already in-flight for a different peer, its BLOCKTXN
    1796-    // response will be dropped.
    1797+    // Forget about all prior requests
    


    Sjors commented at 4:29 pm on May 24, 2023:

    03423f8bd12b95a06a4a9d8377e781625dd38aae This deleted comment line is still relevant when deleting the request:

    0// Forget about all prior requests. If a block was already in-flight
    1// for a different peer, its BLOCKTXN response will be dropped.
    
  118. in src/net_processing.cpp:4371 in 03423f8bd1 outdated
    4371+                        vInv[0] = CInv(MSG_BLOCK | GetFetchFlags(*peer), blockhash);
    4372+                        m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETDATA, vInv));
    4373+                        return;
    4374+                    } else {
    4375+                        // Give up for this peer and wait for other peer(s)
    4376+                        RemoveBlockRequest(pindex->GetBlockHash(), pfrom.GetId());
    


    Sjors commented at 4:40 pm on May 24, 2023:
    03423f8bd12b95a06a4a9d8377e781625dd38aae: why not return?

    instagibbs commented at 5:42 pm on May 24, 2023:
    oops, indeed, looks like we’ll send off bogus getblocktxns here, if it comes from a preferred peer that’s not first. needs a fix. Wasted bandwidth and should result in “Peer %d sent us block transactions for block we weren’t expecting\n”

    Sjors commented at 6:09 pm on May 24, 2023:
  119. in src/net_processing.cpp:4387 in 03423f8bd1 outdated
    4382@@ -4361,9 +4383,24 @@ void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type,
    4383                     txn.blockhash = blockhash;
    4384                     blockTxnMsg << txn;
    4385                     fProcessBLOCKTXN = true;
    4386-                } else {
    4387+                } else if (first_in_flight) {
    4388+                    // We will try to round-trip any compact blocks we get on failure,
    


    Sjors commented at 4:43 pm on May 24, 2023:
    03423f8bd12b95a06a4a9d8377e781625dd38aae “on failure” -> “when missing transactions” (the word failure is ambiguous, this is different from READ_STATUS_FAILED)
  120. in src/net_processing.cpp:4394 in 03423f8bd1 outdated
    4390+                    req.blockhash = pindex->GetBlockHash();
    4391+                    m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETBLOCKTXN, req));
    4392+                } else if (pfrom.m_bip152_highbandwidth_to &&
    4393+                    (!pfrom.IsInboundConn() ||
    4394+                    IsBlockRequestedFromOutbound(blockhash) ||
    4395+                    already_in_flight < MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK - 1)) {
    


    Sjors commented at 4:45 pm on May 24, 2023:
    03423f8bd12b95a06a4a9d8377e781625dd38aae could add a static assert MAX_CMPCTBLOCKS_INFLIGHT_PER_BLOCK > 1 in a followup.

    instagibbs commented at 5:40 pm on May 24, 2023:
    if it’s 1, that’s fine, it just won’t allow additional slots

    Sjors commented at 1:18 pm on May 27, 2023:
    > 0 is fine too, but I guess nobody will this to zero accidentally.
  121. in src/net_processing.cpp:4402 in 03423f8bd1 outdated
    4398+                    // - we already have an outbound attempt in flight(so we'll take what we can get), or
    4399+                    // - it's not the final parallel download slot (which we may reserve for first outbound)
    4400                     req.blockhash = pindex->GetBlockHash();
    4401                     m_connman.PushMessage(&pfrom, msgMaker.Make(NetMsgType::GETBLOCKTXN, req));
    4402+                } else {
    4403+                    // Give up for this peer and wait for other peer(s)
    


    Sjors commented at 4:52 pm on May 24, 2023:

    03423f8bd12b95a06a4a9d8377e781625dd38aae: “Give up” implies that this peer could have done something to prevent this, but that’s often not the case (Nodes sending cmpctblock messages SHOULD limit prefilledtxn to 10KB of transactions. - BIP152 footnote 5).

    0// Do not followup with `GETBLOCKTXN` to this peer,
    1// wait for an outbound peer to announce the compact block instead.
    
  122. in test/functional/p2p_compactblocks.py:850 in d7f359b35e
    845+            with p2p_lock:
    846+                assert "getblocktxn" in peer.last_message
    847+            return block, cmpct_block
    848+
    849+        for name, peer in [("delivery", delivery_peer), ("inbound", inbound_peer), ("outbound", outbound_peer)]:
    850+            self.log.info(f"Setting {name} as high bandwidth peer")
    


    Sjors commented at 4:59 pm on May 24, 2023:
    d7f359b35e8b1e9acc4d397de262cd9ba9bbcb83: maybe explain what’s going on here. We send an unsolicited compact block from these peers to our node. Because the block adds to our tip the message doesn’t get ignored. Once the block is successfully processed we mark the peer as high bandwidth. This is done on a first-come-first-serve basis. This results in the [False, True, True, True] pattern below.
  123. Sjors commented at 5:30 pm on May 24, 2023: member

    Post merge utACK d7f359b35e8b1e9acc4d397de262cd9ba9bbcb83 modulo question about missing return. Other comments can wait for a followup.

    Conceptually this PR still relies on our node immediately reacting to an incoming compact block message. This forces us to be a bit impatient and potentially use more bandwidth than needed.

    In a followup we could instead track which peers have announced a block and then request block transactions from other peers if the first peer takes too long. This could be a function of the number of transactions we asked for, the typical pace we’re used to, if the announcing peer was inbound, etc. That’s all much more complex though. See point 4 in #27740.

    As @ajtowns mentioned, refactoring the multimap stuff in a followup would really help future developers - including future versions of myself.

  124. fanquake referenced this in commit e43432086a on May 25, 2023
  125. fanquake referenced this in commit 907a7c9681 on May 25, 2023
  126. fanquake referenced this in commit f7b63cd202 on May 25, 2023
  127. fanquake referenced this in commit f36668a976 on May 25, 2023
  128. fanquake referenced this in commit 3be347d804 on May 25, 2023
  129. fanquake referenced this in commit 98103f921c on May 25, 2023
  130. fanquake referenced this in commit 99bf252f9e on May 25, 2023
  131. fanquake removed the label Needs backport (25.x) on May 25, 2023
  132. fanquake commented at 1:30 pm on May 25, 2023: member
    Up for backport in #27752.
  133. sidhujag referenced this in commit 4f72852f81 on May 26, 2023
  134. Sjors referenced this in commit f5bbc60963 on May 27, 2023
  135. Sjors referenced this in commit 0a4da49543 on May 27, 2023
  136. Sjors referenced this in commit 5d1a6b2b86 on May 27, 2023
  137. Sjors referenced this in commit 944a06875b on May 27, 2023
  138. Sjors referenced this in commit 0c75a94ec5 on May 27, 2023
  139. Sjors referenced this in commit 55e9aa6800 on May 27, 2023
  140. fanquake referenced this in commit da527094cf on Jun 2, 2023
  141. fanquake referenced this in commit ff01fc0052 on Jun 2, 2023
  142. fanquake referenced this in commit f1e276cae5 on Jun 2, 2023
  143. fanquake referenced this in commit debe9d292f on Jun 2, 2023
  144. fanquake referenced this in commit c17746d87f on Jun 2, 2023
  145. fanquake referenced this in commit ce969f4684 on Jun 2, 2023
  146. pinheadmz commented at 3:00 pm on June 2, 2023: member
    @instagibbs does this close #21803 ?
  147. fanquake referenced this in commit 722361e129 on Jun 16, 2023
  148. fanquake referenced this in commit a45159b8e2 on Jun 16, 2023
  149. fanquake referenced this in commit 38e3af9fad on Jun 16, 2023
  150. fanquake referenced this in commit d1a93f5d41 on Jun 16, 2023
  151. fanquake referenced this in commit e66a5cbb56 on Jun 16, 2023
  152. fanquake referenced this in commit cdd3de08e3 on Jun 16, 2023
  153. fanquake referenced this in commit 8825983716 on Jul 4, 2023
  154. bitcoin locked this on Jun 1, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-01 19:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me