Expire inflight GETDATA requests #15776

pull ajtowns wants to merge 7 commits into bitcoin:master from ajtowns:201904-inflighttxtimeout changing 6 files +270 −4
  1. ajtowns commented at 12:22 PM on April 9, 2019: member

    I think #14897 introduces a livelock between peers who don't reply to GETDATA requests with TX data.

    Scenario is likely: Alice sends Bob and INV for a new tx, Bob sends GETDATA for the tx, Alice responds with TX, Bob doesn't have the parent so sends GETDATA for the parent, adding it to the tx_in_flight set, Alice replies NOTFOUND due to the parent having expired from the relay set due to being in the mempool for over 15 minutes, so Bob doesn't remove the entry from the tx_in_flight set. Once this happens 100 times (MAX_PEER_TX_IN_FLIGHT), Bob will not request any more tx info via GETDATA from Alice. If this happens to all Bob's peers, Bob will not receive tx data except when blocks are found. If Bob is not reachable for inbound nodes, this can plausibly happen for all Bob's outbound nodes.

    I think this happened for my node this week (running master-ish), causing my mempool to be empty, despite having 8 good peers who continued sending INV notifications.

    Patches are:

    • debugging info for getpeerinfo to be able to tell if either of the queues are growing
    • logging of 'notfound' messages
    • test cases for 'notfound' response to getdata, and no response at all to getdata
    • fix to expire inflight tx's that are reported as 'notfound'
    • fix to disconnect nodes that have 100 inflight tx getdata requests as well as additional tx's we'd like to request that have been waiting for 30mins or more
  2. fanquake added the label P2P on Apr 9, 2019
  3. sdaftuar commented at 12:36 PM on April 9, 2019: member

    That seems like a potentially very bad bug. However at first glance I don't see how fPauseSend could be implicated as a trigger; I don't believe we ever drop requests on the floor if the send buffer fills up, we just wait until the buffer drains and then continue as before.

    Actually, I realize now that one way this could maybe trigger is if our peer sends a NOTFOUND in response to a getdata, I don't think we would reduce the in flight count (I think this is fixed in #15505 but this seems like a bug in 0.18). So over time, as NOTFOUND's pile up, the in-flight queue would get full and we'd stop requesting transactions. So we should definitely take some kind of fix for this, yikes.

  4. fanquake added this to the milestone 0.18.0 on Apr 9, 2019
  5. DrahtBot commented at 12:45 PM on April 9, 2019: member

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #15505 ([p2p] Request NOTFOUND transactions immediately from other outbound peers, when possible by sdaftuar)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  6. gmaxwell commented at 6:16 PM on April 9, 2019: contributor

    We should perhaps disconnect a peer that is consistently getdata DOSing us... If all your peers do that (for whatever reason) you're not going to gain txn, regardless of what happens with this. So this fix only helps the case where a peer is randomly responding to some getdatas.

    The fact that we'd stop asking for data from peers if they were not responding wasn't lost on me in the original changes, but this seemed like a good thing (I can't caught the notfound interaction, either through just missing it or due to havign 15505 on the brain).

  7. sdaftuar commented at 6:19 PM on April 9, 2019: member

    I think we should do two things: (a) Fix the bug in NOTFOUND processing -- if a peer sends a NOTFOUND for a transaction, we should remove it from the in-flight map for that peer. I think we should definitely do this and backport to 0.18.

    (b) As a safeguard, we should also do something if a peer's in-flight count hits the max so that we're not downloading new transactions from them. I think there are two responses that make sense to me: either we eventually disconnect them, or we expire the in-flight entry and make room for a new request. For 0.18, I think expiry makes more sense, as it's more robust to other bugs that may be lurking in this new code. Eventually, I think disconnection is where I expect things to go in the long-run (which is what we do for block getdata's that are dropped by the peer).

    Also, I think I spotted another bug in this code related to emptying out the tx_in_flight set -- if a different peer serves us a transaction that we requested from another, then the entry doesn't get removed (upon receiving a TX message, we only erase from the delivering peer's in_flight set). So a "DoS"er that helpfully blasted us with transactions would be able to blind us to transactions from our other peers. The expiry solution would mitigate this, which is a point in its favor for merging into 0.18, but we should fix this too I think if we can do so easily.

  8. ajtowns force-pushed on Apr 10, 2019
  9. ajtowns commented at 10:17 AM on April 10, 2019: member

    at first glance I don't see how fPauseSend could be implicated as a trigger

    Yep, I think you're right.

    Actually, I realize now that one way this could maybe trigger is if our peer sends a NOTFOUND in response to a getdata, I don't think we would reduce the in flight count

    Right; I think a plausible scenario is "INV tx1, GETDATA tx1, TX tx1, oh I'm missing a parent for tx1, GETDATA tx1p, nope that's over 15m old and you're not doing MEMPOOL bloom filters, so NOTFOUND" -- and I can duplicate that behaviour with real (regtest) nodes, too. Seems like it should mostly only happen in real life if you've just reconnected to the network after some time offline, and there's a reasonable backlog in the mempool (as otherwise parents that are older than 15m are probably already in a block)

    Updated the PR to remove tx's from notfound messages from the inflight set and to disconnect (and log) any outbound peers whose inflight queue gets full despite that. Has test cases for for both scenarios.

  10. ajtowns commented at 10:23 AM on April 10, 2019: member

    Also, I think I spotted another bug in this code related to emptying out the tx_in_flight set -- if a different peer serves us a transaction that we requested from another, then the entry doesn't get removed (upon receiving a TX message, we only erase from the delivering peer's in_flight set).

    Maybe I'm missing something, but I think this should be okay -- if it's added to in_flight then we've sent a GETDATA to that peer, and whether or not any other peer sends a TX message, that peer should send a TX or NOTFOUND message, at which point the entry's unconditionally removed from in_flight as on of the very first steps...

  11. ajtowns force-pushed on Apr 10, 2019
  12. in src/rpc/net.cpp:184 in 12df73f5e3 outdated
     177 | @@ -178,6 +178,8 @@ static UniValue getpeerinfo(const JSONRPCRequest& request)
     178 |              obj.pushKV("banscore", statestats.nMisbehavior);
     179 |              obj.pushKV("synced_headers", statestats.nSyncHeight);
     180 |              obj.pushKV("synced_blocks", statestats.nCommonHeight);
     181 | +            obj.pushKV("tx_inflight", statestats.nTxInFlight);
     182 | +            obj.pushKV("tx_process", statestats.nTxProcess);
    


    morcos commented at 4:52 PM on April 11, 2019:

    do these need to be added to the help text?


    ajtowns commented at 5:39 AM on April 12, 2019:

    are they even worth keeping? i mostly added them for debugging, and i am using them in one of the test cases, but they're pretty obscure info.


    ajtowns commented at 7:14 AM on April 12, 2019:

    added help text anyway

  13. gmaxwell commented at 7:05 PM on April 11, 2019: contributor

    I think it's okay-- even preferable-- to not remove the tx when we get the tx from another peer. We should either get the tx or a not found or we should eventually stop requesting tx from that peer (e.g. because we disconnected them...)

  14. in src/net_processing.cpp:3129 in 12df73f5e3 outdated
    3126 | +        //if (vInv.size() > MAX_INV_SZ)
    3127 | +        //{
    3128 | +        //    LOCK(cs_main);
    3129 | +        //    Misbehaving(pfrom->GetId(), 20, strprintf("message notfound size() = %u", vInv.size()));
    3130 | +        //    return false;
    3131 | +        //}
    


    MarcoFalke commented at 7:07 PM on April 11, 2019:

    In commit 392cf539756e2af6df5684c3444df626b446754f

    What is this adding? We generally don't add commented out code without rationale


    ajtowns commented at 5:37 AM on April 12, 2019:

    Replaced with actual comment

  15. ajtowns force-pushed on Apr 12, 2019
  16. jonasschnelli commented at 10:58 AM on April 15, 2019: contributor

    Not sure if it is related but today I found – for the second time – that my test 0.18.0RC3 node has a "stuck" mempool (size of 6 or so)

  17. fanquake added this to the "Blockers" column in a project

  18. jonasschnelli commented at 11:00 AM on April 15, 2019: contributor

    Just looked at the debug.log and my node gets a lot of invs (but no getdata response) and mempool won't increase.

  19. jonasschnelli commented at 11:02 AM on April 15, 2019: contributor

    After an addnode, I could see getdata's and a growing mempool

  20. in src/net_processing.cpp:3132 in 599e5db907 outdated
    3126 | +        // note: we don't enforce a limit on number of inv's in a notfound
    3127 | +
    3128 | +        LogPrint(BCLog::NET, "received notfound (%u invsz) peer=%d\n", vInv.size(), pfrom->GetId());
    3129 | +
    3130 | +        if (vInv.size() > 0) {
    3131 | +            LogPrint(BCLog::NET, "received notfound for: %s peer=%d\n", vInv[0].ToString(), pfrom->GetId());
    


    naumenkogs commented at 4:47 PM on April 15, 2019:

    Could you explain what's going on here? Why we're logging only the first notfound INV?


    ajtowns commented at 4:06 AM on April 16, 2019:

    It's doing the same logging as getdata which also only logs the first entry. Didn't seem terribly valuable to log every notfound entry seen.

  21. ajtowns commented at 5:00 AM on April 16, 2019: member

    Scenario is likely: Alice sends Bob and INV for a new tx, Bob sends GETDATA for the tx, Alice responds with TX, Bob doesn't have the parent so sends GETDATA for the parent, adding it to the tx_in_flight set, Alice replies NOTFOUND due to the parent having expired from the relay set due to being in the mempool for over 15 minutes, so Bob doesn't remove the entry from the tx_in_flight set.

    I added logging for when my node sends a notfound to try and see what the reasons were. It looks like most of the time these are for tx's that aren't even in my mempool at all anymore, though that varies between peers:

    • peer=0 -- 2821 not in mempool, 0 in mempool

    • peer=3 -- 1495 not in mempool, 0 in mempool

    • peer=1 -- 1092 not in mempool, 0 in mempool

    • peer=5 -- 401 not in mempool, 117 in mempool

    • peer=4 -- 308 not in mempool, 0 in mempool

    • peer=2 -- 146 not in mempool, 0 in mempool

    • peer=102 -- 73 in mempool, 14 not in mempool

    • peer=204 -- 111 in mempool, 3 not in mempool

    For the ones that are in the mempool, the times are all just >15m as expected (ranging from 902s to 17432s -- almost 5 hours), and the parent explanation seems fine for them.

    Not sure what's causing other nodes to request tx's that aren't in my mempool anymore though. For the couple I've looked at, they seem to have been confirmed in prior blocks, eg:

    2019-04-16T03:36:38Z got inv: tx 4c5f3fdd242b0f352716fb3bccca810585e726c2f1b435e7fdbcfb8ffc1faa4c  have peer=90
    2019-04-16T03:46:25Z UpdateTip: new best=000000000000000000166ca879430bcbd524b0e3a4b333cf9646a4a4cc03c54f height=571829
    2019-04-16T04:06:43Z UpdateTip: new best=...a417 height=571830 
    2019-04-16T04:12:32Z received getdata for: witness-tx 4c5f3fdd242b0f352716fb3bccca810585e726c2f1b435e7fdbcfb8ffc1faa4c peer=1
    2019-04-16T04:12:32Z will send notfound for: 4c5f3fdd242b0f352716fb3bccca810585e726c2f1b435e7fdbcfb8ffc1faa4c -1s in mempool peer=1

    or

    2019-04-16T00:45:15Z got inv: tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39  new peer=1
    2019-04-16T00:45:15Z Requesting witness-tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 peer=1
    2019-04-16T00:45:17Z got inv: tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39  have peer=6
    2019-04-16T00:45:21Z got inv: tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39  have peer=5
    2019-04-16T00:51:06Z Requesting block 00000000000000000022a55d495ebbec42ebcf1de23a7b6db3c4fc0361efd044 from  peer=49
    2019-04-16T00:51:06Z UpdateTip: new best=00000000000000000022a55d495ebbec42ebcf1de23a7b6db3c4fc0361efd044 height=571814
    2019-04-16T00:51:06Z received: cmpctblock (20107 bytes) peer=2
    2019-04-16T02:23:15Z received getdata for: witness-tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 peer=2
    2019-04-16T02:23:15Z will send notfound for: c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 -1s in mempool peer=2
  22. MarcoFalke commented at 12:11 PM on April 16, 2019: member

    Could update the comment that says have to wait around forever. Currently only SPV clients actually care ?

    Also, I was wondering if this could be limited to just the bugfix without the RPC changes (and maybe without the tests)? You could still provide the branch with tests to the reviewers to run, but I think review should focus on the bugfix and shouldn't be distracted by RPC changes or tests. No strong opinion though.

  23. MarcoFalke commented at 12:23 PM on April 16, 2019: member
    2019-04-16T02:23:15Z received getdata for: witness-tx c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 peer=2
    2019-04-16T02:23:15Z will send notfound for: c062e3b8e43817920b138e1675e6e33fd92c036e9e5de73c5cb2f062423f4f39 -1s in mempool peer=2
    

    I presume you are using self-made logging, since I couldn't find that logprint in the code base?

    Also, I can't follow how the mempool should influence getdata at all. We are using mapRelay to store the transaction pointers and it is populated at the same time the inv gets sent. (Of course a timeout of an entry in mapRelay might cause the issue mentioned here, but that shouldn't occur so frequently)

  24. in src/net_processing.cpp:3376 in 599e5db907 outdated
    3364 | @@ -3341,6 +3365,14 @@ void PeerLogicValidation::ConsiderEviction(CNode *pto, int64_t time_in_seconds)
    3365 |              }
    3366 |          }
    3367 |      }
    3368 | +
    3369 | +    const auto& tx_download = state.m_tx_download;
    3370 | +    if (tx_download.m_tx_in_flight.size() >= MAX_PEER_TX_IN_FLIGHT) {
    3371 | +        if (!tx_download.m_tx_process_time.empty() && (tx_download.m_tx_process_time.begin()->first + GETDATA_TX_TIMEOUT) <= time_in_seconds * 1000000) {
    3372 | +            LogPrintf("Disconnecting outbound peer %d due to tx request backlog (%d inflight, %d queued)\n", pto->GetId(), tx_download.m_tx_in_flight.size(), tx_download.m_tx_process_time.size());
    


    sdaftuar commented at 2:06 PM on April 16, 2019:

    This will disconnect both inbound and outbound peers, I think?


    ajtowns commented at 5:14 AM on April 17, 2019:

    Yeah, fixed

  25. sdaftuar changes_requested
  26. sdaftuar commented at 2:24 PM on April 16, 2019: member

    For robustness' sake, I don't think we should disconnect peers yet for ignoring transaction getdata requests. The logic for transaction download is new in 0.18, and if there are other bugs yet to be shaken out then we risk disconnecting honest peers, which would also be a problem.

    Can we go back to the expiry approach?

  27. ajtowns commented at 2:46 AM on April 17, 2019: member

    I presume you are using self-made logging, since I couldn't find that logprint in the code base?

    Yeah, it's just

    +                auto txinfo = mempool.info(inv.hash);
    +                LogPrint(BCLog::NET, "will send notfound for: %s %ds in mempool peer=%d\n", inv.hash.ToString(), (txinfo.tx ? GetTime() - txinfo.nTime : -1), pfrom->GetId());

    before vNotFound.push_back(inv); in ProcessGetData().

    Also, I can't follow how the mempool should influence getdata at all. We are using mapRelay to store the transaction pointers and it is populated at the same time the inv gets sent. (Of course a timeout of an entry in mapRelay might cause the issue mentioned here, but that shouldn't occur so frequently)

    Yeah -- if it hasn't expired from mapRelay it should be fine, I'm trying to understand why a getdata is being sent after expiry from mapRelay. Apparently that's almost always when it's also been removed from the mempool.

    Ah! I think what's happening is I'm getting a tx A, which spends B and C, and AcceptToMemoryPool is noticing I don't have B and setting fMissingInputs to true, at which point I'm doing RequestTx for both B and C, because AlreadyHave(C) returns false for both because they're not in the mempool, and because B's first two outputs aren't in the UTXO set it doesn't notice that I have C via the UTXO set. (EDIT to add bolded text)

    So the cases where I'm sending getdata and getting back notfound seem to me to be:

    • I'm unnecessarily requesting tx data I already have, because I don't have some other parent of a tx I just received (most of the time)
    • My peer isn't forwarding me a parent that's still in the mempool but has expired from mapRelayTx (rarely, but non-zero)
  28. Fix potential memory leak in m_process_time 50f8ba992c
  29. Report number of pending/inflight tx requests via getpeerinfo 44bafa17ce
  30. Log NOTFOUND messages e1576bf0fe
  31. Test p2p behaviour for getdata/notfound 465d2ad177
  32. Test p2p behaviour when getdata is ignored b83bf533c8
  33. Remove NOTFOUND txs from in_flight set e96932fd7e
  34. Disconnect from outbound peers with in_flight tx backlog 595b0402a5
  35. ajtowns force-pushed on Apr 17, 2019
  36. ajtowns commented at 5:16 AM on April 17, 2019: member

    Can we go back to the expiry approach?

    I don't see why not. Closing this PR in favour of #15834 . Fixed some issues in the branch anyway so it's still useful for reference.

  37. ajtowns closed this on Apr 17, 2019

  38. fanquake removed this from the "Blockers" column in a project

  39. MarcoFalke referenced this in commit 607b1b7498 on Apr 18, 2019
  40. MarcoFalke referenced this in commit f792395d13 on Jun 12, 2019
  41. sidhujag referenced this in commit 0fe7564a41 on Jun 12, 2019
  42. codablock referenced this in commit 74eabc23e5 on Apr 7, 2020
  43. ckti referenced this in commit 0333b9cdda on Mar 28, 2021
  44. DrahtBot locked this on Dec 16, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-13 18:14 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me