I think #14897 introduces a livelock between peers who don't reply to GETDATA requests with TX data.
Scenario is likely: Alice sends Bob and INV for a new tx, Bob sends GETDATA for the tx, Alice responds with TX, Bob doesn't have the parent so sends GETDATA for the parent, adding it to the tx_in_flight set, Alice replies NOTFOUND due to the parent having expired from the relay set due to being in the mempool for over 15 minutes, so Bob doesn't remove the entry from the tx_in_flight set. Once this happens 100 times (MAX_PEER_TX_IN_FLIGHT), Bob will not request any more tx info via GETDATA from Alice. If this happens to all Bob's peers, Bob will not receive tx data except when blocks are found. If Bob is not reachable for inbound nodes, this can plausibly happen for all Bob's outbound nodes.
I think this happened for my node this week (running master-ish), causing my mempool to be empty, despite having 8 good peers who continued sending INV notifications.
Patches are:
- debugging info for
getpeerinfoto be able to tell if either of the queues are growing - logging of 'notfound' messages
- test cases for 'notfound' response to getdata, and no response at all to getdata
- fix to expire inflight tx's that are reported as 'notfound'
- fix to disconnect nodes that have 100 inflight tx getdata requests as well as additional tx's we'd like to request that have been waiting for 30mins or more