Fix potential network stalling bug #27981

pull sipa wants to merge 1 commits into bitcoin:master from sipa:202306_pushback changing 2 files +30 −37
  1. sipa commented at 8:37 pm on June 26, 2023: member

    See https://github.com/ElementsProject/elements/issues/1233. There, it has been observed that if both sides of a P2P connection have a significant amount of data to send, a stall can occur, where both try to drain their own send queue before trying to receive. The same issue seems to apply to the current Bitcoin Core codebase, though I don’t know whether it’s a frequent issue for us.

    The core issue is that whenever our optimistic send fails to fully send a message, we do subsequently not even select() for receiving; if it then turns out that sending is not possible either, no progress is made at all. To address this, the solution used in this PR is to still select() for both sending and receiving when an optimistic send fails, but skip receiving if sending succeeded, and (still) doesn’t fully drain the send queue.

    This is a significant reduction in how aggressive the “receive pushback” mechanism is, because now it will only mildly push back while sending progress is made; if the other side stops receiving entirely, the pushback disappears. I don’t think that’s a serious problem though:

    • We still have a pushback mechanism at the application buffer level (when the application receive buffer overflows, receiving is paused until messages in the buffer get processed; waiting on our own net_processing thread, not on the remote party).
    • There are cases where the existing mechanism is too aggressive; e.g. when the send queue is non-empty, but tiny, and can be sent with a single send() call. In that case, I think we’d prefer to still receive within the same processing loop of the network thread.
  2. DrahtBot commented at 8:37 pm on June 26, 2023: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK mzumsande, ajtowns, naumenkogs

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #28222 (Use shared_ptr for CNode inside CConnman by willcl-ark)
    • #28196 (BIP324 connection support by sipa)
    • #28165 (net: transport abstraction by sipa)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. sipa commented at 8:38 pm on June 26, 2023: member
    cc @psgreco, who pointed to the issue, and helped test it.
  4. psgreco commented at 9:03 pm on June 26, 2023: contributor
    concept ack 5e9237891df69876e1a6f81bf158aed2a683ffe2, this potential issue is not easy to trigger on demand in bitcoin, but it’s relatively easy to trigger in elements, when the node is roughly 20/24 hours behind. Tested in elements a similar version of the patch it does solve the stalling
  5. kristapsk commented at 8:38 am on June 27, 2023: contributor
    Wouldn’t it be possible to trigger and test this with some functional test?
  6. psgreco commented at 11:04 am on June 27, 2023: contributor
    In theory, it should be, but in our tests (mostly @lolhill ’s) a good component of this situation is latency. I’ve never been able to replicate this between 2 local hosts, always with a host that’s ~100ms away.
  7. glozow added the label P2P on Jun 29, 2023
  8. in src/net.cpp:1221 in 5e9237891d outdated
    1232-        bool select_send;
    1233-        {
    1234-            LOCK(pnode->cs_vSend);
    1235-            select_send = !pnode->vSendMsg.empty();
    1236-        }
    1237+        bool select_send = WITH_LOCK(pnode->cs_vSend, return !pnode->vSendMsg.empty());
    


    ajtowns commented at 3:36 pm on July 4, 2023:
    Would it make sense to introduce a method bool CNode::WantsToSend() const { return !pnode->vSendMsg.empty(); } method, and use that here and instead of returning a pair<X, bool> above?

    sipa commented at 2:48 pm on July 20, 2023:
    That’d mean grabbing the lock twice, no? I added it to SocketSendData because the cs_vSend lock is already grabbed to call that.

    ajtowns commented at 0:27 am on July 21, 2023:
    Not if you make it WantsToSend() EXCLUSIVE_LOCK_REQUIRED(cs_vSend) and require the caller to have the lock?

    ajtowns commented at 8:23 am on August 17, 2023:
    For your consideration: https://github.com/ajtowns/bitcoin/commit/87c509c6d6ee0d355d08e0d4bc60bc01d4a0ad60 . I expanded the WITH_LOCK out in GenerateWaitSockets because I thought that was clearer than trying to make it an expression. Keeps the signature of SocketSendData the same, doesn’t add any additional locking, and avoids the dummy data_left in PushMessage.
  9. ajtowns commented at 4:06 pm on July 4, 2023: contributor

    The core issue is that whenever our optimistic send fails to fully send a message, we do subsequently not even select() for receiving; if it then turns out that sending is not possible either, no progress is made at all. To address this, the solution used in this PR is to still select() for both sending and receiving when an optimistic send fails, but skip receiving if sending succeeded, and (still) doesn’t fully drain the send queue.

    AIUI (correct me if I’m wrong!) the backpressure we do is:

    • fPauseSend – we won’t do ProcessMessage (which would probably cause us to generate new data to send) when this is set, which is when we have more than 1MB of data lined up. (However we will do SendMessages, which generates relatively small messages like INVs and PING and GETDATA)
    • fPauseRecv – we won’t select the socket for reading any more once we’ve got more than 5MB in parsed messages queued up to process
    • prefer sending over receiving – if we’ve got data to send, we’ll prioritise sending it, even if we’re making no forward progress and could receive messages, to the point where we don’t even check (via select/poll) to see if there’s any data to receive when we’ve got data to send

    This patches changes the latter logic to be:

    • prefer sending over receiving – always see if we can send/receive data, but don’t actually try to receive data until either (a) we don’t have data to send in the first place, (b) we can’t send any data, or (c) we’ve successfully sent all the data we have.

    This seems pretty sensible to me: this is a peer to peer protocol, so it seems to me we should be making progress in parallel on sending and receiving whenever possible – your send is my receive after all.

    Approach ACK.

  10. naumenkogs commented at 11:53 am on July 11, 2023: member
    Approach ACK
  11. Rework receive buffer pushback
    Co-authored-by: Anthony Towns <aj@erisian.com.au>
    3388e523a1
  12. in src/net.cpp:1228 in 5e9237891d outdated
    1250-            requested = Sock::RECV;
    1251-        }
    1252-
    1253-        events_per_sock.emplace(pnode->m_sock, Sock::Events{requested});
    1254+        if (select_send) events_per_sock.emplace(pnode->m_sock, Sock::Events{Sock::SEND});
    1255+        if (select_recv) events_per_sock.emplace(pnode->m_sock, Sock::Events{Sock::RECV});
    


    mzumsande commented at 11:13 pm on July 19, 2023:
    Won’t the second emplace just do nothing both if both select_send and select_recv are true? I thought the idea was to change behavior to have both send and recv requested events (instead of just giving select_send priority like in master). But wouldn’t we need to insert a combination of Sock::SEND and Sock::RECV then, instead of repeated emplace?

    sipa commented at 11:38 pm on July 19, 2023:
    The order, or splitting, does not matter. All of these are fed to Sock::WaitMany, which will mark the ones that are ready for sending to/receiving from. It’s in the processing of those wait results that the prioritization happens, where receiving is skipped if (a) the socket was ready for sending (b) something was sent and (c) there is yet more to send.

    mzumsande commented at 2:06 am on July 20, 2023:
    I’m still confused: events_per_sock is a std::unordered_map. If select_send and select_recv are true, we now emplace twice into it, with the same key pnode->m_sock and different values. That means that the first value stays, and the second emplace is a no-op, leaving the container unchanged. So if only the first value (in this case Sock::SEND) is fed to Sock::WaitMany, why doesn’t the order matter?

    sipa commented at 2:08 am on July 20, 2023:

    Oh, I had missed the data type, and thought you were talking something else.

    You’re absolutely right, will address.


    ajtowns commented at 2:10 am on July 20, 2023:

    I think the concern here is that events_per_sock is a Sock::EventsPerSock which is an unordered map, so using the same key in two emplace calls would cause the second to overwrite the first. The key type here is shared_ptr<const Sock> and the std::hash for shared ptrs just looks at the address they’re pointing too, so as far as I can see the second emplace here will indeed overwrite the first, making this effectively the same as:

    0if (select_recv) {
    1    emplace(RECV);
    2} else if (select_send) {
    3    emplace(SEND);
    4}
    

    So I think this should instead be a single emplace:

    0if (select_send || select_recv) {
    1    Sock:Event event = (select_send ? Sock::SEND : 0) | (select_recv ? Sock::RECV : 0);
    2    events_per_sock.emplace(pnode->m_sock, Sock::Events{event});
    3}
    

    sipa commented at 2:47 pm on July 20, 2023:
    Fixed.

    mzumsande commented at 3:00 pm on July 20, 2023:

    so using the same key in two emplace calls would cause the second to overwrite the first.

    Just for the record: It’s the other way round, the second call gets ignored and nothing overwritten, see https://en.cppreference.com/w/cpp/container/unordered_map/emplace


    ajtowns commented at 0:30 am on July 21, 2023:
    Err, doesn’t that mean this behaved the same as the previous code? How did it fix anything?

    sipa commented at 0:38 am on July 21, 2023:
    @ajtowns The old code here first added a SEND, and then a RECEIVE, so if both were performed, only a SEND would be present in the map, meaning an unconditional send.

    mzumsande commented at 0:41 am on July 21, 2023:
    I could reproduce the problem in a functional test today - for me, neither version (including the current one) fixes the problem completely. Will post detailed results soon.

    ajtowns commented at 2:47 am on July 21, 2023:

    @ajtowns The old code here first added a SEND, and then a RECEIVE, so if both were performed, only a SEND would be present in the map, meaning an unconditional send.

    Right, but master does the same thing, only emplacing SEND whenever vSendMsg isn’t empty; so seems weird that the original patch here was observed to help anything?


    sipa commented at 2:51 am on July 21, 2023:
    @ajtowns You’re right. I’m confused now. @mzumsande’s observation that stalling can also be caused by fPauseRecv being set on both sides seems like a partial explanation, but not the whole thing.
  13. sipa force-pushed on Jul 20, 2023
  14. sipa commented at 2:53 pm on July 20, 2023: member
    @psgreco See above; it turned out that what I intended to do here wasn’t actually what was implemented (it was instead unconditionally preferring send over receive). Would you mind trying again if this fixes the issue for you?
  15. mzumsande commented at 1:28 am on July 21, 2023: contributor

    I wrote a functional test, see https://github.com/mzumsande/bitcoin/tree/test_sipa_netstalling (because of the 1st commit obviously not intended for merge, but it makes it possible to reproduce the problem). It works by mining a few large blocks, and having two nodes exchange these blocks in both directions by repeated getblockfrompeer calls, and then check whether the deadlock happened.

    Unfortunately, the current branch doesn’t appear to fix the problem completely, the test fails for me both here and on master: When the situation is reached where we now select for both sending and receiving (because our peer doesn’t receive any data), we try to resolve the deadlock by now also receiving. This works for a little while - however, if our send buffer is full, fPauseSend will be set, and because of that we skip early in ProcessMessages() and don’t call PollMessage() anymore. Therefore the received data will pile up without being cleared by net_processing. When pnode->m_msg_process_queue_size becomes too large (5MB), fPauseRecv will also be set, and after that we are again in a deadlock situation where both peers are sending and none is receiving. I could observe this with the added logging in the 3rd commit in my branch.

    Not sure how to best fix this…

  16. ajtowns commented at 4:50 am on July 21, 2023: contributor

    I think fPauseSend getting set on both sides and causing a deadlock should probably be out of scope for this PR – at least as I understand it, this fixes an issue where we get a deadlock even without fPauseSend triggering.

    I think the scenario here is:

    • peer A sends a 2MB message to peer B. This fills up B’s socket receive buffer (200kB?) and A’s socket send buffer (200kB?) without completing. A still has 1.6MB to send to B, so stops reading from the socket.
    • peer B does the same thing at almost exactly the same time, with the same result.
    • A/B are deadlocked.

    Maybe adding a debug-only sendp2pmsg rpc would be the easiest way to simulate this and be useful for debugging p2p things in general?

    If we do want to address fPauseSend deadlocking, a few approaches come to mind:

    1. easy: make fPauseSend a timestamp rather than a boolean, and if it’s been set for >5 minutes, disconnect. doesn’t prevent the deadlock, but at least frees up the connection slot and makes it possible to try again.
    2. hard: rework net_processing so that we keep making as much progress as we can – eg, change fPauseSend to continue processing incoming block or tx messages, but to skip GETDATA messages and to defer sending out INV messages and the like, so that we’re draining as much traffic as we can, while limiting how much gets added.
    3. impossible? add more dynamic traffic throttling: if you’re bandwidth limited and getting too much TX traffic, perhaps you should be raising your feefilter level even if your mempool isn’t full? I don’t see how to generalise that if it’s blocks or header messages or something else that cause a problem though.
  17. psgreco commented at 8:05 pm on July 21, 2023: contributor

    @psgreco See above; it turned out that what I intended to do here wasn’t actually what was implemented (it was instead unconditionally preferring send over receive). Would you mind trying again if this fixes the issue for you?

    It seems to fix the issue for me still with the new changes, but the refactor that I had to do to run in elements 22 (like bitcoin 22), doesn’t let me make a hard confirmation.

  18. sipa commented at 2:03 pm on July 24, 2023: member

    @ajtowns @mzumsande Thanks, so it appears there are actually two mostly-unrelated network buffer deadlock issues, and Martin’s test is likely triggering both of them.

    I agree with AJ that we should still fix the network side one, even if we can’t (or don’t want to address) the application buffer side one. Fixing the application buffer side one indeed seems a lot harder, and probably needs discussion beyond this PR.

    It would be good to have a test for the network side one, without it also triggering the application side one, to verify this PR actually fixes something. Especially as I don’t understand @psgreco’s earlier observation (where an older version of this PR unconditionally preferred sending, which shouldn’t improve the situation at all) as a means to validate it.

  19. mzumsande commented at 11:37 am on August 11, 2023: contributor

    Thanks, so it appears there are actually two mostly-unrelated network buffer deadlock issues, and Martin’s test is likely triggering both of them.

    So far, I haven’t been able to trigger the original deadlock issue in my test when I run it on master - only the other one described above.

  20. mzumsande commented at 10:02 pm on August 15, 2023: contributor

    Tested ACK 3388e523a129ad9c7aef418c9f57491f8c2d9df8

    I now managed to reproduce the deadlock described in the OP by

    1.) adding a debug-only sendmsgtopeer rpc as suggested by @ajtowns above (https://github.com/mzumsande/bitcoin/commit/9c90e5dc33051ff9d48160c831a63c434ddb4e4d)

    2.) Creating a functional test that uses this rpc to simultaneously have two nodes send a large message (4MB) to each other. (https://github.com/mzumsande/bitcoin/commit/70e6752c3a66ecf414be1bd16ca80ba4c3c4fa63)

    The added test creates the situation described above and fails on master and succeeds on this branch.

    If you want, feel free to include the commits from 202208_test_sendmsg - alternatively, if you’d prefer not to deal with test / rpc feedback here, I’d also be happy to open a separate PR that builds on this branch.

  21. ajtowns commented at 9:38 am on August 17, 2023: contributor

    ACK 3388e523a129ad9c7aef418c9f57491f8c2d9df8

    Test case looks good; seems reasonable to do the sendmsgtopeer change in a separate PR though.

  22. naumenkogs commented at 11:50 am on August 17, 2023: member
    ACK 3388e523a129ad9c7aef418c9f57491f8c2d9df8
  23. fanquake merged this on Aug 17, 2023
  24. fanquake closed this on Aug 17, 2023

  25. sidhujag referenced this in commit 9e088b00c1 on Aug 17, 2023
  26. Sjors commented at 3:41 pm on August 18, 2023: member
    Post merge concept ACK. From my (very) limited understanding of sockets, this makes sense. Thanks @ajtowns for the description #27981 (comment).
  27. achow101 referenced this in commit c9273f68f6 on Aug 24, 2023
  28. luke-jr referenced this in commit cd5476aa0f on Aug 29, 2023
  29. Frank-GER referenced this in commit 74aeda39e2 on Sep 8, 2023
  30. kwvg referenced this in commit b86fe80566 on Aug 4, 2024
  31. kwvg referenced this in commit 8eceb7c3d1 on Aug 4, 2024
  32. kwvg referenced this in commit 157b583a24 on Aug 6, 2024
  33. kwvg referenced this in commit 7e91aa385c on Aug 6, 2024
  34. kwvg referenced this in commit e324c89b4f on Aug 6, 2024
  35. kwvg referenced this in commit ee7affc118 on Aug 6, 2024
  36. kwvg referenced this in commit f657da6ceb on Aug 13, 2024
  37. kwvg referenced this in commit 5d534d266b on Aug 13, 2024
  38. bitcoin locked this on Aug 17, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-27 22:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me