Net: Pass `MSG_MORE` flag when sending non-final network messages (round 2) #26844

whitslack commented at 7:09 PM on January 7, 2023: contributor

N.B.: This is my second attempt at introducing this optimization. #12519 (2018) was closed in deference to switching to doing gathering socket writes using sendmsg(2), which I agree would have superior performance due to fewer syscalls, but that work was apparently abandoned in late 2018. Ever since, Bitcoin Core has continued writing tons of runt packets to the wire. Can we proceed with my halfway solution for now?

Since Nagle's algorithm is disabled, each and every call to send(2) can potentially generate a separate TCP segment on the wire. This is especially inefficient when sending the tiny header preceding each message payload.

Linux implements a MSG_MORE flag that tells the kernel not to push the passed data immediately to the connected peer but rather to collect it in the socket's internal transmit buffer where it can be combined with data from successive calls to send(2). Where available, specify this flag when calling send(2) in CConnman::SocketSendData(CNode &) if the data buffer being sent is not the last one in node.vSendMsg.

DrahtBot commented at 7:09 PM on January 7, 2023: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	vasild, sipa

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

DrahtBot added the label P2P on Jan 7, 2023

Pass MSG_MORE flag when sending non-final network messages

Since Nagle's algorithm is disabled, each and every call to send(2) can potentially generate a separate TCP segment on the wire. This is especially inefficient when sending the tiny header preceding each message payload.

Linux implements a MSG_MORE flag that tells the kernel not to push the passed data immediately to the connected peer but rather to collect it in the socket's internal transmit buffer where it can be combined with data from successive calls to send(2). Where available, specify this flag when calling send(2) in CConnman::SocketSendData(CNode &) if the data buffer being sent is not the last one in node.vSendMsg.

691eaf8873

whitslack force-pushed on Jan 7, 2023

dergoegge commented at 4:44 PM on January 16, 2023: member

At a first glance I would ask why we can't do what was suggested here: #12519 (comment), if that is preferable? At least back then, it sounded like we could just pick https://github.com/bitcoin/bitcoin/commit/77cee166e17e355306d2ca1a1d360e71c12d8350 without the more involved changes in #11227. (Not advocating against your change, just wondering why you have chosen to go with the less preferable version when both seem technically possible)

Also pinging @vasild who would likely be a good reviewer for this.

whitslack commented at 8:43 PM on January 16, 2023: contributor

just wondering why you have chosen to go with the less preferable version when both seem technically possible

Because my change is tiny, and that other change is involved. I want the change to actually be merged, not bike-shedded indefinitely. Smaller changes are easier to validate.

in src/net.cpp:811 in 691eaf8873

 807 | +#ifdef MSG_MORE
 808 | +            if (it + 1 != node.vSendMsg.end()) {
 809 | +                flags |= MSG_MORE;
 810 | +            }
 811 | +#endif
 812 | +            nBytes = node.m_sock->Send(reinterpret_cast<const char*>(data.data()) + node.nSendOffset, data.size() - node.nSendOffset, flags);

vasild commented at 8:51 AM on January 30, 2023:

If this can't send the full message one of the two breaks below will be executed which may result in this function returning with all Send() calls using MSG_MORE. For example: let there be 3 messages, 10 bytes each. Send() of the first one will use MSG_MORE, let it return 10 (full message sent). Send() of the second one will use MSG_MORE, let it return any number in [0..9]. Then this function will return, without ever calling Send() without MSG_MORE and the first message may remain buffered/unsent somewhere in the OS.

I think this is unwanted effect, defeating TCP_NODELAY and the check if (it + 1 != node.vSendMsg.end()).

I wonder if it is possible to tell the OS that we are done, without sending further data? Maybe Send(nullptr, 0, MSG_NOSIGNAL | MSG_DONTWAIT); // without MSG_MORE would do that?

whitslack commented at 5:58 PM on January 30, 2023:

If not all of our pending outgoing messages can fit in the socket's transmit buffer, then we want all of our calls to send to specify MSG_MORE. The kernel will immediately (without delay) transmit as many full segments as it can assemble and that will fit in the congestion window. It will not immediately transmit a partial segment but instead will call back to us (via returning a writable status from select, poll, epoll, etc.) to fetch more data to transmit. If we have no more data to transmit, then yes, calling send with a zero length and without MSG_MORE will tell the kernel that we have nothing left to send, and it will transmit a partial segment as soon as such segment will fit in the congestion window.

In your example, if the second of the three 10-byte messages can only partially be copied to the socket transmit buffer, then it must be the case that the transmit buffer is now full. The transmit buffer is always larger than MSS, the maximum segment size, meaning the kernel can now assemble a full segment, which it will transmit as soon as the window allows. It will then indicate the socket is writable again, and we will be called again to write more of the second message.

So, in short, your concern that we will unnecessarily delay transmitting data is unwarranted. The kernel never delays transmitting a full segment if it fits in the congestion window. The kernel does delay transmitting a partial segment that fits in the congestion window if the last call to send on that socket specified MSG_MORE, but that's exactly what we want, as we want the opportunity to write more data into the transmit buffer so the kernel will potentially transmit fewer and larger segments.

vasild commented at 12:00 PM on January 31, 2023:

If send(2) writes less bytes than requested, does it follow that the transmit buffer is full? I am not sure about that. Is it documented somewhere? What about a signal interrupting the send midway (or even before it started)? Is that not a case where send(2) writes less bytes than requested, but the transmit buffer is not full?

whitslack commented at 7:31 PM on January 31, 2023:

What about a signal interrupting the send midway (or even before it started)?

I'm pretty sure that's only possible if the socket is in blocking mode. But regardless, if send returns a short write due to invoking a signal handler, then the socket is still writable, and the user code will call send again without delay (i.e., on the next iteration of the I/O loop).

vasild commented at 12:50 PM on February 2, 2023:

That would be after accepting new connections, sending, receiving and parsing messages from all other (possibly 100s) connections. I guess this would be rare and even when it happens would probably have no adverse effects on the communication, but I would feel more comfortable if this PR does not introduce that. If a dummy Send() without MSG_MORE before the two breaks is a way to avoid that, then I think it should be added.

whitslack commented at 7:31 PM on February 2, 2023:

In no case will the kernel hold onto a partial segment in the socket's transmit buffer for longer than 250 ms if the segment fits in the congestion window, so that's the maximum latency that could be introduced by writing a complete message with MSG_MORE, maxing out the transmit buffer while writing a subsequent message also with MSG_MORE, and then going off to do other things for a nigh eternity before coming back around.

Anyway, I don't want to add more syscall overhead. I'd prefer to rewrite this to do gathering writes rather than adding yet another send call. If that's the only way you'll let this move forward, then I'll do that work, but then I'd expect that to be bike shedded for ages while people debate the merits of stack allocation versus heap allocation of the struct iovec array, platform agnosticism, abstraction leakage, readability and maintainability of code, and scope creep.

vasild commented at 2:39 PM on February 9, 2023:

... 250 ms ...

That seems like relying on an undocumented behavior (unless it is documented somewhere)? But anyway:

... If that's the only way you'll let this move forward ...

I am not blocking this. I did not NACK it. I guess if it gathers enough ACKs from other reviewers it will get merged. Actually, I do not think this is a blocker, it is more in the "I would feel more comfortable" category.

I guess something like this would be nice:

--- i/src/net.cpp
+++ w/src/net.cpp
@@ -790,57 +790,67 @@ void V1TransportSerializer::prepareForTransport(CSerializedNetMsg& msg, std::vec
 
 size_t CConnman::SocketSendData(CNode& node) const
 {
     auto it = node.vSendMsg.begin();
     size_t nSentSize = 0;
 
     while (it != node.vSendMsg.end()) {
         const auto& data = *it;
         assert(data.size() > node.nSendOffset);
         int nBytes = 0;
+        int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
         {
             LOCK(node.cs_hSocket);
             if (!node.m_sock) {
                 break;
             }
-            int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
 #ifdef MSG_MORE
             if (it + 1 != node.vSendMsg.end()) {
                 flags |= MSG_MORE;
             }
 #endif
             nBytes = node.m_sock->Send(reinterpret_cast<const char*>(data.data()) + node.nSendOffset, data.size() - node.nSendOffset, flags);
         }
         if (nBytes > 0) {
             node.m_last_send = GetTime<std::chrono::seconds>();
             node.nSendBytes += nBytes;
             node.nSendOffset += nBytes;
             nSentSize += nBytes;
             if (node.nSendOffset == data.size()) {
                 node.nSendOffset = 0;
                 node.nSendSize -= data.size();
                 node.fPauseSend = node.nSendSize > nSendBufferMaxSize;
                 it++;
             } else {
                 // could not send full message; stop sending more
+#ifdef MSG_MORE
+                if (flags & MSG_MORE) {
+                    node.m_sock->Send(nullptr, 0, flags & ~MSG_MORE);
+                }
+#endif
                 break;
             }
         } else {
             if (nBytes < 0) {
                 // error
                 int nErr = WSAGetLastError();
                 if (nErr != WSAEWOULDBLOCK && nErr != WSAEMSGSIZE && nErr != WSAEINTR && nErr != WSAEINPROGRESS) {
                     LogPrint(BCLog::NET, "socket send error for peer=%d: %s\n", node.GetId(), NetworkErrorString(nErr));
                     node.CloseSocketDisconnect();
                 }
             }
             // couldn't send anything at all
+#ifdef MSG_MORE
+            if (flags & MSG_MORE) {
+                node.m_sock->Send(nullptr, 0, flags & ~MSG_MORE);
+            }
+#endif
             break;
         }
     }
 
     if (it == node.vSendMsg.end()) {
         assert(node.nSendOffset == 0);
         assert(node.nSendSize == 0);
     }
     node.vSendMsg.erase(node.vSendMsg.begin(), it);
     return nSentSize;

whitslack commented at 7:38 PM on February 9, 2023:

... 250 ms ...

That seems like relying on an undocumented behavior (unless it is documented somewhere)?

It is documented. (It's actually 200 milliseconds. I misremembered.)

sendmsg(2) says:

MSG_MORE (Since Linux 2.4.4) The caller has more data to send. This flag is used with TCP sockets to obtain the same effect as the TCP_CORK socket option (see tcp(7)), with the difference that this flag can be set on a per-call basis.

tcp(7) says:

TCP_CORK (since Linux 2.2) If set, don't send out partial frames. All queued partial frames are sent when the option is cleared again. This is useful for prepending headers before calling sendfile(2), or for throughput optimization. As currently implemented, there is a 200 millisecond ceiling on the time for which output is corked by TCP_CORK. If this ceiling is reached, then queued data is automatically transmitted.

sipa commented at 7:58 PM on February 9, 2023:

I also can't imagine the timeout actually matters for us? The patch above (the one adding a send with empty non-MSG_MORE) only triggers in case a send() call managed to push fewer bytes than we tried to send. That must mean the send buffer is full, and thus there is no point why the kernel would need to wait anymore.

whitslack commented at 8:07 PM on February 9, 2023:

The kernel will never wait to send full segments. Vasil's concern is the case where we try to send a non-final message, run out of space in the kernel's transmit buffer (so only some of the message was copied in), then go off and do other things for a virtual eternity. If we haven't called send without MSG_MORE, then it is possible that there is a complete message sitting in the kernel's transmit buffer (ahead of the incomplete message that we failed to fully write) that will be delayed. That's true, but I'd say that's by design. If it's really taking us longer than 200 ms for the network I/O loop to make a full pass, we have much bigger problems!

whitslack commented at 8:17 PM on February 9, 2023:

The (very reasonable) hope is that the network I/O loop comes back around to our socket to finish writing the incomplete message (and possibly more messages) before the kernel's transmit buffer has drained below the size of a full segment. In that case, there will be no artificial delay introduced to any message we've written to the socket.

So, basically, we want the network I/O loop to come back around to service our socket again in less time than the kernel's transmit buffer takes to drain.

vasild commented at 8:01 AM on February 10, 2023:

It is documented

Even better, thanks!

sipa commented at 3:10 PM on February 10, 2023:

Thanks, I understand now.

FWIW, I switched my (public, well-connected) node to (a rebased version of) this PR, together with @vasild's proposed patch above + logging whenever these empty non-MSG_MORE message would be sent.

On this node with ~250 connections, in 16 hours time, having uploaded 7.2 GB and downloaded 1.2 GB, 221 times a MSG_MORE was interrupted (always after sending some bytes, usually at least a few kB; never with having sent nothing).

whitslack commented at 5:41 PM on February 10, 2023:

221 times a MSG_MORE was interrupted

I think what you may have meant was that a send call indicated that fewer bytes were copied than were requested, not that the call was interrupted (by a signal handler).

The real test would be querying the SIOCOUTQ ioctl on the socket the next time the network I/O loop gets around to servicing it. That's the number of untransmitted bytes sitting in the kernel's transmit buffer. If that number is smaller than the connection's MSS, then the kernel exhausted its transmit buffer and started to delay sending the remaining bytes due to MSG_MORE.

whitslack commented at 5:49 PM on February 10, 2023:

never with having sent nothing

N.B.: It would be impossible for zero bytes to have been copied to a socket in non-blocking mode if that socket had indicated that it was writable and no other thread had written to it since such indication was made. Again, I'm pretty sure that even signal handlers won't interrupt a non-blocking write since the kernel only enters an interruptible wait state during a blocking write.

(Edit: Actually, there may be an edge case where a socket indicates writable despite having no free space in its transmit buffer: specifically the case where the next write call would immediately return an error without copying any bytes. That would be a fatal error on the socket, though, such as "connection reset.")

sipa commented at 6:03 PM on February 10, 2023:

I think what you may have meant was that a send call indicated that fewer bytes were copied than were requested, not that the call was interrupted (by a signal handler).

Yes, exactly - I tried to be succinct, but that may have been confusing. Specifically, it refers to fewer bytes being accepted by send() than handed to it, while MSG_MORE was set.

N.B.: It would be impossible for zero bytes to have been copied to a socket in non-blocking mode if that socket had indicated that it was writable and no other thread had written to it since such indication was made.

Indeed, so that sounds very much expected that 0 never happened (except when the socket errorred and the connected closed, but in that case, there is obviously no point in adding a non-MSG_MORE padding either).

vasild commented at 8:59 AM on January 30, 2023: contributor

Would be nice to demonstrate that this actually brings some benefit and is not just code clutter. Maybe compare tcpdump output with and without this patch?

whitslack commented at 6:09 PM on January 30, 2023: contributor

Would be nice to demonstrate that this actually brings some benefit and is not just code clutter. Maybe compare tcpdump output with and without this patch?

tcpdump is what originally alerted me to the presence of this anti-pattern in Bitcoin Core years ago. It was the reason I looked into the code and opened my original PR to address the problem.

I'd like someone else with experience in low-level network I/O programming to comment before I'd go to the effort to produce "proof" of the merits of using MSG_MORE. That flag does exist for good reason.

sipa commented at 6:21 PM on January 30, 2023: member

Concept ACK.

Unless someone is actively pursuing a PR to switch to gathering writes, I don't see any reason not to start using MSG_MORE if it's known to be beneficial.

I also find the justification that this will never do something undesirable fairly convincing.

vasild commented at 2:35 PM on January 31, 2023: contributor

Concept ACK

Below is some basic demonstration of MSG_MORE (tldr - it works!). The logs show how send(2) is called and there is the relevant tcpdump(1) output. This is during the initial handshake up to and including verack.

master

calling Send(24) without MSG_MORE
calling Send(103) without MSG_MORE
calling Send(24) without MSG_MORE
calling Send(24) without MSG_MORE
calling Send(24) without MSG_MORE

tcpdump:

IP master.17807 > random_node.8333: Flags [S], seq 950983330, win 64240, options [mss 1420,sackOK,TS val 4155972266 ecr 0,nop,wscale 7], length 0
IP master.17807 > random_node.8333: Flags [.], ack 3334216285, win 502, options [nop,nop,TS val 4155972306 ecr 1548157682], length 0
* IP master.17807 > random_node.8333: Flags [P.], seq 0:24, ack 1, win 502, options [nop,nop,TS val 4155972306 ecr 1548157682], length 24
* IP master.17807 > random_node.8333: Flags [P.], seq 24:127, ack 1, win 502, options [nop,nop,TS val 4155972306 ecr 1548157682], length 103
IP master.17807 > random_node.8333: Flags [.], ack 25, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 0
IP master.17807 > random_node.8333: Flags [.], ack 128, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 0
IP master.17807 > random_node.8333: Flags [.], ack 152, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 0
IP master.17807 > random_node.8333: Flags [.], ack 176, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 0
IP master.17807 > random_node.8333: Flags [P.], seq 127:151, ack 176, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 24
IP master.17807 > random_node.8333: Flags [P.], seq 151:175, ack 176, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 24
IP master.17807 > random_node.8333: Flags [P.], seq 175:199, ack 176, win 502, options [nop,nop,TS val 4155972357 ecr 1548157732], length 24

PR

>>>> calling Send(24) with MSG_MORE
>>>> calling Send(103) without MSG_MORE
>>>> calling Send(24) without MSG_MORE
>>>> calling Send(24) without MSG_MORE
>>>> calling Send(24) without MSG_MORE

tcpdump:

IP pr.17827 > random_node.8333: Flags [S], seq 2038199236, win 64240, options [mss 1420,sackOK,TS val 4155323939 ecr 0,nop,wscale 7], length 0
IP pr.17827 > random_node.8333: Flags [.], ack 3514299724, win 502, options [nop,nop,TS val 4155323980 ecr 4260101434], length 0
* IP pr.17827 > random_node.8333: Flags [P.], seq 0:127, ack 1, win 502, options [nop,nop,TS val 4155324474 ecr 4260101434], length 127
IP pr.17827 > random_node.8333: Flags [.], ack 128, win 502, options [nop,nop,TS val 4155324509 ecr 4260101964], length 0
IP pr.17827 > random_node.8333: Flags [.], ack 152, win 502, options [nop,nop,TS val 4155324509 ecr 4260101964], length 0
IP pr.17827 > random_node.8333: Flags [P.], seq 127:151, ack 152, win 502, options [nop,nop,TS val 4155324510 ecr 4260101964], length 24
IP pr.17827 > random_node.8333: Flags [P.], seq 151:175, ack 176, win 502, options [nop,nop,TS val 4155324510 ecr 4260101964], length 24
IP pr.17827 > random_node.8333: Flags [.], ack 200, win 502, options [nop,nop,TS val 4155324554 ecr 4260101974], length 0
IP pr.17827 > random_node.8333: Flags [P.], seq 175:199, ack 200, win 502, options [nop,nop,TS val 4155324558 ecr 4260102014], length 24

(notice the length of the packets marked with *, at the end of each line).

vasild approved

vasild commented at 2:41 PM on February 9, 2023: contributor

ACK 691eaf8873fe2f189153ca637506a0291504c97a

I think this is good to go as it is and the above discussion is not a blocker. It would be nice to get input from other reviewers on it.

Thanks!

sipa commented at 3:24 PM on February 10, 2023: member

ACK 691eaf8873fe2f189153ca637506a0291504c97a

Post-BIP324 (#24545 etc), this is also a privacy issue (as hiding packet boundaries against passive observers is a design goal).

fanquake added this to the milestone 25.0 on Feb 13, 2023

fanquake merged this on Feb 15, 2023

fanquake closed this on Feb 15, 2023

sidhujag referenced this in commit e7839eea4d on Feb 16, 2023

bitcoin locked this on Feb 15, 2024