Set Windows TCP buffers to 64KB to match OSX and Unix #6693

ptschip commented at 5:51 AM on September 18, 2015: contributor

After making the changes I ran the wallet.py test (which is where the problem manifested the most often) 100 times without fail on Native Windows. It was a simple problem of the Windows default buffer size being too small and causing buffer overflow's on occasion when syncing large amounts of data.

laanwj added the label Windows on Sep 18, 2015

laanwj added the label P2P on Sep 18, 2015

laanwj commented at 3:08 PM on September 18, 2015: member

Is this a security issue?

I don't really know windows networking internals, but in which circumstances can a client flood this buffer? And if it floods, what happens? Why doesn't it block (stop increasing the ACK pointer in TCP packets) and pick it up later?

in src/net.h:None in 4db0d19869 outdated

  59 | @@ -60,6 +60,8 @@ static const bool DEFAULT_UPNP = false;
  60 |  static const size_t MAPASKFOR_MAX_SZ = MAX_INV_SZ;
  61 |  /** The maximum number of peer connections to maintain. */
  62 |  static const unsigned int DEFAULT_MAX_PEER_CONNECTIONS = 125;
  63 | +/** The maximum TCP send or receive buffer size */
  64 | +static const unsigned int TCP_BUFFER_SIZE = 65535;

MarcoFalke commented at 4:00 PM on September 18, 2015:

Shouldn't this be named MAX_WINDOWS_TCP_BUFFER_SIZE?

ptschip commented at 5:31 PM on September 18, 2015:

I suppose that's true, it is a MAX user defined setting, since the OS can reduce the Buffer size if/when it needs to cut back on memory. And it is currently related only to Windows.

I'll update it and push.

9:01 AM, MarcoFalke wrote:

In src/net.h #6693 (review):

@@ -60,6 +60,8 @@ static const bool DEFAULT_UPNP = false; static const size_t MAPASKFOR_MAX_SZ = MAX_INV_SZ; /\ The maximum number of peer connections to maintain. / static const unsigned int DEFAULT_MAX_PEER_CONNECTIONS = 125; +/* The maximum TCP send or receive buffer size */ +static const unsigned int TCP_BUFFER_SIZE = 65535;

Shouldn't this be named |MAX_WINDOWS_TCP_BUFFER_SIZE|?

— Reply to this email directly or view it on GitHub https://github.com/bitcoin/bitcoin/pull/6693/files#r39871260.

ptschip commented at 5:21 PM on September 18, 2015: contributor

No I'm not referring to a security issue, just functional and performance related.

What is happening is that the nodes are getting hung when large numbers of blocks are getting downloaded. I see this clearly when running the wallet.py test and it happens 50% of the time I run that test. I've verified this on both win32 and win64 workstations. What is seen is that at some point the client sends a getdata but the other node never receives the request and it always happens when a large number of blocks has to sync (such as mining 100 blocks during the wallet.py test). That by itself could be a number of things, and I've run down quite a few rabbit trails to no avail, but when I then increase the buffer size the problem goes away. I wish I could see the traffic in the network sniffer but on Windows we can't monitor the loopback adapter with WireShark. So I think it's a logical deduction that the buffer is getting full.

The bigger problem though in my mind is that send buffers should not be getting full to begin with. There should be some handling in the code to check the buffer has room before data is sent down the pipe to prevent hangups, but I don't know the code base yet well enough to see if that is already being done and just not working properly, or whether it needs to be added.

Regardless, the default 8KB Windows socket buffer is far too small (I really don't understand why they still have kept it that low, I think since windows 95). Mac has a 64KB buffer default and I would expect Linux probably does also. The bigger the buffer the more throughput is possible and hence better performance: Throughput = Buffersize/latency.

On 18/09/2015 8:09 AM, Wladimir J. van der Laan wrote:

Is this a security issue?

I don't really know windows networking internals, but in which circumstances can a client flood this buffer? And if it floods, what happens?

— Reply to this email directly or view it on GitHub #6693 (comment).

laanwj commented at 11:51 AM on September 21, 2015: member

Right, thanks for the explanation. It does leave me kind of worried that our code isn't robust against a full buffer. I understand that a larger buffer can give better performance (less copy operations), but getting entirely stuck is unexpected. So I don't disagree on this change, but I hope this doesn't cover up a deeper issue.

ptschip commented at 1:09 PM on September 21, 2015: contributor

@laanwj I agree that this may be covering up a deeper functional issue, but if there's an issue there, it's been there for a long time and likely for all platforms. I don't really see any risk here in raising the buffer to a more reasonable size for Native Windows which OSX and Linux already use as defaults. IMO it really needs to be done anyway, deeper issues aside.

And at the least, buffers size fixes issue 6554, and will likely improve sync times for windows users, and in addition, we finally get the all the python tests working on Windows and therefore (hopefully) on Travis as well.

As a side note, and as far as security goes, there's really nothing an attacker can do with this change that they can't already do.

Peter

MarcoFalke commented at 1:23 PM on September 21, 2015: member

Concept ACK. But I agree with @laanwj: Even though this fixes regtest mode on Windows, there are many nodes running the "old code" on main net. If the buffer issue reveals any vulnerability, fixes need to be submitted to the old release branches as well and node operators need to be notified.

laanwj commented at 3:40 PM on September 21, 2015: member

@MarcoFalke This is not a 'buffer overflow' in the security sense. That formulation confused me as well. The sync gets stuck, but there is no process crash, nor any danger of anything besides the buffer being overwritten. @theuni can you take a look here?

ptschip force-pushed on Sep 21, 2015

theuni commented at 1:53 AM on September 22, 2015: member

Mmm, I don't like this as-is. Not because I think it's wrong (argument for raising the size seems reasonable), but because as the others have said, it seems much more likely this is a symptom of a larger underlying issue.

If the buffer is completely full, logically the send should fail and return some error. If it's not completely full, small writes should continue to succeed. @ptschip Could you please add some logging so that nErr is always printed if the send() fails? Something like:

diff --git a/src/net.cpp b/src/net.cpp
index 87c4f0a..4b27a88 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -750,15 +750,16 @@ void SocketSendData(CNode *pnode)
                 it++;
             } else {
                 // could not send full message; stop sending more
+                LogPrintf("send full. Only sent: %i bytes. %lu bytes remain\n", nBytes, data.size() - pnode->nSendOffset);
                 break;
             }
         } else {
             if (nBytes < 0) {
                 // error
                 int nErr = WSAGetLastError();
+                LogPrintf("socket send error %s. Trying to send: %lu bytes\n", NetworkErrorString(nErr), data.size() - pnode->nSendOffset);
                 if (nErr != WSAEWOULDBLOCK && nErr != WSAEMSGSIZE && nErr != WSAEINTR && nErr != WSAEINPROGRESS)
                 {
-                    LogPrintf("socket send error %s\n", NetworkErrorString(nErr));
                     pnode->CloseSocketDisconnect();
                 }
             }

It would also be helpful to see the send() return values printed before/after your buffer size change. Maybe a trend would emerge.

ptschip commented at 4:46 AM on September 22, 2015: contributor

@theuni I'll follow up with the logprints tomorrow but here's a few interesting data points. I ran the test but this time just changing the RCV buffer. So using 8KB send, 8KB Receive, the test fails, using 8KB send, 16KB receive the test also fails, using 8KB send and 64KB receive, the test passes. It seems more likely the problem is on the receiving end, also noting that when the test fails a getdata is sent and a log entry made, but never received at the other end. After reviewing the code I also don't see anything in the send data code that is cause for alarm. However, I'll still add the logprint entries and see if anything comes out of it. But really it's starting to look like just poorly sized buffers,

theuni commented at 5:19 AM on September 22, 2015: member

@ptschip poorly sized buffers would mean more inefficient transfers. If a few thousand bytes can't be transferred locally in a matter of seconds, that points to something bigger. Looking forward to some log data.

ptschip commented at 2:01 PM on September 22, 2015: contributor

Here are the logs, with 8KB send/receive buffers, both ends are having trouble sending:

From the node with the blocks, trying to send block data

2015-09-22 13:43:33 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 1240 bytes 2015-09-22 13:43:33 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 1240 bytes 2015-09-22 13:43:33 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 1888 bytes 2015-09-22 13:43:33 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 1888 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 1969 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 2293 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 2293 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 1969 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 2374 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 2617 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 2455 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 2455 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 2941 bytes

And from the node trying to sync blocks:

2015-09-22 13:43:33 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 637 bytes 2015-09-22 13:43:33 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 637 bytes 2015-09-22 13:43:33 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 637 bytes 2015-09-22 13:43:33 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 637 bytes 2015-09-22 13:43:35 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 637 bytes

ptschip commented at 2:38 PM on September 22, 2015: contributor

Logs for a tests with 8KB send and 64KB receive buffers show nothing out of the ordinary in the nodes that are syncing, but in the node that mines the blocks we have again the following which happens for 79 out of the 100 blocks being sent:

2015-09-22 14:27:16 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 1240 bytes 2015-09-22 14:27:16 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 5209 bytes 2015-09-22 14:27:16 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 204 bytes 2015-09-22 14:27:16 socket send error A non-blocking socket operation could not be completed immediately. (10035). Trying to send: 5371 bytes 2

Also, I noticed these entries, not sure if this is something new? Seems to happen when mining new blocks.

2015-09-22 14:27:16 ERROR: CheckProofOfWork(): hash doesn't match nBits 2015-09-22 14:27:16 ERROR: CheckProofOfWork(): hash doesn't match nBits 2015-09-22 14:27:16 ERROR: CheckProofOfWork(): hash doesn't match nBits

theuni commented at 7:50 PM on September 22, 2015: member

@ptschip We use select() to determine when we can write, so we really shouldn't be failing in that way. The only thing I can think of is that it's the optimistic writes that are failing. Could you please try disabling those as an experiment, and seeing what changes?

diff --git a/src/net.cpp b/src/net.cpp
index 87c4f0a..36b10ee 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -2373,10 +2375,6 @@ void CNode::EndMessage() UNLOCK_FUNCTION(cs_vSend)
     ssSend.GetAndClear(*it);
     nSendSize += (*it).size();

-    // If write queue empty, attempt "optimistic write"
-    if (it == vSendMsg.begin())
-        SocketSendData(this);
-
     LEAVE_CRITICAL_SECTION(cs_vSend);
 }

theuni commented at 8:27 PM on September 22, 2015: member

Another thing to try: According to this, we may get the observed behavior if a single write is larger than the underlying buffer. Here's a quick hack to determine if that's what we're seeing:

diff --git a/src/net.cpp b/src/net.cpp
index 87c4f0a..8c0ce4c 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -738,7 +738,9 @@ void SocketSendData(CNode *pnode)
     while (it != pnode->vSendMsg.end()) {
         const CSerializeData &data = *it;
         assert(data.size() > pnode->nSendOffset);
-        int nBytes = send(pnode->hSocket, &data[pnode->nSendOffset], data.size() - pnode->nSendOffset, MSG_NOSIGNAL | MSG_DONTWAIT);
+        size_t nMaxSend = 4000; // 4000 as a quick hack test. This value should come from getsockopt(...SO_SNDBUF...) cached in pnode.
+        size_t nSendSize = std::min(data.size() - pnode->nSendOffset, nMaxSend);
+        int nBytes = send(pnode->hSocket, &data[pnode->nSendOffset], nSendSize, MSG_NOSIGNAL | MSG_DONTWAIT);
         if (nBytes > 0) {
             pnode->nLastSend = GetTime();
             pnode->nSendBytes += nBytes;

ptschip commented at 3:19 AM on September 24, 2015: contributor

@theuni Cory, I did a couple of things (see below), neither of which worked. The same problem persists. Only when the receive buffer is changed to 64KB does the problem go away and the test pass. I'm wondering if there is a problem with how truncated data is handled on the receiving end , as would happen with a receive buffer overflow?

disabled optimistic sending. No change in the test. We still fail at exactly the same place.
made the send socket a blocking socket, which according to Microsoft would fix the issue of a delayed ACK. So, delayed ACKing is not the cause of the problem.

Method 1: Use Blocking Sockets This problem only occurs with non-blocking sockets. When you use a blocking socket, this problem does not occur because Afd.sys handles the socket buffer differently. For more information about blocking and non-blocking socket programming, see the Microsoft Platform SDK documentation.

theuni commented at 3:25 PM on September 24, 2015: member

@ptschip Thanks for trying those things. Did you try the second patch (which sets the max send size to 4000)? I actually had a good feeling about that test. Setting the socket to blocking would introduce tons of new issues since all of our code assumes non-blocking send/recv, so it's no surprise that didn't improve the situation.

I'll keep looking at the receiving side. That's very helpful.

theuni commented at 3:41 PM on September 24, 2015: member

@ptschip Are you able to hop on IRC for some real-time troubleshooting? I'm cfields on freenode.

There are several things I'd like to poke at, but the latency here makes it tough.

ptschip commented at 3:46 PM on September 24, 2015: contributor

@theuni yeah, i only set the socket to blocking to verify that the problem is not being caused by an overrun send buffer, i wasn't intending it as a fix.

Didn't get to the second test. I'm not really fully available right now...in the middle of moving to a new home over the next few days/week. But I'll get to it if I can.

On 24/09/2015 8:26 AM, Cory Fields wrote:

@ptschip https://github.com/ptschip Thanks for trying those things. Did you try the second patch (which sets the max send size to 4000)? I actually had a good feeling about that test. Setting the socket to blocking would introduce tons of new issues since all of our code assumes non-blocking send/recv, so it's no surprise that didn't improve the situation.

I'll keep looking at the receiving side. That's very helpful.

— Reply to this email directly or view it on GitHub #6693 (comment).

ptschip commented at 4:18 PM on September 24, 2015: contributor

@theuni Sorry, no time for IRC right now but i did just try out your other option, limiting the maxsendsize, but no joy there...

I'll have more time next week...for some real time troubleshooting but can't right now.

Peter

On 24/09/2015 8:42 AM, Cory Fields wrote:

@ptschip https://github.com/ptschip Are you able to hop on IRC for some real-time troubleshooting? I'm cfields on freenode.

There are several things I'd like to poke at, but the latency here makes it tough.

— Reply to this email directly or view it on GitHub #6693 (comment).

theuni commented at 5:12 PM on September 24, 2015: member

@ptschip Ok, no worries. Good luck with the move.

Here's another one to try when you get a chance. Here's the theory:

We may end up in a state where we have received lots of getdatas but not yet processed them all. The message handler and socket handler both grab the recv lock via TRY_LOCK, though it will obviously take the message handler longer to process the message than it takes to grab it from the socket. So it's likely that the socket handler will fail to grab the lock much more often.

During that time, the receive buffer will fill up because we're not actively draining it. If it gets too full, we may start failing to send() on the other end. Increasing the receive buffer would alleviate some pressure, but it wouldn't be addressing the real issue.

That's consistent with a "fix" of raising the receive buffer size, while also explaining the send errors.

Here's a patch to test that theory. Fingers crossed :)

diff --git a/src/net.cpp b/src/net.cpp
index 87c4f0a..8f287b5 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -1115,8 +1115,8 @@ void ThreadSocketHandler()
                     }
                 }
                 {
-                    TRY_LOCK(pnode->cs_vRecvMsg, lockRecv);
-                    if (lockRecv && (
+                    LOCK(pnode->cs_vRecvMsg);
+                    if (1 && (
                         pnode->vRecvMsg.empty() || !pnode->vRecvMsg.front().complete() ||
                         pnode->GetTotalRecvSize() <= ReceiveFloodSize()))
                         FD_SET(pnode->hSocket, &fdsetRecv);
@@ -1174,8 +1174,8 @@ void ThreadSocketHandler()
                 continue;
             if (FD_ISSET(pnode->hSocket, &fdsetRecv) || FD_ISSET(pnode->hSocket, &fdsetError))
             {
-                TRY_LOCK(pnode->cs_vRecvMsg, lockRecv);
-                if (lockRecv)
+                LOCK(pnode->cs_vRecvMsg);
+                if (1)
                 {
                     {
                         // typical socket buffer is 8K-64K

ptschip commented at 2:42 AM on October 1, 2015: contributor

Hi Cory, I finally got around to trying this...it didn't work at all...couldn't get passed even the first 1 or 2 steps in the python script wallet.py

On 24/09/2015 10:12 AM, Cory Fields wrote:

@ptschip https://github.com/ptschip Ok, no worries. Good luck with the move.

Here's another one to try when you get a chance. Here's the theory:

We may end up in a state where we have received lots of getdatas but not yet processed them all. The message handler and socket handler both grab the recv lock via TRY_LOCK, though it will obviously take the message handler longer to process the message than it takes to grab it from the socket. So it's likely that the socket handler will fail to grab the lock much more often.

During that time, the receive buffer will fill up because we're not actively draining it. If it gets too full, we may start failing to send() on the other end. Increasing the receive buffer would alleviate some pressure, but it wouldn't be addressing the real issue.

That's consistent with a "fix" of raising the receive buffer size, while also explaining the send errors.

Here's a patch to test that theory. Fingers crossed :)

diff --git a/src/net.cpp b/src/net.cpp index 87c4f0a..8f287b5 100644 --- a/src/net.cpp +++ b/src/net.cpp @@ -1115,8 +1115,8 @@ void ThreadSocketHandler() } } {

TRY_LOCK(pnode->cs_vRecvMsg, lockRecv);

if (lockRecv && (

LOCK(pnode->cs_vRecvMsg);

if (1 && ( pnode->vRecvMsg.empty() || !pnode->vRecvMsg.front().complete() || pnode->GetTotalRecvSize() <= ReceiveFloodSize())) FD_SET(pnode->hSocket, &fdsetRecv); @@ -1174,8 +1174,8 @@ void ThreadSocketHandler() continue; if (FD_ISSET(pnode->hSocket, &fdsetRecv) || FD_ISSET(pnode->hSocket, &fdsetError)) {

TRY_LOCK(pnode->cs_vRecvMsg, lockRecv);

if (lockRecv)

LOCK(pnode->cs_vRecvMsg);

if (1) { { // typical socket buffer is 8K-64K

— Reply to this email directly or view it on GitHub #6693 (comment).

ptschip commented at 4:33 PM on October 3, 2015: contributor

@theuni Cory, making a little progress here. Not sure that this is getting at the root of the issue either, but I found that the following enhancement also fixes the problem, it appears that the sleep time is rather long. ( I found that 20 milliseconds also worked.)

in the ThreadMessageHandler:

    if (fSleep)

messageHandlerCondition.timed_wait(lock, boost::posix_time::microsec_clock::universal_time() + boost::posix_time::milliseconds(100));
messageHandlerCondition.timed_wait(lock, boost::posix_time::microsec_clock::universal_time() + boost::posix_time::milliseconds(10));

ptschip commented at 2:36 AM on October 6, 2015: contributor

@theuni Cory, here's yet another bit of code that fixes or masks the issue, we don't really need to be sleeping if we've just sent messages, rather we should be checking the receive queue for anything that's been coming back. So now we have 3 different code changes, all performance enhancements, which on their own I think would be beneficial but none of them gets to the root of the problem.

g_signals.SendMessages(pnode, pnode == pnodeTrickle || pnode->fWhitelisted);
if(g_signals.SendMessages(pnode, pnode == pnodeTrickle || pnode->fWhitelisted))
```
  fSleep = false;
```

One other thing I found today, which may be just another rabbit trail, but may also be nearer to the root of the problem is that we seem to be receiving a great many blocks that we've never requested! In the logs of the nodes that fail to sync I see this pattern:

137 blocks requested 185 blocks downloaded 203 blocks are actually in the database however when we RPC them we only see 119.

So it appears that all blocks "have" been downloaded and sync'd somehow, but are not showing up in the logs as downloaded and they can't be counted by RPC'ing and getting a block count. And now that I think of it that was one of the original observations at the beginning of the issue 6554 thread.

I'll keep digging...

theuni commented at 6:53 AM on October 7, 2015: member

@ptschip Thanks for continuing to dig.

Your change (if I'm reading it right) is to set fSleep to false if SendMessages returns true? SendMessages always returns true :)

So you're effectively disabling the sleep. I bet you'd find that the cpu load is pegged on one core. That does work to strengthen my locking theory, though.

Please try building from this branch: https://github.com/theuni/bitcoin/tree/net-lock-hacking

It's a hack, but it should be enough to test my theory. This drops the node's receive lock while it processes messages. The result should be that it races through the messagehandler much quicker.

ptschip commented at 2:04 PM on October 7, 2015: contributor

@theuni I built your code and ran it, it also fixes the issue.

On 06/10/2015 11:54 PM, Cory Fields wrote:

@ptschip https://github.com/ptschip Thanks for continuing to dig.

Your change (if I'm reading it right) is to set fSleep to false if SendMessages returns true? SendMessages always returns true :)

So you're effectively disabling the sleep. I bet you'd find that the cpu load is pegged on one core. That does work to strengthen my locking theory, though.

Please try building from this branch: https://github.com/theuni/bitcoin/tree/net-lock-hacking

It's a hack, but it should be enough to test my theory. This drops the node's receive lock while it processes messages. The result should be that it races through the messagehandler much quicker.

— Reply to this email directly or view it on GitHub #6693 (comment).

theuni commented at 4:30 PM on October 7, 2015: member

Woohoo, we've found the root cause! Thanks very much for your testing.

Now to come up with some sane way to deploy it.

ptschip commented at 7:10 PM on October 7, 2015: contributor

@theuni That's great Cory!

And, although this PR only masks the underlying problem I still think it has merit from a performance perspective and should be merged in to bring the Windows nodes up to the same tcp buffer size as the OSX and Linux nodes. What do you think?

On 07/10/2015 9:31 AM, Cory Fields wrote:

Woohoo, we've found the root cause! Thanks very much for your testing.

Now to come up with some sane way to deploy it.

— Reply to this email directly or view it on GitHub #6693 (comment).

laanwj commented at 7:43 PM on October 7, 2015: member

What do you think?

Fine with increasing the buffer size, but in the commit message and pull description then call it 'improve performance' instead of 'fix issue'. (and it would be nice to have measurements that show it does improve performance)

theuni commented at 8:47 PM on October 7, 2015: member

I'm a bit skeptical that it will actually improve performance in a measurable way. Assuming the change above fixed the issue for the reason I think it did, the problem is that the socket is not being drained for long periods of time due to the eventhandler thread keeping it locked. Without that lock held, the buffer should be drained within the first few passes (ideally the first), and the time between passes should be close to zero. In fact, I suspect that the single recv buffer would become the bottleneck more quickly as soon as we bump the socket buffer above 64k: https://github.com/bitcoin/bitcoin/blob/master/src/net.cpp#L1182. That directly affects how long it takes to drain the socket.

IMO if we're going to bump, we need to bump that buffer as well, since that directly affects how many passes through select() are needed to drain a message. Alternatively we could do it in a loop so that we always drain as much as possible, though I suspect that the current behavior is intended to enforce fairness among small/large messages.

It should be pretty easy to come up with some tests to demonstrate the effects of each possible change:

Benchmark a worst case for a pass through ProcessMessages()
Benchmark a worst case for the number of passes it takes to drain a full recv socket
Assume the worst: Ignoring the socket handling overhead, it could take up to time(ProcessMessages) * passes to fully drain.

By tweaking the socket recv and loop recv buffers, we should be able to calculate a measurable effect. I'll see what I can come up with.

Edit: 10k -> 64k

ptschip commented at 4:08 AM on October 8, 2015: contributor

@theuni Cory, that line in the code is actually 64KB in hex so it wouldn't need to be changed for this. All we need is to up the send/receive buffer to 64KB. But if we want to go higher then yes we would need to bump pchBuf as well.

                    typical socket buffer is 8K-64K
                    char pchBuf[0x10000];

On 07/10/2015 1:47 PM, Cory Fields wrote:

I'm a bit skeptical that it will actually improve performance in a measurable way. Assuming the change above fixed the issue for the reason I think it did, the problem is that the socket is not being drained for long periods of time due to the eventhandler thread keeping it locked. Without that lock held, the buffer should be drained within the first few passes (ideally the first), and the time between passes should be close to zero. In fact, I suspect that the single recv buffer would become the bottleneck more quickly as soon as we bump the socket buffer above 10k: https://github.com/bitcoin/bitcoin/blob/master/src/net.cpp#L1182. That directly affects how long it takes to drain the socket.

IMO if we're going to bump, we need to bump that buffer as well, since that directly affects how many passes through select() are needed to drain a message. Alternatively we could do it in a loop so that we always drain as much as possible, though I suspect that the current behavior is intended to enforce fairness among small/large messages.

It should be pretty easy to come up with some tests to demonstrate the effects of each possible change:

Benchmark a worst case for a pass through ProcessMessages()

Benchmark a worst case for the number of passes it takes to drain a full recv socket

Assume the worst: Ignoring the socket handling overhead, it could take up to |time(ProcessMessages) * passes| to fully drain.

By tweaking the socket recv and loop recv buffers, we should be able to calculate a measurable effect. I'll see what I can come up with.

— Reply to this email directly or view it on GitHub #6693 (comment).

ptschip renamed this:
~~Fix for issue 6554 - Nodes sometimes will not sync on Native Windows~~
Set Windows TCP buffers to 64KB to match OSX and Unix
on Oct 8, 2015

Set Windows TCP buffers to 64KB to be match OSX and Unix 278447a15b

ptschip force-pushed on Oct 8, 2015

ptschip commented at 4:58 AM on October 8, 2015: contributor

I changed both the commit message and pull description to "Set Windows TCP buffers to 64KB to match OSX and Unix" . I thought there would be less confusion/questions as to why OSX and UNIX were left out if I had just mentioned "improvements for windows" only.

I've already run tests on this last week, syncing a 600MB blockchain over a private 78Mbit wireless network, running Windows 7 on both machines, for both 8KB buffers and 64KB and got a 12% improvement in overall sync time.

These were the results of the tests, each run 3 times:

    8KB buffer node sending to 64KB buffer node :   347 secs, 342

secs, 344 secs for a 344 sec average 64KB buffer node sending to a 64KB buffer node: 309 secs, 311 secs, 294 secs for a 304 sec average

On 07/10/2015 12:45 PM, Wladimir J. van der Laan wrote:

What do you think?
Fine with increasing the buffer size, but in the commit message and pull description then call it 'improve performance' instead of 'fix issue'. (and it would be nice to have measurements that show it does improve performance)

— Reply to this email directly or view it on GitHub #6693 (comment).

theuni commented at 5:20 AM on October 8, 2015: member

@ptschip I'd be curious to see how it compares when using the patch above instead.

Also, not sure about OSX, but Linux sock sizes vary a good bit. On my machine I get: recvbuf: 87380. sendbuf: 22320, but I'm pretty sure those fluctuate based on the current conditions.

ptschip commented at 1:55 AM on October 9, 2015: contributor

@theuni, I'm curious about that too, I'll run some tests when I get a chance but won't be for a few days, I'm busy with something else and it's Thanksgiving this weekend in Canada also :) ...

On 07/10/2015 10:21 PM, Cory Fields wrote:

@ptschip https://github.com/ptschip I'd be curious to see how it compares when using the patch above instead.

Also, not sure about OSX, but Linux sock sizes vary a good bit. On my machine I get: recvbuf: 87380. sendbuf: 22320, but I'm pretty sure those fluctuate based on the current conditions.

— Reply to this email directly or view it on GitHub #6693 (comment).

ptschip commented at 12:57 AM on October 14, 2015: contributor

@theuni Cory, I got around to testing your "net-lock-hacking" code and the results are pretty good.

Tested again on a Windows 7 with a wireless high speed connection (not over the internet)...syncing a 600MB blockchain, 3 tests each and average taken.

Current Master Build with 8KB tcp send buffer, 8KB receive buffer: 236 seconds to sync Cory's net lock code with 8KB tcp send buffer and 8KB receive buffer: 194 seconds to sync.

Almost an 18% improvement. Even so, we're still only using about 1/3 of the network bandwidth...there's clearly more improvement possible to increase throughput. Judging from the logs there is a about 50 seconds of just verifying blocks, not sure much can be done about that, but there should be another 40 or 50 seconds to save somewhere to get bandwidth utilization up to 60 or 70%.

Anyway, I thought you'd like to know the results...great stuff.

I'm still looking at the buffersizes and going to run some tests tomorrow. After your code changes and the low latency of my network I'm not seeing much difference in terms of tcp buffer sizes helping performance. I'm going try test form a separate internet connection to see what happens when there's more latency on the wire.

theuni commented at 1:31 AM on October 14, 2015: member

@ptschip That's great to hear!

As for the peak/trough download pattern, that's a much larger issue. Though I'm working to fix it as part of a much more invasive refactor.

laanwj commented at 11:35 AM on November 5, 2015: member

typical socket buffer is 8K-64K char pchBuf[0x10000];

Be careful increasing that - it is allocated on the stack. When I investigated bitcoin's stack frames in the context of the UPnP buffer overflow, ThreadSocketHandler came out as the function with the single biggest stack frame in our source base. I think 64kB is acceptable (thread stacks are usually 4-8MB), but it is something to keep an eye on, we wouldn't want to introduce e.g. a stack overflow DoS.

BTW: what is the outcome of this? Are you going to PR anything @theuni, or is it better to leave this be for 0.12 and aim for merging the libevent transition in 0.13?

luke-jr commented at 11:40 AM on November 5, 2015: member

(thread stacks are usually 4-8MB)

Are we explicitly setting that? Mac OS X thread stacks are only 512k by default...

laanwj commented at 12:49 PM on November 5, 2015: member

Are we explicitly setting that? Mac OS X thread stacks are only 512k by default...

No, we aren't changing it. 4-8MB is for Linux. But this is the reason why I say to be careful, it's indeed possible for OSes to have even smaller stacks.

ptschip commented at 3:54 PM on November 5, 2015: contributor

To allow for further investigation, I think we should hold off on this merge for now. I'm seeing unexpected performance results when buffers are increased and network latency is introduced, and not sure if it's my test setup or something deeper in the messaging system. When @theuni submits his refactor that he's working on for the backend messaging I could re-test/investigate after that.

On 05/11/2015 3:36 AM, Wladimir J. van der Laan wrote:

typical socket buffer is 8K-64K
char pchBuf[0x10000];
Be careful increasing that - it is allocated on the stack. When I investigated bitcoin's stack frames in the context of the UPnP buffer overflow, |ThreadSocketHandler| came out as the function with the single biggest stack frame in our source base. I think 64kB is acceptable (thread stacks are 4-8MB), but it is something to keep an eye on, we wouldn't want to introduce e.g. a stack overflow DoS.

BTW: what is the outcome of this? Are you going to PR anything @theuni https://github.com/theuni, or is it better to leave this be for 0.12 and aim for merging the libevent transition in 0.13?

— Reply to this email directly or view it on GitHub #6693 (comment).

laanwj commented at 2:51 PM on November 10, 2015: member

Closing this for now - to be revisited later

laanwj closed this on Nov 10, 2015

theuni commented at 9:22 PM on November 10, 2015: member

@laanwj Sorry, missed your ping here. Yes, I'd like to leave this alone for 0.12. The lock interaction is far too complicated.

laanwj commented at 5:27 PM on February 16, 2016: member

Probably should revisit this for 0.13. Is this related to "Test Nodes will sometimes not sync their blocks on native Windows #6554"?

ptschip commented at 6:46 PM on February 16, 2016: contributor

It does work around issue 6554 but I believe that Cory discovered the real root cause of 6554 was in the message handling and the locking issues therein.

On 16/02/2016 9:28 AM, Wladimir J. van der Laan wrote:

Probably should revisit this for 0.13. Is this related to "Test Nodes will sometimes not sync their blocks on native Windows #6554 #6554"?

— Reply to this email directly or view it on GitHub #6693 (comment).

ptschip deleted the branch on Dec 23, 2016

ptschip restored the branch on Feb 18, 2018

bitcoin locked this on Sep 8, 2021