Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k)

gmaxwell commented at 10:17 pm on December 23, 2016: contributor

Instead of using a 100ms sleep in the condition variable, this changes it to sleep for 100ms after the loop started, so if the loop takes a long time (e.g. due to waiting on the vnodes lock at the end) it will restart sooner.

Also avoids a potential 100ms delay before announcing a new block that arrived via submitblock, prevents delays in processing messages for early peers when they arrive part way through the message handler loop, and reruns the message handler when some peers don’t finish sendmessages due to contention for cs_main.

The final commit changes the message handling to only hold cs_vRecvMsg long enough to move out the message. This speeds up IBD up to 200k from a single peer over GbE compared to master by about 200%, by eliminating cs_vRecvMsg contention (that the prior commits happened to also make worse). With this change on both sides the speedup is ~3.2x.

fanquake added the label P2P on Dec 23, 2016

theuni commented at 7:41 am on December 24, 2016: member

There’s so much currently wrong with this loop that I’m hesitant to make changes before rewriting it entirely. I’ve been discussing my plans with @TheBlueMatt for doing just that after #9289 is merged.

Some other fun issues with this loop (I’m sure I’m leaving several out):

messageHandlerCondition wakes spuriously
The depth of vRecvGetData and messages to send aren’t taken into consideration
cs_vRecvMsg is held for the duration of message processing, leading to insane lock contention

I have no objections to slapping some band-aids on here, but I’m really really hoping to address all of those at once in my next PR which moves this handler into net_processing. As a bonus, it should simplify @TheBlueMatt’s multi-threaded processing.

gmaxwell commented at 10:19 pm on December 24, 2016: contributor

messageHandlerCondition wakes spuriously

who cares? :) It’s not like this loop or processing that results from random wakes ever shows up in profiles.

If it were a real concern it would be easy to go back to sleep if the boolean hasn’t been flipped and not enough time has passed.

but I’m really really hoping to address all of those at once in my next PR which moves this handler into net_processing.

My guess is that we may be a bit close to the end of 0.14 to achieve much more in the way of big changes, especially ones with difficult to test concurrency implications. I just don’t want efforts to make big sweeping changes to stand in the way of meaningful improvement– we probably should have made the changes in this PR a year ago.

I’m reviewing 9289– hopefully we can get that in soon.

TheBlueMatt commented at 10:33 pm on December 24, 2016: member

I’m hoping the next PR isn’t huge in the way of complicated locking implications, more like simplified locking implications compared to master. Let’s wait and take a look next week (and if we decide it can’t make it we can come back to here - I think the potential advantage in initial from-network sync time from “doing it right” might be worth the extra review/testing implications).

On December 24, 2016 5:19:59 PM EST, Gregory Maxwell notifications@github.com wrote:

messageHandlerCondition wakes spuriously

who cares? :) It’s not like this loop or processing that results from random wakes ever shows up in profiles.

If it were a real concern it would be easy to go back to sleep if the boolean hasn’t been flipped and not enough time has passed.

but I’m really really hoping to address all of those at once in my next PR which moves this handler into net_processing.

My guess is that we may be a bit close to the end of 0.14 to achieve much more in the way of big changes, especially ones with difficult to test concurrency implications. I just don’t want efforts to make big sweeping changes to stand in the way of meaningful improvement– we probably should have made the changes in this PR a year ago.

I’m reviewing 9289– hopefully we can get that in soon.

– You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/bitcoin/bitcoin/pull/9415#issuecomment-269102470

theuni commented at 6:00 am on December 25, 2016: member

who cares? :) It’s not like this loop or processing that results from random wakes ever shows up in profiles.

My concern was moreso that some platforms (win32) may vary significantly in this wakeup behavior.

Completely agree with you though. I certainly wasn’t suggesting that we not take the improvements here for the sake of something coming later. In fact, the wake from the miner is necessary either way. I was just pointing out that I’m hoping to be able to delete most of this code soon anyway :)

gmaxwell commented at 11:52 am on December 27, 2016: contributor

Mystery: this slows the IBD of the first 200k blocks from a local gigabit ethernet connected peer by about 40%. I’ve run the test 4 times each way, same result. Maybe if we solve that, we make IBD faster.

gmaxwell commented at 11:39 pm on December 27, 2016: contributor

I bisected my patch down to the cs_vSend cs_vRecvMsg contention causing fsleep=false to be the cause of the IBD performance regression.

TheBlueMatt commented at 11:34 am on December 28, 2016: member

Ugh, yea, those desperately need splitting. I know @theuni has been working on a patch to do that, see-also https://github.com/bitcoin/bitcoin/pull/9419/commits/c214d120a363a05ba9afdccff6b4bda6e29ae7c4

On December 28, 2016 12:39:28 AM GMT+01:00, Gregory Maxwell notifications@github.com wrote:

I bisected my patch down to the cs_vSend cs_vRecvMsg contention causing fsleep=false to be the cause of the IBD performance regression.

– You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/bitcoin/bitcoin/pull/9415#issuecomment-269398595

gmaxwell commented at 8:12 am on December 29, 2016: contributor

I tested reducing the message handler hold of cs_vrecv lock to be just a narrow space small enough to std::move the message out, and speed up IBD to 200k from a single peer by more than a factor of two over master.

Reduce latency of ThreadMessageHandler.

Instead of using a 100ms sleep in the condition variable, this changes
 it to sleep for 100ms after the loop started, so if the loop takes a
 long time (e.g. due to waiting on the vnodes lock at the end) it will
 restart sooner.

a832c9e10a

Make submitblock wake the message handler thread.

Avoids a potential 100ms delay before announcing a new block that
 arrived via submitblock.

685f5047fa

Introduce an atomic flag to indicate when the message handler has more work.

This prevents delays in processing messages for early peers when they arrive
 part way through the message handler loop.

This also reruns the message handler when some peers don't finish sendmessages
 due to contention for cs_main.

b95d7a506e

gmaxwell commented at 2:51 pm on December 29, 2016: contributor

Sync to 200k with master: 243.7 blocks per second. With this PR on the receiver: 465.431 blocks per second. With the PR on both sender and receiver: 782.287 (3.2x master’s speed).

(Chainstate reindex is about 1000/s so there is still perhaps some room for improvement.)

gmaxwell renamed this:
~~Reduce latency of ThreadMessageHandler.~~
Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k)
on Dec 29, 2016

Only hold cs_vRecvMsg long enough to move out the message.

This also eliminates a race on nSendSize by making it atomic.

This speeds up IBD up to 200k from a single peer over GbE
 compared to master by about 200%, by eliminating cs_vRecvMsg
 contention (that the prior commits happened to also make worse).

e255f7dcae

theuni commented at 3:46 pm on December 29, 2016: member

Looks like we’re stepping on each-other :)

We’re definitely working to solve the same issue, though it seems we disagree on the approach. This is less flexible for a clean separation of layers as it requires constant two-way communication between the net layer and the message processor. Done this way, the net must notify the processor that new messages are available, then the processor must grab the messages from the net.

The more typical approach (what I’ve seen, anyway) is a one-way model where the net throws its messages to the processor, which is forced to consume them. That greatly reduces the risk of the processor throttling the network. It also makes things much easier as processing becomes multi-threaded, because the processor maintains the message queue, as opposed to having each new thread contend with the net lock. So the net side doesn’t care what happens once it’s handed off a message, it only has to have a means of being corked if the processor is at capacity.

The last abstraction step (moving processing out of CConnman, as seen in a quick/dirty attempt here) relies on this one-way model.

All that to say: I prefer the approach in #9441, which yields the same speedups :)

gmaxwell commented at 4:18 pm on December 29, 2016: contributor

I don’t follow your two-way comments. There is communication only one way here: from the ThreadSocketHandler to ThreadMessageHandler for receive, and the reverse for sending.

The processing cannot itself throttle the network (except in terms of the input rate exceeding the processing rate): Assuming we haven’t hit the GetReceiveFloodSize the only time message processing can stall reception is holding the per-peer lock during a simple move. The network does let it know new things are available– allowing it to sleep when there is nothing to do.

If messages were to be handled concurrently, it would be no finer grained than per-peer+per-direction in any case to preserve ordering; which is the granularity of the locks in this configuration. Though message handling takes cs_main for effectively everything, so at the moment there wouldn’t really be any gain in making it concurrent.

Perhaps you’re referring to the WakeMessageHandler calls in ProcessMessages? Those are communication from ThreadMessageHandler to itself– just reusing the mechanism so that the handler doesn’t sleep when there is more work to be done. It could more cleanly be done with a return argument, but the signaling mechanism already exists.

which yields the same speedups

Sure takes a lot more code to do it. The purpose of its architecture is unclear to me, I don’t see what we’re gaining for the extra complexity.

theuni commented at 4:45 pm on December 29, 2016: member

I don’t follow your two-way comments. There is communication only one way here: from the ThreadSocketHandler to ThreadMessageHandler for receive, and the reverse for sending.

ThreadSocketHandler notifies ThreadMessageHandler that it has a new message. ThreadMessageHandler then reaches into the socket handler’s buffer to get and remove it. Logically the queue belongs in the processor so that it can delegate intelligently. After all, the net is done with a message as soon as it’s come in.

Sure takes a lot more code to do it. The purpose of its architecture is unclear to me, I don’t see what we’re gaining for the extra complexity.

Eh? It’s essentially 2 changes. The first changes to use a return value rather than signaling (as you suggested above): https://github.com/bitcoin/bitcoin/pull/9441/commits/1d232f47ab9ad978aa3324e390f8baca1bea72e3 And the second looks much like yours: https://github.com/bitcoin/bitcoin/pull/9441/commits/65e916c499658e656a6e23455f22666fb0068edc

Maybe you’re referring to the boost commits that it’s built on? That’s only because they touch so much of the same code that I didn’t want to have to keep up with rebasing.

gmaxwell commented at 5:12 pm on December 29, 2016: contributor

After all, the net is done with a message as soon as it’s come in.

I don’t think it is– e.g need to stop reading data off the socket if we have too much queued data. Otherwise the memory usage is unbounded: I don’t think going around and lazily asking it to stop is sufficient, as AFAIK we have no guarantee about how much data we can receive between times the message handler will make its way back around to notice that it’s too large. E.g. cold caches it can take 5 seconds to validate a block, an on a gigabit link you could receive >500MB of data in that time. It looks to me like your code handles this– by having the socket side stay aware of the queue usage. So it’s functionally fine, though you end up duplicating the code that does this size management in both message handling and socket handling.

I don’t think that “belongs” is even the right way to think about this. The queue “belongs” to the node– as its the destruction of the node that ultimately destroys the queue. It is accessed by the socket thread and the message handling thread, for their respective operations.

theuni commented at 5:48 pm on December 29, 2016: member

I don’t think it is its need to stop reading data off the socket if we have too much queued data. Otherwise the memory usage is unbounded: I don’t think going around and lazily asking it to stop is sufficient, as AFAIK we have no guarantee about how much data we can receive between times the message handler will make its way back around to notice that it’s too large.

Nor do I.

With each message, the message handler immediately notifies the socket handler whether it should cork. There’s no lag there, it’s done atomically with the message hand-off: https://github.com/bitcoin/bitcoin/pull/9441/files#diff-eff7adeaec73a769788bb78858815c91R2463

In the refactored code, the cork is the return value of the “take my message” function: https://github.com/theuni/bitcoin/commit/1a6b10aea129ac363727c2d68fae809c2861e4da#diff-9a82240fe7dfe86564178691cc57f2f1R1255

I don’t think that “belongs” is even the right way to think about this. The queue “belongs” to the node– as its the destruction of the node that ultimately destroys the queue. It is accessed by the socket thread and the message handling thread, for their respective operations.

Disagreed. These are two separate layers. The queue could easily outlive the node. For ex, during IBD a node may send 10 blocks and immediately disconnect. There’s no requirement that we destroy the queue before deleting the node.

gmaxwell closed this on Dec 29, 2016

DrahtBot locked this on Sep 8, 2021

Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) #9415