validation: Warm coins cache during prevalidation to connect blocks faster

andrewtoth commented at 1:40 am on June 14, 2020: contributor

Retrieving coins from disk is a significant source of ConnectBlock latency. This PR increases ConnectBlock speed by retrieving coins from disk on a separate thread before ConnectBlock is reached. When a block is passed into ProcessNewBlock it is immediately warmed before prevalidation checks are begun.

Benchmarking IBD with -prune=10000 and default -dbcache from blocks 600000-633000 gives a 7.1% increase in speed. Since this is only really useful when blocks take a long time to prevalidate, benchmarks from genesis only give a 2.3% increase. However, this should keep increasing as more large blocks get mined, so it will be even more useful in the future. Of course this will be less useful with high -dbcache values.

fanquake added the label Validation on Jun 14, 2020

andrewtoth commented at 1:40 am on June 14, 2020: contributor

Big thanks to @JeremyRubin for giving me this idea.

JeremyRubin commented at 2:20 am on June 14, 2020: contributor

Concept ACK and lite CR-ACK, the approach seems to make sense.

Very nice work! Did this help with the variance from before?

I don’t want to go overkill but I wonder if there’s a nicer way to communicate the start warming/stop warming instructions (e.g. not just tweaking flags so there’s some sort of interface). I also wonder if it makes sense to just always completely finish warming the cache or not? Aborting warming early still means the work has to happen in serial (at some point) with the main thread. Is there ever a circumstance where the warmer gets the cache full that it evicts its own warmed entries? Can we detect this and just stop then?

DrahtBot commented at 4:15 am on June 14, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#19604 (Pass mempool pointer to UnloadBlockIndex/GetCoinsCacheSizeState by MarcoFalke)
#19306 (refactor: Replace RecursiveMutex with Mutex in CTxMemPool by hebasto)
#18766 (Disable fee estimation in blocksonly mode by darosior)
#18731 (refactor: Make CCheckQueue RAII-styled by hebasto)
#17479 (Return BlockValidationState from ProcessNewBlock if CheckBlock/AcceptBlock fails by jnewbery)
#15545 ([doc] explain why CheckBlock() is called before AcceptBlock by Sjors)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

naumenkogs commented at 8:23 am on June 14, 2020: member

Concept ACK. I think speeding up connecting blocks is always a good idea.

I see you introducing a new thread. That might be an issue. Last time I tried that it didn’t work out well: see discussion here and an issue I created here. Maybe you want to express your opinion there or during the IRC meeting, as it’s likely that you’ll face the same issue with this PR

andrewtoth commented at 1:43 pm on June 14, 2020: contributor

Did this help with the variance from before? @JeremyRubin my personal benchmarking bandwidth is rather limited. I’m trying to get another machine set up to do more extensive benchmarking on this. My numbers were done on a 12 virtual core CPU with internal SSD. I think this might have a wide variance of performance increase depending on CPU and disk types. However, the results I had were very promising so I submitted this for concept approval. I will come back with some more numbers soon(ish).

I don’t want to go overkill but I wonder if there’s a nicer way to communicate the start warming/stop warming instructions (e.g. not just tweaking flags so there’s some sort of interface).

I thought I had done a pretty good job of this by wrapping it in WarmBlock/CancelWarmingBlock methods. I’ll think some more on this. I’m open to suggestions.

I also wonder if it makes sense to just always completely finish warming the cache or not? Aborting warming early still means the work has to happen in serial (at some point) with the main thread.

My benchmarks showed this to have consistently negative performance. I also tried locking/unlocking inside ConnectBlock only where the coins were actually accessed, but it was also less performant and would have also made this diff a lot more complex.

Is there ever a circumstance where the warmer gets the cache full that it evicts its own warmed entries? Can we detect this and just stop then?

I don’t think this can occur. The cache is flushed only after ConnectBlock in ConnectTip, so the effect will be the same when warming coins in ConnectBlock or right before.

andrewtoth commented at 1:46 pm on June 14, 2020: contributor

@naumenkogs I don’t think your issue applies here. The work this thread needs to do is on the order of milliseconds, but it also needs to be immediately available for there to be any benefit. If it had to wait some milliseconds for scheduling it would be useless.

in src/validation.cpp:3937 in 548e17302b outdated

3938@@ -3860,6 +3939,9 @@ bool ChainstateManager::ProcessNewBlock(const CChainParams& chainparams, const s
3939         // Therefore, the following critical section must include the CheckBlock() call as well.
3940         LOCK(cs_main);
3941 
3942+        // Before doing any checks, begin warming the cache to process this block
3943+        ::ChainstateActive().WarmBlock(pblock);

JeremyRubin commented at 5:11 pm on June 14, 2020:

Are we always guaranteed to have checked the pow by this point? I think so, because of AcceptBlockHeader?

Worth documenting that this is expected or putting the WarmBlock call inside CheckBlock after the PoW check.

andrewtoth commented at 10:28 pm on July 18, 2020:

Updated to explicitly check PoW here if we haven’t already.

JeremyRubin commented at 5:15 pm on June 14, 2020: contributor

Yeah, you have done a decent job with it!

Concretely it would be my preference to have a separate object for the warming thread (which is owned by the chainstate) so that the inner fields of the warming logic are encapsulated away from the chainstate. That way it’s more explicit what the coin warmer can access and its state variables can only be accessed through the interface. I would then also hide the variables from the CoinsTip function and access through an interface. This is just a style preference so feel free to ignore – some people also feel that objects are too heavyweight, but in particular I feel that an object to own/manage a thread is a good abstraction as it helps make clear which fields are protected by m_cs_warm_block (also check out some of the locking annotations used elsewhere in the code, would help with this this patch).

luke-jr commented at 11:40 pm on June 14, 2020: member

Concept ACK

jnewbery commented at 0:03 am on June 17, 2020: member

This is the kind of change that immediately scares me. It’s adding complexity in a consensus-critical area, without great benefit to the critical metrics:

IBD: 2.3% speedup for is nice, but not worth the complexity/risk
Block propagation: I doubt this will have any impact since we relay compact blocks before calling ActivateBestChain(), which is where the coins are needed.

So I’d tend to concept NACK this kind of change. It seems to go in the opposite direction from what we’d want, which is to make validation-critical code as simple as possible.

andrewtoth commented at 2:45 pm on June 17, 2020: contributor

@jnewbery I can definitely appreciate that perspective! However, I think concentrating on the 2.3% speedup is a pessimistic take on this patch. It has a 7.1% increase in speed for recent blocks, which are what we are likely to continue to experience in the future. While a user only has to perform IBD from genesis once, they will have to sync recent blocks every time they turn on the node. I think optimizing the common user flow of using the node, then turning off for days/weeks/months until it’s used again will prove to be a big benefit. I also think I can find a way to increase IBD from genesis somewhat, but am still experimenting and will get more concrete numbers together. As I said in the description, this will only become more beneficial as the block chain grows.

I also don’t think this is that invasive of a patch. It’s not changing any logic with any of the actual validation code. It is just looping through the inputs of a block before the actual validation occurs. This seems like an easy win for a bump of 7%.

You are correct that this doesn’t improve block propagation. I didn’t claim that it did. However, block propagation is only the first step for a block to be useful to a user; it still has to be connected after it is propagated. Increasing propagation to miners isn’t as useful if it still takes a long time to validate the block afterwards. Miners (and all users, really) want to have the block connected as soon as possible. This can reduce forks by allowing miners to start mining on top of the block faster, and improve miner revenue by more quickly determining the optimal block template to mine on.

jnewbery commented at 3:23 pm on June 17, 2020: member

think optimizing the common user flow of using the node, then turning off for days/weeks/months until it’s used again will prove to be a big benefit.

This isn’t a usage pattern that we should be optimizing for:

it happens very infrequently (at most once every few days/weeks/months if the user really does stop-start their node frequently)
I’d argue that for this scenario, a 7.1% increase isn’t that meaningful. If instead of a node taking one hour to resync from cold with this PR, I’m probably ok waiting one hour and four minutes without this PR.

I’ve had a quick read through the implementation and I don’t see how this can work. Your ThreadWarmCoinsCache thread is taking cs_main, grabbing a pointer to the coins tip, then dropping cs_main and continuing to write to the cache object. Meanwhile, the message handler thread or any other thread could also be accessing that same cache and reading/writing to it. Surely cs_main needs to be held for as long as you have the coins view cache?

andrewtoth commented at 4:18 pm on June 17, 2020: contributor

Meanwhile, the message handler thread or any other thread could also be accessing that same cache and reading/writing to it. Surely cs_main needs to be held for as long as you have the coins view cache? @jnewbery any other thread accessing CoinsTip requires cs_main to be held before accessing. Calling CoinsTip will cancel the current block warming and acquire the warming lock, and it can’t start warming another block until it takes cs_main again. I can make this more clear with refactoring as per the comment here.

JeremyRubin commented at 5:46 pm on June 17, 2020: contributor

I think the approach can be made clearly safe. While out of the propagation path for compact blocks, this is still in path for mining so there is a motivating factor IMO.

Repeated uncontended locking is sufficiently cheap that I’d be curious to see the performance difference if the warming thread just tries to grab the lock before every read/write and receives an atomic flag from the main thread to quit warming at a certain point. This would make the algorithm more ‘obviously correct’ and I think would have negligible performance impact as the lock won’t be contended.

I’d also be curious if we can start the warming even earlier. E.g., when we receive a compact block we can immediately warm coins we can reconstruct. But that’s a lot of complexity for a marginal gain over the current approach.

andrewtoth force-pushed on Jun 22, 2020

andrewtoth commented at 0:49 am on June 22, 2020: contributor

I had originally been benchmarking by connecting to a single peer in the local network, which processes the new block and connects each block serially. When connecting to multiple remote peers the node will process new blocks in parallel and then connect them all later. When benchmarking a true IBD with multiple remote peers, the node gets a lot more time to warm the blocks as they come in.

Doing a full IBD with -prune=10000 -stopatheight=635000 on the same machine as before, I get an increase in speed of 39% (548 minutes in master, 337 minutes in this branch) and total block connect time speed increase of 7% (14563.66s (22.93ms/blk) in master, 13571.83s (21.37ms/blk) in this branch). Total time warming coins was 4294.83s.

Doing a full IBD to block 635000 with no pruning on an older quad core machine with SSD, I get a speed increase in total time of 12% (28 hours in master, 24.75 hours in this branch) and total connect time speed increase of 16% (67425.83s (106.18ms/blk) in master, 56600.08s (89.13ms/blk) in this branch). Total time warming coins was 22229.61s.

Of course there is a lot of variance due to network latency and quality of peers, and I will continue benchmarking with different configurations. However, I think these numbers warrant adding some more complexity to the validation logic as well as a new dedicated thread.

This also means what I said about warmed coins being evicted before block connection is not true. The cache can be filled and the coins evicted before the block is connected. I’ll work on updating this so it won’t happen.

I’m still working on refactoring the locking interface, but I wanted to share my findings. I also pushed a small update to cancel warming a block when a new block is processed before the previous one is connected, so it will work efficiently with multiple remote peers in case anyone else wants to try benchmarking this.

JeremyRubin commented at 5:00 pm on July 18, 2020: contributor

@andrewtoth I wanted to check in here and see how your progress is going? Anything in particular you feel stuck on?

Make m_cacheview a shared_ptr 1e994f2706

Move CoinsTip accesses out of critical path c294aa63fb

andrewtoth force-pushed on Jul 18, 2020

Add BlockWarmer class afb9e0b101

Warm new blocks with BlockWarmer 3cfb3031f8

Add debug bench logs for BlockWarmer 0580245dd8

andrewtoth force-pushed on Jul 18, 2020

andrewtoth commented at 10:17 pm on July 18, 2020: contributor

@JeremyRubin Thanks for checking in, as well as for all your help and encouragement!

I’ve pushed my latest. I’ve taken your suggestion and moved the thread handling to a new class BlockWarmer, and added an instance of it as a member variable of CChainState. Also added locking annotations. I think it makes it a lot more clear what is being changed with the validation logic. I can also write unit tests for it now.

As for what I’m felling stuck on, it’s difficult to get a consistent benchmark on this. It’s also very time consuming since I don’t have a bunch of machines to keep running this on. It’s consistently faster doing IBD, but the speedup varies widely, to above 20% to low single digits. I think I might have to setup something in the local network to pretend to be multiple peers. I’m also thinking I should write a benchmark for time to connect after receiving a block in ProcessNewBlock. I think I need to demonstrate objectively that there is a speedup here worth the added complexity.

Also, since we are downloading blocks out of order, I’m not sure if we could warm a block but it won’t be processed until after the next flush. Not sure if it’s possible to detect that.

andrewtoth commented at 11:06 pm on July 18, 2020: contributor

Looks like I have to figure out some locking issues as well.

andrewtoth marked this as a draft on Jul 18, 2020

andrewtoth commented at 11:08 pm on July 18, 2020: contributor

Converted to draft until I figure out all the issues.

andrewtoth commented at 2:30 am on August 7, 2020: contributor

Unfortunately I don’t think I’ll be able to keep working on this for the near future. Going to close for now.

andrewtoth closed this on Aug 7, 2020

DrahtBot locked this on Feb 15, 2022

validation: Warm coins cache during prevalidation to connect blocks faster #19271

Conflicts