validation: don't clear cache on periodic flush: >2x block connection speed

andrewtoth commented at 2:44 pm on August 7, 2023: contributor

Since #17487 we no longer need to clear the coins cache when syncing to disk. A warm coins cache significantly speeds up block connection, and only needs to be fully flushed when nearing the dbcache limit.

Periodic flushes occur every 24 hours, which empties the cache and causes block connection to slow down. By keeping the cache through periodic flushes a node can run for several days with an increasingly hotter cache and connect blocks much more quickly. Now not only can setting a higher dbcache value be beneficial for IBD, it can also be beneficial for connecting blocks faster.

To benchmark in real world usage, I spun up 6 identical t2.small AWS EC2 instances, all running in the same region in the same VPC. I configured 2 instances to run master, 2 instances to run the change in this PR, and 2 instances to run the change in this PR but with dbcache=1000. All instances had prune=5000 and a 20 GB gp2 EBS volume. A 7th EC2 instance in the same VPC ran master and connected only to some trusted nodes in the outside network. Each of the 6 nodes under test only connected directly to this 7th instance. I manually pruned as much as possible and uploaded the same blocks, chainstate and mempool.dat to all instances. I started all 6 peers simultaneously at block height 835245 and ran them for over a week until block 836534.

The results were much faster block connection times for this branch compared to master, and much faster for this branch with dbcache=1000 compared to default dbcache.

branch	speed
master 1	1995.49ms/blk
master 2	2129.78ms/blk
branch default dbcache 1	1189.65ms/blk
branch default dbcache 2	1037.74ms/blk
branch dbcache=1000 1	393.69ms/blk
branch dbcache=1000 2	427.77ms/blk

The log files of all 6 instances are here. There is a lot of noise with the exact times of blocks being connected, so I plotted the rolling 20 block connect time averages. The large dots are the times where the cache is emptied. For the red master nodes, this happens every 24 hours. The blue branch nodes with default dbcache only filled up and emptied the caches once, which is seen in the middle. The green branch nodes with 1000 dbcache never emptied the cache. It is very clear from the chart that whenever the cache is emptied, connect block speed degrades significantly.

plot

Also note that this still clears the cache for pruning flushes. Having frequent pruning flushes with a large cache that doesn’t clear is less performant than the status quo #15265 (comment). See #28280.

DrahtBot commented at 2:44 pm on August 7, 2023: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	sipa, brunoerg, achow101
Concept ACK	hernanmarino

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#29700 (kernel, refactor: return error status on all fatal errors by ryanofsky)
#28280 (Don’t empty dbcache on prune flushes: >30% faster IBD by andrewtoth)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Validation on Aug 7, 2023

DrahtBot added the label CI failed on Aug 7, 2023

andrewtoth force-pushed on Aug 17, 2023

DrahtBot removed the label CI failed on Aug 18, 2023

DrahtBot added the label CI failed on Sep 4, 2023

DrahtBot removed the label CI failed on Sep 5, 2023

DrahtBot added the label CI failed on Sep 15, 2023

DrahtBot removed the label CI failed on Sep 20, 2023

DrahtBot added the label CI failed on Oct 25, 2023

andrewtoth marked this as a draft on Dec 4, 2023

DrahtBot commented at 0:28 am on March 3, 2024: contributor

🤔 There hasn’t been much activity lately and the CI seems to be failing.

If no one reviewed the current pull request by commit hash, a rebase can be considered. While the CI failure may be a false positive, the CI hasn’t been running for some time, so there may be a real issue hiding as well. A rebase triggers the latest CI and makes sure that no silent merge conflicts have snuck in.

DrahtBot added the label Needs rebase on Mar 22, 2024

validation: don't clear cache on periodic flush 4a6d1d1e3b

andrewtoth force-pushed on Mar 27, 2024

andrewtoth marked this as ready for review on Mar 27, 2024

andrewtoth renamed this:
~~validation: don't clear cache on periodic flush~~
validation: don't clear cache on periodic flush - 2x block connection speed
on Mar 27, 2024

andrewtoth renamed this:
~~validation: don't clear cache on periodic flush - 2x block connection speed~~
validation: don't clear cache on periodic flush - >2x block connection speed
on Mar 27, 2024

DrahtBot removed the label Needs rebase on Mar 27, 2024

andrewtoth renamed this:
~~validation: don't clear cache on periodic flush - >2x block connection speed~~
validation: don't clear cache on periodic flush: >2x block connection speed
on Mar 27, 2024

DrahtBot removed the label CI failed on Mar 28, 2024

hernanmarino approved