faster & less memory for sync: bulk pool allocator for node based containers

martinus commented at 6:51 am on September 4, 2019: contributor

This is a custom allocator for node based containers, as described in #16718 (comment). Hopefully, this is a less controversial change and easier to review than #16718.

The allocator retains memory of nodes, and allocates them in bulk. In my benchmarks, using this pool allocator for CCoinsMap gives about 16% faster -reindex-chainstate, and decreases memory usage by ~8%. The allocator is potentially useful for all node-based containers like std::list, std::map, std::unordered_map, etc. I believe this is especially useful on systems where malloc() and free() are relatively expensive operations, since this allocator will dramatically reduce these calls.

Some noteworthy properties about this allocator:

It relies on the fact that node based containers allocate nodes one at a time. If a container doesn’t behave like this, the allocator will still work but won’t be any faster.
The allocator determines the size of the nodes, which will then be used throughout the pool, by the first call to allocate(1). So if the container would not allocate something with the size of a node in its first call to allocate(1), the pool won’t be usable for nodes since it was set up with a different size. With C++17 it would be possible to use a containers type node_type to correctly initialize the pool’s size, but for now we have to rely on this heuristic.
Copying / moving / swapping a map propagates the allocator. So if two different pools are used for two containers a and b, calling a = b will from now on use b’s pool in a.
The pool is not threadsafe. Two containers which use the same pool in different threads won’t work.
A pool’s memory is only actually destroyed in the pool’s destructor.

I’ve added a benchmark to compare std::map, std::unordered_map with and without the pool allocator for the CCoinsMap.

fanquake added the label Utils/log/libs on Sep 4, 2019

DrahtBot commented at 7:45 am on September 4, 2019: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#16910 (wallet: reduce loading time by using unordered maps by achow101)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

fanquake added the label Brainstorming on Sep 4, 2019

fanquake added the label Needs Conceptual Review on Sep 4, 2019

MarcoFalke commented at 11:26 am on September 4, 2019: member

re-run ci

MarcoFalke closed this on Sep 4, 2019

MarcoFalke reopened this on Sep 4, 2019

jamesob commented at 1:52 pm on September 4, 2019: member

Concept ACK. Thanks for working on this. I did a cursory skim over the code and it looks very readable. The tests are nice too.

I’ll do a more thorough review and some benchmark runs in the next few days.

martinus renamed this:
~~bulk pool allocator for node based containers~~
faster & less memory for sync: bulk pool allocator for node based containers
on Sep 5, 2019

martinus force-pushed on Sep 5, 2019

martinus closed this on Sep 5, 2019

martinus reopened this on Sep 5, 2019

in src/support/allocators/bulk_pool.h:111 in ad55a26725 outdated

106+            // allocation didn't happen with this allocator
107+            return false;
108+        }
109+
110+        // put it into the linked list
111+        *reinterpret_cast<void**>(p) = m_free_chunks;

practicalswift commented at 10:17 am on September 7, 2019:

Is this really defined behaviour? :-)

</language lawyer hat on>

martinus commented at 11:54 am on September 7, 2019:

hm, I could add a

0struct Node {
1    Node* next;
2};

And then use Node* instead of void*. Then instead of

0// put it into the linked list
1*reinterpret_cast<void**>(p) = m_free_chunks;
2m_free_chunks = p;

I could write

0auto n = reinterpret_cast<Node*>(p);
1n->next = m_free_chunks;
2m_free_chunks = n;

I guess the code is a bit better readable with that, but I honestly don’t know if it’s more defined behavior :thinking:

martinus commented at 9:04 am on September 8, 2019:

I got rid of the void** casts in in https://github.com/bitcoin/bitcoin/pull/16801/commits/52ee5735304cb8be342e264a19274afc0f422fa8

martinus commented at 6:37 pm on September 11, 2019: contributor

Here is a graph of another comparison run of the indexing progress over time that I get on my machine (Intel i7-8700 @ 3.20GHz, 32GB of RAM)

out

3432sec for 2019-08-bulkpoolallocator, 4135 sec for master.
6.973.488 kbyte max RSS for 2019-08-bulkpoolallocator, 7.643.956 kbyte for master.

The straight lines where no progress is made are when the cache is full and then dumped into the leveldb database.

ktprime commented at 3:16 pm on September 12, 2019: none

I want to how to migrate your good bulk pool to my project.

martinus commented at 4:55 pm on September 12, 2019: contributor

I want to how to migrate your good bulk pool to my project.

Feel free to copy the bulk_pool.h to your project. It’s under the MIT license. I might extract the file into a standalone project with more benchmarks, but thats not high on my priority list.

I am interested though in any improvements you see with this allocator!

promag commented at 11:05 pm on September 15, 2019: member

Looks like the improvement grows over time/progress? Have you checked with a greater -stopatheight?

martinus commented at 5:25 am on September 16, 2019: contributor

I’ve also made a graph with -stopatheight 594000: out

jamesob commented at 2:44 pm on September 27, 2019: member

After running benchmarks across a few different machines, I’m pretty confused.

The raw results are below, but the headline is that I’m consistently seeing branches using this allocator outperform master but also vice versa depending upon the host. It’s about half and half, and when master wins it’s by a pretty dramatic margin.

E.g. master is reliably faster than this branch and #16718 on bench-ssd-2, bench-ssd-5, and bench-hdd-5 (except for one weird run where master took 41 hours??). On all other hosts (ssd-3, ssd-4, hdd-4), these branches beat master consistently.

This is very confusing to me because AFAICT -reindex-chainstate should be a deterministic process - i.e. I’d expect the same logical execution for a given height on mainnet regardless of host. I guess maybe blockfile layouts can differ, but would that really be responsible for so dramatic a difference?

Before anyone asks: these are dedicated benchmarking machines (specs listed below). Before each run I’m dropping all caches (sudo /sbin/swapoff -a; sudo /sbin/sysctl vm.drop_caches=3;), tuning with pyperf (sudo /usr/local/bin/pyperf system tune;), and I’ve ensured that the CPU governors are set to performance.

0Hostname:              bench-ssd-2
1Kernel:                Linux 4.9.0-8-amd64
2OS:                    Debian GNU/Linux 9
3RAM (GB):              7.71
4Architecture:          x86_64
5CPU(s):                4
6Thread(s) per core:    1
7Core(s) per socket:    4
8Model name:            Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz

 0-reindex-chainstate to 550,000 (dbcache 4000,4000,5000)
 1
 2
 3hostname     branch                               runtime   peak mem  cpu%
 4--------     ------                               -------   --------  ----
 5
 6bench-ssd-2  jamesob/2019-08-robinhood            8:44:15   5268.90MB 356%
 7                                                  8:44:08   5279.89MB 356%
 8                                                  8:54:16   6209.94MB 350%
 9
10             master                               3:56:18   6257.37MB 211%
11                                                  3:52:25   6322.87MB 215%
12                                                  6:09:23   7184.23MB 138%
13
14             martinus/2019-08-bulkpoolallocator   8:57:44   5680.03MB 349%
15                                                  8:57:50   5710.53MB 349%
16                                                  9:19:26   6634.91MB 336%
17
18bench-ssd-3  jamesob/2019-08-robinhood            1:53:11   5288.62MB 88%
19                                                  1:56:02   5300.98MB 87%
20                                                  2:28:13   6289.06MB 71%
21
22             martinus/2019-08-bulkpoolallocator   2:14:50   5703.27MB 83%
23                                                  2:11:09   5709.48MB 85%
24                                                  3:28:15   6747.18MB 56%
25
26             master                               2:35:11   6272.42MB 88%
27                                                  2:32:48   6341.90MB 89%
28                                                  5:31:51   7212.80MB 45%
29
30bench-ssd-4  master                               2:54:59   6268.15MB 79%
31                                                  2:35:52   6277.92MB 88%
32                                                  5:14:52   7155.87MB 47%
33
34             jamesob/2019-08-robinhood            1:57:07   5269.11MB 85%
35                                                  1:53:59   5351.89MB 88%
36                                                  2:28:55   6312.53MB 71%
37
38             martinus/2019-08-bulkpoolallocator   2:12:36   5715.95MB 84%
39                                                  2:09:30   5668.43MB 86%
40                                                  3:18:38   6707.09MB 59%
41
42bench-ssd-5  master                               3:58:37   6263.33MB 210%
43                                                  3:40:44   6214.23MB 226%
44                                                  5:37:39   7109.82MB 151%
45
46             martinus/2019-08-bulkpoolallocator   8:53:29   5710.38MB 352%
47                                                  8:47:41   5692.83MB 356%
48                                                  9:25:06   6553.30MB 333%
49
50             jamesob/2019-08-robinhood            8:41:53   5294.47MB 357%
51                                                  8:36:57   5331.02MB 360%
52                                                  8:53:55   6176.95MB 350%
53
54bench-hdd-4  martinus/2019-08-bulkpoolallocator   3:50:01   5729.15MB 49%
55                                                  3:53:44   5793.15MB 47%
56                                                  17:20:06  6741.35MB 11%
57
58             jamesob/2019-08-robinhood            2:55:47   5310.89MB 57%
59                                                  2:54:19   5306.11MB 57%
60                                                  9:04:22   6220.20MB 19%
61
62             master                               10:11:25  6257.50MB 23%
63                                                  5:25:15   6256.57MB 42%
64                                                  40:32:45  7123.75MB 6%
65
66bench-hdd-5  martinus/2019-08-bulkpoolallocator   10:04:43  5703.16MB 310%
67                                                  10:03:57  5775.29MB 311%
68                                                  24:04:41  6728.74MB 130%
69
70             master                               8:05:00   6248.24MB 103%
71                                                  6:08:48   6250.89MB 135%
72                                                  41:39:38  7152.32MB 20%
73
74             jamesob/2019-08-robinhood            9:25:51   5295.75MB 329%
75                                                  9:24:18   5297.03MB 330%
76                                                  13:15:56  6134.24MB 235%

I’m wondering if this allocator is behaving pathologically based upon some different initial condition across the hosts, but for now I’m pretty stumped. In the coming weeks I’ll work on tooling that will allow me to get a more precise idea of where time is being spent and will roll all that into bitcoinperf.

martinus commented at 10:14 am on September 28, 2019: contributor

It’s really strange that the difference is so dramatic. Do you have the debug.log files, or some graphs with the progresss? It would be interesting to see where it’s slowing down: e.g. if it’s in the DB sync, or continuously, or right before a sync.

I think since your host only has 7.7GB memory the slowdown might come from the host swapping. Maybe the problem is the way bitcoin is handling the cache: As transactions are added to the cache, it slowly gets full. With some bad luck / bad configuration, the hashmap’s load factor can get full close to the actual dbcache limit. If it has to double it’s size right at that time, it will temporarily need a lot more memory than the dbcache, has to rehash, and once this costly operation is done, it will be over the dbcache limit, and the cache is dumped and thrown away.

So in that case it would be a much better logic to flush the cache once the load factor reaches maximum, and when another insert would bump it over the limit.

martinus force-pushed on Oct 5, 2019

martinus commented at 5:08 pm on October 5, 2019: contributor

rebased & squashed in 329754d4c33656b0d526aa456077f4667c6ecb11

in src/support/allocators/bulk_pool.h:84 in 329754d4c3 outdated

79+    {
80+        Destroy();
81+    }
82+
83+    /// Don't allow allocation for types that are smaller than a Node. This does not make sense
84+    /// because we need to fit a pointer into the memory.

ryanofsky commented at 5:18 pm on October 31, 2019:

Would it be better to drop these methods or to use a static assert, so it could be a compile time error when sizeof(T) < sizeof(ChunkNode) instead of a runtime error?

martinus commented at 6:08 pm on October 31, 2019:

Thanks for having a look at the code! In that case I think it it better to return nullptr, because of how it’s used in bulk_pool::Allocator<T>::allocate(size_t n). There bulk_pool::Pool::Allocate() is called, and if nullptr is returned I fall back to ::operator new

I could probably rewrite it to be a compile error, but then I need to rewrite bulk_pool::Allocator<T>::allocate(size_t n) to handle that case - which I don’t know how to do nicely

ryanofsky commented at 6:12 pm on October 31, 2019:

Oh, I see. I didn’t realize it would fall back to default operator new. Could mention that explicitly, but this is also fine.

in src/support/allocators/bulk_pool.h:312 in 329754d4c3 outdated

307+            return;
308+        }
309+        ::operator delete(p);
310+    }
311+
312+    /// Calls p->~U(). Only required for g++4.8

ryanofsky commented at 5:30 pm on October 31, 2019:

More information on this? A quick web search didn’t turn up anything for me. Could say more in comment.

martinus commented at 6:10 pm on October 31, 2019:

It seems only g++4.8 requires that this method is present, otherwise I get a compile error. It’s deprecated in C++17 and removed in C++20. I’ll add a comment for that.

ryanofsky commented at 5:50 pm on October 31, 2019: member

Skimmed code and looks pretty good. Just had two questions below.

Add bulk pool allocator for node based containers.

In my benchmarks, using this pool allocator for CCoinsMap gives about
16% faster `-reindex-chainstate`, and decreases memory usage by 8%.
The allocator is potentially useful for all node-based containers like
`std::list`, `std::map`, `std::unordered_map`.

54ada87348

martinus force-pushed on Nov 1, 2019

in src/coins.h:219 in 54ada87348

215@@ -215,10 +216,11 @@ class CCoinsViewCache : public CCoinsViewBacked
216      * declared as "const".
217      */
218     mutable uint256 hashBlock;
219-    mutable CCoinsMap cacheCoins;
220+    mutable bulk_pool::Pool cacheCoinsPool{};

ryanofsky commented at 2:54 pm on November 1, 2019:

Given that the pool only grows and uses increasing memory over time, and that its lifetime is effectively the lifetime of the entire node, would it make sense to add some kind of garbage collection method for CCoinsViewCache that would free unneeded memory (maybe after finishing syncing, or after flushing, or on an interval, or from an RPC a user could call)?

JeremyRubin commented at 7:16 pm on November 1, 2019:

I have the following code for something i’m working on. It should be amortized cheap to call after every erase; you could also make it have different parameters during sync or after (e.g, always compact after sync, only compact if it’s above 50MB free during sync).

 0 static void resize_if_savings(CTxMemPoolEntry::relatives& relatives) {
 1       assert(relatives.max_load_factor() == 1)
 2       // If we're at 0, clear out the map
 3       if (relatives.size() == 0) {
 4           CTxMemPool::relatives tmp;
 5           std::swap(tmp, relatives);
 6           return;
 7       }
 8       // don't bother saving for small enough sets
 9       // 19 buckets isn't very large, and fits in with the usual
10       // prime rehash policies
11       if (relatives.bucket_count() <= 19) return;
12       // don't bother rehashing if we're more than half full
13       const size_t full_size = relatives.bucket_count();
14       if (relatives.size() > full_size/2) return;
15   
16       CTxMemPool::relatives tmp{std::make_move_iterator(relatives.begin()), std::make_move_iterator(relatives.end()), relatives.size()};
17       std::swap(tmp, relatives);
18   }

martinus commented at 12:06 pm on November 2, 2019:

@JeremyRubin that looks interesting. I think the if (relatives.size() > full_size/2) return; is a bit too aggressive though. For a hashmap uses doubling as resize strategy: if it has e.g. 256 elements and you add one it will double its size, and when you remove one element again it would already trigger the rehashing to decrease its size.

JeremyRubin commented at 9:34 pm on November 2, 2019:

On my system the rehash policy is the prime less than double policy, which ensures that we have a growth factor < 2 for all resizes. So in the case of 256, we would first “double” to 503. Then we would be at 257/503 “fullness”, where half would be 251. So we can remove up to 6 elements before triggering a re-hash. So the pathological cases are a bit more precise, see below:

If you’re at exactly 251, then you add one, you double to 503, and then you can remove two before triggering a rehash (strict inequality). Let’s say you choose to remove a third. Then you’ll rehash to either 467 or 257 (implementations depending – I believe it’s usually 257 because we go to the smallest prime larger than N). That means you will need to remove down to 233 (~20 more elements) before triggering a rehash OR more likely ~125 elements. Let’s say instead, if you add back another N, you won’t trigger a rehash until you hit 467, so you can add back another ~215 elements before having to go again OR more likely, 7 elements.

So there is a case to be made that it’s less than ideal, but it’s not quite as bad given the primes.

There’s also a case to be made (at least for the use cases I’m looking at) where you’re usually in a streak of erases, followed by a streak of adds. So while it is possible to trigger the bad behavior, it’s atypical.

Therefore I don’t actually think it’s that bad to use half as the shrink test, given that it quickly gets us shrunk to a more than half full table.

But if we do a resize at, say, at n <= (buckets - (buckets/100)*54), and resize to (n + (n/100)36), then we can always be able to add at least 36% after a resize and remove 1/(1 + 0.361.08) - 0.36 ~~ 36% more before resizing again. (The ~1.08 more than 36% I think, is half the average overshoot prime to prime. This is not a great estimate but can be improved).

ryanofsky commented at 5:32 pm on November 1, 2019: member

James walked me through the benchmarks, and I guess a summary is that the benchmarks do generally look good and show expected decreases in runtime and memory usage at varying dbcache sizes.

The problem is that in some benchmark setups, reindexing reliably but for no explicable reason takes a lot longer (9 hours vs 2 hours) on the bulk allocator and robinhood branches than it does on the master branch. That is, the robinhood and bulk allocator branches generally perform better in the ways we would expect, but on some particular benchmarking machines with particular datadirs, reindexing takes a lot longer when using bulk allocator and robinhood branches instead of the master branch.

Also, very strangely, copying the datadir from a benchmarking machine where reindexing with the bulk allocator branch was fast, to a benchmarking machine where reindexing with the bulk allocator branch was slow, somehow makes reindexing on that same machine fast as well. This is suspicious and probably indicates either a mistake made in benchmark setup, or the presence of a pre-existing pathological performance bug in bitcoin somehow triggered by changing the allocator. (But even that doesn’t make much sense, because while it’s possible to imagine the layout of block data in the datadir impacting reindex time, the actual block reads done during reindexing should be identical between the master and bulk allocator branches.)

It does seem like we need to debug the case where switching to the bulk allocator appears to cause reindexing slowdowns with particular datadirs, even though using the bulk allocator could only be causing these slowdowns very indirectly.

jamesob commented at 6:17 pm on November 1, 2019: member

Just as an update, I’ve got a plan for diagnosing the benching weirdness I’ve seen thus far. With any luck I’ll be posting an analysis (or at least additional information) in the next few days.

martinus commented at 6:25 pm on November 1, 2019: contributor

Is it possible that this weirdness can be explained by different assumedvalid points? Either in the branches, or on the disk. It was a surprise to me: #17060 (comment)

As @MarcoFalke wrote in #17060 (comment), it is probably best to either use -noassumevalid or use the same block hash, and make sure that block is actually available on all disks

jamesob commented at 3:58 pm on November 8, 2019: member

Okay, I’m back with good news and bad(ish) news. The good news is that the bench weirdness has been resolved by comparing master (6a7c40bee4, bench/master.1) to a version of this branch rebased on top of that same commit (bench/alloc.1). So I’m guessing the weirdness before may have been related to an assumevalid bump, but I’m not definitively sure.

The bad news is that the now-very-consistent results don’t show as much of an improvement as I expected.

I did two separate runs over the course of a few days - for each machine, the branches were tested twice back-to-back in random order (results are ordered the same). The command tested was bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=$DBCACHE -connect=0.

 0host         branch                 time              %     maxmem  cpu% dbcache
 1bench-ssd-2  bench/master.1             9:13:57    1.00  5777.93MB 344% dbcache=4000MB
 2bench-ssd-2  bench/master.1             9:16:30    1.00  5760.55MB 343% dbcache=4000MB
 3bench-ssd-2  bench/alloc.1              9:01:19    0.97  5466.12MB 348% dbcache=4000MB
 4bench-ssd-2  bench/alloc.1              9:01:06    0.97  5453.22MB 349% dbcache=4000MB
 5
 6bench-ssd-3  bench/alloc.1              9:03:00    0.97  5451.97MB 348% dbcache=4000MB
 7bench-ssd-3  bench/alloc.1              9:01:58    0.97  5474.70MB 348% dbcache=4000MB
 8bench-ssd-3  bench/master.1             9:16:33    1.00  5772.59MB 343% dbcache=4000MB
 9bench-ssd-3  bench/master.1             9:18:04    1.00  5781.47MB 342% dbcache=4000MB
10
11bench-ssd-4  bench/master.1             9:15:28    1.00  5781.36MB 343% dbcache=4000MB
12bench-ssd-4  bench/master.1             9:15:41    1.00  5765.45MB 343% dbcache=4000MB
13bench-ssd-4  bench/alloc.1              9:02:24    0.98  5451.96MB 348% dbcache=4000MB
14bench-ssd-4  bench/alloc.1              9:01:49    0.98  5443.95MB 348% dbcache=4000MB
15
16bench-ssd-5  bench/alloc.1              8:57:42    0.97  5456.22MB 351% dbcache=4000MB
17bench-ssd-5  bench/alloc.1              8:57:48    0.97  5464.43MB 351% dbcache=4000MB
18bench-ssd-5  bench/master.1             9:12:06    1.00  5787.68MB 345% dbcache=4000MB
19bench-ssd-5  bench/master.1             9:11:49    1.00  5777.87MB 346% dbcache=4000MB
20
21bench-strong bench/alloc.1              14:58:34   0.99  5425.66MB 548% dbcache=4000MB
22bench-strong bench/alloc.1              14:57:55   0.99  5463.18MB 548% dbcache=4000MB
23bench-strong bench/master.1             15:04:45   1.00  5751.64MB 545% dbcache=4000MB
24bench-strong bench/master.1             15:04:06   1.00  5734.97MB 545% dbcache=4000MB

So we see very consistent times across branches and machines, but the gain associated with this branch is pretty mild (~2-3%).

All of the bench-ssd-* machines resemble each other pretty closely. bench-strong has much more CPU and RAM (which led me to be initially surprised by the results) but then I realized it has a spinning disk. The specs for the machines are below:

(representative of the bench-ssd-* machines)

key	value
hostname	bench-ssd-2
cpu_model_name	Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
ram_gb	7.712375640869141
os	[‘Debian GNU/Linux’, ‘9’, ‘stretch’]
arch	x86_64
kernel	4.9.0-8-amd64
read_iops (/home/ccl/bitcoinperf)	2032
write_iops (/home/ccl/bitcoinperf)	672

bench-strong

key	value
hostname	bench-strong
cpu_model_name	Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
ram_gb	31.353431701660156
os	[‘Debian GNU/Linux’, ‘10’, ‘buster’]
arch	x86_64
kernel	4.19.0-5-amd64
read_iops (/home/james/.bitcoin)	332
write_iops (/home/james/.bitcoin)	113

In order to make certain that these runtime characteristics aren’t unstable based on dbcache, I reran this with dbcache=5000 (from dbcache=4000) - in previous benchmarking sessions I hadn’t seen the weird “bi-modal” behavior until dbcache hit 5000.

 0host         branch                 time              %     maxmem  cpu% dbcache
 1bench-ssd-2  bench/alloc.1              9:12:44    0.98  6397.99MB 341% dbcache=5000MB
 2bench-ssd-2  bench/alloc.1              9:11:06    0.98  6397.02MB 342% dbcache=5000MB
 3bench-ssd-2  bench/master.1             9:22:27    1.00  6697.26MB 340% dbcache=5000MB
 4bench-ssd-2  bench/master.1             9:22:28    1.00  6609.71MB 339% dbcache=5000MB
 5
 6bench-ssd-3  bench/master.1             9:27:27    1.00  6738.98MB 336% dbcache=5000MB
 7bench-ssd-3  bench/master.1             9:24:56    1.00  6770.75MB 338% dbcache=5000MB
 8bench-ssd-3  bench/alloc.1              9:25:38    1.00  6454.57MB 334% dbcache=5000MB
 9bench-ssd-3  bench/alloc.1              9:12:47    0.97  6418.94MB 341% dbcache=5000MB
10
11bench-ssd-4  bench/alloc.1              9:12:12    0.98  6543.72MB 341% dbcache=5000MB
12bench-ssd-4  bench/alloc.1              9:26:04    1.00  6394.11MB 333% dbcache=5000MB
13bench-ssd-4  bench/master.1             9:23:44    1.00  6749.35MB 338% dbcache=5000MB
14bench-ssd-4  bench/master.1             9:22:57    0.99  6673.99MB 339% dbcache=5000MB
15
16bench-ssd-5  bench/alloc.1              9:04:42    0.97  6380.43MB 346% dbcache=5000MB
17bench-ssd-5  bench/alloc.1              9:06:23    0.97  6380.80MB 345% dbcache=5000MB
18bench-ssd-5  bench/master.1             9:23:53    1.00  6666.49MB 338% dbcache=5000MB
19bench-ssd-5  bench/master.1             9:20:08    0.99  6604.82MB 341% dbcache=5000MB
20
21bench-strong bench/master.1             15:00:44   1.00  6869.63MB 545% dbcache=5000MB
22bench-strong bench/master.1             14:58:43   1.00  6706.32MB 546% dbcache=5000MB
23bench-strong bench/alloc.1              15:02:07   1.00  6511.17MB 545% dbcache=5000MB
24bench-strong bench/alloc.1              15:02:24   1.00  6415.48MB 545% dbcache=5000MB

As you can see, the results are very similar. In fact I was a little surprised that an extra gig of dbcache doesn’t seem to make any real difference in terms of runtime - if nothing else it marginally hurt the runtimes.

So anyway, take this all for what it is. I wouldn’t be surprised if after the wide variance seen in my benchmarking that this is all taken with a grain of salt, but I’ve done what I can to rule out spurious measures. That said, anyone should be able to reproduce these results by making superficial modifications to this script.

I would say if these benchmarks are indeed characteristic of the performance gain associated with this PR, I am not in favor of merging it as such because I think the risk associated with swapping out an allocator isn’t compensated for by a 3% reduction in IBD runtime. But that said I am (hopefully obviously) a big fan of this sort of work, and will continue to cheer @martinus on in his attempts to optimize what is certainly one of the biggest bottlenecks in bitcoind.

martinus commented at 4:29 pm on November 8, 2019: contributor

Thanks for getting to the bottom of this @jamesob! It’s good to see that the allocator actually seems to do it’s job as it’s supposed to be, even when the benefit is small. It’s probably bigger when validation doesn’t have be be redone, but the additional code complexity is probably not worth it.

ryanofsky commented at 8:44 pm on November 18, 2019: member

It’s probably bigger when validation doesn’t have be be redone, but the additional code complexity is probably not worth it.

It seems like a pool allocator could be a basic building block for a lot of future optimizations. It might also be useful for testing overhead of allocations even in places where we wouldn’t want to enable it in released code. So I could see the effort here being useful even if we aren’t looking to merge the PR now. But would suggest closing the PR or marking it work in progress to update the status.

One thing I’m not clear on is the difference between 3% performance improvement last measured and 16% initially measured. If there are theories about this, I’d be curious.

martinus commented at 7:24 am on November 19, 2019: contributor

One thing I’m not clear on is the difference between 3% performance improvement last measured and 16% initially measured. If there are theories about this, I’d be curious.

I’m pretty sure the difference is due to to additional validation due to -noassumevalid. With validation, validation is the bottleneck so the improvement is not as pronounced. When no validation needs to happen, you see much bigger gains from this PR.

martinus closed this on Nov 19, 2019

MarcoFalke commented at 2:09 pm on November 19, 2019: member

Oh, I didn’t want to imply in #17060 (comment) that -noassumevalid is the only setting that makes sense for benchmarking. Every release has a different assumed valid block set by default. So when looking at total IBD time, it makes sense to consider the setting where no validation happens because it is skipped by default. If there is a more than 10% speed-up, it should not be neglected.

jamesob commented at 4:06 pm on August 9, 2021: member

I’ve benchmarked @martinus’ rebased version of this branch (per #22640 (comment)) and I’m happy to report that I’m seeing a roughly 9% speedup along with an 8% reduction in memory usage (with a very high dbcache setting). A pull request for this branch should probably be reopened, though due to the rebase I think @martinus may have to open a separate pull request.

ibd local range 500000 540000

commands index

bench name	command
ibd.local.range.500000.540000	`bitcoind -dbcache=10000 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876`

martinus/2019-08-bulkpoolallocator vs. $mergebase (absolute)

bench name	x	martinus/2019-08-bulkpoolallocator	$mergebase
ibd.local.range.500000.540000.total_secs	3	2850.0195 (± 9.6506)	3134.5484 (± 11.6699)
ibd.local.range.500000.540000.peak_rss_KiB	3	5831204.0000 (± 3492.0817)	6358314.6667 (± 3223.6564)
ibd.local.range.500000.540000.cpu_kernel_secs	3	261.0567 (± 1.3968)	267.7467 (± 0.7879)
ibd.local.range.500000.540000.cpu_user_secs	3	12815.7433 (± 3.5496)	13162.3167 (± 13.4203)

martinus/2019-08-bulkpoolallocator vs. $mergebase (relative)

bench name	x	martinus/2019-08-bulkpoolallocator	$mergebase
ibd.local.range.500000.540000.total_secs	3	1	1.100
ibd.local.range.500000.540000.peak_rss_KiB	3	1	1.090
ibd.local.range.500000.540000.cpu_kernel_secs	3	1	1.026
ibd.local.range.500000.540000.cpu_user_secs	3	1	1.027

martinus commented at 4:39 pm on August 9, 2021: contributor

Thanks for the thorough benchmark! I’ll try to clean up the code a bit and then open another PR with the pool allocator.