Improve speed, memory efficiency with alternate hashmap

jamesob commented at 2:30 am on August 25, 2019: member

tl;dr

I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.

When profiling initial block download, it’s evident that a decent chunk of time on the critical path is spent in CCoinsViewCache, the data structure responsible for the in-memory representation of the UTXO set.

Selection_147 Profile of the msghand thread at height ~300,000.

The essential data member of the cache, cacheCoins, is a std::unordered_map that holds as many UTXO as can fit into memory given the dbcache setting. While this map implementation is definitely preferable to std::map (which retains order and has O(log(n)) lookups insted of O(1)), both stdlib maps are subpar relative to non-std alternatives.

After seeing cacheCoins as a bottleneck, I decided to try swapping one of these alternative implementations in to see if there is any benefit. It looks like there is.

ibd local range 500000 520000

For 20,000 blocks at a representative part of the chain, we see something like a 15% speedup with a 19.3% reduction in memory usage (dbcache was set to 3000).

The CCoinsCaching benchmark shows improvement but not as dramatically as IBD does:

microbenches

Obviously adding 2000 lines of new consensus-critical code isn’t appealing, but if those numbers hold for the full IBD process I think there might be a compelling case to be made for the added code.

Implementations considered

Of the few best-looking options, I picked the one with the most succinct implementation and fewest dependencies, robin_hood::unordered_node_map (it’s just a header file). I also looked at Facebook’s F14NodeMap and Abseil’s node_hash_map but both required too much elbow-grease (for now) to integrate here without pulling in a whole lot of code.

Next steps

I’m going to be running more comprehensive benchmarks, including a near-full IBD from the network, to see if these improvements hold.

I’m also interested in benching this along with a similar change to mapBlockIndex (currently a std::unordered_map).

Questions

If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

Use robin_hood.unordered_node_map for CCoinsViewCache.cacheCoins 6f9882ce4a

fanquake added the label Resource usage on Aug 25, 2019

promag commented at 10:09 am on August 25, 2019: member

including a near-full IBD from the network, to see if these improvements hold.

Yap that would be interesting.

I’m also interested in benching this along with a similar change to mapBlockIndex

I’d be surprised to see such improvements here.

If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

Either way it should include tests.

fanquake added the label Needs Conceptual Review on Aug 25, 2019

fanquake deleted a comment on Aug 25, 2019

DrahtBot commented at 12:07 pm on August 25, 2019: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#16910 (wallet: reduce loading time by using unordered maps by achow101)
#16801 (faster & less memory for sync: bulk pool allocator for node based containers by martinus)
#15606 ([experimental] UTXO snapshots by jamesob)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

practicalswift commented at 12:09 pm on August 25, 2019: contributor

@jamesob Interesting! Thanks for providing benchmarks!

Which implementation of std::unordered_map did you benchmark against?

It would be interesting to see this benchmarked against the three major implementations of std::unordered_map (libstdc++, libc++ and Microsoft STL).

jamesob commented at 2:46 pm on August 25, 2019: member

Which implementation of std::unordered_map did you benchmark against?

I’ve been benching on Ubuntu and Debian systems with libstdc++ (x86_64, glibc6).

fanquake commented at 2:58 am on August 26, 2019: member

I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.

Nice. This is the kind of PR I enjoy reading / reviewing / testing.

I did a basic benchmark on my macOS machine: time src/bitcoind -stopatheight=350000.

#16718 6f9882ce4a817c9f14aa7526165ab6e278de890e - 48m49s master db67101c748c208cced8e9b76a66d57cd48fbf6e - 55m22s

Will run something to a more intense height later.

jamesob commented at 4:23 pm on August 26, 2019: member

Some additional context from further benchmarking:

Local tests

I did a local IBD (not real network) comparison up to height 510,000. We see a speedup of roughly 17% coupled with memory savings of roughly 13.5%.

Given the vertical shift in real memory usage curve on the robinhood branch (pictured below), I suspect there’s a problem with my implementation of the DynamicUsage() memory estimator for the new robin_hood map. We appear to be over-reporting estimated memory usage of the new map relative to actual, so fixing this might lead to additional speedups for configurations with a limiting dbcache.

ibd local 510000

2019-08-robinhood vs. master (absolute)

bench name	x	2019-08-robinhood	master
ibd.local.510000.total_secs	1	7827.2801 (± 0.0000)	9447.3154 (± 0.0000)
ibd.local.510000.peak_rss_KiB	1	4863324.0000 (± 0.0000)	5620656.0000 (± 0.0000)

2019-08-robinhood vs. master (relative)

bench name	x	2019-08-robinhood	master
ibd.local.510000.total_secs	1	1	1.207
ibd.local.510000.peak_rss_KiB	1	1	1.156

IBD from the real network

I let IBD run overnight up to height 550,000 using peers from the real network for both master and this branch. The differences in speed were not as compelling here (a 6% gain for the robinhood branch) but memory differences were still pronounced: 14.6% percent savings for the robinhood branch (6502MB vs. 7615MB, both at a dbcache of 5000).

 0Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
 1Linux 4.4.0-119-generic x86_64
 2
 3`master` to height 550_000, dbcache=5000:
 4
 5Command being timed: "./src/bitcoind -datadir=/data/bitcoin-rh -printtoconsole=0 -stopatheight=550000 -dbcache=5000"
 6User time (seconds): 54822.52
 7System time (seconds): 929.91
 8Percent of CPU this job got: 121%
 9Elapsed (wall clock) time (h:mm:ss or m:ss): 12:44:12
10Average shared text size (kbytes): 0
11Average unshared data size (kbytes): 0
12Average stack size (kbytes): 0
13Average total size (kbytes): 0
14Maximum resident set size (kbytes): 7615768
15Average resident set size (kbytes): 0
16Major (requiring I/O) page faults: 1715
17Minor (reclaiming a frame) page faults: 72418086
18Voluntary context switches: 26934810
19Involuntary context switches: 287115
20Swaps: 0
21File system inputs: 451040
22File system outputs: 596633792
23Socket messages sent: 0
24Socket messages received: 0
25Signals delivered: 0
26Page size (bytes): 4096
27Exit status: 0
28
29
30On the robinhood branch:
31
32Command being timed: "./src/bitcoind -datadir=/data/bitcoin-rh -printtoconsole=0 -stopatheight=550000 -dbcache=5000"
33User time (seconds): 52074.03
34System time (seconds): 914.95
35Percent of CPU this job got: 123%
36Elapsed (wall clock) time (h:mm:ss or m:ss): 11:57:01
37Average shared text size (kbytes): 0
38Average unshared data size (kbytes): 0
39Average stack size (kbytes): 0
40Average total size (kbytes): 0
41Maximum resident set size (kbytes): 6502096
42Average resident set size (kbytes): 0
43Major (requiring I/O) page faults: 14411
44Minor (reclaiming a frame) page faults: 72379830
45Voluntary context switches: 22815471
46Involuntary context switches: 290463
47Swaps: 0
48File system inputs: 770608
49File system outputs: 597949704
50Socket messages sent: 0
51Socket messages received: 0
52Signals delivered: 0
53Page size (bytes): 4096
54Exit status: 0

The “real” gains here dampen my enthusiasm somewhat, but the memory story is compelling. Curious for others to weigh in on whether these differences merit the increased risk of bringing in a non-std map implementation.

jamesob force-pushed on Aug 26, 2019

jamesob commented at 3:19 am on August 27, 2019: member

Consider my enthusiasm renewed because the latest “real” IBD runs came back showing a hard-to-believe 40% speedup. Sync to 550,000 on different machines but identical hardware. Still a 12% memory savings.

Obviously it’s hard to gauge variance when syncing from the real network.

`2019-08-robinhood`

Selection_149

`master`

Selection_148

Apologies for the images - tmux over SSH makes copying text a pain.

JeremyRubin commented at 4:25 am on August 27, 2019: contributor

Concept ack, NACK this implementation. From the docs:

0Depends on good Hashing. For a really bad hash the performance will not only degrade like in std::unordered_map, the map will simply fail with an std::overflow_error. In practice, when using the standard robin_hood::hash, I have never seen this happening.

There are some ways that we could maybe get some better performance with STD::unordered_map, will take a look and give some feedback later.

jamesob commented at 4:00 pm on August 27, 2019: member

the map will simply fail with an std::overflow_error

Thanks for the close look @JeremyRubin. This is exactly the kind of change I’d be really worried about introducing. The performance difference will have to be really compelling (and verified by a few people aside from myself) to consider using a non-std implementation for something as consensus-sensitive as cacheCoins.

I wonder if, assuming we did find a non-std implementation that we’re comfortable with, we should initially leave its use as an opt-in configure flag that, e.g., installations running on lower-power hardware can make use of.

jamesob commented at 4:22 pm on August 27, 2019: member

Alternatively, we could look at forking robin-hood-hashing and removing all instances of throwOverflowError() with something more conservative.

elichai commented at 4:43 pm on August 27, 2019: contributor

Concept ACK for replacing the hashmap, std::map is a disaster. (std::unordered_map is a little bit better but still really bad).

JeremyRubin commented at 5:26 pm on August 27, 2019: contributor

@jamesob maybe have a look at the experimental rust fork?

You could just wrap the rust hasmap (with at least rust v1.35 or use hashbrown crate). Rust’s hashmap is based on swisstable.

Replacing sensitive data structures with rust seems like a good use case given long term goals, and can be used for low resource nodes experimentally?

elichai commented at 5:34 pm on August 27, 2019: contributor

@jeremyrubin I just had a talk with him about this, Personally I’d love to see rust inside of bitcoin, but I’m not sure a hashmap is the best first step as integrating rust and c++ requires a C ABI in between.

I do hope I’ll have time to test the C++ SwissTable implementation and compare benchmarks https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal/raw_hash_set.h

jamesob commented at 5:43 pm on August 27, 2019: member

I agree with @elichai - I think adding Rust as a dependency for one of the most consensus-sensitive data structures is way too big a leap. A few extra thousand lines of heavily-scrutinized cpp is about the biggest risk I’m willing to take personally, and that’s assuming a ton of review, testing, and maybe a gradual deployment process.

elichai commented at 7:56 pm on August 27, 2019: contributor

Would love a quick explanation on how to run all these benchmarks myself (so I can get both a control before changing and after).
Why did you choose unordered_node_map and not unordered_flat_map?.
Do you think it would be interesting in the future for other uses of std::unordered_map in the code? (and potentially even std::map)

jamesob commented at 8:13 pm on August 27, 2019: member

Would love a quick explanation on how to run all these benchmarks myself (so I can get both a control before changing and after).

You can use bitcoinperf to perform the host-local comparative benchmarks I’ve posted above. I’ve committed an example file for this branch here - note that you’re going to have to change some of the paths and potentially create some partially synced datadirs to be able to use it.

Why did you choose unordered_node_map and not unordered_flat_map?

In the case of cacheCoins, the values can be fairly large (>55 bytes) since they include scriptPubkeys. The flat map is good for values of small size, but shuffling around medium-to-large objects isn’t the intended use. The node map allocates values separately and stores pointers instead. See Folly’s docs for more details.

That said, I’m happy to benchmark against a flat_map implementation to see how it compares to node_map.

Do you think it would be interesting in the future for other uses of std::unordered_map in the code? (and potentially even std::map)

Yeah, I do, though I don’t expect any other piece of data in the system will give us such dramatic improvements in performance. If we’re going to use this map for arguably the most sensitive data structure in the system, we might as well use it elsewhere. I think a compelling candidate would be mapBlockIndex, but there are probably other good uses as well.

elichai commented at 8:18 pm on August 27, 2019: contributor

Sounds good :) Thanks! I’ll try also to benchmark myself, curious the difference between unordered_node_map and unordered_flat_map over unique pointers.

JeremyRubin commented at 10:25 pm on August 27, 2019: contributor

Ok I’ve had a chance to more tightly review the code here. I didn’t want to say so earlier, because I wasn’t sure, but I think it should be possible to modify the cuckoocache code minimally to support cacheCoins (store a key/value and (maybe) get rid of atomics for erasure).

I can put together a PoC of that… there are some other benefits as CuckooCache is specifically a cache and not a map being used as a cache.

This is already a consensus dependency, and has been in since v0.14, so it introduces less new code.

jamesob commented at 1:47 pm on August 28, 2019: member

Latest round of benchmarks are in from two dedicated bench machines with identical specs (Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz, 8GB memory, SSD).

First I reran the last test I reported above (/usr/bin/time -v ./src/bitcoind -stopatheight=550000 -dbcache=5000) to ensure I could recreate the results. To my surprise, they held (40% speedup, 13% memory savings). To ensure that the wide disparity in performance wasn’t due to a difference in the machines, I did a crossover: I swapped branches on each machine and reran the IBD. Same results.

master (a7be1cc92be4946c4f042bccd3a1b007657f3241):

0Elapsed (wall clock) time (h:mm:ss or m:ss): 6:47:37
1Maximum resident set size (kbytes): 7425920

this branch:

0Elapsed (wall clock) time (h:mm:ss or m:ss): 4:03:51
1Maximum resident set size (kbytes): 6436820

This is a remarkable performance improvement that I’m eager to see reproduced by others.

jamesob commented at 1:57 pm on August 28, 2019: member

I think it should be possible to modify the cuckoocache code minimally to support cacheCoins @JeremyRubin I’d be very eager to see your PoC for this. In principle I strongly agree that if we can modify existing consensus-supporting data structures and see similar performance benefits, that’s the way to go.

CuckooCache is specifically a cache and not a map being used as a cache

I’m curious about what you mean here, and it highlights one of my concerns about repurposing CuckooCache. As you probably know, we can’t just evict stuff from cacheCoins - we have to be very careful about persisting cache content (FlushStateToDisk()) before eviction. At the moment it looks like CuckooCache quietly evicts data when it fills up - I’m curious to see how you’ll ensure that we can coordinate eviction with a guaranteed flush.

Sjors commented at 4:20 pm on August 28, 2019: member

Concept ACK for improving std::unordered_map behavior or swapping it for an alternative if the improvement is >10%. It’s nice how this only changes 1 line in coins.h.

I’m able to reproduce the memory gain, but the speed is actually worse for me, on a 2019 MacBook Pro with -dbcache=10000, no indexes and no wallets, syncing over LAN to a single node:

IBD before: 3:35 hours plus 11 minute db flush (memory at height 564180: 8.23 GB)
IBD after: 5:30 hours plus 20 minute db flush (memory at height 564180: 7.5 GB)

martinus commented at 4:22 pm on August 28, 2019: contributor

Hi all! I am the author of the the robin_hood::unordered_map variants, and just saw this PR by accident. Please ask away if you have any question about my implementation!

I have also done extensive map benchmarks here, which might be interesting: https://github.com/martinus/map_benchmark

I agree with @JeremyRubin that the throwOverflowError() is a potential problem, if there is a way to force collisions in the map. Be aware that new hasmaps (facebook’s F14, Google’s Swisstable) also depend on good hashes. They don’t throw an exception, but performance can degrade so much that it’s not much different from an infinite loop.

in src/test/coins_tests.cpp:67 in 33b9786e6c outdated

63@@ -64,7 +64,7 @@ class CCoinsViewTest : public CCoinsView
64                     map_.erase(it->first);
65                 }
66             }
67-            mapCoins.erase(it++);
68+            it = mapCoins.erase(it);

martinus commented at 4:51 pm on August 28, 2019:

Be aware that there is a bug in robin_hood map https://github.com/martinus/robin-hood-hashing/issues/42 (as well as in tessil’s map https://github.com/Tessil/robin-map/issues/20) where it is possible that entries are iterated over a second time before end() is reached.

elichai commented at 4:59 pm on August 28, 2019: contributor

@martinus would love to see some list of projects you know are depending and using your implementation :)

I think if people would feel this is a stable and vetted implementation they would feel better about considering using it in consensus critical code. Thanks :)

martinus commented at 5:41 pm on August 28, 2019: contributor

@elichai, we have been using the map in several core components in Dynatrace’s proprietary monitoring platform for quite a while. That’s a very well tested and widely deployed product on multiple platforms (I work at dynatrace so I might be biased..).

I don’t know much about other use cases. Here is a github query tough to find other uses of the map: https://github.com/search?q=robin_hood_h_included&type=Code It only finds usages who copied the file though.

Some usages that I had a look at:

https://github.com/jinja2cpp/Jinja2Cpp reported 2x speedup https://twitter.com/flex_ferrum/status/1159210516943908865
https://github.com/carlfindahl/wacman 2x speedup: https://github.com/carlfindahl/wacman/commit/2189bef22eb86d4ad227dcd3b7ed1ca412110d15#diff-275e07678407ddce12c68ed11772f041R90
https://github.com/masc-ucsc/lgraph serializes and mmaps robin_hood maps https://github.com/masc-ucsc/lgraph/commit/ec37bdc6e867824a40e269bd5204cd42b1a75cf3. I don’t know of any speedups.
https://github.com/p-ranav/csv has switched from tessil’s robin_hood map to my map and reported speedup here: https://github.com/p-ranav/csv/commit/eabe6172cad9bad1f35b0749d729cbb3df89c8c2
https://github.com/JeremyBYU/polylidarv2 has 30% speedup https://github.com/JeremyBYU/polylidarv2/commit/8e6e07bff1de5046cf60af6a62045abba7cf15ea

MarcoFalke commented at 5:51 pm on August 28, 2019: member

Are there any benchmarks on low cpu, low memory devices. Given that this slows down IBD for @Sjors (https://github.com/bitcoin/bitcoin/pull/16718#issuecomment-525818499), I wouldn’t be surprised if it also slowed down other architectures/configurations.

laanwj commented at 6:37 pm on August 28, 2019: member

The speed-ups are promising, it’s surprising to me that this makes so much of a difference,

But to me it sounds quite risky, to be honest, to import a custom map implementation into consensus-critical code. Of course, we can never be sure, but likely these have much less testing than whatever tried-and-true map implementation comes with C++ libraries by default.

(Importing rust code for this reason seems a very bad idea to me, I’d be surprised if it helps performance at all with a FFI barrier there, and that makes it doubly risky)

TheBlueMatt commented at 6:56 pm on August 28, 2019: member

Right, pulling in Rust for something like this is way too early/aggressive. I’m curious if its possible to get similar speedups by taking a hatchet to robinhood so that we can cut it down to something thats more reviewable, since obviously thats the key criteria.

martinus commented at 6:59 pm on August 28, 2019: contributor

It would be very interesting where exactly the speedup is coming from. It might be possible to reproduce this speedup by staying with std::unordered_map. E.g. I see two possibilities for improvement:

robin_hood map requires less hashing and less key comparisons than std::unordered_map. It might be possible to save a lot of hashing & comparisons by simply adding the precalculated hash value to the key (so instead of the COutPoint use e.g. std::pair<COutPoint, size_t>. This unfortunately will increase memory usage a bit (but not much, since the key/value pair is already quite large).
I use a bulk allocator that never frees any memory, but reuses previously allocated memory. So once the cache has reached its peak size, no allocations will be necessary any more. This can also be achieved with a custom allocator for std::unordered_map.
If the speedup comes from the robin_hood style memory layout, that’s bad luck - can’t be achieved with std::unordered_map.

jamesob commented at 7:03 pm on August 28, 2019: member

First -reindex-chainstate bench finished: ~~46.7%~~ 2.14x faster with 1173 MB less memory usage.

master (a7be1cc92be4946c4f042bccd3a1b007657f3241):

 0user@bench-ssd-4:~/bitcoin$ /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000; git branch
 1
 2        Command being timed: "./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000"
 3        User time (seconds): 7735.97
 4        System time (seconds): 1081.16
 5        Percent of CPU this job got: 53%
 6        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32:10
 7        Average shared text size (kbytes): 0
 8        Average unshared data size (kbytes): 0
 9        Average stack size (kbytes): 0
10        Average total size (kbytes): 0
11        Maximum resident set size (kbytes): 7563764
12        Average resident set size (kbytes): 0
13        Major (requiring I/O) page faults: 15004449
14        Minor (reclaiming a frame) page faults: 61431800
15        Voluntary context switches: 22076432
16        Involuntary context switches: 4492039
17        Swaps: 0
18        File system inputs: 2826327856
19        File system outputs: 165988936
20        Socket messages sent: 0
21        Socket messages received: 0
22        Signals delivered: 0
23        Page size (bytes): 4096
24        Exit status: 0

2019-08-robinhood:

 0user@bench-ssd-3:~/bitcoin$ /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000; git branch
 1
 2        Command being timed: "./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000"
 3        User time (seconds): 5598.92
 4        System time (seconds): 720.66
 5        Percent of CPU this job got: 82%
 6        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:07:54
 7        Average shared text size (kbytes): 0
 8        Average unshared data size (kbytes): 0
 9        Average stack size (kbytes): 0
10        Average total size (kbytes): 0
11        Maximum resident set size (kbytes): 6361924
12        Average resident set size (kbytes): 0
13        Major (requiring I/O) page faults: 1684815
14        Minor (reclaiming a frame) page faults: 73715700
15        Voluntary context switches: 7033296
16        Involuntary context switches: 1901858
17        Swaps: 0
18        File system inputs: 502322648
19        File system outputs: 177410624
20        Socket messages sent: 0
21        Socket messages received: 0
22        Signals delivered: 0
23        Page size (bytes): 4096
24        Exit status: 0

@laanwj in terms of your concerns, I couldn’t agree more. The last thing I want is to be responsible for a consensus failure.

I wonder if these performance gains indeed turn out to be real, we could devote a substantial amount of effort to scrutinizing this map implementation using e.g. fuzz + stress testing and thorough manual review. After some period (months), we could make its use opt-in via configure flag to allow some hardware-constrained installations to use it.

That said, in the grand scheme of things maybe even a 50% speedup in IBD is negligibly marginal relative to the risk.

jamesob commented at 7:11 pm on August 28, 2019: member

My mistake - that’s 2.14x faster, not 46.7%.

martinus commented at 7:19 pm on August 28, 2019: contributor

master (a7be1cc):

0        User time (seconds): 7735.97
1        Major (requiring I/O) page faults: 15004449

2019-08-robinhood:

0        User time (seconds): 5598.92
1        Major (requiring I/O) page faults: 1684815

I find it a bit fishy that the major page faults with robinhood are 10x less. Could disk cache affect the benchmark? Maybe drop the cache before benchmarking, I think sync; echo 3 > /proc/sys/vm/drop_caches should work

MarcoFalke commented at 7:22 pm on August 28, 2019: member

we could make its use opt-in via configure flag to allow some hardware-constrained installations to use it

I disagree about a phased out deployment. Either the consensus code is safe to run, correct, and can be deployed to the whole network, or not. There is no in-between. The last thing you want to see happening is that all arm (or whatever “hardware-constrained means) devices are off consensus.

martinus commented at 7:26 am on August 30, 2019: contributor

I’ve benchmarked bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000 on my machine (Intel i7-8700, g++ 9.1, libstdc++), and can confirm a significant speedup with @jamesob’s changes. For accurate reproducability I have disabled frequency scaling and turbo boost for my cpu (with sudo pyperf system tune). Runtime with std::unordered_map is 6471 seconds, and 5339 seconds with robin_hood map, 21% faster.

I’ve modified SaltedOutpointHasher and added a custom equals function to count the number of hashes and equal operations performed. To my surprise, on my system std::unordered_map needs less hash calls. It looks like it caches the full hash. (10.7 vs. 13.2 billion hashes). I’ve benchmarked just the hashing and equals operations, and this should account for about 323 vs. 391 seconds of CPU total.

Next I’m going to try to use a bulk pool allocator for std::unordered_map and see how that goes.

martinus commented at 5:21 pm on August 30, 2019: contributor

I have tried to use boost::pool_allocator and boost::fast_pool_allocator for the std::unordered_map, but performance was much slower than std::allocator.

Then I adapted my BulkPoolAllocator to be usable for the std::unordered_map, and results are much better.

Benchmark results for bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000 on my i7-8700:

	unordered_map	robin_hood	unordered_map + BulkPoolAllocator
elapsed h:mm:ss	1:47:51	1:28:59	1:32:56
speedup %	0%	21.2%	16.1%
max RSS kbyte	7553716	6301540	6950432
max RSS %	100%	83.4%	92.0%

So most of the savings can also be achieved by staying with std::unordered_map and adding the BulkPoolAllocator. This should be much easier to integrate since it’s less than 200 lines of code. Also, the allocator might make sense for other node-based containers as well.

JeremyRubin commented at 7:02 pm on August 30, 2019: contributor

I think we may also be able to eke out some additional performance by making the keys the sha256 salted hash of the COutpoint rather than the COutpoint itself, and instead of using siphash taking entropy from that salted hash (as done for cuckoocache).

At the expense of one extra sha256 on insert/find (and less siphash), we get to drop 4 of the key bytes and have cheaper hash re-evaluation. Furthermore (this is a little bit platform dependent, but maybe we can detect it or wrap the hash in an is_integral type) we can set __cache_hash_code in the map to false, and save another 8 bytes. Depending on padding issues we may save even more space.

[robinhood] remove custom hash 27f5bc74d2

[robinhood] remove inline DataNode (flatmap=true) fa73a0f68e

[robinhood] remove robin_hood::pair 7915fb73d3

[robinhood] remove Cloner and Destroyer for flatmaps 25274b69e6

[robinhood] remove unnecessary constructors, operators 8cc7434ec8

[robinhood] remove all references to IsFlatMap cd43885b99

[robinhood] remove integer sequence stuff 8739248c41

jamesob force-pushed on Sep 3, 2019

jamesob commented at 2:19 pm on September 3, 2019: member

Thanks for running those benchmarks, @martinus. I also think we should try calling cacheCoins.reserve(dbcache / avg_coinscacheentry_size) somewhere during startup with both implementations to see how that affects things.

On the chance that we still want to consider using the robin_hood map, I’ve stripped about 500 lines out of robinhood.h (mostly relating to unordered_flat_map which I don’t think we have much use for). I refer to this stripped down version as robinhood-slim in the benchmarks below (and it’s now the tip of this branch).

I’ve been running benchmarks pretty continuously for the last few days and here’s a summary of the SSD-based timings:

Selection_153

Unsurprisingly it seems that performance improvement is most pronounced with a large dbcache. Presumably this is because as the number of entries in cacheCoins grows, std::unordered_map’s performance starts to degrade, robin_hood starts to shine, or both.

I found it interesting that over numerous runs, a large dbcache (7000) ended up being slower across the board than a smaller dbcache (6000 or 3000).

In general, the weird timing disparities from these results give me pause. I’m benching on dedicated, non-virtualized hardware that is for no purpose other than benchmarking, and I’m dropping caches before runs (sudo /sbin/swapoff -a; sudo /sbin/sysctl vm.drop_caches=3;). Yet we see weird outliers like the 500-minute robinhood reindex at dbcache=5000 - which I didn’t believe initially but then reproduced. Why would a dbcache=7000 run take longer than a dbcache=3000? Is the runtime of -reindex-chainstate somehow weirdly determined by the dbcache value during IBD?

I’m quite confused by some aspects of these results.

Anyway, here’s the memory story:

Selection_154

We reliably see significant memory savings from the robinhood map.

I also ran -reindex-chainstate to 550,000 on equivalent machines with HDDs. The disparity between master and this branch has been almost unbelievable, but I’ve reliably reproduced it even after doing machine crossover. The following table is characteristic of the -reindex-chainstates I’ve done on HDD:

branch	dbcache	time	memory
robinhood	5000	8:46:59	6.38 GB
master	5000	21:32:58 (stopped early)	7.47 GB

Similarly, IBD to the same height with the same dbcache took 47 hours on master (!?) and 17 hours on this branch.

I’m just kind of assuming these results are bogus somehow, although I’m very confused by the fact that I can reliably reproduce them even after swapping machines, but maybe someone with a spinning disk can confirm or deny that there’s a dramatic difference in performance between this branch and master.

MarcoFalke commented at 2:27 pm on September 3, 2019: member

Thanks for doing those additional benchmarks @jamesob. I think that confirms the finding by @Sjors that the sync is slowed down with a high dbcache in #16718 (comment)

jamesob commented at 3:16 pm on September 3, 2019: member

I think that confirms the finding by @Sjors that the sync is slowed down with a high dbcache in #16718 (comment)

Note though that I differ from Sjors’ findings in that this branch was ~2x faster than master for dbcache=7000 (reindex-chainstate), whereas in his results this branch was slower than master for a large dbcache (10000) during IBD.

I lightly suspect that @Sjors’ slowness on this branch relative to master was due to some kind of IBD-inherent variability since on every other benchmark run I’ve seen, this branch performs better than master.

Sjors’ total IBD runtime (<5 hours) was also less than my reindex-chainstate runtime (up to 8 hours), which I attribute to his probably-much-better hardware.

martinus commented at 6:03 pm on September 3, 2019: contributor

In https://github.com/martinus/bitcoin/tree/2019-08-bulkpoolallocator I’m currently working on a PR where I’ve extracted (and completely rewrote…) the pool allocator so it is usable in std::unordered_map. Unfortunately the standard allocator API is trickier than I’ve expected, but I think I got it now. It’s performance is still as advertised in #16718 (comment). I still have to clean up the code and add some unit tests before I’m creating a PR, probably in the next few days.

luke-jr commented at 9:24 pm on September 20, 2019: member

Surprised nobody’s pointed out that using std::unordered_map basically gives us an unknown implementation right now… which probably is best avoided in consensus code, when possible. The fact that we can get performance improvements too really suggests to me that switching to a specific implementation would be a good idea. So concept ACK.

Agree with need for intense care and testing to avoid a surprise protocol change. Bundling like LevelDB seems best if it’s a lot of code.

As far as issues with bad hashes - we don’t want to enable intentionally-crafted transactions to screw over performance, so might need to re-hash with some seed?

ryanofsky commented at 7:51 pm on October 21, 2019: member

Status of this?

Does using std::unordered_map with custom allocator #16718 (comment) still seem like it could be a good approach?
Are the code changes in this PR more or less complete, or is there more robinhood code streamlining to be done?
Are there still concerns about performance will smaller dbcaches and more constrained hardware with these implementations?
Are there still benchmark reproducibility issues? Are these affecting other setups or just James’ benchmarks?

martinus commented at 6:32 am on October 23, 2019: contributor

Status of this?

I think using a custom allocator from #16801 is a much safer choice than switching to a different map implementation.Writing a correct and fast hashmap is orders of magnitudes more difficult than a custom allocator. This PR was good though that it has highlighted what could be possible. Most of the performance benefit can be achieved with the custom allocator as well

Concerning benchmark reproducibility, I believe @jamesob has stumbled upon a pretty severe performance bug, but we don’t know yet if this is due to the new code or maybe the optimizations trigger an existing bug more often: #17060 (comment)

jamesob commented at 10:30 pm on November 14, 2019: member

Some new (and stabilized) benchmarks here. Following the same methodology as in similar PRs, we see a 6% difference in runtime with ~400MB less memory usage for a reindex up to 550,000 (bench/robinhood.1 vs. bench/master.1):

 0host         tag                      time       time% maxmem  cpu%  dbcache
 1bench-ssd-2  robinhood.1              8:43:36    0.94  5304.37MB 356%  4000MB
 2bench-ssd-2  robinhood.1              8:44:29    0.94  5309.67MB 355%  4000MB
 3bench-ssd-2  master.1                 9:15:02    1.00  5776.95MB 344%  4000MB
 4bench-ssd-2  master.1                 9:15:26    1.00  5762.75MB 343%  4000MB
 5
 6bench-ssd-3  robinhood.1              8:44:41    0.94  5293.84MB 355%  4000MB
 7bench-ssd-3  robinhood.1              8:46:29    0.94  5309.91MB 354%  4000MB
 8bench-ssd-3  master.1                 9:18:07    1.00  5780.84MB 342%  4000MB
 9bench-ssd-3  master.1                 9:15:53    1.00  5773.25MB 343%  4000MB
10
11bench-ssd-4  robinhood.1              8:46:13    0.95  5279.41MB 354%  4000MB
12bench-ssd-4  robinhood.1              8:45:36    0.94  5294.21MB 354%  4000MB
13bench-ssd-4  master.1                 9:14:42    1.00  5782.72MB 344%  4000MB
14bench-ssd-4  master.1                 9:16:17    1.00  5762.92MB 343%  4000MB
15
16bench-ssd-5  robinhood.1              8:42:02    0.94  5277.88MB 357%  4000MB
17bench-ssd-5  robinhood.1              8:41:34    0.94  5302.08MB 357%  4000MB
18bench-ssd-5  master.1                 9:12:43    1.00  5773.20MB 345%  4000MB
19bench-ssd-5  master.1                 9:12:08    1.00  5769.65MB 345%  4000MB
20
21bench-strong robinhood.1              14:16:20   0.94  5302.65MB 563%  4000MB
22bench-strong master.1                 15:07:08   1.00  5804.43MB 543%  4000MB
23bench-strong master.1                 15:08:44   1.00  5722.19MB 543%  4000MB

Interesting, but probably not a sufficient benefit to justify adding an extra 1500 lines of consensus-critical code. I’m going to close this one and perhaps someone can revisit this later.

jamesob closed this on Nov 14, 2019

MarcoFalke locked this on Dec 16, 2021

Improve speed, memory efficiency with alternate hashmap #16718