Improve speed, memory efficiency with alternate hashmap (#2)

jamesob commented at 8:19 pm on August 5, 2021: member

Resurrection of #16718

tl;dr

I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.

When profiling initial block download, it’s evident that a decent chunk of time on the critical path is spent in CCoinsViewCache, the data structure responsible for the in-memory representation of the UTXO set.

Selection_147 Profile of the msghand thread at height ~300,000.

The essential data member of the cache, cacheCoins, is a std::unordered_map that holds as many UTXO as can fit into memory given the dbcache setting. While this map implementation is definitely preferable to std::map (which retains order and has O(log(n)) lookups insted of O(1)), both stdlib maps are subpar relative to non-std alternatives.

After seeing cacheCoins as a bottleneck, I decided to try swapping one of these alternative implementations in to see if there is any benefit. It looks like there is.

ibd local range 500000 520000

For 20,000 blocks at a representative part of the chain, we see something like a 15% speedup with a 19.3% reduction in memory usage (dbcache was set to 3000).

The CCoinsCaching benchmark shows improvement but not as dramatically as IBD does:

microbenches

Obviously adding 2000 lines of new consensus-critical code isn’t appealing, but if those numbers hold for the full IBD process I think there might be a compelling case to be made for the added code.

Implementations considered

Of the few best-looking options, I picked the one with the most succinct implementation and fewest dependencies, robin_hood::unordered_node_map (it’s just a header file). I also looked at Facebook’s F14NodeMap and Abseil’s node_hash_map but both required too much elbow-grease (for now) to integrate here without pulling in a whole lot of code.

Next steps

I’m going to be running more comprehensive benchmarks, including a near-full IBD from the network, to see if these improvements hold.

I’m also interested in benching this along with a similar change to mapBlockIndex (currently a std::unordered_map).

Questions

If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

Use robin_hood.unordered_node_map for CCoinsViewCache.cacheCoins df1c9674c0

[robinhood] remove custom hash dfdd18eb31

[robinhood] remove inline DataNode (flatmap=true) 960dbb5795

[robinhood] remove robin_hood::pair c04614301b

[robinhood] remove Cloner and Destroyer for flatmaps f29a8a4c1d

[robinhood] remove unnecessary constructors, operators 9acd37be68

[robinhood] remove all references to IsFlatMap c0cd24143f

[robinhood] remove integer sequence stuff f76b3681c3

[robin_hood] update to 3.11.3 c19d014826

jamesob commented at 8:34 pm on August 5, 2021: member

I’m reopening this because I think it’s worth some long-term consideration. We’re not going to get many opportunities to improve block connection by ~12%, so I think this is worth continuing to look at. Now that bitcoin is on c++17, I wonder if there aren’t some good ways to slim down @martinus’s implementation, since it contains a number of provisions to deal with <c++17 platforms. At the very least we could work on fuzz testing this to our satisfaction.

I’ve rerun some benchmarks on an idle workstation and have confirmed that I continue to see ~13% speedups along with reduction in memory use, which is especially pronounced when using a large dbcache.

Issues to consider

From the README: “For a really bad hash the performance will not only degrade like in std::unordered_map, the map will simply fail with an std::overflow_error. In practice, when using the standard robin_hood::hash, I have never seen this happening.”
- https://github.com/martinus/robin-hood-hashing/issues/117
- Possible that we can amend this behavior, or at least convince ourselves this is appropriately unlikely given our use of siphash?

Benchmarks

dbcache=10000

ibd local range 500000 540000

commands index

bench name	command
ibd.local.range.500000.540000	`bitcoind -dbcache=10000 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876`

jamesob/2019-08-robinhood vs. $mergebase (absolute)

bench name	x	jamesob/2019-08-robinhood	$mergebase
ibd.local.range.500000.540000.total_secs	3	2812.5091 (± 27.2604)	3190.3358 (± 17.5632)
ibd.local.range.500000.540000.peak_rss_KiB	3	5517118.6667 (± 1750.8682)	6344530.6667 (± 3478.0996)
ibd.local.range.500000.540000.cpu_kernel_secs	3	260.3867 (± 2.5737)	263.0400 (± 2.1568)
ibd.local.range.500000.540000.cpu_user_secs	3	12714.3000 (± 7.6987)	13223.2667 (± 8.7341)

jamesob/2019-08-robinhood vs. $mergebase (relative)

bench name	x	jamesob/2019-08-robinhood	$mergebase
ibd.local.range.500000.540000.total_secs	3	1	1.134
ibd.local.range.500000.540000.peak_rss_KiB	3	1	1.150
ibd.local.range.500000.540000.cpu_kernel_secs	3	1	1.010
ibd.local.range.500000.540000.cpu_user_secs	3	1	1.040

dbcache=300

ibd local range 500000 535000

commands index

bench name	command
ibd.local.range.500000.535000	`bitcoind -dbcache=300 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876`

2019-08-robinhood vs. master (absolute)

bench name	x	2019-08-robinhood	master
ibd.local.range.500000.535000.total_secs	3	3592.5329 (± 202.2415)	4070.7353 (± 13.1181)
ibd.local.range.500000.535000.peak_rss_KiB	3	2049018.6667 (± 147027.9395)	2110129.3333 (± 36744.0276)
ibd.local.range.500000.535000.cpu_kernel_secs	3	576.7433 (± 4.9359)	576.6567 (± 1.4616)
ibd.local.range.500000.535000.cpu_user_secs	3	10879.8767 (± 20.1319)	11260.5933 (± 8.3835)

2019-08-robinhood vs. master (relative)

bench name	x	2019-08-robinhood	master
ibd.local.range.500000.535000.total_secs	3	1.00	1.133
ibd.local.range.500000.535000.peak_rss_KiB	3	1.00	1.030
ibd.local.range.500000.535000.cpu_kernel_secs	3	1.00	1.000
ibd.local.range.500000.535000.cpu_user_secs	3	1.00	1.035

0xB10C commented at 8:50 pm on August 5, 2021: member

Cool. I plan on benchmarking this too (at least IBD speed) via the validation::connect_block USDT tracepoint.

DrahtBot added the label UTXO Db and Indexes on Aug 5, 2021

martinus commented at 7:15 am on August 6, 2021: contributor

It’s unfortunately really hard to fix https://github.com/martinus/robin-hood-hashing/issues/117 without losing at least some of the performance benefit. Also I don’t really have the time to do it properly… So I’m personally quite a bit weary about using my robin_hood map for such a critical place. I mean I’ve been using it in for a long time in many countless installations and personally have never seen that problem, but other users of the map have definitely seen it. And it definitely is possible for an malicious actor to construct sequence of values that will cause the map to always fail. That’s not possible for unordered_map, this would only degrades the performance.

I think it would be good to try the same benchmarks also with #16801 and also with tsl::sparse_map.

#16801 introduces a relatively simple bulk pool allocator for std::unordered_map. I think most of the performance improvement from robin_hood comes due to the more efficient allocation, and that pool allocator would bring the same allocation behavior to std::unordered_map. The code can be updated with C++17 features which would make it less of a hack. Such a pool allocator could also be beneficial in other places in the code too, wherever node based containers are used.
Have a look at memory efficient hashmap implementation: tsl::sparse_map seems to be excellent, header only, and in my benchmarks while slower than robin_hood it is faster than std::unordered_map and more memory efficient than both. I think being able to keep more elements in memory should be a better trade off because it needs less flushes. https://github.com/Tessil/sparse-map

martinus commented at 8:37 pm on August 6, 2021: contributor

I have now rebased my branch https://github.com/martinus/bitcoin/tree/2019-08-bulkpoolallocator and done some rudimentary benchmarks: create the CCoinsMap, and 5 times insert 1 million entries and then clear it. I measured average time per insertion in ns/op and maximum resident set size in kbyte.

I tried std::unordered_map and std::map with and without the bulk_pool, and also tried a very simple hash:

0struct Xor {
1    size_t operator()(const COutPoint& id) const noexcept {
2        return static_cast<size_t>(id.hash.GetUint64(0) ^ id.hash.GetUint64(1) ^ id.hash.GetUint64(2) ^ id.hash.GetUint64(3));
3    }
4};

I tried a few different hash implementations that I thought would be interesting, https://github.com/greg7mdp/parallel-hashmap, my https://github.com/martinus/robin-hood-hashing and https://github.com/Tessil/sparse-map.

ns/op	RSS kB	map	Pool	Hash?
62.51	311252	robin_hood::unordered_flat_map		Xor
99.93	311392	robin_hood::unordered_flat_map		SaltedOutpointHasher
106.71	135460	std::unordered_map	Yes	Xor
130.00	126124	robin_hood::unordered_node_map		Xor
158.59	126124	robin_hood::unordered_node_map		SaltedOutpointHasher
175.24	149368	std::unordered_map		Xor
264.35	311212	phmap::flat_hash_map		Xor
317.85	135556	std::unordered_map	Yes	SaltedOutpointHasher
402.43	149460	std::unordered_map		SaltedOutpointHasher
463.53	128336	tsl::sparse_map		Xor
586.76	153776	std::map
545.62	112332	tsl::sparse_map		SaltedOutpointHasher
515.25	154712	phmap::node_hash_map		SimpleHasher
1,113.55	140448	std::map	Yes

Well, I actually didn’t expect robin_hood map to be so good :)

Interestingly though, at least in that benchmark, std::unordered_map with the bulk_pool and the simple hash performs really well, even faster than robin_hood::unordered_node_map. I didn’t expect tsl::sparse_map and phmap::node_hash_map to be so slow. Maybe they fare better in a real world benchmark with lots of lookups.

So I think it would be worthwhile to properly benchmark the bulk_pool way and a faster hash. Not sure if such a strong hash like siphash is actually needed here.

jamesob commented at 4:10 pm on August 9, 2021: member

Once again, @martinus has convinced me that the relative performance benefits of swapping out the entire map implementation aren’t worth the risks inherent in robin_hood’s more uncertain failure modes. Luckily, bitcoinperf benchmarks (https://github.com/bitcoin/bitcoin/pull/16801#issuecomment-895348607) indicate that work on the allocator may (as Martin suspected) yield most of the benefits of the change here with much less risk.

I’m closing this PR (again) and I think we should focus effort on vetting the allocator in Martin’s branch.

jamesob closed this on Aug 9, 2021

DrahtBot locked this on Aug 16, 2022

Improve speed, memory efficiency with alternate hashmap (#2) #22640

Implementations considered

Next steps

Questions

Issues to consider

Benchmarks

dbcache=10000

commands index

jamesob/2019-08-robinhood vs. $mergebase (absolute)

jamesob/2019-08-robinhood vs. $mergebase (relative)

dbcache=300

commands index

2019-08-robinhood vs. master (absolute)

2019-08-robinhood vs. master (relative)