Improve speed, memory efficiency with alternate hashmap (#2) #22640

pull jamesob wants to merge 9 commits into bitcoin:master from jamesob:2019-08-robinhood changing 5 files +2541 −5
  1. jamesob commented at 8:19 pm on August 5, 2021: member

    Resurrection of #16718


    tl;dr

    I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.


    When profiling initial block download, it’s evident that a decent chunk of time on the critical path is spent in CCoinsViewCache, the data structure responsible for the in-memory representation of the UTXO set.

    Selection_147 Profile of the msghand thread at height ~300,000.

    The essential data member of the cache, cacheCoins, is a std::unordered_map that holds as many UTXO as can fit into memory given the dbcache setting. While this map implementation is definitely preferable to std::map (which retains order and has O(log(n)) lookups insted of O(1)), both stdlib maps are subpar relative to non-std alternatives.

    After seeing cacheCoins as a bottleneck, I decided to try swapping one of these alternative implementations in to see if there is any benefit. It looks like there is.

    ibd local range 500000 520000

    For 20,000 blocks at a representative part of the chain, we see something like a 15% speedup with a 19.3% reduction in memory usage (dbcache was set to 3000).

    The CCoinsCaching benchmark shows improvement but not as dramatically as IBD does:

    microbenches

    Obviously adding 2000 lines of new consensus-critical code isn’t appealing, but if those numbers hold for the full IBD process I think there might be a compelling case to be made for the added code.

    Implementations considered

    Of the few best-looking options, I picked the one with the most succinct implementation and fewest dependencies, robin_hood::unordered_node_map (it’s just a header file). I also looked at Facebook’s F14NodeMap and Abseil’s node_hash_map but both required too much elbow-grease (for now) to integrate here without pulling in a whole lot of code.

    Next steps

    I’m going to be running more comprehensive benchmarks, including a near-full IBD from the network, to see if these improvements hold.

    I’m also interested in benching this along with a similar change to mapBlockIndex (currently a std::unordered_map).

    Questions

    If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

  2. Use robin_hood.unordered_node_map for CCoinsViewCache.cacheCoins df1c9674c0
  3. [robinhood] remove custom hash dfdd18eb31
  4. [robinhood] remove inline DataNode (flatmap=true) 960dbb5795
  5. [robinhood] remove robin_hood::pair c04614301b
  6. [robinhood] remove Cloner and Destroyer for flatmaps f29a8a4c1d
  7. [robinhood] remove unnecessary constructors, operators 9acd37be68
  8. [robinhood] remove all references to IsFlatMap c0cd24143f
  9. [robinhood] remove integer sequence stuff f76b3681c3
  10. [robin_hood] update to 3.11.3 c19d014826
  11. jamesob commented at 8:34 pm on August 5, 2021: member

    I’m reopening this because I think it’s worth some long-term consideration. We’re not going to get many opportunities to improve block connection by ~12%, so I think this is worth continuing to look at. Now that bitcoin is on c++17, I wonder if there aren’t some good ways to slim down @martinus’s implementation, since it contains a number of provisions to deal with <c++17 platforms. At the very least we could work on fuzz testing this to our satisfaction.

    I’ve rerun some benchmarks on an idle workstation and have confirmed that I continue to see ~13% speedups along with reduction in memory use, which is especially pronounced when using a large dbcache.

    Issues to consider

    • From the README: “For a really bad hash the performance will not only degrade like in std::unordered_map, the map will simply fail with an std::overflow_error. In practice, when using the standard robin_hood::hash, I have never seen this happening.”

    Benchmarks

    dbcache=10000

    ibd local range 500000 540000

    commands index

    bench name command
    ibd.local.range.500000.540000 bitcoind -dbcache=10000 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876

    jamesob/2019-08-robinhood vs. $mergebase (absolute)

    bench name x jamesob/2019-08-robinhood $mergebase
    ibd.local.range.500000.540000.total_secs 3 2812.5091 (± 27.2604) 3190.3358 (± 17.5632)
    ibd.local.range.500000.540000.peak_rss_KiB 3 5517118.6667 (± 1750.8682) 6344530.6667 (± 3478.0996)
    ibd.local.range.500000.540000.cpu_kernel_secs 3 260.3867 (± 2.5737) 263.0400 (± 2.1568)
    ibd.local.range.500000.540000.cpu_user_secs 3 12714.3000 (± 7.6987) 13223.2667 (± 8.7341)

    jamesob/2019-08-robinhood vs. $mergebase (relative)

    bench name x jamesob/2019-08-robinhood $mergebase
    ibd.local.range.500000.540000.total_secs 3 1 1.134
    ibd.local.range.500000.540000.peak_rss_KiB 3 1 1.150
    ibd.local.range.500000.540000.cpu_kernel_secs 3 1 1.010
    ibd.local.range.500000.540000.cpu_user_secs 3 1 1.040

    dbcache=300

    ibd local range 500000 535000

    commands index

    bench name command
    ibd.local.range.500000.535000 bitcoind -dbcache=300 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876

    2019-08-robinhood vs. master (absolute)

    bench name x 2019-08-robinhood master
    ibd.local.range.500000.535000.total_secs 3 3592.5329 (± 202.2415) 4070.7353 (± 13.1181)
    ibd.local.range.500000.535000.peak_rss_KiB 3 2049018.6667 (± 147027.9395) 2110129.3333 (± 36744.0276)
    ibd.local.range.500000.535000.cpu_kernel_secs 3 576.7433 (± 4.9359) 576.6567 (± 1.4616)
    ibd.local.range.500000.535000.cpu_user_secs 3 10879.8767 (± 20.1319) 11260.5933 (± 8.3835)

    2019-08-robinhood vs. master (relative)

    bench name x 2019-08-robinhood master
    ibd.local.range.500000.535000.total_secs 3 1.00 1.133
    ibd.local.range.500000.535000.peak_rss_KiB 3 1.00 1.030
    ibd.local.range.500000.535000.cpu_kernel_secs 3 1.00 1.000
    ibd.local.range.500000.535000.cpu_user_secs 3 1.00 1.035
  12. 0xB10C commented at 8:50 pm on August 5, 2021: member
    Cool. I plan on benchmarking this too (at least IBD speed) via the validation::connect_block USDT tracepoint.
  13. DrahtBot added the label UTXO Db and Indexes on Aug 5, 2021
  14. martinus commented at 7:15 am on August 6, 2021: contributor

    It’s unfortunately really hard to fix https://github.com/martinus/robin-hood-hashing/issues/117 without losing at least some of the performance benefit. Also I don’t really have the time to do it properly… So I’m personally quite a bit weary about using my robin_hood map for such a critical place. I mean I’ve been using it in for a long time in many countless installations and personally have never seen that problem, but other users of the map have definitely seen it. And it definitely is possible for an malicious actor to construct sequence of values that will cause the map to always fail. That’s not possible for unordered_map, this would only degrades the performance.

    I think it would be good to try the same benchmarks also with #16801 and also with tsl::sparse_map.

    • #16801 introduces a relatively simple bulk pool allocator for std::unordered_map. I think most of the performance improvement from robin_hood comes due to the more efficient allocation, and that pool allocator would bring the same allocation behavior to std::unordered_map. The code can be updated with C++17 features which would make it less of a hack. Such a pool allocator could also be beneficial in other places in the code too, wherever node based containers are used.

    • Have a look at memory efficient hashmap implementation: tsl::sparse_map seems to be excellent, header only, and in my benchmarks while slower than robin_hood it is faster than std::unordered_map and more memory efficient than both. I think being able to keep more elements in memory should be a better trade off because it needs less flushes. https://github.com/Tessil/sparse-map

  15. martinus commented at 8:37 pm on August 6, 2021: contributor

    I have now rebased my branch https://github.com/martinus/bitcoin/tree/2019-08-bulkpoolallocator and done some rudimentary benchmarks: create the CCoinsMap, and 5 times insert 1 million entries and then clear it. I measured average time per insertion in ns/op and maximum resident set size in kbyte.

    I tried std::unordered_map and std::map with and without the bulk_pool, and also tried a very simple hash:

    0struct Xor {
    1    size_t operator()(const COutPoint& id) const noexcept {
    2        return static_cast<size_t>(id.hash.GetUint64(0) ^ id.hash.GetUint64(1) ^ id.hash.GetUint64(2) ^ id.hash.GetUint64(3));
    3    }
    4};
    

    I tried a few different hash implementations that I thought would be interesting, https://github.com/greg7mdp/parallel-hashmap, my https://github.com/martinus/robin-hood-hashing and https://github.com/Tessil/sparse-map.

    ns/op RSS kB map Pool Hash?
    62.51 311252 robin_hood::unordered_flat_map Xor
    99.93 311392 robin_hood::unordered_flat_map SaltedOutpointHasher
    106.71 135460 std::unordered_map Yes Xor
    130.00 126124 robin_hood::unordered_node_map Xor
    158.59 126124 robin_hood::unordered_node_map SaltedOutpointHasher
    175.24 149368 std::unordered_map Xor
    264.35 311212 phmap::flat_hash_map Xor
    317.85 135556 std::unordered_map Yes SaltedOutpointHasher
    402.43 149460 std::unordered_map SaltedOutpointHasher
    463.53 128336 tsl::sparse_map Xor
    586.76 153776 std::map
    545.62 112332 tsl::sparse_map SaltedOutpointHasher
    515.25 154712 phmap::node_hash_map SimpleHasher
    1,113.55 140448 std::map Yes

    Well, I actually didn’t expect robin_hood map to be so good :)

    Interestingly though, at least in that benchmark, std::unordered_map with the bulk_pool and the simple hash performs really well, even faster than robin_hood::unordered_node_map. I didn’t expect tsl::sparse_map and phmap::node_hash_map to be so slow. Maybe they fare better in a real world benchmark with lots of lookups.

    So I think it would be worthwhile to properly benchmark the bulk_pool way and a faster hash. Not sure if such a strong hash like siphash is actually needed here.

  16. jamesob commented at 4:10 pm on August 9, 2021: member

    Once again, @martinus has convinced me that the relative performance benefits of swapping out the entire map implementation aren’t worth the risks inherent in robin_hood’s more uncertain failure modes. Luckily, bitcoinperf benchmarks (https://github.com/bitcoin/bitcoin/pull/16801#issuecomment-895348607) indicate that work on the allocator may (as Martin suspected) yield most of the benefits of the change here with much less risk.

    I’m closing this PR (again) and I think we should focus effort on vetting the allocator in Martin’s branch.

  17. jamesob closed this on Aug 9, 2021

  18. DrahtBot locked this on Aug 16, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me