Improve speed, memory efficiency with alternate hashmap #16718

pull jamesob wants to merge 8 commits into bitcoin:master from jamesob:2019-08-robinhood changing 5 files +1509 −5
  1. jamesob commented at 2:30 am on August 25, 2019: member

    tl;dr

    I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.


    When profiling initial block download, it’s evident that a decent chunk of time on the critical path is spent in CCoinsViewCache, the data structure responsible for the in-memory representation of the UTXO set.

    Selection_147 Profile of the msghand thread at height ~300,000.

    The essential data member of the cache, cacheCoins, is a std::unordered_map that holds as many UTXO as can fit into memory given the dbcache setting. While this map implementation is definitely preferable to std::map (which retains order and has O(log(n)) lookups insted of O(1)), both stdlib maps are subpar relative to non-std alternatives.

    After seeing cacheCoins as a bottleneck, I decided to try swapping one of these alternative implementations in to see if there is any benefit. It looks like there is.

    ibd local range 500000 520000

    For 20,000 blocks at a representative part of the chain, we see something like a 15% speedup with a 19.3% reduction in memory usage (dbcache was set to 3000).

    The CCoinsCaching benchmark shows improvement but not as dramatically as IBD does:

    microbenches

    Obviously adding 2000 lines of new consensus-critical code isn’t appealing, but if those numbers hold for the full IBD process I think there might be a compelling case to be made for the added code.

    Implementations considered

    Of the few best-looking options, I picked the one with the most succinct implementation and fewest dependencies, robin_hood::unordered_node_map (it’s just a header file). I also looked at Facebook’s F14NodeMap and Abseil’s node_hash_map but both required too much elbow-grease (for now) to integrate here without pulling in a whole lot of code.

    Next steps

    I’m going to be running more comprehensive benchmarks, including a near-full IBD from the network, to see if these improvements hold.

    I’m also interested in benching this along with a similar change to mapBlockIndex (currently a std::unordered_map).

    Questions

    If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

  2. Use robin_hood.unordered_node_map for CCoinsViewCache.cacheCoins 6f9882ce4a
  3. fanquake added the label Resource usage on Aug 25, 2019
  4. promag commented at 10:09 am on August 25, 2019: member

    including a near-full IBD from the network, to see if these improvements hold.

    Yap that would be interesting.

    I’m also interested in benching this along with a similar change to mapBlockIndex

    I’d be surprised to see such improvements here.

    If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

    Either way it should include tests.

  5. fanquake added the label Needs Conceptual Review on Aug 25, 2019
  6. fanquake deleted a comment on Aug 25, 2019
  7. DrahtBot commented at 12:07 pm on August 25, 2019: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #16910 (wallet: reduce loading time by using unordered maps by achow101)
    • #16801 (faster & less memory for sync: bulk pool allocator for node based containers by martinus)
    • #15606 ([experimental] UTXO snapshots by jamesob)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  8. practicalswift commented at 12:09 pm on August 25, 2019: contributor

    @jamesob Interesting! Thanks for providing benchmarks!

    Which implementation of std::unordered_map did you benchmark against?

    It would be interesting to see this benchmarked against the three major implementations of std::unordered_map (libstdc++, libc++ and Microsoft STL).

  9. jamesob commented at 2:46 pm on August 25, 2019: member

    Which implementation of std::unordered_map did you benchmark against?

    I’ve been benching on Ubuntu and Debian systems with libstdc++ (x86_64, glibc6).

  10. fanquake commented at 2:58 am on August 26, 2019: member

    I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.

    Nice. This is the kind of PR I enjoy reading / reviewing / testing.

    I did a basic benchmark on my macOS machine: time src/bitcoind -stopatheight=350000.

    #16718 6f9882ce4a817c9f14aa7526165ab6e278de890e - 48m49s master db67101c748c208cced8e9b76a66d57cd48fbf6e - 55m22s

    Will run something to a more intense height later.

  11. jamesob commented at 4:23 pm on August 26, 2019: member

    Some additional context from further benchmarking:

    Local tests

    I did a local IBD (not real network) comparison up to height 510,000. We see a speedup of roughly 17% coupled with memory savings of roughly 13.5%.

    Given the vertical shift in real memory usage curve on the robinhood branch (pictured below), I suspect there’s a problem with my implementation of the DynamicUsage() memory estimator for the new robin_hood map. We appear to be over-reporting estimated memory usage of the new map relative to actual, so fixing this might lead to additional speedups for configurations with a limiting dbcache.

    ibd local 510000

    2019-08-robinhood vs. master (absolute)

    bench name x 2019-08-robinhood master
    ibd.local.510000.total_secs 1 7827.2801 (± 0.0000) 9447.3154 (± 0.0000)
    ibd.local.510000.peak_rss_KiB 1 4863324.0000 (± 0.0000) 5620656.0000 (± 0.0000)

    2019-08-robinhood vs. master (relative)

    bench name x 2019-08-robinhood master
    ibd.local.510000.total_secs 1 1 1.207
    ibd.local.510000.peak_rss_KiB 1 1 1.156

    IBD from the real network

    I let IBD run overnight up to height 550,000 using peers from the real network for both master and this branch. The differences in speed were not as compelling here (a 6% gain for the robinhood branch) but memory differences were still pronounced: 14.6% percent savings for the robinhood branch (6502MB vs. 7615MB, both at a dbcache of 5000).

     0Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
     1Linux 4.4.0-119-generic x86_64
     2
     3`master` to height 550_000, dbcache=5000:
     4
     5Command being timed: "./src/bitcoind -datadir=/data/bitcoin-rh -printtoconsole=0 -stopatheight=550000 -dbcache=5000"
     6User time (seconds): 54822.52
     7System time (seconds): 929.91
     8Percent of CPU this job got: 121%
     9Elapsed (wall clock) time (h:mm:ss or m:ss): 12:44:12
    10Average shared text size (kbytes): 0
    11Average unshared data size (kbytes): 0
    12Average stack size (kbytes): 0
    13Average total size (kbytes): 0
    14Maximum resident set size (kbytes): 7615768
    15Average resident set size (kbytes): 0
    16Major (requiring I/O) page faults: 1715
    17Minor (reclaiming a frame) page faults: 72418086
    18Voluntary context switches: 26934810
    19Involuntary context switches: 287115
    20Swaps: 0
    21File system inputs: 451040
    22File system outputs: 596633792
    23Socket messages sent: 0
    24Socket messages received: 0
    25Signals delivered: 0
    26Page size (bytes): 4096
    27Exit status: 0
    28
    29
    30On the robinhood branch:
    31
    32Command being timed: "./src/bitcoind -datadir=/data/bitcoin-rh -printtoconsole=0 -stopatheight=550000 -dbcache=5000"
    33User time (seconds): 52074.03
    34System time (seconds): 914.95
    35Percent of CPU this job got: 123%
    36Elapsed (wall clock) time (h:mm:ss or m:ss): 11:57:01
    37Average shared text size (kbytes): 0
    38Average unshared data size (kbytes): 0
    39Average stack size (kbytes): 0
    40Average total size (kbytes): 0
    41Maximum resident set size (kbytes): 6502096
    42Average resident set size (kbytes): 0
    43Major (requiring I/O) page faults: 14411
    44Minor (reclaiming a frame) page faults: 72379830
    45Voluntary context switches: 22815471
    46Involuntary context switches: 290463
    47Swaps: 0
    48File system inputs: 770608
    49File system outputs: 597949704
    50Socket messages sent: 0
    51Socket messages received: 0
    52Signals delivered: 0
    53Page size (bytes): 4096
    54Exit status: 0
    

    The “real” gains here dampen my enthusiasm somewhat, but the memory story is compelling. Curious for others to weigh in on whether these differences merit the increased risk of bringing in a non-std map implementation.

  12. jamesob force-pushed on Aug 26, 2019
  13. jamesob commented at 3:19 am on August 27, 2019: member

    Consider my enthusiasm renewed because the latest “real” IBD runs came back showing a hard-to-believe 40% speedup. Sync to 550,000 on different machines but identical hardware. Still a 12% memory savings.

    Obviously it’s hard to gauge variance when syncing from the real network.

    2019-08-robinhood

    Selection_149

    master

    Selection_148

    Apologies for the images - tmux over SSH makes copying text a pain.

  14. JeremyRubin commented at 4:25 am on August 27, 2019: contributor

    Concept ack, NACK this implementation. From the docs:

    0Depends on good Hashing. For a really bad hash the performance will not only degrade like in std::unordered_map, the map will simply fail with an std::overflow_error. In practice, when using the standard robin_hood::hash, I have never seen this happening.
    

    There are some ways that we could maybe get some better performance with STD::unordered_map, will take a look and give some feedback later.

  15. jamesob commented at 4:00 pm on August 27, 2019: member

    the map will simply fail with an std::overflow_error

    Thanks for the close look @JeremyRubin. This is exactly the kind of change I’d be really worried about introducing. The performance difference will have to be really compelling (and verified by a few people aside from myself) to consider using a non-std implementation for something as consensus-sensitive as cacheCoins.

    I wonder if, assuming we did find a non-std implementation that we’re comfortable with, we should initially leave its use as an opt-in configure flag that, e.g., installations running on lower-power hardware can make use of.

  16. jamesob commented at 4:22 pm on August 27, 2019: member
    Alternatively, we could look at forking robin-hood-hashing and removing all instances of throwOverflowError() with something more conservative.
  17. elichai commented at 4:43 pm on August 27, 2019: contributor
    Concept ACK for replacing the hashmap, std::map is a disaster. (std::unordered_map is a little bit better but still really bad).
  18. JeremyRubin commented at 5:26 pm on August 27, 2019: contributor

    @jamesob maybe have a look at the experimental rust fork?

    You could just wrap the rust hasmap (with at least rust v1.35 or use hashbrown crate). Rust’s hashmap is based on swisstable.

    Replacing sensitive data structures with rust seems like a good use case given long term goals, and can be used for low resource nodes experimentally?

  19. elichai commented at 5:34 pm on August 27, 2019: contributor

    @jeremyrubin I just had a talk with him about this, Personally I’d love to see rust inside of bitcoin, but I’m not sure a hashmap is the best first step as integrating rust and c++ requires a C ABI in between.

    I do hope I’ll have time to test the C++ SwissTable implementation and compare benchmarks https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal/raw_hash_set.h

  20. jamesob commented at 5:43 pm on August 27, 2019: member
    I agree with @elichai - I think adding Rust as a dependency for one of the most consensus-sensitive data structures is way too big a leap. A few extra thousand lines of heavily-scrutinized cpp is about the biggest risk I’m willing to take personally, and that’s assuming a ton of review, testing, and maybe a gradual deployment process.
  21. elichai commented at 7:56 pm on August 27, 2019: contributor
    1. Would love a quick explanation on how to run all these benchmarks myself (so I can get both a control before changing and after).

    2. Why did you choose unordered_node_map and not unordered_flat_map?.

    3. Do you think it would be interesting in the future for other uses of std::unordered_map in the code? (and potentially even std::map)

  22. jamesob commented at 8:13 pm on August 27, 2019: member

    Would love a quick explanation on how to run all these benchmarks myself (so I can get both a control before changing and after).

    You can use bitcoinperf to perform the host-local comparative benchmarks I’ve posted above. I’ve committed an example file for this branch here - note that you’re going to have to change some of the paths and potentially create some partially synced datadirs to be able to use it.

    Why did you choose unordered_node_map and not unordered_flat_map?

    In the case of cacheCoins, the values can be fairly large (>55 bytes) since they include scriptPubkeys. The flat map is good for values of small size, but shuffling around medium-to-large objects isn’t the intended use. The node map allocates values separately and stores pointers instead. See Folly’s docs for more details.

    That said, I’m happy to benchmark against a flat_map implementation to see how it compares to node_map.

    Do you think it would be interesting in the future for other uses of std::unordered_map in the code? (and potentially even std::map)

    Yeah, I do, though I don’t expect any other piece of data in the system will give us such dramatic improvements in performance. If we’re going to use this map for arguably the most sensitive data structure in the system, we might as well use it elsewhere. I think a compelling candidate would be mapBlockIndex, but there are probably other good uses as well.

  23. elichai commented at 8:18 pm on August 27, 2019: contributor
    Sounds good :) Thanks! I’ll try also to benchmark myself, curious the difference between unordered_node_map and unordered_flat_map over unique pointers.
  24. JeremyRubin commented at 10:25 pm on August 27, 2019: contributor

    Ok I’ve had a chance to more tightly review the code here. I didn’t want to say so earlier, because I wasn’t sure, but I think it should be possible to modify the cuckoocache code minimally to support cacheCoins (store a key/value and (maybe) get rid of atomics for erasure).

    I can put together a PoC of that… there are some other benefits as CuckooCache is specifically a cache and not a map being used as a cache.

    This is already a consensus dependency, and has been in since v0.14, so it introduces less new code.

  25. jamesob commented at 1:47 pm on August 28, 2019: member

    Latest round of benchmarks are in from two dedicated bench machines with identical specs (Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz, 8GB memory, SSD).

    First I reran the last test I reported above (/usr/bin/time -v ./src/bitcoind -stopatheight=550000 -dbcache=5000) to ensure I could recreate the results. To my surprise, they held (40% speedup, 13% memory savings). To ensure that the wide disparity in performance wasn’t due to a difference in the machines, I did a crossover: I swapped branches on each machine and reran the IBD. Same results.

    master (a7be1cc92be4946c4f042bccd3a1b007657f3241):

    0Elapsed (wall clock) time (h:mm:ss or m:ss): 6:47:37
    1Maximum resident set size (kbytes): 7425920
    

    this branch:

    0Elapsed (wall clock) time (h:mm:ss or m:ss): 4:03:51
    1Maximum resident set size (kbytes): 6436820
    

    This is a remarkable performance improvement that I’m eager to see reproduced by others.

  26. jamesob commented at 1:57 pm on August 28, 2019: member

    I think it should be possible to modify the cuckoocache code minimally to support cacheCoins @JeremyRubin I’d be very eager to see your PoC for this. In principle I strongly agree that if we can modify existing consensus-supporting data structures and see similar performance benefits, that’s the way to go.

    CuckooCache is specifically a cache and not a map being used as a cache

    I’m curious about what you mean here, and it highlights one of my concerns about repurposing CuckooCache. As you probably know, we can’t just evict stuff from cacheCoins - we have to be very careful about persisting cache content (FlushStateToDisk()) before eviction. At the moment it looks like CuckooCache quietly evicts data when it fills up - I’m curious to see how you’ll ensure that we can coordinate eviction with a guaranteed flush.

  27. Sjors commented at 4:20 pm on August 28, 2019: member

    Concept ACK for improving std::unordered_map behavior or swapping it for an alternative if the improvement is >10%. It’s nice how this only changes 1 line in coins.h.

    I’m able to reproduce the memory gain, but the speed is actually worse for me, on a 2019 MacBook Pro with -dbcache=10000, no indexes and no wallets, syncing over LAN to a single node:

    • IBD before: 3:35 hours plus 11 minute db flush (memory at height 564180: 8.23 GB)
    • IBD after: 5:30 hours plus 20 minute db flush (memory at height 564180: 7.5 GB)
  28. martinus commented at 4:22 pm on August 28, 2019: contributor

    Hi all! I am the author of the the robin_hood::unordered_map variants, and just saw this PR by accident. Please ask away if you have any question about my implementation!

    I have also done extensive map benchmarks here, which might be interesting: https://github.com/martinus/map_benchmark

    I agree with @JeremyRubin that the throwOverflowError() is a potential problem, if there is a way to force collisions in the map. Be aware that new hasmaps (facebook’s F14, Google’s Swisstable) also depend on good hashes. They don’t throw an exception, but performance can degrade so much that it’s not much different from an infinite loop.

  29. in src/test/coins_tests.cpp:67 in 33b9786e6c outdated
    63@@ -64,7 +64,7 @@ class CCoinsViewTest : public CCoinsView
    64                     map_.erase(it->first);
    65                 }
    66             }
    67-            mapCoins.erase(it++);
    68+            it = mapCoins.erase(it);
    


    martinus commented at 4:51 pm on August 28, 2019:
    Be aware that there is a bug in robin_hood map https://github.com/martinus/robin-hood-hashing/issues/42 (as well as in tessil’s map https://github.com/Tessil/robin-map/issues/20) where it is possible that entries are iterated over a second time before end() is reached.
  30. elichai commented at 4:59 pm on August 28, 2019: contributor

    @martinus would love to see some list of projects you know are depending and using your implementation :)

    I think if people would feel this is a stable and vetted implementation they would feel better about considering using it in consensus critical code. Thanks :)

  31. martinus commented at 5:41 pm on August 28, 2019: contributor

    @elichai, we have been using the map in several core components in Dynatrace’s proprietary monitoring platform for quite a while. That’s a very well tested and widely deployed product on multiple platforms (I work at dynatrace so I might be biased..).

    I don’t know much about other use cases. Here is a github query tough to find other uses of the map: https://github.com/search?q=robin_hood_h_included&type=Code It only finds usages who copied the file though.

    Some usages that I had a look at:

  32. MarcoFalke commented at 5:51 pm on August 28, 2019: member
    Are there any benchmarks on low cpu, low memory devices. Given that this slows down IBD for @Sjors (https://github.com/bitcoin/bitcoin/pull/16718#issuecomment-525818499), I wouldn’t be surprised if it also slowed down other architectures/configurations.
  33. laanwj commented at 6:37 pm on August 28, 2019: member

    The speed-ups are promising, it’s surprising to me that this makes so much of a difference,

    But to me it sounds quite risky, to be honest, to import a custom map implementation into consensus-critical code. Of course, we can never be sure, but likely these have much less testing than whatever tried-and-true map implementation comes with C++ libraries by default.

    (Importing rust code for this reason seems a very bad idea to me, I’d be surprised if it helps performance at all with a FFI barrier there, and that makes it doubly risky)

  34. TheBlueMatt commented at 6:56 pm on August 28, 2019: member
    Right, pulling in Rust for something like this is way too early/aggressive. I’m curious if its possible to get similar speedups by taking a hatchet to robinhood so that we can cut it down to something thats more reviewable, since obviously thats the key criteria.
  35. martinus commented at 6:59 pm on August 28, 2019: contributor

    It would be very interesting where exactly the speedup is coming from. It might be possible to reproduce this speedup by staying with std::unordered_map. E.g. I see two possibilities for improvement:

    • robin_hood map requires less hashing and less key comparisons than std::unordered_map. It might be possible to save a lot of hashing & comparisons by simply adding the precalculated hash value to the key (so instead of the COutPoint use e.g. std::pair<COutPoint, size_t>. This unfortunately will increase memory usage a bit (but not much, since the key/value pair is already quite large).

    • I use a bulk allocator that never frees any memory, but reuses previously allocated memory. So once the cache has reached its peak size, no allocations will be necessary any more. This can also be achieved with a custom allocator for std::unordered_map.

    • If the speedup comes from the robin_hood style memory layout, that’s bad luck - can’t be achieved with std::unordered_map.

  36. jamesob commented at 7:03 pm on August 28, 2019: member

    First -reindex-chainstate bench finished: 46.7% 2.14x faster with 1173 MB less memory usage.

    master (a7be1cc92be4946c4f042bccd3a1b007657f3241):

     0user@bench-ssd-4:~/bitcoin$ /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000; git branch
     1
     2        Command being timed: "./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000"
     3        User time (seconds): 7735.97
     4        System time (seconds): 1081.16
     5        Percent of CPU this job got: 53%
     6        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32:10
     7        Average shared text size (kbytes): 0
     8        Average unshared data size (kbytes): 0
     9        Average stack size (kbytes): 0
    10        Average total size (kbytes): 0
    11        Maximum resident set size (kbytes): 7563764
    12        Average resident set size (kbytes): 0
    13        Major (requiring I/O) page faults: 15004449
    14        Minor (reclaiming a frame) page faults: 61431800
    15        Voluntary context switches: 22076432
    16        Involuntary context switches: 4492039
    17        Swaps: 0
    18        File system inputs: 2826327856
    19        File system outputs: 165988936
    20        Socket messages sent: 0
    21        Socket messages received: 0
    22        Signals delivered: 0
    23        Page size (bytes): 4096
    24        Exit status: 0
    

    2019-08-robinhood:

     0user@bench-ssd-3:~/bitcoin$ /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000; git branch
     1
     2        Command being timed: "./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000"
     3        User time (seconds): 5598.92
     4        System time (seconds): 720.66
     5        Percent of CPU this job got: 82%
     6        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:07:54
     7        Average shared text size (kbytes): 0
     8        Average unshared data size (kbytes): 0
     9        Average stack size (kbytes): 0
    10        Average total size (kbytes): 0
    11        Maximum resident set size (kbytes): 6361924
    12        Average resident set size (kbytes): 0
    13        Major (requiring I/O) page faults: 1684815
    14        Minor (reclaiming a frame) page faults: 73715700
    15        Voluntary context switches: 7033296
    16        Involuntary context switches: 1901858
    17        Swaps: 0
    18        File system inputs: 502322648
    19        File system outputs: 177410624
    20        Socket messages sent: 0
    21        Socket messages received: 0
    22        Signals delivered: 0
    23        Page size (bytes): 4096
    24        Exit status: 0
    

    @laanwj in terms of your concerns, I couldn’t agree more. The last thing I want is to be responsible for a consensus failure.

    I wonder if these performance gains indeed turn out to be real, we could devote a substantial amount of effort to scrutinizing this map implementation using e.g. fuzz + stress testing and thorough manual review. After some period (months), we could make its use opt-in via configure flag to allow some hardware-constrained installations to use it.

    That said, in the grand scheme of things maybe even a 50% speedup in IBD is negligibly marginal relative to the risk.

  37. jamesob commented at 7:11 pm on August 28, 2019: member
    My mistake - that’s 2.14x faster, not 46.7%.
  38. martinus commented at 7:19 pm on August 28, 2019: contributor

    master (a7be1cc):

    0        User time (seconds): 7735.97
    1        Major (requiring I/O) page faults: 15004449
    

    2019-08-robinhood:

    0        User time (seconds): 5598.92
    1        Major (requiring I/O) page faults: 1684815
    

    I find it a bit fishy that the major page faults with robinhood are 10x less. Could disk cache affect the benchmark? Maybe drop the cache before benchmarking, I think sync; echo 3 > /proc/sys/vm/drop_caches should work

  39. MarcoFalke commented at 7:22 pm on August 28, 2019: member

    we could make its use opt-in via configure flag to allow some hardware-constrained installations to use it

    I disagree about a phased out deployment. Either the consensus code is safe to run, correct, and can be deployed to the whole network, or not. There is no in-between. The last thing you want to see happening is that all arm (or whatever “hardware-constrained means) devices are off consensus.

  40. martinus commented at 7:26 am on August 30, 2019: contributor

    I’ve benchmarked bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000 on my machine (Intel i7-8700, g++ 9.1, libstdc++), and can confirm a significant speedup with @jamesob’s changes. For accurate reproducability I have disabled frequency scaling and turbo boost for my cpu (with sudo pyperf system tune). Runtime with std::unordered_map is 6471 seconds, and 5339 seconds with robin_hood map, 21% faster.

    I’ve modified SaltedOutpointHasher and added a custom equals function to count the number of hashes and equal operations performed. To my surprise, on my system std::unordered_map needs less hash calls. It looks like it caches the full hash. (10.7 vs. 13.2 billion hashes). I’ve benchmarked just the hashing and equals operations, and this should account for about 323 vs. 391 seconds of CPU total.

    Next I’m going to try to use a bulk pool allocator for std::unordered_map and see how that goes.

  41. martinus commented at 5:21 pm on August 30, 2019: contributor

    I have tried to use boost::pool_allocator and boost::fast_pool_allocator for the std::unordered_map, but performance was much slower than std::allocator.

    Then I adapted my BulkPoolAllocator to be usable for the std::unordered_map, and results are much better.

    Benchmark results for bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000 on my i7-8700:

    unordered_map robin_hood unordered_map + BulkPoolAllocator
    elapsed h:mm:ss 1:47:51 1:28:59 1:32:56
    speedup % 0% 21.2% 16.1%
    max RSS kbyte 7553716 6301540 6950432
    max RSS % 100% 83.4% 92.0%

    So most of the savings can also be achieved by staying with std::unordered_map and adding the BulkPoolAllocator. This should be much easier to integrate since it’s less than 200 lines of code. Also, the allocator might make sense for other node-based containers as well.

  42. JeremyRubin commented at 7:02 pm on August 30, 2019: contributor

    I think we may also be able to eke out some additional performance by making the keys the sha256 salted hash of the COutpoint rather than the COutpoint itself, and instead of using siphash taking entropy from that salted hash (as done for cuckoocache).

    At the expense of one extra sha256 on insert/find (and less siphash), we get to drop 4 of the key bytes and have cheaper hash re-evaluation. Furthermore (this is a little bit platform dependent, but maybe we can detect it or wrap the hash in an is_integral type) we can set __cache_hash_code in the map to false, and save another 8 bytes. Depending on padding issues we may save even more space.

  43. [robinhood] remove custom hash 27f5bc74d2
  44. [robinhood] remove inline DataNode (flatmap=true) fa73a0f68e
  45. [robinhood] remove robin_hood::pair 7915fb73d3
  46. [robinhood] remove Cloner and Destroyer for flatmaps 25274b69e6
  47. [robinhood] remove unnecessary constructors, operators 8cc7434ec8
  48. [robinhood] remove all references to IsFlatMap cd43885b99
  49. [robinhood] remove integer sequence stuff 8739248c41
  50. jamesob force-pushed on Sep 3, 2019
  51. jamesob commented at 2:19 pm on September 3, 2019: member

    Thanks for running those benchmarks, @martinus. I also think we should try calling cacheCoins.reserve(dbcache / avg_coinscacheentry_size) somewhere during startup with both implementations to see how that affects things.

    On the chance that we still want to consider using the robin_hood map, I’ve stripped about 500 lines out of robinhood.h (mostly relating to unordered_flat_map which I don’t think we have much use for). I refer to this stripped down version as robinhood-slim in the benchmarks below (and it’s now the tip of this branch).


    I’ve been running benchmarks pretty continuously for the last few days and here’s a summary of the SSD-based timings:

    Selection_153

    Unsurprisingly it seems that performance improvement is most pronounced with a large dbcache. Presumably this is because as the number of entries in cacheCoins grows, std::unordered_map’s performance starts to degrade, robin_hood starts to shine, or both.

    I found it interesting that over numerous runs, a large dbcache (7000) ended up being slower across the board than a smaller dbcache (6000 or 3000).

    In general, the weird timing disparities from these results give me pause. I’m benching on dedicated, non-virtualized hardware that is for no purpose other than benchmarking, and I’m dropping caches before runs (sudo /sbin/swapoff -a; sudo /sbin/sysctl vm.drop_caches=3;). Yet we see weird outliers like the 500-minute robinhood reindex at dbcache=5000 - which I didn’t believe initially but then reproduced. Why would a dbcache=7000 run take longer than a dbcache=3000? Is the runtime of -reindex-chainstate somehow weirdly determined by the dbcache value during IBD?

    I’m quite confused by some aspects of these results.

    Anyway, here’s the memory story:

    Selection_154

    We reliably see significant memory savings from the robinhood map.


    I also ran -reindex-chainstate to 550,000 on equivalent machines with HDDs. The disparity between master and this branch has been almost unbelievable, but I’ve reliably reproduced it even after doing machine crossover. The following table is characteristic of the -reindex-chainstates I’ve done on HDD:

    branch dbcache time memory
    robinhood 5000 8:46:59 6.38 GB
    master 5000 21:32:58 (stopped early) 7.47 GB

    Similarly, IBD to the same height with the same dbcache took 47 hours on master (!?) and 17 hours on this branch.

    I’m just kind of assuming these results are bogus somehow, although I’m very confused by the fact that I can reliably reproduce them even after swapping machines, but maybe someone with a spinning disk can confirm or deny that there’s a dramatic difference in performance between this branch and master.

  52. MarcoFalke commented at 2:27 pm on September 3, 2019: member
    Thanks for doing those additional benchmarks @jamesob. I think that confirms the finding by @Sjors that the sync is slowed down with a high dbcache in #16718 (comment)
  53. jamesob commented at 3:16 pm on September 3, 2019: member

    I think that confirms the finding by @Sjors that the sync is slowed down with a high dbcache in #16718 (comment)

    Note though that I differ from Sjors’ findings in that this branch was ~2x faster than master for dbcache=7000 (reindex-chainstate), whereas in his results this branch was slower than master for a large dbcache (10000) during IBD.

    I lightly suspect that @Sjors’ slowness on this branch relative to master was due to some kind of IBD-inherent variability since on every other benchmark run I’ve seen, this branch performs better than master.

    Sjors’ total IBD runtime (<5 hours) was also less than my reindex-chainstate runtime (up to 8 hours), which I attribute to his probably-much-better hardware.

  54. martinus commented at 6:03 pm on September 3, 2019: contributor
    In https://github.com/martinus/bitcoin/tree/2019-08-bulkpoolallocator I’m currently working on a PR where I’ve extracted (and completely rewrote…) the pool allocator so it is usable in std::unordered_map. Unfortunately the standard allocator API is trickier than I’ve expected, but I think I got it now. It’s performance is still as advertised in #16718 (comment). I still have to clean up the code and add some unit tests before I’m creating a PR, probably in the next few days.
  55. luke-jr commented at 9:24 pm on September 20, 2019: member

    Surprised nobody’s pointed out that using std::unordered_map basically gives us an unknown implementation right now… which probably is best avoided in consensus code, when possible. The fact that we can get performance improvements too really suggests to me that switching to a specific implementation would be a good idea. So concept ACK.

    Agree with need for intense care and testing to avoid a surprise protocol change. Bundling like LevelDB seems best if it’s a lot of code.

    As far as issues with bad hashes - we don’t want to enable intentionally-crafted transactions to screw over performance, so might need to re-hash with some seed?

  56. ryanofsky commented at 7:51 pm on October 21, 2019: member

    Status of this?

    • Does using std::unordered_map with custom allocator #16718 (comment) still seem like it could be a good approach?

    • Are the code changes in this PR more or less complete, or is there more robinhood code streamlining to be done?

    • Are there still concerns about performance will smaller dbcaches and more constrained hardware with these implementations?

    • Are there still benchmark reproducibility issues? Are these affecting other setups or just James’ benchmarks?

  57. martinus commented at 6:32 am on October 23, 2019: contributor

    Status of this?

    I think using a custom allocator from #16801 is a much safer choice than switching to a different map implementation.Writing a correct and fast hashmap is orders of magnitudes more difficult than a custom allocator. This PR was good though that it has highlighted what could be possible. Most of the performance benefit can be achieved with the custom allocator as well

    Concerning benchmark reproducibility, I believe @jamesob has stumbled upon a pretty severe performance bug, but we don’t know yet if this is due to the new code or maybe the optimizations trigger an existing bug more often: #17060 (comment)

  58. jamesob commented at 10:30 pm on November 14, 2019: member

    Some new (and stabilized) benchmarks here. Following the same methodology as in similar PRs, we see a 6% difference in runtime with ~400MB less memory usage for a reindex up to 550,000 (bench/robinhood.1 vs. bench/master.1):

     0host         tag                      time       time% maxmem  cpu%  dbcache
     1bench-ssd-2  robinhood.1              8:43:36    0.94  5304.37MB 356%  4000MB
     2bench-ssd-2  robinhood.1              8:44:29    0.94  5309.67MB 355%  4000MB
     3bench-ssd-2  master.1                 9:15:02    1.00  5776.95MB 344%  4000MB
     4bench-ssd-2  master.1                 9:15:26    1.00  5762.75MB 343%  4000MB
     5
     6bench-ssd-3  robinhood.1              8:44:41    0.94  5293.84MB 355%  4000MB
     7bench-ssd-3  robinhood.1              8:46:29    0.94  5309.91MB 354%  4000MB
     8bench-ssd-3  master.1                 9:18:07    1.00  5780.84MB 342%  4000MB
     9bench-ssd-3  master.1                 9:15:53    1.00  5773.25MB 343%  4000MB
    10
    11bench-ssd-4  robinhood.1              8:46:13    0.95  5279.41MB 354%  4000MB
    12bench-ssd-4  robinhood.1              8:45:36    0.94  5294.21MB 354%  4000MB
    13bench-ssd-4  master.1                 9:14:42    1.00  5782.72MB 344%  4000MB
    14bench-ssd-4  master.1                 9:16:17    1.00  5762.92MB 343%  4000MB
    15
    16bench-ssd-5  robinhood.1              8:42:02    0.94  5277.88MB 357%  4000MB
    17bench-ssd-5  robinhood.1              8:41:34    0.94  5302.08MB 357%  4000MB
    18bench-ssd-5  master.1                 9:12:43    1.00  5773.20MB 345%  4000MB
    19bench-ssd-5  master.1                 9:12:08    1.00  5769.65MB 345%  4000MB
    20
    21bench-strong robinhood.1              14:16:20   0.94  5302.65MB 563%  4000MB
    22bench-strong master.1                 15:07:08   1.00  5804.43MB 543%  4000MB
    23bench-strong master.1                 15:08:44   1.00  5722.19MB 543%  4000MB
    

    Interesting, but probably not a sufficient benefit to justify adding an extra 1500 lines of consensus-critical code. I’m going to close this one and perhaps someone can revisit this later.

  59. jamesob closed this on Nov 14, 2019

  60. MarcoFalke locked this on Dec 16, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me