The Zero Allocations project #18849

pull jb55 wants to merge 9 commits into bitcoin:master from jb55:zeroalloc changing 18 files +130 −93
  1. jb55 commented at 7:40 am on May 2, 2020: member

    This is an octomerge staging PR for tracking various heap-allocation reductions during IBD and regular use. Reducing heap allocations improves cache coherency and should improve performance. Maybe. You can use this PR to bench IBD.

    • ZAP1 - #18847 - compressor: use prevector in CompressScript serialization
    • ZAP2 - #18848 - threadnames: don't allocate memory in ThreadRename
    • ZAP3 - #18985 - bloom: use Span instead of std::vector for insert and contains
    • ZAP4 - no pr yet - netaddress: return a prevector from CService::GetKey()
  2. DrahtBot added the label Consensus on May 2, 2020
  3. DrahtBot added the label P2P on May 2, 2020
  4. DrahtBot added the label RPC/REST/ZMQ on May 2, 2020
  5. DrahtBot added the label Tests on May 2, 2020
  6. DrahtBot added the label Utils/log/libs on May 2, 2020
  7. DrahtBot added the label Validation on May 2, 2020
  8. DrahtBot commented at 12:30 pm on May 2, 2020: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #19314 (refactor: Use uint16_t instead of unsigned short by renepickhardt)
    • #19184 (Overhaul transaction request logic by sipa)
    • #19107 (p2p: Refactor, move all header verification into the network layer, without changing behavior by troygiorshev)
    • #19031 (Implement ADDRv2 support (part of BIP155) by vasild)
    • #18071 (Refactoring CHashWriter & Get{Prevouts,Sequence,Outputs}Hash to SHA256 by JeremyRubin)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  9. practicalswift commented at 3:38 pm on May 2, 2020: contributor
    Would be interesting to see this benchmarked to be able to reason about risk vs reward (where the ideal reward would be a significant measurable and user-perceivable performance improvement) :)
  10. elichai commented at 11:09 am on May 4, 2020: contributor
    I’d be very interested to see the benchmarks of this :) I’ve tried doing similar things in the past, but C++ makes this somewhat scary for a big project with lots of contributors because it requires more careful tracking of lifetimes. Also a few full IBD profiles I’ve done in the past show IIRC ~10-15% of local IBD time is spent on new and delete.
  11. in src/netaddress.cpp:734 in a80f49e566 outdated
    737+    CServiceKey vKey;
    738+    vKey.resize(18);
    739+    memcpy(vKey.data(), ip, 16);
    740+    vKey[16] = port / 0x100; // most significant byte of our port
    741+    vKey[17] = port & 0x0FF; // least significant byte of our port
    742+    return vKey;
    


    elichai commented at 11:12 am on May 4, 2020:
    what about returning a std::array<unsigned char, 18> instead?

    jb55 commented at 3:08 pm on May 4, 2020:
    I think this makes more sense

    jb55 commented at 4:29 pm on May 15, 2020:
    I decided against this for now since it would require me to implement a bunch of new Serialization machinery for std::array which I couldn’t figure out how to do.
  12. jb55 commented at 3:26 pm on May 4, 2020: member

    One of the hottest allocations that I found that might be the “easiest” to take a stab at is:

    cacheCoins unordered_map in CCoinsViewCache::AddCoin:

    typedef std::unordered_map<COutPoint, CCoinsCacheEntry, SaltedOutpointHasher> CCoinsMap

    May04-081624

    This picture shows that (looks like the cacheCoins.emplace call in AddCoin) is doing 60 million dynamic allocations for this small heaptrack snapshot of IBD (I think I ran this one for 5-10 minutes…?). With peak memory usage of 202.5MB.

    One thing I’m trying now is a custom arena allocator for this map to see if that helps.

  13. jb55 commented at 3:38 pm on May 4, 2020: member

    Also a few full IBD profiles I’ve done in the past show IIRC ~10-15% of local IBD time is spent on new and delete.

    This is a bit higher than I expected, but yes this is why gamedevs typically ban dynamic allocations from their main game loop, it absolutely kills performance: heap fragmentation, cache incoherency, etc, etc. Most high performing games arena-allocate or use custom allocators. I’m going into this from a gamedev mindset, I see IBD as our main loop. Let’s minimize time spent per frame and maximize FPS (in our case, blocks/txs per second :sweat_smile: ). If we can arena/custom allocate some of these hotspots it might help a bunch.

  14. sipa commented at 10:17 pm on May 4, 2020: member
    @jb55 The CCoinsMap is where the actual UTXO cache is stored - it should be one allocation per UTXO. That’s a huge number, because we cache a huge number of UTXOs, but I don’t expect it will be easy to reduce. IBD performance heavily relies on being able to delete cache entries if they’re created and spent without a flush to disk in between, which seems incompatible with arena allocation.
  15. jb55 commented at 9:21 am on May 5, 2020: member

    @jb55 The CCoinsMap is where the actual UTXO cache is stored - it should be one allocation per UTXO. That’s a huge number, because we cache a huge number of UTXOs, but I don’t expect it will be easy to reduce.

    yes in this case the only thing we could do better is a custom allocator that maintains an efficient memory layout for cache friendliness, This is something I wanted to try as well in addition to reducing allocations where possible.

  16. elichai commented at 10:10 am on May 5, 2020: contributor
    We could also replace the hashmap with one that preserves cache locality, like SwissTable (CC #16718)
  17. jb55 commented at 6:23 pm on May 5, 2020: member
    re: allocator, looks like there is some prior art here: https://github.com/bitcoin/bitcoin/pull/16801
  18. DrahtBot added the label Needs rebase on May 8, 2020
  19. Make Span size type unsigned
    This matches a change in the C++20 std::span proposal.
    1f790a1147
  20. Make pointer-based Span construction safer
    This prevents constructing a Span<A> given two pointers into an array
    of B (where B is a subclass of A), at least without explicit cast to
    pointers to A.
    bb3d38fc06
  21. Add Span constructors for arrays and vectors ab303a16d1
  22. Simplify usage of Span in several places 2676aeadfa
  23. jb55 force-pushed on May 15, 2020
  24. DrahtBot removed the label Needs rebase on May 15, 2020
  25. compressor: use a prevector in compressed script serialization
    Use a prevector for stack allocation instead of heap allocation during
    script compression and decompression. These functions were doing
    millions of unnecessary heap allocations during IBD.
    
    We introduce a CompressedScript type alias for this prevector. It is
    size 33 as that is the maximum size of a compressed script.
    
    Fix the DecompressScript header to match the variable name from
    compressor.cpp
    
    Signed-off-by: William Casarin <jb55@jb55.com>
    83a425d25a
  26. bloom: use Span instead of std::vector for `insert` and `contains`
    We can avoid many unnecessary std::vector allocations by changing
    CBloomFilter to take Spans instead of std::vector's for the `insert`
    and `contains` operations.
    
    CBloomFilter currently converts types such as CDataStream and uint256
    to std::vector on `insert` and `contains`. This is unnecessary because
    CDataStreams and uint256 are already std::vectors internally. We just
    need a way to point to the right data within those types. Span gives
    us this ability.
    
    Signed-off-by: William Casarin <jb55@jb55.com>
    a6d8f9fd28
  27. netaddress: fix indentation in CService::GetKey [moveonly]
    Also run clang formatter to reorder imports
    
    Signed-off-by: William Casarin <jb55@jb55.com>
    561f045ee4
  28. netaddress: return a prevector from CService::GetKey()
    Avoid heap allocations in CService::GetKey by returning a prevector
    instead of a std::vector. This method gets called many times, so
    passing via stack helps avoid heap thrashing.
    
    Signed-off-by: William Casarin <jb55@jb55.com>
    102761516e
  29. jb55 force-pushed on May 16, 2020
  30. jamesob commented at 7:23 pm on May 20, 2020: member

    I’m Concept ACK and generally in favor of the thought behind this project, but my reindex benchmarks show that runtime is indistinguishable from this branch’s master merge-base. These changes still may be worth pursuing for other reasons, but just wanted to add some data.

    Selection_160

  31. sipa commented at 7:28 pm on May 20, 2020: member
    @jamesob Unless you’re running with -par=1, I believe there is no change in behaviour (and even then, it’s probably trivial).
  32. jb55 commented at 8:22 pm on May 20, 2020: member
    @jamesob thanks for running that, is there an easy way to run these myself as I hack on this branch? I’m guessing most perf would be dominated by IO factors/secp? Another motivating factor for this PR was to investigate heap usage in general. right now my core node uses a gig of ram and I’m looking at ways to reduce that outside of IBD.
  33. jb55 commented at 8:45 pm on May 20, 2020: member
    actually, I’m just going to try to reproduce this locally: pruned IBD in-memory filesystem, and I’ll just run it a bunch of times up to some height and then compare the time-to-height.
  34. jb55 commented at 9:40 pm on May 20, 2020: member

    looks like the current changes don’t affect IBD much at all:

    -datadir=/run/bitcoin -reindex

    0prune=550
    1printtoconsole=1
    2connect=127.0.0.2
    3noassumevalid=1
    
     0reindex master to height 100,000 (ms)
     1
     210700
     310768
     411017
     511016
     610852
     7
     8reindex zeroalloc to height 100,000 (ms)
     9
    1010846
    1110796
    1210826
    1310817
    1410764
    

    looks like I’ll have to try harder!

    but now I have a pretty good testbed with this bpftrace script + linux userspace tracepoints:

    https://jb55.com/s/ibd.bt.txt https://jb55.com/s/tracepoints.patch.txt

  35. jamesob commented at 0:48 am on May 21, 2020: member
    0reindex zeroalloc to height 100,000 (ms)
    

    Consider that testing reindex or IBD up to height 100,000 is uncharacteristic because it happens almost instantaneously relative to the rest of the chain. You may find my previous work in bitcoinperf helpful, if not a little hard to set up (happy to try and improve this with you). Here’s an example and accompanying output of benching a subset of IBD within a meaningful chain region.

  36. jb55 commented at 1:26 am on May 21, 2020: member

    @jamesob I tried a bit higher and started to see what might be a performance improvement: #18847 (comment)

    The only IBD focused commit so far is ZAP1 linked above, I’m still looking into utxo heap improvements which might be more interesting, as it is the # 1 allocator during IBD.

  37. Merge branches '2020-05-compresscript-prevector' and '2020-05-servicekey-prevector' into zeroalloc 8b88fcba80
  38. jb55 force-pushed on May 21, 2020
  39. DrahtBot commented at 1:19 pm on June 18, 2020: member

    🐙 This pull request conflicts with the target branch and needs rebase.

  40. DrahtBot added the label Needs rebase on Jun 18, 2020
  41. MarcoFalke referenced this in commit 4cfe6c37d9 on Apr 28, 2021
  42. jb55 closed this on Sep 20, 2021

  43. jamesob commented at 10:26 pm on September 20, 2021: member
    Sad to see this closed; I love the idea of exploring how we can reduce allocations in general since it seems like (in addition to any incidental performance benefits) fewer allocations probably makes behavior more predictable… although I guess while we might be able to reasonably preallocate certain things (coinsCache come to mind) others (the block index) might not be possible/desirable, so maybe “Zero Allocations” is somewhat of a misnomer.
  44. JeremyRubin commented at 2:11 am on September 21, 2021: contributor

    it’s generally really hard to get changes like this through; e.g. personally my epoch mempool stuff also has an aim to eliminate short lived allocations in many of the mempool algorithms but it is (from my perspective) stalled out for lack of sufficient review.

    If big refactorings like ZAP or the Mempool Project are going to make the kind of headway you’d want to see they probably would need to see prioritization during a specific release cycle in order for contributors like myself or @jb55 (who I can’t speak for as to why he closed it) in order for it to feel worthwhile investing effort on rebasing/keeping alive… I know that there’s also a difficult balance, contributors with full-time or part-time jobs outside of Bitcoin Core are probably often more writing code than reviewing other’s code (i.e., to scratch one’s itch or fix a specific problem), which leads to less ‘review karma’ or whatever…

  45. MarcoFalke removed the label Tests on Sep 21, 2021
  46. MarcoFalke removed the label RPC/REST/ZMQ on Sep 21, 2021
  47. MarcoFalke removed the label P2P on Sep 21, 2021
  48. MarcoFalke removed the label Validation on Sep 21, 2021
  49. MarcoFalke removed the label Consensus on Sep 21, 2021
  50. MarcoFalke removed the label Utils/log/libs on Sep 21, 2021
  51. MarcoFalke added the label Refactoring on Sep 21, 2021
  52. MarcoFalke added the label Up for grabs on Sep 21, 2021
  53. DrahtBot locked this on Oct 30, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-21 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me