coins: compact chainstate regularly #35465

pull l0rinc wants to merge 3 commits into bitcoin:master from l0rinc:l0rinc/forcecompactdb-ibd-exit changing 6 files +113 −7
  1. l0rinc commented at 4:02 PM on June 4, 2026: contributor

    Problem: https://github.com/bitcoin-core/leveldb-subtree/pull/61 disabled read-triggered seek compactions to avoid large chainstate write amplification from random UTXO lookups. That avoids repeated read-driven rewrites, but it also removes opportunistic cleanup that previously helped compact old chainstate data.

    After IBD, normal chainstate churn can leave obsolete entries behind until ordinary LevelDB compaction naturally reaches the affected levels, keeping the chainstate database larger than necessary.

    Also, chainstates created by pre-29 nodes can contain thousands of files from the old 2 MiB LevelDB table target. After the mmap limit dropped back to 1000 and seek compaction was disabled, continuing from such a chainstate can leave many table reads on the non-mmap path until the database is compacted.

    Fix: After each completed post-IBD full chainstate flush, give the chainstate a 1/320 chance to compact. With roughly hourly full flushes, this averages about once every two weeks and makes a six-month stretch without compaction about a one-in-a-million event.

    The randomized recurring trigger spreads compactions across nodes and keeps maintenance stateless, without storing last-compaction height or timestamp metadata in the chainstate database. Compaction runs on a background thread (utxocompact) so validation only schedules the work.

    Partially fixes #35298 and #35457

  2. DrahtBot added the label Validation on Jun 4, 2026
  3. DrahtBot commented at 4:02 PM on June 4, 2026: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--006a51241073e994b41acfe9ec718e94-->

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/35465.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK andrewtoth, sipa, optout21, sedited
    Concept ACK w0xlt

    If your review is incorrectly listed, please copy-paste <code>&lt;!--meta-tag:bot-skip--&gt;</code> into the comment that the bot should ignore.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #34897 (indexes: Don't commit ahead of the flushed chainstate by mzumsande)
    • #34320 (coins: delegate CCoinsViewDB::HaveCoin to GetCoin by l0rinc)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

  4. in src/validation.cpp:3418 in dbf0762734
    3414 | @@ -3385,6 +3415,7 @@ bool Chainstate::ActivateBestChain(BlockValidationState& state, std::shared_ptr<
    3415 |                      pindexMostWork = nullptr;
    3416 |                  }
    3417 |                  pindexNewTip = m_chain.Tip();
    3418 | +                force_compaction |= pindexNewTip->GetBlockHash() == m_chainman.AssumedValidBlock();
    


    sedited commented at 4:30 PM on June 4, 2026:

    Could we retrieve the assumed valid block from m_chainman.GetConsensus().defaultAssumevalid instead?


    l0rinc commented at 6:30 PM on June 4, 2026:

    That would have made testing more difficult, ended up relying on minimum chainwork instead.

  5. DrahtBot added the label CI failed on Jun 4, 2026
  6. DrahtBot commented at 5:22 PM on June 4, 2026: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task test ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/26963717717/job/79561784206</sub> <sub>LLM reason (✨ experimental): CI failed because CTest reported 1 failing test: coins_tests.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  7. l0rinc renamed this:
    validation: compact chainstate after connecting assumevalid block
    validation: compact chainstate at minimum chain work
    on Jun 4, 2026
  8. l0rinc force-pushed on Jun 4, 2026
  9. in src/validation.cpp:1951 in ed7cb385f5
    1946 | +        if (m_chainman.m_chainstates.size() != 1) return; // Skip assumeutxo.
    1947 | +        Assert(this == &m_chainman.CurrentChainstate());
    1948 | +        coins_db = &CoinsDB();
    1949 | +    }
    1950 | +
    1951 | +    if (BlockValidationState state; !FlushStateToDisk(state, FlushStateMode::FORCE_FLUSH)) { // Ensure compaction reflects the latest chainstate.
    


    andrewtoth commented at 6:48 PM on June 4, 2026:

    We want the latest in memory chainstate to be consistent with leveldb before we compact, but we don't need to clear the cache. That would needlessly cause a lot more read requests at the same time that we are compacting in the background.

        if (BlockValidationState state; !FlushStateToDisk(state, FlushStateMode::FORCE_SYNC)) { // Ensure compaction reflects the latest chainstate.
    
  10. in doc/release-notes-35465.md:1 in ed7cb385f5
       0 | @@ -0,0 +1,7 @@
       1 | +Performance
    


    andrewtoth commented at 6:48 PM on June 4, 2026:

    I'm not sure this change warrants a release note. This change should not be visible to the user.

  11. andrewtoth commented at 6:48 PM on June 4, 2026: contributor

    Concept ACK

  12. l0rinc force-pushed on Jun 4, 2026
  13. l0rinc force-pushed on Jun 4, 2026
  14. DrahtBot removed the label CI failed on Jun 4, 2026
  15. l0rinc force-pushed on Jun 4, 2026
  16. andrewtoth commented at 12:32 PM on June 5, 2026: contributor

    I tested this on mainnet with a pruned IBD with default min work params. The compaction is triggered after 938344 is connected on the scheduler thread, and completes in 2m 32s while 110 blocks were connected on msghand thread. The resulting chainstate size after 952425 was 12GB. Doing the same IBD on master had a chainstate size of 15GB.

    However, the current size of the chainstate db of 11GB is rather unfortunate, since it straddles L4 and L5. Most files would naturally sit at L4, but a full compaction moves all files to L5. This means that when spending any UTXO, the tombstone entry will never reach L5 until 11 GB of new data is added. This means that in a year the db could be 22 GB even if there is a net negative growth in the UTXO set. So, because of this I think instead we will need to have a scheduled compaction instead of doing this only once.

  17. andrewtoth commented at 9:11 PM on June 5, 2026: contributor

    I ran an experiment with a chainstate synced up to 952529.

    First, we took the chainstate that is ~14.7 GB and erased every entry via an iterator and batch write. Since many files are in L4, size compaction was able to erase many of them, reducing chainstate size by over 4 GB. Still, not ideal since our empty chainstate db is still over 10 GB!

    Level Start files Start MB Erased files Erased MB
    L0 4 9 0 0
    L1 1 3 0 0
    L2 5 81 3 96
    L3 45 983 31 998
    L4 311 9,939 172 5,471
    L5 117 3,643 117 3,643
    L6 0 0 0 0
    Total (du) 14,666 10,231

    Second, I took the synced chainstate and did a manual full compaction on it (what this PR aims to do). This reduced the size of the db to ~10.6 GB, but every file is now on L5. Then I did the same as above, erasing every single entry. Now, the empty db balloons to around ~16 GB. This can be explained because size compaction no longer pushes any entries past L4 until there's more than 10 GB of new data.

    Level Start files Start MB Erased files Erased MB
    L0 0 0 0 0
    L1 0 0 0 0
    L2 0 0 3 97
    L3 0 0 31 995
    L4 0 0 131 4,203
    L5 337 10,618 337 10,618
    L6 0 0 0 0
    Total (du) 10,618 15,936

    So, I think this PR has a decent approach to scheduling the compaction, but I think we need to also schedule a compaction randomly every month or two. After triggering another manual compaction, both dbs correctly reduce to size 0.

  18. l0rinc commented at 9:15 PM on June 5, 2026: contributor

    I have also measured the performance; the extra compaction doesn't add any meaningful slowdown.

    <details><summary>reindex-chainstate without and with compaction</summary>

    for DBCACHE in 5000; do     COMMITS="47da4f9b716d11294d4fb0f30b04a7bcf128cc14 4bcac7f699457fb001410a49795d48e4941b27bd";     STOP=950059; CC=gcc; CXX=g++;     BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";     (echo ""; for c in $COMMITS; do git fetch -q origin "$c" 2>/dev/null || true; git log -1 --pretty='%h %s' $c || exit 1; done) &&     (echo "" && echo "$(date -I) | reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 1 && echo HDD || echo SSD)"; echo "") &&     hyperfine     --sort command     --runs 1     --export-json "$BASE_DIR/rdx-$(sed -E 's/[^ ]+/\L&/g;s/[.]/_/g;s/ /-/g'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"     --parameter-list COMMIT ${COMMITS// /,}     --prepare "killall -9 bitcoind 2>/dev/null; rm -f ./build/bin/bitcoind; git clean -fxd; git reset --hard {COMMIT} && \
          cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind -j1 && \
          ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log; rm -rfd $DATA_DIR/indexes;"     --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && grep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q \"\$(git rev-parse --short=12 {COMMIT})\"; \
                    cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"     "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"; done
    
    47da4f9b71 Merge bitcoin/bitcoin#35410: net: use the proxy if overriden when doing v2->v1 reconnections
    4bcac7f699 validation: compact chainstate at minimum work
    
    2026-06-04 | reindex-chainstate | 950059 blocks | dbcache 5000 | umbrel | x86_64 | Intel(R) N150 | 4 cores | 15Gi RAM | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=950059 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 47da4f9b716d11294d4fb0f30b04a7bcf128cc14)
      Time (abs ≡):        34274.786 s               [User: 32144.272 s, System: 3560.024 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=950059 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 4bcac7f699457fb001410a49795d48e4941b27bd)
      Time (abs ≡):        34745.423 s               [User: 32484.966 s, System: 3508.027 s]
    
    Relative speed comparison
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=950059 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 47da4f9b716d11294d4fb0f30b04a7bcf128cc14)
            1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=950059 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 4bcac7f699457fb001410a49795d48e4941b27bd)
    

    </details>

    The logs also show that compaction happened close to the assumevalid height:

    <details><summary>Compaction logs</summary>

    cat ../logs/debug-4bcac7f699457fb001410a49795d48e4941b27bd-1780604207.log | grep -C4 compaction
    2026-06-05T14:56:31Z Enabling script verification at block [#938344](/bitcoin-bitcoin/938344/) (00000000000000000001731222e3df1a47a0944429559d4c4e4caf53e71676e5): block height above assumevalid height.
    2026-06-05T14:56:31Z UpdateTip: new best=00000000000000000001731222e3df1a47a0944429559d4c4e4caf53e71676e5 height=938344 version=0x2004a000 log2_work=96.100823 tx=1315808700 date='2026-02-25T21:37:38Z' progress=0.965842 cache=1973.3MiB(6654179txo)
    2026-06-05T14:56:32Z UpdateTip: new best=0000000000000000000157898e02aeb20f847dd3ef8973abe68510119b7e6a33 height=938345 version=0x26946000 log2_work=96.100834 tx=1315811999 date='2026-02-25T21:49:08Z' progress=0.965844 cache=1973.3MiB(6661457txo)
    2026-06-05T14:56:32Z UpdateTip: new best=0000000000000000000009325afaa03dd4686deea9781d754a82d2b7ba4ceba8 height=938346 version=0x34000000 log2_work=96.100844 tx=1315815668 date='2026-02-25T22:28:46Z' progress=0.965853 cache=1973.3MiB(6672983txo)
    2026-06-05T14:56:32Z Starting chainstate compaction of /mnt/my_storage/BitcoinData/chainstate
    2026-06-05T14:59:55Z Finished chainstate compaction of /mnt/my_storage/BitcoinData/chainstate
    2026-06-05T14:59:55Z UpdateTip: new best=000000000000000000003028d4699d1d2566de2934ca3f2301d990a95e7db0fc height=938347 version=0x22644000 log2_work=96.100855 tx=1315819222 date='2026-02-25T22:31:42Z' progress=0.965853 cache=1973.3MiB(6679310txo)
    2026-06-05T14:59:55Z UpdateTip: new best=000000000000000000009c20b9c211fd01503cfcabf44c939f83bc54d9a89b2c height=938348 version=0x20866000 log2_work=96.100865 tx=1315820982 date='2026-02-25T22:35:24Z' progress=0.965854 cache=1973.3MiB(6681830txo)
    2026-06-05T14:59:56Z UpdateTip: new best=00000000000000000001c0bc82a20646b8daadc382af5f06486f016e931cf332 height=938349 version=0x20036000 log2_work=96.100876 tx=1315824111 date='2026-02-25T22:45:22Z' progress=0.965857 cache=1973.3MiB(6686145txo)
    2026-06-05T14:59:56Z UpdateTip: new best=00000000000000000001dc00d7c939a9d780c72dc64e378c8c9638d3e8401ace height=938350 version=0x23f2c000 log2_work=96.100886 tx=1315826175 date='2026-02-25T22:55:02Z' progress=0.965859 cache=1973.3MiB(6690199txo)
    

    </details>

    Compaction took ~3.5 minutes, but it seems to have been blocking new block connections - I'm not yet sure why (cc: @andrewtoth).

    This means that in a year the db could be 22 GB even if there is a net negative growth in the UTXO set

    I'm still trying to simulate this, will get back to you on it.

    but I think we need to also schedule a compaction randomly every month or two

    That was my original approach, to schedule one every 10k blocks after IBD. I'll investigate it, thanks a lot for the measurements.

  19. l0rinc marked this as a draft on Jun 5, 2026
  20. l0rinc force-pushed on Jun 6, 2026
  21. l0rinc commented at 10:37 PM on June 6, 2026: contributor

    Since the last push, chainstate compaction no longer runs through the validation-interface queue (blocking block connections). CCoinsViewDB::CompactFull() now starts a one-shot background compaction job, so validation only force-syncs the latest chainstate before triggering the work.

    The first compaction is still deterministic when the active chain leaves IBD. After that, recurring maintenance is randomized: each post-IBD full chainstate flush has a 1/1000 chance to start compaction. With the current roughly hourly flush cadence, that averages to about six weeks.

  22. l0rinc marked this as ready for review on Jun 6, 2026
  23. l0rinc renamed this:
    validation: compact chainstate at minimum chain work
    validation: compact chainstate after IBD and post-IBD flushes
    on Jun 6, 2026
  24. andrewtoth commented at 11:34 PM on June 6, 2026: contributor

    If we are doing a compaction randomly after every full_flush_completed, then I don't see a need for the min work check. Since it's done in the background, we can randomly compact during IBD too. There's no need for logic that tracks IBD state. That should simplify the logic here quite a bit.

    Is there a way to prevent the compaction from occurring during shutdown? We flush a few times during shutdown and it would not be great if it triggered a compaction that prevented shutdown for several minutes.

    We should use a named TraceThread for the compaction thread.

  25. l0rinc marked this as a draft on Jun 8, 2026
  26. l0rinc force-pushed on Jun 8, 2026
  27. DrahtBot added the label CI failed on Jun 8, 2026
  28. DrahtBot commented at 5:26 PM on June 8, 2026: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task Windows-cross to x86_64, ucrt: https://github.com/bitcoin/bitcoin/actions/runs/27151249634/job/80142552112</sub> <sub>LLM reason (✨ experimental): CI failed at the link step due to an undefined reference to util::TraceThread(...) when building libbitcoinkernel.dll.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  29. l0rinc force-pushed on Jun 8, 2026
  30. l0rinc renamed this:
    validation: compact chainstate after IBD and post-IBD flushes
    validation: compact fragmented chainstate after full flushes
    on Jun 8, 2026
  31. DrahtBot removed the label CI failed on Jun 8, 2026
  32. l0rinc force-pushed on Jun 8, 2026
  33. in src/validation.cpp:134 in 849ffc18cc
     129 | +static bool ShouldCompactChainstate(CCoinsViewDB& coins_db, bool in_ibd)
     130 | +{
     131 | +    static constexpr uint32_t flush_ratio{1'000}; // Roughly every 6 weeks with hourly flushes
     132 | +    static constexpr uint32_t leveldb_compaction_threshold{1'000}; // LevelDB table-cache/open-file budget and POSIX mmap limit
     133 | +    if (sizeof(void*) < 8 || !in_ibd) return FastRandomContext().randrange(flush_ratio) == 0;
     134 | +    return CompactableFileCount(coins_db) > leveldb_compaction_threshold;
    


    sipa commented at 8:12 PM on June 8, 2026:

    If this threshold is exceeded outside of IBD somehow, wouldn't we also want to compact?


    l0rinc commented at 8:21 PM on June 8, 2026:

    But that's deterministic, isn't it?


    andrewtoth commented at 8:33 PM on June 8, 2026:

    Hmm if the chainstate ever exceeds 32GB (which we can expect at some point), then this will compact every flush.

    If we just want to compact a chainstate with older 2MB files, we can instead check if L3 has > 1GB / 32 MB file count?


    l0rinc commented at 8:51 PM on June 8, 2026:

    I'm pushing a new version since this won't reliably trigger compaction during IBD. The old code did not compare against total chainstate size, but against the file count of the level above the deepest occupied level. The new version I'm about to push checks for occupancy instead; I'll try some reindex-chainstates overnight to see how it performs.

    Note that changing the default max_open_files from 1000 to 2048 alone would avoid the speed regression, but I want to achieve the same through careful compaction instead.

    If this threshold is exceeded outside of IBD somehow, wouldn't we also want to compact?

    I deliberately wanteded to avoid that, i.e. went for fast compactions during IBD, unpredictable ones in steady state. Let me know if reviewers don't share this goal.


    andrewtoth commented at 9:38 PM on June 8, 2026:

    this won't reliably trigger compaction during IBD.

    Sorry, I'm confused. We don't want to trigger compaction during IBD, right? We want to compact afterwards. If we just compact every ~1000 flushes and also if there are too many files at a level than max file size (to compact an older 2mb file size chainstate), then we have accomplished the goal of the PR.

    I don't think this PR should be concerned with compacting during IBD. I'm not convinced we need to do that. I think doing compactions to speed up IBD can be done as a follow-up.

    I also don't think we need to benchmark this. It should not change IBD or reindex-chainstate from current master behavior.


    sipa commented at 9:41 PM on June 8, 2026:

    @l0rinc Did you mean to thumbs-up your own comment? :sweat_smile:


    l0rinc commented at 10:12 AM on June 9, 2026:

    Sorry, I'm confused. We don't want to trigger compaction during IBD, right?

    How else would we fix the regression? We can also bump max_open_files, but I want to achieve the same through careful compaction instead.

    doing compactions to speed up IBD can be done as a follow-up

    Not speed up, but revert the 8% regression that seek compaction + mmap reduction caused.

    Did you mean to thumbs-up your own comment? 😅

    I had a weak moment, felt really proud of that comment. <img width="1605" height="980" alt="image" src="https://github.com/user-attachments/assets/f4c6b659-f908-45df-b1ac-f8b829e2049e" />


    sipa commented at 12:09 PM on June 9, 2026:

    Maybe it makes sense to separate the two concerns into separate PRs?

    1. The persistent post-IBD inflated disk usage without compactions, especially after UTXO set shrinkage.
    2. The performance regression caused by many small files that haven't been compacted into large files yet, combined with the mmap file count reduction.

    The solution to the first is just a randomized occasional full compaction, post IBD. It's an easy fix that doesn't need much investigation.

    The second is possibly more involved if we want to address it during IBD as well, may need benchmarks, and safeguards against overly aggressive compaction runs if the over-1000-files condition is continuously hit. (One random idea: what if we just schedule a full compaction once, on startup, if the file count is above 1000?)


    andrewtoth commented at 12:10 PM on June 9, 2026:

    This PR was initially about fixing the disk space regression caused by https://github.com/bitcoin-core/leveldb-subtree/pull/61. That change alone has been backported, so if we want to have a change on master we can backport we should keep a fix concentrated on fixing the size regression alone.

    That change in combination with https://github.com/bitcoin-core/leveldb-subtree/pull/52 has introduced a speed regression (https://github.com/bitcoin/bitcoin/issues/35457) but is not backported. How we deal with the speed regression seems out of scope for a targeted fix to the disk size regression.

    I want to achieve the same through careful compaction instead.

    Perhaps we could discuss in #35457? I'm of the opinion https://github.com/bitcoin-core/leveldb-subtree/pull/52 should be reverted instead. More compactions will regress the disk IO improvements of https://github.com/bitcoin-core/leveldb-subtree/pull/61.


    l0rinc commented at 12:19 PM on June 9, 2026:

    Maybe it makes sense to separate the two concerns into separate PRs

    Sure, makes sense. I'm at a conference now, but I can easily split it today.

    One random idea: what if we just schedule a full compaction once, on startup, if the file count is above 1000?

    That's basically what #35467 tried (though it still warns for now since I'm not sure if it still makes sense after this PR, but @andrewtoth also mention it should do it instead of just warning).

    I'm of the opinion https://github.com/bitcoin-core/leveldb-subtree/pull/52 should be reverted instead. More compactions will regress the disk IO improvements of https://github.com/bitcoin-core/leveldb-subtree/pull/61.

    I'll split the two concerns and we can discuss alternatives on the one concerning the IBD compactions. I'm currently measuring how compactions during IBD affect the speed - and since they happen after flushes, I don't think they should affect IBD speed anymore. Also, the parallel fetcher likely changes our usage patterns, giving more time between bulk reads, so we likely have less blockage caused by background compactions (since they're not spread out evenly).

  34. l0rinc force-pushed on Jun 8, 2026
  35. l0rinc force-pushed on Jun 9, 2026
  36. l0rinc renamed this:
    validation: compact fragmented chainstate after full flushes
    validation: compact chainstate regularly after post-IBD flushes
    on Jun 9, 2026
  37. l0rinc renamed this:
    validation: compact chainstate regularly after post-IBD flushes
    coins: compact chainstate regularly after post-IBD flushes
    on Jun 9, 2026
  38. l0rinc force-pushed on Jun 9, 2026
  39. DrahtBot added the label CI failed on Jun 9, 2026
  40. in src/validation.cpp:120 in d15668d535 outdated
     112 | @@ -113,6 +113,13 @@ const std::vector<std::string> CHECKLEVEL_DOC {
     113 |   * */
     114 |  static constexpr int PRUNE_LOCK_BUFFER{10};
     115 |  
     116 | +// Return whether the completed full flush should compact chainstate
     117 | +static bool ShouldCompactChainstate(bool in_ibd)
     118 | +{
     119 | +    static constexpr uint32_t flush_ratio{1'000}; // Roughly every 6 weeks with hourly flushes
     120 | +    return !in_ibd && FastRandomContext().randrange(flush_ratio) == 0;
    


    andrewtoth commented at 1:46 PM on June 9, 2026:

    What is the rationale for not compacting if we are in IBD? The node should be able to make progress just the same if we are compacting on a background thread. Is there any evidence we would slow down if we triggered a compaction during IBD as well? Your other comments seem to suggest the opposite.


    l0rinc commented at 2:34 PM on June 9, 2026:

    I don't want randomness during IBD, makes benchmarking impossible. We already know most of the payload during IBD, we can optimize for speed through a deterministic algorithm.


    andrewtoth commented at 2:57 PM on June 9, 2026:

    I don't want randomness during IBD, makes benchmarking impossible.

    Hmm but we already randomly do periodic syncs if you have a high enough dbcache value.


    l0rinc commented at 4:41 PM on June 9, 2026:

    Isn't all of that deterministic currently? Subsequent runs are surprisingly stable, often within 1-2 minutes of each ither.


    andrewtoth commented at 4:58 PM on June 9, 2026:

    Isn't all of that deterministic currently?

    There's a 20 minute window of random time to periodically sync.

    I don't think doing a full compaction randomly during IBD will change anything here. We were doing constant seek compaction before anyways.


  41. DrahtBot removed the label CI failed on Jun 9, 2026
  42. in src/validation.cpp:117 in d15668d535 outdated
     112 | @@ -113,6 +113,13 @@ const std::vector<std::string> CHECKLEVEL_DOC {
     113 |   * */
     114 |  static constexpr int PRUNE_LOCK_BUFFER{10};
     115 |  
     116 | +// Return whether the completed full flush should compact chainstate
     117 | +static bool ShouldCompactChainstate(bool in_ibd)
    


    andrewtoth commented at 4:22 PM on June 9, 2026:

    This function seems simple enough to inline below.


    l0rinc commented at 9:53 AM on June 11, 2026:

    For now it is, but we will have to do compactions during IBD as well to avoid the 8% slowdown mentioned before. A few local measurements indicate a well-targeted compactions gets rid of the slowdown introduced.


    andrewtoth commented at 12:41 PM on June 11, 2026:

    What is the downside of instead reverting the patch that causes the 8% slowdown (the mmap limit reduction)?


    l0rinc commented at 5:46 PM on June 11, 2026:

    Just that it doesn't actually fix the underlying problem. We're already using too much (virtual) memory, likely because the mmaped files, we should be able to reduce the open file requirements by compacting more strategically. Which seems to be even faster than before instead of 8% slower. TLDR: reverting it seems like a workaround, it just revealed a problem.


    andrewtoth commented at 6:01 PM on June 11, 2026:

    Isn't the underlying problem the speed regression? You are saying it is revealing something about the virtual memory? Does it reduce the virtual memory? I thought not #33351 (comment).

  43. in src/validation.cpp:2843 in d15668d535 outdated
    2841 | +            m_chainman.m_options.signals->ChainStateFlushed(this->GetRole(), GetLocator(m_chain.Tip()));
    2842 | +        }
    2843 | +
    2844 | +        if (!m_chainman.m_interrupt && m_chainman.m_chainstates.size() == 1) { // Skip AssumeUTXO
    2845 | +            if (ShouldCompactChainstate(m_chainman.IsInitialBlockDownload())) {
    2846 | +                CoinsDB().CompactFull();
    


    andrewtoth commented at 4:23 PM on June 9, 2026:

    The thread-spawning can throw. Do we want to catch here and log?


    sedited commented at 9:21 AM on June 11, 2026:

    Agree that a log would be nice, but I think if a compaction fails a hard crash might be the correct call.


    l0rinc commented at 9:54 AM on June 11, 2026:

    @andrewtoth, @sedited, care to suggest a patch for what you'd like to see here exactly?


    andrewtoth commented at 12:24 PM on June 11, 2026:

    if a compaction fails a hard crash might be the correct call.

    Compaction failing is handled inside the thread, where it is caught and logged.

    I'm suggesting wrapping the thread spawning with a try/catch and logging. I don't think we spawn threads anywhere else in the codebase after startup. And if we throw here we just don't spawn a compaction thread. Would we want to just abort if spawning that thread fails?


    l0rinc commented at 5:47 PM on June 11, 2026:

    Thanks, added a log, let me know if it's close to what you meant.

  44. andrewtoth approved
  45. andrewtoth commented at 4:32 PM on June 9, 2026: contributor

    ACK d15668d53558ac78953b50634e8272d5bc55d654

    Nice approach with randomly compacting every 1000 flushes.

    Tested by setting flush_ratio to 1 and DATABASE_WRITE_INTERVAL_... to be between 5 and 7 minutes.

    Compaction was triggered every periodic flush in the background.

    No compaction was triggered on shutdown.

    Not convinced we need to not compact during IBD. There doesn't seem to be sufficient rationale to have a check for this and not just compact as well if one of the flushes during IBD is chosen.

    PR description should be updated. The motivation for this is not about pressure on leveldb data structures, but on the disk footprint of the chainstate db. Also, can mention #35298 is fixed by this.

  46. l0rinc marked this as ready for review on Jun 9, 2026
  47. in src/dbwrapper.cpp:259 in bc1a8f760d outdated
     255 | @@ -256,7 +256,7 @@ CDBWrapper::CDBWrapper(const DBParams& params)
     256 |  
     257 |      if (params.options.force_compact) {
     258 |          LogInfo("Starting database compaction of %s", fs::PathToString(params.path));
     259 | -        DBContext().pdb->CompactRange(nullptr, nullptr);
     260 | +        CompactFull();
    


    optout21 commented at 8:30 AM on June 11, 2026:

    bc1a8f7 validation: randomly compact chainstate:

    Maybe move the start&stop log lines to the new method as well?


    l0rinc commented at 9:03 AM on June 11, 2026:

    The logs wouldn't be optional anymore (in the other case they're debug, here it's triggered deliberately so the logs are info).

    store the time of last compaction, and compact if age is over a desired time period with some randomization (the time has to be persisted into DB).

    That's exactly what this solution is try to avoid:

    This spreads compactions across nodes and keeps recurring maintenance stateless, without storing last-compaction height or timestamp metadata in the chainstate database.


    Another possible approach is to base decision on compaction state, but that may be too specific with LevelDB internals.

    That's the other one that we're deliberately trying to avoid since it's more complicated and it would be deterministic therefore many nodes would trigger it at the same time on the network (causing a predictable slight slowdown)


    optout21 commented at 9:15 AM on June 11, 2026:

    That's exactly what this solution is try to avoid:

    Avoid exactly what? The need to store the last time? Or base the decision based on time? Randomization is an important secondary goal, but randomization can be added to either solution (e.g. desired period range of 4-6 weeks, with linearly increasing probability within the range).


    l0rinc commented at 9:48 AM on June 11, 2026:

    Avoid exactly what?

    I have mentioned it in the PR description: we want to avoid having to store the state, randomization helps with avoiding storing a new value and also with making sure this doesn't cause a predictable network-wide slowdown.

  48. in src/validation.cpp:2841 in bc1a8f760d
    2839 | +        if (m_chainman.m_options.signals) {
    2840 | +            // Update best block in wallet (so we can detect restored wallets).
    2841 | +            m_chainman.m_options.signals->ChainStateFlushed(this->GetRole(), GetLocator(m_chain.Tip()));
    2842 | +        }
    2843 | +
    2844 | +        if (!m_chainman.m_interrupt && m_chainman.m_chainstates.size() == 1) { // Skip AssumeUTXO
    


    sedited commented at 8:54 AM on June 11, 2026:

    Why skip on the assumeutxo case? Wouldn't it be better to skip if the chainstate is historical?


    l0rinc commented at 9:51 AM on June 11, 2026:

    Because of the messy resizing mostly (which I also guarded just to be sure). It's just too complicated, it's easier to just skip that case.


    sedited commented at 10:02 AM on June 11, 2026:

    Chances are that the background validation takes a long time. Maybe I am missing something here. What does resizing being messy have to do with this?


    l0rinc commented at 10:08 AM on June 11, 2026:

    We have to avoid swapping out the underlying LevelDB database while it's compacting


    sedited commented at 10:13 AM on June 11, 2026:

    Oh, my impression was your new CCoinsViewDB destructor was already doing that by waiting on the compaction to complete.


    l0rinc commented at 10:26 AM on June 11, 2026:

    Which is likely the case, I just haven't tried it with AssumeUTXO - and seeing that @andrewtoth also just skipped that case in https://github.com/bitcoin-core/leveldb-subtree/pull/61#issuecomment-4529236797, it seemed safer. I don't mind doing it here if you think it's important, but this just seemed simpler.


    andrewtoth commented at 12:21 PM on June 11, 2026:

    I think doing it during assumeutxo is also safe now that we have the mutex. That's a better solution then the diff I shared earlier, so it makes skipping assumeutxo moot.

  49. optout21 commented at 9:01 AM on June 11, 2026: contributor

    One concern: The 1/1000 probability on every call is a simple & elegant solution. However, the emergent behavior depends on the frequency of calling this method, which makes the scheme somewhat brittle. For example, some future changes (e.g. affecting flushing) may influence this, and make the desired "every 6 weeks" invalid. As an alternative, the decision could be based explicitly on time: store the time of last compaction, and compact if age is over a desired time period with some randomization (the time has to be persisted into DB). Another possible approach is to base decision on compaction state, but that may be too specific with LevelDB internals.

  50. in src/validation.cpp:119 in bc1a8f760d
     112 | @@ -113,6 +113,13 @@ const std::vector<std::string> CHECKLEVEL_DOC {
     113 |   * */
     114 |  static constexpr int PRUNE_LOCK_BUFFER{10};
     115 |  
     116 | +// Return whether the completed full flush should compact chainstate
     117 | +static bool ShouldCompactChainstate(bool in_ibd)
     118 | +{
     119 | +    static constexpr uint32_t flush_ratio{1'000}; // Roughly every 6 weeks with hourly flushes
    


    sedited commented at 9:07 AM on June 11, 2026:

    From what I can tell with this ratio some nodes will not compact for a few periods, while some will run compactions in quick succession. Can we make this distribution have a shorter and thinner tail? How about counting (could include a static counter to make it self contained) up to 1000 and then drawing from a smaller range? We're already basing this function's behaviour off the flush intervals, so I don't think adding a tiny bit of state here is particularly bad in terms of adding state and assumptions.


    l0rinc commented at 9:51 AM on June 11, 2026:

    You mean to add an in-memory state to avoid the worst-case of doing multiple compactions in close proximity? Given that it's running in the background (after an existing flush) this shouldn't put any pressure as far as I can tell.


    sedited commented at 9:58 AM on June 11, 2026:

    I'm not really too concerned by wasting work, more that a there will be a few nodes which we expect to not compact at all after half a year.


    l0rinc commented at 10:11 AM on June 11, 2026:

    yeah, with hourly flushes roughly ~1% will not flush for half a year. We could address that by having an in-memory last-compacted state and do a constant + small random deviation + an init trigger that compacts on startup if needed. That would still avoid having to store this state and probably address the long standing nodes and the restarts as well. What do other reviewers think (cc: @andrewtoth, @sipa, @optout21)?


    optout21 commented at 11:19 AM on June 11, 2026:
    • Have an in-memory-only state for the next planned compaction time, initially unset
    • If it is unset, set it to: current time + base period + randomization, and return false
    • If is is set, but in the future, return false
    • If it is set and in the past, set it again, and return true

    If you mean something like this, I think it's a good solution. It's best to avoid the geometric distribution with a long tail.

    A first time after startup could be treated separately with a shorter time period (randomized as well).

    Note that in the case of frequent restarts, compaction may never happen, but that's a corner case, and it's the result of the goal of not compacting right after start.


    andrewtoth commented at 12:19 PM on June 11, 2026:

    I think the main benefit of doing it this way is that it doesn't depend on running the node for a long time. If a user turns on the node and syncs it every few months then turns it off after a few hours, the still have a chance at compacting.


    andrewtoth commented at 12:40 PM on June 11, 2026:

    What about, along with the 1/1000 odds of compaction, nodes also compact if L3 grows to > 900MB. So, 1010MB of new data is added to the cache, which at <200MB per week would be at least 5 weeks. By that time ~56% of nodes will have already compacted randomly, so it's not deterministic, and it caps the downside of disk growth.


    l0rinc commented at 5:07 PM on June 11, 2026:

    nodes also compact if L3 grows to > 900MB

    I'm not sure - is there really a problem we're solving here? I could drop the 1/1000 to 1/200 and let it compact every ~week instead. A node that hasn't compacted for half a year would have a ~1/8 billion chance of that happening (and would still be fine).


    andrewtoth commented at 5:18 PM on June 11, 2026:

    I could drop the 1/1000 to 1/200 and let it compact every ~week instead

    Yeah, but we're using a ton more disk IO that way. It will be 200MB written every week to >15GB.

    By capping it to 1GB extra disk space we solve the objection here that there can be a long tail of no flushing, and we solve the objection of it being deterministic.

    With this combined approach we would also not want to do this during IBD (since it would continuously compact when the db is exactly 1GB, plus some other weirdness), so that also resolves my suggestion of randomly doing it during IBD as well.


    l0rinc commented at 5:30 PM on June 11, 2026:

    By capping it to 1GB extra disk space

    I don't like it; seems very ad hoc, and it doesn't seem like we're actually solving a realistic problem.

    randomly doing it during IBD as well

    As mentioned before, I don't want more randomness during IBD - we can make that deterministic. But that's for another PR.

    Yeah, but we're using a ton more disk IO that way. It will be 200MB written every week to >15GB.

    Currently it's 15 GB every hour; what I'm suggesting will reduce that to 15 GB every week. But if that's too much, sure, we can do a 1/320 chance, which is roughly every 2 weeks and roughly a 1/million chance of no compaction for 6 months.


    andrewtoth commented at 5:35 PM on June 11, 2026:

    Currently it's 15 GB every hour; what I'm suggesting will reduce that to 15 GB every week.

    It's 200mb a week. The 15gb an hour has been fixed already and backported.


    l0rinc commented at 5:48 PM on June 11, 2026:

    I am aware, but the current behavior that we're fixing is the 15 GB per hour. The same per two weeks sounds very reasonable to me - do you have a different opinion on that?

  51. sedited commented at 9:23 AM on June 11, 2026: contributor

    Concept ACK

  52. w0xlt commented at 4:22 PM on June 11, 2026: contributor

    Concept ACK

  53. coins: test chainstate flush baseline
    Add `CDBWrapper::GetProperty()` and expose it through `CCoinsViewDB::GetDBProperty()` so coins tests can inspect LevelDB runtime properties through the coins view.
    Use it in a coins DB flush baseline that records the LevelDB layout after flushing while keeping readback coverage for the flushed coin and best block.
    
    Co-authored-by: Andrew Toth <andrewstoth@gmail.com>
    b10889d107
  54. validation: randomly compact chainstate
    Full chainstate flushes are convenient maintenance points for long-term LevelDB cleanup because the chainstate was just written.
    Randomize the trigger so nodes that flush near the same height do not compact together.
    
    Add blocking chainstate compaction through `CCoinsViewDB::CompactFull()` and give each post-IBD full flush on the normal chainstate a 1/320 chance to start compaction.
    With hourly flushes this averages roughly every two weeks and makes a six-month miss about one in a million.
    This keeps the schedule stateless and leaves last-compaction height or timestamp bookkeeping out of chainstate metadata.
    
    Co-authored-by: Andrew Toth <andrewstoth@gmail.com>
    aa021b26f3
  55. coins: compact chainstate in background
    Full chainstate compaction can take minutes on large databases.
    Move `CCoinsViewDB::CompactFull()` to a named `utxocompact` one-shot background thread so validation only schedules the work.
    
    When validation selects compaction after a full flush, the chainstate was just written and another write is less likely to be needed immediately.
    The coins view destructor waits for completion, and a mutex prevents compaction from using `m_db` while `ResizeCache()` replaces it.
    
    Co-authored-by: Andrew Toth <andrewstoth@gmail.com>
    394e473d42
  56. l0rinc force-pushed on Jun 11, 2026
  57. l0rinc commented at 5:44 PM on June 11, 2026: contributor

    Rebased and addressed the latest review feedback: the randomized post-IBD compaction trigger is now 1/320 (it's important to me to keep it simple), startup failures for the background compaction job are logged, and multi-chainstate mode is no longer skipped now that DB handle replacement is guarded. The helper method wasn't inlined since I'm still planning to continue with compaction during IBD (running benchmarks indicate we can make the code even faster than before, instead of 8% slower).

    Thanks everyone for the review, testing, and suggestions.

  58. andrewtoth approved
  59. andrewtoth commented at 5:58 PM on June 11, 2026: contributor

    ACK 394e473d42ba1383dfec45a3eafa8a73a09dbe8b

  60. DrahtBot requested review from sedited on Jun 11, 2026
  61. l0rinc renamed this:
    coins: compact chainstate regularly after post-IBD flushes
    coins: compact chainstate regularly and for legacy layouts
    on Jun 12, 2026
  62. l0rinc renamed this:
    coins: compact chainstate regularly and for legacy layouts
    coins: compact chainstate regularly
    on Jun 12, 2026
  63. l0rinc commented at 9:34 AM on June 13, 2026: contributor

    I have simulated steady-state compaction during a local -reindex-chainstate run by treating heights above 200,000 as post-IBD, shortening periodic flushes to 2-3 minutes, and using a 1/10 trigger so the run produced enough compaction samples.

    The plot shows chainstate size before/after each compaction, the sampled peak size during compaction, compaction duration, temporary extra disk usage, flush workload, and the LevelDB level layout after each compaction. The green marker is the simulated post-IBD threshold, and the purple marker is a node restart during the run.

    The run produced 22 completed compactions over 244 eligible periodic flushes, close to the expected 24.4 samples for a 1/10 trigger. Compactions averaged 1.76 minutes, needed at most 1.06 GiB of temporary extra disk space, and reclaimed 0.68 GiB on average.

    <img width="2578" height="2258" alt="image" src="https://github.com/user-attachments/assets/96e90e0e-058a-4cdc-8f8b-497392f71571" />

    <details><summary>Compaction details</summary>

    parsed 44 snapshots, 22 compactions, and 246 flushes
    wrote reindex-chainstate-logs/chainstate-compactions.csv, reindex-chainstate-logs/chainstate-compactions-pairs.csv, reindex-chainstate-logs/chainstate-compactions-flushes.csv, reindex-chainstate-logs/chainstate-compactions-restarts.csv, reindex-chainstate-logs/chainstate-compactions-summary.txt, and reindex-chainstate-logs/chainstate-compactions.png
    raw snapshots: 44
    completed compactions: 22
    incomplete compactions: 0
    parsed flush commits: 246
    simulated post-IBD threshold: height > 200000
    restart marker: shutdown 2026-06-13T00:16:05Z at h=856403; started 2026-06-13T08:15:24Z; first tip 2026-06-13T08:15:29Z at h=854933
    compaction heights: 372712..930751
    compaction duration: n=22 mean=1.759 min median=1.631 min minimum=0.286 min maximum=4.207 min
    peak extra disk: n=22 mean=0.496 GiB median=0.492 GiB minimum=0.097 GiB maximum=1.061 GiB
    peak extra share: n=22 mean=7.702 % median=7.786 % minimum=2.366 % maximum=13.663 %
    disk reclaimed: n=22 mean=0.676 GiB median=0.450 GiB minimum=-0.652 GiB maximum=3.859 GiB
    disk reclaimed share: n=22 mean=11.138 % median=10.701 % minimum=-5.390 % maximum=29.532 %
    positive disk reclaimed: n=18 mean=0.895 GiB median=0.560 GiB minimum=0.188 GiB maximum=3.859 GiB
    total disk reclaimed: 14.875 GiB
    file count before compaction: n=22 mean=233.773 files median=185.000 files minimum=58.000 files maximum=527.000 files
    file count after compaction: n=22 mean=208.500 files median=164.000 files minimum=48.000 files maximum=434.000 files
    file count delta: n=22 mean=25.273 files median=20.000 files minimum=-4.000 files maximum=123.000 files
    flushes during compaction: n=22 mean=2.818 flushes median=2.500 flushes minimum=0.000 flushes maximum=7.000 flushes
    changed outputs during compaction: n=22 mean=19.689 M median=17.490 M minimum=0.000 M maximum=48.941 M
    back-to-back compactions: h=378305 after 8s, h=500553 after 2s, h=553030 after 31s, h=843571 after 9s, h=930751 after 15s
    periodic flush commits: 244
    flush changed outputs: n=244 mean=6.846 M median=6.997 M minimum=0.008 M maximum=7.006 M
    flush dirty percentage: n=244 mean=100.000 % median=100.000 % minimum=100.000 % maximum=100.000 %
    periodic flush duration: n=244 mean=6.639 s median=6.000 s minimum=0.000 s maximum=48.000 s
    eligible periodic flushes above height 200000: 244
    observed compaction rate: 22/244 = 0.090; expected compactions at 1/10: 24.40
    

    </details>

  64. gmaxwell commented at 6:29 PM on June 13, 2026: contributor

    I have also measured the performance; the extra compaction doesn't add any meaningful slowdown.

    latency also matters, if I'm correct in understanding this it can result in multi-minute delays in getblocktemplate. Not really super ideal. The random nature makes them infrequently but could also leave a user chasing a ghost as to why it was randomly so slow. Also long delays will abuse peers (e.g. who might have just requested a tip block from the node) and potentially make the node look broken or malicious.

    Have you considered any alternative options that would avoid this? E.g. always compacting at startup?

  65. andrewtoth commented at 6:38 PM on June 13, 2026: contributor

    it can result in multi-minute delays in getblocktemplate

    The compactions occur on separate threads that do not take cs_main, so it should not cause any delays in getblocktemplate or anything user facing. Compactions also do not block normal reads and writes of the database, since they happen on a separate background thread (right now only size compactions get scheduled due to writes).

    The only edge case is if a user is unlucky enough to try and shutdown the node while a compaction is already underway. In that case a warning log is printed that shutdown is delayed due to compaction.

  66. optout21 commented at 5:38 AM on June 14, 2026: contributor

    I bring up my suggestion for the probability distribution again, with more backing and explanation (after some offline communication).

    My concern with the current probability solution is that the distribution is skewed towards larger values, and it's unbounded. Simulation showed that while the average is 2 weeks, the median is only 11 days, and the maximum (in 1% of the cases) is 87 days, but can be even higher with lower probability.

    As an alternative, I propose to always chose the number of calls until the planned compaction, chosen with linear distribution (in the range 0..640). The countdown number is an extra state, but need not be persisted.

    Diff shown (against 394e473d42ba1383dfec45a3eafa8a73a09dbe8b):

    diff --git a/src/validation.cpp b/src/validation.cpp
    index 0836b3073c..73f3555383 100644
    --- a/src/validation.cpp
    +++ b/src/validation.cpp
    @@ -113,11 +113,22 @@ const std::vector<std::string> CHECKLEVEL_DOC {
      * */
     static constexpr int PRUNE_LOCK_BUFFER{10};
     
    +// Max calls till next compaction (linear distribution, average is half)
    +static constexpr uint32_t COMPACT_PERIOD{640}; // Roughly every 2 weeks with hourly flushes
    +
    +// Number of calls until next planned compaction
    +static uint32_t g_compact_countdown{FastRandomContext().randrange(COMPACT_PERIOD)};
    +
     // Return whether the completed full flush should compact chainstate
     static bool ShouldCompactChainstate(bool in_ibd)
     {
    -    static constexpr uint32_t flush_ratio{320}; // Roughly every 2 weeks with hourly flushes
    -    return !in_ibd && FastRandomContext().randrange(flush_ratio) == 0;
    +    if (in_ibd) return false;
    +    if (g_compact_countdown > 0) {
    +        --g_compact_countdown;
    +        return false;
    +    }
    +    g_compact_countdown = FastRandomContext().randrange(COMPACT_PERIOD);
    +    return true;
     }
     
     TRACEPOINT_SEMAPHORE(validation, block_connected);
    

    This solution only adds minimal complexity, but it's much more predictable, as the compaction periods fall in the 0 to 4-weeks period with equal, linear distribution.

    Notes:

    The linear distribution is of secondary importance to this PR, so I'm ready to ACK without it as well. This could be done as a follow-up if you don't want an extra change right now.

    With this setup it's easy to set a different period for the first compaction after startup, if desired (not possible with state-less random pick).

    It's also possible to use absolute times instead of call counts. This is analogous, but solves the issue of dependence of the call frequency, making the scheme less brittle against future usage changes (again, not possible with state-less pick).

    More detailed analysis on the probability distribution (with simulation results) here: https://gist.github.com/optout21/1b468cd3e381b97eccdba5c45a2450a0

  67. sedited commented at 12:07 PM on June 14, 2026: contributor

    @gmaxwell does #35465 (comment) address your concerns?

  68. sipa commented at 1:47 PM on June 14, 2026: member

    @optout21 FWIW, the number of flushes per compaction in this PR right now follows a geometric distribution with $p = 1/320$, so the probability that it takes exactly $k$ flushes is $p \cdot (1-p)^k$, and the probability that it takes at least $k$ flushes is $(1-p)^k$. The median should be around 9 days. It's indeed fairly asymmetric (plots), but that is also inevitable with any memoryless distribution (i.e., anything we can do without introducing extra state).

    I wouldn't be opposed to something that uses state to make it more uniform, but I'm not sure it's necessary, and the geometric distribution has the advantage of being maximally unpredictable. Also, if we're going to go with extra state, I'd opt for a somewhat more narrow range, as it allows avoiding very short intervals entirely, for example with a uniform distribution over $[312, 360]$ (i.e., between 13 and 15 days).

    EDIT: Maybe this explains the 11 days number? The flush frequency is not exactly once per hour, but {wait 1 hour; then wait for next block}. Taking that into account, the distribution is more complicated, but the median time should be ~10.66 days between compactions.

  69. optout21 commented at 9:56 PM on June 14, 2026: contributor

    I wouldn't be opposed to something that uses state to make it more uniform, but I'm not sure it's necessary

    The aspect mostly worrying me is the possibility for very large compaction periods (unbounded distribution), but I agree that it's not critical. I'm aware that this is not the most important detail of this PR, and I trust the PR author and reviewers (@andrewtoth) to assess the various concerns and suitable values.

    if we're going to go with extra state, I'd opt for a somewhat more narrow range

    Indeed, a narrower distribution around a value (e.g. 14 +/-1 days) may be more appropriate, and it's possible with stateful logic, but not with a memory-less one. Though I would argue for a wider interval, at least 25% of the average length. A very narrow range has the risk of being too regular, and through some unforeseen interactions (block propagation?), even converge towards synchronization across different nodes (something to be avoided).

    A more optimal distribution may be one with a strict lower and upper bound, and non-uniform distribution within, like e.g. 12 + 16 * rand^3 (days).

    The median should be around 9 days.

    Thanks for checking, indeed, the theoretical median is 221 (9.2 days). This shows up in the results as the median of the single first period (222). The median of medians measured was 238 (9.9 days), the average of medians 265.85 (hence the 11 days mentioned). I haven't given much thought to these differences, but I suspect it's an artifact of the relatively short 4320-step simulation runs (I did a standalone simulation, the exact timing of flushes was out of scope).

    (plots)

    🙏👍 (In line with my visual comparison included at the end of the gist.)

  70. l0rinc commented at 11:58 AM on June 15, 2026: contributor

    Also, if we're going to go with extra state, I'd opt for a somewhat more narrow range, as it allows avoiding very short intervals entirely, for example with a uniform distribution over [312,360] (i.e., between 13 and 15 days)

    If the state is in memory it would mean that we will only ever compact if the node is online for at least 13 days. If it's persisted, we will have many complications (based on whether the nodes is online for an extended period or not, unless it's tied to block height which we wanted to avoid). The current stateless solution works in both always-online and on-demand validation cases.

    My concern with the current probability solution is that the distribution is skewed towards larger values, and it's unbounded

    Have you considered any alternative options that would avoid this? E.g. always compacting at startup?

    We could extend this in the future for a worst-case startup compaction option - but all of my similar suggestions were rejected so far. I'm also not sure it's needed for now.

    The only part I would note is that running full compaction seems to result in a different level-distribution than previous seek compactions:

    with seek compaction <img width="3538" height="2258" alt="image" src="https://github.com/user-attachments/assets/b1b7aa4a-8c29-4a6a-9182-adfa5991e9b6" />

    no seek compaction + regular force compactions <img width="3538" height="2258" alt="image" src="https://github.com/user-attachments/assets/a84f748d-a50a-457c-bf7f-73732cf5ad7c" /> Note the shape of L4 at the bottom. Also note that this compaction seems to require 1.5 GB extra space - should we extend the space check warning to include this - or skip compaction when we don't have enough room?

  71. optout21 commented at 9:31 AM on June 16, 2026: contributor

    If the state is in memory it would mean that we will only ever compact if the node is online for at least 13 days

    That's a good point, that's why I think it makes sense to have a shorter value for the first compaction period, e.g. 1 hour (randomized). The drawback is that in that case a node that restarts every 2 hours, will compact after each restart. But that's acceptable; in any case it's not possible to satisfy all edge cases using only time-based, non-persisted state.

  72. l0rinc commented at 9:13 AM on June 17, 2026: contributor

    To get a better idea of how all of this is "supposed" to behave, I have replaced LevelDB with RocksDB locally and got a level distribution that's completely different from how LevelDB currently behaves (edited): <img width="3538" height="2258" alt="image" src="https://github.com/user-attachments/assets/03871ec8-0bc7-4edb-a7f9-85baadef521f" />

    Based on this, I have adjusted kMaxMemCompactLevel and rerun everything.

    <details><summary>kMaxMemCompactLevel = 6</summary>

    diff --git a/src/leveldb/db/dbformat.h b/src/leveldb/db/dbformat.h
    --- a/src/leveldb/db/dbformat.h	(revision e09a3346e37023a6a337f7ca4bee0f667efdd86e)
    +++ b/src/leveldb/db/dbformat.h	(date 1781626650712)
    @@ -39,7 +39,7 @@
     // expensive manifest file operations.  We do not push all the way to
     // the largest level since that can generate a lot of wasted disk
     // space if the same key space is being repeatedly overwritten.
    -static const int kMaxMemCompactLevel = 2;
    +static const int kMaxMemCompactLevel = 6;
     
     // Approximate gap in bytes between samples of data read during iteration.
     static const int kReadBytesPeriod = 1048576;
    

    </details> which resulted in a chart that doesn't look like LevelDB gets surprised by the payload all the time (no seek compaction, lower mmap, no post-IBD compaction). <img width="3538" height="2258" alt="image" src="https://github.com/user-attachments/assets/4f803660-07a2-4061-95f9-a036c9841b85" />

    I will experiment with this further - @sipa, @andrewtoth your expertise would be welcome here.

  73. andrewtoth commented at 1:49 PM on June 17, 2026: contributor

    @optout re: unbounded distribution - With 1/320, the probability that this will not occur in 6 months is 1 in a million. The worst case of writing 200MB per week would be 4.8GB in 6 months. In practice that number will be a lot less, since there will be cut-through on disk as the chainstate grows on lower levels. For reference the blockchain will grow about 50 GB in 6 months. I don't think this is something we should be worried about. @l0rinc for manual compactions, it pushes all files to the lowest level, which is now L5. I don't think this matters if we are compacting regularly.

    Changes like this are very susceptible to bike shedding. There are a lot of parameters we can tweak here, but for diminishing returns. I think this solves the problem in an elegant way, and will keep users' chainstate disk size the smallest possible over time.

    I like this change because it does not need state persisted on disk or in memory. A long running node has the same probability to compact as one that is turned on for a few hours and then shut down.

    Compactions happen in the background and are invisible to users.

    I reiterate my ACK from above.

  74. in src/validation.cpp:2841 in 394e473d42
    2837 | @@ -2838,9 +2838,11 @@ bool Chainstate::FlushStateToDisk(
    2838 |              m_chainman.m_options.signals->ChainStateFlushed(this->GetRole(), GetLocator(m_chain.Tip()));
    2839 |          }
    2840 |  
    2841 | -        if (!m_chainman.m_interrupt && m_chainman.m_chainstates.size() == 1) { // Skip AssumeUTXO
    


    sipa commented at 1:58 PM on June 17, 2026:

    The last commit here drops the && m_chainmain.m_chainstates.size() == 1 condition, introduced in the first commit. Do we want that condition or not?


    andrewtoth commented at 3:08 PM on June 17, 2026:

    I don't think it's necessary with the m_db_mutex lock. The reason was to prevent the cache resizing during compaction, but locking seems better IMO since it's a more explicit solution.


    sipa commented at 4:25 PM on June 17, 2026:

    Oh, I see. It was not clear to me at all it was related to avoiding a data race. Some comments in the code would have been helpful; I assumed it was just trying to avoid compacting for performance reasons, like during IBD.


    l0rinc commented at 12:32 PM on June 18, 2026:

    Some comments in the code would have been helpful

    Instead of code comments I tried explained this in the commit messages. Let me know if you feel strongly about code comments instead:

    Add blocking chainstate compaction through CCoinsViewDB::CompactFull()

    The coins view destructor waits for completion, and a mutex prevents compaction from using m_db while ResizeCache() replaces it.


    As mentioned in the other comment, it was easier to introduce simple blocking compaction independently from the multi-threaded (locking) async version, hence the back-and-forth.

  75. in src/txdb.h:42 in 394e473d42
      37 | @@ -37,9 +38,13 @@ class CCoinsViewDB final : public CCoinsView
      38 |  protected:
      39 |      DBParams m_db_params;
      40 |      CoinsViewOptions m_options;
      41 | +    //! Prevents CompactFull() from using m_db while ResizeCache() replaces it.
      42 | +    Mutex m_db_mutex;
    


    sipa commented at 2:21 PM on June 17, 2026:

    I'm surprising about locking of m_db here (prior to this PR).

    ResizeCache() replaces the object pointed to by m_db, but there are other functions that access m_db without any locking, and m_db is not atomic. Is this safe? Is there a guarantee that no other threads are ever accessing an CCoinsViewDB while ResizeCache() is called?


    andrewtoth commented at 3:03 PM on June 17, 2026:

    It looks like this can cause a use-after-free/data-race today on master if:

    • gettxoutsetinfo, scantxoutset, dumptxoutset are iterating after having released cs_main
    • assumeutxo background IBD completes and resets m_db in MaybeRebalanceCaches()

    Those 3 RPCs are the only ones who hold a handle to m_db after releasing cs_main.

    This change also takes a handle to m_db without cs_main. Using the new m_db_mutex to lock both the compaction thread and ResizeCache() is safe, because ResizeCache() is the only place where m_db is mutated after initialization.


    sipa commented at 4:27 PM on June 17, 2026:

    Indeed, just found the exact same thing. That should probably be addressed, but is unrelated to this PR.


    l0rinc commented at 12:27 PM on June 18, 2026:

    This part is really messy (it's why I still don't like AssumeUTXO; we keep bumping into these exceptionally messy cases). Originally, I locked AND avoided AssumeUTXO. Actually, before that I compacted on the main thread to avoid this completely, but the latest version should be safe too (I still hope we decide to remove background validation for AssumeUTXO). Thank you both for checking it out!

  76. sipa commented at 4:49 PM on June 17, 2026: member

    ACK 394e473d42ba1383dfec45a3eafa8a73a09dbe8b

  77. DrahtBot requested review from optout21 on Jun 17, 2026
  78. optout21 commented at 7:57 AM on June 18, 2026: contributor

    ACK 394e473d42ba1383dfec45a3eafa8a73a09dbe8b

    This change controls the compaction of the coins database explicitly. Full compaction is invoked with a chance of 1/320 after every flush, which happens roughly hourly. Compaction happens in the background. The result is a coins database that is regularly compacted.

    Several concerns were taken into consideration:

    • Tradeoff between infrequent compactions (uncompacted db with degrading performance) and frequent compactions (high amount of data written)
    • Avoiding network-wide compaction storms
    • Edge case of no compaction for a long time (probability, effect)
    • Edge case of a node being restarted very often
    • Tradeoff of memory-less probabilistic decision vs. one with state

    Thanks for evaluating my review comments!

  79. l0rinc commented at 12:45 PM on June 18, 2026: contributor

    Thanks you all for the reviews, I'm running a few more scenarios in the background, I personally would prefer waiting for those results before merge.

  80. l0rinc commented at 8:06 PM on June 19, 2026: contributor

    A few more benchmarks are still running, but @andrewtoth is right that it should be safe to cherry-pick these to previous releases as well, regardless of the outcome of https://github.com/bitcoin-core/leveldb-subtree/pull/63 (since they contain only https://github.com/bitcoin-core/leveldb-subtree/pull/61). We can continue arguing about mmap and max open files on master. Thanks for the reviews, rfm!

  81. sedited added the label Backport on Jun 20, 2026
  82. sedited added the label Needs backport (29.x) on Jun 20, 2026
  83. sedited added the label Needs backport (30.x) on Jun 20, 2026
  84. sedited added the label Needs Backport (31.x) on Jun 20, 2026
  85. sedited removed the label Backport on Jun 22, 2026
  86. sedited approved
  87. sedited commented at 8:32 AM on June 22, 2026: contributor

    ACK 394e473d42ba1383dfec45a3eafa8a73a09dbe8b

  88. sedited merged this on Jun 22, 2026
  89. sedited closed this on Jun 22, 2026

  90. l0rinc deleted the branch on Jun 22, 2026
  91. l0rinc commented at 9:51 AM on June 22, 2026: contributor

    Thanks for the reviews! @0xB10C would it be possible to run the latest master on any of your observer nodes - maybe with additional LevelDB level stats after for each flush - see https://github.com/bitcoin-core/leveldb-subtree/pull/61#issuecomment-4524808917?

  92. fanquake referenced this in commit fef6c8a4f2 on Jun 22, 2026
  93. fanquake referenced this in commit 711065a3b9 on Jun 22, 2026
  94. fanquake referenced this in commit ea3b318d8d on Jun 22, 2026
  95. fanquake removed the label Needs Backport (31.x) on Jun 22, 2026
  96. fanquake commented at 9:53 AM on June 22, 2026: member

    Backported to 31.x in #35331.

  97. fanquake referenced this in commit 55ead701a3 on Jun 22, 2026
  98. fanquake referenced this in commit cbdb11da24 on Jun 22, 2026
  99. fanquake referenced this in commit d8f2e590e3 on Jun 22, 2026
  100. fanquake removed the label Needs backport (30.x) on Jun 22, 2026
  101. fanquake commented at 10:16 AM on June 22, 2026: member

    Backported to 30.x in #35452.

  102. fanquake referenced this in commit e0ab4cffaa on Jun 22, 2026
  103. fanquake referenced this in commit 4798279925 on Jun 22, 2026
  104. fanquake referenced this in commit 07cf5f35b6 on Jun 22, 2026
  105. fanquake removed the label Needs backport (29.x) on Jun 22, 2026
  106. fanquake commented at 10:55 AM on June 22, 2026: member

    Backported to 29.x in #35450.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-07-01 08:51 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me