index: Force database compaction in coinstatsindex #33306

pull fjahr wants to merge 3 commits into bitcoin:master from fjahr:2025-09-csi-compaction changing 5 files +18 −1
  1. fjahr commented at 1:56 pm on September 4, 2025: contributor

    This PR forces full database compaction on coinstatsindex every 10,000 blocks. This does not seem to come with any considerable performance impact, though I did not test on many different systems.

    Motivation: Currently, coinstatsindex ends up using a lot of 55kb files if this is not the case. This is likely related to very fast writes occurring during sync but so far I could not pin down where exactly the difference in behavior is coming from. It seems to be a LevelDB internal thing though, since I can not make out any difference in configuration between coinstatsindex and the other indexes.

    This was first reported here: #30469 (comment)

  2. DrahtBot added the label UTXO Db and Indexes on Sep 4, 2025
  3. DrahtBot commented at 1:56 pm on September 4, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33306.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30469 (index: Fix coinstats overflow by fjahr)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  4. mzumsande commented at 3:48 pm on September 4, 2025: contributor

    Do you know if the current state with many small files and one large file has actual downsides, or does it just look weird?

    I am not an expert on LevelDB, just mentioning that two AIs I consulted say that benefits of regular compaction would be only be of cosmetic value and suggest to just let LevelDB do its thing.

  5. fjahr commented at 4:56 pm on September 4, 2025: contributor

    Good question and I am definitely not a LevelDB expert either. I think that at the size and usage pattern of the coinstatsindex we are not really facing any significant problems on a modern computer and system since the size of the data we are storing is very small relatively speaking for what LevelDB is capable of handling. But I think the impact is probably also non-zero and might increase over time as the blockchain grows. Very high-level rationale: there must be a reason LevelDB does these compactions usually.

    As far as I can tell the best resource on this topic is this docs page: https://chromium.googlesource.com/external/leveldb/%2B/HEAD/doc/impl.md#timing From my understanding these small files are L0 files and having many of them means we potentially have overlapping keys among them, meaning we are wasting disk space, and “This may significantly increase the cost of reads due to the overhead of merging more files together on every read.” per the document.

  6. fjahr commented at 5:01 pm on September 4, 2025: contributor
    FWIW, from the patterns I have observed so far it seems likely to me that the problem is only occurring during sync and afterwards the data of new blocks are actually compacted as it should, just those 55kb files are left over from the sync phase. So I am thinking about changing the approach to only compact once when the index is being set to m_synced = true, though this would make it a more complex change and I need to do some more testing if my anecdotal observation is correct.
  7. in src/dbwrapper.cpp:351 in 62d3806127 outdated
    347@@ -348,6 +348,11 @@ bool CDBWrapper::IsEmpty()
    348     return !(it->Valid());
    349 }
    350 
    351+void CDBWrapper::CompactFull() const
    


    l0rinc commented at 7:03 pm on September 4, 2025:
    could you please separate the low risk changes from the riskier ones? This extraction seems like a simple refactor, would be great to see this in a very simple first commit, so that we can focus on the new call in src/index/coinstatsindex.cpp

    fjahr commented at 2:31 pm on September 8, 2025:
    DOne
  8. in src/index/coinstatsindex.cpp:229 in 62d3806127 outdated
    223@@ -224,6 +224,14 @@ bool CoinStatsIndex::CustomAppend(const interfaces::BlockInfo& block)
    224     m_muhash.Finalize(out);
    225     value.second.muhash = out;
    226 
    227+    // Force compaction of the database from time to time since LevelDB doesn't
    228+    // seem to be handling this well itself at the speed coinstats index is
    229+    // syncing.
    


    l0rinc commented at 7:06 pm on September 4, 2025:
    as other have also mentioned, compaction is already a background process, it is triggered regularly, it should already take care of these as far as I can tell. Is it urgent to do the expensive compaction regularly? Doing more compactions should take more time overall, can you please share your benchmarks in this regard?

    fjahr commented at 2:31 pm on September 8, 2025:

    As I and others have mentioned (see the linked PR), this doesn’t work for coinstatsindex and we end up with hundreds of 55kb files that seem to stay around forever or until we trigger a full compaction manually.

    Benchmarking didn’t seem to make sense to me because the compaction completes almost infinitely so comparing two runs with compaction and without may not have given a result that gives an accurate answer. But I figured if I time the runtime of each compaction to the millisecond and then add them up that gives me a good indicator for how much time the compaction takes. I have been using Signet for this test and the individual runtimes of the compactions range from 12ms to 220ms. Overall the time for the coinstatsindex to sync (time between thread start and thread exit messages) took 5 minutes and 6 seconds. The cummulative time spend on compaction during that sync was 1,367ms. So the performance impact was making full sync of the index ~0.45% slower.

    Here is the patch I used for this: https://github.com/fjahr/bitcoin/commit/817990e515c1b64de59b2838c91032069651da5d In case you want to reproduce the result.


    l0rinc commented at 6:05 pm on September 8, 2025:

    I haven’t checked the linked PR in detail (I have to have some stopping condition, and I think the PR should explain the important context) but I didn’t get the impression until now that we weren’t doing any compaction here before, I just thought you meant that “it’s struggling to keep up”.

    Have a few questions here:

    • could we rather add this to the other background compaction thread? Do we understand why there’s no automatic compaction?
    • does it make sense to do the compaction so often? There’s a reason that’s usually not done for every change, garbage collection is usually a rare event. Wouldn’t it suffice to to it once at the very end only? Maybe 10k isn’t that often, I’ll see if there’s a way to measure it.
    • We’re doing a compactionand logging for genesis (height == 0) as well, right? Is it deliberate?

    I want to see a global, macro benchmark as well, since the continuous compaction most likely slows it down - but I haven’t worked with this area of the code yet so it may take a while for me to set it up


    Here’s what I did to benchmark the performance of only the indexing - created 3 commits:

    For simplicity the application terminates after the indexes are successfully created and we’re measuring the time it took to iterate and index them.

    I’ve started the application with -connect=0 -coinstatsindex=1 to show the progress without any additional activity for stability and measure the output folder’s size and number of files.

     0COMMITS="d1e2c95c4d5db1fcfa200276e9053f432a92ff29 6028fdf13a5059c9f44e9a8c4e5a04bc4346749b 461b92236f91b6a1090222185ffa20e481ea02de"; \
     1DATA_DIR="/mnt/my_storage/BitcoinData"; \
     2for COMMIT in $COMMITS; do \
     3  git fetch -q origin "$COMMIT" >/dev/null 2>&1 || true && \
     4  git checkout -q "$COMMIT" && \
     5  git log -1 --pretty='%h %s' "$COMMIT" && \
     6  rm -rf "$DATA_DIR/indexes/coinstats" "$DATA_DIR/debug.log" && \
     7  cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release >/dev/null 2>&1 && \
     8  ninja -C build >/dev/null 2>&1 && \
     9  ./build/bin/bitcoind -datadir="$DATA_DIR" -connect=0 -coinstatsindex=1 -printtoconsole=0; \
    10  grep -m1 'Indexing finished migrating' "$DATA_DIR/debug.log" || true; \
    11  du -h "$DATA_DIR/indexes/coinstats/db"; \
    12  find "$DATA_DIR/indexes/coinstats/db" -type f -name '*.ldb' | wc -l; \
    13done
    

    It seems the indexing takes several hours, I’ll post the result here after it finishes

  9. in src/index/coinstatsindex.cpp:231 in 62d3806127 outdated
    223@@ -224,6 +224,14 @@ bool CoinStatsIndex::CustomAppend(const interfaces::BlockInfo& block)
    224     m_muhash.Finalize(out);
    225     value.second.muhash = out;
    226 
    227+    // Force compaction of the database from time to time since LevelDB doesn't
    228+    // seem to be handling this well itself at the speed coinstats index is
    229+    // syncing.
    230+    if (block.height % 10000 == 0) {
    231+        LogInfo("Compacting database of coinstatsindex");
    


    l0rinc commented at 7:08 pm on September 4, 2025:
    shouldn’t this be a debug? Do we need 91 instances of this? (haven’t checked, but will this be triggered for genesis?)

    fjahr commented at 2:31 pm on September 8, 2025:
    I am pretty indifferent, so making it debug. I had to add a new category since none of the existing seemed to fit.
  10. in src/dbwrapper.h:288 in 62d3806127 outdated
    283@@ -284,6 +284,8 @@ class CDBWrapper
    284         ssKey2 << key_end;
    285         return EstimateSizeImpl(ssKey1, ssKey2);
    286     }
    287+
    288+    void CompactFull() const;
    


    l0rinc commented at 7:09 pm on September 4, 2025:
    is const the right thing to advertise here? And is it a noexcept?

    fjahr commented at 2:31 pm on September 8, 2025:
    It doesn’t modify the data but the storage on disk but since there are changes I think you are right and this shouldn’t be const. I don’t think it’s noexcept because we are doing stuff on disk there are a lot issues that could come from that.
  11. in src/index/coinstatsindex.cpp:230 in 62d3806127 outdated
    223@@ -224,6 +224,14 @@ bool CoinStatsIndex::CustomAppend(const interfaces::BlockInfo& block)
    224     m_muhash.Finalize(out);
    225     value.second.muhash = out;
    226 
    227+    // Force compaction of the database from time to time since LevelDB doesn't
    228+    // seem to be handling this well itself at the speed coinstats index is
    229+    // syncing.
    230+    if (block.height % 10000 == 0) {
    


    l0rinc commented at 7:13 pm on September 4, 2025:

    nit: to avoid counting the 0s manually:

    0if (block.height % 10'000 == 0) {
    

    or

    0if (!(block.height % 10'000)) {
    

    fjahr commented at 2:31 pm on September 8, 2025:
    Done
  12. fjahr force-pushed on Sep 8, 2025
  13. db, refactor: Add full compaction helper method e8ebd8d77b
  14. log: Add log category index bf99e7fbc6
  15. index: Force database compaction in coinstatsindex
    The coinstatsindex ends up using a lot of 55kb files if this is not the case. This is likely related to very fast writes occurring during sync.
    7bc77c5c37
  16. fjahr force-pushed on Sep 8, 2025
  17. in src/logging.h:97 in bf99e7fbc6 outdated
    93@@ -94,6 +94,7 @@ namespace BCLog {
    94         TXRECONCILIATION = (CategoryMask{1} << 26),
    95         SCAN        = (CategoryMask{1} << 27),
    96         TXPACKAGES  = (CategoryMask{1} << 28),
    97+        INDEX       = (CategoryMask{1} << 29),
    


    l0rinc commented at 5:46 pm on September 8, 2025:
    do you think this is better than using the existing LEVELDB category? Seems these categories are in short supply and compaction seems like a LevelDB thing to me.
  18. DrahtBot added the label Needs rebase on Sep 9, 2025
  19. DrahtBot commented at 1:28 am on September 9, 2025: contributor
    🐙 This pull request conflicts with the target branch and needs rebase.
  20. in src/dbwrapper.cpp:353 in 7bc77c5c37
    347@@ -348,6 +348,11 @@ bool CDBWrapper::IsEmpty()
    348     return !(it->Valid());
    349 }
    350 
    351+void CDBWrapper::CompactFull()
    352+{
    353+    DBContext().pdb->CompactRange(nullptr, nullptr);
    


    l0rinc commented at 2:58 am on September 9, 2025:

    nit: since we’re touching it:

    0    DBContext().pdb->CompactRange(/*begin=*/nullptr, /*end=*/nullptr);
    
  21. l0rinc changes_requested
  22. l0rinc commented at 5:06 am on September 9, 2025: contributor

    I’m missing benchmark results to see how this change affects the already very slow performance. I will post my own results as soon as they finish.

    I would also like to understand why database compaction isn’t enabled for this LevelDB instance.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-09-10 03:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me