chainstate leveldb is rewritten every 30 or 60 minutes #35298

issue tuxArg opened this issue on May 15, 2026
  1. tuxArg commented at 8:43 PM on May 15, 2026: none

    Is there an existing issue for this?

    • I have searched the existing issues

    Current behaviour

    I try to understand why all .ldb on chainstate are regenearated every 60 minutes (sometimes 30 minutes). It seems to be some cron that triggers it. I would really want to avoid this as it writes 10GB of data on each 30 or 60 minutes period.

    cat debug.log | grep '[leveldb] Level-0' | grep started 2026-05-15T12:18:02Z [leveldb] Level-0 table #1356: started 2026-05-15T12:18:05Z [leveldb] Level-0 table #75238: started 2026-05-15T12:18:06Z [leveldb] Level-0 table #68702: started 2026-05-15T13:18:21Z [leveldb] Level-0 table #75501: started 2026-05-15T13:48:21Z [leveldb] Level-0 table #75761: started 2026-05-15T14:18:21Z [leveldb] Level-0 table #76022: started 2026-05-15T14:48:21Z [leveldb] Level-0 table #76285: started 2026-05-15T15:18:21Z [leveldb] Level-0 table #76550: started 2026-05-15T15:48:21Z [leveldb] Level-0 table #76813: started 2026-05-15T16:48:21Z [leveldb] Level-0 table #77078: started 2026-05-15T17:48:21Z [leveldb] Level-0 table #77348: started 2026-05-15T18:48:21Z [leveldb] Level-0 table #77620: started 2026-05-15T19:48:21Z [leveldb] Level-0 table #77895: started

    After each of this event logs there are many log entries like: 2026-05-15T19:48:21Z [leveldb] Delete type=2 #77536 2026-05-15T19:48:22Z [leveldb] Compacting 1@1 + 0@2 files 2026-05-15T19:48:22Z [leveldb] Generated table #77896@1: 11927 keys, 558881 bytes 2026-05-15T19:48:22Z [leveldb] compacted to: files[ 0 0 3 32 487 2 0 ]

    All the process last 40 seconds and finishes like: 2026-05-15T19:49:03Z [leveldb] Delete type=2 #77898

    Expected behaviour

    I don't expect that chainstate db shoud be regenerated every hour. Or better if I could configure how often it does it.

    Steps to reproduce

    relevant bitcoin.conf: server=1 txindex=1 dbcache=800 maxmempool=100 debug=leveldb

    Relevant log output

    No response

    How did you obtain Bitcoin Core

    Pre-built binaries

    What version of Bitcoin Core are you using?

    30.2

    Operating system and version

    ubuntu 24.04

    Machine specifications

    I run it inside a podman container on a ext4 fs shared.

  2. pinheadmz commented at 7:47 PM on May 17, 2026: member

    I think this comment in the code answers most of your question?

    https://github.com/bitcoin/bitcoin/blob/7802e578c3f1e9a5d9b57fb003349d0e032bb43b/src/validation.cpp#L93-L98

    I asked claude to help explain the rest of it:

    Why 10 GB gets written. The actual dirty data written per flush is small — just the UTXO changes from the last hour's blocks. The 10 GB write you're seeing is LevelDB's internal compaction, which is triggered by the flush but rewrites far more data.

    With -dbcache=800 and -txindex, your coins cache gets roughly 700 MB. LevelDB is configured with

    write_buffer_size = cache / 4 ≈ 175 MB

    With LevelDB's default L0_CompactionTrigger = 4, four level-0 files (≈700 MB) trigger compaction into level-1, which then cascades: level-1 (10 MB max) → level-2 (100 MB) → level-3 (1 GB) → level-4 (10 GB). The entire chainstate lives at the deeper levels, so a cascade rewrite of ~10 GB is normal.

  3. tuxArg commented at 12:42 AM on May 18, 2026: none

    @pinheadmz I appreciate you found exactly the lines that seem to trigger this issue. My question is why do we need this.

    Just before db compation everything works fine so there's no actual urgency to do this every hour. If it can be done 24 times a day, it could probably be done just once a day too.

    So, why is it fixed at 50 to 70 minutes then? It would be better to have an option to configure it.

  4. l0rinc commented at 7:47 AM on May 18, 2026: contributor

    So, why is it fixed at 50 to 70 minutes then? It would be better to have an option to configure it.

    This was added in #30611, see the motivation in the PR description.

    I try to understand why all .ldb on chainstate are regenearated every 60 minutes

    That's not what happens; the permanent storage is updated instead of keeping everything in memory. Can you please tell us what the problem is that you're trying to solve?

    After each of this event logs there are many log entries like

    You don't need to enable LevelDB debug logging. This is normal behavior: the state is written to disk (LevelDB), which does some cleanup (background compaction) to keep disk access optimal. Again, this is a feature, not a bug. If we leave things in memory for too long, any interrupt (e.g. a crash) would wipe out that state and you would need to redo the work.

    I don't expect that chainstate db shoud be regenerated every hour.

    It's not, it's just updated regularly to avoid data loss. We could theoretically bump the 50/70 minute range to 90/110 minutes if users think the current interval is too frequent -- what do you think @andrewtoth?

  5. tuxArg commented at 8:52 AM on May 18, 2026: none

    This was added in #30611, see the motivation in the PR description.

    I've just read it and all its thread.

    That's not what happens; the permanent storage is updated instead of keeping everything in memory. Can you please tell us what the problem is that you're trying to solve?

    I run a bitcoin node and as many running one I run it continuously. What I'm trying to solve is I/O disk usage (writes in this case). Each flush to disk it writes around 10GB of data on a full node. That's 240GB a day. #30611 focused on reducing spikes and avoid redoing work after power outages. Power loss can happen but it's not frequent enough to justify optimizing for it at the expense of long running operation.

    Reducing spikes is in right direction but what about total data written to disk? How much data was written before every day? I don't think it was 240GB because I would have noticed that spike.

    You don't need to enable LevelDB debug logging.

    I have enabled it to find out what was happening. 10GB every hour is still too much to be unnoticed.

    It's not, it's just updated regularly to avoid data loss. We could theoretically bump the 50/70 minute range to 90/110 minutes if users think the current interval is too frequent -- what do you think @andrewtoth?

    It's not to avoid data loss. It was to avoid work to be redone, but we can recover that data, so it's not data loss. I think it would be better to let users to configure the interval in bitcoin.conf

  6. andrewtoth commented at 12:08 PM on May 18, 2026: contributor

    @tuxArg did you recently upgrade from version 28 or earlier to version 30.2? In version 29 and up the leveldb max file size was increased from 2MB to 32MB, so compactions will take a long time for the first while until all 2mb files have been compacted to 32mb.

    However, that doesn't explain the frequency of compactions. We now write to the chainstate leveldb every ~hour, but just writing to the db does not trigger a compaction every time.

  7. JohnTravolski commented at 12:44 PM on May 18, 2026: none

    Hi, I am also concerned about this. I monitor cumulative disk writes using smartmontools since I'm running on an SSD. I don't want to kill it early since SSDs have limited writes. Previously my node was writing 20 GB / day (Core 27.0 + Fulcrum indexer), but after I upgraded to Core 31.0 it was writing about 220 GB / day, 11 times more. I used sudo iotop -oPa and let it sit for a few hours to ensure it was attributable to the bitcoin-qt process.

  8. l0rinc commented at 1:15 PM on May 18, 2026: contributor

    @andrewtoth's right, it's probably the file size changes after #30039. After it's done compacting, you will have fewer writes than before. This shouldn't kill an SSD - it took me almost 2 years to kill mine, often doing several full reindex-chainstates per day :)

  9. GURGPqxVwj commented at 1:55 PM on May 18, 2026: none

    I would like to add another point here.

    I read the comments above. I understand that Bitcoin Core intentionally writes the chainstate to persistent storage roughly every 50–70 minutes, and that LevelDB compaction can rewrite much more data than the actual coins cache flush.

    My concern is mainly the total SSD write volume. I still do not understand how a fully synced node in normal operation can write hundreds of GB per day to the SSD. I am not talking about IBD or reindex here.

    I have seen the same general pattern on two different hardware platforms:

    • first on a low-power mini PC with 4 GB RAM
    • now on a newer thin client with 16 GB RAM

    I have also seen the same general pattern with different Bitcoin Core versions:

    • Bitcoin Core 30.2
    • Bitcoin Core 31.0, after upgrading only the binaries and keeping the same datadir

    The current data below is from the newer system, because that setup is cleaner and easier to trust.

    Current setup:

    • Ubuntu Server 24.04.4
    • Bitcoin Core 31.0
    • same datadir previously used with 30.2
    • external NVMe SSD, ext4, noatime
    • not running in a container
    • no txindex
    • dbcache=2048
    • wallet disabled
    • debug=bench
    • debug=leveldb
    • node fully synced, not IBD
    • no kernel I/O errors
    • no EXT4 errors
    • SMART health PASSED, media errors 0
    • Linux block write counter and NVMe SMART Data Units Written match almost exactly

    I mention both the Linux block write counter and the NVMe SMART counter because I wanted to make sure I am not just misreading one tool or measuring some local artifact. During the compaction waves, both counters increased by essentially the same amount, so the writes seem to be real device writes.

    There was a large compaction wave directly after starting 31.0. That seems reasonable to me. The node had to catch up about 60 blocks and LevelDB recovery/startup activity was happening at the same time.

    The more interesting part happened later, after startup/catch-up was finished.

    What I repeatedly observe is this pattern:

    1. For some time, the node behaves as I would expect. New blocks arrive, UpdateTip is logged, and the reported cache grows.

    2. Then a point is reached where the reported UpdateTip cache stops growing. This is what I mean by "plateau" here.

    3. The exact cache value is not always the same. In this 31.0 run the plateau was around 44.0 MiB. In earlier observations I saw the same general pattern begin at other reported cache values. So I do not think 44 MiB itself is a fixed threshold.

    4. From that point on, new blocks are still connected, but the reported cache stays pinned.

    5. Then LevelDB compaction waves start.

    6. After a compaction wave, the reported cache does not continue growing again. It stays pinned, and later more compaction waves can follow.

    Here is the 31.0 observation.

    Shortly before the plateau, there was a normal FlushStateToDisk / BatchWrite:

    • BatchWrite: write coins cache to disk (330525 out of 337279 cached coins)
    • WriteBatch memory usage: db=chainstate, before=0.0MiB, after=26.3MiB

    After that, new blocks were connected normally. But the reported UpdateTip cache reached about 44.0 MiB and stopped growing.

    In this run I observed at least 17 consecutive UpdateTips with the reported cache at about 44.0 MiB:

    • height 949927: cache=44.0MiB
    • height 949928: cache=44.0MiB
    • height 949929: cache=44.0MiB
    • ...
    • height 949938: cache=44.0MiB
    • later also up to at least height 949943: cache=44.0MiB

    The txo count changed during that time, so the node was not idle. New blocks were connected, but the reported cache stayed pinned.

    After this cache plateau, a large LevelDB compaction wave happened.

    From my report, counted since the marker:

    • Compactions: 35
    • Generated tables: 389
    • Deleted tables: 419
    • Generated bytes: 10.903 GiB
    • Compacted bytes: 10.903 GiB
    • chainstate files since marker: 348 files, size-sum about 10.565 GiB

    The SSD write counters matched this closely:

    • Linux written since marker: 11.022 GiB
    • SMART written since marker: 11.022 GiB
    • largest measured interval: 9.933 GiB in 15.2 minutes
    • SMART-Linux difference: 0.000 GiB

    There were no storage errors in dmesg.

    The compaction wave was mostly the familiar pattern of many generated ~34.5 MB .ldb files. The later part contained several lines like:

    • Compacting 1@4 + 10@5 files
    • Generated table ... about 34.5 MB
    • Compacted ... about 345 MB

    What looks important to me is that the cache did not start growing again after the large compaction wave. It stayed around 44.0 MiB for further UpdateTips, and more LevelDB activity happened later.

    So to me the pattern does not look like "one cache flush, then one cleanup, then normal cache growth again". It looks more like the node reaches a state where the reported cache stops growing, and while it stays in that state, compaction waves repeat.

    So what I am trying to understand is this:

    Is this amount of write amplification expected for a fully synced node in normal operation?

    If the answer is "yes, this is expected", then I would like to understand why a synced node can write this much data to the SSD, and whether there is a way to tune this behaviour.

    I can provide selected debug.log snippets and the small write-counter reports if that would help.

  10. andrewtoth commented at 2:14 PM on May 18, 2026: contributor

    @GURGPqxVwj

    Since you have an already synced chainstate, was this synced on a pre-v29 node? If so, that explains the large compactions and the problem will resolve itself after some time. The hundreds of GB being written per day is due to the large compactions from 2mb -> 32mb leveldb files.

    Regarding the cache value in the UpdateTip - this value can decrease as well after some blocks and is normal. The cache increases when a block creates more outputs than it spends. If most transactions have the same number of inputs and outputs, the value will not increase. Incoming mempool transactions will also increase the cache value. Also, every ~hour the cache is written to disk, and during this process all spent entries from the cache are removed. This reduces the number of entries in the cache.

  11. tuxArg commented at 2:14 PM on May 18, 2026: none

    @tuxArg did you recently upgrade from version 28 or earlier to version 30.2? In version 29 and up the leveldb max file size was increased from 2MB to 32MB, so compactions will take a long time for the first while until all 2mb files have been compacted to 32mb.

    I did. 27.1 -> 29 -> 30 -> 30.2 I also did reindex-chainstate before reporting this as I thought it could be the cause. But still same behavior. I mostly see 33MB ldb files:

    $ ls -l chainstate/*.ldb | awk '{printf "%.0fM\n", $5/1048576}' | sort -n | uniq -c
         71 0M
          1 1M
          5 2M
          1 5M
          1 7M
          2 8M
          1 9M
          3 14M
          1 16M
          1 18M
          2 22M
          2 23M
          2 25M
          2 26M
          2 27M
          2 31M
        552 33M
    

    However, that doesn't explain the frequency of compactions. We now write to the chainstate leveldb every ~hour, but just writing to the db does not trigger a compaction every time.

    Well, that is the bug I'm reporting here. If 10 GB must be written then we need a way to configure how often, but if they are not meant to be rewritten every hour, then we have a bug here that triggers full compaction every time.

  12. andrewtoth commented at 2:23 PM on May 18, 2026: contributor

    @tuxArg thanks, I see. I did measure disk usage in #30611 (comment) and did not see this issue of frequent compactions.

    I will investigate and see if any nodes I run are also experiencing this behavior.

  13. GURGPqxVwj commented at 3:18 PM on May 18, 2026: none

    @andrewtoth Thanks, that explanation helps.

    To answer your question: no, this chainstate was not synced on a pre-v29 node. In my setup the chainstate was created with Bitcoin Core 30.2. I never used v28 or older for this datadir. Later I upgraded only the Bitcoin Core binaries to 31.0 and kept the same datadir.

    So I do not think the 2 MB -> 32 MB migration from a pre-v29 chainstate explains my case, unless I misunderstand something.

    Your explanation about the UpdateTip cache makes sense. I understand now that this value does not have to grow monotonically and can decrease or stay flat depending on the blocks and cache flushes.

    However, after looking at the current debug.log for longer, the large compactions do not seem to be just a one-time catch-up or startup effect. After the node was running normally on 31.0, the reported cache stayed at 44.0 MiB from height 949927 to at least height 949954, while new blocks were connected.

    During this same period I see repeated large LevelDB compaction waves:

    • 11:20:42Z–11:29:11Z: about 10.23 GiB compacted/generated
    • 12:08:09Z–12:29:13Z: about 10.55 GiB
    • 13:06:07Z–13:23:47Z: about 10.55 GiB
    • 14:16:59Z–14:30:32Z: about 10.23 GiB

    Since my marker at 09:02:59Z, the debug log shows 137 compactions and about 42.6 GiB generated/compacted LevelDB output.

    So my remaining question is mainly about the amount and recurrence of this write amplification. Is that still expected for a chainstate created with 30.2, or would you expect those large compactions to settle down after some runtime?

  14. l0rinc commented at 3:34 PM on May 18, 2026: contributor

    Thanks for the details reports!

    would you expect those large compactions to settle down after some runtime

    I would also expect them to decrease, but it's normal to have some spikes temporarily. Please let us know if this continues for the following days.

    I will investigate and see if any nodes I run are also experiencing this behavior.

    Thanks.

  15. iotamega commented at 4:22 PM on May 18, 2026: none

    It's not to avoid data loss. It was to avoid work to be redone, but we can recover that data, so it's not data loss. I think it would be better to let users to configure the interval in bitcoin.conf

    +1 to allow this to be a configurable option. I am seeing similar issues across many nodes. Having the ability to configure this would be helpful.

  16. ArmchairCryptologist commented at 5:14 PM on May 18, 2026: none

    I'm seeing similar behavior. Two full nodes with the chainstate on an SSD and the blocks on an HDD, running 30.2 and 31.0 respectively, have written on average 13.6 GB/hour and 13.8 GB/hour to their SSDs since last reboot ~12 days ago. Both of them mostly have ~32MB ldb files in the chainstate directory, so the aforementioned compaction seems to have completed, but the write rate is still the same.

    Bitcoin Core version v31.0.0 (release build)
    Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
    sda              41.47      3203.81      4042.79      5447.53 3300410450 4164682942 5611784428
    
    Bitcoin Core version v30.2.0 (release build)
    Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
    sdb              41.05      3155.59      3979.64         0.00 3258175332 4109016343          0
    

    Double-checked the disk stats in the hypervisor for the latter, and the write activity is still ongoing, averaging 5.4 MB/s in the last hour.

    At ~330 GB/day it could take less than a year to reach the rated write endurance for some recent QLC SSDs - for example, the WD Green SN3000 500GB is rated for only 100 TB lifetime writes, which it would reach after only ~300 days at this rate - so this probably needs to be addressed.

  17. sipa commented at 6:15 PM on May 18, 2026: member

    To the people reporting this issue, what are the ages of your DATADIR/chainstate/*.ldb files? Specifically, what percentage (in terms of byte size) are up to a few hours old? LevelDB database files are immutable, so if ~all files are very recent, that would mean that it is indeed rewriting the whole database on every flush.

    Just to be clear, writing something every 50-70 minutes is expected, but after initial sync (and possibly post-29.0 conversion to 32 MiB files), it shouldn't be writing gigabytes every time. If it's actually rewriting the whole thing all the time, then that is a bug.

    Also, can you share your bitcoin.conf or other notable configuration options? It's clearly not happening to everyone, so there must be something in your configuration that triggers it.

  18. andrewtoth commented at 6:24 PM on May 18, 2026: contributor

    From what I can gather, I think restarting once with -forcecompactdb=1 should help here. It seems there are a lot of scattered files at different levels from [leveldb] compacted to: files[ 0 0 3 32 487 2 0 ], and that could be impacting something here.

    Also, when we bumped max_file_size, we should probably also have bumped write_buffer_size. The latter is implicitly the l0 file size, so every hourly write is also creating a new l0 file. That will make compaction happen more frequently (although it shouldn't cause the whole db to be rewritten each time).

  19. tuxArg commented at 6:27 PM on May 18, 2026: none

    @sipa In my case, chainstate dir has 18GB and 10GB are from the last hour. @andrewtoth I will restart with forcecompactdb=1 and I'll tell you on a few hours if it keeps doing it.

  20. ArmchairCryptologist commented at 6:54 PM on May 18, 2026: none

    @sipa In bytes, >99% of the idb files have been touched in the last two hours on both nodes.

    Outside of binds/addnodes/etc, these are probably the relevant settings:

    disablewallet=1
    dbcache=1000
    maxmempool=1000
    persistmempool=0
    txindex=1
    server=1
    

    Will test starting with forcecompactdb.

  21. tuxArg commented at 7:32 PM on May 18, 2026: none

    From what I can gather, I think restarting once with -forcecompactdb=1 should help here. It seems there are a lot of scattered files at different levels from [leveldb] compacted to: files[ 0 0 3 32 487 2 0 ], and that could be impacting something here.

    I've restarted with forcecompactdb=1. It regenerated all files in chainstate dir. All small files disappeared. But after one hour.. It compacted all again and 10GB were written again to disk according to iostat.

    $ ls -l chainstate/*.ldb | awk '{printf "%.0fM\n", $5/1048576}' | sort -n | uniq -c 16 0M 2 11M 2 17M 1 21M 1 22M 2 25M 2 30M 2 32M 648 33M

    I run it with debug=leveldb if that's useful to debug.

  22. sipa commented at 7:54 PM on May 18, 2026: member

    I note both @ArmchairCryptologist and @tuxArg have -txindex enabled. I wonder if that is related; I'm enabling it on my test system too.

  23. iotamega commented at 8:03 PM on May 18, 2026: none

    I note both @ArmchairCryptologist and @tuxArg have -txindex enabled. I wonder if that is related; I'm enabling it on my test system too.

    Do as well on my end.

    txindex=1 coinstatsindex=1 v2transport=1 listen=1 port=8333 listenonion=0 shrinkdebugfile=0 debug=1 logips=1 loglevelalways=1 logtimemicros=1 printpriority=1 #capturemessages=1

  24. andrewtoth commented at 8:45 PM on May 18, 2026: contributor

    I was seeing this write amplification on my nodes as well. I instrumented leveldb with some logging to find the issue. It seems due to seek compactions. Due to the large database with random keys, there are many levels that each read will go through and trigger a seek compaction on the way to finding the entry. Seek compactions are not really useful for our workload, so it's possible to just disable it. Size compactions will still occur, so the db will still remain balanced.

    Disabling this would also resolve #29662.

    The following patch fixes the issue for me:

    diff --git a/src/leveldb/db/version_set.cc b/src/leveldb/db/version_set.cc
    index cd07346ea8..35a533a3d1 100644
    --- a/src/leveldb/db/version_set.cc
    +++ b/src/leveldb/db/version_set.cc
    @@ -7,6 +7,7 @@
     #include <stdio.h>
     
     #include <algorithm>
    +#include <limits>
     
     #include "db/filename.h"
     #include "db/log_reader.h"
    @@ -648,21 +649,8 @@ class VersionSet::Builder {
           FileMetaData* f = new FileMetaData(edit->new_files_[i].second);
           f->refs = 1;
     
    -      // We arrange to automatically compact this file after
    -      // a certain number of seeks.  Let's assume:
    -      //   (1) One seek costs 10ms
    -      //   (2) Writing or reading 1MB costs 10ms (100MB/s)
    -      //   (3) A compaction of 1MB does 25MB of IO:
    -      //         1MB read from this level
    -      //         10-12MB read from next level (boundaries may be misaligned)
    -      //         10-12MB written to next level
    -      // This implies that 25 seeks cost the same as the compaction
    -      // of 1MB of data.  I.e., one seek costs approximately the
    -      // same as the compaction of 40KB of data.  We are a little
    -      // conservative and allow approximately one seek for every 16KB
    -      // of data before triggering a compaction.
    -      f->allowed_seeks = static_cast<int>((f->file_size / 16384U));
    -      if (f->allowed_seeks < 100) f->allowed_seeks = 100;
    +      // Disable seek compaction for our workload
    +      f->allowed_seeks = std::numeric_limits<int>::max();
     
           levels_[level].deleted_files.erase(f->number);
           levels_[level].added_files->insert(f);
    
  25. ArmchairCryptologist commented at 8:46 PM on May 18, 2026: none

    I think txindex might be a red herring, but it does seem to amplify it somewhat. I finished checking my other nodes, all of which are pruning nodes without txindex enabled, running either 30.2 or 31.0, and they all have iostat reporting between 7.5 GB/hour and 8.3 GB/hour written since last system restart (12 days for all of them). Which is notably less than the txindex nodes, but still substantial. These also all have all ibf files in chainstate touched in the last 3-4 hours.

    These all run these settings, some have enabled wallet and some do not:

    dbcache=1000
    maxmempool=1000
    persistmempool=0
    

    I can't say for sure how old the chainstate database is on most of these nodes since I usually just sync new nodes from an existing one instead of doing IBD, but at least one of the full nodes did IBD no later than May 2020 based on the file timestamps on the blocks database, so it might be related to that.

    PS: I can also confirm that doing forcecompactdb=1 did not resolve it; while it did eliminate some leftover small ldb files on startup, the chainstate database was fully rewritten again a couple of hours after startup.

  26. GURGPqxVwj commented at 9:02 PM on May 18, 2026: none

    @andrewtoth @sipa

    Thanks, this is very helpful.

    This sounds consistent with what I am seeing. In my case, txindex is not enabled and the chainstate was not created on a pre-v29 node, but I still see the recurring large rewrites.

    I also checked the chainstate .ldb file ages by bytes:

    • files total: 397
    • bytes total: 10.565 GiB
    • <= 1h: 10.231 GiB (96.84%)
    • <= 2h: 10.554 GiB (99.89%) So almost the whole current chainstate .ldb set was very recently rewritten.

    I will leave the node running unchanged overnight and report the write volume and file age distribution again tomorrow.

  27. andrewtoth commented at 10:58 PM on May 18, 2026: contributor

    @sipa Each periodic sync every ~hour will produce an l0 file a little over 2 MiB, because max_write_buffer is 2MiB and anything over gets written to l0. Now this file gets allowed_seeks = max(file_size / 16 KiB, 100) ~= 128 seeks before it gets seek compacted. 128 random reads going through this file will happen almost immediately from mempool or next block, and then it gets compacted without waiting for the 4 l0 files that trigger size compaction. This will produce a smaller l1 file than size compaction will, so it also has a smaller seek budget. Random reads will drain that seek budget quickly, and it will get scheduled for compaction again.

    The seek compaction mechanism was designed for spinning disk reads. Not sure we need it. Another option is to increase max_write_buffer to ~32 MiB so it doesn't produce an l0 file every sync, and when it does the l0 file at least has a higher seek budget.

  28. tuxArg commented at 11:41 PM on May 18, 2026: none

    @andrewtoth What about just removing this line:

    if (f->allowed_seeks < 100) f->allowed_seeks = 100;
    

    Isn't it enough?

  29. tuxArg commented at 12:22 AM on May 19, 2026: none
     //   (1) One seek costs 10ms
     //   (2) Writing or reading 1MB costs 10ms (100MB/s)
     //   (3) A compaction of 1MB does 25MB of IO:
     //         1MB read from this level
     //         10-12MB read from next level (boundaries may be misaligned)
     //         10-12MB written to next level
     // This implies that 25 seeks cost the same as the compaction
     // of 1MB of data.

    The economics around this are not OK. This assumes that the variable is how long a task lasts, rather than what resources it uses. A modern CPU has multiple cores but may have only one or two disks. Disk time costs much more than single core/thread time. Even if we use SSDs and have more I/O bandwidth than 100MB/s, a single seek likely costs much less than 10 ms

  30. andrewtoth commented at 12:36 AM on May 19, 2026: contributor

    @andrewtoth What about just removing this line:

    if (f->allowed_seeks < 100) f->allowed_seeks = 100;
    

    Isn't it enough?

    Removing just that line will cause small files to have an even smaller seek budget, so it would actually make this problem worse. We want all files to have a large seek budget.

  31. GURGPqxVwj commented at 6:37 AM on May 19, 2026: none

    @andrewtoth @sipa

    Overnight update from the same unmodified Bitcoin Core 31.0 run.

    I didn't use -forcecompactdb, didn't apply a patch, and didn't change the config.

    Small correction to my earlier cache observation: the reported cache did eventually continue growing. It stayed at 44.0 MiB for a long time (62 consecutive UpdateTips in my report, from height 949927 to 949988) but later increased again, up to 56.3 MiB in the overnight report. So I don't consider the 44.0 MiB plateau itself central anymore.

    Since my marker at 2026-05-18T09:02:59Z:

    • node fully synced, initialblockdownload=false
    • 120 UpdateTips
    • 637 LevelDB compactions
    • 7021 generated tables
    • 197.464 GiB generated/compacted LevelDB output

    The SSD write counters match this closely:

    • Linux block write counter: 198.533 GiB written
    • NVMe SMART Data Units Written: 198.533 GiB written
    • SMART-Linux difference: 0.000 GiB
    • measured rate: 226.65 GiB/day
    • largest measured interval: 10.259 GiB in 15.2 minutes
    • no kernel / storage errors in dmesg
    • SMART health PASSED, media errors 0

    I also checked the chainstate .ldb file ages by bytes again.

    • files total: 461
    • bytes total: 10.566 GiB
    • <= 1h: 10.232 GiB (96.84%)
    • <= 2h: 10.554 GiB (99.89%)
    • <= 4h: 10.555 GiB (99.89%)
    • <= 8h: 10.555 GiB (99.89%)
    • <= 12h: 10.555 GiB (99.89%)
    • <= 24h: 10.566 GiB (100.00%) So almost the whole current chainstate .ldb set is still very recent.

    txindex isn't enabled in my case:

    • no txindex setting found
    • getindexinfo returns {}

    This still looks consistent with recurring large rewrites of almost the whole chainstate .ldb set.

  32. l0rinc commented at 8:52 AM on May 19, 2026: contributor

    It looks like this LevelDB behavior has come up before. In upstream Google LevelDB, tunable allowed_seeks was suggested as far back as 2014, and another user reported in 2020 that disabling seek compaction improved their workload. I do not see an upstream Google LevelDB PR or current option that exposes such a knob.

    But there is similar work elsewhere:

    The original LevelDB heuristic assumes expensive seeks, but with bloom filters, cache, and SSDs, that cost model can be wrong for large random-key workloads like chainstate. In that case, seek-triggered compactions can become write amplification rather than an optimization.

    I did some local testing with knobs Core can already influence. Increasing bloom filter bits or block cache size did not stop read-triggered seek compactions in a small LevelDB repro. Increasing the chainstate write buffer did reduce L0 churn by avoiding small steady-state syncs becoming L0 files, but it does not disable the seek-compaction path for existing files. These were already mentioned by @andrewtoth above.

    So there may be two separate changes worth considering:

    • A Core-only mitigation: raise the chainstate LevelDB write buffer, ideally for coins_db only.
    • A fuller fix: add a chainstate-only option, similar to Mojang's disable_seek_autocompaction / goleveldb's DisableSeeksCompaction, that disables seek-triggered compaction while leaving normal size compactions and manual compactions enabled.

    Thanks for the detailed reports - they were very helpful for narrowing this down!

  33. andrewtoth commented at 11:38 AM on May 19, 2026: contributor

    A fuller fix: add a chainstate-only option, similar to Mojang's disable_seek_autocompaction / goleveldb's DisableSeeksCompaction, that disables seek-triggered compaction @l0rinc why add a chainstate-only option? Why shouldn't wejust disable it globally like in the patch I shared?

  34. l0rinc commented at 12:18 PM on May 19, 2026: contributor

    Seems to me we can get it in faster if it's narrower - I don't usually test the other LevelDB usages, but we can of course set it globally and test all usages as well.

  35. andrewtoth commented at 1:48 PM on May 19, 2026: contributor

    Ran two independent patches - one to disable seek compaction in leveldb, and one to bump MAX_COINS_DB_CACHE to 64 MB (which gives a 16 MB max_write_buffer).

    disabling seek compaction - 2.19 G read, 511.42 M written bumping max_write_buffer - 34.74 G read, 31.97 G written master - 216.66 G read, 155.92 G written

    So, a higher max_write_buffer will hold off on seek compactions for longer since it doesn't produce the l0 file as quickly, but once it does the compaction still overwrites the entire database.

    This was still a problem for compactions before (see #29662), but with #30611 this will now do it every hour.

    I don't usually test the other LevelDB usages, but we can of course set it globally and test all usages as well. @l0rinc I think a clean disabling for all usage is more narrow and easier to reason about. Then we also don't need to add plumbing to pass the option. The economics of seek compactions in the comments are for a disk that does random seeks in 10ms. These types of disks were more common in 2011 when this comment was made. I don't think that type of disk could be usable with bitcoind today, so it's not worth considering.

  36. fanquake added this to the milestone 32.0 on May 19, 2026
  37. andrewtoth commented at 3:42 PM on May 19, 2026: contributor
  38. sipa commented at 4:11 PM on May 19, 2026: member

    So this is a 30.0 regression, right, not 31.0?

  39. andrewtoth commented at 4:20 PM on May 19, 2026: contributor

    I did not observe the cascading compactions when testing. See #30611 (comment), where the compaction spikes occur only every few days both with writing the chainstate every hour and every 24 hours. But it seems reporters here are also seeing this when running v30.2.

  40. tuxArg commented at 4:34 PM on May 19, 2026: none

    Opened a fix bitcoin-core/leveldb-subtree#61 @andrewtoth Do you think this simple patch could be applied to v30.2 tag? Or other changes should be made too?

  41. ArmchairCryptologist commented at 5:24 PM on May 19, 2026: none

    So this is a 30.0 regression, right, not 31.0?

    Cannot confirm the exact version of the regression, but I can confirm that there seems to be no difference in behavior or average write levels between nodes running 30.2 and 31.0, they are all experiencing this issue with roughly the same intensity.

  42. andrewtoth commented at 6:03 PM on May 19, 2026: contributor

    So, I see the measurements I made in #30611 (comment) were made before #30039 was merged. So, the compactions were likely still happening but were much less expensive. A 2 MiB file would only rewrite other 2 MiB files in lower levels. After #30039, a small ~2 MiB higher level file can rewrite many other 32 MiB files at the level below.

  43. l0rinc commented at 6:17 PM on May 19, 2026: contributor

    Glad we reduced it from the original 128 mb...

  44. andrewtoth commented at 2:43 PM on May 20, 2026: contributor

    <img width="2048" height="1089" alt="Image" src="https://github.com/user-attachments/assets/1dfb2a6a-7077-4cc4-a369-76ac5c0fe3de" />

    Got some interesting write io data for average 24 hour period for multiple nodes from @0xB10C.

    Note that erin is running v29, so is compacting once a day. Still, 34 GB is still way too much write IO, when the actual chainstate writing is < 50 MB per day. After about ~36 hours now, my node with the above patch shows 1418.37 M written.

    Also note that frank is running v31 with blocks only. The reason its write io is much less is because there are no mempool txs triggering random reads into the chainstate db. So, after a batch write, an L0 file is produced, but nothing reads it so it doesn't get scheduled for compaction. The next block that comes in will trigger the seek compaction, but after the L0 files are compacted they will stop there, since they aren't getting constantly read from at L1. Eventually the files will still get seek compacted after every block, but not as quickly as getting seek compacted by mempool txs.

  45. 0xB10C commented at 12:19 PM on May 22, 2026: contributor

    I've been running https://github.com/andrewtoth/bitcoin/tree/disable_seek_compaction on one of my nodes for slightly more than 24h now. Seems promising. Note that this is total disk writes, and there's other stuff (e.g. detailed debug logs) writing to disk.

    Here's a graph of hourly disk writes on two hosts that had comparable disk writes before. jade is the one running the patch.

    <img width="1274" height="640" alt="Image" src="https://github.com/user-attachments/assets/76df5152-2e93-4a7b-9f21-a0f46ec135d4" />

  46. l0rinc commented at 1:22 PM on May 22, 2026: contributor

    @0xB10C can you also track the total chainstate size to make sure we understand how this patch will affect disk occupancy (not just speed or write count)?

  47. achow101 closed this on May 27, 2026

  48. sipa commented at 9:23 PM on May 27, 2026: member

    Should we keep this open until (a) the leveldb subtree has been updated and (b) we decide whether to do something like automatic full compaction every few days and/or at the end of initial sync?

  49. fanquake reopened this on May 27, 2026


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-06-10 06:51 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me