dbwrapper: Bump LevelDB max file size to 32 MiB to avoid system slowdown from high disk cache flush rate #30039

pull maciejsszmigiero wants to merge 1 commits into bitcoin:master from maciejsszmigiero:dbwrapper-bump-max-file-size changing 2 files +2 −0
  1. maciejsszmigiero commented at 9:41 am on May 4, 2024: contributor

    The default max file size for LevelDB is 2 MiB, which results in the LevelDB compaction code generating ~4 disk cache flushes per second when syncing with the Bitcoin network. These disk cache flushes are triggered by fdatasync() syscall issued by the LevelDB compaction code when reaching the max file size.

    If the database is on a HDD this flush rate brings the whole system to a crawl. It also results in very slow throughput since 2 MiB * 4 flushes per second is about 8 MiB / second max throughput, while even an old HDD can pull 100 - 200 MiB / second streaming throughput.

    Increase the max file size for LevelDB to 128 MiB instead so the flush rate drops to about 1 flush / 2 seconds and the system no longer gets so sluggish.

    The max file size value chosen also matches the MAX_BLOCKFILE_SIZE file size setting already used by the block storage.

  2. DrahtBot commented at 9:41 am on May 4, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/30039.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK l0rinc, andrewtoth, TheCharlatan, willcl-ark, tdb3, laanwj, davidgumberg
    Concept ACK sipa

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  3. sipa commented at 2:13 pm on May 4, 2024: member
    @jamesob Feel like benchmarking a reindex or so with this?
  4. laanwj commented at 4:41 pm on May 4, 2024: member
    Are there any drawbacks to this?
  5. laanwj added the label UTXO Db and Indexes on May 4, 2024
  6. willcl-ark commented at 4:52 pm on May 4, 2024: member
  7. maciejsszmigiero commented at 4:57 pm on May 4, 2024: contributor

    Are there any drawbacks to this?

    I didn’t notice any.

    It’s worth mentioning that the total amount of data stored in this database is at least two orders of magnitude higher than even 128 MiB file size.

  8. laanwj commented at 5:06 pm on May 4, 2024: member

    It’s worth mentioning that the total amount of data stored in this database is at least two orders of magnitude higher than even 128 MiB file size.

    Oh yes, i asked because i like this even from a “leveldb creates less files” point of view, eg anecdotally on one of my nodes the counter exceeds 6 digits bitcoin-core/bitcoin-maintainer-tools#161 . Of course, this includes deleted files, the active number is “only” about 6000.

  9. tdb3 commented at 6:30 pm on May 5, 2024: contributor

    Are there any drawbacks to this?

    I didn’t notice any.

    It’s worth mentioning that the total amount of data stored in this database is at least two orders of magnitude higher than even 128 MiB file size.

    It would be great if there are few/no drawbacks. Do you mind sharing the methods used so far to test this? It would be great to have some data for comparison.

    Other questions that come to mind (thinking out loud before I dig deeper or perform testing):

    • Does the change from 2MB to 128MB have any impact on consistent or transient RAM usage (i.e. for resource-constrained nodes)?
    • Is the file size (or an option to use the legacy smaller file size) something we would want to expose in bitcoin.conf (e.g. as a debug option)?
  10. andrewtoth commented at 9:31 pm on May 5, 2024: contributor

    Might partially address #29662

    That issue is complaining about long compaction times. From https://github.com/bitcoin/bitcoin/blob/master/src/leveldb/include/leveldb/options.h#L111-L112:

    The downside will be longer compactions and hence longer latency/performance hiccups.

    it seems this change would make compaction times longer, so would exacerbate that issue?

  11. maciejsszmigiero commented at 10:51 pm on May 5, 2024: contributor

    Do you mind sharing the methods used so far to test this?

    I am simply watching the disk cache flush rate in iostat(1). In addition to that, the difference in the system interactivity is also pretty apparent.

    Does the change from 2MB to 128MB have any impact on consistent or transient RAM usage (i.e. for resource-constrained nodes)?

    Did not observe any such effect, the RAM usage of the Bitcoin process seems to vary within roughly the same bounds when syncing with the Bitcoin network with our without this change.

    Is the file size (or an option to use the legacy smaller file size) something we would want to expose in bitcoin.conf (e.g. as a debug option)?

    Maybe, but I don’t know whether it makes sense to expose additional tuning option with respect to, for example, maintenance impact.

    The downside will be longer compactions and hence longer latency/performance hiccups.

    it seems this change would make compaction times longer, so would exacerbate that issue?

    For me, the biggest performance impact of compaction is from disk cache flushes this operation generates. This patch significantly reduces such flush rate and so should make compaction less painful.

  12. in src/dbwrapper.cpp:150 in 7f15e71f7e outdated
    146@@ -147,6 +147,7 @@ static leveldb::Options GetOptions(size_t nCacheSize)
    147         // on corruption in later versions.
    148         options.paranoid_checks = true;
    149     }
    150+    options.max_file_size = 128 << 20;
    


    andrewtoth commented at 0:55 am on May 6, 2024:
    Should we make this a constant? Would it be appropriate to reuse MAX_BLOCKFILE_SIZE?

    laanwj commented at 7:43 am on May 6, 2024:
    +1 on a constant, but i don’t think it’s approprioate to reuse MAX_BLOCKFILE_SIZE, better to define a new one

    maciejsszmigiero commented at 10:00 pm on May 7, 2024:
    Added a relevant constant.
  13. andrewtoth commented at 2:18 am on May 6, 2024: contributor

    Benchmarked IBD with an SSD to block 800k, dbcache=450, prune=0 with a local node serving the blocks. This branch is 27% (!) faster than master :rocket:

    0 commit 7f15e71f7e762645dbd1ea5eba9ecc6f9ad60236 (branch)
    1  Time (mean ± σ):     14711.490 s ± 225.376 s    [User: 19465.517 s, System: 1147.712 s]
    2  Range (min … max):   14552.125 s … 14870.854 s    2 runs
    3  
    4 commit eb0bdbdd753bca97120247b921fd29d606fea6e9 (master)
    5  Time (mean ± σ):     20274.276 s ± 106.042 s    [User: 21762.310 s, System: 4546.936 s]
    6  Range (min … max):   20199.293 s … 20349.259 s    2 runs
    

    This patch significantly reduces such flush rate and so should make compaction less painful.

    From what I understand, this patch reduces the frequency of flushes, but they will take longer when they do occur. This is great for IBD, but for #29662 the issue is an unavoidable compaction at startup. The compaction could potentially take longer with this patch.

  14. laanwj commented at 7:56 am on May 6, 2024: member

    This branch is 27% (!) faster than master

    That’s impressive!

    From what I understand, this patch reduces the frequency of flushes

    Not only the frequency of flushes; another potential advantage here is that leveldb will spend less time open()ing and close()ing files to maintain its allowed number of open files (eg the fd_limiter stuff).

  15. luke-jr commented at 5:49 pm on May 7, 2024: member
    If there’s no drawbacks, why not go even larger?
  16. willcl-ark commented at 6:38 pm on May 7, 2024: member

    I ran some benchmarks of IBD to block 800,000 vs master for comparison, and got some similar, if slightly less impressive, results with default dbcache.

    With -dbcache=16384:

    • master@ fdb41e08: 9607 s
    • master@ fdb41e08 + 7f15e71f7e762645dbd1ea5eba9ecc6f9ad60236: 9351 s
    • 3% faster with this change

    With -dbcache=450:

    • master@ fdb41e08: 15338 s
    • master@ fdb41e08 + 7f15e71f7e762645dbd1ea5eba9ecc6f9ad60236: 13246 s
    • ~16% faster with this change

    I only did a single run of each though. Sync was performed from a single second local node with datadir on a separate SSD.

  17. andrewtoth commented at 7:18 pm on May 7, 2024: contributor
    FWIW re: #29662 I did not notice any difference in compaction time at startup on an SSD. It takes about 5 seconds to finish with debug=leveldb both on master and this branch.
  18. maciejsszmigiero force-pushed on May 7, 2024
  19. maciejsszmigiero commented at 10:07 pm on May 7, 2024: contributor

    If there’s no drawbacks, why not go even larger?

    I used 128 MiB as the new size for commonality with MAX_BLOCKFILE_SIZE already used by the block storage and because it gives me a nice low disk cache flush rate of about 1 flush / 2 seconds that no longer impacts the overall system performance.

    But just to be sure, changed the patch to use std::max() around this max_file_size option so if at some point LevelDB decides to increase its default above 128 MiB we won’t be lowering it accidentally.

  20. mzumsande commented at 5:02 pm on May 8, 2024: contributor
    I’ve played around with this branch a bit, upgrading and downgrading between it and master with existing datadirs on signet and didn’t run into any issues. Also just noting that this will affect all leveldb databases, also the indexes and the block/index db.
  21. sipa commented at 6:23 pm on May 17, 2024: member
    It appears that RocksDb (more-developed derivative of LevelDB) uses a default of 64 MiB (https://github.com/facebook/rocksdb/blob/main/include/rocksdb/advanced_options.h#L468). See also my comment in #30059 (comment).
  22. l0rinc commented at 12:18 pm on October 2, 2024: contributor

    I did a few benchmarks on HDD and SSD separately (no raspberry pi yet, but I understood @davidgumberg did some of those and saw a significant speedup), to see the effect of the different values on IBD.

    I have tried different values via #30059 (rebased), namely 1,2,4,8,16,32,64,128,256,512 MiB (current value is 2) with default dbcache, until 600k blocks using real nodes (which introduces some randomness, but the repeated runs should still indicate a trend).

    0hyperfine \
    1  --runs 1 \
    2  --export-json /mnt/my_storage/ibd_benchmark.json \
    3  --parameter-list DBFILESIZE 1,2,4,8,16,32,64,128,256,512 \
    4  --prepare 'rm -rf /mnt/my_storage/BitcoinData/*' \
    5  './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize={DBFILESIZE} -printtoconsole=0'
    
     0Benchmark 1: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=1 -printtoconsole=0
     1  Time (abs ≡):        9376.982 s               [User: 8939.258 s, System: 2037.366 s]
     2
     3Benchmark 2: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=2 -printtoconsole=0
     4  Time (abs ≡):        7809.227 s               [User: 8399.808 s, System: 1258.152 s]
     5
     6Benchmark 3: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=4 -printtoconsole=0
     7  Time (abs ≡):        7060.817 s               [User: 8210.950 s, System: 626.069 s]
     8
     9Benchmark 4: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=8 -printtoconsole=0
    10  Time (abs ≡):        7201.632 s               [User: 8046.769 s, System: 615.964 s]
    11
    12Benchmark 5: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=16 -printtoconsole=0
    13  Time (abs ≡):        7848.417 s               [User: 8394.320 s, System: 713.182 s]
    14
    15Benchmark 6: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=32 -printtoconsole=0
    16  Time (abs ≡):        8289.161 s               [User: 8183.729 s, System: 599.698 s]
    17
    18Benchmark 7: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=64 -printtoconsole=0
    19  Time (abs ≡):        7580.532 s               [User: 8077.446 s, System: 612.879 s]
    20
    21Benchmark 8: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=128 -printtoconsole=0
    22  Time (abs ≡):        9060.371 s               [User: 8140.057 s, System: 606.641 s]
    23
    24Benchmark 9: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=256 -printtoconsole=0
    25  Time (abs ≡):        8778.117 s               [User: 8001.854 s, System: 620.595 s]
    26
    27Benchmark 10: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=512 -printtoconsole=0
    28  Time (abs ≡):        7856.151 s               [User: 7970.946 s, System: 680.476 s]
    29
    30Summary
    31  './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=4 -printtoconsole=0' ran
    32    1.02 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=8 -printtoconsole=0'
    33    1.07 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=64 -printtoconsole=0'
    34    1.11 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=2 -printtoconsole=0'
    35    1.11 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=16 -printtoconsole=0'
    36    1.11 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=512 -printtoconsole=0'
    37    1.17 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=32 -printtoconsole=0'
    38    1.24 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=256 -printtoconsole=0'
    39    1.28 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=128 -printtoconsole=0'
    40    1.33 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=500000 -dbfilesize=1 -printtoconsole=0'
    
     0Benchmark 1: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=1 -printtoconsole=0
     1  Time (abs ≡):        10150.860 s               [User: 8046.261 s, System: 1557.130 s]
     2
     3Benchmark 2: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=2 -printtoconsole=0
     4  Time (abs ≡):        8935.037 s               [User: 7746.422 s, System: 981.186 s]
     5
     6Benchmark 3: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=4 -printtoconsole=0
     7  Time (abs ≡):        7636.675 s               [User: 7348.012 s, System: 547.172 s]
     8
     9Benchmark 4: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=8 -printtoconsole=0
    10  Time (abs ≡):        7633.078 s               [User: 7306.267 s, System: 572.424 s]
    11
    12Benchmark 5: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=16 -printtoconsole=0
    13  Time (abs ≡):        7639.829 s               [User: 7266.532 s, System: 591.955 s]
    14
    15Benchmark 6: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=32 -printtoconsole=0
    16  Time (abs ≡):        7345.802 s               [User: 7265.908 s, System: 584.797 s]
    17
    18Benchmark 7: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=64 -printtoconsole=0
    19  Time (abs ≡):        7617.101 s               [User: 7092.537 s, System: 551.785 s]
    20
    21Benchmark 8: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=128 -printtoconsole=0
    22  Time (abs ≡):        7508.948 s               [User: 7065.206 s, System: 580.337 s]
    23
    24Benchmark 9: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=256 -printtoconsole=0
    25  Time (abs ≡):        7563.822 s               [User: 7093.650 s, System: 599.636 s]
    26
    27Benchmark 10: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=512 -printtoconsole=0
    28  Time (abs ≡):        7600.085 s               [User: 6997.129 s, System: 536.973 s]
    29
    30Summary
    31  ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=32 -printtoconsole=0 ran
    32    1.02 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=128 -printtoconsole=0
    33    1.03 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=256 -printtoconsole=0
    34    1.03 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=512 -printtoconsole=0
    35    1.04 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=64 -printtoconsole=0
    36    1.04 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=8 -printtoconsole=0
    37    1.04 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=4 -printtoconsole=0
    38    1.04 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=16 -printtoconsole=0
    39    1.22 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=2 -printtoconsole=0
    40    1.38 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=500000 -dbfilesize=1 -printtoconsole=0
    

    While these measurements aren’t definitive, both hinted at -dbfilesize=4 being better than -dbfilesize=2 (the default) and may not be a lot better than -dbfilesize=128.

    I’ll rerun these with 2,4,8,64,128 and 800k blocks on the HDD to validate the findings.

  23. davidgumberg commented at 7:10 pm on October 4, 2024: contributor

    I cherry picked your branch onto master and did two runs syncing from a stable, dedicated local node twice on a Raspberry Pi 5 4GB using microSD for storage, with a prune of 2000 and the default dbcache using the following command:

    0./build/src/bitcoind -daemon=0 -connect=ryzen7900xnode:8333 -stopatheight=800000 -prune=2000 -debug=bench -debug=blockstorage -debug=coindb -debug=mempool -debug=prune
    

    I saw a massive improvement, with your branch taking, on average, ~67.8% of the time taken by the master branch to reach block height 800,0001:

    Avg (hh:mm:ss) Run 1 Run 2
    Master 47:17:14 (170,234s) 48:38:05 (175,085s) 45:56:22 (165,382s)
    Branch, cherry picked onto master 32:01:14 (115,274s) 34:06:26 (122,786s) 29:56:01 (107,761s)

    For me this validates that a substantial performance improvement is possible. I suspect especially on disk I/O constrained setups, and I’m really interested in making IBD on Raspberry Pi’s faster.

    Concept ACK on looking into the tradeoffs of different settings here. Not to try and duplicate discussion too much between this and #30059, but I second @l0rinc that looking for one good default seems better than making this configurable, unless we find evidence that different setups benefit substantially from different values.

    But, I think more work needs to be done to identify what value works best here, and hopefully come up with an account for why, I will try to bench some different max file size values on the Raspberry Pi that I have similar to @lorinc’s work above.



    1. These benchmarks took so long that the weather had changed between run 1 and run 2, and I am not running these in a room where the temperature is very well controlled which I believe is the primary cause of run 2 being faster for both. ↩︎

  24. l0rinc commented at 3:20 pm on October 7, 2024: contributor

    Finished benchmarking with the default 2 mb file size vs 4, 8, 64 and 128 mb. This time it’s full IBD with real peers until 800k blocks on a HDD.

    0hyperfine   --runs 1   --export-json /mnt/ibd_DBFILESIZE.json   --parameter-list DBFILESIZE 2,4,8,64,128   --prepare 'rm -rf /mnt/BitcoinData/*'   './build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize={DBFILESIZE} -printtoconsole=0'
    
     0 Benchmark 1: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=2 -printtoconsole=0
     1  Time (abs ≡):        36403.630 s               [User: 31186.459 s, System: 5761.138 s]
     2
     3 Benchmark 2: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=4 -printtoconsole=0
     4  Time (abs ≡):        30540.101 s               [User: 29188.931 s, System: 3430.547 s]
     5
     6Benchmark 3: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=8 -printtoconsole=0
     7  Time (abs ≡):        28913.948 s               [User: 28857.575 s, System: 2292.117 s]
     8
     9Benchmark 4: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=64 -printtoconsole=0
    10  Time (abs ≡):        27911.380 s               [User: 28268.729 s, System: 2179.778 s]
    11
    12Benchmark 5: ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=128 -printtoconsole=0
    13  Time (abs ≡):        28191.359 s               [User: 27915.963 s, System: 2045.088 s]
    
    0  ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=64 -printtoconsole=0 ran
    1    1.01 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=128 -printtoconsole=0
    2    1.04 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=8 -printtoconsole=0
    3    1.09 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=4 -printtoconsole=0
    4    1.30 times faster than ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbfilesize=2 -printtoconsole=0
    

    Edit:

    Repeated the same for SSD, very similar results:

    0hyperfine   --runs 1   --export-json /mnt/ibd_DBFILESIZE-ssd.json   --parameter-list DBFILESIZE 2,8,16,32,64 --prepare 'rm -rf /mnt/my_storage/BitcoinData/*'  './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize={DBFILESIZE} -printtoconsole=0 -dbcache=1000'        
    
     0Benchmark 1: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=2 -printtoconsole=0 -dbcache=1000
     1  Time (abs ≡):        32323.964 s               [User: 30174.040 s, System: 6349.312 s]
     2 
     3Benchmark 2: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=8 -printtoconsole=0 -dbcache=1000
     4  Time (abs ≡):        24513.755 s               [User: 27618.551 s, System: 1728.897 s]
     5 
     6Benchmark 3: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=16 -printtoconsole=0 -dbcache=1000
     7  Time (abs ≡):        24648.438 s               [User: 27925.669 s, System: 1893.671 s]
     8 
     9Benchmark 4: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=32 -printtoconsole=0 -dbcache=1000
    10  Time (abs ≡):        24797.871 s               [User: 27621.893 s, System: 1755.004 s]
    11 
    12Benchmark 5: ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=64 -printtoconsole=0 -dbcache=1000
    13  Time (abs ≡):        25078.417 s               [User: 27879.669 s, System: 2064.851 s]
    
    0  './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=8 -printtoconsole=0 -dbcache=1000' ran
    1    1.01 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=16 -printtoconsole=0 -dbcache=1000'
    2    1.01 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=32 -printtoconsole=0 -dbcache=1000'
    3    1.02 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=64 -printtoconsole=0 -dbcache=1000'
    4    1.32 times faster than './build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=800000 -dbfilesize=2 -printtoconsole=0 -dbcache=1000'
    

    In conclusion it seems to me that 2mb is indeed too low, there seems to be a significant jump when doubling the file size (~20-30%% faster), but after that the advantage is smaller (8mb is 25% faster, 64 mb is 30% faster and 128 is 29% faster).

    Since we’re not yet sure of all the second order effects of this change (longer compaction, more memory, migration problems, etc), I wouldn’t yet recommend jumping to 128, but to 8 or 16 only.

  25. in src/dbwrapper.cpp:150 in 3e32d23c9e outdated
    146@@ -147,6 +147,7 @@ static leveldb::Options GetOptions(size_t nCacheSize)
    147         // on corruption in later versions.
    148         options.paranoid_checks = true;
    149     }
    150+    options.max_file_size = std::max(options.max_file_size, DBWRAPPER_MAX_FILE_SIZE);
    


    l0rinc commented at 9:10 pm on October 30, 2024:

    As mentioned in the comments, it seems to me that 16 may be a better default value based on the measured IBDs - basically just as fast as 128, without having to worry about the increase in e.g. MaxGrandParentOverlapBytes and ExpandedCompactionByteSizeLimit (10x and 25x this value) called e.g. in IsTrivialMove with a warning: "the move could create a parent file that will require a very expensive merge later on" (or any other such surprise) - which we likely want to avoid:

    0    options.max_file_size = 16 << 20;
    

    l0rinc commented at 7:39 pm on November 30, 2024:
    Thanks, please resolve the comment
  26. l0rinc commented at 8:54 pm on November 6, 2024: contributor

    @maciejsszmigiero, are you still working on this or should we take over?


    I can also confirm that it’s possible to just switch file size values back-and-forth without needing a reindex. I have reindexed until block 600k with master vs 16 mb blocks (instead of the 128 for the reasons mentioned before).

    The LevelDB files seem to effortlessly change from 2 mb to 17 :

    • chainstate/062435.ldb - 906'412 bytes
    • chainstate/061885.ldb - 2'171'330 bytes
    • chainstate/063212.ldb - 1'936'570 bytes
    • chainstate/064711.ldb - 982'165 bytes
    • chainstate/061518.ldb - 2'171'520 bytes
    • chainstate/062708.ldb - 2'169'653 bytes
    • chainstate/061659.ldb - 2'171'631 bytes
    • chainstate/063237.ldb - 2'170'487 bytes
    • chainstate/062435.ldb - 906'412 bytes
    • chainstate/065302.ldb - 17'347'086 bytes
    • chainstate/062708.ldb - 2'169'653 bytes
    • chainstate/063237.ldb - 2'170'487 bytes

    And when reverting to master, effortlessly go back:

    • chainstate/062468.ldb - 2'171'399 bytes
    • chainstate/065305.ldb - 17'358'270 bytes
    • chainstate/062543.ldb - 2'172'244 bytes
    • chainstate/062468.ldb - 2'171'399 bytes
    • chainstate/065579.ldb - 2'170'605 bytes
    • chainstate/068994.ldb - 2'169'617 bytes
    • chainstate/068954.ldb - 2'170'158 bytes
    • chainstate/062543.ldb - 2'172'244 bytes

    The total bytes on disk seems to be basically the same, but the number of files is reduced considerably (might alleviate open file problems):

    • before 2168 files, 4'383'947'229 bytes
    • after 280 files, 4'386'553'693 bytes
  27. maciejsszmigiero commented at 9:40 pm on November 9, 2024: contributor

    @l0rinc

    are you still working on this or should we take over?

    I can obviously change the default in this PR to 16 MiB but I think having #30059 is important too: as you measured here on Oct 2 the best performing size on HDD storage actually seems to be 32 MiB.

  28. willcl-ark commented at 10:30 am on November 22, 2024: member

    I would also support slightly reducing the value in this PR, my preference though would be 32MB, for these reasons:

    • The benchmark data shows the biggest gains come from the initial increases (2MB → 4MB → 8MB)
    • There are diminishing returns after 8MB, with 128MB actually performing slightly worse than 64MB in total time in some of the benchmarks above
    • Most of the performance gains are captured by the 32MB size, esp. when including HDDs
    • Smaller files will also:
      • Be more manageable in memory-constrained environments (relevant for our current default 450MB cache)
      • Create less memory pressure during compaction operations*
      • Allow for more granular cache utilization (unclear to me if/how this affects us though, I’ve not measured this)

    * If I am understanding LevelDB compaction correctly, the 32MB filesize will use less peak memory during merge operations than a 128MB filesize, which would be useful for our resource-constrained use cases (Raspberry Pi nodes).

    The system time improvements in the benchmarks suggest we’ll get most of the benefits at 32MB, while keeping better compatibility with our default memory settings.

  29. maflcko added the label Waiting for author on Nov 22, 2024
  30. maflcko removed the label Waiting for author on Nov 22, 2024
  31. willcl-ark commented at 12:11 pm on November 26, 2024: member

    Just wanted to clarify that I’d love to see this get in sooner rather than later, as it can provide a valuable speedup and I’d be ready to ACK with a reduced (default) size selected, for the reasons I outlined above.

    We can in tandem discuss making this value configurable over in #30059.

  32. laanwj commented at 11:07 pm on November 26, 2024: member

    Just wanted to clarify that I’d love to see this get in sooner rather than later, as it can provide a valuable speedup and I’d be ready to ACK with a reduced (default) size selected, for the reasons I outlined above.

    Yes, i think there’s been enough analysis here to confirm that changing the default to 16MB or 32MB would give almost all of the gain, with the least risk, let’s go for that.

  33. dbwrapper: Bump max file size to 32 MiB
    The default max file size for LevelDB is 2 MiB, which results in the
    LevelDB compaction code generating ~4 disk cache flushes per second when
    syncing with the Bitcoin network.
    These disk cache flushes are triggered by fdatasync() syscall issued by the
    LevelDB compaction code when reaching the max file size.
    
    If the database is on a HDD this flush rate brings the whole system to a
    crawl.
    It also results in very slow throughput since 2 MiB * 4 flushes per second
    is about 8 MiB / second max throughput, while even an old HDD can pull
    100 - 200 MiB / second streaming throughput.
    
    Increase the max file size for LevelDB to 32 MiB instead so the flush rate
    drops significantly and the system no longer gets so sluggish.
    
    The new max file size value chosen is a compromise between the one that
    works best for HDD and SSD performance, as determined by benchmarks done by
    various people.
    b73d331937
  34. maciejsszmigiero force-pushed on Nov 30, 2024
  35. maciejsszmigiero commented at 7:37 pm on November 30, 2024: contributor
    Updated the new max file size value to 32 MiB, as suggested.
  36. l0rinc commented at 7:40 pm on November 30, 2024: contributor
    ACK b73d3319377a4c9d7e2dd279c3d106002585bc36
  37. DrahtBot requested review from willcl-ark on Nov 30, 2024
  38. DrahtBot requested review from davidgumberg on Nov 30, 2024
  39. andrewtoth approved
  40. andrewtoth commented at 7:57 pm on November 30, 2024: contributor
    ACK b73d3319377a4c9d7e2dd279c3d106002585bc36
  41. TheCharlatan approved
  42. TheCharlatan commented at 8:21 pm on November 30, 2024: contributor
    ACK b73d3319377a4c9d7e2dd279c3d106002585bc36
  43. sipa commented at 8:22 pm on November 30, 2024: member
    Concept ACK. Please update the PR title and description to reflect the new size.
  44. maciejsszmigiero renamed this:
    dbwrapper: Bump LevelDB max file size to 128 MiB to avoid system slowdown from high disk cache flush rate
    dbwrapper: Bump LevelDB max file size to 32 MiB to avoid system slowdown from high disk cache flush rate
    on Nov 30, 2024
  45. willcl-ark approved
  46. willcl-ark commented at 8:24 pm on November 30, 2024: member
    ACK b73d3319377a4c9d7e2dd279c3d106002585bc36
  47. DrahtBot requested review from sipa on Nov 30, 2024
  48. tdb3 approved
  49. tdb3 commented at 8:27 pm on November 30, 2024: contributor
    ACK b73d3319377a4c9d7e2dd279c3d106002585bc36
  50. laanwj approved
  51. laanwj commented at 9:37 pm on November 30, 2024: member
    ACK b73d3319377a4c9d7e2dd279c3d106002585bc36
  52. fanquake merged this on Dec 2, 2024
  53. fanquake closed this on Dec 2, 2024


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me