(RFC) kernel: Replace leveldb-based BlockTreeDB with flat-file based store #32427

pull TheCharlatan wants to merge 6 commits into bitcoin:master from TheCharlatan:blocktreestore changing 23 files +1228 −106
  1. TheCharlatan commented at 2:27 pm on May 6, 2025: contributor

    Opening this PR mostly to get concept/approach feedback.

    This is motivated by the kernel library, where the internal usage of leveldb is a limiting factor to its future use cases. Specifically it is not possible to share leveldb databases between two processes. A notable use-case for the kernel library is accessing and analyzing existing block data. Currently this can only be done by first shutting down the node writing this data. Moving away from leveldb opens the door towards doing this in parallel. A flat file based approach was chosen, since the requirements for persistence here are fairly simple (no deletion, constant-size entries). The change also offers better performance by making node startup faster, and has a smaller on-disk footprint, though this is negligible in the grand scheme of things.

    The BlockTreeStore introduces a new data format for storing block indexes and headers on disk. The class is very similar to the existing CBlockTreeDB, which stores the same data in a leveldb database. Unlike CBlockTreeDB, the data stored through the BlockTreeStore is directly serialized and written to flat .dat files. The storage schema introduced is simple. It relies on the assumption that no entry is ever deleted and that no duplicate entries are written. These assumptions hold for the current users of CBlockTreeDB.

    A write ahead ahead log and boolean flags as file existence checks ensure write atomicity. Every data entry is also given a crc32c checksum to detect data corruption.

    The change also opens the door towards getting rid of having to reindex the block tree due to corruption (though switching from pruned -> un-pruned still requires extra logic). This would simplify a lot of kernel/validation code, as can be seen in this commit. The implementation here should be fairly robust against corruption (no parallel read/writes, no background compaction).

    An alternative to this pull request, that could allow the same kernel feature, would be closing and opening the leveldb database only when reading and writing. This might incur a (negligible) performance penalty, but more importantly requires careful consideration of how to handle any contentions when opening, which might have complex side effects due to our current locking mode. It would also be possible to introduce an existing database with the required features for just the block tree, but that would introduce reliance on a new dependency and come with its own tradeoffs. For these reasons I chose this approach.

  2. DrahtBot commented at 2:28 pm on May 6, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32427.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK josibake, theuni, ismaelsadeeq, marcofleon, l0rinc
    Approach ACK w0xlt

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #33274 (kernel: chainparams & headersync updates for 30.0 by fanquake)
    • #33042 (refactor: inline constant return values from dbwrapper write methods by l0rinc)
    • #31974 (Drop testnet3 by Sjors)
    • #31645 ([IBD] coins: increase default UTXO flush batch size to 32 MiB by l0rinc)
    • #30595 (kernel: Introduce initial C header API by TheCharlatan)
    • #28690 (build: Introduce internal kernel library by TheCharlatan)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    LLM Linter (✨ experimental)

    Possible typos and grammar issues:

    • In blocktreestorage.cpp comment: “// Write thefileInfo entries to the log” → “// Write the fileInfo entries to the log” [missing space between “the” and “fileInfo”]

    drahtbot_id_4_m

  3. w0xlt commented at 6:12 pm on May 6, 2025: contributor

    Approach ACK

    The codebase changes seem surprisingly small for this proposal. Changing the code to reduce the dependency on LevelDB sounds good to me.

  4. shahsb commented at 3:14 am on May 7, 2025: none

    Thanks @TheCharlatan for this proposal and making the code changes.!

    Please find my review comments, suggestions and some clarifying questions:

    1. How will concurrent reads (and potentially writes) be handled with the flat file format?

    2. Even if no deletion occurs, file corruption or partial writes can happen. Are you planning mmap or memory buffering?

    3. What would be the “Corruption Recovery Strategy?” – While the write-ahead log is mentioned as future work, providing even a minimal rollback/recovery mechanism in the initial version would make this stronger.

    4. What would be the “Migration Path”? – Will there be tooling or a migration process from existing leveldb-based data

    5. Data Integrity Guarantees? – Are checksums or hash-based verifications being added per entry or per file?

    6. Consider including a pluggable interface that allows fallback to LevelDB for testing or backwards compatibility

    7. Write contention and corruption risks: – While flat files avoid LevelDB’s process-level locking, concurrent writes require a mechanism (e.g., file locks, flock()) to prevent race conditions.

    8. Portability and Cross platform related edge cases: – Ensure file-locking mechanisms (e.g., fcntl on Unix, LockFileEx on Windows) are robust.

    9. How about Handling Large Files? – Test edge cases like file sizes approaching OS limits (e.g., 2+ GB on 32-bit systems). What would happen in such cases?

    10. On similar lines to point-5 – Corruption Detection mechanism could also be implemented to detect corruptions very early in the cycle.

    The flat-file approach is a reasonable trade-off given the simplicity of the block tree storage requirements.

    However, there are major significant challenges and risks involved with this approach as highlighted in the above 10 comments. (Concurrency, Corruption/Integrity, Performance, Migration, Corruption detection, Portability, Backward compatibility etc.)

  5. laanwj added the label UTXO Db and Indexes on May 7, 2025
  6. josibake commented at 8:57 am on May 7, 2025: member

    Concept ACK

    A notable use-case for the kernel library is accessing and analyzing existing block data

    A concrete example is index building in electrs / esplora / etc. For example, Electrs does this today by:

    1. Waiting for Bitcoin Core to finish IBD
    2. Reading all of the blocks out over JSON-RPC
    3. Parsing them and writing them into the appropriate indexes

    I started on a PoC for Electrs to use libbitcoinkernel for index building to demonstrate how this could be done much more efficiently and saw promising results. However, the requirement that Bitcoin Core be shut down before Electrs could process the block files made this approach clunky. I’ll revive this PoC as a means of testing this PR and hopefully provide some use case motivated feedback on the approach.

  7. Sjors commented at 12:58 pm on May 7, 2025: member
    Have you considered simply having one block per file? Typical blk files are 130 MB, so for “modern” blocks it would 50x the number of files. But is that actually a problem? It’s a lot simpler if we can just have $HASH.dat, maybe grouped in a directory per 10k blocks.
  8. TheCharlatan commented at 1:13 pm on May 7, 2025: contributor

    Have you considered simply having one block per file? Typical blk files are 130 MB, so for “modern” blocks it would 50x the number of files. But is that actually a problem? It’s a lot simpler if we can just have $HASH.dat, maybe grouped in a directory per 10k blocks.

    I’m not sure what you are suggesting here. Are you suggesting we create ~900k files and then have some subdivision within those files into 10k groups where each of those has a single $HASH.dat with all the headers and file pointers for the blocks in that division? Is there something we gain through that? My impression is we do the file splitting in the first place to make pruning easier. We don’t prune headers, so I don’t think splitting the file gains us anything. I don’t think there is much wrong with just having a single file. Maybe it even helps the OS a bit to manage the file buffers?

  9. Sjors commented at 1:33 pm on May 7, 2025: member

    Are you suggesting we create ~900k files

    Yes. If we’re going to redesign block storage, it seems good to wonder why can’t let the file system handle things.

    with all the headers and file pointers for the blocks in that division

    One block per file. We can still have a single file for the block index, which would contain the header (not just the hash) and validation state. Block files themselves would be in a predictable location, so we wouldn’t need an index for that.

    My impression is we do the file splitting in the first place to make pruning easier.

    Do you mean compared to the alternative of having a single file for all blocks? I would imagine that would create I/O problems, since the operating system wouldn’t know which part of the big file changed. And it can’t defragment it.

    Having one file per block makes pruning marginally easier than now, since you don’t have to worry about keeping nearby blocks in the same file.

    One downside of what I’m suggesting is that the headers would either be stored redundantly (in the block file as well as in the index), or anyone parsing the block files has to prepend the header themselves.

  10. ryanofsky commented at 1:59 pm on May 7, 2025: contributor

    How worried are we about file corruption here? I thought the main reason we use leveldb and sqlite databases in places like this where we don’t need indexing is that they support atomic updates, so you can pull the power cord any time and next time you reboot you will will see some consistent view of the data, even if it’s not the latest data. I didn’t look too closely at the implementation here but it seems like it is updating data in the files in place, so if writes are interrupted, data could be corrupt, and it’s not clear if there are even checksums in place that would detect this.

    Maybe this is not an issue for the PR, but it would be good to make clear what types of corruption BlockTreeStore can and can’t detect and what types of corruption it can recover from. If it can do simple things to detect corruption like adding checksums, or to prevent it like writing to temporary files and renaming them in place, those could be good to consider.

    If this PR does introduce some increased risk of corruption, maybe that is worth it for reasons listed in the description. I also think another alternative could be to use sqlite for this since this would not necessarily introduce a new dependency and we already have a ReadKey/WriteKey/EraseKey/HasKey wrapper functions for sqlite that might help it be an easy replacement for leveldb.

  11. TheCharlatan commented at 6:44 pm on May 7, 2025: contributor

    Re #32427 (comment)

    Do you mean compared to the alternative of having a single file for all blocks? I would imagine that would create I/O problems, since the operating system wouldn’t know which part of the big file changed. And it can’t defragment it.

    Yes, that is what I meant. We never change block files, so that is not a problem. I’m also not sure how real this problem actually is. A bunch of databases just maintain one big file and have good performance doing so. I’m still not sure what the benefit of what you propose would be. Either way, I think this is a bit out of scope, since while this change implements a database migration, it does not require a reindex, which a change to the block file format would. Improving pruning behavior is also not the goal here.

  12. TheCharlatan commented at 7:05 pm on May 7, 2025: contributor

    Re #32427 (comment)

    How worried are we about file corruption here?

    I was hoping to provoke a discussion about this as I alluded to in the PR description - thanks for providing your thoughts on this. I think the proof of work and integrity checks done on loading the index already provide fairly solid guarantees on load, but agree that we should do better. I have also talked to some other people about it offline, and there seems to be some appetite for improving corruption resistance. It is my understanding that the feature for reindexing the block tree was added as a salvaging option, because leveldb does not provide strong anti-corruption guarantees, but has sprawled a bit since. Removing the need to provide code for reindexing the block tree would be a nice simplification of validation code in my eyes. I think adding a checksum for the entries and writing from a log file could be fairly simple to implement and provide strong guarantees. I’m open to suggestions here.

    I also think another alternative could be to use sqlite for this since this would not necessarily introduce a new dependency and we already have a ReadKey/WriteKey/EraseKey/HasKey wrapper functions for sqlite that might help it be an easy replacement for leveldb.

    This has also been brought up by some others. While I’m still not sure that we should be introducing a new validation dependency, maybe it would be good to implement it and open the change as an alternative draft / RFC pull request in the meantime?

  13. mzumsande commented at 9:06 pm on May 7, 2025: contributor

    A full -reindex can be necessary for two reasons:

    1. corruption in the block tree db
    2. corruption in the blk files.

    In my personal experience of running a node on crappy hardware a long time ago, it was usually 2. that would happen (I knew that because the reindex wouldn’t scan all block files but abort with an error somewhere, and switch to IBD from peers). My suspicion is that while 1. may have been the dominant reason in the early years, 2. may be just as important today.

    However, if that was the case, changing the block tree db format wouldn’t allow us to get rid of -reindex, even if the new format would never corrupt.

  14. Sjors commented at 7:01 am on May 8, 2025: member

    We never change block files, so that is not a problem. I’m also not sure how real this problem actually is.

    But we prune blocks, and they may not all be at the start of the big file.

    A bunch of databases just maintain one big file and have good performance doing so.

    Even on a spinning disk? That’s where I tend to keep my .dat files.

    I’m still not sure what the benefit of what you propose would be.

    Compared to the current situation where we bundle a bunch of, but not all, blocks in one file, it just seems simpler to have one file per block.

    In the “corruption in the blk files” example above it also makes recovery really easy: just load the block files one by one, hash them, redownload if the hash doesn’t match. No need to update any index.

  15. ryanofsky commented at 4:44 pm on May 8, 2025: contributor

    re: TheCharlatan #32427 (comment)

    Thanks for clarifying the situation with leveldb. I just assumed based on its design that it would support atomic updates pretty robustly but if it has corruption problems of its own then it doesn’t sound like we would lose much by switching to something simpler.

    I still do think using sqlite could be a nice solution because data consistency issues can be a significant source of pain for users and developers, and with sqlite we basically just don’t need to think those issues. But I also understand not wanting to require sqlite as a kernel dependency.

    Another thing about this PR (as of fabd3ab615a7c718f37a60298a125864edb6106b) is it seems like it doesn’t actually remove much blockstorage code, and the BlockTreeDB class remains intact, I guess because migration code depends on it.

    An idea that might improve this could be to make BlockTreeStore methods pure virtual and have FlatFile and LevelDB subclasses implementing them. This could organize the code more cleanly by letting FlatFile and LevelDB implementations both live side-by side outside of blockstorage instead of one being inside and one being outside. This could also let kernel applications provide alternate backends, and allow things like differential fuzz testing. (This was also the exact same approach used to replace bdb with sqlite in the wallet.)

    re: Sjors #32427 (comment)

    FWIW I also think using individual block files could be great (assuming a sharded directory structure like the .git/objects to avoid having many files per directory). That idea is mostly tangential to this PR though, I think? Possible I am missing some connections.

  16. TheCharlatan commented at 1:24 pm on May 10, 2025: contributor

    Re #32427#pullrequestreview-2823207472

    In my personal experience of running a node on crappy hardware a long time ago, it was usually 2. that would happen (I knew that because the reindex wouldn’t scan all block files but abort with an error somewhere, and switch to IBD from peers).

    That is interesting, I don’t think I’ve ever run into blk file corruption. I agree with you that if that is something we need to be able to salvage from, that the reindex logic would have to remain for it. If that is the case though, shouldn’t we also be cleaning left over block data then? Seems like the user could just end up with 100’s of GBs of unusable block data otherwise.

  17. l0rinc commented at 8:44 pm on May 14, 2025: contributor

    I didn’t have time to review this in detail - nor to form a detailed concept/approach feedback, but I ran a few reindexes to see if it affects performance because somebody was referring to this as an optimization and wanted to understand if that’s indeed the case.

    I ran a reindex until 888,888 comparing the speed against master.

     0COMMITS="14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c fabd3ab615a7c718f37a60298a125864edb6106b"; \
     1STOP_HEIGHT=888888; DBCACHE=4500; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done; echo "") && \
     5hyperfine \
     6  --sort 'command' \
     7  --runs 1 \
     8  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g; s/-$//' <<< "$COMMITS")-$STOP_HEIGHT-$DBCACHE-$CC.json" \
     9  --parameter-list COMMIT ${COMMITS// /,} \
    10  --prepare "killall bitcoind; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
    11    cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \
    12    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=5000 -printtoconsole=0; sleep 10" \
    13  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
    14  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=$DBCACHE"
    

    14b8dfb2bd Merge bitcoin/bitcoin#31398: wallet: refactor: various master key encryption cleanups fabd3ab615 blockstorage: Remove BlockTreeDB dead code

    0Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
    1  Time (abs ):        27076.605 s               [User: 32171.870 s, System: 1311.182 s]
    2
    3Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)
    4  Time (abs ):        27034.197 s               [User: 32220.994 s, System: 1286.553 s]
    5
    6Relative speed comparison
    7        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
    8        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)
    

    Which indicates there’s no measurable speed difference. But here the chainstate reindexing dominates, so I did one until block 1 as well.

     0COMMITS="14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c fabd3ab615a7c718f37a60298a125864edb6106b"; \
     1STOP_HEIGHT=1; DBCACHE=450; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done; echo "") && \
     5hyperfine \
     6  --sort 'command' \
     7  --runs 1 \
     8  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g; s/-$//' <<< "$COMMITS")-$STOP_HEIGHT-$DBCACHE-$CC.json" \
     9  --parameter-list COMMIT ${COMMITS// /,} \
    10  --prepare "killall bitcoind; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
    11    cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \
    12    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=5000 -printtoconsole=0; sleep 10" \
    13  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
    14  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=$DBCACHE"
    

    14b8dfb2bd Merge bitcoin/bitcoin#31398: wallet: refactor: various master key encryption cleanups fabd3ab615 blockstorage: Remove BlockTreeDB dead code

    0Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
    1  Time (abs ):        7718.677 s               [User: 7368.404 s, System: 174.230 s]
    2
    3Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)
    4  Time (abs ):        7683.972 s               [User: 7344.276 s, System: 165.120 s]
    5
    6Relative speed comparison
    7        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
    8        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)
    

    Which also indicates there’s no measurable speed difference. So at least I can confirm that - if my measurements were accurate - there doesn’t seem to be a speed regression caused by this change.

  18. theuni commented at 3:03 pm on May 15, 2025: member

    Concept ACK. Neat :)

    Maybe this is not an issue for the PR, but it would be good to make clear what types of corruption BlockTreeStore can and can’t detect and what types of corruption it can recover from. If it can do simple things to detect corruption like adding checksums, or to prevent it like writing to temporary files and renaming them in place, those could be good to consider.

    Yeah, I think this is the heart of it. I’m onboard for a new impl outside of leveldb, but before getting too deep into the implementation itself we need to decide 2 main things:

    1. Is the current block/index storage layout ideal? I think @Sjors’s one-block-per-file idea is interesting. Undo data could go in the same file without breaking any append-only guarantees. Not requiring file offset record keeping sounds nice. But what would the consequences be? Do any filesystems hate that type of dir layout? Would performance suffer due to a bajillion opens/closes?
    2. After figuring out 1, like @ryanofsky asked, what guarantees do we need to provide? Are we just protecting against power outages? Cosmic bit-flip corruption? Bad sectors? Malicious users?

    The impl here with no slicing or atomicity attempts isn’t very robust, but that’s obviously fine for an RFC.

  19. sipa commented at 3:14 pm on May 15, 2025: member

    I think a file structure of $DATADIR/blocks/[${(HEIGHT//2016)*2016}]/$HEIGHT-$HASH.dat would be a nice color for the bikeshed. That would mean typically 2016 block files per directory (if no branches appear), organized neatly per retarget period.

    As for putting block and undo data in the same file, I’m unsure. Undo data to me feels more like a validation-level thing, while block data is more a storage-level thing.

  20. hodlinator commented at 9:31 pm on May 15, 2025: contributor

    Re: One file per block

    FWIW the idea of using one file per block gets me going too. :) It is slightly orthogonal but would be nice to avoid changing formats twice in short succession.

    My bikeshed color: Since block hashes start with zeroes, maybe one could shard based off the last two bytes:

    Genesis block ends up in something like: $DATADIR/blocks/e2/6f/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f.dat Block 896819 from today ends up in: $DATADIR/blocks/84/28/000000000000000000012f13426140d43426f9db96fe9c93d3db4ebddfbf8428.dat

    Using two levels deep directories of 256 entries at each branch point means we start averaging 1000 files per leaf directory at block height 65'536'000 (a bit earlier due to re-orgs). (File system limitations: https://stackoverflow.com/a/466596).

    If having the height as the key is more useful than the hash, I prefer #32427 (comment).

    Possible argument against

    Spinning disks typically perform much better when sequentially accessed data is stored within the same file, so the current approach of multiple blocks per file may be more performant for some types of operations. I don’t know if or how frequently we access the contents of blocks sequentially though.

  21. davidgumberg commented at 0:57 am on May 16, 2025: contributor

    My bikeshed color: Since block hashes start with zeroes, maybe one could shard based off the last two bytes:

    Just curious because the FAT32 file limit per-directory is so small, is there any scenario where a miner could DoS nodes with this format by also mining the last two bytes?

    I think no, because at tip the additional hash needed for two bytes would be prohibitively expensive, although I’m not sure if there is a birthday-problem-like advantage because an attacker doesn’t necessarily need to target only one two-byte suffix. And for IBD, headers-first sync would prevent an attack where someone suffix-mines >65,000 blocks from genesis and tries to get nodes to download them.

  22. bitcoin deleted a comment on May 16, 2025
  23. bitcoin deleted a comment on May 16, 2025
  24. bitcoin deleted a comment on May 16, 2025
  25. bitcoin deleted a comment on May 16, 2025
  26. bitcoin deleted a comment on May 16, 2025
  27. maflcko commented at 9:44 am on May 16, 2025: member

    2. what guarantees do we need to provide? Are we just protecting against power outages? Cosmic bit-flip corruption? Bad sectors?

    I’d say ideally all of them. In the rare case where they happen, detecting them early on Bitcoin Core startup (before a validation-internal assert is hit) may help finding the root-cause and also could free up some developer time due to making it easier to remote-diagnose hardware issues (many of them have more than 5 comments: https://github.com/bitcoin/bitcoin/issues?q=is%3Aissue%20%20memtest86). So I’d see it as a benefit if this change can provide stronger detection-checks than leveldb.

  28. theuni commented at 8:20 pm on May 16, 2025: member

    My bikeshed color: Since block hashes start with zeroes, maybe one could shard based off the last two bytes:

    Just curious because the FAT32 file limit per-directory is so small, is there any scenario where a miner could DoS nodes with this format by also mining the last two bytes?

    I think no, because at tip the additional hash needed for two bytes would be prohibitively expensive, although I’m not sure if there is a birthday-problem-like advantage because an attacker doesn’t necessarily need to target only one two-byte suffix. And for IBD, headers-first sync would prevent an attack where someone suffix-mines >65,000 blocks from genesis and tries to get nodes to download them.

    There’s also the possibility of using a local salt like we do for most other game-able data, as opposed to using the actual block hash. Block data is already xor’d with a per-node value, doing something similar with the filenames doesn’t seem unreasonable to me. Maybe we’d even want to for the same reason we xor the data? And if already obfuscated, we could go a step further and ascii-encode to trim the file length. Of course, if there’s no real need for that salting/obfuscation, it would just make blocks needlessly impossible to eyeball.

    My bikeshed color: Since block hashes start with zeroes, maybe one could shard based off the last two bytes:

    Genesis block ends up in something like: $DATADIR/blocks/e2/6f/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f.dat Block 896819 from today ends up in: $DATADIR/blocks/84/28/000000000000000000012f13426140d43426f9db96fe9c93d3db4ebddfbf8428.dat

    Note that without the block heights as @sipa proposed, reindexing would be significantly more complicated. With the current impl the blocks on disk are going to be at least vaguely in-order, the above proposal would make them random.

  29. ismaelsadeeq commented at 10:38 pm on June 3, 2025: member

    Moving away from leveldb opens the door towards doing this in parallel.

    For this reason, I am Concept ACK. No opinion on the approach yet; I am still studying the PR and prev discussion.

    It might be a little too early for this but I’m excited and tried testing it out on Signet.

    Just by building the branch and running the node on Signet, it crashes. See logs (I didnt start the node with -reindex option or maybe I might be doing something wrong though): https://gist.github.com/ismaelsadeeq/18889a42b6e8bd20560198e5e6d52607

    However after the crash, starting the node again with -reindex option seems to run smoothly.

    Also when I messed a bit with the /blocks directory specifically by attempting to use py-bitcoinkernel to read block saved using this blockstreedb, the data got corrupted and I had to sync the node again from genesis block.

  30. TheCharlatan commented at 7:46 am on June 4, 2025: contributor
    Thanks for giving this a try @ismaelsadeeq! I’m working on adding a write ahead log at the moment, so will draft this PR in the meantime. Bit surprised that you immediately ran into some corruption, maybe it is caused by the library attempting to still write some data like the genesis block? I think it would be good to have a test for parallel reads/writes here as well as a demo branch for the library.
  31. TheCharlatan marked this as a draft on Jun 4, 2025
  32. marcofleon commented at 11:17 am on June 5, 2025: contributor

    Concept ACK

    I’ve differentially fuzzed BlockTreeDB and BlockTreeStore for ~5000 cpu hours so far and no issues. Happy to continue testing (differentially fuzzing or otherwise) once the final approach is implemented.

  33. TheCharlatan force-pushed on Jun 9, 2025
  34. TheCharlatan force-pushed on Jun 9, 2025
  35. DrahtBot added the label CI failed on Jun 9, 2025
  36. DrahtBot commented at 8:40 pm on June 9, 2025: contributor

    🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/43761651546 LLM reason (✨ experimental): The CI failure is caused by a lint test error.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  37. TheCharlatan force-pushed on Jun 9, 2025
  38. TheCharlatan force-pushed on Jun 9, 2025
  39. TheCharlatan force-pushed on Jun 9, 2025
  40. DrahtBot removed the label CI failed on Jun 9, 2025
  41. TheCharlatan commented at 7:08 am on June 10, 2025: contributor

    The latest push updates the block tree store to use a write ahead log for atomic writes, and crc32c checksums to detect data corruption. As mentioned, taking this out of draft again.

    Did not spend too much time yet on evaluating the various proposals for reforming block storage yet, but I am warming up to the idea. I still think it is largely orthogonal to the work here, besides potentially needing another change to the data serialization.

  42. TheCharlatan marked this as ready for review on Jun 10, 2025
  43. bitcoin deleted a comment on Jun 28, 2025
  44. bitcoin deleted a comment on Jun 28, 2025
  45. DrahtBot added the label CI failed on Jul 6, 2025
  46. DrahtBot commented at 3:38 pm on July 6, 2025: contributor

    🚧 At least one of the CI tasks failed. Task tidy: https://github.com/bitcoin/bitcoin/runs/43764366691 LLM reason (✨ experimental): Compilation failed due to errors caused by ignoring return values of ’nodiscard’ functions, triggering compile-time errors with -Werror.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  47. TheCharlatan force-pushed on Jul 6, 2025
  48. TheCharlatan commented at 8:46 pm on July 6, 2025: contributor

    Rebased 155afe8529a611c3dcb3fb76101abd01020a24ea -> 80810b6011e30ac5ff72c43a2fbfd0e13df0c4cc (blocktreestore_0 -> blocktreestore_1, compare)

    • Fixed silent merge conflict with #29307
  49. TheCharlatan force-pushed on Jul 6, 2025
  50. DrahtBot removed the label CI failed on Jul 6, 2025
  51. DrahtBot added the label Needs rebase on Jul 8, 2025
  52. TheCharlatan force-pushed on Jul 9, 2025
  53. TheCharlatan commented at 1:27 pm on July 9, 2025: contributor

    Rebased 98a34dd55ac32f323b297bb6d77eefe096f27074 -> 254d0a75b50b0eaf91003ea8a0534981ec740090 (blocktreestore_1 -> blocktreestore_2, compare)

  54. DrahtBot removed the label Needs rebase on Jul 9, 2025
  55. in src/kernel/blocktreestorage.cpp:47 in 8858c43ee0 outdated
    42+    return data_end;
    43+}
    44+
    45+static int64_t CalculateBlockFilesPos(int nFile)
    46+{
    47+    // start position + nFile * (BLOCK_FILE_IFO_WRAPPER_SIZE + checksum)
    


    l0rinc commented at 4:10 pm on July 27, 2025:
    nit: The comment repeats the code and has a typo; consider deleting or expanding with a rationale.

    TheCharlatan commented at 2:14 pm on August 19, 2025:
    Yes, removed.
  56. in src/kernel/blocktreestorage.cpp:93 in 8858c43ee0 outdated
    88+    }
    89+
    90+    {
    91+        auto file{AutoFile{fsbridge::fopen(m_block_files_file_path, "rb")}};
    92+        if (file.IsNull()) {
    93+            throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_header_file_path)));
    


    l0rinc commented at 4:13 pm on July 27, 2025:
    0            throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_block_files_file_path)));
    

    TheCharlatan commented at 2:16 pm on August 19, 2025:
    Fixed.
  57. in src/kernel/blocktreestorage.cpp:99 in 8858c43ee0 outdated
    100+        uint32_t version;
    101+        file >> version;
    102+        if (version != BLOCK_FILES_FILE_VERSION) {
    103+            throw BlockTreeStoreError("Invalid block files file version");
    104+        }
    105+    }
    


    l0rinc commented at 4:15 pm on July 27, 2025:

    It seem we need this in multiple places, maybe we could add a sub-helper, something like:

     0void BlockTreeStore::CheckMagicAndVersion() const
     1{
     2    auto check{[](const fs::path& path, uint32_t magic_expected, uint32_t version_expected) {
     3        AutoFile file{fsbridge::fopen(path, "rb")};
     4        if (file.IsNull()) {
     5            throw BlockTreeStoreError(strprintf("Unable to open file %s", fs::PathToString(path)));
     6        }
     7        if (auto magic{ser_readdata32(file)}; magic != magic_expected) {
     8            throw BlockTreeStoreError(strprintf("Invalid magic in %s: got 0x%08x", fs::PathToString(path.filename()), magic));
     9        }
    10        if (auto version{ser_readdata32(file)}; version != version_expected) {
    11            throw BlockTreeStoreError(strprintf("Invalid version in %s: got %u", fs::PathToString(path.filename()), version));
    12        }
    13    }};
    14
    15    check(m_header_file_path, HEADER_FILE_MAGIC, HEADER_FILE_VERSION);
    16    check(m_block_files_file_path, BLOCK_FILES_FILE_MAGIC, BLOCK_FILES_FILE_VERSION);
    17}
    

    The errors would look like:

    • [error] Unable to open file …/bitcoin/demo/blocks/migration/headers.dat

    • [error] Invalid magic in headers.dat: got 0x2d5e2eb2

    • [error] Invalid version in headers.dat: got 2


    Given that we’re redoing something similar in multiple places, maybe we could extract these to static helper methods as well.


    TheCharlatan commented at 2:16 pm on August 19, 2025:
    Taken, thanks!
  58. in src/kernel/blocktreestorage.cpp:122 in 8858c43ee0 outdated
    111+      m_block_files_file_path{path / BLOCK_FILES_FILE_NAME},
    112+      m_reindex_flag_file_path{path / REINDEX_FLAG_FILE_NAME},
    113+      m_prune_flag_file_path{path / PRUNE_FLAG_FILE_NAME}
    114+{
    115+    assert(GetSerializeSize(DiskBlockIndexWrapper{}) == DISK_BLOCK_INDEX_WRAPPER_SIZE);
    116+    assert(GetSerializeSize(BlockFileInfoWrapper{}) == BLOCK_FILE_INFO_WRAPPER_SIZE);
    


    l0rinc commented at 4:16 pm on July 27, 2025:
    nit: These asserts run at runtime, consider moving to a unit test to avoid overhead.

    TheCharlatan commented at 2:17 pm on August 19, 2025:
    It would be nice if GetSerializeSize would be a constexpr. But yes, moving this out. EDIT: Left it for now, but leaving this unresolved.
  59. in src/kernel/blocktreestorage.cpp:124 in 8858c43ee0 outdated
    119+        fs::remove(m_header_file_path);
    120+        fs::remove(m_block_files_file_path);
    121+    }
    122+    bool header_file_exists{fs::exists(m_header_file_path)};
    123+    bool block_files_file_exists{fs::exists(m_block_files_file_path)};
    124+    if (header_file_exists ^ block_files_file_exists) {
    


    l0rinc commented at 4:16 pm on July 27, 2025:

    This may be a bit more desciptive:

    0    if (header_file_exists != block_files_file_exists) {
    

    TheCharlatan commented at 2:17 pm on August 19, 2025:
    Done.
  60. in src/kernel/blocktreestorage.cpp:68 in 8858c43ee0 outdated
    63+        return m_block_files_file_path;
    64+    case DISK_BLOCK_INDEX:
    65+    case HEADER_DATA_END:
    66+        return m_header_file_path;
    67+    }
    68+    throw BlockTreeStoreError("Unrecognized value in block tree store");
    


    l0rinc commented at 4:21 pm on July 27, 2025:

    We could include the bad value_type in these errors:

    0    default:
    1        throw BlockTreeStoreError(strprintf("Unrecognized value_type (%u) in block tree store", value_type));
    2    }
    

    TheCharlatan commented at 2:15 pm on August 19, 2025:
    Done.
  61. in src/kernel/blocktreestorage.cpp:56 in 8858c43ee0 outdated
    51+enum ValueType : uint32_t {
    52+    LAST_BLOCK,
    53+    BLOCK_FILE_INFO,
    54+    DISK_BLOCK_INDEX,
    55+    HEADER_DATA_END,
    56+};
    


    l0rinc commented at 4:23 pm on July 27, 2025:

    To make sure the serialization stays stable (e.g when somebody renames one of them and reorders for some reason):

    0enum class ValueType : uint32_t {
    1    LAST_BLOCK       = 0,
    2    BLOCK_FILE_INFO  = 1,
    3    DISK_BLOCK_INDEX = 2,
    4    HEADER_DATA_END  = 3,
    5};
    

    TheCharlatan commented at 2:15 pm on August 19, 2025:
    Yes, raw enums are also not portable. This is better, but also required a small change to allow serialization.
  62. in src/chain.h:186 in 8858c43ee0 outdated
    182@@ -161,6 +183,9 @@ class CBlockIndex
    183     //! Byte offset within rev?????.dat where this block's undo data is stored
    184     unsigned int nUndoPos GUARDED_BY(::cs_main){0};
    185 
    186+    //! Byte offset within headers.dat where this block's header data is stored
    187+    int64_t header_pos GUARDED_BY(::cs_main){0};
    


    l0rinc commented at 4:54 pm on July 27, 2025:

    How would this behave in case of a reorg when the block size differs from the previous one?

    nit: other positions were stored as unsigned


    TheCharlatan commented at 1:36 pm on August 19, 2025:
    This should not change across re-orgs. Only the Chain class mutates during re-orgs, the block index remains unchanged.
  63. in src/kernel/blocktreestorage.cpp:29 in 8858c43ee0 outdated
    24+static size_t constexpr CHECKSUM_SIZE{sizeof(uint32_t)};
    25+static size_t constexpr FILE_POSITION_SIZE{sizeof(int64_t)};
    26+
    27+static int64_t ReadHeaderFileDataEnd(AutoFile& file)
    28+{
    29+    int64_t data_end;
    


    l0rinc commented at 5:17 pm on July 27, 2025:
    we don’t have signed 64 bit reading/writing (we’re casting to unsigned and back), would it be simpler to make this unsigned in the first place?

    TheCharlatan commented at 4:38 pm on August 19, 2025:
    It would be, but we use the positions with the result of ftell, which returns a signed type, so I made them signed too.
  64. in src/kernel/blocktreestorage.cpp:37 in 8858c43ee0 outdated
    32+    file.seek(HEADER_FILE_DATA_END_POS, SEEK_SET);
    33+    file >> data_end;
    34+    data << data_end;
    35+    data << HEADER_FILE_DATA_END_POS;
    36+    uint32_t re_check = crc32c::Crc32c(UCharCast(data.data()), data.size());
    37+    file >> checksum;
    


    l0rinc commented at 5:23 pm on July 27, 2025:

    I find this reading and writing back and forth confusing. We could bring checksum and data closer to the first usage and maybe add more info to the error and reduce the scope of the variables we only need for validation, maybe something like:

     0static uint64_t ReadHeaderFileDataEnd(AutoFile& file)
     1{
     2    file.seek(HEADER_FILE_DATA_END_POS, SEEK_SET);
     3    auto data_end{ser_readdata64(file)};
     4    if (auto checksum{ser_readdata32(file)},
     5        re_check{crc32c::Crc32c((DataStream{} << data_end << HEADER_FILE_DATA_END_POS).str())};
     6        re_check != checksum) {
     7        throw BlockTreeStoreError(strprintf("Header file data failed integrity check: got 0x%08x, expected 0x%08x", re_check, checksum));
     8    }
     9    return data_end;
    10}
    

    Alternatively we could also have a header data structure which is read and written and skipped atomically.


    TheCharlatan commented at 2:14 pm on August 19, 2025:
    Done.
  65. in src/chain.h:441 in 8858c43ee0 outdated
    436@@ -412,6 +437,34 @@ class CDiskBlockIndex : public CBlockIndex
    437     std::string ToString() = delete;
    438 };
    439 
    440+struct DiskBlockIndexWrapper : public CDiskBlockIndex {
    441+public:
    442+    DiskBlockIndexWrapper() = default;
    


    l0rinc commented at 5:33 pm on July 27, 2025:

    isn’t public implied?

    0struct DiskBlockIndexWrapper : CDiskBlockIndex {
    1    DiskBlockIndexWrapper() = default;
    

    TheCharlatan commented at 3:00 pm on August 19, 2025:
    Yes, changed.
  66. in src/chain.h:464 in 8858c43ee0 outdated
    459+        READWRITE(obj.nVersion);
    460+        READWRITE(obj.hashPrev);
    461+        READWRITE(obj.hashMerkleRoot);
    462+        READWRITE(obj.nTime);
    463+        READWRITE(obj.nBits);
    464+        READWRITE(obj.nNonce);
    


    l0rinc commented at 5:39 pm on July 27, 2025:

    READWRITE is a variadic macro, in many other cases we’re batching these to avoid repetition, e.g. https://github.com/bitcoin/bitcoin/blob/master/src/protocol.h#L48, consider simplifying:

    0        READWRITE(obj.nHeight, obj.nStatus, obj.nTx, obj.nFile, obj.nDataPos, obj.nUndoPos, obj.header_pos);
    1        READWRITE(obj.nVersion, obj.hashPrev, obj.hashMerkleRoot, obj.nTime, obj.nBits, obj.nNonce); // block header
    

    nit: wouldn’t it make more sense to start with the header data (in case we need to read only that)?


    TheCharlatan commented at 3:01 pm on August 19, 2025:
    Done.
  67. in src/kernel/blocktreestorage.cpp:149 in 8858c43ee0 outdated
    144+        auto autofile{AutoFile{file}};
    145+        if (!autofile.Commit()) {
    146+            throw BlockTreeStoreError(strprintf("Failed to create header file %s\n", fs::PathToString(m_header_file_path)));
    147+        }
    148+        if (autofile.fclose() != 0) {
    149+            throw BlockTreeStoreError(strprintf("Failure when closing created header file %s\n", fs::PathToString(m_header_file_path)));
    


    l0rinc commented at 5:47 pm on July 27, 2025:
    nit: “Failed to close” (verb phrase) is more consistent with other such messages. nit2: do we really care if we can’t close after a successful commit?

    TheCharlatan commented at 2:18 pm on August 19, 2025:
    Fixed the message. I think we should error here, since a subsequent open might fail too, and then it seems better to fail earlier.
  68. in src/kernel/blocktreestorage.cpp:206 in 8858c43ee0 outdated
    201+    file << BLOCK_FILES_FILE_MAGIC;
    202+    file << BLOCK_FILES_FILE_VERSION;
    203+    file.seek(BLOCK_FILES_LAST_BLOCK_POS, SEEK_SET);
    204+    DataStream data;
    205+    data << 0;
    206+    file << std::span<std::byte>{data};
    


    l0rinc commented at 5:50 pm on July 27, 2025:

    What’s the reason for serializing this and HEADER_FILE_DATA_START_POS differently and not directly (like we do with other positions)?

    If we keep it, consider simplifying:

    0    file << std::span{data};
    

    TheCharlatan commented at 4:50 pm on August 19, 2025:
    We need the extra step to calculate the checksum. But I removed the unneeded cast and got rid of the superfluous seek (which was still there from a prior version).
  69. in src/kernel/blocktreestorage.cpp:260 in 8858c43ee0 outdated
    255+bool BlockTreeStore::ReadBlockFileInfo(int nFile, CBlockFileInfo& info)
    256+{
    257+    LOCK(m_mutex);
    258+    auto file{AutoFile{fsbridge::fopen(m_block_files_file_path, "rb")}};
    259+    if (file.IsNull()) {
    260+        throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_header_file_path)));
    


    l0rinc commented at 5:58 pm on July 27, 2025:

    As far as I can tell exceptions don’t need trailing newline:

    02025-07-27T17:57:41Z Loading block index db: last block file = 0
    12025-07-27T17:57:41Z [error] Unable to open file .../bitcoin/demo/blocks/index/headers.dat
    2
    32025-07-27T17:57:41Z : Error loading databases.
    

    (nit: the errors should likely reference m_block_files_file_path instead of m_header_file_path, but the mentioned helpers should take care of these problems)


    TheCharlatan commented at 5:15 pm on August 19, 2025:
    Added a helper for opening these files. Should be a bit clearer now. Also removed all the newlines.
  70. in src/kernel/blocktreestorage.cpp:264 in 8858c43ee0 outdated
    259+    if (file.IsNull()) {
    260+        throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_header_file_path)));
    261+    }
    262+    file.seek(CalculateBlockFilesPos(nFile), SEEK_SET);
    263+    if (file.feof()) {
    264+        // return in case the info was not found
    


    l0rinc commented at 5:59 pm on July 27, 2025:
    nit: redundant comment, we already have everything in the code to deduce this

    TheCharlatan commented at 2:20 pm on August 19, 2025:
    Removed.
  71. in src/kernel/blocktreestorage.cpp:361 in 8858c43ee0 outdated
    356+        (void)log_file.fclose();
    357+        fs::remove(m_log_file_path);
    358+        return false;
    359+    }
    360+    re_rolling_checksum = 0;
    361+    log_file.seek(4, SEEK_SET); // we already read the number of types, so skip ahead of it
    


    l0rinc commented at 6:07 pm on July 27, 2025:
    nit: can we avoid the magic number (and comment explaining it) here?

    TheCharlatan commented at 2:22 pm on August 19, 2025:
    Yes, replaced the magic number and removed the comment.
  72. in src/kernel/blocktreestorage.cpp:423 in 8858c43ee0 outdated
    435+    AssertLockHeld(::cs_main);
    436+    LOCK(m_mutex);
    437+
    438+    // Use a write-ahead log file that gets atomically flushed to the target files.
    439+
    440+    { // start log_file scope
    


    l0rinc commented at 6:08 pm on July 27, 2025:
    seems a bit unconventional to do this, looks like a code smell - can we extract to a method instead, if we need RAII?

    TheCharlatan commented at 5:25 pm on August 19, 2025:
    Mmh, not sure what to do here. Extracting a method seems a bit heavy?
  73. in src/kernel/blocktreestorage.cpp:445 in 8858c43ee0 outdated
    440+    { // start log_file scope
    441+    FILE* raw_log_file{fsbridge::fopen(m_log_file_path, "wb")};
    442+    if (!raw_log_file) {
    443+        throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_header_file_path)));
    444+    }
    445+    size_t log_file_prealloc_size{fileInfo.size() * (BLOCK_FILE_INFO_WRAPPER_SIZE + FILE_POSITION_SIZE) + blockinfo.size() * (DISK_BLOCK_INDEX_WRAPPER_SIZE + FILE_POSITION_SIZE)};
    


    l0rinc commented at 6:10 pm on July 27, 2025:

    We could extract these and reuse them, there’s a lot of repetition here:

    0constexpr size_t header_entry_size{DISK_BLOCK_INDEX_WRAPPER_SIZE + FILE_POSITION_SIZE};
    1size_t log_file_prealloc_size{fileInfo.size() * (BLOCK_FILE_INFO_WRAPPER_SIZE + FILE_POSITION_SIZE) + blockinfo.size() * header_entry_size};
    2stream.reserve(header_entry_size);
    

    TheCharlatan commented at 2:24 pm on August 19, 2025:
    Thanks, done.
  74. in src/kernel/blocktreestorage.cpp:532 in 8858c43ee0 outdated
    542+    }
    543+
    544+    } // end log_file scope
    545+
    546+    if (!ApplyLog()) {
    547+        LogError("Failed to apply write-ahead log to data files");
    


    l0rinc commented at 6:54 pm on July 27, 2025:
    not sure I fully understand when we’re returning and when we’re throwing - how come the WAL write isn’t fatal as well?

    TheCharlatan commented at 5:44 pm on August 19, 2025:
    It is fatal, returning false here triggers a FatalError further up the callstack in our code. I think I am trying to bend towards our previous method of indicating a failure through a boolean, but as you are pointing out, this is just inconsistent. I’d prefer to make this void and throw, but then I’d have to change some things in the callers of this function. Maybe once #33042 is in we can make all of these throw instead?
  75. in src/kernel/blocktreestorage.cpp:383 in 8858c43ee0 outdated
    378+        uint32_t entry_size = type_size + FILE_POSITION_SIZE;
    379+
    380+        DataStream stream;
    381+        stream.resize(entry_size);
    382+
    383+        for (uint32_t i = 0; i < num_iterations; i++) {
    


    l0rinc commented at 6:55 pm on July 27, 2025:
    nit: the outer loop is already i nit2: ++i is more common in these cases

    TheCharlatan commented at 4:54 pm on August 19, 2025:
    Fixed.
  76. in src/kernel/blocktreestorage.cpp:394 in 8858c43ee0 outdated
    389+            uint32_t re_checksum = crc32c::Crc32c(UCharCast(stream.data()), entry_size);
    390+            re_rolling_checksum = crc32c::Extend(re_rolling_checksum, UCharCast(stream.data()), entry_size);
    391+            uint32_t checksum;
    392+            log_file >> checksum;
    393+            if (re_checksum != checksum) {
    394+                throw BlockTreeStoreError("Detected on-disk file corruption. The disk might be nearing its end of life");
    


    l0rinc commented at 7:07 pm on July 27, 2025:
    nit: as mentioned, I’d unify these checksum checks - and not sure we need to be this specific about interpreting the results, we’re usually not this friendly :)

    TheCharlatan commented at 5:40 pm on August 19, 2025:
    I couldn’t come up with a clean extraction. We could add a callback, but seems like a smell too. We do get some decent protection through the rolling checksum from missing anything during the dry run, so we also don’t need to be super defensive here.
  77. in src/node/blockstorage.cpp:67 in 254d0a75b5 outdated
    57@@ -58,15 +58,6 @@ bool BlockTreeDB::ReadBlockFileInfo(int nFile, CBlockFileInfo& info)
    58     return Read(std::make_pair(DB_BLOCK_FILES, nFile), info);
    59 }
    60 
    61-bool BlockTreeDB::WriteReindexing(bool fReindexing)
    62-{
    63-    if (fReindexing) {
    64-        return Write(DB_REINDEX_FLAG, uint8_t{'1'});
    65-    } else {
    66-        return Erase(DB_REINDEX_FLAG);
    


    l0rinc commented at 7:14 pm on July 27, 2025:
    is Erase still used after this?

    TheCharlatan commented at 5:50 pm on August 19, 2025:
    Good catch, no, and looks like we did not even have coverage for it :/, Removed it.
  78. in src/kernel/caches.h:15 in 254d0a75b5 outdated
    10@@ -11,21 +11,16 @@
    11 
    12 //! Suggested default amount of cache reserved for the kernel (bytes)
    13 static constexpr size_t DEFAULT_KERNEL_CACHE{450_MiB};
    14-//! Max memory allocated to block tree DB specific cache (bytes)
    15-static constexpr size_t MAX_BLOCK_DB_CACHE{2_MiB};
    16 //! Max memory allocated to coin DB specific cache (bytes)
    17-static constexpr size_t MAX_COINS_DB_CACHE{8_MiB};
    18+static constexpr size_t MAX_COINS_DB_CACHE{10_MiB};
    


    l0rinc commented at 2:59 am on July 28, 2025:
    these changes might need some (commit message) explanations
  79. in src/node/blockstorage.cpp:454 in 254d0a75b5 outdated
    450@@ -478,13 +451,13 @@ bool BlockManager::LoadBlockIndex(const std::optional<uint256>& snapshot_blockha
    451 bool BlockManager::WriteBlockIndexDB()
    452 {
    453     AssertLockHeld(::cs_main);
    454-    std::vector<std::pair<int, const CBlockFileInfo*>> vFiles;
    455+    std::vector<std::pair<int, CBlockFileInfo*>> vFiles;
    


    l0rinc commented at 2:59 am on July 28, 2025:
    Is there a const-correct serialization where we don’t have to change this?

    TheCharlatan commented at 8:18 pm on August 19, 2025:
    I missed this from an earlier change. The serialization is already correct, but I did not revert back all the call sites again. Fixed.
  80. in src/init.cpp:1 in a2ff8f482c outdated


    l0rinc commented at 3:34 am on July 28, 2025:
    typo in commit message: “newly introduced BlockTreeStore
  81. in src/node/blockstorage.h:301 in 254d0a75b5 outdated
    297@@ -298,7 +298,7 @@ class BlockManager
    298      */
    299     std::multimap<CBlockIndex*, CBlockIndex*> m_blocks_unlinked;
    300 
    301-    std::unique_ptr<BlockTreeDB> m_block_tree_db GUARDED_BY(::cs_main);
    302+    std::unique_ptr<kernel::BlockTreeStore> m_block_tree_db GUARDED_BY(::cs_main);
    


    l0rinc commented at 3:41 am on July 28, 2025:

    do we still call the field tree db?

    0    std::unique_ptr<kernel::BlockTreeStore> m_block_tree_store GUARDED_BY(::cs_main);
    

    TheCharlatan commented at 2:29 pm on August 19, 2025:
    No, just wanted to not do this particular rename to make it clearer that the new store is really just a drop in replacement. Could then easily be done in a follow-up.
  82. in src/node/blockstorage.cpp:1221 in 254d0a75b5 outdated
    1208+        dump_blockindexes.reserve(m_block_index.size());
    1209+        for (auto& pair : m_block_index) {
    1210+            dump_blockindexes.push_back(&pair.second);
    1211+        }
    1212+
    1213+        if (!block_tree_store->WriteBatchSync(dump_files, max_blockfile_num, dump_blockindexes)) {
    


    l0rinc commented at 4:08 am on July 28, 2025:
    can we do a load from the new location to make absolutely sure we’ve indeed written the data correctly and only delete after we validate that?

    TheCharlatan commented at 8:26 pm on August 19, 2025:
    Mmh, I guess that would improve things a bit by guaranteeing that the migration absent a failure during the rename would be successful. There is also a race condition, since we could have a power outage after removing the old db, but before the rename is complete.
  83. l0rinc commented at 5:29 pm on July 28, 2025: contributor

    Concept ACK, not yet sure about the approach

    LevelDB migration

    Moving away from unmaintained LevelDB makes sense and aligns with other modularization and optimization efforts. Whether that means switching to SQLite or a custom solution like this one is debatable, but removing LevelDB already paves the way for further migrations - which still seem somewhat taboo at this point.

    Before merging something like this, I’d be interested in seeing a full migration story (including other indexes, blocks, and the UTXO set). Otherwise, we risk supporting multiple formats indefinitely. Happy to help with that.

    Migration mechanism

    Do we need a big-bang migration, or would an on-demand, copy-on-first-touch scheme (with an optional background migrator) also work? We’d simply check the new location first, and if missing, migrate, delete the old entry, and serve from the new location - until all migration is done. This would avoid delays at startup, wouldn’t require doubling the space usage of the full index (only the entries in flight), and could be reused for future migrations. It could also allow us to perform both operations for a while, comparing that we’re always reading/writing the exact same data. Could be enabled on CI + background fuzzing for a few months before merging.

    Code structure

    The first commit currently does everything. I understand it’s a draft/RFC, but it would help reviewers if the commits told a story through small, focused steps.

    We’re also missing dedicated data structures for the new feature (with serialization, validation, etc.). Right now, the logic seems scattered across unrelated parts of the codebase.

    There’s also heavy repetition: magic/version checks, CRC validation, WAL record sizing. We could introduce these incrementally in separate commits, possibly splitting out helpers that could be reused elsewhere into separate refactor PRs.

    The tests seem to cover a lot of ground, but I didn’t see many negative cases (e.g. invalid headers), nothing with wipe_block_tree_data = true, and I’m not sure migration, pruning, and reorgs are covered.

    Questions & notes

    • I understand it’s not strictly part of this PR, but I agree with Sjors that we should consider letting the filesystem handle some of this. The downside is that OSs behave differently - some nodes could break simultaneously if we hit file handle or filesystem limits that we haven’t tested for. This assumes there even is an OS (i.e. not bare metal).
      • How much extra space would this cost (considering 32/64-bit platforms, various I/O-caching filesystems, file permission attributes, fragmentation)?
      • If we stored blocks separately, could we redownload only corrupted ones in parallel - even for missing pruned blocks?
      • We might also want to investigate block compression - it looked like we could gain ~20%, although maybe only when we already have all blocks.
      • Can we design this to avoid duplication? Currently we duplicate most block data in the chainstate index. Are we planning to make the blocks indexable, so we can locate a script by offset instead of storing it in the index (at least for non-pruned nodes)? This might be relevant if future migrations depend on how this one is structured.
    • Related to Russ’s concerns: are the integrity checks meant to guard against accidental bit-rot only, or also malicious tampering? Or do we assume physical access means full compromise?
    • I don’t yet understand pruned behavior (this may be orthogonal, feel free to ignore): what should happen if a node is asked for a block that’s about to be deleted?
    • You mentioned that “no entry is ever deleted”, but I’m not yet sure how that holds under reorgs.
  84. DrahtBot added the label CI failed on Jul 30, 2025
  85. DrahtBot removed the label CI failed on Jul 30, 2025
  86. DrahtBot added the label Needs rebase on Aug 18, 2025
  87. kernel: Add blocktreestorage module
    The BlockTreeStore introduces a new data format for storing block
    indexes and headers on disk. The class is very similar to the existing
    CBlockTreeDB, which stores the same data in a leveldb database. Unlike
    CBlockTreeDB, the data stored through the BlockTreeStore is directly
    serialized and written to flat .dat files. The storage schema introduced
    is simple. It relies on the assumption that no entry is ever
    deleted and that no duplicate entries are written. These assumptions
    hold for the current users of CBlockTreeDB.
    
    In order to efficiently update a CBlockIndex entry in the store, a new
    field is added to the class that tracks its position in the file. New
    serialization wrappers are added for both the CBlockIndex and
    CBlockFileInfo classes to avoid serializing integers as VARINT. Using
    VARINT encoding would make updating these fields impossible, since
    changing them might overwrite existing entries in the file.
    
    The new store supports atomic writes by using a write ahead log. Boolean
    flags are persisted through the (non-)existence of certain files. Data
    integrity is verified through the use of crc32c checksums on each data
    entry.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    
    Also make flags based on file existence, instead of complicated boolean
    fields. This makes the operations atomic.
    fd6d563328
  88. fuzz: Use BlockTreeStore in block index fuzz test
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    4206667cb0
  89. blockstorage: Replace BlockTreeDB with BlockTreeStore
    This hooks up the newly introduced BlockTreeStore class to the actual
    codebase. It also adds a migration function to migrate old leveldb block
    indexes to the new format on startup.
    
    The migration first reads from leveldb (blocks/index), and writes it to
    a BlockTreeStore in a separate migration directory (blocks/migration).
    Once done, the original directory (blocks/index) is deleted and the
    migration directory renamed to the original name.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    f7fa33d65d
  90. kernel: Remove block tree db params
    These are no longer needed after the migration to the new
    BlockTreeStore. The cache for the block tree db is also no longer
    needed, so grant what has been freed up to the coins db.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    c8d5e1dd5d
  91. kernel: Add assumed header store to chainparams
    Adds constants for pre-allocating the file size of the header storage
    file in the BlockTreeStore. The chosen constants leave a bit of extra
    space beyond the actual requirement. They may be updated on every
    release, though it is also not a strict requirement to do so.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    9eff9b6df5
  92. blockstorage: Remove BlockTreeDB dead code
    This is not called by anything anymore, so just remove it.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    d35ceaeb46
  93. TheCharlatan force-pushed on Aug 20, 2025
  94. TheCharlatan commented at 7:52 am on August 20, 2025: contributor

    Thank you for the review @l0rinc!

    Rebased 254d0a75b50b0eaf91003ea8a0534981ec740090 -> fc07ce3718b5b8cc168ab634885e0317b9621e8c (blocktreestore_2 -> blocktreestore_3, compare)

    Updated fc07ce3718b5b8cc168ab634885e0317b9621e8c -> d35ceaeb463bc836ac4fc4bd6dd4f387647f33fb (blocktreestore_3 -> blocktreestore_4, compare)

    • The review comments are addressed inline, the change also includes some other smaller cleanups.

    Re #32427

    Before merging something like this, I’d be interested in seeing a full migration story (including other indexes, blocks, and the UTXO set). Otherwise, we risk supporting multiple formats indefinitely. Happy to help with that.

    I have no intention of moving any of the other dbs away from leveldb at this point in time. Currently this PR serves as a basis for a bunch of other applications that leverage its additional capability for allowing other applications to read block data in parallel.

    Do we need a big-bang migration, or would an on-demand, copy-on-first-touch scheme (with an optional background migrator) also work?

    The current migration delay is pretty much negligible compared to normal startup times. Once migrated, startup is significantly faster.

    I’m not sure migration, pruning, and reorgs are covered.

    This still needs a functional test for the migration, but pruning should be covered through existing tests and reorgs are not really relevant for this, since they only influence the chain and not the topology of the block tree.

    You mentioned that “no entry is ever deleted”, but I’m not yet sure how that holds under reorgs.

    During a reorg we change the contents of the Chain data structure, i.e. we remove pointers to the block tree and add others back again. The topology of the block tree in m_block_index is not changed.

    What should happen if a node is asked for a block that’s about to be deleted?

    This behaviour is not changed in this PR and is already correctly handled in net_processing and with our prune locks. I think a similar approach would still be possible if we do one-block-one-file.

    are the integrity checks meant to guard against accidental bit-rot only, or also malicious tampering? Or do we assume physical access means full compromise?

    The CRC checks are only for data integrity. The should guard against data corruption.

    I have also briefly looked at how this might be related to a one-block-one-file approach. I think one possibility could be to store and read the header directly alongside the block in the same file. While it is not clear to me yet what we’d do with the rest of the CBlockIndex data, a clear downside of this is that at startup we’d have to read the headers from a million files, which is much slower than reading everything from a contiguous, single file. I tried benching this a bit and my startup times went from around six seconds (this PR) to around a minute.

  95. DrahtBot removed the label Needs rebase on Aug 20, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-09-02 15:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me