kernel: Replace leveldb-based BlockTreeDB with WAL and .dat file based store #32427

sedited commented at 2:27 PM on May 6, 2025: contributor

This is motivated by the kernel library, where the internal usage of leveldb is a limiting factor to its future use cases. Specifically it is not possible to share leveldb databases between two processes. A notable use-case for the kernel library is accessing and analyzing existing block data. Currently this can only be done by first shutting down the node writing this data. Moving away from leveldb opens the door towards doing this in parallel. A flat file based approach was chosen, since the requirements for persistence here are fairly simple (no deletion, constant-size entries). The change also offers better performance by making node startup faster, and has a smaller on-disk footprint, though this is negligible in the grand scheme of things.

The BlockTreeStore introduces a new data format for storing block indexes and headers on disk. The class is very similar to the existing CBlockTreeDB, which stores the same data in a leveldb database. Unlike CBlockTreeDB, the data stored through the BlockTreeStore is directly serialized and written to flat .dat files. The storage schema introduced is simple. It relies on the assumption that no entry is ever deleted and that no duplicate entries are written. These assumptions hold for the current users of CBlockTreeDB.

A write ahead ahead log and boolean flags as file existence checks ensure write atomicity. Every data entry is also given a crc32c checksum to detect data corruption.

An alternative to this pull request, that could allow the same kernel feature, would be closing and opening the leveldb database only when reading and writing. This might incur a (negligible) performance penalty, but more importantly requires careful consideration of how to handle any contentions when opening, which might have complex side effects due to our current locking mode. It would also be possible to introduce an existing database with the required features for just the block tree, but that would introduce reliance on a new dependency and come with its own tradeoffs. For these reasons I chose this approach.

DrahtBot commented at 2:28 PM on May 6, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32427.

Reviews

See the guideline and AI policy for information on the review process.

Type	Reviewers
Concept ACK	theuni, ismaelsadeeq, marcofleon, l0rinc, stickies-v, HowHsu, craigraw
Approach ACK	edilmedeiros
Stale ACK	willcl-ark, w0xlt, josibake, janb84, alexanderwiederin, yuvicc

If your review is incorrectly listed, please copy-paste <code></code> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#35616 (refactor: Use u64 over size_t for all cache sizes to fix a 32-bit overflow by maflcko)
#35205 (kernel,node: clean up dbcache helpers and add kernel API by l0rinc)
#35167 (Convert check-deps.sh to python by ajtowns)
#35071 (Reindex: save progress to continue after interruption by pinheadmz)
#34132 (coins: drop error catcher, centralize fatal read handling by l0rinc)
#34075 (fees: Introduce Mempool Based Fee Estimation to reduce overestimation by ismaelsadeeq)
#33324 (blocks: add -reobfuscate-blocks argument to enable (de)obfuscating existing blocks by l0rinc)
#32554 (bench: replace embedded raw block with configurable block generator by l0rinc)
#28690 (build: Introduce internal kernel library by sedited)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible places where comparison-specific test macros should replace generic comparisons:

[src/test/blocktreestorage_tests.cpp] BOOST_CHECK_THROW(BlockTreeStore{block_tree_store_dir}, BlockTreeStoreError); -> Use BOOST_CHECK_EXCEPTION with a message/condition matcher if you want to verify the failure details, not just the exception type.
[src/test/blocktreestorage_tests.cpp] BOOST_CHECK_THROW(store->WriteBatchSync(file_infos_to_write, block_indexes_to_write), std::runtime_error); -> Prefer BOOST_CHECK_EXCEPTION(..., std::runtime_error, ...) so the expected error reason is checked too.
[src/test/blocktreestorage_tests.cpp] BOOST_CHECK_THROW(store->WriteBatchSync(file_infos_to_write, block_indexes_to_write), std::runtime_error); -> Prefer BOOST_CHECK_EXCEPTION(..., std::runtime_error, ...) so the expected error reason is checked too.
[src/test/blocktreestorage_tests.cpp] BOOST_CHECK_THROW(BlockTreeStore{block_tree_store_dir}, BlockTreeStoreError); -> Use BOOST_CHECK_EXCEPTION with a message/condition matcher if you want to verify the failure details, not just the exception type.
[src/test/blocktreestorage_tests.cpp] BOOST_CHECK_THROW(BlockTreeStore{block_tree_store_dir}, BlockTreeStoreError); -> Use BOOST_CHECK_EXCEPTION with a message/condition matcher if you want to verify the failure details, not just the exception type.
[src/test/blocktreestorage_tests.cpp] BOOST_CHECK_THROW(BlockTreeStore{block_tree_store_dir}, BlockTreeStoreError); -> Use BOOST_CHECK_EXCEPTION with a message/condition matcher if you want to verify the failure details, not just the exception type.
[test/functional/tool_bitcoin_chainstate.py] assert self.nodes[0].is_node_stopped() is False -> Use assert_equal(self.nodes[0].is_node_stopped(), False).
[test/functional/tool_bitcoin_chainstate.py] assert proc.returncode == 0 -> Use assert_equal(proc.returncode, 0).

2026-07-01 09:52:27

w0xlt commented at 6:12 PM on May 6, 2025: contributor

Approach ACK

The codebase changes seem surprisingly small for this proposal. Changing the code to reduce the dependency on LevelDB sounds good to me.

shahsb commented at 3:14 AM on May 7, 2025: none

Thanks @TheCharlatan for this proposal and making the code changes.!

Please find my review comments, suggestions and some clarifying questions:

How will concurrent reads (and potentially writes) be handled with the flat file format?
Even if no deletion occurs, file corruption or partial writes can happen. Are you planning mmap or memory buffering?
What would be the "Corruption Recovery Strategy?" -- While the write-ahead log is mentioned as future work, providing even a minimal rollback/recovery mechanism in the initial version would make this stronger.
What would be the "Migration Path"? -- Will there be tooling or a migration process from existing leveldb-based data
Data Integrity Guarantees? -- Are checksums or hash-based verifications being added per entry or per file?
Consider including a pluggable interface that allows fallback to LevelDB for testing or backwards compatibility
Write contention and corruption risks: -- While flat files avoid LevelDB’s process-level locking, concurrent writes require a mechanism (e.g., file locks, flock()) to prevent race conditions.
Portability and Cross platform related edge cases: -- Ensure file-locking mechanisms (e.g., fcntl on Unix, LockFileEx on Windows) are robust.
How about Handling Large Files? -- Test edge cases like file sizes approaching OS limits (e.g., 2+ GB on 32-bit systems). What would happen in such cases?
On similar lines to point-5 -- Corruption Detection mechanism could also be implemented to detect corruptions very early in the cycle.

The flat-file approach is a reasonable trade-off given the simplicity of the block tree storage requirements.

However, there are major significant challenges and risks involved with this approach as highlighted in the above 10 comments. (Concurrency, Corruption/Integrity, Performance, Migration, Corruption detection, Portability, Backward compatibility etc.)

laanwj added the label UTXO Db and Indexes on May 7, 2025

josibake commented at 8:57 AM on May 7, 2025: member

Concept ACK

A notable use-case for the kernel library is accessing and analyzing existing block data

A concrete example is index building in electrs / esplora / etc. For example, Electrs does this today by:

Waiting for Bitcoin Core to finish IBD
Reading all of the blocks out over JSON-RPC
Parsing them and writing them into the appropriate indexes

I started on a PoC for Electrs to use libbitcoinkernel for index building to demonstrate how this could be done much more efficiently and saw promising results. However, the requirement that Bitcoin Core be shut down before Electrs could process the block files made this approach clunky. I'll revive this PoC as a means of testing this PR and hopefully provide some use case motivated feedback on the approach.

Sjors commented at 12:58 PM on May 7, 2025: member

Have you considered simply having one block per file? Typical blk files are 130 MB, so for "modern" blocks it would 50x the number of files. But is that actually a problem? It's a lot simpler if we can just have $HASH.dat, maybe grouped in a directory per 10k blocks.

sedited commented at 1:13 PM on May 7, 2025: contributor

Have you considered simply having one block per file? Typical blk files are 130 MB, so for "modern" blocks it would 50x the number of files. But is that actually a problem? It's a lot simpler if we can just have $HASH.dat, maybe grouped in a directory per 10k blocks.

I'm not sure what you are suggesting here. Are you suggesting we create ~900k files and then have some subdivision within those files into 10k groups where each of those has a single $HASH.dat with all the headers and file pointers for the blocks in that division? Is there something we gain through that? My impression is we do the file splitting in the first place to make pruning easier. We don't prune headers, so I don't think splitting the file gains us anything. I don't think there is much wrong with just having a single file. Maybe it even helps the OS a bit to manage the file buffers?

Sjors commented at 1:33 PM on May 7, 2025: member

Are you suggesting we create ~900k files

Yes. If we're going to redesign block storage, it seems good to wonder why can't let the file system handle things.

with all the headers and file pointers for the blocks in that division

One block per file. We can still have a single file for the block index, which would contain the header (not just the hash) and validation state. Block files themselves would be in a predictable location, so we wouldn't need an index for that.

My impression is we do the file splitting in the first place to make pruning easier.

Do you mean compared to the alternative of having a single file for all blocks? I would imagine that would create I/O problems, since the operating system wouldn't know which part of the big file changed. And it can't defragment it.

Having one file per block makes pruning marginally easier than now, since you don't have to worry about keeping nearby blocks in the same file.

One downside of what I'm suggesting is that the headers would either be stored redundantly (in the block file as well as in the index), or anyone parsing the block files has to prepend the header themselves.

ryanofsky commented at 1:59 PM on May 7, 2025: contributor

How worried are we about file corruption here? I thought the main reason we use leveldb and sqlite databases in places like this where we don't need indexing is that they support atomic updates, so you can pull the power cord any time and next time you reboot you will will see some consistent view of the data, even if it's not the latest data. I didn't look too closely at the implementation here but it seems like it is updating data in the files in place, so if writes are interrupted, data could be corrupt, and it's not clear if there are even checksums in place that would detect this.

Maybe this is not an issue for the PR, but it would be good to make clear what types of corruption BlockTreeStore can and can't detect and what types of corruption it can recover from. If it can do simple things to detect corruption like adding checksums, or to prevent it like writing to temporary files and renaming them in place, those could be good to consider.

If this PR does introduce some increased risk of corruption, maybe that is worth it for reasons listed in the description. I also think another alternative could be to use sqlite for this since this would not necessarily introduce a new dependency and we already have a ReadKey/WriteKey/EraseKey/HasKey wrapper functions for sqlite that might help it be an easy replacement for leveldb.

sedited commented at 6:44 PM on May 7, 2025: contributor

Re #32427 (comment)

Do you mean compared to the alternative of having a single file for all blocks? I would imagine that would create I/O problems, since the operating system wouldn't know which part of the big file changed. And it can't defragment it.

Yes, that is what I meant. We never change block files, so that is not a problem. I'm also not sure how real this problem actually is. A bunch of databases just maintain one big file and have good performance doing so. I'm still not sure what the benefit of what you propose would be. Either way, I think this is a bit out of scope, since while this change implements a database migration, it does not require a reindex, which a change to the block file format would. Improving pruning behavior is also not the goal here.

sedited commented at 7:05 PM on May 7, 2025: contributor

Re #32427 (comment)

How worried are we about file corruption here?

I was hoping to provoke a discussion about this as I alluded to in the PR description - thanks for providing your thoughts on this. I think the proof of work and integrity checks done on loading the index already provide fairly solid guarantees on load, but agree that we should do better. I have also talked to some other people about it offline, and there seems to be some appetite for improving corruption resistance. It is my understanding that the feature for reindexing the block tree was added as a salvaging option, because leveldb does not provide strong anti-corruption guarantees, but has sprawled a bit since. Removing the need to provide code for reindexing the block tree would be a nice simplification of validation code in my eyes. I think adding a checksum for the entries and writing from a log file could be fairly simple to implement and provide strong guarantees. I'm open to suggestions here.

I also think another alternative could be to use sqlite for this since this would not necessarily introduce a new dependency and we already have a ReadKey/WriteKey/EraseKey/HasKey wrapper functions for sqlite that might help it be an easy replacement for leveldb.

This has also been brought up by some others. While I'm still not sure that we should be introducing a new validation dependency, maybe it would be good to implement it and open the change as an alternative draft / RFC pull request in the meantime?

mzumsande commented at 9:06 PM on May 7, 2025: contributor

A full -reindex can be necessary for two reasons:

corruption in the block tree db
corruption in the blk files.

In my personal experience of running a node on crappy hardware a long time ago, it was usually 2. that would happen (I knew that because the reindex wouldn't scan all block files but abort with an error somewhere, and switch to IBD from peers). My suspicion is that while 1. may have been the dominant reason in the early years, 2. may be just as important today.

However, if that was the case, changing the block tree db format wouldn't allow us to get rid of -reindex, even if the new format would never corrupt.

Sjors commented at 7:01 AM on May 8, 2025: member

We never change block files, so that is not a problem. I'm also not sure how real this problem actually is.

But we prune blocks, and they may not all be at the start of the big file.

A bunch of databases just maintain one big file and have good performance doing so.

Even on a spinning disk? That's where I tend to keep my .dat files.

I'm still not sure what the benefit of what you propose would be.

Compared to the current situation where we bundle a bunch of, but not all, blocks in one file, it just seems simpler to have one file per block.

In the "corruption in the blk files" example above it also makes recovery really easy: just load the block files one by one, hash them, redownload if the hash doesn't match. No need to update any index.

ryanofsky commented at 4:44 PM on May 8, 2025: contributor

re: TheCharlatan #32427 (comment)

Thanks for clarifying the situation with leveldb. I just assumed based on its design that it would support atomic updates pretty robustly but if it has corruption problems of its own then it doesn't sound like we would lose much by switching to something simpler.

I still do think using sqlite could be a nice solution because data consistency issues can be a significant source of pain for users and developers, and with sqlite we basically just don't need to think those issues. But I also understand not wanting to require sqlite as a kernel dependency.

Another thing about this PR (as of fabd3ab615a7c718f37a60298a125864edb6106b) is it seems like it doesn't actually remove much blockstorage code, and the BlockTreeDB class remains intact, I guess because migration code depends on it.

An idea that might improve this could be to make BlockTreeStore methods pure virtual and have FlatFile and LevelDB subclasses implementing them. This could organize the code more cleanly by letting FlatFile and LevelDB implementations both live side-by side outside of blockstorage instead of one being inside and one being outside. This could also let kernel applications provide alternate backends, and allow things like differential fuzz testing. (This was also the exact same approach used to replace bdb with sqlite in the wallet.)

re: Sjors #32427 (comment)

FWIW I also think using individual block files could be great (assuming a sharded directory structure like the .git/objects to avoid having many files per directory). That idea is mostly tangential to this PR though, I think? Possible I am missing some connections.

sedited commented at 1:24 PM on May 10, 2025: contributor

Re #32427#pullrequestreview-2823207472

In my personal experience of running a node on crappy hardware a long time ago, it was usually 2. that would happen (I knew that because the reindex wouldn't scan all block files but abort with an error somewhere, and switch to IBD from peers).

That is interesting, I don't think I've ever run into blk file corruption. I agree with you that if that is something we need to be able to salvage from, that the reindex logic would have to remain for it. If that is the case though, shouldn't we also be cleaning left over block data then? Seems like the user could just end up with 100's of GBs of unusable block data otherwise.

fanquake added this to a project on May 10, 2025

github-project-automation[bot] changed the project status on May 10, 2025

l0rinc commented at 8:44 PM on May 14, 2025: contributor

I didn't have time to review this in detail - nor to form a detailed concept/approach feedback, but I ran a few reindexes to see if it affects performance because somebody was referring to this as an optimization and wanted to understand if that's indeed the case.

I ran a reindex until 888,888 comparing the speed against master.

<details> <summary>Details</summary>

COMMITS="14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c fabd3ab615a7c718f37a60298a125864edb6106b"; \
STOP_HEIGHT=888888; DBCACHE=4500; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort 'command' \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g; s/-$//' <<< "$COMMITS")-$STOP_HEIGHT-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
    cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=5000 -printtoconsole=0; sleep 10" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=$DBCACHE"

14b8dfb2bd Merge bitcoin/bitcoin#31398: wallet: refactor: various master key encryption cleanups fabd3ab615 blockstorage: Remove BlockTreeDB dead code

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
  Time (abs ≡):        27076.605 s               [User: 32171.870 s, System: 1311.182 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)
  Time (abs ≡):        27034.197 s               [User: 32220.994 s, System: 1286.553 s]

Relative speed comparison
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)

</details>

Which indicates there's no measurable speed difference. But here the chainstate reindexing dominates, so I did one until block 1 as well.

<details> <summary>Details</summary>

COMMITS="14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c fabd3ab615a7c718f37a60298a125864edb6106b"; \
STOP_HEIGHT=1; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort 'command' \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g; s/-$//' <<< "$COMMITS")-$STOP_HEIGHT-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
    cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=5000 -printtoconsole=0; sleep 10" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=$DBCACHE"

14b8dfb2bd Merge bitcoin/bitcoin#31398: wallet: refactor: various master key encryption cleanups fabd3ab615 blockstorage: Remove BlockTreeDB dead code

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
  Time (abs ≡):        7718.677 s               [User: 7368.404 s, System: 174.230 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)
  Time (abs ≡):        7683.972 s               [User: 7344.276 s, System: 165.120 s]

Relative speed comparison
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)

</details>

Which also indicates there's no measurable speed difference. So at least I can confirm that - if my measurements were accurate - there doesn't seem to be a speed regression caused by this change.

theuni commented at 3:03 PM on May 15, 2025: member

Concept ACK. Neat :)

Maybe this is not an issue for the PR, but it would be good to make clear what types of corruption BlockTreeStore can and can't detect and what types of corruption it can recover from. If it can do simple things to detect corruption like adding checksums, or to prevent it like writing to temporary files and renaming them in place, those could be good to consider.

Yeah, I think this is the heart of it. I'm onboard for a new impl outside of leveldb, but before getting too deep into the implementation itself we need to decide 2 main things:

Is the current block/index storage layout ideal? I think @Sjors's one-block-per-file idea is interesting. Undo data could go in the same file without breaking any append-only guarantees. Not requiring file offset record keeping sounds nice. But what would the consequences be? Do any filesystems hate that type of dir layout? Would performance suffer due to a bajillion opens/closes?
After figuring out 1, like @ryanofsky asked, what guarantees do we need to provide? Are we just protecting against power outages? Cosmic bit-flip corruption? Bad sectors? Malicious users?

The impl here with no slicing or atomicity attempts isn't very robust, but that's obviously fine for an RFC.

sipa commented at 3:14 PM on May 15, 2025: member

I think a file structure of $DATADIR/blocks/[${(HEIGHT//2016)*2016}]/$HEIGHT-$HASH.dat would be a nice color for the bikeshed. That would mean typically 2016 block files per directory (if no branches appear), organized neatly per retarget period.

As for putting block and undo data in the same file, I'm unsure. Undo data to me feels more like a validation-level thing, while block data is more a storage-level thing.

hodlinator commented at 9:31 PM on May 15, 2025: contributor

Re: One file per block

FWIW the idea of using one file per block gets me going too. :) It is slightly orthogonal but would be nice to avoid changing formats twice in short succession.

My bikeshed color: Since block hashes start with zeroes, maybe one could shard based off the last two bytes:

Genesis block ends up in something like: $DATADIR/blocks/e2/6f/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f.dat Block 896819 from today ends up in: $DATADIR/blocks/84/28/000000000000000000012f13426140d43426f9db96fe9c93d3db4ebddfbf8428.dat

Using two levels deep directories of 256 entries at each branch point means we start averaging 1000 files per leaf directory at block height 65'536'000 (a bit earlier due to re-orgs). (File system limitations: https://stackoverflow.com/a/466596).

If having the height as the key is more useful than the hash, I prefer #32427 (comment).

Possible argument against

Spinning disks typically perform much better when sequentially accessed data is stored within the same file, so the current approach of multiple blocks per file may be more performant for some types of operations. I don't know if or how frequently we access the contents of blocks sequentially though.

davidgumberg commented at 12:57 AM on May 16, 2025: contributor

My bikeshed color: Since block hashes start with zeroes, maybe one could shard based off the last two bytes:

Just curious because the FAT32 file limit per-directory is so small, is there any scenario where a miner could DoS nodes with this format by also mining the last two bytes?

I think no, because at tip the additional hash needed for two bytes would be prohibitively expensive, although I'm not sure if there is a birthday-problem-like advantage because an attacker doesn't necessarily need to target only one two-byte suffix. And for IBD, headers-first sync would prevent an attack where someone suffix-mines >65,000 blocks from genesis and tries to get nodes to download them.

bitcoin deleted a comment on May 16, 2025

maflcko commented at 9:44 AM on May 16, 2025: member

2. what guarantees do we need to provide? Are we just protecting against power outages? Cosmic bit-flip corruption? Bad sectors?

I'd say ideally all of them. In the rare case where they happen, detecting them early on Bitcoin Core startup (before a validation-internal assert is hit) may help finding the root-cause and also could free up some developer time due to making it easier to remote-diagnose hardware issues (many of them have more than 5 comments: https://github.com/bitcoin/bitcoin/issues?q=is%3Aissue%20%20memtest86). So I'd see it as a benefit if this change can provide stronger detection-checks than leveldb.

theuni commented at 8:20 PM on May 16, 2025: member

My bikeshed color: Since block hashes start with zeroes, maybe one could shard based off the last two bytes:

Just curious because the FAT32 file limit per-directory is so small, is there any scenario where a miner could DoS nodes with this format by also mining the last two bytes?

I think no, because at tip the additional hash needed for two bytes would be prohibitively expensive, although I'm not sure if there is a birthday-problem-like advantage because an attacker doesn't necessarily need to target only one two-byte suffix. And for IBD, headers-first sync would prevent an attack where someone suffix-mines >65,000 blocks from genesis and tries to get nodes to download them.

There's also the possibility of using a local salt like we do for most other game-able data, as opposed to using the actual block hash. Block data is already xor'd with a per-node value, doing something similar with the filenames doesn't seem unreasonable to me. Maybe we'd even want to for the same reason we xor the data? And if already obfuscated, we could go a step further and ascii-encode to trim the file length. Of course, if there's no real need for that salting/obfuscation, it would just make blocks needlessly impossible to eyeball.

My bikeshed color: Since block hashes start with zeroes, maybe one could shard based off the last two bytes:

Genesis block ends up in something like: $DATADIR/blocks/e2/6f/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f.dat Block 896819 from today ends up in: $DATADIR/blocks/84/28/000000000000000000012f13426140d43426f9db96fe9c93d3db4ebddfbf8428.dat

Note that without the block heights as @sipa proposed, reindexing would be significantly more complicated. With the current impl the blocks on disk are going to be at least vaguely in-order, the above proposal would make them random.

ismaelsadeeq commented at 10:38 PM on June 3, 2025: member

Moving away from leveldb opens the door towards doing this in parallel.

For this reason, I am Concept ACK. No opinion on the approach yet; I am still studying the PR and prev discussion.

It might be a little too early for this but I'm excited and tried testing it out on Signet.

Just by building the branch and running the node on Signet, it crashes. See logs (I didnt start the node with -reindex option or maybe I might be doing something wrong though): https://gist.github.com/ismaelsadeeq/18889a42b6e8bd20560198e5e6d52607

However after the crash, starting the node again with -reindex option seems to run smoothly.

Also when I messed a bit with the /blocks directory specifically by attempting to use py-bitcoinkernel to read block saved using this blockstreedb, the data got corrupted and I had to sync the node again from genesis block.

sedited commented at 7:46 AM on June 4, 2025: contributor

Thanks for giving this a try @ismaelsadeeq! I'm working on adding a write ahead log at the moment, so will draft this PR in the meantime. Bit surprised that you immediately ran into some corruption, maybe it is caused by the library attempting to still write some data like the genesis block? I think it would be good to have a test for parallel reads/writes here as well as a demo branch for the library.

sedited marked this as a draft on Jun 4, 2025

marcofleon commented at 11:17 AM on June 5, 2025: contributor

Concept ACK

I've differentially fuzzed BlockTreeDB and BlockTreeStore for ~5000 cpu hours so far and no issues. Happy to continue testing (differentially fuzzing or otherwise) once the final approach is implemented.

sedited force-pushed on Jun 9, 2025

DrahtBot added the label CI failed on Jun 9, 2025

DrahtBot commented at 8:40 PM on June 9, 2025: contributor

🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/43761651546 LLM reason (✨ experimental): The CI failure is caused by a lint test error.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

sedited force-pushed on Jun 9, 2025

DrahtBot removed the label CI failed on Jun 9, 2025

sedited commented at 7:08 AM on June 10, 2025: contributor

The latest push updates the block tree store to use a write ahead log for atomic writes, and crc32c checksums to detect data corruption. As mentioned, taking this out of draft again.

Did not spend too much time yet on evaluating the various proposals for reforming block storage yet, but I am warming up to the idea. I still think it is largely orthogonal to the work here, besides potentially needing another change to the data serialization.

sedited marked this as ready for review on Jun 10, 2025

bitcoin deleted a comment on Jun 28, 2025

sedited changed the project status on Jul 6, 2025

DrahtBot added the label CI failed on Jul 6, 2025

DrahtBot commented at 3:38 PM on July 6, 2025: contributor

🚧 At least one of the CI tasks failed. Task tidy: https://github.com/bitcoin/bitcoin/runs/43764366691 LLM reason (✨ experimental): Compilation failed due to errors caused by ignoring return values of 'nodiscard' functions, triggering compile-time errors with -Werror.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

sedited force-pushed on Jul 6, 2025

sedited commented at 8:46 PM on July 6, 2025: contributor

Rebased 155afe8529a611c3dcb3fb76101abd01020a24ea -> 80810b6011e30ac5ff72c43a2fbfd0e13df0c4cc (blocktreestore_0 -> blocktreestore_1, compare)

Fixed silent merge conflict with #29307

sedited force-pushed on Jul 6, 2025

DrahtBot removed the label CI failed on Jul 6, 2025

DrahtBot added the label Needs rebase on Jul 8, 2025

sedited force-pushed on Jul 9, 2025

sedited commented at 1:27 PM on July 9, 2025: contributor

Rebased 98a34dd55ac32f323b297bb6d77eefe096f27074 -> 254d0a75b50b0eaf91003ea8a0534981ec740090 (blocktreestore_1 -> blocktreestore_2, compare)

Fixed conflict with #32835

DrahtBot removed the label Needs rebase on Jul 9, 2025

in src/kernel/blocktreestorage.cpp:47 in 8858c43ee0 outdated

  42 | +    return data_end;
  43 | +}
  44 | +
  45 | +static int64_t CalculateBlockFilesPos(int nFile)
  46 | +{
  47 | +    // start position + nFile * (BLOCK_FILE_IFO_WRAPPER_SIZE + checksum)

l0rinc commented at 4:10 PM on July 27, 2025:

nit: The comment repeats the code and has a typo; consider deleting or expanding with a rationale.

sedited commented at 2:14 PM on August 19, 2025:

Yes, removed.

in src/kernel/blocktreestorage.cpp:93 in 8858c43ee0 outdated

  88 | +    }
  89 | +
  90 | +    {
  91 | +        auto file{AutoFile{fsbridge::fopen(m_block_files_file_path, "rb")}};
  92 | +        if (file.IsNull()) {
  93 | +            throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_header_file_path)));

l0rinc commented at 4:13 PM on July 27, 2025:

            throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_block_files_file_path)));

sedited commented at 2:16 PM on August 19, 2025:

Fixed.

in src/kernel/blocktreestorage.cpp:105 in 8858c43ee0 outdated

 100 | +        uint32_t version;
 101 | +        file >> version;
 102 | +        if (version != BLOCK_FILES_FILE_VERSION) {
 103 | +            throw BlockTreeStoreError("Invalid block files file version");
 104 | +        }
 105 | +    }

l0rinc commented at 4:15 PM on July 27, 2025:

It seem we need this in multiple places, maybe we could add a sub-helper, something like:

void BlockTreeStore::CheckMagicAndVersion() const
{
    auto check{[](const fs::path& path, uint32_t magic_expected, uint32_t version_expected) {
        AutoFile file{fsbridge::fopen(path, "rb")};
        if (file.IsNull()) {
            throw BlockTreeStoreError(strprintf("Unable to open file %s", fs::PathToString(path)));
        }
        if (auto magic{ser_readdata32(file)}; magic != magic_expected) {
            throw BlockTreeStoreError(strprintf("Invalid magic in %s: got 0x%08x", fs::PathToString(path.filename()), magic));
        }
        if (auto version{ser_readdata32(file)}; version != version_expected) {
            throw BlockTreeStoreError(strprintf("Invalid version in %s: got %u", fs::PathToString(path.filename()), version));
        }
    }};

    check(m_header_file_path, HEADER_FILE_MAGIC, HEADER_FILE_VERSION);
    check(m_block_files_file_path, BLOCK_FILES_FILE_MAGIC, BLOCK_FILES_FILE_VERSION);
}

The errors would look like:

[error] Unable to open file .../bitcoin/demo/blocks/migration/headers.dat
[error] Invalid magic in headers.dat: got 0x2d5e2eb2
[error] Invalid version in headers.dat: got 2

Given that we're redoing something similar in multiple places, maybe we could extract these to static helper methods as well.

sedited commented at 2:16 PM on August 19, 2025:

Taken, thanks!

in src/kernel/blocktreestorage.cpp:116 in 8858c43ee0 outdated

 111 | +      m_block_files_file_path{path / BLOCK_FILES_FILE_NAME},
 112 | +      m_reindex_flag_file_path{path / REINDEX_FLAG_FILE_NAME},
 113 | +      m_prune_flag_file_path{path / PRUNE_FLAG_FILE_NAME}
 114 | +{
 115 | +    assert(GetSerializeSize(DiskBlockIndexWrapper{}) == DISK_BLOCK_INDEX_WRAPPER_SIZE);
 116 | +    assert(GetSerializeSize(BlockFileInfoWrapper{}) == BLOCK_FILE_INFO_WRAPPER_SIZE);

l0rinc commented at 4:16 PM on July 27, 2025:

nit: These asserts run at runtime, consider moving to a unit test to avoid overhead.

sedited commented at 2:17 PM on August 19, 2025:

It would be nice if GetSerializeSize would be a constexpr. But yes, moving this out. EDIT: Left it for now, but leaving this unresolved.

in src/kernel/blocktreestorage.cpp:124 in 8858c43ee0 outdated

 119 | +        fs::remove(m_header_file_path);
 120 | +        fs::remove(m_block_files_file_path);
 121 | +    }
 122 | +    bool header_file_exists{fs::exists(m_header_file_path)};
 123 | +    bool block_files_file_exists{fs::exists(m_block_files_file_path)};
 124 | +    if (header_file_exists ^ block_files_file_exists) {

l0rinc commented at 4:16 PM on July 27, 2025:

This may be a bit more desciptive:

    if (header_file_exists != block_files_file_exists) {

sedited commented at 2:17 PM on August 19, 2025:

Done.

in src/kernel/blocktreestorage.cpp:68 in 8858c43ee0 outdated

  63 | +        return m_block_files_file_path;
  64 | +    case DISK_BLOCK_INDEX:
  65 | +    case HEADER_DATA_END:
  66 | +        return m_header_file_path;
  67 | +    }
  68 | +    throw BlockTreeStoreError("Unrecognized value in block tree store");

l0rinc commented at 4:21 PM on July 27, 2025:

We could include the bad value_type in these errors:

    default:
        throw BlockTreeStoreError(strprintf("Unrecognized value_type (%u) in block tree store", value_type));
    }

sedited commented at 2:15 PM on August 19, 2025:

Done.

in src/kernel/blocktreestorage.cpp:56 in 8858c43ee0 outdated

  51 | +enum ValueType : uint32_t {
  52 | +    LAST_BLOCK,
  53 | +    BLOCK_FILE_INFO,
  54 | +    DISK_BLOCK_INDEX,
  55 | +    HEADER_DATA_END,
  56 | +};

l0rinc commented at 4:23 PM on July 27, 2025:

To make sure the serialization stays stable (e.g when somebody renames one of them and reorders for some reason):

enum class ValueType : uint32_t {
    LAST_BLOCK       = 0,
    BLOCK_FILE_INFO  = 1,
    DISK_BLOCK_INDEX = 2,
    HEADER_DATA_END  = 3,
};

sedited commented at 2:15 PM on August 19, 2025:

Yes, raw enums are also not portable. This is better, but also required a small change to allow serialization.

in src/chain.h:187 in 8858c43ee0 outdated

 182 | @@ -161,6 +183,9 @@ class CBlockIndex
 183 |      //! Byte offset within rev?????.dat where this block's undo data is stored
 184 |      unsigned int nUndoPos GUARDED_BY(::cs_main){0};
 185 |  
 186 | +    //! Byte offset within headers.dat where this block's header data is stored
 187 | +    int64_t header_pos GUARDED_BY(::cs_main){0};

l0rinc commented at 4:54 PM on July 27, 2025:

How would this behave in case of a reorg when the block size differs from the previous one?

nit: other positions were stored as unsigned

sedited commented at 1:36 PM on August 19, 2025:

This should not change across re-orgs. Only the Chain class mutates during re-orgs, the block index remains unchanged.

in src/kernel/blocktreestorage.cpp:29 in 8858c43ee0 outdated

  24 | +static size_t constexpr CHECKSUM_SIZE{sizeof(uint32_t)};
  25 | +static size_t constexpr FILE_POSITION_SIZE{sizeof(int64_t)};
  26 | +
  27 | +static int64_t ReadHeaderFileDataEnd(AutoFile& file)
  28 | +{
  29 | +    int64_t data_end;

l0rinc commented at 5:17 PM on July 27, 2025:

we don't have signed 64 bit reading/writing (we're casting to unsigned and back), would it be simpler to make this unsigned in the first place?

sedited commented at 4:38 PM on August 19, 2025:

It would be, but we use the positions with the result of ftell, which returns a signed type, so I made them signed too.

in src/kernel/blocktreestorage.cpp:37 in 8858c43ee0 outdated

  32 | +    file.seek(HEADER_FILE_DATA_END_POS, SEEK_SET);
  33 | +    file >> data_end;
  34 | +    data << data_end;
  35 | +    data << HEADER_FILE_DATA_END_POS;
  36 | +    uint32_t re_check = crc32c::Crc32c(UCharCast(data.data()), data.size());
  37 | +    file >> checksum;

l0rinc commented at 5:23 PM on July 27, 2025:

I find this reading and writing back and forth confusing. We could bring checksum and data closer to the first usage and maybe add more info to the error and reduce the scope of the variables we only need for validation, maybe something like:

static uint64_t ReadHeaderFileDataEnd(AutoFile& file)
{
    file.seek(HEADER_FILE_DATA_END_POS, SEEK_SET);
    auto data_end{ser_readdata64(file)};
    if (auto checksum{ser_readdata32(file)},
        re_check{crc32c::Crc32c((DataStream{} << data_end << HEADER_FILE_DATA_END_POS).str())};
        re_check != checksum) {
        throw BlockTreeStoreError(strprintf("Header file data failed integrity check: got 0x%08x, expected 0x%08x", re_check, checksum));
    }
    return data_end;
}

Alternatively we could also have a header data structure which is read and written and skipped atomically.

sedited commented at 2:14 PM on August 19, 2025:

Done.

in src/chain.h:442 in 8858c43ee0 outdated

 436 | @@ -412,6 +437,34 @@ class CDiskBlockIndex : public CBlockIndex
 437 |      std::string ToString() = delete;
 438 |  };
 439 |  
 440 | +struct DiskBlockIndexWrapper : public CDiskBlockIndex {
 441 | +public:
 442 | +    DiskBlockIndexWrapper() = default;

l0rinc commented at 5:33 PM on July 27, 2025:

isn't public implied?

struct DiskBlockIndexWrapper : CDiskBlockIndex {
    DiskBlockIndexWrapper() = default;

sedited commented at 3:00 PM on August 19, 2025:

Yes, changed.

in src/chain.h:464 in 8858c43ee0 outdated

 459 | +        READWRITE(obj.nVersion);
 460 | +        READWRITE(obj.hashPrev);
 461 | +        READWRITE(obj.hashMerkleRoot);
 462 | +        READWRITE(obj.nTime);
 463 | +        READWRITE(obj.nBits);
 464 | +        READWRITE(obj.nNonce);

l0rinc commented at 5:39 PM on July 27, 2025:

READWRITE is a variadic macro, in many other cases we're batching these to avoid repetition, e.g. https://github.com/bitcoin/bitcoin/blob/master/src/protocol.h#L48, consider simplifying:

        READWRITE(obj.nHeight, obj.nStatus, obj.nTx, obj.nFile, obj.nDataPos, obj.nUndoPos, obj.header_pos);
        READWRITE(obj.nVersion, obj.hashPrev, obj.hashMerkleRoot, obj.nTime, obj.nBits, obj.nNonce); // block header

nit: wouldn't it make more sense to start with the header data (in case we need to read only that)?

sedited commented at 3:01 PM on August 19, 2025:

Done.

in src/kernel/blocktreestorage.cpp:149 in 8858c43ee0 outdated

 144 | +        auto autofile{AutoFile{file}};
 145 | +        if (!autofile.Commit()) {
 146 | +            throw BlockTreeStoreError(strprintf("Failed to create header file %s\n", fs::PathToString(m_header_file_path)));
 147 | +        }
 148 | +        if (autofile.fclose() != 0) {
 149 | +            throw BlockTreeStoreError(strprintf("Failure when closing created header file %s\n", fs::PathToString(m_header_file_path)));

l0rinc commented at 5:47 PM on July 27, 2025:

nit: "Failed to close" (verb phrase) is more consistent with other such messages. nit2: do we really care if we can't close after a successful commit?

sedited commented at 2:18 PM on August 19, 2025:

Fixed the message. I think we should error here, since a subsequent open might fail too, and then it seems better to fail earlier.

in src/kernel/blocktreestorage.cpp:206 in 8858c43ee0 outdated

 201 | +    file << BLOCK_FILES_FILE_MAGIC;
 202 | +    file << BLOCK_FILES_FILE_VERSION;
 203 | +    file.seek(BLOCK_FILES_LAST_BLOCK_POS, SEEK_SET);
 204 | +    DataStream data;
 205 | +    data << 0;
 206 | +    file << std::span<std::byte>{data};

l0rinc commented at 5:50 PM on July 27, 2025:

What's the reason for serializing this and HEADER_FILE_DATA_START_POS differently and not directly (like we do with other positions)?

If we keep it, consider simplifying:

    file << std::span{data};

sedited commented at 4:50 PM on August 19, 2025:

We need the extra step to calculate the checksum. But I removed the unneeded cast and got rid of the superfluous seek (which was still there from a prior version).

in src/kernel/blocktreestorage.cpp:260 in 8858c43ee0 outdated

 255 | +bool BlockTreeStore::ReadBlockFileInfo(int nFile, CBlockFileInfo& info)
 256 | +{
 257 | +    LOCK(m_mutex);
 258 | +    auto file{AutoFile{fsbridge::fopen(m_block_files_file_path, "rb")}};
 259 | +    if (file.IsNull()) {
 260 | +        throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_header_file_path)));

l0rinc commented at 5:58 PM on July 27, 2025:

As far as I can tell exceptions don't need trailing newline:

2025-07-27T17:57:41Z Loading block index db: last block file = 0
2025-07-27T17:57:41Z [error] Unable to open file .../bitcoin/demo/blocks/index/headers.dat

2025-07-27T17:57:41Z : Error loading databases.

(nit: the errors should likely reference m_block_files_file_path instead of m_header_file_path, but the mentioned helpers should take care of these problems)

sedited commented at 5:15 PM on August 19, 2025:

Added a helper for opening these files. Should be a bit clearer now. Also removed all the newlines.

in src/kernel/blocktreestorage.cpp:264 in 8858c43ee0 outdated

 259 | +    if (file.IsNull()) {
 260 | +        throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_header_file_path)));
 261 | +    }
 262 | +    file.seek(CalculateBlockFilesPos(nFile), SEEK_SET);
 263 | +    if (file.feof()) {
 264 | +        // return in case the info was not found

l0rinc commented at 5:59 PM on July 27, 2025:

nit: redundant comment, we already have everything in the code to deduce this

sedited commented at 2:20 PM on August 19, 2025:

Removed.

in src/kernel/blocktreestorage.cpp:361 in 8858c43ee0 outdated

 356 | +        (void)log_file.fclose();
 357 | +        fs::remove(m_log_file_path);
 358 | +        return false;
 359 | +    }
 360 | +    re_rolling_checksum = 0;
 361 | +    log_file.seek(4, SEEK_SET); // we already read the number of types, so skip ahead of it

l0rinc commented at 6:07 PM on July 27, 2025:

nit: can we avoid the magic number (and comment explaining it) here?

sedited commented at 2:22 PM on August 19, 2025:

Yes, replaced the magic number and removed the comment.

in src/kernel/blocktreestorage.cpp:440 in 8858c43ee0 outdated

 435 | +    AssertLockHeld(::cs_main);
 436 | +    LOCK(m_mutex);
 437 | +
 438 | +    // Use a write-ahead log file that gets atomically flushed to the target files.
 439 | +
 440 | +    { // start log_file scope

l0rinc commented at 6:08 PM on July 27, 2025:

seems a bit unconventional to do this, looks like a code smell - can we extract to a method instead, if we need RAII?

sedited commented at 5:25 PM on August 19, 2025:

Mmh, not sure what to do here. Extracting a method seems a bit heavy?

in src/kernel/blocktreestorage.cpp:445 in 8858c43ee0 outdated

 440 | +    { // start log_file scope
 441 | +    FILE* raw_log_file{fsbridge::fopen(m_log_file_path, "wb")};
 442 | +    if (!raw_log_file) {
 443 | +        throw BlockTreeStoreError(strprintf("Unable to open file %s\n", fs::PathToString(m_header_file_path)));
 444 | +    }
 445 | +    size_t log_file_prealloc_size{fileInfo.size() * (BLOCK_FILE_INFO_WRAPPER_SIZE + FILE_POSITION_SIZE) + blockinfo.size() * (DISK_BLOCK_INDEX_WRAPPER_SIZE + FILE_POSITION_SIZE)};

l0rinc commented at 6:10 PM on July 27, 2025:

We could extract these and reuse them, there's a lot of repetition here:

constexpr size_t header_entry_size{DISK_BLOCK_INDEX_WRAPPER_SIZE + FILE_POSITION_SIZE};
size_t log_file_prealloc_size{fileInfo.size() * (BLOCK_FILE_INFO_WRAPPER_SIZE + FILE_POSITION_SIZE) + blockinfo.size() * header_entry_size};
stream.reserve(header_entry_size);

sedited commented at 2:24 PM on August 19, 2025:

Thanks, done.

in src/kernel/blocktreestorage.cpp:547 in 8858c43ee0 outdated

 542 | +    }
 543 | +
 544 | +    } // end log_file scope
 545 | +
 546 | +    if (!ApplyLog()) {
 547 | +        LogError("Failed to apply write-ahead log to data files");

l0rinc commented at 6:54 PM on July 27, 2025:

not sure I fully understand when we're returning and when we're throwing - how come the WAL write isn't fatal as well?

sedited commented at 5:44 PM on August 19, 2025:

It is fatal, returning false here triggers a FatalError further up the callstack in our code. I think I am trying to bend towards our previous method of indicating a failure through a boolean, but as you are pointing out, this is just inconsistent. I'd prefer to make this void and throw, but then I'd have to change some things in the callers of this function. Maybe once #33042 is in we can make all of these throw instead?

in src/kernel/blocktreestorage.cpp:383 in 8858c43ee0 outdated

 378 | +        uint32_t entry_size = type_size + FILE_POSITION_SIZE;
 379 | +
 380 | +        DataStream stream;
 381 | +        stream.resize(entry_size);
 382 | +
 383 | +        for (uint32_t i = 0; i < num_iterations; i++) {

l0rinc commented at 6:55 PM on July 27, 2025:

nit: the outer loop is already i nit2: ++i is more common in these cases

sedited commented at 4:54 PM on August 19, 2025:

Fixed.

in src/kernel/blocktreestorage.cpp:394 in 8858c43ee0 outdated

 389 | +            uint32_t re_checksum = crc32c::Crc32c(UCharCast(stream.data()), entry_size);
 390 | +            re_rolling_checksum = crc32c::Extend(re_rolling_checksum, UCharCast(stream.data()), entry_size);
 391 | +            uint32_t checksum;
 392 | +            log_file >> checksum;
 393 | +            if (re_checksum != checksum) {
 394 | +                throw BlockTreeStoreError("Detected on-disk file corruption. The disk might be nearing its end of life");

l0rinc commented at 7:07 PM on July 27, 2025:

nit: as mentioned, I'd unify these checksum checks - and not sure we need to be this specific about interpreting the results, we're usually not this friendly :)

sedited commented at 5:40 PM on August 19, 2025:

I couldn't come up with a clean extraction. We could add a callback, but seems like a smell too. We do get some decent protection through the rolling checksum from missing anything during the dry run, so we also don't need to be super defensive here.

in src/node/blockstorage.cpp:66 in 254d0a75b5 outdated

  57 | @@ -58,15 +58,6 @@ bool BlockTreeDB::ReadBlockFileInfo(int nFile, CBlockFileInfo& info)
  58 |      return Read(std::make_pair(DB_BLOCK_FILES, nFile), info);
  59 |  }
  60 |  
  61 | -bool BlockTreeDB::WriteReindexing(bool fReindexing)
  62 | -{
  63 | -    if (fReindexing) {
  64 | -        return Write(DB_REINDEX_FLAG, uint8_t{'1'});
  65 | -    } else {
  66 | -        return Erase(DB_REINDEX_FLAG);

l0rinc commented at 7:14 PM on July 27, 2025:

is Erase still used after this?

sedited commented at 5:50 PM on August 19, 2025:

Good catch, no, and looks like we did not even have coverage for it :/, Removed it.

in src/kernel/caches.h:15 in 254d0a75b5 outdated

  10 | @@ -11,21 +11,16 @@
  11 |  
  12 |  //! Suggested default amount of cache reserved for the kernel (bytes)
  13 |  static constexpr size_t DEFAULT_KERNEL_CACHE{450_MiB};
  14 | -//! Max memory allocated to block tree DB specific cache (bytes)
  15 | -static constexpr size_t MAX_BLOCK_DB_CACHE{2_MiB};
  16 |  //! Max memory allocated to coin DB specific cache (bytes)
  17 | -static constexpr size_t MAX_COINS_DB_CACHE{8_MiB};
  18 | +static constexpr size_t MAX_COINS_DB_CACHE{10_MiB};

l0rinc commented at 2:59 AM on July 28, 2025:

these changes might need some (commit message) explanations

in src/node/blockstorage.cpp:454 in 254d0a75b5 outdated

 450 | @@ -478,13 +451,13 @@ bool BlockManager::LoadBlockIndex(const std::optional<uint256>& snapshot_blockha
 451 |  bool BlockManager::WriteBlockIndexDB()
 452 |  {
 453 |      AssertLockHeld(::cs_main);
 454 | -    std::vector<std::pair<int, const CBlockFileInfo*>> vFiles;
 455 | +    std::vector<std::pair<int, CBlockFileInfo*>> vFiles;

l0rinc commented at 2:59 AM on July 28, 2025:

Is there a const-correct serialization where we don't have to change this?

sedited commented at 8:18 PM on August 19, 2025:

I missed this from an earlier change. The serialization is already correct, but I did not revert back all the call sites again. Fixed.

in src/init.cpp:1 in a2ff8f482c outdated

l0rinc commented at 3:34 AM on July 28, 2025:

typo in commit message: "newly introduced BlockTreeStore

alexanderwiederin commented at 9:02 AM on June 11, 2026:

This is no longer needed I think.

stickies-v commented at 8:20 PM on June 22, 2026:

nit: I think this is now dead code? ReadLastBlockFile can no longer be out of sync with the database contents, so this second loop should never be hit.

stickies-v commented at 3:15 PM on June 25, 2026:

commit c9488cb87aaf2779cfc5df6d573bd94551da7827 message states:

Note that it explicitly does not guard log writes.

What's the intent here? This means that multiple writers can interfere with each other, i.e. each writer will delete the other's logs while they're writing to it. I think if we just acquire the store lock at the beginning of WriteBatchSync we avoid that problem, without adding any additional fs synchronization? It does of course mean writers are holding the lock for a longer time, but I think that's a good thing?

edit: as @willcl-ark pointed out, just adding an extra StoreLock would not work, because the first ApplyLog would unlock the directory after it's finished. I agree with his assessment that this is very weird behaviour. We should either not allow acquiring multiple StoreLock's, or count them and only unlock for the last destructor.

stickies-v commented at 8:13 PM on June 25, 2026:

Would be useful to add a pruned migration scenario too.

stickies-v commented at 4:32 PM on June 26, 2026:

nit: why not just move this commit before 0ca573cf98a758ec361ce7faace74b8b09568580 so we can avoid adding and removing the circular dependency (exception)? Seems like a trivial change (briefly tested).

stickies-v commented at 5:46 PM on July 2, 2026:

In 135387e63210db8f2f8dd9c977f72fc9425f1bac

nit: the commit message states:

Note that this benchmark only gives meaningful result if the directory is created in a non-tmpfs (non-ramdisk) path. Developers might be required to tweak the path for this.

I think this should be documented in the bench code, this is going to lead to confusion later on.

in src/node/blockstorage.h:301 in 254d0a75b5 outdated

 297 | @@ -298,7 +298,7 @@ class BlockManager
 298 |       */
 299 |      std::multimap<CBlockIndex*, CBlockIndex*> m_blocks_unlinked;
 300 |  
 301 | -    std::unique_ptr<BlockTreeDB> m_block_tree_db GUARDED_BY(::cs_main);
 302 | +    std::unique_ptr<kernel::BlockTreeStore> m_block_tree_db GUARDED_BY(::cs_main);

l0rinc commented at 3:41 AM on July 28, 2025:

do we still call the field tree db?

    std::unique_ptr<kernel::BlockTreeStore> m_block_tree_store GUARDED_BY(::cs_main);

sedited commented at 2:29 PM on August 19, 2025:

No, just wanted to not do this particular rename to make it clearer that the new store is really just a drop in replacement. Could then easily be done in a follow-up.

in src/node/blockstorage.cpp:1213 in 254d0a75b5 outdated

1208 | +        dump_blockindexes.reserve(m_block_index.size());
1209 | +        for (auto& pair : m_block_index) {
1210 | +            dump_blockindexes.push_back(&pair.second);
1211 | +        }
1212 | +
1213 | +        if (!block_tree_store->WriteBatchSync(dump_files, max_blockfile_num, dump_blockindexes)) {

l0rinc commented at 4:08 AM on July 28, 2025:

can we do a load from the new location to make absolutely sure we've indeed written the data correctly and only delete after we validate that?

sedited commented at 8:26 PM on August 19, 2025:

Mmh, I guess that would improve things a bit by guaranteeing that the migration absent a failure during the rename would be successful. There is also a race condition, since we could have a power outage after removing the old db, but before the rename is complete.

l0rinc commented at 5:29 PM on July 28, 2025: contributor

Concept ACK, not yet sure about the approach

LevelDB migration

Moving away from unmaintained LevelDB makes sense and aligns with other modularization and optimization efforts. Whether that means switching to SQLite or a custom solution like this one is debatable, but removing LevelDB already paves the way for further migrations - which still seem somewhat taboo at this point.

Before merging something like this, I'd be interested in seeing a full migration story (including other indexes, blocks, and the UTXO set). Otherwise, we risk supporting multiple formats indefinitely. Happy to help with that.

Migration mechanism

Do we need a big-bang migration, or would an on-demand, copy-on-first-touch scheme (with an optional background migrator) also work? We'd simply check the new location first, and if missing, migrate, delete the old entry, and serve from the new location - until all migration is done. This would avoid delays at startup, wouldn't require doubling the space usage of the full index (only the entries in flight), and could be reused for future migrations. It could also allow us to perform both operations for a while, comparing that we're always reading/writing the exact same data. Could be enabled on CI + background fuzzing for a few months before merging.

Code structure

The first commit currently does everything. I understand it's a draft/RFC, but it would help reviewers if the commits told a story through small, focused steps.

We're also missing dedicated data structures for the new feature (with serialization, validation, etc.). Right now, the logic seems scattered across unrelated parts of the codebase.

There's also heavy repetition: magic/version checks, CRC validation, WAL record sizing. We could introduce these incrementally in separate commits, possibly splitting out helpers that could be reused elsewhere into separate refactor PRs.

The tests seem to cover a lot of ground, but I didn't see many negative cases (e.g. invalid headers), nothing with wipe_block_tree_data = true, and I'm not sure migration, pruning, and reorgs are covered.

Questions & notes

I understand it's not strictly part of this PR, but I agree with Sjors that we should consider letting the filesystem handle some of this. The downside is that OSs behave differently - some nodes could break simultaneously if we hit file handle or filesystem limits that we haven't tested for. This assumes there even is an OS (i.e. not bare metal).
- How much extra space would this cost (considering 32/64-bit platforms, various I/O-caching filesystems, file permission attributes, fragmentation)?
- If we stored blocks separately, could we redownload only corrupted ones in parallel - even for missing pruned blocks?
- We might also want to investigate block compression - it looked like we could gain ~20%, although maybe only when we already have all blocks.
- Can we design this to avoid duplication? Currently we duplicate most block data in the chainstate index. Are we planning to make the blocks indexable, so we can locate a script by offset instead of storing it in the index (at least for non-pruned nodes)? This might be relevant if future migrations depend on how this one is structured.
Related to Russ's concerns: are the integrity checks meant to guard against accidental bit-rot only, or also malicious tampering? Or do we assume physical access means full compromise?
I don't yet understand pruned behavior (this may be orthogonal, feel free to ignore): what should happen if a node is asked for a block that's about to be deleted?
You mentioned that "no entry is ever deleted", but I'm not yet sure how that holds under reorgs.

DrahtBot added the label CI failed on Jul 30, 2025

DrahtBot removed the label CI failed on Jul 30, 2025

DrahtBot added the label Needs rebase on Aug 18, 2025

sedited force-pushed on Aug 20, 2025

sedited commented at 7:52 AM on August 20, 2025: contributor

Thank you for the review @l0rinc!

Rebased 254d0a75b50b0eaf91003ea8a0534981ec740090 -> fc07ce3718b5b8cc168ab634885e0317b9621e8c (blocktreestore_2 -> blocktreestore_3, compare)

Updated fc07ce3718b5b8cc168ab634885e0317b9621e8c -> d35ceaeb463bc836ac4fc4bd6dd4f387647f33fb (blocktreestore_3 -> blocktreestore_4, compare)

The review comments are addressed inline, the change also includes some other smaller cleanups.

Re #32427

Before merging something like this, I'd be interested in seeing a full migration story (including other indexes, blocks, and the UTXO set). Otherwise, we risk supporting multiple formats indefinitely. Happy to help with that.

I have no intention of moving any of the other dbs away from leveldb at this point in time. Currently this PR serves as a basis for a bunch of other applications that leverage its additional capability for allowing other applications to read block data in parallel.

Do we need a big-bang migration, or would an on-demand, copy-on-first-touch scheme (with an optional background migrator) also work?

The current migration delay is pretty much negligible compared to normal startup times. Once migrated, startup is significantly faster.

I'm not sure migration, pruning, and reorgs are covered.

This still needs a functional test for the migration, but pruning should be covered through existing tests and reorgs are not really relevant for this, since they only influence the chain and not the topology of the block tree.

You mentioned that "no entry is ever deleted", but I'm not yet sure how that holds under reorgs.

During a reorg we change the contents of the Chain data structure, i.e. we remove pointers to the block tree and add others back again. The topology of the block tree in m_block_index is not changed.

What should happen if a node is asked for a block that's about to be deleted?

This behaviour is not changed in this PR and is already correctly handled in net_processing and with our prune locks. I think a similar approach would still be possible if we do one-block-one-file.

are the integrity checks meant to guard against accidental bit-rot only, or also malicious tampering? Or do we assume physical access means full compromise?

The CRC checks are only for data integrity. The should guard against data corruption.

I have also briefly looked at how this might be related to a one-block-one-file approach. I think one possibility could be to store and read the header directly alongside the block in the same file. While it is not clear to me yet what we'd do with the rest of the CBlockIndex data, a clear downside of this is that at startup we'd have to read the headers from a million files, which is much slower than reading everything from a contiguous, single file. I tried benching this a bit and my startup times went from around six seconds (this PR) to around a minute.

DrahtBot removed the label Needs rebase on Aug 20, 2025

DrahtBot added the label Needs rebase on Sep 3, 2025

sedited force-pushed on Sep 4, 2025

sedited commented at 1:26 PM on September 4, 2025: contributor

Rebased d35ceaeb463bc836ac4fc4bd6dd4f387647f33fb -> daf0e9a3d45f42889fc5895fc580c73d060d2711 (blocktreestore_4 -> blocktreestore_5, compare)

Fixed conflict with #33274

DrahtBot removed the label Needs rebase on Sep 4, 2025

DrahtBot added the label Needs rebase on Oct 31, 2025

sedited force-pushed on Nov 1, 2025

sedited commented at 1:48 PM on November 1, 2025: contributor

Rebased daf0e9a3d45f42889fc5895fc580c73d060d2711 -> 00c5a8736532a0ca3c483d300e1d09d87be948f1 (blocktreestore_5 -> blocktreestore_6, compare)

Fixed conflict with #31645

DrahtBot removed the label Needs rebase on Nov 1, 2025

sedited force-pushed on Nov 1, 2025

fanquake referenced this in commit 4da01123df on Nov 4, 2025

DrahtBot added the label Needs rebase on Nov 4, 2025

sedited force-pushed on Nov 5, 2025

sedited commented at 2:09 PM on November 5, 2025: contributor

Rebased 00c5a8736532a0ca3c483d300e1d09d87be948f1 -> a5a8bff198f9f641c10e159fa3cc51757d8f69f6 (blocktreestore_6 -> blocktreestore_7, compare)

Fixed conflict with #30595

DrahtBot removed the label Needs rebase on Nov 5, 2025

DrahtBot added the label CI failed on Nov 5, 2025

sedited force-pushed on Nov 7, 2025

DrahtBot added the label Needs rebase on Nov 11, 2025

sedited force-pushed on Nov 12, 2025

sedited commented at 1:40 PM on November 12, 2025: contributor

Rebased a5a8bff198f9f641c10e159fa3cc51757d8f69f6 -> ef1f96134098e73b47bfc988d878422917ceac1d (blocktreestore_7 -> blocktreestore_8, compare)

DrahtBot removed the label Needs rebase on Nov 12, 2025

DrahtBot added the label Needs rebase on Nov 12, 2025

sedited force-pushed on Nov 20, 2025

DrahtBot removed the label Needs rebase on Nov 20, 2025

DrahtBot removed the label CI failed on Nov 20, 2025

sedited commented at 8:20 PM on November 20, 2025: contributor

Rebased ef1f96134098e73b47bfc988d878422917ceac1d -> a2d5f75a26976ec3292606bec149ef1d325383a2 (blocktreestore_8 -> blocktreestore_9, compare)

Fixed conflict with #33724
Updated coinstatsindex compatibility test to only check upgrading the version, not downgrading.

sedited changed the project status on Dec 14, 2025

ismaelsadeeq commented at 2:15 PM on December 22, 2025: member

@sedited The PR title still has RFC on it. This seems to have a lot of C's, so is the status still in RFC or are you convinced on the approach and concept, and this is ready for review?

sedited commented at 3:42 PM on December 22, 2025: contributor

Re #32427 (comment)

are you convinced on the approach and concept, and this is ready for review?

I would appreciate some further conceptual review, particularly around whether a different approach, like a different data structure altogether would be more compelling, and whether the tradeoffs here are worth it. To summarize my thoughts so far:

Getting rid of leveldb as a dependency for the block tree seems worthwhile. Clients using the kernel could then use all our existing header and block validation logic without having to rely on an external dependency.
A single block file architecture might be workable, but I don't see yet how that would improve the block tree, or make it easier to implement this change, without a performance penalty.
Being able to read block data in parallel from the filesystem directly still needs better proof of concepts to show that this is actually useful. However I still find the changes here compelling enough for the first point listed here.
Making startup faster is nice too.

in src/node/blockstorage.cpp:1210 in a2d5f75a26

1205 | +
1206 | +    {
1207 | +        // Cleanup a potentially previously failed migration
1208 | +        fs::remove_all(migration_dir);
1209 | +        LogInfo("    Writing data back to migration directory, reindexing: %b, pruned: %b", reindexing, pruned_block_files);
1210 | +        auto block_tree_store{std::make_unique<kernel::BlockTreeStore>(migration_dir, m_opts.chainparams, m_opts.wipe_block_tree_data)};

stickies-v commented at 3:27 PM on January 7, 2026:

nit: I think passing m_opts.block_tree_db_params.wipe_data here is meaningless, migration_dir is empty anyway?

in src/test/kernel/test_kernel.cpp:828 in a2d5f75a26 outdated

 824 | @@ -828,7 +825,7 @@ BOOST_AUTO_TEST_CASE(btck_chainman_in_memory_tests)
 825 |      }
 826 |  
 827 |      BOOST_CHECK(std::filesystem::exists(in_memory_test_directory.m_directory / "blocks"));
 828 | -    BOOST_CHECK(!std::filesystem::exists(in_memory_test_directory.m_directory / "blocks" / "index"));
 829 | +    BOOST_CHECK(std::filesystem::exists(in_memory_test_directory.m_directory / "blocks" / "index"));

stickies-v commented at 3:37 PM on January 7, 2026:

The block_tree_db_in_memory parameter should be cleaned up here as well:

diff --git a/src/test/kernel/test_kernel.cpp b/src/test/kernel/test_kernel.cpp
index 9f4d569f0f..7df33d79a1 100644
--- a/src/test/kernel/test_kernel.cpp
+++ b/src/test/kernel/test_kernel.cpp
@@ -655,7 +655,6 @@ BOOST_AUTO_TEST_CASE(btck_chainman_tests)
 std::unique_ptr<ChainMan> create_chainman(TestDirectory& test_directory,
                                           bool reindex,
                                           bool wipe_chainstate,
-                                          bool block_tree_db_in_memory,
                                           bool chainstate_db_in_memory,
                                           Context& context)
 {
@@ -679,7 +678,7 @@ void chainman_reindex_test(TestDirectory& test_directory)
 {
     auto notifications{std::make_shared<TestKernelNotifications>()};
     auto context{create_context(notifications, ChainType::MAINNET)};
-    auto chainman{create_chainman(test_directory, true, false, false, false, context)};
+    auto chainman{create_chainman(test_directory, true, false, false, context)};
 
     std::vector<std::string> import_files;
     BOOST_CHECK(chainman->ImportBlocks(import_files));
@@ -722,7 +721,7 @@ void chainman_reindex_chainstate_test(TestDirectory& test_directory)
 {
     auto notifications{std::make_shared<TestKernelNotifications>()};
     auto context{create_context(notifications, ChainType::MAINNET)};
-    auto chainman{create_chainman(test_directory, false, true, false, false, context)};
+    auto chainman{create_chainman(test_directory, false, true, false, context)};
 
     std::vector<std::string> import_files;
     import_files.push_back((test_directory.m_directory / "blocks" / "blk00000.dat").string());
@@ -734,7 +733,7 @@ void chainman_mainnet_validation_test(TestDirectory& test_directory)
     auto notifications{std::make_shared<TestKernelNotifications>()};
     auto validation_interface{std::make_shared<TestValidationInterface>()};
     auto context{create_context(notifications, ChainType::MAINNET, validation_interface)};
-    auto chainman{create_chainman(test_directory, false, false, false, false, context)};
+    auto chainman{create_chainman(test_directory, false, false, false, context)};
 
     // mainnet block 1
     auto raw_block = hex_string_to_byte_vec("010000006fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d6190000000000982051fd1e4ba744bbbe680e1fee14677ba1a3c3540bf7b1cdb606e857233e0e61bc6649ffff001d01e362990101000000010000000000000000000000000000000000000000000000000000000000000000ffffffff0704ffff001d0104ffffffff0100f2052a0100000043410496b538e853519c726a2c91e61ec11600ae1390813a627c66fb8be7947be63c52da7589379515d4e0a604f8141781e62294721166bf621e73a82cbf2342c858eeac00000000");
@@ -815,7 +814,7 @@ BOOST_AUTO_TEST_CASE(btck_chainman_in_memory_tests)
 
     auto notifications{std::make_shared<TestKernelNotifications>()};
     auto context{create_context(notifications, ChainType::REGTEST)};
-    auto chainman{create_chainman(in_memory_test_directory, false, false, true, true, context)};
+    auto chainman{create_chainman(in_memory_test_directory, false, false, true, context)};
 
     for (auto& raw_block : REGTEST_BLOCK_DATA) {
         Block block{hex_string_to_byte_vec(raw_block)};
@@ -844,7 +843,7 @@ BOOST_AUTO_TEST_CASE(btck_chainman_regtest_tests)
     const size_t mid{REGTEST_BLOCK_DATA.size() / 2};
 
     {
-        auto chainman{create_chainman(test_directory, false, false, false, false, context)};
+        auto chainman{create_chainman(test_directory, false, false, false, context)};
         for (size_t i{0}; i < mid; i++) {
             Block block{hex_string_to_byte_vec(REGTEST_BLOCK_DATA[i])};
             bool new_block{false};
@@ -853,7 +852,7 @@ BOOST_AUTO_TEST_CASE(btck_chainman_regtest_tests)
         }
     }
 
-    auto chainman{create_chainman(test_directory, false, false, false, false, context)};
+    auto chainman{create_chainman(test_directory, false, false, false, context)};
 
     for (size_t i{mid}; i < REGTEST_BLOCK_DATA.size(); i++) {
         Block block{hex_string_to_byte_vec(REGTEST_BLOCK_DATA[i])};

</details>

in src/kernel/chainparams.h:108 in a2d5f75a26

 103 | @@ -104,6 +104,8 @@ class CChainParams
 104 |      uint64_t AssumedBlockchainSize() const { return m_assumed_blockchain_size; }
 105 |      /** Minimum free space (in GB) needed for data directory when pruned; Does not include prune target*/
 106 |      uint64_t AssumedChainStateSize() const { return m_assumed_chain_state_size; }
 107 | +    /** Minimum free space (in MiB) needed for header store **/
 108 | +    uint64_t AssumedHeaderStoreSize() const { return m_assumed_header_store_size; }

stickies-v commented at 4:25 PM on January 7, 2026:

Headers are fixed size, and we should know the number of headers from pre-sync before writing any to disk, so I think we can avoid this hardcoded variable by initializing the file to 0MiB (or some minimal size) and then exposing a void BlockTreeStore::AllocateHeaderStore(size_t header_count) that can be called when pre-sync is finished?

sedited commented at 5:00 PM on January 13, 2026:

That's a nice suggestion, will see what that works out to.

in src/node/blockstorage.h:296 in a2d5f75a26 outdated

 292 | @@ -295,6 +293,8 @@ class BlockManager
 293 |  
 294 |      BlockfileType BlockfileTypeForHeight(int height);
 295 |  
 296 | +    std::unique_ptr<kernel::BlockTreeStore> CreateAndMigrateBlockTree();

stickies-v commented at 5:00 PM on January 7, 2026:

What's your view on how long we should keep the migration functionality alive? If we want to keep this for ~perpetuity, it might make sense to carve out the migration logic into a separate binary (that gets invoked on node startup), but I think that's probably overkill, right? If we keep it until most users have migrated (say, 5 versions), we could just force the odd remaining user to -reindex on startup?

Related note: I think the BlockTreeDB class docstring should be updated to reflect that this class is now only used for migration purposes, and should not be used for anything else.

sedited commented at 5:02 PM on January 13, 2026:

I am in favour of making this a separate binary. I think keeping the migration as painless as possible for some time into the future is desirable. If we just have a binary that takes care of that and never changes, that seems like a good way to do it to me.

stickies-v commented at 1:13 PM on January 8, 2026: contributor

Strong Concept ACK. Being able to access the block tree from multiple processes would be a significant improvement. Since the scope of rolling our own logic seems limited enough as demonstrated in this PR, I think reducing kernel's dependencies (such as e.g. on sqlite) is a good choice.

A couple of approach questions / thoughts:

The blockfiles.dat file is pretty small. I wonder if it would make sense to skip the whole m_dirty_fileinfo business and just rewrite the whole file on every WriteBlockIndexDB call, simplifying the code at the cost of increased (but I think manageable) write overhead?
If the log structure isn't going to change frequently, we could make the logic less generic and e.g. hardcode that there's going to be a single LAST_BLOCK at the beginning and a single HEADER_DATA_END at the end? Especially in combination with the previous suggestion, I think that might simplify the code quite a bit at the cost of reduced (but perhaps unnecessary) flexibility?

sedited force-pushed on Jan 27, 2026

sedited commented at 1:26 PM on January 27, 2026: contributor

Thanks for the review @stickies-v!

Updated a2d5f75a26976ec3292606bec149ef1d325383a2 -> cf9379196f5cf085ad673d793c4415d47e094d9c (blocktreestore_9 -> blocktreestore_10, compare)

Addressed @stickies-v's comment, removed wipe argument when instantiating a fresh block tree.
Addressed @stickies-v's comment, removed dangling block_tree_db_in_memory parameter in the kernel tests.
Addresssed @stickies-v's comment, removed the pre-alloc altogether.

The header file pre-alloc was removed again, since I did not really measure any tangible benefit from it (even on a spinning disk). I guess this is to be expected since we usually make one big write and flush when syncing the headers in the beginning.

The blockfiles.dat file is pretty small. I wonder if it would make sense to skip the whole m_dirty_fileinfo business and just rewrite the whole file on every WriteBlockIndexDB call, simplifying the code at the cost of increased (but I think manageable) write overhead?

Mmh, I'm not sure about this. It just seems wasteful to re-write the entire file even if we don't change most of it. It's kind of on the cusp of having a relevant size in that respect at around 20kb (i.e. if we ever start flushing this data on every block, we land at around another gb of writes per year), so maybe we should err on the side of keeping writes minimal?

If the log structure isn't going to change frequently, we could make the logic less generic and e.g. hardcode that there's going to be a single LAST_BLOCK at the beginning and a single HEADER_DATA_END at the end? Especially in combination with the previous suggestion, I think that might simplify the code quite a bit at the cost of reduced (but perhaps unnecessary) flexibility?

Removing the header file pre-alloc removed a bunch of complexity again, since we now no longer need to track the HEADER_DATA_END. I feel like it is kind of manageable this way, but do concede that there is space for writing DRYer code. Do you think it is more readable this way?

sedited force-pushed on Jan 27, 2026

sedited commented at 2:44 PM on January 27, 2026: contributor

Rebased cf9379196f5cf085ad673d793c4415d47e094d9c -> be1ab519f0fab6571e0bc791b0eac8c297911d44 (blocktreestore_10 -> blocktreestore_11, compare)

Fixed silent merge conflicts.

DrahtBot added the label CI failed on Jan 27, 2026

DrahtBot commented at 2:45 PM on January 27, 2026: contributor

🚧 At least one of the CI tasks failed. Task test max 6 ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/21398980713/job/61604538651 LLM reason (✨ experimental): CI failure due to uncommitted changes in the index preventing git rebase/merge in the post-build step.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

sedited force-pushed on Jan 28, 2026

sedited commented at 11:50 AM on January 28, 2026: contributor

Updated be1ab519f0fab6571e0bc791b0eac8c297911d44 -> 5e8dcdaba223881e060e77f7a2453567e0a0e17a (blocktreestore_11 -> blocktreestore_12, compare)

Re-worked migration to no longer use remove_all after the recent vulnerabilities.
Fixed iwyu errrors.

DrahtBot removed the label CI failed on Jan 29, 2026

DrahtBot added the label Needs rebase on Feb 4, 2026

sedited force-pushed on Feb 8, 2026

sedited commented at 6:09 PM on February 8, 2026: contributor

Rebased 5e8dcdaba223881e060e77f7a2453567e0a0e17a -> 5e6621e5f7401bce35eab1a04741b5c327a5e820 (blocktreestore_12 -> blocktreestore_13, compare)

Fixed conflict with #34488

DrahtBot removed the label Needs rebase on Feb 8, 2026

DrahtBot added the label CI failed on Feb 8, 2026

DrahtBot removed the label CI failed on Feb 8, 2026

bitcoin deleted a comment on Feb 10, 2026

in src/node/blockstorage.cpp:1210 in 5e6621e5f7

1205 | +            params.path = m_opts.block_tree_dir;
1206 | +            auto block_tree_db{std::make_unique<BlockTreeDB>(params)};
1207 | +            LogInfo("   Reading data from existing leveldb block tree db...");
1208 | +            block_tree_db->ReadLastBlockFile(max_blockfile_num);
1209 | +            files.reserve(max_blockfile_num);
1210 | +            for (int i = 0; i < max_blockfile_num; i++) {

HowHsu commented at 12:45 PM on March 6, 2026:

Maybe I missed something, but here shouldn't it be i <= max_blockfile_num?

in src/kernel/blocktreestorage.cpp:127 in 5e6621e5f7

 122 | +AutoFile BlockTreeStore::OpenBlockFilesFile(std::string mode) const
 123 | +{
 124 | +    return OpenFile(m_block_files_file_path, mode);
 125 | +}
 126 | +
 127 | +BlockTreeStore::BlockTreeStore(const fs::path& path, const CChainParams& params, bool wipe_data)

HowHsu commented at 12:54 PM on March 6, 2026:

params seems not used in this function.

in src/kernel/blocktreestorage.cpp:310 in 5e6621e5f7 outdated

 305 | +    uint32_t number_of_types;
 306 | +    log_file >> number_of_types;
 307 | +
 308 | +    // Do a dry run to check the integrity of the log file. This should prevent corrupting the data with a corrupt/incomplete log
 309 | +    try {
 310 | +        for (uint32_t i = 0; i < number_of_types; i++) {

HowHsu commented at 1:15 PM on March 6, 2026:

The dry-run and the real apply loops are similiar, maybe they can be abstracted to reduce code complexity.

stickies-v commented at 9:30 AM on June 12, 2026:

I've been thinking about this a bit more too. I think it would make sense to have a class WriteAheadLog that owns the log file format, and encapsulates the logic that is currently spread out and somewhat duplicated across ApplyLog and WriteBatchSync.

Given the size of the change and that this doesn't seem blocking for this PR, I briefly tried implementing this with my LLM, and came up with the below interface (LLM code prototype, supervised but only lightly reviewed). Leaving it here for follow-up/reference mostly.

Some key attributes:

WriteAheadLog::Open() (+ private ctor) automatically does the dry-run check, making the class ~correct-by-construction (ignoring in-flight disk corruption)
TraverseLog function is used by Open() (with no-op side-effect) as well as by ApplyLog to avoid traversal logic duplication
WriteAheadLog& operator<<(const WalRecord& record); allows streaming records into the log idiomatically (log << WalRecord{ValueType::BLOCK_FILE_INFO, CalculateBlockFilesPos(file), serialize(BlockFileInfoWrapper{info})};). This requires making the log file format flat (i.e. each record contains its type, as opposed to grouping everything). Users can still minimize file open/closes by grouping the records they stream, but it's no longer enforced by the code.

<details> <summary> WriteAheadLog interfce</summary>


//! A single record to write to the log: a typed, positioned payload. On disk it
//! is laid out flat as <type> <record bytes> <position> <crc32c>; the checksum
//! covers the record bytes and position.
struct WalRecord {
    ValueType type;
    int64_t pos;                  //!< absolute target offset in the data file for `type`
    std::vector<std::byte> bytes; //!< serialized record, without position or checksum
};

//! Encapsulates the write-ahead log format and integrity. The log is a flat
//! stream of self-describing records terminated by a rolling crc32c over them
//! all; each record also carries its own crc32c. There is no grouping on disk:
//! records are independent positioned writes, and the order only matters to the
//! applier, which batches consecutive records targeting the same file.
//!
//! Records are streamed straight to disk via operator<<, so building a log never
//! holds more than one serialized record in memory. Open() replays an existing
//! log only after a full integrity dry-run, so a WriteAheadLog instance is always
//! backed by a healthy, complete log on disk. Mapping a ValueType to a target
//! file and applying the writes is the caller's responsibility.
class WriteAheadLog
{
    fs::path m_path;
    //! Held only while writing (AutoFile is not movable, so own it indirectly);
    //! readers reopen m_path locally.
    std::unique_ptr<AutoFile> m_file;
    uint32_t m_rolling_checksum{0};

    explicit WriteAheadLog(fs::path path, std::unique_ptr<AutoFile> file = nullptr)
        : m_path{std::move(path)}, m_file{std::move(file)} {}

    //! Walk the log, verifying every per-record crc32c and the trailing rolling
    //! crc32c. on_record(type, pos, record) is called for each record; `record`
    //! views a buffer reused across records (do not retain it). Returns false on
    //! any incompleteness or corruption (short read, crc mismatch); never throws
    //! on corruption. Allocates only the single reused read buffer.
    static bool TraverseLog(const fs::path& path, const std::function<void(ValueType, int64_t, std::span<const std::byte>)>& on_record);

public:
    //! Closes an un-committed file handle (e.g. on a mid-write exception) so the
    //! AutoFile written-but-open contract is not violated during stack unwind.
    ~WriteAheadLog();
    WriteAheadLog(WriteAheadLog&&) = default;
    WriteAheadLog(const WriteAheadLog&) = delete;
    WriteAheadLog& operator=(const WriteAheadLog&) = delete;
    WriteAheadLog& operator=(WriteAheadLog&&) = delete;

    //! Open a fresh log for writing.
    static WriteAheadLog Create(const fs::path& path);
    //! Stream one record to the log: serialized, checksummed and written immediately.
    WriteAheadLog& operator<<(const WalRecord& record);
    //! Write the rolling checksum, fsync and close. After this the log is durable.
    void Commit();

    //! Open and integrity-check an existing log. Returns nullopt and removes the
    //! file if it is absent, incomplete or corrupt (safe to ignore).
    static std::optional<WriteAheadLog> Open(const fs::path& path);
    //! Replay validated records in write order. `record` views a reused buffer
    //! (do not retain it). Throws if corruption is detected (not expected after
    //! a successful Open()).
    void ForEachRecord(const std::function<void(ValueType, int64_t pos, std::span<const std::byte>)>& fn) const;
    //! Delete the log file.
    void Remove() const;
};

</details>

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index f2afd9285b..e30352b188 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -117,6 +117,114 @@ static AutoFile OpenFile(const fs::path& path, std::string_view mode)
     return AutoFile{file.release()};
 }
 
+WriteAheadLog WriteAheadLog::Create(const fs::path& path)
+{
+    return WriteAheadLog{path, std::make_unique<AutoFile>(OpenFile(path, "wb").release())};
+}
+
+WriteAheadLog& WriteAheadLog::operator<<(const WalRecord& record)
+{
+    // On disk: <type> <record bytes> <position> <crc32c>. The checksum covers the
+    // record bytes and position (not the type, which only selects the size).
+    DataStream stream;
+    stream << std::span{record.bytes};
+    stream << record.pos;
+    const uint32_t checksum{crc32c::Crc32c(UCharCast(stream.data()), stream.size())};
+    m_rolling_checksum = crc32c::Extend(m_rolling_checksum, UCharCast(stream.data()), stream.size());
+
+    WriteValueType(*m_file, record.type);
+    *m_file << std::span{stream};
+    *m_file << checksum;
+    return *this;
+}
+
+void WriteAheadLog::Commit()
+{
+    *m_file << m_rolling_checksum;
+    if (!m_file->Commit()) {
+        throw BlockTreeStoreError(strprintf("Failed to commit write to log file %s", PathToString(m_path)));
+    }
+    DirectoryCommit(m_path.parent_path());
+    if (m_file->fclose() != 0) {
+        throw BlockTreeStoreError(strprintf("Failed to close after write to log file %s", PathToString(m_path)));
+    }
+    m_file.reset(); // closed cleanly; nothing for the destructor to do
+}
+
+WriteAheadLog::~WriteAheadLog()
+{
+    // A still-open file means the log was abandoned before Commit() (e.g. an
+    // exception mid-write). Close it so AutoFile's written-but-open assertion is
+    // not tripped during unwind; the partial log on disk is rejected on Open().
+    if (m_file && !m_file->IsNull()) (void)m_file->fclose();
+}
+
+void WriteAheadLog::Remove() const
+{
+    fs::remove(m_path);
+}
+
+bool WriteAheadLog::TraverseLog(const fs::path& path, const std::function<void(ValueType, int64_t, std::span<const std::byte>)>& on_record)
+{
+    auto file{OpenFile(path, "rb")};
+    file.seek(0, SEEK_END);
+    const int64_t records_end{file.tell() - static_cast<int64_t>(CHECKSUM_SIZE)}; // rolling crc occupies the final bytes
+    if (records_end < 0) return false;
+    file.seek(0, SEEK_SET);
+
+    uint32_t re_rolling_checksum{0};
+    std::vector<std::byte> buffer; // reused across records; grows to the largest record size
+    try {
+        // Records run until the trailing rolling checksum; each is self-delimiting
+        // via its leading type byte.
+        while (file.tell() < records_end) {
+            const ValueType value_type{ReadValueType(file)};
+            const uint8_t type_size{SizeFromValueType(value_type)};
+            const uint32_t payload_size = type_size + FILE_POSITION_SIZE; // record bytes + position
+            buffer.resize(payload_size);
+            file.read(buffer);
+            const uint32_t re_checksum{crc32c::Crc32c(UCharCast(buffer.data()), payload_size)};
+            re_rolling_checksum = crc32c::Extend(re_rolling_checksum, UCharCast(buffer.data()), payload_size);
+            uint32_t checksum;
+            file >> checksum;
+            if (checksum != re_checksum) return false;
+
+            int64_t pos;
+            SpanReader{std::span{buffer}.last(FILE_POSITION_SIZE)} >> pos;
+            on_record(value_type, pos, std::span{buffer}.first(type_size));
+        }
+        // The cursor should now sit exactly at the rolling checksum.
+        if (file.tell() != records_end) return false;
+        uint32_t rolling_checksum;
+        file >> rolling_checksum;
+        return rolling_checksum == re_rolling_checksum;
+    } catch (const std::ios_base::failure&) {
+        return false;
+    }
+}
+
+std::optional<WriteAheadLog> WriteAheadLog::Open(const fs::path& path)
+{
+    if (!fs::exists(path)) return std::nullopt;
+
+    // Integrity dry run: a WriteAheadLog only exists for a complete, valid log.
+    if (!TraverseLog(path, [](ValueType, int64_t, std::span<const std::byte>) {})) {
+        LogDebug(BCLog::BLOCKSTORAGE, "Corrupt or incomplete blocktree store log file found. Will not apply log.");
+        fs::remove(path);
+        return std::nullopt;
+    }
+    return WriteAheadLog{path};
+}
+
+void WriteAheadLog::ForEachRecord(const std::function<void(ValueType, int64_t, std::span<const std::byte>)>& fn) const
+{
+    // Open() already validated the log; corruption now is unexpected (e.g. the
+    // file changed underneath us) and is fatal.
+    if (!TraverseLog(m_path, fn)) {
+        throw BlockTreeStoreError("Detected on-disk file corruption.");
+    }
+}
+
 void BlockTreeStore::CheckMagicAndVersion() const
 {
     AssertLockHeld(m_mutex);
@@ -313,119 +421,61 @@ bool BlockTreeStore::ApplyLog() const
 {
     AssertLockHeld(m_mutex);
 
-    if (!fs::exists(m_log_file_path)) {
-        return false;
-    }
-
-    auto log_file{OpenFile(m_log_file_path, "rb")};
-
-    uint32_t re_rolling_checksum = 0;
-    uint32_t rolling_checksum = 0;
-    uint32_t number_of_types = 0;
-
-    // Do a dry run to check the integrity of the log file. This should prevent corrupting the data with a corrupt/incomplete log
-    try {
-        log_file >> number_of_types;
-        for (uint32_t i = 0; i < number_of_types; i++) {
-            ValueType value_type{ReadValueType(log_file)};
-            uint8_t type_size{SizeFromValueType(value_type)};
-            uint32_t entry_size = type_size + FILE_POSITION_SIZE;
-            uint64_t num_iterations;
-            log_file >> num_iterations;
-
-            std::vector<std::byte> buffer;
-            buffer.resize(entry_size);
-
-            for (uint64_t j = 0; j < num_iterations; j++) {
-                log_file.read(buffer);
-
-                uint32_t re_checksum = crc32c::Crc32c(UCharCast(buffer.data()), entry_size);
-                re_rolling_checksum = crc32c::Extend(re_rolling_checksum, UCharCast(buffer.data()), entry_size);
-                uint32_t checksum;
-                log_file >> checksum;
-                if (checksum != re_checksum) {
-                    LogDebug(BCLog::BLOCKSTORAGE, "Found invalid entry in blocktree store log file. Will not apply log.");
-                    (void)log_file.fclose();
-                    fs::remove(m_log_file_path);
-                    return false;
-                }
-            }
+    // Open() performs the integrity dry run; a missing/incomplete/corrupt log is
+    // discarded and yields nullopt.
+    auto log{WriteAheadLog::Open(m_log_file_path)};
+    if (!log) return false;
+
+    // Apply each record to its target data file: seek to the recorded absolute
+    // position and write the record with its (record+pos) checksum. The target
+    // file is kept open across a run of records hitting the same file and only
+    // committed/closed when the target file changes (or at the end).
+    std::optional<AutoFile> data_file;
+    fs::path open_path;
+    bool simulated_crash{false};
+
+    auto finish_file{[&] {
+        if (!data_file) return;
+        if (!data_file->Commit()) {
+            throw BlockTreeStoreError(strprintf("Failed to commit write to data file %s", PathToString(open_path)));
         }
-
-        log_file >> rolling_checksum;
-        if (rolling_checksum != re_rolling_checksum) {
-            LogDebug(BCLog::BLOCKSTORAGE, "Found incomplete blocktree store log file. Will not apply log.");
-            (void)log_file.fclose();
-            fs::remove(m_log_file_path);
-            return false;
+        if (data_file->fclose() != 0) {
+            throw BlockTreeStoreError(strprintf("Failed to close after write to data file %s", PathToString(open_path)));
+        }
+        data_file.reset();
+    }};
+
+    // Records targeting one file are contiguous (see WriteBatchSync), so each
+    // target file is opened and fsynced once; reordering would be correct but slow.
+    log->ForEachRecord([&](ValueType type, int64_t pos, std::span<const std::byte> record) {
+        if (simulated_crash) return;
+        const fs::path& target{GetDataFilePath(type)};
+        if (!data_file || open_path != target) {
+            finish_file();
+            open_path = target;
+            data_file.emplace(OpenFile(target, "rb+").release());
         }
-    } catch (const std::ios_base::failure& e) {
-        LogDebug(BCLog::BLOCKSTORAGE, "Corrupt or incomplete log file found, not applying: %s", e.what());
-        (void)log_file.fclose();
-        fs::remove(m_log_file_path);
-        return false;
-    }
-
-    re_rolling_checksum = 0;
-    log_file.seek(sizeof(uint32_t), SEEK_SET);
-
-    // Run through the file again, but this time write it to the target data file.
-    for (uint32_t i = 0; i < number_of_types; ++i) {
-        ValueType value_type = ReadValueType(log_file);
-        auto data_file_path = GetDataFilePath(value_type);
-        auto data_file{OpenFile(data_file_path, "rb+")};
-        uint8_t type_size{SizeFromValueType(value_type)};
-        uint32_t entry_size = type_size + FILE_POSITION_SIZE;
-
-        uint64_t num_iterations;
-        log_file >> num_iterations;
-
-        std::vector<std::byte> buffer;
-        buffer.resize(entry_size);
 
-        for (uint64_t j = 0; j < num_iterations; ++j) {
-            log_file.read(buffer);
-            SpanReader entry_reader{buffer};
-            entry_reader.ignore(type_size);
-            int64_t pos;
-            entry_reader >> pos;
+        DataStream stream;
+        stream << record;
+        stream << pos;
+        const uint32_t checksum{crc32c::Crc32c(UCharCast(stream.data()), stream.size())};
 
-            uint32_t re_checksum = crc32c::Crc32c(UCharCast(buffer.data()), entry_size);
-            re_rolling_checksum = crc32c::Extend(re_rolling_checksum, UCharCast(buffer.data()), entry_size);
-            uint32_t checksum;
-            log_file >> checksum;
-            if (re_checksum != checksum) {
-                throw BlockTreeStoreError("Detected on-disk file corruption.");
-            }
-
-            if (data_file.tell() != pos) {
-                data_file.seek(pos, SEEK_SET);
-            }
-
-            data_file << std::span<std::byte>{buffer.data(), type_size};
-            data_file << checksum;
-
-            // TEST ONLY
-            if (m_incomplete_log_apply) {
-                (void)data_file.fclose();
-                return false;
-            }
-        }
+        if (data_file->tell() != pos) data_file->seek(pos, SEEK_SET);
+        *data_file << record;
+        *data_file << checksum;
 
-        if (!data_file.Commit()) {
-            throw BlockTreeStoreError(strprintf("Failed to commit write to data file %s", PathToString(data_file_path)));
-        }
-        if (data_file.fclose() != 0) {
-            throw BlockTreeStoreError(strprintf("Failed to close after write to data file %s", PathToString(data_file_path)));
+        // TEST ONLY: simulate a crash mid-apply, leaving the log to be re-applied.
+        if (m_incomplete_log_apply) {
+            (void)data_file->fclose();
+            data_file.reset();
+            simulated_crash = true;
         }
-    }
-
-    if (rolling_checksum != re_rolling_checksum) {
-        throw BlockTreeStoreError("Detected on-disk file corruption.");
-    }
+    });
+    if (simulated_crash) return false;
+    finish_file();
 
-    (void)log_file.fclose();
-    fs::remove(m_log_file_path);
+    log->Remove();
     return true;
 }
 
@@ -438,98 +488,55 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
     // This may occur if a previous write threw an exception when writing the logged data to the .dat files.
     if (fs::exists(m_log_file_path)) (void)ApplyLog();
 
-    std::vector<std::pair<CBlockIndex*, int64_t>> pending_header_positions;
-    pending_header_positions.reserve(blockinfo.size());
-
-    // Use a write-ahead log file that gets atomically flushed to the target files.
+    auto serialize{[](auto&& wrapper) {
+        DataStream s;
+        s << wrapper;
+        return std::vector<std::byte>{s.begin(), s.end()};
+    }};
 
-    { // start log_file scope
-    auto log_file{OpenFile(m_log_file_path, "wb")};
+    auto log{WriteAheadLog::Create(m_log_file_path)};
 
-    constexpr size_t block_index_entry_size{DISK_BLOCK_INDEX_WRAPPER_SIZE + FILE_POSITION_SIZE};
+    // Records are written grouped by target file so the applier batches them; the
+    // log format itself imposes no ordering. Last block file number first:
+    log << WalRecord{ValueType::LAST_BLOCK, BLOCK_FILES_LAST_BLOCK_POS, serialize(last_file)};
 
-    DataStream stream;
-    stream.reserve(block_index_entry_size);
-    uint32_t rolling_checksum = 0;
-
-    log_file << uint32_t{3}; // We are writing three different types to the log file for now.
-
-    // Write the last block file number to the log
-    WriteValueType(log_file, ValueType::LAST_BLOCK);
-    log_file << uint64_t{1}; // just the one entry
-    stream << last_file;
-    stream << BLOCK_FILES_LAST_BLOCK_POS;
-    uint32_t checksum = crc32c::Crc32c(UCharCast(stream.data()), stream.size());
-    rolling_checksum = crc32c::Extend(rolling_checksum, UCharCast(stream.data()), stream.size());
-    log_file << std::span<std::byte>{stream};
-    log_file << checksum;
-    stream.clear();
-
-    // Write the fileInfo entries to the log
-    WriteValueType(log_file, ValueType::BLOCK_FILE_INFO);
-    log_file << uint64_t{fileInfo.size()};
-    constexpr size_t block_file_entry_size{BLOCK_FILE_INFO_WRAPPER_SIZE + FILE_POSITION_SIZE};
+    // Block file info records, positioned by file number.
     for (const auto& [file, info] : fileInfo) {
-        int64_t pos{CalculateBlockFilesPos(file)};
-        stream << BlockFileInfoWrapper{info};
-        stream << pos;
-        checksum = crc32c::Crc32c(UCharCast(stream.data()), block_file_entry_size);
-        rolling_checksum = crc32c::Extend(rolling_checksum, UCharCast(stream.data()), block_file_entry_size);
-        log_file.write(stream);
-        log_file << checksum;
-        stream.clear();
+        log << WalRecord{ValueType::BLOCK_FILE_INFO, CalculateBlockFilesPos(file), serialize(BlockFileInfoWrapper{info})};
     }
 
-    // TEST ONLY
+    // TEST ONLY: simulate a crash mid-write. The log on disk is missing its
+    // remaining records and rolling checksum, so Open()'s dry run rejects it.
     if (m_incomplete_log_write) {
-        (void)log_file.fclose();
         throw std::runtime_error("failed to write file");
     }
 
-    // Read the header data end position
+    // Block index records. New entries (header_pos == 0) are appended at the end
+    // of headers.dat; their positions are recorded in memory only after commit.
     int64_t header_data_end;
     {
         auto header_file{OpenFile(m_header_file_path, "rb")};
         header_file.seek(0, SEEK_END);
         header_data_end = header_file.tell();
     }
-
-    // Write the header data to the log
-    WriteValueType(log_file, ValueType::DISK_BLOCK_INDEX);
-    log_file << uint64_t{blockinfo.size()};
-
+    std::vector<std::pair<CBlockIndex*, int64_t>> pending_header_positions;
+    pending_header_positions.reserve(blockinfo.size());
     for (CBlockIndex* bi : blockinfo) {
-        int64_t pos = bi->header_pos == 0 ? header_data_end : bi->header_pos;
-        auto disk_bi{CDiskBlockIndex{bi}};
-        stream << DiskBlockIndexWrapper{&disk_bi};
-        stream << pos;
-        checksum = crc32c::Crc32c(UCharCast(stream.data()), block_index_entry_size);
-        rolling_checksum = crc32c::Extend(rolling_checksum, UCharCast(stream.data()), block_index_entry_size);
-        log_file.write(stream);
-        log_file << checksum;
-        stream.clear();
+        const int64_t pos{bi->header_pos == 0 ? header_data_end : bi->header_pos};
         if (bi->header_pos == 0) {
             pending_header_positions.emplace_back(bi, header_data_end);
             header_data_end += DISK_BLOCK_INDEX_WRAPPER_SIZE + CHECKSUM_SIZE;
         }
+        auto disk_bi{CDiskBlockIndex{bi}};
+        log << WalRecord{ValueType::DISK_BLOCK_INDEX, pos, serialize(DiskBlockIndexWrapper{&disk_bi})};
     }
 
-    // Finally write the rolling checksum and commit.
-    log_file << rolling_checksum;
-    if (!log_file.Commit()) {
-        throw BlockTreeStoreError(strprintf("Failed to commit write to log file %s", PathToString(m_log_file_path)));
-    }
-    DirectoryCommit(m_log_file_path.parent_path());
+    log.Commit();
 
-    // Once committed, apply the header positions to the index and close the file.
+    // Once committed, apply the header positions to the index.
     for (const auto& [block_index, header_pos] : pending_header_positions) {
         block_index->header_pos = header_pos;
     }
-    if (log_file.fclose() != 0) {
-        throw BlockTreeStoreError(strprintf("Failed to close after write to log file %s", PathToString(m_log_file_path)));
-    }
-
-    } // end log_file scope
 
     if (!ApplyLog()) {
         throw BlockTreeStoreError("Failed to apply write-ahead log to data files");
diff --git a/src/kernel/blocktreestorage.h b/src/kernel/blocktreestorage.h
index 0b49839e1c..210a213223 100644
--- a/src/kernel/blocktreestorage.h
+++ b/src/kernel/blocktreestorage.h
@@ -11,8 +11,12 @@
 #include <sync.h>
 #include <util/fs.h>
 
+#include <cstddef>
 #include <cstdint>
 #include <functional>
+#include <memory>
+#include <optional>
+#include <span>
 #include <stdexcept>
 #include <string>
 #include <string_view>
@@ -69,6 +73,71 @@ public:
     explicit BlockTreeStoreError(const std::string& msg) : std::runtime_error(msg) {}
 };
 
+//! A single record to write to the log: a typed, positioned payload. On disk it
+//! is laid out flat as <type> <record bytes> <position> <crc32c>; the checksum
+//! covers the record bytes and position.
+struct WalRecord {
+    ValueType type;
+    int64_t pos;                  //!< absolute target offset in the data file for `type`
+    std::vector<std::byte> bytes; //!< serialized record, without position or checksum
+};
+
+//! Encapsulates the write-ahead log format and integrity. The log is a flat
+//! stream of self-describing records terminated by a rolling crc32c over them
+//! all; each record also carries its own crc32c. There is no grouping on disk:
+//! records are independent positioned writes, and the order only matters to the
+//! applier, which batches consecutive records targeting the same file.
+//!
+//! Records are streamed straight to disk via operator<<, so building a log never
+//! holds more than one serialized record in memory. Open() replays an existing
+//! log only after a full integrity dry-run, so a WriteAheadLog instance is always
+//! backed by a healthy, complete log on disk. Mapping a ValueType to a target
+//! file and applying the writes is the caller's responsibility.
+class WriteAheadLog
+{
+    fs::path m_path;
+    //! Held only while writing (AutoFile is not movable, so own it indirectly);
+    //! readers reopen m_path locally.
+    std::unique_ptr<AutoFile> m_file;
+    uint32_t m_rolling_checksum{0};
+
+    explicit WriteAheadLog(fs::path path, std::unique_ptr<AutoFile> file = nullptr)
+        : m_path{std::move(path)}, m_file{std::move(file)} {}
+
+    //! Walk the log, verifying every per-record crc32c and the trailing rolling
+    //! crc32c. on_record(type, pos, record) is called for each record; `record`
+    //! views a buffer reused across records (do not retain it). Returns false on
+    //! any incompleteness or corruption (short read, crc mismatch); never throws
+    //! on corruption. Allocates only the single reused read buffer.
+    static bool TraverseLog(const fs::path& path, const std::function<void(ValueType, int64_t, std::span<const std::byte>)>& on_record);
+
+public:
+    //! Closes an un-committed file handle (e.g. on a mid-write exception) so the
+    //! AutoFile written-but-open contract is not violated during stack unwind.
+    ~WriteAheadLog();
+    WriteAheadLog(WriteAheadLog&&) = default;
+    WriteAheadLog(const WriteAheadLog&) = delete;
+    WriteAheadLog& operator=(const WriteAheadLog&) = delete;
+    WriteAheadLog& operator=(WriteAheadLog&&) = delete;
+
+    //! Open a fresh log for writing.
+    static WriteAheadLog Create(const fs::path& path);
+    //! Stream one record to the log: serialized, checksummed and written immediately.
+    WriteAheadLog& operator<<(const WalRecord& record);
+    //! Write the rolling checksum, fsync and close. After this the log is durable.
+    void Commit();
+
+    //! Open and integrity-check an existing log. Returns nullopt and removes the
+    //! file if it is absent, incomplete or corrupt (safe to ignore).
+    static std::optional<WriteAheadLog> Open(const fs::path& path);
+    //! Replay validated records in write order. `record` views a reused buffer
+    //! (do not retain it). Throws if corruption is detected (not expected after
+    //! a successful Open()).
+    void ForEachRecord(const std::function<void(ValueType, int64_t pos, std::span<const std::byte>)>& fn) const;
+    //! Delete the log file.
+    void Remove() const;
+};
+
 class CBlockFileInfo
 {
 public:

</details>

HowHsu commented at 2:14 PM on March 6, 2026: contributor

Concept ACK, with minor comments.

DrahtBot added the label Needs rebase on Mar 9, 2026

sedited force-pushed on Mar 10, 2026

sedited commented at 4:35 PM on March 10, 2026: contributor

Thank you for the review @HowHsu,

Rebased 5e6621e5f7401bce35eab1a04741b5c327a5e820 -> 06f4f4b28ab36074faf6a175fd7cd9b4d4bc6270 (blocktreestore_13 -> blocktreestore_14, compare)

Fixed conflict with #34705

Udpated 06f4f4b28ab36074faf6a175fd7cd9b4d4bc6270 -> 62edb8df903d3002f394ea4f3fa12114825ea92d (blocktreestore_14 -> blocktreestore_15, compare)

Addressed @HowHsu's comment, fixed off-by-one error in migration.
Addressed @HowHsu's comment, removed left over params arg from a previous update.

DrahtBot removed the label Needs rebase on Mar 10, 2026

in src/kernel/blocktreestorage.cpp:100 in 62edb8df90

  95 | +        }
  96 | +        if (auto magic{ser_readdata32(file)}; magic != magic_expected) {
  97 | +            throw BlockTreeStoreError(strprintf("Invalid magic in %s: 0x%08x (expected: 0x%08x)", fs::PathToString(path), magic, magic_expected));
  98 | +        }
  99 | +        if (auto version{ser_readdata32(file)}; version != version_expected) {
 100 | +            throw BlockTreeStoreError(strprintf("Invalid magic in %s: 0x%08x (expected: 0x%08x)", fs::PathToString(path), version, version_expected));

craigraw commented at 9:15 AM on March 12, 2026:

Should be Invalid version in...

in src/node/blockstorage.cpp:1249 in 62edb8df90

1244 | +        }
1245 | +    }
1246 | +
1247 | +    {
1248 | +        // Cleanup a potentially previously failed migration by setting wipe_data
1249 | +        LogInfo("   Writing data back to a new block tree store, reindexing: %b, pruned: %b", reindexing, pruned_block_files);

craigraw commented at 9:33 AM on March 12, 2026:

Should this be %d instead?

in src/kernel/blocktreestorage.cpp:275 in 62edb8df90

 270 | +        file >> checksum;
 271 | +        uint32_t re_check = crc32c::Crc32c(UCharCast(data.data()), BLOCK_FILE_INFO_WRAPPER_SIZE + FILE_POSITION_SIZE);
 272 | +        if (re_check != checksum) {
 273 | +            throw BlockTreeStoreError("Block files data failed integrity check.");
 274 | +        }
 275 | +    } catch (std::ios_base::failure::exception&) {

craigraw commented at 9:38 AM on March 12, 2026:

Should this be std::ios_base::failure& to match line 349?

in src/kernel/blocktreestorage.cpp:541 in 62edb8df90

 536 | +    AssertLockHeld(::cs_main);
 537 | +    LOCK(m_mutex);
 538 | +
 539 | +    auto file{OpenHeaderFile("rb")};
 540 | +
 541 | +    int64_t data_end_pos = fs::file_size(m_header_file_path) - 1;

craigraw commented at 9:44 AM on March 12, 2026:

Is the - 1 necessary here? Possibly a (harmless) off-by-one error.

craigraw commented at 10:03 AM on March 12, 2026: none

tACK, with minor comments.

A notable use-case for the kernel library is accessing and analyzing existing block data. Currently this can only be done by first shutting down the node writing this data. Moving away from leveldb opens the door towards doing this in parallel.

I thought it would be useful for reviewers of this PR to have data around this use case. Frigate, an Electrum server implementing BIP352 silent payments scanning reads block headers, block data, and undo data from a running Bitcoin Core node to create an index of tweak keys.

In order to test this, I adapted Frigate to load and parse blocks directly from disk using this PR. While indexing has always been network I/O bound in the past, it immediately became CPU bound. On implementing concurrency, the indexing then become disk I/O bound writing the tweak database. Results from indexing testnet4 from genesis (125,000+ blocks) as follows. Bitcoin Core was running for all 3 indexing runs:

Configuration	Time	Throughput	Bottleneck
JSON-RPC	1422s	88 blocks/sec	Network I/O (RPC round-trips)
Flat file, sequential	841s	149 blocks/sec	Single-core CPU
Flat file, concurrent	202s	623 blocks/sec	DB writes

I would expect the difference to be larger on mainnet. I find these results quite compelling, and I believe this PR to have significant value.

hodlinator commented at 1:23 PM on March 12, 2026: contributor

Thanks for testing this in Frigate and providing some hard numbers!

PR concept push-back

I'm slightly nervous about other processes accessing bitcoind's files directly, especially while it's running. Feels like we are incurring maintenance burden going forward even if access happens through the kernel API and this PR makes it technically safe in the would-be incarnation. What happens the day bitcoind wants to migrate storage formats again?

I'd be an easier sell on changing the file format and only allowing kernel API access when bitcoind is not running.

Poor JSON-RPC performance

I wonder if most of the difference between "JSON-RPC" and "Flat file, sequential" could be mitigated through sharing XOR-key + file offset and supporting sendfile()-like syscalls as I was experimenting with here: #32541#pullrequestreview-3155477609. sendfile() resulted in a ~29% speedup for transaction-sized chunks in my test case.

Apart from that I would expect there to still be a lot of lock-contention if one attempted to do concurrent RPC calls. Maybe most of it could be handled through taking shared (read-only) rather than unique (write) locks. Not sure what other hurdles one might run into along that path though.

(I'd also prefer switching from JSON-RPC to a Cap'n Proto RPC for this kind of thing when possible).

sedited force-pushed on Mar 13, 2026

sedited commented at 8:45 AM on March 13, 2026: contributor

Thanks for trying this out and leaving a review @craigraw!

62edb8df903d3002f394ea4f3fa12114825ea92d -> b01a2082db0db33d333469ea9c21289a499e6cbd (blocktreestore_15 -> blocktreestore_16, compare)

Addressed @craigraw's comment, fixing log copy pasta.
Addressed @craigraw's comment, use %s to format.
Addressed @craigraw's comment, catch on ::failure instead of ::failure::exception.
Addressed @craigraw's comment, fix off by one error in read end condition. This was harmless, since hitting the error would indicate a corrupted file anyway.

sedited commented at 9:44 AM on March 13, 2026: contributor

Re #32427 (comment)

I'm slightly nervous about other processes accessing bitcoind's files directly, especially while it's running.

I think what is important to contextualize here is that this is already the status quo. A bunch of indexers and explorers already ingest raw blocks files and do so while core is running (e.g. blockstream's electrs). So the question really is if we want to offer a supported way for users to do this.

Feels like we are incurring maintenance burden going forward even if access happens through the kernel API and this PR makes it technically safe in the would-be incarnation. What happens the day bitcoind wants to migrate storage formats again?

I'm not sure how this would complicate future migrations. The contract should already be: If you want to ingest data from a running bitcoind, you need to use tools with a compatible version. This doesn't seem to change that to me. As for increasing the maintenance burden in general, this approach seems simpler to me than either leaving things as is and dealing with its restrictions in perpetuity, or a different low-level approach attempting to do something similar through another channel.

Apart from that I would expect there to still be a lot of lock-contention if one attempted to do concurrent RPC calls. Maybe most of it could be handled through taking shared (read-only) rather than unique (write) locks. Not sure what other hurdles one might run into along that path though.

All your ideas here sound good to me, but due to the restrictions you've laid out, I'm also not sure if they would support current needs. The approach in this PR also was demonstrated now to be simple enough for an external developer to implement in reasonable time. I don't think the same could be said for an IPC+sendfile approach, especially given their portability limitations.

DrahtBot added the label CI failed on Mar 13, 2026

DrahtBot removed the label CI failed on Mar 13, 2026

hodlinator commented at 2:04 PM on March 13, 2026: contributor

I think what is important to contextualize here is that this is already the status quo. A bunch of indexers and explorers already ingest raw blocks files and do so while core is running (e.g. blockstream's electrs). So the question really is if we want to offer a supported way for users to do this.

Blocks are fairly immutable by nature, so it makes sense that a specialized fork of electrs would ingest them directly, despite your caveats in the PR description regarding reading LevelDB files opened by another process.

I was somehow under the impression that this PR also somehow made using the kernel API to read blocks easier, sorry about that. The experimental Frigate branch ingests the new format directly from disk as memory mapped files^1. Dropping LevelDB does indeed make this much more approachable. Downstream projects being intended to read files directly without going through the kernel library means we can't provide backwards compatibility and/or error messages on kernel library version vs disk format mismatch, so less maintenance lands on Core.

I'm not sure how this would complicate future migrations. The contract should already be: If you want to ingest data from a running bitcoind, you need to use tools with a compatible version. This doesn't seem to change that to me. As for increasing the maintenance burden in general, this approach seems simpler to me than either leaving things as is and dealing with its restrictions in perpetuity, or a different low-level approach attempting to do something similar through another channel.

Blockstream's fork will need to be upgraded to support the format in this PR. Same will be true the next time bitcoind changes how blocks are stored. The programming effort will probably be small in both cases, but the version mismatches will provide users / node solutions with reasons to delay upgrading Core.

My hypothetical alternative to optimize JSON-RPC does add the XOR logic to the RPC API for downstream projects to deal with, and represents more work on our side, and I agree it would probably incur some maintenance burden. The Remote part of JSON-RPC is a strength of this approach though.

All your ideas here sound good to me, but due to the restrictions you've laid out, I'm also not sure if they would support current needs. The approach in this PR also was demonstrated now to be simple enough for an external developer to implement in reasonable time.

Using JSON-RPC or future RPC interfaces and ensuring it is optimized would benefit all users, and make reaching around to memory map files directly less of an advantage.

I don't think the same could be said for an IPC+sendfile approach, especially given their portability limitations.

It seems that Linux/FreeBSD/MacOS support sendfile(), Windows has TransmitFile(), OpenBSD seems to require more explicit memory mapping + sending to avoid userspace copying. I guess libevent does something like that in it's evbuffer_add_file flow.

In the end I'm positive to the LevelDB -> flat file switch of this PR, even though I'm still on the fence re encouraging downstream projects to read our files directly. Regardless, it does not prevent anyone from optimizing our RPCs (even if both use the same finite supply of PR review).

DrahtBot added the label Needs rebase on Apr 23, 2026

sedited force-pushed on Apr 24, 2026

sedited commented at 12:34 PM on April 24, 2026: contributor

Rebased b01a2082db0db33d333469ea9c21289a499e6cbd -> 546a9c1132ebf277262232387e7ea45e6edf3491 (blocktreestore_16 -> blocktreestore_17, compare)

DrahtBot removed the label Needs rebase on Apr 24, 2026

sedited renamed this:
~~(RFC) kernel: Replace leveldb-based BlockTreeDB with flat-file based store~~
kernel: Replace leveldb-based BlockTreeDB with flat-file based store
on May 9, 2026

DrahtBot added the label Needs rebase on May 26, 2026

sedited force-pushed on May 26, 2026

sedited commented at 8:04 PM on May 26, 2026: contributor

Rebased 546a9c1132ebf277262232387e7ea45e6edf3491 -> ace6e2f816792b2b67ada3d58faede197be98517 (blocktreestore_17 -> blocktreestore_18, compare)

DrahtBot removed the label Needs rebase on May 26, 2026

in src/kernel/blocktreestorage.cpp:139 in 79ca7ff3a3 outdated

 134 | +    assert(GetSerializeSize(BlockFileInfoWrapper{}) == BLOCK_FILE_INFO_WRAPPER_SIZE);
 135 | +    fs::create_directories(path);
 136 | +    if (wipe_data) {
 137 | +        fs::remove(m_header_file_path);
 138 | +        fs::remove(m_block_files_file_path);
 139 | +    }

josibake commented at 11:05 AM on May 27, 2026:

I think you want to wipe everything here, or at least also the WAL (log.dat). I haven't confirmed but feels like you could wipe data and then also read an out of date WAL log and end up in a weird state? I'll try to write a test for this to confirm.

Ideally, wipe_data wipes everything: reindex.dat, prune.dat, log.dat, headers.dat, blockfiles.dat. I suspect some of these were missed because they were added later, e.g. log.dat, but I could also be missing something.

in src/kernel/blocktreestorage.cpp:184 in 79ca7ff3a3

 179 | +
 180 | +void BlockTreeStore::WriteReindexing(bool reindexing) const
 181 | +{
 182 | +    LOCK(m_mutex);
 183 | +    if (reindexing) {
 184 | +        std::ofstream{fs::PathToString(m_reindex_flag_file_path)}.close();

josibake commented at 11:14 AM on May 27, 2026:

Seems worth checking that this write succeeds, no?

in src/kernel/blocktreestorage.cpp:243 in 79ca7ff3a3

 238 | +
 239 | +void BlockTreeStore::WritePruned(bool pruned) const
 240 | +{
 241 | +    LOCK(m_mutex);
 242 | +    if (pruned) {
 243 | +        std::ofstream{fs::PathToString(m_prune_flag_file_path)}.close();

josibake commented at 11:15 AM on May 27, 2026:

Same comment as above, I think we should check that this succeeds.

in src/test/fuzz/block_index.cpp:62 in 766caecbbf

  62 | -        .path = "", // Memory only.
  63 | -        .cache_bytes = 1_MiB,
  64 | -        .memory_only = true,
  65 | -    });
  66 | +    fs::path block_tree_store_dir{g_setup->m_args.GetDataDirBase()};
  67 | +    kernel::BlockTreeStore block_index{block_tree_store_dir};

josibake commented at 11:21 AM on May 27, 2026:

Assuming you take my previous suggestion to wipe all of the files, shouldn't this then be wipe_data=true?

in src/kernel/blocktreestorage.cpp:11 in 79ca7ff3a3

   6 | +
   7 | +#include <chain.h>
   8 | +#include <crc32c/include/crc32c/crc32c.h>
   9 | +#include <kernel/cs_main.h>
  10 | +#include <logging.h>
  11 | +#include <node/blockstorage.h>

josibake commented at 11:36 AM on May 27, 2026:

Feels like a layer violation to me :/ I'm guessing this is to keep things minimal, but this gives me the ick. Have you looked at how invasive it would be to pull CBlockFileInfo into kernel?

in src/node/blockstorage.cpp:1251 in fa95e3e21c outdated

1246 | +    int max_blockfile_num{0};
1247 | +    bool reindexing{false};
1248 | +    bool pruned_block_files{false};
1249 | +
1250 | +    {
1251 | +        LogInfo("Migrating leveldb block tree db to new block tree store.");

josibake commented at 11:41 AM on May 27, 2026:

I might be misreading, but I think your commit message is out of date: I don't see a migration directory getting created.

in src/node/blockstorage.cpp:1270 in fa95e3e21c

1265 | +                throw std::runtime_error("Failed to load block index guts");
1266 | +            }
1267 | +            block_tree_db->ReadReindexing(reindexing);
1268 | +            block_tree_db->ReadFlag("prunedblockfiles", pruned_block_files);
1269 | +        } catch (const std::exception&) {
1270 | +            LogWarning("   Failed to read existing leveldb block tree data. Removing old db and creating new block tree store.");

josibake commented at 11:46 AM on May 27, 2026:

Could also log the exception message? Might be useful for debugging a migration failure.

josibake commented at 11:49 AM on May 27, 2026: member

Been awhile since I've reviewed C++, so likely a bit rusty 😅 Overall, approach is clean and I think @craigraw 's benchmark numbers are very compelling.

I left a few questions that feel blocking to me, but those aside this looks in great shape.

sedited force-pushed on May 28, 2026

sedited commented at 1:05 PM on May 28, 2026: contributor

Thank you for the review @josibake

Updated ace6e2f816792b2b67ada3d58faede197be98517 -> b4badec2f7bd95df73b08b89788e23ec84e3473d (blocktreestore_18 -> blocktreestore_19, compare)

Addressed @josibake's comment, expand wipe to include all the block tree store files.
Addressed @josibake's comment_1 and comment_2, improved flag writing error checking.
Addressed @josibake's comment, wipe data on every fuzzing loop iteration.
Addressed @josibake's comment, added a commit moving CBlockFileInfo into the blocktreestorage modules.
Addressed @josibake's comment, corrected commit message describing migration procedure.
Addressed @josibake's comment, added exception message to migration failure warning log message.

josibake commented at 1:51 PM on May 28, 2026: member

ACK https://github.com/bitcoin/bitcoin/pull/32427/commits/b4badec2f7bd95df73b08b89788e23ec84e3473d :shipit:

DrahtBot requested review from marcofleon on May 28, 2026

DrahtBot requested review from HowHsu on May 28, 2026

DrahtBot requested review from craigraw on May 28, 2026

DrahtBot requested review from theuni on May 28, 2026

DrahtBot requested review from l0rinc on May 28, 2026

DrahtBot requested review from ismaelsadeeq on May 28, 2026

DrahtBot requested review from stickies-v on May 28, 2026

DrahtBot added the label CI failed on May 28, 2026

DrahtBot removed the label CI failed on May 29, 2026

in src/kernel/blocktreestorage.cpp:381 in b4badec2f7

 376 | +        log_file >> type_size;
 377 | +        uint64_t num_iterations;
 378 | +        log_file >> num_iterations;
 379 | +        uint32_t entry_size = type_size + FILE_POSITION_SIZE;
 380 | +
 381 | +        DataStream stream;

w0xlt commented at 6:32 AM on May 30, 2026:

nit (robustness/readability, non-blocking):

In BlockTreeStore::ApplyLog(), the second pass reads each log.dat record into a DataStream, reads the target position from it, and then still uses stream.data() to verify and write the record data.

The record layout is: [data bytes: type_size][target position: FILE_POSITION_SIZE].

The current code does:

log_file.read(std::span<std::byte>(stream));
stream.ignore(type_size); // advances past the data bytes in the DataStream
int64_t pos;
stream >> pos;            // reads the remaining position bytes

uint32_t re_checksum = crc32c::Crc32c(UCharCast(stream.data()), entry_size);
...
data_file << std::span<std::byte>{stream.data(), type_size};

After stream.ignore(type_size) and stream >> pos, the full record has been consumed from the DataStream. At that point stream.size() is 0, but the code still creates a type_size span from stream.data().

This can be confirmed adding:

             log_file.read(std::span<std::byte>(stream));
             stream.ignore(type_size);
             int64_t pos;
             stream >> pos;
+            assert(stream.size() >= type_size);
 
             uint32_t re_checksum = crc32c::Crc32c(UCharCast(stream.data()), entry_size);

Then running:

build/bin/test_bitcoin --run_test=blocktreestorage_tests

The assertion fails because stream.size() is 0 while type_size is 4.

The reason this works is that DataStream::clear() uses std::vector::clear(), and clear() often leaves the old allocation and bytes in memory. So stream.data() may still point to bytes that look correct, but the DataStream itself is empty.

Maybe a better approach would be to keep the record bytes in a plain byte buffer, like in the first loop and use SpanReader only to read the target position.

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index 5960f9e733..740035bdf9 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -378,17 +378,18 @@ bool BlockTreeStore::ApplyLog() const
         log_file >> num_iterations;
         uint32_t entry_size = type_size + FILE_POSITION_SIZE;
 
-        DataStream stream;
-        stream.resize(entry_size);
+        std::vector<std::byte> buffer;
+        buffer.resize(entry_size);
 
         for (uint32_t j = 0; j < num_iterations; ++j) {
-            log_file.read(std::span<std::byte>(stream));
-            stream.ignore(type_size);
+            log_file.read(buffer);
+            SpanReader entry_reader{buffer};
+            entry_reader.ignore(type_size);
             int64_t pos;
-            stream >> pos;
+            entry_reader >> pos;
 
-            uint32_t re_checksum = crc32c::Crc32c(UCharCast(stream.data()), entry_size);
-            re_rolling_checksum = crc32c::Extend(re_rolling_checksum, UCharCast(stream.data()), entry_size);
+            uint32_t re_checksum = crc32c::Crc32c(UCharCast(buffer.data()), entry_size);
+            re_rolling_checksum = crc32c::Extend(re_rolling_checksum, UCharCast(buffer.data()), entry_size);
             uint32_t checksum;
             log_file >> checksum;
             if (re_checksum != checksum) {
@@ -399,11 +400,9 @@ bool BlockTreeStore::ApplyLog() const
                 data_file.seek(pos, SEEK_SET);
             }
 
-            data_file << std::span<std::byte>{stream.data(), type_size};
+            data_file << std::span<std::byte>{buffer.data(), type_size};
             data_file << checksum;
 
-            stream.resize(entry_size);
-
             // TEST ONLY
             if (m_incomplete_log_apply) {
                 (void)data_file.fclose();

</details>

in src/kernel/blocktreestorage.cpp:154 in b4badec2f7

 149 | +        CreateHeaderFile();
 150 | +        CreateBlockFilesFile();
 151 | +    }
 152 | +    CheckMagicAndVersion();
 153 | +    LOCK(m_mutex);
 154 | +    (void)ApplyLog(); // Ignore an incomplete log file here, the integrity of the data is still intact.

w0xlt commented at 7:25 AM on May 30, 2026:

If I am understanding correctly, ApplyLog() returns false for two different situations:

harmless ones: there is no log, or the log is incomplete/corrupt and can be skipped because the data files should still contain the last completed state; and
real I/O failures: a log exists but cannot be opened, or Commit()/fclose() on a data file fails.

The constructor ignores the return value anyway:

(void)ApplyLog();

That is fine when log.dat is missing, incomplete, or fails validation.

But after the log validates, it represents a write that should be replayed. If Commit() or fclose() fails while writing to headers.dat / blockfiles.dat, the node may run using data files that did not receive the complete logged update.

Maybe we can keep returning false for harmless cases like missing/incomplete logs, but throw on real apply failures?

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index 5960f9e733..fe79b344ba 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -151,7 +151,7 @@ BlockTreeStore::BlockTreeStore(const fs::path& path, bool wipe_data)
     }
     CheckMagicAndVersion();
     LOCK(m_mutex);
-    (void)ApplyLog(); // Ignore an incomplete log file here, the integrity of the data is still intact.
+    (void)ApplyLog(); // Missing or incomplete logs are safe to ignore; apply failures throw.
 }
 
 void BlockTreeStore::CreateHeaderFile() const
@@ -309,7 +309,7 @@ bool BlockTreeStore::ApplyLog() const
 
     auto log_file{AutoFile{fsbridge::fopen(m_log_file_path, "rb")}};
     if (log_file.IsNull()) {
-        return false;
+        throw BlockTreeStoreError(strprintf("Unable to open blocktree store log file %s", fs::PathToString(m_log_file_path)));
     }
 
     uint32_t re_rolling_checksum = 0;
@@ -413,11 +413,12 @@ bool BlockTreeStore::ApplyLog() const
 
         if (!data_file.Commit()) {
             LogError("Failed to commit write to data file %s", PathToString(data_file_path));
-            return false;
+            (void)data_file.fclose();
+            throw BlockTreeStoreError(strprintf("Failed to commit write to data file %s", PathToString(data_file_path)));
         }
         if (data_file.fclose() != 0) {
             LogError("Failed to close after write to data file %s", PathToString(data_file_path));
-            return false;
+            throw BlockTreeStoreError(strprintf("Failed to close after write to data file %s", PathToString(data_file_path)));
         }
     }

</details>

in src/kernel/blocktreestorage.cpp:507 in b4badec2f7 outdated

 502 | +    log_file << uint64_t{blockinfo.size()};
 503 | +    constexpr size_t header_entry_size{DISK_BLOCK_INDEX_WRAPPER_SIZE + FILE_POSITION_SIZE};
 504 | +
 505 | +    for (CBlockIndex* bi : blockinfo) {
 506 | +        int64_t pos = bi->header_pos == 0 ? header_data_end : bi->header_pos;
 507 | +        auto disk_bi{CDiskBlockIndex{bi}};

w0xlt commented at 7:51 AM on May 30, 2026:

When a new header entry is written, disk_bi is copied from bi while bi->header_pos is still 0, so the entry is persisted with header_pos = 0.

On the next startup that 0 is read back, so the following WriteBatchSync thinks the entry is new again and appends a duplicate instead of overwriting it in place. Setting disk_bi.header_pos = pos stores the real offset and fixes it.

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index 5960f9e733..81b944e902 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -505,6 +505,7 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
     for (CBlockIndex* bi : blockinfo) {
         int64_t pos = bi->header_pos == 0 ? header_data_end : bi->header_pos;
         auto disk_bi{CDiskBlockIndex{bi}};
+        disk_bi.header_pos = pos;
         stream << DiskBlockIndexWrapper{&disk_bi};
         stream << pos;
         checksum = crc32c::Crc32c(UCharCast(stream.data()), header_entry_size);
diff --git a/src/test/blocktreestorage_tests.cpp b/src/test/blocktreestorage_tests.cpp
index 115180a54e..822c5c6a5b 100644
--- a/src/test/blocktreestorage_tests.cpp
+++ b/src/test/blocktreestorage_tests.cpp
@@ -66,6 +66,7 @@ void check_block_map(const std::unordered_map<uint256, CBlockIndex, BlockHasher>
         BOOST_CHECK_EQUAL(index.nBits, block->nBits);
         BOOST_CHECK_EQUAL(index.nStatus, block->nStatus);
         BOOST_CHECK_EQUAL(index.nFile, block->nFile);
+        BOOST_CHECK_EQUAL(index.header_pos, block->header_pos);
     }
 }

</details>

w0xlt commented at 7:52 AM on May 30, 2026: contributor

A few review comments:

DrahtBot requested review from w0xlt on May 30, 2026

sedited force-pushed on Jun 1, 2026

sedited commented at 7:11 PM on June 1, 2026: contributor

Thank you of the review @w0xlt!

Updated b4badec2f7bd95df73b08b89788e23ec84e3473d -> e45d56f8f27b88dfb566199a88bb26569162c71c (blocktreestore_19 -> blocktreestore_20, compare)

Addressed @w0xlt's comment, replaced data stream buffer with a std::array buffer in LoadBlockIndexGuts.
Addressed @w0xlt's comment, changing returning false to throwing when writing the contents of the log file to the actual data files fails. This should be an error, because failing to apply good data to the files can leave a significant chunk of block files without block index entries pointing to them.
Addressed @w0xlt's comment, removed header pos from on-disk serialization. This was original added as a defense-in-depth measure for the serialization. This is however no longer required because the CRC now provides fairly robust tamper detection.
Added a bunch of tests that should further protect from similar regressions as reported by @w0xlt: Roundtrip checks for reading/writing, checks that mutations don't change the file size, and adding a chain of indexes to the written test data.

Since the on-disk serialization changed, external clients consuming the data from a node running this branch should update.

willcl-ark commented at 8:50 AM on June 2, 2026: member

One thing that wasn't 100% clear to me on first reading is, if I run a pruned node and the migration fails, do I get thrown into a full reindex? It kinda read like I might, but I didn't fully examine (or test) it yet.

josibake commented at 11:05 AM on June 2, 2026: member

One thing that wasn't 100% clear to me on first reading is, if I run a pruned node and the migration fails, do I get thrown into a full reindex? It kinda read like I might, but I didn't fully examine (or test) it yet.

Nice catch! I didn't write a test to confirm but I think you're right. Could be made tighter with something like:

diff --git a/src/node/blockstorage.cpp b/src/node/blockstorage.cpp
index 85d3b44d69..ff36a26434 100644
--- a/src/node/blockstorage.cpp
+++ b/src/node/blockstorage.cpp
@@ -1229,11 +1229,15 @@ std::unique_ptr<kernel::BlockTreeStore> BlockManager::CreateAndMigrateBlockTree(
             params.path = m_opts.block_tree_dir;
             auto block_tree_db{std::make_unique<BlockTreeDB>(params)};
             LogInfo("   Reading data from existing leveldb block tree db...");
-            block_tree_db->ReadLastBlockFile(max_blockfile_num);
+            if (!block_tree_db->ReadLastBlockFile(max_blockfile_num)) {
+                throw std::runtime_error{"Failed to read last block file"};
+            }
             files.reserve(max_blockfile_num + 1);
             for (int i = 0; i <= max_blockfile_num; i++) {
                 CBlockFileInfo info;
-                block_tree_db->ReadBlockFileInfo(i, info);
+                if (!block_tree_db->ReadBlockFileInfo(i, info)) {
+                    throw std::runtime_error{strprintf("Failed to read block file info for file %d", i)};
+                }
                 files.emplace_back(i, info);
             }
 
@@ -1243,13 +1247,11 @@ std::unique_ptr<kernel::BlockTreeStore> BlockManager::CreateAndMigrateBlockTree(
             }
             block_tree_db->ReadReindexing(reindexing);
             block_tree_db->ReadFlag("prunedblockfiles", pruned_block_files);
-        } catch (const std::exception& e) {
-            LogWarning("   Failed to read existing leveldb block tree data. Removing old db and creating new block tree store. (%s)", e.what());
-            auto block_tree_store{std::make_unique<kernel::BlockTreeStore>(m_opts.block_tree_dir)};
-            block_tree_store->WriteReindexing(true);
-            DestroyDB(fs::PathToString(m_opts.block_tree_dir));
-            m_block_index.clear();
-            return block_tree_store;
+        } catch (const std::exception& e) {
+            throw std::runtime_error{strprintf(
+                "Failed to migrate existing leveldb block tree db to new block tree store: (%s) ",
+                e.what())};
         }
     }

..but I think its still an interesting question as to whether or not this is the desired behaviour. In the case of a pruned node, I would lean more towards this approach, but in the case of a full node with a corrupted leveldb index, it seems nicer to just create the block tree database with a reindex flag and let the node do its thing.

Perhaps the best of both worlds would be fail without deleting anything in the case of a corrupt Leveldb index and in the logs prompt the user to try again with an explicit reindex flag? This also (hopefully) gives the user a chance to read the exception messages we are throwing and attempt their own manual solution before falling back to a reindex, or worse a full IBD in the case of a pruned node.

willcl-ark commented at 12:46 PM on June 2, 2026: member

but I think its still an interesting question as to whether or not this is the desired behaviour.

Yes. If you've updated your software (and therefore need migration to succeed), and migration is failing, your only choice may be reindex. Although keeping the old index about would let you downgrade software versions, if you preferred that.

josibake commented at 12:57 PM on June 2, 2026: member

may be reindex

What I'm trying to say is may is the important word here. I don't think we should automatically choose that path for the user. We should prompt them to explicitly tell us to do that. The user might be able to come up with a solution (manually editing leveldb, something I've done in the past to recover), or they may need to reindex but now might not be a good time, e.g., I'll do it later when my laptop has a better wifi connection, etc.

I don't think keeping the old files around after a successful migration is a good idea, in the happy case of clean migration or in the failure mode we are talking about where something fails and the users says okay, migrate with a reindex.

The reason I don't think its a good idea is it leaves old state cluttering the directory. If a user does want to downgrade, I think its reasonable (and safer) for them to need to do a reindex to recreate leveldb. It's unfortunate that in the case of a downgrading pruned node this would restart IBD, but I think overall that will be the path with the least surprises and less opportunities for them to hit weird edge cases and bugs. These are good things to call out in release notes, however.

sedited force-pushed on Jun 3, 2026

sedited commented at 7:35 AM on June 3, 2026: contributor

Thank you for the review @josibake and @willcl-ark!

Updated e45d56f8f27b88dfb566199a88bb26569162c71c -> 3ac29ab3b8bc7222548250911406124b29089d9a (blocktreestore_20 -> blocktreestore_21, compare)

Addressed @josibake's comment, removed the automatic cleanup and reindex. Instead let the user reindex manually upon failure. GUI users are prompted to reindex or shutdown.
Added a functional test exercising the migration and its failure mode.

josibake commented at 8:58 AM on June 3, 2026: member

ACK 3ac29ab

Thanks for taking the suggestion @sedited ! Functional test looks good.

in src/kernel/blocktreestorage.cpp:338 in 95053febe7 outdated

 333 | +            buffer.resize(entry_size);
 334 | +
 335 | +            for (uint64_t j = 0; j < num_iterations; j++) {
 336 | +                log_file.read(buffer);
 337 | +
 338 | +                uint32_t re_checksum = crc32c::Crc32c(UCharCast(buffer.data()), entry_size);

willcl-ark commented at 9:42 AM on June 3, 2026:

In 95053febe769e303a1ef2ae0f510a64bcac5a4f5

Why not include number_of_types, value_type, type_size, and num_iterations in the checksum here too?

These are used during replay so feel like they should probably be covered here with buffer.

sedited commented at 11:39 AM on June 3, 2026:

I think they are implicitly covered by the rolling checksum, since they describe the size and shape of the data. If either of them de-serialize into a bad value, this should show up in an incorrect rolling checksum.

willcl-ark commented at 1:13 PM on June 3, 2026:

ok, I guess I was thinking if a bitflip/disk corruption change value_type on disk, then the buffer would still checksum correctly, but they might be re-directed into the wrong file. I agree the other two are fine/implicit.

writing it out it, it does seem quite unlikely though.

sedited commented at 2:39 PM on June 3, 2026:

Tightened this up a bit in my last push. Seems better to derive the size type from the value type instead of writing it to the log too. This then binds the checksum to the value type too.

in src/node/blockstorage.cpp:548 in 0d66c50511 outdated

 544 | @@ -544,7 +545,7 @@ bool BlockManager::LoadBlockIndexDB(const std::optional<uint256>& snapshot_block
 545 |      m_blockfile_info.resize(max_blockfile_num + 1);
 546 |      LogInfo("Loading block index db: last block file = %i", max_blockfile_num);
 547 |      for (int nFile = 0; nFile <= max_blockfile_num; nFile++) {
 548 | -        m_block_tree_db->ReadBlockFileInfo(nFile, m_blockfile_info[nFile]);
 549 | +        (void)m_block_tree_db->ReadBlockFileInfo(nFile, m_blockfile_info[nFile]);

willcl-ark commented at 9:46 AM on June 3, 2026:

In 0d66c50511fd258395cadde2b2df48743bfd6758

Is this safe to have ReadBlockFileInfo failures ignored?

sedited commented at 11:54 AM on June 3, 2026:

I think this is meant to ignore a missing entry on fresh startup: The max blockfile num is 0, so it attempts to read a block file info entry that does not exist yet.

willcl-ark commented at 1:28 PM on June 3, 2026:

That's fair. But if ReadLastBlockFile() returns 1000 and the block file info entry for file 888 is missing or truncated or whatever, this will hide the failed read here and leave m_blockfile_info[888] default-initialised (when we should be needing to prompt a -reindex).

sedited commented at 2:30 PM on June 3, 2026:

Mmh, might that need fixing outside of this PR?

josibake commented at 12:52 PM on June 10, 2026:

I think you could argue this needs fixing in a followup, considering the surrounding code is inherited. But I also think a small fix like this is defensible in this PR:

const bool empty_store{max_blockfile_num == 0 && m_block_index.empty()};

for (int nFile = 0; nFile <= max_blockfile_num; nFile++) {
    if (!m_block_tree_db->ReadBlockFileInfo(nFile, m_blockfile_info[nFile])) {
        if (empty_store && nFile == 0) {
            continue;
        }
        LogError("%s: failed to read block file info for file %i\n", __func__, nFile);
        return false;
    }
}

Worth doing because I don't think its wise to conflate errors with valid states. Here, because of the conflation, we end up suppressing an error. Better to handle the empty start case explicitly, and perhaps in a follow up refactor how ReadBlockFileInfo works. However, not a blocking comment!

in src/node/blockstorage.cpp:1304 in 0d66c50511

1299 | +        block_tree_store->WriteBatchSync(dump_files, max_blockfile_num, dump_blockindexes);
1300 | +    }
1301 | +
1302 | +    // Re-open to ensure that the migration was successful
1303 | +    auto block_tree_store{std::make_unique<kernel::BlockTreeStore>(m_opts.block_tree_db_params.path)};
1304 | +    DestroyDB(fs::PathToString(m_opts.block_tree_db_params.path));

willcl-ark commented at 9:55 AM on June 3, 2026:

In 0d66c50511fd258395cadde2b2df48743bfd6758

Wondering about ignoring failures here, too... db.h has:

// Destroy the contents of the specified database.
// Be very careful using this method.
//
// Note: For backwards compatibility, if DestroyDB is unable to list the
// database files, Status::OK() will still be returned masking this failure.
LEVELDB_EXPORT Status DestroyDB(const std::string& name,
                                const Options& options);

so it seems like the result from this function might be a bit vague to act on in a vaccuum. That said, we return a bool, so perhaps we could throw something in case of error.

if (!DestroyDB(fs::PathToString(m_opts.block_tree_dir)) ||
    fs::exists(m_opts.block_tree_dir / "CURRENT")) {
    throw kernel::BlockTreeStoreError(
        strprintf("Failed to remove legacy leveldb block tree db at %s",
                  fs::PathToString(m_opts.block_tree_dir)));
}

I think the danger here is that we might:

migrate successfully
fail to delete the db (for some reason)
run for a while
shutdown
on restart we see CURRENT is still present, so attemp migration again, using old data. (and repeat)

Maybe even something like this is warranted if we wanted belt-and-suspenders:

const bool destroyed{DestroyDB(fs::PathToString(m_opts.block_tree_dir))};
if (!destroyed) {
    throw kernel::BlockTreeStoreError(
        strprintf("Failed to remove legacy leveldb block tree db at %s",
                  fs::PathToString(m_opts.block_tree_dir)));
}
if (fs::exists(m_opts.block_tree_dir / "CURRENT")) {
    throw kernel::BlockTreeStoreError(
        strprintf("Legacy leveldb block tree db marker still exists at %s",
                  fs::PathToString(m_opts.block_tree_dir / "CURRENT")));
}

willcl-ark commented at 9:56 AM on June 3, 2026: member

Left a few more questions. Will run some tests on it now.

Really like the overall shape and concept!

sedited force-pushed on Jun 3, 2026

sedited commented at 12:13 PM on June 3, 2026: contributor

Updated 3ac29ab3b8bc7222548250911406124b29089d9a -> 20c932180c963eb8be937a1ea3ca85a04f9c0e96 (blocktreestore_21 -> blocktreestore_22, compare)

Addressed @willcl-ark's comment, added belt and suspenders check for removing legacy db after a migration.

josibake commented at 12:48 PM on June 3, 2026: member

ACK https://github.com/bitcoin/bitcoin/pull/32427/commits/20c932180c963eb8be937a1ea3ca85a04f9c0e96

willcl-ark approved

willcl-ark commented at 1:31 PM on June 3, 2026: member

ACK 20c932180c963eb8be937a1ea3ca85a04f9c0e96

Did various forms of migration testing including:

fresh migration of a complete block tree plus restart

<snip>
2026-06-03T13:16:02Z Using obfuscation key for blocksdir *.dat files (/mnt/nvme0/.bitcoin/blocks): 'fe3ab28aa07dd9ea'
2026-06-03T13:16:02Z Migrating leveldb block tree db to new block tree store.
2026-06-03T13:16:02Z Opening LevelDB in /mnt/nvme0/.bitcoin/blocks/index
2026-06-03T13:16:02Z Opened LevelDB successfully
2026-06-03T13:16:02Z Using obfuscation key for /mnt/nvme0/.bitcoin/blocks/index: 0000000000000000
2026-06-03T13:16:02Z    Reading data from existing leveldb block tree db...
2026-06-03T13:16:03Z    Writing data back to a new block tree store, reindexing: false, pruned: false
2026-06-03T13:16:03Z    Successfully migrated the leveldb block tree db to new block tree store.
<snip>

/mnt/nvme0
❯ diff datadir/blocks/index/ .bitcoin/blocks/index
Only in datadir/blocks/index/: 000063.ldb
Only in datadir/blocks/index/: 000064.ldb
Only in datadir/blocks/index/: 000164.log
Only in datadir/blocks/index/: 000165.ldb
Only in datadir/blocks/index/: 000166.ldb
Only in datadir/blocks/index/: 000167.ldb
Only in .bitcoin/blocks/index: blockfiles.dat
Only in datadir/blocks/index/: CURRENT
Only in .bitcoin/blocks/index: headers.dat
Only in datadir/blocks/index/: LOCK
Only in datadird/blocks/index/: MANIFEST-000162

-reindex on startup with an old db:

2026-06-03T13:18:05Z Using obfuscation key for blocksdir *.dat files (/mnt/nvme0/.bitcoin/blocks): 'fe3ab28aa07dd9ea'
2026-06-03T13:18:05Z Detected legacy leveldb block tree db - removing it
2026-06-03T13:18:05Z Using 16 MiB out of 16 MiB requested for signature cache, able to store 524288 elements
2026-06-03T13:18:05Z Using 16 MiB out of 16 MiB requested for script execution cache, able to store 524288 elements
2026-06-03T13:18:05Z init message: Loading block index…
2026-06-03T13:18:05Z Assuming ancestors of block 00000000000000000000ccebd6d74d9194d8dcdc1d177c478e094bfad51ba5ac have valid signatures.
2026-06-03T13:18:05Z Setting nMinimumChainWork=0000000000000000000000000000000000000001128750f82f4c366153a3a030
2026-06-03T13:18:05Z Initializing chainstate Chainstate [ibd] @ height -1 (null)
2026-06-03T13:18:05Z Wiping LevelDB in /mnt/nvme0/.bitcoin/chainstate
2026-06-03T13:18:06Z Opening LevelDB in /mnt/nvme0/.bitcoin/chainstate
2026-06-03T13:18:06Z Opened LevelDB successfully
2026-06-03T13:18:06Z Wrote new obfuscation key for /mnt/nvme0/.bitcoin/chainstate: c92a47b594732ca3
2026-06-03T13:18:06Z Using obfuscation key for /mnt/nvme0/.bitcoin/chainstate: c92a47b594732ca3
2026-06-03T13:18:06Z init message: Verifying blocks…
2026-06-03T13:18:06Z Block index and chainstate loaded
2026-06-03T13:18:06Z Setting NODE_NETWORK in non-prune mode
2026-06-03T13:18:06Z initload thread start
2026-06-03T13:18:06Z Reindexing block file blk00000.dat (0% complete)...

I will test a few more migration paths, but this LGTM now in its current state.

Left one more comment, but IMO it's a nice-to-have not a blocker, as we'd usually prescribe a -reindex for "issues" with the block tree (metadata) in most cases anyway, which is what would be needed here.

in src/kernel/blocktreestorage.cpp:428 in 20c932180c outdated

 423 | +    (void)log_file.fclose();
 424 | +    fs::remove(m_log_file_path);
 425 | +    return true;
 426 | +}
 427 | +
 428 | +void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBlockFileInfo*>>& fileInfo, int32_t last_file, const std::vector<CBlockIndex*>& blockinfo)

janb84 commented at 1:52 PM on June 3, 2026:

Maybe NIT:

As I see currently BlockTreeStore::WriteBatchSync() high-over:

1 computed a future position for a new header entry. 2 It writes that future position into the in-memory CBlockIndex immediately. 3 finishing with a WAL write and call ApplyLog() to write data to disk.

So if something between 2 and 3 fails, memory has already advanced to the new head entry. while disk has not written the corresponding header entry yet. That leaves the in-memory state inconsistent with on-disk state.

Suggestion is to now keeps those future positions in a temporary list and after successful ApplyLog() copy those pending positions back into the CBlockIndex objects.

void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBlockFileInfo*>>& fileInfo, int32_t last_file, const std::vector<CBlockIndex*>& blockinfo)
{
    AssertLockHeld(::cs_main);
    LOCK(m_mutex);

    std::vector<std::pair<CBlockIndex*, int64_t>> pending_header_positions;
    pending_header_positions.reserve(blockinfo.size());

    // Use a write-ahead log file that gets atomically flushed to the target files.

    { // start log_file scope
    auto log_file{OpenFile(m_log_file_path, "wb")};
    if (log_file.IsNull()) {
        throw BlockTreeStoreError(strprintf("Unable to open file %s", fs::PathToString(m_log_file_path)));
    }

    constexpr size_t block_index_entry_size{DISK_BLOCK_INDEX_WRAPPER_SIZE + FILE_POSITION_SIZE};

    DataStream stream;
    stream.reserve(block_index_entry_size);
    uint32_t rolling_checksum = 0;

    log_file << uint32_t{3}; // We are writing three different types to the log file for now.

    // Write the last block file number to the log
    WriteValueType(log_file, ValueType::LAST_BLOCK);
    log_file << uint8_t{sizeof(uint32_t)};
    log_file << uint64_t{1}; // just the one entry
    stream << last_file;
    stream << BLOCK_FILES_LAST_BLOCK_POS;
    uint32_t checksum = crc32c::Crc32c(UCharCast(stream.data()), stream.size());
    rolling_checksum = crc32c::Extend(rolling_checksum, UCharCast(stream.data()), stream.size());
    log_file << std::span<std::byte>{stream};
    log_file << checksum;
    stream.clear();

    // Write the fileInfo entries to the log
    WriteValueType(log_file, ValueType::BLOCK_FILE_INFO);
    log_file << BLOCK_FILE_INFO_WRAPPER_SIZE;
    log_file << uint64_t{fileInfo.size()};
    constexpr size_t block_file_entry_size{BLOCK_FILE_INFO_WRAPPER_SIZE + FILE_POSITION_SIZE};
    for (const auto& [file, info] : fileInfo) {
        int64_t pos{CalculateBlockFilesPos(file)};
        stream << BlockFileInfoWrapper{info};
        stream << pos;
        checksum = crc32c::Crc32c(UCharCast(stream.data()), block_file_entry_size);
        rolling_checksum = crc32c::Extend(rolling_checksum, UCharCast(stream.data()), block_file_entry_size);
        log_file.write(stream);
        log_file << checksum;
        stream.clear();
    }

    // TEST ONLY
    if (m_incomplete_log_write) {
        (void)log_file.fclose();
        throw std::runtime_error("failed to write file");
    }

    // Read the header data end position
    int64_t header_data_end;
    {
        auto header_file{AutoFile{fsbridge::fopen(m_header_file_path, "rb")}};
        if (header_file.IsNull()) {
            throw BlockTreeStoreError(strprintf("Unable to open file %s", fs::PathToString(m_header_file_path)));
        }
        header_file.seek(0, SEEK_END);
        header_data_end = header_file.tell();
    }

    // Write the header data to the log
    WriteValueType(log_file, ValueType::DISK_BLOCK_INDEX);
    log_file << DISK_BLOCK_INDEX_WRAPPER_SIZE;
    log_file << uint64_t{blockinfo.size()};
    constexpr size_t header_entry_size{DISK_BLOCK_INDEX_WRAPPER_SIZE + FILE_POSITION_SIZE};

    for (CBlockIndex* bi : blockinfo) {
        int64_t pos = bi->header_pos == 0 ? header_data_end : bi->header_pos;
        auto disk_bi{CDiskBlockIndex{bi}};
        stream << DiskBlockIndexWrapper{&disk_bi};
        stream << pos;
        checksum = crc32c::Crc32c(UCharCast(stream.data()), header_entry_size);
        rolling_checksum = crc32c::Extend(rolling_checksum, UCharCast(stream.data()), header_entry_size);
        log_file.write(stream);
        log_file << checksum;
        stream.clear();
        if (bi->header_pos == 0) {
            pending_header_positions.emplace_back(bi, header_data_end);
            header_data_end += DISK_BLOCK_INDEX_WRAPPER_SIZE + CHECKSUM_SIZE;
        }
    }

    // Finally write the rolling checksum, commit, and close
    log_file << rolling_checksum;
    if (!log_file.Commit()) {
        throw BlockTreeStoreError(strprintf("Failed to commit write to log file %s", PathToString(m_log_file_path)));
    }
    if (log_file.fclose() != 0) {
        throw BlockTreeStoreError(strprintf("Failed to close after write to log file %s", PathToString(m_log_file_path)));
    }

    } // end log_file scope

    if (!ApplyLog()) {
        throw BlockTreeStoreError("Failed to apply write-ahead log to data files");
    }

    for (const auto& [block_index, header_pos] : pending_header_positions) {
        block_index->header_pos = header_pos;
    }
}

patch for blocktreestorage_tests.cpp

diff --git a/src/test/blocktreestorage_tests.cpp b/src/test/blocktreestorage_tests.cpp
index 68e6651275..c7bb199c41 100644
--- a/src/test/blocktreestorage_tests.cpp
+++ b/src/test/blocktreestorage_tests.cpp
@@ -236,6 +236,9 @@ BOOST_AUTO_TEST_CASE(BlockTreeStoreIncompleteWrites)
     // Now simulate a crash in the middle of writing the data.
     store->SetSimulateIncompleteLogApply(true);
     BOOST_CHECK_THROW(store->WriteBatchSync(fileinfo, last_file, blockinfo), std::runtime_error);
+    // failed apply must not advance the
+    // in-memory header position before the write-ahead log is applied.
+    BOOST_CHECK_EQUAL(block_index->header_pos, 0);
     BOOST_CHECK(fs::exists(log_file));
     BOOST_CHECK(store->LoadBlockIndexGuts(
         params->GetConsensus(),
@@ -250,7 +253,10 @@ BOOST_AUTO_TEST_CASE(BlockTreeStoreIncompleteWrites)
         params->GetConsensus(),
         [&](const uint256& hash) { return InsertBlockIndex(block_map, hash); },
         m_interrupt));
-    BOOST_CHECK_EQUAL(block_index->header_pos, HEADER_FILE_DATA_START_POS);
+    // Restart recovery should restore the persisted header position from disk,
+    // but it does not mutate the pre-crash in-memory object.
+    BOOST_CHECK_EQUAL(block_index->header_pos, 0);
+    block_index->header_pos = HEADER_FILE_DATA_START_POS;
     CheckBlockMap(block_map, blockinfo);
     CheckBlockFileInfo(0, info, *store);
 }

</details>

sedited commented at 2:23 PM on June 3, 2026:

Mmh, I'm not sure if this is an actual concern. If this write fails, it triggers a fatal error. If for some reason that gets ignored, and we re-try writing the same dirty entries we'll just overwrite the already allocated positions again. Do you see a scenario where this might dangerous?

janb84 commented at 2:36 PM on June 3, 2026:

Danger no, otherwise it would be a NIT. It's more defensive programming, esp for future code changes and relying on “the process probably exits soon anyway” as a correctness strategy isn't "the best code there is"

janb84 commented at 1:58 PM on June 3, 2026: contributor

Concept ACK 20c932180c963eb8be937a1ea3ca85a04f9c0e96

I've been reviewing this PR, but I keep coming in second with my findings, others beat me to the punch. Might have a small NIT still.

Edit: Will there be a release-notes file added ?

sedited force-pushed on Jun 3, 2026

sedited commented at 2:39 PM on June 3, 2026: contributor

Thank you for the review @willcl-ark,

Updated 20c932180c963eb8be937a1ea3ca85a04f9c0e96 -> 3c3f53b488701e53ba912f2058f13883b96610d2 (blocktreestore_22 -> blocktreestore_23, compare)

Addressed @willcl-ark's comment, tightened the conditions where the log file is rejected with a bad value type. The size of the serialized entries is now derived from the deserialized value type instead of being written to the log.

josibake commented at 3:15 PM on June 3, 2026: member

reACK https://github.com/bitcoin/bitcoin/pull/32427/commits/3c3f53b488701e53ba912f2058f13883b96610d2

DrahtBot requested review from willcl-ark on Jun 3, 2026

DrahtBot requested review from janb84 on Jun 3, 2026

in src/kernel/blocktreestorage.cpp:159 in 3c3f53b488 outdated

 154 | +        fs::remove(m_log_file_path);
 155 | +        fs::remove(m_reindex_flag_file_path);
 156 | +        fs::remove(m_prune_flag_file_path);
 157 | +    }
 158 | +    bool header_file_exists{fs::exists(m_header_file_path)};
 159 | +    bool block_files_file_exists{fs::exists(m_block_files_file_path)};

w0xlt commented at 6:30 PM on June 3, 2026:

nit: There is a crash-recovery edge case: if the node, during initial block tree store creation, creates a valid empty headers.dat but crashes before creating blockfiles.dat, the next startup will fail. In this specific case, no block tree data has been written yet; it was just an interrupted empty initialization. So it would be safe to recover by creating the missing blockfiles.dat.

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index fe07af9c0e..0ce916c097 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -22,6 +22,7 @@
 #include <array>
 #include <cstddef>
 #include <cstdio>
+#include <exception>
 #include <ios>
 #include <span>
 #include <string_view>
@@ -128,6 +129,18 @@ static AutoFile OpenFile(const fs::path& path, std::string_view mode)
     return AutoFile{file.release()};
 }
 
+static bool IsEmptyHeaderFile(const fs::path& path)
+{
+    try {
+        if (fs::file_size(path) != HEADER_FILE_DATA_START_POS) return false;
+        AutoFile file{fsbridge::fopen(path, "rb")};
+        if (file.IsNull()) return false;
+        return ser_readdata32(file) == HEADER_FILE_MAGIC && ser_readdata32(file) == HEADER_FILE_VERSION;
+    } catch (const std::exception&) {
+        return false;
+    }
+}
+
 AutoFile BlockTreeStore::OpenHeaderFile(std::string_view mode) const
 {
     return OpenFile(m_header_file_path, mode);
@@ -157,6 +170,10 @@ BlockTreeStore::BlockTreeStore(const fs::path& path, bool wipe_data)
     }
     bool header_file_exists{fs::exists(m_header_file_path)};
     bool block_files_file_exists{fs::exists(m_block_files_file_path)};
+    if (header_file_exists && !block_files_file_exists && IsEmptyHeaderFile(m_header_file_path)) {
+        CreateBlockFilesFile();
+        block_files_file_exists = true;
+    }
     if (header_file_exists != block_files_file_exists) {
         throw BlockTreeStoreError("Block tree store is in an inconsistent state");
     }
diff --git a/src/test/blocktreestorage_tests.cpp b/src/test/blocktreestorage_tests.cpp
index 68e6651275..8b6e15d118 100644
--- a/src/test/blocktreestorage_tests.cpp
+++ b/src/test/blocktreestorage_tests.cpp
@@ -163,6 +163,20 @@ BOOST_AUTO_TEST_CASE(BlockTreeStoreInvalidFiles)
     BOOST_CHECK(fs::exists(header_file_path));
     BOOST_CHECK(fs::exists(block_files_file_path));
 
+    // If initialization was interrupted after creating the empty header file,
+    // recover by creating the missing block files file.
+    fs::remove(block_files_file_path);
+    BlockTreeStore{block_tree_store_dir};
+    BOOST_CHECK(fs::exists(header_file_path));
+    BOOST_CHECK(fs::exists(block_files_file_path));
+
+    // Once header data has been written, a missing block files file is no
+    // longer safe to recreate.
+    {
+        AutoFile header_file{fsbridge::fopen(header_file_path, "ab")};
+        header_file << uint8_t{0};
+        (void)header_file.fclose();
+    }
     fs::remove(block_files_file_path);
     BOOST_CHECK_THROW(BlockTreeStore{block_tree_store_dir}, BlockTreeStoreError);
     fs::remove(header_file_path);

</details>

sedited commented at 7:45 PM on June 3, 2026:

Mmh, I think this might be unlikely enough to occur to just let the user trigger a reindex? If it occurs during a migration we already handle it by wiping the data again on the following migration attempt.

w0xlt commented at 1:58 AM on June 4, 2026:

Yes, letting the user trigger -reindex is technically sufficient, and migration retry already handles this case. It is not ideal UX, since this specific failure is avoidable, but this crash window is also negligible so it probably isn't worth the extra complexity.

If anything is changed, maybe a clearer error message telling the user to retry with -reindex would be preferable.

sedited commented at 8:47 AM on June 4, 2026:

If I manually remove blockfiles.dat and restart, the following will be printed in the logs:

2026-06-04T08:44:51Z [error] Block tree store is in an inconsistent state
2026-06-04T08:44:51Z Error opening block database.
Please restart with -reindex or -reindex-chainstate to recover.
Error opening block database.
Please restart with -reindex or -reindex-chainstate to recover.

w0xlt commented at 6:31 PM on June 3, 2026: contributor

ACK 3c3f53b488701e53ba912f2058f13883b96610d2

willcl-ark approved

willcl-ark commented at 10:32 AM on June 4, 2026: member

ACK 3c3f53b488701e53ba912f2058f13883b96610d2

Excellent work :)

janb84 commented at 11:57 AM on June 4, 2026: contributor

ACK 3c3f53b488701e53ba912f2058f13883b96610d2

marcofleon commented at 6:15 PM on June 9, 2026: contributor

I've been running this differential test on top of this branch for a while now to hopefully give some more confidence in this change. So far, all good. You can see what the test covers here.

in src/kernel/blocktreestorage.cpp:168 in 3c3f53b488

 163 | +    if (!header_file_exists && !block_files_file_exists) {
 164 | +        CreateHeaderFile();
 165 | +        CreateBlockFilesFile();
 166 | +    }
 167 | +    CheckMagicAndVersion();
 168 | +    LOCK(m_mutex);

alexanderwiederin commented at 7:30 PM on June 9, 2026:

We acquire m_mutex after calling CreateHeaderFile and CreateBlockFilesFile here, which both require holding m_mutex. Surprised this is not flagged by the CI.

sedited commented at 7:37 PM on June 9, 2026:

I think it is ignoring them, because this is in a constructor. Might be better to just remove their annotations. The lock still needs to be taken for ApplyLog since we assert there.

yuvicc commented at 5:42 AM on June 10, 2026: contributor

Tested ACK 3c3f53b488701e53ba912f2058f13883b96610d2

I successfully migrated from LevelDB to flat file on my v31.0 full node against this change. Here's the log:

2026-06-09T08:05:53Z Using obfuscation key for blocksdir *.dat files (/root/.bitcoin/blocks): '4a55f0349d7b0a02'
2026-06-09T08:05:53Z Migrating leveldb block tree db to new block tree store.
2026-06-09T08:05:53Z Opening LevelDB in /root/.bitcoin/blocks/index
2026-06-09T08:05:54Z Opened LevelDB successfully
2026-06-09T08:05:54Z Using obfuscation key for /root/.bitcoin/blocks/index: 0000000000000000
2026-06-09T08:05:54Z    Reading data from existing leveldb block tree db...
2026-06-09T08:06:01Z    Writing data back to a new block tree store, reindexing: false, pruned: false
2026-06-09T08:06:05Z    Successfully migrated the leveldb block tree db to new block tree store.
2026-06-09T08:06:06Z Using 16 MiB out of 16 MiB requested for signature cache, able to store 524288 elements
2026-06-09T08:06:06Z Using 16 MiB out of 16 MiB requested for script execution cache, able to store 524288 elements
2026-06-09T08:06:06Z init message: Loading block index…
2026-06-09T08:06:06Z Assuming ancestors of block 00000000000000000000ccebd6d74d9194d8dcdc1d177c478e094bfad51ba5ac have valid signatures.
2026-06-09T08:06:06Z Setting nMinimumChainWork=0000000000000000000000000000000000000001128750f82f4c366153a3a030
2026-06-09T08:06:13Z Loading block index db: last block file = 5587
2026-06-09T08:06:13Z Loading block index db: last block file info: CBlockFileInfo(blocks=58, size=92159083, heights=952882...952939, time=2026-06-08...2026-06-
09)
2026-06-09T08:06:13Z Checking all blk files are present...
2026-06-09T08:06:15Z Initializing chainstate Chainstate [ibd] @ height -1 (null)
2026-06-09T08:06:15Z Opening LevelDB in /root/.bitcoin/chainstate
2026-06-09T08:06:15Z Opened LevelDB successfully
2026-06-09T08:06:15Z Using obfuscation key for /root/.bitcoin/chainstate: a6b1fced96717c7d
2026-06-09T08:06:16Z Leaving InitialBlockDownload (latching to false)
2026-06-09T08:06:16Z Loaded best chain: hashBestChain=0000000000000000000009f0d28499328e1f94ac4f35834be2b3507497f556d4 height=952939 date=2026-06-09T06:52:23Z
progress=1.000000

sedited force-pushed on Jun 10, 2026

sedited commented at 8:26 AM on June 10, 2026: contributor

Thanks for the reviews and A-C-Ks. Decided to address the two nits, which I think improve the robustness a bit. Also took the opportunity to improve some other smaller things.

3c3f53b488701e53ba912f2058f13883b96610d2 -> 0c08cd4f624b5506f1c9b31afcb97b4735e2095c (blocktreestore_23 -> blocktreestore_24, compare)

Addressed @janb84's comment. This lead me to improve the commit path in general: Next to only applying the header positions on successful log write, the log can now also be recovered and applied on a subsequent write. Previously a failed apply and subsequent write would leave the store in an inconsistent state: The dirty block index entries are cleared before a write is attempted, so a subsequent write would not persist the full data. This is now guarded by re-applying an existing log.
Addressed @alexanderwiederin's comment, moved confusing lock to the top of the constructor. Also removed locking requirements from read and writing the flag files, as well as the file open helpers.
Use the OpenFile helper more consistently.
Add DirectoryCommit calls to write operations that create or remove files.
Add a test case for simulating a failed due to a failed log application, followed by a successful write.
Remove left over in-memory test variables.
Move CBlockFileInfo's ToString member function to the block tree store as well.
Update doc/files.md to no longer mention leveldb for the block index.

josibake commented at 12:53 PM on June 10, 2026: member

reACK 0c08cd4

left one small non-blocking comment: #32427 (review)

DrahtBot requested review from w0xlt on Jun 10, 2026

DrahtBot requested review from willcl-ark on Jun 10, 2026

DrahtBot requested review from janb84 on Jun 10, 2026

in src/chain.h:117 in 3c3f53b488

 113 | @@ -114,6 +114,9 @@ class CBlockIndex
 114 |      //! Byte offset within rev?????.dat where this block's undo data is stored
 115 |      unsigned int nUndoPos GUARDED_BY(::cs_main){0};
 116 |  
 117 | +    //! Byte offset within headers.dat where this block's header data is stored

stickies-v commented at 4:43 PM on June 10, 2026:

    //! Byte offset within headers.dat where this block's header data is stored. 0 is a sentinel for when the data is not yet stored.

in src/kernel/blocktreestorage.h:143 in 0c08cd4f62

 138 | +    BlockTreeStore(const fs::path& path, bool wipe_data = false);
 139 | +
 140 | +    void ReadReindexing(bool& reindexing) const;
 141 | +    void WriteReindexing(bool reindexing) const;
 142 | +
 143 | +    void ReadLastBlockFile(int32_t& last_block) const EXCLUSIVE_LOCKS_REQUIRED(!m_mutex);

stickies-v commented at 5:00 PM on June 10, 2026:

nit: the last_block name tripped me up several times during review. Generally, brief docstrings would be useful for this class. Without context, my first intuition for ReadLastBlockFile is that it loads a file. For e.g. ReadPruned, it's not obvious that it's reading a flag, etc. I'm aware this issue exists in BlockTreeDB too, but this seems like a good time to add?

    void ReadLastBlockFile(int32_t& last_block_file) const EXCLUSIVE_LOCKS_REQUIRED(!m_mutex);

in src/chain.h:118 in 0c08cd4f62

 113 | @@ -114,6 +114,9 @@ class CBlockIndex
 114 |      //! Byte offset within rev?????.dat where this block's undo data is stored
 115 |      unsigned int nUndoPos GUARDED_BY(::cs_main){0};
 116 |  
 117 | +    //! Byte offset within headers.dat where this block's header data is stored
 118 | +    int64_t header_pos GUARDED_BY(::cs_main){0};

stickies-v commented at 7:04 PM on June 10, 2026:

Adding header_pos to CBlockIndex seems like a layering violation. CBlockIndex is already a bit of a sink, so I don't think this needs to be a blocker, but I think adding a map to BlockTreeStore could be a workable alternative approach? Since we're dealing with a large amount of block headers, it would add a non-trivial amount of memory usage though. The below implementation also assumes each CBlockIndex has a non-null phashBlock, but I think that's fine?

diff --git a/src/chain.h b/src/chain.h
index 21e8aa3a22..78a0a95ee9 100644
--- a/src/chain.h
+++ b/src/chain.h
@@ -114,9 +114,6 @@ public:
     //! Byte offset within rev?????.dat where this block's undo data is stored
     unsigned int nUndoPos GUARDED_BY(::cs_main){0};
 
-    //! Byte offset within headers.dat where this block's header data is stored
-    int64_t header_pos GUARDED_BY(::cs_main){0};
-
     //! (memory only) Total amount of work (expected number of hashes) in the chain up to and including this block
     arith_uint256 nChainWork{};
 
diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index f2afd9285b..d20d3e151f 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -429,6 +429,14 @@ bool BlockTreeStore::ApplyLog() const
     return true;
 }
 
+int64_t BlockTreeStore::HeaderPosition(const uint256& hash)
+{
+    AssertLockHeld(m_mutex);
+    constexpr int64_t entry_size{DISK_BLOCK_INDEX_WRAPPER_SIZE + CHECKSUM_SIZE};
+    const int64_t next_pos{HEADER_FILE_DATA_START_POS + static_cast<int64_t>(m_header_positions.size()) * entry_size};
+    return m_header_positions.emplace(hash, next_pos).first->second;
+}
+
 void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBlockFileInfo*>>& fileInfo, int32_t last_file, const std::vector<CBlockIndex*>& blockinfo)
 {
     AssertLockHeld(::cs_main);
@@ -438,9 +446,6 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
     // This may occur if a previous write threw an exception when writing the logged data to the .dat files.
     if (fs::exists(m_log_file_path)) (void)ApplyLog();
 
-    std::vector<std::pair<CBlockIndex*, int64_t>> pending_header_positions;
-    pending_header_positions.reserve(blockinfo.size());
-
     // Use a write-ahead log file that gets atomically flushed to the target files.
 
     { // start log_file scope
@@ -486,32 +491,19 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
         throw std::runtime_error("failed to write file");
     }
 
-    // Read the header data end position
-    int64_t header_data_end;
-    {
-        auto header_file{OpenFile(m_header_file_path, "rb")};
-        header_file.seek(0, SEEK_END);
-        header_data_end = header_file.tell();
-    }
-
     // Write the header data to the log
     WriteValueType(log_file, ValueType::DISK_BLOCK_INDEX);
     log_file << uint64_t{blockinfo.size()};
 
-    for (CBlockIndex* bi : blockinfo) {
-        int64_t pos = bi->header_pos == 0 ? header_data_end : bi->header_pos;
+    for (const CBlockIndex* bi : blockinfo) {
         auto disk_bi{CDiskBlockIndex{bi}};
         stream << DiskBlockIndexWrapper{&disk_bi};
-        stream << pos;
+        stream << HeaderPosition(bi->GetBlockHash());
         checksum = crc32c::Crc32c(UCharCast(stream.data()), block_index_entry_size);
         rolling_checksum = crc32c::Extend(rolling_checksum, UCharCast(stream.data()), block_index_entry_size);
         log_file.write(stream);
         log_file << checksum;
         stream.clear();
-        if (bi->header_pos == 0) {
-            pending_header_positions.emplace_back(bi, header_data_end);
-            header_data_end += DISK_BLOCK_INDEX_WRAPPER_SIZE + CHECKSUM_SIZE;
-        }
     }
 
     // Finally write the rolling checksum and commit.
@@ -521,10 +513,6 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
     }
     DirectoryCommit(m_log_file_path.parent_path());
 
-    // Once committed, apply the header positions to the index and close the file.
-    for (const auto& [block_index, header_pos] : pending_header_positions) {
-        block_index->header_pos = header_pos;
-    }
     if (log_file.fclose() != 0) {
         throw BlockTreeStoreError(strprintf("Failed to close after write to log file %s", PathToString(m_log_file_path)));
     }
@@ -550,6 +538,8 @@ bool BlockTreeStore::LoadBlockIndexGuts(
     int64_t data_end_pos = file.tell();
     file.seek(HEADER_FILE_DATA_START_POS, SEEK_SET);
 
+    m_header_positions.clear();
+
     DataStream pos;
     DiskBlockIndexWrapper diskindex;
     uint32_t checksum;
@@ -573,9 +563,10 @@ bool BlockTreeStore::LoadBlockIndexGuts(
         pos.clear();
 
         // Construct block index object
-        CBlockIndex* pindexNew = insertBlockIndex(diskindex.ConstructBlockHash());
+        const uint256 block_hash{diskindex.ConstructBlockHash()};
+        CBlockIndex* pindexNew = insertBlockIndex(block_hash);
         pindexNew->pprev = insertBlockIndex(diskindex.hashPrev);
-        pindexNew->header_pos = record_start;
+        m_header_positions.emplace(block_hash, record_start);
         pindexNew->nHeight = diskindex.nHeight;
         pindexNew->nFile = diskindex.nFile;
         pindexNew->nDataPos = diskindex.nDataPos;
diff --git a/src/kernel/blocktreestorage.h b/src/kernel/blocktreestorage.h
index 0b49839e1c..2a72aaee72 100644
--- a/src/kernel/blocktreestorage.h
+++ b/src/kernel/blocktreestorage.h
@@ -9,18 +9,20 @@
 #include <serialize.h>
 #include <streams.h>
 #include <sync.h>
+#include <uint256.h>
 #include <util/fs.h>
+#include <util/hasher.h>
 
 #include <cstdint>
 #include <functional>
 #include <stdexcept>
 #include <string>
 #include <string_view>
+#include <unordered_map>
 #include <utility>
 #include <vector>
 
 class CBlockIndex;
-class uint256;
 
 namespace Consensus {
 struct Params;
@@ -125,10 +127,18 @@ private:
 
     mutable Mutex m_mutex;
 
+    //! Byte offset within headers.dat of each block's index entry, keyed by hash.
+    //! Dense and append-only, so the next offset is derived from the map size.
+    std::unordered_map<uint256, int64_t, BlockHasher> m_header_positions GUARDED_BY(m_mutex);
+
     void CreateHeaderFile() const EXCLUSIVE_LOCKS_REQUIRED(m_mutex);
     void CreateBlockFilesFile() const EXCLUSIVE_LOCKS_REQUIRED(m_mutex);
     void CheckMagicAndVersion() const EXCLUSIVE_LOCKS_REQUIRED(m_mutex);
 
+    //! Byte offset of the block's header record, allocating the next free slot
+    //! on first request and stable thereafter.
+    int64_t HeaderPosition(const uint256& block_hash) EXCLUSIVE_LOCKS_REQUIRED(m_mutex);
+
     AutoFile OpenBlockFilesFile(std::string_view mode) const;
     AutoFile OpenHeaderFile(std::string_view mode) const;
 
diff --git a/src/test/blocktreestorage_tests.cpp b/src/test/blocktreestorage_tests.cpp
index 7849f22991..826a38a516 100644
--- a/src/test/blocktreestorage_tests.cpp
+++ b/src/test/blocktreestorage_tests.cpp
@@ -90,7 +90,6 @@ void CheckBlockMap(const std::unordered_map<uint256, CBlockIndex, BlockHasher>&
         BOOST_CHECK_EQUAL(index.nFile, block->nFile);
         BOOST_CHECK_EQUAL(index.nDataPos, block->nDataPos);
         BOOST_CHECK_EQUAL(index.nUndoPos, block->nUndoPos);
-        BOOST_CHECK_EQUAL(index.header_pos, block->header_pos);
 
         BOOST_CHECK_EQUAL(index.nVersion, block->nVersion);
         if (index.pprev == nullptr || block->pprev == nullptr) {
@@ -209,8 +208,8 @@ BOOST_AUTO_TEST_CASE(BlockTreeStoreIncompleteWrites)
     int32_t last_file{1};
     std::vector<CBlockIndex*> blockinfo;
     auto block_index = std::make_unique<CBlockIndex>(params->GenesisBlock());
-    auto header_pos = block_index->header_pos;
-    BOOST_CHECK_EQUAL(header_pos, 0);
+    const uint256 genesis_hash{params->GenesisBlock().GetHash()};
+    block_index->phashBlock = &genesis_hash;
     blockinfo.emplace_back(block_index.get());
 
     store->SetSimulateIncompleteLogWrite(true);
@@ -250,7 +249,6 @@ BOOST_AUTO_TEST_CASE(BlockTreeStoreIncompleteWrites)
         params->GetConsensus(),
         [&](const uint256& hash) { return InsertBlockIndex(block_map, hash); },
         m_interrupt));
-    BOOST_CHECK_EQUAL(block_index->header_pos, HEADER_FILE_DATA_START_POS);
     CheckBlockMap(block_map, blockinfo);
     CheckBlockFileInfo(0, info, *store);
 
@@ -374,9 +372,7 @@ BOOST_AUTO_TEST_CASE(BlockTreeStoreRW)
     info = CreateUniqueFileInfo(counter);
     fileinfo.emplace_back(0, &info);
     CBlockIndex* block_index = AddTestBlockIndex(test_map, params->GenesisBlock(), /*prev=*/nullptr);
-    BOOST_CHECK_EQUAL(block_index->header_pos, 0);
     WriteAndCheckBlockIndex(store, test_map, fileinfo, m_interrupt, *params);
-    BOOST_CHECK_EQUAL(block_index->header_pos, HEADER_FILE_DATA_START_POS);
     CheckBlockFileInfo(0, info, store);
 
     // Write another CBlockFileInfo and update the CBlockIndex

</details>

sedited commented at 12:17 PM on June 22, 2026:

The non-null asusmption seems fine, but this is quite a bit of additional memory for what is essentially a small refactoring gain. I'm hesitant to add that here, since it opens another trade off that needs to be litigated. As overcrowded as CBlockIndex is, I'm also not entirely sold that this is a true layer violation. This is the indexing object after all that allows us to retrieve blocks and undo data by their position. Adding another index for the header does not seem out of the ordinary to me.

stickies-v commented at 12:41 PM on June 22, 2026:

I'm also not entirely sold that this is a true layer violation.

My understanding of CBlockIndex is that at the highest level, it's an index of a block's position in the tree (as reflected by its kernel name BlockTreeEntry). We are also leaking other concerns into it, such as 1) validation logic (nStatus), 2) BlockManager storage logic (n{Data,Undo}Pos) and with this PR also BlockTreeStore logic (header_pos). Are these not all separate concerns?

I'm hesitant to add that here

This I understand. Even if I think it shouldn't be a CBlockIndex concern, from a scope creep pov it makes more sense to do it in a separate PR. So I'm happy for this to be marked as resolved.

w0xlt commented at 10:51 PM on June 10, 2026: contributor

reACK 0c08cd4f624b5506f1c9b31afcb97b4735e2095c

in src/kernel/blocktreestorage.cpp:497 in 5638154e1b

 492 | +    // Write the header data to the log
 493 | +    WriteValueType(log_file, ValueType::DISK_BLOCK_INDEX);
 494 | +    log_file << uint64_t{blockinfo.size()};
 495 | +
 496 | +    for (CBlockIndex* bi : blockinfo) {
 497 | +        int64_t pos = bi->header_pos == 0 ? header_data_end : bi->header_pos;

alexanderwiederin commented at 7:28 AM on June 11, 2026:

I would recommend something like

int64_t pos = (bi->header_pos == UNSET_HEADER_POS) ? header_data_end : bi->header_pos;

and setting UNSET_HEADER_POS to -1. 0 being invalid is implicit - I have to read the code to understand why it's not valid.

in src/kernel/blocktreestorage.cpp:482 in 5638154e1b outdated

 477 | +
 478 | +    // TEST ONLY
 479 | +    if (m_incomplete_log_write) {
 480 | +        (void)log_file.fclose();
 481 | +        throw std::runtime_error("failed to write file");
 482 | +    }

alexanderwiederin commented at 7:31 AM on June 11, 2026:

Can we put this behind compile time guards?

in src/kernel/blocktreestorage.cpp:450 in 5638154e1b

 445 | +
 446 | +    DataStream stream;
 447 | +    stream.reserve(block_index_entry_size);
 448 | +    uint32_t rolling_checksum = 0;
 449 | +
 450 | +    log_file << uint32_t{3}; // We are writing three different types to the log file for now.

alexanderwiederin commented at 7:43 AM on June 11, 2026:

Could use a static constexpr uint32_t LOG_NUM_TYPES{3};

in src/test/fuzz/block_index.cpp:118 in e75b4319c5

 115 | @@ -118,13 +116,16 @@ FUZZ_TARGET(block_index, .init = init_block_index)
 116 |  
 117 |      // We should be able to set and read the value of any random flag.
 118 |      const std::string flag_name = fuzzed_data_provider.ConsumeRandomLengthString(100);

alexanderwiederin commented at 7:52 AM on June 11, 2026:

This can be removed I think.

in src/node/blockstorage.cpp:1318 in 486d033254 outdated

1313 | +    // Re-open to ensure that the migration was successful
1314 | +    auto block_tree_store{std::make_unique<kernel::BlockTreeStore>(m_opts.block_tree_db_params.path)};
1315 | +    cleanup_leveldb();
1316 | +
1317 | +    LogInfo("   Successfully migrated the leveldb block tree db to new block tree store.");
1318 | +    m_block_index.clear();

alexanderwiederin commented at 8:08 AM on June 11, 2026:

Would suggest a comment above:

// Clear m_block_index so it can be repopulated normally during LoadBlockIndexDB.

janb84 commented at 8:20 AM on June 11, 2026: contributor

re ACK 0c08cd4f624b5506f1c9b31afcb97b4735e2095c

tnx for taking my NIT suggestion.

in test/functional/feature_blocktree_migration.py:43 in 486d033254

  38 | +        self.generate(legacy_node, nblocks, sync_fun=self.no_op)
  39 | +        assert_equal(legacy_node.getblockchaininfo()["blocks"], nblocks)
  40 | +        self.stop_node(1)
  41 | +
  42 | +        migrate_log = "Successfully migrated the leveldb block tree db to new block tree store."
  43 | +        reindex_log = "Detected legacy leveldb block tree db - removing it"

alexanderwiederin commented at 8:22 AM on June 11, 2026:

reindex_log is slightly misleading. Would suggest detect_legacy_log.

in src/kernel/blocktreestorage.h:83 in 0c08cd4f62 outdated

  78 | +    uint32_t nHeightFirst{}; //!< lowest height of block in file
  79 | +    uint32_t nHeightLast{};  //!< highest height of block in file
  80 | +    uint64_t nTimeFirst{};   //!< earliest time of block in file
  81 | +    uint64_t nTimeLast{};    //!< latest time of block in file
  82 | +
  83 | +    SERIALIZE_METHODS(CBlockFileInfo, obj)

alexanderwiederin commented at 8:48 AM on June 11, 2026:

Would suggest a comment like

// Note: The SERIALIZE_METHODS here use VARINT encoding for compatibility with
// the legacy leveldb block tree db, used during migration in CreateAndMigrateBlockTree.
// BlockFileInfoWrapper uses fixed-width encoding for the new flat file storage.

alexanderwiederin commented at 10:12 AM on June 11, 2026: contributor

ACK 0c08cd4f624b5506f1c9b31afcb97b4735e2095c

Migration completed in ~3 seconds on a full mainnet chain (953k blocks)
On-disk size (blocks/index) dropped from 215M to 99M (~54% reduction)
Node starts and operates correctly after migration (non-pruned and non-reindexed)

Left suggestions inline.

I would suggest we make it clear to people that downgrading after migration requires a full reindex.

willcl-ark commented at 11:14 AM on June 11, 2026: member

Do we want any docs/release notes here? It might be good to consider mentioning:

that the db is changing
that downgrading will require a -reindex
anything else relevant to users I have not thought of :)

josibake commented at 11:26 AM on June 11, 2026: member

Do we want any docs/release notes here?

I'd suggest release notes in a follow-up PR

in src/kernel/blocktreestorage.h:113 in 0c08cd4f62 outdated

 108 | +        if (nTimeIn > nTimeLast)
 109 | +            nTimeLast = nTimeIn;
 110 | +    }
 111 | +};
 112 | +
 113 | +class BlockTreeStore

stickies-v commented at 2:21 PM on June 11, 2026:

This class cannot currently be safely used in multiple processes (including for reading), but unlike leveldb, doesn't throw when it is abused. I think that's currently fine, because other layers check for a datadir lock, but I think that should be documented for this class. In the future, we can improve on this by e.g. adding a read/write lock.

in src/kernel/blocktreestorage.h:61 in 0c08cd4f62

  56 | +//! number of types, [type, number of entries, [entry, target position, checksum]], rolling checksum
  57 | +//! uint32_t,        [uint8_t, uint64_t,      [variable size, int64_t, uint32_t]], uint32_t
  58 | +inline constexpr const char* LOG_FILE_NAME{"log.dat"};
  59 | +
  60 | +enum class ValueType : uint8_t {
  61 | +    LAST_BLOCK = 0,

stickies-v commented at 3:13 PM on June 11, 2026:

I think the LAST_BLOCK record is quite confusing. It partially seems to be a (currently incomplete: doesn't catch truncation of multiples of 40 bytes) corruption check, and partially a mirror of the BlockTreeDB implementation.

It is only used to pre-allocate in m_blockfile_info.resize(max_blockfile_num + 1). However, since blockfiles.dat only consists of fixed-size records, that information can equally well be derived from the file size, except that that does not allow us to catch unexpected truncation (which again, we currently don't throw for anyway).

I think it makes more sense to either drop the field (if truncation is not a problem) and simplify things, or to generalize it into a NUM_RECORDS and use it in headers.dat too, because I think if truncation is an issue for blockfiles.dat it's an issue for headers.dat too?

in src/node/blockstorage.h:59 in 0c08cd4f62

  95 | -        if (nTimeIn > nTimeLast)
  96 | -            nTimeLast = nTimeIn;
  97 | -    }
  98 | -};
  99 | -
 100 |  /** Access to the block database (blocks/index/) */

stickies-v commented at 3:18 PM on June 11, 2026:

nit: docstring is out of date, this is migration-only and no longer the way to access blocks/index/.

in src/kernel/blocktreestorage.cpp:113 in 0c08cd4f62

 108 | +    return static_cast<ValueType>(raw);
 109 | +}
 110 | +
 111 | +static AutoFile OpenFile(const fs::path& path, std::string_view mode)
 112 | +{
 113 | +    AutoFile file{fsbridge::fopen(path, mode.data())};

stickies-v commented at 9:35 PM on June 11, 2026:

I think we should take mode as a const std::string& here to ensure it's null-terminated. .data() and std::string_view is a dangerous combination.

yuvicc commented at 6:51 AM on June 12, 2026: contributor

lgtm! ACK 0c08cd4f624b5506f1c9b31afcb97b4735e2095c

in src/kernel/blocktreestorage.cpp:349 in 0c08cd4f62

 344 | +                uint32_t checksum;
 345 | +                log_file >> checksum;
 346 | +                if (checksum != re_checksum) {
 347 | +                    LogDebug(BCLog::BLOCKSTORAGE, "Found invalid entry in blocktree store log file. Will not apply log.");
 348 | +                    (void)log_file.fclose();
 349 | +                    fs::remove(m_log_file_path);

stickies-v commented at 8:36 AM on June 12, 2026:

For dry-run failures, we assume that, since the log is invalid, no records from it can have been applied yet, so it's safe to remove it - the atomicity holds.

One edge case that breaks that invariant is when WriteBatchSync is interrupted after it has started persisting the log to disk, and disk corruption (e.g. a bitflip) happens at any time during WriteBatchSync. Currently, when restarting the node, we'd remove the partially-applied log, and continue with an inconsistent block tree store. I suspect in most cases such inconsistency would either be auto-corrected on next startup or lead to a crash, but I think it might be possible for there to exist a consensus failure path as well.

I don't know if there's a good solution to this (most likely very rare) problem, but if we can't prevent it I think we should at least document it. The conservative approach would be to treat all log failures as an error, and require a reindex whenever they happen, acknowleding that they'll be false positives in most cases.

in src/kernel/blocktreestorage.h:135 in 0c08cd4f62 outdated

 130 | +    void CheckMagicAndVersion() const EXCLUSIVE_LOCKS_REQUIRED(m_mutex);
 131 | +
 132 | +    AutoFile OpenBlockFilesFile(std::string_view mode) const;
 133 | +    AutoFile OpenHeaderFile(std::string_view mode) const;
 134 | +
 135 | +    [[nodiscard]] bool ApplyLog() const EXCLUSIVE_LOCKS_REQUIRED(m_mutex);

stickies-v commented at 8:40 AM on June 12, 2026:

ApplyLog failures can be manifested as false return value or BlockTreeStoreError. I think that interface should be documented, it's not trivial.

My understanding is that false is used when an invalid log is found during the dry-run, most likely indicating an incomplete WriteBatchSync run. In the 2nd pass, failures are assumed to be disk corruption or other unrecoverable issues, so we throw BlockTreeStoreError.

(Since the log format is an implementation detail and the file is managed by this class, invalid logs are not a part of normal program flow, so I do wonder if using a bool return value here makes sense at all (vs e.g. throwing different exceptions), but I think the current approach works.)

in src/kernel/blocktreestorage.cpp:447 in 0c08cd4f62 outdated

 442 | +    pending_header_positions.reserve(blockinfo.size());
 443 | +
 444 | +    // Use a write-ahead log file that gets atomically flushed to the target files.
 445 | +
 446 | +    { // start log_file scope
 447 | +    auto log_file{OpenFile(m_log_file_path, "wb")};

stickies-v commented at 9:36 AM on June 12, 2026:

Should we add a version header (and while we're at it, magic?) here too? Could be helpful when we change the log file format and the user upgrades after an unclean shutdown, and pretty low cost?

stickies-v commented at 9:41 AM on June 12, 2026: contributor

Partially reviewed 0c08cd4f624b5506f1c9b31afcb97b4735e2095c, this PR is now my highest priority so will continue to (re)-review until it's merged.

Most comments are non-blocking and I'm happy to keep for a follow-up. I would like to have some more eyes on the log file removal edge case, in case I'm underestimating the likelihood or the impact.

DrahtBot requested review from stickies-v on Jun 12, 2026

sedited force-pushed on Jun 14, 2026

sedited commented at 1:17 PM on June 14, 2026: contributor

Thank you for the reviews @stickies-v and @alexanderwiederin!

Updated 0c08cd4f624b5506f1c9b31afcb97b4735e2095c -> aff0b7ba660e23a1c76ab04197c3cef754f18ef1 (blocktreestore_24 -> blocktreestore_25, compare)

Addressed @alexanderwiederin's comment and @stickies-v's comment, using -1 as a new unset header position sentinel and giving it its own UNSET_HEADER_POS constant
Addressed @alexanderwiederin's comment, added log_num_types constexpr
Addressed @alexanderwiederin's comment, removed unused flag_name string in block_index fuzz test.
Addressed @alexanderwiederin's comment, added comment explaining why m_block_index is cleared again during migration.
Addressed @alexanderwiederin's comment, renamed reindex_log to detect_legacy_log in feature_blocktree_migration.py.
Addressed @alexanderwiederin's comment, add comment why the legacy CBlockFileInfo serialization needs to be kept around for varint encoding.
Addressed @stickies-v's comment, renamed last_block to last_block_file in ReadLastBlockFile function argument.
Addressed @stickies-v's comment, correct docstring describing the purpose of the legacy leveldb database.
Addressed @stickies-v's comment, use const std::string& when passing around the file open mode.
Addressed @stickies-v's comment, added a short docstring to ApplyLog explaining its failure modes.
Partially addressed @stickies-v's comment, moved a bunch of the duplicated checksum/writing/reading logic to separate helper functions. I took some of the suggestions, like using "record" for a log file <data, position, checksum> tuple. I'm not sure if further abstraction really aids readability.

The format of the files was also changed again in response to @stickies-v's findings. TLDR: a complete log file is identified by a separate flag file, LastBlockFile is read from the number of CBlockFileInfo values, and added version and magic to the log file.

Addressed @stickies-v's comment, removed the LAST_BLOCK field from the block files file. As suggested the value returned by LastBlockFile is now instead derived from the number of blockfileinfo values. As explained by his comment, this is correct, because the value is only ever used to pre-allocate the vector containing the CBlockFileInfo, so there should be no chance of undersizing it. This also allows for removing this argument from WriteBatchSync and simplifies the calling code a bit.
Addressed @stickies-v's comment, marking a log file write completion with a separate flag file and making the dry run checks throw. In combination, this protects against corruption of the log file and stops corrupted log file entries from being persisted to the data files without the user noticing. A torn log file is now identified and discarded by checking for the log flag file first.
Addressed @stickies-v's comment, added magic and version to the log file.

sedited force-pushed on Jun 14, 2026

sedited commented at 1:22 PM on June 14, 2026: contributor

Rebased aff0b7ba660e23a1c76ab04197c3cef754f18ef1 -> bbc9f2b2a185fe0cea096e7c8b30cccbb2ce93b2 (blocktreestore_25 -> blocktreestore_26, compare)

Fixed conflict with https://github.com/bitcoin/bitcoin/pull/35359

sedited renamed this:
~~kernel: Replace leveldb-based BlockTreeDB with flat-file based store~~
kernel: Replace leveldb-based BlockTreeDB with WAL and .dat file based store
on Jun 14, 2026

edilmedeiros commented at 6:43 PM on June 16, 2026: contributor

Approach ACK.

I would like to have seem the approach going in the direction of splitting the block files as well. But since the idea is to consume this data via the kernel library, it's probably better to not change the file layout to minimize changes here. In the future, this can be abstracted and provide more than one implementation, if needed.

in src/node/blockstorage.cpp:1309 in dc32993ad0 outdated

1304 | +        }
1305 | +
1306 | +        block_tree_store->WriteBatchSync(dump_files, dump_blockindexes);
1307 | +    }
1308 | +
1309 | +    // Re-open to ensure that the migration was successful

willcl-ark commented at 8:53 AM on June 17, 2026:

In dc32993ad013addd39d2e31cd42b965b942ab1f2

We re-open here which checks the magic and version, but I wonder if we should also run LoadBlockIndexGuts() or do some per-record checksum validation before we cleanup_levedb, in case migration has written bad data.

We could do something like read the new store into a temporary map and compare to the (loaded) leveDB store, if you feel like this may be worth doing.

sedited commented at 9:10 AM on June 17, 2026:

Yeah, why not.

willcl-ark commented at 8:55 AM on June 17, 2026: member

Going through the recent force-push now. Mostly looking pretty good to me. One question about double-checking the migrated store before we delete the old one.

l0rinc commented at 9:03 AM on June 17, 2026: contributor

I'm planning on reviewing this in detail after we're finished with other UTXO storage related changes. Is my understanding correct that this is still in the RFC stage and is not expected to be merged anytime soon, i.e. basically a draft? Are we still experimenting with the format (e.g. try storing blocks in separate files), or is it urgent that we spend time on this?

sedited commented at 9:39 AM on June 17, 2026: contributor

I'm planning on reviewing this in detail after we're finished with other UTXO storage related changes. Is my understanding correct that this is still in the RFC stage and is not expected to be merged anytime soon, i.e. basically a draft? Are we still experimenting with the format (e.g. try storing blocks in separate files), or is it urgent that we spend time on this?

In principle, anything that is not a bug is not urgent. I have also asked the other maintainers to apply a high bar for review on this change before merging. I think for this particular approach the changes are at a point where they are ready for review.

I have so far explored using a one-block-one-file approach which also replaces the need for persisting the header chain in a separate index. Reading the headers from a million files is very slow. Startup becomes slower by over an order of magnitude, which I don't think is acceptable. The alternative is keeping the current index, but still store blocks in disjoint files. I'm just not full convinced that such an approach is the best route for this project. We don't know what the effects of suddenly having millions of files (an increase by over two magnitudes) is going to be. My impression is that the current approach with pre-allocating flat files, amortizing read costs and having file cursors is pretty close to ideal performance-wise. The approach here is probably the least intrusive way to achieve the goals laid out in the pull request description. If reviewers would nevertheless like me to fully flesh out such an implementation, I can still do that.

in src/kernel/blocktreestorage.cpp:68 in bbc9f2b2a1

  63 | +    return strprintf("CBlockFileInfo(blocks=%u, size=%u, heights=%u...%u, time=%s...%s)", nBlocks, nSize, nHeightFirst, nHeightLast, FormatISO8601Date(nTimeFirst), FormatISO8601Date(nTimeLast));
  64 | +}
  65 | +
  66 | +static int64_t CalculateBlockFileInfoPosition(int file_index)
  67 | +{
  68 | +    Assume(file_index >= 0);

stickies-v commented at 1:14 PM on June 17, 2026:

nit: seems like this should be assert?

in src/kernel/blocktreestorage.cpp:107 in bbc9f2b2a1

 102 | +{
 103 | +    file << magic;
 104 | +    file << version;
 105 | +}
 106 | +
 107 | +static void ReadAndCheckMagicAndVersion(AutoFile& file, const fs::path& path, uint32_t magic_expected, uint32_t version_expected)

stickies-v commented at 5:44 PM on June 17, 2026:

nit: I think OpenFileAndVerifyHeader would help clean up the interface around opening files and verifying them, and makes it easier to not forgot verifying the header (such as in e.g. LoadBlockIndexGuts)

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index e8e153902f..1580a432c3 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -104,16 +104,6 @@ static void WriteMagicAndVersion(AutoFile& file, uint32_t magic, uint32_t versio
     file << version;
 }
 
-static void ReadAndCheckMagicAndVersion(AutoFile& file, const fs::path& path, uint32_t magic_expected, uint32_t version_expected)
-{
-    if (auto magic{ser_readdata32(file)}; magic != magic_expected) {
-        throw BlockTreeStoreError(strprintf("Invalid magic in %s: 0x%08x (expected: 0x%08x)", fs::PathToString(path), magic, magic_expected));
-    }
-    if (auto version{ser_readdata32(file)}; version != version_expected) {
-        throw BlockTreeStoreError(strprintf("Invalid version in %s: 0x%08x (expected: 0x%08x)", fs::PathToString(path), version, version_expected));
-    }
-}
-
 static AutoFile OpenFile(const fs::path& path, const std::string& mode)
 {
     AutoFile file{fsbridge::fopen(path, mode.c_str())};
@@ -123,6 +113,19 @@ static AutoFile OpenFile(const fs::path& path, const std::string& mode)
     return AutoFile{file.release()};
 }
 
+/** Open a file, verify its magic and version header, and seek just past it. */
+static AutoFile OpenFileAndVerifyHeader(const fs::path& path, const std::string& mode, uint32_t magic, uint32_t version)
+{
+    auto file{OpenFile(path, mode)};
+    if (auto file_magic{ser_readdata32(file)}; file_magic != magic) {
+        throw BlockTreeStoreError(strprintf("Invalid magic in %s: 0x%08x (expected: 0x%08x)", fs::PathToString(path), file_magic, magic));
+    }
+    if (auto file_version{ser_readdata32(file)}; file_version != version) {
+        throw BlockTreeStoreError(strprintf("Invalid version in %s: 0x%08x (expected: 0x%08x)", fs::PathToString(path), file_version, version));
+    }
+    return AutoFile{file.release()};
+}
+
 static void CreateDataFile(const fs::path& path, uint32_t magic, uint32_t version)
 {
     auto file{OpenFile(path, "wb")};
@@ -137,12 +140,6 @@ static void CreateDataFile(const fs::path& path, uint32_t magic, uint32_t versio
     }
 }
 
-void BlockTreeStore::OpenAndCheckMagicAndVersion(const fs::path& path, uint32_t magic_expected, uint32_t version_expected) const
-{
-    auto file{OpenFile(path, "rb")};
-    ReadAndCheckMagicAndVersion(file, path, magic_expected, version_expected);
-}
-
 BlockTreeStore::BlockTreeStore(const fs::path& path, bool wipe_data)
     : m_header_file_path{path / HEADER_FILE_NAME},
       m_log_file_path{path / LOG_FILE_NAME},
@@ -172,8 +169,8 @@ BlockTreeStore::BlockTreeStore(const fs::path& path, bool wipe_data)
         CreateDataFile(m_header_file_path, HEADER_FILE_MAGIC, HEADER_FILE_VERSION);
         CreateDataFile(m_block_files_file_path, BLOCK_FILES_FILE_MAGIC, BLOCK_FILES_FILE_VERSION);
     }
-    OpenAndCheckMagicAndVersion(m_header_file_path, HEADER_FILE_MAGIC, HEADER_FILE_VERSION);
-    OpenAndCheckMagicAndVersion(m_block_files_file_path, BLOCK_FILES_FILE_MAGIC, BLOCK_FILES_FILE_VERSION);
+    (void)OpenFileAndVerifyHeader(m_header_file_path, "rb", HEADER_FILE_MAGIC, HEADER_FILE_VERSION);
+    (void)OpenFileAndVerifyHeader(m_block_files_file_path, "rb", BLOCK_FILES_FILE_MAGIC, BLOCK_FILES_FILE_VERSION);
     (void)ApplyLog(); // Missing or incomplete logs are safe to ignore; apply failures throw.
 }
 
@@ -206,7 +203,7 @@ void BlockTreeStore::WriteReindexing(bool reindexing) const
 void BlockTreeStore::ReadLastBlockFile(int32_t& last_block_file) const
 {
     LOCK(m_mutex);
-    auto file{OpenFile(m_block_files_file_path, "rb")};
+    auto file{OpenFileAndVerifyHeader(m_block_files_file_path, "rb", BLOCK_FILES_FILE_MAGIC, BLOCK_FILES_FILE_VERSION)};
 
     constexpr int64_t entry_size = BLOCK_FILE_INFO_WRAPPER_SIZE + CHECKSUM_SIZE;
     const int64_t file_data_size{file.size() - BLOCK_FILES_FILE_DATA_START_POSITION};
@@ -301,7 +298,7 @@ static void ReadDataValue(AutoFile& file, std::span<std::byte> value_buffer)
 bool BlockTreeStore::ReadBlockFileInfo(int file_index, CBlockFileInfo& info)
 {
     LOCK(m_mutex);
-    auto file{OpenFile(m_block_files_file_path, "rb")};
+    auto file{OpenFileAndVerifyHeader(m_block_files_file_path, "rb", BLOCK_FILES_FILE_MAGIC, BLOCK_FILES_FILE_VERSION)};
     file.seek(CalculateBlockFileInfoPosition(file_index), SEEK_SET);
 
     BlockFileInfoWrapper info_wrapper;
@@ -332,9 +329,7 @@ bool BlockTreeStore::ApplyLog() const
         return false;
     }
 
-    auto log_file{OpenFile(m_log_file_path, "rb")};
-
-    ReadAndCheckMagicAndVersion(log_file, m_log_file_path, LOG_FILE_MAGIC, LOG_FILE_VERSION);
+    auto log_file{OpenFileAndVerifyHeader(m_log_file_path, "rb", LOG_FILE_MAGIC, LOG_FILE_VERSION)};
 
     uint32_t rolling_checksum = 0;
     uint32_t stored_rolling_checksum = 0;
@@ -448,7 +443,7 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
     // Read the header data end position
     int64_t header_data_end;
     {
-        auto header_file{OpenFile(m_header_file_path, "rb")};
+        auto header_file{OpenFileAndVerifyHeader(m_header_file_path, "rb", HEADER_FILE_MAGIC, HEADER_FILE_VERSION)};
         header_data_end = header_file.size();
     }
 
@@ -496,10 +491,9 @@ bool BlockTreeStore::LoadBlockIndexGuts(
     AssertLockHeld(::cs_main);
     LOCK(m_mutex);
 
-    auto file{OpenFile(m_header_file_path, "rb")};
+    auto file{OpenFileAndVerifyHeader(m_header_file_path, "rb", HEADER_FILE_MAGIC, HEADER_FILE_VERSION)};
 
     int64_t data_end_position = file.size();
-    file.seek(HEADER_FILE_DATA_START_POSITION, SEEK_SET);
 
     DiskBlockIndexWrapper disk_index;
     std::array<std::byte, DISK_BLOCK_INDEX_WRAPPER_SIZE> buffer;
diff --git a/src/kernel/blocktreestorage.h b/src/kernel/blocktreestorage.h
index feedcb0363..2514201243 100644
--- a/src/kernel/blocktreestorage.h
+++ b/src/kernel/blocktreestorage.h
@@ -138,8 +138,6 @@ private:
 
     mutable Mutex m_mutex;
 
-    void OpenAndCheckMagicAndVersion(const fs::path& path, uint32_t magic_expected, uint32_t version_expected) const EXCLUSIVE_LOCKS_REQUIRED(m_mutex);
-
     void WriteFlag(const fs::path& path, bool value) const;
 
     /**

</details>

in src/kernel/blocktreestorage.cpp:232 in bbc9f2b2a1

 227 | +}
 228 | +
 229 | +static uint32_t ExtendChecksum(uint32_t checksum, std::span<const std::byte> value_data, int64_t position)
 230 | +{
 231 | +    checksum = crc32c::Extend(checksum, UCharCast(value_data.data()), value_data.size());
 232 | +    std::array<std::byte, FILE_POSITION_SIZE> position_bytes;

stickies-v commented at 6:54 PM on June 17, 2026:

nit: if we alias the position and checksum types, we can use those aliases and derive the size from it? It's a bit awkward to have a int64_t position parameter and then use FILE_POSITION_SIZE in the std::array declaration, when they refer to the same thing.

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index e8e153902f..bbea1f30b1 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -33,10 +33,11 @@
 
 namespace kernel {
 
+using Checksum = uint32_t;
+using FilePosition = int64_t;
+
 static constexpr uint8_t BLOCK_FILE_INFO_WRAPPER_SIZE{36};
 static constexpr uint8_t DISK_BLOCK_INDEX_WRAPPER_SIZE{104};
-static constexpr size_t CHECKSUM_SIZE{sizeof(uint32_t)};
-static constexpr size_t FILE_POSITION_SIZE{sizeof(int64_t)};
 
 /** A wrapper for creating a constant-sized serialization without varint encoding */
 struct BlockFileInfoWrapper : CBlockFileInfo {
@@ -63,10 +64,10 @@ std::string CBlockFileInfo::ToString() const
     return strprintf("CBlockFileInfo(blocks=%u, size=%u, heights=%u...%u, time=%s...%s)", nBlocks, nSize, nHeightFirst, nHeightLast, FormatISO8601Date(nTimeFirst), FormatISO8601Date(nTimeLast));
 }
 
-static int64_t CalculateBlockFileInfoPosition(int file_index)
+static FilePosition CalculateBlockFileInfoPosition(int file_index)
 {
     Assume(file_index >= 0);
-    return BLOCK_FILES_FILE_DATA_START_POSITION + file_index * (BLOCK_FILE_INFO_WRAPPER_SIZE + CHECKSUM_SIZE);
+    return BLOCK_FILES_FILE_DATA_START_POSITION + file_index * (BLOCK_FILE_INFO_WRAPPER_SIZE + sizeof(Checksum));
 }
 
 const fs::path& BlockTreeStore::GetDataFilePath(ValueType value_type) const
@@ -208,7 +209,7 @@ void BlockTreeStore::ReadLastBlockFile(int32_t& last_block_file) const
     LOCK(m_mutex);
     auto file{OpenFile(m_block_files_file_path, "rb")};
 
-    constexpr int64_t entry_size = BLOCK_FILE_INFO_WRAPPER_SIZE + CHECKSUM_SIZE;
+    constexpr int64_t entry_size = BLOCK_FILE_INFO_WRAPPER_SIZE + sizeof(Checksum);
     const int64_t file_data_size{file.size() - BLOCK_FILES_FILE_DATA_START_POSITION};
     if (file_data_size < 0 || file_data_size % entry_size != 0) {
         throw BlockTreeStoreError("Invalid block files file data");
@@ -226,15 +227,15 @@ void BlockTreeStore::WritePruned(bool pruned) const
     WriteFlag(m_prune_flag_file_path, pruned);
 }
 
-static uint32_t ExtendChecksum(uint32_t checksum, std::span<const std::byte> value_data, int64_t position)
+static Checksum ExtendChecksum(Checksum checksum, std::span<const std::byte> value_data, FilePosition position)
 {
     checksum = crc32c::Extend(checksum, UCharCast(value_data.data()), value_data.size());
-    std::array<std::byte, FILE_POSITION_SIZE> position_bytes;
+    std::array<std::byte, sizeof(FilePosition)> position_bytes;
     WriteLE64(UCharCast(position_bytes.data()), static_cast<uint64_t>(position));
     return crc32c::Extend(checksum, UCharCast(position_bytes.data()), position_bytes.size());
 }
 
-static uint32_t Checksum(std::span<const std::byte> value_data, int64_t position)
+static Checksum ComputeChecksum(std::span<const std::byte> value_data, FilePosition position)
 {
     return ExtendChecksum(0, value_data, position);
 }
@@ -255,21 +256,21 @@ static std::pair<ValueType, uint64_t> ReadLogFileSectionHeader(AutoFile& log_fil
 
 struct LogFileRecord {
     std::vector<std::byte> m_value_buffer;
-    int64_t m_position;
-    uint32_t m_checksum;
+    FilePosition m_position;
+    Checksum m_checksum;
 
     LogFileRecord(ValueType value_type) : m_value_buffer(ValueSize(value_type)) {}
 };
 
-static void ReadLogFileRecord(AutoFile& log_file, LogFileRecord& record, uint32_t& rolling_checksum)
+static void ReadLogFileRecord(AutoFile& log_file, LogFileRecord& record, Checksum& rolling_checksum)
 {
     log_file.read(record.m_value_buffer);
     log_file >> record.m_position;
 
-    record.m_checksum = Checksum(record.m_value_buffer, record.m_position);
+    record.m_checksum = ComputeChecksum(record.m_value_buffer, record.m_position);
     rolling_checksum = ExtendChecksum(rolling_checksum, record.m_value_buffer, record.m_position);
 
-    uint32_t stored_checksum;
+    Checksum stored_checksum;
     log_file >> stored_checksum;
     if (stored_checksum != record.m_checksum) {
         throw BlockTreeStoreError("Detected on-disk log file corruption: Checksum mismatch");
@@ -277,10 +278,10 @@ static void ReadLogFileRecord(AutoFile& log_file, LogFileRecord& record, uint32_
 }
 
 template <typename Wrapper>
-static void WriteLogFileRecord(AutoFile& log_file, std::span<std::byte> value_buffer, const Wrapper& wrapper, int64_t position, uint32_t& rolling_checksum)
+static void WriteLogFileRecord(AutoFile& log_file, std::span<std::byte> value_buffer, const Wrapper& wrapper, FilePosition position, Checksum& rolling_checksum)
 {
     SpanWriter{value_buffer} << wrapper;
-    const uint32_t checksum{Checksum(value_buffer, position)};
+    const Checksum checksum{ComputeChecksum(value_buffer, position)};
     rolling_checksum = ExtendChecksum(rolling_checksum, value_buffer, position);
     log_file.write(value_buffer);
     log_file << position;
@@ -289,11 +290,11 @@ static void WriteLogFileRecord(AutoFile& log_file, std::span<std::byte> value_bu
 
 static void ReadDataValue(AutoFile& file, std::span<std::byte> value_buffer)
 {
-    const int64_t position{file.tell()};
+    const FilePosition position{file.tell()};
     file.read(value_buffer);
-    uint32_t checksum;
+    Checksum checksum;
     file >> checksum;
-    if (Checksum(value_buffer, position) != checksum) {
+    if (ComputeChecksum(value_buffer, position) != checksum) {
         throw BlockTreeStoreError("Record data failed integrity check");
     }
 }
@@ -336,8 +337,8 @@ bool BlockTreeStore::ApplyLog() const
 
     ReadAndCheckMagicAndVersion(log_file, m_log_file_path, LOG_FILE_MAGIC, LOG_FILE_VERSION);
 
-    uint32_t rolling_checksum = 0;
-    uint32_t stored_rolling_checksum = 0;
+    Checksum rolling_checksum = 0;
+    Checksum stored_rolling_checksum = 0;
     uint32_t number_of_types = 0;
 
     // Do a dry run to check the integrity of the log file. This should help prevent cascading errors in case of log file corruption.
@@ -419,7 +420,7 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
 
     if (file_info.empty() && block_info.empty()) return;
 
-    std::vector<std::pair<CBlockIndex*, int64_t>> pending_header_positions;
+    std::vector<std::pair<CBlockIndex*, FilePosition>> pending_header_positions;
     pending_header_positions.reserve(block_info.size());
 
     // Use a write-ahead log file that gets atomically flushed to the target files.
@@ -431,7 +432,7 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
     log_file << log_num_types;
 
     std::array<std::byte, BLOCK_FILE_INFO_WRAPPER_SIZE> block_file_info_value_buffer;
-    uint32_t rolling_checksum = 0;
+    Checksum rolling_checksum = 0;
 
     // Write the file_info entries to the log
     WriteLogFileSectionHeader(log_file, ValueType::BLOCK_FILE_INFO, file_info.size());
@@ -446,7 +447,7 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
     }
 
     // Read the header data end position
-    int64_t header_data_end;
+    FilePosition header_data_end;
     {
         auto header_file{OpenFile(m_header_file_path, "rb")};
         header_data_end = header_file.size();
@@ -456,12 +457,12 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
     WriteLogFileSectionHeader(log_file, ValueType::DISK_BLOCK_INDEX, block_info.size());
     std::array<std::byte, DISK_BLOCK_INDEX_WRAPPER_SIZE> block_index_value_buffer;
     for (CBlockIndex* block_index : block_info) {
-        int64_t position = block_index->header_pos == CBlockIndex::UNSET_HEADER_POS ? header_data_end : block_index->header_pos;
+        FilePosition position = block_index->header_pos == CBlockIndex::UNSET_HEADER_POS ? header_data_end : block_index->header_pos;
         auto disk_index{CDiskBlockIndex{block_index}};
         WriteLogFileRecord(log_file, block_index_value_buffer, DiskBlockIndexWrapper{&disk_index}, position, rolling_checksum);
         if (block_index->header_pos == CBlockIndex::UNSET_HEADER_POS) {
             pending_header_positions.emplace_back(block_index, header_data_end);
-            header_data_end += DISK_BLOCK_INDEX_WRAPPER_SIZE + CHECKSUM_SIZE;
+            header_data_end += DISK_BLOCK_INDEX_WRAPPER_SIZE + sizeof(Checksum);
         }
     }
 
@@ -498,7 +499,7 @@ bool BlockTreeStore::LoadBlockIndexGuts(
 
     auto file{OpenFile(m_header_file_path, "rb")};
 
-    int64_t data_end_position = file.size();
+    FilePosition data_end_position = file.size();
     file.seek(HEADER_FILE_DATA_START_POSITION, SEEK_SET);
 
     DiskBlockIndexWrapper disk_index;

</details>

in src/kernel/blocktreestorage.cpp:280 in bbc9f2b2a1

 275 | +        throw BlockTreeStoreError("Detected on-disk log file corruption: Checksum mismatch");
 276 | +    }
 277 | +}
 278 | +
 279 | +template <typename Wrapper>
 280 | +static void WriteLogFileRecord(AutoFile& log_file, std::span<std::byte> value_buffer, const Wrapper& wrapper, int64_t position, uint32_t& rolling_checksum)

stickies-v commented at 1:00 PM on June 18, 2026:

nit: if we assign SERIALIZED_SIZE members to the Wrapper types, we:

keep it more closely to where the serialization is defined
are able to use it at compile time
- can avoid leaking value_buffer into the function signature here

diff --git a/src/chain.h b/src/chain.h
index d40daa05c8..bfd7808288 100644
--- a/src/chain.h
+++ b/src/chain.h
@@ -383,6 +383,8 @@ public:
 
 /** A wrapper for creating a constant-sized serialization without varint encoding */
 struct DiskBlockIndexWrapper : CDiskBlockIndex {
+    static constexpr size_t SERIALIZED_SIZE{104};
+
     DiskBlockIndexWrapper() = default;
 
     explicit DiskBlockIndexWrapper(const CDiskBlockIndex* pindex) : CDiskBlockIndex(*pindex)
diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index e8e153902f..5512fbc715 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -33,13 +33,13 @@
 
 namespace kernel {
 
-static constexpr uint8_t BLOCK_FILE_INFO_WRAPPER_SIZE{36};
-static constexpr uint8_t DISK_BLOCK_INDEX_WRAPPER_SIZE{104};
 static constexpr size_t CHECKSUM_SIZE{sizeof(uint32_t)};
 static constexpr size_t FILE_POSITION_SIZE{sizeof(int64_t)};
 
 /** A wrapper for creating a constant-sized serialization without varint encoding */
 struct BlockFileInfoWrapper : CBlockFileInfo {
+    static constexpr size_t SERIALIZED_SIZE{36};
+
     BlockFileInfoWrapper() = default;
 
     explicit BlockFileInfoWrapper(const CBlockFileInfo* info) : CBlockFileInfo(*info)
@@ -66,7 +66,7 @@ std::string CBlockFileInfo::ToString() const
 static int64_t CalculateBlockFileInfoPosition(int file_index)
 {
     Assume(file_index >= 0);
-    return BLOCK_FILES_FILE_DATA_START_POSITION + file_index * (BLOCK_FILE_INFO_WRAPPER_SIZE + CHECKSUM_SIZE);
+    return BLOCK_FILES_FILE_DATA_START_POSITION + file_index * (BlockFileInfoWrapper::SERIALIZED_SIZE + CHECKSUM_SIZE);
 }
 
 const fs::path& BlockTreeStore::GetDataFilePath(ValueType value_type) const
@@ -84,9 +84,9 @@ static uint8_t ValueSize(const ValueType value_type)
 {
     switch (value_type) {
     case ValueType::BLOCK_FILE_INFO:
-        return BLOCK_FILE_INFO_WRAPPER_SIZE;
+        return BlockFileInfoWrapper::SERIALIZED_SIZE;
     case ValueType::DISK_BLOCK_INDEX:
-        return DISK_BLOCK_INDEX_WRAPPER_SIZE;
+        return DiskBlockIndexWrapper::SERIALIZED_SIZE;
     }
     throw BlockTreeStoreError(strprintf("Unrecognized value type (%u) in block tree store", static_cast<std::underlying_type_t<ValueType>>(value_type)));
 }
@@ -151,8 +151,8 @@ BlockTreeStore::BlockTreeStore(const fs::path& path, bool wipe_data)
       m_reindex_flag_file_path{path / REINDEX_FLAG_FILE_NAME},
       m_prune_flag_file_path{path / PRUNE_FLAG_FILE_NAME}
 {
-    assert(GetSerializeSize(DiskBlockIndexWrapper{}) == DISK_BLOCK_INDEX_WRAPPER_SIZE);
-    assert(GetSerializeSize(BlockFileInfoWrapper{}) == BLOCK_FILE_INFO_WRAPPER_SIZE);
+    assert(GetSerializeSize(DiskBlockIndexWrapper{}) == DiskBlockIndexWrapper::SERIALIZED_SIZE);
+    assert(GetSerializeSize(BlockFileInfoWrapper{}) == BlockFileInfoWrapper::SERIALIZED_SIZE);
     LOCK(m_mutex);
     fs::create_directories(path);
     if (wipe_data) {
@@ -208,7 +208,7 @@ void BlockTreeStore::ReadLastBlockFile(int32_t& last_block_file) const
     LOCK(m_mutex);
     auto file{OpenFile(m_block_files_file_path, "rb")};
 
-    constexpr int64_t entry_size = BLOCK_FILE_INFO_WRAPPER_SIZE + CHECKSUM_SIZE;
+    constexpr int64_t entry_size = BlockFileInfoWrapper::SERIALIZED_SIZE + CHECKSUM_SIZE;
     const int64_t file_data_size{file.size() - BLOCK_FILES_FILE_DATA_START_POSITION};
     if (file_data_size < 0 || file_data_size % entry_size != 0) {
         throw BlockTreeStoreError("Invalid block files file data");
@@ -277,8 +277,9 @@ static void ReadLogFileRecord(AutoFile& log_file, LogFileRecord& record, uint32_
 }
 
 template <typename Wrapper>
-static void WriteLogFileRecord(AutoFile& log_file, std::span<std::byte> value_buffer, const Wrapper& wrapper, int64_t position, uint32_t& rolling_checksum)
+static void WriteLogFileRecord(AutoFile& log_file, const Wrapper& wrapper, int64_t position, uint32_t& rolling_checksum)
 {
+    std::array<std::byte, Wrapper::SERIALIZED_SIZE> value_buffer;
     SpanWriter{value_buffer} << wrapper;
     const uint32_t checksum{Checksum(value_buffer, position)};
     rolling_checksum = ExtendChecksum(rolling_checksum, value_buffer, position);
@@ -305,7 +306,7 @@ bool BlockTreeStore::ReadBlockFileInfo(int file_index, CBlockFileInfo& info)
     file.seek(CalculateBlockFileInfoPosition(file_index), SEEK_SET);
 
     BlockFileInfoWrapper info_wrapper;
-    std::array<std::byte, BLOCK_FILE_INFO_WRAPPER_SIZE> buffer;
+    std::array<std::byte, BlockFileInfoWrapper::SERIALIZED_SIZE> buffer;
 
     try {
         ReadDataValue(file, buffer);
@@ -430,13 +431,12 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
     constexpr uint32_t log_num_types{2}; // We are writing two different types to the log file.
     log_file << log_num_types;
 
-    std::array<std::byte, BLOCK_FILE_INFO_WRAPPER_SIZE> block_file_info_value_buffer;
     uint32_t rolling_checksum = 0;
 
     // Write the file_info entries to the log
     WriteLogFileSectionHeader(log_file, ValueType::BLOCK_FILE_INFO, file_info.size());
     for (const auto& [file, info] : file_info) {
-        WriteLogFileRecord(log_file, block_file_info_value_buffer, BlockFileInfoWrapper{info}, CalculateBlockFileInfoPosition(file), rolling_checksum);
+        WriteLogFileRecord(log_file, BlockFileInfoWrapper{info}, CalculateBlockFileInfoPosition(file), rolling_checksum);
     }
 
     // TEST ONLY
@@ -454,14 +454,13 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
 
     // Write the block_info data to the log
     WriteLogFileSectionHeader(log_file, ValueType::DISK_BLOCK_INDEX, block_info.size());
-    std::array<std::byte, DISK_BLOCK_INDEX_WRAPPER_SIZE> block_index_value_buffer;
     for (CBlockIndex* block_index : block_info) {
         int64_t position = block_index->header_pos == CBlockIndex::UNSET_HEADER_POS ? header_data_end : block_index->header_pos;
         auto disk_index{CDiskBlockIndex{block_index}};
-        WriteLogFileRecord(log_file, block_index_value_buffer, DiskBlockIndexWrapper{&disk_index}, position, rolling_checksum);
+        WriteLogFileRecord(log_file, DiskBlockIndexWrapper{&disk_index}, position, rolling_checksum);
         if (block_index->header_pos == CBlockIndex::UNSET_HEADER_POS) {
             pending_header_positions.emplace_back(block_index, header_data_end);
-            header_data_end += DISK_BLOCK_INDEX_WRAPPER_SIZE + CHECKSUM_SIZE;
+            header_data_end += DiskBlockIndexWrapper::SERIALIZED_SIZE + CHECKSUM_SIZE;
         }
     }
 
@@ -502,7 +501,7 @@ bool BlockTreeStore::LoadBlockIndexGuts(
     file.seek(HEADER_FILE_DATA_START_POSITION, SEEK_SET);
 
     DiskBlockIndexWrapper disk_index;
-    std::array<std::byte, DISK_BLOCK_INDEX_WRAPPER_SIZE> buffer;
+    std::array<std::byte, DiskBlockIndexWrapper::SERIALIZED_SIZE> buffer;
 
     while (file.tell() < data_end_position) {
         if (interrupt) return false;

</details>

in src/kernel/blocktreestorage.cpp:393 in bbc9f2b2a1 outdated

 388 | +                return false;
 389 | +            }
 390 | +        }
 391 | +
 392 | +        if (!data_file.Commit()) {
 393 | +            throw BlockTreeStoreError(strprintf("Failed to commit write to data file %s", PathToString(data_file_path)));

stickies-v commented at 3:47 PM on June 19, 2026:

I think we need to (void)data_file.fclose() before throwing to avoid hitting the Assume(IsNull()) in AutoFile dtor?

sedited commented at 8:26 PM on June 23, 2026:

Thought a bit about this and I think it might actually be better to hit the assume? If this happens in a debug build, I'm not sure that it would make things materially worse to debug for us. It is kind of hard to ensure that we are not throwing on any of these operations. Basically we would have to wrap all of this in another try/catch that then closes the file cleanly. Are we doing that consistently for all the other AutoFile sites? This API is kind of evil.

stickies-v commented at 8:27 PM on June 25, 2026:

I agree the API is bad, and I suppose it doesn't make a huge difference if we abort via Assume or via throwing here, these are all pretty bad failures. I just think it's a bit weird that debug and non-debug builds have different failure modes, but no biggie either way.

in src/kernel/blocktreestorage.cpp:98 in bbc9f2b2a1

  93 | +
  94 | +static ValueType ReadValueType(AutoFile& file)
  95 | +{
  96 | +    std::underlying_type_t<ValueType> raw;
  97 | +    file >> raw;
  98 | +    return static_cast<ValueType>(raw);

stickies-v commented at 3:54 PM on June 19, 2026:

nit: would make sense to do input validation upon parsing it, rather than relying on downstream to do it?

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index e8e153902f..20d40609e1 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -77,7 +77,7 @@ const fs::path& BlockTreeStore::GetDataFilePath(ValueType value_type) const
     case ValueType::DISK_BLOCK_INDEX:
         return m_header_file_path;
     }
-    throw BlockTreeStoreError(strprintf("Unrecognized value type (%u) in block tree store", static_cast<std::underlying_type_t<ValueType>>(value_type)));
+    assert(false);
 }
 
 static uint8_t ValueSize(const ValueType value_type)
@@ -88,14 +88,20 @@ static uint8_t ValueSize(const ValueType value_type)
     case ValueType::DISK_BLOCK_INDEX:
         return DISK_BLOCK_INDEX_WRAPPER_SIZE;
     }
-    throw BlockTreeStoreError(strprintf("Unrecognized value type (%u) in block tree store", static_cast<std::underlying_type_t<ValueType>>(value_type)));
+    assert(false);
 }
 
 static ValueType ReadValueType(AutoFile& file)
 {
     std::underlying_type_t<ValueType> raw;
     file >> raw;
-    return static_cast<ValueType>(raw);
+
+    switch (auto value_type{static_cast<ValueType>(raw)}) {
+    case kernel::ValueType::BLOCK_FILE_INFO:
+    case kernel::ValueType::DISK_BLOCK_INDEX:
+        return value_type;
+    }
+    throw BlockTreeStoreError(strprintf("Unrecognized value type (%u) in block tree store", raw));
 }
 
 static void WriteMagicAndVersion(AutoFile& file, uint32_t magic, uint32_t version)

</details>

in src/node/blockstorage.cpp:1248 in bbc9f2b2a1

1243 | +                }
1244 | +                files.emplace_back(i, info);
1245 | +            }
1246 | +
1247 | +            if (!block_tree_db->LoadBlockIndexGuts(
1248 | +                    GetConsensus(), [this](const uint256& hash) EXCLUSIVE_LOCKS_REQUIRED(cs_main) { return this->InsertBlockIndex(hash); }, m_interrupt)) {

stickies-v commented at 4:23 PM on June 19, 2026:

Using m_block_index for migration seems like a boundary violation. I think we can use a local map here, and avoid the m_block_index.clear() call later?

diff --git a/src/node/blockstorage.cpp b/src/node/blockstorage.cpp
index c17248c2eb..79dc9fbfe0 100644
--- a/src/node/blockstorage.cpp
+++ b/src/node/blockstorage.cpp
@@ -1224,9 +1224,17 @@ std::unique_ptr<kernel::BlockTreeStore> BlockManager::CreateAndMigrateBlockTree(
     int max_blockfile_num{0};
     bool reindexing{false};
     bool pruned_block_files{false};
+    BlockMap migration_index;
 
     {
         LogInfo("Migrating leveldb block tree db to new block tree store.");
+        auto insert_block_index{[&](const uint256& hash) EXCLUSIVE_LOCKS_REQUIRED(::cs_main) -> CBlockIndex* {
+            if (hash.IsNull()) return nullptr;
+            const auto [mi, inserted]{migration_index.try_emplace(hash)};
+            CBlockIndex* pindex{&mi->second};
+            if (inserted) pindex->phashBlock = &mi->first;
+            return pindex;
+        }};
         try {
             DBParams params{};
             params.path = m_opts.block_tree_dir;
@@ -1244,8 +1252,7 @@ std::unique_ptr<kernel::BlockTreeStore> BlockManager::CreateAndMigrateBlockTree(
                 files.emplace_back(i, info);
             }
 
-            if (!block_tree_db->LoadBlockIndexGuts(
-                    GetConsensus(), [this](const uint256& hash) EXCLUSIVE_LOCKS_REQUIRED(cs_main) { return this->InsertBlockIndex(hash); }, m_interrupt)) {
+            if (!block_tree_db->LoadBlockIndexGuts(GetConsensus(), insert_block_index, m_interrupt)) {
                 throw std::runtime_error("Failed to load block index guts");
             }
             block_tree_db->ReadReindexing(reindexing);
@@ -1268,9 +1275,9 @@ std::unique_ptr<kernel::BlockTreeStore> BlockManager::CreateAndMigrateBlockTree(
             dump_files.emplace_back(file.first, &file.second);
         }
         std::vector<CBlockIndex*> dump_blockindexes;
-        dump_blockindexes.reserve(m_block_index.size());
-        for (auto& pair : m_block_index) {
-            dump_blockindexes.push_back(&pair.second);
+        dump_blockindexes.reserve(migration_index.size());
+        for (auto& [hash, index] : migration_index) {
+            dump_blockindexes.push_back(&index);
         }
 
         block_tree_store->WriteBatchSync(dump_files, dump_blockindexes);
@@ -1278,11 +1285,9 @@ std::unique_ptr<kernel::BlockTreeStore> BlockManager::CreateAndMigrateBlockTree(
 
     // Re-open to ensure that the migration was successful
     auto block_tree_store{std::make_unique<kernel::BlockTreeStore>(m_opts.block_tree_dir)};
-    // Clear m_block_index so it can be repopulated normally during LoadBlockIndexDB.
     cleanup_leveldb();
 
     LogInfo("   Successfully migrated the leveldb block tree db to new block tree store.");
-    m_block_index.clear();
 
     return block_tree_store;
 }

</details>

in src/kernel/blocktreestorage.cpp:418 in bbc9f2b2a1

 413 | +    AssertLockHeld(::cs_main);
 414 | +    LOCK(m_mutex);
 415 | +
 416 | +    // If there is a complete log waiting to be applied, write that first. An incomplete log is discarded.
 417 | +    // This may occur if a previous write threw an exception when writing the logged data to the .dat files.
 418 | +    (void)ApplyLog();

stickies-v commented at 11:19 AM on June 22, 2026:

nit: IIUC, leftover logs can only happen in case of unexpected behaviour, such as unclean/sudden shutdown, (temporary/intermittent) hardware/system issues, etc. It probably makes sense to log (here, and in BlockTreeStore ctor) when ApplyLog returns true, e.g. LogWarning("Applied block tree store write-ahead log left over from a previous failure, potentially caused by unclean shutdown or intermittent hardware issue.");?

in src/kernel/blocktreestorage.cpp:420 in bbc9f2b2a1

 415 | +
 416 | +    // If there is a complete log waiting to be applied, write that first. An incomplete log is discarded.
 417 | +    // This may occur if a previous write threw an exception when writing the logged data to the .dat files.
 418 | +    (void)ApplyLog();
 419 | +
 420 | +    if (file_info.empty() && block_info.empty()) return;

stickies-v commented at 11:22 AM on June 22, 2026:

nit: block_info seems like a misnomer?

    if (file_infos_to_write.empty() && block_indexes_to_write.empty()) return;

in src/kernel/blocktreestorage.cpp:425 in bbc9f2b2a1

 420 | +    if (file_info.empty() && block_info.empty()) return;
 421 | +
 422 | +    std::vector<std::pair<CBlockIndex*, int64_t>> pending_header_positions;
 423 | +    pending_header_positions.reserve(block_info.size());
 424 | +
 425 | +    // Use a write-ahead log file that gets atomically flushed to the target files.

stickies-v commented at 11:24 AM on June 22, 2026:

nit: I think this comment is misleading, flushing is not atomic and is the reason why this class currently is not safe to use without synchronization

in src/kernel/blocktreestorage.cpp:460 in bbc9f2b2a1

 455 | +    // Write the block_info data to the log
 456 | +    WriteLogFileSectionHeader(log_file, ValueType::DISK_BLOCK_INDEX, block_info.size());
 457 | +    std::array<std::byte, DISK_BLOCK_INDEX_WRAPPER_SIZE> block_index_value_buffer;
 458 | +    for (CBlockIndex* block_index : block_info) {
 459 | +        int64_t position = block_index->header_pos == CBlockIndex::UNSET_HEADER_POS ? header_data_end : block_index->header_pos;
 460 | +        auto disk_index{CDiskBlockIndex{block_index}};

stickies-v commented at 12:45 PM on June 22, 2026:

        CDiskBlockIndex disk_index{block_index};

in src/test/fuzz/block_index.cpp:62 in bbc9f2b2a1 outdated

  62 | -        .path = "", // Memory only.
  63 | -        .cache_bytes = 1_MiB,
  64 | -        .memory_only = true,
  65 | -    });
  66 | +    fs::path block_tree_store_dir{g_setup->m_args.GetDataDirBase()};
  67 | +    kernel::BlockTreeStore block_index{block_tree_store_dir, /*wipe_data=*/true};

stickies-v commented at 8:43 PM on June 22, 2026:

Review note: I was worried no longer running this in-memory would kill performance and hardware, but it seems like that is mostly not a concern (did not test). The default datadir is a temporary one, and on Linux temporary storage is usually backed by a ramdisk. IIUC other platforms like macOS don't do that, so if we were ever to support fuzzing on macOS again this could become problematic without precautions.

stickies-v commented at 9:36 PM on June 22, 2026: contributor

I've now reviewed most of the code (bbc9f2b2a185fe0cea096e7c8b30cccbb2ce93b2), and will do another full round for my hopefully final review. None of the comments are blocking, and can be done in a follow-up too.

Did some performance testing (from debac5f2cd16 to bbc9f2b2a185) on signet (300k blocks), and am getting shorter startup times (0.362s -> 0.330s), smaller disk size (59.4MB -> 30.9MB), quick migration (0.545s). The only thing that's a regression on my machine (M4 Pro MacOS) is reindex time (590s -> 628s), most of which stemming from the reindex-chainstate step. I suspect this will impact IBD too, but haven't tested specifically.

All these parameters are within acceptable bounds imo, and I'd prefer keeping non-trivial performance optimizations for a follow-up. I think correctness, reliability and reviewability are more important for this PR.

=== Results (height 300000) ===

                            debac5f2cd16        bbc9f2b2a185
--------------------------  ------------------  ------------------
Migration                   n/a                 0.545s
Steady startup (median)     0.362s              0.330s
  all runs                  0.365 0.362 0.361   0.315 0.333 0.330
Index size                  59.4 MiB            30.9 MiB
  format                    leveldb             flat
Reindex total               589.5s              627.9s
  block tree rebuild        57.8s               58.4s

sedited force-pushed on Jun 23, 2026

sedited commented at 8:21 PM on June 23, 2026: contributor

Thank you for another round of great comments @stickies-v!

Rebased bbc9f2b2a185fe0cea096e7c8b30cccbb2ce93b2 -> 746533a1e9b17ce4d086e143235b94fba75d4296 (blocktreestore_26 -> blocktreestore_27, compare)

Updated 746533a1e9b17ce4d086e143235b94fba75d4296 -> fdd26dfa9a2f3407a4802891b17b3c1d79d51210 (blocktreestore_27 -> blocktreestore_28, compare)

Addressed @stickies-v's comment, changed Assume to assert in CalculateBlockFileInfoPosition.
Addressed @stickies-v's comment, took the suggestion for consolidating file magic and version checks.
Addressed @stickies-v's [comment]( #32427 (review)), aliasing the checksum and file position types.
Addressed @stickies-v's comment, added SERIALIZED_SIZE constexpr to wrapper types that can be use in templated serialization functions.
Addressed @stickies-v's comment, moved ValueType time of check earlier.
Addressed @stickies-v's comment, use locally instantiated BlockMap instead of the m_block_index member during migration.
Addressed @stickies-v's comment, added a log line when a left over log is applied. Unlike the suggestion, it uses info level though, which seems appropriate, since this may not be something completely out of the ordinary or requiring attention.
Addressed @stickies-v's comment, applied suggested WriteBatchSync renames.
Addressed @stickies-v's comment, clarified comment on log application scope.
Addressed @stickies-v's comment, cleaned up disk_index initialization.
Addressed @stickies-v's comment, removed dead block file info double iteration.

Also added two new commits:

Added a benchmark for WriteBatchSync to compare against leveldb. This is tracks performance relevant to the FlushStateToDisk hot path. The benchmark only gives meaningful results if the directory is created in a non-tmpfs (non-ramdisk) path. Previous versions of this pull request did less synchronization work, so this wasn't relevant to track at the time.
Added a directory lock file that gets engaged while applying the log and reading data. If the lock file is already taken, it gets polled every millisecond. This should be enough to support a basic external reader.

Currently the block tree store's write performance is about 3 times slower compared to leveldb. This boils to syncing four times with the filesystem instead of once. Leveldb only needs to journal the write, which typically produces a single filesystem synchronization. The block tree store on the other hand synchronizes the log file write, the directory entry for the flag file, and then the two writes to the respective data files.

This can be optimized: Writes could skip applying the log file by just appending to the log and delegate that responsibility to reads. A torn log could be identified with tag bytes at the end of the file. The flag file was introduced after @stickies-v highlighted in a review comment that a corrupt log file might be interpreted as a torn log, which would in turn lead to inconsistent data. Leveldb has a similar issue, so this might be a risk worth taking.

An experimental implementation of this brings its performance on par with leveldb again. Note that the log file format already supports writing multiple data sections, so no change in the file format is required. FlushStateToDisk currently needs to synchronizes upwards of 4 times (blocktreedb , coinsdb, block, and undo files) with the filesystem, so this ends up being an increase in filesystem synchronization time by less than a factor of two, which in the grand scheme of things is not super significant.

in src/kernel/blocktreestorage.cpp:409 in 9a9e005588

 404 | +{
 405 | +    AssertLockHeld(::cs_main);
 406 | +    LOCK(m_mutex);
 407 | +
 408 | +    if (ApplyLog()) {
 409 | +        LogInfo("Applied block tree store write-ahead log left over from a previous failure, potentially caused by unclean shutdown or intermitten hardware issue.");

willcl-ark commented at 9:08 AM on June 24, 2026:

In 9a9e00558807d30142222dfd8f133208c0f183f6

slight typo: s/intermitten/intermittent

willcl-ark commented at 10:32 AM on June 24, 2026: member

Re-reviewing this, I have a few new questions around the StoreLock / WAL interaction, and around StoreLock in general.

a4a5291c6f5 says the directory lock intentionally does not guard log writes, only log application and reads. That makes sense if the lock is meant to keep readers from observing data files while a completed WAL is being applied. But I think there may still be a race around publishing the completed log.

WriteBatchSync() (from 9a9e0055880) writes and commits log.dat, then creates log_flag.dat, then calls ApplyLog(). Since log_flag.dat is visible before the writer enters the final ApplyLog() critical section, another process can observe the completed WAL, apply it, and clear the flag first. The original writer's final ApplyLog() would then return false and throw Failed to apply write-ahead log to data files, even though the write was already applied.

This feels theoretical and unlikely, but I don't think it requires another writer. The BlockTreeStore constructor also calls ApplyLog(), so I think this could happen if another kernel process opens the same datadir by constructing its own btck_ChainstateManager at the wrong time. Based on this PR's use case, I think this might be worth considering.

I also think the current StoreLock RAII wrapper makes this harder to fix safely. util::LockDirectory() returns success if the current process already holds the lock, but every StoreLock destructor unconditionally calls UnlockDirectory(). So an inner StoreLock can release an outer lock early. This means simply adding a broader StoreLock around WriteBatchSync() while leaving ApplyLog() to take its own lock would not be safe.

Could we make the locking model more explicit somehow? For example, ApplyLog() could require a caller-held store lock, perhaps by taking a const StoreLock& token parameter, and WriteBatchSync() could hold that lock across the write-log-and-apply sequence. Alternatively, WriteBatchSync() could treat "the completed log was already applied by another process" as success, though I am less convinced by that. Either way, I think the lock ownership could be clearer to avoid accidental nested unlocks now or in the future.

note: I can't find any current instances of nested StoreLock, but the current wrapper looks like a nestable RAII lock (to me) because of its scoped-lock usage pattern, while in actual fact it is not. For example, future code might try to do:

{
    StoreLock lock_file{dir};
    ApplyLog();
    // Do things assuming the lock is still held.
}
// Assume lock released.

sedited force-pushed on Jun 24, 2026

sedited commented at 5:06 PM on June 24, 2026: contributor

Thanks for taking another look @willcl-ark,

Updated fdd26dfa9a2f3407a4802891b17b3c1d79d51210 -> f9253ae389bdc22dacae70ecdd22170795a67c5e (blocktreestore_28 -> blocktreestore_29, compare)

Gated lock file flushing behind a read-only flag to avoid the cross-process log file contention issues flagged by @willcl-ark.
Introduced an OpenMode mode for the BlockTreeStore constructor.
Addressed @willcl-ark's comment, fixed typo.

For testing the single writer/multi reader pattern, I prepared a branch that has one binary continuously write, while the other binary continuously reads. This surfaced the issue reported by @willcl-ark in the above comment. If any other reviewers are interested in exercising this, or extending the case, I have a branch here: https://github.com/sedited/bitcoin/tree/blocktreestore_bins .

note: I can't find any current instances of nested StoreLock, but the current wrapper looks like a nestable RAII lock (to me) because of its scoped-lock usage pattern, while in actual fact it is not. For example, future code might try to do

I'm not quite sure what the problem is here. You should be allowed to double lock in the same process, or at least that is what the directory lock should afford us. The mutex should take care of any cross-thread concurrency issues.

in src/kernel/blocktreestorage.h:39 in f9253ae389 outdated

  34 | +// from the log file record to be used for the actual data files.
  35 | +
  36 | +//! The data layout of the headers file is as follows:
  37 | +//! <magic> <version> [<DiskBlockIndexWrapper> <checksum>]
  38 | +inline constexpr uint32_t HEADER_FILE_MAGIC{0x1d5e2eb2}; // sha256sum("BLOCK_HEADER_FILE_MAGIC")
  39 | +inline constexpr uint32_t HEADER_FILE_VERSION{1};

stickies-v commented at 12:59 PM on June 25, 2026:

nit: these could be self-documenting

diff --git a/src/kernel/blocktreestorage.h b/src/kernel/blocktreestorage.h
index b4a657d501..bde1d91824 100644
--- a/src/kernel/blocktreestorage.h
+++ b/src/kernel/blocktreestorage.h
@@ -37,7 +37,7 @@ namespace kernel {
 //! <magic> <version> [<DiskBlockIndexWrapper> <checksum>]
 inline constexpr uint32_t HEADER_FILE_MAGIC{0x1d5e2eb2}; // sha256sum("BLOCK_HEADER_FILE_MAGIC")
 inline constexpr uint32_t HEADER_FILE_VERSION{1};
-inline constexpr int64_t HEADER_FILE_DATA_START_POSITION{8}; // after magic (4bytes), version (4bytes)
+inline constexpr int64_t HEADER_FILE_DATA_START_POSITION{sizeof(HEADER_FILE_MAGIC) + sizeof(HEADER_FILE_VERSION)};
 inline constexpr const char* HEADER_FILE_NAME{"headers.dat"};
 
 //! The flag is persisted by presence or absence of this file.
@@ -54,7 +54,7 @@ inline constexpr const char* LOG_FLAG_FILE_NAME{"log_flag.dat"};
 //! <magic> <version> [<BlockFileInfoWrapper> <checksum>]
 inline constexpr uint32_t BLOCK_FILES_FILE_MAGIC{0x6e2e2f44}; // sha256sum("BLOCK_FILES_FILE_MAGIC")
 inline constexpr uint32_t BLOCK_FILES_FILE_VERSION{1};
-inline constexpr int64_t BLOCK_FILES_FILE_DATA_START_POSITION{8}; // after magic (4bytes), version (4bytes)
+inline constexpr int64_t BLOCK_FILES_FILE_DATA_START_POSITION{sizeof(BLOCK_FILES_FILE_MAGIC) + sizeof(BLOCK_FILES_FILE_VERSION)};
 inline constexpr const char* BLOCK_FILES_FILE_NAME{"blockfiles.dat"};
 
 //! The data layout of the log file is as follows:
@@ -64,7 +64,7 @@ inline constexpr const char* BLOCK_FILES_FILE_NAME{"blockfiles.dat"};
 //! The (value, target position, checksum) tuple is referred to as a log file record.
 inline constexpr uint32_t LOG_FILE_MAGIC{0xa0346f91}; // sha256sum("LOG_FILE_MAGIC")
 inline constexpr uint32_t LOG_FILE_VERSION{1};
-inline constexpr int64_t LOG_FILE_DATA_START_POSITION{8}; // after magic (4bytes), version (4bytes)
+inline constexpr int64_t LOG_FILE_DATA_START_POSITION{sizeof(LOG_FILE_MAGIC) + sizeof(LOG_FILE_VERSION)};
 inline constexpr const char* LOG_FILE_NAME{"log.dat"};
 
 enum class ValueType : uint8_t {

</details>

in src/kernel/blocktreestorage.cpp:57 in f9253ae389

  52 | +            case util::LockResult::ErrorWrite:
  53 | +                throw BlockTreeStoreError(strprintf(
  54 | +                    "Cannot create write-lock file in %s", fs::PathToString(m_dir)));
  55 | +            case util::LockResult::ErrorLock:
  56 | +                // Read and write access is typically short, so wait a bit and try again.
  57 | +                UninterruptibleSleep(std::chrono::milliseconds(1));

stickies-v commented at 2:00 PM on June 25, 2026:

Since a goal of this change is to allow other applications to access the database, I think we need to be defensive and assume that a lock can be held without limit. Adding a timeout here seems sensible? And maybe even adding a ContendedLock could be helpful?

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index 875c5b2f5e..176cef082d 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -23,6 +23,7 @@
 #include <util/time.h>
 
 #include <array>
+#include <chrono>
 #include <cstddef>
 #include <cstdio>
 #include <ios>
@@ -43,9 +44,13 @@ class StoreLock
     const fs::path m_dir;
 
 public:
-    explicit StoreLock(const fs::path& dir) : m_dir{dir}
+    explicit StoreLock(const fs::path& dir, std::chrono::milliseconds timeout = 30s) : m_dir{dir}
     {
+        SteadyClock::time_point start{SteadyClock::now()};
         for (;;) {
+            if (SteadyClock::now() > start + timeout) {
+                throw BlockTreeStoreError(strprintf("Operation timed out waiting to acquire lock on %s", dir));
+            }
             switch (util::LockDirectory(m_dir, m_dir / STORE_LOCK_NAME)) {
             case util::LockResult::Success:
                 return;
@@ -54,7 +59,7 @@ public:
                     "Cannot create write-lock file in %s", fs::PathToString(m_dir)));
             case util::LockResult::ErrorLock:
                 // Read and write access is typically short, so wait a bit and try again.
-                UninterruptibleSleep(std::chrono::milliseconds(1));
+                UninterruptibleSleep(1ms);
             }
         }
     }

</details>

in src/kernel/blocktreestorage.cpp:49 in f9253ae389

  44 | +
  45 | +public:
  46 | +    explicit StoreLock(const fs::path& dir) : m_dir{dir}
  47 | +    {
  48 | +        for (;;) {
  49 | +            switch (util::LockDirectory(m_dir, m_dir / STORE_LOCK_NAME)) {

stickies-v commented at 2:27 PM on June 25, 2026:

nit

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index 875c5b2f5e..b0dca46a5d 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -46,7 +46,7 @@ public:
     explicit StoreLock(const fs::path& dir) : m_dir{dir}
     {
         for (;;) {
-            switch (util::LockDirectory(m_dir, m_dir / STORE_LOCK_NAME)) {
+            switch (util::LockDirectory(m_dir, STORE_LOCK_NAME)) {
             case util::LockResult::Success:
                 return;
             case util::LockResult::ErrorWrite:
@@ -59,7 +59,7 @@ public:
         }
     }
 
-    ~StoreLock() { UnlockDirectory(m_dir, m_dir / STORE_LOCK_NAME); }
+    ~StoreLock() { UnlockDirectory(m_dir, STORE_LOCK_NAME); }
 
     StoreLock(const StoreLock&) = delete;
     StoreLock& operator=(const StoreLock&) = delete;

</details>

in src/kernel/blocktreestorage.cpp:352 in f9253ae389 outdated

 347 | +    std::array<std::byte, BlockFileInfoWrapper::SERIALIZED_SIZE> buffer;
 348 | +
 349 | +    try {
 350 | +        ReadDataValue(file, buffer);
 351 | +        SpanReader{buffer} >> info_wrapper;
 352 | +    } catch (std::ios_base::failure&) {

stickies-v commented at 2:34 PM on June 25, 2026:

We're catching SpanReader failures here, but not ReadDataValue failures (which throw BlockTreeStoreError. Is the inconsistency on purpose?

sedited commented at 5:14 PM on June 29, 2026:

Yes, this imitates the dbwrapper's current Read behaviour, where we return false on a serialization error, which we then just ignore anyways. I don't really like that to be honest. Maybe we should just let the exception bubble up here?

stickies-v commented at 10:11 AM on June 30, 2026:

Yes, this imitates the dbwrapper's current Read behaviour, where we return false on a serialization error, which we then just ignore anyways.

My point is that in BlockTreeDB (through CDBWrapper::Read) we catch all exceptions, whereas in BlockTreeStore we return false for a std::ios_base::failure and we throw for a BlockTreeStoreError from ReadDataValue, so that seems like a difference in behaviour? I think the consistent thing would be to catch std::exception in BlockTreeStore too?

I don't really like that to be honest. Maybe we should just let the exception bubble up here?

Yes I think these exceptions should bubble up, but I believe I've seen you argue on this PR to keep behaviour the same as BlockTreeDB in this PR, and then make improvements later on - and I agree with that approach.

sedited commented at 10:14 AM on June 30, 2026:

I think the consistent thing would be to catch std::exception in BlockTreeStore too?

Ok, will do this on the next push.

in src/kernel/blocktreestorage.cpp:46 in f9253ae389

  41 | +class StoreLock
  42 | +{
  43 | +    const fs::path m_dir;
  44 | +
  45 | +public:
  46 | +    explicit StoreLock(const fs::path& dir) : m_dir{dir}

stickies-v commented at 3:19 PM on June 25, 2026:

nit: it's not obvious that StoreLock only synchronizes across processes, but not within each process. I think it would be useful to document that in that case an additional mutex is necessary.

in src/kernel/blocktreestorage.cpp:440 in f9253ae389 outdated

 435 | +    if (rolling_checksum != stored_rolling_checksum) {
 436 | +        throw BlockTreeStoreError("Detected on-disk log file corruption: Rolling checksum mismatch");
 437 | +    }
 438 | +
 439 | +    (void)log_file.fclose();
 440 | +    WriteFlag(m_log_flag_file_path, /*value=*/false, /*directory_commit=*/false);

stickies-v commented at 3:29 PM on June 25, 2026:

    // Reapplying a complete log (in case of a later failure) is idempotent, so avoid an unnecessary directory commit.
    WriteFlag(m_log_flag_file_path, /*value=*/false, /*directory_commit=*/false);

in src/kernel/blocktreestorage.cpp:189 in f9253ae389 outdated

 184 | +{
 185 | +    assert(GetSerializeSize(DiskBlockIndexWrapper{}) == DiskBlockIndexWrapper::SERIALIZED_SIZE);
 186 | +    assert(GetSerializeSize(BlockFileInfoWrapper{}) == BlockFileInfoWrapper::SERIALIZED_SIZE);
 187 | +    LOCK(m_mutex);
 188 | +    fs::create_directories(path);
 189 | +    if (m_mode == OpenMode::WIPE) {

stickies-v commented at 3:30 PM on June 25, 2026:

I think all writes should hold the StoreLock

in src/node/blockstorage.cpp:1295 in f9253ae389

1290 | +        block_tree_store->WriteBatchSync(dump_files, dump_blockindexes);
1291 | +    }
1292 | +
1293 | +    // Re-open to ensure that the migration was successful
1294 | +    auto block_tree_store{std::make_unique<kernel::BlockTreeStore>(m_opts.block_tree_dir)};
1295 | +    // Clear m_block_index so it can be repopulated normally during LoadBlockIndexDB.

stickies-v commented at 3:31 PM on June 25, 2026:

nit: stale docstring

in test/functional/feature_blocktree_migration.py:62 in f9253ae389

  57 | +            self.start_node(0)
  58 | +        assert_equal(block_tree_store_node.getblockchaininfo()["blocks"], nblocks)
  59 | +        self.stop_node(0)
  60 | +
  61 | +        self.log.info("A corrupt legacy block index fails the migration and kills the node")
  62 | +        shutil.move(block_tree_store_node.chain_path, block_tree_store_node.chain_path.with_name("regtest_migrated"))

stickies-v commented at 3:55 PM on June 25, 2026:

nit: why do we keep this data? We don't reuse it, and that part of the test has already passed? rmtree seems more appropriate?

in src/test/blocktreestorage_tests.cpp:33 in f9253ae389

  28 | +using kernel::LOG_FILE_NAME;
  29 | +using kernel::LOG_FLAG_FILE_NAME;
  30 | +
  31 | +BOOST_FIXTURE_TEST_SUITE(blocktreestorage_tests, BasicTestingSetup)
  32 | +
  33 | +CBlockIndex* InsertBlockIndex(std::unordered_map<uint256, CBlockIndex, BlockHasher>& block_map, const uint256& hash)

stickies-v commented at 1:24 PM on June 26, 2026:

CBlockIndex* InsertBlockIndex(node::BlockMap& block_map, const uint256& hash)

in 2 other places:

diff --git a/src/test/blocktreestorage_tests.cpp b/src/test/blocktreestorage_tests.cpp
index bf65ab05b0..12312c3e92 100644
--- a/src/test/blocktreestorage_tests.cpp
+++ b/src/test/blocktreestorage_tests.cpp
@@ -30,7 +30,7 @@ using kernel::LOG_FLAG_FILE_NAME;
 
 BOOST_FIXTURE_TEST_SUITE(blocktreestorage_tests, BasicTestingSetup)
 
-CBlockIndex* InsertBlockIndex(std::unordered_map<uint256, CBlockIndex, BlockHasher>& block_map, const uint256& hash)
+CBlockIndex* InsertBlockIndex(node::BlockMap& block_map, const uint256& hash)
 {
     if (hash.IsNull()) {
         return nullptr;
@@ -75,7 +75,7 @@ void CheckBlockFileInfo(uint32_t file, CBlockFileInfo& file_info, BlockTreeStore
     BOOST_CHECK_EQUAL(a.str(), b.str());
 }
 
-void CheckBlockMap(const std::unordered_map<uint256, CBlockIndex, BlockHasher>& block_map, const std::vector<CBlockIndex*>& blocks)
+void CheckBlockMap(const node::BlockMap& block_map, const std::vector<CBlockIndex*>& blocks)
 {
     LOCK(::cs_main);
     BOOST_CHECK_EQUAL(block_map.size(), blocks.size());
@@ -190,7 +190,7 @@ BOOST_AUTO_TEST_CASE(BlockTreeStoreIncompleteWrites)
     auto params{CreateChainParams(gArgs, ChainType::REGTEST)};
     auto store{std::make_unique<BlockTreeStore>(block_tree_store_dir)};
 
-    std::unordered_map<uint256, CBlockIndex, BlockHasher> block_map;
+    node::BlockMap block_map;
     std::vector<std::pair<int, const CBlockFileInfo*>> fileinfo;
 
     // Write and read a CBlockFileInfo and a CBlockIndex

</details>

in src/test/blocktreestorage_tests.cpp:85 in f9253ae389

  80 | +    LOCK(::cs_main);
  81 | +    BOOST_CHECK_EQUAL(block_map.size(), blocks.size());
  82 | +    for (const auto& block : blocks) {
  83 | +        auto hash{block->GetBlockHeader().GetHash()};
  84 | +        auto it = block_map.find(hash);
  85 | +        BOOST_CHECK(it != block_map.end());

stickies-v commented at 1:31 PM on June 26, 2026:

        BOOST_REQUIRE(it != block_map.end());

in src/test/blocktreestorage_tests.cpp:204 in f9253ae389

 199 | +    info.nSize = 2;
 200 | +    info.nUndoSize = 3;
 201 | +    info.nHeightFirst = 4;
 202 | +    info.nHeightLast = 5;
 203 | +    info.nTimeFirst = 6;
 204 | +    info.nTimeLast = 7;

stickies-v commented at 1:57 PM on June 26, 2026:

    int32_t seed{0};
    CBlockFileInfo info{CreateUniqueFileInfo(seed)};

in src/test/blocktreestorage_tests.cpp:231 in f9253ae389

 226 | +        params->GetConsensus(),
 227 | +        [&](const uint256& hash) { return InsertBlockIndex(block_map, hash); },
 228 | +        m_interrupt));
 229 | +    BOOST_CHECK(block_map.empty());
 230 | +
 231 | +    // Now simulate a crash in the middle of writing the data.

stickies-v commented at 2:10 PM on June 26, 2026:

nit: "writing the data" is ambiguous, writing a log file is also writing data

    // Now simulate a crash in the middle of applying the log.

in src/test/blocktreestorage_tests.cpp:222 in f9253ae389

 217 | +        params->GetConsensus(),
 218 | +        [&](const uint256& hash) { return InsertBlockIndex(block_map, hash); },
 219 | +        m_interrupt));
 220 | +    BOOST_CHECK(block_map.empty());
 221 | +
 222 | +    // The constructor should cleanup the log file and not apply any new state to the data files

stickies-v commented at 2:13 PM on June 26, 2026:

This is incorrect, the log file is currently only cleared at the start of WriteBatchSync. I was adding some more invariants that failed. I'm not sure what's preferable here. I think updating the BlockTreeStore ctor to remove the log file after a failed ApplyLog() could make sense?

diff --git a/src/test/blocktreestorage_tests.cpp b/src/test/blocktreestorage_tests.cpp
index bf65ab05b0..c3aaa4e89a 100644
--- a/src/test/blocktreestorage_tests.cpp
+++ b/src/test/blocktreestorage_tests.cpp
@@ -213,6 +213,7 @@ BOOST_AUTO_TEST_CASE(BlockTreeStoreIncompleteWrites)
     // The log file should exist in an unclean state if we abort in the middle of writing to it
     BOOST_CHECK_THROW(store->WriteBatchSync(fileinfo, blockinfo), std::runtime_error);
     BOOST_CHECK(fs::exists(log_file));
+    BOOST_CHECK(!fs::exists(log_flag_file));
     BOOST_CHECK(store->LoadBlockIndexGuts(
         params->GetConsensus(),
         [&](const uint256& hash) { return InsertBlockIndex(block_map, hash); },
@@ -220,7 +221,9 @@ BOOST_AUTO_TEST_CASE(BlockTreeStoreIncompleteWrites)
     BOOST_CHECK(block_map.empty());
 
     // The constructor should cleanup the log file and not apply any new state to the data files
+    BOOST_CHECK(fs::exists(log_file));
     store = std::make_unique<BlockTreeStore>(block_tree_store_dir);
+    BOOST_CHECK(!fs::exists(log_file));
     BOOST_CHECK(!fs::exists(log_flag_file));
     BOOST_CHECK(store->LoadBlockIndexGuts(
         params->GetConsensus(),

</details>

in src/test/blocktreestorage_tests.cpp:351 in f9253ae389

 346 | +    fs::path block_tree_store_dir{m_args.GetDataDirBase()};
 347 | +    auto header_file{block_tree_store_dir / HEADER_FILE_NAME};
 348 | +    auto block_files_file{block_tree_store_dir / BLOCK_FILES_FILE_NAME};
 349 | +    auto params{CreateChainParams(gArgs, ChainType::REGTEST)};
 350 | +    auto store_ptr{std::make_unique<BlockTreeStore>(block_tree_store_dir)};
 351 | +    auto& store = *store_ptr;

stickies-v commented at 2:51 PM on June 26, 2026:

nit: not sure why we use unique_ptr here and in other test cases? Just BlockTreeStore store{block_tree_store_dir} seems fine?

in src/test/blocktreestorage_tests.cpp:343 in f9253ae389 outdated

 338 | +        [&](const uint256& hash) { return InsertBlockIndex(block_map, hash); },
 339 | +        interrupt));
 340 | +    CheckBlockMap(block_map, blocks);
 341 | +}
 342 | +
 343 | +BOOST_AUTO_TEST_CASE(BlockTreeStoreRW)

stickies-v commented at 3:25 PM on June 26, 2026:

nit: instead of doing a bunch of ad-hoc checks for store consistency, we could just have one (mutable) struct for the full expected state, and then a function that compares the store with the expected state. It cleans the test up a bit, and also makes the testing more complete.

something like:

struct ExpectedStoreState {
    std::map<int, CBlockFileInfo> file_infos;
    std::vector<CBlockIndex*> blocks;
    bool pruned{false};
    bool reindexing{false};
};

void CheckStoreContents(BlockTreeStore& store,
        const ExpectedStoreState& expected,
        util::SignalInterrupt& interrupt, const CChainParams& params,
        const std::string& context)
{
    BOOST_TEST_CONTEXT(context) {
        for (auto& [file, info] : expected.file_infos) {
            CBlockFileInfo copy{info};
            CheckBlockFileInfo(file, copy, store);
        }
        int32_t last_block;
        store.ReadLastBlockFile(last_block);
        int32_t expected_last = expected.file_infos.empty() ? 0 : expected.file_infos.rbegin()->first;
        BOOST_CHECK_EQUAL(last_block, expected_last);

        LOCK(::cs_main);
        node::BlockMap block_map;
        BOOST_CHECK(store.LoadBlockIndexGuts(
            params.GetConsensus(),
            [&](const uint256& hash) { return InsertBlockIndex(block_map, hash); },
            interrupt));
        CheckBlockMap(block_map, expected.blocks);

        bool pruned, reindexing;
        store.ReadPruned(pruned);
        store.ReadReindexing(reindexing);
        BOOST_CHECK_EQUAL(pruned, expected.pruned);
        BOOST_CHECK_EQUAL(reindexing, expected.reindexing);
    }
}

</details>

stringintech commented at 12:01 AM on June 29, 2026: contributor

Re #32427 (review): I'm thinking if it would make sense to add a separate directory lock for writers only, one that's taken for the whole lifetime of the store (in WRITE/WIPE mode) and fails fast on a second writer process instead of busy-waiting (similar to how bitcoind takes its datadir directory lock), while leaving the existing StoreLock solely for the writer-vs-reader clashes it already handles? That would go in the direction of treating a second writer as misuse to be rejected loudly, rather than trying to guarantee correctness for concurrent-writer scenarios.

I experimented with this and tested it by extending the tool_bitcoin_chainstate.py functional test to run a kernel process (the bitcoin-chainstate binary) against a running node's datadir:

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index 875c5b2f5e..0cc104c0e9 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -37,6 +37,7 @@ using Checksum = uint32_t;
 using FilePosition = int64_t;
 
 static constexpr const char* STORE_LOCK_NAME{".lock"};
+static constexpr const char* WRITER_LOCK_NAME{".writer-lock"};
 
 class StoreLock
 {
@@ -65,6 +66,24 @@ public:
     StoreLock& operator=(const StoreLock&) = delete;
 };
 
+WriterLock::WriterLock(const fs::path& dir) : m_dir{dir}
+{
+    switch (util::LockDirectory(m_dir, m_dir / WRITER_LOCK_NAME)) {
+    case util::LockResult::Success:
+        return;
+    case util::LockResult::ErrorWrite:
+        throw BlockTreeStoreError(strprintf(
+            "Cannot create writer-lock file in %s", fs::PathToString(m_dir)));
+    case util::LockResult::ErrorLock:
+        LogError("Another writer process is already using the block tree store in %s.\n",
+                 fs::PathToString(m_dir));
+        throw BlockTreeStoreError(strprintf(
+            "Another writer process is already using the block tree store in %s", fs::PathToString(m_dir)));
+    }
+}
+
+WriterLock::~WriterLock() { UnlockDirectory(m_dir, m_dir / WRITER_LOCK_NAME); }
+
 /** A wrapper for creating a constant-sized serialization without varint encoding */
 struct BlockFileInfoWrapper : CBlockFileInfo {
     static constexpr size_t SERIALIZED_SIZE{36};
@@ -186,6 +205,9 @@ BlockTreeStore::BlockTreeStore(const fs::path& path, const OpenMode open_mode)
     assert(GetSerializeSize(BlockFileInfoWrapper{}) == BlockFileInfoWrapper::SERIALIZED_SIZE);
     LOCK(m_mutex);
     fs::create_directories(path);
+    if (m_mode != OpenMode::READ) {
+        m_writer_lock.emplace(path);
+    }
     if (m_mode == OpenMode::WIPE) {
         fs::remove(m_header_file_path);
         fs::remove(m_block_files_file_path);
diff --git a/src/kernel/blocktreestorage.h b/src/kernel/blocktreestorage.h
index b4a657d501..37e8ffe5c2 100644
--- a/src/kernel/blocktreestorage.h
+++ b/src/kernel/blocktreestorage.h
@@ -12,6 +12,7 @@
 
 #include <cstdint>
 #include <functional>
+#include <optional>
 #include <stdexcept>
 #include <string>
 #include <utility>
@@ -29,6 +30,18 @@ class SignalInterrupt;
 
 namespace kernel {
 
+class WriterLock
+{
+    fs::path m_dir;
+
+public:
+    explicit WriterLock(const fs::path& dir);
+    ~WriterLock();
+
+    WriterLock(const WriterLock&) = delete;
+    WriterLock& operator=(const WriterLock&) = delete;
+};
+
 // Checksums are calculated from the serialized value and its position in the
 // file. This protects against out of order data and allows the same checksum
 // from the log file record to be used for the actual data files.
@@ -146,6 +159,8 @@ private:
     mutable Mutex m_mutex;
     OpenMode m_mode;
 
+    std::optional<WriterLock> m_writer_lock;
+
     void WriteFlag(const fs::path& path, bool value, bool directory_commit) const;
 
     /**
diff --git a/test/functional/tool_bitcoin_chainstate.py b/test/functional/tool_bitcoin_chainstate.py
index 6b7128898d..75aae4f9a1 100755
--- a/test/functional/tool_bitcoin_chainstate.py
+++ b/test/functional/tool_bitcoin_chainstate.py
@@ -13,6 +13,8 @@ snapshot and extend the snapshot chain with new blocks.
 
 import subprocess
 
+from pathlib import Path
+
 from test_framework.test_framework import BitcoinTestFramework
 from test_framework.util import assert_equal
 from test_framework.wallet import MiniWallet
@@ -37,6 +39,31 @@ class BitcoinChainstateTest(BitcoinTestFramework):
         self.add_nodes(2)
         self.start_nodes()
 
+    def writer_lock_test(self):
+        assert self.nodes[0].is_node_stopped() is False
+
+        def run_chainstate(datadir):
+            return subprocess.run(
+                self.get_binaries().chainstate_argv() + ["-regtest", datadir],
+                stdin=subprocess.DEVNULL,
+                capture_output=True,
+                text=True,
+            )
+
+        self.log.info("Test kernel process refuses a datadir owned by a running node")
+        proc = run_chainstate(self.nodes[0].chain_path)
+        assert proc.returncode != 0
+        # The block tree store is opened before the (leveldb) coins db, so the
+        # store's own writer lock is what rejects the process here, surfacing
+        # this message.
+        assert "Another writer process is already using the block tree store" in proc.stdout
+
+        self.log.info("Test kernel process on an unrelated datadir is unaffected")
+        other_datadir = Path(self.options.tmpdir) / "chainstate_other"
+        proc = run_chainstate(other_datadir)
+        assert proc.returncode == 0
+        assert "Another writer process is already using the block tree store" not in proc.stdout
+
     def generate_snapshot_chain(self):
         self.log.info(f"Generate deterministic chain up to block {SNAPSHOT_BASE_BLOCK_HEIGHT} for node0 while node1 disconnected")
         n0 = self.nodes[0]
@@ -98,6 +125,7 @@ class BitcoinChainstateTest(BitcoinTestFramework):
         self.add_block(datadir, n0.getblock(new_tip_hash, 0), expected_stdout="Block tip changed")
 
     def run_test(self):
+        self.writer_lock_test()
         dump_output = self.generate_snapshot_chain()
         self.basic_test()
         self.assumeutxo_test(dump_output['path'])

</details>

sedited force-pushed on Jun 29, 2026

sedited commented at 2:56 PM on June 29, 2026: contributor

Thank you for the reviews @stickies-v, @willcl-ark, @stringintech!

Updated f9253ae389bdc22dacae70ecdd22170795a67c5e -> adbd59eafae4e5c8d27af4ca13dda16e989938b3 (blocktreestore_29 -> blocktreestore_30, compare)

Addressed @stickies-v's comment, made data start positions self-documenting.
Addressed @stickies-v's comment, added a timeout to the lock waiting logic.
Addressed @stickies-v's comment, drop directory in directory lock name.
Addressed @stickies-v's comment, added note to StoreAccessLock that it is only for cross-process exclusivity.
Addressed @stickies-v's comment, added documentation string for why we are not committing the flag file after successfully applying the log.
Addressed @stickies-v's comment, added StoreLock to WIPE operation.
Addressed @stickies-v's comment, removed stale docstring.
Addressed @stickies-v's comment, removed instead of moved stale data directory from block tree migration functional test.
Addressed @stickies-v's comment, added functional test for pruned node migration.
Addressed @stickies-v's comment, consistently use BlockMap in the new unit test.
Addressed @stickies-v's comment, use BOOST_REQUIRED for checking iterator.
Addressed @stickies-v's comment, cleanup block file info test data initialization.
Addressed @stickies-v's comment, corrected doc string in block tree storage tests.
Addressed @stickies-v's comment, corrected test case that was not migrated to current behaviour in a previous iteration of this pull request.
Addressed @stickies-v's comment, got rid of the mentioned unique_ptr indirection. Kept the unique_ptr where it made things easier to reset and re-initialize.
Addressed @stickies-v's comment, took suggestion for an expected data structure that makes the tests easier to follow.
Addressed @stickies-v's comment, moved the commit moving CBlockFileInfo to the block tree store module earlier, which avoids having to deal with a circular dependency.

Addressed @stringintech's comment, took the suggested functional test as a new first commit.

Addressed @stringintech's comment, @stickies-v's comment, and @willcl-ark's comment, by restricting read mode to just read functions. Read mode instances now no longer invoke log file application. Also introduced a write directory lock that gets taken on instantiation and yielded on destruction.

A remaining issue I did not address here, is how to prevent later changes of the code from introducing nested StoreAccessLocks, whose destructors would race to close the directory lock again and yield it too early. I made various attempts at this, with both reference counting, and a similar mutex-protected map that the directory locks already have, but found none of the attempts fully satisfying. I think that could be left for a different PR, maybe in concert with attempting to make the directory locks maintain less global state in general.

DrahtBot added the label CI failed on Jun 29, 2026

DrahtBot commented at 4:42 PM on June 29, 2026: contributor

🚧 At least one of the CI tasks failed. Task test ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/28381255295/job/84084157432 LLM reason (✨ experimental): CI failed due to a C++ build error: blockstorage.h references an unknown type CBlockFileInfo, causing the bitcoin_wallet compilation to fail (treated as -Werror).

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

sedited force-pushed on Jun 29, 2026

DrahtBot removed the label CI failed on Jun 29, 2026

in src/kernel/blocktreestorage.cpp:231 in bf8451a6aa

 230 | @@ -186,6 +231,11 @@ BlockTreeStore::~BlockTreeStore()
 231 |      UnlockDirectory(m_header_file_path.parent_path(), WRITE_LOCK_NAME);

willcl-ark commented at 10:49 AM on June 30, 2026:

In bf8451a6aa97939809970ab1b9e31b4099ec4e2e

I think in this destructor, which could be for a read-only store, we will remove a global write lock, which doesn't seem correct. Probably need to track if we have locked this, or short-circuit for read-only stores (similar to the constructor).

stringintech commented at 1:47 PM on June 30, 2026:

I was thinking along similar lines from the constructor side: what happens if the write lock leaks when the ctor throws after locking. It doesn't seem problematic in most cases (since if the process exits, it would free the lock anyway), but it seems easier to reason about to make the write lock an RAII member (e.g. std::optional<WriteLock> in #32427 (comment) diff) - it's only constructed for WRITE/WIPE, so a READ store has nothing to release, and a ctor throw after locking releases it automatically.

That said, the member does add a bit of surface to the header, which may be why the lock was kept as plain ctor/dtor state in the latest force push instead.

sedited commented at 2:49 PM on June 30, 2026:

Probably need to track if we have locked this, or short-circuit for read-only stores (similar to the constructor).

Yeah, of course. Thanks for the suggestions, will see what I think fits best.

willcl-ark commented at 1:03 PM on June 30, 2026: member

If I'm not mistaken, it looks like btck_chainstate_manager_create() does not currently expose BlockTreeStore::OpenMode::READ, so external kernel users cannot open the block tree store in read-only mode yet. Is that intentional, and is wiring this through the kernel API planned as a follow-up?

sedited commented at 3:01 PM on June 30, 2026: contributor

Re #32427#pullrequestreview-4599628705

Is that intentional, and is wiring this through the kernel API planned as a follow-up?

Yes, intentional. Ideally we'd have a separate discussion on what that would look like. Maybe a prototype of it could make sense at this point?

in src/kernel/blocktreestorage.h:146 in adbd59eafa outdated

 141 | +
 142 | +    // TEST ONLY
 143 | +    bool m_incomplete_log_write{false};
 144 | +    bool m_incomplete_log_apply{false};
 145 | +
 146 | +    mutable Mutex m_mutex;

stickies-v commented at 4:22 PM on June 30, 2026:

We are currently using one primitive for thread synchronization (m_mutex) and one for process synchronization (StoreAccessLock). I think we can quite elegantly combine them into a single BlockTreeStore::Mutex that does both, and remains compatible with clang's thread safety annotations:

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index 27bd50e834..a7e5ddf136 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -40,43 +40,46 @@ using FilePosition = int64_t;
 static constexpr const char* STORE_ACCESS_LOCK_NAME{".lock"};
 static constexpr const char* WRITE_LOCK_NAME{".write"};
 
-//! Blocks cross-process simultaneous read or write access to the data files.
-//! Only a single instance may be created per directory at any one time.
-class StoreAccessLock
+void BlockTreeStore::Mutex::lock()
 {
-    const fs::path m_dir;
-
-public:
-    explicit StoreAccessLock(const fs::path& dir) : m_dir{dir}
-    {
-        std::chrono::milliseconds timeout = 30s;
-        SteadyClock::time_point start{SteadyClock::now()};
-        for (;;) {
-            switch (util::LockDirectory(m_dir, STORE_ACCESS_LOCK_NAME)) {
-            case util::LockResult::Success:
-                return;
-            case util::LockResult::ErrorWrite:
-                throw BlockTreeStoreError(strprintf(
-                    "Cannot create write-lock file in %s", fs::PathToString(m_dir)));
-            case util::LockResult::ErrorLock: {
-                if (SteadyClock::now() > start + timeout) {
-                    throw BlockTreeStoreError(strprintf("Operation timed out waiting to acquire lock on %s", fs::PathToString(m_dir)));
-                }
-                // Read and write access is typically short, so wait a bit and try again.
-                UninterruptibleSleep(1ms);
-            }
+    m_mutex.lock();
+    std::chrono::milliseconds timeout{30s};
+    SteadyClock::time_point start{SteadyClock::now()};
+    for (;;) {
+        switch (util::LockDirectory(m_dir, STORE_ACCESS_LOCK_NAME)) {
+        case util::LockResult::Success:
+            return;
+        case util::LockResult::ErrorWrite:
+            m_mutex.unlock();
+            throw BlockTreeStoreError(strprintf(
+                "Cannot create write-lock file in %s", fs::PathToString(m_dir)));
+        case util::LockResult::ErrorLock: {
+            if (SteadyClock::now() > start + timeout) {
+                m_mutex.unlock();
+                throw BlockTreeStoreError(strprintf("Operation timed out waiting to acquire lock on %s", fs::PathToString(m_dir)));
             }
+            // Read and write access is typically short, so wait a bit and try again.
+            UninterruptibleSleep(1ms);
+        }
         }
     }
+}
 
-    ~StoreAccessLock()
-    {
-        UnlockDirectory(m_dir, STORE_ACCESS_LOCK_NAME);
-    }
+void BlockTreeStore::Mutex::unlock()
+{
+    UnlockDirectory(m_dir, STORE_ACCESS_LOCK_NAME);
+    m_mutex.unlock();
+}
 
-    StoreAccessLock(const StoreAccessLock&) = delete;
-    StoreAccessLock& operator=(const StoreAccessLock&) = delete;
-};
+bool BlockTreeStore::Mutex::try_lock()
+{
+    if (!m_mutex.try_lock()) return false;
+    if (util::LockDirectory(m_dir, STORE_ACCESS_LOCK_NAME) != util::LockResult::Success) {
+        m_mutex.unlock();
+        return false;
+    }
+    return true;
+}
 
 /** A wrapper for creating a constant-sized serialization without varint encoding */
 struct BlockFileInfoWrapper : CBlockFileInfo {
@@ -193,6 +196,7 @@ BlockTreeStore::BlockTreeStore(const fs::path& path, const OpenMode open_mode)
       m_block_files_file_path{path / BLOCK_FILES_FILE_NAME},
       m_reindex_flag_file_path{path / REINDEX_FLAG_FILE_NAME},
       m_prune_flag_file_path{path / PRUNE_FLAG_FILE_NAME},
+      m_store_mutex{path},
       m_mode{open_mode}
 {
     assert(GetSerializeSize(DiskBlockIndexWrapper{}) == DiskBlockIndexWrapper::SERIALIZED_SIZE);
@@ -200,15 +204,14 @@ BlockTreeStore::BlockTreeStore(const fs::path& path, const OpenMode open_mode)
 
     if (m_mode == OpenMode::READ) return;
 
-    LOCK(m_mutex);
     fs::create_directories(path);
+    LOCK(m_store_mutex);
 
     if (util::LockDirectory(path, WRITE_LOCK_NAME) != util::LockResult::Success) {
         throw BlockTreeStoreError("Block tree store write lock is already held. Cannot have stores on multiple processes share write access.");
     }
 
     if (m_mode == OpenMode::WIPE) {
-        StoreAccessLock lock_file{m_header_file_path.parent_path()};
         fs::remove(m_header_file_path);
         fs::remove(m_block_files_file_path);
         fs::remove(m_log_file_path);
@@ -273,8 +276,7 @@ void BlockTreeStore::WriteReindexing(bool reindexing) const
 
 void BlockTreeStore::ReadLastBlockFile(int32_t& last_block_file) const
 {
-    LOCK(m_mutex);
-    StoreAccessLock lock_file{m_log_file_path.parent_path()};
+    LOCK(m_store_mutex);
     auto file{OpenFileAndVerifyHeader(m_block_files_file_path, BLOCK_FILES_FILE_MAGIC, BLOCK_FILES_FILE_VERSION)};
 
     constexpr uint64_t entry_size = BlockFileInfoWrapper::SERIALIZED_SIZE + sizeof(Checksum);
@@ -371,8 +373,7 @@ static void ReadDataValue(AutoFile& file, std::span<std::byte> value_buffer)
 
 bool BlockTreeStore::ReadBlockFileInfo(int file_index, CBlockFileInfo& info)
 {
-    LOCK(m_mutex);
-    StoreAccessLock lock_file{m_log_file_path.parent_path()};
+    LOCK(m_store_mutex);
 
     auto file{OpenFileAndVerifyHeader(m_block_files_file_path, BLOCK_FILES_FILE_MAGIC, BLOCK_FILES_FILE_VERSION)};
     file.seek(CalculateBlockFileInfoPosition(file_index), SEEK_SET);
@@ -393,8 +394,7 @@ bool BlockTreeStore::ReadBlockFileInfo(int file_index, CBlockFileInfo& info)
 
 bool BlockTreeStore::ApplyLog() const
 {
-    AssertLockHeld(m_mutex);
-    StoreAccessLock lock_file{m_log_file_path.parent_path()};
+    AssertLockHeld(m_store_mutex);
 
     if (!fs::exists(m_log_file_path) || !fs::exists(m_log_flag_file_path)) {
         return false;
@@ -478,7 +478,7 @@ void BlockTreeStore::WriteBatchSync(const std::vector<std::pair<int, const CBloc
 {
     CheckWriteAccess();
     AssertLockHeld(::cs_main);
-    LOCK(m_mutex);
+    LOCK(m_store_mutex);
 
     // If there is a complete log waiting to be applied, write that first. An incomplete log is discarded.
     // This may occur if a previous write threw an exception when writing the logged data to the .dat files.
@@ -566,8 +566,7 @@ bool BlockTreeStore::LoadBlockIndexGuts(
     const util::SignalInterrupt& interrupt)
 {
     AssertLockHeld(::cs_main);
-    LOCK(m_mutex);
-    StoreAccessLock lock_file{m_log_file_path.parent_path()};
+    LOCK(m_store_mutex);
 
     auto file{OpenFileAndVerifyHeader(m_header_file_path, HEADER_FILE_MAGIC, HEADER_FILE_VERSION)};
 
diff --git a/src/kernel/blocktreestorage.h b/src/kernel/blocktreestorage.h
index 8a1b48297e..1089cfc761 100644
--- a/src/kernel/blocktreestorage.h
+++ b/src/kernel/blocktreestorage.h
@@ -131,6 +131,28 @@ public:
         READ
     };
 
+    //! Interthread mutex combined with a cross-process file lock. Not re-entrant.
+    class LOCKABLE Mutex
+    {
+        //! Mutex for synchronization across threads.
+        std::mutex m_mutex;
+        //! File lock for synchronization across processes.
+        const fs::path m_dir;
+
+    public:
+        explicit Mutex(const fs::path& dir) : m_dir{dir} {}
+
+        void lock() EXCLUSIVE_LOCK_FUNCTION();
+        void unlock() UNLOCK_FUNCTION();
+        bool try_lock() EXCLUSIVE_TRYLOCK_FUNCTION(true);
+
+        using unique_lock = std::unique_lock<Mutex>;
+
+#ifdef __clang__
+        const Mutex& operator!() const { return *this; }
+#endif
+    };
+
 private:
     fs::path m_header_file_path;
     fs::path m_log_file_path;
@@ -143,7 +165,7 @@ private:
     bool m_incomplete_log_write{false};
     bool m_incomplete_log_apply{false};
 
-    mutable Mutex m_mutex;
+    mutable Mutex m_store_mutex;
     OpenMode m_mode;
 
     void CheckWriteAccess() const;
@@ -157,7 +179,7 @@ private:
      * nothing to apply, either by no log file existing, or it not being
      * complete.
      */
-    [[nodiscard]] bool ApplyLog() const EXCLUSIVE_LOCKS_REQUIRED(m_mutex);
+    [[nodiscard]] bool ApplyLog() const EXCLUSIVE_LOCKS_REQUIRED(m_store_mutex);
 
 public:
     BlockTreeStore(const fs::path& path, OpenMode open_mode = OpenMode::WRITE);
@@ -167,7 +189,7 @@ public:
     void WriteReindexing(bool reindexing) const;
 
     //! Block files are zero indexed. Returns 0 when there are no block files indexed yet.
-    void ReadLastBlockFile(int32_t& last_block_file) const EXCLUSIVE_LOCKS_REQUIRED(!m_mutex);
+    void ReadLastBlockFile(int32_t& last_block_file) const EXCLUSIVE_LOCKS_REQUIRED(!m_store_mutex);
 
     void ReadPruned(bool& pruned) const;
     void WritePruned(bool pruned) const;
@@ -179,15 +201,15 @@ public:
     void SetSimulateIncompleteLogApply(bool val) { m_incomplete_log_apply = val; }
 
     void WriteBatchSync(const std::vector<std::pair<int, const CBlockFileInfo*>>& file_infos_to_write, const std::vector<CBlockIndex*>& block_indexes_to_write)
-        EXCLUSIVE_LOCKS_REQUIRED(::cs_main, !m_mutex);
+        EXCLUSIVE_LOCKS_REQUIRED(::cs_main, !m_store_mutex);
 
-    [[nodiscard]] bool ReadBlockFileInfo(int file_index, CBlockFileInfo& info) EXCLUSIVE_LOCKS_REQUIRED(!m_mutex);
+    [[nodiscard]] bool ReadBlockFileInfo(int file_index, CBlockFileInfo& info) EXCLUSIVE_LOCKS_REQUIRED(!m_store_mutex);
 
     [[nodiscard]] bool LoadBlockIndexGuts(
         const Consensus::Params& consensus_params,
         std::function<CBlockIndex*(const uint256&)> insert_block_index,
         const util::SignalInterrupt& interrupt)
-        EXCLUSIVE_LOCKS_REQUIRED(::cs_main, !m_mutex);
+        EXCLUSIVE_LOCKS_REQUIRED(::cs_main, !m_store_mutex);
 };
 
 } // namespace kernel

</details>

sedited commented at 9:50 AM on July 1, 2026:

This suggestion doesn't seem to work with our lock order annotations. Not sure we can extend our Mutex types outside of sync.h

stickies-v commented at 12:07 PM on July 1, 2026:

You're right, I didn't test debug builds, thanks for pointing this out. A solution could be to have a generic FileMutex in sync.h but it looks like this is leading us too far astray. I think the current synchronization interface is a bit more confusing than it could be, but it works well enough, so I suggest to mark this as resolved.

test: Pin bitcoin-chainstate cross-process directory lock exclusion

Co-authored-by: stringintech <stringintech@gmail.com>

e69c570b55

kernel: Add blocktreestorage module

The BlockTreeStore introduces a new data format for storing block
indexes and headers on disk. The class is very similar to the existing
CBlockTreeDB, which stores the same data in a leveldb database. Unlike
CBlockTreeDB, the data stored through the BlockTreeStore is directly
serialized and written to flat .dat files. The storage schema introduced
is simple. It relies on the assumption that no entry is ever
deleted and that no duplicate entries are written. These assumptions
hold for the current users of CBlockTreeDB.

In order to efficiently update a CBlockIndex entry in the store, a new
field is added to the class that tracks its position in the file. New
serialization wrappers are added for both the CBlockIndex and
CBlockFileInfo classes to avoid serializing integers as VARINT. Using
VARINT encoding would make updating these fields impossible, since
changing them might overwrite existing entries in the file.

The new store supports atomic writes by using a write ahead log. Boolean
flags are persisted through the (non-)existence of certain files. Data
integrity is verified through the use of crc32c checksums on each data
entry.

Also makes flag values, such as reindexing and pruning, based on file
existence, instead of boolean fields. This makes the operations atomic.

This commit is part of a series to replace the leveldb database
currently used for storing block indexes and headers with a flat file
storage. This is motivated by the kernel library, where the usage of
leveldb is a limiting factor to its future use cases.

Co-authored-by: stickies-v <stickies-v@protonmail.com>

f03ffc8d8d

kernel: Add write file lock

This excludes other processes from writing to the block tree store while
a process is already running.

This commit is part of a series to replace the leveldb database
currently used for storing block indexes and headers with a flat file
storage. This is motivated by the kernel library, where the usage of
leveldb is a limiting factor to its future use cases.

Co-authored-by: stringintech <stringintech@gmail.com>

370802a18c

bench: Track block index write sync performance

During FlushStateToDisk, changes to the block index are written to disk.
Since this happens in a fairly hot path, it is useful to track this
performance.

Note that this benchmark only gives meaningful result if the directory
is created in a non-tmpfs (non-ramdisk) path. Developers might be
required to tweak the path for this.

Currently the block tree store's write performance is about 3 times
slower compared to leveldb. This boils to syncing four times with the
filesystem instead of once. Leveldb only needs to journal the write,
which typically produces a single filesystem synchronization. The block
tree store on the other hand synchronizes the log file write, the
directory entry for the flag file, and then the two writes to the
respective data files.

This can be optimized: Writes could skip applying the log file and
delegate that responsibility to reads. A torn log could be identified
with tag bytes at the end of the file. This would bring synchronization
requirements back to par with leveldb again.

FlushStateToDisk currently needs to synchronizes between 4 and 6 times
(blocktreedb , coinsdb, block, and undo files) with the filesystem, so
this ends up being an increase in filesystem synchronization time by
less than a factor of two.

This commit is part of a series to replace the leveldb database
currently used for storing block indexes and headers with a flat file
storage. This is motivated by the kernel library, where the usage of
leveldb is a limiting factor to its future use cases.

135387e632

kernel: Add directory lock to block tree store

The lock is taken during log application and during reads. It is
supposed to protect the read integrity of the files in the case of
external file readers. Note that it explicitly does not guard log
writes.

This commit is part of a series to replace the leveldb database
currently used for storing block indexes and headers with a flat file
storage. This is motivated by the kernel library, where the usage of
leveldb is a limiting factor to its future use cases. It also offers better
performance and has a smaller on-disk footprint, though this is mostly
negligible in the grand scheme of things.

ef5fbe65b3

fuzz: Use BlockTreeStore in block index fuzz test

This commit is part of a series to replace the leveldb database
currently used for storing block indexes and headers with a flat file
storage. This is motivated by the kernel library, where the usage of
leveldb is a limiting factor to its future use cases.

Co-authored-by: stickies-v <stickies-v@protonmail.com>

635dcc4aa7

blockstorage: Move CBlockFileInfo to blocktreestorage

This eliminates the circular dependency between the blockstorage and the
blocktreestorage modules again.

0d99708acb

blockstorage: Replace BlockTreeDB with BlockTreeStore

This hooks up the newly introduced BlockTreeStore class to the actual
codebase. It also adds a migration function to migrate old leveldb block
indexes to the new format on startup.

The migration first reads from leveldb (blocks/index), and writes it to
a fresh BlockTreeStore in the same directory. Once done, the original
leveldb database is deleted.

On migration failure, the node shuts down again and prompts the user to
reindex. This is exercised in the added functional test. The functional
test uses version 28.2 to generate a legacy leveldb block tree db. This
version was chosen, since it is already required by other back-compat
tests (though other older versions would have worked too).

This commit is part of a series to replace the leveldb database
currently used for storing block indexes and headers with a flat file
storage. This is motivated by the kernel library, where the usage of
leveldb is a limiting factor to its future use cases.

e8f9125db1

kernel: Remove block tree db params

These are no longer needed after the migration to the new
BlockTreeStore. The cache for the block tree db is also no longer
needed, so grant what has been freed up to the coins db.

This commit is part of a series to replace the leveldb database
currently used for storing block indexes and headers with a flat file
storage. This is motivated by the kernel library, where the usage of
leveldb is a limiting factor to its future use cases.

1c9a998549

blockstorage: Remove BlockTreeDB dead code

This is not called by anything anymore, so just remove it.

The max block file number and the last persisted block file info are now
always in lockstep, so remove the extra loop guarding against that.

This commit is part of a series to replace the leveldb database
currently used for storing block indexes and headers with a flat file
storage. This is motivated by the kernel library, where the usage of
leveldb is a limiting factor to its future use cases.

50133ca57b

sedited force-pushed on Jul 1, 2026

sedited commented at 9:52 AM on July 1, 2026: contributor

Thanks for the reviews @stickies-v @stringintech @willcl-ark and bearing with me!

adbd59eafae4e5c8d27af4ca13dda16e989938b3 -> 50133ca57be6aea9360d4bae140159a75b884212 (blocktreestore_30 -> blocktreestore_31, compare)

Addressed @stickies-v's comment, catching on std::exception instead of std::ios_base when reading the block file info.
Addressed @stringintech's comment and @willcl-ark's comment, implementing the directory write lock as an RAII wrapper class. Also moved it to a separate commit.

in src/bench/write_block_index.cpp:99 in 135387e632

  94 | +    BuildBlockIndex(block_map);
  95 | +    for (auto& entry : block_map) {
  96 | +        blocks.push_back(&entry.second);
  97 | +    }
  98 | +
  99 | +    bench.run("leveldb", [&] {

stickies-v commented at 5:42 PM on July 2, 2026:

in 135387e63210db8f2f8dd9c977f72fc9425f1bac:

nit: these string arguments override the function name, meaning it's now shown as "leveldb" instead of "WriteBlockIndexLevelDB" in the bench results (and similarly, what's used for --filter). I think the function name is much more descriptive, and it's also the pattern we generally use.

(here + WriteBlockIndexBlockTreeStore)

in src/kernel/blocktreestorage.cpp:269 in 50133ca57b

 264 | +        if (ec && ec != std::errc::no_such_file_or_directory) {
 265 | +            throw BlockTreeStoreError(strprintf("Could not remove flag file %s", fs::PathToString(path)));
 266 | +        }
 267 | +    }
 268 | +    if (directory_commit) {
 269 | +        DirectoryCommit(path.parent_path());

stickies-v commented at 7:52 PM on July 2, 2026:

This is a no-op on Windows. I think this can lead to several edge case issues, like the block tree being marked as unpruned when blocks have already been removed, or the WAL not being applied atomically.

From my reading it seems like FlushFileBuffers (through FileCommit) can be used to persist the creation, but not deletion of flags. So perhaps the better persistence model for flags is to use their contents rather than their existence? There are various approaches here, but I think just keeping one file per flag is reasonable, with contents e.g. <value><checksum, so that we have pretty decent torn write and corruption guarantees?

in src/kernel/blocktreestorage.cpp:482 in 50133ca57b

 477 | +    if (rolling_checksum != stored_rolling_checksum) {
 478 | +        throw BlockTreeStoreError("Detected on-disk log file corruption: Rolling checksum mismatch");
 479 | +    }
 480 | +
 481 | +    (void)log_file.fclose();
 482 | +    // Reapplying a complete log (in case of a later failure) is idempotent, so avoid an unnecessary directory commit.

stickies-v commented at 8:05 PM on July 2, 2026:

Upon further thought, this is not necessarily idempotent. When the next WriteBatchSync starts before the dir has been committed, it is possible to have a log_flag and an incomplete log. However, this will just throw and I don't see how it can lead to consistency issues, so I don't think we need to handle it, the user will just be instructed to -reindex and that seems fine.

Perhaps a mention in the docstring would be useful to help debug in case it ever comes up, though.

in src/kernel/blocktreestorage.cpp:198 in 50133ca57b

 193 | +}
 194 | +
 195 | +static AutoFile OpenFileAndVerifyHeader(const fs::path& path, uint32_t magic_expected, uint32_t version_expected)
 196 | +{
 197 | +    auto file{OpenFile(path, "rb")};
 198 | +    if (auto magic{ser_readdata32(file)}; magic != magic_expected) {

stickies-v commented at 8:28 PM on July 2, 2026:

nit: for < 8 bytes, this will throw ios_base::failure, and we only catch BlockTreeStoreError higher up, so I think we should explicitly handle this:

diff --git a/src/kernel/blocktreestorage.cpp b/src/kernel/blocktreestorage.cpp
index bcc26101a5..a5fd2552e5 100644
--- a/src/kernel/blocktreestorage.cpp
+++ b/src/kernel/blocktreestorage.cpp
@@ -195,6 +195,9 @@ static void CreateDataFile(const fs::path& path, uint32_t magic, uint32_t versio
 static AutoFile OpenFileAndVerifyHeader(const fs::path& path, uint32_t magic_expected, uint32_t version_expected)
 {
     auto file{OpenFile(path, "rb")};
+    if (file.size() < int64_t(sizeof(decltype(magic_expected)) + sizeof(decltype(version_expected)))) {
+        throw BlockTreeStoreError(strprintf("Empty or truncated header in %s", fs::PathToString(path)));
+    }
     if (auto magic{ser_readdata32(file)}; magic != magic_expected) {
         throw BlockTreeStoreError(strprintf("Invalid magic in %s: 0x%08x (expected: 0x%08x)", fs::PathToString(path), magic, magic_expected));
     }

</details>

willcl-ark commented at 4:43 PM on July 7, 2026: member

thanks for the latest changes @sedited.

I still feel like the storage model is slightly in-between two WAL designs. Perhaps this is OK for our purposes though.

As I understand it, the current model is: the writer commits a WAL, readers ignore the WAL, and then the writer immediately applies the WAL to the data files so readers can continue treating the data files as the only readable state. That can work, but then should READ-mode readers need to be protected from observing a committed-but-unapplied WAL?

The alternative model would be closer to sqlite WAL mode, where the WAL is part of the readable state. readers read the data files plus any committed WAL overlay, and applying/checkpointing the WAL into the data files is a later maintenance step. That would be a bigger design change for this PR (at this stage), but it avoids the “committed but not readable until applied” state.

Sticking with the current design then, are we happy with OpenMode::READ returning without checking for a pending committed WAL? In ef5fbe65b39, READ mode skips the constructor path that verifies the data files and calls ApplyLog(). Not applying here is intentional (it's read only mode) but since READ-mode methods then read the data files directly, a reader could observe stale or partially applied data if log_flag.dat is present from a failed writer. Is that OK in this use-case? I think it probably is is, but was wondering what you thought of it.

I still don’t think READ mode should silently apply the WAL if our goal is true read-only access on readers. But I wonder if we need to detect this state and fail (or block) clearly, or something else.

DrahtBot added the label Needs rebase on Jul 9, 2026

DrahtBot commented at 9:31 AM on July 9, 2026: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

sedited commented at 9:42 AM on July 9, 2026: contributor

Thanks for all the review here. Going to convert this to draft for now.

I still don’t think READ mode should silently apply the WAL if our goal is true read-only access on readers. But I wonder if we need to detect this state and fail (or block) clearly, or something else.

Yes, I think I prefer to log being applied on the writer's side. I also think that we should implement true parallel reader access without readers blocking each other, and I'd like to mitigate the write penalty by getting rid of the log flag file again and applying the log in a separate thread. Will take a bit of time to implement all of this.

sedited marked this as a draft on Jul 9, 2026