(RFC) kernel: Replace leveldb-based BlockTreeDB with flat-file based store #32427

pull TheCharlatan wants to merge 6 commits into bitcoin:master from TheCharlatan:blocktreestore changing 22 files +887 −99
  1. TheCharlatan commented at 2:27 pm on May 6, 2025: contributor

    Opening this PR mostly to get concept/approach feedback.

    This is motivated by the kernel library, where the internal usage of leveldb is a limiting factor to its future use cases. Specifically it is not possible to share leveldb databases between two processes. A notable use-case for the kernel library is accessing and analyzing existing block data. Currently this can only be done by first shutting down the node writing this data. Moving away from leveldb opens the door towards doing this in parallel. A flat file based approach was chosen, since the requirements for persistence here are fairly simple (no deletion, constant-size entries). The change also offers better performance by making node startup faster, and has a smaller on-disk footprint, though this is negligible in the grand scheme of things.

    The change also opens the door towards getting rid of having to reindex the block tree due to corruption (though switching from pruned -> un-pruned still requires extra logic). This would simplify a lot of kernel/validation code, as can be seen in this commit. While the implementation here should be fairly robust against corruption, it could also be extended with a full write-ahead log that could eliminate most corruption vectors (e.g. loss of power) .

    The BlockTreeStore introduces a new data format for storing block indexes and headers on disk. The class is very similar to the existing CBlockTreeDB, which stores the same data in a leveldb database. Unlike CBlockTreeDB, the data stored through the BlockTreeStore is directly serialized and written to flat .dat files. The storage schema introduced is simple. It relies on the assumption that no entry is ever deleted and that no duplicate entries are written. These assumptions hold for the current users of CBlockTreeDB.

    An alternative to this pull request, that could allow the same kernel feature, would be closing and opening the leveldb database only when reading and writing. This might incur a (negligible) performance penalty, but more importantly requires careful consideration of how to handle any contentions when opening, which might have complex side effects due to our current locking mode. It would also be possible to introduce an existing database with the required features for just the block tree, but that would introduce reliance on a new dependency and come with its own tradeoffs. For these reasons I chose this approach.

  2. kernel: Add headerstorage module
    The BlockTreeStore introduces a new data format for storing block
    indexes and headers on disk. The class is very similar to the existing
    CBlockTreeDB, which stores the same data in a leveldb database. Unlike
    CBlockTreeDB, the data stored through the BlockTreeStore is directly
    serialized and written to flat .dat files. The storage schema introduced
    is simple. It relies on the assumption that no entry is ever
    deleted and that no duplicate entries are written. These assumptions
    hold for the current users of CBlockTreeDB.
    
    In order to efficiently update a CBlockIndex entry in the store, a new
    field is added to the class that tracks its position in the file. New
    serialization wrappers are added for both the CBlockIndex and
    CBlockFileInfo classes to avoid serializing integers as VARINT. Using
    VARINT encoding would make updating these fields impossible, since
    changing them might overwrite existing entries in the file.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    fac1853471
  3. fuzz: Use BlockTreeStore in block index fuzz test
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    5d9c087cef
  4. blockstorage: Replace BlockTreeDB with BlockTreeStore
    This hooks up the newly introduce BlockTreeStore class to the actual
    codebase. It also adds a migration function to migrate old leveldb block
    indexes to the new format on startup.
    
    The migration first reads from leveldb (blocks/index), and writes it to
    a BlockTreeStore in a separate migration directory (blocks/migration).
    Once done, the original directory (blocks/index) is deleted and the
    migration directory renamed to the original name.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    21d0d6d590
  5. kernel: Remove block tree db params
    These are no longer needed after the migration to the new
    BlockTreeStore.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    d78aa833b8
  6. kernel: Add assumed header store to chainparams
    Adds constants for pre-allocating the file size of the header storage
    file in the BlockTreeStore. The chosen constants leave a bit of extra
    space beyond the actual requirement. They may be updated on every
    release, though it is also not a strict requirement to do so.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    4201dcf625
  7. blockstorage: Remove BlockTreeDB dead code
    This is not called by anything anymore, so just remove it.
    
    This commit is part of a series to replace the leveldb database
    currently used for storing block indexes and headers with a flat file
    storage. This is motivated by the kernel library, where the usage of
    leveldb is a limiting factor to its future use cases. It also offers better
    performance and has a smaller on-disk footprint, though this is mostly
    negligible in the grand scheme of things.
    fabd3ab615
  8. DrahtBot commented at 2:28 pm on May 6, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32427.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK josibake
    Approach ACK w0xlt

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #31974 (Drop testnet3 by Sjors)
    • #31533 (fuzz: Add fuzz target for block index tree and related validation events by mzumsande)
    • #31382 (kernel: Flush in ChainstateManager destructor by TheCharlatan)
    • #30595 (kernel: Introduce initial C header API by TheCharlatan)
    • #19463 (Prune locks by luke-jr)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    LLM Linter (✨ experimental)

    Possible typos and grammar issues:

    • fs::PathToString(m_header_file_path) -> fs::PathToString(m_block_files_file_path)
    • char -> chars
  9. w0xlt commented at 6:12 pm on May 6, 2025: contributor

    Approach ACK

    The codebase changes seem surprisingly small for this proposal. Changing the code to reduce the dependency on LevelDB sounds good to me.

  10. shahsb commented at 2:43 am on May 7, 2025: none

    Thanks @TheCharlatan for this proposal and making the code changes.!

    Please find my review comments, suggestions and some clarifying questions:

    1. How will concurrent reads (and potentially writes) be handled with the flat file format?

    2. Even if no deletion occurs, file corruption or partial writes can happen. Are you planning mmap or memory buffering?

    3. What would be the “Corruption Recovery Strategy?” – While the write-ahead log is mentioned as future work, providing even a minimal rollback/recovery mechanism in the initial version would make this stronger.

    4. What would be the “Migration Path”? – Will there be tooling or a migration process from existing leveldb-based data

    5. Data Integrity Guarantees? – Are checksums or hash-based verifications being added per entry or per file?

  11. shahsb commented at 2:44 am on May 7, 2025: none
    1. Consider including a pluggable interface that allows fallback to LevelDB for testing or backwards compatibility
  12. shahsb commented at 2:55 am on May 7, 2025: none
    1. Write contention and corruption risks: – While flat files avoid LevelDB’s process-level locking, concurrent writes require a mechanism (e.g., file locks, flock()) to prevent race conditions.
    2. Portability and Cross platform related edge cases: – Ensure file-locking mechanisms (e.g., fcntl on Unix, LockFileEx on Windows) are robust.
  13. shahsb commented at 2:56 am on May 7, 2025: none
    1. How about Handling Large Files? – Test edge cases like file sizes approaching OS limits (e.g., 2+ GB on 32-bit systems). What would happen in such cases?
  14. shahsb commented at 3:02 am on May 7, 2025: none
    1. On similar lines to point-5 – Corruption Detection mechanism could also be implemented to detect corruptions very early in the cycle.
  15. shahsb commented at 3:14 am on May 7, 2025: none

    The flat-file approach is a reasonable trade-off given the simplicity of the block tree storage requirements.

    However, there are major significant challenges and risks involved with this approach as highlighted in the above 10 comments. (Concurrency, Corruption/Integrity, Performance, Migration, Corruption detection, Portability, Backward compatibility etc.)

  16. laanwj added the label UTXO Db and Indexes on May 7, 2025
  17. josibake commented at 8:57 am on May 7, 2025: member

    Concept ACK

    A notable use-case for the kernel library is accessing and analyzing existing block data

    A concrete example is index building in electrs / esplora / etc. For example, Electrs does this today by:

    1. Waiting for Bitcoin Core to finish IBD
    2. Reading all of the blocks out over JSON-RPC
    3. Parsing them and writing them into the appropriate indexes

    I started on a PoC for Electrs to use libbitcoinkernel for index building to demonstrate how this could be done much more efficiently and saw promising results. However, the requirement that Bitcoin Core be shut down before Electrs could process the block files made this approach clunky. I’ll revive this PoC as a means of testing this PR and hopefully provide some use case motivated feedback on the approach.

  18. Sjors commented at 12:58 pm on May 7, 2025: member
    Have you considered simply having one block per file? Typical blk files are 130 MB, so for “modern” blocks it would 50x the number of files. But is that actually a problem? It’s a lot simpler if we can just have $HASH.dat, maybe grouped in a directory per 10k blocks.
  19. TheCharlatan commented at 1:13 pm on May 7, 2025: contributor

    Have you considered simply having one block per file? Typical blk files are 130 MB, so for “modern” blocks it would 50x the number of files. But is that actually a problem? It’s a lot simpler if we can just have $HASH.dat, maybe grouped in a directory per 10k blocks.

    I’m not sure what you are suggesting here. Are you suggesting we create ~900k files and then have some subdivision within those files into 10k groups where each of those has a single $HASH.dat with all the headers and file pointers for the blocks in that division? Is there something we gain through that? My impression is we do the file splitting in the first place to make pruning easier. We don’t prune headers, so I don’t think splitting the file gains us anything. I don’t think there is much wrong with just having a single file. Maybe it even helps the OS a bit to manage the file buffers?

  20. Sjors commented at 1:33 pm on May 7, 2025: member

    Are you suggesting we create ~900k files

    Yes. If we’re going to redesign block storage, it seems good to wonder why can’t let the file system handle things.

    with all the headers and file pointers for the blocks in that division

    One block per file. We can still have a single file for the block index, which would contain the header (not just the hash) and validation state. Block files themselves would be in a predictable location, so we wouldn’t need an index for that.

    My impression is we do the file splitting in the first place to make pruning easier.

    Do you mean compared to the alternative of having a single file for all blocks? I would imagine that would create I/O problems, since the operating system wouldn’t know which part of the big file changed. And it can’t defragment it.

    Having one file per block makes pruning marginally easier than now, since you don’t have to worry about keeping nearby blocks in the same file.

    One downside of what I’m suggesting is that the headers would either be stored redundantly (in the block file as well as in the index), or anyone parsing the block files has to prepend the header themselves.

  21. ryanofsky commented at 1:59 pm on May 7, 2025: contributor

    How worried are we about file corruption here? I thought the main reason we use leveldb and sqlite databases in places like this where we don’t need indexing is that they support atomic updates, so you can pull the power cord any time and next time you reboot you will will see some consistent view of the data, even if it’s not the latest data. I didn’t look too closely at the implementation here but it seems like it is updating data in the files in place, so if writes are interrupted, data could be corrupt, and it’s not clear if there are even checksums in place that would detect this.

    Maybe this is not an issue for the PR, but it would be good to make clear what types of corruption BlockTreeStore can and can’t detect and what types of corruption it can recover from. If it can do simple things to detect corruption like adding checksums, or to prevent it like writing to temporary files and renaming them in place, those could be good to consider.

    If this PR does introduce some increased risk of corruption, maybe that is worth it for reasons listed in the description. I also think another alternative could be to use sqlite for this since this would not necessarily introduce a new dependency and we already have a ReadKey/WriteKey/EraseKey/HasKey wrapper functions for sqlite that might help it be an easy replacement for leveldb.

  22. TheCharlatan commented at 6:44 pm on May 7, 2025: contributor

    Re #32427 (comment)

    Do you mean compared to the alternative of having a single file for all blocks? I would imagine that would create I/O problems, since the operating system wouldn’t know which part of the big file changed. And it can’t defragment it.

    Yes, that is what I meant. We never change block files, so that is not a problem. I’m also not sure how real this problem actually is. A bunch of databases just maintain one big file and have good performance doing so. I’m still not sure what the benefit of what you propose would be. Either way, I think this is a bit out of scope, since while this change implements a database migration, it does not require a reindex, which a change to the block file format would. Improving pruning behavior is also not the goal here.

  23. TheCharlatan commented at 7:05 pm on May 7, 2025: contributor

    Re #32427 (comment)

    How worried are we about file corruption here?

    I was hoping to provoke a discussion about this as I alluded to in the PR description - thanks for providing your thoughts on this. I think the proof of work and integrity checks done on loading the index already provide fairly solid guarantees on load, but agree that we should do better. I have also talked to some other people about it offline, and there seems to be some appetite for improving corruption resistance. It is my understanding that the feature for reindexing the block tree was added as a salvaging option, because leveldb does not provide strong anti-corruption guarantees, but has sprawled a bit since. Removing the need to provide code for reindexing the block tree would be a nice simplification of validation code in my eyes. I think adding a checksum for the entries and writing from a log file could be fairly simple to implement and provide strong guarantees. I’m open to suggestions here.

    I also think another alternative could be to use sqlite for this since this would not necessarily introduce a new dependency and we already have a ReadKey/WriteKey/EraseKey/HasKey wrapper functions for sqlite that might help it be an easy replacement for leveldb.

    This has also been brought up by some others. While I’m still not sure that we should be introducing a new validation dependency, maybe it would be good to implement it and open the change as an alternative draft / RFC pull request in the meantime?

  24. mzumsande commented at 9:06 pm on May 7, 2025: contributor

    A full -reindex can be necessary for two reasons:

    1. corruption in the block tree db
    2. corruption in the blk files.

    In my personal experience of running a node on crappy hardware a long time ago, it was usually 2. that would happen (I knew that because the reindex wouldn’t scan all block files but abort with an error somewhere, and switch to IBD from peers). My suspicion is that while 1. may have been the dominant reason in the early years, 2. may be just as important today.

    However, if that was the case, changing the block tree db format wouldn’t allow us to get rid of -reindex, even if the new format would never corrupt.

  25. Sjors commented at 7:01 am on May 8, 2025: member

    We never change block files, so that is not a problem. I’m also not sure how real this problem actually is.

    But we prune blocks, and they may not all be at the start of the big file.

    A bunch of databases just maintain one big file and have good performance doing so.

    Even on a spinning disk? That’s where I tend to keep my .dat files.

    I’m still not sure what the benefit of what you propose would be.

    Compared to the current situation where we bundle a bunch of, but not all, blocks in one file, it just seems simpler to have one file per block.

    In the “corruption in the blk files” example above it also makes recovery really easy: just load the block files one by one, hash them, redownload if the hash doesn’t match. No need to update any index.

  26. ryanofsky commented at 4:44 pm on May 8, 2025: contributor

    re: TheCharlatan #32427 (comment)

    Thanks for clarifying the situation with leveldb. I just assumed based on its design that it would support atomic updates pretty robustly but if it has corruption problems of its own then it doesn’t sound like we would lose much by switching to something simpler.

    I still do think using sqlite could be a nice solution because data consistency issues can be a significant source of pain for users and developers, and with sqlite we basically just don’t need to think those issues. But I also understand not wanting to require sqlite as a kernel dependency.

    Another thing about this PR (as of fabd3ab615a7c718f37a60298a125864edb6106b) is it seems like it doesn’t actually remove much blockstorage code, and the BlockTreeDB class remains intact, I guess because migration code depends on it.

    An idea that might improve this could be to make BlockTreeStore methods pure virtual and have FlatFile and LevelDB subclasses implementing them. This could organize the code more cleanly by letting FlatFile and LevelDB implementations both live side-by side outside of blockstorage instead of one being inside and one being outside. This could also let kernel applications provide alternate backends, and allow things like differential fuzz testing. (This was also the exact same approach used to replace bdb with sqlite in the wallet.)

    re: Sjors #32427 (comment)

    FWIW I also think using individual block files could be great (assuming a sharded directory structure like the .git/objects to avoid having many files per directory). That idea is mostly tangential to this PR though, I think? Possible I am missing some connections.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-05-09 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me