Reducing in-RAM block index size (reviving #24760 with measurements) #35612

ptrinh commented at 12:36 AM on June 27, 2026: none

Following up on #24760 ("The BlockIndex/BlockMap should not live in memory all the time"), which was closed as stale in 2024 with "feel free to open a new issue". In that thread the main open question was empirical: is the block index actually a meaningful share of memory, and is it worth the complexity? Here is some measurement, plus a motivation that is sharper than the default-node case discussed in 2024: memory-constrained / low-power nodes.

Measurements

BlockMap = std::unordered_map<uint256, CBlockIndex> at current mainnet height (~955k entries).

Microbench (real unordered_map, default allocator, 955k entries, peak RSS):

entry layout	sizeof	BlockMap RSS	bytes/entry	delta
current CBlockIndex	144 B	231 MB	~254 B	-
drop cached header (re-read from BlockTreeDB)	96 B	173 MB	~190 B	-58 MB
DB-backed essentials only	40 B	114 MB	~126 B	-117 MB

(The ~254 B/entry includes real map node + bucket overhead, matching the ~216-224 B figure from #24760. It is notably larger than sizeof-based math, so measuring matters.)

Real-node confirmation (a synced mainnet node, restart, RSS sampled through the load phases): the fresh process goes from ~120 MB to a settled ~340 MB across the ~6 s block-index load ("Loading block index" -> "Loaded best chain"), i.e. ~220 MB of resident BlockMap, consistent with the microbench. ("Using 2.0 MiB for block index database" confirms this is the in-RAM map, not LevelDB cache.) It grows with chain height.

Why this matters more for constrained nodes

On a default node, dbcache + mempool dominate and the block index is a minor share, which is roughly the conclusion of the 2024 thread. But on a deliberately lean low-power node (-blocksonly, no txindex/blockfilterindex, minimal -dbcache), those consumers shrink to near-nothing and the block index becomes the dominant non-reclaimable floor - roughly ~220 MB out of a ~300-400 MB total. The chainstate is mmap'd and reclaimable; the BlockMap is not. For a 512 MB-class device, 58-117 MB is 11-23% of total RAM.

Possible directions (in increasing scope)

Lazy-load the cached block header (nVersion, hashMerkleRoot, nTime, nBits, nNonce; 48 B) and re-read from BlockTreeDB on demand, keeping a small recent-window cache for hot paths (MTP, header relay). These fields are already persisted on disk, and are read through a bounded surface (GetBlockHeader() plus ~34 direct accesses). ~58 MB. Lowest risk.
sipa's suggestion in #24760: decouple CBlockIndex::pprev, pull prev from DB/cache and link by hash (looping pprev is rare outside deep reorgs; height arithmetic covers the common case). Enables a DB-backed, much smaller resident index. ~117 MB+. Larger change.

Caveats and open questions

The savings figures are from a faithful microbench + a real-node baseline confirmation, not yet a full modified-bitcoind before/after. Happy to build that prototype if there is appetite.
Tradeoff is RAM vs occasional disk reads (header serve, deep-reorg traversal) and added complexity in foundational code. Whether that tradeoff is worth it for the constrained-node use case is the real question.
Is there interest in this with the low-power framing, and if so which direction is preferred before someone invests in a full implementation?

maflcko commented at 7:34 AM on June 27, 2026: member

For a 512 MB-class device, 58-117 MB is 11-23% of total RAM.

Are 512MB devices common or even supported? I had the impression that 1024MB was the minimum. Even compilation recommends 1.5 GB (https://github.com/bitcoin/bitcoin/blob/master/doc/build-unix.md#memory-requirements)

Tradeoff is RAM vs occasional disk reads (header serve, deep-reorg traversal) and added complexity in foundational code. Whether that tradeoff is worth it for the constrained-node use case is the real question.

Hmm, is the unordered_map in different memory pages? If the historic entries (loaded after a restart) are next to each other in memory pages, I'd presume that you can achieve the same with swap already today, without any code changes?

maflcko added the label Resource usage on Jun 27, 2026

ptrinh commented at 7:56 AM on June 27, 2026: none

Good points, thanks.

On 512MB: fair, I overstated that - ~1GB is the realistic floor and the build docs back that up. A few reframes that don't hinge on 512MB:

The value isn't a specific spec, it's lowering the cost/barrier to running a full node. Being able to repurpose older or cheaper low-RAM hardware - which gets more attractive as RAM prices climb - lets more people run a node cheaply, and more reachable nodes is a decentralization benefit for the network. Lower-end support is a means to that, not the goal in itself.
The BlockMap is also monotonic: it grows ~13 MB/year (~250 B/entry) and is never reclaimed, so it's a slowly-worsening non-reclaimable floor for every node, not just small ones. At a 1GB floor it's still ~6-12%.

Not claiming it's urgent - just that the motivation holds beyond tiny devices.

On swap: I don't think swap gets you there, for a few reasons:

The entries aren't page-contiguous by recency. BlockMap is an unordered_map with one heap-allocated node per entry, and LoadBlockIndexGuts inserts them in BlockTreeDB key order (~block hash), so hot (recent) and cold (historic) entries end up interleaved across pages. Almost every page holds some hot entry, so there are few/no all-cold pages for the kernel to evict.
It's anonymous memory, so reclaim requires swap specifically - the kernel can't just drop the pages. Many constrained/SBC deployments run swapless, and where swap exists, faulting back in on access adds latency.
The directions here move that data to file-backed leveldb, where it's already persisted: the kernel can then drop those clean pages for free under pressure, no swap required. That's the part swap can't replicate.

So the page interleaving is sort of the crux - it's exactly what a code change (segregating or DB-backing cold entries) would address and what swap can't. Whether that's worth the complexity is still the open question; I just wanted to cover why "use swap" doesn't already get it for free.

maflcko commented at 9:59 AM on June 27, 2026: member

LoadBlockIndexGuts inserts them in BlockTreeDB key order (~block hash)

Ok, I see. In theory the BlockMap could be backed by an append-only hive (https://en.cppreference.com/cpp/container/hive) where insertion is roughly height-based, but of course it won't work on swapless systems.

I understand the current temporary RAM price climb, but I wonder if there are real users asking for such optimizations. There is a chance that the RAM prices will fall again over the next years and an additional ~13 MB/year doesn't sound that expensive anyway.

ptrinh commented at 11:30 AM on June 27, 2026: none

Agreed on all of that, and the hive angle is nice - height-ordered allocation would make the cold entries page-adjacent so swap could evict them, though as you note that does not help the swapless case.

To follow up on your earlier 512MB question: I am not arguing 512MB is a comfortable default - you are right that is ~1GB. The case I have in mind is deliberate memory budgeting on a shared host rather than rare hardware. Concretely: capping a node VM/container at ~512MB on a Proxmox box or NAS so it co-exists with other services, or fitting a cheaper VPS tier. That is my own setup, and it lowers both hardware and maintenance cost versus a dedicated larger box. Raspberry Pi Zero-class devices are the embedded end of the same spectrum. IBD on such a device is impractical, but the common pattern is to sync on a fast machine and copy the datadir over, so the constrained box only ever runs steady state - which is exactly where this RAM sits. So it is a real deployment pattern, even if I cannot claim broad demand beyond that.

That said, I take your cost/benefit point. Two clarifications rather than a push:

The main lever is the one-time ~58-117 MB off the current ~220 MB baseline, not the ~13 MB/year growth - I mentioned the yearly figure only to note it is monotonic.
What it frees is non-reclaimable (anon) memory, which is the part that actually drives OOM under pressure, versus the reclaimable chainstate mmap. So it helps tight-memory stability a bit more than the raw MB suggests, though I agree that is niche.

If RAM prices normalize and nobody is hitting this in practice, I am fine leaving it as a documented measurement/analysis for if/when it becomes more pressing, rather than pushing a foundational change without demonstrated demand. Thanks for digging into it.

bitcoin blocked a user on Jun 27, 2026

ptrinh commented at 6:36 PM on July 7, 2026: none

Following up with the modified-bitcoind before/after I offered earlier.

I built a prototype of direction 1: the five cached header fields (nVersion, hashMerkleRoot, nTime, nBits, nNonce; 48 bytes) are removed from CBlockIndex (sizeof 144 -> 96) and re-read from BlockTreeDB on demand through a bounded LRU (20k entries) plus a pinned tier for headers not yet persisted. Disk format unchanged. Branch: https://github.com/ptrinh/bitcoin/tree/lazy-block-header

Measurement: same real mainnet blocks/index (~957k entries), same machine (macOS arm64), identical startup path through the block-index load, peak RSS sampled at 100ms, 3 runs each:

binary	peak RSS through index load	delta
master (a64df338e6)	422.4 MB	-
lazy-header prototype	377.5 MB	-44.9 MB

Run-to-run variance was under 0.3 MB. The saving is somewhat below the ~58 MB my earlier microbench predicted for this layout; two honest reasons: the peak still includes a ~16 MB transient used during load to compute nChainWork without DB re-reads, and the microbench numbers were Linux/glibc while this measurement is macOS malloc (different size classes). The steady-state map saving should land between those figures on glibc; I can produce a Linux number if useful.

Tests: full unit suite (711 cases) and the full functional suite pass. One interesting wrinkle worth flagging for anyone attempting this for real: lazy LevelDB point reads verify checksums while the iterator-based initial load tolerates corruption, so corrupt-DB handling during init needed explicit care to keep producing the canonical "Error loading block database" failure (exercised by feature_init.py).

Caveats: prototype-grade, not PR-ready. The header cache is a global (mirroring cs_main style, but a real PR would want it inside BlockManager), the pinning lifecycle is minimal, and there are no IBD / headers-sync performance numbers yet. The LRU covers hot paths (MTP, recent-header relay); serving deep historic headers now costs a LevelDB point read per miss.

Happy to clean this up into a draft PR if the measured saving justifies review time; equally happy to leave it as a reference measurement if the demand question stays open.

l0rinc commented at 6:56 PM on July 7, 2026: contributor

Thanks a lot for investigating this and for providing reproducers. That makes the tradeoff much easier to reason about.

I do not think we realistically support 1 GB nodes today. I have access to a 2 GB node, and even there I can barely compile Core; IBD only works with considerable swap and an extremely low -dbcache=100.

It would be useful to reduce memory usage where the effort and risk are low. But given that the proposed savings are roughly in the same range as what a single flush can add anyway, I am not sure this is worth it yet. I also have memory-saving PRs where the implementation effort is low, although based on the feedback there the risk is still debatable.

So it is not obvious to me that we have a pressing memory problem here yet, unless we can find low-risk, high-reward cases. The promise should not be that "crappy 10 year old hardware will always work", but rather that "cheap nodes can run Core". A Raspberry Pi 5 with 8 GB RAM is less than $200, so I am not sure this specific issue is urgent enough to justify substantial complexity.

ptrinh referenced this in commit 17fb68f460 on Jul 7, 2026