Fountain Codes: a way to reduce blockchain storage costs

# · Lucas Lima · #1 · 20 Jun 2026, 10:31 UTC

Over the past few months, I have been conducting research on the application of Fountain Codes in Bitcoin, together with Vinteum.

Fountain codes allow the Bitcoin blockchain to be split into encoded fragments called droplets, such that any node can reconstruct the full blockchain by collecting and decoding a sufficient number of them. Each node only needs to store a fraction of this encoded data, reducing the storage burden per node.

I wrote a blog post explaining the idea and covering the progress I’ve made in the research so far. You can read it at https://lucasdbr05.com/posts/fountain-codes/.

# · Anthony Towns · #2 · 21 Jun 2026, 10:34 UTC

Few comments:

On the other hand, a larger k also requires the node to buffer more data before closing the epoch and generating the encoding.

I don’t think that’s true: if you have the droplet composition be deterministic based on a per-peer seed and the height, then you can construct the s droplets as you go. I think the idea was that peers would share the seed they use for working out what droplets they’ll select, both to save a little bit of data, but also to help guarantee unbiased random selection?

So with k=1,000 and s=10 you might decide your droplets for blocks 0-999 are going to be:

[18, 321, 688, 875, 890]
[36, 170]
[39, 178, 980]
[73, 104, 153, 167, 370, 577, 586, 864, 865, 978]
[91, 980]
[164, 359, 741, 842]
[208, 820]
[252, 330, 361, 399, 477, 478, 528, 667, 688, 964]
[351, 876]
[679]

But you can just build those up as you go, so at height 215, your droplets look like:

[18]
[36, 170]
[39, 178]
[73, 104, 153, 167]
[91]
[164]
[208]
empty
empty
empty

So I think your storage is just the fraction s/k versus storing all blocks. I think it would be reasonable to paremeterize based on k and r=s/k instead, with the potential for r being a per-node setting.

So, if we choose k = 1000 and s = 10, the target is approximately 100x storage savings.

Choosing 100x savings per-node means that you need to connect to 100 nodes (or more) in order to actually reconstruct all the blocks, whereas you’d only need to connect to 1 full archival node. Reportedly there are currently 27k NODE_NETWORK_LIMITED listening nodes and 24k NODE_NETWORK listening nodes; so if all current pruned nodes adopted this, it would only act as a bump from 24,000 copies of the blockchain available for download to 24,030. So I think this is much more a “robust engineering for future scenarios” than making a tangible improvement today.

With a 100x savings, that’s presumably an extra ~7GB on disk. A 20x savings (~35GB) would probably be pretty feasible for many node operators who aren’t able to store the full ~700GB of old blocks and would boost the additional number of copies of the blockchain in the above scenario from +30 to +150.

If you have to connect to 100+ peers instead of 1 (or 10) to do IBD, that’s a significantly increased connection load on pruned peers (who currently don’t participate in helping IBD at all), so it might be sensible to have connections made during IBD be block-relay-only, and, once IBD is complete, drop and reconnect in order to start doing tx relay.

the protocol also needs to define how to handle murky droplets.

My first thought is that you’d disconnect and ban a peer that gave you bad data (so not request any additional droplets), but might keep the remaining droplets they gave you on the offchance any of them turned out to be useful.

I wonder what the impact of different serialization sizes of blocks is here: if you have a droplet that’s [208, 820] and 208 is full of inscriptions and 4MB while 820 is full of runestones and 1MB, I think the droplet has to be the full 4MB. That seems like it could result in a multiplier effect, though probably isn’t too substantial unless k is very large.

# · gmaxwell · #3 · 21 Jun 2026, 16:51 UTC

I made a prior development list post where I proposed using RS codes to divide partial nodes into a collection of flavors such that syncing required connecting to e.g. 8 distinct flavors. I actively eschewed fountain codes in this proposal (in spite of the fact that we used them in Bitcoin Fibre and the satellite broadcast) for two reasons:

Bitcoin has to be robust against actively malicious peers. If you get data from a bunch of different peers and some are malicious there didn’t seem have anything short of exptime algorithms to find the set of malicious peers and successfully reconstruct the data in spite of them. For an RS code you can decode in polytime using berlekamp once you have enough data (e.g. two extra good peers for each malicious peer) to correct errors and not just erasures. Decoding absent malicious peers can still be very fast (even for big codes e.g. GitHub - catid/leopard: Leopard-RS : O(N Log N) MDS Reed-Solomon Block Erasure Code for Large Data · GitHub ).

The other reason was that fingerprinting is a concern-- when a node changes network identity it is preferable if it is difficult to distinguish from other nodes, and many protocol decisions have been shaped by this, e.g. the fact that there is a single prune height advertised for limited nodes. Because of this I think it makes sense to limit the number of distinct shards. It does mean that distribution wouldn’t be as perfectly uniform where you need literally any N peers to reconstruct but I think not making every partial node perfectly tracable over its whole lifetime is probably worth that limit.

Sticking to a simpler code also has some potential IPR benefits. ::shrugs::

# · Anthony Towns · #4 · 22 Jun 2026, 03:00 UTC · in reply to #3

I think this is that post:

The blog post linked from the original message in this thread points at a 2019 paper that uses the header chain for solving this: you download a selection of “droplets” each represent an xor combination of 1-to-many blocks. You resolve the 1-block droplets immediately, and verify they’re correctness versus the header-chain, then xor-those blocks out of the other droplets, reducing some of them down from an n-block droplet to an (n-1)-block droplet. Because you’re only using validated data to affect droplets from other nodes, punishment is straightforward.

That seems a fair criticism, particularly if the goal here is that every node (including non-listening ones) will eventually assist new nodes doing IBD.

If you’re just dividing non-archival nodes into N groups, then I think you’d just have each node in group k of N store any block whose height h \equiv k \pmod{N}, rather than doing anything more complex? With N=128, then that would mean storing ~6GB per node, and nodes who are happy to store ~44GB could store/serve k, k+16, k+32, k+48, k+64, k+80, k+96, k+112, while remaining indistinguishable amongst the 6.25% of nodes willing to store 44GB, ie the ones who chose the same k.

# · gmaxwell · #5 · 22 Jun 2026, 03:15 UTC

IIRC the figures you get for share of nodes you need to DOS to block synchronization is much worse than if you use a code.

# · Anthony Towns · #6 · 22 Jun 2026, 03:32 UTC · in reply to #5

Presumably that only works if you have N anonymity groups, but only need to connect to rN groups, with r < 1 (and you need a code to achieve that)? Then you need to DoS every node in (1-r)N+1 groups instead of just 1 group; though in either case you also need to DoS all the remaining full archival nodes as well, assuming there are any.

# · Lucas Lima · #7 · 22 Jun 2026, 04:14 UTC · in reply to #2

I think the idea was that peers would share the seed they use for working out what droplets they’ll select, both to save a little bit of data, but also to help guarantee unbiased random selection?

Yes. The peers share the seed used to generate the droplets in the respective epoch. In the post, where I talk about:

Each droplet carries information about which original blocks were combined to produce it. Using this information, the decoder builds a bipartite graph connecting droplets to their source blocks.

this information might include the seed needed to rebuild the same bipartite graph on both nodes.

So I think your storage is just the fraction 𝑠/𝑘 versus storing all blocks.

Actually it’s approximately s/k, because blocks don’t have exactly the same size, so different encodings produce droplets with variable size. In the last paragraph you say that:

I wonder what the impact of different serialization sizes of blocks is here: if you have a droplet that’s [208, 820] and 208 is full of inscriptions and 4MB while 820 is full of runestones and 1MB, I think the droplet has to be the full 4MB.

That’s one of the reasons the storage isn’t exactly s/k of the total blocks’ data size. In a case where a difference like the one you showed occurs, could this be optimized by concatenating blocks up to a target size threshold, to avoid XORing a large block against a much smaller one, where most of the operation would just be canceling out the zero-padding on the smaller block.

I think it would be reasonable to parameterize based on 𝑘 and 𝑟 =𝑠/𝑘 instead, with the potential for 𝑟 being a per-node setting.

I agree with the above statement. The parameter s being variable per node’s storage capacity is really interesting. But I think using a fixed k is important for node coordination.

My first thought is that you’d disconnect and ban a peer that gave you bad data (so not request any additional droplets), but might keep the remaining droplets they gave you on the offchance any of them turned out to be useful.

I also agree with this statement. The remaining droplets from a murky peer should still be processed by the decoder, which will naturally accept them if they pass the header and merkle root verification when they become singletons, or reject them otherwise.

# · gmaxwell · #8 · 22 Jun 2026, 06:00 UTC · in reply to #6

Right, and slicing is on each block, so it doesn’t matter that blocks are different in size. Each block is encoded in say, 246 parts, such that you need any 10 parts to recover the block. So to block reconstruction you must block reaching at least 96% of nodes (and all the archival nodes, if any exist). And each node would need 1/10th the storage. Discovery is not a big deal because just random selecting likely gives you 86% chance of getting all distinct selections even not having any idea what lane peers are in.

I hadn’t seen the 2019 paper, but a pure peeling code has some unattractive recovery properties, e.g. 30%+ overhead in terms of symbol counts. And the scheme you describe has poor locality… so I don’t see how storage would be handled during recovery on a pruning node. I guess I need to look at some concrete numbers. But I don’t see the advantage vs striping at a block level.

# · Anthony Towns · #9 · 22 Jun 2026, 06:15 UTC · in reply to #6

So I think a way of combining these ideas might be something like this:

Pick k=1008 blocks matching half a retarget interval
Pick N=100 as your anonymity groups
Pick 1.6% (1/63) as your storage reduction factor (~11GB total) for 16 droplets per interval (~32MB additional storage per week rather than ~2GB)
Calculate in advance (hardcode) 1600 unique droplet compositions according to the Robust Soliton distribution with c=0.06 , \delta=0.1 (versus c=0.1, \delta=0.01) using some deterministic pseudo-random scheme. Order these by degree, so \deg(d_1) = 1, \deg(d_{1600}) = 57
Each node picks its anonymity set once from \alpha \in 1..N, and keeps droplet d_i in each interval whenever i \equiv \alpha \pmod{N}.

With that setup, I believe you need to connect to 76-80 honest peers across different anonymity groups to be ~certain of recovering all data, but you can calculate exactly 1008 droplets to download and peel to obtain all the blocks you want, so you don’t pay too much of a bandwidth cose. Meanwhile you detect a dishonest peer essentially as soon as you download a single bad droplet from them.

If you want to make r be a per node parameter, then you could select m \in \{1,2,5,10,20\} and multiplying your storage used by m and storing droplet d_i whenever i \equiv \alpha \pmod{\frac{N}{m}}.

With a hardcoded set of 1600 droplet compositions you might be able to do something better than random, but I’m not sure what that would look like.

I believe that needing 80 honest peers with distinct sets rather than 63 is the tradeoff between using an optimal encoding like Reed-Solomon versus being able to precisely detect who’s responsible for introducing errors via the SeF-peeling scheme.

In practice this would mean during IBD you’re downloading 1008 blocks at once, then processing them all, then downloading the next set. Picking a smaller size for k might be appropriate if that means you can keep the sequence of resolved blocks in memory for peeling, rather than having to reread it repeatedly off disk; after all, if you’re interleaving processing the last set of 1008 while downloading the next, then the block cache for peeling would compete for memory with the leveldb utxo cache which might be annoying for (very) low-memory systems. Unfortunately dropping from k = 1008 significantly increases the number of peers you need or the amount they need to store afaics.

# · Anthony Towns · #10 · 24 Jun 2026, 13:06 UTC · in reply to #8

Okay so a sketch of the 246/10 system. You use GF(2^8) with 10-byte code words, padding each block to a multiple of 10 bytes. Each archival node effectively provides shards 0-9 directly, each sharded node can provide one shard from 10-255 for each block, and the sharded nodes are effectively divided into 246 anonymity sets, each storing 70GB of blockchain data rather than 700GB, growing at ~200MB a week.

You connect to peers providing twelve different shards, perhaps [15, 25, 28, 36, 39, 130, 137, 138, 162, 176, 194, 222]. You ask ten of those nodes for their shard of block 420,004 which is 87,166 bytes each (after padding the block to the next multiple of 10 bytes). You put those shards together, and discover it doesn’t give you the right hash, so someone’s malicious/buggy. You ask the other two nodes for their shards, and try error-correction – if that gives you the correct block, you re-generate the 10 original shards, and determine which peer is malicious and disconnect/ban them. If it doesn’t, perhaps you add another two peers and repeat; or alternatively download the block from an honest archival node, and use that to determine which of your peers are giving you bad shards.

That seems like it works okay to me, and allows you to connect to far fewer peers (10 or 12 rather than 80 or more), but as-specced does have a higher storage requirement (~90% saving rather than ~99% saving). The storage requirement is tighter, with no inflation due to combining small and large blocks together.

# · Ruben Somsen · #11 · 26 Jun 2026, 11:04 UTC

I saw the original presentation at the Scaling Bitcoin conference in 2019. We also revisited the topic in the 2140 office last year and I discussed it with people at last year’s Floresta/utreexo retreat.

I think it’s academically interesting and appreciate that a fresh set of eyes is evaluating it, but in my view it’s currently not worth pursuing.

Reasons:

It seem to me that blockchain storage/serving is simply not an issue right now, even with space restricted nodes choosing to prune - a solution looking for a problem
We are primarily bandwidth bottlenecked (especially considering SwiftSync) - the faster you can get the chain downloaded, the faster IBD will be, and this proposal makes that worse in multiple dimensions:
- The protocol requires you to download ~10% extra data to account for block recovery (slower IBD)
- You need blocks from many different peers to successfully reconstruct the chain. At 1% retention per peer you need ~110 peers (also affecting memory). But you’ll need significant redundancy - every peer can only send you ~1% of the needed data, so your fastest peers are first to lose their utility and you are likely to get stuck with a bunch of slower peers (again slower IBD)
You lose the ability to granularly download blocks and instead have to download sets of blocks in batches (“epochs”). This is restrictive in various ways. You lose the ability to request single blocks from the network (e.g. for BIP157/158 light clients), and it restricts certain IBD optimizations such as sequentially validating the chain from multiple points in parallel

Some of this may be improvable with hybrid networks where some peers do serve all blocks, but I suspect that will significantly blow up the complexity.

I only briefly skimmed the rest of the conversation in this thread, but the concern about peer fingerprinting via their droplet choices stood out to me as another valid concern.

# · gmaxwell · #12 · 10 Jul 2026, 20:50 UTC · in reply to #10

the 10% extra data can be avoided by not using low density parity check codes, peer limit is why my comments had been limited to more miner sharing, etc. so I think I addressed all those points except the big one: it’s addressing a thing which is not the limiting factor. And even without the bandwidth overhead computational overhead would slow it down. So I don’t disagree-- it’s why the idea hadn’t been pursued years ago.