RFC: Macro Regression Test Suite for Historical Reorgs

l0rinc commented at 10:17 am on March 24, 2025: contributor

Context and Motivation

Whenever we’re modifying caching behavior (optimizations, refactors, new features, calculating additional metrics), a common concern is often: “Sweet, but have you tested it via a reorg?!”

We already have tests covering basic reorg scenarios (feature_block.py, p2p_unrequested_blocks.py, feature_pruning.py). However, we’re lacking a macro regression test suite that systematically verifies Bitcoin Core behavior against historical mainnet reorg events, especially across releases, different configurations (pruned, txindex, varying memory settings), or with random undo/redo cycles.

Proposal: Historical Reorg Macro Test Suite

The goal is to create a robust regression test ensuring the latest Bitcoin Core handles historical reorgs identically to when they originally occurred. This would increase confidence that new versions do not introduce regressions in complex reorg and undo/redo logic, especially when modifying sensitive code paths. While partially covered by existing synthetic tests, this proposal is a more heavyweight alternative, using real historical blocks, performing a full IBD, and explicitly checking for behavior changes related to reorgs. Making sure mainnet behavior is retained in critical, but we might as well extend it to making sure testnet behavior (which is a lot more volatile anyway) is also covered.

Dedicated Historical Stale Block Proxy

Leveraging the existing bitcoin-data/stale-blocks dataset, which currently contains over 200 real historical stale blocks, we propose:

A dedicated fake node (“stale-block proxy”) that replays historical mainnet headers and blocks exactly as originally observed (since we can’t have reorgs during IBD otherwise, but this way we can simulate the ones that did actually happen).
The node under test would exclusively connect to this proxy node.
The proxy sequentially presents each historical stale block as a temporary chain tip (once the stale block is reached, the proxy moves on to the next available stale block, routing real blocks via the network), forcing the test node into realistic mainnet reorg conditions during a full IBD.
Once we reach a given height we could validate the resulting UTXO set againt known AssumeUTXO hashes.

Key Testing Scenarios:

Perform full initial block download (IBD) against the stale-block proxy, ensuring natural and historically accurate chain reorgs.
Test various node configurations explicitly:
- Default setup
- Pruned nodes
- Nodes with small and large dbcache memory allocations
- Nodes running with txindex=1

Additional Randomized Undo/Redo Testing:

In addition to historical scenarios, randomly trigger smaller undo/redo reorg cycles at various block heights to further stress-test UTXO consistency, using CoinsTip::SanityCheck() for validation before and after each reorg (confirming that undoing and reapplying a block results in the same state).

RFC / Questions:

Should this be an optional functional test (run periodically, monthly, pre-release), or triggered automatically via a GitHub label for relevant PRs?
Are there any additional scenarios or configurations we should consider?
How could we gather more historical stale blocks for our dataset (do we even have data for >1 reorgs)?

maflcko added the label Tests on Mar 24, 2025

maflcko added the label Brainstorming on Mar 24, 2025

0xB10C commented at 11:43 am on March 24, 2025: contributor

Instead of using mainnet blocks with a reorg maybe every 1000-3000 blocks, one could consider using testnet4 blocks as it often has multiple reorgs per block and blocks are a lot smaller and faster to load.

See for example https://fork.observer/?network=4

l0rinc commented at 11:55 am on March 24, 2025: contributor

Nice, we should do that as well! I think it’s important to be close to the historical behavior - it may not be a tragedy if testnet behavior happens to change accidentally, but it is, if mainnet logic change isn’t caught - that’s why I insist on making it as realistic as possible. Added it to the description.

mzumsande commented at 3:50 pm on March 25, 2025: contributor

since we can’t have reorgs during IBD otherwise

I’d say we shouldn’t really encounter reorgs during IBD in general - reorgs typically happen when the node is synced to the tip, a new node doing IBD today usually won’t experience any of the historical reorgs. Do I understand it correctly that what you suggest is to simulate a situation where the syncing node thinks it is out of IBD to replay these reorgs - by changing -minchainwork for the node under test to a lower value, and having some orchestration in the fake node to send incomplete headers?

I think it’s important to be close to the historical behavior

Can you expand on this a bit? I would have thought that historical reorgs weren’t particularly deep or complicated, so that synthetic scenarios might be more of a stress test than historical ones.

l0rinc commented at 8:18 am on March 26, 2025: contributor

you suggest is to simulate a situation where the syncing node thinks it is out of IBD to replay these reorgs

Exactly!

historical reorgs weren’t particularly deep or complicated

That’s my understanding as well, but they’re real, so they don’t include our testing biases.

maflcko commented at 8:44 am on March 26, 2025: member

I’d say the testing bias should be towards full coverage of the consensus rules via synthetic tests (whether it is a unit, fuzz, or functional test). Obviously it can’t hurt to confirm this with real data, but as soon as a shortcoming or lack of testing is seen, it should be added as a “synthetic” test.